You are on page 1of 25

Cluster Analysis

What is Cluster Analysis?


Cluster analysis (CA) is a class of techniques used to classify objects or
observations into relatively homogeneous groups, called clusters.
Objects in each cluster tend to be similar to each other and dissimilar
to objects in the other clusters.
The groups themselves stand apart from one another.
CA seeks to discover the number and composition of the groups.
Clusters are suggested by the data, not defined by prior information.
Applications
Market segmentation: Consumers may be clustered on the basis of
benefits sought from the purchase of a product. Each cluster would
consist of consumers who are relatively homogeneous in terms of the
benefits they seek.
Understanding buyer behaviors: CA can be used to identify the
homogeneous groups of buyers.
Identifying new product opportunities: By clustering brands and
products, competitive sets within the market can be determined.
Selecting test markets: By grouping cities into homogeneous clusters,
it is possible to select comparable cities to test various marketing
strategies.
An Example
Assume a hypothetical situation of daily expenditures (in $) on food (X1) and
travel (X2) of five persons: 6

Person: a b c d e 5d
Food (X1) : 2 8 9 1 8.5 4
a
Travel (X2) : 4 2 3 5 1 Y 3
c
2 b
1
e
0
-2 0 2 4 6 8 10 12
X

Figure suggests that five observations form two clusters. The first consists of
persons a and d, and the second consists of b, c and e. It can be noted that the
observations in each cluster are similar to one another with respect to
expenditures on food and travel, and that the two clusters are quite distinct from
each other.
Continued
Conclusions concerning the number of clusters and their membership
were reached through a visual inspection. This is possible because
only two variables were involved in grouping the observations.
Can a procedure be devised for similarly grouping observations when
there are more than two variables or attributes?
Measures of Distance for Variables
When the grouping is based on variables, it is natural to employ the
familiar concept of distance.

Let i and j are two points with


coordinates (X1i , X2i) and (X1j , X2j).

The Euclidean distance between


the two points is:
Continued
An observation i is closer (more similar) to j than to observation k if
D(i , j) < D(i , k).
An alternative measure is the squared Euclidean distance:

Another measure is the city block distance, defined as


Continued
The distance measures can be extended to more than two variables.
For example, the Euclidean distance between an observation (X1i , X2i ,
, Xki) and another (X1j , X2j , , Xkj) is

All three measures of distance depend on the units in which variables


are measured, and are influenced by whichever variable takes
numerically larger values. For this reason, the variables are often
standardized so that they have mean 0 and variance 1 before cluster
analysis is applied. Alternatively, weights w1, w1, , wk reflecting the
importance of the variables could be used and a weighted measure of
distance calculated. For example,
Clustering Methods
Hierarchical clustering methods
We do not need to specify in advance how many clusters need to be
extracted.
Example: Nearest Neighbor Method, Furthest Neighbor Method, and Average
Linkage Method.
Non-hierarchical clustering methods
Need to specify in advance how many clusters need to be extracted.
Example: K-means approach
Nearest Neighbor Method
Begin with each observation forming a separate cluster. Merge that
pair of observations that are nearest one another, leaving (n 1)
clusters for the next step. Next, merge into one cluster that pair of
clusters that are nearest one another, leaving (n 2) clusters for the
next step. Continue in this fashion, reducing the number of clusters
by one at each step, until a single cluster is formed consisting of all n
observations. At each step, keep track of the distance at which the
clusters are formed.
Nearest Neighbor Method 6

5 d
4
a
Y 3
c
2 b
1
e
0
-2 0 2 4 6 8 10 12
X

Assume a hypothetical situation of daily expenditures on food (X1) and


travel (X2) of five persons:
Person: a b c d e
Food (X1) : 2 8 9 1 8.5
Travel (X2) : 4 2 3 5 1
Continued
The distance between a and b is
Similarly, the distance between each pair of observations (a, c), (a, d),
etc. can be obtained as follows:

Cluster a(2,4) b(8,2) c(9,3) d(1,5) e(8.5,1)


a(2,4) 0 6.325 7.071 1.414 7.159
b(8,2) 0 1.414 7.616 1.118
c(9,3) 0 8.246 2.026
d(1,5) 0 8.5
e(8.5,1) 0
Continued
Observations b and e are nearest (most similar). So, they are grouped
in the same cluster.

Next, the distance between the cluster (be) and another observation
is the smaller of the distances between that observation, on the one
hand, and b and e, on the other. For example,
D(be, a) = min{D(b, a); D(e, a)} = min{6.325; 7.159} = 6.325.
Continued
The four clusters remaining at the end of this step and the distances
between these clusters are as follows:

Cluster be a c d
be 0 6.325 1.414 7.614
a 0 7.071 1.414
c 0 8.246
d 0

Two pairs of clusters are closest to one another at distance 1.414;


these are (ad) and (bce). We arbitrarily select (a, d) as the new cluster.
Continued
The distance between (be) and (ad) is
D(be, ad) = min{D(be, a), D(be, d)} = min{6.325, 7.616} = 6.325.
Between c and (ad) is
D(c, ad) = min{D(c, a), D(c, d)} = min{7.071, 8.246} = 7.071.
The three clusters remaining at this step and the distances between
these clusters are
Cluster be ad c
be 0 6.325 1.414
ad 0 7.071
c 0
Continued
We merge (be) with c to form the cluster (bce).

The distance between the two remaining clusters is


D(ad, bce) = min{D(ad, be), D(ad, c)} = min{6.325, 7.071} = 6.325.
The grouping of these two clusters occurs at a distance of 6.325, a
much greater distance than that at which the earlier groupings took
place.
Cluster bce ad
bce 0 6.325
ad 0
K-Means Clustering Algorithm
Step 1: Specify the number of clusters and, arbitrarily or deliberately,
the members of each cluster.
Step 2: Calculate each cluster's centroid, and the distances between
each observation and centroid. If an observation is nearer the
centroid of a cluster other than the one to which it currently belongs,
re-assign it to the nearer cluster.
Step 3: Repeat Step 2 until all observations are nearest the centroid
of the cluster to which they belong.
Step 4: If the number of clusters cannot be specified with confidence
in advance, repeat Steps 1 to 3 with a different number of clusters
and evaluate the results.
An Example 6

5 d
4
a
Y 3
c
2 b
1
e
0
-2 0 2 4 6 8 10 12
X

Assume a hypothetical situation of daily expenditures on food (X1) and


travel (X2) of five persons:
Person: a b c d e
Food (X1) : 2 8 9 1 8.5
Travel (X2) : 4 2 3 5 1
Continued
Suppose, we want to form two clusters.
We begin by arbitrarily assigning a, b and d to Cluster 1, and c and e
to Cluster 2.
Calculation of cluster centroid:
It is the point with coordinates equal to the average values of the
variables for the observations in that cluster.
Centroid of Cluster 1: ({2+8+1}/3, {{4+2+5}/3) = (3.36, 3.37) = c1, say.
Centroid of Cluster 2: ({9+8.5}/2, {3+1}/2) = (8.75, 2) c2, say.
The cluster's centroid can be considered as the center of the
observations in the cluster.
Continued
We now calculate the distance between a and the two centroids:

a is closer to the centroid of Cluster 1, to


which it is currently assigned.
Therefore, a is not reassigned.
Next, we calculate the distance between b and the two cluster
centroids:
Continued
Since b is closer to Cluster 2's centroid than to that of Cluster 1, it is
reassigned to Cluster 2.
The new cluster centroids are calculated as (1.5, 4.5), (8.5, 2).
Since b is closer to Cluster 2's centroid
than to that of Cluster 1, it is
reassigned to Cluster 2.
Continued
The distances of the observations from the new cluster centroids (CC)
are calculated as follows:

Observation Distance from CC 1 Distance from CC 2


a 0.707 6.801
b 6.964 0.5
c 7.649 1.118
d 0.707 8.078
e 7.826 1

Every observation belongs to the cluster to the centroid of which it is


nearest, and the K-means approach stops.
Cluster Analysis Procedure
Step 1: Formulate the problem.
Step 2: Select a distance measure.
Step 3: Select a clustering procedure.
Step 4: Decide the number of clusters.
Step 5: Interpret and profile clusters.
Step 6: Validate the obtained results.
Case Example
A major Indian FMCG company wants to map the profile of its target customers in
terms of lifestyle, attitudes, and perception. With the help of the marketing
research team, the companys manager prepares a set of 15 statements which
they feel measure many of the variables of interest. These 15 statements are
given below. The respondents had to agree or disagree ( 1 = strongly agree, 2 =
agree, 3 = neither agree or disagree, 4 = disagree, 5 = strongly disagree) with each
statement.
1. I prefer to use email rather than write a letter.
2. I feel that quality products are always priced high.
3. I think twice before I buy anything.
4. Television is a major source of entertainment.
5. A car is a necessity rather than a luxury.
6. I prefer fast food and ready-to-use products.
Continued
7. People are more health conscious today.
8. Entry of foreign companies has increased the efficiency of Indian companies.
9. Women are active participants in purchase divisions.
10. I believe politicians can play am positive role.
11. I enjoy watching movies.
12. If I get a chance, I would like to settle abroad.
13. I always buy branded products.
14. I frequently go out on weekends.
15. I prefer to pay by credit card rather than in cash.

You might also like