You are on page 1of 44

Cluster Analysis

B. B. Misra
Cluster Analysis
• Large databases are usually unlabeled, grouping or analysis of
such data is a complex task.
• Clustering: The process of organizing objects into groups
whose members are similar in some way.
• Clustering is the process of grouping the data into classes or
clusters, so that
– objects within a cluster have high similarity in comparison to one
another
– but are very dissimilar to objects in other clusters.
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
• Separate “quality” function to measure the “goodness” of a cluster.
• The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering
• Scalability:
– Datasets perform well for several hundred data objects, may be
biased when large dataset is dealt. Design of highly scalable clustering
algo. required.
• Ability to deal with different types of attributes:
– Interval-based (numerical), binary, categorical (nominal), ordinal data
or mixture of these data types.
• Discovery of clusters with arbitrary shape:
– Algorithms use Euclidean distance or Manhattan distance tend to find
spherical clusters with similar size and density. Clusters may have any
shape, algorithms should detect such arbitrary shapes.
• Minimal requirements for domain knowledge to determine
input parameters
– Some algo require input parameters e.g. no. of clusters. Parameters
are difficult to determine with high dimensional data and influences
the quality of cluster.
Requirements of Clustering cntd.
• Able to deal with noise and outliers
– Real life data contains outliers, missing values, unknown or erroneous data.
Algo. sensitive to such data lead to poor cluster quality.
• Insensitive to order of input records:
– Some algo produce different clusters basing on the order of input data. But it
is expected that the algo should produce same cluster in what ever order the
input data is presented.
• High dimensionality
– Many algorithms are good for two or three dimension. Human eye can judge
up to three dimensions. When data is sparse and highly skewed, finding
clusters in high dimensions is challenging.
• Incorporation of user-specified constraints
– Real-world applications may need to perform clustering under various kinds
of constraints.
• Interpretability and usability
– Clustering results should be interpretable, comprehensible and usable. May be
tied to semantic interpretations and applications, application goal may influence
selection of clustering features and methods.
Types of Data in Cluster Analysis
Data Structures
• Data matrix (object-by-variable structure) x ... x ... x 
 11 1f 1p 
– (two modes) n objects and p variables n-by-p  ... ... ... ... ... 
matrix x ... x ... x 
 i1 if ip 
– Ex. n different persons with p different  ... ... ... ... ... 
 
features such as age, height, weight, etc.  xn1 ... x ... x 
nf np 
• Dissimilarity matrix (object-by-object
structure)
– (one mode) n-by-n table  0 
d (2,1) 
– d(i, j) is the difference or dissimilarity  0 
between object i and j.  d (3,1) d (3,2) 0 
 
– d(i, j) is nonnegative, close to 0 when objects  ... ... ... ... 
d (n,1) d (n,2) ... ... 0
i and j are highly similar or near each other.
– d(i, j)= d(j,i) and d(i,i)=0
Interval-scaled variables
• Interval-scaled variables are continuous measurements
– e.g. weight, height, latitude, longitude etc.
• Measurement unit can affect clustering analysis.
– For example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to a very
different clustering structure.
• Expressing a variable in smaller units (e.g. cm instead of km)
lead to a larger range and has larger effect on resulting
clustering structure.
• To avoid dependence on the choice of measurement units, the
data should be standardized, which give equal wt. to all
variables.
Standardization of Interval-scaled variables
• Convert original measurements to unitless variables. Given
measurements for a variable f, standardization is done as
follows.
Find mean absolute deviation, sf:
sf = (|x1f-mf|+|x2f-mf|+…+|xnf-mf|)/n
where x1f, x2f, …, xnf are n measurements of f, and mf is the
mean value of f, i.e. mf = (x1f+ x2f + …+ xnf)/n.
Then calculate Z-score or standardized measurement:
zif = (xif - mf )/sf
Dissimilarity measures of Interval-scaled variables
• Popular distance measures are
• Euclidian distance, d i , j   xi1  x j1 2
 xi2  x j2 2
 ...  xin  x jn 2

where i  ( xi1 , xi 2 ,..., xin ) and j  ( x j1 , x j 2 ,..., x jn ) are two data


objects.
• Manhattan or City block distance,
d i, j  | xi1  x j1 |  | xi 2  x j 2 | ... | xin  x jn |

• Minkowski distance or Lp norm


d i, j   (| xi1  x j1 | p  | xi 2  x j 2 | p ... | xin  x jn | p )1/ p

where p is a positive integer.


• Weighted Euclidean distance: when a weight is assigned to
each variable basing on its importance, distance is found as
d i, j   w1 xi1  x j1   w2 xi 2  x j 2   ...  wn xin  x jn 
2 2 2
Binary Variables
• For binary variables with equal object j
1 0 sum
weight, a 2x2 contingency table is
Object i 1 q r q+r
built, where
0 s t s+t
q = no. of variables equal 1 for both sum q+s r+t p
objects i and j.
r = no of variables with 1 for object i and 0 for object j.
s = no of variables with 0 for object i and 1 for object j.
t = no. of variables equal 0 for both objects i and j.
p= q+r+s+t, total no. of variables
• Symmetric binary variable: Both of its states carry the same
weight, no preference for any outcome, e.g. gender male or female.
• Asymmetric binary variable: The outcome of the state are not
equally important, e.g. +ve or –ve out come of a disease test.
Binary Variables cntd.
• Symmetric binary dissimilarity:
rs
Dissimilarity between object i and object j, d i, j  
qr  st
• Asymmetric binary dissimilarity:
Let 1 represents HIV positive and 0 for HIV negative. Given two asymmetric
binary variables, the agreement of two 1s (a positive match) is considered more
significant than that of two 0s ( a negative match).
Such binary variables are also called “monary” (as if having one state).
Here, the negative matches, t, is considered unimportant and ignored.
rs
d i, j  
qrs
• Jaccard coefficient or asymmetric binary similarity:
Similarity between two objects i and j computed as
 1  d i, j 
q
sim(i, j ) 
qrs
Ex. Dissimilarity between Relational table: patients are described by binary attributes
binary variables
Name Gender Fever Cough Test1 Test2 Test3 Test4
In the table, name is the
Jack M Y N P N N N
object identifier, gender is a
symmetric attribute and rest Mary F Y N P N P N
are asymmetric attribute. Jim M Y Y N N N N

For asymmetric attributes, let … … … … … … … …


Y(yes) and P(positive) set to 1
and N (no or negative) to 0. Relational table: patients attributes converted to binary bits
Let the distance is calculated Name Gender Fever Cough Test1 Test2 Test3 Test4
based on asymmetric Jack M 1 0 1 0 0 0
variables. Mary F 1 0 1 0 1 0
d(i,j)=(r+s)/(q+r+s) Jim M 1 1 0 0 0 0
d(Jack, Mary)= … … … … … … … …
(0+1)/(2+0+1)=0.33
object j
d(Jack, Jim)=(1+1)/(1+1+1)=0.66
1 0 sum
d(Mary, Jim)= (2+1)/(1+2+1)=0.75
Measurement suggests: Marry and Jim are unlikely to have
Object i 1 q r q+r
similar disease, highest dissimilarity among three pairs, where 0 s t s+t
as Jack and Marry are most likely to have a similar disease.
sum q+s r+t p
Categorical Variables
• It is a generalization of binary variable, can have more than two
states. Ex. map-color has five states: red, yellow, green, pink and
blue.
• Let the number of states of a categorical variable be M. The states
can be denoted by letters, symbols, or a set of integers, such as 1,
2, …, M (such integers used for data handling, do not represent
specific ordering).
• The dissimilarity between two objects i and j computed as
d(i, j) = (p-m)/p,
where m is the number of matches (i.e. no. of variables for which i
and j are in the same state), and p is the total number of variables.
• Weights can be assigned to increase the effects of m or to assign
greater weight to the matches in variables having a larger number
of states.
Ex. Dissimilarity between categorical object test1 test2 test3
variables identifier (categorical) (ordinal) (ratio-scaled)
• In the table, attribute test1 is 1 code-A excellent 445
only categorical. 2 code-B fair 22
• Here total number of variables 3 code-C good 164
p=1, so d(i, j)= 0, if object i and j 4 code-A excellent 1210
match otherwise 1, (one mode
dissimilarity matrix for the 4  0 
objects of test1 is given below) . d (2,1) 0 
 
• Categorical variables can be  d (3,1) d (3,2) 0 
encoded by asymmetric binary  
variables by creating a new  d ( 4,1 ) d ( 4, 2) d ( 4,3) 0 
binary variable for each of the M
states i.e. one state is set to 1 o1 o2 o3 o4
and rest of the states are set to o1 0 
0. 1 0 
d 
1 o 2  
• Ex. map-color, out of five states
(red, yellow, green, pink and o3 1 1 0 
blue), let yellow is set to 1 and  
o 4 0 1 1 0 
rest are set to 0.
oi – ith object identifier
Ordinal Variables
• Discrete ordinal variable: resembles a categorical variable, but
the M states of the ordinal value are ordered in a meaningful
sequence.
– Useful for subjective assessment of qualities cannot be measured
objectively, ex. Professional ranks are enumerated in a sequential
order: Asst. Prof., Assoc. Prof., Full Prof.
• Continuous ordinal variable: A set of continuous data of an
unknown scale; i.e. relative ordering is essential but not the
actual magnitude.
– Ex. the relative ranking in a sport (e.g. gold, silver, bronze) is essential
than the actual values of a particular measure.
– interval-scaled quantities may be discretized by splitting the value
range into a finite number of classes.
– The values of an ordinal variable can be mapped to ranks. Let an
ordinal variable f has Mf states, which represent the ranking 1, …, Mf.
Ordinal Variables cntd.
• Dissimilarity calculation for ordinal variables is similar to
interval-valued methods.
• Let f is a variable from a set of ordinal variables of n objects.
The dissimilarity calculation is made as follows
1. Ranking: The value of f for the ith object is xif, and f has Mf
ordered states, representing the ranking 1,…, Mf. Replace each
xif by its corresponding rank, rif  {1,…, Mf}.
2. Normalization: Each ordinal variable may have different
number of states, so map the variable onto [0.0, 1.0] as,
rif  1
zif 
M f 1
3. Distance Calculation: Then, compute dissimilarity using any
of the interval valued method.
Ex. Dissimilarity between ordinal object test1 test2 test3
variables identifier (categorical) (ordinal) (ratio-scaled)

• Consider attribute test2 of 1 code-A excellent 445


2 code-B fair 22
the previous table.
3 code-C good 164
• Let the three attribute values 4 code-A excellent 1210
{fair, good, excellent} are
After ranking the 2nd attribute
ranked as {1, 2, 3}
rif  1 object test1 test2 test3
• Normalize using zif  M  1 identifier (categorical) (ordinal) (ratio-scaled)
f 1 3
here rif is the respective rank 2 1
values and total states, Mf =3. 3 2

• Then the one-mode 4 3

dissimilarity matrix obtained After ranking and normalization of 2nd attribute


from Euclidean distance is object test1 test2 test3
identifier (categorical) (ordinal) (ratio-scaled)
0  1 1
1 0 
d 
2  2 0
0.5 0.5 0  3 0.5
 
0 1 0.5 0 4 1
Ratio-Scaled Variables
• A ratio scaled-variable makes a positive measurement on a
nonlinear scale, such as an exponential scale, approximately
following the formula AeBt or Ae-Bt,
where A and B are positive constants, and t represents time.
Ex. growth of a bacteria population or decay of radioactive
element.
• Dissimilarity is computed in 3 ways.
1. Treated as interval-scaled variables, but not a good choice as the
scaling may be distorted.
2. Logarithmic transformation: Let xif is the value of object i of
variable f. Perform logarithmic transformation, yif=log(xif) for each
object. Treat yif values as interval scaled and find distance.
3. Treat xif as continuous ordinal data and treat their ranks as
interval-valued.
Ex. Dissimilarity between ratio-scaled object test1 test2 test3
variables identifier (categorical) (ordinal) (ratio-scaled)
1 code-A excellent 445
• Consider the previous table,
2 code-B fair 22
test3 is ratio-scaled. 3 code-C good 164
• Let us use the 2nd approach of 4 code-A excellent 1210

logarithmic transformation. After logorithmic transformation of 3rd attribute


object test1 test2 test3
• Use Euclidean distance identifier (categorical) (ordinal) (ratio-scaled)
measure to find dissimilarity. 1 2.65
2 1.34
3 2.21
4 3.08

 0 
1.31 0 
 
0.44 0.87 0 
 
 0.43 1.74 0.87 0 
Variables of mixed types
• Dissimilarity between objects discussed so far is for variable of
same type.
• Real databases may describe objects by a mixture of variable
types. A database may contain all six variable types mixed
– (i.e. any combination of interval-scaled, symmetric binary, asymmetric
binary, categorical, ordinal, and ratio-scaled variable types).
p
• Let p variables of mixed types. The dissimilarity 
f 1
 f
d
ij ij
f

d(i,j) between object i and j is defined as d (i , j )  p

where  ijf  0 if either


f 1
 f
ij

i. xif or xjf is missing (i.e. no measurement of variable f for


object i or j),
ii. xif= xjf=0 and variable f is asymmetric binary;
otherwise  ijf  1 .
Variables of mixed types cntd.
d ijf is calculated depending on its type:
| xif  x jf |
If f is interval-based: d  max x  min x ,
f
i. ij
h hf h hf

where h runs over all nonmissing objects for variable f.


i. If f is binary or categorical: d ijf  0 if xif  x jf ; otherwise dijf  1
rif  1
ii. If f is ordinal: compute the ranks rif and zif  and
M f 1
treat zif as interval-scaled.
iv. If f is ratio-scaled: either perform logarithmic transformation
and treat the transformed data as interval-scaled; or treat f
as continuous ordinal data, compute rif and zif , and then
treat zif as interval scaled.
Note. All the steps are same as discussed before, except (i.) interval-based, where
the values are normalized to map to the interval [0.0, 1.0].
Ex. Dissimilarity of variables of mixed object test1 test2 test3
types identifier (categorical) (ordinal) (ratio-scaled)

• Consider the previous table and all the 1 code-A excellent 445
variables of it. 2 code-B fair 22

• Let the procedure followed for test1 3 code-C good 164


and test2 remain same and the results 4 code-A excellent 1210
are: 0  0 
1 0  1 
• For test3 variable, earlier log values 0
d 
1  d 
2 
obtained (2.65, 1.34, 2.21, 3.08) need 1 1 0  0.5 0.5 0 
to normalized. Here max h xhf  3.08    
0 1 1 0  0 1 0.5 0
and min h xhf  1.34 . Dissimilarity matrix for 3rd attribute  0 
(test3) is found, using d f  | xif  x jf | 0.75 0 
max h xhf  min h xhf d  
ij 3
0.25 0.5 0 
p  
• Now use the dissimilarity matrices of the

f 1
f
ij d ijf 0.25 1.00 0.5 0
three variables to compute d (i, j )   0 
p 0.92 
Ex. 11  11  1 0.75
d 2,1   0.92  f 1
f
ij d 
0
0.58 0.67 0


111  
0.08 1.00 0.67 0
11  1 0.5  1 0.25
d 3,1   0.58 d(4,1) is lowest, objects 1 and 4 are most similar.
111 d(4,2) is highest, objects 2 and 4 are most dissimilar.
Vector Objects
• Information retrieval, text document clustering, and biological
taxonomy compare and cluster complex objects containing
large symbolic entities (e.g. keywords and phrases in text
document).
• Traditional methods to measure distance not used here.
• Let, similarity function is s(x, y) to compare to vectors x, y.
s  x, y  
x'. y
• One popular approach is cosine measure, i.e. x y
where x’ is the transpose of x and x  x12  x22  ...  x 2p is the
Euclidean norm of vector x and s is the cosine angle between
vector x and y.
• Cosine angle is invariant to rotation and dilation but not to
translation and general linear transformation.
Vector Objects cntd.
• Ex. Nonmetric similarity between two objects using cosine
• Let two vectors are
– x=(1,1,0,0) and
– y=(0,1,1,0).
• The similarity between x and y is
1  0  1 1  0  1  0  0
sx, y  
x'. y 1
   0.5
x y 1 1  0  0  0 1 1  0
2 2 2 2 2 2 2 2
2 2
• Tanimoto coefficient or Tanimoto distance:
• It is the ratio of number of attributes shared by x and y to the
number of attributes possessed by x and y. s( x, y)  x'. y
x'.x  y '.y  x'. y

• This fn. is frequently used in information retrieval and biology


taxonomy.
• No universal standard to guide selection of a similarity (distance) function or normalize
data for cluster analysis.
• User should refine selection of such measure to generate meaningful and useful cluster
for the application.
Major Clustering Methods
Major Clustering Approaches
• Partitioning approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or
objects) using some criterion
– Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches cntd.
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
• Model-based:
– A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
– Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
– Based on the analysis of frequent patterns
– Typical methods: pCluster
• User-guided or constraint-based:
– Clustering by considering user-specified or application-specific
constraints
– Typical methods: COD (obstacles), constrained clustering
Distance between Clusters
• Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Density-Based Methods
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Some of the density based methods are:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
DBSCAN
• Density-Based Spatial Clustering of Applications with
Noise (DBSCAN)
• The algorithm grows regions with sufficiently high density
into clusters and discovers clusters of arbitrary shape in
spatial databases with noise.
• It defines a cluster as a maximal set of density-connected
points.
DBSCAN: Definitions
• -neighborhood: The neighborhood within a
radius  of a given object is called the -
neighborhood of the object.
• Core object: If the -neighborhood of an object
contains at least a minimum number, MinPts, of
objects, then the object is called a core object.
• Ex. Let =1 cm, MinPts=3, m and p are core objects
because their - neighborhood contain at least 3
points. But q is not a core point.
• Directly density-reachable: Given a set of objects,
D, an object p is directly density-reachable from
object q if p is within the -neighborhood of q,
and q is a core object.
• Ex. q is directly density-reachable from m, m is
directly density-reachable from p, and p is directly
density-reachable from m.
DBSCAN: Definitions
• Density-reachable: An object p is density-reachable from object q with
respect to  and MinPts in a set of objects, D, if there is a chain of objects p1,
…, pn, where p1 = q and pn = p such that pi+1 is directly density-reachable from
pi with respect to  and MinPts, for 1 ≤ i ≤ n, pi  D.
• Density-connected: An object p is density-connected to object q with respect
to  and MinPts in a set of objects, D, if there is an object o  D such that
both p and q are density-reachable from o with respect to  and MinPts.
Example:
let MinPts = 3.

• m, p, o, and r are core objects, each containing 3 objects in -


neighborhood.
• q is directly density-reachable from m
• m is directly density-reachable from p and vice versa.
• q is (indirectly) density-reachable from p because q is directly
density-reachable from m and m is directly density-reachable form
p. However, p is not density-reachable from q because q is not a
core object.
• Similarly, r and s are density-reachable from o, and o is density-
reachable from r, but not from s.
• o, r, and s are all density-connected.
DBSCAN
• A density-based cluster is a set of density-connected objects
that is maximal with respect to density-reachability.
• Every object not contained in any cluster is considered to be
noise.
• DBSCAN searches for clusters by checking the -neighborhood
of each point in the database.
• If the -neighborhood of a point p contains more than MinPts,
a new cluster with p as a core object is created.
• DBSCAN then iteratively collects directly density-reachable
objects from these core objects, which may involve the merge
of a few density-reachable clusters.
• The process terminates when no new point can be added to
any cluster.
DBSCAN
• Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm

Core MinPts = 5
DBSCAN: The Algorithm
• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps and


MinPts.

• If p is a core point, a cluster is formed.

• If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the database.

• Continue the process until all of the points have been


processed.
DBSCAN: Sensitive to Parameters
DBSCAN
Drawbacks:
• Discovery of quality clusters depend on parameter setting (,
MinPts).
• Difficult to set parameter for real world high dimensional
data.
• Slight difference in setting may make significant difference in
clustering.
• High dimensional real data have skewed distribution, global
density parameter may not characterize intrinsic clustering
structure.
OPTICS

• Ordering Points to Identify the Clustering Structure


(OPTICS)
• Motivation
– Very different local densities
may be needed to reveal
clusters in different regions
– Clusters A,B,C1,C2, and C3
cannot be detected using one
global density parameter
– A global density parameter
can detect either A,B,C
or C1,C2,C3
OPTICS

• Core-distance
– The core-distance of an object p is the smallest ’ value
that makes {p} a core object.
– If p is not a core object, the core-distance of p is
undefined
• Reachability-distance
– The reachability-distance of an object q with respect to another
object p is the greater value of the core-distance of p and the
Euclidean distance between p and q
• Max(core-distance(p), Euclidean(p,q))
– If p is not a core object, the reachability-distance between p and
q is undefined.

You might also like