Professional Documents
Culture Documents
B. B. Misra
Cluster Analysis
• Large databases are usually unlabeled, grouping or analysis of
such data is a complex task.
• Clustering: The process of organizing objects into groups
whose members are similar in some way.
• Clustering is the process of grouping the data into classes or
clusters, so that
– objects within a cluster have high similarity in comparison to one
another
– but are very dissimilar to objects in other clusters.
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
• Separate “quality” function to measure the “goodness” of a cluster.
• The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering
• Scalability:
– Datasets perform well for several hundred data objects, may be
biased when large dataset is dealt. Design of highly scalable clustering
algo. required.
• Ability to deal with different types of attributes:
– Interval-based (numerical), binary, categorical (nominal), ordinal data
or mixture of these data types.
• Discovery of clusters with arbitrary shape:
– Algorithms use Euclidean distance or Manhattan distance tend to find
spherical clusters with similar size and density. Clusters may have any
shape, algorithms should detect such arbitrary shapes.
• Minimal requirements for domain knowledge to determine
input parameters
– Some algo require input parameters e.g. no. of clusters. Parameters
are difficult to determine with high dimensional data and influences
the quality of cluster.
Requirements of Clustering cntd.
• Able to deal with noise and outliers
– Real life data contains outliers, missing values, unknown or erroneous data.
Algo. sensitive to such data lead to poor cluster quality.
• Insensitive to order of input records:
– Some algo produce different clusters basing on the order of input data. But it
is expected that the algo should produce same cluster in what ever order the
input data is presented.
• High dimensionality
– Many algorithms are good for two or three dimension. Human eye can judge
up to three dimensions. When data is sparse and highly skewed, finding
clusters in high dimensions is challenging.
• Incorporation of user-specified constraints
– Real-world applications may need to perform clustering under various kinds
of constraints.
• Interpretability and usability
– Clustering results should be interpretable, comprehensible and usable. May be
tied to semantic interpretations and applications, application goal may influence
selection of clustering features and methods.
Types of Data in Cluster Analysis
Data Structures
• Data matrix (object-by-variable structure) x ... x ... x
11 1f 1p
– (two modes) n objects and p variables n-by-p ... ... ... ... ...
matrix x ... x ... x
i1 if ip
– Ex. n different persons with p different ... ... ... ... ...
features such as age, height, weight, etc. xn1 ... x ... x
nf np
• Dissimilarity matrix (object-by-object
structure)
– (one mode) n-by-n table 0
d (2,1)
– d(i, j) is the difference or dissimilarity 0
between object i and j. d (3,1) d (3,2) 0
– d(i, j) is nonnegative, close to 0 when objects ... ... ... ...
d (n,1) d (n,2) ... ... 0
i and j are highly similar or near each other.
– d(i, j)= d(j,i) and d(i,i)=0
Interval-scaled variables
• Interval-scaled variables are continuous measurements
– e.g. weight, height, latitude, longitude etc.
• Measurement unit can affect clustering analysis.
– For example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to a very
different clustering structure.
• Expressing a variable in smaller units (e.g. cm instead of km)
lead to a larger range and has larger effect on resulting
clustering structure.
• To avoid dependence on the choice of measurement units, the
data should be standardized, which give equal wt. to all
variables.
Standardization of Interval-scaled variables
• Convert original measurements to unitless variables. Given
measurements for a variable f, standardization is done as
follows.
Find mean absolute deviation, sf:
sf = (|x1f-mf|+|x2f-mf|+…+|xnf-mf|)/n
where x1f, x2f, …, xnf are n measurements of f, and mf is the
mean value of f, i.e. mf = (x1f+ x2f + …+ xnf)/n.
Then calculate Z-score or standardized measurement:
zif = (xif - mf )/sf
Dissimilarity measures of Interval-scaled variables
• Popular distance measures are
• Euclidian distance, d i , j xi1 x j1 2
xi2 x j2 2
... xin x jn 2
0
1.31 0
0.44 0.87 0
0.43 1.74 0.87 0
Variables of mixed types
• Dissimilarity between objects discussed so far is for variable of
same type.
• Real databases may describe objects by a mixture of variable
types. A database may contain all six variable types mixed
– (i.e. any combination of interval-scaled, symmetric binary, asymmetric
binary, categorical, ordinal, and ratio-scaled variable types).
p
• Let p variables of mixed types. The dissimilarity
f 1
f
d
ij ij
f
• Consider the previous table and all the 1 code-A excellent 445
variables of it. 2 code-B fair 22
Outlier
Border
Eps = 1cm
Core MinPts = 5
DBSCAN: The Algorithm
• Arbitrary select a point p
• Core-distance
– The core-distance of an object p is the smallest ’ value
that makes {p} a core object.
– If p is not a core object, the core-distance of p is
undefined
• Reachability-distance
– The reachability-distance of an object q with respect to another
object p is the greater value of the core-distance of p and the
Euclidean distance between p and q
• Max(core-distance(p), Euclidean(p,q))
– If p is not a core object, the reachability-distance between p and
q is undefined.