Professional Documents
Culture Documents
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e
3
Hierarchical Clustering
4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster
5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
6
Agglomerative Clustering Algorithm
7
Starting Situation
• Start with clusters of individual points and a distance matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
8
Intermediate Situation
• After some merging steps, we have some clusters
• Choose two clusters that has the smallest C1 C2 C3 C4 C5
distance (largest similarity) to merge C1
C2
C3
C3
C4
C4
C5
Distance Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
9
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the distance matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Distance Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
10
After Merging
• The question is “How do we update the distance matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Distance Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
11
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
p1
Distance?
p2
p3
p4
p5
MIN
.
MAX
. Distance Matrix
Group Average .
Distance Between Centroids
……
12
MIN or Single Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the closest pair of data objects belonging to
different clusters.
– Determined by one pair of points, i.e., by one link in the
proximity graph
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
14
Strength of MIN
15
Limitations of MIN
16
MAX or Complete Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the farthest pair of data objects belonging to
different clusters
17
MAX
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
0.05
4
0
3 6 4 1 2 5
18
Strength of MAX
19
Limitations of MAX
•Tends to break large clusters
Original Points
20
Limitations of MAX
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
23
Group Average
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
24
Centroid Distance
• Inter-cluster distance
– The distance between two clusters is represented by the
distance between the centers of the clusters
– Determined by cluster centroids
25
Ward’s Method
26
Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
27
Time and Space Requirements
28
Strengths
29
Problems and Limitations
30
Take-away Message
31