Professional Documents
Culture Documents
Clustering
Submitted To :-
Dr V.K.Pathak
Submitted
By :-
Saurabh
Jain
III B.Tech
CSE
Topics Covered :-
1. Clustering – Basic Idea
2. Problem
3. Solution based on Greedy Algorithm
4. Example of implementation of Algorithm
5. Application of Clustering
WHEN
CLUSTERING
ARISES ??
Set of
Photographs
Set of Documents
Set of Microorganisms
What Is Clustering?
Clustering is the process of grouping a set of
physical or abstract objects into classes of similar
objects
A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to
the objects belonging to other clusters.
Motivation for Clustering
INPUT :- Given a set of objects and distances
between them.
Objects can be images, web pages, people,
species . . . .
Distance function: Numeric value specifying
"closeness" of two objects. Increasing distance
corresponds to decreasing similarity.
Goal: Group objects into clusters, where each
cluster is a set of similar objects.
Formalising the Clustering
Problem
Let U be the set of n objects labelled p1,p2,p3,
……………,pn.
For every pair pi and pj , we have a distance
d(pi,pj).
We require d(pi,pi ) = 0, d(pi,pj ) > 0 if i ≠ j and
d(pi,pj ) = d(pj,pi)
Given a positive integer k, a k-clustering of U is a
partition of U into k non-empty subsets or “clusters”
C1, C2,……..,Ck .
The spacing of a clustering is the smallest distance
P1 P2 P3
……… Pn
……
C
L
U P1,P2 P3,P4,P5 Pn
S
T ………
E
R
S
C1 C
2 …… Ck
How hard is clustering?
• Suppose we are given n points,
and would like to cluster them
into k-clusters
– How many possible
clusterings?
n
k
k!
Clustering Criteria:
Maximum Spacing
• Spacing between clusters is defined as Min
distance between any pair of points in different
clusters.
spacing
k=4
Greedy Clustering
Algorithm
Intuition: Greedily cluster objects in increasing
order of distance.
Let C be a set of n clusters, with each object in U
in its own cluster.
Calculate the distance between each pair of
objects.
Process pairs of objects in increasing order of
distance.
Let (p,q) be the next pair with p є Cp and q є Cq.
If Cp ≠ Cq, add new cluster Cp U Cq to C, delete
Cp and Cq from C.
Stop when there are k clusters in C.
Same as Kruskal's algorithm but do not add last k -
1 edges in MST.
Key observation:-This procedure is precisely Kruskal's
algorithm for Minimum-Cost Spanning Tree (except we stop
when there are k connected components).
We stop the algorithm before it adds its last k-1 edges. This is
equivalent to full minimum spanning T, deleting the k-1 most
expensive edges.
5
A B
4 6 3
2 D 3
C
3 1 2
E F
4
Minimum- spanning tree
A B
3
2 D
C
3 1 2
E F
Expensive
Edge
Expensive
Edge
A B
3
2 D
C
1 2
E F
K=3
A B
2 D
C
1 2
E F
3 clusters are formed after
removing 3-1=2 expensive edges
A B
2 D
C
1 2
E F
EXAMPLE
Problem: Assume that the database D is given by the table
below. Follow single link technique to find clusters in D. Use
Euclidean distance measure.
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Solution :-
Step 1. Plot the objects in n-dimensional space (where n is the number of
attributes). In our case we have 2 attributes – x and y, so we plot the
objects p1, p2, … p6 in 2-dimensional space:
5
2
3
6
4
Step 2. Calculate the distance from each object
(point) to all other points, using Euclidean distance
measure, and place the numbers in a distance matrix
Distance Matrix
Step 3 Identify the two clusters with the shortest distance in
the matrix, and merge them together. Re-compute the distance
matrix, as those two clusters are now in a single cluster, (no
longer exist by themselves).
5
2
3
6
4
p1 0
p2 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5
5
2
3
6
4
p1 0
(p2, p5) 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p1 (p2, p5) (p3, p6) p4
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2),
dist(p3, p5), dist(p6, p5) )
= MIN ( 0.15 , 0.25, 0.28, 0.39 )
//from original matrix
= 0.15
1
5
2
3
6
K=3(no of
clusters)
STOPPING CONDITION IN
THE ABOVE ALGORITHM