You are on page 1of 31

Presentation on

Clustering
Submitted To :-
Dr V.K.Pathak

Submitted
By :-
Saurabh
Jain
III B.Tech
CSE
Topics Covered :-
1. Clustering – Basic Idea
2. Problem
3. Solution based on Greedy Algorithm
4. Example of implementation of Algorithm
5. Application of Clustering
WHEN
CLUSTERING
ARISES ??
Set of
Photographs
Set of Documents
Set of Microorganisms
What Is Clustering?
Clustering is the process of grouping a set of
physical or abstract objects into classes of similar
objects
A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to
the objects belonging to other clusters.
Motivation for Clustering
 INPUT :- Given a set of objects and distances
between them.
 Objects can be images, web pages, people,
species . . . .
 Distance function: Numeric value specifying
"closeness" of two objects. Increasing distance
corresponds to decreasing similarity.
 Goal: Group objects into clusters, where each
cluster is a set of similar objects.
Formalising the Clustering
Problem
Let U be the set of n objects labelled p1,p2,p3,
……………,pn.
For every pair pi and pj , we have a distance
d(pi,pj).
We require d(pi,pi ) = 0, d(pi,pj ) > 0 if i ≠ j and
d(pi,pj ) = d(pj,pi)
Given a positive integer k, a k-clustering of U is a
partition of U into k non-empty subsets or “clusters”
C1, C2,……..,Ck .
The spacing of a clustering is the smallest distance
P1 P2 P3
……… Pn

……
C
L
U P1,P2 P3,P4,P5 Pn
S
T ………
E
R
S
C1 C
2 …… Ck
How hard is clustering?
• Suppose we are given n points,
and would like to cluster them
into k-clusters
– How many possible
clusterings?
n
k
k!
Clustering Criteria:
Maximum Spacing
• Spacing between clusters is defined as Min
distance between any pair of points in different
clusters.

• Clustering of maximum spacing. Given an integer


k, find a k-clustering of maximum spacing.

spacing
k=4
Greedy Clustering
Algorithm
Intuition: Greedily cluster objects in increasing
order of distance.
Let C be a set of n clusters, with each object in U
in its own cluster.
Calculate the distance between each pair of
objects.
Process pairs of objects in increasing order of
distance.
Let (p,q) be the next pair with p є Cp and q є Cq.
If Cp ≠ Cq, add new cluster Cp U Cq to C, delete
Cp and Cq from C.
Stop when there are k clusters in C.
Same as Kruskal's algorithm but do not add last k -
1 edges in MST.
Key observation:-This procedure is precisely Kruskal's
algorithm for Minimum-Cost Spanning Tree (except we stop
when there are k connected components).
We stop the algorithm before it adds its last k-1 edges. This is
equivalent to full minimum spanning T, deleting the k-1 most
expensive edges.

Remark:- Equivalent to finding an MST and deleting the k-1


most expensive edges.
Undirected Graph - G(E,V)

5
A B
4 6 3

2 D 3
C

3 1 2
E F
4
Minimum- spanning tree

A B
3

2 D
C

3 1 2
E F
Expensive
Edge
Expensive
Edge
A B
3

2 D
C

1 2
E F
K=3

A B

2 D
C

1 2
E F
3 clusters are formed after
removing 3-1=2 expensive edges

A B

2 D
C

1 2
E F
EXAMPLE
Problem: Assume that the database D is given by the table
below. Follow single link technique to find clusters in D. Use
Euclidean distance measure.
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Solution :-
Step 1. Plot the objects in n-dimensional space (where n is the number of
attributes). In our case we have 2 attributes – x and y, so we plot the
objects p1, p2, … p6 in 2-dimensional space:

5
2
3
6

4
Step 2. Calculate the distance from each object
(point) to all other points, using Euclidean distance
measure, and place the numbers in a distance matrix

d(p1, p2) = (|xp1 – xp1 |2+ | yp1 -


yp2 |2 )1/2
p1 0
p2 0.24 0
p3 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6

Distance Matrix
Step 3 Identify the two clusters with the shortest distance in
the matrix, and merge them together. Re-compute the distance
matrix, as those two clusters are now in a single cluster, (no
longer exist by themselves).

By looking at the distance matrix above, we see that p3 and


p6 have the smallest distance from all - 0.11 So, we merge
those two in a single cluster, and re-compute the distance
matrix. 1

5
2
3
6

4
p1 0
p2 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5

dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) ,


dist(p6, p1) )
= MIN ( 0.22 , 0.23 //from
original matrix
= 0.22
Step 4 Repeat Step 3 until all clusters are merged.
 
a. So, looking at the last distance matrix above, we see that
p2 and p5 have the smallest distance from all - 0.14 So,
we merge those two in a single cluster, and re-compute the
distance matrix.

5
2
3
6

4
p1 0
(p2, p5) 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p1 (p2, p5) (p3, p6) p4

dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2),
dist(p3, p5), dist(p6, p5) )
= MIN ( 0.15 , 0.25, 0.28, 0.39 )
//from original matrix
= 0.15
 
1

5
2
3
6

K=3(no of
clusters)
STOPPING CONDITION IN
THE ABOVE ALGORITHM

We stop the procedure once we obtain k


connected components
K is the number of clusters the user would
like to have. In the above example K=3
Applications
Routing in mobile ad hoc networks.
Identify patterns in gene expression.
Document categorization for web search.
Similarity searching in medical image
databases
Skycat: cluster 109 sky objects into stars,
quasars, galaxies.
ANY QUESTIONS?

You might also like