Professional Documents
Culture Documents
Agglomerative (bottom-up):
Divisive (top-down):
Precondition: Start with each document as a separate cluster. Postcondition: Eventually all documents belong to the same cluster. Precondition: Start with all documents belonging to the same cluster. Postcondition: Eventually each document forms a cluster of its own.
Prasad
L17HierCluster
d3 d5 d1 d2 d4
d1,d2 d3,d4,d5
d4,d5
d3
Prasad
L17HierCluster
Prasad
L17HierCluster
d6
d4
d5
d3
Centroid Outlier
Prasad
L17HierCluster
Complete-link
Similarity of the furthest points, the least cosinesimilar
Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
Average cosine between pairs of elements
Prasad
L17HierCluster
Makes tighter, spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting ), ck ) = min( sim( i , ck ), sim , j , sim(( ci c jcluster to anotherccluster, ck(cis: ck ))
Ci
Prasad
Cj
Ck
10
Prasad
L17HierCluster
11
Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n-2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each cluster must be done in constant time.
Often O(n3) if done naively or O(n2 log n) if done more cleverly
Prasad L17HierCluster 12
Prasad
L17HierCluster
13
The centroid of this cluster will be a dense Prasad L17HierCluster vector The medoid of this cluster will be a sparse
15
Prasad
L17HierCluster
17
Labeling
Common heuristics - list 5-10 most frequent terms in the centroid vector.
Drop stop-words; stem.
The measured quality of a clustering depends on both the document representation and the similarity measure used
Prasad L17HierCluster 20
Purity example
Cluster I
Cluster II
Cluster III
Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5
23
Rand Index
Number of points Same class in ground truth Different classes in ground truth Same Cluster in clustering Different Clusters in clustering
A B
C D
A P= A+ B
Prasad L17HierCluster
A R= A+C
25
20 20
24 72
Prasad
L17HierCluster
28
Prasad
29
Cluster labeling
Clusters have clean descriptions in terms of noun phrase co-occurrence
Multi-lingual docs
E.g., Canadian government docs. Every doc in English and equivalent French.
Must cluster by concepts rather than language
Simplest: pad docs in one language with dictionary equivalents in the other
thus each doc has a representation in both languages
Feature selection
Which terms to use as axes for vector space? IDF is a form of feature selection
Can exaggerate noise e.g., mis-spellings
Better to use highest weight mid-frequency words the most discriminating terms Pseudo-linguistic heuristics, e.g.,
drop stop-words stemming/lemmatization use only nouns/noun phrases
L17HierCluster
Prasad
32
Evaluation of clustering
Perhaps the most substantive issue in data mining in general: how do you measure goodness? Most measures focus on computational efficiency Time and space For application of clustering to search: Measure retrieval effectiveness
Prasad
L17HierCluster
33
Approaches to evaluating
Anecdotal Ground truth comparison
Cluster retrieval
Microeconomic / utility
Prasad
L17HierCluster
34
Anecdotal evaluation
Probably the commonest (and surely the easiest)
I wrote this clustering algorithm and look what it found!
No benchmarks, no comparison possible Any clustering algorithm will pick up the easy stuff like partition by languages Generally, unclear scientific value.
Prasad L17HierCluster 35
For the docs given, the static prior taxonomy may be incomplete/wrong in Prasad places
36
Microeconomic viewpoint
Anything - including clustering - is only as good as the economic utility it provides For clustering: net economic gain produced by an approach (vs. another approach) Strive for a concrete optimization problem Examples
recommendation systems clock time for interactive search
Prasad
expensive
L17HierCluster
37
Evaluation
Compare two IR algorithms
1. send query, present ranked results 2. send query, cluster results, present clusters
Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann
Prasad
L18LSI
40
Todays topic
Latent Semantic Indexing
Term-document matrices are very large But the number of topics that people talk about is small (in some sense) Clothes, movies, politics, Can we represent the term-document space by a lower dimensional latent space?
Prasad
L18LSI
41
Prasad
L18LSI
42
(right) eigenvector
eigenvalue
43
Matrix-vector multiplication
30 0 0 S = 0 20 0 0 0 1
has eigenvalues 30, 20, 1 with corresponding eigenvectors
1 v 1 = 0 0
0 v 2 = 1 0
0 v 3 = 0 1
On each eigenvector, S acts as a multiple of the identity matrix: but as a different multiple on each. Any vector (say x= the eigenvectors:
2 4 6
46
Sv 2 v = 1 v = { 1and 1, , , {, 0 2} }v 2 { 1 2 } 1 2
All eigenvalues of a real symmetric matrix are real.
L18LSI
n T w if the = , v w Sw 0 , Sv 0
Prasad 47
Example
Let
2 1 S= 1 2
Real, symmetric.
Then
2 1 2 S= I 0 1 . 2 ( ) = 2 1
The eigenvalues are 1 and 3 (nonnegative, real). The eigenvectors are orthogonal (andvalues real): 1 1 Plug in these and solve for 1 1 Prasad eigenvectors.
Eigen/diagonal Decomposition
Let be a square matrix with m linearly independent eigenvectors (a nonUnique defective matrix) Theorem: Exists an eigen decomposition diagonal
(cf. matrix diagonalization theorem)
are eigenvalues of
49
And S=UU1.
50
2 1 S = ; 1 2
1 = 1, 2 = 3.
form
1 U= 1
Recall UU1 =1.
1 1
1 2 1 / / 2 U= / 1 1 2 2 /
1 2 1 1 1 / 2 1 0/ Then, S=UU1 = 3/ 1 2 / 1 0 1 2 1
Example continued
Lets divide U (and multiply U1) by 2
Then,
S=
2/ 0 2 / 11 1 1 / 2 / 1 2 3 01 1 1 1 / / / 2 2 2 2/
Q
(Q-1= QT )