Clustering Part3

Clustering methods: Part 3
Clustervalidation
PasiFrnti
15.4.2014
Speech and Image Processing Unit

School of Computing
University of Eastern Finland
PartI:
Introduction
Clustervalidation
Precision = 5/5 = 100%
Recall = 5/7 = 71%
Supervisedclassification:
Classlabelsknownforgroundtruth
Oranges:
Accuracy,precision,recall
Clusteranalysis Apples:
P
Noclasslabels
Validationneedto: Precision = 3/5 = 60%
Recall = 3/3 = 100%
Compareclusteringalgorithms
Solvenumberofclusters
Avoidfindingpatternsinnoise
Measuringclusteringvalidity
InternalIndex:
Validatewithoutexternalinfo
Withdifferentnumberofclusters ? ?
Solvethenumberofclusters
ExternalIndex
Validateagainstgroundtruth
?
Comparetwoclusters:
(howsimilar)
?
Clusteringofrandomdata
1
Random Points 1
DBSCAN
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1
K-means 1
Complete Link
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Clustervalidationprocess
1. Distinguishingwhethernonrandomstructureactually
existsinthedata(onecluster).
2. Comparingtheresultsofaclusteranalysistoexternally
knownresults,e.g.,toexternallygivenclasslabels.
3. Evaluatinghowwelltheresultsofaclusteranalysisfit
thedatawithoutreferencetoexternalinformation.
4. Comparingtheresultsoftwodifferentsetsofcluster
analysestodeterminewhichisbetter.
5. Determiningthenumberofclusters.
Clustervalidationprocess
Clustervalidationreferstoproceduresthatevaluatetheresultsof
Clustervalidation
clusteringinaquantitativeand
quantitative objectivefashion.[Jain
objective & Dubes,
1988]
Howtobequantitative:Toemploythemeasures.
Howtobeobjective:Tovalidatethemeasures!
INPUT: Clustering Partitions P Validity m*

DataSet(X) Algorithm Codebook C Index
Different number of clusters m

PartII:
Internalindexes
Internalindexes
Groundtruthisrarelyavailablebutunsupervised
validationmustbedone.
Minimizes(ormaximizes)internalindex:
Variancesofwithinclusterandbetweenclusters
Ratedistortionmethod
Fratio
DaviesBouldinindex(DBI)
BayesianInformationCriterion(BIC)
SilhouetteCoefficient
Minimumdescriptionprinciple(MDL)
Stochasticcomplexity(SC)
Meansquareerror(MSE)
The more clusters the smaller the MSE.
Small knee-point near the correct value.
But how to detect?
Knee-point between
14 and 15 clusters.
Meansquareerror(MSE)
6
-2
-4
-6
5 10 15
10
6
5 clusters
SSE
3
10 clusters
2
0
2 5 10 15 20 25 30
K
FromMSEtoclustervalidity
Minimizewithinclustervariance(MSE)
Maximizebetweenclustervariance
Inter-cluster
Intra-cluster variance is
variance is maximized
minimized
JumppointofMSE
(ratedistortionapproach)
First derivative of powered MSE values:

J k MSE (k ) d / 2
MSE (k 1) d / 2
0,16 Biggest jump on 15 clusters.

0,14
S2
Jump values
0,12
0,1
0,08
0,06
0,04
0,02
0
0 10 20 30 40 50 60 70 80 90 100
Number of clusters
Sumofsquaresbasedindexes
SSW/kBallandHall(1965)
k2|W|Marriot(1971)
/ k 1
SSB
SSW
Calinski&Harabasz(1974)
/ N k

log(SSB/SSW)Hartigan(1975)
d log( SSW /(dN )) log(k )
2
Xu(1997)
(disthedimensionofdata;Nisthesizeofdata;kisthenumberofclusters)
SSW = Sum of squares within the clusters (=MSE)

SSB = Sum of squares between the clusters
Variances
N
Withincluster: SSW (C , k ) || xi c p (i ) ||2
i 1
k
Betweenclusters: SSB(C , k ) n j || c j x ||2
j 1
TotalVarianceofdataset:
N k
( X ) || xi c p (i ) || n j || c j x ||
2 2
i 1 j 1
SSW SSB
Fratiovariancetest
VarianceratioFtest
Measuresratioofbetweengroupsvarianceagainst
thewithingroupsvariance(originalftest)
Fratio(WBindex):
N
k || xi c p (i ) ||2
k SSW
F i 1

k
( X ) SSW
n
j 1
j || c j x ||
2
SSB
CalculationofFratio
7 6
6
5
5
Intermediate result
Cost F Test
4
Divider (between cluster) 3
3
Nominator (k *MSE) 2
2
F-ratio total 1
1
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of cluster
FratiofordatasetS1
1.4
1.2
F-ratio (x10^5)
1.0
PNN
0.8
IS
0.6
minimum
0.4
0.2
0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
FratiofordatasetS2
1.4
1.2
PNN
1.0
F-ratio (x10^5)
IS
0.8
minimum
0.6
0.4
0.2
0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
FratiofordatasetS3
1.4
1.3 S3
1.2
1.1
F-ratio
1.0
minimum
0.9
PNN
0.8
IS
0.7
0.6
25 20 15 10 5
Number of clusters
FratiofordatasetS4
1.5
S4
1.4
1.3
PNN
1.2
F-ratio
IS
1.1
minimum at 16
1.0
0.9
minimum at 15
0.8
25 20 15 10 5
Number of clusters
ExtensionoftheFratioforS3
3.1
S3
2.6
2.1
F-ratio
another knee point

1.6
minimum PNN
1.1
IS
0.6
25 20 15 10 5
Number of clusters
Sumofsquarebasedindex
SSW / SSB & MSE SSW / m log(SSB/SSW)
SSB / m 1 d log( SSW /(dn 2 )) log(m) m* SSW/SSB

SSW / n m
Minimize intra cluster variance

Maximize the distance between clusters
Cost function weighted sum of the two:
MAE j MAE k
R j ,k
d (c j , c k )
M
1
DBI
M
max R
j 1
j k
j ,k
MeasuredvaluesforS2
35 12
MSE
30 DBI 10
F-test
25
8
Minimum point
DBI & F-test

20
MSE
6
15
4
10
2
5
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of cluster
Silhouettecoefficient
[Kaufman&Rousseeuw,1990]
Cohesion: measures how closely related are

objects in a cluster
Separation: measure how distinct or well-
separated a cluster is from other clusters
cohesion
separation
Cohesion a(x): average distance of x to all other vectors in
the same cluster.
Separation b(x): average distance of x to the vectors in
other clusters. Find the minimum among the clusters.
silhouette s(x):
b( x ) a ( x )
s( x)
max{a ( x), b( x)}
s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good
Silhouette coefficient (SC):
N
1
SC
N
s( x)
i 1
x
x
cohesion
a(x): average distance separation

in the cluster
b(x): average distances to
others clusters, find minimal
Performanceof
Bayesianinformationcriterion(BIC)
BIC=BayesianInformationCriterion
1
BIC L( ) m log n
2
L()loglikelihoodfunctionofallmodels;
nsizeofdataset;
mnumberofclusters
UndersphericalGaussianassumption,weget:
FormulaofBICinpartitioningbasedclustering
m
ni * d ni ni m 1
BIC (ni log ni ni log n log(2 ) log i ) m log n
i 1 2 2 2 2
ddimensionofthedataset
nisizeoftheithcluster
icovarianceofithcluster
KneePointDetectiononBIC
Original BIC = F(m) SD(m) = F(m-1) + F(m+1) 2F(m)

Internalindexes
Internalindexes
Soft partitions
Comparisonoftheindexes
Kmeans
Comparisonoftheindexes
RandomSwap
PartIII:
Stochasticcomplexityfor
binarydata
Stochasticcomplexity
Principleofminimumdescriptionlength(MDL):
findclusteringCthatcanbeusedfordescribingthe
datawithminimuminformation.
Data=Clustering+descriptionofdata.
Clusteringdefinedbythecentroids.
Datadefinedby:
whichcluster(partitionindex)
whereincluster(differencefromcentroid)
Solutionforbinarydata
M d nij M nj d M
SC n j h n j log
log max(1, n j )

i 1 n j
j 1
j 1 N 2 j 1
where
h p p log p 1 p log 1 p
This can be simplified to:

nij

M d M
d M
SC n j h n j log n j log max 1, n j
j 1 i 1 n j j 1 2 j 1
Numberofclustersby
stochasticcomplexity(SC)
21.8
21.7
Repeated
21.6 K-means
SC
21.5
21.4
RLS
21.3
21.2
50 60 70 80 90
Number of clusters
PartIV:
Externalindexes
Paircountingmeasures
Measurethenumberofpairsthatarein:
SameclassbothinPandG.
K K'
1 G P
a
2 i 1 j 1
nij (nij 1)
SameclassinPbutdifferentinG. a a
1 K' 2 K K' 2 b
b ( n. j nij ) b d
d
2 j 1 i 1 j 1
c c
DifferentclassesinPbutsameinG.
K K K'
1
c ( ni2. nij2 )
2 i 1 i 1 j 1
DifferentclassesbothinPandG.
K K' K K'
1
d ( N 2 nij2 ( ni2. n.2j ))
2 i 1 j 1 i 1 j 1
RandandAdjustedRandindex
[Rand,1971][Hubert and Arabie, 1985]
G P
Agreement: a, d
a a Disagreement: b, c
b d b
d
c c
ad
RI ( P, G )
abcd
RI E ( RI )
ARI
1 E ( RI )
Externalindexes
Iftrueclasslabels(groundtruth)areknown,thevalidity
ofaclusteringcanbeverifiedbycomparingtheclass
labelsandclusteringlabels.
nij = number of objects in class i and cluster j

Randstatistics
Visualexample
Pointwisemeasures
1 K K'
a nij (nij 1)
2 i 1 j 1
1 K' 2 K K' 2
b ( n. j nij )
2 j 1 i 1 j 1
1 K 2 K K' 2
c ( ni. nij )
2 i1 i 1 j 1
1 2 K K' 2 K K'
d ( N nij ( ni. n.2j ))
2
2 i 1 j 1 i 1 j 1
Randindex
(example)
Vectors Same Different

assignedto: cluster clusters
Sameclusteringroundtruth 20 24
Differentclustersingroundtruth 20 72
Rand index = (20+72) / (20+24+20+72) = 92/136 = 0.68

Adjusted Rand = (to be calculated) = 0.xx
Externalindexes
Paircounting
Informationtheoretic
Setmatching
Paircountingmeasures
Agreement: a, d
Disagreement: b, c
1 K K'
G P a nij (nij 1)
2 i 1 j 1
1 K' 2 K K' 2
a a b ( n. j nij )
2 j 1 i 1 j 1
b d b
1 K 2 K K' 2
d c ( ni. nij )
c c 2 i1 i 1 j 1
1 2 K K' 2 K K'
d ( N nij ( ni. n.2j ))
2
2 i 1 j 1 i 1 j 1
ad
Rand Index:RI ( P, G ) abcd
RI E ( RI )
Adjusted Rand Index: ARI
1 E ( RI ) 51
Informationtheoreticmeasures
Basedontheconceptofentropy
K K'
nij Nnij
MI ( P, G ) log
i 1 j 1 N ni n j
MutualInformation(MI)measurestheinformationthattwoclusterings
shareandVariationofInformation(VI)isthecomplementofMI
H (P ) H (G )
ni : size of cluster Pi
H ( P | G)
MI H (G | P) n j : size of cluster G j
nij : number of shared
objects in Pi and G j
VI ( P, G )
Setmatchingmeasures
Categories
Pointlevel
Clusterlevel
Threeproblems
Howtomeasurethesimilarityoftwoclusters?
Howtopairclusters?
Howtocalculateoverallsimilarity?
Similarityoftwoclusters
| Pi G j | P1
Jaccard J
| Pi G j | n1=1000
2 | Pi G j | P3 P2
Sorensen-Dice SD
| Pi | | G j | n3=200 n2=250
| Pi G j |
Braun-Banquet BB
max(| Pi |, | G j |)
P2, P3 P2, P1
Criterion H/NVD/CSI 200 250
J 0.80 0.25
SD 0.89 0.40
BB 0.80 0.25
Pairing
Matchingprobleminweightedbipartitegraph
G P
P1
G2
G1
P2
G3
P3
Pairing
MatchingorPairing?
Algorithms
Greedy
Optimalpairing
NormalizedVanDongen
Matchingbasedonnumberofsharedobjects
Clustering P: big circles

Clustering G: shape of objects
K K'
( 2 N max K'
j 1 nij max iK1 nij )
i 1 j 1
NVD
2N
PairSetIndex(PSI)
nij
Sij
max(| Pi |, | G j |)
Similarityoftwoclusters
Gj Pi
Sij 1
j:theindexofpairedclusterwithPi S=100%
S PG Sij
S ji 1
i
S ij 0.5
TotalSImilarity S=50%
S ji 0.5
OptimalpairingusingHungarianalgorithm
PairSetIndex(PSI)
Adjustmentforchance
Max( S ) min( K , K ' )
min( K , K ')
ni (mi / N )
E (S ) 1 max(ni , mi )
size of clusters in P : n1>n2>>nK
size of clusters in G : m1>m2>>mK
Max( S ) 1
Transforma tion :
E (S ) 0
S E (S )
S E (S )
PSI max( K , K ') E ( S )
0 S E (S )

PropertiesofPSI
Symmetric
Normalizedtonumberofclusters
Normalizedtosizeofclusters
Adjusted
Rangein[0,1]
Numberofclusterscanbedifferent
Randompartitioning
ChangingnumberofclustersinPfrom1to20
G 1 1000 2000 3000
Randomly partitioning into two

cluster
Linearityproperty
Enlargingthefirstcluster
G 1000 2000 3000
P1 1250 2000 3000
P2 2000 3000
P3 2500 3000
P4 3000
Wronglabelingsomepartofeachcluster
G 1000 2000 3000
P1 900 1900 2900

P2 800 1800 2800
P3 500 1500 2500
P4 334 1333 2333

Clustersizeimbalance
G1 1 1000 2000 3000
P1 800 1800 3000
G2 1 1000 2000 2500

P2 800 1800 2500
Numberofclusters
G1 1000 2000
P1 800 1800
G2 1000 2000 3000
P2 800 1800 2800

PartV:
Clusterlevelmeasure
Comparingpartitionsofcentroids
Point-level differences Cluster-level

mismatches
Centroidindex(CI)
[Frnti, Rezaei, Zhao, Pattern Recognition, 2014]
GiventwosetsofcentroidsCandC,
findnearestneighbormappings(CC):
qi arg min ci c' j , i 1, K1
2
1 j K 2
Detectprototypeswithnomapping:
1, qi j i
orphan c
'
j
0, otherwise
Centroidindex: Number of zero

K2 mappings!
CI1 C , C ' orphan c 'j
j 1
Exampleofcentroidindex
Data S2
1
1 2
Counts 1
2
1 Mappings
1
1
0
1 1 1
CI = 2 1
1
Index-value equals to the 0 Value 1 indicate
count of zero-mappings same cluster
ExampleoftheCentroidindex
1
0
Two clusters
but only one 3
allocated
1
Three mapped
into one
AdjustedRandvs.Centroidindex
Merge-based (PNN)
ARI=0.91 ARI=0.82
CI=0 CI=1
Random
Swap K-means
ARI=0.88
CI=1
Centroidindexproperties
Mappingisnotsymmetric(CCCC)
Symmetriccentroidindex:
CI 2 C , C ' max CI 1 C , C ' , CI 1 C ' , C
Pointwisevariant(CentroidSimilarityIndex):
MatchingclustersbasedonCI
Similarityofclusters K K2
C
1
S12 S 21
C i Cj j Ci
CSI where S12 i 1
S 21 j 1
2 N N
Centroidindex
Distance to ground truth

(2 clusters): 1
1 GT CI=1 CSI=0.50
2 GT CI=1 CSI=0.50
3 GT CI=1 CSI=0.50
4 GT CI=1 CSI=0.50 1 1 1
0.56 0.53 0.56
3 0 2 0 4
0.87 0.87
1
0.65
MeanSquaredErrors
Clustering quality (MSE)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 179.76 176.92 173.64 179.73 168.92 164.64 164.78 161.47
House 6.67 6.43 6.28 6.20 6.27 5.96 5.91 5.87
Miss America 5.95 5.83 5.52 5.92 5.36 5.28 5.21 5.10
House 3.61 3.28 2.50 3.57 2.62 2.83 - 2.44
Birch1 5.47 5.01 4.88 5.12 4.73 4.64 - 4.64
Birch2 7.47 5.65 3.07 6.29 2.28 2.28 - 2.28
Birch3 2.51 2.07 1.92 2.07 1.96 1.86 - 1.86
S1 19.71 8.92 8.92 8.92 8.93 8.92 8.92 8.92
S2 20.58 13.28 13.28 15.87 13.44 13.28 13.28 13.28
S3 19.57 16.89 16.89 16.89 17.70 16.89 16.89 16.89
S4 17.73 15.70 15.70 15.71 17.52 15.70 15.71 15.70
AdjustedRandIndex
Adjusted Rand Index (ARI)
Data set
Bridge 0.38 0.40 0.39 0.37 0.43 0.52 0.50 1
House 0.40 0.40 0.44 0.47 0.43 0.53 0.53 1
Miss America 0.19 0.19 0.18 0.20 0.20 0.20 0.23 1
House 0.46 0.49 0.52 0.46 0.49 0.49 - 1
Birch 1 0.85 0.93 0.98 0.91 0.96 1.00 - 1
Birch 2 0.81 0.86 0.95 0.86 1 1 - 1
Birch 3 0.74 0.82 0.87 0.82 0.86 0.91 - 1
S1 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.80 0.99 0.99 0.89 0.98 0.99 0.99 0.99
S3 0.86 0.96 0.96 0.96 0.92 0.96 0.96 0.96
S4 0.82 0.93 0.93 0.94 0.77 0.93 0.93 0.93
NormalizedMutualinformation
Normalized Mutual Information (NMI)
Data set
Bridge 0.77 0.78 0.78 0.77 0.80 0.83 0.82 1.00
House 0.80 0.80 0.81 0.82 0.81 0.83 0.84 1.00
Miss America 0.64 0.64 0.63 0.64 0.64 0.66 0.66 1.00
House
0.81 0.81 0.82 0.81 0.81 0.82 - 1.00
Birch 1 0.95 0.97 0.99 0.96 0.98 1.00 - 1.00
Birch 2 0.96 0.97 0.99 0.97 1.00 1.00 - 1.00
Birch 3 0.90 0.94 0.94 0.93 0.93 0.96 - 1.00
S1 0.93 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.90 0.99 0.99 0.95 0.99 0.93 0.99 0.99
S3 0.92 0.97 0.97 0.97 0.94 0.97 0.97 0.97
S4 0.88 0.94 0.94 0.95 0.85 0.94 0.94 0.94
NormalizedVanDongen
Normalized Van Dongen (NVD)
Data set
Bridge 0.45 0.42 0.43 0.46 0.38 0.32 0.33 0.00
House 0.44 0.43 0.40 0.37 0.40 0.33 0.31 0.00
Miss America 0.60 0.60 0.61 0.59 0.57 0.55 0.53 0.00
House
0.40 0.37 0.34 0.39 0.39 0.34 - 0.00
Birch 1 0.09 0.04 0.01 0.06 0.02 0.00 - 0.00
Birch 2 0.12 0.08 0.03 0.09 0.00 0.00 - 0.00
Birch 3 0.19 0.12 0.10 0.13 0.13 0.06 - 0.00
S1 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00
S2 0.11 0.00 0.00 0.06 0.01 0.04 0.00 0.00
S3 0.08 0.02 0.02 0.02 0.05 0.00 0.00 0.02
S4 0.11 0.04 0.04 0.03 0.13 0.04 0.04 0.04
CentroidIndex
C-Index (CI2)
Data set
Bridge 74 63 58 81 33 33 35 0
House 56 45 40 37 31 22 20 0
Miss America 88 91 67 88 38 43 36 0
House 0
43 39 22 47 26 23 ---
Birch 1 7 3 1 4 0 0 --- 0
Birch 2 18 11 4 12 0 0 --- 0
Birch 3 23 11 7 10 7 2 --- 0
S1 2 0 0 0 0 0 0 0
S2 2 0 0 1 0 0 0 0
S3 1 0 0 0 0 0 0 0
S4 1 0 0 0 1 0 0 0
CentroidSimilarityIndex
Centroid Similarity Index (CSI)
Data set
Bridge 0.47 0.51 0.49 0.45 0.57 0.62 0.63 1.00

House 0.49 0.50 0.54 0.57 0.55 0.63 0.66 1.00
Miss America 0.32 0.32 0.32 0.33 0.38 0.40 0.42 1.00
House 0.54 0.57 0.63 0.54 0.57 0.62 --- 1.00
Birch 1 0.87 0.94 0.98 0.93 0.99 1.00 --- 1.00
Birch 2 0.76 0.84 0.94 0.83 1.00 1.00 --- 1.00
Birch 3 0.71 0.82 0.87 0.81 0.86 0.93 --- 1.00
S1 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.82 1.00 1.00 0.91 1.00 1.00 1.00 1.00
S3 0.89 0.99 0.99 0.99 0.98 0.99 0.99 0.99
S4 0.87 0.98 0.98 0.99 0.85 0.98 0.98 0.98
Highqualityclustering
Method MSE
GKM Global K-means 164.78
RS Random swap (5k) 164.64
GA Genetic algorithm 161.47
RS8M Random swap (8M) 161.02
GAIS-2002 GAIS 160.72
+ RS1M GAIS + RS (1M) 160.49
+ RS8M GAIS + RS (8M) 160.43
GAIS-2012 GAIS 160.68
+ RS1M GAIS + RS (1M) 160.45
+ RS8M GAIS + RS (8M) 160.39
+ PRS GAIS + PRS 160.33
+ RS8M + GAIS + RS (8M) + 160.28
Centroidindexvalues
RS8M GAIS 2002 GAIS 2012
Main
algorithm:
RS1M RS8M RS1M RS8M RS8M
+ Tuning 1
+ Tuning 2
RS8M --- 19 19 19 23 24 24 23 22
GAIS (2002) 23 --- 0 0 14 15 15 14 16
+ RS1M 23 0 --- 0 14 15 15 14 13
+ RS8M 23 0 0 --- 14 15 15 14 13
GAIS (2012) 25 17 18 18 --- 1 1 1 1
+ RS1M 25 17 18 18 1 --- 0 0 1
+ RS8M 25 17 18 18 1 0 --- 0 1
+ PRS 25 17 18 18 1 0 0 --- 1
+ RS8M + PRS 24 17 18 18 1 1 1 1 ---
Summaryofexternalindexes
(existingmeasures)
PartVI:
Efficientimplementation
Strategiesforefficientsearch
Bruteforce:solveclusteringforallpossible
numberofclusters.
Stepwise:asinbruteforcebutstartusing
previoussolutionanditerateless.
Criterionguidedsearch:Integratecostfunction
directlyintotheoptimizationfunction.
Bruteforcesearchstrategy
Search for each separately
100 %
Number of clusters
Stepwisesearchstrategy
Start from the previous result
30-40 %
Number of clusters
Criterionguidedsearch
Integrate with the cost function!
3-6 %
Number of clusters
Stoppingcriterionfor
stepwisesearchstrategy
S t a r t in g p o in t
f1
E v a lu a tio n f u n c tio n v a lu e
f 3k / 2
k Tmin L
f1 f k
H a lf w a y
f k/2
C u rren t
fk
E s t im a t e d
f 3 k/2
1 k /2 k 3 k /2
I t e r a t io n n u m b e r
Comparisonofsearchstrategies
100
90
80
70 DLS
60 CA
% 50 Stepwise/FCM
40 Stepwise/LBG-U
30 Stepwise/K-means
20
10
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data dimensionality
Openquestions
Iterativealgorithm(KmeansorRandomSwap)
withcriterionguidedsearch
or
Hierarchicalalgorithm???
Po
M t en
Sc t
or ial t
P h op
D ic f
th
esi or
s!
Literature
1. G.W.Milligan,andM.C.Cooper,Anexaminationofproceduresfor
determiningthenumberofclustersinadataset,Psychometrika,Vol.50,1985,
pp.159179.
2. E.Dimitriadou,S.Dolnicar,andA.Weingassel,Anexaminationofindexesfor
determiningthenumberofclustersinbinarydatasets,Psychometrika,Vol.67,
No.1,2002,pp.137160.
3. D.L.DaviesandD.W.Bouldin,"Aclusterseparationmeasure,IEEE
TransactionsonPatternAnalysisandMachineIntelligence,1(2),224227,
1979.
4. J.C.BezdekandN.R.Pal,"Somenewindexesofclustervalidity,IEEE
TransactionsonSystems,ManandCybernetics,28(3),302315,1998.
5. H.Bischof,A.Leonardis,andA.Selb,"MDLPrincipleforrobustvector
quantization,PatternAnalysisandApplications,2(1),5972,1999.
6. P.Frnti,M.XuandI.Krkkinen,"Classificationofbinaryvectorsbyusing
DeltaSCdistancetominimizestochasticcomplexity",PatternRecognition
Letters,24(13),6573,January2003.
Literature
7. G.M.James,C.A.Sugar,"FindingtheNumberofClustersinaDataset:An
InformationTheoreticApproach".JournaloftheAmericanStatistical
Association,vol.98,397408,2003.
8. P.K.Ito,RobustnessofANOVAandMANOVATestProcedures.In:
KrishnaiahP.R.(ed),HandbookofStatistics1:AnalysisofVariance.North
HollandPublishingCompany,1980.
9. I.KrkkinenandP.Frnti,"Dynamiclocalsearchforclusteringwithunknown
numberofclusters",Int.Conf.onPatternRecognition(ICPR02),Qubec,
Canada,vol.2,240243,August2002.
10. D.PellagandA.Moore,"Xmeans:ExtendingKMeanswithEfficient
EstimationoftheNumberofClusters",Int.Conf.onMachineLearning(ICML),
727734,SanFrancisco,2000.
11. S.SalvadorandP.Chan,"DeterminingtheNumberofClusters/Segmentsin
HierarchicalClustering/SegmentationAlgorithms",IEEEInt.Con.Toolswith
ArtificialIntelligence(ICTAI),576584,BocaRaton,Florida,November,2004.
12. M.Gyllenberg,T.KoskiandM.Verlaan,"Classificationofbinaryvectorsby
stochasticcomplexity".JournalofMultivariateAnalysis,63(1),4772,1997.
Literature
13. M.Gyllenberg,T.KoskiandM.Verlaan,"Classificationofbinaryvectorsby
stochasticcomplexity".JournalofMultivariateAnalysis,63(1),4772,1997.
14. X.HuandL.Xu,"AComparativeStudyofSeveralClusterNumberSelection
Criteria",Int.Conf.IntelligentDataEngineeringandAutomatedLearning
(IDEAL),195202,HongKong,2003.
15. Kaufman,L.andP.Rousseeuw,1990.FindingGroupsinData:AnIntroduction
toClusterAnalysis.JohnWileyandSons,London.ISBN:10:0471878766.
16. [1.3]M.Halkidi,Y.BatistakisandM.Vazirgiannis:Clustervaliditymethods:part
1,SIGMODRec.,Vol.31,No.2,pp.4045,2002
17. R.Tibshirani,G.Walther,T.Hastie.Estimatingthenumberofclustersinadata
setviathegapstatistic.J.R.Statist.Soc.B(2001)63,Part2,pp.411423.
18. T.Lange,V.Roth,M,BraunandJ.M.Buhmann.Stabilitybasedvalidationof
clusteringsolutions.NeuralComputation.Vol.16,pp.12991323.2004.
Literature
19. Q.Zhao,M.XuandP.Frnti,"Sumofsquaresbasedclusteringvalidity
indexandsignificanceanalysis",Int.Conf.onAdaptiveandNatural
ComputingAlgorithms(ICANNGA09),Kuopio,Finland,LNCS5495,313
322,April2009.
20. Q.Zhao,M.XuandP.Frnti,"Kneepointdetectiononbayesian
informationcriterion",IEEEInt.Conf.ToolswithArtificialIntelligence
(ICTAI),Dayton,Ohio,USA,431438,November2008.
21. W.M.Rand,Objectivecriteriafortheevaluationofclusteringmethods,
JournaloftheAmericanStatisticalAssociation,66,846850,1971
22. L.HubertandP.Arabie,Comparingpartitions,JournalofClassification,
2(1),193218,1985.
23. P.Frnti,M.RezaeiandQ.Zhao,"Centroidindex:Clusterlevelsimilarity
measure",PatternRecognition,2014.(accepted)

Clustering Part3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Part3

Uploaded by

Copyright:

Available Formats

Clustering methods: Part 3

Speech and Image Processing Unit

INPUT: Clustering Partitions P Validity m*

Different number of clusters m

First derivative of powered MSE values:

0,16 Biggest jump on 15 clusters.

SSW = Sum of squares within the clusters (=MSE)

another knee point

SSW / SSB & MSE SSW / m log(SSB/SSW)

SSB / m 1 d log( SSW /(dn 2 )) log(m) m* SSW/SSB

Minimize intra cluster variance

DBI & F-test

Cohesion: measures how closely related are

a(x): average distance separation

Original BIC = F(m) SD(m) = F(m-1) + F(m+1) 2F(m)

This can be simplified to:

nij = number of objects in class i and cluster j

Vectors Same Different

Rand index = (20+72) / (20+24+20+72) = 92/136 = 0.68

Clustering P: big circles

Max( S ) min( K , K ' )

G 1 1000 2000 3000

Randomly partitioning into two

P1 1250 2000 3000

P1 900 1900 2900

P4 334 1333 2333

G1 1 1000 2000 3000

P1 800 1800 3000

G2 1 1000 2000 2500

G2 1000 2000 3000

P2 800 1800 2800

Point-level differences Cluster-level

Centroidindex: Number of zero

Distance to ground truth

Bridge 0.47 0.51 0.49 0.45 0.57 0.62 0.63 1.00

You might also like