You are on page 1of 94

Clustering methods: Part 3

Clustervalidation

PasiFrnti
15.4.2014

Speech and Image Processing Unit


School of Computing
University of Eastern Finland
PartI:

Introduction
Clustervalidation
Precision = 5/5 = 100%
Recall = 5/7 = 71%
Supervisedclassification:
Classlabelsknownforgroundtruth
Oranges:
Accuracy,precision,recall
Clusteranalysis Apples:
P

Noclasslabels
Validationneedto: Precision = 3/5 = 60%
Recall = 3/3 = 100%
Compareclusteringalgorithms
Solvenumberofclusters
Avoidfindingpatternsinnoise
Measuringclusteringvalidity
InternalIndex:
Validatewithoutexternalinfo
Withdifferentnumberofclusters ? ?
Solvethenumberofclusters

ExternalIndex
Validateagainstgroundtruth
?
Comparetwoclusters:
(howsimilar)
?
Clusteringofrandomdata
1
Random Points 1
DBSCAN
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

y
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1
K-means 1
Complete Link
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Clustervalidationprocess
1. Distinguishingwhethernonrandomstructureactually
existsinthedata(onecluster).
2. Comparingtheresultsofaclusteranalysistoexternally
knownresults,e.g.,toexternallygivenclasslabels.
3. Evaluatinghowwelltheresultsofaclusteranalysisfit
thedatawithoutreferencetoexternalinformation.
4. Comparingtheresultsoftwodifferentsetsofcluster
analysestodeterminewhichisbetter.
5. Determiningthenumberofclusters.
Clustervalidationprocess
Clustervalidationreferstoproceduresthatevaluatetheresultsof
Clustervalidation
clusteringinaquantitativeand
quantitative objectivefashion.[Jain
objective & Dubes,
1988]
Howtobequantitative:Toemploythemeasures.
Howtobeobjective:Tovalidatethemeasures!

INPUT: Clustering Partitions P Validity m*


DataSet(X) Algorithm Codebook C Index

Different number of clusters m


PartII:

Internalindexes
Internalindexes
Groundtruthisrarelyavailablebutunsupervised
validationmustbedone.
Minimizes(ormaximizes)internalindex:
Variancesofwithinclusterandbetweenclusters
Ratedistortionmethod
Fratio
DaviesBouldinindex(DBI)
BayesianInformationCriterion(BIC)
SilhouetteCoefficient
Minimumdescriptionprinciple(MDL)
Stochasticcomplexity(SC)
Meansquareerror(MSE)
The more clusters the smaller the MSE.
Small knee-point near the correct value.
But how to detect?

Knee-point between
14 and 15 clusters.
Meansquareerror(MSE)
6

-2

-4

-6
5 10 15

10

6
5 clusters
SSE

3
10 clusters
2

0
2 5 10 15 20 25 30
K
FromMSEtoclustervalidity
Minimizewithinclustervariance(MSE)
Maximizebetweenclustervariance
Inter-cluster
Intra-cluster variance is
variance is maximized
minimized
JumppointofMSE
(ratedistortionapproach)

First derivative of powered MSE values:


J k MSE (k ) d / 2
MSE (k 1) d / 2

0,16 Biggest jump on 15 clusters.


0,14
S2
Jump values

0,12
0,1
0,08
0,06
0,04
0,02
0
0 10 20 30 40 50 60 70 80 90 100
Number of clusters
Sumofsquaresbasedindexes
SSW/kBallandHall(1965)
k2|W|Marriot(1971)
/ k 1
SSB
SSW
Calinski&Harabasz(1974)
/ N k

log(SSB/SSW)Hartigan(1975)
d log( SSW /(dN )) log(k )
2

Xu(1997)

(disthedimensionofdata;Nisthesizeofdata;kisthenumberofclusters)

SSW = Sum of squares within the clusters (=MSE)


SSB = Sum of squares between the clusters
Variances
N
Withincluster: SSW (C , k ) || xi c p (i ) ||2
i 1

k
Betweenclusters: SSB(C , k ) n j || c j x ||2
j 1

TotalVarianceofdataset:
N k
( X ) || xi c p (i ) || n j || c j x ||
2 2

i 1 j 1
SSW SSB
Fratiovariancetest
VarianceratioFtest
Measuresratioofbetweengroupsvarianceagainst
thewithingroupsvariance(originalftest)
Fratio(WBindex):
N
k || xi c p (i ) ||2
k SSW
F i 1

k
( X ) SSW
n
j 1
j || c j x ||
2
SSB
CalculationofFratio
7 6

6
5

5
Intermediate result

Cost F Test
4
Divider (between cluster) 3
3
Nominator (k *MSE) 2
2

F-ratio total 1
1

0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of cluster
FratiofordatasetS1
1.4
1.2
F-ratio (x10^5)

1.0
PNN
0.8
IS
0.6
minimum
0.4
0.2
0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
FratiofordatasetS2
1.4

1.2
PNN
1.0
F-ratio (x10^5)

IS
0.8
minimum
0.6

0.4

0.2

0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
FratiofordatasetS3
1.4

1.3 S3
1.2

1.1
F-ratio

1.0
minimum
0.9
PNN
0.8
IS
0.7

0.6

25 20 15 10 5
Number of clusters
FratiofordatasetS4
1.5
S4
1.4

1.3
PNN
1.2
F-ratio

IS
1.1
minimum at 16
1.0

0.9
minimum at 15
0.8
25 20 15 10 5
Number of clusters
ExtensionoftheFratioforS3
3.1

S3
2.6

2.1
F-ratio

another knee point


1.6
minimum PNN
1.1
IS
0.6

25 20 15 10 5
Number of clusters
Sumofsquarebasedindex

SSW / SSB & MSE SSW / m log(SSB/SSW)

SSB / m 1 d log( SSW /(dn 2 )) log(m) m* SSW/SSB


SSW / n m
DaviesBouldinindex(DBI)

Minimize intra cluster variance


Maximize the distance between clusters
Cost function weighted sum of the two:
MAE j MAE k
R j ,k
d (c j , c k )
M
1
DBI
M
max R
j 1
j k
j ,k
DaviesBouldinindex(DBI)
MeasuredvaluesforS2
35 12
MSE
30 DBI 10
F-test
25
8
Minimum point

DBI & F-test


20
MSE

6
15

4
10

2
5

0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of cluster
Silhouettecoefficient
[Kaufman&Rousseeuw,1990]

Cohesion: measures how closely related are


objects in a cluster
Separation: measure how distinct or well-
separated a cluster is from other clusters

cohesion
separation
Silhouettecoefficient
Cohesion a(x): average distance of x to all other vectors in
the same cluster.
Separation b(x): average distance of x to the vectors in
other clusters. Find the minimum among the clusters.
silhouette s(x):
b( x ) a ( x )
s( x)
max{a ( x), b( x)}
s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good
Silhouette coefficient (SC):
N
1
SC
N
s( x)
i 1
Silhouettecoefficient

x
x

cohesion

a(x): average distance separation


in the cluster
b(x): average distances to
others clusters, find minimal
Performanceof
Silhouettecoefficient
Bayesianinformationcriterion(BIC)
BIC=BayesianInformationCriterion
1
BIC L( ) m log n
2
L()loglikelihoodfunctionofallmodels;
nsizeofdataset;
mnumberofclusters
UndersphericalGaussianassumption,weget:

FormulaofBICinpartitioningbasedclustering
m
ni * d ni ni m 1
BIC (ni log ni ni log n log(2 ) log i ) m log n
i 1 2 2 2 2
ddimensionofthedataset
nisizeoftheithcluster
icovarianceofithcluster
KneePointDetectiononBIC

Original BIC = F(m) SD(m) = F(m-1) + F(m+1) 2F(m)


Internalindexes
Internalindexes

Soft partitions
Comparisonoftheindexes
Kmeans
Comparisonoftheindexes
RandomSwap
PartIII:
Stochasticcomplexityfor
binarydata
Stochasticcomplexity
Principleofminimumdescriptionlength(MDL):
findclusteringCthatcanbeusedfordescribingthe
datawithminimuminformation.
Data=Clustering+descriptionofdata.
Clusteringdefinedbythecentroids.
Datadefinedby:
whichcluster(partitionindex)
whereincluster(differencefromcentroid)
Solutionforbinarydata
M d nij M nj d M
SC n j h n j log
log max(1, n j )

i 1 n j
j 1
j 1 N 2 j 1

where

h p p log p 1 p log 1 p

This can be simplified to:


nij

M d M
d M
SC n j h n j log n j log max 1, n j
j 1 i 1 n j j 1 2 j 1
Numberofclustersby
stochasticcomplexity(SC)
21.8

21.7

Repeated
21.6 K-means
SC

21.5

21.4
RLS
21.3

21.2
50 60 70 80 90

Number of clusters
PartIV:

Externalindexes
Paircountingmeasures
Measurethenumberofpairsthatarein:

SameclassbothinPandG.
K K'
1 G P
a
2 i 1 j 1
nij (nij 1)

SameclassinPbutdifferentinG. a a
1 K' 2 K K' 2 b
b ( n. j nij ) b d
d
2 j 1 i 1 j 1
c c
DifferentclassesinPbutsameinG.
K K K'
1
c ( ni2. nij2 )
2 i 1 i 1 j 1

DifferentclassesbothinPandG.
K K' K K'
1
d ( N 2 nij2 ( ni2. n.2j ))
2 i 1 j 1 i 1 j 1
RandandAdjustedRandindex
[Rand,1971][Hubert and Arabie, 1985]

G P
Agreement: a, d
a a Disagreement: b, c
b d b
d
c c
ad
RI ( P, G )
abcd

RI E ( RI )
ARI
1 E ( RI )
Externalindexes
Iftrueclasslabels(groundtruth)areknown,thevalidity
ofaclusteringcanbeverifiedbycomparingtheclass
labelsandclusteringlabels.

nij = number of objects in class i and cluster j


Randstatistics
Visualexample
Pointwisemeasures

1 K K'
a nij (nij 1)
2 i 1 j 1
1 K' 2 K K' 2
b ( n. j nij )
2 j 1 i 1 j 1

1 K 2 K K' 2
c ( ni. nij )
2 i1 i 1 j 1

1 2 K K' 2 K K'
d ( N nij ( ni. n.2j ))
2

2 i 1 j 1 i 1 j 1
Randindex
(example)

Vectors Same Different


assignedto: cluster clusters
Sameclusteringroundtruth 20 24

Differentclustersingroundtruth 20 72

Rand index = (20+72) / (20+24+20+72) = 92/136 = 0.68


Adjusted Rand = (to be calculated) = 0.xx
Externalindexes

Paircounting
Informationtheoretic
Setmatching
Paircountingmeasures
Agreement: a, d
Disagreement: b, c
1 K K'
G P a nij (nij 1)
2 i 1 j 1
1 K' 2 K K' 2
a a b ( n. j nij )
2 j 1 i 1 j 1
b d b
1 K 2 K K' 2
d c ( ni. nij )
c c 2 i1 i 1 j 1

1 2 K K' 2 K K'
d ( N nij ( ni. n.2j ))
2

2 i 1 j 1 i 1 j 1

ad
Rand Index:RI ( P, G ) abcd
RI E ( RI )
Adjusted Rand Index: ARI
1 E ( RI ) 51
Informationtheoreticmeasures
Basedontheconceptofentropy
K K'
nij Nnij
MI ( P, G ) log
i 1 j 1 N ni n j
MutualInformation(MI)measurestheinformationthattwoclusterings
shareandVariationofInformation(VI)isthecomplementofMI

H (P ) H (G )

ni : size of cluster Pi
H ( P | G)
MI H (G | P) n j : size of cluster G j
nij : number of shared
objects in Pi and G j

VI ( P, G )
Setmatchingmeasures
Categories
Pointlevel
Clusterlevel

Threeproblems
Howtomeasurethesimilarityoftwoclusters?
Howtopairclusters?
Howtocalculateoverallsimilarity?
Similarityoftwoclusters
| Pi G j | P1
Jaccard J
| Pi G j | n1=1000
2 | Pi G j | P3 P2
Sorensen-Dice SD
| Pi | | G j | n3=200 n2=250

| Pi G j |
Braun-Banquet BB
max(| Pi |, | G j |)

P2, P3 P2, P1
Criterion H/NVD/CSI 200 250
J 0.80 0.25
SD 0.89 0.40
BB 0.80 0.25
Pairing
Matchingprobleminweightedbipartitegraph

G P
P1
G2
G1

P2

G3
P3
Pairing
MatchingorPairing?
Algorithms
Greedy
Optimalpairing
NormalizedVanDongen
Matchingbasedonnumberofsharedobjects

Clustering P: big circles


Clustering G: shape of objects

K K'
( 2 N max K'
j 1 nij max iK1 nij )
i 1 j 1
NVD
2N
PairSetIndex(PSI)
nij
Sij
max(| Pi |, | G j |)
Similarityoftwoclusters
Gj Pi
Sij 1

j:theindexofpairedclusterwithPi S=100%

S PG Sij
S ji 1

i
S ij 0.5
TotalSImilarity S=50%
S ji 0.5

OptimalpairingusingHungarianalgorithm
PairSetIndex(PSI)
Adjustmentforchance

Max( S ) min( K , K ' )

min( K , K ')
ni (mi / N )
E (S ) 1 max(ni , mi )
size of clusters in P : n1>n2>>nK
size of clusters in G : m1>m2>>mK

Max( S ) 1
Transforma tion :
E (S ) 0

S E (S )
S E (S )
PSI max( K , K ') E ( S )
0 S E (S )

PropertiesofPSI
Symmetric
Normalizedtonumberofclusters
Normalizedtosizeofclusters
Adjusted
Rangein[0,1]
Numberofclusterscanbedifferent
Randompartitioning
ChangingnumberofclustersinPfrom1to20

G 1 1000 2000 3000

Randomly partitioning into two


cluster
Linearityproperty
Enlargingthefirstcluster
G 1000 2000 3000

P1 1250 2000 3000

P2 2000 3000

P3 2500 3000

P4 3000

Wronglabelingsomepartofeachcluster
G 1000 2000 3000

P1 900 1900 2900


P2 800 1800 2800
P3 500 1500 2500

P4 334 1333 2333


Clustersizeimbalance

G1 1 1000 2000 3000

P1 800 1800 3000

G2 1 1000 2000 2500


P2 800 1800 2500
Numberofclusters

G1 1000 2000

P1 800 1800

G2 1000 2000 3000

P2 800 1800 2800


PartV:

Clusterlevelmeasure
Comparingpartitionsofcentroids

Point-level differences Cluster-level


mismatches
Centroidindex(CI)
[Frnti, Rezaei, Zhao, Pattern Recognition, 2014]

GiventwosetsofcentroidsCandC,
findnearestneighbormappings(CC):
qi arg min ci c' j , i 1, K1
2

1 j K 2

Detectprototypeswithnomapping:
1, qi j i
orphan c
'
j
0, otherwise

Centroidindex: Number of zero


K2 mappings!
CI1 C , C ' orphan c 'j
j 1
Exampleofcentroidindex
Data S2

1
1 2
Counts 1
2
1 Mappings
1
1
0
1 1 1
CI = 2 1
1
Index-value equals to the 0 Value 1 indicate
count of zero-mappings same cluster
ExampleoftheCentroidindex
1
0

Two clusters
but only one 3
allocated

1
Three mapped
into one
AdjustedRandvs.Centroidindex
Merge-based (PNN)

ARI=0.91 ARI=0.82
CI=0 CI=1

Random
Swap K-means

ARI=0.88
CI=1
Centroidindexproperties

Mappingisnotsymmetric(CCCC)
Symmetriccentroidindex:
CI 2 C , C ' max CI 1 C , C ' , CI 1 C ' , C

Pointwisevariant(CentroidSimilarityIndex):
MatchingclustersbasedonCI
Similarityofclusters K K2

C
1

S12 S 21
C i Cj j Ci
CSI where S12 i 1
S 21 j 1

2 N N
Centroidindex

Distance to ground truth


(2 clusters): 1
1 GT CI=1 CSI=0.50
2 GT CI=1 CSI=0.50
3 GT CI=1 CSI=0.50
4 GT CI=1 CSI=0.50 1 1 1
0.56 0.53 0.56

3 0 2 0 4
0.87 0.87

1
0.65
MeanSquaredErrors
Clustering quality (MSE)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 179.76 176.92 173.64 179.73 168.92 164.64 164.78 161.47
House 6.67 6.43 6.28 6.20 6.27 5.96 5.91 5.87
Miss America 5.95 5.83 5.52 5.92 5.36 5.28 5.21 5.10
House 3.61 3.28 2.50 3.57 2.62 2.83 - 2.44
Birch1 5.47 5.01 4.88 5.12 4.73 4.64 - 4.64
Birch2 7.47 5.65 3.07 6.29 2.28 2.28 - 2.28
Birch3 2.51 2.07 1.92 2.07 1.96 1.86 - 1.86
S1 19.71 8.92 8.92 8.92 8.93 8.92 8.92 8.92
S2 20.58 13.28 13.28 15.87 13.44 13.28 13.28 13.28
S3 19.57 16.89 16.89 16.89 17.70 16.89 16.89 16.89
S4 17.73 15.70 15.70 15.71 17.52 15.70 15.71 15.70
AdjustedRandIndex
Adjusted Rand Index (ARI)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 0.38 0.40 0.39 0.37 0.43 0.52 0.50 1
House 0.40 0.40 0.44 0.47 0.43 0.53 0.53 1
Miss America 0.19 0.19 0.18 0.20 0.20 0.20 0.23 1
House 0.46 0.49 0.52 0.46 0.49 0.49 - 1
Birch 1 0.85 0.93 0.98 0.91 0.96 1.00 - 1
Birch 2 0.81 0.86 0.95 0.86 1 1 - 1
Birch 3 0.74 0.82 0.87 0.82 0.86 0.91 - 1
S1 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.80 0.99 0.99 0.89 0.98 0.99 0.99 0.99
S3 0.86 0.96 0.96 0.96 0.92 0.96 0.96 0.96
S4 0.82 0.93 0.93 0.94 0.77 0.93 0.93 0.93
NormalizedMutualinformation
Normalized Mutual Information (NMI)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 0.77 0.78 0.78 0.77 0.80 0.83 0.82 1.00
House 0.80 0.80 0.81 0.82 0.81 0.83 0.84 1.00
Miss America 0.64 0.64 0.63 0.64 0.64 0.66 0.66 1.00
House
0.81 0.81 0.82 0.81 0.81 0.82 - 1.00
Birch 1 0.95 0.97 0.99 0.96 0.98 1.00 - 1.00
Birch 2 0.96 0.97 0.99 0.97 1.00 1.00 - 1.00
Birch 3 0.90 0.94 0.94 0.93 0.93 0.96 - 1.00
S1 0.93 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.90 0.99 0.99 0.95 0.99 0.93 0.99 0.99
S3 0.92 0.97 0.97 0.97 0.94 0.97 0.97 0.97
S4 0.88 0.94 0.94 0.95 0.85 0.94 0.94 0.94
NormalizedVanDongen
Normalized Van Dongen (NVD)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 0.45 0.42 0.43 0.46 0.38 0.32 0.33 0.00
House 0.44 0.43 0.40 0.37 0.40 0.33 0.31 0.00
Miss America 0.60 0.60 0.61 0.59 0.57 0.55 0.53 0.00
House
0.40 0.37 0.34 0.39 0.39 0.34 - 0.00
Birch 1 0.09 0.04 0.01 0.06 0.02 0.00 - 0.00
Birch 2 0.12 0.08 0.03 0.09 0.00 0.00 - 0.00
Birch 3 0.19 0.12 0.10 0.13 0.13 0.06 - 0.00
S1 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00
S2 0.11 0.00 0.00 0.06 0.01 0.04 0.00 0.00
S3 0.08 0.02 0.02 0.02 0.05 0.00 0.00 0.02
S4 0.11 0.04 0.04 0.03 0.13 0.04 0.04 0.04
CentroidIndex
C-Index (CI2)
Data set
KM RKM KM++ XM AC RS GKM GA
Bridge 74 63 58 81 33 33 35 0
House 56 45 40 37 31 22 20 0
Miss America 88 91 67 88 38 43 36 0
House 0
43 39 22 47 26 23 ---
Birch 1 7 3 1 4 0 0 --- 0
Birch 2 18 11 4 12 0 0 --- 0
Birch 3 23 11 7 10 7 2 --- 0
S1 2 0 0 0 0 0 0 0
S2 2 0 0 1 0 0 0 0
S3 1 0 0 0 0 0 0 0
S4 1 0 0 0 1 0 0 0
CentroidSimilarityIndex
Centroid Similarity Index (CSI)
Data set
KM RKM KM++ XM AC RS GKM GA

Bridge 0.47 0.51 0.49 0.45 0.57 0.62 0.63 1.00


House 0.49 0.50 0.54 0.57 0.55 0.63 0.66 1.00
Miss America 0.32 0.32 0.32 0.33 0.38 0.40 0.42 1.00
House 0.54 0.57 0.63 0.54 0.57 0.62 --- 1.00
Birch 1 0.87 0.94 0.98 0.93 0.99 1.00 --- 1.00
Birch 2 0.76 0.84 0.94 0.83 1.00 1.00 --- 1.00
Birch 3 0.71 0.82 0.87 0.81 0.86 0.93 --- 1.00
S1 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00
S2 0.82 1.00 1.00 0.91 1.00 1.00 1.00 1.00
S3 0.89 0.99 0.99 0.99 0.98 0.99 0.99 0.99
S4 0.87 0.98 0.98 0.99 0.85 0.98 0.98 0.98
Highqualityclustering

Method MSE
GKM Global K-means 164.78
RS Random swap (5k) 164.64
GA Genetic algorithm 161.47
RS8M Random swap (8M) 161.02
GAIS-2002 GAIS 160.72
+ RS1M GAIS + RS (1M) 160.49
+ RS8M GAIS + RS (8M) 160.43
GAIS-2012 GAIS 160.68
+ RS1M GAIS + RS (1M) 160.45
+ RS8M GAIS + RS (8M) 160.39
+ PRS GAIS + PRS 160.33
+ RS8M + GAIS + RS (8M) + 160.28
Centroidindexvalues
RS8M GAIS 2002 GAIS 2012
Main
algorithm:
RS1M RS8M RS1M RS8M RS8M
+ Tuning 1
+ Tuning 2

RS8M --- 19 19 19 23 24 24 23 22
GAIS (2002) 23 --- 0 0 14 15 15 14 16
+ RS1M 23 0 --- 0 14 15 15 14 13
+ RS8M 23 0 0 --- 14 15 15 14 13
GAIS (2012) 25 17 18 18 --- 1 1 1 1
+ RS1M 25 17 18 18 1 --- 0 0 1
+ RS8M 25 17 18 18 1 0 --- 0 1
+ PRS 25 17 18 18 1 0 0 --- 1
+ RS8M + PRS 24 17 18 18 1 1 1 1 ---
Summaryofexternalindexes
(existingmeasures)
PartVI:

Efficientimplementation
Strategiesforefficientsearch

Bruteforce:solveclusteringforallpossible
numberofclusters.
Stepwise:asinbruteforcebutstartusing
previoussolutionanditerateless.
Criterionguidedsearch:Integratecostfunction
directlyintotheoptimizationfunction.
Bruteforcesearchstrategy
Search for each separately

100 %
Number of clusters
Stepwisesearchstrategy
Start from the previous result

30-40 %
Number of clusters
Criterionguidedsearch
Integrate with the cost function!

3-6 %
Number of clusters
Stoppingcriterionfor
stepwisesearchstrategy
S t a r t in g p o in t
f1
E v a lu a tio n f u n c tio n v a lu e

f 3k / 2
k Tmin L
f1 f k
H a lf w a y
f k/2
C u rren t
fk
E s t im a t e d
f 3 k/2

1 k /2 k 3 k /2
I t e r a t io n n u m b e r
Comparisonofsearchstrategies
100
90
80
70 DLS
60 CA
% 50 Stepwise/FCM
40 Stepwise/LBG-U
30 Stepwise/K-means
20
10
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data dimensionality
Openquestions

Iterativealgorithm(KmeansorRandomSwap)
withcriterionguidedsearch
or
Hierarchicalalgorithm???
Po
M t en
Sc t
or ial t
P h op
D ic f
th
esi or
s!
Literature
1. G.W.Milligan,andM.C.Cooper,Anexaminationofproceduresfor
determiningthenumberofclustersinadataset,Psychometrika,Vol.50,1985,
pp.159179.
2. E.Dimitriadou,S.Dolnicar,andA.Weingassel,Anexaminationofindexesfor
determiningthenumberofclustersinbinarydatasets,Psychometrika,Vol.67,
No.1,2002,pp.137160.
3. D.L.DaviesandD.W.Bouldin,"Aclusterseparationmeasure,IEEE
TransactionsonPatternAnalysisandMachineIntelligence,1(2),224227,
1979.
4. J.C.BezdekandN.R.Pal,"Somenewindexesofclustervalidity,IEEE
TransactionsonSystems,ManandCybernetics,28(3),302315,1998.
5. H.Bischof,A.Leonardis,andA.Selb,"MDLPrincipleforrobustvector
quantization,PatternAnalysisandApplications,2(1),5972,1999.
6. P.Frnti,M.XuandI.Krkkinen,"Classificationofbinaryvectorsbyusing
DeltaSCdistancetominimizestochasticcomplexity",PatternRecognition
Letters,24(13),6573,January2003.
Literature
7. G.M.James,C.A.Sugar,"FindingtheNumberofClustersinaDataset:An
InformationTheoreticApproach".JournaloftheAmericanStatistical
Association,vol.98,397408,2003.
8. P.K.Ito,RobustnessofANOVAandMANOVATestProcedures.In:
KrishnaiahP.R.(ed),HandbookofStatistics1:AnalysisofVariance.North
HollandPublishingCompany,1980.
9. I.KrkkinenandP.Frnti,"Dynamiclocalsearchforclusteringwithunknown
numberofclusters",Int.Conf.onPatternRecognition(ICPR02),Qubec,
Canada,vol.2,240243,August2002.
10. D.PellagandA.Moore,"Xmeans:ExtendingKMeanswithEfficient
EstimationoftheNumberofClusters",Int.Conf.onMachineLearning(ICML),
727734,SanFrancisco,2000.
11. S.SalvadorandP.Chan,"DeterminingtheNumberofClusters/Segmentsin
HierarchicalClustering/SegmentationAlgorithms",IEEEInt.Con.Toolswith
ArtificialIntelligence(ICTAI),576584,BocaRaton,Florida,November,2004.
12. M.Gyllenberg,T.KoskiandM.Verlaan,"Classificationofbinaryvectorsby
stochasticcomplexity".JournalofMultivariateAnalysis,63(1),4772,1997.
Literature
13. M.Gyllenberg,T.KoskiandM.Verlaan,"Classificationofbinaryvectorsby
stochasticcomplexity".JournalofMultivariateAnalysis,63(1),4772,1997.
14. X.HuandL.Xu,"AComparativeStudyofSeveralClusterNumberSelection
Criteria",Int.Conf.IntelligentDataEngineeringandAutomatedLearning
(IDEAL),195202,HongKong,2003.
15. Kaufman,L.andP.Rousseeuw,1990.FindingGroupsinData:AnIntroduction
toClusterAnalysis.JohnWileyandSons,London.ISBN:10:0471878766.
16. [1.3]M.Halkidi,Y.BatistakisandM.Vazirgiannis:Clustervaliditymethods:part
1,SIGMODRec.,Vol.31,No.2,pp.4045,2002
17. R.Tibshirani,G.Walther,T.Hastie.Estimatingthenumberofclustersinadata
setviathegapstatistic.J.R.Statist.Soc.B(2001)63,Part2,pp.411423.
18. T.Lange,V.Roth,M,BraunandJ.M.Buhmann.Stabilitybasedvalidationof
clusteringsolutions.NeuralComputation.Vol.16,pp.12991323.2004.
Literature
19. Q.Zhao,M.XuandP.Frnti,"Sumofsquaresbasedclusteringvalidity
indexandsignificanceanalysis",Int.Conf.onAdaptiveandNatural
ComputingAlgorithms(ICANNGA09),Kuopio,Finland,LNCS5495,313
322,April2009.
20. Q.Zhao,M.XuandP.Frnti,"Kneepointdetectiononbayesian
informationcriterion",IEEEInt.Conf.ToolswithArtificialIntelligence
(ICTAI),Dayton,Ohio,USA,431438,November2008.
21. W.M.Rand,Objectivecriteriafortheevaluationofclusteringmethods,
JournaloftheAmericanStatisticalAssociation,66,846850,1971
22. L.HubertandP.Arabie,Comparingpartitions,JournalofClassification,
2(1),193218,1985.
23. P.Frnti,M.RezaeiandQ.Zhao,"Centroidindex:Clusterlevelsimilarity
measure",PatternRecognition,2014.(accepted)

You might also like