You are on page 1of 9

Data Clustering

Theory, Algorithms,
and Applications
Guojun Gan
York University
Toronto, Ontario, Canada
Chaoqun Ma
Hunan University
Changsha, Hunan, People's Republic of China
Jianhong Wu
York University
Toronto, Ontario, Canada
SlaJTL
Society for Industrial and Applied Mathematics

American Statistical Association


Philadelphia, PennsylvaniaAlexandria, V irginia
Contents
List of Figures
xiii
List of Tables
xv
List of Algorithms
xvii
Preface
xix
I
1
2
Clustering, Data, and Similarity Measures
Data Clustering
1 . 1 Definition of Data Clustering
1 . 2The Vocabulary of Clustering
1 . 2. 1 Records and Attributes
1 . 2. 2Distances and Similarities
1 . 2. 3Clusters, Centers, and Modes
1 . 2. 4Hard Clustering and Fuzzy Clustering
1 . 2. 5Validity Indices
1 . 3Clustering Processes
1 . 4Dealing with Missing Values
1 . 5Resources for Clustering
1 . 5. 1 Surveys and Reviews on Clustering
1 . 5. 2Books on Clustering
1 . 5. 3Journals
1 . 5. 4Conference Proceedings
1 . 5. 5Data Sets
1 . 6Summary
Data Types
2. 1 Categorical Data
2. 2Binary Data
2. 3Transaction Data
2. 4Symbolic Data
2. 5Time Series
2. 6Summary
1
3
3
5
5
5
6
7
8
8
1 0
1 2
1 2
1 2
1 3
1 5
1 7
1 7
1 9
1 9
21
23
23
24
24
vi
Contents
3 Scale Conversion
25
3. 1 Introduction
25
3. 1 . 1 Interval to Ordinal
25
3. 1 . 2Interval to Nominal
27
3. 1 . 3Ordinal to Nominal
28
3. 1 . 4Nominal to Ordinal
28
3. 1 . 5Ordinal to Interval
29
3. 1 . 6Other Conversions
29
3. 2 Categorization of Numerical Data
30
3. 2. 1 Direct Categorization
30
3. 2. 2Cluster-based Categorization
31
3. 2. 3Automatic Categorization
37
3. 3 Summary
41
4 Data Standardization and Transformation
43
4. 1 Data Standardization
43
4. 2 Data Transformation
46
4. 2. 1 Principal Component Analysis
46
4. 2. 2SVD
48
4. 2. 3The Karhunen-L&ve Transformation
49
4. 3 Summary
51
5 Data Visualization
53
5. 1 Sammon's Mapping 53
5. 2 MDS
54
5. 3 SOM
56
5. 4 Class-preserving Projections
59
5. 5 Parallel Coordinates 60
5. 6 Tree Maps
61
5. 7 Categorical Data Visualization62
5. 8 Other Visualization Techniques 65
5. 9 Summary 65
6 Similarity and Dissimilarity Measures 67
6. 1 Prel im inaries 67
6. 1 . 1 Proximity Matrix 68
6. 1 . 2Proximity Graph69
6. 1 . 3Scatter Matrix 69
6. 1 . 4Covariance Matrix70
6. 2 Measures for Numerical Data 71
6. 2. 1 Euclidean Distance 71
6. 2. 2Manhattan Distance71
6. 2. 3Maximum Distance 72
6. 2. 4Minkowski Distance 72
6. 2. 5Mahalanobis Distance72
Contents vii
6. 2. 6Average Distance
6. 2. 7Other Distances
73
74
6. 3 Measures for Categorical Data74
6. 3. 1 The Simple Matching Distance 76
6. 3. 2Other Matching Coefficients 76
6. 4 Measures for Binary Data 77
6. 5 Measures for Mixed-type Data79
6. 5. 1 A General Similarity Coefficient 79
6. 5. 2A General Distance Coefficient 80
6. 5. 3A Generalized Minkowski Distance81
6. 6 Measures for Time Series Data83
6. 6. 1 The Minkowski Distance 84
6. 6. 2Time Series Preprocessing 85
6. 6. 3Dynamic Time Warping87
6. 6. 4Measures Based on Longest Common Subsequences 88
6. 6. 5Measures Based on Probabilistic Models90
6. 6. 6Measures Based on Landmark Models 91
6. 6. 7Evaluation92
6. 7 Other Measures92
6. 7. 1 The Cosine Similarity Measure93
6. 7. 2A Link-based Similarity Measure93
6. 7. 3Support 94
6. 8 Similarity and Dissimilarity Measures between Clusters 94
6. 8. 1 The Mean-based Distance94
6. 8. 2The Nearest Neighbor Distance95
6. 8. 3The Farthest Neighbor Distance95
6. 8. 4The Average Neighbor Distance 96
6. 8. 5Lance-Williams Formula 96
6. 9 Similarity and Dissimilarity between Variables 98
6. 9. 1 Pearson's Correlation Coefficients 98
6. 9. 2Measures Based on the Chi-square Statistic1 01
6. 9. 3Measures Based on Optimal Class Prediction 1 03
6. 9. 4Group-based Distance1 05
6. 1 0 Summary
1 06
II Clustering Algorithms
1 07
7 Hierarchical Clustering Techniques
1 09
7. 1 Representations of Hierarchical Clusterings1 09
7. 1 . 1 ii-tree
1 1 0
7. 1 . 2Dendrogram1 1 0
7. 1 . 3Banner
1 1 2
7. 1 . 4Pointer Representation
1 1 2
7. 1 . 5Packed Representation1 1 4
7. 1 . 6Icicle Plot
1 1 5
7. 1 . 7Other Representations1 1 5
viii
Contents
7. 2
Agglomerative Hierarchical Methods
1 1 6
7. 2. 1 The Single-link Method
1 1 8
7. 2. 2The Complete Link Method1 20
7. 2. 3The Group Average Method
1 22
7. 2. 4The Weighted Group Average Method
1 25
7. 2. 5The Centroid Method
1 26
7. 2. 6The Median Method1 30
7. 2. 7Ward's Method
1 32
7. 2. 8Other Agglomerative Methods 1 37
7. 3 Divisive Hierarchical Methods1 37
7. 4 Several Hierarchical Algorithms 1 38
7. 4. 1 SLINK
1 38
7. 4. 2Single-link Algorithms Based an Minimum Spanning Trees 1 40
7. 4. 3CLINK
1 41
7. 4. 4BIRCH
1 44
7. 4. 5CURE 1 44
7. 4. 6DIANA 1 45
7. 4. 7DISMEA 1 47
7. 4. 8Edwards and Cavalli-Sforza Method 1 47
7. 5 Summary 1 49
8 Fuzzy Clustering Algorithms 1 51
8. 1 Fuzzy Sets 1 51
8. 2 Fuzzy Relations 1 53
8. 3 Fuzzy k-means 1 54
8. 4 Fuzzy k-modes 1 56
8. 5 The c-means Method1 58
8. 6 Summary 1 59
9 Center-based Clustering Algorithms 1 61
9. 1 The k-means Algorithm 1 61
9. 2 Variations of the k-means Algorithm1 64
9. 2. 1 The Continuous k-means Algorithm 1 65
9. 2. 2The Compare-means Algorithm1 65
9. 2. 3The Sort-means Algorithm 1 66
9. 2. 4
Acceleration of the k-means Algorithm with the
kd-tree1 67
9. 2. 5Other Acceleration Methods 1 68
9. 3
The Trimmed k-means Algorithm 1 69
9. 4 The x-means Algorithm 1 70
9. 5
The k-harmonic Means Algorithm 1 71
9. 6
The Mean Shift Algorithm 1 73
9. 7 MEC
1 75
9. 8
The k-modes Algorithm (Huang)1 76
9. 8. 1 Initial Modes Selection 1 78
9. 9
The k-modes Algorithm (Chaturvedi et al. )
1 78
Contents ix
1 0
9. 1 0The k-probabilities Algorithm
9. 1 1 The k-prototypes Algorithm
9. 1 2Summary
Search-based Clustering Algorithms
1 79
1 81
1 82
1 83
1 0. 1 Genetic Algorithms1 84
1 0. 2 The Tabu Search Method1 85
1 0. 3 Variable Neighborhood Search for Clustering 1 86
1 0. 4 Al-Sultan's Method1 87
1 0. 5 Tabu Searchbased Categorical Clustering Algorithm1 89
1 0. 6 J-means1 90
1 0. 7 GKA 1 92
1 0. 8 The Global k-means Algorithm 1 95
1 0. 9 The Genetic k-modes Algorithm 1 95
1 0. 9. 1 The Selection Operator 1 96
1 0. 9. 2The Mutation Operator 1 96
1 0. 9. 3The k-modes Operator1 97
1 0. 1 0 The Genetic Fuzzy k-modes Algorithm 1 97
1 0. 1 0. 1 String Representation 1 98
1 0. 1 0. 2Initialization Process 1 98
1 0. 1 0. 3Selection Process 1 99
1 0. 1 0. 4Crossover Process1 99
1 0. 1 0. 5Mutation Process 200
1 0. 1 0. 6Termination Criterion 200
1 0. 1 1 SARS
200
1 0. 1 2 Summary 202
1 1 Graph-based Clustering Algorithms 203
1 1 . 1 Chameleon
203
1 1 . 2 CACTUS
204
1 1 . 3 A Dynamic Systembased Approach 205
1 1 . 4 ROCK 207
1 1 . 5 Summary 208
1 2 Grid-based Clustering Algorithms
209
1 2. 1 STING
209
1 2. 2 OptiGrid
21 0
1 2. 3 GRIDCL US
21 2
1 2. 4 GDILC
21 4
1 2. 5 WaveCluster
21 6
1 2. 6 Summary 21 7
1 3 Density-based Clustering Algorithms 21 9
1 3. 1 DB SCAN
21 9
1 3. 2 BRIDGE
221
1 3. 3 DB CLASD
222
Contents
1 4
1 3. 4DENCLUE
1 3. 5CUBN
1 3. 6Summary
Model-based Clustering Algorithms
223
225
226
227
1 4. 1 Introduction
227
1 4. 2 Gaussian Clustering Models
230
1 4. 3
Model-based Agglomerative Hierarchical Clustering 232
1 4. 4 The EM Algorithm
235
1 4. 5 Model-based Clustering
237
1 4. 6 COOLCAT
240
1 4. 7 STUCCO
241
1 4. 8 Summary
242
1 5 Subspace Clustering
243
1 5. 1 CLIQUE
244
1 5. 2 PROCLUS
246
1 5. 3 ORCLUS
249
1 5. 4 ENCLUS
253
1 5. 5 FINDIT
255
1 5. 6 MAFIA
258
1 5. 7 DOC
259
1 5. 8 CLTree 261
1 5. 9 PART
262
1 5. 1 0 SUBCAD 264
1 5. 1 1 Fuzzy Subspace Clustering270
1 5. 1 2 Mean Shift for Subspace Clustering
275
1 5. 1 3 Summary 285
1 6 Miscell aneous Algorithms 287
1 6. 1 Time Series Clustering Algorithms
287
1 6. 2 Streaming Algorithms 289
1 6. 2. 1 LSEARCH 290
1 6. 2. 2Other Streaming Algorithms
293
1 6. 3 Transaction Data Clustering Algorithms
293
1 6. 3. 1 Largeltem 294
1 6. 3. 2CLOPE 295
1 6. 3. 3OAK296
1 6. 4 Summary 297
1 7
Evaluation of Clustering Algorithms 299
1 7. 1 Introduction299
1 7. 1 . 1 Hypothesis Testing301
1 7. 1 . 2External Criteria302
1 7. 1 . 3Internal Criteria 303
1 7. 1 . 4Relative Criteria304
Contentsxi
1 7. 2Evaluation of Partitional Clustering305
1 7. 2. 1 Modified Hubert's P Statistic 305
1 7. 2. 2The Davies-Bouldin Index305
1 7. 2. 3Dunn's Index307
1 7. 2. 4The SD Validity Index307
1 7. 2. 5The S_Dbw Validity Index308
1 7. 2. 6 The RMSSTD Index 309
1 7. 2. 7The RS Index31 0
1 7. 2. 8The Calinski-Harabasz Index31 0
1 7. 2. 9Rand's Index31 1
1 7. 2. 1 0 Average of Compactness31 2
1 7. 2. 1 1 Distances between Partitions31 2
1 7. 3Evaluation of Hierarchical Clustering31 4
1 7. 3. 1 Testing Absence of Structure 31 4
1 7. 3. 2Testing Hierarchical Structures31 5
1 7. 4Validity Indices for Fuzzy Clustering31 5
1 7. 4. 1 The Partition Coefficient Index31 5
1 7. 4. 2The Partition Entropy Index31 6
1 7. 4. 3The Fukuyama-Sugeno Index31 6
1 7. 4. 4Validity Based on Fuzzy Similarity31 7
1 7. 4. 5A Compact and Separate Fuzzy Validity Criterion 31 8
1 7. 4. 6A Partition Separation Index 31 9
1 7. 4. 7An Index Based on the Mini-max Filter Concept and Fuzzy
Theory3 I 9
1 7. 5 Summary
320
III Applications of Clustering321
1 8 Clustering Gene Expression Data323
1 8. 1 Background 323
1 8. 2Applications of Gene Expression Data Clustering324
1 8. 3Types of Gene Expression Data Clustering325
1 8. 4Some Guidelines for Gene Expression Clustering325
1 8. 5Similarity Measures for Gene Expression Data326
1 8. 5. 1 Euclidean Distance326
1 8. 5. 2Pearson's Correlation Coefficient326
1 8. 6 A Case Study
328
1 8. 6. 1 C++ Code 328
1 8. 6. 2Results334
1 8. 7 Summary 334
IV MATLAB and C++ for Clustering341
1 9 Data Clustering in MATLAB343
1 9. 1 Read and Write Data Files343
1 9. 2Handle Categorical Data 347
xii
Contents
1 9. 3 M-files, MEX-files, and MAT-files 349
1 9. 3. 1 M-files 349
1 9. 3. 2MEX-files 351
1 9. 3. 3MAT-files354
1 9. 4 Speed up MATLAB354
1 9. 5Some Clustering Functions 355
1 9. 5. 1 Hierarchical Clustering 355
1 9. 5. 2k-means Clustering359
1 9. 6 Summary 362
20 Clustering in C/C++363
20. 1 The STL 363
20. 1 . 1 The vector Class363
20. 1 . 2The list Class 364
20. 2 C/C++ Program Compilation366
20. 3Data Structure and Implementation367
20. 3. 1 Data Matrices and Centers 367
20. 3. 2Clustering Results 368
20. 3. 3The Quick Sort Algorithm 369
20. 4 Summary 369
ASome Clustering Algorithms371
BThe kd-tree Data Structure
375
C MATLAB Codes
377
C. 1
The MATLAB Code for Generating Subspace Clusters 377
C. 2
The MATLAB Code for the k-modes Algorithm
. 379
C. 3
The MATLAB Code for the MSSC Algorithm 381
D C++ Codes
385
D. 1
The C++ Code for Converting Categorical Values to Integers
385
D. 2
The C++ Code for the FSC Algorithm 388
Bibliography
397
Subject Index
443
Author Index
455

You might also like