Clustering

Data Mining:
Concepts and Techniques

Chapter 7
Jiawei Han Department of Computer Science
University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
March 28, 2014 Data Mining: Concepts and Techniques 1
Chapter 7. Cluster Analysis

1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods 8. Model-Based Methods 9. Clustering High-Dimensional Data 10. Constraint-Based Clustering 11. Outlier Analysis 12. Summary
March 28, 2014 Data Mining: Concepts and Techniques 2
What is Cluster Analysis?
Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Cluster analysis
Unsupervised learning: no predefined classes
March 28, 2014
Data Mining: Concepts and Techniques
Clustering: Rich Applications and Multidisciplinary Efforts

Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks
Image Processing Economic Science (especially market research) WWW

Document classification Cluster Weblog data to discover groups of similar access patterns
Data Mining: Concepts and Techniques 4
March 28, 2014
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
March 28, 2014
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters with

high intra-class similarity low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
March 28, 2014
Similarity and Dissimilarity Between Objects (Cont.)
If q = 2, d is Euclidean distance:
Properties

d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp
d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures
March 28, 2014
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:

Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment
March 28, 2014
The K-Means Clustering Method
10
9
Example
10
10 9 8 7 6 5
9
8
8
7
7
6
6
5
5
4
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Assign each objects to most similar center
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8
10 9 8 7 6
reassign
K=2 Arbitrarily choose K object as initial cluster center
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
March 28, 2014
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially

distort the distribution of the data.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
March 28, 2014
10
What Is Outlier Discovery?
What are outliers? The set of objects are considerably dissimilar from the remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ... Problem: Define and find outliers in large data sets Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
March 28, 2014
11
Summary
Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches There are still lots of research issues on cluster analysis
March 28, 2014
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139172, 1987. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB98.
March 28, 2014
References (2)

V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999. A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
March 28, 2014
References (3)
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review ,
SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition,. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB98. A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large
Databases, ICDT'01.

A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD 02. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB97. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.
March 28, 2014
15

Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

University of Illinois at Urbana-Champaign

Chapter 7. Cluster Analysis

What is Cluster Analysis?

Cluster: a collection of data objects

Similar to one another within the same cluster

Unsupervised learning: no predefined classes

March 28, 2014

Data Mining: Concepts and Techniques

Clustering: Rich Applications and Multidisciplinary Efforts

Spatial Data Analysis

Image Processing Economic Science (especially market research) WWW

March 28, 2014

Examples of Clustering Applications

March 28, 2014

Data Mining: Concepts and Techniques

Quality: What Is Good Clustering?

A good clustering method will produce high quality

high intra-class similarity low inter-class similarity

March 28, 2014

Data Mining: Concepts and Techniques

Similarity and Dissimilarity Between Objects (Cont.)

d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

March 28, 2014

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:

March 28, 2014

The K-Means Clustering Method

Assign each objects to most similar center

Update the cluster means

K=2 Arbitrarily choose K object as initial cluster center

Update the cluster means

March 28, 2014

Data Mining: Concepts and Techniques

What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially

March 28, 2014

Data Mining: Concepts and Techniques

What Is Outlier Discovery?

March 28, 2014

Data Mining: Concepts and Techniques

March 28, 2014

March 28, 2014

March 28, 2014

March 28, 2014

Data Mining: Concepts and Techniques

You might also like