Matt

Data Clustering
Statistics seminar
Matt Collinge
4 May 2005
Abstract
I will cover the basic features of data clustering,
focusing on the aspects that are most relevant to data
analysis. I will briefly discuss how clustering can be used
for data compression. I will review the properties of basic
K-means clustering as a data modeling technique and
discuss its shortcomings. Then I will present a series of
improvements to the K-means algorithm, and put it in
context as a maximum likelihood method. I will touch on
some general issues in clustering analysis, such as the
choice of clustering algorithm and how to decide if a
clustering is significant. Along the way, I will discuss some
astrophysical applications of clustering analysis.
Clustering Definition
• Basic idea: grouping together similar objects
• More formally, clusters are “connected
regions of a multi-dimensional space
containing a relatively high density of points,
separated from other such regions by a
region containing a relatively low density of
points”
Caution: The notion of proximity/similarity is

problem-dependent.
Motivation for Clustering
• Data compression (lossy)
– Represent a complex signal as an arrangement
of templates
• Data modeling
– Useful for classification
– Outliers may be interesting
– Can reveal structure we otherwise might miss,
especially in high-dimensional data spaces
– Type of machine learning
Clustering Terminology
• Exclusive (hard) or not (soft)
– E.g., grouping people according to age, as
opposed to spoken language
• Extrinsic (supervised) or intrinsic
(unsupervised)
– Pre-labeled data v. unlabeled data
• Hierarchical or partitional
– The type of structure imposed on the data
Vector Quantization (VQ)
• Lossy compression based on block coding,
applications in image or voice compression
• Maps a vector to a codeword drawn from a
predesigned codebook with the goal of
minimizing distortion
– For K=64 codewords (requiring 6 bits to specify)
to represent floating point numbers (4 bytes),
compression is 80%
– E.g., a color picture: 2D array of RGB triplets
• How do you generate the codebook?
VQ codebook generation
• Start with a training set
• For N training vectors and K codewords,

recommend N/K > 1000
K-means clustering
• Unsupervised machine learning
• Put N data points in an I-dimensional space
into K clusters
– Clusters parametrized by their means: m(k)
– Data: x(n) (assumed to be real)
– (Each m and x is an I-dimensional vector)
• Need a distance metric, such as:
K-means algorithm
• A ‘competitive learning’ algorithm
• Assign each data point to the nearest
mean (which may be randomly initialized).
– Equivalently, determine ‘responsibilities’:
• Update the means to reflect responsibilities:
– where
• Means with no responsibilities stay put.

Example: K=2
Example: K=4 (take 1)
Example: K=4 (take 2)
K-means: Questions
• Does it always converge? Yes.
• To the same answer? No. Depends on I.C.
• What is the right choice for the distance
metric? Depends on the problem.
• What value of K should we choose?
• How do we choose among multiple
clustering solutions?
• What if the clusters look different?
• What about outliers/borderline cases?
Examples of Misbehavior
Soft K-means: A simple upgrade
• New definition of responsibilities:
– β is the ‘stiffness’; it has an associated length

scale σ≡1/√β
• Update step looks the same, but is it?
– where
This is mixture density modeling

Example: Soft K-means with K=4
Soft K-means: irresponsible
means are still competitive
Soft K-means v2.0
• New parameter: πk is ‘weight’; now we also

have varying σk
• We have to update
and
This is ML algorithm for spherical Gaussians

Soft K-means v3.0
• Why stop there?
• Axis-aligned Gaussians.
Soft K-means v3.0 in action: K=2
Hooray!
K=4? Uh-oh.
• Small cluster has formed. v2.0 and v3.0 are

pathological in this respect.
• This is over fitting, a fatal flaw of ML.
AutoClass
• K-means clustering implementation
• Supplemented with Bayesian maximum a posteriori (MAP)
method
• First tested on IRAS Low Resolution Spectral Atlas
– 5425 spectra with 100 usable channels between 7-24 microns
– “Our very first attempts to apply AutoClass to the spectral data did
not produce very good results…”
• Results
– 77 classes, according to variation of (subtle) spectral features,
significantly different from previous classifications
– Example: objects with differences in silicate emission feature later
shown to have different average Galactic coordinates
– Revealed some calibration problems
• Was it worth it? Maybe.
Clustering Tendency
• Can we tell if a data set is ripe for clustering
analysis?
– Test against alternatives: spatial randomness,
regularity
– Requires knowledge of sampling window
• Example tests
– Number of points in uniform cells, vs Poisson or
uniform
– Distribution of point-to-point distances, nearest
neighbor distances
Cluster Validity
How many clusters are in the data?
• Multiple trials with different I.C. (but no
guarantees)
• Guard against overlapping, overfitting, and
test statistical separation of nearby clusters
Is a given clustering an acceptable
representation? Which of two clusterings is
preferable?
• Same old question. Test goodness of fit,
perform ML, . . .
Friends-of-Friends
• Cluster-finding algorithm often employed in
analyzing cosmological simulations
• Two basic parameters
– Minimum number of particles Nmin
– Linking length hlink, with corresponding overdensity δmin
• All pairs with separation < hlink are linked, all
mutually linked groups with N > Nmin are clusters
• Problems
– No information about hierarchy/substructure
– No single value of δmin sufficient to resolve all structures
of interest
– Filaments can create artificial links
– Small clusters may not be bound
Nonparametric Bayes Classifier (NBC)
• Basic idea: use training sets, plus prior
expectations, to classify data
– Our case: classify astronomical point sources as ‘stars’ or ‘quasars’
• Probability that x belongs to class k: p(x|Ck)
– Could represent pdf as a histogram derived from the training set
– Instead, use ‘kernel density estimate’ of the pdf: analogous, but
more sophisticated
• Incorporate prior information via Bayes’ Rule
• Two classes, so if P(C1|x) > 0.5, x is assigned to

class C1
Kernel Density Estimation
• Nonparametric estimator
– Kh(z) is kernel function (normalized)
– h is the bandwidth, h=σ2 for a Gaussian kernel
– z is the distance from a point in the test set to a point in the training set
• Naively, N kernel estimations is N2; for large data
sets, fast algorithms are needed
– Can use tricks learned from N-body, e.g., trees
– Sufficient to bound the density for each class, and compute until the
bounds separate
• Choice of bandwidth is critical (analogy: choice of
histogram bin size)
– Goal: minimize difference between true underlying pdf and kernel
density estimate
Input and Output
• Training sets: 468,149 stars and 16,713 quasars
• Represent objects by u-g, g-r, r-i, i-z colors
• Bayesian prior P(C1)=0.88
• Test set: 831,600 UVX (u-g<1) point sources
– 113,674 (13.7%) quasars
– 717,926 (86.3%) stars
• Contamination remains, due to overlap of star and
quasar training sets
– Exclude “quasars” for which KDE stellar density > 0.01
– 100,563 quasars
– Estimated 95% efficiency
Recap
• Clustering can be used for data
compression
• Different application, different approach:
supervised classification, unsupervised
modeling
– Many choices of proximity measure
– Many choices of clustering algorithm
• Useful for exploratory data analysis
• Ultimately, one of a battery of tools to find
patterns in data
References
Cheeseman, P., and Stutz, J., 1996, Advances in Knowledge Discovery
and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, eds. AAAI Press, MIT Press
-description of AutoClass
Jain, Anil K., and Dubes, Richard C., 1988, Algorithms for Clustering
Data, Englewood Cliffs: Prentice Hall
-from a computer science perspective
MacKay, David J. C. 2003, Information Theory, Inference, and Learning
Algorithms, Cambridge: Cambridge University Press
-Chapters 20 and 22.
Richards, G.T., et al., 2004, astro-ph/0408505
-photometric selection of quasars from SDSS
Cluster this, Clusterman

Matt

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matt

Uploaded by

Copyright:

Available Formats

Data Clustering

Caution: The notion of proximity/similarity is

• For N training vectors and K codewords,

• Update the means to reflect responsibilities:

• Means with no responsibilities stay put.

– β is the ‘stiffness’; it has an associated length

This is mixture density modeling

• New parameter: πk is ‘weight’; now we also

This is ML algorithm for spherical Gaussians

• Small cluster has formed. v2.0 and v3.0 are

• Two classes, so if P(C1|x) > 0.5, x is assigned to

You might also like