Professional Documents
Culture Documents
Statistics seminar
Matt Collinge
4 May 2005
Abstract
I will cover the basic features of data clustering,
focusing on the aspects that are most relevant to data
analysis. I will briefly discuss how clustering can be used
for data compression. I will review the properties of basic
K-means clustering as a data modeling technique and
discuss its shortcomings. Then I will present a series of
improvements to the K-means algorithm, and put it in
context as a maximum likelihood method. I will touch on
some general issues in clustering analysis, such as the
choice of clustering algorithm and how to decide if a
clustering is significant. Along the way, I will discuss some
astrophysical applications of clustering analysis.
Clustering Definition
• Basic idea: grouping together similar objects
• More formally, clusters are “connected
regions of a multi-dimensional space
containing a relatively high density of points,
separated from other such regions by a
region containing a relatively low density of
points”
– where
– where
• Axis-aligned Gaussians.
Soft K-means v3.0 in action: K=2
Hooray!
K=4? Uh-oh.