Professional Documents
Culture Documents
Clustering
Unsupervised classification, that is, without
the class attribute
Want to discover the classes
Hierarchical Partitional
2
1 1 2 2 2 2
1 1 1 2
Complete-Link 1
1
* * * * * * * * 2* 2 2
2
1
1 11 2
2 2 2
Seeds
Data Mining and Knowledge
Discovery 13
Assign Instances to Clusters
Cobweb
CU C1 , C2 ,..., Ck
i j
k
Without k it would always be best for each
instance to have its own cluster, overfitting!
Data Mining and Knowledge
Discovery 20
The Weather Problem
Outlook Temp. Humidity Windy Play
Sunny Hot High FALSE No
Sunny Hot High TRUE No
Overcast Hot High FALSE Yes
Rainy Mild High FALSE Yes
Rainy Cool Normal FALSE Yes
Rainy Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Sunny Mild High FALSE No
Sunny Cool Normal FALSE Yes
Rainy Mild Normal FALSE Yes
Sunny Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Rainy Mild High TRUE No
Data Mining and Knowledge
Discovery 21
Weather Data (without Play)
Label instances: a,b,….,n
Start by putting Add another instance
the first instance in its own cluster
in its own cluster
a a b
b a a c b
a c b c
Highest utility
a b c d
e f
Look at the instances:
Rainy Cool Normal FALSE
Rainy Cool Normal TRUE
Quite similar!
a b c d
e f g
Merged into a
single cluster
before h is added
b c
a d h e f g
g f j m n
a d h c l e i
b k What next?
g f j m n
a d h c l e i
b k What do a, b, c, d, h, k, and l
have in common?
CU C1 , C2 ,..., Ck il i
k
Problems with zero variance!
The acuity parameter imposes a minimum
variance
Attribute
Given some data, how can you determine the parameters:
A Mean for Cluster A
A Standard deviation for Cluster A
B Mean for Cluster B
B Standard deviation for Cluster B
p A Probabilit y of being in Cluster A
Data Mining and Knowledge
Discovery 34
Problems
If we knew which instance came from each
cluster we could estimate these values
If we knew the parameters we could calculate
the probability that an instance belongs to
each cluster
Prx | A Pr[ A] f ( x; A , A ) p A
PrA | x
Pr[ x] Pr[ x]
( x )2
1
f ( x; A , A ) e 2 2
.
2
Data Mining and Knowledge
Discovery 35
EM Algorithm
Expectation Maximization (EM)
Start with initial values for the parameters
Calculate the cluster probabilities for each instance
Re-estimate the values for the parameters
Repeat
General purpose maximum likelihood
estimate algorithm for missing data
Can also be used to train Bayesian networks
(later)
Data Mining and Knowledge
Discovery 36
Beyond Normal Models
More than one class:
Straightforward
More than one numeric attribute
Easy if assume attributes independent
If dependent attributes, treat them jointly
using the bivariate normal
Nominal attributes
No more normal distribution!
STEP 3: Prune
Data Mining and Knowledge
Discovery 57
Generating Item Sets
How do we generate minimum coverage item
sets in a scalable manner?
Total number of item set is huge
Grows exponentially in the number of attributes
Need an efficient algorithm:
Start by generating minimum coverage 1-item sets
Use those to generate 2-item sets, etc
Why do we only need to consider minimum
coverage 1-item sets?
If windy = false and play = no
Meets min. then outlook = sunny
coverage
If windy = false and play = no
and accuracy
then humidity = high
Head of
Item node links
F:4 C:1
F
C:3 B:1 B:1
C
A A:3 P:1
B
M M:2 B:1 Frequent Pattern
P (P:3)
P:2 M:1
Paths
<F:4, C:3, A:3, M:2, P:2>
Occurs twice <C:1, B:1, P:1>
Occurs ones
Data Mining and Knowledge
Discovery 74
Rule Generation
Mining complete set of association rules
has some problems
May be a large number of frequent item
sets
May be a huge number of association rules
Output: EA:2
Data Mining and Knowledge
Discovery 78
Mining with Taxonomies
Taxonomy:
Clothes Footwear
Support
determined Counting Intersecting Counting Intersecting
Apriori* Partition FP-Growth* Eclat
Apriori-like AprioriTID
algorithms DIC
No algorithm
dominates others!
* Have discussed
Data Mining and Knowledge
Discovery 86
Applications
Market basket analysis
Classic marketing application
Classification
One single item
consequent
Data Mining and Knowledge
Discovery 93