Professional Documents
Culture Documents
Clustering
With your host, the self-appointed King of
ClusteringKai Larsen
Cluster Analysis
Source: http://www.vias.org/science_cartoons/cluster_analysis.html
http://www.abdn.ac.uk/zoologymuseum/images/kingdoms.jpg
Writing Skills
English Majors
Business Majors
Salary
5
Unsupervised Classification
Training Data
case
case
case
case
case
1: inputs, ?
2: inputs, ?
3: inputs, ?
4: inputs, ?
5: inputs, ?
new
case
6
Training Data
case 1: inputs, cluster 1
case 2: inputs, cluster 3
case 3: inputs, cluster
2 case 4: inputs, cluster
1 case 5: inputs, cluster
2
new
case
# of classes is unknown
Description
For example, segmenting existing customers into groups and associating a
distinct profile with each group could help future marketing strategies.
From the Internet: There are three customer types, each of which need to
be sold to very differently. These are: the Financier, the Techie and the
User.
From Kai: There are two kinds of students, those with BI experience, and
those without
Caveat:
There is no guarantee that the resulting clusters will be meaningful or useful. You
have to carefully consider them.
K-means (iterative)
Hierarchical (one-shot)
k-means Clustering
Assignment
10
Reassignment
11
12
Euclidean Distance
(U2,V2)
(U1,V1)
L2 = ((U1 - U2)2 + (V1 - V2)2)1/2
(generally leads to spherical clusters)
13
Hierarchical
Red1
Red2
Red3
Red4
Red1
1.12
.5
2.7
Red2
1.12
Red3
.5
2.24
Red4
2.7
2.24
14
Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:
Red1/3
Red2
Red4
Red1/3
1.03
2.46
Red2
1.03
Red4
2.46
1/3
15
Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:
Red1/2/3
Red4
Red1/2/3
2.28
Red4
2.28
1/2/3
16
Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:
1/2/
3/4
Red1/2/3/4
1
Red1/2/3/4
17
Result
18
Manhattan Distance
(U2,V2)
(U1,V1)
L1 = |U1 - U2| + |V1 - V2|
19
In teams of two
1. Using Manhattan Distance,
create a table with all
distances between red dots
2. Create a dendrogram
20
6
1
21
22
Tribe Creation
23
24
Source: http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_Clustering
25
Association Rules
A B C
A CD
Rule
AD
CA
AC
B&CD
B CD
Support
2/5 (.40)
2/5 (.40)
2/5 (.40)
1/5 (.20)
Probability
Probabilitythat
thattwo
twoitems
items
co-occur
co-occur
# transactions with both A and D
# transactions with both A and D
All transactions
All transactions
27
ADE
B C E
Confidence
2/3 (.67)
2/4 (.50)
2/3 (.67)
1/3 (.33)
Conditional
Conditionalprobability
probabilitythat
that
transaction
contains
D,
transaction contains D,
given
giventhat
thatititcontains
containsAA
# transactions with both A and D
# transactions with both A and D
# transactions with A
# transactions with A
28
Size
Sizeofofbox=
box=transaction
transactioncounts
counts
Color
of
link=
indicates
confidence
Color of link= indicates confidencelevel
levelofofrule
rule
Thickness
of
link
=confidence
Thickness of link =confidence
29
Barbie Candy
1.
2.
3.
4.
5.
6.
7.
8.
30
Conclusions
Clustering provides another way to understand data
Its results need to jive with human understanding
Unless we use the clusters directly for predictive
analysis
Market basket analysis is now an industry standard
31
32