You are on page 1of 32

Quiz

What is the most natural (non-autonomous, e.g.


breathing) thing done by human beings?
How often does the average human do it?

Clustering
With your host, the self-appointed King of
ClusteringKai Larsen

Cluster Analysis

Source: http://www.vias.org/science_cartoons/cluster_analysis.html

http://www.abdn.ac.uk/zoologymuseum/images/kingdoms.jpg

Can we use this information?

Writing Skills

English Majors

Business Majors

Salary
5

Unsupervised Classification
Training Data
case
case
case
case
case

1: inputs, ?
2: inputs, ?
3: inputs, ?
4: inputs, ?
5: inputs, ?

new
case
6

Training Data
case 1: inputs, cluster 1
case 2: inputs, cluster 3
case 3: inputs, cluster
2 case 4: inputs, cluster
1 case 5: inputs, cluster
2
new
case

What and Why?


What:

Classification with an unknown target

# of classes is unknown

Increase between class distance, decrease within class distance


Why:

Description
For example, segmenting existing customers into groups and associating a
distinct profile with each group could help future marketing strategies.
From the Internet: There are three customer types, each of which need to
be sold to very differently. These are: the Financier, the Techie and the
User.
From Kai: There are two kinds of students, those with BI experience, and
those without
Caveat:

There is no guarantee that the resulting clusters will be meaningful or useful. You
have to carefully consider them.

Two, basic, types of cluster analysis

K-means (iterative)
Hierarchical (one-shot)

k-means Clustering

Assignment

10

Reassignment

11

Example K-means Clustering


Andromeda Galaxy
Source:www.freewebs.com/
bnip1/andromedakmeans.htm

12

Euclidean Distance

(U2,V2)
(U1,V1)
L2 = ((U1 - U2)2 + (V1 - V2)2)1/2
(generally leads to spherical clusters)
13

Hierarchical

Create a table with all distances


between people or cases

We get the following table of differences:

Red1

Red2

Red3

Red4

Red1

1.12

.5

2.7

Red2

1.12

Red3

.5

2.24

Red4

2.7

2.24

Now, starting with he shortest distances between dots, we cluster


items.

14

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

Red1/3

Red2

Red4

Red1/3

1.03

2.46

Red2

1.03

Red4

2.46

1/3

Now, starting with he shortest distances between dots, we cluster


items.

15

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

Red1/2/3

Red4

Red1/2/3

2.28

Red4

2.28

1/2/3

Now, starting with he shortest distances between dots, we cluster


items.

16

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

1/2/
3/4

Red1/2/3/4

1
Red1/2/3/4

Now, starting with he shortest distances between dots, we cluster


items.

17

Result

18

Manhattan Distance

(U2,V2)
(U1,V1)
L1 = |U1 - U2| + |V1 - V2|

19

In teams of two
1. Using Manhattan Distance,
create a table with all
distances between red dots
2. Create a dendrogram

20

6
1

21

Life is often messy


Tribe Movement

22

Tribe Creation

How many clusters?

23

Flow Clustering Example

24

Source: http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_Clustering

Ancient Chinese Classification of Animals:


"Animals are divided into:
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
m)
n)

those that belong to the Emperor


embalmed ones
those that are trained
suckling pigs
Mermaids
fabulous ones
stray dogs
those that are included in this classification
those that tremble as if they were mad
innumerable ones
those drawn with a very fine camel's hair brush
others
those that have just broken a flower vase
those that resemble flies from a distance."
from Other Inquisitions: 1937-1952 by Jorge Luis Borges

25

For the Marketing Buffs


Market Basket Analysis
(a quick intro)

Association Rules
A B C

A CD

Rule
AD
CA
AC
B&CD

B CD

Support
2/5 (.40)
2/5 (.40)
2/5 (.40)
1/5 (.20)
Probability
Probabilitythat
thattwo
twoitems
items
co-occur
co-occur
# transactions with both A and D
# transactions with both A and D
All transactions
All transactions

27

ADE

B C E

Confidence
2/3 (.67)
2/4 (.50)
2/3 (.67)
1/3 (.33)
Conditional
Conditionalprobability
probabilitythat
that
transaction
contains
D,
transaction contains D,
given
giventhat
thatititcontains
containsAA
# transactions with both A and D
# transactions with both A and D
# transactions with A
# transactions with A

28

Size
Sizeofofbox=
box=transaction
transactioncounts
counts
Color
of
link=
indicates
confidence
Color of link= indicates confidencelevel
levelofofrule
rule
Thickness
of
link
=confidence
Thickness of link =confidence

29

Barbie Candy
1.
2.
3.
4.
5.
6.
7.
8.

30

Put them closer together in the store.


Put them far apart in the store.
Package candy bars with the dolls.
Package Barbie + candy + poorly selling item.
Raise the price on one, lower it on the other.
Barbie accessories for proofs of purchase.
Do not advertise candy and Barbie together.
Offer candies in the shape of a Barbie doll.

Conclusions
Clustering provides another way to understand data
Its results need to jive with human understanding
Unless we use the clusters directly for predictive
analysis
Market basket analysis is now an industry standard

31

Lets Submit to Titanic

32

Create Kaggle Account


Invite team members
Download train and test files from Kaggle
Save files as .xlsx
Import files into SQL Server
Run prediction with multiple models
Figure out which is best based on cross-validation
Use that model to predict
Upload results
Note: gendermodel.csv has submission format example
You need the same column names and number of rows

You might also like