Professional Documents
Culture Documents
Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )
Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )
Business queries:
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811
http://i.imgur.com/ckkoAOp.gif
10
http://i.stack.imgur.com/tRDGU.png
11
12
What Is Data?
Values of qualitative or quantitative variables
belonging to a set of items
Represented in a structure, e.g., tabular, tree
or graph structure
Typically the results of measurements
As an abstract concept can be viewed as the
lowest level of abstraction from which
information and then knowledge are derived
Jian Pei: CMPT 741/459 Data Mining -- Introduction
13
What Is Information?
Knowledge communicated or received
concerning a particular fact or circumstance
Conceptually, information is the message
(utterance or expression) being conveyed
Cannot be predicted
Can resolve uncertainty
14
What Is Knowledge?
Familiarity with someone or something,
which can include facts, information,
descriptions, or skills acquired through
experience or education
Implicit knowledge: practical skill or expertise
Explicit knowledge: theoretical
understanding of a subject
15
Data Systems
A data system answers queries based on
data acquired in the past
Base data the rawest data not derived
from anywhere else
Knowledge information derived from the
base data
16
Aggregate queries
What is the average GPA of all students at this
school?
17
Queries
A precise request for information
Subjects in databases and information
retrieval
Databases: structured queries on structured
(e.g., relational) data
Information retrieval: unstructured queries on
unstructured (e.g., text, image) data
Important assumptions
Information needs
Query languages
Jian Pei: CMPT 741/459 Data Mining -- Introduction
18
Data-driven Exploration
What should be the next strategy of a
company?
A lot of data: sales, human resource, production,
tax, service cost,
19
Data-driven Thinking
Starting with some simple queries
New queries are raised by consuming the
results of previous queries
No ultimate query in design!
But many queries can be answered using DB/IR
techniques
20
21
22
23
Machine Learning
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E
Tom M. Mitchell
Essentially, learn the distribution of data
24
25
Transformed
data
Preprocessed
data
Interpretation/
Patterns evaluation
Data mining
Transformation
Preprocessing
Selection
Target data
Data
Jian Pei: CMPT 741/459 Data Mining -- Introduction
26
27
28
29
Unit cost
1956
$10,000/MB
1980
$193/MB
1990
$9/MB
2000
$6.9/GB
2010
$0.08/GB
2013
0.06/GB
http://ns1758.ca/winch/winchest.html
30
31
In Q2 2015
Facebook has 1.49 billion active users
Wechat has 600 million active users, 100 million
outside China
LinkedIn has 380 million active users
Twitter has 304 active users
Jian Pei: CMPT 741/459 Data Mining -- Introduction
32
Velocity
Google processes 24+ petabytes of data per
day
Facebook gets 10+ million new photos
uploaded every hour
Facebook members like or leave a comment
3+ billion times per day
YouTube users upload 1+ hour of video
every second
400+ million tweets per day
Jian Pei: CMPT 741/459 Data Mining -- Introduction
33
34
35
36
Veracity
1 in 3 business leaders don't trust the
information they use to make decisions
Assuming a slowly growing total cost budget,
tradeoff between data volume and data
quality
Loss of veracity in combining different types
of information from different sources
Loss of veracity in data extraction,
transformation, and processing
Jian Pei: CMPT 741/459 Data Mining -- Introduction
37
Variety
Integrating data capturing different aspects
of a data object
Vancouver Canucks: game video, technical
statistics, social media,
Different pieces are in different format
38
Four V-challenges
Volume: massive scale and growth, 40% per
year in global data generated
Velocity: real time data generation and
consumption
Variety: heterogeneous data, mainly
unstructured or semi-structured, from many
sources
Veracity
Jian Pei: CMPT 741/459 Data Mining -- Introduction
39
40
41
Datafication
Extract data about an object or event in a
quantified way so that it can be analyzed
Different from digitalization
42
Important techniques
Data aggregation
Extended datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction
43
Data holders
Data specialists
Big-data mindset leaders
A capable company may play 2 or 3 roles at
the same time
What is most important, big-data mindset,
skills, or data itself?
44
Privacy
big data analytics have the potential to
eclipse longstanding civil rights protections
in how personal information is used in
housing, credit, employment, health,
education, and the marketplace
Executive Office of the (US) President
45
Keep in Mind
Our industry does not respect
tradition it only respects
innovation.
Satya Nadella
46
47
Format
Due to the fast progress in data mining, we
will go beyond the textbook substantially
Active classroom discussion
Open questions and brainstorming
Textbook: Data Mining Concepts and
Techniques (3rd ed)
48
Trying
Assignments and a project
Thinking
Examine everything from a data scientist angle from
today
Jian Pei: CMPT 741/459 Data Mining -- Introduction
49
50
51
53
Burnt or Burned?
54
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
55
Transaction Data
Alphabet: a set of items
Example: all products sold in a store
56
57
Tid
Item
t123
t123
t123
t236
t236
Item a: , t123,
Item b: , t123, , t236,
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
58
59
60
Broad applications
Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click
stream) analysis,
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
61
Frequent Itemsets
Itemset: a set of items
E.g., acm = {a, c, m}
Support of itemsets
Sup(acm) = 3
a, b, c, f, l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n
62
A Nave Attempt
Generate all possible itemsets, test their
supports against the database
How to hold a large number of itemsets into
main memory?
100 items 2100 1 possible itemets
63
64
65
66
Apriori-Based Mining
Generate length (k+1) candidate itemsets
from length k frequent itemsets, and
Test the candidates against DB
67
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Itemset
a
b
c
d
e
Sup
2
3
3
1
3
Freq 1-itemsets
Itemset
a
b
c
e
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Sup
2
3
3
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
68
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Test
return k Lk;
69
70
71
Counting Array
A 2-dimensional triangle matrix can be
implemented using a 1-dimensional array
1
1
2
3
4
5
2
1
3
2
5
4
3
6
8
5
4
7
9
10
4
9 10
72
Example of Candidate-generation
L3 = {abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd abc * abd
acde acd * ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
73
INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
74
Method
Candidate itemsets are stored in a hash-tree
A leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
75
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
12+356
124
457
125
458
345
356
357
689
367
368
159
76
Association Rules
Rule c am
Support: 3 (i.e., the support Transaction database TDB
of acm)
TID
Items bought
Confidence: 75% (i.e.,
100 f, a, c, d, g, I, m, p
sup(acm) / sup(c))
200 a, b, c, f, l, m, o
Given a minimum support
300 b, f, h, j, o
threshold and a minimum
confidence threshold, find
400 b, c, k, s, p
all association rules whose
500 a, f, c, e, l, p, m, n
support and confidence
passing the thresholds
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
77
78
79
100 100
100 100
+
+ ! +
= 2 1 1.27 1030
1 2
100
Bottleneck: candidate-generation-and-test
80
AC
A
ABD
ACD
BC
AD
BCD
CD
BD
D
{}
Itemset lattice
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
81
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
82
a
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
83
Projected Databases
To test whether Xy is frequent, we can use
the X-projected database
The sub-database of transactions containing X
Check whether item y is frequent in X-projected
database
a
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
84
Header
table
item
f
c
a
b
m
p
TID
root
f:4
c:3
c:1
b:1
b:1
a:3
p:1
m:2
b:1
p:2
m:1
Items bought
(ordered)
freq items
100 f, a, c, d, g, I, m, p f, c, a, m, p
200 a, b, c, f, l,m, o
f, c, a, b, m
300 b, f, h, j, o
f, b
400 b, c, k, s, p
c, b, p
500 a, f, c, e, l, p, m, n
f, c, a, m, p
85
Benefits of FP-tree
Completeness
Never break a long pattern in any transaction
Preserve complete information for freq pattern mining
Not scan database anymore
Compactness
Reduce irrelevant info infrequent items are removed
Items in frequency descending order (f-list): the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not counting
node-links and the count fields)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
86
87
Header
table
item
f
c
a
b
m
p
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
88
root
f:3
c:3
a:3
m-projected FP-tree
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
table
item
f
c
a
b
m
p
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
89
Recursive Mining
Patterns having m but no p can be mined
recursively
Optimization: enumerate patterns from a
single-branch FP-tree
Enumerate all combination
Support = that of the last item
m, fm, cm, am
fcm, fam, cam
fcam
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
Header
table
item
f
c
a
root
f:3
c:3
a:3
m-projected FP-tree
90
a1:n1
a1:n1
a2:n2
r=
a3:n3
b1:m1
c2:k2
r1
a2:n2
a3:n3
c1:k1
b1:m1
c2:k2
c1:k1
c3:k3
c3:k3
91
FP-growth
Pattern-growth: recursively grow frequent patterns
by pattern and database partitioning
Algorithm
For each frequent item, construct its projected database,
and then its projected FP-tree
Repeat the process on each newly created projected
FP-tree
Until the resulted FP-tree is empty, or contains only one
path single path generates all the combinations, each
of which is a frequent pattern
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
92
Scaling up by DB Projection
What if an FP-tree cannot fit into memory?
Database projection
Partition a database into a set of projected
databases
Construct and mine FP-tree once the projected
database can fit into main memory
Heuristic: Projected database shrinks quickly in many
applications
93
m-proj DB
fcab
fca
fca
am-proj DB
fc
fc
fc
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
b-proj DB
f
cb
a-proj DB
fc
cm-proj DB
f
f
f
c-proj DB
f
f-proj DB
94
Other factors
No candidate generation nor candidate test
Database compression using FP-tree
No repeated scan of entire database
Basic operations counting local frequent items
and building FP-tree, no pattern search nor
pattern matching
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)
95
Building FP-trees
A stack of FP-trees
Redundant information
Transaction abcd appears in a-, ab-, abc-, ac-,
, c- projected databases and FP-trees
96
Non-focused mining
A manager may be only interested in patterns
involving some items (s)he manages
A user is often interested in patterns satisfying
some constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
97
Tid transaction
Itemset Lattice
ABCD
ABC
AB
AC
A
ABD
ACD
BC
BCD
AD
B
{}
ABD
20
ABC
30
AD
40
ABCD
50
CD
CD
BD
10
Min_sup=2
D
Length Frequent itemsets
1
A, B, C, D
98
Max-Patterns
Tid transaction
ABCD
ABC
AB
AC
A
ABD
ACD
BC
BCD
AD
B
{}
ABD
20
ABC
30
AD
CD 40
50
BD
10
ABCD
CD
Min_sup=2
A, B, C, D
ABC, ABD
99
AC
A
ABD
ACD
BC
AD
BCD
CD
BD
D
{}
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
100
ABCD
ABC:2
AB:3
AC:2
ABD:2
BC:2
ACD
BCD
AD:3
BD:2
10
ABD
20
ABC
30
AD
CD:2 40
50
A:4
B:4
C:3
{}
ABCD
CD
D:4
Min_sup=2
ABC:2, ABD:2
101
TID
Items
10
a, c, d, e, f
20
a, b, e
30
c, e, f
40
a, c, d, f
50
c, e, f
102
103
Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: push constraints for efficient mining
104
Dimension/level constraint
in relevance to region, price, brand, customer category
Interestingness constraint
strong rules: support and confidence
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
105
106
Optimization
Mining frequent patterns with constraint C
Sound: only find patterns satisfying the constraints C
Complete: find all patterns satisfying the constraints C
A nave solution
Constraint test as a post-processing
107
TDB (min_sup=2)
Anti-Monotonicity
Anti-monotonicity
TID
Transaction
10
a, b, c, d, f
20
30
40
b, c, d, f, g, h
a, c, d, e, f
c, e, f, g
Example
C: range(S.profit) 15
Itemset ab violates C
So does every superset of ab
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
10
-30
30
20
-10
108
Anti-monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
Antimonotone
No
no
yes
no
yes
yes
no
yes
no
yes
no
yes
no
convertible
yes
no
109
TDB (min_sup=2)
Monotonicity
Monotonicity
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Example
C: range(S.profit) 15
Itemset ab satisfies C
So does every superset of ab
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
-20
10
-30
30
20
-10
110
Monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)
Monotone
yes
yes
no
yes
no
no
yes
no
yes
no
yes
no
yes
convertible
no
yes
111
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
a
b
c
d
e
f
g
h
Profit
40
0
-20
10
-30
30
20
-10
112
Convertible Constraints
Let R be an order of items
Convertible anti-monotone
If an itemset S violates a constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
Convertible monotone
If an itemset S satisfies constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
113
Item
Profit
40
-20
10
-30
30
20
-10
114
Convertible Constraints
Constraint
Convertible
Convertible Strongly
anti-monotone monotone convertible
avg(S) , v
Yes
Yes
Yes
median(S) , v
Yes
Yes
Yes
Yes
No
No
No
Yes
No
No
Yes
No
Yes
No
No
115
Item
Value
40
-20
10
-30
30
20
-10
116
TDB (min_sup=2)
TID
Transaction
10
a, f, d, b, c
20
f, g, d, b, c
30
a, f, d, c, e
40
f, g, h, c, e
Item
Profit
40
30
20
10
-10
-20
-30
<a, f, g, d, b, h, c, e>
C is convertible anti-monotone w.r.t. R
117
Misleading patterns
Play basketball eat cereal [40%, 66.7%]
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
118
Evaluation Criteria
Objective interestingness measures
Examples: support, patterns formed by mutually
independent items
Domain independent
Subjective measures
Examples: domain knowledge, templates/
constraints
119
P(A B)
P(AB)
=
=
P(A)P(B) P(A)P(B)
Basketball
Not basketball
Sum (row)
f11
f10
f1+
Cereal
2000
1750
3750
f01
f00
f0+
Not cereal
1000
250
1250
f+1
f+0
Sum(col.)
3000
2000
5000
120
Property of Lift
880
50
930
50
20
70
930
70
1000
20
50
70
50
880
930
70
930
1000
121
122
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
123
124
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
125
GSP
GSP (Generalized Sequential Pattern) mining
Outline of the method
Initially, every item in DB is a candidate of length-1
For each level (i.e., sequences of length-k) do
Scan database to collect support count for each candidate
sequence
Generate candidate length-(k+1) sequences from length-k
frequent sequences using Apriori
126
Cand Sup
<a>, <b>, <c>, <d>, <e>, <f>, <g>,
<a>
3
<h>
<b>
5
Scan database once
<c>
4
count support for candidates
<d>
3
Seq-id
Sequence
<e>
3
10
<(bd)cb(ac)>
<f>
2
20
<(bf)(ce)b(fg)>
min_sup =2
30
<g>
1
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
<h>
1
127
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<e>
<f>
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
128
129
130
<(bd)cba>
min_sup
=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)
Seq-id
10
20
30
40
50
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
131
132
Bottlenecks of GSP
A huge set of candidates
1,000 frequent length-1 sequences generate
1000 999
1000 1000 +
= 1,499,500 length-2 candidates!
2
100
30
133
134
PrefixSpan
Projection-based
But only prefix-based projection: less projections
and quickly shrinking sequences
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)
135
<a>
<aa>
<ab>
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
136
SID
sequence
The ones having prefix <a>; 10 <a(abc)(ac)d(cf)>
The ones having prefix <b>; 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
The ones having prefix <f> 40
<eg(af)cbc>
137
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
138
Completeness of PrefixSpan
SDB
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>
<af>-proj. db
139
Efficiency of PrefixSpan
No candidate sequence needs to be
generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing
projected databases
Can be improved by bi-level projections
140
Effectiveness
Redundancy due to anti-monotonicity
{<abcd>} leads to 15 sequential patterns of
same support
Closed sequential patterns and sequential
generators
141
Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )
Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )
Business queries:
Which categories of products are most popular for customers
Find pairs (customer groups, most popular products)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)
143
144
145
Egocentric Analysis
How am I different from (more often than
not, better than) others?
In what aspects am I good?
http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg
146
Dimensions
An aspect or feature of a situation, problem, or
thing, a measurable extent of some kind
Dictionary
Dimensions/attributes are used to model
complex objects in a divide-and-conquer
manner
Objects are compared in selected dimensions/
attributes
147
Multi-dimensional Analysis
Find interesting patterns in multi-dimensional
subspaces
Michael Jordan is outstanding in subspaces (total
points, total rebounds, total assists) and (number of
games played, total points, total assists)
148
OLAP
Conceptually, we may explore all possible
subspaces for interesting patterns
149
OLAP
Aggregates and group-bys are frequently used in
data analysis and summarization
SELECT time, altitude, AVG(temp)
FROM weather GOUP BY time, altitude;
In TPC, 6 standard benchmarks have 83 queries,
aggregates are used 59 times, group-bys are used 20
times
150
OLAP Operations
Roll up (drill-up): summarize data by
climbing up hierarchy or by dimension
reduction
(Day, Store, Product type, SUM(sales)
(Month, City, *, SUM(sales))
151
Roll Up
http://www.tutorialspoint.com/dwh/images/rollup.jpg
152
Drill Down
http://www.tutorialspoint.com/dwh/images/drill_down.jpg
153
Other Operations
Dice: pick specific values or ranges on some
dimensions
Pivot: rotate a cube changing the order of
dimensions in visual analysis
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
154
Dice
http://www.tutorialspoint.com/dwh/images/dice.jpg
155
Relational Representation
If there are n dimensions, there are 2n
possible aggregation columns
Roll up by model by year by color in a table
156
Difficulties
Many group bys are needed
6 dimensions 26=64 group bys
157
158
DATA CUBE
Model Year
Color Sales
CUBE
SALES
Model Year Color Sales
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
1990
1990
1990
1991
1991
1991
1992
1992
1992
1990
1990
1990
1991
1991
1991
1992
1992
1992
red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue
5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39
CUBE
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
62
5
95
154
49
54
95
198
71
31
54
156
182
90
236
508
63
64
62
189
55
52
9
116
39
27
62
128
157
143
133
433
125
69
149
343
106
104
110
314
110
58
116
284
339
233
369
941
159
Semantics of ALL
ALL is a set
Model.ALL = ALL(Model) = {Chevy, Ford }
Year.ALL = ALL(Year) = {1990,1991,1992}
Color.ALL = ALL(Color) = {red,white,blue}
160
OLAP
users
clerk, IT professional
knowledge worker
function
decision support
DB design
application-oriented
subject-oriented
data
usage
repetitive
ad-hoc
access
lots of scans
unit of work
complex query
# records
accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
161
162
Subject-Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of
data for decision makers, not on daily
operations or transaction processing
Providing a simple and concise view around
particular subject issues by excluding data
that are not useful in the decision support
process
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)
163
Integrated
Integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction
records
164
Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems
Operational databases: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
165
Nonvolatile
A physically separate store of data
transformed from the operational
environment
Operational updates of data do not occur in
the data warehouse environment
Do not require transaction processing, recovery,
and concurrency control mechanisms
Require only two operations in data accessing
Initial loading of data
Access of data
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)
166
167
168
Star Schema
time
time_key
day
day_of_the_week
month
quarter
yearbranch
branch_key
branch_name
branch_type
Measures
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)
item
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
169
Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
Measures
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)
item
item_key
supplier
item_name supplier_ke
supplier_ty
brand
type
supplier_key
location
location_key
street
city
city_key
city_key
city
state_or_province
country
170
Fact Constellation
time
item
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
Measur
es
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_state
country
item_key
shipper_ke
from_location
y
to_location
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
171
172
173
Query types
Point query: looking up one specific tuple (rare)
Range query: returning the aggregate of a
(large) set of tuples, with group by
Complex queries: need specific algorithms and
index structures, will be discussed later
174
175
Bitmap Index
For n tuples, a bitmap index has n bits and
can be packed into n /8 bytes and n /32
words
From a bit to the row-id: the j-th bit of the pcust gender
th byte row-id = p*8 +j
Jack
Cathy
Nancy
1 0 0
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)
176
177
178
Bit-Sliced Index
A sale amount can be written as an integer
number of pennies, and then be represented
as a binary number of N bits
24 bits is good for up to $167,772.15,
appropriate for many stores
179
Using Indexes
SELECT SUM(sales) FROM Sales WHERE C;
Tuples satisfying C is identified by a bitmap B
180
Cost Comparison
Traditional value-list index (B+ tree) is costly
in both I/O and CPU time
Not good for OLAP
181
z1
A2
x2
z2
A100
x100
z100
A1
x1
z1
A2
x2
z2
A100
x100
z100
182
Typical in OLAP
Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples
Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method
183
MOLAP
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
sum
184
Easy to implement
Fast retrieval
Many entries may be empty if data is sparse
Costly in space
185
Measure
Dimensions
Measure
Store
Product
Season
Sales
S1
P1
Spring
Store
S1
P2
Spring
12
S1
P1
Spring
S2
P1
Fall
S1
P2
Spring
12
S2
P1
Fall
S1
Spring
Cubing
186
item
time,location
time,item
D(apex) cuboid
location
supplier
item,location
time,supplier
location,supplier
item,supplier
time,location,supplier
time,item,locationtime,item,supplier
D cuboids
D cuboids
D cuboids
item,location,supplier
D(base) cuboid
time, item, location, supplierc
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)
187
item
D(apex) cuboid
location
supplier
D cuboids
time,location
item,location
location,supplier
item,supplier
time,supplier
D cuboids
time,location,supplier
time,item,location
time,item,supplier
item,location,supplier
time, item, location, supplier
D cuboids
D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
(9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *),
(*, milk, Urbana, *), (*, milk, Urbana, *)
(*, milk, Chicago, *), (*, milk, *, *)
188
189
All
AB
BC
AC
ABC
b2
b1
b0
a0
a1
a2
a3
A
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)
191
all
AB
A
BC
AC
ABC
AB
AC
BC
ABC
All
AB
BC
AC
ABC
193
194
Iceberg Cube
In a data cube, many aggregate cells are
trivial
Having an aggregate too small
Iceberg query
195
196
BUC
Once a base table (A,
B, C) is sorted by A-BC, aggregates (*,*,*),
(A,*,*), (A,B,*) and
(A,B,C) can be
computed with one
scan and 4 counters
To compute other
aggregates, we can
sort the base table in
some other orders
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)
197
Example
Threshold: sum() >= 300
Location
Year
Color
Amount
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
Seattle
2015
Green
120
Vancouver
2014
Green
260
Seattle
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
198
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2015
Yellow
300
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
Vancouver
2014
Green
260
Victoria
2014
Red
400
Sum(Seattle, *, *) = 280
Sum(Vancouver, *, *) = 1000
Sum(Victoria, *, *) = 400
199
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2014
Green
260
Vancouver
2015
Yellow
300
Vancouver
2015
Red
160
Victoria
2014
Red
400
200
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Green
260
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
201
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Green
260
Vancouver
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
202
203
204
Clustering
Community Detection
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811
206
207
What Is Clustering?
Group data into clusters
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
208
Requirements of Clustering
Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge
to determine input parameters
209
Data Matrix
For memory-based clustering
Also called object-by-variable structure
! x
x
1f
11
"
"
"
x
! x
i
1
if
"
"
"
xn1 ! xnf
x
1p
"
"
! x
ip
"
"
! x
np
210
Dissimilarity Matrix
For memory-based clustering
Also called object-by-object structure
Proximities of pairs of objects
d(i, j): dissimilarity between objects i and j
Nonnegative
0
d (2,1)
Close to 0: similar
0
d (3,1) d (3,2) 0
"
"
"
d (n,1) d (n,2) ! ! 0
211
212
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
213
Interval-valued Variables
Continuous measurements of a roughly
linear scale
Weight, height, latitude and longitude
coordinates, temperature, etc.
214
Standardization
Calculate the mean absolute deviation
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)
m f = 1n (x1 f + x2 f
+ ... +
xnf )
215
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
If q = , d is Chebyshev distance
Weighed distance
d (i, j) = q w | x x |q +w | x x |q +...+ w p | x x |q ) (q > 0)
ip j p
1 i1 j1
2 i2 j 2
216
Chebyshev Distance
Manhattan Distance
When n = 2, chess-distance
Picture from Wekipedia
Jian Pei: CMPT 741/459 Clustering (1)
http://brainking.com/images/rules/chess/02.gif
217
j
k
218
Binary Variables
Object j
1
0
1
q
r
Object i
0
s
t
Sum q+s r+t
Sum
q+r
s+t
p
q + r + s +t
219
Nominal Variables
A generalization of the binary variable in that
it can take more than 2 states, e.g., Red,
yellow, blue, green
m
d (i, j) = p
p
Method 1: simple matching
M: # of matches, p: total # of variables
220
Ordinal Variables
An ordinal variable can be discrete or
rif {1,..., M f }
continuous
Order is important, e.g., rank
Can be treated like interval-scaled
Replace xif by their rank
Map the range of each variable onto [0, 1] by
replacing the i-th object in the f-th variable by
zif
rif 1
=
M f 1
221
Ratio-scaled Variables
Ratio-scaled variable: a positive
measurement on a nonlinear scale
E.g., approximately at exponential scale, such
as AeBt
222
223
Clustering Methods
224
225
K-means
Arbitrarily choose k objects as the initial
cluster centers
Until no change, do
(Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
Update the cluster means, i.e., calculate the
mean value of the objects for each cluster
226
K-Means:
Example
K-Means:
K-Means:Example
Example
=2
10
3
2
1
0
10
10
10
7
8
10
K=2
K=2
Arbitrarily choose
Arbitrarily
chooseKK
bitrarily
choose
K
object as
object
as initial
initial
bject as
initialcenter
cluster
cluster center
uster center
Assign
3
AssignAssign
each 2
each each
1
objects
object
objects
0
9
10 to
0
1
to
most
the
to most
similar
similarmost
similar
center center
3
2
1
0
0
01
3
12
23
3
4
54
65
76
87
10
9 8 10 9
4
Update
Update
Update
3
the
the
2
the
cluster1
cluster
cluster
10
means0 0
means
means
3
2
1
0
reassign
reassign
reassign
center
10
10
7
6
4
3
2
1
10
10
0 0 1
7
10
10
10
reassign
reassign
reassign
10
10
10
Update
Update54
Update
the
the
3
thecluster
cluster 2
means
cluster
1
means
means
10
10
9
8
7
6
5
0
0
1 0 2
0
0
10
10
227
228
229
A Problem
of K-means
ensitive
to outliers
Sensitive
to
outliers
May substantially distort the distribution of the data
-medoids:
the most
centrally
located
objec
May substantially
distort
the distribution
of the data
n acluster
K-medoids: the most centrally located object
in a cluster
10
10
0
0
10
JianMining
Pei: CMPT
741/459 Clustering
(1)
Pei: Data
-- Clustering
and Outlier
Detection
10
230
231
Swapping Cost
Measure whether o is better than o as a
medoid
Use the squared-error
criterion
k
E = d ( p, oi ) 2
i =1 pCi
Compute Eo -Eo
Negative: swapping brings benefit
232
PAM: Example
Example
PAM:
TotalCost
Cost=
=20
20
Total
Total
Cost
=
20
1010
10
10
99 9
1010
10
99 9
88 8
Arbitrary
Arbitrary
Arbitrary
choosekkk
choose
choose
objectas
as
object
object
as
initial
initial
initial
medoids
medoids
medoids
77 7
66 6
55 5
44 4
33 3
22 2
11 1
00 0
00 0
11 1
22 2
33 3
44 4
55 5
66 6
77 7
88 8
99 9
7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1
K=2
K=2
K=2
Do loop
loop
Do
Do
loop
Until no
no
Until
Until
no
change
change
change
9
88 8
8
77 7
1
00 0
0 00 0
1010
10
1010
10
10
99 9
11 1
1
22 2
2
33 3
3
44 4
4
55 5
5
66 6
6
77 7
7
88 8
8
99 9
9
1010
10
10
Assign
Assign
Assign
each
each
each
remaining
remaining
remaining
objectto
to
object
object
to
nearest
nearest
nearest
medoids
medoids
medoids
9
88 8
8
77 7
7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1
1
00 0
0 00 0
0
Compute
Compute
Compute
totalcost
costof
of
total
total
cost
of
swapping
swapping
swapping
88 8
77 7
66 6
55 5
55 5
5
66 6
6
77 7
7
88 8
8
99 9
9
1010
10
10
99 9
88 8
77 7
66 6
55 5
44 4
44 4
33 3
33 3
22 2
22 2
11 1
11 1
00 0
00 0
44 4
4
1010
10
99 9
qualityis
If
IfIfquality
quality
isis
improved.
improved.
improved.
33 3
3
ramdom
1010
10
ramdom
22 2
2
Randomly selectaa
Randomly
Randomly select
select a
nonmedoid
object,Oramdom
nonmedoid
ramdom
nonmedoid object,O
object,O
Total Cost==26
26
Total
Total Cost
Cost = 26
Swapping O
Swapping
Swapping O
O
and
O
and
O
ramdom
and Oramdom
11 1
1
11 1
22 2
33 3
44 4
55 5
66 6
77 7
Jian Pei:Data
DataMining
Mining----Clustering
Clustering andOutlier
OutlierDetection
Detection
Jian
Jian Pei:
Pei: CMPT
741/459 Clusteringand
(1)
88 8
1010
99 9 10
00 0
00 0
11 1
22 2
33 3
44 4
55 5
66 6
77 7
88 8
1010
99 9 10
39
39
233
234
Hierarchy
An arrangement or classification of things
according to inclusiveness
A natural way of abstraction, summarization,
compression, and simplification for
understanding
Typical setting: organize a given set of
objects to a hierarchy
No or very little supervision
Some heuristic quality guidances on the quality
of the hierarchy
Jian Pei: CMPT 459/741 Clustering (2)
235
Hierarchical Clustering
Group data objects into a tree of clusters
Top-down versus bottom-up
Step 0
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
236
237
Dendrogram
Show how to merge clusters
hierarchically
Decompose data objects into a multilevel nested partitioning (a tree of
clusters)
A clustering of the data objects: cutting
the dendrogram at the desired level
Each connected component forms a cluster
Jian Pei: CMPT 459/741 Clustering (2)
238
10
10
0
0
10
0
0
10
10
239
Distance Measures
d ( p, q )
Minimum distance d min (Ci , C j ) = pmin
C , qC
Maximum distance d max (Ci , C j ) = max d ( p, q)
pC , qC
Mean distance
d mean (Ci , C j ) = d (mi , m j )
Average distance
1
d avg (Ci , C j ) =
ni n j
d ( p, q )
pCi qC j
240
Challenges
Hard to choose merge/split points
Never undo merging/splitting
Merging/splitting decisions are critical
241
BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
CF (Clustering Feature) tree: a hierarchical
data structure summarizing object
information
Clustering objects clustering leaf nodes of the
CF tree
242
CF = (5, (16,30),(54,190))
Ni=1=oi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(
243
CF-tree in BIRCH
Clustering features
Summarize the statistics for a cluster
Many cluster quality measures (e.g., radium, distance)
can be derived
Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)
244
CF Tree
B=7
L=6
CF
CF CF
1child 2child 3child
1
2
3
Non-leaf node
CF
CF CF
1
2
3
child child child
1
2
3
Leaf node
prev
CF CF
1
2
CF next
6
CF
6child
6
Root
CF
5
child
5
Leaf node
prev
CF CF
1
2
CF next
4
245
Parameters of a CF-tree
Branching factor: the maximum number of
children
Threshold: max diameter of sub-clusters
stored at the leaf nodes
246
BIRCH Clustering
Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression
of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the CFtree
247
248
249
250
MinPts = 3
Eps = 1 cm
p
q
251
Density-Based Clustering
Density-reachable
Directly density reachable p1p2, p2p3, , pn-1 pn
pn density-reachable from p1
Density-connected
If points p, q are density-reachable from o then p and q
are density-connected
p
q
p1
q
o
252
DBSCAN
A cluster: a maximal set of densityconnected points
Discover clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Core
Eps = 1cm
MinPts = 5
253
254
255
Biclustering
Clustering both objects and attributes
simultaneously
Four requirements
Only a small set of objects in a cluster (bicluster)
A bicluster only involves a small number of
attributes
An object may participate in multiple biclusters
or no biclusters
An attribute may be involved in multiple
biclusters, or no biclusters
Jian Pei: CMPT 459/741 Clustering (3)
256
Application Examples
Recommender systems
Objects: users
Attributes: items
Values: user ratings
sample/condition
gene
w11 w12
w21 w22
w31 w32
w1m
wn1 wn2
wnm
w2m
w3m
Microarray data
Objects: genes
Attributes: samples
Values: expression levels
Jian Pei: CMPT 459/741 Clustering (3)
257
a1
a33
a86
b6
60
60
60
b12
60
60
60
b36
60
60
60
535
b99
60
60
60
10 10 10 10 10
20 AllElectronics
20 20 is highly
20 interested in finding
subset of products. 20
For example,
a group of customers who all like the same group of products. Such a cluster
50 50 50 50 50
is a submatrix in the customer-product matrix, where all elements have a high
value. Using such a cluster,
AllElectronics
in two
0
0
0 can0 make recommendations
0
directions. First, the company can recommend products to new customers
who are similar to the customers in the cluster. Second, the company can
recommend to customers new products that are similar to those involved in
the cluster.
On rows
Figure 11.6: A bi-cluster with constant values on rows.
Jian Pei: CMPT
459/741
Clusteringin(3)a gene expression data matrix, the bi-clusters in a
As with
bi-clusters
258
259
50
100
100
80
30
50
90
20
70
1000
120
100
20
30
80
10
Pei: CMPT
459/741 Clustering (3) that is e
260 bilues Jian
using
multiplication,
ij = c i j . Clearly,
261
Obejctgreen
D1
D2
262
Pattern-based Clusters
pScore: the similarity between two objects rx,
ry on two attributes au, av
rx .au
pScore
ry .au
rx .av
= ( rx .au ry .au ) ( rx .av ry .av )
ry .av
rx .av
ry .av
( 0)
263
Maximal pCluster
If (R, D) is a -pCluster , then every subcluster (R , D ) is a -pCluster, where R R
and D D
An anti-monotonic property
A large pCluster is accompanied with many
small pClusters! Inefficacious
264
265
266
CLIQUE
Clustering In QUEst
Automatically identify subspaces of a high
dimensional data space
Both density-based and grid-based
267
268
Identify clusters:
Determine dense units in all subspaces of interests and
connected dense units in all subspaces of interests
269
0 1 2 3 4 5 6 7
Vacation
Salary
(10,000)
CLIQUE: An Example
50
age
Vacation
(week)
0 1 2 3 4 5 6 7
30
20
Jian Pei: CMPT 459/741 Clustering (4)
20
30
40
50
30
40
50
age
60
age
60
270
271
272
Fuzzy Clustering
Each point xi takes a probability wij to belong
to a cluster Cj
Requirements
k
For each point xi,
ij
=1
j =1
273
274
Critical Details
Optimization on sum of the squared error
(SSE): SSE(C ,, C ) = k m w p dist( x , c ) 2
1
ij
j =1 i =1
c j = wijp xi / wijp
Computing centroids:
i =1
i =1
Updating the fuzzy pseudo-partition
wij = (1 / dist( xi , c j ) 2 )
1
p 1
2
(
1
/
dist
(
x
,
c
)
)
i
q
1
p 1
q =1
When p=2
wij = 1 / dist ( xi , c j ) 2
2
1
/
dist
(
x
,
c
)
i
q
q =1
275
Choice of P
When p 1, FCM behaves like traditional kmeans
When p is larger, the cluster centroids
approach the global centroid of all data
points
The partition becomes fuzzier as p increases
276
Effectiveness
277
Is a Clustering Good?
Feasibility
Applying any clustering methods on a uniformly
distributed data set is meaningless
Quality
Are the clustering results meeting users interest?
Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful
Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: CMPT 459/741 Clustering (4)
278
Major Tasks
Assessing clustering tendency
Are there non-random structures in the data?
279
Hopkins Statistic
Hypothesis: the data is generated by a
uniform distribution in a space
Sample n points, p1, , pn, uniformly from
the space of D
For each point pi, find the nearest neighbor
of pi in D, let xi be the distance between pi
and its nearest neighbor in D
xi = min{dist(pi , v)}
v2D
281
Hopkins Statistic
Sample n points, q1, , qn, uniformly from D
For each qi, find the nearest neighbor of qi in
D {qi}, let yi be the distance between qi and
its nearest neighbor in D {qi}
yi =
v2D,v6=qi
i=1
n
P
yi
i=1
xi +
n
P
yi
i=1
282
Explanation
n
X
xi
If D
is
uniformly
distributed,
then
and
i=1
n
X
yi would be close to each other, and thus
i=1
283
Many methods
exist
r
n
2
Set k =
, each cluster has 2n points on
average
Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant
turning point)
284
A Cross-Validation Method
Divide the data set D into m parts
Use m 1 parts to find a clustering
Use the remaining part as the test set to test
the quality of the clustering
For each point in the test set, find the closest
centroid or cluster center
Use the squared distances between all points in the
test set and the corresponding centroids to measure
how well the clustering model fits the test set
285
286
287
C is a clustering on D
C(oi) is the cluster-id of oi in C
288
Correctness(oi , oj ) =
Bcubed
Precision
and
Recall
BCubed precision is defined as
Precision
"
Correctness(oi , oj )
n
10.6. EVALUATION OF "
CLUSTERING
oj :i=j,C(oi )=C(oj )
{oj |i = j, C(oi ) = C(oj )}
i=1
Precision
BCubed
=
.
BCubed
recall
is defined
as
n
!
Recall
Correctness(oi , oj )
n
!
oj :i=j,L(oi )=L(oj )
{oj |i = j, L(oi ) = L(oj )}
i=1
Recall BCubed =
.
n
Intrinsic Methods
When the ground truth of a data set is not available, we have to use
Jian Pei:to
CMPT
459/741the
Clustering
(4)
289
method
assess
clustering
quality. In general, intrinsic metho
Silhouette Coefficient
No ground truth is assumed
Suppose a data set D of n objects is partitioned
into k clusters, C1, , Ck
For each object o,
Calculate a(o), the average distance between o and
every other object in the same cluster
compactness of a cluster, the smaller, the better
Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not belong
to degree of separation from other clusters, the
larger, the better
Jian Pei: CMPT 459/741 Clustering (4)
290
Silhouette Coefficient
a(o) =
Then
dist(o, o0 )
|Ci | 1
P
dist(o, o0 )
b(o) = min {
Cj :o62Cj
o0 2Cj
|Cj |
b(o) a(o)
s(o) =
max{a(o), b(o)}
291
Classification
293
294
Model Construction
Classification
Algorithms
Training
Data
Name
Mike
Mary
Bill
Jim
Dave
Anne
Rank
Ass. Prof
Ass. Prof
Prof
Asso. Prof
Ass. Prof
Asso. Prof
Years
3
7
2
7
6
3
Tenured
No
Yes
Yes
Yes
No
No
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
295
Model Application
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
Name
Rank
Years
Tom
Ass. Prof
2
Merlisa Asso. Prof
7
George
Prof
5
Joseph Ass. Prof
7
Jian Pei: CMPT 741/459 Classification (1)
Tenured
No
No
Yes
Yes
Tenured?
296
Supervised/Unsupervised Learning
Supervised learning (classification)
Supervision: objects in the training data set have
labels
New data is classified based on the training set
297
Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Data transformation
Generalize and/or normalize data
298
Measurements of Quality
Prediction accuracy
Speed and scalability
Construction speed and application speed
299
300
Decision Tree
A node in the tree a test of some attribute
A branch: a possible value of the attribute
Classification
Outlook
Start at the root
Test the attribute
Move down the tree branch
Sunny
Humidity
High Normal
No
Jian Pei: CMPT 741/459 Classification (1)
Yes
Overcast
Rain
Yes
Wind
Strong
Weak
No
Yes
301
Training Dataset
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
302
Appropriate Problems
Instances are represented by attribute-value
pairs
Extensions of decision trees can handle realvalued attributes
303
304
305
Entropy
Measure homogeneity of examples
c
Entropy ( S ) pi log 2 pi
i =1
306
Information Gain
The expected reduction in entropy caused
by partitioning the examples according to an
attribute
| Sv |
Gain( S , A) Entropy ( S )
Entropy ( S v )
vValues ( A ) | S |
307
Example
9
9 5
5
Entropy ( S ) = log 2 log 2
14
14 14
14
= 0.94
Outlook
Temp
Humid
Wind
PlayTenni
s
Sunny
Hot
High
Weak
No
Sunny
Hot
High
Strong
No
Overcast
Hot
High
Weak
Yes
Rain
Mild
High
Weak
Yes
Rain
Cool
Normal
Weak
Yes
Rain
Cool
Normal
Strong
No
Overcast
Cool
Normal
Strong
Yes
Sunny
Mild
High
Weak
No
Sunny
Cool
Normal
Weak
Yes
Rain
Mild
Normal
Weak
Yes
Sunny
Mild
Normal
Strong
Yes
Overcast
Mild
High
Strong
Yes
Overcast
Hot
Normal
Weak
Yes
Rain
Mild
High
Strong
No
| Sv |
Gain( S ,Wind ) = Entropy ( S )
Entropy ( S v )
v{Weak , Strong } | S |
8
6
Engropy ( SWeak ) Engropy ( S Strong )
14
14
8
6
= 0.94 0.811 1.00 = 0.048
14
14
= Entropy ( S )
308
309
310
Natural Bias
The information gain measure favors
attributes with many values
An extreme example
Attribute date may have the highest
information gain
A very broad decision tree of depth one
Inapplicable to any future data
311
Alternative Measures
Gain ratio: penalize attributes like date by
incorporating split information
| Si |
|S |
log 2 i
|S|
i =1 | S |
SplitInfor mation( S , A)
GainRatio( S , A)
Gain( S , A)
SplitInformation( S , A)
312
Measuring Inequality
Lorenz Curve
X-axis: quintiles
Y-axis: accumulative share of
income earned by the plotted
quintile
Gap between the actual lines
and the mythical line: the degree
of inequality
Gini
index
N 1 gini( ) + N 2 gini( )
T1
T2
N
N
314
315
Inductive Bias
The set of assumptions that, together with
the training data, deductively justifies the
classification to future instances
Preferences of the classifier construction
316
317
Overfitting
A decision tree T may overfit the training
data
if there exists an alternative tree T such that T
has a higher accuracy than T over the training
examples, but T has a higher accuracy than T
over the entire distribution of data
Why overfitting?
Noise data
Bias in training data
Jian Pei: CMPT 741/459 Classification (1)
T
All data
T
Training data
318
319
Holdout Method
Partition the available labeled data set into
two disjoint subsets: the training set and the
test set
50-50
2/3 for training and 1/3 for testing
320
321
Cross-Validation
Each record is used the same number of times for
training and exactly once for testing
K-fold cross-validation
Partition the data into k equal-sized subsets
In each round, use one subset as the test set, and use
the rest subsets together as the training set
Repeat k times
The total error is the sum of the errors in k rounds
Leave-one-out: k = n
Utilize as much data as possible for training
Computationally expensive
Jian Pei: CMPT 741/459 Classification (2)
322
323
PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)
a+d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
Jian Pei: CMPT 741/459 Classification (2)
324
325
326
Fallout
Type I errors false positive: a negative
object is classified as positive
Fallout: the type I error rate, FP / (TP + FP)
327
F Measure
How can we summarize precision and recall into
one metric?
Using the harmonic mean between the two
2rp
2TP
F - measure (F) =
=
r + p 2TP + FP + FN
F measure
( 2 +1)rp
( 2 +1)TP
F =
= 2
2
r+ p
( +1)TP + 2 FN + FP
= 0, F is the precision
= , F is the recall
0 < < , F is a tradeoff between the precision and the
recall
Jian Pei: CMPT 741/459 Classification (2)
328
Weighted Accuracy
A more general metric
wa + w d
Weighted Accuracy =
wa + wb+ wc + w d
1
Measure
w1
w2
w3
w4
Recall
Precision
2 + 1
Accuracy
329
ROC Curve
Receiver Operating Characteristic (ROC)
1-dimensional data set containing 2
classes. Any points located at x > t is
classified as positive
330
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Jian Pei: CMPT 741/459 Classification (2)
332
Cost-Sensitive Learning
In some applications, misclassifying some
classes may be disastrous
Tumor detection, fraud detection
Class=Yes
Class=No
-1
100
333
334
Oversampling
Replicate the positive examples until the
training set has an equal number of positive
and negative examples
For noisy data, may cause overfitting
335
Errors in Classification
Bias: the difference between the real class
boundary and the decision boundary of a
classification model
Variance: variability in the training data set
Intrinsic noise in the target class: the target
class can be non-deterministic instances
with the same attribute values can have
different class labels
Jian Pei: CMPT 741/459 Classification (3)
336
One or More?
What if a medical doctor is not sure about a case?
Joint-diagnosis: using a group of doctors carrying
different expertise
Wisdom from crowd is often more accurate
337
Ensemble Classifiers
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
D1
D2
C1
C2
Step 3:
Combine
Classifiers
....
C*
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*(x)=Vote(C1(x), , Ck(x))
338
0
.
35
0
.
65
= 0.06
i
i =13
339
340
341
342
Bootstrap
Given an original training set T, derive a
tranining set T by repeatedly uniformly
sampling with replacement
If T has n tuples, each tuple has a probability
p = 1 - (1 - 1/n)n of being selected in T
When n , p 1 - 1/e 0.632
343
Bootstrap
Use a bootstrap sample as the training set,
use the tuples not in the training set as the
test set
.632 bootstrap: compute the overall
accuracy by combining the accuracies of
each bootstrap sample with the accuracy
computed from a classifier using the whole
data set as the training set
acc.632bootstrap
1 k
= (0.632 i + 0.368 accall )
k 1
344
Bagging
Run bootstrap k times to obtain k base classifiers
A test instance is assigned to the class that
receives the highest number of votes
Strength: reduce the variance of base classifiers
good for unstable base classifiers
Unstable classifiers: sensitive to minor perturbations in
the training set, e.g., decision trees, associative
classifiers, and ANN
345
Boosting
Assign a weight to each training example
Initially, each example is assigned a weight 1/n
346
347
AdaBoost
Each base classifier carries an importance
score related to its error rate
N
1
Error rate
i = w j I (Ci ( x j ) y j )
N j =1
wi: weight, I(p) = 1 if p is true
Importance score = 1 ln 1 i
i
2 i
348
349
if C j ( xi ) = yi
w exp
( j +1)
wi
=
j
Z j exp
if C j ( xi ) yi
where Z j is the normalization factor, wi( j +1) = 1
( j)
i
350
351
Bayes Theorem
P ( D | h) P ( h)
P( h | D) =
P ( D)
352
353
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No
354
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No
355
Smoothing
Suppose an attribute has n different values:
a1, , an
Assume a small enough value > 0
Let Pi be the frequency of ai,
Pi = # tuples having ai / total # of tuples
1
n
Estimate P (ai ) = +
Pi
356
357
Err =
Zx
0
P (Crocodile | X)dX +
Z1
P (Alligator | X)dX
358
Cons
A (too) strong assumption: independent
attributes
359
Associative Classification
Mine association possible rules (PR) in form
of condset c
Condset: a set of attribute-value pairs
C: class label
Build classifier
Organize rules according to decreasing
precedence based on confidence and support
Classification
Use the first matching rule to classify an
unknown case
Jian Pei: CMPT 741/459 Classification (4)
360
361
Instance-based Methods
Instance-based learning
Store training examples and delay the processing until a
new instance must be classified ( lazy evaluation )
Typical approaches
K-nearest neighbor approach
Instances represented as points in an Euclidean space
Case-based reasoning
Use symbolic representations and knowledge-based inference
362
+
_
_
Jian Pei: CMPT 741/459 Classification (4)
_
+
.xq
_
+
+
363
KNN Methods
For continuous-valued target functions, return the
mean value of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Give greater weights to closer neighbors
1
d ( xq , xi )2
364
365
Outlier Detection
http://i.imgur.com/ckkoAOp.gif
367
http://i.stack.imgur.com/tRDGU.png
368
Outlier Analysis
One persons noise is another persons
signal
Outliers: the objects considerably dissimilar
from the remainder of the data
Examples: credit card fraud, Michael Jordon,
intrusions, etc
Applications: credit card fraud detection, telecom
fraud detection, intrusion detection, customer
segmentation, medical analysis, etc
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)
369
370
Types of Outliers
Three kinds: global, contextual and collective
outliers
A data set may have multiple types of outlier
One object may belong to more than one type of
outlier
371
Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)
372
Collective Outliers
A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
Application example: intrusion detection when a
number of computers keep sending denial-ofservice packages to each other
373
374
Understandability
Understand why these are outliers: Justification of
the detection
Specify the degree of an outlier: the unlikelihood of
the object being generated by a normal mechanism
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)
375
376
Supervised Methods
Modeling outlier detection as a classification problem
Samples examined by domain experts used for training & testing
Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
377
Unsupervised Methods
Assume the normal objects are somewhat
``clustered' into multiple groups, each having some
distinct features
An outlier is expected to be far away from any
groups of normal objects
Weakness: Cannot detect collective outlier effectively
Normal objects may not share any strong patterns, but
the collective outliers may share high similarity in a small
area
378
Challenges
Hard to distinguish noise from outliers
Costly since first clustering: but far less outliers than
normal objects
379
Semi-Supervised Methods
In many applications, the number of labeled data is often
small
Labels could be on outliers only, normal objects only, or both
380
381
Proximity-based Methods
An object is an outlier if the nearest
neighbors of the object are far away, i.e., the
proximity of the object is significantly
deviates from the proximity of most of the
other objects in the same data set
382
383
Clustering-based Methods
Normal data belong to large and dense
clusters, whereas outliers belong to small or
sparse clusters, or do not belong to any
clusters
384
Challenges
Since there are many clustering methods,
there are many clustering-based outlier
detection methods as well
Clustering is expensive: straightforward
adaption of a clustering method for outlier
detection can be costly and does not scale
up well for large data sets
385
386
Example
Statistical methods (also known as modelbased methods) assume that the normal
data follow some statistical model
The data not following the model are outliers.
387
Parametric Methods
Assumption: the normal data is generated by
a parametric distribution with parameter
The probability density function of the
parametric distribution f(x | ) gives the
probability that object x is generated by the
distribution
The smaller this value, the more likely x is an
outlier
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)
388
)=
n
X
i=1
ln f (xi | (u,
)) =
n
ln(2)
2
n
ln
2
n
1 X
2
(xi
)2
i=1
=x
=
xi
n i=1
n
X
2 = 1
(xi
n i=1
x
)2
389
Example
Daily average temperature: {24.0, 28.9, 28.9,
29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Since n = 10,
p
== 2.29
= 1.51
= 28.61
Then (24 28.61)
/1.51
3.04
< 3, 24 is
an outlier since 3 contains 99.7% data
390
N 1u
p t
N
N
2N
,N 2
+ t2 ,N
2N
t22N
,N 2 is the value taken by a t-distribution at a
significance level of /(2N), and N is the number
of objects in the data set
391
Non-parametric Method
Not assume an a-priori statistical model,
instead, determine the model from the input
data
Not completely parameter free but consider the
number and nature of the parameters are
flexible and not fixed in advance
392
Histogram
A transaction in the amount of $7,500 is an
outlier, since only 0.2% transactions have an
amount higher than $5,000
393
Challenges
Hard to choose an appropriate bin size for
histogram
Too small bin size normal objects in empty/
rare bins, false positive
Too big bin size outliers in some frequent
bins, false negative
394
395
Depth-based Methods
Organize data objects in layers with various
depths
The shallow layers are more likely to contain
outliers
396
397
Distance-based Outliers
A DB(p, D)-outlier is an object O in a dataset
T such that at least a fraction p of the objects
in T lie at a distance greater than distance D
from O
The larger D, the more outlying
The larger p, the more outlying
398
399
Intuition
Outliers comparing to their local
neighborhoods, instead of the global data
distribution
The density around an outlier object is
significantly different from the density around
its neighbors
Use the relative density of an object against
its neighbors as the indicator of the degree
of the object being outliers
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)
400
401
One-Class Model
A classifier is built to describe only the normal class
Learn the decision boundary of the normal class
using classification methods such as SVM
Any samples that do not belong to the normal class
(not within the decision boundary) are declared as
outliers
Advantage: can detect new outliers that may not
appear close to any outlier objects in the training set
Extension: Normal objects may belong to multiple
classes
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)
402
One-Class Model
403
404
Example
405
406
Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)
407
408
Example
Detect outlier customers in the context of
customer groups
Contextual attributes: age group, postal code
Behavioral attributes: the number of transactions per
year, annual total transaction amount
Method
Locate cs context;
Compare c with the other customers in the same
group; and
Use a conventional outlier detection method
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)
409
410
Collective Outliers
Objects as a group deviate significantly from
the entire data
Examine the structure of the data set, i.e, the
relationships between multiple data objects
The structures are often not explicitly defined,
and have to be discovered as part of the outlier
detection process.
411
Data subspaces
Local behavior and patterns of data
412
Angle-based Outliers
413