Professional Documents
Culture Documents
Classification
Tomi Yiu
CS 632 Advanced Database
Systems
April 5, 2001
1
Papers
Outline
Classification problem
General decision tree model
Decision tree classifiers
SLIQ
SPRINT
VFDT (Hoeffding Tree Algorithm)
Classification Problem
A set of attributes
A class label
A Training set
Age
23
17
43
68
32
20
Car Type
Family
Sports
Sports
Family
Truck
Family
Risk
High
High
High
Low
Low
High
5
Classification Models
Neural networks
Statistical models
linear/quadratic discriminants
Decision trees
Genetic models
A Decision Tree
Age < 25
High
Low
8
10
Splitting Index
13
SLIQ - Overview
Data Structure
Car
Type
Class List
Index
Class Leaf
23
Family
1 High
N1
17
Sports
2 High
N1
43
Sports
3 High
N1
68
Family
4 Low
N1
32
Truck
5 Low
N1
20
Family
6 High
N1
17
Pre-sorting
After Pre-sorting
Age Class
List
Index
Car
Type
Class List
Index
Class Leaf
17
Family
1 High
N1
20
Sports
2 High
N1
23
Sports
3 High
N1
32
Family
4 Low
N1
43
Truck
5 Low
N1
68
Family
6 High
N1
19
Node Split
Class Histogram
Evaluate Split
for each attribute A
traverse attribute list of A
for each value v in the attribute list
find the corresponding class and leaf node
update the class histogram in the leaf l
if A is a numeric attribute then
compute splitting index for test (Av) for leaf l
if A is a categorical attribute then
for each leaf of the tree do
find subset of A with the best split
22
24
Example of Evaluating
Splits
Initial Histogram
Ag
e
Index
Class
Lea
f
17
High
N1
20
High
N1
23
High
N1
32
Low
N1
43
Low
N1
68
High
N1
25
Index
Class
Lea
f
17
High
N2
20
High
N2
23
High
N1
32
Low
N1
43
Low
N1
68
High
N2
N2
N3
N3 (New value)
26
MDL Principle
Encoding Scheme
29
Performance (Scalability)
30
SPRINT - Overview
An Example of Attribute
Lists
Age Clas rid
s
rid
family
High
17
High 1
sports
High
20
High 5
sports
High
23
High 0
family
Low
32
Low 4
truck
Low
43
High 2
family
high
68
Low 3
33
34
Categorical attributes
36
Evaluate numeric
attributes
37
Evaluate categorical
attributes
Attribute List
Car
Type
Clas
s
rid
family
High 0
sports
High 1
sports
Count Matrix
H L
family
2 1
High 2
sports
2 0
family
Low
truck
0 1
truck
Low
family
high 5
38
Non-splitting attribute
39
41
Parallelizing Classification
42
rid
family
High
17
High 1
sports
High
20
High 5
sports
High
23
High 0
Process 1
rid
family
Low
32
Low 4
truck
Low
43
High 2
family
high
43
44
Example of Histograms in
Parallel Classification
Process 0
High 1
20
High 5
23
High 0
Low 4
43
High 2
Cbelo
0 0
Cabov 4 2
Process 1
H L
H L
Cbelo
3 0
Cabov 1 2
45
47
Data Streams
48
Issues
Incremental learning
methods
Hoeffding Bound
2n
52
55
Ties
Computation of G()
Memory
Poor attributes
Initialization
56
Performance Examples
57
Performance Nodes
58
59
Conclusion
SLIQ
SPRINT
VFDT
60