Professional Documents
Culture Documents
TPR
70++60
++ 80
• The bootstrap 0.6 90
worthless
• Bagging
0.4
• Boosting
0.2
20
10
0
• Test set and training set disjoint 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Training set size (%)
• Select more (66%?) training instances than
test instances Test set (%) + Training set (%) = 100%
• Holdout: test set
Two methods compared:
• Problem: what if the dataset is small?
• PRISM classification-rule learning algorithm
• Naive Bayesian classifier
Performance Performance Performance Performance
Test Test
Training Data Training Data Training Data Training Data
Data Data
Fold Fold
K-fold cross-validation:
K-fold cross-validation: split up dataset D into • Dataset D, with N = |D|
K partitions (called folds)
• fˆ−k trained model based on dataset D after
for k = 1, . . . , K: removal of fold k
1. Train model using K−1 of the folds (exclude • Indexing function κ : {1, . . . , N } → {1, . . . , K}
which associates fold κ(i) to instance i
fold k)
2. Evaluate using the remaining fold k (hold- • Success rate with L 0-1 loss function:
N
out) 1 X
σ= L(ci , fˆ−κ(i) (x0i))
3. Gather performance results N i=1
How many folds?
Stratification
90
Naive Bayes
ID3
85
80
• Folds not necessarily representative for whole
dataset
Success rate (%)
75
70
65 • Solution:
60
55
– check whether folds are representative
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 – if not: select new folds (at random)
Number of folds
N
1
e−1 = (−1)k
Test X
Dataset + RN
k=0
k!
– Newton’s binomial theorem:
• Yields estimation of uncertainty in learning N
1 N N −1 k
X
• Basic idea: 1− =
N k=0
k N
1. sample N = |D| times from dataset with N
N (N − 1) · · · (N − k + 1)
(−1)k
X
replacement: result training set of size N =
k=0
k!N k
2. instances not selected are taken as test N
1
(−1)k
X
set ≈
k=0
k!
• Expected size of test set?
• Solution:
Bagging: representation
• Bootstrap aggregating Bagging: learning
• Use m bootstrap samples to learn m models
Mi: multiple experts; output: • Bias-variance decomposition – bias is fixed
m for specific technique; try reducing variance
∗ (x) = 1 X
fˆm fˆk (x) of learning
m k=1
is Monte-Carlo estimate for true function • Random samples ⇒ slightly different models
∗
f = limm→∞ fˆm
• Result: expert models for a particular part
• Typical use:
of the feature space
• Bagging and naive Bayesian networks?
M1 M2 Mm
Voting
Averaging
M1 M2 Mm
Output
Boosting algorithm
• Solution: boosting – learn models that com- • w(x) are the weights attached to instances
plement each other Boosting(M, D)
• Output: weighted combination of model re- {
for each x ∈ D do
sults
w(x) ← initial-value
for k ← 1, . . . , m do
E ← Sample(D)
sample sample sample
Mk ← LearnModel(E)
← ErrorRate(D, Mk )
weights
weights
if 6= 0 and < 0.5 then
M ← M ∪ {Mk }
M1 M2
Mm
for each x ∈ D do
if x is classified correctly by Mk
then w(x) ← w(x) · /(1 − )
Renormalise(D)
}
Classify(M, x)
{
for each c ∈ C do
w(c) ← 0
for k ← 1, . . . , m do
if c = MostLikelyClass(Mk , x) then
w(c) ← w(c) + wk
return highest c
}