Receiver Operating Character-Istic: Max C

Optimal performance: ROC
• Output of probabilistic classifier:

cmax = arg max P (C | E)
C
Refinements may not yield the best performance
and • Alternative: Receiver Operating Character-
istic (ROC): determine threshold d, such that
Evaluation (
c if P (c | E) ≥ d
C=
¬c otherwise
• ROC curves 1.0

10
• Hold out method 20 +
30+ +
0.8
• Cross-validation 50++40
TPR
70++60
++ 80
• The bootstrap 0.6 90
worthless
• Bagging
0.4
• Boosting
0.2
0.2 0.4 0.6 0.8 1.0

FPR
Area under the ROC curve

Evaluation problem sketch
When comparing various techniques:
• actual performance for particular thresholds Initial Model (Assumptions)
(cut-off points) may vary
Training Trained Model Application
• area under the ROC curve Af = 01 f (x)dx
R
Process Process

offers good measure for comparison, with

f relationship between FPR and TPR for
classifier Performance Performance
Training Data Fresh Data
• Fresh data differs from training set (which

affects performance)
• Overfitting
• Bias-variance decomposition
Size of holdout
Solution 1: holdout method

100
PRISM
90 Naive Bayes
Initial Model (Assumptions)
80
Training Trained Model Testing Tested
70
Process Process
Success rate (%)

Model

60

50
Performance Performance 40
Training Data Test Data 30
20
10
0
• Test set and training set disjoint 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Training set size (%)
• Select more (66%?) training instances than
test instances Test set (%) + Training set (%) = 100%
• Holdout: test set
Two methods compared:
• Problem: what if the dataset is small?
• PRISM classification-rule learning algorithm
• Naive Bayesian classifier
Solution 2: cross-validation Performance results
Initial Model (Assumptions) Initial Model (Assumptions)
Trained Model Training Trained Model Testing

Training Testing
Process Process Process Process

Performance Performance Performance Performance
Training Data Test Data Training Data Test Data
Test Test
Training Data Training Data Training Data Training Data
Data Data
Fold Fold
K-fold cross-validation:
K-fold cross-validation: split up dataset D into • Dataset D, with N = |D|
K partitions (called folds)
• fˆ−k trained model based on dataset D after
for k = 1, . . . , K: removal of fold k
1. Train model using K−1 of the folds (exclude • Indexing function κ : {1, . . . , N } → {1, . . . , K}
which associates fold κ(i) to instance i
fold k)
2. Evaluate using the remaining fold k (hold- • Success rate with L 0-1 loss function:
N
out) 1 X
σ= L(ci , fˆ−κ(i) (x0i))
3. Gather performance results N i=1
How many folds?
Stratification
90
Naive Bayes
ID3
85
80
• Folds not necessarily representative for whole
dataset
Success rate (%)
75
70
65 • Solution:
60
55
– check whether folds are representative
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 – if not: select new folds (at random)
Number of folds
• Common choices: 5 and 10 (5-fold and 10-

fold cross-validation) • Further elimination of effect of random se-
• Leave-one-out method: lection of folds: run cross-validation re-
– N = K (N -fold cross-validation) peatedly
– almost all available data is used

– easy to implement, no random sampling Result: M -time stratified K-fold cross-validation
– computationally expensive
Bootstrap: test-set size

Solution 3: bootstrapping • Probability that instance i is not selected:
1
P (i) = 1 −
N
N
1

Original Dataset Training Dataset
• After taking N samples: 1−N . Note
that:
N N
– ex = N xk
k=0 k! + RN (x), i.e.
P
N
1
e−1 = (−1)k
Test X
Dataset + RN
k=0
k!
– Newton’s binomial theorem:
• Yields estimation of uncertainty in learning N
1 N N −1 k
X
• Basic idea: 1− =
N k=0
k N
1. sample N = |D| times from dataset with N
N (N − 1) · · · (N − k + 1)
(−1)k
X
replacement: result training set of size N =
k=0
k!N k
2. instances not selected are taken as test N
1
(−1)k
X
set ≈
k=0
k!
• Expected size of test set?
⇒ the size of the test set: e−1 · N ≈ 0.37N

Committee of experts
Bootstrap: practical
• Effect of random selection of test set: re-

sult may not be representative
• Solution:
– adjust error rate:
= 0.63 · test + 0.37 · training
– repeat the process a number of times,

• Hypothesis space
and compute average outcome
• No single hypothesis fits all possible (or all
parts of a) dataset best
• Usage: in the context of learning multiple
• Combining many weak models may give rise
models (bagging)
to a strong model (“Many hands make light
work”)
Bagging: representation
• Bootstrap aggregating Bagging: learning
• Use m bootstrap samples to learn m models
Mi: multiple experts; output: • Bias-variance decomposition – bias is fixed
m for specific technique; try reducing variance
∗ (x) = 1 X
fˆm fˆk (x) of learning
m k=1
is Monte-Carlo estimate for true function • Random samples ⇒ slightly different models
∗
f = limm→∞ fˆm
• Result: expert models for a particular part
• Typical use:
of the feature space

• Bagging and naive Bayesian networks?

M1 M2 Mm

Voting
Averaging
M1 M2 Mm
Output
Boosting algorithm
Boosting • Dataset D with instances x ∈ D; M models

• Correctly classified instances improve by
• Limitation of bagging: models are learnt com- /(1 − )
pletely separately where is the error rate
• Solution: boosting – learn models that com- • w(x) are the weights attached to instances
plement each other Boosting(M, D)
• Output: weighted combination of model re- {
for each x ∈ D do
sults
w(x) ← initial-value

for k ← 1, . . . , m do
E ← Sample(D)
sample sample sample
Mk ← LearnModel(E)

← ErrorRate(D, Mk )

weights
weights

if 6= 0 and < 0.5 then
M ← M ∪ {Mk }
M1 M2

Mm
for each x ∈ D do
if x is classified correctly by Mk
then w(x) ← w(x) · /(1 − )
Renormalise(D)
}
Application multiple models
• Models are combined by weighted votes (clas-

sification) or weighted average (regression)
• Contribution of each model Mk :
k
wk = − log
1 − k
Note wk → ∞ if k ↓ 0 where k is the error
rate for model Mk
• C: class variable with values c
Classify(M, x)
{
for each c ∈ C do
w(c) ← 0
for k ← 1, . . . , m do
if c = MostLikelyClass(Mk , x) then
w(c) ← w(c) + wk
return highest c
}

Receiver Operating Character-Istic: Max C

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Receiver Operating Character-Istic: Max C

Uploaded by

Copyright:

Available Formats

Optimal performance: ROC

• Output of probabilistic classifier:

• ROC curves 1.0

0.2 0.4 0.6 0.8 1.0

Area under the ROC curve

Training Data Fresh Data

• Fresh data differs from training set (which

Solution 1: holdout method

Success rate (%)

Training Data Test Data 30

Solution 2: cross-validation Performance results

Initial Model (Assumptions) Initial Model (Assumptions)

Trained Model Training Trained Model Testing

Training Data Test Data Training Data Test Data

• Common choices: 5 and 10 (5-fold and 10-

– almost all available data is used

Bootstrap: test-set size

⇒ the size of the test set: e−1 · N ≈ 0.37N

• Effect of random selection of test set: re-

– adjust error rate:

 = 0.63 · test + 0.37 · training

– repeat the process a number of times,

Boosting • Dataset D with instances x ∈ D; M models

Application multiple models

• Models are combined by weighted votes (clas-

You might also like

= 0.63 · test + 0.37 · training