You are on page 1of 5

Optimal performance: ROC

• Output of probabilistic classifier:


cmax = arg max P (C | E)
C
Refinements may not yield the best performance
and • Alternative: Receiver Operating Character-
istic (ROC): determine threshold d, such that
Evaluation (
c if P (c | E) ≥ d
C=
¬c otherwise

• ROC curves 1.0


10
• Hold out method 20 +
30+ +
0.8
• Cross-validation 50++40

TPR
70++60
++ 80
• The bootstrap 0.6 90
worthless
• Bagging
0.4
• Boosting
0.2

0.2 0.4 0.6 0.8 1.0


FPR

Area under the ROC curve


Evaluation problem sketch
When comparing various techniques:
• actual performance for particular thresholds Initial Model (Assumptions)
(cut-off points) may vary
Training Trained Model Application
• area under the ROC curve Af = 01 f (x)dx
R
Process Process
    
 
offers good measure for comparison, with    
 
f relationship between FPR and TPR for
classifier Performance Performance

Training Data Fresh Data

• Fresh data differs from training set (which


affects performance)
• Overfitting
• Bias-variance decomposition
Size of holdout

Solution 1: holdout method


100
PRISM
90 Naive Bayes
Initial Model (Assumptions)
80
Training Trained Model Testing Tested
70
Process Process

Success rate (%)


Model
    


 
 
 60
   
 
 
  50
Performance Performance 40

Training Data Test Data 30

20

10

0
• Test set and training set disjoint 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Training set size (%)
• Select more (66%?) training instances than
test instances Test set (%) + Training set (%) = 100%
• Holdout: test set
Two methods compared:
• Problem: what if the dataset is small?
• PRISM classification-rule learning algorithm
• Naive Bayesian classifier

Solution 2: cross-validation Performance results

Initial Model (Assumptions) Initial Model (Assumptions)

Trained Model Training Trained Model Testing


Training Testing
Process Process Process Process


 
  
   

    
   
   










  

  
  
    
  



 










  
 

Performance Performance Performance Performance

Training Data Test Data Training Data Test Data

Test Test
Training Data Training Data Training Data Training Data
Data Data

Fold Fold
K-fold cross-validation:
K-fold cross-validation: split up dataset D into • Dataset D, with N = |D|
K partitions (called folds)
• fˆ−k trained model based on dataset D after
for k = 1, . . . , K: removal of fold k

1. Train model using K−1 of the folds (exclude • Indexing function κ : {1, . . . , N } → {1, . . . , K}
which associates fold κ(i) to instance i
fold k)
2. Evaluate using the remaining fold k (hold- • Success rate with L 0-1 loss function:
N
out) 1 X
σ= L(ci , fˆ−κ(i) (x0i))
3. Gather performance results N i=1
How many folds?
Stratification
90
Naive Bayes
ID3
85

80
• Folds not necessarily representative for whole
dataset
Success rate (%)

75

70

65 • Solution:
60

55
– check whether folds are representative
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 – if not: select new folds (at random)
Number of folds

• Common choices: 5 and 10 (5-fold and 10-


fold cross-validation) • Further elimination of effect of random se-
• Leave-one-out method: lection of folds: run cross-validation re-
– N = K (N -fold cross-validation) peatedly

– almost all available data is used


– easy to implement, no random sampling Result: M -time stratified K-fold cross-validation
– computationally expensive

Bootstrap: test-set size


Solution 3: bootstrapping • Probability that instance i is not selected:
1
P (i) = 1 −
N
N
1

Original Dataset Training Dataset
• After taking N samples: 1−N . Note
that:
N N
– ex = N xk
k=0 k! + RN (x), i.e.
P

N
1
e−1 = (−1)k
Test X
Dataset + RN
k=0
k!
– Newton’s binomial theorem:
• Yields estimation of uncertainty in learning N  
1 N N −1 k
  X 
• Basic idea: 1− =
N k=0
k N
1. sample N = |D| times from dataset with N
N (N − 1) · · · (N − k + 1)
(−1)k
X
replacement: result training set of size N =
k=0
k!N k
2. instances not selected are taken as test N
1
(−1)k
X
set ≈
k=0
k!
• Expected size of test set?

⇒ the size of the test set: e−1 · N ≈ 0.37N


Committee of experts
Bootstrap: practical

• Effect of random selection of test set: re-


sult may not be representative

• Solution:

– adjust error rate:

 = 0.63 · test + 0.37 · training

– repeat the process a number of times,


• Hypothesis space
and compute average outcome
• No single hypothesis fits all possible (or all
parts of a) dataset best
• Usage: in the context of learning multiple
• Combining many weak models may give rise
models (bagging)
to a strong model (“Many hands make light
work”)

Bagging: representation
• Bootstrap aggregating Bagging: learning
• Use m bootstrap samples to learn m models
Mi: multiple experts; output: • Bias-variance decomposition – bias is fixed
m for specific technique; try reducing variance
∗ (x) = 1 X
fˆm fˆk (x) of learning
m k=1
is Monte-Carlo estimate for true function • Random samples ⇒ slightly different models

f = limm→∞ fˆm
• Result: expert models for a particular part
• Typical use:
of the feature space


• Bagging and naive Bayesian networks?









       
M1 M2 Mm
                     
  
  


Voting
Averaging   
M1 M2 Mm

Output
Boosting algorithm

Boosting • Dataset D with instances x ∈ D; M models


• Correctly classified instances improve by
• Limitation of bagging: models are learnt com- /(1 − )
pletely separately where  is the error rate

• Solution: boosting – learn models that com- • w(x) are the weights attached to instances
plement each other Boosting(M, D)
• Output: weighted combination of model re- {
for each x ∈ D do
sults
w(x) ← initial-value
  

         
 
  
     
 for k ← 1, . . . , m do
E ← Sample(D)
sample sample sample
Mk ← LearnModel(E)



 
        ← ErrorRate(D, Mk )

  weights
 weights
  
if  6= 0 and  < 0.5 then
M ← M ∪ {Mk }

M1 M2


Mm
for each x ∈ D do
if x is classified correctly by Mk
then w(x) ← w(x) · /(1 − )
Renormalise(D)
}

Application multiple models

• Models are combined by weighted votes (clas-


sification) or weighted average (regression)
• Contribution of each model Mk :
k
wk = − log
1 − k
Note wk → ∞ if k ↓ 0 where k is the error
rate for model Mk
• C: class variable with values c

Classify(M, x)

{
for each c ∈ C do
w(c) ← 0
for k ← 1, . . . , m do
if c = MostLikelyClass(Mk , x) then
w(c) ← w(c) + wk
return highest c
}

You might also like