ML Tutorial 1 2 PDF

Machine Learning Tutorial
CB,, GS,, REC
Section 1
Machine Learning basic concepts
Machine Learning Tutorial for the UKP lab

June 10,
10 2011
This ppt includes some slides/slide-parts/text taken

from online materials created by the following
people:
- Greg Grudic
- Alexander Vezhnevets
- Hal III Daume
What is Machine Learning?

The goal of machine learning is to build computer
systems that can adapt and learn from their
experience.
Tom Dietterich
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
A Generic System
x1
x2
System
xN
y1
y2
h1 , h2 , ..., h K
yM
x = ( x1 , x2 ,..., xN )
I
Input
t Variables:
V i bl
Hidden Variables: h = ( h1 , h2 ,..., hK )
Output Variables: y = ( y1 , y2 ,..., yK )

When are ML algorithms NOT needed?

When the relationships between all system variables
(input, output, and hidden) is completely
understood!
This is NOT the case for almost any real system!
The Sub
Sub-Fields
Fields of ML
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Supervised Learning
Given: Training examples
{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1
, f ( xP ))
for some unknown function (system) y = f ( x )

Find f ( x )
Predict y = f ( x ) , where x is not in the training set
Model model quality

Model,
Definition: A computer program is said to learn
from experience E
with respect to some class of tasks T
and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Learned hypothesis: model of problem/task T
Model quality: accuracy/performance measured by P
Data / Examples / Sample / Instances

Data: experience E in the form
f
off examples / instances
characteristic of the whole input space
representative sample
independent and identically distributed (no bias in selection / observations)
Good
G d example
l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)
probably i.i.d.
representative?
if annotation is involved it is always a question of compromises
Definitely bad example

all abstracts that have John Smith as an author
Instances have to be comparable to each other

Data / Examples / Sample / Instances

Example: set off queries and a set off top retrieved documents
(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each
try predicting relevance for reranking!
top retrieved set is dependent on underlying IR system!
issues with representativeness, but for reranking this is fine
characterization is dependent on query (exc. PRank), i.e. only certain pairs (for
the same Q) are meaningfully comparable (c.f. independent examples for the
same Q)
we have to normalize the features per query to have same mean/variance
we have to form pairs and compare e.g. the diff of feature values
Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2
Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
10
Features
The available examples (experience) has to be
described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features
E.g. for credit risk assesment, typical features can be: income range,
debt load,
load employment history,
history real estate properties,
properties criminal record,
record
city of residence, etc.
Common feature types

yp
binary
nominal
nominal
ordinal
numeric
numeric
(criminal record, Y/N)

(city of residence, X)
(income range, 0-10K, 10-20K, )
(debt load, $)
11
Machine Learning Tutorial

CB,, GS,, REC
Section 2
Experimental practice
by now youve learned what machine learning is; in the supervised approach you
need (carefully selected / prepared) examples that you describe through features;
the algorithm then learns a model of the problem based on the examples (usually
some kind of optimization is performed in the background); and as a result
result,
improvement is observed in terms of some performance measure
Machine Learning Tutorial for the UKP lab

June 10, 2011
Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression
y in Neural Network
number/size of hidden layer
number of instances per leaf in decision tree
one that actually gets optimized through the training parameter

regression coefficients
network weights
size/depth of decision tree (in Weka, other implementations might allow to control that)
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters!
usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
13
Cross validation Illustration

Cross-validation,
X k = {x1 ,..., xk }
X1
X2
X3
Test
X4
The result is an average

over all iterations
Train
X5
14
Cross Validation
Cross-Validation
n-fold
f ld CV:
CV common practice
ti for
f making
ki (hyper)parameter
(h
)
t estimation
ti ti more
robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model
typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation
n-fold CV: common practice to report average performance,

performance deviation,
deviation etc.
etc
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)
bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e.g.
e g confidence intervals are to be treated with extreme
caution)
5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets
Folding via
ia natural
nat ral units
nits of processing for the given
gi en task
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries!
The PPI case
15
Cross Validation
Cross-Validation
Ideally the valid settings are:
take off-the-shelf algorithms, avoid parameter tuning and compare
results e.g.
results,
e g via cross-validation
cross validation
n.b. you probably do the folding yourself, trying to minimize biases!
do parameter tuning (n.b. selecting/tuning your features is also tuning!)

but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn
experimental best practice to align the predefined standards (you might even
benefit from comparative results, etc.)
You might want to do something different

be
be aware of these & the consequences
16
The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data
credit
di risk
i k assessment: 1 credit
di request, good/bad
d/b d credit,
di ~s ran out in
i the
h
previous year
2. define and collect/calculate features, define train / validation

(development) ((test!)) / test (evaluation) data
3. pick a learning algorithm (e.g. decision tree), train model
train on training set
optimize/set model hyperparameters (e.g. number of instances / leaf, use
pruning, ) according to performance on validation data
cross validation: use all training data as validation data
test model accuracy on (blind) test set
4. ready
y to use model to p
predict unseen instances with an expected
p
accuracy similar to that seen on test
17
Try this in Weka

=== Run information ===
Relation: segment
Instances: 1500
Attributes: 20
Test mode:
split 80.0%
80 0% train,
train remainder test
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Correctly Classified Instances
290
96.6667 %
Incorrectly Classified Instances
10
3.3333 %
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 12
Correctly Classified Instances
281
93.6667 %
Incorrectly Classified
C
f
Instances
19
6.3333 %
18
Model complexity
Fitting
Fitti a polynomial
l
i l regression:
i
t
a ( x) = nn x
M=0
1.0
0.0
1.0
M=1
0.0
n =0
By, for instance, least squares:
1.0
0.0
1.0
0.5
x
1.0
= arg min y j nn x
j =1
0.0
0.5
x
1.0
M=3
1.0
M=9
2
0.0
1.0
0.0
n =0
1.0
1.0
0.0
0.5
x
19
1.0
0.0
0.5
x
1.0
Data size and model complexity

Important concept: discriminative power of the
algorithm
linear vs nonlinear model
some theoretical aspects:
1-hidden-layer NN with unlimited hidden nodes can
perfectly model any smooth function/surface
20
Data size and model complexity

Overfitting: the model perfectly learns to classify training data,
data but
has no (bad) generalization ability
results in high test error (useless model)
typical for small sample sizes and powerful models
Underfitting: the model is not capable of learning the (complex)

patterns in the training set
Reasons of Underfitting and Overfitting:
lack of discriminative power
smallll sample
l size
i
noise in the data /labels or features/
generalization ability of algorithm
has to be chosen wrt. sample size
Size (complexity) of learnt model

grows with data size
if the data is consistent,
consistent this is OK
21
Predictions Confusion matrix

TP: p classified as p
FP: n classified as p
TN: n classified as n
FN: p
p classified as n
Good
G
prediction:
TP+TN
Error:
FP (false alarm) + FN (miss)
22
Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(TP+TN) / (TP+FN+FP+TN)
Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values
e.g.
RMSE=
1
( f ( x) y ) 2
Algorithms (e.g. those in Weka) typically optimize these

might be a mismatch between optimization objective and actual evaluation measure
optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)
23
Evaluation measures
TP: p
p classified as p
FP: n classified as p
TN: n classified as n
FN: p
p classified as n
Precision
Fraction of correctly predicted positives and all
predicted positives
TP/(TP+FP)
(
)
Recall
Fraction
ac o o
of co
correctly
ec y p
predicted
ed c ed pos
positives
es a
and
da
all ac
actual
ua pos
positives
es
TP/(TP+FN)
F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
precision recall
F = (1 + ) 2
precision + recall
2
Only makes sense for a subset of classes (usually measured for a single
class)
For all classes, it equals the accuracy
24
Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
A sequence of tokens with the same label is treated as a single instance
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG
before_O joining_O IBM_ORG.
Why? We need complete phrases to be identified correctly
How? With external evaluation script, e.g. conlleval for NER
Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG
before_O joining_O IBM_ORG.
Multiple penalty:
3 Positives: John (PER),
(PER) Johns Hopkins University (ORG),
(ORG) IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67,
0 67 F(ORG) = 0
0.5
5
25
Loss types
1
1.
The real loss function given to us by the world

world. Typically involves notions of money saved
saved,
time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this
function.
The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance
assessments etc.
assessments,
etc We can perform these evaluations,
evaluations but they are slow and costly
costly. They
require humans in the loop.
Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,
mean-average-precision. These require humans at the front of the loop, but after that are
cheap
h
and
d quick.
i k T
Typically
i ll some effort
ff t h
has b
been putt iinto
t showing
h i correlation
l ti b
between
t
th
these
and something higher up.
Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for
parsing, chunking and named-entity recognition), alignment error rate (for word alignment)
and
d perplexity
l it (f
(for llanguage modeling).
d li ) Th
These also
l require
i h
humans att th
the ffrontt off th
the lloop,
but differ from (3) in that they are not actually compared with higher-up tasks.
2.
3.
4.
Be careful what you are optimizing! Some measures (trypically of Type 4)

become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER

Readability measures
26
Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.
Example
p tagging
gg g 1:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2:
JJohn
h _PER studied
di d_O at_O the
h _O Johns
J h _O Hopkins
H ki _O University
U i
i _O before
b f _O joining
j i i _O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
0 FP
1 FN: Johns Hopkins
p
University
y ((ORG))
F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
mostt likely,
lik l this
thi is
i bad!!
b d!!
27
ROC curve
ROC
OC Receiver Operating
O
Characteristic
C
curve
Curve that depicts the relation between recall (sensitivity) and false
positives (1-specificity)
(1 specificity)
Sensitivvity (Reca
all)
Best case
Worst case
False Positives
FP / (FP+TN)
28
Evaluation measures
Area
A
under
d ROC curve, AUC
As you vary the decision threshold, you can plot the recall vs. false
positive rate
p
The area under the curve measures how accurately your model
separates
t positive
iti from
f
negatives
ti
perfect ranking: AUC = 1.0
random decision: AUC = 0.5
Similarly (e.g. in IR): area under P/R curve

when
h there
th
are too
t many (true)
(t ) negatives
ti
correctly identifying negatives is not interesting anyway
29
Evaluation measures (Ranking)

Precision
P i i @K
number of true positives in top K predictions / ranks
MAP
The average of precisions computed at the point of each of the positives in the
ranked list ((P=0 for positives not ranked at all))
NDCG
For graded relevance / ranking
Highly relevant documents appearing lower in a search result list should be
penalized as the graded relevance value is reduced logarithmically proportional
to the position of the result.
30
Learning curve
Measures
M
h
how
th
the
accuracy
error
off the
th model
d l changes
h
with
ith
sample size
iteration number
Smaller sample
worse accuracy
y bias in the estimate
more likely
(representative sample)
variance in the estimate
Typical learning curve

If it looks differently:
you are plotting error vs. size/iteration
you are doing something wrong!
overfitting (iteration, not sample size)!
??
31
Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a
varying amount of training data (Banko & Brill, 2001):
Winnow
perceptron
nave Bayes
memory-based learner
Features:
bag of words:
words within a window of the
target word
collocations containing
specific words and/or part of speech
Training corpus: 1-billion words

from a variety of English texts
((news articles, literature, scientific abstracts, etc.))
32
Take home messages (up until now)

Supervised
p
learning:
g based on a set of labeled examples
p
((x,, f(x))
( )) learn the
input-output mapping, i.e. f(x)
3 factors of successful machine learning models
much data
good features
well-suited learning algorithm
ML workflow
1.
2
2.
3.
4.
problem definition
feature engineering; experimental setup /train,
/train validation,
validation test //
selection of learning algorithm, (hyper)parameter tuning, training a final model
predict unseen examples & fill tables / draw figures for the paper - test
Careful
C f l with
ith
data representation (i.i.d, comparability, )

experimental setup (cross-validation, blind testing, )
data size and algorithm selection (+ overfitting,
overfitting underfitting,
underfitting ))
evaluation measures
33

ML Tutorial 1 2 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Tutorial 1 2 PDF

Uploaded by

Copyright:

Available Formats

Machine Learning Tutorial

CB,, GS,, REC

Machine Learning basic concepts

Machine Learning Tutorial for the UKP lab

This ppt includes some slides/slide-parts/text taken

What is Machine Learning?

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Hidden Variables: h = ( h1 , h2 ,..., hK )

Output Variables: y = ( y1 , y2 ,..., yK )

When are ML algorithms NOT needed?

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

for some unknown function (system) y = f ( x )

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Model model quality

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Data / Examples / Sample / Instances

Definitely bad example

Instances have to be comparable to each other

Data / Examples / Sample / Instances

Common feature types

(criminal record, Y/N)

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Machine Learning Tutorial

Machine Learning Tutorial for the UKP lab

one that actually gets optimized through the training parameter

Cross validation Illustration

The result is an average

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

n-fold CV: common practice to report average performance,

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

do parameter tuning (n.b. selecting/tuning your features is also tuning!)

You might want to do something different

2. define and collect/calculate features, define train / validation

test model accuracy on (blind) test set

Try this in Weka

By, for instance, least squares:

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Data size and model complexity

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Data size and model complexity

Underfitting: the model is not capable of learning the (complex)

Size (complexity) of learnt model

Predictions Confusion matrix

Algorithms (e.g. those in Weka) typically optimize these

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

The real loss function given to us by the world

Be careful what you are optimizing! Some measures (trypically of Type 4)

phrase P/R/F e.g. in NER

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Similarly (e.g. in IR): area under P/R curve

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Evaluation measures (Ranking)

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Typical learning curve

Training corpus: 1-billion words

Take home messages (up until now)

data representation (i.i.d, comparability, )

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

You might also like