Professional Documents
Culture Documents
Section 1
A Generic System
x1
x2
System
xN
y1
y2
h1 , h2 , ..., h K
yM
x = ( x1 , x2 ,..., xN )
I
Input
t Variables:
V i bl
The Sub
Sub-Fields
Fields of ML
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Supervised Learning
Given: Training examples
{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1
, f ( xP ))
Good
G d example
l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)
probably i.i.d.
representative?
if annotation is involved it is always a question of compromises
Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2
Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
10
Features
The available examples (experience) has to be
described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features
E.g. for credit risk assesment, typical features can be: income range,
debt load,
load employment history,
history real estate properties,
properties criminal record,
record
city of residence, etc.
11
Section 2
Experimental practice
by now youve learned what machine learning is; in the supervised approach you
need (carefully selected / prepared) examples that you describe through features;
the algorithm then learns a model of the problem based on the examples (usually
some kind of optimization is performed in the background); and as a result
result,
improvement is observed in terms of some performance measure
Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression
y in Neural Network
number/size of hidden layer
number of instances per leaf in decision tree
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters!
usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
13
X1
X2
X3
Test
X4
Train
X5
14
Cross Validation
Cross-Validation
n-fold
f ld CV:
CV common practice
ti for
f making
ki (hyper)parameter
(h
)
t estimation
ti ti more
robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model
typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation
Folding via
ia natural
nat ral units
nits of processing for the given
gi en task
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries!
The PPI case
15
Cross Validation
Cross-Validation
Ideally the valid settings are:
take off-the-shelf algorithms, avoid parameter tuning and compare
results e.g.
results,
e g via cross-validation
cross validation
n.b. you probably do the folding yourself, trying to minimize biases!
16
The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data
credit
di risk
i k assessment: 1 credit
di request, good/bad
d/b d credit,
di ~s ran out in
i the
h
previous year
4. ready
y to use model to p
predict unseen instances with an expected
p
accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
17
split 80.0%
80 0% train,
train remainder test
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Correctly Classified Instances
290
96.6667 %
Incorrectly Classified Instances
10
3.3333 %
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 12
Correctly Classified Instances
281
93.6667 %
Incorrectly Classified
C
f
Instances
19
6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
18
Model complexity
Fitting
Fitti a polynomial
l
i l regression:
i
t
a ( x) = nn x
M=0
1.0
0.0
1.0
M=1
0.0
n =0
1.0
0.0
1.0
0.5
x
1.0
= arg min y j nn x
j =1
0.0
0.5
x
1.0
M=3
1.0
M=9
2
0.0
1.0
0.0
n =0
1.0
1.0
0.0
0.5
x
19
1.0
0.0
0.5
x
1.0
20
21
22
Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(TP+TN) / (TP+FN+FP+TN)
Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values
e.g.
RMSE=
1
( f ( x) y ) 2
23
Evaluation measures
TP: p
p classified as p
FP: n classified as p
TN: n classified as n
FN: p
p classified as n
Precision
Fraction of correctly predicted positives and all
predicted positives
TP/(TP+FP)
(
)
Recall
Fraction
ac o o
of co
correctly
ec y p
predicted
ed c ed pos
positives
es a
and
da
all ac
actual
ua pos
positives
es
TP/(TP+FN)
F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
precision recall
F = (1 + ) 2
precision + recall
2
Only makes sense for a subset of classes (usually measured for a single
class)
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
24
Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
A sequence of tokens with the same label is treated as a single instance
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG
before_O joining_O IBM_ORG.
Why? We need complete phrases to be identified correctly
How? With external evaluation script, e.g. conlleval for NER
Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG
before_O joining_O IBM_ORG.
Multiple penalty:
3 Positives: John (PER),
(PER) Johns Hopkins University (ORG),
(ORG) IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67,
0 67 F(ORG) = 0
0.5
5
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
25
Loss types
1
1.
2.
3.
4.
26
Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.
Example
p tagging
gg g 1:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2:
JJohn
h _PER studied
di d_O at_O the
h _O Johns
J h _O Hopkins
H ki _O University
U i
i _O before
b f _O joining
j i i _O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
0 FP
1 FN: Johns Hopkins
p
University
y ((ORG))
F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
mostt likely,
lik l this
thi is
i bad!!
b d!!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
27
ROC curve
ROC
OC Receiver Operating
O
Characteristic
C
curve
Curve that depicts the relation between recall (sensitivity) and false
positives (1-specificity)
(1 specificity)
Sensitivvity (Reca
all)
Best case
Worst case
False Positives
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
FP / (FP+TN)
28
Evaluation measures
Area
A
under
d ROC curve, AUC
As you vary the decision threshold, you can plot the recall vs. false
positive rate
p
The area under the curve measures how accurately your model
separates
t positive
iti from
f
negatives
ti
perfect ranking: AUC = 1.0
random decision: AUC = 0.5
29
MAP
The average of precisions computed at the point of each of the positives in the
ranked list ((P=0 for positives not ranked at all))
NDCG
For graded relevance / ranking
Highly relevant documents appearing lower in a search result list should be
penalized as the graded relevance value is reduced logarithmically proportional
to the position of the result.
30
Learning curve
Measures
M
h
how
th
the
accuracy
error
off the
th model
d l changes
h
with
ith
sample size
iteration number
Smaller sample
worse accuracy
y bias in the estimate
more likely
(representative sample)
variance in the estimate
??
31
Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a
varying amount of training data (Banko & Brill, 2001):
Winnow
perceptron
nave Bayes
memory-based learner
Features:
bag of words:
words within a window of the
target word
collocations containing
specific words and/or part of speech
32
ML workflow
1.
2
2.
3.
4.
problem definition
feature engineering; experimental setup /train,
/train validation,
validation test //
selection of learning algorithm, (hyper)parameter tuning, training a final model
predict unseen examples & fill tables / draw figures for the paper - test
Careful
C f l with
ith
33