Professional Documents
Culture Documents
Google
Saarland University
META
2 Kernels
3 SVM
2 Kernels
3 SVM
Also known as
I Memory-based learning
I Instance or exemplar based learning
I Similarity-based methods
I Case-based reasoning
d
X
Hamming(x, x 0 ) = δ(xi , xi0 ) (1)
i=1
(
0 0 if xi = xi0
δ(xi , xi ) = (2)
1 6 xi0
if xi =
For a vector with a mixture of symbolic and numeric values, the above
definition of per feature distance is used for symbolic features, while for
numeric ones we can use the scaled absolute difference
xi − xi0
δ(xi , xi0 ) = . (3)
maxi − mini
where
wi is the weight of the i th feature
Y is the set of class labels
Vi is the set of possible values for the i th feature
P(v ) is the probability of value v
P
class entropy H(Y ) = − y ∈Y P(y ) log2 P(y )
H(Y |v ) is the conditional class entropy given that feature value = v
Numeric values need to be temporarily discretized for this to work
The χ2 statistic for a problem with k classes and m values for feature F :
k X
m
2
X (Eij − Oij )2
χ = (7)
Eij
i=1 j=1
where
Oij is the observed number of instances with the i th class label and
the j th value of feature F
Eij is the expected number of such instances in case the null
n n
hypothesis is true: Eij = ·jn··i·
nij is the frequency count of instances with the i th class label and the
j th value of feature F
Pk
I n·j = Pi=1 nij
m
I ni· = j=1 nij
Pk Pm
I n·· = i=1 j=0 nij
So far all the instances in the neighborhood are weighted equally for
computing the majority class
We may want to treat the votes from very close neighbors as more
important than votes from more distant ones
A variety of distance weighting schemes have been proposed to
implement this idea
2 Kernels
3 SVM
6: αi ← αi + 1
7: b ← b + y (k)
8: return (α, b)
where the map φ projects the vectors in the original feature space onto
the transformed feature space
I It can also be thought of as a similarity function in the input object
space
2 Kernels
3 SVM
For linearly separable training instances ((x1 , y1 ), ..., (xn , yn )) find the
hyperplane (w, b) that solves the optimization problem:
1
minimizew,b ||w||2
2 (15)
subject to yi (w · xi + b) ≥ 1 ∀i∈1..n
●
2
● ● ●
0
y
●
−2
●
−4
−4 −2 0 2 4
Stroppa and Chrupala (UdS) ML for NLP 2010 27 / 35
Soft margin
SVM with soft margin works by relaxing the requirement that all data
points lie outside the margin
For each offending instance there is a “slack variable” ξi which
measures how much it would have move to obey the margin
constraint.
n
1 2
X
minimizew,b ||w|| + C ξi
2
i=1
(16)
subject to yi (w · xi + b) ≥ 1 − ξi
∀i∈1..n ξi > 0
where
ξi = max(0, 1 − yi (w · xi + b))
The hyper-parameter C trades off minimizing the norm of the weight
vector versus classifying correctly as many examples as possible.
As the value of C tends towards infinity the soft-margin SVM
approximates the hard-margin version.
Stroppa and Chrupala (UdS) ML for NLP 2010 28 / 35
Dual form
The dual formulation is in terms of support vectors, where SV is the set of their
indices:
!
X
∗ ∗ ∗ ∗
f (x, α , b ) = sign yi αi (xi · x) + b (17)
i∈SV
The weights in this decision function are the α∗ .
n n
X 1 X
minimize W (α) = αi − yi yj αi αj (xi · xj )
2
i=1 i,j=1
n
(18)
X
subject to yi αi = 0 ∀i∈1..n αi ≥ 0
i=1
The weights together with the support vectors determine (w, b):
X
w= αi yi xi (19)
i∈SV
Weight averaging
Show that the above algorithm performs weight averaging.
Hints:
In the standard perceptron algorithm, the final weight vector (and
bias) is the sum of the updates at each step.
In average perceptron, the final weight vector should be the mean of
the sum of partial sums of updates at each step
Naive weight averaging: final weights are the mean of the sum of
partial sums:
n i
1 XX
w= f (x(j) ) (22)
n
i=1 j=1
Naive weight averaging: final weights are the mean of the sum of
partial sums:
n i
1 XX
w= f (x(j) ) (22)
n
i=1 j=1
For this exercise you should formulate the Viterbi algorithm for a decoding
a second-order HMM.