Professional Documents
Culture Documents
Group
University of Texas at
Austin
Machine Learning
Group
wTx + b < 0
f(x) = sign(wTx + b)
University of Texas at
Austin
Machine Learning
Group
Linear Separators
Which of the linear separators is optimal?
University of Texas at
Austin
Machine Learning
Group
Classification Margin
wT xi b
r
w
University of Texas at
Austin
Machine Learning
Group
University of Texas at
Austin
Machine Learning
Group
wTxi + b /2 if yi = 1
yi(wTxi + b) /2
1
w
University of Texas at
Austin
2
w
Machine Learning
Group
2
w
is maximized
yi(wTxi + b) 1
yi (wTxi + b) 1
7
Machine Learning
Group
yi (wTxi + b) 1
Machine Learning
Group
b = yk - iyixi Txk
Notice that it relies on an inner product between the test point x and the
support vectors xi we will return to this later.
Machine Learning
Group
i
i
University of Texas at
Austin
10
Machine Learning
Group
yi (wTxi + b) 1
yi (wTxi + b) 1 i, ,
i 0
11
Machine Learning
Group
Dual problem is identical to separable case (would not be identical if the 2norm penalty for slack variables Ci2 was used in primal objective, we
would need additional Lagrange multipliers for slack variables):
Find 1N such that
Q() =i - ijyiyjxiTxj is maximized and
(1) iyi = 0
(2) 0 i C for all i
w =iyixi
b= yk(1- k) - iyixiTxk
University of Texas at
Austin
Machine Learning
Group
h min 2 , m0 1
where is the margin, D is the diameter of the smallest sphere that can
enclose all of the training examples, and m0 is the dimensionality.
University of Texas at
Austin
13
Machine Learning
Group
Most important training points are support vectors; they define the
hyperplane.
Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
Find 1N such that
Q() =i - ijyiyjxiTxj is maximized and
(1) iyi = 0
(2) 0 i C for all i
University of Texas at
Austin
f(x) = iyixiTx + b
14
Machine Learning
Group
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
x
0
University of Texas at
Austin
x
15
Machine Learning
Group
General idea: the original feature space can always be mapped to some
higher-dimensional feature space where the training set is separable:
: x (x)
University of Texas at
Austin
16
Machine Learning
Group
17
Machine Learning
Group
For some functions K(xi,xj) checking that K(xi,xj)= (xi) T(xj) can be
cumbersome.
Mercers theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
K(x1,x1)
K(x1,x2)
K(x1,x3)
K(x1,xn)
K(x2,x1)
K(x2,x2)
K(x2,x3)
K(xn,x1)
K(xn,x2)
K(xn,x3)
K(xn,xn)
K(x2,xn)
K=
University of Texas at
Austin
18
Machine Learning
Group
d p
dimensions
p
xi x j
2 2
19
Machine Learning
Group
University of Texas at
Austin
20
Machine Learning
Group
SVM applications
SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and
gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for such data.
SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. 97], principal component analysis [Schlkopf et al. 99], etc.
Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of is at a time, e.g. SMO [Platt 99] and [Joachims 99]
Tuning SVMs remains a black art: selecting a specific kernel and parameters
is usually done in a try-and-see manner.
University of Texas at
Austin
21