Andrew Rosenberg - Support Vector Machines (And Kernel Methods in General)

Support
Vector Machines (and Kernel Methods in general)

Machine Learning March 23, 2010
Last Time
Mul?layer Perceptron/Logis?c Regression Networks
Neural Networks Error Backpropaga?on
Today
Support Vector Machines
Note: well rely on some math from Op?mality Theory that we wont derive.
Maximum Margin
Perceptron (and other linear classiers) can lead to many equally valid choices for the decision boundary
Are these really equally valid?

4
Max Margin
How can we pick which is best? Maximize the size of the margin.
Small Margin Large Margin
Are these really equally valid?

5
Support Vectors
Support Vectors are those input points (vectors) closest to the decision boundary 1. They are vectors 2. They support the decision hyperplane
6
Support Vectors
Dene this as a decision problem The decision hyperplane:
wT + b = 0 x
No fancy math, just the equa?on of a hyperplane.
Support Vectors
i x i {1, +1} t are the data are the labels
Aside: Why do some cassiers use ti {0, 1} or t i {1, +1}
Simplicity of the math and interpreta?on. For probability density func?on es?ma?on 0,1 has a clear correlate. For classica?on, a decision boundary of 0 is more easily interpretable than .5.
8
Support Vectors
wT + b = 0 x
Decision Func?on:
D(xi ) = sign(w xi + b)
T
9
Support Vectors
w +b=0 x
T
Margin hyperplanes:
w +b x
T
= =
10
w +b x
T
Support Vectors
The decision hyperplane:
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
11
Support Vectors
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
w +b x
T
= =
12
w +b x
T
Support Vectors
This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the op?miza?on
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
T T
+b x
= 1 = 1
13
+ b w x
What are we op?mizing?

We will represent the size of the margin in terms of w. This will allow us to simultaneously
Iden?fy a decision boundary Maximize the margin
14
How do we represent the size of the margin in terms of w?

1. There must at least one point that lies on each support hyperplanes Proof outline: If not, we could dene a larger margin support hyperplane that does touch the nearest point(s).
x1 x2
15

1. There must at least one point that lies on each support hyperplanes Proof outline: If not, we could dene a larger margin support hyperplane that does touch the nearest point(s).
x1 x2
16

1. There must at least one point that lies on each support hyperplanes 2. Thus:
w x1 + b
T
= 1 = 1
x1 x2
wT x2 + b
3. And:
wT (x1 x2) = 2
17

1. There must at least one point that lies on each support hyperplanes 2. Thus:
wT x + b = 0
w x1 + b
T
= 1 = 1
x1 x2
wT x2 + b
3. And:
wT (x1 x2) = 2
18

The vector w is perpendicular to the decision hyperplane
If the dot product of two vectors equals zero, the two vectors are perpendicular.
wT x + b = 0
x1 x2
wT (x1 x2) = 2
19

The margin is the projec?on of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0
wT (x1 x2) = 2
x1 x2
20
Aside: Vector Projec?on

= cos() v u v u
v v u u u
adjacent cos() = hypotenuse
hypotenuse = adjacent = goal v goal cos() = v v u goal v u cos() = = goal v u v u v u goal = v u v

21

The margin is the projec?on of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0
wT (x1 x2) = 2
v u Projec?on: u u wT (x1 x2 ) w w 2 Size of the Margin: w
x1 x2
22
Maximizing the margin

Goal: maximize the margin
2 max w min w
Linear Separability of the data by the decision boundary
where ti (wT xi + b) 1
wT xi + b 1 if wT xi + b 1 if
ti = 1 ti = 1
23
Max Margin Loss Func?on

If constraint op7miza7on then Lagrange Mul7pliers min w Op?mize the Primal
where ti (wT xi + b) 1 1 L(w, b) = w w 2
N 1 i=0
i [ti ((w xi ) + b) 1]
24

Op?mize the Primal
1 L(w, b) = w w 2
Par?al wrt b
N 1 i=0
i [ti ((w xi ) + b) 1] = 0 = 0
L(w, b) b N 1 i ti
i=0
25

Op?mize the Primal
1 L(w, b) = w w i [ti ((w xi ) + b) 1] 2 i=0 L(w, b) = 0 Par?al wrt w w N 1 w i ti xi = 0
i=0 N 1
N 1 i=0
i ti xi
26

Op?mize the Primal
1 L(w, b) = w w i [ti ((w xi ) + b) 1] 2 i=0 L(w, b) = 0 Par?al wrt w w N 1 w i ti xi = 0
i=0
Now have to nd i. Subs?tute back to the Loss func?on
N 1
N 1 i=0
i ti xi
27

Construct the dual
1 L(w, b) = w w 2 w
N 1 N 1 i=0 N 1
i [ti ((w xi ) + b) 1] i ti xi
1 W () = i i j ti tj (xi xj ) 2 i,j=0 i=0 N 1 i ti = 0 where i 0

i=0
i=0 N 1
28
Dual formula?on of the error

Op?mize this quadra7c program to iden?fy the lagrange mul?pliers and thus the weights
W () =
N 1 i=0
1 i i j ti tj (xi xj ) 2 i,j=0
where i 0
N 1
There exist (extremely) fast approaches to quadra?c op?miza?on in both C, C++, Python, Java and R
29
Quadra?c Programming
1 T T minimize f ( ) = Q + c x x x x 2 subject to (one or more) A k x B = l x
If Q is posi?ve semi denite, then f(x) is convex. If f(x) is convex, then there is a single maximum.
W () =
N 1 i=0
where i 0
N 1 1 i i j ti tj (xi xj ) 2 i,j=0
30
Support Vector Expansion

Independent of the Dimension of x!
New decision Func?on

D( ) = sign(wT + b) x x T N 1 i ti xi + b x = sign = sign N 1
i=0 i=0
When i is non-zero then xi is a support vector When i is zero xi is not a support vector
31
N 1 i=0
i ti xi
i ti (xi T ) + b x
Kuhn-Tucker Condi?ons
In constraint op?miza?on: At the op?mal solu?on
Constraint * Lagrange Mul?plier = 0
i (1 ti (wT xi + b)) = 0 if i = 0 ti (wT xi + b) = 1
Only points on the decision boundary contribute to the solu?on!
32
Visualiza?on of Support Vectors

=0
>0
33
Interpretability of SVM parameters

What else can we tell from alphas?
If alpha is large, then the associated data point is quite important. Its either an outlier, or incredibly important.
But this only gives us the best solu?on for linearly separable data sets
34
Basis of Kernel Methods

W () =
N 1 i=0
=
N 1 i=0
N 1
i ti xi
The decision process doesnt depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The error is based on the dot product of data points not the data points themselves.
35
Basis of Kernel Methods

W () =
Since data points only appear within a dot product. Thus we can map to another space through a replacement
N 1 i=0
N 1
W () =
The error is based on the dot product of data points not the data points themselves.
36
N 1 i=0
xi xj (xi ) (xj )
1 i i j ti tj ((xi ) (xj )) 2 i,j=0
N 1
Learning Theory bases of SVMs

Theore?cal bounds on tes?ng error.
The upper bound doesnt depend on the dimensionality of the space The lower bound is maximized by maximizing the margin, , associated with the decision boundary.
37
Why we like SVMs

They work
Good generaliza?on
Easily interpreted.
Decision boundary is based on the data in the form of the support vectors.
Not so in mul?layer perceptron networks
Principled bounds on tes?ng error from Learning Theory (VC dimension)

38
SVM vs. MLP

SVMs have many fewer parameters
SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate
SVM: Convex op?miza?on task

2 N 1 1 yn g wkl g wjk g wij xn,i R() = N n=0 2 j i
k
39
MLP: likelihood is non-convex -- local minima
Soq margin classica?on

There can be outliers on the other side of the decision boundary, or leading to a small margin. Solu?on: Introduce a penalty term to the constraint func?on N 1
min w + C
N 1 i=0
1 L(w, b) = w w + C 2
where ti (wT xi + b) 1 i and i 0

i
N 1 i=0
i=0
i [ti ((w xi ) + b) + i 1]
40
Soq Max Dual

min w + C
T
N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0
where ti (w xi + b) 1 i and i 0
N 1 i=0
N S?ll Quadra?c Programming! 1
W () =
i=0
1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0
N 1
where 0 i C
i ti = 0
41
Soq margin example

Points are allowed within the margin, but cost is introduced.
x1 i x2
Hinge Loss
42
Probabili?es from SVMs

Support Vector Machines are discriminant func?ons
Discriminant func?ons: f(x)=c Discrimina?ve models: f(x) = argmaxc p(c|x) Genera?ve Models: f(x) = argmaxc p(x|c)p(c)/p(x)
No (principled) probabili?es from SVMs SVMs are not based on probability distribu?on func?ons of class instances.
43
Eciency of SVMs
Not especially fast. Training n^3
Quadra?c Programming eciency
Evalua?on n
Need to evaluate against each support vector (poten?ally n)
44
Good Bye
Next ?me:
The Kernel Trick -> Kernel Methods or How can we use SVMs that are not linearly separable?
45

Andrew Rosenberg - Support Vector Machines (And Kernel Methods in General)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Andrew Rosenberg - Support Vector Machines (And Kernel Methods in General)

Uploaded by

Copyright:

Available Formats

Support

Vector Machines (and Kernel Methods in general)

Are these really equally valid?

Are these really equally valid?

What are we op?mizing?

How do we represent the size of the margin in terms of w?

How do we represent the size of the margin in terms of w?

How do we represent the size of the margin in terms of w?

How do we represent the size of the margin in terms of w?

How do we represent the size of the margin in terms of w?

How do we represent the size of the margin in terms of w?

Aside: Vector Projec?on

adjacent cos() = hypotenuse

hypotenuse = adjacent = goal v goal cos() = v v u goal v u cos() = = goal v u v u v u goal = v u v

How do we represent the size of the margin in terms of w?

Maximizing the margin

Max Margin Loss Func?on

Max Margin Loss Func?on

Max Margin Loss Func?on

Max Margin Loss Func?on

Max Margin Loss Func?on

1 W () = i i j ti tj (xi xj ) 2 i,j=0 i=0 N 1 i ti = 0 where i 0

Dual formula?on of the error

Support Vector Expansion

New decision Func?on

Only points on the decision boundary contribute to the solu?on!

Visualiza?on of Support Vectors

Interpretability of SVM parameters

Basis of Kernel Methods

Basis of Kernel Methods

1 i i j ti tj ((xi ) (xj )) 2 i,j=0

Learning Theory bases of SVMs

Why we like SVMs

Principled bounds on tes?ng error from Learning Theory (VC dimension)

SVM vs. MLP

SVM: Convex op?miza?on task

MLP: likelihood is non-convex -- local minima

Soq margin classica?on

where ti (wT xi + b) 1 i and i 0

Soq Max Dual

N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0

N S?ll Quadra?c Programming! 1

Soq margin example

Probabili?es from SVMs

You might also like