You are on page 1of 45

Support

Vector Machines (and Kernel Methods in general)


Machine Learning March 23, 2010

Last Time
Mul?layer Perceptron/Logis?c Regression Networks
Neural Networks Error Backpropaga?on

Today
Support Vector Machines
Note: well rely on some math from Op?mality Theory that we wont derive.

Maximum Margin
Perceptron (and other linear classiers) can lead to many equally valid choices for the decision boundary

Are these really equally valid?


4

Max Margin
How can we pick which is best? Maximize the size of the margin.
Small Margin Large Margin

Are these really equally valid?


5

Support Vectors
Support Vectors are those input points (vectors) closest to the decision boundary 1. They are vectors 2. They support the decision hyperplane
6

Support Vectors
Dene this as a decision problem The decision hyperplane:

wT + b = 0 x
No fancy math, just the equa?on of a hyperplane.

Support Vectors
i x i {1, +1} t are the data are the labels
Aside: Why do some cassiers use ti {0, 1} or t i {1, +1}

Simplicity of the math and interpreta?on. For probability density func?on es?ma?on 0,1 has a clear correlate. For classica?on, a decision boundary of 0 is more easily interpretable than .5.
8

Support Vectors
Dene this as a decision problem The decision hyperplane:

wT + b = 0 x
Decision Func?on:

D(xi ) = sign(w xi + b)
T
9

Support Vectors
Dene this as a decision problem The decision hyperplane:

w +b=0 x
T

Margin hyperplanes:

w +b x
T

= =
10

w +b x
T

Support Vectors
The decision hyperplane:

w +b=0 x
T

Scale invariance

cw + cb = 0 x
T

11

Support Vectors
The decision hyperplane:

w +b=0 x
T

Scale invariance

cw + cb = 0 x
T

w +b x
T

= =
12

w +b x
T

Support Vectors
The decision hyperplane:
This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the op?miza?on

w +b=0 x
T

Scale invariance

cw + cb = 0 x
T

T T

+b x

= 1 = 1
13

+ b w x

What are we op?mizing?


We will represent the size of the margin in terms of w. This will allow us to simultaneously
Iden?fy a decision boundary Maximize the margin

14

How do we represent the size of the margin in terms of w?


1. There must at least one point that lies on each support hyperplanes Proof outline: If not, we could dene a larger margin support hyperplane that does touch the nearest point(s).

x1 x2

15

How do we represent the size of the margin in terms of w?


1. There must at least one point that lies on each support hyperplanes Proof outline: If not, we could dene a larger margin support hyperplane that does touch the nearest point(s).

x1 x2

16

How do we represent the size of the margin in terms of w?


1. There must at least one point that lies on each support hyperplanes 2. Thus:

w x1 + b
T

= 1 = 1

x1 x2

wT x2 + b
3. And:

wT (x1 x2) = 2
17

How do we represent the size of the margin in terms of w?


1. There must at least one point that lies on each support hyperplanes 2. Thus:
wT x + b = 0

w x1 + b
T

= 1 = 1

x1 x2

wT x2 + b
3. And:

wT (x1 x2) = 2
18

How do we represent the size of the margin in terms of w?


The vector w is perpendicular to the decision hyperplane
If the dot product of two vectors equals zero, the two vectors are perpendicular.
wT x + b = 0

x1 x2

wT (x1 x2) = 2

19

How do we represent the size of the margin in terms of w?


The margin is the projec?on of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0

wT (x1 x2) = 2

x1 x2

20

Aside: Vector Projec?on


= cos() v u v u

v v u u u

adjacent cos() = hypotenuse

hypotenuse = adjacent = goal v goal cos() = v v u goal v u cos() = = goal v u v u v u goal = v u v


21

How do we represent the size of the margin in terms of w?


The margin is the projec?on of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0

wT (x1 x2) = 2
v u Projec?on: u u wT (x1 x2 ) w w 2 Size of the Margin: w

x1 x2

22

Maximizing the margin


Goal: maximize the margin
2 max w min w
Linear Separability of the data by the decision boundary

where ti (wT xi + b) 1

wT xi + b 1 if wT xi + b 1 if

ti = 1 ti = 1
23

Max Margin Loss Func?on


If constraint op7miza7on then Lagrange Mul7pliers min w Op?mize the Primal
where ti (wT xi + b) 1 1 L(w, b) = w w 2
N 1 i=0

i [ti ((w xi ) + b) 1]

24

Max Margin Loss Func?on


Op?mize the Primal
1 L(w, b) = w w 2
Par?al wrt b

N 1 i=0

i [ti ((w xi ) + b) 1] = 0 = 0

L(w, b) b N 1 i ti
i=0

25

Max Margin Loss Func?on


Op?mize the Primal
1 L(w, b) = w w i [ti ((w xi ) + b) 1] 2 i=0 L(w, b) = 0 Par?al wrt w w N 1 w i ti xi = 0
i=0 N 1

N 1 i=0

i ti xi

26

Max Margin Loss Func?on


Op?mize the Primal
1 L(w, b) = w w i [ti ((w xi ) + b) 1] 2 i=0 L(w, b) = 0 Par?al wrt w w N 1 w i ti xi = 0
i=0
Now have to nd i. Subs?tute back to the Loss func?on

N 1

N 1 i=0

i ti xi

27

Max Margin Loss Func?on


Construct the dual
1 L(w, b) = w w 2 w
N 1 N 1 i=0 N 1

i [ti ((w xi ) + b) 1] i ti xi

1 W () = i i j ti tj (xi xj ) 2 i,j=0 i=0 N 1 i ti = 0 where i 0


i=0

i=0 N 1

28

Dual formula?on of the error


Op?mize this quadra7c program to iden?fy the lagrange mul?pliers and thus the weights
W () =
N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0

where i 0

N 1

There exist (extremely) fast approaches to quadra?c op?miza?on in both C, C++, Python, Java and R

29

Quadra?c Programming
1 T T minimize f ( ) = Q + c x x x x 2 subject to (one or more) A k x B = l x
If Q is posi?ve semi denite, then f(x) is convex. If f(x) is convex, then there is a single maximum.

W () =

N 1 i=0

where i 0

N 1 1 i i j ti tj (xi xj ) 2 i,j=0
30

Support Vector Expansion


Independent of the Dimension of x!

New decision Func?on


D( ) = sign(wT + b) x x T N 1 i ti xi + b x = sign = sign N 1
i=0 i=0

When i is non-zero then xi is a support vector When i is zero xi is not a support vector
31

N 1 i=0

i ti xi

i ti (xi T ) + b x

Kuhn-Tucker Condi?ons
In constraint op?miza?on: At the op?mal solu?on
Constraint * Lagrange Mul?plier = 0
i (1 ti (wT xi + b)) = 0 if i = 0 ti (wT xi + b) = 1

Only points on the decision boundary contribute to the solu?on!

32

Visualiza?on of Support Vectors


=0

>0

33

Interpretability of SVM parameters


What else can we tell from alphas?
If alpha is large, then the associated data point is quite important. Its either an outlier, or incredibly important.

But this only gives us the best solu?on for linearly separable data sets

34

Basis of Kernel Methods


W () =
N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0
=
N 1 i=0

N 1

i ti xi

The decision process doesnt depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The error is based on the dot product of data points not the data points themselves.
35

Basis of Kernel Methods


W () =
Since data points only appear within a dot product. Thus we can map to another space through a replacement
N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0

N 1

W () =

The error is based on the dot product of data points not the data points themselves.
36

N 1 i=0

xi xj (xi ) (xj )

1 i i j ti tj ((xi ) (xj )) 2 i,j=0

N 1

Learning Theory bases of SVMs


Theore?cal bounds on tes?ng error.
The upper bound doesnt depend on the dimensionality of the space The lower bound is maximized by maximizing the margin, , associated with the decision boundary.

37

Why we like SVMs


They work
Good generaliza?on

Easily interpreted.
Decision boundary is based on the data in the form of the support vectors.
Not so in mul?layer perceptron networks

Principled bounds on tes?ng error from Learning Theory (VC dimension)


38

SVM vs. MLP


SVMs have many fewer parameters
SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate

SVM: Convex op?miza?on task


2 N 1 1 yn g wkl g wjk g wij xn,i R() = N n=0 2 j i
k
39

MLP: likelihood is non-convex -- local minima

Soq margin classica?on


There can be outliers on the other side of the decision boundary, or leading to a small margin. Solu?on: Introduce a penalty term to the constraint func?on N 1

min w + C
N 1 i=0

1 L(w, b) = w w + C 2

where ti (wT xi + b) 1 i and i 0


i
N 1 i=0

i=0

i [ti ((w xi ) + b) + i 1]

40

Soq Max Dual


min w + C
T

N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0

where ti (w xi + b) 1 i and i 0

N 1 i=0

N S?ll Quadra?c Programming! 1

W () =

i=0

1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0

N 1

where 0 i C

i ti = 0
41

Soq margin example


Points are allowed within the margin, but cost is introduced.

x1 i x2

Hinge Loss

42

Probabili?es from SVMs


Support Vector Machines are discriminant func?ons
Discriminant func?ons: f(x)=c Discrimina?ve models: f(x) = argmaxc p(c|x) Genera?ve Models: f(x) = argmaxc p(x|c)p(c)/p(x)

No (principled) probabili?es from SVMs SVMs are not based on probability distribu?on func?ons of class instances.
43

Eciency of SVMs
Not especially fast. Training n^3
Quadra?c Programming eciency

Evalua?on n
Need to evaluate against each support vector (poten?ally n)

44

Good Bye
Next ?me:
The Kernel Trick -> Kernel Methods or How can we use SVMs that are not linearly separable?

45

You might also like