Professional Documents
Culture Documents
Last
Time
Mul?layer
Perceptron/Logis?c
Regression
Networks
Neural
Networks
Error
Backpropaga?on
Today
Support
Vector
Machines
Note:
well
rely
on
some
math
from
Op?mality
Theory
that
we
wont
derive.
Maximum
Margin
Perceptron
(and
other
linear
classiers)
can
lead
to
many
equally
valid
choices
for
the
decision
boundary
Max
Margin
How
can
we
pick
which
is
best?
Maximize
the
size
of
the
margin.
Small
Margin
Large
Margin
Support
Vectors
Support
Vectors
are
those
input
points
(vectors)
closest
to
the
decision
boundary
1.
They
are
vectors
2.
They
support
the
decision
hyperplane
6
Support
Vectors
Dene
this
as
a
decision
problem
The
decision
hyperplane:
wT + b = 0 x
No
fancy
math,
just
the
equa?on
of
a
hyperplane.
Support
Vectors
i x i {1, +1} t are the data are the labels
Aside:
Why
do
some
cassiers
use
ti {0, 1} or
t
i {1, +1}
Simplicity
of
the
math
and
interpreta?on.
For
probability
density
func?on
es?ma?on
0,1
has
a
clear
correlate.
For
classica?on,
a
decision
boundary
of
0
is
more
easily
interpretable
than
.5.
8
Support
Vectors
Dene
this
as
a
decision
problem
The
decision
hyperplane:
wT + b = 0 x
Decision
Func?on:
D(xi ) = sign(w xi + b)
T
9
Support
Vectors
Dene
this
as
a
decision
problem
The
decision
hyperplane:
w +b=0 x
T
Margin hyperplanes:
w +b x
T
= =
10
w +b x
T
Support
Vectors
The
decision
hyperplane:
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
11
Support
Vectors
The
decision
hyperplane:
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
w +b x
T
= =
12
w +b x
T
Support
Vectors
The
decision
hyperplane:
This
scaling
does
not
change
the
decision
hyperplane,
or
the
support
vector
hyperplanes.
But
we
will
eliminate
a
variable
from
the
op?miza?on
w +b=0 x
T
Scale invariance
cw + cb = 0 x
T
T T
+b x
= 1 = 1
13
+ b w x
14
x1 x2
15
x1 x2
16
w x1 + b
T
= 1 = 1
x1 x2
wT x2 + b
3. And:
wT (x1 x2) = 2
17
w x1 + b
T
= 1 = 1
x1 x2
wT x2 + b
3. And:
wT (x1 x2) = 2
18
x1 x2
wT (x1 x2) = 2
19
wT (x1 x2) = 2
x1 x2
20
v v u u u
wT (x1 x2) = 2
v u Projec?on:
u u wT (x1 x2 ) w w 2 Size
of
the
Margin:
w
x1 x2
22
where ti (wT xi + b) 1
wT xi + b 1 if wT xi + b 1 if
ti = 1 ti = 1
23
i [ti ((w xi ) + b) 1]
24
N 1 i=0
i [ti ((w xi ) + b) 1] = 0 = 0
L(w, b) b N 1 i ti
i=0
25
N 1 i=0
i ti xi
26
N 1
N 1 i=0
i ti xi
27
i [ti ((w xi ) + b) 1] i ti xi
i=0 N 1
28
1 i i j ti tj (xi xj ) 2 i,j=0
where i 0
N 1
There exist (extremely) fast approaches to quadra?c op?miza?on in both C, C++, Python, Java and R
29
Quadra?c
Programming
1 T T minimize f ( ) = Q + c x x x x 2 subject to (one or more) A k x B = l x
If
Q
is
posi?ve
semi
denite,
then
f(x)
is
convex.
If
f(x)
is
convex,
then
there
is
a
single
maximum.
W () =
N 1 i=0
where i 0
N 1 1 i i j ti tj (xi xj ) 2 i,j=0
30
When
i
is
non-zero
then
xi
is
a
support
vector
When
i
is
zero
xi
is
not
a
support
vector
31
N 1 i=0
i ti xi
i ti (xi T ) + b x
Kuhn-Tucker
Condi?ons
In
constraint
op?miza?on:
At
the
op?mal
solu?on
Constraint
*
Lagrange
Mul?plier
=
0
i (1 ti (wT xi + b)) = 0 if i = 0 ti (wT xi + b) = 1
32
>0
33
But this only gives us the best solu?on for linearly separable data sets
34
1 i i j ti tj (xi xj ) 2 i,j=0
=
N 1 i=0
N 1
i ti xi
The
decision
process
doesnt
depend
on
the
dimensionality
of
the
data.
We
can
map
to
a
higher
dimensionality
of
the
data
space.
Note:
data
points
only
appear
within
a
dot
product.
The
error
is
based
on
the
dot
product
of
data
points
not
the
data
points
themselves.
35
1 i i j ti tj (xi xj ) 2 i,j=0
N 1
W () =
The
error
is
based
on
the
dot
product
of
data
points
not
the
data
points
themselves.
36
N 1 i=0
xi xj (xi ) (xj )
N 1
37
Easily
interpreted.
Decision
boundary
is
based
on
the
data
in
the
form
of
the
support
vectors.
Not
so
in
mul?layer
perceptron
networks
min w + C
N 1 i=0
1 L(w, b) = w w + C 2
i=0
i [ti ((w xi ) + b) + i 1]
40
where ti (w xi + b) 1 i and i 0
N 1 i=0
W () =
i=0
1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0
N 1
where 0 i C
i ti = 0
41
x1 i x2
Hinge Loss
42
No
(principled)
probabili?es
from
SVMs
SVMs
are
not
based
on
probability
distribu?on
func?ons
of
class
instances.
43
Eciency
of
SVMs
Not
especially
fast.
Training
n^3
Quadra?c
Programming
eciency
Evalua?on
n
Need
to
evaluate
against
each
support
vector
(poten?ally
n)
44
Good
Bye
Next
?me:
The
Kernel
Trick
->
Kernel
Methods
or
How
can
we
use
SVMs
that
are
not
linearly
separable?
45