You are on page 1of 33

SUPPORT VECTOR

MACHINES
EMARO Group
Ahmad AD
Daravuth Koung
Debaleena Misra
Fernando Nunez Mendoza
Sukumar Karumuri
Yu-Sin Lin

Date 09.12.16
Outline
Introduction to Machine Learning
Definition
Classification & Techniques
Support Vector Machine
Definition and Application
Mathematical Detail
Computational Example
Comparison with Neural Network
Conclusion
Machine Learning - Definition
What is Machine Learning?

"A computer program is said to learn from experience E


with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured
by P, improves with experience E."
- Tom M. Mitchell, Head of M.L. Dept. at CMU

Spam Email Filtering:


E Collected data (spam and non-spam emails).
T Marking each email as spam or not.
P Accuracy (correct decisions / total decisions)
Machine Learning - Classification
By type of problems and tasks
Supervised Learning
Unsupervised Learning
Reinforcement Learning
By output of a machine-learned system
Classification (Two-Class & Multiclass)
Regression
Clustering

Classification Regression Clustering


Machine Learning
Techniques
Linear regression (R)
Logistic regression (C)
Trees, forests, and jungles (R&C)
Neural networks (R&C) Bayes Classifier

Bayes Methods (R&C)


K-Means Clustering (K)
SVMs (C) K-Means Clustering

R
Regression
C
Neural Network Classification
Tree
History of SVM
Support Vector Machine
(SVM) is a classifier derived
from statistical learning
theory by Vapnik and
Chervonenkis

Said to start in 1979 with


Vladimir Vapniks paper, the
major developments were in
the 1990s

SVM became popular


because of its success in
handwritten digit recognition
(in NIST (1998)). It gave
accuracy which is
comparable to sophisticated
neural networks with
What is SVM useful for?
Pattern Recognition:
- Object Recognition
- Text Categorization
- Face Recognition
- Handwriting
recognition
Bioinformatics (Protein classification, cancer
classification)
Stock market performance
Linearly Separable Data
Training
set

Hyper-plane

Many possible decision


boundaries can
separate these points
into two classes.

Which one to
choose?
The SVM Approach
The SVM algorithm seeks to maximise the margin around the
separating hyper-plane such that this decision boundary is as
far away from data from both the classes as possible

Supporting The decision function is


vectors fully specified by a subset
of training samples closest
to the hyper-plane called
the support vectors
d
d Margin
Ma
rg
i n
m Thus, we get the widest
margin when we is the
minimum
How to optimise?
The Mathematics behind the
SVM

Two major
classifications: +1/-1
Objective: to get the
best distance margin
Though H2 divides the
sample space, it has a
very low distance
margin
H3 is the best option
For linearly separable point
set

For 2D, the equation of the


hyperplane is of the form
Ax+By+C=0, where A,B are
a1,a2 and C is a0
Required optimization
Distance from
origin for
hyperplane 1: |1-
a0|/|a|

For hyperplane 2: |-
1-a0|/|a|

Hence distance
between them is
2/|a|

Goal: minimize |a|


Solutions
Lagrange Function:

Quad-prog function in MATLAB


Using support vectors- Support vectors
are learning samples used for the
computation of the hyperplane.
Extension to non-linear sample sets

Penalty component or Hinge


Loss function

Here, E7,E5,E8,E6 are error


margins

Kernel functions (explained


in consequent slide)
Non Linear Classification: Kernel

What if data is not separable linearly?


Non Linear Classification: Kernel

For such data non linear classification


is performed.
Gaussian Kernel is generally used.
Non Linear Classification: Gaussian
Kernel
COMPUTATIONAL EXAMPLE

SPAM CLASSIFIER
SPAM CLASSIFIER
Many mail services today provide
spam classifier.
Here a technique of SPAM classifier
based on SVM is represented. (This
code was written as part of Coursera
course on Machine Learning).
SPAM Classifier: Objective
To train a classifier capable of
distinguishing between Spam and
Non-spam emails with certain
accuracy.
In other words, we want to predict
y=1 for a spam email and y=0 for a
non spam email.
SPAM classifier: Preprocessing Steps

Only body of dataset emails in considered.


Lower casing
Stripping HTML
Normalizing URLs
Normalizing Email Addresses
Normalizing Numbers
Normalizing Dollars
Word Stemming
Removal of Special Characters
SPAM classifier: Preprocessing Steps
SPAM classifier: Normalization
SPAM Classifier: Vocabulary List
After preprocessing vocabulary list is
created.
How to choose which word to use and
which to discard.
Here word that occur 100 times in the
email dataset (spam) are considered only.
The list consists of 1899 words.
In practice, a vocabulary list with about
10,000 to 50,000 words is often used.
Mapping
Once the list is
ready, it can be used
for mapping of
email. i.e. replacing
each word with the
index of vocabulary
list if applicable or
discarding it
otherwise.
SPAM Classifier: Feature Extraction

Converting email into feature vector


x[i].
The algorithm runs through the
mapped data and put 1 at the ith
index if the ith word is present in the
mapped data.
SPAM Classifier: Training
SPAM Classifier: Test Accuracy

Test Accuracy: 98.9%


Brief introduction for NN
classification
Linearly separable cases
NN are heuristic, while SVMs are theoretically
founded.
SVM is guaranteed to find converge towards best
solution while NN may not
Linearly non-separable cases
Introduce additional neurons in the hidden
layer
Why SVM over NN?

Training is more efficient


Always gives global and unique minimum
it is a convex optimization problem
Fewer parameters to select
kernel
error cost
Less prone to over-fitting
over-fitting can be controlled by soft margin
approach
Weakness of SVM
The choice of Kernel function
Here is no concrete theory for choosing a kernel
function
Sensitive to noise
A few off-point data can dramatically decrease its
performance
It is a binary classifier
Able to classify between only two classes
Training and testing is slow compared to NN
It is the Constrained Quadratic Programming
problem
Conclusion
SVM is a supervised machine learning technique
Uses the theories of optimization to find the
most optimal hyper-plane
Useful for many applications:
Pattern recognition
Bioinformatics
Stock market performance
Thank you for your
attention!

You might also like