2 views

Uploaded by Anil

Percept Rom

- IJET-V4I3P78.pdf
- Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
- Wang 2016
- Artificial Intelligence
- 399 Final
- Diagnosing Epilepsy using EEG signals and classification of EEG signals using SVM
- Fruit Disease Detection and Classification
- ISVC LNCS 2014 AcceptedPaper
- Tutorial SVM Matlab
- Emotion Recognition Using Physiological Signals
- 27 1519150840_20-02-2018.pdf
- CP5191 MLT UNIT II.doc
- Support Vector Machines
- IJIGSP-V6-N8-1
- Key for Dm II Mid 2017 May Set A
- image
- Memory Game C++ program
- <XXX>
- FYP Final Report_robotic Arm
- Bag of Keypoints

You are on page 1of 21

Piyush Rai

Machine Learning (CS771A)

Aug 20, 2016

Recall the gradient descent (GD) update rule for (unreg.) logistic regression

N

X

w (t+1) = w (t)

((t)

n yn )x n

n=1

(t)

1

1+exp(w (t) > x n )

w (t+1) = w (t) t ((t)

n yn )x n

where t is the learning rate at update t (typically decreasing with t)

(t)

(t)

Lets replace the predicted label prob. n by the predicted binary label yn

where

Thus w (t)

w (t+1) = w (t) t (

yn(t) yn )x n

(

>

(t)

1 if n 0.5 or w (t) x n 0

(t)

yn =

>

(t)

0 if n < 0.5 or w (t) x n < 0

(t)

gets updated only when yn 6= yn (i.e., when w (t) mispredicts)

Mistake-Driven Learning

Consider the mistake-driven SGD update rule

w (t+1) = w (t) t (

yn(t) yn )x n

Lets, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then

(

(t)

2yn if yn 6= yn

(t)

yn yn =

(t)

0

if yn = yn

(t)

w (t+1) = w (t) + 2t yn x n

For t = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)

w (t+1) = w (t) + yn x n

Note: There are other ways of deriving the Perceptron rule (will see shortly)

Machine Learning (CS771A)

One of the earliest algorithms for binary classification (Rosenblatt, 1958)

Learns a linear hyperplane to separate the two classes

example at a time. Also highly scalable for large training data sets (like SGD

based logistic regression and other SGD based learning algorithms)

Guaranteed to find a separating hyperplane if the data is linearly separable

If data not linearly separable

First make it linearly separable (more when we discuss Kernel Methods)

.. or use multi-layer Perceptrons (more when we discuss Deep Learning)

Machine Learning (CS771A)

Hyperplanes

Separates a D-dimensional space into two half-spaces (positive and negative)

w is orthogonal to any vector x lying on the hyperplane

w >x = 0

w >x + b = 0

b > 0 means moving it parallely along w (b < 0 means in opposite direction)

Represents the decision boundary by a hyperplane w RD

Classification rule: y = sign(w T x + b)

w T x + b > 0 y = +1

w T x + b < 0 y = 1

Note: Some algorithms that we have already seen (e.g., distance from

means, logistic regression, etc.) can also be viewed as learning hyperplanes

Notion of Margins

Geometric margin n of an example x n is its signed distance from hyperplane

n =

wT xn + b

||w ||

Margin of a set {x 1 , . . . , x N } w.r.t. w is the min. abs. geometric margin

= min |n | = min

1nN

1nN

|(w T x n + b)|

||w ||

Positive if w predicts yn correctly

Negative if w predicts yn incorrectly

For a hyperplane based model, lets consider the following loss function

`(w , b) =

N

X

`n (w , b) =

n=1

N

X

max{0, yn (w x n + b)}

n=1

otherwise the model will incur some positive loss when yn (w T x n + b) < 0

Perform stochastic optimization on this loss. Stochastic (sub)gradients are

`n (w , b)

w

`n (w , b)

b

yn x n

yn

Upon every mistake, update rule for w and b (assuming learning rate = 1)

w

w + yn x n

b + yn

Machine Learning (CS771A)

Given: Training data D = {(x 1 , y1 ), . . . , (x N , yN )}

Initialize: w old = [0, . . . , 0], bold = 0

Repeat until convergence:

For a random (x n , yn ) D

if sign(w T x n + b) 6= yn i.e., yn (w T x n + b) 0 (i.e., mistake is made)

w new

w old + yn x n

bnew

bold + yn

All training examples are classified correctly

May overfit, so less common in practice

Completed one pass over the data (each example seen once)

E.g., examples arriving in a streaming fashion and cant be stored in memory

10

Lets look at a misclassified positive example (yn = +1)

Perceptron (wrongly) predicts that w Told x n + bold < 0

Updates would be

w new = w old + yn x n = w old + x n (since yn = +1)

bnew = bold + yn = bold + 1

wT

new x n + bnew

(w old + x n )T x n + bold + 1

T

(w T

old x n + bold ) + x n x n + 1

T

Thus w T

new x n + bnew is less negative than w old x n + bold

11

w new = w old + x

12

Now lets look at a misclassified negative example (yn = 1)

Perceptron (wrongly) predicts that w Told x n + bold > 0

Updates would be

w new = w old + yn x n = w old x n (since yn = 1)

bnew = bold + yn = bold 1

wT

new x n + bnew

(w old x n )T x n + bold 1

T

(w T

old x n + bold ) x n x n 1

T

Thus w T

new x n + bnew is less positive than w old x n + bold

13

w new = w old x

14

Convergence of Perceptron

Theorem (Block & Novikoff): If the training data is linearly separable with

margin by a unit norm hyperplane w (||w || = 1) with b = 0, then perceptron

converges after R 2 / 2 mistakes during training (assuming ||x|| < R for all x).

Proof:

yn w T

xn

||w ||

= yn w T

xn

k x n 0, and update w k+1 = w k + yn x n

T

T

T

wT

k+1 w = w k w + yn w x n w k w + (why is this nice?)

k+1 w > k

2

2yn w T

k xn

(1)

2

k x n 0)

(2)

w

||w

k+1 || R k

k+1

k R 2 / 2

Nice Thing: Convergence rate does not depend on the number of training

examples N or the data dimensionality D. Depends only on the margin!!!

Machine Learning (CS771A)

15

Perceptron finds one of the many possible hyperplanes separating the data

.. if one exists

Large margin leads to good generalization on the test data

16

One of the most popular/influential classification algorithms

Backed by solid theoretical groundings (Vapnik and Cortes, 1995)

A hyperplane based classifier (like the Perceptron)

Additionally uses the Maximum Margin Principle

Finds the hyperplane with maximum separation margin on the training data

17

A hyperplane based linear classifier defined by w and b

Prediction rule: y = sign(w T x + b)

Given: Training data {(x 1 , y1 ), . . . , (x N , yN )}

Goal: Learn w and b with small training error the maximum margin

For now, assume the entire training data can be correctly classified by (w , b)

Zero loss on the training examples (non-zero loss case later)

w T x n + b 1 for yn = +1

w T x n + b 1 for yn = 1

Equivalently, yn (w T x n + b) 1

min1nN |w T x n + b| = 1

The hyperplanes margin:

= min1nN

|w T x n +b|

= ||w1 ||

||w ||

18

We want to maximize the margin =

1

||w ||

Our optimization problem would be:

||w ||2

2

T

yn (w x n + b) 1,

Minimize f (w , b) =

subject to

n = 1, . . . , N

Machine Learning (CS771A)

19

We can give a slightly more formal justification to this

Recall: Margin =

1

||w ||

Small ||w || regularized/simple solutions (wi s dont become too large)

Simple solutions good generalization on test data

20

Next class..

Introduction to kernel methods (nonlinear SVMs)

21

- IJET-V4I3P78.pdfUploaded byInternational Journal of Engineering and Techniques
- Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)Uploaded byRoots999
- Wang 2016Uploaded byHenrique Scheidt Klein
- Artificial IntelligenceUploaded byVivek
- 399 FinalUploaded byShrivant Bhartia
- Diagnosing Epilepsy using EEG signals and classification of EEG signals using SVMUploaded byIRJET Journal
- Fruit Disease Detection and ClassificationUploaded byIRJET Journal
- ISVC LNCS 2014 AcceptedPaperUploaded bybinuq8usa
- Tutorial SVM MatlabUploaded byVíctor Garrido Arévalo
- Emotion Recognition Using Physiological SignalsUploaded byChristoforos Christoforou
- 27 1519150840_20-02-2018.pdfUploaded byRahul Sharma
- CP5191 MLT UNIT II.docUploaded bybala_07123
- Support Vector MachinesUploaded byRevathy_C
- IJIGSP-V6-N8-1Uploaded byUdegbe Andre
- Key for Dm II Mid 2017 May Set AUploaded byjyothibellaryv
- imageUploaded byAnshul Suneri
- Memory Game C++ programUploaded byAnjannAnjanna
- <XXX>Uploaded byapi-454236048
- FYP Final Report_robotic ArmUploaded byYasar Abbas
- Bag of KeypointsUploaded bysubakishaan
- Mittal 11Uploaded byCreativ Pinoy
- TIC TAC TOE GAMEUploaded byMax Bondar
- btcspinner baruUploaded byDidin Syaifuddin Guret
- MullenSentimentCourseSlides.pdfUploaded byanupbabhani
- Visual Basic Code—Part 5 ArraysUploaded bylalitha
- Resilient Identity Crime DetectionUploaded byhappy2009y
- A Theoretical Study on Partially Automated MethodUploaded byInternational Journal of Research in Engineering and Technology
- Brain-Computer Interface Using Single-Channel ElectroencephalographyUploaded byJeya Kumar
- 1917-1924Uploaded byLucas Ucs
- Radially Defined Local Binary Patterns for Hand Gesture RecognitionUploaded byeditor3854

- regression.pdfUploaded byAnil
- 771A Lec2 SlidesUploaded byShivamGoyal
- logistic.pdfUploaded byAnil
- 19Tries.pdfUploaded byAnil
- 06 Priority QueuesUploaded bybsgindia82
- Cerc2016 PresentationUploaded byAnil
- 00overview.pdfUploaded byAnil
- probablistic.pdfUploaded byAnil
- decisoin.pdfUploaded byAnil
- reading-cs330.pdfUploaded byAnil
- 01 Union FindUploaded byprintesoi
- 11UndirectedGraphsUploaded byVikas Srivastava
- 05AdvTopicsSorting.pdfUploaded bySugar Plums
- Stack QueueUploaded byBethanyethan
- SortingUploaded byrubikmail8218
- 16 GeometricUploaded byKen Arok
- 20CompressionUploaded byArif Setyawan
- 07SymbolTables.pdfUploaded byAnil
- Binary Search TreesUploaded bymani271092
- Hashing introUploaded byatulw
- Pattern MatchingUploaded byamitavaghosh1
- 13DirectedGraphs.pdfUploaded byAnil
- 18RadixSort.pdfUploaded byPushpendra Singh
- lec18Uploaded byraw.junk
- 14MSTUploaded byUpinder Kaur
- 09 Balanced TreesUploaded bySoohwan Harry Jeon
- PPT on Linear ProgrammingUploaded byh_badgujar
- 17GeometricSearch.pdfUploaded byAnil
- Shortest Paths AlogrithgmsUploaded byatulw

- 978-1-4614-5076-4_9Uploaded bybehl.rohit7582
- Claims Reserving Using GAMLSSUploaded byGiorgio Spedicato
- Errors 11Uploaded byTushar Raj
- Statistics TutorialUploaded byS
- Discrete Probability Distributions LectureUploaded bySamuel Bradley
- rems 5013 syllabus spring 1Uploaded byapi-110384436
- Riggio 2017Uploaded byJorgeSinval
- Sex and Sexuality ( Three Volumes )Uploaded byMiguel Angel Barreto
- Art of PS SummaryUploaded byJHan Lim
- [George E. P. Box, J. Stuart Hunter, William G. Hu(BookFi.org)Uploaded byJose Luis Jurado Zurita
- saUploaded byNajmul Puda Pappadam
- Statistics TutorialUploaded byFe Oracion
- Functional Design of RoadsUploaded bySiva Kumar Mathan
- START_IUSSD_UnderstandingLoneactorTerrorism_Sept2013.pdfUploaded byMika Rainman
- The Basic Concepts of Quantitative Research DesignUploaded byCaca
- Chapter-7Uploaded byArslan Saleem
- Spc Probs Solns s07Uploaded byYiğitÖmerAltıntaş
- An Adaptive Model to Classify Plant Diseases Detection using KNNUploaded byEditor IJTSRD
- 8 - 3.48 - 3.4Uploaded byOlivia Angiuli
- Max LikelihoodUploaded byravibabukancharla
- Types of Random Sampling TechniquesUploaded byjrbautista
- Tree Induction vs. Logistic RegressionUploaded byFlav_hernandez76
- THE IMPACT OF MUTATION AND PROMOTION ON THE PERFORMANCE OF SOUTH KALIMANTAN GOVERNMENTS EMPLOYEES.Uploaded byIJAR Journal
- An Introduction to Statistics With Python With Applications in the Life SciencesUploaded bysreekanth22063140
- 3.2.13Uploaded byAlfadil Altahir
- Monte carlo Simulation.xlsxUploaded byAnuj Popli
- 02 Intro to DOEUploaded byRajanishshetty
- Uso de drogas y personalidadUploaded bySaul Psiq
- PriyaUploaded bypriyanka ramnani
- Malhotra 19Uploaded byAnuth Siddharth