You are on page 1of 4

Another Introduction to Support Vector Machines

Andre Guggenberger

ABSTRACT
This paper provides a brief introduction to Support Vector Machines. SVMs are a modern technique in the eld of machine learning and have been successfully used in dierent elds of application. The aim of this paper is not to give an detailed explanation about the theoretical background but to impart the fundamental functionality to get an extensive understanding how SVMs work. The paper also outlines what has to be considered when SVMs are applied, which elds of application exists and what the elds of researches nowadays are.

General Terms
Support Vector Machines

Keywords
Support Vector Machines, Maximum Margin, Kernel Trick, Applications, Kernel Types

1.

WHAT ARE SVMS?

Consider a typical classication problem. Some input vectors (feature vectors) and some labels are given. The objective of the classication problem is to predict the labels of new input vectors so that the error rate of the classication is minimal. There are many algorithms to solve such kind of problems. Some of them require that the input data is linearly separable. But for many applications this assumption is not appropriate. And even if the assumption holds, most of the time there are many possible solutions for the hyperplane. Figure 1 illustrates this. In 1965 Vapnik ([13], [12]) introduced a mathematical approach to solve this kind of an optimization problem. The basis of his approach is the projection of the low-dimensional training data in a higher dimensional feature space, because in this higher dimensional feature space it is easier to sepFigure 1: Positive samples (green boxes) and negative samples (red circles). There are many possible solutions for the hyperplane (from [7])

More formally: min|wT x + b| = 1 Using these informations we can dene the support vectors for two points (x1 , +1) and (x2 , -1): wT x1 + b = +1 wT x2 + b = 1

The distance between a point and the hyperplane is: d(x, H (w, b)) = wT x + b ||w||

We can now consider our task to maximize this distance w.r.t. the weights and the bias b. By using the above denitions of support vectors and the distance we can formulate the optimization problem: Figure 2: Maximum margin, the vectors on the dashed line are the support vectors (from [7]) wT x1 + b = 1 wT x2 + b = 1 and ||w|| ||w|| wT (x1 x2 ) 2 = ||w|| ||w||
2 ||w||

arate the input data. Moreover through this projection it is possible, that training data, which couldntt be separated linearly in the low-dimensional feature space can be separated linearly in the high-dimensionl space. This projection is achieved by using kernel functions. See chapter 3 for more details. Support Vector Machines are so called maximum margin classier. This means that the resulting hyperplane maximizes the distance between the nearest vectors of dierent classes with the asumption that a large margin is better for the gerneraliziation ability of the SVM. These nearest vectors are called support vectors and SVMs consider only these vectors for the classication task. All other vectors can be ignored. Figure 2 illustrates a maximum margin classier and the support vectors.

is called the margin. To maximize the margin we have

|| to minimize ||w|| (respectively ||w ), which is equivalent to 2 2 minimize ||w|| , but the latter one is easier to deal with.

We can summarize: ||w||2 2 sidecondition : yi (wT xi + b) > 1 f or i = 1, 2, ...n minimize :

This kind of an optimization problem can be solved using Lagrange multipliers i . To do this we formulate this problem using Lagrangian.

2.

SOME MATHEMATICAL BACKGROUND


L(w, b, ) = 1 ||w||2 2
n

To get a better understanding, how support vector machines work, we have to consider some mathematical background. This and the following chapter are based strongly on [8] and [7] For a detailed introduction see also [12], [13] or [5]. First we have to dene a hyperplane: hyperplane : H (w, b) = x|wT x + b = 0 where w is a weight vector, x is an input vector and b is the bias. Note that w points orthogonal to H. If we multiply w and b by the same positive factor, the resulting decision regions remain unchanged. To get an unique represenation of the hyperplane, we have to normalize (w,b) w.r.t. to the training data. This represenation is also called canonical representation.

i (yi (wT xi + b) 1)
i=1

(1)

We want to maximize the i and minimize w and b and therefore we have to calculate the rst partial derivations and set them equal to 0: d d L (w, b, ) = 0 and L (w, b, ) = 0 db dw
n n

(2) (3)

=
i=1

yi = 0 and w =
i=1

yi xi

We can use this result in (1) and get the so called Dual Problem to calculate the maximized i .

where xnew isanewinputvector


n

W () =
i=1

1 2

i j yi yi < xi , xj >
i=1,j =1 n

(4)

where i >= 0 and


i=1

i yi = 0

(5)

To get a better understanding we look at the following example, which is presented in [7]: 2 2 ) : (g 1, g 2) (g1 , 2g1 g2 , g2 2 2 T < (p), (q ) > = (g1 , 2g1 g2 , g2 )(h2 2h1 h2 , h2 1, 2)
2 2 2 2 = g1 h1 + 2g1 h1 g2 h2 + g2 h2

Now we are ready to summarize and dene the classier: Solve (4) and get the i , which maximize (4) Using these i we can calculate the vectors w, using (2):
n

= (g1 h1 + g2 h2 )2 =< p, q >2 =: K (p, q ) So as we can see the kernel function delivers the same result as the inner product but without the necessity to compute the inner product in the high-dimensional space (the kernel function is computed in the original space!). More on the mathematical background and the kernel trick can also be found in [8] and [9].

w=
i=1

yi xi

The classier:
n

f (xnew ) = sign(
i=1

yi < xi , xnew > +b)

(6)

4.

KERNEL TYPES

3.

THE KERNEL TRICK

The projection from a low-dimensional feature space into a high-dimensional feature space is achieved by so called kernels. To understand why kernels work we have to consider that in the denition of the classier and in the Dual Problem (4) the training points xi are only used in inner products. If we have a function (also called feature map) : Rg x (x) which maps the input space into the high-dimensional feature space , there can be a problem to calculate the inner products (it is often not possible actually) if this feature space is too large. So we replace the inner products by a kernel function K, which is in Rg but is

We have seen how projection into a high-dimensional feature space works. There are many possible kernel functions but as in [4] stated this kernel types are common: 1. linear : K (xi , xj ) = xT i xj
d 2. polynomial: K (xi , xj ) = (xT i xj + r ) , > 0

3. sigmoid: K (xi , xj ) = exp( |xi xj |2 ), > 0 4. radial basis function (RBF): K (xi , xj ) = tanh(xT i xj + r) As we can see, depending on the kernel type we choose, the kernel parameters ( , r, d) have to be set. Which kernel type (and which parameters) performs best, depends on the application and can be determined by using cross-validation. As a rule of thumb, if the number of features is large (as in the eld of text classication) the linear kernel is sucient. It is also worth to consider the bias-variance analysis of the dierent kernel types presented in [11].

K (p, q ) =< (p), (q ) > A kernel function has some special properties (theorem of mercer): K(p,q) must be symmetric K must be positive denite Using a kernel function we can reformulate (4) and (6):
n

5.

DIFFERENT TYPES OF SVMS

In the previous chapters we have looked at the mathematical theory behind SVMs. It is also worth to consider, that there are some dierent algorithms for SVMs. Beside the default Langrangian Support Vector Machine (LSVM) there are some others like the Finite Newton Langrangian SVM (NLSVM) or the Finite Newton Support Vector Machine (NSVM). [10] compares dierent methods .

6.

SOFT MARGINS AND WEIGHTED SVMS

W () =
i=1

1 2
n

i j K (xi , xj )
i=1,j =1

Despite of the Kernel Trick it is possible that no hyperplane exists, which separates the input data linearly. In this case it is possible to use the Soft Margin method, which will choose a maximum margin hyperplane but allows some classication error. In this case the optimization task changes to:

f (xnew ) = sign(
i=1

yi K (xi , xnew ) + b)

minimize : ||w|| + C
i=1

sidecondition : yi (wT xi + b) > 1 i f or i = 1, 2, ...n where is the slack variable measuring the degree of missclassication and C is a hyperparameter controlling the inuence of . In [2] a detailed analysis of soft margin classiers is presented and as expected this analysis states that the choice of the regularization parameter C has a large impact on the overall result. Another extension of the original SVMs is the weighted support vector machine and is described in [3]. There it is possible to weight the misclassication of the classes dierently. Similar to the Soft Margin method the minimization task changes: minimize : ||w||2 + C
i=1

si i

sidecondition : yi (wT xi + b) > 1 i f or i = 1, 2, ...n where si is a weighting factor of the i-th training sample.

7.

SVM APPLICATION

After the previous theoretical chapters we focus now on the applications of Support Vector Machines. SVMs are one of the most promising machine learning algorithms and there are many examples, where SVMs are used successfully, e.g. text classication, face recognition, OCR, bioinformatics. On these data sets SVMs perform very good and often outperform other traditional techniques. Of course they are no magic bullet and as in [1] exposed, there are still some open issues like the incorporation of domain knowledge, a new way of modell selection or the interpretation of the results produced by SVMs. A simple method to incorporate prior knowledge in support vector machines is presented in [6], where the hypothesis space is modied rather than the optimization problem.

8.

BRIEF GUIDE TO CLASSIFICATION

[2] D.-R. Chen, Q. Wu, Y. Ying, and D.-X. Zhou. Support vector machine soft margin classiers: Error analysis. J. Mach. Learn. Res., 5:11431175, 2004. [3] S.-X. D. S.-T. Chen. Weighted support vector machine for classication. 4:38663871, 2005. [4] C.-J. L. Chih-Wei Hsu, Chih-Chung Chang. A practical guide to support vector classication. citeseer.ist.psu.edu/689242.html. [5] N. Cristianini and J. Shawe-Taylor. An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge University Press, New York, NY, USA, 2000. [6] Q. V. Le, A. J. Smola, and T. G artner. Simpler knowledge-based support vector machines. In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages 521528, New York, NY, USA, 2006. ACM. [7] F. Markowetz. Klassikation mit support vector machines. http://lectures.molgen.mpg.de/statistik03/docs/Kapitel 16.pdf, 2003. [8] R. Meir. Support vector machines - an introduction. http://www.ee.technion.ac.il/ rmeir/SVMReview.pdf, 2002. [9] R. T. Michael I. Jordan. The kernel trick. www.cs.berkeley.edu/ jordan/courses/281Bspring04/lectures/lec3.pdf. [10] X.-Z. W. Shu-Xia Lu. a comparison among four svm classication methods: Lsvm, nlsvm, ssvm and nsvm. 7:42774282, 2004. [11] G. Valentini and T. G. Dietterich. Bias-variance analysis of support vector machines for the development of svm-based ensemble methods. J. Mach. Learn. Res., 5:725775, 2004. [12] V. N. Vapnik. The Nature of Statistical Leaming Theory. Springer, second edition, 2000. [13] V. N. Vapnik and A. Y. Chervonenkis. Theory of pattern recognition. www.cs.berkeley.edu/ jordan/courses/281Bspring04/lectures/lec3.pdf.

Finally we focus on applying SVMs to the task of classication and provide a brief guide for using SVMs.

8.1

Preprocessing

SVMs work with numeric values (intervall measurement) and so we have to convert potential categorical values (binarization). It is also important to scale data, because data in greater numeric ranges shouldnt dominate those in smaller ones. Note that these issues are also important for other machine learning techniques like neural networks.

8.2

Model Selection

We have seen that there are dierent types of kernels. So we have to choose an appropriate one and use cross-validation to nd the best hyperparameters and prevent overtting. It is also possible to use cross-validation to test dierent types of kernels.

9.

REFERENCES

[1] K. P. Bennett and C. Campbell. Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl., (2):113, 2000.

You might also like