You are on page 1of 6

ONLINE SVR TRAINING BY SOLVING THE PRIMAL OPTIMIZATION PROBLEM

D. Brugger*, w Rosenstiel M. Bogdan

Universitat Tiibingen Universitat Leipzig


Technische Informatik Technische Informatik
Sand 13,72076 Tiibingen Johannisgasse 26, 04103 Leipzig

ABSTRACT The first online training algorithms for SVM classifica-


tion and regression incrementally updated the current solu-
Online regression estimation becomes important in pres- tion of the learning problem after arrival of the next training
ence of drifts and rapid changes in the training data. In this pattern [7, 8]. Although the notion of sustaining an opti-
article we propose a new online training algorithm for SVR, mal solution at all learning steps is attractive, this approach
called PRIONA, which is based on the idea of computing requires an unbounded number of operations during the in-
approximate solutions to the primal optimization problem. cremental updates and hence is not suited for adoption in a
We explore different unconstrained optimization methods real- time environment.
for the solution of the primal SVR problem and investigate By maintaining only an approximate solution of the learn-
the impact of different buffering strategies. By using a line ing problem, algorithms using stochastic gradient descent
search PRIONA does not require a priori selection of a learn- are able to limit the computational time per iteration [9].
ing rate which facilitates its practical application. Further The framework of stochastic gradient descent was first in-
PRIONA is shown to perform better in terms of prediction troduced by the naive online risk minimization algorithm
accuracy on various benchmark data sets in comparison to (N ORMA) [9] which was later combined with stochastic meta
the NORMA and SILK online SVR algorithms. descent (SMD) to adapt step sizes [10]. The SMD exten-
sion of NORMA is not included in the empirical compar-
1. INTRODUCTION isons of this article since it is hard to implement and has
been outperformed by the sparse implicit learning with ker-
During recent years support vector machines (SVMs) have nels (SILK) algorithm [11]. NORMA and SILK are applica-
been successfully applied to solve classification as well as ble to both, regression and classification problems, and can
regression problems in various application domains [1]. In be combined with a variety of different loss functions. On
the field of neurobiology, for example, SVMs are used to the downside both algorithms need careful adjustment of the
classify recorded EEG signals in the context of brain com- learning rate for each data set to ensure rapid convergence.
puter interfaces [2] or to infer spike trains from recorded For online classification it has been recently proposed
local field potentials (LFPs) [3]. Besides classification, sup- to use a reorganization of the steps performed by sequential
port vector regression (SVR) is employed to predict planar minimal optimization (SMO) for SVM batch training [12].
movements from cortical activity of rhesus monkeys during The so called LaSVM algorithm does not necessitate selec-
a center out reaching task [4] or the closed loop control of tion of a learning rate and the number of operations per-
stimulus intensities based on recorded LFP signals [5]. formed in every iteration is bounded. Essentially LaSVM
In the regression problems just mentioned training usu- maintains an approximate solution to the dual SVM opti-
ally occurs offline on previously collected data and the learnt mization problem during learning. Unfortunately approx-
SVR model is subsequently fixed during the prediction phase. imate dual solutions not always translate into high quality
This approach is often unsuitable in practice, especially if primal solutions. Further it has been shown for regularized
the distribution of the input patterns or regression targets is least squares that primal optimization yields a lower objec-
non-stationary. Throughout closed loop stimulation for in- tive function value than dual optimization [13].
stance the SVR model has been observed to get outdated The online algorithm for SVR proposed in this article
after 1-2 minutes [6]. Under these conditions online SVR is based on the idea of computing approximate primal so-
training is able to track slow temporal changes in the data lutions. Since the next point on the optimization path is
by continuously updating the estimated function. determined by a line search it is not necessary to select suit-
able learning rates and the computational time per iteration
* Supported by the Centre for Integrative Neuroscience, Tlibingen. is easily limited by restricting the input buffer size.

978-1-4244-4948-4/09/$25.00 © 2009 IEEE


2. PRIMAL SVR

Let {(Xi , Yi)} ~l E ]Rd x R denote a set of training patterns.


Linear SVR estimates the function g(x ) = (w, x) + b =
f( x) + b by solving the following optimization problem [I ]:
1 m A
minLE:(w,b) = - '" lE: ((w ,x) + b - Yi) + - llwI12 . (1)
w ,b 2L 2
i= l

The first term in (I) minimizes the E-insensitive loss

across all training patterns, while the second term regular-


izes the solution. Here we extended the primal SVR for-
mulation of [14] to training with bias term b and used the
reparameterization A = l /G. Fig.!. The e-insensitive quadratic loss function and the
Nonlinear SVR is obtained by introducing a kernel func- three transitions (I,fI,III) to be handled by the line search.
tion k( Xi , Xj) and an associated Hilbert space 'H that fulfills
the "reproducing property" [15] :
If further W(,6, b) is a diagonal matrix with i- th entry equal
(3) to S i (,6, b)2, and 1 denotes the vector with all components
equal to one, then the gradient and Hessian of the objective
By combining the reproducing property with the "reprcscn- function (5) can be written in compact form as:
ter theorem" [16]:
TW(
m \7L = (K ,6,b)r(,6,b) - Es(,6, b)TK + AK (6 ) (7)
f(x) = L ,6ik( X, Xi) , (4) E: 1TW( ,6, b)r(,6, b) - ES(,6, bfl
i= l
\72L = [(KT~(,6, b) + AI)K KW(,6, b)l] (8)
the optimization problem for nonlinear SVR is given by: E: 1 7 W(,6, b)K 1 W(,6, b)1 .

1 m A 3. PRIMAL ONLINE SVR


minLE:(,6, b) = -2 L lE:(Ki ,6 + b - Yi) + _ ,6TK ,6 , (5)
{3 ,b . 2
2= 1
Online training is the problem of estimating the mapping f :
where the regularization term IIwl1 2 of the linear formula- ]Rd f-+ ]R based on a sequence S = ((Xl , yd,... , (x m ,Ym))
tion (1) was replaced by Il fll ~. In equation (5) above K, of input patterns ( X t , Yt) E ]Rd x ]R arriving at time point t.
is the i-the row of kernel matrix K with entries K ij = In many practical situations the regression targets Yt usually
k(Xi, Xj). Further, the residual of the i-th input pattern will become available after prediction on the input pattern X t .
be subsequently denoted by ri(,6, b) = K i ,6 + b - Yi. The primal online algorithm (PRIONA) for SVR pro-
Since the loss function in equation (2) is not differen- posed in this article computes an approximation to the so-
tiabl e the objective function (5) is not amenable to direct ap- lution of the primal SVR problem in equation (5) by ap-
plication of unconstrained optimization methods.Although plying unconstrained optimization to a subset of the input
it is possible to devise differentiable approximations for the patterns in sequence S stored in a buffer. Although this idea
loss function [14], this article concentrates on the s- insensitive is straightforward there are several important issues to be
quadratic loss! shown in figure 1 in order to simplify the fol- clarifi ed to turn this idea into a working algorithm .
lowing description of the online algorithm.
To indicate the position of the residual relative to the 3.1. Buffering strategies
e-insensitive zone in the loss function we introduce the fol-
lowing sign function : Most online training algorithms for SVMs [9, 10] use the
"first in first out" (FIFO) principle to manage the patterns
+ 1' E < ri(,6, b) in the input buffer. This strategy is easy to implement, but
si (,6 , b) = - 1, ri(,6,b) < -E (6) it is suboptimal from a theoretical point of view. In online
{ classification problems for example, it makes more sense to
0, otherwise
remove patterns that are far away from the decision bound-
'This is a special case of the more general loss given in [17) for I = ary [18], since their removal has the least impact on the cur-
1/2 and eo = 00 rent solution. Later this strategy was refined to remove those
patterns that have the lowest classification error on a subset
of the training data [19].
-Newton
For regression problems there is no decision boundary 350 - - Scaled gradient
- • - Gradient
but the concept is transferable by removing patterns with the """""'" Incremental updates
smallest absolute value of the residual h CB, b)I. Note that a ~ 300
.5
pattern can be safely removed if h CB, b) 1 < e, since it does ~ 250
not contribute to the solution. Alternatively one can base '"c::
,g 200
the decision for removing a pattern on the absolute value of ~
~
the corresponding coefficient (3i. Ql 150
~
Ql
~ 100

3.2. Descent directions


50

Unconstrained optimization methods commonly use the prin-


500 600 700 800 900 1000
ciple of iterative descent to minimize the objective function Buffer size
value.The next point (13, b) on the optimization path for the
SVR problem is computed from the current solution ((3, b) Fig. 2. Average iteration time in milliseconds for differ-
and a particular descent direction. To guarantee a decrease ent descent directions in dependence of the buffer size on
in the objective function value from one iteration to the next the abalone data set. Dotted curves illustrate cubic and
a line search is used to interpolate between ((3, b) and (13, b). quadratic run time complexities fitted to the first two iter-
ation times measured for the Newton and scaled gradient
directions respectively.
3.2.1. Gradient

The negative gradient of the objective function points in di-


3.2.3. Newton
rection of steepest descent and is the simplest choice for the
descent direction : For the well known Newton method the descent direction
corresponds to the negative gradient multiplied by the in-
(13, b) = ((3, b) - V'L E;( (3 , b) . (9) verse of the Hessian matrix . If the Hessian matrix (8) and
gradient (7) for the primal SVR problem are appropriately
Under the assumption that the matrix vector product K (3 in factorized the Newton step is given by:
equation (7) is cached and just updated every iteration, the
run time complexity to compute the gradient step is only
O(n · Isvl ), where n is the size of the buffer and Isvl the
(~) = (~) - V'2 L e( (3 , b)-lV' L e( (3 , b) =

[W((3 , ~~K+AI W((30 b)l] - l( W((3 , b)Y; cs((3 , b)) _(~) .


number of support vectors.

3.2.2. Diagonally scaled gradient (12)

If the objective function is quadratic the corresponding Hes- To invert the Hessian it is sufficient to compute the Cholesky
sian is diagonal and hence is invertible in linear time . Since decomposition of K s v ,s v + AI, where K s v ,s v denotes the
many functions are well approximated by a quadratic func- subset of the kernel matrix with respect to the support vec-
tion locally it makes sense to multiply the negative gradient tors. The resulting run time complexity to compute the
by the inverse of the diagonal portion of the Hessian: Newton step is therefore O( lsvI3 ) .

(13, b) = ((3, b) - DV'L e( (3 ,b) . (10) 3.3. Incremental updates

Note that it is not necessary to explicitly compute the Hes- In the online setting it is possible to incrementally update
sian, since the entries of scaling matrix D are given by: the inverse Hessian required by the Newton method after
arrival of the next pattern in S by exploiting the well known
Sherman-Morrison-Woodbury formula:
l :::;i :::;n
(II)
i =n +l

The run time complexity to compute the scaled gradient step This approach has been previously applied in incremental
hence is O(n ·I svl ). SVM learning [7, 8]. According to (13) exchanging one
column and row of the inverse requires O(n Z ) multiplica- A
tions. Yet, the worst case complexity of incremental updates
is O(n 3 ) , if, for example, all the patterns in the buffer leave
the set of support vectors upon arrival of the next pattern .
Since this worst case scenario occurs frequently in practice,
.>: ......'
.'.'

especially for small buffer sizes (figure 2) the possibility of .........~I(p)


~ .
incremental updates will not be considered any further.
o PI PZ p* P3 P4 = 1
3.4. Line search
•..............•
The line search identifies the scalar step size P E [0,1] such
that Lc; ({3(p), b(p)) in equation (5) is minimized-. The ob-
B
jective function in dependence of the step size, denoted by
¢ (p), is a piecewise continous, quadratic function as shown
in figure 3A. Discontinuities occur at points Pi, if patterns ~ /:
enter or leave the set of support vectors, or equivalently, if
the absolute value of the residual exceeds c. The residual \ ~ .....· · · ·~:·~P)
ri([3(p) , b(p)) depending on the step size is expanded as: Minimu~......•··•·· •••.

[K [3+b- y+ p(K(,B - [3) + l(b -b)) ]i = r , + puc , (14)


o PI PZ P3 = p* P4 P5 = 1
•.........······1l········.>
where we introduced r i and U i as shorthands to simplify
the notation. For the three cases shown in figure I the cor- Fig. 3. The objective function ¢ (p) in dependence of the
responding transition points are Pi = sgn(u~:c;-ri (case I), step size and its first derivative ¢' (p) . A: The optimal step
· = sgn(ri) c;- r i (case II) and p, = - sgn(r i )c;- ri (case III) . size p* corresponds to a zero crossing of ¢' (p). B: Since
P2 Ui ' 2 Ui
After computing the transition point s the line search orig- ¢ (p) is a piecewise continuous quadratic function the mini-
inally propos ed in [20] for linear SVMs then proceeds by mum p* might be located at one of the transition points Pi.
sorting the Pi'S in non-decreasing order and searching for
a zero crossing of the first derivative ¢' (p) between Pi and
PH]' A potential minimum is then determined analytically: The corrected line search ensures the convergence of PRI-
ONA even in combination with the gradient and scaled gra-
* ¢' (0) dient descent directions (figure 4B).
(IS)
P = - (¢'(1) - ¢' (O))
4. DATASETS
and rejected if p* :::; Pi or p* > PH] ' If no minimum is
found the search continues between the next pair of point s For the empirical comparison of online algorithms NORMA,
(PH], PH Z), after updating the value of ¢' (O) and (¢'(1) - SILK and PRIONA we used eight publicly available bench-
¢' (O)). The run time complexity, dominated by the sorting mark data sets (UCI repository: abalone, housing, mpg,
step, is O(nlogn). triazines ; Statlib: cadata, space-ga; Delve : cpusmall), and
In some situations the minimizer p* cannot be found an- four data sets (fbl-fb4) from experiments on closed loop
alytically (figure 3B) and the line search described in [20] stimulation control [6]. In all data sets the patterns were lin-
fails. Combined with the gradient of scaled gradient de- early scaled to [- 1, +1] and the regression targets to [0, 1].
scent directions this failure leads to large prediction errors
OfPRIONA (figure 4A).lfthe line search terminates without
finding a zero crossing of ¢' the minimum is located at one 5. RESULTS
of the transition points (figure 3B).
On all data sets we used online SVR training in conjunction
We therefore propose to correct the line search by eval-
with the RBF kernel [I]. The kernel parameter ry, regular-
uating ¢ (p) either at point Pi, if p* :::; Pi, or PH], if p* >
ization parameter A, and the loss function parameter E were
PH] during the search for the zero crossing. This modifica-
selected offline by tenfold cross validation on a subset of
tion then allows to return the point Pi which minimizes the
the patterns. During the initial iterations of every online
objective function if p* cannot be found analytically. One
algorithm prediction errors are usually very high. To deal
evaluation of ¢ (p) costs O( n Z ) operations and the overall
with the non-normal distribution of the squared error (SE)
run time complexity of the corrected line search is O(n 3 ) .
between predicted and true target values the median of the
2(3 (p) = (3 + p(!3 - (3) and b(p) = b + p(b - b). SE is used to quantify the prediction accuracy. Further 95%
A O.3 ~ .~
I
I
c
.2 100
I T§rn
,l!!E
• •
w 0.2
-c
I
I
.- c
Ol ' - 50
• •
Cl

OlOl
o ~E
0.1
. . ......~.J - . -.-.-.-..". _.-. _.i
. ... I
Q):.;::;

~
e
0
0
o
o
o
Cl

o .. n,
10-1
_ Newton
--Newton *al3 *Il ~ _ Scaled gradient
- . - . Gradient I c=J Gradient
........ Scaled gradient *
w 0.2 *
-c
o .
\
0.1 ' ....
w
~ 10-2
c
0
* o·
*
~ ;~
* *
~_. <tl 0 II
: ·.... •........ I . . . • •_ ."

:a
o 10 20 30 40 50 60 70
Ol 011
o· ;
~B
::!: 0
Iteration
10-3 Ol!l

Fig. 4. The current average error, l it L~=l(Yi - f( Xi)) 2, 0 c


of PRIONA with different descent directions on the pyrim Ol .lll ro N C') '<t Ol Ol E <tl en
C
0 <tl
E ~ .e .e .e c
"0
c.
E .~
Ol Ol
c
data set. A: A failure of the line search incurs large predic- ro ""Q
<tl en :J c. d>
U 'N
.Q U :J 0 <tl <tl
tion errors for the gradient and scaled gradient directions. <tl C.
o s: C.
en :s
B: Convergence problems are avoided by the corrected line
search. Fig. 5. Average iteration time (top) and prediction ac-
curacy (bottom) for PRIONA using different descent direc-
confidence intervals are computed by the bootstrap method tions . Bars represent 95% confidence intervals for the me-
with B=IOOO bootstrap samples and significant differences dian of the squared prediction errors. The horizontal line in
in the prediction accuracy, indicated by an asterisk (*), are the upper plot indicates the iteration time restriction for data
assessed by a Wilcoxon test (0: = .05). sets fbI-1M.
First we studied the impact of the buffering strategies
and different descent directions on the performance of PRI- stantially better on the data sets fbl-fb4, housing, and mpg.
ONA. Across all data sets the choice of a particular buffer- In addition the buffer size restriction of PRIONA leads to av-
ing strategy had a negligible influence.' on the prediction erage iteration times that are below the 10ms limit imposed
accuracy, while the performance impact of the descent di- by experimental environment for the closed loop control of
rection is substantial (figure 5). In this comparison the op- stimulus intensities (fbl-fb4).
timal buffer size for each data set and descent direction was
selected from the set {64, 128, 256, 512}.
Surprisingly predictions of PRIONA using the gradient 6. DISCUSSION
and scaled gradient directions are less precise compared to
In conclusion the results presented in this article show that
the Newton direction even for larger buffer sizes, where the
approximate solutions for the primal SVR optimization prob-
required average iteration time is higher.
lem can be successfully employed to build an novel online
Next we compared the performance of PRIONA using
algorithm (PRIONA). We found that approximate solutions
the Newton direction with NORMA and SILK. The learn-
are best computed by using the Newton descent directions,
ing rate of NORMA and SILK in iteration t was set to TJt =
while the choice of a particular buffering strategy turned
TJo /0, as suggested in [9], and the initial learning rate TJo
out to be uncritical with respect to prediction accuracy. On
was tuned for each data set across the range 0.1, 0.2, .. . , 10.
several benchmark data sets PRIONA performed better than
To obtain average iteration times on the same scale the buffer
NORMA and SILK with comparable average iteration times.
size ofPRIONA was restricted to 64 patterns, while the opti-
It should be noted that PRIONA can also find exact updates
mal buffer size for NORMA and SILK was selected from the
by performing several Newton optimization steps after ar-
set {64 , 128,256,512, m}, where m is the total number of
rival of the next training pattern if iteration time is not lim-
patterns in a data set.
ited. Further we expect the automatic step size selection of
On ten out of twelve data sets PRIONA achieves a lower
PRIONA to facilitate its practical application. In the future
median SE in comparison to NORMA and SILK (figure 6).
we plan to use PRIONA in the context of closed loop micros-
Although the difference is small for the abalone, cpusmall,
timulation to compensate for drifts and rapid changes in the
and space-ga data sets, the performance of PRIONA is sub-
recorded local field potentials.
3 Detailed results arc omitted due to lack of space .
c: [7] J. Ma, J. The iler, and S. Perk ins, "Accurate on-l ine
oQ 30
eU> 00
0 support vector regression," Neural Computation, vol.
,Sl E
0- c: 20 15,pp.2683-2703,2003.
a> 0- • 0
Cla>
~ E 10 [8] G. Cauwenberghs and T. Poggio, "Incremental and
a> .. 0
~ • . 0
0
• 0
0
•• •
0 0 0 0 0 ••
0 I
0
decrement al support vector machin e learning," in
10-1 NIPS, 2000, pp. 409--415.
_NORMA
*
gil
*
Cl ll
_ SILK
c=J PRIONA [9] J. Kivinen , A. J . Smola, and R. C. Will iamson, "On-
0
* * * line learnin g with kernels," IEEE Trans. Sig. Proc.,
w
en C * * ;I~ vol. 52,no. 8,pp. 2165-2175,2004.
0

'010-2 * •
* ; *
~ a ~ =~
c:
ttl [10] S. Y. N. Vishwanathan, N. N. Schr audolph, and A. J.
'6
a>
:2 cc
• Cc
B
~
BIii Iii Smola, "Step size adaptation in reproducing kernel
Hilbert space," JMLR, vol. 7, pp. 1107-1133,2006.
10-3 rlc [11] L. Cheng, S.Y.N. Vishwanath an, D. Schuurmans,
a> sttl
• Cl Cl
~ S. Wang, and T. Caelli , "Implicit online learn ing with
c: ro :E .e .e .e'=t
C\l C')
c: c- E
o;:
ttl l/l
a> kernel s," in NIPS, pp. 249-256. MIT Press, 2007.
0
ro "0 E
l/l
00;
:J
E >.
o, %'
o
c:
"N
.c ~ :J 0 ttl ttl
ttl o,
0
.J:: c,
l/l
:s [12] A. Bordes, S. Ertekin, J. Weston, and L. Bottou,
"Fas t kernel classifiers with onlin e and active learn-
Fig. 6. Average iteration time (top) and prediction accu- ing," Journal of Machine Learning Research, vol. 6,
racy (bottom) for the online algorithms NORMA, SILK and pp.I579-1619,2005.
PRIONA. [13] O. Chapelle , "Training a support vector mach ine in the
primal ," Neural Computation, vol. 19, pp. 1135-1178,
2007 .
7. REFERENCES
[14] L. Bo, L. Wang, and L. Jiao, "Recursive finite newton
[1] B. Scholkopf and A. J. Smol a, Learning with kernels, algorithm for support vector regression in the primal,"
MIT Press, 2002 , Learnin g with kernels. Neural Computation, vol. 19, pp. 1082-1096,2007.

[2] T. N. Lal, M. Schr oder, T. Hinte rberger, J. Weston, [15] N. Aronszajn, "Theory of reproducing kern els,"
M. Bogdan , N. Birb aumer, and B. Scholkopf, "Sup- Transactions of the American Mathematical Society,
port Vector Channel Selection in BCI ," IEEE Trans. vol. 68,no. 3, pp. 337--404, 1950.
Biomed. Eng., vol. 51, no. 6, pp. 1003-1010,2004.
[16] G. S. Kimeldorf and G. Wahba, "A correspondence
[3] M. J . Rasch, A. Gretton, Y. Mur ayama, W. Maass, and between bayesian estimation on stochastic processes
N. K. Logothetis, "Inferr ing spike trains from local and smoothing by splines," The Annals of Mathemati-
field potent ials," J. Neurophysiol., vol. 99, pp. 1461- cal Statistics, vol. 41, no. 2, pp. 495-502, April 1970.
1476,2008. [17] J. L. Rojo -Alvarez, M. Martinez-Ramon , M. de Prado-
Cumplido, A. Artes-Rodri guez, and A. R. Figuei ras-
[4] l. Shph igelman, Y. Singer, R. Paz, and E. Vaadia, Vidal, "Support vector method for robust ARMA sys-
"Spikernels: predicting arm movements by embed- tem identification ," IEEE Transactions on Signal Pro-
ding population spike rate patterns in inner-p roduct cessing, vol. 52, no. 1, pp. 155-164,2004.
spaces," Neu ral Computation, vol. 17, pp. 671-690,
2005 . [18] K. Crammer, J. S. Kandola , and Y. Singer, "Online
classification on a budget," in NIPS, 2003.
[5] D. Brugger, S. Butovas, M. Bogdan, C. Schw arz, and
[19] J. Weston , A. Bordes, and L. Bottou, "Online (and
W. Rosens tiel, "Direct and inverse solution for a stim-
offline) on an even tighter budget," in Proc. ofthe JOth
ulus adaptation probl em using SVR," in ESANN pro-
Int. Workshop on Artificial Intelligence and Statistics,
ceedings, Bruges, 2008 , pp. 397--402.
2005 , pp. 413--420.
[6] D. Brugger, S. Buto vas, M. Bogdan, C. Schw arz, and [20] S. S. Keerthi and D. DeCoste, "A modified finite
W. Rosenstiel, "Rea l-time adaptive microstimulation newton method for fast solut ion of large sca le linear
increases reliability of electrically evoked cortical po- SVMs," JMLR , vol. 6, pp. 341-361,2005.
tent ials," Submitted to Nature, 2009.

You might also like