Neural Network Learning For Analog VLSI Implementations of Support Vector Machines: A Survey

Neurocomputing 55 (2003) 265283
www.elsevier.com/locate/neucom
Neural network learning for analog VLSI
implementations of support vector machines:
a survey
Davide Anguita
a;
, Andrea Boni
b
a
Department of Biophysical and Electronic Engineering, University of Genova,
Via Opera Pia 11A, I-16145 Genova, Italy
b
Department of Information and Communication Technologies, University of Trento,
Via Sommarive 14, I-38050 Povo (TN), Italy
Received 6 March 2002; accepted 9 January 2003
Abstract
In the last few years several kinds of recurrent neural networks (RNNs) have been proposed for
solving linear and nonlinear optimization problems. In this paper, we provide a survey of RNNs
that can be used to solve both the constrained quadratic optimization problem related to support
vector machine (SVM) learning, and the SVM model selection by automatic hyperparameter
tuning. The appeal of this approach is the possibility of implementing such networks on analog
VLSI systems with relative easiness. We review several proposals appeared so far in the literature
and test their behavior when applied to solve a telecommunication application, where a special
purpose adaptive hardware is of great interest.
c 2003 Elsevier B.V. All rights reserved.
Keywords: SVM learning; Recurrent networks; Analog VLSI; Quadratic programming
1. Introduction
Since the seminal work of Mead [31], the idea of implementing on VLSI systems
the neural computation framework has been pursued with great eort. On one hand,
the eorts have concentrated on the realization of devices that could mimic biological
functionality (Meads book goes exactly in this direction); on the other hand, the
objective has been the design of dedicated hardware, which implements the mainstream
Corresponding author. Tel.: +39-010-3532800; fax: +39-010-3532175.

E-mail addresses: anguita@dibe.unige.it (D. Anguita), andrea.boni@ing.unitn.it (A. Boni).
0925-2312/03/$ - see front matter c 2003 Elsevier B.V. All rights reserved.
doi:10.1016/S0925-2312(03)00382-5
266 D. Anguita, A. Boni / Neurocomputing 55 (2003) 265283
algorithms of neural computation (see, for example, [10,11,19,20] for some recent
reviews).
One of the most notable examples of this research area is the vast amount of lit-
erature and actual realizations of special-purpose devices implementing the multi-layer
perceptron (MLP) and the backpropagation algorithm. Unfortunately, after an initial
enthusiasm and few commercial realizations (e.g. Intel [22] and Siemens [35]), the
skepticism has raised sharply: digital implementations have been criticized for being
just hardware accelerators of neural algorithms [32] and analog implementations failed
to become an appealing alternative. We argue that, among other reasons, there were
many causes for this failure: the diculty in implementing the BP algorithm on silicon
[21], the poor performance of the algorithm itself when compared to more advanced
optimization method, which cannot be easily implemented on dedicated hardware [8],
and, last but not least, the NP-completeness of its learning process that prevents
to nd a global minimum and, therefore, the exploitation of solid theoretical
results [6].
In the last years a new neural algorithm, the support vector machine (SVM), has
captured the attention of the neurocomputing world [17]. In few years, the SVM has
imposed itself as one of the best methods for solving real-world problems [12,13] and a
vast amount of literature have emerged, almost resembling the enthusiasm caused by the
rediscover of the backpropagation during the 80s. The advantages of the SVM respect
to former generation networks are clear: they are based on a very solid theoretical
background [42], and the learning is performed by minimizing a quadratic functional
subject to linear constraints for which the global minimum can be easily found [25].
Furthermore, some applications that would benet from a hardware implementation
of SVMs have started to emerge [38,40], rising a new interest for special purpose
devices.
We survey in this work a methodology that looks very promising for the analog
hardware implementation of SVMs. Collecting from several proposals that date back
up to more than 20 years ago, we specialize these methods to the solution of the
constrained quadratic optimization problem that implements SVM learning.
2. Neural networks for SVM learning
2.1. The constrained quadratic programming problem
We revise here the constrained quadratic programming (CQP) problem related
to SVM learning. The CQP addressed here refers to classication tasks, but can be
used without any relevant modication for regression tasks as well. We do not give any
detail about the derivation of the following formulas, which can be easily found
elsewhere [18].
Given a set of patterns {x
i
; y
i
}
l
i=1
with x
i
R
n
and y
i
{1; +1}, we can write, in
matrix notation, the primal CQP problem for SVM:
min
w;Q
E
P
=
1
2
w
2
+ Ce
T
Q; (1)
D. Anguita, A. Boni / Neurocomputing 55 (2003) 265283 267
Y(X
T
w + be) e Q; (2)
Q 0; (3)
where e
i
=1 i=1 : : : l, X=[4(x
1
)| : : : |4(x
l
)], Y is a diagonal matrix with the elements
of y along the diagonal and the superscript
T
indicates the transpose operator. The
nonlinear transformation 4: R
n
R
N
is the implicit mapping from the input space to
the feature space such that X
T
X =K R
ll
and k
ij
= k(x
i
; x
j
) =4(x
i
) 4(x
j
). Note
that, in general, Nn or, possibly, N =.
By introducing 2l positive Lagrange multipliers z; R
l
+
we can write the
Lagrangian
L
P
=
1
2
w
2
+ Ce
T
Q z
T
[Y(X
T
w + be) e + Q]
T
Q (4)
and obtain the classical dual formulation of the SVM problem
min
z
E
D
=
1
2
z
T
Qz e
T
z; (5)
0 6z 6Ce; (6)
y
T
z = 0; (7)
where Q = YKY and at optimality:
= Ce z; (8)
w = XYz; (9)
z
T
[Y(X
T
w + be) e + Q] = 0; (10)
T
Q = 0: (11)
For our purposes, it is convenient to compute the dual of the above problem, by
introducing 2l + 1 Lagrange multipliers z
U
, z
L
R
l
+
and
0
R, that allows to write
the original primal problem in a slightly dierent form. The Lagrangian is
L
D
=
1
2
z
T
Qz e
T
z + (z Ce)
T
z
U
z
T
z
L
+
0
y
T
z (12)
and the problem becomes
min
z;z
E
PD
=
1
2
z
T
Qz + Ce
T
z
U
; (13)
Qz +
0
y e z
U
; (14)
z
U
0; (15)
0 6z 6Ce; (16)
y
T
z = 0; (17)
where at optimality
z
T
[Qz e + z
U
+
0
y] = 0; (18)
(z Ce)
T
z
U
= 0; (19)
0
y
T
z = 0: (20)
From this last formulation it is easy to see that b=
0
: this will be useful in subsequent
sections to avoid the explicit computation of the bias term.
The above CQP can be solved with general [28] and special-purpose [24,33] nu-
merical methods which have been extensively studied in the last years. These meth-
ods, however, can be implemented on a conventional computer, but are not suited for
dedicated VLSI implementations, in particular analog ones.
A more suitable approach is to map the problem on a dynamical system described
by a set of ordinary dierential equations (ODEs) whose stable points correspond
to the solutions of the optimization process. Then, an electronic circuit can be built
with the same dynamic described by the set of ODEs. This is the very old idea of
analog computers that was investigated thoroughly when digital computers were not
so mainstream [39].
In general, the ODEs have the form
u = AF
A
(u); (21)
where lim
t
u(t) = u
is one, or possibly the unique, solution of the optimization

problem. From a connectionist point of view, Eq. (21) can be seen as a recurrent
neural network,
1
while from an electronic point of view it is a circuit that, at present
times, can be built by using, for example, operational ampliers, diodes and resis-
tors [26]. Note that A is usually a diagonal matrix, with positive entries, that scales
the convergence speed of the network and does not aect the nal solution; there-
fore, without any loss in generality, we will assume A = I (the identity matrix) in
the rest of the paper. This choice will also allow a fair comparison among dierent
methods.
It is worthwhile to briey mention, at this point, how digital implementations can be
approached in a similar way. Through a time discretization, we can write an equation
similar to Eq. (21) which can be implemented, in a straightforward way, on a digital
computer:
u
k+1
= F
D
(u
k
): (22)
An obvious way to derive Eq. (22) from Eq. (21) is to apply the well-known Eulers
method for ODEs integration: in this case we obtain
u
k+1
=u
k
+ F
A
(u
k
); (23)
1
A very complete analysis of this approach can be found in [16].
where is the integration step. Unfortunately, the choice of is not trivial and an
extensive a priori analysis must be performed to ensure that lim
k
u
k
= u
for a
particular integration step. For this reason, in some cases, it is more convenient to nd
directly F
D
.
We do not address digital implementations in this work. Some examples exist in
literature [2], but we believe that it is not yet clear if a straightforward implementation
of special-purpose optimization methods (e.g. [33,25]) can be more or less eective of
the above approach, when targeting digital VLSI systems.
Note that a lot of work is still missing for achieving actual analog or digital im-
plementations. Analog implementations of Eq. (21) are aected by noise, delays and
actual physical limits of the electronic technology that can aect the convergence of the
system. On the other hand, digital implementations of Eq. (22) must take in account
the nite register length [4]. Obviously, the comparison between the two choices is
outside of the scope of this paper and depend mostly on the actual application [37].
In the following section we address networks of the kind described by Eq. (21),
which can be used for analog implementations. In Section 3 some proposals are bench-
marked on a simple articial problem and a nonlinear channel equalization task, in order
to evaluate the accuracy of the obtained solution, their convergence speed, etc. Finally,
Section 4 summarizes the results and indicates which method appears to be the best
one for implementing SVM learning in analog VLSI.
2.2. Recurrent networks for SVM learning
2.2.1. Primal networks
The KennedyChua network (A
1
) is one of the rst complete proposals for solving
nonlinear constrained optimization problems [27]. The link between circuit theory and
connectionism was already clear at that time and it was shown that this recurrent
network is equivalent to the Chuas Canonical Circuit [23] and is a generalization of
the well-known Tank and Hopeld recurrent network [26]. The rst attempt to use this
network for implementing SVM learning and, to our best knowledge, the rst proposal
of this kind of approach to SVM learning was presented in [5].
Using the fact that the equality constraint can be written as two inequality constraints
[29], we dene
A
T
= [ y|y| I |I ]; (24)
d
T
= [0 : : : 0|C : : : C]; (25)
where I R
ll
is the identity matrix. Then, the evolution of A
1
is described by
F
A
1
(z) =Qz + e A
T
P(Az d); (26)
where P is a projection operator P
i
(x) =max{0; x} and is a positive penalty param-
eter. Note that the number of actual connections needed to implement A
1
is not as
large as Eq. (26) seems to suggest. If we consider, for example, matrix A, we can note
that only 4l values are dierent from zero, on a total of 2l
2
+ 2l entries; furthermore,
they are binary values {1; +1}, therefore no multiplication is needed for computing
a product with matrix A, but only an appropriate sign change and an addition.
The idea behind this network is very simple and eective: the rst two terms of the
right side of Eq. (26) are the antigradient E
D
, while the last term penalizes the
trajectory of the solution each time it moves outside the feasibility region.
A
1
is called a primal network because only the variables of the original problem
appear explicitly. One of the advantages of this kind of network is that the solution
is forced to satisfy the constraints during the entire optimization process, albeit with a
tolerance that depends on the penalty parameter . This last issue is the most critical
one for primal networks, because it aects also the quality of the solution: in fact, the
exact minimum is reached only if , which is impossible to achieve in practice.
However, we will show in the experimental section that good solutions can be obtained
with reasonable values of the penalty parameter.
Many improvements can be applied to A
1
. The rst one goes in the direction of
improving the quality of the solution: exploiting the nonlinearity of some electronic
devices, it is possible to apply a nonlinear penalty instead of a linear one [15]:
F
A
2
(z) =Qz + e A
T
P
+
(Az d); (27)
where P
+
i
(x) = x
1=8
+ x if x 0 and 0 otherwise.
Other proposals modify slightly Eq. (27) and take advantage of hybrid continuous/
discrete-time solutions. This kind of networks are not really convergent in a traditional
sense, but they oscillate in a neighborhood of the solution [36]. Nevertheless, they are
very appealing from an electronic implementation point of view. The dynamic of this
network, applied to SVM, is the following:
F
A
3
(z) =U(

Az

d)(Qz e) A
T
P(Az d); (28)
where

A and

d are dened as in Eqs. (24) and (25) except for the equality constraint
2
and U
i
(x) = 1 if x 60 and 0 otherwise.
Another way to improve the quality of the solution is suggested in [30], where the
A
1
network is augmented by a second network that starts working when the rst
one is in the vicinity of the solution. Since the second network is based on additional
parameters implementing a Lagrange multiplier method, we will consider this approach
in the following section.
2.2.2. Primal-dual networks
A dierent approach for solving the CQP problem is to implement on our dynamical
system the methods for constrained optimization based on Lagrange multipliers [28].
One of the rst proposals in this sense is reported in [34], but we will refer to [49]
where a detailed and theoretically sound analysis of Lagrange programming neural
network (LNPP) is reported.
2
This is necessary to avoid continuous oscillations along the y
T
z = 0 path.
The main idea of primal-dual methods is to map both the primal and the dual
optimization problem on a dynamical system such that its equilibrium point satis-
es the KarushKuhnTucker (KKT) conditions and therefore is also a solution of
the CQP.
The main advantage of the primal-dual approach is that the exact solution is reached
without any need of parameter tuning (e.g. the penalty parameter). Furthermore, from
an implementation point of view, it can be technically very dicult to implement
the equality constraint due to unavoidable mismatches of electronic devices, while
primal-dual networks avoid this problem.
There are, however, some disadvantages: the rst one is the increased size of the
network due to the introduction of the Lagrange multipliers; the second one aects
the evolution of the variables during learning. In fact, the primal variables satisfy
the constraints only at the equilibrium point, therefore, during the evolution of the
network, the solution can be infeasible and this could be unacceptable for some
applications.
The network in [49] can be applied to SVM learning by transforming the inequality
constraints of the CQP problem in equality constraints: introducing 2l slack variables
L
i
and
U
i
,
i
6C becomes
i
C + (
U
i
)
2
= 0 and
i
0 becomes (
L
i
)
2

i
= 0.
Then, we obtain:
F
A
4
_
_
_
_
_
_
_
_
_
_
_
_
_
z
L
z
U
z
L
0
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
Qz + e
0
y z
U
+ z
L
2B
U
z
U
2B
L
z
L
z Ce + B
U
U
z + B
L
L
y
T
z
; (29)
where B
L
(B
U
) is a diagonal matrix with the values
L
i
(
U
i
) along the diagonal.
This network is quite large, if compared to primal ones, but does not contain any
parameter to tune. Furthermore, it can be implemented very easily by linear devices
able to compute weighted sums of their inputs; the only exceptions are the terms
i
i
and
2
i
which require a multiplication (i.e. a second order neuron).
It is worthwhile noting, at this point, the absence of explicit constraints in Eq. (29).
This is the main advantage of the Lagrangian approach, because it avoids the tolerance
issue of electronic implementation, but it can also be a potential source of problems
because the variables can assume, at least in theory, any real value and eventually
diverge in time. For this reason, and for accelerating its convergence, it is possible to
add a penalty term of the form
1
2
(y
T
z
2
+z Ce + B
U
2
+B
L
L
z
2
) (30)
to the Lagrangian function leading to an augmented Lagrangian programming neural
network (ALPNN) [30,49]. In our case, using Eq. (18), the ALPNN for SVM learning
can be written as
F
A
5
_
_
_
_
_
_
_
_
_
_
_
_
_
z
L
z
U
z
L
0
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
Qz+e
0
yz
U
+z
L
+(yy
T
zCe+B
U
U
+B
L
L
)
2B
U
z
U
+ 2B
U
(z Ce + B
U
U
)
2B
L
z
L
+ 2B
L
(B
L
L
z)
z Ce + B
U
U
z + B
L
L
y
T
z
: (31)
The eect of the penalty term is twofold: on one hand, it forces the variables to
move faster toward their optimum value; on the other hand, it penalizes large deviation
avoiding potential problems of the electronic implementation.
As a last remark, it is interesting to note that primal-dual networks, by making the
Lagrange multipliers explicit, provide directly the value of the bias term of the SVM.
Primal networks, instead, must resort to additional circuitry that solves, for example,
Eq. (10) for any 0
i
C.
2.3. Projection primal-dual networks
The large number of variables of previous methods can be lowered by mixing the
primal and the dual approaches as seen so far. An eective solution was rst proposed
in [43] and improved in [47] where the equality constraint is handled by a Lagrangian
method, while the inequality constraints are handled through a projection method. The
network can be described by the following equation:
F
A
6
_
z
0
_
=
_
yy
T
z [2Qz +
0
y e QB(z Qz
0
y + e)]
y
T
B(z Qz
0
y + e)
(32)
with
=z B(z Qz
0
y + e)
2
(33)
and B is the projection operator B
i
(x) = max{0; min{x; C}}.
Network A
6
is quite complex: there are many common sub-expressions that simplify
its implementation, but the term requires second order neurons.
Improving on the work described in [9] a simpler network was proposed in [44]
and, using the results of [46], adapted to SVM learning [1,41].
The network is described by the following equation (note the similarity to the pre-
vious one with = 1):
F
A
7
_
z
0
_
=
_
yy
T
z (I + Q)[z B(z Qz
0
y + e)]
y
T
B(z Qz
0
y + e)
: (34)
This last network is one of the most interesting for VLSI implementations be-
cause it collects most of the advantages of previous ones without suering from their
disadvantages. In particular, the number of variables coincides with the number of
parameters that identify the SVM; the bias term is provided explicitly; there is no need
to tune any parameter, and the topology of the network is quite simple.
It is worthwhile mentioning that projection networks appear to be very promising
for a large number of optimization problems. New results have recently appeared in
the literature, showing that projection networks are both topologically simple and very
ecient in solving linear [48] and nonlinear [45] projection equations, which include
constrained linear and quadratic optimization problems as special cases. In particular,
the result in [45] allows a user to further simplify the topology of network A
7
, by
getting rid of the term (I + Q).
3. Experimental results: two case studies
It is clear from previous sections that there are four networks that are appealing from
an electronic implementation point of view: the two primal networks A
1; 2
and the two
projection primal-dual ones A
6; 7
. The A
3
network is very tightly dependent from
the actual electronic realization, usually through a switched-capacitors technology [36],
therefore we do not include it here, where we perform a simple numerical integration of
Eq. (21). A
4; 5
, instead, are too critical, in practice, due to the absence of any limitation
on the values of the status variables, a drawback already pointed out by Platt and
Barr [34].
We show in this section the behavior of networks A
1; 2; 6; 7
on a simple articial prob-
lem and a telecommunication application. The articial problem has been built to show
the dierent paths of the evolution of the parameters of the SVM for each network.
In order to visualize the evolution in time, we restrict the problem to three patterns.
The telecommunication problem consists of a channel equalization application cited
in the introduction of this paper [14,38] and aims at showing the behavior of the
networks in a setting similar to a real-world one. The size of the problem is limited to a
number of patterns, which is compatible with an actual analog VLSI implementation: by
converting the continuous-time formulation to a discrete-time one, it could be possible
to derive an algorithm for a general purpose digital computer, and, therefore, applicable
to much larger problems.
In the following experiments, we solve Eqs. A
1; 2; 6; 7
by using the Gears BDF
method for sti ODEs.
3
The evolution of each network is stopped when u
2
10
6
or a maximum limit of iterations is reached or the algorithm fails to produce reliable
integration steps.
3.1. Learning paths for the articial problem
The training set of the articial problem consists of three patterns (x
1
; y
1
) = (0; 1),
(x
2
; y
2
) = (0:5; 1), and (x
3
; y
3
) = (1; 1) solved with a nonlinear SVM with Gaussian
kernel k(x
i
; x
j
) = exp((x
i
x
j
)
2
).
3
Visual Numerics IMSL Math Library.
0
2
4
6
8
10
alpha_1
0
2
4
6
8
10
12
14
16
18
20
alpha_2
2
4
6
8
10
a
l
p
h
a
_
3
0
2
4
6
8
10
alpha_1
0
2
4
6
8
10
12
14
16
18
20
alpha_2
2
4
6
8
10
a
l
p
h
a
_
3
0
2
4
6
8
10
alpha_1
0
2
4
6
8
10
12
14
16
18
20
alpha_2
2
4
6
8
10
a
l
p
h
a
_
3
0
2
4
6
8
10
alpha_1
0
2
4
6
8
10
12
14
16
18
20
alpha_2
2
4
6
8
10
a
l
p
h
a
_
3
Fig. 1. The learning path of A
1; 2; 6; 7
networks for dierent starting points: the origin and (20,0,0), which lies
outside of the depicted box. The solution is close to the upper right corner, approximately (7.915, 15.831,
7.915).
The solution is z
= (2c; 4c; 2c)

T
with c = 1=(exp(1) 4 exp(1=4) + 3) that cor-
respond approximately to z
(7:915; 15:831; 7:915)

T
.
We let evolve the networks from two dierent starting points z
A
= (0; 0; 0)
T
and
z
B
= (20; 0; 0)
T
using a low value for the penalty parameter for the rst two networks
( = 0:2).
In Fig. 1, the learning path of each network is reported. Note that the starting point
z
B
lies outside of the feasibility region, while z
A
satises all the constraints.
Fig. 2. The model of the transmitterchannelreceiver.
It is clear from the evolution of A
1
how the penalty parameter aects the quality of
the solution, in fact the two paths do not reach the same point. Nonetheless, network
A
2
shows a better behavior, despite the use of the same penalty parameter: this is
obviously the eect of the nonlinear penalty term.
Networks A
6
and A
7
show comparable learning paths, but we must remark that
A
7
reached the equilibrium point after only 75 integration steps, while more than 550
steps were needed for A
6
.
3.2. Channel equalization
The channel equalization problem is a typical real-world application where a special
purpose hardware can be fruitfully used, on the receiver side, in order to estimate
one among two symbols u
n
{1}, of an independent sequence emitted from a given
source. All the nonlinear eects of the involved components (transmitter, channel and
receiver) are modeled as FIR lters, plus a Gaussian distributed white noise e with
zero mean and variance
2
e
(see Fig. 2):
x(n) =
N
k=0
h
k
u(n k);
x(n) =
P
p=1
c
p
x
p
(n);
x(n) = x(n) + e(n):
(35)
The classical theory, tackles this problem by nding an optimal classier (the
Bayesian maximum likelihood detector), that provides an estimate u(n D) of
y
n
=u(nD) through the observation of an M-dimensional vector x
n
=[x(n); x(n1); : : : ;
x(n M + 1)]
T
. Whereas these methods require the knowledge of the symbols prob-
ability, neural network-based approaches, have been successfully applied [14,38] to
systems where such a distribution is not known. In practice, a classier is selected on
the basis of l previous samples, having the following structure:
{(x
nl+1
; u(n D l + 1)); : : : ; (x
n
; u(n D))}: (36)
The main disadvantage, when one uses neural network based methods, consists in the
diculty of nding adaptive methods to control the complexity of the network and, as
a consequence, its generalization (in [38] this topic is discussed in Section 2); such
a problem increases when one needs to implement in hardware not only the learning
algorithm, but the model selection as well, as in the case of the channel equalization
detailed in this section. In this sense, one needs to use simple estimates that can be
easily implemented. A recent work [7] suggests a method to execute model selection
in a data-dependent way and has shown to be quite reliable in practice when applied
to SVM classication [3].
The method, called maximal discrepancy, is very appealing from a hardware imple-
mentation point of view, because the model selection can be performed by using the
same CQP problem of SVM learning but with a new training set obtained by splitting
the original one in two halves and ipping the targets of the second half. Let us call E
the error performed by the SVM, after learning with a particular choice of the hyper-
parameters, on the original dataset and

E the error on the new data, then the maximal
discrepancy method chooses the SVM corresponding to the minimum of E +(1 2

E)
(see [3] for more details).
With the maximal discrepancy technique, one can use two replicates of the same
hardware: one for model selection and one for the SVM training.
For our benchmark purposes, we applied the networks described above to the fol-
lowing channel model [38]:
x(n) = x(n) 0:9 x
3
(n);
x(n) = u(n) +
1
2
u(n 1)
(37)
with
2
e
=0:2, M=2, and D=2. In [38] this channel equalization problem was solved by
using a training set of 512 patterns and a 2nd degree polynomial kernel. We want to let
this problem be more suitable from a hardware realization point of view, therefore we
use only 32 patterns for training and a Gaussian kernel k(x
i
; x
j
)=exp(x
i
x
j
2
). The
variance of the RBF kernel has been xed, because it could be dicult to change it in a
VLSI implementation, while the value of the hyperparameter C can be easily modied
by varying, for example, the output range of P neurons. We found that, according to
the maximal discrepancy method, the optimal value of this hyperparameter is C=0:65.
The SVM described above produces the discriminating surface depicted in the right
side of Fig. 3 that can be compared with the optimal Bayesian receiver on the left side.
Both the learning patterns and the actual data constellation without noise are showed
in the same plots.
The results obtained by the SVM selected by the maximal discrepancy method are
astonishing good. In fact, if we estimate the actual generalization by means of a test
set of approximately 3000 patterns (Fig. 4), we measure a receiver error of 4.1%.
2
1
0
1
2
x

(
n
1
)
2 1 0 1 2
x (n)
2
1
0
1
2
x

(
n
1
)
1 0 1 2
x (n)
Fig. 3. The central section of the Bayesian optimal receiver (left) and the SVM receiver (right).
2
1
0
1
2
3
x

(
n
1
)
3 2 1 0 1 2 3
x (n)
Fig. 4. The training and test set of the channel equalization problem.
Table 1
Misclassication error on the test set of the Bayesian optimal receiver and the SVM receivers
Receiver Error (%)
Bayesian 3.7
SVM [38] 4.2
SVM (SMO) 4.1
SVM (A
1
; = 0:2) 4.4
SVM (A
1
; = 10) 4.1
SVM (A
2
; = 0:2) 4.3
SVM (A
6
) 4.1
SVM (A
7
) 4.1
This result is even slightly better than the one obtained in [38], where no model
selection was explicitly performed but a much larger training set was used. However,
we stress the fact that we are not interested in the optimal solution of the equalization
channel problem, but more on the actual solutions found by the recurrent networks
described in the previous sections.
In Table 1, the performance obtained by the optimal receiver, the SVM receiver
used in [38], the SVM receiver using a SMO algorithm [33] and the four recurrent
networks is summarized.
The primal networks are more prone to errors due to the presence of the penalty
term. Only increasing the value of the solution becomes comparable to the one found
by the SMO algorithm. Note, however, that the value = 10 caused the network A
2
to become unstable, even though, as expected from the results of the previous section,
its performance is better when the same value is used. Networks A
6; 7
show similar
(optimal) performance even though A
7
is topologically simpler.
Fig. 5 shows the central sections of the SVM receivers obtained as detailed before.
The SVM receiver obtained by A
1
is showed for both values of the penalty parameter.
In Fig. 6, the evolution of F
A
2
is depicted. It is interesting to note that the two
networks that obtained the best classication performance are also the two with the
fastest convergence.
4. Discussion of the results
The choice of a specic network for a particular application, among the ones pre-
sented in this work, depends from at least three factors, which are tightly interrelated:
the theoretical soundness, some practical considerations and the implementation issues.
From the theoretical point of view, projection networks are the most promising: in
fact, primal networks rely on a penalty parameter, which is not provided by theory, and
primal-dual networks can be unstable [34]. Furthermore, the results recently appeared
in [45] show that these networks can be implemented with only one layer of neurons,
and that convergence is guaranteed even when matrix Q is positive semi-denite (and
not strictly positive denite), which can be the case for SVM learning problems.
1
0
1
2
1 0 1 2
x (n)
2
2
2
2
2 2
2
2
1
0
1
2
x

(
n
1
)
x

(
n
1
)
x

(
n
1
)
x

(
n
1
)
1 0 1 2
x (n)
1
0
1
2
1 0 1 2
x (n)
1
0
1
2
1 0 1 2
x (n)
Fig. 5. The central section of the SVM Receiver, using the A
1; 2; 6; 7
networks for learning.
From a practical point of view, the choice is mostly application dependent. As de-
scribed in the previous section, the networks perform comparably, respect to the gen-
eralization performance, but the speed of convergence and the accuracy of the solution
can vary greatly. From the experimental results of the previous section, the network
A
7
shows to be both more accurate and faster in convergence.
When targeting actual analog VLSI devices, the implementation issues could over-
come any other aspect. In fact, the number of neurons and connections aects directly
the complexity of the device and, therefore, its cost and reliability (see Table 2).
The clear winner in this sense is network A
7
, which only makes use of the nec-
essary variables involved in the minimization problem, without resorting to external
circuits. This observation is even more true, if we consider the topological simplica-
tion reported in [45].
1e-007
1e-006
1e-005
0.0001
0.001
0.01
0.1
1
5 10 15 20 25 30 35 40 45
|
|
F
|
|
^
2
t
A1 gamma=0.2
A1 gamma=10
A2
A6
A7
Fig. 6. Comparison of network evolution during learning.
Table 2
Quantitative comparison of networks topology
Network L neurons P neurons H neurons Weights
A
1
l 2l + 2 0 l
2
+ 10l
A
2
l 2l + 2 0 l
2
+ 10l
A
3
l 4l + 2 0 l
2
+ 10l
A
4
5l + 1 0 4l l
2
+ 10l
A
5
5l + 1 0 10l l
2
+ 19l
A
6
l + 1 l l + 2 2l
2
+ 4l
A
7
l + 1 l 0 2l
2
+ 4l
L neurons perform a weighted sum of their inputs, P neurons implement projection operators, and H
neurons perform higher-order computations like multiplication of their inputs. The column Weights indicates
the total number of nonzero connections of the network. These gures can slightly change if some reuse of
common sub-expressions or clever use of each neuron output is performed.
In Table 3 some of these issues are summarized: the projection network A
7
appears
as the most eective solution for analog VLSI implementations of SVM learning.
5. Conclusions
We have presented a survey of networks that allow to perform SVM learning and are
suitable for the implementation on analog VLSI electronics. When benchmarked on a
telecommunication application, the selected methods perform remarkably well and the
Table 3
A summarizing comparison of the networks
Quality index A
1
A
2
A
6
A
7
Topological complexity Low Low High Low
Need for external circuit Yes Yes No No
Accuracy Moderate High High High
Convergence speed Moderate High Moderate High
Note that the accuracy and convergence speed of A
1; 2
depend also from the value of the penalty parameter.
dierence between the solutions is very small, however both theoretical and practical
considerations suggest that projection networks are the best suited for analog VLSI
implementations. Current work is addressing the actual VLSI realization.
Acknowledgements
We thank S. Ridella for many valuable discussions, and two anonymous reviewers
for their suggestions on how to improve the paper and for pointing us to additional
references.
References
[1] D. Anguita, A. Boni, Improved neural network for SVM learning, IEEE Trans. Neural Networks 13
(5) (2002) 12431244.
[2] D. Anguita, A. Boni, S. Ridella, VLSI friendly training algorithms and architectures for support vector
machines, Int. J. Neural Systems 10 (3) (2000) 159170.
[3] D. Anguita, S. Ridella, F. Rivieccio, R. Zunino, Hyperparameter design criteria for support vector
classiers, Neurocomputing, this issue.
[4] D. Anguita, S. Ridella, S. Rovetta, Worst case analysis of weight inaccuracy eects in multilayer
perceptrons, IEEE Trans. Neural Networks 10 (2) (1998) 415418.
[5] D. Anguita, S. Ridella, S. Rovetta, Circuital implementation of support vector machines, Electron. Lett.
34 (16) (1998) 15961597.
[6] M. Anthony, P.L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University
Press, Cambridge, 1999.
[7] P.L. Bartlett, S. Boucheron, G. Lugosi, Model selection and error estimation, Mach. Learning 48 (2002)
85113.
[8] R. Battiti, First- and second-order methods for learning: between steepest descent and Newtons method,
Neural Comput. 4 (2) (1992) 141166.
[9] A. Bouzerdom, T.R. Pattison, Neural network for quadratic optimization with bound constraints, IEEE
Trans. Neural Networks 4 (2) (1993) 293304.
[10] G. Cauwenberghs, M. Bayoumi (Eds.), Learning on SiliconAdaptive VLSI Neural Systems, Kluwer
Academic Publishers, Dordrecht, 1999.
[11] G. Cauwenberghs, M. Bayoumi, E. Sanchez-Sinencio (Eds.), Special issue on learning on silicon, Analog
Integrated Circuits Signal Process. 18 (23) (1999) 113312.
[12] C.-C. Chang, C.-J. Lin, IJCNN2001 Challenge: Generalization Ability and Text Decoding, Proceedings
of the International Conference on Neural Networks, Washington, DC, USA, June 2001, pp. 10311036.
[13] M.-W. Chang, B.-J. Chen, C.-J. Lin, EUNITE Network Competition: Electricity Load Forecasting,
National Taiwan University, Technical Report, November 2001.
[14] S. Chen, G.J. Gibson, C.F.N. Cowan, P.M. Grant, Adaptive equalization of nite nonlinear channels
using multilayer perceptrons, Signal Process. 20 (2) (1990) 107119.
[15] J. Chen, M.A. Shanblatt, C.-Y. Maa, Improved neural networks for linear and nonlinear programming,
Int. J. Neural Systems 2 (4) (1992) 331339.
[16] A. Cichocki, R. Unbehauen, Neural Networks for Optimization and Signal Processing, John Wiley &
Sons, New York, 1993.
[17] C. Cortes, V. Vapnik, Support Vector Networks, Mach. Learning 20 (1995) 273297.
[18] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University
Press, Cambridge, 2000.
[19] S. Draghici, Neural networks in analog hardwaredesign and implementation issues, Int. J. Neural
Systems 10 (1) (2000) 128.
[20] S. Draghici, Guest editorialnew trends in neural network implementations, Int. J. Neural Systems 10
(3) (2000) viiviii.
[21] P.J. Edwards, A.F. Murray, Analogue Imprecision in MLP Training, World Scientic, Singapore, 1996.
[22] M. Holler, S. Tam, H. Castro, R. Benson, An electrically trainable articial neural network (ETANN)
with 10240 oating gate synapses, Proceedings of the International Joint Conference on Neural
Networks, Washington, DC, USA, 1989, pp. 191196.
[23] J.L. Huertas, A. Rueda, A. Rodriguez-Vazquez, L.O. Chua, Canonical nonlinear programming circuit,
Int. J. Circuit Theory Appl. 15 (1) (1987) 7177.
[24] T. Joachims, Making large-scale SVM learning practical, in: B. Sch olkopf, C. Burges, A. Smola (Eds.),
Advances in Kernel MethodsSupport Vector Learning, The MIT Press, Cambridge, MA, 1999.
[25] S.S. Keerthi, E.G. Gilbert, Convergence of a generalized SMO algorithm for SVM classier design,
Mach. Learning 46 (2002) 351360.
[26] M.P. Kennedy, L.O. Chua, Unifying the tank and hopeld linear programming network and the canonical
nonlinear network of Chua and Lin, IEEE Trans. Circuits and Systems 34 (2) (1987) 210214.
[27] M.P. Kennedy, L.O. Chua, Neural networks for nonlinear programming, IEEE Trans. Circuits and
Systems 35 (5) (1988) 554562.
[28] D.G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison-Wesley, Reading, MA,
1973.
[29] C.-Y. Maa, M.A. Shanblatt, Linear and quadratic programming neural network analysis, IEEE Trans.
Neural Networks 3 (4) (1992) 580594.
[30] C.-Y. Maa, M.A. Shanblatt, A two-phase optimization neural network, IEEE Trans. Neural Networks 3
(6) (1992) 10031009.
[31] C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA, 1989.
[32] A.R. Omondi, Neurocomputers: a dead end? Int. J. Neural Systems 10 (6) (2000) 475481.
[33] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Sch olkopf,
C. Burges, A. Smola (Eds.), Advances in Kernel MethodsSupport Vector Learning, The MIT Press,
Cambridge, MA, 1999.
[34] J.C. Platt, A. Barr, Constrained dierential optimization, in: Advances in Neural Information Processing
Systems, NIPS, 1987, American Institute of Physics, pp. 612621.
[35] U. Ramacher, SYNAPSEA neurocomputer that synthesizes neural algorithms on a parallel systolic
engine, J. Parallel Distributed Comput. 14 (1992) 306318.
[36] A. Rodriguez-Vazquez, R. Dominguez-Castro, A. Rueda, J.L. Huertas, E. Sanchez-Sinencio, Nonlinear
switched capacitor neural networks for optimization procedures, IEEE Trans. Circuits and Systems 37
(3) (1990) 384398.
[37] R. Sarpeshkar, Analog versus digital: extrapolating from electronics to neurobiology, Neural Comput.
10 (1998) 16011638.
[38] D.J. Sebald, J.A. Bucklew, Support vector machine techniques for non-linear equalization, IEEE Trans.
Signal Process. 48 (11) (2000) 32173226.
[39] T.E. Stern, Theory of Nonlinear Networks and Systems, Addison-Wesley, Reading, MA, 1965.
[40] S. Still, B. Sch olkopf, K. Hepp, R.J. Douglas, Four-legged walking gait control using a neuromorphic
chip interfaced to a support vector learning algorithm, in: Advances in Neural Information Processing
Systems, NIPS 13, 2000, The MIT Press, Cambridge, MA, 2000.
[41] Y. Tan, Y. Xia, J. Wang, Neural network realization of support vector methods for pattern classication,
International Joint Conference on Neural Networks, Vol. 5, 2000, pp. 411416.
[42] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[43] X.-Y. Wu, Y.-S. Xia, J. Li, W.-K. Chen, A high performance neural network for solving linear and
quadratic programming problems, IEEE Trans. Neural Networks 7 (3) (1996) 6772.
[44] Y.-S. Xia, A new neural network for solving linear and quadratic programming problems, IEEE Trans.
Neural Networks 7 (6) (1996) 15441547.
[45] Y.-S. Xia, H. Leung, J. Wang, A projection neural network and its application to constrained
optimization problems, IEEE Trans. CAS-I: Fund. Theory Appl. 49 (4) (2002) 447458.
[46] Y.-S. Xia, J. Wang, A general methodology for designing globally convergent optimization neural
networks, IEEE Trans. Neural Networks 9 (6) (1998) 13311343.
[47] Y.-S. Xia, J. Wang, Recurrent neural networks for optimization: the state of the art, in: L.R. Medseker,
L.C. Jain (Eds.), Recurrent Neural Networks (Design and Applications), CRC Press, Boca Raton, FL,
2000, pp. 2945.
[48] Y.-S. Xia, J. Wang, A recurrent neural network for solving linear projection equations, Neural Networks
(13) (2000) 337350.
[49] S. Zhang, A.G. Costantinides, Lagrange programming neural networks, IEEE Trans. Circuits and
Systems II (CAS-II): Analog and Digital Signal Processing 39 (7) (1992) 441452.
Davide Anguita graduated in Electronic Engineering in 1989 and obtained the Ph.D.
in Computer Science and Electronic Engineering at the University of Genova, Italy,
in 1993. After working as a research associate at the International Computer Science
Institute, Berkeley, USA, on special-purpose processors for neurocomputing, he
joined the Department of Biophysical and Electronic Engineering at the University
of Genova, where he teaches digital electronics. His current research focuses on
industrial applications of articial neural networks and kernel methods and their
implementation on digital and analog electronic devices. He is a member of IEEE
and chair of the Smart Adaptive Systems committee of the European Network on
Intelligent Technologies (EUNITE).
Andrea Boni was born in Genova, Italy, in 1969 and graduated in Electronic Engi-
neering in 1996. He received a Ph.D. degree in Electronic and Computer Science
in 2000. After working as research consultant at DIBE, University of Genova, he
joined the Department of Information and Communication Technologies, Univer-
sity of Trento, Italy. His main scientic interests are on the study and development
of digital circuits for advanced information processing, with particular attention to
programmable logic devices, digital signal theory and analysis, statistical signal
processing, statistical learning theory and support vector machines. The applica-
tion elds of such interests focus on identication and control of non-linear sys-
tems, pattern recognition, time series forecasting, images and signals processing, and
cryptography.

Neural Network Learning For Analog VLSI Implementations of Support Vector Machines: A Survey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network Learning For Analog VLSI Implementations of Support Vector Machines: A Survey

Uploaded by

Copyright:

Available Formats

Neurocomputing 55 (2003) 265283

Corresponding author. Tel.: +39-010-3532800; fax: +39-010-3532175.

is one, or possibly the unique, solution of the optimization

= (2c; 4c; 2c)

(7:915; 15:831; 7:915)

You might also like