Professional Documents
Culture Documents
Multilayer Perceptrons
6.1 Introduction
In this chapter we study an important class of neural networks, namely, multilayer feedfor-
ward networks. Typically, the network consists of a set of sensory units (source nodes)
that constitute the input layer, one or more hidden layers of computation nodes, and an
output layer of computation nodes. The input signal propagates through the network in
a forward direction, on a layer-by-layer basis. These neural networks are commonly
referred to as multilayer perceptrons (MLPs), which represent a generalization of the
single-layer perceptron considered in Chapter 4.
Multilayer perceptrons have been applied successfully to solve some difficult and
diverse problems by training them in a supervised manner with a highly popular algorithm
known as the error back-propagation algorithm. This algorithm is based on the error-
correction learning rule. As such, it may be viewed as a generalization of an equally
popular adaptive filtering algorithm: the ubiquitous least-mean-square (LMS) algorithm
described in Chapter 5 for the special case of a single linear neuron model.
Basically, the error back-propagation process consists of two passes through the differ-
ent layers of the network: a forward pass and a backward pass. In the forward pass, an
activity pattern (input vector) is applied to the sensory nodes of the network, and its effect
propagates through the network, layer by layer. Finally, a set of outputs is produced as
the actual response of the network. During the forward pass the synaptic weights of the
network are all fixed. During the backward pass, on the other hand, the synaptic weights
are all adjusted in accordance with the error-correction rule. Specifically, the actual
response of the network is subtracted from a desired (target) response to produce an error
signal. This error signal is then propagated backward through the network, against the
direction of synaptic connections-hence the name ‘‘error back-propagation.’ ’ The synap-
tic weights are adjusted so as to make the actual response of the network move closer to
the desired response. The error back-propagation algorithm is also referred to in the
literature as the back-propagation algorithm, or simply buck-prop. Henceforth, we will
refer to it as the back-propagation algorithm. The learning process performed with the
algorithm is called back-propagation learning.
A multilayer perceptron has three distinctive characteristics:
1. The model of each neuron in the network includes a nonlinearity at the output end.
The important point to emphasize here is that the nonlinearity is smooth (i.e., differentiable
everywhere), as opposed to the hard-limiting used in Rosenblatt’s perceptron. A commonly
used form of nonlinearity that satisfies this requirement is a sigmoidal nonlinearity defined
by the logistic function:
1
yi = 1 + exp(-uj>
138
6.1 / Introduction 139
where vj is the net internal activity level of neuron j , and y, is the output of the neuron.
The presence of nonlinearities is important because, otherwise, the input-output relation
of the network could be reduced to that of a single-layer perceptron. Moreover, the use
of the logistic function is biologically motivated, since it attempts to account for the
refractory phase of real neurons (Pineda, 1988b).
2. The network contains one or more layers of hidden neurons that are not part of the
input or output of the network. These hidden neurons enable the network to learn com-
plex tasks by extracting progressively more meaningful features from the input patterns
(vectors).
3. The network exhibits a high degree of connectivity, determined by the synapses of
the network. A change in the connectivity of the network requires a change in the population
of synaptic connections or their weights.
Indeed, it is through the combination of these characteristics together with the ability to
learn from experience through training that the multilayer perceptron derives its computing
power. These same characteristics, however, are also responsible for the deficiencies in
our present state of knowledge on the behavior of the network. First, the presence of a
distributed form of nonlinearity and the high connectivity of the network make the theoreti-
cal analysis of a multilayer perceptron difficult to undertake. Second, the use of hidden
neurons makes the learning process harder to visualize. In an implicit sense, the learning
process must decide which features of the input pattern should be represented by the
hidden neurons. The learning process is therefore made more difficult because the search
has to be conducted in a much larger space of possible functions, and a choice has to be
made between alternative representations of the input pattern (Hinton, 1987).
Research interest in multilayer feedforward networks dates back to the pioneering work
of Rosenblatt (1962) on perceptrons and that of Widrow and his students on Madalines
(Widrow, 1962). Mudalines were constructed with many inputs, many Adaline elements
in the first layer, and with various logic devices such as AND, OR, and majority-vote-
taker elements in the second layer; the Adaline was described in Section 5.8. The Madalines
of the 1960s had adaptive first layers and fixed threshold functions in the second (output)
layers (Widrow and Lehr, 1990). However, the tool that was missing in those early days
of multilayer feedforward networks was what we now call back-propagation learning.
The usage of the term “back-propagation’’ appears to have evolved after 1985. How-
ever, the basic idea of back-propagation was first described by Werbos in his Ph.D. thesis
(Werbos, 1974), in the context of general networks with neural networks representing a
special case. Subsequently, it was rediscovered by Rumelhart, Hinton, and Williams
(1986b), and popularized through the publication of the seminal book entitled Parallel
Distributed Processing (Rumelhart and McClelland, 1986). A similar generalization of
the algorithm was derived independently by Parker in 1985, and interestingly enough, a
roughly similar learning algorithm was also studied by LeCun (1985).
The development of the back-propagation algorithm represents a “landmark’ ’ in neural
networks in that it provides a computationally eficient method for the training of multilayer
perceptrons. Although it cannot be claimed that the back-propagation algorithmcan provide
a solution for all solvable problems, it is fair to say that it has put to rest the pessimism
about learning in multilayer machines that may have been inferred from the book by
Minsky and Papert (1969).
to perform a useful function at the output of the network. Second, at each neuron
of the network through which a function signal passes, the signal is calculated as
a function of the inputs and associated weights applied to that neuron.
2. Error Signals. An error signal originates at an output neuron of the network, and
propagates backward (layer by layer) through the network. We refer to it as an
“error signal” because its computation by every neuron of the network involves
an error-dependent function in one form or another.
The output neurons (computational nodes) constitute the output layer of the network.
The remaining neurons (computational nodes) constitute hidden layers of the network.
Thus the hidden units are not part of the output or input of the network-hence their
designation as “hidden.” The first hidden layer is fed from the input layer made up of
sensory units (source nodes); the resulting outputs of the first hidden layer are in turn
applied to the next hidden layer; and so on for the rest of the network.
-
f. --- -
Function signa$
Error signals
FIGURE 6.2 Illustration of the directions of two basic signal flows in a multilayer
perceptron, namely, forward propagation of function signals and back-propagation of
error signals.
142 6 / Multilayer Perceptrons
Notation
W The indices i, j , and k refer to different neurons in the network; with signals propagat-
ing through the network from left to right, neuron j lies in a layer to the right of
neuron i, and neuron k lies in a layer to the right of neuron j when neuron j is a
hidden unit.
W The iteration n refers to the nth training pattern (example) presented to the network.
W The symbol %(n) refers to the instantaneous sum of error squares at iteration n. The
average of %(n) over all values of n (i.e., the entire training set) yields the average
squared error %av.
W The symbol e,@) refers to the error signal at the output of neuron j for iteration n.
W The symbol d,(n) refers to the desired response for neuron j and is used to compute
em.
W The symbol y,(n) refers to the function signal appearing at the output of neuronj at
iteration n.
W The symbol w,,(n)denotes the synaptic weight connecting the output of neuron i to
the input of neuron j at iteration n. The correction applied to this weight at iteration
n is denoted by Aw,,(n).
W The net internal activity level of neuron j at iteration n is denoted by u,(n); it
constitutes the signal applied to the nonlinearity associated with neuron j .
rn The activation function describing the input-output functional relationship of the
nonlinearity associated with neuron j is denoted by %(e).
We define the instantaneous value of the squared error for neuronj as $ef(n).Correspond-
ingly, the instantaneous value %(n) of the sum of squared errors is obtained by summing
$ef(n)over all neurons in the output layer; these are the only “visible” neurons for which
error signals can be calculated. The instantaneous sum of squared errors of the network
is thus written as
where the set C includes all the neurons in the output layer of the network. Let N denote
the total number of patterns (examples) contained in the training set. The average squared
error is obtained by summing %(n)over all n and then normalizing with respect to the
set size N , as shown by
(6.3)
The instantaneous sum of error squares %(n),and therefore the average squared error 8 a v ,
is a function of all the free parameters (i.e., synaptic weights and thresholds) of the
network. For a given training set, represents the costfunction as the measure of training
set learning performance. The objective of the learning process is to adjust the free
parameters of the network so as to minimize 8,. To do this minimization we use an
approximation similar in rationale to that we used for the derivation of the LMS algorithm
in Chapter 5. Specifically, we consider a simple method of training in which the weights
are updated on a pattern-by-pattern basis. The adjustments to the weights are made in
accordance with the respective errors computed for each pattern presented to the network.
The arithmetic average of these individual weight changes over the training set is therefore
an estimate of the true change that would result from modifying the weights based on
minimizing the cost function over the entire training set.
Consider then Fig. 6.3, which depicts neuron j being fed by a set of function signals
produced by a layer of neurons to its left. The net internal activity level uj(n) produced
Neuron j
where p is the total number of inputs (excluding the threshold) applied to neuron j . The
synaptic weight wjo (corresponding to the fixed input yo = -1) equals the threshold @
applied to neuron j . Hence the function signal yj(n) appearing at the output of neuron j
at iteration n is
where the use of prime (on the right-hand side) signifies differentiation with respect to
the argument. Finally, differentiating Eq. (6.4) with respect to wji(n)yields
(6.10)
(6.1 1)
(6.12)
where 7 is a constant that determines the rate of learning; it is called the learning-rate
parameter of the back-propagation algorithm. The use of the minus sign in Eq. (6.12)
accounts for gradient descent in weight space. Accordingly, the use of Eq. (6.11) in (6.12)
yields
Awji(n>= 76;.(n)yi(n> (6.13)
6.3 / Derivation of the Back-Propagation Algorithm 145
The local gradient points to required changes in synaptic weights. According to Eq. (6.14),
the local gradient q(n) for output neuron j is equal to the product of the corresponding
error signal e,(n) and the derivative cp'(vj(n))of the associated activation function.
From Eqs. (6.13) and (6.14) we note that a key factor involved in the calculation of
the weight adjustment Awji(n)is the error signal ej(n) at the output of neuron j . In this
context, we may identify two distinct cases, depending on where in the network neuron
j is located. In case I, neuron j is an output node. This case is simple to handle, because
each output node of the network is supplied with a desired response of its own, making
it a straightforward matter to calculate the associated error signal. In case 11, neuron j is
a hidden node. Even though hidden neurons are not directly accessible, they share responsi-
bility for any error made at the output of the network. The question, however, is to know
how to penalize or reward hidden neurons for their share of the responsibility. This
problem is indeed the credit-assignment problem considered in Section 2.6. It is solved
in an elegant fashion by back-propagating the error signals through the network.
In the sequel, cases I and I1 are considered in turn.
(6.15)
where, in the second line, we have made use of EQ. (6.9). To calculate the partial derivative
a%(n)layj(n),we may proceed as follows. From Fig. 6.4 we see that
1
%(n) = - e;(n), neuron k is an output node (6.16)
2 kEC
which is a rewrite of Eq. (6.2) except for the use of index k in place of index j . We have
done so in order to avoid confusion with the use of index j that refers to a hidden neuron
146 6 I Multilayer Perceptrons
Neuron j Neuron k
A
r - 3
FIGURE 6.4 Signal-flow graph highlighting the details of output neuron k connected to
hidden neuron j.
under case 11. In any event, differentiating Eq. (6.16) with respect to the function signal
yj(n), we get
(6.17)
Next, we use the chain rule for the partial derivative aek(n)layj(n),and thus rewrite Eq.
(6.17) in the equivalent form
(6.18)
(6.19)
Hence
(6.20)
We also note from Fig. 6.4 that for neuron k, the net internal activity level is
(6.21)
where q is the total number of inputs (excluding the threshold) applied to neuron k. Here
again, the synaptic weight wko(n) is equal to the threshold &(n) applied to neuron k, and
the corresponding input yo is fixed at the value - 1. In any event, differentiating Eq. (6.21)
with respect to yj(n) yields
(6.22)
6.3 I Derivation of the Back-Propagation Algorithm 147
Thus, using Eqs. (6.20) and (6.22) in (6.18), we get the desired partial derivative:
= -E Gk(n)wkj(n)
k
(6.23)
where, in the second line, we have used the definition of the local gradient &(n) given
in Eq. (6.14) with the index k substituted for j .
Finally, using Eq. (6.23) in (6.15), we get the local gradient q(n) for hidden neuron
j , after rearranging terms, as follows:
4(n) = p,'(u,(n)) 2 Sk(n)wk,(n), neuron j is hidden (6.24)
k
The factor cp,'(u,(n))involved in the computation of the local gradient q(n) in Eq.
(6.24) depends solely on the activation function associated with hidden neuron j . The
remaining factor involved in this computation, namely, the summation over k, depends
on two sets of terms. The first set of terms, the &(E), requires knowledge of the error
signals ek(n),for all those neurons that lie in the layer to the immediate right of hidden
neuron j , and that are directly connected to neuron j ; see Fig. 6.4. The second set of
terms, the wkJ(n),consists of the synaptic weights associated with these connections.
We may now summarize the relations that we have derived for the back-propagation
algorithm. First, the correction Aw,,(n) applied to the synaptic weight connecting neuron
i to neuron j is defined by the delta rule:
Weight learning-
( c ~ y i m )= (rate p a r t e . ) *
local
(gr;d&nt) * (
input signal
of ne;mz j ) (6.25)
Second, the local gradient 4(n) depends on whether neuron j is an output node or a hidden
node:
1. If neuron j is an output node, 4(n) equals the product of the derivative d(u,(n))
and the error signal e,(n),both of which are associated with neuron j ; see Eq. (6.14).
2. If neuron j is a hidden node, 4(n) equals the product of the associated derivative
cpr(u,(n)) and the weighted sum of the 8's computed for the neurons in the next
hidden or output layer thal are connected to neuron j ; see Eq. (6.24).
v,(n) = c
P
1=0
w,l(n)Yl(n) (6.27)
where p is the total number of inputs (excluding the threshold) applied to neuron j , and
w,,(n) is the synaptic weight connecting neuron i to neuron j , and yl(n) is the input signal
148 6 / Multilayer Perceptrons
of neuronj or, equivalently, the function signal appearing at the output of neuron i. If
neuron j is in the first hidden layer of the network, then the index i refers to the ith input
terminal of the network, for which we write
yi(n> = xi(n) (6.28)
where xi(n) is the ith element of the input vector (pattern). If, on the other hand, neuron
j is in the output layer of the network, the indexj refers to the jth output terminal of the
network, for which we write
Yj(n> = oj(n) (6.29)
where oJ(n)is thejth element of the output vector (pattern). This output is compared with
the desired response dj(n), obtaining the error signal ej(n) for the jth output neuron. Thus
the forward phase of computation begins at the first hidden layer by presenting it with
the input vector, and terminates at the output layer by computing the error signal for each
neuron of this layer.
The backward pass, on the other hand, starts at the output layer by passing the error
signals leftward through the network, layer by layer, and recursively computing the 6
(i.e., the local gradient) for each neuron. This recursive process permits the synaptic
weights of the network to undergo changes in accordance with the delta rule of Eq. (6.25).
For a neuron located in the output layer, the 6 is simply equal to the error signal of that
neuron multiplied by the first derivative of its nonlinearity. Hence we use Eq. (6.25) to
compute the changes to the weights of all the connections feeding into the output layer.
Given the 8 s for the neurons of the output layer, we next use Eq. (6.24) to compute the
8 s for all the neurons in the penultimate layer and therefore the changes to the weights
of all connections feeding into it. The recursive computation is continued, layer by layer,
by propagating the changes to all synaptic weights made.
Note that for the presentation of each training example, the input pattern is fixed
(‘ ‘clamped”) throughout the round-trip process, encompassing the forward pass followed
by the backward pass.
Sigmoidal Nonlinearity
The computation of the 6 for each neuron of the multilayer perceptron requires knowledge
of the derivative of the activation function p(*)associated with that neuron. For this
derivative to exist, we require the function p(.) to be continuous. In basic terms, dzfferentia-
bility is the only requirement that an activation function would have to satisfy. An example
of a continuously differentiable nonlinear activation function commonly used in multilayer
perceptrons is the sigmoidal nonlinearity, a particular form of which is defined for neuron
j by the logistic function
yj(n) = pj(uj(n))
- 1
--to < vj(n) < w (6.30)
I + exp(-uj(n)>’
where uJ(n)is the net internal activity level of neuron j . According to this nonlinearity,
the amplitude of the output lies inside the range 0 5 yJ 5 1. Another type of sigmoidal
nonlinearity is the hyperbolic tangent, which is antisymmetric with respect to the origin
and for which the amplitude of the output lies inside the range - 1 5 yj 5 + 1. We will
have more to say on this latter form of nonlineariti in Section 6.7. For the time being,
we are going to concentrate on the logistic function of Eq. (6.30) as the source of
nonlinearity.
6.3 / Derivation of the Back-PropagationAlgorithm 149
(6.31)
Using Eq. (6.30) to eliminate the exponential term exp(-u,(n)) from Eq. (6.31), we may
express the derivative ~~(u,(n)) as
R’(uJ<n>>
= YJ(n>[l - y](n>l (6.32)
For a neuron j located in the output layer, we note that y,(n) = o,(n). Hence, we may
express the local gradient for neuron j as
S/W = e,(n>cp:(u,(n>)
= [d,(n) - o,(n)lo,(n)[l - o,(n)], neuronj is an output node (6.33)
where o,(n) is the function signal at the output of neuronj, and d,(n) is the desired response
for it. On the other hand, for an arbitrary hidden neuron j , we may express the local
gradient as
c
S,(n) = d(ul(n>) 44n>wk,(n)
k
Note from J3q. (6.32) that the derivative p;(u,(n)) attains its maximum value at
y,(n) = 0.5, and its minimum value (zero) at y,(n) = 0, or y,(n) = 1.0. Since the amount
of change in a synaptic weight of the network is proportional to the derivative
qr(u,(n)),it follows that for a sigmoidal activation function the synaptic weights are
changed the most for those neurons in the network for which the function signals are in
their midrange. According to Rumelhart et al. (1986a), it is this feature of back-propagation
learning that contributes to its stability as a learning algorithm.
Rate of Learning
The back-propagation algorithm provides an “approximation” to the trajectory in weight
space computed by the method of steepest descent. The smaller we make the learning-
rate parameter 7,the smaller will the changes to the synaptic weights in the network be
from one iteration to the next and the smoother will be the trajectory in weight space.
This improvement, however, is attained at the cost of a slower rate of learning. If, on the
other hand, we make the learning-rate parameter 7 too large so as to speed up the rate
of learning, the resulting large changes in the synaptic weights assume such a form that
the network may become unstable (i.e., oscillatory). A simple method of increasing the
rate of learning and yet avoiding the danger of instability is to modify the delta rule of
Eq. (6.13) by including a momentum term, as shown by’ (Rumelhart et al., 1986a)
) aAwjj(n - 1)
A ~ j i ( n= + 76j(n)yi(n) (6.35)
For the special case of the LMS algorithm, which is a linear adaptive filtering algorithm, it has been shown
that use of the momentum constant 01 reduces the stable range of the learning-rate parameter r], and could thus
lead to instability if r] is not adjusted appropriately. Moreover, the misadjustment increases with increasing 01
(for details, see Roy and Shynk, 1990).
150 6 I Multilayer Perceptrons
Ayi(n - 1) Ayi(n)
where a! is usually a positive number called the momentum constant. It controls the
feedback loop acting around Awji(n), as illustrated in Fig. 6.5, where z-' is the unit-delay
operator. Equation (6.35) is called the generalized delta rule2;it includes the delta rule
of Eq. (6.13) as a special case (Le., a = 0).
In order to see the effect of the sequence of pattern presentations on the synaptic
weights due to the momentum constant a,we rewrite Eq. (6.35) as a time series with
index t. The index t goes from the initial time 0 to the current time n. Equation (6.35)
may be viewed as a first-order difference equation in the weight correction Awji(n).Hence,
solving this equation for Awji(n),we have
n
Awji(n)= 7 2
r=o
a"-'4(t)yi(t) (6.36)
which represents a time series of length n + 1. From Eqs. (6.1 1) and (6.14) we note that
the product 4(n)yi(n)is equal to -a%(n)ldw,(n). Accordingly, we may rewrite Eq. (6.36)
in the equivalent form
Based on this relation, we may make the following insightful observations (Watrous,
1987; Jacobs, 1988; Goggin et al., 1989):
1. The current adjustment Aw,[(n) represents the sum of an exponentially weighted
time series. For the time series to be convergent, the momentum constant must be restricted
to the range 0 5 la1 < 1. When a! is zero, the back-propagation algorithm operates without
momentum. Note also that the momentum constant a can be positive or negative, although
it is unlikely that a negative a would be used in practice.
2. When the partial derivative d%(t)law,,(t)has the same algebraic sign on consecutive
iterations, the exponentially weighted sum Aw,,(n) grows in magnitude, and so the weight
w,,(n) is adjusted by a large amount. Hence the inclusion of momentum in the back-
propagation algorithm tends to accelerate descent in steady downhill directions.
3. When the partial derivative d%(t)law,,(t)has opposite signs on consecutive iterations,
the exponentially weighted sum Aw,,(n) shrinks in magnitude, and so the weight w,,(n) is
adjusted by a small amount. Hence the inclusion of momentum in the back-propagation
algorithm has a stabilizing efSfect in directions that oscillate in sign.
Thus, the incorporation of momentum in the back-propagation algorithm represents a
minor modification to the weight update, and yet it can have highly beneficial effects on
learning behavior of the algorithm. The momentum term may also have the benefit of
For a derivation of the back-propagation algorithm including the momentum constant from first principles,
see Hagiwara (1992).
6.3 I Derivation of the Back-PropagationAlgorithm 151
preventing the learning process from terminating in a shallow local minimum on the error
surface.
in further adjustments to the synaptic weights and threshold levels. This process is contin-
ued until the last example [x(N),d(N)]in the epoch is accounted for. Let Awjj(n)denote
the change applied to synaptic weight wji after the presentation of pattern n. Then the net
weight change AGji,averaged over the entire training set of N patterns, is given by
(6.38)
where in the second and third lines we have made use of Eqs. (6.12) and (6.2), respectively.
2. Batch Mode. In the batch mode of back-propagation learning, weight updating is
performed ajler the presentation of all the training examples that constitute an epoch. For
a particular epoch, we define the cost function as the average squared error of Eqs. (6.2)
and (6.3), reproduced here in the composite form:
(6.39)
where the error signal ej(n)pertains to output neuronj for training example n and which
is defined by Eq. (6.1). The error ej(n)equals the difference between 4(n)and yj(n),which
represent the jth element of the desired response vector d(n) and the corresponding value
of the network output, respectively. In Eq. (6.39) the inner summation with respect to j
is performed over all the neurons in the output layer of the network, whereas the outer
summation with respect to n is performed over the entire training set in the epoch at hand.
For a learning-rate parameter T, the adjustment applied to synaptic weight wji, connecting
neuron i to neuron j , is defined by the delta rule
(6.40)
To calculate the partial derivative aej(n)lawjiwe proceed in the same way as before.
According to Eq. (6.40), in the batch mode the weight adjustment Awjiis made only after
the entire training set has been presented to the network.
Comparing Eqs. (6.38) and (6.40), we clearly see that the average weight change Ati;.i
made in the pattern mode of training is different from the corresponding value Awji made
in the batch mode, presumably for the same reduction in the average squared error
that results from the presentation of the entire training set. Indeed, AGji for the pattern-
to-pattern mode represents an estimate of Awji for the batch mode.
From an “on-line” operational point of view, the pattern mode of training is preferred
over the batch mode, because it requires less local storage for each synaptic connection.
Moreover, given that the patterns are presented to the network in a random manner, the
use of pattern-by-pattern updating of weights makes the search in weight space stochastic
in nature, which, in turn, makes it less likely for the back-propagation algorithm to be
trapped in a local minimum. On the other hand, the use of batch mode of training provides
a more accurate estimate of the gradient vector. In the final analysis, however, the relative
6.4 / Summary of the Back-Propagation Algorithm 153
effectiveness of the two training modes depends on the problem at hand (Hertz et al.,
1991).
Stopping Criteria
The back-propagation algorithm cannot, in general, be shown to converge, nor are there
well-defined criteria for stopping its operation. Rather, there are some reasonable criteria,
each with its own practical merit, which may be used to terminate the weight adjustments.
To formulate such a criterion, the logical thing to do is to think in terms of the unique
properties of a local or global minimum of the error surface. Let the weight vector W*
denote a minimum, be it local or global. A necessary condition for w* to be a minimum
is that the gradient vector g(w) (i.e., first-order partial derivative) of the error surface with
respect to the weight vector w be zero at w = w*. Accordingly, we may formulate a
sensible convergence criterion for back-propagation learning as follows (Kramer and
Sangiovanni-Vincentelli, 1989):
m The back-propagation algorithm is considered to have converged when the Euclidean
norm of the gradient vector reaches a sufficiently small gradient threshold.
The drawback of this convergence criterion is that, for successful trials, learning times
may be long. Also, it requires the computation of the gradient vector g(w).
Another unique property of a minimum that we can use is the fact that the cost function
or error measure %,(w) is stationary at the point w = w*. We may therefore suggest a
different criterion of convergence:
w The back-propagation algorithm is considered to have converged when the absolute
rate of change in the average squared error per epoch is sufficiently small.
Typically, the rate of change in the average squared error is considered to be small enough
if it lies in the range of 0.1 to 1 percent per epoch; sometimes, a value as small as 0.01
percent per epoch is used.
A variation of this second criterion for convergence of the algorithm is to require that
the maximum value of the average squared error &"(w) be equal to or less than a sufficiently
small threshold. Kramer and Sangiovanni-Vincentelli (1989) suggest a hybrid criterion
of convergence consisting of this latter threshold and a gradient threshold, as stated here:
w The back-propagation algorithm is terminated at the weight vector wBndwhen
((g(wfid)l(5 E , where E is a sufficiently small gradient threshold, or '&"(wfind)IT,
where T is a sufficiently small error energy threshold.
Another useful criterion for convergence is as follows. After each learning iteration,
the network is tested for its generalization performance. The learning process is stopped
when the generalization performance is adequate, or when it is apparent that the generaliza-
tion performance has peaked; see Section 6.11 for more details.
I
----_______ ------_
I
I Sensitivity network I
I (back-propagation phase) I
I
I
I I
I I
I I
-
I I
I
I
-el I
I
I e2 I
I I
I
I
b-4 I
I I
I I
I I
I I
I I
I I
I I
I I
I I
I
I
I
I
.ttt.
w
++ I
I
($1) 812) $3) I
~ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ~ ~ ~ ~ ~ ~ l
The layer index 1 extends from the input layer (1 = 0) to the output layer (1 = L); in Fig.
6.6 we have L = 3; we refer to L as the depth of the network. The lower part of the
figure accounts for the backward phase, which is referred to as a sensitivity network for
computing the local gradients in the back-propagation algorithm. The notations used in
this second part of the figure are as follows:
While the network of Fig. 6.6 is merely an architectural layout of the back-propagation
algorithm, it is found to have substantial advantages in dynamic situations where the
algorithmic representation becomes cumbersome (Narendra and Parthasarathy, 1990).
Earlier we mentioned that the pattern-by-pattern updating of weights is the preferred
method for on-line implementation of the back-propagation algorithm. For this mode of
operation, the algorithm cycles through the training data {[x(n),d(n)]; n = 1, 2, . . . , N }
as follows.
1. Initialization. Start with a reasonable network configuration, and set all the synaptic
weights and threshold levels of the network to small random numbers that are uniformly
distrib~ted.~
2. Presentations of Training Examples. Present the network with an epoch of training
examples. For each example in the set ordered in some fashion, perform the following
sequence of forward and backward computations under points 3 and 4, respectively.
3. Forward Computation. Let a training example in the epoch be denoted by [x(n),d(n)],
with the input vector x(n) applied to the input layer of sensory nodes and the desired
response vector d(n) presented to the output layer of computation nodes. Compute the
activation potentials and function signals of the network by proceeding forward through
the network, layer by layer. The net internal activity level v,!l)(n)for neuron j in layer 1
is
where yy-')(n) is the function signal of neuron i in the previous layer I - 1 at iteration
n and w$)(n)is the synaptic weight of neuron j in layer 1 that is fed from neuron i in layer
1 - 1. For i = 0, we have yg-')(n) = - 1 and wj$(n) = Os''(n),where $)(n) is the threshold
applied to neuron j in layer 1. Assuming the use of a logistic function for the sigmoidal
nonlinearity, the function (output) signal of neuronj in layer 1 is
1
y!)(n) =
1 + exp(-v,!"(n))
If neuronj is in the first hidden layer &e., I = l), set
y,(O)(n)= xj(n)
where xi@) is the jth element of the input vector x(n). If neuron j is in the output layer
(i.e., 1 = L), set
yy)(n)= oj(n)
Note that this form of initialization is different from that used for the LMS algorithm.
156 6 / Multilayer Perceptrons
Hence, adjust the synaptic weights of the network in layer 1 according to the generalized
delta rule:
w;yn + 1) = w’i,”(n) + CY[wj?(n) - wJ’(n - l)] + qa,!l)(n)yy-yn)
where q is the learning-rate parameter and CY is the momentum constant.
5. Iteration. Iterate the computation by presenting new epochs of training examples
to the network until the free parameters of the network stabilize their values and the
average squared error S, computed over the entire training set is at a minimum or
acceptably small value. The order of presentation of training examples should be random-
ized from epoch to epoch. The momentum and the learning-rate parameter are typically
adjusted (and usually decreased) as the number of training iterations increases.
6.5 Initialization
The first step in back-propagation learning is, of course, to initialize the network. A good
choice for the initial values of the free parameters @e., adjustable synaptic weights and
threshold levels) of the network can be of tremendous help in a successful network design.
In cases where prior information is available, it may be better to use the prior information
to guess the initial values of the free parameters. But how do we initialize the network
if no prior information is available? The customary practice is to set all the free parameters
of the network to random numbers that are uniformly distributed inside a small range of
values, as stated in the summary presented in Section 6.4.
The wrong choice of initial weights can lead to a phenomenon known as premature
saturation (Lee et al., 1991).This phenomenon refers to a situation where the instantaneous
sum of squared errors %(n)remains almost constant for some period of time during the
learning process. Such a phenomenon cannot be considered as a local minimum, because
the squared error continues to decrease after this period is finished. (In more direct terms,
the premature saturation phenomenon corresponds to a ‘‘saddle point” in the error surface.)
When a training pattern is applied to the input layer of a multilayer perceptron, the
output values of the network are calculated through a sequence of forward computations
that involves inner products and sigmoidal transformations. This is followed by a sequence
of backward computations that involves the calculation of error signals and pertinent slope
of the sigmoidal activation function, and culminates in synaptic weight adjustments.
Suppose that, for a particular training pattern, the net internal activity level of an output
neuron (computation node) is computed to have a large magnitude. Then, assuming that
the sigmoidal activation function of the neuron has the limiting values - 1 and + 1, we
find that the corresponding slope of the activation function for that neuron will be very
small, and the output value for the neuron will be close to - 1 or + 1. In such a situation,
we say that the neuron is in “saturation.” If the output value is close to +1 when the
6.6 / The XOR Problem 157
target value (desired response) is .- 1, or vice versa, we say that the neuron is “incorrectly
saturated.” When this happens, the adjustment applied to the synaptic weights of the
neuron will be small (even though the magnitude of the associated error signal is large),
and the network may take a long time to escape from it (Lee et al., 1991).
At the initial stage of back-propagationlearning, depending on the prevalent conditions,
both unsaturated neurons and incorrectly saturated ones may exist in the output layer of
the network. As the learning process continues, the synaptic weights associated with the
unsaturated output neurons change rapidly, because the corresponding error signals and
gradients have relatively large magnitudes, thereby resulting in a reduction in the instanta-
neous sum of squared errors %(n).If, however, at this point in time the incorrectly saturated
output neurons remain saturated for some particular training patterns, then the phenomenon
of premature saturation may arise with %(n) remaining essentially constant.
In Lee et al. (1991), a formula for the probability of premature saturation in back-
propagation learning has been derived for the batch mode of updating, and it has been
verified using computer simulation. The essence of this formula may be summarized as
follows:
1. Incorrect saturation is avoided by choosing the initial values of the synaptic weights
and threshold levels of the network to be uniformly distributed inside a small range
of values.
2. Incorrect saturation is less likely to occur when the number of hidden neurons is
maintained low, consistent with a satisfactory operation of the network.
3. Incorrect saturation rarely occurs when the neurons of the network operate in their
linear regions.
For pattern-by-pattern updating, computer simulation results show similar trends to the
batch mode of operation referred to herein.
Russo (1991) recommends an empirical formula for the initial size of the weights that
helps avoid the saturation of the neurons. This formula is described under point 3 of
Section 6.7.
and
1 XORO = 1
We first recognize that the use of a single neuron with two inputs results in a straight
line for decision boundary in the input space. For all points on one side of this line, the
neuron outputs 1; for all points on the other side of the line, it outputs 0. The position
and orientation of the line in the input space are determined by the synaptic weights of
the neuron connected to the input nodes, and the threshold applied to the neuron. With
the input patterns (0,O) and (1,l) located on opposite corners of the unit square, and
likewise for the other two input patterns (0,l) and (l,O), it is clear that we cannot construct
a straight line for a decision boundary so that (0,O) and (0,l) lie in one decision region,
and (0,l) and (1,O) lie in the other decision region. In other words, an elementary perceptron
cannot solve the XOR problem.
We may solve the XOR problem by using a single hidden layer with two neurons, as
in Fig. 6.7a (Touretzky and Pomerleau, 1989). The signal-flow graph of the network is
shown in Fig. 6.7b. The following assumptions are made here:
Each neuron is represented by a McCulloch-Pitts model.
Bits 0 and 1 are represented by the levels 0 and +1, respectively.
The top neuron, labeled 1, in the hidden layer is characterized as follows:
w11= w12 =fl
i
= +3
The slope of the decision boundary constructed by this hidden neuron is equal to -1,
and positioned as in Fig. 6.8a. The bottom neuron, labeled 2 in the hidden layer, is
Neuron 1
-1
1.5
cp(.)
91.1
O O3
0.5
v(.)
0.5
-1
( b)
FIGURE 6.7 (a) Architectural graph of network for solving the XOR problem. (b) Signal-
flow graph of the network.
6.6 / The XOR Problem 159
Input
x2
Input x1
(C)
FIGURE 6.8 (a) Decision boundary constructed by hidden neuron 1 of the network in
Fig. 6.7. (b) Decision boundary constructed by hidden neuron 2 of the network. (c)
Decision boundaries constructed by the complete network.
characterized as follows:
w21 = w22 = + l
e, = +$
The orientation and position of the decision boundary constructed by this second hidden
neuron is as shown in Fig. 6.83.
The output neuron, labeled 3 in Fig. 6.7a, is characterized as follows:
~ 3 1 -2
=
~ 3 +1
=2
6, = +;
The function of the output neuron is to construct a linear combination of the decision
boundaries formed by the two hidden neurons. The result of this computation is shown
160 6 / Multilayer Perceptrons
in Fig. 6 . 8 ~The
. bottom hidden neuron has an excitatory (positive) connection to the output
neuron, whereas the top hidden neuron has a stronger inhibitory (negative) connection to
the output neuron. When both hidden neurons are off, which occurs when the input pattern
is (O,O), the output neuron remains off. When both hidden neurons are on, which occurs
when the input pattern is (l,l), the output neuron is switched off again, because the
inhibitory effect of the larger negative weight connected to the top hidden neuron overpow-
ers the excitatory effect of the positive weight connected to the bottom hidden neuron.
When the top hidden neuron is off and the bottom hidden neuron is on, which occurs
when the input pattern is (0,l) or (l,O), the output neuron is switched on due to the
excitatory effect of the positive weight connected to the bottom hidden neuron. Thus, the
network of Fig. 6.7a does indeed solve the XOR problem.
a tanh(bu) = a
1 - exp(-bu)
1 + exp(-bu)
2a
1
- -a (6.41)
1 + exp(-bu)
Accordingly, the modifications made in the formulation of the back-propagation algorithm
using this form of sigmoidal nonlinearity are of a minor nature; see Problem 6.9.
Suitable values for the constants a and b are (Guyon, 1991)
a = 1.716
and
b=$
2. It is important that the target values (desired response) are chosen within the range
of the sigmoidal activation function. More specifically, the desired response dj for neuron
6.7 / Some Hints for Making the Back-Propagation Algorithm Perform Better 161
(b)
FIGURE 6.9 (a) Asymmetric activation (hyperbolic) function. (b) Nonsymrnetric
activation (logistic) function.
j in the output layer of the multilayer perceptron should be offset by some amount E away
from the limiting value of the sigmoidal activation function, depending on whether the
limiting value is positive or negative. Otherwise, the back-propagation algorithm tends
to drive the free parameters of the network to infinity, and thereby slow down the learning
process by orders of magnitude. To be specific, consider the asymmetric activating function
of Fig. 6.9a. For the limiting value f a , we set
dj=u-s
162 6 / Multilayer Perceptrons
where E is an appropriate positive constant. For the choice of a = 1.716 referred to earlier,
we may set E = 0.716, in which case the target value (desired response) dj can be chosen
conveniently to be _t 1.
3. The initialization of the synaptic weights and threshold levels of the network should
be uniformly distributed inside a small range. The reason for making the range small is
to reduce the likelihood of the neurons in the network saturating and producing small
error gradients. However, the range should not be made too small, as it can cause the
error gradients to be very small and the learning therefore to be initially very slow. For
an asymmetric activation function in the form of a hyperbolic tangent for which the
constants a and b are specified above, a possible choice for initialization is to pick.random
values for the synaptic weights and threshold levels that are uniformly distributed inside
the range
(-y,i-y)
where Fiis fan-in (i.e., the total number of inputs) of neuron i in the network; in other
words, the weight initialization is done on a neuron-by-neuron basis.
4. All neurons in the multilayer perceptron should desirably learn at the same rate.
Typically, the last layers tend to have larger local gradients than the layers at the front
end of the network. Hence, the learning-rate parameter 7 should be assigned a smaller
value in the last layers than the front-end layers. Neurons with many inputs should have
a smaller learning-rate parameter than neurons with few inputs.
5. For on-line operation, pattern-by-pattern updating rather than batch updating should
be used for weight adjustments. For pattern-classification problems involving a large and
redundant database, pattern-by-pattern updating tends to be orders of magnitude faster
than batch updating. However, pattern-by-pattern updating is harder to parallelize.
6. The order in which the training examples are presented to the network should be
randomized (shuffled) from one epoch to the next. This form of randomization is critical
for improving the speed of convergence. Also, the use of queries may improve the training
efficiency (Baum, 1991).
7. Learning from a set of training examples deals with an unknown input-output
mapping function f(*).In effect, the learning process exploits the information contained
in the examples about the functionf(.) to infer an approximate implementation of it. The
process of learning from examples may be generalized to include learning from hints,
which is achieved by allowing prior information that we may have about the function
f(.)to be included in the learning process (Abu-Mostafa, 1990; Al-Mashouq and Reed,
1991). Such information, for example, may include invariance properties, symmetries, or
any other knowledge about the function f(.)that may be used to accelerate the search for
its approximate realization; the idea of invariances was discussed in Section 1.7.
by %k. Let y k , j be the kth output of the network produced in response to the prototype xi,
as shown by
where the function Fk(.) defines the mapping learned by the network from the input to
the kth output. For convenience of presentation, let
(6.43)
where F(.) is a vector-valued function. A basic question that we wish to address in this
section is the following: '
Afcer the training of a multilayer perceptron, what should the optimum decision
rule be for classifying the m outputs of the network?
Clearly, any reasonable output decision rule ought to be based on knowledge of the vector-
valued function:
In general, all that is certain about the vector-valued function F(*)is that it is a continuous
function that minimizes the mean-squared error defined as the value of the cost (loss)
functional:
where dj is the desired (target) output pattern for the prototype xi, ))-/Iis the Euclidean
norm of the enclosed vector, and N is the total number of input-output patterns presented
to the network in training. The essence of the mean-squared error criterion of Eq. (6.45)
is the same as the cost function of Eq. (6.3). The vector-valued function F(.) is strongly
dependent on the choice of input-output pairs (xj,dj)used to train the network, so that
different values of (xj,dj) will indeed lead to different vector-valued functions F(.). Note
that the terminology (xj,dj)used here is the same as that of [x(j),d(j)]used previously.
Suppose now that the network is trained with binary target values (which incidently
correspond to the upper and lower bounds on the network outputs when using the logistic
function), written as follows:
It is tempting to suppose that a multilayer perceptron classifier trained with the back-
propagation algorithm on a finite set of independently and identically distributed (i.i.d.)
examples may lead to an asymptotic approximation of the underlying a posteriori class
probabilities. This property may be justified on the following grounds (white, 1989;
Richard and Lippmann, 1991):
The law of large numbers is invoked to show that, as the size of the training set,
N, approaches infinity, the weight vector w that minimizes cost functional L(F) of
Eq. (6.45) approaches the optimum weight vector w* that minimizes the expectation
of the random quantity - F(w,x)11*,where d is the desired response vector and
F(w,x) is the approximation realized by a multilayer perceptron with weight vector
w and vector x as input (White, 1989). The function F(w,x), showing explicit
dependence on the weight vector w, is the same as F(x) used previously.
The optimum weight vector w* has the property that the corresponding vector of
actual network outputs, F(w*,x), is a mean-squared, error-minimizingapproximation
to the conditional expectation E[dlx] (White, 1989); this issue was discussed in
Chapter 2.
For a 1 of m pattern classijication problem, the kth element of desired response
vector d equals one if the input vector belongs to class %k and zero otherwise. Under
this condition, the conditional expectation E[dklx] equals the a posteriori class
probability P(%k]x),k = 1, 2, . . . , m (Richard and Lippmann, 1991).
It follows therefore that a multilayer perceptron classifier (using the logistic function for
nonlinearity) does indeed approximate the a posteriori class probabilities, provided that
the size of the training set is large enough and the back-propagation learning process does
not get stuck at a local minimum. Accordingly, we may proceed to answer the question
we posed earlier. Specifically, we may say that the most appropriate output decision
rule is
A unique largest output value exists with probability one4when the underlying posterior
class distributions are distinct. Hence this decision rule has the advantage of rendering
unambiguous decisions over the common ad hoc rule of selecting class membership based
on the concept of output “firing,” that is, the vector x is assigned membership in a
particular class if the corresponding output value is greater than some fixed threshold
(usually, 0.5 for the logistic form of activation function), which can lead to multiple class
assignments.
In Section 6.6 we pointed out that the binary target values [0,1], corresponding to the
logistic function of Eq. (6.30), are perturbed by a small amount E as a practical measure,
to avoid the saturation of synaptic weights (due to finite numerical precision) during
training of the network. As a result of this perturbation, the target values are now nonbinary,
and the asymptotic approximationsFk(x) are no longer exactly the a posteriori probabilities
P((ek1X) of the m classes of interest (Hampshire and Pearlmutter, 1990). Instead, P((eklx)
is linearly mapped to the closed interval [ E , 1 - E ] , such that P(%,Ix) = 0 is mapped to
an output of E and P((ek1X) = 1 is mapped to an output of 1 - E . Because this linear
mapping preserves relative ordering, it does not affect the result of applying the output
decision rule of Eq. (6.47).
It is also of interest to note that when a decision boundary is formed by thresholding
the outputs of a multilayer perceptron against some fixed values, the overall shape and
orientation of the decision boundary may be explained heuristically (for the case of a
single hidden layer) in terms of the number of hidden neurons and the ratios of synaptic
weights connected to them (Lui, 1989). Such an analysis, however, is not applicable to
a decision boundary formed in accordance with the output decision rule of Eq. (6.47). A
more appropriate approach is to consider the hidden neurons as nonlinearfeature detectors
that attempt to map classes from the original input space Rp, where the classes may not
be linearly separable, into the space of hidden-layer activations, where it is more likely
for them to be linearly separable (Yee, 1992).
where
h1= mean vector = [0,OlT
0 2l-variance=l
-
“*
Here it is assumed that infinite-precisionarithmetic is used; ties are possible with finite precision.
166 6 / Multilayer Perceptrons
where
p 2 = [2,OlT
u: = 4
The two classes are assumed to be equally likely; that is,
P(Ce1) = P(%,) = 0.5
Figure 6.1 l a shows three-dimensional plots of the two Gaussian distributions defined by
Eqs. (6.48) and (6.49). Figure 6.11b shows individual scatter diagrams for classes and
%2 and the joint scatter diagram representing the superposition of scatter plots of 540
points taken from each of the two processes. This latter diagram shows clearly that the
two distributions overlap each other significantly, indicating that there is a significant
probability of misclassification.
FIGURE 6.1 l a Top: the probability density function f(xI%,); bottom: the probability
density function f ( ~ ] % ~ ) .
6 I I I I I I I
. . . . . . . . , . . . . .* . . , . . . . . . . . , . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
X1
6
0
.o
0' 0 :
: o :
4
-2
-4
: 0 :
-0
I I I I I I
-6
-4 -2 0 2 4 6 8 10
X1
FIGURE 6.1 1b Top: scatter plot of class %, ; middle: scatter plot of class (e2; bottom:
joint scatter plot of classes %, and %* (* = %,, o = %).
167
168 6 / Multilayer Perceptrons
(6.50)
(6.51)
(6.52)
and
X = l
The optimum (Bayesian) decision boundary is therefore defined by
or, equivalently,
(6.53)
(6.55)
and
(6.56)
Equation (6.54) represents a circle with center x, and radius r. Let fl, define the region
lying inside this circle. Accordingly, the Bayesian classification rule for the problem at
hand may be stated as follows:
Classify the observation vector x as belonging to class g1ifx E fll, and to class
g2otherwise.
6.9 I Computer Experiment 169
For the particular parameters of this experiment, we have a circular decision boundary
whose center is located at
x, = [-&O]T
The experimental results reported in this computer experiment are based on Yee (1992).
170 6 / Multilayer Perceptrons
TABLE 6.1 Variable Parameters of Multilayer Perceptron
Parameter Symbol Typical Range
In Table 6.2 and subsequent tables, the mean-squared error is computed precisely as
the error functional defined in Eq. (6.45). It should be emphasized that the mean-squared
error is included in these tables only as a matter of record, since a small mean-squared
error does not necessarily imply good generalization (Le., good performance with data
not seen before).
After convergence of a network trained with a total number of Npatterns, the probability
of correct classification can, in theory, be calculated as follows:
P(c;N) = P(c;Nl%l)P(%I)+ P(c;Nl%z)P(%,) (6.58)
where
and Rl(N) is the region in decision space over which the multilayer perceptron (trained
with N patterns) classifies the vector x as belonging to class Usually, this region is
found experimentally by evaluating the mapping function learned by the network and
applying the output decision rule of Eq. (6.47). Unfortunately, the numerical evaluation
of P(c;NI%,) and P ( C ; N I % ~is) problematic, because closed-form expressions describing
the decision boundary R,(N) cannot be easily found.
Accordingly, we resort to the use of an experimental approach that involves testing
the trained multilayer perceptron against another independent set of input-output pairs
that are again drawn randomly from the distributions for classes '%,and %* with equal
probability. Let A be a random variable that counts the number of patterns out of the N
test patterns that are classified correctly. Then the ratio
is a random variable that provides the maximum-likelihood unbiased estimate of the actual
classification performance p of the network. Assuming that p is constant over the N
input-output pairs, we may apply the Chernoffbound (Devroye, 1991) to the estimator
PN of p, obtaining
Prob(lp, - pI > E ) < 2 exp(-2s2N) = 6
The application of the Chernoff bound yields N = 26,500 for E = 0.01, and 6 = 0.01
(i.e., 99 percent certainty that the estimate p has the given tolerance). We thus picked a
test set of size N = 32,000. The last column of Table 6.2 presents the average probability
of correct classification estimated for this test set size.
The average classification performance presented in Table 6.2 for a multilayer per-
ceptron using two hidden neurons is already reasonably close to the Bayesian performance
P, = 81.51 percent. On this basis, we may conclude that for the pattern-classification
problem described here, the use of two hidden neurons is adequate. To emphasize this
conclusion, in Table 6.3 we present the results of simulations repeated for the case of
four hidden neurons, with all other parameters held constant. Although the average mean-
squared error in Table 6.3 for four hidden neurons is slightly lower than that in Table
6.2 for two hidden neurons, the average rate of correct classification does not show a
significant improvement. Accordingly, for the remainder of the computer experiment
described here, the number of hidden neurons is held at two.
Optimal Learning and Momentum Constants. For the “optimal” values of the
learning-rate parameter r] and momentum constant a, we may use any one of three
definitions:
1. The 7 and CY that, on average, yield convergence to a local minimum in the error
surface of the network with the least number of epochs.
2. The r] and a that, for either the worst-case or on average, yield convergence to the
global minimum in the error surface with the least number of epochs.
3. The r] and a that, on average, yield convergence to the network configuration that
has the best generalization, over the entire input space, with the least number of
epochs
The terms “average” and “worst-case” used here refer to the distribution of the training
input-output pairs. Definition 3 is the ideal; in practice, however, it is difficult to apply,
since minimizing the mean-squared error is usually the mathematical criterion for opti-
mality during network training and, as stated previously, a lower mean-squared error over
a training set does not necessarily imply good generalization. From a research point of
view, definition 2 has generated more interest than definition 1. For example, in Luo
(1991), rigorous results are presented for the optimal adaptation of the learning-rate
parameter r] such that the smallest number of epochs is needed for the multilayer perceptron
to approximate the globally optimum synaptic weight matrix to a desired accuracy, albeit
for the special case of linear neurons. In general, however, heuristic and experimental
procedures presently dominate thie optimal selection of r] and a, using definition 1. For
the experiment described here, we therefore consider optimality in the sense of defini-
tion 1.
TABLE 6.3 Simulation Results for Multilayer Perceptron Using Four Hidden Neuronsa
Run Training Set Number Mean-Squared Probability of Correct
Number Size of Epochs Error Classification
0.45 I I I I I I I
Mean-
squared
error
over
epoch
0.2 I
0 100
I I
200
I
300 400
I I
500
I
600 700
Number of epochs
(a)
0.42 I I I I I I I
........................................................
............................. i . .............................................
Mean- 0
squared
error 0
over
epoch
Mean-
squared
error
over
epoch
Number of epochs
(C)
0.5 - 1 I I I I I
0.45
-a=O :
....... a=O.1
Mean- 0.4 ...... . . . .....
........ a=0.5 :
squared
error . a=0.9
over
epoch 0.35 .......
._ _ .c. I . * ..
* I .I .” .- .C .- .- -
0.3 . . . . . ........... . . . . ............ .......
...........
0.25
50 100 150 200 250 300 350
Number of epochs
(d 1
FIGURE 6.12 (continued) (c) r] = 0.5;(d) r] = 0.9.
The experimental learning curves shown here suggest the following trends:
While, in general, a smaller learning-rate parameter r] results in slower convergence,
it can locate “deeper” local iminima in the error surface than a larger r]. This finding
is intuitively satisfying, since a smaller r] implies that the search for a minimum
should cover more of the error surface than would be the case for a larger r].
For r] + 0, the use of a! + 1 produces increasing speed of convergence. On the
other hand, for r] + 1, the use of a! + 0 is required to ensure learning stability.
The use of the constants r] = {0.5,0.9} and a! = 0.9 causes oscillations in the mean-
squared error during learning and a higher value for the final mean-squared error at
convergence, both of which are undesirable effects.
174 6 / Multilayer Perceptrons
0.42 I I I I I I I I
0.4
0.38 . . . . . . . . . ..........
0.36
Mean- 0.34
squared
error 0.32
b\ :
over
epoch 0.3 ........................................................
.....
0.28 ..jS...... ...........
0.26
0.24
I I I I I I I
0.22;
20 40 60 80 100 120 140 160
Number of epochs
FIGURE 6.13 Best learning curves selected from Fig. 6.12.
In Fig. 6.13 we show plots of the “best” learning curves selected from each group of
the learning curves plotted in Fig. 6.12, so as to determine an “overall” best learning
curve. From Fig. 6.13, it appears that the optimal learning-rate parameter rlopt is about
0.1 and the optimal momentum constant aopt is about 0.5. Thus, Table 6.4 summarizes
the “optimal” values of network parameters used in the remainder of the experiment.
The fact that the final mean-squared error of each curve in Fig. 6.13 does not vary
significantly over the range of 17 and a suggests a “well-behaved” (i.e., relatively smooth)
error surface for the problem at hand.
X1
Figure 6.14 shows the decision boundaries for the two networks in the ensemble of
20 with the worst and best classification performance; this figure also includes, for refer-
ence, the (circular) Bayesian decision boundary. Both decision boundaries (for the worst
and best classification performance) are convex with respect to the region where they
classify the vector x as belonging to class or class q2.Each boundary appears to
comprise two linear segments with a nonlinear segment joining them near their projected
intersection.
Figure 6.15 shows the ensemble-averaged, slowest and fastest learning curves observed
during training of the “optimized” network. Despite the superior final mean-squared error
0.36 I I 1 I I I I I I
0.34 -i.. .
. . . . .:. . . . . . . .I. . . . . . . . . . . . . . .I. . . . . . . . . . . . . . ..:. . . . . ..:.. . . . . ...:. . . . . . . . j . . . .
.. ...
. .
I.
. . -
0.32 -;:. . . . . .:. . . . . . . .: . . . . . . .!. . . . . . . j . . . . . . . ./.. . . . . . . . j..
I:
Fastest . . . . . . . i . . . . . -
I.
:0 .. ___ __
. Average j
,. ..
I.
I:
........
Slowest j
0.3 -\;
. . . . .:. . . . . . . .:. . . . . . . .:. . . . . . .I . . . . . . . .:. . . . . ..: . . . . . . . . . . .
Mean- I.
I:...
squared
I.
I : ..
error
I .
....
’
, . .. . . .
I
epoch -.
. . . . . . . . . ... . . . . . . . . . . :. . . . . . . .:. . . . . . . . :.. . . . . . . .:. . . . . . . . :.. . . . . . .
.............. :
*.
. . . . . . ............,.............;............:............;............;............
. . .
.............................................................. _
.f
. ._ --------- 1- ......;..------1------.-.--------.--------.-------.
.
. . ..:. . . . . . . . . . . . . . . . :. . . . . . . : . . . . . . . .: . . . . . . . . . . . . . . . .: . . . . .-
I I I I I I I I I
0.2
Number of epochs
FIGURE 6.15 Ensemble-averaged learning curves; slowest and fastest performances
for the optimized network.
176 6 / Multilayer Perceptrons
0.804
0.79
0.786
0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245 0.25
Final mean-squared error over training set
FIGURE 6.16 Average correct classification probability versus mean-squared error.
of the network with the fastest learning curve (0.2047 versus 0.2477 for the network with
the slowest learning curve), this apparent advantage produces no gain in classification
performance; in fact, the fastest learning network has a marginally worse classification
performance than the slowest learning network (79.78 versus 79.90 percent). A similar
situation occurs when the mean-squared error of the network in the ensemble with the
best classification performance is compared with that of the network with the worst
classification performance: the former has a P,of 80.23 percent with a final mean-squared
error of 0.2380, whereas the latter has a P , of 78.68 percent with a final mean-squared
error of 0.2391, not significantly different from the former. These results reinforce the
notion that a lower mean-squared error over the training set is not in itself a sufficient
condition for a better classification performance.
Indeed, the plot of the experimentally determined probability of correct classification
versus the final mean-squared error shown in Fig. 6.16 displays only a weak negative
correlation of -0.1220 between the two measures of network performance. The ensemble
statistics of the performance measures, average probability of correct classification and
final mean-squared error, computed over the training set are listed in Table 6.5.
6.10 Generalization
In back-propagation learning, we typically start with a training set and use the back-
propagation algorithm to compute the synaptic weights of a multilayer perceptron by
loading (encoding) as many of the training examples as possible into the network. The
hope is that the neural network so designed will generalize. A network is said to generalize
well when the input-output relationship computed by the network is correct (or nearly
so) for inputloutput patterns (test data) never used in creating or training the network; the
term “generalization” is borrowed from psychology. Here, of course, it is assumed that
the test data are drawn from the same population used to generate the training data.
The learning process (i.e., training of a neural network) may be viewed as a “curve-
fitting” problem. The network itself may be considered simply as a nonlinear input-output
mapping. Such a viewpoint then permits us to look on generalization not as a mystical
property of neural networks but rather simply as the effect of a good nonlinear interpolation
of the input data (Wieland and Leighton, 1987). The network performs useful interpolation
Nonlinear
mapping
output
output
Nonlinear
mapping
Input
( b)
FIGURE 6.17 (a) Properly fitted data (good generalization). (b) Overfitted data (poor
generalization).
178 6 / Multilayer Perceptrons
(6.61)
Thus, with an error of 10 percent, say, the number of training examples should be
approximately 10 times the number of synaptic weights in the network. Equation (6.62)
is in accordance with Widrow’s rule of thumb for the LMS algorithm, which states that
the settling time for adaptation in adaptive temporal filtering is approximately equal to
the memory time span of an adaptive tapped-delay-linefilter divided by the misadjustment
(Widrow and Stearns, 1985). The misadjustment in the LMS algorithm plays a role
somewhat analogous to the accuracy parameter E in Eq. (6.62).
6.1 1 Cross-Validation
The essence of back-propagationlearning is to encode an input-output relation, represented
by a set of examples {x,d}, with a multilayer perceptron (MLP) well trained in the sense
that it learns enough about the past to generalize to the future. From such a perspective,
the learning process amounts to a choice of network parameterization for this data set.
More specifically, we may view the MLP selection problem as choosing, within a set of
candidate model structures (parameterizations), the “best” one according to a certain
criterion.
In this context, a standard tool in statistics, known as cross-validation, provides an
appealing guiding principle (Stone, 1974; Janssen et al., 1988). First, as usual, the available
data set is randomly partitioned into a training set and a test set. The training set is further
partitioned into two subsets:
180 6 / Multilayer Perceptrons
1. A subset used for estimation of the model (i.e., training the network).
2. A subset used for evaluation of the pe$omzance of the model (Le., validation); the
validation subset is typically 10 to 20 percent of the training set.
The motivation here is to validate the model on a data set different from the one used
for parameter estimation. In this way, we may use the training set to assess the performance
of various candidate model structures, and thereby choose the “best” one. The particular
model with the best-performing parameter values is then trained on the full training set,
and the generalization performance of the resulting network is measured on the test set.
It is of interest to note that cross-validation follows a two-step procedure, the first step
of which is basically the same as that of the structural risk-minimization procedure
described in Section 2.13.
Cross-validation may be used for large neural networks with good generalization as
the goal in different ways, as described next.
Network Complexity
Consider first the problem of choosing network complexity measured in terms of the
number of hidden neurons used in a multilayer perceptron. Statistically, this problem may
be interpreted as that of choosing the size of the parameter set used to model the data set
(Hanson and Solamon, 1990; Smith, 1993). Measured in terms of the ability of the network
to generalize, there is obviously a limit on the size of the network. This follows from the
basic observation that it may not be an optimal strategy to train the network to perfection
on a given data set, because of the ill-posedness of any finite set of data representing a
target function, a condition that is true for both “noisy” and “clean” data. Rather, it
would be better to train the network in order to produce the “best” generalization. To
do so, we may use cross-validation. Specifically, the training data set is partitioned into
training and test subsets, in which case “overtraining” will show up as poorer performance
on the cross-validation set.
Mean-
squared
error in
generalization
0
Number of epochs
FIGURE 6.18 Illustration of the idea of cross validation.
cross-validation set is then used to validate the training after each epoch. When the
classification performance of the network on the cross-validation set fails to improve by
a certain amount (typically, 0.5 percent), the size of the learning-rate parameter is reduced;
typically, a factor of 2 is used for the parameter reduction. After each succeeding epoch,
the learning-rate parameter is further reduced, until once again there is no further improve-
ment in classification performance on the cross-validation set. When that point is reached,
training of the network is halted.
In this section we address the answer to this question and related issues.
multilayer perceptrons, it was Cybenko who demonstrated rigorously for the first time
that a single hidden layer is sufficient to uniformly approximate any continuous function
with support in a unit hypercube; this work was published as a University of Illinois
Technical Report in 1988, and republished as a paper one year later (Cybenko, 1988,
1989). In 1989, two other papers were published independently on multilayer perceptrons
as universal approximators, one by Funahashi and the other by Hornik, Stinchcombe, and
White.
Given this historical background on input-output mapping by multilayer feedforward
networks, we may now embody the answer to the question we raised earlier in the universal ,
approximation theorem stated as follows (Cybenko, 1989; Funahashi, 1989; Hornik et
al., 1989):
This theorem is directly applicable to multilayer perceptrons. We first note that the
logistic function 1/[1 + exp(-u)] used as the nonlinearity in a neuron model for the
construction of a multilayer perceptron is indeed a nonconstant, bounded, and monotone-
increasing function; it therefore satisfies the conditions imposed on the function &.).
Next, we note that Eq. (6.63) represents the output of a multilayer perceptron described
as follows:
1. The network has p input nodes and a single hidden layer consisting of M neurons;
the inputs are denoted by x l , . . . ,xp.
2. Hidden neuron i has synaptic weights w i l , . . . , wipand threshold 0,.
3. The network output is a linear combination of the outputs of the hidden neurons,
with aI, . . . , aM defining the coefficients of this combination.
The universal approximation theorem is an existence theorem in the sense that it
provides the mathematical justification for the approximation of an arbitrary continuous
function as opposed to exact representation. Equation (6.63), which is the backbone of
the theorem, merely generalizes approximations by finite Fourier series. In effect, the
theorem states that a single hidden layer is suficient for a multilayerperceptron to compute
a uniform E approximation to a given training set represented by the set of inputs X I ,
. . . , xp and a desired (target) output f ( x l , . . . , xp). However, the theorem does not
say that a single hidden layer is optimum in the sense of learning time or ease of
implementation.
algorithm and then tested with new data. During training, the network learns specific
points of a target function f in accordance with the training data, and thereby produces
the approximating function F defined in Eq. (6.63). When the network is exposed to test
data that has not been seen before, the network function F acts as an “estimator” of new
points of the target function; that is, F = f According to Barron (1991, 1992), the total
risk R, defined as the mean integrated squared error between the target functionf and the
estimated network function2 is bounded by
(Z)
0 - + O -1OgN
(7 ) (6.64)
where O(.) denotes “order of,” and C’is the first absolute moment of the Fourier magnitude
distribution of the target functionf; that is, C, quantifies the “regularity” off. The network
parameters used to define the bound are as follows: M is the total number of hidden
neurons, p is the number of input (source) nodes, and N is the number of training examples.
The total risk R quantifies the ability of the network to generalize to new data. In particular,
the two terms in the bound on the risk R express the trade-off between two conflicting
requirements concerning the size of the hidden layer (Barron, 1991, 1992):
The accuracy of best approximation, which requires a large M (number of hidden
neurons) in accordance with the universal approximation theorem
The accuracy of the empirical jit to this approximation, which requires a small
M N (ratio of hidden layer size to training set size)
The approach taken by Barron in deriving these important results involves a form of
complexity regularization that is closely related to Vapnik’s method of structural risk
minimization, which was discussed in Section 2.13.
The implication of the bound on the mean-squared approximation error is that exponen-
tially large sample sizes are not required to get accurate estimates for multilayer perceptrons
as universal approximators, which makes them all the more important.
The error between the empirical fit and the best approximation may be viewed as an
estimation error. Let eo denote the mean-squared value of this error. It follows therefore
that the bound on En is on the order of Mp/N, ignoring the logarithmic factor log N, which
is justifiable for a large training set. Reformulating this result, we may thus state that the
size N of the training set for good generalization should be larger than MplEo, which
interestingly has a mathematical structure similar to that of Eq. (6.62). In other words,
as a rule of thumb, we may state that for good generalization the number of training
examples should desirably be larger than the ratio of the total number of free parameters
in the network to the mean-squared estimation error.
Another interesting result that emerges from the bounds described in (6.64) is that
when the size of the hidden layer is optimized @e., the total risk is minimized with respect
to N) by setting
M-Cf -
(p: 1 N)“
then the total risk R is bounded by O ( C f s N ) .A surprising aspect of this result
is that, in terms of the first-order behavior of the total risk R , the rate of convergence
expressed as a function of the training set size N is of order (1/N)’” (times a logarithmic
factor). In contrast, for traditional smooth functions (Le., those with bounded norms of
the derivatives of order s for some s > 0), the minimax rate of convergence of the total
risk is of order (1/N)2s’(2s+P).
The dependence of the rate on p (i.e., dimension of the input
space) in the exponent is a curse of dimensionality (Duda and Hart, 1973), which severely
restricts the practical application of traditional smooth functions. The curse of dimensional-
184 6 / Multilayer Perceptrons
ity does not appear to apply to the class of functions described in Eq. (6.63) (Barron,
199l), confirming another important property of multilayer perceptrons.
Practical Considerations
The universal approximation theorem is important from a theoretical viewpoint, because
it provides the necessavy mathematical tool for the viability of feedforward networks with
a single hidden layer as a class of approximate solutions. Without such a theorem, we
could conceivably be searching for a solution that cannot exist. However, the theorem is
only an existence proof; it does not tell us how to construct the multilayer perceptron to
do the approximation.6
Nevertheless, the universal approximation theorem has limited practical value. The
theorem assumes that the continuous function to be approximated is given and that a
hidden layer of unlimited size is available for the approximation. Both of these assumptions
are violated in most practical applications of multilayer perceptrons.
The problem with multilayer perceptrons using a single hidden layer is that the neurons
therein tend to interact with each other globally. In complex situations this interaction
makes it difficult to improve the approximation at one point without worsening it at some
other point. On the other hand, with two hidden layers the approximation (curve-fitting)
process becomes more manageable. In particular, we may proceed as follows (Chester,
1990; Funahashi, 1989):
1. Local features are extracted in the first hidden layer. Specifically, some neurons in
the first hidden layer are used to partition the input space into regions, and other
neurons in that layer learn the local features characterizing those regions.
2. Global features are extracted in the second hidden layer. Specifically, a neuron in
the second hidden layer combines the outputs of neurons in the first hidden layer
operating on a particular region of the input space, and thereby learns the global
features for that region and outputs zero elsewhere.
This two-stage approximation process is similar in philosophy to the spline technique for
curve fitting (Schumaker, 19Sl), in the sense that the effects of neurons are isolated and
the approximations in different regions of the input space may be individually adjusted.
Sontag (1992) provides further justification for the use of two hidden layers in the
context of inverse problems. Specifically, the following inverse problem is considered:
This problem arises in inverse kinematics (dynamics), where the observed state x(n) of
a system is a function of current actions u(n) and the previous state x(n - 1) of the
system, as shown by
x(n) = f(x(n - l),u(n))
It is assumed that f is inuertible, so that we may solve for u(n) as a function of x(n) for
any x(n - 1). The function f represents the direct kinematics, whereas the function cp
Takahaski (1993) has developed a theory for constructing a multilayer perceptron with two input (source)
nodes, a single hidden layer, and two output neurons. The theory permits the specification of the training data
set and the free parameters of the network so as to generalize and approximate any given continuous mapping
between sets of contours on a plane with any given permissible error.
6.13 / Back-Propagationand Differentiation 185
represents the inverse kinematics. In practical terms, the motivation is to find a function
Q that is computable by a multilayer feedforward network. In general, discontinuous
functions Q are needed to solve the inverse kinematics problem. Interestingly, however,
even if the use of neuron models with discontinuous activation functions is permitted,
one hidden layer is not enough to guarantee the solution of all such inverse problems,
whereas multilayer feedforward networks with two hidden layers are indeed sufficient,
for every possible f, C, and E (Sontag, 1992).
(6.66)
(6.67)
(6.68)
where rp’ is the partial derivative of the nonlinearity rp with respect to its input, and xiis
the ith element of the input vector x. In a similar way, we may derive the equations for
the partial derivatives of a general network with more hidden layers and more neurons
in the output layer.
Connectionism
The back-propagation algorithm is an example of a connectionist paradigm that relies on
local computations to discover the information-processingcapabilities of neural networks.
This form of computational restriction is referred to as the locality constraint, in the sense
that the computation performed by a neuron is influenced solely by those neurons that
are in physical contact with it. The use of local computations in the design of artificial
neural networks is usually advocated for three principal reasons:
1. Artificial neural networks that perform local computations are often held up as
metaphors for biological neural networks.
2. The use of local computations permits a graceful degradation in performance due
to hardware errors, and therefore a fault-tolerant network design.
3. Local computations favor the use of parallel architectures as an efficient method
for the implementation of artificial neural networks.
Taking these three points in reverse order, point 3 is perfectly justified in the case
of back-propagation learning. In particular, the back-propagation algorithm has been
implemented successfully on parallel computers by many investigators, and VLSI architec-
tures have been developed for the hardware realization of multilayer perceptrons (Tomlin-
son et al., 1990; Hammerstrom, 1992). Point 2 is justified so long as certain precautions
are taken in the application of the back-propagation algorithm, such as injecting small
numbers of “transient faults” at each step (Bolt, 1992). As for point 1, relating to the
biological plausibility of back-propagation learning, it has indeed been seriously questioned
on the following grounds (Shepherd, 1990b; Crick, 1989; Stork, 1989):
1. The reciprocal synaptic connections between the neurons of a multilayer perceptron
may assume weights that are excitatory or inhibitory. In the real nervous system,
however, neurons appear usually to be the one or the other. This is one of the most
serious of the unrealistic assumptions made in neural network models.
2. In a multilayer perceptron, hormonal and other types of global communications are
ignored. In real nervous systems, these types of global communication are critical
for state-setting functions, such as arousal, attention, and learning.
3. In back-propagation learning, a synaptic weight is modified by a presynaptic activity
and an error (learning) signal independent of postsynaptic activity. There is evidence
from neurobiology to suggest otherwise.
4. In a neurobiological sense, the implementation of back-propagation learning requires
the rapid transmission of information backward along an axon. It appears highly
unlikely that such an operation actually takes place in the brain.
5. Back-propagation learning implies the existence of a “teacher,” which in the context
of the brain would presumably be another set of neurons with novel properties. The
existence of such neurons is biologically implausible.
However, these neurobiological misgivings do not belittle the engineering importance
of back-propagation learning as a tool for information processing, as evidenced by its
successful application in numerous and highly diverse fields, including the simulation of
neurobiological phenomena.
Hidden Units
Hidden units play a critical role in the operation of multilayer perceptrons with back-
propagation learning in that they act as feature detectors. A novel way in which this
important attribute of back-propagation learning can be exploited is in the “discovery”
of significant features that characterize input patterns of interest. Moreover, because the
learning process can sometimes be performed without an external teacher, the feature
188 6 I Multilayer Perceptrons
Hidden layer
Compressed
data
Reconstructed data
A
f-
Output layer
Hidden layer
Compressed data
(b)
FIGURE 6.20 Autoencoder with a single hidden layer. (a) Input-output connections of
the autoencoder. (b) Layout of network for reconstructing the original data.
extraction is accomplished without detailed knowledge of the output pattern or the underly-
ing physics.
One of the simplest ways to avoid the need for an external teacher in back-propagation
learning is to learn the identity map over some set of inputs (Rumelhart et al., 1986b).
To do this, we let the input pattern and the target (desired) pattern be exactly the same,
as illustrated in Fig. 6.20. Specifically, the network is constrained to perform the identity
mapping through a narrow channel (hidden layer) of the network, forcing it to develop
an efficient encoding in that channel. There are two interesting aspects to this operation:
6.14 / Virtues and Limitations of Back-Propagation Learning 189
Output layer
0 0 I7 0 0 0 Inputlayer
FIGURE 6.21 Autoencoder with three hidden layers.
Universal Approximation
A multilayer perceptron trained with the back-propagation algorithm manifests itself as
a nested sigmoidal scheme, written in the following compact form for the case of a single
output:
where (p(*) is a sigmoidal activation function, wojis the synaptic weight from neuron j in
the last hidden layer to the single output neuron 0, and so on for the other synaptic
weights, and xiis the ith element of the input vector x. The weight vector w denotes the
entire set of synaptic weights ordered by layer, then neurons in a layer, and then number
in a neuron. The scheme of nested nonlinear functions described in Eq. (6.69) is unusual
in the classical approximation theory. Indeed, it is a universal approximator in the sense that
a multilayer perceptron with back-propagation learning can approximate any continuous
multivariate function to any desired degree of accuracy, provided that sufficiently many
hidden neurons are available (Cybenko, 1988; Funahashi, 1989; Hornik et al., 1989).
In the context of approximation, the use of back-propagation learning offers another
useful property. Intuition suggests that a multilayer feedforward network with smooth
activation functions should have output function derivatives that can also approximate
the derivatives of an unknown input-output mapping. A proof of this result is presented
in Hornik et al. (1990). In fact, it is shown in this paper that multilayer feedforward
networks can approximate functions that are not differentiable in the classical sense, but
possess a generalized derivative as in the case of piecewise differentiable functions. The
approximation results reported by Hornik et al. provide a previously missing theoretical
justification for the use of multilayer feedforward networks in applications that require
the approximation of a function and its derivatives.
Convergence
The back-propagation algorithm is a first-order approximation of the steepest-descent
technique in the sense that it depends on the gradient of the instantaneous error surface
in weight space. The algorithm is therefore stochastic in nature; that is, it has a tendency
to zigzag its way about the true direction to a minimum on the error surface. Indeed,
back-propagation learning is an application of a statistical method known as stochastic
approximation that was originally proposed by Robbins and Monro (195 1). Consequently,
it suffers from a slow convergence property. We may identify two fundamental causes
for this property (Jacobs, 1988):
1. The error surface is fairly flat along a weight dimension, which means that the
derivative of the error surface with respect to that weight is small in magnitude. In
such a situation, the adjustment applied to the weight is small and, consequently,
many iterations of the algorithm may be required to produce a significant reduction
in the error performance of the network. Alternatively, the error surface is highly
curved along a weight dimension, in which case the derivative of the error surface
with respect to that weight is large in magnitude. In this second situation, the
adjustment applied to the weight is large, which may cause the algorithm to overshoot
the minimum of the error surface.
2. The direction of the negative gradient vector (i.e., the negative derivative of the
cost function with respect to the vector of weights) may point away from the
minimum of the error surface; hence the adjustments applied to the weights may
induce the algorithm to move in the wrong direction.
Consequently, the rate of convergence in back-propagation learning tends to be rela-
tively slow, which in turn makes it computationally expensive. According to the empirical
study of Saarinen et al. (1992), the local convergence rates of the back-propagation
algorithm are linear, which is justified on the grounds that the Jacobian (matrix of first-
order partial derivatives) is almost rank-deficient, and so is the Hessian matrix (second-
order partial derivatives of the error surface with respect to the weights); these are
consequences of the intrinsically ill-conditioned nature of neural-network training prob-
lems. Saarinen et al. interpret the linear local convergence rates of back-propagation
learning in two ways:
w It is a vindication of back-propagation (gradient descent) in the sense that higher-
order methods may not converge much faster while requiring more computational
effort; or
6.14 / Virtues and Limitations of Back-Propagation Learning 191
Local Minima
Another peculiarity of the error surface that impacts the performance of the back-propaga-
tion algorithm is the presence of local minima (i.e., isolated valleys) in addition to global
minima. Since back-propagation is basically a hill-climbing technique, it runs the risk of
being trapped in a local minimum, where every small change in synaptic weights increases
the cost function. But somewhere else in the weight space there exists another set of
synaptic weights for which the cost function is smaller than the local minimum in which
the network is stuck. Clearly, it is undesirable to have the learning process terminate at
a local minimum, especially if it is located far above a global minimum.
The issue of local minima in back-propagation learning has been raised in the epilogue
of the enlarged edition of the classic book by Minsky and Papert (1988), where most of
the attention is focused on a discussion of the two-volume book, Parallel Distributed
Processing, by Rumelhart and McClelland (1986). In Chapter 8 of the latter book it is
observed that getting trapped in a local minimum is rarely a practical problem for back-
propagation learning. Minsky and Papert counter by pointing out that the entire history
of pattern recognition shows otherwise. Gori and Tesi (1992) describe a simple example
where, although a nonlinearly separable set of patterns could be learned by the chosen
network with a single hidden layer, back-propagation learning can get stuck in a local
minimum.
What we basically need is a theoretical framework of back-propagation learning that
explains the local-minima problem. This is a difficult task to accomplish. Nevertheless,
some progress has been reported in the literature on this issue. Baldi and Hornik (1989)
have considered the problem of learning in layered linear feedforward neural networks
using back-propagation learning. The main result of their paper is that the error surface
has only one minimum, corresponding to an orthogonal projection onto the subspace
spanned by the first principal eigenvectors of a covariance matrix associated with the
training patterns; all other critical points of the error surface are saddle points. Gori and
Tesi (1992) have considered the more general case of back-propagation learning that
involves the use of nonlinear neurons. The main result of this latter paper is that for
linearly separable patterns, convergence to an optimal solution (i.e., global minimum) is
ensured by using the batch mode of back-propagation learning, and the network exceeds
Rosenblatt’s perceptron in generalization to new examples.
Scaling
In principle, neural networks such as multilayer perceptrons trained with the back-propaga-
tion algorithm offer the potential of universal computing machines. However, for that
potential to be fully realized, we have to overcome the scaling problem, which addresses
the issue of how well the network behaves (e.g., as measured by the time required for
training or the best generalization performance attainable) as the computational task
increases in size and complexity. Among the many possible ways of measuring the size
or complexity of a computational task, the predicate order defined by Minsky and Papert
(1969, 1988) provides the most useful and important measure.
To explain what we mean by a predicate, let @ ( X ) denote a function that can have
only two values. Ordinarily, we think of the two values of $ ( X ) as 0 and 1. But by taking
the values to be FALSE or TRUE, we may think of @ ( X ) as a predicate, that is, a variable
192 6 / Multilayer Perceptrons
statement whose falsity or truth depends on the choice of argument X. For example, we
may write
1 if the figure Xis a circle
!hRCLE(X) = (6.70)
0 if the figure Xis not a circle
Using the idea of a predicate, Tesauro and Janssens (1988) performed an empirical
study involving the use of a multilayer perceptron trained with the back-propagation
algorithm to learn to compute the parity function. Theparityfunction is a Boolean predicate
defined by
1 if 1x1is an odd number
h4RITY(X) = (6.71)
0 otherwise
and whose order is equal to the number of inputs. The experiments performed by Tesauro
and Janssens appear to show that the time required for the network to learn to compute
the parity function scales exponentially with the number of inputs (Le., the predicate order
of the computation), and that projections of the use of the back-propagation algorithm to
learn arbitrarily complicated functions may be overly optimistic.
It is generally agreed that it is inadvisable for a multilayer perceptron to be fully
connected. In this context, we may therefore raise the following question: Given that a
multilayer perceptron should not be fully connected, how should the synaptic connections
of the network be allocated? This question is of no major concern in the case of small-
scale applications, but it is certainly crucial to the successful application of back-propaga-
tion learning for solving large-scale, real-world problems.
One way of alleviating the scaling problem is to develop insight into the problem at
hand and use it to put ingenuity into the architectural design of the multilayer perceptron.
Specifically, the network architecture and the constraints imposed on synaptic weights of
the network should be designed so as to incorporate prior information about the task into
the makeup of the network. Such a design strategy is well illustrated by the way in which
back-propagation learning has been successfully applied to the optical character recognition
problem (LeCun et al., 1990a); this strategy is discussed in Section 6.22.
Another way of dealing with the scaling problem is to reformulate the back-propagation
learning process with modularity built into the network architecture (Ballad, 1990; Jacobs
et al., 1991a); modularity appears to be an important principle in the architecture of
vertebrate nervous systems (Houk, 1992; Van Essen et al., 1992). A computational system
is said to have a modular architecture if it can be broken down into two or more subsystems
that perform computation on distinct inputs in the absence of communication with each
other. Moreover, the connection process inside each module can be made “dynamic,”
that is, synaptic connections can be made or broken as part of the learning process.
Modular networks are treated in Chapter 12.
HEURISTIC 1. Every adjustable network parameter of the cost function should have its
own individual learning-rate parameter.
Here we note that the back-propagation algorithm may be slow to converge, because
the use of a fixed learning-rate parameter may not suit all portions of the error surface.
In other words, a learning-rate parameter appropriate for the adjustment of one synaptic
weight is not necessarily appropriate for the adjustment of other synaptic weights in the
network. Heuristic 1 recognizes this fact by assigning a different learning-rate parameter
to each adjustable synaptic weight (parameter) in the network.
HEURISTIC 2. Every learning-rate parameter should be allowed to vary from one itera-
tion to the next.
Typically, the error surface behaves differently along different regions of a single
weight dimension. In order to match this variation, heuristic 2 states that the learning-
rate parameter needs to vary from iteration to iteration. It is of interest to note that this
heuristic is well founded in the case of linear units; see Luo (1991).
HEURISTIC 3. When the derivative of the cost function with respect to a synaptic
weight has the same algebraic sign for several consecutive iterations of the algorithm,
the learning-rate parameter for that particular weight should be increased.
The current operating point in weight space may lie on a relatively flat portion of the
error surface along a particular weight dimension. This may, in turn, account for the
derivative of the cost function (Le., the gradient of the error surface) with respect to that
weight maintaining the same algebraic sign, and therefore pointing in the same direction,
for several consecutive iterations of the algorithm. Heuristic 3 states that in such a situation
the number of iterations required to move across the flat portion of the error surface may
be reduced by increasing the learning-rate parameter appropriately.
HEURISTIC 4. When the algebraic sign of the derivative of the cost function with
respect to a particular synaptic weight alternates for several consecutive iterations of the
algorithm, the learning-rate parameter for that weight should be decreased.
When the current operating point in weight space lies on a portion of the error surface
along a weight dimension of interest that exhibits peaks and valleys (i.e., the surface is
highly curved), then it is possible for the derivative of the cost function with respect to
that weight to change its algebraic sign from one iteration to the next. In order to prevent
the weight adjustment from oscillating, heuristic 4 states that the learning-rate parameter
for that particular weight should be decreased appropriately.
It is noteworthy that the use of a different and time-varying learning-rate parameter
for each synaptic weight in accordance with these heuristics modifies the back-propagation
algorithm in a fundamental way. Specifically, the modified algorithm no longer performs
a steepest-descent search. Rather, the adjustments applied to the synaptic weights are
based on (1) the partial derivatives of the error surface with respect to the weights, and
( 2 ) estimates of the curvatures of the error surface at the current operating point in weight
space along the various weight dimensions.
184 6 / Multilayer Perceptrons
Furthermore, all four heuristics satisfy the locality constraint. Adherence to the locality
constraint limits the domain of usefulness of these heuristics in that error surfaces exist
for which they do not work. Nevertheless, modifications of the back-propagation algorithm
in light of these heuristics, as shown next, do have practical value.
(6.72)
where yj(n)is the output of neuronj, and dj is the desired (target) response for that neuron.
Although the definition of E(n) in Eq. (6.72) is mathematically similar to that of the cost
function %(n)in Eq.(6.2), the parameter space pertaining to the new cost function E(n) is
assumed to consist of different learning rates. Let vi@) denote the learning-rate parameter
assigned to synaptic weight wji(n) at iteration number n. Applying the chain rule to E(n),
we may write
(6.73)
For convenience of presentation, we reproduce Eqs. (6.4) and (6.12) here as follows:
uj(n>= 2 wji(n)yi(n>
i
(6.74)
d%(n- 1)
wjj(n) = wji(n - 1) - 7jji(n) (6.75)
awji(n- 1)
(6.76)
Hence, differentiating Q. (6.76) with respect to qji(n),and rewriting Eq. (6.5), we have
(6.77)
(6.78)
Next, we evaluate the partial derivative aE(n)ldyj(n).For the case when neuron j lies
in the output layer of the network, the desired response dj(n) is supplied externally.
Accordingly, we may differentiate Eq. (6.72) with respect to yj(n). obtaining the result
= -ej(n) (6.79)
6.15 / Accelerated Convergence of Back-Propagation Through Learning-Rate Adaptation 195
where ej(n) is the error signal. Thus, using the partial derivatives of Eqs. (6.77), (6.78),
and (6.79) in (6.73), and then rearranging terms, we obtain
(6.80)
Note that the partial derivative a%(n - l)/awi,(n - 1) on the right-hand side of Eq. (6.80)
refers to the cost functions %(n - 1) describing the error surface at time n - 1; the
differentiation is with respect to synaptic weight wji(n - 1).
From Eq. (6.11) we note that the factor
- cpj'(vj(n>)ej(n)yi(n>
equals the partial derivative a%(n)lawji(n).Using this relation in Eq. (6.80), we may
redefine aE(n)laqji(n)simply as
(6.81)
Equation (6.81) defines the derivative of the error surface with respect to the learning-
rate parameter qi(n),assuming that neuronj lies in the output layer of the network. In
fact, we can show that this same formula also applies to a neuron j that lies in a hidden
layer of the network; see Problem 6.14. In other words, Eq. (6.81) applies to all neurons
in the network.
We are now ready to formulate a learning-rate update rule that performs steepest
descent on an error surface over the parameter space, where the parameter of interest is
the learning-rate parameter qji(n).Specifically, we define the adjustment applied to qi(n)
as
- d%(n) d%(n - 1)
(6.82)
- aw,(n> awjj(n - 1)
where y is a positive constant, called the control step-size parameter for the learning-
rate adaptation procedure.
The partial derivatives a%(n - l)/dwji(n - 1) and a%(n)/awji(n) refer to the derivative
(negative gradient) of the error surface with respect to the synaptic weight wji (connecting
neuron i to neuronj), evaluated at iterations n - 1 and n, respectively. Accordingly, we
may make two important observations:
When the derivative of the error surface with respect to the weight w,,has the same
algebraic sign on two consecutive iterations, the adjustment AvJn + 1) has a positive
value. The adaptation procedure therefore increases the learning-rate parameter for
the weight w,,. Correspondingly, the back-propagation learning along that direction
will be fast.
When the derivative of the error surface with respect to the weight w,,alternates
on two consecutive iterations, the adjustment Aq,,(n + 1) assumes a negative value.
In this case, the adaptation procedure decreases the learning-rate parameter for the
weight w,,. Correspondingly,the back-propagation learning along that direction will
be slow.
These two observations are in perfect accord with heuristics 3 and 4, respectively.
The learning-rate adaptation procedure described by Eq. (6.82) is called the delta-delta
learning rule (Jacobs, 1988). Although this learning rule satisfies the heuristics mentioned
196 6 / Multilayer Perceptrons
above, it has some potential problems. If the derivative of the error surface with respect
to a particular weight has the same sign but small magnitudes at two consecutive iterations,
the positive adjustment applied to the learning rate for that weight is very small. On the
other hand, if the derivative of the error surface with respect to a weight has opposite
signs and large magnitudes at two consecutive iterations, the negative adjustment applied
to that weight will be very large. Under these circumstances, it is very difficult to choose
an appropriate value for the step-size parameter y. This limitation of the delta-delta
learning rule is overcome by introducing a further modification, as described next.
Let wj,(n)denote the value of the synaptic weight connecting neuron i to neuron j ,
measured at iteration n. Let qji(n)denote the learning-rate parameter assigned to this
weight at this iteration. The learning-rate update rule is now defined as follows:
i:
Aqji(n + 1) = -/3qji(n)
if $(n - l)Dji(n)> 0
if Sji(n- l)Dj,(n)< 0
otherwise
(6.84)
and
(6.85)
where 6 is a positive constant. The quantity D,,(n) is the current value of the partial
derivative of the error surface with respect to the weight ~ , ~ ( nThe
) . second quantity S,,(n)
is an exponentially weighted sum of the current and past derivatives of the error surface
with respect to w,,,and with ( as the base and iteration number n as the exponent. The
learning-rate adaptation procedure described in Eqs. (6.83) to (6.85) is called the delta-
bar-delta learning rule' (Jacobs, 1988). Note that if we set the control parameters K and
/3 both equal to zero, the learning-rate parameters assume a constant value as in the
standard back-propagation algorithm.
From these defining equations we may make the following observations:
1. The delta-bar-deltalearning rule uses a mechanism similar to the delta-delta learning
role to satisfy the heuristics 3 and 4. Specifically, if for a synaptic weight wJIthe
derivative DJn) at iteration n and the exponentially weighted sum S,,(n - 1) at the
previous iteration n - 1 have the same sign, then the learning-rate parameter for
that weight is incremented by the constant K . If, on the other hand, the quantities
DJn) and SJl(n- 1) have opposite signs, then the learning-rate parameter for the
weight wJ1is decremented by a proportion, /3, of its current value q Z ( n ) Otherwise,
.
the learning-rate parameter remains unchanged.
2. The learning-rateparameter q,,(n) is incremented linearly but decremented exponen-
tially. A linear increase prevents the learning-rate parameter from growing too fast,
whereas an exponential decrease means that the learning-rate parameter remains
positive and that it is reduced rapidly.
A strict application of back-propagation learning calls for the computation of weight
and parameter updates for each input pattern. For a large network and sizable training
database, this form of updating results in a significant increase in memory storage size
~
Minai and Williams (1990) describe a further improvement of the delta-bar-deltalearning rule. In particular,
it is shown that the advantages of momentum can be retained without significant drawbacks.
6.15 / Accelerated Convergence of Back-Propagation Through Learning-Rate Adaptation 197
and computational complexity when the delta-bar-delta rule for learning-rate adaptation
is incorporated into the back-propagation algorithm. To be specific, we have to allo-
cate additional memory storage for (1) the parameters q,(n) and Aqz(n) assigned to
every synaptic weight in the network, and (2) the associated partial derivatives
a%(n - l)/dw,,(n - 1). All of this may make it difficult to justify the use of the delta-
bar-delta learning rule in practice.
However, we note that the weight changes due to the application of each training-
pattern point are usually small compared to the magnitudes of the weights themselves.
This is merely a manifestation of the principle of minimal disturbance (Widrow and Lehr,
1990). We may therefore simplify implementation of the back-propagation algorithm
incorporating the delta-bar-delta learning rule by exploiting an idea similar to the gradient
reuse method (Hush and Salas, 1988). In particular, we note that when batch updating is
used, the gradient estimate may be formed by averaging the gradient contributions com-
puted from several patterns. Weights and leaming-rate parameters are then updated once
every B patterns, where B is the epoch (batch) size. The basic idea of the gradient reuse
method is that gradients are reused several times until the resulting weight updates and
leaming-rate updates no longer lead to a reduction in error performance. By so doing,
the number of gradients computed per iteration is significantly reduced, on average.
Moreover, with batch updating a more accurate estimate of the true gradient is computed,
and the update of interest is therefore more likely to be in the correct downhill direction.
Thus, let SF)(n) and yy)(n) denote the local gradient and the output signal computed
for neuron j and neuron i, respectively; the superscript b refers to the presentation of the
bth pattern point, and the index n refers to the current iteration. From Eqs. (6.11) and
(6.14) we note that the partial derivative of the error surface with respect to the jith
synaptic weight and corresponding to the presentation of the bth pattern point is equal
to -S~)(n)$b)(n). According to the gradient reuse method, we may then set the average
gradient with respect to the jith weight as follows:
(6.86)
where B is the epoch (batch) size over which the gradient averaging is performed. We
may therefore redefine the update formula for the jith synaptic weight as
B
w,,(n + 1) = w,,(n) + aAw,,(n - 1) + q l ( n + 1) 2 S~)(n)y~)(n)
b=l
(6.87)
Correspondingly, we may reformulate Eq. (6.84) in light of the gradient reuse method as
B
= - 2 Sy)<n>yfb)(n>
Djl(n> (6.88)
b=l
Computer Experiment
In this computer experiment, we study a two-dimensional classification problem that
involves nonconvex decision regions. The distribution of the pattern classes q1and YZ2 is
shown in Fig. 6.22a. Class consists of pattern points inside the area marked YZ1, and
class YZ2 consists of the pattern points inside the area marked <e2. The requirement is to
198 6 / Multilayer Perceptrons
2 2
1 1
x2 O x2 O
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
X1
(a) ( b)
-3 -2 -1 0 1 2 3
X1
(c)
FIGURE 6.22 Pattern classification using multilayer perceptron: (a) distribution of
classes (el And &; (b) the training data of classes and %2 = 100 points; 4Z2 = 100
points); and (c) the testing data of classes (el and g2 = 482 points, %* = 518 points).
design a neural network classifier that decides whether an input pattern belongs to class
%, or class % 2 .
The composition of the neural network used in the study is as follows:
Figure 6.22b depicts the pattern set used to train the network. Figure 6 . 2 2 ~depicts the
test pattern set used to evaluate the performance of the network after training. Both sets
consist of uniformly distributed random points. There are 100 class %, points and 100
6.15 / Accelerated Convergence of Back-PropagationThrough Learning-Rate Adaptation 199
class %2 points in the training set, and there are 482 class points and 518 class %e2
points in the test set.
In Fig. 6.23 we present learning curves that compare the use of pattern updates with
batch updates. In particular, Fig. 6.23a compares the error performance of the standard
back-propagation algorithm using pattern updates and batch updates. Figure 6.23b presents
corresponding curves for the modified back-propagation algorithm incorporating the delta-
bar-delta training rule. From Fig. 6.23a we see that for the standard back-propagation
algorithm, the use of pattern updates yields slightly better results than batch updates. On
-Pattern update
Batch update
Mean-
squared
error 0.5 X Y
over -
-
-
epoch -
0.5 x 1 0 - ~
--
-
-
-
0.5 x 1 0 - ~ I
I I
I I
I
0 1000 2000 3000
Number of epochs
(a)
-Pattern update
_ _ _ Batch
_ _ .update
Mean- 0.5 X
squared
0.5 x 3 1
0 1000 2000 3000
Number of epochs
(b)
FIGURE 6.23 Learning curves for pattern update and batch update methods.
(a) Learning curves for the standard back-propagation algorithm for 3000 epochs.
Continuous curve: vo = 0.75, a = 0.25,K = 0.0,p = 0.0,6 = 0.0; dashed curve:
vo = 0.75, a = 0.25,K = 0.0,p = 0.0,6 = 0.0. (b) Learning curves for the modified
back-propagation algorithm for 3000 epochs. Continuous curve: vo = 0.75, a! = 0.05,
K = 0.002,j3 = 0.05, 6 = 0.7; dashed curve: vo = 0.75, a = 0.3,K = 0.01, p = 0.1,
6 = 0.7.
200 6 I Multilayer Perceptrons
the other hand, in Fig. 6.23b we have opposite results in that, for the modified back-
propagation algorithm, the use of batch updates yields a better performance than pattern
updates.
In Fig. 6.24 we compare the conventional and modified forms of the back-propagation
algorithm, using batch updates. In particular, this figure compares their learning curves
for the following values of control parameters.
Mean- 0.5~
lo-'$ ':
propagation
d back-
0.5 x 1
0 1000 2000 3000
Number of epochs
(a)
2 2
1 1
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
X1 X1
(b) (C)
1. Standard back-propagation:
a = 0.75
2. Modified back-propagation (incorporating the delta-bar-delta training rule):
7 0 = 0.75
(Y = 0.75
p = 0.2
c= 0.7
In Fig. 6.24a we see that the improvement in the error performance of the modified back-
propagation algorithm over that of the standard form of the algorithm is indeed quite
significant.
In Figs. 6.24b and 6 . 2 4 ~we show the classified results obtained using the standard
and modified forms of the back-propagation algorithm for the same sets of parameter
values listed above. The classified results presented here are quite dramatic; they were
obtained after the presentation of 150 training epochs. In the case of standard back-
propagation, we see from Fig. 6.24b that the shapes of class % l and class %& have not
fully developed. In sharp contrast, the shapes of class % l and class %2 can be clearly seen
from the classified results of Fig. 6 . 2 4 ~obtained using the modified back-propagation
algorithm. Note that in Figs. 6.24b and 6.24c, only correctly classified test data points
are shown.
pAflJJ(x) = ~ n ( p .(x>,
4 pB(x)) (6.90)
The complement of the fuzzy set A is defined by the membership function
pz(x) == - pA(x) (6.91)
One other term that we need to introduce is a universe of discourse generated by a kernel
space referring to any prescribed set of objects or constructs. Let K be a kernel space,
and let E be a set that contains K and that is generated from K by a finite application of
set-theoretic operations. Then a universe of discourse is a designated subset of E (Dubois
and Prade, 1980).
Returning to the issue at hand, consider Fig. 6.25, which presents the block diagram
of a hybrid learning system, in which an on-line fuzzy logic controller is used to adapt
the learning parameters of a multilayer perceptron with back-propagation learning. The
objective here is to provide a significant improvement in the rate of convergence of the
learning process (Choi et al., 1992). The fuzzy logic controller in Fig. 6.25 incorporates
human linguistic descriptions about the unknown learning parameters of the back-propaga-
tion algorithm. In particular, it consists of four principal components, as depicted in Fig.
6.26 (Lee, 1990):
1. FuzziJication integace, which performs the following functions:
w A scale mapping that changes the range of values of input variables into
corresponding universe of discourse
w Fuzzification that converts nonfuzzy (crisp) input data into suitable linguistic
values, which may be viewed as labels of fuzzy sets
2. Fuzzy rule base, which consists of a set of linguistic control rules written in the
form:
“IF a set of conditions are satisfied, THEN a set of
consequences are inferred.”
3. Fuzzy inference machine, which is a decision-making logic that employs rules from
the fuzzy rule base to infer fuzzy control actions in response to fuzzified inputs.
4. DefuUiJication integace, which performs the following functions:
w A scale mapping that converts the range of values of output variables into
corresponding universe of discourse
w Defuzzification that yields a nonfuzzy (crisp) control action from an inferred
fuzzy control action
A commonly used defuzzification rule is the centroid method, according to which the
defuzzification interface produces a crisp output defined as the center of gravity of the
distribution of possible actions.
Controlled
learning
Target Output
performance
controller perceptron
attribute
V Y
Nonfuzzy Fuzzification Defuzzification Nonfuzzy
input interface interface * output
dk
The main idea behind the fuzzy control of back-propagation learning is the implementa-
tion of heuristics in the form of fuzzy “IF . . . , THEN . . .” rules that are used for the
purpose of achieving a faster rate of convergence. The heuristics are driven by the behavior
of the instantaneous sum of squared errors E given in Eq. (6.72). In the following heuristics,
the change of error (denoted by CE) is an approximation to the gradient of the error
surface, and the change of CE (denoted by CCE) is an approximation to the second-order
gradient information related to the acceleration of convergence (Choi et al., 1992):
w IF CE is small with no sign changes in several consecutive iterations of the algorithm,
THEN the value of the learning-rate parameter should be increased.
w IF sign changes occur in CE in several consecutive iterations of the algorithm, THEN
the value of the learning-rate parameter should be decreased, regardless of the value
of CCE.
w IF CE is small AND CCE is small, with no sign changes for several consecutive
iterations of the algorithm, THEN the values of both the learning-rate parameter and
the momentum constant should be increased.
The first two heuristics are recognized as heuristics 3 and 4 as presented in the previous
section, respectively. The last heuristic, involving the momentum constant, is new.
The sign change of the gradient to the error surface is identical to the sign change of
CE. Consider, for example, a situation where we have
E(n - 2) 5 E(n - 1)
and
E(n - 1) > E(n)
We may then write
CE(n - 1) = E(n - 1) - E(n - 2) Z 0
and
CE(n) = E(n) - E(n - 1) < 0
The implication here is that there has been a sign change from iteration n - 2 to iteration
n of the algorithm.
A fuzzy rule base for fuzzy back-propagation learning is given in Tables 6.6 and 6.7.
The associated rnernbershipfunctions, labeled by linguistic terms, are shown in parts (a)
204 6 / Multilayer Perceptrons
TABLE 6.6 Decision Table for the Fuzzy Logic Control of
Learnina-Rate Parameter na
CCE NB NS ZE PS PB
NB NS NS NS NS NS
NS NS ZE PS ZE NS
ZE ZE PS PS PS ZE
PS NS ZE PS ZE NS
PB NS NS NS NS NS
aDefinitions: NB,negative big; NS, negative small; ZE, zero; PS, positive
small; PB, positive big.
and (b) of Fig. 6.27. Table 6.6 and Fig. 6.27a pertain to changes applied to the learning-
rate parameter q of the back-propagation algorithm, whereas Table 6.7 pertains to changes
applied to the momentum constant a of the algorithm. The details of these two tables are
as follows:
The contents of Table 6.6 represent the value of the fuzzy variable Aq, denoting
the change applied to the learning-rate parameter q,for fuzzified values of CE and
CCE. For instance, we may read the following fuzzy rule from Table 6.6:
m If CE is zero, AND IF CCE is negative small, THEN A q is positive small.
The contents of Table 6.7 represent the value of the fuzzy variable A a , denoting
the change applied to the momentum constant a, for fuzzified values of CE and
CCE. For instance, we may read the following fuzzy rule from Table 6.7:
IF CE is negative small, AND CCE is negative small, THEN A a is zero.
From the membership function shown in Fig. 6.27 we note that the universe of discourse
for both CE and CCE is [-0.3,0.3]; values of CE and CCE outside of this range are
clamped to -0.3 and 0.3, respectively.
Having determined whether AT, the change in the learning-rate parameter, is positive
small, zero, or negative small, we may then assign an appropriate value to the change
A q using the membership functions presented in Fig. 6.27b.
Choi et al. (1992) present computer simulation results, comparing the performance
of fuzzy back-propagation learning to that of standard back-propagation learning. The
experiments involved the detection of a constant signal in the presence of additive Laplacian
Notations:
Z E zero
Notations:
Z E zero
noise.l0 The results of the computer simulations appear to show that the convergence of
fuzzy back-propagation learning is dramatically faster, and the steady-state value of the
mean-squared error is significantly smaller than standard back-propagation learning.
A sample v drawn from a Lapluciun noise process is governed by the probability density function
ff
f ( u ) =Texp(-alvI), --m < u< M
rn Network pruning, in which case we start with a large multilayer perceptron with an
adequate performance for the problem at hand, and then prune it by weakening or
eliminating certain synaptic weights in a selective and orderly fashion.
The cascade-correlation learning architecture (Fahlman and Lebiere, 1990) is an
example of the network-growing approach. The procedure begins with a minimal network
that has some inputs and one or more output nodes as indicated by input/output consider-
ations, but no hidden nodes. The LMS algorithm, for example, may be used to train the
network. The hidden neurons are added to the network one by one, thereby obtaining a
multilayer structure. Each new hidden neuron receives a synaptic connection from each
of the input nodes and also from each preexisting hidden neuron. When a new hidden
neuron is added, the synaptic weights on the input side of that neuron are frozen; only
the synaptic weights on the output side are trained repeatedly. The added hidden neuron
then becomes a permanent feature detector in the network. The procedure of adding new
hidden neurons is continued in the manner described here until satisfactory performance
is attained.
In yet another network-growing approach described in Lee et al. (1990), a third
level of computation termed the structure-level adaptation is added to the forward
pass (function-level adaptation) and backward pass (parameter-level adaptation). In
this third level of computation the structure of the network is adapted by changing
the number of neurons and the structural relationship among neurons in the network.
The criterion used here is that when the estimation error (after convergence) is larger
than a desired value, a new neuron is added to the network in a position where it
is most needed. The desirable position for the new neuron is determined by monitoring
the learning behavior of the network. In particular, if after a long period of parameter
adaptation (training), the synaptic weight vector pertaining to the inputs of a neuron
continues to fluctuate significantly, it may be inferred that the neuron in question
does not have enough representation power to learn its proper share of the task. The
structure-level adaptation also includes a provision for the possible annihilation of
neurons. Specifically, a neuron is annihilated when it is not a functioning element of
the network or it is a redundant element of the network. This method of network
growing appears to be computationally intensive.
In this section we focus on two network-pruning approaches. In the first approach,
generalization is improved through the use of complexity regularization. This approach
is exemplified by the weight-decay (Hinton, 1987) and the weight-elimination (Weigend
et al., 1991) procedures. In the second approach, synaptic weights are removed from the
network on the basis of their saliencies, the calculations of which involve the use of the
Hessian matrix of the error surface. This latter approach is exemplified by the so-called
optimal brain damage (LeCun et al., 1990) and the optimal brain surgeon (Hassibi et al.,
1993) procedures.
Complexity Regularization
In designing a multilayer perceptron by whatever method, we are in effect building a
nonlinear model of the physical phenomenon responsible for the generation of the
input-output examples used to train the network. Insofar as the network design is statistical
in nature, we need an appropriate measure of the fit between the model (network) and
the observed data. This implies that unless we have some prior information, the design
procedure should include a criterion for selecting the model complexity (i.e., the number of
independently adjusted parameters of the network). Various criteria for model-complexity
selection are described in the statistics literature, important examples of which include the
6.17 / Network-Pruning Techniques 207
minimum description length (MDL) criterion (Rissanen, 1978, 1989), and an information-
theoretic criterion (AIC) (Akaike, 1974). Although these criteria do indeed differ from
each other in their exact details, they share a common form of composition, as described
here (Haykin, 1991; Priestley, 1981):
(
Model-complexity
criterion =
log-likelihood
) (
function ) (
+
model-complexity
penalty
) (6.92)
The basic difference between the various criteria lies in the definition of the model-
complexity penalty term.
In the context of back-propagation learning, or any other supervised learning procedure
for that matter, we may follow a similar approach. Specifically, the learning objective is
to find a weight vector that minimizes the total risk
Weight Decay. In the weight-decay procedure (Hinton, 1987), the complexity regular-
ization is defined as the squared norm of the weight vector w in the network, as shown
by
= w: (6.94)
' ~ C d
where the set C,,, refers to all the synaptic weights in the network. This procedure operates
by forcing some of the synaptic weights in the network to take on values close to zero,
while permitting other weights to retain their relatively large values. Accordingly, the
weights of the network are grouped roughly into two categories: those that have a large
influence on the network (model), and those that have little or no influence on it. The
weights in the latter category are referred to as excess weights. In the absence of complexity
208 6 / Multilayer Perceptrons
(6.95)
-
-5.0 -4.0 -3.0 -2.0 -1.0 0 1.0 2.0 3.0 4.0 5.0 Wo
denote the corresponding value of the total risk. Since R(n) can increase or decrease from
one iteration to the next, it is compared to three quantities:
H Previous risk, R(n - 1)
H Average risk, defined by
we expect that the smoother this function is, the smaller will be the influence of the
complexity regularization. Thus, using a second-orderderivative to approximate the curva-
ture of the approximating function realized by the network (model), the complexity
regularization is defined by
(6.96)
where y j ( w ) is the output of neuron j in the output layer of the network, p is the number
of source nodes, and q is the number of output neurons. The curvature-driven smoothing
procedure improves generalization by reducing the tendency of the network to overfit the
training data. Moreover, it makes it possible to produce a solution whose properties are
controlled by a single scalar parameter, namely, the regularization parameter A. Bishop
(1990) describes a generalized form of back-propagation learning that can be used to
minimize the total risk R ( w ) of Eq. (6.93) with the complexity regularization defined as
in Eq. (6.96). No network pruning is used here; improvement in generalization performance
is attained by assigning a suitable value to the regularization parameter A.
where gi is the ith component of the gradient vector of %, and hjiis the jith component
of the Hessian matrix of '8, both measured with respect to the parameters of the network.
That is, we have
(6.98)
and
(6.99)
The requirement is to identify a set of parameters whose deletion from the multilayer
perceptron will cause the least increase in the value of the cost function %. To solve this
problem in practical terms, we make the following approximations:
1. Extrema1 Approximation. We assume that parameters are deleted from the network
only after the training process has converged. The implication of this assumption
is that the parameters have a set of values corresponding to a local minimum or
global minimum of the error surface. In such a case, the gi(n)may be set equal to
zero, and the first summation on the right-hand side of Eq. (6.97) may therefore be
ignored. Otherwise, the saliency measures (defined later) will be invalid for the
problem at hand.
6.17 / Network-Pruning Techniques 211
2. Quadratic Approximation. We assume that the error surface around a local minimum
or global minimum is nearly “quadratic.” Hence the higher-order terms in Eq.
(6.97) may also be neglected.
Under these two assumptions, Eq. (6.97) is approximated simply as
1
8% e - hji S W ~Swj
2 i j
1
= - SW’H SW (6.100)
P
where Sw is the perturbation applied to the weight vector w , and H is the Hessian matrix
containing all second-order derivatives of % with respect to the elements of the weight
vector w. The optimal brain damage (OBD) procedure (LeCun et al., 1990) simplifies
the computations by making a further assumption: The Hessian matrix H is a diagonal
matrix. On the other hand, no such assumption is made in the optimal brain surgeon
(OBS)procedure (Hassibi et al., 1993); accordingly, it contains the OBD procedure as a
special case. From here on, we follow the OBS strategy.
The goal of OBS is to set one of the synaptic weights to zero so as to minimize the
incremental increase in % given in Eq. (6.100). Let wi(n) denote this particular synaptic
weight. The elimination of this weight is equivalent to the condition
SWi + wi = 0
or
lTSw + wi = 0 (6.101)
where l i is the unit vector whose elements are all zero, except for the ith element, which
is equal to unity. We may now restate the goal of OBS as follows (Hassibi et al., 1993):
Minimize the quadratic form $ Sw‘H Sw with respect to the incremental change in
the weight vector, Sw, subject to the constraint that 1; Sw + wi is zero, and then
minimize the resultant with respect to the index i.
S = - S1W ’ H S W
2
+ X(lTSw + W J (6.102)
where X is the Lagrangian multiplier. Then, taking the derivative of the Lagrangian S
with respect to Sw, applying the constraint of Eq. (6.101), and using matrix inversion,
we find that the optimum change in the weight vector w is
Sw = - (6.103)
saliency Sirepresents the increase in the squared error (performance measure) that results
from the deletion of wi.
The difficult computational aspect of the OBS procedure lies in the inversion of the
Hessian matrix H. The Hessian matrix would have to be nonsingular for its inverse to
be computable. In general, for back-propagation learning the Hessian matrix is always
nonsingular. Moreover, there is experimental evidence suggesting that neural network
training problems are intrinsicallyill-conditioned, leading to a Hessian matrix that is almost
rank-deficient (Saarinen et al., 1991, 1992) which may raise computational difficulties of
its own.
Hassibi et al. (1993) compute the Hessian matrix H for the batch mode of learning by
proceeding as follows. First, they derive an approximate formula for the Hessian matrix
evaluated at a local minimum of the error surface. The approximation is made up of a
sum of outer product terms, with each such term requiring structural information of the
network in the form of first-order partial derivatives only; see Problem 6.17. Second, they
use a result in matrix algebra known as the matrix inversion lemma to calculate the inverse
matrix H-I. The OBS procedure is thus more precise than the OBD procedure, but
computationally more intensive.
In their paper, Hassibi et al. report that on some benchmark problems, the OBS procedure
resulted in smaller networks than those obtained using the weight-decay procedure. It
is also reported that as a result of applying the OBS procedure to the NETtalk multilayer
perceptron involving a single hidden layer and 18,000 weights (Sejnowski and Rosenberg,
19$7), the network was pruned to a mere 1560 weights, which represents a dramatic
reduction in the size of the network.
algorithm of the IUS type differs from the LMS algorithm in that it utilizes information
contained in the input data more effectively, going back to the first iteration of the
algorithm. The use of an RLS filtering algorithm offers the following advantages:
w Fast speed of convergence
w Built-in learning-rate parameter
w Insensitivity to variations in the condition number of the input data; the condition
number refers to the ratio of the largest eigenvalue to the smallest eigenvalue of the
covariance matrix of the input data
In a similar sort of way, we may train a multilayer perceptron using an extended RLS
(Kalman) filtering algorithm applied to back-propagation learning (Palmieri et al., 1991;
Shah and Palmieri, 1990). The term “extended” is used here to account for the fact that
the neurons of a multilayer perceptron are nonlinear, and would therefore have to be
linearized in order to accommodatethe application of the standard IUS filtering algorithm.
(6.105)
where dj(n) and yj(n) are the desired response and actual response of output neuron j for
pattern n, respectively, and the set C includes all the output neurons of the network; the
dependence of the cost function ga,(w) on the weight vector w arises because the output
yj(n)is itself dependent on w.Now we could apply the extended form of RLS (Kalman)
filtering algorithm to the problem at hand by linearizing the cost function ge,(w) at each
working point, as described by Singhal and Wu (1989). Unfortunately, the resulting
computational complexity becomes prohibitive, because such an approach requires the
storage and updating of the error covariance matrix whose size is the square of the number
of synaptic connection in the network. To overcome this problem, we may simplify the
application of extended Kalman filtering by partitioning the global problem into a number
of subproblems, each one at the neuron level (Palmieri et al., 1991; Shah and Palmieri,
1990).
Consider then neuron i, which may be located anywhere in the network. During training,
the behavior of neuron i may be viewed as a nonlinear dynamical system, which in the
context of Kalman filter theory may be described by the state and measurement equations
as follows (Haykin, 1991; Anderson and Moore, 1979):
wi(n + 1) = wi(n) (6.106)
(6.107)
where the iteration n corresponds to the presentation of the nth pattern, xi@)is the input
vector of neuron i , the activation function q(-)is responsible for the nonlinearity in the
neuron, and ei(n) is the measurement error at the output of neuron i . The weight vector
wi of the optimum model for neuron i is to be “estimated” through training with examples.
It is assumed that the activation function is differentiable. Accordingly, we may use
the Taylor series to expand the measurement equation (6.107) about the current estimate
214 6 / Multilayer Perceptrons
[
q,(n>= ap(xTc.)w,(n))]
aw,(n> w,(n~=*,(n)
(6.1 11)
where
1
w4 = ?&[d,(n) - Y,(~>I2 (6.112)
simulation results for fairly difficult pattern-classification problems are presented in Shah
and Palmieri (1990),which demonstrate the superior performance of MEKA compared
to the global version of the extended Kalman filter algorithm (Shinghal and Wu, 1989).
Conjugate-Gradient Method
In the first-order method of gradient descent, the direction vector (along which an adjust-
ment to the weight vector is made) is simply the negative of the gradient vector. Conse-
quently, the approach to a minimum taken by the solution may follow a zigzag path. The
conjugate-gradient method avoids this problem by incorporating an intricate relationship
between the direction and gradient vectors. The method was first applied to the general
unconstrained minimization problem by Fletcher and Reeves (1964). The conjugate-
gradient method is guaranteed to locate the minimum of any quadratic function of N
variables in at most N steps. For nonquadratic functions, as exemplified by the cost
function used in the training of a multilayer perceptron, the process is iterative rather than
N-step, and a criterion for convergence is required.
Let p(n) denote the direction vector at iteration n of the algorithm. Then the weight
vector of the network is updated in accordance with the rule
w(n + 1) = w(n> + r)(n)p(n) (6.117)
where r)(n)is the learning-rate parameter, on which we shall have more to say later. The
initial direction vector p(0) is set equal to the negative gradient vector g(n) at the initial
point n = 0; that is,
P(0) = -g(O) (6.11 8)
Each successive direction vector is then computed as a linear combination of the current
gradient vector and the previous direction vector. We thus write
~ ( +n 1) = -g(n + 1) + P(n>p(n> (6.119)
where P(n) is a time-varying parameter. There are various rules for determining P(n) in
terms of the gradient vectors g(n) and g(n + 1); two alternate rules are the following.
The Fletcher-Reeves formula (Fletcher and Reeves, 1964):
(6.120)
216 6 / Multilayer Perceptrons
The accuracy of the line search has a profound influence on the performance of the
conjugate-gradient method (Johansson et al., 1990).
The use of the conjugate-gradient method for the supervised training of multilayer
perceptrons has been studied by Kramer and Sangiovanni-Vincentelli (1989), Johansson
et al. (1990), and other researchers. The findings reported in these two papers on small
Boolean learning problems seem to show that back-propagation learning based on the
conjugate-gradient method requires fewer epochs than the standard back-propagation
algorithm, but it is computationally more complex.
Newton’s Method
The conjugate-gradient method uses the Hessian matrix in its derivation, and yet the
algorithm is formulated in such a way that the estimation and storage of the Hessian
matrix are completely avoided. In direct contrast, the Hessian matrix plays a prominent
role in Newton’s method and its variants.
Using the Taylor series expansion, we may approximate an incremental change A%,(w)
in the cost function zav(w)as a quadratic function of Aw, an incremental change in the
weight vector w, as follows:
Agav(w> = EadW + Aw) - gav(w)
(6.124)
(6.125)
Differentiating Eq. (6.123) with respect to Aw, the change AgaV(w)is minimized when
g + HAW= 0
FIGURE 6.29 Multilayer perceptron with a single hidden layer and an output layer.
218 6 / Multilayer Perceptrons
this figure, node i refers to the input layer, neuron j refers to the hidden layer, and neuron
k refers to the output layer. It is assumed that all the hidden and output neurons of the
network use the same sigmoidal nonlinearity in the form of a logistic function. Let wji
denote the synaptic weight of hidden neuron j connected to node i in the input layer. Let
xila denote the ith component of the input pattern (vector), given case a. Then, the net
internal activity of neuron j is given by
(6.130)
Consider next the output layer of the network. Let wk, denote the synaptic weight of
output neuron k connected to hidden neuron j . The net internal activity of output neuron
k is thus given by
(6.131)
The kth output of the network, given case a at the input, is therefore
Ykla = dUkla) (6.132)
The output ykl. is assigned a probabilistic interpretation as follows. Define
Pkla = Yklar (6.133)
as the estimate of the conditional probability that the proposition k is true, given the case
a at the input. On this basis, we may interpret
1 - Pkla = 1 - Ykl.
as the estimate of the conditional probability that the proposition k is fake, given the
input case a. Correspondingly, let Qklodenote the actual (true) value of the conditional
probability that the proposition k is true, given the input case a. This means that 1 - Qkla
is the actual value of the conditional probability that the proposition k is false, given the
input case a. Thus, we may define the relative entropy for the multilayer perceptron of
Fig. 6.29 as follows:
where Paris the a priori probability of occurrence of case a at the input of the multilayer
perceptron. The relative entropy is also referred to in the information theory literature
as the Kullback-Leibler measure of information (Kullback, 1968). A more detailed discus-
sion of this criterion is presented in Chapter 11. For the present it suffices to say that
the relative entropy HPllQprovides a quantitative measure of how closely the condi-
tional probability Pk1, matches Qklu.When Pkla = we have HPlle = 0;otherwise,
HPliQ > 0.
To perform the supervised learning of the underlying probability distribution, we use
gradient descent on the relative entropy HPllQin weight space. First, we use the chain rule
to express the partial derivative of HPlle with respect to the synaptic weight wkj of output
6.20 / Supervised Learning of Probability Distributions by Multilayer Perceptrons 219
neuron k as follows:
(6.136)
Next, we use Eq. (6.134) to express the partial derivative of HP1lQwith respect to the
synaptic weight wji of hidden neuron j as follows:
(6.137)
(6.138)
Hence, using Eqs. (6.128) through (6.133) in (6.138) and then substituting the result in
Eq. (6.137), we get the desired partial derivative:
(6.139)
where cp’(*) is the derivative of the logistic function d-)with respect to its argument.
Assuming the use of the same learning-rate parameter 17 for all weight changes made
in the network, the weight change applied to the synaptic weights wkj and wjiare computed,
respectively, in accordance with the gradient-descent rules:
(6.140)
and
(6.141)
Note that according to the assignment described in Eq. (6.134), the a posteriori condi-
tional probability Q(kla) plays the role of a desired (target) response for the output of
the multilayer perceptron, given that the input data are taken from class a. Thus, using
the relative entropy as the criterion for training, and assuming that the multilayer perceptron
has enough free parameters, enough training data, and that the training does not get stuck
at a local minimum, then the outputs of the multilayer perceptron will approximate the
a posteriori conditional probabilities of the underlying distribution. It is important for the
multilayer perceptron to converge to a global minimum, because it is only then that we
are assured of the relative entropy HPllQbeing close to zero, which in turn means that the
estimated conditional probability Pk1, is a close match for the true conditional proba-
bility Qkla; it is difficult, if not virtually impossible, to guarantee this condition in practice.
220 6 / Multilayer Perceptrons
6.21 Discussion
Back-propagation learning has indeed emerged as the standard algorithm for the training
of multilayer perceptrons, against which other learning algorithms are often benchmarked.
The back-propagation algorithm derives its name from the fact that the partial derivatives
of the cost function (performance measure) with respect to the free parameters (synaptic
weights and thresholds) of the network are determined by back-propagating the error
signals (computed by the output neurons) through the network, layer by layer. In so doing,
it solves the credit-assignment problem in a most elegant fashion. The computing power i
of the algorithm lies in its two main attributes:
1
Local method for updating the synaptic weights and thresholds of the multilayer
perceptron
ESJicient method for computing all the partial derivatives of the cost function with
respect to these free parameters
For the cost function we may use a criterion based on (1) mean-square error, or
(2) relative entropy. The mean-square error criterion is well suited for solving pattern-
classification and nonlinear regression problems. On the other hand, the relative entropy
criterion is particularly useful for estimating the condtional a posteriori probabilities of
an underlying distribution.
The specific details involved in the design of a multilayer perceptron naturally depend
on the application of interest. We may, however, make two distinctions:
1. In pattern classification, all the neurons of the network are nonlinear. When the
nonlinearity is in the form of the logistic function, the output of neuron k in the
output layer of the network is an asymptotic approximation of the a posteriori
probability of class k, provided that the network has enough free parameters, enough
training data, and does not get stuck in a local minimum.
2. In nonlinear regression, all the neurons in the output layer are linear. This particular
choice is justified as follows. Assuming that all the information content of the
training vectors is transferred to the synaptic weights and thresholds of the neurons,
the residue (approximation error) would ideally be modeled as a realization of white
Gaussian noise. The white property means that the power spectrum of the residue
is constant and therefore independent of frequency. The Gaussian property means
that the amplitude of the residue is Gaussian distributed. Now, the Gaussian distribu-
tion occupies the complete amplitude range (- a,w). It follows therefore that the
dynamic range of the network output must likewise be (-m, a),which in turn
requires that the neurons in the output layer of the network be all linear.
In the context of nonlinear regression, various methods have been proposed for improv-
ing performance by combining two or more neural networks into a single composite
structure. One such method is the basic ensemble method (BEM) that uses the simple
idea of ensemble averaging (Perrone and Cooper, 1993). Suppose that we have an ensemble
of K regression estimators (e.g., multilayer perceptrons) trained on different data sets
drawn from the same population. Let Fk(x)denote the kth regression estimator, where x
is the input vector and k = 1, 2, . . . , K . The ensemble-averaged regression estimator is
defined simply as
(6.142)
Perrone and Cooper (1993) demonstrate the generality of this method by showing that it
is applicable to a very wide class of cost functions.
6.22 / Applications 221
6.22 Applications
Multilayer perceptrons, trained with the back-propagation algorithm, have been applied
successfully in a variety of diverse areas. These applications include the following:
w NETtalk: Neural networks that learn to pronounce English text (Sejnowski and
Rosenberg, 1987)
w Speech recognition (Cohen et al., 1993; Renals et al., 1992a, b)
w Optical character recognition (Sackinger et al., 1992; LeCun et al., 1990a)
m On-line handwritten character recognition (Guyon, 1990)
m Combining visual and acoustic speech signals for improved intelligibility (Sejnowski
et al., 1990)
w System identification (Narendra and Parthasarathy, 1990)
w Control (Werbos, 1989, 1992; Nguyen and Widrow, 1989; Jordan and Jacobs, 1990)
w Steering of an autonomous vehicle (Pomerleau, 1992)
w Radar target detection and classification (Haykin and Bhattacharya, 1992; Hayhn
and Deng, 1991; Orlando et al., 1990)
w Passive sonar detection and classification with a low false alarm rate (Casselman et
al., 1991)
w Medical diagnosis of heart attacks (Baxt, 1993; Harrison et al., 1991)
w Modeling of the control of eye movements (Robinson, 1992)
To document all of these applications in full and others not mentioned here, we would
need a whole book to do justice to them all. Instead, we have chosen to present detailed
discussions of two specific applications: optical character recognition and speech recogni-
tion, the needs for both of which are rather pervasive.
5 10 3,000
4 300 1,200
3 1,200 50,000
2 784 3,136
1 3,136 78,400
20 x 20 (= 400) inputs
of the confidence that can be used to reject ambiguous digits. Moreover, since the network
has no feedback, the classification is accomplished in a single pass.
Figure 6.31 represents an example depicting all the states of the OCR network for the
case of a handwritten digit 4. The states of the input and four hidden layers are shown
as gray levels, whereas the states of the output layer are proportional to the size of the
black (negative) and the white (positive) squares.
Of all the five computation layers of the network, only the output layer is fully connected
with independent synaptic weights. The four hidden layers are carefully constrained to
0 1 2 3 4 5 6 7 8 9 Output layer
3,000 connection eval. by DSP
4 Hidden layers
I 130,000 connections
eval. by NN-chip
FIGURE 6.31 Example for the states of the OCR network with a number four as input.
(From E. Sackinger et al., 1992a, with permission of IEEE.)
6.22 / Applications 223
improve the generalization capability of the network for input patterns that are not used
during training. This is accomplished by the use of local receptivejields, shown shaded
in Fig. 6.30. Weight sharing is also used in these four layers, which consists of having
several synapses of the network controlled by a single weight. This has the beneficial
effect of reducing the number of “free” parameters, with the result that the network has
many synapses but relatively few free parameters. Owing to the redundant nature of the
input data and the constraints imposed on the network, the learning time is reduced
significantly.
Each neuron in the first hidden layer has 25 inputs connected to a local receptive field
made up of 5 X 5 pixel neighborhood in the input (image) layer. The neurons in this
layer are grouped into four feature maps. Two adjacent neurons, belonging to the same
feature map, have their local receptive fields displaced by one pixel, as illustrated in Fig.
6.32. Weight sharing in this layer is achieved by assigning an identical set of synaptic
weights to all the neurons within each feature map. Hence the characterization of the first
hidden layer is completely determined by just 4 X 25 = 100 free parameters, plus four
bias values. The operation performed by the first hidden layer may thus be viewed
essentially as four separate, two-dimensional, nonlinear convolutions of the four feature
maps with the pixel image.
The second hidden layer is designed to reduce the spatial resolution of the four feature
maps generated by the first layer, resulting in four new feature maps that are a quarter
of the original size. The purpose of this layer is to provide some degree of translational
and rotational invariance, which is accomplished as follows. The neurons in this layer
are, as before, grouped into four feature maps, but now each neuron averages four inputs.
Again, the architecture of local receptive fields and weight sharing is used, with one
important change: The local receptive fields of adjacent neurons in a feature map do not
overlap; rather, they are now displaced by two input units, as illustrated in Fig. 6.33. This
has the effect of reducing the spatial resolution.
The third hidden layer performs feature extraction using a 5 X 5 receptive field in a
manner similar to the first hidden layer. However, there are some basic differences to be
0Neuron no. 1
Neuron no. 2
Pixel image
FIGURE 6.32 Illustration of the layout of the first hidden layer of the OCR network.
(From E. Sackinger et al., 1992a, with permission of IEEE.)
224 6 / Multilayer Perceptrons
-I
It
Feature map (4 in total)
0Neuron no. 1
M Neuron no. 2
. ....
FIGURE 6.33 Illustration of the layout of the second hidden layer of the OCR network.
(From E. Sackinger et al., 1992a, with permission of IEEE.)
noted. First, the neurons in the third hidden layer are grouped into 12 feature maps.
Second, the inputs from one or two feature maps generated in the second hidden layer
are combined. Most of the neurons in the third hidden layer have 50 inputs, with 25 of
them connected to one feature map and the other 25 connected to a similar spatial area
in another feature map, as illustrated in Fig. 6.34.
The fourth hidden layer, made up of 12 feature maps, performs the same averaging
and subsampling function as explained for the second hidden layer.
Finally, the output layer has 300 inputs and 10 output neurons. This layer is the only
one in the OCR network that is fully connected; that is, it contains 3000 independent
synaptic weights. The output layer classifies the input patterns by computing 10 hyper-
planes in the 300-dimensional feature space generated by the four hidden layers of the
network.
Now that we understand the architecture of the OCR network, we next consider its
performance, for which the error rate is a customary parameter. The error rate measures
how many misclassifications the OCR network makes per 100 input patterns when it is
tested on a test set of data never seen by the network before. Another even more important
performance measure is the reject rate, defined as the number of patterns, with low
0Neuron no. 1
......
j...j Neuron no. 2
Ih1'7
.-1
I 'tif1I 0
Feature map (4 in total)
I
.. . . ... .. ...
i--+ i-
;.. i.. Receptive field of neuron no. 2
... .. ..
FIGURE 6.34 Illustration of the layout of the third hidden layer of the OCR network.
(From E. Sackinger et al., 1992a, with permission of IEEE.)
6.22 I Applications 225
FIGURE 6.35 Recognition accuracy of the ANNA chip; the error and reject rates are
given in percentages for three different chips. (From E. Sackinger et al., 1992a, with
permission of IEEE.)
'I ANNA is a hybrid analog-digital chip (Sackinger et al., 1992); a description of this chip used to implement
the OCR network is presented in Chapter 15.
226 6 / Multilayer Perceptrons
13.4 13.2 12.5
FIGURE 6.36 Recognition accuracy with the ANNA chip after retraining. The error and
reject rates are given in percentages for three different chips. (From E. Sackinger et al.,
1992a, with permission of IEEE.)
After retraining just the output layer, both the error rate and the reject rate are close to
the original performance of the OCR network without quantization, as measured on the
SUN workstation. The results shown in Fig. 6.36 also show that “chip-to-chip” matching
is sufficiently good for one chip to be replaced by another without adversely affecting
performance of the network.
The special form of multilayer perceptron, involving the use of nonlinear convolution
in one layer alternating with subsampling in the next layer as described here, is called a
convolutional network that extracts its own features. The architectural layout of the
network is similar to that of a neocognitron (Fukushima, 1975, 1988b), the development
of which was motivated by neurobiological considerations (Hubel and Wiesel, 1962,
1977). However, there is one important difference between the convolutional network
and the neocognitron: Learning in the neocognitron is self-organized, proceeding on a
layer-by-layer basis, whereas in a convolutionalnetwork learning is supervised,proceeding
on an epoch-by-epoch basis.
Speech Recognition
Speech is the principal means of human communication. It is more than just a string of
words; it reflects the moods, the ideas, and the personality of the speaker. For humans,
speech recognition is a natural and simple process. However, to make a computer respond
to even simple spoken commands has proved to be an extremely complex and difficult
task. The use of a computer for solving the speech-recognition problem is hampered by
variations in pronunciations; even single words are rarely pronounced in the same way
twice. The speech-recognition problem is complicated further by concatenation, where
consonants adopt aspects of neighboring consonants and vowels in fluent speech
(Wheddon, 1990).
Another source of difficulty is the size of the vocabulary. The vocabulary size varies
in an inverse manner with the system accuracy and efficiency, in the sense that more
words introduce more confusion and require more time to process (Waibel and Lee, 1990).
In addition to linguistic and vocabulary factors, there is the issue of speaker dependence
versus speaker independence to be considered. A speaker-dependent system uses speech
from a target speaker for training. Accordingly, it results in good accuracy, but requires
an inconvenient period for new speakers. On the other hand, a speaker-independentsystem
is trained to handle a variety of speakers. Typically, a speaker-independent system is less
accurate than a speaker-dependent system (Waibel and Lee, 1990).
6.22 / Applications 227
All of these factors make speech recognition a continuing challenge to the speech
research community (Lippmann, 1989a).
Hidden Markov models are widely used for continuous speech recognition by virtue
of their inherent ability to incorporate the sequential and statistical character of the speech
signal (Bahl et al., 1983; Jelinek, 1976). Such a model is formally defined as follows
(Rabiner and Juang, 1986):
A hidden Markov model (HMM) is a doubly stochastic process with an underlying stochastic
process that is not observable (ie., it is hidden), but can only be observed through another
set of stochastic processes that generates the sequence of observed symbols.
Q = {q*,q 2 , . . ., q K 1
and that it is governed by statistical laws. Specifically, for each unit of speech (e.g.,
vocabulary word or phoneme) we have a particular HMM made up of L states q, E Q,
with 1 = 1, 2, . . . , L, according to a prescribed topology (Bourlard, 1990; Bourlard and
Wellekens, 1990). A preprocessor is used to transform the speech signal into a sequence
a44= 1
of acoustic vectors,
x = {XI, XP, * * 9 XN)
with the individual vectors of acoustic parameters being extracted by the preprocessor at
regular intervals, typically every 10 ms. The acoustic vectors include local speech features
such as spectra and related parameters. The transition probabilities, on the other hand,
are used to model the displacement of these features through time. The main limitation
of the HMM approach is the requirement of stringent statistical assumptions that are
unlikely to be valid for speech (Cohen et al., 1993).
Another way in which we may approach speech recognition is to use neural networks,
particularly multilayer perceptrons based on back-propagation learning. The main advan-
tages of this latter approach are a powerful discrimination-based learning procedure, a
flexible architecture that permits easy use of contextual information, and relatively weak
hypotheses about statisticaldistributions.However, the temporal structure of speech signals
remains difficult to handle with neural networks, and in the framework of continuous
speech recognition it is still impossible to recognize a sentence in terms of speech units
with multilayer perceptrons (Bourlard, 1990). This temporal problem is solved very
efficiently in the HMM approach by a technique known as the dynamic time-warping
algorithm, which captures the dynamics of speech (Ney, 1984).
As an alternative, the idea of a hybrid approach has been proposed for continuous
speech recognition, involving the combined use of multilayer perceptrons (MLP) and
hidden Markov models (Morgan and Bourlard, 1990; Bourlard and Wellekens, 1990).
This hybrid approach exploits the advantages of back-propagation learning applied to
MLP, while preserving the HMM formalism to integrate over time and to segment continu-
ous speech signals. Such a combination is referred to as an MLP-HMM speech-recognition
system.
The MLP is used to provide estimates of the conditional probability distributionP ( q , I X ~ ) ,
which refers to the a posteriori probability of state q, given the input vector x, . Let y ( j , i )
denote the actual value of the jth output produced by xi.Let d(j,k) denote the desired
(target) value of the jth output associated with the class q k ; it is defined by
(6.143)
Using the relative entropy between the a posteriori target (desired) distribution and a
posteriori output distribution as the training criterion, and assuming that the network has
enough free parameters, enough training data, and that the training does not get stuck in
a local minimum, then the optimized MLP output yopt(j, i ) approximates the a posteriori
class probability P(qjlxi),as shown by
yopt(j,i) = P(qjlxi) (6.144)
Using Bayes’ rule, we may then compute the required HMM probabilities as follows:
(6.145)
data, and minimizing the probability of the data being generated by all incorrect state
sequences.
Cohen et al. (1993) have applied a hybrid MLP-HMM approach to context-dependent
modeling of continuous speech, in which the realization of individual phonemes is highly
dependent on phonetic context. For example, the sound of the vowel /ae/ in the words
“map” and “tap” is quite different. Context-dependent effects are referred to as co-
articulation. The use of context-dependent phonetic models has been shown to improve
the recognition accuracy of an HMM significantly (Schwartz et al., 1985). In a context-
dependent HMM, different probability distributions are used for each phoneme in every
different relevant context. The hybrid MLP-HMM system described in Cohen et al. (1993)
uses an MLP with an input layer of 234 nodes, spanning 9 frames (with 26 coefficients
per frame) of various spectrum- and energy-related parameters that are normalized to
have zero mean and unit variance. The hidden layer has 1000 neurons, and the output layer
has 69 neurons, one for each context-independentphonetic class in the SRI-DECIPHERTM
system, a state-of-the-art continuous speech-recognition system. Both the hidden and
output neurons use sigmoidal nonlinearities. The MLP is purposely trained to estimate
the conditional probability P ( q jIxi),where q j is the state (class) associated with the middle
frame of the input xi.The desired (target) probability distribution is defined as 1 for the
index corresponding to the phoneme class label and 0 for other classes (states). The
experimental results reported in Cohen et al. (1993) and earlier experimental results
reported in Renals et al. (1992b) indicate that the use of neural networks can significantly
improve the performance of a context-independentHMM speech recognition system. The
development of the hybrid MLP-HMM system is based on a general method of decompos-
ing a multilayer classification network with a large output layer into a number of smaller
networks, with no statistical independence assumptions. It has been demonstrated that
this method may be used to estimate likelihoods for context-dependent phoneme models.
Moreover, it appears that scaling is no longer considered to be a problem. It should be
noted, however, that the use of MLP involves more intensive computations during training
and recognition than the corresponding Gaussian mixture models used in standard HMMs.
PROBLEMS
6.1 Figure P6.1 shows a neural network, involving a single hidden neuron, for the
implementation of the XOR pattern; this network may be viewed as an alternative
to that considered in Section 6.6. Show that the network of Fig. P6.1 solves the
XOR problem by constructing (a) decision regions, and (b) a truth table for the
network.
6.2 Use the back-propagation algorithm for computing a set of synaptic weights and
thresholds for a neural network structured as in Fig. 6.6 to solve the XOR problem.
Assume the use of a logistic function for the nonlinearity.
FIGURE P6.1
230 6 / Multilayer Perceptrons
6.3 Consider the functions
and
2
p(x) = - tan-l(x) (ii)
n-
Explain why both of these functions fit the requirements of a sigmoid function. How
do these two functions differ from each other?
6.4 The inclusion of a momentum term in the weight update may be viewed as a
mechanism for satisfying heuristics 3 and 4 that provide guidelines for accelerating
the convergence of the back-propagation algorithm, which was discussed in Section
6.15. Demonstrate the validity of this statement.
6.5 The momentum constant CY is normally assigned a positive value in the range
0 5 CY < 1. Investigate the difference that would be made in the behavior of Eq.
(6.37) with respect to time t if CY was assigned a negative value in the range
-1 < C Y S O .
6.6 Consider the encoding problem in which a set of orthogonal input patterns are
mapped with a set of orthogonal output patterns through a small set of hidden
neurons (Rumelhart and McClelland, 1986). Figure P6.6 shows the basic architecture
for solving this problem. Essentially, the problem is to learn an encoding of a p-bit
pattern into a log, p-bit pattern, and then learn to decode this representation into the
output pattern. Construct the mapping generated by the back-propagation algorithm
applied to the network of Fig. P6.6 for the case of identity mapping illustrated below:
6.7 Consider the simple example of a network involving a single weight, for which the
cost function is
%(w) = kl(w - wo)z + kz
where wo, k l , and k2 are constants. A back-propagation algorithm with momentum
is used to minimize %(w).
Explore the way in which the inclusion of the momentum constant a influences
the learning process, with particular reference to the number of epochs required for
convergence versus a.
6.8 Consider a two-layer network containing no hidden neurons. Assume that the network
hasp inputs and a single output neuron. Let xkdenote the kth input signal; the output
signal is defined by
where wji is synaptic weight from neuron i to neuron j , and yi is output of neu-
ron i ;
(c) Nonlinearity:
1
cp(u) = 1 + exp(-u)
6.11 Investigate the use of back-propagation learning using a sigmoidal nonlinearity to
achieve one-to-one mappings, as described here:
1
1. f(x) = -, 1 I x 5 100
X
2. f(x) = loglox, 1 9 x 5 10
232 6 / Multilayer Perceptrons
3. f(x) = exp(-x), 1I
xI10
2T
4. f(x) = sinx, 0I
xI-
2
For each mapping, do the following:
(a) Set up two sets of data, one for network training, and the other for testing.
(b) Use the training data set to compute the synaptic weights of the network, assumed
to have a single hidden layer.
(c) Evaluate the computation accuracy of the network by using the test data.
Use a single hidden layer but with a variable number of hidden neurons. Investigate
how the network performance is affected by varying the size of the hidden layer.
6.12 In this problem we use the back-propagation algorithm to solve a difficult nonlinear
prediction problem and compare its performance with that of the LMS algorithm.
The time series to be considered is created using a discrete Volterra model that has
the form
i) + i j
gijv(n - i ) v ( n -
where g,, gij, . . . are the Volterra coefficients, the v(n) are samples of a white,
independently distributed Gaussian noise sequence, and x(n) is the resultant Volterra
model output. The first summation term is the familiar moving-average (MA) time-
series model, and the remaining summation terms are nonlinear components of ever-
increasing order. In general, the estimation of the Volterra coefficients is considered
to be difficult, primarily because of their nonlinear relationship to the data.
In this problem we consider the simple example
x(n) = v ( n ) + p v ( n - l ) v ( n - 2 )
The time series has zero mean, and is uncorrelated and therefore has a white spectrum.
However, the time-series samples are not independent of each other, and therefore
a higher-order predictor can be constructed. The variance of the model output is
x ( n - 1)
7
Multilayer
x ( n - 3)
t- perceptron
'IGURE P6.12
Problems 233
given by
u: = u;+ P"a:!
The white-noise input variance is set equal to unity. Hence, with P = 0.5,
we find that the output variance of the predictor is u: = 0.5.
Compute the learning curve of the nonlinear predictor, with the variance of
the predictor output 2(n) plotted as a function of the number of epochs of training
samples up to 100 epochs.
Repeat the experiment using the LMS algorithm designed to perform a linear
prediction on an input of 6 samples. The learning-rate parameter of the algorithm
is set equal to q = 0.9.
The results of the experiment should reveal that initially the back-propagation
algorithm and the LMS algorithm follow a similar path, and then the back-propagation
algorithm continues to improve, finally producing a prediction variance close to the
ideal value u: = 0.25.
6.13 Consider a fully connected multilayer perceptron consisting of an input layer of two
nodes, a hidden layer of two neurons, and an output layer of one neuron as shown
in Fig. P6.13. Assume the use of a hyperbolic tangent for the nonlinearity. We wish
to use this particular network structure to study some aspects of double back-
propagation learning for improving generalization performance (Drucker and
LeCun, 1992).
(a) Determine the partial derivatives of the instantaneous sum of squared errors 25
with respect to all 6 synaptic weights of the network, and comment on their
compositions. In normal back-propagation, these are the only gradient terms
that we need.
FIGURE P6.13
234 6 / Multilayer Perceptrons
Suppose that we proceed one more step and calculate the partial derivatives of
8 with respect to the input signals. Show that these new partial derivatives
involve all the synaptic weights in the network.
Now formulate a total risk R defined as the sum of 8 and another function 8 b
defined as follows:
where x1 and xz are the input signals, and tl and t2 are the corresponding target
derivatives. Hence, recalculate the partial derivatives of the risk R with respect
to the synaptic weights of the network.
We wish to force the gradients of 8 with respect to the input signals to become
zero even if the gradients of 8 with respect to the synaptic weights are nonzero.
What would be the result of such a constraint imposed on the network?
6.14 Equation (6.81), defining the derivative of the error surface with respect to the
learning-rate parameter qji(n),was derived assuming that neuron j lies in the output
layer of the multilayer perceptron. Show that this same formula also applies to a
neuron j that lies in a hidden layer of the network.
6.15 Write the fuzzy set rule for the fuzzy back-propagation network represented by
Table 6.6.
6.16 (a) Derive the formula for the salency Si given in Eq. (6.104).
(b) Assume that the Hessian matrix of the average squared error of a multilayer
perceptron with respect to its weights may be approximated by a diagonal matrix
as follows:
H = diag[hll, hzz,. . . , hw]
where W is the total number of weights in the network. Determine the saliency
Siof weight w iin the network.
6.17 Consider a multilayer perceptron with a single output neuron, defined by the function
y = F(w,x)
where x is the input vector and w is the vector of synaptic weights in the network.
The average squared error on the training set of size N is defined by
where dk is the desired response for the input vector xk pertaining to the kth example
and Y k is the actual response of the network produced by the input vector xk. Show
that the Hessian H, evaluated at a local minimum of the error surface, may be
approximated as follows (Hassibi et al., 1993):
where
aFtw,xk)
g, =
aw
6.18 The use of a momentum term in the weight update described in Eq. (6.35) may be
considered as an approximation to the conjugate-gradient method (Battiti, 1992).
Discuss the validity of this statement.
Problems 235
6.19 In describing the optical character recognition problem in Section 6.20,we mentioned
that the function of the first hidden layer in the multilayer feedforward network
of Fig. 6.28 may be interpreted as that of performing two-dimensional nonlinear
convolutions of the four feature maps (characterizingthat layer) with the pixel image.
Justify the validity of this interpretation.
How would you describe the role of the second hidden layer? Justify your answer.
6.20 Consider a multilayer perceptron designed to operate as a pattern classifier, on the
basis of minimizing the mean-square error % defined by
l N K K
C2C
2N i = i j=i k = i nik[d(j,k ) - y ( j , k)1*
% =-
where nik is the number of times the input vector xiis classified q k , K refers to both
the total numbers of output neurons and classes, and N is the size of the training
set. The desired response d ( j ,k), pertaining to the jth output associated with the
class q k , is defined by
Show that, regardless of the MLP topology, the minimization of % with respect to
the actual outputs of the network yields the optimum values for the outputs given
by
nil
yo&, 9= ~
nik
What is the probabilistic interpretation of this result?
6.21 In Section 6.9 we presented qualitative arguments for the property of a multilayer
perceptron classifier (using a logistic function for nonlinearity)that its outputs provide
estimates of the a posteriori class probabilities. This property assumes that the size
of the training set is large enough, and that the back-propagation algorithm used to
train the network does not get stuck at a local minimum. Fill in the mathematical
details of this property.