Professional Documents
Culture Documents
Backpropagation
Let hj be in sum of inputs coming in from the input layer, into node j of the hidden layer.
X
Fig. 4.2: An MLP for the XOR
hj =
xi vij
(4.1)
i
function
We assume that this hidden node then gets activated depending on the magnitude of its input,
hj . The activation function is typically chosen to be a sigmoid, in which case its derivative
0
is easy to find. In general, if g() denotes the activation function, (a) g must be computable,
(b) g should be fairly constant at the extremes, but should change rapidly in the middle of
the range. This allows us to have a small range over which the neuron changes state (fires or
stops firing). A sigmoid function serves as an approximation to a step function: for positive
, the output of node j then is
aj = g(hj ) =
1
1 + exp (hj )
(4.2)
g (h) = a(1 a)
(4.3)
a = g(h) = tanh(h) =
c
SBN,
IITB, 2013
exp(h) exp(h)
exp(h) + exp(h)
(4.4)
53
Multilayer perceptrons
For a node k in the output layer, we have its input (hk ) and output (denoted as yk instead
of ak ) as follows:
X
hk =
aj wjk
(4.5)
j
yk = g(hk ) =
1
1 + exp (hk )
(4.6)
After proceeding through the network in the forward direction, we end up predicting a
set of outputs which needs to be compared to the true output values y (i.e. targets, t). The
values of the computed outputs depend on (a) the current input x, (b) the activation function
g(.) of the nodes of the network, and (c) the weights of the network (denoted v for the first
layer and w for the second layer). The error then is
Notation: i, j and k will be
2
n
X
X
X
used to indicate the input,
1
1
tj g
wjk aj
(4.7)
E(w) =
(tk yk )2 =
hidden, and output nodes,
2
2
j
k=1
k
respectively. Then aj and
yk are the activation levels Let hk be the input to the output-layer neuron k. The, summing over all the hidden layer
of the hidden and output neurons feeding into outer neuron k gives
X
nodes.
hk =
wlk al
(4.8)
l
We wish to minimize the error E by manipulating the weights wjk . We use the chain rule:
E
E hk
=
wjk
hk wjk
(4.9)
To determine the second term above, we first note that wlk /wjk = 0 for all values of l
except l = j.
P
X wlk al
l wlk al
hk
=
=
= aj
(4.10)
wjk
wjk
wjk
l
The first term in Eq. 4.9 is denoted o (subscript o for output layer) and is also computed
using the chain rule:
o =
E
E yk
=
hk
yk hk
Neuron k in the output layer has its output (g(.) is the activation function)
Notation: a is the activation
X
of a hidden neuron; y is the
yk = g houtput
= g
wjk ahidden
j
k
activation of an output neuj
ron. The superscripts to h
(hidden, or output) help Then
keep track of which neurons
g houtput
0
E
E
output
k
o =
=
output
output
output g hk
we compute inputs to.
g hk
h
g hk
#
" k
2 0 output
1X
output
=
g hk
tk
g hk
2
g houtput
k
k
0
0
= g houtput
tk g houtput
= (yk tk )g houtput
k
k
k
(4.11)
(4.12)
(4.13)
The update rule for the weights now is written in analogous form to Eq. 3.36, where the
sign ensures we go downhill and reduce the error:
wjk wjk
E
wjk
(4.14)
54
where for a sigmoid activation function, with the scaling factor set to 1,
E
= o aj = (yk tk )yk (1 yk )aj
wjk
(4.15)
In similar fashion we can compute the weights vjk which connect the inputs to the hidden
nodes. We compute h as follows (the summation is over the output nodes):
h =
X
k
X houtput
E houtput
k
k
=
o hidden
hidden
h
h
houtput
k
k
k
k
(4.16)
Since
!
houtput
k
=g
wlk hhidden
l
(4.17)
wlk hhidden
l
hhidden
k
(4.18)
(4.19)
o wjk
(4.20)
X
E
= aj (1 aj )
o wjk
vij
!
xi
(4.21)
vij vij
E
vij
(4.22)
form w n. The weights may then be chosen to be randomly chosen values in the range
1/ n < w < 1/ n, which consequently means that the total input to a neuron has an
approximate magnitude of 1. Note that unit variances can be achieved for your inputs by
standardizing them: subtract mean and then divide by standard deviation. Hence good
practice is to standardize the inputs.
Training: For each input vector (observation) we have a forwards phase where we predict
the output values using the current weights, followed by the backwards phase,where we
adjust the weights.
Forwards phase: determine the activation of each neuron j in the hidden layer using
hj =
xi vij
aj = g(hj ) =
1
1 + exp(hj )
(4.23)
1
1 + exp(hk )
(4.24)
X
j
aj wjk
yk = g(hk ) =
55
Multilayer perceptrons
(4.25)
wjk ok
(4.26)
(4.27)
(4.28)
The inputs are fed in a random- This process (across all input vectors) will be repeated until learning stops.
ized order to avoid bias, in the Recall: We use the forwards phase as described above, for a new observation.
different iterations.
yk = g(hk ) = hk
(4.29)
We can also use a soft-max activation function to rescale inputs between 0 and 1:
exp(hk )
hk exp(hk )
yk = g(hk ) = P
(4.30)
The update equation continues to be the same for the linear output: ok = tk yk .
One pass through all training The MLP normally involves all training data being fed, and weights being computed, before
data = one epoch.
the training error is computed towards updating the weights. The algorithm above operates
in sequential (loop) mode. It does the job, but it would be slower than the batch method.
The sequential method however, has a better choice (given the random order in which
points are used for training) of avoiding local minima, when trying to minimize the training
error.
We can speed up optimization terms by adding momentum terms:
t1
t
t
wij
wij
+ o ahidden
+ wij
j
(4.31)
where the superscripts t and t1 keep track of iterations. 0 < < 1 controls the momentum
( = 0.9 usually). In general this may be tweaked to avoid large changes in the weights, as
t become large.
56
There are issues of over-fitting, and validation of the MLP, and deciding thresholds of
when to stop learning which are important. Since these impact practically any classification
method, we will discuss these later.
The RBF network is practically an MLP, with one hidden layer. This hidden layer has RBFs
describing activation of the nodes; the output layer could continue to have non-Gaussian (e.g.
sigmoidal) activations. A bias input is usually added for the output layer, to deal with the
scenario when none of the RBF nodes fire. This last (output) layer then is just a perceptron,
and training it is straightforward. What needs attention is how to train the weights into the
RBF layer.
The RBF layer and output nodes have different activation functions, but also different
purposes: the RBF layer generates a nonlinear representation of the inputs, while the output
layer uses these nonlinear inputs in a linear classifier (assuming a simple perceptron has been
used). Practically, there are two tasks: locating the RBF nodes in weight space, and using
the activations of the RBF nodes to train the linear outputs. There are several ways to set
up the RBF nodes: assuming that we have good training data, we randomly use some of the
inputs as locations for our nodes. Alternately, we can use an unsupervised approach like the
k-means algorithm to determine the node location.
For each input vector, we compute the activation of all the hidden nodes, as represented
by matrix G. Then Gij describes the activation of hidden node j for input i. The outputs
are then y = GW for a set of weights W. We need to minimize the deviation of y from t,
the output targets. Then, using the pseudo-inverse (i.e. solving in the least squares sense),
W = (G> G)1 G> t
(4.33)
57
The value of is also important as indicated before. Since we ideally desire good coverage
of weight space, the width between the Gaussians should be a function of the maximum
distance d between hidden node locations, and the number of hidden nodes. If we use M
(4.34)
This ensures that in a relative sense, at least one of the terms is large and hence one node at
least should fire.