You are on page 1of 32

LEARNING AND GENERALIZATION IN

NEURAL NETWORKS

Steve Renals

27 Febuary 1998
COM336/COM648/COM682

ii
COM336/COM648/COM682

Contents

1 THE BACK-PROPAGATION ALGORITHM 3


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Units and weights in ANNs . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Neural network architectures . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Applications of neural networks . . . . . . . . . . . . . . . . . . . 6
1.2 Supervised learning in feed-forward networks . . . . . . . . . . . . . . . . 6
1.2.1 Single layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Linear Separability . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Multi-layer perceptrons The back-propagation algorithm . . . . 9
1.3 Issues in training MLPs by backprop . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Batch and online learning . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 When is training complete? . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Will backprop always find the solution? . . . . . . . . . . . . . . . 13

2 GENERALIZATION 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Evaluating Generalization: Training, Test and Validation Sets . . . . . . . . 15
2.3 Training and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 OVERTRAINING AND HOW TO AVOID IT 23


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Cross-Validation and Early Stopping . . . . . . . . . . . . . . . . . . . . . 23
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 CLASSIFICATION USING NEURAL NETWORKS 27


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Estimation of Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . 27
4.3 Practical Implications of Posterior Probability Estimation . . . . . . . . . . 28
4.3.1 Minimum Error-Rate Decisions . . . . . . . . . . . . . . . . . . . 29
4.3.2 Compensating for Different Priors . . . . . . . . . . . . . . . . . . 29
4.3.3 Outputs Sum to One . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.4 Combining Network Outputs . . . . . . . . . . . . . . . . . . . . . 29
4.3.5 Confidence and Rejection . . . . . . . . . . . . . . . . . . . . . . 30
4.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1
COM336/COM648/COM682

2
COM336/COM648/COM682

Chapter 1

THE BACK-PROPAGATION
ALGORITHM

1.1 Introduction
In this course we are interested in artificial neural networks biology does not concern
us! In the only analogy of the kind that Ill use, the table below gives the correspondences
between the terminology used for biological neural networks and that used for artificial
neural networks (ANNs):

Biology Computing
Neuron Unit/Node/Cell
Synapse Connection
Synaptic Weight Weight
Neuron Firing Rate Node Output

Dont be fooled into thinking that ANNs tell us anything (detailed) about the brain: : :

1.1.1 Units and weights in ANNs


The basic idea behind neural networks is that a set of simple processing units are connected
together to produce more complex computations. Figure 1.1 illustrates a unit used in ANNs.
This weighted sum unit operates by taking a weighted sum of its inputs, to produce the
activation value of the unit, v:

v = wi yi + c ; (1.1)
i

where yi is input i to the unit, wi is the connection weight of that incoming connection, and
c is referred to as the bias of the unit. Note that the bias is equivalent to the weight on a
connection whose input is always 1.
In a neural network the inputs to a unit will either be the output of another unit that is
connected to it, or an input from the environment (e.g. the input from some sensor). In all
the networks that well consider in this course connections are directed the fact there is a
connection from unit a to unit b with weight w does not tell us anything about there being
(or not being) a connection from unit b to unit a (and what weight it might have if it does
exist).
The output of a unit y is computed by passing the activation through a nonlinear transfer
function:

y = f (v) : (1.2)

3
COM336/COM648/COM682

Choices for transfer function f include the step function (where the output is 0 if the acti-
vation is below 0 and 1 if it is above 0:
f (v) = 1 if v  0 (1.3)
= 0 if v < 0 (1.4)
and soft step functions such as the Sshaped sigmoid function:
1
1 + exp(?v)
f (v) = : (1.5)

Note that in both these cases the bias of the unit can be used to shift the function along the
x axis.
Units in neural networks may be described as:
Input units, which receive some input from the environment (e.g. a pixel map for a net-
work to recognize handwritten characters);
Output units may be observed from the environments (e.g. the class of character input to
the network);
Hidden units which are internal to the network and do not directly interact with the envi-
ronment.

1.1.2 Neural network architectures


The architecture of a neural network tells us what connections between units are allowed.
Many architectures have been studied and used; here well characterize them in terms of
some defining characteristics.
Recurrent or feed-forward? In a feed-forward network there is an ordering imposed on
the nodes in a network: if there is a connection from unit a to unit b then there can-
not be a connection from b to a. There is no such constraint in recurrent networks
(figure 1.3(a)), in which any unit can be connected to any other, thus allowing loops
or recurrences. The absence of recurrences makes feed-forward networks much eas-
ier to train, but also less powerful. A feed-forward structure makes things simpler
because there is a simple flow of computation through the network. In a layered
feed-forward network (figure 1.3(b)) the weights are organized in layers connecting
groups of units together.
Fully or partially connected? In a fully connected network all allowed connections are
implemented; in a partially connected network there is a structuring of connections.
For example in the AT&T handwriting recognition system (part 5 of these notes)
the network connectivity is structured to take account of prior knowledge about the
problem.

y1

w1 v = w1 y1+ w2y 2+ ... + w ny n

y2
w2

.. y = f(v) y
.

wn
yn

Figure 1.1: ANN unit with weighted input summation

4
COM336/COM648/COM682

1
0.9

0.8
0.8
0.7

0.6
0.6

f(v)

f(v)
0.5

0.4
0.4

0.3
0.2
0.2

0.1
0

0
6 4 2 0 2 4 6 6 4 2 0 2 4 6
v v

(a) (b)

Figure 1.2: Non-linear transfer functions: (a) step function; (b) sigmoid function.

Modularity Many problems can be attacked by constructing a system that consists of


several modules, with relatively sparse interconnections between modules. If the task
can be decomposed into subtasks, then it is possible to train the different modules
independently of each other.

Input
Units
-0.05 -0.34 Hidden Output
Units Units
-0.22
Input
Node 0.14 Hidden Node
0.13
-0.44
0.51
-0.30

0.76 0.19
0.04
0.03
1.27
Input Output node
Node 0.11 Hidden-to-output layer

0.02 -1.75 Input-to-hidden layer

(a) (b)

Figure 1.3: Examples of (a) a fully-connected recurrent network and (b) a two layer feed-
forward network.

1.1.3 Learning
The principal reason why neural networks have attracted such interest, is the existence of
learning algorithms for neural networks: algorithms that use data to estimate the optimal
weights in a network to perform some task. There are three basic approaches to learning in
neural networks

Supervised learning uses a training set that consists of a set of pattern pairs: an input pat-
tern and the corresponding desired (or target) output pattern. The desired output may
be regarded as the networks teacher for that input. The basic approach in super-
vised learning is for the network to compute the output its current weights produce
for a given input, and to compare this network output with the desired output. The
aim of the learning algorithm is to adjust the weights so as minimize the difference
between the network output and the desired output.

5
COM336/COM648/COM682

y I0
Bias Unit

vkO= wOI
O
+ yO
ki i
ykO= f (v k)
y kO
WkiOI
yiI Output
Input

Figure 1.4: Single layer perceptron

Reinforcement learning uses much less supervision. If a network aims to perform that
some task, then the reinforcement signal is a simple yes or no at the end of the
task to indicate whether the task has been performed satisfactorily.

Unsupervised learning only uses input data there is no training signal, unlike the previous
two approaches. The aim of unsupervised learning is to make sense of some data set,
for example clustering similar patterns together. compression.

1.1.4 Applications of neural networks


Classification is the task of assigning inputs to a number of discrete categories or classes.
Examples include classifying a handwritten letter as one from A-Z, classifying a
speech pattern to the corresponding word, etc.

Function approximation is the task of learning a mapping that generates a vector of nu-
merical outputs from a vector of numerical inputs. Many real world problems fall
into this category (e.g. estimating the flow in a pipeline, given inputs from sensors.

Forecasting involves predicting the future from the past. Examples abound, such as finan-
cial prediction (stock markets, currency), electricity consumption,, etc.

Control problems are those in which the values of input variables are determined in or-
der to achieve the desired outputs. Examples include robot arm control and various
automotive systems.

1.2 Supervised learning in feed-forward networks


1.2.1 Single layer perceptrons
The single layer perceptron discussed here is slightly different to Rosenblatts perceptron.
However, as we will see, it is the form that is most generalizable to multi-layered archi-
tectures. The single layer perceptron consists of a set of input units connected by a single
layer of weights to a set of output units (figure 1.4).
If we have I input units, each of which outputs value yIi and O output units with output
value yO OH
k , connected by weights wki , then we can write the behaviour of the network using

6
COM336/COM648/COM682

the following equations:

k = wki yi
vO OI I
(1.6)
i
yO
k = f
O O
(vk ) = f O ( wOI I
ki yi ) (1.7)
i
1
f O (v) =
1 + exp(?v)
(1.8)

where we have assumed that the output unit transfer function f O () is a sigmoid. The
activation value of output unit k is represented by vO k . Input 0 always has value 1, so the
weight wOI
k0 corresponds to the bias of output unit k.

Exercise: Rewrite the above equations in matrix-vector notation (i.e. representing the
weights as a single matrix W, the inputs as vector yI , and so on).

The key problem in supervised training is that of credit assignment: given that there
is an output error (i.e. the network outputs differ from the desired outputs supplied by the
training data), how do we adjust the weights of the network to minimize that output error
over all the training data? The basic idea is as follows:
 If an output unit has the desired output value then do not adjust the weights leading
into that unit;
 If an output unit has an output less than the desired output, then increment the weights
leading into that unit by a small amount (proportional to the difference between de-
sired and actual outputs);
 If an output unit has an output greater than the desired output, then decrement the
weights leading into that unit by a small amount (proportional to the difference be-
tween desired and actual outputs);
This assumes that all the inputs are positive.

Exercise: How do the rules change if some inputs are negative?

This is rather heuristic. A more principled way of proceeding involves defining an error
function for the network. For a single pattern p we define the error E p :
1
2
Ep = e2k (1.9)
k

(dk ? yk )
1
2
O O 2
= (1.10)
k
ek = (dkO ? yO
k) (1.11)

where dkO is the desired value for output unit k and ek is the local error for output unit k. We
describe E p as the sum squared error. We can sum E p over all patterns to give the overall
training set error:
E = Ep : (1.12)
p

E p tells us how well the network performs on pattern p. E p = 0 means that it is perfect
for this pattern and no weights need adjusting. The larger E p the worse the network is
doing. We can write ek in terms of the weights:
ek = dk ? f O ( wOI I
ki yi ) (1.13)
i

7
COM336/COM648/COM682

Thus we can rewrite E p as:


!2
dk ? f O ( wOI
1
2
I
Ep = ki yi ) (1.14)
k i

E p is a function of the weights wOI I O


ki , the input values yi and desired outputs dk . The inputs
and desired outputs are specified by the training set; the task is to optimize the weights
given that training set. Thus the learning task becomes the following: adjust the weights
wOI
ki so that E is minimized.
One way to minimize E (or any function) is by a process called gradient descent. The
idea of gradient descent is to look at the error surface (i.e. E in terms of all the wOI
ki ) and to
adjust the weights in the direction of steepest descent. We can express this as:

E p
wOI
ki = ? wOI
(1.15)
ki

ki (new) = wki (old) + wki


wOI OI OI
: (1.16)

is a small positive constant called the step size or learning rate which governs how
much the weights are adjusted. There is a ? sign in (1.15) this is because we want to go
downhill: without the ? sign we would performing gradient ascent and would end up at
a point of maximum error!
To do this we need to calculate the partial derivatives E p =wOI
ki this is straightfor-
ward since we have already written E p as a function of the weights. Using the chain rule of
differentiation:

E E yOk
= (1.17)
wOI
ki yO
k wki
OI

E yOk vk
O
= (1.18)
yk vk wki
O O OI

We can write down each of these derivatives:


E
yO
= ?(dk ? yOk) = ?ek (1.19)
k
yO
k ) = y k (1 ? y k )
f 0 (vO
k O O
= (1.20)
vO
k
vO
k I
= yi (1.21)
wOI
ki

(Exercise: Verify that each of these derivatives is correct.)

Substituting (1.19), (1.20) and (1.21) into (1.18) we have the derivative we require:

E
wOI
= ?(dk ? yOk)yOk (1 ? yOk)yIi (1.22)
ki

So the expression for the weight update is obtained by inserting (1.22) in (1.15):

wOI
ki = (dk ? yOk)yOk(1 ? yOk)yIi (1.23)

So we have a learning algorithm for single layer perceptrons:

1. Initialize weight matrix to small random values

8
COM336/COM648/COM682

2. While E is unsatisfactory

(a) For each pattern p:


i. Compute output node activations (vO
k ) using equation (1.6)
ii. Compute network outputs (yO
k ) using equations (1.7) and (1.8)
iii. Compute the local output errors (ek ) and the network error for this pattern
(E p ) using equations (1.11) and (1.9)
iv. Compute the gradient of the error E p with respect to each weight wOI
ki and
update the weights using equation (1.23)

Exercise: Why are the weights initialized to small random values (rather than initial-
izing all weights to 0)?

1.2.2 Linear Separability


Unfortunately there some problems that a single-layer perceptron cannot handle those
problems that are not linearly separable. Assuming the transfer function is a step function
for the moment, an output unit defines a hyperplane in input space1 . If the output can orient
its hyperplane so that each input is mapped to the correct output, all well and good. But for
many problems this is not the case (see figure 1.5) and a single layer perceptron does not
have the capability to solve them. The solution is to use multi-layer networks.

(a) Linearly Separable

(b) Not Linearly Separable

Figure 1.5: Examples of (a) linearly separable and (b) not linearly separable classification
problems in one-dimensional input space.

Exercise: Satisfy yourself that a single layer perceptron cannot solve the xor problem.

1.2.3 Multi-layer perceptrons The back-propagation algorithm


In the single layer perceptron all units were either designated as input or output. If we
add another layer of weights to this model, then we find that there is a new set of units
that are neither input or output units; these are usually referred to as hidden units and can
be interpreted as the networks internal representation for a problem. For simplicity, well
consider a two layer MLP, but everything covered would extend to an MLP with more than
two layers (figure 1.6).

1 In one-dimensional input space it would define a point; in two-dimensional input space it would define a line;

in three-dimensional input space a plane. Things get harder to visualize in higher dimensions.

9
COM336/COM648/COM682

y I0 yH
0

Bias Units

yjH
WjiHI WkjOH
yiI y kO

Input Hidden Output

Figure 1.6: A multi-layer perceptron (two layers).

NOTATION
yIi Output of input unit i
yH
i Output of hidden unit i
yO
i Output of output unit i
vIi Activation of input unit i
vH
i Activation of hidden unit i
vO
i Activation of output unit i
f H () Transfer function for hidden units (e.g. sigmoid), yH H H
i = f (vi )
f O () Transfer function for output units (e.g. sigmoid), yi = f (vO
O O
i )
wHIji Weight from input unit i to hidden unit j
wOHji Weight from hidden unit i to output unit j
wHIj0 Bias of hidden unit j
wOHj0 Bias of output unit j
diO Desired (target) value for output unit i
Gradient descent learning rate

As for the single layer perceptron (equations (1.61.8)) we can write down the set of
equations that describe the behaviour of the network:

vHj = wHIji yIi (1.24)


i
yHj = f O H
(v j ) (1.25)
k = wk j yi
vO OH H
(1.26)
j

yO O O
k = f (v k ) (1.27)

And well assume both transfer functions are sigmoids:


1
f O (v) = f H (v) =
1 + exp(?v)
(1.28)

This process described in equations (1.241.28) is sometimes referred to as the forward


propagation.

10
COM336/COM648/COM682

The error function for an MLP is defined by equations (1.9) and (1.11), as for a single
layer network. However, the credit assignment problem is more difficult: in a single layer
network every weight connects an input to an output and credit assignment is accomplished
by a straightforward differentiation. However for an MLP the existence of hidden units
(which do not have a target) complicates things.
Similarly to (1.14) we can write the E p in terms of the hidden-to-output weights wOH
kj :
!2
e2k = dk ? f ( wOH
1 1
2
H
Ep = kj vj ) (1.29)
k 2 k j

The hidden-to-output weights can be adjusted by steepest descent in the same way as we
trained the weights of the single layer network (equations (1.171.23)):

E E yOk vk
O
= (1.30)
wOH
kj yO
k vk wk j
O OH

?(dk ? yOk)yOk (1 ? yOk)yHj


= (1.31)
k j = (dk ? yk )yk (1 ? yk )y j
wOH O O O H
(1.32)

The same approach can be used for the input-to-hidden units

E E y j
H
= (1.33)
wHI
ji yHj wHI
ji

The second term on the right hand side can be written down simply by differentiating (1.24)
and (1.25):

yHj yHj vHj


= (1.34)
wHI
ji vHj wHI
ji
H
= y j (1 ? yHj )yIk (1.35)

This is analogous to (1.20) and (1.21) for the single layer network.
But this leaves us with the problem of computing E =yHj . This is easy for an output
unit, since it is clear exactly how the error varies relative to each output (just differentiate
(1.29) with respect to yO k ). But we dont have an error defined for hidden units this is
the hidden unit credit assignment problem. Each hidden unit has an effect on the error
associated with each output unit it is connected, so it would make sense for the error signal
for a hidden unit to be a weighted combination of the output unit error signals. This is the
famous back-propagation of error and we can now see that it arises from a slightly
more subtle application of the chain rule of differentiation:

E E yO
yHj
= yO yHk (1.36)
k k j

We must sum over all output units, since each hidden unit is connected to all output units
and so affects the error of all output units. The required derivatives are now straightforward:
E
yO
= ?(dk ? yOk) (1.37)
k
yO
k
yHj
O
= y k (1 ? yOk)wOH
kj (1.38)

E
yHj
= ?(dk ? yOk)yOk(1 ? yOk)wOH
kj (1.39)
k

11
COM336/COM648/COM682

And inserting (1.39) and (1.35) in (1.33) we find:

E
wHI
= ?(dk ? yOk)y0k (1 ? yOk)wOH
k j y j (1 ? y j )yi
H H I
(1.40)
ji k
!
= ? yHj (1 ? yHj ) (dk ? yO 0
k )yk (1 ? yO OH
k )wk j yIi (1.41)
k
!
wHI
ji = yHj (1 ? yHj ) (dk ? yO
k )yk (1 ? yk )wk j
0 O OH
yIi (1.42)
k

We can now write down the algorithm for training an MLP by back-propagation of
error:

1. Initialize weight matrix to small random values


2. While E is unsatisfactory
(a) For each pattern p:
i. Compute hidden node activations (vHj ) using equation (1.24)
ii. Compute hidden node outputs (yHj ) using equations (1.25) and (1.28)
iii. Compute output node activations (vO k ) using equation (1.26)
O
iv. Compute network outputs (yk ) using equations (1.27) and (1.28)
v. Compute the local output errors (ek ) and the network error for this pattern
(E p ) using equations (1.11) and (1.9)
vi. Compute the gradient of the error E p with respect to each hidden-to-output
weight wOH k j and update the weights using equation (1.32)
vii. Compute the gradient of the error E p with respect to each input-to-hidden
weight wHI ji and update the weights using equation (1.42)

12
COM336/COM648/COM682

1.3 Issues in training MLPs by backprop


1.3.1 Batch and online learning
In the example above we have discussed on-line learning; that is the weights are updated
after each pattern presentation. An alternative approach is batch learning where the error
function optimized is E (equation (1.12)) rather than E p (equation (1.9)). Optimizing E
simply involves summing the gradients over all patterns:

E E p
w ji
= w ji : (1.43)
p

The algorithm for batch learning involves passing through the entire training set before
updating the weights, thus ensuring that the gradients used for weight update will contain
information from the entire training set unlike the pattern-by-pattern approach of online
learning. However batch learning requires more memory (the incomplete sum of gradients
for each weight) and for training sets of more than a few tens of patterns online learning is
much faster.

1.3.2 When is training complete?


The question of when to stop training is not trivial! For now we can say that training con-
tinues until the error function is either (a) sufficiently small, or (b) has stopped decreasing.
But see part 3 of these notes: : :

1.3.3 Will backprop always find the solution?


Backprop will always find a solution; but it is only guaranteed to find a local minimum
they may be a better global minimum in some unexplored region of weight space. One way
to evaluate the quality of the local minimum is to train the network several times, starting
with different random weight matrices,. and to compare the performance of the resultant
trained networks.

13
COM336/COM648/COM682

14
COM336/COM648/COM682

Chapter 2

GENERALIZATION

2.1 Introduction
We can regard feed-forward networks as implementing a function mapping an input vector
(yI = (yI1 ; : : : ; yIi ; yIn )) to an output vector (yO = (yO O O
1 ; : : : ; yi ; yn )). The parameters of this
function are the weights. We can divide the functions implemented by an MLP into classi-
fiers (which map an input vector to a discrete class) and regressors (which map a continuous
input vector to a continuous output vector). In the case of classification the output vector
usually has the same dimensionality as there are classes, with each element corresponding
to one of the classes. In regression, the output vectors corresponds to the continuous valued
quantity being estimated.
The classification and regression functions that are estimated using an MLP (or other
neural network) are unknown all we have is a training set, that gives input-output ex-
amples of the function. The role of neural network training is to identify this mystery
function, given only the training data. What the training process (e.g. backprop) does
is to estimate the parameters of the function (i.e. the weights of the network) so that it
replicates the data as well as possible and generalizes to new data well. It turns out that
obtaining the best generalization performance on new data is not usually the same solution
as replicating the training data as well as possible.
Consider the case when we have very few training patterns but a large network with
many patterns. It will be relatively easy to manipulate the weights to reproduce the training
set, but it does not seem likely that the resulting network will have learned the charac-
teristics of new data. On the other hand, if we have very many training patterns, and we
manage to train the network to replicate them, then it would seem that it is more likely to
respond correctly to new, unseen patterns. Our aim is to make such intuitions precise, and
to develop approaches which can maximize the generalization performance of a network.

2.2 Evaluating Generalization: Training, Test and Valida-


tion Sets
The obvious way to obtain an estimate of how well a network generalizes is to test its
performance on new data. However we have to be careful which data we use if we keep
on using the same test set, then although this data isnt being used by the training algorithm,
the fact we are trying to get the best performance on this particular data set may cause other
factors (e.g. choice of learning rate) to be tuned to this test set. To make the distinction
explicit it is common to work with three data sets:

Training Set This is the data that used by the training algorithm to adjust the weights of
the network.

15
COM336/COM648/COM682

Validation (Development) Set This data is used during training to assess how well the
network is currently performing the performance of the network on this data may
be used to guide the training in some way (e.g. controlling the learning rate, deciding
when to stop training, choosing between several trained networks).
Test (Evaluation) Set This is the genuine test data and, ideally, should be used once
only, after training is complete.

2.3 Training and Generalization


We can think of the neural network training process in the following way. A network with
a given architecture is able to implement a given set of functions, depending on how its
weights are set. For example the set of functions that a perceptron with a single output
unit can implement correspond to the set of hyperplanes in the input space. Each training
example that is presented to the network may be represented by a certain subset of the
total set of functions available to the network. If the intersection of the function subsets
relating to each data point is non-empty, then the training set can be learned perfectly by
the network. (Note that finding a set of weights that corresponds to such a function is not
guaranteed this is the training problem.) In the case of an MLP learning a logic function
(e.g. XOR) then this is the case after training. However, in pattern recognition it is not
usually the case that a network is able to learn exactly all the training data indeed it is
not always desirable to do this.
Generalization is a measure that tells us how well the network performs on the actual
problem once training is complete. This can be measured by looking at the performance
of the network on evaluation data unseen during the training process. By definition, such
data is not available when the network is being trained. Thus, we want to be able to predict
the performance of the network on such data, and train the network so as to maximize the
predicted performance.
Since maximizing the generalization performance of the network involves producing
the best possible performance on new data, a simple approach to estimate the generalization
performance of networks is to evaluate their performance on a validation set. The network
is trained in the usual way, by minimizing the error function with respect to the training
data set. The performance on an unseen test set is estimated using a validation set. If we
want to choose between different networks trained on the same data, then we choose the
one that performs best on the validation set. This procedure is sometimes called a hold-out
method, since the validation set may be regarded as data held out from the training set.
In figure 2.1 we see the result of estimating a small set of training data using two net-
works, one implementing a straight line (corresponding to a single layer network) the other
implementing a much more complex function (corresponding to a multi-layer network with
several hidden units). The more complex network is able to approximate the function ex-
actly, whereas the single layer network fits the data reasonably well (but not exactly) with
its straight line. However, we want to know how well each network will generalize to un-
seen data. If our new data was as shown in figure 2.2, then it seems that the more complex
function is indeed doing the right thing and generalizes well. However, this would be
a lucky situation does the original data really lead us to expect that all the new data
will fall on this curve? A more realistic expectation of new data is that illustrated in fig-
ure 2.3. In this case, the straight line still represents the data quite well, but the function
implemented by the more complex network is not a good estimate for the new data.
This example illustrates some general principals:
1. Good performance on the training data does not necessarily lead to good generaliza-
tion performance;
2. Simple solutions are better (more likely to generalize well) than complex solutions
(unless a large amount of training data indicates otherwise);

16
COM336/COM648/COM682

Figure 2.1: Estimating a function from training data. The function on the left corresponds
to a single unit, the function on the right to a network with several hidden units.

Figure 2.2: A set of new data applied to the two networks from figure 2.1.

3. Larger (more complex) networks require more training data.

17
COM336/COM648/COM682

Figure 2.3: Another set of new data applied to the two networks from figure 2.1.

18
COM336/COM648/COM682

2.4 Bias and Variance


There are two basic reasons for poor generalization: (1) too many network parameters
compared with the number of training examples (i.e. trying to fit too complex a function
to the training data); (2) trying to model training data with a function that is too simple
(compared with the function that generated the training data). In case (1) the network is too
flexible it will have learned every bit of the training data, including all the noise; in case
(2) the network is not flexible enough it will apply to much smoothing and as well as
not modelling the noise in the training data, it will also fail to model some of the genuine
variability of the underlying function.
We can gain some insight into this problem by considering the overall generalization
error as the sum of two terms: the bias and the variance

NB: The term bias in this context has nothing to do with the bias or
threshold of a unit in a network!

 A model which is too simple, or too inflexible, is said to have a large bias (and a
small variance);

 A model which has too much flexibility relative to the training data has a large vari-
ance (and a small bias).

Figure 2.4: Trade-off between bias and variance: high bias and low variance (left); low bias
and high variance (right).

As an example of bias and variance, consider figure 2.4. In this case we see the result of
two networks trying to model a particular noisy data set. The network on the left has a high
bias because it has relatively few parameters compared with the training data it is limited
in what functions it can represent (i.e. biased) and the end result is over-smoothing. The
network on the right is very flexible it is able to exactly model just about every data
point; this is a very low bias network, but we say it has a high variance. Bias and variance
are complementary quantities and the best generalization performance will be when we
achieve the optimal tradeoff (figure 2.5).
Lets get more specific about what we mean by bias, variance and generalization error.
We could use the training set error Etrain as an estimate of the generalization error:

Etrain = (y0i ? di)2 = (y


O
? d)2 (2.1)
training set i training set

19
COM336/COM648/COM682

Figure 2.5: Optimal trade-off between bias and variance.

But we have seen that this is not a good predictor for the generalization error. So how can
we express the generalization error? If we are performing cross validation, then we can get
an estimate of the generalization error from the cross validation set, Exval :

Exval = (y0i ? di)2 = (y


O
? d)2 (2.2)
validation set i validation set

The error of the validation set is dependent on the network weights, which are in turn
dependeent on the training data. So the performance of the validation set is dependent on
training data set was used to train the network. What we want is an expression for the
generalization error that is independent of any particular set of training data. Precisely, we
want to express the generalization error Egen as the average over all possible training data
sets:

Egen = ED [(y0i ? di)2 ] = ED [(yO ? d)2 ] (2.3)


i

ED [ fD (x)] is the expectated value (or average) of a function fD (x) over all possible training
data sets D, where fD (x) has a dependence on the training data. Egen is not a quantity that
we can directly estimate, but it is important if we want to get a theoretical understanding of
the generalization error.
In equation (2.3) we can expand the term inside the brackets as follows
? O  ? O 
y ?d 2 = y? ED[yO] + ED[yO] ? d 2 (2.4)
? O  ? 2 ? O ? 
= y ? ED [y ] + ED [y ] ? d + 2 y ? ED [y ] ED [y ] ? d
O 2 O O O
(2.5)

To complete (2.3) we need to take the expectation over all data sets D. This results in the
third term above vanishing since:
? ? 
ED [ yO ? ED[yO ] ED [yO ] ? d ] = ED [y
O
ED [yO ]] ? ED[yO d] ? ED[ED [yO ]ED [yO ]] + ED[ED [yO ]d]
(2.6)
O
= ED [y ]ED [y ]
O
? ED[y ]d ? ED[y
O O O O
]ED [y ] + ED [y ]d
(2.7)
=0 (2.8)

Recall that ED [ED [ fD (x)]] = ED [ fD (x)], since the expectation does not depend on D (since
we have averaged over D).

20
COM336/COM648/COM682

This leaves us with:


? 2
i ? di
Egen = ED [ yO ] (2.9)
i
? 2 ? O 
= ED [yO ] ? d + ED [ y ?{zED[yO] 2}] (2.10)
| {z } |
bias2 variance

Lets see what this expression means.


 The bias is the difference of the average (over all data sets) network function y from
the desired function d. If the network was completely flexible and could model every
data set then this quantity would be very low. If the network is inflexible and tends
to produce the same function for each data set, then this quantity will be large.
 The variance is the average of the difference between the network function from the
average network function. If the network is very inflexible (high bias) the variance
will be low since the network will tend to produce the same output for all data sets.
On the other hand a very flexible (low bias) network will have a high variance, since
the network will be able to model each data set exactly (or nearly so), and so the
network will have a quite different function for each data set.
Look back at figures 2.4 and 2.5 in terms of these definitions of bias and variance. There
is a natural tradeoff between bias and variance:
 A function closely fitted to the training data will have a large variance and therefore
can be expected to give a large training error;
 If we smooth the function, the variance will be decreased, but at the cost of increasing
bias, and if this is taken too far the expected error will again be large.
This tradeoff between bias and variance is of crucial importance when using neural com-
puting for practical problems.

2.5 References
1. D. W. Patterson, Artificial Neural Networks: Theory and Applications, Prentice Hall,
1996. (Chapter 7)
2. R. P. Lippmann, An Introduction to Computing with Neural Nets, IEEE Acoustics
Speech and Signal Processing Magazine, 4(4), October 1987.
3. D. R. Hush and B. G. Horne, Progress in Supervised Neural Networks, IEEE Sig-
nal Processing Magazine, 10(1), pp839, January 1993.
4. C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press,
1995. (Chapters 1 and 9)

21
COM336/COM648/COM682

22
COM336/COM648/COM682

Chapter 3

OVERTRAINING AND HOW


TO AVOID IT

3.1 Introduction
The lowest generalization error is achieved by an optimal tradeoff between bias and vari-
ance. A function that is too closely fitted to the training set will tend to have a large variance
and hence give a large expected error this is called overtraining. We can decrease the
variance by smoothing the function (hence increasing the bias) but if we go to far the large
bias will cause the expected generalization error to become large again.
One way to reduce both the bias and variance is to use more flexible networks while
simultaneously increasing the size of the training set. Increasing the flexibility of the net-
work will reduce the bias, while adding more training data will decrease the variance since
each extra training data point adds a new constraint in the space of functions available to
the network that implement the function described by the training data.
However in many situations we do not have the option of increasing the amount of
training data. If we have some prior information about the unknown function we are trying
to model, then using this information to constrain the network function will not necessarily
increase the bias. For example, if the true function is linear, then constraining the network
to linear functions will not increase the bias since the constrained functions are consistent
with the true network function. The bias-variance model tells us that in some situations
(with limited training data) performance will be better with a constrained model (e.g. a
simple linear network) than a less constrained model (e.g. a multi-layer perceptron) even
though the less flexible model is a special case of the more flexible one.
However, if we do not have this task-specific information we can still make an attack
on the bias-variance problem (or, equivalently, the problem of overtraining). We do this by
methods that reduce the effective complexity of the network. Two important methods for
doing this are cross validation and regularization.

3.2 Cross-Validation and Early Stopping


The basic idea of cross-validation is to hold out a validation set from the training data,
which is used to control when to stop training. The idea of early stopping comes about
from some empirical results that indicate that while the training error will monotonically
decrease as training progresses, the validation set error will only decrease up to a certain
point, after which it will increase again. This is illustrated in figure 3.1. This result implies
that generalization can be improved if the network is tested with the validation set every
epoch or every few epochs (an epoch is a complete pass through the training set), and that

23
COM336/COM648/COM682

training is stopped at the minimum of the validation error. In practice, this may mean that
training continues, but the weight matrix corresponding to the minimum validation error is
stored as the best performing network.

Validation

Training

t* t

Figure 3.1: Evolution of the training set and validation set errors as training progresses. t 
marks the optimal point, according to the validation set, at which to stop training.

Why does early stopping work? The bias and variance as defined earlier will not change
as training progresses the network does not become any more or less flexible. However
we can regard the effective variance as increasing as training progresses, while the effective
bias decreases as training progresses. Its an active research issue just how these quantities
should be mathematically defined. Essentially, the idea is that as the training process con-
tinues so the effective complexity of the model increases, as the network is able to model
the training set with increasing accuracy.

Another way of thinking about early stopping is that it stops the network weights being
over-tuned to the training data. At the start of training, the network is not at all tuned to
the training data. As training progresses so the network becomes increasingly well tuned
to every characteristic of the training data. By stopping training early, the network is less
likely to have overfit the training data, as it will not have had time to learn the noise as
well as the signal. And performance of the validation set is a simple, easy to compute
way to find out the best time to stop training.

Although there are theories that predict how well we can expect these cross-validation
approaches to work, there is no convincing theory which explains why early stopping works
in terms of quantities like bias and variance. However, although it is not theoretically
pure, early stopping is a commonly used approach to maximizing generalization, and is
usually very effective. It can also be used in conjunction with other approaches such as
regularization.

An example of early stopping is to be found in the large phoneme classification net-


works used by the speech group here at Sheffield. These networks have a lot of train-
ing data (1-10 million training examples) and a lot of parameters (100,000 to 1 million
weights). To prevent overtraining cross-validation controlled early stopping is used. In this
setup training typically involves passing through the entire training set just 1015 times.

24
COM336/COM648/COM682

3.3 Regularization
The error function that is optimized when training a neural network is the error on the
training set:
Etrain = (y0i ? diO)2 = (y
O
? dO)2 (3.1)
training set i training set

However, as we have seen, we dont just want to optimize the network performance on
the training set this may lead to a low bias solution, at the expense of a high variance.
What we want to do is to achieve good performance on the training set, while limiting the
complexity of the network. The technique of regularization encourages this by adding a
network complexity term EW to the error function that is optimized:
E = Etrain + EW (3.2)
The penalty term EW is sometimes referred to as the regularizer. The overall error function
is now a tradeoff between the training set error and a model complexity term, with the
tradeoff being controlled by the parameter . This is roughly analogous to the bias-variance
tradeoff in the previous expression for generalization error. So what sort of penalty function
is EW ? Well here are two properties we would like it to have:
 Easy to differentiate (so we can still train using back-propagation)
 Minimizing EW corresponds in some way to minimizing the flexibility of the net-
work.
Two types of regularizer that have been used are based on curvature and on weight decay.
The idea of minimizing the curvature is that high variance networks will typically have
functions with lots of maxima and minima (as they try to fit every data point). These points
are characterized by high curvature (high second derivative). If we set EW to correspond to
this curvature:
2 yO
EW = yI2k
k i i

Then minimizing EW will result in a low curvature (smoother) function. The derivatives
of this type of regularizer can be computed for an MLP, but curvature based regularizers
havent been used much in practice in neural computing (although they are widely used in
computer vision and other areas).
Weight decay is a more commonly used regularizer. In weight decay, we define:
1
2
EW = w2i
i

where the sum is over all the weights and biases in the network. This has a very simple
partial derivative:
EW
= wi
wi
Remembering that backprop training uses the negative of this derivative, we have:
E
wi = ? (3.3)
wi
 
Etrain EW
= ? + (3.4)
wi wi
Etrain
= ? ? wi (3.5)
wi

25
COM336/COM648/COM682

So the effect of this penalty term on the training process is to add a second force (in addition
to modelling the training data) that causes the weights to decay at a rate proportional to
the size of the weight.
One can think of weight decay as putting a spring on the weights. The strength of
the spring is controlled by the parameter . If the training data is consistently pushing a
weight in the same direction, then that force should outweigh the weight decay. However,
if the training data is not consistently pushing the weight in one direction, then the weight
decay term may start to dominate and the weight will decay to 0. In the latter case, if
weight decay is not applied, then the weight value might walk randomly without being
well determined by the training data. This is an example of the network being too flexible
and weight decay is a way of enabling the data to determine how best to decrease the
flexibility.

26
COM336/COM648/COM682

Chapter 4

CLASSIFICATION USING
NEURAL NETWORKS

4.1 Introduction
The pattern classification task is to classify an input vector x (= yI using the notation of
part 1 of these notes) into one of M classes (C1 ; C2 ; : : : ; Cm ). This is a problem that has
been studied for many years in the field of statistical pattern recognition. A solution to
this problem is to compute the Bayesian Posterior Probability P(Ci jx) for each class and to
assign the input vector to the class with the largest posterior probability.
The posterior probability, P(Ci jx) is the conditional probability of the class Ci given the
input data x. A method which directly estimates such probabilities may be thought of as a
recognition model. Recognition models are discriminative training consists of moving
class boundaries to maximize the correct classification of the training data by the model.
Standard statistical pattern recognition techniques do not usually estimate P(Ci jx) di-
rectly as this can be difficult. Instead they use a generative model. A generative model may
be thought of as a machine Ci that generates pattern vectors x with a likelihood P(xjCi ).
Training consists of building a machine for each class, using the training data; recognition
involves computing the likelihood of each machine generating a test example, and labelling
the pattern with the class whose machine was most likely to have generated the data.
Intuitively it seems that neural networks are more like recognition models than genera-
tive models when used for classification. Indeed, it turns out that (given some conditions)
we can show that feed-forward networks trained as a classifier directly estimate the poste-
rior probability of each class given the data P(Ci jx).

4.2 Estimation of Posterior Probabilities


When a feed-forward network (e.g. an MLP) is trained as a classifier a 1-from-M output
coding is usually employed. That is, if there are M classes we use M output units, 1 for
each class. The target vector d used in training for a pattern of class i, will consist of zeros
except that di = 1. If:
 The network is trained using a sum-squared error cost function (i.e., the usual error
function) ,

 There is enough training data


 There network is flexible enough to represent the desired function, and

 The learning algorithm (e.g. backprop) is powerful to find the global minimum,

27
COM336/COM648/COM682

then it may be proven that when previously unseen patterns are presented to a trained
network, the output yO i will be an estimate of the posterior probability of the class Ci given
the data x.
The proof that such networks is a little bit involved and I wont reproduce it here you
can look at it in Richard and Lippmann (1991). The proof works by looking at the expected
value of the each desired output given the input, E [di jx]. It turns out that this is equal to the
probability P(di = 1jx). If we use a 1-from-M output coding, then P(di = 1jx) = P(Ci jx),
the posterior probability of the class given the data.
Note that the above assumptions are rarely met to minimize the generalization error
we do not usually train to a minimum on the training set, for example. However, experi-
ments with both real and simulated data have indicated that networks trained as classifiers
tend to give good posterior probability estimates.
This result implies:
 The elements of the output vector produced by the network in response to the input
will sum to 1;
 Each element will be between 0 and 1.
Both of these arise because the outputs of the network are probabilities.
This is a very important result:
 It puts neural networks on a sound statistical founding, so that they can be understood
in terms of well-understood pattern recognition approaches;
 Networks that estimate posterior probabilities may be combined in a principled way
with other statistical methods;
 We have a practically useful interpretation of the meaning of real-valued outputs of
a 1-from-M classifier;
 A variety of practical implications arise from this result (next section)
This result highlights several fallacies that have been believed about neural network
classifiers:
Fallacy 1 Network outputs should be binary outputs near 0 or 1. When training we have
perfect knowledge about which class a pattern belongs to. At recognition this is not
so. It may be the case that the best we can say even given a perfect model of the
data is that a pattern has a certain probability of belonging to a class. In this case,
since the network outputs are probability estimates, we should expect real numbers
and binary 0/1 values.
Fallacy 2 A correct/incorrect threshold may be arbitrarily set, e.g. correct when the ou tput
is above 0.5, incorrect when below 0.5. Such arbitrary thresholds make no sense
when dealing with probabilities knowing that we are estimating the probability of
each class given the data, the logical rule to use for classification is simply to choose
the class with the highest probability.
Fallacy 3 Output values substantially different from 0 or 1 indicate that more training is
required. This is related to the first fallacy. If the data is confusable, then binary
values shouldnt be expected, and outputs not near 0 or 1 may simply be accurate
probability estimates.

4.3 Practical Implications of Posterior Probability Esti-


mation
The fact that we can interpret network outputs as probabilities, has several practical impli-
cations.

28
COM336/COM648/COM682

4.3.1 Minimum Error-Rate Decisions


It can be shown that the pattern classification error rate can be minimized by choosing the
class with highest probability given the data. Thus when an MLP trained as classifier is used
with new data the optimal classification strategy is to simply pick the class corresponding
to the highest output. Any other approach (using thresholds, etc.) will not in the long run
result in a better performance.

4.3.2 Compensating for Different Priors


As mentioned above, generative models estimate the probability of a model generating the
classes, p(xjCi ). This likelihood may be related to the posterior probability using Bayes
Rule:
p(xjCi )P(Ci )
P(Ci jx) = (4.1)
p (x )
where P(Ci ) is the prior probability of class Ci and p(x) is a normalizing constant. Gen-
erative pattern recognition models estimate both the likelihood p(xjCi ) and the prior P(Ci )
to arrive at a posterior probability estimate. Neural networks, on the other hand, are able to
estimate the posterior probability directly. The prior probability is simply the probability
of each class occurring, without having seen any data. The training data estimate of the
priors is simply the relative frequency of each class in the training set.
This decomposition may be used if the prior probabilities are known to be different in
the training and test sets. To compensate for a different prior, we simply divide each output
by the relative frequency of the corresponding class, and multiply by a new estimate of the
prior probability. Bishop gives the following example for when this might be useful:
Consider the problem of classifying medical images into normal and tu-
mour. When used for screening purposes in the general population we have
a very low prior probability for tumour. To obtain a good variety of tumour
images in the training set would therefore require huge numbers of examples.
An alternative is to increase artificially the proportion of tumour images in the
training set, and then to compensate for different priors in the test data. The
prior probabilities for tumours in the general population can be obtained from
medical statistics without having to collect the corresponding images. Cor-
rection of the network outputs is then a simple matter of multiplication and
division.

4.3.3 Outputs Sum to One


Since the outputs of a network trained as a classifier are probability estimates they will sum
to one. This fact can be used to as a measure of how well a network has trained. Further, it
is possible to use this fact as a constraint in the network architecture the outputs may be
normalized in such a way so that they are forced to sum to 1.
It can be shown that the average posterior probability of each class in the training set
will correspond to the training set estimate of the class prior probability (the relative fre-
quency of that class). We can use this information as a measure of how well the network is
trained by checking whether the training set posterior probability averages for each output
unit do indeed tend to the prior probabilities. If they do not, this is a good indication that
the network is not modelling the posterior probabilities well.

4.3.4 Combining Network Outputs


It has been shown by several researchers that rather than using a single network to solve
a problem, there can be benefits in breaking a problem down into subunits each solved by

29
COM336/COM648/COM682

separate network. If we assume the inputs to each network are independent then dividing
by the priors (training set relative frequencies) will give us a likelihood estimate (scaled by
p(x) which may be treated as a constant) of the input data being generated by the class.
If these are independent we can combine the scaled likelihoods (by multiplying) and can
reconvert to posteriors by multiplying by the relevant prior probabilities.

4.3.5 Confidence and Rejection


Often in pattern recognition applications we only want to make a classification if we are
confident of the networks decision. In other cases we may prefer to reject an input vector,
rather than risk an incorrect classification. For example, a signature verification system may
be more appropriate if it rejects 8% of the inputs, while making a 1% error on the remainder
(accepted inputs), rather than rejecting none, and making errors on half of the rejected
inputs, resulting in a 5% error 5 times higher than if rejection was used. (Rejected inputs
may be passed onto a human, for example.) If the network outputs represent posterior
probabilities, then we can perform rejection in a principled way, simply by rejecting inputs
resulting in posterior probabilities that all fall below a threshold. Note that this is different
to using a threshold to decide class membership here the threshold is being used to
decide whether the minimum error rate classification should be used, or whether the pattern
should be rejected.

4.4 References
 Bishop, chapter 6.
 M. D. Richard and R. P. Lippmann (1991), Neural network classifiers estimate
Bayesian a posteriori probabilities, Neural Computation, 3, 461483.

30

You might also like