Professional Documents
Culture Documents
NEURAL NETWORKS
Steve Renals
27 Febuary 1998
COM336/COM648/COM682
ii
COM336/COM648/COM682
Contents
2 GENERALIZATION 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Evaluating Generalization: Training, Test and Validation Sets . . . . . . . . 15
2.3 Training and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
COM336/COM648/COM682
2
COM336/COM648/COM682
Chapter 1
THE BACK-PROPAGATION
ALGORITHM
1.1 Introduction
In this course we are interested in artificial neural networks biology does not concern
us! In the only analogy of the kind that Ill use, the table below gives the correspondences
between the terminology used for biological neural networks and that used for artificial
neural networks (ANNs):
Biology Computing
Neuron Unit/Node/Cell
Synapse Connection
Synaptic Weight Weight
Neuron Firing Rate Node Output
Dont be fooled into thinking that ANNs tell us anything (detailed) about the brain: : :
v = wi yi + c ; (1.1)
i
where yi is input i to the unit, wi is the connection weight of that incoming connection, and
c is referred to as the bias of the unit. Note that the bias is equivalent to the weight on a
connection whose input is always 1.
In a neural network the inputs to a unit will either be the output of another unit that is
connected to it, or an input from the environment (e.g. the input from some sensor). In all
the networks that well consider in this course connections are directed the fact there is a
connection from unit a to unit b with weight w does not tell us anything about there being
(or not being) a connection from unit b to unit a (and what weight it might have if it does
exist).
The output of a unit y is computed by passing the activation through a nonlinear transfer
function:
y = f (v) : (1.2)
3
COM336/COM648/COM682
Choices for transfer function f include the step function (where the output is 0 if the acti-
vation is below 0 and 1 if it is above 0:
f (v) = 1 if v 0 (1.3)
= 0 if v < 0 (1.4)
and soft step functions such as the Sshaped sigmoid function:
1
1 + exp(?v)
f (v) = : (1.5)
Note that in both these cases the bias of the unit can be used to shift the function along the
x axis.
Units in neural networks may be described as:
Input units, which receive some input from the environment (e.g. a pixel map for a net-
work to recognize handwritten characters);
Output units may be observed from the environments (e.g. the class of character input to
the network);
Hidden units which are internal to the network and do not directly interact with the envi-
ronment.
y1
y2
w2
.. y = f(v) y
.
wn
yn
4
COM336/COM648/COM682
1
0.9
0.8
0.8
0.7
0.6
0.6
f(v)
f(v)
0.5
0.4
0.4
0.3
0.2
0.2
0.1
0
0
6 4 2 0 2 4 6 6 4 2 0 2 4 6
v v
(a) (b)
Figure 1.2: Non-linear transfer functions: (a) step function; (b) sigmoid function.
Input
Units
-0.05 -0.34 Hidden Output
Units Units
-0.22
Input
Node 0.14 Hidden Node
0.13
-0.44
0.51
-0.30
0.76 0.19
0.04
0.03
1.27
Input Output node
Node 0.11 Hidden-to-output layer
(a) (b)
Figure 1.3: Examples of (a) a fully-connected recurrent network and (b) a two layer feed-
forward network.
1.1.3 Learning
The principal reason why neural networks have attracted such interest, is the existence of
learning algorithms for neural networks: algorithms that use data to estimate the optimal
weights in a network to perform some task. There are three basic approaches to learning in
neural networks
Supervised learning uses a training set that consists of a set of pattern pairs: an input pat-
tern and the corresponding desired (or target) output pattern. The desired output may
be regarded as the networks teacher for that input. The basic approach in super-
vised learning is for the network to compute the output its current weights produce
for a given input, and to compare this network output with the desired output. The
aim of the learning algorithm is to adjust the weights so as minimize the difference
between the network output and the desired output.
5
COM336/COM648/COM682
y I0
Bias Unit
vkO= wOI
O
+ yO
ki i
ykO= f (v k)
y kO
WkiOI
yiI Output
Input
Reinforcement learning uses much less supervision. If a network aims to perform that
some task, then the reinforcement signal is a simple yes or no at the end of the
task to indicate whether the task has been performed satisfactorily.
Unsupervised learning only uses input data there is no training signal, unlike the previous
two approaches. The aim of unsupervised learning is to make sense of some data set,
for example clustering similar patterns together. compression.
Function approximation is the task of learning a mapping that generates a vector of nu-
merical outputs from a vector of numerical inputs. Many real world problems fall
into this category (e.g. estimating the flow in a pipeline, given inputs from sensors.
Forecasting involves predicting the future from the past. Examples abound, such as finan-
cial prediction (stock markets, currency), electricity consumption,, etc.
Control problems are those in which the values of input variables are determined in or-
der to achieve the desired outputs. Examples include robot arm control and various
automotive systems.
6
COM336/COM648/COM682
k = wki yi
vO OI I
(1.6)
i
yO
k = f
O O
(vk ) = f O ( wOI I
ki yi ) (1.7)
i
1
f O (v) =
1 + exp(?v)
(1.8)
where we have assumed that the output unit transfer function f O () is a sigmoid. The
activation value of output unit k is represented by vO k . Input 0 always has value 1, so the
weight wOI
k0 corresponds to the bias of output unit k.
Exercise: Rewrite the above equations in matrix-vector notation (i.e. representing the
weights as a single matrix W, the inputs as vector yI , and so on).
The key problem in supervised training is that of credit assignment: given that there
is an output error (i.e. the network outputs differ from the desired outputs supplied by the
training data), how do we adjust the weights of the network to minimize that output error
over all the training data? The basic idea is as follows:
If an output unit has the desired output value then do not adjust the weights leading
into that unit;
If an output unit has an output less than the desired output, then increment the weights
leading into that unit by a small amount (proportional to the difference between de-
sired and actual outputs);
If an output unit has an output greater than the desired output, then decrement the
weights leading into that unit by a small amount (proportional to the difference be-
tween desired and actual outputs);
This assumes that all the inputs are positive.
This is rather heuristic. A more principled way of proceeding involves defining an error
function for the network. For a single pattern p we define the error E p :
1
2
Ep = e2k (1.9)
k
(dk ? yk )
1
2
O O 2
= (1.10)
k
ek = (dkO ? yO
k) (1.11)
where dkO is the desired value for output unit k and ek is the local error for output unit k. We
describe E p as the sum squared error. We can sum E p over all patterns to give the overall
training set error:
E = Ep : (1.12)
p
E p tells us how well the network performs on pattern p. E p = 0 means that it is perfect
for this pattern and no weights need adjusting. The larger E p the worse the network is
doing. We can write ek in terms of the weights:
ek = dk ? f O ( wOI I
ki yi ) (1.13)
i
7
COM336/COM648/COM682
E p
wOI
ki = ? wOI
(1.15)
ki
is a small positive constant called the step size or learning rate which governs how
much the weights are adjusted. There is a ? sign in (1.15) this is because we want to go
downhill: without the ? sign we would performing gradient ascent and would end up at
a point of maximum error!
To do this we need to calculate the partial derivatives E p =wOI
ki this is straightfor-
ward since we have already written E p as a function of the weights. Using the chain rule of
differentiation:
E E yOk
= (1.17)
wOI
ki yO
k wki
OI
E yOk vk
O
= (1.18)
yk vk wki
O O OI
Substituting (1.19), (1.20) and (1.21) into (1.18) we have the derivative we require:
E
wOI
= ?(dk ? yOk)yOk (1 ? yOk)yIi (1.22)
ki
So the expression for the weight update is obtained by inserting (1.22) in (1.15):
wOI
ki = (dk ? yOk)yOk(1 ? yOk)yIi (1.23)
8
COM336/COM648/COM682
2. While E is unsatisfactory
Exercise: Why are the weights initialized to small random values (rather than initial-
izing all weights to 0)?
Figure 1.5: Examples of (a) linearly separable and (b) not linearly separable classification
problems in one-dimensional input space.
Exercise: Satisfy yourself that a single layer perceptron cannot solve the xor problem.
1 In one-dimensional input space it would define a point; in two-dimensional input space it would define a line;
in three-dimensional input space a plane. Things get harder to visualize in higher dimensions.
9
COM336/COM648/COM682
y I0 yH
0
Bias Units
yjH
WjiHI WkjOH
yiI y kO
NOTATION
yIi Output of input unit i
yH
i Output of hidden unit i
yO
i Output of output unit i
vIi Activation of input unit i
vH
i Activation of hidden unit i
vO
i Activation of output unit i
f H () Transfer function for hidden units (e.g. sigmoid), yH H H
i = f (vi )
f O () Transfer function for output units (e.g. sigmoid), yi = f (vO
O O
i )
wHIji Weight from input unit i to hidden unit j
wOHji Weight from hidden unit i to output unit j
wHIj0 Bias of hidden unit j
wOHj0 Bias of output unit j
diO Desired (target) value for output unit i
Gradient descent learning rate
As for the single layer perceptron (equations (1.61.8)) we can write down the set of
equations that describe the behaviour of the network:
yO O O
k = f (v k ) (1.27)
10
COM336/COM648/COM682
The error function for an MLP is defined by equations (1.9) and (1.11), as for a single
layer network. However, the credit assignment problem is more difficult: in a single layer
network every weight connects an input to an output and credit assignment is accomplished
by a straightforward differentiation. However for an MLP the existence of hidden units
(which do not have a target) complicates things.
Similarly to (1.14) we can write the E p in terms of the hidden-to-output weights wOH
kj :
!2
e2k = dk ? f ( wOH
1 1
2
H
Ep = kj vj ) (1.29)
k 2 k j
The hidden-to-output weights can be adjusted by steepest descent in the same way as we
trained the weights of the single layer network (equations (1.171.23)):
E E yOk vk
O
= (1.30)
wOH
kj yO
k vk wk j
O OH
E E y j
H
= (1.33)
wHI
ji yHj wHI
ji
The second term on the right hand side can be written down simply by differentiating (1.24)
and (1.25):
This is analogous to (1.20) and (1.21) for the single layer network.
But this leaves us with the problem of computing E =yHj . This is easy for an output
unit, since it is clear exactly how the error varies relative to each output (just differentiate
(1.29) with respect to yO k ). But we dont have an error defined for hidden units this is
the hidden unit credit assignment problem. Each hidden unit has an effect on the error
associated with each output unit it is connected, so it would make sense for the error signal
for a hidden unit to be a weighted combination of the output unit error signals. This is the
famous back-propagation of error and we can now see that it arises from a slightly
more subtle application of the chain rule of differentiation:
E E yO
yHj
= yO yHk (1.36)
k k j
We must sum over all output units, since each hidden unit is connected to all output units
and so affects the error of all output units. The required derivatives are now straightforward:
E
yO
= ?(dk ? yOk) (1.37)
k
yO
k
yHj
O
= y k (1 ? yOk)wOH
kj (1.38)
E
yHj
= ?(dk ? yOk)yOk(1 ? yOk)wOH
kj (1.39)
k
11
COM336/COM648/COM682
E
wHI
= ?(dk ? yOk)y0k (1 ? yOk)wOH
k j y j (1 ? y j )yi
H H I
(1.40)
ji k
!
= ? yHj (1 ? yHj ) (dk ? yO 0
k )yk (1 ? yO OH
k )wk j yIi (1.41)
k
!
wHI
ji = yHj (1 ? yHj ) (dk ? yO
k )yk (1 ? yk )wk j
0 O OH
yIi (1.42)
k
We can now write down the algorithm for training an MLP by back-propagation of
error:
12
COM336/COM648/COM682
E E p
w ji
= w ji : (1.43)
p
The algorithm for batch learning involves passing through the entire training set before
updating the weights, thus ensuring that the gradients used for weight update will contain
information from the entire training set unlike the pattern-by-pattern approach of online
learning. However batch learning requires more memory (the incomplete sum of gradients
for each weight) and for training sets of more than a few tens of patterns online learning is
much faster.
13
COM336/COM648/COM682
14
COM336/COM648/COM682
Chapter 2
GENERALIZATION
2.1 Introduction
We can regard feed-forward networks as implementing a function mapping an input vector
(yI = (yI1 ; : : : ; yIi ; yIn )) to an output vector (yO = (yO O O
1 ; : : : ; yi ; yn )). The parameters of this
function are the weights. We can divide the functions implemented by an MLP into classi-
fiers (which map an input vector to a discrete class) and regressors (which map a continuous
input vector to a continuous output vector). In the case of classification the output vector
usually has the same dimensionality as there are classes, with each element corresponding
to one of the classes. In regression, the output vectors corresponds to the continuous valued
quantity being estimated.
The classification and regression functions that are estimated using an MLP (or other
neural network) are unknown all we have is a training set, that gives input-output ex-
amples of the function. The role of neural network training is to identify this mystery
function, given only the training data. What the training process (e.g. backprop) does
is to estimate the parameters of the function (i.e. the weights of the network) so that it
replicates the data as well as possible and generalizes to new data well. It turns out that
obtaining the best generalization performance on new data is not usually the same solution
as replicating the training data as well as possible.
Consider the case when we have very few training patterns but a large network with
many patterns. It will be relatively easy to manipulate the weights to reproduce the training
set, but it does not seem likely that the resulting network will have learned the charac-
teristics of new data. On the other hand, if we have very many training patterns, and we
manage to train the network to replicate them, then it would seem that it is more likely to
respond correctly to new, unseen patterns. Our aim is to make such intuitions precise, and
to develop approaches which can maximize the generalization performance of a network.
Training Set This is the data that used by the training algorithm to adjust the weights of
the network.
15
COM336/COM648/COM682
Validation (Development) Set This data is used during training to assess how well the
network is currently performing the performance of the network on this data may
be used to guide the training in some way (e.g. controlling the learning rate, deciding
when to stop training, choosing between several trained networks).
Test (Evaluation) Set This is the genuine test data and, ideally, should be used once
only, after training is complete.
16
COM336/COM648/COM682
Figure 2.1: Estimating a function from training data. The function on the left corresponds
to a single unit, the function on the right to a network with several hidden units.
Figure 2.2: A set of new data applied to the two networks from figure 2.1.
17
COM336/COM648/COM682
Figure 2.3: Another set of new data applied to the two networks from figure 2.1.
18
COM336/COM648/COM682
NB: The term bias in this context has nothing to do with the bias or
threshold of a unit in a network!
A model which is too simple, or too inflexible, is said to have a large bias (and a
small variance);
A model which has too much flexibility relative to the training data has a large vari-
ance (and a small bias).
Figure 2.4: Trade-off between bias and variance: high bias and low variance (left); low bias
and high variance (right).
As an example of bias and variance, consider figure 2.4. In this case we see the result of
two networks trying to model a particular noisy data set. The network on the left has a high
bias because it has relatively few parameters compared with the training data it is limited
in what functions it can represent (i.e. biased) and the end result is over-smoothing. The
network on the right is very flexible it is able to exactly model just about every data
point; this is a very low bias network, but we say it has a high variance. Bias and variance
are complementary quantities and the best generalization performance will be when we
achieve the optimal tradeoff (figure 2.5).
Lets get more specific about what we mean by bias, variance and generalization error.
We could use the training set error Etrain as an estimate of the generalization error:
19
COM336/COM648/COM682
But we have seen that this is not a good predictor for the generalization error. So how can
we express the generalization error? If we are performing cross validation, then we can get
an estimate of the generalization error from the cross validation set, Exval :
The error of the validation set is dependent on the network weights, which are in turn
dependeent on the training data. So the performance of the validation set is dependent on
training data set was used to train the network. What we want is an expression for the
generalization error that is independent of any particular set of training data. Precisely, we
want to express the generalization error Egen as the average over all possible training data
sets:
ED [ fD (x)] is the expectated value (or average) of a function fD (x) over all possible training
data sets D, where fD (x) has a dependence on the training data. Egen is not a quantity that
we can directly estimate, but it is important if we want to get a theoretical understanding of
the generalization error.
In equation (2.3) we can expand the term inside the brackets as follows
? O ? O
y ?d 2 = y? ED[yO] + ED[yO] ? d 2 (2.4)
? O ? 2 ? O ?
= y ? ED [y ] + ED [y ] ? d + 2 y ? ED [y ] ED [y ] ? d
O 2 O O O
(2.5)
To complete (2.3) we need to take the expectation over all data sets D. This results in the
third term above vanishing since:
? ?
ED [ yO ? ED[yO ] ED [yO ] ? d ] = ED [y
O
ED [yO ]] ? ED[yO d] ? ED[ED [yO ]ED [yO ]] + ED[ED [yO ]d]
(2.6)
O
= ED [y ]ED [y ]
O
? ED[y ]d ? ED[y
O O O O
]ED [y ] + ED [y ]d
(2.7)
=0 (2.8)
Recall that ED [ED [ fD (x)]] = ED [ fD (x)], since the expectation does not depend on D (since
we have averaged over D).
20
COM336/COM648/COM682
2.5 References
1. D. W. Patterson, Artificial Neural Networks: Theory and Applications, Prentice Hall,
1996. (Chapter 7)
2. R. P. Lippmann, An Introduction to Computing with Neural Nets, IEEE Acoustics
Speech and Signal Processing Magazine, 4(4), October 1987.
3. D. R. Hush and B. G. Horne, Progress in Supervised Neural Networks, IEEE Sig-
nal Processing Magazine, 10(1), pp839, January 1993.
4. C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press,
1995. (Chapters 1 and 9)
21
COM336/COM648/COM682
22
COM336/COM648/COM682
Chapter 3
3.1 Introduction
The lowest generalization error is achieved by an optimal tradeoff between bias and vari-
ance. A function that is too closely fitted to the training set will tend to have a large variance
and hence give a large expected error this is called overtraining. We can decrease the
variance by smoothing the function (hence increasing the bias) but if we go to far the large
bias will cause the expected generalization error to become large again.
One way to reduce both the bias and variance is to use more flexible networks while
simultaneously increasing the size of the training set. Increasing the flexibility of the net-
work will reduce the bias, while adding more training data will decrease the variance since
each extra training data point adds a new constraint in the space of functions available to
the network that implement the function described by the training data.
However in many situations we do not have the option of increasing the amount of
training data. If we have some prior information about the unknown function we are trying
to model, then using this information to constrain the network function will not necessarily
increase the bias. For example, if the true function is linear, then constraining the network
to linear functions will not increase the bias since the constrained functions are consistent
with the true network function. The bias-variance model tells us that in some situations
(with limited training data) performance will be better with a constrained model (e.g. a
simple linear network) than a less constrained model (e.g. a multi-layer perceptron) even
though the less flexible model is a special case of the more flexible one.
However, if we do not have this task-specific information we can still make an attack
on the bias-variance problem (or, equivalently, the problem of overtraining). We do this by
methods that reduce the effective complexity of the network. Two important methods for
doing this are cross validation and regularization.
23
COM336/COM648/COM682
training is stopped at the minimum of the validation error. In practice, this may mean that
training continues, but the weight matrix corresponding to the minimum validation error is
stored as the best performing network.
Validation
Training
t* t
Figure 3.1: Evolution of the training set and validation set errors as training progresses. t
marks the optimal point, according to the validation set, at which to stop training.
Why does early stopping work? The bias and variance as defined earlier will not change
as training progresses the network does not become any more or less flexible. However
we can regard the effective variance as increasing as training progresses, while the effective
bias decreases as training progresses. Its an active research issue just how these quantities
should be mathematically defined. Essentially, the idea is that as the training process con-
tinues so the effective complexity of the model increases, as the network is able to model
the training set with increasing accuracy.
Another way of thinking about early stopping is that it stops the network weights being
over-tuned to the training data. At the start of training, the network is not at all tuned to
the training data. As training progresses so the network becomes increasingly well tuned
to every characteristic of the training data. By stopping training early, the network is less
likely to have overfit the training data, as it will not have had time to learn the noise as
well as the signal. And performance of the validation set is a simple, easy to compute
way to find out the best time to stop training.
Although there are theories that predict how well we can expect these cross-validation
approaches to work, there is no convincing theory which explains why early stopping works
in terms of quantities like bias and variance. However, although it is not theoretically
pure, early stopping is a commonly used approach to maximizing generalization, and is
usually very effective. It can also be used in conjunction with other approaches such as
regularization.
24
COM336/COM648/COM682
3.3 Regularization
The error function that is optimized when training a neural network is the error on the
training set:
Etrain = (y0i ? diO)2 = (y
O
? dO)2 (3.1)
training set i training set
However, as we have seen, we dont just want to optimize the network performance on
the training set this may lead to a low bias solution, at the expense of a high variance.
What we want to do is to achieve good performance on the training set, while limiting the
complexity of the network. The technique of regularization encourages this by adding a
network complexity term EW to the error function that is optimized:
E = Etrain + EW (3.2)
The penalty term EW is sometimes referred to as the regularizer. The overall error function
is now a tradeoff between the training set error and a model complexity term, with the
tradeoff being controlled by the parameter . This is roughly analogous to the bias-variance
tradeoff in the previous expression for generalization error. So what sort of penalty function
is EW ? Well here are two properties we would like it to have:
Easy to differentiate (so we can still train using back-propagation)
Minimizing EW corresponds in some way to minimizing the flexibility of the net-
work.
Two types of regularizer that have been used are based on curvature and on weight decay.
The idea of minimizing the curvature is that high variance networks will typically have
functions with lots of maxima and minima (as they try to fit every data point). These points
are characterized by high curvature (high second derivative). If we set EW to correspond to
this curvature:
2 yO
EW = yI2k
k i i
Then minimizing EW will result in a low curvature (smoother) function. The derivatives
of this type of regularizer can be computed for an MLP, but curvature based regularizers
havent been used much in practice in neural computing (although they are widely used in
computer vision and other areas).
Weight decay is a more commonly used regularizer. In weight decay, we define:
1
2
EW = w2i
i
where the sum is over all the weights and biases in the network. This has a very simple
partial derivative:
EW
= wi
wi
Remembering that backprop training uses the negative of this derivative, we have:
E
wi = ? (3.3)
wi
Etrain EW
= ? + (3.4)
wi wi
Etrain
= ? ? wi (3.5)
wi
25
COM336/COM648/COM682
So the effect of this penalty term on the training process is to add a second force (in addition
to modelling the training data) that causes the weights to decay at a rate proportional to
the size of the weight.
One can think of weight decay as putting a spring on the weights. The strength of
the spring is controlled by the parameter . If the training data is consistently pushing a
weight in the same direction, then that force should outweigh the weight decay. However,
if the training data is not consistently pushing the weight in one direction, then the weight
decay term may start to dominate and the weight will decay to 0. In the latter case, if
weight decay is not applied, then the weight value might walk randomly without being
well determined by the training data. This is an example of the network being too flexible
and weight decay is a way of enabling the data to determine how best to decrease the
flexibility.
26
COM336/COM648/COM682
Chapter 4
CLASSIFICATION USING
NEURAL NETWORKS
4.1 Introduction
The pattern classification task is to classify an input vector x (= yI using the notation of
part 1 of these notes) into one of M classes (C1 ; C2 ; : : : ; Cm ). This is a problem that has
been studied for many years in the field of statistical pattern recognition. A solution to
this problem is to compute the Bayesian Posterior Probability P(Ci jx) for each class and to
assign the input vector to the class with the largest posterior probability.
The posterior probability, P(Ci jx) is the conditional probability of the class Ci given the
input data x. A method which directly estimates such probabilities may be thought of as a
recognition model. Recognition models are discriminative training consists of moving
class boundaries to maximize the correct classification of the training data by the model.
Standard statistical pattern recognition techniques do not usually estimate P(Ci jx) di-
rectly as this can be difficult. Instead they use a generative model. A generative model may
be thought of as a machine Ci that generates pattern vectors x with a likelihood P(xjCi ).
Training consists of building a machine for each class, using the training data; recognition
involves computing the likelihood of each machine generating a test example, and labelling
the pattern with the class whose machine was most likely to have generated the data.
Intuitively it seems that neural networks are more like recognition models than genera-
tive models when used for classification. Indeed, it turns out that (given some conditions)
we can show that feed-forward networks trained as a classifier directly estimate the poste-
rior probability of each class given the data P(Ci jx).
The learning algorithm (e.g. backprop) is powerful to find the global minimum,
27
COM336/COM648/COM682
then it may be proven that when previously unseen patterns are presented to a trained
network, the output yO i will be an estimate of the posterior probability of the class Ci given
the data x.
The proof that such networks is a little bit involved and I wont reproduce it here you
can look at it in Richard and Lippmann (1991). The proof works by looking at the expected
value of the each desired output given the input, E [di jx]. It turns out that this is equal to the
probability P(di = 1jx). If we use a 1-from-M output coding, then P(di = 1jx) = P(Ci jx),
the posterior probability of the class given the data.
Note that the above assumptions are rarely met to minimize the generalization error
we do not usually train to a minimum on the training set, for example. However, experi-
ments with both real and simulated data have indicated that networks trained as classifiers
tend to give good posterior probability estimates.
This result implies:
The elements of the output vector produced by the network in response to the input
will sum to 1;
Each element will be between 0 and 1.
Both of these arise because the outputs of the network are probabilities.
This is a very important result:
It puts neural networks on a sound statistical founding, so that they can be understood
in terms of well-understood pattern recognition approaches;
Networks that estimate posterior probabilities may be combined in a principled way
with other statistical methods;
We have a practically useful interpretation of the meaning of real-valued outputs of
a 1-from-M classifier;
A variety of practical implications arise from this result (next section)
This result highlights several fallacies that have been believed about neural network
classifiers:
Fallacy 1 Network outputs should be binary outputs near 0 or 1. When training we have
perfect knowledge about which class a pattern belongs to. At recognition this is not
so. It may be the case that the best we can say even given a perfect model of the
data is that a pattern has a certain probability of belonging to a class. In this case,
since the network outputs are probability estimates, we should expect real numbers
and binary 0/1 values.
Fallacy 2 A correct/incorrect threshold may be arbitrarily set, e.g. correct when the ou tput
is above 0.5, incorrect when below 0.5. Such arbitrary thresholds make no sense
when dealing with probabilities knowing that we are estimating the probability of
each class given the data, the logical rule to use for classification is simply to choose
the class with the highest probability.
Fallacy 3 Output values substantially different from 0 or 1 indicate that more training is
required. This is related to the first fallacy. If the data is confusable, then binary
values shouldnt be expected, and outputs not near 0 or 1 may simply be accurate
probability estimates.
28
COM336/COM648/COM682
29
COM336/COM648/COM682
separate network. If we assume the inputs to each network are independent then dividing
by the priors (training set relative frequencies) will give us a likelihood estimate (scaled by
p(x) which may be treated as a constant) of the input data being generated by the class.
If these are independent we can combine the scaled likelihoods (by multiplying) and can
reconvert to posteriors by multiplying by the relevant prior probabilities.
4.4 References
Bishop, chapter 6.
M. D. Richard and R. P. Lippmann (1991), Neural network classifiers estimate
Bayesian a posteriori probabilities, Neural Computation, 3, 461483.
30