Professional Documents
Culture Documents
Contents
Real Neurons
Note: This document is an excerpt from the NeuralystTM User's Guide, Chapter 3.
Real Neurons
Let's start by taking a look at a biological neuron. Figure 1 shows such a neuron.
This sounds very simplistic until we recognize that there are approximately one
hundred billion (100,000,000,000) neurons each connected to as many as one
thousand (1,000) others in the human brain. The massive number of neurons and
the complexity of their interconnections results in a "thinking machine", your brain.
Each neuron has a body, called the soma. The soma is much like the body of any
other cell. It contains the cell nucleus, various bio-chemical factories and other
components that support ongoing activity.
Surrounding the soma are dendrites. The dendrites are receptors for signals
generated by other neurons. These signals may be excitatory or inhibitory. All
signals present at the dendrites of a neuron are combined and the result will
determine whether or not that neuron will fire.
If a neuron fires, an electrical impulse is generated. This impulse starts at the base,
called the hillock, of a long cellular extension, called the axon, and proceeds down
the axon to its ends.
The end of the axon is actually split into multiple ends, called the boutons. The
boutons are connected to the dendrites of other neurons and the resulting
interconnections are the previously discussed synapses. (Actually, the boutons do
not touch the dendrites; there is a small gap between them.) If a neuron has fired,
the electrical impulse that has been generated stimulates the boutons and results in
electrochemical activity which transmits the signal across the synapses to the
receiving dendrites.
At rest, the neuron maintains an electrical potential of about 40-60 millivolts. When
a neuron fires, an electrical impulse is created which is the result of a change in
potential to about 90-100 millivolts. This impulse travels between 0.5 to 100 meters
per second and lasts for about 1 millisecond. Once a neuron fires, it must rest for
several milliseconds before it can fire again. In some circumstances, the repetition
rate may be as fast as 100 times per second, equivalent to 10 milliseconds per
firing.
Compare this to a very fast electronic computer whose signals travel at about
200,000,000 meters per second (speed of light in a wire is 2/3 of that in free air),
whose impulses last for 10 nanoseconds and may repeat such an impulse
immediately in each succeeding 10 nanoseconds continuously. Electronic computers
have at least a 2,000,000 times advantage in signal transmission speed and
1,000,000 times advantage in signal repetition rate.
It is clear that if signal speed or rate were the sole criteria for processing
performance, electronic computers would win hands down. What the human brain
lacks in these, it makes up in numbers of elements and interconnection complexity
between those elements. This difference in structure manifests itself in at least one
important way; the human brain is not as quick as an electronic computer at
arithmetic, but it is many times faster and hugely more capable at recognition of
patterns and perception of relationships.
The human brain differs in another, extremely important, respect beyond speed; it
is capable of "self-programming" or adaptation in response to changing external
stimuli. In other words, it can learn. The brain has developed ways for neurons to
change their response to new stimulus patterns so that similar events may affect
future responses. In particular, the sensitivity to new patterns seems more
extensive in proportion to their importance to survival or if they are reinforced by
repetition.
Neural networks are models of biological neural structures. The starting point for
most neural networks is a model neuron, as in Figure 2. This neuron consists of
multiple inputs and a single output. Each input is modified by a weight, which
multiplies with the input value. The neuron will combine these weighted inputs and,
with reference to a threshold value and activation function, use these to determine
its output. This behavior follows closely our understanding of how real neurons
work.
While there is a fair understanding of how an individual neuron works, there is still a
great deal of research and mostly conjecture regarding the way neurons organize
themselves and the mechanisms used by arrays of neurons to adapt their behavior
to external stimuli. There are a large number of experimental neural network
structures currently in use reflecting this state of continuing research.
In our case, we will only describe the structure, mathematics and behavior of that
structure known as the backpropagation network. This is the most prevalent and
generalized neural network currently in use. If the reader is interested in finding out
more about neural networks or other networks, please refer to the material listed in
the bibliography.
Next, multiple layers are then arrayed one succeeding the other so that there is an
input layer, multiple intermediate layers and finally an output layer, as in Figure 3.
Intermediate layers, that is those that have no inputs or outputs to the external
world, are called >hidden layers. Backpropagation neural networks are usually fully
connected. This means that each neuron is connected to every output from the
preceding layer or one input from the external world if the neuron is in the first layer
and, correspondingly, each neuron has its output connected to every neuron in the
succeeding layer.
Generally, the input layer is considered a distributor of the signals from the external
world. Hidden layers are considered to be categorizers or feature detectors of such
signals. The output layer is considered a collector of the features detected and
producer of the response. While this view of the neural network may be helpful in
conceptualizing the functions of the layers, you should not take this model too
literally as the functions described may not be so specific or localized.
With this picture of how a neural network is constructed, we can now proceed to
describe the operation of the network in a meaningful fashion.
Neural Network Operation
The output of each neuron is a function of its inputs. In particular, the output of
the jth neuron in any layer is described by two sets of equations:
[Eqn 1]
and
[Eqn 2]
For every neuron, j, in a layer, each of the i inputs, Xi, to that layer is multiplied by a
previously established weight, wij. These are all summed together, resulting in the
internal value of this operation, Uj. This value is then biased by a previously
established threshold value, tj, and sent through an activation function, Fth. This
activation function is usually the sigmoid function, which has an input to output
mapping as shown in Figure 4. The resulting output, Yj, is an input to the next layer
or it is a response of the neural network if it is the last layer. Neuralyst allows other
threshold functions to be used in place of the sigmoid described here.
This corrective procedure is called backpropagation (hence the name of the neural
network) and it is applied continuously and repetitively for each set of inputs and
corresponding set of outputs produced in response to the inputs. This procedure
continues so long as the individual or total errors in the responses exceed a
specified level or until there are no measurable errors. At this point, the neural
network has learned the training material and you can stop the training process and
use the neural network to produce responses to new input data.
[There is some heavier going in the next few paragraphs. Skip ahead if you don't
need to understand all the details of neural network learning.]
[Eqn 3]
and
[Eqn 4]
For the ith input of the jth neuron in the output layer, the weight wij is adjusted by
adding to the previous weight value, w'ij, a term determined by the product of
a learning rate, LR, an error term, ej, and the value of the ith input,Xi. The error
term, ej, for the jth neuron is determined by the product of the actual output, Yj, its
complement, 1 - Yj, and the difference between the desired output, dj, and the
actual output.
Once the error terms are computed and weights are adjusted for the output layer,
the values are recorded and the next layer back is adjusted. The same weight
adjustment process, determined by Equation 3, is followed, but the error term is
generated by a slightly modified version of Equation 4. This modification is:
[Eqn 5]
In this version, the difference between the desired output and the actual output is
replaced by the sum of the error terms for each neuron, k, in the layer immediately
succeeding the layer being processed (remember, we are going backwards through
the layers so these terms have already been computed) times the respective pre-
adjustment weights.
The learning rate, LR, applies a greater or lesser portion of the respective
adjustment to the old weight. If the factor is set to a large value, then the neural
network may learn more quickly, but if there is a large variability in the input set
then the network may not learn very well or at all. In real terms, setting the learning
rate to a large value is analogous to giving a child a spanking, but that is
inappropriate and counter-productive to learning if the offense is so simple as
forgetting to tie their shoelaces. Usually, it is better to set the factor to a small
value and edge it upward if the learning rate seems slow.
[Eqn 6]
This is similar to Equation 3, with a momentum factor, M, the previous weight, w'ij,
and the next to previous weight, w''ij, included in the last term. This extra term
allows for momentum in weight adjustment. Momentum basically allows a change
to the weights to persist for a number of adjustment cycles. The magnitude of the
persistence is controlled by the momentum factor. If the momentum factor is set to
0, then the equation reduces to that of Equation 3. If the momentum factor is
increased from 0, then increasingly greater persistence of previous adjustments is
allowed in modifying the current adjustment. This can improve the learning rate in
some situations, by helping to smooth out unusual conditions in the training set.
[Okay, that's the end of the equations. You can relax again.]
As you train the network, the total error, that is the sum of the errors over all the
training sets, will become smaller and smaller. Once the network reduces the total
error to the limit set, training may stop. You may then apply the network, using the
weights and thresholds as trained.
It is a good idea to set aside some subset of all the inputs available and reserve
them for testing the trained network. By comparing the output of a trained network
on these test sets to the outputs you know to be correct, you can gain greater
confidence in the validity of the training. If you are satisfied at this point, then the
neural network is ready for running.
Usually, no backpropagation takes place in this running mode as was done in the
training mode. This is because there is often no way to be immediately certain of
the desired response. If there were, there would be no need for the processing
capabilities of the neural network! Instead, as the validity of the neural network
outputs or predictions are verified or contradicted over time, you will either be
satisfied with the existing performance or determine a need for new training. In this
case, the additional input sets collected since the last training session may be used
to extend and improve the training data.
EMAIL to <Neuralyst@CheshireEng.com>.
The good news is that developing engaging animated systems with code
does not require scientific rigor or accuracy, as weve learned throughout
this book. We can simply be inspired by the idea of brain function.
Their work, and the work of many scientists and researchers that
followed, was not meant to accurately describe how the biological brain
works. Rather, an artificial neural network (which we will now simply
refer to as a neural network) was designed as a computational model
based on the brain to solve certain kinds of problems.
Its probably pretty obvious to you that there are problems that are
incredibly simple for a computer to solve, but difficult for you. Take the
square root of 964,324, for example. A quick line of code produces the
value 982, a number Processing computed in less than a millisecond.
There are, on the other hand, problems that are incredibly simple for you
or me to solve, but not so easy for a computer. Show any toddler a picture
of a kitten or puppy and theyll be able to tell you very quickly which one
is which. Say hello and shake my hand one morning and you should be
able to pick me out of a crowd of people the next day. But need a machine
to perform one of these tasks? Scientists have already spent entire careers
researching and implementing complex solutions.
Figure 10.2
One of the key elements of a neural network is its ability to learn. A neural
network is not just a complex system, but a complex adaptive system,
meaning it can change its internal structure based on the information
flowing through it. Typically, this is achieved through the adjusting
of weights. In the diagram above, each line represents a connection
between two neurons and indicates the pathway for the flow of
information. Each connection has aweight, a number that controls the
signal between the two neurons. If the network generates a good output
(which well define later), there is no need to adjust the weights. However,
if the network generates a poor outputan error, so to speakthen the
system adapts, altering the weights in order to improve subsequent
results.
There are several strategies for learning, and well examine two of them in
this chapter.
So instead, well begin our last hurrah in the nature of code with the
simplest of all neural networks, in an effort to understand how the overall
concepts are applied in code. Then well look at some Processing sketches
that generate visual results inspired by these concepts.
10.2 The Perceptron
Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical
Laboratory, a perceptron is the simplest neural network possible: a
computational model of a single neuron. A perceptron consists of one or
more inputs, a processor, and a single output.
Say we have a perceptron with two inputslets call them x1 and x2.
Input 0: x1 = 12
Input 1: x2 = 4
Step 2: Weight inputs.
Each input that is sent into the neuron must first be weighted, i.e.
multiplied by some value (often a number between -1 and 1). When
creating a perceptron, well typically begin by assigning random weights.
Here, lets give the inputs the following weights:
Weight 0: 0.5
Weight 1: -1
Input 1 * Weight 1 4 * -1 = -4
Sum = 6 + -4 = 2
Activation functions can get a little bit hairy. If you start reading one of
those artificial intelligence textbooks looking for more info about
activation functions, you may soon find yourself reaching for a calculus
textbook. However, with our friend the simple perceptron, were going to
do something really easy. Lets make the activation function the sign of
the sum. In other words, if the sum is a positive number, the output is 1; if
it is negative, the output is -1.
Lets review and condense these steps so we can implement them with a
code snippet.
Lets assume we have two arrays of numbers, the inputs and the weights.
For example:
Show Raw
Show Raw
float sum = 0;
for (int i = 0; i < inputs.length; i++) {
sum += inputs[i]*weights[i];
}
Once we have the sum we can compute the output.
Show Raw
Step 3: Passing the sum through an activation function
We can see how there are two inputs (x and y), a weight for each input
(weight x andweight y), as well as a processing neuron that generates the
output.
To avoid this dilemma, our perceptron will require a third input, typically
referred to as abias input. A bias input always has the value of 1 and is
also weighted. Here is our perceptron with the addition of the bias:
Figure 10.6
0 * weight for x = 0
0 * weight for y = 0
1 * weight for bias = weight for bias
The output is the sum of the above three values, 0 plus 0 plus the biass
weight. Therefore, the bias, on its own, answers the question as to where
(0,0) is in relation to the line. If the biass weight is positive, (0,0) is
above the line; negative, it is below. It biases the perceptrons
understanding of the lines position relative to (0,0).
10.4 Coding the Perceptron
Were now ready to assemble the code for a Perceptron class. The only
data the perceptron needs to track are the input weights, and we could use
an array of floats to store these.
Show Raw
class Perceptron {
float[] weights;
The constructor could receive an argument indicating the number of
inputs (in this case three: x, y, and a bias) and size the array accordingly.
Show Raw
Perceptron(int n) {
weights = new float[n];
for (int i = 0; i < weights.length; i++) {
The weights are picked randomly to start.
weights[i] = random(-1,1);
}
}
A perceptron needs to be able to receive inputs and generate an output.
We can package these requirements into a function called feedforward() .
In this example, well have the perceptron receive its inputs as an array
(which should be the same length as the array of weights) and return the
output as an integer.
Show Raw
Figure 10.7
Show Raw
With this method, the network is provided with inputs for which there is a
known answer. This way the network can find out if it has made a correct
guess. If its incorrect, the network can learn from its mistake and adjust
its weights. The process is as follows:
Steps 1 through 4 can be packaged into a function. Before we can write the
entire function, however, we need to examine Steps 3 and 4 in more
detail. How do we define the perceptrons error? And how should we
adjust the weights according to this error?
This was also an error calculation. The current velocity acts as a guess and
the error (the steering force) tells us how to adjust the velocity in the right
direction. In a moment, well see how adjusting the vehicles velocity to
follow a target is just like adjusting the weights of a neural network to
arrive at the right answer.
In the case of the perceptron, the output has only two possible
values: +1 or -1. This means there are only three possible errors.
If the perceptron guesses the correct answer, then the guess equals the
desired output and the error is 0. If the correct answer is -1 and weve
guessed +1, then the error is -2. If the correct answer is +1 and weve
guessed -1, then the error is +2.
-1 -1 0
-1 +1 -2
Desired Guess Error
+1 -1 +2
+1 +1 0
Therefore:
Notice that a high learning constant means the weight will change more
drastically. This may help us arrive at a solution more quickly, but with
such large changes in weight its possible we will overshoot the optimal
weights. With a small learning constant, the weights will be adjusted
slowly, requiring more training time but allowing the network to make
very small adjustments that could improve the networks overall accuracy.
Show Raw
float c = 0.01;
Step 1: Provide the inputs and known answer. These are passed in as arguments to train().
Step 4: Adjust all the weights according to the error and learning constant.
Show Raw
class Perceptron {
The Perceptron stores its weights and learning constants.
float[] weights;
float c = 0.01;
Perceptron(int n) {
weights = new float[n];
Weights start off random.
Output is a +1 or -1.
Show Raw
class Trainer {
float[] inputs;
int answer;
inputs[2] = 1;
answer = a;
}
}
Now the question becomes, how do we pick a point and know whether it is
above or below a line? Lets start with the formula for a line, where y is
calculated as a function of x:
y = f(x)
y = ax + b
Show Raw
float f(float x) {
return 2*x+1;
}
So, if we make up a point:
Show Raw
float x = random(width);
float y = random(height);
How do we know if this point is above or below the line? The line
function f(x) gives us the yvalue on the line for that x position. Lets call
that yline .
Show Raw
Show Raw
if (y < yline) {
The answer is -1 if y is above the line.
answer = -1;
} else {
answer = 1;
}
We can then make a Trainer object with the inputs and the correct answer.
Show Raw
Show Raw
ptron.train(t.inputs,t.answer);
Now, its important to remember that this is just a demonstration.
Remember ourShakespeare-typing monkeys ? We asked our genetic
algorithm to solve for to be or not to bean answer we already knew.
We did this to make sure our genetic algorithm worked properly. The
same reasoning applies to this example. We dont need a perceptron to tell
us whether a point is above or below a line; we can do that with simple
math. We are using this scenario, one that we can easily solve without a
perceptron, to demonstrate the perceptrons algorithm as well as easily
confirm that it is working properly.
Lets look at how the perceptron works with an array of many training
points.
RESET PAUSE
Show Raw
The Perceptron
Perceptron ptron;
2,000 training points
float f(float x) {
return 2*x+1;
}
void setup() {
size(640, 360);
int answer = 1;
if (y < f(x)) answer = -1;
training[i] = new Trainer(x, y, answer);
}
}
void draw() {
background(255);
translate(width/2,height/2);
ptron.train(training[count].inputs, training[count].answer);
For animation, we are training one point at a time.
Exercise 10.1
Instead of using the supervised learning model above, can you train the
neural network to find the right weights by using a genetic algorithm?
Exercise 10.2
Visualize the perceptron itself. Draw the inputs, the processing node, and
the output.
We are now going to take significant creative license with the concept of a
neural network. This will allow us to stick with the basics and avoid some
of the highly complex algorithms associated with more sophisticated
neural networks. Here were not so concerned with following rules
outlined in artificial intelligence textbookswere just hoping to make
something interesting and brain-like.
Remember our good friend the Vehicle class? You know, that one for
making objects with a location, velocity, and acceleration? That could
obey Newtons laws with an applyForce() function and move around the
window according to a variety of steering rules?
What if we added one more variable to our Vehicle class?
Show Raw
class Vehicle {
Perceptron brain;
PVector location;
PVector velocity;
PVector acceleration;
//etc...
Heres our scenario. Lets say we have a Processing sketch with
an ArrayList of targets and a single vehicle.
Figure 10.9
Lets say that the vehicle seeks all of the targets. According to the
principles of Chapter 6, we would next write a function that calculates a
steering force towards each target, applying each force one at a time to the
objects acceleration. Assuming the targets are
an ArrayList ofPVector objects, it would look something like:
Show Raw
Show Raw
force.mult(weight);
applyForce(force);
}
}
But what if instead we could ask our brain (i.e. perceptron) to take in all
the forces as an input, process them according to weights of the
perceptron inputs, and generate an output steering force? What if we
could instead say:
Show Raw
Ask our brain for a result and apply that as the force!
Note how these two functions implement nearly identical algorithms, with
two differences:
1. Summing PVectors. Instead of a series of numbers added
together, each input is a PVector and must be multiplied by the
weight and added to a sum according to the
mathematical PVector functions.
Once the resulting steering force has been applied, its time to give
feedback to the brain, i.e.reinforcement learning. Was the decision to
steer in that particular direction a good one or a bad one? Presumably if
some of the targets were predators (resulting in being eaten) and some of
the targets were food (resulting in greater health), the network would
adjust its weights in order to steer away from the predators and towards
the food.
Lets take a simpler example, where the vehicle simply wants to stay close
to the center of the window. Well train the brain as follows:
Show Raw
Figure 10.10
Here we are passing the brain a copy of all the inputs (which it will need
for error correction) as well as an observation about its environment:
a PVector that points from its current location to where it desires to be.
This PVector essentially serves as the errorthe longer the PVector , the
worse the vehicle is performing; the shorter, the better.
The brain can then apply this error vector (which has two error values,
one for x and one for y) as a means for adjusting the weights, just as we
did in the line classification example.
void train(PVector[] forces, PVector error) { void train(float[] inputs, int desired) {
for (int i = 0; i < weights.length; i++) { for (int i = 0; i < weights.length; i++)
weights[i] += c*error.x*forces[i].x; weights[i] += c * error * inputs[i];
weights[i] += c*error.y*forces[i].y;
} }
} }
Because the vehicle observes its own error, there is no need to calculate
one; we can simply receive the error as an argument. Notice how the
change in weight is processed twice, once for the error along the x-axis
and once for the y-axis.
Show Raw
weights[i] += c*error.x*forces[i].x;
weights[i] += c*error.y*forces[i].y;
We can now look at the Vehicle class and see how the steer function uses a
perceptron to control the overall steering force. The new content from this
chapter is highlighted.
RESET PAUSE
Example 10.2: Perceptron steering
Show Raw
class Vehicle {
Perceptron brain;
PVector location;
PVector velocity;
PVector acceleration;
float maxforce;
float maxspeed;
void update() {
velocity.add(acceleration);
velocity.limit(maxspeed);
location.add(velocity);
acceleration.mult(0);
}
applyForce(result);
Exercise 10.3
Visualize the weights of the network. Try mapping each targets
corresponding weight to its brightness.
Exercise 10.4
Try different rules for reinforcement learning. What if some targets are
desirable and some are undesirable?
On the left of Figure 10.11, we have classic linearly separable data. Graph
all of the possibilities; if you can classify the data with a straight line, then
it is linearly separable. On the right, however, is non-linearly separable
data. You cant draw a straight line to separate the black dots from the
gray ones.
See how you can draw a line to separate the true outputs from the false
ones?
XOR is the equivalent of OR and NOT AND. In other words, A XOR B only
evaluates to true if one of them is true. If both are false or both are true,
then we get false. Take a look at the following truth table.
Figure 10.13
This is not linearly separable. Try to draw a straight line to separate the
true outputs from the false onesyou cant!
Instead, here well focus on a code framework for building the visual
architecture of a network. Well make Neuron objects
and Connection objects from which a Network object can be created and
animated to show the feed forward process. This will closely resemble
some of the force-directed graph examples we examined in Chapter 5
(toxiclibs).
The primary building block for this diagram is a neuron. For the purpose
of this example, the Neuron class describes an entity with an (x,y) location.
Show Raw
An incredibly simple Neuron class stores and displays the location of a single neuron.
class Neuron {
PVector location;
Neuron(float x, float y) {
location = new PVector(x, y);
}
void display() {
stroke(0);
fill(0);
ellipse(location.x, location.y, 16, 16);
}
}
The Network class can then manage an ArrayList of neurons, as well as
have its own location (so that each neuron is drawn relative to the
networks center). This is particle systems 101. We have a single element
(a neuron) and a network (a system of many neurons).
Show Raw
class Network {
ArrayList<Neuron> neurons;
PVector location;
Network(float x, float y) {
location = new PVector(x,y);
neurons = new ArrayList<Neuron>();
}
void addNeuron(Neuron n) {
neurons.add(n);
}
void display() {
pushMatrix();
translate(location.x, location.y);
for (Neuron n : neurons) {
n.display();
}
popMatrix();
}
}
Now we can pretty easily make the diagram above.
Show Raw
Network network;
void setup() {
size(640, 360);
Make a Network.
network.addNeuron(a);
network.addNeuron(b);
network.addNeuron(c);
network.addNeuron(d);
}
void draw() {
background(255);
Show the network.
network.display();
}
The above yields:
Whats missing, of course, is the connection. We can consider
a Connection object to be made up of three elements, two neurons
(from Neuron a to Neuron b) and a weight .
Show Raw
class Connection {
A connection is between two neurons.
Neuron a;
Neuron b;
A connection has a weight.
float weight;
void display() {
stroke(0);
strokeWeight(weight*4);
line(a.location.x, a.location.y, b.location.x, b.location.y);
}
}
Once we have the idea of a Connection object, we can write a function
(lets put it inside the Network class) that connects two neurons together
the goal being that in addition to making the neurons in setup() , we can
also connect them.
Show Raw
void setup() {
size(640, 360);
network = new Network(width/2,height/2);
network.connect(a,b);
network.connect(a,c);
network.connect(b,d);
network.connect(c,d);
network.addNeuron(a);
network.addNeuron(b);
network.addNeuron(c);
network.addNeuron(d);
}
The Network class therefore needs a new function called connect() , which
makes a Connection object between the two specified neurons.
Show Raw
Show Raw
Show Raw
class Neuron {
PVector location;
ArrayList<Connection> connections;
Neuron(float x, float y) {
location = new PVector(x, y);
connections = new ArrayList<Connection>();
}
Adding a connection to this neuron
void addConnection(Connection c) {
connections.add(c);
}
The neurons display() function can draw the connections as well. And
finally, we have our network diagram.
RESET PAUSE
Show Raw
void display() {
stroke(0);
strokeWeight(1);
fill(0);
ellipse(location.x, location.y, 16, 16);
Show Raw
void setup() {
All our old network set up code
network.feedforward(random(1));
}
The network, which manages all the neurons, can choose to which
neurons it should apply that input. In this case, well do something simple
and just feed a single input into the first neuron in the ArrayList, which
happens to be the left-most one.
Show Raw
class Network {
Show Raw
class Neuron
Show Raw
class Neuron
int sum = 0;
sum += input;
}
The neuron can then decide whether it should fire, or pass an output
through any of its connections to the next layer in the network. Here we
can create a really simple activation function: if the sum is greater than 1,
fire!
Show Raw
if (sum > 1) {
fire();
If weve fired off our output, we can reset our sum to 0.
sum = 0;
}
}
Now, what do we do in the fire() function? If you recall, each neuron keeps
track of its connections to other neurons. So all we need to do is loop
through those connections and feedforward() the neurons output. For this
simple example, well just take the neurons sum variable and make it the
output.
Show Raw
void fire() {
for (Connection c : connections) {
The Neuron sends the sum out through all of its connections
c.feedforward(sum);
}
}
Heres where things get a little tricky. After all, our job here is not to
actually make a functioning neural network, but to animate a simulation
of one. If the neural network were just continuing its work, it would
instantly pass those inputs (multiplied by the connections weight) along
to the connected neurons. Wed say something like:
Show Raw
class Connection {
Lets first think about how we might do that. We know the location
of Neuron a; its the PVector a.location . Neuron b is located at b.location .
We need to start something moving from Neuron a by creating
another PVector that will store the path of our traveling data.
Show Raw
Show Raw
Show Raw
stroke(0);
line(a.location.x, a.location.y, b.location.x, b.location.y);
fill(0);
ellipse(sender.x, sender.y, 8, 8);
This resembles the following:
Figure 10.16
OK, so thats how we might move something along the connection. But
how do we know when to do so? We start this process the moment
the Connection object receives the feedforward signal. We can keep
track of this process by employing a simple boolean to know whether the
connection is sending or not. Before, we had:
Show Raw
Show Raw
class Connection {
sending = true;
Start the animation at the location of Neuron A.
sender = a.location.get();
Store the output for when it is actually time to feed it forward.
output = val*weight;
}
Notice how our Connection class now needs three new variables. We need
a boolean sending that starts as false and that will track whether or not
the connection is actively sending (i.e. animating). We need
a PVector sender for the location where well draw the traveling dot. And
since we arent passing the output along this instant, well need to store it
in a variable that will do the job later.
Show Raw
void update() {
if (sending) {
As long as were sending, interpolate our points.
Show Raw
void update() {
if (sending) {
sender.x = lerp(sender.x, b.location.x, 0.1);
sender.y = lerp(sender.y, b.location.y, 0.1);
If were close enough (within one pixel) pass on the output. Turn off sending.
if (d < 1) {
b.feedforward(output);
sending = false;
}
}
}
Lets look at the Connection class all together, as well as our
new draw() function.
RESET PAUSE
Show Raw
void draw() {
background(255);
The Network now has a new update() method that updates all of the Connection objects.
network.update();
network.display();
if (frameCount % 30 == 0) {
We are choosing to send in an input every 30 frames.
network.feedforward(random(1));
}
}
class Connection {
The Connections data
float weight;
Neuron a;
Neuron b;
void update() {
if (sending) {
sender.x = lerp(sender.x, b.location.x, 0.1);
sender.y = lerp(sender.y, b.location.y, 0.1);
float d = PVector.dist(sender, b.location);
if (d < 1) {
b.feedforward(output);
sending = false;
}
}
}
void display() {
stroke(0);
strokeWeight(1+weight*4);
line(a.location.x, a.location.y, b.location.x, b.location.y);
if (sending) {
fill(0);
strokeWeight(1);
ellipse(sender.x, sender.y, 16, 16);
}
}
}
Exercise 10.5
The network in the above example was manually configured by setting the
location of each neuron and its connections with hard-coded values.
Rewrite this example to generate the networks layout via an algorithm.
Can you make a circular network diagram? A random one? An example of
a multi-layered network is below.
RESET PAUSE
Exercise 10.6
Rewrite the example so that each neuron keeps track of its forward and
backward connections. Can you feed inputs through the network in any
direction?
Exercise 10.7
Instead of lerp() , use moving bodies with steering forces to visualize the
flow of information in the network.
The end
If youre still reading, thank you! Youve reached the end of the book. But
for as much material as this book contains, weve barely scratched the
surface of the world we inhabit and of techniques for simulating it. Its my
intention for this book to live as an ongoing project, and I hope to
continue adding new tutorials and examples to the books website as well
as expand and update the printed material. Your feedback is truly
appreciated, so please get in touch via email at (daniel@shiffman.net) or
by contributing to the GitHub repository , in keeping with the open-source
spirit of the project. Share your work. Keep in touch. Lets be two with
nature.
A Basic Introduction to
Feedforward Backpropagation Neural Networks
David Leverington
Associate Professor of Geosciences
Although the long-term goal of the neural-network community remains the design of autonomous
machine intelligence, the main modern application of artificial neural networks is in the field of pattern
recognition (e.g., Joshi et al., 1997). In the sub-field of data classification, neural-network methods have
been found to be useful alternatives to statistical techniques such as those which involve regression
analysis or probability density estimation (e.g., Holmstrm et al., 1997). The potential utility of neural
networks in the classification of multisource satellite-imagery databases has been recognized for well
over a decade, and today neural networks are an established tool in the field of remote sensing.
The most widely applied neural network algorithm in image classification remains the feedforward
backpropagation algorithm. This web page is devoted to explaining the basic nature of this classification
routine.
Neural networks are members of a family of computational architectures inspired by biological brains
(e.g., McClelland et al., 1986; Luger and Stubblefield, 1993). Such architectures are commonly called
"connectionist systems", and are composed of interconnected and interacting components called nodes
or neurons (these terms are generally considered synonyms in connectionist terminology, and are used
interchangeably here). Neural networks are characterized by a lack of explicit representation of
knowledge; there are no symbols or values that directly correspond to classes of interest. Rather,
knowledge is implicitly represented in the patterns of interactions between network components (Lugar
and Stubblefield, 1993). A graphical depiction of a typical feedforward neural network is given in Figure
1. The term feedforward indicates that the network has links that extend in only one direction. Except
during training, there are no backward links in a feedforward network; all links proceed from input
nodes toward output nodes.
Figure 1: A typical feedforward neural network.
Individual nodes in a neural network emulate biological neurons by taking input data and performing
simple operations on the data, selectively passing the results on to other neurons (Figure 2). The output
of each node is called its "activation" (the terms "node values" and "activations" are used
interchangeably here). Weight values are associated with each vector and node in the network, and these
values constrain how input data (e.g., satellite image values) are related to output data (e.g., land-cover
classes). Weight values associated with individual nodes are also known as biases. Weight values are
determined by the iterative flow of training data through the network (i.e., weight values are established
during a training phase in which the network learns how to identify particular classes by their typical
input data characteristics). A more formal description of the foundations of multi-layer, feedforward,
backpropagation neural networks is given in Section 5.
Once trained, the neural network can be applied toward the classification of new data. Classifications are
performed by trained networks through 1) the activation of network input nodes by relevant data sources
[these data sources must directly match those used in the training of the network], 2) the forward flow of
this data through the network, and 3) the ultimate activation of the output nodes. The pattern of
activation of the networks output nodes determines the outcome of each pixels classification. Useful
summaries of fundamental neural network principles are given by Rumelhart et al. (1986), McClelland
and Rumelhart (1988), Rich and Knight (1991), Winston (1991), Anzai (1992), Lugar and Stubblefield
(1993), Gallant (1993), and Richards and Jia (2005). Parts of this web page draw on these summaries. A
brief historical account of the development of connectionist theories is given in Gallant (1993).
Figure: 2 Schematic comparison between a biological neuron and an artificial neuron (after Winston,
1991; Rich and Knight, 1991). For the biological neuron, electrical signals from other neurons are
conveyed to the cell body by dendrites; resultant electrical signals are sent along the axon to be
distributed to other neurons. The operation of the artificial neuron is analogous to (though much
simpler than) the operation of the biological neuron: activations from other neurons are summed at the
neuron and passed through an activation function, after which the value is sent to other neurons.
2 McCulloch-Pitts Networks
Neural computing began with the development of the McCulloch-Pitts network in the 1940's
(McCulloch and Pitts, 1943; Luger and Stubblefield, 1993). These simple connectionist networks, shown
in Figure 3, are stand-alone decision machines that take a set of inputs, multiply these inputs by
associated weights, and output a value based on the sum of these products. Input values (also known as
input activations) are thus related to output values (output activations) by simple mathematical
operations involving weights associated with network links. McCulloch-Pitts networks are strictly
binary; they take as input and produce as output only 0's or 1's. These 0's and 1's can be thought of as
excitatory or inhibitory entities, respectively (Luger and Stubblefield, 1993). If the sum of the products
of the inputs and their respective weights is greater than or equal to 0, the output node returns a 1
(otherwise, a 0 is returned). The value of 0 is thus a threshold that must be exceeded or equalled if the
output of the system is to be 1. The above rule, which governs the manner in which an output node maps
input values to output values, is known as an activation function (meaning that this function is used to
determine the activation of the output node). McCulloch-Pitts networks can be constructed to compute
logical functions (for example, in the X AND Y case, no combination of inputs can produce a sum of
products that is greater than or equal to 0, except the combination X=Y=1). McCulloch-Pitts networks
do not learn, and thus the weight values must be determined in advance using other mathematical or
heuristic means. Nevertheless, these networks did much to inspire further research into connectionist
models during the 1950's (Luger and Stubblefield, 1993).
3 Perceptrons
The development of a connectionist system capable of limited learning occurred in the late 1950's, when
Rosenblatt created a system known as a perceptron (see Rosenblatt, 1962; Luger and Stubblefield,
1993). Again, this system consists of binary activations (inputs and outputs) (see Figure 4). In common
with the McCulloch-Pitts neuron described above, the perceptrons binary output is determined by
summing the products of inputs and their respective weight values. In the perceptron implementation, a
variable threshold value is used (whereas in the McCulloch-Pitts network, this threshold is fixed at 0): if
the linear sum of the input/weight products is greater than a threshold value (theta), the output of the
system is 1 (otherwise, a 0 is returned). The output unit is thus said to be, like the perceptron output unit,
a linear threshold unit. To summarize, the perceptron classifies input values as either 1 or 0, according
to the following rule, referred to as the activation function:
(eqn 1)
PERCEPTRON OUTPUT = 1 if (sum of products of inputs and weights) > theta
(otherwise, PERCEPTRON OUTPUT = 0)
The perceptron is trained (i.e., the weights and threshold values are calculated) based on an iterative
training phase involving training data. Training data are composed of a list of input values and their
associated desired output values. In the training phase, the inputs and related outputs of the training data
are repeatedly submitted to the perceptron. The perceptron calculates an output value for each set of
input values. If the output of a particular training case is labelled 1 when it should be labelled 0, the
threshold value (theta) is increased by 1, and all weight values associated with inputs of 1 are decreased
by 1. The opposite is performed if the output of a training case is labelled 0 when it should be labelled 1.
No changes are made to the threshold value or weights if a particular training case is correctly classified.
This set of training rules is summarized as:
(eqn 2a)
If OUTPUT is correct, then no changes are made to the threshold or weights
(eqn 2b)
If OUTPUT = 1, but should be 0
then {theta = theta + 1}
and {weightx = weightx -1, if inputx = 1}
(eqn 2c)
If OUTPUT = 0, but should be 1
then {theta = theta - 1}
and {weightx=weightx +1, if inputx = 1}
where the subscript x refers to a particular input-node and weight pair. The effect of the above training
rules is to make it less likely that a particular error will be made in subsequent training iterations. For
example, in equation (2b), increasing the threshold value serves to make it less likely that the same sum
of products will exceed the threshold in later training iterations, and thus makes it less likely that an
output value of 1 will be produced when the same inputs are presented. Also, by modifying only those
weights that are associated with input values of 1, only those weights that could have contributed to the
error are changed (weights associated with input values of 0 are not considered to have contributed to
error). Once the network is trained, it can be used to classify new data sets whose input/output
associations are similar to those that characterize the training data set. Thus, through an iterative training
stage in which the weights and threshold gradually migrate to useful values (i.e., values that minimize or
eliminate error), the perceptron can be said to learn how to solve simple problems.
Figure 4: An example of a perceptron. The system consists of binary activations. Weights are identified
by ws, and inputs are identified by is. A variable threshold value (theta) is used at the output node.
The development of the perceptron was a large step toward the goal of creating useful connectionist
networks capable of learning complex relations between inputs and outputs. In the late 1950's, the
connectionist community understood that what was needed for the further development of connectionist
models was a mathematically-derived (and thus potentially more flexible and powerful) rule for
learning. By the early 1960's, the Delta Rule [also known as the Widrow and Hoff learning rule or the
least mean square (LMS) rule] was invented (Widrow and Hoff, 1960). This rule is similar to the
perceptron learning rule above (McClelland and Rumelhart, 1988), but is also characterized by a
mathematical utility and elegance missing in the perceptron and other early learning rules. The Delta
Rule uses the difference between target activation (i.e., target output values) and obtained activation to
drive learning. For reasons discussed below, the use of a threshold activation function (as used in both
the McCulloch-Pitts network and the perceptron) is dropped; instead, a linear sum of products is used to
calculate the activation of the output neuron (alternative activation functions can also be applied - see
Section 5.2). Thus, the activation function in this case is called a linear activation function, in which the
output nodes activation is simply equal to the sum of the networks respective input/weight products.
The strengths of networks connections (i.e., the values of the weights) are adjusted to reduce the
difference between target and actual output activation (i.e., error). A graphical depiction of a simple two-
layer network capable of employing the Delta Rule is given in Figure 5. Note that such a network is not
limited to having only one output node.
Figure 5: A network capable of implementing the Delta Rule. Non-binary values may be used. Weights
are identified by ws, and inputs are identified by is. A simple linear sum of products (represented by the
symbol at top) is used as the activation function at the output node of the network shown here.
During forward propagation through a network, the output (activation) of a given node is a function of
its inputs. The inputs to a node, which are simply the products of the output of preceding nodes with
their associated weights, are summed and then passed through an activation function before being sent
out from the node. Thus, we have the following:
(Eqn 3a)
and
(Eqn 3b)
where Sj is the sum of all relevant products of weights and outputs from the previous layer i,
wijrepresents the relevant weights connecting layer i with layer j, ai represents the activations of the
nodes in the previous layer i, aj is the activation of the node at hand, and f is the activation function.
Figure 6: Schematic representation of an error function for a network containing only two weights (w1
and w2) (after Lugar and Stubblefield, 1993). Any given combination of weights will be associated with
a particular error measure. The Delta Rule uses gradient descent learning to iteratively change network
weights to minimize error (i.e., to locate the global minimum in the error surface).
For any given set of input data and weights, there will be an associated magnitude of error, which is
measured by an error function (also known as a cost function) (Figure 6) (e.g., Oh, 1997; Yam and
Chow, 1997). The Delta Rule employs the error function for what is known as gradient descent learning,
which involves the modification of weights along the most direct path in weight-space to minimize
error; change applied to a given weight is proportional to the negative of the derivative of the error with
respect to that weight (McClelland and Rumelhart 1988, pp.126-130). The error function is commonly
given as the sum of the squares of the differences between all target and actual node activations for the
output layer. For a particular training pattern (i.e., training case), error is thus given by:
(Eqn 4a)
where Ep is total error over the training pattern, is a value applied to simplify the functions derivative,
n represents all output nodes for a given training pattern, tj sub n represents the target value for node n in
output layer j, and aj sub n represents the actual activation for the same node. This particular error
measure is attractive because its derivative, whose value is needed in the employment of the Delta Rule,
is easily calculated. Error over an entire set of training patterns (i.e., over one iteration, or epoch) is
calculated by summing all Ep:
(Eqn 4b)
where E is total error, and p represents all training patterns. An equivalent term for E in Equation 4b is
sum-of-squares error. A normalized version of Equation 4b is given by the mean squared error (MSE)
equation:
(Eqn 4c)
where P and N are the total number of training patterns and output nodes, respectively. It is the error of
Equations 4b and 4c that gradient descent attempts to minimize (in fact, this is not strictly true if weights
are changed after each input pattern is submitted to the network; see Section 4.1 below; see also
Rumelhart et al., 1986: v1, p.324; Reed and Marks, 1999: pp. 57-62). Error over a given training pattern
is commonly expressed in terms of the total sum of squares (tss) error, which is simply equal to the
sum of all squared errors over all output nodes and all training patterns. The negative of the derivative of
the error function is required in order to perform gradient descent learning. The derivative of Equation
4a (which measures error for a given pattern p), with respect to a particular weight wij sub x, is given by
the chain rule as:
(Eqn 5a)
where aj sub z is the activation of the node in the output layer that corresponds to the weight wij sub x
(note: subscripts refer to particular layers of nodes or weights, and the sub-subscripts simply refer to
individual weights and nodes within these layers). It follows that
(Eqn 5b)
and
(Eqn 5c)
Thus, the derivative of the error over an individual training pattern is given by the product of the
derivatives of Equation 5a:
(Eqn 5d)
Because gradient descent learning requires that any change in a particular weight be proportional to the
negative of the derivative of the error, the change in a given weight must be proportional to the negative
of equation 5d. Replacing the difference between the target and actual activation of the relevant output
node by d, and introducing a learning rate epsilon, Equation 5d can be re-written in the final form of the
delta rule:
(Eqn 5e)
The reasoning behind the use of a linear activation function here instead of a threshold activation
function can now be justified: the threshold activation function that characterizes both the McColloch
and Pitts network and the perceptron is not differentiable at the transition between the activations of 0
and 1 (slope = infinity), and its derivative is 0 over the remainder of the function. As such, the threshold
activation function cannot be used in gradient descent learning. In contrast, a linear activation function
(or any other function that is differentiable) allows the derivative of the error to be calculated.
Equation 5e is the Delta Rule in its simplest form (McClelland and Rumelhart, 1988). From Equation 5e
it can be seen that the change in any particular weight is equal to the products of 1) the learning rate
epsilon, 2) the difference between the target and actual activation of the output node [d], and 3) the
activation of the input node associated with the weight in question. A higher value for e will necessarily
result in a greater magnitude of change. Because each weight update can reduce error only slightly,
many iterations are required in order to satisfactorily minimize error (Reed and Marks, 1999). An actual
example of the iterative change in neural network weight values as a function of an error surface is given
in Figures 7 and 8. Figure 7 is a three-dimensional depiction of the error surface associated with a
particular mathematical problem. Figure 8 shows the two-dimensional version of this error surface,
along with the path that weight values took during training. Note that weight values changed such that
the path defined by weight values followed the local gradient of the error surface.
Weights can be updated in two primary ways: batch training, and on-line (also called sequential or
pattern-based) training. In batch mode, the value of dEp/dwij is calculated after each pattern is submitted
to the network, and the total derivative dE/dwij is calculated at the end of a given iteration by summing
the individual pattern derivatives. Only after this value is calculated are the weights updated. As long as
the learning rate epsilon (e) is small, batch mode approximates gradient descent (Reed and Marks,
1999).
On-line mode (also called pattern-mode learning) involves updating the values of weights after each
training pattern is submitted to the network (note that the term can be misleading: on-line mode does not
involve training during the normal feedforward operation of the network; it involves off-line training just
like batch mode). As noted earlier, on-line learning does not involve true gradient descent, since the sum
of all pattern derivatives over a given iteration is never determined for a particular set of weights;
weights are instead changed slightly after each pattern, causing the pattern derivatives to be evaluated
with respect to slightly different weight values. On-line mode is not a simple approximation of the
gradient descent method, since although single-pattern derivatives as a group sum to the gradient, each
derivative has a random deviation that does not have to be small (Reed and Marks, 1999: p.59).
Although error usually decreases after most weight changes, there may be derivatives that cause the
error to increase as well. Unless learning rates are very small, the weight vector tends to jump about the
E(w) surface, mostly moving downhill, but sometimes jumping uphill; the magnitudes of the jumps are
proportional to the learning rate epsilon (Reed and Marks, 1999).
Cyclic, fixed orders of training patterns are generally avoided in on-line learning, since convergence can
be limited if weights converge to a limit cycle (Reed and Marks 1999: p.61). Also, if large numbers of
patterns are in a training dataset, an ordered presentation of the training cases to the network can cause
weights/error to move very erratically over the error surface (with any given series of an individual
class training patterns potentially causing the network to move in weight-space in a direction that is
very different from the overall desired direction). Thus, training patterns are usually submitted at random
in on-line learning. A comparison between the learning curves produced by networks using non-random
and random submission of training data is given in Figure 9. Note that the network using non-random
submission produced an extremely erratic learning curve, compared to the relatively smooth learning
curve produced by the network using random submission. The on-line mode has an advantage over batch
mode, in that the more erratic path that the weight values travel in is more likely to bounce out of local
minima; pure gradient descent offers no chance to escape a local minimum. Further, the on-line mode is
superior to batch mode if there is a high degree of redundancy in the training data, since, when using a
large training dataset, the network will simply update weights more often in a given iteration, while a
batch-mode network will simply take longer to evaluate a given iteration (Bishop, 1995a, p.264). An
advantage of batch mode is that it can settle on a stable set of weight values, without wandering about
this set.
Figure 9: Learning curves produced by networks using non-random (fixed-order) and random
submission of training data (Leverington, 2001).
A simple example of the employment of the Delta Rule, based on a discussion given in McClelland and
Rumelhart (1988), is as follows: imagine that the following inputs and outputs are related as in Table
3.1. Imagine further that is the desire of a worker is to train a network to be able to correctly label each
of the four input cases in this table. This problem will require a network with four input nodes and one
output node. All 4 weights associated with each input node are initially set to 0, and an arbitrary learning
rate (epsilon) of 0.25 is used in this example.
During the training phase, each training case is presented to the network individually, and weights are
modified according to the Delta Rule. For example, when the first training case is presented to the
network, the sum of products equals 0. Because the desired output for this particular training case is 1,
the error equals 1-0 = 1. Using equation (5e), the changes in the four weights are respectively calculated
to be {0.25, -0.25, 0.25, -0.25}. Since the weights are initially set to {0, 0, 0, 0}, they become {0.25,
-0.25, 0.25, -0.25} after this first training case. Presentation of the second set of training inputs causes
the network to calculate a sum of products of 0, again. Thus, the changes in the four weights in this case
are calculated to be {0.25, 0.25, 0.25, 0.25), and, once the changes are added to the previously-
determined weights, the new weight values become {0.5, 0, 0.5, 0}. After presentation of the third and
fourth training cases, the weight values become {0, -0.5, 0, 0.5} and {-0.5, 0, 0.5, 0}, respectively. At the
end of this training iteration, the total sum of squared errors = 12 + 12 + (-2)2 + (-2)2 = 10.
After this first iteration, it is not clear that the weights are changing in a manner that will reduce network
error. In fact, with the last set of weights given above, the network would only produce a correct output
value for the last training case; the first three would be classified incorrectly. However, with repeated
presentation of the same training data to the network (ie. with multiple iterations of training), it becomes
clear that the networks weights do indeed evolve to reduce classification error: error is eliminated
altogether by the twentieth iteration. The network has learned to classify all training cases correctly, and
is now ready to be used on new data whose relations between inputs and desired outputs generally match
those of the training data.
The Delta Rule will find a set of weights that solves a network learning problem, provided such a set of
weights exists. The required condition for this set of weights existing is that all solutions must be a linear
function of the inputs. As was presented by Minsky and Papert (1969), this condition does not hold for
many simple problems (e.g., the exclusive-OR function, in which an output of 1 must be produced when
either of two inputs are 1, but an output of 0 must be produced with neither or both of the inputs are 1;
see McClelland and Rumelhart 1988, pp. 145-152). Minsky and Papert recognized that a multi-layer
network could convert an unsolvable problem to a solvable problem (note: a multi-layer network
consists of one or more intermediate layers placed between the input and output layers; this accepted
terminology can be somewhat confusing and contradictory, in that the term layer in multi-layer
refers to a row of weights, whereas the term layer in general neural-network usage usually means a
row of nodes; see Section 5.1 and, e.g., Vemuri, 1992, p.42). Minsky and Papert also recognized that the
use of a linear activation function (such as that used in the Delta Rule example above, where network
output is equal to the sum of the input/weight products) would not allow the benefits of having a multi-
layer network to be realized, since a multi-layer network with linear activation functions is functionally
equivalent to a simple input-output network using linear activation functions. That is, linear systems
cannot compute more in multiple layers than they can in a single layer (McClelland and Rumelhart,
1988). Based on the above considerations, the questions at the time became 1) what kind of activation
function should be used in a multi-layer network, and 2) how can intermediate layers in a multi-layer
network be taught? The original application of the Delta Rule involved only an input layer and an
output layer. It was generally believed that no general learning rule for larger, multi-layer networks,
could be formulated. As a result of this view, research on connectionist networks for applications in
artificial intelligence was dramatically reduced in the 1970's (McClelland and Rumelhart, 1988; Joshi et
al., 1997).
Eventually, despite the apprehensions of earlier workers, a powerful algorithm for apportioning error
responsibility through a multi-layer network was formulated in the form of the backpropagation
algorithm (Rumelhart et al., 1986). The backpropagation algorithm employs the Delta Rule, calculating
error at output units in a manner analogous to that used in the example of Section 4.2, while error at
neurons in the layer directly preceding the output layer is a function of the errors on all units that use its
output. The effects of error in the output node(s) are propagated backward through the network after
each training case. The essential idea of backpropagation is to combine a non-linear multi-layer
perceptron-like system capable of making decisions with the objective error function of the Delta Rule
(McClelland and Rumelhart, 1988).
The use of a smooth, non-linear activation function is essential for use in a multi-layer network
employing gradient-descent learning. An activation function commonly used in backpropagation
networks is the sigma (or sigmoid) function:
(Eqn 6)
where aj sub m is the activation of a particular receiving node m in layer j, Sj is the sum of the
products of the activations of all relevant emitting nodes (i.e., the nodes in the preceding layer i) by
their respective weights, and wij is the set of all weights between layers i and j that are associated with
vectors that feed into node m of layer j. This function maps all sums into [0,1] (Figure 10) (an alternate
version of the function maps activations into [-1, 1]; e.g., Gallant 1993, pp. 222-223). If the sum of the
products is 0, the sigma function returns 0.5. As the sum gets larger the sigma function returns values
closer to 1, while the function returns values closer to 0 as the sum gets increasingly negative. The
derivative of the sigma function with respect to Sj sub m is conveniently simple, and is given by Gallant
(1993, p.213) as:
(Eqn 7)
The sigma function applies to all nodes in the network, except the input nodes, whose values are
assigned input values. The sigma function superficially compares to the threshold function (which is
used in the perceptron) as shown in Figure 10. Note that the derivative of the sigma function reaches its
maximum at 0.5, and approaches its minimum with values approaching 0 or 1. Thus, the greatest change
in weights will occur with values near 0.5, while the least change will occur with values near 0 or 1.
McClelland and Rumelhart (1988) recognize that it is these features of the equation (i.e., the shape of the
function) that contribute to the stability of learning in the network; weights are changed most for units
whose values are near their midrange, and thus for those units that are not yet committed to being either
on or off.