You are on page 1of 18

3.

ARTIFICIAL NEURAL NETWORK


Artificial Neural Networks (ANNs), also called parallel distributed processing systems (PDPs),
and connectionist systems are intended for modeling the organizational principles of the central
nervous system. This offers the hope that the biologically inspired computing capabilities of the
ANN will allow the cognitive and sensory tasks to be performed more easily and more
satisfactorily than with conventional serial processors.
3.1 BIOLOGICAL NEURAL NETWORKS
Regulatory or control mechanism is a basic need of all multicellular organism. These regulatory
mechanism are for switching on and off the bodys multifarious physiological performances
like transmission, monitoring, integration and deciphering the information that are continuously
exchanged between the external and the internal environments. This synchronization between
the external and the internal environments. This synchronization between various systems in a
multicellular organism is not haphazard but are rather controlled. The timing and location of
one set of activities is correlated with that of the other set of activities. This is known as co-
ordination.
The co-ordination between the time and space of one set of activities to the other set in
multicellular organisms is solely due to the nervous system and endocrine system. The nervous
system is the bodys control unit and communication network that shares the maintenance of
homoeostasis of the body with the endocrine system. The nervous system is made up of several
millions of nerve cells along with various supporting tissues forming a series of conducting
tissue extending to all parts of the body.
The main function of a nerve cell is to transmit external and internal impulses from the site of
the receptor to the central nervous system through sensory nerves and back to the effector
organs via motor nerves. Nervous system provides the quickest means of communication
within the body and thus serves as the chief coordinator of all the performances of a body.
The structural and the functional unit of each nerve cell is the neuron. Fig.3.1 illustrates a
biological neuron. Each neuron structurally consists of three parts each of which are associated
with a specific function.
The cell body (soma) or centron or neurocyton with dendrons (dendrites) which are the
outgrowth of the cell membrane of the neuron-cell body.
21
The axon (neurite). It is a single long process arising from the axon-hillock of the cell
body of neuron.
The axons and the dendrons constitute nerve fibers. The axons give off branches called
collaterals along its course and near the ends it ramifies into non-myelinated terminated
terminal branches known as axons terminals that serve as the third structural unit of a
neuron.
3.2 MATHEMATICAL MODELLING OF ANN
With this basic idea a mathematical model of a neuron can be developed for ANNs , based on
two distinct operations performed by neurons. They are
Synaptic Operation

The synaptic operation provides a confluence between n-dimensional neural input vectors, X,
and n-dimensional synaptic weight vector, W. A dot product is often used as a confluence
operation. Thus , the resulting vector, Z, can be expressed as,
i
T
i i
X W Z i=1,2,3, (3.1)
Fig.3.1 A biological neuron
22
where,
T
n
W W W W W ) (
...... 3 2 1

(3.2)
is the vector of the synaptic weights.

T
n
X X X X X ) (
.......... 3 2 1

(3.3)
is the vector of the neural inputs.
T
n
Z Z Z Z Z ) (
...... 3 2 1

(3.4)
of the weighted neural inputs.
Thus, synaptic operation assigns a relative significance to each incoming input signal
i
X ,
according to past experience stored in
i
W .
Somatic Operation
This operation is a two step process , viz.
Somatic Aggregation: This operation can be expressed as,
i
n
i
T
i
n
i
i
X W Z u



1 1
(3.5)
where , u is the intermediate accumulated sum. Thus, the combined synaptic operation and the
somatic aggregation operation provide a mapping from n-dimensional neural input space, X, to
one dimensional space, u.
Non-linear operation with Thresholding: A nonlinear operation on u yields a neural
output y, given by,
] [
0
W u f y (3.6)
where, f is the non-linear function and
0
W is the threshold.
Thus,
23
] [ ] [ ] [
0
0
0
v f X W f W X W f y
i
n
i
T
i i
n
i
T
i



(3.7)
provided ,

1
0
X
(3.8)
where, v is the final accumulated sum and
0
X is the fixed bias input. Thus a Neural Processing
Unit (NPU) can be schematically represented as in Fig.3.2.


3.3 DIFFERENT NETWORK ARCHITECTURES
Although a single neuron processing unit can handle single pattern classification problems, the
strength of neural computation comes from the neurons connected in a network. The set of
processing unit when assembled in a closely interconnected network is called an artificial
neural network.
Depending on the different modes of interconnection, ANNs can be broadly classified as:
Z
Z
Z
Z
0
1
i
n
( v )
W
W
W
W
v
Y
0
1
i
n
X = - 1
X
X
X
0
1
i
n
N
e
u
t
r
a
l

O
u
t
p
u
t
Fig. 3.2 A neural processing unit
1. Multilayer Feed-forward Neural Networks or Static Neural Networks: This is
characterized by directed layered graphs. In this case, a number of patterns such as
structure, connection weights and thresholds are allowed to obtain from learning rather
than they are predetermined.
24

Fig. 3.3a Detailed Representation of Multi-layer Structure
2. Laterally connected Neural Networks: In consists of feed forward input units and a
layer of neurons that are laterally connected to their neighbors.
3. Recurrent or Dynamic Neural Networks: In these networks, neurons are connected
in a layered structure and the neurons in a given layer may receive inputs from neurons
in given layer below it and/or from the layers above it. The output not only depends
upon the current inputs but also upon past inputs and/or outputs. These networks have
been mostly used for the solution of optimization problems
4. Hybrid Neural Networks: These networks combine two or more of the features of the
above mentioned networks.
3.4 SUPERVISED LEARNING
The ability of a particular neural network is largely determined by the learning process and
network structure used. The learning procedure is divided into thee types : Supervised,
reinforced and unsupervised. These three types of learning are defined by the type of error
signal used to train the weights in the network. In supervised learning, an error scalar is
provided for each output unit by an external teacher, while in reinforced learning the network
is given only a global punish/reward signal. In unsupervised learning, no external error signal is
provided, instead internal errors are generated between the units which are then used to modify
the weights.
In supervised learning paradigm, the weights connecting units in the network are set on the
basis of detailed error information supplied to the network by an external teacher. In most
cases the network is trained using a set of input-output pairs which are examples of the
mapping that the network is required to learn to compute. The learning process may, therefore,
25
be viewed as fitting a function and its performance can thus be judged on whether the network
can learn the desired function over the interval represented by the training set and to what
extent the network can successfully generalize away from the points that it has been trained on.
As an example, consider the case of electrode contour optimization where the input-output
training sets are known, i.e. the predetermined electrode contours and the stresses along those
electrode contours from the results of electric field computations carried out for such electrode
contours. Thus for such problem a neural network with supervised learning is needed.
3.5 MULTILAYER FEED-FORWARD NEURAL NETWORKS
Pattern classification problems, which are not linearly separable, can be solved with MFNNs
possessing one or more hidden layers in which the neurons have nonlinear characteristics.
A single neuron can provide simple pattern classification problem, i.e. transformation of sets or
functions from the input space to the output space. A two layer network consisting of two
inputs and N outputs can produce N distinct lines in the pattern space, provided the regions
formed by the problem are linearly separable.
But, in the all possible problems, as the dimensionality of the input space is more, the problems
are not linearly separable and cannot be solved by a two- layer neural network. This leads to
MFNNs.
In MFNNs, neurons are connected in a layered structure and neurons in given layer receive
inputs from the neurons in the layer immediately below it and send their outputs to the neurons
immediately above it. Their outputs are a function of only the current inputs and are
independent on the past inputs and/or outputs.
Although, theoretically, an infinite number of layers are required to define any decision
boundary. A three-layer FNN can generate arbitrary complex decision regions. For this reason,
three layer FNNs often referred to as universal approximators. The additional layer in between
the input and the output layer is known as the hidden layer and the number of units in the
hidden layers depends on the nature of the problem.
The term feed-forward implies that all the information in the network flows in the forward
direction, and during normal processing there is no feedback from the outputs to the inputs.
Backpropagation learning algorithm
26
Fig.3.3 shows a schematic diagram of multilayer feedforward network. Processing elements in
neural networks are commonly known as neurons. The neurons in the network are divided into
three layers: the input layer, the output layer and the hidden layers. It is important to note that
in feedforward networks signals can only propagate from the input layer to the output layer via
one or more hidden layers. It should also be noted that only the nodes in the hidden layers and
the output layer which perform activation function are called ordinary neurons. Since the nodes
in the input layer simply pass on the signals from the external source to the hidden layer, they
are after not regarded as ordinary neurons.

Fig. 3.3b Schematic Multi-layer Structure

The neural network can identify input pattern vectors once the connection weights are adjusted
by means of the learning process. The back-propagation learning algorithm, which is a
generalization of Widrow-Hoff error correction rule, is the most popular method in training the
ANN. This learning algorithm is presented below in details.
Let, the net input to a neuron in input layer is
i
net
. Then for each neuron in the input layer, the
neuron outputs are given by

i i
net O (3.9)
The net input to a neuron in hidden layer j is

i
N
i
i ji j
o net
1
(3.10)
where,
i
N
is the number of neurons in the input layer .
The output of neuron j is
27

) , (
j j j
net f O
(3.11)
Where f is the activation function.
For a sigmoidal activation function,
) (
1
1
j j
net
j
e
O
+
+
(3.12)
In the eqn(3.12), the parameter
j

serves as threshold or bias. The effect of a positive is


j

to
shift the activation function to the left along the horizontal axis. These effects are illustrated in
Fig. 3.4.

Fig. 3.4 Sigmoidal Activation Function
Similarly, for a neuron in output layer, the input is given by

j
N
j
j Kj K
O net
1
(3.13)
where,
j
N
is the number of neurons in the hidden layer.
The corresponding output is given by
(3.14)
In the learning phases or training of such a net work, a pattern is presented as input and the set
of weights in all the connecting links and also all the threshold in the neurons are adjusted in
such a way that the desired outputs
pK
t
are obtained at the output neurons. Once this
adjustment has been accomplished by the net work, another pair of input-output patterns is
presented and the net work is required to learn that association also. In fact the net work is
required to find a single set of weights and thresholds that will satisfy all the input-output pairs
presented to it.
28
) , (
K K K
net f O
In general, the outputs
pK
O
will not be the same as the target values
pK
t
. For each pattern the
sum of squared errors is


K
N
K
pK pK p
O t E
1
2
) (
2
1


(3.15)
Where,
K
N is the number of neurons in the output layer. In the generalized delta rule
formulated by Rumelhart et al for learning the weights and thresholds, the procedure for
learning the correct set of weights is to vary the weights in a manner calculated to reduce the
error
p
E
as rapidly as possible. In other words, the gradient search in weight space is carried
out on the basis of
p
E
.
Omitting the subscript p for convenience, eqn (3.15) is written as


K
N
K
K K p
O t E
1
2
) (
2
1


(3.16)
Convergence towards improved values for the weights and thresholds is achieved by taking
incremental changes
Kj

proportional to
Kj
E

, that is

Kj
Kj
E


(3.17)
where,

is the learning rate.


Eqn. (3.17) can be written as

Kj
K
K
Kj
net
net
E


(3.18)
Now,
j j Kj
Kj Kj
K
O O
net


(3.19)
29
Let,
K
K
net
E


(3.20)
Therefore, eqn. (3.18) becomes

j K Kj
O
(3.21)
Again,
K
K
K
K
net
O
O
E
net
E
K


(3.22)
The two factors of R.H.S of eqn. (3.22) are obtained as follows

] ) (
2
1
[
2

K K
K K
O t
O O
E
= ) (
K K
O t (3.23)
Again,
) , (
K K
K K
K
net f
net net
O

=
]
1
1
[
) (
K K
net
K
e net
+
+

= ) 1 (
K K
O O (3.24)
Therefore, for any output-layer neuron K,
K
is obtained from eqns (3.22), (3.23) and (3.24) as
follows ) 1 ( ) (
K K K K K
O O O t (3.25)
For the next lower layer where the weights do not affect output nodes directly, it can be written
that

ji
ji
E


(3.26)
ji
j
j
net
net
E



i
j
O
net
E


=
i j
O
(3.27)
30
where,
j
j
net
E


(3.28)

j
j
j
net
O
O
E


(3.29)
Now, as in the case of eqn (3.24)

) 1 (
j j
j
j
O O
net
O

(3.30)
However, the factor can not be evaluated directly. Instead it is written in terms of quantities
which are known and other quantities that can be evaluated. Hence,
j
K
N
K
K j
O
net
net
E
O
E
K

1


j Kj
j
N
K
K
o
O net
E
K

1
) (

Kj
N
K
K
K
net
E


1
) (

Kj
N
K
K
K


1
(3.31)
Therefore, from eqns (3.29), (3.30) and (3.31),


K
N
K
Kj K j j j
O O
1
) 1 (
(3.32)
Thus the deltas at a hidden layer neuron can be evaluated in terms of the deltas at an upper
layer. Hence, starting at the highest layer, i.e. the output layer, are evaluated using eqn (3.25)
and then the errors are propagated backward to lower layers using eqn.(3.32)
Summarizing and using the subscript p to denote the pattern number,

pj Pk Kj p
O (3.33)
and ) 1 ( ) (
p K p K p K p K p K
O O O t (3.34)
31
for the output-layer neurons and

pi pj ji p
O (3.35)
and
Kj
N
K
pK pj pj pj
K
O O


1
) 1 ( (3.36)
for the hidden-layer neurons.
It is important to note here that the threshold of each neuron is trained in the same way as the
other weights. The threshold of a neuron is regarded as a modifiable connection weight
between that neuron and a fictitious neuron in the previous layer, which always has an output
value of unity.
The learning procedure therefore consists of the network starting off with a random set of
weight values choosing one of the training-set patterns and using that as input pattern
evaluating the outputs in a feedforward manner. The errors at the outputs generally will be
quite large which necessitate changes in the weights. Using the back-propagation procedure,
the net work calculates
ji p

for all the
ji

in the net work for that particular pattern and the


corrections to the weights are made. This procedure is repeated for all the patterns in
completing the first iteration and then all the patterns of the training set are presented once
again in the second iteration. A new set of outputs is obtained and new weights are again
evaluated. In a successful learning exercise, the system error will decrease with decrease with
the number of iterations, and the procedure will converge to a stable set of weights, which will
exhibit only small fluctuations in value as further learning is attempted.
While implementing such network it is very important to choose a proper value of

. A large

corresponds to rapid learning but might also result in oscillations. Rumelhart et al suggested
that the eqns(3.33) and (3.35) might be modified to include a sort of momentum term that is
) ( ) 1 ( n O n
ji pi pj ji
+ +
(3.37)
where (n+1)is used to include the (n+1)th step and

is a proportionality constant called


momentum constant. The second term in eqns (3.37) is used to specify that the change in
ji


at the (n+1)th step should be some what similar to the change undertaken at the nth step. In this
way some inertia is built in and momentum in the rate of change is conserved to some degree.
32
It is also important to note that the network must not be allowed to start off with a set of equal
weights. It has been shown that it is not possible to proceed from such weight configuration to
one of unequal weights, even if the latter corresponds to smaller system error.
3.6 NORMALISATION OF INPUT-OUTPUT DATA
Scaling of the input-output data has a significant influence on the convergence property and
also on the accuracy of the learning process. It is obvious from the sigmoidal activation
function given in eqn (3.12) that the range of the output of the network must be within (0,1).
Moreover the input variables should be kept small in order to avoid saturation effect caused by
the sigmoidal function. Thus the input-output data must be normalized before the initiation of
the training of the neural network. Two schemes have been tried for scaling the input-output
variables as detailed below.
Scheme 1 of Normalization
In maximum values of the input and output vector components are determined as follows
)) ( max(
max ,
p net net
i i

(3.38)
NP p ,....., 1
i
N i ,....., 1
where NP is the number of patterns in the training set.
And
)) ( max(
max ,
p O O
K K

(3.39)
NP p ,....., 1
K
N K ,....., 1
Normalized by these maximum values, the input and output variables are given as follows.
max , ,
/ ) ( ) (
i i nor i
net p net p net
(3.40)
NP p ,....., 1
i
N i ,....., 1
and
max , ,
/ ) ( ) (
K K nor K
O p O p O
(3.41)
NP p ,....., 1

K
N K ,....., 1
33
After the normalization the input and output variable range is within (0,1) in this scheme.
Scheme 2 of Normalization
In this scheme of normalization, the output variables are normalized by using eqns(3.39) and
(3.41) to get a variables range within(0,1).But the input variables are normalized as follows.
i
av i i
nor i
net p net
p net

,
,
) (
) (

(3.42)
NP p ,....., 1
i
N i ,....., 1
where
av i
net
,
and
i
are the average value and the standard deviation of the ith component
of the input vector respectively.
In this scheme after the normalization the input variable range is ) , (
2 1
K K , where
1
K and
2
K are real positive numbers. These input variables can then be easily made to fall in the
range
(-1 , 1) by dividing these variables by the greater of the two numbers i.e.
1
K and
2
K .
3.7 FASTER TRAINING
Fudge the Derivative Term
The first major improvement to back-propagation is extremely simple: you can fudge the
derivative term in the output layer. If you are using the usual back-propagation activation
function:
)) * exp( 1 /( 1 x D +
the derivative is:
) 1 ( s s
34
where s is the activation value of the output unit and most often D = 1. The derivative is largest
at s = 1/2 and it is here that you will get the largest weight changes. Unfortunately as you near
the values 0 or 1 the derivative term gets close to 0 and the weight changes become very small.
In fact if the network's response is 1 and the target is 0, that is the network is off by quite a lot,
you end up with very small weight changes. It can take a VERY long time for the training
process to correct this. More than likely you will get tired of waiting. Fahlman's solution was to
add 0.1 to the derivative term making it:
) 1 ( 1 . 0 s s +
The solution of Chen and Mars was to drop the derivative term altogether, in effect the
derivative was 1. This method passes back much larger error quotas to the lower layer, so large
that a smaller

must be used there. In their experiments on the 10 - 5 - 10 encoder problem


they found the best results came when that

was 0.1 times the upper level

, hence they
called their method the "differential step size" method. One tenth is not always the best value
so you must experiment with both the upper and lower level etas to get the best results. Besides
that, the

you use for the upper layer must also be much smaller than the

you use without


this method.
3.8 ADAPTIVE LEARNING ALGORITHM
To make the learning process converge more rapidly than the conventional method, in which
both the learning rate and momentum are kept constant during the learning, an adaptive
learning algorithm has been developed to adopt both momentum and learning rate in the
learning process.
The proposed adaption rule for learning rate is as follows

'

>

) 1 ( ) ( ) 1 (
) 1 ( ) ( ] exp[ ) 1 (
) (
n R n R n
n R n R
n
k
n
n
e e i
e e i
i


(3.43)
35
where ) (n
i
is the learning rate at iteration n in between the input layer and the next hidden
layer,
e
R
the root mean square error in training.
] ) (
.
1
[
1
2
1



NP
p
N
k
pk pk
k
e
k
O t
N NP
R
(3.44)
where, k is a constant.
Here the basic idea is to decrease

when ) 1 ( ) ( n R n R
e e
and to keep

constant for
) 1 ( ) ( n R n R
e e
. Note that
) 1 ( ) ( ) ( n R n R n E
e e
is negative when the
error is decreasing, which implies that the connection weights are updated in the correct
direction. It is reasonable to maintain this update direction in the next iteration. In this case we
achieve this by decreasing the learning rate in next iteration. On the other hand, if the
connection weights are moved to the opposite direction, causing the error to increase, we
should try to ignore this direction in next iteration by keeping the value of

the same as the


value of

in previous iteration. The value of the constant k should be selected judiciously to


give the best result, and the optimum value of k is problem dependent.
Similar to the learning rate, the proposed adaption rule for momentum constant is as follows

'

,
_

+
>

) 1 ( ) ( ) 1 ( ]
100
1 [
) 1 ( ) ( ) 1 ( )]
100
( 1 [
) (
n R n R n
R
n R n R n
R
n
e e i
e e i
i

(3.45)
where R is the percentage rate of change of
i

between two successive iteration, and


) (n
i

is the momentum at iteration n in between the input layer and the next hidden layer.
The learning rates and the moments for the other layers are updated in the same way as those
for
i
and
i

, respectively, at nth iteration during the learning process.


36
3.9 RESILIENT PROPAGATION ALGORITHM
The basic principle of RPROP is to eliminate the harmful influence of the size of the partial
derivative on the weight step. As a consequence, only the sign of the derivative is considered to
indicate the direction of the weight update. The size of the weight change is exclusively
determined by a weight-specific, so called update-value

'

<

+
>



else
w
t E
if t
w
t E
if t
w
ij
ij
ij
ij
t
ij
, 0
0
) (
), (
0
) (
), (
) (
(3.46)
where
ij
w
t E

) (
denotes the partial derivative with respect to each weight. The second
step of Rprop learning is to determine the new update-values.

'


<

>



+
else t
w
t E
w
t E
if t
w
t E
w
t E
if t
ij
ij ij
ij
ij ij
ij
t
ij
), 1 (
0
) ( ) 1 (
), ( .
0
) ( ) 1 (
), ( .
) (

(3.46)
where
+
< < < 1 0
.Thus every time the partial derivative of the corresponding weight
) (t w
ij
changes its sign, which indicates that the last update was too big and the algorithm has
jumped over the local minimum, the update-value
) (t
ij

is decreased by the factor


. If the
derivative retains its sign, the update value is slightly increased in order to accelerate
convergence in shallow regions. Additionally, in case of a change in sign, there should be no
37
adaptation in the succeeding learning step. In practice this can be achieved by setting
0
) (

ij
w
t E

in the adaptation rule. Finally the weight update and the adaptation are
performed after the gradient information of the whole pattern set is computed.
The Rprop algorithm requires setting the following parameters (i) the increase factor is set to
2 . 1
+

; (ii)the decrease factor is set to


5 . 0

; (iii)the initial update-value is set to


1 . 0
0
; (iv)the maximum weight step, which is used in order to prevent the weights from
becoming too large, is
. 50
max

38

You might also like