You are on page 1of 130

PR

Lecture 5-6
Dr. Robi Polikar Feedforward Neural
Networs

The Multilayer Perceptron


Radial Basis Function NNs
Chapter 11 in Alpaydin
PR This Week in CI&PR
 Feedforward Neural Networks: The Multilayer Perceptron (MLP)
 Brief history of artificial neural networks (adapted from Gutierrez)
• Physiological origins
• The birth of multilayered neural networks
 Architecture and notation
 The backpropagation learning algorithm A review of second order methods
Newton’s method, Levenberg –
 Improving the backpropagation
Marquardt, quick-prop, conjugate-
 Choosing the activation function
gradient etc.
 Input normalization
 Radial Basis Function Networks
 Choosing the target values
 The $1,000,000 question: How many hidden units? A E. Alpaydin, Int. to Machine Learning,
D Duda, Hart & Stork, Pattern Classification
 Initializing weights G R. Gutieerez-Osuna, Lecture Notes
 Choosing the learning rates: adaptive learning rate M Matlab Documentation, Mathworks
 Help! I got stuck at the local minimum: The momentum term
 Stopping criteria
 Regularization RP Original graphic created / generated by Robi Polikar – All Rights Reserved © 2001 – 2013.
May be used with permission and citation.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Neural Networks
Neural Network, in computer science, highly interconnected network of information-
processing elements that mimics the connectivity and functioning of the human brain. Neural
networks address problems that are often difficult for traditional computers to solve, such as
speech and pattern recognition. They also provide some insight into the way the human brain
works. One of the most significant strengths of neural networks is their ability to learn from a
limited set of examples.
© Encarta, 1993-2007 Microsoft Corporation. All rights reserved.

In computer science and related fields, artificial neural networks are models inspired by
animal central nervous systems (in particular the brain) that are capable of machine
learning and pattern recognition. They are usually presented as systems of interconnected
"neurons" that can compute values from inputs by feeding information through the network.
For example, in a neural network for handwriting recognition, a set of input neurons may be
activated by the pixels of an input image representing a letter or digit. The activations of these
neurons are then passed on, weighted and transformed by some function determined by the
network's designer, to other neurons, etc., until finally an output neuron is activated that
determines which character was read.
Like other machine learning methods, neural networks have been used to solve a wide variety
of tasks that are hard to solve using ordinary rule-based programming, including computer
vision and speech recognition.
http://en.wikipedia.org/wiki/Artificial_neural_network
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Neural Networks
Physiological Origins

d input
nodes
H hidden
x1 layer nodes
c output
nodes

x2 z1


Wjk

..
Wij

....
zk

..
yj
zc


x(d-1)

i=1,2,…d
RP xd j=1,2,…,H
k=1,2,…c

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
The Mathematical origins
PR 1940s
 McCulloch & Pitts, 1943
 Warren McCulloch (a psychiatrist and neuroanatomist) and
Walter Pitts (mathematician) devised the first computational
model of neurons
• A McCulloch-Pitt neuron fires if the sum of excitatory inputs exceeds a threshold,
provided that the neuron does not receive an inhibitory input. They showed that
such as network of neurons can construct any logical function.

 D. O. Hebb, 1949 (1904 – 1985) – Father of Cognitive


Psychology
 Devised “Hebbian Learning”, where the weight of one
connection across a synapse is increased proportionate to the
amount of activation that synapse experiences., i.e., a neural
pathway is strengthened every time it is used.
• “When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth or metabolic chance takes place in
one or both cells such that A’s efficiency , as one of the cells firing B, is increased.”
The Organization of Behavior, Wiley, 1949.

𝑛𝑒𝑤 𝑜𝑙𝑑
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝜂𝑝𝑖 𝑎𝑗
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Origins
1950s
 Frank Rosenblatt, 1958 (1928 – 1969) x1
 Founder of the perceptron model – a wJi
single neuron with adjustable weights and J
a threshold activation. xd
• Proved that if two classes are linearly separable,
the learning algorithm for perceptron (the  d 
perceptron rule) will converge. f  netJ   wJi  xi 
 
 Widrow (1929 - ) & Hoff (1937 - )  i 1 

 Bernard Widrow and Ted Hoff are the


fathers of the famous LMS algorithm,
which is used not only to adapt the
weights of the perceptron model, but also
for weights of an adaptive filter. Hoff is
also credited with the invention of the
microprocessor / CPU (Intel 4004)
𝑛𝑒𝑤 𝑜𝑙𝑑
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝜂 𝑑𝑗 − 𝑥𝑖 𝑤𝑖𝑗 𝑥𝑖
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR What do they
look like today?

TED HOFF
BERNARD
WIDROW
(with
Nik Kasabov)

RP PHOTOS – ALL RIGHTS RESERVED, ROBI POLIKAR, IJCNN 2009, ATLANTA, GA © 2009
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Origins
1960s and 70s

 Marvin Minsky and Seymour Papert, 1969


 Both pioneers of AI, in their monograph, “ Perceptrons”,
Minksy and Seymour proved the mathematical limitations
of the perceptrons: they don’t work if the data are not
linearly separable. In fact, they showed that the perceptron,
a single layer network, cannot even solve the simple XOR
problem. They further (incorrectly, but effectively) argued
that similar limitations would also be applicable to
multilayered networks.
• This was the kiss of death for the ANNs. The ANN research and funding
came to screeching halt with this article.

The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has
many features to attract attention: its linearity, its intriguing learning theorem; its clear paradigmatic
simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over
to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or
reject) our intuitive judgment that the extension to multilayer systems is sterile.
Minksy and Papert, 1969
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The origins
1980s
 S. Grossberg and G. Carpenter, 1980
 Two of very few left studying ANNs were Grossberg
and Carpenter (couple), described a new unsupervised
algorithm called the Adaptive Resonance Theory
(ART), which has later been expanded into a
supervised algorithm ARTMAP. Several variations of
ARTMAP have been developed since then.
 John Hopfield, 1982
 He developed the Hopfield network, the first example
of recurrent networks, used as an associative memory.
 Teuvo Kohonen, 1982
 Credited with the development of the unsupervised
training algorithms, self organizing maps (SOMs)
 A. Barto, R.S. Sutton and C. Anderson, 1983
 Popularized reinforcement learning
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Feedforward
Networks
 D. Rumelhart, Hinton and Williams, 1986
 Developed a simple, yet elegant algorithm for training multilayer
networks, learning nonlinearly separable decision boundaries, through
back propagation of errors, a generalization of the LMS algorithm.
 Paul Werbos, 1974
 Widely recognized as the original inventor of the backpropagation
algorithm in his Ph.D. thesis (Harvard) in 1974. Currently at NSF
 Dave Broomhead and David Lowe, 1988
John Moody and Christian Darken, 1989
 Radial Basis Function (RBF)
Neural networks

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Kernel Methods
Support Vector Machines
 Vladimir Vapnik and Alexey Chernovenkis, 1962
 VC dimension of classifiers
 Statistical learning theory
 Linear SVMs (coming soon)

 Isabel Guyon, Bernard Boser and V. Vapnik, 1992


 Nonlinear SVMs (coming soon)
 Kernel trick (coming soon)

 B. Schölkoph, A. Smola, C. Burges, N. Chistianni, 1995 – today


 Recent developments
 Kernel regression, Kernel PCA
 Semi supervised learning

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Bagging, Boosting,
Ensemble Systems
 Leo Breiman (1928-2005), 1994
 Bagging – one of the original ensemble of
classifiers algorithms based on random sampling
with replacement.
 Later, random forests, a clever name for an
ensemble of decision trees algorithms
 Robert Schapire and Yoav Freund, 1995
 Hedge, boosting, AdaBoost (coming soon!)
 Arguably one of the most influential algorithms in
recent history of machine learning.
 Tin Kam Ho, 1998
 Random subspace methods
 Ludmilla Kuncheva, 2004
 Classifier fusion
 Gavin Brown, 2004
 Feature selection, diversity in ensemble systems
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other notable People
 Michael I. Jordan, Zoubin Ghahramani, Daphne Koller
 Bayesian networks, mixture of experts
 Expectation – maximization (EM) algorithm
 Graphical methods
 David Wolpert
 No free lunch theorem, 1997
 Stacked generalization, 1992
 Nitesh Chawla
 Learning from unbalanced data
 SMOTE (Synthetic Minority Oversampling
TEchnique)
 Geoffrey Hinton
 Deep neural networks / deep learning 2000
 Dropout method, 2012
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
The XOR Problem
x2
Input x Output
1 (1,1) 0
(0,1) 1
(1,0) 1
0
(0,0) 0
RP x1
0 1

No linear function can separate the two classes of the XOR problem!
(no set of w0 w1 w2 will satisfy all of the following contraints)

𝑥1 = 0, 𝑥2 = 0, 𝑦 = 0 → 𝑤0 ≤0
𝑥1 = 0, 𝑥2 = 1, 𝑦 = 1 → 𝑤2 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 0, 𝑦 = 1 → 𝑤1 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 1, 𝑦 = 0 → 𝑤1 + 𝑤2 + 𝑤0 ≤0

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
 But a multilayered network can: 𝑦
1,    𝑥1 + 𝑥2 − 1.5 > 0
 Start with the AND function 𝑦=
0,    𝑥1 + 𝑥2 − 1.5 < 0

A
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
𝑥1 XOR 𝑥2 = (𝑥1 AND ~𝑥2) OR (~𝑥1 AND 𝑥2)

OR x1  x2   x1  x2    x1  x2 
z1 z2

AND AND

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The ANN Training Cycle
Stage 1: Network Training ANN w/ weights to be
determined

Present Examples
Indicate Desired Outputs

Determine
Synaptic
Weights “knowledge”
Stage 2: Network Testing

New Data Predicted Outputs

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Universal Approximator
 Classification can be thought of as a special case of function approximation:
 For a three class problem:
𝑥1
Class 1: [1 0 0] 1: Class 1
x …. Classifier Class 2: [0 1 0] 2: Class 2
3: Class 3
Class 3: [0 0 1]
𝑥𝑑

𝑑-dimensional input 𝑦 = 𝑓(𝒙) 1 or 3, c-dimensional input


x y

 Hence, the problem is – given a set of input/output example pairs of an unknown


function – to determine the output of this function to any general input.
 An algorithm that is capable of approximating any function – however
difficult, large dimensional or complicated it may be – is known as a
universal approximator. The MLP is a universal approximator.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Multilayer Perceptron
Architecture
• What truly separates an MLP from a regular simple perceptron
d input is the non-linear threshold function f, also known as the activation
nodes function. If a linear thresholding function is used, the MLP can be
H hidden
replaced with a series of simple perceptron, which can then only
layer nodes solve linearly separable problems.
x1
c output
nodes
z1 d 
……...
x2
 
y j  f net j  f  w ji xi 

 
 i 1 

..
Wkj
……....

RP Wji netk zk netj


netj y H 
zk  f net k   f  wkj y j 

..
j
 
zc  j 1


x(d-1)
netk

i=1,2,…d
xd j=1,2,…,H
Computational Intelligence & Pattern Recognition k=1,2,…c © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Computing Node Values

x1 wJi
netJ
 d 
∑ y J  f net J   f   wJi xi 
yJ  i 1 

xd

y1 wKj
netk
H 
∑ z K  f net K   f   wKj y j 
zK  j 1 

RP yH
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Backpropagation
Learning Rule
 The weights are determined through the gradient descent error
minimization of the criterion function J(w) Target (desired) outputs
𝑐 Actual network outputs
1 2
1
Training Error: 𝐽(𝐰) = 𝑡𝑘 − 𝑧𝑘 = ‖𝐭 − 𝐳‖2 𝐻
2 2
𝑘=1 𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
𝑗=1
𝜕𝐽 𝐰
𝐰(𝑡 + 1) = 𝐰(𝑡) + 𝛥𝐰(𝑡) ⇒ 𝛥𝐰 = −𝜂
𝜕𝐰
We need to express 𝐽(𝐰) in terms of w for both output and hidden layer nodes. Output nodes
are easy, since we know the functional representation of 𝐽 with respect to w through the chain
rule:
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑛𝑒𝑡𝑘 𝜕𝐽(𝐰) 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
= ⋅ = ⋅ ⋅ = − 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑦𝑗
𝜕𝑤𝑘𝑗 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗
yj k Output node sensitivity
= -k
𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 = 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 𝑦𝑗
For logistic sigmoid
= 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 𝑛𝑒𝑡𝑘 𝑓 1 − 𝑛𝑒𝑡𝑘 𝑦𝑗 𝑓 ′ 𝑥 = 𝑓 𝑥 𝑓(1 − 𝑥)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Backpropagation
Learning Rule
 For the hidden layer, things are a little bit more complicated, since we do not know the
desired values of the hidden layer node outputs. However, by the appropriate use of the
chain rule, we obtain : 𝑓 ′ 𝑛𝑒𝑡𝑗 𝑥 𝑖
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗
= ⋅ ⋅
𝜕𝑤𝑗𝑖 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑗𝑖
𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
Hidden layer ⋅ = 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗
= j 𝜕𝑛𝑒𝑡𝑘 𝜕𝑦𝑗
node sensitivity
𝑐 𝑐
𝜕𝐽(𝐰) 𝜕 1 2
𝜕𝑧𝑘
= 𝑡𝑘 − 𝑧𝑘 =− 𝑡𝑘 − 𝑧𝑘 ⋅
𝜕𝑦𝑗 𝜕𝑦𝑗 2 𝜕𝑦𝑗
𝑘=1 𝑘=1
𝑐 𝑐

=− 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗 = − 𝛿𝑘 𝑤𝑘𝑗


𝑘=1 𝑘=1
k
𝑐
𝜕𝐽 𝑑
⇒ 𝛥𝑤𝑗𝑖 = −𝜂 =𝜂 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗 𝑥𝑖 = 𝜂 ⋅ 𝛿𝑗 ⋅ 𝑥𝑖 𝑦𝑗 = 𝑓 𝑛𝑒𝑡𝑗 = 𝑓 𝑤𝑗𝑖 𝑥𝑖
𝜕𝑤𝑗𝑖
𝑘=1 𝑖=1
𝐻

𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
= j
Computational Intelligence & Pattern Recognition 𝑗=1Glassboro, NJ
© 2001- 2013, Robi Polikar, Rowan University,
PR MLP/BP
 The weight update rule is then:
𝛥𝑤𝑗𝑖 = 𝜂 ⋅ 𝛿𝑗 ⋅ 𝑥𝑖 for hidden layer weights

𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 for output layer weights

 In each case, the parameter 𝛿 represents the sensitivity of the criterion function
(error) with respect to activation of hidden / output layer node. The sensitivity of a
hidden layer node is a weighted sum of the output sensitivities, scaled by 𝑓’(𝑛𝑒𝑡𝑗),
where the weights are the hidden-to-output layer weights, whereas the output
sensitivities themselves are the errors at the output level, scaled by 𝑓’(𝑛𝑒𝑡𝑘)
𝑐

𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗 𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘


𝑘=1
Credit assignment problem
(have you seen this somewhere before)?

 The algorithm takes its name – backpropagation – from the fact that during
training, the errors are propagated back, from the output to the hidden layer!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Credit Assignment Problem
 Hidden nodes themselves do not make error: they just contribute to the
errors of the output nodes. The amount contributed is indicated by the
sensitivities.

𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘

𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗
𝑘=1

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Generalizations of MLP/BP
 We have seen the BP for a specific two layer MLP. However, with some
notational and bookkeeping effort, the BP learning rule can be easily
generalized to the following cases…
 Input units including bias units – just add one more input node with 𝑥0 = 1
 Input units connected directly to the output units (as well as hidden nodes)
 There are more than two layers  Deep neural networks
 There are different nonlinearities for each layer
 Each unit has its own nonlinearity
 Each unit has a different learning rate
 The output is a continuous value (i.e., regression problem)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Protocols
 Three major training protocols:

 Stochastic Learning: Instances are drawn randomly from the training data, and
the weights are updated for each chosen instance.

 Batch Learning: Entire training data is shown to the network before weights are
updated. Each such presentation of the entire training dataset to the network is
called an epoch. In this case, the error for each pattern 𝑱𝒑 are computed, and
summed, before weights are updated. This is the recommended mode of training
an MLP.

 Online Learning: Instances are drawn consecutively from the training data, and
the weights are updated for each instance. This can be sensitive to the order in
which the data are presented.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP- Batch Learning

Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻, iteration 𝑡 = 0, criterion 𝜃;


do t  t+1 increment epoch
𝑚 ← 0; Δ𝑤𝑗𝑖 ← 0, Δ𝑤𝑘𝑗 ← 0;
do 𝑚 ← 𝑚 + 1
Select pattern 𝒙𝑚
Δ𝑤𝑗𝑖 ← Δ𝑤𝑗𝑖 + 𝜂𝛿𝑗 𝑥𝑖 ; Δ𝑤𝑘𝑗 ← Δ𝑤𝑘𝑗 + 𝜂𝛿𝑘 𝑦𝑗 ;
until 𝑚 = 𝑛
𝑤𝑗𝑖 ← 𝑤𝑗𝑖 + Δ𝑤𝑗𝑖 ; 𝑤𝑘𝑗 ← 𝑤𝑘𝑗 + Δ𝑤𝑘𝑗
until || 𝛻𝐽(𝐰)|| < 𝜃
return 𝐰
end

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Datasets used in MLP/BP
 Typically, three sets of data are used in MLP training and testing, all of
which are accompanied by their corresponding correct class information
 Training data: This is the data on which the gradient descent is
performed. That is, the training is done on this data.
 Validation data: A second set of dataset, which is not used for training,
however, it is used to determine when the training should stop.
 Test data: This is the dataset using which we assess the generalization
performance of the network.

D
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP Bayes
 With sufficient number of hidden layer nodes, the MLP can approximate
any function, and hence can solve any non-linearly separable classification
problem.
 In fact, with sufficient number of hidden nodes, along with plenty of data, it
can be shown that the network outputs represents the posterior
probabilities of classes (Richard & Lippman, 1991).
 Outputs can, however, may be forced to represent probabilities by
 Use exponential activation function at the output layer 𝑓 𝑛𝑒𝑡 = 𝑒𝑛𝑒𝑡
 Use 0 – 1 target vectors
 Normalize outputs according to

𝑒 𝑛𝑒𝑡𝑘
𝑧𝑘 = 𝑐 𝑛𝑒𝑡
𝑖 𝑒
𝑖

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Improving the Backpropagation
Practical Considerations

 Activation Function
 Input Normalization
 Target Values
 Number of Hidden Units
 Initializing Weights
 Learning Rates
 Momentum
 Stopping Criteria
 Regularization

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Practical Considerations
PR Activation Function
 Desirable properties of an activation function
 Nonlinearity – gives the multilayer networks the power of generating
nonlinear decision boundaries;
 Saturation for classification problems – so that the outputs can be limited
between some minimum and maximum limits (-1 and 1, or 0 and 1). Not
necessary for regression (function approximation problems)
 Continuity and smoothness – so that we can take its derivative
 Monotonocity – so that the activation function itself does not introduce
additional local minima
 Linearity for small values of 𝑛𝑒𝑡, to preserve the properties of linear
discriminant functions
 A function that satisfies all of the above is
….(drum roll)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Sigmoid
Activation Function
Logarithmic Sigmoid 1
𝑓(𝑛𝑒𝑡) =
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡
1
 = 0.75
0.9

0.8
 = 0.5
0.7
 = 0.25
0.6

0.5  = 0.1

0.4

0.3

0.2 Matlab’s logsig


0.1  = 0.9 function uses
 =2 =1
0
-15 -10 -5 0 5 10 15

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Sigmoid
Activation Function
2
Tangential Sigmoid 𝑓(𝑛𝑒𝑡) = tanh(𝛽𝑛𝑒𝑡) = −1
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡

1
 = 0.75

0.8  = 0.5

0.6
 = 0.25

0.4

0.2
 = 0.1
0

-0.2

-0.4

-0.6  =1 Matlab’s tansig


-0.8 function uses
 =2
=2
-1
-15 -10 -5 0 5 10 15

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Normalization
 For stability reasons, individual features need to be in the same ‘ball park figure’. If
two features vary orders of magnitude in their values, network cannot learn
effectively.

 If the mean and variance of each feature is made zero and one, respectively, this is
called standardization. This makes sense, if the features are uncorrelated.
 If the relative amplitudes of features need to be conserved, necessary when the
features are related, then a –relative to maximum – or – relative to norm2 –
normalizations should be used !

 For the fish example, x1=length (mm), x2=weight(kg)


typical values: x1=985mm, x2=2kg Large variation in input values…No good!
 Use smart normalization: Use x1=0.985m and x2=2kg, so that both variables are in
the same order of magnitude (i.e., normalize only one variable, when it makes sense
to do so)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Preprocessing &
Normalization in Matlab
 Matlab function mapminmax() scales inputs and targets into [-1 1] range
 [Y,PS] = mapminmax(X, YMIN,YMAX) processes X by normalizing the minimum and maximum
values of each row to [YMIN, YMAX]. The default values for YMIN and YMAX are
-1 and +1, respectively. PS is a struct of processing settings that then allows using the exact same
normalization to some other input Z through Y=mapminmax ('apply',Z, PS).
 X = mapminmax('reverse', Y, PS) returns X, given Y and settings PS.

 [Y,PS] = mapstd(X, ymean,ystd) processes matrix X (similar to above functions) but by


transforming the mean and standard dev. of each row to ymean and ystd (defaults 0 and 1).
 X=mapstd(‘reverse’, Y, PS) returns X in the original units.

 The feedforwardnet function includes both normalization and standardization of both


inputs and targets by default. This means that tansig must be used for the activation
function at the hidden and output layers.

 [y,ps] = fixunknowns(x) processes matrixes by replacing each row


 containing unknown values (represented by NaN) with two rows of information: first row
contains the original row, with NaN values replaced by the row's mean, second row contains
1 and 0 values, indicating which values in the first row were known or unknown. The
function fixunkowns is only recommended for input processing.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Processing
Functions
 [Y,PS] = processpca(X, maxfrac) applies PCA to X such that
 each row is uncorrelated, the rows are in the order of the amount they contribute to total
variation, and rows whose contribution to total variation are less than maxfrac. The
parameters are saved as PS, such that y = processpca('apply',Z,PS) applies the same
transformation to some other matrix Z. The process can be reversed by
Z = processpca('reverse',y,ps)
 Other processing functions include:
 Removeconstantrows: removes the rows of a matrix that has constant values
 Removerows: removes certain rows indicated by an input argument of indices to
be removed.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Target Values
 Typically, two output encoding protocols are used for classification problems

 0 1 targets with log-sigmoid (‘logsig’ in Matlab) activation function


• [0 0 0 1 0 0]  Class 4
• [0 1 0 0 0 0]  Class 2
 -1 + 1 targets with tan-sigmoid (‘tansig’ in Matlab) activation function (Matlab’s default)
• [-1 -1 -1 +1 -1 -1]  Class 4
• [-1 +1 -1 -1 -1 -1]  Class 2

 From a practical point of view, it is recommended that the asymptotic values not
be used. That is, use 0.05 instead of 0, and 0.95 instead of 1.
 This is because, the slope of the activation function, that is, the gradient, which is
proportional to Δw, approaches zero at extreme values of the input. This significantly slows
down the training.

 For regression (function approximation) problems, actual values are used along
with the linear activation function (‘purelin’ in Matlab) .

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Number of Hidden Units
 Although the number of input and output layer nodes are fixed (number of features and number
of classes, respectively), the number of hidden layer nodes, H, is a user selectable parameter.
 H defines the expressive power of the network. Typically, larger H results in a network that can
solve more complicated problems. However,
 Excessive number of hidden nodes causes over fitting. This is a phenomenon where the training
error can be made arbitrarily small, but the network performance on the test data is poor  Poor
generalization performance… No Good !
 Too few hidden nodes may not be able to solve more complicated problems … No Good !
 There is no formal procedure to determine H. Typically, it is determined by a combination of
previous expertise, amount of data available, dimensionality, complexity of the problem, and trial
and error.
 A common rule of thumb that is often used is to choose H such that to total number of weights
remains less then N/10, where N is the total number of training data available.

Note that H, along with d and c determines the


total number of weights, which represents the
degrees of freedom of the algorithm.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Initializing the Weights
 To promote uniform learning where all classes are learned approximately at
the same time, the weights must be initialized carefully.
 A typical rule of thumb is to randomly choose the weights from a uniform
distribution according to the following limits:

−1 𝑑 < 𝑤𝑗𝑖 < 1 𝑑


−1 𝐻 < 𝑤𝑘𝑗 < 1 𝐻

 Another approach is the Nguyen-Widrow initialization, which generates


initial weights and biases such that the active regions of the neurons are
distributed approximately evenly over the input space.
D. Nguyen and B. Widrow, ``Improving the Learning Speed of 2-Layer Neural Networks by Choosing Initial Values of the
Adaptive Weights,'' Proceedings of the International Joint Conference on Neural Networks (IJCNN), 3:21-26, June 1990.

 Initialization to zero is never a good idea…Why?


Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Learning Rate
 Learning rate, in theory, only affects the convergence time, however, since the
global minimum is often not found, it could result in system divergence:

 A proper way to select the learning rate involves computing the second derivative
of the criterion function wrt each weight, and taking the inverse of this derivative
as the learning rate. This dynamic learning rate however is computationally
expensive. A good starting point is =0.1 𝜕2𝐽 𝐰
−1
𝜂𝑜𝑝𝑡 =
𝜕𝐰 2
 MATLAB uses an alternate dynamic learning rate update scheme
 If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 > 𝑘. 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑 discard current weight update, set 𝜂 = 𝑎1 . 𝜂
 If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 < 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑  keep current weight update, set 𝜂 = 𝑎2 . 𝜂
 Typical values: 𝑘 = 1.04 𝑎1 = 0.7 𝑎2 = 1.05
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Problem of Local minima

J(w)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Momentum
 If there are small plateaus in the error surface, then backpropagation can take a long time,
or even get stuck in small local minima. In order to prevent this, a momentum term is
added, which incorporates the speed at which the weights are learned. This is loosely
related to the momentum in physics – a moving object keeps moving unless prevented by
outside forces.

 Momentum term simply makes the following change to the weight update rule, where  is
the momentum term:
𝑤𝑡+1 = 𝑤𝑡 + 1 − 𝛼 𝛥𝑤𝑡 + 𝛼𝛥𝑤t−1

 If =0, this is the same as the regular backpropagation, where the weight update is
determined purely by the gradient descent
 If =1, the gradient descent is completely ignored, and the update is based on the
‘momentum’, previous weight update rule. The weight update continues on along the
direction in which it was moving previously.
 Typical value for  is generally between 0.5 and 0.95

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Batch Learning with Momentum

Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻,


iteration 𝑡 = 0, criterion 𝜃, momentum 𝛼;
do 𝑡 ← 𝑡 + 1 increment epoch
𝑚 ← 0; Δ𝑤𝑗𝑖 ← 0, Δ𝑤𝑘𝑗 ← 0;
do 𝑚 ← 𝑚 + 1
Select pattern xm
Δ𝑤𝑗𝑖 (𝑡 + 1) ← Δ𝑤𝑗𝑖 (𝑡) + 𝜂 1 −  𝛿𝑗 𝑥𝑖 + Δ𝑤𝑗𝑖 (𝑡 − 1);
Δ𝑤𝑘𝑗 (𝑡 + 1) ← Δ𝑤𝑘𝑗 (𝑡) + 𝜂 1 −  𝛿𝑘 𝑦𝑗 + Δ𝑤𝑘𝑗 (𝑡 − 1);
until 𝑚 = 𝑛
𝑤𝑗𝑖 ← 𝑤𝑗𝑖 + Δ𝑤𝑗𝑖 ; 𝑤𝑘𝑗 ← 𝑤𝑘𝑗 + Δ𝑤𝑘𝑗
until || 𝛻𝐽(𝐰)|| < 𝜃
return w
end Backpropagation with Momentum

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Stopping Criterion

 In order to prevent overfitting, the algorithm should be stopped before it


reaches its minimum error goal.
 This is because, too small error goal causes the noise in the data to be learned at
the expense of general properties of the data
 It is difficult, however, to determine the stopping threshold on the error goal.
 Typical approach is to monitor the performance of the network on a validation
data, and stop training when the error on the validation data reaches a desired level.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Regularization
 Regularization is the smoothing of the error curve so that the optimum
solution can be found more effectively.
 One such technique is the weight decay, that prevents the weights from growing
too large:
𝑤t+1 = 1 − 𝜀 𝑤𝑡     0 < 𝜀 < 1
 The weights that do not contribute to reducing the criterion function will
eventually shrink to zero. They can then be eliminated all together. Weights that do
contribute to 𝐽 will not decay however, as they will be updated.
 This is effectively equivalent to using the following criterion function with no
separate weight decay.

2𝜀 𝑇 Regularization term
𝐽𝑒𝑓 =𝐽 𝐰 + 𝐰 𝐰
𝜂

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Feedforward
networks in Matlab
 The process for training / testing / using neural networks in Matlab includes
the following steps:
 Collect data (duh!)
 Create the network (duh!)2
 Configure the network
 Initialize the weights and biases
 Train the network
 Validate and use the network
 The main functions to use an MLP in Matlab are:
 feedforwardnet, which creates the network architecture, 𝑛𝑒𝑡
 configure, which sets up the parameters of the 𝑛𝑒𝑡
 train, which configures and trains the network
 sim, which simulates the network, by computing the outputs for a gives test data
 perform, which computes the performance of the network on testdat whose
labels are known.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a network
 Matlab uses the network object to store all of the information that defines a NN.
 The network object includes the structure of the network (how many layers, how
many nodes in each layer), as well as many configurable parameters, such as the
weights and biases.
 The fundamental building block is the neuron, represented as follows, where p is
input, w is the weight (vector) and b is the bias associated with an extra input of
fixed value 1. Matlab uses
 a weight function to determine how the weights
are applied (for MLPs this is dot product, 𝐰 𝑇 𝐱, for RBF
it can be a distance function, e.g. 𝐰 − 𝐱 ),
 a net input function , which for MLPs simply adds
the bias to the weighted sum of the inputs; and
 a transfer (activation) function to act as
nonlinear thresholding, which for MLPs is typically
logistic or tangential sigmoid. M

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a network
 An abbreviated version of this model is shown as follows:

S: # of neurons
R: # of inputs

 The transfer functions are typically indicated with diagrams indicating the
nature of the function being used, for example:

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
A single-layer network of S logsig neurons having R inputs is shown below
in full detail on the left and with a layer diagram on the right

M
Image Source: Matlab Neural Network Toolbox, User’s Guide
http://www.mathworks.com/products/neuralnet/
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
A two-layer (single hidden layer) network layer diagram

weights connecting input weights connecting hidden


layer 1 and hidden layer 1 layer 2 and hidden layer 1
2

# of bias terms

# of inputs
# of hidden Output of Output of # of output (hidden
layer nodes hidden layer 1 hidden layer 2 layer 2) nodes

M
𝐈𝐖: Input weights 𝐋𝐖: Layer Weights

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
 Similarly, if you have three layers…

𝐚𝟏 = 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 𝐚𝟐 = 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐚𝟏 + 𝐛𝟐 𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐚𝟐 + 𝐛𝟑 = 𝐲

𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 + 𝐛𝟐 + 𝐛𝟑 = 𝐲

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR feedforwardnet
feedforwardnet Feedforward neural network (Neural Networks Toolbox)

net = feedforwardnet(hiddenSizes, trainFcn) creates a network net, where hiddenSizes is a row vector of one or more
hidden layer sizes (default = 10) and trainFcn indicating the training function (default trainlm) to be used with this network.

Specialized versions of the feedforward network include fitting (fitnet) and pattern recognition (patternnet) networks. A
variation on the feedforward network is the cascade forward network (cascadeforwardnet) which has additional
connections from the input to every layer, and from each layer to all following layers.

Examples

Here a feedforward neural network is used to solve a simple problem.

[x,t] = simplefit_dataset;
net = feedforwardnet(10)
net = train(net,x,t);
view(net)
y = net(x);
perf = perform(net,y,t)

Note that feedforwardnet() automatically applies removeconstantrows() and mapminmax() to both the inputs and
outputs (target values).

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net = feedforwardnet(10)
feedforwardnet

# of datasets to be used for training,


not the size of input data.

# of layers to be used as outputs,


not the number of classes.

Total # of weights

Used with Simulink only

defines which layers have biases,


generate network inputs / outputs,
or just connection layers

Structure of properties for inputs,


layers, outputs…

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net.inputs
 This property holds structures of properties for each of the network's inputs. It is
always an 𝑁𝑖 x 1 cell array of input structures, where 𝑁𝑖 is the number of network
inputs (net.numinputs, which in our case will always be 1).
 To access these properties, type net.inputs{1} which will return the following
properties (your numbers may be different depending on your network)

Always empty matrix for feedforward networks


Default processing functions for the input layer

This property defines the number of elements in the input data


used to configure the input, or zero if the input is unconfigured.

The processing functions (net.inputs{i}.processFcns) and their associated processing parameters


(net.inputs{i}.processParams) are used to define proper processing settings (net.inputs{i}.processSettings) during network
configuration which either happens the first time train is called, or by calling configure directly, so as to best match the
example data. Then whenever the network is trained or simulated, processing occurs consistent with those settings.

To remove default input processing functions: net.inputs{1}.processFcns={}


Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net.layers
 This property holds structures of properties for each of the network's layers. It is always an
𝑁𝑙 x 1 cell array of layer structures, where 𝑁𝑙 is the number of network layers
(net.numLayers). The structure defining the properties of the 𝑖 𝑡ℎ layer is located at
net.layers{i}. To see the properties, type: net.layers{1} and net.layers{2} (or more)

Used by self organizing map


type neural networks

Nguyen-Widrow algorithm for weight/bias initialization


Add bias values to the (weighted) inputs

Used by self organizing maps (SOMs)


 A matrix of min and max values for each node 
 # of neurons in this layer. It is 0 if the layer is not yet configured 
Used by self organizing maps (SOMs)
 Activations functions used in this layer 

To change the transfer function to logsig, for example, you could execute the command:
net.layers{1}.transferFcn = 'logsig'
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net.outputs
 This property holds structures of properties for each of the network's outputs. It is always a
1 x 𝑁𝑙 cell array, where 𝑁𝑙 is the number of network outputs (net.numOutputs).

Used by recursive (feedback) neural networks

Default processing functions for the output layer

Also try:
To remove default input processing
functions: net.outputs{2}.processFcns={}

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
net.biases;
PR net.inputWeights;
net.layerWeights
 These properties hold the structures of properties for each of the network’s biases, input or
layer weights.
 For biases, there is one structure for each layer, which include the following properties:
initFcn, learn (whether the biases should be learned), learnFcn (which learning function to
use to learn bias values; default traingdm) , learnParam (.lr : initial learning rate and .mc:
momentum constant), and size.
 For input and layer weights:

Used by time delay neural networks (TDNNs)

# of weights : 0 indicates that the network is not configured yet


Weighting function: dot product = weighted sum

Number of input and output weights are read from the size of the input and target data, in a subsequent configure() or train() function

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Weights & Biases
 The actual values of the weights and biases can be obtained by calling
net.IW, net.LW and net.b.
 Note that due to their cell structures, you actually need to call them with
the following cell array indices:
net.IW{1}; net.LW{2,1} and net.b{1} or net.b{2}

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Network Functions
 A variety of functions control how the network is trained and how it behaves
Adaptation function, generally used for incremental (on-line, one
instance at a time learning)
Type of derivative / gradient to be used
Type of data division to be used for validation: dividerand is random division (default)
Ratio of data to be used for training, validation and test partitions

Defines how the initialization of weights (net.IW, net.LW) and biases (net.b) are to be done
Objective /cost function; mse: mean square error (default)

Lists which performance metrics should be


plotted during training

Training function, generally used for batch learning. trainlm is default


Sets the parameters of the current training
function, such as error goal, max # of
epocs, min gradient, learning rate, etc. (See
next slide)

To change the data partitioning: net.divideFcn=‘dividetrain’; assigns all data to training


net.divideParam.trainRatio=1; net.divideParam.valRatio=0; net.divideParam.testRatio=0 should
achieve the same outcome.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Parameters
for traingdx
 traingdx is a gradient descent back propagation algorithm with momentum
term and adaptive learning rate, with the following parameters and their
default values

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Parameters
for trainlm
 trainlm is the Levenberg-Marquardt backpropagation. It is one of the fastest BP
algorithms, but also the one that requires the most memory. It is the default
training algorithm for feedforward networks, using the following parameters and
default values.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Network Methods
 For any given network object net, the following methods are available

view(net) returns a graphical diagram of the architecture

 Finally, for any given set of test_data, you can obtain the outputs of the network
net by invoking outputs=net(test_data)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training & Testing The
Network
train Train neural network (Neural Network Toolbox)

[net,tr] = train(net,P,T,Pi,Ai) trains a network net according to net.trainFcn and net.trainParam of the network object net,
where P is the network input (matrix or cell array in column format), T is the network targets (in one-hot format), and Pi
and Ai are the input and layer delays (not used for MLPs). It returns the trained network net and training record tr, which
includes the performances.

sim Simulate neural network (Neural Network Toolbox)

[Y,Pf,Af,E,perf] = sim(net,P,Pi,Ai,T) simulates the network net, using input data P (typically test data, again in column
format), target values T, as well as the delays (for TDNNs) Pi and Ai(not used for MLPs). The function returns the network
outputs Y, network errors E and the network performance perf along with final input and layer delays Pf and Af (not used
for MLPs).

If target values for the test data are not known, and we simply want to obtain the outputs of the network for the input P,
use output=net(P);

If target values are known for the test data, the performance of the network can also be obtained by the perform function
by calling perf = perform(net, P, T).

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training MLP In Matlab
net = feedforwardnet(10, 'traingdx');
net.inputs{1}.processFcns={} %Remove the default input processing functions of
%min/max normalization, fixing unknowns and removing repeat instances
net.outputs{2}.processfcns={} %Remove the default output processing functions
%net.divideFcn=''; %This removes the default data partitioning (normally 60%, 20%, 20%)

net.divideparam.trainRatio=TR_ratio;
net.divideparam.valRatio=V_ratio;
net.divideparam.testRatio=T_ratio;

net.trainParam.epochs = 1000; % Maximum number of epochs to train


net.trainParam.goal = 0.01; % Performance goal
net.trainParam.max_fail = 10; % Maximum validation failures
net.trainParam.mc = 0.9; % Momentum constant
net.trainParam.min_grad = 1e-10; % Minimum performance gradient
net.trainParam.show = 50; % Epochs between displays
net.trainParam.showCommandLine = 0; % Generate command-line output
net.trainParam.showWindow = 1; % Show training GUI
net.trainParam.time = inf; % Maximum time to train in seconds

%train and simulate the network


[net,train_record,net_outputs,net_errors] = train(net,tr_data,tr_labels);

[Network_output ignore1 ignore2 Network_error Network_perf]= sim(net,ts_data, [], [], ts_labels);

plotconfusion(ts_labels, Network_output) %Graphical output of the confusion matrix

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Test on OCR Data
Confusion Matrix
174 0 0 2 0 0 2 0 0 0 97.8%
1
9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 2.2%
0 166 1 0 1 0 2 0 6 1 93.8%
2
0.0% 9.2% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 0.3% 0.1% 6.2%
0 2 174 0 0 1 0 0 0 0 98.3%
3
0.0% 0.1% 9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 1.7%
0 0 0 163 0 0 0 0 0 0 100%
4
0.0% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Output Class

0 0 0 0 179 1 0 2 0 2 97.3%
5
0.0% 0.0% 0.0% 0.0% 10.0% 0.1% 0.0% 0.1% 0.0% 0.1% 2.7%
4 0 0 3 0 174 0 6 2 1 91.6%
6
0.2% 0.0% 0.0% 0.2% 0.0% 9.7% 0.0% 0.3% 0.1% 0.1% 8.4%
0 0 0 0 0 1 176 0 0 0 99.4%
7
0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 9.8% 0.0% 0.0% 0.0% 0.6%
0 0 1 4 0 0 0 164 1 0 96.5%
8
0.0% 0.0% 0.1% 0.2% 0.0% 0.0% 0.0% 9.1% 0.1% 0.0% 3.5%
0 6 1 2 1 0 1 2 159 2 91.4%
9
0.0% 0.3% 0.1% 0.1% 0.1% 0.0% 0.1% 0.1% 8.8% 0.1% 8.6%
0 8 0 9 0 5 0 5 6 174 84.1%
10
0.0% 0.4% 0.0% 0.5% 0.0% 0.3% 0.0% 0.3% 0.3% 9.7% 15.9%
97.8%91.2%98.3%89.1%98.9%95.6%97.2%91.6%91.4%96.7%94.8%
2.2% 8.8% 1.7% 10.9% 1.1% 4.4% 2.8% 8.4% 8.6% 3.3% 5.2%
1 2 3 4 5 6 7 8 9 10
Target Class
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Some issues to consider
 If the network inputs and outputs are preprocessed with the minmax
function, all input and output values will be normalized to [-1 1] range.
 Then logsig function is not suitable for output layer. Why?
 In this case, use one of the following
• Logsig at the hidden layer, tansig or purelin at the output layer
• Tansig at both hidden and output layers
• Tansig at the hidden and purelin at the output layer.
 Always check your input arguments. Matlab expects the data to be in the
columns – don’t screw this up!
 You can create your own partitioning using the dividerand() function
 Recall: the network created by Matlab is a “neural network object” – similar
to a struct. It has a mindboggling set of parameters. Type “net” at the
command prompt and investigate its components.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nntool

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nntool

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Train / Validation / test

trainmlp_validation.m

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR If you set TR_error=0

Overfitting

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR trainmlp_validation.m Examples
Spiral data
6

4
One hidden layer, with N=50 nodes,
2
+tansig activation,
Sigmoid activation, traingdxtraingdx
0 MLP based classification
classification of
of the
the test
test data
data
66

-2

44
-4

-6

22

-8
-6 -4 MLP-2based classification
0 of2the test data
4 6 8
6

00

-2
-2

-4
-4

-6
-6-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
-2

One hidden layer, with N=50 nodes,


-4
Sigmoid + purelin activation, traingdx
-6
RP
-6 -4 -2 0 2 4 6
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR trainmlp_validation.m Examples
MLP based classification of the test data
Randomly generated Gaussian data 6
8

6 4

4
2

0
0

-2 -2

-4
-4

-6
-4 -3 -2 -1 0 1 2 3 4 5 6
-6
-3 -2 -1 0 1 2 3 4 5

One hidden layer, with N=20 nodes,


Sigmoid activation, traingdx

RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Second Order Methods
 The gradient descent is based on the minimization of the criterion function
using the first order derivative

 Methods that make use of second order derivatives typically find the
solution much faster the first order methods.
 Newton’s Methods
• Levenberg Marquardt Backpropagation
 Quick prop
 Conjugate Gradient Methods
• Fletcher-Reeves
• Polak-Ribiere

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Newton’s Method

 The weight update rule uses the Hessian matrix, which includes the second
derivatives of the criterion function with respect to the weights:

𝜕2𝐽 𝜕2𝐽 𝜕2𝐽



𝜕𝑤1 2 𝜕𝑤1 𝜕𝑤2 𝜕𝑤1 𝜕𝑤𝑘
2 𝜕2𝐽 𝜕2𝐽 𝜕2𝐽
𝜕 𝐽(𝐰) ⋯
𝐇= 2
= 𝜕𝑤2 𝜕𝑤1 𝜕𝑤2 2 𝜕𝑤2 𝜕𝑤𝑘
𝜕𝐰
⋮ ⋮ ⋱ ⋮
𝜕2𝐽 𝜕2𝐽 𝜕2𝐽

𝜕𝑤𝑘 𝜕𝑤1 𝜕𝑤𝑘 𝜕𝑤2 𝜕𝑤𝑘 2

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Newton’s Method
 The weight update rule in Newton’s method is then:

𝜕𝐽(𝐰) 𝜕𝐽(𝐰)
𝛥𝐰 = −𝐇 −1 𝐰𝑘+1 = 𝐰𝑘 − 𝐇 −1
𝜕𝐰 𝜕𝐰

 The problem with this method is that


 Inverse of the Hessian is a computationally very expensive operation
 The method assumes the error surface to be quadratic. In practice, this is generally not
true, and may cause divergence.

 Another group of algorithms use an approximation of the Hessian, and have


been shown to work rather well in practice. These algorithms are called quasi-
Newton algorithms. One of the fastest of such algorithms is the Levenberg-
Marquardt algorithm. The LM algorithm in MATLAB
( trainlm) is the default learning algorithm for feedforwardnet().

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Levenberg – Marquardt
Backpropagation
 The LM method uses the 𝐇 = 𝐉 𝑇 𝐉 approximation for the Hessian, with the
gradient computed as 𝐠 = 𝐉 𝑇 𝐞 where 𝐉 is the Jacobian matrix containing
the first derivatives of the network errors (not to be confused with the
criterion function 𝐽), and e is the vector of network errors. The weight
update rule is then

𝐰𝑘+1 = 𝐰𝑘 − 𝐉𝑇 𝐉 + 𝜇𝐈 −1 𝐉 𝑇 𝐞

where 𝜇 is a constant which switches the algorithm back and forth between
a regular gradient descent (when 𝜇 is large) and quasi-Newton’s method
(when 𝜇 is zero). Typically, 𝜇 is decreased as the network gets closer to the
solution, since in that region Newton’s algorithm is most efficient.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Quick-Prop
 Simple and fast second order method that does not even require a
second order derivative calculation !!!
 Assumptions:
 Weights are independentDescent optimized separately for each weight !
 Error surface is quadratic
 Weight update rule given by

𝑑𝐽
𝑑𝑤 𝑘
𝛥𝑤𝑘+1 = 𝛥𝑤𝑘
𝑑𝐽 𝑑𝐽

𝑑𝑤 𝑘−1 𝑑𝑤 𝑘

*The difference of first order derivatives


approximate the second order derivative ! This
technique is not available in MATLAB. 𝒅𝑱𝒌/𝒅𝒘 = 𝟎 𝒅𝑱𝒌−𝟏 /𝒅𝒘 = 𝟎

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent
 Another group of second order algorithms that do not require Hessian
computation are the family of Conjugate Gradient Descent methods.

 Instead of iterating along the direction of the steepest gradient descent


only, these methods use alternate directions, along with the gradient
descent directions for faster convergence.

𝑇
 Pairs of directions (vectors) that satisfy 𝛥𝐰𝑘−1 𝐇𝛥𝐰𝑘 = 0 are called H-
conjugate, meaning that vectors 𝐰𝑘−1 and 𝐰𝑘 are non-interfering with
respect to H. If H is proportional to the identity matrix, then conjugate
directions are orthogonal to each other.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent
 Conjugate Gradient Descent:
1. Start with the steepest descent direction: 𝛥𝐰0 = −𝛻𝐽(𝐰0 )
2. At the 𝑘𝑡ℎ update, perform a line search to determine the optimal distance to
move along this direction, 𝐰𝑘 (equivalent to determining the optimum learning
rate). Let’s call this amount 𝛼𝑘.
3. Move along this direction 𝐰𝑘 by the amount 𝛼𝑘: 𝐰𝑘+1 = 𝐰𝑘 + 𝛼𝑘 𝛥𝐰𝑘
4. The next search will then be conjugate to previous search direction. Compute the
conjugate direction by 𝛥𝐰𝑘 = −𝛻𝐽(𝐰𝑘 ) + 𝛽𝑘 𝛥𝐰𝑘−1
5. The various versions of the conjugate gradient descent algorithm are distinguished
by the way the constant k is computed.

𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 − 𝛻𝐽 𝐰𝑘−1
𝛽𝑘 = 𝑇
𝛽𝑘 =
𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝑇 𝛻𝐽 𝐰𝑘−1
Fletcher-Reeves update – ‘traincgf’ Polak-Ribiere update – ‘traincgp’

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Gradient Descent vs.
Conjugate Gradient

Conjugate Gradient
Descent

Gradient
Descent

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Neural Network
structures
 Radial Basis Function (RBF) Networks: Similar in architecture to MLP, however, different
learning rule. Typically used for function approximation, though capable of solving
classification problems as well.
Matlab: newrb (train_data, targets, error_goal, spread)

 Probabilistic Neural Networks (revisited): Almost identical to RBF architecture,


however, simpler learning rule. Network outputs are posterior probabilities of
respective classes. No iterative learning is involved, therefore, PNN training is extremely
fast.
Matlab: newpnn (train_data, targets, spread)

 Learning Vector Quantization (LVQ) Networks: Most commonly used in speech


processing applications, its training mechanism uses a winner take it all type
competitive hidden layer.
Matlab: newlvq (range, subclasses, class_priors, learning_rate, learning_rule)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR
Dr. Robi Polikar

Radial Basis Function


Neural Networks
PR Function Approximation
 Constructing complicated functions from simple building blocks
 Lego systems
 Fourier / wavelet transforms
 Logic circuits
 RBF networks

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Function Approximation

* * ?
* *
* *
* *
*
*

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Recall: Universal
Approximator
 Classification can be thought of as a special case of function approximation:
 For a three class problem:
𝑥1
Class 1: 1 or [1 0 0]
x …. Classifier Class 2: 2 or [0 1 0]
Class 3: 3 or [0 0 1]
𝑥𝑑

𝑑-dimensional input 1 or 3, c-dimensional input


x 𝑦 = 𝑓(𝒙) 𝑦

 Hence, the problem is – given a set of input/output example pairs of an unknown


function – to determine the output of this function to any general input.
 An algorithm that is capable of approximating any function – however
difficult, large dimensional or complicated it may be – is known as a
universal approximator. The RBF is also a universal approximator.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Radial Basis Function
Neural Networks
 The RBF networks, just like MLP networks, can therefore be used
classification and/or function approximation problems.
 The RBFs, which have a similar architecture to that of MLPs, however,
achieve this goal using a different strategy:

………..

Linear output
Input layer layer
Nonlinear
transformation layer
(generates local receptive fields) RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Nonlinear Receptive Fields
 The hallmark of RBF networks is their use of nonlinear receptive fields
 The receptive fields nonlinearly transforms (maps) the input feature space,
where the input patterns are not linearly separable, to the hidden unit
space, where the mapped inputs may be linearly separable.
 The hidden unit space often needs to be of a higher dimensionality
 Cover’s Theorem (1965) on the separability of patterns: A complex pattern
classification problem that is nonlinearly separable in a low dimensional space, is
more likely to be linearly separable in a high dimensional space.
 We will see this concept again with SVM soon.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The (you guessed it right) XOR Problem
x2
Consider the nonlinear functions to map the input vector x to the 1- 2 space
1
𝐱−𝐭 1 2
𝜙1 (𝐱) = 𝑒 − 𝐭1 = 1 1 𝑇
𝐱 = [𝑥1 𝑥2]
𝐱−𝐭 2 2
𝜙2 (𝐱) = 𝑒 − 𝐭2 = 0 0 𝑇
0
x1
0 1

_ (1,1)
Input x 𝝓𝟏(𝐱) 𝝓𝟐(𝐱) 1.0
_
(1,1) 1 0.1353 0.8
_
(0,1) 0.3678 0.3678  0.6
(1,0) 0.3678 0.3678 _
0.4 (0,1)
(0,0) 0.1353 1 _ (1,0)
0.2 (0,0)
_
0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2 RP
The nonlinear  function transformed a nonlinearly separable problem into a linearly separable one !!!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Initial Assessment
 Using nonlinear functions, we can convert a nonlinearly separable problem
into a linearly separable one.
 From a function approximation perspective, this is equivalent to
implementing a complex function (corresponding to the nonlinearly
separable decision boundary) using simple functions (corresponding to the
linearly separable decision boundary)
 Implementing this procedure using a network architecture, yields the RBF
networks, if the nonlinear mapping functions are radial basis functions.
 Radial Basis Functions:
 Radial: Symmetric around its center
 Basis Functions: Also called kernels, a set of functions whose linear combination
can generate an arbitrary function in a given function space.
 Hence, the radial basis functions are in fact distance functions: the distance
between a data point 𝒙 and the RBF with a center of 𝝁 is zero if 𝒙 = 𝝁, and
becomes symmetrically smaller as 𝒙 moves away from 𝝁.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Networks
Radial Basis Function with =2, =1
d input 1

nodes H hidden layer RBFs 0.9

0.8
(receptive fields) 0.7

x1 0.6

0.5

𝝓𝟏 c output 0.4

0.3
nodes 0.2

𝑧1
0.1
……...
x2 -4 -2 0 2 4 6 8
x
Wkj

..
……....

𝐻 𝐻

Uji netk 𝑧𝑘 = 𝑓 𝑛𝑒𝑡𝑘 = 𝑓 𝑤𝑘𝑗 𝑦𝑗 = 𝑤𝑘𝑗 𝑦𝑗


𝑗=1 𝑗=1
𝝓𝒋 y
..
j
𝑧𝑐 Linear act. function

x(d-1)
𝑥1
𝝓𝑯 uJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 𝐱−𝐮𝐉
xd = 𝑒− 𝜎
𝐔= 𝐗𝑇 𝑥𝑑
RP
: spread constant
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Principle of Operation
Radial basis Euclidean
function Norm
x1
UJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 yJ 𝐱−𝐮𝐉 𝜎: spread constant
= 𝑒− 𝜎
xd

𝝓 y1
𝐻 𝐻
wKj
𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗 = 𝑤𝐾𝑗 𝑦𝑗
𝝓 yH
𝑗=1 𝑗=1

Unknowns: 𝑢𝑗𝑖, 𝑤𝑘𝑗, 𝜎

RP

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Principle of Operation
 What do these parameters represent?
 Physical meanings:
• 𝜙: The radial basis function for the hidden layer. This is a simple nonlinear mapping
function (typically Gaussian) that transforms the d- dimensional input patterns to a
(typically higher) H-dimensional space. The complex decision boundary will be
constructed from linear combinations (weighted sums) of these simple building
blocks.
• 𝑢𝑗𝑖 : The weights joining the first to hidden layer. These weights constitute the
center points of the radial basis functions. Also called prototypes of data.
• 𝜎: The spread constant(s). These values determine the spread (extend) of each
radial basis function.
• 𝑤𝑗𝑘: The weights joining hidden and output layers. These are the weights which are
used in obtaining the linear combination of the radial basis functions. They
determine the relative amplitudes of the RBFs when they are combined to form the
complex function.
• ‖𝐱 − 𝐮𝐽 ‖: the Euclidean distance between the input 𝐱 and the prototype vector 𝐮𝐉.
Activation of the hidden unit is determined according to this distance through 𝜙

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Principle of Operation
0.4
Weighted sum of radial basis transfer functions
Function implemented
0.35
by RBF network wJ:Relative weight
of Jth RBF
0.3
𝜙𝐽 : 𝐽𝑡ℎ RBF function
0.25
RBFs centered at
0.2
training data instances

0.15

0.1
RP
0.05 J
uJ
0
-2 0 2 4 6 8 10

Training data instances uJ: center for Jth RBF


Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBFs – In Depth
 RBF neural networks are feed-forward neural networks that consist
of
 a hidden layer nonlinear kernel units that compute essentially the distance
between the given inputs and preset centers called prototypes – this layer performs
a nonlinear transformation of the inputs into a higher dimensional space (what is
that dimensionality???), where the classes are better separated
 an output layer of linear neurons that computes a weighted sum of contributions
from the kernels to predict the target labels.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR How to Train?
 There are various approaches for training RBF networks.
 Approach 1: Exact RBF – Guarantees correct classification of all training
data instances. Requires N hidden layer nodes, one for each training
instance. No iterative training is involved. RBF centers (u) are fixed as
training data points, spread as variance of the data, and w are obtained by
solving a set of linear equations (In Matlab: newrbe() )
 Approach 2: Fixed centers selected at random. Uses 𝐻 < 𝑁 hidden layer
nodes. No iterative training is involved. Spread is based on Euclidean
metrics, w are obtained by solving a set of linear equations.
 Approach 3: Centers are obtained from unsupervised learning
(clustering). Spreads are obtained as variances of clusters, w are obtained
through LMS algorithm. Clustering (k-means) and LMS are iterative. This
is the most commonly used procedure. Typically provides good results.
 Approach 4: All unknowns are obtained from supervised learning.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
 Exact RBF
 The first layer weights u are set to the training data; 𝑼 = 𝑿𝑻. That is, the
Gaussians are centered at the training data instances.
𝑑
 The spread is chosen as 𝜎 = max , where 𝑑𝑚𝑎𝑥 is the maximum Euclidean distance
2𝑁
between any two centers, and 𝑁 is the number of training data points. Note that
𝐻 = 𝑁, for this case.
 The output of the 𝑘𝑡ℎ RBF output neuron is then
𝑁 𝑁
Multiple
𝑧𝑘 = 𝑤𝑘𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ outputs 𝑧= 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ Single output
𝑗=1 𝑗=1

 During training, we want the outputs to be equal to our desired targets. Without
loss of any generality, assume that we are approximating a single dimensional
function, and let the unknown true function be 𝑓(𝒙). The desired (target) output
for each input is then 𝑡𝑖 = 𝑓(𝒙𝑖), 𝑖 = 1, 2, … , 𝑁.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
(Cont.)

 We then have a set of linear equations, which can be represented in the


matrix form:

𝜙11 𝜙12 ⋯ 𝜙1𝑁 𝑤1 𝑡1


𝑁
𝜙21 𝜙22 ⋯ 𝜙2𝑁 𝑤2 𝑡
𝑧= 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ ⋅ ⋮ = ⋮2
⋮ ⋮ ⋮ ⋮
𝑗=1
y 𝜙𝑁1 𝜙𝑁2 ⋯ 𝜙𝑁𝑁 𝑤𝑁 𝑡𝑁

𝜙𝑖𝑗 = 𝜙‖𝐱 𝑖 − 𝐱𝑗 ‖ , (𝑖, 𝑗) = 1,2, . . . , 𝑁

𝐝 = 𝑡1 , 𝑡2 , ⋯ 𝑡𝑁 𝑇 𝚽⋅𝐰=𝐝
Define: 𝐰 = 𝑤1 , 𝑤2 , ⋯ 𝑤𝑁 𝑇 𝐰 = 𝚽 −1 𝐝
𝚽 = 𝜙𝑖𝑗 (𝑖, 𝑗) = 1,2, . . . , 𝑁
Is this matrix always invertible?

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
(Cont.)

 Michelli’s Theorem (1986)


 If 𝒙𝑖 𝑁𝑖=1 are a distinct set of points in the 𝑑-dimensional space, then the
𝑁 𝑏𝑦 𝑁 interpolation matrix 𝚽 with elements obtained from radial basis
functions 𝜙𝑖𝑗 = 𝜙‖𝐱 𝑖 − 𝐱𝑗 ‖ is nonsingular, and hence can be inverted!
 Note that the theorem is valid regardless the value of 𝑁, the choice of the RBF
(as long as it is an RBF), or what the data points may be, as long as they are
distinct!
 A large number of RBFs can be used:

• Multiquadrics: 𝜙(𝑟) = 𝑟 2 + 𝑐 2 1 2

for some 𝑐 > 0, 𝑟 ∈ ℜ


• Inverse multiquadrics: 1
𝜙(𝑟) = 2
𝑟 + 𝑐2 1 2

• Gaussian functions: 𝜙(𝑟) = 𝑒 −𝑟 2 2𝜎2 for some 𝜎 > 0, 𝑟 ∈ ℜ

𝑟 = ‖𝐱 − 𝐱𝑗 ‖
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach1
(Cont.)

 The Gaussian is the most commonly used RBF (why…?).


 Note that as 𝑟 → ∞, 𝜙(𝑟) → 0

 Gaussian RBFs are localized functions ! unlike the sigmoids used by MLPs

Using Gaussian radial basis functions Using sigmoidal radial basis functions
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Exact RBF Properties
 Using localized functions typically makes RBF networks more suitable for
function approximation problems. Why?
 Since first layer weights are set to input patterns, second layer weights are
obtained from solving linear equations, and spread is computed from the
data, no iterative training is involved !!!
 Guaranteed to correctly classify all training data points!
 However, since we are using as many receptive fields as the number of
data, the solution is over determined, if the underlying physical process
does not have as many degrees of freedom  Overfitting!
 The importance of : Too small will
also cause overfitting. Too large will
fail to characterize rapid changes in
the signal.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Too many
Receptive Fields?
 In order to reduce the artificial complexity of the RBF, we need to
use fewer number of receptive fields.
 How about using a subset of training data, say 𝑀 < 𝑁 of them.
 These 𝑀 data points will then constitute 𝑀 receptive field centers.
 How to choose these 𝑀 points…?
 At random  Approach 2.
𝑀
− 2 ‖𝐱𝑖 −𝐱𝑗 ‖2 𝑑max
𝑦𝑗 = 𝜙𝑖𝑗 = 𝜙 ‖𝐱 𝑖 − 𝐱𝑗 ‖2 = 𝑒 𝑑max , 𝑖 = 1,2, . . . , 𝑁 𝑗 = 1,2, . . . , 𝑀 𝜎 =
2𝑀
Output layer weights are determined as they were in Approach 1, through solving a
set of M linear equations!

 Unsupervised training: K-means  Approach 3


The centers are selected through self organization of clusters, where the
data is more densely populated. Determining 𝑀 is usually heuristic.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 3
K-Means - Unsupervised
Clustering - Algorithm
 Choose number of clusters, M
 Initialize M cluster centers to the first M training data points: 𝐭𝑘 = 𝐱𝑘 , 𝑘 = 1,2, … , 𝑀.
 Repeat
 At iteration 𝑛, group all patterns to the cluster whose center is closest
𝐭𝑘(𝑛): center of 𝑘𝑡ℎ RBF at
𝐶(𝐱) = argmin‖𝐱(𝑛) − 𝐭 𝑘 (𝑛)‖, 𝑘 = 1,2, . . . , 𝑀 𝑛𝑡ℎ iteration
𝑘
 Compute the centers of all clusters after the regrouping

𝑀𝑘
1
𝐭𝑘 = 𝐱𝑗
New cluster center 𝑀𝑘 Instances that are grouped
for kth RBF. 𝑗=1
in the kth cluster
Number of instances
in the kth cluster

 Until there is no change in cluster centers from one iteration to the next.

An alternate k-means algorithm is given in Haykin (p. 301).


Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Determining the Output Weights:
PR Approach 3 LMS Algorithm
1
 The LMS algorithm is used to minimize the cost function 𝐸(𝐰) = 𝑒 2 (𝑛) where
2
𝑇
𝑒(𝑛) is the error at iteration 𝑛, i.e., 𝑒(𝑛) = 𝑑(𝑛) − 𝐲 𝑛 𝐰 𝑛
∂𝐸(𝐰) ∂𝑒(𝑛) ∂𝑒(𝑛) ∂𝐸(𝐰)
= 𝑒(𝑛) = −𝐲(𝑛) = −𝐲(𝑛)𝑒(𝑛)
∂𝐰(𝑛) ∂𝐰 ∂𝐰 ∂𝐰(𝑛 )

 Using the steepest (gradient) descent method: 𝐰(𝑛 + 1) = 𝐰(𝑛) + 𝜂𝐲(𝑛)𝑒(𝑛)


Instance based LMS algorithm pseudocode (for single output):
Initialize weights, 𝑤𝒋 to some small random value, 𝑗 = 1,2, … , 𝑀
M Cluster centers
Repeat
obtained through k-means
Choose next training pair (𝒙, 𝑑);
𝑀
Compute network output at iteration n: 𝑧(𝑛) = 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐱𝑗 ‖ = 𝐰𝑇 ⋅ 𝐲
𝑗=1

Compute error: 𝑒(𝑛) = 𝑑(𝑛) − 𝑧(𝑛)


Update weights:
Until weights converge to a steady set of values 𝐰(𝑛 + 1) = 𝐰(𝑛) + 𝜂𝑒(𝑛)𝐲(𝑛)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 4:
Supervised
RBF Training
 This is the most general form.
 All parameters, receptive field centers (first layer weights), output layer weights
and spread constants, are learned through iterative supervised training using LMS
/ gradient descent algorithm.
𝑁

ℰ= 𝑒2𝑗
𝑗=1
𝑀

𝑤𝑘 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝑖=1

𝐺 ‖𝐱𝑗 − 𝐭𝑖 ‖ = 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝐶

G’ represents the first derivative


of the function wrt its argument

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF
 Similarities
 Both are universal approximators: they can approximate an arbitrary function of arbitrary
dimensionality and arbitrary complexity, provided that the number of hidden layer units are
sufficiently large, and there is sufficient training data.
 Differences
 MLP generates more global decision regions, as opposed to RBF generating more local
decision regions
 MLP partition the feature space into hyperplanes, whereas RBF partitions the space into
hyperellipsoids
 MLP is more likely to battle with local minima and flat valleys then RBF, and hence in
general has longer training times
 Since MLPs generate global decision regions, they do better in extrapolating, that is
classifying instances that are outside of the feature space represented by the training data. It
should be noted however, extrapolating may mean dealing with outliers.
 MLPs typically require fewer parameters then RBFs to approximate a given function with
the same accuracy G
(From R. Gutierrez)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF (Cont.)
 Differences (cont.)
 All parameters of the MLP are trained simultaneously, whereas RBF parameters can be
trained separately in an efficient hybrid manner
 RBFs have one and only one hidden layer, whereas MLPs can have multiple hidden layers.
 The hidden neurons of an MLP compute the inner product between an input vector and the
weight vector. RBFs instead compute the Euclidean distance between the input vector and
the radial basis function centers.
 The hidden layer of an RBF is nonlinear and its output layer is linear, whereas an MLP
typically has both layers as nonlinear. This really is more of a historic preference based on
empirical success. MLPs typically do better on classification type problems, and RBFs
typically do better on regression / function approximation type problems.

G
(From R. Gutierrez)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF

G
(From R. Gutierrez)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Networks
in Matlab
2
-x
radbas(x)=e The RBF accepts a distance between
1 the input p and the weight vector w.
0.9
As the distance between w and p
decreases, the RBF output 0. Hence
0.8

0.7

0.6 an RBF is a detector that produces 1


0.5 when the input p is identical to its
0.4
weight vector w. The bias b allows the
0.3
sensitivity of the radbas neuron to be
0.2

𝑎 = 𝑟𝑎𝑑𝑏𝑎𝑠 𝐰 − 𝐩 𝑏 0.1
adjusted.
2
𝑟𝑎𝑑𝑏𝑎𝑠 𝑥 = 𝑒 −𝑥 0
-3 -2 -1 0 1 2 3

𝑆 1 x𝑅 𝐈𝐖 𝟏,𝟏

𝐚𝟏 𝐚𝟐 = 𝐲
𝑆 1 x1
𝐋𝐖 𝟐,𝟏
𝐧𝟏 𝐧𝟐

𝐛𝟏 𝐛𝟐

𝑆 1 x1 𝑆 2 x1 𝑆2

𝐚𝟏 = 𝑟𝑎𝑑𝑏𝑎𝑠 𝐈𝐖1,1 − 𝐩 .∗ 𝐛1 𝐚𝟐 = 𝑝𝑢𝑟𝑒𝑙𝑖𝑛 𝐋𝐖 2,1 𝐚𝟏 + 𝐛2


a{1} = radbas(netprod(dist(net.IW{1,1},p),net.b{1}))
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR newrb
newrb Design radial basis network
net = newrb(P,T,goal,spread,MN,DF)
Radial basis networks can be used to approximate functions. newrb adds neurons to the hidden layer of a radial basis network until it meets the
specified mean squared error goal.
net = newrb(P,T,goal,spread,MN,DF) takes the following arguments – P: R-by-Q matrix of Q input vectors; T: S-by-Q matrix of Q target class vectors;
goal: Mean squared error goal (default = 0.0); spread: Spread of radial basis functions (default = 1.0); MN: Maximum number of neurons (default is
Q); DF: Number of neurons to add between displays (default = 25)
and returns a new radial basis network. The larger the spread is, the smoother the function approximation. Too large a spread means a lot of
neurons are required to fit a fast-changing function. Too small a spread means many neurons are required to fit a smooth function, and the network
might not generalize well. Call newrb with different spreads to find the best value for a given problem.

Examples
Here you design a radial basis network, given inputs P and targets T.
P = [1 2 3]; T = [2.0 4.1 5.9]; net = newrb(P,T);
The network is simulated for a new input.
P = 1.5; Y = sim(net,P)

About the Algorithm


newrb creates a two-layer network. The first layer has radbas neurons, and calculates its weighted inputs with dist and its net input with netprod.
The second layer has purelin neurons, and calculates its weighted input with dotprod and its net inputs with netsum. Both layers have biases.
Initially the radbas layer has no neurons. The following steps are repeated until the network's mean squared error falls below goal.

1. The network is simulated.


2. The input vector with the greatest error is found.
3. A radbas neuron is added with weights equal to that vector.
4. The purelin layer weights are redesigned to minimize error.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Variations
 newgrnn – Create a generalized regression neural network. Consists of two
layers, the first is identical to that of RBF, whereas the second layer has a
slight variation of purelin layer. More suitable for function approximation
problems.
 newpnn – Create a probabilistic neural network. Also has two layers, with
the first being a RBF, whereas the second normalizes the first layer outputs
and passes them through a competitive function that picks the largest
output. Most suited for classification problems. This function essentially
creates a version of kNN, with the distances computed wrt to the radial
basis function.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR trainrbf.m RBF Matlab Demo
%trainrbf: Trains and simulated a RBF and GRNN network on a synthetic
dataset - Originally written - 2003, Updated 10/2013 Original function
%Robi Polikar 1

load arb_function3.csv; RP 0
X=arb_function3(:,1);
Y=arb_function3(:,2);
size(X); N=length(X); -1
X=X(1:4:N); Y=Y(1:4:N); 0 2000 4000 6000 8000 10000 12000 14000 16000
Training Data
%Sub sample the data at a rate 1
of 1:5 to create training data
P1=X(1:5:length(X))'; 0
T1=Y(1:5:length(Y))';

net_rb = newrb(P1,T1,0.001, 1); -1


0 2000 4000 6000 8000 10000 12000 14000 16000
net_grnn=newgrnn(P1, T1, 1);
out_rb=sim(net_rb,X'); RBF approximation of the original data
2
out_grnn=sim(net_grnn, X');

subplot(411) 0
plot(X,Y); grid
title('Original function')
-2
subplot(412) 0 2000 4000 6000 8000 10000 12000 14000 16000
plot(P1,T1); grid GRNN approximation of the original data
title('Training Data'); 1
subplot(413)
plot(X,out_rb); grid
0
title('RBF approximation of the original data')

subplot(414) -1
plot(X,out_grnn); grid 0 2000 4000 6000 8000 10000 12000 14000 16000
title('GRNN approximation of the original data')
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Original function

PR 1

RP
-1
0 2000 4000 6000 8000 10000 12000 14000 16000
Using an error Training Data
goal of 1, and 1
spread of 5
0
Required 1200
Neurons, using 2000
points for training -1 0 2000 4000 6000 8000 10000 12000 14000 16000

Simulation on original data


2

-1
0 2000 4000 6000 8000 10000 12000 14000 16000

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
The Peaks!
x 5
z  3  1  x   e  x 2  y 12  x2  y 2 1  x12  y 2
 10    x  y   e  e
2 3

5  3
Mesh of the training data

10

-5

-10
4
2 3
2
0 1
0
-2 -1
-2
-4 -3

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
The Peaks!
 RBF parameters
 Error goal: 0.1
 Spread: 1
 Training function: fully supervised
 MLP parameters:
 Number of hidden layers: 2
 Number of nodes in each layer: 25
 Error goal: 0.01; (does not reach in 1000 iterations)
 Training function: traingdx or trainlm

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
rbf_mlp_peaks
The Peaks!

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nnstart
 For a very good introduction, play with nnstart

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Selecting Data

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Setting Up The Network

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training
 Check the dimensionality
 Check out the error histogram,
confusion matrix and ROC curve once
the training is completed

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Performance Plots
Best Validation Performance is 0.01575 at epoch 84
0
10
Train
Validation
Test
Best
Mean Squared Error (mse)

-1
10

x 10
4 Error Histogram with 20 Bins

-2 3.5 Training
10
Validation
3 Test
Zero Error
2.5

Instances
-3 2
10
0 10 20 30 40 50 60 70 80 90
1.5
90 Epochs
1

0.5

-0.445

-0.146

0.153
-0.9432
-0.8436
-0.7439
-0.6443
-0.5446

-0.3453
-0.2457

0.0533

0.2526
0.3523
0.4519
0.5516
0.6512
0.7509
0.8505
0.9502
-0.04635
Errors = Targets - Outputs

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR ROC Curves
(Class Specific)
Training ROC Validation ROC
1 1
Class 1
Class 2
0.8 0.8
Class 3

True Positive Rate

True Positive Rate


Class 4
0.6 Class 5 0.6
Class 6
Class 7
0.4 0.4
Class 8
Class 9
0.2 Class 10 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate

Test ROC All ROC


1 1

0.8 0.8
True Positive Rate

True Positive Rate


0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Confusion Matrices - I

Training Confusion Matrix Validation Confusion Matrix


114 0 0 0 0 2 0 0 0 0 98.3% 123 0 0 0 0 4 1 0 3 1 93.2%
1 1
9.9% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 1.7% 9.2% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1% 0.0% 0.2% 0.1% 6.8%
0 102 0 0 1 0 1 0 7 0 91.9% 0 121 1 0 7 0 2 0 4 0 89.6%
2 2
0.0% 8.9% 0.0% 0.0% 0.1% 0.0% 0.1% 0.0% 0.6% 0.0% 8.1% 0.0% 9.0% 0.1% 0.0% 0.5% 0.0% 0.1% 0.0% 0.3% 0.0% 10.4%
0 0 110 3 0 1 0 0 0 0 96.5% 0 1 125 5 0 0 0 0 4 1 91.9%
3 3
0.0% 0.0% 9.6% 0.3% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 3.5% 0.0% 0.1% 9.3% 0.4% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1% 8.1%
0 0 2 116 0 1 0 2 0 3 93.5% 0 0 5 130 0 7 0 7 0 3 85.5%
4 4
0.0% 0.0% 0.2% 10.1% 0.0% 0.1% 0.0% 0.2% 0.0% 0.3% 6.5% 0.0% 0.0% 0.4% 9.7% 0.0% 0.5% 0.0% 0.5% 0.0% 0.2% 14.5%
0 1 0 0 115 0 0 0 1 0 98.3% 0 9 0 0 108 0 3 5 0 6 82.4%
Output Class

Output Class
5 5
0.0% 0.1% 0.0% 0.0% 10.0% 0.0% 0.0% 0.0% 0.1% 0.0% 1.7% 0.0% 0.7% 0.0% 0.0% 8.1% 0.0% 0.2% 0.4% 0.0% 0.4% 17.6%
0 0 0 0 0 97 0 0 0 0 100% 0 0 0 1 0 104 2 0 0 0 97.2%
6 6
0.0% 0.0% 0.0% 0.0% 0.0% 8.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 7.8% 0.1% 0.0% 0.0% 0.0% 2.8%
1 1 0 0 3 0 110 0 0 0 95.7% 0 2 0 0 1 0 130 0 3 0 95.6%
7 7
0.1% 0.1% 0.0% 0.0% 0.3% 0.0% 9.6% 0.0% 0.0% 0.0% 4.3% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 9.7% 0.0% 0.2% 0.0% 4.4%
0 0 0 0 0 0 0 117 0 1 99.2% 0 1 1 0 0 1 0 121 0 1 96.8%
8 8
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 10.2% 0.0% 0.1% 0.8% 0.0% 0.1% 0.1% 0.0% 0.0% 0.1% 0.0% 9.0% 0.0% 0.1% 3.2%
0 1 3 0 0 0 1 0 115 1 95.0% 0 6 6 1 0 1 6 0 112 0 84.8%
9 9
0.0% 0.1% 0.3% 0.0% 0.0% 0.0% 0.1% 0.0% 10.0% 0.1% 5.0% 0.0% 0.4% 0.4% 0.1% 0.0% 0.1% 0.4% 0.0% 8.4% 0.0% 15.2%
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class

Test Confusion Matrix All Confusion Matrix


123 0 0 0 0 2 0 0 3 3 93.9% 360 0 0 0 0 8 1 0 6 4 95.0%
1 1
9.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.2% 6.1% 9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.2% 0.1% 5.0%
0 109 2
Computational
2 Intelligence &0 Pattern
7 0 2 0
Recognition 8 0 85.2%
2
0 332
© 2001- 3
2013,0 Robi
15 0 5
Polikar, 0
Rowan 19 0 88.8%
University, Glassboro, NJ
0 1 3 0 0 0 1 0 115 1 95.0% 0 6 6 1 0 1 6 0 112 0 84.8%
9 9
0.0% 0.1% 0.3% 0.0% 0.0% 0.0% 0.1% 0.0% 10.0% 0.1% 5.0% 0.0% 0.4% 0.4% 0.1% 0.0% 0.1% 0.4% 0.0% 8.4% 0.0% 15.2%

PR
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10

Confusion Matrices - II
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class

Test Confusion Matrix All Confusion Matrix


123 0 0 0 0 2 0 0 3 3 93.9% 360 0 0 0 0 8 1 0 6 4 95.0%
1 1
9.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.2% 6.1% 9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.2% 0.1% 5.0%
0 109 2 0 7 0 2 0 8 0 85.2% 0 332 3 0 15 0 5 0 19 0 88.8%
2 2
0.0% 8.1% 0.1% 0.0% 0.5% 0.0% 0.1% 0.0% 0.6% 0.0% 14.8% 0.0% 8.7% 0.1% 0.0% 0.4% 0.0% 0.1% 0.0% 0.5% 0.0% 11.2%
0 3 117 4 0 1 0 0 5 0 90.0% 0 4 352 12 0 2 0 0 9 1 92.6%
3 3
0.0% 0.2% 8.7% 0.3% 0.0% 0.1% 0.0% 0.0% 0.4% 0.0% 10.0% 0.0% 0.1% 9.2% 0.3% 0.0% 0.1% 0.0% 0.0% 0.2% 0.0% 7.4%
0 0 1 121 0 4 0 7 0 3 89.0% 0 0 8 367 0 12 0 16 0 9 89.1%
4 4
0.0% 0.0% 0.1% 9.0% 0.0% 0.3% 0.0% 0.5% 0.0% 0.2% 11.0% 0.0% 0.0% 0.2% 9.6% 0.0% 0.3% 0.0% 0.4% 0.0% 0.2% 10.9%
2 5 0 0 135 0 1 3 1 4 89.4% 2 15 0 0 358 0 4 8 2 10 89.7%
Output Class

Output Class
5 5
0.1% 0.4% 0.0% 0.0% 10.1% 0.0% 0.1% 0.2% 0.1% 0.3% 10.6% 0.1% 0.4% 0.0% 0.0% 9.4% 0.0% 0.1% 0.2% 0.1% 0.3% 10.3%
2 0 1 1 0 137 0 2 1 0 95.1% 2 0 1 2 0 338 2 2 1 0 97.1%
6 6
0.1% 0.0% 0.1% 0.1% 0.0% 10.2% 0.0% 0.1% 0.1% 0.0% 4.9% 0.1% 0.0% 0.0% 0.1% 0.0% 8.8% 0.1% 0.1% 0.0% 0.0% 2.9%
0 3 0 0 6 0 114 0 1 0 91.9% 1 6 0 0 10 0 354 0 4 0 94.4%
7 7
0.0% 0.2% 0.0% 0.0% 0.4% 0.0% 8.5% 0.0% 0.1% 0.0% 8.1% 0.0% 0.2% 0.0% 0.0% 0.3% 0.0% 9.3% 0.0% 0.1% 0.0% 5.6%
0 4 0 2 0 3 0 117 1 1 91.4% 0 5 1 2 0 4 0 355 1 3 95.7%
8 8
0.0% 0.3% 0.0% 0.1% 0.0% 0.2% 0.0% 8.7% 0.1% 0.1% 8.6% 0.0% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 9.3% 0.0% 0.1% 4.3%
4 7 4 0 1 0 4 0 102 2 82.3% 4 14 13 1 1 1 11 0 329 3 87.3%
9 9
0.3% 0.5% 0.3% 0.0% 0.1% 0.0% 0.3% 0.0% 7.6% 0.1% 17.7% 0.1% 0.4% 0.3% 0.0% 0.0% 0.0% 0.3% 0.0% 8.6% 0.1% 12.7%
3 7 1 0 1 5 0 1 3 121 85.2% 7 13 2 5 3 11 0 6 9 352 86.3%
10 10
0.2% 0.5% 0.1% 0.0% 0.1% 0.4% 0.0% 0.1% 0.2% 9.0% 14.8% 0.2% 0.3% 0.1% 0.1% 0.1% 0.3% 0.0% 0.2% 0.2% 9.2% 13.7%
91.8%79.0%92.9%94.5%90.0%90.1%94.2%90.0%81.6%90.3% 89.4% 95.7%85.3%92.6%94.3%92.5%89.9%93.9%91.7%86.6%92.1%91.5%
8.2% 21.0% 7.1% 5.5% 10.0% 9.9% 5.8% 10.0%18.4% 9.7% 10.6% 4.3% 14.7% 7.4% 5.7% 7.5% 10.1% 6.1% 8.3% 13.4% 7.9% 8.5%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Evaluate The Network

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Evaluation Results
On Test Data
Confusion Matrix ROC
1
174 0 1 1 0 0 0 0 0 2 97.8% Class 1
1
9.7% 0.0% 0.1% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 2.2% Class 2
0.9 Class 3
0 169 7 2 1 3 4 0 16 4 82.0%
2 Class 4
0.0% 9.4% 0.4% 0.1% 0.1% 0.2% 0.2% 0.0% 0.9% 0.2% 18.0% Class 5
0 0 157 3 0 0 0 0 0 2 96.9% 0.8 Class 6
3 Class 7
0.0% 0.0% 8.7% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 3.1%
Class 8
0 0 8 164 0 0 0 1 2 1 93.2% 0.7 Class 9
4
0.0% 0.0% 0.4% 9.1% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 6.8% Class 10
Output Class

1 5 0 0 177 0 5 1 0 8 89.8% 0.6


5

True Positive Rate


0.1% 0.3% 0.0% 0.0% 9.8% 0.0% 0.3% 0.1% 0.0% 0.4% 10.2%
2 0 0 4 0 178 0 5 4 0 92.2%
6 0.5
0.1% 0.0% 0.0% 0.2% 0.0% 9.9% 0.0% 0.3% 0.2% 0.0% 7.8%
1 0 0 0 0 0 170 0 0 0 99.4%
7 0.4
0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 9.5% 0.0% 0.0% 0.0% 0.6%
0 0 2 1 1 0 0 161 2 0 96.4%
8
0.0% 0.0% 0.1% 0.1% 0.1% 0.0% 0.0% 9.0% 0.1% 0.0% 3.6% 0.3

0 8 2 4 0 0 2 1 143 4 87.2%
9
0.0% 0.4% 0.1% 0.2% 0.0% 0.0% 0.1% 0.1% 8.0% 0.2% 12.8% 0.2
0 0 0 4 2 1 0 10 7 159 86.9%
10
0.0% 0.0% 0.0% 0.2% 0.1% 0.1% 0.0% 0.6% 0.4% 8.8% 13.1%
0.1
97.8%92.9%88.7%89.6%97.8%97.8%93.9%89.9%82.2%88.3%91.9%
2.2% 7.1% 11.3%10.4% 2.2% 2.2% 6.1% 10.1%17.8%11.7% 8.1%
0
1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Target Class

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR What Else Can it do,
you Ask…?
Prepared to be impressed,
and then click here

and then click here

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR …and here is the script…
% Solve a Pattern Recognition Problem with a Neural Network - Script generated by NPRTOOL - Created Thu Oct 10 02:48:04 EDT 2013
% This script assumes these variables are defined: % opt_train - input data. % opt_class - target data.
inputs = opt_train; targets = opt_class;
% Create a Pattern Recognition Network
hiddenLayerSize = 20; net = patternnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions - For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing - For a list of all data division functions type: help nndivide
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 30/100; net.divideParam.valRatio = 35/100; net.divideParam.testRatio = 35/100;
% For help on training function 'trainlm' type: help trainlm; For a list of all training functions type: help nntrain
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function - For a list of all performance functions type: help nnperformance
net.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions - For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', 'plotregression', 'plotfit'};
% Train and test the Network
[net,tr] = train(net,inputs,targets); outputs = net(inputs); errors = gsubtract(targets,outputs); performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets .* tr.valMask{1};
testTargets = targets .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs);
valPerformance = perform(net,valTargets,outputs);
testPerformance = perform(net,testTargets,outputs)

% View the Network


view(net)
% Uncomment these lines to enable various plots.
%figure, plotperform(tr) %figure, plottrainstate(tr) %figure, plotconfusion(targets,outputs) %figure, ploterrhist(errors)

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR …And the simulink model

Constant Input NNET Output

x1 y1

Pattern Recognition Neural Network

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Midterm Project II
Oct 24
 Option 1 (easier, but close ended – intended for undergraduate students) - Pick 3 of the
more challenging datasets (from the UCI repository) that you used in Midterm Project I, and
design two function approximation (with noise) problems
 Train and test using the following network structures: MLP, RBF, PNN, LVQ, GRNN (which network
is suitable for which classification vs. regression problem)
 Investigate different parameters, learning algorithms, etc. and tabulate your results. Can you make
any generalizations with respect to accuracy, speed, network size, etc.?
 Use proper cross validation and statistical test to compare the algorithms on the datasets. Provide
the principles of operation for each of the network, including PNN, LVQ and GRNN.
 UG students can instead do option 2 for 20% additional credit.

 Option 2 – intended for grad students (open ended problem – publication opportunity)
 You will be given a dataset of EEG data with two classes: Alzheimer’s disease and normal. The dataset
includes raw data as well as 152 different feature sets obtained from 71 subjects. Your goal is to devise
and implement a rigorous approach to determine which feature sets provide the best diagnostic
accuracy, using appropriate cross-validation techniques. Provide a list of these approaches in descending
order of accuracy, also providing sensitivity, specificity and positive predictive value, along with their
confidence intervals.

Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ

You might also like