Professional Documents
Culture Documents
00179
1. Introduction
Evolving from our understanding of neuro-biological systems, articial neural networks give computers an amazing capacity to learn complex tasks from examples.
They have become an alternative computational approach for problems that do
not have algorithmic solutions, or for which the algorithmic solutions are too difcult to express analytically. Their success can be attributed in part to their fault
tolerance, parallel processing, and generalization ability. The most popular neural
network architecture that is in use today, and discussed in almost every neural
network textbook, is the multilayer perceptron (MLP). MLPs have proven to be
a powerful computational tool for many problems in pattern recognition, function
approximation, and data analysis, to name a few. However, MLPs have some drawbacks when applied directly, without any processing, to high-dimensional data such
as in image analysis, image understanding and machine vision. The main problem
is that the size of the network grows with the size of the input image, which makes
the network training a much harder task. Moreover, over-tting may occur and
45
46
00179
the generalization ability of the network suers when there is no sucient training
samples. The common approach to circumvent these problems is to use some preprocessing techniques to extract lower-dimensional features from the input data.
Feature extraction, however, is a computationally expensive process and requires
prior knowledge about the data to design the feature extractor.
In the past 20 years, researchers have focused not only on the development of
training algorithms for MLPs, but also on the identication of signicant network
structures and weight constraints that can reduce the number of trainable parameters. Inspired by the Hubel and Weisels hierarchical vision model of the cortex,
Fukushima et al.1 developed neocognitron, a two-dimensional (2D) neural network
architecture for visual pattern recognition. LeCun et al.,2 on the other hand, proposed a series of convolutional neural network (CoNN) architectures, based upon the
three structural concepts of local receptive elds, weight sharing and sub-sampling.
These networks can easily deal with variability in 2D shapes and possess a certain
degree of local invariance to distortion and translation. Consequently, they have
attracted considerable interest and gained popularity for solving visual pattern
recognition problems such as face detection,3 face recognition4 and facial expression analysis,5 and medical image pattern recognition.6
In Ref. 7, LeCun et al. reported their latest CoNN which is widely known as
the LeNet-5 for handwritten digit recognition. The network consists of seven processing layers, where the rst four layers are two successive pairs of convolutional
and sub-sampling layers with a total of 44 feature maps for feature extraction. The
fth and the sixth layers are the respective convolutional and fully-connected layers
with 120 and 84 neurons, and the output layer has ten neurons to represent the
ten digit classes. Overall, the network has 60,000 trainable parameters, and was
on
trained and tested on the MNIST database8 with an error rate of 0.8%. Calder
et al.9 developed a CoNN structure similar to the LeNet-5 which uses Gabor lters as receptive elds for the rst convolutional layer, and at the output layer, 84
perceptrons is used to represent the output as a grayscale image of size 12 7.
To improve the performance of their handwritten digit recognition system, they
applied a boosting method to boost their networks so as to achieve an error rate of
0.68%. Simard and his colleagues,10 on the other hand, used a much simple CoNN
structure for handwritten digit recognition with four processing layers and a network retina of size 29 29. The rst layer has ve feature maps of size 13 13,
and the second layer has 50 feature maps. In each layer, the size of the feature
map is reduced from n to (n 3)/2, where n is the original size, and the receptive
elds size used throughout the network is 5 5. The last two layers are equivalent to two layer fully-connected MLP with 100 hidden neurons and ten neurons as
outputs. Expanding the training set through elastic distortions and using a crossentropy function as an error function, they have achieved an error rate of 0.4%.
Gorgevik and Cakmakov11 proposed another approach by combining two neural
networks and a support vector machine to implement a three-stage classier for
handwritten digit recognition. First, the digit images are preprocessed for slant
00179
47
correction. Then 292 features are extracted from the image as inputs for the cascade of classiers. Based on the MNIST database, their three-stage classier has an
error rate of 0.83%. The experimental results of these neural-based approaches show
that neural networks can yield state-of-the-art performances. Nevertheless, these
networks are still plugged with the problem of huge number of trainable parameters.
Recently, we have proposed a new class of convolutional neural networks, known
as shunting inhibitory convolutional neural networks (SICoNNets), which can be
easily tailored to the users specications.12 The key characteristics of these networks are the processing element used for feature extraction and the systematic
interconnection schemes between the dierent hidden layers. The processing elements in the hidden layers are based on the shunting inhibition mechanism, which
plays an important role in visual information processing in the cortex.13 15 The
reason for using this type of processing elements is that shunting inhibitory neurons have been shown to be more computationally powerful than the traditional
sigmoid type neurons. Contrary to a sigmoid neuron, a single shunting inhibitory
neuron can solve linearly nonseparable classication problems by forming nonlinear
decision boundaries.16,17 In Ref. 18, the shunting inhibitory convolutional neural
network was applied to a two-class pattern classication task for discriminating segmented images between a face and a non-face, and subsequently developed as a face
detection system that can detect and localize faces in complex background scenes.
In this paper, we apply SICoNNets to handwritten digit recognition. The next
section gives a detailed description of the shunting inhibitory convolutional neural
network architecture. Section 3 describes the training algorithms that have been
developed for these networks, followed by the description of the handwritten digit
recognition system in Sec. 4. The experimental results and performance analysis
are presented in Sec. 5, and nal concluding remarks are given in Sec. 6.
2. Description of SICoNNet Architecture
The proposed convolutional neural networks, SICoNNets, have a exible do-ityourself network architecture in which the following network parameters can be
specied: the input size, the receptive eld size, number of layers and/or number of
feature maps, number of outputs, and connection scheme between layers. The input
layer is a 2D array used by the network to receive images from the environment.
The input layer is succeeded by several processing layers, or hidden layers, and
each hidden layer is made up of planes of shunting inhibitory neurons, known as
feature maps. Each neuron in the feature map receives inputs from a small local
neighborhood in the previous layer, its receptive field. However, all the neurons in a
feature map share the same set of connection weights [Fig. 1(a)], and each hidden
layer has a xed receptive eld size. Since all neurons in a feature map share the
same set of weights, the same operation is performed on dierent parts of the
input plane. Hence, the same elementary visual feature is extracted from dierent
positions in the input image. Other feature maps of the same layer operate with
48
00179
Feature map
Receptive field
Shifted horizontally by
two positions
ReceptiveField
Shifted
vertically by
two positions
(a)
(b)
Fig. 1. Schematic diagrams illustrate: (a) the application of local receptive elds and (b) the
movement of a receptive eld in the input image.
dierent sets of weights to extract dierent types of local features. In higher layers,
the feature maps extract higher-order features by taking their inputs from one or
more feature maps in the preceding layer. In each hidden layer, another structural
process, namely sub-sampling, is performed to reduce the spatial resolution of the
2D input by shifting the centers of receptive elds of adjacent neurons by two
positions in both directions [see Fig. 1(b)]; as a result, the size of the feature maps
is reduced by one quarter in each hidden layer. This introduces a certain degree
of invariance to translation and input distortion as the absolute location of the
extracted feature becomes less important in higher layer so long as its approximate
position relative to other features is preserved.
The computation performed by the shunting inhibitory neuron at location (i, j)
in the kth feature map of the Lth layer is given by
ZL,k (i, j) =
XL,k (i, j)
,
aL,k (i, j) + YL,k (i, j)
where
XL,k (i, j) = gL
SL1
i, j = 1, . . . , FL
(1)
m=1
and
YL,k (i, j) = fL
SL1
[DL,k ZL1,m ](2i)(2j) + dL,k (i, j) .
m=1
The parameters CL,k and DL,k are the set of excitatory and inhibitory weights,
respectively, bL,k and dL,k are scalar parameters called the biases, aL,k is the passive
00179
49
decay rate of the neuron, gL and fL are the activation functions, SL1 is the number
of feature maps at the (L 1)th layer, and FL is the size of the feature map at the
Lth layer. In a feature map, all the neurons share the same set of weights, CL,k and
DL,k as well as the biases and the passive decay rate parameter. In order to avoid
division by zero in (1), aL,k is constrained to be positive:
aL,k (i, j) + YL,k (i, j) ,
(2)
50
00179
Layer 2
1
Layer 1
Layer 1
A
3
(a)
Fig. 2.
(b)
Table 1.
L2 Feature Map
1
2
3
4
5
6
7
8
Connections from L1 to L2
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
(3)
i=1
where y is the neural response of the sigmoid neuron, h is the output activation
function, wi s are the connection weights, zi s are the input signals, SN is the
number of input signals, and b is the bias term.
00179
51
3. Training Algorithm
To train the SICoNNets, a batch training algorithm based on the combination of
Rprop,20 Quickprop,21 and SuperSAB22 has been developed and named QRProp.
It is a local adaptation technique, in which the temporal behavior of the partial
derivative of the weight is used in the computation of the weight update. For comparison, the LevenbergMarquardt algorithm (LM) is also implemented, where the
Jacobian matrix is computed using a modied error-backpropagation rule similar
to the one developed by Hagan23 (see Ref. 12 for more details).
The weight update rule of the QRprop method is given by
(k + 1) = W
(k) + W
(k) +
(k 1),
W
(k) W
(4)
(5)
i (k) = max(0.5i (k 1), min), if gi (k)gi (k 1) < 0, i = 1, . . . , n
otherwise
i (k 1),
where n is the number of trainable weights, max and min are the upper and lower
limits of the step size, respectively; the initial value i (0) is set to 0.001 and the
respective limits for max and min are 10 and 1010 . The local weight update of
the ith weight is then determined by
wi (k) = sgn(gi (k))i (k),
(6)
where sgn denotes the signum function. When the current local gradient has a
change of sign with respect to the previous local gradient of the same weight, the
stored local gradient is set to zero so as to avoid an update in that weight in the
next iteration. Furthermore, when the product of the current and previous local
gradients is less than zero and there is an increase in the network error E, the ith
weight update is reverted back to the previous weight update and multiplied by an
adaptive momentum rate:
if gi (k)gi (k 1) < 0 and E(k) > E(k 1),
then wi (k) = i (k)wi (k 1).
(7)
52
00179
The adaptive momentum rate i (k) of the ith weight used in (4) and (7) is computed
as the magnitude of the Quickprop-step, bounded within the range [0.5, 1.5]:
gi (k)
,
(8)
i (k) =
gi (k 1) gi (k)
i (k) = min(
i (k), 1.5),
max(
i (k), 0.5), if gi (k)gi (k 1) < 0
i (k) =
.
0,
if gi (k)gi (k 1) = 0
(9)
(10)
Moreover, when there is a decrease in the current network error with respect to
the previous error, a small percentage of the negative gradient is added to the
weight
(k + 1) = W
(k + 1)
W
(k) g (k),
(11)
where
(k) is a vector of learning rates, which are adapted using similar principle
as the SuperSAB method and bounded above by (13).
otherwise
i (k 1),
i (k), 0.9).
i (k) = min(
(13)
ei (k) i (k 1).
15:
end if
16:
i (k) min(e
i (k), 0.9).
17:
wi (k) sgn(gi (k))i (k).
18:
if gi (k)gi (k 1) < 0 and E(k) > E(k 1) then
19:
wi (k) i (k)wi (k 1).
20:
end if
21:
wi (k + 1) wi (k) + wi (k) + i (k)wi (k 1).
22:
if E(k) < E(k 1) then
23:
wi (k + 1) wi (k + 1) i (k)gi (k).
24:
end if
25: end while
00179
53
54
00179
hidden layer was trained on a set of 5000 handwritten digit patterns, where 500
samples were taken from each digit class of the MNIST training set, based on a
ve-fold cross validation procedure. In each fold, 4000 patterns were gathered for
training and 1000 patterns for testing. For analysis purposes, the training mean
square error (MSE), training time and number of training epochs were recorded
in each fold and averaged across the ve folds. As the training time is relatively
dependent on the machine used, we compute the training time in terms of the gradient descent epoch time unit or gdeu. One gdeu is dened as the average time
taken by the network to perform one gradient descent training epoch on a xed
training set and a xed-size network, and it remains constant throughout the gradient descent training process. On a PC with 3 GHz CPU and 2 GB RAM, using
MATLAB software as programming language, one gdeu time unit is approximately
42.5 seconds, based on a network with 1366 trainable parameters and a training set
of 4000 samples.
Figure 3 shows that both training methods converge with dierent speeds. In
terms of the mean square error (MSE), as a function of the number of epochs,
Fig. 3(a) shows that the LM algorithm has better convergence speed than QRProp;
however, based on the training time, Fig. 3(b) shows that the MSE of the LM
algorithm decreases slower than that of QRProp. Moreover, after a certain number of gdeus, the MSE of the LM algorithm remains constant, indicating that the
training algorithm has reached a local minimum. On the other hand, the MSE of
the QRProp method gradually decreases and becomes smaller than that of the LM
algorithm. Another test was conducted to analyze the classication performance of
the training algorithms. The results, based on ve-fold cross-validation, are shown
in Fig. 4. Since the LM shows better convergence, the trained network yields higher
classication accuracy after a few epochs; for instance, at 20 iterations, the trained
network achieves a classication accuracy of 96.8% on the 4000 training patterns
and 94.9% on the 1000 test patterns. However, the LM algorithm is known to have
0.5
LM
QRProp
LM
QRProp
0.5
1
1.5
2
2.5
1
2
3
3
3.5
20
40
60
(a)
80
100
100
200
300
400
500
(b)
Fig. 3. The convergence speed of the training algorithms as a function of (a) number of training
epochs and (b) number of gdeus.
00179
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
LM
QRProp
0.2
0.1
20
40
60
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
100
80
55
50
100
150
200
250
300
250
300
0.9
0.9
(a)
0.8
0.7
0.6
0.5
0.4
0.3
LM
QRProp
0.2
0.1
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
80
100
50
100
150
200
(b)
Fig. 4. The classication accuracy of the training algorithms versus the number of training epochs
and the training time based on (a) the training set and (b) the test set.
some shortcomings such as the computation of the Hessian matrix and its storage.
On a large training set of 60,000 samples, it is not possible to train a network
with 2722 trainable parameters using the LM algorithm due to the huge amount of
memory required to store the Jacobian and Hessian matrices. On the contrary, the
QRProp method requires only few gradient and function evaluations to update
the weights. Furthermore, when training for a longer period of time, say 250 gdeus,
the classication performance achieved by QRProp on the test set is similar to
that of the LM algorithm. Therefore, QRProp is chosen to train the SICoNNets for
handwritten digit recognition.
5.2. Classification performance of three SICoNNet architectures
In this experiment, we train and evaluate the classication performance of three
dierent SICoNNet architectures: fully-connected, binary-connected and toeplitzconnected. Each SICoNNet was trained on a set of 10,000 patterns and tested
on the entire test set of the MNIST database. The classication rates of the different network architectures are presented in Table 2. Clearly, all three networks
achieve classication rates higher than 90%. The best classication rate is 94.1%
achieved with the binary-connected network, followed by the toeplitz-connected
56
00179
SICoNNets
Binary
Toeplitz
Full
Accuracy
(%)
98.3
97.1
95.9
97.9
96.7
96.5
95.1
96.6
92.4
92.2
95.0
86.6
93.0
92.8
89.1
91.3
92.2
85.2
95.9
95.5
93.7
93.4
91.0
89.0
93.0
89.3
84.7
91.8
90.1
88.2
94.1
93.6
90.2
Actual class
0
1
2
3
4
5
6
7
8
9
970
0
7
0
1
3
8
1
4
4
0
1120
1
0
0
0
3
2
2
4
1
2
1001
5
3
2
4
8
1
0
0
2
6
982
0
11
0
6
6
10
0
0
1
0
954
1
2
2
4
11
0
0
0
7
0
862
2
0
5
7
6
2
0
0
4
5
935
0
1
1
2
1
3
6
1
1
2
999
3
5
1
8
5
10
0
4
2
0
945
8
0
0
0
0
19
3
0
6
3
959
Classication accuracy
Classication
Rate (%)
99.0
98.7
97.8
97.2
97.1
96.6
97.6
97.6
97.0
95.0
97.3
00179
57
(a) Digit 4
(b) Digit 9
Fig. 5. Examples of digit patterns in the test set that were misclassied (a) digit four predicted
as nine, and (b) digit nine predicted as four.
3-layer
LeNet-57
Boosted GCNN9
CoNN with cross entropy10
CoNN24
SICoNNet
No. of F. Maps/Neurons
No. of T. Weights
1160
164
176
55
24
936,660
60,000
63,156
127,540
18,370
2,722
2.95
0.80
0.68
0.40
1.20
2.70
our knowledge and from the list,8 the most successful classier reported to date
was developed by Simard et al.10 with an error rate of 0.4%. However, their CoNN
has the most trainable network parameters, apart from the three layer MLP, with
127,540 trainable weights. This amount of weights is computed from their given network structure and assumed that the 100 hidden neurons in the third hidden layer
of the network is fully connected to the 50 feature maps of size 5 5 in the second
hidden layer, and each feature map has a single receptive eld. Most of the classiers based on convolutional neural networks have error rates of less than 1%, at the
expense of having more than 10,000 trainable weights. Even though the same test set
from the MNIST database is used to evaluate the performances of these networks,
the size of the training set and the preprocessing applied to the training patterns
are dierent; for example, LeNet-5 and the network implemented by Simard et al.
were both trained on an augmented training set with articially distorted versions
of the original digit patterns so as to accommodate all form of ane transformations. The proposed CoNN, on the other hand, was trained and tested on binary
images, and its recognition error rate is lower than that of the MLP, but higher
than those of the existing CoNNs. However, it has the least number of trainable
weights with 24 feature maps in the hidden layers behaving as feature detectors.
To improve the performance of the proposed CoNN for this pattern recognition
task, a large training set with distorted digit patterns can be used, and the network
structure can be modied so that another classication layer is added between the
last feature extraction layer and the output layer, as with two classication layers
58
00179
6. Conclusion
In this paper, we proposed to use a new class of convolutional neural networks
for handwritten digit recognition. These networks, known as shunting inhibitory
convolutional neural networks, have a exible network structure with three connection schemes: fully-connected, binary-connected and toeplitz-connected. A hybrid
training method (QRProp), derived from existing rst-order training algorithms,
was used to train the networks for handwritten digit recognition. The performance
of QRProp was compared to that of the LevenbergMarquardt algorithm. Experimental results show that the QRProp method has better convergence speed than
the LM algorithm, in terms of the training time, and achieves similar classication
accuracy. Among the three dierent SICoNNet architectures (binary-, toeplitz-, and
fully-connected networks), the binary-connected network has the best recognition
rate. Evaluated on the MNIST database, a binary-connected network, with 2722
trainable weights, achieves a correct classication rate of 97.3%.
References
1. K. Fukushima, S. Miyake and T. Ito, Neocognitron: A neural network model for a
mechanism of visual pattern recognition, IEEE Trans. Syst. Man Cybernet. SMC13(5) (1983) 826834.
2. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and
L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural
Comput. 1(4) (1989) 541551.
3. C. Garcia and M. Delakis, A neural architecture for fast and robust face detection, in
Proc. Sixteenth Int. Conf. Pattern Recogn., Quebec Canada 2 (2002) 4447.
4. S. Lawrence, C. L. Giles, A. C. Tsoi and A. D. Back, Face recognition: a convolutional
neural network approach, IEEE Trans. Neural Networks 8(1) (1997) 98113.
5. B. Fasel, Multiscale facial expression recognition using convolutional neural networks,
in Proc. Third Indian Conf. Comput. Vision, Graphics Image Process, Ahmedabad,
India (2002).
6. S.-C. B. Lo, J.-S. J. Lin, M. T. Freedman and S. K. Mun, Application of articial
neural networks to medical image pattern recognition: Detection of clustered microcalcications on mammograms and lung cancer on chest radiographs, J VLSI Signal
Process. Syst. 18(3) (1996) 263274.
7. Y. LeCun, L. Bottou, Y. Bengio and P. Haner, Gradient-based learning applied to
document recognition, Proc. IEEE 86(11) (1998) 22782324.
8. Y. LeCun, The MNIST database of handwritten digits, http://yann.lecun.
com/exdb/mnist.
9. A. Calder
on, S. Roa and J. Victorino, Handwritten digit recognition using convolutional neural networks and Gabor lter, in Proc. Int. Congr. Comput. Intell., Medellin,
Colombia (2003).
00179
59
10. P. Y. Simard, D. Steinkraus and J. C. Platt, Best practices for convolutional neural
networks applied to visual documents analysis, Proc. Seventh Int. Conf. Document
Anal. Recogn. 2 (2003) 958962.
11. D. Gorgevik and D. Cakmakov, An ecient three-stage classier for handwritten digit
recognition, Proc. 17th Int. Conf. Pattern Recogn. 4 (2004) 507510.
12. F. H. C. Tivive and A. Bouzerdoum, Ecient training algorithms for a class of shunting inhibitory convolutional neural networks, IEEE Trans. Neural Networks 16(3)
(2005) 541556.
13. L. J. Borg-Graham, C. Monier and Y. Fregnac, Visual input evokes transient and
strong shunting inhibition in visual cortical neurons, Nature 393(6683) (1998) 369
373.
14. J. S. Anderson, M. Carandini and D. Ferster, Orientation tuning of input conductance,
excitation, and inhibition in cat primary visual cortex, J. Neurophysiol. 84 (2000)
909926.
15. Y. Fregnac, C. Monier, F. Chavane, P. Baudot and L. Graham, Shunting inhibition,
a silent step in visual computation, J. Physiol. 97 (2003) 441451.
16. A. Bouzerdoum, A new class of high-order neural networks with nonlinear decision
boundaries, in Proc. Sixth Int. Conf. Neural Inf. Process., Perth 3 (1999) 10041009.
17. A. Bouzerdoum, Classication and function approximation using feed-forward shunting inhibitory articial neural networks, in Proc. IEEE-INNS-ENNS Int. Joint Conf.
Neural Networks (2000) 613618.
18. F. H. C. Tivive and A. Bouzerdoum, A face detection system using shunting inhibitory
convolutional neural networks, in Proc. Int. Joint Conf. Neural Networks 4 (2004)
25712575.
19. B. Fasel, Robust face analysis using convolutional neural networks, in Proc. Sixteenth
Int. Conf. Pattern Recogn., Quebec, Canada 2 (2002) 4043.
20. M. Riedmiller and H. Braun, A direct adaptive method for faster backpropagation
learning: The RPROP algorithm, Proc. IEEE Int. Conf. Neural Networks (1993)
586591.
21. S. Fahlman, An empirical study of learning speed in back-propagation networks,
Carnegie Mellon University, Technical Report CMU-CS 88-162 (1988).
22. T. Tollenaere, SuperSAB: Fast adaptive BP with good scaling properties, Neural
Networks 3 (1990) 561573.
23. M. T. Hagan and M. Menhaj, Training feedforward networks with the marquardt
algorithm, IEEE Trans. Neural Networks 5 (1994) 989993.
24. E. Poisson, C. V. Gaudin and P.-M. Lallican, Multi-modular architecture based on
convolutional neural networks for online handwritten character recognition, Proc. 9th
Int. Conf. Neural Inf. Process. 5 (2002) 24442448.