You are on page 1of 36

Measuring image classification precision of neural

networks with different architectures


Manuel Carbonell - Autonomous University of Barcelona
Master in Modelling for Science and Engineering
Tutor: Ruben Tous
May 31, 2016
www.github.com/manucarbonell/convnet
Abstract
In this article we study artificial neural networks models applied to
computer vision, and also how their architecture modifications affect to
the performance of their training and precision of the predictions.

Contents
1 Introduction and previous concepts
1.1 Motivation and objectives . . . . . . . . . . .
1.2 Artificial Intelligence . . . . . . . . . . . . . .
1.2.1 History . . . . . . . . . . . . . . . . .
1.3 Feedforward neural networks . . . . . . . . .
1.3.1 Perceptrons . . . . . . . . . . . . . . .
1.3.2 Perceptron output with step function
1.3.3 An example . . . . . . . . . . . . . . .
1.3.4 Layers . . . . . . . . . . . . . . . . . .
1.3.5 Network training and sigmoid neurons
1.4 Learning with gradient descent . . . . . . . .
1.4.1 Cost function . . . . . . . . . . . . . .
1.4.2 Backpropagation algorithm equations
1.4.3 Backpropagation algorithm steps . . .
1.5 Types of layers . . . . . . . . . . . . . . . . .
1.5.1 Convolutional layers . . . . . . . . . .
1.5.2 An example . . . . . . . . . . . . . . .
1.5.3 Pooling Layers . . . . . . . . . . . . .
1.5.4 Rectifier linear units layers . . . . . .
1.5.5 Local Response Normalization layers .
1.5.6 Softmax layers . . . . . . . . . . . . .
1.6 Why are CNN so effective on image data? . .
1.7 Tensorflow . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2
2
2
3
4
4
5
5
6
6
8
8
9
11
13
13
14
15
16
16
16
17
17

2 Related work
2.1 Universal approximation of functions . .
2.2 Recurrent Neural Networks . . . . . . . .
2.3 Deep Belief Networks (DBNs) . . . . . . .
2.4 Deep Dream . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

17
17
18
18
21

3 Methodology
3.1 The dataset . . . . . . . . . . . . . . . .
3.2 Data augmentation . . . . . . . . . . . .
3.3 Hardware setup . . . . . . . . . . . . . .
3.4 Implemented and customized programs .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

21
22
23
23
23

4 Results
4.1 Single softmax layer network . . . .
4.2 Convolution . . . . . . . . . . . . . .
4.3 Convolution and pooling . . . . . . .
4.4 Two convolutional and pooling layers
4.5 Image size augmentation . . . . . . .
4.6 Overfitting . . . . . . . . . . . . . .
4.7 Data amount augmentation . . . . .
4.8 Batch size . . . . . . . . . . . . . . .
4.9 Extracted features . . . . . . . . . .
4.10 Retrain ImageNet Model . . . . . . .
4.10.1 Bottlenecks . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

25
25
26
26
28
29
30
31
31
32
34
35

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

5 Conclusions and future work

1
1.1

35

Introduction and previous concepts


Motivation and objectives

We first do a general contextualization and introduction to ANNs (artificial


neural networks), and their parts, and then we will describe how to train a
fully connected network with gradient descent method. Then we check how
essential is every part of the network by running a training algorithm with
different setups. On the way we get introduced to the recently launched library
Tensorflow and discuss obtained results.

1.2

Artificial Intelligence

Artificial neural networks is one of the trend research topics in artificial intelligence (AI). Inside the main faced goals in which intelligence simulation is split,
which are deduction, knowledge representation, planning, natural language processing (communication), perception, motion and manipulation and learning,
neural networks belongs to the last one, more specifically called machine learning (ML). In that branch of AI we study algorithms that improve automatically
through experience, by means of approximating functions according to a given
data, usually called training data thats why when running ANNs algorithms
we say that the network is learning or being trained.

Machine learning is split between supervised, unsupervised and reinforcement learning. In unsupervised learning the objective is to infer a function
to describe hidden structure from unlabeled data, f.e. finding similarity between each data point, clustering data or reducing its dimension. Examples of
unsupervised learning are k-means, maximum likelihood estimation, principal
component analysis. Supervised learning consists of using labeled training data
to estimate a map that returns the right label when receiving new data according to the same pattern as in the training set. Convolution neural networks
are an example of supervised learning, where for example we classify images by
their label according to a labeled training data set. In reinforcement learning,
programs are rewarded by taking actions in an environment so as to maximize
some notion of cumulative reward.
Theres disagreement about whether biological foundations are important for
continuing the development of AI as it happens with ANNs, which are inspired
in biological neurons, or it should be a completely independent research field the
same way bird biology has no contribution in most of aeronautical engineering
ideas. Until now AI research has been mostly statistical related. In many AI
specific tasks, g.e. recognizing a song, where a fingerprint of the audio frequencies is generated, a purely statistical method produces results similar that a
human would give. But are we then doing just a classical statistics work? We
could find some differences between classical statistics and AI, which are:
The dimensions of the data. In classical statistics we have low dimension
data sets, e.g. less than 100 dimensions, in AI we can have much more
than that.
In classical statistics we have a lot of noise in the data, which might make
it difficult to find a structure, in AI the noise is not sufficient to hide the
structure of the data when properly processed.
In classical statistics theres not much structure in the data and if there
is, it can be represented with a simple model, with not many parameters,
in AI the structure is too complicated to be represented with a simple
model with few parameters.
Usually the objective when doing classical statistics is to reveal a structure hidden by noise, in AI the objective is to get to present a complicated
structure in a way that can be learned.
1.2.1

History

The first scientific approach to artificial neural networks was done by Warren
McCulloch and Walter Pitts [1] in 1943. Their objective was to mathematically
formalize the behavior that brain neurons have when we perform logical reasoning, or when reading our sensess inputs.
In 1940 neuroscientist Donald Hebb propposed a theory for the adaptation of
the neurons in the brain during the learning process, which was named after
him, hebbian learning. He already thought of the idea of weights in connections between neurons, which appear in present artificial neural network (ANN)
models.The model states that the weight between two neurons increases if the
two neurons activate simultaneously, and reduces if they activate separately.

This is not how convolutional neural networks (CNN) work but it was a start.
In 1948 Alan Turing suggested a model of computation called unorganized
machine, thinking of the human cortex of an infant, which is largely random
initially but can be trained to perform particular tasks. In this model Turing
defined A-type machines, which consisted of randomly connected networks of
NAND logic gates, and B-type machines, which where built taking A-type machines and substituting inter-node connections with structures called connection
modifiers, which where made of A-type nodes. This connection modifiers where
supposed to undergo appropriate interference, mimicking education.
Frank Rosenblatt first created an electronic device called perceptron in
1960, which represented a neuron as a logical gate with weights and bias. The
utility of the model though was not observable due to lack of computing resources, thats why the idea of neural network was not trend till recent times.
It was in 1975 when Paul Werbos [4] thought of applying backpropagation
algorithm to find an optimal solution for the parameters in a neural network,
which greatly improved the problem-solving capability of a neural network, and
is now the state of the art for image recognition. After this advance many
research in this field was done again and the goodness of the predictions gradually improved till present times where ANN are state of the art for image
recognition, achieving human level precision and substituting methods such as
support vector machine or random forest, as it happen in ImageNet 2012 contest
where winner team of image classification used a convolutional neural network
[2] standing out of all other methods. Next we introduce the components of a
convolutional neural network in present time.

1.3
1.3.1

Feedforward neural networks


Perceptrons

An artificial neural network is a graph, where nodes are a special kind of logical gates called perceptrons or artificial neurons, which have some parameters that allow their behavior change so that patterns can be recognized in data
sets.
They are called neural networks because
the original model was supposed to be inspired in animal neurons, which have the
property to gradually chemically change
when doing synapses, and getting to
extract an abstract concept when connected, which also happens with artificial
neural networks model, where values of
parameters are changed instead of physiFigure 1: Animal neurons gradually
cal and chemical properties. Perceptrons
change with synapses
receive many inputs and compute an output by means of weights and biases. We
can imagine perceptrons as decision taking units considering different sources
of information and giving different importance to these sources.

xi Inputs
HH wi W eights
H
HH
'$
j
a Output
-



&%
*



Figure 2: Perceptron with 3 inputs and 1 output

1.3.2

Perceptron output with step function

We first represent the output a of perceptron with inputs {xi }m


i=1 weighted by
{wi }m
and
bias
b
with
the
step
function:
i=1
(
Pm
0 if i=1 xi wi < b
a(x1 , ..., xm , w1 , ..., wm , b) =
(1)
Pm
1 if i=1 xi wi b
1.3.3

An example

A real-life example of perceptron could be the decision of enrolling or not in a


given master program. The input variables could be:
The subjects are or not of your interest.
The references you have of job possibilities after the master are good or bad.
The university is close or not to your place.
You would take a positive decision if the sum of the weighted input variables is
greater than a given bias. Lets say you give a weight of 5 to the interest of the
subjects, a 4 to the goodness of job opportunities and 3 to the closeness of the
uni. Then if the bias is for example 5 it would be enough to have an interesting
content in the subjects to enroll in the master, but if the bias is for example
seven then two conditions should be positive to enroll in the master.
From now on we will use the following notation for inputs weights and biases:
l
wl := (wjk
), is the weights matrix for layer l where each element of the matrix
l
wjk denotes the weight of the connection from the k-th neuron in the (l 1)
-th layer to the j-th neuron
the l-th layer. x := (x1 , ..., xm )denotes the
P l inl1
input vector, and zjl = k wjk
ak + blj is the weighted input to the activation
function for neuron j in layer l .

Figure 3: Step function.

1.3.4

Layers

Perceptrons are organized in different layers which represent levels of abstraction


of the decision taking, from lower to higher. That means first layers recognize
concrete non-general patterns in the data, and last layers give an abstract classification of the data, such as is the picture a 0 or a 1, or is there an eye on this
picture? Many layers together form a neural network.
Input
layer

Hidden
layers

Output
layer

Figure 4: Neural network with 4 layers.

The first layer is called input layer, which receives the information that has
to be processed by the network (in our case image pixel intensities). Coming
up next are hidden layers. In this example picture we only show two hidden
layers but we can find networks with 12 hidden layers, for example. Hidden
layers process the input layer outputs to give the output layer a final result.
1.3.5

Network training and sigmoid neurons

Our purpose is to get the network to give the output result we want when
we give a determined input. For this we will proceed with a method called
6

network training or learning. The idea of training the network consists of


giving a big amount of input data together with the expected output results,
and adapt the network parameters to fit as good as possible these expected
results. Lets say we want the network to recognize if faces are appearing or
not in a picture, then we will give as input many pictures that are containing
a face and many pictures that dont, and the labels of the images telling if
a face actually appears. For every picture we gradually modify weights and
biases of our network so that the output is every time more frequently the same
as the label. To do that we use gradient descent method which we explain
below. After the training, the network should be able to recognize if theres a
face or not in a new input picture with a high accuracy. The above described
perceptrons are an intuitive approximate simple idea of neuron model that was
developed into sigmoid neurons, which we define below. The output function of
a sigmoid neuron is not as the one described in (1), since with a step function, the
behavior of the network would be chaotic when modifying weights and biases.
We introduce then sigmoid neurons, which compute their output with sigmoid
function. The output a of a sigmoid neuron is defined with sigmoid function
which is a smooth-shaped version of the step function.
1
(2)
(z) = tanh(z) =
1 + ez

Figure 5: Sigmoid function.

The reason why such a function is chosen for the perceptron output is its
smooth shape and the property that makes it similar to step function: for input
values multiplied by weights that are much greater than bias (i.e. z ) the
output is close to 1, and equivalently for (z ) the output is close to 0.
The important difference with step function is that this time when we slightly
change the weights and biases of a perceptron the output is going to slightly
change too, due to sigmoid function continuity and smoothness. This is going
to allow us to search for the optimal weights and biases of each perceptron
s.t. we get the targeted output when giving an input by using gradient descent
method. Putting together the definitions of neural network and sigmoid neuron,
the activation of the jth neuron in layer l is going to be:
!
X
l
alj =
wjk
akl1 + blj = (zjl )
(3)
k
l
where wjk
is the weight of the kth neuron of l 1th layer activation into jth
neuron of lth layer and blj is the bias of the jth neuron in the lth layer.

1.4
1.4.1

Learning with gradient descent


Cost function

We denote the target output or desired output of the network when x is the
input by y(x), and the neuron output by a(x, w, b)=a(z) (the desired output
doesnt depend on the weights and biases of neurons but the neuron output
does). We want the training algorithm to determine which weights and biases
approximate best the outputs a(x, w, b) to y(x) for all inputs x.
We define a cost function (also called loss function) as a measurement of the
goodness of the fit of a neural network with weights and biases w and b to a
target y(x). First we have the quadratic cost function:
Cx (w, b) =

X 1
(a(x, w, b) y(x))2
2n
x

(4)

Where n is the number of training inputs and the sum is over each input x. In
the quadratic cost function we can easily observe the main properties that any
cost function should have:
It is positive in all the function domain.
The more outputs of the network are different from the label, the higher
is the value taken by the function.
P
The cost function can be written as an average C = n1 x Cx over cost
functions Cx for individual training examples, x.
It can be written as a function of the outputs from the neural network.
We will write from now on a instead of a(z) and y instead of y(x) to ease
the reading.
Cost functions are defined with the objective to find some weight values w
and biases b such that the output a is as frequent as possible the same as the
target y, equivalently, to find a minimum of the function C by varying w and
b. We could use an analytic method by solving the equation matching gradient
to zero to find local minimums and check which one is the lowest, but since
the number of variables is going to be very large and the shape of the function
tends to be pretty complicated that method would be too costly and we would
probably not get close to the real minimum, thats why we are going to use
gradient descent method instead.
Gradient descent method consists of gradually get closer to a local or absolute
minimum (w0 , b0 ) of the function by means of subtracting the gradient scaled by
a small value called learning rate, based on the fact that in a multidimensional
scalar function the gradient vector indicates the direction of maximum growth,
so the opposite vector indicates the maximum descent. So in each step of the
gradient descent method the weights and biases would change following:
(w, b)n+1 = (w, b)n Cx (w, b)

(5)

Where w is the weights vector, b the biases, the learning rate, x is a fixed
input and C(x, w, b) the cost function.

1.4.2

Backpropagation algorithm equations

Now the question is, how do we calculate the gradient of this cost function,
of which we dont know even the concrete expression? The answer is to use
backpropagation algorithm. Before describing backpropagation algorithm we
need to define a couple of equations.
We define the error of a neuron j in layer l as the variation of the cost function
with respect to the weighted inputs plus bias in that neuron:
jl :=

C
zjl

(6)

For two matrices, A, B, of the same dimensions, mn the Hadamard product,


or element-wise matrix product, A B, is a matrix, of the same dimension as
the operands, with elements given by
(A B)i,j = (A)i,j (B)i,j

(7)

Claim 1. Let L be a the number of layers of a neural network and j one of its
neurons, then we have the following equality for the neuron error:
jL =

C 0 L
(zj )
aL
j

(8)

in matrix form
L = a C 0 (z L )
Proof. Lets first apply the definition of error for a neuron in layer L.
jL =

C
zjL

(9)

We have to develop the right term of the equality until we get to the right term
in equation (8). If we derive the cost function applying the chain rule and taking
L
in account that the activations aL
k of neurons in layer L depend of zj we get
the intermediate step:
X C aL
k
jL =
(10)
L z L
a
j
k
k
where the sum is over all the neurons in the output layer. The activation of
a neuron in a given layer only depends of the input that receives that same
aL
neuron, not the other neurons of layer so the term zLk is equal to zero when
j

k 6= j. In consequence we have
jL =
but from (1.4.3) we know that

aL
j
zjL

C aL
j
L
aL
z
j
j

(11)

= 0 (zj ), which finishes the proof.

Claim 2. The errors of two consecutive neural network layers are related by
the following equality:
l = ((wl+1 )T l+1 ) 0 (z l )
Where (wl+1 )T is the transpose of the weight matrix for layer l + 1.
9

(12)

Proof. Taking in account the relation of the input of a neuron in a layer with
the inputs of previous layer we can write
jl

C
zjl
X C z l+1
k
l+1 z l
z
j
k
k

=
=

X z l+1
k

zjl

kl+1

and after the definitions of z and its activations


X
X
l+1 l
l+1
zkl+1 =
wkj
aj + bl+1
=
wkj
(zjl ) + bl+1
k
k
j

(13)
(14)
(15)

(16)

If we differentiate we get
zkl+1
l+1 0 l
= wkj
(zj )
zjl
substituting back in previous expression we get
X
l+1 l+1 0 l
jl =
wkj
k (zj )

(17)

(18)

and in matrix form


l = ((wl+1 )T l+1 ) 0 (z l )

(19)

Claim 3. We have the following equalities for the gradient components


C
= jl ;
blj
C
l
= al1
k j
l
wjk
Proof. For the first equality we differentiate the expression of zjl = wjl xlj T +
P l l1
blj (equivalently k wjk
ak + blj ) with respect to blj
C
blj

C zjl
zjl blj

(20)

C
1 = jl
zjl

(21)

10

and for the second


C
l
wjk

C zjl
l
zjl wjk

= jl
=

zjl

l
wjk
P l l1
k wjk ak + blj
l
wjk

l
= al1
k j

1.4.3

(22)
(23)
(24)
(25)

Backpropagation algorithm steps

Now that we have shown all the necessary equations, we can list the steps of
backpropagation algorithm to calculate gradient of the cost function. Denoting
ax,l = (al1 , ..., aln )as the vector of neurons activations in layer l when x is the
input, where alj is defined in .
Input: For all neurons in input layer, set neuron values a1 to the corresponding values of pixel intensities in the example image.
Feedforward:
For each layer l {2, ..., L 1} do:
z l = wl al1 + bl and alj = (z l )
Output error L : Calculate the output error
L = a C 0 (z L )
Backpropagation: After calculating the last layer error we backpropagate it until first layers:
l = ((wl+1 )T l+1 ) 0 (z l ) for l in {L 1, ..., 2}
Output gradient components: We compute the components of the
cost function gradient as given in claim 3.
C
= jl ;
blj
C
l
= al1
k j
l
wjk
This process is done for all examples x in a given subset of the training set
usually called batch, and then the weights are updated (gradient descent
step)
X x,l x,l1 T
wl wl
(a
)
(26)
m x

11

bl bl

X x,l

m x

(27)

Notice that the output error is very simple to calculate in case its a quadratic
cost function:
L = a C 0 (z L ) = (aL y)(z L )(1 (z L ))

(28)

since
0 (z) =

1
1 + ez 1
ez
=
= (z)(1 (z))
z
2
z
(1 + e )
(1 + e ) (1 + ez )

(29)

The backpropagation algorithm description, concretely equation (8) tells us that


the variation of weights and biases in every step depends on the derivative of
the output function, which will be sigmoid in general. When looking at sigmoid
function limits we see that limz 0 (z) = limz 0 (z) = 0 so for very large
or very small values of z the variation of the cost will be very small. It is the case
that sometimes that happens and the learning process gets increasingly slower.
When that happens we say that the neuron is saturated. To avoid learning
slowdown by neuron saturation, we use the cross-entropy cost function
C=


1 XX
L
yj ln aL
j + (1 yj ) ln(1 aj )
n x j

(30)

The motivation to use such a function is that, in addition to fulfill the desired
properties of a cost function mentioned before, when we calculate the partial
derivatives of the cross entropy cost function C , they dont depend on the
derivative of the activation function 0 (z) which as we said causes the saturation.
Indeed if we calculate the derivative of the cross entropy cost function with
respect to the weight:


C
y
(1 y)
1X

=
(31)
wj
n x
(z) 1 (z) wj


(1 y)
1X
y

=
0 (z)xj
(32)
n x
(z) 1 (z)
1X
0 (z)xj
=
((z) y)
(33)
n x (z)(1 (z))
1X
=
xj ((z) y)
(34)
n x
and with respect to the bias:
C
1X
=
((z) y)
b
n x

(35)

Anyway when we use linear neurons, that is neurons with a linear activation
function (not constant), the neuron saturation doesnt happen, because their
derivative is not zero or asymptotically close to it, so in that case we could use
quadratic cost function.
12

1.5

Types of layers

We have now a general idea of how a feedforward neural network works. We


focus now on some types of layers contained in feedforward neural networks.
We studied some of them and they are all implemented in the library that we
used, but there was not enough time to test all of them.
1.5.1

Convolutional layers

In the previous description we supposed that each perceptron of a layer was


connected to all perceptrons of next layer, but thats not how convolution networks are generally built. For image recognition, the training performance and
accuracy of the network prediction improves if we use the so called local receptive fields. It is clear that its not as important the relation between two pixels
that are next to each other as between two pixels that are in opposite sides
of the image. When we recognize a pattern in an image we scan the image
looking for concrete shapes in parts of the image. This is what characterizes
convolutional layers, where a reduced size region of the input image layer is
connected with a single neuron of the hidden layer next to it.

Sliding the local receptive field, also called kernel to the right by one neuron, or
any number of neurons defined as stride, we connect the obtained new region
with the next hidden neuron, by saving its activation, and do so for the whole
layer.

So if for example we choose a region of 5x5 neurons, the activation described


in equation (3) becomes
!
4 X
4
X
l+1
l
ajk = b +
wm,n aj+m,k+n .
(36)
m=0 n=0

13

The previous operation is called convolution, which gives the name to the network model. We notice that the weights and biases are shared for all local
receptive fields, so with this process we are checking in which degree a feature is
present all across the image, and slightly modify it during the learning process.
Also this way we greatly reduce the number of parameters compared with the
fully connected network, and get a more meaningful information for each neuron
in the hidden layer.
We call the map from local receptive fields to a hidden layer a feature map.
For a convolutional layer we can have many feature maps, this way we can recognize different shapes on the images. So to say each feature map tells us if in
a region a given pattern is present or not, with a real value between 0 and 1.
Having the kernel defined as above, the output of a convolutional layer would
have smaller dimensions than the input, but theres cases when this is not the
case. In some cases we use an enhancement of the layer on the edges with mean
values called padding to filter the layer and get an output of the same size and
shape as the layer.
1.5.2

An example

Lets say for example we want a network to recognize if an eye is present on a


picture. A possible feature map with a 4 neuron local receptive field telling if
the following features are present in the right position would tell a case in which
an eye is present:

The previous layer would have another a feature map to recognize if the
above features are present or not.

14

and so on for every lower level of abstraction until we get to the input layer
with the image data.
1.5.3

Pooling Layers

After a convolution layer we usually have pooling layers, which simplify the
information of the previous layer. A commonly used one is max-pooling layer,
which takes the maximum value of the activation in a given region, say 2 2
neurons of previous layer.
al+1
jk = max{a2j+m,2k+n }m,n(0,1)

(37)

The idea of max-pooling is to summarize in a layer the most relevant infor-

mation of the feature maps, if they appear or not in a approximate part of the
image, since we dont care about the exact position of a feature when we are
looking for patterns.

15

1.5.4

Rectifier linear units layers

As we explained before, when using sigmoid output for all the neurons in a layer
it can happen that the state of many neurons becomes saturated, due to the
shape of this output function. The rectifier layers are characterized by having
their neuronss activation function defined as
f (x) = max(0, x)

(38)

Using this kind of output we will avoid saturation, so this kind of layers are
usually combined with convolutional layers with sigmoid function. A smooth
approximation to the rectifier is the analytic function
f (x) = ln(1 + ex )

(39)

which is called the softplus function. This layers have the property of accelerating the learning process, that is achieving a lower cost value in less steps.
1.5.5

Local Response Normalization layers

The local response normalization layers perform a kind of lateral inhibition


by normalizing over local input regions. After Krizhevskys work we know that
they should improve the classification goodness. Their activation function is
given by

min(N 1,i+n/2)
X
i
j
aix,y = zx,y
/ k +
(zx,y
)2
(40)
j=max(0,in/2)
i
where zx,y
is the activity of a neuron computed by applying kernel i at position
(x, y) and then applying ReLU nonlinearity aix,y is the response normalized
i
. The sum runs over n adjacent kernel maps at the same spatial
activity of zx,y
position and N is the total number of kernels in the layer. Details about other
parameters can be found in [3].

1.5.6

Softmax layers

Softmax layers transform the activations from previous layer into a probability
distribution, keeping the same information of the activations. Each neuron of
the softmax layer has the following activation function:
L

ezj
aL
=
P
L
j
zk
ke

(41)

The activations of the previous layer zjL are not necessarily between 0 and 1
and summing 1 for the whole layer, so with the softmax layer we get sure that
we have a better representation of the probability that the image belongs to a
particular class. We will actually not count softmax as a layer helping to train
the network, but a layer that helps to make the classification results human
readable.

16

1.6

Why are CNN so effective on image data?

We studied the importance of convolutional networks because they have been


major advance for working on images, sounds and video data. But why are
they mostly only used in this kind of data? The most plausible reason is that
it has a common fundamental property: the local stationarity and multi-scale
compositional structure, that allows expressing long range interactions in terms
of shorter, localized interactions. That is, video and image data have always
a particular property that if we look the data close enough we see that theres
a usually a smoothness of values, for example in any image part, if we zoom
close enough we see that the color changes gradually from one to another. This
smoothness of values might be an advantage to make gradient descent training
work fine. By multi-scale compositional structure we mean that for different
scales of observation of the data there are always given patterns that let us
identify concepts, when looking close its types of edges and points, when looking
far its different complex geometrical shapes.

1.7

Tensorflow

To run the needed operations for training a neural network we used Googles
recently launched open source deep learning library Tensorflow. TensorFlow
is an open source software library for numerical computation using data flow
graphs. It substitutes previous libraries with similar purposes such as Theano
or SciKit-Learn. In each graph nodes represent mathematical operations, from
simple ones like matrix multiplication and addition to more complex like convolution or softmax. Graph edges represent the multidimensional data arrays
(tensors) communicated between them. The flexible architecture allowed to
deploy computation to one or more CPUs or GPUs. Tensorflow has already
some implemented convolutional networks to classify images and other prediction tasks.

2
2.1

Related work
Universal approximation of functions

The way artificial neural networks have evolved until today, where they are
useful to solve many classification problems with good results was heuristic, it
was not analytically and with deductive steps determined that neural networks
could properly model certain types of data such as audio and video, but it can be
analytically proven that linear combinations of sigmoid functions can uniformly
approximate any continuous function, which tells that we could approximate
any data set with neural networks. Details can be found in [6]. In spite of this
formalization of the function approximation capability of ANN it is accepted
that they have a black box nature in terms of the feature extraction. It is not
exactly known the interpretation of the weights and biases learned, although we
could observe that basic shapes that might be present in images are identified
as features.

17

2.2

Recurrent Neural Networks

In our work we used all the time feedforward neural networks, which propagate
the activations in one direction, but it is also important to remark that theres
other types of commonly used ANNs as recurrent neural networks, in which
connection form a directed cyclic graph.

2.3

Deep Belief Networks (DBNs)

The main condition to use CNNs is to have labeled data, which in most of
cases in life doesnt happen. Sometimes we have similar kind of problems, but
need to be solved in an unsupervised way, and to do this we can use deep
belief networks, which are the unsupervised learning version of artificial neural
networks. Another inconvenient with CNNs and backpropagation is that weights
and biases can get stuck in a poor local optima, making the model stay far from
good prediction results. So to overcome this limitations Smolensky [7] thought
of a network that learn hidden patterns on the data. So the idea is to have
only one visible layer, many hidden and infer states of hidden variables for
some visible variables states, being later able to generate new visible variables
samples. In case of images, we would learn the probability of some features
appearing in a given image, without it being labeled.
DBNs are composed by Restricted Boltzmann Machines (RBMs). RBMs
are simpler than CNNs version of ANNs that learn a probability distribution
over a set of inputs. In case of image sets, the network learns a set of features
given an input image dataset. This can be used to initialize deep neural networks features values. RBMs only have 2 layers, a visible one with m neurons,
(in this method also called units) and a hidden one with n units, with binary
boolean values. The same way as it happens in CNNs, theres a weight matrix
W = (wi,j ) of size m n ,where wi,j determines the weight of connection between visible unit vi and hidden unit hj and also biases, ai for visible units and
bi for hidden units. In RBMs we have a function that associates a scalar value
called energy to each configuration of the variables:
E(v, h) =

m
X
i=1

ai vi

n
X
j=1

bj hj

m X
m
X

vi wi,j hj

(42)

i=1 j=1

in matrix notation,
E(v, h) = aT v bT h v T W h

(43)

Learning corresponds to modifying that energy function so that its shape has
desirable properties. We would like plausible or desirable configurations to have
low energy. We also have a probability distribution for each configuration
of the network, which depends of the energy function:
P (v, h) =

1 E(v,h)
e
Z

(44)

P
E(v,h)
being Z =
a normalizing constant to ensure the probability
(v,h) e
distribution sums 1. The sum is over all possible configurations of visible and
hidden units. Plausible configurations should have a higher probability value,
that is a energy function value close as possible to 0. In a similar way we have

18

the probability of a given visible units vector is the normalized sum of energy
functions exponential over all possible hidden units configurations.
P (v) =

1 X E(v,h)
e
Z

(45)

Visible as well as hidden units activations are intralayer independent, thats


why theyre called restricted. For this reason we have that the conditional
probability of a configuration of the visible units v, given a configuration of the
hidden units h, is
m
Y
P (v|h) =
P (vi |h)
(46)
i=1

and the other way arround, the conditional probability of h given v is


P (h|v) =

n
Y

P (hj |v)

(47)

j=1

The individual activation probabilities are given by


P (hj = 1|v) = bj +

m
X

!
wi,j vi

(48)

i=1

and

P (vi = 1|h) = ai +

n
X

wi,j hj

(49)

j=1

where is sigmoid function described in the introduction. Given a training


set V the idea is to maximize the product of configuration probabilities P (v)
varying the weights, that is to find
Y
arg max
P (v)
(50)
W

vV

equivalently, to maximize the expected log probability of V:


"
#
X
arg max E
log P (v)
W

(51)

vV

To do that it is commonly used the Stochastic Maximum Likelihood (SML)


or Persistent Contrastive Divergence (PCD) algorithm. This would be
the equivalent to backpropagation algorithm which is also performed inside gradient descent, to find the optimum weights. The algorithm computes negative
and positive gradient to calculate the gradient descent step. The procedure for
a sample of values for the visible layer is as follows:
Take a training sample v, compute the probabilities of the hidden units
and sample a hidden activation vector h from this probability distribution.
Compute the the positive gradient, which is the outer product between
v and h.
19

From h, sample a reconstruction v 0 of the visible units, then resample the


hidden activations h0 from this. (Gibbs sampling step)
Compute the negative gradient v 0 h0T .
Let the weight update to wi,j be the positive gradient minus the negative
gradient, times some learning rate: wi,j = (vhT v 0 h0 T ).
In a similar way we update biases a and b.
This algorithm is implemented in the SciKit Learn library BernoulliRMB and
as an example we can see the extracted features W when running it with the
MNIST data set as an input.

Composing together many RBMs, making each hidden layer, the visible layer of
another RMB we form a deep belief network, which is able to extract features
in data of different levels of abstraction. To train a Deep Belief Network we
would proceed as follows:
Given a input data sample X we would train a restricted Boltzmann machine on X to obtain its weight matrix, W . Then we would use it as the
weight matrix between the lower two layers of the network.
Then we would transform X by the RBM to produce new data sample
X 0 , either by sampling or by computing the mean activation of the hidden
units.
Next we repeat the procedure with X X 0 for the next pair of layers,
until the top two layers of the network are reached.
At last we would fine-tune all the parameters of this deep architecture
with respect to a proxy for the DBN log-likelihood, or with respect to
a supervised training criterion (after adding extra learning machinery to
convert the learned representation into supervised predictions, e.g. a linear
classifier).

20

2.4

Deep Dream

Have you ever thought so long in something or someone that have the feeling for
a second you see it even when its not there? This is another curious application
of Deep Neural Networks, the generation of images reminding to hallucinations.
The idea in Deep Dream is to maximize the activations of certain layers features
in a network that is already trained, and mix the detected features with an input
image. To do this it is used gradient ascent, which is the opposite idea of gradient
descent, instead of subtracting gradient to weights and biases in each step we
add it, to get a higher activation value. The result of doing this on an image of
Barcelonas skyline from Parc G
uell with an Inception NeuralNets layer that
detects the presence of canines and other animals is this:

Methodology

Our goal was to build our own CNN, being inspired in examples that already
perform prediction effectively and understand their architecture. To do that we
21

use Tensorflow library and get ideas from its avalaible examples.

3.1

The dataset

The first proposed objective was to train a convolutional network to classify a


dataset of food pictures extracted from Instagram and Google Images in the
following 10 classes:
Beer (0)
Burger (1)
Coffee (2)
Croissant (3)
Fried Eggs (4)
Other (5)
Paella (6)
Pizza (7)
Sushi (8)
Wine (9)

Figure 8: A sample of our initial data set

The aim of the class other is to make the model able to tell if the picture
doesnt belong to any of the food classes. The dataset is composed both from
Instagram photos and web images. Instagram photos have been obtained from
the Instagram API, filtered with user defined tags and manually purged. As
user defined tags are very noisy this method proved to be inefficient and very
time-consuming. In order to facilitate the generation of more ground truth
annotations and a larger training dataset we also obtained images from Google
Images through the Custom Google Search API. This method, which allowed to
automatically annotate a bigger set of images, turned out to be very useful as
almost all the retrieved images showed the desired food category and minimum
manual purge was required. The first model that we are going to build is going
22

to be a single layer neural network. The images of the dataset have no specific
size or format, we store them in a .bin file with records with information of
32x32 pixels with 3 channels RGB. Then to feed the network we randombly
crop them into 24x24 pixel images, to expand the data set size.

3.2

Data augmentation

3.3

Hardware setup

The models were trained over a high-end server with a quadcore Intel i7-3820 at
3.6 GHz with 64 GB of DDR3 RAM memory, and 4 NVIDIA Tesla K40 GPU
cards with 12 GB of GDDR5 each, connected through a PCIe 3.0 in x16 mode
(containing two PCIe switches). The machine runs a GNU/Linux system, with
Linux kernel 3.12 and NVIDIA driver 340.24. We performed experiments with
different configurations (downscaling sizes, data augmentation, different number and composition of layers, different layers geometry, etc.). We also tested
aspects with no impact in the classification accuracy but with practical implications such as different input formats (TFRecords, compressed numpy arrays,
etc.) or different hardware configurations (one or more CPUs and GPUs, etc).
Our runs include an extensive set of conffigurations; for brevity, when those parameters were shown to be either irrelevant or to have negligible effect, we use
default values. Each experimental conffiguration was repeated at least 5 times.
Unless otherwise stated, we report median values in seconds.

3.4

Implemented and customized programs

Our program contains the following parts:


Data format transformer - folder2bin.py:
This script reads images located in a folder containing as many subfolders
as classes of pictures (10) and returns a .bin file with all the image data
stored in records of length n = picture width picture height n color
channels + 1 bytes where the first byte is the class label of the picture
and the rest of bytes are the pixel intensities . We also implemented the
opposite step in bin2folder.py.
Input reader - ManuNet input.py: This script contains functions to read
.bin data files using a queue of image examples and returning tensors of
a given batch size, containing image arrays and separately labels. Also
contains a function to distort images randomly cropp, flip and whiten
them to enlarge the data set.
Model - ManuNet.py:
The network implementation of ManuNet, a customized verion of CIFAR
10 network [2]. This programs allows us to extract our data set from our
github repository (www.github.com/manucarbonell/datasets) and then build
a convolutional network with different architectures to perform experiments. For each value of mode we build a network with:
1 fully connected sigmoid layer

23

1 convolutional layer followed by pooling


2 convolutionals with pooling and normalization
2 convolutionals with pooling and normalization followed by 2 fully
connected layers
Contains function to save summaries of the cost (value of the cost function) evolution during the training process that can be observed later with
TensorBoard, a platform to visualize the learning in an interactive way.
In the end we didnt use TensorBoard since we preferred to generate our
custom graphs and training and evaluation outputs. Also contains the
function called during the training process that builds the model, and
performs the backpropagation algorithm with learning rate exponential
decay.
As we said before, in each step of the backpropagation algorithm we update
weights and biases in the direction of the cost function gradient multiplied
by a scalar , the learning rate. If we used a constant value for a learning
rate, we would stop getting close to the cost function minimum very fast
as we would pass through it, the same way it would be difficult to get the
ball in the whole only using the drive stick when playing golf. So after
a given number of epochs NUM EPOCHS PER DECAY we decay the learning
rate exponentially using the factor LEARNING RATE DECAY FACTOR. In the
function train() the training step is defined calculating loss results and
applying calculated gradient. The different kind of layer ops used are described in Tensorflow library https://www.tensorflow.org/versions/
r0.8/api_docs/python/nn.html.
Network training - ManuNet train.py:
This script calls the input reader, builds the network graph and calls the
training step from the model program and iterates the process saving the
results in a file. The graph is saved in a file .ckpt so it can be read when
evaluating the model. During the training, loss value, steps and execution
time are saved.
Train network and track precision - ManuNet train eval.py:
A modified version of previous program that saves the prediction precision
using both train and test data separately. Model inferences are grouped
in a scope to allow reusing variables.
Evaluate model - ManuNet eval.py:
Returns the precision of our models inference over test data.
Evaluation sample - ManuNet eval sample.py:
Performs model inference over the desired number of batches and saves
the images with their predicted and correct label.
Generate confusion matrix - ManuNet eval byclasses.py:
This program performs inference of imagess labels over the desired number of batches and returns a matrix where each entry aij is the portion of
examples that where labeled as class i and predicted as class j, this way
values on the diagonal aii give the precision for class i.

24

Extract features - ManuNet get features.py:


Performs inference and saves extracted features in convolutional layer kernel as images.

4
4.1

Results
Single softmax layer network

The first and simplest possible approach that weve taken was to train the network with a single softmax layer, with a single matrix product of the image data
with the weights and addition of biases.
Since our dataset is quite noisy and not very
large, without extracting features of different levels of abstraction, the first result is
not going to be very good, if it does learn
something it could be considered already an
achievement.
After running the experiment we can see
the results of running stochastic gradient descent with a single layer network in figure.
Figure 9: 1 softmax layer ANN
Clearly we are not getting close to a minimum of the cost function, since after some
steps the cost is barely decreasing.
When checking the precision of the predictions, feeding the network with
600 test images and dividing the correct classifications between the total classifications, we get a disaster score of 44% (44 out of 100 images are classified
correctly).

Figure 10: Training loss in each step for a network with no hidden layers.

So only with a softmax layer receiving the weight product with the input layer
plus biases, the model does learn something, since a random classification would

25

get around 10% precision, but it doesnt get much more better than that. Taking
a look at the loss value evolution we leave this helloworld experiment and
dont spend time in doing experiments with more steps and go on with the
layers that give the name to the networks we are studying. So lets see how
does the classification improve with a convolution and later pooling layer.

4.2

Convolution

After the simplest approach of having a network with no hidden layers we see
how does the model do with only one hidden convolutional layer.
The layer is going to extract 64 5x5
neurons kernels, with a stride of a
single neuron on each direction and
a padding which will make the output of the convolution layer have the
same size of the input layer.
We
think convolution layer as a prism since
it extracts 64 features presence infor- Figure 11: Convolution network simple
mation for every part of the image, structure
this information we save in a 3d-vector
(24x24x64).
The loss value continues decreasing for a longer continued training (in simplest model it almost stopped decreasing in the first 10.000 steps) and ends up
having an average value of 0.3. Training time was of 2 hours 56 minutes for
completing 50.000 steps, which we choose as a training length sufficient for loss
stabilization. The predictions do as expected a jump in precision to 85.5% of
accuracy in contrast with previous model result.

Figure 12: Loss function values during training, in steps and time

4.3

Convolution and pooling

Now we check the effect of adding a pooling layer after the convolutional layer.
This time cost decreases quite faster and
reaches close to 0.16 value after 50.000
training steps in contrast with the 0.3 of
the previous model. In terms of precision
we get up to 88% goodness score, which
is actually not bad taking in account that

Figure 13: Convolution network with


1 conv. layer and pooling

26

the images are not quite simple as for example handwritten digits, and our network has only 2 layers. If we randomly
choose a sample of the classifications we see that effectively more than 8 out of
10 images are classified in the right group. The real label of the picture is after
L: and the prediction computed by the network is after P:.

L: 0, P: 0

L: 1, P: 1

L: 2, P: 2

L: 3, P: 3

L: 4, P: 8

L: 5, P: 5

L: 6, P: 6

L: 7, P: 6

L: 8, P: 8

L: 9, P: 9

Also we can see the precision results in the confusion matrix, which tells us from
each class, which portion of the predictions where correct and which where in
wrong classes, where rows are real class labels and columns are class predictions.
For example we can observe that 15% of the fried eggs where classified as sushi.
That might be cause of the similarity of colors and shapes (white ovals surrounded by black are present in both classes). So maybe we should rise the
number of layers and features detected in our network, or change other parameters to let the network tell the difference between those classes. A part of this
theres already no notable confusion (greater than 10%) between other classes.
Table 1: Confusion matrix

0
1
2
3
4
5
6
7
8
9

0
0.87
0.00
0.08
0.02
0.02
0.01
0.00
0.00
0.01
0.00

1
0.02
0.85
0.00
0.02
0.00
0.01
0.02
0.02
0.00
0.00

2
0.00
0.02
0.77
0.00
0.02
0.00
0.00
0.00
0.01
0.04

3
0.02
0.04
0.00
0.78
0.05
0.00
0.03
0.02
0.01
0.00

4
0.00
0.00
0.00
0.06
0.72
0.00
0.02
0.02
0.01
0.00

27

5
0.08
0.04
0.11
0.04
0.02
0.94
0.00
0.00
0.06
0.10

6
0.02
0.00
0.02
0.00
0.00
0.01
0.90
0.05
0.01
0.00

7
0.00
0.02
0.00
0.06
0.02
0.01
0.02
0.85
0.04
0.00

8
0.00
0.02
0.03
0.02
0.15
0.01
0.01
0.05
0.83
0.00

9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.86

Figure 15: Training cost depending on time and steps

So taking a look at the classifications, we could say that with 1 convolutional


and pooling layer the network can already classifies images with a clear color
and shape pattern, but has difficulties with classes that contain many colors
and complicated shapes so clearly the idea of convolution together is the main
key of the learning process of the network. In terms of time, it took us 2 hours
and 24 minutes to train the network using the GPU cluster.
It is to remark that in this model we are estimating a function with a total
of 32x32+5x5x64=2624 parameters, so we can imagine the complexity of the
computation, this is what we meant by the differences between classical statistics
and machine learning.

4.4

Two convolutional and pooling layers

So now that we saw that the main improvement comes after having a convolutional layer, lets see what happens if we have another big improvement after
adding a second convolutional layer.

Figure 16: Convolution network with 2 convolution and pooling layers

Again we will recognize 64 features which resulted to be an good amount


in our reference work [3]. Comparing the cost during the training with the
network with a single convolutional layer we can see that this time the learning
curve is not so steep from the beginning, it decreases in a more continuous
rhythm during the whole training, and also with less oscillations. So we could
say adding a convolutional layer gives us a more stable training process. In
terms of precision, in the first 10.000 steps we get almost the same (0.87). If
we do the training process with the GPU version of Tensorflow we can raise
the number of steps to 50.000 to see the difference between 1 and 2 convolution
layers.

28

Figure 17: Training cost depending on time and steps

Here we have the confusion matrix results when training a network with 2
convolutional layers.
Table 2: Confusion matrix for network with 2 convolution and normalization layers

0
1
2
3
4
5
6
7
8
9

0
0.86
0.00
0.10
0.02
0.02
0.01
0.00
0.02
0.00
0.00

1
0.00
0.79
0.02
0.04
0.00
0.01
0.00
0.03
0.00
0.00

2
0.02
0.00
0.70
0.00
0.00
0.00
0.00
0.00
0.00
0.00

3
0.00
0.04
0.03
0.82
0.04
0.00
0.05
0.01
0.02
0.00

4
0.00
0.00
0.03
0.04
0.85
0.00
0.02
0.00
0.00
0.00

5
0.08
0.05
0.07
0.00
0.02
0.93
0.00
0.00
0.02
0.09

6
0.02
0.02
0.00
0.02
0.00
0.02
0.91
0.06
0.00
0.05

7
0.00
0.04
0.00
0.04
0.02
0.00
0.00
0.85
0.01
0.00

8
0.00
0.05
0.00
0.02
0.04
0.01
0.02
0.03
0.94
0.00

9
0.02
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.86

It takes 2h 51 minutes to train the network with 2 convolutional + pooling


layers, but the precision results are the almost the same. Theres some changes
in the confusion, but the general precision score is still 88%, which make us
wonder why in many widely used models theres repetition of convolutional
layers to improve the classifications. It is possible that our dataset is not big
enough to the difference to be noticed when adding repeated layers. Another
parameter we used by reference of previous works is the cropping size, which
was 32x32 for all images, but maybe using such small versions of the pictures
is preventing our network to recognize some features that need more resolution
to be recognized.

4.5

Image size augmentation

So next approach will be to train the network with two convolutional, pooling
and normalization layers with a higher resolution version of the same dataset.
We choose a size that is not going to do the computation too slow but the
difference of resolution is quite noticeable, which is going to be 48x48 after
cropping step. With this size, the precision of the predictions rises to 89.99%,
which is quite a significant improvement. So we can state than higher resolution
with more convolutional layers gives better classification results. Also it takes a
high computation time to proceed with these experiments. With the last one (2
convolutionals, 48x48 pixels) it took 8 hours 21 minutes to complete the 50.000
29

steps of training. The confusion matrix show us some classes predictions now are
really highly accurate, but still some of them lack of a complete understanding
of the patterns by the neural network.

0
1
2
3
4
5
6
7
8
9

0
0.83
0.00
0.02
0.00
0.00
0.01
0.00
0.02
0.00
0.00

1
0.00
0.82
0.02
0.00
0.00
0.00
0.02
0.03
0.00
0.00

2
0.04
0.00
0.88
0.02
0.05
0.00
0.02
0.00
0.01
0.04

3
0.02
0.06
0.00
0.83
0.02
0.00
0.02
0.02
0.01
0.00

4
0.02
0.00
0.00
0.04
0.81
0.00
0.02
0.00
0.00
0.00

5
0.07
0.04
0.08
0.00
0.02
0.96
0.02
0.02
0.00
0.09

6
0.02
0.00
0.00
0.04
0.00
0.01
0.86
0.00
0.01
0.00

7
0.00
0.04
0.00
0.02
0.02
0.00
0.05
0.90
0.00
0.00

8
0.00
0.04
0.00
0.04
0.07
0.01
0.00
0.02
0.96
0.00

9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.87

Table 3: Confusion matrix of 2 convolution layers model trained with 48x48 images

4.6

Overfitting

The plotted loss values are calculated with train data, but it would be good to
see how does the classification precision evolves during the training process, to
know if the network actually learning concepts behind the data, or is it only
memorizing the training data set.

Figure 18: Precision of classifications with training data set in green and test data set
in reed

To do that we take a look the evolution of the precision with training and
test data. In figure 18 we can see that theres a point when the precision of
the classifications stops improving, that means that our network is overfitting,
or some of the neurons are saturated. To avoid this we can try with adding
normalization layer.

30

4.7

Data amount augmentation

In all those training procedures, we used a routine implemented in CIFAR10


model which enlarges the data set by randomly cropping a part of the image to
get a new one. First we cropped 32x32 into 24x24 and then 64x64 into 24x24.
We can see this has a positive effect, because if we run the network with the
64x64 dataset without being cropped we get again down to 87% precision, so
it indeed helped to have more examples distorting the original images.

4.8

Batch size

Initially we set a commonly used batch size, which is 128 examples/batch, but
we wanted to see how this does affect training time and precision of classification.
We compare the training evolution and results for the last network model which
consisted of two convolutional layers each followed by pooling and normalization.
In figure 19 we see that the loss takes slightly less oscillating and lower values
from step 30.000 on. We can see it more clearly in a close up of the last 1.000
steps of training in figure 20.

Figure 19: Training loss with batch size of 128 examples in blue and 64 in red.

31

Figure 20: Training loss on 1000 last steps for batch size of 128 examples in blue and
64 in red.

The big difference comes for training time, half lower for lower batch size,
we can see the comparison in minutes in figure 21.

Figure 21: Training loss depending on time for batch sizes 128 in blue and 64 in red.

The precision of predictions stays at 89% in both cases, so the conclusion


would be that for our data set it is clearly better to set a smaller batch size such
as 64 examples per batch. In future work we could discuss to use other batch
sizes, but this one seems to produce a good enough result.

4.9

Extracted features

Here are some extracted features by our network, in this case the 64 kernels of
the first convolutional layer (5x5 weights). As it happens with CNNs we can not
now why a particular shape and kind of features is learned during the process,
32

but intuitively in some of them it looks like the network learns basic shapes to
recognize the boundary of the objects in the images.

Figure 22: Extracted kernels in first convolutional layer, in a model with 2 convolutional layers.

For each of the previous features, we get a tensor of the shape of the image with
the activations for this tensor, as explained in section 1.5.1, since we are using
a padding to get the output with the same shape of the input.

33

Figure 23: Output of first convolution layer for the model 4.4

For example in figure ?? we can see the output of some images after convolution for the first extracted feature, for each color channel.

4.10

Retrain ImageNet Model

One may ask himself, is it normal if I need to see 1000 images of an elephant
before Im able to recognize it when I see another one again? Maybe in a
very strange case in which you just had your sight given and an elephant is
the first thing you ever saw, otherwise it shouldnt be necessary. Apparently
it happens the same with CNN learning. Once the network has learned many
visual concepts its easier every time to learn new ones. This way after seeing
some results with a self trained CNN we move to this approach, which is going
to be to retrain a large ImageNet model to recognize the pictures of our data
set. This technique is called transfer learning or convolutional network fine
tuning. After knowing about Donaue et al. [5] work, and the option to load
34

Inception Neural Network with Tensorflow to use the learned features in your
own data set we checked if this is a better approach than to train a net only
with our own data.
4.10.1

Bottlenecks

The idea consists of loading the graph of an ImageNet Inception network which
is already trained (concretely for 1000 classes) and using the learned features,
perform a training over the new classes to recognize, avoiding this way a long
training process. To do that we have to adapt the last layer of the trained graph
before softmax to the new added classes. Doing this using Imagenet Inception
v-3 model provided by Tensorflow we achieve a precision of 91.2% in only 17
minutes. So definitely we can state than knowing patterns of many image classes
greatly helps learning faster and better new classes as it happens with biological
neural networks.

Conclusions and future work

After different unexpected accuracy values and training times, we can say that
although some of convolutional neural networks insights are still not clear, they
do work quite good for image classification. For example we can not state that
deeper networks are always going to give better classification results or that
higher image resolution is also going to do, it is going to be like this sometimes
but not always, as we saw on the results. Also after looking for documentation in
this aspect we saw many the parameters such as batch size or learning rate used
in state of the art networks are determined as we proceeded, by trial and error,
so it is still to solve why a specific number of layers and of which deepness works
better. Also we observed that as expected, using a previously trained net that
already classifies 1000 classes gives much better accuracy for our dataset than
the networks trained from the beginning only with our dataset, also consuming
much less time. So now that object classification and recognition in images with
CNN is close to being solved, we would continue by exploring other applications
such as online user behavior prediction, improving speech recognition or look in
which other research areas this machine learning technique can also be useful.

35

References
[1] McCulloch, Warren; Walter Pitts (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics
[2] Krizhevsky, Alex (2009). Learning multiple layers of features from tiny images.
[3] A. Krizhevsky, I. Sutskever, G E. Hinton, Imagenet Classification with Deep
Convolutional Neural Networks
[4] Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.
[5] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, Trevor Darrell (2013). A Deep Convolutional Activation Feature for
Generic Visual Recognition
[6] George Cybenko (1989). Approximation by Superpositions of a Sigmoidal
Function
[7] Smolensky, Paul (1986). Chapter 6: Information Processing in Dynamical
Systems: Foundations of Harmony Theory.

36

You might also like