Professional Documents
Culture Documents
Contents
1 Introduction and previous concepts
1.1 Motivation and objectives . . . . . . . . . . .
1.2 Artificial Intelligence . . . . . . . . . . . . . .
1.2.1 History . . . . . . . . . . . . . . . . .
1.3 Feedforward neural networks . . . . . . . . .
1.3.1 Perceptrons . . . . . . . . . . . . . . .
1.3.2 Perceptron output with step function
1.3.3 An example . . . . . . . . . . . . . . .
1.3.4 Layers . . . . . . . . . . . . . . . . . .
1.3.5 Network training and sigmoid neurons
1.4 Learning with gradient descent . . . . . . . .
1.4.1 Cost function . . . . . . . . . . . . . .
1.4.2 Backpropagation algorithm equations
1.4.3 Backpropagation algorithm steps . . .
1.5 Types of layers . . . . . . . . . . . . . . . . .
1.5.1 Convolutional layers . . . . . . . . . .
1.5.2 An example . . . . . . . . . . . . . . .
1.5.3 Pooling Layers . . . . . . . . . . . . .
1.5.4 Rectifier linear units layers . . . . . .
1.5.5 Local Response Normalization layers .
1.5.6 Softmax layers . . . . . . . . . . . . .
1.6 Why are CNN so effective on image data? . .
1.7 Tensorflow . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
3
4
4
5
5
6
6
8
8
9
11
13
13
14
15
16
16
16
17
17
2 Related work
2.1 Universal approximation of functions . .
2.2 Recurrent Neural Networks . . . . . . . .
2.3 Deep Belief Networks (DBNs) . . . . . . .
2.4 Deep Dream . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
18
21
3 Methodology
3.1 The dataset . . . . . . . . . . . . . . . .
3.2 Data augmentation . . . . . . . . . . . .
3.3 Hardware setup . . . . . . . . . . . . . .
3.4 Implemented and customized programs .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
22
23
23
23
4 Results
4.1 Single softmax layer network . . . .
4.2 Convolution . . . . . . . . . . . . . .
4.3 Convolution and pooling . . . . . . .
4.4 Two convolutional and pooling layers
4.5 Image size augmentation . . . . . . .
4.6 Overfitting . . . . . . . . . . . . . .
4.7 Data amount augmentation . . . . .
4.8 Batch size . . . . . . . . . . . . . . .
4.9 Extracted features . . . . . . . . . .
4.10 Retrain ImageNet Model . . . . . . .
4.10.1 Bottlenecks . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
26
26
28
29
30
31
31
32
34
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1.1
35
1.2
Artificial Intelligence
Artificial neural networks is one of the trend research topics in artificial intelligence (AI). Inside the main faced goals in which intelligence simulation is split,
which are deduction, knowledge representation, planning, natural language processing (communication), perception, motion and manipulation and learning,
neural networks belongs to the last one, more specifically called machine learning (ML). In that branch of AI we study algorithms that improve automatically
through experience, by means of approximating functions according to a given
data, usually called training data thats why when running ANNs algorithms
we say that the network is learning or being trained.
Machine learning is split between supervised, unsupervised and reinforcement learning. In unsupervised learning the objective is to infer a function
to describe hidden structure from unlabeled data, f.e. finding similarity between each data point, clustering data or reducing its dimension. Examples of
unsupervised learning are k-means, maximum likelihood estimation, principal
component analysis. Supervised learning consists of using labeled training data
to estimate a map that returns the right label when receiving new data according to the same pattern as in the training set. Convolution neural networks
are an example of supervised learning, where for example we classify images by
their label according to a labeled training data set. In reinforcement learning,
programs are rewarded by taking actions in an environment so as to maximize
some notion of cumulative reward.
Theres disagreement about whether biological foundations are important for
continuing the development of AI as it happens with ANNs, which are inspired
in biological neurons, or it should be a completely independent research field the
same way bird biology has no contribution in most of aeronautical engineering
ideas. Until now AI research has been mostly statistical related. In many AI
specific tasks, g.e. recognizing a song, where a fingerprint of the audio frequencies is generated, a purely statistical method produces results similar that a
human would give. But are we then doing just a classical statistics work? We
could find some differences between classical statistics and AI, which are:
The dimensions of the data. In classical statistics we have low dimension
data sets, e.g. less than 100 dimensions, in AI we can have much more
than that.
In classical statistics we have a lot of noise in the data, which might make
it difficult to find a structure, in AI the noise is not sufficient to hide the
structure of the data when properly processed.
In classical statistics theres not much structure in the data and if there
is, it can be represented with a simple model, with not many parameters,
in AI the structure is too complicated to be represented with a simple
model with few parameters.
Usually the objective when doing classical statistics is to reveal a structure hidden by noise, in AI the objective is to get to present a complicated
structure in a way that can be learned.
1.2.1
History
The first scientific approach to artificial neural networks was done by Warren
McCulloch and Walter Pitts [1] in 1943. Their objective was to mathematically
formalize the behavior that brain neurons have when we perform logical reasoning, or when reading our sensess inputs.
In 1940 neuroscientist Donald Hebb propposed a theory for the adaptation of
the neurons in the brain during the learning process, which was named after
him, hebbian learning. He already thought of the idea of weights in connections between neurons, which appear in present artificial neural network (ANN)
models.The model states that the weight between two neurons increases if the
two neurons activate simultaneously, and reduces if they activate separately.
This is not how convolutional neural networks (CNN) work but it was a start.
In 1948 Alan Turing suggested a model of computation called unorganized
machine, thinking of the human cortex of an infant, which is largely random
initially but can be trained to perform particular tasks. In this model Turing
defined A-type machines, which consisted of randomly connected networks of
NAND logic gates, and B-type machines, which where built taking A-type machines and substituting inter-node connections with structures called connection
modifiers, which where made of A-type nodes. This connection modifiers where
supposed to undergo appropriate interference, mimicking education.
Frank Rosenblatt first created an electronic device called perceptron in
1960, which represented a neuron as a logical gate with weights and bias. The
utility of the model though was not observable due to lack of computing resources, thats why the idea of neural network was not trend till recent times.
It was in 1975 when Paul Werbos [4] thought of applying backpropagation
algorithm to find an optimal solution for the parameters in a neural network,
which greatly improved the problem-solving capability of a neural network, and
is now the state of the art for image recognition. After this advance many
research in this field was done again and the goodness of the predictions gradually improved till present times where ANN are state of the art for image
recognition, achieving human level precision and substituting methods such as
support vector machine or random forest, as it happen in ImageNet 2012 contest
where winner team of image classification used a convolutional neural network
[2] standing out of all other methods. Next we introduce the components of a
convolutional neural network in present time.
1.3
1.3.1
An artificial neural network is a graph, where nodes are a special kind of logical gates called perceptrons or artificial neurons, which have some parameters that allow their behavior change so that patterns can be recognized in data
sets.
They are called neural networks because
the original model was supposed to be inspired in animal neurons, which have the
property to gradually chemically change
when doing synapses, and getting to
extract an abstract concept when connected, which also happens with artificial
neural networks model, where values of
parameters are changed instead of physiFigure 1: Animal neurons gradually
cal and chemical properties. Perceptrons
change with synapses
receive many inputs and compute an output by means of weights and biases. We
can imagine perceptrons as decision taking units considering different sources
of information and giving different importance to these sources.
xi Inputs
HH wi W eights
H
HH
'$
j
a Output
-
&%
*
1.3.2
An example
1.3.4
Layers
Hidden
layers
Output
layer
The first layer is called input layer, which receives the information that has
to be processed by the network (in our case image pixel intensities). Coming
up next are hidden layers. In this example picture we only show two hidden
layers but we can find networks with 12 hidden layers, for example. Hidden
layers process the input layer outputs to give the output layer a final result.
1.3.5
Our purpose is to get the network to give the output result we want when
we give a determined input. For this we will proceed with a method called
6
The reason why such a function is chosen for the perceptron output is its
smooth shape and the property that makes it similar to step function: for input
values multiplied by weights that are much greater than bias (i.e. z ) the
output is close to 1, and equivalently for (z ) the output is close to 0.
The important difference with step function is that this time when we slightly
change the weights and biases of a perceptron the output is going to slightly
change too, due to sigmoid function continuity and smoothness. This is going
to allow us to search for the optimal weights and biases of each perceptron
s.t. we get the targeted output when giving an input by using gradient descent
method. Putting together the definitions of neural network and sigmoid neuron,
the activation of the jth neuron in layer l is going to be:
!
X
l
alj =
wjk
akl1 + blj = (zjl )
(3)
k
l
where wjk
is the weight of the kth neuron of l 1th layer activation into jth
neuron of lth layer and blj is the bias of the jth neuron in the lth layer.
1.4
1.4.1
We denote the target output or desired output of the network when x is the
input by y(x), and the neuron output by a(x, w, b)=a(z) (the desired output
doesnt depend on the weights and biases of neurons but the neuron output
does). We want the training algorithm to determine which weights and biases
approximate best the outputs a(x, w, b) to y(x) for all inputs x.
We define a cost function (also called loss function) as a measurement of the
goodness of the fit of a neural network with weights and biases w and b to a
target y(x). First we have the quadratic cost function:
Cx (w, b) =
X 1
(a(x, w, b) y(x))2
2n
x
(4)
Where n is the number of training inputs and the sum is over each input x. In
the quadratic cost function we can easily observe the main properties that any
cost function should have:
It is positive in all the function domain.
The more outputs of the network are different from the label, the higher
is the value taken by the function.
P
The cost function can be written as an average C = n1 x Cx over cost
functions Cx for individual training examples, x.
It can be written as a function of the outputs from the neural network.
We will write from now on a instead of a(z) and y instead of y(x) to ease
the reading.
Cost functions are defined with the objective to find some weight values w
and biases b such that the output a is as frequent as possible the same as the
target y, equivalently, to find a minimum of the function C by varying w and
b. We could use an analytic method by solving the equation matching gradient
to zero to find local minimums and check which one is the lowest, but since
the number of variables is going to be very large and the shape of the function
tends to be pretty complicated that method would be too costly and we would
probably not get close to the real minimum, thats why we are going to use
gradient descent method instead.
Gradient descent method consists of gradually get closer to a local or absolute
minimum (w0 , b0 ) of the function by means of subtracting the gradient scaled by
a small value called learning rate, based on the fact that in a multidimensional
scalar function the gradient vector indicates the direction of maximum growth,
so the opposite vector indicates the maximum descent. So in each step of the
gradient descent method the weights and biases would change following:
(w, b)n+1 = (w, b)n Cx (w, b)
(5)
Where w is the weights vector, b the biases, the learning rate, x is a fixed
input and C(x, w, b) the cost function.
1.4.2
Now the question is, how do we calculate the gradient of this cost function,
of which we dont know even the concrete expression? The answer is to use
backpropagation algorithm. Before describing backpropagation algorithm we
need to define a couple of equations.
We define the error of a neuron j in layer l as the variation of the cost function
with respect to the weighted inputs plus bias in that neuron:
jl :=
C
zjl
(6)
(7)
Claim 1. Let L be a the number of layers of a neural network and j one of its
neurons, then we have the following equality for the neuron error:
jL =
C 0 L
(zj )
aL
j
(8)
in matrix form
L = a C 0 (z L )
Proof. Lets first apply the definition of error for a neuron in layer L.
jL =
C
zjL
(9)
We have to develop the right term of the equality until we get to the right term
in equation (8). If we derive the cost function applying the chain rule and taking
L
in account that the activations aL
k of neurons in layer L depend of zj we get
the intermediate step:
X C aL
k
jL =
(10)
L z L
a
j
k
k
where the sum is over all the neurons in the output layer. The activation of
a neuron in a given layer only depends of the input that receives that same
aL
neuron, not the other neurons of layer so the term zLk is equal to zero when
j
k 6= j. In consequence we have
jL =
but from (1.4.3) we know that
aL
j
zjL
C aL
j
L
aL
z
j
j
(11)
Claim 2. The errors of two consecutive neural network layers are related by
the following equality:
l = ((wl+1 )T l+1 ) 0 (z l )
Where (wl+1 )T is the transpose of the weight matrix for layer l + 1.
9
(12)
Proof. Taking in account the relation of the input of a neuron in a layer with
the inputs of previous layer we can write
jl
C
zjl
X C z l+1
k
l+1 z l
z
j
k
k
=
=
X z l+1
k
zjl
kl+1
(13)
(14)
(15)
(16)
If we differentiate we get
zkl+1
l+1 0 l
= wkj
(zj )
zjl
substituting back in previous expression we get
X
l+1 l+1 0 l
jl =
wkj
k (zj )
(17)
(18)
(19)
C zjl
zjl blj
(20)
C
1 = jl
zjl
(21)
10
C zjl
l
zjl wjk
= jl
=
zjl
l
wjk
P l l1
k wjk ak + blj
l
wjk
l
= al1
k j
1.4.3
(22)
(23)
(24)
(25)
Now that we have shown all the necessary equations, we can list the steps of
backpropagation algorithm to calculate gradient of the cost function. Denoting
ax,l = (al1 , ..., aln )as the vector of neurons activations in layer l when x is the
input, where alj is defined in .
Input: For all neurons in input layer, set neuron values a1 to the corresponding values of pixel intensities in the example image.
Feedforward:
For each layer l {2, ..., L 1} do:
z l = wl al1 + bl and alj = (z l )
Output error L : Calculate the output error
L = a C 0 (z L )
Backpropagation: After calculating the last layer error we backpropagate it until first layers:
l = ((wl+1 )T l+1 ) 0 (z l ) for l in {L 1, ..., 2}
Output gradient components: We compute the components of the
cost function gradient as given in claim 3.
C
= jl ;
blj
C
l
= al1
k j
l
wjk
This process is done for all examples x in a given subset of the training set
usually called batch, and then the weights are updated (gradient descent
step)
X x,l x,l1 T
wl wl
(a
)
(26)
m x
11
bl bl
X x,l
m x
(27)
Notice that the output error is very simple to calculate in case its a quadratic
cost function:
L = a C 0 (z L ) = (aL y)(z L )(1 (z L ))
(28)
since
0 (z) =
1
1 + ez 1
ez
=
= (z)(1 (z))
z
2
z
(1 + e )
(1 + e ) (1 + ez )
(29)
1 XX
L
yj ln aL
j + (1 yj ) ln(1 aj )
n x j
(30)
The motivation to use such a function is that, in addition to fulfill the desired
properties of a cost function mentioned before, when we calculate the partial
derivatives of the cross entropy cost function C , they dont depend on the
derivative of the activation function 0 (z) which as we said causes the saturation.
Indeed if we calculate the derivative of the cross entropy cost function with
respect to the weight:
C
y
(1 y)
1X
=
(31)
wj
n x
(z) 1 (z) wj
(1 y)
1X
y
=
0 (z)xj
(32)
n x
(z) 1 (z)
1X
0 (z)xj
=
((z) y)
(33)
n x (z)(1 (z))
1X
=
xj ((z) y)
(34)
n x
and with respect to the bias:
C
1X
=
((z) y)
b
n x
(35)
Anyway when we use linear neurons, that is neurons with a linear activation
function (not constant), the neuron saturation doesnt happen, because their
derivative is not zero or asymptotically close to it, so in that case we could use
quadratic cost function.
12
1.5
Types of layers
Convolutional layers
Sliding the local receptive field, also called kernel to the right by one neuron, or
any number of neurons defined as stride, we connect the obtained new region
with the next hidden neuron, by saving its activation, and do so for the whole
layer.
13
The previous operation is called convolution, which gives the name to the network model. We notice that the weights and biases are shared for all local
receptive fields, so with this process we are checking in which degree a feature is
present all across the image, and slightly modify it during the learning process.
Also this way we greatly reduce the number of parameters compared with the
fully connected network, and get a more meaningful information for each neuron
in the hidden layer.
We call the map from local receptive fields to a hidden layer a feature map.
For a convolutional layer we can have many feature maps, this way we can recognize different shapes on the images. So to say each feature map tells us if in
a region a given pattern is present or not, with a real value between 0 and 1.
Having the kernel defined as above, the output of a convolutional layer would
have smaller dimensions than the input, but theres cases when this is not the
case. In some cases we use an enhancement of the layer on the edges with mean
values called padding to filter the layer and get an output of the same size and
shape as the layer.
1.5.2
An example
The previous layer would have another a feature map to recognize if the
above features are present or not.
14
and so on for every lower level of abstraction until we get to the input layer
with the image data.
1.5.3
Pooling Layers
After a convolution layer we usually have pooling layers, which simplify the
information of the previous layer. A commonly used one is max-pooling layer,
which takes the maximum value of the activation in a given region, say 2 2
neurons of previous layer.
al+1
jk = max{a2j+m,2k+n }m,n(0,1)
(37)
mation of the feature maps, if they appear or not in a approximate part of the
image, since we dont care about the exact position of a feature when we are
looking for patterns.
15
1.5.4
As we explained before, when using sigmoid output for all the neurons in a layer
it can happen that the state of many neurons becomes saturated, due to the
shape of this output function. The rectifier layers are characterized by having
their neuronss activation function defined as
f (x) = max(0, x)
(38)
Using this kind of output we will avoid saturation, so this kind of layers are
usually combined with convolutional layers with sigmoid function. A smooth
approximation to the rectifier is the analytic function
f (x) = ln(1 + ex )
(39)
which is called the softplus function. This layers have the property of accelerating the learning process, that is achieving a lower cost value in less steps.
1.5.5
min(N 1,i+n/2)
X
i
j
aix,y = zx,y
/ k +
(zx,y
)2
(40)
j=max(0,in/2)
i
where zx,y
is the activity of a neuron computed by applying kernel i at position
(x, y) and then applying ReLU nonlinearity aix,y is the response normalized
i
. The sum runs over n adjacent kernel maps at the same spatial
activity of zx,y
position and N is the total number of kernels in the layer. Details about other
parameters can be found in [3].
1.5.6
Softmax layers
Softmax layers transform the activations from previous layer into a probability
distribution, keeping the same information of the activations. Each neuron of
the softmax layer has the following activation function:
L
ezj
aL
=
P
L
j
zk
ke
(41)
The activations of the previous layer zjL are not necessarily between 0 and 1
and summing 1 for the whole layer, so with the softmax layer we get sure that
we have a better representation of the probability that the image belongs to a
particular class. We will actually not count softmax as a layer helping to train
the network, but a layer that helps to make the classification results human
readable.
16
1.6
1.7
Tensorflow
To run the needed operations for training a neural network we used Googles
recently launched open source deep learning library Tensorflow. TensorFlow
is an open source software library for numerical computation using data flow
graphs. It substitutes previous libraries with similar purposes such as Theano
or SciKit-Learn. In each graph nodes represent mathematical operations, from
simple ones like matrix multiplication and addition to more complex like convolution or softmax. Graph edges represent the multidimensional data arrays
(tensors) communicated between them. The flexible architecture allowed to
deploy computation to one or more CPUs or GPUs. Tensorflow has already
some implemented convolutional networks to classify images and other prediction tasks.
2
2.1
Related work
Universal approximation of functions
The way artificial neural networks have evolved until today, where they are
useful to solve many classification problems with good results was heuristic, it
was not analytically and with deductive steps determined that neural networks
could properly model certain types of data such as audio and video, but it can be
analytically proven that linear combinations of sigmoid functions can uniformly
approximate any continuous function, which tells that we could approximate
any data set with neural networks. Details can be found in [6]. In spite of this
formalization of the function approximation capability of ANN it is accepted
that they have a black box nature in terms of the feature extraction. It is not
exactly known the interpretation of the weights and biases learned, although we
could observe that basic shapes that might be present in images are identified
as features.
17
2.2
In our work we used all the time feedforward neural networks, which propagate
the activations in one direction, but it is also important to remark that theres
other types of commonly used ANNs as recurrent neural networks, in which
connection form a directed cyclic graph.
2.3
The main condition to use CNNs is to have labeled data, which in most of
cases in life doesnt happen. Sometimes we have similar kind of problems, but
need to be solved in an unsupervised way, and to do this we can use deep
belief networks, which are the unsupervised learning version of artificial neural
networks. Another inconvenient with CNNs and backpropagation is that weights
and biases can get stuck in a poor local optima, making the model stay far from
good prediction results. So to overcome this limitations Smolensky [7] thought
of a network that learn hidden patterns on the data. So the idea is to have
only one visible layer, many hidden and infer states of hidden variables for
some visible variables states, being later able to generate new visible variables
samples. In case of images, we would learn the probability of some features
appearing in a given image, without it being labeled.
DBNs are composed by Restricted Boltzmann Machines (RBMs). RBMs
are simpler than CNNs version of ANNs that learn a probability distribution
over a set of inputs. In case of image sets, the network learns a set of features
given an input image dataset. This can be used to initialize deep neural networks features values. RBMs only have 2 layers, a visible one with m neurons,
(in this method also called units) and a hidden one with n units, with binary
boolean values. The same way as it happens in CNNs, theres a weight matrix
W = (wi,j ) of size m n ,where wi,j determines the weight of connection between visible unit vi and hidden unit hj and also biases, ai for visible units and
bi for hidden units. In RBMs we have a function that associates a scalar value
called energy to each configuration of the variables:
E(v, h) =
m
X
i=1
ai vi
n
X
j=1
bj hj
m X
m
X
vi wi,j hj
(42)
i=1 j=1
in matrix notation,
E(v, h) = aT v bT h v T W h
(43)
Learning corresponds to modifying that energy function so that its shape has
desirable properties. We would like plausible or desirable configurations to have
low energy. We also have a probability distribution for each configuration
of the network, which depends of the energy function:
P (v, h) =
1 E(v,h)
e
Z
(44)
P
E(v,h)
being Z =
a normalizing constant to ensure the probability
(v,h) e
distribution sums 1. The sum is over all possible configurations of visible and
hidden units. Plausible configurations should have a higher probability value,
that is a energy function value close as possible to 0. In a similar way we have
18
the probability of a given visible units vector is the normalized sum of energy
functions exponential over all possible hidden units configurations.
P (v) =
1 X E(v,h)
e
Z
(45)
n
Y
P (hj |v)
(47)
j=1
m
X
!
wi,j vi
(48)
i=1
and
P (vi = 1|h) = ai +
n
X
wi,j hj
(49)
j=1
vV
(51)
vV
Composing together many RBMs, making each hidden layer, the visible layer of
another RMB we form a deep belief network, which is able to extract features
in data of different levels of abstraction. To train a Deep Belief Network we
would proceed as follows:
Given a input data sample X we would train a restricted Boltzmann machine on X to obtain its weight matrix, W . Then we would use it as the
weight matrix between the lower two layers of the network.
Then we would transform X by the RBM to produce new data sample
X 0 , either by sampling or by computing the mean activation of the hidden
units.
Next we repeat the procedure with X X 0 for the next pair of layers,
until the top two layers of the network are reached.
At last we would fine-tune all the parameters of this deep architecture
with respect to a proxy for the DBN log-likelihood, or with respect to
a supervised training criterion (after adding extra learning machinery to
convert the learned representation into supervised predictions, e.g. a linear
classifier).
20
2.4
Deep Dream
Have you ever thought so long in something or someone that have the feeling for
a second you see it even when its not there? This is another curious application
of Deep Neural Networks, the generation of images reminding to hallucinations.
The idea in Deep Dream is to maximize the activations of certain layers features
in a network that is already trained, and mix the detected features with an input
image. To do this it is used gradient ascent, which is the opposite idea of gradient
descent, instead of subtracting gradient to weights and biases in each step we
add it, to get a higher activation value. The result of doing this on an image of
Barcelonas skyline from Parc G
uell with an Inception NeuralNets layer that
detects the presence of canines and other animals is this:
Methodology
Our goal was to build our own CNN, being inspired in examples that already
perform prediction effectively and understand their architecture. To do that we
21
use Tensorflow library and get ideas from its avalaible examples.
3.1
The dataset
The aim of the class other is to make the model able to tell if the picture
doesnt belong to any of the food classes. The dataset is composed both from
Instagram photos and web images. Instagram photos have been obtained from
the Instagram API, filtered with user defined tags and manually purged. As
user defined tags are very noisy this method proved to be inefficient and very
time-consuming. In order to facilitate the generation of more ground truth
annotations and a larger training dataset we also obtained images from Google
Images through the Custom Google Search API. This method, which allowed to
automatically annotate a bigger set of images, turned out to be very useful as
almost all the retrieved images showed the desired food category and minimum
manual purge was required. The first model that we are going to build is going
22
to be a single layer neural network. The images of the dataset have no specific
size or format, we store them in a .bin file with records with information of
32x32 pixels with 3 channels RGB. Then to feed the network we randombly
crop them into 24x24 pixel images, to expand the data set size.
3.2
Data augmentation
3.3
Hardware setup
The models were trained over a high-end server with a quadcore Intel i7-3820 at
3.6 GHz with 64 GB of DDR3 RAM memory, and 4 NVIDIA Tesla K40 GPU
cards with 12 GB of GDDR5 each, connected through a PCIe 3.0 in x16 mode
(containing two PCIe switches). The machine runs a GNU/Linux system, with
Linux kernel 3.12 and NVIDIA driver 340.24. We performed experiments with
different configurations (downscaling sizes, data augmentation, different number and composition of layers, different layers geometry, etc.). We also tested
aspects with no impact in the classification accuracy but with practical implications such as different input formats (TFRecords, compressed numpy arrays,
etc.) or different hardware configurations (one or more CPUs and GPUs, etc).
Our runs include an extensive set of conffigurations; for brevity, when those parameters were shown to be either irrelevant or to have negligible effect, we use
default values. Each experimental conffiguration was repeated at least 5 times.
Unless otherwise stated, we report median values in seconds.
3.4
23
24
4
4.1
Results
Single softmax layer network
The first and simplest possible approach that weve taken was to train the network with a single softmax layer, with a single matrix product of the image data
with the weights and addition of biases.
Since our dataset is quite noisy and not very
large, without extracting features of different levels of abstraction, the first result is
not going to be very good, if it does learn
something it could be considered already an
achievement.
After running the experiment we can see
the results of running stochastic gradient descent with a single layer network in figure.
Figure 9: 1 softmax layer ANN
Clearly we are not getting close to a minimum of the cost function, since after some
steps the cost is barely decreasing.
When checking the precision of the predictions, feeding the network with
600 test images and dividing the correct classifications between the total classifications, we get a disaster score of 44% (44 out of 100 images are classified
correctly).
Figure 10: Training loss in each step for a network with no hidden layers.
So only with a softmax layer receiving the weight product with the input layer
plus biases, the model does learn something, since a random classification would
25
get around 10% precision, but it doesnt get much more better than that. Taking
a look at the loss value evolution we leave this helloworld experiment and
dont spend time in doing experiments with more steps and go on with the
layers that give the name to the networks we are studying. So lets see how
does the classification improve with a convolution and later pooling layer.
4.2
Convolution
After the simplest approach of having a network with no hidden layers we see
how does the model do with only one hidden convolutional layer.
The layer is going to extract 64 5x5
neurons kernels, with a stride of a
single neuron on each direction and
a padding which will make the output of the convolution layer have the
same size of the input layer.
We
think convolution layer as a prism since
it extracts 64 features presence infor- Figure 11: Convolution network simple
mation for every part of the image, structure
this information we save in a 3d-vector
(24x24x64).
The loss value continues decreasing for a longer continued training (in simplest model it almost stopped decreasing in the first 10.000 steps) and ends up
having an average value of 0.3. Training time was of 2 hours 56 minutes for
completing 50.000 steps, which we choose as a training length sufficient for loss
stabilization. The predictions do as expected a jump in precision to 85.5% of
accuracy in contrast with previous model result.
Figure 12: Loss function values during training, in steps and time
4.3
Now we check the effect of adding a pooling layer after the convolutional layer.
This time cost decreases quite faster and
reaches close to 0.16 value after 50.000
training steps in contrast with the 0.3 of
the previous model. In terms of precision
we get up to 88% goodness score, which
is actually not bad taking in account that
26
the images are not quite simple as for example handwritten digits, and our network has only 2 layers. If we randomly
choose a sample of the classifications we see that effectively more than 8 out of
10 images are classified in the right group. The real label of the picture is after
L: and the prediction computed by the network is after P:.
L: 0, P: 0
L: 1, P: 1
L: 2, P: 2
L: 3, P: 3
L: 4, P: 8
L: 5, P: 5
L: 6, P: 6
L: 7, P: 6
L: 8, P: 8
L: 9, P: 9
Also we can see the precision results in the confusion matrix, which tells us from
each class, which portion of the predictions where correct and which where in
wrong classes, where rows are real class labels and columns are class predictions.
For example we can observe that 15% of the fried eggs where classified as sushi.
That might be cause of the similarity of colors and shapes (white ovals surrounded by black are present in both classes). So maybe we should rise the
number of layers and features detected in our network, or change other parameters to let the network tell the difference between those classes. A part of this
theres already no notable confusion (greater than 10%) between other classes.
Table 1: Confusion matrix
0
1
2
3
4
5
6
7
8
9
0
0.87
0.00
0.08
0.02
0.02
0.01
0.00
0.00
0.01
0.00
1
0.02
0.85
0.00
0.02
0.00
0.01
0.02
0.02
0.00
0.00
2
0.00
0.02
0.77
0.00
0.02
0.00
0.00
0.00
0.01
0.04
3
0.02
0.04
0.00
0.78
0.05
0.00
0.03
0.02
0.01
0.00
4
0.00
0.00
0.00
0.06
0.72
0.00
0.02
0.02
0.01
0.00
27
5
0.08
0.04
0.11
0.04
0.02
0.94
0.00
0.00
0.06
0.10
6
0.02
0.00
0.02
0.00
0.00
0.01
0.90
0.05
0.01
0.00
7
0.00
0.02
0.00
0.06
0.02
0.01
0.02
0.85
0.04
0.00
8
0.00
0.02
0.03
0.02
0.15
0.01
0.01
0.05
0.83
0.00
9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.86
4.4
So now that we saw that the main improvement comes after having a convolutional layer, lets see what happens if we have another big improvement after
adding a second convolutional layer.
28
Here we have the confusion matrix results when training a network with 2
convolutional layers.
Table 2: Confusion matrix for network with 2 convolution and normalization layers
0
1
2
3
4
5
6
7
8
9
0
0.86
0.00
0.10
0.02
0.02
0.01
0.00
0.02
0.00
0.00
1
0.00
0.79
0.02
0.04
0.00
0.01
0.00
0.03
0.00
0.00
2
0.02
0.00
0.70
0.00
0.00
0.00
0.00
0.00
0.00
0.00
3
0.00
0.04
0.03
0.82
0.04
0.00
0.05
0.01
0.02
0.00
4
0.00
0.00
0.03
0.04
0.85
0.00
0.02
0.00
0.00
0.00
5
0.08
0.05
0.07
0.00
0.02
0.93
0.00
0.00
0.02
0.09
6
0.02
0.02
0.00
0.02
0.00
0.02
0.91
0.06
0.00
0.05
7
0.00
0.04
0.00
0.04
0.02
0.00
0.00
0.85
0.01
0.00
8
0.00
0.05
0.00
0.02
0.04
0.01
0.02
0.03
0.94
0.00
9
0.02
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.86
4.5
So next approach will be to train the network with two convolutional, pooling
and normalization layers with a higher resolution version of the same dataset.
We choose a size that is not going to do the computation too slow but the
difference of resolution is quite noticeable, which is going to be 48x48 after
cropping step. With this size, the precision of the predictions rises to 89.99%,
which is quite a significant improvement. So we can state than higher resolution
with more convolutional layers gives better classification results. Also it takes a
high computation time to proceed with these experiments. With the last one (2
convolutionals, 48x48 pixels) it took 8 hours 21 minutes to complete the 50.000
29
steps of training. The confusion matrix show us some classes predictions now are
really highly accurate, but still some of them lack of a complete understanding
of the patterns by the neural network.
0
1
2
3
4
5
6
7
8
9
0
0.83
0.00
0.02
0.00
0.00
0.01
0.00
0.02
0.00
0.00
1
0.00
0.82
0.02
0.00
0.00
0.00
0.02
0.03
0.00
0.00
2
0.04
0.00
0.88
0.02
0.05
0.00
0.02
0.00
0.01
0.04
3
0.02
0.06
0.00
0.83
0.02
0.00
0.02
0.02
0.01
0.00
4
0.02
0.00
0.00
0.04
0.81
0.00
0.02
0.00
0.00
0.00
5
0.07
0.04
0.08
0.00
0.02
0.96
0.02
0.02
0.00
0.09
6
0.02
0.00
0.00
0.04
0.00
0.01
0.86
0.00
0.01
0.00
7
0.00
0.04
0.00
0.02
0.02
0.00
0.05
0.90
0.00
0.00
8
0.00
0.04
0.00
0.04
0.07
0.01
0.00
0.02
0.96
0.00
9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.87
Table 3: Confusion matrix of 2 convolution layers model trained with 48x48 images
4.6
Overfitting
The plotted loss values are calculated with train data, but it would be good to
see how does the classification precision evolves during the training process, to
know if the network actually learning concepts behind the data, or is it only
memorizing the training data set.
Figure 18: Precision of classifications with training data set in green and test data set
in reed
To do that we take a look the evolution of the precision with training and
test data. In figure 18 we can see that theres a point when the precision of
the classifications stops improving, that means that our network is overfitting,
or some of the neurons are saturated. To avoid this we can try with adding
normalization layer.
30
4.7
4.8
Batch size
Initially we set a commonly used batch size, which is 128 examples/batch, but
we wanted to see how this does affect training time and precision of classification.
We compare the training evolution and results for the last network model which
consisted of two convolutional layers each followed by pooling and normalization.
In figure 19 we see that the loss takes slightly less oscillating and lower values
from step 30.000 on. We can see it more clearly in a close up of the last 1.000
steps of training in figure 20.
Figure 19: Training loss with batch size of 128 examples in blue and 64 in red.
31
Figure 20: Training loss on 1000 last steps for batch size of 128 examples in blue and
64 in red.
The big difference comes for training time, half lower for lower batch size,
we can see the comparison in minutes in figure 21.
Figure 21: Training loss depending on time for batch sizes 128 in blue and 64 in red.
4.9
Extracted features
Here are some extracted features by our network, in this case the 64 kernels of
the first convolutional layer (5x5 weights). As it happens with CNNs we can not
now why a particular shape and kind of features is learned during the process,
32
but intuitively in some of them it looks like the network learns basic shapes to
recognize the boundary of the objects in the images.
Figure 22: Extracted kernels in first convolutional layer, in a model with 2 convolutional layers.
For each of the previous features, we get a tensor of the shape of the image with
the activations for this tensor, as explained in section 1.5.1, since we are using
a padding to get the output with the same shape of the input.
33
Figure 23: Output of first convolution layer for the model 4.4
For example in figure ?? we can see the output of some images after convolution for the first extracted feature, for each color channel.
4.10
One may ask himself, is it normal if I need to see 1000 images of an elephant
before Im able to recognize it when I see another one again? Maybe in a
very strange case in which you just had your sight given and an elephant is
the first thing you ever saw, otherwise it shouldnt be necessary. Apparently
it happens the same with CNN learning. Once the network has learned many
visual concepts its easier every time to learn new ones. This way after seeing
some results with a self trained CNN we move to this approach, which is going
to be to retrain a large ImageNet model to recognize the pictures of our data
set. This technique is called transfer learning or convolutional network fine
tuning. After knowing about Donaue et al. [5] work, and the option to load
34
Inception Neural Network with Tensorflow to use the learned features in your
own data set we checked if this is a better approach than to train a net only
with our own data.
4.10.1
Bottlenecks
The idea consists of loading the graph of an ImageNet Inception network which
is already trained (concretely for 1000 classes) and using the learned features,
perform a training over the new classes to recognize, avoiding this way a long
training process. To do that we have to adapt the last layer of the trained graph
before softmax to the new added classes. Doing this using Imagenet Inception
v-3 model provided by Tensorflow we achieve a precision of 91.2% in only 17
minutes. So definitely we can state than knowing patterns of many image classes
greatly helps learning faster and better new classes as it happens with biological
neural networks.
After different unexpected accuracy values and training times, we can say that
although some of convolutional neural networks insights are still not clear, they
do work quite good for image classification. For example we can not state that
deeper networks are always going to give better classification results or that
higher image resolution is also going to do, it is going to be like this sometimes
but not always, as we saw on the results. Also after looking for documentation in
this aspect we saw many the parameters such as batch size or learning rate used
in state of the art networks are determined as we proceeded, by trial and error,
so it is still to solve why a specific number of layers and of which deepness works
better. Also we observed that as expected, using a previously trained net that
already classifies 1000 classes gives much better accuracy for our dataset than
the networks trained from the beginning only with our dataset, also consuming
much less time. So now that object classification and recognition in images with
CNN is close to being solved, we would continue by exploring other applications
such as online user behavior prediction, improving speech recognition or look in
which other research areas this machine learning technique can also be useful.
35
References
[1] McCulloch, Warren; Walter Pitts (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics
[2] Krizhevsky, Alex (2009). Learning multiple layers of features from tiny images.
[3] A. Krizhevsky, I. Sutskever, G E. Hinton, Imagenet Classification with Deep
Convolutional Neural Networks
[4] Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.
[5] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, Trevor Darrell (2013). A Deep Convolutional Activation Feature for
Generic Visual Recognition
[6] George Cybenko (1989). Approximation by Superpositions of a Sigmoidal
Function
[7] Smolensky, Paul (1986). Chapter 6: Information Processing in Dynamical
Systems: Foundations of Harmony Theory.
36