You are on page 1of 100

Lecture 4:

Backpropagation and
Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 13, 2017
Administrative

Assignment 1 due Thursday April 20, 11:59pm on Canvas

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 2 April 13, 2017
Administrative

Project: TA specialities and some project ideas are posted


on Piazza

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 3 April 11, 2017
Administrative

Google Cloud: All registered students will receive an email


this week with instructions on how to redeem $100 in credits

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 13, 2017
Where we are...

scores function

SVM loss

data loss + regularization

want

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 13, 2017
Optimization

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 13, 2017
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 13, 2017
Computational graphs

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 13, 2017
Convolutional network
(AlexNet)

input image

weights

loss

Figure copyright Alex Krizhevsky, Ilya Sutskever, and


Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 9 April 13, 2017
Neural Turing Machine

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 10 April 13, 2017
Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017
local gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 13, 2017
local gradient

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 13, 2017
local gradient

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 13, 2017
local gradient

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 13, 2017
local gradient

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 13, 2017
Another example:

(-1) * (-0.20) = 0.20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 13, 2017
Another example:

[local gradient] x [upstream gradient]


[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 13, 2017
Another example:

[local gradient] x [upstream gradient]


x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017
sigmoid function

sigmoid gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017
sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
mul gate: gradient switcher

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017
Gradients add at branches

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017
Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
local gradient element of x)

f
gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the in practice we process an


entire minibatch (e.g. 100)
size of the of examples at one time:
Jacobian matrix? i.e. Jacobian would technically be a
[4096 x 4096!] [409,600 x 409,600] matrix :\

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 58 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 59 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 13, 2017
A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 13, 2017
Modularized implementation: forward / backward API
Graph (or Net) object (rough psuedo code)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 13, 2017
Modularized implementation: forward / backward API

x
z
*
y
(x,y,z are scalars)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 13, 2017
Modularized implementation: forward / backward API

x
z
*
y
(x,y,z are scalars)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 13, 2017
Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 13, 2017
Caffe Sigmoid Layer

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 13, 2017
In Assignment 1: Writing SVM / Softmax
Stage your forward/backward computation!
margins
E.g. for the SVM:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 13, 2017
Summary so far...
neural nets will be very large: impractical to write down gradient formula
by hand for all parameters
backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 13, 2017
Next: Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 13, 2017
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 13, 2017
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 13, 2017
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 13, 2017
In Assignment 2: Writing a 2-layer net

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 89 April 13, 2017
This image by Fotis Bobolas is
licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 90 April 13, 2017
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 91 April 13, 2017
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 92 April 13, 2017
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 93 April 13, 2017
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 94 April 13, 2017
Be very careful with your brain analogies!
Biological Neurons:
Many different types
Dendrites can perform complex non-linear computations
Synapses are not a single weight but a complex non-linear dynamical
system
Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 13, 2017
Activation functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 13, 2017
Neural networks: Architectures

3-layer Neural Net, or


2-layer Neural Net, or 2-hidden-layer Neural Net
1-hidden-layer Neural Net Fully-connected layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 13, 2017
Example feed-forward computation of a neural network

We can efficiently evaluate an entire layer of neurons.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 13, 2017
Example feed-forward computation of a neural network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 13, 2017
Summary
- We arrange neurons into fully-connected layers
- The abstraction of a layer has the nice property that it
allows us to use efficient vectorized code (e.g. matrix
multiplies)
- Neural networks are not really neural
- Next time: Convolutional Neural Networks

10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
0

You might also like