You are on page 1of 140

Deep Learning Foundations

Jie-Han Chen
2018/08/16 @ National Cheng Kung University and X-Village, Taiwan
Who am I?

● Name: Jie-Han Chen


● My research interests:
○ Artificial General Intelligence
○ Reinforcement Learning
○ Neural Network Architecture Design
● Currently, An Master Student in CSIE,
National Cheng Kung University, Taiwan.

● LinkedIn: link

2
The revolution
Recent years, deep learning has provided impressing results
in many domains:

● Computer Vision
● Speech and Text Recognition
● Decision making policy and Control
● Modification and Generation

3
Computer Vision
Object detection Semantic Segmentation

Redmon et al., YOLO(2016) Long et al., FCN(2015)

4
Speech and Text Recognition
Speech Recognition and Chatbot Language Translation

5
Decision Making Policy and Control
Gaming (AI development) Robotics

6
Modification and Generation

click!

7
Interesting application

Quickdraw: AutoDraw:
https://quickdraw.withgoogle.com/ https://www.autodraw.com/ 8
They were all built by the Magic

9
10
Let’s unveil the mysteries of
Deep Learning

11
You don’t need to pick up all contents
in this lecture.

Just keep something in mind, when


training neural network.

12
Outline
● The inspiration of artificial neural network
● Perceptron & multi-layer perceptrons
● Neural network
● Optimization and learning algorithm
● Tips for training neural network
● Reference
● Resources

13
“At least two ingredients are necessary for the
advancement of a technology:
● Concept
● Implementation”
-- (quoted from Neural Network Design)

14
The inspiration of Artificial Neural Network

由神經科學(neuroscience)得
知神經元包含以下元素:

● 神經細胞本體 Cell Body


● 樹突 Dendrites
● 軸突 Axon
● 突觸 Synapse

Biological Neurons
15
quoted from Neural Network Design, 2nd edition.
Single Neuron Perceptron (感知器)
● Proposed by Warren McCulloch and Walter Pitts (1943)
● Can compute arithmetic or logical function
synapse(突觸 )

axon from a neuron


cell body(神經細胞本體 )
(軸突)

output axon
(軸突)

dendrite(樹突 )
16
Single Neuron Perceptron (感知器)

x1 synapse(突觸 )
cell body(神經細胞本體 )
axon from a neuron
w1x1
(軸突)

w2x2
output axon
(軸突)
w3x3
dendrite(樹突 )

Bias corresponding to transfer function


intecept term (activation function)
17
Single Neuron Perceptron (感知器)

+1
w1x1
0
w2x2

w3x3
Hard Limit Transfer Function
b (hardlim() or sgn())

18
Single Neuron Perceptron (感知器)

w1x1 +1

w2x2
0

w3x3

If wTx is greater than or equal to -b, the output will be


1, otherwise the output will be 0. Thus each neuron
divides the input space into two regions.
19
Single Neuron Perceptron (感知器)
x2
Decision Boundary

label: +1

label: 0

x1

20
Single Neuron Perceptron (感知器)
● AND operation:

x1 x1 x2 output
w1 = 1 0 0 0

0 1 0
w2 = 1 1 0 0
x2 b = -1.5 1 1 1

21
x2
Decision Boundary

label: +1

label: 0

x1

22
Single Neuron Perceptron (感知器)
● OR operation:

x1 x1 x2 output
w1 = 1 0 0 0

0 1 1
w2 = 1 1 0 1
x2 b = -0.5 1 1 1

23
x2

Decision Boundary

label: +1

label: 0

x1

24
Single Neuron Perceptron (感知器)
● NOT operation:

x1 output

w1 = -0.6 0 1
x1
1 0

b = 0.5

25
We can change transfer function to
build different models.

26
Another kind of Single Neuron Perceptron

Symmetrical Hard Limit Transfer Function


w1x1 (hardlims())

w2x2
+1

w3x3
0
b
-1

27
Linear Regression
When we change the transfer function to linear function, it would be a kind of
linear regression.
Linear function f(x)=x
w1x1

w2x2

w3x3

p.s., Such linear activation function has also been


applied in ADALINE networks.
28
Logistic Regression

Logistic function

w1x1

w2x2

w3x3

29
Learning algorithm of Perceptron
We won’t cover the Perceptron Learning Algorithm(PLA) here. if you have strong
interest about the learning algorithm, just notice that the learning algorithm would
be different with different transfer functions, hardlim and hardlims.

● hardlim: see the details in “Neural network design” in Chapter4 (formula 4.34,
4.35)
● hardlims: see the details in “Learning from data” in Chapter1 (formula 1.3)
● You also can write down the loss function and use derivative to derive the
learning algorithm.

30
The disadvantage of Perceptron
The decision boundary of perceptron is a line, and it cannot
handle linear inseparable problem, e.g., XOR logic
x2

Decision Boundary

x1
31
Questions?

32
Linear Separable V.S. Not Linear Separable
我們可以將 資料 分成兩個類型: Linear Separable, Not Linear
Separable。

● Linear Separable: 資料本身可以由直線分割。


● Not Linear Separable: 資料本身不可由直線分割。

33
Linear Separable

x2 x2

x1 x1

34
Linear Separable

x2 x2

g(x)

x1 x1
g(x)
35
Not Linear Separable

x2 x2

x1 x1
Collinear 共線
36
沒有一個模型不能解決的事情
如果有...
If you cannot solve the classification
problem by a single model...

37
就結合更多模型
Just give it more models!

38
Not Linear Separable

x2 x2

x1 x1
Collinear
39
Let’s introduce the notation before
giving you the concept of
neural network

40
Define the notation of neural network

w1x1 W1, 1x1


w2x2 W1, 2x2

w3x3
W1, 3x3
b
b
Now, we replace the weights vector with matrix.
W1, i : the weight is from ith source and to the 1st output
neuron
41
Define the notation of neural network

W1, 1x1 x1

W1, 2x2
W1, 1 W1, 2 W1, 3 x2 b Wx + b

W1, 3x3 x3
b

42
Define the notation of neural network

W1, 1x1 x1

W1, 2x2
f f W1, 1 W1, 2 W1, 3 x2 b f( Wx + b )

W1, 3x3 x3
b

43
Define the notation of neural network

W1, 1x1
W1, 2x2
f

W1, 3x3
b

The output of a single neuron would be a scalar. 44


A layer of Neurons - Single Neuron Perceptron
1 W1, 0 = b1
W1, 1 a1
x1 each output denotes a single
W1, 2 model(decision boundary)
x2 a2
W1, 3

x3 .
W1, N .
.
xN
aS
N-dimension input data 45
A layer of Neurons - Single Neuron Perceptron
1 W1, 0 = b1
W1, 1 a1
x1 the input
Notice: we add
of bias here! W1, 2

x2 a2
W1, 3

x3 .
W1, N .
.
xN
aS
N-dimension input data 46
A layer of Neurons
W1, 1 a1
x1
b1 f ( W 1x + b 1 )

x2 a2 f ( W 2x + b 2 )

x3 b2 . ...
.
...
.
xN
WS, N f ( W S x + bS )
aS
N-dimension input data bS 47
A layer of Neurons
A layer of neurons can be expressed by a function:

The output of such function is a vector from a layer of


neurons, and we will combine the results from different
neurons soon.

48
Two layers of Neurons
f
W1, 1 We can combine different models
into a more powerful single model
x1 by adding another layer of neurons.
b1

x2

x3 b2 .
.
.
xN
WS, N
bS 49
Two layers of Neurons
f1
W11, 1
W21, 1
x1
b 11
f2
x2 W21, 2

x3 b 12 .
. W21, 3 b 21
.
xN
W1S, N
b 1S 50
XOR logic
x2 b = -1.5

1 1
b = -0.5 b = +1.5

1 -1
1 -1

x1 x1 x2
Use AND operation to
combine two models ! 51
Multi-Layer Perceptron (MLP)

Input layer hidden layer output layer 52


Multi-Layer Perceptron (MLP)

We also can add more hidden layers in MLP.


53
Multi-Layer Perceptron (MLP)

data

Feed Forward

Multi-layer perceptron is a kind of Neural Net called


feedforward neural network. 54
Feedforward Neural Network

55
Feedforward Neural Network
x

A feedforward neural network is a series of transformation of


input data x. 56
Why do we need
non-linear
activation function?
Why do we need non-linear activation function?
● Reason1:

Neural network is a series of transformations of input data. if


the transformation is linear, it is useless to add more hidden
layers.

58
Why do we need non-linear activation function?
● Reason2:

The neural network is a function approximator to approximate


the ideal target function f target in machine learning. The
target function f target may be complicated (non-linear).

59
Why do we need non-linear activation function?
● Reason2:

quoted from Hsuan-Tien Lin’s Machine Learning 60


Foundations course
From Linear to Non-Linear
Use different kinds of activation function to transform the
linear function to non-linear function.

+1 +1

0 0

The condition of linear function -1

61
Activation Function

sigmoid function tanh function Relu function

62
The decision boundary of Neural Network

quoted from “Introductory Overview Lecture The Deep Learning Revolution” in JSM 2018 63
and you can play with ConvnetJS: https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
Thus, with non-linear transformation
the neural network could be more
powerful!

64
Model Capacity
Deep Learning Multi-Layer
& Perceptron
Neural Network

High

Linear Regression
&
Logistic Regression
Low
65
Model Capacity
There are many kinds of neural network
● Feedforward neural network
● Recurrent neural network
● Radial basis function network
● Memory neural network
● ...

There are all called Deep Learning in this era, because of


using deep hidden layers.

66
Recurrent neural network

67
What makes deep learning shine in
many machine learning techniques?

68
Representation Learning
● Rule-based system, Classic machine
learning: relying on human’s domain
knowledge to design features.
● Representation Learning: Without
handcrafted features, usually use raw
inputs and it can extract representation
automatically by series of transformation.
● Thus, representation Learning is useful
in processing speech data, image data.

69
The image is quoted from Goodfellow et al., Deep Learning.
Questions?

70
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.

Learning algorithms:

● Gradient based method


● Evolutionary method
● ...

71
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.

Learning algorithms:

● Gradient based method We focus on this!


● Evolutionary method
● ...

72
Gradient descent in Neural Network
In the previous lecture, we have learned that gradient descent
can be used to optimize the learning model.

Where J( ) is the objective function and


denotes the parameters (weights) in the
neural network.

73
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.

● How do we compute the gradients for the weights of each


layer?
● The landscape of loss function is probably not convex.

74
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.

● How do we compute the gradients for the weights of


each layer?
● The landscape of loss function is probably not convex.

75
How to compute the gradients?

76
The operation in Neural Network
Let’s think about the simplified example, there are 3 kinds of
computation involves in neural network.

● Addition x1w1

● Multiplication
● Activation
x2w2
b

77
A simplified example of backpropagation
Consider a simple function f(x, y) = x + y. It can be expressed
by a computational graph.

+ z = f(x, y) = x + y

78
A simplified example of backpropagation
Consider a simple function f(x, y) = xy. It can be expressed by
a computational graph too.

* z = f(x, y) = xy

79
A simplified example of backpropagation
Consider the case of activation function:

x f z = f (x)

80
The derivative of activation function

sigmoid function tanh function


Image Credits: 81
http://ronny.rest/media/blog/2017/2017_08_16_tanh/tanh_v_sigmoid.jpg
The derivative of activation function

ReLU (Rectified Linear Unit) 82


A simplified example of backpropagation
The following is a computational graph of perceptron, and our objective is to find

partial derivative of cost function C(y) with respect to weights vector w:


x1
a1

w1
+ f
x2 a3 y
a2

w2 83
Chain Rule and Backpropagation
x1 a1 = w1x1
a3 = a 1 + a 2

a1 y = f (a3)
w1
+ f
x2 a3 y


a2
w2

84
Chain Rule and Backpropagation
x1


a1
w1
+ f
x2 a3 y


a2
w2

It is decided by your cost function


85
Chain Rule and Backpropagation
x1


a1
w1
+ f
x2 a3 y


a2
w2
We can compute gradients by chain rule, and the
purple line denotes behavior of backpropagation.
86
Chain Rule and Backpropagation
x1


a1
w1
+ f
x2 a3 y


a2
w2

87
Chain Rule and Backpropagation
x1


a1
w1
+ f
x2 a3 y


a2
w2

It is decided by your cost function


88
Chain Rule and Backpropagation
x1 Let’s take a deep look at single
computation unit

a1
w1
+ f
x2 a3 y


a2
w2

89
Chain Rule and Backpropagation

If we want to compute the gradients


of current computation unit, we need
x1
two things:

a1
w1
● The gradients of current output
with respect to current weights.
● The cumulative gradients from
output side.

90
What about neural network?
We can consider each neuron in neural network a
computation unit.

x1 Cost: C(θ)

x2

91
What about neural network?
Find:

w1 w3
x1 Cost: C(θ)

w2 w4

x2

92
What about neural network?
Find:

z = w Tx + b
a = σ(z) z’ = w3a + b3
w1 w3
x1

w2 w4

x2

z’’ = w4a + b4 93
What about neural network?
Case1: Output Layer
y1 = σ(z’), y2 = σ
(z’’)

w1 w3 y1
x1

w2 w4
y2
x2
decided by cost function

94
done.
What about neural network?
Case2: Not Output Layer

Compute gradients recursively,


w1 w3 y1 until we reach the output layer.
x1

w2 w4
y2
x2

95
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.

● How do we compute the gradients for the weights of each


layer?
● The landscape of loss function is probably not
convex.

96
Convex V.S Non-Convex

J(θ) J(θ)

θ θ

convex non-convex

You can see math definition in wikipedia:


97
https://en.wikipedia.org/wiki/Convex_function
The Visualization of loss landscape in NN

A Complicated Loss Landscape Image Credits: 98


https://www.cs.umd.edu/~tomg/projects/landscapes/
Which point(weights) is better?

● The yellow one


● The red one

99
No Absolute Answer!

● The loss landscape is


decided by training data,
which may not be closed to
actual loss with just few
samples. (Learning theory)
● Fortunately, local optimal
would be great enough to
solve most problems.

100
The issues about gradient descent
● Memory Usage
● Vanishing gradients
● Dead ReLUs
● Exploding gradients in RNNs

101
Memory Usage
x1


a1
w1
+ f
x2 a3 y


a2
w2
In order to compute gradients efficiently, we often cache the
forward results (a3, x2) , which may cause large consumption in
memory 102
Vanishing gradients
Let’s look at sigmoid function:
● The maximum derivative value
of sigmoid function is 0.25

● The gradients through


backpropagation will be smaller
and smaller in the early layers.
(*0.25*0.25*0.25 …)

● If the initialized weights are too


big, the outputs(Wx+b) are big
and it will make zero gradients.
sigmoid function
reference: 103
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
Vanishing gradients
Let’s look at sigmoid function:
● The maximum derivative value
of sigmoid function is 0.25

● The gradients through


backpropagation will be smaller
and smaller in the early layers.
(*0.25*0.25*0.25 …)

● If the initialized weights are too


big, the outputs(Wx+b) are big
and it will make zero gradients.
sigmoid function
104
Vanishing gradients in tanh

If we initialized large weights in


neural network, it also makes
vanishing gradients in tanh.

105
Dead ReLUs
● If a neuron gets clamped to
zero in the forward pass,
then its weights will get zero
gradients and stop updating.
● Both initialization with large
weights and huge gradients
update(aggressive learning
rate) during training phase
can cause dead ReLUs
issue.

ReLU (Rectified Linear Unit) 106


Dead ReLUs

Use leaky ReLU instead.

107
Exploding gradients in RNNs
Sometimes, the gradients may explode, especially in vanilla recurrent neural
network(原始的RNN). Pascanu et al. addressed this problem and proposed a
solution called gradients clipping to relieve exploding gradients.

A special recurrent unit LSTM can relieve this issue!

108
Pascanu et al., “On the difficulty of training Recurrent Neural Networks”
Exploding gradients in RNNs
A simplied recurrent unit:

a *

109
Exploding gradients in RNNs

a1 a2 a3 a4
a * * * *
a4 = a*b*b*b*b

b b b b

110
Exploding gradients in RNNs
If |b| < 1, the cumulative gradients go to zero. if |b| > 1, the cumulative gradients
go to infinity.

a * * * *
a1 a2 a3 a4 = a*b*b*b*b

b b b b

111
The methods to relieve previous issues
● Xavier initialization - relieve vanishing gradients
● Kaiming initialization - relieve dead ReLUs
● Use normalization layers
○ Batch Normalization
○ Layer Normalization
○ …
● Use LSTM, Gradients Clipping - relieve exploding gradients
The articles about initialization and batch normalization(Chinese):
https://zhuanlan.zhihu.com/p/25110150
You also can found similar content in the Chapter 6 of “Deep learning from scratch”, O’reilly 112
https://github.com/oreilly-japan/deep-learning-from-scratch
The resources about backpropagation
● Lecture notes of CS231n, Stanford
○ http://cs231n.github.io/optimization-1/
● Hung-Yi Lee’s course video in NTU, Taiwan
○ https://www.youtube.com/watch?v=ibJpTrp5mcE&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89
yu49&index=12
● An article from Andrej Karpathy
○ Yes you should understand backprop:
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

113
Questions?

114
Tips for training deep neural network
● Choose Optimizer
● Data augmentation
● Regularization
● Early Stopping
● Normalization

115
Choose Optimizer
There are many optimizer in deep learning ● Stochastic Gradient
package, the most basic one is
SGD(Stochastic Gradient Descent Optimizer)
Descent(SGD)
● Adagrad
However, in order to well optimize the
objective function, researchers design many ● Adadelta
useful optimizers that help deep ● RMSProp
learning.Improving Deep Neural Networks: Hyperparameter ● Adam
tuning, Regularization and Optimization

recommended article:
116
http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer
Nowadays, a better optimizer has following two attributes:

● Adaptive Learning Rate


● Moment (Momentum)

e.g., Adam: Adaptive moment estimation

117
Choose Optimizer
What is moment/momentum in optimizer?

Use previous gradients to help learning.

118
119
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
120
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
Choose Optimizer (The case in saddle point)

121
reference: http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer

The results are quoted from Kingma et al., ADAM: A METHOD FOR 122
STOCHASTIC OPTIMIZATION
Choose Adam in default.

reference: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/ 123


Data augmentation
Increase the generalizability of model(avoid overfitting) by
giving additional human-modified data, including noise.
(Image: 旋轉,縮放,平移,翻轉,不同角度)

124
Data augmentation
Be careful!! DO NOT apply transformations that would
change the correct class.

125
Image from MNIST
Regularization
Many strategies used in machine learning are explicitly designed to reduce the
testing error, possibly at the expense of increased training error. These strategies
are known collectively as regularization.(e.g., L1-regularization, L2-regularization)

The most popular one in deep learning is Dropout

126
Regularization - Dropout

The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 127
Neural Networks from Overfitting” (JMLR 2014)
Regularization - Dropout

The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 128
Neural Networks from Overfitting” (JMLR 2014)
Early Stopping

129
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Early Stopping

130
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Normalization
● Handcrafted Data Normalization
● Normalization by using neural network computation unit

131
Normalization

132
This slide was borrowed from Andrew Ng’s Machine Learning Course
Normalization
● Normalization by using neural network computation unit
○ Batch Normalization
○ Layer Normalization

Image Credits:
https://kratzert.github.io/2016/02/12/un
derstanding-the-gradient-flow-through- 133
The computation graph of Batch Normalization the-batch-normalization-layer.html
Normalization
● Normalization by using neural network computation unit
○ Batch Normalization
○ Layer Normalization

Notice!! The behavior of normalization layer may be different


in training and testing.

134
Resources about CNN
CNN, Convolutional Neural Network. CNN is popular neural network architecture
in deep learning, especially in computer vision tasks. Here are some nice
resources:

● CS231n, Stanford: http://cs231n.stanford.edu/


● ConvNetJS: https://cs.stanford.edu/people/karpathy/convnetjs/index.html
● Article: An Intuitive Explanation of Convolutional Neural Networks:
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
● Intuitively Understanding Convolutions for Deep Learning:
https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee
1
● Intuitively Understanding Convolutions for Deep Learning (Chinese)
http://bangqu.com/nNMB58.html#utm_source=Facebook_PicSee&utm_medium=Social

135
Implementation

136
Pytorch Installation on Google Colab

https://gist.github.com/JIElite/2a0643cb256cc96517ad1cbc2280dbf8
137
Reference
Book
● Deep Learning
● Neural Network Design
● Learning from Data

Course
● CS229, Stanford
● CS231n, Stanford
● Hsuan-Tien Lin’s Machine Learning Foundations, National Taiwan University
● Hung-yi Lee’s Machine Learning, National Taiwan University

And some papers and articles


138
Resources
● Deep Learning(Nature):
○ https://www.evl.uic.edu/creativecoding/courses/cs523/slides/week3/Deep
Learning_LeCun.pdf
● Pytorch examples:
○ https://github.com/jcjohnson/pytorch-examples
● CS230 code example:
○ https://github.com/cs230-stanford/cs230-code-examples
● Introductory Overview Lecture The Deep Learning Revolution:
○ http://www.cs.cmu.edu/~rsalakhu/jsm2018.html
● Why softmax is named “soft”max?
○ http://neuralnetworksanddeeplearning.com/chap3.html#softmax
139
Resources
Tensorflow playground: https://playground.tensorflow.org/

140

You might also like