Professional Documents
Culture Documents
Jie-Han Chen
2018/08/16 @ National Cheng Kung University and X-Village, Taiwan
Who am I?
● LinkedIn: link
2
The revolution
Recent years, deep learning has provided impressing results
in many domains:
● Computer Vision
● Speech and Text Recognition
● Decision making policy and Control
● Modification and Generation
3
Computer Vision
Object detection Semantic Segmentation
4
Speech and Text Recognition
Speech Recognition and Chatbot Language Translation
5
Decision Making Policy and Control
Gaming (AI development) Robotics
6
Modification and Generation
click!
7
Interesting application
Quickdraw: AutoDraw:
https://quickdraw.withgoogle.com/ https://www.autodraw.com/ 8
They were all built by the Magic
9
10
Let’s unveil the mysteries of
Deep Learning
11
You don’t need to pick up all contents
in this lecture.
12
Outline
● The inspiration of artificial neural network
● Perceptron & multi-layer perceptrons
● Neural network
● Optimization and learning algorithm
● Tips for training neural network
● Reference
● Resources
13
“At least two ingredients are necessary for the
advancement of a technology:
● Concept
● Implementation”
-- (quoted from Neural Network Design)
14
The inspiration of Artificial Neural Network
由神經科學(neuroscience)得
知神經元包含以下元素:
Biological Neurons
15
quoted from Neural Network Design, 2nd edition.
Single Neuron Perceptron (感知器)
● Proposed by Warren McCulloch and Walter Pitts (1943)
● Can compute arithmetic or logical function
synapse(突觸 )
output axon
(軸突)
dendrite(樹突 )
16
Single Neuron Perceptron (感知器)
x1 synapse(突觸 )
cell body(神經細胞本體 )
axon from a neuron
w1x1
(軸突)
w2x2
output axon
(軸突)
w3x3
dendrite(樹突 )
+1
w1x1
0
w2x2
w3x3
Hard Limit Transfer Function
b (hardlim() or sgn())
18
Single Neuron Perceptron (感知器)
w1x1 +1
w2x2
0
w3x3
label: +1
label: 0
x1
20
Single Neuron Perceptron (感知器)
● AND operation:
x1 x1 x2 output
w1 = 1 0 0 0
0 1 0
w2 = 1 1 0 0
x2 b = -1.5 1 1 1
21
x2
Decision Boundary
label: +1
label: 0
x1
22
Single Neuron Perceptron (感知器)
● OR operation:
x1 x1 x2 output
w1 = 1 0 0 0
0 1 1
w2 = 1 1 0 1
x2 b = -0.5 1 1 1
23
x2
Decision Boundary
label: +1
label: 0
x1
24
Single Neuron Perceptron (感知器)
● NOT operation:
x1 output
w1 = -0.6 0 1
x1
1 0
b = 0.5
25
We can change transfer function to
build different models.
26
Another kind of Single Neuron Perceptron
w2x2
+1
w3x3
0
b
-1
27
Linear Regression
When we change the transfer function to linear function, it would be a kind of
linear regression.
Linear function f(x)=x
w1x1
w2x2
w3x3
Logistic function
w1x1
w2x2
w3x3
29
Learning algorithm of Perceptron
We won’t cover the Perceptron Learning Algorithm(PLA) here. if you have strong
interest about the learning algorithm, just notice that the learning algorithm would
be different with different transfer functions, hardlim and hardlims.
● hardlim: see the details in “Neural network design” in Chapter4 (formula 4.34,
4.35)
● hardlims: see the details in “Learning from data” in Chapter1 (formula 1.3)
● You also can write down the loss function and use derivative to derive the
learning algorithm.
30
The disadvantage of Perceptron
The decision boundary of perceptron is a line, and it cannot
handle linear inseparable problem, e.g., XOR logic
x2
Decision Boundary
x1
31
Questions?
32
Linear Separable V.S. Not Linear Separable
我們可以將 資料 分成兩個類型: Linear Separable, Not Linear
Separable。
33
Linear Separable
x2 x2
x1 x1
34
Linear Separable
x2 x2
g(x)
x1 x1
g(x)
35
Not Linear Separable
x2 x2
x1 x1
Collinear 共線
36
沒有一個模型不能解決的事情
如果有...
If you cannot solve the classification
problem by a single model...
37
就結合更多模型
Just give it more models!
38
Not Linear Separable
x2 x2
x1 x1
Collinear
39
Let’s introduce the notation before
giving you the concept of
neural network
40
Define the notation of neural network
w3x3
W1, 3x3
b
b
Now, we replace the weights vector with matrix.
W1, i : the weight is from ith source and to the 1st output
neuron
41
Define the notation of neural network
W1, 1x1 x1
W1, 2x2
W1, 1 W1, 2 W1, 3 x2 b Wx + b
W1, 3x3 x3
b
42
Define the notation of neural network
W1, 1x1 x1
W1, 2x2
f f W1, 1 W1, 2 W1, 3 x2 b f( Wx + b )
W1, 3x3 x3
b
43
Define the notation of neural network
W1, 1x1
W1, 2x2
f
W1, 3x3
b
x3 .
W1, N .
.
xN
aS
N-dimension input data 45
A layer of Neurons - Single Neuron Perceptron
1 W1, 0 = b1
W1, 1 a1
x1 the input
Notice: we add
of bias here! W1, 2
x2 a2
W1, 3
x3 .
W1, N .
.
xN
aS
N-dimension input data 46
A layer of Neurons
W1, 1 a1
x1
b1 f ( W 1x + b 1 )
x2 a2 f ( W 2x + b 2 )
x3 b2 . ...
.
...
.
xN
WS, N f ( W S x + bS )
aS
N-dimension input data bS 47
A layer of Neurons
A layer of neurons can be expressed by a function:
48
Two layers of Neurons
f
W1, 1 We can combine different models
into a more powerful single model
x1 by adding another layer of neurons.
b1
x2
x3 b2 .
.
.
xN
WS, N
bS 49
Two layers of Neurons
f1
W11, 1
W21, 1
x1
b 11
f2
x2 W21, 2
x3 b 12 .
. W21, 3 b 21
.
xN
W1S, N
b 1S 50
XOR logic
x2 b = -1.5
1 1
b = -0.5 b = +1.5
1 -1
1 -1
x1 x1 x2
Use AND operation to
combine two models ! 51
Multi-Layer Perceptron (MLP)
data
Feed Forward
55
Feedforward Neural Network
x
58
Why do we need non-linear activation function?
● Reason2:
59
Why do we need non-linear activation function?
● Reason2:
+1 +1
0 0
61
Activation Function
62
The decision boundary of Neural Network
quoted from “Introductory Overview Lecture The Deep Learning Revolution” in JSM 2018 63
and you can play with ConvnetJS: https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
Thus, with non-linear transformation
the neural network could be more
powerful!
64
Model Capacity
Deep Learning Multi-Layer
& Perceptron
Neural Network
High
Linear Regression
&
Logistic Regression
Low
65
Model Capacity
There are many kinds of neural network
● Feedforward neural network
● Recurrent neural network
● Radial basis function network
● Memory neural network
● ...
66
Recurrent neural network
67
What makes deep learning shine in
many machine learning techniques?
68
Representation Learning
● Rule-based system, Classic machine
learning: relying on human’s domain
knowledge to design features.
● Representation Learning: Without
handcrafted features, usually use raw
inputs and it can extract representation
automatically by series of transformation.
● Thus, representation Learning is useful
in processing speech data, image data.
69
The image is quoted from Goodfellow et al., Deep Learning.
Questions?
70
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.
Learning algorithms:
71
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.
Learning algorithms:
72
Gradient descent in Neural Network
In the previous lecture, we have learned that gradient descent
can be used to optimize the learning model.
73
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.
74
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.
75
How to compute the gradients?
76
The operation in Neural Network
Let’s think about the simplified example, there are 3 kinds of
computation involves in neural network.
● Addition x1w1
● Multiplication
● Activation
x2w2
b
77
A simplified example of backpropagation
Consider a simple function f(x, y) = x + y. It can be expressed
by a computational graph.
+ z = f(x, y) = x + y
78
A simplified example of backpropagation
Consider a simple function f(x, y) = xy. It can be expressed by
a computational graph too.
* z = f(x, y) = xy
79
A simplified example of backpropagation
Consider the case of activation function:
x f z = f (x)
80
The derivative of activation function
w2 83
Chain Rule and Backpropagation
x1 a1 = w1x1
a3 = a 1 + a 2
*
a1 y = f (a3)
w1
+ f
x2 a3 y
*
a2
w2
84
Chain Rule and Backpropagation
x1
*
a1
w1
+ f
x2 a3 y
*
a2
w2
*
a1
w1
+ f
x2 a3 y
*
a2
w2
We can compute gradients by chain rule, and the
purple line denotes behavior of backpropagation.
86
Chain Rule and Backpropagation
x1
*
a1
w1
+ f
x2 a3 y
*
a2
w2
87
Chain Rule and Backpropagation
x1
*
a1
w1
+ f
x2 a3 y
*
a2
w2
*
a2
w2
89
Chain Rule and Backpropagation
90
What about neural network?
We can consider each neuron in neural network a
computation unit.
x1 Cost: C(θ)
x2
91
What about neural network?
Find:
w1 w3
x1 Cost: C(θ)
w2 w4
x2
92
What about neural network?
Find:
z = w Tx + b
a = σ(z) z’ = w3a + b3
w1 w3
x1
w2 w4
x2
z’’ = w4a + b4 93
What about neural network?
Case1: Output Layer
y1 = σ(z’), y2 = σ
(z’’)
w1 w3 y1
x1
w2 w4
y2
x2
decided by cost function
94
done.
What about neural network?
Case2: Not Output Layer
w2 w4
y2
x2
95
Gradient descent in Neural Network
There are two questions about doing gradient descent in
neural network.
96
Convex V.S Non-Convex
J(θ) J(θ)
θ θ
convex non-convex
99
No Absolute Answer!
100
The issues about gradient descent
● Memory Usage
● Vanishing gradients
● Dead ReLUs
● Exploding gradients in RNNs
101
Memory Usage
x1
*
a1
w1
+ f
x2 a3 y
*
a2
w2
In order to compute gradients efficiently, we often cache the
forward results (a3, x2) , which may cause large consumption in
memory 102
Vanishing gradients
Let’s look at sigmoid function:
● The maximum derivative value
of sigmoid function is 0.25
105
Dead ReLUs
● If a neuron gets clamped to
zero in the forward pass,
then its weights will get zero
gradients and stop updating.
● Both initialization with large
weights and huge gradients
update(aggressive learning
rate) during training phase
can cause dead ReLUs
issue.
107
Exploding gradients in RNNs
Sometimes, the gradients may explode, especially in vanilla recurrent neural
network(原始的RNN). Pascanu et al. addressed this problem and proposed a
solution called gradients clipping to relieve exploding gradients.
108
Pascanu et al., “On the difficulty of training Recurrent Neural Networks”
Exploding gradients in RNNs
A simplied recurrent unit:
a *
109
Exploding gradients in RNNs
a1 a2 a3 a4
a * * * *
a4 = a*b*b*b*b
b b b b
110
Exploding gradients in RNNs
If |b| < 1, the cumulative gradients go to zero. if |b| > 1, the cumulative gradients
go to infinity.
a * * * *
a1 a2 a3 a4 = a*b*b*b*b
b b b b
111
The methods to relieve previous issues
● Xavier initialization - relieve vanishing gradients
● Kaiming initialization - relieve dead ReLUs
● Use normalization layers
○ Batch Normalization
○ Layer Normalization
○ …
● Use LSTM, Gradients Clipping - relieve exploding gradients
The articles about initialization and batch normalization(Chinese):
https://zhuanlan.zhihu.com/p/25110150
You also can found similar content in the Chapter 6 of “Deep learning from scratch”, O’reilly 112
https://github.com/oreilly-japan/deep-learning-from-scratch
The resources about backpropagation
● Lecture notes of CS231n, Stanford
○ http://cs231n.github.io/optimization-1/
● Hung-Yi Lee’s course video in NTU, Taiwan
○ https://www.youtube.com/watch?v=ibJpTrp5mcE&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89
yu49&index=12
● An article from Andrej Karpathy
○ Yes you should understand backprop:
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
113
Questions?
114
Tips for training deep neural network
● Choose Optimizer
● Data augmentation
● Regularization
● Early Stopping
● Normalization
115
Choose Optimizer
There are many optimizer in deep learning ● Stochastic Gradient
package, the most basic one is
SGD(Stochastic Gradient Descent Optimizer)
Descent(SGD)
● Adagrad
However, in order to well optimize the
objective function, researchers design many ● Adadelta
useful optimizers that help deep ● RMSProp
learning.Improving Deep Neural Networks: Hyperparameter ● Adam
tuning, Regularization and Optimization
recommended article:
116
http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer
Nowadays, a better optimizer has following two attributes:
117
Choose Optimizer
What is moment/momentum in optimizer?
118
119
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
120
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
Choose Optimizer (The case in saddle point)
121
reference: http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer
The results are quoted from Kingma et al., ADAM: A METHOD FOR 122
STOCHASTIC OPTIMIZATION
Choose Adam in default.
124
Data augmentation
Be careful!! DO NOT apply transformations that would
change the correct class.
125
Image from MNIST
Regularization
Many strategies used in machine learning are explicitly designed to reduce the
testing error, possibly at the expense of increased training error. These strategies
are known collectively as regularization.(e.g., L1-regularization, L2-regularization)
126
Regularization - Dropout
The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 127
Neural Networks from Overfitting” (JMLR 2014)
Regularization - Dropout
The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 128
Neural Networks from Overfitting” (JMLR 2014)
Early Stopping
129
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Early Stopping
130
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Normalization
● Handcrafted Data Normalization
● Normalization by using neural network computation unit
131
Normalization
132
This slide was borrowed from Andrew Ng’s Machine Learning Course
Normalization
● Normalization by using neural network computation unit
○ Batch Normalization
○ Layer Normalization
Image Credits:
https://kratzert.github.io/2016/02/12/un
derstanding-the-gradient-flow-through- 133
The computation graph of Batch Normalization the-batch-normalization-layer.html
Normalization
● Normalization by using neural network computation unit
○ Batch Normalization
○ Layer Normalization
134
Resources about CNN
CNN, Convolutional Neural Network. CNN is popular neural network architecture
in deep learning, especially in computer vision tasks. Here are some nice
resources:
135
Implementation
136
Pytorch Installation on Google Colab
https://gist.github.com/JIElite/2a0643cb256cc96517ad1cbc2280dbf8
137
Reference
Book
● Deep Learning
● Neural Network Design
● Learning from Data
Course
● CS229, Stanford
● CS231n, Stanford
● Hsuan-Tien Lin’s Machine Learning Foundations, National Taiwan University
● Hung-yi Lee’s Machine Learning, National Taiwan University
140