Deep Learning Foundations PDF

Deep Learning Foundations
Jie-Han Chen
2018/08/16 @ National Cheng Kung University and X-Village, Taiwan
Who am I?
● Name: Jie-Han Chen

● My research interests:
○ Artificial General Intelligence
○ Reinforcement Learning
○ Neural Network Architecture Design
● Currently, An Master Student in CSIE,
National Cheng Kung University, Taiwan.
● LinkedIn: link
2
The revolution
Recent years, deep learning has provided impressing results
in many domains:
● Computer Vision
● Speech and Text Recognition
● Decision making policy and Control
● Modification and Generation
3
Computer Vision
Object detection Semantic Segmentation
Redmon et al., YOLO(2016) Long et al., FCN(2015)
4
Speech and Text Recognition
Speech Recognition and Chatbot Language Translation
5
Decision Making Policy and Control
Gaming (AI development) Robotics
6
Modification and Generation
click!
7
Interesting application
Quickdraw: AutoDraw:
https://quickdraw.withgoogle.com/ https://www.autodraw.com/ 8
They were all built by the Magic
9
10
Let’s unveil the mysteries of
Deep Learning
11
You don’t need to pick up all contents
in this lecture.
Just keep something in mind, when

training neural network.
12
Outline
● The inspiration of artificial neural network
● Perceptron & multi-layer perceptrons
● Neural network
● Optimization and learning algorithm
● Tips for training neural network
● Reference
● Resources
13
“At least two ingredients are necessary for the
advancement of a technology:
● Concept
● Implementation”
-- (quoted from Neural Network Design)
14
The inspiration of Artificial Neural Network
由神經科學(neuroscience)得
知神經元包含以下元素:
● 神經細胞本體 Cell Body

● 樹突 Dendrites
● 軸突 Axon
● 突觸 Synapse
Biological Neurons
15
quoted from Neural Network Design, 2nd edition.
Single Neuron Perceptron (感知器)
● Proposed by Warren McCulloch and Walter Pitts (1943)
● Can compute arithmetic or logical function
synapse(突觸 )
axon from a neuron

cell body(神經細胞本體 )
(軸突)
output axon
(軸突)
dendrite(樹突 )
16
x1 synapse(突觸 )
cell body(神經細胞本體 )
axon from a neuron
w1x1
(軸突)
w2x2
output axon
(軸突)
w3x3
dendrite(樹突 )
Bias corresponding to transfer function

intecept term (activation function)
17
+1
w1x1
0
w2x2
w3x3
Hard Limit Transfer Function
b (hardlim() or sgn())
18
w1x1 +1
w2x2
0
w3x3
If wTx is greater than or equal to -b, the output will be

1, otherwise the output will be 0. Thus each neuron
divides the input space into two regions.
19
x2
Decision Boundary
label: +1
label: 0
x1
20
● AND operation:
x1 x1 x2 output
w1 = 1 0 0 0
0 1 0
w2 = 1 1 0 0
x2 b = -1.5 1 1 1
21
x2
Decision Boundary
label: +1
label: 0
x1
22
● OR operation:
x1 x1 x2 output
w1 = 1 0 0 0
0 1 1
w2 = 1 1 0 1
x2 b = -0.5 1 1 1
23
x2
Decision Boundary
label: +1
label: 0
x1
24
● NOT operation:
x1 output
w1 = -0.6 0 1
x1
1 0
b = 0.5
25
We can change transfer function to
build different models.
26
Another kind of Single Neuron Perceptron
Symmetrical Hard Limit Transfer Function

w1x1 (hardlims())
w2x2
+1
w3x3
0
b
-1
27
Linear Regression
When we change the transfer function to linear function, it would be a kind of
linear regression.
Linear function f(x)=x
w1x1
w2x2
w3x3
p.s., Such linear activation function has also been

applied in ADALINE networks.
28
Logistic Regression
Logistic function
w1x1
w2x2
w3x3
29
Learning algorithm of Perceptron
We won’t cover the Perceptron Learning Algorithm(PLA) here. if you have strong
interest about the learning algorithm, just notice that the learning algorithm would
be different with different transfer functions, hardlim and hardlims.
● hardlim: see the details in “Neural network design” in Chapter4 (formula 4.34,
4.35)
● hardlims: see the details in “Learning from data” in Chapter1 (formula 1.3)
● You also can write down the loss function and use derivative to derive the
learning algorithm.
30
The disadvantage of Perceptron
The decision boundary of perceptron is a line, and it cannot
handle linear inseparable problem, e.g., XOR logic
x2
Decision Boundary
x1
31
Questions?
32
Linear Separable V.S. Not Linear Separable
我們可以將資料分成兩個類型: Linear Separable, Not Linear
Separable。
● Linear Separable: 資料本身可以由直線分割。

● Not Linear Separable: 資料本身不可由直線分割。
33
Linear Separable
x2 x2
x1 x1
34
Linear Separable
x2 x2
g(x)
x1 x1
g(x)
35
Not Linear Separable
x2 x2
x1 x1
Collinear 共線
36
沒有一個模型不能解決的事情
如果有...
If you cannot solve the classification
problem by a single model...
37
就結合更多模型
Just give it more models!
38
Not Linear Separable
x2 x2
x1 x1
Collinear
39
Let’s introduce the notation before
giving you the concept of
neural network
40
Define the notation of neural network
w1x1 W1, 1x1

w2x2 W1, 2x2
w3x3
W1, 3x3
b
b
Now, we replace the weights vector with matrix.
W1, i : the weight is from ith source and to the 1st output
neuron
41
W1, 1x1 x1
W1, 2x2
W1, 1 W1, 2 W1, 3 x2 b Wx + b
W1, 3x3 x3
b
42
W1, 1x1 x1
W1, 2x2
f f W1, 1 W1, 2 W1, 3 x2 b f( Wx + b )
W1, 3x3 x3
b
43
W1, 1x1
W1, 2x2
f
W1, 3x3
b
The output of a single neuron would be a scalar. 44

A layer of Neurons - Single Neuron Perceptron
1 W1, 0 = b1
W1, 1 a1
x1 each output denotes a single
W1, 2 model(decision boundary)
x2 a2
W1, 3
x3 .
W1, N .
.
xN
aS
N-dimension input data 45
A layer of Neurons - Single Neuron Perceptron
1 W1, 0 = b1
W1, 1 a1
x1 the input
Notice: we add
of bias here! W1, 2
x2 a2
W1, 3
x3 .
W1, N .
.
xN
aS
N-dimension input data 46
A layer of Neurons
W1, 1 a1
x1
b1 f ( W 1x + b 1 )
x2 a2 f ( W 2x + b 2 )
x3 b2 . ...
.
...
.
xN
WS, N f ( W S x + bS )
aS
N-dimension input data bS 47
A layer of Neurons
A layer of neurons can be expressed by a function:
The output of such function is a vector from a layer of

neurons, and we will combine the results from different
neurons soon.
48
Two layers of Neurons
f
W1, 1 We can combine different models
into a more powerful single model
x1 by adding another layer of neurons.
b1
x2
x3 b2 .
.
.
xN
WS, N
bS 49
Two layers of Neurons
f1
W11, 1
W21, 1
x1
b 11
f2
x2 W21, 2
x3 b 12 .
. W21, 3 b 21
.
xN
W1S, N
b 1S 50
XOR logic
x2 b = -1.5
1 1
b = -0.5 b = +1.5
1 -1
1 -1
x1 x1 x2
Use AND operation to
combine two models ! 51
Multi-Layer Perceptron (MLP)
Input layer hidden layer output layer 52

We also can add more hidden layers in MLP.

53
data
Feed Forward
Multi-layer perceptron is a kind of Neural Net called

feedforward neural network. 54
Feedforward Neural Network
55
Feedforward Neural Network
x
A feedforward neural network is a series of transformation of

input data x. 56
Why do we need
non-linear
activation function?
Why do we need non-linear activation function?
● Reason1:
Neural network is a series of transformations of input data. if

the transformation is linear, it is useless to add more hidden
layers.
58
● Reason2:
The neural network is a function approximator to approximate

the ideal target function f target in machine learning. The
target function f target may be complicated (non-linear).
59
● Reason2:
quoted from Hsuan-Tien Lin’s Machine Learning 60

Foundations course
From Linear to Non-Linear
Use different kinds of activation function to transform the
linear function to non-linear function.
+1 +1
0 0
The condition of linear function -1
61
Activation Function
sigmoid function tanh function Relu function
62
The decision boundary of Neural Network
quoted from “Introductory Overview Lecture The Deep Learning Revolution” in JSM 2018 63
and you can play with ConvnetJS: https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
Thus, with non-linear transformation
the neural network could be more
powerful!
64
Model Capacity
Deep Learning Multi-Layer
& Perceptron
Neural Network
High
Linear Regression
&
Logistic Regression
Low
65
Model Capacity
There are many kinds of neural network
● Feedforward neural network
● Recurrent neural network
● Radial basis function network
● Memory neural network
● ...
There are all called Deep Learning in this era, because of

using deep hidden layers.
66
Recurrent neural network
67
What makes deep learning shine in
many machine learning techniques?
68
Representation Learning
● Rule-based system, Classic machine
learning: relying on human’s domain
knowledge to design features.
● Representation Learning: Without
handcrafted features, usually use raw
inputs and it can extract representation
automatically by series of transformation.
● Thus, representation Learning is useful
in processing speech data, image data.
69
The image is quoted from Goodfellow et al., Deep Learning.
Questions?
70
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.
Learning algorithms:
● Gradient based method

● Evolutionary method
● ...
71
Learning in Neural Network
Deep Learning is a subfield of machine learning, and we can
learn the weights by optimization algorithm.
Learning algorithms:
● Gradient based method We focus on this!

● Evolutionary method
● ...
72
Gradient descent in Neural Network
In the previous lecture, we have learned that gradient descent
can be used to optimize the learning model.
Where J( ) is the objective function and

denotes the parameters (weights) in the
neural network.
73
There are two questions about doing gradient descent in
neural network.
● How do we compute the gradients for the weights of each

layer?
● The landscape of loss function is probably not convex.
74
neural network.
● How do we compute the gradients for the weights of

each layer?
● The landscape of loss function is probably not convex.
75
How to compute the gradients?
76
The operation in Neural Network
Let’s think about the simplified example, there are 3 kinds of
computation involves in neural network.
● Addition x1w1
● Multiplication
● Activation
x2w2
b
77
A simplified example of backpropagation
Consider a simple function f(x, y) = x + y. It can be expressed
by a computational graph.
＋ z = f(x, y) = x + y
78
Consider a simple function f(x, y) = xy. It can be expressed by
a computational graph too.
＊ z = f(x, y) = xy
79
Consider the case of activation function:
x f z = f (x)
80
The derivative of activation function
sigmoid function tanh function

Image Credits: 81
http://ronny.rest/media/blog/2017/2017_08_16_tanh/tanh_v_sigmoid.jpg
The derivative of activation function
ReLU (Rectified Linear Unit) 82

The following is a computational graph of perceptron, and our objective is to find
partial derivative of cost function C(y) with respect to weights vector w:

x1
a1
＊
w1
＋ f
x2 a3 y
a2
＊
w2 83
Chain Rule and Backpropagation
x1 a1 = w1x1
a3 = a 1 + a 2
＊
a1 y = f (a3)
w1
＋ f
x2 a3 y
＊
a2
w2
84
x1
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
It is decided by your cost function

85
x1
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
We can compute gradients by chain rule, and the
purple line denotes behavior of backpropagation.
86
x1
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
87
x1
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
It is decided by your cost function

88
x1 Let’s take a deep look at single
computation unit
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
89
If we want to compute the gradients

of current computation unit, we need
x1
two things:
＊
a1
w1
● The gradients of current output
with respect to current weights.
● The cumulative gradients from
output side.
90
What about neural network?
We can consider each neuron in neural network a
computation unit.
x1 Cost: C(θ)
x2
91
Find:
w1 w3
x1 Cost: C(θ)
w2 w4
x2
92
Find:
z = w Tx + b
a = σ(z) z’ = w3a + b3
w1 w3
x1
w2 w4
x2
z’’ = w4a + b4 93
Case1: Output Layer
y1 = σ(z’), y2 = σ
(z’’)
w1 w3 y1
x1
w2 w4
y2
x2
decided by cost function
94
done.
Case2: Not Output Layer
Compute gradients recursively,

w1 w3 y1 until we reach the output layer.
x1
w2 w4
y2
x2
95
neural network.
● How do we compute the gradients for the weights of each

layer?
● The landscape of loss function is probably not
convex.
96
Convex V.S Non-Convex
J(θ) J(θ)
θ θ
convex non-convex
You can see math definition in wikipedia:

97
https://en.wikipedia.org/wiki/Convex_function
The Visualization of loss landscape in NN
A Complicated Loss Landscape Image Credits: 98

https://www.cs.umd.edu/~tomg/projects/landscapes/
Which point(weights) is better?
● The yellow one

● The red one
99
No Absolute Answer!
● The loss landscape is

decided by training data,
which may not be closed to
actual loss with just few
samples. (Learning theory)
● Fortunately, local optimal
would be great enough to
solve most problems.
100
The issues about gradient descent
● Memory Usage
● Vanishing gradients
● Dead ReLUs
● Exploding gradients in RNNs
101
Memory Usage
x1
＊
a1
w1
＋ f
x2 a3 y
＊
a2
w2
In order to compute gradients efficiently, we often cache the
forward results (a3, x2) , which may cause large consumption in
memory 102
Vanishing gradients
Let’s look at sigmoid function:
● The maximum derivative value
of sigmoid function is 0.25
● The gradients through

backpropagation will be smaller
and smaller in the early layers.
(*0.25*0.25*0.25 …)
● If the initialized weights are too

big, the outputs(Wx+b) are big
and it will make zero gradients.
sigmoid function
reference: 103
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
Vanishing gradients
Let’s look at sigmoid function:
● The maximum derivative value
of sigmoid function is 0.25
● The gradients through

backpropagation will be smaller
and smaller in the early layers.
(*0.25*0.25*0.25 …)
● If the initialized weights are too

big, the outputs(Wx+b) are big
and it will make zero gradients.
sigmoid function
104
Vanishing gradients in tanh
If we initialized large weights in

neural network, it also makes
vanishing gradients in tanh.
105
Dead ReLUs
● If a neuron gets clamped to
zero in the forward pass,
then its weights will get zero
gradients and stop updating.
● Both initialization with large
weights and huge gradients
update(aggressive learning
rate) during training phase
can cause dead ReLUs
issue.
ReLU (Rectified Linear Unit) 106

Dead ReLUs
Use leaky ReLU instead.
107
Exploding gradients in RNNs
Sometimes, the gradients may explode, especially in vanilla recurrent neural
network(原始的RNN). Pascanu et al. addressed this problem and proposed a
solution called gradients clipping to relieve exploding gradients.
A special recurrent unit LSTM can relieve this issue!
108
Pascanu et al., “On the difficulty of training Recurrent Neural Networks”
A simplied recurrent unit:
a ＊
109
a1 a2 a3 a4
a ＊＊＊＊
a4 = a*b*b*b*b
b b b b
110
If |b| < 1, the cumulative gradients go to zero. if |b| > 1, the cumulative gradients
go to infinity.
a ＊＊＊＊
a1 a2 a3 a4 = a*b*b*b*b
b b b b
111
The methods to relieve previous issues
● Xavier initialization - relieve vanishing gradients
● Kaiming initialization - relieve dead ReLUs
● Use normalization layers
○ Batch Normalization
○ Layer Normalization
○ …
● Use LSTM, Gradients Clipping - relieve exploding gradients
The articles about initialization and batch normalization(Chinese):
https://zhuanlan.zhihu.com/p/25110150
You also can found similar content in the Chapter 6 of “Deep learning from scratch”, O’reilly 112
https://github.com/oreilly-japan/deep-learning-from-scratch
The resources about backpropagation
● Lecture notes of CS231n, Stanford
○ http://cs231n.github.io/optimization-1/
● Hung-Yi Lee’s course video in NTU, Taiwan
○ https://www.youtube.com/watch?v=ibJpTrp5mcE&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89
yu49&index=12
● An article from Andrej Karpathy
○ Yes you should understand backprop:
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
113
Questions?
114
Tips for training deep neural network
● Choose Optimizer
● Data augmentation
● Regularization
● Early Stopping
● Normalization
115
Choose Optimizer
There are many optimizer in deep learning ● Stochastic Gradient
package, the most basic one is
SGD(Stochastic Gradient Descent Optimizer)
Descent(SGD)
● Adagrad
However, in order to well optimize the
objective function, researchers design many ● Adadelta
useful optimizers that help deep ● RMSProp
learning.Improving Deep Neural Networks: Hyperparameter ● Adam
tuning, Regularization and Optimization
recommended article:
116
http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer
Nowadays, a better optimizer has following two attributes:
● Adaptive Learning Rate

● Moment (Momentum)
e.g., Adam: Adaptive moment estimation
117
Choose Optimizer
What is moment/momentum in optimizer?
Use previous gradients to help learning.
118
119
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
120
Image Credits: Hung-yi Lee’s Machine Learning Course(Tips for deep learning)
Choose Optimizer (The case in saddle point)
121
reference: http://ruder.io/optimizing-gradient-descent/index.html
Choose Optimizer
The results are quoted from Kingma et al., ADAM: A METHOD FOR 122
STOCHASTIC OPTIMIZATION
Choose Adam in default.
reference: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/ 123

Data augmentation
Increase the generalizability of model(avoid overfitting) by
giving additional human-modified data, including noise.
(Image: 旋轉，縮放，平移，翻轉，不同角度)
124
Data augmentation
Be careful!! DO NOT apply transformations that would
change the correct class.
125
Image from MNIST
Regularization
Many strategies used in machine learning are explicitly designed to reduce the
testing error, possibly at the expense of increased training error. These strategies
are known collectively as regularization.(e.g., L1-regularization, L2-regularization)
The most popular one in deep learning is Dropout
126
Regularization - Dropout
The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 127
Neural Networks from Overfitting” (JMLR 2014)
Regularization - Dropout
The images are quoted from Srivastava et al., “Dropout: A Simple Way to Prevent 128
Neural Networks from Overfitting” (JMLR 2014)
Early Stopping
129
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Early Stopping
130
Quoted from Ian Goodfellow et al., “Deep Learning” in Section 7.8
Normalization
● Handcrafted Data Normalization
● Normalization by using neural network computation unit
131
Normalization
132
This slide was borrowed from Andrew Ng’s Machine Learning Course
Normalization
Image Credits:
https://kratzert.github.io/2016/02/12/un
derstanding-the-gradient-flow-through- 133
The computation graph of Batch Normalization the-batch-normalization-layer.html
Normalization
Notice!! The behavior of normalization layer may be different

in training and testing.
134
Resources about CNN
CNN, Convolutional Neural Network. CNN is popular neural network architecture
in deep learning, especially in computer vision tasks. Here are some nice
resources:
● CS231n, Stanford: http://cs231n.stanford.edu/

● ConvNetJS: https://cs.stanford.edu/people/karpathy/convnetjs/index.html
● Article: An Intuitive Explanation of Convolutional Neural Networks:
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
● Intuitively Understanding Convolutions for Deep Learning:
https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee
1
● Intuitively Understanding Convolutions for Deep Learning (Chinese)
http://bangqu.com/nNMB58.html#utm_source=Facebook_PicSee&utm_medium=Social
135
Implementation
136
Pytorch Installation on Google Colab
https://gist.github.com/JIElite/2a0643cb256cc96517ad1cbc2280dbf8
137
Reference
Book
● Deep Learning
● Neural Network Design
● Learning from Data
Course
● CS229, Stanford
● CS231n, Stanford
● Hsuan-Tien Lin’s Machine Learning Foundations, National Taiwan University
● Hung-yi Lee’s Machine Learning, National Taiwan University
And some papers and articles

138
Resources
● Deep Learning(Nature):
○ https://www.evl.uic.edu/creativecoding/courses/cs523/slides/week3/Deep
Learning_LeCun.pdf
● Pytorch examples:
○ https://github.com/jcjohnson/pytorch-examples
● CS230 code example:
○ https://github.com/cs230-stanford/cs230-code-examples
● Introductory Overview Lecture The Deep Learning Revolution:
○ http://www.cs.cmu.edu/~rsalakhu/jsm2018.html
● Why softmax is named “soft”max?
○ http://neuralnetworksanddeeplearning.com/chap3.html#softmax
139
Resources
Tensorflow playground: https://playground.tensorflow.org/
140

Deep Learning Foundations PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Foundations PDF

Uploaded by

Copyright:

Available Formats

Deep Learning Foundations

● Name: Jie-Han Chen

Redmon et al., YOLO(2016) Long et al., FCN(2015)

Just keep something in mind, when

● 神經細胞本體 Cell Body

axon from a neuron

Bias corresponding to transfer function

If wTx is greater than or equal to -b, the output will be

Symmetrical Hard Limit Transfer Function

p.s., Such linear activation function has also been

● Linear Separable: 資料本身可以由直線分割。

w1x1 W1, 1x1

The output of a single neuron would be a scalar. 44

The output of such function is a vector from a layer of

Input layer hidden layer output layer 52

We also can add more hidden layers in MLP.

Multi-layer perceptron is a kind of Neural Net called

A feedforward neural network is a series of transformation of

Neural network is a series of transformations of input data. if

The neural network is a function approximator to approximate

quoted from Hsuan-Tien Lin’s Machine Learning 60

The condition of linear function -1

sigmoid function tanh function Relu function

There are all called Deep Learning in this era, because of

● Gradient based method

● Gradient based method We focus on this!

Where J( ) is the objective function and

● How do we compute the gradients for the weights of each

● How do we compute the gradients for the weights of

sigmoid function tanh function

ReLU (Rectified Linear Unit) 82

partial derivative of cost function C(y) with respect to weights vector w:

It is decided by your cost function

It is decided by your cost function

If we want to compute the gradients

Compute gradients recursively,

● How do we compute the gradients for the weights of each

You can see math definition in wikipedia:

A Complicated Loss Landscape Image Credits: 98

● The yellow one

● The loss landscape is

● The gradients through

● If the initialized weights are too

● The gradients through

● If the initialized weights are too

If we initialized large weights in

ReLU (Rectified Linear Unit) 106

Use leaky ReLU instead.

A special recurrent unit LSTM can relieve this issue!

● Adaptive Learning Rate

e.g., Adam: Adaptive moment estimation

Use previous gradients to help learning.

reference: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/ 123

The most popular one in deep learning is Dropout

Notice!! The behavior of normalization layer may be different

● CS231n, Stanford: http://cs231n.stanford.edu/

And some papers and articles

You might also like