Professional Documents
Culture Documents
Multi-GPU/Multi-core/machines
Data parallelism: split data and run in parallel with the same model
Cơ sở hạ tầng tính
toán, nhà nghiên cứu,
môi trường và chính
sách như bệ phóng
“No two stem cells are identical, even if they are genetic clones …. Computer scientists analysed thousands of the images using deep
learning programs and found relationships between the locations of cellular structures. They then used that information to predict where
the structures might be when the program was given just a couple of clues, such as the position of the nucleus. The program ‘learned’ by
comparing its predictions to actual cells”
Reinforcement Learning
Lee Sedol
2016
1997
AUTOMATED LEARNING!
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST
Neural LanguageImpact
NLP of Deep Learning
Class-based Neural
Language language
model Language
model models
word2vec
[Mikolov et al. 2013]
mid-1990s
Marvin Minsky and Seymour Papert
| published book “Perceptrons” |
expressive neuron layers but didn’t know
how to learn| Machine learning = neural
networks Hinton, Osindero, and Teh
| published “A fast learning algorithm
The heat for NN was over for deep belief nets” |
1986 | learning with bigger and more
1949 1982 hidden layers was infeasible |
1985
1950 1969 Resurgent of NN and connectionism
1943
John Hopfield (Caltech physicist) | re-invented Autoencorder (AE) |
| notice the analogy between Stacked Sparse AE: Google learns
neurons and spin glasses ‘cat’ from 10 million youtube videos
Donald Hebb (Canadian psychologist)
| Hebb’s rule::“Neurons that fire together wire
together” | Cornerstone of connectionism:
attempt to explain how neurons learn |
19
Some historical motivations
Our brain is made up by network of inter-
connected neurons and have highly-parallel
architecture.
ANN (artificial neural networks) are motivated by
biological neural systems.
Two groups of ANN researchers:
To use ANN to study/model the brain
Use the brain as the motivation to design ANN as an
effective learning machine, which might not be the true
model of the brain. Deviation from the brain model is
not necessarily bad, e.g. airplanes do not fly like birds.
So far, most, if not all, use the brain architecture as
motivation to build computational models.
𝑤0
∑
𝑤1 𝑤2 𝑤𝑀
𝑥1 𝑥2 … 𝑥𝑀
1
0.5
0 𝑧
1
𝜎 𝑧 =
1 + 𝑒 −𝑧
𝑑𝜎 𝑧
= 𝜎(𝑧)(1 − 𝜎 𝑧 )
𝑑𝑧
Neural activation: ℎ 𝒙 = 𝑓 𝒘⊤ 𝒙 + 𝑏
𝐽(𝒘) is a nonconvex, high-dimensional function with possibly many
local minima.
𝒙
At a point 𝒙 = (𝑥1 , 𝑥2 ), the gradient 𝒙∗
vector of the function 𝑓 𝒙 w.r.t 𝒙 is −𝛻𝑓 𝑥
𝑥1
𝜕𝑓 𝜕𝑓
𝛻𝑓 𝒙 = ,
𝜕 𝑥1 𝜕 𝑥2
[Nguồn: Wikipedia]
𝑤𝑖 ← 𝑤𝑖 + 𝜂 𝑦𝑡 − 𝑦𝑡 𝑥𝑡𝑖
Compare with standard gradient descent:
Less computation required in each iteration
Approaching the minimum in a stochastic sense
Use smaller learning rate 𝜂, hence need more iterations
ℎ ℎ 𝜕𝐽𝑡 𝒘
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 −𝜂 ℎ 𝑧1 𝑧2
𝜕𝑤𝑖𝑗
= 𝑥𝑖
𝑤𝑖𝑗ℎ
𝜕𝐽𝑡 (𝒘) 𝜕𝐽𝑡 𝜕𝑧𝑗
ℎ =𝜕𝑧 × ℎ 𝒙
𝜕𝑤𝑖𝑗 𝑗 𝜕𝑤𝑖𝑗
= 𝑧𝑗 (1 − 𝑧𝑗 ) …
𝑥1 𝑥𝑚
𝜕𝐽𝑡 𝜕𝐽𝑡 𝜕𝑧𝑗
= ×
𝜕 𝑧𝑗 𝜕𝑧𝑗 𝜕 𝑧𝑗
chain rule 0 0
= 𝑤𝑗𝑘 since 𝑦𝑘 = ∑𝑤𝑗𝑘 𝑧𝑗
𝐾
𝜕𝐽𝑡 𝜕 𝑦𝑘
× 𝐾
Gradient descent update 𝑘=1
𝜕 𝑦𝑘 𝜕𝑧𝑗 𝜕𝐽𝑡
= −𝛿𝑗ℎ = −𝑧𝑗 1 − 𝑧𝑗 𝑜 𝑜
𝑤𝑗𝑘 𝛿𝑘
𝜕𝑧𝑗
ℎ
𝑤𝑖𝑗 ℎ
← 𝑤𝑖𝑗 + 𝜂𝛿𝑗ℎ 𝑥𝑖 −𝛿𝑘𝑜 𝑘=1
4 nodes
𝑘 𝑘
Momentum helps
accelerate SGD to use
relevant directions
and dampens
𝒗 ← 𝛼𝒗 − 𝜂𝛻𝜽 𝐽 𝜽 + 𝛼𝒗
𝜽←𝜽+𝒗
𝒈 ← 𝛻𝜽 𝐽 𝜽
𝜸 ←𝜸 +𝒈⊙𝒈
𝜂
𝜽←𝜽− ⊙𝒈
𝛿+ 𝜸
where 𝛿 is a small constant
(e.g., 10−7 ) for numerical stability.
Large partial derivatives
Thus, rapid decrease in their learning rates
Small partial derivatives
hence relatively small decrease in their learning rates.
Weakness: always decrease the learning rate!
𝒈 ← 𝛻𝜽 𝐽 𝜽
𝜸 ← 𝛽𝜸 + 1 − 𝛽 𝒈 ⊙ 𝒈
𝜂
𝜽←𝜽− ⊙𝒈
𝛿+𝜸
𝜽 ← 𝜽 + 𝛼𝒗
𝒈 ← 𝛻𝜽 𝐽 𝜽
𝜸 ← 𝛽𝜸 + 1 − 𝛽 𝒈 ⊙ 𝒈
𝜂
𝒗 ← 𝛼𝒗 − ⊙𝒈
𝛿+𝜸
𝜽←𝜽+𝒗
𝑡 ←𝑡+1
𝒈 ← 𝛻𝜽 𝐽 𝜽
𝒔 ← 𝛽1 𝒔 + 1 − 𝛽1 𝒈
𝜸 ← 𝛽2 𝜸 + 1 − 𝛽2 𝒈 ⊙ 𝒈
𝒔
𝒔←
1 − 𝛽1𝑡
𝜸
𝜸←
1 − 𝛽2𝑡
𝒔
𝜽←𝜽−𝜂
𝛿+𝜸
Suggested default values: 𝜖 = 0.001, 𝛽1 = 0.9,
𝛽2 = 0.999 and 𝛿 = 10−8 .
Should always be used as first choice!
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 36
Tricks and tips for DNN
Overfitting problem - quick fixes
Performance measured by errors on “unseen” train
data.
Minimize error alone on training data is not generalize
enough
Causes: too many hidden nodes and overtrained.
“proper fit”
Possible quick fixes:
Use cross validation or early stopping, e.g.,
stop training when validation error starts to
grow.
Weight decaying: minimize also the
magnitude of weights, keeping weights small
(since the sigmoid function is almost linear
near 0, if weights are small, decision surfaces “over fit”
are less non-linear and smoother)
Overfit: tendency of the network to “memorize” all
Keep small number of hidden nodes! training samples, leading to poor generalization
2 2 generalize
𝑘 𝑘
Ω 𝜽 = 𝑤𝑖,𝑗 = 𝑾 𝐹
𝑘 𝑖 𝑘
𝑘
Gradient: 𝛻𝑾 𝑘 Ω 𝜽 = 2𝑾
“proper fit”
Apply on weights (𝑾) only, not on biases (𝒃)
Bayesian interpretation: the weights follow a
Gaussian prior.
L1 regularization
𝑘
Ω 𝜽 = ‖𝑤𝑖,𝑗 ‖
𝑘 𝑖
Optimization is now much harder – subgradient.
Apply on weights (𝑾) only, not on biases (𝒃) “over fit”
Bayesian interpretation: the weights follow a Overfit: tendency of the network to “memorize” all
Laplacian prior. training samples, leading to poor generalization
Testing
𝒛 𝑙 =𝒛 𝑙 × 1−𝜇
𝒛 𝑙+1 =𝑓 𝒘 𝑙 ⊤𝒛 𝑙 +𝒃 𝑙
42
What is an autoencoder?
Simply a neural network that tries to copy its 𝑟1 𝑟𝑛 …
input to its output. 𝒓
Input 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑁 ⊤ 𝑔𝝋 : 𝒛 ⟼ 𝒓 𝑟𝑁
An encoder function 𝑓 parameterized by 𝜽 𝑧1 𝑧𝐾 𝒛
A coding representation 𝒛 = 𝑧1 , 𝑧2 , … , 𝑧𝐾
⊤ =
𝑓𝜽 : 𝒙 ⟼ 𝒛
𝑓𝜽 𝒙
A decoder function 𝑔 parameterized by 𝝋 … 𝒙
An output, also called reconstruction 𝑥1 𝑥𝑛 … 𝑥𝑁
𝒓 = 𝑟1 , 𝑟2 , … , 𝑟𝑁 ⊤ = 𝑔𝝋 𝒛 = 𝑔𝝋 𝑓𝜽 𝒙
A loss function 𝒥 that computes a scalar 𝒥 𝒙, 𝒓 to
measure how good of a reconstruction 𝒓 of the
given input 𝒙, e.g., mean square error loss:
𝑁
𝒥 𝒙, 𝒓 = 𝑥𝑛 − 𝑟𝑛 2
𝑛=1
Learning objective: argmin 𝒥(𝒙, 𝒓)
𝜽,𝝋 Giá trị đầu vào và ra là giống nhau
Contractive autoencoder
Human performance
Galaxy
Diabetes recognition
from retina image
• CNN đã gặt hái rất nhiều thành công (nhất là trong computer vision)
• Tuy nhiên nó đã đến lúc bão hòa – nếu có cải thiện, thì chỉ là vụn vặt.
• Next? Mô hình sinh dữ liệu nhiều tầng (deep generative models)
Tạm dịch: chúng ta có thể suy diễn nghĩa của một từ nếu biết
những từ vựng khác hay dùng kèm với nó.
Class-based Neural
Language
language Language
model
model models
o (aka n-gram model) o Reduce computation o Use distributed representation
o distributions over sequence of o Cluster words into categories for words.
symbolic tokens in a natural o Use category ID for context o Similar words will stay closer.
language. o Symbolic representation
o Use probability chain rule. o With one-hot representation,
o Estimating by counting the the distance between any two
occurrence of n-gram words will always be exactly √2 !
context
𝑛
Fun: https://ronxin.github.io/wevi/
Motivation:
Learn continuous feature representations for nodes in large networks (e.g., social networks)
Define a flexible notion of a node’s neighbourhood
Key ideas:
Consider a node in a network as a word in a document.
Apply Word2vec to learn a distributed representation for a node
Question: “How can we define the context (neighbourhood) for a node?”
o Homophily: nodes are organized based on communities they belong to
o Structural equivalence: nodes are organized based on their structural roles
Node2vec develops a biased random walk approach to explore both types of neighbourhood
of a given node
Question-Answering system. topic2vec (Niu and Dai, distributed representation for topics
2015)
End-to-End systems.
item2vec (Barkan and distributed representation of items in
Multimodal linguistic regularities Koenigstein, 2016) recommender systems
med2vec (Choi et al., distributed representations of ICD codes
2016)
node2vec (Grover and distributed representation for nodes in a
Leskovec, 2016) network
paper2vec (Ganguly and distributed representations of textual and
Pudi, 2017) graph-based information
sub2vec (Adhikari et al., distributed representation for subgraphs
2017)
cat2vec (Wen et al., 2017) distributed representation for categorical values
fis2vec (2017) distributed representation for frequent itemsets
Assumption: the true, but unknown, data Assumption: the true, but unknown, data
distribution: 𝑃(𝑥) distribution: 𝑃(𝑥)
Data: D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } where 𝑥𝑖 ~𝑃(𝑥). Data: D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } where 𝑥𝑖 ~𝑃(𝑥).
Goal: estimate density 𝑃 𝑥 with data 𝐷 such Goal: estimate generator 𝐺(𝑧) with data 𝐷
that 𝑃 𝑥 is as ‘close’ to 𝑃(𝑥) as possible. such that samples generated by 𝐺(𝑧) is
‘indistinguishable’ from samples generated
by 𝑃(𝑥) where 𝑧~ℕ(0,1).
[Goodfellow, NIPS Tuts 2016]
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 62
Deep Generative Models
Explicit Density Estimation
Restricted Boltzmann Machines
Deep Boltzmann Machines
𝐖
Variables
𝑁 : visible/observed v1 v2 vN
𝒗 = 𝑣𝑛 𝑁 ∈ 0,1
𝐾
𝒉 = ℎ𝑘 𝐾 ∈ 0,1 : hidden/latent
𝐚
Parameters
𝑁 𝐾
𝒂 = 𝑎𝑛 𝑁 ∈ ℝ , 𝒃 = 𝑏𝑘 𝐾 ∈ ℝ : visible
and hidden biases
𝑁×𝐾 : connection
𝑾 = 𝑤𝑛𝑘 𝑁×𝐾 ∈ ℝ
weights
𝜕 𝜕𝐸 𝒗, 𝒉; 𝜓 𝜕𝐸 𝒗, 𝒉; 𝜓
log ℒ 𝒗; 𝜓 = 𝔼𝑝 𝒗,𝒉;𝜓 − 𝔼𝑝 𝒉 𝒗; 𝜓
𝜕𝜓 𝜕𝜓 𝜕𝜓
𝑝 𝒉 ∣ 𝒗; 𝜓 = 𝑝 ℎ𝑘 ∣ 𝒗; 𝜓
𝑘=1
data space
latent
data representation
Covariance RBM (cRBM) (Hinton, 2010) Beta-Bernoulli process RBMs (HongLak Lee et al., 2012)
Mean-covariance RBM (mcRBM) = cRBM + Gaussian RBM An infinite RBM (Hugo Larochelle, 2015)
(Ranzato et al., 2010) Infinite RBMs (Jun Zhu group)
The mean-product of student’s t-distribution models (mPoT) Privacy-preserving RBMs (Li et al., 2014)
(Ranzato et al, 2010) Mixed-Variate RBM, ordinal Boltzmann machine, cumulative
Spike-and-slab RBM (ssRBM) (Courville et al.,2011) RBM (Truyen et al, 12-14)
Convolutional RBMs (Desjardins and Bengio, 2008) Graph-induced RBMs (Tu et. al, 2015)
Non-negative RBMs (Tu et. al, 2013)
RBM-3 decoder
RBM-2
encoder
RBM-1
Stack of
pretrained RBMs Deep Autoencoder
Deep
Boltzmann Machine
Deep Belief Net
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 70
Deep Boltzmann Machine (DBM) [Salakhudinov, Hinton 2008, 2012]
Deep
Boltzmann Machine
nature, flower,
red, green
blue, green,
yellow, colours
chocolate,
cake
74
DBM Applications [Ruslan Salakhudinov, Deep Learning Summer School, 2015]
Story telling
Assumption: the true, but unknown, data Assumption: the true, but unknown, data
distribution: 𝑃(𝑥) distribution: 𝑃(𝑥)
Data: D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } where 𝑥𝑖 ~𝑃(𝑥). Data: D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } where 𝑥𝑖 ~𝑃(𝑥).
Goal: estimate density 𝑃 𝑥 with data 𝐷 such Goal: estimate generator 𝐺(𝑧) with data 𝐷
that 𝑃 𝑥 is as ‘close’ to 𝑃(𝑥) as possible. such that samples generated by 𝐺(𝑧) is
‘indistinguishable’ from samples generated
by 𝑃(𝑥) where 𝑧~ℕ(0,1).
Its bound:
log 𝑝𝜽 𝒙 = 𝐷𝐾𝐿 𝑞𝝓 𝒛 ∣ 𝒙 ∥ 𝑝𝜽 𝒛 ∣ 𝒙 + ℒ 𝜽, 𝝓; 𝒙
where ℒ 𝜽, 𝝓; 𝒙 is the (variational) lower bound:
log 𝑝𝜽 𝒙 ≥ ℒ 𝜽, 𝝓; 𝒙 = 𝔼𝑞𝝓 𝒛∣𝒙 − log 𝑞𝝓 𝒛 ∣ 𝒙 + log 𝑝𝜽 𝒙, 𝒛
The bound can also be written as:
ℒ 𝜽, 𝝓; 𝒙 = −𝐷𝐾𝐿 𝑞𝝓 𝒛 ∣ 𝒙 ∥ 𝑝𝜽 𝒛 + 𝔼 𝑞𝝓 𝒛∣𝒙 log 𝑝𝜽 𝒙 ∣ 𝒛
We want to differentiate and optimize the lower bound w.r.t both the variational parameter
𝝓 and generative parameter 𝜽.
Gradient 𝛻𝜽 ℒ 𝜽, 𝝓; 𝒙 can be derived easily, but 𝛻𝝓 ℒ 𝜽, 𝝓; 𝒙 is problematic, for example,
the approximation:
𝛻𝝓 𝔼𝑞𝝓 𝒛∣𝒙 log 𝑝𝜽 𝒙 ∣ 𝒛 = 𝔼𝑞𝝓 𝒛∣𝒙 log 𝑝𝜽 𝒙 ∣ 𝒛 𝛻𝝓 log 𝑞𝝓 𝒛
𝐿
1 𝑙 𝑙
≅ log 𝑝𝜽 𝒙 ∣ 𝒛 𝛻𝝓 log 𝑞𝝓 𝒛
𝐿
𝑙=1
exhibits very high variance.
moreover, taking the gradient over the random variable 𝒛 is infeasible, thus we use
reparameterization trick.
∥ 𝒙 − 𝒙 ∥2 ∥ 𝒙 − 𝒙 ∥2
Decoder Decoder
(𝑝𝜽 ) (𝑝𝜽 )
𝐾𝐿 𝒩 𝝁 𝒙 , Σ(𝒙 ∥ 𝒩 0, 𝑰 𝒛 ~𝒩 𝝁 𝒙 , Σ 𝒙 𝐾𝐿 𝒩 𝝁 𝒙 , Σ(𝒙 ∥ 𝒩 0, 𝑰 +
𝝁 𝒙 Σ 𝒙 𝝁 𝒙 Σ 𝒙 *
Encoder Encoder
forward 𝝐 ~𝒩 0, 𝑰
(𝑞𝝓 ) (𝑞𝝓 )
backward
𝒙 𝒙
𝒛
sampling sampling on manifold
𝑝(𝑥 ∣ 𝑧) 𝑞(𝑧 ∣ 𝑥)
[Nguồn: http://deeplearning.jp/cvae/]
To summarize: 𝒛 ~ 𝑝𝒛 𝑑𝑎𝑡𝑎
𝒙 ~ 𝑝𝑑𝑎𝑡𝑎 → 𝐷 𝒙 tries to be near 1,
𝒛 ~ 𝑝𝒛 → 𝐷 tries to push 𝐷 𝐺 𝒛 near 0,
𝒛 ~ 𝑝𝒛 → 𝐺 tries to push 𝐷 𝐺 𝒛 near 1.
Real
images
Generated
images
Super-resolution GAN
Interactive GAN
https://www.tensorflow.org/
https://github.com/tensorflow/tensorflow
𝑢 =𝑥+𝑦=5
𝑧 = −2 𝜕𝑓
𝑓 = 𝑢𝑧 = −10 𝜕𝑓
=1
𝑓 = 𝑢𝑧 = −10
*
𝜕𝑓
= −2
𝜕𝑥
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
=1 𝜕𝑓 𝜕𝑓 𝜕𝑢
𝜕𝑓 = = 1 ∗ 𝑧 = −2 = = −2 ∗ 1
𝜕𝑢 𝜕𝑓 𝜕𝑢 𝜕𝑥 𝜕𝑢 𝜕𝑥
= −2
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑢
= =1∗𝑢 =5 = = −2 ∗ 1
𝜕𝑧 𝜕𝑓 𝜕𝑧 𝜕𝑦 𝜕𝑢 𝜕𝑦
= −2
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 100
Vài kinh nghiệm …
Summary for practical use - Debugging
Make sure our model is able to (over) fit on weights:
a small dataset. o cannot initialize weights to 0 with tanh
activation, since the gradients would be 0
If not, investigate the following situations: (saddle points)
Are some of the units saturated, even o The initial parameters need to break
before the first update? symmetry between different units, otherwise
o Scale down the initialization of your all hidden units would act the same.
parameters for these units Useful references:
o Properly normalize the inputs
Practical Recommendations for Gradient-
Is the training error fluctuating significantly Based Training of Deep Architectures –
/ bouncing up and down? Yoshua Bengio, 2012
o Decrease the learning rate Stochastic Gradient Descent Tricks – Leon
Parameter initialization Bottou
The initialization of parameters affects the
convergence and numerical stability.
biases: initialize all biases to 0
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 101
Tóm tắt
What is deep learning and few Deep Generative Models
application examples. RBM and Deep DMs
Deep neural networks Directed GM: VAE, GAN
Key intuitions and techniques TensorFlow
Deep Autoencoders Few practical tips
Convolutional networks (CNN)
Neural embedding
Word2vec, node2vec, recent
embedding methods
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 102
Vài suy nghĩ thêm
Why deep learning works?
Can learn and represent complex, nonlinear relationships
Strong and flexible priors (layered representation)
Hierarchical structures
103
Appendix
104
Node2vec [Grover and Leskovec, KDD’16]
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 105
Node2vec [Grover and Leskovec, KDD’16]
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 106
Node2vec [Grover and Leskovec, KDD’16]
Experimental results:
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 107
DBM Applications [Ruslan Salakhudinov, Deep Learning Summer School, 2015]
3D object generation
real generated
Pattern completion
Damaged images
Completed images
True images
©Dinh Phung, 2017 VIASM 2017, Data Science Workshop, FIRST 108