Professional Documents
Culture Documents
Deep Learning
Song Han
Stanford University
May 25, 2017
Intro
2
Deep Learning is Changing Our Lives
Self-Driving Car Machine Translation
This image is licensed under CC-BY 2.0 This image is in the public domain
This image is in the public domain This image is licensed under CC-BY 2.0
3
AlphaGo Smart Robots
3
Models are Getting Larger
16X 10X
Model Training Ops
152 layers 465 GFLOP
22.6 GFLOP 12,000 hrs of Data
~3.5% error ~5% Error
8 layers 80 GFLOP
1.4 GFLOP 7,000 hrs of Data
~16% Error ~8% Error
Microsoft Baidu
4
The first Challenge: Model Size
5
The Second Challenge: Speed
6
The Third Challenge: Energy Efficiency
7
Where is the Energy Consumed?
8
Where is the Energy Consumed?
Figure
: Energy table 1: Energy
for 45nm CMOS table for 45nm
process CMOS process
[7]. Memory access[7].
is 3Memory
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than
1 expensive
simple than simple arithmetic.
arithmetic.
= 1000
This image is in the public domain
To achieve
ve this goal, we present athis goal, we
method present
to prune a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This9fi
Where is the Energy Consumed?
Figure
: Energy table for 45nm
how
1: Energy
CMOS
toprocess
make
table deep
for 45nm CMOSlearning
[7]. Memory process
access[7].
more
is 3Memory
efficient?
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than expensive
simple than simple arithmetic.
arithmetic.
To achieve
ve this goal, we present athis goal, we
method present
to prune
Battery images are in the public domain
Image 1, image 2, image 2, image 4 a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This10fi
Improve the Efficiency of Deep Learning
by Algorithm-Hardware Co-Design
11
Application as a Black Box
Spec 2006
Algorithm
This image is in
the public domain
Hardware
This image is in
CPU
the public domain
12
Open the Box before Hardware Design
Algorithm
?
This image is in
the public domain
Hardware
This image is in
?PU
the public domain
13
Agenda
Inference Training
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
Hardware 101: the Family
Hardware
1 15
0 6x104
Int16 S M
1 7
0 127
Int8 S M
- -
Fixed point S I F
radix point
Energy numbers are from Mark Horowitz Computings Energy Problem (and what we can do about it), ISSCC 2014
Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Agenda
Algorithm
Inference Training
Hardware
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Pruning Neural Networks
-0.01x2 +x+1
60 Million
6M 10x less connections
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away
Pruning
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away
Pruning Pruning+Retraining
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away
ptioning
1000 Trillion
Synapses
50 Trillion 500 Trillion
Synapses Synapses
This image is in the public domain This image is in the public domain This image is in the public domain
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
[Han et al. ICLR16]
Trained Quantization
2.0
Trained Quantization
Quantization: less bits per weight
g: less number of weights Huffman Encod
ain Connectivity
Encode Weight
same same
accuracy Generate Code Book accuracy
2.09, 2.12, 1.92, 1.87
une Connections
9x-13x 27x-31x Encode Index
Quantize the Weights
reduction reduction
with Code Book
Train Weights
2.0
Retrain Code Book
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Weight Value
Pruning Trained Quantization Huffman Coding 44
[Han et al. ICLR16]
Weight Value
Pruning Trained Quantization Huffman Coding 45
[Han et al. ICLR16]
Weight Value
Pruning Trained Quantization Huffman Coding 46
[Han et al. ICLR16]
AlexNet on ImageNet
Huffman Coding
Huffman Encoding
Encode Weights
same
y accuracy
Train Connectivity
Encode Weights
original same same same
network accuracy Generate Code Book accuracy accuracy
Prune Connections
original 9x-13x 27x-31x Encode Index 35x-49x
Quantize the Weights
size reduction reduction reduction
with Code Book
Train Weights
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by 10, while quantization further improves the compression rate:
between 27 and 31. Huffman coding gives more compression: between 35 and 49. The
compression rate already included the meta-data for sparse representation. The compression scheme
doesnt incur any accuracy
Pruning loss.
Trained Quantization Huffman Coding 52
[Han et al. ICLR16]
Input
64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64 64
Output
Concat/Eltwise
128
Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy
Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression
Deep
SqueezeNet 0.47MB 510x 57.5% 80.3%
Compression
Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv 2016
Average
0.6x
Average
Deep
Compression
F8
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Quantizing the Weight and Activation
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Quantization Result
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Low Rank Approximation for Conv
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Binary / Ternary Net: Motivation
Normalized
4800 4800 Intermediate Ternary Weight
recision Weight
Count
Count
3200 3200
=>
1600
Quantize
1600
0 0
-t
e
0.05
0 t 0.05
1 0
Weight Value
0.05 0.05
-1
0
Weight Value
0.05
0 1
(d) (e)
gradient1
Net (a), pruned GoogLeNet (b), after retraining
he sparistyInference
constraint and recovering the zero
Time
Trained Ternary Quantization
Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp
gradient1 gradient2
Loss
Feed Forward Back Propagate Inference Time
and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper-
Weight Evolution during Training
parameter r we are able to obtain ternary weight networks with various sparsities. We use the first
method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one
to explore a wider range of sparsities in section 5.1.1.
2
1
0
-1
-2
-3
Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives
100%
Ternary Weight
75%
Percentage
50%
25%
0%
0 50 100 150 0 50 100 150 0 50 100 150
Epochs
Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers
of ResNet-20 on CIFAR-10.
4
Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17
Visualization of the TTQ Kernels
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
3x3 DIRECT Convolutions
Compute Bound
Filter
6 2 7 1 8 0
8 7 6
5 4 3
2 1 0
9xK FMAs/Input: Good Data Reuse
4 0 4 1 4 2
4 3 4 4 4 5
4 6 4 7 4 8
73
3x3 WINOGRAD Convolutions
Transform Data to Reduce Math Intensity
4x4 Tile
Data Transform
over C
Point-wise
Output Transform
Filter multiplication
Filter Transform
See A. Lavin & S. Gray, Fast Algorithms for Convolutional Neural Networks
Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA
74
Speedup of Winograd Convolution
VGG16, Batch Size 1 Relative Performance
cuDNN 3 cuDNN 5
2.00
1.50
1.00
0.50
0.00
conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0
75
Agenda
Algorithm
Inference Training
Hardware
Hardware for Efficient Inference
a common goal: minimize memory access
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
78
Google TPU
The Matrix Unit: 65,536 (256x256)
8-bit multiply-accumulate units
700 MHz clock rate
Peak: 92T operations/second
65,536 * 2 * 700M
>25X as many MACs vs GPU
>100X as many MACs vs CPU
4 MiB of on-chip Accumulator
memory
24 MiB of on-chip Unified Buffer
(activation memory)
3.5X as much on-chip memory
vs GPU
Two 2133MHz DDR3 DRAM
channels
8 GiB of off-chip weight DRAM
memory
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
79
Google TPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
80
Inference Datacenter Workload
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
81
Roofline Model: Identify Performance Bottleneck
nsightful visual
CM 52.4 (2009): 65-76.
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
82
TPU Roofline
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
83
Log Rooflines for CPU, GPU, TPU
Star = TPU
Triangle = GPU
Circle = CPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
84
Linear Rooflines for CPU, GPU, TPU
Star = TPU
Triangle = GPU
Circle = CPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
85
Why so far below Rooflines?
Low latency requirement => Cant batch more => low ops/byte
Challenge:
Hardware that can infer on compressed model
86
[Han et al. ISCA16]
EIE: the First DNN Accelerator for
Sparse, Compressed Model
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
logically P E3 B 0 0 0 0 C
B
B
B
C B
C=B
b3 C
C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
Column Pointer 0 1 2 3
Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016, Hotchips 2016
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 90
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 91
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 92
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 93
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 94
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 95
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 96
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 97
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 98
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 99
z William J. Dally
[Han et al. ISCA16]
iversity, NVIDIA EIE Architecture
avan,horowitz,dally}@stanford.edu
Weight decode
Compressed 4-bit 16-bit
DNN Model Virtual weight Weight Real weight
Encoded Weight Look-up ALU Prediction
Relative Index
Sparse Format Index Result
Input 16-bit Mem
4-bit Accum
Image Relative Index Absolute Index
word,
rule or speech0 sample.
of thumb: * A = 0 For embedded
W * 0 = 0 mobile applications,
2.09, 1.92=> 2
these resource demands become prohibitive. Table I shows
Compression Acceleration Regularization 100
[Han et al. ISCA16]
Col Weight
Even Ptr SRAM Bank Sparse Decoder
Start/ Bypass Dest Src
Matrix Regs
End Act Act
Addr SRAM Regs Regs
ReLU
Odd Ptr SRAM Bank Address Absolute Address
Accum
Relative Index
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W
Speedup on EIE
SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
1018x
1000x 507x 618x
248x
0.1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS 189x
98x Power
(%)
Area
(%)
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
(mW) (m2 )
60x
119,797x 76,784x
100000x 61,533x
9.157 638,024 34,522x
48x 24,207x
Energy Efficiency
15x
78x 101x 102x
cell 23,961 (3.76%)
100x 37x
59x 61x 39x
ueue 0.112 (1.23%) 758 (0.12%) 26x 37x 20x 25x 25x 23x
36x
18x 17x 14x 14x 15x 20x
9x12x
x
23,961
Figure 7.
(3.76%) Alex-6 Alex-7 Alex-8 VGG-6
3x
VGG-7
Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
VGG-8
3x
NT-We
Geo Mean
NT-Wd NT-LSTM Geo Mean
2x
nit: I/O and Computing. In the I/O mode, all of
1x
1xcorner. We placed and routed
re idle while the activations and weights in every
e accessed by a DMA connected with the Central 1x
the PE using the Synopsys IC
Table III
s is one time cost. In the Computing mode, the
0.5x
compiler (ICC). We
eatedly collects a non-zero value from the LNZD used Cacti [25] to get SRAM area and
0.6x B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
and broadcasts this value to all PEs. This process Layer Size Weight% Act% FLOP% Description
energy numbers. We annotated the toggle rate from the RTL
until the input length is exceeded. By setting the 9216,
gth and starting address of pointer array, EIE is Alex-6 9% 35.1% 3%
to execute different layers.
simulation to the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
V. E VALUATION M ETHODOLOGY
switching activity interchange format (SAIF), and estimated Alex-7
4096
9% 35.3% 3%
large scale image
tor, RTL and Layout. We the power
implemented using Prime-Time PX.
a custom 4096, classification
urate C++ simulator for the accelerator aimed to Alex-8 25% 37.5% 10%
Comparison 1000
circuits. Each Baseline. We compare EIE with three dif-
GPU mGPU EIE
e RTL behavior of synchronous
module is abstracted as an ferent
object thatoff-the-shelf
implements
CPU
computing units: CPU, GPU and mobile VGG-6
25088,
4% 18.3% 1% Compressed
ation logic and the flip-flop GPU.
NT-LSTM
act methods: propagate and update, corresponding
in RTL. The simulator
or design space exploration. It also serves as a
or RTL verification.
1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E
VGG-7
4096
4096,
4096
4096,
Geo Mean 4% 37.5% 2%
VGG-16 [3] for
large scale image
classification and
classpath
asure the area, power and critical processor,
delay, we that has been used in NVIDIA Digits Deep VGG-8
1000
23% 41.1% 9% object detection
Compression Dev Box as a Acceleration
CPU baseline. To run the benchmark Regularization
ted the RTL of EIE in Verilog. The RTL is verified
e cycle-accurate simulator.Learning
Then we synthesized 4096, Compressed
NT-We 10% 100% 10% 102
g the Synopsys Design Compiler (DC) under the 600 NeuralTalk [7]
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
Speedup
34x 33x 48x
25x 21x 24x 22x 25x
14x 14x 16x 15x 15x
9x 10x 10x 9x 9x
0.1x
Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
mGPU Compressed
SpMat
100x
10x
1x
5x 7x
26x 37x
10x 1x
37x
9x12x
15x
59x
1x
3x
7x10x
7x
18x
1x
17x
10x
78x 101x
13x
20x
1x
61x
10x
102x
14x
EIE1x
2x
5x
8x
14x
5x 1x
6x 6x
25x
8x
39x
1x
6x 6x
14x
7x
25x
1x
4x 5x
15x 20x
7x 1x
6x 7x
23x
36x
9x
1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
24,207x Table III
10,904x
corner. We placed and routed8,053x
Power
(mW)
(%) the PE using the Synopsys IC
Area
(m2 )
(%) B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
ry
9.157
5.416
compiler(93.22%)
(59.15%)
638,024
594,786
(ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description
network 1.874 energy numbers.
(20.46%) 866(0.14%) We annotated the toggle rate from the RTL 9216,
er 1.026 (11.20%) 9,465(1.48%)
Alex-6 9% 35.1% 3%
national
cell
0.841
simulation
(9.18%) 8,946
23,961
to
(1.40%)
(3.76%)
the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
1.807 (19.73%)switching activity interchange format (SAIF), and estimated
ueue 0.112 (1.23%) 758(0.12%)
Alex-7 9% 35.3% 3%
ad 121,849
(19.10%) 4096 large scale image
tRead
mUnit
4.955 (54.11%)
1.162 (12.68%) the power
469,412
using
(73.57%)
3,110(0.49%) Prime-Time PX. 4096, classification
W 1.122 (12.25%) 18,934(2.97%) Alex-8 25% 37.5% 10%
Comparison Baseline. We compare EIE with three dif- 1000
cell 23,961(3.76%)
Comparison: Throughput
EIE
ASIC
1E+05
ASIC
ASIC
1E+04
GPU
1E+03 ASIC
1E+02
CPU mGPU
1E+01 FPGA
1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs
1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03
1E+02
Algorithm
Inference Training
Hardware
Part 3: Efficient Training Algorithms
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Part 3: Efficient Training Algorithms
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Moores law made CPUs 300x faster than in 1990
But its over
Parameter Server p = p + p
p p
Model!
Workers
Data!
Shards
Large scale distributed deep networks
J Dean et al (2012)
Large Scale Distributed Deep Networks, Jeff Dean et al., 2013
Model Parallel
Split up the Model i.e. the network
Kernels
Multiple 3D
Kuvkj
x
BBxyj BBxyj
xyj xyj
6D Loop A
AAxyk
ijij BBxyj
Forall output map j xyj
For each input map k BBxyj BBxyj
xyj xyj
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps
Axyk Bxyj
Kernels
Multiple 3D
Kuvkj
x
AA
6D Loop A Aijij
Bxyj
Forall output map j AAxyk
ijij ij
bi
= Wij
x aj
Output activations
Input activations
weight matrix
bi Wij
bi
= Wij
x aj
Output activations
Input activations
weight matrix
Data parallel
Run multiple training examples in parallel
Limited by batch size
Model parallel
Split model over multiple processors
By layer
Conv layers by map region
Fully connected layers by output activation
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Mixed Precision
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Mixed Precision Training
VOLTA
VOLTA
VOLTA
VOLTA TRAINING
TRAINING
TRAINING
TRAINING METHOD
METHOD
METHOD
METHOD
F16
W (F16)
WW(F16) WW WW F16
F16F16
(F16)
W (F16) FWD F16
F16
F16 Actv
F16
Actv F16 F16 FWD
FWD
FWD Actv
Actv
Actv
Actv
Actv
Actv F16F16
F16 F16 W
F16 F16F16
WW W
Actv Grad
F16
F16 BWD-A
Actv
Actv
Actv Grad F16
Grad
Grad
Actv Grad BWD-A
BWD-A
BWD-A F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad
Actv Grad
F16 Actv
W Grad
WWGrad F16 F16
F16F16Actv
Actv
Grad
W Grad
F16
F16 BWD-W
F16BWD-W Actv
BWD-W
BWD-W F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad
F16
F16
F16F16
F32 F32
Master-W (F32) F32
F32F32 Weight Update F32
F32F32 Updated Master-W
Master-W
Master-W(F32)
(F32)
Master-W
Master-W (F32)
(F32) Weight
WeightUpdate
Update
Weight Update Updated
UpdatedMaster-W
Master-W
Updated Master-W
5
5 5 55
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
Inception V1
INCEPTION V1
12
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
ResNet
RESNET50
13
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
AlexNet
Top1 Top5
Mode accuracy, % accuracy, %
Fp32 58.62 81.25
RESNET RESULTS
Mixed precision training 58.12 80.71
FP16 training
Boris Ginsburg, Sergei Nikolaev, 71.36
Paulius Micikevicius, Training 90.84
with mixed precision, NVIDIA GTC 2017
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Model Distillation
Knowledge Knowledge
Knowledge
student
model
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017
Dense Sparse Dense
Pruning Re-Dense
ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi
se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution,
arrives at better local minima, and achieves higher prediction accuracy across a wide
orithm 1: Workflow of DSD training
range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) N (0, )
tput :W (t) . Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
Initial Dense Phase
DSD: Intuition
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
[Han et al. ICLR 2017]
Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.
Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.
Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.
Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in
in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a
climbing a rock a red uniform is together in a field. on a bench. bike through the
wall. playing with a ball. woods.
Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives
girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle.
a tree. uniform is jumping field. with his hands in the
over the goal. air. DSD: a car drives
DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest.
in a pink shirt is player in a white playing in the on a bench with his
swinging on a uniform is trying to grass. arms folded.
swing. make a shot.
Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and
Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street.
into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a
DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair.
in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on
park. sand. a street.
Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass.
bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass.
Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running
standing in front of a building. in front of a mountain. field. through the grass.
DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red
in a fountain. next to a man in a black shirt. and white uniforms.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda
Algorithm
Inference Training
Hardware
CPUs for Training
CPUs Are Targeting Deep Learning
Intel Knights Landing (2016)
7 TFLOPS FP32
245W TDP
29 GFLOPS/W (FP32)
14nm process
Knights Mill: next gen Xeon Phi optimized for deep learning
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.
Image
Image Source:
Source: Intel,
Intel, Data DataNext
Source: Source: Next Platform
Platform
2
GPUs for Training
GPUs Are Targeting Deep Learning
Nvidia PASCAL GP100 (2016)
15 FP32 TFLOPS
120 Tensor TFLOPS
16GB HBM2 @ 900GB/s
300W TDP
12nm process
21B Transistors
die size: 815 mm2
300GB/s NVLink
a new instruction that performs 4x4x4 FMA mixed-precision operations per clock
12X increase in throughput for the Volta V100 compared to the Pascal P100
8
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta
Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta
Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100.
Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the
ResNet-50 deep neural network 3.7x faster than Tesla P100.
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
The GV100 SM is partitioned
into four processing blocks,
each with:
8 FP64 Cores
16 FP32 Cores
16 INT32 Cores
two of the new mixed-precision
Tensor Cores for deep learning
a new L0 instruction cache
one warp scheduler
one dispatch unit
a 64 KB Register File.
https://devblogs.nvidia.com/parallelforall/
cuda-9-features-revealed/
Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1455 MHz
Peak FP32 TFLOP/s* 5.04 6.8 10.6 15
Peak Tensor Core - - - 120
TFLOP/s*
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB 16 GB
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
achine learning models on our new Google Cloud TPUs 5
Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning 149
in an afternoon using just one eighth of a TPU pod.
Google Cloud TPU
Algorithm
Inference Training
Hardware
Future
152
Outlook: the Focus for Computation
Brain-Inspired
Mobile
Computing Cognitive
Computing
Computing
153
Thank you!
stanford.edu/~songhan