cs231n 2017 Lecture15

Efficient Methods and Hardware for
Deep Learning
Song Han
Stanford University
May 25, 2017
Intro
Song Han Bill Dally

PhD Candidate Chief Scientist
Stanford NVIDIA
Professor
Stanford
2
Deep Learning is Changing Our Lives
Self-Driving Car Machine Translation
This image is licensed under CC-BY 2.0 This image is in the public domain
This image is in the public domain This image is licensed under CC-BY 2.0
3
AlphaGo Smart Robots
3
Models are Getting Larger
IMAGE RECOGNITION SPEECH RECOGNITION
16X 10X
Model Training Ops
152 layers 465 GFLOP
22.6 GFLOP 12,000 hrs of Data
~3.5% error ~5% Error
8 layers 80 GFLOP
1.4 GFLOP 7,000 hrs of Data
~16% Error ~8% Error
2012 2015 2014 2015

AlexNet ResNet Deep Speech 1 Deep Speech 2
Microsoft Baidu
Dally, NIPS2016 workshop on Efficient Methods for Deep Neural Networks
4
The first Challenge: Model Size
Hard to distribute large models through over-the-air update
App icon is in the public domain

This image is licensed under CC-BY 2.0
Phone image is licensed under CC-BY 2.0
5
The Second Challenge: Speed
Error rate Training time

ResNet18: 10.76% 2.5 days
ResNet50: 7.02% 5 days
ResNet101: 6.21% 1 week
ResNet152: 6.16% 1.5 weeks
Such long training time limits ML researchers productivity
Training time benchmarked with fb.resnet.torch using four M40 GPUs
6
The Third Challenge: Energy Efficiency
AlphaGo: 1920 CPUs and 280 GPUs,

$3000 electric bill per game This image is in the public domain
This image is in the public domain
on mobile: drains battery

on data-center: increases TCO
Phone image is licensed under CC-BY 2.0 This image is licensed under CC-BY 2.0
7
Where is the Energy Consumed?
larger model => more memory reference => more energy
8
Relative Energy Cost Relative Energy Cost

peration Energy [pJ]
Operation Relative Cost[pJ]
Energy Relative CostEnergy Cost
Relative
bit int ADD 0.1ADD
32 bit int 1 0.1 1
bit float ADD 0.9 ADD
32 bit float 9 0.9 9
bit Register File 32 bit Register
1 File 10 1 10
bit int MULT 3.1MULT
32 bit int 31 3.1 31
bit float MULT 32 bit float
3.7 MULT 37 3.7 37
bit SRAM Cache 32 bit SRAM
5 Cache 50 5 50
bit DRAM Memory 32 bit 640
DRAM Memory 6400 640 6400
1 10 100 11000 1010000 100 1000 100
Figure
: Energy table 1: Energy
for 45nm CMOS table for 45nm
process CMOS process
[7]. Memory access[7].
is 3Memory
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than
1 expensive
simple than simple arithmetic.
arithmetic.
= 1000
This image is in the public domain
To achieve
ve this goal, we present athis goal, we
method present
to prune a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This9fi
Relative Energy Cost Relative Energy Cost

peration Energy [pJ]
Operation Relative Cost[pJ]
Energy Relative CostEnergy Cost
Relative
bit int ADD 0.1ADD
32 bit int 1 0.1 1
bit float ADD 0.9 ADD
32 bit float 9 0.9 9
bit Register File 32 bit Register
1 File 10 1 10
bit int MULT 3.1MULT
32 bit int 31 3.1 31
bit float MULT 32 bit float
3.7 MULT 37 3.7 37
bit SRAM Cache 32 bit SRAM
5 Cache 50 5 50
bit DRAM Memory 32 bit 640
DRAM Memory 6400 640 6400
1 10 100 11000 1010000 100 1000 100
Figure
: Energy table for 45nm
how
1: Energy
CMOS
toprocess
make
table deep
for 45nm CMOSlearning
[7]. Memory process
access[7].
more
is 3Memory
efficient?
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than expensive
simple than simple arithmetic.
arithmetic.
To achieve
ve this goal, we present athis goal, we
method present
to prune
Battery images are in the public domain
Image 1, image 2, image 2, image 4 a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This10fi
Improve the Efficiency of Deep Learning
by Algorithm-Hardware Co-Design
11
Application as a Black Box
Spec 2006
Algorithm
This image is in
the public domain
Hardware
This image is in
CPU
the public domain
12
Open the Box before Hardware Design
Algorithm
?
This image is in
the public domain
Hardware
This image is in
?PU
the public domain
Breaks the boundary between algorithm and hardware
13
Agenda
Inference Training
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Algorithms for Algorithms for

Efficient Inference Efficient Training
Inference Training
Hardware for Hardware for

Hardware
Hardware 101: the Family
Hardware
General Purpose* Specialized HW
CPU GPU FPGA ASIC

latency throughput programmable fixed
oriented oriented logic logic
* including GPGPU
Hardware 101: Number Representation
s E
(-1) x (1.M) x 2
Range Accuracy
1 8 23
S E M 10-38 - 1038 .000006%
FP32
1 5 10
S E M 6x10-5 - 6x104 .05%
FP16
1 31
S M 0 2x109
Int32
1 15
0 6x104
Int16 S M
1 7
0 127
Int8 S M
- -
Fixed point S I F
radix point
Dally, High Performance Hardware for Machine Learning, NIPS2015

Hardware 101: Number Representation
Operation: Energy (pJ) Area (m2)

8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
Energy numbers are from Mark Horowitz Computings Energy Problem (and what we can do about it), ISSCC 2014
Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Agenda
Algorithm

Inference Training

Hardware
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
4. Low Rank Approximation
5. Binary / Ternary Net
6. Winograd Transformation
1. Pruning
2. Weight Sharing
3. Quantization
Pruning Neural Networks
[Lecun et al. NIPS89]

[Han et al. NIPS15]
Pruning Trained Quantization Huffman Coding 24

[Han et al. NIPS15]
-0.01x2 +x+1
60 Million
6M 10x less connections

[Han et al. NIPS15]
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away

[Han et al. NIPS15]
Pruning
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%

[Han et al. NIPS15]
Retrain to Recover Accuracy
Pruning Pruning+Retraining
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%

[Han et al. NIPS15]
Iteratively Retrain to Recover Accuracy
Pruning Pruning+Retraining Iterative Pruning and Retraining

0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%

[Han et al. NIPS15]
Pruning RNN and LSTM
ptioning
ultimodal Recurrent Neural Networks, Mao et al.

Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
al Image Caption Generator, Vinyals et al.
onvolutional Networks for Visual Recognition and Description, Donahue et al.
*Karpathy
Visual Representation for et al, "Deep
Image Visual- Chen and Zitnick
Caption Generation,
Semantic Alignments for Generating
KarpathyImage
& Justin Johnson 2015.
Descriptions, Lecture
10 - 51 8 Feb 2016
Figure copyright IEEE, 2015; reproduced for educational purposes.

[Han et al. NIPS15]
Pruning RNN and LSTM

Original: a basketball player in a white uniform is
playing with a ball
90% Pruned 90%: a basketball player in a white uniform is
playing with a basketball
Original : a brown dog is running through a grassy field

90% Pruned 90%: a brown dog is running through a grassy
area
Original : a man is riding a surfboard on a wave

90% Pruned 90%: a man in a wetsuit is riding a wave on a
beach
Original : a soccer player in red is running in the field

95% Pruned 95%: a man in a red shirt and black and white
black shirt is running through a field

Pruning Happens in Human Brain
1000 Trillion
Synapses
50 Trillion 500 Trillion
Synapses Synapses
This image is in the public domain This image is in the public domain This image is in the public domain
Newborn 1 year old Adolescent

Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172172, 2013.

[Han et al. NIPS15]
Pruning Changes Weight Distribution
Before Pruning After Pruning After Retraining
Conv5 layer of Alexnet. Representative for other network layers as well.

1. Pruning
2. Weight Sharing
3. Quantization
[Han et al. ICLR16]
Trained Quantization
2.09, 2.12, 1.92, 1.87
2.0

[Han et al. ICLR16]
Trained Quantization
Quantization: less bits per weight
g: less number of weights Huffman Encod
Cluster the Weights
ain Connectivity
Encode Weight
same same
accuracy Generate Code Book accuracy
2.09, 2.12, 1.92, 1.87
une Connections
9x-13x 27x-31x Encode Index
Quantize the Weights
reduction reduction
with Code Book
Train Weights
2.0
Retrain Code Book
e three stage compression pipeline: pruning,

32 bit quantization and Huffman cod
umber of weights by 10,4bit
while quantization further improves the comp
8x less memory footprint
and 31. Huffman coding gives more compression: between 35 an
[Han et al. ICLR16]
Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev
weights cluster index fine-tuned

(32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Trained

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

[Han et al. ICLR16]
Before Trained Quantization:

Continuous Weight
Count
Weight Value
[Han et al. ICLR16]
After Trained Quantization:

Discrete Weight
Count
Weight Value
[Han et al. ICLR16]
After Trained Quantization:

Discrete Weight after Training
Count
Weight Value
[Han et al. ICLR16]
How Many Bits do We Need?

[Han et al. ICLR16]
How Many Bits do We Need?

[Han et al. ICLR16]
Pruning + Trained Quantization Work Together

[Han et al. ICLR16]
Pruning + Trained Quantization Work Together
AlexNet on ImageNet

[Han et al. ICLR16]
Huffman Coding
Huffman Encoding
Encode Weights
same
y accuracy
Encode Index 35x-49x

n reduction
In-frequent weights: use more bits to represent

nd Huffman coding. Pruning
roves the compression rate:
Frequent weights: use less bits to represent
etween 35 and 49. The
n. The compression scheme
[Han et al. ICLR16]
Summary of Deep Compression

Published as a conference paper at ICLR 2016
Quantization: less bits per weight

Pruning: less number of weights Huffman Encoding
Cluster the Weights
Train Connectivity
Encode Weights
original same same same
network accuracy Generate Code Book accuracy accuracy
Prune Connections
original 9x-13x 27x-31x Encode Index 35x-49x
Quantize the Weights
size reduction reduction reduction
with Code Book
Train Weights
Retrain Code Book
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by 10, while quantization further improves the compression rate:
between 27 and 31. Huffman coding gives more compression: between 35 and 49. The
compression rate already included the meta-data for sparse representation. The compression scheme
doesnt incur any accuracy
Pruning loss.
Trained Quantization Huffman Coding 52
[Han et al. ICLR16]
Results: Compression Ratio
Original Compressed Compression Original Compressed

Network
Size Size Ratio Accuracy Accuracy
LeNet-300 1070KB 27KB 40x 98.36% 98.42%
LeNet-5 1720KB 44KB 39x 99.20% 99.26%
AlexNet 240MB 6.9MB 35x 80.27% 80.30%
VGGNet 550MB 11.3MB 49x 88.68% 89.09%
GoogleNet 28MB 2.8MB 10x 88.90% 88.92%
ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%
Can we make compact models to begin with?
Compression Acceleration Regularization 53

SqueezeNet
Input
64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64 64
Output
Concat/Eltwise
128
Vanilla Fire module

Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv 2016

Compressing SqueezeNet
Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy
AlexNet - 240MB 1x 57.2% 80.3%
AlexNet SVD 48MB 5x 56.0% 79.4%
Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression
SqueezeNet - 4.8MB 50x 57.5% 80.3%
Deep
SqueezeNet 0.47MB 510x 57.5% 80.3%
Compression
Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv 2016

Results: Speedup
Average
0.6x
CPU GPU mGPU

Results: Energy Efficiency
Average
CPU GPU mGPU

Deep Compression Applied to Industry
Deep
Compression
F8

1. Pruning
2. Weight Sharing
3. Quantization
Quantizing the Weight and Activation
Train with float

Quantizing the weight and
activation:
Gather the statistics for
weight and activation
Choose proper radix point
position
Fine-tune in float format
Convert to fixed-point format
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Quantization Result
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
1. Pruning
2. Weight Sharing
3. Quantization
Low Rank Approximation for Conv
Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15

Low Rank Approximation for Conv
Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15

Low Rank Approximation for FC
Novikov et al Tensorizing Neural Networks, NIPS15

1. Pruning
2. Weight Sharing
3. Quantization
Binary / Ternary Net: Motivation
(S) Recover Zero Weights Train on Dense (D)

6400 6400
Normalized
4800 4800 Intermediate Ternary Weight
recision Weight
Count
Count
3200 3200
=>
1600
Quantize
1600
0 0
-t
e
0.05
0 t 0.05
1 0
Weight Value
0.05 0.05
-1
0
Weight Value
0.05
0 1
(d) (e)
gradient1
Net (a), pruned GoogLeNet (b), after retraining
he sparistyInference
constraint and recovering the zero
Time
Trained Ternary Quantization
Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp
gradient1 gradient2
Loss
Feed Forward Back Propagate Inference Time
Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17
Pruning Trained Quantization Huffman Coding

l = t max(|w|) (9)
and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper-
Weight Evolution during Training
parameter r we are able to obtain ternary weight networks with various sparsities. We use the first
method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one
to explore a wider range of sparsities in section 5.1.1.
res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp

3
Ternary Weight Value
2
1
0
-1
-2
-3
Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives
100%
Ternary Weight
75%
Percentage
50%
25%
0%
0 50 100 150 0 50 100 150 0 50 100 150
Epochs
Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers
of ResNet-20 on CIFAR-10.
4
Visualization of the TTQ Kernels

Error Rate on ImageNet

1. Pruning
2. Weight Sharing
3. Quantization
3x3 DIRECT Convolutions
Compute Bound
9xC FMAs/Output: Math Intensive

Image Tensor
0 8 1 7 2 6
0 1 2
3
6
4
7
5
8
3 5 4 4 5 3
Filter
6 2 7 1 8 0
8 7 6
5 4 3
2 1 0
9xK FMAs/Input: Good Data Reuse
4 0 4 1 4 2
4 3 4 4 4 5
4 6 4 7 4 8
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA
73
3x3 WINOGRAD Convolutions
Transform Data to Reduce Math Intensity
4x4 Tile
Data Transform
over C
Point-wise
Output Transform
Filter multiplication
Filter Transform
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs

Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs
See A. Lavin & S. Gray, Fast Algorithms for Convolutional Neural Networks
74
Speedup of Winograd Convolution
VGG16, Batch Size 1 Relative Performance
cuDNN 3 cuDNN 5
2.00
1.50
1.00
0.50
0.00
conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0
Measured on Maxwell TITAN X
75
Agenda
Algorithm

Inference Training

Hardware
Hardware for Efficient Inference
a common goal: minimize memory access
Sparse Matrix Read Unit. The sparse-matrix read unit

uses pointers pj and pj+1 to read the non-zero elements (if
any) of this PEs slice of column Ij from the sparse-matrix
SpMat
SRAM. Each entry in the SRAM is 8-bits in length and
Act_0 Act_1
contains one 4-bit element of v and one 4-bit element of x.
For efficiency (see Section VI) the PEs slice of encoded Ptr_Even Arithm Ptr_Odd
sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight
entries are fetched on each SRAM read. The high 13 bits
of the current pointer p selects an SRAM row, and the low SpMat
3-bits select one of the eight entries in that row. A single
(v, x) entry is provided to the arithmetic unit each cycle.
Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process
Eyeriss DaDiannao TPU

entry from the sparse matrix read unit and performs the
multiply-accumulate operation bx = bx + v aj . Index EIE Table II
T HE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LIN
MIT CAS Google Stanford

activation registers) while v is multiplied by the activation
value at the head of the activation queue. Because v is stored
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
Power Area
(%) (%)
RS Dataflow eDRAM 8-bit Integer

in 4-bit encoded form, it is first expanded to a 16-bit fixed-
Compression/
point number via a table look up. A bypass path is provided
Total
memory
(mW)
9.157
5.416 (59.15%)
(m2 )
638,024
594,786 (93.22%)
to route the output of the adder to its input if the same clock network 1.874 (20.46%) 866 (0.14%)
accumulator is selected on two adjacent cycles.
This unit is designed for dense Sparsity
Activation Read/Write. The Activation Read/Write Unit
register
combinational
filler cell
1.026
0.841
(11.20%)
(9.18%)
9,465
8,946
23,961
(1.48%)
(1.40%)
(3.76%)
contains two activation register files that accommodate the
matrices. Sparse architectural
source and destination activation values respectively during
Act queue
0.112 (1.23%)
PtrRead
1.807 (19.73%)
758
121,849
(0.12%)
(19.10%)
SpmatRead
4.955 (54.11%) 469,412 (73.57%)
support was omitted for time-to-
a single round of FC layer computation. The source and
destination register files exchange their role for next layer.
ArithmUnit
1.162 (12.68%)
ActRW
1.122 (12.25%)
3,110
18,934
(0.49%)
(2.97%)
deploy reasons. Sparsity will have
Thus no additional data transfer is needed to support multi- filler cell 23,961 (3.76%)
layer feed-forward computation.
high priority in future designs
Each activation register file holds 64 16-bit activations.
Central Unit: I/O and Computing. In the I/O mode, al
This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in ev
across 64 PEs. Longer activation vectors can be accommo- PE can be accessed by a DMA connected with the Cen
dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In the Computing mode,
Compression Acceleration Regularization
vector has a length greater than 4K, the MV will be CCU repeatedly collects a non-zero value from the LN
completed in several batches, where each batch is of length quadtree and broadcasts this value to all PEs. This77
proc
Google TPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit
78
Google TPU
The Matrix Unit: 65,536 (256x256)
8-bit multiply-accumulate units
700 MHz clock rate
Peak: 92T operations/second
65,536 * 2 * 700M
>25X as many MACs vs GPU
>100X as many MACs vs CPU
4 MiB of on-chip Accumulator
memory
24 MiB of on-chip Unified Buffer
(activation memory)
3.5X as much on-chip memory
vs GPU
Two 2133MHz DDR3 DRAM
channels
8 GiB of off-chip weight DRAM
memory
79
Google TPU
80
Inference Datacenter Workload
Layers TPU Ops / TPU

Nonlinear %
Name LOC Weights Weight Batch
function Deployed
FC Conv Vector Pool Total Byte Size
MLP0 0.1k 5 5 ReLU 20M 200 200
61%
MLP1 1k 4 4 ReLU 5M 168 168
sigmoid,
LSTM0 1k 24 34 58 52M 64 64
tanh
29%
sigmoid,
LSTM1 1.5k 37 19 56 34M 96 96
tanh
CNN0 1k 16 16 ReLU 8M 2888 8
5%
CNN1 1k 4 72 13 89 ReLU 100M 1750 32
81
Roofline Model: Identify Performance Bottleneck
nsightful visual
CM 52.4 (2009): 65-76.
82
TPU Roofline
83
Log Rooflines for CPU, GPU, TPU
Star = TPU
Triangle = GPU
Circle = CPU
84
Linear Rooflines for CPU, GPU, TPU
Star = TPU
Triangle = GPU
Circle = CPU
85
Why so far below Rooflines?
Low latency requirement => Cant batch more => low ops/byte
How to Solve this?

less memory footprint => need compress the model
Challenge:
Hardware that can infer on compressed model
86
[Han et al. ISCA16]
EIE: the First DNN Accelerator for
Sparse, Compressed Model

[Han et al. ISCA16]
EIE: the First DNN Accelerator for
Sparse, Compressed Model
0*A=0 W*0=0 2.09, 1.92=> 2
Sparse Weight Sparse Activation Weight Sharing

90% static sparsity 70% dynamic sparsity 4-bit weights
10x less computation 3x less computation
5x less memory footprint 8x less memory footprint

EIE: Reduce Memory Access by Compression
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
logically P E3 B 0 0 0 0 C
B
B
B
C B
C=B
b3 C
C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3
physically Relative Index 0 1 2 0 0
Column Pointer 0 1 2 3
Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016, Hotchips 2016
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
[Han et al. ISCA16]
Dataflow
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
z William J. Dally

[Han et al. ISCA16]

iversity, NVIDIA EIE Architecture
avan,horowitz,dally}@stanford.edu
Weight decode
Compressed 4-bit 16-bit
DNN Model Virtual weight Weight Real weight
Encoded Weight Look-up ALU Prediction
Relative Index
Sparse Format Index Result
Input 16-bit Mem
4-bit Accum
Image Relative Index Absolute Index
Figure 1. Efficient inferenceAddress

engine that Accumulate
works on the compressed deep
neural network model for machine learning applications.
word,
rule or speech0 sample.
of thumb: * A = 0 For embedded
W * 0 = 0 mobile applications,
2.09, 1.92=> 2
these resource demands become prohibitive. Table I shows
[Han et al. ISCA16]
Micro Architecture for each PE
Act Value Act Value

Act Queue Act Leading
Act Index SRAM NZero
Encoded
Weight
Detect
Act Index
Col Weight
Even Ptr SRAM Bank Sparse Decoder
Start/ Bypass Dest Src
Matrix Regs
End Act Act
Addr SRAM Regs Regs
ReLU
Odd Ptr SRAM Bank Address Absolute Address
Accum
Relative Index
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W
SRAM Regs Comb

[Han et al. ISCA16]
Speedup on EIE
SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
1018x
1000x 507x 618x
248x
mGPU Compressed EIE

Act_0 Act_1 210x
135x 189x
94x 115x 92x 98x
Ptr_Even Arithm Ptr_Odd 100x 56x 63x 60x
Speedup
34x 33x 48x

25x 21x 24x 22x 25x
14x 14x 16x 15x 15x
9x 8x 10x 9x 10x 9x 9x
10x 5x 5x
3x 3x 3x
2x 3x 2x 3x 2x 2x
1x 1x 1.1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x
1.0x 1.0x
SpMat 1x 0.6x 0.5x
0.3x 0.5x 0.5x 0.5x 0.6x
0.1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS 189x
98x Power
(%)
Area
(%)
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
(mW) (m2 )
60x
119,797x 76,784x
100000x 61,533x
9.157 638,024 34,522x
48x 24,207x
Energy Efficiency
14,826x 11,828x 10,904x

ry 5.416 (59.15%) 594,786 (93.22%) 9,485x 8,053x
network 1.874 (20.46%) 866 10000x
(0.14%)
er
national
1.026
0.841
(11.20%)
(9.18%)
9,465
8,946 25x
(1.48%)
1000x
(1.40%)
15x
78x 101x 102x
cell 23,961 (3.76%)
100x 37x
59x 61x 39x
ueue 0.112 (1.23%) 758 (0.12%) 26x 37x 20x 25x 25x 23x
36x
18x 17x 14x 14x 15x 20x
9x12x
9x 7x10x 10x 10x 8x

ad 1.807 (19.73%) 121,849 (19.10%)
10x 5x 7x 5x 6x 6x 6x 6x 4x 5x 6x 7x
15x 3x 13x 14x 2x
tRead 4.955 (54.11%) 469,412 (73.57%) 1x 10x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x
mUnit 1.162 (12.68%) 3,110 (0.49%)
1x
W 1.122 (12.25%) 18,934 (2.97%)
cell
x
23,961
Figure 7.
(3.76%) Alex-6 Alex-7 Alex-8 VGG-6
3x
VGG-7
Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
VGG-8
3x
NT-We
Geo Mean
NT-Wd NT-LSTM Geo Mean
2x
nit: I/O and Computing. In the I/O mode, all of
1x
1xcorner. We placed and routed
re idle while the activations and weights in every
e accessed by a DMA connected with the Central 1x
the PE using the Synopsys IC
Table III
s is one time cost. In the Computing mode, the
0.5x
compiler (ICC). We
eatedly collects a non-zero value from the LNZD used Cacti [25] to get SRAM area and
0.6x B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
and broadcasts this value to all PEs. This process Layer Size Weight% Act% FLOP% Description
energy numbers. We annotated the toggle rate from the RTL
until the input length is exceeded. By setting the 9216,
gth and starting address of pointer array, EIE is Alex-6 9% 35.1% 3%
to execute different layers.
simulation to the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
V. E VALUATION M ETHODOLOGY
switching activity interchange format (SAIF), and estimated Alex-7
4096
9% 35.3% 3%
large scale image
tor, RTL and Layout. We the power
implemented using Prime-Time PX.
a custom 4096, classification
urate C++ simulator for the accelerator aimed to Alex-8 25% 37.5% 10%
Comparison 1000
circuits. Each Baseline. We compare EIE with three dif-
GPU mGPU EIE
e RTL behavior of synchronous
module is abstracted as an ferent
object thatoff-the-shelf
implements
CPU
computing units: CPU, GPU and mobile VGG-6
25088,
4% 18.3% 1% Compressed
ation logic and the flip-flop GPU.
NT-LSTM
act methods: propagate and update, corresponding
in RTL. The simulator
or design space exploration. It also serves as a
or RTL verification.
1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E
VGG-7
4096
4096,
4096
4096,
Geo Mean 4% 37.5% 2%
VGG-16 [3] for
large scale image
classification and
classpath
asure the area, power and critical processor,
delay, we that has been used in NVIDIA Digits Deep VGG-8
1000
23% 41.1% 9% object detection
Compression Dev Box as a Acceleration
CPU baseline. To run the benchmark Regularization
ted the RTL of EIE in Verilog. The RTL is verified
e cycle-accurate simulator.Learning
Then we synthesized 4096, Compressed
NT-We 10% 100% 10% 102
g the Synopsys Design Compiler (DC) under the 600 NeuralTalk [7]
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
Wd NT-LSTM Geo Mean

1018x
1000x 507x 618x
[Han et al. ISCA16]

248x 210x 189x
94x 115x 135x 92x 98x
100x 56x 63x 60x
Speedup
34x 33x 48x
25x 21x 24x 22x 25x
14x 14x 16x 15x 15x
9x 10x 10x 9x 9x
Energy Efficiency on EIE

8x 9x
10x 5x 5x
3x 3x 3x
2x 3x 2x 3x 2x 2x
1x 1x 1.1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x
1.0x 1.0x
1x 0.6x 0.5x 0.5x 0.5x 0.5x 0.6x
There is no batching in all cases.

0.3x
0.1x
Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
SpMat CPU Dense (Baseline) CPU Compressed GPU Dense

119,797x
GPU Compressed mGPU Dense mGPU Compressed EIE
61,533x 76,784x
100000x 34,522x 24,207x
Energy Efficiency
14,826x 11,828x 9,485x 10,904x

Act_0 Act_1 8,053x
10000x
Ptr_Even Arithm Ptr_Odd
1000x
mGPU Compressed
SpMat
100x
10x
1x
5x 7x
26x 37x
10x 1x
37x
9x12x
15x
59x
1x
3x
7x10x
7x
18x
1x
17x
10x
78x 101x
13x
20x
1x
61x
10x
102x
14x
EIE1x
2x
5x
8x
14x
5x 1x
6x 6x
25x
8x
39x
1x
6x 6x
14x
7x
25x
1x
4x 5x
15x 20x
7x 1x
6x 7x
23x
36x
9x
1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
24,207x Table III
10,904x
corner. We placed and routed8,053x
Power
(mW)
(%) the PE using the Synopsys IC
Area
(m2 )
(%) B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
ry
9.157
5.416
compiler(93.22%)
(59.15%)
638,024
594,786
(ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description
network 1.874 energy numbers.
(20.46%) 866(0.14%) We annotated the toggle rate from the RTL 9216,
er 1.026 (11.20%) 9,465(1.48%)
Alex-6 9% 35.1% 3%
national
cell
0.841
simulation
(9.18%) 8,946
23,961
to
(1.40%)
(3.76%)
the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
1.807 (19.73%)switching activity interchange format (SAIF), and estimated
ueue 0.112 (1.23%) 758(0.12%)
Alex-7 9% 35.3% 3%
ad 121,849
(19.10%) 4096 large scale image
tRead
mUnit
4.955 (54.11%)
1.162 (12.68%) the power
469,412
using
(73.57%)
3,110(0.49%) Prime-Time PX. 4096, classification
W 1.122 (12.25%) 18,934(2.97%) Alex-8 25% 37.5% 10%
Comparison Baseline. We compare EIE with three dif- 1000
cell 23,961(3.76%)
ferent off-the-shelf computing units: CPU, GPU and mobile

nit: I/O and Computing. In the I/O mode, all of
VGG-6
25088,
4096
4% 18.3% 1% Compressed Geo Mean
GPU.
re idle while the activations
25x and weights in every 36x 4096,
VGG-16 [3] for
e accessed by a DMA connected with the Central
1) CPU. We use Intel
15x 20x
Core i-7 5930k CPU, a Haswell-E23x VGG-7
4096
4% 37.5% 2% large scale image
classification and
s is one time cost. In the Computing mode, the 4096,
eatedly collects a non-zero class processor, that has been used in NVIDIA Digits 7x
value from the LNZD Deep VGG-8 23% 41.1% 9% object detection
and broadcasts this value to all PEs. This process
4x 5x 6x 1000
until the input length is Learning Dev the
exceeded. By setting Box as a CPU baseline. To run the benchmark 4096, Compressed
NT-We 10% 100% 10%
gth and starting address on CPU, we used MKL CBLAS GEMV to implement the
of pointer array, EIE is 600 NeuralTalk [7]
to execute different layers.
7xV. EVALUATION METHODOLOGY original1x dense model and 7xMKL SPBLAS CSRMV 1x for the 9xNT-Wd
600,
11% 100% 11%
with RNN and
8791 LSTM for
compressed
tor, RTL and Layout. We implemented a custom sparse model. CPU socket and DRAM power 1201, automatic
NTLSTM 10% 100% 11%
urate C++ simulator for the areaccelerator
as reported aimed toby the pcm-power utility provided by Intel. 2400 image captioning
e RTL behavior of synchronous circuits. Each
2) GPU.
module is abstracted as an object We use NVIDIA GeForce GTX Titan
that implements
CPUX GPU, GPU mGPU EIE
Wd
act methods: propagate and update, corresponding
ation logic and the flip-flop in RTL. The simulator
or design space exploration. using It alsonvidia-smi
serves as a
NT-LSTM
a state-of-the-art GPU for deep learning as our baseline
utility to report the power. To run
The uncompressed DNN model is obtained from Caffe
model zoo [28] and NeuralTalk model zoo [7]; The com-
Geo Mean
or RTL verification.
asure the area, power andthe benchmark,
critical path delay, we we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23].
the The original dense layer.Acceleration The benchmark networks have 9 layers in total obtained
For the compressed sparse layer, Regularization
Compression
ted the RTL of EIE in Verilog. RTL is verified
e cycle-accurate simulator. Then we synthesized
we stored
g the Synopsys Design Compiler (DC) underthethesparse matrix in in CSR format, and used from AlexNet, VGGNet, and NeuralTalk. We use the Image- 103
[Han et al. ISCA16]
Comparison: Throughput
EIE
Throughput (Layers/s in log scale)

1E+06
ASIC
1E+05
ASIC
ASIC
1E+04
GPU
1E+03 ASIC
1E+02
CPU mGPU
1E+01 FPGA
1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

[Han et al. ISCA16]
Comparison: Energy Efficiency

EIE
Energy Efficiency (Layers/J in log scale)
1E+06
1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03
1E+02
1E+01 GPU mGPU
1E+00 CPU FPGA

Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

Agenda
Algorithm

Inference Training

Hardware
Part 3: Efficient Training Algorithms
1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
1. Parallelization
Moores law made CPUs 300x faster than in 1990
But its over
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Data Parallel Run multiple inputs in parallel

Data Parallel Run multiple inputs in parallel
Doesnt affect latency for one input

Requires P-fold larger batch size
For training requires coordinated weight update
Parameter Update
One method to achieve scale is parallelization
Parameter Server p = p + p
p p
Model!
Workers
Data!
Shards
Large scale distributed deep networks
J Dean et al (2012)
Large Scale Distributed Deep Networks, Jeff Dean et al., 2013
Model Parallel
Split up the Model i.e. the network

Model-Parallel Convolution by output region (x,y)
Kernels
Multiple 3D
Kuvkj
x
BBxyj BBxyj
xyj xyj
6D Loop A
AAxyk
ijij BBxyj
Forall output map j xyj
For each input map k BBxyj BBxyj
xyj xyj
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps
Axyk Bxyj

Model-Parallel Convolution By output map j
(filter)
Kernels
Multiple 3D
Kuvkj
x
AA
6D Loop A Aijij
Bxyj
Forall output map j AAxyk
ijij ij
For each input map k

For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps
Axyk Bxyj

Model Parallel Fully-Connected Layer (M x V)
bi
= Wij
x aj
Output activations
Input activations
weight matrix

Model Parallel Fully-Connected Layer (M x V)
bi Wij
bi
= Wij
x aj
Output activations
Input activations
weight matrix

Hyper-Parameter Parallel
Try many alternative networks in parallel

Summary of Parallelism
Lots of parallelism in DNNs
16M independent multiplies in one FC layer
Limited by overhead to exploit a fraction of this
Data parallel
Run multiple training examples in parallel
Limited by batch size
Model parallel
Split model over multiple processors
By layer
Conv layers by map region
Fully connected layers by output activation
Easy to get 16-64 GPUs training one model in parallel

1. Parallelization
Mixed Precision
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Mixed Precision Training
VOLTA
VOLTA
VOLTA
VOLTA TRAINING
TRAINING
TRAINING
TRAINING METHOD
METHOD
METHOD
METHOD
F16
W (F16)
WW(F16) WW WW F16
F16F16
(F16)
W (F16) FWD F16
F16
F16 Actv
F16
Actv F16 F16 FWD
FWD
FWD Actv
Actv
Actv
Actv
Actv
Actv F16F16
F16 F16 W
F16 F16F16
WW W
Actv Grad
F16
F16 BWD-A
Actv
Actv
Actv Grad F16
Grad
Grad
Actv Grad BWD-A
BWD-A
BWD-A F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad
Actv Grad
F16 Actv
W Grad
WWGrad F16 F16
F16F16Actv
Actv
Grad
W Grad
F16
F16 BWD-W
F16BWD-W Actv
BWD-W
BWD-W F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad
F16
F16
F16F16
F32 F32
Master-W (F32) F32
F32F32 Weight Update F32
F32F32 Updated Master-W
Master-W
Master-W(F32)
(F32)
Master-W
Master-W (F32)
(F32) Weight
WeightUpdate
Update
Weight Update Updated
UpdatedMaster-W
Master-W
Updated Master-W
5
5 5 55
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
Inception V1
INCEPTION V1
12
ResNet
RESNET50
13
AlexNet
Top1 Top5
Mode accuracy, % accuracy, %
Fp32 58.62 81.25
RESNET RESULTS
Mixed precision training 58.12 80.71
FP16 training Inception

No scale V3
of loss function
54.89 78.12
Top1 Top5
FP16 training,Mode
loss scale = 1000 57.76 %
accuracy, 80.76 %
accuracy,
Fp32 71.75 90.52
INCEPTION-V3
Mixed precision training
RESULTS
71.17 90.10
Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=1024, no augmentation, 1 crop, 1 model
30
FP16 training, loss scale

Scale loss=function
1
ResNet-50 by 71.17
100x 90.33
FP16 training, loss scale = 1, Top1 Top5
70.53 90.14
FP16 master weight storage
Mode accuracy, % accuracy, %
Fp32 73.85 91.44
Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model
41
Mixed precision training 73.6 91.11
FP16 training
Boris Ginsburg, Sergei Nikolaev, 71.36
Paulius Micikevicius, Training 90.84
with mixed precision, NVIDIA GTC 2017
FP16 training, loss scale = 100 74.13 91.51

Part 3: Efficient Training Algorithm
1. Parallelization
Model Distillation
Teacher model 1 Teacher model 2 Teacher model 3

(Googlenet) (Vggnet) (Resnet)
Knowledge Knowledge
Knowledge
student
model
student model has much smaller model size

Softened outputs reveal the dark knowledge
Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

Softened outputs reveal the dark knowledge
Method: Divide score by a temperature to get a much softer

distribution
Result: Start with a trained model that classifies 58.9% of the

test frames correctly. The new model converges to 57.0%
correct even when it is only trained on 3% of the data
Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

Part 3: Efficient Training Algorithm
1. Parallelization
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017
Dense Sparse Dense
Pruning Re-Dense
Sparsity Constraint Increase Model Capacity
ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi
se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution,
arrives at better local minima, and achieves higher prediction accuracy across a wide
orithm 1: Workflow of DSD training
range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) N (0, )
tput :W (t) . Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
Initial Dense Phase
DSD: Intuition
learn the trunk first then learn the leaves
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
[Han et al. ICLR 2017]
DSD is General Purpose:

Vision, Speech, Natural Language
Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.
GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%

VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%
Open Sourced DSD Model Zoo: https://songhan.github.io/DSD

The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.


Abs. Rel.

NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%



Abs. Rel.

NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6% 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 13.4% 1.1% 7.4%


https://songhan.github.io/DSD
DSD on Caption Generation
Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in
in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a
climbing a rock a red uniform is together in a field. on a bench. bike through the
wall. playing with a ball. woods.
Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives
girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle.
a tree. uniform is jumping field. with his hands in the
over the goal. air. DSD: a car drives
DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest.
in a pink shirt is player in a white playing in the on a bench with his
swinging on a uniform is trying to grass. arms folded.
swing. make a shot.
Figure 3: Visualization of DSD training improves the performance of image captioning.

Baseline model: Andrej Karpathy, Neural Talk model zoo.
the
Han et forest from
al. DSD: the background. Training
Dense-Sparse-Dense The goodfor performance of DSD training
Deep Neural Networks, generalizes
ICLR 2017 beyond these
examples, more image caption results generated by DSD training is provided in the supplementary
137
material.
A. Supplementary Material: More Examples of DSD Training Improves the Performance of
Generated
NeuralTalk by NeuralTalk
Auto-Caption System (Images from Flickr-8K Test Set)
DSD on Caption Generation
Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and
Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street.
into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a
DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair.
in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on
park. sand. a street.
Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass.
bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass.
Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running
standing in front of a building. in front of a mountain. field. through the grass.
DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red
in a fountain. next to a man in a black shirt. and white uniforms.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda
Algorithm

Inference Training

Hardware
CPUs for Training
CPUs Are Targeting Deep Learning
Intel Knights Landing (2016)
7 TFLOPS FP32
16GB MCDRAM 400 GB/s
245W TDP
29 GFLOPS/W (FP32)
14nm process
Knights Mill: next gen Xeon Phi optimized for deep learning
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.
Image
Image Source:
Source: Intel,
Intel, Data DataNext
Source: Source: Next Platform
Platform
2
GPUs for Training
GPUs Are Targeting Deep Learning
Nvidia PASCAL GP100 (2016)
10/20 TFLOPS FP32/FP16

16GB HBM 750 GB/s
300W TDP
67 GFLOPS/W (FP16)
16nm process
160GB/s NV Link
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.

Data Source: NVIDIA
Source: Nvidia 3
GPUs for Training
Nvidia Volta GV100 (2017)
15 FP32 TFLOPS
120 Tensor TFLOPS
16GB HBM2 @ 900GB/s
300W TDP
12nm process
21B Transistors
die size: 815 mm2
300GB/s NVLink
Data Source: NVIDIA

Whats
TENSOR new
CORE in Volta:
4X4X4 Tensor CoreACC
MATRIX-MULTIPLY
a new instruction that performs 4x4x4 FMA mixed-precision operations per clock
12X increase in throughput for the Volta V100 compared to the Pascal P100
8
Pascal v.s. Volta
Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.
Pascal v.s. Volta
Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100.
Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the
ResNet-50 deep neural network 3.7x faster than Tesla P100.
The GV100 SM is partitioned
into four processing blocks,
each with:
8 FP64 Cores
16 FP32 Cores
16 INT32 Cores
two of the new mixed-precision
Tensor Cores for deep learning
a new L0 instruction cache
one warp scheduler
one dispatch unit
a 64 KB Register File.
https://devblogs.nvidia.com/parallelforall/
cuda-9-features-revealed/
Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1455 MHz
Peak FP32 TFLOP/s* 5.04 6.8 10.6 15
Peak Tensor Core - - - 120
TFLOP/s*
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB 16 GB
TDP 235 Watts 250 Watts 300 Watts 300 Watts

Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion
GPU Die Size 551 mm 601 mm 610 mm 815 mm
Manufacturing 28 nm 28 nm 16 nm FinFET+ 12 nm FFN
Process
GPU / TPU
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
achine learning models on our new Google Cloud TPUs 5
Google Cloud TPU
Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
source: Google Blog
Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning 149
in an afternoon using just one eighth of a TPU pod.
Google Cloud TPU
A TPU pod built with 64 second-generation TPUs delivers up to 11.5

petaflops of machine learning acceleration.
A TPU pod built with 64 second-generation TPUs delivers up to 11.5 petaEops of
One of our new large-scale translation models used to take a full day to train
machine learning acceleration.
on 32 of the best commercially-available GPUsnow it trains to the same
accuracy in an afternoon using just one eighth of a TPU pod. Google Blog
Introducing Cloud TPUs
150
Wrap-Up
Algorithm

Inference Training

Hardware
Future
Smart Low Latency Privacy Mobility Energy-Efficient
152
Outlook: the Focus for Computation
PC Era Mobile-First Era AI-First Era
Brain-Inspired
Mobile
Computing Cognitive
Computing
Computing
Sundar Pichai, Google IO, 2016
153
Thank you!
stanford.edu/~songhan

cs231n 2017 Lecture15

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

cs231n 2017 Lecture15

Uploaded by

Copyright:

Available Formats

Efficient Methods and Hardware for

Song Han Bill Dally

IMAGE RECOGNITION SPEECH RECOGNITION

2012 2015 2014 2015

Dally, NIPS2016 workshop on Efficient Methods for Deep Neural Networks

Hard to distribute large models through over-the-air update

App icon is in the public domain

Error rate Training time

Such long training time limits ML researchers productivity

Training time benchmarked with fb.resnet.torch using four M40 GPUs

AlphaGo: 1920 CPUs and 280 GPUs,

on mobile: drains battery

larger model => more memory reference => more energy

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost

Breaks the boundary between algorithm and hardware

Algorithms for Algorithms for

Hardware for Hardware for

General Purpose* Specialized HW

CPU GPU FPGA ASIC

Dally, High Performance Hardware for Machine Learning, NIPS2015

Operation: Energy (pJ) Area (m2)

Algorithms for Algorithms for

Hardware for Hardware for

4. Low Rank Approximation

5. Binary / Ternary Net

4. Low Rank Approximation

5. Binary / Ternary Net

[Lecun et al. NIPS89]

Pruning Trained Quantization Huffman Coding 24

Pruning Neural Networks

Pruning Trained Quantization Huffman Coding 25

Pruning Neural Networks

Pruning Trained Quantization Huffman Coding 26

Pruning Neural Networks

Pruning Trained Quantization Huffman Coding 27

Retrain to Recover Accuracy

Pruning Trained Quantization Huffman Coding 28

Iteratively Retrain to Recover Accuracy

Pruning Pruning+Retraining Iterative Pruning and Retraining

Pruning Trained Quantization Huffman Coding 29

Pruning RNN and LSTM

ultimodal Recurrent Neural Networks, Mao et al.

Pruning Trained Quantization Huffman Coding 30

Pruning RNN and LSTM

Original : a brown dog is running through a grassy field

Original : a man is riding a surfboard on a wave

Original : a soccer player in red is running in the field

Pruning Trained Quantization Huffman Coding 31

Newborn 1 year old Adolescent

Pruning Trained Quantization Huffman Coding 32

Pruning Changes Weight Distribution

Before Pruning After Pruning After Retraining

Conv5 layer of Alexnet. Representative for other network layers as well.

Pruning Trained Quantization Huffman Coding 33

4. Low Rank Approximation

5. Binary / Ternary Net

2.09, 2.12, 1.92, 1.87

Pruning Trained Quantization Huffman Coding 35

Cluster the Weights

e three stage compression pipeline: pruning,