Low Power Design Methododlogies

ELEN 468 Lecture 29 1
ELEN 468
Advanced Logic Design
Lecture 29
Low Power Design
Power Dissipation
P6
Pentium proc
486
386
286
8086
8085
8080
8008
4004
0.1
1
10
100
1971 1974 1978 1985 1992 2000
Year
P
o
w
e
r

(
W
a
t
t
s
)

Power increases despite Vdd decrease
Courtesy, Intel
Power Density
4004
8008
8080
8085
8086
286
386
486
Pentium proc
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
P
o
w
e
r

D
e
n
s
i
t
y

(
W
/
c
m
2
)

Hot Plate
Nuclear
Reactor
Rocket
Nozzle
Courtesy, Intel
Why Power Increased
Growing die size, fast frequency scaling
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
Gate Power Dissipation
Leakage power
Dynamic power
Short circuit power
Dynamic Power
Occurs at each
switching
P
d
= C
L
V
dd
2
f
p
f
p
switching
frequency

out
V
dd

out
V
dd

Saturation
Linear
Leakage Power
Static
Leakage current
= a V
dd
Leakage current
= b/V
t
Killer to CMOS
technology
out
V
dd

out
V
dd

Saturation
Linear
Leakage
Leakage
Short Circuit Power
During switching,
there is a short
moment when both
PMOS and CMOS are
partially on
P
s
= Q(V
dd
-V
t
)
3
t
r
f
p
t
r
rising time
out
V
dd

out
V
dd

Input rising
Input falling
Where Does Power Go?
Power percentages
Core transistor
leakage
Gate leakage
Cache leakage
Active power
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scalable X86 CPU Design for 90nm
Low V
T
devices are <1% of total
non-memory transistor width
[J. Schultz and C. Webb, ISSCC 2004]
Total chip power based on ITRS roadmap
In 2004, we are just breaking even
[Kim, et al, Computer 2003]
Power percentages
Core transistor
leakage
Gate leakage
Cache leakage
Active power
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Energy Performance Space
Every design is a point on a 2-D plane
Performance
E
n
e
r
g
y
Low Power Design
Reduce dynamic power
o: clock gating, sleep mode
C: small transistors (esp. on clock), short wires
V
DD
: lowest suitable voltage
f: lowest suitable frequency
Reduce static power
Selectively use low V
t
devices
Power gating, MTCMOS
Stacked devices
Body bias
Clock Gating
Gate off clock to idle functional
units
e.g., floating point units
need logic to generate
disable signal
increases complexity of control logic
consumes power
timing critical to avoid clock glitches
at OR gate output
additional gate delay on clock
signal
gating OR gate can replace a buffer in
the clock distribution tree
R
e
g
clock
disable
Functional
unit
Active Power Reduction - Supply
Voltage Reduction
Static Dynamic
Pros:
Always active in saving
Cons:
Additional power delivery network
Needs special care of interface between
power domains
signals close to V
t
excessive leakage
and reduced noise margins

Adjusting operation voltage and frequency to
performance requirements:
High performance high V
dd
& frequency
Power saving low V
dd
& frequency
Pros:
Doesnt limit performance
Cons:
Penalty of transition between different
power states can be high (in performance
and power)
Additional control logic
Slow Slow
Fast
High
Supply
Voltage
Low
Supply
Voltage
Voltage Islands (Multi-Vdd)
Allow both macro and cell voltage assignment
Allow different voltage islands in the same circuit row
Lift unnatural layout restrictions
Minimal placement disturbance
Lackey+
ICCAD02
Usami+
JSSC98
Vddh
Vddl
GVI
DAC03
Level Converter
Interface circuit when V
ddl
drives V
ddh
to avoid leakage
VddH
VddL
weak on!
V
ddh
V
ddl
IN
OUT
Conventional dual
supply level converter
V
ddh
IN
OUT
New single supply level
converter
Adjacency Metrics for Clustering
Logic adjacency metric (LAM): V
ddl
fanin cone of
level shifter without going through V
ddh
LC1
V
ddh
V
ddl
LC2
LC3
V
ddh
V
ddl
LC2
LC3
Physical adjacency metric (PAM): for each candidate
V
ddl
cell, compute total size of its neighbor V
ddl
cells
LAM to guide logic aware voltage assignment
PAM to guide placement aware voltage re-assignment
Level Converter Optimizations
Logic replacement (or gate sizing)
Z
MUX
1
LC
LC
LC
LC
DEC
Z
MUX
2
DEC
B
A
B
A
LC
LC
LC/Buffer co-optimization
Placement to Form Voltage Islands
with Power Grid Co-design
Based on V
ddl
and V
ddh

cell placement after
voltage assignment,
define V
ddl
/V
ddh
power
grids on demand
Detailed placement to
form V
ddl
/V
ddh
voltage
islands that can hit their
corresponding power
supplies
Vddh
Power grids on demand
V
ddl
V
ddh
V
ddl
V
ddh
V
ddl
V
ddh
Vddl
Example of Voltage Islands
V
ddl
=
1.2V
V
ddh
= 1.5V
No timing degradation, no area increase!
- IBM Cu11
- 0.13um
- 400 MHz
(courtesy IBM)
Dynamic Frequency and
Voltage Scaling
Always run at the lowest supply voltage that meets the timing
constraints
DFS (dynamic frequency scaling) saves only power
DVS (dynamic voltage scaling) + DFS saves both energy and power
A DVS+DFS system requires the following
A programmable clock generator (PLL)
PLL from 200MHz 700MHz in increments of 33MHz
A supply regulation loop that sets the minimum V
DD
necessary for
operation at the desired frequency
32 levels of V
DD
from 1.1V to 1.6V
An operating system that sets the required frequency + supply voltage
to meet the task completion deadlines
heavier load ramp up V
DD
, when stable speed up clock
lighter load slow down clock, when PLL locks onto new rate, ramp down
V
DD

Leakage Reduction Techniques
pullup (V
dd
)
V
x
stack effect
W
u
W
l
High V
t
devices
Low V
t
devices
dual V
t
partitioning
V
nwell
V
dd
V
pwell
0

variable threshold
(VTCMOS)
low V
t
logic
sleep

sleep

V
dd
virtual V
dd
HVT

virtual Gnd

multi-threshold
(MTCMOS)
HVT

V
dd
Natural Transistor Stacks
Reduce the leakage by stacking the devices
Reduced Vds
Negative Vgs
Negative Vbs
How?
Design with Dual V
th

Dual V
th
design
Two flavors of transistors: slow high V
th
, fast low V
th
Low V
th
are faster, but have 10X leakage
Dual V
th
evaluation
Impacts of Variable V
T

Reducing the V
T
increases the sub-
threshold leakage current (exponentially)
V
T
= V
T0
+ ( \|
F
+ V
SB
- \|
F
)
where V
T0
is the threshold voltage at V
SB
=
0, V
SB
is the source- bulk (substrate)
voltage, is the body-effect coefficient
But, reducing V
T
decreases gate delay
(increases performance)
Variable V
T
through Body Bias

0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
-2.5 -2 -1.5 -1 -0.5 0
V
SB
(V)
For NMOS, the substrate is
normally tied to ground
(V
SB
= 0)
A negative bias on V
SB

causes V
T
to increase
Adjusting the substrate
bias at runtime is called
adaptive body-biasing
(ABB) or dynamic threshold
scaling (DTS)
Requires a triple well fab
process
V
SB,p
V
SB,n
Forward/Reverse Body Biasing
RBB (Reverse Body Bias): zero
body bias in active mode, a deep
reverse bias in standby mode.
FBB (Forward Body Bias): high Vth in
standby mode, forward body biasing to
achieve better current drive in active mode.
Disadvantages:
Increase PN junction reverse
leakage
Scaling down technology worsen
short channel effects and weaken
the Vth modulation capability
Disadvantages:
Larger junction capacitance
High body effect for stack devices
Implementation of Dynamic Vth Scaling
(DTS)
The lowest Vth is delivered (NBB-no body bias) if the highest
performance is required.
When the performance demand is low, clock frequency is lowered
and Vth is raised via RBB to reduce the run time leakage power
dissipation.
How?
When critical path replica frequency is less then reference CLK,
adjust bias to decrease Vth.
Otherwise adjust bias to increase Vth.
Results:
Power Gating Using Sleep Transistors
Or can reduce leakage by
gating the supply rails when
the circuit is in sleep mode
in normal mode, sleep = 0 and
the sleep transistors must
present as small a resistance as
possible (via sizing)
in sleep mode, sleep = 1, the
transistor stack effect reduces
leakage by orders of magnitude
Or can eliminate leakage by switching off the power
supply (but lose the memory state)
Example of Power Gating
Embedded
Power
Switches
Rows of
Standard
Cells
Power Switch
Control Signals
Can reduce power
1000X
Smaller voltage swing
(IR drop on sleep
transistors)
Lower performance
Increased noise
coupling
Local power grid
design
Power Dissipation on Variation
Tolerance
Conventional variation tolerance
Using large timing safety margin
Implies aggressive timing target
Greater power dissipation
Observation
Near-worst-case variations occur rarely
Safety margin is applied continuously to
guard the small chance of variations
Poor power efficiency
Question..
Can we deal with errors instead
preventing them from occurring by
conservative binning/clocking?
How fast can we speed up the
circuit with error rate in
manageable range?
Fault tolerant system
Begin with reference values
Introduce redundancy
Hardware: Triple Modular Redundancy
Time: Repeated process
Information: Code
Software: various algorithm
How about for delay fault?
how do we detect (may be correct?) errors?
Delay fault tolerant system
Delay fault detection
Redundant timing margin in signal path
+: Second sampling at increase clock period
- : Decrease delay of reference signal between
pipeline registers
t
1
t
2
Timing margin
2
nd
sampling
t
Delay fault tolerant system
Delay fault removal
Reference signal (S
R
)
Reprocessing at slower clock period (t)
t
1
t
2
Timing margin
t
S
R
t

Delay fault tolerant system: Example
RAZOR*
Dynamic Voltage Scaling Design
Reduce power voltage down to
manageable failure rate
t
1
t
2
Timing margin
* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003
RAZOR continued
Implemented to 120MHz clock frequency
But for high speed circuits
Managing two clocks
Minimum path delay constraint
Delay of MUX
Parity coding
Parity generation based on output correlation
Avoid well-correlated outputs for pairing
Timing margin
t
Now.. Lets look at delay distribution(s)
Clock speed achieved for contained error rate
Parity coding (continued)
Complexity
Example: C449 ISCAS Benchmark
Recently Proposed Design
Fault detection
Partial hardware and time redundancy
Timing margin
t
L
n
L
n+1

g
0
g
m

L'
n+1

FL BL
g
m

BL'
g
i

Proposed Design
Fault removal
Pipeline flush & reprocessing at lower
clock
L
n
L
n+1

g
0
g
m

L'
n+1

FL BL
g
m

BL'
g
i

Proposed Design
Division of FL an BL
PI PO
Latch
FL
BL
CP
Error?
BL
Proposed Design
Considerations
The effects on the original circuit should be
minimal.
Maximize delay fault detection coverage
Minimize added complexity
Proposed Design
First, POs to BL
Gate with longest delay to gate with shortest delay
For the gates connected to BL,
Choose the gate with maximum delay
Then, any gate whose number of fanout> number of fanin
Proposed Design
Delay fault detection coverage
d
FL
: delay from PI to any gate in FL
d
i
: delay from PI to any gate in original circuit
max{ }
1
max{ }
FL
F
i
d
C
d
=
Add graphical view
Proposed Design
Delay simulation
SPICE simulation
TSMC 0.18um tech. Vcc=1.6V
Gate delay for rising and falling signal
Load: inverter
Different input combinations are considered
Delay simulation
Randomly generated test vectors
10
6
~10
8
according to number of primary inputs (PI)
Proposed Design
Area complexity
N
gate
: Number of gates in the original circuit
N
ff
: Number of ffs in each pipeline, (N
PI
+N
PO
)/2
N
gate_BL
: Number of gates in BL
N
gate_CP
: Number of gates in comparison block
N
Latch
: Number of latches=Number of
connections between FL and BL
w: Complexity ratio of flipflop to gate
_ _ gate BL gate CP Latch
A
gate ff
N N N
C
N w N
+ +
=
+
Fault Coverage vs. Complexity
Fault Detection Coverage vs. Added Complexity: C499
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Fault detection Coverage CF
A
d
d
e
d

C
o
m
p
l
e
x
i
t
y

C
A
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
A
d
d
e
d

C
o
m
p
l
e
x
i
t
y

C
A
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage C
F
A
d
d
e
d

C
o
m
p
l
e
x
i
t
y

C
A
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
A
d
d
e
d

C
o
m
p
l
e
x
i
t
y

C
A
Complexity
Effective complexity penalty
Depends on application
More than half of area is cache
Speed critical part: integer unit

0.5

AE A A
Appicable area
C C C
Total chip area
= s
Estimation of Complexity
& AGU
Data
Cache
Align
Mux
Registers ALUs
Intel Pentium 4
Processor on 90 nm
Process
Conclusion
Delay fault tolerant design is proposed
Possible operation clock frequency gain is
estimated from modeling and experiments
Delay fault detection coverage and complexity
are analyzed for optimal implementation
It shows that 10% clock frequency gain is
possible with proposed design at a moderate (8-
25%) complexity increase

Low Power Design Methododlogies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low Power Design Methododlogies

Uploaded by

Copyright:

Available Formats

ELEN 468 Lecture 29 1

You might also like