Pipelining: Why Wait - . - ?

PIPELINING
why wait . . . ?
... let's solve a "real problem"
device: washer
function: fill, agitate, spin
washerPD = 30 mins
clean,dry laundry device: dryer

function: heat, fast spin
dryerPD = 60 mins
one load at a time
everyone knows that the real

reason that students put off
doing laundry so long is not
because they procrastinate,
step 1:
are lazy,
or even not because they are
working with their
computation slides
the fact is, doing one load

at a time is not smart step 2:
total = washerPD + dryerPD = 90 mins

doing N loads of laundry
here's how 0th year E.E's do laundry, step 1:

the "combinational" way
step 2:
because they did not hear

of pipelining yet ! step 3:
step 4:
.....
total = N*(washerPD + dryerPD)

= N*90 mins
doing N loads... the 1st year E.E. way
if students "pipeline" the

step 1:
laundry process
step 2:
actually, it's more like N*60 + 30

if we account for the startup step 3:
transient correctly
.....
when doing pipeline analysis, we're
mostly interested in the "steady
state" where we assume we have an
infinite supply of inputs
total = N*max(washerPD , dryerPD)

= N*60 mins
some definitions
latency:
the delay from when an input is established until the
output associated with that input becomes valid
assuming that the
(0th year's laundry = 90 mins) wash is started
(1st year's laundry = 120 mins) as soon as
possible and waits
(wet) in the
implies that 0th's in a six hour wait gets 4 loads done,
washer until dryer
while 1st's gets 5 and goes home half an hour earlier
is available
throughput:
the rate of which inputs or outputs are processed
(0th year's laundry = 1/90 mins-1= 0.011 mins-1)
(1st year's laundry = 1/60 mins-1= 0.016 mins-1)
okay, back to circuits...
for combinational logic:

F
latency = tPD
X H P(X) throughput = 1/tPD
we can't get the answer
G faster, but are we making
effective use of our
hardware at all times?
X
F(X)
G(X)
P(X)
F and G are "idle", just holding their outputs

stable while H performs its computation
pipelined circuits
use registers to hold H's input stable!
now F and G can be working on input
Xi+1.
because of the 2-stage pipeline :
a valid input X during clock cycle j,

P(X) is valid during clock j+2.
suppose F, G, H have propagation delays of 15, 20, 25 ns
and we are using ideal zero-delay registers:
latency throughput
unpipelined 45 1/45
2-stage pipeline 50 1/25

worse better
pipeline diagrams
clock cycle
i i+1 i+2 i+3
pipeline stages
input Xi Xi+1 Xi+2 Xi+3 ...
F reg F(Xi) F(Xi+1) F(Xi+2)

...
G reg G(Xi) G(Xi+1) G(Xi+2)
H reg H(Xi) H(Xi+1) H(Xi+2)
the results associated with a particular set of input data

move diagonally through the diagram,
progressing through one pipeline stage each clock cycle
pipeline conventions
definition:
a K-stage pipeline ('K-pipeline") is an acyclic circuit having
exactly K registers on every path from an input to an output
a combinational circuit is thus a 0-stage pipeline
convention:
every pipeline stage, hence every K-stage pipeline, has a
register on its output (not on its input)
always:
the clock common to all registers must have a period
sufficient to cover propagation over combinational paths
PLUS (input) register tPD PLUS (output) register tSETUP
the latency of a K-pipeline is K times the

period of the clock common to all registers
the throughput of a K-pipeline is the
frequency of the clock
ill-formed pipelines
consider a bad job of pipelining:
for what value of K is the above circuit a K-pipeline?

none
problem:
successive inputs get mixed: e.g., B(A(Xi+1 ), Yi)
this happened because some paths from inputs to outputs
had 2 registers, and some had only 1!
can this happen on a well-formed K pipeline?
a pipelining methodology
step 1: STRATEGY:
draw a line that crosses every
output in the circuit, and focus your attention on placing
select one endpoint as an pipelining registers around the
origin slowest circuit elements
(bottlenecks)
step 2:
continue to draw new lines
from the origin across various
circuit connections such that
these new lines partition the
inputs from the outputs
adding a pipeline register at

every point where a separating
line crosses a connection will
always generate a valid
pipeline
pipeline example
observations:
• 1-pipeline improves neither
latency nor throughput
• troughput is improved by
breaking long combinational
paths, allowing faster
clock
• too many stages cost
LATENCY THROUGHPUT latency while not
improving throughput
0-pipe 4 1/4 • back-to-back registers
are often required to keep
1-pipe 4 1/4 pipeline well-formed
2-pipe 4 1/2
3-pipe 6 1/2
considering pipelining
• advantages
– higher throughput than
the corresponding combinatorial device
– different parts of the logic
work on different parts of the problem
• disadvantages
– generally, increases latency
– only as good as the weakest link
is there a way around this "weak link" problem?

how do 1st year EE's a.d.2010 laundry
they work around the bottleneck:
first they find a place
with twice as many dryers as washers
step 1:
step 2: throughput = 1/30 loads/min
step 3:
latency = 90 min
step 4:
step 5:
circuit interleaving
one way to overcome
a pipeline bottleneck
is to replicate
the critical element
as many time as needed
and alternate inputs
between the various copies
N-1 registers
latency = 2 clocks
N-way interleaving is equivalent

to N pipeline stages
combining pipelining and interleaving
combining interleaving
with pipelining
moves the bottleneck
from the C-element
to the F-element
here, C' interleaves two C-elements

with a propagation delay of 8 ns
the resulting C' circuit has

a throughput of 4 ns,
this can be considered and latency of 8ns.
as an extra pipelining stage
that passes through the middle of the C' module
assignment B4
Pipeline a combinational encryptor X 5 1 3 3 1
for throughput! 0
The device takes an integer value X 2 4 7
and computes an encrypted version C(X).
1 5 3 8 5
The propagation delay of each module
is given in ms. 6 9 11
(contamination delays are zero). 13
5 10 3 12 1
before monday, march 8, 9:00 C(X)
• what is the latency and throughput
of the unpipelined device? From: student@tue.nl
• give the locations for registers To: computation@ics.ele.tue.nl
(ideal, zero-delay) by edge numberSubject : B4
after maximizing the throughput!
use as few as possible registers! 27 0.40
• give the latency and throughput 4 9 10 drawing in
of your pipelined device! 30 0.80 attachment
attachment <student_B3.xxx>
multiplication (positive numbers)
multiplicand A3 A2 A1 A0
multiplier B3 B2 B1 B0
x
ABi called a "partial A3B0 A2B0 A1B0 A0B0
product
A3B1 A2B1 A1B1 A0B1
A3B2 A2B2 A1B2 A0B2
+ A3B3 A2B3 A1B3 A0B3
multiplying N-bit number by M-bit number gives (N+M)-bit result
easy part:
forming partial products (just an AND gate since Bi is either 0 or 1)
hard part:
adding partial products column by column with carry
multiplication
multiplicand A13 0A2 0A1 1A0

multiplier B1
3 0B2 1B1 1B0
x
ABi called a "partial A31B0 A02B0 01B0 A10B0
A
product
A31B1 A20B1 A01B1 A
10B1
A3B2 A2B2 A1B2 A0B2 +
+ A3B3 A2B3 A11B3 A10B3 0 1 1
1 0 0 1
multiplying N-bit number by M-bit number gives (N+M)-bit result +
1 1 0 0 0 1 1
easy part:
forming partial products (just an AND gate since Bi is either 0 or 1)
hard part:
adding M N-bit partial products
sequential multiplier
assume the multiplicand (A) has N bits

and the multiplier (B) has M bits.
init: P 0, load A&B
repeat M times {
P P + (BLSB ==1 ? A : 0)
shift P/B right one bit
}
done: (N+M)-bit result in P/B
a sequential circuit with a single N-bit adder

can proces one partial product at a time
and then cycle the circuit M times
sequential multiplier (64-bit ALU)
after initialization
multiplicand
(product register at 0)
and loading the operands;
63 31 0
E
<< repeat 32 times:
>>
{if 1==LSB in multiplier,
then add multiplicand;
shift multiplier 1 right;
shift multiplicand 1 left;
64 - bit
add zero }
ALU multiplier
63 0 31 0
product register E E
<<
>>
LSB
finite state machine
multiplicand
31 0
E
repeat 32 times:
shift multiplier 1 right;
shift product 1 right;
32 - bit add zero }
ALU multiplier
63 0 31 0
product register E E
<<
<<
>> >>
LSB
multiplicand
31 0
E
repeat 32 times:
shift content of
product register 1 right;
32 - bit add zero }
ALU multiplier
63 0
product register E
<<
>>
LSB

a combinational multiplier
A3 A2 A1 A0
B0
tPD = 10*tPD,FA
FA FA FA FA
(follow the path A3 A2 A1 A0
from A0 to P7) B1
FA FA FA FA
A3 A2 A1 A0
B2
FA FA FA FA
A3 A2 A1 A0
B3
FA FA FA FA
P7 P6 P5 P4 P3 P2 P1 P0
pipelined multiplier
A3 A2 A1 A0
B0
"carry save" FA FA FA FA
configuration
A3 A2 A1 A0
B1
FA FA FA FA
A3 A2 A1 A0
B2
FA FA FA FA
A3 A2 A1 A0
B3
FA FA FA FA
FA FA FA FA
P7 P6 P5 P4 P3 P2 P1 P0
summary
• latency (L) = time it takes for given input to effect an output
• throughput (T) = rate at which new outputs appear
• for combinational circuits: L = tPD of device, T = 1/L
• for K-stage pipeline (K > 0):
– always have registers on output(s)
– K registers on every path from input to output
– T = (tPD,reg + tPD,slowest pipeline stage + tSETUP)-1
• to increase throughput: split the slowest stage
• no further splitting possible, use replication/interleaving
– L = KxT
• pipelined latency ≥ combinational latency
• pipelining can be combined chapter
with circuit interleaving 4.5-p332
en 3.3:

Pipelining: Why Wait - . - ?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipelining: Why Wait - . - ?

Uploaded by

Copyright:

Available Formats

PIPELINING

clean,dry laundry device: dryer

everyone knows that the real

the fact is, doing one load

total = washerPD + dryerPD = 90 mins

here's how 0th year E.E's do laundry, step 1:

because they did not hear

total = N*(washerPD + dryerPD)

if students "pipeline" the

actually, it's more like N*60 + 30

total = N*max(washerPD , dryerPD)

for combinational logic:

F and G are "idle", just holding their outputs

a valid input X during clock cycle j,

2-stage pipeline 50 1/25

input Xi Xi+1 Xi+2 Xi+3 ...

F reg F(Xi) F(Xi+1) F(Xi+2)

H reg H(Xi) H(Xi+1) H(Xi+2)

the results associated with a particular set of input data

the latency of a K-pipeline is K times the

for what value of K is the above circuit a K-pipeline?

adding a pipeline register at

is there a way around this "weak link" problem?

step 2: throughput = 1/30 loads/min

N-way interleaving is equivalent

here, C' interleaves two C-elements

the resulting C' circuit has

multiplying N-bit number by M-bit number gives (N+M)-bit result

multiplicand A13 0A2 0A1 1A0

assume the multiplicand (A) has N bits

done: (N+M)-bit result in P/B

a sequential circuit with a single N-bit adder

finite state machine

You might also like