You are on page 1of 11

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO.

9, SEPTEMBER 1978 855

Effective Pipelining of Digital Systems

J. ROBERT JUMP, MEMBER, IEEE, AND SUDHIR R. AHUJA, MEMBER, IEEE

Abstract-This paper presents quantitative techniques for the brought into stage 1 and step 1 is repeated for these new
evaluation and comparison of pipelined digital systems. They are data. Thus, both stages 1 and 2 are operating concurrently
based on three measures of effectiveness: delay, average during the second phase of execution. If there is a total of n
time/operation, and average cost/operation. Moreover, the
techniques do not assume that there is an unbounded stream of stages, then after n executions of the algorithm have been
operations to be performed, although this case is considered. The use initiated, all n stages will be operating concurrently, each
of the analysis methods to compare different ways of pipelining a realizing a different step of a different execution of the
given algorithm is illustrated by an investigation of the pipelining of algorithm.
general four-neighbor cellular arrays. The methods can also be used Since all stages of an n-stage pipeline can operate concur-
to evaluate different algorithms for performing the same operation.
This is illustrated by comparing three array algorithns for integer rently, a pipelined system has the potential to operate at n
multiplication. times the computation rate of an equivalent nonpipelined
system. Two factors will prevent the full realization of this
Index Terms-Cellular arrays, cost-effectiveness, logic design, n-fold gain. First, the rate will be determined by the slowest
multiplication, parallelism, pipelining.
stage. If this stage requires t time units, then at most one
execution of the algorithm can be completed by the system
I. INTRODUCTION every t units of time. Even when the propagation times ofall
P IPELINING is a well-known hardware design n stages are equal, they are usually greater than 1/n times the
technique for utilizing parallelism to increase the com- time per operation of a nonpipelined system due to the
putation rate of a digital system (e.g., [1], [3], [4], [12]). The additional time required to transfer intermediate results
effectiveness of this technique is highly dependent on the between adjacent stages. Second, there is a delay between
structure of the algorithm to be executed and the structure of the initiation of the first execution and its completion,
its input data. For certain types of computations the use of during this time no results are produced by the system. If
pipelining can result in a significant increase in performance the total number of executions to be performed is small, then
with only a modest increase in cost. For other computations this delay can significantly increase the average time per
a gain in performance may not be possible, or may only be execution.
realized at considerable cost. The goal of this paper is to The major advantage of pipelining over other parallel
develop analytical techniques that can be used to help design techniques is that frequently the same improvement
evaluate the effectiveness of pipelined systems. in performance can be obtained for less cost using pipelin-
Pipelining can be used to realize an algorithm whenever ing. This happens because an n-stage pipeline is obtained by
that algorithm can be divided into a fixed number of steps partitioning a nonpipelined system into n smaller subsys-
that are to be executed in sequence. A pipelined realization tems and then adding registers between the stages to hold
of such an algorithm consists of several hardware stages the intermediate results. T-hus, the dominant additional cost
separated by registers. There is one stage for each step of the of a pipelined system is due to the added registers and this is
algorithm and they are interconnected in the same order frequently small compared to the cost of the stages.
that the steps are executed as illustrated in Fig. 1. The cost advantage of a pipelined system may be di-
This organization can only be used effectively when the minished in certain situations. For example, if the increase in
algorithm is applied repeatedly to a stream of input data. In system performance is small due to reasons such as those
this case, the first operands are sent to stage 1 and step 1 is described above, it may not merit the increased cost. Also, as
executed. When step 1 is completed, the intermediate results, the number of stages is increased, they become smaller and
produced as the output of stage 1, are sent to stage 2 as input the cost of registers between stages becomes significant. On
for step 2. At the same time, the second set of operands are the other hand, if only a small number of large stages is used,
the performance increase is limited.
Manuscript received February 2, 1976; revised August 10, 1976. This This paper presents a quantitative analysis of the factors
work was supported by the National Science Foundation under Grants described above. This is done using the following three
GJ750, GJ36471, and DCR7414283. measures of effectiveness: time/operation, delay, and
J. R. Jump is with the Department of Electrical Engineering, Rice
University, Houston, TX 77001. cost/operation. It is assumed that one complete execution of
S. R. Ahuja is with Bell Laboratories, Holmdel, NJ 07733. the realized algorithm corresponds to a single operation
0018-9340/78/0900-0855$00.75 C) 1978 IEEE
856 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 9, SEPTEMBER 1978

INPUT
effectiveness since it depends on the delay --and
1L,
rTREGISTER
time/operation as well as the system cost. It will be seen that
even though the total cost of a pipelined system may be
much larger than a nonpipelined system, its cost/operation
can be less.
STAGE I The next section presents closed-form expressions for the
time/operation, cost/operation, and delay of a general
pipelined system. These expressions are then used to investi-
REGISTER gate the problem of comparing different ways of pipelining
the same algorithm. Davidson and Larson [5] have also
investigated this problem. The next section can be viewed as
STAGE 2
an extension of their work that provides a somewhat more
analytical comparison of the time and cost tradeoffs that are
possible.
The results of Section II are applied to a specific class of
algorithms in Section III. This class can be characterized as
being iterative in two dimensions. Such algorithms can be
realized as two-dimensional cellular arrays by performing a
time-iteration to space-iteration transformation. They are
REGISTER, particularly well-suited for pipelining since the division into
stages can be done easily in many different ways by parti-
tioning the array along its rows, columns, and diagonals.
STAGE N| The analysis of these algorithms shows that in most cases,
partitioning an array along its diagonals not only gives the
best time/operation but also the lowest cost/operation.
Section IV illustrates how the analysis techniques
OUTPUT developed in Section II can be used to compare pipelined
Fig. 1. Linear pipelined algorithm. systems that perform the same operation but use different
algorithms. In particular, three array algorithms for integer
multiplication are compared.
such as multiplication or division. Then timeloperation is
defined as the reciprocal of the average number of opera- II. GENERAL ANALYSIS
tions completed per unit of time. Delay is the time between
the initiation and completion of a single operation. Note Consider the general pipelined system illustrated in Fig. 1.
that since different operations are overlapped, the It is assumed that each stage is realized as a combinational
time/operation will normally be much less than the delay. logic circuit, and that the registers between stages consist of
clocked flip-flops with a common clock signal. Although
Costloperation is a measure that takes into account both the only one register is shown between two stages, it is assumed
cost of the system and its computation rate. It is defined as
the cost per unit of time divided by the average number of that it contains enough flip-flops to hold all of the informa-
completed operations per unit of time. 'The cost per unit of tion passed between the stages. The important parameters
time is defined as- the total cost of using the pipeline to for this system are given below.
perform a given number of operations, divided by the total N the number of stages.
time required to perform these operations. It could be NR the total number of flip-flops in registers between
obtained by dividing the total purchase or construction cost stages.
of the system by the total time it is to be used. Alternatively, Kp-the total cost per second of the N stages, exclusive
it could be obtained directly from the rental rate if the of the register cost.
equ,ipment is rented. KR the cost per second of a single flip-flop.
-Time per operation measures the computation rate of the TS-the maximum of all stage propagation delays,
system and is the measure of effectiveness, most frequently exclusive of the register delays.
applied to evaluate.pipelined systems. It is'an important TR- the sum of the maximum propagation delay and set
measure in applications, such as real-time computation, up time of a flip-flop.
where it is necessary to 'achieve a maximum computation M _ the average number of operations to be performed.
rate. Once a sequence of operations has been initiated, the
delay indicates how much time is required before the first The performance measures for a system are denoted by
operation is completed. After this, the results of the follow- the following symbols.
ing operations are produced at the maximum rate. Hence,
delay may also be viewed as- a measure of system latency. T(M) time per operation where M operations are
Cost/operation is the most complete measure of system performed.
JUMP AND AHUJA: PIPELINING DIGITAL SYSTEMS 857

?l(M) =cost per operation where -M --operations- are systems -and takes -into account the number-of operations
performed. performed so that his results are valid for small M. Although
X time per operation when the number of opera- he was primarily interested in evaluating the efficiency of
tions performed is large (M - cc). different forms of parallelism, there is a similarity of his
q cost per operation when the number of opera- measures to rl(M) and T(M). Specifically, I(M) and T(M) can
tions performed is large (M -+ oo). be defined in terms of q and T as follows:
6 delay of the system.
17(m) =?I M N I
) and T(M) = (M
+
j)
The minimum clock period of the system is (TS + TR), the
maximum time for a signal to propagate from the input of a
register to the output of the stage following the register. The factor M/(M + N - 1) is used by Chen as a measure of
Once the system starts execution, it requires N clock periods efficiency. For pipelined systems it represents the loss of
before the first result is produced. Hence, the delay 3 of the parallelism that results from the delay experienced before
system is given by the first output is produced.
The expressions derived above can be used to compare
6=N(Ts+TR). two different pipelined realizations of the same algorithm.
The time/operation T(M) is defined as the average number Suppose that one realization has parameters V1), N and
of time units required to perform a single operation, given N(') and the other has parameters T7, 2N and N Each
S

that M operations are to be performed. Now the first result is of the other parameters has the same value for both realiza-
produced in N(TS + TR) seconds and the remaining (M - 1) tions. This corresponds to the case where a given non-
results are produced at the rate of one every (TS + TR) pipelined system is partitioned in two different ways,
seconds. Hence, the total time to perform M complete resulting in a different number of stages and flip-flops, and
operations is N(TS + TR) + (M - 1)(Ts + TR). Thus, the different stage delays. The relative effectiveness of the two
average time per operation is given by systems is given by the following results where 3(i), r('), (09
T(O)(M), and ?1")(M) are the measures for realization i, i = 1, 2.
T(M)= N(Ts + TR) + (M- 1)(TS +TR) Result 2.1:
M
6(2) > ,(l) i N(2) PS" +Tr.
=(TS TR) (M )
Result 2.2:
The cost of the system is obtained by summing the costs of - t~~~(2) > (1) iff Vs2) > p 1)
the stages and the registers. Hence, the cost per unit oftime is (2) = (1) iff T2 = 1)
KP + KR NR. The cost/operation, 6i(M), for a sequence of M
operations is given by Result 2.3: Assume that ThS2)> TS1). Then
ti(M)= (KP + KRNR)T(M) T( K NT (if + TR)-
- (P )-TR)
= (KP + KRNR)(TS + TR)( M +N ) If Th2) = TS1 thnr2 ()ifN2 z)
If the number of operations M is quite large, then the Result 2.4: Assume that > (s Then S

expressions for T(M) and rl(M) can be simplified as follows: z(2)(M) . z(2)(M)
T_ lim T(M)= Ts + TR.
iffK M 2 T1(NV -1)-N2R(N(2-1)
?I lim Cj(M) = (KP + KRNR)(TS + TR). If >2) - ( then T(2)(M) . z()(M') iff N>2R .R NR
Result 2.5: Assume that ,6(2) > ,5(1). Then
T and q are the quantitative measures most frequently used
to analyze a pipelined system (e.g., [5], [9], [12]). However, (2) (1)
they can be inaccurate when M is the same order of
magnitude as N, the number of stages.
In [5], Davidson and Larson use a measure ofeffectiveness
that is quite similar to ,l. The only difference is that their If V(2) - s(1), then ?i21(M2) .l (M)N22>
iff . Tese
measure is (operations/time)/cost instead of (cost/time)/ results can be verified easily using the expressions for delay,
(operations/time). If the total useful lifetime of the system time/operation, and cost/operation.
is divided by their measure, the result would be t7. When Results 1-5 can now be used to determine whether a given
using these two measures to compare the effectiveness of pipelined system is cost effective. For this purpose, a non-
two pipelined systems, they both give the same results if the pipelined version of the same system is viewed as a pipelined
lifetimes of the systems are equal. system with only one stage, formed by cascading the stages
In [3], Chen considers the performance of pipefined of the pipelined system. Let system 1 be the pipelined version
858 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 9, SEPTEMBER 1978

and system 2 the nonpipelined version. Then N(1) > 1, Result 2.8: Let k be a number such that 0 < k < 1. Then
N(2) = 1, Ts2) > TPs, and N(1) > N(2). It will be assumed that
the stage delay T'Ps is greater than the flip-flop delay TR. In
practice, the delay of the nonpipelined system is usually
ti/ti(M) 2 k and T/T(M) . k iffM>(N.i/1l-lI)
-

divided equally among the stages of the pipelined system so


that Ts" is approximately equal to Vs2)/N(l). Since N() > 2, Note that to achieve an increase in performance greater
it will be assumed that TS2 2 Ps".) than 0.9 times the maximum possible increase would require
A simple sufficient condition for system 1 to be better than that the average number of operations be on the order of 9
system 2 can now be derived from Results 2.2 and 2.3. times the number of stages. If the number of operations is
Indeed, note that approximately. equal to the number of stages, then the gain
in performance is only one-half the maximum possible gain.
N(PI)(T1) + TR) - NR(R2) + TR) The most important difference between the results pre-
t2) - 1)
sented in this section and similar results in the report by
N(1)(s1) + TR) - N(2)(Ps1) + TR) Davidson and Larson [5] is that this paper investigates how
<
T1)- VI)
T(2-T1 the number of operations performed can affect the perfor-
mance and effectiveness of pipelined systems. In [5] it is
< (N(R1) N)(R 1 + TS1) assumed that this number is so large that the transient effect
-

S2) - S) of loading the pipeline is insignificant.


In order to determine the effectiveness of a pipelined
<
2(N(R1) -N(2))Ps) system, Davidson and Larson compare an expression for
(operations/time)/cost of a pipelined system to one for an
=2(NR) N(R2)).
- equivalent nonpipelined system. They conclude that pipelin-
ing is attractive when the stage delay is large relative to the
Hence, the following result gives a loose bound for Kp if the register delay and the stage cost is large relative to the
pipelined system is to be more cost effective than the register cost. Results 2.2, 2.3, and 2.6 are a more quantitative'
nonpipelined system. statement of this observation. Results 2.4, 2.5, 2.7, and 2.8
Result 2.6: If N0s2) = 1, N(S) > 1, Ts2) > PS" > TR, and extend this type of analysis by removing the assumption that
Vs2) > 271, ), then T(2) > T(1) and j(2) > ?1(1) whenever the number of operations performed is very large.
2(N()- N(R2))KR < Kp. This result shows that pipelining is
advantageous whenever there is a large number of opera- III. PIPELINED ARRAYS
tions to be performed, and the added cost of the registers
needed to realize the pipelining is less than one-half the cost This section applies the analysis techniques developed in
of the stages. the previous section to a class of algorithms that are
These results support the comments in Section I to the particularly well-suited for pipelining. This class consists of
effect that pipelining can be a very effective form of paral- those algorithms that can be realized as two-dimensional
lelism. Result 6 indicates that in most cases where pipelining cellular arrays where each cell is a combinational logic
is possible and the number of operations to be performed is circuit connected to its four nearest neighbors as shown in
large, then pipelining will result in a decrease in both the Fig. 2. Arrays such as these have received considerable
time per operation and the cost per operation. The lower attention from researchers interested in the design of LSI
bound on the number of operations needed to realize this semiconductor devices [llJ.
improvement is given in the following result. The main step in pipelining any system is to partition it
Result 2.7: If N(') > 1, pM2) = 1, and T(2) > T"), then into stages. This step is especially easy for arrays since their
regular structure can be partitioned in many different and
T[(2)(M) > Tr(1)(M) iff M > (2)-I natural ways. For example, an array can be divided into
T-/1
stages by inserting registers between rows or columns or
If in addition, 1(2) > ?I ., then along the diagonals of the array.
Four possibilities for pipelining an array are illustrated in
4(2) (M) > q1(1)(M) N()- 1
iff M > 1(2 /(1)- Fig. 3. Row or column pipelining is accomplished by inserting
registers between certain rows or columns as shown in Fig.
Result 2.7 deals with the minimum number of operations 3(a) and (b). Registers can also be placed between both rows
necessary to guarantee an advantage, in computation time and columns partitioning the array into rectangular subar-
or cost, for a pipelined system. Of course, this advantage is rays as shown in Fig. 3(c). This will be called block pipelining.
minimal for the minimum value of M given by Result 2.7 and Finally, inserting registers along the diagonals perpendicu-
approaches its maximum, represented by z or C, as M lar to the flow of information as illustrated in Fig. 3(d) will be
approaches infinity. Hence, it is also of interest to know how called slice pipelining. Note that any method of placing
large M must be in order to achieve a given level of registers in an array will produce a well-formed pipeline if all
performance. The final result of this section is an answer to paths from an array input terminal to a cell or array output
this question. The result follows easily from the expressions terminal pass through the same number of registers. This
for T, ,, T(M), and t1(M). number is indicated in the cells of Fig. 3. The set of all those
JUMP AND AHUJA: PIPELINING DIGITAL SYSTEMS 859

n INPUTS

-m INPUTS

11 Ig2g

Fig. 2. Four neighbor array.

cells that are i registers away from an input terminal form i the number of horizontal lines between two adja-
stage i of the pipeline. In order to ensure that all paths from cent cells.
an array input to a cell or array output pass through an Kc--the cost per unit of time of a cell.
equal number of registers, it is necessary to add registers Tc the maximum propagation delay of a cell.
outside the array. These outboard registers delay certain bits Then Kp, the total cost of the stages, is given by mnKc. The
so that all bits of an input operand can be presented to the total number of flip-flops N., for both block and slice
system during one clock period, and all bits of an output pipelining, is given by N(mi + nj), where N is the number of
result are available during one clock period. stages. Note that the number of flip-flops between any two
Given an m x n array, let I and k denote the number of adjacent stages is fixed and equal to mi + nj due to the
rows and columns, respectively, in a block and let p denote regular structure of the array.
the width of a slice as shown in Fig. 3(c) and (d). Then block Now consider a block pipelined array. If the block size is
pipelining includes both row (k = n) and column (1= m) I x k, then the number of stages is given by
pipelining as special cases. Block and slice pipelining are
identical if the block consists of only one cell (I = k = 1) and
the slice is one cell wide (p = 1). This special case will be N(b) = m + In
called maximal pipelining since it results in the maximum
computation rate (minimum T). and the delay per stage is
The main result of this section is that slice pipelining is
always as good or better than block pipelining. Indeed, it TS = (k+ I- 1)Tc.
will be shown that, given any block pipelined array, the same For a slice pipelined array with slice width p, the number of
array can be slice pipelined in such a way that its measures stages is
delay, time/operation, and cost/operation are all less than or
equal to the corresponding measures for the block pipeined N(s) =m + n + 1
array. It will also be shown that the cost/operation of a
maximally pipelined array is almost always less than the and the delay per stage is
cost/operation of the same array without pipelining. Thus,
there is almost always an advantage to be gained by TsP) = pT .
pipelining an array if it is to be used to perform repeated Given a block pipelined array with block size 1 x k,
operatipns on a large stream of data. Moreover, this advan- consider slice pipelining the same array with a slice width of
tage is usually greatest for slice pipelining. p = k + I- 1. Then 7b) = s) and the following result
In order to analyze the effect of pipelining two- shows that N(-) < N(b). Hence, it will follow that the slice
dimensional arrays, the following notation will be used. pipelined system is as fast as the block pipelined array
m the number of rows in the array. (T(s)= T(b)) but the cost/operation and delay of the slice
n the number of columns in the array. pipelined system will be less than or equal to that of the
j the number of vertical lines between two adjacent block pipelined system.
cells. Lemma 3.1: Let m, n, k, and I be nonzero positive integers
860 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 9, SEPTEMBER 1978

I nen4

m-5

(a) (b)

=2

.1_

(c) (d)
Fig. 3. (a) Row pipelining. (b) Column pipelining. (c) Block pipelining,
(d) Slice pipelining.

with k < m and I < n. Then


['J 1 11
m n
+ 1=q, +q2 1 + r, + r2
m+n
- -

- k I k I 1)

|k| I+|1 2 ___I_>


k+l- [ and
Proof: First it is shown that
I k+l- 11= q +q2-1+ jr, + r
[n +| I - > k +I - If r1 = 0 or r2 =0, then
To this end, let m=q1k+rr and n=q21+r2 where
0 < r, <k and 0 < r2 1. Then [k |+n 11 = k Il_11
JUMP AND AHUJA: PIPELINING DIGITAL SYSTEMS 861

If r1 > 0 and r2> 0, then resulting measures 3(s), Ts), and ts) in such a way that
3(b) > s(s) r(b) > T(s), and 11(b) > 6(s)
Frfl + rl =2 and
[r +r2 <2
jk
Proof: If the block size is k x 1, then pick the slice width
p to be k + 1- 1. Then Ps ) = Ps) and N(s) < N(b) by Lemma
so that 3. 1. Hence, (b) > 3(s) T(b) > r and ?(b) 2 q(s) by Results
2.1-2.3.
n
1- Theorem 3.1 establishes the superiority of slice pipelining
+
1 - I an array when there is a large number of operations to be
performed. If this number of operations is small, then
Next consider the expression Results 2.4 and 2.5 can be used to compare the two systems
/m +n
as illustrated in the following theorem.
im + n-1I Theorem 3.2: Let T(b)(M) and ,i(b)(M) be the
n- -1 tk + - 1. time/operation and cost/operation of a block pipelined
By simple algebraic manipulation, it is seen to be equal to array. Then the same array can be slice pipelined to give a
system with measures r(s)(M) and ?I(s)(M) in such a way that
(m k)l(l 1) (l I)k(k 1)
- - + - - T(b)(M) 2 T(')(M) and 6(b)(M) . 4(S)(M) for all M > 1.
kl(k + I-1) Proof: Ifthe block size of the original array is k x 1, then
pick the slice width of the slice pipelined array to be
Since 1 < k < m and 1 < I < n, this expression is greater k + I-1. Then N(b) . N(') as before. Also, 7V) = Vs) so that
than or equal to 0. Hence, T(b) = T(z). Hence, T(b)(M) > T(s)(M), for any M > 1 by Result
2.4. Similarly, if ?I(b) = i(s), then i(b)(M) . (s)(M) for any
m n m+n-1 M > 1. Hence, assume that i(b) > (s). Then Ci(b)(M) .
k 1 k+l-1 (s)(M) iff
Note that
I(s)(N(s)_ 1) -1(b)(N(b) - 1)
a [al a b-i M1.2 (b) (s)
b-b -b b
But the right side of this inequality is less than or equal to
for any integers a and b. Hence, zero so that 1(b)(M) 2 q(t)(M) for all M 2 1.
The previous results compare slice and block pipelining
m n m n m+n- 1 and show that slice pipelining is always more effective.
+ - -~ + k+ I-12 k+l 1 However, it may be that a pipelined array is not more
effective than the same nonpipelined array. This can happen
> [r+n- i]k+l-2 if there is a small number of operations to be performed,
resulting in an increased average time/operation. It can also
But happen if the cost of the added registers is large relative to
the cost of the cells resulting in an increased cost/operation.
k+I
-2<1.
k + I- 1
In general, the only way to determine whether pipelining will
improve performance is to perform the calculations in-
Since dicated by Results 2.1-2.5. However, the following result
shows that under reasonable assumptions, a decrease in ?
m n [m+n-1I can always be realized by pipelining an array.
and
k + i1 I-1 Result 3.1: Given an m x n array with i horizontal and j
are integers it follows that vertical lines between adjacent cells, let ,(o) and T() be the
cost and time/operation for the nonpipelined array and n(m)
m n
+ 1 m[+n- 1 and T(m) the corresponding measures for the maximally
pipelined array. If KC > KR, TC> TR, n > 2i, and m 2 2j,
Hence then ?I < 1(o) and T(m) < T(°)
Proof: For maximal pipelining the number of stages
N'm) is equal to m + n - 1 and T(m), the delay per stage is, Tc.
|m + -n 12I> For the nonpipelined array, N(o) = 1 and
and the lemma is proved. -so) = (M +N- 1)Tc.
Theorem 3.1: Let 6(b), T(b), and p1(b) be the delay,
time/operation, and cost/operation for a block pipelined Hence, T(m) < T(°) by Result 2.2. Also, using Result 2.3, it can
array. Then the same array can be slice pipelined with be seen that tl(m) < z(o) iff
862 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 9, SEPTEMBER 1978

MULTIPLICAND PREVIOUS
BIT SUM

MULTIPLIER
IX BIT

CARRY FROM
EARLIER CELL

S = KY (UDV
C - XyU + XYv + UV
Fig. 4. Basic cell.

KP. mnKc Nlm)(Vm)


R s + TR) N(o)(PS
R O' +TR)
K= K>
RKR PO) PM)
(m+n-1)(mi+nj)(Tc+ TR)-(mi+njX(m+n- 1)Tc+TR)
(m + n- 1)TC- Tc
_ (mi + nj)(m + n - 2)TR (mi + nj)TR/TC.
=

(m + n - 2)Tc

Therefore t1(m) < jj(o) iff Kc/KR> (i/n + j/m)TR/Tc. Due to rithm. Its array realization is shown in Fig. 5(a). Note that
the hypothesis of this result, KC/KR> 1 and (i/n + the rows of this array form ripple-carry adders and shifting
j/m)TR/Tc < 1. Hence, j(o) < ,l(m). of the partial products is achieved by staggering the rows.
Since a cell is usually more complex than a flip-flop, it is Fig. 5(b) shows the same array drawn as a rectangle and
not unreasonable to expect that KC > KR and TC > TR maximally pipelined. Several arrays based on this algorithm
Moreover, the number of lines between cells is frequently have appeared in the literature (e.g., [6], [10]).
small relative to the number of rows or columns. Hence, The second multiplication algorithm is the Guild algo-
Result 3.1 shows that a reduction in both time and cost per rithm illustrated by the array in Fig. 6. This array was
operation can be realized by pipelining most arrays of developed by Guild [8] and the possibilities for pipelining
interest. the array have been investigated by Hallin and Flynn [9] and
Deverell [6]. Although the array uses the same number of
IV. PIPIELINING ARRAY MULTIPLIERS cells as the shift-and-add array, its maximum delay from an
The techniques developed in Section II can also be used to input terminal to an output terminal is only 2n - 1 cell
evaluate the effectiveness of pipelining different algorithms delays for an n-bit multiplier. The maximum delay for the
that realize the same operation. To illustrate this, three shift-and-add array is 3n - 2 cell delays.
algorithms for integer multiplication will be compared. All The third algorithm uses the carry-save principle to sum
three algorithms can be naturally implemented as iterative the partial products. This is illustrated by the array in Fig.
arrays where each cell is the gated full adder shown in Fig. 4. 7(a). Note that this array differs from the shift-and-add array
The different algorithms are realized by using different in that the carry output of a cell in one row is saved and
interconnection patterns for the arrays. Each of these three added in the next row. Also, the carry-save algorithm
patterns is uniform throughout the array so that the algo- requires that a final addition be performed to complete the
rithms are well-suited for LSI implementation. multiplication. This is usually performed in a fast carry-
The first algorithm is the standard shift-and-add algo- look-ahead adder. However, in order to maintain the unifor-
JUMP AND AHUJA: PIPELINING DIGITAL SYSTEMS

C7

RESULT
C6 C5

< *
b

v
C4

'
3
0 I0\O0
C3

2
(a)

2 b1
MULTIPLICAND
REGISTER-'

BASIC.
.s A ,s,^

Fig. 5. (a) Shift-and-add multiplier. (b) Pipelined shift-and-add

b1
a
0 JO0 1

1
0\J

1
b
04
multiplier.

ao
cMULTIPLIER

-~~~~~~~~~

BASIC CELL
^r,,
O0
CELL; MI'=
X-LI

R7 IUT
RESULT

Algorithm
Shift-And-Add
Guild
Carry-Save
b3
Ct=

1=
V.

mm

11
MULTIPLICAND
02
I

N
3=

+
*

c5

TABLE I

3n - 2
2n-1
2n
b1

U 11

11

i..

v_ ~
\4
4=

(b)
I

COUNT OF STAGES, FLIP-FLOPS, AND CELLS

delay Ts is equal to Tc. Hence, T = Tc + TR for all three


arrays. Since all three arrays have the same time/operation ,
= -EL

T
C4
C4
Ii,
11

%,s

I
11
u

9n2 - 7n
6n2 3n-1
17

NR

8.5n2 - 4.5n + 1
-.,Io
ft

,\-O -4

I
C3

JZZL-
MULTIPLIER
ao a1 02

'\\

C44
_
'K

C2
_

Cj

NC

2n2
x

n
n2
I

_
4
03

CO
_
863

E] EE=~I
2 _
3 =
the cost/operation i1(M) will be used to analyze the relative
effectiveness of the arrays.
7~~~~~ F= REGISTER To compute O(M), the number of cells, flip-flops and
stages must be counted for each of the three arrays. Expres-
sions for these are given in Table I where n is the number of
C7 C6 C5 C4 C3 C2
,
Cl CO bits in the multiplier and multiplicand and Ncis the number
RESULT U y Bv of cells. The following expressions for t(M) are obtained.
Shift-And-Add Array:
I- I IX
I1SA(M)
Fig. 6. Guild array. = (Kcn2 + KR(9n2 - 7n))(Tc + TR) (M )

mity of the array, this final addition will be realized by a Guild Array:
second copy of the same array used to sum the partial IlG(M) = (Kcn2 + KR(6n2 3n - 1))
products. This is illustrated in Fig. 7(b) where the carry-save
array has been redrawn as a rectangle and maximally *(T+ TR) (M + 2n )
pipelined. Examples of other arrays using the carry-save Carry-Save Array:
principle can be found in [2] and [7].
In order to compare the three algorithms it will be qcs(M) = (Kc2n2 + KR(8.5n2 - 4.5n + 1))
assumed that they are all maximally pipelined as shown in (T + T ) M 2n - )
Figs. 5(b), 6, and 7(b). Recall that this means that the stage
864 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 9, SEPTEMBER 1978

MULTIPLICAND
b3 b2 bl bo 18 -

V*6Z.,ZJ 16 F

14 H
0

21 '
12 -
.,1
Cn
IA IV z Co
MULTIPLIER 10 M

Cl CO
oo-JC,
1i_
FAST ADDBER X 8
Li
zo
C7 C6 -C5 C4 2 a) 6
z 2
z KCI/KR =2
4

(a)
2 KC/KR =4

I ~~ ~~~I
0 8 24 32
16 40 48 56
NUMER OF BITS (n) -

Fig. 8. Maximum M/n for i1CS(M) < 1SA(M).

BASIC ( 9

07

YInZ5
u

< a: 5
o<
-w0
2 o
w 4

z 3
0
X a:
z
or 3O
RESULT l/*"" M =4n

(b) 0 -
I
Fig. 7. (a) Carry-save multiplier. (b) Pipelined carry-save multiplier. 0 8 16 24 32 40 48 56
NUMBER OF MULTIPLIER BITS (n) -

Direct comparison of these expressions gives the follow-


Fig. 9. Maximum KC/KR for t1CS(M) < ISA(M)'
ing result.
Result 4.1: For any positive n, M, Kc, Kr, Tc, and TR, To utilize LSI chip area more effectively, it may be
?1G(M) < tSA(M), and qG(M) < ?ICS(M). desirable to implement one of the two rectangular arrays,
Hence, the Guild array is the most effective of the three in either the shift-and-add or the carry-save array. Selecting
all cases. This is to be expected since the Guild array has no the best one of these two is not as easy since the choice
more cells than the other two but it has fewer stages and depends on the values of n, KC /KR, and M. The possibilities
flip-flops. are indicated by the graphs in Figs. 8 and 9. For a given
JUMP AND AHUJA: PIPELINING DIGITAL SYSTEMS

Kc /KR and n, the graphs in Fig. 8 give the maximum M for infinity. The effectiveness of a system for small M may be
,1cs(M) to be less than or equal to sSA(M). Similarly, if M and quite different than it is for large M.ndeed, the comparison
n are given, then the graphs in Fig. 9 give the maximum ofthe shift-and-add and the carry-save algorithms in Section
Kc/KR for 11CS(M) to be less than or equal to 1,SA(M). IV showed that the carry-save algorithm was better for most
The shape of these curves can be explained by observing M of reasonable size. However, an analysis assuming that M
that, for n 2 8, the carry-save array has more cells but fewer approaches infinity would lead to the opposite conclusion.
flip-flops and stages than the shift-and-add array. Thus, as
REFERENCES
KC /KR increases for fixed M and n, the cost of the carry-save [1] S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers,
array increases relative to the shift-and-add array. The "The IBM system/360 model 91 floating point execution unit," IBM
carry-save array has less delay than the shift-and-add array, J. Res. Develop., vol. 11, pp. 34-53, Jan. 1967.
tending to decrease its average cost/operation relative to the [2] D. P. Burton and D. R. Noaks, "High speed iterative multiplier,"
shift-and-add array. However, as M is increased for fixed Electron. Lett., vol. 4, p. 262, June 28, 1968.
[3] T. C. Chen, "Parallelism, pipelining, and computer efficiency,"
Kc/KR and n, the effect of the delay on J7(M) is minimized. Comput. Design, vol. 10, pp. 69-74, Jan. 1971.
Note that the cells surrounded by the dashed line in the [4] L. W. Cotton, "Maximum-rate pipeline systems," in Proc. 1969
carry-save array in Fig. 7(b) are not needed. They were Spring Joint Comput. Conf., pp. 581-586, 1969.
[5] E. S. Davidson and A. G. Larson, "Pipelining and parallelism in
included to keep the array completely uniform. If they were cost-effective processor design," Res. Rep., Digital System Lab., Stan-
removed from the array, the cost/operation would be given ford University, Stanford, CA, 1973.
by the following expression: [6] J. Deverell, "Pipeline iterative arithmetic arrays," Res. Rep. R74-93,
Twickenham College of Tech., Twickenham, England, 1973.
[7] R. DeMori, "Suggestions for an I.C. fast parallel multipher," Elec-
n1cs(M) = (Kc(1.5n2 + O.5n) + KR(7n2 - 2n)) tron. Lett., vol. 5, p. 50, Feb. 6, 1969.
[8] H. H. Guild, "Fully iterative fast array for binary multiplication and
*(TC + TR)(M+2-) addition," Electron. Lett., vol. 8, pp. 880-886, Aug. 1972.
[9] T. G. Hallin and M. J. Flynn, "Pipelining of arithmetic functions,"
IEEE Trans. Comput., vol. C-21, pp. 880-886, Aug. 1972.
This is still greater than tG(M) for all positive M, n, KC, KR, [10] J. C. Hoffman, B. Lacaze, and P. Csillag, "Multiplier parallele a
T, and TR. If this expression were used, the curves in Figs. 8 circuits logiques iteratifs," Electron. Lett., vol. 9, p. 178, May 3, 1968.
[11] R. C. Minnick, "A survey of microcellular research," J. Assoc.
and 9 would be shifted up but their general shape would not Comput. Mach., vol. 14, pp. 203-241, Apr. 1967.
be changed. [12] L. E. Shar and E. S. Davidson, "A multiminiprocessor system imple-
mented through pipelning," Computer, vol. 7, pp. 42-51, Feb. 1974.
V. SUMMARY
This paper has presented quantitative techniques for the
analysis of pipelining in digital systems, using the three J. Robert Jump (S'58-M'61) was born in Kansas
measures of effectiveness, delay, time/operation, and City, MO, on February 15, 1937. He received the
cost/operation. Of the three cost/operation is felt to be the B.S. and M.S. degrees in electrical engineering
most important since it takes into account the other two. It from the University of Cincinnati, Cincinnati,
OH, in 1960 and 1962, respectively, and the M.S.
was shown that frequently many different ways exist to and Ph.D. degrees in computer and communica-
pipeline an algorithm and that the three measures could be tion sciences from the University of Michigan,
used to make a comparative analysis for the best choice. This Ann Arbor, in 1965 and 1968, respectively.
His industrial experience includes work in digi-
analysis was demonstrated by showing that slice pipelining tal systems design at Radiation, Inc., Orlando,
an array is always better than block pipelining. Finally, it FL, Avco Corporation, Cincinnati, OH, and IBM,
was shown how the relative effectiveness of pipelining Owego, NY. At the present time, he is an Associate Professor of Electrical
Engineering at Rice University, Houston, TX.
different algorithms could be evaluated by comparing three
algorithms for integer multiplication.
Two features of the analysis methods presented in this
paper should be noted. First, in the comparison of two Sudhir R. Ahuja (M'77) was born in Meerut,
pipelined systems, the ratios KP/KR and TC/TR, rather than India, on April 25, 1950. He received the
B.Tech. (Hons.) degree in electrical engineering
the absolute values ofthe parameters, are important. Thus, it from Indian Institute of Technology, Bombay,
is not always necessary to know the actual costs or times in in 1972. He received the M.S. and Ph.D. degrees
order to reach some general conclusions about the relative in electrical engineering from Rice University,
effectiveness of different pipelined systems. This was il- Houston, TX, in 1974 and 1977, respectively.
He is currently working in the Interactive
lustrated in Section III where it was shown that slice Computer Systems Research Department at Bell
pipelining is never worse than block pipelining. Laboratories, Holmdel, NJ. His research interests
are in the field of parallel and pipelined processors,
Second, the analysis not based on
is the assumption that large memory systems, and multiprocessor systems.
M, the number of operations to be performed, approaches Dr. Ahuja is a member of Sigma Xi.

You might also like