Professional Documents
Culture Documents
Multiplication
Parts I. Number Representation
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
Chapters
Numbers and Arithmetic Representing Signed Numbers Redundant Number Systems Residue Number Systems Basic Addition and Counting Carry-Lookahead Adders Variations in Fast Adders Multioperand Addition Basic Multiplication Schemes High-Radix Multipliers Tree and Array Multipliers Variations in Multipliers Basic Division Schemes High-Radix Dividers Variations in Dividers Division by Convergence Floating-Point Reperesentations Floating-Point Operations Errors and Error Control Precise and Certifiable Arithmetic Square-Rooting Methods The CORDIC Algorithms Variations in Function Evaluation Arithmetic by Table Lookup High-Throughput Arithmetic Low-Power Arithmetic Fault-Tolerant Arithmetic Past, Present, and Future
Elementary Operations
III. Multiplication
IV. Division
V. Real Arithmetic
May 2007
Slide 1
Revised
Sep. 2001
Revised
Sep. 2003
Revised
Oct. 2005
Revised
May 2007
May 2007
Slide 2
III Multiplication
Review multiplication schemes and various speedup methods Multiplication is heavily used (in arith & array indexing) Division = reciprocation + multiplication Multiplication speedup: high-radix, tree, . . . Bit-serial, modular, and array multipliers
Well, well, for a rabbit, youre not very good at multiplying, are you?
May 2007 Computer Arithmetic, Multiplication Slide 4
May 2007
Slide 6
p2k1p2k2
a x x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 p
Fig. 9.1 Multiplication of two 4-bit unsigned binary numbers in dot notation.
May 2007 Computer Arithmetic, Multiplication Slide 7
Multiplication Recurrence
Fig. 9.1
a x x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 p
Preferred
Multiplication with right shifts: top-to-bottom accumulation p(j+1) = (p(j) + xj a 2k) 21 |add| |shift right| with p(0) = 0 and p(k) = p = ax + p(0)2k
Multiplication with left shifts: bottom-to-top accumulation p(j+1) = 2 p(j) + xkj1a |shift| |add|
May 2007
with
Left-shift algorithm ======================= a 1 0 1 0 x 1 0 1 1 ======================= p(0) 0 0 0 0 (0) 2p 0 0 0 0 0 +x3a 1 0 1 0 p(1) 0 1 0 1 0 (1) 2p 0 1 0 1 0 0 +x2a 0 0 0 0 p(2) 0 1 0 1 0 0 2p(2) 0 1 0 1 0 0 0 +x1a 1 0 1 0 p(3) 0 1 1 0 0 1 0 (3) 2p 0 1 1 0 0 1 0 0 +x0a 1 0 1 0 p(4) 0 1 1 0 1 1 1 0 =======================
Fig. 9.2 Examples of sequential multiplication with right and left shifts.
Slide 9
May 2007
Slide 11
Mux
k
xj
xj a cout
Adder
k
Fig. 9.4 Hardware realization of the sequential multiplication algorithm with additions and right shifts.
May 2007 Computer Arithmetic, Multiplication Slide 12
Fig. 9.5 Combining the loading and shifting of the double-width register holding the partial product and the partially used multiplier.
May 2007 Computer Arithmetic, Multiplication Slide 13
(j)
Multiplicand a 0
0
Mux
k
Fig. 9.6 Hardware realization of the sequential multiplication algorithm with left shifts and additions.
May 2007 Computer Arithmetic, Multiplication Slide 14
============================ a 1 0 1 1 0 x 0 1 0 1 1 ============================ p(0) 0 0 0 0 0 1 0 1 1 0 +x0a 2p(1) 1 1 0 1 1 0 p(1) 1 1 0 1 1 0 1 0 1 1 0 +x1a 2p(2) 1 1 0 0 0 1 0 (2) p 1 1 0 0 0 1 0 +x2a 0 0 0 0 0 2p(3) 1 1 1 0 0 0 1 0 (3) p 1 1 1 0 0 0 1 0 +x3a 1 0 1 1 0 2p(4) 1 1 0 0 1 0 0 1 0 p(4) 1 1 0 0 1 0 0 1 0 +x4a 0 0 0 0 0 2p(5) 1 1 1 0 0 1 0 0 1 0 (5) p 1 1 1 0 0 1 0 0 1 0 ============================
Slide 15
============================ a 1 0 1 1 0 x 1 0 1 0 1 ============================ p(0) 0 0 0 0 0 1 0 1 1 0 +x0a 2p(1) 1 1 0 1 1 0 p(1) 1 1 0 1 1 0 0 0 0 0 0 +x1a 2p(2) 1 1 1 0 1 1 0 (2) p 1 1 1 0 1 1 0 +x2a 1 0 1 1 0 2p(3) 1 1 0 0 1 1 1 0 (3) p 1 1 0 0 1 1 1 0 +x3a 0 0 0 0 0 2p(4) 1 1 1 0 0 1 1 1 0 p(4) 1 1 1 0 0 1 1 1 0 +(x4a) 0 1 0 1 0 0 0 0 1 1 0 1 1 1 0 2p(5) (5) p 0 0 0 1 1 0 1 1 1 0 ============================
Slide 16
May 2007
Booths Recoding
Table 9.1 Radix-2 Booths recoding Explanation xi xi1 yi 0 0 0 No string of 1s in sight 0 1 1 End of string of 1s in x 1 Beginning of string of 1s in x 1 0 1 1 0 Continuation of string of 1s in x Example 1 0 0 1 1 1 0 1 (1) 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 0 Operand x Recoded version y
============================ a 1 0 1 1 0 x 1 0 1 0 1 Multiplier 1 1 1 1 1 y Booth-recoded ============================ p(0) 0 0 0 0 0 0 1 0 1 0 +y0a 2p(1) 0 0 1 0 1 0 (1) p 0 0 1 0 1 0 1 0 1 1 0 +y1a 2p(2) 1 1 1 0 1 1 0 (2) p 1 1 1 0 1 1 0 +y2a 0 1 0 1 0 2p(3) 0 0 0 1 1 1 1 0 p(3) 0 0 0 1 1 1 1 0 +y3a 1 0 1 1 0 2p(4) 1 1 1 0 0 1 1 1 0 (4) p 1 1 1 0 0 1 1 1 0 y4a 0 1 0 1 0 0 0 0 1 1 0 1 1 1 0 2p(5) p(5) 0 0 0 1 1 0 1 1 1 0 ============================
Slide 18
Row i
Software aspects:
Optimizing compilers replace multiplications by shifts/adds/subs Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm
Hardware aspects:
Synthesize special-purpose units such as filters y[t] = a0x[t] + a1x[t 1] + a2x[t 2] + b1y[t 1] + b2y[t 2]
May 2007 Computer Arithmetic, Multiplication Slide 19
Shift, add
Shift
Ri: Register that contains i times (R1) This notation is for clarity; only one register other than R1 is needed
May 2007
Slide 20
Shift, add
Shift
Shift, subtract
R3 shift-left 3 R1 R7 shift-left 4 + R1
May 2007
Slide 21
R3 shift-left 3 R1 R7 shift-left 4 + R7
May 2007
Slide 22
10 High-Radix Multipliers
Chapter Goals Study techniques that allow us to handle more than one multiplier bit in each cycle (two bits in radix 4, three in radix 8, . . .) Chapter Highlights High radix gives rise to difficult multiples Recoding (change of digit-set) as remedy Carry-save addition reduces cycle time Implementation and optimization methods
May 2007
Slide 24
May 2007
Slide 25
a x
r0 x 0 a 20 0 x1 a 21 r1 1 r2 x 2 a 22 2 r3 x 3 a 23
3
Preferred
Multiplication with right shifts in radix r : top-to-bottom accumulation p(j+1) = (p(j) + xj a r k) r 1 |add| |shift right| with p(0) = 0 and p(k) = p = ax + p(0)r k
Multiplication with left shifts in radix r : bottom-to-top accumulation p(j+1) = r p(j) + xkj1a |shift| |add|
May 2007
with
a x x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 p
Fig. 9.1
Number of cycles is halved, but now the difficult multiple 3a must be dealt with
Product
Slide 27
May 2007
Multiplier 3a 0 a 2a Mux
2-bit shifts
00 01 10 11
x i+1
xi
k + 1 cycles, rather than k One extra cycle not too bad, but we would like to avoid it if possible Solving this problem for radix 4 may also help when dealing with even higher radices
To the adder
Fig. 10.2 The multiple generation part of a radix-4 multiplier with precomputation of 3a.
May 2007
Slide 28
a x (x x ) 1 0 (x 3x 2) p
Slide 29
Fig. 10.4 The multiple generation part of a radix-4 multiplier based on replacing 3a with 4a (carry into next higher radix-4 multiplier digit) and a.
xi Set if c) xi+1(xi x i+1 = x i = 1 or if x i+1 = c = 1
Mux control ---------------0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 Set carry -----------0 0 0 0 0 1 1 1
Slide 30
a 2a a
10 11
00 01
x i+1
May 2007
1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 0
1 1
May 2007
a x (x x ) 1 0 (x 3x 2) p
Fig. 10.5 Example of radix-4 multiplication with modified Booths recoding of the 2scomplement multiplier.
Slide 32
Multiplicand
non0
a
1
2a
Enable 0 Select
Mux 0, a, or 2a
k+1
Add/subtract control
Fig. 10.6 The multiple generation part of a radix-4 multiplier based on Booths recoding.
May 2007 Computer Arithmetic, Multiplication Slide 33
0
Mux
2a 0 a
Mux
x i+1
xi
CSA
Adder
New Cumulative Partial Product
Fig. 10.7 Radix-4 multiplication with a carry-save adder used to combine the cumulative partial product, xia, and 2xi+1a into two numbers.
May 2007 Computer Arithmetic, Multiplication Slide 34
New PP
Fig. 10.8 Radix-2 multiplication with the upper half of the cumulative partial product in stored-carry form.
May 2007
i/2
Fig. 10.9 Radix-4 multiplication with a carry-save adder used to combine the stored-carry cumulative partial product and zi/2a into two numbers.
May 2007 Computer Arithmetic, Multiplication Slide 36
Fig. 10.10 Booth recoding and multiple selection logic for high-radix or parallel multiplication.
0, a, or 2a
Selective Complement
k+2
0, a, a, 2a, or 2a
0
Mux
2a 0 a
Mux
x i+1
xi
CSA CSA
New Cumulative Partial Product
Adder
FF
2-Bit Adder
8a 0
Mux
4a 0
Mux
x i+3 2a 0
Mux
4-Bit Shift
x i+2 a x i+1 xi
CSA
CSA
Fig. 10.12 Radix-16 multiplication with the upper half of the cumulative partial product in carry-save form.
May 2007
CSA
Multiplier
Remove this mux & CSA and replace the 4-bit shift (adder) with a 3-bit shift (adder) to get a radix-8 multiplier (cycle time will remain the same, though)
0
Mux
4a 0
Mux
x i+3 2a 0
Mux
4-Bit Shift
x i+2 a x i+1 xi
CSA
CSA
A radix-16 multiplier design becomes a radix-256 multiplier if radix-4 Booths recoding is applied first (the muxes are replaced by Booth recoding and multiple selection logic)
May 2007
CSA
Fig. 10.12
FF
All multiples
...
Adder
Partial product
Partial product
Speed up
Fig. 10.13 High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers.
May 2007 Computer Arithmetic, Multiplication Slide 41
PH1
State latches
State flip-flops
CLK
State latches
3a
3a
This radix-64 multiplier runs at the clock rate of a radix-8 design (2X speed)
Node 3
Beat-1 Input
Adder
Node 2
Beat-3 Input
Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints: AT grows at least as fast as k3/2 AT2 is at least proportional to k2 The preceding radix-2b implementations are suboptimal, because: AT = AT2 =
May 2007
AT 2 = O((k3/b) log2b)
AT- or AT 2Optimal
O(k 2) O(k 3)
Intermediate designs do not yield better AT or AT 2 values; The multipliers remain asymptotically suboptimal for any b By the AT measure (indicator of cost-effectiveness), slower radix-2 multipliers are better than high-radix or tree multipliers Thus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers High-radix multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11)
May 2007 Computer Arithmetic, Multiplication Slide 45
May 2007
Slide 47
a
All multiples
...
Multiplier ...
Adder
Partial product
MultipleForming Circuits
a a a
Partial product
. . .
Adder Basic binary High-radix or partial tree Adder Full Economize tree
Speed up
Fig. 10.13 High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers.
Redundant result Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly
Fig. 11.1
May 2007
...
Large tree of carry-save adders Logdepth
...
Small tree of carry-save adders
Adder Product
Logdepth
Adder Product
. . .
Partial-Products Reduction Tree
Redundant result
Fig. 11.1
3. Redundant-to-binary converter
Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly
Slide 50
May 2007
Dadda Tree (4 FAs + 2 HAs + 6-Bit Adder) 3 4 3 2 1 FA FA -------------------1 3 2 2 3 2 1 FA HA HA FA ---------------------2 2 2 2 1 2 1 6-Bit Adder ---------------------1 1 1 1 1 1 1 1 1 2
[0, 6] [1, 6]
[1, 7]
[2, 8]
[3, 9]
[4, 10]
[5, 11]
[6, 12]
7-bit CSA
[2, 8] [1,8]
7-bit CSA
[5, 11] [3, 11]
7-bit CSA
[6, 12] [2, 8] [3, 12]
CSA trees are quite irregular, causing some difficulties in VLSI realization Thus, our motivation to examine alternate methods for partial products reduction
May 2007
7-bit CSA
[3,9] [2,12] [3,12] The index pair [i, j] means that bit positions from i up to j are involved.
10-bit CSA
[3,12] [4,13] [4,12]
10-bit CPA
Ignore [4, 13] 3 2 1 0
Slide 52
FA
FA
FA
FA
FA Level-2 carries
FA
FA
FA
FA Level-3 carries
FA
FA
FA
FA
Slide 53
May 2007
4-to-2
4-to-2
4-to-2
4-to-2
4-to-2
4-to-2
4-to-2
Fig. 11.5 Tree multiplier with a more regular structure based on 4-to-2 reduction modules. Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI
May 2007 Computer Arithmetic, Multiplication Slide 54
Even if 4-to-2 reduction is implemented using two CSA levels, design regularity potentially makes up for the larger number of logic levels Similarly, using Booths recoding may not yield any advantage, because it introduces irregularity
M u l t i p l i c a n d
...
Redundant-to-binary converter
Fig. 11.6 Layout of a partial-products reduction tree composed of 4-to-2 reduction modules. Each solid arrow represents two numbers.
May 2007 Computer Arithmetic, Multiplication
Magnitude positions --------xk2 yk2 zk2 xk3 yk3 zk3 xk4 yk4 zk4 ... ... ...
x x x x x x x x x x x x x x x x x x x x x x x x
Five redundant copies removed FA FA FA FA FA FA
Fig. 11.7 Sharing of full adders to reduce the CSA width in a signed tree multiplier.
Computer Arithmetic, Multiplication Slide 56
May 2007
a. Unsigned
a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
b. 2's-complement
c. Baugh-Wooley
a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------_ _ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 _ _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 _ x3 a1 x3 a0 x3 a2 _ _ _ a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 a4 a4 1 x4 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
d. Modified B-W
__
__
a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 3 x1 a2 x1 a1 x1 0 1 a 2 2 1 2 0 2 a x __3 a0 x3 1 a x 0 4
Slide 57
a4 a3 a2 a1 a0 c. Baugh-Wooley x4 x3 x2 x1 x0 Fig. 11.8 ---------------------------_ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 a4x0 = a4(1 x0) a4 _ _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 = a4x0 a4 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 _ _ _ _ a x a x a x a x a x 4 4 3 4 2 4 1 4 0 4 a4 a4x0 a4 a4 1 x4 x4 a4 -------------------------------------------------------In next column p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
d. Modified B-W
a4x0 = (1 a4x0) 1 = (a4x0) 1 1 (a4x0) 1
__
__
In next column
May 2007
a x a x a4 x1 __ 4 2 3 2 a x __3 a3 x3 a2 x3 __ __ 4 a x a x a x a x 4 4 3 4 2 4 1 4 1 1 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
Computer Arithmetic, Multiplication Slide 58
a a a a a 4 3 2 1 0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 0 1 a3 x1 a2 x1 a1 x1 2 2 1 2 0 2 a x __3 a0 x3 1 a x 0 4
a. Unsigned
a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
b. 2's-complement
c. Baugh-Wooley
a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------_ _ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 _ _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 _ x3 a1 x3 a0 x3 a2 _ _ _ a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 a4 a4 1 x4 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0
d. Modified B-W
__
__
a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 3 x1 a2 x1 a1 x1 0 1 a 2 2 1 2 0 2 a x __3 a0 x3 1 a x 0 4
Slide 59
High-radix versus partial-tree multipliers: The difference is quantitative, not qualitative For small h, say 8 bits, we view the multiplier of Fig. 11.9 as high-radix When h is a significant fraction of k, say k/2 or k/4, then we tend to view it as a partial-tree multiplier
Sum Carry
...
CSA Tree
. o o o o o o o o k-by-k fractional . o o o o o o o o multiplication --------------------------------. o o o o o o o|o Max error = 8/2 + 7/4 . o o o o o o|o o + 6/8 + 5/16 + 4/32 . o o o o o|o o o + 3/64 + 2/128 . o o o o|o o o o + 1/256 = 7.004 ulp . o o o|o o o o o . o o|o o o o o o . o|o o o o o o o Mean error = . |o o o o o o o o 1.751 ulp --------------------------------. o o o o o o o o|o o o o o o o o
Removing the dots at the right does not lead to much loss of precision.
May 2007 Computer Arithmetic, Multiplication Slide 61
ulp
Variable compensation
. . . . . . . . o o o o o o o o o o o o o o o o o| o o| o o| o o| o o| o o| x-1o| y-1 |
Constant and variable error compensation for truncated multipliers. Max error = +4 ulp Max error 3 ulp Mean error = ? ulp
May 2007
x1 a
x0 a CSA
0
a3 x1 a4 x1 a3 x2 a4 x2
a4 x0 0 a3 x0 0 a 2 x0 0 a 1 x0 0 a0 x0 p0 a2 x1 a1 x1 a0 x1 p1 a2 x2 a1 x2 a0 x 2 p2 a2 x3 a1 x3 a 0 x3 p3 a2 x4 a1 x4 a 0 x4 p4
a4 x3 a3 x4 a4 x4
0 p9 p8 p7 p6 p5
Fig. 11.10 A basic array multiplier uses a one-sided CSA tree and a ripple-carry adder.
May 2007
p8
p7
p6
p5
Slide 64
p4
FA
p9
May 2007
p4 p5
Slide 65
p8
p7
p6
.. .
Mux
i i
Bi
Level i
Mux
i i+1 i i+1
Bi+1
Mux
k
Bi
Dots in row i
.. .
Mux
k
Dots in row i + 1
i + 1 Conditional bits of the final product
[k, 2k1]
k1
i+1
i1
Bi+1
All remaining bits of the final product produced only 2 gate levels after pk1
Computer Arithmetic, Multiplication Slide 66
May 2007
...
h inputs
...
CSA Tree
CSA
Latch
Sum Carry FF Adder h-Bit Adder Lower part of the cumulative partial product
Sum Carry
CSA
With latches after every FA level, the maximum throughput is achieved Latches may be inserted after every h FA levels for an intermediate design Example: 3-stage pipeline Fig. 11.17 Pipelined 5 5 array multiplier using latched FA blocks. The small shaded boxes are latches.
May 2007
Latched FA with AND gate FA FA FA FA
Latch
p9
p8
p7
p6
p5
p4
p3
p2
p1
p0
Slide 68
12 Variations in Multipliers
Chapter Goals Learn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.) Chapter Highlights Building a multiplier from smaller units Performing multiply-add as one operation Bit-serial and (semi)systolic multipliers Using a multiplier for squaring is wasteful
May 2007 Computer Arithmetic, Multiplication Slide 69
May 2007
Slide 70
aH xH a L xL a L xH a H xL a H xH p
aL xL
Fig. 12.1 Divide-and-conquer (recursive) strategy for synthesizing a 2b 2b multiplier from b b multipliers.
May 2007 Computer Arithmetic, Multiplication Slide 71
2b 2c 2b 4c gb hc
May 2007
use b c multipliers and (3; 2)-counters use b c multipliers and (5?; 2)-counters use b c multipliers and (?; 2)-counters
Computer Arithmetic, Multiplication Slide 73
xH
[4, 7]
aL
[0, 3]
xH
[4, 7]
aH
[4, 7]
L [0, 3]
aL
[0, 3]
xL
[0, 3]
Multiply
[12,15] [8,11]
Multiply
[8,11] [4, 7]
Multiply
[8,11] 8 [4, 7]
Multiply
[4, 7] [0, 3]
Add
[4, 7]
12
Add
[8,11]
Add
[4, 7]
12
Add
[8,11]
Add
[12,15]
p [12,15]
May 2007
p[8,11]
p [4, 7]
p [0, 3]
Slide 74
y ax z
cin p
Fig. 12.4 Additive multiply module with 2 4 multiplier (ax) plus 4-bit and 2-bit additive inputs (y and z). b c AMM b-bit and c-bit multiplicative inputs b-bit and c-bit additive inputs (b + c)-bit output (2b 1) (2c 1) + (2b 1) + (2c 1) = 2b+c 1
May 2007 Computer Arithmetic, Multiplication Slide 75
a [0, 3]
[0, 1] [2, 5]
x [0, 1]
*
x [0, 1] a [0, 3]
a [4, 7]
[6, 9] [4, 5]
x [2, 3]
[2, 3] [4,7]
x [2, 3] a [0, 3]
a [4, 7]
[8, 11] [6, 7]
x [4, 5]
[4,5] [6, 9]
a [0, 3]
[8, 9] [8, 11] [6, 7]
x [6, 7]
x [4, 5] a [4, 7]
[8, 9] [10,13] [10,11]
a [4, 7]
p[4,5] p [6, 7]
p[0, 1] p[2, 3]
Fig. 12.5 An 8 8 multiplier built of 4 2 AMMs. Inputs marked with an asterisk carry 0s.
May 2007 Computer Arithmetic, Multiplication Slide 76
a [0,3]
x [0, 1]
*
x [2, 3]
*
This design is more regular than that in Fig. 12.5 and is easily expandable to larger configurations; its latency, however, is greater
x [4, 5]
*
x [6, 7] p[0, 1]
p [8, 9]
p [6, 7]
p[4,5]
p[2, 3]
Fig. 12.6 Alternate 8 8 multiplier design based on 4 2 AMMs. Inputs marked with an asterisk carry 0s.
Slide 77
s s s 2 1 0
Bit-serial multiplier
(Must follow the k-bit inputs with k 0s; alternatively, view the product as being only k bits wide)
a a a 2 0 1 x x x 2 0 1
p p p 2 0 1
What goes inside the box to make a bit-serial multiplier? Can the circuit be designed to support a high clock rate?
May 2007 Computer Arithmetic, Multiplication Slide 78
FA
Carry
FA
FA
Fig. 12.7 Semi-systolic circuit for 4 4 multiplication in 8 clock cycles. This is called semisystolic because it has a large signal fan-out of k (k-way broadcasting) and a long wire spanning all k positions
May 2007 Computer Arithmetic, Multiplication Slide 79
+d
Original delays Adjusted delays Fig. 12.8 Example of retiming by delaying the inputs to CL and advancing the outputs from CL by d units
May 2007 Computer Arithmetic, Multiplication Slide 80
a3
a0
Sum
FA
Carry
FA
FA
Fig. 12.7
a3 Multiplicand (parallel in) a1 a2 a0 x0 x1 x2 x3 Multiplier (serial in) LSB-first
Sum
FA
Carry
FA
FA
Cut 3
May 2007
Cut 2
Cut 1
a3
a0
Sum
FA
Carry
FA
FA
Fig. 12.7
a3 Multiplicand (parallel in) a1 a2 a0 x0 x1 x2 x3
FA
Carry
FA
FA
ai xi ai x(i - 1)
a(i - 1) x(i - 1) a x
ai xi
c in
ai xi
xi a(i - 1)
1 0 Mux
s out
Already output
(a) Structure of the bit-matrix p(i - 1) ai xi ai x(i - 1) xi a(i - 1) 2p (i ) obtain (i ) p (b) Reduction after each input bit
Shift right to
Fig. 12.12 The cellular structure of the bit-serial multiplier based on the cell in Fig. 12.11.
May 2007
Mod-15 CPA 4
Slide 84
...
. . . CSA Tree
Table
x4 x4 p9 p8
x4 x3 x3 x4 p7
x4 x2 x3 x3 x2 x4 p6
p2
p1
p0 Simplify
x4 x3 x4 p9 p8
x4 x2
x4 x1 x3 x2 x3 p6
x4 x0 x3 x1 p5
x3 x0 x2 x1 x2 p4
x2 x0
x1 x0 x1 p2
x0
x1x0 x1x0
p3 0 x0
p7
Fig. 12.18
May 2007
Multiply-add versus multiply-accumulate Multiply-accumulate units often have wider additive inputs
(c) (d)
May 2007
Fig. 12.19 Dot-notation representations of various methods for performing a multiply-add operation in hardware.
Slide 87