Carry Save Adder Implementation: in Out

E C E N 6 2 6 3 A d v a n c e d V L S I D e s i g n
Carry Save Adder Implementation

Now that we have seen that the carry save adder trees are most efficiently implemented by
putting together the (3,2) blocks, we must still address the issue of how to implement the
(3,2) block (carry save adder) efficiently. Functionally, the carry save adder is identical to
the full adder. The full adder is usually implemented with a reduced delay from Cin to
Cout because the carry chain is the critical delay path in adders. Unfortunately, there is no
single carry chain in the carry save adder trees in multipliers. Thus, it does not pay to
make the delay shorter for one input by sacrificing delay on other inputs for carry save
adders. Instead, carry save adders are normally implemented by treating the 3 inputs
equally and trying to minimize delay from each input to the outputs. We have
A
B S = A ⊕ B ⊕ C = ABC + ABC + A BC + ABC
C
A
B
A C = AB + AC + BC
C
B
C
As we can see from the expanded version of the exclusive or function for the sum, S, both
the uncomplemented and complemented form is required for each input (there is a trans-
mission gate XOR circuit that does not require the complemented inputs but we won’t
consider this here). If we want to avoid putting extra inverters in our carry paths to pro-
duce the complemented input, the best thing to do is to have each carry save adder produce
both uncomplemented and complemented outputs which can then be used as inputs by the
next stage of carry save adders. Due to symmetries in the logic functions for C and S, pro-
ducing C, C, S and S does not take as much circuitry as one might think. The idea is to
find common sub-functions for which we may use the same transistors to implement parts
of more than one output function.
S= A ⊕ B ⊕ C= A ⊕ B ⊕ C= A( B ⊕ C) + A( B ⊕ C )
S= A ⊕ B ⊕ C= A ⊕ B ⊕ C= A( B ⊕ C) + A(B ⊕ C )
Carry Save Adder Implementation December 11, 2004 page 1 of 10

C = AB + BC + AC
= AB + ( A + A )BC + A ( B + B )C
= AB + ABC + ABC
= AB + ( AB + AB )C
C = AB + BC + AC
= (A + B)(B + C)(A + C)
= ABA + ABC + ACA + ACC + BBA + BBC + BCA + BCC
= AB + AC + BC
= AB + A ( B + B )C + ( A + A )BC
= AB + ABC + ABC
= AB + ( AB + AB )C
In both cases, we see the functions have
1. Common sub-functions
2. The common part is gated by a complementary input.
These two properties allow the transistors for the common part to be shared. Consider full
CMOS gates for f and f with a common part, C, which is gates by I.
C’
C’ = dual of C
f I I f
uncommon parts
C
Here it is obvious why the gating signals I, I must be complementary to avoid shorting f to
f!

Cleverly extending this idea gives the folded transistor full CMOS design for the carry
save adder shown below [2].
C C
B C C
B B
B
A
A A
CARRY CARRY
A
SUM A
SUM A
A
B
A
B
B C C
C C
Note that a double benefit of reducing area and increasing speed (by reducing transistor
loading) is obtained by using this technique.

The transistor count may be further reduced by using logic gate design styles that elimi-
nate the pMOS pull-up block which is made possible when synthesizing both f and f.
Common blocks in f and f may still be shared as above.
CVSL [2] CPL [1]
f f
f f
f f
f f
In both cases, the f and f blocks are synthesized with nMOSFET’s only (no pMOSFET’s).
CVSL eliminates having to duplicate f and f with pMOSFET’s by using the cross coupled
pMOSFET’s which force f and f to opposite values. The problem is that the cross-couple
is slow as we saw last semester. Consider switching f from high to low. At the beginning
of the switching transient the pMOS cross couple has not yet switched, so we have
f
pull up turns off
off on
f f
open t
on
the nMOSFET that just turned on must fight the pMOSFET that is still turned on to bring
the f output low enough to turn on the other pMOSFET which then causes the first pMOS-
FET to turn off. This can take a considerable amount of time so that the typical CVSL
gates are not much faster than the full CMOS gates even though the input gate load is 1/2
that of full CMOS.
The Complementary Pass Logic (CPL) method overcomes the speed problem by using
inverters as level detectors for the two nMOS pass transistor blocks. There is no cross
couple circuit and no fighting of logic levels. However, an nMOS pass circuit is notori-
ously slow at passing high logic levels. This can be compensated by adjusting the inverter
cross over voltage, Vinv, to a lower than usual value as discussed for partial swing logic
last semester. In fact, CPL is just non-full swing pass transistor logic where both a logic

function and its complement are implemented simultaneously. This is very useful for
arithmetic circuits such as multipliers and adders. CPL gates as originally presented in
[1], can be improved somewhat. The AND/NAND gate should be changed as follows.
A B B A A 0 1 A
B B
B B
AB AB AB AB
original AND/NAND improved AND/NAND
The revised form has a much smaller load on the B input, and is much faster. As usual, the
inverters do not need to be included in every gate; they are inserted where needed to pre-
vent n2 delay through n transistors in series. For example, two 2-input XOR gates can be
cascaded to make a 3-input XOR gate and an inverter need not be inserted between the
two XOR gates.
BC C C
ABC
<=> A B
S
B
S
S S
The three input XOR gate can be used to produce the sum output for the carry save adder.
The CPL three input XOR gate has the same number of transistors as the folded CVSL
three input XOR gate[2]. The structure of the circuits is almost the same which can be

seen by redrawing the CPL gate upside down and explicitly putting in the transistors in the
inverters.
Sum Circuits
CPL Folded CVSL
S S
S S
A A
A A
B B
B B
C C
C C
The CPL gate has the advantage of being faster than the CVSL gate by about a factor of 2.
The CPL gate has the disadvantage of dissipating DC power, whereas the CVSL gate does
not.
The carry output must also be implemented. The CPL carry circuit in [1] has 12 pass
FETs in it. We can improve the speed of this circuit by putting in explicit connections to
power and ground (1’s and 0’s) as we did earlier for the AND/NAND gate.

CPL Carry Circuits Folded CVSL
Cout Cout
Cout Cout
A A A
A B
B B
B C C
C C
It is interesting to note that the folded CVSL carry circuit from [2], which has only 6 pass
FETs in it, cannot be made into a CPL circuit. When A = B = 1 in the CVSL circuit, a par-
allel combination of pass FETs controlled by C and C gives a valid logic 0, but in CPL it
does not.
invalid valid 0
1 1 1 1
C C
C C
The above circuits are optimized implementations for the (3,2) carry save adder building
block cell. It is also possible to optimize other building block cells, for example the (4,2)
compressor. The (4,2) compressor has 4 explicit inputs plus one hidden carry for a total of
5 inputs. The sum bit output of the (4,2) compressor is the exclusive or of all 5 inputs. If
the (4,2) compressor is made from two (3,2) blocks, then the 5 input XOR gets imple-
mented by four 2 input XOR gates in series. A tree of XOR gates would be faster [3].

Similarly, a tree of gates can be found for the other (4,2) outputs which would be faster
than obtained from two cascaded (3,2) circuits.
DE
AB DE
C
B C
5 input XOR from 5 input XOR

cascaded (3,2) optimized tree
compressors
The CPL gate for the XOR tree might look like the following. Note that it is necessary to
add the inverters before the internal XOR outputs can be used to control the gate of a pass
FET.
B B D D
A E
A E
C C
S S

Power reduction for CPL. As with all forms of non-full swing pass transistor logic, CPL
suffers from DC power consumption because the inverter input voltage does not get high
enough to completely turn off the pFET in the inverter. The DC power consumption can
be eliminated by adding an extra pull-up pFET to the inverter inputs. The function of the
added pFET is to eventually bring the inverter input voltage high enough to completely
turn off the inverter pFET. The extra gate and drain load from the pFET does slow down
the circuit, but it is still faster than CVSL.
Another way to eliminate DC power consumption is to have a special IC fabrication pro-

cess that has been optimized for CPL[1]. The problem is that the transistor switching
threshold voltages, VTn and VTp, are chosen to implement full swing logic in normal IC fab
processes. We want to choose VTn and VTp to eliminate the power consumption for non-
full swing logic.
A high voltage in the inverter input is

V IH = V dd – V Tn(pass)
where the switching threshold of the nMOS pass FET has been modified by the body
effect so that
V Tn(pass) ≈ 1.5V Tn
for most processes. The pFET turns off at
V poff = V dd – V Tp
so that the high input turn off margin is
V MH = V IH – V poff = V Tp – V Tn(pass)
If VTn and VTp are chosen to be the same magnitude, which is the case for most IC fabrica-
tion processes, then the high turn off margin is negative which is the cause of the DC
power consumption. The low input to the inverter goes all the way to zero, so that the low
input turn off margin is
V ML = V noff – V IL = V Tn – 0 = V Tn
which is the same as for full swing logic.
To optimize the fabrication process for non-full swing logic, the high turn off margin,
VMH, should be increased to a positive number. If VTn is decreased to accomplish this,
then the low turn off margin, VML, is decreased. Increasing VTp does raise VMH without
lowering VML. However, if we want to increase VMH to be as big as VML, this would
require

V Tp = V Tn + V Tn(pass) .
Such a large VTp would make the p-channel devices very slow.
There is another way to make

V Tp = V Tn + V Tn(pass) .
That is to make the nMOS pass FETs differently than the regular nMOSFETs in the
inverter. A “native” nMOSFET is easy to fabricate with a threshold
V Tn ′ ≈ 0 .
If the native nMOSFET is used for the pass transistors, body effect increases the threshold
to only a few tenths of a volt. Thus, it is possible to satisfy
V Tp = V Tn + V Tn ′(pass)
without increasing VTp very much.
[1] K. Yano et al., “A 3.8-ns 16X16-b Multiplier Using Complementary Pass-Transistor

Logic,” IEEE J. Solid-State Circuits, vol. 25, pp. 388-394, Apr. 1990.
[2] P. Song and G. De Micheli, “Circuit and Architecture Trade-offs for High-Speed
Multiplication,” IEEE J. Solid-State Circuits, vol. 26, pp. 1184-1198, Sep. 1991.
[3] N. Nagamatsu et al., “A 15-ns 32X32-b CMOS Multiplier with an Improved Parallel
Structure,” IEEE J. Solid-State Circuits, vol. 25, pp. 494-497, Apr. 1990.

Carry Save Adder Implementation: in Out

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Carry Save Adder Implementation: in Out

Uploaded by

Copyright:

Available Formats

E C E N 6 2 6 3 A d v a n c e d V L S I D e s i g n

Carry Save Adder Implementation

Carry Save Adder Implementation December 11, 2004 page 1 of 10

In both cases, we see the functions have

2. The common part is gated by a complementary input.

Carry Save Adder Implementation December 11, 2004 page 2 of 10

Carry Save Adder Implementation December 11, 2004 page 3 of 10

CVSL [2] CPL [1]

Carry Save Adder Implementation December 11, 2004 page 4 of 10

Carry Save Adder Implementation December 11, 2004 page 5 of 10

Carry Save Adder Implementation December 11, 2004 page 6 of 10

CPL Carry Circuits Folded CVSL

Carry Save Adder Implementation December 11, 2004 page 7 of 10

5 input XOR from 5 input XOR

Carry Save Adder Implementation December 11, 2004 page 8 of 10

Another way to eliminate DC power consumption is to have a special IC fabrication pro-

A high voltage in the inverter input is

which is the same as for full swing logic.

Carry Save Adder Implementation December 11, 2004 page 9 of 10

There is another way to make

[1] K. Yano et al., “A 3.8-ns 16X16-b Multiplier Using Complementary Pass-Transistor

Carry Save Adder Implementation December 11, 2004 page 10 of 10

You might also like