Professional Documents
Culture Documents
Delay
A Project Report
submitted in partial fulfillment of the requirements for the award of the degree of
MASTER OF TECHNOLOGY
in
VLSI & EMBEDDED SYSTEMS
by
G.BHAGYA SRI (13MK1D6805)
under the esteemed guidance of
Prof. P.BALA MURALI KRISHNA
CERTIFICATE
This is to certify that a project report entitled CSLA IMPLEMENTATION
TECHNIQUE TO MINIMIZE THE AREA, POWER AND DELAY being submitted by
GUTTIKONDA BHAGYA SRI (13MK1D6805) in partial fulfillment of the requirements for
the award of the degree of Master of Technology in VLSI & EMBEDDED SYSTEMS to
Jawaharlal Nehru Technological University, Kakinada, during the year 2013-2015 of SRI
MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN, GUNTUR.
PROJECT GUIDE
Prof. P.Bala Murali Krishna
Professor
Department of ECE
SMITW
EXTERNAL EXAMINER
ACKNOWLEDGEMENT
It gives us an honor to express my deep sense of gratitude and to our principal and project
guide Prof P.Bala Murali Krishna, Department of ECE, Sri Mittapalli Institute of Technology
for Women, Guntur for his valuable guidance, constant encouragement, and for every scientific
and personal concern throughout the course of investigation and successful completion of this
work.
I wish to extend my sincere thanks to G.Suseelamma, Head of the Department of ECE, Sri
Mittapalli Institute of Technology for Women, Guntur for her constant support, encouragement
and enabling us to do a work of this magnitude.
Our sincere thanks to teaching and non-teaching staff members of ECE Department of Sri
Mittapalli Institute of Technology for Women, Guntur.
Lastly, I bow to my affectionate Parents for their love and blessings, which has sustained me a
lot in completing this project work successfully.
BY
G. BHAGYA SRI
(13MK1D6805)
CONTENTS
TITLE
ABSTRACT
Page No
I
LIST OF FIGURES
II & III
LIST OF TABLES
IV
CHAPTER 1: INTRODUCTION
1.2 Objective
8
12
15
15
16
17
19
20
21
23
24
3.2 Operation
27
29
31
32
33
34
36
37
37
38
3.5.5 Multiplexer
39
44
46
48
48
48
49
50
52
53
54
56
57
58
61
61
4.4 Applications
63
4.5 Advantages
64
65
5.1 Conclusion
65
65
REFERENCES
66 & 67
ABSTRACT
With the advancements in semiconductor technology, there has been an increased emphasis in
low-power design techniques over the last few decades. Reversible computing has been proposed
by several researchers as a possible alternative to address the energy dissipation problem. This
project describes the design of Mach Zehnder Interferometer and reviews its applications in
emerging optical communication networks. Mach Zehnder Interferometer is used to measure
relative phase shift between two collimated beams from a coherent light source. Using the basic
principle, a number of devices was designed, few of these such as optical sensors, all-optical
switches, optical add-drop multiplexer and implementation of sum function is discussed in this
project.
LIST OF FIGURES
Page No
14
29
30
31
31
34
Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width
34
34
34
34
34
34
36
Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA
37
37
Fig. 3.10 Proposed SQRT-CSLA for n = 16. All intermediate and output
46
48
49
49
50
II
50
51
52
Fig. 4.4 (a) Simulation Waveform Result of 16-bit Ripple Carry Adder
52
53
53
54
54
55
56
Fig. 4.7 (a) Simulation Waveform Result of 32-bit Ripple Carry Adder
56
57
57
58
58
59
60
III
LIST OF TABLES
NAME OF THE TABLE
Page No
31
39
41
45
45
45
46
48
49
51
51
52
53
55
55
56
57
59
59
62
62
62
63
IV
CHAPTER 1
INTRODUCTION
This chapter introduces the concepts such as introduction of VLSI, objective, existing
system proposed systemand the project outline.
revolution started with the introduction of the first microprocessor, the 4004 by Intel
in 1972 and the 8080 in 1974. Today many companies like Texas Instruments,
Infineon, Alliance Semiconductors, Cadence, Synopsys,Celox Networks, Cisco,
Micron Tech, National Semiconductors, ST Microelectronics, Qualcomm, Lucent,
Mentor Graphics, Analog Devices, Intel, Philips, Motorola and many other firms
have been established and are dedicated to the various fields in "VLSI" like
Programmable Logic Devices, Hardware Descriptive Languages, Design tools,
Embedded Systems etc.In 1980s hold over from outdated taxonomy for integration
levels. Obviouslyinfluenced from frequency bands i.e. HF, VHF and UHF. Sources
disagree on what is measured (gates or transistors)
SSI Small-Scale Integration (0-102)
MSI Medium-Scale Integration (102 -103)
LSI Large-Scale Integration (103 -105)
VLSI Very Large-Scale Integration (105 - 107)
ULSI Ultra Large-Scale Integration (>= 107)
VLSI Technology Inc. was a company which designed and manufactured custom
and semi-custom ICs. The company was based in Silicon Valley, with headquarters
at 1109 McKay Drive in San Jose, California. Along with LSI Logic, VLSI
Technology defined the leading edge of the application-specific integrated circuit
(ASIC) business, which accelerated the push of powerful embedded systems into
affordable products. The company was founded in 1979 by a trio from Fairchild
Semiconductor by way of Synertek - Jack Balletto, Dan Floyd, Gunnar Wetlesen and by Doug Fairbairn of Xerox PARC and Lambda (later VLSI Design) magazine.
Alfred J. Stein became the CEO of the company in 1982. Subsequently VLSI built
its first fab in San Jose; eventually a second fab was built in San Antonio, Texas.
VLSI had its initial public offering in 1983, and was listed on the stock market as
(NASDAQ: VLSI). The company was later acquired by Philips and survives to this
day as part of NXP Semiconductors.
The first semiconductor chips held two transistors each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions
or systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making
2
it possible to fabricate one or more logic gates on a single device. Now known
retrospectively as small-scale integration (SSI), improvements in technique led to
devices with hundreds of logic gates, known as medium-scale integration (MSI).
Further improvements led to large scale integration (LSI), i.e. systems with at least a
thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used.
But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use.
As of early 2008, billion transistor processors are commercially available. This is
expected to become more commonplace as semiconductor fabrication moves from
the current generation of 65nm processes to the next 45nm generations (while
experiencing new challenges such as increased variation across process corners). A
notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that
almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium,
whose large transistor count is largely due to its 24MB L3 cache. Current designs, as
opposed to the earliest devices, use extensive design automation and automated logic
synthesis to lay out the transistors, enabling higher levels of complexity in the
resulting logic functionality. Certain high performance logic blocks like the SRAM
(Static Random Access Memory) cell, however, are still designed by hand to ensure
the highest efficiency (sometimes by bending or breaking established design rules to
obtain the last bit of performance by trading stability) [citation needed].
What is VLSI?
VLSI stands for "Very Large Scale Integration". This is the field which involves
packing more and more logic devices into smaller and smaller areas.
Simply we say Integrated circuit is many transistors on one chip.
Design/manufacturing of extremely small, complex circuitry using
modifiedsemi-conductor material.
Integrated circuit (IC) may contain millions of transistors, each a few mm in
size.
3
Why VLSI?
Integration improves the design Lower parasitic means higher speed and lower
power consumption and physically smaller. The Integration reduces manufacturing
cost (almost) no manual assembly. The course will cover basic theory and
techniques of digital VLSI design in CMOS technology. Topics include: CMOS
devices and circuits, fabrication processes, static and dynamic logic structures, chip
layout, simulation and testing, low power techniques, design tools and
methodologies, VLSI architecture.
We use full custom techniques to design basic cells and regular structures such as
data path and memory. There is an emphasis on modern design issues in
interconnect and clocking. We will also use several case studies to explore recent
real world VLSI designs (e.g. Pentium, Alpha, PowerPC Strong ARM, etc.) and
papers from the recent research literature. On-campus students will design small test
circuits using various CAD tools. Circuits will be verified and analyzed for
performance with various simulators. Some final project designs will be fabricated
and returned to students the following semester for testing.
Very-large-scale-integration (VLSI) is the process of creating integrated circuits
by combining thousands of transistor based circuits into a single chip. VLSI began in
the 1970s when complex semiconductor and communication technologies were
being developed. The microprocessor is a VLSI device. The term is no longer as
common as it once was, as chips have increased in complexity into the hundreds of
millions of transistors.
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions
or systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making
it possible to fabricate one or more logic gates on a single device. Now known
retrospectively as "small-scale integration" (SSI), improvements in technique led to
devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past
this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.
Applications of VLSI
I. Electronic system in cars.
II. Digital electronics control VCRs.
III. Transaction processing system, ATM.
IV. Personal computers and Workstations.
V. Medical electronic systems.
I. Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically,
hydraulically, or by other means; electronics are usually smaller, more flexible, and
easier to service. In other cases, electronic systems have created totally new
applications. Electronic systems perform a variety of tasks, some of them visible,
some more hidden: Personal entertainment systems such as portable MP3 players
and DVD players perform sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control
functions required for anti-lock braking systems.
II. Digital electronics compress and decompress video, even at high definition data
rates, on the fly in consumer electronics. Low cost terminals for Web browsing still
require sophisticated electronics, despite their dedicated function.
III. Personal computers and workstations provide word-processing, financial
analysis, and games. Computers include both central processing units and specialpurpose hardware for disk access, faster screen display, etc.
IV.Medical electronic systems measure bodily functions and perform complex
processing algorithms to warn about unusual conditions. The availability of these
complex systems, far from overwhelming consumers, only creates demand for even
more complex systems.
systems is its variety as systems become more complex, we build not a few general
purpose computers but an ever wider range of special purpose systems. Our ability
to do so is a testament to our growing mastery of both integrated circuit
manufacturing and design, but the increasing demands of customers continue to test
the limits of design and manufacturing.
1.2 Objective
The main objective of this study is to identify redundant logic operations and
data-dependency so as to provide parallel path for carry propagation which helps to
reduce the overall adder delay. The CLSA has two units: 1) the sum and carry
generator unit (SCG) and 2) the sum and carry selection unit. Accordingly, we
remove all redundant logic operations and sequence logic operations based on their
data-dependency.
only concern but also smaller area and less power become major concerns for design
of digital circuits.
CHAPTER 2
LITERATURE REVIEW
As we know adders are of fundamental importance in a wide variety of digital
systems, several types of fast adders exist but adding fast using low area and power
is still challenging. In digital adders, the speed of addition is limited by the time
required to propagate a carry through adder. So the CSLA is used in many
computational systems to alleviate the problem of carry propagation delay. So many
papers were published on this with several examples of such adders and many
efficient implementations were also done.
A number of modifications are suggested by researchers to improve the
performance of carry select adder. Reference [1] proposes a logic formulation for
CSLA by removing all the redundant logic operation from the conventional CSLA
design. In this design carry select (CS) operation is scheduled before the calculation
of the final SUM. Reference [3] presents various architectures of CSLA and also
presents analysis of the presented architectures for their speed and area. A powerarea efficient gate level modified design is implemented in [15, 4, 8] by minimizing
the logic operationin comparison with the conventional CSLA design. Analysis of
16-bit conventional CSLA and Binary to Excess-1Converter (BEC) CSLA is
presented in [7] and a D-latch based CSLA architecture is proposed in this project.
An area delay optimized architecture of 16-bit, 32-bit and 64-bit CSLA adder is
proposed and analyzed in [5, 6]. Reference [16] presents simulation and
performance evaluation of a 16-bit modified architecture of Square-Root CSLA
(SQRTCSLA). Area-Delay-Power based simulation of redundant logic optimized
modified design of CSLA with respect to the conventional CSLA design is shown in
[9, 10, 11, 12]. A modified design for 16-bit, 32-bit and 64-bit CSLA is proposed in
[19] that does not usemultiplexerarchitecture. This paper also shows a comparative
analysis of theproposed architecture with the conventional architecture. A logic
converter unit (LCU) based modified architecture ofadder is proposed in [20] for
optimized area-delay-power parameter. The modified architectures find applications
inhigh performance VLSI system architectures in the development of modern
electronic devices and gadgets. An efficientarchitecture of Adder essentially
improves the overall performance of complex systems. The different sections of
8
theproposed work are arranged as: Section II presents the architecture of 64-bit
CSLA and the design of its building blockusing gate level logic. Section III presents
the simulation and synthesis results. This section also shows the comparativeanalysis
of the design for dynamic power consumption on different FPGAs. Section IV
presents the conclusion basedon the present design simulation analysis. In the last,
this paper is concluded with the acknowledgement and thereferences.
In 1962, O.J. Bedrij [1] described the extremely fast digital adder with sum
selection and multiple-radix carry. He compared the amount of hardware and the
logical delay for a 100-bit ripple-carry adder and a carry-select adder. The problem
of carry-propagation delay was overcome by independently generating multipleradix carries and using these carries to select between simultaneously generated
sums. In this adder system, the addend and augend were divided into subaddend and
subaugend sections that were added twice to produce two sub sums. One addition
was done with a carry digit forced into each section, and the other addition
combined the operands without the forced carry digit. The selection of the correct
sub sum from each of the adder sections depended upon whether or not there
actually was a carry into that adder section.
Bedriji 1962 proposes [1] that the problem of carry propagation delay is
overcome by independently generating multiple radix carries and using these carries
to select between simultaneously generated sums. Ramkumar et al 2010 proposed a
BEC method to reduce the maximum delay of carry propagation in final stage of
carry save adder [2]. Ramkumar and Harish 2011 [7] propose BEC technique which
is a simple and efficient gate level modification to significantly reduce the area and
power of square root CSLA.
There are many carry select adder approaches available but most of them use
ripple carry adder. T.Y. Chang and M.J. Hsiao [3], suggested that instead of using
dual ripple carry adders, a carry select adder scheme using an add one circuit to
replace one ripple carry adder requires 29.2% fewer transistors with a speed penalty
of 5.9% for bit length n=64. If speed was important for this 64-bit adder, then two of
carry-select adder blocks could be substituted by the proposed scheme with a 6.3%
area saving and the same speed.
9
The Youngjoon kim and Lee-Sup Kim [4] suggested that a carry-select adder
could be implemented by using single ripple carry adder and an add-one circuit
instead of using dual ripple-carry adders. They proposed a new add-one circuit using
the first zero finding circuit and multiplexers to reduce the area and power with no
speed penalty. For n=64 bit, this new carry-select adder requires 38% fewer
transistors than the dual ripple-carry carry select adder and 29 percent fewer
transistors than Chang's carry-select adder using single ripple carry adder. This new
64b adder using a 0.25 um CMOS technology had 3.45 ns delay time at 2.5 V power
supply. Behnam Amelifard et.al [6], suggested a new adder called carry select adder
with sharing (CSAS) which was area efficient but the delay was more. M.Alioto et.al
[5], suggested using variable size block sizing depending on the multiplexers delay.
The B. Ram kumar, H.M. Kittur, and P.M. Kannan [7] suggested a very simple
approach to improve the speed of addition. Based on this approach a 16, 32 and 64bit adder architecture was developed and compared with conventional fast adder
architectures. In many parallel multipliers to speed up the final addition, CLA was
arranged in the form of Carry Select adder (CSLA) & was used. But due to the
structure of the CSLA it occupied more chip area, because it uses multiple pairs of
RCAs to generate the partial sum and carry by considering Cin=0 and Cin=1.Thus
the complexity of the final adder structure was high. So they replaced the RCA
(CLA) with Cin=1 with BEC logic, which reduced the maximum area and delay in
the final adder structure.
10
SQRT CSLA using ordinary BEC-1 converter with only a slight increase in the
delay. This work evaluates the performance of the proposed designs in terms of
delay, area, and power by hand with logical effort and through Cadence Virtuoso.
The results analysis shows that the proposed CSLA structure is better than the SQRT
CSLA with ordinary BEC-1 converter.
11
proposed work, generally in Wallace multiplier the partial products are reduced as
soon as possible and the final carry propagation path carry select adder is used. In
this project, modification is done at gate level to reduce area and power
consumption. The Modified Square Root Carry Select-Adder (MCSLA) is designed
using Common Boolean Logic and then compared with regular CSLA respective
architectures, and this MCSLA is implemented in Wallace Tree Multiplier. This
work gives the reduced area compared to normal Wallace tree multiplier. Finally, an
area efficient Wallace tree multiplier is designed using common Boolean logic based
square root carry select adder.
Ramkumar and Harish 2011 [4] propose BEC technique which is a simple and
efficient gate level modification to significantly reduce the area and power of square
root CSLA. Veena nair in 2013 suggested a new approach in with D-latch is used
with enabled signal instead of BEC [6]. Based on this approach a 16, 32 and 64-bit
adder architecture was developed and compared with conventional fast adder
architectures. The new structure as a result reduces the delay of the structure.
know about the MOS transistor. MOS Transistor MOS stands for Metal Oxide
Semiconductor field effect transistor. MOS is the basic element in the design of a
large scale integrated circuit is the transistor. It is a voltage controlled device. These
transistors are formed as a "sandwich'' consisting of a semiconductor layer, usually a
slice, or wafer, from a single crystal of silicon; a layer of silicon dioxide (the oxide)
and a layer of metal. These layers are patterned in a manner which permits
transistors to be formed in the semiconductor material (the "substrate''); The MOS
transistor consists of three regions, Source, Drain and Gate.
The source and drain regions are quite similar, and are labeled depending on to
what they are connected. The source is the terminal, or node, which acts as the
source of charge carriers; charge carriers leave the source and travel to the drain. In
the case of an N channel MOSFET (NMOS), the source is the more negative of the
terminals; in the case of a P channel device (PMOS), it is the more positive of the
terminals. The area under the gate oxide is called the "channel". Below is figure of a
MOS Transistor.
The transistor normally needs some kind of voltage initially for the channel to
form. When there is no channel formed, the transistor is said to be in the cut off
region. The voltage at which the transistor starts conducting (a channel begins to
form between the source and the drain) is called threshold Voltage. The transistor at
this point is said to be in the linear region. The transistor is said to go into the
saturation region when there are no more charge carriers that go from the source to
the drain. CMOS technology is made up of both NMOS and CMOS transistors.
Complementary Metal Oxide Semiconductors (CMOS) logic devices are the most
common devices used today in the high density, large number transistor count
circuits found in everything from complex microprocessor integrated circuits to
signal processing and communication circuits.
A whole wafer is processed at a time; Different parts of each die will be made
P-type or N-type (small amount of other atoms intentionally introduced doping
implant). Interconnections are made with metal insulation used is typically SiO2.
SiN is also used. New materials being investigated (low-k dielectrics). In CMOS
fabrication p-well process, n-well process and twin-tub process. All the devices on
the wafer are made at the same time. After the circuitry has been placed on the chip,
the chip is over glassed (with a passivation layer) to protect it only those areas which
connect to the outside world will be left uncovered (the pads). The wafer finally
passes to a test station test probes send test signal patterns to the chip and monitor
the output of the chip. The yield of a process is the percentage of die which pass this
testing, the wafer is then scribed and separated up into the individual chips. These
15
are then packaged and Chips are binned according to their performance.
involved constraints one can impose on a design using the software packages
supported by the vendors. Bit stream generation: FPGAs are typically configured at
power up time from some sort of external permanent storage device, typically a flash
memory. Once the place and route process is finished, the resulting choices for the
configuration of each programmable element in the FPGA chip, be it logic or
interconnect, must be stored in a file to program the flash. Of these four phases, only
the first one is human labor intensive. Somebody has to type in the HDL code,
which can be tedious and error prone for complicated designs involving, for
example, lots of digital signal processing. This is the reason for the appearance, in
recent years, of alternative flows which include a preliminary phase in which the
user can draw blocks at a higher level ofabstraction and rely on the software tool for
the generation of the HDL. Some of these tools also include the capability of
simulating blocks which will become HDLs with other blocks which provide stimuli
and processing to make the simulation output easier to interpret. The concept of
hardware co-simulation is also becoming widely used. In co-simulation, stimuli are
sent to a running FPGA hosting the design to be tested and the outputs of the design
are sent back to a computer for display (typically through a Joint Test Action Group
(JTAG), or Ethernet connection). The advantage of co-simulation is that one is
testing the real system, therefore suppressing all possible misinterpretations present
in a pure simulator. In other cases, co-simulation may be the only way to simulate a
complex design in a reasonable amount of time.
The standard FPGA design flow starts with design entry using schematics or a
hardware description language (HDL), such as Verilog HDL or VHDL. In this step,
you create the digital circuit that is implemented inside the FPGA. The flow then
proceeds through compilation, simulation, programming, and verification in the
FPGA hardware we first define the relevant terminology in the field and then
describe the recent evolution of FPDs. The three main categories of FPDs are
delineated: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and FieldProgrammable Gate Arrays (FPGAs).
17
In general terms FPGAs are best at tasks that use short word length integer or
fixed point data, and exhibit a high degree of parallelism, but they are not so good at
high precision floating-point arithmetic (although they can still outperform
conventional processors in many cases). The implications of shipping data to the
FPGA from the CPU and vice versa must also come under consideration, for if that
outweighs any improvement in the kernel then implementing the algorithm in an
FPGA may be an exercise in futility. FPGAs are best suited to integer arithmetic.
Unfortunately, the vast majority of scientific codes rely heavily on 64 bit IEEE
floating point arithmetic (often referred to as double precision floating point
arithmetic). It is not unreasonable to suggest that in order to get the most out of
FPGAs computational scientists must perform a thorough numerical analysis of their
code, and ideally reemployment it using fixed point arithmetic or lower precision
floating-point arithmetic. Scientists who have been used to continual performance
increases provided by each new generation of processor are not easily convinced that
the large amount of effort required for such an exercise will be sufficiently
rewarded. That said the recent development of efficient floating point cores has gone
some way towards encouraging scientists to use FPGAs.
real world applications, then the wider acceptance of FPGAs will move a step closer.
At present there is very little performance data available for 64-bit floating-point
intensive algorithms on FPGAs. To give an indication of expected performance we
have therefore used data taken from the Xilinx floating point cores (v3) datasheet.
To measure the area, performance and power consumption gap between field
programmable gate arrays (FPGAs) and standard cell application-specific integrated
circuits (ASICs) for the following reasons: I. In the early stages of system design,
when system architects choose their implementation medium, they often choose
between FPGAs and ASICs. Such decisions are based on the differences in cost
(which is related to area); performance and power consumption between these
implementation media but to date there have been few attempts to quantify these
differences. A system architect can use these measurements to assess whether
implementation in an FPGA is feasible. II. These measurements can also be useful
for those building ASICs that contain programmable logic, by quantifying the
impact of leaving part of a design to be implemented in the programmable fabric.
III. FPGA makers seeking to improve FPGAs can gain insight by quantitative
measurements of these metrics, particularly when it comes to understanding the
benefit of less programmable (but more efficient) hard heterogeneous blocks such as
block memory multipliers/accumulators and multiplexers that modern FPGAs often
employ.
19
terms of LUTs and IOs can be routed. This is determined by estimates such as those
derived from Rent's rule or by experiments with existing designs.
20
implement it in hardware and this difficulty is a direct result of I/O issues. As noted
above for a design to work in hardware access is required to resources that are
external to the FPGA, such as memory, and an FPGA is, by its very nature, unaware
of the components to which it is connected. If you want to retrieve a value from
main memory and use it on the FPGA then you need to instantiate a memory
controller. While systems such as the Cray XD1 provide cores for communicating
with memory, such cores are still complex and unfamiliar to software programmers.
Our early experiences with VHDL have indicated that it should only be used for
FPGA development if you are in a position to work closely with experienced
hardware designers throughout the development process.
22
CHAPTER-3
DESIGN APPROACH
Low-Power, area-efficient, and high-performance VLSI systems are increasingly
used in portable and mobile devices, multi standard wireless receivers, and
biomedical instrumentation [1], [2]. An adder is the main component of an
arithmetic unit. A complex digital signal processing (DSP) system involves several
adders. An efficient adder design essentially improves the performance of a complex
DSP system. A ripple carry adder (RCA) uses a simple design, but carry propagation
delay (CPD) is the main concern in this adder. Carry look-ahead and carry select
(CS) methods have been suggested to reduce the CPD of adders. A conventional
carry select adder (CSLA) is an RCARCA configuration that generates a pair of
sum words and output carry bits corresponding the anticipated input-carry (cin = 0
and 1) and selects one out of each pair for final-sum and final-output-carry [3]. A
conventional CSLA has less CPD than an RCA, but the design is not attractive since
it uses a dual RCA. Few attempts have been made to avoid dual use of RCA in
CSLA design. Kim and Kim [4] used one RCA and one add-one circuit instead of
two RCAs, where the add-one circuit is implemented using a multiplexer (MUX).
He et al. [5] proposed a square-root (SQRT)-CSLA to implement large bit-width
adders with less delay. In a SQRT CSLA, CSLAs with increasing size are connected
in a cascading structure. The main objective of SQRT-CSLA design is to provide a
parallel path for carry propagation that helps to reduce the overall adder delay. We
suggested a binary to BEC-based CSLA. The BEC-based CSLA involves less logic
resources than the conventional CSLA, but it has marginally higher delay. A CSLA
based on common Boolean logic (CBL) is also proposed in [7] and [8]. The CBLbased CSLA of [7] involves significantly less logic resource than the conventional
CSLA but it has longer CPD, which is almost equal to that of the RCA. To
overcome this problem, a SQRT-CSLA based on CBL was proposed in [8].
However, the CBL-based SQRTCSLA design of [8] requires more logic resource
and delay than the BEC-based SQRT-CSLA of [6]. We observe that logic
optimization largely depends on availability of redundant operations in the
formulation, whereas adder delay mainly depends on data dependence. In the
existing designs, logic is optimized without giving any consideration to the data
23
The main contribution in this brief is logic formulation based on data dependence
and optimized carry generator (CG) and CS design. Based on the proposed logic
formulation, we have derived an efficient logic design for CSLA. Due to optimized
logic units, the proposed CSLA involves significantly less ADP than the existing
CSLAs. We have shown that the SQRT-CSLA using the proposed CSLA design
involves nearly 32% less ADP and consumes 33% less energy than that of the
corresponding SQRT-CSLA.
have a delay, from addition inputs A and B to the carry out, equal to that of the
multiplexer chain leading in to it, so that the carry out is calculated just in time. The
delay is derived from uniform sizing, wherethe ideal number of full-adder
elements per block is equal to the square root of the number of bits being added,
since that will yield an equal number of MUX delays.
However, the carry select adder is not area efficient because it uses multiple pairs
of Ripple Carry Adders to generate partial sum and carry by considering carry input
and then the final sum and carry are selected by the multiplexers (mux). To
overcome the above problem, the above CSLA is modified by using n-bit Binary to
24
Excess-1 code converters (BEC) to improve the speed of addition. The logic can be
implemented with any type of adder to further improve the speed. We use the Binary
toExcess-1 Converter (BEC) instead of ripple carry adder in the regular CSLA to
achieve lower area and power consumption. The main advantage of this BEC logic
comes from the lesser number of logic gates than the Full Adder (FA) structure. The
modified design has reduced area and power as compared with the regular
SQRTCSLA with an increase in the delay. Therefore, an improved CSLA was
designed with a D-Latch replacing the BEC in the modified CSLA. This design has
efficiently reduced the delay there by increasing the speed making it a high speed
Carry Select Adder.The factors which are desirable in adders are as follows:
High speed, Low power consumption
Area efficient
Robustness and noise stability
Insensitivity to process variables
Less internal activity when activity is low
According to the requirement of the adder the designer has to consider all these
parameter While choosing a structure for adders what makes this decision even
harder is that usually most of these parameter are not independent from each other
tradeoff between desired parameter make this decision a multi-dimensional
optimization problem for high performance system a multi-dimensional optimization
problem for a non-linear system that usually has hundreds of variables, is
unfortunately impossible to solve within the limited design time.
The idea for this thesis is to explore the area, power consumption and time delay
for different structure of adders this will give us a good understanding of different
structure and makes the decision easier for the designers.
The Ripple Carry Adder (RCA) provides the most compact design but takes
longer computing time. If there is N-bit RCA, the delay is linearly proportional to N.
Thus for large values of N the RCA gives highest delay of all adders. The Carry
Look Ahead Adder (CLA) gives fast results but consumes large area. If there is Nbit adder, CLA is fast for N4, but for large values of N its delay increases more
than other adders. So for higher number of bits, CLA gives higher delay than other
adders due to presence of large number of fan-in and a large number of logic gates.
25
The Carry Select Adder (CSA) provides a compromise between small area but
longer delay RCA and a large area with shorter delay CLA. In rapidly growing
mobile industry, faster units are not the only concern but also smaller area and less
power become major concerns for design of digital circuits. In mobile electronics,
reducing area and power consumption are key factors in increasing portability and
battery life. Even in servers and desktop computers power dissipation is an
important design constraint. Design of area and power efficient high-speed data path
logic systems are one of the most substantial areas of research in VLSI system
design.
In the present work, the design of an 8-bit adder topology like ripple carry adder,
carry look ahead adder, carry skip adder, carry select adder, carry increment adder,
carry save adder and carry bypass adder are presented. It tightly integrates mixedsignal implementation with digital implementation, circuit simulation, transistorlevel extraction and verification. Performance issues like area, power dissipation and
propagation delay for all the adders are analyzed at 0.12m 6metal layer CMOS
technology using micro windtool. Design of area and power-efficient high speed
data path logic systems are one of the most substantial areas of research in VLSI
system design. In digital adders, the speed of addition is limited by the time required
to propagate a carry through the adder. The sum for each bit position in an
elementary adder is generated sequentially only after the previous bit position has
been summed and a carry propagated into the next position. The CSLA is used in
many computational systems to alleviate the problem of carry propagation delay by
independently generating multiple carries and then select a carry to generate the sum
[1].
However, the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering carry input
Cin = 0 and Cin = 1, then the final sum and carry are selected by the multiplexers
(mux). The basic idea of this work is to use simple combinational circuit instead of
RCA with cin = 1 and multiplexer in the regular CSLA to achieve lower area and
power. The main advantage of this Project is logic comes from low power than the
n-bit Full Adder (FA) structure. The SQRT CSLA has been developed by using
simple combinational circuit and compared with regular SQRT CSLA.A regular
CSLA uses two copies of the carry evaluation blocks, one with block carry input is
zero and other one with block carry input is one. Regular CSLA suffers from the
26
disadvantage of occupying more chip area. The modified CSLA reduces the area and
power when compared to regular CSLA with increase in delay by the use of Binary
to Excess-1 converter. This Project proposes a scheme which reduces the delay, area
and power than regular and modified CSLA by the use of D-latches.
3.2 Operation
Carry Select Adders (CSA) is one of the fastest adders used in many dataprocessing processors to perform fast arithmetic functions. The carry-select adder
partitions the adder into several groups, each of which performs two additions in
parallel. Therefore, two copies of ripple-carry adder act as carry evaluation block per
select stage. One copy evaluates the carry chain assuming the block carry-in is zero,
while the other assumes it to be one. Once the carry signals are finally computed, the
correct sum and carry-out signals will be simply selected by a set of multiplexers.
The 4-bit adder block is RCA.Systems are one of the most substantial areas of
research in VLSI system design. In digital adders, the speed of addition is limited by
the time required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the previous bit
position has been summed and a carry propagated into the next position. The CSLA
is used in many computational systems to alleviate the problem of carry propagation
delay by independently generating multiple carries and then select a carry to
generate the sum. However, the CSLA is not area efficient because it uses multiple
pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by
considering carry input and, then the final sum and carry are selected by the
multiplexers (MUX).
The carry-select adder generally consists of two ripple carry adders and a
multiplexer. Adding two n-bit numbers with a carry-select adder is done with two
adders (therefore two ripple carry adders) in order to perform the calculation twice,
one time with the assumption of the carry being zero and the other assuming one.
After the two results are calculated, the correct sum, as well as the correct carry, is
then selected with the multiplexer once the correct carry is known. The number of
bits in each carry select block can be uniform, or variable. In the uniform case, the
optimal delay occurs for a block size of n variable, the block size should have a
delay, from additional inputs A and B to the carry out, equal to that of the
27
multiplexer chain leading into it, so that the carry out is calculated just in time. The
delay is derived from uniform sizing, where the ideal number of full-adder elements
per block is equal to the square root of the number of bits being added, since that
will yield an equal number of MUX delays. Two 4-bit ripple carry adders are
multiplexed together, where the resulting carry and sum bits are selected by the
carry-in. Since one ripple carry adder assumes a carry-in of 0, and the other assumes
a carry-in of 1, selecting which adder had the correct assumption via the actual
carry-in yields the desired result. A 16-bit carry-select adder with a uniform block
size of 4 can be created with three of these blocks and a 4-bit ripple carry adder.
Since carry-in is known at the beginning of computation, a carry select block is not
needed for the first four bits. The delay of this adder will be four full adder delays,
plus three MUX delaysA 16-bit carry-select adder with variable size can be similarly
created. Here we show an adder with block sizes. This break-up is ideal when the
full-adder delay is equal to the MUX delay, which is unlikely. The total delay is two
full adder delays, and four MUX delays.
Addition is the heart of computer arithmetic, and the arithmetic unit is often the
work horse of a computational circuit. They are the necessary component of a data
path, e.g. in microprocessors or a signal processor. There are many ways to design
an added. The Ripple Carry Adder (RCA) provides the most compact design but
takes longer computing time. If there is N-bit RCA, the delay is linearly proportional
to N. Thus for large values of N the RCA gives highest delay of all adders. The
Carry Look Ahead Adder (CLA) gives fast results but consumes large area. If there
is N-bit adder, CLA is fast for N4, but for large values of N its delay increases
more than other adders. So for higher number of bits, CLA gives higher delay than
other adders due to presence of large number of fan-in and a large number of logic
gates. The Carry Select Adder (CSA) provides a compromise between small area but
longer delay RCA and a large area with shorter delay CLA.In rapidly growing
mobile industry, faster units are not the only concern but also smaller area and less
power become major concerns for design of digital circuits. In mobile electronics,
reducing area and power consumption are key factors in increasing portability and
battery life. Even in servers and desktop computers power dissipation is an
important design constraint. Design of area- and power-efficient high-speed data
path logic systems are one of the most substantial areas of research in VLSI system
design. In digital adders, the speed of addition is limited by the time required to
28
propagate a carry through the adder. The sum for each bit position in an elementary
adder is generated sequentially only after the previous bit position has been summed
and a carry propagated into the next position. Among various adders, the CSA is
intermediate regarding speed and area.
29
Code converters are very essential in digital systems. Here we are going to give
the truth table for binary to excess-1 converter. The Excess-1 converter is obtained
by adding one to the binary value. The detailed structures of the 5-bit BEC without
carry (BEC) and with carry (BECWC) are shown in Fig.3.3. The BEC gets n
inputs and generates n output; the BECWC gets n input and generates n+1 output to
give the carry output as the selection input of the next stage mux used in the final
adder design. The function table of BEC and BECWC are shown in Table 3.1.
30
Large bit sized multipliers require multiple BEC and each of them requires the
selection input from the carry output of the preceding BEC.
have been suggested for efficient implementation of the SCG unit. We made a study
of the logic designs suggested for the SCG unit of conventional and BEC-based
CSLAs of [6] by suitable logic expressions. The main objective of this study is to
identify redundant logic operations and data dependence. Accordingly, we remove
all redundant logic operations and sequence logic operations based on their data
dependence.
Fig. 3.4 (a) Conventional CSLA; n is the input operand bit-width. (b) The logic
operations of the RCA are shown in split form, where HSG, HCG, FSG, and
FCG represent half-sum generation, half-carry generation, full-sum generation,
and full-carry generation, respectively.
Suppose two n-bit operands are added in the conventional CSLA, then RCA-1
and RCA-2 generate n-bit sum (s0 and s1) and output-carry (c0 out and c1 out)
corresponding to input-carry (cin = 0 and cin = 1), respectively. Logic expressions of
RCA-1 and RCA-2 of the SCG unit of the n-bit CSLA are given as
soo (i) = A(i) XOR B(i), coo(i) = A(i) and B(i)
32
and
and
corresponding to
excess-1 code. The most significant bit (MSB) of BEC represents c1 out, in which n
least significant bits (LSBs) represent
........ 2
The selected carry word is added with the half-sum (s0) to generate the final-sum
(s). Using this method, one can have three design advantages:
1. Calculation of
2. The n-bit select unit is required instead of the (n+1) bit; and
3. Small output-carry delay.
All these features result in an areadelay and energy-efficient design for the
CSLA. We have removed all the redundant logic operations of 2 and rearranged
logic expressions of 2 based on their dependence. The proposed logic formulation
for the CSLA is given as
so(i) = A(i) XOR B(i), coo(i) = A(i) and B(i)
c1o(i) = c1o(i-1) and soo(i) + co(i) for c1o(0) = 0
c11(i) = c01(i-1) and soo(i) + co(i) for c1o(0) = 1
c(i)= c1o(i) if(cin=0)
c(i)= c11(i) if(cin=1) ..........3
33
Fig. 3.5 Structure of the BEC-based CSLA; n is the input operand bit-width.
Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width,
and [] represents delay (in the unit of inverter delay), n = max (t, 3.5n + 2.7).
(b) Gate-level design of the HSG. (c) Gate-level optimized design of (CG0) for
input-carry = 0. (d) Gate-level optimized design of (CG1) for input-carry = 1.
(e) Gate-level design of the CS unit. (f) Gate-level design of the final sum
generation (FSG) unit.
34
The logic diagram of the HSG unit is shown in Fig. 3.6 (b). The logic circuits of
CG0 and CG1 are optimized to take advantage of the fixed input-carry bits. The
optimized designs of CG0 and CG1 are shown in Fig. 3.6 (c) and (d), respectively.
The CS unit selects one final carry word from the two carry words available at its
input line using the control signal cin. It selects when cin = 0; otherwise, it selects.
The CS unit can be implemented using an n-bit 2-to-l MUX. However, we find from
the truth table of the CS unit that carry words c0 1 and c11 follow a specific bit
pattern. If (i) = 1, then (i) = 1, irrespective of s0(i) and c0(i), for 0 i n 1. This
feature is used for logic optimization of the CS unit. The optimized design of the CS
unit is shown in Fig. 3.6 (e), which is composed of n ANDOR gates. The final
carry word c is obtained from the CS unit. The MSB of c is sent to output as cout,
and (n 1) LSBs are XORed with (n 1) MSBs of half-sum (s0) in the FSG [shown
in Fig. 3.6 (f)] to obtain (n 1) MSBs of final-sum (s). The LSB of s0 is XORed
with cin to obtain the LSB of s.
We have considered all the gates to be made of 2-input AND, 2-input OR, and
inverter (AOI). A 2-input XOR is composed of 2 AND, 1 OR, and 2 NOT gates. The
area and delay of the 2-input AND, 2-input OR, and NOT gates are taken from the
Synopsys Armenia Educational Department (SAED) 90-nm standard cell library
datasheet for theoretical estimation. The area and delay of a design are calculated
using the following relations:
A = a . Na + r . No + i - Ni
T = na . Ta + no . To + nj . Ti.......... 4
Where (Na, No, Ni) and (na, no, ni), respectively, represent the (AND, OR, NOT)
gate counts of the total design and its critical path. (a, r, i) and (Ta, To, Ti),
respectively, represent the area and delay of one (AND, OR, NOT) gate. We have
calculated the (AOI) gate counts of each design for area and delay estimation the
area and delay of each design are calculated from the AOI gate counts (Na, No, Ni),
(na, no, ni), and the cell details. The path of the proposed CSLA, the delay of each
intermediate and output signals of the proposed n-bit CSLA design of Fig. 3.6 is
shown in the square bracket against each signal. We can observe that the proposed
n-bit single-stage CSLA adder involves 6n less number of AOI gates than the CSLA
35
of [6] and takes 2.7 and 6.6 units less delay to calculate final-sum and output-carry.
Compared with the CBL-based CSLA of [7], the proposed CSLA design involves n
more AOI gates, and it takes (n 4.7) unit less delay to calculate the output-carry.
In this work the following adder structures are used:
Ripple Carry Adder
Carry Save Adder
Carry Look-Ahead Adder
Carry Increment adder
Carry Skip Adder
Carry Bypass Adder
Carry Select Adder
t = (n-1) tc + ts
The well-known adder architecture, ripple carry adder is composed of cascaded
full adders for n-bit adder, as shown in figure 3.7. It is constructed by cascading full
adder blocks in series. The carry out of one stage is fed directly to the carry-in of the
next stage. For an n-bit parallel adder it requires n full adders.
Not very efficient when large number bit numbers are used.
Delay increases linearly with bit length.
Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA
Because of multiplexers larger area is required.
Have a lesser delay than Ripple Carry Adders (half delay of RCA).
Hence we always go for Carry Select Adder while working with smaller no of
bits.
Let ai and bi be the augends and addend inputs, ci the carry input, si and ci+1, the
sum and carry-out to the ith bit position. If the auxiliary functions, pi and gi are
called the propagate and generate signals, the sum output respectively are defined as
follows.
As we increase the no of bits in the Carry Look Ahead adders, the complexity
increases because the no. of gates in the expression Ci+1 increases. So
practically its not desirable to use the traditional CLA shown above because it
increases the space required and the power too.
Instead we will use here Carry Look Ahead adder (less bits) in levels to create
a larger CLA. Commonly smaller CLA may be taken as a 4-bit CLA. So we can
define carry look ahead over a group of 4 bits. Hence now we redefine terms
carry generate as [Group Generated Carry] g [i, i+3] and carry propagate as
[Group Propagated Carry] p [i, i+3] which are defined below.
X0 = ~B0
X2 = B2 ^ (B0& B1)
X1 = B0 ^ B1
38
The 4-bit BEC with 2:1 multiplexer, the inputs for the 2:1MUX are one is the
output of the 4-bit BEC and another input is output of 4- bit full adder with input
carry equal to zero. The selection line is carry of previous stage which select one of
the input as output, if Cin=1 output is 4-bit BEC output.
Table 3.2 Functional table of the 4-bit BEC
B3 B2 B1 B0 X3 X2 X1 X0
0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
1 0
1 0
1 0
1 0
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
1 0
1 0
1 0
1 0
1 1
1 1
1 1
1 1
0 0
3.5.5 Multiplexer
In electronics, a multiplexer (or MUX) is a device that selects one of several
analog or digital input signals and forwards the selected input into a single line.
Multiplexer of 2n inputs has n select lines, which are used to select which input line
to send to the output. Multiplexers are mainly used to increase the amount of data
that can be sent over the network within a certain amount of time and bandwidth. A
multiplexer is also called a data selector. An electronic multiplexer makes it possible
for several signals to share one device or resource, for example one A/D converter or
one communication line, instead of having one device per input signal.
39
In digital circuit design, the selector wires are of digital value. In the case of a 2to-1 multiplexer, a logic value of 0 would connect to the output while a logic value
of 1 would connect to the output. In larger multiplexers, the number of selector pins
is equal to where is the number of inputs. A 2-to-1 multiplexer has a Boolean
equation where and are the two inputs, is the selector input, and is the output.
The first class consists of the very slow ripple-carry adder with the smallest area.
In the second class, the carry-skip, carry-select adders with multiple levels have
small area requirements and shortened computation times. From the third class, the
carry-look ahead adder and from the fourth class, the parallel prefix adder represents
the fastest addition schemes with the largest area complexities.
40
On the other hand, carry Addition, one of the most frequently used arithmetic
operations, is employed to build advanced operations such as multiplication and
division. Theoretical research has found that the lower bound on the critical path
delay of the adder has complexity O (log n), where n is the adder width. The design
of high performance adders has been extensively studied [10] [15], and several
adders have achieved logarithmic delays. Whereas theoretical bounds indicate that
41
no traditional adder can achieve sub-logarithmic delay, it has been shown that
speculative adders can achieve sub-logarithmic delays by neglecting rare input
patterns that exercise the critical paths [2, 11, 13]. Furthermore, by augmenting
speculative adders with error detection and recovery, one can construct reliable
variable-latency adders whose average performance is very close to speculative
adders [3, 6, 12, and 17].
Speculative adders are built upon the observation that the critical path is rarely
activated in traditional adders. In traditional adders, each output depends on all
previous (lower or equal significance) bits. In particular, the most significant output
depends on all the n bits, where n is the adder width. In contrast, in speculative
adders [2, 6, 11, 13, 17], each output only depends on the previous k bits rather than
all previous bits, where k is much smaller than n. However, the cumulative error
grows linearly with the adder width since each speculative output can independently
be in error. Moreover, the calculation of each speculative output requires an
individual k-bit adder; hence, such designs also incur large area overhead and large
fanout at the primary inputs. Techniques such as effective sharing [17] can mitigate
but not eliminate fanout and area problems. Although the speculative adder in [18]
can mitigate the area problem, it incurs a fairly high error rate that limits its
application.
For applications where errors cannot be tolerated, a reliable variable latency adder
can be built upon the speculative adder by adding error detection and recovery [3, 6,
12, 17]. For the vast majority of input combinations, the speculative adder produces
correct results; when error detection flags an error, error recovery provides correct
results in one or more extra cycles. Ideally, the average performance of the variable
latency adder should be similar to the speculative one. However, existing variable
latency adders have several drawbacks. When error detection indicates no error, the
actual delay is the longer of the speculative adder and error detection. The delay of
error detection is always longer than the speculative adder [6] [17]. Hence, the
benefit of speculation is limited by the delay of error detection [3] [12]. Besides, the
circuitry for error detection and recovery incurs nontrivial area overhead. Finally,
variable latency adders are mostly restricted for random inputs [3, 12, and 17]. This
42
thesis first describes a novel function speculation technique, called speculative carry
select addition (SCSA). The key idea is to segment the chain of propagate signals in
addition into blocks of the same size. Specifically, the input bits of addends are
segmented into blocks, and the carry bits between blocks are selectively truncated to
0. SCSA is less susceptible to errors, since it is only applied for blocks instead of
individual outputs.
Finally, the previous variable latency and speculative adders are mainly designed
for unsigned random inputs, so this thesis proposes the modified variable latency
and speculative adders suitable for both random and Gaussian inputs. With modified
speculative adder and error detection block, the variable latency adder still achieves
high performance when 2's complement Gaussian inputs present. This shows that the
variable latency adder design is feasible for practical applications.
In the present work, the design of an 8-bit adder topology like ripple carry adder,
carry look ahead adder, carry skip adder, carry select adder, carry increment adder,
carry save adder and carry bypass adder are presented. It tightly integrates mixedsignal implementation with digital implementation, circuit simulation, transistorlevel extraction and verification. Performance issues like area, power dissipation and
propagation delay for all the adders are analyzed at 0.12m 6metal layer CMOS
43
Design of area and power-efficient high speed data path logic systems are one of
the most substantial areas of research in VLSI system design. In digital adders, the
speed of addition is limited by the time required to propagate a carry through the
adder. The sum for each bit position in an elementary adder is generated sequentially
only after the previous bit position has been summed and a carry propagated into the
next position. The CSLA is used in many computational systems to alleviate the
problem of carry propagation delay by independently generating multiple carries and
then select a carry to generate the sum [1].
However, the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering carry input
Cin = 0 and Cin = 1, then the final sum and carry are selected by the multiplexers
(mux). The basic idea of this work is to use simple combinational circuit instead of
RCA with cin = 1 and multiplexer in the regular CSLA to achieve lower area and
power. The main advantage of this Project is logic comes from low power than the
n-bit Full Adder (FA) structure. The SQRT CSLA has been developed by using
simple combinational circuit and compared with regular SQRT CSLA.
A regular CSLA uses two copies of the carry evaluation blocks, one with block
carry input is zero and other one with block carry input is one. Regular CSLA
suffers from the disadvantage of occupying more chip area. The modified CSLA
reduces the area and power when compared to regular CSLA with increase in delay
by the use of Binary to Excess-1 converter. This Project proposes a scheme which
reduces the delay, area and power than regular and modified CSLA by the use of Dlatches.
44
clear picture of which adder suits best in which type of situation during design
process. Hence below we present both the theoretical and practical comparisons of
all the three adders whish were taken into consideration.
45
Fig. 3.10 Proposed SQRT-CSLA for n = 16. All intermediate and output signals
are labeled with delay
46
considered the cascaded configuration of (2-bit RCA and 2-bit, 3-bit, 4-bit, 6-bit, 7bit, and 8-bit CSLAs) and (2-bit RCA and 2-bit, 3-bit, 4-bit, 6-bit, 7-bit, 8-bit, 9-bit,
11-bit, and 12-bit CSLAs), respectively, for the 32-bit SQRTCSLA and the 64-bit
SQRT-CSLA to optimize adder delay. To demonstrate the advantage of the
proposed CSLA design in SQRT-CSLA, we have estimated the area and delay of
SQRTCSLA using the proposed CSLA design and the BEC-based CSLA of [6] and
the CBL-based CSLA of [7] for bit-widths 16 and 32.
47
CHAPTER 4
RESULTS ANALYSIS
In this section, the proposed method synthesis and simulation results are reported.
Fig 4.1 (a) Simulation Waveform Result of 8-bit Ripple Carry Adder
Table 4.1 Device Utilization summary of 8-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs
3s1600efg484-4
09 out of 14752
16 out of 29504
26
26 out of 00376 6%
48
3s1600efg484-4
9 out of 14752
16 out of 29504
26
26 out of 00376 6%
49
50
Advanced HDL
Synthesis Report
Macro Statistics
# Xors
:9
1-bit xor2 : 8
8-bit xor2 : 1
3s1600efg484-4
14 out of 14752
26 out of 29504
26
26 out of 376
51
6%
Fig 4.4 (a) Simulation Waveform Result of 16-bit Ripple Carry Adder
Table 4.5 Device Utilization summary of 16-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs
3s1600efg484-4
18 out of 14752
32 out of 29504
50
50 out of 376 13%
52
3s1600efg484-4
18 out of 14752
32 out of 29504
50
50 out of 376 13%
53
54
Advanced HDL
Synthesis Report
Macro Statistics
# Xors
: 17
1-bit xor2 : 16
16-bit xor2 : 1
3s1600efg484-4
34 out of 14752
63 out of 29504
50
50 out of 376
13%
55
Fig 4.7 (a) Simulation Waveform Result of 32-bit Ripple Carry Adder
Table 4.9 Device Utilization summary of 32-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs
3s1600efg484-4
37 out of 14752
64 out of 29504
98
98 out of 376
26%
56
3s1600efg484-4
50 out of 14752
91 out of 29504
98
98 out of 376
26%
57
58
Advanced HDL
Synthesis Report
Macro Statistics
# Xors
: 33
1-bit xor2 : 32
32-bit xor2 : 1
3s1600efg484-4
76 out of 14752
140 out of 29504
98
98 out of 376 26%
59
As for the transistor count in 32-bit carry select adder, the transistor count of our
proposed area-efficient carry select adder could be reduced to be very close to that of
carry ripple adder; however, the transistor count in the conventional carry select
adder is nearly double as compared with the proposed design. This result shows that
sharing common Boolean logic term could indeed achieve a superior performance in
aspect of transistor count. As the input bit number of the conventional carry select
adder increases to 32-bit, the power consumption in the conventional carry select
adder will be 3.3 times larger than that in our proposed area-efficient carry select
adder.
It is clear that the delay of the 8-bit, 16-bit, 32-bit, and 64-bit proposed SQRT
CSLA is reduced by 4.6%, 49.3%, 44.5%, and 59.08%, respectively when compared
to regular SQRT CSLA. Power reduction of the proposed paper when compared to
regular SQRT CSLA 8-bit, 16-bit, 32-bit and 64-bit is 10.8%, 17.73%, 20.01% and
21.9% respectively.
60
We perform the simulation and synthesis and summarize the results of all the
adders. The Functional verification (simulation) and synthesis (high level description
is converted into RTL) of all the adders is performed and results are summarized.
61
SQRT-CSLA
(CBL) [7]
SQRT-CSLA
proposed
Width
(n)
Delay
(ns)
Area
(um2)
ADP
(um2us)
EADP
(1%)
16
7.38
1813.71
13.39
161.41
32
14.58
3627.42
52.89
280.64
64
28.98
7254.84
210.25
436.55
16
3.0
1706.80
5.12
---
32
3.85
3608.98
13.89
--
64
5.27
7435.46
39.18
--
16
2.0
1574.54
4.10
--
32
2.75
2989.99
11.89
---
64
4.28
6553.24
32.10
--
SQRT-CSLA
(BEC) [7]
SQRT-CSLA
(CBL)
SQRT-CSLA
proposed
Width(n)
Delay(ns)
Area(um2)
Power(uW)
16
5.61
2890.52
30.5673
32
6.56
6100.34
60.2537
64
8.37
12613.2
113.6457
16
10.45
1722.96
12.8662
32
18.72.
2765.38
17.7900
64
35.10
5530.56
91.1744
16
5.55
1813.45
19.6652
32
6.59
3735.36
38.1886
64
8.35
7603.89
70.62442
16
5.55
1813.45
19.6652
32
6.59
3735.36
38.1886
64
8.35
7603.89
70.62442
Used
91
29504
1%
51
14752
1%
51
51
100%
0%
97
376
1%
62
Available Utilization
Word
size
8 bit
16
bit
32
bit
64
bit
Delay
(ns)
Area
1.719
Modified
CSLA
Regular
CSLA
Adder
Power
(uw)
Power
delay
product
(10-15)
Area
delay
product
(10-25)
Leakage
Switching
Total
991
0.007
101.9
203.9
350.5
1703.5
1.958
895
0.006
94.2
188.4
368.8
1752.5
2.775
2272
0.017
263.7
527.4
1463.8
6304.8
Modified
CSLA
Regular
CSLA
3.048
1929
0.013
235.9
471.8
1438.0
5879.6
5.137
4783
0.036
563.6
1127.3
5790.9
24570.2
Modified
CSLA
Regular
CSLA
5.482
3985
0.027
484.9
969.9
5316.9
21848.5
9.174
9916
0.075
1212.4
2425.0
22245.9
90969.3
Modified
CSLA
9.519
8183
0.057
1025.0
2050.1
19514.9
77893.9
Regular
CSLA
4.4 Applications
Arithmetic Logic units
High Speed Multiplication
Advanced Microprocessor Design
Digital Signal Process
63
4.5 Advantages
Low Power Consumption
Less Area (Less Complexity)
More Speed Compare to regular CSLA
Less Complexity
64
CHAPTER 5
CONCLUSION & FUTURE SCOPE
5.1 Conclusion
Thus in order to reduce the area and power of SQRT CSLA architecture that
we have implemented in this Project, a simple approach has been used. In this work,
the numbers of gates have been reduced and this feature offers a greater advantage in
the area and power reduction. The simulation results indicate that the modified
SQRT CSLA is suffering from larger delay whereas the in 32-bit modified SQRT
CSLA, area and power are significantly reduced. The delay calculations used here
can be computed using the mentor graphics tool.
65
REFERENCES
[1] Low-Power and Area-Efficient Carry select Adder by B.Ram Kumar and Harish
M Kitturin IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Volume 20 No.2, February-2012.
[2] An Area efficient static CMOS carry-select adder based on a compact carry lookahead unit G.A. Ruiz, M. Granda in Micro-Electronics Journal 35(2004) 939-944,
2004-Elsevier Ltd.
[3] O.J. Badrij, Carry-select Adder, IRE Transaction Electronics Computers, pp
340- 344, 1962.
[4] Y. Kim and L.S. Kim, 64-bit carry-select adder with reduced area, Electron.
Lett, vol.37, no.10, pp.614-615, May-2001.
[5] J.M. Rabaey, Digtal Integrated Circuits a Design Perspective. Upper Saddle
River, NJ: Prentice-Hall, 2001.
[6] Cadence, Encounter user guide, Version6.2.4, March 2008.
[7] T.Y. Chang and M.J. Hsiao, Carry-select adder using single ripple-carry adder,
Electronics Letters, vol. 34, no. 22, pp. 2101 2103, Oct. 1998.
[8] Computer Arithmetic Algorithms and hardware designs by Behrooz parhami.
[9] Review on Carry Skip Adder and Gray/Black Cell Function Lecture 18 Datapath
Subsystems Chapter 10 Copyright 2005 Pearson Addison-Wesley. All rights
reserved.
[10] Gray Yeap and Gilbert, Practical Low power Digital VLSI Design, Kluwer
Academic Publishers. 1998.
[11] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed.
New York, NY, USA: Oxford Univ. Press, 2010.
66
[12] K.K.Parhi, VLSI Digital Signal Processing. New York, NY, USA Wiley, 1998.
[17] R.UMA, Vidya Vijayan, M. Mohanapriya, Sharon Paul 2, Area, Delay and
Power Comparison of Adder Topologies International Journal of VLSI design
Communication Systems (VLSICS) Vol.3, No.1, February 2012.
[18] S. Manju and V. Sornagopal, An efficient SQRT architecture of carry select
adder design by common Boolean logic, in Proc. VLSI ICEVENT, 2013, pp. 15.
[20] Y. He, C. H. Chang, and J.Gu, An area efficient 64-bit square root Carry-Select
Adder for low power applications. in Proc. IEEE Int. Symp. Circuits Syst., 2005,
vol. 4, pp. 40824085.
67
PG Student,
Department of ECE,
Sri Mittapalli Institute of Technology for Women,
Guntur, Andhra Pradesh, India.
ABSTRACT:
Carry Select Adder (CSLA) is one of the fastest adders
used in many data-processing processors to perform fast
arithmetic functions. From the structure of the CSLA, it is
clear that there is scope for reducing the area and power
consumption in the CSLA. This work uses a simple and
efficient gate-level modification to significantly reduce
the area and power of the CSLA. Based on this modification 8-bit, 16-bit, 32-bit, 64-bit square-root CSLA (SQRT
CSLA) architecture have been developed and compared
with the regular SQRT CSLA architecture. The proposed
design has reduced area and power as compared with the
regular SQRT CSLA with only a slight increase in the delay. This work evaluates the performance of the proposed
designs in terms of delay, area, power, and their products
by hand with logical effort and through custom design and
layout in 0.18-m CMOS process technology. The results
analysis shows that the proposed CSLA structure is better
than the regular SQRT CSLA.
Keywords:
SQRT CSLA, area efficient, CSLA, low power, delay efficient.
I. INTRODUCTION:
Design of area and power efficient high speed data pathlogic systems are one of the most substantial areas of researchin VLSI system design. In digital adders, the speed
of additionis limited by the time required to propagate a
carry throughthe adder. The sum for each bit position in an
elementaryadder is generated sequentially only after the
previous bitposition has been summed and a carry propagated into thenext position.The CLSA is used in many
computational system is toalleviate the problem of carry
propagation delay byindependently generating multiple
carries and then select acarry to generate the sum [1].
Professor,
Department of ECE,
Sri Mittapalli Institute of Technology for Women,
Guntur, Andhra Pradesh, India.
However, the CSLA[3] is not areaefficient because it uses
multiple pairs of Ripple Carry Adders(RCA) to generate
partial sum and carry by considering carryinput and then
the final sum and carry are selected by themultiplexers
(mux). The basic idea of this work is to use Binary to
Excess-1converted (BEC) instead of RCA with in the regular CSLA toachieve lower area and power consumption
[2]-[4]. The mainadvantage of this BEC logic comes from
the lesser number oflogic gates than the bit Full Adder
(FA) structure.This brief isstructured as follows. This paper deals with thedelay and area evaluation methodology
of the basic adder blocks. And also presents the detailed
structure and thefunction of the BEC logic.The SQRT
CSLA has been chosen for comparison with theproposed
design as it has a more balanced delay, and requires lower
power and area [5], [6]. The delay and area evaluation
methodology of the regular and modified SQRT CSLA are
presented.The rest of the paper is organised as follows.In
Section II, logic formulation is presented. In Section III,
the proposed adder design is explained. In Section IV, the
proposed scheme is compared to the previously proposed
ones and results are shown. Finally, Section V concludes
this paper.
January 2016
Page 120
January 2016
Page 121
Where (Na, No, Ni) and (na, no, ni), respectively, represent the (AND, OR, NOT) gate counts of the total design
and its critical path. (a, r, i) and (Ta, To, Ti), respectively,
represent the area and delay of one (AND, OR, NOT)
gate. We have calculated the (AOI) gate counts of each
design for area and delay estimation the area and delay of
each design are calculated from the AOI gate counts (Na,
No, Ni), (na, no, ni), and the cell details of Table I. The
path of the proposed CSLA, the delay of each intermediate and output signals of the proposed n-bit CSLA design
of Fig. 3 is shown in the square bracket against each signal. We can observe that the proposed n-bit single-stage
CSLA adder involves 6n less number of AOI gates than
the CSLA of [6] and takes 2.7 and 6.6 units less delay
to calculate final-sum and output-carry. Compared with
the CBL-based CSLA of [7], the proposed CSLA design
involves n more AOI gates, and it takes (n 4.7) unit less
delay to calculate the output-carry.
Fig.4. Proposed SQRT-CSLA for n = 16. All intermediate and output signals are labelled with delay
January 2016
Page 122
IV RESULTS& DISCUSSION:
In this section, we present the experimental results. In
Section IV-A, the proposedmethod is compared with the
conventional methods.We perform the simulation and
synthesis and summarize the results of all the adders. The
Functional verification (simulation) and synthesis (high
level description is converted into RTL) of all the adders
is performed and results are summarized.After the observation of simulation waveforms, synthesis is performed
for calculation of delay and area and thereby the speed
and power of the CSLAs are calculated and a comparison
of regular, modified and improved CSLA is made in terms
of delay, area and power
The area indicates the total cell area of the design and the
total power is sum of the leakage power, internal power
and switching power. The percentage reduction in the cell
area, total power, power-delay product and the areadelay
product as function of the bit size are shown.
A. PERFORMANCE COMPARISON:
In this section, the proposed method is compared with the
other 32-bit ripple carry adder.AreaDelay Estimation
Method: The comparison of proposed system with 32-bit
RCA is shown in Table 1.The delay can be calculated by
adding up the number of gates in the longest path of logic
block that contributes maximum delay.The area evolution is done by counting the total number of AOI gates
required for each logic block. The main disadvantage of
regular CSLA is high area usage that can be overcome by
using modified CSLA.Table 1 shows the Area and delay
of and, or, and not gates given in the90-nm standard cell
library datasheet of proposed system compared with 32bit RCA.Here, we show the result of power delays and
critical path delays.
January 2016
Page 123
V. CONCLUSION:
A simple approach is in this paper to reduce thearea and
power of SQRT CSLA architecture. The reducednumber
of gates of this work offers the great advantage in thereduction of area and also the total power. The comparedresults show that the modified SQRT CSLA has a slightlylarger delay, but the area and power of the 32-bmodified
SQRT CSLA are significantly reduced by 17.4%and
15.4% respectively. The power-delay product and also
the area-delay product of the proposed design show a decrease for 16-, 32-b sizes which indicates thesuccess of
the method and not a mere trade off of delay forpower and
area. The modified CSLA architecture is therefore,low
area, low power, simple and efficient for VLSI hardwareimplementation. It would be interesting to test the design
ofthe modified SQRT CSLA.
REFERENCES:
[1] Low-Power and Area-Efficient Carry select Adder by
B.Ram Kumar and Harish M Kitturin IEEE Transactions
on Very Large scale Integration(VLSI) Systems, Volume
20 No.2, February 2012.
[2] An Area efficient static CMOS carry-select adder based
on a compact carry look-ahead unit G.A.Ruiz, M.Granda
in Microelectronics Journal 35(2004) 939-944,2004
Elsevier Ltd.
[3] O.J.Badrij,Carry-select Adder, IRETransaction
Electronics Computers, pp 340- 344, 1962.
January 2016
Page 124
January 2016
Page 125