You are on page 1of 83

CSLA Implementation Technique to Minimize the Area, Power and

Delay
A Project Report
submitted in partial fulfillment of the requirements for the award of the degree of
MASTER OF TECHNOLOGY
in
VLSI & EMBEDDED SYSTEMS
by
G.BHAGYA SRI (13MK1D6805)
under the esteemed guidance of
Prof. P.BALA MURALI KRISHNA

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING


SRI MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN
(Approved by AICTE, New Delhi & Affiliated to JNTU, Kakinada)
NH-5, TUMMALAPALEM, GUNTUR-522233, A.P.
2013-2015

SRI MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN


(Approved by AICTE, New Delhi & Affiliated to JNTU, Kakinada)
NH5, TUMMALAPALEM, GUNTUR-522233, A.P.

Department of Electronics and Communication Engineering

CERTIFICATE
This is to certify that a project report entitled CSLA IMPLEMENTATION
TECHNIQUE TO MINIMIZE THE AREA, POWER AND DELAY being submitted by
GUTTIKONDA BHAGYA SRI (13MK1D6805) in partial fulfillment of the requirements for
the award of the degree of Master of Technology in VLSI & EMBEDDED SYSTEMS to
Jawaharlal Nehru Technological University, Kakinada, during the year 2013-2015 of SRI
MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN, GUNTUR.

PROJECT GUIDE
Prof. P.Bala Murali Krishna
Professor
Department of ECE
SMITW

HEAD OF THE DEPARTMENT


G. Suseelamma
Associate Professor
Department of ECE
SMITW

EXTERNAL EXAMINER

ACKNOWLEDGEMENT

I express my sincere thanks to Sri M.V.Koteswara Rao, Chairman, and Sri


M.B.V.Satyanarayana, Secretary and Correspondent of Sri Mittapalli Institute of Technology for
Women, Guntur for providing dexterities to carry out this project.

It gives us an honor to express my deep sense of gratitude and to our principal and project
guide Prof P.Bala Murali Krishna, Department of ECE, Sri Mittapalli Institute of Technology
for Women, Guntur for his valuable guidance, constant encouragement, and for every scientific
and personal concern throughout the course of investigation and successful completion of this
work.

I wish to extend my sincere thanks to G.Suseelamma, Head of the Department of ECE, Sri
Mittapalli Institute of Technology for Women, Guntur for her constant support, encouragement
and enabling us to do a work of this magnitude.

Our sincere thanks to teaching and non-teaching staff members of ECE Department of Sri
Mittapalli Institute of Technology for Women, Guntur.

Lastly, I bow to my affectionate Parents for their love and blessings, which has sustained me a
lot in completing this project work successfully.

BY
G. BHAGYA SRI
(13MK1D6805)

CONTENTS
TITLE
ABSTRACT

Page No
I

LIST OF FIGURES

II & III

LIST OF TABLES

IV

CHAPTER 1: INTRODUCTION

1.1 Introduction to VLSI

1.2 Objective

1.3 Existing system

1.3.1 Existing System Disadvantages

1.4 Proposed system

1.5 Project Outline

CHAPTER 2: LITERATURE REVIEW


2.1 CMOS Technology

8
12

2.1.1 CMOS Transmission Gate

15

2.1.2 Fabrication Technology

15

2.2 FPGA Design Flow

16

2.3 FPGA Performance

17

2.4 Basic FPGA Architecture

19

2.5 FPGA Design and Programming

20

2.6 VHDL & Verilog

21

CHAPTER 3: DESIGN APPROACH

23

3.1 Overview of Carry Select Adder

24

3.2 Operation

27

3.3 Why we replaced Regular CSLA with Modified CSLA?

29

3.4 Logic Formulation

31

3.4.1 Logic Expressions of the SCG Unit of the Conventional CSLA

32

3.4.2 Logic Expression of the SCG Unit of the BEC-Based CSLA

33

3.5 Proposed Adder Design

34

3.5.1 Ripple Carry Adder (RCA)

36

3.5.2 Carry Select Adders (CSLA)

37

3.5.3 Carry Look Ahead Adders (CLA)

37

3.5.4 Binary to Excess-1 Converter

38

3.5.5 Multiplexer

39

3.6 Analysis of Adders

44

3.7 Square Root CSLA (SQRT-CSLA)

46

CHAPTER 4: RESULTS ANALYSIS


4 .1 Performance Evaluation

48
48

4.1.1 Ripple Carry Adder (8-bit)

48

4.1.2 CSA (8-bit)

49

4.1.3 Proposed CSA (8-bit)

50

4.2.1 Ripple Carry Adder (16-bit)

52

4.2.2 CSA (16-bit)

53

4.2.3 Proposed CSA (16-bit)

54

4.3.1 Ripple Carry Adder (32-bit)

56

4.3.2 CSA (32-bit)

57

4.3.3 Proposed CSA (32-bit)

58

4.2 Performance Comparison

61

4.3 Synthesis Report

61

4.4 Applications

63

4.5 Advantages

64

CHAPTER 5: CONCLUSION AND FUTURE SCOPE

65

5.1 Conclusion

65

5.2 Future Scope

65

REFERENCES

66 & 67

ABSTRACT
With the advancements in semiconductor technology, there has been an increased emphasis in
low-power design techniques over the last few decades. Reversible computing has been proposed
by several researchers as a possible alternative to address the energy dissipation problem. This
project describes the design of Mach Zehnder Interferometer and reviews its applications in
emerging optical communication networks. Mach Zehnder Interferometer is used to measure
relative phase shift between two collimated beams from a coherent light source. Using the basic
principle, a number of devices was designed, few of these such as optical sensors, all-optical
switches, optical add-drop multiplexer and implementation of sum function is discussed in this
project.

LIST OF FIGURES

NAME OF THE FIGURE

Page No

Fig. 2.1 MOS TRANSISTOR

14

Fig. 3.1 Block diagram of regular CSLA

29

Fig. 3.2 Block diagram of modified CSLA

30

Fig. 3.3 The 5-bit Binary to Execss-1 Code Converter:

31

(a) BEC (without carry)

(b) BECWC (with carry)

31

Fig. 3.4 (a) Conventional CSLA; n is the input operand bit-width.


32
(b) The logic operations of the RCA is shown in split form
Fig. 3.5 Structure of the BEC-based CSLA; n is the input operand bit-width

34

Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width

34

(b) Gate-level design of the HSG

34

(c) Gate-level optimized design of (CG0) for input-carry = 0.

34

(d) Gate-level optimized design of (CG1) for input-carry = 1

34

(e) Gate-level design of the CS unit

34

(f) Gate-level design of the final sum generation (FSG)

34

Fig. 3.7 A 4-bit Ripple Carry Adder

36

Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA

37

Fig. 3.9 34-BIT CLA Logic equations

37

Fig. 3.10 Proposed SQRT-CSLA for n = 16. All intermediate and output

46

signals are labeled with delay


Fig. 4.1 (a) Simulation Waveform Result of 8-bit Ripple Carry Adder

48

Fig. 4.1 (b) RTL Diagram of 8 bit Ripple Carry Adder

49

Fig. 4.2 (a) Simulation Waveform Result of 8-bit CSA

49

Fig. 4.2 (b) RTL diagram of 8 bit CSA

50
II

Fig. 4.3 (a) Simulation Waveform Result of 8-bit Proposed CSA

50

Fig. 4.3 (b) Design Summary of 8-bit Proposed CSA

51

Fig. 4.3 (c) RTL diagram of 8 bit proposed CSA

52

Fig. 4.4 (a) Simulation Waveform Result of 16-bit Ripple Carry Adder

52

Fig. 4.4 (b) RTL diagram of 16 bit Ripple Carry Adder

53

Fig. 4.5 (a) Simulation Waveform Result of 16-bit CSA

53

Fig. 4.5 (b) RTL diagram of 16 bit CSA

54

Fig. 4.6 (a) Simulation Waveform Result of 16-bit Proposed CSA

54

Fig. 4.6 (b) Design Summary of 16-bit Proposed CSA

55

Fig. 4.6 (c) RTL diagram of 16 bit Proposed CSA

56

Fig. 4.7 (a) Simulation Waveform Result of 32-bit Ripple Carry Adder

56

Fig. 4.7 (b) RTL diagram of 32-bit Ripple Carry Adder

57

Fig. 4.8 (a) Simulation Waveform Result of 32-bit CSA

57

Fig. 4.8 (b) RTL diagram of 32-bit CSA

58

Fig. 4.9 (a) Simulation Waveform Result of 32-bit Proposed CSA

58

Fig. 4.9 (b) Design Summary of 32-bit Proposed CSA

59

Fig. 4.9 (c) RTL diagram of 32-bit Proposed CSA

60

III

LIST OF TABLES
NAME OF THE TABLE

Page No

Table 3.1 Truth table

31

Table 3.2 Functional table of the 4-bit BEC

39

Table 3.3 Categorization of adders w.r.t delay time and capacity

41

Table 3.4 Theoretical Comparison of Area Occupied

45

Table 3.5 Theoretical Comparison of Time Required

45

Table 3.6 Theoretical Area Delay Product (AxT)

45

Table 3.7 Comparison of Time Required (Simulated value)

46

Table 4.1 Device Utilization summary of 8-bit Ripple Carry Adder

48

Table 4.2 Device Utilization summary of 8-bit CSA

49

Table 4.3 Synthesis Report of 8-bit Proposed CSA

51

Table 4.4 Device Utilization summary of 8-bit Proposed CSA

51

Table 4.5 Device Utilization summary of 16-bit Ripple Carry Adder

52

Table 4.6 Device Utilization summary of 16-bit CSA

53

Table 4.7 Synthesis Report of 16-bit Proposed CSA

55

Table 4.8 Device Utilization summary of 16-bit Proposed CSA

55

Table 4.9 Device Utilization summary of 32-bit Ripple Carry Adder

56

Table 4.10 Device Utilization summary of 32-bit CSA

57

Table 4.11 Synthesis Report of 32-bit Proposed CSA

59

Table 4.12 Device Utilization summary of 32-bit Proposed CSA

59

Table 4.13 Theoretical Estimation

62

Table 4.14 comparison of post layout- synthesis result

62

Table 4.15 Design Summary

62

Table 4.16 Comparison of the Regular and Modified SQRT CSLA

63

IV

CHAPTER 1
INTRODUCTION
This chapter introduces the concepts such as introduction of VLSI, objective, existing
system proposed systemand the project outline.

1.1 Introduction to VLSI


VLSI Design presents state-of-the-art papers in VLSI design, computer aided design,
design analysis, design implementation, simulation and testing. Its scope also includes
papers that address technical trends, pressing issues, and educational aspects in VLSI
Design. The Journal provides a dynamic high quality international forum for original
papers and tutorials by academic, industrial, and other scholarly contributors in VLSI
Design.
The development of microelectronics spans a time which is even lesser than the
average life expectancy of a human, and yet it has seen as many as four generations.
Early 60s saw the low density fabrication processes classified under Small Scale
Integration (SSI) in which transistor count was limited to about 10. This rapidly gave
way to Medium Scale Integration in the late 60s when around 100 transistors could be
placed on a single chip. It was the time when the cost of research began to decline and
private firms started entering the competition in contrast to the earlier years where the
main burden was borne by the military. Transistor-Transistor logic (TTL) offering higher
integration densities outlasted other IC families like ECL and became the basis of the
first integrated circuit revolution. It was the production of this family that gave impetus
to semiconductor giants like Texas Instruments, Fairchild and National Semiconductors.
Early seventies marked the growth of transistor count to about 1000 per chip called
the Large Scale Integration. By mid-eighties, the transistor count on a single chip had
already exceeded 1000 and hence came the age of Very Large Scale Integration or VLSI.
Though many improvements have been made and the transistor count is still rising,
further names of generations like ULSI are generally avoided. It was during this time
when TTL lost the battle to MOS family owing to the same problems that had pushed
vacuum tubes into negligence, power dissipation and the limit it imposed on the number
of gates that could be placed on a single die. The second age of Integrated Circuits
1

revolution started with the introduction of the first microprocessor, the 4004 by Intel
in 1972 and the 8080 in 1974. Today many companies like Texas Instruments,
Infineon, Alliance Semiconductors, Cadence, Synopsys,Celox Networks, Cisco,
Micron Tech, National Semiconductors, ST Microelectronics, Qualcomm, Lucent,
Mentor Graphics, Analog Devices, Intel, Philips, Motorola and many other firms
have been established and are dedicated to the various fields in "VLSI" like
Programmable Logic Devices, Hardware Descriptive Languages, Design tools,
Embedded Systems etc.In 1980s hold over from outdated taxonomy for integration
levels. Obviouslyinfluenced from frequency bands i.e. HF, VHF and UHF. Sources
disagree on what is measured (gates or transistors)
SSI Small-Scale Integration (0-102)
MSI Medium-Scale Integration (102 -103)
LSI Large-Scale Integration (103 -105)
VLSI Very Large-Scale Integration (105 - 107)
ULSI Ultra Large-Scale Integration (>= 107)
VLSI Technology Inc. was a company which designed and manufactured custom
and semi-custom ICs. The company was based in Silicon Valley, with headquarters
at 1109 McKay Drive in San Jose, California. Along with LSI Logic, VLSI
Technology defined the leading edge of the application-specific integrated circuit
(ASIC) business, which accelerated the push of powerful embedded systems into
affordable products. The company was founded in 1979 by a trio from Fairchild
Semiconductor by way of Synertek - Jack Balletto, Dan Floyd, Gunnar Wetlesen and by Doug Fairbairn of Xerox PARC and Lambda (later VLSI Design) magazine.
Alfred J. Stein became the CEO of the company in 1982. Subsequently VLSI built
its first fab in San Jose; eventually a second fab was built in San Antonio, Texas.
VLSI had its initial public offering in 1983, and was listed on the stock market as
(NASDAQ: VLSI). The company was later acquired by Philips and survives to this
day as part of NXP Semiconductors.
The first semiconductor chips held two transistors each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions
or systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making
2

it possible to fabricate one or more logic gates on a single device. Now known
retrospectively as small-scale integration (SSI), improvements in technique led to
devices with hundreds of logic gates, known as medium-scale integration (MSI).
Further improvements led to large scale integration (LSI), i.e. systems with at least a
thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used.
But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use.
As of early 2008, billion transistor processors are commercially available. This is
expected to become more commonplace as semiconductor fabrication moves from
the current generation of 65nm processes to the next 45nm generations (while
experiencing new challenges such as increased variation across process corners). A
notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that
almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium,
whose large transistor count is largely due to its 24MB L3 cache. Current designs, as
opposed to the earliest devices, use extensive design automation and automated logic
synthesis to lay out the transistors, enabling higher levels of complexity in the
resulting logic functionality. Certain high performance logic blocks like the SRAM
(Static Random Access Memory) cell, however, are still designed by hand to ensure
the highest efficiency (sometimes by bending or breaking established design rules to
obtain the last bit of performance by trading stability) [citation needed].

What is VLSI?
VLSI stands for "Very Large Scale Integration". This is the field which involves
packing more and more logic devices into smaller and smaller areas.
Simply we say Integrated circuit is many transistors on one chip.
Design/manufacturing of extremely small, complex circuitry using
modifiedsemi-conductor material.
Integrated circuit (IC) may contain millions of transistors, each a few mm in
size.
3

Why VLSI?
Integration improves the design Lower parasitic means higher speed and lower
power consumption and physically smaller. The Integration reduces manufacturing
cost (almost) no manual assembly. The course will cover basic theory and
techniques of digital VLSI design in CMOS technology. Topics include: CMOS
devices and circuits, fabrication processes, static and dynamic logic structures, chip
layout, simulation and testing, low power techniques, design tools and
methodologies, VLSI architecture.
We use full custom techniques to design basic cells and regular structures such as
data path and memory. There is an emphasis on modern design issues in
interconnect and clocking. We will also use several case studies to explore recent
real world VLSI designs (e.g. Pentium, Alpha, PowerPC Strong ARM, etc.) and
papers from the recent research literature. On-campus students will design small test
circuits using various CAD tools. Circuits will be verified and analyzed for
performance with various simulators. Some final project designs will be fabricated
and returned to students the following semester for testing.
Very-large-scale-integration (VLSI) is the process of creating integrated circuits
by combining thousands of transistor based circuits into a single chip. VLSI began in
the 1970s when complex semiconductor and communication technologies were
being developed. The microprocessor is a VLSI device. The term is no longer as
common as it once was, as chips have increased in complexity into the hundreds of
millions of transistors.
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions
or systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making
it possible to fabricate one or more logic gates on a single device. Now known
retrospectively as "small-scale integration" (SSI), improvements in technique led to
devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past

this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.

Applications of VLSI
I. Electronic system in cars.
II. Digital electronics control VCRs.
III. Transaction processing system, ATM.
IV. Personal computers and Workstations.
V. Medical electronic systems.

I. Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically,
hydraulically, or by other means; electronics are usually smaller, more flexible, and
easier to service. In other cases, electronic systems have created totally new
applications. Electronic systems perform a variety of tasks, some of them visible,
some more hidden: Personal entertainment systems such as portable MP3 players
and DVD players perform sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control
functions required for anti-lock braking systems.
II. Digital electronics compress and decompress video, even at high definition data
rates, on the fly in consumer electronics. Low cost terminals for Web browsing still
require sophisticated electronics, despite their dedicated function.
III. Personal computers and workstations provide word-processing, financial
analysis, and games. Computers include both central processing units and specialpurpose hardware for disk access, faster screen display, etc.
IV.Medical electronic systems measure bodily functions and perform complex
processing algorithms to warn about unusual conditions. The availability of these
complex systems, far from overwhelming consumers, only creates demand for even
more complex systems.

The growing sophistication of applications continually pushes the design and


manufacturing of integrated circuits and electronic systems to new levels of
complexity. And perhaps the most amazing characteristic of this collection of
5

systems is its variety as systems become more complex, we build not a few general
purpose computers but an ever wider range of special purpose systems. Our ability
to do so is a testament to our growing mastery of both integrated circuit
manufacturing and design, but the increasing demands of customers continue to test
the limits of design and manufacturing.

1.2 Objective
The main objective of this study is to identify redundant logic operations and
data-dependency so as to provide parallel path for carry propagation which helps to
reduce the overall adder delay. The CLSA has two units: 1) the sum and carry
generator unit (SCG) and 2) the sum and carry selection unit. Accordingly, we
remove all redundant logic operations and sequence logic operations based on their
data-dependency.

1.3 Existing System


In digital adders, the speed of addition is limited by the time required to
propagate a carry through the adder. The sum for each bit position in an elementary
adder is generated sequentially only after the previous bit position has been summed
and a carry propagated into the next position. The early years carry look ahead adder
used to overcome the delay it will produce all produce all the carries at time but it
requires more circuitry, next those are replaced by carry select adders using dual
RCAs.

1.3.1 Existing System Disadvantages


The Ripple Carry Adder (RCA) provides the most compact design but takes
longer computing time. If there is N-bit RCA, the delay is linearly proportional to N.
Thus for large values of N the RCA gives highest delay of all adders. The Carry
Look Ahead Adder (CLA) gives fast results but consumes large area. So for higher
number of bits, CLA gives higher delay than other adders due to presence of large
number of fan-in and a large number of logic gates. The Carry Select Adder (CSA)
provides a compromise between small area but longer delay RCA and a large area
with shorter delay CLA. In rapidly growing mobile industry, faster units are not the

only concern but also smaller area and less power become major concerns for design
of digital circuits.

1.4 Proposed System


In this technique one carry ripple adder is used instead of using dual carry
ripple adder to enhance the area, power and delay. A carry-select adder can be
implemented by using single ripple carry adder and an add-one circuit instead of
using dual ripple-carry adders. This paper proposes a new add-one circuit using the
first zero finding circuit and multiplexers to reduce the area and power with no speed
penalty. For bit length n=64, thisnew carry-select adder requires 38percent fewer
transistors than the dual ripple-carry select adder and 29percent fewer transistors than
Changs carry-select adder using single ripple carry adder.

1.5 Project Outline


The project is organized into 6 chapters, namely Introduction, Literature
Review, Proposed Concept, Simulation and Synthesis Result Analysis and
Conclusion and Future Scope.Chapter2 contains the complete details about the
Introduction of VLSI and literature review. Chapter 3 describes about the Logic
Formulation, Proposed Adder Design and Analysis of adders. Chapter 4 explains the
Simulation Result and Synthesis Result. Chapter 5 includes conclusion of proposed
works to enhance the project in the future.

CHAPTER 2
LITERATURE REVIEW
As we know adders are of fundamental importance in a wide variety of digital
systems, several types of fast adders exist but adding fast using low area and power
is still challenging. In digital adders, the speed of addition is limited by the time
required to propagate a carry through adder. So the CSLA is used in many
computational systems to alleviate the problem of carry propagation delay. So many
papers were published on this with several examples of such adders and many
efficient implementations were also done.
A number of modifications are suggested by researchers to improve the
performance of carry select adder. Reference [1] proposes a logic formulation for
CSLA by removing all the redundant logic operation from the conventional CSLA
design. In this design carry select (CS) operation is scheduled before the calculation
of the final SUM. Reference [3] presents various architectures of CSLA and also
presents analysis of the presented architectures for their speed and area. A powerarea efficient gate level modified design is implemented in [15, 4, 8] by minimizing
the logic operationin comparison with the conventional CSLA design. Analysis of
16-bit conventional CSLA and Binary to Excess-1Converter (BEC) CSLA is
presented in [7] and a D-latch based CSLA architecture is proposed in this project.
An area delay optimized architecture of 16-bit, 32-bit and 64-bit CSLA adder is
proposed and analyzed in [5, 6]. Reference [16] presents simulation and
performance evaluation of a 16-bit modified architecture of Square-Root CSLA
(SQRTCSLA). Area-Delay-Power based simulation of redundant logic optimized
modified design of CSLA with respect to the conventional CSLA design is shown in
[9, 10, 11, 12]. A modified design for 16-bit, 32-bit and 64-bit CSLA is proposed in
[19] that does not usemultiplexerarchitecture. This paper also shows a comparative
analysis of theproposed architecture with the conventional architecture. A logic
converter unit (LCU) based modified architecture ofadder is proposed in [20] for
optimized area-delay-power parameter. The modified architectures find applications
inhigh performance VLSI system architectures in the development of modern
electronic devices and gadgets. An efficientarchitecture of Adder essentially
improves the overall performance of complex systems. The different sections of
8

theproposed work are arranged as: Section II presents the architecture of 64-bit
CSLA and the design of its building blockusing gate level logic. Section III presents
the simulation and synthesis results. This section also shows the comparativeanalysis
of the design for dynamic power consumption on different FPGAs. Section IV
presents the conclusion basedon the present design simulation analysis. In the last,
this paper is concluded with the acknowledgement and thereferences.
In 1962, O.J. Bedrij [1] described the extremely fast digital adder with sum
selection and multiple-radix carry. He compared the amount of hardware and the
logical delay for a 100-bit ripple-carry adder and a carry-select adder. The problem
of carry-propagation delay was overcome by independently generating multipleradix carries and using these carries to select between simultaneously generated
sums. In this adder system, the addend and augend were divided into subaddend and
subaugend sections that were added twice to produce two sub sums. One addition
was done with a carry digit forced into each section, and the other addition
combined the operands without the forced carry digit. The selection of the correct
sub sum from each of the adder sections depended upon whether or not there
actually was a carry into that adder section.
Bedriji 1962 proposes [1] that the problem of carry propagation delay is
overcome by independently generating multiple radix carries and using these carries
to select between simultaneously generated sums. Ramkumar et al 2010 proposed a
BEC method to reduce the maximum delay of carry propagation in final stage of
carry save adder [2]. Ramkumar and Harish 2011 [7] propose BEC technique which
is a simple and efficient gate level modification to significantly reduce the area and
power of square root CSLA.

There are many carry select adder approaches available but most of them use
ripple carry adder. T.Y. Chang and M.J. Hsiao [3], suggested that instead of using
dual ripple carry adders, a carry select adder scheme using an add one circuit to
replace one ripple carry adder requires 29.2% fewer transistors with a speed penalty
of 5.9% for bit length n=64. If speed was important for this 64-bit adder, then two of
carry-select adder blocks could be substituted by the proposed scheme with a 6.3%
area saving and the same speed.
9

The Youngjoon kim and Lee-Sup Kim [4] suggested that a carry-select adder
could be implemented by using single ripple carry adder and an add-one circuit
instead of using dual ripple-carry adders. They proposed a new add-one circuit using
the first zero finding circuit and multiplexers to reduce the area and power with no
speed penalty. For n=64 bit, this new carry-select adder requires 38% fewer
transistors than the dual ripple-carry carry select adder and 29 percent fewer
transistors than Chang's carry-select adder using single ripple carry adder. This new
64b adder using a 0.25 um CMOS technology had 3.45 ns delay time at 2.5 V power
supply. Behnam Amelifard et.al [6], suggested a new adder called carry select adder
with sharing (CSAS) which was area efficient but the delay was more. M.Alioto et.al
[5], suggested using variable size block sizing depending on the multiplexers delay.

The B. Ram kumar, H.M. Kittur, and P.M. Kannan [7] suggested a very simple
approach to improve the speed of addition. Based on this approach a 16, 32 and 64bit adder architecture was developed and compared with conventional fast adder
architectures. In many parallel multipliers to speed up the final addition, CLA was
arranged in the form of Carry Select adder (CSLA) & was used. But due to the
structure of the CSLA it occupied more chip area, because it uses multiple pairs of
RCAs to generate the partial sum and carry by considering Cin=0 and Cin=1.Thus
the complexity of the final adder structure was high. So they replaced the RCA
(CLA) with Cin=1 with BEC logic, which reduced the maximum area and delay in
the final adder structure.

In the year of 2013 Mugilvannan.L, Ramasamy.S, [5] Investigated on Carry


Select Adder (CSLA) is one of the fastest adders used in many data-processing
processors to perform fast arithmetic functions. From the structure of the CSLA, it is
clear that there is scope for reducing the area and power consumption in the CSLA.
This work uses a simple and efficient transistor level modification in BEC-1
converter to significantly reduce the area and power of the CSLA. Based on this
modification 16-b square root CSLA (SQRT CSLA) architecture have been
developed and compared with the SQRT CSLA architecture using ordinary BEC-1
converter. The proposed design has reduced area and power as compared with the

10

SQRT CSLA using ordinary BEC-1 converter with only a slight increase in the
delay. This work evaluates the performance of the proposed designs in terms of
delay, area, and power by hand with logical effort and through Cadence Virtuoso.
The results analysis shows that the proposed CSLA structure is better than the SQRT
CSLA with ordinary BEC-1 converter.

In the year of 2013 Vijayalakshmi.V, Seshadri.R, Ramakrishnan.S, [6] worked


on the study of the VLSI design of the carry look-ahead adder (CLAA) based 32-bit
unsigned integer multiplier and the VLSI design of the carry select adder (CSLA)
based 32-bit unsigned integer multiplier. Both the VLSI design of multiplier
multiplies two 32-bit unsigned integer values and gives a product term of 64-bit
values. The CLAA based multiplier uses the delay time of 99ns for performing
multiplication operation where as in CSLA based multiplier also uses nearly the
same delay time for multiplication operation. But the area needed for CLAA
multiplier is reduced to 31% by the CSLA based multiplier to complete the
multiplication operation. These multipliers are implemented using Altera Quartus II
and timing diagrams are viewed through avan waves.

In year 2014 Pandey.S, Khan.A. A, Sarma.R, [1] Investigated on comparison


between the design of the 8T adder based Carry Select Adder (CSA) and 10T adder
based CSA. Using both the designs of adders 4-bit CSA architecture has been
developed and compared with the 28T adder and 4-bit CSA. The 10T CSA design
has reduced delay, power and area as compared with the 28T CSA with a slight
tradeoff for area as compared to 8T CSA. The analysis shows that the 10T CSA is
better than both 8T adder based CSA and 28T CSA. This work evaluates the
performance of the 10T CSA design in terms of power, delay and area using 180nm
CMOS process technology Cadence Virtuoso tool and Spectre simulator.

In the year of 2014 Paradhasaradhi.D, Prashanthi.M, Vivek.N [2] described


that the Carry Select Adder (CSLA) provides a good compromise between cost and
performance in carry propagation adder design. A Square Root Carry Select Adder
using RCA is introduced but it offers some speed penalty. However, conventional
CSLA is still area-consuming due to the dual ripple carry adder structure. In the

11

proposed work, generally in Wallace multiplier the partial products are reduced as
soon as possible and the final carry propagation path carry select adder is used. In
this project, modification is done at gate level to reduce area and power
consumption. The Modified Square Root Carry Select-Adder (MCSLA) is designed
using Common Boolean Logic and then compared with regular CSLA respective
architectures, and this MCSLA is implemented in Wallace Tree Multiplier. This
work gives the reduced area compared to normal Wallace tree multiplier. Finally, an
area efficient Wallace tree multiplier is designed using common Boolean logic based
square root carry select adder.

In the year of 2014 Naaz.S.A.A, Pradeep.M.N.N, Bhairannawar.S, Halvi.S [4]


presented the study of field of communication and signal processing applications.
Every application demands for a higher throughput arithmetic operation. One of the
key arithmetic operations is multiplication which takes maximum execution time.
The development of efficient multiplier is a subject of interest over decades. So there
is a need for an efficient multiplier which obtains higher performance for real time
signal processing application. The modular design of Vedic multiplier using carry
select adder. The delay of proposed multiplier is reduced due to high speed carry
select adder. The proposed multiplier is applied to parallel FIR filter. It can be
observed that the combinational delay reduced for the proposed multiplier compared
to existing architecture.

Ramkumar and Harish 2011 [4] propose BEC technique which is a simple and
efficient gate level modification to significantly reduce the area and power of square
root CSLA. Veena nair in 2013 suggested a new approach in with D-latch is used
with enabled signal instead of BEC [6]. Based on this approach a 16, 32 and 64-bit
adder architecture was developed and compared with conventional fast adder
architectures. The new structure as a result reduces the delay of the structure.

2.1 CMOS Technology


In the present decade the chips being designed are made from CMOS
technology. CMOS is Complementary Metal Oxide Semiconductor. It consists of
both NMOS and PMOS transistors. To understand CMOS better, we first need to
12

know about the MOS transistor. MOS Transistor MOS stands for Metal Oxide
Semiconductor field effect transistor. MOS is the basic element in the design of a
large scale integrated circuit is the transistor. It is a voltage controlled device. These
transistors are formed as a "sandwich'' consisting of a semiconductor layer, usually a
slice, or wafer, from a single crystal of silicon; a layer of silicon dioxide (the oxide)
and a layer of metal. These layers are patterned in a manner which permits
transistors to be formed in the semiconductor material (the "substrate''); The MOS
transistor consists of three regions, Source, Drain and Gate.

The source and drain regions are quite similar, and are labeled depending on to
what they are connected. The source is the terminal, or node, which acts as the
source of charge carriers; charge carriers leave the source and travel to the drain. In
the case of an N channel MOSFET (NMOS), the source is the more negative of the
terminals; in the case of a P channel device (PMOS), it is the more positive of the
terminals. The area under the gate oxide is called the "channel". Below is figure of a
MOS Transistor.

The transistor normally needs some kind of voltage initially for the channel to
form. When there is no channel formed, the transistor is said to be in the cut off
region. The voltage at which the transistor starts conducting (a channel begins to
form between the source and the drain) is called threshold Voltage. The transistor at
this point is said to be in the linear region. The transistor is said to go into the
saturation region when there are no more charge carriers that go from the source to
the drain. CMOS technology is made up of both NMOS and CMOS transistors.
Complementary Metal Oxide Semiconductors (CMOS) logic devices are the most
common devices used today in the high density, large number transistor count
circuits found in everything from complex microprocessor integrated circuits to
signal processing and communication circuits.

The CMOS structure is popular because of its inherent lower power


requirements, high operating clock speed, and ease of implementation at the
transistor level. The complementary p-channel and n-channel transistor networks are
used to connect the output of the logic device to the either the VDD or VSS power
supply rails for a given input logic state. The MOSFET transistors canbe treated as
13

simple switches. The switch must be on (conducting) to allow current to flow


between the source and drain terminals. In CMOS, there is only one driver, but the
gate can drive as many gates as possible. In CMOS technology, the output always
drives another CMOS gate input. The charge carriers for PMOS transistors is holes
and charge carriers for NMOS are electrons. The mobility of electrons is two times
more than that of holes. Due to this the output rise and fall time is different. To
make it same, the W/L ratio of the PMOS transistor is made about twice that of the
NMOS transistor. This way, the PMOS andNMOS transistors will have the same
drive strength. In a standard cell library, the length L of a transistor is always
constant. The width W values are changed to have to different drive strengths for
each gate. The resistance is proportional to (L/W). Therefore, if the increasing the
width, decreases the resistance.

Fig. 2.1 MOS TRANSISTOR


Power Dissipation in CMOS ICs the big percentage of power dissipation in
CMOS ICs is due to the charging and discharging of capacitors. Majority of the low
power CMOS IC designs issue is to reduce power dissipation.
The main sources of power dissipation are:
a. Dynamic Switching Power
Due to charging and discharging of circuit capacitances, a low to high output
transition draws energy from the power supply. A high to low transition
dissipatesenergy stored in CMOS transistor.
b. Short Circuit Current
It occurs when the rise/fall time at the input of the gate is larger than theoutput
rise/fall time.
14

c. Leakage Current Power


It is caused by two reasons a. Reverse Bias Diode Leakage on TransistorDrains: This
happens in CMOS design, when one transistor is off, and the active transistor
charges up/down the drain using the bulk potential of the other transistor.

2.1.1 CMOS Transmission Gate


A PMOS transistor is connected in parallel to a NMOS transistor to form a
Transmission gate. The transmission gate just transmits the value at the input to the
output. It consists of both NMOS and PMOS because, PMOS transistor transmits a
strong 1 and NMOS transistor transmits a strong 0.
The advantages of using a Transmission Gate are:
It shows better characteristics than a switch.
The resistance of the circuit is reduced, since the transistors are connected in
parallel.

2.1.2 Fabrication Technology


It is Silicon of extremely high purity and chemically purified then grown into
large crystals. Wafers is type of crystals are sliced into wafers, and wafer diameter is
currently 150mm, 200mm, 300mm and wafer thickness <1mm and also surface
ispolished to optical smoothness. Wafer is then ready for processing, each wafer will
yield many chips and the chip die size varies from about 5mmx5mm to
15mmx15mm.

A whole wafer is processed at a time; Different parts of each die will be made
P-type or N-type (small amount of other atoms intentionally introduced doping
implant). Interconnections are made with metal insulation used is typically SiO2.
SiN is also used. New materials being investigated (low-k dielectrics). In CMOS
fabrication p-well process, n-well process and twin-tub process. All the devices on
the wafer are made at the same time. After the circuitry has been placed on the chip,
the chip is over glassed (with a passivation layer) to protect it only those areas which
connect to the outside world will be left uncovered (the pads). The wafer finally
passes to a test station test probes send test signal patterns to the chip and monitor
the output of the chip. The yield of a process is the percentage of die which pass this
testing, the wafer is then scribed and separated up into the individual chips. These
15

are then packaged and Chips are binned according to their performance.

2.2 FPGA Design Flow


The designer facing a design problem must go through a series of steps between
initial ideas and final hardware. This series of steps is commonly referred to as the
design flow. First, after all the requirements have been spelled out, a proper digital
design phase must be carried out. It should be stressed that the tools supplied by the
different FPGA vendors to target their chips do not help the designer in this phase.
They only enter the scene once the designer is ready to translate a given design into
working hardware. The most common flow nowadays used in the design of FPGAs
involves the following subsequent phases:
Design entry: This step consists in transforming the design ideas into some form of
computerized representation. This is most commonly accomplished using Hardware
Description Languages (HDLs). The two most popular HDLs are Verilog and the
Very High Speed Integrated Circuit HDL (VHDL) [2]. It should be noted that an
HDL, as its name implies, is only a tool to describe a design that pre-existed in the
mind, notes, and sketches of a designer. It is not a tool to design electronic circuits.
Another point to note is that HDLs differ from conventional software programming
languages in the sense that they dont support the concept of sequential execution of
statements in the code. This is easy to understand if one considers the alternative
schematic representation of an HDL file: what one sees in the upper part of the
schematic cannot be said to happen before or after what one sees in the lower part.
Synthesis: The synthesis tool receives HDL and a choice of FPGA vendor and
model. From these two pieces of information, it generates a net list which uses the
primitives proposed by the vendor in order to satisfy the logic behavior specified in
the HDLfiles. Most synthesis tools go through additional steps such as logic
optimization, register load balancing, and other techniques to enhance timing
performance, so the resulting net list can be regarded as a very efficient
implementation of the HDLdesign.
Place and route: The placer takes the synthesized net list and chooses a place for
each of the primitives inside the chip. The routers task is then to interconnect all
these primitives together satisfying the timing constraints. The most obvious
constraint for a design is the frequency of the system clock, but there are more
16

involved constraints one can impose on a design using the software packages
supported by the vendors. Bit stream generation: FPGAs are typically configured at
power up time from some sort of external permanent storage device, typically a flash
memory. Once the place and route process is finished, the resulting choices for the
configuration of each programmable element in the FPGA chip, be it logic or
interconnect, must be stored in a file to program the flash. Of these four phases, only
the first one is human labor intensive. Somebody has to type in the HDL code,
which can be tedious and error prone for complicated designs involving, for
example, lots of digital signal processing. This is the reason for the appearance, in
recent years, of alternative flows which include a preliminary phase in which the
user can draw blocks at a higher level ofabstraction and rely on the software tool for
the generation of the HDL. Some of these tools also include the capability of
simulating blocks which will become HDLs with other blocks which provide stimuli
and processing to make the simulation output easier to interpret. The concept of
hardware co-simulation is also becoming widely used. In co-simulation, stimuli are
sent to a running FPGA hosting the design to be tested and the outputs of the design
are sent back to a computer for display (typically through a Joint Test Action Group
(JTAG), or Ethernet connection). The advantage of co-simulation is that one is
testing the real system, therefore suppressing all possible misinterpretations present
in a pure simulator. In other cases, co-simulation may be the only way to simulate a
complex design in a reasonable amount of time.
The standard FPGA design flow starts with design entry using schematics or a
hardware description language (HDL), such as Verilog HDL or VHDL. In this step,
you create the digital circuit that is implemented inside the FPGA. The flow then
proceeds through compilation, simulation, programming, and verification in the
FPGA hardware we first define the relevant terminology in the field and then
describe the recent evolution of FPDs. The three main categories of FPDs are
delineated: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and FieldProgrammable Gate Arrays (FPGAs).

2.3 FPGA Performance


While the headline performance increase offered by FPGAs is often very large
(>100 times for some algorithms) it is important to consider a number of factors

17

when assessing their usefulness for accelerating a particular application. Firstly, is it


practical to implement the whole application on an FPGA? The answer to this is
likely to be no, particularly for floating-point intensive applications which tend to
swallow up a large amount of logic. If it is either impractical or impossible to
implement the whole application on an FPGA, the next best option is to implement
those kernels within the application that are responsible for the majority of the run
time, which may be determined by profiling. Next, the real speedup of the whole
application must be estimated once the kernel has been implemented in a FPGA.
Even if that kernel was originally responsible for 90% of the runtime the total speedup that you can achieve for your application cannot exceed 10 times (even if you
achieve a 1000 times speed up for the kernel), an example of Amdahls law, that
long time irritant of the HPC software engineer. Once such an estimate has been
made, one must decide if the potential gain is worthwhile given the complexity of
instantiating the algorithm on anFPGA.

In general terms FPGAs are best at tasks that use short word length integer or
fixed point data, and exhibit a high degree of parallelism, but they are not so good at
high precision floating-point arithmetic (although they can still outperform
conventional processors in many cases). The implications of shipping data to the
FPGA from the CPU and vice versa must also come under consideration, for if that
outweighs any improvement in the kernel then implementing the algorithm in an
FPGA may be an exercise in futility. FPGAs are best suited to integer arithmetic.
Unfortunately, the vast majority of scientific codes rely heavily on 64 bit IEEE
floating point arithmetic (often referred to as double precision floating point
arithmetic). It is not unreasonable to suggest that in order to get the most out of
FPGAs computational scientists must perform a thorough numerical analysis of their
code, and ideally reemployment it using fixed point arithmetic or lower precision
floating-point arithmetic. Scientists who have been used to continual performance
increases provided by each new generation of processor are not easily convinced that
the large amount of effort required for such an exercise will be sufficiently
rewarded. That said the recent development of efficient floating point cores has gone
some way towards encouraging scientists to use FPGAs.

If the performance of such cores can be demonstrated by accelerating a number of


18

real world applications, then the wider acceptance of FPGAs will move a step closer.
At present there is very little performance data available for 64-bit floating-point
intensive algorithms on FPGAs. To give an indication of expected performance we
have therefore used data taken from the Xilinx floating point cores (v3) datasheet.
To measure the area, performance and power consumption gap between field
programmable gate arrays (FPGAs) and standard cell application-specific integrated
circuits (ASICs) for the following reasons: I. In the early stages of system design,
when system architects choose their implementation medium, they often choose
between FPGAs and ASICs. Such decisions are based on the differences in cost
(which is related to area); performance and power consumption between these
implementation media but to date there have been few attempts to quantify these
differences. A system architect can use these measurements to assess whether
implementation in an FPGA is feasible. II. These measurements can also be useful
for those building ASICs that contain programmable logic, by quantifying the
impact of leaving part of a design to be implemented in the programmable fabric.
III. FPGA makers seeking to improve FPGAs can gain insight by quantitative
measurements of these metrics, particularly when it comes to understanding the
benefit of less programmable (but more efficient) hard heterogeneous blocks such as
block memory multipliers/accumulators and multiplexers that modern FPGAs often
employ.

2.4 Basic FPGA Architecture


The most common FPGA architecture consists of an array of configurable logic
blocks (CLBs), I/O pads, and routing channels. Generally, all the routing channels
have the same width (number of wires). Multiple I/O pads may fit into the height of
one row or the width of one column in the array. An application circuit must be
mapped into an FPGA with adequate resources. While the number of CLBs and
I/Os required is easily determined from the design, the number of routing tracks
needed may vary considerably even among designs with the same amount of logic.
(For example, a crossbar switch requires much more routing than a systolic array
with the same gate count.) Since unused routing tracks increase the cost (and
decrease the performance) of the part without providing any benefit, FPGA
manufacturers try to provide just enough tracks so that most designs that will fit in

19

terms of LUTs and IOs can be routed. This is determined by estimates such as those
derived from Rent's rule or by experiments with existing designs.

2.5 FPGA Design and Programming


To define the behavior of the FPGA, the user provides a hardware description
language (HDL) or a schematic design. The HDL form might be easier to work with
when handling large structures because it's possible to just specify them numerically
rather than having to draw every piece by hand. On the other hand, schematic entry
can allow for easier visualization of a design. Then, using an electronic design
automation tool, a technology-mapped netlist is generated. The netlist can then be
fitted to the actual FPGA architecture using a process called place-and-route, usually
performed by the FPGA Companys proprietary place-and-route software. The user
will validate the map, place and route results via timing analysis, simulation, and
other verification methodologies. Once the design and validation process is
complete, the binary file generated (also using the FPGA company's proprietary
software) is used to (re)configure the FPGA. The source files are fed to a software
suite from the FPGA/CPLD vendor that through different steps will produce a file.
This file is then transferred to the FPGA/CPLD via a serial interface or to an
external memory device like an EEPROM.
The most common HDLs are VHDL and Verilog, although in an attempt to reduce
the complexity of designing in HDLs, which have been compared to the equivalent
of assembly languages, there are moves to raise the abstraction level through the
introduction of alternative languages.

Advantages of Using Hardware Description Languages (HDLs) to Design FPGA


Devices Using Hardware Description Languages (HDLs) to design high-density
FPGA devices have the following advantages:
I. Top-Down Approach for Large Projects Designers use HDLs to create
complex designs. The top-down approach to system design works well for
large HDL projects that require many designers working together. After the
design team determines the overall design plan, individual designers can
work independently on separate code sections.
II. Functional Simulation Early in the Design Flow You can verify design

20

functionality early in the design flow by simulating the HDL description.


Testing your design decisions before the design is implemented at the
Register Transfer Level (RTL) or gate level allows you to make any
necessary changes early on.
III. Synthesis of HDL Code to Gates Synthesizing your hardware description to
target the FPGA implementation:
Decreases design time by allowing a higher-level design specification, rather
than specifying the design from the FPGA base elements.
Reduces the errors that can occur during a manual translation of a hardware
description to a schematic design.
Allows you to apply the automation techniques used by the synthesis tool
(such as machine encoding styles and automatic I/O insertion) during
optimization to the original HDL code. This results in greater optimization
and efficiency.

Early Testing of Various Design Implementations HDLs allows you to test


different design implementations early in the design flow. Use the synthesis tool to
perform the logic synthesis and optimization into gates. Additionally, Xilinx FPGA
devices allow you to implement your design at your computer. Since the synthesis
time is short, you have more time to explore different architectural possibilities at
the Register Transfer Level (RTL). You can reprogram Xilinx FPGA devices to test
several design implementations. You can retarget RTL code to new FPGA devices
with minimum recoding.

2.6 VHDL & Verilog


Both VHDL and Verilog are well established hardware description languages.
They have the advantage that the user can define high-level algorithms and low-level
optimizations (gate-level and switch-level) in the same language. A basic example of
VHDL code, the evaluation of the Fibonacci series, is shown below, and it is a good
example of the points made above. The code itself is reasonably straightforward for
a software programmer to understand, provided that he/she understands that this is a
truly parallel language and all lines are executing at once. It is also straightforward
to simulate a simple design of this nature. However, it is surprisingly difficult to
21

implement it in hardware and this difficulty is a direct result of I/O issues. As noted
above for a design to work in hardware access is required to resources that are
external to the FPGA, such as memory, and an FPGA is, by its very nature, unaware
of the components to which it is connected. If you want to retrieve a value from
main memory and use it on the FPGA then you need to instantiate a memory
controller. While systems such as the Cray XD1 provide cores for communicating
with memory, such cores are still complex and unfamiliar to software programmers.
Our early experiences with VHDL have indicated that it should only be used for
FPGA development if you are in a position to work closely with experienced
hardware designers throughout the development process.

22

CHAPTER-3
DESIGN APPROACH
Low-Power, area-efficient, and high-performance VLSI systems are increasingly
used in portable and mobile devices, multi standard wireless receivers, and
biomedical instrumentation [1], [2]. An adder is the main component of an
arithmetic unit. A complex digital signal processing (DSP) system involves several
adders. An efficient adder design essentially improves the performance of a complex
DSP system. A ripple carry adder (RCA) uses a simple design, but carry propagation
delay (CPD) is the main concern in this adder. Carry look-ahead and carry select
(CS) methods have been suggested to reduce the CPD of adders. A conventional
carry select adder (CSLA) is an RCARCA configuration that generates a pair of
sum words and output carry bits corresponding the anticipated input-carry (cin = 0
and 1) and selects one out of each pair for final-sum and final-output-carry [3]. A
conventional CSLA has less CPD than an RCA, but the design is not attractive since
it uses a dual RCA. Few attempts have been made to avoid dual use of RCA in
CSLA design. Kim and Kim [4] used one RCA and one add-one circuit instead of
two RCAs, where the add-one circuit is implemented using a multiplexer (MUX).
He et al. [5] proposed a square-root (SQRT)-CSLA to implement large bit-width
adders with less delay. In a SQRT CSLA, CSLAs with increasing size are connected
in a cascading structure. The main objective of SQRT-CSLA design is to provide a
parallel path for carry propagation that helps to reduce the overall adder delay. We
suggested a binary to BEC-based CSLA. The BEC-based CSLA involves less logic
resources than the conventional CSLA, but it has marginally higher delay. A CSLA
based on common Boolean logic (CBL) is also proposed in [7] and [8]. The CBLbased CSLA of [7] involves significantly less logic resource than the conventional
CSLA but it has longer CPD, which is almost equal to that of the RCA. To
overcome this problem, a SQRT-CSLA based on CBL was proposed in [8].
However, the CBL-based SQRTCSLA design of [8] requires more logic resource
and delay than the BEC-based SQRT-CSLA of [6]. We observe that logic
optimization largely depends on availability of redundant operations in the
formulation, whereas adder delay mainly depends on data dependence. In the
existing designs, logic is optimized without giving any consideration to the data

23

dependence. In this brief, we made an analysis on logic operations involved in


conventional and BEC-based CSLAs to study the data dependence and to identify
redundant logic operations. Based on this analysis, we have proposed a logic
formulation for the CSLA.

The main contribution in this brief is logic formulation based on data dependence
and optimized carry generator (CG) and CS design. Based on the proposed logic
formulation, we have derived an efficient logic design for CSLA. Due to optimized
logic units, the proposed CSLA involves significantly less ADP than the existing
CSLAs. We have shown that the SQRT-CSLA using the proposed CSLA design
involves nearly 32% less ADP and consumes 33% less energy than that of the
corresponding SQRT-CSLA.

3.1 Overview of Carry Select Adder


The carry-select adder generally consists of two ripple carry adder and a
multiplexer. Adding two n-bit numbers with a carry select adder is done with two
adders (therefore two ripple carry adders) in order to perform the calculation twice,
one time with the assumption of the carry being zero and the other assuming one.
After the two results are calculated, the correct sum, as well as the correct carry, is
then selected with the multiplexer once the correct carry is known. The number of
bits in each carry select block can be uniform, or variable. In the uniform case, the
optimal delay occurs for a block size of

. When variable, the block size should

have a delay, from addition inputs A and B to the carry out, equal to that of the
multiplexer chain leading in to it, so that the carry out is calculated just in time. The
delay is derived from uniform sizing, wherethe ideal number of full-adder
elements per block is equal to the square root of the number of bits being added,
since that will yield an equal number of MUX delays.

However, the carry select adder is not area efficient because it uses multiple pairs
of Ripple Carry Adders to generate partial sum and carry by considering carry input
and then the final sum and carry are selected by the multiplexers (mux). To
overcome the above problem, the above CSLA is modified by using n-bit Binary to

24

Excess-1 code converters (BEC) to improve the speed of addition. The logic can be
implemented with any type of adder to further improve the speed. We use the Binary
toExcess-1 Converter (BEC) instead of ripple carry adder in the regular CSLA to
achieve lower area and power consumption. The main advantage of this BEC logic
comes from the lesser number of logic gates than the Full Adder (FA) structure. The
modified design has reduced area and power as compared with the regular
SQRTCSLA with an increase in the delay. Therefore, an improved CSLA was
designed with a D-Latch replacing the BEC in the modified CSLA. This design has
efficiently reduced the delay there by increasing the speed making it a high speed
Carry Select Adder.The factors which are desirable in adders are as follows:
High speed, Low power consumption
Area efficient
Robustness and noise stability
Insensitivity to process variables
Less internal activity when activity is low

According to the requirement of the adder the designer has to consider all these
parameter While choosing a structure for adders what makes this decision even
harder is that usually most of these parameter are not independent from each other
tradeoff between desired parameter make this decision a multi-dimensional
optimization problem for high performance system a multi-dimensional optimization
problem for a non-linear system that usually has hundreds of variables, is
unfortunately impossible to solve within the limited design time.
The idea for this thesis is to explore the area, power consumption and time delay
for different structure of adders this will give us a good understanding of different
structure and makes the decision easier for the designers.
The Ripple Carry Adder (RCA) provides the most compact design but takes
longer computing time. If there is N-bit RCA, the delay is linearly proportional to N.
Thus for large values of N the RCA gives highest delay of all adders. The Carry
Look Ahead Adder (CLA) gives fast results but consumes large area. If there is Nbit adder, CLA is fast for N4, but for large values of N its delay increases more
than other adders. So for higher number of bits, CLA gives higher delay than other
adders due to presence of large number of fan-in and a large number of logic gates.

25

The Carry Select Adder (CSA) provides a compromise between small area but
longer delay RCA and a large area with shorter delay CLA. In rapidly growing
mobile industry, faster units are not the only concern but also smaller area and less
power become major concerns for design of digital circuits. In mobile electronics,
reducing area and power consumption are key factors in increasing portability and
battery life. Even in servers and desktop computers power dissipation is an
important design constraint. Design of area and power efficient high-speed data path
logic systems are one of the most substantial areas of research in VLSI system
design.
In the present work, the design of an 8-bit adder topology like ripple carry adder,
carry look ahead adder, carry skip adder, carry select adder, carry increment adder,
carry save adder and carry bypass adder are presented. It tightly integrates mixedsignal implementation with digital implementation, circuit simulation, transistorlevel extraction and verification. Performance issues like area, power dissipation and
propagation delay for all the adders are analyzed at 0.12m 6metal layer CMOS
technology using micro windtool. Design of area and power-efficient high speed
data path logic systems are one of the most substantial areas of research in VLSI
system design. In digital adders, the speed of addition is limited by the time required
to propagate a carry through the adder. The sum for each bit position in an
elementary adder is generated sequentially only after the previous bit position has
been summed and a carry propagated into the next position. The CSLA is used in
many computational systems to alleviate the problem of carry propagation delay by
independently generating multiple carries and then select a carry to generate the sum
[1].
However, the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering carry input
Cin = 0 and Cin = 1, then the final sum and carry are selected by the multiplexers
(mux). The basic idea of this work is to use simple combinational circuit instead of
RCA with cin = 1 and multiplexer in the regular CSLA to achieve lower area and
power. The main advantage of this Project is logic comes from low power than the
n-bit Full Adder (FA) structure. The SQRT CSLA has been developed by using
simple combinational circuit and compared with regular SQRT CSLA.A regular
CSLA uses two copies of the carry evaluation blocks, one with block carry input is
zero and other one with block carry input is one. Regular CSLA suffers from the
26

disadvantage of occupying more chip area. The modified CSLA reduces the area and
power when compared to regular CSLA with increase in delay by the use of Binary
to Excess-1 converter. This Project proposes a scheme which reduces the delay, area
and power than regular and modified CSLA by the use of D-latches.

3.2 Operation
Carry Select Adders (CSA) is one of the fastest adders used in many dataprocessing processors to perform fast arithmetic functions. The carry-select adder
partitions the adder into several groups, each of which performs two additions in
parallel. Therefore, two copies of ripple-carry adder act as carry evaluation block per
select stage. One copy evaluates the carry chain assuming the block carry-in is zero,
while the other assumes it to be one. Once the carry signals are finally computed, the
correct sum and carry-out signals will be simply selected by a set of multiplexers.
The 4-bit adder block is RCA.Systems are one of the most substantial areas of
research in VLSI system design. In digital adders, the speed of addition is limited by
the time required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the previous bit
position has been summed and a carry propagated into the next position. The CSLA
is used in many computational systems to alleviate the problem of carry propagation
delay by independently generating multiple carries and then select a carry to
generate the sum. However, the CSLA is not area efficient because it uses multiple
pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by
considering carry input and, then the final sum and carry are selected by the
multiplexers (MUX).
The carry-select adder generally consists of two ripple carry adders and a
multiplexer. Adding two n-bit numbers with a carry-select adder is done with two
adders (therefore two ripple carry adders) in order to perform the calculation twice,
one time with the assumption of the carry being zero and the other assuming one.
After the two results are calculated, the correct sum, as well as the correct carry, is
then selected with the multiplexer once the correct carry is known. The number of
bits in each carry select block can be uniform, or variable. In the uniform case, the
optimal delay occurs for a block size of n variable, the block size should have a
delay, from additional inputs A and B to the carry out, equal to that of the

27

multiplexer chain leading into it, so that the carry out is calculated just in time. The
delay is derived from uniform sizing, where the ideal number of full-adder elements
per block is equal to the square root of the number of bits being added, since that
will yield an equal number of MUX delays. Two 4-bit ripple carry adders are
multiplexed together, where the resulting carry and sum bits are selected by the
carry-in. Since one ripple carry adder assumes a carry-in of 0, and the other assumes
a carry-in of 1, selecting which adder had the correct assumption via the actual
carry-in yields the desired result. A 16-bit carry-select adder with a uniform block
size of 4 can be created with three of these blocks and a 4-bit ripple carry adder.
Since carry-in is known at the beginning of computation, a carry select block is not
needed for the first four bits. The delay of this adder will be four full adder delays,
plus three MUX delaysA 16-bit carry-select adder with variable size can be similarly
created. Here we show an adder with block sizes. This break-up is ideal when the
full-adder delay is equal to the MUX delay, which is unlikely. The total delay is two
full adder delays, and four MUX delays.
Addition is the heart of computer arithmetic, and the arithmetic unit is often the
work horse of a computational circuit. They are the necessary component of a data
path, e.g. in microprocessors or a signal processor. There are many ways to design
an added. The Ripple Carry Adder (RCA) provides the most compact design but
takes longer computing time. If there is N-bit RCA, the delay is linearly proportional
to N. Thus for large values of N the RCA gives highest delay of all adders. The
Carry Look Ahead Adder (CLA) gives fast results but consumes large area. If there
is N-bit adder, CLA is fast for N4, but for large values of N its delay increases
more than other adders. So for higher number of bits, CLA gives higher delay than
other adders due to presence of large number of fan-in and a large number of logic
gates. The Carry Select Adder (CSA) provides a compromise between small area but
longer delay RCA and a large area with shorter delay CLA.In rapidly growing
mobile industry, faster units are not the only concern but also smaller area and less
power become major concerns for design of digital circuits. In mobile electronics,
reducing area and power consumption are key factors in increasing portability and
battery life. Even in servers and desktop computers power dissipation is an
important design constraint. Design of area- and power-efficient high-speed data
path logic systems are one of the most substantial areas of research in VLSI system
design. In digital adders, the speed of addition is limited by the time required to
28

propagate a carry through the adder. The sum for each bit position in an elementary
adder is generated sequentially only after the previous bit position has been summed
and a carry propagated into the next position. Among various adders, the CSA is
intermediate regarding speed and area.

3.3 Why we replaced Regular CSLA with Modified CSLA?


Regular CSLA has 2 ripple carry adders (rca) in each module for performing
addition depending on carry.
Using 2 RCAs in each module increases the number of transistors.
Increase in number of transistors leads to increase in area and power
consumption.
2nd RCA in each module can be replaced by binary to excess one converter which
performs the same operation with less number of transistors which leads to modified
CSLA which is area efficient and low power consumption.

Fig. 3.1 Block diagram of regular CSLA

29

Fig. 3.2 Block diagram of modified CSLA

Code converters are very essential in digital systems. Here we are going to give
the truth table for binary to excess-1 converter. The Excess-1 converter is obtained
by adding one to the binary value. The detailed structures of the 5-bit BEC without
carry (BEC) and with carry (BECWC) are shown in Fig.3.3. The BEC gets n
inputs and generates n output; the BECWC gets n input and generates n+1 output to
give the carry output as the selection input of the next stage mux used in the final
adder design. The function table of BEC and BECWC are shown in Table 3.1.

30

Table 3.1 Truth table

Large bit sized multipliers require multiple BEC and each of them requires the
selection input from the carry output of the preceding BEC.

Fig. 3.3 The 5-bit Binary to Execss-1 Code Converter:


(a) BEC (without carry),
(b) BECWC (with carry).

3.4 Logic Formulation


The CSLA has two units: 1) the sum and carry generator unit (SCG) and 2) the
sum and carry selection unit [9]. The SCG unit consumes most of the logic resources
of CSLA and significantly contributes to the critical path. Different logic designs
31

have been suggested for efficient implementation of the SCG unit. We made a study
of the logic designs suggested for the SCG unit of conventional and BEC-based
CSLAs of [6] by suitable logic expressions. The main objective of this study is to
identify redundant logic operations and data dependence. Accordingly, we remove
all redundant logic operations and sequence logic operations based on their data
dependence.

Fig. 3.4 (a) Conventional CSLA; n is the input operand bit-width. (b) The logic
operations of the RCA are shown in split form, where HSG, HCG, FSG, and
FCG represent half-sum generation, half-carry generation, full-sum generation,
and full-carry generation, respectively.

3.4.1 Logic Expressions of the SCG Unit of the Conventional CSLA


The SCG unit of the conventional CSLA as shown in Fig. 3.4 (a), [3] is
composed of two n-bit RCAs, where n is the adder bit-width. The logic operation of
the n-bit RCA shown in fig. 3.4 (b) is performed in four stages:

Half-sum generation (HSG);

Half-carry generation (HCG);

Full-sum generation (FSG); and

Full carry generation (FCG).

Suppose two n-bit operands are added in the conventional CSLA, then RCA-1
and RCA-2 generate n-bit sum (s0 and s1) and output-carry (c0 out and c1 out)
corresponding to input-carry (cin = 0 and cin = 1), respectively. Logic expressions of
RCA-1 and RCA-2 of the SCG unit of the n-bit CSLA are given as
soo (i) = A(i) XOR B(i), coo(i) = A(i) and B(i)
32

s1o (i) = soo (i) XOR c1o (i1)


c1o (i) = coo (i) + soo (i) and c1o (i1), couto = c1o (n1)
so1 (i) = A(i) XOR B(i) co1 (i) = A(i) and B(i)
s11 (i) = so1 (i) XOR c11 (i1)
c11 (i) = co1 (i) + so1 (i) and c11 (i1), cout1 = c11 (n1) ..................1
As stated above the main idea of this work is to use BEC instead of the RCA with
Cin=1in order to reduce the area and power consumption of the regular CSLA. To
replace the n bit RCA, an n+1 bit BEC is required.

3.4.2 Logic Expression of the SCG Unit of the BEC-Based CSLA


The RCA as shown in Fig. 3.2, calculates n-bit sum
cin = 0. The BEC unit receives

and

and

corresponding to

from the RCA and generates (n + 1) bit

excess-1 code. The most significant bit (MSB) of BEC represents c1 out, in which n
least significant bits (LSBs) represent

. The logic expressions

s11(i)= soo (0) c11 (0) = s1o (0)


s11(i)= s1o(i) + c11(i-1)
c11(i)= s1o(i). c11(i-1)
cout1 = c1o(n-1) + c11(n-1)

........ 2

The selected carry word is added with the half-sum (s0) to generate the final-sum
(s). Using this method, one can have three design advantages:
1. Calculation of

is avoided in the SCG unit;

2. The n-bit select unit is required instead of the (n+1) bit; and
3. Small output-carry delay.
All these features result in an areadelay and energy-efficient design for the
CSLA. We have removed all the redundant logic operations of 2 and rearranged
logic expressions of 2 based on their dependence. The proposed logic formulation
for the CSLA is given as
so(i) = A(i) XOR B(i), coo(i) = A(i) and B(i)
c1o(i) = c1o(i-1) and soo(i) + co(i) for c1o(0) = 0
c11(i) = c01(i-1) and soo(i) + co(i) for c1o(0) = 1
c(i)= c1o(i) if(cin=0)
c(i)= c11(i) if(cin=1) ..........3

33

Fig. 3.5 Structure of the BEC-based CSLA; n is the input operand bit-width.

3.5 Proposed Adder Design


The proposed CSLA is based on the logic formulation given in 3.6 (a), and its
structure is shown in Fig. 3.5. It consists of one HSG unit, one FSG unit, one CG
unit, and one CS unit. The CG unit is composed of two CGs (CG0 and CG1)
corresponding to input-carry 0 and 1. The HSG receives two n-bit operands (A
and B) and generate half-sum word s0 and half-carry word c0 of width n bits each.
Both CG0 and CG1 receive s0 and c0 from the HSG unit and generate two n-bit fullcarry words c0 1 and c11 corresponding to input-carry 0 and 1, respectively.

Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width,
and [] represents delay (in the unit of inverter delay), n = max (t, 3.5n + 2.7).
(b) Gate-level design of the HSG. (c) Gate-level optimized design of (CG0) for
input-carry = 0. (d) Gate-level optimized design of (CG1) for input-carry = 1.
(e) Gate-level design of the CS unit. (f) Gate-level design of the final sum
generation (FSG) unit.

34

The logic diagram of the HSG unit is shown in Fig. 3.6 (b). The logic circuits of
CG0 and CG1 are optimized to take advantage of the fixed input-carry bits. The
optimized designs of CG0 and CG1 are shown in Fig. 3.6 (c) and (d), respectively.
The CS unit selects one final carry word from the two carry words available at its
input line using the control signal cin. It selects when cin = 0; otherwise, it selects.
The CS unit can be implemented using an n-bit 2-to-l MUX. However, we find from
the truth table of the CS unit that carry words c0 1 and c11 follow a specific bit
pattern. If (i) = 1, then (i) = 1, irrespective of s0(i) and c0(i), for 0 i n 1. This
feature is used for logic optimization of the CS unit. The optimized design of the CS
unit is shown in Fig. 3.6 (e), which is composed of n ANDOR gates. The final
carry word c is obtained from the CS unit. The MSB of c is sent to output as cout,
and (n 1) LSBs are XORed with (n 1) MSBs of half-sum (s0) in the FSG [shown
in Fig. 3.6 (f)] to obtain (n 1) MSBs of final-sum (s). The LSB of s0 is XORed
with cin to obtain the LSB of s.

We have considered all the gates to be made of 2-input AND, 2-input OR, and
inverter (AOI). A 2-input XOR is composed of 2 AND, 1 OR, and 2 NOT gates. The
area and delay of the 2-input AND, 2-input OR, and NOT gates are taken from the
Synopsys Armenia Educational Department (SAED) 90-nm standard cell library
datasheet for theoretical estimation. The area and delay of a design are calculated
using the following relations:

A = a . Na + r . No + i - Ni
T = na . Ta + no . To + nj . Ti.......... 4
Where (Na, No, Ni) and (na, no, ni), respectively, represent the (AND, OR, NOT)
gate counts of the total design and its critical path. (a, r, i) and (Ta, To, Ti),
respectively, represent the area and delay of one (AND, OR, NOT) gate. We have
calculated the (AOI) gate counts of each design for area and delay estimation the
area and delay of each design are calculated from the AOI gate counts (Na, No, Ni),
(na, no, ni), and the cell details. The path of the proposed CSLA, the delay of each
intermediate and output signals of the proposed n-bit CSLA design of Fig. 3.6 is
shown in the square bracket against each signal. We can observe that the proposed
n-bit single-stage CSLA adder involves 6n less number of AOI gates than the CSLA
35

of [6] and takes 2.7 and 6.6 units less delay to calculate final-sum and output-carry.
Compared with the CBL-based CSLA of [7], the proposed CSLA design involves n
more AOI gates, and it takes (n 4.7) unit less delay to calculate the output-carry.
In this work the following adder structures are used:
Ripple Carry Adder
Carry Save Adder
Carry Look-Ahead Adder
Carry Increment adder
Carry Skip Adder
Carry Bypass Adder
Carry Select Adder

3.5.1 Ripple Carry Adder (RCA)


The ripple carry adder is constructed by cascading full adders (FA) blocks in
series. One full adder is responsible for the addition of two binary digits at any stage
of the ripple carry. The carryout of one stage is fed directly to the carry-in of the
next stage. Even though this is a simple adder and can be used to add unrestricted bit
length numbers, it is however not very efficient when large bit numbers are used.
One of the most serious drawbacks of this adder is that the delay increases linearly
with the bit length. The worst-case delay of the RCA is when a carry signal
transition ripples through all stages of adder chain from the least significant bit to the
most significant bit, which is approximated by:

t = (n-1) tc + ts
The well-known adder architecture, ripple carry adder is composed of cascaded
full adders for n-bit adder, as shown in figure 3.7. It is constructed by cascading full
adder blocks in series. The carry out of one stage is fed directly to the carry-in of the
next stage. For an n-bit parallel adder it requires n full adders.

Fig. 3.7 A 4-bit Ripple Carry Adder


36

Not very efficient when large number bit numbers are used.
Delay increases linearly with bit length.

3.5.2 Carry Select Adders (CSLA)


In Carry select adder scheme, blocks of bits are added in two ways: one assuming
a carry-in of 0 and the other with a carry-in of 1. This results in two pre computed
sum and carry-out signal pairs (s0i-1: k, c0i; s1i-1: k, c1i), later as the blocks true
carry-in (ck) becomes known, the correct signal pairs are selected. Generally,
multiplexers are used to propagate carries.

Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA
Because of multiplexers larger area is required.
Have a lesser delay than Ripple Carry Adders (half delay of RCA).
Hence we always go for Carry Select Adder while working with smaller no of
bits.

3.5.3 Carry Look Ahead Adders (CLA)


Carry Look Ahead Adder can produce carries faster due to carry bits generated in
parallel by an additional circuitry whenever inputs change. This technique uses carry
bypass logic to speed up the carry propagation.

Fig. 3.9 34-BIT CLA Logic equations


37

Let ai and bi be the augends and addend inputs, ci the carry input, si and ci+1, the
sum and carry-out to the ith bit position. If the auxiliary functions, pi and gi are
called the propagate and generate signals, the sum output respectively are defined as
follows.

As we increase the no of bits in the Carry Look Ahead adders, the complexity
increases because the no. of gates in the expression Ci+1 increases. So
practically its not desirable to use the traditional CLA shown above because it
increases the space required and the power too.
Instead we will use here Carry Look Ahead adder (less bits) in levels to create
a larger CLA. Commonly smaller CLA may be taken as a 4-bit CLA. So we can
define carry look ahead over a group of 4 bits. Hence now we redefine terms
carry generate as [Group Generated Carry] g [i, i+3] and carry propagate as
[Group Propagated Carry] p [i, i+3] which are defined below.

3.5.4 Binary to Excess-1 Converter


The main idea of this work is to use BEC instead of the RCA with Cin = 1 in
order to reduce the area and power consumption of the regular CSLA. To replace the
n-bit RCA, an n+1-bit BEC is required. A structure and the function table of a 4-b
BEC. Illustrates how the basic function of the CSLA is obtained by using the4-bit
BEC together with the mux. One input of the 2:1 mux gets as it input (B3, B2, B1,
and B0) and another input of the mux is the BEC output. This produces the two
possible partial results in parallel and the mux is used to select either the BEC output
or the direct inputs according to the control signal Cin. The importance of the BEC
logic stems from the large silicon area reduction when the CSLA with large number
of bits are designed. The Boolean expressions of the 4-bit BEC is listed as (note the
functional symbols ~ NOT, & AND, ^ XOR)

X0 = ~B0

X2 = B2 ^ (B0& B1)

X1 = B0 ^ B1

X3 = B3 ^ (B0 & B1& B2).

38

The 4-bit BEC with 2:1 multiplexer, the inputs for the 2:1MUX are one is the
output of the 4-bit BEC and another input is output of 4- bit full adder with input
carry equal to zero. The selection line is carry of previous stage which select one of
the input as output, if Cin=1 output is 4-bit BEC output.
Table 3.2 Functional table of the 4-bit BEC
B3 B2 B1 B0 X3 X2 X1 X0
0

0 0

0 0

0 0

0 1

0 1

0 1

0 1

1 0

1 0

1 0

1 0

1 1

1 1

1 1

1 1

0 0

0 0

0 0

0 0

0 1

0 1

0 1

0 1

1 0

1 0

1 0

1 0

1 1

1 1

1 1

1 1

0 0

3.5.5 Multiplexer
In electronics, a multiplexer (or MUX) is a device that selects one of several
analog or digital input signals and forwards the selected input into a single line.
Multiplexer of 2n inputs has n select lines, which are used to select which input line
to send to the output. Multiplexers are mainly used to increase the amount of data
that can be sent over the network within a certain amount of time and bandwidth. A
multiplexer is also called a data selector. An electronic multiplexer makes it possible
for several signals to share one device or resource, for example one A/D converter or
one communication line, instead of having one device per input signal.

39

In digital circuit design, the selector wires are of digital value. In the case of a 2to-1 multiplexer, a logic value of 0 would connect to the output while a logic value
of 1 would connect to the output. In larger multiplexers, the number of selector pins
is equal to where is the number of inputs. A 2-to-1 multiplexer has a Boolean
equation where and are the two inputs, is the selector input, and is the output.

Addition is the most common and often used arithmetic operation on


microprocessor, digital signal processor, especially digital computers. Also, it serves
as a building block for synthesis all other arithmetic operations. Therefore, regarding
the efficient implementation of an arithmetic unit, the binary adder structures
become a very critical hardware unit. In any book on computer arithmetic, someone
looks that there exists a large number of different circuit architectures with different
performance characteristics and widely used in the practice. Although many
researches dealing with the binary adder structures have been done, the studies based
on their comparative performance analysis are only a few.

In this project, qualitative evaluations of the classified binary adder architectures


are given. Among the huge member of the adders we wrote VHDL (Hardware
Description Language) code for Ripple-carry, Carry-select and Carry-look ahead to
emphasize the common performance properties belong to their classes. In the
following section, we give a brief description of the studied adder architectures.
With respect to asymptotic delay time and area complexity, the binary adder
architectures can be categorized into four primary classes as given in Table 3.3. The
given results in the table are the highest exponent term of the exact formulas, very
complex for the high bit lengths of the operands.

The first class consists of the very slow ripple-carry adder with the smallest area.
In the second class, the carry-skip, carry-select adders with multiple levels have
small area requirements and shortened computation times. From the third class, the
carry-look ahead adder and from the fourth class, the parallel prefix adder represents
the fastest addition schemes with the largest area complexities.

40

Table 3.3 Categorization of adders w.r.t delay time and capacity

Cell-based design techniques, such as standard-cells and FPGAs, together with


versatile hardware synthesis are rudiments for a high productivity in ASIC design. In
the majority of digital signal processing (DSP) applications the critical operations
are the addition, multiplication and accumulation. Addition is an indispensable
operation for any digital system, DSP or control system. Therefore, a fast and
accurate operation of a digital system is greatly influenced by the performance of the
resident adders. Adders are also very significant component in digital systems
because of their widespread use in other basic digital operations such as subtraction,
multiplication and division. Hence, improving performance of the digital adder
would extensively advance the execution of binary operations inside a circuit
compromised of such blocks. Many different adder architectures for speeding up
binary addition have been studied and proposed over the last decades. For cell-based
design techniques they can be well characterized with respect to circuit area and
speed as well as suitability for logic optimization and synthesis. Ripple Carry Adder
(RCA) [1] [2] is the simplest, but slowest adders with O (n) area and O (n) delay,
where n is the operand size in bits. Carry Look-Ahead (CLA) [3] [4] have O (n * log
(n)) area and O (log (n)) delay, but typically suffer from irregular layout.

On the other hand, carry Addition, one of the most frequently used arithmetic
operations, is employed to build advanced operations such as multiplication and
division. Theoretical research has found that the lower bound on the critical path
delay of the adder has complexity O (log n), where n is the adder width. The design
of high performance adders has been extensively studied [10] [15], and several
adders have achieved logarithmic delays. Whereas theoretical bounds indicate that
41

no traditional adder can achieve sub-logarithmic delay, it has been shown that
speculative adders can achieve sub-logarithmic delays by neglecting rare input
patterns that exercise the critical paths [2, 11, 13]. Furthermore, by augmenting
speculative adders with error detection and recovery, one can construct reliable
variable-latency adders whose average performance is very close to speculative
adders [3, 6, 12, and 17].

Speculative adders are built upon the observation that the critical path is rarely
activated in traditional adders. In traditional adders, each output depends on all
previous (lower or equal significance) bits. In particular, the most significant output
depends on all the n bits, where n is the adder width. In contrast, in speculative
adders [2, 6, 11, 13, 17], each output only depends on the previous k bits rather than
all previous bits, where k is much smaller than n. However, the cumulative error
grows linearly with the adder width since each speculative output can independently
be in error. Moreover, the calculation of each speculative output requires an
individual k-bit adder; hence, such designs also incur large area overhead and large
fanout at the primary inputs. Techniques such as effective sharing [17] can mitigate
but not eliminate fanout and area problems. Although the speculative adder in [18]
can mitigate the area problem, it incurs a fairly high error rate that limits its
application.

For applications where errors cannot be tolerated, a reliable variable latency adder
can be built upon the speculative adder by adding error detection and recovery [3, 6,
12, 17]. For the vast majority of input combinations, the speculative adder produces
correct results; when error detection flags an error, error recovery provides correct
results in one or more extra cycles. Ideally, the average performance of the variable
latency adder should be similar to the speculative one. However, existing variable
latency adders have several drawbacks. When error detection indicates no error, the
actual delay is the longer of the speculative adder and error detection. The delay of
error detection is always longer than the speculative adder [6] [17]. Hence, the
benefit of speculation is limited by the delay of error detection [3] [12]. Besides, the
circuitry for error detection and recovery incurs nontrivial area overhead. Finally,
variable latency adders are mostly restricted for random inputs [3, 12, and 17]. This

42

thesis first describes a novel function speculation technique, called speculative carry
select addition (SCSA). The key idea is to segment the chain of propagate signals in
addition into blocks of the same size. Specifically, the input bits of addends are
segmented into blocks, and the carry bits between blocks are selectively truncated to
0. SCSA is less susceptible to errors, since it is only applied for blocks instead of
individual outputs.

A single individual adder is required to compute all outputs of a block instead of


each output, which mitigates the area overhead problem. An analytical model to
determine the error rate of SCSA is formulated, and the accurate relation between
the block size and output error is developed. A high performance speculative adder
design is presented for low error rates (e.g. 0.01% and 0.25%). Secondly, this thesis
describes a reliable variable latency adder design that augments the speculative
adder with error detection and recovery. The speculative adder produces correct
results in a single cycle in most cases, and error recovery provides correct results in
an extra cycle in worst cases. The performance of the variable latency adder is close
to that of the speculative adder. This approach has two advantages. First, the critical
path delay of the error detection block is lower or comparable to that of the
speculative adder. Second, the error detection and recovery circuitry incurs low area
overhead by using intermediate results from the speculative adder.

Finally, the previous variable latency and speculative adders are mainly designed
for unsigned random inputs, so this thesis proposes the modified variable latency
and speculative adders suitable for both random and Gaussian inputs. With modified
speculative adder and error detection block, the variable latency adder still achieves
high performance when 2's complement Gaussian inputs present. This shows that the
variable latency adder design is feasible for practical applications.

In the present work, the design of an 8-bit adder topology like ripple carry adder,
carry look ahead adder, carry skip adder, carry select adder, carry increment adder,
carry save adder and carry bypass adder are presented. It tightly integrates mixedsignal implementation with digital implementation, circuit simulation, transistorlevel extraction and verification. Performance issues like area, power dissipation and
propagation delay for all the adders are analyzed at 0.12m 6metal layer CMOS
43

technology using micro-wind tool. The remainder of this Project is organized as


follows.

Design of area and power-efficient high speed data path logic systems are one of
the most substantial areas of research in VLSI system design. In digital adders, the
speed of addition is limited by the time required to propagate a carry through the
adder. The sum for each bit position in an elementary adder is generated sequentially
only after the previous bit position has been summed and a carry propagated into the
next position. The CSLA is used in many computational systems to alleviate the
problem of carry propagation delay by independently generating multiple carries and
then select a carry to generate the sum [1].

However, the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering carry input
Cin = 0 and Cin = 1, then the final sum and carry are selected by the multiplexers
(mux). The basic idea of this work is to use simple combinational circuit instead of
RCA with cin = 1 and multiplexer in the regular CSLA to achieve lower area and
power. The main advantage of this Project is logic comes from low power than the
n-bit Full Adder (FA) structure. The SQRT CSLA has been developed by using
simple combinational circuit and compared with regular SQRT CSLA.

A regular CSLA uses two copies of the carry evaluation blocks, one with block
carry input is zero and other one with block carry input is one. Regular CSLA
suffers from the disadvantage of occupying more chip area. The modified CSLA
reduces the area and power when compared to regular CSLA with increase in delay
by the use of Binary to Excess-1 converter. This Project proposes a scheme which
reduces the delay, area and power than regular and modified CSLA by the use of Dlatches.

3.6 Analysis of Adders


In our project we compared 3- different adders Ripple Carry Adders, Carry Select
Adders and the Carry Look Ahead Adders. The basic purpose of our experiment was
to know the time and power trade-offs between different adders which will give us a

44

clear picture of which adder suits best in which type of situation during design
process. Hence below we present both the theoretical and practical comparisons of
all the three adders whish were taken into consideration.

Table 3.4 Theoretical Comparison of Area Occupied

Table 3.5 Theoretical Comparison of Time Required

Table 3.6 Theoretical Area Delay Product (AxT)

45

Table 3.7 Comparison of Time Required (Simulated value)

3.7 Square Root CSLA (SQRT-CSLA)


The multipath carry propagation feature of the CSLA is fully exploited in the
SQRT-CSLA [5], which is composed of a chain of CSLAs. CSLAs of increasing
size are used in the SQRT-CSLA to extract the maximum concurrence in the carry
propagation path. Using the SQRT-CSLA design, large-size adders are implemented
with significantly less delay than a single-stage CSLA of same size. However, carry
propagation delay between the CSLA stages of SQRT-CSLA is critical for the
overall adder delay.

Fig. 3.10 Proposed SQRT-CSLA for n = 16. All intermediate and output signals
are labeled with delay

Due to early generation of output-carry with multipath carry propagation


feature, the proposed CSLA design is more favorable than the existing CSLA
designs for area delay [10] efficient implementation of SQRT-CSLA. A 16-bit
SQRT-CSLA design using the proposed CSLA is shown in Fig. 3.7, where the 2-bit
RCA, 2-bit CSLA, 3-bit CSLA, 4-bit CSLA, and 5-bit CSLA are used. We have

46

considered the cascaded configuration of (2-bit RCA and 2-bit, 3-bit, 4-bit, 6-bit, 7bit, and 8-bit CSLAs) and (2-bit RCA and 2-bit, 3-bit, 4-bit, 6-bit, 7-bit, 8-bit, 9-bit,
11-bit, and 12-bit CSLAs), respectively, for the 32-bit SQRTCSLA and the 64-bit
SQRT-CSLA to optimize adder delay. To demonstrate the advantage of the
proposed CSLA design in SQRT-CSLA, we have estimated the area and delay of
SQRTCSLA using the proposed CSLA design and the BEC-based CSLA of [6] and
the CBL-based CSLA of [7] for bit-widths 16 and 32.

47

CHAPTER 4
RESULTS ANALYSIS
In this section, the proposed method synthesis and simulation results are reported.

4.1 Performance Evaluation:


The proposed system isimplemented by Xilinx software and the simulation
waveforms of each module are shown below.

4.1.1 Ripple Carry Adder (8-bit):

Fig 4.1 (a) Simulation Waveform Result of 8-bit Ripple Carry Adder
Table 4.1 Device Utilization summary of 8-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
09 out of 14752
16 out of 29504
26
26 out of 00376 6%

48

Fig 4.1 (b) RTL Diagram of 8-bit Ripple Carry Adder

4.1.2 CSA (8-bit):

Fig 4.2 (a) Simulation Waveform Result of 8-bit CSA


Table 4.2 Device Utilization summary of 8-bit CSA
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
9 out of 14752
16 out of 29504
26
26 out of 00376 6%
49

Fig 4.2 (b) RTL diagram of 8 bit CSA

4.1.3 Proposed CSA (8-bit):

Fig 4.3 (a) Simulation Waveform Result of 8-bit Proposed CSA

50

Fig 4.3 (b) Design Summary of 8-bit Proposed CSA


Table 4.3 Synthesis Report of 8-bit Proposed CSA
HDL Synthesis
Report
Macro Statistics
# Xors
:9
1-bit xor2 : 8
8-bit xor2 : 1

Advanced HDL
Synthesis Report
Macro Statistics
# Xors
:9
1-bit xor2 : 8
8-bit xor2 : 1

Low Level Synthesis


Optimizing unit <prop8>...
Mapping all equations...
Building and optimizing final netlist...
Found area constraint ratio of 100
(+ 5) on block prop8, actual ratio is 0.

Table 4.4 Device Utilization summary of 8-bit Proposed CSA


Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
14 out of 14752
26 out of 29504
26
26 out of 376

51

6%

Fig 4.3 (c) RTL diagram of 8 bit proposed CSA

4.2.1 Ripple Carry Adder (16-bit):

Fig 4.4 (a) Simulation Waveform Result of 16-bit Ripple Carry Adder
Table 4.5 Device Utilization summary of 16-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
18 out of 14752
32 out of 29504
50
50 out of 376 13%

52

Fig 4.4 (b) RTL diagram of 16-bit Ripple Carry Adder

4.2.2 CSA (16-bit):

Fig 4.5 (a) Simulation Waveform Result of 16-bit CSA


Table 4.6 Device Utilization summary of 16-bit CSA
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
18 out of 14752
32 out of 29504
50
50 out of 376 13%
53

Fig 4.5 (b) RTL diagram of 16 bit CSA

4.2.3 Proposed CSA (16-bit):

Fig 4.6 (a) Simulation Waveform Result of 16-bit Proposed CSA

54

Fig 4.6 (b) Design Summary of 16-bit Proposed CSA


Table 4.7 Synthesis Report of 16-bit Proposed CSA
HDL Synthesis
Report
Macro Statistics
# Xors
: 17
1-bit xor2 : 16
16-bit xor2 : 1

Advanced HDL
Synthesis Report
Macro Statistics
# Xors
: 17
1-bit xor2 : 16
16-bit xor2 : 1

Low Level Synthesis


Optimizing unit <prop16>...
Mapping all equations...
Building and optimizing final netlist...
Found area constraint ratio of 100
(+ 5) on block prop16, actual ratio is 0.
Final Macro Processing...

Table 4.8 Device Utilization summary of 16-bit Proposed CSA


Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
34 out of 14752
63 out of 29504
50
50 out of 376
13%

55

Fig 4.6 (c) RTL diagram of 16 bit Proposed CSA

4.3.1 Ripple Carry Adder (32-bit):

Fig 4.7 (a) Simulation Waveform Result of 32-bit Ripple Carry Adder
Table 4.9 Device Utilization summary of 32-bit Ripple Carry Adder
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
37 out of 14752
64 out of 29504
98
98 out of 376
26%
56

Fig 4.7 (b) RTL diagram of 32-bit Ripple Carry Adder

4.3.2 CSA (32-bit):

Fig 4.8 (a) Simulation Waveform Result of 32-bit CSA


Table 4.10 Device Utilization summary of 32-bit CSA
Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
50 out of 14752
91 out of 29504
98
98 out of 376
26%

57

Fig 4.8 (b) RTL diagram of 32-bit CSA

4.3.3 Proposed CSA (32-bit):

Fig 4.9 (a) Simulation Waveform Result of 32-bit Proposed CSA

58

Fig 4.9 (b) Design Summary of 32-bit Proposed CSA


Table 4.11 Synthesis Report of 32-bit Proposed CSA
HDL Synthesis
Report
Macro Statistics
# Xors
: 33
1-bit xor2 : 32
32-bit xor2 : 1

Advanced HDL
Synthesis Report
Macro Statistics
# Xors
: 33
1-bit xor2 : 32
32-bit xor2 : 1

Low Level Synthesis


Optimizing unit <prop32>...
Mapping all equations...
Building and optimizing final netlist...
Found area constraint ratio of 100
(+ 5) on block prop32, actual ratio is 0.

Table 4.12 Device Utilization summary of 32-bit Proposed CSA


Device utilization summary
Selected Device
Number of Slices
Number of 4 input LUTs
Number of IOs
Number of bonded IOBs

3s1600efg484-4
76 out of 14752
140 out of 29504
98
98 out of 376 26%

59

Fig 4.9 (c) RTL diagram of 32-bit Proposed CSA


The simulated V files are imported into the synthesized tool and corresponding
values of delay and area are noted. The synthesized reports contain area and delay
values for different sized adders. The similar design flow is followed for both the
regular and modified SQRT CSLA of different sizes.

As for the transistor count in 32-bit carry select adder, the transistor count of our
proposed area-efficient carry select adder could be reduced to be very close to that of
carry ripple adder; however, the transistor count in the conventional carry select
adder is nearly double as compared with the proposed design. This result shows that
sharing common Boolean logic term could indeed achieve a superior performance in
aspect of transistor count. As the input bit number of the conventional carry select
adder increases to 32-bit, the power consumption in the conventional carry select
adder will be 3.3 times larger than that in our proposed area-efficient carry select
adder.

It is clear that the delay of the 8-bit, 16-bit, 32-bit, and 64-bit proposed SQRT
CSLA is reduced by 4.6%, 49.3%, 44.5%, and 59.08%, respectively when compared
to regular SQRT CSLA. Power reduction of the proposed paper when compared to
regular SQRT CSLA 8-bit, 16-bit, 32-bit and 64-bit is 10.8%, 17.73%, 20.01% and
21.9% respectively.
60

We perform the simulation and synthesis and summarize the results of all the
adders. The Functional verification (simulation) and synthesis (high level description
is converted into RTL) of all the adders is performed and results are summarized.

After the observation of simulation waveforms, synthesis is performed for


calculation of delay and area and thereby the speed and power of the CSLAs are
calculated and a comparison of regular, modified and improved CSLA is made in
terms of delay, area and power

4.2 Performance Comparison


In this section, the proposed method is compared with the other 32-bit ripple
carry adder. Here, we show the result of power delays and critical path delays. The
area indicates the total cell area of the design and the total power is sum of the
leakage power, internal power and switching power. The percentage reduction in the
cell area, total power, power-delay product and the area-delay product as function of
the bit size.

4.3 Synthesis Report


Synthesis report shows the design summary including number of LUTs, I/O
buffers, Slice registers, flip flop pairs and theoretical estimation, layout estimation.
Area and delay Comparison: Table 4.2 exhibits the simulation results of both the
CSLA structures in terms of delay, area and power. Table 4.3 shows the device
utilization summary. Table 4.4 depicts that the proposed SQRT CSLA has less
number of gates and hence less area.

61

Table 4.13 Theoretical Estimation


Design
SQRT-CSLA
[6]

SQRT-CSLA
(CBL) [7]

SQRT-CSLA
proposed

Width
(n)

Delay
(ns)

Area
(um2)

ADP
(um2us)

EADP
(1%)

16

7.38

1813.71

13.39

161.41

32

14.58

3627.42

52.89

280.64

64

28.98

7254.84

210.25

436.55

16

3.0

1706.80

5.12

---

32

3.85

3608.98

13.89

--

64

5.27

7435.46

39.18

--

16

2.0

1574.54

4.10

--

32

2.75

2989.99

11.89

---

64

4.28

6553.24

32.10

--

Table 4.14 comparison of post layout- synthesis result


Design
SQRT-CSLA
(conv)

SQRT-CSLA
(BEC) [7]

SQRT-CSLA
(CBL)

SQRT-CSLA
proposed

Width(n)

Delay(ns)

Area(um2)

Power(uW)

16

5.61

2890.52

30.5673

32

6.56

6100.34

60.2537

64

8.37

12613.2

113.6457

16

10.45

1722.96

12.8662

32

18.72.

2765.38

17.7900

64

35.10

5530.56

91.1744

16

5.55

1813.45

19.6652

32

6.59

3735.36

38.1886

64

8.35

7603.89

70.62442

16

5.55

1813.45

19.6652

32

6.59

3735.36

38.1886

64

8.35

7603.89

70.62442

Table 4.15 Design Summary


Logic utilization

Used

Number of occupied slice

91

29504

1%

Number of 4 input LUTs

51

14752

1%

51

51

100%

0%

97

376

1%

Number of slices containing only


related logic
Number of slices containing only
unrelated logic
Number of bonded IOBs

62

Available Utilization

Table 4.16 Comparison of the Regular and Modified SQRT CSLA

Word
size
8 bit

16
bit

32
bit

64
bit

Delay
(ns)

Area

1.719

Modified
CSLA
Regular
CSLA

Adder

Power
(uw)

Power
delay
product
(10-15)

Area
delay
product
(10-25)

Leakage

Switching

Total

991

0.007

101.9

203.9

350.5

1703.5

1.958

895

0.006

94.2

188.4

368.8

1752.5

2.775

2272

0.017

263.7

527.4

1463.8

6304.8

Modified
CSLA
Regular
CSLA

3.048

1929

0.013

235.9

471.8

1438.0

5879.6

5.137

4783

0.036

563.6

1127.3

5790.9

24570.2

Modified
CSLA
Regular
CSLA

5.482

3985

0.027

484.9

969.9

5316.9

21848.5

9.174

9916

0.075

1212.4

2425.0

22245.9

90969.3

Modified
CSLA

9.519

8183

0.057

1025.0

2050.1

19514.9

77893.9

Regular
CSLA

AreaDelay Estimation Method: The comparison of proposed system with 32-bit


RCA is shown in Table 4.4
The delay can be calculated by adding up the number of gates in the longest path
of logic block that contributes maximum delay. The area evolution is done by
counting the total number of AOI gates required for each logic block. The main
disadvantage of regular CSLA is high area usage that can be overcome by using
modified CSLA. Table 4.1 shows the Area and delay of and, or, and not gates given
in the 90-nm standard cell library datasheet of proposed system compared with 32bit RCA.

4.4 Applications
Arithmetic Logic units
High Speed Multiplication
Advanced Microprocessor Design
Digital Signal Process

63

4.5 Advantages
Low Power Consumption
Less Area (Less Complexity)
More Speed Compare to regular CSLA
Less Complexity

64

CHAPTER 5
CONCLUSION & FUTURE SCOPE
5.1 Conclusion
Thus in order to reduce the area and power of SQRT CSLA architecture that
we have implemented in this Project, a simple approach has been used. In this work,
the numbers of gates have been reduced and this feature offers a greater advantage in
the area and power reduction. The simulation results indicate that the modified
SQRT CSLA is suffering from larger delay whereas the in 32-bit modified SQRT
CSLA, area and power are significantly reduced. The delay calculations used here
can be computed using the mentor graphics tool.

5.2 Future Scope


Now a days Carry Select Adder (CSLA) used in many data-processing
processors to perform fast arithmetic functions. The speed of SQRT CSLA greater
than Modified SQRT CSLA, but the area and power reduced compared to SQRT
CSLA. So, SQRT CSLA can be replaced by Modified SQRT CSLA Where the area
and power major constraints than speed.

65

REFERENCES
[1] Low-Power and Area-Efficient Carry select Adder by B.Ram Kumar and Harish
M Kitturin IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Volume 20 No.2, February-2012.
[2] An Area efficient static CMOS carry-select adder based on a compact carry lookahead unit G.A. Ruiz, M. Granda in Micro-Electronics Journal 35(2004) 939-944,
2004-Elsevier Ltd.
[3] O.J. Badrij, Carry-select Adder, IRE Transaction Electronics Computers, pp
340- 344, 1962.
[4] Y. Kim and L.S. Kim, 64-bit carry-select adder with reduced area, Electron.
Lett, vol.37, no.10, pp.614-615, May-2001.
[5] J.M. Rabaey, Digtal Integrated Circuits a Design Perspective. Upper Saddle
River, NJ: Prentice-Hall, 2001.
[6] Cadence, Encounter user guide, Version6.2.4, March 2008.
[7] T.Y. Chang and M.J. Hsiao, Carry-select adder using single ripple-carry adder,
Electronics Letters, vol. 34, no. 22, pp. 2101 2103, Oct. 1998.
[8] Computer Arithmetic Algorithms and hardware designs by Behrooz parhami.
[9] Review on Carry Skip Adder and Gray/Black Cell Function Lecture 18 Datapath
Subsystems Chapter 10 Copyright 2005 Pearson Addison-Wesley. All rights
reserved.
[10] Gray Yeap and Gilbert, Practical Low power Digital VLSI Design, Kluwer
Academic Publishers. 1998.

[11] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed.
New York, NY, USA: Oxford Univ. Press, 2010.

66

[12] K.K.Parhi, VLSI Digital Signal Processing. New York, NY, USA Wiley, 1998.

[13] J. M. Rabaey, Digital Integrated Circuits, IEEE Trans. on VLSI Systems,


2003.

[14] O. J. Bedrij, Carry-select adder, IRE Trans. Electron. Comput, pp.340344,


1962.
[15] I-Chyn Wey, Cheng-Chen Ho, Yi-Sheng Lin, and Chien-Chang Peng, An
Area-Efficient Carry Select Adder Design by Sharing the Common Boolean Logic
Term in Proceedings of International MultiConference of Engineers and Computer
sciencentist 2012 Vol II, IMECS 2012, March 14-16, 2012, Hong-Kong.
[16] A.P.Chandrakasan, N.Verma, and D.C.Daly, Ultralow-power electronics for
biomedical applications, Annu. Rev. Biomed. Eng, vol.10, pp.247274, Aug.2008.

[17] R.UMA, Vidya Vijayan, M. Mohanapriya, Sharon Paul 2, Area, Delay and
Power Comparison of Adder Topologies International Journal of VLSI design
Communication Systems (VLSICS) Vol.3, No.1, February 2012.
[18] S. Manju and V. Sornagopal, An efficient SQRT architecture of carry select
adder design by common Boolean logic, in Proc. VLSI ICEVENT, 2013, pp. 15.

[19] AreaDelayPower Efficient Carry-Select Adder Basant Kumar Mohanty,


Senior Member, IEEE, and Sujit Kumar Patel, IEEE Transaction On CircuitAnd
System-i: Express Briefs, Vol 61, No 6, June 2014.

[20] Y. He, C. H. Chang, and J.Gu, An area efficient 64-bit square root Carry-Select
Adder for low power applications. in Proc. IEEE Int. Symp. Circuits Syst., 2005,
vol. 4, pp. 40824085.

67

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

CSLA Implementation Technique to Minimise the Area,


Power and Delay
Bhagya Sri Gutthikonda

PG Student,
Department of ECE,
Sri Mittapalli Institute of Technology for Women,
Guntur, Andhra Pradesh, India.

ABSTRACT:
Carry Select Adder (CSLA) is one of the fastest adders
used in many data-processing processors to perform fast
arithmetic functions. From the structure of the CSLA, it is
clear that there is scope for reducing the area and power
consumption in the CSLA. This work uses a simple and
efficient gate-level modification to significantly reduce
the area and power of the CSLA. Based on this modification 8-bit, 16-bit, 32-bit, 64-bit square-root CSLA (SQRT
CSLA) architecture have been developed and compared
with the regular SQRT CSLA architecture. The proposed
design has reduced area and power as compared with the
regular SQRT CSLA with only a slight increase in the delay. This work evaluates the performance of the proposed
designs in terms of delay, area, power, and their products
by hand with logical effort and through custom design and
layout in 0.18-m CMOS process technology. The results
analysis shows that the proposed CSLA structure is better
than the regular SQRT CSLA.

Keywords:
SQRT CSLA, area efficient, CSLA, low power, delay efficient.

I. INTRODUCTION:
Design of area and power efficient high speed data pathlogic systems are one of the most substantial areas of researchin VLSI system design. In digital adders, the speed
of additionis limited by the time required to propagate a
carry throughthe adder. The sum for each bit position in an
elementaryadder is generated sequentially only after the
previous bitposition has been summed and a carry propagated into thenext position.The CLSA is used in many
computational system is toalleviate the problem of carry
propagation delay byindependently generating multiple
carries and then select acarry to generate the sum [1].

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

P.Bala Murali Krishna

Professor,
Department of ECE,
Sri Mittapalli Institute of Technology for Women,
Guntur, Andhra Pradesh, India.
However, the CSLA[3] is not areaefficient because it uses
multiple pairs of Ripple Carry Adders(RCA) to generate
partial sum and carry by considering carryinput and then
the final sum and carry are selected by themultiplexers
(mux). The basic idea of this work is to use Binary to
Excess-1converted (BEC) instead of RCA with in the regular CSLA toachieve lower area and power consumption
[2]-[4]. The mainadvantage of this BEC logic comes from
the lesser number oflogic gates than the bit Full Adder
(FA) structure.This brief isstructured as follows. This paper deals with thedelay and area evaluation methodology
of the basic adder blocks. And also presents the detailed
structure and thefunction of the BEC logic.The SQRT
CSLA has been chosen for comparison with theproposed
design as it has a more balanced delay, and requires lower
power and area [5], [6]. The delay and area evaluation
methodology of the regular and modified SQRT CSLA are
presented.The rest of the paper is organised as follows.In
Section II, logic formulation is presented. In Section III,
the proposed adder design is explained. In Section IV, the
proposed scheme is compared to the previously proposed
ones and results are shown. Finally, Section V concludes
this paper.

II. LOGIC FORMULATION:


The CSLA has two units: 1) the sum and carry generator unit (SCG) and 2) the sum and carry selection unit
[9]. The SCG unit consumes most of the logic resources
of CSLA as shown in fig 1, and significantly contributes
to the critical path. Different logic designs have been
suggested for efficient implementation of the SCG unit.
We made a study of the logic designs suggested for the
SCG unit of conventional and BEC-based CSLAs of [6]
by suitable logic expressions. The main objective of this
study is to identify redundant logic operations and data
dependence. Accordingly, we remove all redundant logic
operations and sequence logic operations based on their
data dependence which are discussed below.

January 2016
Page 120

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

Fig. 1. (a) Conventional CSLA; n is the input operand


bit-width. (b) The logic operations of the RCA.

Fig.2. Structure of the BEC-based CSLA; n is the input operand bit-width.

A. Logic Expressions of the SCG Unit of theConventional CSLA:


The SCG unit of the conventional CSLA as shown in
Fig.1(a), [3] is composed of two n-bit RCAs, where n is
the adder bit-width. The logic operation of the n-bit RCA
shown in fig.1(b) is performed in four stages:
Half-sum generation (HSG);
Half-carry generation (HCG);
Full-sum generation (FSG); and
Full carry generation (FCG).
Suppose two n-bit operands are added in the conventional
CSLA, then RCA-1 and RCA-2 generate n-bit sum (s0
and s1) and output-carry (c0 out and c1 out) corresponding to input-carry (cin = 0 and cin = 1), respectively. Logic
expressions of RCA-1 and RCA-2 of the SCG unit of the
n-bit CSLA are given as

The selected carry word is added with the half-sum (s0)


to generate the final-sum (s). Using this method, one can
have three design advantages:

1.Calculation of S10 is avoided in the SCG unit;
2.The n-bit select unit is required instead of the (n+1) bit;
and
3.Small output-carry delay.
All these features result in an areadelay and energyefficient design for the CSLA.We have removed all the
redundant logic operations of 2 and rearranged logic expressions of 2 based on their dependence. The proposed
logic formulation for the CSLA is given as

As stated above the main idea of this work is to use BEC


instead of the RCA with Cin=1in order to reduce the area
and power consumption of the regular CSLA. To replace
the n bit RCA, an n+1 bit BEC is required.

B. Logic Expression of the SCG Unit of the


BEC-Based CSLA

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

III. PROPOSED ADDER DESIGN:


The proposed CSLA is based on the logic formulation given in equ.4, and its structure is shown in Fig. 3(a)where n
is the input operand bit-width, and [*] represents delay (in
the unit of inverter delay), n = max (t, 3.5n + 2.7).

January 2016
Page 121

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

It consists of one HSG unit, one FSG unit, one CG unit,


and one CS unit. The CG unit is composed of two CGs
(CG0 and CG1) corresponding to input-carry 0 and 1.
The HSG receives two n-bit operands (A and B) and generate half-sum word s0 and half-carry word c0 of width n
bits each. Both CG0 and CG1 receive s0 and c0 from the
HSG unit and generate two n-bit full-carry words c0 1 and
c11 corresponding to input-carry 0 and 1, respectively.
The logic diagram of the HSG unit is shown in Fig.3 (b).
The logic circuits of CG0 and CG1 are optimized to take
advantage of the fixed input-carry bits. The optimized designs of CG0 and CG1 are shown in Fig. 3(c) and (d),
respectively.

Fig.3(a). Proposed CS adder design,(b) Gate-level design


of the HSG. (c) Gate-level optimized design of (CG0) for
input-carry = 0. (d) Gate-level optimized design of (CG1)
for input-carry = 1. (e) Gate-level design of the CS unit.
(f) Gate-level design of the final-sum generation (FSG)
unit.The CS unit selects one final carry word from the two
carry words available at its input line using the control
signal cin. It selects C_1^0 when cin = 0; otherwise, it
selects C_1^1. The CS unit can be implemented using an
n-bit 2-to-l MUX. However, we find from the truth table
of the CS unit that carry words c0 1 and c11 follow a specific bit pattern. If C_1^0 (i) = 1, then C_1^1 (i) = 1, irrespective of s0(i) and c0(i), for 0 i n 1. This feature
is used for logic optimization of the CS unit. The optimized design of the CS unit is shown in Fig. 3(e), which is
composed of n ANDOR gates. The final carry word c is
obtained from the CS unit. The MSB of c is sent to output
as cout, and (n 1) LSBs are XORed with (n 1) MSBs
of half-sum (s0) in the FSG [shown in Fig. 3(f)] to obtain
(n 1) MSBs of final-sum (s). The LSB of s0 is XORed
with cin to obtain the LSB of s.We have considered all
the gates to be made of 2-input AND, 2-input OR, and
inverter (AOI). A 2-input XOR is composed of 2 AND, 1
OR, and 2 NOT gates. The area and delay of the 2-input
AND, 2-input OR, and NOT gates (shown in Table I) are
taken from the Synopsys Armenia Educational

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

Department (SAED) 90-nm standard cell library datasheet


for theoretical estimation. The area and delay of a design
are calculated using the following relations:

Where (Na, No, Ni) and (na, no, ni), respectively, represent the (AND, OR, NOT) gate counts of the total design
and its critical path. (a, r, i) and (Ta, To, Ti), respectively,
represent the area and delay of one (AND, OR, NOT)
gate. We have calculated the (AOI) gate counts of each
design for area and delay estimation the area and delay of
each design are calculated from the AOI gate counts (Na,
No, Ni), (na, no, ni), and the cell details of Table I. The
path of the proposed CSLA, the delay of each intermediate and output signals of the proposed n-bit CSLA design
of Fig. 3 is shown in the square bracket against each signal. We can observe that the proposed n-bit single-stage
CSLA adder involves 6n less number of AOI gates than
the CSLA of [6] and takes 2.7 and 6.6 units less delay
to calculate final-sum and output-carry. Compared with
the CBL-based CSLA of [7], the proposed CSLA design
involves n more AOI gates, and it takes (n 4.7) unit less
delay to calculate the output-carry.

A.EXTENSION CONCEPT OF Multistage


CSLA (SQRT-CSLA)
The multipath carry propagation feature of the CSLA is
fully exploited in the SQRT-CSLA [5], which is composed
of a chain of CSLAs. CSLAs of increasing size are used
in the SQRT-CSLA to extract the maximum concurrence
in the carry propagation path. Using the SQRT-CSLA design, large-size adders are implemented with significantly
less delay than a single-stage CSLA of same size. However, carry propagation delay between the CSLA stages of
SQRT-CSLA is critical for the overall adder delay.

Fig.4. Proposed SQRT-CSLA for n = 16. All intermediate and output signals are labelled with delay

January 2016
Page 122

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

Due to early generation of output-carry with multipath


carry propagation feature, the proposed CSLA design
is more favourable than the existing CSLA designs for
areadelay[10] efficient implementation of SQRT-CSLA.
A 16-bit SQRT-CSLA design using the proposed CSLA
is shown in Fig. 4, where the 2-bit RCA, 2-bit CSLA,
3-bit CSLA, 4-bit CSLA, and 5-bit CSLA are used. We
have considered the cascaded configuration of (2-bit RCA
and 2-, 3-, 4-, 6-, 7-, and 8-bit CSLAs) and (2-bit RCA
and 2-bit, 3-bit, 4-bit, 6-bit, 7-bit, 8-bit, 9-bit, 11-bit, and
12-bit CSLAs), respectively, for the 32-bit SQRTCSLA
and the 64-bit SQRT-CSLA to optimize adder delay. To
demonstrate the advantage of the proposed CSLA design
in SQRT-CSLA, we have estimated the area and delay
of SQRTCSLA using the proposed CSLA design and the
BEC-based CSLA of [6] and the CBL-based CSLA of [7]
for bit-widths 16, 32.

IV RESULTS& DISCUSSION:
In this section, we present the experimental results. In
Section IV-A, the proposedmethod is compared with the
conventional methods.We perform the simulation and
synthesis and summarize the results of all the adders. The
Functional verification (simulation) and synthesis (high
level description is converted into RTL) of all the adders
is performed and results are summarized.After the observation of simulation waveforms, synthesis is performed
for calculation of delay and area and thereby the speed
and power of the CSLAs are calculated and a comparison
of regular, modified and improved CSLA is made in terms
of delay, area and power

Table II exhibits the simulation results of both theCSLA


structures in terms of delay, area and power.

Table I: path delays comparison

The area indicates the total cell area of the design and the
total power is sum of the leakage power, internal power
and switching power. The percentage reduction in the cell
area, total power, power-delay product and the areadelay
product as function of the bit size are shown.

Table II: Comparison of the Regular andModified SQRT CSLA

A. PERFORMANCE COMPARISON:
In this section, the proposed method is compared with the
other 32-bit ripple carry adder.AreaDelay Estimation
Method: The comparison of proposed system with 32-bit
RCA is shown in Table 1.The delay can be calculated by
adding up the number of gates in the longest path of logic
block that contributes maximum delay.The area evolution is done by counting the total number of AOI gates
required for each logic block. The main disadvantage of
regular CSLA is high area usage that can be overcome by
using modified CSLA.Table 1 shows the Area and delay
of and, or, and not gates given in the90-nm standard cell
library datasheet of proposed system compared with 32bit RCA.Here, we show the result of power delays and
critical path delays.

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

Area and delay Comparison: Table III, depicts that the


proposed SQRT CSLA has less number of gates and
hence less area.

January 2016
Page 123

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

Table-III: Theoretical Estimation

Table-V : comparison of post layout- synthesis


result

B. Synthesis & Simulation Report


In this section, the proposed method synthesis and simulation results are reported.Synthesis Report:Synthesis
report shows the design summary including number of
LUTs, I/O buffers, Slice registers, flip flop pairs as shown
in Table III.

Table -IV: Design Summary

V. CONCLUSION:
A simple approach is in this paper to reduce thearea and
power of SQRT CSLA architecture. The reducednumber
of gates of this work offers the great advantage in thereduction of area and also the total power. The comparedresults show that the modified SQRT CSLA has a slightlylarger delay, but the area and power of the 32-bmodified
SQRT CSLA are significantly reduced by 17.4%and
15.4% respectively. The power-delay product and also
the area-delay product of the proposed design show a decrease for 16-, 32-b sizes which indicates thesuccess of
the method and not a mere trade off of delay forpower and
area. The modified CSLA architecture is therefore,low
area, low power, simple and efficient for VLSI hardwareimplementation. It would be interesting to test the design
ofthe modified SQRT CSLA.

REFERENCES:
[1] Low-Power and Area-Efficient Carry select Adder by
B.Ram Kumar and Harish M Kitturin IEEE Transactions
on Very Large scale Integration(VLSI) Systems, Volume
20 No.2, February 2012.
[2] An Area efficient static CMOS carry-select adder based
on a compact carry look-ahead unit G.A.Ruiz, M.Granda
in Microelectronics Journal 35(2004) 939-944,2004
Elsevier Ltd.
[3] O.J.Badrij,Carry-select Adder, IRETransaction
Electronics Computers, pp 340- 344, 1962.

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

January 2016
Page 124

ISSN No: 2348-4845

International Journal & Magazine of Engineering,


Technology, Management and Research
A Peer Reviewed Open Access International Journal

[4] Y. Kim and L.S.Kim, 64-bit carry-selectadder with


reduced area, Electron. Lett., vol.37, no. 10, pp. 614
615, May 2001.
[5] J.M. Rabaey,Digtal Integrated Circuits ADesign Perspective. Upper Saddle River, NJ:Prentice-Hall, 2001.
[6] Cadence, Encounter user guide, Version6.2.4, March
2008.
[7] T.Y.Chang and M.J.Hsiao, Carry-select adder using
single ripple-carry adder, Electronics Letters, vol. 34,
no. 22, pp. 2101 2103, Oct. 1998.

Volume No: 3 (2016), Issue No: 1 (January)


www.ijmetmr.com

[8] Computer Arithmetic Algorithms and hardware designs by Behrooz parhami.


[9] Review on Carry Skip Adder and Gray/Black Cell
Function Lecture 18 Datapath Subsystems Chapter 10
Copyright 2005
Pearson Addison-Wesley. All rights
reserved.
[10] Gray Yeap and Gilbert, Practical Lowpower Digital
VLSI Design, Kluwer Academic Publishers. 1998.

January 2016
Page 125

You might also like