DSP M.tech 64 Bit Mac

CHAPTER 1
INTRODUCTION
Recently there has been a trend to implement DSP functions using field
programmable gate arrays (FPGAs). While application specific integrated circuits (ASICs) are
the traditional solution to high performance applications, the high development costs and time-
to-market factors prohibit the deployment of such solutions for certain cases. DSP processors
offer high programmability, but the sequential execution nature of their architecture can
adversely affect their throughput performance. As such, the reason for the rising popularity of the
FPGA is due to the balance that FPGAs provide the designer in terms of flexibility, cost, and
time-to-market. Digital filter structures, which are extensively used in applications such as
speech processing, image and video processing, and telecommunications to name a few, are
commonly implemented using FPGAs.

In signal processing, there are many instances in which an input signal
to a system contains extra unnecessary content or additional noise which can degrade the quality
of the desired portion. In such cases we may remove or filter out the useless samples. For
example, in the case of the telephone system, there is no reason to transmit very high frequencies
since most speech falls within the band of 400 to 3,400 Hz. Therefore, in this case, all
frequencies above and below that band are filtered out. The frequency band between 400 and
3,400 Hz, which isnt filtered out, is known as the passband, and the frequency band that is
blocked out is known as the stop band.

1.3 FINITE IMPULSE RESPONSE :
A finite impulse response (FIR) filter is a filter structure that can be used to implement almost
any sort of frequency response digitally. An FIR filter is usually implemented by using a series
of delays, multipliers, and adders to create the filter's output.
Figure below shows the basic block diagram for an FIR filter of length N. The delays result in
operating on prior input samples. The h
k
values are the coefficients used for multiplication, so
that the output at time n is the summation of all the delayed samples multiplied by the
appropriate coefficients.
The difference equation that defines the output of an FIR filter in terms of its input is:

where:
x[n] is the input signal,
y[n] is the output signal,
b
i
are the filter coefficients, and
N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these
are commonly referred to as taps.
This equation can also be expressed as a convolution of the coefficient sequence b
i
with the input
signal:

That is, the filter output is a weighted sum of the current and a finite number of previous values
of the input. Also the response of the filter depends upon the values of the filter coefficients and
the input applied.

Figure : The logical structure of an FIR filter

The process of selecting the filter's length and coefficients is called filter design. The goal is to
set those parameters such that certain desired stopband and passband parameters will result from
running the filter. Most engineers utilize a program such as MATLAB to do their filter design.
But whatever tool is used, the results of the design effort should be the same:
A frequency response plot, like the one shown in Figure 1, which verifies that the filter
meets the desired specifications, including ripple and transition bandwidth, The filter's
length and coefficients.
The longer the filter (more taps), the more finely the response can be tuned.
With the length, N, and coefficients, float h[N] = { ... }, decided upon, the implementation of the
FIR filter is fairly straightforward. Listing 1 shows how it could be done in C. Running this code
on a processor with a multiply-and-accumulate instruction (and a compiler that knows how to
use it) is essential to achieving a large number of taps.

1.4 Approach to design a FIR Filter :
Filters are signal conditioners. Each functions by accepting an input signal, blocking prespecified
frequency components, and passing the original signal minus those components to the output.
For example, a typical phone line acts as a filter that limits frequencies to a range considerably
smaller than the range of frequencies human beings can hear. That's why listening to CD-quality
music over the phone is not as pleasing to the ear as listening to it directly.
A digital filter takes a digital input, gives a digital output, and consists of digital components. In
a typical digital filtering application, software running on a digital signal processor (DSP) reads
input samples from an A/D converter, performs the mathematical manipulations dictated by
theory for the required filter type, and outputs the result via a D/A converter.
An analog filter, by contrast, operates directly on the analog inputs and is built entirely with
analog components, such as resistors, capacitors, and inductors.
There are many filter types, but the most common are lowpass, highpass, bandpass, and
bandstop. A lowpass filter allows only low frequency signals (below some specified cutoff)
through to its output, so it can be used to eliminate high frequencies. A lowpass filter is handy, in
that regard, for limiting the uppermost range of frequencies in an audio signal; it's the type of
filter that a phone line resembles.
A highpass filter does just the opposite, by rejecting only frequency components below some
threshold. An example highpass application is cutting out the audible 60Hz AC power "hum",
which can be picked up as noise accompanying almost any signal in the U.S.
The designer of a cell phone or any other sort of wireless transmitter would typically place an
analog bandpass filter in its output RF stage, to ensure that only output signals within its narrow,
government-authorized range of the frequency spectrum are transmitted.
Engineers can use bandstop filters, which pass both low and high frequencies, to block a
predefined range of frequencies in the middle.

Frequency response
Simple filters are usually defined by their responses to the individual frequency components that
constitute the input signal. There are three different types of responses. A filter's response to
different frequencies is characterized as passband, transition band, or stopband. The passband
response is the filter's effect on frequency components that are passed through (mostly)
unchanged.
Frequencies within a filter's stopband are, by contrast, highly attenuated. The transition band
represents frequencies in the middle, which may receive some attenuation but are not removed
completely from the output signal.
In below Figure which shows the frequency response of a lowpass filter,
p
is the passband
ending frequency,
s
is the stopband beginning frequency, and A
s
is the amount of attenuation in
the stopband. Frequencies between
p
and
s
fall within the transition band and are attenuated to
some lesser degree.

Figure: The response of a lowpass filter to various input frequencies
Given these individual filter parameters, one of numerous filter design software packages can
generate the required signal processing equations and coefficients for implementation on a DSP.
Before we can talk about specific implementations, however, some additional terms need to be
introduced.
Ripple is usually specified as a peak-to-peak level in decibels. It describes how little or how
much the filter's amplitude varies within a band. Smaller amounts of ripple represent more
consistent response and are generally preferable.
Transition bandwidth describes how quickly a filter transitions from a passband to a stopband, or
vice versa. The more rapid this transition, the higher the transition bandwidth; and the more
difficult the filter is to achieve. Though an almost instantaneous transition to full attenuation is
typically desired, real-world filters don't often have such ideal frequency response curves.
There is, however, a tradeoff between ripple and transition bandwidth, so that decreasing either
will only serve to increase the other.

1.5 Properties of FIR Filter:
An FIR filter has a number of useful properties which sometimes make it preferable to an infinite
impulse response (IIR) filter. FIR filters:
Are inherently stable. This is due to the fact that all the poles are located at the origin and
thus are located within the unit circle.
Require no feedback. This means that any rounding errors are not compounded by
summed iterations. The same relative error occurs in each calculation. This also makes
implementation simpler.
They can be designed to be linear phase, which means the phase change is proportional to
the frequency. This is usually desired for phase-sensitive applications, for example
crossover filters, and mastering, where transparent filtering is adequate.
The main disadvantage of FIR filters is that considerably more computation power is required
compared with a similar IIR filter. This is especially true when low frequencies (relative to the
sample rate) are to be affected by the filter.

1.6 Filter design Techniques:
To design a filter means to select the coefficients such that the system has specific
characteristics. The required characteristics are stated in filter specifications. Most of the time
filter specifications refer to the frequency response of the filter. There are different methods to
find the coefficients from frequency specifications:
1. Window design method
2. Frequency Sampling method
3. Weighted least squares design
4. Minimax design
5. Equiripple design.
Software packages like MATLAB, GNU Octave, Scilab, and SciPy provide convenient ways to
apply these different methods.
Some of the time, the filter specifications refer to the time-domain shape of the input signal the
filter is expected to "recognize". The optimum matched filter is to sample that shape and use
those samples directly as the coefficients of the filter -- giving the filter an impulse response that
is the time-reverse of the expected input signal.

Chapter 2
Design of High Performance 64 bit MAC Unit
Introduction:
A design of high performance 64 bit Multiplier-and-Accumulator (MAC) is implemented in this
paper. MAC unit performs important operation in many of the digital signal processing (DSP)
applications. The multiplier is designed using modified Wallace multiplier and the adder is done
with carry save adder.

MAC unit is an inevitable component in many digital signal processing (DSP) applications
involving multiplications and/or accumulations. MAC unit is used for high performance digital
signal processing systems. The DSP applications include filtering, convolution, and inner
products. Most of digital signal processing methods use nonlinear functions such as discrete
cosine transform (DCT) or discrete wavelet transforms (DWT). Because they are basically
accomplished by repetitive application of multiplication and addition, the speed of the
multiplication and addition arithmetic determines the execution speed and performance of the
entire calculation. Multiplication-and-accumulate operations are typical for digital filters.
Therefore, the functionality of the MAC unit enables high-speed filtering and other processing
typical for DSP applications. Since the MAC unit operates completely independent of the CPU,
it can process data separately and thereby reduce CPU load. The application like optical
communication systems which is based on DSP, require extremely fast processing of huge
amount of digital data. The Fast Fourier Transform (FFT) also requires addition and
multiplication. 64 bit can handle larger bits and have more memory.

MAC OPERATION:

The Multiplier-Accumulator (MAC) operation is the key operation not only in DSP applications
but also in multimedia information processing and various other applications. As mentioned
above, MAC unit consist of multiplier, adder and register/accumulator. In this paper, we used 64
bit modified Wallace multiplier. The MAC inputs are obtained from the memory location and
given to the multiplier block. This will be useful in 64 bit digital signal processor. The input
which is being fed from the memory location is 64 bit. When the input is given to the multiplier
it starts computing value for the given 64 bit input and hence the output will be 128 bits. The
multiplier output is given as the input to carry save adder which performs addition. The function
of the MAC unit is given by the following equation :

F= Pi Qi (1)

The output of carry save adder is 129 bit i.e. one bit is for the carry (128bits+ 1 bit). Then, the
output is given to the accumulator register. The accumulator register used in this design is
Parallel In Parallel Out (PIPO). Since the bits are huge and also carry save adder produces all the
output values in parallel, PIPO register is used where the input bits are taken in parallel and
output is taken in parallel. The output of the accumulator register is taken out or fed back as
one of the input to the carry save adder.

MODIFIED WALL ACE MULTIPLIER:

A modified Wallace multiplier is an efficient hardware implementation of digital circuit
multiplying two integers. Generally in conventional Wallace multipliers many full adders and
half adders are used in their reduction phase. Half adders do not reduce the number of partial
product bits. Therefore, minirnizing the number of half adders used in a multiplier reduction will
reduce the complexity. Hence, a modification to the Wallace reduction is done in which the
delay is the same as for the conventional Wallace reduction. The modified reduction method
greatly reduces the number of half adders with a very slight increase in the number of full adders.
Reduced complexity Wallace multiplier reduction consists of three stages. First stage the N x N
product matrix is formed and before the passing on to the second phase the product matrix is
rearranged to take the shape of inverted pyramid. During the second phase the rearranged
product matrix is grouped into non-overlapping group of three as shown in the figure single bit
and two bits in the group will be passed on to the next stage and three bits are given to a full
adder. The number of rows in each stage of the reduction phase is calculated by the formula

ri+ 1= 2[ri/3]+rjmod3
If ri mod3 = 0, then ri+ 1 = 2r/3

If the value calculated from the above equation for number of rows in each stage in the second
phase and the number of row that are found in each stage of the second phase does not match,
only then the half adder will be used. The final product of the second stage will be in the height
of two bits and passed on to the third stage. During the third stage the output of the second stage
is given to the carry propagation adder to generate the final output.

Thus, 64 bit modified Wallace multiplier is constructed and the total number of stages in the
second phase is 10. As per the equation the number of row in each of the 10 stages was
calculated and the use of half adders was restricted only to the 10
th
stage. The total number of
half adders used in the second phase is 8 and the total number of full adders that was used during
the second phase is slightly increased that in the conventional Wallace multiplier.

CARRY SAVE ADDER:

In this design 128 bit carry save adder is used since the output of the multiplier is 128 bits (2N).
The carry save adder minirnize the addition from 3 numbers to 2 numbers. The propagation
delay is 3 gates despite of the number of bits. The carry save adder contains n full adders,
computing a single sum and carries bit based mainly on the respective bits of the three input
numbers. The entire sum can be calculated by shifting the carry sequence left by one place and
then appending a 0 to most significant bit of the partial sum sequence. Now the partial sum
sequence is added with ripple carry unit resulting in n + 1 bit value. The ripple carry unit refers
to the process where the carryout of one stage is fed directly to the carry in of the next stage.
This process is continued without adding any intermediate carry propagation. Since the
representation of 128 bit carry save adder is infeasible , hence a typical 8 bit carry save adder is
shown in the figure .Here we are computing the sum of two 128 bit binary numbers, then 128
half adders at the first stage instead of 128 full adder. Therefore , carry save unit comprises of
128 half adders, each of which computes single sum and carry bit based only on the
corresponding bits of the two input numbers. If x and y are supposed to be two 128 bit numbers
then it produces the partial products and carry as S and C respectively.
Si = xi xor yi
Ci = xi and yi
During the addition of two numbers using a half adder, two ripple carry adder is used. This is due
the fact that ripple carry adder cannot compute a sum bit without waiting for the previous carry
bit to be produced, and hence the delay will be equal to that of n full adders. However a carry-
save adder produces all the output values in parallel, resulting in the total computation time less
than ripple carry adders. So, Parallel in Parallel out (PIPO) is used as an accumulator in the final
stage.

CHAPTER-5
INTRODUCTION TO VLSI DOMAIN

4.1 VLSI DESIGN:
The complexity of VLSI is being designed and used today makes the manual approach to
design impractical. Design automation is the order of the day. With the rapid technological
developments in the last two decades, the status of VLSI technology is characterized by the
following
A steady increase in the size and hence the functionality of the ICs:
A steady reduction in feature size and hence increase in the speed of operation as well as gate
or transistor density.
A steady improvement in the predictability of circuit behavior.
A steady increase in the variety and size of software tools for VLSI design.
The above developments have resulted in a proliferation of approaches to VLSI design.

4.2 HISTORY OF VLSI:
VLSI began in the 1970s when complex semiconductor and communication technologies
were being developed. The microprocessor is a VLSI device. The term is no longer as common
as it once was, as chips have increased in complexity into the hundreds of millions of transistors.
This is the field which involves packing more and more logic devices into smaller and
smaller areas. VLSI circuits can now be put into a small space few millimeters across.. VLSI
circuits are everywhere ... our computer, our car, our brand new state-of-the-art digital camera,
the cell-phones, and what we have.

4.3 VARIOUS INTEGRATIONS:
Over time, millions, and today billions of transistors could be placed on one chip, and to
make a good design became a task to be planned thoroughly.
In the early days of integrated circuits, only a few transistors could be placed on a chip as the
scale used was large because of the contemporary technology, and manufacturing yields were
low by today's standards. As the degree of integration was small, the design was done easily.
Over time, millions, and today billions of transistors could be placed on one chip, and to make a
good design became a task to be planned thoroughly.

4.3.1 SSI Technology:
The first integrated circuits contained only a few transistors. Called "small-scale
integration" (SSI), digital circuits containing transistors numbering in the tens provided a few
logic gates for example, while early linear ICs such as the Plessey SL201 or the Philips TAA320
had as few as two transistors. The term Large Scale Integration was first used by IBM scientist
Rolf Landauer when describing the theoretical concept from there came the terms for SSI, MSI,
VLSI, and ULSI.

4.3.2 MSI Technology:
The next step in the development of integrated circuits, taken in the late 1960s, introduced
devices which contained hundreds of transistors on each chip, called "medium-scale integration" (MSI).
They were attractive economically because while they cost little more to produce than SSI
devices, they allowed more complex systems to be produced using smaller circuit boards, less assembly
work (because of fewer separate components), and a number of other advantages.

4.3.3 LSI Technology:
Further development, driven by the same economic factors, led to "large-scale
integration" (LSI) in the mid 1970s, with tens of thousands of transistors per chip.
Integrated circuits such as 1K-bit RAMs, calculator chips, and the first microprocessors, that
began to be manufactured in moderate quantities in the early 1970s, had under 4000 transistors.
True LSI circuits, approaching 10,000 transistors, began to be produced around 1974, for
computer main memories and second-generation microprocessors.

4.3.4 VLSI:
Final step in the development process, starting in the 1980s and continuing through the
present, was in the early 1980s, and continues beyond several billion transistors as of 2009. In
1986 the first one megabit RAM chips were introduced, which contained more than one million
transistors. Microprocessor chips passed the million transistor mark in 1989 and the billion
transistor mark in 2005.The trend continues largely unabated, with chips introduced in 2007
containing tens of billions of memory transistors.

VLSI DESIGN FLOW:

Fig 4.1 vlsi design flow

Start
Design Entity

Pre layout
Simulation
Logic Synthesis
System
Partitioning
Pre layout
Simulation
Floor Planning
Placement
Circuit Extraction
Routing
Finish

4.4 ULSI, WSI, SOC and 3D-IC:
To reflect further growth of the complexity, the term ULSI that stands for "ultra-large-scale
integration" was proposed for chips of complexity of more than 1 million transistors. Wafer-scale
integration (WSI) is a system of building very-large integrated circuits that uses an entire silicon wafer to
produce a single "super-chip". Through a combination of large size and reduced packaging.
A system-on-a-chip ( SOC) is an integrated circuit in which all the components needed for a
computer or other system are included on a single chip. The design of such a device can be complex and
costly, and building disparate components on a single piece of silicon may compromise the efficiency of
some elements. However, these drawbacks are offset by lower manufacturing and assembly costs and by
a greatly reduced power budget: because signals among the components are kept on-die, much less power
is required.
Three-dimensional integrated circuit (3D-IC) has two or more layers of active electronic
components that are integrated both vertically and horizontally into a single circuit, &less power
consumption.

4.5 VLSI DESIGN FLOW AND THEIR DESCRIPTION:
The design at the behavioral level is to be elaborated in terms of known and
acknowledged functional blocks. It forms the next detailed level of design description. Once
again the design is to be tested through simulation and iteratively corrected for errors. The
elaboration can be continued one or two steps further. It leads to a detailed design description in
terms of logic gates and transistor switches.
Optimization
The circuit at the gate level in terms of the gates and flip-flops can be redundant in
nature. The same can be minimized with the help of minimization tools. The step is not shown
separately in the figure. The minimized logical design is converted to a circuit in terms of the
switch level cells from standard libraries provided by the foundries. The cell based design
generated by the tool is the last step in the logical design process; it forms the input to the first
level of physical design.
Simulation
The design descriptions are tested for their functionality at every level behavioral, data
flow, and gate. One has to check here whether all the functions are carried out as expected and
rectify them. All such activities are carried out by the simulation tool. The tool also has an editor
to carry out any corrections to the source code. Simulation involves testing the design for all its
functions, functional sequences, timing constraints, and specifications. Normally testing and
simulation at all the levels behavioral to switch level are carried out by a single tool; the
same is identified as scope of simulation tool in Figure 4.2.

Fig 4:2 scope of simulation tool
4.6 Synthesis
With the availability of design at the gate (switch) level, the logical design is complete.
The corresponding circuit hardware realization is carried out by a synthesis tool. Two common
approaches are as follows:
The circuit is realized through an FPGA. The gate level design description is the starting point
for the synthesis here. The FPGA vendors provide an interface to the synthesis tool. Through the
interface the gate level design is realized as a final circuit. With many synthesis tools, one can
directly use the design description at the data flow level itself to realize the final circuit through
an FPGA. The FPGA route is attractive for limited volume production or a fast development
cycle.
The circuit is realized as an ASIC. A typical ASIC vendor will have his own library of basic
components like elementary gates and flip-flops. Eventually the circuit is to be realized by
selecting such components and interconnecting them conforming to the required design. This
constitutes the physical design. Being an elaborate and costly process, a physical design may call
for an intermediate functional verification through the FPGA route. The circuit realized through
the FPGA is tested as a prototype. It provides another opportunity for testing the design closer to
the final circuit.
Physical Design
A fully tested and error-free design at the switch level can be the starting point for a
physical design [Baker & Boyce, Wolf]. It is to be realized as the final circuit using (typically) a
million components in the foundrys library. The step-by-step activities in the process are
described briefly as follows:
System partitioning: The design is partitioned into convenient compartments or functional
blocks. Often it would have been done at an earlier stage itself and the software design prepared
in terms of such blocks. Interconnection of the blocks is part of the partition process.
Floor planning: The positions of the partitioned blocks are planned and the blocks are arranged
accordingly. The procedure is analogous to the planning and arrangement of domestic furniture
in a residence. Blocks with I/O pins are kept close to the periphery; those which interact
frequently or through a large number of interconnections are kept close together, and so on.
Partitioning and floor planning may have to be carried out and refined iteratively to yield best
results.
Placement: The selected components from the ASIC library are placed in position on the
Silicon floor. It is done with each of the blocks above.
Routing: The components placed as described above are to be interconnected to the rest of the
block: It is done with each of the blocks by suitably routing the interconnects. Once the routing is
complete, the physical design cam is taken as complete. The final mask for the design can be
made at this stage and the ASIC manufactured in the foundry.
Post Layout Simulation
Once the placement and routing are completed, the performance specifications like
silicon area, power consumed, path delays, etc., can be computed. Equivalent circuit can be
extracted at the component level and performance analysis carried out. This constitutes the final
stage called verification. One may have to go through the placement and routing activity once
again to improve performance.
Critical Subsystems
The design may have critical subsystems. Their performance may be crucial to the overall
performance; in other words, to improve the system performance substantially, one may have to
design such subsystems afresh. The design here may imply redefinition of the basic feature size
of the component, component design, placement of components, or routing done separately and
specifically for the subsystem. A set of masks used in the foundry may have to be done afresh for
the purpose.

CHAPTER 6
TOOLS AND HDL USED
5.1 ROLE OF HDL

An HDL provides the framework for the complete logical design of the ASIC. All the
activities coming under the purview of an HDL are shown enclosed in bold dotted lines . Verilog
and VHDL are the two most commonly used HDLs today. Both have constructs with which the
design can be fully described at all the levels. There are additional constructs available to
facilitate setting up of the test bench, spelling out test vectors for them and observing the
outputs from the designed unit.
IEEE has brought out Standards for the HDLs, and the software tools conform to them.
Verilog as an HDL was introduced by Cadence Design Systems; they placed it into the public
domain in 1990. It was established as a formal IEEE Standard in 1995. The revised version has
been brought out in 2001. However, most of the simulation tools available today conform only to
the 1995 version of the standard.VHDL used by a substantial number of the VLSI designers
today is the used in this project for modeling the design.
We have used Xilinx ISE 9.2i for simulation and synthesis purposes. We implemented
the prescribed design in VHDL, a famous Industry and IEEE standard HDL.

5.2 Different Versions of Verilog
o Verilog-95
o Verilog 2001
o Verilog 2005
o SystemVerilog

5.3 NEEDS OF (VERILOG)HDL
o Interoperability.
o Technology independence.
o Design reuse.
o Several levels of abstraction.
o Readability.
o Standard language.
o Widely supported.

5.4 BRIEF HISTORY
o Verilog was invented by Phil Moorby and Prabhu Goel during the winter of 1983/1984 at
Automated Integrated Design Systems (later renamed to Gateway Design Automation)
o In 1985 it used as hardware modeling language.
o Gateway Design Automation was later purchased by Cadence Design Systems in 1990..
o Cadence transferred Verilog into the public domain under the Open Verilog International
(OVI) organization.
o IEEE Standard 1364-1995, commonly referred to as Verilog-95.

5.4.1 Related Standards
o Verilog-95 doesnt support (2's complement) signed nets and variables. To perform
signed-operations using awkward bit-level manipulations.
o rVerilog-2001 can be more succinctly described by one of the built-in operators: +, -, /, *,
>>>. A generate/endgenerate construct.
o SystemVerilog is a superset of Verilog-2005, with many new features and capabilities to
aid design-verification and design-modeling.

5.5 VERILOG FEATURES

o Case sensitive.
o Verilog support concarancy and sequenshality.
o Verilog syntaxes similar to the C-programming syntaxes.
o A Verilog design consists of a hierarchy of modules

5.6 LEVELS OF ABSTRACTION
Verilog supports many possible styles of design description, which differ primarily in
how closely they relate to the HW.
It is possible to describe a circuit in a number of ways.
Switch level
Gate level.
Data flow level.
Behavioral level.

Switch level description
This is the lowest level of abstraction provided by verilog. A module can be implemented in
terms
Switches (PMOS and NMOS)
storage nodes.

Gate level description
The module is implemented in terms of logic gates.
Design at this level is similar to describing a design in terms of logic gate levels.
For large circuits, a low-level description quickly becomes impractical.

Dataflow level Description
Circuit is described in terms of how data moves through the system.
In the dataflow style you described how information flows between registers in the
system.
The combinational of is described at a relatively high level, the placement and operation
register is specified quite precisely.

Fig 5.1.Data Flow Of Verilog Description
The behavior of the system over the time is defined by registers.
The lower level descriptions must be created or obtained.
The behavioral description can be provided in the form of subprograms(functions or
procedures).

Behavioral level Description
Circuit is described in terms of its operation over time.
Representation might include, e.g., state diagram ,timing diagrams and algorithmic
descriptions.
The concept of time may be expressed precisely using delays(e.g., A=B# 10).
If no actual delay is used, order of sequential operations is defined.
In the lower level of abstraction (e.g., RTL) synthesis tools ignore detailed timing
specifications.
The actual timing results depend on implementation technology and efficiency of
synthesis tools.
There are few tools for behavioral synthesis.

General format:
Always @ [(sensitivity list)]
Always _declarative_part
Begin
Always _statements
[wait_statement]
End

CHAPTER 6
SOFTWARE TOOLS

6.1 SOFTWARE TOOL-XILINX:
Xilinx ISE

is a software tool produced by Xilinx for synthesis and analysis of HDL
designs, which enables the developer to synthesize ("compile") their designs, perform timing
analysis, examine RTL diagrams, simulate a design's reaction to different stimuli, and configure
the target device with the programmer.
Xilinx was founded in 1984 by two semiconductor engineers, Ross Freeman and Bernard
Vonderschmitt, who were both working for integrated circuit and solid-state device manufacturer
Zilog Corp.
While working for Zilog, Freeman wanted to create chips that acted like a blank tape,
allowing users to program the technology themselves. At the time, the concept was paradigm-
changing. "The concept required lots of transistors and, at that time, transistors were considered
extremely precious people thought that Ross's idea was pretty far out", said Xilinx Fellow Bill
Carter, who when hired in 1984 as the first IC designer was the company's eighth employee.
Xilinx is a software tool, which is used to run the programs in VHDL language. It has
various versions like Xilinx 92.1, Xilinx 10.1, Xilinx 10.5 etc. Xilinx has various pre-defined
libraries ,packages.
6.2 VERSION 9.2I:
New Device Support.
This release supports the new Spartan- 3A DSP family.
New Software Features.
Following are the new features in this release.
Operating System Support:
Support for Windows Vista Business 32-bit operating system.
This operating system is supported, but has had limited testing.
Support for Windows XP Professional 64-bit operating system
Support for Red Hat Enterprise WS 5.0 32-bit and 64-bit operating system. This operating
system is supported, but has had limited testing.
WHY XILINX ONLY?
We have many software tools to run the VHDL programs like cadence .But compared to all
software tools Xilinx is cost effective.
6.3: A BRIEF TUTORIAL: IMPLEMENTING VHDL DESIGNS USING XILINX ISE.
This tutorial shows how to create, implemented, simulate and synthesis VHDL designs
for implemented in FPGA chips using Xilinx ISE 9.2i and Model Sim : Xilinx Edition III v6.2g.
1. Launch Xilinx ISE from either the shortcut on your desktop or from your start menu
under programs ->Xilinx ISE 9.2i -> Project Navigator.
2. Start a new project by clicking File -> New Project..

3. In the resulting window, verify the Top-Level Source Type is VHDL. Change the
Project Location to a suitable directory and give it whatever name you
choose,e.g.lab3.

4. The next window shows the details of the project and the target chip. We will be
synthesizing designs into real chips so it is important to match the target chip with the
particular board/chip you will be using. Beginning labs will be done in a Spartan 2E
XC2S200E chip that comes in a PQ208 package with a spread grade of 6 as shown.

5. Since we are starting a new design the text couple of pop-up windows arent relevant, just
click Next and Next and Finish.
6. You should now be in the main Project Navigator window. Select Project -> New
Source. From the menu.

7. In the resulting pop-up window specify a VHDL Module source and give the file a name.
I tend to just use the same name as the project itself, e.g. Lab 3. Click Next.

8. The next pop-up window allows you to specify your inputs and outputs through the
Wizard if you so desire. In this tutorial we will build a 2*1 multiplexer so we can specify
the specify the inputs and outputs as shown below. Here, the default entity and
architecture names have also been changed. Once all inputs and outputs are entered click
Next and click Finish.

9. You can see that the wizard has used STD_LOGIC as the default type for your signals
and also filled in the basic entity and architecture details for you.

10. Now you can fill in the rest of your code for your design. In this case, we can o the
multiplexer as shown below. Make sure to frequently save your code.

11. Once the code is entered we can proceed with a simulation of the design by click on the
simulation by setting source as Behavioral mode before going to simulation once check
the syntax.

12. Then we can get simulation output and then we want synthesis report just change the
source into synthesis we can get open a small window just click on the that and we
getting synthesis report an RTL schematic diagram and technology schematicdiagram.

CHAPTER-7
HARDWARE TOOLS
A field-programmable gate array (FPGA) is a semiconductor device that can be configured by the
customer or designer after manufacturinghence the name "field-programmable". FPGAs are
programmed using a logic circuit diagram or a source code in a hardware description language (HDL) to
specify how the chip will work. They can be used to implement any logical function that an application-
specific integrated circuit (ASIC) could perform, but the ability to update the functionality after shipping
offers advantages for many applications.
FPGAs contain programmable logic components called "logic blocks", and a hierarchy of
reconfigurable interconnects that allow the blocks to be "wired together"somewhat like a one-chip
programmable breadboard. Logic blocks can be configured to perform complex combinational functions,
or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory
elements, which may be simple flip-flops or more complete blocks of memory.
7.1 HISTORY
The FPGA industry sprouted from programmable read only memory (PROM) and programmable
logic devices (PLDs). PROMs and PLDs both had the option of being programmed in batches in a factory
or in the field (field programmable), however programmable logic was hard-wired between logic gates.
Xilinx Co-Founders, Ross Freeman and Bernard Vonderschmitt, invented the first commercially
viable field programmable gate array in 1985 the XC2064. The XC2064 had programmable gates and
programmable interconnects between gates, the beginnings of a new technology and market. The
XC2064 boasted a mere 64 configurable logic blocks (CLBs), with two 3-input lookup tables (LUTs). More
than 20 years later, Freeman was entered into the National Inventor's Hall of Fame for his invention.

7.2 ARCHITECTURE
The most common FPGA architecture consists of an array of configurable logic blocks (CLBs), I/O
pads, and routing channels. Generally, all the routing channels have the same width (number of wires).
Multiple I/O pads may fit into the height of one row or the width of one column in the array.
An application circuit must be mapped into an FPGA with adequate resources. While the
number of CLBs and I/Os required is easily determined from the design, the number of routing tracks
needed may vary considerably even among designs with the same amount of logic.

Fig 7.1 Internal Structure of FPGA
7.3 APPLICATIONS
Applications of FPGAs include digital signal processing, software-defined radio, aerospace and
defense systems, ASIC prototyping, medical imaging, computer vision, speech recognition, cryptography,
bioinformatics, computer hardware emulation, radio astronomy and a growing range of other areas.

SPECIFICATIONS OF SPARTAN-3 FPGA

Figure 4.2 Image of Spartan-3E FPGA kit

The Spartan-3 family of Field-Programmable Gate Arrays is specifically designed to
meet the needs of high volume, cost-sensitive consumer electronic applications. The eight-
member family offers densities ranging from 50,000 to five million system gates.

The Spartan-3 family builds on the success of the earlier Spartan-IIE family by increasing
the amount of logic resources, the capacity of internal RAM, the total number of I/Os, and the
overall level of performance as well as by improving clock management functions. Numerous
enhancements derive from the Virtex-II platform technology. These Spartan-3 FPGA
enhancements, combined with advanced process technology, deliver more functionality and
bandwidth per dollar than was previously possible, setting new standards in the programmable
logic industry.

Because of their exceptionally low cost, Spartan-3 FPGAs are ideally suited to a wide
range of consumer electronics applications, including broadband access, home networking,
display/projection and digital television equipment.

The Spartan-3 family is a superior alternative to mask programmed ASICs. FPGAs avoid
the high initial cost, the lengthy development cycles, and the inherent inflexibility of
conventional ASICs. Also, FPGA programmability permits design upgrades in the field with no
hardware replacement necessary, an impossibility with ASICs.

4.2.2 FEATURES OF SPARTAN 3E
Low-cost, high-performance logic solution for high-volume, consumer-oriented applications
With Densities up to 74,880 logic cells

SelectIO interface signaling
o Up to 633 I/O pins
o 622+ Mb/s data transfer rate per I/O
o 18 single-ended signal standards
o 8 differential I/O standards including LVDS, RSDS
o Termination by Digitally Controlled Impedance
o Signal swing ranging from 1.14V to 3.465V
o Double Data Rate (DDR) support
o DDR, DDR2 SDRAM support up to 333 Mbps

Logic resources
o Abundant logic cells with shift register capability
o -Wide, fast multiplexers
o Fast look-ahead carry logic
o Dedicated 18 x 18 multipliers
o JTAG logic compatible with IEEE 1149.1/1532

SelectRAM hierarchical memory
o Up to 1,872 Kbits of total block RAM
o Up to 520 Kbits of total distributed RAM
o Digital Clock Manager (up to four DCMs)
o Clock skew elimination
o Frequency synthesis
o High resolution phase shifting
o Eight global clock lines and abundant routing
o Fully supported by Xilinx ISE and WebPACK software development systems

4.2.3 ARCHITECTURAL OVER VIEW OF SPARTAN 3E

Figure 4.3 Architectural overview of Spartan-3E FPGA kit

The Spartan-3 family architecture consists of five fundamental programmable functional
elements:
Configurable Logic Blocks (CLBs) contain RAM-based Look-Up Tables (LUTs) to implement logic
and storage elements that can be used as flip-flops or latches. CLBs can be programmed to
perform a wide variety of logical functions as well as to store data.
Input/Output Blocks (IOBs) control the flow of data between the I/O pins and the internal logic
of the device. Each IOB supports bidirectional data flow plus 3-state operation.
Digital Clock Manager (DCM) blocks provide self-calibrating, fully digital solutions for
distributing, delaying, multiplying, dividing, and phase shifting clock signals.
Block RAM provides data storage in the form of 18-Kbit dual-port blocks.
Multiplier blocks accept two 18-bit binary numbers as inputs and calculate the product.

7.4 A BRIEF TUTORIAL: SOURCE CODE IS DUMPED INTO FPGA.

1. Now lets look at the flow for actually synthesizing and implementing the design in the
FPGA prototyping boards. Close ModelSim and go back to the Xilinx ISE environment.
In the Sources subwindow change the selection in the dropdown box from Behavioral
Simulation to Synthesis/Implementation.

2. To properly synthesize the design we need to specify which pins on the chip all the inputs
and outputs should be assigned to. In general of course we could assign the signals just
about any way we want. Since we will be using specific prototype boards, we need to
make sure our pins assignments match the switches, buttons, and LEDs so we can test our
design. We will be starting with Digilab 2E boards that are connected to Digilab DIO2
input/output boards. The I/O board has already been programmed and configured to have
the following connections:

3. To assign specific pins, expand the User Constraints selection under the Process
subwindow and double-click on Assign Package Pins.

4. A new application called Xilinx PACE should be launched.

a. In the Design Object List subwindow you should see a listing of all the input and
output signals from our design.

Here is where we can specify which pin locations we want for each signal. Simply
enter the pins numbers from the tables shown in Step 19 above, making sure to use a
capital letter P in front of the pin specification. Lets assign our signals as A
P163 (Switch 1)
I0 P164 (Switch 2)
I1 P166 (Switch 3)
Y P149 (LED 0)

Once all pins have been assigned, save your constraints by selecting File Save
from the menu bar and exit Xilinx Pace.
5. Back in the Xilinx ISE. In the Process subwindow double-click on the Synthesize XST
selection and wait for the process to complete. Then double-click on the Implement
Design selection and wait for the process to complete. Then double-click on the
Generate Programming File selection and wait for the process to complete. If all goes
well, you should have green checks marks for the whole design.

6. There is a lot of information you can obtain through all of the objects listed in the
Processes subwindow, but let us proceed to downloading the design onto the prototyping
board for testing. First make sure the prototyping board is connected to the PC and has
power on. Also make sure the slide switch on the FPGA board by the parallel port is set
to JTAG (as opposed to Port). Then select Configure Device (iMPACT) underneath
the Generate Programming File selection. You should the following window

7. Now you need to specify which bitstream file to use to configure the device. For this
tutorial we want to select the mux.bit file and click Open.

You will probably get the message below. Just click Yes.

You will also get a warning message saying the JTAG clock was updated in the bitstream
file (which is good) so just click OK. There is a way to correct for that in the original
design flow, but Xilinx automatically catches it here so I dont usually bother.

8. You should now see the Spartan XC2S200E chip in the main window. Right click on the
chip to prepare for downloading the bitstream file.

Select Program on the resulting window.

9. Click OK.

If all goes well you should get the Programming Succeeded message

10. Now just test and verify your design on the actual FPGA board!

SIMULATION RESULTS

SYNTHESIS REPORT
=====================================================================
====
* Final Report *
=====================================================================
====
Final Results
RTL Top Level Output File Name : topmodule_mac.ngr
Top Level Output File Name : topmodule_mac
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : No

Design Statistics
# IOs : 10

Cell Usage :
# BELS : 45
# GND : 1
# INV : 2
# LUT2 : 2
# LUT3 : 6
# LUT3_D : 2
# LUT3_L : 2
# LUT4 : 23
# LUT4_L : 5
# MUXF5 : 2
# FlipFlops/Latches : 19
# FDCE : 19
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 9
# IBUF : 1
# OBUF : 8
=====================================================================
====

Device utilization summary:
---------------------------

Selected Device : 3s500efg320-5

Number of Slices: 21 out of 4656 0%
Number of Slice Flip Flops: 19 out of 9312 0%
Number of 4 input LUTs: 42 out of 9312 0%
Number of IOs: 10
Number of bonded IOBs: 10 out of 232 4%
Number of GCLKs: 1 out of 24 4%

---------------------------
Partition Resource Summary:
---------------------------

No Partitions were found in this design.

---------------------------

=====================================================================
====
TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.
FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT
GENERATED AFTER PLACE-and-ROUTE.

Clock Information:
------------------
-----------------------------------+------------------------+-------+
Clock Signal | Clock buffer(FF name) | Load |
-----------------------------------+------------------------+-------+
clk | BUFGP | 19 |
-----------------------------------+------------------------+-------+

Asynchronous Control Signals Information:
----------------------------------------
-----------------------------------+------------------------+-------+
Control Signal | Buffer(FF name) | Load |
-----------------------------------+------------------------+-------+
rst | IBUF | 19 |
-----------------------------------+------------------------+-------+

Timing Summary:
---------------
Speed Grade: -5

Minimum period: 4.588ns (Maximum Frequency: 217.958MHz)
Minimum input arrival time before clock: No path found
Maximum output required time after clock: 8.868ns
Maximum combinational path delay: No path found

Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=====================================================================
====
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 4.588ns (frequency: 217.958MHz)
Total number of paths / destination ports: 148 / 38
-------------------------------------------------------------------------
Delay: 4.588ns (Levels of Logic = 4)
Source: accumulator_4 (FF)
Destination: accumulator_7 (FF)
Source Clock: clk rising
Destination Clock: clk rising

Data Path: accumulator_4 to accumulator_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDCE:C->Q 11 0.514 0.823 accumulator_4 (accumulator_4)
LUT3:I2->O 1 0.612 0.509 csa/st2[5].fa2/cout1_SW3_SW1_SW0 (N31)
LUT4:I0->O 1 0.612 0.000 csa/st2[5].fa2/cout1_SW3_F (N53)
MUXF5:I0->O 1 0.278 0.360 csa/st2[5].fa2/cout1_SW3 (N11)
LUT4:I3->O 2 0.612 0.000 csa/st2[7].fa2/Mxor_s_xo<0>1 (acc<7>)
FDCE:D 0.268 accumulator_7
----------------------------------------
Total 4.588ns (2.896ns logic, 1.692ns route)
(63.1% logic, 36.9% route)

=====================================================================
====
Timing constraint: Default OFFSET OUT AFTER for Clock 'clk'
Total number of paths / destination ports: 117 / 8
-------------------------------------------------------------------------
Offset: 8.868ns (Levels of Logic = 6)
Source: accumulator_4 (FF)
Destination: fpgaout<7> (PAD)
Source Clock: clk rising

Data Path: accumulator_4 to fpgaout<7>
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDCE:C->Q 11 0.514 0.823 accumulator_4 (accumulator_4)
LUT3:I2->O 1 0.612 0.509 csa/st2[5].fa2/cout1_SW3_SW1_SW0 (N31)
LUT4:I0->O 1 0.612 0.000 csa/st2[5].fa2/cout1_SW3_F (N53)
MUXF5:I0->O 1 0.278 0.360 csa/st2[5].fa2/cout1_SW3 (N11)
LUT4:I3->O 2 0.612 0.410 csa/st2[7].fa2/Mxor_s_xo<0>1 (acc<7>)
LUT4:I2->O 1 0.612 0.357 mac<7>1 (fpgaout_7_OBUF)
OBUF:I->O 3.169 fpgaout_7_OBUF (fpgaout<7>)
----------------------------------------
Total 8.868ns (6.409ns logic, 2.459ns route)
(72.3% logic, 27.7% route)

=====================================================================
====

Total REAL time to Xst completion: 28.00 secs
Total CPU time to Xst completion: 28.01 secs

-->

Total memory usage is 261200 kilobytes

CONCLUSION

Optimized and Synthesizable VHDL code is developed for the implementation of 64 BIT MAC
unit. Each module is tested with some of the sample vectors and output results are perfect with
minimal delay. Since the delay of 64 bit is less, this design can be used in the system which
requires high performance in processors involving large number of bits of the operation.

FUTURE SCOPE

The future scope of this project is to design a 128 bit MAC unit. This will be even more faster
but at the expense of some additional hardware. More precisely, we can design an 8 tap filter
with the present architecture.

REFERENCES
[1].Young-Ho Seo and Dong-Wook Kim, "New VLSI Architecture of Parallel Multiplier-
Accumulator Based on Radix-2 Modified Booth Algorithm," IEEE Transactions on very large
scale integration (vlsi) systems, vol. 18, no. 2,february 20 10
(2). Ron S. Waters and Earl E. Swartzlander, Jr., "A Reduced Complexity Wall ace Multiplier
Reduction, " IEEE Transactions On Computers, vol. 59, no. 8, Aug 20 10
[3]. C. S. Wallace, "A suggestion for a fast multiplier," iEEE Trans. ElectronComput., vol. EC-
13, no. I, pp. 14-17, Feb. 1964
[4]. Shanthala S, Cyril Prasanna Raj, Dr.S.Y.Kulkarni, "Design and VLST Implementation of
Pipelined Multiply Accumulate Unit," IEEE International Conference on Emerging Trends in
Engineering and Technology, ICETET-09
[5]. B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, "ASIC Implementation of Modified
Faster Carry Save Adder ", European Journal of Scientific Research, Vol. 42, Issue 1, 2010.
[6]. R.UMA, Vidya Vijayan, M. Mohanapriya and Sharon Paul, "Area, Delay and Power
Comparison of Adder Topologies", International Journal of VLSI design & Communication
Systems (VLSICSj Vo1.3, No.1, February 2012
[7]. V. G. Oklobdzija, "High-Speed VLSI Arithmetic Units: Adders and Multipliers", in "Design
of High-Performance Microprocessor Circuits", Book edited by A.Chandrakasan,IEEE
Press,2000
[8]. Dadda, "Some Schemes for Parallel Multipliers," Alta Frequenza, vol. 34, pp. 349-356, 1965
[9]. C.S. Wall ace "A Suggestion for a fast multipliers," IEEE Trans. Electronic Computers, vol.
13, no.l,pp 14-17, Feb. 1967

WEB Links:

http://wikipeadia/
http://ieeexplore.ieee.org/
http://www.progressive-coding.com/tutorial.php?id=0&print=1
www.ecommerce.hostip.info

DSP M.tech 64 Bit Mac

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP M.tech 64 Bit Mac

Uploaded by

Copyright:

Available Formats

CHAPTER 1

You might also like