Peoject Documentation

1.
INTRODUCTION
1.1.
INTRODUCTION:
Filter is the component which passes certain band of frequencies and opposes other frequency components. Filter is the basic component in any Digital Signal Processor (DSP) applications. For this we have two filters they are Finite Impulse Response (FIR) filter and Infinite Impulse Response (IIR) filter.
FIR filter is digital type of filter where we consider finite number of samples. In FIR filter the impulse response settle down to zero after final sample of interval, where as in IIR filter we consider infinite number of samples for analysis.
Here in our project we designed FIR filter with less resources and less delay using Distributed Arithmetic (DA) algorithm. If we use direct method i.e. Multiplication and Accumulate (MAC) for implementing FIR filter it consumes much area (resource) and is expensive to implement on FPGA. To overcome this drawback DA came into existence, which is a multiplier-less architecture. As DA is a very efficient solution especially suited for LUT-based FPGA architectures.
The main problem of DA is that the LUT size will increase exponentially with the order of the filter. To overcome this problem a hardware-efficient DA architecture is used which reduces the LUT size by modifying the architecture of the filter to achieve high performance.
CVR College of Engineering (VLSI)
Page 1
1.2.
FIR Filter:
FIR filter is a one polynomial coefficient. FIR filter needs much high order polynomial to get an equivalent filter as IIR filter, which results in longer delay.
H (Z) =B (Z)/ZN
Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]..+ bn x[n-N]
N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are commonly referred to as taps. This equation can also be expressed as a convolution of the coefficient sequence bi with the input signal
Page 2
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input.
1.3.
Block Diagram of FIR filter:
Page 3
Fig1.1:Block diagram of FIR filter

1.4. Spartan III features:
The Spartan -3E family reduces system cost to by offering the lowest cost-per-logic of any FPGA family, supporting the lowest-cost configuration solutions including commodity serial (SPI) and parallel flash memories, and efficiently integrating the functions of many chips into a single FPGA. Advanced, Low-Cost Features

Five devices with 100K to 1.6M system gates From 66 to 376 I/Os with package and destiny migration Up to 648K bits of block RAM and up to 231K bits of distributed RAM Up to 36 embedded 18x18 multipliers for high-performance DSP applications Up to eight Digital Clock Mangers
Cost-Saving System Interfaces and Solutions

Support for Xilinx Platform Flash as well as commodity serial(SPI) and byte-wide flash memory for configuration
Page 4
Easy-to-implement interfaces to DDR memory Support for 18 common I/O standards, including PCI-X, mini-LVDs, and RSDS
Industry-Leading Design Tools and IP

ISE design tools to shorten design and verification time Hundreds of pre-verified, pre-optimized Intellectual Property(IP) cores and reference designs
Chip Scope ProTM system-debugging environment
Easy-to-Use, Low-Cost FPGA Development Systems

Complete Spartan-3E Standard Kit available for only $149 USD Includes XC3S500E FPGA, SPI Flash, 32Mb DDR memory support for USB2.0
1.4.
CONCLUSION:
In this chapter we discussed about FIR filter and its block diagram. The Spartan-3 pro-
FPGA features are described.
Page 5
2. LITERATURE SURVEY
2.1 INTRODUCTION:
The signal is the one which carries information from one source to the destination. There are different types of signals. Filter plays essential role in Digital Signal Processing (DSP). Filter is a system that passes certain frequency components and rejects other frequency components. Filters are designed for the specifications of the desired properties of the system. FPGA is a prototype device which is used to implement simpler algorithms.
2.2. Signal:
In the field of communications, signal processing and in electrical engineering more generally, a signal is any time-varying or spatial-varying quantity. In the physical world, any quantity measurable through time or over space can be taken as a signal. Within a complex society, any set of human information or machine data also be taken as a signal. Such information or machine data must all be part systems existing in the physical world- either living or non-living.
CVR College of Engineering (VLSI) Page 6
Despite the complexity of such systems, their outputs and inputs can often be represented as simple quantities measurable through time or across space. In the latter half of the 20th century Electrical engineering itself separated into several disciplines, specializing in the design and analysis of physical signals and systems, on one hand and in the functional behavior and conceptual structure of the complex human and machine systems, on the other. These engineering disciplines have led the way in the design, study, and implementation of systems that take advantage of signals as simple measurable quantities in order to facilitate the transmission, storage and manipulation of information.
2.2.1. Definition of the signal:

In information theory, a signal is a codified message, that is, the sequence of states in a communication channel that encodes a message. In the context of signal processing, arbitrary binary data streams are not considered as signals, but only analog and digital signals that are representations of analog physical quantities. In a communication system, a transmitter encodes a message into a signal, which is carried to a receiver by the communications channel. For example, the words "Mary had a little lamb" might be the message spoken into a telephone. The telephone transmitter converts the sounds into an electrical voltage signal. The signal is transmitted to the receiving telephone by wires; and at the receiver it is reconverted into sounds. In telephone networks, signaling, for example common channel signaling, refers to phone number and other digital control information rather than the actual voice signal. Signals can be categorized in various ways. The most common distinction is between discrete and continuous spaces that the functions are defined over, for example discrete and continuous time domains. Discrete-time signals are often referred to as time series in other fields. Continuous-time signals are often referred to as continuous signals even when the signal functions are not continuous; an example is a square-wave signal. A second important distinction is between discrete-valued and continuous-valued. Digital signals are sometimes defined as discrete-valued sequences of quantified values that may or may
not be derived from an underlying continuous-valued physical process. In other contexts, digital signals are defined as the continuous-time waveform signals in a digital system, representing a bit-stream. In the first case, a signal that is generated by means of a digital modulation method is considered as converted to an analog signal, while it is considered as a digital signal in the second case.
2.2.2. Types of Signals:
2.2.2.1. Discrete-time and continuous time signal: If for a signal, the quantities are defined only on a discrete set of times, we call it a
discrete-time signal. In other words, a discrete-time real (or complex) signal can be seen as a function from the set of integers to the set of real (or complex) numbers. Discrete signals have frequency domain analysis. A discrete signal usually uses Z- Transform to analyze its frequency response, where discrete signals are denoted by u (k) and k= -1, 0, 1, 2, 3.. A continuous-time real (or complex) signal is any real-valued (or complexvalued) function which is defined for all time t in an interval, most commonly an infinite interval. Continuous signals have continuous frequency spectrum. It uses Fourier Transform (FT) to obtain its frequency response, where continuous signals are denoted by u (t), t is continuous.
2.2.2.2. Analog and Digital signal:
Page 8
There are mainly two types of signals encountered in practice, analog and digital. In short, the difference between them is that digital signals are discrete and quantized, as defined below, while analog signals possess neither property.
DISCRETIZATION:
One of the fundamental distinctions between different types of signals is between continuous and discrete time. In the mathematical abstraction, the domain of a continuous-time (CT) signal is the set of real numbers (or some interval thereof), whereas the domain of a discrete-time (DT) signal is the set of integers (or some interval). What these integers represent depends on the nature of the signal. DT signals often arise via sampling of CT signals. An audio signal, for example consists of a continually fluctuating voltage on a line that can be digitized by an ADC circuit, wherein the circuit will read the voltage level on the line, say, every 50 s. The resulting stream of numbers is stored as digital data on a discrete-time signal. Computers and other digital devices are restricted to discrete time.
QUANTIZATION:
If a signal is to be represented as a sequence of numbers, it is impossible to maintain arbitrarily high precision - each number in the sequence must have a finite number of digits. As a result, the values of such a signal are restricted to belong to a finite set; in other words, it is quantized.
2.3 Filters in signal processing:
In signal processing, a filter is a device or process that removes from a signal some unwanted or component or feature. In general, it takes an input that is a function of time and produces an output that is a function of time (usually delayed from the input).
Page 9
Filtering is a class of signal processing, the defining feature of filters being the complete or partial suppression of some aspect of the signal. Most often, this means removing some frequencies and not others in order to suppress interfering signals and reduce background noise. However, filters do not exclusively act in the frequency domain; especially in the field of image processing many other targets for filtering exist. There are many different bases of classifying filters and these overlap in many different ways, there is no simple hierarchical classification. Filters may be:

analog or digital discrete-time (sampled) or continuous-time linear or non-linear passive or active type of continuous-time filter Infinite impulse response (IIR) or finite impulse response (FIR) type of discrete-time or digital filter.
2.3.1. Analog Filter:

Analog filters are a basic building block of signal processing much used in electronics. Amongst their many applications are the separation of an audio signal before application to bass, mid-range and tweeter loudspeakers; the combining and later separation of multiple telephone conversations onto a single channel; the selection of a chosen radio station in a radio receiver and rejection of others. Passive linear electronic analogue filters are those filters which can be described with linear differential equations (linear); they are composed of capacitors, inductors and, sometimes, resistors (passive) and are designed to operate on continuously varying (analogue) signals. There are many linear filters which are not analogue in implementation (digital filter), and there are many electronic filters which may not have a passive topology both of which may have the same transfer function of the filters described in this article. Analogue filters are most
often used in wave filtering applications, that is, where it is required to pass particular frequency components and to reject others from analog (continuous-time) signals.
2.3.2. Digital Filters:

In electronics, computer science and mathematics, a digital filter is a system that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. This is in contrast to the other major type of electronic filter, the analog filter, which is an electronic circuit operating on continuous-time analog signals. An analog signal may be processed by a digital filter by first being digitized and represented as a sequence of numbers, then manipulated mathematically, and then reconstructed as a new analog signal. In an analog filter, the input signal is "directly" manipulated by the circuit. A digital filter system usually consists of an analog-to-digital converter (to sample the input signal), a microprocessor (often a specialized digital signal processor), and a digital-toanalog converter. Software running on the microprocessor can implement the digital filter by performing the necessary mathematical operations on the numbers received from the ADC. In some high performance applications, an FPGA or ASIC is used instead of a general purpose microprocessor. Digital filters may be more expensive than an equivalent analog filter due to their increased complexity, but they make practical many designs that are impractical or impossible as analog filters. Since digital filters use a sampling process and discrete-time processing, they experience latency (the difference in time between the input and the response), which is almost irrelevant in analog filters. Digital filters are commonplace and an essential element of everyday electronics such as radios, cell phones, and stereo receivers.
2.3.3. Passive filter:
Page 11
Passive
implementations
of
linear
filters
are
based
on
combinations
of resistors (R), inductors (L) and capacitors (C). These types are collectively known as passive filters, because they do not depend upon an external power supply and/or they do not contain active components such as transistors. Inductors block high-frequency signals and conduct low-frequency signals,
while capacitors do the reverse. A filter in which the signal passes through an inductor, or in which a capacitor provides a path to ground, presents less attenuation to low-frequency signals than high-frequency signals and is a low-pass filter. If the signal passes through a capacitor, or has a path to ground through an inductor, then the filter presents less attenuation to highfrequency signals than low-frequency signals and is a high-pass filter. Resistors on their own have no frequency-selective properties, but are added to inductors and capacitors to determine the time-constants of the circuit, and therefore the frequencies to which it responds. The inductors and capacitors are the reactive elements of the filter. The number of elements determines the order of the filter. In this context, an LC tuned circuit being used in a band-pass or band-stop filter is considered a single element even though it consists of two components. At high frequencies (above about 100 megahertz), sometimes the inductors consist of single loops or strips of sheet metal, and the capacitors consist of adjacent strips of metal. These inductive or capacitive pieces of metal are called stubs.
2.3.4. Active Filter:

Active filters are implemented using a combination of passive and active (amplifying) components, and require an outside power source. Operational amplifiers are frequently used in active filter designs. These can have high Q, and can achieve resonance without the use of inductors. However, their upper frequency limit is limited by the bandwidth of the amplifiers used.
Page 12
2.3.5. Linear- Continuous time filter:

Linear continuous-time circuit is perhaps the most common meaning for filter in the signal processing world, and simply "filter" is often taken to be synonymous. These are filters that are designed to remove certain frequencies and allow others to pass. Such a filter is, of necessity, a linear filter. Any non-linearity will result in the output signal containing components of frequency which were not present in the input signal. The modern design methodology for linear continuous-time filters is called network synthesis. Some important filter families designed in this way are;
Chebyshev filter, has the best approximation to the ideal response of any filter for a specified order and ripple. Butterworth filter, has a maximally flat frequency response. Bessel filter, has a maximally flat phase delay. Elliptic filter, has the steepest cutoff of any filter for a specified order and ripple.
The difference between these filter families is that they all use a different polynomial function to approximate to the ideal filter response. This results in each having a different transfer function. Another methodology which is dead but can still is seen walking around now and again is the image parameter method. Filters designed by this methodology are archaically called "wave filters". Some important filters designed by this method are;

Constant k filter, the original and simplest form of wave filter. M-derived filter, a modification of the constant k with improved cutoff steepness and impedance matching.
2.3.6. Terminology to classify linear filter:

Some terms used to describe and classify linear filters:
Page 13
The
frequency response can be classified into a number of different band
forms describing which frequencies the filter passes (the pass band) and which it rejects (the stop band);
Low-pass filter low frequencies are passed, high frequencies are attenuated. High-pass filter high frequencies are passed, Low frequencies are attenuated. Band-pass filters only frequencies in a frequency band are passed. Band-stop filter or band-reject filters only frequencies in a frequency band are
attenuated.
Notch filter rejects just one specific frequency - an extreme band-stop filter. Comb filter has multiple regularly spaced narrow pass bands giving the band form the
appearance of a comb.
All-pass filter all frequencies are passed, but the phase of the output is modified. Cutoff frequency is the frequency beyond which the filter will not pass signals. It is
usually measured at a specific attenuation such as 3dB.

Roll-off is the rate at which attenuation increases beyond the cut-off frequency. Transition band, the (usually narrow) band of frequencies between a pass band and stop
band.
Ripple is the variation of the filters insertion loss in the pass band. The order of a filter is the degree of the approximating polynomial and in passive filters
corresponds to the number of elements required to build it. Increasing order increases roll-off and brings the filter closer to the ideal response.
2.3.7. FIR Filter:
Page 14
A Finite Impulse Response (FIR) filter is a type of a digital filter. The impulse response, the filter's response to a Kronecker delta input, is finite because it settles to zero in a finite number of sample intervals. This is in contrast to Infinite Impulse Response (IIR) filters, which have internal feedback and may continue to respond indefinitely. The impulse response of an Nth-order FIR filter lasts for N+ 1 sample, and then dies to zero. The difference equation that defines the output of an FIR filter in terms of its input is: Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]..+ bn x [n-N] where:

x[n] is the input signal, y[n] is the output signal, bi are the filter coefficients, and N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are commonly referred to as taps. This equation can also be expressed as a convolution of the coefficient sequence bi with the input signal:
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input.
2.3.8. IIR filters:

Infinite Impulse Response (IIR) is a property of signal processing systems. Systems with this property are known as IIR systems or, when dealing with filter systems, as IIR filters. IIR systems have an impulse response function that is non-zero over an infinite length of time. This is in contrast to FIR, which have fixed-duration impulse responses. The simplest analog IIR filter is an RC filter made up of a single resistor (R) feeding into a node shared with a
single capacitor (C). This filter has an exponential impulse response characterized by an RC time constant. IIR filters may be implemented as either analog or digital filters. In digital IIR filters, the output feedback is immediately apparent in the equations defining the output. Note that unlike with FIR filters, in designing IIR filters it is necessary to carefully consider "time zero" case in which the outputs of the filter have not yet been clearly defined. Design of digital IIR filters is heavily dependent on that of their analog counterparts because there are plenty of resources, works and straightforward design methods concerning analog feedback filter design while there are hardly any for digital IIR filters. As a result, usually, when a digital IIR filter is going to be implemented, an analog filter (e.g. Chebyshev filter, Butterworth filter, Elliptic filter) is first designed and then is converted to a digital filter by applying discretization techniques such as Bilinear transform or Impulse invariance. Digitals filters are often described and implemented in terms of the difference equation that defines how the output signal is related to the input signal:
where:

is the feed forward filter order are the feed forward filter coefficients is the feedback filter order are the feedback filter coefficients is the input signal Is the output signal.
Page 16
A more condensed form of the difference equation is:
2.4. FPGA:
FPGAs offer an opportunity to accelerate your digital signal processing application up to 1000 times over a traditional DSP microprocessor.
Microprocessors are slow:

Digital signal processing has traditionally been done using enhanced microprocessors. While the high volume of generic product provides a low cost solution, the performance falls seriously short for many applications. Until recently, the only alternatives were to develop custom hardware (typically board level or ASIC designs), buy expensive fixed function processors (e.g. an FFT chip), or use an array of microprocessors.
FPGAs accelerate DSP:

Recent increases in Field Programmable Gate Array performance and size offer a new hardware acceleration opportunity. FPGAs are an array of programmable logic cells interconnected by a matrix of wires and programmable switches.. Each cell performs a simple logic function defined by a user's program. An FPGA has a large number (64 to over 20,000) of
these cells available to use as building blocks in complex digital circuits. Custom hardware has never been so easy to develop.
Performance up to 1000x:
The ability to manipulate the logic at the gate level means you can construct a custom processor to efficiently implement the desired function. By simultaneously performing all of the algorithms sub functions, the FPGA can outperform a DSP by as much as 1000:1.
Fig 2.1 comparision of DSP and FPGA.

DSP performance is limited by the serial instruction stream. FPGAs are a better solution in the region above the curve.
FPGA DSPs are flexible:
Page 18
Like microprocessors, many FPGAs can be infinitely reprogrammed in-circuit in only a fraction of a second. Design revisions, even for a fielded product, can be implemented quickly and painlessly. Hardware can also be reduced by taking advantage of reconfiguration.
Highly integrated:
The programmable logic in an FPGA can absorb much of the interface and glue logic associated with microprocessors. The tighter integration can make a product smaller, lighter, cheaper and lower power.
Competitively priced:
FPGAs are a generic product customized at the point of use. They enjoy the cost advantages of high production volumes. There is also none of the NRE charges or fabrication delays associated with ASIC development and get you to market on time. The FPGAs flexibility eliminates the long design cycle associated with ASICs. With FPGAs there are no delays for prototypes or early production volume. Design revisions are easily implemented, often taking less than a day. The devices are fully tested by the manufacturer, eliminating production test development.
2.5. CONCLUSION:
In this chapter we discussed about signals, different types of signals, filters, different types of filters and FPGA in Digital Signal Processing.
Page 19
3. DESIGN METHODOLOGY
3.1.
INTRODUCTION:
A Finite Impulse Response (FIR) filter is a type of a digital filter. The direct
implementation of the FIR filter requires more number of resources, to reduce the number of resources Distributed Arithmetic came into existence which replaces multiplications by additions and siftings. To reduce ROM size the proposed DA algorithm came into existence which uses multiplexers. The LUT-less algorithm uses multiplexers to remove the usage of ROM memory.
3.2.
DIRECT IMPLEMENTATION OF FIR FILTER:
Generally FIR filter is designed using Multiply and Accumulate (MAC) principle where the filter coefficients undergo multiplication and additions. The MAC principle is common in Digital Signal Processing algorithms.
The following expression explains the MAC operation.
y = hK 1 x0 + hk 1 x1 + + h0 xK 1
i.e.
y = hn k xk
k =0
K 1
Note a few points:

h=[h0,h1, h2,, hK-1] is a matrix of constant values h=[h0,h1, h2,, hK-1] is a matrix of constant values Each hk is of M-bits Each hk is of N-bits
y should be able large enough to accommodate the result

A numerical example:
h = [32 ,42 ,45 ,23 ] x = [ 42 ,20 ,22 ,67 ] ( K = 4)
y = 32 42 + 45 20 +78 ( ) + 23 67 22 y =1344 +900 1716 +1541 = 2069
Fig 3.1. Block diagram of 1-tap filter using direct implementation.
Page 22
Fig 3.2. Block diagram of 4-tap FIR filter using direct implementation.
In direct implementation we follow Multiply and Accumulate (MAC) operation. In this type of operation we directly multiply the coefficient of the filter with the variable and add them to get final result. If we consider 1-tap filter, filter coefficient h0 is directly multiplied with variable x0 and result is assigned to the output. In 4-tap filter filter-coefficient are multiplied with corresponding variables, the result of four multipliers are added and assigned to the result. If we follow this method we require four multipliers, which require many resources. To reduce resource utilization and improve speed we follow Distributed Arithmetic (DA) Algorithm, which is multiplier less architecture.
Page 23
3.3.
IMPLEMENTING ARITHMETIC:
FIR
FILTER
USING
DISTRIBUTED
Distributed arithmetic is a bit level rearrangement of a multiply accumulate to avoid the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiplyaccumulate that is well suited to FPGA designs. It can also be extended to other sum functions such as complex multiples, Fourier transforms and so on. In most of the multiply accumulate applications in signal processing, one of the multiplicands for each product is a constant. Usually each multiplication uses a different constant. Using our most compact multiplier, the scaling accumulator, we can construct a multiple product term parallel multiply-accumulate function in a relatively small space if we are willing to accept a serial input. In this case, we feed four parallel scaling accumulators with unique serialized data. Each multiplies that data by a possibly unique constant, and the resulting
Page 24
products
are
summed
in
an
adder
tree
as
shown
below
Fig 3.3. 4-tap FIR filter using DA algorithm. If we stop to consider that the scaling accumulator multiplier is really just a sum of vectors, then it becomes obvious that we can rearrange the circuit. Here, the adder tree combines the 1 bit partial products before they are accumulated by the scaling accumulator. All we have done is rearranged the order in which the 1xN partial products are summed. Now instead of individually accumulating each partial product and then summing the results, we postpone the accumulate function until after weve summed all the 1xN partials at a particular bit time. This simple rearrangement of the order of the adds has effectively replaced N multiplies followed by an N input add with a series of N input adds followed by a multiply. This arithmetic manipulation directly eliminates N-1 Adders in an N product term
Page 25
multiply-accumulate function. For larger numbers of product terms, the savings becomes significant.
Fig 3.4. block diagram of 4- tap filter using LUT less algorithm.
Further hardware savings are available when the coefficients Cn are constants. If that is true, then the adder tree shown above becomes a Boolean logic function of the 4 serial inputs. The combined 1xN products and adder tree is reduced to a four input look up table. The sixteen entries in the table are sums of the constant coefficients for all the possible serial input combinations. The table is made wide enough to accommodate the largest sum without overflow. Negative table values are sign extended to the width of the table, and the input to the scaling accumulator should be sign extended to maintain negative sums.
Page 26
Fig 3.5. block diagram which explains MUX operations. Obviously the serial inputs limit the performance of such a circuit. As with most hardware applications, we can obtain more performance by using more hardware. In this case, more than one bit sum can be computed at a time by duplicating the LUT and adder tree as shown here. The second bit computed will have a different weight than the first, so some shifting is required before the bit sums are combined. In this 2 bit at a time implementation, the odd bits are fed to one LUT and adder tree, while the even bits are simultaneously fed to an identical tree. The odd bit partials are left shifted to properly weight the result and added to the even partials before accumulating the aggregate. Since two bits are taken at a time, the scaling accumulator has to shift the feedback by 2 places.
Page 27
Fig 3.6. block diagram which explains MUX operations for more number of inputs This paralleling scheme can be extended to compute more than two bits at a time. In the extreme case, all input bits can be computed in parallel and then combined in a shifting adder tree. No scaling accumulator is needed in this case, since the output from the adder tree is the entire sum of products. This fully parallel implementation has a data rate that matches the serial clock, which can be greater than 100 MS/S in today's FPGAs.
Page 28
Fig 3.7. Block digram which explains shifting and addition operations. Most often, we have more than 4 product terms to accumulate. Increasing the size of the LUT might look attractive until you consider that the LUT size grows exponentially. Considering the construction of the logic we stuffed into the LUT, it becomes obvious that we can combine the results from the LUTs in an adder tree. The area of the circuit grows by roughly 2n-1 using adder trees to expand it rather than the 2n growth experienced by increasing LUT size. For FPGAs, the most efficient use of the logic occurs when we use the natural LUT size (usually a 4LUT, although and 8-LUT would make sense if we were using an 8 input block RAM) for the LUTs and then add the outputs of the LUTs together in an adder tree, as shown below:
Fig 3.8. .
Block diagram of 8-tap FIR filter.
3.4.
MATHEMATICAL ANALYSIS OF DISTRIBUTED ARITHMETIC:

General equation of FIR filter is
y = hk xnk
k =0
n 1
-----1
Page 29
Let xk be a N-bits scaled twos complement number i.e.
| xk | < 1 xk: {bk0, bk1, bk2, bk(N-1) }

where bk0 is the sign bit We can express xk as xk = bk 0 + bkn 2 n
n =1 N 1
-----2
Now by substituting (2) in (1), we get

K N 1 y = Ak bk 0 + bkn 2 n k =1 n =1
y = ( bk 0 Ak ) + ( Ak bkn ) 2 n
k =1 k =1 n =1
K N 1
----3
And now
y = ( bk 0 Ak ) +
k =1
k =1
N 1 n ( bkn Ak ) 2 n=1
Expanding this part
By expanding the term we get
Page 30
y = ( bk 0 Ak ) + ( Ak bk 1 ) 2 1 + ( Ak bk 2 ) 2 2 + + ( Ak bk ( N 1) ) 2 ( N 1)
K K k =1 k =1
Now by expanding the sigma term we get the following equation
y = [ b10 A1 + b20 A2 + + bK 0 AK ]
[ + [( b [
+ ( b11 A1 ) 2 1 + ( b12 A1 ) 2 2 + + ( b1( N 1) A1 ) 2 ( N 1)

21
A2 ) 2 1 + ( b22 A2 ) 2 2 + + ( b2( N 1) A2 ) 2 ( N 1)
] ]
+ ( bK 1 AK ) 2 1 + ( bK 2 AK ) 2 2 + + ( bK ( N 1) AK ) 2 ( N 1)
By taking common multiples into consideration we can re arrange the equation in the following fashion.
y = [ b10 A1 + b20 A2 + + bK 0 AK ]
+ [ ( b11 A1 ) + ( b21 A2 ) + + ( bK 1 AK ) ] 21
+ [ ( b12 A1 ) + ( b22 A2 ) + + ( bK 2 AK ) ] 2 2 + ( b1( N 1) A1 ) + ( b2 ( N 1) A2 ) + + ( bK ( N 1) AK ) 2( N 1)

Finally the equation is reduced in the following way. y = (bk 0 ) Ak + [ b1n Ak + b2 n A2 + + bKn AK ] 2 n
k =1 n=1 K N 1
-----4
The equation 4 is the final formula of the distributed arithmetic.
For ROM construction the equation 4 is reduced in the following fashion.

K N 1 K y = Ak (bk 0 ) + Ak bkn 2 n k =1 n =1 k =1
-----4
K Ak bkn k =1
has only 2K possible values i.e.
K Ak bkn = f n (b1nb2 n bKn ) k =1
---- 5
Page 32
(5) Can be pre-calculated for all possible values of b1n b2n bKn We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n bKn s
3.5. Block Diagram of FIR filter using DA algorithm:

Here in our project we are designing 4-tap FIR filter. The original LUT based DA implementation of FIR filter is shown in the following figure.
Page 33
Fig:3.9. Block diagram of 4-tap FIR filter using DA based Algorithm The block diagram of LUT based DA implemented FIR filter consists of three units such as the shift register unit, the DA-LUT unit and the adder/shifter unit. The four input signals each of four bits are given to parallel in serial out shift registers. The output of parallel in serial out register is single bit value. The coefficients of filter are stored in the Look up Table and depending on the output of the four parallel in serial out registers a value is selected from Look-Up. The output of the look up table is given to the Adder and Shifter unit. The Adder and Shifter unit adds this value to the left shifted previous output and gives it to
Page 34
the output. This process is repeated for four clock cycles, after four clock cycles we will get the required output.
3.5.1. Shift Register unit:
A serial-in/parallel-out shift register is similar to the serial-in/ serial out shift register in that it shifts data into internal storage elements and shifts data out at the serialout, data-out and pin. It is different in that it makes all the internal stages available as outputs. Therefore, a serial-in/parallel-out shift register converts data from serial format to parallel format. If four data bits are shifted in by four clock pulses via a single wire at data-in, below, the data becomes available simultaneously on the four outputs QA to QD after the fourth clock pulse.
Fig 3.10. Serial in parallel out shift register with 4- stages. The practical application of the serial-in/parallel-out shift register is to convert data from serial format on a single wire to parallel format on multiple wires. Perhaps, we will illuminate four LEDs (Light Emitting Diodes) with the four outputs (QA QB QC QD ).
Page 35
Fig 3.11. serial in parallel out shift register in detail. The above details of the serial-in/parallel-out shift register are fairly simple. It looks like a serial-in/ serial-out shift register with taps added to each stage output. Serial data shifts in at SI (Serial Input). After a number of clocks equal to the number of stages, the first data bit in appears at SO (QD) in the above figure. In general, there is no SO pin. The last stage (QD above) serves as SO and is cascaded to the next package if it exists. Note that serial-in/ serial-out shift registers come in grater than 8-bit lengths of 18 to 64-bits. It is not practical to offer a 64-bit serial-in/parallel-out shift register requiring that many output pins. See waveforms below for above shift register.
Page 36
Fig 3.12. Serial in parallel out register waveforms. The shift register has been cleared prior to any data by CLR', an active low signal, which clears all type D Flip-Flops within the shift register. Note the serial data 1011pattern presented at the SI input. This data is synchronized with the clock CLK. This would be the case if it is being shifted in from something like another shift register, for example, a parallel-in/ serialout shift register (not shown here). On the first clock at t1, the data 1 at SI is shifted from D to Q of the first shift register stage. After t2 this first data bit is at QB. After t3 it is at QC. After t4 it is at QD. Four clock pulses have shifted the first data bit all the way to the last stage QD. The second data bit a 0 is at QC after the 4th clock. The third data bit a 1 is at QB. The fourth data bit another 1 is at QA. Thus, the serial data input pattern 1011is contained in (QD QC QB QA). It is now available on the four outputs.
Page 37
It will available on the four outputs from just after clock t4 to just before t5. This parallel data must be used or stored between these two times, or it will be lost due to shifting out the QD stage on following clocks t5 to t8 as shown above.
3.5.2. Look Up Table unit:

The binary data is stored in the solid-state devices. Those storage "cells" within solid-state memory devices are easily addressed by driving the "address" lines of the device with the proper binary value(s). Suppose we had a ROM memory circuit written, or programmed, with certain data, such that the address lines of the ROM served as inputs and the data lines of the ROM served as outputs, generating the characteristic response of a particular logic function. Theoretically, we could program this ROM chip to emulate whatever logic function we wanted without having to alter any wire connections or gates. Consider the following example of a 4 x 2 bit ROM memory (a very small memory!) programmed with the functionality of a half adder:
Fig 3.13. Functionality of Half Adder.
Page 38
If this ROM has been written with the above data (representing a half-adder's truth table), driving the A and B address inputs will cause the respective memory cells in the ROM chip to be enabled, thus outputting the corresponding data as the (Sum) and Cout bits. Unlike the halfadder circuit built of gates or relays, this device can be set up to perform any logic function at all with two inputs and two outputs, not just the half-adder function. To change the logic function, all we would need to do is write a different table of data to another ROM chip. We could even use an EPROM chip which could be re-written at will, giving the ultimate flexibility in function. It is vitally important to recognize the significance of this principle as applied to digital circuitry. Whereas the half-adder built from gates or relays processes the input bits to arrive at a specific output, the ROM simply remembers what the outputs should be for any given combination of inputs. This is not much different from the "times tables" memorized in grade school: rather than having to calculate the product of 5 times 6 (5 + 5 + 5 + 5 + 5 + 5 = 30), school-children are taught to remember that 5 x 6 = 30, and then expected to recall this product from memory as needed. Likewise, rather than the logic function depending on the functional arrangement of hard-wired gates or relays (hardware), it depends solely on the data written into the memory (software). Such a simple application, with definite outputs for every input, is called a look-up table, because the memory device simply "looks up" what the output(s) should to be for any given combination of inputs states.
3.5.3. Adder and Shifter unit:

The adder and shifter unit consists of manly two blocks they are shifter and accumulator. The input to the adder and shifter unit is the output of LUT. The input is added to the left shifted previous output and it is assigned to the output. Here we use 16-bit adder. This process is repeated k times to obtain the final output, where k is the number of input bits. Here we designed 4-Tap filter where input is 4 bit size so it requires four clock cycles to get the required output. The adder and shifter unit one which eliminates the multiplication process by using shifting and accumulate process.
Page 39
3.6. Proposed Distributed Arithmetic:

The lower half of the LUT of the original LUT based DA implementation of FIR filter is the sum of the sum of upper half of the LUT. The lower half is nothing but the locations where b 3=1 and the upper half is the locations where b3=0. To avoid this wastage of memory we are using proposed DA where the LUT size is reduced by an half with the additional 2*1 multiplexer and full adder as shown in the following figure. By using this proposed Distributed Arithmetic the LUT size is reduced to the half of its size. The output of the fourth input i.e. b3 is given to the multiplexer. If the output is one then h[3] will be the output of the multiplexer and if the b3 is zero then zero will be the output of the multiplexer. The output of the multiplexer is added with the output of the LUT and then given to the adder/shifter unit.
Fig 3.14. Block diagram of 4-tap FIR filter using proposed DA algorithm.
3.7. LUT-less Distributed Arithmetic:

The LUT reduction procedure discussed above will be further developed to obtain LUT-less DA architecture. The LUT-less DA architecture is as shown below:
Fig 3.15. Block diagram of 4-tap FIR filter using LUT less algorithm.
Here in this procedure all LUTs are replaced by multiplexers and full adders so that memory usage is reduced completely. The output of the parallel in serial out is given to the multiplexer where the value of the output is one then respective constant value is obtained otherwise zero will be obtained. The output of multiplexer is given to the adder/shifter unit of the filter.
Page 41
3.8. VLSI implementation methods:

At the engineering level digital VLSI chips are classified by the approach used to implement and the circuit. Several design styles can be considered for chip implementation of specified algorithms or logic styles can be considered for chip implementation for specified algorithms or logic functions. Each design has its own merits and demerits and thus a proper choice has to be made by designers in order to provide the functionality at low cost.
3.8.1. PLD (PROGRAMMABLE LOGIC DEVICE):

PLDs are standard ICs that are available in standard configurations from a catalog of parts and are sold in very high volume to many different customers. PLDs may be configured or programmed to create a part customized to a specified application, and so they also belong to the family of ASICs. PLDs use different technologies to allow programming of the device. There are four types of PLDs 1. 2. 3. 4. Programmable Logic Array (PLA) Programmable Array Logic (PAL) Complex Programmable Logic Device (CPLD) Field Programmable Gate Array (FPGA)
1. Programmable Logic Array (PLA):
Page 42
A Programmable Logic Array is a small PLD that contains two levels of logic, an AND-plane and an OR-plane, where both levels are programmable.
2. Programmable Array Logic (PAL):

A Programmable Array Logic is a small PLD that has programmable AND plane followed by a fixed OR plane.
3. Complex Programmable Logic Device (CPLD):

A Complex Programmable Logic Device is a PLD that consists of an arrangement of multiple PLA/PAL like blocks on a single chip.
4. Field Programmable Gate Array (FPGA):

A Field Programmable Gate Array is a PLD that allows a very high logic capacity than CPLD.
3.8.2. Features of PLD:

1. No customized mask layers or logic cells. 2. Fast design turnaround. 3. Single large blocks of programmable interconnect.
3.8.3.
FPGA (Field Programmable Gate Array):
3.8.3.1.Introduction:
Field Programmable Gate Arrays are specific integrated circuits that can be userprogrammed easily. The FPGA contains versatile functions, configurable interconnects and input/output interface to adapt to the user specification. FPGA allow rapid prototyping using custom logic structures, and are very popular for limited production products. Modern FPGA are extremely dense, with complexity of several millions of gates which enable the emulation of very complex hardware such as parallel microprocessors, mixture of processor and signal
processing. One key advantage of FPGA is their ability to be reprogrammed, in order to create a completely different hardware by modifying the logic gate array. FPGA not only exist as simple components, but also as CPU ram-blocks in system-on-chip designs. FPGA consists of Slices where each Slice consists of 2 look up tables and 2 D-Flip Flops.
3.8.3.2. Look Up Table (LUT):

Look Up Table (LUT) is a one-bit wide memory array, where the address lines for the memory are inputs of the logic block and the one-bit output from the memory is the input for the next block. A LUT with n inputs would correspond to (2^n)*1 bit memory, can realize any logic function of its n inputs by programming the logic functions truth table directly into the memory.
Fig 3.16 LUTs in FPGA.
Page 44
3.8.3.3. Classification of FPGA:

FPGAs are classified based on Switching Technology. 1. SRAM based FPGAs 2. ANTIFUSE based FPGAs
XILINX and ALTERA are the leading manufacturers in SRAM based FPGA. ACTEL, QUICKLOGIC, CYPRESS are the leading manufactures in
ANTIFUSE based FPGA.
3.8.3.4. FPGA Design flow:

The involved in implementation of a design on FPGA involves System Specifications. Specifications refer to kind of inputs and kind of outputs and the range of values that the kit can take it. Based on these System specifications we move on to the next step i.e. Architecture describes the interconnections between all the blocks involved in our design. Each and every block in the Architecture along with their interconnections is modeled in either VHDL or Verilog depending on our ease. All these blocks are then simulated and the outputs are verified for correct functioning. From this simulation step we head towards the next step i.e. Synthesis. This is a very important step in knowing whether our design can be implemented on a FPGA kit or not. Synthesis converts our VHDL code into its functional components which are vendor specific. After performing synthesis we can have a look of RTL schematic and Technology Schematic. We can also see the timing delays that will be present in the FPGA if the design is implemented on it. Place & Route is the next step in which the tool places all the components on a FPGA die for optimum performance both in terms of area and speed. We also see the interconnections, which will be made, in this part of the implementation flow. In post place and route simulation step the actual delays, which will be involved on the FPGA kit, are considered by the tool and simulation is performed taking into consideration these delays, which will be present in the
implementation on the kit. Delays here mean electrical loading effect, wiring delays, stray capacitances.
System specifications Initials
Architecture
Block diagram
VHDL Module
Coding
Simulation
Functional verification
Synthesis
Generating a net list
Place and route
Placing of FPGA die and interconnections
Post place and route place
Timing verification
Generating BIT map file
Configuring FPGA
Download on to FPGA
Fig:3.17.
FPGA implementation design flow
Page 46
After post place and route, comes generating the bit-map file, which means converting the VHDL code bit streams which is useful to configure the FPGA kit. A bit file is generated after we perform this step. After this comes final step of downloading the bit map file on to the FPGA board which is done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action Group) which is a IEEE standard. The bit map file contains the whole design, which is placed on the FPGA die; the outputs can now be observed from the FPGA LEDs or multiplexed seven segment displays. This step completes the whole process of implementing our design on an FPGA.
3.8.3.5. Characteristics of FPGA:
None of the mask layers are customized. A method for programming the basic logic cells and the interconnect. The core is a regular array of programmable basic logic cells that can implement combinational as well as sequential logic (flip-flops).
A matrix of programmable interconnect surround the basic logic cells. Programmable I/O cells surround the core. Design turnaround is a few hours.
3.9.
Applications of FPGA:
1. Device controllers 2. Random logic 3. Emulation of hardware 4. Integrating multiple SPLDs
3.10. Conclusion:
This chapter discussed about Design methodologies, implementation of FIR filter in different methods and FPGA design flow.
4. DESIGN ANALYSIS
INTRODUCTION:
4.1.
After implementation of the design, next is to analyze the design. Now this chapter will discuss about the analysis of the design after the implementation FIR filter in all four different algorithms. In this chapter we are flow chapter of the algorithm is presented. Here synthesis and simulation reports are discussed in the view of the performance of the design.
4.2.
Flow chart:
Page 48
Fig.4.1 flow chart

4.3.
Simulation:
4.3.1.
Direct implementation of FIR filter:
Page 49
Fig 4.2. Simulation report of Direct Implementation of FIR filter. In direct implementation we use Multiply and Accumulate (MAC) operation. Here we take filter co-efficient as constants and variables as input. Here we multiply the variables with filter co-efficient based on the FIR equation. Here we get output in one clock cycle but the delay will be more. Here time period of clock is 20 ns. For 1st 20 ns reset is one, so the output is zero.
4.3.2. One tap filter:
Page 50
Fig 4.3 Simulation report of one tap filter We apply input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for one tap filter we have one filter co-efficient and one 4-bit input. We store the precalculated values (in this case h0 and 0000) in the ROM and we give input to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Based on output of PISO well decide whether we have to select h0 or 0000, if bit from PISO is zero well select 0000, if it is one well select one of the values from LUT. The output of the LUT is added to left shift previous output. During 1 st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT, this added to left shifted previous output. After four clock cycles well get the require output.
4.3.3. Implementation of FIR filter using DA algorithm:
Page 51
Fig 4.4. Simulation report of DA algorithm based FIR filter. We apply inputs to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for 4-tap filter we have four filters co-efficient 16 bit each and four 4-bit input. We store the pre-calculated values in the ROM and we give input to the PISO register. At first reset is one, so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSBs of all four inputs from parallel in serial out registers. Based on output of PISO registers well select a value from the LUT. The output of the LUT is added to left shift previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT, this added to left shifted previous output. After four clock cycles well get the require output.
4.3.4. Implementation of FIR filter using Proposed DA algorithm:
Page 52
Fig:4.5. Simulation report of proposed DA based FIR filter. We applied the input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Output of one PISO register is given to the MUX and remaining three outputs from other PISO register is given to the 3-tap LUT. Based on the output of three PISO registers well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for 4- tap filter we have four filters co-efficient 16 bits each and four 4-bit inputs. We give inputs to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Three are given to the LUT and remaining is given to given to MUX. The output of the MUX and LUT are added. This result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to the sum of LUT and MUX. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT and MUX, this added to left shifted previous output. After four clock cycles well get the require output.
4.3.5. Implementation of FIR filter using LUT-Less algorithm:
Page 53
Fig 4.6. Simulation report of LUT-Less based FIR filter. We applied the input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. The output of each PISO is fed to the four different MUX. The output of all MUX is added and the result is given to output.
Here for 4-tap filter we have four filters co-efficient, 16 bit each and four 4-bit inputs. Well give one input to four PISO registers. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Each output of PISO is given to four MUX. In MUX based on the input well decide whether the we have to select filter co-efficient or 0000, if bit from PISO is zero well select 0000, if it is one well select corresponding filter co-efficient. The output of MUX is added. The result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well get values from MUX, this are added and the result is added to left shifted previous output. After four clock cycles well get the require output.
4.4.
Synthesis report: 4.4.1. Direct implementation:
Page 54
=========================================== ============================== * Advanced HDL Synthesis *
=========================================== ============================== Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. =========================================== ============================== Advanced HDL Synthesis Report Macro Statistics # Multipliers 4x16-bit multiplier # Adders/Subtractors 20-bit adder # Registers Flip-Flops :7 :7 :6 :6 : 80 : 80
=========================================== ============================== Timing Summary =========================================== ============================== Speed Grade: -5 Minimum period: No path found Minimum input arrival time before clock: 11.542ns Maximum output required time after clock: 6.216ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns)
=========================================== ============================== Timing constraint: Default OFFSET IN BEFORE for Clock 'clk' Total number of paths / destination ports: 77620 / 160
4.4.2. DA algorithm based FIR filter:
================================================== ======================= * Advanced HDL Synthesis *
================================================== ======================= Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ================================================== ======================= Advanced HDL Synthesis Report Macro Statistics # ROMs 16x16-bit ROM # Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal :1 :1 :2 :1 :1 : 53 : 53 :4 :4
================================================== ======================= CVR College of Engineering (VLSI) Page 56
Timing Summary Speed Grade: -5 Minimum period: 9.602ns (Maximum Frequency: 104.140MHz) Minimum input arrival time before clock: 10.130ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ================================================== ======================= Timing constraint: Default period analysis for Clock 'clk' Clock period: 9.602ns (frequency: 104.140MHz) Total number of paths / destination ports: 7762 / 68
4.4.3. Proposed DA algorithm:
================================================== ======================= * Advanced HDL Synthesis *
================================================== ======================= Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ================================================== ======================= Advanced HDL Synthesis Report Macro Statistics # ROMs 8x16-bit ROM CVR College of Engineering (VLSI) :1 :1 Page 57
# Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal
:3 :2 :1 : 53 : 53 :4 :4 Timing Summary
================================================== ======================= Speed Grade: -5 Minimum period: 10.188ns (Maximum Frequency: 98.154MHz) Minimum input arrival time before clock: 10.672ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ================================================== ======================= Timing constraint: Default period analysis for Clock 'clk' Clock period: 10.188ns (frequency: 98.154MHz) Total number of paths / destination ports: 8655 / 68
4.4.4. LUT-Less algorithm:
============================================== =========================== * Advanced HDL Synthesis *

Page 58
============================================== =========================== Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ============================================== =========================== Advanced HDL Synthesis Report Macro Statistics # Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal :5 :4 :1 : 53 : 53 :4 :4
============================================== =========================== Timing Summary Speed Grade: -5 Minimum period: 12.615ns (Maximum Frequency: 79.268MHz) Minimum input arrival time before clock: 13.055ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ============================================== =========================== Timing constraint: Default period analysis for Clock 'clk'
Clock period: 12.615ns (frequency: 79.268MHz) Total number of paths / destination ports: 64845 / 68
4.5.
Conclusion:
In this chapter we simulation reports and synthesis reports are observed.
5. RESULT
This project is comprised of several chapters which devoted to the designing of FIR filter using four different algorithms. Here the comparison tables are presented to have overview on the resource usage and the timing comparisons. A simple overview of the project, including the scope, motivation and objectives are discussed. Finally a FIR filter is designed and the output is observed on the Spartan III FPGA kit.
Resource comparison table:
Resource
Specification Direct Distributed Proposed LUT less implementation arithmetic DA algorithm algorithm algorithm 8*16bit 1 1 1 53 4 1 2 1 53 4
Page 60
4 1 53 4
Rom Adder/ subtractor Registers
16*16bit 16 bit adder 3 bit adder Flip-flop
Comparator 4-bit comparator
Delay comparison:
Techniques Direct implementation Distributed Arithmetic Proposed algorithm LUT algorithm DA less
Time delay (ns)
Frequency (MHz) --
9.602 10.188 12.615
104.14 98.154 79.268
Page 61
6.
CONCLUSION
This project presents the proposed DA architectures for FIR filter. The architectures reduces the memory usage by half at every iteration of LUT reduce the memory usage by half at every iteration of LUT reduction at the cost of the limited decrease of the system frequency. We also divide high order filters into several groups of small filters, hence we can reduce the LUT size also. As to get the speed implementation of the FIR filter a proposed DA algorithm is adopted.
We have successfully implemented high efficient 4-tap FIR filter, using both original DA architecture and the proposed DA architecture on Spartan III FPGA kit device. It shows that the proposed DA architecture is the hardware efficient for the FPGA implementation.
Page 62
7. FUTURE SCOPE
The speed of the filter can be further increased using pipelining principle where parallel processing can implemented. By using pipelining principle the filter can be extended to the higher order. The 70-tap can be implemented using symmetrical structure, so that we can reduce it to 35-tap. Then by dividing 35-tap filter to the 7 smaller filters each having 5-tap DA-LUT unit could be implemented by a 4-input LUT with an additional 2*1 multiplexer and a full adder. Thus our 4-tap can be extended to the 70-tap and more higher filter.
Page 63
APPENDIX-A
Xilinx FPGA
Field Programmable Gate Arrays (FPGAs) are Specific Integrated Circuits that can be easily userprogrammed. The FPGA contains versatile functions, configurable interconnects and input/output interface to adapt to the user specifications. The FPGA allows rapid Prototyping using Custom Logic Structures, and are very popular for limited production products. Modern FPGAs are extremely dense, with the complexity of several millions of gates, which enable the emulation of very complex hardware such as Parallel Microprocessors, mixture of Processors and Signal Processing. One key advantage of FPGA is their ability to be reprogrammed, in order to create a completely different hardware by modifying the Logic Gate Array.
Advantages of FPGA:
1. Short turnaround time
2. Design independent
3. Flexibility
Classification of FPGAs:
The FPGAs are classified based on switching technology.
1. SRAM based FPGAs
Page 64
2. ANTIFUSE based FPGAs
XILINX and ALTERA are the leading manufactures in SRAM based FPGA.
ACTEL, QUICKLOGIC, CYPRESS are the leading manufactures in ANTIFUSE based FPGA.
XILINX FPGA
Xilinx is a developer of FPGA and CPLD devices that are used in numerous applications within telecommunications, consumer, defense, and others fields. The Xilinx offers device families for glue logic (Cool Runner, Cool Runner II), low-cost (Spartan), and high-end (Virtex) applications. The Xilinx also provides different application oriented optimized series FPGAs as LX (For Logic), SX (For Signal Processing) and (FX for Fully Featured).
Xilinx develops IP (intellectual property) cores designed in HDL which allow designers to minimize time to market. These IP cores range from simple functions (such as BCD encoders, counters, etc.) to complex systems (such as multi-gigabit networking cores and custom embedded microcontrollers like the fully-featured Micro blaze soft microprocessor, and the compact Pico blaze microcontroller.) In addition, Xilinx Design Services (XDS) can create custom cores.
Xilinx offers Electronic Design Automation (EDA) tools for use with its devices. Chief among these is ISE, which offers a complete EDA flow. Domain specific tools include Xilinx's Embedded Developer's Kit (EDK), which is aimed primarily at designers wishing to use the embedded PowerPC 405 core in the Virtex-II Pro and Virtex-4, or Xilinx's own soft microprocessor/microcontroller in their designs. Other domain-specific tools include Xilinx's System Generator for DSP, which provides seamless simulation and implementation of high-performance DSP designs on Xilinx's FPGAs. The design in Xilinx FPGA can be implemented by using the basic block of the FPGA. The main purpose of the logic block is to design the desired functionality by using available components of the Logic Block.
Page 65
Basic Logic Block

In Xilinx FPGA the basic blocks that can be used for design are Configurable Logic Blocks (CLBs). Each CLB consists of the SLICES. The SLICES consists of two-LUTs and 2-D flip-flops.
Look-up Table (LUT):

The Look-up Table (LUT) is a one-bit wide memory array, where the address line for the memory are inputs of the logic block and the one-bit output from the memory is the LUT output. A LUT with n inputs would correspond to 2^n x 1 bit memory, and can realize any logic function of its n inputs by programming the logic functions truth table directly into the memory. The LUT can be used to design any combinational circuit which is having single output.
XILINX FPGA DESIGN FLOW:
The Xilinx FPGA Design flow is shown in Figure A-1. The first step involved in implementation of a design on FPGA involves Specifications. The Specifications consists of number of inputs and number of outputs and the range of values that the kit can take in. Based on these specifications the architecture will be designed. The Architecture describes the interconnections between all the blocks involved in the design.
Each and every block in the Architecture along with their interconnections is modeled in either VHDL or Verilog depending on requirement. All these blocks are then simulated and the outputs are verified for correct functionality. The Simulation can be done at various levels of abstractions. The other simulations include the post synthesis simulation, post place and route simulation etc. The simulation needs a set of test vectors for checking the functionality.
Once the functional simulation is correct then the next step is Synthesis. The Synthesis converts the HDL description in to Net list. The Net list gives the information about the functional hardware elements. The synthesis step gives two views of the design. One is Technology Dependent and other is Technology independent. The Technology independent view
of the synthesis gives the design information in terms of gates and other components which are not dependent to any technology. The Technology dependent view of the synthesis gives the hardware information in terms of the LUTs and other components, which are dependent to Xilinx Technology.
Place & Route is the next step in which the tool places all the components on a FPGA die for optimum performance both in terms of area and speed. After placing the components the interconnections between the components can also be done. In post place and route simulation step the actual delays which will be involved in the design are considered by the tool and simulation is performed by considering these delays. These Delays are because of electrical loading effect, wiring delays, stray capacitances. After post place and route, comes generating the bit-map file, which means converting the VHDL/Verilog code into bit streams which is useful to configure the FPGA kit. A .bit file is generated after this step. After this comes final step of downloading the bit map file on to the FPGA board which is done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action Group) which is an IEEE standard. The bit map file contains the whole design which is to be used in FPGA board.
Page 67
APPENDIX-B
Spartan-3 Starter Kit Introduction:
The Xilinx Spartan-3 Starter Kit provides a low-cost, easy-to-use development and evaluation platform for Spartan-3 FPGA designs.
Figure-B-1 shows the Spartan-3 Starter Kit board, which includes the following components and features:
200,000-gate Xilinx Spartan-3 XC3S400 FPGA in a 256-ball thin Ball Grid Array package (XC3S400FT256) [1]
The Table B-1 shows the device information of the Xilinx Spartan-3 FPGA.
Family Name Device Name
Xilinx Spartan-3 XC3S400
Page 68
Capacity Package Speed Grade
20,000 gates Ball Grid Array -4/-5
Table B-1: Device Information of Spartan-3
The components and capacity of the Xilinx Spaatan-3 FPGA are shown below.
4,320 logic cell equivalents
Twelve 18K-bit block RAMs (216K bits)
Twelve 18x18 hardware multipliers
Four Digital Clock Managers (DCMs)
Up to 173 user-defined I/O signals
2Mbit Xilinx XCF02S Platform Flash, in-system programmable configuration PROM [2]
1Mbit non-volatile data or application code storage available after FPGA configuration
Jumper options allow FPGA application to read PROM data or FPGA configuration from other sources[3]
1M-byte of Fast Asynchronous SRAM (bottom side of board) [4]
Two 256Kx16 ISSI IS61LV25616AL-10T 10 ns SRAMs
Configurable memory architecture
- Single 256Kx32 SRAM array, ideal for Micro Blaze code images
- Two independent 256Kx16 SRAM arrays
Individual chip select per device
Individual byte enables
3-bit, 8-color VGA display port[5]
9-pin RS-232 Serial Port [6]
DB9 9-pin female connector (DCE connector)
Maxim MAX3232 RS-232 transceiver/translator[7]
Uses straight-through serial cable to connect to computer or workstation serial port
Second RS-232 transmit and receive channel available on board test points[8]
PS/2-style mouse/keyboard port[9]
Four-character, seven-segment LED display[10]
Eight slide switches[11]
Eight individual LED outputs[12]
Four momentary-contact push button switches[13]
50MHz crystal oscillator clock source( as in FigureB-2)[14]
Socket for an auxiliary crystal oscillator clock source[15]
FPGA configuration mode selected via jumper settings[16]
Push button switch to force FPGA reconfiguration (FPGA configuration happens automatically at power-on)[17]
LED indicates when FPGA is successfully configured[18]
Three 40-pin expansion connection ports to extend and enhance the Spartan-3 Starter Kit Board[19][20][21]
See www.xilinx.com/s3board for compatible expansion cards
Compatible
with
Diligent,
Inc.
peripheral
boards
https://digilent.us/Sales/boards.cfm#Peripheral
FPGA serial configuration interface signals available on the A2 and B1 connectors -
PROG_B, DONE, INIT_B, CCLK, DONE
JTAG port [22] for low-cost download cable[23]
Diligent JTAG download/debugging cable connects to PC parallel port [23].
JTAG download/debug port compatible with the Xilinx Parallel Cable IV and MultiPRO Desktop Tool [24].
AC power adapter input for included international unregulated +5V power
supply[25].
Power-on indicator LED [26].
On-board 3.3V [27], 2.5V [28] , and 1.2V[29] regulators Component Locations.
Page 72
Figure B-2 indicates the component locations on the top side and bottom side of the board, respectively.
Page 73
Figure B-2: XILINX Spartan-3 Starter Kit
Page 74

Peoject Documentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Peoject Documentation

Uploaded by

Copyright:

Available Formats

1.

CVR College of Engineering (VLSI)

Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]..+ bn x[n-N]

CVR College of Engineering (VLSI)

Block Diagram of FIR filter:

CVR College of Engineering (VLSI)

Fig1.1:Block diagram of FIR filter

Cost-Saving System Interfaces and Solutions

CVR College of Engineering (VLSI)

Industry-Leading Design Tools and IP

Easy-to-Use, Low-Cost FPGA Development Systems

FPGA features are described.

CVR College of Engineering (VLSI)

2.2.1. Definition of the signal:

2.2.2. Types of Signals:

2.2.2.2. Analog and Digital signal:

CVR College of Engineering (VLSI)

2.3 Filters in signal processing:

CVR College of Engineering (VLSI)

2.3.1. Analog Filter:

2.3.2. Digital Filters:

2.3.3. Passive filter:

CVR College of Engineering (VLSI)

2.3.4. Active Filter:

CVR College of Engineering (VLSI)

2.3.5. Linear- Continuous time filter:

2.3.6. Terminology to classify linear filter:

CVR College of Engineering (VLSI)

frequency response can be classified into a number of different band

usually measured at a specific attenuation such as 3dB.

2.3.7. FIR Filter:

CVR College of Engineering (VLSI)

2.3.8. IIR filters:

CVR College of Engineering (VLSI)

A more condensed form of the difference equation is:

Microprocessors are slow:

FPGAs accelerate DSP:

Fig 2.1 comparision of DSP and FPGA.

FPGA DSPs are flexible:

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

DIRECT IMPLEMENTATION OF FIR FILTER:

The following expression explains the MAC operation.

Note a few points:

y should be able large enough to accommodate the result

y = 32 42 + 45 20 +78 ( ) + 23 67 22 y =1344 +900 1716 +1541 = 2069

Fig 3.1. Block diagram of 1-tap filter using direct implementation.

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

CVR College of Engineering (VLSI)

Block diagram of 8-tap FIR filter.

MATHEMATICAL ANALYSIS OF DISTRIBUTED ARITHMETIC:

CVR College of Engineering (VLSI)

Let xk be a N-bits scaled twos complement number i.e.

| xk | < 1 xk: {bk0, bk1, bk2, bk(N-1) }

Now by substituting (2) in (1), we get

Expanding this part

By expanding the term we get

CVR College of Engineering (VLSI)