Professional Documents
Culture Documents
INTRODUCTION
1.1.
INTRODUCTION:
Filter is the component which passes certain band of frequencies and opposes other frequency components. Filter is the basic component in any Digital Signal Processor (DSP) applications. For this we have two filters they are Finite Impulse Response (FIR) filter and Infinite Impulse Response (IIR) filter.
FIR filter is digital type of filter where we consider finite number of samples. In FIR filter the impulse response settle down to zero after final sample of interval, where as in IIR filter we consider infinite number of samples for analysis.
Here in our project we designed FIR filter with less resources and less delay using Distributed Arithmetic (DA) algorithm. If we use direct method i.e. Multiplication and Accumulate (MAC) for implementing FIR filter it consumes much area (resource) and is expensive to implement on FPGA. To overcome this drawback DA came into existence, which is a multiplier-less architecture. As DA is a very efficient solution especially suited for LUT-based FPGA architectures.
The main problem of DA is that the LUT size will increase exponentially with the order of the filter. To overcome this problem a hardware-efficient DA architecture is used which reduces the LUT size by modifying the architecture of the filter to achieve high performance.
Page 1
1.2.
FIR Filter:
FIR filter is a one polynomial coefficient. FIR filter needs much high order polynomial to get an equivalent filter as IIR filter, which results in longer delay.
H (Z) =B (Z)/ZN
N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are commonly referred to as taps. This equation can also be expressed as a convolution of the coefficient sequence bi with the input signal
Page 2
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input.
1.3.
Page 3
Five devices with 100K to 1.6M system gates From 66 to 376 I/Os with package and destiny migration Up to 648K bits of block RAM and up to 231K bits of distributed RAM Up to 36 embedded 18x18 multipliers for high-performance DSP applications Up to eight Digital Clock Mangers
Easy-to-implement interfaces to DDR memory Support for 18 common I/O standards, including PCI-X, mini-LVDs, and RSDS
1.4.
CONCLUSION:
In this chapter we discussed about FIR filter and its block diagram. The Spartan-3 pro-
Page 5
2. LITERATURE SURVEY
2.1 INTRODUCTION:
The signal is the one which carries information from one source to the destination. There are different types of signals. Filter plays essential role in Digital Signal Processing (DSP). Filter is a system that passes certain frequency components and rejects other frequency components. Filters are designed for the specifications of the desired properties of the system. FPGA is a prototype device which is used to implement simpler algorithms.
2.2. Signal:
In the field of communications, signal processing and in electrical engineering more generally, a signal is any time-varying or spatial-varying quantity. In the physical world, any quantity measurable through time or over space can be taken as a signal. Within a complex society, any set of human information or machine data also be taken as a signal. Such information or machine data must all be part systems existing in the physical world- either living or non-living.
CVR College of Engineering (VLSI) Page 6
Despite the complexity of such systems, their outputs and inputs can often be represented as simple quantities measurable through time or across space. In the latter half of the 20th century Electrical engineering itself separated into several disciplines, specializing in the design and analysis of physical signals and systems, on one hand and in the functional behavior and conceptual structure of the complex human and machine systems, on the other. These engineering disciplines have led the way in the design, study, and implementation of systems that take advantage of signals as simple measurable quantities in order to facilitate the transmission, storage and manipulation of information.
not be derived from an underlying continuous-valued physical process. In other contexts, digital signals are defined as the continuous-time waveform signals in a digital system, representing a bit-stream. In the first case, a signal that is generated by means of a digital modulation method is considered as converted to an analog signal, while it is considered as a digital signal in the second case.
2.2.2.1. Discrete-time and continuous time signal: If for a signal, the quantities are defined only on a discrete set of times, we call it a
discrete-time signal. In other words, a discrete-time real (or complex) signal can be seen as a function from the set of integers to the set of real (or complex) numbers. Discrete signals have frequency domain analysis. A discrete signal usually uses Z- Transform to analyze its frequency response, where discrete signals are denoted by u (k) and k= -1, 0, 1, 2, 3.. A continuous-time real (or complex) signal is any real-valued (or complexvalued) function which is defined for all time t in an interval, most commonly an infinite interval. Continuous signals have continuous frequency spectrum. It uses Fourier Transform (FT) to obtain its frequency response, where continuous signals are denoted by u (t), t is continuous.
Page 8
There are mainly two types of signals encountered in practice, analog and digital. In short, the difference between them is that digital signals are discrete and quantized, as defined below, while analog signals possess neither property.
DISCRETIZATION:
One of the fundamental distinctions between different types of signals is between continuous and discrete time. In the mathematical abstraction, the domain of a continuous-time (CT) signal is the set of real numbers (or some interval thereof), whereas the domain of a discrete-time (DT) signal is the set of integers (or some interval). What these integers represent depends on the nature of the signal. DT signals often arise via sampling of CT signals. An audio signal, for example consists of a continually fluctuating voltage on a line that can be digitized by an ADC circuit, wherein the circuit will read the voltage level on the line, say, every 50 s. The resulting stream of numbers is stored as digital data on a discrete-time signal. Computers and other digital devices are restricted to discrete time.
QUANTIZATION:
If a signal is to be represented as a sequence of numbers, it is impossible to maintain arbitrarily high precision - each number in the sequence must have a finite number of digits. As a result, the values of such a signal are restricted to belong to a finite set; in other words, it is quantized.
In signal processing, a filter is a device or process that removes from a signal some unwanted or component or feature. In general, it takes an input that is a function of time and produces an output that is a function of time (usually delayed from the input).
Page 9
Filtering is a class of signal processing, the defining feature of filters being the complete or partial suppression of some aspect of the signal. Most often, this means removing some frequencies and not others in order to suppress interfering signals and reduce background noise. However, filters do not exclusively act in the frequency domain; especially in the field of image processing many other targets for filtering exist. There are many different bases of classifying filters and these overlap in many different ways, there is no simple hierarchical classification. Filters may be:
analog or digital discrete-time (sampled) or continuous-time linear or non-linear passive or active type of continuous-time filter Infinite impulse response (IIR) or finite impulse response (FIR) type of discrete-time or digital filter.
often used in wave filtering applications, that is, where it is required to pass particular frequency components and to reject others from analog (continuous-time) signals.
Page 11
Passive
implementations
of
linear
filters
are
based
on
combinations
of resistors (R), inductors (L) and capacitors (C). These types are collectively known as passive filters, because they do not depend upon an external power supply and/or they do not contain active components such as transistors. Inductors block high-frequency signals and conduct low-frequency signals,
while capacitors do the reverse. A filter in which the signal passes through an inductor, or in which a capacitor provides a path to ground, presents less attenuation to low-frequency signals than high-frequency signals and is a low-pass filter. If the signal passes through a capacitor, or has a path to ground through an inductor, then the filter presents less attenuation to highfrequency signals than low-frequency signals and is a high-pass filter. Resistors on their own have no frequency-selective properties, but are added to inductors and capacitors to determine the time-constants of the circuit, and therefore the frequencies to which it responds. The inductors and capacitors are the reactive elements of the filter. The number of elements determines the order of the filter. In this context, an LC tuned circuit being used in a band-pass or band-stop filter is considered a single element even though it consists of two components. At high frequencies (above about 100 megahertz), sometimes the inductors consist of single loops or strips of sheet metal, and the capacitors consist of adjacent strips of metal. These inductive or capacitive pieces of metal are called stubs.
Page 12
Chebyshev filter, has the best approximation to the ideal response of any filter for a specified order and ripple. Butterworth filter, has a maximally flat frequency response. Bessel filter, has a maximally flat phase delay. Elliptic filter, has the steepest cutoff of any filter for a specified order and ripple.
The difference between these filter families is that they all use a different polynomial function to approximate to the ideal filter response. This results in each having a different transfer function. Another methodology which is dead but can still is seen walking around now and again is the image parameter method. Filters designed by this methodology are archaically called "wave filters". Some important filters designed by this method are;
Constant k filter, the original and simplest form of wave filter. M-derived filter, a modification of the constant k with improved cutoff steepness and impedance matching.
Page 13
The
forms describing which frequencies the filter passes (the pass band) and which it rejects (the stop band);
Low-pass filter low frequencies are passed, high frequencies are attenuated. High-pass filter high frequencies are passed, Low frequencies are attenuated. Band-pass filters only frequencies in a frequency band are passed. Band-stop filter or band-reject filters only frequencies in a frequency band are
attenuated.
Notch filter rejects just one specific frequency - an extreme band-stop filter. Comb filter has multiple regularly spaced narrow pass bands giving the band form the
appearance of a comb.
All-pass filter all frequencies are passed, but the phase of the output is modified. Cutoff frequency is the frequency beyond which the filter will not pass signals. It is
band.
Ripple is the variation of the filters insertion loss in the pass band. The order of a filter is the degree of the approximating polynomial and in passive filters
corresponds to the number of elements required to build it. Increasing order increases roll-off and brings the filter closer to the ideal response.
Page 14
A Finite Impulse Response (FIR) filter is a type of a digital filter. The impulse response, the filter's response to a Kronecker delta input, is finite because it settles to zero in a finite number of sample intervals. This is in contrast to Infinite Impulse Response (IIR) filters, which have internal feedback and may continue to respond indefinitely. The impulse response of an Nth-order FIR filter lasts for N+ 1 sample, and then dies to zero. The difference equation that defines the output of an FIR filter in terms of its input is: Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]..+ bn x [n-N] where:
x[n] is the input signal, y[n] is the output signal, bi are the filter coefficients, and N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are commonly referred to as taps. This equation can also be expressed as a convolution of the coefficient sequence bi with the input signal:
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input.
single capacitor (C). This filter has an exponential impulse response characterized by an RC time constant. IIR filters may be implemented as either analog or digital filters. In digital IIR filters, the output feedback is immediately apparent in the equations defining the output. Note that unlike with FIR filters, in designing IIR filters it is necessary to carefully consider "time zero" case in which the outputs of the filter have not yet been clearly defined. Design of digital IIR filters is heavily dependent on that of their analog counterparts because there are plenty of resources, works and straightforward design methods concerning analog feedback filter design while there are hardly any for digital IIR filters. As a result, usually, when a digital IIR filter is going to be implemented, an analog filter (e.g. Chebyshev filter, Butterworth filter, Elliptic filter) is first designed and then is converted to a digital filter by applying discretization techniques such as Bilinear transform or Impulse invariance. Digitals filters are often described and implemented in terms of the difference equation that defines how the output signal is related to the input signal:
where:
is the feed forward filter order are the feed forward filter coefficients is the feedback filter order are the feedback filter coefficients is the input signal Is the output signal.
Page 16
2.4. FPGA:
FPGAs offer an opportunity to accelerate your digital signal processing application up to 1000 times over a traditional DSP microprocessor.
these cells available to use as building blocks in complex digital circuits. Custom hardware has never been so easy to develop.
Performance up to 1000x:
The ability to manipulate the logic at the gate level means you can construct a custom processor to efficiently implement the desired function. By simultaneously performing all of the algorithms sub functions, the FPGA can outperform a DSP by as much as 1000:1.
Page 18
Like microprocessors, many FPGAs can be infinitely reprogrammed in-circuit in only a fraction of a second. Design revisions, even for a fielded product, can be implemented quickly and painlessly. Hardware can also be reduced by taking advantage of reconfiguration.
Highly integrated:
The programmable logic in an FPGA can absorb much of the interface and glue logic associated with microprocessors. The tighter integration can make a product smaller, lighter, cheaper and lower power.
Competitively priced:
FPGAs are a generic product customized at the point of use. They enjoy the cost advantages of high production volumes. There is also none of the NRE charges or fabrication delays associated with ASIC development and get you to market on time. The FPGAs flexibility eliminates the long design cycle associated with ASICs. With FPGAs there are no delays for prototypes or early production volume. Design revisions are easily implemented, often taking less than a day. The devices are fully tested by the manufacturer, eliminating production test development.
2.5. CONCLUSION:
In this chapter we discussed about signals, different types of signals, filters, different types of filters and FPGA in Digital Signal Processing.
Page 19
3. DESIGN METHODOLOGY
3.1.
INTRODUCTION:
A Finite Impulse Response (FIR) filter is a type of a digital filter. The direct
implementation of the FIR filter requires more number of resources, to reduce the number of resources Distributed Arithmetic came into existence which replaces multiplications by additions and siftings. To reduce ROM size the proposed DA algorithm came into existence which uses multiplexers. The LUT-less algorithm uses multiplexers to remove the usage of ROM memory.
CVR College of Engineering (VLSI) Page 20
3.2.
Generally FIR filter is designed using Multiply and Accumulate (MAC) principle where the filter coefficients undergo multiplication and additions. The MAC principle is common in Digital Signal Processing algorithms.
y = hK 1 x0 + hk 1 x1 + + h0 xK 1
i.e.
y = hn k xk
k =0
K 1
h=[h0,h1, h2,, hK-1] is a matrix of constant values h=[h0,h1, h2,, hK-1] is a matrix of constant values Each hk is of M-bits Each hk is of N-bits
A numerical example:
h = [32 ,42 ,45 ,23 ] x = [ 42 ,20 ,22 ,67 ] ( K = 4)
Page 22
Fig 3.2. Block diagram of 4-tap FIR filter using direct implementation.
In direct implementation we follow Multiply and Accumulate (MAC) operation. In this type of operation we directly multiply the coefficient of the filter with the variable and add them to get final result. If we consider 1-tap filter, filter coefficient h0 is directly multiplied with variable x0 and result is assigned to the output. In 4-tap filter filter-coefficient are multiplied with corresponding variables, the result of four multipliers are added and assigned to the result. If we follow this method we require four multipliers, which require many resources. To reduce resource utilization and improve speed we follow Distributed Arithmetic (DA) Algorithm, which is multiplier less architecture.
Page 23
3.3.
IMPLEMENTING ARITHMETIC:
FIR
FILTER
USING
DISTRIBUTED
Distributed arithmetic is a bit level rearrangement of a multiply accumulate to avoid the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiplyaccumulate that is well suited to FPGA designs. It can also be extended to other sum functions such as complex multiples, Fourier transforms and so on. In most of the multiply accumulate applications in signal processing, one of the multiplicands for each product is a constant. Usually each multiplication uses a different constant. Using our most compact multiplier, the scaling accumulator, we can construct a multiple product term parallel multiply-accumulate function in a relatively small space if we are willing to accept a serial input. In this case, we feed four parallel scaling accumulators with unique serialized data. Each multiplies that data by a possibly unique constant, and the resulting
Page 24
products
are
summed
in
an
adder
tree
as
shown
below
Fig 3.3. 4-tap FIR filter using DA algorithm. If we stop to consider that the scaling accumulator multiplier is really just a sum of vectors, then it becomes obvious that we can rearrange the circuit. Here, the adder tree combines the 1 bit partial products before they are accumulated by the scaling accumulator. All we have done is rearranged the order in which the 1xN partial products are summed. Now instead of individually accumulating each partial product and then summing the results, we postpone the accumulate function until after weve summed all the 1xN partials at a particular bit time. This simple rearrangement of the order of the adds has effectively replaced N multiplies followed by an N input add with a series of N input adds followed by a multiply. This arithmetic manipulation directly eliminates N-1 Adders in an N product term
Page 25
multiply-accumulate function. For larger numbers of product terms, the savings becomes significant.
Fig 3.4. block diagram of 4- tap filter using LUT less algorithm.
Further hardware savings are available when the coefficients Cn are constants. If that is true, then the adder tree shown above becomes a Boolean logic function of the 4 serial inputs. The combined 1xN products and adder tree is reduced to a four input look up table. The sixteen entries in the table are sums of the constant coefficients for all the possible serial input combinations. The table is made wide enough to accommodate the largest sum without overflow. Negative table values are sign extended to the width of the table, and the input to the scaling accumulator should be sign extended to maintain negative sums.
Page 26
Fig 3.5. block diagram which explains MUX operations. Obviously the serial inputs limit the performance of such a circuit. As with most hardware applications, we can obtain more performance by using more hardware. In this case, more than one bit sum can be computed at a time by duplicating the LUT and adder tree as shown here. The second bit computed will have a different weight than the first, so some shifting is required before the bit sums are combined. In this 2 bit at a time implementation, the odd bits are fed to one LUT and adder tree, while the even bits are simultaneously fed to an identical tree. The odd bit partials are left shifted to properly weight the result and added to the even partials before accumulating the aggregate. Since two bits are taken at a time, the scaling accumulator has to shift the feedback by 2 places.
Page 27
Fig 3.6. block diagram which explains MUX operations for more number of inputs This paralleling scheme can be extended to compute more than two bits at a time. In the extreme case, all input bits can be computed in parallel and then combined in a shifting adder tree. No scaling accumulator is needed in this case, since the output from the adder tree is the entire sum of products. This fully parallel implementation has a data rate that matches the serial clock, which can be greater than 100 MS/S in today's FPGAs.
Page 28
Fig 3.7. Block digram which explains shifting and addition operations. Most often, we have more than 4 product terms to accumulate. Increasing the size of the LUT might look attractive until you consider that the LUT size grows exponentially. Considering the construction of the logic we stuffed into the LUT, it becomes obvious that we can combine the results from the LUTs in an adder tree. The area of the circuit grows by roughly 2n-1 using adder trees to expand it rather than the 2n growth experienced by increasing LUT size. For FPGAs, the most efficient use of the logic occurs when we use the natural LUT size (usually a 4LUT, although and 8-LUT would make sense if we were using an 8 input block RAM) for the LUTs and then add the outputs of the LUTs together in an adder tree, as shown below:
Fig 3.8. .
3.4.
y = hk xnk
k =0
n 1
-----1
Page 29
-----2
y = ( bk 0 Ak ) + ( Ak bkn ) 2 n
k =1 k =1 n =1
K N 1
----3
And now
y = ( bk 0 Ak ) +
k =1
k =1
N 1 n ( bkn Ak ) 2 n=1
Page 30
y = ( bk 0 Ak ) + ( Ak bk 1 ) 2 1 + ( Ak bk 2 ) 2 2 + + ( Ak bk ( N 1) ) 2 ( N 1)
K K k =1 k =1
y = [ b10 A1 + b20 A2 + + bK 0 AK ]
[ + [( b [
A2 ) 2 1 + ( b22 A2 ) 2 2 + + ( b2( N 1) A2 ) 2 ( N 1)
] ]
+ ( bK 1 AK ) 2 1 + ( bK 2 AK ) 2 2 + + ( bK ( N 1) AK ) 2 ( N 1)
By taking common multiples into consideration we can re arrange the equation in the following fashion.
y = [ b10 A1 + b20 A2 + + bK 0 AK ]
+ [ ( b11 A1 ) + ( b21 A2 ) + + ( bK 1 AK ) ] 21
Finally the equation is reduced in the following way. y = (bk 0 ) Ak + [ b1n Ak + b2 n A2 + + bKn AK ] 2 n
k =1 n=1 K N 1
-----4
-----4
K Ak bkn k =1
---- 5
Page 32
(5) Can be pre-calculated for all possible values of b1n b2n bKn We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n bKn s
Page 33
Fig:3.9. Block diagram of 4-tap FIR filter using DA based Algorithm The block diagram of LUT based DA implemented FIR filter consists of three units such as the shift register unit, the DA-LUT unit and the adder/shifter unit. The four input signals each of four bits are given to parallel in serial out shift registers. The output of parallel in serial out register is single bit value. The coefficients of filter are stored in the Look up Table and depending on the output of the four parallel in serial out registers a value is selected from Look-Up. The output of the look up table is given to the Adder and Shifter unit. The Adder and Shifter unit adds this value to the left shifted previous output and gives it to
Page 34
the output. This process is repeated for four clock cycles, after four clock cycles we will get the required output.
A serial-in/parallel-out shift register is similar to the serial-in/ serial out shift register in that it shifts data into internal storage elements and shifts data out at the serialout, data-out and pin. It is different in that it makes all the internal stages available as outputs. Therefore, a serial-in/parallel-out shift register converts data from serial format to parallel format. If four data bits are shifted in by four clock pulses via a single wire at data-in, below, the data becomes available simultaneously on the four outputs QA to QD after the fourth clock pulse.
Fig 3.10. Serial in parallel out shift register with 4- stages. The practical application of the serial-in/parallel-out shift register is to convert data from serial format on a single wire to parallel format on multiple wires. Perhaps, we will illuminate four LEDs (Light Emitting Diodes) with the four outputs (QA QB QC QD ).
Page 35
Fig 3.11. serial in parallel out shift register in detail. The above details of the serial-in/parallel-out shift register are fairly simple. It looks like a serial-in/ serial-out shift register with taps added to each stage output. Serial data shifts in at SI (Serial Input). After a number of clocks equal to the number of stages, the first data bit in appears at SO (QD) in the above figure. In general, there is no SO pin. The last stage (QD above) serves as SO and is cascaded to the next package if it exists. Note that serial-in/ serial-out shift registers come in grater than 8-bit lengths of 18 to 64-bits. It is not practical to offer a 64-bit serial-in/parallel-out shift register requiring that many output pins. See waveforms below for above shift register.
Page 36
Fig 3.12. Serial in parallel out register waveforms. The shift register has been cleared prior to any data by CLR', an active low signal, which clears all type D Flip-Flops within the shift register. Note the serial data 1011pattern presented at the SI input. This data is synchronized with the clock CLK. This would be the case if it is being shifted in from something like another shift register, for example, a parallel-in/ serialout shift register (not shown here). On the first clock at t1, the data 1 at SI is shifted from D to Q of the first shift register stage. After t2 this first data bit is at QB. After t3 it is at QC. After t4 it is at QD. Four clock pulses have shifted the first data bit all the way to the last stage QD. The second data bit a 0 is at QC after the 4th clock. The third data bit a 1 is at QB. The fourth data bit another 1 is at QA. Thus, the serial data input pattern 1011is contained in (QD QC QB QA). It is now available on the four outputs.
Page 37
It will available on the four outputs from just after clock t4 to just before t5. This parallel data must be used or stored between these two times, or it will be lost due to shifting out the QD stage on following clocks t5 to t8 as shown above.
Page 38
If this ROM has been written with the above data (representing a half-adder's truth table), driving the A and B address inputs will cause the respective memory cells in the ROM chip to be enabled, thus outputting the corresponding data as the (Sum) and Cout bits. Unlike the halfadder circuit built of gates or relays, this device can be set up to perform any logic function at all with two inputs and two outputs, not just the half-adder function. To change the logic function, all we would need to do is write a different table of data to another ROM chip. We could even use an EPROM chip which could be re-written at will, giving the ultimate flexibility in function. It is vitally important to recognize the significance of this principle as applied to digital circuitry. Whereas the half-adder built from gates or relays processes the input bits to arrive at a specific output, the ROM simply remembers what the outputs should be for any given combination of inputs. This is not much different from the "times tables" memorized in grade school: rather than having to calculate the product of 5 times 6 (5 + 5 + 5 + 5 + 5 + 5 = 30), school-children are taught to remember that 5 x 6 = 30, and then expected to recall this product from memory as needed. Likewise, rather than the logic function depending on the functional arrangement of hard-wired gates or relays (hardware), it depends solely on the data written into the memory (software). Such a simple application, with definite outputs for every input, is called a look-up table, because the memory device simply "looks up" what the output(s) should to be for any given combination of inputs states.
Page 39
Fig 3.14. Block diagram of 4-tap FIR filter using proposed DA algorithm.
The LUT reduction procedure discussed above will be further developed to obtain LUT-less DA architecture. The LUT-less DA architecture is as shown below:
Fig 3.15. Block diagram of 4-tap FIR filter using LUT less algorithm.
Here in this procedure all LUTs are replaced by multiplexers and full adders so that memory usage is reduced completely. The output of the parallel in serial out is given to the multiplexer where the value of the output is one then respective constant value is obtained otherwise zero will be obtained. The output of multiplexer is given to the adder/shifter unit of the filter.
Page 41
Page 42
A Programmable Logic Array is a small PLD that contains two levels of logic, an AND-plane and an OR-plane, where both levels are programmable.
3.8.3.1.Introduction:
Field Programmable Gate Arrays are specific integrated circuits that can be userprogrammed easily. The FPGA contains versatile functions, configurable interconnects and input/output interface to adapt to the user specification. FPGA allow rapid prototyping using custom logic structures, and are very popular for limited production products. Modern FPGA are extremely dense, with complexity of several millions of gates which enable the emulation of very complex hardware such as parallel microprocessors, mixture of processor and signal
CVR College of Engineering (VLSI) Page 43
processing. One key advantage of FPGA is their ability to be reprogrammed, in order to create a completely different hardware by modifying the logic gate array. FPGA not only exist as simple components, but also as CPU ram-blocks in system-on-chip designs. FPGA consists of Slices where each Slice consists of 2 look up tables and 2 D-Flip Flops.
Page 44
implementation on the kit. Delays here mean electrical loading effect, wiring delays, stray capacitances.
System specifications Initials
Architecture
Block diagram
VHDL Module
Coding
Simulation
Functional verification
Synthesis
Timing verification
Configuring FPGA
Download on to FPGA
Fig:3.17.
Page 46
After post place and route, comes generating the bit-map file, which means converting the VHDL code bit streams which is useful to configure the FPGA kit. A bit file is generated after we perform this step. After this comes final step of downloading the bit map file on to the FPGA board which is done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action Group) which is a IEEE standard. The bit map file contains the whole design, which is placed on the FPGA die; the outputs can now be observed from the FPGA LEDs or multiplexed seven segment displays. This step completes the whole process of implementing our design on an FPGA.
None of the mask layers are customized. A method for programming the basic logic cells and the interconnect. The core is a regular array of programmable basic logic cells that can implement combinational as well as sequential logic (flip-flops).
A matrix of programmable interconnect surround the basic logic cells. Programmable I/O cells surround the core. Design turnaround is a few hours.
3.9.
Applications of FPGA:
3.10. Conclusion:
CVR College of Engineering (VLSI) Page 47
This chapter discussed about Design methodologies, implementation of FIR filter in different methods and FPGA design flow.
4. DESIGN ANALYSIS
INTRODUCTION:
4.1.
After implementation of the design, next is to analyze the design. Now this chapter will discuss about the analysis of the design after the implementation FIR filter in all four different algorithms. In this chapter we are flow chapter of the algorithm is presented. Here synthesis and simulation reports are discussed in the view of the performance of the design.
4.2.
Flow chart:
Page 48
Simulation:
4.3.1.
Page 49
Fig 4.2. Simulation report of Direct Implementation of FIR filter. In direct implementation we use Multiply and Accumulate (MAC) operation. Here we take filter co-efficient as constants and variables as input. Here we multiply the variables with filter co-efficient based on the FIR equation. Here we get output in one clock cycle but the delay will be more. Here time period of clock is 20 ns. For 1st 20 ns reset is one, so the output is zero.
Page 50
Fig 4.3 Simulation report of one tap filter We apply input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for one tap filter we have one filter co-efficient and one 4-bit input. We store the precalculated values (in this case h0 and 0000) in the ROM and we give input to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Based on output of PISO well decide whether we have to select h0 or 0000, if bit from PISO is zero well select 0000, if it is one well select one of the values from LUT. The output of the LUT is added to left shift previous output. During 1 st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT, this added to left shifted previous output. After four clock cycles well get the require output.
Page 51
Fig 4.4. Simulation report of DA algorithm based FIR filter. We apply inputs to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for 4-tap filter we have four filters co-efficient 16 bit each and four 4-bit input. We store the pre-calculated values in the ROM and we give input to the PISO register. At first reset is one, so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSBs of all four inputs from parallel in serial out registers. Based on output of PISO registers well select a value from the LUT. The output of the LUT is added to left shift previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT, this added to left shifted previous output. After four clock cycles well get the require output.
Page 52
Fig:4.5. Simulation report of proposed DA based FIR filter. We applied the input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. Output of one PISO register is given to the MUX and remaining three outputs from other PISO register is given to the 3-tap LUT. Based on the output of three PISO registers well select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output. Here for 4- tap filter we have four filters co-efficient 16 bits each and four 4-bit inputs. We give inputs to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Three are given to the LUT and remaining is given to given to MUX. The output of the MUX and LUT are added. This result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to the sum of LUT and MUX. During second clock cycle we get second MSB from PISO, based on this well select value from the LUT and MUX, this added to left shifted previous output. After four clock cycles well get the require output.
Page 53
Fig 4.6. Simulation report of LUT-Less based FIR filter. We applied the input to the Parallel in Serial Out register (PISO), after clock event well check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle well get the MSB of inputs from Parallel in Serial Out register. The output of each PISO is fed to the four different MUX. The output of all MUX is added and the result is given to output.
Here for 4-tap filter we have four filters co-efficient, 16 bit each and four 4-bit inputs. Well give one input to four PISO registers. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle well get the MSB of the input from parallel in serial out register. Each output of PISO is given to four MUX. In MUX based on the input well decide whether the we have to select filter co-efficient or 0000, if bit from PISO is zero well select 0000, if it is one well select corresponding filter co-efficient. The output of MUX is added. The result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero well get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this well get values from MUX, this are added and the result is added to left shifted previous output. After four clock cycles well get the require output.
4.4.
Page 54
=========================================== ============================== Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. =========================================== ============================== Advanced HDL Synthesis Report Macro Statistics # Multipliers 4x16-bit multiplier # Adders/Subtractors 20-bit adder # Registers Flip-Flops :7 :7 :6 :6 : 80 : 80
=========================================== ============================== Timing Summary =========================================== ============================== Speed Grade: -5 Minimum period: No path found Minimum input arrival time before clock: 11.542ns Maximum output required time after clock: 6.216ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns)
CVR College of Engineering (VLSI) Page 55
=========================================== ============================== Timing constraint: Default OFFSET IN BEFORE for Clock 'clk' Total number of paths / destination ports: 77620 / 160
================================================== ======================= Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ================================================== ======================= Advanced HDL Synthesis Report Macro Statistics # ROMs 16x16-bit ROM # Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal :1 :1 :2 :1 :1 : 53 : 53 :4 :4
Timing Summary Speed Grade: -5 Minimum period: 9.602ns (Maximum Frequency: 104.140MHz) Minimum input arrival time before clock: 10.130ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ================================================== ======================= Timing constraint: Default period analysis for Clock 'clk' Clock period: 9.602ns (frequency: 104.140MHz) Total number of paths / destination ports: 7762 / 68
================================================== ======================= Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ================================================== ======================= Advanced HDL Synthesis Report Macro Statistics # ROMs 8x16-bit ROM CVR College of Engineering (VLSI) :1 :1 Page 57
# Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal
:3 :2 :1 : 53 : 53 :4 :4 Timing Summary
================================================== ======================= Speed Grade: -5 Minimum period: 10.188ns (Maximum Frequency: 98.154MHz) Minimum input arrival time before clock: 10.672ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ================================================== ======================= Timing constraint: Default period analysis for Clock 'clk' Clock period: 10.188ns (frequency: 98.154MHz) Total number of paths / destination ports: 8655 / 68
============================================== =========================== Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i. ============================================== =========================== Advanced HDL Synthesis Report Macro Statistics # Adders/Subtractors 16-bit adder 3-bit adder # Registers Flip-Flops # Comparators 4-bit comparator not equal :5 :4 :1 : 53 : 53 :4 :4
============================================== =========================== Timing Summary Speed Grade: -5 Minimum period: 12.615ns (Maximum Frequency: 79.268MHz) Minimum input arrival time before clock: 13.055ns Maximum output required time after clock: 6.280ns Maximum combinational path delay: No path found Timing Detail: -------------All values displayed in nanoseconds (ns) ============================================== =========================== Timing constraint: Default period analysis for Clock 'clk'
CVR College of Engineering (VLSI) Page 59
Clock period: 12.615ns (frequency: 79.268MHz) Total number of paths / destination ports: 64845 / 68
4.5.
Conclusion:
In this chapter we simulation reports and synthesis reports are observed.
5. RESULT
This project is comprised of several chapters which devoted to the designing of FIR filter using four different algorithms. Here the comparison tables are presented to have overview on the resource usage and the timing comparisons. A simple overview of the project, including the scope, motivation and objectives are discussed. Finally a FIR filter is designed and the output is observed on the Spartan III FPGA kit.
Resource
Specification Direct Distributed Proposed LUT less implementation arithmetic DA algorithm algorithm algorithm 8*16bit 1 1 1 53 4 1 2 1 53 4
Page 60
4 1 53 4
Delay comparison:
Techniques Direct implementation Distributed Arithmetic Proposed algorithm LUT algorithm DA less
Frequency (MHz) --
Page 61
6.
CONCLUSION
This project presents the proposed DA architectures for FIR filter. The architectures reduces the memory usage by half at every iteration of LUT reduce the memory usage by half at every iteration of LUT reduction at the cost of the limited decrease of the system frequency. We also divide high order filters into several groups of small filters, hence we can reduce the LUT size also. As to get the speed implementation of the FIR filter a proposed DA algorithm is adopted.
We have successfully implemented high efficient 4-tap FIR filter, using both original DA architecture and the proposed DA architecture on Spartan III FPGA kit device. It shows that the proposed DA architecture is the hardware efficient for the FPGA implementation.
Page 62
7. FUTURE SCOPE
The speed of the filter can be further increased using pipelining principle where parallel processing can implemented. By using pipelining principle the filter can be extended to the higher order. The 70-tap can be implemented using symmetrical structure, so that we can reduce it to 35-tap. Then by dividing 35-tap filter to the 7 smaller filters each having 5-tap DA-LUT unit could be implemented by a 4-input LUT with an additional 2*1 multiplexer and a full adder. Thus our 4-tap can be extended to the 70-tap and more higher filter.
Page 63
APPENDIX-A
Xilinx FPGA
Field Programmable Gate Arrays (FPGAs) are Specific Integrated Circuits that can be easily userprogrammed. The FPGA contains versatile functions, configurable interconnects and input/output interface to adapt to the user specifications. The FPGA allows rapid Prototyping using Custom Logic Structures, and are very popular for limited production products. Modern FPGAs are extremely dense, with the complexity of several millions of gates, which enable the emulation of very complex hardware such as Parallel Microprocessors, mixture of Processors and Signal Processing. One key advantage of FPGA is their ability to be reprogrammed, in order to create a completely different hardware by modifying the Logic Gate Array.
Advantages of FPGA:
1. Short turnaround time
2. Design independent
3. Flexibility
Classification of FPGAs:
The FPGAs are classified based on switching technology.
Page 64
XILINX and ALTERA are the leading manufactures in SRAM based FPGA.
ACTEL, QUICKLOGIC, CYPRESS are the leading manufactures in ANTIFUSE based FPGA.
XILINX FPGA
Xilinx is a developer of FPGA and CPLD devices that are used in numerous applications within telecommunications, consumer, defense, and others fields. The Xilinx offers device families for glue logic (Cool Runner, Cool Runner II), low-cost (Spartan), and high-end (Virtex) applications. The Xilinx also provides different application oriented optimized series FPGAs as LX (For Logic), SX (For Signal Processing) and (FX for Fully Featured).
Xilinx develops IP (intellectual property) cores designed in HDL which allow designers to minimize time to market. These IP cores range from simple functions (such as BCD encoders, counters, etc.) to complex systems (such as multi-gigabit networking cores and custom embedded microcontrollers like the fully-featured Micro blaze soft microprocessor, and the compact Pico blaze microcontroller.) In addition, Xilinx Design Services (XDS) can create custom cores.
Xilinx offers Electronic Design Automation (EDA) tools for use with its devices. Chief among these is ISE, which offers a complete EDA flow. Domain specific tools include Xilinx's Embedded Developer's Kit (EDK), which is aimed primarily at designers wishing to use the embedded PowerPC 405 core in the Virtex-II Pro and Virtex-4, or Xilinx's own soft microprocessor/microcontroller in their designs. Other domain-specific tools include Xilinx's System Generator for DSP, which provides seamless simulation and implementation of high-performance DSP designs on Xilinx's FPGAs. The design in Xilinx FPGA can be implemented by using the basic block of the FPGA. The main purpose of the logic block is to design the desired functionality by using available components of the Logic Block.
Page 65
The Xilinx FPGA Design flow is shown in Figure A-1. The first step involved in implementation of a design on FPGA involves Specifications. The Specifications consists of number of inputs and number of outputs and the range of values that the kit can take in. Based on these specifications the architecture will be designed. The Architecture describes the interconnections between all the blocks involved in the design.
Each and every block in the Architecture along with their interconnections is modeled in either VHDL or Verilog depending on requirement. All these blocks are then simulated and the outputs are verified for correct functionality. The Simulation can be done at various levels of abstractions. The other simulations include the post synthesis simulation, post place and route simulation etc. The simulation needs a set of test vectors for checking the functionality.
Once the functional simulation is correct then the next step is Synthesis. The Synthesis converts the HDL description in to Net list. The Net list gives the information about the functional hardware elements. The synthesis step gives two views of the design. One is Technology Dependent and other is Technology independent. The Technology independent view
CVR College of Engineering (VLSI) Page 66
of the synthesis gives the design information in terms of gates and other components which are not dependent to any technology. The Technology dependent view of the synthesis gives the hardware information in terms of the LUTs and other components, which are dependent to Xilinx Technology.
Place & Route is the next step in which the tool places all the components on a FPGA die for optimum performance both in terms of area and speed. After placing the components the interconnections between the components can also be done. In post place and route simulation step the actual delays which will be involved in the design are considered by the tool and simulation is performed by considering these delays. These Delays are because of electrical loading effect, wiring delays, stray capacitances. After post place and route, comes generating the bit-map file, which means converting the VHDL/Verilog code into bit streams which is useful to configure the FPGA kit. A .bit file is generated after this step. After this comes final step of downloading the bit map file on to the FPGA board which is done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action Group) which is an IEEE standard. The bit map file contains the whole design which is to be used in FPGA board.
Page 67
APPENDIX-B
Spartan-3 Starter Kit Introduction:
The Xilinx Spartan-3 Starter Kit provides a low-cost, easy-to-use development and evaluation platform for Spartan-3 FPGA designs.
Figure-B-1 shows the Spartan-3 Starter Kit board, which includes the following components and features:
200,000-gate Xilinx Spartan-3 XC3S400 FPGA in a 256-ball thin Ball Grid Array package (XC3S400FT256) [1]
The Table B-1 shows the device information of the Xilinx Spartan-3 FPGA.
Page 68
The components and capacity of the Xilinx Spaatan-3 FPGA are shown below.
2Mbit Xilinx XCF02S Platform Flash, in-system programmable configuration PROM [2]
1Mbit non-volatile data or application code storage available after FPGA configuration
CVR College of Engineering (VLSI) Page 69
Jumper options allow FPGA application to read PROM data or FPGA configuration from other sources[3]
- Single 256Kx32 SRAM array, ideal for Micro Blaze code images
Second RS-232 transmit and receive channel available on board test points[8]
CVR College of Engineering (VLSI) Page 70
Push button switch to force FPGA reconfiguration (FPGA configuration happens automatically at power-on)[17]
Three 40-pin expansion connection ports to extend and enhance the Spartan-3 Starter Kit Board[19][20][21]
Compatible
with
Diligent,
Inc.
peripheral
boards
https://digilent.us/Sales/boards.cfm#Peripheral
CVR College of Engineering (VLSI) Page 71
JTAG download/debug port compatible with the Xilinx Parallel Cable IV and MultiPRO Desktop Tool [24].
supply[25].
On-board 3.3V [27], 2.5V [28] , and 1.2V[29] regulators Component Locations.
Page 72
Figure B-2 indicates the component locations on the top side and bottom side of the board, respectively.
Page 73
Page 74