A Flexible Implementation of High-Performance FIR

INSTITUT FR THEORETISCHE NACHRICHTENTECHNIK UND INFORMATIONSVERARBEITUNG UNIVERSITT HANNOVER
A Flexible Implementation of High-Performance FIR Filters on Xilinx FPGAs

Tien-Toan Do, Holger Kropp, Carsten Reuter, Peter Pirsch
PRELIMINARY VERSION
To appear in: Proceedings of the 8th international workshop on Field Programmable Logic and Applications September 1998
A Flexible Implementation of High-Performance FIR Filter on Xilinx FPGAs

Tien-Toan Do, Holger Kropp, Carsten Reuter, Peter Pirsch
Laboratorium fr Informationstechnologie, u University of Hannover, Schneiderberg 32, 30167 Hannover, Germany
{toan, kropp, reuter}@mst.uni-hannover.de http://www.mst.uni-hannover.de/

Abstract. Finite impulse-response lters (FIR lters) are very commonly used in digital signal processing applications and traditionally implemented using ASICs or DSP-processors. For FPGA implementation, due to the high throughput rate and large computational power required under real-time constraints, they are a challenging subject. Indeed, the limitation of resources on an FPGA, i. e. , logic blocks and ip ops, and furthermore, the high routing delays, require compact implementations of the circuits. Hence, in lookup table-based FPGAs, e. g. Xilinx FPGAs, FIR-lters were implemented usually using distributed arithmetic. However, such lters can only be used where the lter coecients are constant. In this paper, we present approaches for a more exible FPGA implementation of FIR lters. Using pipelined multipliers which are carefully adapted to the underlying FPGA structure, our FIR lters do not require a predenition of the lter coecients. Combining pipelined multipliers and parallely distributed arithmetic results in different trade-os between hardware cost and exibility of the lters. We show that clock frequencies of up to 50 MHz are achievable using Xilinx XC40xx 5 FPGAs.
Introduction
Belonging to the so called low-level DSP-algorithms, nite impulse-response ltering represents a substantial part of digital signal processing. Low-level DSPalgorithms are characterized by their high regularity. Nevertheless, on the other hand, they require a high computational performance. Yet, if the processing has to be performed under real-time conditions, those algorithms have to deal with high throughput rates. An N tap FIR ltering algorithm can be expressed, like many other DSPalgorithms, by an arithmetic sum of products:
N 1
y(i) =
k=0
h(k) x(i k)
(1)
where y(i) and x(i) are the response and the input at time i, respectively; and h(k), for k = 0, 1, ..., N 1 are the lter coecients. Hence, the implementation of an N tap FIR lter expressed mathematically in equation (1) requires the implementation of N multiplications, which are very costly regarding hardware and computational time. However, in many cases of digital signal processing where symmetric FIR lters are required, the number of multiplications can be reduced. For the coecients of such lters, the following relations are valid [1]: h(k) = h(N k 1), for k = 0, 1, 2, ..., N 1 (2)
Utilizing relation (2) can almost halve the number of required multiplications. Thus, only symmetric FIR-lters are considered here. Further, lters whose coecients are constant can be implemented at a low hardware cost using bit-plane-structures, distributed arithmetic (DA) [2] or lookup-table multipliers (LUTMULT) instead of conventional hardware multipliers. Especially, for FPGAs where lookup tables (LUTs) are the underlying logic blocks, e. g., Xilinx FPGAs [3], DA techniques [4] and LUTMULT can be invoked as a convenient way for low-cost realization of FIR-lters with constant coecients. Nevertheless, such lters would be not used, if the lter coecients should be frequently varied. This is the case when, e. g., emulating algorithms, where inuences of such variations of the algorithm parameters on the quality of the processed signals must be investigated. Hence, in this paper, we present approaches leading to an ecient, exible and modular realization of symmetric FIR-lters on Xilinx XC40xx FPGAs. FPGA-implementations of pipelined lters using parallely distributed arithmetic and implementation results will be discussed in sections 2. In section 3, the alternative approach for an implementation using conventional hardware multipliers which are carefully adapted to the underlying FPGA structure is considered. In Section 4 concluding remarks will be provided.
Distributed-Arithmetic FIR Filters
In essence, distributed arithmetic (DA) is a computation technique that perform multiplication using lookup table-based schemes [5]. DA-techniques permit computations in form of sum of products as expressed in equation (1) to be decomposed into repetitive lookup table procedures, the results from which are then accumulated to produce the nal result. Since Xilinx XC4000 FPGAs are based on lookup tables, distributed arithmetic is a convenient way to implement the multiply-intensive algorithms like FIR lters, provided that one of the multiplication operands is constant. The bits of the other operand are then used as address lines for looking up a table which is, in fact, a storage, e. g. ROM, RAM, where the potential products from the multiplication of the rst operand by the potential values of the second operand are stored (Fig. 1). FPGA-implementation of FIR lters using serial distributed
arithmetic has been proposed in [4] and [6], where implementation results are also described. We realize fully parallel DA FIR lters on Xilinx XC4000 as depicted in gure 1 where an 8 tap 8 bit symmetric lter is sketched. To assure a compact realisation of the circuit, the LUT sizes are tailored to the required precision for the output data. So, for a given precision, the LUT sizes are not uniform [6], but depend on the positions of the individual bits, i. e. LUTs for the less signicant bits are smaller. Furthermore, in order to obtain high performance, the lters are pipelined after every 4 bit adder whose timing amounts about 18 ns to 20 ns on a XC4000 5. The number of required CLBs for the 8 tap 8 bit symmetric FIR lter depicted in guge 1, which can run at frequencies up to 50 MHz on a XC4000 5, is 140. The latency of the above lter is 14 clock cycles
4 4 LUT + 9 4 LUT 8 8 + 4 9 4 LUT 8 8 + 9 LUT 8 8 + 9 LUT 4 5 LUT 2 4 22 3 2 2 + 9 + 7 7 2 6 24 + 17 9 + 10 2
LUT
17 out
in 8
R E G
8 8
9 2 8 22 13 + + 11
R E G
LUT
R E G
R E G
LUT
Fig. 1. Distributed-Arithmetic FIR Filter on Xilinx XC4000
While the fully precise ltering requires data stored in every LUT to be 10 bitwide and outputs 19 bit data, the maximal absolute error (= 1024) caused by our 8 bit 8 tap FIR lter depicted in g. 1 is quite the same as it caused by the coressponding Xilinx DA lter (= 1022), where the LUTs are uniformly wide and require 36 CLBs. The number of CLBs for all the LUTs in our design is 27. Hence, high-performance digital lters can be implemented at a low hardware cost on LUT-based FPGAs using DA technique. The main drawback of this approach is that DA technique requires the predenition of the lter coef-
cients. In many application FPGAs for DSP, e. g. hardware emulation of DSP algorithms, lters are needed which allow a frequent and exible modication of the lter coecients.
FIR Filters with Conventional Hardware Multipliers
Though multipliers are costly, involving them is inevitable for lters whose coecients should be frequently varied. Hence, we have investigated an ecient FPGA-implementation of FIR-lters using pipelined array multipliers. For the processing at a sample rate comparable to that of the above DA lter, the 8 by 9 multipliers of the lter are two-rows pipelined [7] as illustrated in gure 2, and their structure is adapted and carefully mapped onto the target architecture, i.e. Xilinx FPGA. Further, for the same precision as for the above DA lter, the eight right most product bits from the multiplication (max. absolut error = 1023) are cut o. The lter has a latency of 13 clock cycles and requires 390 CLBs. The achievable frequency for this lter on a XC4000 5 is about 45 MHz - 50 MHz. In comparison with a parallely distributed arithmetic FIR-
in 8
R E G
8 8 + 9 MULT 9 9 8 8 coeff. + 10 10 + : Register

& & & & & &
11 out
R E G
8 8 + 9 MULT 9 9 8 8 coeff.
R E G
8 8 + 9 MULT 9 9 8 8 coeff.
R E G
8 8 + 9 MULT 9 8 8 coeff. 9
& & ADD3 ADD4
Fig. 2. FIR Filter with conventional multipliers on Xilinx XC4000
lter (Fig. 1), the hardware cost for the FIR lter with conventional hardware multipliers (Fig. 2) is increased, while the performance is quite the same.
Because the achievable frequencies for the above DA lters and lters with conventional hardware multipliers are about the same, they can be combined in a hybrid approach leading to dierent trade-os between hardware cost and exibility.
Conclusions
Using Xilinx XC40xx-5 FPGAs, clock frequencies up to about 45MHz - 50MHz for FIR lters are achievable. While the DA technique approach leads to lowcost implementations of FIR lters on lookup table-based FPGAs, FIR lters with conventional hardware multipliers are more exible. In spite of the high cost, such lters are desirable in many cases where the lter coecients should be frequently varied. An example for that is hardware emulation of algorithms where inuences of variations of the algorithm parameters, e.g., lter coecients, on the processing have to be investigated. Combining the above approaches will lead to dierent trade-os regarding hardware costs and exibility.
Acknowledgment
One of the authors, T.-T. Do, receives a scholarship from the German Academic Exchange Service (Deutscher Akademischer Austauschdienst - DAAD). He is grateful to this organization for supporting his research.
References
1. 2. 3. 4. A. V. Openheim, R. W. Schafer: Digital Signal Processing, Prentice Hall (1975) P. Pirsch: Architectures for Digital Signal Processing, John Wiley & Sons (1997) Xilinx Inc.: The Programmable Logic Data Book, (1996) L. Mintzer: FIR Filters with Field-Programmable Gate Arrays, IEEE Journal of VLSI Signal Processing (August 1993) 119128 5. C. S. Burrus: Digital Filters Structures described by Distributed Arithmetic, IEEE Trans. on Circuits and Systems (1977), 674680 6. Xilinx Inc.: Core Solutions, (May 1997) 7. T.-T. Do, H. Kropp, M. Schwiegershausen, P. Pirsch: Implementation of Pipelined Multipliers on Xilinx FPGAs - A Case Study, 7th International Workshop on FieldProgrammable Logic and Applications, Proceedings (1997)

A Flexible Implementation of High-Performance FIR

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Flexible Implementation of High-Performance FIR

Uploaded by

Copyright:

Available Formats

INSTITUT FR THEORETISCHE NACHRICHTENTECHNIK UND INFORMATIONSVERARBEITUNG UNIVERSITT HANNOVER

A Flexible Implementation of High-Performance FIR Filters on Xilinx FPGAs

A Flexible Implementation of High-Performance FIR Filter on Xilinx FPGAs

{toan, kropp, reuter}@mst.uni-hannover.de http://www.mst.uni-hannover.de/

Distributed-Arithmetic FIR Filters

Fig. 1. Distributed-Arithmetic FIR Filter on Xilinx XC4000

FIR Filters with Conventional Hardware Multipliers

8 8 + 9 MULT 9 9 8 8 coeff. + 10 10 + : Register

& & ADD3 ADD4

Fig. 2. FIR Filter with conventional multipliers on Xilinx XC4000

You might also like