FIR Filter Realization Via Deferred End-Around Carry Modular Addition

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1
FIR Filter Realization via Deferred End-Around

Carry Modular Addition
Armin Belghadr and Ghassem Jaberipur
Abstract— Hardware realization of FIR filters that are based ai (0 ≤ i ≤ k − 1) are provided either as constants or dynamic
on residue number systems leads to increased speed and reduced variables. Each tap includes a multiplication and an addition
power, where besides the popular Mersenne numbers, several operation whose circuits can be considered as a pipeline stage.
moduli of the form 2n ± δ(δ ≥ 3) are commonly used. However,
additional weighted 2 i (i > 1) end-around carries (EACs) slow

k−1
down and complicate the required modular adders in comparison y (t) = ai x (t − i ) (1)
to modulo-(2n − 1) adders. For example, for δ = 3, the modular
sum is obtained via A + B ∓ 3cout , where A and B are modulo- i=0
(2n ± 3) operands and cout is the carryout of binary addition Introduction of residue number system (RNS) as a vehi-
A + B. In this paper, a new multioperand modular adder is
proposed, where the key improvement is that all the required cle for implementation of the aforementioned addition and
EAC additions (e.g., +3cout ) are postponed until after the last multiplication operations has been shown [3]–[14] to gain
filter tap, whereby tens of addition operations take place without advantages in terms of speed, area consumption and power
the EAC secondary addition; hence considerable savings of dissipation over the conventional binary FIR filter realizations.
time, area consumption, and power dissipation. The proposed An RNS is recognized by a set of often mutually prime
deferred EAC addition scheme has been applied to three previous
relevant works. The corresponding synthesis results showed moduli as its bases, so that a wide word multiplier or adder
over 11%–32%, 27%–29%, and 21%–37%, reductions in delay, is broken down into narrower RNS computation channels that
area, and power measures, respectively. This is achieved despite operate in parallel; hence better performance and commonly
area and power overhead of the few appended stages into the lower power dissipation that is believed to be due to broken
pipelined architecture of the filter, which are nevertheless shown carry chains. However the overhead of residue generation
to become less significant as the number of filter taps grows.
and the final RNS-to-binary conversions must be considered
Index Terms— FIR filter, digital signal processing, residue in evaluations. On the other hand, performance of an RNS
number system, modular adder. operation is dictated by the slowest computing channel that
generally corresponds to the smallest modulo. Therefore, it is
I. I NTRODUCTION desirable to set up a moduli set containing small modulus
that lead to balanced speed of the corresponding channels.
F INITE-FIELD and infinite-field impulse response

(FIR & IIR) filtering are quite common in digital signal
processing (DSP). The former, in both transposed and direct
However, in order to accommodate the required dynamic
range, one needs to increase the number of modulus, besides
the commonly used moduli 2n − 1, 2n , and 2n + 1, and the
forms, is also widely used in wireless communication, and less balanced 2n± p − 1. This calls, as a viable solution, for
many other applications [1]. This is chiefly due to the linear inclusion of moduli of the form 2n ± δ(δ ≥ 3).
phase and inherently stable properties of FIR filters. The key Several FIR filter realizations have been reported that
to such stability, contrary to the IIR, is the lack of feedback, in particular reiterate the benefits of RNS implementation.
so that any bounded input results in a bounded output [2]. For example, three of the most recent relevant contributions
Hardware realization of a k-tap FIR filter is based on (i.e., [9], [12], and [14]) and an older one [13] are notable that
(1), where x(t) is the temporal input signal and coefficients are briefly described below.
• In [9] a constant-coefficient (i.e., the ai constants in (1))
Manuscript received November 7, 2017; revised January 3, 2018; accepted
January 21, 2018. This work was supported in part by IPM under RNS-FIR filter architecture is proposed. The filter benefits
Grant CS1396-2-03 and in part by Shahid Beheshti University. This paper was from an improved version of a multiplication algorithm
recommended by Associate Editor M. Mozaffari Kermani. (Corresponding based on level-constrained common subexpression elimi-
author: Ghassem Jaberipur.)
A. Belghadr is with the Department of Computer Science and Engi- nation [15]. It is shown that careful use of RNS leads
neering, Shahid Beheshti University, Tehran 1983963113, Iran (e-mail: to delay and energy efficiency. In terms of figures of
a_belghadr@sbu.ac.ir). merit, this work shows 22% performance improvement,
G. Jaberipur is with the Department of Computer Science and Engineering,
Shahid Beheshti University, Tehran 1983963113, Iran, and also with the up to 14% reduction in power consumption and also
School of Computer Science, Institute for Research in Fundamental Sciences, 12% area reduction, in comparison with 2’s complement
Tehran 1953833511, Iran (e-mail: jaberipur@sbu.ac.ir). implementations of FIR filter.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. • In [12] the required multipliers for the utilized moduli
Digital Object Identifier 10.1109/TCSI.2018.2798595 set {2n − 1, 2n , 2n+1 − 1} are implemented via preloaded
1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS
TABLE I τ = {2n , 2n − 1, 2n + 1} [4], [9]–[11], [19], [20]. For example,

S ELECTED M ODULI S ETS FOR AN FIR A REA -D ELAY the widest dynamic range in Table I can be accommodated by
T RADEOFF S CENARIO [17]
an instance of τ with n = 16, for which very efficient modular
adders [21], [22] and multipliers [23], [24] are available.
Nevertheless, to cover such large dynamic ranges with even
smaller n (i.e., bit-width), 2-level RNS architectures have been
proposed [11], where higher level of parallelism, and fast
small-sized modular arithmetic units are achieved, at the cost
of doubling the efforts for forward and reverse conversions.
In order to expand the dynamic range, while keeping the
width of RNS channels reasonably low, moduli of the form
2n ± δ, for δ > 1, can be useful in the general parametric
designs. Nevertheless, such moduli can be found in the relevant
designs with fixed moduli, such as the proposed 48-bit case of
Table I. For example, 11 = 8 + 3, 19 = 16 + 3, 23 = 32 − 9,
look-up tables (LUT), where improved figures of merit 29 = 32 − 3. Such endeavor requires hardware realizations of
are evident for the proposed RNS-FIR filters with respect the corresponding modular adders and multipliers. However,
to previous relevant works. Since the moduli set is fixed, while we have not encountered any dedicated realization for
the operation speed directly depends on the value of n. the case of 2n + δ (δ > 1), the fastest modulo-(2n − δ) adder,
That is should there be a need for larger dynamic ranges, tailored for δ = 2q + 1 (e.g., modulo-11, −23, −29) [25],
slower filters are achieved. is slower than modular adders for the popular moduli set τ .
• The work of [14], shows that the RNS-FIR structures in Although this modulo-(2n − δ) adder can be considered as
comparison to the binary counterparts, are more tolerant latency-compatible with the τ -adders, the results in [25] show
to delay variations, in the context of process variation considerable additional area and power measures. On the
prone Nano-scale VLSI implementations. The RNS ben- other hand, in case of arbitrary δ-values, no single-adder
efit is highlighted by reporting around 60% reduction realization has been reported for modulo-(2n − δ) or modulo-
in normalized delay variation against the normal 2’s (2n + δ) addition, whereas commonly one n-bit half carry-
complement realizations. save adder (HCSA), two parallel adders followed by an n-bit
• The RNS-FIR filter of [13] focuses on moduli sets with multiplexer (MUX) are required [26]. The need for an extra
only prime moduli, which enables the use of isomorphic adder and a bulky multiplexer array stems from the necessity
modular multipliers. As such, the authors show power to incorporate the impact of δ-folded end-around carry (EAC)
efficiency of the RNS realization, wherein up to 50% signal into the final addition result (see Section II).
power reduction has been achieved at the same clock rate In this paper, we propose a new multi-operand modular
as in the traditional normal binary implementations. adder design via postponing the EAC handling per the required
Moduli selection is of utmost importance in general DSP single addition operations, while keeping track of the EACs
applications due to its significant impact on design parame- via an incrementor that is off the critical delay path (CDP).
ters. So that a simple methodology does not guarantee a This leads to elimination of the aforementioned MUX array.
suitable moduli set. For example, in order to speed up the Consequently, saving the delay and area of one n-bit MUX
operations via higher parallelism, it is possible to employ array within the CDP is expected in addition to reduction of
large number of small moduli to fulfill a required dynamic the consumed area via replacing one adder with an incrementor
range. This in turn undesirably escalates the complexity of and the corresponding power savings. Moreover, the usual
the required reverse conversion circuitry. To better explore the additional complexities for the case of modulo-(2n + δ) adders
vast design space induced by several possible moduli sets, are avoided via implementing them as modulo-(2n+1 − δ )
Re et al. [16] have proposed a tool for automatic generation of adders, where δ = 2n − δ. As a common application of multi-
the hardware description of RNS-FIR filters. This design tool operand modular addition, we study the impact of the proposed
is capable of choosing an appropriate moduli set according deferred EAC addition method in consecutive modulo-(2n ±δ)
to a number of important criteria as listed in [2], as well additions that occur in RNS realization of FIR filters with
as the required dynamic range, and the target technology constant coefficients.
library specifications. For instance, the contents of Table I In Section II RNS-FIR filter architecture and a commonly
(borrowed from [17]), which are generated by the aforemen- used general modular adder structure are introduced. Proposed
tioned tool, illustrate the moduli sets selected for different FIR filter architecture with the new addition approach is
dynamic ranges, within an experiment aiming at the best area- presented in Section III. Section IV provides analytical and
delay tradeoff. Also similar studies over proper moduli-set synthesis based evaluations and comparison results, followed
selection for efficient FIR filter implementations have been by our concluding remarks in Section V.
reported in [4] and [18].
While the aforementioned moduli selection study of [16] II. RNS-FIR F ILTER A RCHITECTURE
focuses on specific prime modulus, there are other FIR filter Recalling (1), as the main equation for implementation of an
designs that are based on small parametric moduli sets such as FIR filter, its RNS realization is provided by the architecture of
BELGHADR AND JABERIPUR: FIR FILTER REALIZATION VIA DEFERRED EAC MODULAR ADDITION 3
Fig. 1. General view of an RNS-FIR filter.
Fig. 3. Detailed implementation of the modular adder (Equation 3).
with better impact in realization of RNS-FIR filters.

|A + B + δ|2n i f A + B + δ ≥ 2n
|A + B|2n −δ = (3)
A+B i f A + B + δ < 2n
The case of modulo-(2n + δ) adders can be handled via (4),
to be similarly realized by Fig. 3, where δ = 2n − δ. For
Fig. 2. Modulo-m j k-tap FIR filter in transposed form.
example, modulo-19 adders can be realized as modulo-(32-13)
instead of modulo-(16+3).
Fig. 1, where the l gray boxes represent the parallel modulo- |A + B|2n +δ = |A + B|2n+1 −δ (4)
m j (1 ≤ j ≤ l) channels that realize (2).
k−1
A. Previous Works of Interest

y j (t) = |ai |m j |x (t − i )|m j (2) In order to show the impact of our proposed method
mj
i=0 mj (as will be described in Section III) on the previously reported
The residual factors x j = |x (t − i )|m j are prepared via architectures for RNS-FIR filter realization, we briefly describe
the left forward conversion blocks in Fig. 1. However, |ai |m j two selected works [12], [13] here. These works will be
constants can be obtained via offline residue generators. accordingly modified in the next section and their figures of
In addition, an RNS to binary conversion unit (the rightmost merit will be compared with those of ours in Section IV.
block) composes the temporal outputs y j = |y (t)|m j into y(t) The RNS-FIR filter of [13] is considered as an effective
signals of (1). RNS realization that, as is reiterated in [27], leads to low power
Fig. 2 reveals the inside of each gray box (see Fig. 1), realization of DSP applications. The main innovation of this
as the bulk of RNS-FIR filter computations, where there are work is to take advantage of the isomorphism technique [28] in
k consecutive filter taps, each of which contains a modular implementation of the required multipliers, which reduces the
multiplier, a modular adder (except for the leading tap), and complexity of modular multiplication into a modular addition
a pipeline register. It is well known that the pipelining clock and three LUT retrievals. Regarding the employed moduli set,
should be set equal to the delay of one tap [2]. To reduce however, all the moduli should be prime numbers.
such clock period, researchers have taken advantage of the fast Reddy and Sahoo [12] have proposed two approaches for
LUT-based multipliers [12], [13]. To keep the size of required constant coefficient RNS-FIR realization, for the moduli set
LUTs reasonably low, the corresponding modulo should be {2n − 1, 2n , 2n+1 − 1}, which are called “RNS1” and “RNS2.”
adequately small. Therefore, as was discussed in Section I, Both architectures take advantage of LUTs for binary-to-RNS
one way of accommodating the required dynamic range is to conversion and radix-16 partial product generation to imple-
employ as many moduli of the form 2n ± δ, as necessary. ment the required multipliers. The main difference between the
The conventional realization of |A + B|2n −δ [26] imple- two architectures is that in each tap of RNS1 (see Fig. 6) the
ments (3), a realization of which is given in Fig. 3, as a radix-16 partial products are added to those of the previous tap,
compact version of the one in [26]. The provided fast parallel via non-modular adders with carry-in (i.e., the single bit EAC
architecture consists of an array of half-adders/pseudo-half- from the previous tap) while in RNS2 (see Fig. 8), the partial
adders that is marked as HCSA, where other blocks (two products are reduced to one via CSAs and modular adders and
adders and the multiplexor) are self-explanatory. However, added to that of the previous tap similar to RNS1.
since the figures of merit of this architecture is not compatible
with those of the moduli set τ , and given the corresponding III. P ROPOSED D EFERRED EAC FIR
considerations on different realizations of modulo-(2n − δ) F ILTER A RCHITECTURE
adders in Section I, we propose a new multi-operand architec- Prominent impact of modular addition in the performance
ture for such modular adders that works for arbitrary δ values of RNS-FIR filters is evident through (2). The heavily required
Fig. 4. Proposed New1 modulo-m j k-tap FIR filter using deferred end-around carry modular addition.
Fig. 5. FIR filter tap comparison, a) Proposed architecture used in Fig. 4, b) Conventional architecture used in Fig. 2 and [13].
modulo-m j addition, where m j = 2n ± δ j can be described (see Fig. 3), as is shown by the gray crossed circle in the
via (5) and (6) (with δ representing an arbitrary δ j ). These (k + 2)th tap of Fig. 4. To appreciate the simplicity of the new
equations are the elaborated editions of (3) and (4), where architecture (see Fig. 5a), referred to hereafter as “New1”, and
W = A + B + δ = wn wn−1 . . . w0 (wn+1 wn . . . w0 , in case ease of comparison, we provide Fig. 5b. This figure contains
2n + δ) represents the interim sum. the structure of each tap of Fig. 2 based on the adder of Fig. 3
(i.e., a general modular adder [26]). To evaluate the figures of
|W |2n i f wn = 1
|A + B|2n −δ = = wn−1 . . . w0 − δwn merit of [13] that was briefly described at the end of Section II,
W − δ i f wn = 0
as one of the reference works, we consider the latter modular
(5) adder as had been utilized in [13] due to lack of adequate

|A + B|2n +δ = |A + B|2n+1 −δ = wn . . . w0 − δ wn+1 (6) information therein.
Functionality of the proposed realization of RNS-FIR filter,
Direct realization of (5) (and likewise for (6)) requires the
regardless of its multipliers’ detailed architecture, can be also
costly −δwn operation per each tap. However, these subtrac-
demonstrated with a simple numerical example, as follows,
tions (on wn = 0 instances) can be avoided by accumulating
for k = 5. This is with the understanding that the number
the number of required subtractions in a register until after
of taps is commonly in order of tens for high frequency
the last tap, when the accumulated value times δ is subtracted
selectivity in real applications [2]. Therefore, the overhead of
(i.e., only one subtraction). More details of this endeavor is
our two extra stages becomes negligible.
depicted by Fig. 4 and Fig. 5a (bearing the details of one
Example 1 (5-Tap Case): Equation (7) describes an instance
tap of Fig. 4.) that is supported by the following additional
of (2), for k = 5.
explanations. To count the required number of −δ operations,
we use an incrementor per tap (i.e., the +1 boxes after each tap |a | × |x (5)|
0 mj mj
in Fig. 4). The incrementor within tap j receives the current mj

count and its clock is triggered if wn = 0 in tap j − 1 + |a1 |m j × |x (4)|m j
m j
(see also Fig. 5a). Let p denote the value of the total count
+ |a | × |x (3)|
(i.e. the output of the last incrementor in the (k + 1)th tap). |y (5)|m j = 2 mj mj (7)
m j
The correction term − pδ is obtained via an LUT that is
+ |a3 |m j × |x (2)|m j
preloaded with k integers |− pδ|2n −δ , where 0 ≤ p < k. m j

This is indicated by a multiplication box in the (k + 1)th tap + |a | × |x (1)|
4 mj mj
in Fig. 4, which is followed by a modulo-(2n − δ) adder m j mj
Fig. 6. Proposed New2 modulo-(28 − 1) k-tap FIR filter, based on RNS1 of [12] and using deferred end-around carry modular addition.
Fig. 7. FIR filter tap comparison, a) Proposed architecture used in Fig. 6, b) Architecture used in RNS1 of [12].
TABLE II and that of Fig. 2 (Tap 5). Note that decimal values are used
N UMERICAL E XAMPLE TO D EMONSTRATE F UNCTIONALITY instead of the actual binary, for ease of tracing.
OF THE P ROPOSED M ODULAR FIR F ILTER A RCHITECTURE
To better demonstrate the impact of the proposed deferred
EAC modular addition on RNS-FIR filter realization, without
loss of generality, we embark upon applying our technique on
two other quite recently reported architectures [12] RNS1 and
RNS2 that were briefly described at the end of Section II.
Consequently, Figs. 6 and 8 depict the corresponding two
new designs New2 and New3, respectively, where details of
each tap is illustrated by Figs. 7a (i.e., our technique applied
on RNS1), 7b (RNS1 of [12]), 9a (ours on RNS2), and
9b (RNS2), respectively.
IV. E VALUATION AND C OMPARISON

In order to show the potential benefits of the proposed
addition technique, our comparison baseline contains three
of the best previous RNS-FIR realizations; namely two
from [12] and one from [13]. The main influential components
for comparing the figures of merit of the proposed and
Table II contains the steps of computation of |y (5)|m j the reference works are the corresponding tap architectures
(for m j = 23, and arbitrary values for ai and x (5 − i )) based (see Figs. 5, 7, and 9), and less importantly the extra stages
on the proposed architecture of Fig. 4 (delivered in Stage 7), in Figs. 4, 6, and 8.
Fig. 8. Proposed New3 modulo-(28 − 1) k-tap FIR filter, based on RNS2 of [12] and using deferred end-around carry modular addition.
Fig. 9. FIR filter tap comparison, a) Proposed architecture used in Fig. 8, b) Architecture used in RNS2 of [12].
For evaluation and comparison purposes, the required hard- corresponding implementations in one similar channel and in
ware descriptions (i.e., all the required memory units, registers, the span of all taps (including the extra stages) whose number
buffers, adders and multipliers), for the circuits under test, varies from 16 to 1024 (see Subsection B). The exact measures
regarding the three comparison sets, are implemented and sim- resulted from our experiments are compiled in Tables III-VII.
ulated to verify their correct functioning. The corresponding However, for ease of comparison and better reading, we pro-
HDL codes are mapped to the CMOS standard cell library of vide eight plots (see Figs. 10-17) based on the contents of these
the 90nm technology node of the TSMC, using the Synopsys tables. Note that since the architectures RNS1 and RNS2 of
Design Compiler. Note that there are other components (such the reference work [12] cannot be realized for other moduli
as binary-to-RNS and reverse converters) that are exactly the besides 2n −1, we do the same for New2 and New3. However,
same in all designs and thus are not taken into account in for comparison with [13], note that neither their design nor
our evaluations. Regarding the evaluation of power dissipation, ours is dependent on the value of δ. Therefore, for synthesis
it is worth mentioning that the power measures are extracted purposes, we have considered the widest possible registers that
based on simulation of synthesis results with back-annotation may be required depending on δ, n, and k. In other words,
of toggling activity, where uniformly distributed sample input hardware realization for the modulo-(2n ± δ) filter tap/channel
values are applied. consumes the same amount of hardware and follows the same
Two general scenarios are deliberated in our evaluation structure for any value of δ, as far as it provides sufficient
experiments. One is comparing the results of single tap space for its registers. This is possible due to the realization of
realizations (see Subsection A, below). The other mainly aims the multipliers using LUTs and also use of general structure for
to consider impact of the extra stages, and thus compares the modular adders. However, in actual implementations, where
TABLE III
A REA (mm2 ) AND AT (A REA -T IME P RODUCT ) C OMPARISON FOR S INGLE TAP
TABLE IV
P OWER (uW) AND E NERGY (PDP) C OMPARISON FOR S INGLE TAP
the moduli set is clearly known (i.e., all δ j s are determined), building blocks are implemented as were proposed in the
register sizes can be precisely set with no additional memory original papers.
allocation. Moreover, there are two restrictions in [13]; namely Comparison Set 1: Regarding the comparison of [13] and
the moduli of the form 2n ± δ should be prime and with n ≤ 6 New1 designs, the Adder1 and one n-bit multiplexer of the
to allow for reasonable LUT sizes. Although our contribution former design are to be compared against the incrementor
does not enforce such limitations, for fair comparison, our (and the associated register) of the latter. Therefore, obvi-
New1 design observes the same restrictions as in [13]. ous delay improvement (due to lack of multiplexer), and
area/power reduction (due to lack of the Adder1 and mul-
A. Tap Comparison tiplexer) are expected for the proposed work. This expecta-
Structural difference of the proposed tap architecture with tion is confirmed by the synthesis results that are reflected
those of the reference works can be captured by examining in Tables III and IV.
Figs. 5, 7 and 9, where critical delay paths (highlighted Comparison Sets 2 and 3: In the comparisons between
red) contain the same multiplier architecture within each architectures in [12] and the corresponding proposed cir-
comparison set. cuits, components of the critical delay paths are the mod-
We should reiterate that, in our experiments, all adders ular and non-modular adder blocks of the reference work
of [12] and [13] have been realized via parallel prefix and the simple non-modular adders and incrementors of the
architecture (see also Section III), while all other basic proposed architectures. In the proposed realizations, lower
Fig. 10. Area comparison of FIR filter tap (New designs vs. references). Fig. 12. Power comparison of FIR filter tap (New designs vs. references).
Fig. 11. AT comparison of FIR filter tap (New designs vs. references). Fig. 13. Energy comparison of FIR filter tap (New designs vs. references).
TABLE V
area and power consumptions are expected, due to lack of C OMPARISON OF F IGURES OF M ERIT FOR S INGLE TAP OF FIR F ILTER
carry-in supplements in the utilized parallel prefix adder, (I MPROVEMENT F IGURES IN THE 2’s C OMPLEMENT C OLUMN
and replacement of large modular adder blocks with non- R EGARD THE P ERCENTAGE OF N EW 2 M EASURES OVER
THE 2’s C OMPLEMENT E XPERIENCE )
modular simple adders. The curves in Figs. 10-13 compare
the area, area-time product (AT), power, and energy (power-
delay product or PDP) measures, respectively, for the three
comparison sets (i.e., [13] vs. New1, RNS1 of [12] vs.
New2, and RNS2 of [12] vs. New3) and only for n = 8.
These are drawn based on the data of Tables III and IV,
which are extracted from reiterated experiments for differ-
ent time constraints applied into synthesis process. Note
that the curves related to the proposed designs cover a
range of higher working frequencies (see left-most part of
the curves) not shared with those of the reference works,
where the synthesis tool was not able to produce as fast
circuits.
The least delay values acquired by the synthesis tool,
show 13.2%, 16.8%, and 48.9% speed up against the refer-
ence works [13], RNS1 [12], and RNS2 [12], respectively.
Greater impacts are evident through the curves of Fig. 10 for
area reduction (27.3%, 27.8%, and 29.2%) and Fig. 12 for frequency achieved by both designs in each comparison set,
power saving (21.2%, 28.7%, and 37.2%). Note that the latter not those of the highest frequency experienced by each design.
reported area and power improvements regard the highest To better capture the exact figures of merit, the values are
Fig. 14. Area improvement for modulo-(28 −δ) channel of the realized filters Fig. 16. Power improvement for modulo-(28 − δ) channel of the realized
for the three comparison sets. filters for the three comparison sets.
Fig. 15. AT improvement for modulo-(28 − δ) channel of the realized filters Fig. 17. PDP improvement for modulo-(28 −δ) channel of the realized filters
for the three comparison sets. for the three comparison sets.
summarized in Table V. Acknowledgement of RNS advantages than that of a single tap and thus do not violate the pipeline
in FIR filter realization by previous studies was addressed in clocking that is based on one filter tap delay. Similarly, extra
Section I. Nevertheless, we provide design parameters for a circuitry at the final stages of the reference architectures
plain 2’s complement FIR tap implementation with the same RNS1 and RNS2 of [12] exist that must be evaluated in
technology that are obtained based on behavioral description this part. The improved area and power measures that were
for the synthesis tool. The results are compared against the reported in Tables III and IV, regard one tap comparison.
fastest RNS alternative (i.e., New2) in Table V. Superiority of However, in order to consider the overhead of the added final
the latter is clearly evident. stages of the proposed and reference architectures, modulo-
Although we have run our experiments only for n = 8 (2n ± δ) filter channels (2n − 1 for the comparison with
and n = 16, it is easy to analytically conclude that the same reference works RNS1 and RNS2) with varying number of
superiority of the new designs in all the three comparison sets taps (including the final taps), are realized and compared for
could be experienced for larger values of n. The reason is that the proposed and reference designs.
the impact of n is equally sensed in the depth (i.e., the number Tables VI and VII provide for area/AT and power/energy
of node levels) of the parallel prefix adders utilized in both measures, respectively, which are obtained for the highest
designs of each comparison set. That is doubling n adds one working frequency that is experienced by both designs in
parallel prefix level with the effect of additional delay of two each comparison set for operands of width n = 8, 16 and
simple 2-input gates, and extra area consumption and power seven different number of taps from 16 to 1024. For ease
dissipation due to almost doubled size of circuits. of comparison, we use the tables’ contents for n = 8 and
different number of taps to draw the plots in Figs. 14-17 that
B. Overhead of Extra Stages demonstrate the improvement percentage (regarding the pro-
The synthesis results show that delay of the two extra posed design vs. the referenced work in each comparison set)
terminating stages in Fig. 4 (i.e., 0.48 ns, and 0.36 ns) are less in area, AT, power, and energy, respectively. Convergence of
TABLE VI
A REA AND AT (A REA -T IME P RODUCT ) VALUES FOR M ODULO -(2n − δ) FIR F ILTER C HANNEL D ESIGNS
TABLE VII
P OWER AND E NERGY (P OWER -D ELAY P RODUCT ) VALUES FOR M ODULO -(2n − δ) FIR F ILTER C HANNEL D ESIGNS
the percentage curves to constant values towards the high the required dynamic range in such applications. Therefore,
number of taps indicate that overhead of the extra stages the use of repeated moduli of the form 2n ± 1, with larger n,
become negligible as the number of taps grows, as is the and additional moduli of the form 2n ± δ has been experienced
case for high frequency selectivity applications. Also note in RNS realization of FIR filters. Regarding the required
that the improvement curves related to the comparison set of modulo-(2n ± δ) adders, however, the existing realizations for
RNS1 [12] and New2 show less variation against different arbitrary δ-values by far are not adequately compatible with τ -
number of taps. This is due to the extra circuitry employed at adders, particularly in timing balance. This can result in longer
the final stage of the RNS1 filter architecture. tap delay, a problem for which we were motivated to look for
a solution.
We reviewed the previous RNS-FIR filter realizations and
V. C ONCLUSION
focused on their multi-operand modular adder architectures,
It is well known that introduction of residue number sys- which we found it not very efficient. Therefore, we proposed
tem into hardware realization of FIR filters is advantageous. a new technique for realization of such adders that is based on
On the other hand, different studies over the problem of postponing the end-around carry (EAC) addition. We studied
RNS moduli selection concluded that the popular moduli set the impact of this method in consecutive modulo-(2n ± δ)
τ = {2n , 2n + 1, 2n − 1} may not be adequate in order to cover additions that occur in FIR filter RNS realization. A few
extra stages were added to the sequence of taps for EAC [19] S. Negovan, “Digital fir filter architecture based on the residue number
correction, with no penalty on the pipeline clocking time. Our system,” Facta Univ.-Ser., Electron. Energetics, vol. 22, pp. 125–140,
Apr. 2009.
synthesis results showed over 11-32%, 27-29%, and 21-37%, [20] A. Lindahl and L. Bengtsson, “A low-power FIR filter using combined
reductions in delay, area, and power measures, respectively. residue and radix-2 signed-digit representation,” in Proc. DSD, 2005,
These improvements were achieved despite the area and power pp. 42–47.
[21] L. Kalampoukas, D. Nikolos, C. Efstathiou, H. T. Vergos, and
cost of the added stages into the pipelined architecture of J. Kalamatianos, “High-speed parallel-prefix modulo 2n − 1 adders,”
the filter which was shown to become less significant as the IEEE Trans. Comput., vol. 49, no. 7, pp. 673–680, Jul. 2000.
number of filter taps grows. [22] G. Jaberipur and S. Nejati, “Balanced minimal latency RNS addition for
moduli set 2n − 1, 2n , 2n +,” in Proc. 18th Int. Conf. Syst., Signals,
As for the future relevant work, we plan to study the impact Image Process., Jun. 2011, pp. 159–165.
of the deferred EAC technique on RNS multipliers, especially [23] H. T. Vergos and C. Efstathiou, “Design of efficient modulo 2n + 1
in RNS-FIR filter applications. multipliers,” IET Comput. Digit. Techn., vol. 1, no. 1, pp. 49–57,
Jan. 2007.
[24] G. Jaberipur and B. Parhami, “Efficient realisation of arithmetic algo-
R EFERENCES rithms with weighted collection of posibits and negabits,” IET Comput.
Digit. Techn., vol. 6, no. 5, pp. 259–268, Sep. 2012.
[1] L. Tan and J. Jiang, Digital Signal Processing: Fundamentals and [25] S. H. F. Langroudi and G. Jaberipur, “Modulo-(2n − 2q − 1) parallel
Applications, 2nd ed. Orlando, FL, USA: Academic, 2013, p. 876. prefix addition via excess-modulo encoding of residues,” in Proc.
[2] A. Nannarelli and M. Re, “Residue number systems: A survey,” Tech. ARITH, Lyon, France, 2015, pp. 121–128.
Univ. Denmark, Kongens Lyngby, Denmark, Tech. Rep. 2008-04, 2008. [26] H. T. Vergos and C. Efstathiou, “On the design of efficient modular
[3] W. L. Freking and K. K. Parhi, “Low-power FIR digital filters using adders,” J. Circuits Syst. Comput., vol. 14, no. 5, pp. 965–972, 2005.
residue arithmetic,” in Proc. Conf. Rec. 31st Asilomar Conf. Signals, [27] G. C. Cardarilli, A. Nannarelli and M. Re, “RNS applications in digital
Syst. Amp, Comput., vol. 1. Nov. 1997, pp. 739–743. signal processing,” in Embedded Systems Design with Special Arithmetic
[4] R. Conway and J. Nelson, “Improved RNS FIR filter architectures,” and Number Systems. Cham, Switzerland: Springer, 2017, pp. 181–215.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 51, no. 1, pp. 26–28, [28] I. M. Vinogradov, An Introduction to the Theory of Numbers. New York,
Jan. 2004. NY, USA: Pergamon, 1955.
[5] G. C. Cardarilli, A. Del Re, A. Nannarelli, and M. Re, “Low power and
low leakage Implementation of RNS FIR filters,” in Proc. Conf. Rec.
39th Asilomar Conf. Signals, Syst. Comput., 2005, pp. 1620–1624.
[6] T. K. Shahana, R. K. James, B. R. Jose, K. P. Jacob, and S. Sasi, “Per-
formance analysis of FIR digital filter design: RNS versus traditional,”
in Proc. ISCIT, Sydney, NSW, Australia, 2007, pp. 1–5.
[7] W. Jenkins and B. Leon, “The use of residue number systems in the
design of finite impulse response digital filters,” IEEE Trans. Circuits
Syst., vol. CS-24, no. 4, pp. 191–201, Apr. 1977. Armin Belghadr received the B.S. degree in com-
[8] G. C. Cardarilli, A. Nannarelli, and M. Re, “Residue number system for puter hardware engineering and the M.S. degree in
low-power DSP applications,” in Proc. Conf. Rec. 41st Asilomar Conf. computer architecture from Shahid Beheshti Univer-
Signals, Syst. Comput., Pacific Grove, CA, USA, 2007, pp. 1412–1416. sity, Tehran, Iran, in 2011 and 2013, respectively.
[9] P. Patronik, K. Berezowski, S. J. Piestrak, J. Biernat, and A. Shrivastava, He is currently pursuing the Ph.D. degree in
“Fast and energy-efficient constant-coefficient FIR filters using residue computer architecture with the Department of Com-
number system,” in Proc. ISLPED, Fukuoka, Japan, 2011, pp. 385–390. puter Science and Engineering, Shahid Beheshti
[10] D. Živaljević, N. Stamenković, and V. Stojanović, “Digital filter imple- University. He focuses on teaching and research
mentation based on the RNS with diminished-1 encoded channel,” in in the mainstreams of computer-aided design, com-
Proc. TSP, Prague, Czech Republic, 2012, pp. 662–666. puter arithmetic, and 3-D field programmable gate
[11] N. I. Chervyakov, P. A. Lyakhov, and K. S. Shulzhenko, “FIR filters arrays. His research interests include computer
in two-stage residue number system,” in Proc. EnT, Moscow, Russia, arithmetic and particularly residue and redundant number systems.
2014, pp. 26–29.
[12] K. S. Reddy and S. K. Sahoo, “An approach for fixed coefficient RNS-
based FIR filter,” Int. J. Electron., vol. 104, no. 8, pp. 1–19, 2017.
[13] G. C. Cardarilli, A. Nannarelli, and M. Re, “Reducing power dissipation
in FIR filters using the residue number system,” in Proc. 43rd IEEE
Midwest Symp. Circuits Syst., vol. 1. Lansing, MI, USA, Aug. 2000,
pp. 320–323.
[14] I. Kouretas and P. Vassilis, “Delay-variation-tolerant FIR filter architec-
tures based on the residue number system,” in Proc. IEEE Int. Symp. Ghassem Jaberipur received the B.S. degree in
Circuits Syst. (ISCAS), May 2013, pp. 2223–2226. electrical engineering from the Sharif University
[15] J. H. Choi, N. Banerjee, and K. Roy, “Variation-aware low-power synthe- of Technology in 1974, the M.S. degree in engi-
sis methodology for fixed-point FIR filters,” IEEE Trans. Comput.-Aided neering from UCLA in 1976, the M.S. degree in
Design Integr. Circuits Syst., vol. 28, no. 1, pp. 87–97, Jan. 2009. computer science from the University of Wisconsin,
[16] A. Del Re, A. Nannarelli, and M. Re, “A tool for automatic generation Madison, in 1979, and the Ph.D. degree in computer
of RTL-level VHDL description of RNS FIR filters,” in Proc. Eur. Conf. engineering from the Sharif University of Technol-
Exhib. Design, Autom. Test, vol. 1. 2004, pp. 686–687. ogy in 2004. He is currently a Professor of com-
[17] G. C. Cardarilli, A. Del Re, A. Nannarelli, and M. Re, “Impact of RNS puter engineering with the Department of Computer
coding overhead on FIR filters performance,” in Proc. Conf. Rec. 41st Science and Engineering, Shahid Beheshti Univer-
Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, USA, 2007, sity, Tehran, Iran. He is also with the School of
pp. 1426–1429. Computer Science, Institute for Research in Fundamental Sciences, Tehran.
[18] Y. Liu and E. M.-K. Lai, “Moduli set selection and cost estimation for His main research interest is in computer arithmetic. He is recognized as one
RNS-based FIR filter and filter bank design,” Design Autom. Embed. of the 50 distinguished graduates for years 1966–2016 in the Sharif University
Syst., vol. 9, no. 2, pp. 123–139, Jun. 2004. of Technology.

FIR Filter Realization Via Deferred End-Around Carry Modular Addition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FIR Filter Realization Via Deferred End-Around Carry Modular Addition

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

FIR Filter Realization via Deferred End-Around

F INITE-FIELD and infinite-field impulse response

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE I τ = {2n , 2n − 1, 2n + 1} [4], [9]–[11], [19], [20]. For example,

Fig. 1. General view of an RNS-FIR filter.

Fig. 3. Detailed implementation of the modular adder (Equation 3).

with better impact in realization of RNS-FIR filters.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

IV. E VALUATION AND C OMPARISON

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

You might also like