You are on page 1of 120

Linkping Studies in Science and Technology

Thesis No. 1030


STUDIES ON IMPLEMENTATION OF
LOW POWER FFT PROCESSORS
Weidong Li
LiU-Tek-Lic-2003:29
Department of Electrical Engineering
Linkpings universitet, SE-581 83 Linkping, Sweden
Linkping, June 2003
Studies on Implementation of
Low Power FFT Processors
Copyright 2003 Weidong Li
Department of Electrical Engineering
Linkpings universitet
SE-581 83 Linkping
Sweden
ISBN 91-7373-692-9 ISSN 0280-7971
To memory of my father.
Abstract
In the last decade, the interest for high speed wireless and on cable
communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) is a strong candidates and has been suggested
or standardized in those communication systems. One key
component in OFDM-based systems is FFT processor, which
performs the efcient modulation/demodulation.
There are many FFT architectures. Among them, the pipeline archi-
tectures are suitable for the real-time communication systems. This
thesis presents the implementation of pipeline FFT processors that
has low power consumptions.
We select the meet-in-the-middle design methodology for the imple-
mentation of FFT processors. A resource analysis for the pipeline
architectures is presented. This resource analysis determines the
number of memories, butteries, and complex multipliers to meet
the specication.
We present a wordlengths optimization method for the pipeline
architectures. We showthat the high radix buttery can be efciently
implemented with carry-save technique, which reduce the hardware
complexity and the delay. We present also an efcient implemen-
tation of complex multiplier using distributed arithmetic (DA). The
implementation of low voltage memories is also discussed.
Finally, we present a 16-point buttery using constant multipliers
that reduces the total number of complex multiplications. The FFT
processor using the 16-point butteries is a competitive candidate
for low power applications.
Acknowledgement
I would like to thank my supervisor Professor Lars Wanhammar for
his support and guidance of this research. Also, I would like to thank
the whole group, Electronics Systems at Linkping University, for
their help in the discussions in research as well as other matters.
Lastly, I would like to express my gratitude to Oscar Gustafsson,
Henrik Ohlsson, and Per Lwenberg for the proofreading.
Finally, and most importantly, I would like to thank my family,
relatives, and friends, especially A Phung, for their boundless
support and encouragement.
This work was nancially supported by the Swedish Strategic Fund
(SSF) under INTELECT program.
i
Table of Contents
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. DFT and FFT ..................................................................... 2
1.2. OFDM Basics .................................................................... 3
1.3. Power Consumption .......................................................... 6
1.4. Thesis Outline ................................................................... 7
1.5. Contributions ..................................................................... 8
2. FFT ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Cooley-Tukey FFT Algorithms ......................................... 9
2.1.1. Eight-Point DFT ....................................................... 10
2.1.2. Basic Formula ........................................................... 12
2.1.3. Generalized Formula ................................................ 13
2.2. Sande-Tukey FFT Algorithms ........................................ 18
2.3. Prime Factor FFT Algorithms ......................................... 20
2.4. Other FFT Algorithms ..................................................... 23
2.4.1. Split-Radix FFT Algorithm ...................................... 23
2.4.2. Winograd Fourier Transform Algorithm .................. 26
2.5. Performance Comparison ................................................ 26
2.5.1. Multiplication Complexity ....................................... 27
2.5.2. Addition Complexity ................................................ 29
2.6. Other Issues ..................................................................... 30
2.6.1. Scaling and Rounding Issue ..................................... 31
2.6.2. IDFT Implementation ............................................... 35
2.7. Summary ......................................................................... 36
ii
3. LOW POWER TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 37
3.1. Power Dissipation Sources .............................................. 37
3.1.1. Short-Circuit Power .................................................. 37
3.1.2. Leakage Power ......................................................... 38
3.1.3. Switching Power ....................................................... 39
3.2. Low Power Techniques ................................................... 40
3.2.1. System Level ............................................................ 40
3.2.2. Algorithm Level ....................................................... 42
3.2.3. Architecture Level .................................................... 44
3.2.4. Logic Level ............................................................... 47
3.2.5. Circuit Level ............................................................. 50
3.3. Low Power Guidelines .................................................... 51
3.4. Summary ......................................................................... 52
4. FFT ARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1. General-Purpose Programmable DSP Processors ........... 53
4.2. Programmable FFT Specific Processors ......................... 54
4.3. Algorithm-Specific Processors ........................................ 56
4.3.1. Radix-2 Multipath Delay Commutator ..................... 57
4.3.2. Radix-2 Single-Path Delay Feedback ....................... 58
4.3.3. Radix-4 Multipath Delay Commutator ..................... 59
4.3.4. Radix-4 Single-Path Delay Commutator .................. 60
4.3.5. Radix-4 Single-Path Delay Feedback ....................... 61
4.3.6. Radix-2
2
Single-Path Delay Commutator ................ 62
4.4. Summary ......................................................................... 63
5. IMPLEMENTATION OF FFT PROCESSORS . . . . . . . . 65
5.1. Design Method ................................................................ 65
5.2. High-level Modeling of an FFT Processor ...................... 67
5.2.1. Resource Analysis .................................................... 67
iii
5.2.2. Validation of the High-Level Model ........................ 70
5.2.3. Wordlength Optimization ......................................... 71
5.3. Subsystems ...................................................................... 72
5.3.1. Memory .................................................................... 73
5.3.2. Butterfly .................................................................... 79
5.3.3. Complex Multiplier .................................................. 83
5.4. Final FFT Processor Design ............................................ 93
5.5. Summary ......................................................................... 96
6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7. REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
iv
1
1
INTRODUCTION
The Fast Fourier Transform(FFT) is one of the most used algorithms
in digital signal processing. The FFT, which facilitates the efcient
transformation between the time domain and the frequency domain
for a sampled signal, is used in many applications, e.g., radar,
communication, sonar, speech signal processing.
In the last decade, the interest for high speed wireless and on
cable communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) technique, which is a special Multicarrier
Modulation (MCM) method, has been demonstrated to be an
efcient and reliable approach for high-speed data transmission. The
immunity to multipath fading channel and the capability for parallel
signal processing make it a promising candidate for the next gener-
ation wide-band communication systems. The modulation and
demodulation of OFDM based communication systems can be
efciently implemented with an FFT, which has made the FFT
valuable for those communication systems. The OFDM based
communication systems have high performance requirement in both
throughput and power consumption. This performance requirement
necessitates an application-specic integrated circuit (ASIC)
solution for FFT implementation. This thesis addresses the problem
of designing efcient application-specic FFT processors for
OFDM based wide-band communication systems.
In this chapter, we give a short review on DFT and FFT. Then an
introduction to OFDM and power consumption are presented.
Finally, the outline of the thesis is described.
2 Chapter 1
1.1. DFT and FFT
The Discrete Fourier transform (DFT) for an N-point data sequence
, is dened as
(1.1)
for , where is the primitive N
th
root of unity. The number N is also called transform length. The
index respective is referred to as time-domain and frequency-
domain index, respectively.
The inverse DFT (IDFT) for data sequence
( ) is
(1.2)
for .
Direct computation of an N-point DFT according to Eq. (1.1)
requires complex additions and complex
multiplications. The complexity for computing an N-point DFT is
therefore . With the contribution from Cooley and Tukey
[13], the complexity for computation of an N-point DFT can be
reduced to . The Cooley and Tukeys approach and
later developed algorithms, which reduces of complexity for DFT
computation, are called fast Fourier transform (FFT) algorithms.
Among the FFT algorithms, two algorithms are especially
noteworthy. One algorithm is the split-radix algorithm, which treats
the even part and odd part with different radix, was published in
1984. Another algorithm is Winograd Fourier Transform Algorithm
(WFTA), which requires the least known number of multiplications
among practical algorithms for moderate lengths DFTs and was
published in 1976.
Many implementation approaches for the FFT have been
proposed since the discovery of FFT algorithms. Due to the high
computation workload and intensive memory access, the implemen-
tation of FFT algorithms is still a challenging task.
x k ( ) { } k 0 1 N 1 , , , =
X n ( ) x k ( )W
N
nk
k 0 =
N 1

=
n 0 1 N 1 , , , = W
N
e
2 N
=
k n
X n ( ) { }
n 0 1 N 1 , , , =
x k ( )
1
N
---- X n ( )W
N
n k
n 0 =
N 1

=
k 0 1 N 1 , , , =
N N 1 ( ) N N 1 ( )
O N
2
( )
O N N ( ) log ( )
INTRODUCTION 3
1.2. OFDM Basics
OFDM is a special MCM technique. The idea for MCM is to divide
transmission bandwidth into many narrow subchannels (subcar-
riers), which transmit data in parallel [5].
The principle for MCM is shown in Fig. 1.1. The high rate data
stream at bits/s is grouped into blocks with M bits per block
at a rate of . Ablock is called a symbol. Asymbol allocates m
k
-
bits of M bits for modulation of a carrier k at and totally M bits
for modulation of N carriers. This results in N subchannels, which
send symbols at a rate of .
In the conventional MCM, the N subchannels are non-
overlapping. Each subchannel has its own modulator and demodu-
lator. This leads to inefcient usage of spectrum and excessive
hardware requirement.
The OFDM technique can overcome those drawbacks. With
OFDM, the spectrum can be used more efcient since overlapping
of subchannels is allowed. The overlapping does not cause inter-
ference of subchannels due to the orthogonal modulation.
M f
sym
f
sym
f
c k ,
f
sym
S
e
r
i
a
l

t
o

p
a
r
a
l
l
e
l
M bits (a symbol)
modulator n-1
modulator n-2
modulator 0
demodulator n-1
demodulator n-2
demodulator 0
Channel noise
f
c,n-1
f
c,n-2
f
c,0
f
c,n-1
f
c,n-2
f
c,0
P
a
r
a
l
l
e
l

t
o

s
e
r
i
a
l
m
n-1
bits
Input
Mf
sym
b/s
f
sym
symbol/s
Output
x(t)
Figure 1.1. A multicarrier modulation system.
4 Chapter 1
The orthogonality can be explained in frequency domain. The
symbol rate is f
sym
, e.g., each symbol is sent during a symbol time T
(which is equal to 1/f
sym
). The frequency spacing between adjacent
subchannels is set to be 1/T Hz, the carrier signals can be expressed
as following:
(1.3)
(1.4)
where f
0
is the system base frequency and is the signal for carrier
k at frequency f
k
. If the frequency of subcarrier k and the base
function are chosen according to Eq. (1.3) and Eq. (1.4), its spectrum
is a sinc function with zero points at (l is integer) except
or f
k
. It means that there is no interference to other
subchannels with the selected functions.
This orthogonality can also be found in the time domain. For two
carrier signals, g
k
, and g
l
, the integral over a symbol time is
(1.5)
which shows that two carriers are orthogonal.
OFDM overcomes the inefcient implementation of the
modulator and demodulator for conventional MCM. From Fig. 1.1,
the sending signal x(t) is the summation of symbol transmission in
all subchannels, e.g.,
f
k
f
0
k
T
--- + = 0 k N 1 <
g
k
t ( )
e
j2f
k
t
0 t T <
0 otherwise

'

=
g
k
f
0
l T +
l k =
g
k
t ( )g
l

t ( ) t d
0
T

T k l =
0 otherwise

'

=
Figure 1.2. Spectrum overlapping of subcarriers for OFDM.
f
1/T
INTRODUCTION 5
where S
k
is the modulated signal of m
k
bit, which should be trans-
mitted by subchannel k. This is an N-point Inverse Discrete Fourier
Transform (IDFT) and baseband modulation (with ). The
IDFT can be computed efciently by Inverse Fast Fourier Transform
(IFFT) algorithm. Hence the OFDM modulator can be implemented
with one IFFT processor and baseband modulator for N subcarriers
instead of N modulators for conventional MCM. In similar way, the
OFDM demodulator can be implemented more efcient than that of
conventional MCM. The simplied OFDMsystembased on the FFT
is shown in Fig. 1.3.
In reality, the interference between subchannels exists due to the
non-ideal channel characteristics and frequency offset in trans-
mitters and receivers. This interference effects the performance of
the OFDM system. The frequency offset can, in most case, be
compensated.
The other issues, for instance, intersymbol interference, can be
reduced by techniques like cyclic prex.
x t ( ) S
k
g
k
t ( )
k 0 =
N 1

e
j2f
0
t
S
k
e
j2kt
T
--------------
k 0 =
N 1

= =
e
j2 f
0
t
e
2jpf
0
t
Channel noise
Input
D
/
A
IFFT
Output
e
2jpf
0
t
A
/
D
FFT
Figure 1.3. OFDM system based on FFT.
6 Chapter 1
1.3. Power Consumption
The famous Moores Lawpredicts the exponential increase in circuit
integration and clock frequency during the last three decades. Table
1 shows the expectation for the near future from Semiconductor
Industry Association.
The power consumption decreases as the feature size and the
power supply voltage are reduced. However, the power consumption
increases or retains almost the same as the advance of technology
according to the table above. This is due to the potential workload
increase.
During the last decade, the power consumption has grown from
a secondary constraint to one of the main constraints in the design of
integrated circuit. In portable applications, low power consumption
has long been the main constraint. Several other factors, for
instances, more functionality, higher workload, and longer operation
time, contribute to make the power consumption and energy
efciency even more critical. In the high performance applications,
where the power consumption traditionally was a secondary
Year
Feature size
2003
107 nm
2004
90 nm
2005
80 nm
2010
45 nm
ASIC usable Mega
transistors/cm
2
(auto layout)
142 178 225 714
ASIC maximum functions per
chip (Mega transistors/chip)
810 1020 1286 4081
Package cost (cents/pin)-
maximum/minimum
1.24/
0.70
1.17/
0.66
1.11/
0.61
0.98/
0.49
On-chip, local clock (MHz) 3088 3990 5173 11511
Supply V
dd
(V)
(high performance)
1.0 1.0 0.9 0.6
Power consumption for High
performance with heatsink (W)
150 160 170 218
Power consumption for
Battery(W)-(Hand-held)
2.8 3.2 3.2 3.0
Table 1.1. Technology Roadmap from the International Technology
for Semiconductors (ITRS).
INTRODUCTION 7
constraint, the low power techniques gain more ground due to the
steady increasing cost for cooling and packaging. Besides those
factors, the increasing power consumption has resulted in higher on-
chip temperature, which in turn reduces the reliability. The delivery
of power supply to the chip has also raised many problems like
power rails design, noise immunity, IR-drop etc. Therefore the low
power techniques are important for the current and future integrated
circuits.
1.4. Thesis Outline
In this thesis we summarize some implementation aspects of a low
power FFT processors for an OFDM communication system. The
system specication for the FFT processor has been dened as
Transform length is 1024
Transform time is less than 40 ms (continuously)
Continuous I/O
25.6 Msamples/sec. throughput
Complex 24 bits I/O data
Low power
In chapter 2, we introduce several FFT algorithms, which is the
starting point for the implementation. The basic idea of FFT
algorithms, e.g., divide and conquer, is demonstrated through a few
examples. Several FFT algorithms and their performance are given
also.
An overview for low power techniques is given in chapter 3.
Different techniques are introduced at different abstraction level.
The main focus of the for low power techniques is reduction of
dynamic power consumption. A general guideline is found in the
end of this chapter.
The choice of FFT architectures is important for the implemen-
tation. A few architectures, including the pipeline architectures, are
introduced in chapter 4. The pipeline architectures are discussed in
more detail since they are the dedicated architectures for our target
application.
8 Chapter 1
In chapter 5, more detailed implementation steps for FFT
processors are provided. Both design method and the design for FFT
processors are discussed in this chapter.
The conclusions for the FFT processor implementation are given
in chapter 6.
1.5. Contributions
The main contributions of this thesis are:
A method for minimizing the wordlengths in the pipelined
FFT architectures, as outlined in Section 5.2.3.
An approach to construct efcient high-radix butteries,
presented in Section 5.3.2.2.
A complex multiplier using distributed arithmetic and the
overturned-stairs tree, given in Section 5.3.3.
A 16-point buttery with constant multipliers. This reduces
of the total number of complex multiplications and is
described in Section 5.4
Various generators for different components, for instance, the
ripple-carry adder, Brent-Kung adder, complex multiplier,
etc. This is found in Chapter 5.
9
2
FFT ALGORITHMS
In FFT processor design, the mathematical properties of FFT must
be exploited for an efcient implementation since the selection of
FFT algorithm has large impact on the implementation in term of
speed, hardware complexity, power consumption etc. This chapter
focuses on the review of FFT algorithms.
2.1. Cooley-Tukey FFT Algorithms
The technique for efcient computation of DFTs is based on divide
and conquer approach. This technique works by recursively
breaking down a problem into two, or more, sub-problems of the
same (or related) type. The sub-problems are then independently
solved and their solutions are combined to give a solution to the
original problem. This technique can be applied to DFT computation
by dividing the data sequence into smaller data sequences until the
DFTs for small data sequences can be computed efciently.
Although the technique was described in 1805 [26], it was not
applied to DFT computation until 1965 [13]. Cooley and Tukey
demonstrated the simplicity and efciency of the divide and conquer
approach for DFT computation and made the FFT algorithms widely
accepted. We give a simple example for the divide and conquer
approach. Then a basic and a generalized FFT formulation are given.
10 Chapter 2
2.1.1. Eight-Point DFT
In this section, we illustrate the idea of the divide and conquer
approach and show why dividing is also conquering for DFT
computation.
Let us consider an 8-point DFT, e.g., and data sequence
, , the DFT of is given by
(2.1)
for .
One way to break down a long data sequence into shorter ones is
to group the data sequence according to their indices. Let
and ( ) be two sequences, the grouping of
to and can be done intuitively through
separating members by odd and even index.
(2.2)
(2.3)
for .
The DFT for can be rewritten
(2.4)
N 8 =
x k ( ) { } k 0 1 7 , , , = x k ( ) { }
X n ( ) x k ( )W
8
nk
k 0 =
7

=
n 0 1 7 , , , =
x
o
l ( ) { }
x
e
l ( ) { } l 0 1 2 3 , , , =
x k ( ) { } x
o
l ( ) { } x
e
l ( ) { }
x
o
l ( ) x 2l 1 + ( ) =
x
e
l ( ) x 2l ( ) =
l 0 1 2 3 , , , =
x k ( ) { }
X n ( ) x
o
l ( )W
8
n 2l 1 + ( )
l 0 =
3

x
e
l ( )W
8
n 2l ( )
l 0 =
3

+
x
o
l ( )W
8
n
W
8
n 2l ( )
l 0 =
3

x
e
l ( )W
8
n 2l ( )
l 0 =
3

+
W
8
n
x
o
l ( )W
4
nl
l 0 =
3

x
e
l ( )W
4
nl
l 0 =
3

+
W
8
n
X
o
n ( ) X
e
n ( ) +
=
=
=
=
FFTALGORITHMS 11
where , and
are 4-point DFTs of and , respectively.
Eq. (2.4) shows that the computation of an 8-point DFT can be
decomposed into two 4-point DFTs and summations. The direct
computation of an 8-point DFT requires complex
additions and complex multiplications. The
computation of two 4-point DFTs requires
complex additions and complex multiplica-
tions. With additional complex multiplications for
and 7 complex additions, it requires totally 30 complex multiplica-
tions and 32 complex additions for the 8-point DFT computation
according to Eq. (2.4). It requires only two 4-point DFTs for the 8-
point DFT due to the fact that and
for . Furthermore, the number of
complex multiplications for can be reduced to 3 from 7
since for . The total number of complex
additions and complex multiplications is 32 and 27, respectively.
This can be shown in Fig. 2.1.
The above 8-point DFT example shows that the decomposition
of a long data sequence into smaller data sequences reduces the
computation complexity.
W
8
n 2l ( )
e
2
8
---------
,
_
n 2l ( )
e
2
4
---------
,
_
nl
W
4
nl
= = = X
o
n ( )
X
e
n ( ) x
o
l ( ) { } x
e
l ( ) { }
8 8 1 ( ) 56 =
8 8 1 ( ) 56 =
2 4 4 1 ( ) 24 =
2 4 4 1 ( ) 24 =
8 1 W
8
n
X
o
n ( )
X
o
n ( ) X
o
n 4 ( ) =
X
e
n ( ) X
e
n 4 ( ) = n 4
W
8
n
X
o
n ( )
W
8
n
W
8
n 4
= n 4
W
1
W
2
4
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
x(0)
x(2)
x(4)
x(6)
x(1)
x(3)
x(5)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
3
W
n
multiplication with W
n
8
W
0
Figure 2.1. An 8-point DFT computation with two 4-point DFTs.
12 Chapter 2
2.1.2. Basic Formula
The 8-point DFT example illustrates the principle of the Cooley-
Tukey FFT algorithm. We introduce a more mathematical
formulation for the FFT algorithm.
Let N be a composite number, i.e., N = r
1
r
0
, the index k can be
expressed by a two-tuple (k
1
, k
0
) as
( ) (2.5)
In the similar way, the index n can be described by (n
1
, n
0
) as
( , ) (2.6)
The term can be factorized as
(2.7)
where .
With Eq. (2.7), Eq. (1.1) can be rewritten
(2.8)
Eq. (2.8) indicates that the DFT computation can be performed
in three steps:
1 Compute r
0
different r
1
-point DFTs (inner parenthesis).
2 Multiply results with .
3 Compute r
1
different r
0
-point DFTs (utter parenthesis).
k r
0
k
1
k
0
+ = 0 k
0
r
0
0 k
1
r
1
< , <
n r
1
n
1
n
0
+ = 0 n
1
r
0
< 0 n
0
r
1
<
W
N
nk
W
N
nk
W
N
r
1
n
1
n
0
+ ( ) r
0
k
1
k
0
+ ( )
W
N
r
1
r
0
n
1
k
1
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
=
=
=
W
N
r
1
r
0
n
1
k
1
W
N
Nn
1
k
1
e
2Nn
1
k
1
N
e
2n
1
k
1

1 = = = =
X n
1
n
0
, ( ) x k
1
k
0
, ( )W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
k
0
0 =
r
0
1

k
1
0 =
r
1
1

=
x k
1
k
0
, ( )W
r
1
n
0
k
1
k
1
0 =
r
1
1

,

_
W
N
n
0
k
0
,

_
W
r
0
n
1
k
0
k
0
0 =
r
0
1

=
W
N
n
0
k
0
FFTALGORITHMS 13
The r
0
r
1
-point DFTs require or
complex multiplications and additions. The second step requires N
complex multiplications. The nal step requires
complex multiplications and additions. Therefore the total number
of complex multiplications using Eq. (2.8) is and
the number of complex additions is . This is a
reduction from O(N
2
) to O(N(r
1
+r
0
)). Therefore the decomposition
of DFT reduces the computation complexity for DFT.
The number r
0
and r
1
are called radix. If r
0
and r
1
are equal to r,
the number system is called radix-r system. Otherwise, it is called
mixed-radix system. The multiplications with are called
twiddle factor multiplications.
Example 2.1. For , we apply the basic formula by decom-
posing with r
0
= 2 and r
1
= 4. This results in the
given 8-point DFT example in the above section, which is shown in
Fig. 2.1. It is a mixed-radix FFT algorithm.
A closer study for the given 8-point DFT example, it is easy to
show that it does not need to store the input data in the memory after
the computation of the two 4-point DFTs. It can reduce the total
memory size and is important for memory constrained system. An
algorithm with this property is called an in-place algorithm.
2.1.3. Generalized Formula
If r
0
or/and r
1
are not prime, further improvement to reduce the
computation complexity can be achieved by applying the divide and
conquer approach recursively to r
1
-point or/and r
0
-point DFTs [7].
Let , the index k and n can be
written as
(2.9)
(2.10)
where for .
r
1
r
0
r
1
1 ( ) N r
1
1 ( )
N r
0
1 ( )
N r
0
r
1
1 + ( )
N r
0
r
1
2 + ( )
W
N
n
0
k
0
N 8 =
N 8 4 2 = =
N r
p 1
r
p 2
r
0
=
k r
0
r
1
r
p 2
k
p 1
r
0
k
1
k
0
+ + + =
n r
p 1
r
p 2
r
1
n
p 1
r
p 1
n
1
n
0
+ + + =
k
i
n
p i 1
0 r
i
1 , [ ] , i 0 1 p 1 , , , =
14 Chapter 2
The factorization of can be expressed as
(2.11)
where .
Eq. (2.1) can then be written
(2.12)
Note that the inner product can be recognized as an r
p-1
-point
DFT for n
0
. Dene
(2.13)
With Eq. (2.13), index k
p-1
is replaced by n
0
. Equation (2.12)
can now be rewritten as
(2.14)
W
N
nk
W
N
nk
W
N
n r
0
r
1
r
p 2
k
p 1
r
0
k
1
k
0
+ + + ( )
W
N
r
0
r
1
r
p 2
nk
p 1
r
0
nk
1
nk
0
+ + +
W
N
r
0
r
1
r
p 2
nk
p 1
W
N
r
0
nk
1
W
N
nk
0
W
r
p 1
nk
p 1
W
N r
0

nk
1
W
N
nk
0
=
=
=
=
W
N
r
0
r
1
r
i
nk
i 1 +
W
N r
0
r
1
r
i
( )
nk
i 1 +
=
X n
p 1
n
p 2
n
0
, , , ( )
x k
p 1
k
p 2
k
0
, , , ( )W
r
p 1
n
0
k
p 1
k
p 1
0 =
r
p 1
1

,

_
k
1
0 =
r
1
1

k
0
0 =
r
0
1

=
W
r
p 1
r
p 2
nk
p 2
W
N r
0

nk
1
W
N
nk
0

x
1
n
0
k
p 2
k
0
, , , ( )
x k
p 1
k
p 2
k
0
, , , ( )W
r
p 1
n
0
k
p 1
k
p 1
0 =
r
p 1
1

=
X n
p 1
n
p 2
n
0
, , , ( )
x k
p 1
k
p 2
k
0
, , , ( )W
r
p 1
n
0
k
p 1
k
p 1
0 =
r
p 1
1

,

_
k
1
0 =
r
1
1

k
0
0 =
r
0
1

=
W
r
p 1
r
p 2
nk
p 2
W
N r
0

nk
1
W
N
nk
0

FFTALGORITHMS 15
The term can be factorized as
The inner sum of k
p-2
in Eq. (2.14) can then be written as
(2.15)
which can be done through multiplications and r
p-2
-point DFTs
(2.16)
(2.17)
Eq. (2.14) can be rewritten as
(2.18)
This process from Eq. (2.14) to Eq. (2.17) can be repeated
times until index k
0
is replaced by n
p-1
.
W
N r
0
r
1
r
i 1
( )
nk
i
W
N r
0
r
1
r
i 1
( )
nk
i
W
N r
0
r
1
r
i 1 +
( )
r
p 1
r
p 2
r
1
n
p 1
r
p 1
n
1
n
0
+ + + ( )k
i
W
r
p 1
r
p 2
r
i
r
p 1
r
p 2
r
i 1 +
n
p i 1
r
p 1
n
1
n
0
+ + + ( )k
i
W
r
p 1
r
p 2
r
i
r
p 1
r
i 2 +
n
p i 2
n
0
+ + ( )k
i
W
r
i
n
p i 1
k
i
=
=
=
x
1
n
0
k
p 2
k
0
, , , ( )W
r
p 1
r
p 2
nk
p 2
k
p 2
0 =
r
p 2
1

x
1
n
0
k
p 2
k
0
, , , ( )W
r
p 1
r
p 2
n
0
k
p 2
[ ]W
r
p 2
n
1
k
p 2
k
p 2
0 =
r
p 2
1

=
x
1
' n
0
k
p 2
k
0
, , , ( ) x
1
n
0
k
p 2
k
0
, , , ( )W
r
p 1
r
p 2
n
0
k
p 2
=
x
2
n
0
n
1
k
p 3
k
0
, , , , ( )
x
1
' n
0
k
p 2
k
0
, , , ( )W
r
p 2
n
1
k
p 2
k
p 2
0 =
r
p 2
1

=
X n
p 1
n
p 2
n
0
, , , ( )
x
2
n
0
n
1
k
0
, , , ( )W
r
p 1
r
p 2
r
p 2
nk
p 3
k
p 3
0 =
r
p 3
1

,

_
k
1
0 =
r
1
1

k
0
0 =
r
0
1

=
W
r
p 1
r
p 2
r
p 3
r
p 3
nk
p 4
W
N r
0

nk
1
W
N
nk
0

p 2
16 Chapter 2
(2.19)
Eq. (2.14) can then be expressed as
(2.20)
Eq. (2.20) reorders the output data to natural order. This process
is called unscrambling. The unscrambling process requires a special
addressing mode that converts address (n
0
,...,n
p-1
) to
. In case for radix-2 number system, the n
i
represents a bit. The addressing for unscrambling is to make a
reverse of the address bits and hence is called bit-reverse addressing.
In case of radix-r (r > 2) number system, it is called digit-reverse
addressing.
Example 2.2. 8-point DFT. Let , the factorization
of can be expressed as
(2.21)
By using the generalized formula, the computation of an 8-point
DFT can be computed with the following sequential equations [7]
(2.22)
(2.23)
(2.24)
(2.25)
x
p 1
n
0
n
1
n
p 1
, , , ( )
x
p 2
' n
0
k
p 2
k
0
, , , ( )W
r
0
n
p 1
k
0
k
0
0 =
r
0
1

=
X n
p 1
n
p 2
n
0
, , , ( ) x
p 1
n
0
n
1
n
p 1
, , , ( ) =
n
p 1
n
p 2
n
0
, , , ( )
N 2 2 2 =
W
N
nk
W
N
nk
W
2
n
0
k
2
( ) W
4
n
0
k
1
W
2
n
1
k
1
( ) W
8
2n
1
n
0
+ ( )k
0
W
2
n
2
k
0
( ) =
x
1
n
0
k
1
k
0
, , ( ) x k
2
k
1
k
0
, , ( )W
2
n
0
k
2
k
2
0 =
1

=
x
1
' n
0
k
1
k
0
, , ( ) x
1
n
0
k
1
k
0
, , ( )W
4
n
0
k
1
=
x
2
n
0
n
1
k
0
, , ( ) x
1
' n
0
k
1
k
0
, , ( )W
2
n
1
k
1
k
1
0 =
1

=
x
2
' n
0
n
1
k
0
, , ( ) x
2
n
0
n
1
k
0
, , ( )W
8
2n
1
n
0
+ ( )k
0
=
FFTALGORITHMS 17
(2.26)
(2.27)
where Eq. (2.22) corresponds to the term in Eq. (2.21), Eq.
(2.23) corresponds to the term, and so on.
The result is shown in Fig. 2.2.
x
2
n
0
n
1
n
2
, , ( ) x
2
' n
0
n
1
k
0
, , ( )W
2
n
2
k
0
k
0
0 =
1

=
X n
2
n ,
1
n
0
, ( ) x
2
n
0
n
1
n
2
, , ( ) =
W
2
n
0
k
2
W
4
n
0
k
1
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 2.2. 8-point DFT with Cooley-Tukey algorithm.
18 Chapter 2
The recursive usage of divide and conquer approach for an 8-
point DFT is shown in Fig. 2.3. As illustrated in the gure, the inputs
are divided into smaller and smaller groups. This class of algorithms
is called decimation-in-time (DIT) algorithm.
2.2. Sande-Tukey FFT Algorithms
Another class of algorithms is called decimation-in-frequency (DIF)
algorithm, which divides the outputs into smaller and smaller DFTs.
This kind of algorithm is also called the Sande-Tukey FFT
algorithm.
The computation of DFT with DIF algorithm is similar to
computation with DIT algorithm. For the sake of simplicity, we do
not derive the DIF algorithm but illustrate the algorithm with an
example.
Example 2.3. 8-point DFT. The factorization of can be
expressed as
(2.28)
C
o
m
b
i
n
e

t
w
o
4
-
p
o
i
n
t

D
F
T
2-point
DFT
2-point
DFT
2-point
DFT
2-point
DFT
C
o
m
b
i
n
e
t
w
o

2
-
p
o
i
n
t
D
F
T
C
o
m
b
i
n
e
t
w
o

2
-
p
o
i
n
t
D
F
T
8
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
C
o
m
b
i
n
e

t
w
o
4
-
p
o
i
n
t

D
F
T
Figure 2.3. The divide and conquer approach for DFT.
W
N
nk
W
N
nk
W
2
k
2
n
0
W
8
2k
1
k
0
+ ( )n
0
( ) W
2
k
1
n
1
W
4
k
0
n
1
( ) W
2
k
0
n
2
( ) =
FFTALGORITHMS 19
The sequential equations can be constructed in similar way as
those in Eq. (2.22) through Eq. (2.27). The result is shown in Fig.
2.4.
The computation of DFT with DIF algorithms can be expressed
with sequential equations, which are similar to that of DIT
algorithms. By using the same notation for index n and k as in Eq.
(2.9) and Eq. (2.10), the computation of N-point DFT with DIF
algorithm is
(2.29)
where and
. The unscrambling process is done by
. (2.30)
Comparing Fig. 2.4 with Fig. 2.2, we can nd that the signal-
ow graph (SFG) for DFT computation with DIF algorithm is
transposition of that with DIT algorithm. Hence, many properties for
DIT and DIF algorithms are the same. For instance, the computation
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
n
multiplication with W
n
8
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
Figure 2.4. 8-point DFT with DIF algorithm.
x
i
n
0
n
i
k
p i 1
k
0
, , , , , ( )
x
i 1
n
0
n
i 1
k
p i
k
0
, , , , , ( )W
r
p i
n
i 1
k
p i
k
p i
0 =
r
p i
1

=
W
N r
p 1
r
p i
( )
n
i
r
p i 2
r
0
k
p i 1
r
0
k
1
k
0
+ + + ( )

x
0
k
p 1
k
p 2
k
0
, , , ( ) x k
p 1
k
p 2
k
0
, , , ( ) =
i 1 2 p , , , =
X n
p 1
n
0
, , ( ) x
p
n
0
n
p 1
, , ( ) =
20 Chapter 2
workload for the DIT and DIF algorithms are the same. The
unscrambling process are required for both DIF and DIT algorithms.
However, there are clear differences between DIF and DIT
algorithms, e.g., the position of twiddle factor multiplications. The
DIF algorithms have the twiddle factor multiplications after the
DFTs and the DIT algorithms have the twiddle factor multiplications
before the DFTs.
2.3. Prime Factor FFT Algorithms
In Cooley-Turkey or Sande-Turkey algorithms, the twiddle factor
multiplications are required for DFT computation. If the
decomposition of N is relative prime, there exists another type of
FFT algorithms, i.e., prime factor FFT algorithm, which reduce the
twiddle factor multiplications.
In Cooley-Turkey or Sande-Turkey algorithms, index n or k is
expressed with Eq. (2.9) and Eq. (2.10). This representation of index
number is called index mapping. If and are relatively prime,
e.g., greatest common divider gcd( , ) = 1, it exists another index
mapping, so-called Goods mapping [19]. An index n can be
expressed as
(2.31)
where , , is the
multiplication inverse of modulo , e.g., , and
is the multiplication inverse of modulo . This mapping is a
variant of Chinese Remainder Theorem.
r
1
r
0
r
1
r
0
n r
0
n
1
r
0
1
mod r
1
,
_
r
1
n
0
r
1
1
mod r
0
,
_
+
,
_
mod N
=
N r
1
r
0
= 0 n
1
r
1
< 0 n
0
r
0
< , r
0
1
r
1
r
0
r
0
1
mod r
1
1 =
r
1
1
r
0
FFTALGORITHMS 21
Example 2.4. Construct index mapping for 15-point DFT inputs
according to Goods mapping.
We have with and . is 2 since
and , the index can be
computed according to
(2.32)
The mapping can be illustrated with an index matrix
The mapping for the outputs is simple. It can be constructed by
( ).
Example 2.5. Construct index mapping for 15-point DFT outputs.
We have with and . The index
mapping for the outputs can be constructed by
for and . The
result can be shown Fig. 2.6.
N 3 5 = r
1
3 = r
0
5 = r
1
1
r
1
r
1
1
mod r
0
3 2
mod 5
1 = = r
0
1
2 =
k 5 2k
1
mod 3
,
_
3 k
0
mod 5
,
_
+
,
_
mod 15
=
0 6 12 3 9
10 1 7 13 4
5 11 2 8 14
Figure 2.5. Goods mapping for 15-point DFT inputs.
k
1
k
0
n r
0
n
1
r
1
n
0
+ ( )
mod N
= 0 n
1
r
1
< 0 n
0
r
0
< ,
N 3 5 = r
1
3 = r
0
5 =
n 5n
1
3n
0
+ ( )
mod 15
= 0 n
1
3 < 0 n
0
5 <
0 3 6 9 12
5 8 11 14 2
10 13 1 4 7
Figure 2.6. Index mapping for 15-point DFT outputs.
n
1
n
0
22 Chapter 2
The computation with prime factor FFT algorithms is similar to
the computation with Cooley-Turkey algorithm. It can be divided
into two steps:
1 Compute r
0
different r
1
-point DFTs. It performs column-wise
DFTs for the input index matrix.
2 Compute r
1
different r
0
-point DFTs. It performs row-wise DFTs
for the output index matrix.
Example 2.6. 15-point DFT with prime factor mapping FFT
algorithm.
The input and output index matrices can be constructed as shown
in Fig. 2.5 and Fig. 2.6. Following the computation steps above, the
computation of 15-point DFT can be performed by and ve 3-point
DFTs followed three 5-point DFTs.
The 15-point DFT with prime factor mapping FFT algorithm is
shown in Fig. 2.7.
X(0)
X(3)
X(6)
X(9)
X(12)
X(5)
X(8)
X(11)
X(14)
X(2)
X(10)
X(13)
X(1)
X(4)
X(7)
x(0)
x(10)
x(5)
x(6)
x(1)
x(11)
x(12)
x(7)
x(2)
x(3)
x(13)
x(8)
x(9)
x(4)
x(14)
5
-
p
o
i
n
t

D
F
T
n
0
=
0
5
-
p
o
i
n
t

D
F
T
n
0
=
1
5
-
p
o
i
n
t

D
F
T
n
0
=
2
3
-
p
o
i
n
t
D
F
T
k
0
=
4
3
-
p
o
i
n
t
D
F
T
k
0
=
3
3
-
p
o
i
n
t
D
F
T
k
0
=
2
3
-
p
o
i
n
t
D
F
T
k
0
=
1
3
-
p
o
i
n
t
D
F
T
k
0
=
0
Figure 2.7. 15-point FFT with prime factor mapping.
FFTALGORITHMS 23
The prime factor mapping based FFT algorithm above is also an
in-place algorithm.
Swapping of input and output index matrix gives another FFT
algorithm which does not need twiddle factor multiplication outside
the butteries either.
Although the prime factor FFT algorithms are similar to the
Cooley-Tukey or Sande-Tukey FFT algorithms, the prime factor
FFT algorithms are derived from convolution based DFT
computations [19] [49] [31]. This leads later to Winograd Fourier
Transform Algorithm (WFTA) [61].
2.4. Other FFT Algorithms
In this section, we discuss two other FFT algorithms. One is the
split-radix FFT algorithm (SRFFT) and the other one is Winograd
Fourier Transform algorithm (WFTA).
2.4.1. Split-Radix FFT Algorithm
Split-radix FFT algorithms (SRFFT) were proposed nearly
simultaneously by several authors in 1984 [17] [18]. The algorithms
belong to the FFT algorithms with twiddle factor. As a matter of fact,
split-radix FFT algorithms are based on the observation of Cooley-
Turkey and Sande-Turkey FFT algorithms. It is observed that
different decomposition can be used for different parts of an
algorithm. This gives possibility to select the most suitable
algorithms for different parts in order to reduce the computational
complexity.
24 Chapter 2
For instance, the signal-ow graph (SFG) for a 16-point radix-2
DIF FFT algorithm is shown in Fig. 2.8.
The SRFFT algorithms exploit this idea by using both a radix-2
and a radix-4 decomposition in the same FFT algorithm. It is
obviously that all twiddle factors are equal to 1 for even indexed
outputs with radix-2 FFT computation, i.e., the twiddle factor
multiplication is not required. But in the radix-4 FFT computation,
there is not such general rule (see Fig. 2.9). For the odd indexed
outputs, a radix-4 decomposition the computational efciency is
increased because the four-point DFT has the largest multiplication-
free buttery. This is because the radix-4 FFT is more efcient than
the radix-2 FFT from the multiplication complexity point of view.
Consequently, the DFT computation uses different radix FFT
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
6*1
16
W
7*1
16
W
4*1
16
W
5*1
16
W
6*0
16
W
7*0
16
W
4*0
16
W
5*0
16
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
Figure 2.8. Signal-ow graph for a 16-point DIF FFT algorithm.
FFTALGORITHMS 25
algorithms for odd and even indexed outputs. This reduces the
number of complex multiplications and additions/subtractions. A
16-point SRFFT is shown in Fig. 2.10.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(4)
X(8)
X(12)
X(1)
X(5)
X(9)
X(13)
X(2)
X(6)
X(10)
X(14)
X(3)
X(7)
X(11)
X(15)
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
Radix-4 Butterfly
Figure 2.9. Radix-4 DIF algorithm for 16-point DFT.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(5)
X(9)
X(13)
X(3)
X(7)
X(11)
X(15)
-j
-j
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
W
0*3
8
W
1*3
8
W
0*2
8
W
1*2
8
R
a
d
i
x
-
2
R
a
d
i
x
-
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
-j
-j
-j
-j
-j
-j
-j
Figure 2.10. SFG for 16-point DFT with SRFFT algorithm.
26 Chapter 2
Although the SRFFT algorithms are derived from the
observation of radix-2 and radix-4 FFT algorithm, it cannot be
derived by index mapping. This could be the reason that the
algorithms are discovered so late [18]. The SRFFT can also be
generalized to lengths N = p
k
, where p is a prime number [18].
2.4.2. Winograd Fourier Transform Algorithm
The Winograd Fourier transform algorithm (WFTA) [61] uses the
cyclic convolution method to compute the DFT. This is based on
Raders idea [49] for prime number DFT computation.
The computation of an N-point DFT (N is product of two co-
prime numbers r
1
and r
0
) with WFTA can be divided into ve steps:
two pre- and post-addition steps and a multiplication step at the
middle. The number of arithmetic operations depends on N. The
number of multiplications is .
The aim of Winograds algorithm is to minimize the number of
multiplications. WFTA succeeds in minimizing the number of
multiplications to the smallest number known. However, the
minimization of multiplications results in complicated computation
ordering and large increase of other arithmetic operations, e.g.,
additions. Furthermore, the irregularity of WFTA makes it
impractical for most real applications.
2.5. Performance Comparison
For the algorithm implementation, the computation load is of great
concern. Usually the number of additions and multiplications are
two important measurements for the computation workload. We
compare the discussed algorithms from the addition and
multiplication complexity point of view.
O N ( )
r
1
-point
input
adds
r
0
-point
input
adds
N-point
multiplications
r
0
-point
output
adds
r
1
-point
output
adds
r
0
sets r
1
sets r
1
sets r
0
sets
Figure 2.11. General structure of WFTA.
FFTALGORITHMS 27
Since the restriction of transform length for prime factor based
algorithms or WFTA, the comparison is not strictly on the same
transform length but rather that of a nearby transform length.
2.5.1. Multiplication Complexity
Since multiplication has large impact on the speed and power
consumption, the multiplication complexity is important for the
selection of FFT algorithms.
In many DFT computations, both complex multiplications and
real multiplications are required. For the purpose of comparison, the
counting is based on the number of real multiplications. A complex
multiplication can be realized directly with 4 real multiplications
and 2 real additions, which is shown in Fig. 2.12 (a). With a simple
transformation, the number of real multiplications can be reduced to
3, but the number of real additions increases to 3 as shown in Fig.
2.12 (b). We consider a complex multiplication as 3 real
multiplications and 3 real additions in the following analysis.
For DFT with transform length of , the number of
complex multiplications can be estimated as half of the number of
total buttery operations, e.g., . This number is over-
estimated, for example, a complex multiplication with twiddle factor
does not require any multiplications when k is a multiple of
N/4. Furthermore, it requires only 2 real multiplications and 2
additions when k is an odd multiple of N/8. Taking these
simplications into account, the number of real multiplication for a
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
-C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
Figure 2.12. Realization of a complex multiplication.
(a) Direct realization (b) Transformed realization
N 2
n
=
N N
2
log ( ) 2
W
N
k
28 Chapter 2
DFT with radix-2 algorithm and transform length of is
[25]. The radix-4 algorithm for a
DFT with transform length of requires
real multiplications
[18]. For the split-radix FFT algorithm, the number of real
multiplications is for a DFT with
[18].
If the transform length is a product of two or more co-prime
numbers, there is no simple analytic expression for the number of
real multiplications. However, there are lower bounds that can be
attained by algorithms for those transform lengths. These lower
bounds can be computed [18].
As mentioned previously, WFTA has been proven that it has the
lowest number of multiplications for those transformlengths that are
less than 16. It requires the lowest number of multiplications of the
existing algorithms.
From the multiplication complexity point of view, the most
attractive algorithm is WFTA, following by the prime factor
algorithm, the split-radix algorithm, and the xed-radix algorithm.
The number of real multiplication for various FFT algorithms on
complex data is shown in the following table [18].
N Radix-2 Radix-4 SRFFT PFA WFTA
16 24 20 20
60 200 136
64 264 208 196
240 1100 632
256 1800 1392 1284
504 2524 1572
512 4360 3076
1008 5804 3548
1024 10248 7856 7172
Table 2.1. Multiplication complexity for various FFT algorithms.
N 2
n
=
M 3N 2 N
2
log ( ) 5N 8 + =
N 4
n
=
M 9N 8 N
2
log ( ) 43N 12 16 3 + =
M N N
2
log 3N 4 + =
N 2
n
=
FFTALGORITHMS 29
2.5.2. Addition Complexity
In a radix-2 or radix-4 FFT algorithm, the addition and subtraction
operations are used for realizing the buttery operations and the
complex multiplications. Since subtraction has the same complexity
as addition, we consider a subtraction equivalent to an addition.
The additions for the buttery operations is the larger part of the
addition complexity. For each radix-2 buttery operation (a 2-point
DFT), the number of real additions is four since each complex
addition/subtraction requires two additions. For a transform length
of , a DFT requires N/2 radix-2 DFTs for each stage. So the
total number of real additions is , or with
radix-2 FFT algorithms. For a transform length of , a DFT,
requires N/4 radix-4 DFTs for each stage. Each radix-4 DFT requires
8 complex additions/subtractions, i.e., 16 real additions. The total
number of real additions is , or . Both
radix-2 and radix-4 FFT algorithms require the same number of
additions for a DFT with a transform length of powers of 4.
The number of additions required for the complex
multiplications is less than the number of buttery operations.
Nevertheless, it cannot be ignored. As described previously, a
complex multiplication requires generally 3 additions. The exact
number [25] is for a DFT with
transform length of using the radix-2 algorithm. The
number of additions for DFT with transform length of is
for radix-4 algorithm
[18]. The split-radix algorithm has the best result for addition
complexity: additions for an
DFT [18].
N 2
n
=
4 N
2
log ( ) N 2 ( ) 2nN
N 4
n
=
16 N
4
log ( ) N 4 ( ) 4nN
A 7N N
2
log ( ) 5N 8 + =
N 2
n
=
N 4
n
=
A 25N 8 N
2
log ( ) 43N 12 16 3 + =
A 3N N
2
log 3N 4 + = N 2
n
=
30 Chapter 2
From the addition complexity point of view, WFTA is a poor
choice. In fact, the irregularity and increase of addition complexity
makes the WFTA less attractive for practical implementation. The
number of real additions for various FFTs on complex data are given
in the following table [18].
2.6. Other Issues
Many issues are related to the FFT algorithm implementations, e.g.,
scaling and rounding considerations, inverse FFT implementation,
parallelism of FFT algorithms, in-place and/or in-order issue,
regularity of FFT algorithms etc. We discuss the rst two issues in
more detail.
N Radix-2 Radix-4 SRFFT PFA WFTA
16 152 148 148
60 888 888
64 1032 976 964
240 4812 5016
256 5896 5488 5380
504 13388 14540
512 13566 12292
1008 29548 34668
1024 30728 28336 27652
Table 2.2. Addition complexity for various FFT algorithms.
FFTALGORITHMS 31
2.6.1. Scaling and Rounding Issue
In hardware it is not possible to implement an algorithmwith innite
accuracy. To obtain sufcient accuracy, the scaling and rounding
effects must be considered.
Without loss of generality, we assume that the input data {x(n)}
are scaled, i.e., |x(n)| < 1/2, for all n. To avoid overowof the number
range, we apply the safe scaling technique [58]. This ensures that an
overow cannot occur. We take the 16-point DFT with radix-2 DIF
FFT algorithm (see Fig. 2.8) as an example.
The basic operation for the radix-2 DIF FFT algorithm consists
of a radix-2 buttery operation and a complex multiplication as
shown in Fig. 2.13.
For two numbers u and v with |u| < 1/2 and |v| < 1/2, we have
(2.33)
(2.34)
where the magnitude for the twiddle factor is equal to 1.
To retain the magnitude, the results must be scaled with a factor
1/2. After scaling, rounding is applied in order to have the same
input and output wordlengths. This introduces an error, which is
W
p
N
U
V
B
u
v
Figure 2.13. Basic operation for radix-2 DIF FFT algorithm.
U u v + u v 1 < + =
V u v ( ) W
N
p
u v u v 1 < + = =
32 Chapter 2
called quantization noise. This noise for a real number is modeled as
an additive white noise source with zero mean and variance of
, where is the weight of the least signicant bit.
The additive noise for U respective V is complex. Assume that
the quantization noise for U and V are Q
U
and Q
V.
, respectively. For
the Q
U
, we have
(2.35)
(2.36)
Since the quantization noise is independent of the twiddle factor
multiplication, we have
(2.37)
(2.38)

2
12
B
u
v
/2
p
W
N
1/2
U
Q
V
Q
n
U
n
V
Figure 2.14. Model for scaling and rounding of radix-2 buttery.
E Q
U
{ } E Q
Ure
jQ
Uim
+ { }
E Q
Ure
{ } E jQ
Uim
{ } +
=
=
0 =
Var Q
U
{ } E Q
Ure
2
Q
Uim
2
+

' ;

=
E Q
Ure
2

' ;

E Q
Uim
2

' ;

+
2
2
12
--------- = =
E Q
V
W
N
p


' ;

E Q
V
{ } E W
N
p

' ;

0 = =
Var Q
V
W
N
p


' ;

E Q
V
W
N
p
( ) Q
V
W
N
p
( )

' ;

=
E Q
V
Q
V
{ }
2
2
12
--------- = =
FFTALGORITHMS 33
After analysis of the basic radix-2 buttery operation, we
consider the scaling and quantization effects in an 8-point DIF FFT
algorithm. The noise propagation path for the output X(0) is
highlighted with bold solid lines in Fig. 2.15.
For the sake of clarity, we assume that is equal for each stage,
i.e., the internal wordlength is the same for all stages. If we analyze
backwards for X(0), i.e., from stage l back to stage 1, it is easy to nd
that noise from stage l-1 is scaled with 1/2 to stage l and stage l-1 has
exactly double noise sources to that of stage l. Generally, if the
transform length is N and the number of stages is n, where ,
the variance of a noise source from stage l is scaled with
and the number of noises sources in stage l is .
Hence the total quantization noise variance for an output X(k) is
(2.39)
if the input sequence is zero mean white noise with variance .
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
stage 1 stage 2 stage 3
Figure 2.15. Noise propagation.
N 2
n
=
1 2 ( )
2 n l ( )
2
n l
Var Q
X k ( )
{ } 2
n l
1
2
2 n l ( )
------------------

2
6
------
l 1 =
n

2
6
------
1
2
n l
------------
l 1 =
n


2
6
------ 2
1
2
n 1
-------------
,
_
= =

2
34 Chapter 2
The output variance for an output X(k) can be derived by the
following equation
(2.40)
where for from the white noise
assumption. If
in
is the weight of the least signicant bit for the real
or imaginary part of the input, the input variance is equal to
. The signal-noise-ratio (SNR) for the output X(k) is
therefore [60]
For a radix-r DIF FFT algorithm, a similar analysis [60] yields
This result, which is based on the white noise assumption, can be
used to determine the required internal wordlength.
The nite wordlength effect of nite precision coefcients is
more complicated. Simulation is typically used to determine word-
length of the coefcients.
Var X k ( ) { } E X k ( )X k ( ) { } = =
1
N
2
------- E x n ( )x m ( ) { }W
N
n m ( )k
m 0 =
N 1

n 0 =
N 1

= =
E
1
N
---- x n ( )W
N
nk
n 0 =
N 1

1
N
---- x m ( )W
N
mk
m 0 =
N 1


' ;

= =
1
N
2
------- E x n ( )x n ( ) { }
n 0 =
N 1

1
N
2
-------N
2

2
N
----- = = =
E x n ( )x m ( ) { } 0 = n m

2
2
in
2
12
SNR
2
in
2
12 ( )
N
--------------------------

2
6
------ 2
1
2
n 1
-------------
,
_
----------------------------------

in
2

2
--------
1
2
n
-----
2 2
n 1 +

--------------------------

in
2

2
--------
1
2 2
n
1 ( )
----------------------- = = =
SNR
2
in
2
12 ( )
N
--------------------------

2
6
------ 2
1
r
n 1
------------
,
_
----------------------------------

in
2

2
--------
1
r
n
-----
2 r
n 1 +

-------------------------

in
2

2
--------
1
2N r ( )
--------------------- = = =
FFTALGORITHMS 35
2.6.2. IDFT Implementation
An OFDM system requires both DFT and IDFT for signal
processing. The IDFT implementation is also critical for the OFDM
system.
There are various approaches for the IDFT implementation. The
straightforward one is to compute the IDFT directly according to Eq.
(1.2), which has a computation complexity of . This
approach is obviously not efcient.
The second approach is similar to FFT computation. If we ignore
the scaling factor 1/N, the only difference between DFT and IDFT is
the twiddle factor, which is instead of . This can easily
be performed by changing the read addresses of the twiddle factor
ROM(s) for the twiddle factor multiplications. It also requires the
reordering of input when a radix-r DFT is used. This approach adds
an overhead to each buttery operation and change the access order
of the coefcient ROM.
The third approach converts the computation on IDFT to the
computation on DFT. This is shown by the following equation.
where the term within the parenthesis is a denition of DFT and
is the conjugate of .
O N
2
( )
W
N
nk
W
N
n k
x k ( )
1
N
---- X n ( )e
j2nk N
k 0 =
N 1

1
N
---- X
re
n ( )e
j2nk N
jX
im
n ( )e
j2nk N
+ [ ]
k 0 =
N 1

1
N
---- X
re
n ( )e
j2nk N
j X
im
n ( )e
j2nk N

k 0 =
N 1

1
N
---- X

n ( )e
j2nk N
k 0 =
N 1

= =
= =
= =
=
a
*
a
36 Chapter 2
The conjugation of a complex number can be done by swapping
the real and imaginary parts. Hence, the IDFT can therefore be
computed with a DFT by adding two swaps and one scaling: swap
the real and imaginary part at input before the DFT computation,
swap the real and imaginary part at output from DFT, and a scaling
with factor 1/N.
2.7. Summary
In this chapter we discussed the most commonly used FFT
algorithms, e.g., the Cooley-Tukey and Sande-Tukey algorithms.
Each computation step was given in detail for the Cooley-Tukey
algorithms. Other algorithms like prime factor algorithm, split-radix
algorithm, and WFTA are also discussed.
We compared the different algorithms in term of number of
addictions and multiplications. Some other aspects, for instance,
memory requirements, will be discussed later.
37
3
LOW POWER TECHNIQUES
Low power consumption has emerged as a major challenge in the
design of integrated circuits.
In this chapter, we discuss the basic principles for power
consumption in standard CMOS circuits. Afterwards, a review of
low-power techniques for CMOS circuits is given.
3.1. Power Dissipation Sources
In CMOS circuits, the main contributions to the power consumption
are from short-circuit, leakage, and switching currents. In the
following subsections, we introduce them separately.
3.1.1. Short-Circuit Power
In a static CMOS circuit, there are two complementary networks: p-
network (pull-up) and n-network (pull-down). The logic functions
for the two networks are complementary. Normally when the input
and output state are stable, only one network is turned on and
conducts the output either to power supply node or to ground node
and the other network is turned off and blocks the current from
owing. Short-circuit current exists during the transitions as one
network is turned on and the other network is still active. For
example, the input signal to an inverter is switching from 0 to .
It exists a short time interval where the input voltage is larger than
but less than . During this time interval, both
V
dd
V
tn
V
dd
V
tp

38 Chapter 3
PMOS-transistor (p-network) and NMOS-transistor (n-network) are
turned on and the short-circuit current ows through both kinds of
transistors from power supply line to the ground.
The exact analysis of the short-circuit current in a simple inverter
[6] is complex, it can be studied by simulation using SPICE. It is
observed that the short-circuit current is proportional to the slope of
input signals, the output loads and the transistor sizes [54]. The
short-circuit current consumes typically less than 10% of the total
power in a well-designed circuit [54].
3.1.2. Leakage Power
There are two contributions to leakage currents: one from the
currents that ow through the reverse biased diodes, the other from
the currents that ow through transistors that are non-conducting.
The leakage currents are proportional to the leakage area and
exponential of the threshold voltage. The leakage currents depend on
the technology and cannot be modied by the designers except in
some logic styles.
The leakage current is in the order of pico-Ampere with current
technology but it will increase as the threshold voltage is reduced. In
some cases, like large RAMs, the leakage current is one of the main
concerns. The leakage current is currently not a severe problem in
most digital designs. However, the power consumed by leakage
current can be as large as the power consumed by the switching
Figure 3.1. Leakage current types: (a) reverse biased diode
current, (b) subthreshold leakage current.
(a) (b)
n
+
p
+
p
-
substrate
V
dd
V
dd
Gnd
Gnd
Gate
I
reverse
I
sub
LOWPOWERTECHNIQUES 39
current for 0.06 m technology. The usage of multiple threshold
voltages can reduce the leakage current in deep-submicron
technology.
3.1.3. Switching Power
The switching currents are due to the charging and discharging of
node capacitances. The node capacitances mainly include gate,
overlapping, and interconnection capacitances.
The power consumed by switching current [63] can be expressed
as
(3.1)
where is the switching activity factor, C
L
is the capacitance load, f
is the clock frequency, and V
dd
is the power supply voltage.
The equation shows that the switching power depends on a few
quantities that are readily observable and measurable in CMOS
circuits. It is applicable to almost every digital circuits and gives the
guidance to the low power design.
The power consumed by switching current is the dominant part
of the power consumption. Reducing the switching current is the
focus of most low power design techniques.
P C
L
f V
dd
2
2 =
40 Chapter 3
3.2. Low Power Techniques
Low power techniques can be discussed at various levels of
abstractions: system level, algorithm and architecture level, logic
level, circuit level, and technology level. Fig. 3.2 shows some
examples of techniques at the different levels.
In the following, we give an overview for different low power
techniques. This is organized after the abstraction level.
3.2.1. System Level
A system typically consists of both hardware and software
components, which affect the power consumption.
The system design includes the hardware/software partitioning,
hardware platform selection (application-specic or general-
purpose processors), resource sharing (scheduling) strategy, etc. The
system design usually has the largest impact on the power
consumption and hence the low power techniques applied at this
level have the most potential for power reduction.
At the system level, it is hard to nd the best solution for low
power in the large design space and there is a shortage of accurate
power analysis tools at this level. However, if, for example, the
instruction-level power models for a given processor are available,
software power optimization can be performed [56]. It is observed
System
Algorithm
Architecture
Logic
Circuit
Figure 3.2. Low-power design methodology at
different abstraction levels.
Partitioning, Power-down
Parallelism, Pipelining
Voltage scaling
Logic styles and manipulation, Data encoding
Technology
Threshold reduction, Double-threshold devices
Energy recovery, Transistor sizing
LOWPOWERTECHNIQUES 41
that faster code and frequently usage of cache are most likely to
reduce the power consumption. The order of instructions also have
an impact on the internal switching within processors and hence on
the power consumption.
The power-down and clock gating are two of the most used low
power techniques at system level. The non-active hardware units are
shut down to save the power. The clock drivers, which often
consumes 30-40% of the total power consumption, can be gated to
reduce the switching activities as illustrated in Fig. 3.3.
The power-down can be extended to the whole system. This is
called sleep mode and widely used in low power processors. The
StrongARM SA-1100 processor has three power states and the
average power varies for each state [29]. These power states can be
utilized by the software through advanced conguration and power
management interface (ACPI). In the recent year, the power
management has gained a lot attention in operating system design.
For example, the Microsoft desktop operating system supports
advanced power management (APM).
The system is designed for the peak performance. However, the
computation requirement is time varying. Adapting clocking
frequency and/or dynamic voltage scaling to match the performance
constraints is another low power technique. The lower requirement
for performance at certain time interval can be used to reduce the
AND
block enable
clock
to block clock network
Figure 3.3. Clock gating.
RUN
IDLE SLEEP
400 mW
50 mW 160 W
90 s 10 s
10 s 160 ms
90 s
Figure 3.4. Power states for StrongARM SA-1100 processor.
42 Chapter 3
power supply voltage. This requires either feedback mechanism
(load monitoring and voltage control) or predetermined timing to
activate the voltage down-scaling.
Another less explored domain for low power design is using
asynchronous design techniques. The asynchronous designs have
many attractive features, like non-global clocking, automatic power-
down, no spurious transitions, and low peak current, etc. It is easy to
reduce the power consumption further by combining the
asynchronous design technique with other lowpower techniques, for
instance, dynamic voltage scaling technique [42]. This is illustrated
in Fig. 3.5.
3.2.2. Algorithm Level
The algorithm selection have large impact on the power
consumption. For example, using fast Fourier transform instead of
direct computation of the DFT reduces the number of operations
with a factor of 102.4 for a 1024-point Fourier transform and the
power consumption is likely to be reduced with a similar factor.
The task of algorithm design is to select the most energy-
efcient algorithm that just satises the constraints. The cost of an
algorithm includes the computation part and the
communication/storage part. The complexity measurement for an
algorithm includes the number of operations and the cost of
Load
Monitor
B
u
f
f
e
r
Processing
Unit
DC-DC
Converter
Power
Supply
Synchronous/Asynchronous Interface
Input Output
Figure 3.5. Asynchronous design with dynamic voltage
scaling.
LOWPOWERTECHNIQUES 43
communication/storage. Reduction of the number of operations,
cost per operation, and long distance communications are key issues
to algorithm selection.
One important technique for low power of the algorithmic level
is algorithmic transformations [45] [46]. This technique exploits the
complexity, concurrency, regularity, and locality of an algorithm.
Reducing the complexity of an algorithm reduces the number of
operations and hence the power consumption. The possibility of
increasing concurrency in an algorithm allows the use of other
techniques, e.g., voltage scaling, to reduce the power consumption.
The regularity and locality of an algorithm affects the controls and
communications in the hardware.
The loop unrolling technique [9] [10] is a transformation that
aims to enhance the speed. This technique can be used for reducing
the power consumption. With loop unrolling, the critical path can be
reduced and hence voltage scaling can be applied to reduce the
power consumption. In Fig. 3.6, the unrolling reduces the critical
path and gives a voltage reduction of 26% [10]. This reduces the
power consumption with 20% even the capacitance load is increases
with 50% [10]. Furthermore, this technique can be combine with
other techniques at architectural level, for instance, pipeline and
interleaving, to save more power.
In some cases, like wave digital lters, the faster algorithms,
combined with voltage-scaling, can be chosen for energy-efcient
applications [58].
2D
x
n
x
n-1
b
0
b
0
a
1
b
0
a
1
2
a
1
y
n
y
n-1
D
x
n
y
n
a
1
b
0
Figure 3.6. (a) Original signal flow graph.
(b) Unrolled signal flow graph.
(a) (b)
44 Chapter 3
3.2.3. Architecture Level
As the algorithm is selected, the architecture can be determined for
the given algorithm.
As we can see from Eq. (3.1), an efcient way to reduce the
dynamic power consumption is the voltage scaling. When supply
voltage is reduced, the power consumption is reduced. However, this
increases the gate delay. The delay of a min-size inverter (0.35 m
standard CMOS technology) increases as the supply voltage is
reduced, which is shown in Fig. 3.7.
To reduce the power supply voltage is used to reduce the power
consumption. However, this increases the delay. To compensate the
delay, we use low power techniques like parallelism and pipelining
[11].
We demonstrate an example of architecture transformation.
0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delay vs. power supply voltage
Power supply voltage (V)
D
e
l
a
y

(
n
s
)
Figure 3.7. Delay vs. supply voltage for an inverter.
LOWPOWERTECHNIQUES 45
Example 3.1. Parallel [11].
The use of two parallel datapath is equivalent to interleaving of two
computational tasks. A datapath to determine the largest number of
C and (A + B) is shown in Fig. 3.8. It requires an adder and a
comparator. The original clock frequency is 40 MHz [11].
In order to maintain the throughput while reducing the power
supply voltage, we use a parallel architecture. The parallel
architecture with twice the amount of resources is shown in Fig. 3.9.
The clock frequency can be reduced to half, from 40 MHz to 20
MHz since two tasks are executed concurrently. This allows the
supply voltage to be scaled down from 5 V to 2.9 V [11]. Since the
extra routing is required to distribute computations to two parallel
units, the capacitance load is increased by a factor of 2.15 [11]. Still,
this gives a signicant power saving [11]:
1/T
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/T
1/T
Figure 3.8. Original datapath.
P
par
C
par
V
par
2
f
par
2.15C
orig
( ) 0.58V
orig
( )
2
f
orig
2
------------
,
_
= =
0.36P
orig

46 Chapter 3
Example 3.2. Pipelining [11].
Pipelining is another method for increasing the throughput. By
adding a pipelining register after the adder in Fig. 3.8, the
throughput can the increased from 1/(T
add
+ T
comp
) to 1/max(T
add
,
T
comp
). If T
add
is equal to T
comp
, this increases the throughput by a
factor of 2. With this enhancement, the supply voltage can also in
this case be scaled down to 2.9 V (the gate delay doubles) [11]. The
effective capacitance increases to a factor of 1.15 because of the
insertions of latches [11]. The power consumption for pipelining
[11] is
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
1/T
A
B
C
Figure 3.9. Parallel implementation.
P
pipe
C
pipe
V
pipe
2
f
pipe
1.15C
orig
( ) 0.58V
orig
( )
2
f
orig
0.39P
orig

=
=
LOWPOWERTECHNIQUES 47
One benet of pipelining is the low area overhead in comparison
with using parallel datapaths. The area overhead equals the area of
the inserted latches. Another benet is that the amount of glitches
can be reduced.
Further power saving can be obtained by parallelism and/or
pipelining. However, since the delay increases signicantly as the
voltage approaches the threshold voltage and the capacitance load
for routing and/or pipeline registers increases, there exists an
optimal power supply voltage. Reduction of supply voltage lower
than the optimal voltage increases the power consumption.
Locality is also an important issue for architecture trade-off. The
on-chip communication through long buses requires signicant
amount of power. To reduce such communications is important.
3.2.4. Logic Level
The power consumption depends on the switching activity factor,
which in turn depends on the statistical characteristics of data.
However, most low power techniques do not concentrate on this
issue from the system level to the architecture level. The low power
techniques at the logic level, however, focus mainly on the reduction
of switching activity factor by using the signal correlation and, off
course, the node capacitances.
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/2T
1/2T
1/2T 1/2T
1/2T
Figure 3.10. Pipeline implementation.
48 Chapter 3
As we know from the gated clocking, the clock input to non-
active functional block does not change by gating, and, hence,
reduces the switching of clock network. Precomputation [1] uses the
same concept to reduce the switching activity factor: a selective
precomputing of the output of a circuit is done before the output are
required, and this reduces the switching activity by gating those
inputs to the circuit. This is illustrated in Fig. 3.11. The input data
is partitioned into two parts, corresponding to registers R
1
and R
2
.
One part, R
1
, is computed in precomputation block g one clock cycle
before the main computation A is performed. The result from g
decides gating of R
2
. The power can then be saved by reducing the
switching activity factor in A.
An example of precomputation for low-power is the comparator.
The comparator takes the MSBof the two numbers to register R
1
and
the others to R
2
. The comparison of MSB is performed in g. If two
MSBs are not equal, the output from g gated the remaining inputs.
In this way, only a small portion of inputs to the comparators main
block A (subtractor) is changed. Therefore the switching activity is
reduced.
Gate reorganization [12] [32] [57] is a technique to restructure
the circuit. This can be decomposition a complex gate to simple
gates, or composition simple gates to a complex gate, duplication of
a gate, deleting/addition of wires. The decomposition of a complex
gate and duplication of a gate help to separate the critical and non-
critical path and reduce the size of gates in the non-critical path, and,
hence, the power consumption. In some cases, the decomposition of
a complex gate increases the circuit speed and gives more space for
En
R
2
R
1
R
3
A
g
Figure 3.11. A precomputation structure for low power.
LOWPOWERTECHNIQUES 49
power supply voltage scaling. The composition of simple gates can
reduce the power consumption if the complex gate can reduce the
charge/discharge of high-frequently switching node. The deleting of
wires reduces the capacitance load and circuit size. The addition of
wires helps to provide an intermediate circuit that may eventually
lead to a better one.
Encoding denes the way data bits are represented on the
circuits. The encoding is usually optimized for reduction of delay or
area. In low power design, the encoding is optimized for reduction
of switching activities since various encoding schemes have
different switching properties.
In a counter design, counters with binary and Gray code have the
same functionality. For N-bit counter with binary code, a full
counting cycle requires transitions [63] A full counting
cycle for a Gray coded N-bit counter requires only transitions.
For instance, the full counting cycle for a 2-bit binary coded counter
is from 00, 01, 10, 11, and back to 00, which requires 6 transitions.
The full counting cycle for 2-bit Gray coded counter is from 00, 01,
11, 10, and back to 00, which requires 4 transitions. The binary
coded counter has twice transitions as the Gray coded counter when
the n is large. Using binary coded counter therefore requires more
power consumption than using Gray coded counter under the same
conditions.
As we can see from the previous example, the logic coding style
has large impact on the number of transitions. Traditionally, the
logic coding style is used for enhancement of speed performance.
Careful choice of coding style is important to meet the speed
requirement and minimize the power consumption. This can be
applied to the nite state machine, where states can be coded with
different schemes.
A bus is an on-chip communication channel that has large
capacitance. As the on-chip transfer rate, increases, the use of buses
contributes with a signicant portion of the total power. Bus
encoding is a technique to exploit the property of transmitted signal
to reduce the power consumption. For instance, adding an extra bit
to select one of the inverse or the non-inverse bits at the receiver end
can save power [53]. Low swing techniques can be applied for the
bus also [27].
2 2
n
1 ( )
2
n
50 Chapter 3
3.2.5. Circuit Level
At the circuit level, the potentials power saving are often less than
that of higher abstract levels. However, this cannot be ignored. The
power savings can be signicant as the basic cells are frequently
used. A few percents improvement for D ip-op can signicantly
reduce the power consumption in deep pipelined systems.
In CMOS circuits, the dynamic power consumption is caused by
the transitions. Spurious transitions typically consume between 10%
and 40%of the switching activity power in the typical combinational
logic [20]. In some cases, like array multipliers, the amount of
spurious transitions is large. To reduce the spurious transitions, the
delays of signals from registers that converge at a gate should be
roughly equal. This can be done by insertions of buffers and device
sizing [33]. The insertions of buffer increase the total load
capacitance but can still reduce the spurious transitions. This
technique is called path balancing.
Many logic gates have inputs
that are logically equivalent, i.e.,
the swapping of inputs does not
modify the logic function of the
gate. Example gates are NANDs,
NORs, XORs, etc. However, from
the power consumption point of
view, the order of inputs does effect
the power consumption. For
instance, the A-input, which is near
the output in a two-input NAND
gate, consumes less power than the
B-input closed to the ground with the same switching activity factor.
Pin ordering is to assign more frequently switching to input pin that
consumes less power. In this way, the power consumption will be
reduced without cost. However, the statistics of switching activity
factors for different pins must be known in advanced and this limits
the use of pin ordering [63].
Different logic styles have different electrical characteristics.
The selection of logic style affects the speed and power
consumption. In most cases, the standard CMOS logic is a good
starting point for speed and power trade-off. In some cases, for
B
A
Out
C
out
C
i
Figure 3.12. Nand gate.
LOWPOWERTECHNIQUES 51
instance, the XOR/NXOR implementation, other logic styles, like
complementary pass-transistor logic (CPL) is efcient. CPL
implements a full-adder with fewer transistors than the standard
CMOS. The evaluation of full-adder is done only with NMOS
transistors network. This gives a small layout as well.
Transistor sizing affects both delay and power consumption.
Generally, a gate with smaller size has smaller capacitance and
consumes less power. This is paid for by larger delay. To minimize
the transistor sizes and meet the speed requirement is a trade-off.
Typically, the transistor sizing uses static timing analysis to nd out
those gates (whose slack time is larger than 0) to be reduced. The
transistor sizing is generally applicable for different technologies.
3.3. Low Power Guidelines
Several approaches to reduce the power consumption have been
briey discussed. Belowwe summarize some of the most commonly
used low power techniques.
Reduce the number of operations. The selection of algorithm
and/or architecture has signicant impact on the power
consumption.
Power supply voltage scaling. The voltage scaling is an
efcient way to reduce the power consumption. Since the
throughput is reduced as the voltage is reduced, this may need
to be compensated for with parallel and/or pipelining
techniques.
Complementary Inputs
P
a
s
s
-
t
r
a
n
s
i
s
t
o
r
(
N
M
O
S
)

N
e
t
w
o
r
k
Output Output
C
o
m
p
l
e
m
e
n
t
a
r
y

I
n
p
u
t
s
Figure 3.13. CPL logic network.
52 Chapter 3
I/Os between chips can consume large power due to the large
capacitive loads. Reducing the number of chips is a promising
approach to reduce the power consumption.
Power management. In many systems, the most power
consuming parts are often idle. For example, in a lap-top
computer, the portion of display and harddisk could consume
more than 50% of the total power consumption. Using power
management strategies to shut down these components when
they are idle for a long time can achieve good power saving.
Reducing the effective capacitance. The effective capacitance
can be reduced by several approaches, for example, compact
layout and efcient logic style.
Reduce the number of transitions. To minimize the number of
transitions, especially the glitches, is important.
3.4. Summary
In this chapter we discussed some low power techniques that are
applicable at different abstraction levels.
53
4
FFT ARCHITECTURES
Not only several variations of the FFT algorithm have been
developed after the Cooley-Tukeys publication but also various
implementations. Generally, the FFT can be implemented in
software, general-purpose digital signal processors, application-
specic processors, or algorithm-specic processors.
The implementations with software on general-purpose
computer can be found in literature and still being explored in some
projects, for instance, the FFTW project in the Laboratory for
Computer Science at MIT [28]. Software implementations are not
suitable for our target application as the power consumption is too
high.
Since it is hard to summarize all other implementations, we will
concentrate on algorithmic-specic architectures and only give a
brief overview on some FFT architectures.
4.1. General-Purpose Programmable DSP Processors
Many commercial programmable DSP processors include the
special instructions for the FFT computation. Although the perfor-
mance varies from one to another, most of them belong to the
Harvard architecture from the architecture point of view. A
processor with Harvard architecture has separate busses for data and
control.
54 Chapter 4
A typical programmable DSP processor has on chip data and
program memory, address generator, program control, MAC, ALU,
and I/O interfaces, as illustrated in Fig. 4.1.
The computation of FFT with general-purpose DSP processor
does not differ too much from the software computation of FFT in a
general-purpose computer.
To compute the FFT with a general-purpose DSP processor
requires three steps: rst the data input, then the FFT/IFFT compu-
tation, and nally the data output. In some DSP processors, for
instance TIs TMS320C3x, bit-reverse addressing is available to
accelerate the unscrambling for the data output. Typical FFT/IFFT
execution times are about 1 ms [2] [41] [55], which is far from the
implementation using more specialized implementations. The
implementations with general-purpose programmable DSP
processor is therefore not applicable due to the throughput
requirement.
4.2. Programmable FFT Specic Processors
Several programmable FFT processors have been developed for the
FFT/IFFT computations. These processors are 5 to 10 times faster
than the general-purpose programmable DSP processors.
The programmable FFT-specic processors have specic butter-
ies and at least one complex multiplier [65]. The buttery is usually
radix-2 or radix-4. There is often an on-chip coefcient ROM, which
I
/
O

i
n
t
e
r
f
a
c
e
I
/
O

i
n
t
e
r
f
a
c
e
Data
Memory
Program
Memory
Address
Generator
Program
Controller
MAC &
ALU
Program
Data
Program
Data
Address
Buss
Data
Buss
Address
Buss
Data
Buss
Figure 4.1. General-purpose programmable DSP processor.
FFTARCHITECTURES 55
stores sinus and cosine coefcients. This type of programmable
FFT-specic processors are often provided with windowing
functions in either time or frequency domain.
The Zarlinks (former Plessey) PDSP16515A processor
performs decimation in time, radix 4, forward or inverse Fast Fourier
Transforms [65]. Data are loaded into an internal workspace RAM
in normal sequential order, processed, and then read-out in correct
order. The processor has two internal workspace RAMs, one output
buffer, and one coefcient ROM.
Although the PDSP1615A processor accelerates the FFT
computation, it is still hard to meet the throughput requirement with
a single processor due to the slow I/O. The processor requires 98 s
to perform 1024-point FFT with a system clock of 40 MHz. Using
multiple processor conguration can achieve a higher throughput,
but the power consumption is then substantially higher.
A recent released FFT specic processor from DoubleBW
systems B. V. has higher throughput (100 Msamples/s) [16], but
consumes 8 W at 3.3 V.
Coefficient
ROM
Radix-4
Datapath
Output
Buffer
Workspace
RAM
Workspace
RAM
3 Term
Window
Operator
Input
Output
Figure 4.2. FFT-specic processor PDSP16515A.
56 Chapter 4
4.3. Algorithm-Specic Processors
Non programmable algorithm-specic processors can also be
designed for the computation of FFT algorithms. The processors are
designed mostly for xed-length FFTs. The architecture of an
algorithmic-specic FFT processor is therefore optimized with
respect to memory structure, control units, and processing elements.
There are mainly three types of algorithm-specic processors:
fully parallel FFT processors, column FFT processors, and pipelined
FFT processors.
All three types of algorithm-specic processors represent
different mapping of the signal-ow graph for FFT to hardware
structures. The hardware structure in a fully parallel FFT processor
is an isomorphic mapping of the signal-ow graph [3]. For example,
the signal-ow graph for an 8-point FFT algorithm is shown in Fig.
4.3. The 8-point fully parallel FFT processor requires 24 complex
adders and 5 complex multipliers. The hardware requirement is
excessive, and, hence, is not power efcient.
To reduce the hardware complexity, a column or a pipelined FFT
processor can be used. A set of process elements in a column FFT
processor [21] compute one stage at a time. The results are fed back
to the same set of process elements to compute the next stage. For
the long transform length, the routing for the processing elements is
complex and difcult.
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 4.3. Signal-ow graph for an 8-point FFT.
FFTARCHITECTURES 57
For a pipelined FFT processor, each stage has its own set of
processing elements. All the stages are computed as soon as data are
available. pipelined FFT processors have features like simplicity,
modularity and high throughput. These features are important for
real-time, in-place applications where the input data often arrive in
a natural sequential order. We therefore select the pipeline archi-
tecture for our FFT processor implementation.
The most common groups of the pipelined FFT architecture are
Radix-2 multipath delay commutator (R2MDC)
Radix-2 single-path delay feedback (R2SDC)
Radix-4 multipath delay commutator (R4MDC)
Radix-4 single-path delay commutator (R4SDC)
Radix-4 single-path delay feedback (R4SDF)
Radix-2
2
single-path delay commutator (R2
2
SDC)
We will discuss these pipeline architectures in more detail.
4.3.1. Radix-2 Multipath Delay Commutator
The Radix-2 Multipath Delay Commutator (R2MDC) architecture is
the most straightforward approach to implement the radix-2 FFT
algorithm using a pipeline architecture [48]. An 8-point R2MDC
FFT is shown in Fig. 4.4.
When a new frame arrives, the rst four input data are multi-
plexed to the top-left delay elements in the gure and the next four
input data directly to the buttery. In this way the rst input data is
delayed by four samples and arrives to the buttery simultaneously
with the fourth input sample. This complete the start-up of the rst
stage of the pipeline. The outputs from the rst stage buttery and
the multiplier are then fed into the multipath delay commutator
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
Output0
Output1
Switch
4
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
2
2
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
1
1
Input
Switch
Mux
Figure 4.4. An 8-point DIF R2MDC architecture.
58 Chapter 4
between stage 1 and stage 2. There are two paths (multipath) with
delay elements and one switch (commutator). The multipath delay
commutator alleviates the data dependency problem. The rst and
second outputs from the upper side of the buttery are fed into the
two upper delay elements. After this, the switch changes and the
third and fourth outputs from the upper output of the rst buttery
are sent directly to the buttery at stage 2. However, the rst and
second outputs from the multiplier at the rst stage are now delayed
by the upper delay elements, which make the rst and second
outputs from the multiplier of the rst stage arrive together with the
fth and sixth outputs from the top.
The buttery and the multiplier are idle half the time to wait for
the new inputs. Hence the utilization of the buttery and the multi-
plier is 50%. The total number of delay elements is 4 + 2 + 2 + 1 +
1 = 10 for the 8-point FFT. The total number of delay elements for
an N-point FFT can be derived in similar way and is
N/2+N/2+N/4+...+2, i.e., 3N/2-2. Each stage (except the last one)
has one multiplier and the number of multipliers is .
4.3.2. Radix-2 Single-Path Delay Feedback
Herbert L. Groginsky and George A. Works introduced a feedback
mechanism in order to minimize the number of delay elements [22].
In the proposed architecture one half of outputs from each stage are
fed back to the input data buffer when the input data are directly sent
N ( )
2
log 1
FFTARCHITECTURES 59
to the buttery. This architecture is called Radix-2 Single-path Delay
Feedback (R2SDF). Fig. 4.5 shows the principle of an 8-point
R2SDF FFT.
The delay elements at the rst stage save four input samples
before the computation starts. During the execution they store one
output from the buttery of the rst stage and one output is immedi-
ately transferred to the next stage. Thus, in the new interim half
frame when the delay elements are lled with fresh input sample, the
results of the previous frame are sent to the next stage. The buttery
is provided with a feedback loop. The modied buttery is shown in
the right side of Fig. 4.5. When the mux is 0, the buttery is idle and
data passes by. When the mux is 1, the buttery processes the
incoming samples. Because of the feedback mechanism we reduce
the requirement of delay elements from 3N/2 to N 1 (N/2 + N/4 +
... + 1) which is minimal. The number of multiplier is exact the same
as R2MDC FFT architecture, i.e., . The utilization of
multiplier and butteries remains the same, namely 50%.
4.3.3. Radix-4 Multipath Delay Commutator
This architecture is similar to R2MDC. Input data are separated by
a 4-to-1 multiplexer and 3N/2 delay elements at the rst stage. A 4-
path delay commutator is used between two stages. Computation is
taking place only when the last 1/4 part of data is multiplexed to the
0
1
0
1
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
Mux
Radix-2 SDF Butterfly
Output Input
4
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
2
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
1
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
Figure 4.5. An 8-point DIF R2SDF FFT.
N ( )
2
log 1
60 Chapter 4
buttery. The utilization of the butteries and the multipliers is 25%.
The length of the FFT has to be . A length-64 DIF Radix-4
Multipath Delay Commutator (R4MDC) FFT is shown in Fig. 4.6.
Each stage (except the last stage) has 3 multipliers and the
R4MDCFFT requires in total multipliers for an N-
point FFT which is more than the R2MDC or R2SDF. Moreover the
memory requirement is 5N/2-4, which is the largest among the three
discussed architectures. From the view of hardware and utilization,
it is not a good structure.
4.3.4. Radix-4 Single-Path Delay Commutator
To increase the utilization of the butteries, G. Bi and E. V. Jones [4]
proposed a simplied radix-4 buttery. In the simplied radix-4
buttery, only one output is produced in comparison with 4 in the
conventional buttery. To provide the same four outputs, the
buttery works four times instead of just one. Due to this modi-
cation the buttery has a utilization of or 100%. To accom-
modate this change we must provide the same four data at four
different times to the buttery. A few more delay elements are
required with this architecture. Furthermore, the simplied buttery
needs additional control signals, and so do the commutators. The
number of multipliers is , which is less than the R4MDC
FFT architecture. The utilization of the multiplier is 75% due to the
fact that at least one-fourth of the data are multiplied with the trivial
4
n
Output0
Input
Mux
48
32
16
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Switch
8
4
12
12
4
8
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Switch
2
1
1
2
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Output1
Output2
Output3
3
3
Figure 4.6. A 64-point DIF R4MDC FFT.
3 N ( )
4
log 1 ( )
4 25%
N ( )
4
log 1
FFTARCHITECTURES 61
twiddle factor 1 (no multiplication is needed). The structure of a 16-
point DIF Radix-4 Single-Path Delay Commutator (R4SDC) FFT is
shown below.
The main benet of this architecture is the utilization
improvement for butteries. The cost for R4SDC is the increase
amounts of delay elements.
4.3.5. Radix-4 Single-Path Delay Feedback
Radix-4 single-path delay feedback (R4SDF) [15] [62] is a radix-4
version of R2SDF. Since we use the radix-4 algorithmwe can reduce
the number of multipliers to log
4
(N) 1 compared to log
2
(N) 2 for
R2SDF. But the utilization of the butteries are reduced to 25%. The
radix-4 SDF butteries also become more complicated than the
radix-2 SDF butteries. A64-point DIF R4SDF FFT is illustrated in
Fig. 4.8.
Single-path
Delay
Commutator
24
Simplified
Radix-4
Butterfly
Single-path
Delay
Commutator
6
Simplified
Radix-4
Butterfly
Output Input
Figure 4.7. A 16-point DIF R4SDC FFT.
Output
Input
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
4
4
4
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
1
1
1
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
16
16
16
Figure 4.8. A 64-point DIF R4SDF FFT.
62 Chapter 4
The radix-4 SDF buttery is shown in Fig. 4.9. The data are sent
to the buttery for processing when the mux is 1, otherwise the data
are shifted into a delay-line with a length of 3N/4 (rst stage).
4.3.6. Radix-2
2
Single-Path Delay Commutator
The Radix-2
2
Single-path Delay Commutator (R2
2
SDC) archi-
tecture [24] uses a modied radix-4 DIF FFT algorithm. It has the
same buttery structure as the radix-2 DIF FFT, but places the multi-
pliers at the same places as for the radix-4 DIF FFT. Basically two
kinds of radix-2 SDF butteries are used to achieve the same output
(but not of the same order) as a radix-4 buttery. By reducing the
radix from 4 to 2 we increase utilization of the butteries from 25%
to 50%. We reduce the number of multipliers compared to the
conventional radix-2 algorithm. This approach is based on a 4-point
DFT.
The outputs are bit-reversed instead of 4-reversed as in a conven-
tional radix-4 algorithm.
Radix-4 SDF Butterfly
Mux
0
1
0
1
0
1
0
1
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Figure 4.9. Radix-4 SDF buttery.
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
16 32
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
8 4 2 1
Input Output
Figure 4.10. A 64-point DIF R2
2
SDC FFT.
FFTARCHITECTURES 63
4.4. Summary
In this chapter, several FFT implementation classes are discussed.
The programmable DSP or FFT-specic processors cannot meet the
requirements in both high throughput and low power applications.
Algorithm-specic implementations, especially with pipelined FFT
architectures are better in this respect.
64 Chapter 4
65
5
IMPLEMENTATION OF
FFT PROCESSORS
In this chapter, we discuss implementation of FFT processors.
In VLSI design, the design method is an important guide for
implementation. We follow the meet-at-the-middle design method.
5.1. Design Method
As the transistor feature size is scaled down, more and more
functionalities can be integrated in a single chip. High speed, high
complexity, and short design time are several requirements for VLSI
designs. This requires that the design methodology must cope with
the increasing complexity using a systematic approach. A design
methodology is the overall strategy to organize and solve the design
tasks at the different steps of the design process [24].
The bottom-up methodology, which builds the system by
assembling the existing building blocks, can hardly catch up with the
high performance and communication requirements of current
system. Hence the bottom-up methodology is not suitable for the
design of complex systems.
In the top-down design methodology, the system requirements
and organization is developed by a successive decomposition.
Typically, a high-level design language is used to dene the system
functionality. After a number of decomposition steps, the system is
described by a HDL, which can be used for automatic logic
66 Chapter 5
synthesis. A drawback with this design approach is that the result
highly relies on the synthesis tools. If the nal result fails to meet the
performance requirement, the whole design has to be redesigned.
In the meet-in-the-middle
methodology, the specication-
synthesis process is carried out
in essentially a top-down
fashion, but the actual design of
the building blocks is
performed in bottom-up. This is
illustrated in Fig. 5.1. The
design process is therefore
divided into two almost
independent parts that meet in
the middle. The circuit design
phase can be shortened by
using efcient circuit design
tools or even automatic logic
synthesis tools. Often, some of
the building blocks are already
available in a circuit library.
In our target application, the requirement for the FFT processor
is specied. The design process starts with creation of functional
specication of the FFT processor. This results in a high-level
model. The high-level model is then validated by a testbench for the
FFT algorithm. The testbench can be reused for successive models.
After the system functionality is validated by simulation, the
functional specication is mapped into an architectural
specication.
In the architectural specication, the detail computation process
is mapped to the hardware. Different functionalities are partitioned
and mapped into hardware or software components. Detailed
communications between different components are to be decided at
the architecture specication. After the architecture model is
created, the model needs to be simulated for performance and
validation. Basically, the software and hardware design are
Specification/Validation
Algorithm
Scheduling
Architecture
MEET-IN-THE-MIDDLE
Logic
Gate
Transistor
Layout
Layout
Cell
Module
Figure 5.1. The meet-in-the-mid-
dle methodology.
IMPLEMENTATIONOF FFT PROCESSORS 67
separated after this architectural partitioning. Since the FFT
processor is completely implemented in hardware, the partitioning
of software and hardware is not necessary.
Once an architecture is selected, the individual hardware blocks
are rened by adding the implementation details and constraints. In
this phase, we apply the bottom-up design methodology. Different
subblocks are built from cells in combination of blocks.
5.2. High-level Modeling of an FFT Processor
High level modeling serves two purposes: to create a cycle-true
model for the algorithm and hardware architecture, to simulate,
validate and optimize the high-level model.
Since the whole FFT processor is implemented in hardware, the
software and hardware co-design is not needed. As mentioned
previously, we do not need to determine the system specication
since it is given. The system specication for the FFT processor has
been dened as
Transform length is 1024
Transform time is less than 40 s (continuously)
Continuous I/O
25.6 Msamples/sec. throughput
Complex 24 bits I/O data
According to the meet-in-the-middle design methodology, the
high-level design is a top-down process. We start with the resource
analysis.
5.2.1. Resource Analysis
The high-level design can be divided into several tasks:
Architecture selection
Partitioning
Scheduling
RTL model generation
Validation of models
68 Chapter 5
The rst three tasks are associated with each other and the aim is
to allocate the resource to meet the systemspecication. In the ASIC
implementation, the resource is constrained. Hence the resource
analysis is required.
There are many possible architectures for FFT processors.
Among them, the pipelined FFT architectures are particularly
suitable for real-time applications since they can easily
accommodate the sequential nature of sampling.
The pipelined FFT architectures can be divided into datapath and
control part. Since the control part is much simpler than the datapath
in respect of both hardware and power consumption, the resource
analysis concentrates on the datapath.
The datapath for the FFT processor consists of memories,
butteries and complex multipliers. We discuss them separately.
5.2.1.1. Butteries
From the specication, the computation time for the 1024-point FFT
processor is
s (5.2)
With radix-2 algorithm, the number of buttery operations is
. A buttery can be
implemented with parallel adders/subtractors using one clock cycle.
Hence the minimum number of butteries is
(5.3)
This is optimal with the assumption that ALL data are available
to ALL stages, which is impossible for continuous data streams.
Each buttery has to be idle for 50%in order to reorder the incoming
data. The allocation of buttery operations from two stages to the
same buttery is not possible with as soon as possible (ASAP)
scheduling. Therefore the number of butteries is 10, i.e., equal to
the number of stages.
t
FFT
4 10
5
=
N r ( ) N ( )
r
log N 2 ( ) N ( )
2
log 5120 = =
N
BF
No
BFop
t
BFop

t
FFT
----------------------------------------
5120
4 10
5
25.6 10
6

---------------------------------------------------- 5 = = =
IMPLEMENTATIONOF FFT PROCESSORS 69
With similar discussion, the number of butteries for a radix-4
pipeline architecture is equal to the number of stages.
5.2.1.2. Complex Multipliers
The number of complex multiplications is
(5.4)
where N is the transform length and r is the radix. It does not include
the complex multiplications within the r-point DFT.
For radix-2 algorithm, the number of complex multiplications is
about 4068. The complex multiplication can be computed either in
one clock cycle and two clock cycles (pipelining). The minimum
number of complex multipliers is, with assumption of fast complex
multipliers (one complex multiplication per clock cycle),
(5.5)
Since the resource sharing between two stages is not possible for
pipeline architectures. The number of complex multipliers is 9, i.e.
each stage except the last stage has its own set of complex
multipliers.
For radix-4 algorithm, the number of complex multipliers is 4.
5.2.1.3. Memories
The memories requirement increases linearly with the transform
length. For a 1024-point FFT processor, it dissipates more power
than the complex multipliers.
The size of the memories are determined by the maximum
amount of live data, which is determined by the architectures. In
general, the architectures with feedback are efcient in terms of the
utilization of memories.
N
cmult
N
r
---- r 1 ( ) N ( )
r
1 log ( )
N
cmult
N
cmult
t
cmult
t
FFT
-------------------------------
4068
4 10
5
25.6 10
6

---------------------------------------------------- 4 = =
70 Chapter 5
5.2.2. Validation of the High-Level Model
After the resource analysis, the next step is to model the FFT
algorithm at high-level. For a fast evaluation, the algorithm is
described with high-level programming language, like C or Matlab.
The validation of high-level model is done through simulation
and comparison. The interface between model and testbench is plain
text les: the input data is stored in a text le and read in by the
model and the output data frommodel is saved in a text le also. This
gives a freedom for the construction of model and testbench: the
model can be written in either C, Matlab or VHDL, the testbench can
also written in C or Matlab. Moreover, it is easy to convert from
oating-point arithmetic to xed-point arithmetic. The same
testbench can be reused by changing the output/input le arithmetic.
Architecture Memory requirement [words] Memory utilization
R2MDC N/2+N/2+...+2 = 3N/2-2 66%
R2SDF N/2+N/4+...+1 = N-1 100%
R4MDC 3N/4+3N/16+...+12 = 5N/2-4 40%
R4SDF 3N/4+3N/16+...+3 = N-1 100%
R4SDC 3N/2+3N/8+...+6 = 2N-2 50%
R2
2
SDF N/2+N/4+...+1 = N-1 100%
Table 5.1. Memory requirement and utilization for pipelined
architectures.
Device
Under
Test
Test Bench
input text le output text le
Figure 5.2. Testbench.
IMPLEMENTATIONOF FFT PROCESSORS 71
5.2.3. Wordlength Optimization
In the pipelined FFT architectures, the most research effort has been
relative to the regular modular implementations, which uses xed
wordlength for both data and coefcients for each stages. The
possibility to use different wordlength, which is provided by the
pipeline architecture, is often ignored to achieve modular solutions.
Based on the observation that the wordlength for different stages
in the pipelined FFT processor can be various, we proposed a
wordlength optimization method [34]. We rst tune the wordlength
of data memory (data RAM) at each stage separately to make sure
that the precision requirement is met, and then we adjust the
wordlength of the coefcient ROM at each stage. Because our focus
is placed on reducing power consumption in the data memory, the
strategy is that the larger the RAM block in a stage, the shorter its
wordlength should be. The conventional uniform wordlength
scheme for both data memory and coefcient ROM is also
simulated. To obtain the optimal word lengths prole numerous
design iterations have been performed.
Start
End
Coefcient
wordlengths
OK?
Sine wave
test vectors
OK?
Data
wordlengths
Random
test vectors
No
Yes
Yes
No
Figure 5.3. Wordlength optimization for pipelined FFTs.
72 Chapter 5
Two types of testing vectors are used in our simulation. One is
sine wave and the other is random numbers. The sine wave stimuli
is sensitive to the precision of the coefcient representation, and the
samples of random numbers are effective stimuli to check the
precision of buttery calculations. To make the results obtained
highly reliable, 100,000 sets of random samples are generated and
fed into the simulator. The optimization result is shown table below
for 1024-point pipelined FFT architectures.
5.3. Subsystems
Once the RTL-level model of FFT processor is created and
validated, the subsystems can be constructed according to the meet-
at-the-middle design methodology.
For the subsystems design, there are two design methods: the
semi-custom method and full custom method. Semi-custom design
method has a shorter design time. The RTL-level description in a
HDL can be synthesized with synthesis tool and the synthesis result
is fed to place and route tool for nal layout. However, this design
methodology relies on the synthesis tools and place and route tools.
The designer have less control over the design process. Moreover,
the most synthesis tools use static timing analysis and do not
consider the interconnections during synthesis. The designers have
to increase timing margins during synthesis to meet the speed
requirement after place and route. The resulting designs are often
unnecessary large. In our case, the impact of power supply voltage
scaling is hard to predict since the characterization of cells is done at
Architecture Memory size for
xed wordlength
Memory size for
optimized wordlength
Saving
R2MDC 42952 bits 42824 bits 0%
R2SDF 28644 bits 28580 bits 0%
R4MDC 71552 bits 61488 bits 14%
R4SDF 28644 bits 24708 bits 14%
R4SDC 57288 bits 49176 bits 14%
R2
2
SDF 28672 bits 28580 bits 0%
Table 5.2. Wordlength optimization.
IMPLEMENTATIONOF FFT PROCESSORS 73
normal supply voltage. We select therefore full custom design for
the FFT processor, but use semi-custom design method for the
control path, where the timing is not critical.
In the following we introduce the subsystem design for the FFT
processor. The main subsystems are memories, butteries, and
complex multipliers.
5.3.1. Memory
In many DSP processors, the memory contributes signicant portion
of area and power consumption for the whole processor. In the 1024-
point FFT processor, the memory becomes the most signicant part
in both area and power consumption. Hence the low power design of
memories are a key issue for the FFT processor.
5.3.1.1. RAM
The data are stored in RAMs. In the RAM design, there are mainly
two types of RAMs: static RAM (SRAM), dynamic RAM (DRAM).
Since the DRAM often requires special technology and is not
Data reordering
(Memories)
Buttery
Complex
Multiplier
Figure 5.4. Datapath for a stage in a pipelined FFT processor.
Input
Output
74 Chapter 5
available for standard CMOS technology and SRAM is more
suitable for low voltage operations, we select the SRAM for the data
storage. An overview of SRAM is shown in Fig. 5.5.
The SARM consists of four parts [30]:
memory cell array
decoders
sense ampliers
periphery circuits
We discuss the implementation for the rst three parts, which is
the main parts of the SRAM.
Memory array
Memory array dominates the SRAM area. The keys for the design of
memory array are cell area and noise immunity of bit-lines.
The memory cell is basic building block in the SRAM and the
size of the memory cell is of importance. Even through a 4-
transistors (4-T) memory cell has less area than that of a 6-transistors
(6-T), the current leakage at low voltage is considerable larger. We
therefore select to use a 6-T memory cell. A typical 6-T memory cell
is shown in Fig. 5.6.
Sense Amplifiers
Data I/O
D
e
c
o
d
e
r
C
t
l

c
i
r
c
u
i
t
s
Cell
Array
Address Data
Figure 5.5. Overview of a SRAM.
IMPLEMENTATIONOF FFT PROCESSORS 75
The stability of the memory cell during the read and write
operations determines the device sizing [52]. The cell designer have
to consider process variations, short channel effects, soft error rate
and low supply voltage [23]. The width ratio is
determined by the read operation and a larger ratio means less
chance for a SRAM cell changes its state during read operation. The
read stability can be measured by the static noise margin (SNM).
The width ratio affects the write operation and a
larger means data is more difcult to write into the cell. The width
for the access NMOS transistor, W
a
, is set to min-size for minimal
cell area. Normally, the value for is 1~2 and is 2~3.
B
L
B
L
_
b
a
r
WWL W
p
W
n
W
a
Figure 5.6. SRAM cell.
Schematic Layout
W
n
W
a
=
W
p
W
a
=
76 Chapter 5
Example 6. SNM for a SRAM cell.
The SNM can be simulated with SPICE. The SNM for a SRAM
cell with 2 in standard 0.35 m CMOS technology is shown in
Fig. 5.7.
In order to reduce the power consumption and speed up the read
access, most SRAMs read data through a par bit-lines with small
swing. The voltage swing between two bit-lines is usually about 100
mV to 300 mV, which is sensitive to noise. As the power supply
voltage decreases, the affect of noise becomes more important. To
reduce the noise from outside, the memory array is surrounded with
guard-ring to reduce the substrate-coupling noise. To avoid the
coupling noise from nearby bit-line pars, we use the twisted bit-lines
layout. Thus the coupling from nearby bit-line pars does not affect
the swing difference of bit-lines.
1 1.5 2 2.5 3 3.5
0.42
0.44
0.46
0.48
0.5
0.52
0.54
Power supply voltage (V)
S
N
M

(
V
)
SNM vs power supply voltage
Figure 5.7. SNM of a SRAM cell vs. power supply voltage.
Guard-ring
Bit-lines
Figure 5.8. Noise reduction for memory array.
IMPLEMENTATIONOF FFT PROCESSORS 77
Decoder
The decoder can be realized by using a hierarchical structure, which
reduce both the delay and the activity factor. The row decoder can
use either NOR-NAND decoder or tree decoder. Tree decoder
requires fewer transistors, but suffer from speed degradation due to
the serial-connection of pass-transistors, which could increase the
delay (it becomes worse for lower power supply voltage). The NOR-
NAND decoder has a regular layout, but requires more transistors.
In small decoders the tree decoder is preferred and the NOR-NAND
decoders is preferred for larger decoders.
For the large decoder, a word-line enable signal is added to the
decoder, which controls the width of word-line acting pulse and
reduces the glitches of word-line drivers. This reduces the power
consumption of decoder.
Sense Amplier
The sense amplier is used to amplify the bit-line signals with small
swing during read operation. To have a fast access, the sense
amplier is designed with high gain. This high gain requirement in
turn requires high current and hence high power consumption for
sense amplier. One way to reduce the power consumption is to
reduce the active time for sense amplier. This can be achieved by
using pulsed sense enable signal.
At low supply voltage, the current mode sense amplier is less
suitable. We therefore modied an STC D-ip-op [64] to form a
two stage latch type sense amplier. The sense amplier is
functional when the supply voltage is as low as 0.9 V.
SE_bar
BL BL_bar
Dout Dout_bar
Figure 5.9. STC D-ip-op.
78 Chapter 5
The simulated waveforms for read operation are shown in Fig.
5.10. The access time is 11 ns using standard 0.35 m CMOS
technology with typical process under 85C. The power
consumption for the sense amplier is 59.5 W per bit at 50 MHz.
The simulated waveforms for write operation are shown in
Figure 5.11. The total power consumption is 83.4 W per bit at 50
MHz.
Symbol Wave
:A0:v(out0)
:A0:v(out1)
:A0:v(prech)
1:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
-100m
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 122n 124n 126n 128n 130n 132n 134n 136n 138n 140n
SENSE AMPLIFIER TEST
11 ns
Figure 5.10. Read operation.
Symbol Wave
D0:A0:v(clout0)
D0:A0:v(clout1)
D0:A0:v(prech)
D0:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 140n
WRITE TEST
Figure 5.11. Write operation.
IMPLEMENTATIONOF FFT PROCESSORS 79
The pulse width for the word-line signal and sense enable signal
must be selected carefully. Short pulse width dissipates less power,
but needs to be sufcient long to guarantee the read operation under
process variation, low power supply voltage, etc.
For the periphery circuits, the I/O drivers have large capacitance
load. To reduce the short-circuit current is an important issue for the
I/O drivers. Avoiding switching of the PMOS and NMOS
simultaneously is a efcient technique for reducing of short-circuit
current.
5.3.1.2. Implementation
A 256 w26 b SRAM with separate I/O (Fig. 5.12) has been
implemented with above discussed techniques. The SRAM, which
runs at 1.5 V and 50 MHz, consumes 2.6 mW. A module generator
for SRAM is under development.
5.3.2. Buttery
The buttery is one of the characteristic building blocks in an FFT
processor.
The buttery consists mainly of adders/subtractors. Hence we
discuss the implementation of adder/subtractor rst and later the
complete buttery.
5.3.2.1. Adder design
The adder is one of the fundamental arithmetic components. There
are many adder structures [47].
Figure 5.12. SRAM macro (1.270.33 mm
2
).
80 Chapter 5
The ripple-carry adder (RCA) is constructed with full-adder. The
RCA is slowest among the different implementations. However, it is
simple and consumes small amount of power for 16-bit adder
implementations. If the wordlength is small, it is suitable to select
RCA for the buttery.
When the speed is important, for instance, the vector merge
adder in the multiplier, the RCA cannot meet the speed requirement.
In these cases, other carry accelerating adder structures are
attractive. We select the Brent-Kung adder for high speed adder
implementation. The Brent-Kung adder has a short delay and a
regular structure. It will be discussed later in the complex multiplier
design.
RCA implementation
We have developed a program that generates the schematic and the
layout for RCA. An CMOS full-adder layout is shown in Fig. 5.13.
A 3-bit RCA layout with sign-extension is shown in Fig. 5.14.
Figure 5.13. CMOS full-adder.
S
i
z
e
:

1
8
.
7


1
5
.
0

m
2
IMPLEMENTATIONOF FFT PROCESSORS 81
5.3.2.2. High radix buttery architecture
The use of higher radix tends to reduce the memory access rate,
arithmetic workload, and, hence, the power consumption [39] [60].
Efcient design of high-radix butteries is therefore important. In
practice, the commonly used high radix butteries are radix-4 and
radix-8 butteries. Butteries with higher radix than radix-8 are
often decomposed to lower radix butteries.
A conventional buttery is often based on an isomorphic
mapping of the signal-ow graph. Signal-ow graph for a radix-4
buttery is shown in Fig. 5.15. The buttery requires 8 complex
adders/subtractors and a delay of 2 additions/subtractions.
To reduce the complexity, we proposed a carry-save based
buttery [36]. The computation for a radix-4 buttery is divided into
two steps. The rst step is a 4-2 compression with
addition/subtraction controlled inputs. The second step is a normal
addition. The delay is changed from two additions/subtractions to
Figure 5.14. Layout of 3-bit ripple-carry adder.
Sign
extension
Full-adder Full-adder Full-adder
x(0)
x(1)
x(2)
x(3)
X(0)
X(1)
X(2)
X(3)
j
Figure 5.15. Signal-ow graph for 4-point DFT.
82 Chapter 5
one addition and one 4-2 compression. This implementation reduces
the hardware since a fast adder is more complex than a 4-2
compressor. The total delay is also reduced since the delay for a 4-2
compressor is smaller. The implementation of a radix-4 buttery
with carry-save adders is shown in Fig. 5.16. In the gure, only real
additions are shown and it appears more complicated than that of
Fig. 5.15, where additions are complex additions.
Carry-save radix-4 buttery implementation.
A carry-save radix-4 buttery (wordlength is 15 for real and
imaginary part of input) was described in VHDL-code and
synthesized using AMS 0.8 m standard CMOS technology [36].
The synthesis result shows that the area saving can be up to 21%
for the carry-save radix-4 buttery. The delay can be reduced with
22%.
The radix-2/4 split-radix buttery and radix-8 buttery can also
be implemented using the carry-save adder.
Architecture Area Delay@3.3 V, 25C
Conventional 10504.16 12.32 ns
Carry-save 8266.48 9.59 ns
Table 5.3. Performance comparison for two radix-4 butterflies.
(4,2)-counter
Fast Adder
Inverter
Xre(0)
Xim(0)
Xre(1)
Xim(1)
Xre(2)
Xim(2)
Xre(3)
Xim(3)
xre(0)
xim(0)
xre(1)
xim(1)
xre(2)
xim(2)
xre(3)
xim(3)
Figure 5.16. Parallel radix-4 buttery.
IMPLEMENTATIONOF FFT PROCESSORS 83
5.3.3. Complex Multiplier
There is no question that complex multipliers are one of the critical
units in FFT processors. From the speed point of view, the complex
multiplier is slowest part in the data path. With pipelining, the
throughput can be increased while the latency retains the same.
From the power consumption point of view, complex multipliers
stand for about 70% to 80% of the total power consumption in the
previous FFT implementations [39] [60]. This has been reduced to
less than 50% of the total power consumption due to the increase of
power consumption for the memories as the transform length of FFT
increases [37]. Hence the complex multipliers are the key
components in FFT design.
A straightforward implementation (see Fig. 5.17 (a)) of a
complex multiplication requires four real multiplications, one
addition and one subtraction. However, the number of
multiplications can be reduced to three by using transformation at
the cost of extra pre- and post additions (see Fig. 5.17 (b)). A more
efcient way to reduce the cost of multiplication is to utilize
distributed arithmetic [35] [58].
5.3.3.1. Distributed Arithmetic
Distributed arithmetic (DA) uses precomputed partial sums for an
efcient computation of inner products of a constant vector and a
variable vector [14].
Figure 5.17. Realization of a complex multiplication.
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
-C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
84 Chapter 5
Let C
R
+ jC
I
and X
R
+ jX
I
be two complex numbers of which
C
R
+ jC
I
is the coefcient and X
R
+ jX
I
is a variable complex number.
In the case of a complex multiplication, we have
(5.1)
Hence, a complex multiplication can be considered as two inner
products of two vectors of length two. We will realize the real and
imaginary parts separately.
For sake of simplicity, we consider only the rst inner product in
Eq. (5.1), i.e., the real part. The complex coefcient, C
R
+ jC
I
is
assumed to be xed and twos-complement representation is used
for both the coefcient and data. The data is scaled so that
|Z
R
+ jZ
I
| is less than 1. The inner product Z
R
can be rewritten
where x
Rk
and x
Ik
are the kth bits in the real and imaginary parts,
respectively. By interchanging the order of the two summations we
get
(5.2)
which can be written as
(5.3)
where .
F
k
is a function of two binary variables, i.e., the kth bits in X
R
and
X
I
. Since F
k
can take on only four values, it can be computed and
stored in a look-up table.
In the same way we get the corresponding binary function for the
imaginary part is .
Z
R
jZ
I
+ C
R
jC
I
+ ( ) X
R
j X
I
+ ( )
C
R
X
R
C
I
X
I
( ) j C
R
X
I
C
I
X
R
+ ( ) +
=
=
Z
R
C
R
x
R0
x
Rk
2
k
k 1 =
W
d
1

+ C
I
x
I 0
x
Ik
2
k
k 1 =
W
d
1

+ =
Z
R
C
R
x
R0
C
I
x
I 0
C
R
x
Rk
C
I
x
Ik
( )2
k
k 1 =
W
d
1

+ + =
Z
R
F
k
x
R0
x
I 0
, ( ) F
k
x
Rk
x
Ik
, ( )2
k
k 1 =
W
d
1

+ =
F
k
x
Rk
x
Ik
, ( ) C
R
x
Rk
C
I
x
Ik
=
G
k
x
Rk
x
Ik
, ( ) C
R
x
Ik
C +
I
x
Rk
=
IMPLEMENTATIONOF FFT PROCESSORS 85
Further reduction of the look-up table can be done by Offset
Binary Coding [14].
5.3.3.2. Offset Binary Coding
The offset binary coding can be applied to distributed arithmetic by
using the following expression for the data.
(5.4)
where is the inverse of bit .
Without any loss of generality, we assume that the magnitudes of
C
R
, C
I
, X
R
, and X
I
all are less than 1. The wordlength of X
R
, and X
I
is W
d
. Then, the complex multiplication can be written
Z
R
+ jZ
I
= (C
R
X
R
C
I
X
I
) + j(C
R
X
I
+ C
I
X
R
)
(5.5)
The functions F and G can be expressed as follows.
(5.6)
(5.7)
x x
0
x
0
( )2
1
x
i
x
i
( )2
i 1
i 1 =
W
d
1

2
W
d

+ =
b b
C
R
x
R0
x
R0
( )2
1
C
R
x
Ri
x
Ri
( )2
i 1
i 1 =
W
d
1

C
R
2
W
d

+

' ;

=
C
I
x
I 0
x
I 0
( )2
1
C
I
x
Ii
x
Ii
( )2
i 1
i 1 =
W
d
1

C
I
2
W
d

+

' ;

+ j C
R
x
I 0
x
I 0
( )2
1
C
R
x
Ii
x
Ii
( )2
i 1
i 1 =
W
d
1

C
R
2
W
d

+

' ;

+ j C
I
x
R0
x
R0
( )2
1
C
I
x
Ri
x
Ri
( )2
i 1
i 1 =
W
d
1

C
I
2
W
d

+

' ;

F x
R0
x
I 0
, ( ) F x
Ri
x
Ii
, ( )2
i 1
i 1 =
W
d
1

F 0 0 , ( )2
W
d

+ +

' ;

=
+ j G x
R0
x
I 0
, ( ) G x
Ri
x
Ii
, ( )2
i 1
i 1 =
W
d
1

G 0 0 , ( )2
W
d

+ +

' ;

F x
Ri
x
Ri
, ( ) C
R
x
Ri
x
Ri
( ) C
I
x
Ii
x
Ii
( ) =
G x
Ri
x
Ri
, ( ) C
I
x
Ri
x
Ri
( ) C
R
x
Ii
x
Ii
( ) + =
86 Chapter 5
In Eq. (5.7), the factor is either 1 or 1. Hence the
partial product, i.e., function F
k
(G
k
), for each bit is of the form
C
R
t C
I
. All possible partial products are tabulated in following
table. Obviously, only two coefcients, i.e., (C
R
C
I
) and (C
R
+
C
I
), are sufcient to store since (C
R
C
I
) and (C
R
+ C
I
) easily can be
generated from the two former coefcients by inverting all bits and
adding 1 in the least-signicant position.
The complex multiplier with distributed arithmetic is illustrated
in Fig. 5.18. The accumulators, which adds the partial products, are
the same as in a real multiplication. The partial product generation
is only slightly more complicated than for a real multiplier. Hence,
the complexity of the complex multiplier in term of chip area
corresponds to approximately two real multipliers.
x
Ri
x
Ii
F(x
Ri
,x
Ii
) G(x
Ri
,x
Ii
)
0 0 (C
R
C
I
) (C
R
+C
I
)
0 1 (C
R
+C
I
) (C
R
C
I
)
1 0 (C
R
+C
I
) (C
R
C
I
)
1 1 (C
R
C
I
) (C
R
+C
I
)
Table 5.4. Partial product generation.
x
i
x
i
( ) 2
X
I
X
R
-(C
I
+C
R
) -(C
R
-C
I
)
Partial product
Generation
Partial product
Generation
F G
Accumulator Accumulator
Z
R
Z
I
Figure 5.18. Block schematic for complex multiplier.
IMPLEMENTATIONOF FFT PROCESSORS 87
5.3.3.3. Implementation Considerations
Multipliers can be divided into three types: bit-parallel, bit-serial,
and digit-serial. Although a bit-serial, or digit-serial, multiplier has
less chip area than that of bit-parallel multiplier, it requires a higher-
speed clock than that of bit-parallel one for the same throughput. To
achieve high throughput, a bit-serial, or digit-serial multiplier often
needs several parallel units, which increases activity factor for the
local clock. To meet the speed requirement, we therefore select a bit-
parallel multiplier. Complex multiplier with DA is shown in Fig.
5.18.
The selection of pre-computed values from Table 5.4, which
correspond to the partial product generation in a real (imaginary)
datapath, can be realized with a 2:1 multiplexer and an XOR gate as
shown in Fig. 5.19.
An alternative is to use a 4:1 multiplexer circuit. The benet of
this implementation is that the delay is reduced since the generation
of select signal (X
Ri
X
Ii
) is not required. Hence the delay for the
partial product generation is reduced.
For the accumulator design, the selection of structure is
important. The usual structures for accumulators are: array, carry-
save and tree structures. The tree structure is the fastest. It is also
suitable for our lowpower strategy, i.e., designing a faster circuit and
using voltage scaling to reduce the power consumption. We select
the tree structure for accumulator.
0 1
XRi xor XIi
-(CR-CI)i -(CR+CI)i
XRi
PPi
0 1 0 1
0 1
-(CR-CI)i -(CR+CI)i
PPi
XRi
XIi
(a) (b)
Figure 5.19. Circuits for partial product generation.
88 Chapter 5
The fastest, i.e., lowest height, multi-operand tree is the Wallace
tree. The Wallace tree has complex wiring and is therefore difcult
to optimize and the layout becomes irregular. The overturned-stairs
tree [40], which has a regular layout and the same height as the
Wallace tree when the data wordlength is less than 19, is used in the
design of the complex multipliers.
The overturned-stairs adder tree was suggested by Mou and
Jutand [40]. The main features of overturned-stairs adder tree are
Recursive structure that yields regular routing and simplies
the design of the layout generator.
The tree height is low, i.e, O( ), where p depends on the
type of overturned-stairs tree.
There are several types of overturned-stairs adder trees [40]. The
rst-order overturned-stairs adder tree, which has the same speed
bound to that of Wallace tree when the number of the operands is
less than 19, is chosen.
The construction of overturned-stairs tree is illustrated in Fig.
5.20. The trees of height 1 to 3 are shown in Fig. 5.20. When the
height is more than three, we can construct the tree with only three
building blocks: body, root, and connector. The body can be
constructed repeatedly according to Fig. 5.20. The body of height j
( ) consists of a body of height , a branch of height ,
and a connector. The branch of height is formed by using
carry-save adders (CSAs) on top of each other with proper
interconnections [40]. The connector connects three feed-throughs
from the body of height and two outputs from the branch of
height to construct the body of height j. A root (CSA) is
connected to the outputs of the connector to form the whole tree of
height .
Since there are only three feed-throughs between body of height
to body of height in overturned-stairs tree. It is also easy for
the routing planning in the accumulator design.
N
p
j 2 > j 1 j 2
j 2
j 2
j 1
j 2
j 1 +
j 1 j
IMPLEMENTATIONOF FFT PROCESSORS 89
The full-adder is essential for the accumulator. The choice of
full-adder has large impact on the performance of accumulator. We
compared several full-adders and found the most suitable for our
implementation. However, recently a large number of new adder
cells has been proposed [51] and they should be evaluated in the
future work.
The rst type of full-adder is a conventional static CMOS adder.
When the voltage is as low as 1.5 V, the conventional static CMOS
full adder, with large stack height, is too slow. Furthermore, it is not
competitive from a power consumption point of view.
CSA
Tree 1
CSA
CSA
Tree 2
CSA
CSA
Branch n
CSA
n

C
S
A
s
CSA
CSA
Tree 3
CSA CSA
Root
Root
Body 2
CSA
CSA
Connector
Body height j-1
B
r
a
n
c
h
H
e
i
g
h
t
j
-
2
CSA
CSA
CSA
Connector
Root
Tree of height j+1
Body of height j
Figure 5.20. Overturned-stairs tree.
90 Chapter 5
A second type of full-adder is a full-adder with transmission
gates (TG). This full-adder realizes the XOR-gate with transmission
gates and both the power consumption and chip area are smaller than
that of a conventional static CMOS full-adder.
A third type of full adder is Reusens full-adder [50]. This full-
adder is fast and compact but requires buffers for the outputs. The
buffer insertion is usually considered as a drawback since it
introduces delay and increases the power consumption. However, in
x
z
y
y
y
x
y
x
z
z
S
C
Figure 5.21. Conventional static CMOS full-adder.
S
C
z
y
x
Figure 5.22. Transmission gates full-adder.
IMPLEMENTATIONOF FFT PROCESSORS 91
the accumulator the buffer insertion is necessary anyway in order to
drive the long interconnections. There is no direct path from V
DD
or
V
SS
in this full adder, which tends to reduce the power consumption.
5.3.3.4. Accumulator Implementation
After the selection of structure and adder cell, the accumulator can
be implemented.
A software for the automatic generation of overturned-stairs
adder trees has been developed. The software can handle different
wordlengths for the data and coefcient. The generated structural
VHDL-code can be validated by applying random test vectors in a
testbench.
A handcrafted accumulator using overturned-stairs tree with
0.35 m standard CMOS technology is shown in Fig. 5.24. The
worst case delay is 26 ns at 1.5 V and 25 C with SPICE simulation.
The power consumption for this complex multiplier is 15 mW at 1.5
V and 72.6 mW at 3.3 V, both run at 25 MHz.
Adder type Transistor count
Delay (ns)@1.5 V Power (W)@1.5 V
Static CMOS 24 4.2 4.3
TG 16 3.5 2.5
Reusens 16 3.2 2.1
Table 5.5. Comparison of full-adders in 0.35 m technology.
Figure 5.23. Reusens full-adder.
x y z y
S C
92 Chapter 5
5.3.3.5. Brent-Kung Adder Implementation
The Brent-Kung adder is used as the vector merge adder. The Brent-
Kung adder belongs to the prex adder, which uses the propagation
and generation property of carry bit in a full-adder to accelerate the
carry propagation.
A program for schematic generation for Brent-Kung adder has
been developed and the layout generator is under construction.
The generated schematic of a 32-bit Brent-Kung adder is
illustrated in Fig. 5.25. The layout of a 32-bit Brent-Kung adder is
shown in Fig. 5.26
S
i
z
e
:

7
0
4
.
3


4
7
9
.
5

m
2
Figure 5.24. Accumulator layout.
Figure 5.25. Block diagram for a 32-bit Brent-Kung adder.
IMPLEMENTATIONOF FFT PROCESSORS 93
5.4. Final FFT Processor Design
After the design of the components and the selection of FFT
architecture, we apply the meet-in-the-middle methodology to
combine the components into the complete implementation.
An observation is that a large portion of the total power are
consumed by the computation of complex multiplications in the FFT
processor. We have implement a complex multiplier that consumes
72.6 mW with power supply voltage of 3.3 V at 25 MHz in a
standard 0.35 m CMOS technology. For a 1024-point FFT
processor, it requires four complex multipliers, and, hence,
consumes 290 mW@3.3 V, 25 MHz. Even with bypass techniques
for trivial complex multiplications, the power consumption for the
computation of the complex multiplications is still more than 210
mW. Hence the reduction of the number of complex multiplication
is vital.
Using high radix butteries can reduce the number of complex
multiplications outsides the butteries. However, it is not common
to use high radix buttery for VLSI implementations due to two
main drawbacks: it increases the number of complex multiplications
within the butteries if the radix is larger than 4, and it increases the
routing complexity as well. Overcoming this two drawbacks is the
key for using high radix butteries.
Figure 5.26. 32-bit Brent-Kung adder.
S
i
z
e
:

0
.
2
5


0
.
1
6

m
m
2
94 Chapter 5
As is well-known, adders consume much less power than the
multipliers with the same wordlength. This is because the adder has
less hardware and has much fewer glitches. We have implemented a
32-bit Brent-Kung adder (real) that consumes 1.5 mW@3.3 V, 25
MHz, which is much less than a bit complex multiplier
(72.6 mW@3.3 V, 25 MHz). Therefore it is efcient to replace the
complex multipliers with constant multiplier when possible.
We use constant multipliers in the design of 16-point buttery in
order to reduce the number of complex multipliers. For a 16-point
FFT buttery, there are three type non-trivial complex
multiplications within the buttery, i.e., multiplications with ,
, and . The multiplications with and can share
coefcients since
and . We therefore
can use constant multipliers, which reduce the complexity. The
implementation of a multiplication with is illustrated in Fig.
5.27.
The selection of FFT algorithm affects the number and positions
of constant multipliers. For 16-point DFT, the radix-4 FFT and
SRFFT algorithm is more efcient than that of radix-2 FFT
algorithm in term of number of multiplications. Moreover, both
radix-2 and split-radix algorithm require three multipliers (two
multipliers with and one multiplier with ) while the
radix-4 algorithm requires only two multipliers (one multiplier
17 13
W
16
1
W
16
2
W
16
3
W
16
1
W
16
3
8 ( ) cos 2 8 ( ) sin 3 8 ( ) sin = =
8 ( ) sin 2 8 ( ) cos 3 8 ( ) cos = =
W
16
1
cos(p/8)+sin(p/8)
cos(p/8)-sin(p/8)
cos(p/8)
Real Input
Imaginary Input
Real Input
Imaginary Input
Figure 5.27. Complex multiplication with . W
16
1
(p = )
W
16
2
W
16
1
IMPLEMENTATIONOF FFT PROCESSORS 95
with and one multiplier with / ). Hence the 16-
point butterfly with radix-4 is more efficient and is selected for
our implementation.
By replacing the complex multiplications with constant
multiplications within the 16-point buttery, the power consumption
for complex multiplication within 16-point butterfly is reduced to
10 mW@3.3 V, 25 MHz. The number of non-trivial complex
multiplications can be reduced to 1776. The total number of
complex multipliers is reduced to two for a 1024-point FFT due to
the use of 16-point butteries. The number of non-trivial complex
multiplications required for 1024-point FFT for different algorithms
is shown in the following table.
In the 1024-point FFT processor, there is only two complex
multipliers and two constant multipliers, which consumes less than
160 mW. Hence, a power saving of more than 20% for the
computation of complex multiplications can be achieved. This is less
than the theoretical saving of 35% (the ratio for the number of
complex multiplications) due to the computation for complex
multiplications within the 16-point buttery.
To cope with the complex routing associated with high radix
butterflies it is better to divide the 16-point buttery into four stages.
since the radix-2 buttery have the simplest routing.
As mentioned in the resource analysis, the most memory
efcient architectures are the architectures with single-path
feedback since it gives the minimum data memory, e.g., only N 1
words for an N-point FFT.
Algorithm R2FFT R4FFT SRFFT Our approach
No. of Comp. Mult. 3586 2732 2390 1776
Table 5.6. The number of non-trivial complex multiplications
for different FFT architectures.
W
16
2
W
16
1
W
16
3
96 Chapter 5
The radix-4 algorithmcan be decomposed into radix-2 algorithm
as done in [24]. Hence the mapping of 16-point buttery can be done
with four pipelined radix-2 butteries. Each buttery has its own
feedback memory. The 16-point buttery is illustrated in Fig. 5.28.
The power consumption for the data memory is estimated to 300
mW (the power consumption for 128 words or higher memory is
given by the vendor and the smaller memory is estimated through
linear approximation down to 32 words). The butteries consumes
about 30 mW.
The total power consumption for the three main subsystems is
490 mW. By assuming the 15% of overhead, for instance, the clock
buffers, communication buses, etc., the power consumption for the
FFT processor is therefore estimated to about 550 mW at 3.3 V [38].
The 1024-point FFT processor can also run at 1.5 V, which gives
more power saving. The total power consumption of the 1024-point
FFT processor is less than 200 mW at 1.5 V for 0.35 m standard
CMOS process. The memories contribute 55% of the total power
consumption, the computation units for buttery operations and
complex multiplications with 37%, and others with 8%.
5.5. Summary
In this chapter, we have discussed the implementation of a 1024-
point FFT processor.
Aresource analysis gave a start point for the implementation. We
proposed a wordlength optimization method for the pipelined FFT
architectures. This method gave a memory saving up to 14%.
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
Output Intput
Constant multiplier
Figure 5.28. 16-point Buttery.
IMPLEMENTATIONOF FFT PROCESSORS 97
We discussed the implementation of subblocks, i.e., butteries,
memories, and complex multipliers. We proposed the high radix
butteries using carry-save technique, which is efcient in term of
delay and area. We constructed a complex multiplier using DA and
overturned-stairs tree, which is area efcient. All those subblocks
can be operate at low power supply voltage and suitable for the
voltage scaling.
Finally, we discussed the implementation of FFT processor
using a 16-point buttery. The use of proposed 16-point buttery
reduces the number of complex multiplications and retains the
minimum memory requirement, which is power efcient.
98 Chapter 5
99
6
CONCLUSIONS
This thesis discussed the essential parts of low power pipelined FFT
processor design.
The selection of FFT algorithm is an important start point for the
FFT processor implementation. The FFT algorithm with less multi-
plications and additions is attractive.
The selection of low power strategy affects FFT hardware
design. The supply voltage scaling is an efcient low power
technique and was used for the FFT processor design.
After the selection of the FFT algorithm and the low power
strategy, it is important to reduce the hardware complexity. The
wordlengths in each stage of the pipelined FFT processor may be
different and therefore optimized. A simulation-based method has
been developed for wordlength optimization of the pipelined FFT
architectures. In some cases, the wordlength optimization can
reduce the size of memories up to 14%compared with using uniform
wordlength in each stage. This also results in a power saving of 14%
for the memories. The reduction of wordlength also reduces the
power consumption in the complex multipliers and the butteries
proportionally.
For the detail design, we proposed that a carry-save technique
was used for implementation of the butteries. This technique is
generally applicable for high-radix butteries.The proposed high-
radix butteries reduce both the area and the delay with more than
20%. In the complex multiplier design, we use distributed arithmetic
to reduce the hardware complexity. We select overturned-stairs tree
100 Chapter 6
for the realization of complex multiplier. The overturned-stairs tree
has a regular structure and the same performance as the Wallace tree
when the data wordlength is less than 19, is therefore used.
Simulation shows that the complex multiplier operate up to 30 MHz
at 1.5 V. The power consumption is 15 mW at 25 MHz with a 1.5 V
power supply voltage. In the SRAM design, we modied an STC D-
ip-op to form a two stage sense amplier. The sense amplier can
be operated at low power supply voltage.
With optimized word length, the data memory size is reduced
with 10%. Using proposed 16-point buttery, the number of
complex multiplications can be reduced and results a power saving
more than 20% for complex multiplications. With all those efforts,
the total power consumption of the 1024-point pipelined FFT
processor with a continuous throughput of 25 Msamples/s and
equivalent wordlength of 12-bit is less than 200 mW at 1.5 V for
0.35 m standard CMOS process. The memories contribute to 55%
of the total power consumption, the computation units for buttery
operations and complex multiplications with 37%, and others with
8%. The memories consume the most signicant part of the total
power consumption, which indicates that the optimization of the
memory structure could be important for the implementation of low
power FFT processors.
101
REFERENCES
[1] M. Alidina, J. Monterio, S. Devadas, A. Ghosh, and M.
Papaefthymiou, Precomputation-based sequential logic
optimization for low power, IEEE Trans. on VLSI Systems,
Vol. 2, No. 4, pp. 426436, Dec., 1994.
[2] Analog Devices Inc., ADSP-21060 SHARC Super Harvard
Architecture Computer, Norwood, MA, 1993.
[3] A. Antola, R. Negrini, and N. Scarabottolo, Arrays for
discrete Fourier Transform, Proc. Eorupean Signal Process.
Conf. (EUSIPO), Amsterdam, Netherlands, Sep. 1988, Vol. 2,
pp. 915918.
[4] G. Bi and E. V. Jones, A pipeline FFT processor for word-
sequential data, IEEE Trans. on Acoustics, Speech, Signal
Processing, Vol. 37, No. 12, pp. 19821985, Dec., 1989.
[5] J. A. C. Bingham, "Multicarrier modulation for data
trasnmission: An idea whose time has come," IEEE Commun.
Mag., Vol. 28, pp. 514, May 1990.
[6] L. Bisdounis, O. Koufopavlou, and S. Nikolaids, Accurate
evaluation of CMOS short-circuit power dissipation for short
channel devices, Intern. Symp. on Low Power Electronics &
Design, Monterey, CA, Aug., 1996, pp. 181192.
[7] E. O. Brigham, The Fast Fourier Transform and Its
Applications, Prentice Hall, 1988.
102
[8] C. S. Burrus, Index mappings for multidimentional
formulation of DFT and convolution, IEEE Trans. on
Acoustics, Speech, Signal Processing, Vol. ASSP25, No. 3,
pp. 239242, June, 1977.
[9] A. Chadrakasan and R. W. Brodersen, Low Power Digital
CMOS Design, Kluwer, 1995.
[10] A. Chadrakasan, M. P. Potkonjak, R. Mehra, J. Rabey, and R.
W. Brodersen, Optimizing power using transformations,
IEEE Trans. on Computer-Aided Design, Vol. 14,
No. 1, pp. 1231, Jan., 1995.
[11] A. Chadrakasan, S. Sheng, and R. W. Brodersen, Low-power
CMOS design, IEEE Journal of Solid-State Circuits, Vol. 27,
No. 4, pp. 472484, April, 1992.
[12] S. Chang, M. Marek-Sadowska, and K. Cheng, Perturb and
simplify: multilevel Boolean network optimizer, IEEETrans.
on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 15, No. 12, pp. 14941504, Dec., 1996.
[13] J. W. Cooley and J. W. Turkey, An algorithm for the machine
computation of complex Fourier series, Mathematics of
Computation, Vol. 19, pp. 297301, April, 1965.
[14] A. Croisier, D.J. Esteban, Levilion, and V. Riso, Digital Filter
for PCM Encoded Signals, U.S. Patent 3777 130, Dec., 1973.
[15] A. M. Despain, Fourier transform computer using CORDIC
iterations, IEEE Trans. on Computers, Vol. C23, No. 10, pp.
9931001, 1974.
[16] DoubleBWSystems B.V., PowerFFTprocessor data sheet,
Delft, the Netherlands, March, 2002.
[17] P. Duhamel and H. Hollmann, Split-radix FFT algorithm,
Electronics Letters, Vol. 20, No. 1, pp. 1416, Jan., 1984.
[18] P. Duhamel and M. Vetterli, Fast Fourier transforms: A
tutorial review and a state of the art, Signal Processing,
Vol. 19, No. 4, pp. 259299, April, 1990.
[19] I. J. Good, The interaction algorithm and practical Fourier
analysis, J. Royal Statist. Soc., ser. B, Vol. 20, pp. 361372,
1958.
103
[20] A. Ghosh, S. Devadas, K. Keutzer, and J. White, Estimation
of average switching activity in combinational and sequential
circuits, In Proc. of the 29th Design Automation Conf., June,
1992, pp. 253259.
[21] S. F. Gorman and J. M. Wills, Partial Column FFTpipelines,
IEEE Trans. on Circuits and Systems-II, Vol. 42, No. 6, June,
1995.
[22] H. L. Groginsky and G. A. Works, A Pipeline Fast Fourier
Transform, IEEE trans. on Computers, Vol. C19(11), pp.
10151019, 1970.
[23] D. Hang and Y. Kim A deep sub-micron SRAM cell design
and analysis methodology, In Proc. of Midwest Symp. on
Circuits and Systems, Dayton, Ohio, USA, Aug., 2001,
pp. 858861.
[24] S. He and M. Torkelson, A New Approach to Pipeline FFT
Processor, In Proc. of the 10th Intern. Parallel Processing
Symp. (IPPS), Honolulu, Hawaii, USA, pp. 766770, 1996.
[25] M. T. Heideman and C. S. Burrus, On the number of
multiplications necessary to compute a Length-2
n
DFT,
IEEE Trans. on Acoustics, Speech, Signal Processing,
Vol. ASSP34, No. 1, Feb., 1986.
[26] M. T. Heideman, D. H. Johnson, and C. S. Burrus, Gauss and
the history of the FFT, IEEE Acoustics, Speech, and Signal
Processing Magazine, Vol. 1, pp 1421, Oct., 1984.
[27] M. Hikari, H. Kojima, et al., Data-dependent logic swing
internal bus architecture for ultralow-power LSIs, IEEE
Journal of Solid-State Circuits, Vol. 30, No. 4, pp. 379402,
April, 1995.
[28] http://theory.lcs.mit.edu/~fftw
[29] Intel Corp., SA-1100 Microprocessor Technical Reference
Manual, Santa Clara, CA., USA., 1998.
[30] K. Itoh et. al., Trends in Low-power RAM Circuit
Technologies, Proc. of IEEE, pp. 524543, April, 1995.
104
[31] D. P. Kobla and T. W. Parks, A prime factor algorithm using
high-speed convolution, IEEE Trans. on Acoustics, Speech
Signal Processing, Vol. ASSP31, No. 4, pp. 281294, Aug.,
1977.
[32] S. Krishnamoorthy and A. Khouja, Efcient power analysis
of combinational circuits, In Proc. of Custom Integrated
Circuit Conf., San Diego, California, USA, 1996,
pp. 393396.
[33] C. Lemonds and S. S. Mahant Shetti, A low power 16 by 16
multiplier using transition reduction circuitry, In Proc. of the
Intern. Workshop on the LowPower Design, Napa, California,
USA, April, 1994, pp. 139142.
[34] W. Li, Y. Ma, and L. Wanhammar, Word length estimation
for memory efcient pipeline FFT/IFFTProcessors, ICSPAT,
Orlando, Florida, USA, Nov., 1999, pp. 326330.
[35] W. Li and L. Wanhammar, A Complex Multiplier Using
Overturned-Stairs Adder Tree, In Proc. of Intern. Conf. on
Electronic Circuits and Systems (ICECS), Paphos, Cyprus,
September, 1999, Vol. 1, pp. 2124.
[36] W. Li and L. Wanhammar, Efcient Radix-4 and Radix-8
Buttery Elements, In Proc. of NorChip Conf., Oslo,
Norway, Nov., 1999, pp. 262267.
[37] W. Li and L. Wanhammar, A Pipeline FFT Processor, In
Proc. of IEEE Workshop on Signal Processing Systems
(SiPS), Taipei, China, Nov., 1999, pp. 654662.
[38] W. Li and L. Wanhammar, An FFT processor based on 16-
point module, In Proc. of NorChip Conf., Stockholm,
Sweden, Nov., 2001, pp. 125130.
[39] J. Melander, Design of SIC FFT Architectures, Linkping
Studies in Science and Technology, Thesis No. 618,
Linkping University, Sweden, May, 1997.
[40] Z. Mou and F. Jutand, Overturned-Stairs Adder Trees and
Multiplier Design, IEEE Trans. Computers, vol. C41,
pp. 940948, 1992.
105
[41] Motorola Inc., DSP96002 IEEE Floating-Point Dual-Port
Processor Users Manual, Phoenix, AZ, 1989.
[42] L. Nielsen, C. Nielsen, J. Spars, and K. van Berkel, Low-
power operation using self-timed circuits and adaptive scaling
of the supply voltage, IEEE Trans. on VLSI Systems, Vol. 2,
No. 4, pp. 391397, Dec., 1994.
[43] E. Nordhamn, Design of an Application Specic FFT
Processor, Linkping Studies in Science and Technology,
Thesis No. 324, Linkping University, Sweden, June, 1992.
[44] M. C. Pease, An adaptation of fast Fourier transform for
parallel processing, Journal of the Association for
Computing Machinery, Vol. 15, No. 2, pp. 252264, April,
1968.
[45] M. Potkonjak, and M. Rabaey, Algorithm selection: A
quantitative optimization intensive approach, IEEETrans. on
Computer-Aided Design for Integrated Circuits System, Vol.
18, No. 5, pp. 524532, May, 1999.
[46] J. Rabaey, L. Guerra, and R. Mehra, Design guidance in the
power dimension, In Proc. of Intern. Conf. on Acoustics,
Speech and Signal Processing, Detroit, Michigan, USA, May,
1995, Vol. 5, pp. 28372840.
[47] J. Rabaey and M. Pedram (Ed.), Low Power Design
Methodologies, Kluwer, 1996.
[48] L. R. Rabiner and B. Gold, Theory and Application of Digital
Signal Processing, Prentice Hall, 1975.
[49] C. M. Rader, Discrete Fourier transforms when the number
of data samples is prime, Proc. of IEEE, Vol. 56, pp. 1107
1108, June, 1968.
[50] P. P. Reusens, High Performance VLSI Digital Signal
Processing Architecture and Chip Design, Cornell University,
Thesis, Aug., 1983.
106
[51] M. Sayed and W. Badawy, Performance analysis of single-bit
full adder cells using 0.18, 0.25, and 0.35 m CMOS
technologies, In Proc. of IEEE Intern. Symp. on Circuits and
Systems (ISCAS), Vol. 3, Scottdale, Arizona, USA, May,
2002, pp. 559563.
[52] E. Seevinck et al, Static-Noise Margin Analysis of MOS
SRAM Cells, IEEE Journal of Solid-State Circuit, Vol.
SC-22, No.5, pp. 748754, Oct., 1987.
[53] M. R. Stan and W. P. Burleson, Bus-invert coding for low-
power I/O, IEEE Trans. on VLSI Systems, Vol. 1, No. 3, pp.
4958, March, 1995.
[54] H. J. M. Veendrick, Short-circuit dissipation of static CMOS
circuitry and its impact on the design of buffer circuits, IEEE
Journal of Solid-State Circuit, Vol. 19, pp. 468473, Aug.,
1984.
[55] Texas Instruments Incorporated, An Implementation of FFT,
DCT, and Other Transforms on the TMS320C30,
Application report: SPRA113, Dallas, Texas, 1997.
[56] V. Tiwari, S. Malik, and P. Ashar, Compilation techniques for
low energy: an overview, In Proc. of 1994 IEEE Symp. on
Low Power Electronics, San Diego, California, USA, Oct.,
1994, pp. 3839.
[57] Q. Wang and S. Vrudhula, Multi-level logic optimization for
low power using local logic transformation, Proc. of Intern.
Conf. of Computer-Aided Design, San Jose, California, USA,
pp. 270277, 1996.
[58] L. Wanhammar, DSP Integrated Circuits, Academic Press,
1999.
[59] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI
Design, Addison-Wesley, second edition, 1993.
[60] T. Widhe, Efcient Implementation of FFT Processing
Elements, Linkping Studies in Science and Technology,
Thesis No. 619, Linkping University, Sweden, June, 1997.
107
[61] S. Winograd, On the computing the discrete Fourier
transform, Proc. Nat. Acad. Sci. USA, Vol. 73, No. 4,
pp. 10051006, April, 1976.
[62] E. H. Wold and A. M. Despain, Pipeline and Parallel-pipeline
FFTProcessors for VLSI Implementation, IEEETransaction
on Computers, Vol. C33, No. 5, pp. 414426, 1984.
[63] G. Yeap, Practical Low Power Digital VLSI Design, Kluwer,
1998.
[64] J. Yuan, High Speed CMOS Circuit Technique, Linkping
Studies in Science and Technology, Thesis No. 132,
Linkping University, Sweden, 1988.
[65] Zarlink semiconductor Inc., PDSP16515A Stand Alone FFT
Processor Advance Information, April, 1999.
108

You might also like