NTTs

Number Theoretic
Transforms
G.A. Jullien
Extract from Number Theoretic Techniques in Digital Signal

Processing
Book Chapter
Advances in Electronics and Electron Physics, Academic Press
Inc., (Invited), vol. 80, Chapter 2, pp. 69-163, 1991.
Introduction
This chapter discusses the application of number theory to the intensive computations required in
many Digital Signal Processing (DSP) systems. Number theory was always looked upon as one of
the last vestiges of pure mathematical pursuit, where the idea of applying the theories to practical
purposes was irrelevant. Over the last few decades we have been discovering very practical
applications for number theory; among these are applications in coding theory, communications,
physics and digital information (Schroeder (1985)). Our specific interest in this work is the
application of number theory to the mechanics of arithmetic computations themselves. Modern day
computer technology is heavily dependent on the manipulation of numbers, in fact the automization
of arithmetic operations was the quest of all the early computer pioneers, from Pascal to Babbage to
Von-Neumann. The involvement of number theory in this endeavor, however, has been strangely
very limited. The use of number theory in helping to perform the basic arithmetic operations has not
been accepted by the commercial world. Perhaps the reason for this is the pervasion of the binary
number system in the mechanization of computer arithmetic. A large number of hardware designers
have little or no knowledge that alternative ways of computing arithmetic exist!
There is one field in which the high speed arithmetic manipulation of numbers is of paramount
importance. This is the field of Digital Signal Processing (DSP), and it is here that alternative
arithmetic hardware may find a niche. This is indeed the subject of this paper.
Why is it that some DSP systems are so computationaly intensive? Typically the DSP system we
are discussing will be involved with the processing of streams of data that are indeterminately long;
the requirement of the system is to produce processed output data at the same rate that the input
data is entering the processor. This is referred to as real-time processing. Often the arithmetic
manipulations are simple in form (cascades of additions and multiplications in a well defined
structure) but the numbers of operations that have to be computed every second can be enormous. It
Page 1
Number Theoretic Techniques in Digital Signal Processing
will be constructive at this point to consider an example to illustrate the processing power required
by a typical DSP system.
Let us consider a basic problem; the filtering of television pictures in real-time. We will assume that
a two-dimensional filter is required and is given by the following equation:
2 2
ym,n = Â! Â! wi,j . xm-i,n-j
i=-2 j=-2
Here wi ,j is the filter kernel, and represents the weights on a two-dimensional averaging procedure
around each output sample (ym,n). The averaging is based on a neighborhood of input samples {xi
,j }, that extends 5 samples in each dimension and centered on the output pixel position. The double
summation, in fact, represents two-dimensional convolution. Convolution is a basic operation to all
linear signal processing problems since it represents the mapping function of an input signal to an
output signal through a linear system. The real time constraint is important since, as we will see, an
extremely large computational requirement will be necessary in order to construct the filter.
In order to see the speed of operation required by our filter, we will perform some elementary
calculations based on typical requirements for the filter. We assume a 525 line scan with 500
picture elements (pixels) in each line. We also assume that each primary colour is to be filtered
independently with the 5 x 5 filter. With an interlaced system we need to filter 30 images every
second. Each output pixel (made up of three primary colour sub-pixels) requires 75
multiply/accumulate operations (M/A Ops) to implement the filter. There are 262,500 pixels in each
image and so 1.96875 x 107 M/A Ops are required for each image. Since we require to filter 30
images per second, the total number of M/A!Ops per second is 5.9 x 108. Assuming that an M/A
Op is considered a basic operation in the processor used to filter the image, then we need a machine
that delivers 590 Mega Operations per second (MOPS). In these terms we require a super
computer to deliver this performance!
Page 2
Clearly the implementation has to be somewhat different to a large main frame super computer, it
would certainly be difficult to fit one into a normal sized television set!
The filtering operation has the advantage of not requiring the full features of a general purpose
computer (the filtering operation does not have to be performed within the typical multi-user,
complex operating system environment of a main frame machine) and the computations are simply
repeated arithmetic operations in a predetermined structure. In fact, it is rather pointless to talk about
the computational requirements in normal general computer terms (MIPs, MOPs, MFLOPs, etc.),
but rather to talk in direct terms of the data flow requirements; that is, the throughput rate of the
filter in samples per second (in this example, a sample is a pixel). For our example filter we have to
filter pixels at a rate of 7.875!x!106 pixels per second. If we provide arithmetic hardware which
contains 75 Multiply/Accumulate circuits, then each circuit has to operate within 120nS. If we use
currently available integrated circuits that operate within 60nS, then we can build a circuit with only
38 integrated circuits for the basic arithmetic operations, and perform two operations, sequentially,
with one circuit. We have therefore reduced the original requirement of a supercomputer to a
system containing a relatively small number of 'chips'.
It is the purpose of this paper to discuss ways in which number theoretic techniques can be used to
perform DSP operations, such as this example filter, by reducing the amount of hardware involved
in the circuitry. Since we are going to be directly concerned with the implementation of DSP
arithmetic, it might be helpful to finish off our example by considering the arithmetic details of each
Multiply/Accumulate operation.
The representation of each primary colour pixel by 8-bits is more than adequate for very high
definition of the image and the quantization of each filter weight by 8-bits is also very generous.
This naturally leads to a 16-bit multiplier structure. Since there are an accumulation of 25 such
multiplications for each output pixel, an extra 5-bits are required for this. With 21-bits we can, in
fact, compute each output pixel with no error in the computation process. Since 21-bits is a
pessimistic upper bound for the computational dynamic range (only required if both the filter
Page 3
coefficients and the input pixels are at their maximum values - a very rare occurrence), we can easily
specify a 20-bit range for accurate computation.
This calculation leads us to an important observation that we do not need floating point arithmetic.
We can, in fact, perform error-free computation within the dynamic range of a typical floating point
standard mantissa length - no manipulation of the exponent being required at all! This example
represents a fine-grained computation (one in which successive parts of the computation are of a
limited dynamic range) and is typical of many DSP algorithms. The limitations we are able to
impose on the arithmetic calculations will be a great help in reducing the hardware required in the
very high-speed throughput systems to be discussed in this paper.
For the purposes of this paper we will restrict ourselves to the computation of linear filter and
transform algorithms which have, for the one-dimensional case, the inner product form:
N-1
yn = Âxm.wf(n,m)
m=0
In the case of linear filtering (convolution) f(n,m)!=!n-m; for the case of a linear transform with the
2π
convolution property, wf(n,m) is a primitive Nth root of unity. For the DFT, wf(n,m)!=!e-i N ! n m !.
Although this limitation appears overly restrictive, the general form of equation (1) probably
accounts for the vast majority of digital signal processing functions implemented commercially, and
also leads us to a body of literature that uses number theory to indirectly compute the convolution
form of the inner product formulation. We will also restrict our investigation of the application of
number theoretic techniques to those that use finite field arithmetic directly in the computations.
This will not cover those algorithms whose structure is derived via number theoretic techniques, and
whose computation is performed over the usual complex field (e.g. Winograd Fourier Transforms);
there is a vast amount of literature already available on this separate subject, and the reader is
directed to the work by McClellan and Rader (1979) as a starting point.
Page 4
Our discussion in this review will concentrate on two areas: the use of finite ring/field operations
for computing ordinary integer computations (Residue Number System) for high speed signal
processing and the use of algorithms defined over finite rings/fields in which not only the arithmetic
is performed over such finite structures, but also the properties of the algorithm depend on the
properties of the finite field/ring over which the algorithm is defined. This is an important
differentiation. Our sole concern here is to examine arithmetic and general computations over finite
mathematical systems.
Our starting point should be at the recognized starting point for the modern field of digital signal
processing, the Fast Fourier Transform (FFT). The FFT is not a new transform, but really a name
for a set of algorithms that compute the Discrete Fourier Transform (DFT) with much reduced
arithmetic operations. In particular, multiplications, traditionally a hardware and/or time consuming
arithmetic operation, are reduced considerably. With the publishing of the first FFT algorithm
(Cooley and Tukey (1965)) digital signal processing began to emerge as a field in its own right.
Only a few years after that landmark paper, the first applications of finite field transforms to
computing convolution appeared. We will not attempt to retrace the history here, but will discuss
these two discoveries as initial steps on the road to the application of number theory to the digital
processing of signals. The discovery of efficient ways of computing the DFT allowed both the
computation of spectra, and the implementation of finite impulse response filters to be carried out
with much reduced computational time. The most interesting use of the FFT, in the initial years, was
in the latter application. The ability of the DFT to indirectly compute the inner product form for FIR
filters comes from its convolution property. The result is surprising because it seems, on the
surface, to involve more work in that three transforms have to be computed. It is also surprising that
a transform with a basis set involving trigonometrical functions can be used to efficiently compute
convolutions between, say, integer sequences, but it does indeed work.
In order to examine this remarkable property we will briefly discuss FIR filters and then introduce
the DFT.
Page 5
A. Finite Impulse Response Digital Filtering using DFTs
Finite impulse response (FIR) digital filters produce an output based on a weighted sum of present
and past inputs. Mathematically this is a weighted averaging scheme represented by:
N-1
yk = Â! hn . xk-n
i=0
where {hn} is the weighting sequence, and {xn} is the input sequence.
The weighting sequence, {hn} characterizes the filter, and it is the determination of the sequence that
corresponds to the design of the filter. The sequence is sometimes referred to as the impulse
response based on the equivalent characterization of analog filters. In the digital case the impulse
function is a perfectly realizable sequence {1, 0, 0, 0, .....}. If we use this impulse function as the x
sequence, the output of the filter will be the {hn} sequence. We can also define a frequency
response for the filter by using an input sequence that is a complex sinusoid, xn = eiΩn . Here, the
frequency is simply an angular increment, Ω, which locates the value of the input, at sample n, on
the unit circle at an angle of Ωn. As with continuous systems, we can invoke the use of a transform
to map convolution into multiplication. The transform we use is the z-transform (Jury (1964))
which is defined as:
N-1
X(z) = Â!xn!z-n
i=0
Using this definition, and the convolution property, we are able to define a transfer function for the
FIR filter as:
Y(z)
X(z) = H(z)
By letting z= eiΩ , we can find the magnitude and phase shift (polar coordinates) of H(eiΩ ), and
hence the frequency response (H(eiΩ ) vs Ω). We note that a simple inversion procedure exists; we
Page 6
simply examine the coefficients of H(z) and assign them to delayed samples of the impulse
response. H(z) can also be obtained by a polynomial division of Y(z) by X(z); for the FIR filter we
know that this division will always have finite length and so H(z) (for z an indeterminate complex
number) has only zeros. For filters which have an infinite length response (infinite impulse
response (IIR) filters), H(z) contains poles as well as zeros; i.e. we represent H(z) as a rational
function of polynomials. Closed solutions exist for determining the impulse response for IIR
filters based on the roots of H(z) (Jury (1964)). We will not consider IIR filters here since our
interest is in implementing a finite convolution process. IIR filters possess the property that they
are able to implement arbitrary frequency magnitude responses with much fewer multiplications
than equivalent FIR filters. They have unfortunate side-effects, however, of: possible instability due
to inexact multipliers; possibility of producing limit cycle oscillations; possibility of overflow
oscillations; difficulty in implementing linear phase responses. Although a great deal of effort has
been spent on studying such filters (see, for example, Rabiner and Gold (1975)), it is not clear that
the implementation efficiency suitably compensates for the problems. Perhaps the best solution is
to concentrate on efficient implementation strategies for FIR filters. This is the main subject of this
paper.
II. Discrete Fourier Transforms

D
The DFT is defined as a transformation that maps a sequence xn fi Xk where:
N-1 N-1
-j2πnk/N jW Nnk
Xk = ∑!!xn !e = ∑!!xn !e (1)
!!!n=0 !!!n=0
2π
with WN = N
In terms of the Fourier transform, this may be regarded as an approximation to the Fourier integral,
but it turns out that it is a transform in its own right with exact properties. Two of the properties that
Page 7
are important to the subject of this paper are the inverse and convolution property. In fact we need
the first to prove the second.
A. Inverse and Convolution Property

D-1
One important property is that the transform is reversible, i.e. the mapping xn! fi !Xk exists, with
an inverse given by:
N-1
1 j-W Nnk
xn = N ∑!!Xk !e (2)
!!!k=0
The proof that the inverse exists yields the basis for the second important property, the convolution
property.
The easiest way to prove that (2) is the inverse of (1) is by direct substitution. We will make the
substitution using an appropriate dummy variable, m; the result should yield that xn=xm,"m=n.
N-1 È N-1 ˘
1 Í jW Nmk˙ -jW Nnk
xn = N ∑ Í ∑!!xm !e ˙ e
Î
!!!k=0 !!!m=0 ˚
interchanging the order of summation and gathering terms we find:
N-1 È N-1 ˘
1 Í jW Nk(m-n)˙
xn = N ∑ xm Í ∑!!e ˙
!!!m=0 Î !!!k=0 ˚
or
N-1 È N-1 ˘
1 Í ˙
xn = N! ∑ xm Í ∑!!WN(k[m-n])˙ (3)
!!!m=0 Î !!!k=0 ˚
The function, W(k[m-n]), displays rather an interesting property.
Page 8
j[m-n]WN
W(k[m-n]) is a function on the unit circle in the complex plane with N unique values: 1,!e
j2[m-n]WN j(N-1)[m-n]WN
,!e ,!....!,!e for k = 0, 1, 2, .... , N-1. For [m-n]=KN, where K is some integer,
N-1
all values of the function are 1 and so the sum, ∑ W N(k[m-n]), evaluates to N. For [m-n]≠KN
!!!k=0
we can invoke the fact that the function yields a finite geometric series with the sum:
N-1 jN[m-n]W
!!W N(k[m-n]) = 1!-!e
N
∑ j[m-n]WN = 0.
1!-!e
!!!k=0
Thus, the only non-zero values generated by (3) are for [m-n]=KN, and for these values we obtain
the result xn=xn+KN. If we restrict both the sample and transformation domain to be periodic, with
period N, then we have proved that the transform is invertible.
Based on this result for the existence of an exact inverse for the DFT, we can now proceed to
examine the convolution property.
We start by considering the result of multiplying two transform domain sequences together.
Assume that the sample domain sequences are xn and yn with transform domain sequences of Xk
and Yk. The result of multiplying the transform domain sequences is a new sequence Zk!=!Xk . Yk .
Using the DFT of the x sequence explicitly, we have:
N-1
Zk = ∑!!xn.!WN(kn) .Yk (4)
!!!n=0
Inverting equation (4) yields:
N-1 Ê N-1 ˆ
1 Á ˜
zm = N ∑ Á ! ∑!!xn.!WN(kn)!.Yk˜ . W N(-km) (5)
k=0 Ë !!!n=0 ¯
We now interchange the order of summation to find:
Page 9
N-1Ê N-1 ˆ
Á1 ˜
zm = ∑ xn Á N! ∑!!!Yk !.!WN(k[n-m])˜ (6)
n=0 Ë !!!k=0 ¯
The inner expression of equation (6), in parenthesis, is the sequence y[m-n]. Since {y} is periodic,
we can retain values from a single period by forming y[m⊕N(-n)]. Note that here we are using the
notation ⊕N to indicate addition modulo N. This notation will be expanded later.
From (6) we obtain the final result for the output sequence:
N-1
zm = ∑ xn y[m⊕N(-n)] (7)
!n=0
zm is therefore the convolution of the sequence {x} with the sequence {y}. It is the convolution of
two periodic sequences, and is often called cyclic, or periodic, convolution. As noted before, we only
need to know the samples of a single period in order to perform the calculation. If we require to
determine the periodic convolution of two sequences, then we can perform the calculation indirectly
by taking DFT's of each sequence, multiplying in the transform domain, and inverting the result. If
we require the convolution of two aperiodic sequences, it turns out that we can still use periodic
convolution, providing that we are prepared to perform several periodic convolutions with sub
samples of the sequences, and assemble the final sequence from partial outputs of the periodic
convolutions; these are the so-called overlap/save and overlap/add procedures. We are not able to
give full details of such procedures here, but there are many fine text books which detail the two
approaches (e.g. Rabiner and Gold (1975)).
The use of an indirect computation of convolution is advantageous when the amount of computation
involved is considerably less than that required using the direct convolution sum. An N point
periodic convolution requires N2 multiplications and N2-N binary (two valued) additions. Can we
reduce this by an indirect computation involving DFT's? The answer is yes, but only if the
transform can be computed efficiently. An algorithm that efficiently computes a DFT is called a fast
Page 10
algorithm, and the computation of a DFT with such an algorithm is called a Fast Fourier Transform
(FFT). Note that the FFT is not a separate transform, just a fast (or efficient) way of computing a
DFT.
For simplicity we will discuss the development of fixed radix decimation algorithms that use a
power of two as the radix. These were the first Fast algorithms to be developed, and still represent
the major implementation procedure for commercial hardware and software solutions.
III. Fast Fourier Transform (FFT) Algorithms
We will briefly examine the classical decimation in time algorithm to appreciate the magnitude of
savings that can occur in computing the DFT using a fast algorithm.
A. Decimation in Time FFT Algorithm
There are two basic decimation (sequence slicing) procedures: decimation over the sample domain
(decimation in time, DIT) or decimation over the transform domain (decimation in frequency, DIF).
We will consider the first procedure, with the assumption that N contains some power of 2 factors.
D
The transformation: xn fi Xk can be written:
N-1
Xk = ∑ xn W N(nk) (8)
!!!k=0
and we can decompose the computation by computing (8) for the even and odd samples, separately,
and then put the two results back together. This yields:
N N
2!-1 2!-1
Xk = ∑ x2n W N(nk) + WN(k) . ∑ x2n+1 W N(nk) (9)
!!!n=0 2 !!!n=0 2
Page 11
We have now reduced the computational requirement for an N-point DFT to that required for two
N
2 point DFTs and an extra N multiplications. If we are interested in reducing multiplications then
N2
we have reduced the requirement from N2 multiplications to 2 + N multiplications. Although this
N
may not represent a large reduction, we are able to repeat the process by decimating each of the 2
N
point DFTs to two 4 point DFTs. We can continue in this way until we run out of factors of N.
The greatest decimation occurs for N, a power of 2, in which the basic computational element
reduces to a 2-point DFT (which can be computed using real numbers). In this case the number of
N
multiplications reduces to 2 log2N (in fact some of these multiplications are trivial, but we will use
N
this figure as an upper bound). The calculation reduces to the calculation of 2 log2N two-sample
DFT's with a multiplication by a power of W. This multiplier is popularly referred to as a 'twiddle
factor'.
The basic computational element is shown in the signal flow graph in Fig 1 below, and is often
referred to as a 'butterfly' based on its structure.
+ r
A ∑ C = A + B. W N
+
+
r
B ∑ D = A - B. W N
-
r
WN
Figure 1 Butterfly calculation with twiddle factor multiplier
Page 12
x0
X0
x1
W 08 X4
-1
x 0
2 W X2
8
-1
x 0 2
3 W 8 W8 X6
-1 -1
x 0
4 W X1
8
-1
x 1
5 W 08 W8 X5
-1 -1
x6 0 2
W W8 X3
8
-1 -1
x7 0 2 3
W W W8 X7
8 8
-1 -1 -1
Figure 2 FFT Signal Flow Graph for 8-point DFT transform
Notice that there is only one complex multiplication and two complex additions per butterfly. An
example is shown in Figure 2 for an 8-point transform using 4log28!=!4 x 3!=!12 butterfly
calculations. Appropriate pairs of shaded nodes represent the addition/subtraction network of the
butterfly calculation.
Note that although there are 12 multiplications indicated, in fact only 5 of these are non-trivial (i.e.
not a multiplication by unity). Before we leave this brief introduction to fast algorithms it can be
observed that the output sequence has been scrambled in the process of reducing the number of
multiplications, and that the structure is not the same from stage to stage. Algorithms are available
that do not scramble the output data (or require the input data to be scrambled) and there are also
algorithms that maintain a constant structure, but not together (a sort of Heisenberg uncertainty
principle for FFT's !).
Page 13
It seems to be a law of nature that as one attempts to reduce the number of arithmetic operations (or
at least the most costly..multiplication) the structure suffers. This comes back to haunt us when we
examine VLSI implementations of such algorithms.
There are many works available on the myriad of fast algorithms that have been discovered since the
initial algorithm was published (for example Blahut (1985)). It has been our intent here simply to
point out that these algorithms exist and to briefly discuss one of the earliest.
B. Use of the DFT in indirect computation of convolution
Based on the fact that DFT's can be computed very efficiently, it becomes clear that the operation of
convolution can also be computed much more efficiently than directly implementing the convolution
sum. Assuming that we wish to compute circular convolution, then an N-point convolution can be
computed using 2 DFTs and an inverse DFT. Since an inverse DFT is the same complexity as a
forward DFT, the number of multiplications involved are 3N!log2!N!+!N. The reduction in
1
multiplications for an N-point circular convolution is N !!{ 1!+!3log2!N} . For the case of
filtering with a fixed set of weights, we can store the DFT of the weight sequence and obtain a
1
greater reduction of N !{ 1!+!2.!log2!N} multiplications. These savings are reduced if one is
interested in aperiodic convolution (the normal filtering operation) but for large filter kernels (large
N) the savings can be significant enough to warrant using this indirect approach.
We note one unfortunate fact, even if we are convolving only real sequences, it is still necessary to
use complex arithmetic. In fact our reduction calculations above are predicated on the need for
complex sequence convolution. There is another unfortunate effect that since the roots of unity are
obtained from transcendental functions, we are not able to compute convolutions without error; even
the convolution between sequences of integers, which in the direct form can be easily computed
without invoking any errors, will have to be computed using approximations to the roots, and will, in
general, invoke errors in the final convolution sequence.
Page 14
There are mathematical systems in which roots of unity can be generated and used in calculations
without any approximations. In fact, approximations are meaningless. The mathematical systems
are finite algebras, and we will devote the remainder of this chapter to discussing their use in the
computation of DSP algorithms.
It is important to establish some basic concepts and theories relating to the subject of finite algebra;
the following section is included for that purpose.
IV. Finite Algebras
For our specific purposes we will be using algebraic structures that emulate normal integer
computations. The entities to be discussed are groups, rings and fields which contain a finite
number of integers as their elements and allow arithmetic operations that are modifications of
normal integer arithmetic operations. This will not be an exhaustive review; the reader is referred to
standard texts on number theory (e.g. Dudley, 1969, and Vinogradov, 1954) as well as texts on
modern algebra (e.g. Burton, 1970, and Schilling and Piper, 1975).
As a starting point, we will formally define the notion of a ring and field.
A. Rings and Fields
Definition 1:
If R is a non empty set on which there are defined binary operations of addition, +, and
multiplication, •, such that the following postulates (I) - (VI) hold, we say that R is a ring,
(I) Commutative law of addition a+b = b+a
(II) Associative law (a+b)+c = a+(b+c)
(III) Distributive law a•(b+c) = a•b + a•c
Page 15
(IV) Existence of an element, denoted by the symbol 0, of R such that a+0 = a for every a
ŒR
(V) Existence of an additive inverse. For each a Œ R, there exists x Œ R such that
a+x!=!0
(VI) Closure, where it is understood that a, b, c are arbitrary elements of R in the above
postulates
A ring in which multiplication is a commutative operation is called a commutative ring. A ring with
identity is a ring in which there exists an identity element, 1, for the operation of multiplication,
a•1 = 1•a = a for all a Œ R.
The ring of integers is a well known example of a commutative ring with identity. In abstract
algebra, elements of a ring are not necessarily the integers, even not necessarily numbers; e.g. they
can be polynomials. Given a ring R with identity 1, an element a Œ R is said to be invertible, or to
be a unit, whenever a possesses an inverse with respect to multiplication. The multiplicative inverse,
a-1, is an element such that a-1•a = 1. The set of all invertible elements of a ring is a group with
respect to the operation of multiplication and is called a multiplicative group.
Definition 2:
If R is a ring and 0≠a Œ R, then a is called a divisor of zero if there exists some b≠0, b Œ R such
that a•b = 0.
Definition 3:
An integral domain is a commutative ring with identity which has no divisors of zero.
Definition 4:
A ring, F, is said to be a field provided that the set F-{0} is a commutative group under
multiplication.
Page 16
Viewed otherwise: a field is a commutative ring with identity in which each non-zero element
possesses an inverse under multiplication. It follows (McCoy, 1972) that every field is an integral
domain. The rings (fields) with a finite number of elements are called finite rings (fields). A ring of
integers modulo m, denoted here as Zm, is an example of a finite ring.
In every finite field with p elements, the nonzero elements form a multiplicative group. This
multiplicative group of order p-1 is cyclic; i.e. it contains an element, a, whose powers exhaust the
entire group. This element is called a generator, or a primitive root of unity, and the period or order
of a is p-1. The order of any element x in the multiplicative group is the least positive integer t such
that xt = 1, xs≠1, sŒ[1,t). The order t is a divisor of p-1 and x is called a primitive t-th root of
unity.
For every element x of the set F-{0} the mapping defined by: x=at , t Œ {0, 1, ..., p-2} is the
isomorphism of the additive group of integers with addition modulo p-1 and the multiplicative
group of the field F.
The integer t is called the index of x relative to the base , denoted inda x.
1. The ring of residue classes
In order to describe the system, the notion of congruence should be introduced. The basic theorem,
known as the division algorithm will be established first.
Division Algorithm
For given integers a and b, b not zero, there exists two unique integers, q and r, such that a!=!b.q +
a
r, r Œ [0, b). It is clear that q is the integer value of the quotient b . The quantity r is the least
positive (integer) remainder of the division of a by b and is designated as the residue of a modulo b,
or ab. We will say that b divides a (written ba) if there exists an integer k such that a = b.k.
Definition 5:
Page 17
Two integers c and d are said to be congruent modulo m, written
c ≡ d (mod m) if and only if m(c-d)
Since m0, c ≡ c (mod m) by definition
Another alternative definition of congruence can be stated as follows:
Two integers c and d are congruent modulo m if and only if they leave the same remainder when
divided by m, cm = dm .
Definition 6:
A set of integers containing exactly those integers which are congruent, modulo m, to a fixed integer
is called a residue class, modulo m.
The residue classes (mod m) form a commutative ring with identity with respect to the modulo m
addition and multiplication, traditionally known as the ring of integers modulo m or the residue ring,
and denoted Z m . The ring of residue classes (mod m) contains exactly m distinct elements. The
ring of residue classes (mod m) is a field if and only if m is a prime number, because multiplicative
inverses, denoted a-1 (mod m) or a-1m , exist for each nonzero a!Œ!Zm. Thus the nonzero classes
of Zm form a cyclic multiplicative group of order m-1, {1,!2, ..., m-1}, with multiplication modulo
m, isomorphic to the additive group {0,!1,!...,!m-2} with addition modulo m-1.
If m is composite, Zm is not a field. Multiplicative inverses do not exist for those nonzero elements
a Œ Zm for which the gcd (a, m) ≠ 1. The Euler's Totient Function, denoted f(m), is a widely used
number theoretic function and is defined as the number of positive integers less than m and
relatively prime to it. It follows that the number of invertible elements of Z m is equal to f(m). If m
has the prime power factorization m = m1e1 . m2e2 . .... mLeL then we can write:
È 1 ˘È 1 ˘ È 1 ˘
f(m) = m ÍÎ !1!-!!m1!˙˚ ÍÎ !1!-!!m2!˙˚ ...... ÍÎ !1!-!!mL!˙˚
Page 18
The Euler-Fermat theorem states that if c is an integer, m is a positive integer and
(c, m) = 1 , then cf(m) ≡ 1 (mod m).
The important consequence of this theorem is that there is an upper limit on the order of any
element a Œ Zm. Specifically, the order t is a divisor of f(m).
2. Galois Fields
For any prime m and any positive integer n, there exists a finite field with mn elements. This
(essentially unique) field is commonly denoted by the symbol GF(mn) and is called a Galois field
in honor of the French mathematician Evariste Galois. Since any finite field with mn elements is a
simple algebraic extension of the field Zm, a brief review of the basic concepts of the extensions of
a given field will be presented.
Let F be a field. Then any field K containing F is an extension of F. If l is algebraic over F, i.e. if l
is a root of some irreducible polynomial f(x) Œ F[x] such that f(l)!=!0, then the extension field
arising from a field F by the adjunction of a root l is called a simple algebraic extension, denoted
F(l).
Each element of F(l) can be uniquely represented as a polynomial:
a0 + a1 l + .... + an-1 l n-1 , ai Œ F.
This unique representation closely resembles the representation of a vector in terms of the vectors
of a basis
1, l, ....... , ln-1.
The vector space concepts are sometimes applied to the extension fields, and F(l) is considered as a
vector space of dimension n over F.
Page 19
The field of complex numbers is an example of an extension of the field of real numbers; it is
generated by adjoining a root j = -1 of the irreducible polynomial x2 + 1.
If f(x) is an irreducible polynomial of degree n over Z m, m a prime, then the Galois Field with mn
elements, GF(mn) is usually defined as the quotient field Z m[x] / (f(x)), i.e., the field of residue
classes of polynomials of Z m[x] reduced modulo f(x). All fields containing mn elements are
isomorphic to each other. In particular, Z m[x]!/!(f(x)) is isomorphic to the simple algebraic
extension Zm(l), where l is a root of f(x) = 0.
Now that we have some basic background in finite algebraic structures, we can return to the
problem of indirectly computing convolution via transforms that exhibit the cyclic convolution
property. We will expand our definition of such transforms from the DFT to the NTT (Number
Theoretic Transform).
V. Number Theoretic Transforms
Number Theoretic Transforms (NTT's) were discovered independently by a number of researchers
(Nicholson 1971; Pollard 1971; Rader 1972) as a generalization of the standard DFT with respect
to the ring over which the transform is defined. The transform retains the same cyclic convolution
property as the DFT, but allows error free computation with the promise of much faster hardware
solutions.
The NTT has the same form as the DFT, with the exception that it is computed over a finite ring, or
field, rather than over the field of complex numbers.
N-1
Xk = Â ! p [ xn!ƒp!ank] (10)
n=0
where the ring, or field, is R(p) or GF(p) respectively. Note that a new notation has been introduced
here. In order to explicitly identify arithmetic operations over finite fields, or rings, we will use the
Page 20
symbols ⊕p, ƒp to represent addition and multiplication modulo p, respectively, and Â ! p to
represent summation modulo p. Where it is clear that a single ring or field is being referred to, then
the subscript may be dropped (except on the summation). This notation will be used throughout the
rest of the paper, and will be extended where appropriate.
The NTT that has probably been studied more than any other is the Fermat Number Transform
(FNT). This, in large part, is due to the ability to use modified binary arithmetic to perform the
computations. It is not clear, given the present body of knowledge on VLSI implementations, that
this ability to use modified binary hardware is necessarily a desirable attribute of number theoretic
algorithms. It is the intent of this paper to present alternative forms of implementation of number
theoretic techniques that do not have to be predicated on the existence of standard computational
elements. This opens up the possibility of using a wider variety of number theoretic transforms. Let
us, however, start off by exploring number theoretic transforms based on Fermat numbers.
A. Fermat Number Transforms
Although the NTT allows convolution to be computed without error (unlike the DFT which
introduces errors due to the requirement for arithmetic with coefficients obtained from
transcendental functions), the use of such a finite field (or ring) transform is only appropriate when
the hardware has no greater complexity than normal computations over the reals. Satisfying this
requirement can be a problem when the field modulus is not chosen with the implementation in
mind. Among the earliest implementations considered (and still occupying a small, but current,
research interest) are Fermat Number Transforms (FNT's).
t
The FNT is initially defined over a Galois Field GF(Ft) where the modulus Ft = 22 +1 is the t'th
Fermat Number (FN), where the FN is prime (primes do not exist for all values of t). This
transform brings about three main implementation simplicities (Agarwal and Burrus, 1974):
Page 21
1 The arithmetic is very similar to ordinary binary arithmetic because the modulus is close to a
power of 2.
2 Simple forms can be found for the multiplicative group generator that reduces the
multiplications required to compute the transform to modified bit-pattern shifts.
3 A standard Fast algorithm, obtained from an appropriate FFT, can be used to compute the
transform.
The idea of being able to compute over a finite field but with only slightly modified binary hardware
is obviously appealing, albeit constraining.
The FNT pair is defined as:
È ˘ (11)
!!F t Î xn!ƒF t!ank˚
Xk = N-1
Â
n=0
xn = N-1 N-1 È ˘ (12)

!!F t Î Xk!ƒF t!a-nk˚
Â
k=0
There are some points that we should note:
t
1 Since NÙFt -1, it is clear that we can generate a maximally composite N= 22 and so use
one of the fast algorithms available for the DFT, to optimize a computational structure for the FNT.
2 If we restrict N to be even, N-1 will always exist.
3 The first four Fermat Numbers are prime, and 3 is a primitive root for all four of these
primes.
2 t
From 1 and 3 we find that 32 ≡ 1. If we wish to find a generator, a, for a multiplicative group of
b! 2t ! ! (2t-b)
order 2b (this will allow a transform of length 2b), then a 2 ≡!32 ≡!1; hence a ≡!32 . This
last expression allows us to determine the relationship between suitable values of a (for ease of
implementation) and possible transform lengths. The easiest value of a to choose is clearly a = 2.
Page 22
This will allow multiplication by powers of a to be replaced by shifts modulo Ft. Since Ft is close
to a power of 2, the hardware should be only moderately more complex than shifts modulo a power
of 2 (binary shifts).
It turns out that for a = 2, the order is 2t+1. This can be verified, using ordinary arithmetic, from:
(t+1) t t !
22 = (22 + 1) . (22 - 1) + 1 ≡!1
The largest Fermat prime is F4, for which the order is 32; the arithmetic is modified 17-bit binary.
If we examine F5 and F6, we find that they are composite numbers of the form Ft!=!K.2t+2!+!1.
We have to pause at this moment to reflect on the fact that Ft, t!Œ!{5,!6} is not a prime number.
Clearly we will not be computing over a field, so will the NTT still work? It turns out that the
conditions for the existence of the transform still exist when computing over a ring. The restriction
is that the transform has to exist over each field constructed from each of the prime factors of the
(i)
composite modulus. The transform length, N, therefore satisfies NÙF! t "i, where
Ft!=! ’F(i)! t (this assumes no powers of primes as factors) . For F5 and F6 we can show that
N!=!2t+2 is the maximum transform length supported by these Fermat numbers; as a necessary
condition, we can see that 2t+2Ù(Ft - 1). Although we are only computing over a ring of integers
Z F t, 2t+2Ù
/ K.2t+2+1, and so N-1 exists. The length supported by a = 2 is 2t+1, as before.
For all the Fermat Numbers considered here, it turns out that the transform length, N!=!2t+2, can be
achieved by using a value of a = 2 (Agarwal and Burrus (1974)) ; i.e. a2!=!2. The expression for
a = 2 is given by:
(t-2) (t-1)
a = 22 . (22 - 1)
which is clearly representable by a 2-bit (2 ones in a field of zeros) generator. For a fast DIT or
DIF algorithm, only the first, or last, stage will require odd powers of a, and so most of the
multiplications will only require single-bit multipliers. Given the use of a = 2 and the fifth FN,
Page 23
we have a dynamic range of just over 32-bits, with a transform length of N!=!128. It is clear that
there is an unfortunate link between dynamic range and transform length, assuming the requirement
for a fast algorithm.
It might be helpful to take an example of convolution over GF(F2) in order to reinforce the results
presented here. In order to illustrate the difference between performing convolution using an NTT
and an 8-point DFT, we will chose a = 32 = 9 in order that the NTT may also generate an 8-point
transform length.
The transform pair is:
7
Xk = Â!17 [ xn!ƒ17!9nk]
n=0
7
xn = 8-1 Â!17 [ Xk!ƒ17!9-nk]
n=0
We can concisely show the calculation using matrix notation (the modulo arithmetic is implied):
X=Tx (13)
where T is the transformation matrix, whose elements are given by:
Tkn = 9kn
Note that the corresponding DFT transformation matrix has elements given by:
Tkn = e-jπkn/4
We can immediately see the major difference in the computation of both transforms. Although the
DFT uses normal arithmetic, the calculations are over the complex field, with elements that are not
representable in a finite machine. The NTT, although requiring finite field (modulo) arithmetic,
requires calculations over an integer field using elements that are defined without error.
Page 24
As an example calculation, in order to compare the two approaches, we will cyclically convolve the
sequence {x} = {1, 2, 3, 4, 0, 0, 0, 0} with itself. This example is chosen to illustrate several points,
as can be seen if we compute the cyclic convolution directly:
7
yk = Â! xn . x[k⊕8(-n)] or {y} = {1, 4, 10, 20, 25, 24, 16, 0}
n=0
If we now compute the aperiodic (normal) convolution we find:
7
yk = Â!xn!.!xk-n or {y} = {1, 4, 10, 20, 25, 24, 16, 0}
n=0
We can observe that both the computed sequences are identical, which shows that it is possible to
generate normal aperiodic convolution from periodic convolution by padding the sequences to be
convolved with zeros. This forms the basis for the classical techniques of overlap-add and overlap-
save for using cyclic convolution to compute normal convolution (Gold and Rader (1969)). As a
second point, we also notice that some of the elements of the convolution output lie outside the
elements of the field over which the NTT is to be computed, GF(17). This will be discussed after
the example calculations.
We will generate the forward transform as the matrix multiplication of equation (13).
Ê 1016 ˆ Ê 1
1
1
9
1
13
1
15
1
16
1
8
1
4
1
2
ˆ Ê 12 ˆ
Á6 ˜ Á 1 13 16 4 1 13 16 4 ˜ Á3˜
Á 1115 ˜ = Á 1
1
15
16
4
1
9
16
16
1
2
16
13
1
8
16
˜ Q 17 Á 0 ˜
4
Á 13 ˜ Á 1 8 13 2 16 9 4 15 ˜ Á0˜
Ë 157 ¯ Ë 1
1
4
2
16
4
13
8
1
16
4
15
16
13
13
9
¯ Ë 00 ¯
The symbol, Q17 , indicates matrix multiplication over GF(17). Since we are convolving the
sequence with itself, we generate the transform domain point-wise multiplication (over GF(17)) as
below, and feed this result to the input of the inverse transform calculation.
Page 25
Ê 151 ˆ Ê 1016 ˆ Ê 1016 ˆ

Á2 ˜ Á6 ˜ Á6 ˜
Á 24 ˜ = Á 1115 ˜ ƒ17 Á 15 ˜
11
Á 16 ˜ Á 13 ˜ Á 13 ˜
Ë 154 ¯ Ë 157 ¯ Ë 157 ¯
Ê 158 ˆ Ê 1
1
1
2
1
4
1
8
1
16
1
15
1
13
1
9
ˆ Ê 151 ˆ
Á 12 ˜ Á 1 4 16 13 1 4 16 13 ˜ Á2 ˜
Á 137 ˜ = Á 1
1
8
16
13
1
2
16
16
1
9
16
4
1
15
16
˜ Q 17 Á 4 ˜
2
Á5 ˜ Á 1 15 4 9 16 2 13 8 ˜ Á 16 ˜
Ë 90 ¯ Ë 1
1
13
9
16
13
4
15
1
16
13
8
16
4
4
2
¯ Ë 154 ¯
The inverse transform requires a final multiplication by 8-1 = 15 to obtain the cyclic convolution
output.
Ê 14 ˆ Ê 158 ˆ
Á 10 ˜ Á 12 ˜
Á 38 ˜ = 8 -1 ƒ17 Á 13 ˜
7
Á7 ˜ Á5 ˜
Ë 160 ¯ Ë 90 ¯
We note that some of the elements of the output convolution do not equal the original calculation
using the direct convolution sum. They are, in fact, the result of finding the least positive residue,
modulo 17, of the correct output. In fact the results are not wrong in the accepted sense; they are
simply a mapping of the correct values onto GF(17). We will see later that such results can still be
Page 26
used, and in fact will provide a technique for removing some of the problems of the linkage of
dynamic range with transform length.
It will be constructive at this point to examine the use of the DFT in performing the same
convolution. Because we are performing the calculations over the complex field, we will treat the
'real' and 'imaginary' parts of the sequences and W elements as the elements of a 2-tuple; the
interaction between the 'real' and 'imaginary' parts will use the usual rules of complex arithmetic.
The forward transform is shown below:
10.00 0.00
-0.41 -7.24
-2.00 2.00
2.41 -1.24 =
-2.00 0.00
2.41 1.24
-2.00 -2.00
-0.41 7.24
1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00
1.00 0.00 0.71-0.71 0.00 -1.00 -0.71 -0.71 -1.00 0.00 -0.71 0.71 0.00 1.00 0.71 0.71 2.00 0.00
1.00 0.00 0.00-1.00 -1.00 0.00 0.00 1.00 1.00 0.00 0.00 -1.00 -1.00 0.00 0.00 1.00 3.00 0.00
1.00 0.00 -0.71-0.71 0.00 1.00 0.71 -0.71 -1.00 0.00 0.71 0.71 0.00 -1.00 -0.71 0.71 4.00 0.00
1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 0.00 0.00
1.00 0.00 -0.71 0.71 0.00 -1.00 0.71 0.71 -1.00 0.00 0.71 -0.71 0.00 1.00 -0.71 -0.71 0.00 0.00
1.00 0.00 0.00 1.00 -1.00 0.00 0.00 -1.00 1.00 0.00 0.00 1.00 -1.00 0.00 0.00 -1.00 0.00 0.00
1.00 0.00 0.71 0.71 0.00 1.00 -0.71 0.71 -1.00 0.00 -0.71-0.71 0.00 -1.00 0.71 -0.71 0.00 0.00
The transform domain input to the inverse transform is the point-wise multiplication of the forward
transform with itself:
Page 27
100.00 0.00 10.00 0.00 10.00 0.00

-52.28 6.00 -0.41 -7.24 -0.41 -7.24
0.00 -8.00 -2.00 2.00 -2.00 2.00
4.28 -6.00 = 2.41 -1.24 2.41 -1.24
4.00 0.00 -2.00 0.00 -2.00 0.00
4.28 6.00 2.41 1.24 2.41 1.24
0.00 8.00 -2.00 -2.00 -2.00 -2.00
-52.28 -6.00 -0.41 7.24 -0.41 7.24
The inverse transform is computed below with a final division by 8.
8.00 0.00
32.00 0.00
80.00 0.00
160.00 0.00
=
200.00 0.00
192.00 0.00
128.00 0.00
0.00 0.00
1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 100.00 0.00
1.00 0.00 0.71 0.71 0.00 1.00 -0.71 0.71 -1.00 0.00 -0.71 -0.71 0.00 -1.00 0.71 -0.71 -52.28 6.00
1.00 0.00 0.00 1.00 -1.00 0.00 0.00 -1.00 1.00 0.00 0.00 1.00 -1.00 0.00 0.00 -1.00 0.00 -8.00
1.00 0.00 -0.71 0.71 0.00 -1.00 0.71 0.71 -1.00 0.00 0.71 -0.71 0.00 1.00 -0.71 -0.71 4.28 -6.00
1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 1.00 0.00 -1.00 0.00 4.00 0.00
1.00 0.00 -0.71 -0.71 0.00 1.00 0.71 -0.71 -1.00 0.00 0.71 0.71 0.00 -1.00 -0.71 0.71 4.28 6.00
1.00 0.00 0.00 -1.00 -1.00 0.00 0.00 1.00 1.00 0.00 0.00 -1.00 -1.00 0.00 0.00 1.00 0.00 8.00
1.00 0.00 0.71 -0.71 0.00 -1.00 -0.71 -0.71 -1.00 0.00 -0.71 0.71 0.00 1.00 0.71 0.71 -52.28 -6.00
Page 28
1.00 0.00 8.00 0.00

4.00 0.00 32.00 0.00
10.00 0.00 80.00 0.00
20.00 0.00 1 160.00 0.00
=
25.00 0.00 8 200.00 0.00
24.00 0.00 192.00 0.00
16.00 0.00 128.00 0.00
0.00 0.00 0.00 0.00
The calculations were performed at great precision, and so the effect of quantization noise is not
seen in this limited precision format. The quantization noise is present in both the inexact
representation of the W coefficients, and in the scaling procedures used to keep the data within the
dynamic range of the computational system. Even at large precision, zero elements become non-
zero, albeit very small, values. The fact that the input sequence is purely real, shows up in the
hermitian symmetry of the input to the inverse transform (even symmetry in real part about the
center sample (sample 4) and odd symmetry in the imaginary part). Although we have explicitly
shown all details of the calculations (not normally shown in most texts) an examination of the DFT
calculation compared to the GF(17) NTT calculation, demonstrates the simplicity and error free
behavior of computing indirect convolution over finite fields.
We are now in a position to summarize the features about each approach to computing indirect
convolution. This is presented in Table I below:
Table I Comparison of DFT and FNT integer sequence convolution
Comments
Feature DFT FNT
Transform Length No restriction Restricted by requirement
(i)
NÙF! t "i
Page 29
Dynamic Range No restriction Restricted by requirement
Ft
|Xk| < 2
Arithmetic Ordinary complex arithmetic Modulo integer arithmetic
Coefficients Transcendental roots of unity Integer elements with simple

structure
Precision Limited by wordlength of the Error free

computer system
Algorithmic Structure All 'fast' algorithms Transform length for 'fast'

algorithms restricted
There are probably more issues than those covered in Table I, but the essential points are there. We
see that the DFT offers no restrictions on dynamic range, transform length and type of fast
algorithm to be employed, but inherently introduces errors both in the inability to represent the
roots of unity with perfect precision, and the fact that in computing the transform there is
computational number growth that has to be checked by scaling. On the other hand the FNT has
some fairly severe algebraic constraints that limit the type of transform that can be computed, but
offers completely error free computation of convolution, with the ability to perform the calculations
with modified integer arithmetic and with simply structured coefficients.
How are we to set about removing, or at least relaxing, some of the transform length and dynamic
range constraints of the original FNT? We will start by considering algebraic 'tricks' that can be
played, followed by the use of multi-dimensional algorithms that can be used to dramatically
increase transform length with no increase in the dynamic range requirement. We will now expand
our horizons to examine general forms of NTT's.
Page 30
B. Indirect convolution using general NTT's
In order to look more carefully at the properties of the NTT, let us introduce it in a formal way:
The general form of the Number Theoretic Transform, over a ring, R, is given by:
T:
N-1
Xk = Â!p [ xn!ƒp!ank] ; k Œ {0, 1, 2, ..., ,N-1} (14)
n=0
where a is a primitive Nth root of unity in R. The inverse transform has the form:
T-1:
N-1
xn = Â!p [ Xk!ƒp!a-kn] ; n Œ {0, 1, 2, ..., ,N-1} (15)
n=0
with the restriction that N-1 Œ R.
2π
-j N
Over the complex field, the DFT, with a = e , is the only transform that exists of this form, and
possesses the cyclic convolution property as we demonstrated earlier. The complex field can
support a DFT of any length, since N'th roots of unity exist for all N. In a finite algebraic system
(finite ring1 or field), the length of the transform depends on the choice of the modulus, p, and the
generator, a.
As part of this formal introduction of NTT's, we will consider the conditions for the existence of a
transform. Nicholson (1971) was the first to present these conditions for existence, in his treatise
on the algebraic theory of the 'generalized DFT'. His algebraic system for the generalization was a
1 Here the term 'ring' implies a commutative ring with identity
Page 31
Ÿ
commutative ring R with identity and without zero divisors. (i.e. an integral domain). The necessary
and sufficient conditions to generate an NTT of length N are:
Ÿ
1) That a is a primitive root of unity in R , i.e. aN = 1, ak ≠ 1, k = 1, 2, ...., N-1. If we denote
Ÿ Ÿ Ÿ
by R ƒ a multiplicative (cyclic) group of R , a is any generator of R ƒ of order N.
Ÿ
2) That the multiplicative inverse of N, N-1, belongs to R .
Ÿ
For the special case of a field, GF(p), the maximum number of elements in R ƒ, is p-1. More
general results for the existence of NTT's over finite commutative rings have been presented by
Ÿ Ÿ
Dubois and Venetsanopoulos (1978). Let R be a finite commutative ring with identity. Then R
Ÿ
decomposes uniquely as a direct sum of local rings1, {Ri}, which we will write as R @ ⊕∑ Ri
Ÿ
where @ indicates isomorphism. Under this decomposition, an element r Œ R has an L-tuple
Ÿ
representation (r1, .... , rL). R supports a generalized DFT of length N if and only if:
1) Each Ri contains a primitive Nth root of unity ai.

Ÿ
2) N-1 exists in R .
A primitive root of unity in Ri is any generator of a multiplicative group of order N. If Ii is the

maximal ideal of Ri, then Ri/Ii is a finite (Galois) field of mini elements (Burton (1970)). Thus for
any finite ring R, a necessary and sufficient condition can be derived, that R supports a length N
generalized DFT if and only if NÙgcd(mini - 1), " i!Œ!{1, 2, ...., L}. The notation aÙb means a
divides b; gcd is the greatest common divisor.
Practical considerations dictate a selection of ring/fields that support transforms whose parameters
lead to efficient implementation of modular arithmetic, either in hardware or software. Most of the
1A local ring is a commutative ring with identity which has a unique maximal ideal I (any element
a!œ!I is invertible in a local ring).
Page 32
reported work on Number Theoretic Transforms has supposed that the hardware will be
implemented using the binary number system. In the conventional binary arithmetic, residue
reduction is particularly easy when the modulus is of the form 2k ± 1. The choice 2k does not admit
useful transforms since N = 1. Also, in order to simplify multiplications in the binary number
system, the cyclic group generator, a, is chosen as a power of 2. When this constraint has to be
fulfilled, the transform length is usually not close to the maximum length attainable. In addition to
the above, the transform length is normally chosen to be highly composite, so that a 'fast' algorithm
exists for the implementation.
A considerable effort has been made to provide rings and fields which allow for adequate dynamic
range and satisfy the above constraints, so that the conflicts hidden in these constraints are
minimized. In particular, Rader (1972b), proposed transforms defined over the ring of integers
modulo Mersenne numbers, M!=!2p!-!1, p prime. These transforms are referred to as Mersenne
Number Transforms (MNT). In the ring of integers, modulo a Mersenne number, 2 is a pth root of
unity and -2 is a 2pth root of unity. Thus the transform can be computed without general
multiplication, although the use of 'fast' algorithms is precluded because the order of the transform
is not a power of 2 and not even highly composite. And, as we have already discussed, both Rader
(1972a) and Agarwal and Burrus (1974) proposed to compute the NTT with a modulus of the form
of the t'th Fermat number, Ft = 2b + 1, b!=!2t. Agarwal and Burrus (1974) show that an FNT with
a!=!2 allows a transform length N = 2t+1 and an FNT with a = 22t+2 (22t-1 - 1) (known as 2 ,
since a2 ≡ 2 (Mod Ft) ) allows N = 2t+2. Thus an FFT type algorithm can be used. The hardware
implementation of a 64-point FNT is described by McClellan (1976).
The main disadvantage of both the MNT and the FNT is the rigid relationship between the dynamic
range and attainable transform lengths. For example, with a 32-bit word machine using F5 = 232 +
1, N = 64 for a = 2 and N = 128 for a = 2 . There is also a limited choice of possible
wordlengths. This point is especially significant when a FNT is used and may result in a mismatch
Page 33
of wordlength and dynamic range required for the particular convolution, because of the large
spacing between Fermat numbers.
As a modification to the original FNT and MNT, Agarwal and Burrus (1974) introduced a modified
FNT by using ring moduli that had the same implementation properties as a Fermat number;
namely 2b+1, b!≠!2t. Such moduli have small factors and so yield small transform lengths. The
transform length can be artificially increased by using multi-dimensional transform techniques, as
we will see later. Probably the most interesting idea along the lines of modifying the ring modulus,
while retaining implementation efficiency was proposed by Nussbaumer (1976b), as pseudo-
Fermat and pseudo-Mersenne number transforms. A pseudo-MNT is defined with a ring modulus
(2p!-1)
of MM = q , p composite and q some factor of 2p!-1, and a pseudo-FNT is defined with a
(2b+1)
ring modulus of MF!=! s , b!≠!2t and s some factor of 2b+1. Because 2p -1 and 2b+1,
defined as above are not prime, transforms will have a short length. Thus, if 2b+1 and 2p -1 contain
small factors, these can be divided out in the pseudo-MNT, or pseudo-FNT, to yield longer
transform lengths. The arithmetic implementation of these transforms can still be performed using
modified binary arithmetic over the modulus 2p -1, or modulus 2b+1, for the pseudo-MNT or
pseudo-FNT, respectively, followed by a final reduction modulo MM or MF. The difficult
arithmetic operation (that is not performed easily with binary arithmetic) is therefore relegated to a
very small part of the overall computation budget.
In a complete search for algebraic properties that lead to useful transforms, and yet are still easily
implemented by modified binary arithmetic, the search broadened to include field, or ring, moduli
with a 3-bit form (3 'ones' in a field of 'zeros') as compared to the 2-bit structure of Fermat and
Mersenne numbers. These moduli are of the form 2n ± 2m ± 1, and have been investigated by
Pollard (1976) and Liu et al.(1976). Other rings have also been considered, such as the ring of
Eisenstein integers (Dubois and Venetsanopoulos (1978)). The driving force behind all of the
investigations reported, has been the use of modified binary arithmetic elements with multiplication
by elements of the transform matrix kept as simple as possible, whilst yielding highly composite
Page 34
transform lengths that are reasonably long. It is amazing that algebraic structures exist that allow all
of these requirements to be met! In the quest for more degrees of freedom in the algebraic
structures that had already been discovered, the use of extension fields, and rings, provided an
important tool.
C. NTT's over extension fields
Extension fields can be of any dimension; in order to see how such fields are built, we will discuss
the construction of a second degree extension field. The concept is a familiar one to those not
versed in any modern algebra or number theory. We simply have to consider the relationship
between the field of real numbers, R, and the field of complex numbers, C. The field, C, is in fact
built from R, by the adjoining of the solution of an irreducible polynomial over R. The polynomial
over R that has no solution in x2 + 1 = 0. We usually refer to the solution as i œ R,, and we
'build' the complex field as a polynomial field in i by the construction C = { S : +, *} where S!=!a
+ ib with a, b!Œ!R. The 'complex' operators + and * are defined by reducing the polynomial that
results from the ordinary algebraic application of addition and multiplication, + and * respectively,
by the irreducible polynomial, i2 !+!1!=!0. In the case of addition, the resulting polynomial is first
order and so no reduction is applied, i.e. addition operates point-wise. In the case of multiplication,
the resulting polynomial is of second order in i, and so it is reduced. This gives rise to the extra
arithmetic operations involved in complex multiplication compared to complex addition, i.e. four
'real' multiplications and two 'real' additions for complex multiplication, compared to only 2 'real'
additions for complex addition. The specific, well known, details are shown in table II below.
Table II Decomposition of complex arithmetic operations to real operations
Operation Resulting Polynomial Reduction Result
(a + i b) + (c + i d) i (b + d) + (a + c) i!(b+d)!!+!(a+c) (a + c) + i (b + d)
i2 !+!1
Page 35
(a + i b) * (c + i d) i2(b*d) + i(a*d+b*c) i2(b*d)+i(a*d+b*c) (a*c - b*d) + i (a*d

+ (a*c) + b*c)
!!!!+!(a*c)!!!!!!
i2 !+!1
Note that the addition operator in the polynomial elements of C is formal, in that it is never
implemented. We could just as easily write the element (a+ib) as (a,b). The same concept of
building fields can be applied to a Galois field, GF(p), and the degree of the extension field, d,
produces a field of pd elements, GF(pd). In order to build the field, we use an irreducible d'th order
polynomial over GF(p). It has been shown by Pollard (1971), that the cyclic convolution property
exists over such extension fields, and that the transform length NÙ(pd - 1). A polynomial of degree
d
d of the form f(x) = Âai!xi with ai!Œ!GF(p) and ad ≠ 0 is defined to be irreducible (Gaal (1971))
i=0
if it cannot be expressed as a product of two polynomials of positive degree over GF(p). The
quotient field GF(p) / (f(x)) is the required Galois field with pd elements. Up to isomorphism, the
field GF(pd) depends only on the degree of the polynomial and not on its particular form.
A special case is d=2, for which we generate a field GF(p2) which can also be written GF(p) [x],
where x2 - r is the irreducible polynomial with which the field is built. In order to reinforce the
ideas, we will go over this case with an example.
We will take p = 7; the base field is given by GF(7) = {0, 1, 2, 3, 4, 5, 6 ; ⊕7, ƒ7}. We will assume
an irreducible polynomial x2 + 1 = 0. In order to examine the assumption we create Table III as
followings.
Table III. (x2 + 1) over GF(7)
x 0 1 2 3 4 5 6
x2 + 1 1 2 5 10 17 26 37
(x2 + 1) Mod 7 1 2 5 3 3 4 2
Page 36
We note that none of the results of applying the polynomial, x2 + 1, are zero, and so the polynomial
is indeed irreducible over GF(7). As a contrast to this, we see that, in Table IV below, the same
polynomial defined over GF(5) is reducible.
Table IV. (x2 + 1) over GF(5)
x 0 1 2 3 4
x2 + 1 1 2 5 10 17
(x2 + 1) Mod 5 1 2 0 0 2
Formally we define -1 as a quadratic nonresidue over GF(7) and as a quadratic residue over
GF(5).
The determination of -1 as a quadratic residues, or nonresidue, holds special significance for the
computation of complex sequences over finite fields. The following theorem tells us which form of
primes support which type of residue or nonresidue. This is a classical theorem, but we will present
a proof using indices, (Baraniecka and Jullien (1980)) that is normally not used. We will use
indices later, and so an introduction here will be useful. A field GF(p) possesses a multiplicative
group GM(p-1) = {1, 2, .... , p: ƒp} = GF(p)-{0}, with p-1 non-zero elements. The entire group
can be generated by repeatedly multiplying a generator, a, by itself p-1 times. We therefore have a
unique mapping between the power of the generator and any element in the group. Thus a k ~ e
where e is an element in GM(p-1). We call k the index of e. Note that k!Œ GA(p-1) = {0, 1, .... ,
p-1: ƒp-1}, an additive group. Indices are the finite field equivalent of logarithms and can be used to
map multiplication into addition.
Theorem 1
-1 is a quadratic residue over GF(p) for all odd primes, p, of the form p = 4K+1, and -1 is a
quadratic nonresidue over GF(p) for all odd primes, p, of the form p = 4K+3.
Page 37
Proof
Clearly p=4K+1, or p=4K+3 defines the form of any odd number and therefore of any odd prime.
Let t be the index of -1 , and s be the index of -1. Then 2t!=!s(mod(p-1)). A solution will only
exist for s divisible by 2; 2-1 (mod(p-1)) does not exist since 2Ù(p-1). -1 is therefore a quadratic
residue for s even and a quadratic nonresidue for s odd. ( -1 )4 ≡ 1(mod p), and so 4s ≡ p-1
(mod p-1). This congruence has a solution only for p = 4K+1.

*
Although we can find other quadratic non-residues, with which to build second degree extension
fields, -1 carries special significance in that it allows the emulation of complex arithmetic over a
Galois Field. If we compute transforms over such fields, we can convolve sequences of complex
numbers; a useful property in many communication filtering problems. We see that all primes are
not created equal when it comes to emulating complex arithmetic.
The development of the extension field operations of multiplication and addition uses the same
principle as the development of complex field operations from base real field addition and
multiplication; this is shown in Table V (similar to Table II). In the case of a finite field, the
polynomial reduction uses arithmetic defined over the base field GF(p). We illustrate the operation
using a quadratic nonresidue for -1 over GF(p2).
Table V. Extension field arithmetic operations decomposed as base field operations for a
quadratic nonresidue of -1
Operation Result
Note in the table that
(a,b) ⊕p (c,d) ( [a ⊕p c] , [b ⊕p d] )
we have introduced
(a,b) ƒp (c,d) ({[a ƒp c] ⊕p [-b ƒp d]} , {[a ƒp d] ⊕p [b ƒp c]}) the notation, ⊕p and
ƒp, for extension
field operations of addition and multiplication, respectively; the base field modulus is used for the
subscript. The degree of the extension field is obtained from the number of elements within each n-
Page 38
tuple (in this case 2); we have explicitly shown each element of GF(p2) as a two-tuple rather than
using the formal operator, +. Note, also, that all implemented operations are base field operations.
The result in table III is a specific result for the quadratic nonresidue of -1. In general, for a
quadratic nonresidue of q, we can generate Table VI.
As an example of multiplication over GF(72), consider (2,4) ƒ7 (5,3). The result is:
(2,4) ƒ7 (5,3) = ({[2 ƒ7 5] ⊕7 [-4 ƒ7 3]} , {[2 ƒ7 3] ⊕7 [4 ƒ7 5]} = ({3 ⊕7 2} , {6 ⊕7 6})
=!(5!,!5)
Table VI Extension field arithmetic operations decomposed as base field operations for a
quadratic nonresidue of q
Operation Result
(a,b) ⊕p (c,d) ( [a ⊕p c] , [b ⊕p d] )
(a,b) ƒp (c,d) ({[a ƒp c] ⊕p [q ƒp bƒp d]} , {[a ƒp d] ⊕p [b ƒp c]})
There are perhaps two questions uppermost in our minds right now: why does such a structure
form a field, and why are such fields useful in implementing NTT's ?
To answer the first question we need to explore the existence of all the field axioms. We do not
offer a formal proof here except to point out that we need to prove; closure, the rules associated with
the operations of addition and multiplication, identities, and the existence of the inverse of every
element. The reason such fields are useful is two-fold. Firstly the fields have pd elements, compared
to the p elements of the base field, this means that transform lengths NÙ(pd - 1), as was pointed out
Page 39
earlier, and so the algebraic link between transform length and modulus of implementation has been
relaxed.
To answer the second question, in the case of second degree extension fields based on non-
quadratic residues of -1, the field provides a natural system for implementing convolution between
sequences of complex numbers. The form of a number theoretic transform over extension fields is
identical to that of a base field transform, with the exception that the elements are n-tuples, and that
the arithmetic follows the rules of polynomial addition and multiplication with reduction by the
irreducible polynomial. Details of complex NTT's can be found in Vanwormhoud (1978) and
Baraniecka (1980).
In order to reinforce the ideas presented above, let us take the previous example of convolution, but
this time we will compute the transform over GF(31); note that 31 is a Mersenne prime, and of form
4K+3 which supports the quadratic nonresidue of -1. First we need to determine the possible
transform lengths. The length, NÙ(312 - 1) or NÙ960 and clearly N = 8 is a valid transform
length. We can determine suitable generators for an 8-element multiplicative cyclic group by
performing a linear search. In general, the search for an N'th order group over a field GF(p2) can be
facilitated by noting that (Baraniecka (1980)):
N
i) We need to evaluate only N/2 powers of a, because a 2 ≡ -1 (mod p)
ii) We only need to find an element -

a whose order is KN where K is any positive integer,
since a = -
a K.
A suitable generator is a = (4, 4 ); this can be demonstrated by generating the series
{a,!a2,!.....!,a8}. The series is generated in Table VII, below, with the two-tuple of each element in
the series placed vertically.
Table VII. 8th Order multiplicative group over GF(312)
Page 40
a1 a2 a3 a4 a5 a6 a7 a8
4 0 27 30 27 0 4 1
4 1 4 0 27 30 27 0
We note that a8 = (1,0) and that a n ≠ (1,0), n<8, which satisfies the criterion for the multiplicative
group. The convolution is shown below.
We start with the forward transform, the elements are shown as 2-tuples; the symbol Q31 indicates
matrix multiplication over the extension field GF(312).
Ê 1024 270 ˆ Ê 11 0
0
1 0
4 4
1 0
0 1
1 0
27 4
1 0
30 0
1 0
27 27
1 0
0 30
1 0
4 27
ˆ Ê 12 00 ˆ
Á 29 29 ˜ Á1 0 0 1 30 0 0 30 1 0 0 1 30 0 0 30 ˜ Á3 0˜
Á 299 210 ˜ = Á1
1 0
0
27 4
30 0
0 30
1 0
4 4
30 0
30 0
1 0
4 27
30 0
0 1
1 0
27 27
30 0
˜ Q 31 Á 0 0 ˜
4 0
Á 9 10 ˜ Á1 0 27 27 0 1 4 27 30 0 4 4 0 30 27 4 ˜ Á0 0˜
Ë 2924 24 ¯ Ë 11 0
0
0 30
4 27
30 0
0 30
0 1
27 27
1 0
30 0
0 30
27 4
30 0
0 1
0 1
4 4
¯ Ë 00 00 ¯
The input to the inverse calculation is the transform domain point-wise multiplication shown below.
Ê 72 0
ˆ Ê 1024 0
ˆ Ê 1024 0
ˆ
Á 0 25
8
˜ Á 29 27
29
˜ Á 29 27
29
˜
Á 12 6 ˜ Á 9 21 ˜ Á 9 21 ˜
ƒ 31 Á 29
Á 4 0 ˜ =
Á 29 0 ˜ 0 ˜
Á 120 25
23
˜ Á 299 10
2
˜ Á 299 10
2
˜
Ë 2 6 ¯ Ë 24 4 ¯ Ë 24 4 ¯
The inverse transform is computed as below.
Page 41
Ê 81 00 ˆ Ê 11 0
0
1 0
4 27
1 0
0 30
1 0
27 27
1 0
30 0
1 0
27 4
1 0
0 1
1 0
4 4
ˆ Ê 72 250 ˆ
Á 18 0 ˜ Á1 0 0 30 30 0 0 1 1 0 0 30 30 0 0 1 ˜ Á0 8 ˜
Á 145 00 ˜ = Á1
1 0
0
27 27
30 0
0 1
1 0
4 27
30 0
30 0
1 0
4 4
30 0
0 30
1 0
27 4
30 0
˜ Q 31 Á 4 0 ˜
12 6
Á 6 0˜ Á1 0 27 4 0 30 4 4 30 0 4 27 0 1 27 27 ˜ Á 12 25 ˜
Ë 40 00 ¯ Ë 11 0
0
0 1
4 4
30 0
0 1
0 30
27 4
1 0
30 0
0 1
27 27
30 0
0 30
0 30
4 27
¯ Ë 02 236 ¯
with the final reduction by the multiplicative inverse of 8: 8-1 = 4.
Ê 14 00 ˆ Ê 81 00 ˆ
Á 10 0 ˜ Á 18 0 ˜
Á 2025 00 ˜ = 8 -1 ƒ31 Á 14 0 ˜
5 0
Á 24 0 ˜ Á 6 0˜
Ë 160 00 ¯ Ë 40 00 ¯
We can note similarities between the calculation over GF(312) and the DFT calculation. The
transform matrices for both the forward and reverse transform have the same symmetry; if numbers
close to 31 are replaced by their negative equivalent, the similarity is easy to see. We notice the
same Hermitian symmetry in the forward transform domain as the DFT example, and, because the
modulus is larger than the largest convolution output sample, the convolution is determined exactly
rather than with least positive residues for some of the samples, as was the case with the calculation
over GF(17). Note also that convolution over GF(31) only yields a maximum transform length of
30 against the maximum transform length of 960 for GF(312). We note that the largest power of 2
transform length for GF(31) is 2 whereas the largest power of 2 transform length for GF(312) is
64. This gives a dramatic example of the usefulness of extension fields for relaxing the relationship
between dynamic range and transform length. We can find, quite easily, the maximum power of two
transform length for an NTT defined over GF(p2), p of form 4K+3. We simply have to express the
prime in a different way. This is presented in the following theorem (Baraniecka (1980)).
Page 42
Theorem 2
Given a prime p = 4K+3 = q.2r -1 where r is any positive integer and (q, 2) = 1; an NTT defined
over GF(p2) can have a transform length N = 2B where the maximum value of B is r+1.
Proof
The order of the cyclic subgroup used to define the NTT has to divide p2-1. Therefore NÙ(q222r -
q2r+1). Since q is odd then NÙy2r+1 where y is odd. Therefore we can always find an N = 2B,
where the maximum value of B = r+1.

*
We can try this out in the previous example. p = 31 = 1 . 25 - 1; r = 5 and so the maximum
transform length is N = 26 = 64, which checks with our calculation from p2-1 = 312 -1 = 960.
Since one of the reasons for using extension fields is the ability to handle larger transform lengths,
particularly transforms of maximum length N = 2B, then the above theorem is important from an
implementation perspective.
We can always convolve 2 'real' sequences with a third 'real' sequence using the complex extension
field transform, by creating a complex sequence consisting of elements comprised of a sample from
each of the sequences; thus if the 2 sequences are {xn} and {yn} then the sequence to be convolved
is the complex sequence {xn, yn}. This works since the sequence being convolved with has the
form {xn, 0}, which operates on the two parts of the complex sequence independently. This also
works for DFT indirect convolution. The 2 juxtaposed sequences do not have to be from separate
sources, they can be from the same source at different sampling times. We can illustrate this below
using the same NTT as in the previous example. We will assume that all three sequences are the
example sequence {1, 2, 3, 4, 0, 0, 0, 0}.
The forward transform is computed as below:
Page 43
Ê 1028 1020 ˆ Ê 11 0
0
1 0
4 4
1 0
0 1
1 0
27 4
1 0
30 0
1 0
27 27
1 0
0 30
1 0
4 27
ˆ Ê 12 12 ˆ
Á 0 27 ˜ Á1 0 0 1 30 0 0 30 1 0 0 1 30 0 0 30 ˜ Á3 3˜
Á 1929 3029 ˜ = Á
1
1
0
0
27 4
30 0
0 30
1 0
4 4
30 0
30 0
1 0
4 27
30 0
0 1
1 0
27 27
30 0
˜ Q 31 Á 0 0 ˜
4 4
Á 30 19 ˜ Á1 0 27 27 0 1 4 27 30 0 4 4 0 30 27 4 ˜ Á0 0˜
Ë 2720 280 ¯ Ë 11 0
0
0 30
4 27
30 0
0 30
0 1
27 27
1 0
30 0
0 30
27 4
30 0
0 1
0 1
4 4
¯ Ë 00 00 ¯
The input to the inverse calculation is the transform domain multiplication shown below. Note that
one of the vectors is taken from the previous convolution example transform domain output
representing the transform domain output of the sequence {(xn,!0)}.
Ê 78 7
ˆ Ê 1028 10
ˆ Ê 1024 0
ˆ
Á 23 27
8
˜ Á 0 20
27
˜ Á 29 27
29
˜
Á 6 18 ˜ Á 19 30 ˜ Á 9 21 ˜
ƒ 31 Á 29
Á 4 4 ˜ =
Á 29 29 ˜ 0 ˜
Á 188 6
˜ Á 3027 19
˜ Á 299 10
˜
Ë 27 23
8
¯ Ë 20 0
28
¯ Ë 24 2
4 ¯
Ê 81 81 ˆ Ê 11 0
0
1 0
4 27
1 0
0 30
1 0
27 27
1 0
30 0
1 0
27 4
1 0
0 1
1 0
4 4
ˆ Ê 78 277 ˆ
Á 18 18 ˜ Á1 0 0 30 30 0 0 1 1 0 0 30 30 0 0 1 ˜ Á 23 8 ˜
Á 145 145 ˜ = Á1
1 0
0
27 27
30 0
0 1
1 0
4 27
30 0
30 0
1 0
4 4
30 0
0 30
1 0
27 4
30 0
˜ Q 31 Á 4 4 ˜
6 18
Á6 6 ˜ Á1 0 27 4 0 30 4 4 30 0 4 27 0 1 27 27 ˜ Á 18 6 ˜
Ë 40 40 ¯ Ë 11 0
0
0 1
4 4
30 0
0 1
0 30
27 4
1 0
30 0
0 1
27 27
30 0
0 30
0 30
4 27
¯ Ë 278 238 ¯
Page 44
Ê 14 14 ˆ Ê 81 81 ˆ
Á 10 10 ˜ Á 18 18 ˜
Á 2025 2025 ˜ = 8 -1 ƒ31 Á 14 14 ˜
5 5
Á 24 24 ˜ Á6 6 ˜
Ë 160 160 ¯ Ë 40 40 ¯
Note that both the elements in the 2-tuple output sequence produce the correct convolved sequence.
In general the juxtaposed sequence will be from different sequences or from the same sequence
sampled at different times.
One possible problem is the fact that the generator is no longer of a simple form; in fact we have
two elements to manipulate in the generator. Using the rules of the extension field arithmetic, we
have increased the additive complexity by 2 and the multiplicative complexity by more than 4 (4
multiplications and 2 additions).
Let us examine the form of the generator for both types of prime field moduli. Although primes of
the form p = 4K+3 allow complex arithmetic rules for second degree extension fields, GF(p2), we
cannot find a generator for a multiplicative group of order N = 2B (where B is maximum) that has a
simple form (i.e. only one element). This has been shown by Baraniecka (1980), which we will
express in the following theorem.
Theorem 3
The generator, a, of the multiplicative group GF(p2) - {0}, of order N = 2B, has to have the general
polynomial form, a = (g, b) with g, b ≠ 0 for B = r+1, where p!=!q.2r!-1, q odd.
Proof
Assume b = 0, then a has maximum multiplicative order p-1 < 2p+1; hence b!≠!0. Let a = (g, b) =
(c + d d) q have order 2r+1, where x2 - d = 0 is the irreducible polynomial used to build the
Page 45
extension field. Then (c + d d) q2r = -1, since aError! ))!=!-1. The above can be written as: (c +
p-1
d d) . (c + d d) q2r-1 = -1. If we employ the binomial theorem and the property d 2 = -1
(Vinogradov (1954)), hence (! d)

p !=!-
Error!
*
We now examine primes of the form p = 4K + 1. We can represent such primes in another way:
p!= s.2t + 1, where s is odd. The largest multiplicative group, with a power-of-2 order, has an
order N = 2t+1. Baraniecka (1980) has shown, for such primes, that the generator can always be
simplified to (0,a), a single non-zero element form. The following theorem generates this result:
Theorem 4
Let p = s.2t + 1, (s, 2) = 1, be an odd prime number. Then:
i) If g is a generator for the multiplicative group GF(p) - {0}, then x2!-!g, is an irreducible
polynomial in GF(p) [x].
ii) If g is as in i), then g has multiplicative order s.2t+1 in GF(p2), where GF(p2)!=!{a!+!b g
: a, b Œ GF(p)}.
iii) We can find a generator r , of a cyclic subgroup of order 2t+1 in GF(p2), where r!=!grs
with (r, 2) = 1 and x2!-!r an irreducible polynomial in GF(p) [x].
Proof:
i) Suppose x2!-!g is reducible. Then there exists d Œ GF(p2) such that d2 = g. From Fermat's
È p-1˘ p-1
Î ˚
theorem (Dudley (1969)) dp-1 = 1 = d2 2 = g 2 = -1. This is not the case for an odd prime,
hence x2!-!g is irreducible.
Page 46
t+1 t
ii) ( g )s.2 = gs.2 = 1. We now have to prove that s.2t+1 is the smallest order of g .
Suppose ( g )v = 1 for some 1 ≤ v < s.2t+1, then if 2Ùv, so that v!=!2u, 1!≤ u < s.2t, we have ( g
)2u = gu = 1 which is impossible. If
Error!
t+1
iii) From ii), ( g )s.2 = 1, therefore ( g )s has order 2t+1. Since s is odd, ( g )s !œ!GF(p)
and so there exists an irreducible polynomial, x2!-!gs. We also know (Dudley (1969)) that for some
r, such that (r, 2t+1) = 1, ( g )rs has order 2t+1. It is clear (from the prime factor decomposition
of 2t+1) that we only need to show (r, 2) = 1. We now set r = grs. Since rs is odd, x2!-!r is an
irreducible polynomial in GF(p) [x].
The theorem is important because it not only tells us that a simple generator form exists, but also
how to find it! This is not the case for the other form of primes. We have, of course, only gained a
doubling of the transform length from that offered by the base field, but we should not lose any of
the efficiency inherent in using extension fields for p!=!4K!+!3, since a 'complex' multiplication by
the elements of the transform matrix only requires 2 'real' multiplications (plus a 'real' addition). We
must note here that in producing the first element of the 2-tuple (when the power of a has a zero
2nd element) a multiplication by a constant, r, is required in the calculation. It is generally assumed
that multiplication by constants is free, and although this seems a strange statement to make, we will
demonstrate this fact in the section on hardware implementation.
The rules of arithmetic used in GF(p2), p = 4K + 3, are shown in Table VIII below, where j!=! r .
Table VIII. Arithmetic rules for GF(p2), p = 4K + 3
Operation Resulting Polynomial Reduction Result
(a + j b) + (c + j) d) j (b + d) + (a + c) j!(b!+!d)!!+!(a!+!c) (a + c) + j(b + d)
j2 !+!r
Page 47
(a + j b) * (c + j d) j2(b*d) + j (a*d+ j2(b*d) + j(a*d + (a*c + r* b*d)
b*c) + (a*c) b*c) + j (a*d + b*c)
!!!!+!(a*c)!!!!!!
j2 !+!r
We see the constant r in the multiplication result for the first element of the resulting 2-tuple.
Let us take an example to see how the transform over this new type of field modulus works; we can
find a suitable prime from tabulated data such as Abramowitz and Stegun (1968). We can find the
tables under the section on combinational analysis, in which a table of least positive and negative
primitive roots, along with the factorization of p-1, for p a prime, is given. Our task is to find a
prime, p, such that the factorization of p-1 has a power-of-2 as one of the factors. As an example we
will take p = 29 for which p-1 = 22.7. This means that a transform of order 8 is available over the
extension field. Using the least positive primitive root from the tables, g = 2, and so 2 will
generate a multiplicative sub-group of order 23.7 = 56. The generator of the multiplicative group of
order 8 is given by a = r = ( 2 )7r where (r, 2) = 1. Mapping to indices, we are searching for an
element r subject to the constraint ind2r = 7r. The only indices which satisfy this constraint are
7!x!1!=!7 and 7!x!3 = 21. We can compute the associated values of r from r = 2ind2r . We therefore
obtain r = 12 and r = 17 as the only two values of r. These yield four values of a: a = 12 = 8 2
= (0, 8) = 21 2 =!(0,!21) and a = 17 = 9 2 = (0, 9) = 20 2 = (0, 20). The elements of the
multiplicative sub-group generated by each of these generators are shown in Table IX below. The
elements are arranged vertically with the first element of the 2-tuple above the second.
Table IX. Multiplicative sub-groups over GF(292)
0 12 0 28 0 17 0 1
8 0 9 0 21 0 20 0
0 17 0 28 0 12 0 1
Page 48
9 0 8 0 20 0 21 0
0 17 0 28 0 12 0 1
20 0 21 0 9 0 8 0
0 12 0 28 0 17 0 1
21 0 20 0 8 0 9 0
We note from the table that the lower two sub-groups are the inverse of the upper two sub-groups
in that s!x!t = (1, 0) , where s,t are elements of the respective groups. In forming a transform we
can use one of the sub-groups to generate the transform matrix for the forward transform, and the
other to generate the matrix for the inverse transform. This is demonstrated below using the same
convolution example as used earlier, with the first and third sub-groups used to generate the
forward and inverse transforms, respectively.
The forward transform is computed as below:
Ê 1025 102 ˆ Ê 11 0
0
1 0
0 8
1 0
12 0
1 0
0 9
1 0
28 0
1 0
0 21
1 0
17 0
1 0
0 20
ˆ Ê 12 12 ˆ
Á3 3 ˜ Á1 0 12 0 28 0 17 0 1 0 12 0 28 0 17 0 ˜ Á3 3˜
Á 277 1527 ˜ = Á
1
1
0
0
0 9
28 0
17 0
1 0
0 8
28 0
28 0
1 0
0 20
28 0
12 0
1 0
0 21
28 0
˜ Q 29 Á 40 40 ˜
Á 20 14 ˜ Á1 0 0 21 12 0 0 20 28 0 0 8 17 0 0 9 ˜ Á0 0˜
Ë 2210 222 ¯ Ë 11 0
0
17 0
0 20
28 0
17 0
12 0
0 21
1 0
28 0
17 0
0 9
28 0
12 0
12 0
0 8
¯ Ë 00 00 ¯
The input to the inverse calculation is the transform domain multiplication shown below. The
rightmost vector was generated from the forward transform of the sequence:
{1,0 2,0 3,0 4,0 0,0 0,0 0,0 0,0 }.
Page 49
Ê 132 13
ˆ Ê 1025 10
ˆ Ê 108 0
ˆ
Á 9 11
9
˜ Á 3 2
3
˜ Á 3 23
0
˜
Á 8 28 ˜ Á 7 15 ˜ Á ˜
ƒ 29 Á 23 21
Á 4 4 ˜ =
Á 27 27 ˜ 27 0 ˜
Á 209 0
˜ Á 2022 14
˜ Á 228 6
˜
Ë 1 20
10
¯ Ë 10 22
2
¯ Ë 23 0
8
¯
Ê 83 83 ˆ Ê 11 0
0
1 0
0 20
1 0
17 0
1 0
0 21
1 0
28 0
1 0
0 9
1 0
12 0
1 0
0 8
ˆ Ê 132 1311 ˆ
Á 22 22 ˜ Á1 0 17 0 28 0 12 0 1 0 17 0 28 0 12 0 ˜ Á9 9 ˜
Á 1526 1526 ˜ = Á
1
1
0
0
0 21
28 0
12 0
1 0
0 20
28 0
28 0
1 0
0 8
28 0
17 0
1 0
0 9
28 0
˜ Q 29 Á 84 284 ˜
Á 18 18 ˜ Á1 0 0 9 17 0 0 8 28 0 0 20 12 0 0 21 ˜ Á9 0 ˜
Ë 120 120 ¯ Ë 11 0
0
12 0
0 8
28 0
12 0
17 0
0 9
1 0
28 0
12 0
0 21
28 0
17 0
17 0
0 20
¯ Ë 201 2010 ¯
With the final multiplication:
Ê 14 14 ˆ Ê 83 83 ˆ
Á 10 10 ˜ Á 22 22 ˜
Á 2025 2025 ˜ = 8 -1 ƒ29 Á 15 15
26 26
˜
Á 24 24 ˜ Á 18 18 ˜
Ë 160 160 ¯ Ë 120 120 ¯
We can see that each element of the transform matrix, for both the forward and reverse transform, is
of the simple form, but the zero generator element is exchanged between the first and second place
in the 2-tuple for adjacent matrix elements. It appears as though we may need to multiplex the
arithmetic hardware between the adjacent inner product partial product calculations in the transform
matrix/vector multiplication for both the forward and reverse transforms. An interesting result
Page 50
occurs if we consider implementing the transform using a fast algorithm of the type discussed in
section III, A.
Consider the decimation-in-time algorithm, as shown in the 8-point transform flow graph of Fig 2.
Only in the final stage do we require multiplications by odd powers of a, the other stages only use
even powers of a and for these we only require a real multiplication. Thus the hardware only
changes at the final stage.
We will illustrate the operation of the fast algorithm on the previous convolution example using the
8-point DIT fast algorithm to compute the transforms. The forward transform uses the 'twiddle
factors' shown in Table Xa below, and the inverse transform uses the 'twiddle factors' in Table Xb;
the 2-tuple is arranged vertically.
Table Xa Table Xb
n 1 2 3 n 1 2 3
n 0 12 0 n 0 17 0
W8 W8
8 0 9 20 0 21
The Fast NTT (FNTT) is displayed below in columns of 2-tuples, within vertical lines, with bold
face italics indicating the result after multiplication. If no twiddle factor is present in the signal flow
graph of Fig 2, a multiplier of unity is used for consistency, and the italics are normal face. Since
this is a numerical equivalent of the flow graph of Fig 2, the operation proceeds from left to right,
rather than the reverse for the matrix multiplication notation. The forward transform is shown below
with the standard input sequence {(1,1), (2,2), (3,3), (4,4), (0,0), (0,0), (0,0), (0,0)}.
1 1 1 1 1 1 1 1 4 4 4 4 10 10 10 10
2 2 2 2 2 2 2 2 6 6 6 6 27 27 25 2
3 3 3 3 3 3 3 3 27 27 27 27 3 3 3 3
Page 51
4 4 4 4 4 4 4 4 27 27 5 5 22 22 7 15
0 0 0 0 1 1 1 1 8 8 8 8 25 2 27 27
0 0 0 0 2 2 2 2 21 21 17 23 20 14 20 14
0 0 0 0 3 3 7 7 23 23 23 23 7 15 22 22
0 0 0 0 4 4 19 19 12 12 13 21 10 2 10 2
Input Multiply 1st Stage Multiply 2nd Stage Multiply 3rd Stage Shuffle
Butterfly Butterfly Butterfly
The output of the transform is multiplied with the transform of the convolving sequence: {1,0 2,0
3,0 4,0 0,0 0,0 0,0 0,0}.
10 10 10 0 13 13
25 2 8 23 2 11
3 3 3 0 9 9
7 15 23 21 8 28
27 27 x 27 0 fi 4 4
20 14 8 6 9 0
22 22 22 0 20 20
10 2 23 8 1 10
The result of the multiplication is now used as the input to the inverse transform:
13 13 13 13 17 17 17 17 17 17 17 17 8 8 8 8
2 11 2 11 11 11 11 11 20 20 20 20 26 26 3 3
9 9 9 9 0 0 0 0 17 17 17 17 22 22 22 22
8 28 8 28 9 9 9 9 2 2 5 5 12 12 15 15
4 4 4 4 9 9 9 9 25 25 25 25 3 3 26 26
9 0 9 0 22 11 22 11 25 27 7 7 18 18 18 18
Page 52
20 20 20 20 18 18 16 16 22 22 22 22 15 15 12 12
1 10 1 10 7 18 3 16 19 24 22 22 0 0 0 0
Input Multiply 1st Stage Multiply 2nd Stage Multiply 3rd Stage Shuffle
Butterfly Butterfly Butterfly
and the result is finally multiplied by 8-1 = 11.
8 8 1 1
3 3 4 4
22 22 10 10
15 15 x 8-1 20 20
26 26 fi 25 25
18 18 24 24
12 12 16 16
0 0 0 0
D. Quadratic Residue Rings
While we are on the subject of implementation simplicities, it will be useful to examine some work
by Nussbaumer (1976a) on Fermat Number Transforms. Very often it is required to be able to
convolve complex sequences, rather than the real sequences that have been the subject of our
discussion so far. We have already seen that extension fields, GF(p2), where p!=!4K+3, can
support finite field equivalents of arithmetic operations over the complex numbers. The only
Page 53
problem is that we have to perform these operations over the base field (the same way that complex
arithmetic is computed over the reals) and this involves some unfortunate complexities with regard
to the implementation of multiplication. Addition is component-wise, but multiplication involves a
'cross-coupling' of the parts of the 2-tuple defining the field elements, including the requirement for
four base field multiplications and 2 base field additions, as was discussed earlier in the definition
of second degree extension fields.
Nussbaumer discovered a simplification in this requirement based on what would initially appear to
be a wrong way of doing things. He considered the case of Fermat primes, where the prime is of the
form p = 4K + 1. This prime will not support a 'complex' extension field, since -1 is a quadratic
residue; we can, however, form a ring of 2-tuples, CR(Ft), where the ring operators take the form of
complex addition and complex multiplication on the elements of the 2-tuples, but all base operations
are carried out modulo Ft. What Nussbaumer did was to effectively define an extension ring (he did
not call it that) which is isomorphic to this complex ring, but where the operations of addition and
multiplication are both component-wise. We will define this Quadratic Residue ring as QR(Ft) = {
S : ⊕, ƒ }, where the set S consists of elements that are 2-tuples, a = (a°, a*), a!Œ S, and the
operations, ⊕, and ƒ are both component-wise. We can map the results of this ring to CR(Ft) in
the following way:
Let a = a r + ia i; a r,!aiŒ GF(Ft), a Œ CR(Ft), then we define the mapping

t
a°!=!ar!⊕F t!(j!ƒF t !a i) and a*!=!ar ⊕F t (-j ƒF t a i), where j!=! -1 = 2 2 . Clearly
a°,!a*!Œ!GF(Ft). The inverse mapping can be found to be a r!= 2-1ƒ(a°!⊕ a*) and a i!=!2-1ƒj-
1ƒ(a°!⊕ (-a*)) (we have dropped the subscript, F , for
t clarity). Let b!=!(b°,!b*), map to b = b r +
ibi, apply point-wise addition and multiplication to a and b, and confirm the mapping to CR(Ft):
Addition
D!(a°⊕b° , a*⊕b*)!
c = a!⊕!b!=
Page 54
c° = a°⊕b° = ar!⊕!(jƒai)!⊕!br!⊕!(jƒbi)!= !ar!⊕!br! ⊕!!jƒ(ai!⊕ bi)
c* = a*⊕b* = ar!⊕!(-jƒai)!⊕!br!⊕!(-jƒbi)!= !ar!⊕!br! ⊕!!-jƒ(ai!⊕ bi)
Now apply the inverse mapping to the result over CR(Ft), g = gr + ig i:
gr = 2-1ƒ(c°!⊕ c*) = 2-1ƒ(ar!⊕!br! ⊕!!ar!⊕!br) = ar!⊕!br
gi = 2-1ƒj-1ƒ(c°!⊕ c*) = 2-1ƒj-1ƒ[ jƒ(ai!⊕!bi!⊕!ai!⊕!bi)] = ai!⊕!bi.
This verifies the mapping for addition.
Multiplication
D!(a°ƒb° , a*ƒb*)!
c = a!ƒ!b!=
c° = a°ƒb° = [ a r!⊕!(jƒai)] !ƒ![ b r!⊕!(jƒbi)] !=
![ (arƒbr)!! ⊕!-(aiƒbi)] ⊕ j[ (arƒbi)!! ⊕!(aiƒbr)]
c* = a*ƒb* = [ a r!⊕!(-jƒai)] !ƒ![ b r!⊕!(-jƒbi)] != !
![ (arƒbr)!! ⊕!-(aiƒbi)] ⊕ -j[ (arƒbi)!! ⊕!(aiƒbr)]
Now apply the inverse mapping to the result over CR(Ft), g = gr + ig i:
gr = 2-1ƒ(c°!⊕ c*) = 2-1 ƒ 2 ƒ[ (arƒbr)!!⊕!-(aiƒbi)] = [ (arƒbr)!! ⊕!-(aiƒbi)]
gi = 2-1ƒj-1ƒ(c°!⊕ c*) = 2-1ƒj-1ƒ j ƒ [ (arƒbi)!!⊕!(aiƒbr)] =[ (arƒbi)!!⊕!(aiƒbr)]
This verifies the mapping for multiplication.
For notational convenience we will refer to the a° term as the normal element and the a* term as the
conjugate element.
Page 55
It might be useful to reinforce the idea behind this quadratic residue mapping with a simple scalar
example, using GF(17). The parameters to be used are j!=!22!=!4; j-1 = 13; 2-1!=!9.
Let a = (13, 5), b = (9, 11). Computing over the ring, CR(17), yields:
a!⊕!b!=!(13⊕9,!5⊕11) = (5, 16) and a!ƒ!b!=![(13ƒ9)⊕ -(5ƒ11),!(13ƒ11)⊕(5ƒ9)] = (11,!1).
Now let us compare performing this calculation using the QR mapping:
QR(17)
a = (13, 5) fi a = (13 ⊕ (4ƒ5), 13 ⊕ -(4ƒ5)) = (16, 10)
QR(17)
b = (9, 11) fi b = (9 ⊕ (4ƒ11), 9 ⊕ -(4ƒ11)) = (2, 16)
Now we compute the addition and multiplication over QR(17), component-wise:
a!⊕!b = (16⊕2,!10⊕16) = (1, 9); a!ƒ!b = (16ƒ2,!10ƒ16) = (15, 7)
Finally we map back to CR(17):
CR(17)
(1, 9) fi [9ƒ(1⊕9),!9ƒ13ƒ(1⊕ -9)] = (5, 16)
CR(17)
(15, 7) fi [9ƒ(15⊕7),!9ƒ13ƒ(15⊕ -7)] = (11, 1)
and we see that we obtain the same result as the direct complex calculation over CR(17).
The method seems involved, in that there are 3 distinct steps involved. Note, however, that if the
middle step (the component-wise calculation) contains many multiplications and additions, that the
overhead associated with steps 1 and 3 can be relatively small. We therefore concentrate on the
features of step 2, bearing this assumption in mind. There are two main features to the calculation
over QR(Ft):
1) The multiplication operation requires 2 base field multiplications compared to 4 base field
multiplications and 2 base field additions for the calculation over CR(Ft).
Page 56
2) The calculations can be carried out without any interaction between the normal and
conjugate channels.
The first feature has advantages in minimizing implementation hardware, the second feature
provides advantages in testing and fault tolerance of a complex sequence processor.
A more comprehensive example is probably in order at this point. We will consider the complex
convolution of two 4-point sequences over GF(17). The sequences and their convolution sum are
shown in equation (16) below, where the symbol, O
* , represents the cyclic convolution operator:
Ê 15 4
ˆ Ê1 3
ˆ Ê1 1
ˆ
Á 15 8 ˜ Á2 2 ˜ Á1 1 ˜
= O
* Á0 0 ˜
(16)
Á2 8 ˜ Á3 1 ˜
Ë2 ¯ Ë0 ¯ Ë0 0
¯
4 0
We will now consider the computation of this convolution using the FNTT defined over GF(17).
We can compute this by transforming each real and imaginary sequence of both inputs separately.
We then multiply these transformed sequences, using the rules of complex arithmetic, and invert the
real and imaginary sequences of this result separately. The inverted two sequences will be the real
and imaginary components of the convolution sum (Nussbaumer (1976a)).
The transform of the real component of the first input sequence is shown in equation (17); Q17
represents matrix multiplication over GF(17).
Ê6ˆ Ê1 1 1 1
ˆ Ê1ˆ
Á7˜ Á1 13 16 4 ˜ Á2˜
Á2˜ Á1 Q17
16 ˜ Á3˜
= (17)
16 1
Ë6¯ Ë1 4 16 13
¯ Ë0¯
Page 57
The transformed sequences and their multiplication, using the rules of complex arithmetic, are
shown in equation (18).
Ê0 7
ˆ Ê6 ˆ6
Ê2 2
ˆ
Á 12 14 ˜ Á7 11 ˜ Á 14 14 ˜
Á0 ƒ17
0 ˜ Á2 2 ˜ Á0 0 ˜
= (18)
Ë 14 12
¯ Ë6 10
¯ Ë5 5
¯
We now invert the real and imaginary parts of the result as separate sequences. The inverse
transform for the real component is shown in equation (19).
Ê9ˆ Ê1 1 1 1
ˆ Ê0ˆ
Á9˜ Á1 4 16 13 ˜ Á 12 ˜
Á8˜ Á1 Q17 Á 0 ˜
16 ˜
= (19)
16 1
Ë8¯ Ë1 13 16 4
¯ Ë 14 ¯
The final sequence is the combination of the inverse of the real and imaginary components of the
transform domain multiplication (shown in equation (20) with the final multiplication by 4-1 =!13).
Ê 15 4
ˆ Ê9 16
ˆ
Á 15 8 ˜ Á9 15 ˜
Á2 4-1 ƒ17
8 ˜ Á8 15 ˜
= (20)
Ë2 4
¯ Ë8 16
¯
The final result agrees with the original convolution sum in equation (16).
Now we have to compute the convolution using the mapping to QR(17). First we perform the
mapping, as shown below:
Page 58
Ê1 3
ˆ Ê 13 6
ˆ
Á2 2 ˜ QR(17) Á 10 11 ˜
Á3 1 ˜
fi
Á7 16 ˜
Ë0 0
¯ Ë0 0
¯
Ê1 1
ˆ Ê5 14
ˆ
Á1 1 ˜ QR(17) Á 5 14 ˜
Á0 0 ˜
fi
Á0 0 ˜
Ë0 0
¯ Ë0 0
¯
Now we compute the convolution for the normal and conjugate sequences separately. The
computation for the normal sequence is shown below:
Ê 13 ˆ Ê 13 ˆ Ê5ˆ Ê 10 ˆ
Á 10 ˜ FNTT(17) Á 0 ˜ Á 5 ˜ FNTT(17) Á 2 ˜
Á 7 ˜ fi Á 10 ˜ Á 0 ˜ fi Á 0 ˜
Ë0¯ Ë 12 ¯ Ë0¯ Ë8¯
Ê 11 ˆ Ê 13 ˆ Ê 10 ˆ Ê 11 ˆ Ê 14 ˆ
Á0˜ Á0˜ Á2˜ Á0˜ IFNTT(17)
Á 13 ˜
Á0˜ =
Á 10 ˜ ƒ17 Á 0 ˜ Á0˜ fi
Á0˜
Ë 11 ¯ Ë 12 ¯ Ë8¯ Ë 11 ¯ Ë1¯
The computation for the conjugate sequence is shown below:
Page 59
Ê6ˆ Ê 16 ˆ Ê 14 ˆ Ê 11 ˆ
Á 11 ˜ FNTT(17) Á 14 ˜ Á 14 ˜ FNTT(17) Á 9 ˜
Á 16 ˜ fi Á 11 ˜ Á 0 ˜ fi Á 0 ˜
Ë0¯ Ë0¯ Ë0¯ Ë2¯
Ê6ˆ Ê 16 ˆ Ê 11 ˆ Ê6ˆ Ê 16 ˆ
Á7˜ Á 14 ˜ Á9˜ Á7˜ IFNTT(17)
Á0˜
Á0˜ =
Á 11 ˜ ƒ17 Á 0 ˜ Á0˜ fi
Á4˜
Ë0¯ Ë0¯ Ë2¯ Ë0¯ Ë3¯
Now all it remains for us to do is to compute the inverse mapping from the quadratic residue ring:
Ê 14 16
ˆ Ê 15 4
ˆ
Á 13 0 ˜ !!CR(17) Á 15 8 ˜
Á0 4 ˜ fi
Á2 8 ˜
Ë1 3
¯ Ë2 4
¯
The result matches that obtained directly and using the complex ring transform method. Note the
complete independence of the normal and conjugate calculations once the mapping to the quadratic
ring is performed. We are able to compute the forward transforms, transform domain multiplication
and inverse transform without invoking any interaction between normal and conjugate channels.
Although the quadratic residue ring has been built from Galois Fields based on Fermat numbers,
the only requirement that such a ring exist is that the base field have a prime modulus of the form
p!= 4K+1, since we have already discovered that such a field contains a quadratic residue for -1.
The ramifications of computations over quadratic rings goes beyond the simplification of
convolution via Fermat NTT's, and it seems that Nussbaumer did not realize the importance of this
finding at the time. We will return to quadratic rings later.
Page 60
E. Multi-Dimensional Mapping
Another technique used to remove the tight coupling of transform length with the algebraic
properties of the number of elements in the field, is that of multi-dimensional mapping. This was
first pointed out by Rader (1972b) and expanded upon by Agarwal and Burrus (1974). The basic
concept is that a one-dimensional convolution can always be implemented by rewriting the one-
dimensional input sequence as a multi-dimensional sequence. The convolution can then be
indirectly computed via multi-dimensional transforms which, in turn, can be computed as a series of
short one-dimensional transforms. The final step is a mapping back to the one-dimensional output
sequence. As an example for a two-dimensional mapping of an original one-dimensional sequence,
consider the cyclic convolution of equation (21):
N-1
yn = ∑ xq h[n⊕ (-q)]
N
(21)
!!!q=0
We assume that we can write N = L.M where L and M are integers. A change of variables can now
be made:
n!=!l!+!mLÔ̧
˝ k, l = 0, 1, ...., L-1; p, m = 0, 1, ......, M-1
q!=!k!+!pLÔ˛
and the convolution now becomes:
L-1 M-1
yl+mL = ∑∑ xk+pL h[l⊕mL⊕(-k)⊕(-pL)] (22)
!!!k=0!!!p=0
where we have dropped the subscript on the modulo N addition operator. Let us now define two-
dimensional arrays for y, x and h. We will keep the same notation as used by Agarwal and Burrus
(1974). Thus:
^ ^ ^
X l, m = xl+mL ; H l, m = hl+mL ; Y l, m = yl+mL (23)
Page 61
and the convolution can be written as equation (24):
L-1 M-1
^ ^ ^
Y l, m = ∑∑ H l⊕(-k), m⊕(-p) X k, p (24)
!!!k=0!!!p=0
This is a two-dimensional cyclic convolution, and we can compute it indirectly using two-
dimensional NTT's.
Two-dimensional NTT's can be calculated using one-dimensional NTT's along the rows and then
along the columns of the intermediate results. Clearly two-dimensional convolution is a sort of
overlay of column-wise, followed by row-wise (or vice-versa) one-dimensional cyclic convolution.
If we examine the decomposition of the original one-dimensional sequence, we find that increasing
values of the m-index (row index) defines a sampling of the original signal by a reduction factor of
L and thus preserves the cyclic nature of the sequence (this new sequence has period M rather than
N). Increasing values of the l-index (column index) are contiguous samples of only a segment of
the original sequence. Thus, although cyclic convolution will work for the rows it will not work for
the columns, since this sequence is not a periodic sub-sampling of the original signal. We must
therefore compute aperiodic convolution along the columns, and this means invoking one of the two
techniques (overlap-add or overlap-save (Gold and Rader (1969), chapter by T. Stockham, Jr.),
available for computing aperiodic convolution from cyclic convolution. Another way of looking at
the problem, is to consider that although the indices of the 2-D sequences are computed over finite
rings, R(L) and R(M), the formation of these rings from the original index sequence was over
R(N).
The overlap-save technique involves appending at least (L-1) zero samples to the original column
^ ^
sequences of the X array; the H array is augmented by the periodic extension of the original {h}
sequence, as indicated in the index mapping of equation (23). The final result will have L correct
values and L-1 incorrect values per column (Agarwal and Burrus (1974)). Normally, in order to
^
compute fast convolution, we will require to append L zeros to the X columns (rather than L-1
Page 62
zeros), requiring a total 2-dimensional array of 2L x M. Two of the rows of the final result will be
found to be dependent (except for a cyclic shift) because of this redundancy of one extra row added
to the 2-D arrays.
The 2-D NTT is defined in equation (), where the Nth order generator (a) has been replaced by an
Mth order generator (aL) and an 2Lth order generator (aM/2).
^^ 2L-1M-1
^
X l, m = ∑ ∑ X k, p . a(Mkl+Lpm) (25)
!!!k=0!!!p=0
Consider taking 1-D transforms along the columns (p index) of the input, xk, p.
M-1
^ ^
X [k]m = ∑ X k, p . aLpm
!!!p=0
The index [k] corresponds to the column of the 1-D transform result. We can now compute the 2-
D NTT by taking 1-D transforms along the rows (k index) of this modified intermediate result, as
shown in equation (26).
^^ 2L-1
^ (26)
X l, m = ∑ X [k]m . aMkl
!!!k=0
^^ ^^
By computing 2-D transforms for X l,m and X l,m, we can form the 2-D cyclic convolution by
multiplication in the transform domain, followed by inverse transformation.
2L-1 M-1 ^ ^^
^ ^
Y k,p = (2N) ∑ ∑ X l, m . H l, m . a-(Mlk+Lmp)
-1 (27)
!!!l=0!!!m=0
It now remains to unscramble the resulting 2-D sequence back into a 1-D sequence; this result is
the required 1-D cyclic convolution. The entire process appears quite involved, but it allows the use
Page 63
of small length transforms to implement much longer length convolutions. As before, we will take
an example to illustrate the procedure, and as before we will consider the cyclic convolution of the
sequence {1,!2,!3,!4,!0,!0,!0,!0} with itself.
In order to demonstrate the ability of this technique to allow longer length convolutions than
possible with a direct procedure, we will consider the convolution over GF(29). We have already
determined that we can only achieve a direct one-dimensional length 8 transform using a second
degree extension field, GF(292). We will decompose the length N = 8 = 4 x 2, if we let L = 2 and
M = 4, then we can compute the convolution using 4 x 4 arrays, with a series of length 4 one-
dimensional convolutions. The parameters for the example are a L = a M/2 = a 2 = 12 (a = 12
does not exist in GF(29)), and (2N-1) = 16-1 = 20. The sequence {1,!2,!3,!4,!0,!0,!0,!0} is mapped
^ ^
into the H and X array as below:
0 3 1 0 3 1 0 0
Ê ˆ Ê ˆ
^
Á 0 4 2 0 ˜ ^
Á 4 2 0 0 ˜
H =
Á 0 0 3 1 ˜ X =
Á 0 0 0 0 ˜
Ë 0 0 4 2
¯ Ë 0 0 0 0
¯
^ ^
Note the cyclic extension of the H array and the 2 rows of zeros appended to the X array. The
next step is to compute the two-dimensional NTT of each array. We first show the intermediate step
after taking 1-dimensional NTT's of each column in the two arrays:
0 7 10 3 7 3 0 0
Ê ˆ Ê ˆ
Ï ^ [k] ¸
Á 0 22 3 4 ˜ Ï ^ [k] ¸
Á 22 25 0 0 ˜
ÌH
Ó m˝˛ =
Á 0 28 27 28 ˜ ÌX
Ó m˝˛ =
Á 28 28 0 0 ˜
Ë 0 13 22 23
¯ Ë 13 6 0 0
¯
Followed by 1-dimensional NTT's of the rows:
Page 64
20 9 0 0 10 14 4 0
Ê ˆ Ê ˆ
^^ Á 0 10 6 13 ˜ ^^ Á 18 3 26 12 ˜
H =
Á 25 2 0 2 ˜ X =
Á 27 16 0 11 ˜
Ë 0 3 15 11
¯ Ë 19 27 7 28
¯
^^ ^
The next step is to form the product, H ƒ29 X^ , and invert the result. This is shown below:
26 10 0 0
Ê ˆ
^^ ^^ Á 0 1 11 11 ˜
H ƒ29 X =
Á ˜
8 3 0 22
Ë 0 23 18 18
¯
and the inverse transform is computed by 1-D transforms on the columns and rows with a final
multiplication by 16-1. The final inverse transform with multiplication is shown below:
6 11 4 28
Ê ˆ
^
Á 0 6 1 7 ˜
Y = 16-1 ƒ29 Á ˜
16 15 23 24
Ë 6 1 7 0
¯
4 17 22 9
Ê ˆ
Table XI Primes (10 bits or less) and Á 0 4 20 24 ˜
transform lengths =
Á 1 10 25 16 ˜
p
Ë ¯
2q 4 20 24 0
97 32
We now unravel the bottom two rows of the result
193 64
257 256 array to find the output sequence: {1,!4, 10, 20, 25,
449 64 24, 16, 0}.

577 64
641 128 F. Extension of dynamic range
673 32
769 256
Page 65
An inverse problem to that discussed above is the extension of dynamic range for a given transform
length. This seems trivial, in that we can probably find a large transform length for a large dynamic
range, and simply reduce the transform length by using powers of the multiplicative group
generator. That, however, is not the exercise we wish to perform here. Suppose that we find small
(10-bits or less) primes that have desirable properties for the construction of fields in which
reasonable power-of-two transform lengths are possible, and our goal is to produce such a length
transform. For example, primes, p, which have the property 2qÙp-1, will allow transform lengths of
2q. Table XI shows some small primes and their associated transform length, 2q. The transform
lengths shown are quite 'respectable' in terms of usefulness in DSP, but the dynamic range afforded
by the size of the prime is probably inadequate. Can we 'stick' several of these transforms together
to form a larger dynamic range. The answer is a resounding yes, and we will show, in the following
sections, that this approach has much larger ramifications than one would initially imagine.
G. Binary Implementations
The interesting fact is that, beside from the flurry of academic and general theoretical interest, all of
these efforts have lead to very little in the way of special purpose hardware being built to implement
these transforms. It appears that most of the implementations are on general purpose computers
(software implementations) with dubious usefulness. The only hardware approach, using a single
base field, has been discussed by McClellan (1976) using a Fermat Number Transform. A special
coding scheme was used to implement the modulo Ft computation. An interesting modification to
binary arithmetic was also proposed by Leibowitz (1976). Both of these approaches have important
ramifications for the efficient binary implementation of Fermat Number Transform operations. It is
appropriate to briefly discuss these techniques here, for later contrast with the non-binary
implementations to be discussed later.
1. McClellan Approach
Page 66
The representation of Fermat numbers in the binary representation is only 50% efficient, since only
one of the field1 elements requires the most significant bit to be one. In the obvious mapping, this
element is the largest in the field. The coding scheme introduced by McClellan maps this isolated
case to zero. The full mapping is described below.
The representation of an element, B, in GF(Ft) is by t+1 bits {bt, bt-1, ...., b0} with the following
mapping:
1 If bt = 1, then B = 0.
t
2 If bt = 0, then B = Âst-i!2t-i where sj
i=1
Ï1 if!bj !=!1
= ÌÓ
-1 if!bj !=!0
Using the example given by McClellan for F4:
10000 represents zero
01010 represents 23 - 22 + 2 - 1 = 5
00011 represents -23 - 22 + 2 + 1 = -9 (=8)
Over GF(Ft) this representation is valid for all elements. The representation of zero as a special case
(bt = 1) allows simple hardware for arithmetic with zero as one of the elements. General addition
involves an ordinary binary addition followed by an adjustment based on the state of the output
carry. This is the same complexity as 1's complement binary arithmetic. Multiplication by powers
of 2 (required in the formation of the forward or inverse FNT) is a simple cyclic shift with the
addition of a logical inversion as bits are fed back into the least significant position.
1We will assume that the Fermat Number chosen is prime
Page 67
We see that ordinary binary arithmetic elements can be used, with slight modifications to the
circuitry. General multiplication, of course, retains the same complexity that it has for binary
arithmetic. We note that all general arithmetic operations are performed with only t bits rather than
the t+1 bits of the full representation. McClellan's hardware only computed the transform and not
the transform domain multiplication.
2. Leibowitz Approach
Leibowitz introduced a modification that reduced the complexity of the arithmetic hardware. This
diminished-1 representation involves a simple translation of 1-bit from the normal binary
representation. This contrasts to McClellan's use of symmetrical weightings {-1,1}. The
diminished-1 representation simply adds one to the binary representation, thus {0, 1, 2, .... , 2t}
becomes {1, 2, 3, ... , 0}; note that this mapping places 0 as the only element that requires bt = 1,
and so corresponds with McClellan's mapping of that element. The other elements are, in general,
mapped to different positions than McClellan's mapping. Leibowitz demonstrates that simpler
implementations are possible, compared to McClellan's mapping, and also that general
multiplication can be carried out as a simple modification of a binary multiplication.
We see that general multiplication is more complicated than binary multiplication (where the
majority of computational complexity is centered) and that these representations only work for
fields based on Fermat primes.
We will now introduce an alternative to the single modulus NTT that has been used to great effect
in the implementation of convolution hardware, and which opens the door to some very interesting
VLSI implementations; this alternative is the Residue Number System (RNS). Rather than finding
special large moduli that allow modified binary arithmetic hardware to be used, we will restrict the
modulus to be small enough that look-up tables can be used to implement the arithmetic (these
look-up tables are basically unminimized truth tables). We are no longer restricted as to the form of
Page 68
the modulus, just the size for practical hardware solutions. We can remove the modulus size
problem by several parallel computations over a direct sum ring. We will start by digressing,
somewhat, into the theory of the RNS; this knowledge will then be applied to the implementation of
flexible NTTs.
Page 69
VI. References
Abramowitz, M. and Stegun, I.A. Eds. (1968). "Handbook of Mathematical Functions with
Formulas, Graphs and Mathematical Tables." National Bureau of Standards, Applied
Mathematics Series, 55, 7th printing.
Agarwal R.C. and Burrus, C. S. (1974). "Fast convolution using Fermat number transforms with
applications to digital filtering." IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-
22, no. 2, April, pp. 87-97.
Baraniecka, A.Z. (1980). "Digital Filtering using Number Theoretic Techniques". Ph.!D.
dissertation, University of Windsor, Windsor, Ontario, Canada.
Baraniecka, A.Z. and Jullien, G.A. (1980). "Residue Number System!Implementations of Number
Theoretic Transforms in Complex Residue!Rings." IEEE Trans. on Acoustics, Speech, and
Signal Processing,!Vol. ASSP-28, No.3, June, pp.285-291.
Baugh, R.A. and Day, E.C. (1961). "Electronic sign evaluation for residue number systems." Tech.
Rep. No. TR-60-597-32, RCA, Camden, HJ, and Burlington, MA, Jan.
Bayoumi, M.A., Jullien, G.A., Miller, W.C. (1987). "A VLSI Implementation of Residue Adders",
IEEE Trans. on Circuits and Systems, Vol. CAS-34, No.3.
Blahut, R.E. (1985). "Fast Algorithms for Digital Signal Processing." Addison-Wesley Publishing
Co., Inc., Reading, MA. .
Burton, D.M. (1970). "A First Course in Rings and Ideals." Addison-Wesley Publishing Co., Inc.,
Page 70
Carhoun, D.O., Johnson, B.L., Redinbo, G.R. (1983). “A Synthesis Algorithm for Recursive Finite
Field FIR Digital Filters.” Proc. of the 1983 Int. Symp. on Circuits and Systems, Vol. 2, pp.
689-693.
Cheney, P.W. (1961). "A Digital Correlator Based on the Residue Number System." IRE
Transactions on Electronic Computers, vol. EC-11, March, pp. 63-70.
Cooley, J.W. and Tukey, J.W. (1965). “An Algorithm for the Machine Computation of Complex
Fourier Transforms.”, Mathematics of Computation, April, pp. 297-301.
Dillinger, T. E. (1988). “VLSI Engineering.” Prentice-Hall, ISBN 0-13-942731-7 025, New
Jersey.
Dubois, E. and Venetsanopoulos, A.N. (1978). "The Discrete Fourier Transform over Finite Rings
with Application to Fast Convolution." IEEE Transactions on Computers, Vol. C-17, No. 7,
July, pp. 586-593.
Dudley, U. (1969). "Elementary Number Theory." W.H. Freeman and Co.
Gold, B, and Rader, C.M. (1969). "Digital Processing of Signals." McGraw-Hill Book Company,
New York.
Garner, H.L. (1959). "The residue number system." IRE Trans. Electronic Computers, vol. EC-8,
June, pp. 140-147.
Hatamian, M., Cash, G.L. (1986). "High Speed Signal Processing, Pipelining, and Signal
Processing," Proceedings of the Int. Conf. on Acoustics, Speech, and Signal Processing,
April, pp.1173-1176.
Jenkins W.K, and Leon, B.J. (1977). “The Use of Residue Number Systems in the Design of
Finite Impulse Response Digital Filters”, IEEE Trans. on Circuits and Systems, Vol. CAS-
24, April, pp. 191-201.
Page 71
Jenkins W.K, Paul, D.F., Davidson, E.S. (1985). “A Custom Designed Integrated Circuit for the
Realization of Residue Number Digital Filters.” Proc. 1985 Int. Conf. on ASSP, Tampa, Fl.,
March, pp. 220-223.
Jenkins, W.K. and Altman, E.J. (1988). "Self-checking properties of residue number error checkers
based on mixed radix conversion." IEEE Transactions on Circuits and Systems, Vol. CAS-
35, No. 2, February, pp. 159-167.
Jenkins, W.K. and Krogmeier, J.V. (1987a). "The design of dual-mode complex signal processors
based on quadratic modular number codes." IEEE Transactions on Circuits and Systems,
CAS vol. 34, no. 4, April, pp 354-364.
Jenkins, W.K. and Lao, S.F. (1987b). "The design of an RNS digital filter module using the IBM
MVISA design system." Proceedings of the International Symposium on Circuits and
Systems, Philadelphia, PA, May, pp 122 - 125.
Jenkins, W.K. (1975). "Composite number theoretic transforms for digital filtering." Proc. 9th
Asilomar Conference on Cir., Sys. and Comp., Nov., pp. 458-462.
Jullien, G.A. ( 1978). "Residue Number Scaling and Other Operations Using ROM Arrays." IEEE
Trans on Computers, Vol. C-27, No.4, April, pp. 325-336.
Jullien, G.A. ( 1980). "Implementation of Multiplication, Modulo a Prime Number, with
Applications to Number Theoretic Transforms." IEEE Trans on Computers, Vol. C-29,
No.10, October, pp. 899-905.
Jullien, G.A., Krishnan, R. and Miller, W. C. (1987). "Complex digital signal processing over finite
fields." IEEE Transactions on Circuits and Systems, CAS vol. 34, no. 4, April, pp 365-337.
Page 72
Jullien, G.A., Bird, P.D., Carr, J.T., Taheri, M., Miller, W.C. (1989). "An Efficient Bit-Level
Systolic Cell Design for Finite Ring Digital Signal Processing Applications.” Journal of
VLSI Signal Processing, Vol I-3, (In Print)
Jury, E.I (1964). “Theory and Application of the z-Transform Method.” John Wiley and Sons,
Inc., New York.
Kung, H.T. and Leiserson, C.E. (1978). "Systolic Arrays (for VLSI)," Sparse Matrix Proc. 1978,
Academic Press, Orlando, Fla., pp.256-282.
Leibowitz, L. M. (1976). "A Simplified Binary Arithmetic for the Fermat Number Transform."
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, no. 5, October, pp. 356-359.
Liu, K.Y., Reed, I.S. and Truong, T.K. (1976). "Fast Number Theoretic Transforms for Digital
Filtering." Electronic Letters, Vol. 12, November, pp. 644-646.
McCanny, J.V., McWhirter, J.G. (1982). "Implementation of Signal Processing Functions using 1-
bit Systolic Arrays." Electronic Letters, Vol. 18, No. 6, March, pp. 241-243.
McClellan, J.H. (1976). "Hardware Realization of a Fermat Number Transform." IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-24, June, pp. 216-225.
McClellan, J.H. and Rader, C.M. (1979). “Number Theory in Digital Signal Processing.”
Prentice-Hall, Inc., Englewood Cliffs, NJ.
McCoy, N. (1972). "Fundamentals of Abstract Algebra." Allyn and Bacon Inc.
Nagpal, H.K., Jullien, G.A., Miller, W.C. ( 1983). "Processor Architectures for Two-Dimensional
Convolvers Using a Single Multiplexed Computational Element with Finite Field Arithmetic."
IEEE Trans on Computers, Vol. C-32, No.11, November, pp. 989-1000.
Page 73
Nicholson, P.J. (1971). "Algebraic Theory of Finite Fourier Transforms." J. Comput., Sys. Sci.,
Vol. 5, pp. 524-547.
Nussbaumer, H. (1976a). "Digital Filtering using Complex Mersenne Transforms." IBM J.
Research and Development, Vol. 20, May, pp. 282-284.
Nussbaumer, H. (1976b). "Complex Convolution Via Fermat Number Transforms." IBM J.
Research and Development, Vol. 20, September, pp. 498-504.
Peled, A. and Liu, B. (1974). "A new hardware realization of digital filters." IEEE Trans. on
Acoustics, Speech, and Signal Processing, vol. ASSP-22, Dec., pp 456-462.
Pollard, J.M. (1971). "The Fast Fourier Transform in a Finite Field.” Math. Comp., Vol. 25, April,
pp. 365-374.
Pollard, J.M. (1976). "Implementation of Number Theoretic Transforms.” Electronics Letters, Vol.
12, July, pp.378-379.
Rader, C.M. (1972a). "The Number Theoretic DFT and Exact Discrete Convolution.” IEEE Arden
House Workshop on Digital Signal Processing, Jan. (Verbal Report)
Rader, C.M. (1972b). "Discrete Convolutions via Mersenne Transforms.” IEEE Transactions on
Computers, Vol. C-21, December, pp. 1269-1273.
Rabiner, L.R. and Gold, B. (1975). "Theory and Application of Digital Signal Processing.”
Prentice-Hall, Englewood Cliffs, N.J.
Schilling, O. F. G., and Piper, W. S. (1975). "Basic Abstract Algebra.” Allyn & Bacon, Boston.
Schroeder, M.R. (1985). "Number Theory in Science and Communication.” Second Edition,
Springer Series in Information Sciences, Springer-Verlag, New York.
Page 74
Senoy, A.P., and Kumaresan, R. (1987). " Residue to Binary Conversion for RNS Arithmetic
Using only Modular Look-Up Tables.” Submitted for publication to IEEE trans. Circuits and
Systems.
Soderstrand, M.A.. (1977). “A High-Speed Low-Cost Recursive Digital Filter Using Residue
Number Arithmetic”, Proc. IEEE, Vol. 65, No. 7, July, pp. 1065-1067.
Soderstrand, M.A., Jenkins, W.K., Jullien, G.A. and Taylor, F.J. (eds) (1986). "Residue Number
System Arithmetic: Modern Applications in Digital Signal Processing.” IEEE Press, New
York, NY.
Soderstrand, M.A., Vernia, and Chang, J-H. (1983). "An Improved Residue Number System
Digital-to-Analog Converter." IEEE Trans. on Circuits and Systems, Vol. CAS-30,
December 1983, pp 903-907.
Soderstrand, M.A., Chang, B. (1986). “Design of a High Performance FIR Digital Filter on a
CMOS Semi-Custom VLSI Chip.” Proc. 1986 ISMM Int. Conf. on Mini and Micro
Computers, Beverly Hills, CA., Feb.
Svoboda, A. and Valach, M. (1955). "Operational circuits." Stroje Na Zpracovani Informaci, vol. 3,
Nakl. CSAV, Praha.
Svoboda, A. (1957). "Rational numerical system of residual classes." Stroje Na Zpracovani
Informaci, vol. 5, pp. 9-37, Prague.
Svoboda, A. (1958). "The numerical system of residual classes in mathematical machines." Proc.
Congreso Internacional De Automatica, Madrid, Spain, October. (Also Information
Processing (Proc. of UNESCO Conference, June 1959), pp. 419-422, 1960.)
Szabo, N.S. and Tanaka, R. I. (1967). "Residue Arithmetic and Its Applications to Computer
Technology.” McGraw-Hill Publishing Co.
Page 75
Taheri, M., Jullien, G.A. and Miller, W.C. (1988). "High Speed Signal Processing Using Systolic
Arrays Over Finite Rings." IEEE Transactions on Selected Areas in Communications, VLSI
in Communications III, Vol. 6, No. 3, April, pp. 504-512.
Tanaka, R.I. (1962). "Modular arithmetic techniques." Tech. Rep. 2-38-62-1A, ASTDR, Lockheed
Missiles and Space Co., Nov.
Taylor, F.J. and Huang, C.H. (1982). "An Autoscale Residue Multiplier." IEEE Trans Computer.
vol. C-31, no. 4, April, pp. 321-325.
Taylor, F.J. and Ramnarayanan, A.S. (1981). "An Efficient Residue-to-Decimal Converter." IEEE
Trans. Circuits and Systems, vol. CAS 28, no. 12, December, pp. 1164-1169.
Vanwormhoud, M.C. (1978). "Structural Properties of Complex Residue Rings Applied to Number
Theoretic Fourier Transforms.”. IEEE Trans. Acoust., Speech and Signal Processing, Vol.
ASSP-26, Feb., pp. 99-104.
Vinogradov, I.M. (1954). "Elements of Number Theory.” New York, Dover Publishing Co.
Page 76

NTTs

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NTTs

Uploaded by

Copyright:

Available Formats

Number Theoretic

Extract from Number Theoretic Techniques in Digital Signal

have little or no knowledge that alternative ways of computing arithmetic exist!

by a typical DSP system.

a two-dimensional filter is required and is given by the following equation:

summation, in fact, represents two-dimensional convolution. Convolution is a basic operation to all

computer to deliver this performance!

system containing a relatively small number of 'chips'.

specify a 20-bit range for accurate computation.

very high-speed throughput systems to be discussed in this paper.

directed to the work by McClellan and Rader (1979) as a starting point.

arithmetic operations. In particular, multiplications, traditionally a hardware and/or time consuming

convolutions between, say, integer sequences, but it does indeed work.

A. Finite Impulse Response Digital Filtering using DFTs

which is defined as:

FIR filter as:

to inexact multipliers; possibility of producing limit cycle oscillations; possibility of overflow

II. Discrete Fourier Transforms

the first to prove the second.

A. Inverse and Convolution Property

an inverse given by:

interchanging the order of summation and gathering terms we find:

The function, W(k[m-n]), displays rather an interesting property.

period N, then we have proved that the transform is invertible.

examine the convolution property.

Using the DFT of the x sequence explicitly, we have:

Inverting equation (4) yields:

We now interchange the order of summation to find:

notation ⊕N to indicate addition modulo N. This notation will be expanded later.

approaches (e.g. Rabiner and Gold (1975)).

III. Fast Fourier Transform (FFT) Algorithms

A. Decimation in Time FFT Algorithm

DFT's with a multiplication by a power of W. This multiplier is popularly referred to as a 'twiddle

referred to as a 'butterfly' based on its structure.

Figure 1 Butterfly calculation with twiddle factor multiplier

Figure 2 FFT Signal Flow Graph for 8-point DFT transform

principle for FFT's !).

examine VLSI implementations of such algorithms.

B. Use of the DFT in indirect computation of convolution

general, invoke errors in the final convolution sequence.

computation of DSP algorithms.

the following section is included for that purpose.

IV. Finite Algebras

A. Rings and Fields

(I) Commutative law of addition a+b = b+a

(II) Associative law (a+b)+c = a+(b+c)

(III) Distributive law a•(b+c) = a•b + a•c

group of the field F.

1. The ring of residue classes

known as the division algorithm will be established first.

Two integers c and d are said to be congruent modulo m, written

c ≡ d (mod m) if and only if m(c-d)

Since m0, c ≡ c (mod m) by definition

Another alternative definition of congruence can be stated as follows:

is called a residue class, modulo m.

m, isomorphic to the additive group {0,!1,!...,!m-2} with addition modulo m-1.

The Euler-Fermat theorem states that if c is an integer, m is a positive integer and

(c, m) = 1 , then cf(m) ≡ 1 (mod m).

a given field will be presented.

Each element of F(l) can be uniquely represented as a polynomial:

a0 + a1 l + .... + an-1 l n-1 , ai Œ F.

vector space of dimension n over F.

generated by adjoining a root j = -1 of the irreducible polynomial x2 + 1.

(a + i b) * (c + i d) i2(bd) + i(ad+bc) i2(bd)+i(ad+bc) (ac - bd) + i (a*d