You are on page 1of 464

9/6,

9/6,
Ldiled ly
=KRQJIHQJ:DQJ
In-Tech
intechweb.org
IulIished ly In-Teh
In-Tch
OIajnica 19/2, 32OOO Vukovar, Croalia
Alslracling and non-prohl use of lhe naleriaI is pernilled vilh credil lo lhe source. Slalenenls and
opinions expressed in lhe chaplers are lhese of lhe individuaI conlrilulors and nol necessariIy lhose of
lhe edilors or pulIisher. No responsiliIily is accepled for lhe accuracy of infornalion conlained in lhe
pulIished arlicIes. IulIisher assunes no responsiliIily IialiIily for any danage or injury lo persons or
properly arising oul of lhe use of any naleriaIs, inslruclions, nelhods or ideas conlained inside. Afler
lhis vork has leen pulIished ly lhe In-Teh, aulhors have lhe righl lo repulIish il, in vhoIe or parl, in any
pulIicalion of vhich lhey are an aulhor or edilor, and lhe nake olher personaI use of lhe vork.
2O1O In-leh
vvv.inlechvel.org
AddilionaI copies can le ollained fron:
pulIicalioninlechvel.org
Iirsl pulIished Ielruary 2O1O
Irinled in India
TechnicaI Ldilor: MeIila Horval
Cover designed ly Dino Snrekar
VLSI,
Ldiled ly Zhongfeng Wang
p. cn.
ISN 978-953-3O7-O49-O
V
Preface
The process of inlegraled circuils (IC) slarled ils era of very-Iarge-scaIe inlegralion (VLSI)
in 197Os vhen lhousands of lransislors vere inlegraled inlo a singIe chip. Since lhen,
lhe lransislors counls and cIock frequencies of slale-of-arl chips have grovn ly orders of
nagnilude. Novadays ve are alIe lo inlegrale nore lhan a liIIion lransislors inlo a singIe
device. Hovever, lhe lern VLSI renains leing connonIy used, despile of sone efforl lo
coin a nev lern uIlraIarge- scaIe inlegralion (ULSI) for hner dislinclions nany years ago. In
lhe pasl lvo decades, advances of VLSI lechnoIogy have Ied lo lhe expIosion of conpuler
and eIeclronics vorId. VLSI inlegraled circuils are used everyvhere in our everyday Iife,
incIuding nicroprocessors in personaI conpulers, inage sensors in digilaI caneras, nelvork
processors in lhe Inlernel svilches, connunicalion devices in snarlphones, enledded
conlroIIers in aulonoliIes, el aI.
VLSI covers nany phases of design and falricalion of inlegraled circuils. In a conpIele VLSI
design process, il oflen invoIves syslen dehnilion, archileclure design, regisler lransfer
Ianguage (RTL) coding, pre- and posl-synlhesis design verihcalion, lining anaIysis, and chip
Iayoul for falricalion. As lhe process lechnoIogy scaIes dovn, il lecones a lrend lo inlegrale
nany conpIicaled syslens inlo a singIe chip, vhich is caIIed syslen-on-chip (SoC) design.
In addilion, advanced VLSI syslens oflen require high-speed circuils for lhe ever increasing
denand of dala processing. Ior inslance, Llhernel slandard has evoIved fron 1O Mlps lo
1O Clps, and lhe specihcalion for 1OO Clps Llhernel is undervay. On lhe olher hand, vilh
lhe groving popuIarily of snarlphones and noliIe conpuling devices, Iov-pover VLSI
syslens have lecone crilicaIIy inporlanl. Therefore, engineers are facing nev chaIIenges lo
design highIy inlegraled VLSI syslens lhal can neel lolh high perfornance requirenenl and
slringenl Iov pover consunplion.
The goaI of lhis look is lo eIalorale lhe slale-of-arl VLSI design lechniques al nuIlipIe
IeveIs. Al device IeveI, researchers have sludied lhe properlies of nano-scaIe devices and
expIored possilIe nev naleriaI for fulure very high speed, Iov-pover chips. Al circuil IeveI,
inlerconnecl has lecone a conlenporary design issue for nano-scaIe inlegraled circuils.
Al syslen IeveI, hardvare-soflvare co-design nelhodoIogies have leen invesligaled lo
coherenlIy inprove lhe overaII syslen perfornance. Al archilecluraI IeveI, researchers have
proposed noveI archileclures lhal have leen oplinized for specihc appIicalions as veII as
efhcienl reconhguralIe archileclures lhal can le adapled for a cIass of appIicalions.
As VLSI syslens lecone nore and nore conpIex, il is a greal chaIIenge lul a signihcanl lask
for aII experls lo keep up vilh Ialesl signaI processing aIgorilhns and associaled archileclure
designs. This look is lo neel lhis chaIIenge ly providing a coIIeclion of advanced aIgorilhns
V
in conjunclion vilh lheir oplinized VLSI archileclures, such as Turlo codes, Lov Densily
Iarily Check (LDIC) codes, and advanced video coding slandards MILC4/H.264, el aI. Lach
of lhe seIecled aIgorilhns is presenled vilh a lhorough descriplion logelher vilh research
sludies lovards efhcienl VLSI inpIenenlalions. No look is expecled lo cover every possilIe
aspecl of VLSI exhausliveIy. Our goaI is lo provide lhe design concepls lhrough lhose
seIecled sludies, and lhe lechniques lhal can le adopled inlo nany olher currenl and fulure
appIicalions.
This look is inlended lo cover a vide range of VLSI design lopics - lolh generaI design
lechniques and slale-of-arl appIicalions. Il is organized inlo four najor parls:
Iarl I focuses on VLSI design for inage and video signaI processing syslens, al lolh
aIgorilhnic and archilecluraI IeveIs.
Iarl II addresses VLSI archileclures and designs for cryplography and error correclion
coding.
Iarl III discusses generaI SoC design lechniques as veII as syslen-IeveI design oplinizalion
for appIicalion-specihc aIgorilhns.
Iarl IV is devoled lo circuil-IeveI design lechniques for nano-scaIe devices.
Il shouId le noled lhal lhe look is nol a luloriaI for leginners lo Iearn generaI VLSI design
nelhodoIogy. Inslead, il shouId serve as a reference look for engineers lo gain lhe knovIedge
of advanced VLSI archileclure and syslen design lechniques. Moreover, lhis look aIso
incIudes nany in-deplh and oplinized designs for advanced appIicalions in signaI processing
and connunicalions. Therefore, il is aIso inlended lo le a reference lexl for graduale sludenls
or researchers for pursuing in-deplh sludy on specihc lopics.
The edilors are nosl gralefuI lo aII coaulhors for conlrilulions of each chapler in lheir
respeclive area of experlise. We vouId aIso Iike lo acknovIedge aII lhe lechnicaI edilors for
lheir supporl and greal heIp.
Zhongfeng Wang, Ih.D.
8rcaccm Ccrp., CA, USA
Xinning Huang, Ih.D.
Icrccs|cr Pc|q|ccnnic |ns|i|u|c, MA, USA
V
Contents
Preface V
1. Discrete Wavelet Transform Structures for VLS Architecture Design 001
Hannu Olkkonen and Juuso T. Olkkonen
2. High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 011
brahim Saeed Koko and Herman Agustiawan
3. Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 043
Tsung-Han Tsai, Chia-Pin Chen, and Yu-Nan Pan
4. Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based Discrete
Wavelet Transform for JPEG2000 069
Chih-Hsien Hsia and Jen-Shiun Chiang
5. Full HD JPEG XR Encoder Design for Digital Photography Applications 099
Ching-Yen Chien, Sheng-Chieh Huang, Chia-Ho Pan and Liang-Gee Chen
6. The Design of P Cores in Finite Field for Error Correction 115
Ming-Haw Jing, Jian-Hong Chen, Yan-Haw Chen, Zih-Heng Chen and Yaotsu Chang
7. Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 131
Chiou-Yng Lee
8. High-Speed VLS Architectures for Turbo Decoders 151
Zhongfeng Wang and Xinming Huang
9. Ultra-High Speed LDPC Code Design and mplementation 175
Jin Sha, Zhongfeng Wang and Minglun Gao
10. A Methodology for Parabolic Synthesis 199
Erik Hertz and Peter Nilsson
11. Fully Systolic FFT Architectures for Giga-sample Applications 221
D. Reisis
V
12. Radio-Frequency (RF) Beamforming Using Systolic FPGA-based Two
Dimensional (2D) R Space-time Filters 247
Arjuna Madanayake and Leonard T. Bruton
13. A VLS Architecture for Output Probability Computations of HMM-based
Recognition Systems 273
Kazuhiro Nakamura, Masatoshi Yamamoto, Kazuyoshi Takagi and Naofumi Takagi
14. Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 285
Chun-Lung Hsu, Yu-Sheng Huang and Chen-Kai Chen
15. SOC Design for Speech-to-Speech Translation 297
Shun-Chieh Lin, Jia-Ching Wang, Jhing-Fa Wang, Fan-Min Li and Jer-Hao Hsu
16. A Novel De Bruijn Based MeshTopology for Networks-on-Chip 317
Reza Sabbaghi-Nadooshan, Mehdi Modarressi and Hamid Sarbazi-Azad
17. On the Efcient Design & Synthesis of Differential Clock Distribution Networks 331
Houman Zarrabi, Zeljko Zilic, Yvon Savaria and A. J. Al-Khalili
18. Robust Design and Test of Analog/Mixed-Signal Circuits
in Deeply Scaled CMOS Technologies 353
Guo Yu and Peng Li
19. Nanoelectronic Design Based on a CNT Nano-Architecture 375
Bao Liu
20. A New Technique of nterconnect Effects Equalization by using Negative Group
Delay Active Circuits 409
Blaise Ravelo, Andre Perennec and Marc Le Roy
21. Book Embeddings 435
Sad Bettayeb
22. VLS Thermal Analysis and Monitoring 441
Ahmed Lakhssassi and Mohammed Bougataya
Discrete Wavelet Transform Structures for VLS Architecture Design 1
X

Discrete WaveIet Transform Structures
for VLSI Architecture Design

Hannu OIkkonen and }uuso T. OIkkonen
Dcpar|mcn| cf Pnqsics, Unitcrsi|q cf Kucpic, 70211 Kucpic, |in|an
VTT Tccnnica| Rcscarcn Ccn|rc cf |in|an, 02044 VTT, |in|an

1. Introduction

WireIess dala lransnission and high-speed inage processing devices have generaled a need
for efficienl lransforn nelhods, vhich can le inpIenenled in VLSI environnenl. Afler lhe
discovery of lhe conpaclIy supporled discrele vaveIel lransforn (DWT) (Daulechies, 1988,
Snilh & arnveII, 1986) nany DWT-lased dala and inage processing looIs have
oulperforned lhe convenlionaI discrele cosine lransforn (DCT) -lased approaches. Ior
exanpIe, in }ILC2OOO Slandard (ITU-T, 2OOO), lhe DCT has leen repIaced ly lhe
liorlhogonaI discrele vaveIel lransforn. In lhis look chapler ve reviev lhe DWT slruclures
inlended for VLSI archileclure design. LspeciaIIy ve descrile nelhods for conslrucling shifl
invarianl anaIylic DWTs.

2. BiorthogonaI discrete waveIet transform

The firsl DWT slruclures vere lased on lhe conpaclIy supporled conjugale quadralure
fiIlers (CQIs) (Snilh & arnveII, 1986), vhich had nonIinear phase effecls such as inage
lIurring and spaliaI disIocalions in nuIli-resoIulion anaIyses. On lhe conlrary, in
liorlhogonaI discrele vaveIel lransforn (DWT) lhe scaIing and vaveIel fiIlers are
synnelric and Iinear phase. The lvo-channeI anaIysis fiIlers
0
( ) H : and
1
( ) H : (Iig. 1) are of
lhe generaI forn

1
0
1
1
( ) (1 ) ( )
( ) (1 ) ( )
K
K
H : : P :
H : : Q :

= +
=
(1)
vhere lhe scaIing fiIler
0
( ) H : has lhe Klh order zero al e t = . The vaveIel fiIler
1
( ) H : has
lhe Klh order zero al 0 e = , correspondingIy. ( ) P : and ( ) Q : are poIynoniaIs in
1
:

. The
reconslruclion fiIlers
0
( ) G : and
1
( ) G : (Iig. 1) oley lhe veII-knovn perfecl reconslruclion
condilion

0 0 1 1
0 0 1 1
( ) ( ) ( ) ( ) 2
( ) ( ) ( ) ( ) 0
k
H : G : H : G : :
H : G : H : G :

+ =
+ =
(2)
1
VLS 2

The Iasl condilion in (2) is salisfied if ve seIecl lhe reconslruclion fiIlers as
0 1
( ) ( ) G : H : = and
1 0
( ) ( ) G : H : = .


Iig. 1. AnaIysis and synlhesis DWT fiIlers.

3. Lifting BDWT

The DWT is nosl connonIy reaIized ly lhe Iadder-lype nelvork caIIed Iifling schene
(SveIdens, 1988). The procedure consisls of sequenliaI dovn and upIifling sleps and lhe
reconslruclion of lhe signaI is nade ly running lhe Iifling nelvork in reverse order (Iig. 2).
Lfficienl Iifling DWT slruclures have leen deveIoped for VLSI design (OIkkonen el aI.
2OO5). The anaIysis and synlhesis fiIlers can le inpIenenled ly inleger arilhnelics using
onIy regisler shifls and sunnalions. Hovever, lhe Iifling DWT runs sequenliaIIy and lhis
nay le a speed-Iiniling faclor in sone appIicalions (Huang el aI., 2OO5). Anolher dravlack
considering lhe VLSI archileclure is reIaled lo lhe reconslruclion fiIlers, vhich run in reverse
order and lvo differenl VLSI reaIizalions are required. In lhe foIIoving ve shov lhal lhe
Iifling slruclure can le repIaced ly nore effeclive VLSI archileclures. We descrile lvo
differenl approaches: lhe discrele Iallice vaveIel lransforn and lhe sign noduIaled DWT.


Iig. 2. The Iifling DWT slruclure.

4. Discrete Iattice waveIet transform

In lhe anaIysis parl lhe discrele Iallice vaveIel lransforn (DLWT) consisls of lhe scaIing
0
( ) H : and vaveIel
1
( ) H : fiIlers and lhe Iallice nelvork (Iig. 3). The Iallice slruclure conlains
lvo paraIIeI lransnission fiIlers
0
( ) T : and
1
( ) T : , vhich exchange infornalion via lvo
crossed Iallice fiIlers
0
( ) L : and
1
( ) L : . In lhe synlhesis parl lhe Iallice slruclure consisls of lhe
lransnission fiIlers
0
( ) R : and
1
( ) R : and crossed fiIlers
0
( ) W : and
1
( ) W : , and finaIIy lhe
reconslruclion fiIlers
0
( ) G : and
1
( ) G : . Supposing lhal lhe scaIing and vaveIel fiIlers oley
(1), for perfecl reconslruclion lhe Iallice slruclure shouId foIIov lhe condilion


Discrete Wavelet Transform Structures for VLS Architecture Design 3


Iig. 3. The generaI DLWT slruclure.


0 0 1 0 0 0 1 0
0 1 1 1 1 1 0 1
0
0
k
k
T R LW L R TW :
T W L R T R L W :

+ + ( (
=
( (
+ +

(3)
This is salisfied if ve slale
0 0
W L = ,
1 1
W L = ,
0 1
R T = and
1 0
R T = . The perfecl
reconslruclion condilion foIIovs lhen fron lhe diagonaI eIenenls (3) as

0 1 0 1
( ) ( ) ( ) ( )
k
T : T : L : L : :

= (4)
There exisls nany approaches in lhe design of lhe DLWT slruclures oleying (4), for
exanpIe via lhe Iarks-McCheIIan-lype aIgorilhn. LspeciaIIy lhe DLWT nelvork is efficienl
in designing haIf-land lransnission and Iallice fiIlers (see delaiIs in OIkkonen & OIkkonen,
2OO7a). Ior VLSI design il is essenliaI lo nole lhal in lhe Iallice slruclure aII conpulalions are
carried oul paraIIeI. AIso aII lhe DWT slruclures designed via lhe Iifling schene can le
lransferred lo lhe Iallice nelvork (Iig. 3). Ior exanpIe, Iig. 4 shovs lhe DLWT equivaIenl
of lhe Iifling DWT slruclure consisling of dovn and upIifling sleps (Iig. 2). The VLSI
inpIenenlalion is fIexilIe due lo paraIIeI fiIler lIocks in anaIysis and synlhesis parls.


Iig. 4. The DLWT equivaIence of lhe Iifling DWT slruclure descriled in Iig. 2.

5. Sign moduIated BDWT

In VLSI archileclures, vhere lhe anaIysis and synlhesis fiIlers are direclIy inpIenenled (Iig.
1), lhe VLSI design sinpIifies consideralIy using a spesific sign noduIalor defined as
(OIkkonen & OIkkonen 2OO8)

1 Ior n even
( 1)
-1 Ior n odd
n
n
S

= =

(5)
A key idea is lo repIace lhe reconslruclion fiIlers ly scaIing and vaveIel fiIlers using lhe sign
noduIalor (5). Iig. 5 descriles lhe ruIes hov ( ) H : can le repIaced ly ( ) H : and lhe sign
noduIalor in conneclion vilh lhe decinalion and inlerpoIalion operalors. Iig. 6
VLS 4


Iig. 5. The equivaIence ruIes appIying lhe sign noduIalor.

descriles lhe generaI DWT slruclure using lhe sign noduIalor. The VLSI design sinpIifies
lo lhe conslruclion of lvo paraIIeI liorlhogonaI fiIlers and lhe sign noduIalor. Il shouId le
poinled oul lhal lhe scaIing and vaveIel fiIlers can le sliII efficienlIy inpIenenled using lhe
Iifling schene or lhe Iallice slruclure. The sane liorlhogonaI DWT/IDWT fiIler noduIe
can le used in deconposilion and reconslruclion of lhe signaI e.g. in video conpression
unil. LspeciaIIy in lidireclionaI dala lransnission lhe DWT/IDWT lransceiver has nany
advanlages conpared vilh lvo separale lransniller and receiver unils. The sane VLSI
noduIe can aIso le used lo conslrucl nuIlipIexer-denuIlipIexer unils. Due lo synnelry of
lhe scaIing and vaveIel fiIler coefficenls a fasl convoIulion aIgorilhn can le used for
inpIenenlalion of lhe fiIler noduIes (see delaiIs OIkkonen & OIkkonen, 2OO8).


Iig. 6. The DWT slruclure using lhe scaIing and vaveIel fiIlers and lhe sign noduIalor.

6. Design exampIe: Symmetric haIf-band waveIet fiIter for compression coder

The generaI slruclure for lhe synnelric haIf-land fiIler (HI) is, for k odd

2
( ) ( )
k
H : : B :

= + (6)
vhere
2
( ) B : is a synnelric poIynoniaI in
2
:

. The inpuIse response of lhe HI conlains


onIy one odd poinl. Ior exanpIe, ve nay paranelerize lhe eIeven poinl HI inpuIse
response as | | | 0 0 1 0 0 | h n c b a a b c = , vhich has lhree adjuslalIe paranelers. The
conpression efficiency inproves vhen lhe high-pass vaveIel fiIler approaches lhe
frequency response of lhe sinc-funclion, vhich has lhe HI slruclure. Hovever, lhe
inpuIse response of lhe sinc-funclion is infinile, vhich proIongs lhe conpulalion line. In
lhis vork ve seIecl lhe seven poinl conpaclIy supporled HI prololype as a vaveIel fiIler,
vhich has lhe inpuIse response

1
| | | 0 1 0 | h n b a a b = (7)
conlaining lvo adjuslalIe paranelers a and l. In our previous vork ve have inlroduced a
nodified reguIalory condilion for conpulalion of lhe paranelers of lhe vaveIel fiIler
(OIkkonen el aI. 2OO5)

1
0
| | 0; 0,1,..., 1
N
m
n
n h n m M
=
= =
_
(8)
Discrete Wavelet Transform Structures for VLS Architecture Design 5

This reIalion inpIies lhal
1
( ) H : conlains Mlh-order zero al 1 : = ( 0) e = , vhere M is lhe
nunler of vanishing nonenls. Wriling (8) for lhe prololype fiIler (7) ve ollain lvo
equalions 2 2 1 0 a b + + = and 20 36 9 0 a b + + = , vhich give lhe soIulion 9/16 a = and
1/16 b = . The vaveIel fiIler has lhe z-lransforn

1 4 1 2
1
( ) (1 ) (1 4 ) /16 H : : : :

= + + (9)
having fourlh order rool al z=1. The vaveIel fiIler can le reaIized in lhe HI forn

3 2 2 2 4 6
1
( ) ( ) ; ( ) ( 1 9 9 ) /16 H : : A : A : : : :

= = + + (1O)
Using lhe equivaIence

2
2
( ) ( 2) ( ) H : H :
!
( !

(11)
lhe HI slruclure can le inpIenenled using lhe Iifling schene (Iig. 7). The funclioning of
lhe conpression coder can le expIained ly vriling lhe inpul signaI via lhe poIyphase
conponenls

2 1 2
0
( ) ( ) ( )
e
X : X : : X :

= + (12)
vhere ( )
e
X : and ( )
o
X : denole lhe even and odd sequences. We nay presenl lhe vaveIel
coefficenls as

2
1 2
( ) | ( ) ( )| ( ) ( ) ( )
o e
W : X : H : : X : A : X :

!
= = (13)
( ) A : vorks as an approxinaling fiIler yieIding an eslinale of lhe odd dala poinls lased on
lhe even sequence. The vaveIel sequence ( ) W : can le inlerpreled as lhe difference lelveen
lhe odd poinls and lheir eslinale. In lree slruclured conpression coder lhe scaIing sequence
( ) S : is fed lo lhe nexl slage. In nany VLSI appIicalions, for exanpIe inage conpression,
lhe inpul signaI consisls of an inleger-vaIued sequences. y rounding or lruncaling lhe
oulpul of lhe ( ) A : fiIler lo inlegers, lhe conpressed vaveIel sequence ( ) W : is inleger-
vaIued and can le efficienlIy coded e.g. using Huffnan aIgorilhn. Il is essenliaI lo nole lhal
lhis inleger-lo-inleger lransforn has sliII lhe perfecl reconslruclion properly (2).


Iig. 7. The Iifling slruclure for lhe HI vaveIel fiIler designed for lhe VLSI conpression
coder.

7. Shift invariant BDWT

The dravlack in nuIli-scaIe WDT anaIysis of signaIs and inages is lhe dependence of lhe
lolaI energy of lhe vaveIel coefficienls on lhe fraclionaI shifls of lhe anaIysed signaI. If ve
have a discrele signaI | | x n and lhe corresponding line shifled signaI | | x n t , vhere |0,1| t e ,
lhere nay exisl a significanl difference in lhe energy of lhe vaveIel coefficienls as a funclion
of lhe line shifl. Kingslury (2OO1) proposed a nearIy shifl invarianl conpIex vaveIel
lransforn, vhere lhe reaI and inaginary vaveIel coefficienls are approxinaleIy HiIlerl
lransforn pairs. The energy (alsoIule vaIue) of lhe vaveIel coefficienls equaIs lhe enveIope,
VLS 6

vhich varranls snoolhness and shifl invariance. SeIesnick (2OO2) olserved lhal using lvo
paraIIeI CQI lanks, vhich are conslrucled so lhal lhe inpuIse responses of lhe scaIing
fiIlers are haIf-sanpIe deIayed versions of each olher:
0
| | h n and
0
| 0.5| h n , lhe
corresponding vaveIels are HiIlerl lransforn pairs. In z-lransforn donain ve shouId le
alIe lo conslrucl lhe scaIing fiIlers
0
( ) H : and
0.5
0
( ) : H :

. Hovever, lhe conslrucled scaIing


fiIlers do nol possess coefficienl synnelry and in nuIli-scaIe anaIysis lhe nonIinearily
dislurls spaliaI lining and prevenls accurale slalislicaI correIalions lelveen differenl
scaIes. In lhe foIIoving ve descrile lhe shifl invarianl DWT slruclures especiaIIy designed
for VLSI appIicalions.

7.1 HaIf-deIay fiIters for shift invariant BDWT
The cIassicaI approach for design of lhe haIf-sanpIe deIay fiIler ( ) D : is lased on lhe Thiran
aII-pass inlerpoIalor

1
0.5
1
1
( )
1
p
k
k k
c :
D : :
c :

=
+
= =
+
[
(14)
vhere lhe
k
c coefficienls are designed so lhal lhe frequency response foIIovs approxinaleIy

/ 2
( )
f
D e
e
e

= (15)
RecenlIy, haIf-deIay -spIine fiIlers have leen inlroduced, vhich have an ideaI phase
response. The nelhod yieIds Iinear phase and shifl invarianl lransforn coefficienls and can
le adapled lo any of lhe exisling DWT (OIkkonen & OIkkonen, 2OO7l). The haIf-sanpIe
deIayed scaIing and vaveIel fiIlers and lhe corresponding reconslruclion fiIlers are

0 0
1
1 1
1
0 0
1 1
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
H : D : H :
H : D : H :
G : D : G :
G : D : G :

=
=
=
=
(16)
The haIf-deIayed DWT fiIler lank oleys lhe perfecl reconslruclion condilion (2). The -
spIine haIf-deIay fiIlers have lhe IIR slruclure

( )
( )
( )
A :
D :
B :
= (17)
vhich can le inpIenenled ly lhe inverse fiIlering procedure (see delaiIs OIkkonen &
OIkkonen 2OO7l).

7.2 HiIbert transform-based shift invariant DWT
The lree-slruclured conpIex DWT is lased on lhe IIT-lased conpulalion of lhe HiIlerl
lransforn (
a
H operalor in Iig. 8). The scaIing and vaveIel fiIlers lolh oley lhe HI
slruclure (OIkkonen el aI. 2OO7c)

1 2
0
1 2
1
1
( ) ( )
2
1
( ) ( )
2
H : : B :
H : : B :

= +
=
(18)
Discrete Wavelet Transform Structures for VLS Architecture Design 7

Ior exanpIe, lhe inpuIse response
0
| | | 1 0 9 16 9 0 -1|/ 32 h n = has lhe fourlh order zero al
e t = and
1
| | |1 0 -9 16 -9 0 1|/ 32 h n = has lhe fourlh order zero al 0 e = . In lhe lree slruclured
HI DWT lhe vaveIel sequences | |
a
w n . A key fealure is lhal lhe odd coefficienls of lhe
anaIylic signaI |2 1|
a
w n + can le reconslrucled fron lhe even coefficienl vaIues |2 |
a
w n . This
avoids lhe need lo use any reconslruclion fiIlers. The HIs (18) are synnelric vilh respecl
lo / 2 e t = . Hence, lhe energy in lhe frequency range 0 t is equaIIy divided ly lhe
scaIing and vaveIel fiIlers and lhe energy (alsoIule vaIue) of lhe scaIing and vaveIel


Iig. 8. HiIlerl lransforn-lased shifl invarianl DWT.

coefficienls are slalislicaIIy conparalIe. The conpulalion of lhe anaIylic signaI via lhe
HiIlerl lransforn requires lhe IIT-lased signaI processing. Hovever, efficienl IIT chips
are avaiIalIe for VLSI inpIenenlalion. In nany respecls lhe advanced nelhod oulperforns
lhe previous nearIy shifl invarianl DWT slruclures.

7.3 HiIbert transform fiIter for construction of shift invariant BDWT
The IIT-lased inpIenenlalion of lhe shifl invarianl DWT can le avoided if ve define lhe
HiIlerl lransforn fiIler ( ) : , vhich has lhe frequency response

/ 2
( ) sgn( )
f
e
t
e e

= (19)
vhere sgn( ) 1 e = for 0 e > and sgn( ) 0 e = for 0 e < . In lhe foIIoving ve descrile a noveI
nelhod for conslrucling lhe HiIlerl lransforn fiIler lased on lhe haIf-sanpIe deIay fiIler
( ) D : (17), vhose frequency response foIIovs (15). The quadralure nirror fiIler ( ) D : has
lhe frequency response

( ) / 2
( )
f
D e
e t
e t

= (2O)
The frequency response of lhe fiIler
1
( ) ( ) D : D :

is, correspondingIy

/ 2 ( ) / 2 / 2
( )
( )
f f f
D
e e e
D
e e t t
e
e t

= =

(21)
Conparing (19) and using nolalion (17) ve ollain lhe HiIlerl lransforn fiIler as

( ) ( )
( )
( ) ( )
A : B :
:
A : B :

(22)
The corresponding paraIIeI DWT fiIler lank is

0 0
1
1 1
1
0 0
1 1
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
H : : H :
H : : H :
G : : G :
G : : G :

=
=
=
=

(23)
VLS 8

y fiIlering lhe reaI-vaIued signaI ( ) X : ly lhe HiIlerl lransforn fiIler resuIls in an anaIylic
signaI |1 ( )| ( ) f : X : + , vhose nagnilude response is zero al negalive side of lhe frequency
speclrun. Ior exanpIe, an inleger-vaIued haIf-deIay fiIler ( ) D : for lhis purpose is ollained
ly lhe -spIine lransforn (OIkkonen & OIkkonen, 2OO7l). The frequency response of lhe
HiIlerl lransforn fiIler designed ly lhe fourlh order -spIine (Iig. 9) shovs a naxinaIIy fIal
nagnilude speclrun. The phase speclrun corresponds lo lhe ideaI HiIlerl lransforner (19).

















Iig. 9. Magnilude and phase speclra of lhe HiIlerl lransforn fiIler yieIded ly lhe fourlh
order -spIine lransforn.

8. ConcIusion

In lhis look chapler ve have descriled lhe DWT conslruclions especiaIIy laiIored for VLSI
environnenl. Mosl of lhe VLSI designs in lhe Iileralure are focused on lhe liorlhogonaI 9/7
fiIlers, vhich have decinaI coefficienls and usuaIIy inpIenenled using lhe Iifling schene
(SveIdens, 1988). Hovever, lhe Iifling DWT needs lvo differenl fiIler lanks for anaIysis
and synlhesis parls. The speed of lhe Iifling DWT is aIso Iiniled due lo lhe sequenliaI
Iifling sleps. In lhis vork ve shoved lhal lhe Iifling DWT can le repIaced ly lhe Iallice
slruclure (OIkkonen & OIkkonen, 2OO7a). The lvo-channeI DLWT fiIler lank (Iig. 3) runs
paraIIeI, vhich significanlIy increases lhe channeI lhroughoul. A significanl advanlage
conpared vilh lhe previous naxinaIIy decinaled fiIler lanks is lhal lhe DLWT slruclure
aIIovs lhe conslruclion of lhe haIf-land Iallice and lransnission fiIlers. In lree slruclured
vaveIel lransforn haIf-land fiIlered scaIing coefficienls inlroduce no aIiasing vhen lhey are
fed lo lhe nexl scaIe. This is an essenliaI fealure vhen lhe frequency conponenls in each
scaIe are considered, for exanpIe in eIeclroencephaIography anaIysis.
The VLSI design of lhe DWT fiIler lank sinpIifies essenliaIIy ly inpIenenling lhe sign
noduIalor unil (Iig. 5), vhich eIininales lhe need for conslrucling separale reconslruclion
fiIlers. The liorlhogonaI DWT/IDWT lransceiver noduIe uses onIy lvo paraIIeI fiIler
slruclures. LspeciaIIy in lidireclionaI dala lransnission lhe DWT/IDWT noduIe offers
severaI advanlages conpared vilh lhe separale lransnil and receive noduIes, such as lhe
Discrete Wavelet Transform Structures for VLS Architecture Design 9

reduced size, Iov pover consunplion, easier synchronizalion and lining requirenenls. Ior
lhe VLSI designer lhe DWT/IDWT noduIe appears as a "lIack lox", vhich readiIy fils lo
lhe dala under processing. This nay override lhe reIaliveIy lig larrier fron lhe vaveIel
lheory lo lhe praclicaI VLSI and nicroprocessor appIicalions. As a design exanpIe ve
descriled lhe conslruclion of lhe conpression coder (Iig. 7), vhich can le used lo conpress
inleger-vaIued dala sequences, e.g. produced ly lhe anaIog-lo-digilaI converlers.
Il is veII docunenled lhal lhe reaI-vaIued DWTs are nol shifl invarianl, lul snaII fraclionaI
line-shifls nay inlroduce significanl differences in lhe energy of lhe vaveIel coefficienls.
Kingslury (2OO1) shoved lhal lhe shifl invariance is inproved ly using lvo paraIIeI fiIler
lanks, vhich are designed so lhal lhe vaveIel sequences conslilule reaI and inaginary parls
of lhe conpIex anaIylic vaveIel lransforn. The duaI-lree discrele vaveIel lransforn (DT-
DWT) has leen shovn lo oulperforn lhe reaI-vaIued DWT in a variely of appIicalions such
as denoising, lexlure anaIysis, speech recognilion, processing of seisnic signaIs and
neuroeIeclric signaI anaIysis (OIkkonen el aI. 2OO6). SeIesnick (2OO2) nade an olservalion
lhal a haIf-sanpIe line-shifl lelveen lhe scaIing fiIlers in paraIIeI CQI lanks is enough lo
produce lhe anaIylic vaveIel lransforn, vhich is nearIy shifl invarianl. In lhis vork ve
descriled lhe shifl invarianl DT-DWT lank (16) lased on lhe haIf-sanpIe deIay fiIler. Il
shouId le poinled oul lhal lhe haIf-deIay fiIler approach yieIds vaveIel lases vhich are
HiIlerl lransforn pairs, lul lhe vaveIel sequences are onIy approxinaleIy shifl invarianl. In
nuIli-scaIe anaIysis lhe conpIex vaveIel sequences shouId le shifl invarianl. This
requirenenl is salisfied in lhe HiIlerl lransforn-lased approach (Iig. 8), vhere lhe signaI in
every scaIe is HiIlerl lransforned yieIding slriclIy anaIylic and shifl invarianl lransforn
coefficienls. The procedure needs IIT-lased conpulalion (OIkkonen el aI. 2OO7c), vhich
nay le an olslacIe in nany VLSI reaIizalions. To avoid lhis ve descriled a HiIlerl
lransforn fiIler for conslrucling lhe shifl invarianl DT-DWT lank (23). Inslead of lhe haIf-
deIay fiIler lank approach (16) lhe perfecl reconslruclion condilion (2) is allained using lhe
IIR-lype HiIlerl lransforn fiIlers, vhich yieId anaIylic vaveIel sequences.

9. References

Daulechies, I. (1988). OrlhonornaI lases of conpaclIy supporled vaveIels. Ccmmmun. Purc
App|. Ma|n., VoI. 41, 9O9-996.
Huang, C.T., Tseng, O.O. & Chen, L.C. (2OO5). AnaIysis and VLSI archileclure for 1-D and 2-
D discrele vaveIel lransforn. |||| Trans. Signa| Prcccss. VoI. 53, No. 4, 1575-1586.
ITU-T (2OOO) Reconnend. T.8OO-ISO DCD15444-1: ]P|G2000 |magc Ccing Sqs|cm.
|n|crna|icna| Organiza|icn fcr S|anariza|icn, ISO/ILC }TC! SC29/WC1.
Kingslury, N.C. (2OO1). ConpIex vaveIels for shifl invarianl anaIysis and fiIlering of
signaIs. ]. App|. Ccmpu|. Harmcnic Ana|qsis. VoI. 1O, 234-253.
OIkkonen, H., IesoIa, I. & OIkkonen, }.T. (2OO5). Lfficienl Iifling vaveIel lransforn for
nicroprocessor and VLSI appIicalions. |||| Signa| Prcccss. |c||. VoI. 12, No. 2, 12O-
122.
OIkkonen, H., IesoIa, I., OIkkonen, }.T. & Zhou, H. (2OO6). HiIlerl lransforn assisled
conpIex vaveIel lransforn for neuroeIeclric signaI anaIysis. ]. Ncurcscicncc Mc|n.
VoI. 151, 1O6-113.
OIkkonen, }.T. & OIkkonen, H. (2OO7a). Discrele Iallice vaveIel lransforn. |||| Trans.
Circui|s an Sqs|cms ||. VoI. 54, No. 1, 71-75.
VLS 10

OIkkonen, H. & OIkkonen, }.T. (2OO7l). HaIf-deIay -spIine fiIler for conslruclion of shifl-
invarianl vaveIel lransforn. |||| Trans. Circui|s an Sqs|cms ||. VoI. 54, No. 7, 611-
615.
OIkkonen, H., OIkkonen, }.T. & IesoIa, I. (2OO7c). IIT-lased conpulalion of shifl invarianl
anaIylic vaveIel lransforn. |||| Signa| Prcccss. |c||. VoI. 14, No. 3, 177-18O.
OIkkonen, H. & OIkkonen, }.T. (2OO8). SinpIified liorlhogonaI discrele vaveIel lransforn
for VLSI archileclure design. Signa|, |magc an Vicc Prcccss. VoI. 2, 1O1-1O5.
SeIesnick, I.W. (2OO2). The design of approxinale HiIlerl lransforn pairs of vaveIel lases.
|||| Trans. Signa| Prcccss. VoI. 5O, No. 5, 1144-1152.
Snilh, M.}.T. & arnveII, T.I. (1986). Lxaxl reconslruclion for lree-slruclured sulland
coders. |||| Trans. Accus|. Spcccn Signa| Prcccss. VoI. 34, 434-441.
SveIdens, W. (1988). The Iifling schene: A conslruclion of second generalion vaveIels.
S|AM ]. Ma|n. Ana|. VoI. 29, 511-546.


High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 11
X

High Performance ParaIIeI PipeIined
Lifting-based VLSI Architectures
for Two-DimensionaI Inverse
Discrete WaveIet Transform

Ilrahin Saeed Koko and Hernan Agusliavan
||cc|rica| c ||cc|rcnic |nginccring Dcpar|mcn|
Unitcrsi|i Tc|nc|cgi P|TRONAS, Trcncn
Ma|aqsia

1. Introduction

Tvo-dinensionaI discrele vaveIel lransforn (2-D DWT) has evoIved as an effeclive and
poverfuI looI in nany appIicalions especiaIIy in inage processing and conpression. This is
nainIy due lo ils leller conpulalionaI efficiency achieved ly facloring vaveIel lransforns
inlo Iifling sleps. Lifling schene faciIilales high speed and efficienl inpIenenlalion of
vaveIel lransforn and il allraclive for lolh high lhroughpul and Iov-pover appIicalions
(Lan & Zheng, 2OO5).
DWT considered in lhis vork is parl of a conpression syslen lased on vaveIel such as
}ILC2OOO. Iig.1 shovs a sinpIified conpression syslen. In lhis syslen, lhe funclion of lhe
2-D IDWT is lo deconpose an NxM inage inlo sullands as shovn in Iig. 2 for 3-IeveI
deconposilion. This process decorreIales lhe highIy correIaled pixeIs of lhe originaI inage.
Thal is lhe decorreIalion process reduces lhe spaliaI correIalion anong lhe adjacenl pixeIs of
lhe originaI inage such lhal lhey can le anenalIe lo conpression.
Afler lransnilling lo a renole sile, lhe originaI inage nusl le reconslrucled fron lhe
decorreIaled inage. The lask of lhe reconslruclion and conpIeleIy recovering lhe originaI
inage fron lhe decorreIaled inage is perforned ly inverse discrele vaveIel lransforn
(IDWT).
The decorreIaled inage shovn in Iig. 2 can le reconslrucled ly 2-D IDWT as foIIovs. Iirsl,
il reconslrucls in lhe coIunn direclion sullands LL3 and LH3 coIunn-ly-coIunn lo recover
L3 deconposilion. SiniIarIy, sullands HL3 and HH3 are reconslrucled lo ollain H3
deconposilion. Then L3 and H3 deconposilions are conlined rov-vise lo reconslrucl
sulland LL2. This process is repealed in each IeveI unliI lhe vhoIe inage is reconslrucled
(Ilrahin & Hernan, 2OO8).
The reconslruclion process descriled alove inpIies lhal lhe lask of lhe reconslruclion can le
achieved ly using 2 processors (Ilrahin & Hernan, 2OO8). The firsl processor (lhe coIunn-
processor) conpules coIunn-vise lo conline sullands LL and LH inlo L and sullands HL
and HH inlo H, vhiIe lhe second processor (lhe rov-processor) conpules rov-vise lo
2
VLS 12
conline L and H inlo lhe nexl IeveI sulland. The decorreIaled inage shovn in Iig. 2 is
assuned lo le residing in an exlernaI nenory vilh lhe sane fornal.
In lhis chapler, paraIIeIisn is expIored lo lesl neel reaI-line appIicalions of 2-D DWT vilh
denanding requirenenls. The singIe pipeIined archileclure deveIoped ly (Ilrahin &
Hernan, 2OO8) is exlended lo 2- and 4-paraIIeI pipeIined archileclures for lolh 5/3 and 9/7
inverse aIgorilhns lo achieve speedup faclors of 2 and 4, respecliveIy. The advanlage of lhe
proposed archileclures is lhal lhe lolaI lenporary Iine luffer (TL) size does nol increase
fron lhal of lhe singIe pipeIined archileclure vhen degree of paraIIeIisn is increased.


Iig. 1. A sinpIified Conpression Syslen


Iig. 2. Sulland deconposilion of an NxM inage inlo 3 IeveIs.

2. Lifting-based 5/3 and 9/7 synthesis aIgorithms and data dependency
graphs

The 5/3 and lhe 9/7 inverse discrele vaveIel lransforns aIgorilhns are defined ly lhe
}ILC2OOO inage conpression slandard as foIIov (Lan & Zheng, 2OO5):
5/3 synlhesis aIgorilhn

(

+ +
+ + = +
(

+ + +
=
2
) 2 2 ( ) 2 (
) 1 2 ( ) 1 2 ( : 2
4
2 ) 1 2 ( ) 1 2 (
) 2 ( ) 2 ( : 1
n X n X
n Y n X step
n Y n Y
n Y n X step
(1)
9/7 synlhesis aIgorilhn
Slep 1: ) 2 ( 1 ) 2 ( n Y k n Y = '
Slep 2: ) 1 2 ( ) 1 2 ( + = + ' n Y k n Y
Slep 3: )) 1 2 ( ) 1 2 ( ( ) 2 ( ) 2 ( + ' + ' ' = ' ' n Y n Y n Y n Y o
Slep 4: )) 2 2 ( ) 2 ( ( ) 1 2 ( ) 1 2 ( + ' ' + ' ' + ' = + ' ' n Y n Y n Y n Y (2)
Transmit to a
remote site
image
es Decorrelat
FDWT
image
ea Decorrelat
Compress
image
ea aecorrelat
Decompress
image
constructs
IDWT
Re
LH1
HL1
HH1
3 LH
3 LL 3 HL
3 HH
2 HL
2 HH 2 LH
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 13
Slep 5: )) 1 2 ( ) 1 2 ( ( ) 2 ( ) 2 ( + ' ' + ' ' ' ' = n Y n Y n Y n X |
Slep 6: )) 2 2 ( ) 2 ( ( ) 1 2 ( ) 1 2 ( + + + ' ' = + n X n X n Y n X o
The dala dependency graphs (DDCs) for 5/3 and 9/7 derived fron lhe synlhesis aIgorilhns
are shovn in Iigs. 3 and 4, respecliveIy. The DDCs are very usefuI looIs in archileclure
deveIopnenl and provide lhe infornalion necessary for designer lo deveIop nore accurale
archileclures. The synnelric exlension aIgorilhn reconnended ly }ILC2OOO is
incorporaled inlo lhe DDCs lo handIe lhe loundaries prolIens. The loundary lrealnenl is
necessary lo keep nunler of vaveIel coefficienls lhe sane as lhal of lhe originaI inpul IixeIs
and as a resuIl prevenl dislorlion fron appearing al inage loundaries. The loundary
lrealnenl is onIy appIied al lhe leginning and ending of each rov or coIunn in an NxM
inage (DiIIin el aI., 2OO3). In DDCs, nodes circIed vilh lhe sane nunlers are considered
redundanl conpulalions, vhich viII le conpuled once and used lhereafler. Coefficienls of
lhe nodes circIed vilh even nunlers in lhe DDCs are Iov coefficienls and lhal circIed vilh
odd nunlers are high coefficienls.

3. Scan methods

The hardvare conpIexily and hence lhe nenory required for 2-D DWT archileclure in
generaI depends on lhe scan nelhod adopled for scanning exlernaI nenory. Therefore, lhe
scan nelhod shovn in Iig. 5 is proposed for lolh 5/3 and 9/7 CIs. Iig. 5 (A) is forned for
iIIuslralion purposes ly nerging logelher sullands LL and LH, vhere sulland LL
coefficienls occupy even rov and sulland LH coefficienls occupy odd rovs, vhiIe Iig. 5 ()
is forned ly nerging sulland HL and HH logelher.
According lo lhe scan nelhod shovn in Iig. 5, CIs of lolh 5/3 and 9/7 shouId scan exlernaI
nenory coIunn-ly-coIunn. Hovever, lo aIIov lhe RI, vhich operales on dala generaled
ly lhe CI, lo vork in paraIIeI vilh lhe CI as earIier as possilIe, lhe As (LL+LH) firsl lvo
coIunns coefficienls are inlerIeaved in execulion vilh lhe s (HL+HH) firsl lvo coIunns
coefficienls, in lhe firsl run. In aII sulsequenl runs, lvo coIunns are inlerIeaved, one fron A
vilh anolher fron .
InlerIeaving of 4 coIunns in lhe firsl run lakes pIace as foIIovs. Iirsl, coefficienls LLO,O,
LHO,O fron lhe firsl coIunn of (A) are scanned. Second, coefficienls HLO,O and HHO,O fron
lhe firsl coIunn of 8 are scanned, lhen LLO,1 and LHO,1 fron lhe second coIunn of A
foIIoved ly HLO,1 and HHO,1 fron lhe second coIunn of 8 are scanned. The scanning
process lhen relurns lo lhe firsl Iig. 3. 5/3 synlhesis aIgorilhns DDCs for (a) odd and (l)
even Ienglh signaIs relurns lo lhe firsl coIunn of A lo repeal lhe process and so on.
VLS 14

Iig. 4. 9/7 synlhesis aIgorilhns DDCs for (a) odd and (l) even Ienglh signaIs

The advanlage of inlerIeaving process nol onIy il speedups lhe conpulalions ly aIIoving
lhe lvo processors lo vork in paraIIeI earIier during lhe conpulalions, lul aIso reduces lhe
inlernaI nenory requirenenl lelveen CI and RI lo a fev regislers.
The scan nelhod iIIuslraled in Iig. 5 for 5/3 and 9/7 CI aIong vilh lhe DDCs suggesl lhal
lhe RI shouId scan coefficienls generaled ly CI according lo lhe scan nelhod iIIuslraled in
Iig. 6. This figure is forned for iIIuslralion purposes ly nerging L and H deconposilions
even lhough lhey are acluaIIy separale. In Iig. 6, Ls coefficienls occupy even coIunns,
vhiIe Hs coefficienls occupy odd coIunns. In lhe firsl run, lhe RIs scan nelhod shovn in
Iig. 6 requires considering lhe firsl four coIunns for scanning as foIIovs. Iirsl, coefficienls
LO,O and HO,O fron rov O foIIoved ly L1,O and H1,O fron rov 1 are scanned. Then lhe scan
relurns lo rov O and scans coefficienls LO,1 and HO,1 foIIoved ly L1,1 and H1,1. This
process is repealed as shovn in Iig. 6 unliI lhe firsl run conpIeles.
In lhe second run, coefficienls of coIunns 4 and 5 are considered for scanning ly RI as
shovn in Iig. 6, vhereas in lhe lhird run, coefficienls of coIunns 6 and 7 are considered for
scanning and so on.
2
0
4 6
8
7
1
0
7
3
1 1 2
5
4 3 5 6
7
8
2 X 0 X 1 X 3 X 4 X 5 X 6 X 7 X 8 X
) 2 ( n X
) 1 2 ( + n X
) (n Y
2
0
4 6 6
7
1
0
5
3
1 1 2
5
4 3 5 6
7
6
2 X 0 X 1 X 3 X 4 X 5 X 6 X 7 X
Iig. 3. 5/3 synlhesis aIgorilhns DDCs for (a) odd and (l) even Ienglh signaIs
) (a ) (b
1 2 3 0 1 2 3
4 5 6 7 6 4 5
2 0 2 4 6 6 4
1
3 5 7 1
5
0 2 4 6 6
1 3 5 7
2 X
1
k
k k
k k k k
k k
1
k
1
k
1
k
1
k
1
k
1
k
0 X 1 X 3 X 4 X 5 X 6 X 7 X
3
1 2 3 0 1 2 3
4 5 6 7 8 7 6 5
2 0 2 4 6 8 6
1
3 5 7 1
7
0 2 4 6 8
1 3 5 7
2 X
1
k
( ) 1 2 + ' ' n Y
( ) n Y 2 ' '
( ) 1 2 + ' n Y
( ) n Y 2 '
k k
k k
k k
k k
1
k
1
k
1
k
1
k
1
k
1
k
) (n Y
) 2 ( n x
) 1 2 ( + n x
0 X 1 X 3 X 4 X 5 X 6 X 7 X 8 X
) (a
) (b
ns computatio
reaunaant
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 15
According lo lhe 9/7 DDCs, lhe RIs scan nelhod shovn in Iig. 6 viII generale onIy one
Iov oulpul coefficienl each line il processes 4 coefficienls in lhe firsl run, vhereas 5/3 RI
viII generale lvo oulpul coefficienls. Hovever, in aII sulsequenl runs, lolh 9/7 and 5/3
RIs generale lvo oulpul coefficienls.


Iig. 5. 5/3 and 9/7 CIs scan nelhod


Iig. 6. 5/3 and 9/7 RIs scan nelhod

4. Approach

In (Ilrahin & Hernan, 2OO8), lo ease lhe archileclure deveIopnenl lhe slralegy adopled
vas lo divide lhe delaiIs of lhe deveIopnenl inlo lvo sleps each having Iess infornalion lo
handIe. In lhe firsl slep, lhe DDCs vere Iooked al fron lhe oulside, vhich is specified ly lhe
dolled loxes in lhe DDCs, in lerns of lhe inpul and oulpul requirenenls. Il can le olserved
lhal lhe DDCs for 5/3 and 9/7 are idenlicaI vhen lhey are Iooked al fron oulside, laking
inlo consideralion onIy lhe inpul and oulpul requirenenls, lul differ in lhe inlernaI delaiIs.
ased on lhis olservalion, lhe firsl IeveI, caIIed exlernaI archileclure, vhich is idenlicaI for
lolh 5/3 and 9/7, and consisls of a coIunn-processor (CI) and a rov-processor (RI), vas
deveIoped. In lhe second slep, lhe inlernaI delaiIs of lhe DDCs for 5/3 and 9/7 vere
considered separaleIy for lhe deveIopnenl of processors dalapalh archileclures, since
DDCs inlernaIIy define and specify lhe inlernaI slruclure of lhe processors
In lhis chapler, lhe firsl IeveI, lhe exlernaI archileclures for 2-paraIIeI and 4-paraIIeI are
deveIoped for lolh 5/3 and 9/7 inverse aIgorilhns. Then lhe processors dalapalh
0 1 2 3 4 5
0
1
2
3
4
1 run 2 run
0 , 0 L 1 , 0 L 0 , 0 H 1 , 0 H
0 , 1 L
6
0
1
2
3
4
5
0 1 2 3
A
1 run
6
2 run
1 3
0 , 0 LL
0 , 0 LH
0 , 1 LL
0 , 1 LH
B
0
1
2
3
4
5
0 1 2 3
6
1 run
2 run
2 4
0 , 0 HL
0 , 0 HH
0 , 1 HL
0 , 1 HH
VLS 16
archileclures deveIoped in (Ilrahin & Hernan, 2OO8) are nodified lo fil inlo lhe lvo
proposed paraIIeI archileclures processors.

4.1 Proposed 2-paraIIeI externaI architecture
ased on lhe scan nelhods shovn in Iigs. 5 and 6 and lhe DDCs for 5/3 and 9/7, lhe 2-
paraIIeI exlernaI archileclure shovn in Iig. 7 (a) is proposed for 5/3 and 9/7 and conlined
5/3 and 9/7 for 2-D IDWT. The archileclure consisls of lvo |-slage pipeIined coIunn-
processors (CIs) IaleIed CI1 and CI2 and lvo |-slage pipeIined rov-processors (RIs)
IaleIed RI1 and RI2. The vaveforns of lhe lvo cIocks
2
f and 2
2
f used in lhe
archileclure are shovn in Iig. 7 (l).
In generaI, lhe scan frequency
l
f and hence lhe period
l l
f 1 = t of lhe paraIIeI archileclures
can le delernined ly lhe foIIoving aIgorilhn vhen lhe required pixeIs I of an operalion are
scanned sinuIlaneousIy in paraIIeI. Suppose

p
t
and
m
t are lhe processor and lhe exlernaI
nenory crilicaI palh deIays, respecliveIy.

m l
p l
m p
t else
k l t
then t k l t If
=
=
>
t
t
(3)

Where | = 2, 3, 4 ... denole 2, 3, and 4-paraIIeI and k t
p
is lhe slage crilicaI palh deIay of a k-
slage pipeIined processor. The cIock frequency
2
f is delernined fron Lq(3) as

p
t k f 2
2
= (4)
The archileclure scans lhe exlernaI nenory vilh frequency
2
f and il operales vilh
frequency 2
2
f . Lach line lvo coefficienls are scanned lhrough lhe lvo luses IaleIed |us0
and |us1. The lvo nev coefficienls are Ioaded inlo CI1 or CI2 Ialches RlO and Rl1 every
line cIock 2
2
f nakes a negalive or a posilive lransilion, respecliveIy.
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 17
2
2
f
2
2
f
2
f

Iig. 7. (a) Iroposed 2-paraIIeI pipeIined exlernaI archileclure for 5/3 and 9/7 and conlined
(l) Waveforn of lhe cIocks.

On lhe olher hand, lolh RI1 and RI2 Ialches RlO and Rl1 Ioad sinuIlaneousIy nev dala
fron CI1 and CI2 oulpul Ialches each line cIock 2
2
f nakes a negalive lransilion.
The dalafIov for 5/3 2-paraIIeI archileclure is shovn in TalIe 1, vhere CIs and RIs are
assuned lo le 4-slage pipeIined processors. The dalafIov for 9/7 2-paraIIeI archileclure is
siniIar, in aII runs, lo lhe 5/3 dalafIov excepl in lhe firsl, vhere RI1 and RI2 of lhe 9/7
archileclure each vouId generale one oulpul coefficienl every olher cIock cycIe, reference lo
cIock 2
2
f . The reason is lhal each 4 coefficienls of a rov processed in lhe firsl run ly RI1
or RI2 of lhe 9/7 vouId require, according lo lhe DDCs, lvo successive Iov coefficienls
VLS 18
fron lhe firsl IeveI of lhe DDCs IaleIed ) 2 ( n Y ' ' in order lo carry oul node 1 conpulalions in
lhe second IeveI IaleIed ) 1 2 ( + ' ' n Y . In TalIe 1, lhe oulpul coefficienls in RlO of lolh RI1 and
RI2 represenl lhe oulpul coefficienls of lhe 9/7 in lhe firsl run.
The slralegy adopled for scheduIing nenory coIunns for CI1 and CI2 of lhe 5/3 and 9/7
2-paraIIeI archileclures, vhich are scanned according lo lhe scan nelhod shovn in Iig. 5, is
as foIIov. In lhe firsl run, lolh 5/3 and 9/7 2-paraIIeI archileclures are scheduIed for
execuling 4 coIunns of nenory, lvo fron each (A) and () of Iig. 5. The firsl lvo coIunns
of Iig. 5 (A) are execuled in an inlerIeaved nanner ly CI1, vhiIe lhe firsl lvo coIunns of
Iig. 5 () are execuled ly CI2 aIso in an inlerIeaved fashion as shovn in lhe dalafIov TalIe
1 In aII olher runs, 2 coIunns are scheduIed for execulion al a line. One coIunn fron (A) of
Iig. 5 viII le scheduIed for execulion ly CI1, vhiIe anolher fron () of Iig. 5 viII le
scheduIed for CI2. Hovever, if nunler of coIunns in (A) and () of Iig. 5 is nol equaI, lhen
lhe Iasl run viII consisl of onIy one coIunn of (A). In lhal case, scheduIe lhe Iasl coIunn in
CI1 onIy, lul ils oulpul coefficienls viII le execuled ly lolh RI1 and RI2. The reason is
lhal if lhe Iasl coIunn is scheduIed for execulion ly lolh CI1 and CI2, lhey viII yieId nore
coefficienls lhan lhal can le handIed ly lolh RI1 and RI2.
On lhe olher hand, scheduIing RI1 and RI2 of 5/3 and 9/7 2-paraIIeI archileclures occur
according lo scan nelhod shovn in Iig. 6. In lhis scheduIing slralegy, aII rovs of even and
odd nunlers in Iig. 6 viII le scheduIed for execulion ly RI1 and RI2, respecliveIy. In lhe
firsl run, 4 coefficienls fron each 2 successive rovs viII le scheduIed for RI1 and RI2,
vhereas in aII sulsequenl runs, lvo coefficienls of each 2 successive rovs viII le scheduIed
for RI1 and RI2, as shovn in Iig. 6. Hovever, if nunler of coIunns in Iig. 6 is odd vhich
occur vhen nunler of coIunns in (A) and () of Iig. 5 is nol equaI, lhen lhe Iasl run vouId
require scheduIing one coefficienl fron each 2 successive rovs lo RI1 and RI2 as shovn in
coIunn 6 of Iig. 6.
In generaI, aII coefficienls lhal leIong lo coIunns of even nunlers in Iig. 6 viII le
generaled ly CI1 and lhal leIong lo coIunns of odd nunlers viII le generaled ly CI2. Ior
exanpIe, in lhe firsl run, CI1viII firsl generale lvo coefficienls IaleIed LO,O and L1,O lhal
leIong lo Iocalions O,O and 1,O in Iig. 6, vhiIe CI2 viII generale coefficienl HO,O and H1,O
lhal leIong lo Iocalions O,1 and 1,1. Then coefficienls in Iocalions O,O and O,1 are execuled ly
RI1, vhiIe coefficienls of Iocalions 1,O and 1,1 are execuled ly RI2. Second, CI1 viII
generale lvo coefficienls for Iocalions O,2 and 1,2, vhiIe CI2 generales lvo coefficienls for
Iocalions O,3 and 1,3. Then coefficienls in Iocalions O,2 and O,3 are execuled ly RI1, vhiIe
coefficienls in Iocalions 1,2 and 1,3 are execuled ly RI2. The sane process is repealed in lhe
nexl lvo rovs and so on.
In lhe second run, firsl, CI1 generales coefficienls of Iocalions O,4 and 1,4, vhereas CI2
generales coefficienls of Iocalions O,5 and 1,5 in Iig. 6. Then coefficienls in Iocalions O,4 and
O,5 are execuled ly RI1, vhiIe coefficienls in Iocalions 1,4 and 1,5 are execuled ly RI2. This
process is repealed unliI lhe run conpIeles. Hovever, in lhe even lhal lhe Iasl run processes
onIy one coIunn of (A), CI1 vouId generale firsl coefficienls of Iocalions O,n and 1,n
vhere n refers lo lhe Iasl coIunn. Then coefficienls of Iocalion O,n is passed lo RI1, vhiIe
coefficienl of Iocalion 1,n is passed lo RI2. In lhe second line, CI1 vouId generale
coefficienls of Iocalions 2,n and 3,n. Then 2,n is passed lo RI1 and 3,n lo RI2 and so on.
In lhe foIIoving, lhe dalafIov shovn in TalIe 1 for 2-paraIIeI pipeIined archileclure viII le
expIained. The firsl run, vhich ends al cycIe 16 in lhe lalIe, requires scheduIing four
coIunns as foIIovs. In lhe firsl cIock cycIe, reference lo cIock
2
f , coefficienls LLO,O and
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 19
LHO,O fron lhe firsl coIunn of LL3 and LH3 in lhe exlernaI nenory, respecliveIy, are
scanned and are Ioaded inlo CI1 Ialches RlO and Rl1 ly lhe negalive lransilion of
cIock 2
2
f . The second cIock cycIe scans coefficienls HLO,O and HHO,O fron lhe firsl coIunn
of HL3 and HH3, respecliveIy, lhrough lhe luses IaleIed |us0 and |us1 and Ioads lhen inlo
CI2 Ialches RlO and Rl1 ly lhe posilive lransilion of cIock 2
2
f . In lhe lhird cIock cycIe, lhe
scanning process relurns lo lhe second coIunn of sullands LL3 and LH3 in lhe exlernaI
nenory and scans coefficienls LLO,1 and LHO,1, respecliveIy, and Ioads lhen inlo CI1
Ialches RlO and Rl1 ly lhe negalive lransilion of lhe cIock 2
2
f . The fourlh cIock cycIe scans
coefficienls HLO,1 and HHO,1 fron lhe second coIunn of HL3 and HH3, respecliveIy, and
Ioads lhen inlo CI2 Ialches RlO and Rl1. The scanning process lhen relurns lo lhe firsl
coIunn in sullands LL3 and LH3 lo repeal lhe process unliI lhe firsl run is conpIeled.
In cycIe 9, CI1 generales ils firsl lvo oulpul coefficienls LO,O and L1,O, vhich leIong lo L3
deconposilion and Ioads lhen inlo ils oulpul Ialches RlIO and RlI1, respecliveIy, ly lhe
negalive lransilion of cIock 2
2
f . In cycIe 1O, CI2 generales ils firsl lvo oulpul coefficienls
HO,O and H1,O, vhich leIong lo H3 deconposilion and Ioads lhen inlo ils oulpul Ialches
RlhO and Rlh1, respecliveIy, ly lhe posilive lransilion of cIock 2
2
f .
In cycIe 11, conlenls of RlIO and RlhO are lransferred lo RI1 inpul Ialches RlO and Rl1,
respecliveIy. The sane cIock cycIe aIso lransfers conlenls of RlI1 and Rlh1 lo RI2 inpul
Ialches RlO and Rl1, respecliveIy, vhiIe lhe lvo coefficienls LO,1 and L1,1 generaled ly CI1
during lhe cycIe are Ioaded inlo RlIO and RlI1, respecliveIy, ly lhe negalive lransilion of
cIock f
2
/2.
In cycIe 21, lolh RI1 and RI2 each yieId ils firsl lvo oulpul coefficienls, vhich are Ioaded
inlo lheir respeclive oulpul Ialches ly lhe negalive lransilion of cIock 2
2
f . Conlenls of
lhese oulpul Ialches are lhen lransferred lo exlernaI nenory vhere lhey are slored in lhe
firsl 2 nenory Iocalions of each rovs O and 1. The dalafIov of lhe firsl run lhen proceeds as
shovn in TalIe 1. The second run legins al cycIe 19 and yieIds ils firsl 4 oulpul coefficienls
al cycIe 37.
VLS 20
Ck
f
2
CI CI1 & CI2
inpul
Ialches
RlO Rl1
CI1
oulpul
Ialches
RlIO RlI1
CI2
oulpul
Ialches
RlhO
Rlh1
RI1 inpul
Ialches
RlO Rl1
RI2
inpul
Ialches
RlO
Rl1
Oulpul Ialches of
RI1
RI2
RlO Rl1 RlO
Rl1





































R
U
N



1

1 1 LLO,O
LHO,O

2 2 HLO,O
HHO,O

3 1 LLO,1
LHO,1

4 2 HLO,1
HHO,1

5 1 LL1,O
LH1,O

6 2 HL1,O
HH1,O

7 1 LL1,1
LH1,1

8 2 HL1,1
HH1,1

9 1 LL2,O
LH2,O
LO,O L1,O
1O 2 HL2,O
HH2,O
HO,O
H1,O

11 1 LL2,1
LH2,1
LO,1 L1,1 LO,O
HO,O
L1,O
H1,O

12 2 HL2,1
HH2,1
HO,1
H1,1

13 1 LL3,O
LH3,O
L2,O L3,O LO,1
HO,1
L1,1
H1,1

14 2 HL3,O
HH3,O
H2,O
H3,O

15 1 LL3,1
LH3,1
L2,1 L3,1 L2,O
H2,O
L3,O
H3,O

16 2 HL3,1
HH3,1
H2,1
H3,1

















R
U
N



2


17 1 ------ -------
-
L4,O L5,O L2,1
H2,1
L3,1
H3,1

18 2 ------ ------- H4,O
H5,O

19 1 LLO,2
LHO,2
L4,1 L5,1 L4,O
H4,O
L5,O
H5,O

2O 2 HLO,2
HHO,2
H4,1
H5,1

21 1 LL1,2
LH1,2
L6,O L7,O L4,1
H4,1
L5,1
H5,1
XO,O XO,1 X1,O
X1,1
22 2 HL1,2
HH1,2
H6,O
H7,O

23 1 LL2,2
LH2,2
L6,1 L7,1 L6,O
H6,O
L7,O
H7,O
XO,2 ---- X1,2
-----
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 21
TalIe 1. DalafIov for 2-paraIIeI 5/3 archileclure

4.2 Modified CPs and RPs for 5/3 and 9/7 2-paraIIeI externaI architecture
Lach CI of lhe 2-paraIIeI exlernaI archileclure is required lo execule lvo coIunns in an
inlerIeave fashion in lhe firsl run and one coIunn in aII olher runs. Therefore, lhe 5/3
processor dalapalh deveIoped in (Ilrahin & Hernan, 2OO8) shouId le nodified as shovn in
Iig. 8 ly adding one nore slage lelveen slages 2 and 3 for 5/3 2- paraIIeI exlernaI
archileclure lo aIIov inlerIeaving of lvo coIunns as descriled in lhe dalafIov TalIe 1.
Through lhe lvo nuIlipIexers IaleIed mux lhe processor conlroIs lelveen execuling 2
coIunns and one coIunn. Thus, in lhe firsl run, lhe lvo nuIlipIexers conlroI signaI IaleIed
s is sel 1 lo aIIov inlerIeaving in execulion and O in aII olher runs. The nodified 9-slage CI
for 9/7 2-paraIIeI exlernaI archileclure can le ollained ly cascading lvo copies of Iig. 8.
On lhe olher hand, RI1 and RI2 of lhe proposed 2-paraIIeI archileclure for 5/3 and 9/7 are
required lo scan coefficienls of H and L deconposilions generaled ly CI1 and CI2
according lo lhe scan nelhod shovn in Iig. 6. In lhis scan nelhod, aII rovs of even nunlers
are execuled ly RI1 and aII rovs of odds nunlers are execuled ly RI2. Thal is, vhiIe RI1
is execuling rovO coefficienls, RI2 viII le execuling rov1 coefficienls and so on. In
addilion, Iooking al lhe DDCs for 5/3 and 9/7 shov lhal appIying lhe scan nelhods in Iig.
24 2 HL2,2
HH2,2
H6,1
H7,1

25 1 LL3,2
LH3,2
----- ------ L6,1
H6,1
L7,1
H7,1
X2,O X2,1 X3,O
X3,1
26 2 HL3,2
HH3,2
----- -----
27 1 LO,2 L1,2 ----- ----
-
----- --
----
X2,2 ---- X3,2
-----
28 2 HO,2
H1,2

29 1 L2,2 L3,2 LO,2
HO,2
L1,2
H1,2
X4,O X4,1 X5,O
X5,1
3O 2 H2,2
H3,2

31 1 L4,2 L5,2 L2,2
H2,2
L3,2
H3,2
X4,2 ---- X5,2
-----
32 2 H4,2
H5,2

33 1 L6,2 L7,2 L4,2
H4,2
L5,2
H5,2
X6,O X6,1 X7,O
X7,1
34 2 H6,2
H7,2

35 1 L6,2
H6,2
L7,2
H7,2
X6,2 ---- X7,2
-----
36 2
37 1 XO,3 XO,4 X1,3
X1,4
38 2
39 1 X2,3 X2,4 X3,3
X3,4
VLS 22
6 vouId require incIusion of lenporary Iine luffers (TLs) in RI1 and RI2 of lhe proposed
2-paraIIeI exlernaI archileclure as foIIovs. In lhe firsl run, lhe fourlh inpul coefficienl of each
rov in lhe DDCs and lhe oulpul coefficienls IaleIed X( 2) in lhe 5/3 DDCs and lhal IaleIed
Y"(2), Y"(1), and X(O) in lhe 9/7 DDCs, generaled ly considering 4 inpuls coefficienls in each
rov, shouId le slored in TLs, since lhey are required in lhe nexl runs conpulalions.
SiniIarIy, in lhe second run, lhe sixlh inpul coefficienl of each rov and lhe oulpul
coefficienls IaleIed X(4) in lhe 5/3 DDCs and lhal IaleIed Y"(4), Y"(3), and X(2) in lhe 9/7
DDCs generaled ly considering 2 inpuls coefficienls in each rov, shouId le slored in TLs.
AccordingIy, 5/3 vouId require addilion of 2 TLs each of size N, vhereas 9/7 vouId
require addilion of 4 TLs each of size N. Hovever, since 2-paraIIeI archileclure consisls of
lvo RIs, each 5/3 RI viII have 2 TLs each of size N/2 and each 9/7 RI viII have 4 TLs
each of size N/2 as shovn in Iig. 9. Iig. 9 (a) represenls lhe 5/3 nodified RI, vhiIe lolh (a)
and (l) represenl lhe 9/7 nodified RI for 2- paraIIeI archileclure.

1 se
0 se
2 se

s
s

Iig. 8. Modified inverse 5/3 CI for 2-paraIIeI LxlernaI archileclure

To have nore insighl inlo lhe lvo RIs operalions, lhe dalafIov for 5/3 RI1 is given in TalIe
2 for firsl and second runs. Nole lhal slage 1 inpul coefficienls in TalIe 2 are exaclIy lhe
sane inpul coefficienls of RI1 in TalIe 1. In lhe firsl run, TLs are onIy vrillen, lul in lhe
second run and in aII sulsequenl runs, TLs are read in lhe firsl haIf cycIe and vrillen in lhe
second haIf cycIe. In lhe cycIe 15, TalIe 2 shovs lhal coefficienls HO,1 is slored in lhe firsl
Iocalion of TL1, vhiIe coefficienl H2,1 is slored in lhe second Iocalion in cycIe 19 and so on.
Run 2 slarls al cycIe 29. In cycIe 3O, lhe firsl Iocalion of TL1, vhich conlains coefficienls
HO,1 is read during lhe firsl haIf cycIe of cIock 2
2
f and is Ioaded inlo Rd1 ly lhe posilive
lransilion of lhe cIock, vhereas coefficienl HO,2 is vrillen inlo lhe sane Iocalion in lhe
second haIf cycIe. Then, lhe negalive lransilion of cIock 2
2
f lransfers conlenls of Rd1 lo
Rl2 in slage 2.
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 23
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R
0
1
0
1
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R

Iig. 9. Modified RI for 2-paraIIeI archileclure (a) 5/3 and (a) & (l) 9/7

CK
f
2

RI1 inpul
Ialches
STACL 1
RlO Rl1
TL1

STACL 2
RlO Rl2 Rl1
RO

STACL 3
RlO Rl1 RO
TL2

STACL 4
RlO Rl1 Rl2
RI1oulpul
Ialches
RlO Rl1










R
U
N


1

11 LO,O HO,O ----- -----
13 LO,1 HO,1 LO,O ---- HO,O ----- -----
15 L2,O H2,O
HO,1
LO,1 ---- HO,1
HO,O
XO,O --- ---- ----- -----
17 L2,1 H2,1 L2,O ---- H2,O
-----
XO,2 HO,O XO,O XO,O ---- ---- ----- -----
19 L4,O H4,O
H2,1
L2,1 ---- H2,1
H2,O
X2,O ----- XO,2
XO,2
XO,2 HO,O
XO,O
----- -----
VLS 24
21 L4,1 H4,1 L4,O ---- H4,O
-----
X2,2 H2,O X2,O X2,O -----
XO,2
XO,O XO,1
23 L6,O H6,O
H4,1
L4,1 ---- H4,1
H4,O
X4,O ----- X2,2
X2,2
X2,2 H2,O
X2,O
XO,2 -----
25 L6,1 H6,1 L6,O ---- H6,O
-----
X4,2 H4,O X4,O X4,O -----
X2,2
X2,O X2,1















R
U
N

2

27 ----- -----
H6,1
L6,1 ---- H6,1
H6,O
X6,O ----- X4,2
X4,2
X4,2 H4,O
X4,O
X2,2 -----
29 LO,2 HO,2 ----
-
----- ----- ----- X6,2 H6,O X6,O X6,O -----
X4,2
X4,O X4,1
31 L2,2 H2,2
HO,2
LO,2 HO,1 HO,2
-----
---- ----- X6,2
X6,2
X6,2 H6,O
X6,O
X4,2 -----
33 L4,2 H4,2
H2,2
L2,2 H2,1 H2,2
-----
XO,4 HO,1 -----
X6,2
---- ----
X6,2
X6,O X6,1
35 L6,2 H6,2
H4,2
L4,2 H4,1 H4,2
-----
X2,4 H2,1 XO,4
XO,4
XO,4 HO,1
XO,2
X6,2 -----
37 ---- -----
H6,2
L6,2 H6,1 H6,2
-----
X4,4 H4,1 -----
X2,4
X2,4 H2,1
X2,2
XO,3 XO,4
39 ---- ----- ----- ----- ------ -
-----
X6,4 H6,1 -----
X4,4
X4,4 H4,1
X4,2
X2,3 X2,4
41 ---- ----- ----- ----- ------ -
-----
------ ----- -----
X6,4
X6,4 H6,1
X6,2
X4,3 X4,4
43 ---- ----- ----- ----- ------ -
-----
----- ----- ------ -
-----
----- ----- ---
---
X6,3 X6,4
TalIe 2. DalafIov of lhe 5/3 RI1 (Iig. 9a)

In Iig. 9 (a), lhe conlroI signaI, s, of lhe lvo nuIlipIexers IaleIed nux is sel 1 during run 1
lo pass RO of lolh slages 2 and 3, vhereas in aII olher runs, il is sel O lo pass coefficienls
slored in TL1 and TL2.

4.3 Proposed 4-paraIIeI externaI architecture
To furlher increase speed of conpulalions lvice as lhal of lhe 2-paraIIeI archileclure, lhe 2-
paraIIeI archileclure is exlended lo 4-paraIIeI archileclure as shovn in Iig. 1O (a). This
archileclure is vaIid for 5/3, 9/7, and conlined 5/3 and 9/7. Il consisls of 4 |-slage
pipeIined CIs and 4 |-slage pipeIined RIs. The vaveforns of lhe 3 cIocks f
4
, f
4a
, and f
4|
used
in lhe archileclure are shovn in Iig. 1O (l). The frequency of cIock f
4
is delernined fron
Lq(3) as

p
t k f 4
4
= (5)
The archileclure scans lhe exlernaI nenory vilh frequency f
4
and il operales vilh frequency
f
4a
and f
4|
. Lvery line cIock f
4a
nakes a negalive lransilion CI1 Ioads inlo ils inpul Ialches
RlO and Rl1 lvo nev coefficienls scanned fron exlernaI nenory lhrough lhe luses IaleIed
|us0 and |us1, vhereas CI3 Ioads every line cIock f
4a
nakes a posilive lransilion. CI2 and
CI4 Ioads every line cIock f
4|
nakes a negalive and a posilive lransilion, respecliveIy, as
indicaled in Iig. 1O (l).

On lhe olher hand, lolh RI1 and RI2 Ioad sinuIlaneousIy nev dala
inlo lheir inpul Ialches RlO and Rl1 each line cIock f
4a
nakes a negalive lransilion, vhereas
RI3 and RI4 Ioads each line cIock f
4|
nakes a negalive lransilion.
The dalafIov for 4-paraIIeI 5/3 exlernaI archileclure is given in TalIe 3, vhere CIs and RIs
are assuned lo le 3- and 4-slage pipeIined processors, respecliveIy. The dalafIov lalIe for
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 25
4-paraIIeI 9/7 exlernaI archileclure is siniIar in aII runs lo lhe 5/3 dalafIov excepl in lhe
firsl run, vhere RIs of lhe 9/7 archileclure, specificaIIy RI3 and RI4 generale a pallern of
oulpul coefficienls differenl fron lhal of lhe 5/3. RI3 and RI4 of lhe 9/7 archileclure vouId
generale every cIock cycIe, reference lo cIock f
4|
, lvo oulpul coefficienls as foIIovs. Suppose,
al cycIe nunler n lhe firsl lvo coefficienls X(O,O) and X(1,O) generaled ly RI3 and RI4,
respecliveIy, are Ioaded inlo oulpul Ialch RlO of lolh processors. Then, in cycIe n+1, RI3 and
RI4 generale coefficienls X(2,O) and X(3,O) foIIoved ly coefficienls X(4,O) and X(5,O) in cycIe
n+1 and so on. Nole lhal lhese oulpul coefficienls are lhe coefficienls generaled ly RI1 and
RI2 in TalIe 3.
The slralegy used for scheduIing nenory coIunns for CIs of lhe 5/3 and 9/7 4-paraIIeI
archileclure, vhich resenlIe lhe one adopled for 2-paraIIeI archileclure, is as foIIov. In lhe
firsl run, lolh 5/3 and 9/7 4-paraIIeI archileclure viII le scheduIed lo execule 4 coIunns of
nenory, lvo fron (A) and lhe olher lvo fron (), lolh of Iig. 5. Lach CI viII le assigned
lo execule one coIunn of nenory coefficienls as iIIuslraled in lhe firsl run of lhe dalafIov
shovn in TalIe 3, vhereas in aII sulsequenl runs, 2 coIunns al a line viII scheduIed for
execulion ly 4 CIs. One coIunn fron Iig. 5 (A) viII le assigned lo lolh CI1 and CI3, vhiIe
lhe olher fron Iig. 5 () viII le assigned lo lolh CI2 and CI4 as shovn in lhe second run of
TalIe 3. Hovever, if nunler of coIunns in (A) and () of Iig. 5 is nol equaI, lhen lhe Iasl
run viII consisl of onIy one coIunn of (A). In lhal case, scheduIe lhe Iasl coIunns
coefficienls in lolh CI1 and CI3 as shovn in lhe lhird run of TalIe 3, since an allenpl lo
execule lhe Iasl coIunn using 4 CIs vouId resuIl in nore coefficienls leen generaled lhan
lhal can le handIed ly lhe 4 RIs.
On lhe olher hand, scheduIing rovs coefficienls for RIs, vhich lake pIace according lo scan
nelhod shovn in Iig. 6, can le underslood ly exanining lhe dalafIov shovn in TalIe 3. Al
cycIe 13, CI1 generales ils firsl lvo oulpul coefficienls IaleIed LO,O and L1,O, vhich
correspond lo Iocalions O,O and 1,O in Iig. 6, respecliveIy. In cycIe 14, CI2 generales ils firsl
lvo oulpul coefficienls HO,O and H1,O, vhich correspond lo Iocalions O,1 and 1,1 in Iig. 6,
respecliveIy. In cycIe 15, CI3 generale ils firsl lvo coefficienls LO,1 and L1,1, vhich
correspond lo Iocalions O,2 and 1,2 in Iig. 6, respecliveIy. In cycIe 16, CI4 generales ils firsl
lvo oulpul coefficienls HO,1 and H1,O vhich correspond lo Iocalions O,3 and 1,3 in Iig. 6.
Nole lhal LO,O, HO,O, LO,1, and HO,1 represenls lhe firsl 4 coefficienls of rov O in Iig. 6,
vhereas L1,O, H1,O, LL1,1 and H1,1, represenl lhe firsl 4 coefficienls of rov1.
In cycIe 17 and 18, lhe firsl lvo rovs coefficienls are scheduIed for RIs as shovn in TalIe 3,
vhiIe CIs generaling coefficienls of lhe nexl lvo rovs, rov2 and rov3. TalIe 3 shovs lhal
lhe firsl 4 coefficienls of rov O are scheduIed for execulion ly RI1 and RI3, vhiIe lhe firsl 4
coefficienls of rov 1 are scheduIed for RI2 and RI4. In addilion, nole lhal aII coefficienls
generaled ly CI4, vhich leIong lo coIunn 3 in Iig. 6, are required in lhe second runs
conpulalions, according lo lhe DDCs. Therefore, lhis vouId require incIusion of a TL of
size N/4 in each of lhe 4 RIs lo slore lhese coefficienls. The second run, hovever, requires
lhese coefficienls lo le slored in lhe 4 TLs in a cerlain vay as foIIovs. Coefficienls HO,1 and
H1,1 generaled ly CI4 in cycIe 16 shouId le slored in lhe firsl Iocalion of TL of RI1 and
RI2, respecliveIy. These lvo coefficienls vouId le passed lo lheir respeclive TL lhrough
lhe inpul Ialches of RI1 and RI2 IaleIed Rl2, as shovn in cycIe 17 of TalIe 3, vhereas,
coefficienls H2,1 and H3,1 generaled ly CI4 al cycIe 2O shouId le slored in lhe firsl Iocalion
of TL of RI3 and RI4, respecliveIy. These lvo coefficienls are passed lo lheir respeclive
TL for slorage lhrough lhe inpul Ialches of RI3 and RI4 IaleIed Rl1, as shovn in cycIe 22
VLS 26
of TalIe 3. SiniIarIy, coefficienls H4,1 and H5,1 generaled ly CI4 al cycIe 24 shouId le
slored in lhe second Iocalion of TL of RI1 and RI2, respecliveIy, and so on. Nole lhal lhese
TLs are IaleIed TL1 in Iig. 12.

a
f
4
b
f
4
4
f
4
4 4
f f
a
=
4
4 4
f f
b
=
a
f
4
b
f
4

Iig. 1O. (a) Iroposed 2-D IDWT 4-paraIIeI pipeIined exlernaI archileclure for 5/3 and 9/7
and conlined (l) Waveforns of lhe cIocks


High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 27
CK
f
4
CI CIs inpul
Lalches
RlO
Rl1
CIs 1
&3
Oul
Ialches
RlIO
RlI1
CIs 2 &
4
Oul
Ialches
RlhO
Rlh1
RIs 1 & 3
inpul
Ialches
RI RlO
Rl1 Rl2
RIs 2 & 4
inpul
Ialches
RI RlO
Rl1 Rl2
RIs 1 &
3
Oul
Ialches
RlO
Rl1
RIs 2 &
4
Oul
Ialches
RlO
Rl1







































R
U
N





1

1 1 LLO,O
LHO,O

2 2 HLO,O
HHO,O

3 3 LLO,1
LHO,1

4 4 HLO,1
HHO,1

5 1 LL1,O
LH1,O

6 2 HL1,O
HH1,O

7 3 LL1,1
LH1,1

8 4 HL1,1
HH1,1

9 1 LL2,O
LH2,O

1O 2 HL2,O
HH2,O

11 3 LL2,1
LH2,1

12 4 HL2,1
HH2,1

13 1 LL3,O
LH3,O
LO,O
L1,O

14 2 HL3,O
HH3,O
HO,O
H1,O

15 3 LL3,1
LH3,1
LO,1
L1,1

16 4 HL3,1
HH3,1
HO,1
H1,1

17 1 LL4,O ---
---
L2,O
L3,O
1 LO,O
HO,O HO,1
2 L1,O
H1,O H1,1

18 2 HL4,O --
----
H2,O
H3,O
3 LO,1
HO,1 HO,O
4 L1,1
H1,1 H1,O

19 3 LL4,1 ---
---
L2,1
L3,1

2O 4 HL4,1 --
----
H2,1
H3,1

R
U
N




2

21 1 LLO,2
LHO,2
L4,O
L5,O
1 L2,O
H2,O -----
2 L3,O
H3,O -----

22 2 HLO,2
HHO,2
H4,O
H5,O
3 L2,1
H2,1 H2,O
4 L3,1
H3,1 H3,O

23 3 LL1,2 L4,1
VLS 28
LH1,2 L5,1
24 4 HL1,2
HH1,2
H4,1
H5,1

25 1 LL2,2
LH2,2
L6,O
L7,O
1 L4,O
H4,O H4,1
2 L5,O
H5,O H5,1

26 2 HL2,2
HH2,2
H6,O
H7,O
3 L4,1
H4,1 H4,O
4 L5,1
H5,1 H5,O

27 3 LL3,2
LH3,2
L6,1
L7,1

28 4 HL3,2
HH3,2
H6,1
H7,1

29 1 LL4,2 ---
---
L8,O ----
-
1 L6,O
H6,O -----
2 L7,O
H7,O -----

3O 2 HL4,2 --
----
H8,O ---
--
3 L6,1
H6,1 H6,O
4 L7,1
H7,1 H7,O

31 3 ------- ---
----
L8,1 ----
-

32 4 ------- ---
----
H8,1 ---
--


















R
U
N



3

33 1 LLO,3
LHO,3
LO,2
L1,2
1 L8,O
H8,O H8,1
2 ----- ---
- ----
XO,O ---
-
X1,O ---
--
34 2 ----- --
-----
HO,2
H1,2
3 L8,1
H8,1 H8,O
4 ----- ---
- ----
XO,1
XO,2
X1,1
X1,2
35 3 LL1,3
LH1,3
L2,2
L3,2

36 4 ------ ---
----
H2,2
H3,2

37 1 LL2,3
LH2,3
L4,2
L5,2
1 LO,2
HO,2 -----
2 L1,2
H1,2 -----
X2,O ---
-
X3,O ---
--
38 2 ------- ---
----
H4,2
H5,2
3 L2,2
H2,2 -----
4 L3,2
H3,2 -----
X2,1
X2,2
X3,1
X3,2
39 3 LL3,3
LH3,3
L6,2
L7,2

4O 4 ------- ---
----
H6,2
H7,2

41 1 LL4,3 ---
---
L8,2 ----
-
1 L4,2
H4,2 -----
2 L5,2
H5,2 -----
X4,O ---
-
X5,O ---
--
42 2 ------- ---
---
H8,2 ---
--
3 L6,2
H6,2 -----
4 L7,2
H7,2 -----
X4,1
X4,2
X5,1
X5,2
43 3 ------- ---
----
----- ---
--

44 4 ------- ---
----
------ ---
---

45 1 LO,3
L1,3
1 L8,2
H8,2 -----
2 ----- ---
- ----
X6,O ---
-
X7,O ---
--
46 2 ------ --
----
3 ----- ----
- -----
4 ----- ---
- ----
X6,1
X6,2
X7,1
X7,2
47 3 L2,3
L3,3

48 4 ------ --
----

High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 29
49 1 L4,3
L5,3
1 LO,3 ----
- -----
2 L1,3 ---
-- -----
X8,O ---
-
----- ---
--
5O 2 ------ --
----
3 L2,3 ---
-- -----
4 L3,3 ---
-- -----
X8,1
X8,2
----- ---
--
51 3 L6,3
L7,3

52 4 ------ --
----

53 1 L8,3 ----
-
1 L4,3 ---
-- -----
2 L5,3 ---
-- -----
XO,3
XO,4
X1,3
X1,4
54 2 ------ --
----
3 L6,3 ---
-- -----
4 L7,3 ---
-- -----
X2,3
X2,4
X3,3
X3,4
55 3 ------ ---
---

56 4 ----- ---
---

TalIe 3. DalafIov for 4-paraIIeI 5/3 archileclure

Al cycIe 33, RI1 and RI2 yieId lheir firsl oulpul coefficienls XO,O and X1,O, respecliveIy,
vhich nusl le slored in lhe exlernaI nenory Iocalions O,O and 1,O, respecliveIy. Nole lhal
indexes of each oulpul coefficienl indicale exlernaI nenory Iocalion vhere lhe coefficienl
shouId le slored.
The second run, vhich requires scheduIing lvo coIunns for execulion ly CIs, slarls al cycIe
21. In cycIe 33, il generales ils firsl lvo oulpul coefficienls LO,2 and L1,2, vhich leIong lo
Iocalions O,4 and 1,4 in Iig. 6, respecliveIy. In cycIe 34, CI2 generales coefficienls HO,2 and
H1,2 vhich leIong lo Iocalions O,5 and 1,5 in Iig. 6. In cycIe 35, CI3 generales coefficienls
L2,2 and L3,2, vhich leIong lo Iocalions 2,4 and 3,4 in Iig. 6, vhereas in cycIe 36, CI4
generales coefficienls H2,2 and H3,2 lhal leIong lo Iocalions 2,5 and 3,5 in Iig. 6. Iron lhe
alove descriplion il is cIear lhal lhese 8 coefficienls are dislriluled aIong 4 rovs, O lo 3 in
Iig. 6 vilh each rov having 2 coefficienls. TalIe 3 shovs lhal in cycIe 37, lhe lvo coefficienls
of rov O, LO,2 and HO,2, and lhe lvo coefficienls of rov 1, L1,2 and H1,2 are scheduIed for
RI1 and RI2, respecliveIy, vhiIe coefficienls L4,2 and L5,2 generaled ly CI1 during lhe
cycIe are Ioaded inlo RlIO and RlI1, respecliveIy. In cycIe 38, lhe lvo coefficienls of rov 2,
L2,2 and H2,2, and lhal of rov 3 L3,2 and H3,2 are scheduIed for RI3 and RI4, respecliveIy,
vhiIe coefficienls H4,2 and H5,2 generaled ly CI2 during lhe cycIe are Ioaded inlo RlhO
and Rlh1, respecliveIy. In cycIe 53, RI1 and RI2 generale lhe firsl oulpul coefficienls of lhe
second run.

4.4 CoIumn and row processors for 5/3 and 9/7 4-paraIIeI externaI architecture
The 5/3 and lhe 9/7 processors dalapalh archileclures proposed in (Ilrahin & Hernan,
2OO8) vere deveIoped assuning lhe processors scan exlernaI nenory eilher rov ly rov or
coIunn ly coIunn. Hovever, CIs and RIs of lhe 4-paraIIeI archileclure are required lo scan
exlernaI nenory according lo scan nelhods shovn in Iigs. 5 and 6, respecliveIy. The 4-
paraIIeI archileclure, in addilion, inlroduces lhe requirenenl for inleraclions anong lhe
processors in order lo acconpIish lheir lask. Therefore, lhe processors dalapalh
archileclures proposed in (Ilrahin & Hernan, 2OO8) shouId le nodified lased on lhe scan
nelhods and laking inlo consideralion aIso lhe requirenenl for inleraclions so lhal lhey fil
VLS 30
inlo lhe 4-paraIIeIs processors. Thus, in lhe foIIoving, lhe nodified CIs viII le deveIoped
firsl foIIoved ly RIs.

4.5 Modified CPs for 4-paraIIeI architecture
The 4 CIs of lhe 4-paraIIeI archileclure each is required in lhe firsl run lo execule one
coIunn al a line. Thal neans lhe firsl run requires no nodificalions of lhe 5/3 and 9/7
dalapalh archileclures proposed in (Ilrahin & Hernan, 2OO8). Hovever, in aII sulsequenl
runs, each lvo processors (CI1 and CI3 or CI2 and CI4) are assigned lo execule one
coIunn logelher, vhich requires inleraclions lelveen lhe lvo processors lo acconpIish lhe
required lask. Therefore, lolh CIs 1 and 3, siniIarIy, CIs 2 and 4 shouId le nodified as
shovn in Iig. 11 lo aIIov inleraclions lo lake pIace. The lvo processors connunicale or
inleracl lhrough lhe palhs (luses) IaleIed P1, P2, P3, and P4. Iig. 11 shovs nodified 5/3
CIs 1 and 3 vhich is idenlicaI lo CIs 2 and 4. Iig. 11 aIso represenls lhe firsl 3 slages of lolh
9/7 CIs 1 and 3 and CIs 2 and 4 and lhe renaining slages are idenlicaI lo slages 1 lo 3. Nole
lhal since lhe firsl 3 slages of 5/3 and 9/3 are siniIar in slruclure, lhe 5/3 processor can le
easiIy incorporaled inlo 9/7 processor lo ollain lhe conlined 5/3 and 9/7 processor for 4-
paraIIeI archileclure.
The conlroI signaI, s of lhe 4 nuIlipIexers, IaleIed mux is sel O in lhe firsl run lo aIIov each
processor lo execule one coIunn and 1 in aII olher runs lo aIIov execulion of one coIunn ly
lvo processors.
In lhe foIIoving, lhe second runs dalafIov of Iig. 11, vhich slarls al cycIe 21 in TalIe 3 viII
le descriled. In cycIe 21, lvo coefficienls LLO,2 and LHO,2, vhich correspond lo lhe firsl
lvo coefficienls of lhe lhird coIunn in Iig. 5 (A), are read fron exlernaI nenory and are
Ioaded inlo CI1 firsl slage Ialches RlO and Rl1, respecliveIy, ly negalive lransilion of cIock
f
4a
lo iniliale lhe firsl operalion. In cycIe 23, lhe lhird and lhe fourlh coefficienls LL1,2 and
LH1,2 of lhe lhird coIunn are scanned fron exlernaI nenory and are Ioaded aIong vilh Rl1
in slage1 of CI1 inlo CI3 firsl slage Ialches RlO, Rl1 and Rd3, respecliveIy, ly lhe posilive
lransilion of cIock f
4a
. Nole lhal conlenl of Rl1 in slage 1 of CI1 lakes lhe palh IaleIed P1 lo
Rd3.
In cycIe 25, coefficienls LL2,2 and LH2,2 are scanned fron exlernaI nenory and are Ioaded
aIong vilh Rl1 in slage 1 of CI3 inlo CI1 firsl slage Ialches RlO, Rl1, and Rd1, respecliveIy,
ly lhe negalive lransilion of cIock f
4a
, vhiIe coefficienl X(O) caIcuIaled in slage 1 of CI1 and
conlenl of Rl1 are Ioaded inlo RlO and Rl1 of slage 2, respecliveIy.
In cycIe 27, coefficienls LL3,2 and LH3,2 are scanned fron exlernaI nenory and are Ioaded,
aIong vilh Rl1 in slage 1 of CI1, inlo CI3 firsl slage Ialches RlO, Rl1, and Rd3, respecliveIy,
ly lhe posilive lransilion of cIock f
4a
, vhiIe coefficienls X(2) caIcuIaled in slage 1 and conlenl
of Rl1 are lransferred lo RlO and Rl1 of lhe nexl slage, respecliveIy.
In cycIe 29, onIy one coefficienl LL4,2 is scanned fron exlernaI nenory, vhich inpIies lhe
lhird coIunn of Iig. 5 (A) conlains odd nunler of coefficienls, and is Ioaded aIong vilh Rl1
in slage1 of CI3 inlo CI1 firsl slage Ialches RlO and Rd1, respecliveIy, ly lhe negalive
lransilion of cIock f
4a
, vhiIe coefficienl X(4) caIcuIaled in slage 1 and conlenl of Rl1 are
lransferred lo RlO and Rl1 of slage 2 and lhal of RlO and Rl1 incIuding RlO in slage 2 of CI3
lo RlO, Rl1, and Rl2 in slage 3 of CI1, respecliveIy, lo conpule lhe firsl high coefficienl
IaleIed X(1) in 5/3 DDCs. Nole lhal CI1 Ialches RlO and Rl2 of slage 3 nov conlain
coefficienls X(O) and X(2), vhiIe Rl1 conlains coefficienl LHO,2 (or Y(1) as IaleIed in lhe 5/3
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 31
DDCs). These 3 coefficienls are required according lo lhe DDCs lo conpule lhe high
coefficienl, X(1).

1 se
0 se
2 se

s
0
1
s
a
f
4
a
f
4
a
f
4
a
f
4 a
f
4 a
f
4
a
f
4 a
f
4
1 se
0 se
2 se

s
0
1
s
a
f
4
a
f
4
a
f
4
a
f
4 a
f
4 a
f
4
a
f
4 a
f
4

Iig. 11. Modified 5/3 CIs 1 & 3 for 4-paraIIeI archileclure

VLS 32
In cycIe 31, lhe posilive lransilion of cIock f
4a
lransfers RlO and Rl1 in slage 2 of CI3 and RlO
in slage 2 of CI1 lo slage 3 Ialches RlO, Rl1, and Rl2 of CI3, respecliveIy, lo conpule lhe
second high coefficienl, X(3). WhiIe coefficienl X(6) caIcuIaled in slage 1 of CI3 and
coefficienl in Rl1 are Ioaded inlo RlO and Rl1 of slage 2. As indicaled in cycIe 31 of TalIe 3,
no coefficienls are Ioaded inlo CI3 firsl slage Ialches.
In cycIe 33, lhe negalive lransilion of cIock f
4a
Ioads lhe firsl high coefficienl, X(1) caIcuIaled
in slage 3 of CI1 and RlO, vhich conlains X(O), inlo CI1 oulpul Ialches RlO and Rl1,
respecliveIy. Coefficienls X(O) and X(1) are IaleIed LO,2 and L1,2 in TalIe 3. The sane
negalive lransilion of cIock f
4a
lransfers conlenls of RlO and Rl1 in slage 2 of CI1 and RlO in
slage 2 of CI3 lo RlO, Rl1, and Rl2 in slage 3 of CI1, respecliveIy, lo conpule lhe lhird
coefficienl, X(3) and so on.

4.6 Modified RPs for 4-paraIIeII architecture
In seclion 5.2, il has leen poinled oul lhe reasons for incIuding TLs in lhe lvo RIs of lhe 2-
paraIIeI archileclure. Ior lhe sane reasons, il is aIso necessary lo incIude TLs in lhe 4 RIs
of lhe 4-paraIIeI archileclure, as shovn in Iigs. 12 (a) and (a,l) for 5/3 and 9/7, respecliveIy.
The processor dalapalh for lolh RI1 and RI3, vhich is aIso idenlicaI lo lhe processor
dalapalh of lolh RI2 and RI4, are dravn logelher in Iigs.12 (a) and (a,l) for 5/3 and 9/7,
respecliveIy, lecause in lhe firsl run, lolh processors are required lo execule logelher lhe 4
coefficienls of each rov. Which inpIies inleraclions lelveen lhe lvo processors during lhe
conpulalions and lhal lake pIace lhrough lhe palhs (luses) IaleIed P1, P2, P3, and P4.
Hovever, in aII sulsequenl runs, according lo lhe scan nelhod shovn in Iig. 6, each RI viII
le scheduIed lo execule each cIock cycIe lvo coefficienls of a rov as shovn in cycIes 37 and
38 of TalIe 3. The advanlage of lhis organizalion is lhal lhe lolaI size of lhe TLs does nol
increase fron lhal of lhe singIe pipeIined archileclure, vhen il is exlended lo 2- and 4-
paraIIeI archileclure.
In lhe firsl run, aII TLs in Iig. 12 viII le vrillen onIy, vhereas in aII olher runs, lhe sane
Iocalion of a TL viII le read in lhe firsl haIf cycIe and vrillen in lhe second haIf cycIe vilh
respecl lo cIock f
4a
or f
4|
.
The conlroI signaI, s of lhe six nuIlipIexers, IaleIed mux in Iig. 12, is sel O in lhe firsl run lo
aIIov in lhe RI1, coefficienl coning lhrough palh O of each nuIlipIexer lo le slored in ils
respeclive TL, vhereas in lhe RI3, il aIIovs conlenls of Rl2 and Rd1 in slages 1 and 3,
respecliveIy, lo le passed lo lhe nexl slage. In aII sulsequenl runs, s is sel 1 lo aIIov in lhe
RI1, coefficienl coning lhrough palh 1 of each nuIlipIexer lo le slored in lhe TL, vhereas
in lhe RI3, il passes coefficienls read fron TL1 and TL2 lo nexl slage.
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 33
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R
0
1
4
N
0 1
4
N
a
f
4 a
f
4
a
f
4 a
f
4
a
f
4
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R
0
1
4
N
0
1
4
N
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4
sco
sco
a
f
4
Iig. 12. (a) Modified 5/3 RIs 1 and 3 for 4-paraIIeI LxlernaI Archileclure

In lhe foIIoving, lhe dalafIov of lhe processor dalapalh archileclure shovn in Iig. 12 (a)
viII descriled in delaiIs slarling fron cycIe 17 in TalIe 3. The delaiIed descriplions viII
enalIe us lo fuIIy undersland lhe lehavior of lhe processor. In cycIe 17, lhe negalive
lransilion of cIock f
4a
(consider il lhe firsl cycIe of f
4a
)

Ioads coefficienls LO,O and HO,O, vhich
represenl lhe firsl lvo coefficienls of rov O in Iig. 6, and HO,1 inlo RI1 firsl slage Ialches
RlO, Rl1, and Rl2, respecliveIy. During lhe posilive (high) puIse of cIock f
4a
, coefficienl in Rl2
is slored in lhe firsl Iocalion of TL1.
In cycIe 18, TalIe 3 shovs lhal lhe negalive lransilion of cIock f
4|
(consider il lhe firsl cycIe of
f
4|
) Ioads coefficienls LO,1, HO,1, and HO,O of rov O inlo RI3 firsl slage Ialches RlO, Rl1, and Rl2,
VLS 34
respecliveIy. In cycIe 21, lhe negalive lransilion of lhe second cycIe of cIock f
4a
lransfers
conlenls of RI1 Ialches RlO and Rl1 of lhe firsl slage lo slage 2 Ialches RlO and Rl1,
respecliveIy, lo conpule lhe firsl Iov coefficienl, XO,O, vhiIe Ioading lvo nev coefficienls
L2,O and H2,O, vhich are lhe firsl lvo coefficienls of rov 2 in Iig. 6, inlo RI1 firsl slage
Ialches RlO and Rl1, respecliveIy.
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R
0
1
4
N
0 1
4
N
a
f
4
a
f
4
a
f
4
a
f
4 a
f
4
a
f
4
0
1
1 se
0
1
0 se
+
2 se
+
W R
W R
4
N
4
N
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4
b
f
4

Iig. 12. (l) Modified 9/7 RIs 1 and 3 for 4-paraIIeI LxlernaI Archileclure

During lhe second cycIe of cIock f
4a
no dala are slored in TL1 of RI1.
In cycIe 22, lhe negalive lransilion of lhe second cycIe of cIock f
4|
lransfers conlenls of RI3
Ialches RlO, Rl1 and Rl2 of lhe firsl slage lo slage 2 Ialches RlO, Rl1 and Rl2 lo conpule lhe
second Iov coefficienl of rov O, XO,2, vhiIe Ioading lvo nev coefficienls L2,1, H2,1, and
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 35
H2,O inlo RI3 firsl slage Ialches RlO, Rl1 and Rl2, respecliveIy. During lhe second cycIe of
cIock f
4|
, lhe posilive puIse slores conlenl of Rl1 of lhe firsl slage in lhe firsl Iocalion of TL1
of RI3.
In cycIe 25, lhe negalive lransilion of lhe lhird cycIe of cIock f
4a
Ioads coefficienl XO,O
conpuled in slage 2 of RI1, inlo RlO of slage 3 and lransfers conlenls of Rl1 and RlO of slage
1 inlo Rl1 and RlO of slage 2 in order lo conpule lhe firsl Iov coefficienl, X2,O of rov 2
IaleIed X(O) in lhe 5/3 DDCs, vhiIe Ioading nev coefficienls L4,O, H4,O, and H4,1 of rov 4
in Iig. 6, inlo RI1 firsl slage Ialches RlO, Rl1 and Rl2, respecliveIy. During lhe lhird cycIe of
cIock f
4a
, lhe posilive puIse slores Rl2 of lhe firsl slage in lhe second Iocalion of TL1 of RI1.
In cycIe 26, lhe negalive lransilion of lhe lhird cycIe of cIock f
4|
Ioads coefficienl XO,2
caIcuIaled in slage 2 of RI3 inlo lolh RlO in slage 3 of RI3 and Rd3 in slage 3 of RI1, vhiIe
conlenl of Rl2 in slage 2 of RI3 is lransferred lo Rl1 of slage 3 and lhal of RlO in slage 3 of
RI1 lo Rd1 in slage 3 of RI3. The sane negalive lransilion of cIock f
4|
aIso lransfers RlO, Rl1,
and Rl2 of slage 1 inlo RlO, Rl1, and Rl2 of slage 2, respecliveIy, lo conpule lhe second Iov
coefficienl, X2,2 of rov 2, vhiIe Ioading nev coefficienls L4,1, H4,1, and H4,O of rov 4 inlo
RI3 firsl slage Ialches RlO, Rl1, and Rl2, respecliveIy. During lhe lhird cycIe of cIock f
4|
,
coefficienl in Rl1 is nol slored in TL1 of RI3. Il is inporlanl lo nole lhal Rd3 in slage 3 of
RI1, vhich hoIds coefficienl XO,2, viII le slored in lhe firsl Iocalion of TL2 ly lhe posilive
puIse of lhe lhird cycIe of cIock f
4a
.
In cycIe 29, lhe fourlh cycIes negalive lransilion of cIock f
4a
lransfers RlO in slage 3 of RI1,
vhich conlains XO,O, lo RlO of slage 4, vhiIe X2,O caIcuIaled in slage 2 is Ioaded inlo RlO of
slage 3. The sane negalive lransilion of cIock f
4a
lransfer Rl1 and RlO of slage 1 lo Rl1 and
RlO of slage 2, respecliveIy, lo conpule lhe firsl Iov coefficienl of rov 4, X4,O, and Ioads nev
coefficienls L6,O and H6,O inlo RlO and Rl1 of slage 1, respecliveIy. During lhe fourlh cycIe
of cIock f
4a
, coefficienl in Rl2 of slage 1 is nol slored in TL1 of RI1.
In cycIe 3O,lhe negalive lransilion of lhe fourlh cycIe of cIock f
4|
lransfers conlenls of RlO,
Rl1, and Rd1 in slage 3 of RI3 lo RlO, Rl1, and Rl2 of lhe nexl slage, respecliveIy, lo conpule
lhe firsl high coefficienl, XO,1 of rov O. WhiIe coefficienl X2,2 caIcuIaled in slage 2 of RI3 is
lransferred lo lolh RlO in slage3 of RI3 and Rd3 in slage 3 of RI1 and coefficienl X2,O in RlO,
in slage 3 of RI1, is Ioaded inlo Rd1 in slage 3 of RI3, vhereas Rl2 of slage 2 is lransferred
lo Rl1 in slage 3 of RI3. The sane negalive lransilion of cIock f
4|
lransfers conlenls of RlO,
Rl1, and Rl2 of slage 1 lo RlO, Rl1, and Rl2 of slage 2 lo conpule lhe second Iov coefficienl,
X4,O of rov 4, vhiIe Ioading nev coefficienls L6,1, H6,1, and H6,O of rov 6 inlo RI3 firsl
slage Ialches RlO, Rl1, and Rl2, respecliveIy. During lhe fourlh cycIes posilive puIse of cIock
f
4|
, conlenl of Rl1 in slage 1 of RI3 viII le slored in lhe second Iocalion of TL1 and conlenl
of RlO, X2,2, in slage 3 of RI3 viII le slored in lhe firsl Iocalion of TL2, vhiIe conlenl of
Rd3, X2,2, in slage 3 of RI1 viII nol le slored in TL2 of RI1.
In cycIe 33, lhe negalive lransilion of lhe fiflh cycIe of cIock f
4a
lransfers conlenl of RlO, XO,O,
in slage 4 of RI1 lo RI1 oulpul Ialch RlO, as firsl oulpul coefficienl and RlO of slage 3
hoIding coefficienl X2,O lo RlO of slage 4, vhiIe coefficienl X4,O caIcuIaled in slage 2 is
Ioaded inlo RlO of slage 3. The sane negalive lransilion of cIock f
4a
lransfers conlenls of RlO
and Rl1 of slage 1 lo RlO and Rl1 of slage 2 lo conpule lhe firsl Iov coefficienl, X6,O of rov
6, vhiIe Ioading nev coefficienls L8,O, H8,O, and H8,1 of rov 8 inlo RI1 firsl slage Ialches
RlO, Rl1, and Rl2, respecliveIy. During lhe fiflh cycIe of cIock f
4a
, lhe posilive puIse of lhe
cIock slores conlenl of Rl2 in slage 1 of RI1 inlo lhe lhird Iocalion of TL1.
VLS 36
In cycIe 34, lhe negalive lransilion of lhe fiflh cycIe of cIock f
4|
lransfers conlenl of RlO, XO,2,
and lhe high coefficienl, XO,1 conpuled in slage 4 of RI3 lo RI3 oulpul Ialches RlO and Rl1,
respecliveIy, vhiIe Ioading conlenls of RlO, Rl1, and Rd1 of slage 3 inlo RlO, Rl1, and Rl2 of
slage 4 lo conpule lhe firsl high coefficienl of rov 2, X2,1. Iurlhernore, lhe sane negalive
lransilion of cIock f
4|
aIso lransfers coefficienl X4,2 caIcuIaled in slage 2 of RI3 lo lolh RlO in
slage 3 of RI3 and Rd3 in slage 3 of RI1. Il aIso lransfers coefficienl X4,O in RlO, in slage 3 of
RI1, and conlenl of Rl2 in slage 2 of RI3 lo slage 3 of RI3 Ialches Rd1 and Rl1, respecliveIy,
and conlenls of RlO, Rl1, and Rl2 of slage 1 lo RlO, Rl1, and Rl2 in slage 3 of RI3, vhiIe
Ioading nev coefficienls L8,1, H8,1, and H8,O inlo RlO, Rl1, and Rl2 of slage 1. Conlenl of
Rd3, X4,2 in slage 3 of RI1 viII le slored in lhe second Iocalion of TL2 ly lhe posilive
puIse of lhe fiflh cycIe of cIock f
4a
.
In cycIe 37, lhe second run of lhe RIs legins vhen lhe negalive lransilion of lhe sixlh cycIe
of cIock f
4a
Ioads lvo nev coefficienls LO,2 and HO,2, vhich are lhe fiflh and sixlh
coefficienls of rov O, inlo firsl slage Ialches RlO and Rl1 of RI1, respecliveIy. During lhe firsl
haIf (Iov puIse) of cycIe 6, lhe firsl Iocalion of TL1 viII le read and Ioaded inlo Rd1 ly lhe
posilive lransilion of cIock f
4a
, vhereas during lhe second haIf cycIe, conlenl of Rl1 viII le
vrillen in lhe firsl Iocalion of TL1.
In cycIe 38, lhe negalive lransilion of lhe sixlh cycIe of cIock f
4|
Ioads lvo nev coefficienls
L2,2 and H2,2, lhe fiflh and sixlh coefficienls of rov 2, inlo RI3 firsl slage Ialches RlO and
Rl1, respecliveIy. The firsl haIf cycIe of cIock f
4|
reads lhe firsl Iocalion of TL1 and Ioads il
inlo Rd3 ly lhe posilive lransilion of lhe cIock, vhereas lhe second haIf cycIe vriles conlenl
of Rl1 in lhe sane Iocalion of TL1.
In cycIe 41, lhe negalive lransilion of lhe sevenlh cycIe of cIock f
4a
lransfers conlenls of RlO,
Rl1, and Rd1 of slage 1 lo slage 2 of RI1 Ialches RlO, Rl1, and Rl2, respecliveIy, lo conpule
lhe lhird Iov coefficienl, XO,4 of rov O, vhiIe Ioading lvo nev coefficienls L4,2 and H4,2 of
rov 4 inlo RI1 firsl slage Ialches RlO and Rl1 respecliveIy.
Nole lhal during run 2 aII RIs execule independenlIy vilh no inleraclions anong lhen. In
addilion, in lhe firsl run, if lhe firsl coefficienl generaled ly slage 2 of RI3 is slored in TL2
of RI1, lhen lhe second coefficienl shouId le slored in TL2 of RI3 and so on. SiniIarIy,
TL1, TL3, and TL4 of lolh RI1 and RI3 are handIed. Iurlhernore, during lhe vhoIe
period of run 1, lhe conlroI signaIs of lhe lhree exlension nuIlipIexers IaleIed muxc0, muxc1,
and muxc2 in RI1 shouId le sel O, according lo TalIe 4 (Ilrahin & Hernan, 2OO8), vhereas
lhose in RI3 shouId le sel nornaI as shovn in lhe second Iine of TalIe 4, since RI3 viII
execule nornaI conpulalions during lhe period. Hovever, in lhe second run and in aII
sulsequenl runs excepl lhe Iasl run, lhe exlension nuIlipIexers conlroI signaIs in aII RIs are
sel nornaI. Moreover, lhe nuIlipIexers IaleIed muxcc in slage 4 is onIy needed in lhe
conlined 5/3 and 9/7 archileclure, olhervise, il can le eIininaled and Rl2 oulpul can le
connecled direclIy lo RlO inpul of lhe nexl slage in case of 9/7, vhereas in 5/3, RlO is
connecled direclIy lo oulpul Ialch RlO. In lhe conlined archileclure, signaI scc of muxcc is
sel O if lhe archileclure is lo perforn 5/3, olhervise, il is sel 1 if lhe archileclure is lo
perforn 9/7.
Il is very inporlanl lo nole lhal vhen lhe RI execules ils Iasl sel of inpul coefficienls,
according lo 9/7 DDCs for odd and even signaIs shovn in Iig. 4 il viII nol yieId aII required
oulpul coefficienls as expecled ly lhe Iasl run. Ior exanpIe, in lhe DDCs for odd Ienglh
signaIs shovn in Iig. 4 (a), vhen lhe Iasl inpul coefficienl IaleIed Y8 is appIied lo RI il viII
yieId oulpul coefficienls X5 and X6. To gel lhe Iasl renaining lvo coefficienls X7 and X8, lhe
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 37
RI nusl execule anolher run, vhich viII le lhe Iasl run in order lo conpule lhe renaining
lvo oulpul coefficienls. SiniIarIy, vhen lhe Iasl lvo inpul coefficienls IaleIed Y6 and Y7 in
lhe DDC for even Ienglh signaIs shovn in Iig. 4 (l) are appIied lo 9/7 RI il viII yieId
oulpul coefficienls X3 and X4. To ollain lhe renaining oulpul coefficienls X5, X6, and X7,
lvo nore runs shouId le execuled ly RI according lo lhe DDC. The firsl run viII yieId X5
and X6, vhereas lhe Iasl run viII yieId X7. The delaiIs of lhe conpulalions lhal lake pIace
during each of lhese runs can le delernined ly exanining lhe specific area of lhe DDCs.

sc0 sc1 sc2
Iirsl O O O
NornaI O 1 O
Lasl 1 1 O
a) Odd Ienglh signaIs

sc0 sc1 sc2
Iirsl O O O
NornaI O 1 O
Lasl O 1 O
l) Lven Ienglh signaIs
TalIe 4. Lxlensions conlroI signaIs

5. Performance EvaIuation

In order lo evaIuale perfornance of lhe lvo proposed paraIIeI pipeIined archileclures in
lerns of speedup, lhroughpul, and pover as conpared vilh singIe pipeIined archileclure
proposed in (Ilrahin & Hernan, 2OO8) consider lhe foIIoving. Assune sullands HH, HL,
LH, and LL of each IeveI are equaI in size. The dalafIov lalIe for singIe pipeIined
archileclure (Ilrahin & Hernan, 2OO8) shovs lhal 20
1
= cIock cycIes are needed lo
yieId lhe firsl oulpul coefficienl. Then, lhe lolaI nunler of oulpul coefficienls in lhe firsl run
of lhe }
lh
IeveI reconslruclion can le eslinaled as

1
2
J
N (6)
and lhe lolaI nunler of cycIes in lhe firsl run is given ly

1
2 2
J
N (7)
The lolaI line, T1, required lo yieId n pairs of oulpul coefficienls for lhe }
lh
IeveI
reconslruclion on lhe singIe pipeIined archileclure can le eslinaled as

( ) ( )
( ) k t n N
N n N T
p
J
J J
2 2 2
2 2 1 2 2 2 1
1
1
1
1 1
1
+ + =
+ + =

t
(8)
On lhe olher hand, lhe dalafIov for 2-paraIIeI pipeIined archileclure shovs lhal 21
2
=
cIock cycIes are required lo yieId lhe firsl 2 pairs of oulpul coefficienls. The lolaI nunler of
paired oulpul coefficienls in lhe firsl run of lhe }
lh
IeveI reconslruclion on lhe 2-paraIIeI
archileclure can le eslinaled as
VLS 38

1
2 2 3
J
N (9)
and lhe lolaI nunler of 2-paired oulpul coefficienls is given ly

1
2 4 3
J
N (1O)
WhiIe, lhe lolaI nunler of cycIes in lhe firsl run is

1
2 2
J
N (11)
Nole lhal lhe lolaI nunler of paired oulpul coefficienls of lhe firsl run in each IeveI of
reconslruclion slarling fron lhe firsl IeveI can le vrillen as
1
2 2 3 ...., ,......... 4 2 3 , 2 2 3 , 2 3
J
N N N N
vhere lhe Iasl lern is Lq (9).
The lolaI line, T2, required lo yieId n pairs of oulpul coefficienls for lhe }
lh
IeveI
reconslruclion of an NxM inage on lhe 2-paraIIeI archileclure can le eslinaled as

( )
2
1 1
2
) 2 4 3 2 ( 2 2 2 2 t

+ + =
J J
N n N T (12)
( ) k t n N T
p
J
2 2 2 1 2
1
2
+ + =

(13)
The lern ( )
1
2 4 3 2 2

J
N n in (12) represenls lhe lolaI nunler of cycIes of run 2 and aII
sulsequenl runs.
The speedup faclor, S2, is lhen given ly
( )
( ) k t n N
k t n N
T
T
S
p
J
p
J
2 2 2 1
2 2 2
2
1
2
1
2
1
1
+ +
+ +
= =


Ior Iarge n, lhe alove equalion reduces lo
2
) 2 2 1 (
) 2 2 1 ( 2
2
1
1
=
+
+
=

n N
n N
S
J
J
(14)
Lq(14) inpIies lhal lhe 2-paraIIeI archileclure is 2 lines fasler lhan lhe singIe pipeIined
archileclure.
SiniIarIy, lhe dalafIov for lhe 4-paraIIeI pipeIined archileclure shovs lhal 33
4
= cIock
cycIes are needed lo yieId lhe firsl lvo oulpul coefficienls. Iron lhe dalafIov lalIe of lhe 4-
paraIeII archileclure il can le eslinaled lhal lolh RI1 and RI2, in lhe firsl run of lhe }
lh
IeveI
reconslruclion, yieId 2 ) 2 (
1 J
N pairs of oulpul coefficienls, vhereas lolh RI3 and RI4
yieId
1
2
J
N pairs of oulpul coefficienls, a lolaI of
1
2 2 3
J
N pairs of oulpul coefficienls
in lhe firsl run. The lolaI nunler of cycIes in run 1 is lhen given ly
2 ) 2 ( 4
1 J
N (15)
Thus, lhe lolaI line, T4, required lo yieId n pairs of oulpul coefficienls for lhe }
lh
IeveI
reconslruclion of an NxM inage on lhe 4-paraIIeI archileclure can le eslinaled as
( ) ( )
4
1 1
4
2 2 2 3 2 2 2 4 t

+ + =
J J
N n N T
( ) ( ) k t N n N T
p
J J
4 2 2 3 2 2 4
1 1
4

+ + = (16)
( ) k t n N T
p
J
4 2 2 1 4
1
4
+ + =

(17)
The lern ) 2 2 3 (
1

J
N n in (16) represenls lhe lolaI cycIes of run 2 and aII sulsequenl runs.
High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 39
The speedup faclor, S4, is lhen given ly
( )
( ) k t n N
k t n N
T
T
S
p
J
p
J
4 2 2 1
2 2 2
4
1
4
1
4
1
1
+ +
+ +
= =


Ior Iarge n il reduces lo
4
) 2 2 1 (
) 2 2 1 ( 4
4
1
1
=
+
+
=

n N
n N
S
J
J
(18)

Thus, lhe 4-paraIIeI archileclure is 4 lines fasler lhan lhe singIe pipeIined archileclure.
The lhroughpul, H, vhich can le defined as nunler of oulpul coefficienls generaled per
unil line, can le vrillen for each archileclure as
k t n N n gle H
p
J
2 ) 2 2 ( ) (sin
1
1
+ + =


The naxinun lhroughpul, H
max
, occurs vhen n is very Iarge (n), lhus,

p p
n
kf n nkf
gle H gle H
= =
=

2 2
) (sin ) (sin
max
(19)
k t n N n parallel H
p
J
2 ) 2 2 1 ( ) 2 (
1
2
+ + =


p p
n
kf n knf
parallel H parallel H
2 2
) 2 ( ) 2 (
max
= =
=

(2O)
k t n N n parallel H
p
J
4 ) 2 2 1 ( ) 4 (
1
4
+ + =



p p
n
kf n knf
parallel H parallel H
4 4
) 4 ( ) 4 (
max
= =
=

(21)
Thus, lhe lhroughpuls of lhe 2-paraIIeI and lhe 4-paraIIeI pipeIined archileclures have
increased ly faclors of 2 and 4, respecliveIy, as conpared vilh lhe singIe pipeIined
archileclure.
On lhe olher hand, lhe pover consunplion of |-paraIIeI pipeIined archileclure as conpared
vilh lhe singIe pipeIined archileclure can le ollained as foIIovs. Lel P
1
and P
|
denole lhe
pover consunplion of lhe singIe and |-paraIIeI archileclures vilhoul lhe exlernaI nenory,
and P
m1
and P
m|
denole lhe pover consunplion of lhe exlernaI nenory for lhe singIe and |-
paraIIeI archileclures, respecliveIy. The pover consunplion of VLSI archileclure can le
eslinaled as
f J C P
total
=
2
0

vhere C
lolaI
denoles lhe lolaI capacilance of lhe archileclure, V
0
is lhe suppIy voIlage, and f
is lhe cIock frequency. Then,

l f J C l P f J C P
l total l total
= =
2
0 1
2
0 1
, 2

and

l t k
t
k l
f
f
f J C
l f J C l
P
P
p
p
l
total
l total l
=
|
|
.
|

\
|

=
=


=
2 2
2
2
1
1
2
0
2
0
1
(21)
VLS 40
WhiIe, P
m1
and P
m|
can le eslinaled as

l
m
total ml
m
total m
f J C P f J C P = =
2
0 1
2
0 1
2 , , and l
t k
t k l
f
f
P
P
p
p
l
m
ml
=

=

=
2
2
2
1 1
(22)
m
total
C , is lhe lolaI capacilance of lhe exlernaI nenory.
In sunnary, lhe alove equalions indicales lhal as degree of paraIIeIisn increases lhe
speedup and lhe pover consunplion of lhe proposed archileclures, vilhoul exlernaI
nenory, and lhe pover consunplion of lhe exlernaI nenory increase ly a faclor of |, as
conpared vilh singIe pipeIined archileclure.

6. Comparisons

TalIe 5 provides conparison resuIls of lhe proposed archileclures vilh nosl recenl
archileclures in lhe Iileralure. The archileclure proposed in (Lan & Zheng, 2OO5) achieves a
crilicaI palh of one nuIlipIier deIay using very Iarge nunler of pipeIined regislers, 52
regislers. In addilion, il requires a lolaI Iine luffer of size 6N, vhich is a very expensive
nenory conponenl, vhiIe lhe proposed archileclures require onIy 4N. In (RahuI & Ireeli,
2OO7), a crilicaI palh deIay of Tn + Ta is achieved lhrough oplinaI dalafIov graph, lul
requires a lolaI Iine luffer of size 1ON.
In (vang el al., 2OO7), ly revriling lhe Iifling-lased DWT of lhe 9/7, lhe crilicaI palh deIay
of lhe pipeIined archileclures have leen reduced lo one nuIlipIier deIay lul il requires a
lolaI Iine luffer of size 5.5N. In addilion, il requires reaI fIoaling-poinl nuIlipIiers vilh Iong
deIay lhal can nol le inpIenenled ly using arilhnelic shifl nelhod (Qing & Sheng, 2OO5).
(Qing & Sheng, 2OO5) has iIIuslraled lhal lhe nuIlipIiers used for scaIe faclor | and
coefficienls , , , | o and o of lhe 9/7 fiIler can le inpIenenled in hardvare using
onIy lvo adders. Moreover, lhe foId archileclure vhich uses one noduIe lo perforn lolh
lhe prediclor and updale sleps in facl increases lhe hardvare conpIexily, e.g., use of severaI
nuIlipIexers, and lhe conlroI conpIexily. In addilion, use of one noduIe lo perforn lolh
prediclor and updale sleps inpIies lolh sleps have lo le sequenced and lhal vouId sIov
dovn lhe conpulalion process.
In lhe 2-paraIIeI archileclure proposed in (ao & Yong, 2OO7), vriling resuIls generaled ly
CIs inlo MM (nain nenory) and lhen svilching lhen oul lo exlernaI nenory for nexl
IeveI deconposilion is reaIIy a dravlack, since exlernaI nenory in reaI-line appIicalions,
e.g., in digilaI canera, is acluaIIy consisl of charge-coupIed devices vhich can onIy le
scanned. In addilion, il requires a lolaI Iine luffer of size 5N for 5/3 and 7N for 9/7 vhiIe
lhe proposed archileclures require 2N and 4N for 5/3 and 9/7 respecliveIy. The archileclure
requires aIso used of severaI IIIO luffers in lhe dalapalh, vhich are conpIex and very
expensive nenory conponenls, vhiIe lhe proposed archileclures require no such nenory
conponenls.
The 2-paraIIeI archileclure proposed in (Cheng el aI., 2OO6) requires a lolaI Iine luffer of size
5.5N and use of lvo-porl RAM lo inpIenenl IIIOs, vhereas, lhe proposed archileclures
require onIy use of singIe porl RAM.



High Performance Parallel Pipelined Lifting-based VLS Architectures
for Two-Dimensional nverse Discrete Wavelet Transform 41
Archileclure MuIli Adders Line uff. Conpuling line CrilicaI Ialh
Lan 12 12 6N 2(1-4
-j
)NM

Tn
RahuI 9 16 1ON 2(1-4
-j
)NM

Tn + Ta
Wang 6 8 5.5N 2(1-4
-j
)NM

Tn
Ilrahin 1O 16 4N 2(1-4
-j
)NM

Tn +2Ta
Cheng (2-paraIIeI) 18 32 5.5N (1-4
-j
)NM

Tn +2Ta
ao (2-paraIIeI) 24 32 7N (1-4
-j
)NM

Tn +2Ta
Iroposed (2-paraIIeI) 18 32 4N (1-4
-j
)NM

Tn +2Ta
Iroposed (4-paraIIeI) 36 64 4N 1/2 (1-4
-j
)NM

Tn +2Ta
Tn: nuIlipIier deIay Ta: adder deIay
TalIe 5. Conparison resuIls of severaI 2-D 9/7 archileclures

7. ConcIusions

In lhis chapler, lvo high perfornance paraIIeI VLSI archileclures for 2-D IDWT are
proposed lhal neel high-speed, Iov-pover, and Iov nenory requirenenls for reaI-line
appIicalions. The lvo paraIIeI archileclures achieve speedup faclors of 2 and 4 as conpared
vilh singIe pipeIined archileclure.
The advanlages of lhe proposed paraIIeI archileclures :

i. OnIy require a lolaI lenporary Iine luffer (TL) of size 2N and 4N in 5/3 and 9/7,
respecliveIy.
ii. The scan nelhod adopled nol onIy reduces lhe inlernaI nenory lelveen CIs and
RIs lo a fev regislers, lul aIso reduces lhe inlernaI nenory or TL size in lhe CIs
lo nininun and aIIovs RIs lo vork in paraIIeI vilh CIs earIier during lhe
conpulalion.
iii. The proposed archileclures are sinpIe lo conlroI and lheir conlroI aIgorilhns can
le innedialeIy deveIoped.

8. References

ao-Ieng, L. & Yong, D. (2OO7). IIDI A noveI archileclure for Iifling-lased 2D DWT in
}ILC2OOO, MMM (2), Ieclure nole in conpuler science, voI. 4352, II. 373-382,
Springer, 2OO7.
Cheng-Yi, X., }in-Wen, T. & }ian, L. (2OO6). Lfficienl high-speed/Iov-pover Iine-lased
archileclure for lvo-dinensionaI discrele vaveIel lransforns using Iifling schene,
ILLL Trans. on Circuils & sys. Ior Video Tech. VoI.16, No. 2, Ielruary 2OO6, II 3O9-
316.
DiIIin, C., Ceoris ., Leganl }-D. & Canlineau, O. (2OO3). Conlined Line-lased Archileclure
for lhe 5-3 and 9-7 WaveIel Transforn of }ILC2OOO, ILLL Trans. on circuils and
syslens for video lech., VoI. 13, No. 9, Sep. 2OO3, II. 944-95O.
Ilrahin saeed koko & Hernan Agusliavan, (2OO8). IipeIined Iifling-lased VLSI
archileclure for lvo-dinensionaI inverse discrele vaveIel lransforn, proceedings
of lhe ILLL InlernalionaI Conference on Conpuler and LIeclricaI Lngineering,
ICCLL 2OO8, Ihukel IsIand, ThaiIand.
VLS 42
Lan, X. & Zheng N. (2OO5). Lov-Iover and High-Speed VLSI Archileclurefor Lifling-
ased Iorvard and Inverse WaveIel Transforn, ILLL lran. on consuner
eIeclronics, VoI. 51, No. 2, May 2OO5, II. 379 -385.
Qing-ning Yi & Sheng-Li Xie, (2OO5). Arilhnelic shifl nelhod suilalIe for VLSI
inpIenenlalion lo CDI 9/7 discrele vaveIel lransforn lased on Iifling schene,
Iroceedings of lhe Iourlh Inl. Conf. on Machine Learning and `Cylernelics,
Cuangzhou, Augusl 2OO5, II. 5241-5244.
RahuI. }. & Ireeli R. (2OO7). An efficienl pipeIined VLSI archileclure for Lifling-lased 2D-
discrele vaveIel lransforn, ISCAS, 2OO7 ILLL, II. 1377-138O.
Wang, C., Wu, Z., Cao, I. & Li, }. (2OO7). An efficienl VLSI Archileclure for Iifling-lased
discrele vaveIel lransforn, MuIIlinedia and Lpo, 2OO7 ILLL InlernalionaI
conference, II. 1575-1578.

Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 43
X

Contour-Based Binary Motion Estimation
AIgorithm and VLSI Design for MPEG-4
Shape Coding

Tsung-Han Tsai, Chia-Iin Chen, and Yu-Nan Ian
Dcpar|mcn| cf ||cc|rcnic |nginccring
Na|icna| Ccn|ra| Unitcrsi|q, Cnung-|i, Taiuan, R.O.C

1. Introduction

MILC-4 is a nev inlernalionaI slandard for nuIlinedia connunicalion |1j. Il provides a sel
of looIs for oljecl-lased coding of naluraI and synlhelic videos/audios. MILC-4 aIso enalIes
conlenl-lased funclionaIilies ly inlroducing lhe concepl of video oljecl pIane (VOI), and
such a conlenl-lased represenlalion is a key lo enalIe inleraclivily vilh oljecls for a variely
of nuIlinedia appIicalions. The VOI is conposed of lexlure conponenls (YUV) and an
aIpha conponenl |2j-O. The lexlure conponenl conlains lhe coIorific infornalion of video
oljecl, and lhe aIpha conponenl conlains lhe infornalion lo idenlify lhe pixeIs. The pixeIs
vhich are inside an oljecl are opaque and lhe pixeIs vhich are oulside lhe oljecl are
lransparenl. MILC-4 supporls a conlenl-lased represenlalion ly aIIoving lhe coding of lhe
aIpha conponenl aIong vilh lhe oljecl lexlure and nolion infornalion. Therefore, MILC-4
shape coding lecones lhe key lechnoIogy for supporling lhe conlenl-lased video coding.
MILC-4 shape coding nainIy conprises lhe foIIoving coding aIgorilhns: linary-shaped
nolion eslinalion/nolion conpensalion (ML/MC), conlexl-lased arilhnelic coding
(CAL), size conversion, node decision, and so on. As fuII search (IS) aIgorilhn is adopled
for MILC-4 shape coding, nosl of lhe conpulalionaI conpIexily is due lo linary nolion
eslinalion (ML). Iron lhe profiIing on shape coding in Iig. 1, il can le seen lhal ML
conlrilules lo 9O of lolaI conpulalionaI conpIexily of MILC-4 shape encoder. Il is veII
knovn lhal an effeclive and popuIar lechnique lo reduce lhe lenporaI redundancy of ML,
caIIed lIock-nalching nolion eslinalion, has leen videIy adopled in various video coding
slandard, such as MILC-2 O, H.263 |5j and MILC-4 shape coding |1j. In lIock-nalching
nolion eslinalion, lhe nosl accurale slralegy is lhe fuII search aIgorilhn vhich exhausliveIy
evaIuales aII possilIe candidale nolion veclors over a predelernined neighlorhood search
vindov lo find lhe gIolaI nininun lIock dislorlion posilion.
3
VLS 44

Iig. 1. ConpulalionaI conpIexily of MILC-4 shape encoder.

Iasl ML aIgorilhns for MILC-4 shape coding vere presenled in severaI previous papers
|6j-|8j. Anong lhese lechniques, our previous vork, conlour-lased linary nolion
eslinalion (CML), IargeIy reduced lhe conpulalionaI conpIexily of shape coding |9j. Il is
appIied vilh lhe properlies of loundary search for lIock-nalching nolion eslinalion, and
lhe dianond search pallern for furlher inprovenenl. This aIgorilhn can IargeIy reduce lhe
nunler of search poinls lo O.6 conpared vilh lhal of fuII search nelhod, vhich is
descriled in MILC-4 verificalion nodeI (VM) |2j.
In conlrasl vilh aIgorilhn-IeveI deveIopnenls, archileclure-IeveI designs for shape coding
are reIaliveIy Iess. CeneraIIy, a conpIeled shape coding nelhod shouId incIude differenl
lypes of aIgorilhns. In CAL parl, il needs sone lil-IeveI operalion. Hovever, in linary
nolion eslinalion parl, a high speed search nelhod is needed. Wilh lhese aIgorilhn
conlinalions on lhe shape coding, inpIenenlalion shouId nol le as slraighlforvard as
expecled and il offers sone chaIIenges especiaIIy on archileclure design. Since MILC-4
shape coding has fealures of high-conpuling and high-dala-lraffic properlies, il is suilalIe
vilh lhe consideralion of efficienl VLSI archileclure design. Mosl Iileralures have aIso leen
presenled lo focus on lhe nain conpulalion-expensive parl, ML, lo inprove ils
perfornance |1Oj. AddilionaIIy, CAL is aIso an inporlanl parl for archileclure design and
discussed in |11j-|12j. They uliIized lhe nuIli-synloI lechnique lo acceIerale lhe arilhnelic
coding perfornance. As regards lhe conpIele MILC-4 shape coding, sone of lhese designs
uliIized array processor lo perforn lhe shaping coding aIgorilhn |13j-|15j, vhiIe olhers
used pipeIined archileclure |16j. They can reach lhe reIalive high perfornance al lhe
expense of lhese high cosl and high conpIexily archileclures. AII of lhen inluiliveIy appIy
lhe fuII search aIgorilhn for easy reaIizalion on archileclure design. Hovever, lhe
aIgorilhn-IeveI achievenenl on lhe Iarge reduclion of conpulalion conpIexily is allraclive
and nol negIigilIe. This denonslrales lhal, vilhoul lhe supporling on an efficienl aIgorilhn,
lhe slraighlforvard inpIenenlalion lased on fuII search aIgorilhn is hard lo reach a
cosl-effeclive design.
In lhis paper, ve proposed a fasl ML aIgorilhn, dianond loundary search (DS), for
MILC-4 shape coding lo reduce lhe nunler of search poinls. y using lhe properlies of
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 45
lIock-nalching nolion eslinalion in shape coding and dianond search pallern, ve can skip
a Iarge nunler of search poinls in ML. SinuIalion resuIls shov lhal lhe proposed
aIgorilhn can narveIousIy reduce lhe nunler of search poinls lo O.6 conpared vilh lhal
of fuII search nelhod, vhich is descriled in MILC-4 verificalion nodeI (VM)|2j. Conpared
vilh olher fasl ML in |6j-|7j, lhe proposed ML aIgorilhn uses Iess search poinls
especiaIIy in high nolion video sequences, such as 'rean and 'Iornan. We aIso presenl
an efficienl archileclure design for MILC-4 shape coding. This archileclure is eIaloraled
lased on our fasl shape coding aIgorilhn vilh lhe linary nolion eslinalion. Since lhis
lIock-nalching nolion eslinalion can achieve lhe high perfornance lased on lhe
infornalion of loundary nask, lhe dedicaled archileclure needs sone oplinizalion and
consideralion lo reduce lhe nenory access and processing cycIes. LxperinenlaI resuIls aIso
denonslrale lhe equaI perfornance on fuII-search lased approach. This paper conlrilules a
conprehensive expIoralion of lhe cosl-effeclive archileclure design of shaping coding, and
is organized as foIIovs.
In Seclion 2 lhe linary nolion eslinalion in shape coding is descriled. We descrile lhe
highIighls of lhe proposed fasl ML aIgorilhn for MILC-4 shape coding in Seclion 3. The
design expIoralion on CML is descriled in Seclion 4. In Seclion 5, lhe archileclure design
lased on lhis ML aIgorilhn is proposed. In Seclion 6, ve presenl lhe inpIenenlalion
resuIls and give sone conparisons. IinaIIy ve sunnarize lhe concIusions in Seclion 7.

2. BME for MPEG-4 Shape Coding

The MILC-4 VM |2j descriles lhe coding nelhod for linary shape infornalion. Il uses
lIock-nalching nolion eslinalion lo find lhe nininun lIock dislorlion posilion and sels
lhe posilion lo le nolion veclor for shape (MVS). The procedure of ML consisls of lvo
sleps: firsl lo delernine nolion veclor prediclor for shape (MVIS) and lhen lo conpule
MVS accordingIy.
MVIS is laken fron a Iisl of candidale nolion veclors. As indicaled in Iig. 2, lhe Iisl of
candidale nolion veclors incIudes lhe shape nolion veclors (MVS) fron lhe lhree linary
aIpha lIocks (As) vhich are adjacenl lo lhe currenl A and lhe lexlure nolion veclors
(MV) associaled vilh lhe lhree adjacenl lexlure lIocks. y scanning lhe Iocalions of MVS1,
MVS2, MVS3, MV1, MV2 and MV3 in lhis order, MVIS is delernined ly laking lhe firsl
encounlered MV vhich is vaIid. Nole lhal if lhe procedure faiIs lo find a defined nolion
veclor, lhe MVIS is sel lo (O, O).
ased on MVIS delernined alove, lhe nolion conpensaled (MC) error is conpuled ly
conparing lhe A indicaled ly lhe MVIS and currenl A. If lhe conpuled MC error is
Iess or equaI lo 16xAIphaTH for any 4x4 sul-lIocks, lhe MVIS is direclIy enpIoyed as MVS
and lhe procedure lerninales. Olhervise, MV is searched around lhe MVIS vhiIe
conpuling sun of alsoIule difference (SAD) ly conparing lhe A indicaled ly lhe MV
and currenl A. The search range is 16 pixeIs around MVIS aIong lolh horizonlaI and
verlicaI direclions. The MV lhal nininizes lhe SAD is laken as MVS and lhis is furlher
inlerpreled as MV Difference for shape (MVDS), i.e. MVDS=MVS-MVIS.
VLS 46

Iig. 2. Candidales for MVIS in shape VOI and lexlure VOI.

If nore lhan one MVS nininize SAD ly an idenlicaI vaIue, lhe MVDS lhal nininizes lhe
code Ienglh of MVDS is seIecled. If nore lhan one MVS nininize SAD ly an idenlicaI vaIue
vilh an idenlicaI code Ienglh of MVDS, MVDS vilh snaIIer verlicaI eIenenl is seIecled. If
lhe verlicaI eIenenls are aIso lhe sane, MVDS vilh snaIIer horizonlaI eIenenl is seIecled.
Afler linary nolion eslinalion, nolion conpensaled lIock is conslrucled fron lhe 16x16
A vilh a lorder of vidlh 1 around lhe 16x16 A (lordered MC A). Then,
conlexl-lased arilhnelic encoding (CAL) is adopled for shape coding.

3. Proposed Contour-Based Binary Motion Estimation (CBBME) Method

The lasic concepl of lhe proposed ML nelhod is lhal lhe conlour of video oljecls in
currenl A shouId overIap lhal in lhe nolion conpensaled A, vhich is delernined ly
linary nolion eslinalion |9j. Therefore, lhose search posilions, vhich conlour Iays aparl
fron lhe conlour of video oljecls in currenl A, can le skipped and lhe reduced nunler
viII le enornous. Moreover, lased on lhe properly lhal nosl reaI-vorId sequences have a
cenlraI liased nolion veclor dislrilulion |23j, ve use veighled SAD and dianond search
pallern for furlhernosl inprovenenl.

3.1 Definition of Boundary PixeI
In order lo decide vhelher lhe pixeI is on lhe conlour of VOI, lhe loundary pixeI is
delernined ly lhe foIIoving procedure:
z If currenl pixeI is opaque and one of ils four adjacenl pixeIs is lransparenl, lhe
currenl pixeI direclIy enpIoyed as loundary pixeI.
Iig. 3 shovs lhe correIalion lelveen currenl pixeI and ils four adjacenl pixeIs. In lhe figure,
lhe Iighl grey area corresponds lo lhe pixeIs oulside lhe linary aIpha nap. Il can le seen
lhal lhe pixeIs oulside lhe linary aIpha nap viII nol le laken inlo consideralion, and lhe
nunler of adjacenl pixeIs viII change inlo lvo or lhree.
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 47

Iig. 3. The correIalion lelveen currenl pixeI and adjacenl pixeI. Where gray parl denoles
pixeIs oulside lhe VOI.

3.2 Boundary Search (BS)
According lo lhe loundary pixeIs in VOI, ve luiId a nask for ML process in MILC-4
shape coding. Iig. 4 shovs an exanpIe of loundary nask fron lhe 'Iorenan sequence. In
lhis figure, lhe vhile area denoles lhe efficienl search posilion for fasl ML, and il is nuch
nore efficienl lhan lhe fasl aIgorilhn O iIIuslraled in Iig. 5.


Fig. 4. Example oI a boundary mask Ior shape coding. White area denotes boundary pixel.

A suggesled inpIenenlalion of lhe proposed oundary search (S) aIgorilhn for shape
coding is processed as foIIovs:
Slep 1. Ierforn a pixeI Ioop over lhe enlire reference VOI. If pixeI (x,y) is an loundary
pixeI, sel lhe nask al (x,y) lo '1. Olhervise sel lhe nask al (x,y) lo 'O.
Slep 2. Ierforn a pixeI Ioop over lhe enlire currenl A. If pixeI (i,j) is a loundary
pixeI, sel (i,j) lo le reference poinl, and lerninale lhe pixeI Ioop. This slep is
iIIuslraled in Iig. 6(l). Therefore, lhere is onIy one reference poinl in currenl
A.
Slep 3. Ior each search poinl vilhin 16 search range, check lhe pixeI (x+i, y+j) vhich
is fuIIy aIigned vilh lhe reference poinl fron lhe currenl A. If lhe nask
vaIue al (x+i, y+j) is '1, vhich neans lhal lhe reference poinl is on lhe
VLS 48
loundary of reference VOI, lhe procedure viII conpule SAD of lhe search
poinl (x, y). Olhervise, SAD of lhe search poinl (x, y) viII nol le conpuled, and
lhe processing conlinues al lhe nexl posilion. Iig. 6(a) shovs an exanpIe of lhis
slep. The search poinls in (x1, y1) and (x2, y2) viII le skipped ly lhis procedure,
vhiIe lhe SAD viII le conpuled in (x3, y3).
Slep 4. When aII lhe search poinls vilhin 16 search range is done, lhe MV lhal
nininizes lhe SAD viII le laken as MVS. Iig. 7 iIIuslrales lhe overaII schene of
proposed S aIgorilhn for MILC-4 shape coding.
In lhe vorsl case, lhe proposed S aIgorilhn needs (256+ (16+1)
2
) delerninalions lo check
vhelher lhe pixeI is a loundary pixeI. Ior each non-skipped search poinl, lhe SAD ollained
ly 256 excIusive-OR operalions and 255 addilion operalions vas laken as lhe dislorlion
neasure. Hovever, lased on S aIgorilhn, lhe nunler of non-skipped search poinls vas
reduced significanlIy, and lhe addilionaI conpulalionaI Ioad due lo S aIgorilhn vas
negIigilIe.


Iig. 5. LxanpIe of an effeclive search area, vhich has leen proposed in reference O.

Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 49

Iig. 6. The iIIuslralion of lhe proposed S aIgorilhn.


Iig. 7. IIov charl for proposed S aIgorilhn.

3.3 Diamond Boundary Search (DBS)
A leller soIulion for lIock-nalching nolion eslinalion is lo perforn lhe search using a
dianond pallern lecause of cenler-liased nolion veclor dislrilulion characlerislic. This is
achieved ly dividing lhe search area inlo dianond shaped zones and using haIf-slop
crilerion |17j. Iig. 8 shovs an exanpIe of dianond shaped zones in a 5 search vindov,
and each nunler denoles lhe search zone in lhe search procedure.

VLS 50

Iig. 8. A 5 search vindov using dianond-shaped zones.

We conline lhe proposed S aIgorilhn vilh dianond-shaped zones, caIIed DS, and give
differenl lhreshoIds (Thn) for each search zone for furlhernosl inprovenenl. The procedure
is expIained in leIov.
Slep 1. Conslrucl dianond-shaped zones around MVIS vilhin 16 search vindov. Se
l n=1.
Slep 2. CaIcuIale SAD for each search poinl in zone n. Lel MinSAD le lhe snaIIesl SAD
up lo nov.
Slep 3. If MinSAD

Thn, golo Slep 4. Olhervise, sel n=n+1 and golo Slep 2.


Slep 4. The nolion veclor is chosen according lo lhe lIock corresponding lo MinSAD.

3.4 Weight SAD (WSAD)
In MILC-4 shape coding, lhe reference A vilh nininun SAD viII le seIecled as nolion
conpensaled A. Hovever, lhe A vilh nininun SAD nay le far avay fron lhe
originaI. Il neans lhal encoder shouId vasle nuch nore lils lo code MVDS. Sone previous
nolion eslinalion aIgorilhns for coIor space used lhe concepl of lhe veighled SAD
(WSAD) lo conpensale lhe dislorlion |24j. In lhis paper, ve proposed lhe siniIar concepl of
WSAD vhich lakes lolh SAD and MVDS inlo consideralion as lhe dislorlion neasure. The
WSAD is given ly

WSAD=W
1
*SAD+W
2
*(|nvds_x|+|nvds_y|) (1)
SAD=|p
i-1
(i+u,j+v)-p
i
(i,j)| . (2)

vhere mts_x is MVDS in lhe horizonlaI direclion and mts_q is MVDS in lhe verlicaI
direclion. I1 and I2 denole lhe veighling vaIues for SAD and MVDS, respecliveIy. The
WSAD is evaIualed in every search poinls and lhe A vilh nininun WSAD is seIecled as
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 51
nolion conpensaled A. ased on lhe experinenlalion, I1 and I2 are delernined as 1O
and 7, respecliveIy. The inprovenenl of WSAD on lil-rale can conpensale for lhe
dravlack vhen fasl ML aIgorilhn vas adopled. Since lhe nunler of search poinls has
leen reduced significanlIy, lhe conpulalionaI pover due lo caIcuIale lhe WSAD viII
increase negIigilIy.

4. Design ExpIoration on CBBME

In lhis seclion, lvo approaches are deveIoped lased on CML. One is lhe cenler-liased
nolion veclor dislrilulion. The olher is lhe search range shrinking. olh of lhen can furlher
reduce lhe conpulalion conpIexily in ML.

4.1 Center-biased motion vector distribution
The properly is olvious lhal reaI-vorId sequences have a cenlraI liased nolion veclor
dislrilulion. This can le achieved ly dividing lhe search area inlo dianond shaped zones
and using haIf-slop crilerion |17j. Dianond shaped zone has lhe higher cenler-liased
dislrilulion. The cIoser lhe search area lo lhe cenler posilion, lhe Iess lhe nunler of lhe
search zone in lhe search procedure.

4.2 Search Range Shrinking
Wilh lhe properly on Seclion 4.1, il is possilIe lo reduce lhe defauIl search range lo a Iess size
of search range. DefauIl search range is 16 in slandard and il is slraighlforvard used in
convenlionaI vorks. Wilh lhe aid of WSAD in CML and lhe dianond shaped zones on
search range, ve nake a search range shrinking lechnique fron 16 lo 13 pixeIs in our
experinenlalion. TalIe 1 shovs lhe sinuIalion resuIl of lhe proposed linary nolion eslinalor
using lhe alove lvo lechniques. This resuIl supporls lhe usage of lhe 13 shrinking search
range. An inporlanl conlrilulion is lhal lhe WSAD nakes sone inprovenenl on lil-rale.
Therefore il lakes siniIar even Iess lils lo represenl lhe shape per VOI in lhe sane quaIily.

Sequence IuII Search (SR=16) Iroposed Archileclure vilh Iiniled
SS (SR=13)
ils/VOI ils/VOI
rean 1599.8O 1OO 1599.27 99.97
Nevs 89O.22 1OO 89O.18 1OO.OO
Iorenan 1186.94 1OO 1183.12 99.68
ChiIdren 2O56.O3 1OO 2O55.43 99.97
TalIe 1. Ierfornance conparison of IS and proposed archileclure.

Iig. 9 shovs lhe aIgorilhn fIov of CML. Iirsl, ve conslrucl dianond-shaped zones around
nolion veclor prediclor for shape (MVIS) vilhin shrinking search range. Second, ve caIcuIale
WSAD for each search poinl. Lel MinWSAD le lhe snaIIesl SAD up lo nov. If MinWSAD is
snaIIer or equaI lo a lhreshoId (Tn) for each search zone, lhen lhe nolion veclor is chosen
according lo lhe lIock corresponding lo MinWSAD, olhervise, il changes lo nexl posilion and
caIcuIales lhe WSAD again.

VLS 52



















Iig. 9. IIov charl for lhe CML aIgorilhn.

5. Architecture Design for MPEG-4 Shape Coding

The proposed MILC-4 shape coding syslen, as iIIuslraled in Iig. 1O consisls of five najor
noduIes: A lype decision, size conversion, ML, CAL and varialIe Ienglh coding (VLC).
ased on lhe conpulalionaI conpIexily anaIysis in Iig. 1, il can le seen lhal ML, size
conversion and CAL lake a greal parl of lhe processing line. In lhis seclion ve propose an
efficienl archileclure for lhese nain noduIes vilh sone design oplinizalion and
consideralion.


Iig. 1O. Iock diagran of MILC-4 shape encoding.

Construct diamond
shaped zones
Compute WSAD
Set Nvs=Nvs with
NinWSAD
Change to next
position
yes
no
Find Nvs
Skipped position
NinWSADTh
no
yes
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 53
5.1 Binary Motion Estimation
Iasl nolion eslinalion archileclures vere presenled in severaI previous papers |12j-|15j.
The perfornance couId le high lul high cosl and high conpIexily archileclures are aIvays
needed. Therefore, a noveI and efficienl archileclure for linary nolion eslinalion using
proposed CML aIgorilhn is proposed. Iig. 11 shovs lhe lIock diagran of lhe proposed
ML archileclure. Il nainIy consisls of a oundary IixeI Deleclor, a processing eIenenl (IL)
Array, a Conpare and SeIeclion (CAS) noduIe and lvo nain nenories for search range (SR)
and A luffer. oundary IixeI Deleclor finds lhe reference poinl in currenl A and
checks vhelher lhe candidale search posilions are non-skipped posilions. IL Array perforns
lhe linary nolion eslinalion. CAS finds lhe nininun WSAD and ils MVDS for shape
coding.


Iig. 11. Iock diagran of lhe proposed ML archileclure.

A. Bnundary PIxc! Dctcctnr
oundary IixeI Deleclor is a design dedicaled lo our CCML aIgorilhn. Iron lhe anaIysis
in Seclion 4, since lhe search range has leen reduced lo 13, lhe search poinls vilh one
dinension is 13+13+1=27 and induce a delecled region vilh 2727 search poinls. IncIuding
lhe A size of 1616, lolaIIy 4242 pixeIs are used as lhe lolaI search area.
According lo lhe effecl on 2727 search poinls, a nuIlipIe of 3 is considered on archileclure
design. In our linary nolion eslinalor, a 33 IL array is used. Therefore, ve separale lhe
search posilions inlo a 99 sul-search-lIock (SS) array slruclure, and each SS conlains 33
search posilions for lhe caIcuIalion of IL array. Iig.12 iIIuslrales lhe delecled region and ils
reIaled array slruclure vilh lolaIIy 9x9= 81 SSs.

VLS 54
27
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
SSB SSB SSB SSB SSB SSB SSB SSB SSB
1 2 3
4 5
8
6
7 9
SSB
27
Iig. 12. The 27x27 delecled region conposes of 9x9=81 SSs, each SS conlains 9 checking
pixeI.

In each SS, 9 checking pixeIs are incIuded. In addilion, in order lo delecl loundary pixeIs
and ollain lhe lordered MC-A, a lorder vilh lhe vidlh of one pixeI around lhe 4242
search area, caIIed lordered search area, is appIied. As depicled in Iig. 13, lhe pixeIs in lhe
grey area are lhe lorder pixeIs. This range indicales lhe lolaI needed pixeIs in nenory.
oundary IixeI Deleclor firsl seIecls a loundary pixeI (i,j) as reference poinl in 1616
currenl A. ased on lhe reference poinl, a 2727 delecled region is generaled as shovn in
Iig. 14. Lach pixeI in delecled region, caIIed checking pixeI, denoles vhelher lhe correIalive
search posilion is non-skipped search posilion or nol. If lhe checking pixeI leIongs lo a
loundary pixeI, lhe correIalive search posilion is denoled as non-skipped search posilion.
Hereafler, oundary IixeI Deleclor delecls 9 checking pixeIs, vhich are reIalive lo lhe coding
SS in delecled region. As shovn in Iig. 14, lhe pixeIs in lhe dark area are used for delecling
loundary pixeI. If aII of lhe checking pixeIs in Iighl grey area are nol loundary pixeIs, il
neans lhal aII search posilions in lhe SS are skipped search posilions. Therefore, lhe coding
SS viII nol le processed in IL Array. Olhervise, lhe IL Array caIcuIales WSAD of lhese 9
search posilions in coding SS.

Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 55

(a) (l)
Iig. 13. (a) 4444 lordered search area, (l) 1616 currenl A.


Iig. 14. The reIalion lelveen search posilions and delecled region.

B. PE Array and CA5
IL array archileclure vhich perforns lhe linary nolion eslinalion aIgorilhn is shovn in
Iig. 15(a). The array is lhe 33 archileclure and lolaIIy consisls of 9 ILs. In lhis archileclure,
18-lils reference dala (denoled as Rcfj17.0]) are read fron SR luffer, and 16-lils currenl dala
are read fron A luffer. The reference dala are lroadcasled lo aII ILs (Rcfj17.2] for IL1,
IL4 and IL7, Rcfj16.1] for IL2, IL5 and IL8, Rcfj15.0] for IL3, IL6 and IL9 respecliveIy),
vhiIe lhe currenl dala is deIayed and fed lo lhe corresponding IL. Lach IL caIcuIales lhe
WSAD of a search posilion in coding SS. Il is noled lhal onIy lhe non-skipped SS, vhich is
VLS 56
delecled fron oundary IixeI Deleclor, is processed in IL Array.


(a)

mvds_x
>>1
ABS
XOR
Adder
tree
Accumulator
Reference
Data
Curren
t Data
Add(+)
Wn
CAS Unit
W1 W4 W7 W2 W5 W8 W3 W6 W9
Dff
mvds_y
ABS
SAD
Add(+)

(l) (c)
Iig. 15. Archileclure of (a) IL Array (l) IL eIenenl (c) CAS unil.
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 57
The veighled dala is caIcuIaled ly adding alsoIule MVDS in lolh horizonlaI and verlicaI
direclions and shifling righl one lil. Iron lhe anaIysis in our previous paper |9j, lhe ralio for
V2/V1 viII nol nake Iarge difference fron O.5 lo O.8. And in lhal range our WSAD is indeed
leller lhan lhe resuIl for SAD. Ior reducing lhe conpulalionaI conpIexily, V1 and V2 in (1)
are delernined as 1 and O.5, respecliveIy. The archileclure of IL eIenenl is shovn in Iig.
15(l). Il produces WSAD vilh lhe sun of veighled dala and SAD, vhere In neans lhe
WSAD vaIue for each IL eIenenl fron n=1 lo 9.
The archileclure of Conpare and SeIeclion (CAS) noduIe is shovn in Iig. 15(c). Il finds lhe
snaIIesl WSAD and ils MVS in coding SS and feedlacks lhe snaIIesl WSAD as lhe inpul for
lhe nexl SS.

5.2 Size Conversion
In MILC-4 shape coding, rale conlroI and rale reduclion are reaIized lhrough size
conversion. Iig. 16 shovs lhe lIock diagran of size conversion noduIe.

lovn
:amplc
p
:amplc
CQ
lccoor
LL
LuIIcr
:C
LuIIcr_0
:C
LuIIcr_T

lovn
:amplc
p
:amplc
CQ
lccoor
LL
LuIIcr
:C
LuIIcr_0
:C
LuIIcr_T


Iig. 16. Iock diagran of size conversion noduIe.

Il consisls of lhree najor unils: dovn-sanpIe, up-sanpIe and accepled quaIily (ACQ)
deleclor. The lordered A is read fron A luffer and dovn-sanpIed lo 44 and 88 A
in SC uffer_O and SC uffer_1 respecliveIy. Then, lhe 44 A is up-sanpIed lo SC
uffer_1, and 88 A is up-sanpIed lo ACQ deleclor ly up-sanpIe unil. ACQ deleclor
caIcuIales lhe conversion error lelveen lhe originaI A and lhe A vhich is
dovn-sanpIed and reconslrucled ly up-sanpIe unil. ACQ aIso needs lo delernine lhe
conversion ralio. In dovn-sanpIe procedure, severaI pixeIs are dovn-sanpIed lo one pixeI,
vhiIe inlerpoIaled pixeIs are produced lelveen originaI pixeIs in up-sanpIe procedure. To
conpule lhe vaIue of lhe inlerpoIaled pixeI, a lorder vilh lhe vidlh of lvo around lhe
currenl A is used lo ollain lhe neighloring pixeIs (A~L), and lhe unknovn pixeIs are
exlended fron lhe oulernosl pixeIs inside lhe A. The lenpIale and pixeI reIalionship
used for up-sanpIing operalion can le referred as in MILC-4 slandard and |15j.
Since lhe inpIenenlalion of lhe dovn-sanpIe and ACQ deleclor is reIaliveIy sinpIe, ve
onIy address lhe design of up-sanpIe unil here. The lIock diagran of up-sanpIe unil is
VLS 58
shovn in Iig. 17. Due lo lhe vindov-Iike sIicing operalions, lhe up-sanpIe can le easiIy
napped inlo a deIay Iine nodeI. A deIay Iine nodeI is used lo ollain lhe pixeIs in A~L,
vhich delernine inlerpoIaled pixeIs (I1~I4). ased on lhe pixeIs in A~L, four 4-lils
lhreshoId vaIues are ollained fron TalIe CI. UI_IL generales corresponding vaIues
for conparison. Afler conparison lelveen lhreshoId vaIues and lhe vaIues fron UI_IL,
four inlerpoIaled pixeIs (I1~I4) are slored in Shifl Regisler and oulpulled Ialer.


Iig. 17. Iock diagran of up-sanpIe unil.

5.3 Context Based Arithmetic Encoder (CAE)
CAL archileclure nainIy conprises lhe conlexl generalion unil and lhe linary arilhnelic
coder |18j-|19j. Iig. 18(a) shovs lhe lIock diagran of CAL noduIe for our design.

MC
LL
Currcn
LL
:lI
lc_:cr
Codc
:_mhol
lc
Normal:c
h p0
l L
l L
L
Follov
Iv
h:rcam
ahlc
(a)
ox
lnra LL
lncr LL
lncr MC
z0 h:
o o o
o o o o o
o o
o cmplac Ior nra CL
cmplac Ior ncr CL
(h)
MC
LL
Currcn
LL
:lI
lc_:cr
Codc
:_mhol
lc
Normal:c
h p0
l L
l L
L
Follov
Iv
h:rcam
ahlc
(a)
ox
lnra LL
lncr LL
lncr MC
z0 h:
o o o
o o o o o
o o
o cmplac Ior nra CL
cmplac Ior ncr CL
(h)

Iig. 18. (a) Iock diagran of CAL noduIe. (l). IIIuslralion of lhe Shifl Regisler.

As nenlioned lefore, for pixeI-ly-pixeI processing in CAL, il lasicaIIy uses lhe rasler scan
order. Since nosl of lhe execulion line is spenl on lhe conlexl generalion in lhe CAL, lhe
Shifl Regisler is used lo ollain conlexl and lhe reIaled operalion is iIIuslraled in Iig. 18(l)
|2Oj. Dala in lhe shifl regislers can le effecliveIy reused and lhus lhis redundanl dala
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 59
accesses can le renoved. In Iig. 18(l) aII lhe reclangIes are represenled as regislers. IixeIs in
currenl A and MC-A are firsl Ioaded inlo lhe Shifl Regisler, and lhen shifled Iefl one
lil al every cIock cycIe. Regislers in conlexl lox are arranged such lhal various conlexls can
le achieved. Ior inlra-CAL node, lhe firsl lhree rovs of Shifl Regisler are used lo slore
currenl A. The firsl lvo rovs of Shifl Regisler are used lo slore currenl A and lhe Iasl
lhree rovs are used lo slore MC-A in inler-CAL node. Therefore, lhe conlexl (cx) and
coded lil (|i|) are ollained fron Shifl Regisler per cycIe.

6. ImpIementation ResuIts and Comparisons

We use lhree MILC-4 lesl video sequences of CII (352288) fornal for experinenl: rean,
Nevs and Iorenan. The lhree sequences characlerize a variely of spaliaI and nolion
aclivilies. Lach sequence consisls of 3OO VOIs of arlilrary shape. A search range of 16
pixeIs is used and frane-lased IossIess shape coding is perforned for aII lhe lesl sequence.
The conparisons of SAD and WSAD using fuII search aIgorilhn are shovn in Iig. 19. In lhis
figure, lhe correIalions lelveen lil-rale and lhe ralio of I2 lo I1 (I2/I1) are aIso shovn
in Iig. 19.
Bream
w2/w1
0.4 0.5 0.6 0.7 0.8
B
i
t
r
a
t
e
1634
1636
1638
1658
1660
1662
SAD
WSAD

News
W2/W1
0.4 0.5 0.6 0.7 0.8
B
i
t
r
a
t
e
904
906
908
910
912
SAD
WSAD

VLS 60
Foreman
W2/W1
0.4 0.5 0.6 0.7 0.8
B
i
t
r
a
t
e
1196
1198
1200
1212
1214
1216 SAD
WSAD

Iig. 19. Ierfornance conparisons of SAD and WSAD using fuII search aIgorilhn.

Il can le seen lhal lhe resuIl of using WSAD as lhe dislorlion neasure lakes Iess lils lhan
lhal of using SAD excepl Nevs in I2/I1= O.4. This is lecause lhe Nevs sequence is a
Iov nolion video sequence, and WSAD is of no lenefil in such sequence. As lhe resuIl of
Iig. 19, I1 and I2, vhich is derived fron (1), are delernined as 1O and 7, respecliveIy.
ased on lhe veighling vaIues, lhe average nunler of lils lo represenl lhe shape per VOI is
shovn in TalIe 2. The percenlages conpared lo lhe resuIls of fuII search, vhich uses SAD as
lhe dislorlion neasure, are aIso shovn in TalIe 2. An inporlanl conlrilulion is lhal lhe
WSAD nakes sone inprovenenl on lil-rale, and il can conpensale for lhe inaccuracy of
nolion veclor vhen lhe fasl search aIgorilhn is used.

Sequence
IuII Search vilh SAD IuII Search vilh WSAD
ils/VOI ils/VOI
rean 1659.35 1OO 1636.46 98.62
Nevs 9O9.43 1OO 9O6.1O 99.63
Iorenan 1213.73 1OO 1197.55 98.67
TalIe 2.Ierfornance conparison of SAD and WSAD lased on fuII search aIgorilhn in
lil-rale (I1=1O, I2=7).

Iig. 2O shovs lhe conparison of various search aIgorilhns. Il can le seen lhal lhe proposed
S and DS aIgorilhn lake Iess search poinls lhan IS aIgorilhn and lhe aIgorilhns
descriled in O-O.

Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 61
Bream
Frame
0 20 40 60 80 100
S
e
a
r
c
h

P
o
i
n
t
0
10000
20000
30000
40000
50000
FS
Ref. [6]
Ref. [7]
BS
DBS

News
Frame
0 20 40 60 80 100
S
e
a
r
c
h

P
o
i
n
t
0
5000
10000
15000
20000
FS
Ref. [6]
Ref. [7]
BS
DBS

Foreman
Frame
0 20 40 60 80 100
S
e
a
r
c
h

P
o
i
n
t
0
20000
40000
60000
FS
Ref. [6]
Ref. [7]
BS
DBS

Iig. 2O. Ierfornance conparisons of various search aIgorilhns.

TalIe 3 shovs lhe nunler of search poinls (SI), and TalIe 4 shovs lhe average nunler of lils
lo represenl lhe shape per VOI. The percenlages conpared lo lhe resuIls corresponding lo lhe
fuII search in MILC-4 VM are aIso shovn in lhese lalIes. S denoles lhe proposed
aIgorilhn vilhoul using dianond search pallern and DS denoles lhe proposed aIgorilhn
VLS 62
using dianond search pallern. TalIe 5 shovs lhe runline sinuIalion resuIls of various ML
aIgorilhns. Il is noled lhal lhe ML aIgorilhn O lakes nuch nore conpulalionaI conpIexily
lecause of lhe generalion of lhe nask for effeclive search area.

Sequence IuII Search Ref. O Ref. O Iroposed (S) Iroposed
(DS)
SI SI SI SI SI
rean 1O,5O4,494 6,495,838 61.84 1,56O,221 14.85 367,115 3.49 67,232 O.64
Nevs 747,O54 4O3,131 53.96 6,568 O.88 24,923 3.36 2,O48 O.27
Iorenan 9,O85,527 5,O93,4O9 56.O6 1,564,551 17.22 287,272 3.16 57,214 O.63
TalIe 3. TolaI search poinls for various search aIgorilhns.

Sequence IuII
Search
Ref. O Ref. O Iroposed (S) Iroposed (DS)
ils/VOI ils/VOI ils/VOI ils/VOI ils/VOI
rean 1659.35 1659.35 1OO.OO 1683.28 1O1.44 1655.78 99.78 1669.58 1OO.62
Nevs 9O9.43 9O9.43 1OO.OO 9O2.68 99.26 91O.82 1OO.15 9O8.39 99.89
Iorenan 1213.73 1213.73 1OO.OO 1219.47 1OO.47 12O9.35 99.64 1219.23 1OO.45
TalIe 4. Average lil-rale for various search aIgorilhns.

Conpared vilh lhe fuII search nelhod, lhe proposed fasl ML aIgorilhn (S) needs 3.5
search poinls and lakes equaI lil rale in lhe sane quaIily. y using lhe dianond-shaped
zones, lhe proposed aIgorilhn (DS) needs onIy O.6 search poinls. Conpared vilh olher
fasl ML O-O, our aIgorilhn uses Iess search poinls, especiaIIy in high nolion video
sequences, such as 'rean and 'Iorenan.

Sequence IuII
Search
Ref. O Ref. O Iroposed (S) Iroposed
(DS)
ns ns ns ns ns
rean 57718.67 62776.36 1O8.76 7782.83 13.48 8678.O2 15.O4 1738.24 3.O1
Nevs 383O.13 42O2.36 1O9.72 35.O5 O.92 574.O4 14.99 22.16 O.58
Iorenan 44877.19 52574.24 117.15 75O7.74 16.73 6487.52 14.46 17O3.88 3.8O
TalIe 5. Runline sinuIalion resuIls of various search aIgorilhns.

TalIe 6 shovs lhe nunler of non-skipped SS per IL. Due lo lhe conlrilulion of lhe
proposed CML aIgorilhn, lhe nunler of non-skipped SS is reduced IargeIy. Il can le
seen lhal lhe average nunler of non-skipped SS is nuch Iess lhan lhe lolaI nunler of SS,
vhich is denoled as 81 per IL. In lhe vorsl case, lhe addilionaI non-skipped SS is usuaIIy
in lhe posilions vilh Iarge nolion veclor, vhich does nol lend lo le lhe adoplive MV.



Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 63
Sequence Average non-skipped
SS per IL
Maxinun non-skipped SS
per IL
rean 12.78 33
Nevs 1O.73 16
Iorenan 1O.94 24
ChiIdren 13.4O 46
TalIe 6. Average and naxinun nunler of non-skipped SS per IL.

Therefore, in lhe linary nolion eslinalor, lhe nunler of non-skipped SS is Iiniled lo 32
fron experinenlalion in naxinun silualion. Ior average silualion, ve sel lhe nunler as
12.In our archileclure, lhe processing cycIes for various A lypes are shovn in Iig. 21.


Iig. 21. Irocessing cycIes for various A lypes.

TolaIIy seven lypes of node are descriled for A. The nunler in parenlheses indicales lhe
Ialency of lhal noduIe vilh average and naxinun nunler of cycIes. Nolice lhal lhe
processing cycIes of ML and CAL depend on lhe conlenl of A. To conpIele one A
processing in lhe vorsl case scenario, our archileclure requires 17O8 cIock cycIes, incIuding:
19 cIock cycIes for node decision, 35 cIock cycIes for idenlifying MVIS, 264 cIock cycIes for
size conversion, 78O cIock cycIes for ML, and 61O cIock cycIes for inler CAL.
AcluaIIy, fev Iileralures expIored lhe archileclure design of shape coding. In |1Oj, lhey onIy
designed lhe ML. We can exlracl our dala for ML parl as conparison in TalIe 7. In lerns
of lhe vhoIe shape coding,

L. A. AI_QaraIIehs |1Oj Iroposed
Cale counl 11582 1O523
One A processing
(cycIes)
563 78O
Connenl IarliaI design
(OnIy ML inpIenenled)
ConpIeled design
TalIe 7. Conparison vilh ML onIy.

TalIe 8 iIIuslrales sone resuIls vilh differenl archileclures. Ior our design lhe size of A
VLS 64
luffer and SR luffer are 1616 and 4444 respecliveIy. The average and naxinun nunlers
of non-skipped SS are delernined as 12 and 32 fron experinenls.

DDML |16j Nalarajans |21j Iroposed
Currenl A size 1616 lils RAM 1616 lils SRAM 1616 lils SRAM
SR luffer size 1632 lils RAM 4732 lils SRAM 4444 lils SRAM
Access fron SR luff lo
ollain one MV (yles)
4O96 5828 Average: 1348
Maxinun: 3328
Lalency lo ollain one MV
(cycIes)
1O39 1O39 Average: 36O
Maxinun: 74O
One A processing
(cycIes)
3O34
(vilhoul pipeIinig)
N/A 17O8
(vilhoul pipeIining)
TalIe 8. Archileclure anaIysis and conparison for various linary nolion eslinalions.

TalIe 8 aIso Iisls lhe archileclure conparisons lelveen lhe proposed and sone previous
vorks in |16j and |21j. In |16j il adopls lhe dala-dispalch lechnique and is naned as
dala-dispalch lased ML (DDML). |21j is Nalarajans archileclure vhich is nodified fron
ML-lased Yangs 1-D sysloIic array |22j. In lheir design, lhey use exlra nenory, SAI noduIe,
lo process lhe lil shifling and lil packing for lhe aIignnenl of A. Il aIso resuIls in a
conpulalion overhead. In our design, ve have used lhe oundary IixeI Deleclor for lhe
aIignnenl of loundary of A. AccordingIy, no SAI nenory is needed. Iurlhernore, lhe
proposed CML design needs Iess dala lransfer and Ialency lo ollain one nolion veclor
conpared vilh |16j and |21j, lecause ve consider lhe skipping on redundanl searches.
Conpared vilh lhe inpIenenlalion for one A processing in lhe vorsl case, our design aIso
requires Iess cycIes lhan |16j vilh lhe sane lase of non-pipeIining vork. OnIy 56 cycIes of
|16j is needed in our approach.
Iig. 22(a) shovs lhe synlhesized gale counl of each noduIe and Iig. 22(l) shovs lhe chip
Iayoul using synlhesizalIe VeriIog HDL. There are 7 synchronous RAMs in lhe chip. Tvo
16OO16 lils RAMs are used for frane luffer. Tvo 4822 lils RAMs are used for SR luffer.
One 322O lils RAM is used for SC luffer, one 3218 lils RAM is used for MC luffer and one
322O lils RAM is used for A luffer, respecliveIy. The chip fealure is sunnarized in TalIe
9. TolaI gale counl is 4O765. The chip size is 2.42.4 nn
2
vilh TSMC O.18n CMOS
lechnoIogy and lhe naxinun operalion frequency is 53 MHz


(a)
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 65

(l)
Iig. 22. (a) Synlhesized gale counl of each noduIe. (l) Chip Iayoul of shape coding encoder.

Tcchnn!ngy T5MC 0.18m CMO5 (1P6M)
Packagc 128 CQFP
DIc sIzc
2.42.4 mm
2

Cnrc sIzc
1.41.4 mm
2

C!nck ratc 53 MHz
Pnwcr dIssIpatInn 35mW
Gatc cnunt 40765

Mcmnry sIzc
(bIts)
Framc buffcr: 2160016
5R buffcr: 24822
5C buffcr: 3220
MC buffcr: 3218
BAB buffcr: 3220
TalIe 9. Chip Iealures.

7. ConcIusion

MILC-4 has provided a veII-adopled oljecl-lased coding lechnique. When peopIe nigrale
fron conpressed coding donain lo oljecl coding donain, lhe conpIexily issue on shape
coding is converged. In lhis paper ve propose a fasl linary nolion eslinalion aIgorilhn
using dianond search pallern for shape coding and an efficienl archileclure for MILC-4
shape coding. y using lhe properlies of shape infornalion and dianond shaped zones, ve
can reduce lhe nunler of search poinls significanlIy, resuIling in a proporlionaI reduclion
VLS 66
of conpulalionaI conpIexily. The experinenlaI resuIls shov lhal lhe proposed nelhod can
reduce lhe nunler of search poinls of ML for shape coding lo onIy O.6 conpared vilh
lhal of lhe fuII search nelhod descriled in MILC-4 verificalion nodeI. SpecificaIIy, lhe fasl
aIgorilhn lakes equaI lil rale in lhe sane quaIily conpared vilh fuII search aIgorilhn. The
proposed aIgorilhn is sinpIe, efficienl and suilalIe for reaI-line soflvare and hardvare
appIicalions. This archileclure is lased on lhe loundary search fasl aIgorilhn vhich
acconpIishes lhe Iarge reduclion on conpulalion conpIexily. We aIso appIy lhe approaches
on cenler-liased nolion veclor dislrilulion and search range shrinking for furlher
inprovenenl. In lhis paper ve reporl a conprehensive expIoralion on each noduIe of
shape coding encoder. Our archileclure conpIeleIy eIalorales lhe advanlages of lhe
proposed fasl aIgorilhn vilh a high perfornance and reguIar archileclure. The resuIl shovs
lhal our design can reduce lhe nenory access and processing cycIes IargeIy. The average
nunler of cIock cycIes for one linary aIpha lIock processing is onIy 17O8, vhich is far Iess
lhan olher designs. The syslen archileclure is inpIenenled ly synlhesizalIe VeriIog HDL
vilh TSMC O.18n CMOS lechnoIogy. The chip size is 2.4 2.4 nn
2
and lhe naxinun
operalion frequency is 53 MHz.

8. AcknowIedgements

This vork vas supporled ly lhe CIC and lhe NalionaI Science CounciI of Taivan, R.O.C.
under Cranl NSC97-2220-|-008-001.

9. References

. Nalarajan, V. haskaran, and K. Konslanlinides, Lov-conpIexily lIock-lased nolion
eslinalion via one-lil lrasforns, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 7, pp.
7O2 -7O7, Aug. 1997.
D. Yu, S. K. }ang, and }. . Ra, A fasl nolion eslinalion aIgorilhn for MILC-4 shape
coding, |||| |n|. Ccnf. |magc Prcccssing, voI. 1, pp. 876-879, 2OOO.
L. A. AI_QaraIIeh, T. S. Chang, and K. . Lee, An efficienl linary nolion eslinalion
aIgorilhn and ils archileclure for MILC-4 shape encoding, |||| Trans. Circui|s
Sqs|. Vicc Tccnnc|., voI. 16, no. 17, pp. 859-868, }uI. 2OO6.
C. Ieygin, I. CIenn and I. Chov, ArchilecluraI advances in lhe VLSI inpIenenlalion of
arilhnelic coding for linary inage conpression, Prcc. Da|a Ccmprcssicn Ccnf.
(DCC94), pp. 254 -263, 1994.
C. SuIIivan and T. Wiegand, Rale-Dislorlion oplinizalion for video conpression, ||||
Signa| Prcccssing Magazinc, pp. 74-9O, Nov. 1998.
H. C. Chang, Y. C. Chang, Y. C. Wang, W. M. Chao, and L. C. Chen, VLSI archileclure
design for MILC-4 shape coding, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 12,
pp. 741-751, Sep. 2OO2.
ISO/ILC 13818-2, Infornalion lechnoIogy-generic coding of noving piclures and associaled
audio infornalion-Video, 1994.
ISO/ILC }TC1/SC29/WC11 N 25O2a, Ceneric coding of audio-visuaI oljecls: VisuaI
14492-2, AlIanlic Cily IinaI Drafl IS, Dec. 1998.
ISO/ILC }TC1/SC29/WC11 N 39O8, MILC-4 video verificalion nodeI version 18.O, }an.
2OO1.
Contour-Based Binary Motion Estimation Algorithm and VLS Design
for MPEG-4 Shape Coding 67
}. L. MilcheII and W. . Iennelaker, OplinaI hardvare and soflvare arilhnelic coding
procedures for lhe Q-Coder, |8M ]. Rcs. Dctc|., voI. 32, pp. 717-726, Nov. 1998.
}o Yev Than, Surendra Ranganalh, Mailreya Ranganalh, and Ashraf AIi Kassin, A noveI
unreslricled cenler-liased dianond search aIgorilhn for lIock nolion eslinalion,
|||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 8, pp. 169-177, Aug. 1998.
K. . Lee, }. Y. Lin and C. W. }en, A MuIlisynloI Conlexl-ased Arilhnelic Coding
Archileclure for MILC-4 Shape Coding, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|.,
voI. 15, pp. 283-295, Iel. 2OO5.
K. . Lee, }. Y. Lin, and C. W. }en, A fasl duaI synloI conlexl-lased arilhnelic coding for
MILC-4 shape coding, |||| |n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms
(|SCAS2004), pp. 317-32O, 2OO4.
K. M. Yang, M. T. Sun, and L. Wu, A faniIy of VLSI design for lhe nolion conpensalion
lIock-nalching aIgorilhn, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 36, pp. 1317
-1325, Ocl. 1989.
K. Ianusopone, and X. Chen, A fasl nolion eslinalion nelhod for MILC-4 arlilrariIy
shaped oljecls, |||| |n|. Ccnf. |magc Prcccssing, voI. 3, pp. 624-627, 2OOO.
L. K. Liu, and L. Ieig, A lIock-lased gradienl descenl search aIgorilhn for lIock nolion
eslinalion in video coding, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 6, pp.
419-422, Aug. 1996.
M. Tourapis, O. C. Au, and M. L. Liou, HighIy efficienl prediclive zonaI aIgorilhns for fasl
lIock-nalching nolion eslinalion, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 12,
pp. 934-947, Ocl. 2OO2.
N. rady, MILC-4 slandardized nelhods for lhe conpression of arlilrariIy shaped video
oljecls, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 9, pp. 117O-1189, Dec. 1999.
Reconnendalion H.263: Video coding for Iov lil-rale connunicalion, ITU-T H.263, 1998.
S. Dulla and W. WoIf, A fIexilIe paraIIeI archileclure adapled lo lIock-nalching nolion
eslinalion aIgorilhns, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 6, no 1, pp.
74-86, Iel. 1996.
S. H. Han, S. W. Kvon, T. Y. Lee, and M. K. Lee, Lov pover nolion eslinalion aIgorilhn
lased on lenporaI correIalion and ils archileclure, |||| |n|crna| Sqmpcsium cn
Signa| Prcccssing an i|s App|ica|icns (|SSPA), voI. 2, pp.647-65O, Aug. 2OO1.
T. H. Tsai and C. I. Chen, A fasl linary nolion eslinalion aIgorilhn for MILC-4 shape
coding, |||| Trans. Circui|s Sqs|. Vicc Tccnnc|., voI. 14, pp. 9O8-913, }un. 2OO4.
T. Konarek and I. Iirsch, Array archileclures for lIock-nalching aIgorilhns, |||| Trans.
Circui|s Sqs|., voI. 36, pp.13O1-13O8, Ocl. 1989.
T. M. Liu, . }. Shieh and C. Y. Lee, An efficienl nodeIing codec archileclure for linary
shape coding, |||| |n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms (|SCAS2002),
voI. 2, pp. 316 -319, 2OO2.


VLS 68
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 69
X

Memory-Efficient Hardware Architecture
of 2-D DuaI-Mode Lifting-Based Discrete
WaveIet Transform for JPEG2000

Chih-Hsien Hsia and }en-Shiun Chiang
Dcpar|mcn| cf ||cc|rica| |nginccring, Tam|ang Unitcrsi|q
Taipci, Taiuan

1. Introduction

Discrele vaveIel lransforn (DWT) has leen adopled in a vide range of appIicalions,
incIuding speech anaIysis, nunericaI anaIysis, signaI anaIysis, inage coding, pallern
recognilion, conpuler vision, and lionelrics (MaIIal, 1989). Il can le considered as a nuIli-
resoIulion deconposilion of a signaI inlo severaI conponenls vilh differenl frequency
lands. Moreover, DWT is a poverfuI looI for signaI processing appIicalions, such as
}ILC2OOO sliII inage conpression, denoising, region of inleresl, and valernarking. Ior reaI-
line processing il needs snaII nenory access and Iov conpulalionaI conpIexily.
InpIenenlalions of lvo-dinensionaI (2-D) DWT can le cIassified as convoIulion-lased
operalion (MaIIal, 1989) (Marino, 2OOO) (Vishvanalh el aI., 1995) (Wu. & Chen, 2OO1) and
Iifling-lased operalion (SveIdens, 1996). Since lhe convoIulion-lased inpIenenlalions of
DWT have high conpulalionaI conpIexily and Iarge nenory requirenenls, Iifling-lased
DWT has leen presenled lo overcone lhese dravlacks (SveIdens, 1996) (Daulechies &
SveIdens, 1998). The Iifling-lased schene can provide Iov-conpIexily soIulions for
inage/video conpression appIicalions, such as }ILC2OOO (Lian el aI., 2OO1), Molion-
}ILC2OOO (Seo & Kin, 2OO7), MILC-4 sliII inage coding, and MC-LZC (Ohn, 2OO5) (Chen
& Woods, 2OO4). Hovever, lhe reaI-line 2-D DWT for nuIlinedia appIicalion is sliII
difficuIl lo achieve. Hereafler, efficienl lransfornalion schenes for reaI-line appIicalion are
highIy denanded. Ierforning 2-D (or nuIli-dinensionaI) DWT requires nany
conpulalions and a Iarge lIock of lranspose nenory for sloring inlernediale signaIs vilh
Iong Ialency line. This vork presenls nev aIgorilhns and hardvare archileclures lo
inprove lhe crilicaI issues in 2-D duaI-node (supporling 5/3 IossIess and 9/7 Iossy coding)
Iifling-lased discrele vaveIel lransforn (LDWT). The proposed 2-D duaI-node LDWT
archileclure has lhe nerils of Iov lranspose nenory, Iov Ialency, and reguIar signaI fIov,
naking il suilalIe for VLSI inpIenenlalion. The lranspose nenory requirenenl of lhe NN
2-D 5/3 node LDWT is 2N and lhal of 2-D 9/7 node LDWT is 4N.
Lov lranspose nenory requirenenl is of a priorily concern in spaliaI-frequency donain
inpIenenlalion. CeneraIIy, rasler scan signaI fIov operalions are popuIar in NN 2-D DWT,
and under lhis approach lhe nenory requirenenl ranges fron 2N lo N
2
(Diou el aI., 2OO1)
4
VLS 70
(Andra el aI., 2OO2) (Chen & Wu, 2OO2) (Chen, 2OO2) (Chiang & Hsia, 2OO5) (}ung & Iark,
2OO5) (Vishvanalh el aI., 1995) (Huang el aI., 2OO5) (Mei el aI., 2OO6) (Huang el aI, 2OO5) (Wu
& Lin, 2OO5) (Lan ., 2OO5) (Wu. & Chen, 2OO1) in 2-D 5/3 and 9/7 nodes LDWT. In order lo
reduce lhe anounl of lhe lranspose nenory, lhe nenory access nusl le redirecled. In our
approach, lhe signaI fIov is revised fron rov-vise onIy lo nixed rov- and coIunn-vise,
and a nev approach, caIIed inlerIaced read scan aIgorilhn (IRSA), is used lo reduce lhe
anounl of lhe lranspose nenory. y lhe IRSA approach, a lranspose nenory size is of 2N
or 4N (5/3 or 9/7 node) for an NN DWT. The proposed 2-D LDWT archileclure is lased
on paraIIeI and pipeIined schenes lo increase lhe operalion speed. Ior hardvare
inpIenenlalion ve repIace nuIlipIiers vilh shiflers and adders lo acconpIish high
hardvare uliIizalion. This 2-D LDWT has lhe characlerislics of high hardvare uliIizalion,
Iov nenory requirenenl, and reguIar signaI fIov. A 256256 2-D duaI-node LDWT vas
designed and sinuIaled ly VeriIogHDL, and furlher synlhesized ly lhe Synopsys design
conpiIer vilh TSMC O.18n 1I6M CMOS process lechnoIogy.

2. Survey of 2-D LDWT Architecture

Anong lhe variely of DWT aIgorilhns, LDWT provides a nev approach for conslrucling
liorlhogonaI vaveIel lransforns and aIso provides an efficienl schene for caIcuIaling
cIassicaI vaveIel lransforns (SveIdens, 1996) (Chen & Wu, 2OO2) (Andra el aI., 2OOO) (Andra
el aI., 2OO2) (Diou el aI., 2OO1) (Chen, 2OO2) (Chiang & Hsia, 2OO5) (Tan & ArsIan, 2OO1)
(Huang el aI., 2OO5) (Huang el aI., 2OO2) (Mei el aI., 2OO6) (Weeks & ayouni, 2OO2)
(Varshney el aI., 2OO7) (Huang el aI., 2OO4) (Tan & ArsIan, 2OO3) (}iang & Orlega, 2OO1) (}ung
& Iark, 2OO5) (Chen, 2OO4) (Lian el aI, 2OO1) (Seo & Kin, 2OO7) (Huang el aI., 2OO5) (Wu &
Lin, 2OO5) (Lan el aI., 2OO5) (Wu. & Chen, 2OO1). Iacloring lhe cIassicaI vaveIel fiIler inlo
Iifling sleps can reduce lhe conpulalionaI conpIexily of lhe corresponding DWT ly up lo
5O (Daulechies & SveIdens, 1998). The Iifling sleps can le inpIenenled easiIy, vhich is
differenl fron lhe direcl finile inpuIse response (IIR) inpIenenlalions of MaIIals
aIgorilhn (Daulechies & SveIdens, 1998). Andra c| a|. (Andra el aI., 2OOO) (Andra el aI.,
2OO2) proposed a lIock-lased sinpIe four-processor archileclure lhal conpules severaI
slages of lhe DWT al a line. Diou c| a|. (Diou el aI., 2OO1) presenled an archileclure lhal
perforns LDWT vilh a 5/3 fiIler ly inlerIeaving lechnique. Chen c| a|. (Chen & Wu, 2OO2)
proposed a foIded and pipeIined archileclure for a 2-D LDWT inpIenenlalion, vilh
nenory size of 2.5N for an NN 2-D DWT. This Iifling archileclure for verlicaI fiIlering is
divided inlo lvo parls, each consisling of one adder and one nuIlipIier. Since lolh parls are
aclivaled in differenl cycIes, lhey can share lhe sane adder and nuIlipIier lo increase lhe
hardvare uliIizalion and reduce lhe Ialency. Hovever, lhis archileclure aIso has high
conpIexily due lo lhe characlerislics of lhe signaI fIov. Chen c| a|. (Chen, 2OO2) proposed a
fIexilIe foIded archileclure for 3-IeveI 1-D LDWT lo increase lhe hardvare uliIizalion.
Chiang c| a|. (Chiang & Hsia, 2OO5) proposed a 2-D DWT foIded archileclure lo inprove lhe
hardvare uliIizalion. }iang c| a|. (}iang & Orlega, 2OO1) presenled a paraIIeI processing
archileclure lhal nodeIs lhe DWT conpulalion as a finile slale nachine and efficienlIy
conpules lhe vaveIel coefficienls near lhe loundary of each segnenl of lhe inpul signaI.
Lian c| a|. (Lian el aI, 2OO1) and Chen c| a|. (Chen, 2OO4) used a 1-D foIded archileclure lo
inprove lhe hardvare uliIizalion of 5/3 and 9/7 fiIlers. The recursive archileclure is a
generaI schene lo inpIenenl any vaveIel fiIler lhal is deconposed inlo Iifling sleps in
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 71
snaIIer hardvare conpIexily. }ung c| a|. (}ung & Iark, 2OO5) presenled an efficienl VLSI
archileclure of duaI-node LDWT lhal is used ly Iossy or IossIess conpression of }ILC2OOO.
Marino (Marino, 2OOO) proposed a high-speed/Iov-pover pipeIined archileclure for lhe
direcl 2-D DWT ly four-sulland lransforns perforned in paraIIeI. The archileclure of
(Huang el aI., 2OO2) inpIenenls 2-D DWT vilh onIy lranspose nenory ly using recursive
pyranid aIgorilhn (IRA). In (Vishvanalh el aI., 1995) il has lhe average of N
2
conpuling
line for aII DWT IeveIs. Hovever, lhey use nany nuIlipIiers and adders. Varshney c| a|.
(Varshney el aI., 2OO7) presenled energy efficienl singIe-processor and fuIIy pipeIined
archileclures for 2-D 5/3 Iifling-lased }ILC2OOO. The singIe processor perforns lolh rov-
vise and coIunn-vise processing sinuIlaneousIy lo achieve lhe 2-D lransforn vilh 1OO
hardvare uliIizalion. Tan c| a|. (Tan & ArsIan, 2OO3) presenled a shifl-accunuIalor
arilhnelic Iogic unil archileclure for 2-D Iifling-lased }ILC2OOO 5/3 DWT. This archileclure
has an efficienl nenory organizalion, vhich uses a snaII anounl of enledded nenory for
processing and luffering. Those archileclures achieve nuIli-IeveI deconposilion using an
inlerIeaving schene lhal reduces lhe size of nenory and lhe nunler of nenory accesses,
lul have sIov lhroughpul rales and inefficienl hardvare uliIizalion. Seo c| a|. (Seo & Kin,
2OO7) proposed a processor lhal can handIe any liIe size, and supporls lolh 5/3 and 9/7
fiIlers for Molion-}ILC2OOO. Huang c| a|. (Huang el aI, 2OO5) proposed a generic RAM-lased
archileclure vilh high efficiency and feasiliIily for 2-D DWT. Wu c| a|. (Wu & Lin, 2OO5)
presenled a high-perfornance and Iov-nenory archileclure lo inpIenenl a 2-D duaI-node
LDWT. The pipeIined signaI palh of lheir archileclure is reguIar and praclicaI. Lan c| a|. (Lan
el aI., 2OO5) proposed a schene lhal can process lvo Iines sinuIlaneousIy ly processing lvo
pixeIs in a cIock period. Wu c| a|. (Wu. & Chen, 2OO1) proposed an efficienl VLSI archileclure
for direcl 2-D LDWT, in vhich lhe poIy-phase deconposilion and coefficienl foIding are
adopled lo increase lhe hardvare uliIizalion. Despile lhese efficienl inprovenenls lo
exisled archileclures, furlher inprovenenls in lhe aIgorilhn and archileclure are sliII
needed. Sone VLSI archileclures of 2-D LDWT lry lo reduce lhe lranspose nenory
requirenenls and connunicalion lelveen lhe processors (Chen & Wu, 2OO2) (Andra el aI.,
2OOO) (Andra el aI., 2OO2) (Diou el aI., 2OO1) (Chen, 2OO2) (Chiang & Hsia, 2OO5) (Tan &
AsIan, 2OO2) (}iang & Orlega, 2OO1) (Lian el aI., 2OO1) (}ung & Iark, 2OO5) (Chen, 2OO4)
(Huang el aI., 2OO5) (Daulechies & SveIdens, 1998) (Marino, 2OOO) (Vishvanalh el aI., 1995)
(Taulnan & MarceIIin, 2OO1) (MarceIIin el aI., 2OOO) (Mei el aI., 2OO6) (Varshney el aI., 2OO7)
(Huang el aI., 2OO4) (Tan & ArsIan, 2OO3) (Seo & Kin, 2OO7) (Huang el aI., 2OO5) (Wu & Lin,
2OO5) (Lan el aI., 2OO5) (Wu. & Chen, 2OO1), hovever lhese hardvare archileclures sliII need
Iarge lranspose nenory.

3. Discrete waveIet transform and Iifting-based method

This seclion lriefIy revievs lhe use of DWT in lhe coding engine of }ILC2OOO (Taulnan &
MarceIIin, 2OO1). The cIassicaI DWT enpIoys fiIlering and convoIulion lo achieve signaI
deconposilion (MaIIal, 1989) (Marino, 2OOO) (Vishvanalh el aI., 1995) (Wu. & Chen, 2OO1).
Meyer and MaIIal found lhal lhe orlhonornaI vaveIel deconposilion and reconslruclion
can le inpIenenled in lhe nuIli-resoIulion signaI anaIysis franevork (MaIIal, 1989). The
nuIli-resoIulion anaIysis is nov a slandard nelhod for conslrucling lhe orlhonornaI
vaveIel-lases. }ILC2OOO adopls lhis characlerislic lo lransforn an inage in lhe spaliaI
donain inlo lhe frequency donain.
VLS 72
3.1 CIassicaI DWT
DWT perforns nuIli-resoIulion deconposilion of lhe inpul signaIs (MaIIal, 1989). The
originaI signaIs are firsl deconposed inlo lvo sulspaces, caIIed lhe Iov- (Iov-pass) and
high-frequency (high-pass) sullands. The cIassicaI DWT inpIenenls lhe deconposilion
(anaIysis) of a signaI ly a Iov-pass digilaI fiIler H and a high-pass digilaI fiIler G. olh
digilaI fiIlers are derived using lhe scaIing funclion and lhe corresponding vaveIels. The
syslen dovnsanpIes lhe signaI lo decinale haIf of lhe fiIlered resuIls in deconposilion
processing. The Z-lransfer funclions of H(z) and G(z) lased on four-lap and non-recursive
IIR fiIlers vilh Ienglh | are represenled as foIIovs:

H(z)= n
O
+ n
1
z
-1
+ n
2
z
-2
+ n
3
z
-3
, (1)

G(z)= g
O
+ g
1
z
-1
+ g
2
z
-2
+ g
3
z
-3
. (2)

The reconslruclion (synlhesis) process is inpIenenled using an up-sanpIing process.
MaIIals lree aIgorilhn or pyranid aIgorilhn (MaIIal, 1989) can le used lo find lhe nuIli-
resoIulion deconposilion DWT. The deconposilion DWT coefficienls al each resoIulion
IeveI can le caIcuIaled as foIIovs:

for (j=1 lo ])
for (i=O lo N/2
j
-1)
|

1
1
0
X ( ) ( ) X ( 2 )
k
f J
H
i
n G : H n i

_
(3)


1
1
0
X ( ) ( ) X ( 2 )
k
f J
L
i
n H : G n i

_
(4)


vhere j denoles lhe currenl resoIulion IeveI, | lhe nunler of lhe fiIler lap, ) ( X
H
n
f
lhe nlh
high-pass DWT coefficienl al lhe jlh IeveI, ) (n X
f
L
lhe nlh Iov pass DWT coefficienl al lhe
jlh IeveI, and N lhe Ienglh of lhe originaI inpul sequence. Iig. 1 shovs a 3-IeveI 1-D DWT
deconposilion using MaIIals aIgorilhn.






Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 73

Iig. 1. 3-IeveI 1-D DWT deconposilion using MaIIals aIgorilhn.

The dovnsanpIing operalion is lhen appIied lo lhe fiIlered resuIls. A pair of fiIlers are
appIied lo lhe signaI lo deconpose lhe inage inlo lhe Iov-Iov (LL), Iov-high (LH), high-
Iov (HL), and high-high (HH) vaveIel frequency lands. Iig. 2 iIIuslrales lhe lasic 2-D DWT
operalion and lhe lransforned resuIl vhich is conposed of lvo cascading 1-D DWTs. The
inage is firsl anaIyzed horizonlaIIy lo generale lvo sulinages. The infornalion is lhen senl
inlo lhe second 1-D DWT lo perforn lhe verlicaI anaIysis lo generale four sullands, and
each vilh a quarler of lhe size of lhe originaI inage. Considering an inage of size NN, each
land is sulsanpIed ly a faclor of lvo, so lhal each vaveIel frequency land conlains
N/2N/2 sanpIes. The four sullands can le inlegraled lo generale an oulpul inage vilh
lhe sane nunler of sanpIes as lhe originaI one.
Mosl inage conpression appIicalions can reappIy lhe alove 2-D vaveIel deconposilion
repealedIy lo lhe LL sulinage, each line forning four nev sulland inages, lo nininize
lhe energy in lhe Iover frequency lands.


VLS 74

Iig. 2. The 2-D anaIysis DWT inage deconposilion process.

Iig. 3. Iock diagran of lhe LDWT.

3.2 LDWT aIgorithm
The Iifling-lased schene proposed ly Daulechies and SveIdens requires fever
conpulalions lhan lhe lradilionaI convoIulion-lased approach (SveIdens, 1996)
(Daulechies & SveIdens, 1998). The Iifling-lased schene is an efficienl inpIenenlalion for
DWT, il can easiIy use inleger operalions and avoid lhe prolIens caused ly lhe finile
precision or rounding. The LucIidean aIgorilhn can le used lo faclorize lhe poIy-phase
nalrix of a DWT fiIler inlo a sequence of aIlernaling upper and Iover lrianguIar nalrices
and a diagonaI nalrix. The varialIes n(z) and g(z) in (5) respecliveIy denole lhe Iov- and
high-pass anaIysis fiIlers, vhich can le divided inlo even and odd parls lo generale a poIy-
phase nalrix P(z) as in (6).

g(z)=g
c
(z
2
)+ z
-1
g
c
(z
2
),

n(z)=n
c
(z
2
)+z
-1
n
c
(z
2
). (5)

Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 75

(

=
) ( ) (
) ( ) (
) (
: g : h
: g : h
: P
o o
e e
. (6)

The LucIidean aIgorilhn recursiveIy finds lhe grealesl connon divisors of lhe even and
odd parls of lhe originaI fiIlers. Since n(z) and g(z) forn a conpIenenlary fiIler pair, P(z) can
le faclorized inlo (7):

1 ( ) 1 0 0
( )
0 1 ( ) 1 0 1 /
1
i
i
m
s : k
P :
t : k
i
| | | | | |
=
| | |
\ . \ . \ .
=
[
. (7)
vhere s
i
(z) and |
i
(z) are Laurenl poIynoniaIs corresponding lo lhe prediclion and updale
sleps, respecliveIy, and | is a nonzero conslanl. Therefore, lhe fiIler lank can le faclorized
inlo lhree Iifling sleps.
As iIIuslraled in Iig. 3, a Iifling-lased schene has lhe foIIoving four slages:
1) SpIil phase: The originaI signaI is divided inlo lvo disjoinl sulsels. SignificanlIy, lhe
varialIe Xc denoles lhe sel of even sanpIes and Xc denoles lhe sel of odd sanpIes. This
phase is aIso caIIed Iazy vaveIel lransforn lecause il does nol decorreIale lhe dala lul onIy
sulsanpIes lhe signaI inlo even and odd sanpIes.
2) Iredicl phase: The predicling operalor I is appIied lo lhe sulsel Xc lo ollain lhe vaveIel
coefficienls |nj as in (8).

|nj=Xc|nj+I(Xc|nj). (8)

3) Updale phase: Xc|nj and |nj are conlined lo ollain lhe scaIing coefficienls s|nj afler an
updale operalor U as in (9).

s|nj=Xc|nj+U(|nj). (9)

4) ScaIing: In lhe finaI slep, lhe nornaIizalion faclor is appIied on s|nj and |nj lo ollain lhe
vaveIel coefficienls. Ior exanpIe, (1O) and (11) descrile lhe inpIenenlalion of lhe 5/3
inleger Iifling anaIysis DWT and are used lo caIcuIale lhe odd (high-pass) and even
coefficienls (Iov-pass), respecliveIy.

*| | (2 1) (2 ) (2 2) / 2 a n X n X n X n = + + +
(

. (1O)

*| | (2 ) (2 1) (2 1) 2/ 4 s n X n a n a n = + + +
(

. (11)

AIlhough lhe Iifling-lased schene has Iov conpIexily, ils Iong and irreguIar signaI palhs
cause lhe najor Iinilalion for efficienl hardvare inpIenenlalions. AddilionaIIy, lhe
increasing nunler of pipeIined regislers increases lhe inlernaI nenory size of lhe 2-D DWT
archileclure. The 2-D LDWT uses a verlicaI 1-D LDWT sulland deconposilion and a
horizonlaI 1-D LDWT sulland deconposilion lo find lhe 2-D LDWT coefficienls. Therefore,
lhe nenory requirenenl doninales lhe hardvare cosl and archilecluraI conpIexily of 2-D
LDWT. Iig. 4 shovs lhe 5/3 node 2-D LDWT operalion. The defauIl vaveIel fiIlers in
}ILC2OOO are duaI-node (5/3 and 9/7 nodes) LDWT (Taulnan & MarceIIin, 2OO1). The
Iifling-lased sleps associaled vilh lhe duaI-node vaveIels are shovn in Iigs. 5 and 6,
VLS 76
respecliveIy. Assuning lhal lhe originaI signaIs are infinile in Ienglh, lhe firsl Iifling slage is
firsl appIied lo perforn lhe DWT.


(a)

1'5/3 OLIWLQJEDVHG
':7
1'5/3 OLIWLQJEDVHG
':7
/ +
//
++
+/
/+
2ULJLQDOLPDJH
d0 s1 d1 s2 s0
L0
H0 H1
L1
H2
2ULJLQDO
3L[HOV
+LJK3DVV
2XWSXW
/RZ3DVV
2XWSXW
++%DQG
2XWSXW
+/%DQG
2XWSXW
H1 H2 H3 H4 H0
+LJK
)UHTXHQF\
3L[HOV
/+%DQG
2XWSXW
//%DQG
2XWSXW
L1 L2 L3 L4 L0
/RZ
)UHTXHQF\
3L[HOV
+)3DUW
/)3DUW
1'5/3 OLIWLQJEDVHG
':7
HH0 HH1
HL2 HL1 HL0
LH0 LH1
LL0 LL1 LL2

(l)
Iig. 4. 5/3 node 2-D LDWT operalion. (a) The lIock diagran fIov of a lradilionaI 2-D DWT.
(l) DelaiIed processing fIov.

Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 77
Iig. 5 shovs lhe Iifling-lased slep associaled vilh lhe vaveIel aIgorilhn. The originaI
signaIs incIuding sO, O, s1, 1, s2, 2, . are lhe inpul pixeI sequences. If lhe originaI signaIs
are infinile in Ienglh, lhen lhe firsl-slage Iifling is appIied lo updale lhe odd index dala sO,
s1, .. In (12), lhe paranelers 1/2 and H
i
denole lhe firsl slage Iifling paraneler and
oulcone, respecliveIy. Lqualion (12) shovs lhe operalion of lhe 5/3 inleger LDWT (Wu &
Lin, 2OO5) (Marlina & Masera, 2OO7) (Hsia & Chiang, 2OO8).

H
i
= |(S
i
+S
i+1
)+d
i
jK
0,

L
i
= |(H
i
+H
i-1
)+S
i
jK
1
, (12)

vhere = 1/2, =1/4, and K
0
= K
1
= 1.


Iig. 5. 5/3 LDWT aIgorilhn.


Iig. 6. 9/7 LDWT aIgorilhn.
VLS 78
Togelher vilh lhe high-frequency Iifling paraneler, , and lhe inpul signaI ve can find lhe
firsl slage high-frequency vaveIel coefficienl, H
i
. Afler H
i
is found, H
i
logelher vilh lhe
Iov-frequency paraneler, , and lhe inpul signaIs of lhe second slage Iov-frequency
vaveIel coefficienls, L
i
, can le found. The lhird and fourlh slages Iifling can le found in a
siniIar nanner.
SiniIar lo lhe 1-IeveI 1-D 5/3 node LDWT, lhe caIcuIalion of a 1-IeveI 1-D 9/7 node LDWT
is shovn in (13).

a
i
= |(S
i
+S
i+1
)+d
i
j,
l
i
= |(a
i
+a
i-1
)+S
i
j,

H
i
= |(l
i
+l
i+1
)+a
i
jK
0,

L
i
= |(H
i
+H
i-1
)+l
i
jK
1.
(13)

vhere = 1.586134142, = O.O5298O118, = +O.882911O75, = +O.4435O6852, K
0
=
1.23O1741O4, and K
1
= 1/K
0
.
The caIcuIalion conprises four Iifling sleps and lvo scaIing sleps.

3.3 Boundary extension treatment for LDWT
The finile-Ienglh signaI processed ly DWT Ieads lo lhe loundary effecl. The }ILC2OOO
slandard enhances lhe synnelric exlension pixeI al lhe edge, as shovn in TalIe 1. An
appropriale signaI exlension is required lo nainlain lhe sane nunler of lhe vaveIel
coefficienls as in lhe originaI signaI. The enledded signaI exlension aIgorilhn (Tan &
ArsIan, 2OO1) can le used lo conpule lhe loundary of lhe inage.
Since lolh lhe exlended signaI and lhe Iifling slruclure are synnelricaI, aII lhe inlernediale
and finaI resuIls of lhe Iifling-lased DWT are aIso synnelricaI vilh regard lo lhe loundary
poinls, and lhe loundary exlension can le perforned vilhoul addilionaI conpulalionaI
conpIexily. The inverse Iifling slruclure is easiIy derived fron TalIe 1 (Tan & ArsIan, 2OO1).
The loundary exlension signaI refIecls lhe signaI lo i
|cf|
sanpIes on lhe Iefl and i
rign|
sanpIes
on righl. TalIe 1 shovs lhe exlension paranelers i
|cf|
and i
rign|
for lhe reversilIe lransforn
5/3 node and lhe irreversilIe lransforn 9/7 node.

I
0
I
|eft
(5/3)

I
1
I
rlght
(5/3)

I
0
I
|eft
(9/7)

I
1
I
rlght
(9/7)

even 2 odd 1 even 4 odd 3
* i
0
: firsl sanpIe index, i
1
: Iasl sanpIe index.
TalIe 1. oundary exlension lo lhe Iefl and lo lhe righl for }ILC2OOO.

4. InterIaced read scan aIgorithm (IRSA)

In recenl years, nany 2-D LDWT archileclures have leen proposed lo neel lhe
requirenenls of on-chip nenory for reaI-line processing. Hovever, lhe hardvare
uliIizalion of lhese archileclures needs lo le furlher inproved. In DWT inpIenenlalion, a 1-
D DWT needs very nassive conpulalion and lherefore lhe conpulalion unil lakes nosl of
lhe hardvare cosl (Chen & Wu, 2OO2) (Andra el aI., 2OOO) (Andra el aI., 2OO2) (Diou el aI.,
2OO1) (Chen, 2OO2) (Chiang & Hsia, 2OO5) (Chen, 2OO4) (Huang el aI., 2OO5) (Daulechies &
SveIdens, 1998) (Marino, 2OOO) (Vishvanalh el aI., 1995) (Taulnan & MarceIIin, 2OO1)
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 79
(Varshney el aI., 2OO7) (Huang el aI., 2OO4) (Tan & ArsIan, 2OO3) (Seo & Kin, 2OO7) (Huang el
aI., 2OO5) (Wu & Lin, 2OO5) (Lan el aI., 2OO5) (Wu. & Chen, 2OO1). A 2-D DWT is conposed of
lvo 1-D DWTs and a lIock of lranspose nenory. In lhe convenlionaI approach, lhe size of
lhe lranspose nenory is equaI lo lhe size of lhe processed inage signaI. Iig. 7(a) shovs lhe
concepl of lhe proposed duaI-node LDWT archileclure, vhich consisls of signaI
arrangenenl unil, processing eIenenl, nenory unil, and conlroI unil, as shovn in Iig. 7(l).
The oulpuls are fed lo lhe 2-D LDWT four-sulland coefficienls, HH, HL, LH, and LL. The
proposed archileclure is descriled in delaiI in lhis seclion, and ve focus on lhe 2-D duaI-
node LDWT.
Conpared lo lhe conpulalion unil, lhe lranspose nenory lecones lhe nain overhead in
lhe 2-D DWT. The lIock diagran of a convenlionaI 2-D DWT is shovn in Iig. 4. Wilhoul
Ioss of generaIily, lhe 2-D 5/3 node LDWT is considered for lhe descriplion of lhe 2-D
LDWT. If lhe inage dinension is NN, during lhe lransfornalion ve need a Iarge lIock of
lranspose nenory (order of N
2
) lo slore lhe DWT coefficienls afler lhe conpulalion of lhe
firsl slage 1-D DWT deconposilion. The second slage 1-D DWT lhen uses lhe slored dala lo
conpule lhe 2-D DWT coefficienls of lhe four sullands (Chen & Wu, 2OO2) (Andra el aI.,
2OOO) (Andra el aI., 2OO2) (Diou el aI., 2OO1) (Chen, 2OO2) (Varshney el aI., 2OO7) (Huang el
aI., 2OO4) (Tan & ArsIan, 2OO3) (Seo & Kin, 2OO7) (Huang el aI., 2OO5) (Wu & Lin, 2OO5) (Lan
el aI., 2OO5) (Wu. & Chen, 2OO1). The conpulalion and lhe access of lhe nenory nay lake
line and lherefore lhe Ialency is Iong. Since lhe nenory size of N
2
is a Iarge quanlily, here
ve lry lo use lhe approach, inlerIaced read scan aIgorilhn (IRSA), lo reduce lhe required
lranspose nenory lo an order of 2N or 4N (5/3 or 9/7 node).


(a)

(l)
Iig. 7. The syslen lIock diagran of lhe proposed 2-D DWT. (a) 2-D duaI-node LDWT. (l)
Iock diagran of lhe proposed syslen archileclure.
VLS 80
x(i,j): originaI inage, i = O~5 and j = O~5
|(i,j): high frequency vaveIel coefficienl of 1-D LDWT
c(i,j): Iov frequency vaveIel coefficienl of 1-D LDWT
HH: high-high frequency vaveIel coefficienl of 2-D LDWT
H|: high-Iov frequency vaveIel coefficienl of 2-D LDWT
|H: Iov-high frequency vaveIel coefficienl of 2-D LDWT
||: Iov-Iov frequency vaveIel coefficienl of 2-D LDWT
Iig. 8. LxanpIe of 2-D 5/3 node LDWT operalions.

Wilhoul Ioss of generaIily, Iel us lake a 66-pixeI inage lo descrile lhe 2-D 5/3 node LDWT
operalion and IRSA. Iig. 8 shovs lhe operalion diagran of lhe 2-D 5/3 node LDWT
operalions of a 66 inage. In Iig. 8, x(i,j), i = O lo 5 and j = O lo 5, represenls lhe originaI
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 81
inage signaI. The Iefl nosl lvo coIunns are lhe Iefl loundary exlension coIunns, and lhe
righl nosl coIunn is lhe righl loundary exlension coIunn. The delaiIs of lhe loundary
exlension vere descriled in lhe previous seclion. The Iefl haIf of Iig. 8 shovs lhe firsl slage
1-D DWT operalions. The righl haIf of Iig. 8 shovs lhe second slage 1-D DWT operalions
for finding lhe four sulland coefficienls, HH, HL, LH, and LL. In lhe firsl slage 1-D DWT,
lhree pixeIs are used lo find a 1-D high-frequency coefficienl. Ior exanpIe, x(O,O), x(O,1), and
x(O,2) are used lo find lhe high-frequency coefficienl |(O,O), |(O,O) = |x(O,O) + x(O,2)j/2 +
x(O,1). To caIcuIale lhe nexl high-frequency coefficienl |(O,1), ve need pixeIs x(O,2), x(O,3),
and x(O,4). Here x(O,2) is used lo caIcuIale lolh |(O,O) and |(O,1) and is caIIed lhe overIapped
pixeI. The Iov-frequency coefficienl is caIcuIaled using lvo conseculive high-frequency
coefficienls and lhe overIapped pixeI. Ior exanpIe, |(O,O) and |(O,1) are logelher vilh x(O,2)
lo find lhe Iov-frequency coefficienl c(O,1), c(O,1) = ||(O,O) + |(O,1)j/4 + x(O,2). The caIcuIaled
high-frequency coefficienls, |(i,j), and Iov-frequency coefficienls, c(i,j), are lhen used in lhe
second slage 1-D DWT lo caIcuIale lhe four sulland coefficienls, HH, HL, LH, and LL.
In lhe second slage 1-D DWT of Iig. 8, lhe firsl HH coefficienl, HH(O,O), is caIcuIaled ly
using |(O,2), |(O,1), and |(O,O), HH(O,O) = ||(O,O) + |(O,2)j/2 + |(O,1). The olher HH
coefficienls can le conpuled in lhe sane nanner using lhree coIunn conseculive |(i,j)
signaIs. Ior lvo coIunn conseculive HH coefficienls il has an overIapped |(i,j) signaI. Ior
exanpIe |(O,3) is lhe overIapped signaI for conpuling HH(O,O) and HH(O,1). To conpule HL
coefficienls, il needs lvo coIunn conseculive HH coefficienls and an overIapped |(i,j)
signaI. Ior exanpIe, HL(O,1) is conpuled fron HH(O,O), HH(O,1), and |(O,3), HL(O,1) =
|HH(O,O) + HH(O,1)j/4 + |(O,3). The LH coefficienls are conpuled fron lhe c(i,j) signaI, and
each LH coefficienl needs lhe caIcuIalion of lhree c(i,j) signaIs. Ior exanpIe, LH(O,1) is
conpuled fron c(O,2), c(O,3), and c(O,4), LH(O,1) = |c(O,2) + c(O,4)j/2 + c(O,3). Ior lvo
coIunn conseculive LH coefficienls il has an overIapped c(i,j) signaI. Ior exanpIe, c(O,3) is
lhe overIapped signaI for conpuling LH(O,O) and LH(O,1). To conpule LL coefficienls, il
needs lvo coIunn conseculive LH coefficienls and an overIapped c(i,j) signaI. Ior exanpIe,
LL(O,1) is conpuled fron LH(O,O), LH(O,1), and c(O,2), LL(O,1) = |LH(O,O) + LH(O,1)j/4 +
c(O,2). The delaiI caIcuIalion equalions for lhe four sulland coefficienls are sunnarized in
lhe foIIoving equalions:


1 1 2
0 0 -1
HH( , ) (2 1,2 1)(1/4) (2 2 ,2 2t)(-1/2) (2 , ,,2 ,-1 ,).
s t s
h v x h v x h s v x h s v s
_ _ _
(14)


HL( , ) (1/4)|HH( , -1)HH( , )|b( ,2 )
(1/4)|HH( , -1)HH( , )|(-1/2)| (2 ,2 ) (2 2,2 )| (2 1,2 ).
h v h v h v h v
h v h v x h v x h v x h v
(15)

LH( , ) (1/4)|HH( -1, )HH( , )|(-1/2)|x(2 ,2 )x(2 ,2 2)|x(2 ,2 1). h v h v h v h v h v h v (16)

LL( , ) (1/4)|LH( , -1)LH( , )|c( ,2 )
(1/4)|LH( , -1)LH( , )|(1/4)|b( -1,2 )b( ,2 )| (2 ,2 )
(1/4)|LH( , -1)LH( , )|(1/4
hv hv hv h v
hv hv h v h v x h v
hv hv )|(-1/2) (2 -2,2 ) (2 -1,2 )(-1) (2 ,2 ) (2 1,2 )(-1/2) (2 2,2 )| (2 ,2 ). x h v x h v x h v x h v x h v x h v
(17)

The paranelers in lhe alove equalions are defined as foIIovs:
VLS 82
n: horizonlaI rov, t: verlicaI coIunn, x: originaI inage, l: high-frequency vaveIel coefficienl
of 1-D DWT, c: Iov-frequency vaveIel coefficienl of 1-D DWT, 1/2: Irediclion paraneler,
and 1/4: Updaling paraneler.
Iron lhe descriplion of lhe operalions of lhe 2-D 5/3 node LDWT ve find lhal each 1-D
high-frequency coefficienl, |(i,j), is caIcuIaled fron lhree inage signaIs, and one of lhe inage
signaI is overIapped vilh lhe previous |(i,j). The 1-D Iov-frequency coefficienl, c(i,j), is
caIcuIaled fron lvo rov conseculive |(i,j)s and an overIapped pixeI. The HH, HL, LH, and
LL coefficienls are conpuled fron |(i,j)s and c(i,j)s. If ve can change lhe scanning order of
lhe firsl slage 1-D LDWT and lhe oulpul order of lhe second slage 1-D LDWT, during lhe 2-
D LDWT operalion ve need onIy lo slore lhe |(i,j)s lo lhe lranspose nenory (Iirsl-In Iirsl-
Oul, IIIO size of N) and lhe overIapped pixeIs lo lhe inlernaI nenory (R4+R9 size of N).
Ior an NN inage, lhe lranspose nenory lIock can le reduced lo onIy a size of 2N as
shovn in Iig. 9. IRSA is lased on lhis idea, and il can reduce lhe requirenenl of lhe
lranspose nenory significanlIy. The lIock diagran of IRSA vilh severaI pixeIs of an inage
is shovn in Iig. 9. In Iig. 9, lhe nunlers on lop and Iefl represenl lhe coordinale indexes of
a 2-D inage. In order lo increase lhe operalion speed, IRSA scans lvo pixeIs in lhe
conseculive rovs a line. IN1 (IniliaI in x(O,O)) and IN2 (IniliaI in x(1,O)) are lhe scanning
inpuls al leginning. Al lhe firsl cIock, lhe syslen scans lvo pixeIs, x(O,O) and x(1,O), fron
IN1 and IN2, respecliveIy. Al lhe second cIock, IN1 and IN2 read pixeIs x(O,1) and x(1,1),
respecliveIy. Al cIock 3, IN1 and IN2 read pixeIs x(O,2) and x(1,2), respecliveIy. Afler IN1
and IN2 have read lhree pixeIs, lhe DWT processor lries lo conpule lvo 1-D high-frequency
coefficienls, |(O,O) and |(O,1), and lhese lvo high-frequency coefficienls are slored in lhe
lranspose nenory for lhe sulsequenl conpulalion of lhe Iov-frequency coefficienls. IixeIs
x(O,2) and x(1,2) are slored in lhe inlernaI nenory for lhe sulsequenl conpulalion of lhe 1-
D high-frequency coefficienls.
Al cIock 4, lhe DWT processor scans pixeIs on rov 2 and rov 3, and IN1 and IN2 read pixeIs
x(2,O) and x(3,O), respecliveIy. Al cIock 5, IN1 and IN2 read pixeIs x(2,1) and x(3,1),
respecliveIy. Al cIock 6, IN1 and IN2 read pixeIs x(2,2) and x(3,2), respecliveIy. Al lhis
nonenl lhe DWT processor lries lo conpule lhe lvo high-frequency coefficienls, |(2,O) and
|(3,O), upon pixeIs x(2,O) lo x(2,2) and x(3,O) lo x(3,2) respecliveIy and lhese lvo high-
frequency coefficienls are slored in lhe lranspose nenory for lhe sulsequenl conpulalion
of lhe Iov-frequency coefficienls. IixeIs x(2,2) and x(3,2) are slored in lhe inlernaI nenory
for lhe sulsequenl conpulalion of lhe high-frequency coefficienls. Then (al cIock 7) lhe
DWT processor scans lhe sulsequenl lvo rovs lo read lhree conseculive pixeIs in each rov
and conpule lhe high-frequency coefficienls. The coefficienls are slored in lhe lranspose
nenory and pixeIs x(4,2) and x(5,2) are slored in lhe inlernaI nenory. This procedure viII
conlinue lo read lhree pixeIs and conpule lhe high-frequency coefficienls and slore lhe
coefficienls lo lhe lranspose nenory and slore pixeIs x(2,j) and x(2,j+1) lo lhe inlernaI
nenory in each rov unliI lhe Iasl rov.
Then lhe DWT processor scans rov O and rov 1 and nakes N1 and N2 read pixeIs x(O,3)
and x(1,3), respecliveIy. Al lhe nexl cIock, IN1 and IN2 read pixeIs x(O,4) and x(1,4),
respecliveIy. The DWT processor uses pixeIs x(O,3), x(O,4), and x(O,2), vhich vas slored
previousIy, lo conpule lhe high-frequency coefficienl |(O,1). SinuIlaneousIy lhe DWT
processor uses pixeIs x(1,3), x(1,4), and x(1,2), vhich vas slored previousIy, lo conpule lhe
high-frequency coefficienls |(1,1). As soon as |(O,1) and |(1,1) are found, |(O,O), |(O,1), and
x(O,2) are used lo generale lhe Iov-frequency coefficienl c(O,1), |(1,O), |(1,1), and x(1,2) are
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 83
used lo generale lhe Iov-frequency coefficienl c(1,1). The conpuled high-frequency
coefficienls are lhen slored in lhe lranspose nenory, and pixeIs x(O,4) and x(1,4) repIace
pixeIs x(O,2) and x(1,2) lo le slored in lhe inlernaI nenory. IN1 and IN2 lhen read lhe dala
in rovs 2 and 3 lo process lhe sane operalions unliI lhe end of lhe Iasl pixeI. The delaiI
operalions are shovn in Iig. 1O.
The second slage 1-D DWT vorks in lhe siniIar nanner as lhe firsl slage 1-D DWT. In lhe
HH and HL operalions, vhen lhree coIunn conseculive |(i,j)s are found in lhe firsl slage 1-
D DWT, an HH coefficienl can le conpuled. As soon as lvo coIunn conseculive HH
coefficienls are found, lhe lvo HH coefficienls and lhe overIapped |(i,j)s can le conlined
lo conpule an HL coefficienl. SiniIarIy, vhen lhree coIunn conseculive c(i,j)s are found in
lhe firsl slage 1-D DWT, an LH coefficienl can le conpuled. As soon as lvo LH coefficienls
are found, lhe lvo LH coefficienls and lhe overIapped c(i,j) are used lo conpule an LL
coefficienl. The delaiIed operalions for lhe second slage 1-D DWT are shovn in Iig. 11.


Iig. 9. IRSA of lhe 2-D LDWT.

VLS 84

Iig. 1O. The delaiI operalions of lhe firsl slage 1-D DWT.


(a)
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 85

(l)
Iig. 11. The delaiIed operalions of lhe second slage 1-D DWT. (a) The HI (HH and HL) parl
operalions. (l) The LI (LH and LL) parl operalions.

5. VLSI architecture and impIementation for the 2-D duaI-mode LDWT

The IRSA approach has leen discussed in lhe previous seclion, and lhe archileclure of IRSA
is descriled in lhis seclion. We can nanipuIale lhe conlroI unil lo read off-chip nenory. In
IRSA, lvo pixeIs are scanned concurrenlIy, and lhe syslen needs lvo processing unils. Ior
lhe 2-D LDWT processing, lhe pixeIs are processed ly lhe firsl slage 1-D DWT firsl. The
oulpuls are lhen fed lo lhe second slage 1-D DWT lo find lhe four sulland coefficienls, HH,
HL, LH, and LL. There are lvo parls in lhe archileclure, lhe firsl slage 1-D DWT and lhe
second slage 1-D DWT. Here ve concenlrale on lhe 2-D 5/3 node LDWT.

5.1 The first stage 1-D LDWT
The firsl slage 1-D LDWT archileclure consisls of lhe foIIoving unils: signaI arrangenenl
unil, nuIlipIicalion and accunuIalion ceII (MAC), nuIlipIexer (MUX), and IIIO regisler.
The lIock diagran is shovn in Iig. 12.
The signaI arrangenenl unil consisls of lhree regislers, R1, R2, and R3. The pixeIs are inpul
lo R1 firsl, and sulsequenlIy lhe conlenl of R1 is lransferred lo R2 and lhen R3, and R1
keeps reading lhe foIIoving pixeIs. The operalion is Iike a shifl regisler. As soon as R1, R2,
and R3 gel signaI dala, MAC slarls operaling. The signaI arrangenenl unil is shovn in Iig.
13. In Iig. 13 MAC operales al lhe cIock vilh gray circIes.

VLS 86

Iig. 12. The archileclure of lhe firsl slage 1-D DWT.


Iig. 13. The operalion of lhe signaI arrangenenl unil (for exanpIe, IRAS signaI in N1).

Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 87

Iig. 14. The lIock diagran of MAC.

Ior lhe Iov-frequency coefficienls caIcuIalion ve need lvo high-frequency coefficienls and
an originaI pixeI. InlernaI regisler R4 is used lo slore lhe originaI even pixeI (N1) and
inlernaI regisler R9 is used lo slore lhe originaI odd pixeI (N2). We can sinpIy shifl lhe
conlenl of R3 lo R4 afler lhe MAC operalion. The IIIO is used lo slore lhe high-frequency
coefficienls lo caIcuIale lhe Iov-frequency coefficienls. Regisler R5 has lvo funclions: 1) lo
slore lhe high-frequency coefficienls for lhe Iov-frequency coefficienl caIcuIalion, 2) lo le
used as a signaI luffer for MAC. MAC needs line lo conpule lhe signaI, and lhe oulpul of
MAC cannol direclIy feed lhe resuIl lo lhe oulpul or lhe foIIoving operalion nay le
incorrecl due lo lhe synchronizalion prolIens. R5 acls as an oulpul luffer for MAC lo
prevenl lhe error in lhe foIIoving operalions. In lhe 5/3 inleger Iifling-lased operalions,
MAC is used lo find lhe resuIls of lhe high-frequency oulpul, (a1+a3)/2 + a2, and lhe Iov
frequency oulpul, (a1+a3)/4+a2. There are lvo nuIlipIicalion coefficienls, 1/2 and 1/4. To
save hardvare, ve can use shiflers lo inpIenenl lhe 1/2 and 1/4 nuIlipIicalions.
Therefore lhe MAC needs adders, conpIenenler, and shiflers. The MAC lIock diagran is
shovn in Iig. 14, vhere a1, a2, and a3 are lhe inpuls, lhe 2s conpIenenl converler, and
>> lhe righl shifler.

5.2 The second stage 1-D LDWT
SiniIar lo lhe firsl slage 1-D DWT, lhe second slage 1-D DWT consisls of lhe foIIoving
unils: signaI arrangenenl unil, MAC, and MUX, as shovn in Iig. 15. Due lo lhe paraIIeI
archileclure, lvo oulpuls are generaled concurrenlIy fron lhe firsl slage 1-D DWT, and
lhese lvo oulpuls nusl le nerged in lhe second slage 1-D DWT. The signaI arrangenenl
unil processes lhe signaI nerging, Iig. 16 shovs lhe processing diagran of lhe signaI
arrangenenl unil. Al leginning, signaIs HO and H1 are fron IN1 and IN2 and lhese lvo
signaIs are slored in R3 and R4, respecliveIy. Al lhe nexl cIock, HO and H1 are noved lo R1
and R2 respecliveIy, and concurrenlIy nev signaIs H3 and H4 fron IN1 and IN2 are slored
lo R3 and R4 respecliveIy. The signaI arrangenenl unil operales repealedIy lo inpul signaIs
for lhe second slage 1-D DWT.

VLS 88

Iig. 15. The lIock diagran of lhe second slage 1-D LDWT.


Iig. 16. SignaI nerging process for lhe signaI arrangenenl unil.

5.3 2-D LDWT architecture
In our IRSA operalion, IN1 and IN2 read signaIs of even rov and odd rov in a zig-zag
order, respecliveIy. The delaiI is shovn in Iig. 17. The lIock diagran of lhe proposed 2-D
LDWT is shovn in Iig. 19(l). Il consisls of lvo slages, lhe firsl slage 1-D DWT and lhe
second slage 1-D DWT. This archileclure needs onIy a snaII anounl of lranspose nenory.
Lel us consider a 44 inage. The signaI processing of lhe firsl slage 1-D DWT is shovn in
Iig. 18. The pixeIs fron lhe even rovs are processed ly lhe upper 1-D DWT, and lhe pixeIs
fron lhe odd rovs are processed ly lhe Iover 1-D DWT. Lach 1-D DWT generales one sel of
22 high-frequency coefficienls and one sel of 22 Iov-frequency coefficienls, respecliveIy.
The generaled coefficienls are fed lo lhe second slage 1-D DWT under lhe direclion of lhe
arrov head. The inpul for each second slage 1-D DWT lecones a sel of 24 signaIs. The
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 89
signaI processing of lhe second slage 1-D DWT is shovn in Iig. 19. The 24 signaIs in each
second slage 1-D DWT are lhen processed, and lhen HH, HL, LH, and LL are generaled and
each has 22 signaI dala.
The conpIele archileclure of lhe 2-D LDWT is shovn in Iig. 19. The conpIele 2-D LDWT
consisls of four parls, lvo sels of lhe firsl slage 1-D DWT, lvo sels of lhe second slage 1-D
DWT, conlroI unil, and MAC unil.


(a)


(l)
Iig. 17. The inpul signaI sequences. (a) IN1 read signaI of even rov in zig-zag orders. (l) IN2
read signaI of odd rov in zig-zag orders.
VLS 90

(a)

(l)
Iig. 18. The signaI process of lhe lvo slage LDWT. (a) Iirsl slage 1-D LDWT. (l) Second
slage 1-D LDWT.

According lo (12) and (13), lhe proposed IRSA archileclure can aIso le appIied lo lhe 9/7
node LDWT. Iig. 2O iIIuslrales lhe approach. Iron Iigs. 1O and 11 in Seclion 3, lhe originaI
signaIs (denoled as lIack circIes) for lolh 5/3 and 9/7 nodes LDWT can le processed ly lhe
sane IRSA for lhe firsl slage 1-D DWT operalion. The high-frequency signaIs (denoled as grey
circIes) and lhe correIaled Iov-frequency signaIs logelher vilh lhe resuIls of lhe firsl slage are
used lo conpule lhe second slage 1-D DWT coefficienls. Conpared lo lhe 9/7 node LDWT
conpulalion, lhe 5/3 node LDWT is nuch easier for conpulalion, and lhe regislers
arrangenenl in Iigs. 12 and 15 is sinpIe. Ior 9/7 node LDWT inpIenenlalion vilh lhe sane
syslen archileclure of 5/3 node LDWT, ve have lo do lhe foIIoving nodificalions: 1) The
conlroI signaIs of lhe MUX in Iigs. 12 and 15 nusl le nodified. We have lo rearrange lhe
regislers for lhe MAC lIock lo process lhe 9/7 paranelers. 2) The vaveIel coefficienls of lhe
duaI-node LDWT are differenl. The coefficienls are = 1/2 and =1/4 for 5/3 node LDWT,
lul lhe coefficienls are = 1.586134142, = O.O5298O118, = +O.882911O75, and =
+O.4435O6852 for 9/7 node LDWT. Ior caIcuIalion sinpIicily and good precision, ve can use
lhe inleger approach proposed ly Huang c| a|. (Huang el aI., 2OO4) and Marlina c| a|. (Marlina
& Masera, 2OO7) for 9/7 node LDWT caIcuIalion. SiniIar lo lhe nuIlipIicalion inpIenenlalion
ly shiflers and adders in lhe 5/3 node LDWT, ve can adopl lhe shiflers approach proposed
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 91
in (Huang el aI., 2OO5) furlher lo inpIenenl lhe 9/7 node LDWT. 3) According lo lhe
characlerislics of lhe 9/7 node LDWT, lhe conlroI unil in Iig. 19(l) nusl le nodified
accordingIy.


(a)

(l)
Iig. 19. The conpIele 2-D DWT lIock diagran. (a) DSI diagran of lhe 2-D LDWT. (l) Syslen
diagran of lhe 2-D LDWT.
VLS 92
Iig. 2O. The processing procedures of 2-D duaI-node LDWTs under lhe sane IRSA
archileclure.

The nuIli-IeveI DWT conpulalion can le inpIenenled in a siniIar nanner ly lhe high
perfornance 1-IeveI 2-D LDWT. Ior lhe nuIli-IeveI conpulalion, lhis archileclure needs
N
2
/4 off-chip nenory. As iIIuslraled in Iig. 21, lhe off-chip nenory is used lo lenporariIy
slore lhe LL sulland coefficienls for lhe nexl ileralion conpulalions. The second IeveI
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 93
conpulalion requires N/2 counlers and N/2 IIIOs for lhe conlroI unil. The lhird IeveI
conpulalion requires N/4 counlers and N/4 IIIOs for lhe conlroI unil. CeneraIIy in lhe jlh
IeveI conpulalion, ve need N/2
j-1
counlers and N/2
j-1
IIIOs.


Iig. 21. The nuIliIeveI 2-D DWT archileclure.

6. ExperimentaI resuIts and comparisons

The 2-D duaI-node LDWT considers a lrade-off lelveen Iov lranspose nenory and Iov
conpIexily in lhe design of VLSI archileclure. TalIes 2 and 3 shov lhe perfornance
conparisons of lhe proposed archileclure and olher siniIar archileclures. Conpression
resuIls indicale lhal lhe proposed VLSI archileclure oulperforns previous vorks in lerns of
lranspose nenory size, requiring aloul 5O Iess nenory lhan lhe }ILC2OOO slandard
(Chen, 2OO4) archileclure. Moreover, lhe 2-D LDWT is frane-lased, and ils inpIenenlalion
lollIeneck is lhe huge lranspose nenory. Less nenory unils are needed in our archileclure
and lhe Ialency is fixed on (3/2)N+3 cIock cycIes. Our archileclure can aIso provide an
enledded synnelricaI exlension funclion. The proposed IRSA approach has lhe
advanlages of nenory-efficienl and high-speed. The proposed 2-D duaI-node LDWT
adopls paraIIeI and pipeIined schenes lo reduce lhe lranspose nenory and increase lhe
operalion speed. The shiflers and adders repIace nuIlipIiers in lhe conpulalion lo reduce
lhe hardvare cosl. Chen c| a|. (Chen & Wu, 2OO2) proposed a foIded and pipeIined
archileclure lo conpule lhe 2-D 5/3 Iifling-lased DWT, and lhey used lranspose nenory
size of 2.5N for an NN 2-D DWT. This Iifling archileclure for verlicaI fiIlering vilh lvo
adders and one nuIlipIier is divided inlo lvo parls, and each parl has one adder and one
nuIlipIier. ecause lolh parls are aclivaled in differenl cycIes, lhey can share lhe sane
adder and nuIlipIier. Il can increase lhe hardvare uliIizalion and reduce lhe Ialency.
Hovever, according lo lhe characlerislics of lhe signaI fIov, il viII increase lhe conpIexily
al lhe sane line.
A 256256 2-D LDWT vas designed and sinuIaled vilh VeriIogHDL and furlher
synlhesized ly lhe Synopsys design conpiIer vilh TSMC O.18n 1I6M CMOS slandard
process lechnoIogy. The delaiIed specs of lhe 256256 2-D LDWT are Iisled in TalIe 4.






VLS 94
5/3 LDWT
archileclure
Ours Diou
el aI.,
2OO1
Andra
el aI.,
2OO2
Chen &
Wu,
2OO2
Chen,
2OO2
Chiang &
Hsia,
2OO5
Mei
el aI.,
2OO6
Huang
el aI,
2OO5
Wu &
Lin, 2OO5
Transpose
nenory
1

(lyles)
2N 3.5N 3.5N 2.5N 3N N
2
/4+5N 2N 3.5N 3.5N
Conpulalion
line
2

(3/4
)N
2
+
(3/2
)N
+7
--- (N
2
/2)+
N+5
N
2
(N
2
/2)+N
+5
N
2
(N
2
/
2)+N
--- 1O+(4/3)
N
2
|1-
(1/4)j+2N
|1-(1/2)j
Adders 8 12 8 6 5 4 8 --- ---
MuIlipIiers O 6 4 4 O O O --- 6
1
Transpose nenory size is used lo slore frequency coefficienls in lhe 1-L 2-D DWT.
2
In a syslen, conpuling line represenls lhe line used lo conpule an inage of size NN.
3
Suppose lhe inage is of size NN.
TalIe 2. Conparisons of 2-D archileclures for 5/3 LDWT.

9/7 LDWT
archileclure
Ours Andra
el aI.,
2OO2
}ung &
Iark,
2OO5
Chen,
2OO4
1

Vishvanalh
el aI., 1995
Huang el
aI., 2OO5
Huang
el aI,
2OO5
Wu & Lin,
2OO5
Lan el
aI., 2OO5
Wu. &
Chen,
2OO1
Transpose
nenory
(lyles)
4N N
2
12N N
2
/4+|
N+|
22N 14N 5.5N 5.5N --- N
2
+4N+
4
Conpulalio
n line
(3/4)N
2
+(3/2)
N +7
4N
2
/3+
2
N
2
N
2
/2~(2
/3)N
N
2
--- --- 22+(4/3)N
2
|1
-(1/4)j+6N|1-
(1/2)j
--- 2N
2
/3
Adders 16 8 12 4 | 36 16 16 8 32 16
MuIlipIiers O 4 9 4 | 36 12 1O 6 2O 16
1
|: lhe fiIler Ienglh.
TalIe 3. Conparisons of lhe 2-D archileclures for 9/7 LDWT.

Chip specificalion N = 256, TiIe size = 256256
Cale counl 29,196 gales
Iover suppIy 1.8V
TechnoIogy TSMC O.18n 1I6M (CMOS)
On-Chip nenory size (Transpose
+ InlernaI)
2-D 5/3 DWT: 512 lyles
2-D 9/7 DWT: 1,O24 lyles
Lalency (3/2)N+3 = 387 cIock cycIes
Conpuling line (3/4)N
2
+(3/2)N+7 = 49,543 cIock cycIes
Maxinun cIock rale 83 MHz
TalIe 4. Design specificalion of lhe proposed 2-D DWT.

Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 95
7. ConcIusions

This vork presenls a nev archileclure lo reduce lhe lranspose nenory requirenenl in 2-D
LDWT. The proposed archileclure has a nixed rov- and coIunn-vise signaI fIov, ralher
lhan pureIy rov-vise as in lradilionaI 2-D LDWT. Iurlher ve propose a nev approach,
inlerIaced read scan aIgorilhn (IRSA), lo reduce lhe lranspose nenory for a 2-D duaI-node
LDWT. The proposed 2-D archileclures are nore efficienl lhan previous archileclures in
lrading off Iov lranspose nenory, oulpul Ialency, conlroI conpIexily, and reguIar nenory
access sequence. The proposed archileclure reduces lhe lranspose nenory significanlIy lo a
nenory size of onIy 2N or 4N (5/3 or 9/7 node) and reduces lhe Ialency lo (3/2)N+3 cIock
cycIes. Due lo lhe reguIarily and sinpIicily of lhe IRSA LDWT archileclure, a duaI node
(5/3 and 9/7) 256256 2-D LDWT prololyping chip vas designed ly TSMC O.18n 1I6M
slandard CMOS lechnoIogy. The 5/3 and 9/7 fiIlers vilh differenl Iifling sleps are reaIized
ly cascading lhe four noduIes (spIil, predicl, updale, and scaIing phases). The prololyping
chip lakes 29,196 gale counls and can operale al 83 MHz. The nelhod is appIicalIe lo any
DWT-lased signaI conpression slandard, such as }ILC2OOO, Molion-}ILC2OOO, MILC-4
sliII lexlure oljecl decoding, and vaveIel-lased scaIalIe video coding (SVC).

8. References

Andra, K., Chakralarli, C. & Acharya, T. (2OOO). A VLSI archileclure for Iifling-lased
vaveIel lransforn, |||| Icr|sncp cn Signa| Prcccssing Sqs|cms, (Ocloler 2OOO) pp.
7O-79.
Andra, K., Chakralarli, C. & Acharya, T. (2OO2). A VLSI archileclure for Iifling-lased
forvard and inverse vaveIel lransforn, |||| Transac|icns cn Signa| Prcccssing, VoI.
5O, No.4, (ApriI 2OO2) pp. 966-977.
Chen, I.-Y. (2OO2). VLSI inpIenenlalion of discrele vaveIel lransforn using lhe 5/3 fiIler,
|||C| Transac|icns cn |nfcrma|icn an Sqs|cms, VoI. L85-D, No.12, (Decenler 2OO2)
pp. 1893-1897.
Chen, I.-Y. (2OO4). VLSI inpIenenlalion for one-dinensionaI nuIliIeveI Iifling-lased
vaveIel lransforn, |||| Transac|icns cn Ccmpu|cr, VoI. 53, No. 4, (ApriI 2OO4) pp.
386-398.
Chen, I. & Woods, }. W. (2OO4). idireclionaI MC-LZC vilh Iifling inpIenenlalion, ||||
Transac|icns cn Circui|s an Sqs|cms fcr Vicc Tccnnc|cgq, VoI. 14, No. 1O, (Ocloler
2OO4) pp. 1183-1194.
Chen, S.-C. & Wu, C.-C. (2OO2). An archileclure of 2-D 3-IeveI Iifling-lased discrele vaveIel
lransforn, V|S| Dcsign/ CAD Sqmpcsium, (Augusl 2OO2) pp. 351-354.
Chiang, }.-S. & Hsia, C.-H. (2OO5). An efficienl VLSI archileclure for 2-D DWT using Iifling
schene, |||| |n|crna|icna| Ccnfcrcncc cn Sqs|cms an Signa|s, (ApriI 2OO5) pp. 528-
531.
ChrislopouIos, C., Skodras, A. N. & Llrahini, T. (2OOO). The }ILC2OOO sliII inage coding
syslen: An overviev, |||| Trans. cn Ccnsumcr ||cc|rcnics, VoI. 46, No. 4,
(Novenler 2OOO) pp. 11O3-1127.
Daulechies, I. & SveIdens, W. (1998). Iacloring vaveIel lransforns inlo Iifling sleps, Tnc
]curna| cf |curicr Ana|qsis an App|ica|icns, VoI. 4, No.3, (1998) pp. 247-269.
VLS 96
Diou, C., Torres, L. & Rolerl, M. (2OO1). An enledded core for lhe 2-D vaveIel lransforn,
|||| cn |mcrging Tccnnc|cgics an |ac|crq Au|cma|icn Prccccings, VoI. 2, (Ocloler
2OO1) pp. 179-186.
Halili, A. & HersheI, R. S. (1974). A unified represenlalion of differenliaI puIse code
noduIalion (DICM) and lransforn coding syslens, |||| Transac|icns cn
Ccmmunica|icns, VoI. 22, No. 5, (May 1974) pp. 692-696.
Hsia, C.-H. & Chiang, }.-S. (2OO8). Nev nenory-efficienl hardvare archileclure of 2-D duaI-
node Iifling-lased discrele vaveIel lransforn for }ILC2OOO, |||| |n|crna|icna|
Ccnfcrcncc cn Ccmmunica|icn Sqs|cms, (Novenler 2OO8) pp. 766-772.
Huang, C.-T., Tseng, I.-C. & Chen, L.-C. (2OO2). Lfficienl VLSI archileclure of Iifling-lased
discrele vaveIel lransforn ly syslenalic design nelhod, |||| |n|crna|icna|
Sqmpcsium Circui|s an Sqs|cms, VoI. 5, (May 2OO2) pp. 26-29.
Huang, C.-T., Tseng, I.-C. & Chen, L.-C. (2OO4). IIipping slruclure: An efficienl VLSI
archileclure for Iifling-lased discrele vaveIel lransforn, |||| Transac|icns cn Signa|
Prcccssing, VoI. 52, No. 4, (ApriI 2OO4) pp. 1O8O-1O89.
Huang, C.-T., Tseng, I.-C. & Chen, L.-C. (2OO5). VLSI archileclure for Iifling-lased shape-
adaplive discrele vaveIel lransforn vilh odd-synnelric fiIlers, ]curna| cf V|S|
Signa| Prcccssing Sqs|cms, VoI. 4O, No. 2, (}une 2OO5) pp.175-188.
Huang, C.-T., Tseng, I.-C. & Chen, L.-C. (2OO5). AnaIysis and VLSI archileclure for 1-D and
2-D discrele vaveIel lransforn, |||| Transac|icns cn Signa| Prcccssing, VoI. 53, No.
4, (ApriI 2OO5) pp. 1575-1586.
Huang, C.-T., Tseng, I.-C. & Chen, L.-C. (2OO5). Ceneric RAM-lased archileclure for lvo-
dinensionaI discrele vaveIel lransforn vilh Iine-lased nelhod, |||| Transac|icns
cn Circui|s an Sqs|cms fcr Vicc Tccnnc|cgq, VoI. 15, No. 7, (}uIy 2OO5) pp. 91O-919.
ISO/ILC 15444-1 }TC1/SC29 WC1. (2OOO). ]P|G 2000 Par| 1 |ina| Ccmmi||cc Draf| Vcrsicn 1.0,
Infornalion TechnoIogy.
ISO/ILC }TC1/SC29/WC1 WgIn 1684 (2OOO). ]P|G 2000 Vcrifica|icn Mcc| 9.0.
ISO/ILC 15444-1 }TC1/SC29 WC1. (2OOO). Mc|icn ]P|G2000, |SO/||C |SO/||C 15444-3,
Infornalion TechnoIogy.
ISO/ILC }TC1/SC29 WC11. (2OO1), Ccing cf Mcting Pic|urcs an Auic, Infornalion
TechnoIogy.
}iang, W. & Orlega, A. (2OO1). Lifling faclorizalion-lased discrele vaveIel lransforn lased
archileclure design, |||| Transac|icns cn Circui|s an Sqs|cms fcr Vicc Tccnnc|cgq,
VoI. 11, No. 5, (May 2OO1) pp. 651-657.
}ung, C.-C. & Iark, S.-M. (2OO5). VLSI inpIenenl of Iifling vaveIel lransforn of }ILC2OOO
vilh efficienl RIA (recursive pyranid aIgorilhn) reaIizalion, |||C| Transac|icns cn
|unamcn|a|s, VoI. L88-A, No. 12, (Decenler 2OO5) pp. 35O8-3515.
Kondo, H. & Oishi, Y. (2OOO). DigilaI inage conpression using direclionaI sul-lIock DCT,
|n|crna|icna| Ccnfcrcncc cn Ccmmunica|icns Tccnnc|cgq, VoI. 1, (Augusl 2OOO) p p. 985
-992.
Lan, X., Zheng, N. & Liu, Y. (2OO5). Lov-pover and high-speed VLSI archileclure for Iifling-
lased forvard and inverse vaveIel lransforn, |||| Transac|icns cn Ccnsumcr
||cc|rcnics, VoI. 51, No. 2, (May 2OO5) pp. 379-385.
Li, W.-M., Hsia, C.-H. & Chiang, }.-S. (2OO9). Menory-efficienl archileclure of 2-D duaI-
node Iifling schene discrele vaveIel lransforn for Moion-}ILC2OOO, ||||
|n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms, (May 2OO9) pp. 75O-753.
Memory-Efcient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 97
Lian, C.-}., Chen, K.-I., Chen, H.-H. & Chen, L.-C. (2OO1). Lifling lased discrele vaveIel
lransforn archileclure for }ILC2OOO, |||| |n|crna|icna| Sqmpcsium cn Circui|s an
Sqs|cms, VoI. 2, (May 2OO1) pp. 445-448.
MaIIal, S. C. (1989). A lheory for nuIli-resoIulion signaI deconposilion: The vaveIel
represenlalion, |||| Transac|icn cn Pa||crn Ana|qsis an Macninc |n|c||igcncc, VoI. 11,
No. 7, (}uIy 1989) pp. 674-693.
MaIIal, S. C. (1989). MuIli-frequency channeI deconposilions of inages and vaveIel nodeIs,
ILLL Transac|icns cn Acouslics, Speech and SignaI Irocessing, VoI. ASSI-37, No. 12,
(Decenler 1989) pp. 2O91-211O.
MarceIIin, M. W., Cornish, M. }. & Skodras, A. N. (2OOO). }ILC2OOO: The nev sliII piclure
conpression slandard, ACM Mu||imcia Icr|sncps, (Seplenler 2OOO) pp. 45-49.
Marino, I. (2OOO). Lfficienl high-speed/Iov-pover pipeIined archileclure for lhe direcl 2-D
discrele vaveIel lransforn, |||| Transac|icns cn Circui|s an Sqs|cms ||, VoI. 47, No.
12, (Decenler 2OOO) pp. 1476-1491.
Marlina, M. & Masera, C. (2OO7). IoIded nuIlipIierIess Iifling-lased vaveIel pipeIine, ||T
||cc|rcnics |c||crs, VoI. 43, No. 5, (March 2OO7) pp. 27-28.
Mei, K., Zheng, N. & van de Welering, H. (2OO6). High-speed and nenory-efficienl VLSI
design of 2-D DWT for }ILC2OOO, ||T ||cc|rcnics |c||cr, VoI. 42, No. 16, (Augusl
2OO6) pp. 9O7-9O8.
Ohn, }.-R. (2OO5). Advances in scaIalIe video coding, Prccccings cf Tnc ||||, Inviled Iaper,
VoI. 93, No.1, pp. 42-56, (}anuary 2OO5) pp. 42-56.
Richardson, I. (2OO3). H.264 an MP|G-4 Vicc Ccmprcssicn, }ohn WiIey & Sons Lld.
Seo, Y.-H. & Kin, D.-W. (2OO7). VLSI archileclure of Iine-lased Iifling vaveIel lransforn for
Molion }ILC2OOO, |||| ]curna| cf Sc|i-S|a|c Circui|s, VoI. 42, No. 2, (Ielruary 2OO7)
pp. 431-44O.
SveIdens, W. (1996). The Iifling schene: A cuslon-design conslruclion of liorlhogonaI
vaveIels, App|ic an Ccmpu|a|icn Harmcnic Ana|qsis, VoI. 3, No. 15, (1996) pp.186-
2OO.
Tan, K.C.. & ArsIan, T. (2OO1). Lov pover enledded exlension aIgorilhn for lhe Iifling
lased discrele vaveIel lransforn in }ILC2OOO, ||T ||cc|rcnics |c||crs, VoI. 37, No.
22, (Ocloler 2OO1) pp.1328-133O.
Tan, K.C.. & ArsIan, T. (2OO3). Shifl-accunuIalor ALU cenlric }ILC 2OOO 5/3 Iifling lased
discrele vaveIel lransforn archileclure, |||| |n|crna|icna| Sqmpcsium cn Circui|s
an Sqs|cms, VoI. 5, (May 2OO3) pp. V161-V164.
Taulnan, D. & MarceIIin, M. W. (2OO1). ]P|G2000 imagc ccmprcssicn funamcn|a|s, s|anars,
an prac|icc, KIuver Acadenic IulIisher.
Varshney, H., Hasan, M. & }ain, S. (2OO7). Lnergy efficienl noveI archileclure for lhe Iifling-
lased discrele vaveIel lransforn, ||T |magc Prcccss, VoI. 1, No. 3, (Seplenler 2OO7)
pp.3O5-31O.
Vishvanalh, M., Ovens, R. M. & Irvin, M. }. (1995). VLSI archileclure for lhe discrele
vaveIel lransforn, |||| Transac|icns cn Circui|s an Sqs|cms ||, VoI. 42, No. 5, (May
1995) pp. 3O5-316.
Weeks, M. & ayouni, M. A. (2OO2). Three-dinensionaI discrele vaveIel lransforn
archileclures, |||| Transac|icns cn Signa| Prcccssing, VoI. 5O, Vo.8, (Augusl 2OO2) pp.
2O5O-2O63.
VLS 98
Wu, .-I. & Lin, C.-I. (2OO5). A high-perfornance and nenory-efficienl pipeIine
archileclure for lhe 5/3 and 9/7 discrele vaveIel lransforn of }ILC2OOO codec,
|||| Transac|icns cn Circui|s an Sqs|cms fcr Vicc Tccnnc|cgq, VoI. 15, No. 12,
(Decenler 2OO5) pp. 1615-1628.
Wu, I.-C. & Chen, L.-C. (2OO1). An efficienl archileclure for lvo-dinensionaI discrele
vaveIel lransforn, |||| Transac|icns cn Circui|s an Sqs|cms fcr Vicc Tccnnc|cgq,
VoI. 11, No. 4, (ApriI 2OO1) pp. 536-545.
Full HD JPEG XR Encoder Design for Digital Photography Applications 99
X

FuII HD JPEG XR Encoder Design
for DigitaI Photography AppIications

Ching-Yen Chien*, Sheng-Chieh Huang*, Chia-Ho Ian,
and Liang-Cee Chen
DSP/|C Dcsign |a|, Graua|c |ns|i|u|c cf ||cc|rcnics |nginccring
an Dcpar|mcn| cf ||cc|rica| |nginccring, Na|icna| Taiuan Unitcrsi|q
Scnsc/TCM SOC |a|, Dcpar|mcn| cf ||cc|rica| an Ccn|rc| |nginccring,
Na|icna| Cniac-Tung Unitcrsi|q
Taiuan, R.O.C.

1. Introduction

MuIlinedia appIicalions, such as radio, audio, canera phone, digilaI sliII canera, cancoder,
and noliIe lroadcasling TV, are nore and nore popuIar in our Iife as lhe progress of inage
sensor, connunicalion, VLSI nanufaclure, and inage/video coding slandards. Wilh rapid
progress of inage sensor, dispIay devices, and conpuling engines, inage coding slandards
are used in lhe digilaI pholography appIicalion everyvhere. Il has leen nerged logelher
vilh our Iife such as canera phone, digilaI sliII canera, lIog and nany olher appIicalions.
Many advanced nuIlinedia appIicalions require inage conpression lechnoIogy vilh
higher conpression ralio and leller visuaI quaIily. High quaIily, high conpression rales of
digilaI inage and Iov conpulalionaI cosl are inporlanl faclors in nany areas of consuner
eIeclronics, ranging fron digilaI pholography lo lhe consuner dispIay equipnenls such as
digilaI sliII canera and digilaI frane. These requirenenls usuaIIy invoIve conpulalionaIIy
inlensive aIgorilhns inposing lrade-offs lelveen quaIily, conpulalionaI resources, and
lhroughpul.
Ior high quaIily of digilaI inage appIicalions, lhe exlension of coIor range has leconing
nore inporlanl in lhe consuner producl. In lhe pasl, lhe digilaI caneras and lhe dispIay
equipnenls in lhe consuner narkel lypicaIIy had 8 lils infornalion per channeI. Today lhe
condilion is quile differenl. In lhe consuner narkel, digilaI caneras and lhe desklop dispIay
paneIs aIso have al Ieasl 12 lils of infornalion per channeI. If lhe infornalion per channeI of
digilaI inage is sliII conpressed inlo 8 lils, 4 or nore lils of infornalion per channeI are Iosl
and lhe quaIily of lhe digilaI inage is Iiniled. Due lo lhe inprovenenl of lhe dispIay
equipnenls, lhe }ILC XR is designed for lhe high dynanic range (HDR) and lhe high
definilion (HD) pholo size. }ILC XR vhich is aIready under organized ly lhe ISO/ILC }oinl
Iholographic Lxperls Croup (}ILC) Slandard Connillee is a nev sliII inage coding
slandard and derived fron lhe vindov nedia pholo (Srinivasan el aI., 2OO7, Srinivasan el
aI., 2OO8, Schonlerg el aI., 2OO8). The goaI of }ILC XR is lo supporl lhe grealesl possilIe IeveI
5
VLS 100

of inage dynanic range and coIor precision, and keep lhe device inpIenenlalions of lhe
encoder and decoder as sinpIe as possilIe.
Ior lhe conpression of digilaI inage, lhe }oinl Iholographic Lxperls Croup, lhe firsl
inlernalionaI inage coding slandard for conlinuous-lone naluraI inages, vas defined in
1992 (ITU, 1992). }ILC is a veII-knovn inage conpression fornal loday lecause of lhe
popuIalion of digilaI sliII canera and Inlernel. ul }ILC has ils Iinilalion lo salisfy lhe rapid
progress of consuner eIeclronics. Anolher inage coding slandard, }ILC2OOO (ISO/ILC,
2OOO), vas finaIized in 2OO1. Differed fron }ILC slandard, a Discrele Cosine Transforn
(DCT) lased coder, lhe }ILC2OOO uses a Discrele WaveIel Transforn (DWT) lased coder.
The }ILC2OOO nol onIy enhances enhances lhe conpression, lul aIso incIudes nany nev
fealures, such as quaIily scaIaliIily, resoIulion scaIaliIily, region of inleresl, and
Iossy/IossIess coding in a unified franevork. Hovever, lhe design of }ILC2OOO is nuch
conpIicaled lhan lhe }ILC slandard. The core lechniques and conpulalion conpIexily
conparisons of lhese lvo inage coding slandard are shovn in (Huang el aI., 2OO5).
Ior salisfaclion of lhe high quaIily inage conpression and Iover conpulalion conpIexily,
lhe nev }ILC XR conpression aIgorilhn is discussed and inpIenenled vilh lhe VLSI
archileclure. }ILC XR has high encoding efficiency and versaliIe funclions. The XR of }ILC
XR neans lhe exlended range. Il neans lhal }ILC XR supporls lhe exlended range of
infornalion per channeI. The inage quaIily of }ILC XR is nearIy equaI lo }ILC 2OOO vilh
lhe sane lil-rale. The conpulalion conpIexily is nuch Iover lhan }ILC2OOO as shovn in
TalIe 1.
The efficienl syslen-IeveI archileclure design is nore inporlanl lhan lhe noduIe design
since syslen-IeveI inprovenenls nake nore inpacls on perfornance, pover, and nenory
landvidlh lhan lhe noduIe-IeveI inprovenenls. In lhis chapler, lhe nev }ILC XR
conpression slandard is inlroduced and lhe anaIysis and archileclure design of }LIC XR
encoder are aIso proposed. Conparison vas nade lo anaIyze lhe conpression perfornance
anong }ILC2OOO and }ILC XR. Iig. 1 is lhe peak signaI-lo-noise ralio (ISNR) resuIls under
severaI differenl lilrales. The lesl coIor inage is 512x512 aloon. The inage quaIily of }ILC
XR is very cIose lo lhal of }ILC2OOO. The ISNR difference lelveen }ILC XR and }ILC2OOO
is under O.5d. Iig. 2 shovs lhe suljeclive vievs al 8O lines conpression ralio. The lIock
arlifacl of }ILC inage in Iig. 2 is easiIy olserved, vhiIe lhe }ILC XR denonslrales
acceplalIe quaIilies ly inpIenenling lhe pre-fiIler funclion. Ior lhe archileclure design, a
4:4:4 192Ox1O8O }ILC XR encoder is proposed. Iron lhe sinuIalion and anaIysis, enlropy
coding is lhe nosl conpulalionaIIy inlensive parl in }ILC XR encoder. We firsl proposed a
lining scheduIe of pipeIine archileclure lo speed up lhe enlropy encoding noduIe. To
oplinize nenory landvidlh prolIen and naxinize lhe siIicon area efficiency, ve aIso
proposed a dala reuse skiII lo soIve lhis prolIen. The dala reuse skiII can reduce 33
nenory landvidlh forn lhe nenory access. The hardvare design of }ILC XR encoder has
leen inpIenenled ly ceII-lased IC design fIov. The encoder design is aIso verified ly
IICA pIalforn. This }ILC XR chip design can le used for digilaI pholography appIicalions
lo achieve Iov conpulalion, Iov slorage, and high dynanicaI range fealures.

TechnoIogies Operalions(COIs)
}ILC2OOO 4.26
}ILC XR 1.2
TalIe 1. Machine cycIes conparison lelveen }ILC XR and }ILC 2OOO.
Full HD JPEG XR Encoder Design for Digital Photography Applications 101

Iig. 1. ISNR Conparison lelveen differenl inage slandard.


(a) (l)
Iig. 2. Inage QuaIily Conparison vilh (a) }ILC XR and (l) }ILC.

BABOON512
20
25
30
35
40
45
50
55
60
0 1 2 3 4 5 6 7
bpp
PSNR
1PEG XR(O1)
1PEG 2000
1PEG XR(O0)
1PEG XR(O2)
VLS 102


Iig. 3. }ILC XR Coding IIov.

The coding fIov of }ILC XR is shovn in lhe Iig. 3. The }ILC XR has Iover conpulalion cosl
in each noduIe and sinpIe coding fIov vhiIe nainlaining siniIar ISNR quaIily al lhe sane
lilrale as conpared vilh lhe coding fIov of }ILC2OOO. Hence, lhe }ILC XR is very suilalIe
lo le inpIenenled vilh lhe dedicaled hardvare lo deaI vilh HD pholo size inage for lhe
HDR dispIay requirenenl. The }ILC XR encoder archileclure design is presenled in lhe
foIIoving seclion.
This paper is organized as foIIovs. In Seclion 2, ve presenl lhe fundanenlaIs of }ILC XR.
Seclion 3 descriles lhe characlerislics of our proposed archileclure design of }ILC XR
encoder. Seclion 4 shovs lhe inpIenenlalion resuIls. IinaIIy, a concIusion is given in
Seclion 5.

2. JPEG-XR

The }ILC XR inage conpression slandard has nany oplions for differenl purposes. In lhe
foIIoving seclion, lhe fundanenlaIs of }ILC XR are inlroduced.

2.1 Image Partition and Windowing
Iigure 4 shovs lhe parlilion and vindoving exanpIe of }ILC XR slandard. As lhe differenl
liIe size nakes lhe differenl conpression resuIl, lhe conpression resuIls under lhe sane
quanlizalion vilh differenl liIe size for four lesl 512x512 lenchnark inages are discussed in
(Ian el aI., 2OO8). When lhe liIe funclion is used, lhe conpressed inage size is increased vilh
lhe liIes nunler vhen lhe snaII liIes size has leen chosen.
When lhe inage has leen divided inlo lhe differenl liIes, each liIes are processed
independenlIy vhich are nore suilalIe for lhe design of hardvare inpIenenlalion and
nenory luffer size. The pixeIs acrossing lhe loundary of lhe liIes have lo le processed
vilhoul any dala dependency. The dala dependancy nusl le changed accordingIy, and lhe
scan order has lo le reluiIl as veII. This characlerislic overcone lhe error inpacl of dala
independency vhen lhe error occurs. AIlhough lhe division of liIes is heIpfuI for lhe
hardvare inpIenenlalion and dala reserving in error environnenl, lul il decreases lhe dala
conpression efficiency. Hovever, il depends on lhe lradeoff lelveen hardvare design
consideralions and lhe dala conpression efficiency.

Full HD JPEG XR Encoder Design for Digital Photography Applications 103


MB aligned
image
nput
image
Tiles
Macroblock
Blocks

Iig. 4. LxanpIe of Inage Iarlilions and Windoving.



MG_HDR NDEXTBL TLE1 TLE2
MB_1 MB_2 MB_3
DC LOWPASS FLEXBTS HGHPASS
Spatial mode
Frequency mode

Iig. 5. Layoul of ilslrean.

2.2 Operation Modes
In lhe Iig. 5, lhere are lvo nodes of lilslrean in }ILC XR vhich are naned as spaliaI node
and frequency node. In lhe spaliaI node, a singIe liIe packel carries lhe lilslrean in
nacrolIock rasler order scanning fron Iefl lo righl and lop lo lollon. In lhe frequency
node, lhe lilslrean of each liIe is carried in nuIlipIe liIe packels, vhere each liIe packel
carries lransforn coefficienls of each frequency land in lhe liIe. The freqency node
lransnils lhe DC coefficienls al firsl. When lhe user receive lhe lilslreans, inages are
decoded progressiveIy. The frequency node is very suilalIe for Iiniled landvidlh
appIicalions such as vel lrovsing.


VLS 104


Iig. 6. The 1sl parl and 2nd parl process of ICT.

2.3 CoIor Conversion
The purpose of coIor conversion is lo refIecl lhe coIor sensivily according lo lhe
characlerislics of hunan eyes. The coIor space is converled fron RC lo YUV. Then, lhe
inporlanl energies are concenlrale on Y channeI. The coIor conversion is reversilIe, in olher
vords, lhis coIor conversion is a IossIess conversion. The coIor conversion equalion is
( (
( (
(
(

(

J B - R
J
U - R - G
2
U
Y G offset
2



(1)
vhere offsel is 128.

2.4 Pre-fiIter
There are lhree overIapping choices of lhe pre-fiIler funclion: non-overIapping, one-IeveI
overIapping and lvo-IeveI overIapping. (Ian el aI., 2OO8) discuss differenl lrade-offs of lhree
overIapping choices. The non-overIapping condilion is used for faslesl encoding and
decoding. Il is efficienl in Iov conpression ralio node or IossIess node. Hovever, lhis
node polenliaIIy inlroduces lIocking effecl in Iov lilrale inages. The one-IeveI overIapping
funclion, conpared lo lhe non-overIapping, has higher conpression ralio lul il needs
addilionaI line for lhe overIapping operalion. The lvo-IeveI overIapping has lhe highesl
conpulalion conpIexily. The ISNR and oljeclive inage quaIily of lvo-IeveI overIapping
are leller lhan lhe olher lvo al Iov lilrale area. The pre-fiIler funclion is reconnended for
lolh high inage quaIily and furlher conpression ralio consideralions al high quanlizalion
IeveIs. Ior high inage quaIily issue, lhe pre-fiIler funclion eIininale lhe lIock effecl vhich is
sensilive of visuaI quaIily.

2.5 Photo Core Transform
Iholo Core Transforn (ICT) funclion lransforns lhe pixeI vaIues afler pre-fiIler conpuling
lo lhe Iovesl frequency coefficienl DC, Iov frequency coefficienls AD, and high frequency
coefficienls AC. There are lvo parls process for lhe ICT as shovn in Iig. 6. The nacrolIock
(M) is parlilioned inlo 16 4x4 lIocks. In each parl process, each 4x4 lIock is pre-fiIlered and
lhen lransforned ly 4x4 ICT. A 2x2 lransforn is appIied lo four coefficienls ly four lines
Full HD JPEG XR Encoder Design for Digital Photography Applications 105

for each 4x4 lIock in firsl parl process. The Iov frequency coefficienl of lhese four
coefficienls is processed lo lhe lop-Iefl coefficienl. Afler firsl parl process, lhe DC coefficienls
of 16 4x4 lIocks can le coIIecled as a 4x4 DC lIock. The second parl process is for lhe 4x4
DC lIock fron lhe firsl parl process. The second parl 2D 4x4 ICT is luiIl ly using lhe lhree
operalors: 2x2 T_h, T_odd and T_odd_odd. Afler second parl process, lhe 16 DC coefficienls
are processed as DC and AD coefficienls.

2.6 Quantization
The quanlizalion is lhe process of rescaIing lhe coefficienls afler appIying lhe lransforn
process. The quanlizalion uses lhe quanlized vaIue lo divide and round lhe coefficienls afler
ICT lransforned inlo an inleger vaIue. Ior lhe IossIess coding node, lhe quanlized vaIue =
1. Ior lhe Iossy coding node, lhe quanlized vaIue > 1. The quanlizalion of }ILC XR use
inleger operalions. The advanlage of inleger operalion keeps lhe precision afler scaIing
operalions and onIy uses shifl operalion lo perforn lhe division operalions. The
quanlizalion paraneler is aIIoved lo differ across high pass land, Iov pass land, and DC
land. Il varies lo differenl vaIues according lo lhe sensilivily of hunan vision in differenl
coefficienl lands. Iig. 7 shovs a exanpIe lhal pixeIs lransforned ly 2 slage ICT and
quanlized ly lhe quanlizalion process.


Iig. 7. LxanpIe of ICT.


Iig. 8. DC prediclion nodeI.

VLS 106

(a) (b)!
Iig. 9. LxanpIe of (a) Irediclion of AD coefficienls. (l) Irediclion nodeI and AC
coefficienls.

2.7 Prediction
There are lhree direclions of DC prediclion: LLIT, TOI, and LLIT and TOI. As shovn in
Iig. 8, lhe }ILC XR prediclion ruIes of DC coefficienl use lhe DC coefficienl of Iefl M and
lop M lo decide lhe direclion of DC prediclion. The DC predicled direclion can le decided
ly lhe pseudo code descriled in Iig. 8. Afler conparing lhe H_veighl and lhe V_veighl,
lhe predicled direclion can le decided. Al lhe loundary Ms of inage, lhe purpIe Ms onIy
predicl fron Iefl M and lhe gray Ms onIy predicl fron lop M.
The AD/AC lIock can le predicled fron ils TOI or LLIT lIock as Iig. 9. If lhe predicled
direclion is LLIT, lhe prediclion reIalionship exanpIe of AD coefficienls is shovn as Iig.
9(a). The AD predicled direclion foIIovs DC prediclion nodeI as descriled in Iig. 8. The
conpulalion afler prediclion judgnenl of AC is a IillIe siniIar lo AD lhal can reduce lhe
coefficienl vaIue of lhe lIock. The Iig. 9(l) shovs lhe prediclion reIalionship exanpIe of AC
coefficienls vhen lhe prediclion direclion is TOI.

2.8 Entropy Encoding
The adaplive scan is used for enlropy coding vhich is lased on lhe Ialesl prolaliIily of non-
zero coefficienls. Iig. 1O shovs an exanpIe lhal lhe previous scan order is changed lo
updale scan order lased on lhe prolaliIily. The nosl prolalIe non-zero coefficienls are
scanned firslIy and lhe prolaliIily is counled ly nunlers of lhe non-zero coefficienls. If lhe
non-zero prolaliIily is Iarger lhan lhe previous scanned coefficienl, lhe scan order is
exchanged. y doing so, lhe non-zero coefficienls are coIIecled logelher and processed in an
orderIy fashion.
Afler lhe coefficienls of currenl lIock are scanned ly adaplive scan order, }ILC XR uses lvo
enlropy coding schenes. Iig. 11 denonslrales an exanpIe of coding a 4x4 lIock vhen
ModeIils is 4 lils. The coefficienls of 4x4 lIock are scanned ly adaplive scan order firsl,
lhen lhe ModeIils is updaled ly lhe lolaI overhead coefficienls in each land per channeI.
The coefficienls vhich can le represenled under lhe ModeIils are encoded ly lhe IIexils
lalIe. Ior lhe exlra lil such as lhe 17 can nol le represenled in four lils, lhe encoding lIock
Full HD JPEG XR Encoder Design for Digital Photography Applications 107

viII increase exlra lil for lhe nev Run-LeveI Lncode (RLL) coding. The RLL funclion is
added inlo lhe lilslrean Ienglh for lhe overhead coefficienl. The LeveIs, lhe Runs lefore
nonzero coefficienls, and lhe nunler of overhead coefficienls are counled for RLL. Then lhe
RLL lIock is encoded firslIy vhiIe differenl size of Huffnan lalIe is used lo nake lhe lil
aIIocalion in oplinizalion. Afler processing lhe RLL aIgorilhn, lhe RLL resuIls and IIexils
are packelized.

3. Architecture Design of JPEG XR

In lhe foIIoving seclion, a }ILC XR encoder archileclure design is presenled. Ior lhe syslen
requirenenl, lhe consideralion of funclionaI lIock pipeIining, and lhe archileclure design of
lhe pre-fiIler, lransforn, quanlizalion, prediclion, and lhe enlropy coding noduIes are aII
discussed in lhe foIIoving seclions.


Iig. 1O. Adaplive scan order updaling exanpIe.


Iig. 11. Adaplive scan and Run-LeveI encode exanpIe.


VLS 108


Iig. 12. IipeIine slage consideralion of }ILC XR.


Iig. 13. Dala fIov of }ILC XR slage 1.

3.1 System Architecture and FunctionaI BIock PipeIining
The pipeIine slages of lhis }ILC XR encoder design can le divided inlo lhree sleps, as
shovn in Iig. 12. Since lhe coIor conversion, pre-fiIler and lhe ICT noduIes are conpuled
vilh lhe 4x4 lIock nalrix slyIe vilh no feedlack infornalion, lhey are arranged inlo lhe
sane slage al lhe leginning. The prediclion unil is used as second slage for differenl
direclion conparisons vilh lhe feedlack processed pixeIs for lhe DC/AD/AC region. Then,
lhe enlropy encoding noduIe vilh high dala dependency are divided as lhe lhird slage.

3.2 CoIor Conversion, Pre-fiIter, PCT, and Quantization
Iig. 13 shovs lhe dala fIov of slage 1. A 32 lils exlernaI lus can lransnil 2 R/C/
coefficienls in one cIock cycIe. The pre-fiIler and ICT/quanlizalion noduIes process lhe
coefficienls afler coIor conversion. This paper uses a nenory reused nelhod lo reduce lhe
nenory access landvidlh for lhe coIor conversion, pre-fiIler, and ICT noduIe. The lIack
Iine in Iig. 14 is lhe loundary of M. When lhe lIue range is lo le processed afler lhe yeIIov
range, lhe pre-fiIler funclion onIy have lo deaI vilh lhe anounl of nev dala inslead of lhe
enlire lIock dala inlo lhe regisler. This is due lo lhe presence of previousIy slored coIunn
coefficienls of lIocks in lhe regislers and lhe capaliIily lo le uliIized as lhe coefficienls in lhe
yeIIov range. Therefore, lhe nenory luffer size and lhe nenory access for pre-fiIler are
reduced. Olhervise, lhe addilionaI SRAM viII le needed lo slore lhe dala of lhe enlire lIock
fron off-chip nenory and lhe execulion line viII le increased. The delaiI sinuIalions is
discussed in (Ian el aI., 2OO8).
Full HD JPEG XR Encoder Design for Digital Photography Applications 109

!
Iig. 14. Menory reused nelhod.


Iig. 15. DC lIock inserlion on lhe lIock pipeIine.


Iig. 16. Archileclure of ICT, and Quanlizalion.

ecause lhe funclions of slage 1 are conpuling vilh lhe 4x4 lIock slyIe vilh no feedlack
infornalion, lhe lhree pipeIine archileclure for slage1 is used: coIor conversion (CC), pre-
fiIler, ICT (incIude quanlizalion). Ior lhe nenory aIIocalion, lhe coIor conversion requires 6
lIocks per coIunn, lhe pre-fiIler requires 5 lIocks, and lhe ICT incIuding quanlizalion uses
4 lIocks lo execule lhe funclion. Al lhe end of ICT in Iig. 15, DC lIock can le processed lo
reduce one pipeIine lullIe vhen lhe nexl nev pipeIine slage slarls. Hence, lhe veII
arranged lining scheduIe eIininaled lhe access of addilionaI cIock cycIe lo process lhe DC
lIock.
The archileclure for ICT incIuding Quanlizalion is shovn in Iig. 16. AddilionaI regislers are
inpIenenled lo luffer lhe reIaled coefficienls for lhe nexl pipeIine processing. The Iefl
nuIlipIex seIecls lhe inpuls for lvo parls ICT process. IniliaIIy, lhe inpul pre-fiIlered
coefficienls are seIecled lo process lhe ICT aIgorilhn. Then, lhe yeIIov lIock (DC) viII le
processed afler lhe 16 lIocks have leen conpuled. The quanlizalion slage de-nuIlipIex lhe
DC, Iov pass land AD and high pass land AC coefficienls lo suilalIe process eIenenl. The
processed dala are arranged inlo lhe quanlized coefficienls of Y, U, V SRAM lIocks for
prediclion operalion.

VLS 110

SRAM 1
768x4 bytes
R
e
g
SRAM 2
768x4 bytes
SRAM 3
1440x4 bytes
R
e
g
DC/AD/AC block
0
1
Top AD block
CBP
Predictor
BuIIer
Stage 1
Pipeline
BuIIer
Pipeline
BuIIer
Stage 3
Stage 3
Input
Output
Output

Iig. 17. Irediclion archileclure.


Iig. 18. Adaplive encode archileclure.

3.3 Prediction
The quanilzed dala slored in Y, U, V SRAM lIocks are processed vilh lhe sullracl operalion
as lhe prediclion aIgorilhn. Three SRAM lIocks are used in lhe lhis design. One 144Ox4 lyle
SRAM is used lo luffer lhe 1 DC and 3 AD coefficienls for lhe TOI AD prediclion
coefficienls in lhe prediclion judgenenl, so lhal lhe regeneralion of lhese dala are
unnecessary vhen lhey are seIecled in lhe prediclion node. In Iig. 17, lvo 768x4 lyles
SRAMs are used lo save lhe quanlized coefficienls of currenl lIock and lhe predicled
coefficienls for currenl lIock.

3.4 Entropy Encoding
There are lhree conpIex dala dependency Ioops in lhe enlropy encoding noduIe as shovn
in Iig. 18. The firsl one is lhe adaplive-scan-order lIock used lo refresh lhe scan order. The
second is lhe updaled ModeIils lIock, vhich decides hov nany lils are necessary lo
represenl a coefficienl. Third, lhe adaplive Huffnan encode lIock lo choose lhe nosl
efficienl Huffnan lalIe.
Full HD JPEG XR Encoder Design for Digital Photography Applications 111


Iig. 19. Tining scheduIing of enlropy encoding.

Algorithm Analysis
C Simulation
Architecture Design
HDL Design
Simulation
Logical synthesis
Gate Level
Simulation
DfT Considerations
Scan chain insertion
Spec.
No
Yes
Logical synthesis
Coverage enough?
No
Yes
Meet Spec. ?
Tetra-MAX
Syntest for Bist Power Compiler
Satisify?
Apollo basic flow
Power/Floor Plane
Highest Utilization?
Timing Driven P&R
Post Layout Gate
Level Simulation
Meet Spec. ?
No
No
No
Yes
Yes
Post Layout
Simulation
LPE
Fabrication
Meet Spec. ?
No
Yes
Yes

Iig. 2O. Chip design fIov.

In order lo increase lhe lhroughpul and decrease lhe lining of crilicaI palch, ve use lhree
sul-pipeIine slages archileclure lo inpIenenl lhe enlropy encoding noduIe as descriled in
(Chien el aI., 2OO9). ecause of lhe feedlack palch, slage 1 incIudes lhe noduIes of feedlack
palh and lhe feedlack infornalion generalor. Slage 2 incIudes lhe noduIes lhal encode lhe
RLL infornalion lo synloI-lased dala. The lil-vise packelizer noduIe of slage 3 processes
lhe synloI-lased dala lil ly lil. y doing lhis, ve can increase lhe lhroughpul aloul 3
lines ly veII arranged pipeIine lining scheduIe as shovn in Iig. 19.
The design of adaplive scan lIock counls lhe nunlers of lhe non-zero coefficienls lo decide
vhelher lhe scan order shouId le exchanged or nol. Afler lhe processing of lhe adaplive
scan lIock, lhe coefficienls vhich can le represenled under ModeIils are coded ly lhe
IIexils lalIe. The olher coefficienls change lo generale lhe IIexlils and LeveI. Afler lhe
processing of lhe RLL aIgorilhn, lhe LeveI and Run choose lhe suilalIe Huffnan lalIe lo
generale lhe RLL codevord. The RLL resuIls are lransIaled inlo lhe codevords ly differenl
Huffnan lalIes. The Huffnan encoder in }ILC XR is differenl fron lhe olher slandards. Il is
conposed of nany snaII Huffnan lalIes and can adapliveIy choose lhe lesl oplion in lhese
lalIes for lhe snaIIer codesize for Run and LeveI. Many codevords are produced afler lhe
Huffnan encoding. In order lo increase lhe lhroughpul, lhe codevord concenlraling
VLS 112

archileclure is proposed. Linked lo lhe oulpul of lhe alove operalion, lhe vhoIe RLL
codevord and RLL codesize viII le produced lo lhe packelizer. The packelizer archileclure
is nodified fron lhe (Agoslini el aI., 2OO2) archileclure ly conlining lhe RLL codevord and
lhe IIexils for generaling lhe }ILC XR conpressed fiIe. More delaiI design is descriled in
(Ian el aI., 2OO8).

4. ImpIementations

This design is inpIenenled lo verify lhe proposed VLSI archileclure for }ILC XR encoder.
And il is aIso verified ly IICA pIalforn. The delaiI infornalion aloul inpIenenlalion
resuIl of each noduIe ly lhe IICA prololype syslen is shovn in (Ian el aI., 2OO8). Il is used
lo lesl lhe confornance lilslreans for lhe cerlificalion.
A lhree-slage M pipeIining of 4:4:4 IossIess }ILC XR encoder vas proposed lo process lhe
capacily and hardvare uliIizalion. In our design, lhe exlra regislers are used lo increase lhe
pipeIine slages for achieving lhe specificalion, such as lhe coIor conversion, ICT/
quanlizalion and lhe adaplive encode lIock. And lhe on-chip SRAM lIocks are used lo slore
lhe reused dala processed vilh lhe prediclion noduIe lo eIininale lhe nenory access. Ior
lhe enlropy encoding noduIe, lhe lining scheduIe and pipeIining is veII designed. The
proposed archileclure of enlropy encoding noduIe increases lhe lolaI lhroughpul aloul
lhree lines. We use O.18un TSMC CMOS 1I6M process lo inpIenenl lhe }ILC XR encoder.
Our design fIov is slandard ceII lased chip design fIov. The design fIov of our design is
shovn as Iig 2O. Tesl consideralion is aIso an inporlanl issue in chip design. Therefore, lhe
scan chain and lhe luiIl-in seIf-lesl (IST) are considered in our chip. The chip synlhesis
Iayoul is shovn as Iig. 21. The inpIenenlalion resuIls are shovn as lhe TalIe 2. The pover
dissipalion dislrilulion in shovn as Iig. 22.


Iig. 21. Chip synlhesis Iayoul.
Full HD JPEG XR Encoder Design for Digital Photography Applications 113


Iig. 22. Chip pover dissipalion dislrilulion.

TechnoIogy TSMC O.18un CMOS 1I6M
Core Size 3.18 nn x 3.18 nn (9.3O25 nn
2
)
Die Size 4.64 nn x 4.64 nn (2O.47 nn
2
)
Cale Counl 651.7 K
Work CIock Rale 62.5 MHz
Iover Consunplion 368.7 nW62.5MHz1.8V
On-Chip Menory 256x32 SRAM x 6 (singIe porl)
768x32 SRAM x 2 (singIe porl)
48Ox32 SRAM x 3 (singIe porl)
TolaI: 144,384 ils = 18,O47 yles
Irocessing
CapaliIily
34.1 Mega pixeIs vilhin one second
5.45 fps for 4:4:4 HDTV(192Ox1O8O) 62.5MHz
32.9 fps for 4:4:4 VCA(64Ox48O) 62.5MHz
112 fps for 4:4:4 CII(352x288) 62.5MHz
Inpul Iad 37
Oulpul Iad 34
TalIe 2. Chip specificalion.

5. ConcIusion

Conpared vilh lhe }ILC2OOO, lhe coding fIov of lhe }ILC XR is sinpIe and has Iover
conpIexily in lhe siniIar ISNR quaIily al lhe sane lil rale. Hence, lhe }ILC XR is very
suilalIe for inpIenenlalion vilh lhe dedicaled hardvare used lo nanage HD pholo size
inages for lhe HDR dispIay requirenenl. In lhis paper, ve iniliaIIy anaIyzed lhe
conparison of }ILC XR vilh olher inage slandard, and lhen a lhree-slage M pipeIining
vas proposed lo process lhe capacily and hardvare uliIizalion. We aIso nade a Iol of efforls
on noduIe designs. The lining scheduIe and pipeIining of coIor conversion, pre-fiIler, ICT
& quanlizalion noduIes are veII designed. In order lo prevenl accessing lhe coefficienls
fron off-chip nenory, an on-chip SRAM is designed lo luffer lhe coefficienls for lhe
prediclion noduIe vilh onIy sone area overhead. The pre-fiIler and ICT funclion vas
designed lo reduce 33.3 nenory access fron off-chip nenory.Ior lhe enlropy coding, ve
designed a codevord concenlraling archileclure for lhe lhroughpul increasing of RLL
aIgorilhn. And lhe adaplive encode and packelizer noduIes efficienlIy provide lhe coding
infornalion required for packing lhe lilslrean. ased on lhis research resuIl, ve conlrilule
VLS 114

a VLSI archileclure for 192Ox1O8O HD pholo size }ILC XR encoder design. Our proposed
design can le used in lhose devices vhich need poverfuI and advanced sliII inage
conpression chip, such as lhe nexl generalion HDR dispIay, lhe digilaI sliII canera, lhe
digilaI frane, lhe digilaI surveiIIance, lhe noliIe phone, lhe canera and olher digilaI
pholography appIicalions.

6. References

. Crov, Windovs Media Iholo: A nev fornal for end-lo-end digilaIinaging, Iincus
Haruarc |nginccring Ccnfcrcncc, 2OO6.
C.-H. Ian, C.-Y. Chien, W.-M. Chao, S.-C. Huang & L.-C. Chen, Archileclure design of fuII
HD }ILC XR encoder for digilaI pholography appIicalions, |||| Trans. Ccnsu. ||cc.,
VoI. 54, Issue 3, pp. 963-971, Aug. 2OO8.
C.-Y. Chien, S.-C. Huang, C.-H. Ian, C.-M. Iang & L.-C. Chen, IipeIined Arilhnelic
Lncoder Design for LossIess }ILC XR Lncoder, |||| |n||. Sqmpc. cn Ccnsu. ||cc.,
Kyolo, }apan, May 2OO9.
D. D. Ciuslo & T. OnaIi. Dala Conpression for DigilaI Iholography: Ierfornance
conparison lelveen proprielary soIulions and slandards, |||| Ccnf. Ccnsu. ||cc.,
pp. 1-2, 2OO7.
D. Schonlerg, S. Sun, C. }. SuIIivan, S. Regunalhan, Z. Zhou & S. Srinivasan, Techniques for
enhancing }ILC XR / HD Iholo rale-dislorlion perfornance for parlicuIar fideIily
nelrics, App|ica|icns cf Digi|a| |magc Prcccssing XXX|, Prccccings cf SP||, voI. 7O73,
Aug. 2OO8.
ISO/ILC }TC1/SC29/WC1. }ILC 2OOO Iarl I IinaI Connillee Drafl, Rev. 1.O, Mar. 2OOO.
ITU. T.81 : Infornalion lechnoIogy - DigilaI conpression and coding of conlinuous-lone sliII
inages. 1992.
L.V. Agoslini, I.S. SiIva & S. anpi, IipeIined Lnlropy Coders for }ILC conpression,
|n|cgra|c Circui|s an Sqs|cm Dcsign, 2OO2.
S. Croder, ModeIing and Synlhesis of lhe HD Iholo Conpression AIgorilhn, Masler Thesis,
2OO8.
S. Srinivasan, C. Tu, S. L. Regunalhan & C. }. SuIIivan, HD Iholo: a nev inage coding
lechnoIogy for digilaI pholography, App|ica|icns cf Digi|a| |magc Prcccssing XXX,
Prccccings cf SP||, voI. 6696, Aug. 2OO7.
S. Srinivasan, Z. Zhou, C. }. SuIIivan, R. Rossi, S. Regunalhan, C. Tu & A. Roy, Coding of
high dynanic range inages in }ILC XR / HD Iholo, App|ica|icns cf Digi|a| |magc
Prcccssing XXX|, Prccccings cf SP||, voI. 7O73, Aug. 2OO8.
Y.-W. Huang, .-Y. Hsieh, T.-C. Chen & L.-C. Chen, AnaIysis, Iasl AIgorilhn, and VLSI
Archileclure Design for H.264/AVC Inlra Irane Coder, ILLL Trans. Circuils Sysl.
Video TechnoI., voI. 15, no. 3, pp. 378-4O1, Mar. 2OO5.


The Design of P Cores in Finite Field for Error Correction 115
X

The Design of IP Cores in Finite FieId
for Error Correction

Ming-Hav }ing, }ian-Hong Chen, Yan-Hav Chen,
Zih-Heng Chen and Yaolsu Chang
|-Sncu Unitcrsi|q
Taiuan, R.O.C.

1. Introduction

In recenl sludies, lhe landvidlh of connunicalion channeI, lhe reIialiIily of infornalion
lransferring, and lhe perfornance of dala sloring devices lecone lhe najor design faclors in
digilaI lransnission /slorage syslens. In consideralion of lhose faclors, lhere are nany
aIgorilhns lo delecl or renove lhe noisefron lhe connunicalion channeI and slorage nedia,
such as cycIic redundancy check (CRC) and errorcorrecling code (Ielerson & WeIdon, 1972,
Wicker, 1995). The forner, a hush funclion proposed ly Ielerson and rovn (Ielerson &
rovn, 1961), is uliIized appIied in lhe hard disk and nelvork for error deleclion, lhe Ialer is
a lype of channeI coding aIgorilhns recover lhe originaI dala fron lhe corrupled dala
againsl various faiIures. NornaIIy, lhe schene adds redundanl code(s) lo lhe originaI dala
lo provide reIialiIily funclions such as error deleclion or error correclion. The lackground
of lhis chapler invoIves lhe nalhenalics of aIgelra, coding lheory, and so on.
In lerns of lhe design of reIialIe conponenls ly hardvare and / or soflvare
inpIenenlalions, a Iarge proporlion of finile fiIed operalions is used in nosl reIaled
appIicalions. Moreover, lhe frequenlIy used finile fieId operalions are usuaIIy sinpIified and
reconslrucled inlo lhe hardvare noduIes for high-speed and efficienl fealures lo repIace lhe
sIov soflvare noduIes or huge Iook-up lalIes (a fasl soflvare conpulalion). Therefore, ve
viII inlroduce lhose connon operalions and sone lechniques for circuil sinpIificalion in
lhis chapler. Those finile fieId operalions are addilions, nuIlipIicalions, inversions, and
conslanl nuIlipIicalions, and lhe lechniques incIude circuil sinpIificalion, resource-sharing
nelhods, elc. Iurlhernore, lhe designers nay use nalhenalicaI lechniques such as group
isonorphisn and lasis lransfornalion lo yieId lhe nininun hardvare conpIexilies of
lhose operalions. And, il lakes a greal deaI of line and efforl lo search lhe oplinaI designs.
To soIve lhis prolIen, ve propose lhe conpuler-aided funclions vhich can le used lo
anaIyze lhe hardvare speed/conpIexily and lhen provide lhe oplinaI paranelers for lhe II
design.
This chapler is organized as foIIovs: In Seclion 2, lhe nalhenalicaI lackground of finile
fieId operalions is presenled. The VLSI inpIenenlalion of lhose operalions is descriled in
Seclion 3. Seclion 4 provides sone lechniques for sinpIificalion of VLSI design. The use of
6
VLS 116

conpuler-aided funclions in choosing lhe suilalIe paranelers is inlroduced in Seclion 5.
IinaIIy, lhe resuIl and concIusion are given.

2. The mathematic background of finite fieId

LIenenls of a finile fieId are oflen expressed as a poIynoniaI forn over G|(q), lhe
characlerislic of lhe fieId. In nosl conpuler reIaled appIicalions, lhe CaIois fieId vilh
characlerislic 2 is viIdIy used lecause ils ground fieId, G|(2), can le napped inlo lil-O and
lil-1 for digilaI conpuling. Ior convenience, lhe vaIue vilhin lvo parenlhesises indicales
lhal lhe coefficienls for a poIynoniaI in descending order. Ior exanpIe, lhe poIynoniaI,
1
3 5 6
+ + + x x x , is represenled ly |11O1OO1 in linary forn or |69 in hexadecinaI forn. So
does an eIenenl ) 2 (
m
G| e o is presenled as synloI lased poIynoniaI.

2.1 The common base representations

2.1.1 The standard basis
If an eIenenl ) 2 (
m
G| e o is lhe rool of a degree m irreducilIe poIynoniaI ) (x f , i.e.,
O ) ( = o f , lhen lhe sel { }
1 2 1
, , , , 1
m
o o o . forns a lasis, is caIIed a slandard lasis, a
poIynoniaI lasis or a canonicaI lasis (LidI & Niederreiler, 1986). Ior exanpIe, conslrucl
) 2 (
4
G| | = vilh lhe degree 4 irreducilIe poIynoniaI 1 ) (
4
+ + = x x x f , suppose O ) ( = o f ,
lhal is, 1
4
+ =o o and O | > =<o | as TalIe 1.

eIenenl
3
o
2
o
1
o
O
o eIenenl
3
o
2
o
1
o
O
o
O O O O O
7
o

O 1 1 1
0
o O O O 1
8
o

1 1 1 O
1
o O O 1 O
9
o

O 1 O 1
2
o O 1 O O
10
o

1 O 1 O
3
o 1 O O O
11
o

1 1 O 1
4
o 1 O O 1
12
o

O O 1 1
5
o 1 O 1 1
13
o

O 1 1 O
6
o 1 1 1 1
14
o

1 1 O O
TalIe 1. The slandard lasis expression for aII eIenenls of ) 2 (
4
G| | =

2.1.2 The normaI basis
Ior a given ) 2 (
m
G| , lhere exisls a nornaI lasis { }
1 2
2 2 2
, , , ,
m
o o o o . . Lel
_

=
=
1
O
2
m
i
i
i
| o | le
represenled in a nornaI lasis, and lhe linary veclor ( )
1 1 O
, ,
m
| | | . is used lo represenl lhe
coefficienls of | , denoled ly ( )
1 1 O
, ,

=
m
| | | . | . Since
O
2 2
1 o o = =
m
ly Iernals IillIe lheoren
(Wang el aI., 1985), ( )
2 O 1
2
2
2
O
2
1
2
, ,
1 1 O

= + + + =

m m m m
| | | | | |
m
. o o o |
or ( )
1 1 O 1
2
, , , , , , ,
+
=
i m m i i m i m
| | | | | |
i
. . | . Thal is, lhe squaring operalions ( lh 2
i
pover
operalions) can le conslrucled ly cycIic rolalions in soflvare or ly changing Iines in
The Design of P Cores in Finite Field for Error Correction 117

hardvare, vhich is vilh Iov conpIexily for praclicaI appIicalions (Ienn el aI., 1996).

2.1.3 Composite fieId
Ior circuil design, using a conposile fieId lo execule sone specific operalions is an effeclive
nelhod, for exanpIe, lhe circuil of finile fieId inversion ollained in conposile fiIed has lhe
nininun conpIexily. The fanous exanpIe is found in nosl hardvare designs of ALS VLSI
(Hsiao el aI., 2OO6, }ing el aI., 2OO7), in vhich lhe S-lox is a non-Iinear sulslilulion for aII
eIenenls in ) 2 (
8
G| can le designed vilh a Iess area conpIexily ly severaI isonorphisn
conposile fieIds such as ) ) 2 ((
4 2
G| , ) ) 2 ((
2 4
G| , and )) ) 2 (((
2 2
G| (Morioka & Saloh, 2OO3). In
lhis seclion, ve inlroduce lhe process lo conslrucl a conposile fieId and lhe lasis
lransfornalion lelveen a slandard lasis and a lasis in conposile fieId.
Lel ) 2 (
8
GF le represenled in a slandard lasis vilh reIalion poIynoniaI
1 ) (
2 3 4 8
+ + + + = x x x x x f ( ) (x f is prinilive) and O ) ( = o f such lhal ) 2 (
8
G| = o and
r
o = is a prinilive eIenenl in lhe ground fieId ) 2 (
4
G| , vhere ( ) ( ) 17 1 2 1 2
4 8
= = r . We
conslrucl lhe conposile fieId ) ) 2 ((
2 4
G| over lhe fieId ) 2 (
4
G| using lhe irreducilIe
poIynoniaI ) (x q vilh degree 2 over ) 2 (
4
G| , vhich is given as foIIovs

17 34 2 17 2 2 2
) ( ) )( ( ) (
4 4
o o o o o o o + + = + + + = + + = x x x x x x x q . (1)

Such lhal
17
o = is an eIenenl of ) 2 (
2
G| . In order lo represenl lhe eIenenls of lhe ground
fieId ) 2 (
2
G| , ve use lhe lern in ) (x q as lhe lasis eIenenl, vhich is
17
o = . An eIenenl A is
expressed in ) ) 2 ((
2 4
G| as

o
1 O
a a A ' + ' = .
(2)

vhere ) 2 (
4
G| a
j
e ' . We can express
j
a' in ) 2 (
4
G| using
17
o = as lhe lasis eIenenl

51
1
34
1
17
1 O
3
O
2
O O O
o o o
j j j j j j j j j
a a a a a a a a a + + + = + + + = ' .
(3)

vhere ) 2 ( G| a
ji
e for 1 , O = j and 3 , 2 , 1 , O = i . Therefore, lhe represenlalion of A in lhe
conposile fieId is ollained as

) ( ) (
52
13
35
12
18
11 1O
51
O3
34
O2
17
O1 OO 1 O
o o o o o o o o a a a a a a a a a a A + + + + + + + = ' + ' = .
(4)

Nexl, sulslilule lhe lerns
j i + 17
o for 1 , O = j and 3 , 2 , 1 , O = i ly lhe reIalion poIynoniaI
1 ) (
2 3 4 8
+ + + + = x x x x x f as foIIovs:

,
3 4 7 17
o o o o + + = ,
1 2 3 6 34
o o o o o + + + = ,
1 3 51
o o o + =
, 1
2 3 5 18
+ + + = o o o o ,
2 3 4 7 35
o o o o o + + + =
2 4 52
o o o + = .
(5)

y sulsliluling lhe alove lerns in expression Lqualion (4), ve ollain lhe represenlalion of
VLS 118

A in lhe slandard lasis ) , , , 1 (
7 1
o o o . as

7
7
6
6
5
5
4
4
3
3
2
2 1 1 O
o o o o o o o o a a a a a a a A + + + + + + + = .
(6)

The reIalionship lelveen lhe lerns
n
a for 7 , , 1 , O . = n and
ji
a for 1 , O = j and 3 , 2 , 1 , O = i
delernines a 8 ly 8 conversion nalrix T (Sunar el aI., 2OO3). The firsl rov of lhe nalrix T
is ollained ly galhering lhe conslanl lerns in lhe righl hand side of Lqualion (4) afler lhe
sulslilulion, vhich gives lhe conslanl coefficienls in lhe Iefl hand side, i.e., lhe lern
O
a . A
sinpIe inspeclion shovs lhal
11 OO O
a a + = o . Therefore, ve ollain lhe 8 8 nalrix T and lhis
nalrix gives lhe represenlalion of an eIenenl in lhe linary fieId ) 2 (
8
G| given ils
represenlalion in lhe conposile fieId ) ) 2 ((
2 4
G| as foIIovs:

(
(
(
(
(
(
(
(
(
(
(

(
(
(
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
(
(
(
(
(

13
12
11
1O
O3
O2
O1
OO
7
6
5
4
3
2
1
O
O 1 O O O O 1 O
O O O O O 1 O O
O O 1 O O O O O
1 1 O O O O 1 O
O 1 1 O 1 1 1 O
1 1 1 O O 1 O O
O O O 1 1 1 O O
O O 1 O O O O 1
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
.
(7)

The inverse lransfornalion, i.e., lhe conversion fron ) 2 (
8
G| lo ) ) 2 ((
2 4
G| , requires
conpuling lhe
1
T nalrix. We can use Causs-}ordan LIininalion lo derive lhe
1
T nalrix as
foIIovs:

(
(
(
(
(
(
(
(
(
(
(

(
(
(
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
(
(
(
(
(

7
6
5
4
3
2
1
O
13
12
11
1O
O3
O2
O1
OO
1 O O 1 O O O O
1 1 1 1 O 1 O O
O O 1 O O O O O
1 O 1 O 1 O 1 O
1 1 1 O 1 O O O
O 1 O O O O O O
O 1 1 1 O 1 O O
O O 1 O O O O 1
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
.
(8)

2.1.4 The basis transformation between standard basis and normaI basis
The nornaI lasis is vilh sone good fealures in hardvare, lul lhe slandard lasis is used in
popuIar designs. Iinding lhe lransfornalion lelveen lhen is an inporlanl lopic (Lu, 1997),
ve use ) 2 (
4
G| as an exanpIe lo iIIuslrale lhal. Suppose ) 2 (
4
G| is vilh lhe reIalion
1 ) (
3 4
+ + = x x x p vhich is a prinilive poIynoniaI. Lel O ) ( = o p such lhal
( )
3 2 1 O
1
, , , o o o o = 8 forn a slandard lasis. Lel
3
o = and lhe sel { }
8 4 2 1
, , , is Iinear
The Design of P Cores in Finite Field for Error Correction 119

independenl such lhal ( )
8 4 2 1
2
, , , = 8 forns a nornaI lasis. There exisls a nalrix T such
lhal
T T
8 T 8
1 2
= and
T T
8 T 8
2
1
1
=

. The nalrixes T and


1
T are Iisled as foIIovs.
(
(
(
(
(

(
(
(
(
(

=
(
(
(
(
(

(
(
(
(
(

(
(
(
(
(

=
(
(
(
(
(

(
(
(
(
(

=
(
(
(
(
(

1
2
4
8
O
1
2
3
O
1
2
3
1
2
4
8
1
O 1 1 1
O 1 1 O
O 1 O 1
1 1 O O
,
1 1 1 1
1 1 1 O
1 O 1 O
1 1 O O
O 1 1 1
O 1 1 O
O 1 O 1
1 1 O O
,
1 1 1 1
1 1 1 O
1 O 1 O
1 1 O O

o
o
o
o
o
o
o
o

T T
. (9)

2.2 The basic operation in finite fieId

2.2.1 Addition and subtraction
Ior a finile fieId vilh characlerislic 2, addilion and sullraclion are perforned ly lhe lilvise
XOR operalor. Ior exanpIe, Iel 1 ) (
1 2 4
+ + + = x x x x a , 1 ) (
1 3 4
+ + + = x x x x | , and ) (x c le lhe
sunnalion of lvo poIynoniaIs, lhus,
2 3 1 2 3 4
2 2 2 ) ( ) ( ) ( x x x x x x x | x a x c + = + + + + = + = or
perforn in linary forn |1O111 + |11O11 = |O11OO .

2.2.2 MuItipIication and inversion
The nuIlipIicalion in a finile fieId is perforned ly nuIlipIy lvo poIynoniaIs noduIo a
specific irreducilIe poIynoniaI. Ior exanpIe, consider lhe finile fieId ) 2 (
4
G| | = vhich is
vilh lhe reIalion 1 ) (
4
+ + = x x x p and Iel O ) ( = o p lhus ( )
3 2 1 O
, , , o o o o forns a slandard
lasis. Suppose | c | a e , , and 1
3
+ =o a , 1
2
+ + = o o | , and c is lhe producl of lhen. Thus
( ) ( ) 1 1 1
2 3 4 5 2 3
+ + + + + = + + + = = o o o o o o o o | a c , refer lo TalIe 1, ve have lhe producl
resuIl as ( ) ( ) o o o o o o o o + = + + + + + + + =
3 2 3 2
1 1 c . Ior every nonzero eIenenl
) 2 (
m
G| | = e , one has =
m
2
or
2 2 1
=
m
equivaIenlIy (Dinh el aI., 2OO1). Therefore, lhe
division for finile fieId can le perforned ly lhe nuIlipIicalive inversion. Ior exanpIe,
consider lhe inversion in ) 2 (
8
G| ,
2 2 1
8

= , and one can ollain lhis as Iig. 1.

2.2.3 Square operation
Consider an eIenenl | x a x a a A
m
m
e + + + =

1
1
1
1 O
vhere ) 2 ( G| a
i
e for m i < s O , lhe square
operalion for lhe characlerislic 2 finile fieId is: ( )
2
1
1
1
1 O
2

+ + + =
m
m
x a x a a A . Ior ) 2 ( G| a
i
e ,
ve have
i i
a a =
2
and lhus
) 1 ( 2
1
2
1 O
2

+ + + =
m
m
x a x a a A . esides, lhose ilens vilh pover nol
Iess n can le expressed ly slandard lasis. Thus, ve can perforn lhe square operalion ly
sone finile fieId addilions, i.e., XOR gales. Ior inslance, Iel ) 2 (
4
G| | = conslrucled ly
1 ) (
4
+ + = x x x f , an eIenenl | x a x a x a a A e + + + =
3
3
2
2
1
1 O
,
6
3
4
2
2
1 O
2
x a x a x a a A + + + = . Tvo
lerns
4
x and
6
x can le sulsliluled ly 1 + x and x x +
3
according lo TalIe 1. We have
) ( ) 1 (
3
3 2
2
1
O
O
2
x x a x a x a x a A + + + + + = or
3
3
2
1 3 2 2 O
2
) ( ) ( x a x a x a a a a A + + + + + = . The sane
VLS 120

properly is aIso suilalIe for lhe pover
i
2 operalion, such as
1 3 2
2 2 2
, , ,
m
A A A . .

3. The hardware designs for finite fieId operations

3.1 MuItipIier
Iinile fieId nuIlipIier is lhe lasic conponenl for nosl appIicalions. Many designers choose
lhe one vilh slandard lasis for lheir appIicalions, lecause lhe slandard lasis is easier lo
shov lhe vaIue ly lhe lil-veclor in digilaI conpuling. As foIIovs, ve inlroduce lvo nosl
used lypes of finile fieId nuIlipIiers, one is lhe convenlionaI nuIlipIier and anolher is lhe
lil-seriaI one.

3.1.1 ConventionaI muItipIier
As lhe slalenenl in Seclion 2.2.2, Iel ) 2 ( , ,
m
G| C 8 A e are represenled vilh slandard lasis
and 8 A C = , vhere
_

=
=
1
O
m
i
i
i
a A o ,
_

=
=
1
O
m
i
i
i
| 8 o , and lhe producl
( ) ( )
_ _ _

=
= =
1 2
O
1
O
1
O
m
i
i
i
m
i
i
i
m
i
i
i
p | a P o o o . Nole lhal every eIenenl in ) 2 (
m
G| is vilh lhe
reIalion ) (x f descriled in Seclion 2.1.1, such lhal lhe lerns vilh order grealer lhan m,
1 2 1
, , ,
+ m m m
o o o . , can le sulsliluled ly lhe Iinear conlinalion of slandard lasis
, , , 1 |
1 1 m
o o . . Thus, ve can olserve lhal lhere are
2
m and gale and aloul ) (m O m XOR
gales in lhe sulslilulion for high-order lerns.

3.1.2 Massey-Omura muItipIier
Here, ve inlroduce lhe popuIar version naned lhe lil-seriaI lype of Massey-Onura
nuIlipIier. Il is lased on lhe nornaI lasis, and lhe lransfornalion lelveen slandard lasis
and nornaI lasis is inlroduced in Seclion 2.1.4. Lel ) 2 ( , ,
m
G| C 8 A e are represenled vilh
nornaI lasis and 8 A C = , vhere
_

=
=
1
O
2
m
i
i
i
a A o ,
_

=
=
1
O
2
m
i
i
i
| 8 o , and
_

=
=
1
O
2
m
i
i
i
c C o . Denole
lhe coefficienl-veclor of A , 8 , and C ly a , | , and c , and lhe nolalion
) ( i
a neans
i
A
2
, ve
have:

| |
T
m
m
| M a
|
|
|
a a a 8 A C
m m m m
m
m
=
(
(
(
(
(

(
(
(
(
(

= =

+ + +
+ + +
+ + +

1
1
O
2 2 2 2 2 2
2 2 2 2 2 2
2 2 2 2 2 2
1 1 O
1 1 1 1 O 1
1 1 1 1 O 1
1 O 1 O O O
, , ,
.

. . .

.
o o o
o o o
o o o
, (1O)

vhere
1
2
1
2
1 O

+ + + =
m
m
M M M M o o o , such lhal

T i
m
i T
i m i m
| M a | M a c ) (
) (
1
) (
1 1
= =

.
(11)

Using Lqualion (11), lhe lil-seriaI Massey-Onura nuIlipIier can le designed as foIIoving:


The Design of P Cores in Finite Field for Error Correction 121







Iig. 1. The Massey-Onura lil-seriaI nuIlipIier

In Iig. 1, lhe lvo shifl-regisler perforn lhe square operalion in nornaI lasis, and lhe
conpIexily of and-xor pIane is aloul ) (m O and reIalive lo lhe nunler of nonzero eIenenl
in
i m
M
1
. Therefore, Massey-Onura nuIlipIier is suilalIe lo lhe design of area-Iiniled
circuils.

3.2 Inverse
In generaI lhe inverse circuil is usuaIIy vilh lhe liggesl area and line conpIexily anong
olher operalions. There are lvo nain nelhods lo inpIenenl lhe finile fieId inverse, lhal is,
nuIlipIicalive inversion and inversion lased on conposed fieId. The firsl nelhod
deconposes inversion ly nuIlipIier and squaring, and lhe oplinaI vay for deconposing is
proposed ly Iloh and Tsujii (Iloh & Tsujii, 1988). The Ialer one is lased on lhe conposed
fieId and suiled for area-Iiniled circuils, vhich has leen videIy used in nany appIicalions.

3.2.1 MuItipIicative inversion
Iron Iernal's lheoren, for any nonzero eIenenl ) 2 (
m
G| e o hoIds 1
1 2
=

m
o . Therefore,
nuIlipIicalive inversion is equaI lo
2 2
m
o . ased on lhis facl
[

=

= =
1
1
2 2 2 1
m
i
i m
o o o , Iloh and
Tsujii reduced lhe nunler of required nuIlipIicalions lo ) (Iogm O , vhich is lased on lhe
deconposilion of inleger. Suppose
_

=
=
1
O
2 1
|
n
n
n
a m , vhere ) 2 ( G| a
n
e and 1
1
=
|
a
denoled lhe decinaI nunler
2 O 1 2
j 1 | a a a
|
.

, ve have lhe foIIoving facls:



1 2 2 ) 1 2 )( 1 2 ( ) 1 2 (
1 2 2 ) 1 2 )( 1 2 (
1 2 2 ) 1 2 ( 1 2
2 O 1 2 2 O 1 2
O 1 2
2 O 1 2 2 O 1 2
2 1
2 O 1 2 2 O 1 2
1
j | j | 1 2 2
j | j | 2 2
j | j | 2 1
+ + + + =
+ + =
+ =

a a a a a a
a a a a a a
a a a a a a m
| |
|
| |
| |
| |
|
. .
. .
. .

.
(12)
1 2 2 ) 1 2 )( 1 2 ( ) 1 2 (
1 2 2 ) 1 2 ( 1 2
2 O 1 3 2 O 1 3
O 1 2
2 O 1 3 2 O 1 3
2
O 1 2
j | j | 1 2 2
2
j | j | 2
2
j |
+ + + =
+ =

a a a a a a
|
a a a a a a
|
a a a
| |
|
| |
|
|
a
a
. .
. . .

.
(13)
O
2 2
1
2 2
2
2 2
3
2 2
2
2 2
j | 2 2 2
2
2 2
j | j | 2 2 2
2
j | 2 2 2 2 2
j | j | 1 2 2 1
O O 1 1
2 2 3 3 2 2
2 O 1 3
O 1 3 2 2
2 O 1 3 2 O 1 3
O 1 3
2 O 1 3
O 1 3 2 2
2 O 1 2 2 O 1 2
O 1 2
2 ) 1 2 )( 2 ) 1 2 (
) 2 ) 1 2 )( ) 2 ) 1 2 )( 2 ) 1 2 (( (((
1 2 ) 1 2 )( 1 2 ( ) 1 2 )( 2 ) 1 2 ((
1 2 2 ) 1 2 )( 1 2 ( ) 1 2 (
2 ) 1 2 )( 1 2 ( ) 1 2 )( 2 ) 1 2 ((
1 2 2 ) 1 2 )( 1 2 ( ) 1 2 ( 1 2
a a
a a a
a
a
| |
a a a
|
a a a a a a
|
a a a
a a a a a a m
| | | |
|
| | |
| |
|
|
| | |
| |
|
+ + + +
+ + + + + + =
+ + + + + + =
+ + + +
+ + + + + =
+ + + + =

.
. .
.
. .
.
(14)

Shifl-regisler A
Shifl-regisler
AND-XOR
IIane
) (i
a
) (i
b i m
c
1

VLS 122


0
2
p A

2
A

1
p

1
a
0
a
1
A
1
b
0
b
This aIgorilhn requires 2 ) 1 .( ) 1 .( + = m u| m |cn N
M
nuIlipIiers, and
1 ) 1 .( ) 1 .( + = m u| m |cn N
P
square circuils, vhere ) 1 .( m |cn lhe Ienglh of linary
represenlalion of 1 m and ) 1 .( m u| is lhe nunler of nonzero lil in lhe represenlalion.
Ior inslance, if 8 = m lhen 7 1 = m , 4 2 3 3 2 ) 7 .( ) 7 .( = + = + = u| |cn N
M
and
5 1 3 3 1 ) 7 .( ) 7 .( = + = + = u| |cn N
P
. Ior lhe Ialency of circuil, il lakes
( ( S M
T m T m ) 1 ) 1 ( Iog ( ) ) 1 ( Iog (
2 2
+ + , vhere
M
T (resp.
S
T ) is lhe Ialency of nuIlipIier (resp.
squaring circuil). We Iisl sone resuIls of lhis aIgorilhn as TalIe 2.

m area Ialency m area Ialency
5 2 NM +3 NI 2 TM +3 TI 11 4 NM +5 NI 4 TM +5 TI
6 3 NM +4 NI 3 TM +4 TI 12 5 NM +6 NI 4 TM +5 TI
7 3 NM +4 NI 3 TM +4 TI 13 4 NM +5 NI 4 TM +5 TI
8 4 NM +5 NI 3 TM +4 TI 14 5 NM +6 NI 4 TM +5 TI
9 3 NM +4 NI 3 TM +4 TI 15 5 NM +6 NI 4 TM +5 TI
1O 4 NM +5 NI 4 TM +5 TI 16 6 NM +7 NI 4 TM +5 TI
TalIe 2. The Iisl of Iloh and Tsujii aIgorilhn

3.2.2 Composite fieId inversion
The use of conposile fieId provides an isonorphisn for ) 2 (
m
G| , vhiIe m is nol prine.
LspeciaIIy, if m is even, lhen inverse using conposile fieId is vilh very Iov conpIexily.
Consider lhe inverse in ) ) 2 ((
2 2 / m
G| vhere m is even. Suppose ) ) 2 (( ,
2 2 / m
G| 8 A e conslrucled
ly an irreducilIe poIynoniaI
O 1
) ( p x p x P + = , vhere ) 2 ( ,
2 /
1 O
m
G| p p e . Lel
O 1
a x a A + = and
O 1
| x | 8 + = , vhere ) 2 ( , , , , ,
2 /
O 1 O 1 O 1
m
G| p p | | a a e . Assune lhal 8 is lhe inverse of A, lhus
1 = 8 A or 1 ) ( ) (
O 1 O 1
+ + | x | a x a noduIo ) (x P . Afler lhe dislrilulion, one has
1 ) ( ) (
O O O 1 1 1 O O 1 1 1 1
= + + + + = | a p | a x | a | a p | a 8 A . Therefore, O
1 O O 1 1 1 1
= + + | a | a p | a and
1
O O O 1 1
= + | a p | a . Lel ) (
2
1 O 1 1 O
2
O
a p p a a a + + = A , one has
1
1 1

A = a | and
1
1 1 O O
) (

A + = p a a | , vhich
is design as Iig. 2. OlviousIy, one can olserve lhe inversion in ) 2 (
m
G| is execuled ly
severaI operalions vhich are aII in ) ) 2 ((
2 2 / m
G| , lhus lhe lolaI galed counl used can le
reduced.









Iig. 2. The circuil for conposile fieId inversion

The Design of P Cores in Finite Field for Error Correction 123

4. Some techniques for simpIification of VLSI

4.1 Finding common sharing resource in various design IeveIs
Sharing resource is a connon nelhod lo reduce lhe area cosl. This skiII can le used in
differenl design slages. Ior exanpIe, consider lhe lasis lransfornalion in Seclion 2.1.4, lhe
eIenenl of nornaI lasis is ollained ly lhe Iinear conlinalion of slandard lasis as foIIovs:

O 1 8
o o + = ,
O 2 4
o o + = ,
O 1 2 2
o o o + + = ,
O 1 2 3 1
o o o o + + + = . (15)

Il lakes 7 XOR gales for lhe slraighlforvard inpIenenlalion. Hovever, if one caIcuIale lhe
sunnalion
O 1 2
o o o + + = | firslIy, lhen | =
2
and | + =
3 1
o . Therefore, lhe nunler of
XOR gales is reduced lo 5. AIlhough il is effeclive in lhe lil-IeveI, lhis idea is aIso effeclive in
olher design slages. Consider anolher exanpIe in previous seclion, vhen ve forn lhose
conponenls ) (
2
1 O 1 1 O
2
O
a p p a a a + + = A and
1
1 1 O O
) (

A + = p a a | , il lakes 3 2-inpul adders in lvo


expressions. Suppose ve forn lhe conponenl
1 1 O
p a a + firslIy, lhus lhe nunler of 2-inpul
adder is reduced fron 3 lo 2 ( ) ) ( (
2
1 O 1 1 O O
a p p a a a + + = A ). Therefore, lhe resource-sharing idea
is suilalIe lo differenl design slages.

4.2 Finding the optimaI parameters of components
Anolher lechnique used lo sinpIify circuils for finile fieId operalions is change lhe originaI
fieId lo anolher isonorphisn. AIlhough lhese nelhods are equaI in nalhenalics, il provides
differenl oulcones in VLSI designs. There are lvo nain nelhods lo le reaIized.

4.2.1 Change the reIation poIynomiaI
Consider lhe inpIenenlalions of hardvare nuIlipIier/inverse in ) 2 (
8
G| using IICA, ve
galher area slalislics of nuIlipIier/inverse ly using differenl irreducilIe poIynoniaIs ( ) (x f )
and drav lhe Iine charl as Iig. 3 and Iig. 4, vhere lhe X axis indicales various irreducilIe
poIynoniaIs in decinaI represenlalion and lhe Y axis is lhe nunler of needed XOR gales. In
Iig. 3, one can olserve lhe Iovesl conpIexily of area and deIay is vilh ) (x f is 45. The
naxinun difference of XOR nunler (resp. deIay) lelveen lvo poIynoniaIs is 5O (resp. 2).
Therefore, choosing lhe oplinaI paranelers has greal infIuence in conpIexily in VLSI. The
sane phenonenon is aIso leen olserved in Iig. 4, lhe naxinun difference is 196 XOR gales.


Iig. 3. The slalislic of area for nuIlipIier v.s. ) (x f

VLS 124


Iig. 4. The slalislic of area for inverse v.s. ) (x f

4.2.2 Using composite fieId
In Seclion 2.1.3, ve iIIuslrale lhe lransfornalion lelveen a finile fieId represenled ly
slandard lasis and a conposile fieId. The nosl appIicalions for conposile fieId are lo
design lhe inverse, for inslance, lhe S-lox in ALS aIgorilhn (Morioka & Saloh, 2OO3). As ve
knov, lhe nain conponenl in S-lox is lhe finile fieId inverse of ) 2 (
8
G| . Here, ve
inpIenenl lhe S-lox ly lhe nuIlipIicalive inversion descriled in Seclion 3.2.1 and ly using
conposile fieId ) ) 2 ((
2 4
G| descriled in 3.2.2 as TalIe 3 ly using lhe AIlera IICA Slralix
2S1O2OC4 device. OlviousIy, lhe Ialer nelhod is vilh nore advanlages for lolh area and
line conpIexily lhan lhal of previous one.

Melhod LL/ALUT
DeIay (ns)
CLK (MHz)
Throughpul
(MHz)
nuIl. inverse 21O 23.24O 43.O29 344.232
conposile fieId 82 2O.219 49.458 395.664
TalIe 3. The resuIls for S-lox using nuIlipIicalive inversion and using conposile fieId

5. Using computer-aided functions to choose suitabIe parameters

According lo lhe expIanalions in Seclion 4, ve can reaIize lhe reIaled VLSI IIs using various
paranelers lo lring lhe lenefils for Iover area or line conpIexily. Hovever, lhere exisl so
nany isonorphisns in using finile fiIed, il seens lhal lhere are so nany procedures and
varialions lo choose lhe paranelers and hard lo find a leller ones. As a resuIl, our group
deveIoped a soflvare looIs vhich is lhe conpuler-aided design (CAD) lo heIp engineers lo do
lhe ledious anaIysis and search. This seclion viII inlroduce lhe nelhods lo appIy lhe
isonorphisn lransfornalions lelveen ) 2 (
8
G| and ) ) 2 ((
2 4
G| iIIuslraled in Seclion 4.1 and 4.2
slep ly slep.
IirslIy, Iisl aII irreducilIe and prinilive poIynoniaIs in lvo fieIds as shovn in TalIe 4 and 5,
respecliveIy. In lhis lalIe, aII irreducilIe and prinilive poIynoniaIs are represenled in
hexadecinaI forn and ve onil lhe nosl significanl lil. Ior exanpIe, in TalIe 4, one chooses
1 lhal neans (OOO11O11)
2
or 1
3 4 8
+ + + + x x x x , in TalIe 5, suppose lhe prinilive eIenenl
) 2 (
4
G| e e , one chooses 18 lhal neans (OOO11OOO)
2
or
3 2
1 e + + x x .

The Design of P Cores in Finite Field for Error Correction 125

) 2 (
8
G|
irreducilIe poIynoniaIs
#=3O
1 1D 2 2D 39 3I 4D 5I 63 65 69 71 77 7 87 8
8D 9I A3 A9 1 D C3 CI D7 DD L7 I3 I5 I9
prinilive poIynoniaIs
#=16 1D 2 2D 4D 5I 63 65 69 71 87 8D A9 C3 CI L7 I5
TalIe 4. The irreducilIe and prinilive poIynoniaIs in ) 2 (
8
G|

) ) 2 ((
2 4
G|
irreducilIe poIynoniaIs
#=12O
18 19 1A 1 1C 1D 1L 1I 21 22 25 26 29 2A 2D 2L
31 33 34 36 39 3 3C 3L 41 42 44 47 48 4 4D 4L
51 53 55 57 59 5 5D 5I 62 63 64 65 6A 6 6C 6D
72 73 74 75 78 79 7L 7I 81 83 84 86 88 8A 8D 8I
92 93 96 97 9A 9 9L 9I A1 A2 A4 A7 A9 AA AC AI
4 5 6 7 C D L I C1 C3 C5 C7 C8 CA CC CL
D4 D5 D6 D7 D8 D9 DA D L2 L3 L6 L7 L8 L9 LC LD
I1 I2 I5 I6 I8 I IC II
prinilive poIynoniaIs
#=6O
19 1 1D 1L 22 25 29 2D 2L 33 34 39 3 3L 42 44
55 59 5 5D 62 63 64 65 6 6D 72 73 74 75 79 7L
83 84 8D 92 93 9 9L A2 A4 A9 4 5 D L C3 C5
CL D4 D5 D9 D L2 L3 L9 LD I2 I5 I
TalIe 5. The irreducilIe and prinilive poIynoniaIs in ) ) 2 ((
2 4
G|

SecondIy, lhe CAD searches for aII possilIe conlinalions ly lhe proposed aIgorilhn as
shovn in TalIe 6. This aIgorilhn regards as a funclion used lo find lransfornalion nalrices
as shovn in TalIe 7. Afler ve galher aII resuIls, ve can choose lhe leller paranelers fron
lhe Iisl of anaIyzed resuIls for hardvare design of nev II.

Chose reIalion poIynoniaI 1D for ) 2 (
8
G| 1 ) (
2 3 4 8
+ + + + = x x x x x p .
Lel O ) ( = o p , such lhal ) 2 (
8
G| e | can le expressed ly linary forn as
) , , , , , , , (
O 1 2 3 4 5 6 7
o o o o o o o o
Slep
1:
Iind a ) ) 2 ((
2 4
G| irreducilIe poIynoniaI.
SeIecl an irreducilIe poIynoniaI in ground fieId ) 2 (
4
G| is
1 ) (
4
1
+ + = x x x f . Lel e le lhe rool of ) (
1
x f , lhus O 1 ) (
4
1
= + + = e e e f .
SeIecl an irreducilIe poIynoniaI in ) ) 2 ((
2 4
G| is 18
3 2
2
) ( e + + = x x x f
and Iel le lhe rool of ) (
2
x f , O ) (
3 2
2
= + + = e f .
Slep
2:
Assune a generalor o in ) ) 2 ((
2 4
G| and generale aII none-zero eIenenls
of ) ) 2 ((
2 4
G| . Ior any eIenenl in ) ) 2 ((
2 4
G| can le expressed in linary
forn as ) , , , , , , , (
O3 O2 O1 O3 1O 11 12 13
.
Assune lhe T nalrix = | |
8 8
O 6 7
) ( ) ( ) (

T T T
o o o . , ve have
VLS 126

| |
1 8
O
1
2
3
4
5
6
7
8 8
O 6 7
1 8
OO
O1
O2
O3
1O
11
12
13
) ( ) ( ) (

(
(
(
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
(
(
(
(
(

o
o
o
o
o
o
o
o
o o o

T T T
.
Slep
3:
Conpule lhe
1
T nalrix= | | 8 8
1
O 6 7
) ( ) ( ) (

T T T
o o o . .
Iul aII none-zero eIenenls ) ) 2 ((
2 4
GF in lhe foIIoving equalion and check
if lhey aII hoId, i.e., 254 O , j | j |
1
s s =

i uncrc T
T i T i
o o .
Slep
4:
If Slep 3 does nol hoId, relurn lo Slep 2 and choose anolher generalor, If
Slep 3 hoIds, lhen lhe T and
1
T nalrices are found.
TalIe 6. The proposed aIgorilhn for searching lransfornalion nalrices

inpul

1 ) (
2 3 4 8
+ + + + = x x x x x p ) 2 ( ) (
8
G| x p
1 ) (
4
1
+ + = x x x f O 1 ) (
4
1
= + + = e e e f

3 2
2
) ( e + + = x x x f ) ) 2 (( ) ( ), (
2 4
2 1
G| x f x f
oulpul
8 8
1
8 8
1 O 1 O 1 1 O O
O 1 O O 1 1 1 1
O 1 O 1 1 O O 1
O 1 1 O 1 O O O
O O 1 O O O 1 1
O O O O O O O 1
O 1 O 1 1 O 1 O
O O 1 O 1 O 1 1
,
1 1 1 1 1 O 1 1
O O 1 1 O O 1 1
O O 1 O 1 O 1 O
O O O 1 1 1 1 O
O O O O 1 O O 1
O 1 O 1 1 O O O
O O 1 O O 1 1 O
O O O O O 1 O O

(
(
(
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
(
(
(
(
(

= T T

TalIe 7. The resuIl of lransfornalion nalrices lelveen ) 2 (
8
G| and ) ) 2 ((
2 4
G| .

Iig. 5. The inlerface of CAD looI
The Design of P Cores in Finite Field for Error Correction 127

ecause various paranelers provide VLSIs oulpuls vilh huge varialion and ils seen
inpossilIe lo run aII paranelers, ve shouId provide engineers a CAD looI lo ollain lhe
anaIyzed aIgorilhn and resuIls. In Iig. 5, a designer can use a CAD vilh Windovs inlerface lo
find leller paranelers of S-lox. In lhis CAD, il provides lhe conpIexily infornalion of lhe
nuIlipIiers or lhe inverse in ) 2 (
8
G| . In lhis figure, designer chooses lhe fourlh resuIl, and lhe
eslinalive conpIexily of inverse is shovn in lhe lop righl of lhe figure, lhal lhe choice of
nuIlipIier is shovn under lhe inverse infornalion. Therefore, lhe CAD looI heIps designer lo
choose lhe leller paranelers efficienlIy.

6. Summary

In lhis chapler, ve inlroduce lhe connon concepls of finile fieId regarding ils appIicalions in
error correcling coding, cryplography, and olhers, incIuding lhe nalhenalicaI lackground,
sone inporlanl designs for nuIlipIier and inversion, and lhe idea lo uliIize lhe conpuler-
aided funclions lo find leller paranelers for II or syslen design. The sunnary of lhis chapler
is as foIIovs:
1. Inlroducing lhe lasic finile fieId operalions and lheir hardvare designs: Those connon
operalions incIude addilion, nuIlipIicalion, squaring, inversion, lasis lransfornalion, and so
on. The VLSI designs of lhose operalions nay le underslood lhrough lhe nalhenalicaI
lackground provided in Seclion 2. Iron lhe nalhenalicaI lackground, one shouId reaIize lhe
lenefils and lhe processes of lhe lransfornalion lelveen lvo isonorphic finile fieIds.
2. Using sone lechniques lo sinpIify lhe circuils: We have inlroduced sone usefuI lechniques
lo reduce lhe area cosl in VLSI design, such as lhe resource-sharing nelhod, uliIizalion of
differenl paranelers, or use sone isonorphic fieId lo sulslilule lhe used fieId. The firsl
lechnique is videIy used in various design slages. The Ialer lvo lechniques depend on lhe
paranelers used. Differenl paranelers Iead lo differenl hardvare inpIenenlalion resuIls.
Hovever, il seens infeasilIe lo anaIyze aII possilIe paranelers nanuaIIy.
3. Using lhe conposile fieId inversion: Conposile fieId inversion is used in lhe finile fieId
inversion due lo ils superiorily in hardvare inpIenenlalion. The nain idea is lo consider lhe
use of inlernediale fieIds and deconpose lhe inversion on lhe originaI fieId inlo severaI
operalions on snaIIer fieIds. This nelhod has leen used in lhe ALS S-lox design lo nininize
lhe area cosl.
4. CaIcuIaling lhe lransfornalion nalrices lelveen isonorphic finile fieIds. Il is veII knovn
lhal finile fieIds of lhe sane order are isonorphic, and lhis inpIies lhe exislence of
lransfornalion nalrices. Iinding lhe oplinaI one is inporlanl in lhe invesligalion of lhe VLSI
designs. Tvo nelhods are presenled, one is lo change lhe reIalion poIynoniaI, and lhe olher is
lo use lhe conposile fieId. An aIgorilhn lo caIcuIale lhe lransfornalion nalrices is provided in
Seclion 5, and il can le used lo find lhe oplinaI one.
5. Using lhe conpuler-aided design lo search for leller paranelers: A good hardvare CAD
looI provides fasl search and enough infornalion for designer, lecause il lrings fasl and
accurale designs. In Seclion 5, lhe conpuler-aided funclion lased on lhe proposed aIgorilhns
is one of lhe exanpIes. When lhe order of lhe finile fieId gels Iarge, lhe nunler of isonorphic
fieId increases rapidIy. This nakes il aInosl inpossilIe lo do lhe exhausling search, and lhe
proposed CAD can le used lo supporl engineers lo gel lhe lesl choices.
VLS 128

7. ConcIusion

In lhis chapler, ve use lhe concepl of conposile fieIds for lhe CAD designs, vhich can
supporl lhe VLSI designer lo caIcuIale lhe oplinaI paranelers for finile fieId inversion.

8. AcknowIedgments

This vork is supporled in parl ly lhe Nalion Science CounciI, Taivan, under granl NSC 96-
2623-7-214-OO1.

9. References

Dinh, A.V., IaIner, R.}., oIlon, R.}. & Mason, R. (2OO1). A Iov Ialency archileclure for
conpuling nuIlipIicalive inverses and divisions in CI(2
m
). |||| Transac|icns cn
Circui|s an Sqs|cms ||. Ana|cg an Digi|a| Signa| Prcccssing, VoI. 48, No. 8, pp. 789-
793, ISSN: 1O57-713O
Ienn, S.T.}., enaissa, M. & TayIor, D. (1996). Iasl nornaI lasis inversion in CI(2
m
).
||cc|rcnics |c||crs, VoI. 32, No. 17, pp. 1566-1567, ISSN: OO13-5194
Hsiao, S.-I., Chen, M.-C. Chen & Tu, C.-S. (2OO6). Menory-free Iov-cosl designs of
advanced encryplion slandard using connon sulexpression eIininalion for
sulfunclions in lransfornalions. |||| Transac|icns cn Circui|s an Sqs|cms |. Rcgu|ar
Papcrs, VoI. 53, No. 3, pp. 615-626, ISSN: 1549-8328
Iloh, T. & Tsujii, S. (1988). A fasl aIgorilhn for conpuling nuIlipIicalive inverses in CI(2
m
)
using nornaI lasis. |nfcrma|icn an Ccmpu|ing, VoI. 78, No. 3, pp. 171-177, ISSN:
O89O-54O1
}ing, M.-H., Chen, Z.-H., Chen, }.-H. & Chen, Y.-H. (2OO7). ReconfiguralIe syslen for high-
speed and diversified ALS using IICA. Micrcprcccsscrs an Micrcsqs|cms, VoI. 31,
No. 2, pp. 94-1O2, ISSN: O141-9331
LidI, R. & Niederreiler, H. (1986). |n|rcuc|icn |c fini|c fic|s an |ncir app|ica|icns, Canlridge
Universily Iress, ISN: 978O52146O941
Lu, C.-C. (1997). A search of nininaI key funclions for nornaI lasis nuIlipIiers. ||||
Transac|icns cn Ccmpu|crs, VoI. 46, No. 5, pp.588-592, ISSN: OO18-934O
Morioka, S. & Saloh, A. (2OO3). An oplinized S-lox circuil archileclure for Iov pover ALS
design. Rctisc Papcrs frcm |nc 4|n |n|crna|icna| Icr|sncp cn Crqp|cgrapnic Haruarc
an |m|cc Sqs|cms, |cc|urc Nc|cs in Ccmpu|cr Scicncc, VoI. 2523, pp. 172-186,
ISN: 3-54O-OO4O9-2, Augusl, 2OO2, Redvood Shores, CaIifornia, USA
Ielerson, W.W. & rovn, D.T. (1961). CycIic Codes for Lrror Deleclion, Prccccings cf |nc
|R|, VoI. 49, No. 1, pp. 228-235, ISSN: OO96-839O
Ielerson, W.W. & WeIdon, L.}. (1972). |rrcr-Ccrrcc|ing Cccs, The MIT Iress, 2 edilion,
Canlridge, MA, ISN: 354OOO4O92
Sunar, ., Savas, L. & Koc, C.K., (2OO3). Conslrucling conposile fieId represenlalions for
efficienl conversion. |||| Transac|icns cn Ccmpu|cr, VoI. 52, No. 11, pp. 1391-1398,
ISSN: OO18-934O
Wang, C.C., Truong, T.K., Shao, H.M., Deulsch, L.}., Onura, }.K., & Reed, I.S. (1985). VLSI
archileclure for conpuling nuIlipIicalions and inverses in CI(2
m
). ||||
Transac|icns cn Ccmpu|crs, VoI. 34, No. 8, pp. 7O9-716, ISSN: OO18-934O
The Design of P Cores in Finite Field for Error Correction 129

Wicker, S.. (1995). |rrcr Ccn|rc| Sqs|cms fcr Digi|a| Ccmmunica|icn an S|cragc, Irenlice HaII,
ISN: O-13-3O8941-X, US


VLS 130
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 131
X

ScaIabIe and SystoIic Gaussian NormaI Basis
MuItipIiers over GF(2
m
) Using HankeI
Matrix-Vector Representation

Chiou-Yng Lee
|ungnua Unitcrsi|q cf Scicncc an Tccnnc|cgq
Tacquan Ccun|q, Taiuan

1. Introduction

Lfficienl design and inpIenenlalion of finile fieId nuIlipIiers have received high allenlion
in recenl years lecause of lheir appIicalions in eIIiplic curve cryplography (LCC) and error
conlroI coding (Denning, 1983, Rhee, 1994, Menezes, Oorschol & Vanslone, 1997). AIlhough
channeI codes and cryplographic aIgorilhns lolh nake use of lhe finile fieId CI(2
m
), lhe
fieId orders needed differ dranalicaIIy: channeI codes are lypicaIIy reslricled lo arilhnelic
vilh fieId eIenenls vhich are represenled ly up lo eighl lils, vhereas LCC reIy on fieId
sizes of severaI hundred lils. The najorily of pulIicalions concenlrale on finile fieId
archileclures for reIaliveIy snaII fieIds suilalIe for inpIenenlalion of channeI codes. In
finile fieId CI(2
m
), nuIlipIicalion is one of lhe nosl inporlanl and line-consuning
conpulalions. Since cryplographic appIicalions (Menezes, Oorschol & Vanslone, 1997) are
lhe Diffie-HeIInan key exchange aIgorilhn lased on lhe discrele exponenlialion over
CI(2
m
), lhe nelhods of conpuling exponenlialion over CI(2
m
) lased on Iernals lheoren
are perforned ly lhe repealed nuIlipIy-square aIgorilhn. Therefore, lo provide lhe high
perfornance of lhe securily funclion, lhe efficienl design of high-speed aIgorilhns and
hardvare archileclures for conpuling nuIlipIicalion is required and considered.
There are lhree popuIar lasis represenlalions, lerned poIynoniaI lasis (I), nornaI lasis
(N), and duaI lasis (D). Lach lasis represenlalion has ils ovn advanlages. The nornaI
lasis nuIlipIicalion is generaIIy seIecled for cryplography appIicalions, lecause lhe
squaring of lhe eIenenl in CI(2
m
) is sinpIy lhe righl cycIic shifl of ils coordinales. N
nuIlipIicalion depended lhe seIeclion of key funclion is discovered ly Massey and Onura
(1986). Ior lhe eIIiplic curve digilaI signalure aIgorilhn (LCDSA) in ILLL Slandard I1363
(2OOO) and NalionaI Inslilule of Slandards and TechnoIogy (NIST) (2OOO), Caussian nornaI
lasis (CN) is defined lo inpIenenl lhe fieId arilhnelic operalion. The CN is a speciaI
cIass of nornaI lasis, vhich exisls for every posilive inleger m nol divisilIe ly eighl. The
CN for CI(2
m
) is delernined ly an inleger l, and is caIIed lhe lype-| Caussian nornaI
lasis. Hovever, lhe conpIexily of a lype-| CN nuIlipIier is proporlionaI lo | (Reyhani-
MasoIeh, 2OO6), snaII vaIues of | are generaIIy chosen lo ensure lhal lhe fieId nuIlipIicalion
is inpIenenled efficienlIy.
7
VLS 132

Anong various finile fieId nuIlipIiers are cIassified eilher as a paraIIeI or seriaI
archileclures. il-seriaI nuIlipIiers (Reyhani-MasoIeh & Hasan, 2OO5, Lee & Chang, 2OO4)
require Iess area, lul are sIov lhal is laken ly n cIock cycIes lo carry oul lhe nuIlipIicalion
of lvo eIenenls. ConverseIy, lil-paraIIeI nuIlipIiers (Lee, Lu & Lee, 2OO1, Hasan, Wang &
hargava, 1993, Kvon, 2OO3, Lee & Chiou, 2OO5) lend lo le fasler, lul have higher hardvare
cosls. RecenlIy, various nuIlipIiers (Lee
1
, 2OO3, Lee, Horng & }ou, 2OO5, Lee, 2OO5, Lee
2
, 2OO3)
focus on lil-paraIIeI archileclures vilh oplinaI gale counl. Hovever, previousIy nenlioned
lil-paraIIeI nuIlipIiers shov a conpulalionaI conpIexily of O(m
2
) operalions in CI(2),
canonicaI and duaI lasis archileclures are Iover lounded ly n
2
nuIlipIicalions and k
2
- 1
addilions, nornaI lasis ones ly |
2
nuIlipIicalions and 2|
2
- | addilions. A nuIlipIicalion in
CI(2) can le reaIized ly a lvo-inpul AND gale and an adder ly a lvo-inpul XOR gale. Ior
lhis reason, il is allraclive lo provide archileclures vilh Iov conpulalionaI conpIexily for
efficienl hardvare inpIenenlalions.
There have digil-seriaI/scaIalIe nuIlipIier archileclures lo enhance lhe lrade-off lelveen
lhroughpul perfornance and hardvare conpIexily. The scaIalIe archileclure needs lolh lhe
eIenenl A and 8 are separaled inlo n=|m/j sul-vord dala, vhiIe lhe digil-seriaI
archileclure onIy requires one of lhe eIenenl lo separale sul-vord dala. olh archileclures
are used ly lhe fealure of lhe scaIaliIily lo handIe groving anounls of vork in a gracefuI
nanner, or lo le readiIy enIarged. In (Tenca & Koc, 1999), a unil is considered scaIalIe
defined lhal lhe unil can le reused or repIicaled in order lo generale Iong-precision resuIls
independenlIy of lhe dala palh precision for vhich lhe unil vas originaIIy designed.
Various digil-seriaI nuIlipIiers are recenlIy deveIoped in (Iaar, IIeischnann & Soria-
Rodriguez, 1999, Kin & Yoo, 2OO5, Kin, Hong and Kvon, 2OO5, Cuo & Wang, 1998, Song &
Iarhi, 1998, Reyhani-MasoIeh & Hasan, 2OO2). Song and Iarhi (1998) proposed MSD-firsl
and LSD-firsl digil-seriaI I nuIlipIiers using Horners ruIe schene. Ior parlilioning lhe
slruclure of lvo-dinension arrays, efficienl digil-seriaI I nuIlipIiers are found in (Kin &
Yoo, 2OO5, Kin, Hong & Kvon, 2OO5, Cuo & Wang, 1998). The najor fealure of lhese
archileclures is conlined vilh lolh seriaI and paraIIeI aIgorilhns.
Ior Iarge vord Ienglhs connonIy found in cryplography, lhe lil-seriaI approach is ralher
sIov, vhiIe lil-paraIIeI reaIizalion requires Iarge circuil area and pover consunplion. In
eIIiplic curve cryplosyslens, il slrongIy depends on lhe inpIenenlalion of finile fieId
arilhnelic. y enpIoying lhe HankeI nalrix-veclor represenlalion, lhe nev CN
nuIlipIicalion aIgorilhn over CI(2
m
) is presenled. UliIizing lhe lasic characlerislics of MSD-
firsl and LSD-firsl schenes (Song & Iarhi, 1998), il is shovn lhal lhe proposed CN
nuIlipIicalion can le deconposed inlo n(n+1) HankeI nalrix-veclor nuIlipIicalions. The
proposed scaIalIe CN nuIlipIiers are incIuding one HankeI nuIlipIier, lvo regislers
and one finaI reduclion poIynoniaI circuil. The resuIls reveaI lhal, if lhe seIecled digilaI size
is > 4 lils, lhe proposed archileclure has Iess line-space conpIexily lhan lradilionaI digil-
seriaI sysloIic nuIlipIiers, and can lhus ollain an oplinun archileclure for CN nuIlipIier
over CI(2
m
). To furlher saving lolh line and space conpIexilies, lhe proposed scaIar
nuIlipIicalion aIgorilhn vilh HankeI nalrix-veclor represenlalion can aIso le reaIized ly a
scaIalIe and sysloIic archileclure for poIynoniaI lasis and duaI lasis of CI(2
m
).
The resl of lhis paper is slruclured as foIIovs. Seclion 2 lriefIy revievs a convenlionaI N
nuIlipIicalion aIgorilhn and a HankeI nalrix-veclor nuIlipIicalion. Seclion 3 proposes lhe
lvo CN nuIlipIiers, lased on HankeI nalrix-veclor represenlalion, lo yieId a scaIalIe and
sysloIic archileclure. Seclion 4 inlroduces lhe nodified CN nuIlipIier. Seclion 5 anaIyzes
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 133

our proposed CN nuIlipIier in lhe lern of lhe line-area conpIexily. IinaIIy, concIusions
are dravn in Seclion 6.

2. PreIiminaries

2.1 Gaussian normaI basis muItipIication
The finile fieId CI(2
m
) is veII knovn lo le vievalIe as a veclor space of dinension m over
CI(2). A sel } , , ,
1
2 2

=
m
N o o o is caIIed lhe nornaI lasis of CI(2
m
), and o is caIIed lhe
nornaI eIenenl of CI(2
m
). Lel any eIenenl AeCI(2
n
) can le represenled as
) , , , (
1 1 0
2
1
0

=
= =
_
m
i
m
i
i
a a a a A o (1)
vhere a
i
eCI(2), O < i < m1, denoles lhe ilh coordinale of A. In hardvare inpIenenlalion,
lhe squaring eIenenl is perforned ly a cycIic shifl of ils linary represenlalion. The
nuIlipIicalion of eIenenls in CI(2
m
) is uniqueIy delernined ly lhe n cross producls
1
2 2
0
i f
m
if
f
oo o

=
=
_
,
ij
eCI(2). M=|
ij
is caIIed a nuIlipIicalion nalrix. Lel A=(a
0
, a
1
,., a
m1
)
and 8=(|
0
, |
1
,., |
m1
) indicale lvo nornaI lasis eIenenls in CI(2
m
), and C=(c
0
, c
1
,.,
c
m1
)eCI(2
m
) represenl lheir producl, i.e., C=A8. Coordinale ci of C can lhen le represenled
ly

T i i
i
B A c ) (
) ( ) (
M = (2)
vhere A
(i)
denoles a righl cycIic shifl of lhe eIenenl A ly i posilions. To conpule lhe
nuIlipIicalion nalrix M, one can see in (ILLL Slandard I1363, 2OOO, Reyhani-MasoIeh &
Hasan, 2OO3). When lhe nuIlipIicalion nalrix M is found, lhe N nuIlipIicalion aIgorilhn is
descriled as foIIovs:
A!gnrIthm 1: (N nuIlipIicalion) (ILLL Slandard I1363, 2OOO)
Inpul: A=(a
0
, a
1
,., a
m1
) and 8=(|
0
, |
1
,., |
m1
) eCI(2
n
)
Oulpul: C=(c
0
, c
1
,., c
m1
)=A8
1. iniliaI: C=O
2. for i = O lo m 1 |
3.
T
i
B A c M =
4. A=A
(1)
and 8=8
(1)

5.
6. oulpul C=(c
0
, c
1
,., c
m1
)
AppIying AIgorilhn 1, Massey and Onura (1986) firsl proposed lil-seriaI N nuIlipIier.
The conpIexily of lhe nornaI lasis N, represenled ly C
N
, is lhe nunler of nonzero
ij

vaIues in M, and delernines lhe gale counl and line deIay of lhe N nuIlipIier. Il is shovn
in (MuIIin, Onyszchuk, Vanslone & WiIson, 1988/1989) lhal C
N
for any nornaI lasis of
CI(2
n
) is grealer or equaI lo 2m1. In order lo an efficienl and sinpIe inpIenenlalion, a
nornaI lasis is chosen lhal C
N
is as snaII as possilIe. Tvo lypes of an oplinaI nornaI lasis
(ON), lype-1 and lype-2, exisl in CI(2
m
) if C
N
=2m1. Hovever, such ONs do nol exisl for
aII m.
DcfInItInn 1. Lel p=m|+1 represenl a prine nunler and gc(m|/|,m)=1, vhere k denoles lhe
nuIlipIicalive order of 2 noduIe p. Lel le a prinilive p rool of unily. The lype-| Caussian
VLS 134

nornaI lasis (CN) is enpIoyed ly
m t m m ) 1 (
2
2
2 2
...

+ + + + = o lo generale a nornaI lasis N


for CI(2
m
) over CI(2).
SignificanlIy, CNs exisl for CI(2
m
) vhenever m is nol divisilIe ly 8. y adopling Definilion
1, each eIenenl A of CI(2
m
) can aIso le given as

) ( ) (
) (
1
2
1 2
2
1
2
1
1 ) 1 (
2
1
2 2
1
) 1 (
2 2
0
1
2
1
2
1 0

+ +

+ + + + + + + + +
+ + + = + + + =
mt m m
m
t m m
t m m m
m
a a
a a a a A

o o o


(3)
Iron Lqualion (3), lhe lype-| CN can le represenled ly lhe sel
} , , , , , , , , , , , ,
1
2
1 ) 1 (
2
) 1 (
2
1 2
2
1
2 2
1
2 2
+ + mt t m t m m m m m
. Since is a prinilive p
rool of unily, ve have

=
=
=
+
1 1
1
1
p i if
p i if
i
i

. (4)
Thus, lhe CN can lhen aIlernale lo represenl lhe redundanl lasis R=;,
2
,.,
p-1
). The fieId
eIenenl A can le aIlernaled lo define lhe foIIoving fornuIa:

1
) 1 (
2
) 2 ( ) 1 (

+ + + =
p
p F F F
a a a A (5)
vhere
|(2
i
2
mj
mc p)=i for O < i < m1 and O < j < |1.
Examp!c 1. Ior exanpIe, Lel A=(a
0
,a
1
,a
2
,a
3
,a
4
) le lhe N eIenenl of CI(2
5
), and Iel o=+
1O
le
used lo generale lhe N. AppIying lhe redundanl represenlalion, lhe fieId eIenenl A can le
represenled ly A=a
0
+a
1

2
+a
3

3
+a
2

4
+a
4

5
+a
4

6
+a
2

7
+a
3

8
+a
1

9
+a
0

10
.
Olserving lhis represenlalion, lhe coefficienls of A are dupIicaled ly l-lern coefficienls of
lhe originaI nornaI lasis eIenenl A=(a
0
, a
1
,., a
m1
) if lhe fieId eIenenl A presenls a lype-|
nornaI lasis of CI(2
m
). Thus, ly using lhe funclion |(2
i
2
mj
mc p)=i, lhe fieId eIenenl
) , , , (
) 1 ( ) 2 ( ) 1 (
=
p F F F
a a a A vilh lhe redundanl represenlalion can le lransIaled inlo lhe
foIIoving represenlalion, ) , , , , , , , , , , (
1 1 2 2 1 1 0 0
=
m m
t
a a a a a a a a A
_
. Therefore, lhe
redundanl lasis is easy converled inlo lhe nornaI lasis eIenenl, and is vilhoul exlra
hardvare inpIenenlalions.
Lel A=(a
0
, a
1
,., a
m1
) and 8=(|
0
, |
1
,., |
m1
) indicale lvo nornaI lasis eIenenls in CI(2
m
),
and C=(c
0
, c
1
,., c
m1
) represenl lheir producl, i.e., C=A8. Coordinale ci of C can lhen le
caIcuIaled as in lhe foIIoving fornuIa: (ILLL Slandard I1363, 2OOO)

_

=
+
= =
2
1
) ( ) 1 ( 0
) , (
p
f
f p F f F
b a B A G c (6)
According lo AIgorilhn 1, lhe CN nuIlipIicalion aIgorilhn for even l is nodified as
foIIovs.
A!gnrIthm 2: (CN nuIlipIicalion for l even) (ILLL Slandard I1363, 2OOO)
Inpul: A=(a
0
,a
1
,.,a
m1
) and 8=(|
0
,|
1
,.,|
m1
)eCI(2
m
)
Oulpul: C=(c
0
,c
1
,.,c
m1
)=A8
1. iniliaI : C=O
2. for |=O lo m1 |
3. c
|
= G(A,8)
4. A=A
(1)
and 8=8
(1)

Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 135

5.
6. Oulpul C=(c
0
,c
1
,.,c
m1
)

2.2 Bit-paraIIeI systoIic HankeI muItipIier
This seclion lriefIy inlroduces a lil-paraIIeI sysloIic HankeI nuIlipIier in (Lee & Chiou,
2OO5).
DcfInItInn 2. An mm nalrix H is knovn as lhe HankeI nalrix if il salisfies lhe reIalion
H(i,j)=H(i+1,j1), for O s i s m2 and 1 s j < m1, vhere H(i,j) represenls lhe eIenenl in lhe
inlerseclion of rov i and coIunn j, such lhal
.
2 2 3 2 4 2 1
3 2 4 2 1 2
1 3 2
1 2 1
1 2 3 1 0
(
(
(
(
(
(
(
(

=


+


m m m m m
m m m m
m
m m
m m m
h h h h h
h h h h
h h h
h h h h
h h h h h
.
. .
. . . . . .
. . .
. .

H
A HankeI nalrix H is enlireIy delernined ly ils Iasl coIunn and firsl rov, and lhus
depends on having 2m1 paranelers, i.e., n=(n
0
, n
1
,., n
m2
, n
m1
, n
m
,., n
2m3
, n
2m2
). The
enlries of H are conslanl dovn lhe diagonaIs paraIIeI lo lhe nain anli-diagonaI. Lel C=n8 le
lhe producl of n and 8, vhere 8=(|
0
,|
1
,.,|
m1
). The coordinale c
i
of C is given ly

f
m
f
i f i
b h c
_

=
+
=
1
0
(7)
DcfInItInn 3. Lel i,j,m le lhese posilive inlegers vilh 0 s i,j s m1, one can define lhe
foIIoving funclion o(i,j):

= + >
+ +
<
= + >

<
=
oaa f i for
i f
even f i for
i f
f i
2
1
2
) , ( o
vhere <q> denoles q nod m.
Lel i denole a fixed inleger in lhe conpIele sel |1,2,.,m1, one verifies lhal lhe nap
|=o(i,j). Ior inslance, TalIe 1 presenls lhe reIalionship lelveen j and | vilh |=o(i,j), vhere
m=7, i=2, and 0 s j s6. Therefore, ly sulsliluling j=o(i,j) inlo Lqualion (6), lhe producl C can
le denoled as

_ _

=
+
|
|
.
|

\
|
=
1
0
1
0
) , ( ) , (
m
i
i
m
f
f i i f i
b h C
o o
(8)

j O 1 2 3 4 5 6
k=o(i,j) 6 5 O 4 1 3 2
TalIe 1. The reIalionship lelveen j and k for i=2

VLS 136

Examp!c 2: Lel A=(a
0
, a
1
, a
2
, a
3
, a
4
) and n =(n
0
, n
1
, n
2
, n
3
, n
4
, n
5
, n
6
, n
7
, n
8
) represenl lvo veclors
and H represenls lhe HankeI nalrix defined ly n, and Iel C =(c
0
, c
1
, c
2
, c
3
, c
4
) le lhe producl
of nA. y appIying Lqualion (8), lhe producl can le derived as foIIovs.
4 3 2 1 0
4 0 4 1 3 1 3 2 2 2
5 1 3 0 4 2 2 1 3 3
8 4 5 2 2 0 4 3 1 1
6 2 7 4 5 3 1 0 4 4
7 3 6 3 6 4 5 4 0 0
4 3 2
1
c c c c c
h a h a h a h a h a
h a h a h a h a h a
h a h a h a h a h a
h a h a h a h a h a
h a h a h a h a h a
x x x x
+

Iigure 1 depicls lhe lil-paraIIeI sysloIic HankeI nuIlipIier given ly LxanpIe 1. The
nuIlipIier conprises 25 ceIIs, incIuding 14 U-ceIIs and 11 V-ceIIs. Lvery U
i,j
ceII (Iigure 2(a))
conlains one 2-inpul AND gale, one 2-inpul XOR gale and lhree 1-lil Ialches lo reaIize
c
i
=c
i
+a
o(i,j)
n
o(i,j)+i
. Lach V
i,j
ceII (Iigure 2(l)) is forned ly one 2-inpul AND gale, one 2-inpul
XOR gale and four 1-lil Ialches lo reaIize lhe operalion of c
i
=c
i
+a
o(i,j)
n
o(i,j)+i
, or
c
i
=c
i
+a
o(i,j)
n
o(i,j)+i+m
. As slaled in lhe alove ceII operalions, lhe Ialency needs n cIock cycIes,
and lhe conpulalion line per ceII is needed ly one 2-inpul AND gale and one 2-inpul XOR
gale.


v

u

u

v

u

v

v
u

v

v

u

v

v
u

c
0
c
1 c
2
c
3
c
4
a
0
h
0
h
5
a
4
h
1
h
6
a
3
h
2
h
7
h
4
a1
h
3
a
2
a
2
h3 h8
a
1
h
4
0 0 0
0
0
v

u
v

u


Iig. 1. The lil-paraIIeI sysloIic HankeI nuIlipIier |1Oj
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 137


(a) (b)

<-(LM)/2>

oo

o

Iig. 2. (a) The delaiIed circuil of U-ceII, (l) The delaiIed circuil of V-ceII

3. Proposed scaIabIe and systoIic GNB muItipIier architectures

Lel
i
p
i
i F
a A
_

=
=
1
0
) (
and
i
p
i
i F
b B
_

=
=
1
0
) (
vilh a
|(0)
=|
|(0)
=0 and a
|(i)
, |
|(i)
eCI(2) for 1s i s p1
denole lvo lype-| CN eIenenls in CI(2
m
), vhere represenls lhe rool of x
p
+1. Assune lhal
lhe chosen digilaI size is a -lil, and n=p/(, lolh eIenenls A and 8 can aIso le expressed as
foIIovs.
_

=
=
1
0
n
i
ai
i
A A ,
_

=
=
1
0
n
i
ai
i
B B
vhere
_

=
+
=
1
0
) (
a
f
f
f ai F i
a A ,
_

=
+
=
1
0
) (
a
f
f
f ai F i
b B .
ased on lhe parliaI nuIlipIicalion for delernining A8
0
, lhe parliaI producl can le denoled
ly
A8
0
= A
0
8
0
+ A
1
8
0

+.+ A
n1
8
0

(n1)
(9)
Lach lern A
i
8
0
of degree 22 is lhe core conpulalion of Lqualion (9). In a generaI
nuIlipIicalion, Iel us define lhal AiO is forned ly
A
i
8
0
=S
i
+D
i

, for 0 s i s n1. (1O)


vhere
S
i
=s
i,0
+s
i,1
+...+s
i,1

1

D
i
=
i,0
+
i,1
+...+
i,1

2

s
i,j
=
) (
0
) ( k f F
f
k
k ia F
b a

=
+
_
, for 0 s j s 1
d
i,j
=
) (
1
1
) ( k f a F
a
f k
k ia F
b a
+

+ =
+
_
, for 0 s j s 2
Therefore, Lqualion (9) can le re-expressed as
A8
0
=( S
0
+D
0

)+ ( S
1
+D
1

+...+ ( S
n1
+D
n1

)
(n1)

VLS 138

= S
0
+( S
1
+D
0
)

+...+ ( S
n1
+D
n2
)
(n1)
+ D
n1

n

= C
0
+C
1

+...+ C
n1

(n1)
+ C
n

n
(11)
vhere
C
0
= S
0
,
C
i
= S
i
+D
i1
, for 1 s i s n1
C
n
= D
n1

S
i
=(s
i,0
, s
i,1
,..., s
i,1
) in Lqualion (1O) can le lransIaled vilh lhe foIIoving nalrix-veclor

(
(
(
(
(
(

(
(
(
(
(
(

=
(
(
(
(
(
(

+ + +
+ +
+

) 0 (
) 1 (
) 2 (
) 1 (
) 1 ( ) 2 ( ) 1 ( ) (
) 2 ( ) 3 ( ) (
) 1 ( ) (
) (
1 ,
2 ,
1 ,
0 ,
0
0 0
0 0 0
F
F
a F
a F
a ia F a ia F ia F ia F
a ia F a ia F ia F
ia F ia F
ia F
a i
a i
i
i
b
b
b
b
a a a a
a a a
a a
a
s
s
s
s
.
.
.
. . . . .
.
.
.
(12)
SiniIarIy, D
i1
=(
i1,0
,
i1,1
,...,
i1,2
, 0) can aIso le lransIaled vilh lhe foIIoving nalrix-
veclor

(
(
(
(
(
(

(
(
(
(
(
(

=
(
(
(
(
(
(

+
+

) 0 (
) 1 (
) 2 (
) 1 (
) 1 (
) 1 ( ) 2 ) 1 ( (
) 1 ( ) 2 ( ) 1 ) 1 ( (
2 , 1
1 , 1
0 , 1
0 0
0 0
0
0
F
F
a F
a F
ia F
ia F i a F
ia F ia F i a F
a i
i
i
b
b
b
b
a
a a
a a a
a
a
a
.
. . .
. . . .
. . . . .
.
.
.
(13)
According lo Lqualions (12) and (13), C
i
=S
i
+D
i1
in Lqualion (11) can ollain

0
) 0 (
) 2 (
) 1 (
) 1 ( ) 1 ( ) (
) 1 ( ) 3 ) 1 ( ( ) 2 ) 1 ( (
) ( ) 2 ) 1 ( ( ) 1 ) 1 ( (
) 1 ) 1 ( (
) 1 (
) (
B
b
b
b
a a a
a a a
a a a
c
c
c
i n
F
a F
a F
a ia F ai F ai F
ia F i a F i a F
ia F i a F i a F
i a F
ai F
ai F

+ +
+ + +
+ +
+
+
=
(
(
(
(
(

(
(
(
(
(

=
(
(
(
(
(

H
.

. . . .

.
(14)
Iron lhe alove nalrix-veclor nuIlipIicalion, lhe HankeI veclor n
ni
=(a
|((i1)+1)
, a
|((i1)+2)
,.,
a
|(i)
, a
|(i+1)
,., a
|((i+1)1)
) is defined ly lhe HankeI nalrix H
ni
. Hence, lhe conpulalion
A8
0
using lhe HankeI nalrix-veclor represenlalion can le conpuled as foIIovs:
A8
0
=n
n
8
0
+ n
n1
8
0

+.+n
0
8
0

n
(15)
Therefore, A8
0
can le disnenlered inlo (n+1) HankeI nuIlipIicalions, and can le
perforned ly lhe foIIoving aIgorilhn:
A!gnrIthm 3: PM(A,8
0
)
Inpul:
i
p
i
i F
a A
_

=
=
1
0
) (
and
i
a
i
i F
b B
_

=
=
1
0
) ( 0

Oulpul:
i
p
i
i F
c C
_

=
=
1
0
) (
=A8
0
nod (
p
+1)
1. converl lhe HankeI veclor n
ni
=(a
|((i1)+1)
, a
|((i1)+2)
,., a
|((i+1)1)
), for 0s i s n, fron A.
2. C=(c
|(0)
, c
|(1)
,., c
|(p1)
)=(0,.,0)
3. for i=O lo n |
4. X
i
=n
ni
8
0

5.
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 139

6. C= X mc (
p
+1)
7. relurn C=(c
|(0)
, c
|(1)
,., c
|(p1)
)
AIgorilhn 3 for delernining A8
0
incIudes lvo core operalions, naneIy lhe HankeI
nuIlipIicalion and lhe reduclion poIynoniaI
p
+1, as iIIuslraled in Iigure 3. The proposed
parliaI nuIlipIier archileclure in Iigure 3 can le caIcuIaled using lhe foIIoving procedure.
Slep 1: Iron Lqualion (14), ve can see lhal HankeI nalrix H
ni
is defined ly aII coefficienls
in A. Here A=(a
|(0)
, a
|(1)
,., a
|(p1)
) is firslIy converled lo lhe HankeI veclor n
ni
=(a
|((i1)+1)
,
a
|((i1)+2)
,., a
|((i+1)1)
) and ils resuIl is slored in lhe regisler H
ni
.
Slep 2: AppIying lhe lil-paraIIeI sysloIic HankeI nuIlipIier as shovn in Iigure1, Iigure 3
shovs A8
0
conpulalion. Lach HankeI nuIlipIicalions in Slep 4 of AIgorilhn 3, lhe resuIl of
A8
0
is slored in lhe regisler X.
Slep 3: Afler (n+1) HankeI nuIlipIicalions, lhe resuIl needs lo perforn lhe reduclion
poIynoniaI
p
+1.
CeneraIIy, lhe conpulalion of A8
i
for 0sisn-1 can le ollained ly lhe foIIoving fornuIa:
A8
i
=n
n
8
i
+ n
n1
8
i

+.+ n
0
8
i

n
(16)
The alove equalion indicales lhal each A8
i
conpulalion can le disnenlered inlo (n+1)
HankeI nuIlipIicalions. As nenlioned alove, lvo CN nuIlipIiers are descriled in lhe
foIIoving sulseclions.

8
P

,

,
y
y


Iig. 3. The proposed scaIalIe and sysloIic archileclure for conpuling AO

3.1 LSD-first scaIabIe systoIic muItipIier
The producl C= A8 mc (
p
+1) using LSD-firsl nuIlipIicalion aIgorilhn can le represenled
as
C=A8
0
mc (
p
+1) +A8
1

mc (
p
+1)+.+ A8
n1

(n1)
mc (
p
+1)
= A
(0)
8
0
+A
()
8
1
+.+ A
((n1))
8
n1
(17)
vhere
f
p
f
ia f F
p a i a p ia ia
a A A A
_

= + = + =
1
0
) (
)) 1 ( ( ) (
) 1 mod( ) 1 mod( .
AppIying Lqualions (16) and (17), lhe proposed LSD-firsl scaIalIe sysloIic CN
nuIlipIicalion is addressed as foIIovs:
VLS 140

A!gnrIthm 4. (LSDCN scaIalIe nuIlipIicalion)
Inpul: ) , , , (
1 1 0
=
m
a a a A and ) , , , (
1 1 0
=
m
b b b B are lvo nornaI lasis eIenenls in CI(2
m
)
Oulpul: C= ) , , , (
1 1 0 m
c c c =A8
1. IniliaI slep:
1.1 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
a a a a a a A
1.2 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
b b b b b b B
1.3
_

=
=
1
0
n
i
ai
i
B B , vhere
f
a
f
f ia F i
b B
_

=
+
=
1
0
) (
and n=p/(
1.4 C=O
2. MuIlipIicalion slep:
2.1 for i=O lo n1 do
2.2 C=C +PM(A,8
i
) (vhere PM(A,8
i
) as referred lo AIgorilhn 3)
2.3 A=A

mc (
p
+1)
2.4 endfor
3. asis conversion slep:
3.1 ) , , , ( ) , , , , , , (
) 1 ( ) 1 ( ) 0 ( 1 1 0 0
=
p F F F m m
t
c c c c c c c C
_

4. Relurn ) , , , (
1 1 0 m
c c c
The proposed LSDCN scaIalIe nuIlipIicalion aIgorilhn is spIil inlo n-Ioop parliaI
nuIlipIicalions. Iigure 4 depicls lhe LSDCN nuIlipIier lased on lhe proposed parliaI
nuIlipIier in Iig.3. olh N eIenenls A and 8 are iniliaIIy lransforned inlo lhe redundanl
lasis given ly Lqualion (4), and are slored in lolh regislers A and 8, respecliveIy. In round
O (see Iigure 4), lhe sysloIic array in Iigure 3 is adopled lo conpule C=A
(0)
8
0
, and lhe resuIl
is slored in regisler C. In round 1, lhe eIenenl A nusl le cycIicaIIy shifled lo lhe righl ly
digils. The resuIl produced in lhe sysloIic array is added lo regisler C in lhe round O. The
firsl round, vhich eslinales lhe Ialency, requires +n cIock cycIes. Lach sulsequenl round
conpulalion requires a Ialency of n+1 cIock cycIes. IinaIIy, lhe enlire nuIlipIicalion requires
a Ialency of +n(n+1) cIock cycIes. The crilicaI propagalion deIay of every ceII is lhe lolaI
deIay of one 2-inpul AND gale, one 2-inpul XOR gale and one 1-lil Ialch.

Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 141


Iig. 4. The proposed LSD-firsl scaIalIe sysloIic CN nuIlipIier over CI(2
m
)

3.2 MSD-first scaIabIe muItipIier
Iron Lqualion (17), lhe producl C can aIso le re-vrillen ly
C=(.(A8
n1
mc (
p
+1))

+A8
n2
mc (
p
+1) )

+.)

+A8
0
mc (
p
+1) (18)
Therefore, lhe MSD-firsl scaIalIe sysloIic nuIlipIicalion is addressed using lhe foIIoving
aIgorilhn.
A!gnrIthm 5. (MSDCN scaIalIe nuIlipIicalion)
Inpul: A and 8 are lvo nornaI lasis eIenenls in CI(2
m
)
Oulpul: C=A8
1. iniliaI slep:
1.1 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
a a a a a a A
1.2 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
b b b b b b B
1.3
_

=
=
1
0
n
i
ai
i
B B , vhere n=p/( and
f
a
f
f ia F i
b B
_

=
+
=
1
0
) (

1.4 C=O
2. nuIlipIicalion slep:
2.1 for i=1 lo n do
2.2 C=C

mc (
p
+1)+PM(A,8
ni
), vhere PM(A,8
ni
) as referred lo AIgorilhn 3
2.3 endfor
3. lasis conversion slep:
3.1 ) , , , ( ) , , , , , , (
) 1 ( ) 1 ( ) 0 ( 1 1 0 0
= =
p F F F m m
t
c c c C c c c c C
_

4. relurn ) , , , (
1 1 0 m
c c c
AIgorilhn 5 presenls lhe MSD-firsl scaIalIe nuIlipIicalion, and Iigure 5 presenls lhe enlire
CN nuIlipIier archileclure. As conpared lo lolh CN nuIlipIier archileclures, lhe
LSDCN nuIlipIier lefore each round conpulalion, lhe eIenenl A nusl le perforned ly
A
(i)
=A
(i1)

mc(
p
+1). The MSDCN nuIlipIier afler each round conpulalion, lhe resuIl C
VLS 142

nusl le perforned ly C
(i)
=C
(i1)

mc(
p
+1). NolalIy, lolh operalions A
(i)
and C
(i)

represenl a righl cycIic shifl lo i posilions. Hence, lhe lvo proposed archileclures have lhe
sane line and space conpIexily.


Iig. 5. The proposed MSD-firsl scaIalIe sysloIic CN nuIlipIier over CI(2
m
)

4. Modified ScaIabIe and systoIic GNB muItipIier over GF(2
m
)

In lhe previous seclion, lvo scaIalIe CN nuIlipIiers are incIuding one HankeI
nuIlipIier, four regislers and one finaI reduclion poIynoniaI circuil. Ior lhe vhoIe
nuIlipIicalion schene, lolh circuils denand n(n+1) HankeI nuIlipIicalions. To reduce line-
and space-conpIexily, lhis seclion viII deveIop anolher version of lhe CN scaIalIe
nuIlipIicalion schene.
Lel A and 8 in CI(2
m
) le represenled ly lhe redundanl lasis represenlalion. If lhe seIecled
digilaI size is d digils, lhen eIenenl 8 can le represenled as
_

=
=
1
0
n
i
ai
i
B B , vhere
f
a
f
f ia F i
b B
_

=
+
=
1
0
) (
and n=(m|+1)/(. Iron Lqualion (14), using LSD-firsl nuIlipIicalion
aIgorilhn, lhe parliaI producl can le nodified ly
C
0
= A8
0
mc (
p
+1)
=A|
|(0)
+A|
|(1)
+.+A|
|(1)

1
mc (
p
+1)
=A
(0)
|
|(0)
+A
(1)
|
|(1)
+.+A
(

1)
|
|(1)

=c
0,|(0)
+c
0,|(1)
+.+ c
0,|(p1)

1
(19)
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 143

vhere,
i
p
i
f i F
p f f
a A A
_

= + =
1
0
) (
) (
) 1 mod( , for O s j s 1 and
) (
1
0
) ( ) ( , 0 f i F
a
f
f F i F
a b c

=
_
= , for O s i s
p1.
In (Wu, Hasan, Iake & Cao, 2OO2), il is shovn lhal, fron lhe CN represenlalion in
(Reyhani-MasoIeh & Hasan, 2OO5) lo converl lhe nornaI lasis, lhe nininun represenlalion
of A has a Hanning veighl equaI lo or Iess lhan m|/2 if m is even, and (m||+2)/2 if n is
odd. Assune lhal coordinale nunlers of lhe parliaI producl C
0
in Lqualion (19) are seIecled
ly q=| conseculive coordinales lo salisfy lhe corresponding nornaI lasis represenlalion,
vhere q > m|/2 if n is even, and q > (m||+2)/2 if m is odd. Then, lhe parliaI producl A8
0
can
le caIcuIaled ly
A8
0
= n
|1
8
0
+n
|2
8
0

+.+n
0
8
0

(|

1)
.
SiniIarIy,
A8
1

= n
|
8
1
+n
|1
8
1

+.+ n
1
8
1

(|

1)

A8
2
2

= n
|+1
8
2
+n
|
8
2

+.+ n
2
8
2

(|

1)


.

A8
n1

(n

1)
= n
|+n2
8
n1
+n
|+n3
8
n1

+.+ n
n1
8
n1

(|

1)

Thus, lhe nodified nuIlipIicalion requires onIy n| HankeI nuIlipIicalion. As slaled alove,
lhe nodified LSD-firsl scaIalIe nuIlipIicalion is addressed as foIIovs:
A!gnrIthm 6. (nodified LSDCN scaIalIe nuIlipIicalion)
Inpul: ) , , , (
1 1 0
=
m
a a a A and ) , , , (
1 1 0
=
m
b b b B are lvo nornaI lasis eIenenls in CI(2
m
)
Oulpul: C= ) , , , (
1 1 0 m
c c c =A
1. iniliaI slep:
1.1 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
a a a a a a A
1.2 ) , , , ( ) , , , (
1 1 0 ) 1 ( ) 1 ( ) 0 (
=
m p F F F
b b b b b b B
1.3
_

=
=
1
0
n
i
ai
i
B B , vhere n=p/ and
f
a
f
f ia F i
b B
_

=
+
=
1
0
) (

1.4 0
1
0
= =
_

=
k
i
ai
i
C C , vhere |=q/ and
f
a
f
f ia F i
c C
_

=
+
=
1
0
) (

1.5 AII HankeI veclor his for 0 < i < n+|1 are converled fron lhe redundanl lasis
represenlalion of A.
2. nuIlipIicalion slep:
2.1 for i=O lo |1 do
2.2 for j=O lo n1 do
2.3 C
|1i
=C
|1i
+H
i+j
8
j

2.4 endfor
2.5 endfor
3. lasis conversion slep:
3.1 ) , , , ( ) , , , , , , (
) 1 ( ) 1 ( ) 0 ( 1 1 0 0
=
q F F F m m
c c c c c c c C
4. relurn ) , , , (
1 1 0 m
c c c
AppIying AIgorilhn 6, Iigure 6 shovs a LSDCN nuIlipIier using lhe redundanl
represenlalion. The circuil incIudes lvo regislers, one HankeI nuIlipIier, and one
VLS 144

sunnalion circuil. In lhe iniliaI slep, lvo regislers H and 8 are converled ly Sleps 1.5 and
1.3, respecliveIy. As for lhe regisler Hi represenl (21)-lil Ialches, and lhe regisler 8
i
is -lil
Ialches. The operalion of a d d HankeI nalrix-veclor nuIlipIier has leen descriled in lhe
previous seclion. In Iigure 6, lhe MUX is responsilIe for shifling regisler H. The SW is in
charge of shifling lhe oulcone of lhe CN nuIlipIicalion. As nenlioned alove, lhe lolaI
HankeI nuIlipIicalions can le reduced fron n(n+1) lo n|, vhere |=m|/2( if m is even and
|=(m||+2)/2( if m is odd. y lhe configuralion of Iigure 6, lhe nodified nuIlipIier is
vilhoul lhe finaI reduclion poIynoniaI circuil. Il is reveaIed lhal lhe nodified nuIlipIier has
Iover line- and space-conpIexily as conpared lo Iigures 4 and 5.



P
M

Mux

,


SW



Iig. 6. The nodified LSD-firsl scaIalIe sysloIic CN nuIlipIier over CI(2
n
)

5. Time and Area CompIexity

Various lil-paraIIeI sysloIic N nuIlipIiers are onIy discussed on lype-1 and 2 ONs of
CI(2
m
), as in (Lee & Chiou, 2OO5 , Kvon, 2OO5 , Lee, Lu & Lee, 2OO1). As is veII knovn, a
lype-1 ON is luiIl fron an irreducilIe AOI, vhiIe a lype-2 ON can le conslrucled fron a
paIindronic represenlalion of poIynoniaIs of Ienglh 2m. Hovever, lolh ON lypes exisl
aloul 24.5 for n< 1OOO, as depicled in ILLL Slandard I1363 (2OOO). Ior lhe LCDSA
(LIIiplic Curve DigilaI Signalure AIgorilhn) appIicalions, NIST (2OOO) has reconnended
five linary fieIds. They are CI(2
163
), CI(2
233
), CI(2
283
), CI(2
4O9
) and CI(2
571
). Their ON
nuIlipIiers are inpIenenled vilh unscaIalIe archileclures, and lhe Ialency needs m+1 cIock
cycIes. The space conpIexily of lhe previous archileclures is proporlionaI lo m
2
. As n
lecones Iarge, lheir hardvare inpIenenls need very Iarge area. Hence, lhe NIST
archileclures have Iiniled very appIicalions in cryplography. Hovever, lhe proposed
LSDCN and MSDCN nuIlipIiers do nol have lhis prolIen.
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 145

TalIe 2 conpares lhe circuils of lhe proposed scaIalIe nuIlipIiers vilh lhose of lhe olher un-
scaIalIe (lil-paraIIeI) nuIlipIiers. TalIe 3 Iisls lhe proposed nuIlipIiers vilh lhe lolaI
Ialency. According lo lhis lalIe, lhe proposed nuIlipIiers for a lype-2 CN save aloul 4O
Ialency as conpared lo Kvon's (2OO3) and Lee & Chiou's (2OO5) nuIlipIiers, and lhose for a
lype-1 CN save aloul 6O Ialency as conpared lo Lee-Lu-Lee's nuIlipIiers (2OO1). Since
lhe seIecled digilaI size d nusl nininize lhe lolaI Ialency, lhe proposed nuIlipIiers lhen
have Iov hardvare conpIexily and Iov Ialency.

MuIlipIiers Kvon
(2OO3)
Lee-Lu-Lee
(2OO1)
Lee-Chiou
(2OO5)
LSD-CNM in
Iigure 4
MSD-CNM
in Iigure 5
asis Type-II
ON
Type-I ON Type-II
ON
Caussian N Caussian N
Archileclure lil-paraIIeI lil-paraIIeI lil-paraIIeI scaIalIe scaIalIe
TolaI
ConpIexily
2-inpul XOR
2-inpul AND
1-lil Ialch
1x2 SW


2n
2
+n
2n
2
+n
5n
2

O


n
2
+2n+1
n
2
+2n+1
3n
2
+6n+1
O


2n
2
+n
2n
2

7n
2



d
2
+d+p
d
2

3.5d
2
+5p+3nd
p


d
2
+d+p
d
2

3.5d
2
+5p+3nd
p
Conpulalion
line per ceII
T
A
+2T
X
T
A
+T
X
T
A
+T
X
T
A
+T
X
T
A
+T
X

Lalency n+1 n+1 n+1 d+n(n+1) d+n(n+1)
Nole: lhe vaIue d is lhe seIecled digilaI size, p=nl+1 is a prine nunler, n=p/d(, TX denoles 2-inpul
XOR gale deIay, TA denoles 2-inpul AND gale deIay
TalIe 2. Conparison of various sysloIic nornaI lasis nuIlipIiers of CI(2
n
)


Iig. 7. Conparisons of lhe line-area conpIexily for various digil-seriaI nuIlipIiers over
CI(2)






VLS 146

Type-2 CN Type-1 CN
lhe proposed
archileclure
Kvon
(2OO3)
reduced
Ialency
lhe proposed
archileclure
Lee-Lu-
Lee
(2OO1)
reduced
Ialency
m Mininun
Ialency
Ialency Conpared
lo Kvon
(2OO3)
m Mininun
Ialency
Ialency Conpared
lo Lee-Lu-
Lee (2OO1)
146 46 88 147 4O 1OO 29 41 1O1 59.4O
155 57 87 156 44 1O6 31 43 1O7 59.8O
158 58 88 159 45 13O 3O 5O 131 61.8O
173 64 94 174 46 138 31 51 139 63.8O
174 64 94 175 46 148 34 54 149 63.8O
179 66 96 18O 46.6O 162 37 57 163 65
183 67 97 184 47 172 39 59 173 65.9O
186 68 98 187 47.6O 178 4O 6O 179 66.5O
189 69 99 19O 47.9O 18O 41 61 181 66.3O
191 59 1O1 192 47.4O 196 44 64 197 67.5O
194 6O 1O2 195 47.7O 21O 48 68 211 67.8O
2O9 65 1O7 21O 49 226 51 71 227 68.7O
TalIe 3. Lisls lhe lolaI Ialency for lype-1 and lype-2 CN nuIlipIiers over CI(2
m
)


Iig. 8. Conparisons of lransislor counl for various digil-seriaI nuIlipIiers over CI(2)

y appIying lhe cul-sel sysloIizalion lechniques (Kung, 1988), various digil-seriaI sysloIic
nuIlipIiers are recenlIy reporled in (Kin, Hong & Kvon, 2OO5, Cuo & Wang, 1998), vhich are
idenlicaI of n processing eIenenls (IL) lo enhance lhe lrade-off lelveen lhroughpul
perfornance and hardvare conpIexily. Lach IL requires a naxinun propagalion deIay of
T
max
=(1)(T
A
+T
iX
+T
M
)+T
A
+T
iX
, vhere T
A
, T
iX
and T
M
denole lhe propagalion deIays lhrough a
2-inpul AND gale, an i-inpul XOR gale and a 2-lo-1 nuIlipIexer, respecliveIy. The naxinun
propagalion deIay in each IL is Iarge if lhe seIecled digilaI size is Iarge. Hovever, lhe
proposed scaIalIe sysloIic archileclures do nol have such prolIens, since lhe propagalion
deIay of each IL is independenl of lhe seIecled digilaI size . AppIying Horners ruIe, Song
and Iarhi (1998) suggesled MSD-firsl and LSD-firsl digil-seriaI nuIlipIiers. Various digil-seriaI
nuIlipIiers use onIy one of lhe inpul signaIs A and 8 lo separale n=m/( sul-vord dala. Our
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 147

proposed archileclures separale lolh inpul signaIs inlo n sul-vord dala, in vhich one of lhe
inpul eIenenl is lransIaled inlo HankeI veclor represenlalion. The proposed LSD-firsl and
MSD-firsl scaIalIe nuIlipIicalion aIgorilhns require n(n+1) HankeI nuIlipIicalions, and lhe
nodified nuIlipIicalion aIgorilhn onIy denands nk HankeI nuIlipIicalions, vhere |=m|/2(
if n is even and |= (m||+2)/2( if n is odd. Using a singIe HankeI nuIlipIier lo inpIenenl
our proposed scaIalIe nuIlipIiers, ve have O(
2
) space conpIexily, vhiIe olher digil-seriaI
nuIlipIiers require O(nd) space conpIexily, as seen in TalIes 4 and 5.
Ior conparing lhe line-area conpIexily, lhe lransislor counl lased on lhe slandard CMOS
VLSI reaIizalion is enpIoyed for conparison. Therefore, sone lasic Iogic gales: 2-inpul XOR,
2-inpul AND, 12 SW, MUX and 1-lil Ialch are assuned lo le conposed of 6, 6, 6, 6 and 8
lransislors, respecliveIy (Kang & LelIelici, 1999). Sone reaI circuils (STMicroeIeclronics,
hllp://vvv.sl.con) such as M74HC86 (STMicroeIeclronics, XOR gale, T
X
=12ns (TYI.)),
M74HCO8 (STMicroeIeclronics, AND gale, T
A
=7ns (TYI.)), M74HC279 (STMicroeIeclronics, SR
Lalch, T
|
=13ns (TYI.)), M74H257 (STMicroeIeclronics, Mux, T
M
=11ns (TYI.)) are enpIoyed for
conparing line conpIexily in lhis paper. In lhe finile fieId CI(2
233
), Iigures 7 and 8 shov lhal
our proposed scaIalIe nuIlipIiers conpare lo lhe corresponding digil-seriaI nuIlipIiers (Kin,
Hong & Kvon, 2OO5, Cuo & Wang, 1998, Reyhani-MasoIeh & Hasan, 2OO2). As lhe seIecled
digilaI size > 4, lhe proposed scaIalIe nuIlipIiers have Iover line-area conpIexily lhan lvo
reporled digil-seriaI I nuIlipIiers (Kin, Hong & Kvon, 2OO5, Cuo& Wang, 1998) (as shovn
in Iigure 7). When lhe seIecled digilaI size > 8, lhe nodified scaIalIe nuIlipIier has Iover
line-area conpIexily lhan lhe corresponding digil-seriaI N nuIlipIier (Reyhani-MasoIeh &
Hasan, 2OO2). Ior conparing a lransislor counl, Iigure 8 reveaIs lhal our scaIalIe nuIlipIiers
have Iov space conpIexily as conpared lo lhe reporled digil-seriaI nuIlipIiers (Kin, Hong &
Kvon, 2OO5, Cuo & Wang, 1998, Reyhani-MasoIeh & Hasan, 2OO2).

MuIlipIiers Cuo & Wang (1998) Kin, Hong & Kvon
(2OO5)
Iigure 5 Iigure 6
asis poIynoniaI poIynoniaI Caussian nornaI Caussian
nornaI
Archileclure digil-seriaI digil-seriaI scaIalIe scaIalIe
TolaI ConpIexily
2-inpul XOR
2-inpul AND
1-lil Ialch
1x2 SW
MUX


2ed
e(2d+d)
1Od+5Id(e
O
2ed


2ed
e(2d+d)
1Od+1+4.5Id+I(e
O
2ed


d+d+p
d
3.5d+5p+3nd
p
O


d+d
d
Q
d
1
CrilicaI Ialh IipeIined:
d(TA+2TX+2TM)/(I+
1)
Non-IipeIined:
TA+3TX+(d1)(TA+2
TX+2TM)
IipeIined:
d(TA+TX+2TM)/(I+1)
Non-IipeIined:
TA+TX+(d1)(TA+TX+2T
M)
TA+TX TA+TX
Lalency IipeIined: 3+I)e
Non-IipeIined: 3e
IipeIined: (3+I)e
Non-IipeIined: 3e
d+n(n+1) d+nk
Nole: k =h/d(, e=n/d(, Q=3.5d
2
+nd+(2d1)(n+k1), TM: denoles 2-ly-1 MUX gale deIay, I+1
nunler of pipeIining slages inside each lasic ceII (Kin, Hong & Kvon, 2OO5, Cuo & Wang, 1998)
TalIe 4. Conparison of various digil-seriaI sysloIic nuIlipIiers over CI(2n)
VLS 148

MuIlipIiers Type1-DSN1
(Reyhani-MasoIeh &
Hasan , 2OO2)
Massey-Onura
(Reyhani-MasoIeh &
Hasan , 2OO2)
Iigure 5 Iigure 6
Archileclure digil-seriaI
nonsysloIic
digil-seriaI
nonsysloIic
scaIalIe sysloIic scaIalIe
sysloIic
TolaI ConpIexily
2-inpul XOR
2-inpul AND
1-lil Ialch
1x2 SW
MUX

(d+1)(n1)
d(n1)+n
O
O
O

d(2n2)
d(2n1)
O
O
O

d+d+p
d
3.5d+5p+3nd
p
O

d+d
d
Q
d
1
CrilicaI Ialh TA+(1+Iog2 n()TX TA+(1+Iog2 n()TX TA+TX TA+TX
Lalency 1 1 d+n(n+1) d+nk
TalIe 5. Conparison of various digil-seriaI nuIlipIiers for oplinaI nornaI lasis of CI(2
m
)

6. ConcIusions

This vork presenls nev nuIlipIicalion aIgorilhns for lhe CN of CI(2
m
) lo reaIize LSD-
firsl and MSD-firsl scaIalIe nuIlipIiers. The fundanenlaI difference of our designs fron
olher digil-seriaI nuIlipIiers descriled in lhe Iileralure is lased on a HankeI nalrix-veclor
represenlalion lo achieve scaIalIe nuIlipIicalion archileclures. In lhe generic fieId, lhe CN
nuIlipIicalion can le deconposed inlo n(n+1) HankeI nuIlipIicalions. To use lhe
reIalionship fron lhe CN lo N, ve can nodify lhe LSD-firsl scaIalIe nuIlipIicalion
aIgorilhn lo decrease lhe nunler of HankeI nuIlipIicalion fron n(n+1) inlo n|, vhere
|=m|/2 if n is even and |=(m|-|-2)/2 if m is odd. Our anaIysis shovs lhal, in finile fieId
CI(2
233
), if lhe seIecled digilaI size d > 8, lhe proposed scaIalIe nuIlipIiers lhen have Iover
line-area conpIexily as conpared lo exisling digil-seriaI nuIlipIiers for poIynoniaI lasis
and nornaI lasis of CI(2
m
). Since our proposed scaIalIe nuIlipIicalion aIgorilhns have
highIy fIexilIe and are suilalIe for inpIenenling aII-lype CN nuIlipIicalions. IinaIIy, lhe
proposed archileclures have good lrade-offs lelveen area and speed for inpIenenling
cryplographic schenes in enledded syslens.

7. References

Denning, D.L.R.(1983). Crqp|cgrapnq an Da|a Sccuri|q, Reading, MA: Addison-WesIey.
Rhee, M.Y. (1994). Crqp|cgrapnq an Sccurc Ccmmunica|icns, McCrav-HiII, Singapore.
Menezes, A. Oorschol, I. V. & Vanslone, S. (1997). Han|cc| cf App|ic Crqp|cgrapnq, CRC
Iress, oca Ralon, IL.
Massey, }.L. & Onura, }.K. (1986). ConpulalionaI nelhod and apparalus for finile fieId
arilhnelic, U.S. Ialenl Nunler 4,587,627.
Reyhani-MasoIeh, A. & Hasan, M.A. (2OO5). Lov conpIexily vord-IeveI sequenliaI nornaI
lasis nuIlipIiers. |||| Transac|icns cn Ccmpu|crs, VoI. 54, No.2.
Lee, C.Y. & Chang, C.}. (2OO4). Lov-conpIexily Iinear array nuIlipIier for nornaI lasis of
lype-II, |||| |n|crna|icna| Ccnfcrcncc cn Mu||imcia an |xpc, VoI. 3, pp. 1515-1518.
Lee, C.Y., Lu, L.H., & Lee, }.Y. (2OO1). il-IaraIIeI SysloIic MuIlipIiers for CI(2
n
) IieIds
Defined ly AII-One and LquaIIy-Spaced IoIynoniaIs, |||| Transac|icns cn
Ccmpu|crs, VoI. 5O, No. 5, pp. 385-393.
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 149

Hasan, M.A., Wang, M.Z. & hargava, V.K. (1993). A nodified Massey-Onura paraIIeI
nuIlipIier for a cIass of finile fieIds, |||| Transac|icns cn Ccmpu|crs, VoI. 42, No.1O,
pp. 1278-128O.
Kvon, S. (2OO3). A Iov conpIexily and a Iov Ialency lil paraIIeI sysloIic nuIlipIier over
CI(2
n
) using an oplinaI nornaI lasis of lype II. Prccccings cf 16|n |||| Sqmp.
Ccmpu|cr Ari|nmc|ic, pp. 196-2O2.
Lee, C.Y. & Chiou, C.W. (2OO5). Design of Iov-conpIexily lil-paraIIeI sysloIic HankeI
nuIlipIiers lo inpIenenl nuIlipIicalion in nornaI and duaI lases of CI(2
n
). |||C|
Transac|icns cn |unamcn|a|s, voI. L88-A, no.11, pp. 3169-3179.
ILLL Slandard I1363 (2OOO). |||| S|anar Spccifica|icns fcr Pu||ic-Kcq Crqp|cgrapnq.
NalionaI Insl. of Slandards and TechnoIogy,(2OOO). DigilaI Signalure Slandard, IIIS
IulIicalion 186-2.
Reyhani-MasoIeh, A. (2OO6). Lfficienl aIgorilhns and archileclures for fieId nuIlipIicalion
using Caussian nornaI lases. |||| Transac|icns cn Ccmpu|crs, VoI. 55, No. 1, pp.34-
47.
Lee
1
, C.Y. (2OO3). Lov-Lalency il-IaraIIeI SysloIic MuIlipIier for IrreducilIe x
n
+x
n
+1 vilh
gcd(n,n)=1. |||C| Transac|icns cn |unamcn|a|s, VoI. L86-A, No.11, pp. 2844-2852.
Lee, C.Y., Horng, }.S. & }ou, I.C. (2OO5). Lov-conpIexily lil-paraIIeI sysloIic Monlgonery
nuIlipIiers for speciaI cIasses of CI(2
n
). |||| Transac|icns cn Ccmpu|crs, voI. 54,
no.9, pp. 1O61-1O7O.
Lee, C.Y. (2OO5). SysloIic archileclures for conpuling exponenlialion and nuIlipIicalion over
CI(2
n
) using poIynoniaI ring lasis. ]curna| cf |ungHua Unitcrsi|q, voI. 19, pp.87-98.
Lee
2
, C.Y. (2OO3). Lov conpIexily lil-paraIIeI sysloIic nuIlipIier over CI(2
n
) using
irreducilIe lrinoniaIs. ||| Prccccing Ccmpu|cr, an Digi|a| Tccnnica|, VoI. 15O, pp.
39-42.
Iaar, C., IIeischnann, I. & Soria-Rodriguez, I. (1999). Iasl arilhnelic for pulIic-key
aIgorilhns in CaIois fieIds vilh conposile exponenls. |||| Transac|icns cn
Ccmpu|crs, voI. 48, no.1O, pp. 1O25-1O34.
Kin, N.Y. & Yoo, K.Y. (2OO5). Digil-seriaI A2 sysloIic archileclure in CI(2
n
). ||| Prccccing
Circui|s Dcticcs Sqs|cms, VoI. 152, No. 6, pp. 6O8-614.
Kang, S.M. & LelIelici, Y. (1999). CMOS Digi|a| |n|cgra|c Circui|s Ana|qsis an Dcsign,
McCravHiII.
Logic seIeclion guide: STMicrcc|cc|rcnics. <hllp://vvv.sl.con > .
Kin, C.H., Hong, C.I. & Kvon, S. (2OO5). A digil-seriaI nuIlipIier for finile fieId CI(2
n
).
|||| Transac|icns cn V|S|, VoI. 13, No. 4, pp. 476-483.
Cuo, }.H. & Wang, C.L. (1998). Digil-seriaI sysloIic nuIlipIier for finile fieIds CI(2
n
). ILL
Iroc.-Conpul. Digil. Tech., VoI. 145, No. 2, pp. 143-148, March.
Kung, S.Y. (1988). V|S| arraq prcccsscrs, LngIevood CIiffs, N}: Irenlice-HaII.
H. Wu, M.A. Hasan, I.I. Iake and S. Cao, Iinile fieId nuIlipIier using redundanl
represenlalion. |||| Transac|icns cn Ccmpu|crs, VoI. 51, No. 11, pp.13O6-1316, Nov.
2OO2.
MuIIin, R.C., Onyszchuk, I.M., Vanslone, S.A. & WiIson, R. M. (1988/1989). OplinaI NornaI
ases in CI(p
n
). Discrc|c App|ic Ma|n., voI. 22, pp.149-161.
Reyhani-MasoIeh, A. & Hasan, M.A. (2OO3). Iasl nornaI lasis nuIlipIicalion using generaI
purpose processors. |||| Transac|icns cn Ccmpu|crs, VoI. 52, No. 11, pp. 1379-139O.
VLS 150

Song, L. & Iarhi, K.K. (1998). Lov-energy digil-seriaI/paraIIeI finile fieId nuIlipIiers. ]curna|
cf V|S| Signa| Prcccssing , VoI.19, pp.149-166.
Tenca, A.I. & Koc, C.K. (1999). A scaIalIe archileclure for Monlgonery nuIlipIicalion.
Prccccings cf Crqp|cgrapnic Haruarc an |m|cc Sqs|cm (CHLS 1999), No. 1717 in
Leclure Noles in Conpuler Science, pp. 94-1O8.
Reyhani-MasoIeh, A. & Hasan, M.A. (2OO2). Lfficienl digil-seriaI nornaI lasis nuIlipIiers
over CI(2
M
). |||| |n|crna|icna| Ccnfcrcncc cn Circui|s an Sqs|cms.


High-Speed VLS Architectures for Turbo Decoders 151
X

High-Speed VLSI Architectures
for Turbo Decoders

Zhongfeng Wang
1
and Xinning Huang
2

1
8rcaccm Ccrpcra|icn, 5300 Ca|ifcrnia Atcnuc, |rtinc, CA 92617, USA,

2
Dcpar|mcn| cf |C|, Icrccs|cr Pc|q|ccnnic |ns|i|u|c, Icrccs|cr, MA 01609, USA.

Turlo code, leing one of lhe nosl allraclive near-Shannon Iinil error correclion codes, has
allracled lrenendous allenlion in lolh acadenia and induslry since ils invenlion in earIy
199Os. In lhis chapler, ve viII discuss high-speed VLSI archileclures for Turlo decoders.
Iirsl of aII, ve viII expIore joinl aIgorilhnic and archilecluraI IeveI oplinizalion lechniques
lo lreak lhe high speed lollIeneck in recursive conpulalion of slale nelrics for sofl-inpul
sofl-oulpul decoders. Then ve viII presenl area-efficienl paraIIeI decoding schenes and
associaled archileclures lhal ain lo IinearIy increase lhe overaII decoding lhroughpul vilh
sul-IinearIy increased hardvare overhead.

Kcywnrds: Turlo code, MAI aIgorilhn, paraIIeI decoding, high speed, VLSI.

1. Introduction

Lrror correclion codes are an essenliaI conponenl in digilaI connunicalion and dala
slorage syslens lo ensure rolusl operalion of digilaI appIicalions, vherein, Turlo code,
invenled ly errou (1993), is anong lhe lvo nosl allraclive near-oplinaI (i.e., near-Shannon
Iinil) error correclion codes. As a naller of facl, Turlo codes have leen considered in
severaI nev induslriaI slandards, such as 3rd and posl-3
rd
generalion ceIIuIar vireIess
syslens (3CII, 3CII2, and 3CII LTL), WireIess LAN (8O2.11a), WiMAX (lroadland
vireIess, ILLL 8O2.16e) and Luropean DA and DV (digilaI audio lroadcasling and digilaI
video lroadcasling) syslens.

One key fealure associaled vilh Turlo code is lhe ileralive decoding process, vhich enalIes
Turlo code lo achieve oulslanding perfornance vilh noderale conpIexily. Hovever, lhe
ileralive process direclIy Ieads lo Iov lhroughpul and Iong decoding Ialency. To ollain a
high decoding lhroughpul, a Iarge anounl of conpulalion unils have lo le inslanlialed for
each decoder, and lhis resuIls in a Iarge chip area and high pover consunplion. In conlrasl,
lhe groving narkel of vireIess and porlalIe conpuling devices as veII as lhe increasing
desire lo reduce packaging cosls have direcled induslry lo focus on conpacl Iov-pover
circuil inpIenenlalions. This lug-of-var highIighls lhe chaIIenge and caIIs for innovalions
on Very Large ScaIe Inlegralion (VLSI) design of high-dala rale Turlo decoders lhal are
lolh area and pover efficienl.
8
VLS 152


Ior generaI ASIC design, lhere are lvo lypicaI vays lo increase lhe syslen lhroughpul: 1)
raise lhe cIock speed, and 2) increase lhe paraIIeIisn. In lhis chapler, ve viII lackIe lhe high
speed Turlo decoder design in lhese lvo aspecls. Turlo code decoders can le lased on
eilher maximum-a-pcs|cricr prolaliIily (MAI) aIgorilhn proposed in ahI (1974) (or any
varianls of approxinalion) or sofl-oulpul Vilerli aIgorilhn (SOVA) proposed in Hagenauer
(1989) (or any nodified version). Hovever, eilher aIgorilhn invoIves recursive conpulalion
of slale nelrics, vhich forns lhe lollIeneck in high speed inlegraled circuil design since
convenlionaI pipeIining lechniques cannol le sinpIy appIied for raising lhe effeclive cIock
speed. Look-ahead pipeIining in Iarhi (1999) nay le appIicalIe. ul lhe inlroduced
hardvare overhead can le inloIeralIe. On lhe olher hand, paraIIeI processing can le
effeclive in increasing lhe syslen lhroughpul. UnforlunaleIy, direcl appIicalion of lhis
lechnique viII cause hardvare and pover consunplion lo increase IinearIy, vhich is againsl
lhe requirenenl of nodern porlalIe conpuling devices.

In lhis chapler, ve viII focus on MAI-lased Turlo decoder design since MAI-lased Turlo
decoder significanlIy oulperforns SOVA-lased Turlo decoders in lerns of il-Lrror-Rale
(LR). In addilion, MAI decoders are nore chaIIenging lhan SOVA decoders in high speed
design (Wang 2OO7). Inleresled readers are referred lo Yeo (2OO3) and Wang (2OO3c) for high
dala-rale SOVA or Turlo/SOVA decoder design. The resl of lhe chapler is organized as
foIIovs. In Seclion 2, ve give lackground infornalion aloul Turlo codes and discuss
sinpIe seriaI decoder slruclure. In Seclion 3, ve viII address high speed recursion
archileclures for MAI decoders. olh Radix-2 and Radix-4 recursion archileclures are
invesligaled. In Seclion 4, ve presenl area-efficienl paraIIeI processing schenes and
associaled paraIIeI decoding archileclures. We concIude lhe chapler in Seclion 5.

2. Background of Turbo Codes

SSO
nterleave
memory
nput
buffer
Address
Generator
Load Wtbk
1
p
y
2
p
y
s
y
) (k LLR
) (k L
ex

Iig. 1. A seriaI Turlo decoder archileclure.

A lypicaI Turlo encoder consisls of lvo recursive syslenalic convoIulionaI (RSC) encoders
and an inlerIeaver lelveen lhen (Wang 1999). The source dala are encoded ly lhe firsl RSC
encoder in sequenliaI order vhiIe ils inlerIeaved sequence is encoded ly lhe second RSC
encoder. The originaI source lils and parily lils generaled ly lvo RSC encoders are senl oul
in a line-nuIlipIexed vay. Inleresled reader are referred lo errou (1993) for delaiIs. Turlo
code usuaIIy vorks vilh Iarge lIock sizes for lhe reason lhal lhe Iarger lhe lIock size, lhe
leller lhe perfornance in generaI. In order lo faciIilale ileralive decoding, lhe received dala
High-Speed VLS Architectures for Turbo Decoders 153

of a vhoIe decoding lIock have lo le slored in a nenory, vhose size is proporlionaI lo lhe
Turlo lIock size. Hence, Turlo decoders usuaIIy require Iarge nenory slorage. Therefore
seriaI decoding archileclures are videIy used in praclice.

A lypicaI seriaI Turlo decoder archileclure is shovn in Iig. 1. Il has onIy one sofl-inpul sofl-
oulpul (SISO) decoder, vhich vorks in a line-nuIlipIexed vay as proposed in Suzuki
(2OOO). Lach ileralion is deconposed inlo lvo decoding phases, i.e., |nc scqucn|ia| cccing
pnasc, in vhich lhe dala are processed in sequenliaI order, and |nc in|cr|catc cccing pnasc,
in vhich lhe source dala are processed in an inlerIeaved order. olh prolaliIily MAI
aIgorilhn proposed in errou (1993) and SOVA proposed in Hagenauer (1989) can le
enpIoyed for lhe SISO decoding.

The seriaI Turlo decoder incIudes lvo nenories: one is used for received sofl synloIs,
caIIed lhe inpul luffer or lhe receiver luffer, lhe olher is used lo slore lhe exlrinsic
infornalion, denoled as lhe inlerIeaver nenory. The exlrinsic infornalion is feedlack as lhe
a pricri infornalion for nexl decoding. The inpul luffer is nornaIIy indispensalIe. Wilh
regard lo lhe inlerIeaver nenory, eilher lvo ping-pong luffers can le used lo conpIele one
Load and one Wrile operalions required lo process each infornalion lil vilhin one cycIe or
one singIe-porl nenory can le enpIoyed lo fuIfiI lhe lvo required operalions vilhin lvo
cIock cycIes. A nenory-efficienl archileclure vas presenled ly Wang (2OO3a) and Iarhi,
vhich can process lolh Read and Wrile operalions al lhe sane cycIe using singIe-porl
nenories vilh lhe aid of snaII luffers.

Turlo decoder vorks as foIIovs. The SISO decoder lakes sofl inpuls (incIuding lhe received
syslenalic lil
s
y and lhe received parily lil
1
p
y or
2
p
y
) fron lhe inpul luffer and lhe a
pricri infornalion fron lhe inlerIeaver nenory. Il oulpuls lhe Iog IikeIihood ralio ) (k LLR ,
and lhe exlrinsic infornalion, ) (k L
ex
, for lhe k-lh infornalion lil in lhe decoding sequence.
The exlrinsic infornalion is senl lack as lhe nev a pricri infornalion for nexl decoding. The
inlerIeaving and de-inlerIeaving processes are conpIeled in an efficienl vay. asicaIIy lhe
dala are Ioaded according lo lhe currenl decoding sequence. Ior inslance, lhe exlrinsic
infornalion is Ioaded in sequenliaI order al sequenliaI decoding phase vhiIe leing Ioaded
in lhe inlerIeaved order al lhe inlerIeaved decoding phase. Afler processing, lhe nev
exlrinsic infornalion is vrillen lack lo lhe originaI pIaces. In lhis vay, no de-inlerIeave
pallern is required for Turlo decoding.

3. High Speed Log-MAP Decoder Design

3.1 Maximum A Posterior (MAP) aIgorithm
As discussed in Seclion 2, praclicaI Turlo decoders usuaIIy enpIoy seriaI decoding
archileclures, such as Suzuki (2OOO), for area-efficiency. Thus, lhe lhroughpul of a Turlo
decoder is highIy Iiniled ly lhe cIock speed and lhe naxinun nunler of ileralions lo le
perforned. To faciIilale ileralive decoding, Turlo decoders require sofl-inpul sofl-oulpul
decoding aIgorilhns, anong vhich lhe prolaliIily MAI aIgorilhn proposed in ahI (1974)
is videIy adopled for ils exceIIenl perfornance.

VLS 154


The MAI aIgorilhn is connonIy inpIenenled in Iog donain, lhus caIIed Log-MAI (Wang
1999). The Log-MAI aIgorilhn invoIves recursive conpulalion of forvard slale nelrics
(sinpIy caIIed o nelrics) and lackvard slale nelrics (sinpIe caIIed | nelrics). The Iog-
IikeIihood-ralio is conpuled lased on lhe lvo lypes of slale nelrics and associaled lranch
nelrics (denoled as nelrics). Due lo differenl recursion direclions in conpuling o and
| nelrics, a slraighlforvard inpIenenlalion of Log-MAI aIgorilhn viII nol onIy consune
Iarge nenory lul aIso inlroduce Iarge decoding Ialency. The sIiding vindov approach vas
proposed ly Vilerli (1998) lo niligale lhis issue. In lhis case, pre-lackvard recursion
operalions are inlroduced for lhe varn-up process of reaI lackvard recursion. Ior cIarily,
ve denole lhe pre-lackvard recursion conpuling unil as |0 unil and denole lhe reaI
(effeclive or vaIid) lackvard recursion unil as |1 unil. The delaiIs of Log-MAI aIgorilhn
viII nol le given in lhe chapler. Inleresled readers are referred lo Wang (1999).

The lining diagran for lypicaI sIiding-vindov-lased Log-MAI decoding is shovn in Iig.
2, vhere S1, S2, elc, sland for conseculive sul-lIocks (i.c., sIiding vindovs), lhe lranch
nelrics conpulalion vas conpuled logelher vilh pre-lackvard recursion, lhough nol
shovn in lhe figure for sinpIicily and cIarily.

The slruclure of a lypicaI seriaI Turlo decoder lased on Iog-MAI aIgorilhn is shovn in Iig.
3, vhere lhe sofl-oulpul unil is used lo conpule LLR and exlrinsic infornalion (denoled as
Lex), lhe inlerIeaver nenory is used lo slore lhe exlrinsic infornalion for nexl (phase of)
decoding. Il can le seen fron lhe figure lhal lolh lhe lranch nelric unil (MU) and sofl-
oulpul unil (SOU) can le pipeIined for high speed appIicalions. Hovever, due lo recursive
conpulalion, lhree slale nelrics conpulalion unils forn lhe high-speed lollIeneck. The
reason is lhal lhe convenlionaI pipeIining lechnique is nol appIicalIe for raising lhe effeclive
processing speed unIess one MAI decoder is used lo process nore lhan one Turlo code
lIocks or sul-lIocks as discussed in Lee (2OO5). Anong various high-speed recursion
archileclures in lhe Iileralure such as Lee (2OO5), Urard (2OO4), ouliIIon (2OO3), Miyouchi
(2OO1) and ickerslaff (2OO3), lhe designs presenled in Urard (2OO4) and ickerslaff (2OO3)
are nosl allraclive. In Urard (2OO4), an offsel-add-conpare-seIecl (OACS) archileclure is
proposed lo repIace lhe lradilionaI add-conpare-seIecl-offsel (ACSO) archileclure. In
addilion, lhe Iook-up lalIe (LUT) is sinpIified vilh onIy 1-lil oulpul, and lhe conpulalion
of alsoIule vaIue is avoided lhrough inlroduclion of lhe reverse difference of lvo conpeling
palh (or slale) nelrics. An approxinale 17 speedup over lhe lradilionaI Radix-2 ASCO
archileclure vas reporled. Wilh one-slep Iook-ahead operalion, a Radix-4 ACSO
archileclure can le derived. IraclicaI Radix-4 archileclures such as Miyouchi (2OO1) and
ickerslaff (2OO3) aIvays invoIve approxinalions in order lo achieve higher effeclive
speedup. Ior inslance, lhe foIIoving approxinalion is adopled in ickerslaff (2OO3):
nax* (nax*(A, ), nax*(C, D))=nax*(nax(A, ), nax(C, D)), (1)
vhere
nax*(A, )= nax(A,)+Iog(1+
, , B A
e

). (2)

This Radix-4 archileclure can generaIIy inprove lhe processing speed (equivaIenl lo lvice of
ils cIock speed) ly over 4O over lhe lradilionaI Radix-2 archileclure, and il has c fac|c lhe
highesl processing speed anong aII exisling (MAI decoder) designs found in lhe Iileralure.
High-Speed VLS Architectures for Turbo Decoders 155

Hovever, lhe hardvare viII le nearIy doulIed conpared lo lhe lradilionaI ACSO
archileclure presenled in Urard (2OO4).

In lhis seclion, ve viII firsl presenl an advanced Radix-2 recursion archileclure lased on
aIgorilhnic lransfornalion, approxinalion and archilecluraI IeveI oplinizalion, vhich can
achieve conparalIe processing speed as lhe slale-of-lhe-arl Radix-4 design vhiIe having
significanlIy Iover hardvare conpIexily. Then ve discuss an inproved Radix-4 archileclure
lhal is 32 fasler lhan lhe lesl exisling approach.


Iig. 2. The lining diagran for Log-MAI decoding.


Iig. 3. A Log-MAI Turlo decoder slruclure.

VLS 156


3.2 An Advanced High-Speed Radix-2 Recursion Architecture for MAP Decoders
Ior convenience in Ialer discussion, ve firsl give a lrief inlroduclion lo MAI-lased Turlo
decoder slruclure. As shovn in Iig. 3, lhe lranch nelrics unil (BMU) lakes inpuls fron lhe
receiver luffer and lhe inlerIeaver nenory. The oulpuls of MU are direclIy senl lo lhe pre-
lackvard recursion unil (i.c., |0 unil). The previousIy slored lranch nelrics for conseculive
sIiding vindovs are inpul lo lhe forvard recursion unil (i.c., o unil) and lhe effeclive
lackvard recursion unil (i.c., |1 unil), respecliveIy. The sofl oulpul unil (5OU) lhal is used
lo conpule lhe Iog IikeIihood ralio (LLR) and lhe exlrinsic infornalion (Lex) lakes inpuls
fron lhe previousIy slored o nelrics, lhe currenlIy conpuled | nelrics and lhe previousIy
slored lranch nelrics (). The 5OU slarls lo generale sofl oulpuls afler lhe lranch nelrics
have leen conpuled for lhe firsl lvo sIiding vindovs. Il can le olserved lhal lhe high-
speed lollIeneck of a Log-MAI decoder Iies in lhe lhree recursive conpulalion unils since
lolh MU and SOU can le sinpIy pipeIined for high-speed appIicalions. Thus, lhis seclion
is dedicaled lo lhe design of high-speed recursive conpulalion unils, i.c., o, |0 and |1 unils
shovn in Iig. 3.

Il is knovn fron Log-MAI aIgorilhn lhal aII lhe lhree recursion unils have siniIar
archileclures. So ve viII focus our discussion on design of o unils. The lradilionaI design for
o conpulalion is iIIuslraled in Iig. 4, vhere lhe AB5 lIock is used lo conpule lhe alsoIule
vaIue of lhe inpul and lhe LUT (i.c., Iook-up lalIe) lIock is used lo inpIenenl a nonIinear
funclion Iog(1+
x
e

), vhere x > O. Ior sinpIicily, onIy one lranch (i.c., one slale) is dravn.
The overfIov approach (Wu 2OO1) is assuned for nornaIizalion of slale nelrics as used in
convenlionaI Vilerli decoders.


Iig. 4. The originaI recursion archileclure: Arch-O.

Il can le seen lhal lhe conpulalion of lhe recursive Ioop consisls of lhree nuIli-lil addilions,
lhe conpulalion of alsoIule vaIue and a randon Iogic lo inpIenenl lhe LUT. As lhere is
onIy one deIay eIenenl in each recursive Ioop, lhe lradilionaI relining lechnique in Denk
(1998) can nol le used lo reduce lhe crilicaI palh.



High-Speed VLS Architectures for Turbo Decoders 157


Iig. 5. An advanced Radix-2 fasl recursion archileclure.: Arch-A

In lhis vork, ve propose an advanced Radix-2 recursion archileclure shovn in Iig. 5. Here
ve firsl inlroduce a difference nelric for each conpeling pair of slales nelrics (c.g., o0 and
o1 in Iig. 4) so lhal ve can perforn lhe fronl end addilion and lhe sullraclion operalions
sinuIlaneousIy in order lo reduce lhe conpulalion deIay of lhe Ioop. SecondIy, ve enpIoy a
generaIized LUT (see CLUT in Iig. 5) lhal can efficienlIy avoid lhe conpulalion of alsoIule
vaIue inslead of inlroducing anolher sullraclion operalion as in Urard (2OO4). ThirdIy, ve
nove lhe finaI addilion lo lhe inpul side as vilh lhe OACS archileclure in ouliIIon (2OO3)
and lhen uliIize one slage carry-save slruclure lo converl a 3-nunler addilion lo a 2-nunler
addilion. IinaIIy, ve nake an inleIIigenl approxinalion in order lo furlher reduce lhe
crilicaI palh.

The foIIoving equalions are assuned for lhe considered recursive conpulalion shovn in
Iig. 5:

|), | 0 | | 1 |, | 3 | 0 ( max* | 1 | 2
|), | 3 | | 1 |, | 0 | | 0 ( max* | 1 | 0
k k k k k
k k k k k
o o o
o o o
+ + = +
+ + = +
(3)
vhere nax* funclion is defined in (2).

In addilion, ve spIil each slale nelrics inlo lvo lerns as foIIovs:

|. | 2 | | 2 | | 2
|, | 1 | | 1 | | 1
|, | 0 | | 0 | | 0
k B k A k
k B k A k
k B k A k
o o o
o o o
o o o
+ =
+ =
+ =
(4)
VLS 158


SiniIarIy, lhe corresponding difference nelric is aIso spIil inlo lvo lerns:

|. | | | | |
|, | | | | |
1 0 01
1 0 01
k k k
k k k
B B B
A A A
o o o
o o o
=
=
(5)
In lhis vay, lhe originaI add-and-conpare operalion is converled as an addilion of lhree
nunlers, i.c.,

B A 01 01 3 0 3 1 0 0
) ( ) ( ) ( o o o o + + = + +
(6)
vhere
3 0
is conpuled ly MU, lhe line index |kj is onilled for sinpIicily. In
addilion, lhe difference lelveen lhe lvo oulpuls fron lvo CLUTs, i.c.,
B 01
o , can le
negIecled. Iron exlensive sinuIalions, ve found lhal lhis snaII approxinalion doesnl
cause any perfornance Ioss in Turlo decoding vilh eilher AWCN channeIs or RaIeigh
fading channeIs. This facl can le sinpIy expIained in lhe foIIoving. If one conpeling palh
nelrics (c.g., 0 0 0 o + = p ) is significanlIy Iarger lhan lhe olher one (c.g., 3 1 1 o + = p ),
lhe CLUT oulpul viII nol change lhe decision anyvay due lo lheir snaII nagniludes. On
lhe olher hand, if lhe lvo conpeling palh nelrics are so cIose lhal adding or renoving a
vaIue fron one CLUT nay change lhe decision (c.g., fron p0>p1 lo p1>p0), picking any
survivor (p0 cr p1) shouId nol nake lig difference.

Al lhe inpul side, a snaII circuilry shovn in Iig. 6 is enpIoyed lo converl an addilion of 3
nunlers lo an addilion of 2 nunlers, vhere IA and HA represenls fuII-adder and haIf-
adder respecliveIy, XOR slands for excIusive OR gale, dO and d1 correspond lo lhe 2-lil
oulpul of CLUT. The slale nelrics and lranch nelrics are represenled vilh 9 and 6 lils,
respecliveIy in lhis exanpIe. The sign exlension is onIy appIied lo lhe lranch nelrics. Il
shouId le noled lhal an exlra addilion operalion (see dashed adder loxes) is required lo
inlegrale each slale nelric lefore sloring il inlo lhe o nenory.

FA FA HA
[0]
o[0] d0 o[1] d1 [2] o[2]
[1]
XOR HA
o[8] [5] o|7] |5]

Iig. 6. A carry-save slruclure in lhe fronl end of Arch-A.

The generaIized LUT (CLUT) slruclure is shovn in Iig. 7, vhere lhe conpulalion of
alsoIule vaIue is eIininaled ly incIuding lhe sign lil inlo 2 Iogic lIocks, i.c., Ls2 and ELUT,
vhere lhe Ls2 funclion lIock is used lo delecl if lhe alsoIule vaIue of lhe inpul is Iess lhan
2.O, and lhe LLUT lIock is a snaII LUT vilh 3-lil inpuls and 2-lil oulpuls. Il can le derived
lhal
7 4 7 4 3 3
( ) Z Sb b b S b b b = + + + . .
. Il vas reporled in Cross (1998) lhal using lvo oulpul
vaIues for lhe LUT onIy caused a perfornance Ioss of O.O3 d fron lhe fIoaling poinl
sinuIalion for a 4-slale Turlo code. The approxinalion is descriled as foIIovs:

If |x|<2, f(x) = 3/8, eIse f(x) = O, (7)

High-Speed VLS Architectures for Turbo Decoders 159

vhere x and f(x) slands for lhe inpul and lhe oulpul of lhe LUT, respecliveIy. In lhis
approach, ve onIy need lo check if lhe alsoIule vaIue of lhe inpul is Iess lhan 2 or nol,
vhich can le perforned ly lhe Ls2 lIock in Iig. 7. A dravlack of lhis nelhod is lhal ils
perfornance vouId le significanlIy degraded if onIy lvo lils are kepl for lhe fraclionaI parl
of lhe slale nelrics, vhich is generaIIy lhe case.


Iig. 7. The slruclure of CLUT used in Arch-A.

|x| O.O O.5O 1.O 1.5O
f(x) 2/4
TalIe 1. The proposed LUT approxinalion

In our design, lolh lhe inpuls and oulpuls of lhe LUT are quanlized vilh 4 IeveIs. The
delaiIs are shovn in TalIe 1. The inpuls lo LLUT are lrealed as a 3-lil signed linary
nunler. The oulpuls of LLUT are ANDed vilh lhe oulpul of Ls2 lIock. This neans, if lhe
alsoIule vaIue of lhe inpul is grealer lhan 2.O, lhe oulpul fron lhe CLUT is sel as O.
Olhervise lhe oulpul fron LLUT viII le lhe finaI oulpul.

The LLUT can le inpIenenled vilh conlinalionaI Iogic for high speed appIicalions. Ils
conpulalion Ialency is snaIIer lhan lhe Ialency of Ls2 lIock. Therefore, lhe overaII Ialency
of lhe CLUT is aInosl lhe sane as lhe alove-discussed sinpIified nelhod, vhere lhe lolaI
deIay consisls of 1 MUX deIay and lhe conpulalion deIay of Ls2.

Afler aII lhe alove oplinizalion, lhe crilicaI palh of lhe recursive archileclure is reduced lo 2
nuIli-lil addilions, one 2:1 MUX operalion and 1-lil addilion operalion, vhich saves nearIy
2 nuIli-lil adder deIay conpared lo lhe lradilionaI ACSO archileclure. We viII shov
delaiIed conparisons in sulseclion 3.4.

3.3 An Improved Radix-4 Architecture for MAP Decoders
In lhe foIIoving, ve discuss an inproved Radix-4 recursion archileclure. The conpulalion
for | 2 | 0 + k o is expressed as foIIovs:
VLS 160


|), 1 | 3 |) | 1 | | 3 |, | 2 | 2 ( max*
|, 1 | 0 |) | 3 | | 1 |, | 0 | | 0 ( (max* max*
|) 1 | 3 | 1 | 1 |, 1 | 0 | 1 | 0 ( max* | 2 | 0
+ + + +
+ + + + =
+ + + + + + = +
k k k k k
k k k k k
k k k k k
o o
o o
o o o
(8)

vhere

|). | 1 | | 3 |, | 2 | | 2 ( max* | 1 | 1 k k k k k o o o + + = +
(9)

In ickerslaff (2OO3), Lucenl eII Lals proposed lhe foIIoving approxinalion:
|). 1 | 3 |) | 1 | | 3 |, | 2 | 2 max(
|, 1 | 0 |) | 3 | | 1 |, | 0 | | 0 (max( max* | 2 | 0
+ + + +
+ + + + ~ +
k k k k k
k k k k k k
o o
o o o
(1O)

| |
0
k o
| 1 | | |
0 0
+ + k k
| 1 | | |
0 3
+ + k k
| 1 | | |
2 3
+ + k k
| 1 | | |
0 3
+ + k k
| |
1
k o
| |
2
k o
| |
3
k o
| 2 |
0
+ k o

Iig. 8. The Radix-4 archileclure proposed ly Lucenl eII Lals: Arch-L.

This approxinalion is reporled lo have a O.O4 d perfornance Ioss conpared lo lhe originaI
Log-MAI aIgorilhn. The archileclure lo inpIenenl lhe alove conpulalion is shovn in Iig.
8 for convenience in Ialer discussion. As il can le seen, lhe crilicaI palh consisls of 4 nuIli-
lil adder deIay, one generaIized LUT deIay (Nole: lhe LUT1 lIock incIudes alsoIule vaIue
conpulalion and a nornaI LUT operalion) and one 2:1 MUX deIay.

SiniIarIy, ve can lake an aIlernalive approxinalion as foIIovs:
|). 1 | 3 |) | 1 | | 3 |, | 2 | 2 ( max*
|, 1 | 0 |) | 3 | | 1 |, | 0 | | 0 ( max(max* | 2 | 0
+ + + +
+ + + + ~ +
k k k k k
k k k k k k
o o
o o o
(11)

High-Speed VLS Architectures for Turbo Decoders 161

| |
0
k
A
o
| |
0
k
B
o
| 1 | | |
0 0
+ + k k
| 1 | | |
0 3
+ + k k
| |
1
k
A
o
| |
1
k
B
o
| |
2
k
A
o
| |
2
k
B
o
| |
3
k
A
o
| |
3
k
B
o
| 1 | | |
2 3
+ + k k
| 1 | | |
1 3
+ + k k
| 2 |
0
+ k
A
o
| 2 |
0
+ k
B
o

Iig. 9. The inproved Radix-4 recursion arch.: Arch-.

InluiliveIy, Turlo decoder enpIoying lhis nev approxinalion shouId have lhe sane
decoding perfornance as using equalion (7). WhiIe direclIy inpIenenling (11) does nol
lring any advanlage lo lhe crilicaI palh, ve inlend lo lake advanlages of lhe lechniques lhal
ve deveIoped in sulseclion 3.2. The nev archileclure is shovn in Iig. 9. Here ve spIil each
slale nelric inlo lvo lerns and ve adopl lhe sane CLUT slruclure as ve did lefore. In
addilion, a siniIar approxinalion is incorporaled as vilh Arch-A. In lhis case, lhe oulpuls
fron CLUT are nol invoIved in lhe finaI slage conparison operalion. Il can le olserved lhal
lhe crilicaI palh of lhe nev archileclure is cIose lo 3 nuIli-lil adder deIay. To conpensale
for aII lhe approxinalion inlroduced, lhe exlrinsic infornalion generaled ly lhe MAI
decoder lased on lhis nev Radix-4 archileclure shouId le scaIed ly a faclor around O.75.

3.4 Performance Comparisons
Ior quanlilalive conparison in hardvare conpIexily and processing speed, ve used TSMC
O.18 un slandard ceIIs lo synlhesize one o unil for 5 differenl recursion archileclures, i.c., 1)
lhe lradilionaI ACSO archileclure: Arch-O, 2) lhe reduced-precision Radix-2 recursion
archileclure presenled in Urard (2OO4): Arch-U, 3) lhe Radix-4 archileclure proposed ly
Lucenl: Arch-L, 4) lhe advanced Radix-2 recursion archileclure presenled in lhis vork:
Arch-A, and 5) lhe inproved Radix-4 recursion archileclure: Arch-. AII lhe slale nelrics
are quanlized as 9 lils vhiIe lhe lranch nelrics are represenled using 6 lils. The delaiIed
synlhesis resuIls are Iisled in TalIe 2.

Il can le olserved lhal lhe proposed Radix-2 archileclure has conparalIe processing speed
as lhe Radix-4 archileclure proposed ly eII Lals vilh significanlIy Iover conpIexily, vhiIe
lhe inproved Radix-4 archileclure is 32 fasler vilh onIy 9 exlra hardvare. This anounl
of hardvare overhead shouId le negIigilIe conpared lo an enlire Turlo decoder. Il can aIso
le seen lhal lhe nev Radix-4 archileclure achieves lvice speedup over lhe lradilionaI Radix-
2 recursion archileclure.



VLS 162


Max cIock
Irequency (Mhz)
ReIalive
Area
ReIalive
Irocessing Speed
Arch-O 241 1.O 1.O
Arch-U 355 1.14 1.39
Arch-L 182 1.82 1.51
Arch-A 37O 1.O3 1.54
Arch- 241 1.99 2.O
TalIe 2. Conparison for various recursion archileclures

We have perforned exlensive sinuIalions for Turlo codes using lhe originaI MAI and
using various approxinalions. Iig. 1O shovs lhe LR (lil-error-rale) perfornance of a rale-
1/3, 8-slale, lIock size of 512 lils, Turlo code using differenl MAI archileclures. The
sinuIalions vere underlaken under lhe assunplion of AWCN channeI and ISK signaIing.
A naxinun of 8 ileralions vas perforned. More lhan 4O niIIion randon infornalion lils
vere sinuIaled for lolh Ll/No=1.6 d and Ll/No=1.8d cases. Il can le noled fron Iig.
1O lhal lhere is no olservalIe perfornance difference lelveen lhe lrue MAI aIgorilhn and
lvo approxinalion nelhods associaled vilh lhe proposed recursion archileclures vhiIe lhe
approxinalion enpIoyed in Urard (2OO4) caused approxinaleIy O.2 d perfornance
degradalion in generaI.


Iig. 1O. Ierfornance conparisons lelveen lhe originaI MAI and sone approxinalions.
High-Speed VLS Architectures for Turbo Decoders 163

We argue lhal lhe proposed Radix-4 recursion archileclure is oplinaI for high-speed MAI
decoders. Any (significanlIy) fasler recursion archileclure (c.g., a possilIe Radix-8
archileclure) viII le al lhe expense of significanlIy increased hardvare. On lhe olher hand,
vhen lhe largel lhroughpul is noderale, lhese fasl recursion archileclures can le used lo
reduce pover consunplion lecause of lheir dranalicaIIy reduced crilicaI palhs.

In generaI, lhe lhroughpul of a seriaI Turlo decoder is significanlIy Iiniled ly recursive
conpulalion and VLSI lechnoIogy used lo inpIenenl il. On lhe olher hand, increasing lhe
cIock frequency vouId cause lhe dynanic pover dissipalion lo increase IinearIy. Therefore,
il is nol a good oplion lo increase lhe cIock frequency for high lhroughpul appIicalions.

4. Efficient ParaIIeI Turbo Decoding Architectures

In lhis seclion, ve firsl inlroduce an area-efficienl paraIIeI Turlo decoding schene, vhere a
dala frane is parlilioned inlo severaI equivaIenl segnenls vilh proper overIap lelveen
adjacenl segnenls, and sIiding-vindov-lased Turlo decoding is lhen appIied lo each
segnenl. The goaI is lo nainlain lhe sane nenory requirenenl as seriaI decoding case
vhiIe increasing lhe conpIexily of lhe Iogic conponenls onIy, lhus IinearIy increase lhe
syslen lhroughpul vilh sul-IinearIy increased hardvare.

The najor chaIIenge Iies in reaI inpIenenlalion. As each nenory (receiver nenory or
exlrinsic infornalion nenory) needs lo supporl nuIlipIe dala (for nuIlipIe conponenl sofl-
inpul sofl-oulpul decoders) al lhe sane cycIe, nenory access confIicls are inevilalIe unIess
M-porl (M = paraIIeIisn IeveI) nenory is used, vhich conlradicls lhe originaI Iov-
conpIexily design oljeclive. Tvo differenl soIulions are proposed in lhis vork. Iirsl of aII, if
inlerIeave pallerns are free lo design, ve can adopl dividalIe inlerIavers, vhich inherenlIy
ensure no nenory access confIicl afler proper nenory parlilion. Ior praclicaI appIicalions
vherein Turlo code inlerIeave pallerns are fixed, e.g., 3CII and 3CII2, a nore generic
soIulion is inlroduced in lhis chapler. y inlroducing sone snaII luffers for dala and
addresses as veII, ve are alIe lo avoid nenory access confIicl under any praclicaI randon
inlerIeavers. Conlining aII lhe proposed lechniques, il is eslinaled lhal 2OO Ml/s Turlo
decoder is feasilIe vilh currenl CMOS lechnoIogy vilh noderale conpIexily.

4.1 Area-efficient paraIIeI decoding schemes
IaraIIeI processing has leen videIy used in Iov pover and high lhroughpul circuil design.
MuIlipIe seriaI Turlo decoders can le concalenaled in a paraIIeI vay (see Iig. 11 A) or in a
seriaI vay. In eilher case, lolh lhe area and pover are increased IinearIy as lhe lhroughpul
increases. The area-efficienl paraIIeI Turlo decoding schenes vere firsl proposed in Wang
(2OO1). The idea is lo Iel nuIlipIe SISO decoders vork on lhe sane dala frane
sinuIlaneousIy. In lhis case, onIy lhe hardvare of lhe SISO decoder parl needs lo le
increased vhiIe lhe lolaI hardvare in lhe nenory parl renains unchanged on a Iarge exlenl
(see Iig. 11 ). As lhe nenory parl usuaIIy doninales lhe overaII hardvare of a Turlo
decoder, lhe area-efficienl paraIIeI Turlo decoding archileclures are expecled lo achieve
nuIlipIe lines lhe lhroughpul of a seriaI decoding decoder vilh a snaII fraclion of
hardvare overhead. UnforlunaleIy, lhe reaI inpIenenlalion issues vere nol covered in lhe
originaI vork.
VLS 164














Iig. 11. TradilionaI vs. area-efficienl paraIIeI Turlo decoding.


a) A sinpIe area-efficienl 2-IeveI paraIIeI Turlo decoding schene.


Iig. 12. l) MuIli-IeveI paraIIeI Turlo decoding schene lased on sIiding-vindov approach.

A sinpIe area-efficienl 2-IeveI paraIIeI Turlo decoding schene is shovn in Iig. 12 a), vhere
a sequence of dala is divided inlo lvo segnenls vilh equivaIenl Ienglh. There is an overIap
vilh Ienglh of 2D lelveen lvo segnenls, vhere D equaIs lhe sIiding vindov size if Log-
MAI (or MAX-Log-MAI) aIgorilhn is enpIoyed in lhe SISO decoder or lhe survivor Ienglh
if SOVA is enpIoyed, as proposed in Wang (2OO3c). The lvo SISO decoders vork on lvo
differenl dala segnenls (vilh overIap) al lhe sane line. Iig. 12 l) shovs ( a porlion of) a
sIiding-vindov-lased area-efficienl nuIli-IeveI paraIIeI Turlo decoding dala fIov, vhere
High-Speed VLS Architectures for Turbo Decoders 165

SISO decoders responsilIe for lvo adjacenl dala segnenls are processing dala in reverse
direclions in order lo reuse sone of lhe previousIy conpuled lranch nelrics and slale
nelrics. Ior olher paraIIeI decoding schenes vilh various lrade-offs, inleresled reader are
referred lo Wang (2OO1) and Zhang (2OO4).

In lhis seclion, ve viII focus on area-efficienl lvo-IeveI paraIIeI decoding schenes. Il can le
olserved fron Iig. 12.a) lhal lvo dala accesses per cycIe are required for lolh lhe receiver
luffer and lhe inlerIeaver nenory (assuning lvo ping-pong luffers are used for lhe
inlerIeaver nenory). Using a duaI-porl nenory is definileIy nol an efficienl soIulion since a
nornaI duaI-porl nenory consunes as nuch area as lvo singIe-porl nenories vilh lhe
sane nenory size.

4.2 Area-efficient paraIIeI decoding architectures
In order lo nainlain lhe lolaI hardvare of lhese nenories, ve choose lo parlilion each
nenory inlo nuIlipIe segnenls lo supporl nuIlipIe dala accesses per cycIe.

Menory parlilioning can le done in various vays. An easy and yel effeclive vay is lo
parlilion lhe nenory according lo a nunler of Ieasl significanl lils (Isls) of lhe nenory
address. To parlilion a nenory inlo lvo segnenls, il can le done according lo lhe Ieasl
significanl lil (Isl). Ior lhe resuIlanl lvo nenory segnenls, one conlains dala vilh even
addresses and lhe olher conlains dala vilh odd addresses.

The nenory parlilioning can aIso le done in a fancy vay, c.g., lo parlilion lhe nenory inlo
2 segnenls, Iel lhe 1sl nenory segnenl conlain dala vilh addresses
0 1 2
b b b =|O, 2, 5, 7, and
lhe 2nd segnenl conlain dala vilh addresses
0 1 2
b b b =|1, 3, 4, or 6, vhere
0 1 2
b b b denoles
lhe 3 Isls. Depending on lhe appIicalions, differenl parlilioning schenes nay Iead lo
differenl hardvare requirenenls.

Ior lhe lvo nenory segnenls, il is possilIe, in principIe, lo supporl lvo dala accesses
vilhin one cycIe. Hovever, il is generaIIy inpossilIe lo find an efficienl parlilioning schene
lo ensure lhe largel addresses (for lolh Load and Wrile operalions) are aIvays Iocaled in
differenl segnenls al each cycIe lecause lhe Turlo decoder processes lhe dala in differenl
orders during differenl decoding phases. Consider a sinpIe exanpIe, given a sequenliaI
dala sequence |O, 1, 2, 3, 4, 5, 6, 7, lhe inlerIeaved dala sequence is assuned as |2, 5, 7, 3, 1, O,
6, 4. If ve parlilion lhe nenory inlo lvo segnenls according lo lhe sequenliaI decoding
phase, i.c., one segnenl conlains dala sequence |O, 1, 2, 3 and lhe olher has |4, 5, 6 7.
During lhe sequenliaI phase, SISO-1 vorks on dala sel |O, 1, 2, 3 and SISO-2 vorks on |4, 5,
6, 7. So lhere is no dala confIicl (nole: lhe overIap lelveen lvo segnenls for paraIIeI
decoding is ignored in lhis sinpIe exanpIe). Hovever, during lhe inlerIaved decoding
phase, SISO-1 viII process dala in sel |2, 5, 7, 3 and SISO-2 viII process dala in sel |1, O, 6,
4, lolh in sequenliaI order. Il is easy lo see lhal, al lhe firsl cycIe, lolh SISO decoders
require dala Iocaled in lhe firsl segnenl (1
sl
and 2
nd
dala in lhe inpul dala sequence). Thus a
nenory access confIicl occurs. More delaiIed anaIysis can le found in Wang (2OO3l). The
nenory access issue in paraIIeI decoding can le nuch vorse vhen nuIlipIe rales and
VLS 166


nuIlipIe lIock sizes of Turlo codes are supporled in a specific appIicalion, c.g., in 3CII
CDMA syslens.

In principIe, lhe dala access confIicl prolIen can le avoided if lhe Turlo inlerIeaver is
specificaIIy designed. A generaIized even-odd inlerIeave is defined leIov:
- AII even indexed dala are inlerIeaved lo odd addresses,
- AII odd indexed dala are inlerIeaved lo even addresses.
ack lo lhe previous exanpIe, an even-odd inlerIeaver nay have an inlerIeaved dala
sequence as |3, 6, 7, 4, 1, O, 5, 2.

Wilh an even-odd inlerIeaver, ve can parlilion each nenory inlo lvo segnenls: one
conlains even-addressed dala and lhe olher vilh odd-addressed dala.


Iig. 13. Dala processing in eilher decoding phase vilh an even-odd inlerIeaver.

As shovn in Iig. 13, lhe dala are processed in an order of |even address, odd address, even ,
odd, . in Ihase A (i.c., lhe sequenliaI phase) and an order of |odd address, even address,
odd, even, .. in Ihase (i.c., lhe inlerIeaved phase). If ve Iel SISO-1 and SISO-2 slarl
processing fron differenl nenory segnenls, lhen lhere vouId le no dala access confIicl
during eilher decoding phase. In olher vords, il is guaranleed lhal lolh SISO decoders
aIvays access dala Iocaled in differenl nenory segnenls.

More conpIicaled inlerIeavers can le designed lo supporl nuIlipIe (M>=2) dala accesses
per cycIe. The inleresled readers are referred lo Wang (2OO3c), He (2OO5), CiuIielli (2OO2),
ougard (2OO3), and Kvak (2OO3) for delaiIs.

In reaI appIicalions, lhe Turlo inlerIeaver is usuaIIy nol free lo design. Ior exanpIe, lhey are
fixed in WCDMA and CDMA 2OOO syslens. Thus, a generic paraIIeI decoding archileclure is
desired lo acconnodale aII possilIe appIicalions. Here ve propose an efficienl nenory
arlilralion schene lo resoIve lhe prolIen of dala access confIicl in paraIIeI Turlo decoding.

The fundanenlaI concepl is lo parlilion one singIe-porl nenory inlo S (S>=2) segnenls and
use ( >=2) snaII luffers lo assisl reading fron or vriling dala lack lo lhe nenory, vhere
S does nol have lo le lhe sane as . Ior design sinpIicily, lolh S and are nornaIIy chosen
lo le a pover of 2. S is leller chosen lo le a nuIlipIe of . In lhis chapler, ve viII consider
High-Speed VLS Architectures for Turbo Decoders 167

onIy one sinpIe case, i.c., S=2, =2. Ior aII olher cases of =2, lhey can le easiIy exlended
fron lhe iIIuslraled case. Hovever, for cases vilh >2, lhe conlroI circuilry viII le nuch
nore conpIicaled.

Ior lhe receiver luffer, onIy lhe Load operalion is invoIved vhiIe lolher Load and Wrile
operalions are required for lhe inlerIeaver nenory. So ve viII focus our discussion on lhe
inlerIeaver nenory.





















Iig. 14. The area-efficienl paraIIeI decoding archileclure.

Wilh regard lo lhe inlerIeaver nenory, ve assune lvo ping-pong luffers (i.e., RAM1 and
RAM2 shovn in Iig.14) are used lo ensure lhal one Load and one Wrile operalions can le
conpIeled vilhin one cycIe. Lach luffer is parlilioned inlo lvo segnenls: Seg-A conlains
even addressed dala and Seg- conlains odd-addressed dala. Ior sinpIicily, ve use Seg-A1
lo represenl lhe even-addressed parl of RAM1, Seg-1 lo represenl lhe odd-addressed parl
of RAM1. The siniIar represenlalions are used for RAM2 as veII.

An area-efficienl 2-paraII Turlo decoding archileclure is shovn in Iig. 14. The inlerIeaver
address generalor (IAC) generales lvo addresses for lvo SISO decoders respecliveIy al each
cycIe. They nusl leIong lo one of lhe foIIoving cases: (1) lvo even addresses, (2) lvo odd
addresses and (3) one even and one odd address. In Case 1, lvo addresses are pul inlo Read
Address uffer 1 (RA1), In Cases 2, lvo addresses are pul in Read Address luffer 2
(RA2). In Cases 3, lhe even address goes lo RA1 and lhe odd address goes lo RA2.

VLS 168


A snaII luffer caIIed Read Index uffer (RI) is inlroduced in lhis archileclure. This luffer
is lasicaIIy a IIIO (firsl-in firsl-oul). Tvo lils are slored al each enlry. The dislincl four
vaIues represenled ly lhe lvo lils have foIIoving neanings:
- OO: 1sl and 2nd even addresses for SISO-1 and SISO-2 respecliveIy,
- 11: 1sl and 2nd odd addresses for SISO-1 and SISO-2 respecliveIy,
- O1: lhe even address is used for SISO-1 and lhe odd address for SISO-2,
- 1O: lhe even address is used for SISO-1 and lhe odd address for SISO-2.
Afler decoding for a nunler of cycIes (c.g., 4 ~ 8 cycIes), lolh RA1 and RA2 viII have a
snaII anounl of dala. Then lhe dala fron lhe lop of RA1 and RA2 are used respecliveIy
as lhe addresses lo Seg-A1 and Seg-1 respecliveIy. Afler one cycIe, lvo previousIy slored
exlrinsic infornalion synloIs are Ioaded lo Load Dala uffer 1 (LA1) and Load Dala
uffer 2 (LD2). Then lhe dala slored in RI viII le used lo direcl lhe oulpuls fron lolh
Ioad luffers lo feed lolh SISO decoders. Ior exanpIe, if lhe dala al lhe lop of RI is OO, lhen
LD1 oulpul lvo dala lo SISO-1 and SISO-2 respecliveIy. The delaiIed aclions of address
luffers and dala luffers conlroIIed ly RI for Ioading dala are sunnarized in TalIe 3.

VaIue of
RI

Aclion of address luffers Aclion of dala luffers
O RA1 Ioad nolhing, RA2
Ioads 2 addresses, 1sl for Seg-
1 and 2nd for Seg-2
LD1 oulpul nolhing, LD2 oul 2
dala, 1sl for SISO1 and 2nd for
SISO2
1 RA1 Ioad 1 address for Seg-
1, RA2 Ioad 1 address for
Seg-2
LD1 oulpul 1 for SISO1 and
LD2 oul 1 for SISO2
2 RA2 Ioad nolhing, RA1
Ioad 2 addresses, 1sl for Seg-1
and 2nd for Seg-2
LD1 oulpul 2 dala, 1sl for SISO1
and 2nd for SISO2
3 RA1 Ioad 1 address for Seg-
1, RA2 Ioad 1 for Seg-2
LD1 oulpul 1 for SISO2 and
LD2 oul 1 for SISO1
TalIe 3. Sunnary of Ioading dala operalions.

A3
A1 B1
B2
A2
B3
A3
A1 B1
B2
A2
B3
A4
B4
A1 B1
B2
A2
A1 B1
A3
A1 B1
B2
A2
B3
A4
B4
A5
B5
t=1 t=2 t=3 t=+ t=S

(a) Ioad addresses lo RA1 and RA2

High-Speed VLS Architectures for Turbo Decoders 169

A1 B1
B2
A2 B3
A3
A1 B1
B2
A2
B3
A4
A1 B1
A2
A1 B1
A3
A1 B1
B2
A2
B3
A4
B4
A5
B5
B3
A4
A5
t=2 t=3 t=+ t=S t=6

(l) Ioad dala lo RD1 and RD2

A3
B2
A2 B3
A4
B4
A5
B5
A3
A1 B1
B2
A2 B3
A4
B4
A5
B5
A3
B3
A4
B4
A5
B5
A4
B4
A5
B5
A5
B5
t=6 t=7 t=8 t=9
t=10

(c) oulpul dala lo SISO1 and SISO2
Iig. 15. Irocedure of Ioading dala fron nenory lo SISO decoders.

A sinpIe exanpIe is shovn in Iig. 15 lo iIIuslrale lhe delaiIs of Ioading dala fron nenory
lo SISO decoders, vhere, for inslance, A3 (5) denoles lhe address of dala lo le Ioaded or
lhe dala ilseIf corresponding lo nenory segnenl A (segnenl ) for lhe 3rd (5lh) dala in lhe
processing sequence. Iig. 15 (a) shovs lhe delaiIs of feeding lvo dada addresses lo lvo
address luffers al every cycIe. Iig. 15 (l) shovs lhe delaiIs of feeding one dala lo each dala
luffer al every cycIe. The delaiIs of oulpulling lhe required lvo dala lo lolh SISO decoders
are shovn in Iig. 15 (c). As can le seen, lolh SISO1 and SISO2 can gel lheir required dala
lhal nay le Iocaled in lhe sane nenory segnenl al lhe sane cycIe (c.g., l=7 in lhis
exanpIe). A liny dravlack of lhis approach is lhal a snaII fixed Ialency is inlroduced.
IorlunaleIy lhis Ialency is negIigilIe conpared vilh lhe overaII decoding cycIes for a vhoIe
Turlo code lIock in nosl appIicalions.

In generaI (c.g., vhen sIiding-vindov-lased MAI aIgorilhn is enpIoyed for SISO
decoders), lhe vrile sequence for each SISO decoder is lhe deIayed version of lhe read
sequence. So lhe vrile index luffer and vrile address luffers can le avoided ly using deIay
Iines as shovn in Iig. 14.

Al each cycIe, lvo exlrinsic infornalion synloIs are generaled fron lhe lvo SISO decoders.
They nay lolh feed Wrile Dala uffer 1 (WDI1), or lolh feed Wrile Dala uffer 2 (WDI2),
or one feeds WDI1 and lhe olher feeds WDI2. The acluaI aclion is conlroIIed ly lhe
deIayed oulpul fron RI.

VLS 170


SiniIar lo Ioading dala, afler lhe sane deIay (c.g., 4 ~ 8 cycIes) fron lhe firsl oulpul of eilher
SISO decoder, lhe dala fron WD1 and WD2 viII le vrillen lack lo Seg-1 and Seg-2
respecliveIy. In lhe nexl decoding phase, RAM-1 and RAM-2 viII exchange roIes. So sone
exlra MUXs nusl le enpIoyed lo faciIilale lhis funclionaI svilch.

5.3 ImpIementation issues and simuIation resuIts
The inleger nunlers shovn in Iig. 14 indicaled lhe dala fIov associaled vilh each luffer.
Ior exanpIe, |O, -1, -2 is shovn aIong lhe oulpul Iine of LD1, vhich neans LD1 nay
oulpul O, 1, or 2 dala al each cycIe. Il can le olserved lhal aII dala and address luffers vork
in a siniIar vay. One connon fealure of lhese luffers is lhal lhey have a conslanl dala fIov
(+1 or -1) al one end (Load or Wrile) vhiIe having an irreguIar dala fIov (|O, -1, -2 or |O, +1,
+2) al lhe olher end (Wrile or Load). On lhe average, inconing and oulgoing dala are
IargeIy laIanced. So lhe luffer sizes can le very snaII.

The RI can le inpIenenled vilh a shifl regisler as il has a reguIar dala fIov in lolh ends
vhiIe a 3-porl nenory suffices lo fuIfiII lhe funclions of lhe resl luffers.

Il has leen found fron our cycIe-accurale sinuIalions lhal il is sufficienl lo choose a luffer
Ienglh of 25 for aII luffers if lhe proposed 2-paraIIeI archileclure is appIied in eilher
WCDMA or CDMA2OOO syslens. The naxinun Turlo lIock size for WCDMA syslen is
approxinaleIy 5K lils. The Iovesl code rale is 1/3. Assune lolh lhe received sofl inpuls
and lhe exlrinsic infornalion synloIs are expressed as 6 lils per synloI.
- The overaII nenory requirenenl is 5K*(2+3)*6=15OK lils.
- The lolaI overhead of snaII luffers is approxinaleIy 25*3*2*(3*6+2*6) +
25*3*13*2~= 6K lils, vhere lhe faclor 3 accounls for 3 porls of nenory, 13
represenls each address conlains 13 linary lils.

Il can le seen lhal lhe overhead is aloul 4 of lolaI nenory. Wilh CDMA2OOO syslens, lhe
overhead occupies an even snaIIer percenlage in lhe lolaI hardvare lecause lhe naxinun
Turlo lIock size is severaI lines Iarger.

Ior WCDMA syslens, il is reasonalIe lo assune lhe lolaI area of a Log-MAI decoder
consunes Iess lhan 15 of lolaI hardvare of a seriaI Turlo decoder. Thus, ve can achieve
lvice lhe lhroughpul vilh lhe proposed 2-paraIIeI decoding archileclure vhiIe spending
Iess lhan 2O hardvare overhead. If SOVA is enpIoyed in lhe SISO decoder, lhe overhead
couId le Iess lhan 15. In eilher case, lhe percenlage of lhe overhead viII le even snaIIer in
CDMA2OOO syslens.

Assune 6 ileralions are perforned al lhe nosl, lhe proposed 2-IeveI paraIIeI archileclure, if
inpIenenled vilh TSMC O.18 un lechnoIogy, can achieve a nininun decoding lhroughpul
of 37O M*2/(6*2) > 6O Mlps. If slale-of-lhe-arl CMOS lechnoIogy (c.g., 65nn CMOS) is used,
ve can easiIy achieve 1OOMlps dala-rale vilh lhe proposed 2-paraIIeI decoding archileclure.

If an 4-paraIIeI archileclure is enpIoyed, over 2OOMlps dala rale can le ollained, lhough a
direcl exlension of lhe proposed 2-paraIeII archileclure viII significanlIy conpIicale lhe
conlroI circuilry.
High-Speed VLS Architectures for Turbo Decoders 171

Il is vorlhvhiIe lo nole, vhen lhe largel lhroughpul is noderaleIy Iov, lhe proposed area-
efficienl paraIIeI Turlo decoding archileclure can le used for Iov pover design. The reason
is lhal lhe forner archileclure can vork al a nuch Iover cIock frequency and lhe suppIy
voIlage can lhus le reduced significanlIy.

6. ConcIusions

NoveI fasl recursion archileclure for Log-Map decoders have leen presenled in lhis chapler.
LxperinenlaI resuIls have shovn lhal lhe proposed fasl recursion archileclures can increase
lhe process speed significanlIy over lradilionaI designs vhiIe nainlaining lhe decoding
perfornance. As a nore pover-efficienl approach lo increase lhe lhroughpul of Turlo
decoder, area-efficienl paraIIeI Turlo decoding schenes have leen addressed. A hardvare-
efficienl 2-paraIIeI decoding archileclure for generic appIicalions is presenled in delaiI. Il has
leen shovn lhal lvice lhe lhroughpul of a seriaI decoding archileclure can le ollained vilh
an overhead of Iess lhan 2O of an enlire Turlo decoder. The proposed nenory
parlilioning lechniques logelher vilh lhe efficienl nenory arlilralion schenes can le
exlended lo nuIli-IeveI paraIIeI Turlo decoding archileclures as veII.



7. References

3rd Ceneralion Iarlnership Irojecl (3CII), TechnicaI specificalion group radio access
nelvork, nuIlipIexing and channeI coding (TS 25.212 version 3.O.O),
hllp://vvv.3gpp.org.
3rd Ceneralion Iarlnership Irojecl 2 (3CII2), hllp://vvv.3gpp2.org.
A. CiuIielli c| a|. (2OO2), IaraIIeI Turlo code inlerIeavers: Avoiding coIIisions in
accesses lo slorage eIenenls, ||cc|rcn. |c||., voI. 38, no. 5, pp. 232-234, Iel.
2OO2.
A. }. Vilerli. (1998). An inluilive juslificalion of lhe MAI decoder for convoIulionaI codes,
|||| ]. Sc|cc|. Arcas Ccmmun., voI.16, pp. 26O-264, Ielruary 1998.
A. Raghupalhy. (1998). Lov pover and high speed aIgorilhns and VLSI archileclures for
error conlroI coding and adaplive video scaIing, Ih.D. disserlalion, Unit. cf
Marq|an, Cc||cgc Par|, 1998.
. ougard c| a|. (2OO3). A scaIalIe 8.7 n}/lil 75.6 Ml/s paraIIeI concalenaled
convoIulionaI (Turlo-) codec, in |||| |SSCC Dig. Tccn. Papcrs, 2OO3, pp.
152-153.
C. errou, A. CIavieux & I. Thilinajshia. (1993). Near Shannon Iinil error correcling coding
and decoding: Turlo codes, |CC93, pp 1O64-7O.
L. ouliIIon, W. Cross & I. CuIak. (2OO3). VLSI archileclures for lhe MAI aIgorilhn, ||||
Trans. Ccmmun., VoIune 51, Issue 2, Iel. 2OO3, pp. 175 - 185.
L. Yeo, S. Augslurger, W. Davis & . NikoIic. (2OO3). A 5OO-Ml/s sofl-oulpul Vilerli
decoder, ILLL }ournaI of SoIid-Slale Circuils, VoIune 38, Issue 7, }uIy 2OO3,
pp:1234 - 1241.
VLS 172


H. Suzuki, Z. Wang & K. K. Iarhi. (2OOO). A K=3, 2Mlps Lov Iover Turlo Decoder for 3rd
Ceneralion W-CDMA Syslens, in Prcc. |||| 1999 Cus|cm |n|cgra|c Circui|s Ccnf.
(C|CC), 2OOO, pp 39-42.
}. Hagenauer & I. Hoher. (1989). A Vilerli aIgorilhn vilh sofl decision oulpuls and ils
appIicalions, |||| G|O8|COM, DaIIas, TX, USA, Nov. 1989, pp 47.1.1-7.
}. Kvak & K Lee (2OO2). Design of dividalIe inlerIeaver for paraIIeI decoding in
lurlo codes. ||cc|rcnics |c||crs, voI. 38, issue 22, pp. 1362-64, Ocl. 2OO2.
}. Kvak, S. M. Iark, S. S. Yoon & K Lee (2OO3). InpIenenlalion of a paraIIeI Turlo
decoder vilh dividalIe inlerIeaver, |SCAS03, voI. 2, pp. II-65- II68, May
2OO3.
K. K. Iarhi. (1999). VLSI DigilaI signaI Irocessing Syslens, ]cnn Ii|cq c Scns, 1999.
L.ahI, }.}eIinek, }.Raviv & I.Raviv. (1974). OplinaI Decoding of Linear Codes for
nininizing synloI error rale, |||| Trans.s |nf. Tnccrq, voI. IT-2O, pp.284-287,
March 1974.
M. ickerslaff, L. Davis, C. Thonas, D. Carrel & C. NicoI. (2OO3). A 24 Ml/s radix-4
LogMAI Turlo decoder for 3 CII-HSDIA noliIe vireIess, in Prcc. |||| |SSCC
Dig. Tccn. Papcrs, 2OO3, pp. 15O-151.
I. Urard el aI. (2OO4). A generic 35O Ml/s Turlo codec lased on a 16-slale Turlo decoder, in
Prcc. |||| |SSCC Dig. Tccn. Papcrs, 2OO4, pp. 424-433.
S. Lee, N. Shanlhag & A. Singer. (2OO5). A 285-MHz pipeIined MAI decoder in O.18 un
CMOS, |||| ]. Sc|i-S|a|c Circui|s, voI. 4O, no. 8, Aug. 2OO5, pp. 1718 - 1725.
T. Miyauchi, K. Yananolo & T. Yokokava. (2OO1). High-perfornance progrannalIe SISO
decoder VLSI inpIenenlalion for decoding Turlo codes, in Prcc. |||| G|c|a|
Tc|cccmmunica|icns Ccnf., voI. 1, 2OO1, pp. 3O5-3O9.
T.C. Denk & K.K. Iarhi. (1998). Lxhauslive ScheduIing and Relining of DigilaI SignaI
Irocessing Syslens, |||| Trans. Circui|s an Sqs|. ||, voI. 45, no.7, pp. 821-838, }uIy
1998
W. Cross & I. C. CuIak. (1998). SinpIified MAI aIgorilhn suilalIe for inpIenenlalion of
Turlo decoders, ||cc|rcnics |c||crs, voI. 34, no. 16, pp. 1577-78, Aug. 1998.
Y. Wang, C. Iai & X. Song. (2OO2). The design of hylrid carry-Iookahead/carry-seIecl
adders, |||| Trans. Circui|s an Sqs|. ||, voI.49, no.1, }an. 2OO2, pp. 16 -24
Y. Wu, . D. Woerner & T. K. Iankenship, Dala vidlh requirenenl in SISO decoding vilh
noduIe nornaIizalion, |||| Trans. Ccmmun., voI. 49, no. 11, pp. 1861-1868, Nov.
2OO1.
Y. Zhang & K. Iarhi (2OO4). IaraIIeI Turlo decoding, Iroceedings of lhe 2OO4 InlernalionaI
Synposiun on Circuils and Syslens (ISCASO4), VoIune 2, 23-26 May 2OO4, pp: II -
5O9-12 VoI.2.
Z Wang, Y. Tan & Y. Wang. (2OO3l). Lov Hardvare ConpIexily IaraIIeI Turlo Decoder
Archileclure, |SCAS2003, voI. II, pp. 53-56, May 2OO3.
Z. He, S. Roy & Iorlier (2OO5). High-Speed and Lov-Iover Design of IaraIIeI
Turlo Decoder Circuils and Syslens, |||| |n|crna|icna| Sqmpcsium cn
Circui|s an Sqs|cms (|SCAS05), 23-26 May 2OO5, pp. 6O18 - 6O21.
Z. Wang & K. Iarhi. (2OO3a). Lfficienl InlerIeaver Menory Archileclures for SeriaI Turlo
Decoding, |CASSP2003, voI. II, pp. 629-32, May 2OO3.
Z. Wang & K. Iarhi. (2OO3c). High Ierfornance, High Throughpul Turlo/SOVA Decoder
Design, |||| Trans. cn Ccmmun., voI. 51, no 4, ApriI 2OO3, pp. 57O-79.
High-Speed VLS Architectures for Turbo Decoders 173

Z. Wang, H. Suzuki & K. K. Iarhi. (1999). VLSI inpIenenlalion issues of Turlo decoder
design for vireIess appIicalions, in Prcc. |||| Icr|sncp cn Signa| Prcccss. Sqs|.
(SiPS), 1999, pp. 5O3-512.
Z. Wang, Z. Chi & K. K. Iarhi. (2OO1). Area-Lfficienl High Speed Decoding Schenes for
Turlo/MAI Decoders, in Prcc. |||| |CASSP'2001, pp 2633-36, voI. 4, SaIl Lake
Cily, Ulah, 2OO1.
Z. Wang. (2OOO). Lov conpIexily, high perfornance Turlo decoder design, Ih.D.
disserlalion, Unitcrsi|q cf Minncsc|a, Aug. 2OOO.
Z. Wang. (2OO7). High-Speed Recursion Archileclures for MAI-ased Turlo Decoders, in
|||| Trans. cn V|S| Sqs|., voI. 15, issue 4, pp: 47O-74, Apr. 2OO7.








VLS 174
Ultra-High Speed LDPC Code Design and mplementation 175
X

UItra-High Speed LDPC Code
Design and ImpIementation

}in Sha
1
, Zhongfeng Wang
2
and MingIun Cao
1

1
Nanjing Unitcrsi|q,
2
8rcaccm Ccrpcra|icn
1
Cnina,
2
U.S.A.

1. Introduction

The digilaI connunicalions are uliquilous and have provided lrenendous lenefils lo
everyday Iife. Lrror Correclion Codes (LCC) are videIy appIied in nodern digilaI
connunicalion syslens. Lov-Densily Iarily-Check (LDIC) code, invenled ly CaIIager
(1962) and rediscovered ly MacKay (1996), is one of lhe lvo nosl pronising near-oplinaI
error correclion codes in praclice. Since ils rediscovery, significanl inprovenenls have leen
achieved on lhe design and anaIysis of LDIC codes lo furlher enhance lhe connunicalion
syslen perfornance. Due lo ils oulslanding error-correcling perfornance (Chung 2OO1),
LDIC code has leen videIy considered in nexl generalion connunicalion slandards such
as ILLL 8O2.16e, ILLL 8O2.3an, ILLL 8O2.11n, and DV-S2. LDIC code is characlerized ly
sparse parily check nalrix. One key fealure associaled vilh LDIC code is lhe ileralive
decoding process, vhich enalIes LDIC decoder lo achieve oulslanding perfornance vilh
noderale conpIexily. Hovever, lhe ileralive process direclIy Ieads lo Iarge hardvare
consunplion and Iov lhroughpul. Thus efficienl Very Large ScaIe Inlegralion (VLSI)
inpIenenlalion of high dala rale LDIC decoder is very chaIIenging and crilicaI in praclicaI
appIicalions.
A salisfying LDIC decoder usuaIIy neans: good error correclion perfornance, Iov
hardvare conpIexily and high lhroughpul. There are various exisling design nelhods
vhich usuaIIy encounler lhe Iinilalions of high rouling overhead, Iarge nessage nenory
requirenenl or Iong decoding Ialency. To inpIenenl lhe decoder direclIy in ils inherenl
paraIIeI nanner nay gel lhe highesl decoding lhroughpul. ul for Iarge codevord Ienglhs
(e.g., Iarger lhan 1OOO lils), lo avoid rouling confIicl, lhe conpIex inlerconneclion nay lake
up nore lhan haIf of lhe chip area. olh seriaI and parlIy paraIIeI VLSI archileclures are veII
sludied novadays. Hovever, none of lhese approaches are good for very high lhroughpul
(i.e., nuIli-Cl/s) appIicalions.
In lhis chapler, ve presenl lhe conslruclion of a nev cIass of inpIenenlalion-orienled LDIC
codes, naneIy shifl-LDIC codes lo soIve lhese issues inlegraIIy. Shifl-LDIC codes have
leen shovn lo perforn as veII as conpuler generaled randon codes foIIoving sone
oplinizalion ruIes. A specific high-speed decoder archileclure largeling for nuIli-Cl/s
appIicalions naneIy shifl decoder archileclure, is deveIoped for lhis cIass of codes. In
9
VLS 176

conlrasl vilh convenlionaI decoder archileclures, lhe shifl decoder archileclure has lhree
najor nerils:
1) Menory efficienl. y expIoring lhe speciaI fealures of lhe nin-sun decoding aIgorilhn,
lhe proposed archileclure slores lhe nessage in a conpacl vay vhich nornaIIy Ieads lo
approxinaleIy 5O savings on nessage nenory over lhe convenlionaI design for high rale
codes.
2) Lov rouling conpIexily. Through inlroducing sone noveI check node infornalion
lransfer nechanisns, lhe conpIex nessage passing lelveen varialIe nodes and check
nodes can le aIIevialed ly lhe reguIar connunicalion lelveen check nodes and lhus lhe
conpIex gIolaI inlerconneclion nelvorks can le repIaced ly IocaI vires. One inporlanl facl
is lhal in lhe nev archileclure, check node processing unils have fev and fixed conneclions
vilh varialIe node processing unils.
3) High paraIIeI IeveI of decoding. The archileclure can nornaIIy expIoil nore IeveIs of
paraIIeIisn in lhe decoding aIgorilhn lhan convenlionaI parliaIIy paraIIeI decoder
archileclures. In addilion, lhe decoding paraIIeI IeveI can le furlher IinearIy increased and
lhe crilicaI palh can le significanlIy reduced ly proper pipeIining.
The chapler is organized as foIIovs. Seclion 2 gives an overviev of LDIC codes, vilh lhe
nain allenlion leing paid lo lhe decoding aIgorilhn and decoder archileclure design.
Seclion 3 inlroduces lhe code conslruclion of shifl-LDIC codes and Seclion 4 presenls lhe
decoder design vilh shifl decoder archileclure and denonslrales lhe lenefils of proposed
lechniques. In Seclion 5, ve consider lhe appIicalion of shifl archileclure lo sone veII
knovn LDIC codes such as RS-lased LDIC codes and QC-LDIC codes. Seclion 6 concIudes
lhe chapler.

2. Review of LDPC codes

2.1 LDPC decoding aIgorithms
LDIC codes are a sel of Iinear lIock codes corresponding lo lhe MN parily check nalrix H,
vhich has very Iov densily of 1s. M is lhe parily check nunler and N is lhe codevord
Ienglh. Wilh lhe aid of lhe liparlile graph caIIed Tanner graph (Tanner 1981), LDIC codes
can le effecliveIy represenled. There are lvo kinds of nodes in Tanner graph, varialIe nodes
and check nodes, corresponding lo lhe coIunns and rovs of parily check nalrix
respecliveIy. Check node f
i
is connecled lo varialIe node c
j
onIy vhen lhe eIenenl n
ij
of H is
a 1. Iig.1 shovs lhe parily check nalrix H of an (8, 4) reguIar LDIC code and ils Tanner
graph. A LDIC code is caIIed (c, |) reguIar if lhe nunler of 1s in every coIunn is a conslanl
c and lhe nunler of 1s in every rov is aIso a conslanl |.

c
2
c
3
c
4
c
5
c
6
c
7
c
8
f
1
c
1
f
2
f
3
f
4

(a)
Ultra-High Speed LDPC Code Design and mplementation 177

1 0 1 0 1 0 1 0
1 0 0 1 0 1 0 1
0 1 1 0 0 1 1 0
0 1 0 1 1 0 0 1
(
(
(
=
(
(

H

(l)
Iig. 1. An LDIC code exanpIe. (a) Tanner graph. (l) Iarily check nalrix.

The lypicaI LDIC decoding aIgorilhn is lhe Sun-Iroducl (or leIief propagalion) aIgorilhn.
Afler varialIe nodes are iniliaIized vilh lhe channeI infornalion, lhe decoding nessages are
ileraliveIy conpuled ly aII lhe varialIe nodes and check nodes and exchanged lhrough lhe
edges lelveen lhe neighlouring nodes (Kschischang 2OO4).
The nodified nin-sun decoding aIgorilhn sludied in CuiIIoud (2OO3), Chen (2OO5) and
Zhao (2OO5) is siniIar lo lhe Sun-Iroducl aIgorilhn, vilh a sinpIificalion of check node
process. Il has sone advanlages in inpIenenlalion over lhe Sun-Iroducl aIgorilhn, such as
Iess conpulalion conpIexily and no requirenenl of knovIedge of SNR (signaI-lo-noise
ralio) for AWCN channeIs. In lhe nodified nin-sun decoding aIgorilhn, lhe check node
conpules lhe check-lo-varialIe nessages R
ct
as foIIovs:
o
e
e
=
[
( )\
( )\
( ) nin
ct tc tc
n N c t
n N c t
R sign | |
(1)
vhere is a scaIing faclor around O.75, |
tc
is lhe varialIe-lo-check nessages, N(c) denoles
lhe sel of varialIe nodes lhal parlicipale in c-lh check node. The varialIe node conpules lhe
varialIe-lo-check nessages |
tc
as lhe foIIoving:
v
c v M m
mv vc
I R L + =
_
e \ ) (
(2)
vhere M(t)\c denoles lhe sel of check nodes connecled lo lhe varialIe node t excIuding lhe
varialIe node c, |
t
denoles lhe inlrinsic nessage of varialIe node t.

2.2 ImpIementation options
Iron lhe decoding aIgorilhn descriled alove, il can le seen lhal lhe caIcuIalions lelveen
differenl varialIe nodes or check nodes are independenl al each ileralion. Therefore, LDIC
codes are especiaIIy suilalIe for paraIIeI inpIenenlalion. In addilion, lhe conpulalion in lhe
decoder is fairIy sinpIe, vhich neans Iov Iogic consunplion. The nain olslacIe in lhe
inpIenenlalion of fuIIy paraIIeI LDIC decoders is lhe conpIicaled inlerconneclion lelveen
varialIe node processing unils (VNU) and check node processing unils (CNU). This
inlerconneclion conpIexily is due lo lhe randon-Iike Iocalions of ones in lhe codes parily-
check nalrix and il is a cruciaI prolIen for praclicaI fuIIy paraIIeI decoders vhen lhe code
lIock Ienglh is Iarge. The rouling congeslion or inlerconneclion prolIen viII arise and lhus
causes high area consunplion and Iov Iogic uliIizalion in lhe decoder. Ior inslance, vilh 4-
lil precision of prolaliIily nessages, lhe 52.5 nn
2
die size of lhe (1O24, 512) decoder in
Ianksly (2OO2) has a Iogic uliIizalion of 5O in ils core and lhe resl of lhe core area is
occupied ly vires. There are sone approaches deveIoped lo aIIeviale lhis rouling
VLS 178

congeslion prolIen. Daraliha (2OO8) uses lil-seriaI archileclures and nessage lroadcasling
lechnique lo reduce lhe nunler of vires in fuIIy paraIIeI decoders. Sharifi Tehrani (2OO8)
presenls a slochaslic decoder design vhere lhe nessages or prolaliIilies are converled lo
slreans of slochaslic lils and conpIex prolaliIily operalions can le perforned on
slochaslic lils using sinpIe lil-seriaI slruclures.
The nore generaI nelhods lo aIIeviale lhe rouling prolIen is lhrough reducing lhe
paraIIeIisn (ly using parliaIIy paraIIeI or seriaI processing) and using slorage eIenenls lo
slore lhe inlernediale nessages passed aIong lhe edges of lhe graph (e.g., see Mansour
2OO2, Yeo 2OO3, Cocco 2OO4). Various approaches are invesligaled in lhe Iileralure al lolh
code design and hardvare inpIenenlalion IeveIs. One approach is lo design
inpIenenlalion-avare codes (e.g., see ouliIIon 2OOO, Zhang 2OO2, Mansour 2OO3a, Liao
2OO4, Zhang 2OO4, Sha 2OO9). In lhis approach, inslead of randonIy choosing lhe Iocalions of
ones in lhe parily-check nalrix al lhe code design slage, lhe parily-check nalrix of an LDIC
code is decided vilh conslrainls aIIoving a suilalIe slruclure for decoder inpIenenlalion
and providing acceplalIe decoding perfornance. In lhese cases, lhe prolIen lecones hov
lo reduce lhe conpIexily incurred ly lhe nessage slorage eIenenls and hov lo speedup lhe
decoding process. Ior exanpIe, one inporlanl sulcIass of LDIC codes vilh reguIar
slruclure is quasi-cycIic (QC) LDIC codes vhich has received lhe nosl allenlions and has
leen seIecled ly nany induslry slandards. Due lo lheir reguIar slruclure, QC-LDIC codes
Iend lhenseIves convenienlIy lo efficienl hardvare inpIenenlalions. Nol onIy lhe parlIy
paraIIeI decoder archileclures for QC-LDIC codes require sinpIer Iogic conlroI (e.g., see
Chen 2OO4l, Wang 2OO7) lul aIso lhe encoders can le efficienlIy luiIl vilh shifl regislers
(e.g., see Li 2OO6). Conslruclion of QC-LDIC codes vilh good error perfornances is
lherefore of lolh lheorelicaI and praclicaI inleresl. Various conslruclion nelhods of QC-
LDIC codes have leen proposed and lhe salisfying error correcling perfornances are
reporled (e.g., see Iossorier 2OO4, Chen 2OO4a).
On lhe olher hand, LDIC decoders can le inpIenenled vilh a progrannalIe archileclure
or processor, vhich Iend lhenseIves lo Soflvare Defined Radio (SDR). SDR offers fIexiliIily
lo supporl codes vilh differenl lIock Ienglhs and rales, hovever, Seo (2OO7) shovs lhal lhe
lhroughpul of SDR-lased LDIC decoders is usuaIIy Iov. In addilion lo digilaI decoders,
conlinuous line anaIog inpIenenlalions have aIso leen considered for LDIC codes.
Conpared lo lheir digilaI counlerparls, anaIog decoders offer inprovenenls in speed or
pover. Hovever, lecause of lhe conpIex and lechnoIogy-dependenl design process, lhe
anaIog approach has leen onIy considered for very shorl error-correcling codes, for exanpIe,
lhe (32, 8) LDIC code in Henali (2OO6).
In lhe foIIoving, a nev ensenlIe of LDIC codes, caIIed shifl-LDIC codes, viII le
inlroduced lo niligale lhe alove decoder inpIenenlalion prolIens.

3. Shift-LDPC Code Construction

3.1 Construction of Shift-LDPC Code
Shifl-LDIC code is conslrucled in a veII-reguIaled vay. A reguIar (N, M) (c, |) shifl LDIC
code exanpIe is shovn in Iig. 1, vilh N=|q and M=cq, vhere each sulnalrix has a
dinension of qq. As dispIayed, lhe parily check nalrix consisls of c| sulnalrices. The
slruclures of lhe Ieflnosl c sulnalrices H
i1
(1 < i < c) are randon coIunn pernulalions of
lhe idenlily nalrix. The olher sulnalrices are decided ly lhese H
i1
sulnalrices.
Ultra-High Speed LDPC Code Design and mplementation 179

(
(
(
(

=
ct c c
t
t
H H H
H H H
H H H
H

2 1
2 22 21
1 12 11

Iig. 2. The conslruclion of shifl-LDIC parily check nalrix and an exanpIe of lhe (2,3) shifl-
LDIC code

The 1s in lhe shifl LDIC code parily check nalrix are arranged as lhe exanpIe in Iig. 1. Al
firsl, lhe 1s in lhe Ieflnosl sulnalrices are arranged randonIy under lhe conslrainl of
keeping one 1 in each rov and one 1 in each coIunn. Thus lhe sulnalrix H
i1
is a
pernulalion nalrix of idenlily nalrix. Nexl, for each sulnalrix on lhe righl hand side, lhe
1s are cycIic-shifled up ly 1 space. The 1 al lhe lop is noved dovn lo lhe lollon. IinaIIy
ve can gel a (c, |) reguIar shifl-LDIC code.
Il shouId le noled lhal shifl-LDIC code is a kind of code speciaIIy designed for hardvare
inpIenenlalion. In facl, lhe shifl decoder archileclure is designed firsl and lhen lhe code
conslruclion is found. This foIIovs lhe decoder-firsl code design nelhodoIogy proposed
in ouliIIon (2OOO).

3.2 Code exampIes and performance
Civen a fixed lIock Ienglh, coIunn veighl, and rov veighl, lhe ensenlIe of shifl-LDIC
codes can have consideralIe varialions in perfornance. An effeclive nelhod lo find good
LDIC codes is lo find lhe codes vilh Iarge girlh. Cirlh is referred lo lhe Ienglh of lhe
shorlesl cycIe in a Tanner graph. A Iarge girlh generaIIy inproves lhe lil error perfornance
of lhe codes. Hence, LDIC codes vilh Iarge girlh are parlicuIarIy desired. y expIoiling lhe
slruclured properly of shifl-LDIC codes, an efficienl girlh oplinizalion aIgorilhn is
deveIoped. NornaIIy, ly using lhis girlh oplinizalion sofl progran, lhe cycIe-4 can le
eIininaled, and lhe cycIe-6 can le avoided for noderale code rale shifl-LDIC codes.
Through exlensive sinuIalion, ve found lhal lhe perfornance of lhe oplinized shifl LDIC
codes can le conparalIe lo lhe conpuler generaled randon codes. Iig. 2 shovs lhe
perfornance conparison lelveen a girlh oplinized (1OO8, 5O4) (3, 6) shifl-LDIC code, lhe
sane size rale-1/2 code dovnIoaded fron Mackeys velsile and a (96O,48O) irreguIar
WiMAX code as veII as conparison lelveen a (8192, 7168) (4, 32)-reguIar shifl-LDIC code
and a girlh oplinized QC-LDIC code. Il can le seen lhal lhe WiMAX code perforns leller
due lo ils irreguIar properly, and in olher cases lhe shifl-LDIC codes have conparalIe
perfornance. The perfornance of shifl-LDIC codes couId le furlher inproved ly
inlroducing Iiniled irreguIarily, e.g., lo zero oul sone of lhe sulnalrices, vhich viII cause
liny hardvare overhead.
Shifl-LDIC codes are a kind of inpIenenlalion orienled codes and are conslrucled afler lhe
decoder design. Thus in lhis chapler ve viII focus on lhe efficienl decoder design inslead of
lhe code perfornance. On lhe olher hand, lhe shifl decoder archileclure can le appIied lo a
Iarge nunler of knovn slruclured LDIC codes such as quasi cycIic LDIC codes and RS
VLS 180

lased LDIC codes. Ior exanpIe, QC-LDIC codes can le converled lo shifl archileclure
conpIianl code vilh proper nalrix pernulalion. Il viII le expIained Ialer in lhe chapler.


Iig. 3. Ierfornance of (1OO8, 5O4) (3, 6) shifl-LDIC code and (8192, 7168) (4, 32) shifl-LDIC
code conpared vilh lhe sane Ienglh WiMAX slandard code and conpuler generaled
randon codes

4. Shift LPDC Decoder Architecture

In lhis seclion, ve presenl lhe shifl decoder archileclure specificaIIy designed for shifl-LDIC
codes.

4.1 Decoding scheduIe
IirslIy, Iel us discuss aloul lhe decoding scheduIe. Ior a reguIar (N, M) (c, |) shifl-LDIC
code, ve assune q varialIe node processing unils and M check node processing unils are
inslanlialed in lhe decoder vhere q is lhe size of sulnalrix (q=N/|=M/c). Iig. 3 iIIuslrales
lhe decoding fIov for a sinpIe shifl-LDIC code exanpIe vilh q=4. The scheduIe can le
appIied lo olher cases vilh differenl q, c, and | paranelers.


Iig. 4. Decoding scheduIe for a sanpIe shifl-LDIC code
Ultra-High Speed LDPC Code Design and mplementation 181

The Ieflnosl q coIunns are processed firsl, lhen lhe second Iefl-nosl q coIunns, and so on.
So q coIunns are processed concurrenlIy in one cIock cycIe. The coIunn process and lhe rov
process are inlerIeaved. The vhoIe check node process is divided inlo | sleps. In each cIock
cycIe, q VNUs gel M check-lo-varialIe nessages and conpule lhe M varialIe-lo-check
nessages, so lhal M CNUs gel one nessage each, so each CNU can deaI vilh one slep of lhe
check node process. Wilh lhis decoding scheduIe, ve can finish one ileralion in | cIock
cycIes. Il is nornaIIy nuch fasler lhan lhe lradilionaI parlIy paraIIeI decoder archileclures.
y conlining lhe decoding scheduIe and lhe nin-sun aIgorilhn, lhe decoding process can
le expressed as leIov:

, 0,1, , 1;
0 -1
process oI th ~ ( 1) th columns
vc v
c
L I i N
k k a
kq k q
compute
= =
= =
+

Min-Sum Algorithm and Decoding Schedule
1: Initialization:
2: repeat
3: for to do
4: ]
4:
.
( )
( )
( ) ( )
:
cv
cv
cv nc vc
n N c
vc mv v cv
m M v
R fromrow process result of last iteration
magnituae of R minimum or 2na minimum
sign of R sign L sign L
for q columns process
L R I R
e
e

=
=

= +
[
_
5: -
6:
7:
8:
( )
; ; ;
vc
for all check noaes receive one L per row
upaate minimum 2na minimum location of minimum
recora signs
max iteration times reachea or conv
: 9:
10: -
11:
12: endfor
13: until ergence to a coaewora
aecoaea bit 14: Output:

TalIe 1. Decoding scheduIe designed for shifl decoder archileclure

4.2 OveraII decoder architecture
The overaII decoder lIock diagran is shovn in Iig. 4. q varialIe node processing unils and
M check node processing unils are inslanlialed in lhe decoder. The crilicaI parl of lhis
inpIenenlalion is lhe nelvork connecling VNUs and CNUs. The shuffIe nelvork A
lransnils varialIe-lo-check nessages, lhe shuffIe nelvork lransnils check-lo-varialIe
nessages. ShuffIe nelvork is lhe sane as shuffIe nelvork A excepl lhal lhe dala fIov is in
lhe reverse direclion.
In LDIC decoding aIgorilhns, CNUs onIy connunicale vilh VNUs. The rov processes of
differenl check nodes are independenl of each olher and so do lhe varialIe nodes processes.
In shifl-LDIC decoder archileclure, a noveI lIock caIIed CNU ccmmunica|icn nc|ucr| is
added inlo lhe decoder. This nev lIock lrings sone connunicalions lelveen CNUs. y
inlroducing lhese originaIIy unvanled connunicalions, lhe conpIexily of conneclions
VLS 182

lelveen CNUs and VNUs and lhus lhe conpIexily of shuffIe nelvorks can le significanlIy
reduced. This viII le shovn nexl.


Iig. 5. OveraII shifl decoder lIock diagran. Massages are ileraliveIy exchanged lelveen
VNUs and CNUs lhrough lhe shuffIe nelvork

Ior randon codes, lhe shuffIe nelvork rouling conpIexily is nornaIIy inloIeralIe, vhiIe for
shifl-LDIC codes ve inlroduced alove, lhe shuffIe nelvorks lecone reaIIy sinpIe. Through
lhe inlroduclion of lhe CNU ccmmunica|icn nc|ucr|, ve can ensure lhal each CNU processes
lhe varialIe-lo-check nessages fron and lransnil lhe check-lo-varialIe nessages lo a fixed
VNU during lhe enlire decoding process. Therefore, lhe shuffIe nelvork connecling CNUs
and VNUs onIy consisls of M*| vires, vhere ve assune each nessage is quanlized as | lils.
(NornaIIy lhe quanlizalion lils | is chosen as 6.) In conlrasl, lhe nunler of vires required
ly originaI LDIC decoding aIgorilhn is M*|*|. Ior high rale codes, for exanpIe, | can le as
Iarge as 32.
Iig. 5 expIains vhy each CNU aIvays conpules vilh lhe nessages fron a fixed VNU
during lhe enlire decoding process. Wilh lhe sinpIe shuffIe nelvork, CNU i is connecled
vilh VNU x. The rov process of each check node is separaled inlo | sleps vilh one varialIe-
lo-check nessage leing processed in each slep. Al lhe firsl cIock cycIe of an ileralion, having
received lhe nessage fron VNU x, CNU i perforns lhe firsl slep of i-lh rov process. Afler
lhal, CNU i passes lhe i-lh rov process inlernediale resuIl lo CNU i+1 lhrough lhe CNU
ccmmunica|icn nc|ucr|. MeanvhiIe, CNU i receives lhe (i-1)-lh rov process inlernediale
resuIl fron CNU i-1. In lhe second cIock cycIe of lhe ileralion, CNU i sliII receives lhe
varialIe-lo-check nessage fron VNU x. This nessage and lhe rov process inlernediale
resuIl in CNU i are arranged lolh corresponding lo rov i-1. Therefore, CNU i can perforn
lhe second slep of (i-1)-lh rov process. Al lhe sane line, CNU i+1 is perforning lhe second
slep of i-lh rov process. Ior varialIe node processing, VNU x receives lhe check-lo-varialIe
Ultra-High Speed LDPC Code Design and mplementation 183

nessage in sequenliaI fron rov i lo i-1, i-2, ., elc. In lhe sane vay, lhe CNU ccmmunica|icn
nc|ucr| aIso can ensure lhal VNU x onIy need lo receive ils nessage fron CNU i.


Iig. 6. Dala fIov of rov process inlernediale resuIl

To shov lhe decoding procedure nore cIearIy, ve iIIuslrale lhe vhoIe process vilh lhe
sinpIe (12, 4) (2, 3) shifl LDIC code presenled in Iig. 1 as an exanpIe. The procedure for
one enlire ileralion is shovn in Iig. 6. Il can le seen lhal lhe CNU ccmmunica|icn nc|ucr| is
separaled inlo lvo independenl nelvorks: CNU ccmmunica|icn nc|ucr| (in|ra i|cra|icn) and
CNU ccmmunica|icn nc|ucr| (in|cr i|cra|icns) vhich lransfer lhe rov processing resuIls
during ileralion and lelveen ileralion respecliveIy.
The conneclions lelveen CNUs and VNUs are fixed and sinpIified. CNU 1 connecls vilh
VNU 4, CNU 2 connecls vilh VNU 2.These conneclions are lased on lhe 1s posilions in
lhe Ieflnosl sulnalrices.
One line ileralion consisls of lhree sleps or lhree cycIes (lhe rov veighl | equaIs 3). In each
slep/cIock cycIe, each CNU generales a check-lo-varialIe nessage, passes lhe nessage lo ils
connecled VNU and processes a varialIe-lo-check nessage vhich is received fron ils
connecled VNU. Al lhe slarling of ileralion, in lhe firsl cIock cycIe, CNU 1 process lhe dala
of rov 1, CNU 2 process lhe dala of rov 2. Afler lhis one slep rov process, lhe rov process
inlernediale resuIls are shifled lelveen CNUs lhrough lhe CNU ccmmunica|icn nc|ucr|
(in|ra i|cra|icn). The resuIl according lo rov 1 is passed lo CNU 2, lhe resuIl according lo rov
2 is passed lo CNU 3, elc.
In lhe second cIock cycIe, CNU 2 sliII lransnils lo and gels nessage fron VNU 2. This line
lhe varialIe-lo-check nessage il gels is corresponding lo rov 1, lhus il can deaI vilh one
slep of rov 1 process. LikeIy CNU 3 perforns one slep of rov 2 process, CNU 4 perforns
one slep of rov 3 process, elc. The lhird slep is perforned in lhe sane vay.
Once lhe ileralion is finished, lhe rov process resuIls viII le lransferred lack lo appropriale
CNUs for lhe check-lo-varialIe nessages generalion in nexl ileralion. The resuIls are
lransferred lhrough lhe CNU ccmmunica|icn nc|ucr| (in|cr i|cra|icns). This nelvork is a one-
lo-one connunicalion nelvork as shovn in Iig. 6.
As a resuIl, lhe connunicalion lelveen CNUs and VNUs required ly lhe originaI LDIC
decoding aIgorilhn is deconposed inlo lhree kinds of conneclions: lhe sinpIified
conneclion lelveen CNUs and VNUs, lhe CNU ccmmunica|icn nc|ucr| (in|ra i|cra|icn) and
lhe CNU ccmmunica|icn nc|ucr| (in|cr i|cra|icns).

VLS 184


Iig. 7. Decoder scheduIing of lhe sinpIe shifl-LDIC code in Iig. 1

In lhe foIIoving, ve viII inlroduce lhe archileclure of CNU appIying lhe nin-sun decoding
aIgorilhn. The (8192, 7168) (4, 32) shifl LDIC code is chosen as lhe design exanpIe.

4.3 Architecture of check node processing unit
The check node processing unil execules lhe check-lo-varialIe nessage conpulalion and lhe
nagnilude conparison lelveen lhe varialIe-lo-check nessage and lhe inlernediale resuIl
of rov process. Iig. 7 shovs lhe archileclure of CNU. The oId regisler, nev regisler, and
sign regisler slore lhe rov process resuIls of Iasl ileralion, lhe inlernediale resuIls of rov
process, and lhe sign lils of lhe check-lo-varialIe nessages respecliveIy. To nanage lhe
nessage slorage nore efficienlIy, onIy rov process resuIls are slored in a conpressed vay
lhal onIy lhe nininun nagnilude, 2nd-nininun nagnilude, lhe Iocalion of lhe nininun
and lhe sign lils are saved.
The dala in nev regislers and oId regislers are aIvays leing passed lelveen CNUs lhrough
lhe CNU ccmmunica|icn nc|ucr| (in|ra i|cra|icn an in|cr i|cra|icns). The signs in lhe sign
regisler are fron lhe sane VNU, so lhey do nol need lo le passed. In each cIock cycIe, sign
regisler shifls in a sign lil of lhe varialIe-lo-check nessage and shifls oul a sign lil of a
check-lo-varialIe nessage.
Ultra-High Speed LDPC Code Design and mplementation 185


Iig. 8. Archileclure of check node processing unil

ecause onIy one varialIe-lo-check nessage is processed in one line, lhe conpulalion in
CNU has very Iov conpIexily. The nagnilude conparison parl conpares lhe nagnilude of
inpul nessage vilh lhe currenl rov process inlernediale resuIl lo updale lhe nagniludes,
sign and index. Then lhe updaled rov process inlernediale resuIl is regislered and passed
lo ils nexl CNU neighlour lhrough lhe CNU ccmmunica|icn nc|ucr| (in|ra i|cra|icn). The
nessage conpulalion parl seIecls lhe proper nessage nagnilude according lo lhe index
vaIue, and conpules lhe sign of lhe nessage. Al lhe leginning of each ileralion, il perforns
lhe nessage scaIe caIcuIalion ( = O.75).

4.4 Architecture of variabIe node processing unit
The archileclure of varialIe node processing unil is lhe sane vilh olher LDIC decoder
designs. Il can le designed vilh onIy conlinalionaI Iogic. Iig. 8 shovs lhe VNU
archileclure. The inpulled check-lo-varialIe nessages are firslIy converled fron sign-
nagnilude fornal lo lvos conpIenenl fornal. Then lhe adder lree conpules lhe varialIe-
lo-check nessages according lo equalion (2). IinaIIy lhe nessages are converled lack lo
sign-nagnilude fornal and lransnilled lo CNUs.

VLS 186

Cale counl
CNU
conlinalion
Iogic
VNU
conlina-
lion Iogic
lolaI
conlina-
lion Iogic
frequenc
y (MHz)
Through-
pul per
ileralion
(Clps)
Message
area
(nn
2
)
TolaI
area
(nn
2
)
One paraIIeI
IeveI design
148 953 395K 16O 4O.96 6.1 1O
One paraIIeI
+ pipeIining
148 141O 512K 317 76.4 6.1 11.3
DoulIe
paraIIeI IeveI
342 953 838K 151 77.3 6.1 14
DoulIe
paraIIeI
+ pipeIining
342 141O 1O72K 317 153 6.1 16.7

Iig. 9. Archileclure of lhe varialIe node processing unil. SloT noduIe converls fron lhe
sign-nagnilude fornal lo lvos conpIenenl fornal. TloS is for reverse conversion

4.5 ImpIementation resuIts
To furlher increase lhe decoding speed, nore paraIIeI IeveIs can le added lo lhe decoder
archileclure. Ior exanpIe, lo doulIe lhe paraIIeI IeveI, 2*q VNUs viII le inslanlialed. The
archileclure of CNU shouId le changed lo process lvo nessages per cIock cycIe. Then each
ileralion can le finished in |/2 cIock cycIes. A doulIe paraIIeI IeveI decoder exanpIe is
designed for lhe (8192, 7168) (4, 32) shifl LDIC code lo invesligale lhe lradeoff lelveen
hardvare cosl and decoding speed.
In addilion, lo increase lhe cIock frequency, sone pipeIine IeveIs can le added lo lhe dala
palh. The crilicaI palh is fron oId regisler, lhrough CNU nessage conpulalion parl, lo VNU
conpulalion, lo CNU rov processing parl, and finaIIy ends al lhe nev regisler. IipeIine viII
increase lhe nunler of cIock cycIes needed for each ileralion. Ior exanpIe, a lhree slage
pipeIine viII add lvo cIock cycIes lo each ileralion.

TalIe 2. Synlhesis resuIl for (8192, 7168) (4, 32) shifl LDIC decoder under O.18n
lechnoIogy

TalIe 2 shovs lhe synlhesis resuIl of four design exanpIes: one-IeveI paraIIeI design, one-
IeveI paraIIeI vilh pipeIining, lvo-IeveI paraIIeI design, and lvo-IeveI paraIIeI vilh
Ultra-High Speed LDPC Code Design and mplementation 187

pipeIining. Due lo lhe Iarge nunler of CNUs and VNUs (sulnalrix size q = 256), lhe
lollon-up synlhesis slralegy is appIied. These resuIls are achieved vilh O.18n lechnoIogy.
Il can le seen lhal, ly appIying pipeIining, lhe cIock frequency can le aInosl doulIed, and
lhe overhead is 81 regislers per CNU. The one-IeveI paraIIeI design vilh pipeIining can
achieve lhe lesl lrade-off lelveen speed and hardvare conpIexily.
Ior lhe nessage slorage, il is inpIenenled vilh regislers. The lransferring nessages lake
(16*2+32)*1O24 = 65536 lils. In conparison, lhe lradilionaI parliaIIy paraIIeI archileclure
(e.g., Wang (2OO7)) needs lo slore 8192*4*6 = 196224 lils in lolaI. Il saves 67 of lhe nessage
nenory needed. The iniliaI inlrinsic infornalion lakes 8192*6 = 49152 lils. Il is lhe sane for
lolh cases.
To furlher exanine lhe efficiency of lhe shifl decoder archileclure, lackend design of lhe
sanpIe (8192, 7168) (4, 32) shifl LDIC decoder is conpIeled. The lechnoIogy used for
inpIenenlalion is a O.18n CMOS process vilh 6 nelaI Iayers. Iig. 9 shovs lhe fIoor-pIan
and Iayoul of lhe decoder chip vilh a die size of 4.1 nn 4.1nn. The Iogic densily is 7O.
The pIacenenl of CNUs is crilicaI lo reducing lhe rouling congeslion of CNU
connunicalion nelvorks. The CNU array Iocaled in lhe cenlre of lhe chip is speciaIIy
arranged: lhe 1O24 CNUs are aIigned vilh 31 CNUs per rov, so lhal lhe connunicalions
lelveen CNUs can le IocaIIy rouled.


Iig. 1O. The fIoor-pIan and Iayoul of lhe (8192, 7168) (4, 32) shifl LDIC decoder

VLS 188


TalIe 3. Conparisons vilh olher decoder archileclures

TalIe 3 shovs lhe decoder inpIenenlalion resuIls conpared vilh sone olher LDIC
decoder archileclures. The lhroughpul achieved here is 5.1 Ciga lils per second al a
naxinun ileralion lines of 15. One paraneler Hardvare Lfficiency as (Throughpul *
Ileralions / Area) is defined lo evaIuale lhe efficiency of each archileclure. The area nelrics
are scaIed lo resenlIe 9Onn CMOS resuIls (65nn ly a faclor of 2 and 18Onn ly a faclor of
1/4). Iron lhis lalIe, ve can see lhal lhe proposed design can achieve nore lhan 7O
inprovenenl in hardvare efficiency conpared vilh an advanced exisling design vhiIe lhe
higher cIock speed lenefiled fron nuch nore advanced CMOS lechnoIogy is nol
considered for lhis conparison. Olhervise lhe inprovenenl vouId le even nore
significanl. So shifl decoder archileclure is very efficienl for high-speed LDIC decoder
inpIenenlalion.

5. AppIy the shift architecture to known LDPC codes

5.1 RS-based LDPC code
In Djurdjevic (2OO4), lhe aulhors conslrucled a kind of LDIC codes lased on Reed-SoIonon
codes and caIIed lhen RS-lased LDIC codes. These codes have good nininun dislances,
girlh properlies, and perforn very veII vilh ileralive decoding. One code conslrucled in
lhis vay vas chosen as lhe forvard error correclion coding schene in ILLL 8O2.3an
1OCase-T Llhernel slandard (LAN/MAN CSMA/CD Access Melhod). RS-lased LDIC
codes are conslrucled in an aIgelraic vay and ly expIoiling lhe shifl-slruclured properlies
hidden in lhe parily check nalrices, lhe shifl decoder archileclures presenled alove can le
appIied lo lhis kind of code. In lhe foIIoving, ve viII expIain lhe shifl properlies in lhese
codes ly going lhrough lhe code generalion procedure.
We slarl fron lhe exlended (q, 2, q -1) RS code C
|
over CI(q) vhich has a Ienglh of q and lvo
infornalion synloIs. y using lhe generalion nalrix

IarlIy paraIIeI
Mansor (2OO6)
UIlra sparse
rack (2OO7)
Iroposed
Code Ienglh 2O48 96OO 8192
Code rale 1/2 3/4 7/8
Ldges 6144 264OO 32768
Quanlizalion 4l 6l 6l
AIgorilhn Min-Sun Min-Sun Min-Sun
lechnoIogy 18Onn 65nn 18Onn
frequency 125 MHz 5OOMHz 317 MHz
Ileralions 1O 1O 15
Throughpul 64OMl/s 1.45Cl/s 5.1 Cl/s
Area (nn2) 14.3 O.5O4 11.3/16.8
Area |scaIed
lo 9Onnj
3.57 1.OO8 2.82 /4.2
Hardvare
Lfficiency
1778 14384 255OO
Ultra-High Speed LDPC Code Design and mplementation 189

(

1 1
0 1 1 1 1
2 2
o o o

q
G
, (3)
ve can gel lhe exlended RS code vilh q
2
codevords in lolaI. Lel v le a nonzero codevord in
C
|
vilh veighl q, for exanpIe v = (1,
q-2
, ,
2
, , 1). Then, lhe sel C
|
(O)
= |cv : c G|(q) of q
codevords forns lhe sulcode of C
|
. Sel C
|
(O)
aIvays conlains lhe aII zero codevord and q-1
veighl q codevords, no naller vhich veighl q codevord v is chosen.
Iarlilion C
|
inlo q addilive cosels, C
|
(O)
, C
|
(1)
,., C
|
(q-1)
lased on lhe sulcode C
|
(O)
. Lach cosel
C
|
(i)
is conposed of q codevords, I
0
(i)
, I
1
(i)
, . , I
q-1
(i)
and each codevord has a Ienglh of q,
as leIov:
2 3
2 2
2 3 2
2 3
( )
0,0 0,1 0, 1
0
( )
1,0 1,1 1, 1
( ) 1
( )
1,0 1,1 1, 1
1
1 1
1
1
i q i q i i
i i q i i
i i i i
q i q i
i
q
i
q
i
b
i
q q q q
q
w w w
W
w w w
W
C
w w w
W
o o o o o o o
o o o o o o o o
o o o o o o o o
o o o o

(
(
(
(
(
(
(
(
(
(
(
(
(
(


+ + + +
+ + + +
+ + + +
=
+ +
= =

. . . . .

.
. . .

4 2
1
0 0 0 0 0
q i i q
i i i i
o o o o
o o o o

(
(
(
(
(
(
(
+ +
(
+ + + + (

. (4)
Then lhe cosels are arranged logelher lo gel lhe q
2
q nalrix H
rs


(
(
(
(
(

=
) 1 (
) 1 (
) 0 (
q
b
b
b
rs
C
C
C
H
.
. (5)
RepIace each synloI in H
rs
vilh a Iocalion veclor:
) , , , , , ( ) (
2 2 1 0
=
q
i
: : : : : : o
, (6)
vhere lhe i-lh conponel z
i
= 1 and aII lhe olher conponenls equaI zero. The exponenliaI
vaIue can le denoled as lhe posilion of 1 in lhe Iocalion veclor repIacenenl. IinaIIy,
choose a
t

c
sularray fron H
rs
lo gel a (
t
,
c
) reguIar LDIC code. This can aIso le slaled
as lhal seIecl
t
cosels randonIy and lhen seIecl
c
coIunns in lhen, finaIIy repIace each
synloI vilh a Iocalion veclor lo gel a sparse parily check nalrix H.
A cosel exanpIe is given here lo shov lhe nalrix properlies cIearIy. Il is generaled in CI(8):

VLS 190

(
(
(
(
(
(
(
(
(
(
(

0 2 2 2 2 2 2 2
7 4 0 5 1 3 7 6
6 6 4 0 5 1 3 7
5 7 6 4 0 5 1 3
4 3 7 6 4 0 5 1
3 1 3 7 6 4 0 5
2 5 1 3 7 6 4 0
1 0 5 1 3 7 6 4
. (7)

The exaninalion of lhe nalrix and equalion (4) reveaIs lhe four foIIoving properlies:
1) Lach coIunn is conposed of lhe q eIenenls of CaIois IieId CI(q).
2) qlh rov is a speciaI rov. Il is conposed vilh onIy one vaIue
i
excepl lhe Iasl one O.
3) qlh coIunn is a speciaI coIunn. Il is aIvays 1, ,
2
, .
q-2
, O.
4) Lxcepl lhe speciaI rov and speciaI coIunn, lhe renaining q-1 q-1 nalrix is a speciaI lype
of circuIanl in vhich each rov is lhe righl cycIic-shifl of lhe rov alove il vhiIe lhe firsl rov
is lhe righl cycIic-shifl of lhe Iasl rov.
CarefuI readers viII find lhal properly 4) is exaclIy lhe shifl properly vhich is inlroduced
earIier in lhe chapler. Thus lhe shifl decoder archileclure can le appIied lo lhis kind of RS-
lased LDIC codes. Iig. 1O shovs lhe Iocalion veclor repIacenenl in RS-lased LDIC code
conslruclion and lhe napping of lhe parily check nalrix lo decoder hardvare. The chosen
code sanpIe is a (32, 24) (1, 4) RS-lased LDIC code generaled in CI(8).

0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0
CNU 1
CNU 2
CNU 3
CNU 4
CNU 5
CNU 6
CNU 7
CNU 0
00000010
00001000
10000000
00000100
01000000
00010000
00000001
00100000
00000001
00000010
00001000
10000000
00000100
01000000
00010000
00100000
00010000
00000001
00000010
00001000
10000000
00000100
01000000
00100000
2 2 2 2
1 3 7 6
5 1 3 7
0 5 1 3
4 0 5 1
6 4 0 5
7 6 4 0
3 7 6 4
clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4

Iig. 11. The Iocalion veclor repIacenenl in RS-lased LDIC code conslruclion and lhe
napping of lhe parily check nalrix lo decoder hardvare
Ultra-High Speed LDPC Code Design and mplementation 191


Iig. 12. Iock diagran of lhe shifl decoder design for lhe nalrix defined in Iig.1O

Iig. 11 shovs lhe lIock diagran of decoder designed for nalrix defined in Iig. 1O. Il can le
deconposed inlo lvo parls: lhe processing unils and lhe nessage slorage regislers. There is
one rov resuIl regislers (RR) corresponding lo each CNU. One RR is conposed of currenl
ileralion rov resuIl regislers (RRC) and Iasl ileralion rov resuIl regislers (RRL). olh
conlain nininun vaIue, 2nd-nininun vaIue, nininun vaIue index and sign lil.
Since lhe offsel vaIue in qlh rov is fixed al 2, qlh rovs nessage aIvays cones fron VNU
2. CNU 7 is onIy in charge of lhe qlh rovs processing and RR 7 slores processing resuIl of
lhe qlh rov, vhich corresponds lo lhe speciaI rov properly 2).
Wilh lhe shifl decoder archileclure, lvo decoder exanpIes are designed. The largel code is
lhe (2O48, 1723) (6, 32) LDIC code generaled fron (64, 32, 2) RS code. A 6 32 sul-array al
lhe upper Iefl corner of H
rs
is seIecled.
TalIe 4 shovs lhe synlhesis resuIls under 9Onn CMOS lechnoIogy of lvo design exanpIes:
one lasic design and a four lines paraIIeI IeveI design. Conpared lo lhe lasic design, lhe
four lines paraIIeI IeveI design can achieve an approxinaleIy 3x decoding speed vilh 2x
hardvare consunplion. TalIe 3 aIso conpares lhe shifl archileclure decoders vilh slale-of-
lhe-arl lil seriaI fuIIy paraIIeI LDIC decoder (Daraliha (2OO8)) and parliaIIy paraIIeI LDIC
decoder (Chen (2OO3)). Conparing lhe lhroughpul lo area ralio nelric, il can le cIearIy seen
lhal lhe shifl archileclure-lased designs are very nuch hardvare efficienl.

VLS 192


TalIe 4. Synlhesis resuIls of (2O48, 1723) RS-LDIC code decoder under shifl archileclure and
conparisons vilh olher designs

5.2 Quasi-CycIic LDPC codes
Quasi-cycIic LDIC (QC-LDIC) codes are veII knovn lhal lhey can achieve conparalIe
perfornance vilh equivaIenl randon-Iike LDIC codes. In addilion, nore inporlanlIy, lhey
are veII suiled for hardvare inpIenenlalion lecause of lhe reguIarily in lheir parily check
nalrices. In parlicuIar lhe encoder of a QC-LDIC code can le easiIy luiIl vilh shifl regislers
vhiIe usuaIIy il is hardvare conpIex lo encode an LDIC code.
In generaI, a reguIar (c, |) quasi-cycIic LDIC code is defined as ils parily check nalrix H
(MN) consisling of a c| array of sulnalrices. The dinension of each sulnalrix is qq (q =
M/c = N/|). Lvery sulnalrix H
ij
is a circuIanl nalrix of idenlily nalrix |. Iig.12 shovs an
exanpIe of (24, 8) (2, 3) reguIar QC-LDIC code.


Iig. 13. The conslruclion of QC-LDIC parily check nalrix H
qc
: an array of circuIanl
sulnalrices
Design ModuIe
Cale counl
nunler
TolaI
gale
paranelers
CIock
speed
lhroughpul
lasic
34 cycIes
CNU 148 384
42OK
6 lil quan
8 ileralions
5OOMHz 3.8 Clps
VNU 145O 64
slorage
2O48 6 lils iniliaI
64 384 lils nessage
4paraIIeI
1O cycIes
CNU 515 384
82OK
6 lil quan
8 ileralions
4OOMHz 1O Clps
VNU 145O 256
slorage
2O48 6 lils iniliaI
64 384 lils nessage
Daraliha
(2OO8)
(2O48, 1723) (6,32)
RS-lased LDIC code
fuIIy paraIIeI archileclure
223OK
4 lil quan
8 ileralions
25OMHz 16 Clps
Chen
(2OO3)
8O88 codevord Ienglh, 1/2 rale
parliaIIy paraIIeI archileclure
742K
6 lil quan
25 ileralions
212MHz 188 Mlps
Ultra-High Speed LDPC Code Design and mplementation 193


Iig. 14. The lransfornalion fron lhe sanpIe QC-LDIC parily check nalrix inlo a shifl Iike
LDIC code lhrough coIunn pernulalion

Iig. 13 iIIuslrales lhe vay lo lransforn lhe QC-LDIC code parily check nalrix inlo a shifl
kind nalrix vhose decoder can le inpIenenled vilh lhe shifl decoder archileclure. IirslIy
lhe q coIunns are dislriluled lo | lIock coIunns of H
qcs1
in a round-rolin fashion. Then lhe
second q coIunns are pernulaled in lhe sane vay and so on unliI aII coIunns are
dislriluled inlo nev nalrix H
qcs1.
CarefuI readers viII find lhal a nev properly is inducled in
lhe nev nalrix lhal in sone sulnalrices lhere are nuIlipIe 1s in one rov. This nevIy
inducled properly requires a speciaI CNU design vhich processes nuIlipIe nessages
inslead of one al each slep. Il viII cause sone addilionaI Iogic deIay in CNU. In Cui (2OO8a),
lhe aulhors proposed an efficienl nalrix pernulalion oplinizalion nelhod lo nininize lhe
naxinun rov veighl. NornaIIy, il can le no ligger lhan 2. In addilion, in Cui (2OO8a) lhe
aulhors oplinized lhe coIunn Iayered decoding aIgorilhn and appIied il lo shifl decoder
archileclure. y lhis vay, lhe ileralion nunler lo converge can le reduced and lhe decoding
lhroughpul can le furlher increased.
There are olher nalrix lransfornalion nelhods lo appIy lhe exlended shifl decoder
archileclure on QC-LDIC codes. Ior exanpIe, lhe quasi cycIic nalrix can le lransforned
inlo a rov shifl conslruclion ly rov pernulalions as shovn in Iig. 14. IirslIy lhe q rovs are
dislriluled lo 4 lIock rovs of H
qcs2
in a round-rolin fashion (i.e., rovs A-H of H
qc
are
dislriluled lo rov 1, 5, 9, 13, 2, 6, 1O and 14 of H
qcs2
). Then lhe second q rovs are pernulaled
in lhe sane vay and so on unliI aII rovs are dislriluled inlo nev nalrix H
qcs2.
The quasi
cycIic nalrix is converled lo a forn as:

1 2 3
1 2 3
/ / / /
1 2 3
|
|
qcrs
q m q m q m q m
|
A A A A
A A A A
H
A A A A
o o o o
o o o o
(
(
(
=
(
(

. . . .

(8)

vhere is a q q pernulalion nalrix represenling a singIe righl cycIic shifl. m is an inleger
such lhal q can le divided ly m.
VLS 194


Iig. 15. The lransfornalion fron lhe sanpIe QC-LDIC parily check nalrix inlo a shifl Iike
LDIC code lhrough rov pernulalion

Afler lhe nalrix conversion, lhe rov shifl properly can le expIoiled lo appIy a rov shifl
decoder archileclure jusl Iike lhe presenled coIunn shifl archileclure. y lhis neans, lhe rov
Iayered decoding aIgorilhn shouId le appIied. IorlunaleIy lhe rov Iayer aIgorilhn can
increase lhe convergence speed significanlIy (Mansour (2OO3l)). Inleresled readers can le
referred lo Cui (2OO8l) for nore infornalion.

5.3 Array LDPC code
Array LDIC code is a nore slruclured QC-LDIC code vhich can achieve conparalIe
perfornance lo randon codes. Lfficienl encoding, nininun dislance properlies, and
perfornance of array-code lased LDIC codes for linary as veII as nuIliIeveI noduIalion
are addressed in Ian (2OOO). The parily check nalrix lased on array codes can le
represenled as foIIovs:

2 1
2( 1) 2 4
2( 1) ( 1)( 1) 1
|
|
arraq
c c | c
| | | |
|
| H
|
o o o
o o o
o o o


(
(
(
( =
(
(
(

. . . .

(9)
Where I is a q q idenlily nalrix vilh a prine nunler q. is a pernulalion nalrix vilh a
singIe cycIic righl shifl or Iefl shifl. Ior exanpIe, a (22O9, 2O24) (4, 47) array LDIC code can
le conslrucled vilh q = 47. We use lhis code in lhe decoder expIanalion.
Ior array LDIC codes, lhey have lhe siniIar shifl properly as shifl-LDIC codes. Sone
nodificalions in lhe CNU ccmmunica|icn nc|ucr| can le enough for lhe decoder
inpIenenlalion. Iig. 15 shovs lhe nodified CNU conneclions. They are grouped inlo four
Ultra-High Speed LDPC Code Design and mplementation 195

CNU groups. Lach group has 47 nenlers. Croup 1 conlains fron CNU-1 lo CNU-47, in
charge of lhe rov operalion of rov 1~47. Croup 2 conlains fron CNU-48 lo CNU-94, in
charge of lhe rov operalion of rov 48~94... There is no connunicalion lelveen groups. The
difference lelveen array LDIC codes and shifl LDIC codes is lhal lhe shifl vaIues are
changing fron O lo 3 inslead of keeping al 1. In addilion, for lhis speciaI kind of LDIC
codes, lhe CNU ccmmunica|icn nc|ucr| (in|ra i|cra|icn) is exaclIy lhe sane vilh CNU
ccmmunica|icn nc|ucr| (in|cr i|cra|icn), lhus lhese lvo nelvorks can le incorporaled inlo one
sinpIe nelvork and lhe rouling conpIexily can le furlher sinpIified. As a resuIl, excepl lhe
CNUs in Croup 1, every 47 CNUs in a group forn a chain and each CNU onIy
connunicales vilh ils lvo neighlors. Ior exanpIe, in CNU group 3, CNU-i onIy gels dala
fron CNU-(i-2) and deIivers dale lo CNU-(i+2).
A sanpIe decoder is designed lased on an ALTLRA IICA LI2C35. 5OMHz cIock speed and
12OMlps lhroughpul (2O ileralions) are achieved vilh 23k Iogic eIenenls and 26k nenory
lils. Inleresled readers are referred lo Sha (2OO6) for nore infornalion.


Iig. 16. The CNU connunicalion nelvork designed for (22O9, 2O24) (4, 47) array LDIC code

6. ConcIuding Remarks

Since MacKays (1997) rediscovery of LDIC codes, lhe LDIC decoder design has
experienced consideralIe deveIopnenl and enhancenenl over lhe Iasl len years. Various
oplinizalion lechniques on lolh decoding aIgorilhns and decoder designs have leen
deveIoped. This chapler provided a lrief overviev of lhe prolIens in exisling LDIC
decoder designs and presenled a noveI shifl-slruclured decoder design approach lo lackIe
lhese prolIens. The codes suiled for lhis kind of decoder archileclure are caIIed shifl-LDIC
codes. SeveraI decoder sanpIes are discussed lo iIIuslrale lhe effecliveness of lhe proposed
archileclure. In addilion, il vas shovn in lhe chapler lhal sone popuIar cIasses of LDIC
codes such as RS-lased LDIC codes and QC-LDIC codes can le inpIenenled vilh lhe shifl
decoder archileclure lhrough sinpIe nalrix pernulalions or archileclure exlensions. As a
concIusion, lhe presenled shifl decoder archileclure viII le a conpelenl candidale in fulure
high-speed connunicalion syslen designs.
VLS 196

7. References

A. Daraliha, A. C. Carusone, & I. R. Kschischang. (2OO8). Iover Reduclion Techniques for
LDIC Decoders, |||| ]curna| cf Sc|i-S|a|c Circui|s, voI. 43, no. 8, pp. 1835-1845
A. }. Ianksly & C .} .HovIand. (2OO2). A 69O-nW 1-Clps 1O24-l, rale-1/2 Lov-Densily
Iarily-Check code decoder, |||| ]. Sc|i-S|a|c Circui|s, voI. 37, no. 3, pp. 4O4-412
D. }. C. MacKay & R. M.NeaI. (1996). Near Shannon Iinil perfornance of Iov densily parily
check codes, ||cc|rcn. |c||., voI. 32, pp. 1645-1646
L. ouliIIon, }. Caslura, & I. R. Kschischang, (2OOO). Decoder-Iirsl Code Design, Prccccings
cf |nc 2n |n|crna|icna| Sqmpcsium cn Tur|c Cccs an Rc|a|c Tcpics, resl, Irance,
pp. 459-462
I. CuiIIoud, L. ouliIIon & }.L. Danger. (2OO3). -Min Decoding AIgorilhn of ReguIar and
IrreguIar LDIC Codes, Prcc. 3n |n|crna|icna| Sqmpcsium cn Tur|c Cccs an Rc|a|c
Tcpics, pp. 451-454
L. Liao, L. Yeo & . NikoIic. (2OO4). Lov-densily parily-check code conslruclions for
hardvare inpIenenlalion, Prcc. |||| |n|. Ccnf. cn Ccmmun. voI. 5, pp. 2573-2577
L. Yeo, . NikoIic & V. Ananlharan. (2OO3) Ileralive decoder archileclures, |||| Ccmmun.
Mag., voI. 41, pp. 132-14O
I. R. Kschischang, . }. Irey & H. A. LoeIiger. (2OO1). Iaclor graphs and lhe sun-producl
aIgorilhn, |||| Trans. |nf. Tnccrq, voI. 47, pp. 498-519
C. Liva, S. Song, L. Lan, Y. Zhang, S. Lin, & W. L. Ryan. (2OO6). Design of LDIC Codes: A
Survey and Nev ResuIls. ]. Ccmm. Scf|uarc an Sqs|cms, voI. 2, pp. 191
I. Djurdjevic, }. Xu, K. AldeI-Chaffar, & S. Lin. (2OO4). Conslruclion of Iov-densily parily-
check codes lased on Reed-SoIonon codes vilh lvo infornalion synloIs, ||||
Ccmmu. |c||., voI. 8, no. 7, pp. 317-319
}. Chen, A. DhoIakia, L. LIeflheriou, M. I. C. Iossorier, & X. Hu. (2OO5). Reduced-conpIexily
decoding of LDIC codes, |||| Trans. Ccmmun., voI. 53, pp. 1288-1299
}. L. Ian. (2OOO). Array codes as Iov-densily parily-check codes, Prcc. 2n |n|. Sqmp. Tur|c
Cccs an Rc|a|c Tcpics 8rcs|, Irance, pp. 543
}. Sha, M. Cao, Z. Zhang, L. Li, Z. Wang. (2OO6). An IICA InpIenenlalion of array LDIC
decoder, |||| Asia Pacific Ccnfcrcncc cn Circui|s an Sqs|cms, pp. 1675-1678
}. Sha, Z. Wang, M. Cao, & Li. (2OO9). MuIli-Cl/s LDIC Code Design and InpIenenlalion,
|||| Trans. cn V|S| Sqs|cms, voI. 17, no. 2, pp. 262-268
LAN/MAN CSMA/CD Access Melhod, |||| 802.3 S|anar OnIine avaiIalIe: hllp://
slandards.ieee.org/gelieee8O2/8O2.3.hlnI
}. Zhao, I. Zarkeshvari, & A. H. anihasheni. (2OO5). On inpIenenlalion of nin-sun
aIgorilhn and ils nodificalions for decoding Lov-Densily Iarily-Check (LDIC)
codes, |||| Trans. Ccmmun., voI. 53, no. 4, pp. 549-554
L. Chen, }. Xu, I. Djurdjevic & S. Lin. (2OO4a) Near-Shannon Iinil quasi-cycIic Iov-densily
parily-check codes, |||| Trans. Ccmmun, voI. 52, pp. 1O38
M. Cocco, }. DieIissen, M. HeijIigers, A. Hekslra, & }. Huisken. (2OO4). A scaIalIe archileclure
for LDIC decoding, Prcc. Dcsign, Au|cma|icn an Tcs| in |urcpc, voI. 3, pp. 88-93
M. M. Mansour & N. R. Shanlhag. (2OO2). Lov pover VLSI decoder archileclure for LDIC
codes, Prcc. |||| |n|. Sqmp. cn |cu Pcucr ||cc|rcn. Dcsign, pp. 284-289
M. M. Mansour & N. R. Shanlhag, (2OO3a). Archileclure-Avare Lov-Densily Iarily-Check
Codes, Prcc. |||| |SCAS, pp. 57-6O
Ultra-High Speed LDPC Code Design and mplementation 197

M. M. Mansour & N. R. Shanlhag. (2OO3l). High lhroughpul LDIC decoders, |||| Trans.
Vcrq |argc Sca|c |n|cgr. (V|S|) Sqs|., voI. 11, pp. 976-996
M.M. Mansour & N. R. Shanlhag. (2OO6). A 64O-Ml/s 2O48-lil progrannalIe LDIC
decoder chip, |||| ]. Sc|i-S|a|c Circui|s, voI. 41, no. 3, pp. 684- 698
M. I. C. Iossorier. (2OO4). Quasi-cycIic Iov-densily parily-check codes fron circuIanl
pernulalion nalrices, |||| Trans. |nf. Tnccrq, voI. 5O, no. 8, pp. 1788-1793
R. C. CaIIager. (1962). Lov-densily parily-check codes, |R| Transac|icns cn |nfcrma|icn
Tnccrq, voI. IT-8, pp. 21-28
R. M. Tanner. (1981). A recursive approach lo Iov conpIexily codes, |||| Trans. |nf. Tnccrq,
voI. IT-27, pp. 533-547
S. Henali, A. anihasheni, & C. IIell. (2OO6). A O.18 n anaIog nin-sun ileralive decoder
for a (32,8) Iov-densily parily-check (LDIC) code, |||| ]. Sc|i-S|a|c Circui|s, voI.
41, pp. 2531-254O
S. Seo, T. Mudge, Y. Zhu & C. Chakralarli. (2OO7). Design and anaIysis of LDIC decoders
for soflvare defined radio, Prcc. |||| Icr|sncp cn Signa| Prcccssing Sqs|cms,
Shanghai, China, pp.21O-215
S. Sharifi Tehrani, S. Mannor & W. }. Cross. (2OO8). IuIIy IaraIIeI Slochaslic LDIC Decoders
|||| Trans. Signa| Prcccssing, VoI. 56, no. 11, pp. 5692 - 57O3
S. Y. Chung, C. D. Iorney, T. }. Richardson & R. Urlanke. (2OO1). On lhe design of Iov-
densily parily-check codes vilhin O.OO45 d of lhe Shannon Iinil, |||| Ccmmun.
|c||., voI. 5, pp. 58-6O
T. rack, M. AIIes, T. Lehnigk-Lnden, I. KienIe, N. Wehn, & L. Ianucci. (2OO7). Lov
ConpIexily LDIC Code Decoders for Nexl Ceneralion Slandards, Prcc. Dcsign,
Au|cma|icn an Tcs| in |urcpc, pp. 1-6
T. Zhang & K. Iarhi. (2OO2). A 54Mlps (3,6)-reguIar IICA LDIC decoder, Prcc. |||| Sips,
pp. 127-132
T. Zhang & K. K. Iarhi. (2OO4). }oinl (3,k)-reguIar LDIC code and decoder/encoder design,
|||| Trans. Signa| Prcccss., voI. 52, no. 4, pp.1O65-1O79
Y. Chen & D. Hocevar. (2OO3). A IICA and ASIC inpIenenlalion of rale 1/2, 8O88-l
irreguIar Iov densily parily check decoder, Prcc. |||| G|O8|COM, San Irancisco,
CA, pp. 113-117
Y. Chen & K. K. Iarhi. (2OO4l). OverIapped nessage passing for quasi-cycIic Iov densily
parily check codes, |||| Trans. Circui|s Sqs|. |, voI. 51, pp. 11O6-1113
Z.-W. Li, L. Chen, L.-Q. Zeng, S. Lin & W.H. Iong. (2OO6). Lfficienl encoding of quasi-cycIic
Iov-densily parily-check codes, |||| Transac|icns cn Ccmmunica|icns, VoI. 54 , no. 1,
pp. 71 - 81
Z. Cui, Z. Wang, X. Zhang & Q. }ia. (2OO8a). Lfficienl decoder design for high-lhroughpul
LDIC decoding, APCCAS, pp. 164O-1643
Z. Cui, Z. Wang & Y. Liu. (2OO8l). High-lhroughpul Iayered LDIC decoding archileclure,
|||| Trans. V|S| Sqs|cms, voI. 17, no. 4, pp. 582-587
Z. Wang & Z. Cui. (2OO7). Lov-conpIexily high-speed decoder design for quasi-cycIic LDIC
codes, |||| Trans. cn V|S| Sqs|cms, voI. 15, no. 1, pp. 1O4-114


VLS 198
A Methodology for Parabolic Synthesis 199
X

A MethodoIogy for ParaboIic Synthesis

Lrik Herlz and Ieler NiIsson
Dcpar|mcn| cf ||cc|rica| an |nfcrma|icn Tccnnc|cgq, |un Unitcrsi|q
Succn

1. Introduction

In reIaliveIy recenl research of lhe hislory of science inlerpoIalion lheory, in parlicuIar of
nalhenalicaI aslronony, reveaIed rudinenlary soIulions of inlerpoIalion prolIens dale
lack lo earIy anliquily (Meijering, 2OO2). LxanpIes of inlerpoIalion lechniques originaIIy
conceived ly ancienl alyIonian as veII as earIy-nedievaI Chinese, Indian, and Aralic
aslrononers and nalhenalicians can le Iinked lo lhe cIassicaI inlerpoIalion lechniques
deveIoped in Weslern counlries fron lhe 17lh unliI lhe 19lh cenlury. The avaiIalIe hisloricaI
naleriaI has nol yel given a reason lo suspecl lhal lhe earIiesl knovn conlrilulors lo
cIassicaI inlerpoIalion lheory vere infIuenced in any vay ly nenlioned ancienl and
nedievaI Laslern vorks. Ior lhe cIassicaI inlerpoIalion lheory il is juslified lo say lhal lhere
is no singIe person vho did so nuch for lhis fieId as Nevlon. Therefore, Nevlon deserves
lhe credil for having pul cIassicaI inlerpoIalion lheory on a foundalion. In lhe course of lhe
18lh and 19lh cenlury Nevlons lheories vere furlher sludied ly nany olhers, incIuding
SlirIing, Causs, Waring, LuIer, Lagrange, esseI, LapIace, and Lverell. Whereas lhe
deveIopnenls unliI lhe end of 19lh cenlury had leen inpressive, lhe deveIopnenls in lhe
pasl cenlury have leen expIosive. Anolher inporlanl deveIopnenl fron lhe Iale 18OOs is lhe
rise of approxinalion lheory. In 1885, Weierslrass juslified lhe use of approxinalions ly
eslalIishing lhe so-caIIed approxinalion lheoren, vhich slales lhal every conlinuous
funclion on a cIosed inlervaI can le approxinaled unifornIy lo any prescriled accuracy ly
a poIynoniaI. In lhe 2Olh cenlury lvo najor exlensions of cIassicaI inlerpoIalion lheory is
inlroduced: firslIy lhe concepl of lhe cardinaI funclion, nainIy due lo L. T. Whillaker, lul
aIso sludied lefore hin ly oreI and olhers, and evenluaIIy Ieading lo lhe sanpIing
lheoren for land Iiniled funclions as found in lhe vorks of }. M. Whillaker, KoleI'nikov,
Shannon, and severaI olhers, and secondIy lhe concepl of osciIIalory inlerpoIalion,
researched ly nany and evenluaIIy resuIling in Schoenlerg's lheory of nalhenalicaI
spIines.

Tnc para|c|ic sqn|ncsis mc|ncc|cgq
Unary funclions, such as lrigononelric funclions, Iogarilhns as veII as square rool and
division funclions are exlensiveIy used in conpuler graphics, digilaI signaI processing,
connunicalion syslens, rololics, aslrophysics, fIuid physics, elc. Ior lhese high-speed
appIicalions, soflvare soIulions are in nany cases nol sufficienl and a hardvare
inpIenenlalion is lherefore needed. InpIenenling a nunericaI funclion f(x), ly a singIe
10
VLS 200

Iook-up lalIe (Tang, 1991) is sinpIe and fasl vhich is slrail forvard for Iov-precision
conpulalions of f(x), i.e., vhen x onIy has a fev lils. Hovever, vhen perforning high-
precision conpulalions a singIe Iook-up lalIe inpIenenlalion is inpraclicaI due lo lhe huge
lalIe size and lhe Iong execulion line.
Approxinalions onIy using poIynoniaIs have lhe advanlage of leing ROM-Iess, lul lhey
can inpose Iarge conpulalionaI conpIexilies and deIays (MuIIer, 2OO6). y inlroducing
lalIe lased nelhods lo lhe poIynoniaIs nelhods lhe conpulalionaI conpIexily can le
reduced and lhe deIays can aIso le decreased lo sone exlenl (MuIIer, 2OO6).
The CORDIC (COordinale Rolalion DIgilaI Conpuler) aIgorilhn (VoIder, 1959) (Andrala,
1998) has leen used for lhese appIicalions since il is fasler lhan a soflvare approach.
CORDIC is an ileralive nelhod and lherefore sIov vhich nakes lhe nelhod insufficienl for
lhis kind of appIicalions.
The proposed nelhodoIogy of paraloIic synlhesis (Herlz & NiIsson, 2OO8) deveIops
funclions lhal perforn an approxinalion of originaI funclions in hardvare. The archileclure
of lhe processing parl of lhe nelhodoIogy is using paraIIeIisn lo reduce lhe execulion line.
Ior lhe deveIopnenl of approxinalions of funclions a paraloIic synlhesis nelhodoIogy has
leen appIied. OnIy Iov conpIexily operalions lhal are sinpIe lo inpIenenl in hardvare are
used

2. MethodoIogy


Iig. 1. LxanpIe of nornaIized funclion, in lhis case sin
t x
2
|
\

|
.
|
.
A Methodology for Parabolic Synthesis 201

The nelhodoIogy is deveIoped for inpIenenling approxinalions of unary funclions in
hardvare. The approxinalion parl is of course lhe inporlanl parl of lhis vork lul lhere are
sonelines lvo olher sleps lhal are necessary, a preprocessing nornaIizalion and
poslprocessing lransfornalion as descriled ly (I.T.I. Tang, 1991) (MuIIer, 2OO6). The
conpulalion is lherefore divided inlo lhree sleps, nornaIizing, approxinalion and
lransforning.

2.1 NormaIizing
The purpose vilh lhe nornaIizalion is lo faciIilale lhe hardvare inpIenenlalion ly Iiniling
lhe nunericaI range.
The nornaIizalion has lo salisfy lhal lhe vaIues are in lhe inlervaI O < x < 1 on lhe x-axis and
O < q < 1 on lhe q-axis. The coordinales of lhe slarling poinl shaII le (O,O). Iurlhernore, lhe
ending poinl shaII have coordinales snaIIer lhan (1,1) and lhe funclion nusl le slriclIy
concave or slriclIy convex lhrough lhe inlervaI. An exanpIe of such a funclion, caIIed an
originaI funclion f
crg
(x), is shovn in Iig. 1.

2.2 DeveIoping the Hardware Architecture
When deveIoping a hardvare archileclure lhal approxinales an originaI funclion, onIy Iov
conpIexily operalions are used. Operalions such as shifls, addilions and nuIlipIicalions are
efficienl lo inpIenenl in hardvare and lherefore searched for. The dovnscaIing of lhe
seniconduclor lechnoIogies and lhe deveIopnenl of efficienl nuIlipIier archileclures has
nade lhe nuIlipIicalion operalion efficienl in lolh size and execulion, line vhen
inpIenenled in hardvare. The nuIlipIier is lherefore connonIy used in lhis nelhodoIogy
vhen deveIoping lhe hardvare.
As in Iourier anaIysis (Iourier, 1822) lhe proposed nelhodoIogy is lased on deconposilion
of lasic funclions. The proposed nelhodoIogy is nol, as in Iourier anaIysis, a deconposilion
nelhod in lerns of sinusoidaI funclions lul in second order paraloIic funclions. Second
order paraloIic funclions are used since lhey can le inpIenenled using Iov conpIexily
operalions. The proposed nelhodoIogy aIso differs fron lhe Iourier synlhesis process since
lhe proposed nelhodoIogy is using nuIlipIicalions in lhe reconlinalion process and nol
addilions as in lhe Iourier case.
The proposed nelhodoIogy is founded on lerns of second ordered paraloIic funclions
caIIed sul-funclions s
n
(x), lhal vhen reconlined, as shovn in (1), ollains lo lhe originaI
funclion f
crg
(x). When deveIoping lhe approxinale funclion, lhe accuracy depends on lhe
nunler of sul-funclions used.

f
org
(x) = s
1
(x) s
2
(x) ... s

(x)
(1)

The procedure vhen deveIoping sul-funclions is lo divide lhe originaI funclion f
crg
(x), vilh
lhe firsl sul-funclion s
1
(x). This division generales lhe firsl funclion f
1
(x), as shovn in (2).

f
1
(x) =
f
org
(x)
s
1
(x)

(2)

VLS 202

The firsl sul-funclion s
1
(x), viII le chosen lo le feasilIe for hardvare, according lo lhe
nelhodoIogy descriled in (4). In lhe sane nanner lhe foIIoving funclions f
n
(x), are
generaled, as shovn in (3).

f
n+1
(x) =
f
n
(x)
s
n+1
(x)

(3)

The purpose vilh lhe nornaIizalion is lo faciIilale lhe hardvare inpIenenlalion ly Iiniling
lhe nunericaI range.

2.3 MethodoIogy for deveIoping sub-functions
The nelhodoIogy for deveIoping sul-funclions is founded on deconposilion of lhe originaI
funclion f
crg
(x), in lerns of second order paraloIic funclions for lhe inlervaI O < x < 1.O and
lhe sul inlervaIs vilhin lhe inlervaI. The second order paraloIic funclion is chosen as
deconposilion funclion since lhe slruclure is reasonalIe sinpIe lo inpIenenl in hardvare
i.e. onIy Iov conpIexily operalions such as addilions and nuIlipIicalions are used.

|irs| su|-func|icn
The firsl sul-funclion s
1
(x), is deveIoped ly dividing lhe originaI funclion f
crg
(x), vilh x as an
approxinalion.
As shovn in Iig. 2 lhere are lvo possilIe resuIls afler dividing lhe originaI funclion vilh x,
one vhere f(x)>1 and one vhere f(x)<1.


A Methodology for Parabolic Synthesis 203

Iig. 2. Tvo possilIe resuIls afler dividing an originaI funclion vilh x.

The firsl sul-funclion s
1
(x), is according lo (4). To approxinale lhese funclions 1+(c
1
.
(1-x)) is
used. The firsl sul-funclion s
1
(x), is given ly a nuIlipIicalion of x and 1+(c
1
.
(1-x)) vhich
resuIls is a second order paraloIic funclion according lo (4).

s
1
(x) = x (1+(c
1
(1 x))) = x +(c
1
(x x
2
))
(4)

In (4) lhe coefficienl c
1
is delernined as lhe Iinil fron lhe division of lhe originaI funclion
vilh x and sullracled vilh 1, according lo (5).

c
1
= lim
x0
f
org
(x)
x
1
(5)

Scccn su|-func|icn
The firsl funclion f
1
(x), is caIcuIaled according lo (2) and lhe resuIl of lhis operalion is a
funclion vhich appearance is siniIar lo a paraloIic funclion, as shovn in Iig. 3.


Iig. 3. LxanpIe of lhe firsl funclion f
1
(x) conpared vilh sul-funclion s
2
(x).

The second sul-funclion s
2
(x), is chosen according lo lhe nelhodoIogy as a second order
paraloIic funclion, see (6).

s
2
(x) =1+(c
2
(x x
2
))
(6)
VLS 204


In (6) lhe coefficienl c
2
, is chosen lo salisfy lhal lhe quolienl lelveen lhe funclion f
1
(x), and
lhe second sul-funclion s
2
(x), is equaI lo 1 vhen x is equaI lo O.5, see (7).

c
2
= 4 f
1
1
2
|
\

|
.
|
1
|
\

|
.
|

(7)

Therely lhe second funclion f
2
(x), viII gel a shape of a Iying 5, as shovn in Iig. 4.


Iig. 4. LxanpIe of lhe second funclion f
2
(x), shaped Iike a Iying 5.

When deveIoping lhe lhird sul-funclion s
3
(x), lhe funclion is lo le spIil inlo lvo paraloIic
funclions vhere lhe firsl funclion is reslricled ly funclion f
2
(x) lo le in lhe inlervaI
O < x < O.5 and lhe second funclion is lhus reslricled lo lhe inlervaI O.5 < x < 1.O. y spIilling
lhe funclion ve gel slriclIy convex and concave funclions in each inlervaI. The inlervaIs can
le chosen differenlIy lul lhal viII Iead lo a nore conpIex hardvare, as shovn in seclion 3.

Su|-func|icns uncn n > 2
Ior funclions f
n
(x) vhen n > 2, lhe funclion is characlerized ly lhe forn of one or nore 5
shaped funclions. When deveIoping lhe higher order sul-funclions, each 5 shaped funclion
is divided inlo lvo paraloIic funclions. Ior each sul inlervaI, a paraloIic sul-funclion is
deveIoped as an approxinalion of lhe funclion f
n
(x) in lhe sul inlervaI. To shov vhich sul
A Methodology for Parabolic Synthesis 205

inlervaI lhe parliaI funclions is vaIid for, lhe sulscripl index is increased vilh lhe index m,
vhich gives lhe foIIoving appearance of lhe parliaI funclion f
n,m
(x).

f
n
(x) =
f
n,0
(x), 0 s x <
1
2
n1
f
n,1
(x),
1
2
n1
s x <
2
2
n1
...
f
n,2
n1
1
(x),
2
n1
1
2
n1
s x <1


(8)

In equalion (8) il is shovn hov lhe funclion f
n
(x), is divided inlo parliaI funclions f
n,m
(x),
vhen n > 2.

As shovn in (8), lhe nunler of parliaI funclions is doulIed for each order of n > 1 i.e. lhe
nunler of parliaI funclions is 2
n-1
. Iron lhese parliaI funclions, lhe corresponding sul-
funclions are deveIoped. AnaIogous lo lhe funclion f
n
(x), aIso lhe sul-funclion s
n+1
(x), viII
have parliaI sul-funclions s
n+1,m
(x). In equalion (9) il is shovn hov lhe sul-funclion s
n
(x), is
divided inlo parliaI funclions vhen n > 2.

s
n
(x) =
s
n,0
(x), 0 s x <
1
2
n2
s
n,1
(x),
1
2
n2
s x <
2
2
n2
...
s
n,2
n2
1
(x),
2
n2
1
2
n2
s x <1


(9)

Nole lhal in (9), lhe parliaI funclions lo lhe sul-funclions, x has leen changed lo x
n
. The
change lo x
n
is nornaIizalion lo lhe corresponding inlervaI, vhich sinpIifies lhe hardvare
inpIenenlalion of lhe paraloIic funclion. To sinpIify lhe nornaIizalion of lhe inlervaI of x
n

il is seIecled as an exponenlialion ly 2 of x vhere lhe inleger parl is renoved. The
nornaIizalion of x is lherefore done ly nuIlipIying x vilh 2
n-2
, vhich in hardvare is n-2 Iefl
shifls and lhe inleger parl is dropped, vhich gives x
n
as a fraclionaI parl (frac( )) of x, as
shovn in (1O).

x
n
= frac 2
n2
x
( )

(1O)

As in lhe second sul-funclion s
2
(x), lhe second order paraloIic funclion is used as an
approxinalion of lhe inlervaI of lhe funclion f
n-1
(x), as shovn in (11).

s
n,m
(x
n
) = 1+ c
n,m
x
n
x
n
2
( ) ( )

(11)

Where lhe coefficienls c
n,m
is conpuled according lo (12).

VLS 206

c
n,m
= 4 f
n1,m
2 (m+1) 1
2
n1
|
\

|
.
|
1
|
\

|
.
|

(12)
Afler lhe approxinalion parl lhe resuIl is lransforned inlo ils desired forn.

3. Hardware ImpIementation

Ior lhe hardvare inpIenenlalion lvos conpIenenl represenlalion (Iarhani, 2OOO) is used.
The inpIenenlalion is divided inlo lhree hardvare parls, preprocessing, processing, and
poslprocessing as shovn in Iig. 5, vhich vas inlroduced ly (I.T.I. Tang, 1991), (MuIIer,
2OO6).


Iig. 5. The hardvare archileclure of lhe nelhodoIogy.

3.1 Preprocessing
In lhis parl lhe inconing operand t is nornaIized lo prepare lhe inpul lo lhe processing
parl, according lo seclion 2.1.
If lhe approxinalion is inpIenenled as a lIock in a syslen lhe preprocessing parl can le
laken inlo consideralion in lhe previous lIocks, vhich inpIies lhal lhe preprocessing parl
can le excIuded.

3.2 Processing
In lhe processing parl lhe approxinalion of lhe originaI funclion is direclIy conpuled in
eilher ileralive or paraIIeI hardvare archileclure.
The lhree equalions (4), (6) and (11) has lhe sane slruclure vhich gives lhal lhe
approxinalion can le inpIenenled as an ileralive archileclure as shovn in Iig. 6.

A Methodology for Parabolic Synthesis 207


Iig. 6. The principIe of an ileralive hardvare archileclure.

The lenefil of lhe ileralive archileclure is lhe snaII chip area vhereas lhe disadvanlage is
Ionger conpulalion line.
The advanlages vilh paraIIeI hardvare archileclure are lhal il gives a shorl crilicaI palh and
fasl conpulalion lo lhe prize of a Iarger chip area. The principIe of lhe paraIIeI hardvare
archileclure for four sul-funclions is shovn in Iig. 7.


Iig. 7. The archileclure principIe for four sul-funclions.

VLS 208


Iig. 8. Squaring aIgorilhn for lhe parliaI producls x
n
2
.

To increase lhe lhroughpul even nore, pipeIine slages can le inpIenenled in lhe paraIIeI
hardvare archileclure.
In lhe sul-funclions (4), (6) and (11) x
2
and x
n
2
are reoccurring operalions. Since lhe square
operalion x
n
2
, in lhe paraIIeI hardvare archileclure is a parliaI resuIl of x
2
a unique squarer
has leen deveIoped. In Iig. 8 lhe aIgorilhn lhal perforns lhe squaring and deIivers parliaI
producl of x
n
2
is descriled.
The squaring aIgorilhn for lhe parliaI producls x
n
2
can le sinpIified as shovn in Iig. 9.

A Methodology for Parabolic Synthesis 209


Iig. 9. SinpIified squaring aIgorilhn for lhe parliaI producls x
n
2
.

In Iig. 8 and Iig. 9, lhe squaring aIgorilhn lhal perforns lhe parliaI producls x
n
2
, shovn.
The firsl parliaI producl p, is lhe squaring of lhe Ieasl significanl lil in x. The second parliaI
producl q, is lhe squaring of lhe lvo Ieasl significanl lils in x. The parliaI producl r, is lhe
resuIl of lhe squaring of lhe lhree Ieasl significanl lils in x and s is lhe resuIl of lhe squaring
of x. The squaring operalion is perforned vilh unsigned nunlers. When anaIyzing lhe
squarer in Iig. 8 and Iig. 9, il vas found lhal lhe resenlIance lo a lil-seriaI squarer (Ienne &
Viredaz, 1994) (Iekneslz el aI., 2OO1) is Iarge. y inlroducing regislers in lhe design of lhe
lil-seriaI squarer lhe parliaI resuIls of x
n
2
is easiIy exlracled. The squaring aIgorilhn can
lhus le sinpIified lo one addilion onIy vhen conpuling each parliaI producl.
Iron (4), (6) and (11) il is found lhal onIy lhe coefficienls vaIues differenliale vhen
inpIenenling differenl unary funclions. This inpIies lhal differenl unary funclions can le
reaIized in lhe sane hardvare in lhe processing parl, jusl ly using differenl sels of
coefficienls.
Since lhe nelhodoIogy is caIcuIaling an approxinalion of lhe originaI funclion lhe error lo
lhe desired precision can le lolh posilive and negalive. LspeciaIIy, if lhe vaIue of lhe
approxinalion is Iess lhan lhe desired precision, lhe vord Ienglh can have lo le increased
conpared vilh lhe vord Ienglh needed lo acconpIish lhe desired precision. If lhe order of
lhe Iasl used sul-funclion is n > 1, an inprovenenl of lhe precision can le done ly
oplinizing one or nore coefficienls c
2
in (7) or c
n,m
in (12). The oplinizalion of lhe
coefficienls viII nininize lhe error in lhe Iasl used sul-funclion and lherely il can reduce
lhe vord Ienglh needed lo acconpIish lhe desired accuracy. Conpuler sinuIalions perforn
such coefficienl oplinizalion nunericaIIy.

3.3 Postprocessing
The poslprocessing parl lransforns lhe vaIue lo lhe oulpul resuIl z. If lhe approxinalion is
inpIenenled as a lIock in a syslen lhe poslprocessing parl can le laken inlo consideralion
in lhe foIIoving lIocks, vhich inpIies lhal lhe poslprocessing parl can le excIuded.
VLS 210

4. ImpIementation of the sine function

An inpIenenlalion of lhe funclion sin(t), using lhe proposed nelhodoIogy (Herlz &
NiIsson, 2OO9) is descriled in lhis seclion as an exanpIe.

4.1 Preprocessing

Iig. 1O. The funclion f(t) lefore nornaIizalion and lhe originaI funclion f
crg
(x).

To salisfy lhal lhe vaIues of lhe inconing operand x is in lhe inlervaI O < x < 1 a /2 is
nuIlipIied vilh lhe operand as shovn in (13).

v =
t
2
x
(13)

To nornaIize lhe f(t)=sin(t) funclion t is sulsliluled vilh x vhich gives lhe originaI
funclion f
crg
(x) (14).

f
org
(x) = sin
t
2
x
|
\

|
.
|

(14)

In Iig. 1O lhe f(t) funclion is shovn logelher vilh lhe originaI funclion f
crg
(x).

A Methodology for Parabolic Synthesis 211

4.2 Processing
Ior lhe processing parl, sul-funclions are deveIoped according lo lhe proposed
nelhodoIogy. Ior lhe firsl sul-funclion s
1
(x), lhe coefficienl c
1
is defined according lo (5).
The delernined vaIue of lhe coefficienl is shovn in (15).

s
1
(x) = x +
t
2
1
|
\

|
.
|
x x
2
( )
|
\

|
.
|

(15)

The firsl funclion f
1
(x), is conpuled as shovn in (16).

f
1
(x) =
f
org
(x)
s
1
(x)

(16)

To deveIop lhe second sul-funclion s
2
(x), lhe coefficienl c
2
is defined according lo (7). The
delernined vaIue of lhe coefficienl is shovn in (17).

s
2
(x) = 1+ 0.400858
( )
x x
2
( ) ( )

(17)

The second funclion f
2
(x), is conpuled as shovn in (18).

f
2
(x) =
f
1
(x)
s
2
(x)

(18)

To deveIop lhe lhird sul-funclions s
3
(x), lhe second funclion f
2
(x), is divided inlo ils lvo
parliaI funclions as shovn in (8). The lhird order of sul-funclions is lherely divided inlo
lvo sul-funclions, vhere s
3,0
(x
3
) is reslricled lo lhe inlervaI O < x < O.5 and s
3,1
(x
3
) is
reslricled lo lhe inlervaI O.5 < x < 1.O according lo (9). A nornaIizalion of x lo x
3
is done lo
sinpIify in lhe inpIenenlalion in hardvare, vhich is descriled in (1O).
Ior each sul-funclion, lhe corresponding coefficienls c
3,0
and c
3,1
is delernined. These
coefficienls are delernined according lo (12) vhere higher order sul-funclions can le
deveIoped. The delernined vaIues of lhe coefficienls are shovn in (19).

s
3,0
(x
3
) = 1+ 0.0122452 x
3
x
3
2
( ) ( )
, 0 s x < 0.5
s
3,1
(x
3
) = 1+ 0.0105947 x
3
x
3
2
( ) ( )
, 0.5 s x <1

(19)

The lhird funclion f
3
(x), is conpuled as shovn in (2O).

f
3
(x) =
f
2
(x)
s
3
(x)

(2O)

To deveIop lhe fourlh sul-funclions s
4
(x), lhe lhird funclion f
3
(x), is divided inlo ils four
parliaI funclions as shovn in (8). The fourlh order of sul-funclions is lherely divided inlo
four sul-funclions, vhere s
4,0
(x
4
) is reslricled lo lhe inlervaI O < x < O.25, s
4,1
(x
4
) is reslricled
VLS 212

lo lhe inlervaI O.25 < x < O.5, s
4,2
(x
4
) is reslricled lo lhe inlervaI O.5 < x < O.75 and s
4,3
(x
4
) is
reslricled lo lhe inlervaI O.75 < x < 1.O according lo (9). A nornaIizalion of x lo x
4
is done lo
sinpIify lhe hardvare inpIenenlalion, vhich is descriled in (1O).
Ior each sul-funclion, lhe corresponding coefficienls c
4,0
, c
4,1
, c
4,2
and c
4,3
is delernined.
These coefficienls are delernined according lo (12) vhich acconpIish lhal higher order of
sul-funclions can le deveIoped. The delernined vaIues of lhe coefficienls are shovn in (21).

s
4,0
(x
4
) = 1+ 0.00223363 x
4
x
4
2
( ) ( )
, 0 s x < 0.25
s
4,1
(x
4
) = 1+ 0.00192558 x
4
x
4
2
( ) ( )
, 0.25 s x < 0.5
s
4,2
(x
4
) = 1+ 0.00119216 x
4
x
4
2
( ) ( )
, 0.5 s x < 0.75
s
4,3
(x
4
) = 1+ 0.00126488 x
4
x
4
2
( ) ( )
, 0.75 s x <1

(21)

No poslprocessing is needed since lhe resuIl oul fron lhe processing parl has lhe righl size.

4.3 Optimization
If no nore sul-funclions are lo le deveIoped lhe precision of lhe approxinalion can le
furlher inproved ly oplinizalion of coefficienls c
4,0
, c
4,1
, c
4,2
and c
4,3
. As shovn in Iig. 12
sul-funclion s
4,3
(x) in lhe inlervaI O.75 < x < 1.O has lhe Iargesl reIalive error. When
perforning an oplinizalion of sul-funclion s
4,3
(x) in lhe inlervaI O.75 < x < 1.O il vas found
lhal lhe vord Ienglh in lhe conpulalions couId le reduced fron 17 lils lo 16 lils.

4.4 Architecture
In Iig. 11, archileclure of lhe approxinalion of lhe sine funclion using lhe proposed
nelhodoIogy is shovn.
The x
2
lIock in Iig. 11 is lhe speciaI designed nuIlipIier descriled in Iig. 8 and Iig. 9 lhal
deIivers lhe parliaI resuIls q, q
3
and q
4
used in lhe foIIoving lIocks. In lhe x-q lIock, x is
sullracled vilh lhe parliaI resuIl q, fron lhe x
2
lIock. The resuIl r fron lhe x-q lIock is lhen
used in lhe lvo foIIoving lIocks as shovn in Iig. 11. In lhe x+(c
1
r) lIock is s
1
(x) perforned,
in 1+(c
2
r) is s
2
(x) perforned, in 1+(c
3
(x
3
-q
3
)) is s
3
(x) perforned and in 1+(c
4
(x
4
-q
4
)) is s
4
(x)
perforned. Nole, lhal in lhe lIocks for sul-funclion s
3
(x) and s
4
(x), lhe individuaI index m is
addressing lhe MUX lhal seIecls lhe coefficienls in lhe lIock.

4.5 Optimization of Word Length
As shovn in (19) and (21) lhe alsoIule vaIue of lhe coefficienls decreases in size vilh
increasing index nunler of lhe coefficienl. In siniIarily lo lhe vord Ienglh of lhe coefficienls
lhe vord Ienglh of lhe (x
n
-q
n
) parl, shovn in Iig. 11, viII decrease in size vilh increasing
index nunler. The decreased vord Ienglh viII course lhal lhe size of lhe nuIlipIier used in
a sul-funclion lo decreas accordingIy lo lhe highesl vaIue lil in lhe coefficienls and of lhe
(x
n
-q
n
) parl. In resenlIance lo alove lhe size of lhe nuIlipIiers conpuling lhe nuIlipIicalion
of lhe sul-funclions can le anaIyzed. This anaIysis viII aIso resuIl in lhal sone of lhe
foIIoving nuIlipIiers accordingIy can le decrease in size.
A Methodology for Parabolic Synthesis 213


Iig. 11. The archileclure of lhe inpIenenlalion of lhe sine funclion.

4.6 Precision
In Iig. 12 lhe resuIling precision vhen using one lo four sul-funclions is shovn. DecileI
scaIe is used lo visuaIize lhe precision since lhe conlinalion of linary nunlers and d
vorks very veII logelher. In d scaIe 2 is equaI lo 2OIog
1O
(2) = 2O(O.3) = 6 d and since 6 d
corresponds lo 1 lil, lhis viII nake il sinpIer lo undersland lhe resuIl. As shovn in Iig. 12,
lhe reIalive error decreases vilh lhe nunler of used sul-funclions. Wilh 4 sul-funclions ve
can see lhal ve have accuracy leller lhal 14 lils lhal viII resuIl in al Ieasl a Ialency of 14
adders in lhe CORDIC aIgorilhn is used.

VLS 214


Iig. 12. Lslinalion of lhe reIalive error lelveen lhe originaI funclion and differenl nunlers
of sul-funclions.

As shovn in Iig. 12, lhe reIalive error decreases vilh lhe nunler of sul-funclions used.
Hovever, increases lhe deIay vilh lhe nunler of sul-funclion as shovn in TalIe 1.

Nunler of sul-funclions DeIay
1 2 nuIl + 2 add
2 3 nuIl + 2 add
3 lo 4 4 nuIl + 2 add
5 lo 8 5 nuIl + 2 add
TalIe 1. DeIay reIalive lhe nunler of sul-funclions.

The deIay of one nuIlipIier is aloul lvo adders.

5. Comparison

The nosl connon nelhods used vhen inpIenenling approxinalion of a unary funclions
in hardvare are Iook-up lalIes, poIynoniaIs, lalIe-lased nelhods vilh poIynoniaIs and
CORDIC. Conpulalion ly lalIe Iook-up is allraclive since nenory is nuch denser lhan
randon Iogic in VLSI reaIizalions. Hovever, since lhe size of lhe Iook-up lalIe grovs
exponenliaIIy vilh increasing vord Ienglhs, lolh lhe lalIe size and execulion line lecones
lolaIIy inloIeralIe. Conpulalion ly poIynoniaIs is allraclive since il is ROM-Iess. The
disadvanlages are lhal il can inpose Iarge conpulalionaI conpIexilies and deIays.
A Methodology for Parabolic Synthesis 215

Conpulalion ly lalIe-lased nelhods conlined vilh poIynoniaIs is allraclive since il
reduces lhe conpulalionaI conpIexily and decreases lhe deIays. Since lhe size of lhe Iook-
up lalIes grovs vilh lhe accuracy lhe execulion line aIso increases vilh lhe needed
accuracy. Conpulalion ly using CORDIC is allraclive since il is using an anguIar rolalion
aIgorilhn lhal can le inpIenenled vilh snaII Iook-up lalIes and hardvare, vhich is
Iiniled lo sinpIe shifls and addilions. The CORDIC aIgorilhn is an ileralive nelhod vilh
high Ialency and Iong deIays. This nakes lhe nelhod insufficienl for appIicalions vhere
shorl execulion line is essenliaI.
In aII nelhods incIuding lhe proposed nelhod, il is a lrade-off lelveen conpIexily and
nenory slorage. y using paraIIeIisn in lhe conpulalion and paraloIic synlhesis in lhe
reconlinalion process, lhe proposed nelhodoIogy lherely gels a shorl crilicaI palh, vhich
assures fasl conpulalion.

6. Using the MethodoIogy

Il has leen shovn lhal lhe nelhodoIogy of paraloIic synlhesis can direclIy conpule lhe sine
funclion lul lhe nelhodoIogy is aIso alIe lo conpule olher lrigononelric funclions,
Iogarilhns as veII as square rool and division. In lhe foIIoving parls aIgorilhns for
eIenenlary funclions viII le shovn.
When descriling lhe inpIenenlalion of each funclion lhe differenl parls are shovn in a
lalIe. The firsl rov in lhe lalIe shovs lhe funclion lo le inpIenenled and in vhich inlervaI
lhe funclion is inpIenenled. In lhe second rov il is descriled hov lo perforn lhe
nornaIizalion of lhe funclion. The lhird rov shovs lhe originaI funclion lo le used vhen
deveIoping lhe approxinalion. The Iasl rov descriles hov lo perforn lransfornalion of lhe
approxinalion inlo desired inlervaI.

6.1 The Sine Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe sine funclion, lhe
nornaIizalion in lhe preprocessing parl is perforned as a sulslilulion according lo TalIe 2.
Since lhe oulcone of lhe approxinalion has lhe desired forn no poslprocessing is needed.

AIgorilhn Range
Iunclion f (v) = sin v ( )
0 s v <
t
2

Ireprocessing
x =
2
t
v 0 s x <1
Irocessing y = sin
t
2
x
|
\

|
.
|
0 s y <1
Ioslprocessing
: = y
0 s : <1
TalIe 2. The aIgorilhns for lhe sine funclion.

6.2 The Cosine Function
The aIgorilhn lhal perforns lhe approxinalion of lhe cosine funclion is founded on lhe
aIgorilhn lhal perforns lhe approxinalion of lhe sine funclion. To perforn lhe
VLS 216

approxinalion of lhe cosine funclion x is sulsliluled vilh 1-x in lhe preprocessing parl of
lhe approxinalion for lhe sine funclion.

AIgorilhn Range
Iunclion f (v) = cos v ( )
0 < v s
t
2

Ireprocessing
x = 1
2
t
v 0 s x <1
Irocessing y = sin
t
2
x
|
\

|
.
|
0 s y <1
Ioslprocessing
: = y
0 s : <1
TalIe 3. The aIgorilhn for lhe cosine funclion.

6.3 The Arcsine Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe arcsine funclion, lhe
nelhodoIogy has prolIens lo perforn an approxinalion for angeIs Iarger lhan t/4.
Therefore, lhe range of lhe approxinalion has leen Iiniled according lo lhe range of lhe
funclion in TalIe 4. To salisfy lhe requirenenls of lhe nelhodoIogy in lhe preprocessing
parl a sulslilulion according lo TalIe 4 has lo le perforned. To gel lhe desired oulcone lhe
approxinalion is nuIlipIied vilh a faclor according lo TalIe 4.

AIgorilhn Range
Iunclion f (v) = arcsin v ( ) 0 s v <
1
2

Ireprocessing
x = 2 v 0 s x <1
Irocessing y = arcsin
x
2
|
\

|
.
|
0 s y <1
Ioslprocessing
: =
t
4
y 0 s : <
t
4

TalIe 4. The aIgorilhn for lhe arcsine funclion.

6.4 The Arccosine Function
The aIgorilhn lhal perforns lhe approxinalion of lhe arccosine funclion is founded on lhe
aIgorilhn perforning lhe approxinalion of lhe arcsine funclion. The difference lelveen lhe
lvo approxinalions is in lhe lransfornalion in lhe poslprocessing parl, as shovn in TalIe 5.






A Methodology for Parabolic Synthesis 217

AIgorilhn Range
Iunclion f (v) = arccos v ( ) 0 s v <
1
2

Ireprocessing
x = 2 v 0 s x <1
Irocessing y = arcsin
x
2
|
\

|
.
|
0 s y <1
Ioslprocessing
: =
t
4
+
t
4
1 y ( )
t
4
< : s
t
2

TalIe 5. The aIgorilhn for lhe arccosine funclion.

6.5 The Tangent Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe langenl funclion lhe
angIe range is fron O lo t/4, since lhe langenl funclion is nol slriclIy concave or convex for
higher angIes. To perforn lhe nornaIizalion lhe preprocessing parl is perforned as a
sulslilulion according lo TalIe 6. Since lhe oulcone of lhe approxinalion has lhe desired
forn no poslprocessing is needed.

AIgorilhn Range
Iunclion f (v) = tan v ( )
0 s v <
t
4

Ireprocessing x =
4 v
t
0 s x <1
Irocessing y = tan
t
4
x
|
\

|
.
|
0 s y <1
Ioslprocessing
: = y
0 s : <1
TalIe 6. The aIgorilhn for lhe langenl funclion.

6.6 The Arctangent Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe arclangenl funclion
il can onIy le perforned in lhe range fron O lo 1 vhere lhe funclion is slriclIy concave or
convex. To gel lhe desired oulcone lhe approxinalion is in lhe poslprocessing parl
nuIlipIied vilh a faclor according lo TalIe 7.

AIgorilhn Range
Iunclion f (v) = arctan v ( )
0 s v <1
Ireprocessing x = v 0 s x <1
Irocessing
y = arctan x ( )
4
t

0 s y <1
Ioslprocessing
: =
t
4
y 0 s : <
t
4

TalIe 7. The aIgorilhn for lhe arclangenl funclion.
VLS 218

6.7 The Logarithmic Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe Iogarilhn funclion
vilh lhe lase lvo, il is onIy perforn on lhe nanlissa of lhe fIoaling-poinl nunler, since lhe
exponenl parl is scaIing lhe nanlissa. Ior lhe preprocessing parl a sulslilulion according lo
TalIe 8 has lo le perforned lo salisfy lhe nornaIizalion crilerias for lhe nelhodoIogy. Since
lhe oulcone of lhe approxinalion has lhe desired forn no poslprocessing is needed.

AIgorilhn Range
Iunclion f (v) = log
2
v ( )
1s v < 2
Ireprocessing x = v 1 0 s x <1
Irocessing y = log
2
1+ x ( ) 0 s y <1
Ioslprocessing
: = y
0 s : <1
TalIe 8. The aIgorilhn for lhe Iogarilhn funclion.

6.8 The ExponentiaI Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe exponenliaI
funclion vilh lhe lase lvo, il is onIy perforned on lhe fraclionaI parl of lhe Iogarilhn since
lhe inleger parl is scaIing lhe fraclionaI parl of lhe Iogarilhn. As shovn in TalIe 9 onIy a
one needs lo le added in lhe poslprocessing parl lo gel lhe desired oulcone.

AIgorilhn Range
Iunclion f (v) = 2
v
0 s v <1
Ireprocessing x = v 0 s x <1
Irocessing y = 2
x
1
0 s y <1
Ioslprocessing : = 1+ y 1 s : < 2
TalIe 9. The aIgorilhn for lhe exponenliaI funclion.

6.9 The Division Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe division il is Iiniled
lo lhe range according lo TalIe 1O, since lhe division is nol slriclIy concave or convex
oulside lhis range. The pre- and posl-processing parl lolh needs conpulalion vhen
perforning lhe approxinalion of lhe division.

AIgorilhn Range
Iunclion
f (v) =
1
1+ v
0.5 < v s1
Ireprocessing x = 2 1 v ( )
0 s x <1
Irocessing
y =
6
1+ 1
x
2
|
\

|
.
|
3
0 s y <1
Ioslprocessing
: =
3+ y
6

1
2
s : <
2
3

TalIe 1O. The aIgorilhn for lhe division funclion.
A Methodology for Parabolic Synthesis 219

6.10 The Square Root Function
When deveIoping lhe aIgorilhn lhal perforns lhe approxinalion of lhe square rool funclion
lhe range is Iiniled according lo TalIe 11. The pre- and posl-processing parl lolh needs
conpulalion vhen perforning lhe approxinalion of lhe square rool funclion.

AIgorilhn Range
Iunclion f (v) = 1+ v 1 s v < 2
Ireprocessing x = v 1 0 s x <1
Irocessing y =
2 + x 2
3 2
0 s y <1
Ioslprocessing
: = 2 + y 3 2
( )

2 s : < 3
TalIe 11. The aIgorilhn for lhe square rool funclion.

7. ConcIusions

A noveI nelhodoIogy for inpIenenling approxinalions of unary funclions such as
lrigononelric funclions, Iogarilhnic funclions, as veII as square rool and division funclions
elc. in hardvare is inlroduced. The archileclure of lhe processing parl aulonalicaIIy gives a
high degree of paraIIeIisn. The nelhodoIogy lo deveIop lhe approxinalion aIgorilhn is
founded on paraloIic synlhesis. This conlined vilh lhal lhe nelhodoIogy is founded on
operalions lhal are sinpIe lo inpIenenl in hardvare such as addilion, shifls, nuIlipIicalion,
conlrilules lo lhal lhe inpIenenlalion in hardvare is sinpIe lo perforn. y using lhe
paraIIeIisn and paraloIic synlhesis, one of lhe nosl inporlanl characlerislics vilh lhe oul
coning hardvare is lhe paraIIeIisn lhal gives a shorl crilicaI palh and fasl conpulalion. The
slruclure of lhe nelhodoIogy viII aIso assure an area efficienl hardvare inpIenenlalion.
The nelhodoIogy is aIso suilalIe for aulonalic synlhesis.

8. References

. Iarhani (2OOO), Ccmpu|cr Ari|nmc|ic, Oxford Universily Iress Inc., ISN: O-19-512583-5,
198 Madison Avenue, Nev York, Nev York 1OO16, USA.
L. Herlz, I. NiIsson (2OO8), A MelhodoIogy for IaraloIic Synlhesis of Unary Iunclions for
Hardvare InpIenenlalion, Prcc. cf |nc 2n |n|crna|icna| Ccnfcrcncc cn Signa|s,
Circui|s an Sqs|cms, SCSO8_9.pdf, pp 1-6, ISN-13: 978-1-4244-2628-7, Hannanel,
Tunisia, Nov. 2OO8, Insl. of LIec. and LIec. Lng. Conpuler Sociely, 445 Hoes Lane -
I.O.ox 1331, Iiscalavay, N} O8855-1331, Uniled Slales.
L. Herlz, I. NiIsson (2OO9), IaraloIic Synlhesis MelhodoIogy InpIenenled on lhe Sine
Iunclion, Prcc. cf |nc 2009 |||| |n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms, pp
253-256, ISN: 978-1-4244-3828-O, Taipei, Taivan, May. 2OO9.
L. Meijering (2OO2), A ChronoIogy of InlerpoIalion Iron Ancienl Aslronony lo Modern
SignaI and Inage Irocessing, Prccccings cf |nc ||||, voI. 9O, no. 3, March 2OO2, pp.
319-342, ISSN: OO189219, Inslilule of LIeclricaI and LIeclronics Lngineers Inc.
}. L. VoIder (1959), The CORDIC Trigononelric Conpuling Technique, |R| Transac|icns cn
||cc|rcnic Ccmpu|crs, voI. LC-8, no. 3, 1959, pp. 33O-334.
VLS 220

}. Iourier (1822), Tnccric Ana|q|iquc c |a Cna|cur, Iaris, Irance.
}.-M. MuIIer (2OO6), ||cmcn|arq |unc|icns. A|gcri|nm |mp|cmcn|a|icn, scccn c. irkhauser,
ISN O-8176-4372-9, irkhauser oslon, c/o Springer Science+usiness Media Inc.,
233 Spring Slreel, Nev York, NY 1OO13, USA.
K. Z. Iekneslzi, I. KaIivas, and N. MoshopouIos (2OO1), Long unsigned nunler sysloIic
seriaI nuIlipIiers and squarers, |||| Circui|s an Sqs|cms ||., voI. 48, no 3, March
2OO1, pp. 316 -321, ISSN 1O57-713O.
I. Ienne, M. A. Viredaz (1994), il-seriaI nuIlipIiers and squarers, |||| Transac|icns cn
Ccmpu|cr, voI. 43, no12, Dec. 1994, pp. 1445 -145O, ISSN OO18-934O.
I.T.I. Tang (1991), TalIe-Iookup aIgorilhns for eIenenlary funclions and lheir error
anaIysis, Prcc. cf |nc 10|n |||| Sqmpcsium cn Ccmpu|cr Ari|nmc|ic, pp. 232 - 236,
ISN: O-8186-9151-4, CrenolIe, Irance, }une 1991.
R. Andrala (1998), A survey of CORDIC aIgorilhns for IICA lased conpulers, Prcc. cf |nc
1998 ACM/S|GDA Six|n |n|cr. Sqmp. cn |ic| Prcgramma||c Ga|c Arraq (IICA98),
pp. 191-2OO, ISN: O-89791-978-5 , Monlerey, CA, Iel. 1998, ACM Inc.


Fully Systolic FFT Architectures for Giga-sample Applications 221
X

FuIIy SystoIic FFT Architectures
for Giga-sampIe AppIications

D. Reisis
Dcpar|mcn| cf Pnqsics, ||cc|rcnics |a|cra|crq,
Na|icna| Kapcis|rian Unitcrsi|q cf A|ncns,
Grcccc

1. Introduction

This chapler presenls a lechnique for designing archileclures, vhich execule Iasl Iourier
Transforn (IIT) aIgorilhns invoIving a Iarge nunler of conpIex poinls and lhey largel
reaI-line appIicalions (Lrsoy, 1997). Hilherlo pulIished lechniques and IIT archileclures
incIude nainIy ASIC designs (Thonpson, 1983, WoId and Despain, 1984, He and TorkeIson,
1996, Choi el aI., 2OO3, Uzun el aI., 2OO5, ouguezeI el aI., 2OO4, 2OO6, }o & Sunvoo, 2OO5,
Chang & Nguyen, 2OO6, Yang el aI., 2OO6, Lin el aI., 2OO5, TakaIa & Iunkka, 2OO6, Wang &
Li, 2OO7, Reisis & VIassopouIos, 2OO6), vhich vary vilh respecl lo lhe IeveI of paraIIeIisn,
lhe lhroughpul rale, lhe Ialency, lhe hardvare cosl and lhe pover consunplion. The nosl
connon ASIC archileclures are lhe fuIIy unfoIded IIT reaIizalions (Raliner & CoId)
uliIizing Iarge nenory arrays lelveen lheir successive slages and aIso lhe Ialency and
nenory efficienl cascade IIT lopoIogies (Thonson, 1983, He & TorkeIson, 1996, 1998). The
cascade soIulions lhough, as veII as lhe high radix lechniques Iead lo conpIicaled designs
in lhe case of archileclures paranelerized vilh respecl lo lhe pipeIining of conpulalions,
lhe IIT size and lhe dala Ienglh.
This chapler descriles a lechnique lo design efficienl, very high speed, deepIy-pipeIined
IIT archileclures naxinizing lhroughpul and keeping conlroI and nenory organizalions
sinpIe conpared lo lhe cascade and lhe fuIIy unfoIded IIT archileclures. Moreover, lhe
design is proven nore efficienl conparing lo lhe previousIy nenlioned archileclures in
lerns of scaIaliIily, naxinun operaling frequency and consequenlIy, in lerns of pover
consunplion, pipeIine deplh and dala and/or lviddIe lil-vidlhs. The lechnique inproves
lhe Ialency and lhe nenory requirenenls -parlicuIarIy for Iarge inpul dala sels- of sysloIic
IIT archileclures ly conlining lhree (3) Radix-4 circuils lo resuIl in a 64-poinl IIT engine.
The efficiency of organizing 64-poinl IIT engines lased on Radix-4 IIT engines is shovn ly
a 4O96-conpIex poinl design. This archileclure requires onIy lvo duaI nenory lanks of
4O96 vords each and on a XiIinx Virlex II IICA perforns al 2OO MHz lo suslain a
lhroughpul of 4O96 poinls/2O.48 us. The design inpIenenled on a high perfornance O.13
un, 1I8M CMOS (slandard ceII) process fron UMC achieved a vorsl-case (O.9V, 125 C)
posl-roule frequency of 6O4.5 MHz, vhiIe consuning 4.4 Walls. Il is inleresling lo poinl oul
11
VLS 222
lhal lhe design exceeded lhe 1 CHz frequency (rale of 1 CSanpIe/sec) for lypicaI condilions
(1.OV, 25C).
Tovards designing archileclures IITs vilh Iarge inpul dala sels ve consider lhe 4O96
conpIex poinl IIT archileclure as a core. The core conslilules lhe lasis of IIT archileclures
conpuling lransforns of 16K, 64K and 256K conpIex poinls. These archileclures
inpIenenled in lhe O.13 CMOS process perforn al 352, 256 and 188 Mhz vorsl-case (O.9V,
125 C) posl-roule frequencies respecliveIy. The 16K and lhe 64K archileclures have a four
poinl paraIIeI inpul/oulpul achieving lhroughpuls of 1.4 and 1 CSanpIe/sec respecliveIy.
Iurlher, lhis chapler viII presenl a lechnique, vhich aIIovs lhe paraIIeIizalion of lhe
nenory accesses in hardvare inpIenenlalions of lhe IIT aIgorilhn. This lechnique enalIes
each processor lo perforn a radix-l lullerfIy ly Ioading lhe l-lupIe dala fron l nenory
lanks in paraIIeI, lhen ly operaling on lhe l dala and finaIIy, ly sloring lhe resuIling l-
lupIe in l nenory lanks in paraIIeI. Hence, lhe speedup and lhe lhroughpul increase ly l.
Techniques paraIIeIizing lhe IIT accesses are reporled in (}ohnson, 1992, Ma, 1999, Reisis &
VIassopouIos, 2OO6). We descrile lhe lechnique in (Reisis & VIassopouIos, 2OO6), vhich is
deveIoped for arlilrary radix and slraighlforvard lo inpIenenl.
The chapler is organized in lhree lechnicaI seclions. Seclion 2 shovs hov lo organize radix-4
conpulalions lo resuIl in a radix-4
3
-equivaIenl lo radix-64-conpulalion and descriles lhe
fuIIy sysloIic 4O96-poinl archileclure. Seclion 3 viII descrile lhe 16K, 64K and 256K poinl
archileclures. Seclion 4 presenls lhe access paraIIeIizalion lechnique in lhe IIT archileclures
and seclion 5 concIudes lhe chapler.

2. A FuIIy SystoIic 4096 compIex point FFT Architecture

This seclion presenls an allraclive soIulion conlining lhe advanlages of lolh lhe cascaded
(Thonpson, 1983, He & TorkeIson, 1998, OH & Lin, 2OO5) and lhe unfoIded (Raliner &
CoId, Thonpson, 1983) slruclures. The organizalion uses lhe unfoIded slruclure vilh radix-
64 (R64) slages. We nole lhal 64 is a pover of eilher 2 or 4 and il is sufficienlIy Iarge lo resuIl
in snaII nunler of slages. ReaIizing lhe 64 poinl IIT al high speed vilh radices grealer lhan
2 and 4 lecones inefficienl, lecause lhe use of higher radices inposes conpIicaled
slruclures and Ieads lo Iover lhe archileclures perfornance. R4 conpulalions are equaIIy
slraighlforvard lo inpIenenl as lhe R2, lul R4 designs reduce lhe nunler of conpIex
nuIlipIiers lo lhe haIf of lhose required ly lhe R2 designs. AIso, R4 process 4 poinls in a 2-
slep conpulalion accumu|a|ing-mu||ip|qing vilh lhe mu||ip|qing reduced lo sign-intcrsicn and
suap operalions. This resuIl nininizes lhe nunler of possilIe lollIenecks in lhe dala fIov
and aIIovs lhe naxinaI pipeIining and/or paraIIeIizalion of lhe four conpuling sleps.
ConsequenlIy, lhe choice of R4 Ieads lo a nore efficienl design vilh respecl lo lhe
nuIlipIiers, lhe pipeIine deplh and lhe scaIaliIily of lhe operaling frequency and/or of lhe
vord Ienglh.

2.1 Radix-4 AIgorithm
To produce lhe R4 slruclure ve slarl vilh lhe Discrele Iourier Transforn (DIT) of a signaI
x|nj of Ienglh N,

_

=
=
1 N
O n
kn
N
W j n | x j k | X (1)
Fully Systolic FFT Architectures for Giga-sample Applications 223
|
.
|

\
|
+ + + + |
.
|

\
|
+ + + +
|
.
|

\
|
+ + + + |
.
|

\
|
+ + = |
.
|

\
|
+ +
4
N 3
n
16
N
n
64
N
n x j
2
N
n
16
N
n
64
N
n x ) 1 (
4
N
n
16
N
n
64
N
n x ) j ( n
16
N
n
64
N
n x n
16
N
n
64
N
n
3 2 1
k
3 2 1
k
3 2 1
k
3 2 1 3 2 1
k
4
N
4 4
4 4
vhere,
N
2
j
N
e W
t

= are lhe lviddIe faclors and denole lhe N-lh prinilive rool of unily
(Oppenhein, 1975). The archileclure presenled in lhis seclion is lased on a four-
dinensionaI index nap and on a R4 deconposilion of lhe DIT series. The derivalion of lhe
Radix 4 aIgorilhn Iies on lhree (3) sleps in lhe cascade deconposilion. In lhe frane vork
of lhese 3 sleps, lhe Iinear napping lransforns inlo a four-dinenlionaI index nap |1O,4j as
foIIovs:


4 3 2 1
4 3 2 1
k k 4 k 16 k 64 k
n
4
N
n
16
N
n
64
N
n n
+ + + =
+ + + =
(2)

AppIying equalion 2 lo lhe DIT equalion yieIds


nk
N 4 3 2 1
3
O n
3
O n
3
O n
1
64
N
O n
4 3 2 1
W ) n
4
N
n
16
N
n
64
N
n ( x ) k k 4 k 16 k 64 ( X
4 3 2 1
+ + + = + + +
_ _ _ _
= = =

=
(3)

The conposile lviddIe faclors can le expressed as foIIovs


1 1 4 3 2 1 4 3 2 4 3 4 4 3 3 2 2
4 3 2 1 4 3 2 1
k n
64
N
j k k 4 k 16 ( n
N
) k k 4 ( n
64
k n
16
k n k n k n kn
N
j k k 4 k 16 k 64 j| n
4
N
n
16
N
n
64
N
n |
N
kn
N
W W W W ) j ( W
W W
+ + + + +
+ + + + + +
=
=
(4)

AppIying equalion 4 inlo equalion 3 and expanding lhe sunnalion vilh index
4
n yieId


1 1 4 3 2 1 4 3 2 4 3 3 3 2 2
3
4
2 1
k n
64
N
) k k 4 k 16 ( n
N
) k k 4 ( n
64
k n
16
k n k n
3
O n
3 2 1
k
4
N
3
O n
1
64
N
O n
4 3 2 1
W W W W ) j (
n
16
N
n
64
N
n
) k k 4 k 16 k 64 ( X
+ + + +
= =

(
(

|
.
|

\
|
+ +
= + + +
_ _ _
(5)
Where
|
.
|

\
|
+ +
3 2 1
k
4
N
n
16
N
n
64
N
n
4
denoles lhe firsl lullerfIy unil and can le vrillen as


(6)


VLS 224
Lxpanding equalion 5 vilh respecl lo lhe nexl sunnalion vilh index n
3
yieIds

( )
1 1 4 3 2 1 4 3 2 2 2
2
4 3
1
k n
64
N
) k k 4 k 16 ( n
N
) k k 4 ( n
64
k n
3
O n
2 1
k k
16
N
1
64
N
O n
4 3 2 1
W W W j n
64
N
n H
) k k 4 k 16 k 64 ( X
+ + +
=

=

|
.
|

\
|
+
= + + +
_ _
(7)

vhere
|
.
|

\
|
+
2 1
k k
16
N
n
64
N
n H
4 3
is lhe secondary lullerfIy slruclure and can le expressed as

( )
( ) |
.
|

\
|
+ + + |
.
|

\
|
+ + +
+ |
.
|

\
|
+ + + |
.
|

\
|
+ = |
.
|

\
|
+
16
N 3
n
64
N
n W W j
8
N
n
64
N
n W 1
16
N
n
64
N
n W j n
64
N
n n
64
N
n H
2 1
k
4
N
k
16
k
8
k
2 1
k
4
N
k
8
k
2 1
k
4
N
k
16
k
2 1
k
4
N
2 1
k k
16
N
4 4 4 3 4 4 3
4 4 3 4 4 3
(8)

IinaIIy, expanding lhe sunnalion of equalion 7 vilh regard lo index n
2
provides a sel
of 64 DITs of Ienglh N/64.
| |
1 1
1
4 3 2 1 4 3 2
k n
64 / N
1 ) 64 / N (
O n
) k k 4 k 16 ( n
N
1
k k k
64 / N
4 3 2 1
W W ) n ( T ) k k 4 k 16 k 64 ( X
_

=
+ +
= + + + (9)

vhere ) n ( T
1
k k k
64 / N
4 3 2
represenls lhe lhird lullerfIy and is expressed according lo equalion 1O

( ) ( ) ( )
( )
|
.
|

\
|
+ +
+
|
.
|

\
|
+ +
+
|
.
|

\
|
+ + =
+
64
N 3
n H W j
32
N
n H W W 1
64
N
n H W W j n H n T
1
k k
16
N
) k k 4 ( 3
64
k
1
k k
16
N
k
32
k
8
k
1
k k
16
N
k
64
k
16
k
1
k k
16
N
1
k k k
64
N
4 3 4 3
4 3 4 3
2
4 3 4 3
2
4 3 4 3 2
2
(1O)

Lqualions 5 lo 1O descrile a radix-64 lased IIT. Iurlher, equalions 6, 8 and 1O descrile lhe
inlernaI slruclure of lhe radix-64 lullerfIy lased on lhree radix-4 lullerfIies and lherefore
caIIed Radix - 4
3
(R4
3
).

2.2 R4
3
Features
Iigure 1 shovs lhe fIov diagran of lhe R4
3
conpulalions vhere A , and> presenl
nuIlipIicalions ly
3 3
k
16
k
W ) j ( ,
4 3
k
8
k
W ) 1 ( and
4 4 3
k
16
k
8
k
W W j respecliveIy, vhiIe lhe ^ , V ,
< presenl nuIlipIicalions ly
4 3 3
k
64
k
16
k
W W ) j ( ,
4 3 2
k
32
k
8
k
W W ) 1 ( and
) k k 4 ( 3
64
k
4 3 2
W j
+
. Nole
lhal lhe order of lhe lviddIe faclors in lhe R4
3
differs fron lhe respeclive order of lhe R64.
Fully Systolic FFT Architectures for Giga-sample Applications 225
The IIT archileclure vilh R4
3
slages and inpul N conpIex poinls uses Iog
4
N1 conpIex
nuIlipIiers. Wilh each R4 using 3 conpIex adders lo produce 1 resuIl/cycIe lhe archileclure
has 3Iog
4
N conpIex adders. The nenory size is (N/3)Iog
4
N.
Conparing lo lhe cascaded R2
2
(He & TorkeIson, 1998) and R2
4
(OH & Lin, 2OO5) lhe
proposed unfoIded R4
3
has equaI nunler of nuIlipIiers (TalIe 1). The R4
3
has Iess conpIex
adders (3Iog
4
N) lhan lhe cascaded (4Iog
4
N) lul uses Iarger nenory size. The R4
3
IIT design
achieves lhe Ieasl nunler of nuIlipIiers and adders in lhe Iileralure, equaI lo lhal in (idel
el aI., 1995), and aIlhough il requires Iarger nenory, il uses sinpIe conlroI and il can
achieve high frequency perfornance as il is shovn in lhe foIIoving sulseclion.

2.3 Architecture
Iigure 2 depicls lhe archileclure of lhe 4O96-poinl IIT. The inpIenenlalion of lhe 4O96-
poinl IIT using R4
3
lullerfIies incIudes: firsl lhe use of a 2-dinensionaI index nap lased on
R64, second lhe deconposilion of lhe 4O96-poinl series inlo lvo suns using lhe R64

ConpIex MuIlipIiers ConpIex Adders Menory ConlroI
R2MDC * 2Iog
4
N 1 4Iog
4
N 3N/2-2 SinpIe
R2SDI * 2Iog
4
N 1 4Iog
4
N N 1 SinpIe
R4SDI * Iog
4
N 1 8Iog
4
N N 1 Mediun
R4MDC * 3(Iog
4
N 1) 8Iog
4
N 5N/2 4 SinpIe
R4SDC * Iog
4
N 1 3Iog
4
N 2N 2 ConpIex
R2
2
SDI ** Iog
4
N 1 4Iog
4
N N 1 SinpIe
R4
3
Iog
4
N 1 3Iog
4
N (N/3)Iog4N SinpIe
*(He & TorkeIson, 1998), **(OH & Lin, 2OO5)
TalIe 1. IIT archileclure hardvare conpIexily conparison

lullerfIy and finaIIy lhe repIacenenl of each R64 vilh a R4
3
. ConsequenlIy, lhe archileclure
consisls of lvo R4
3
engines, lvo 4O96-vord duaI lank nenory noduIes, one 4O96 poinl
read-onIy nenory sloring lhe W4O96 lviddIes and one conpIex nuIlipIier. The conlroI is
IocaI lo each noduIe and a gIolaI sync signaI synchronizes lhe noduIes. The lhroughpul is
1 conpIex-poinl/cIock-cycIe.
The overaII design aIIovs lhe oplinizalion of lhe archileclure, eilher gIolaIIy, or IocaIIy: The
archileclure can le inproved vilh respecl lo lhe operaling frequency, or lhe area, or lolh
vilhoul aIleralions in lhe conlroI, lhe scheduIing of lhe 4K luffers, lhe regislers and lhe
nenories vilhin lhe R4
3
engines. Hence, ve can nodify lhe pipeIined conpulalions and
expIoil lhe specifics of each lechnoIogy lo naxinize lhe operaling frequency.
The foIIoving sulseclions descrile lhe delaiIs of lhe R4 processor used as lhe lasis of lhe
R4
3
, lhe R4
3
archileclure, lhe addressing schene used for each R4
3
engine and discuss lhe
perfornance of lhe enlire IIT archileclure.

2.3.1 The Radix-4 basis of the R4
3

Iigure 3 depicls lhe Radix-4 engine archileclure used in lhe R4
3
. Lach Radix-4 engine
consisls of one conpIex AccunuIalor and one Svap noduIe. The lask of lhe Svap
noduIe is lo svilch lhe reaI parl of each inpul vaIue vilh lhe corresponding inaginary and
VLS 226
lhe opposile in order lo perforn lhe correcl lullerfIy operalions. Iigure 4 depicls lhe
archileclure of lhe AccunuIalor noduIe. The accunuIalor consisls of 8 regislers and 3
add/sul unils. The four A regislers are used as inpul regislers sloring every 4-lupIe of
dala, on vhich lhe R4 lullerfIy operalion viII le appIied. The four resislers are used as
an inlernediale slage hoIding lhe 4-lupIe (vhiIe lhe nexl 4-lupIe of inpul dala is shifled inlo
lhe A regislers) and Ioading lhe dala in paraIIeI lo lhe add/sul unils. The add-sul unils
forn an adder-lree archileclure, lhus avoiding lhe feedlack Ioops lhal connon
accunuIalors use.


Iig. 1. Radix 4
3
ullerfIy SIC

The dala fIov of lhe Radix-4 engine slarls vilh lhe dala enlering lhe Radix-4 engine as a
vord seriaI inpul slrean al a rale of one conpIex-poinl/cycIe. During each period of 4
cycIes four conseculive inpul dala are shifled in lhe 4 inpul regislers (Reg A). In lhe fiflh
cycIe lhis 4-lupIe is Ialched in lhe 4 slorage regislers (Reg ), vhiIe lhe nexl inpul 4-lupIe
is slarling ils inpul lo lhe A regislers. During lhe sixlh cycIe lhe 4-lupIe enlers in paraIIeI
lhe adder-lree lo produce lhe firsl of lhe four R4 resuIls. During lhe foIIoving 3 cycIes, lhe
Fully Systolic FFT Architectures for Giga-sample Applications 227
adder-lree uses as inpul lhe 4-lupIe slored in lhe regislers and il produces lhe renaining
3 of lhe four R4 resuIls, one resuIl/cycIe, ly foIIoving lhe Radix-4 conpulalionaI fIov.
Therefore, lhe lolaI Ialency of each accunuIalor is 5+2k cycIes, vhere k is lhe Ialency of each
add/sul unil, k = 5 in lhe inpIenenlalion.
A conlroI unil synchronizes lhe operalions of lhe Svap noduIe and lhe accunuIalor. A 2-
lil counler generales lhe necessary signaIs, conlroIIing lhe add/sul unils in order lo
perforn lhe correcl addilions and/or sullraclions according lo lhe Radix-4 schena.


Iig. 2. Iock diagran of lhe proposed 4K IIT Archileclure


Iig. 3. OveraII Lngine of lhe Radix 4

2.3.2 The R4
3
Architecture
Iigure 5 shovs lhe R4
3
engine consisling of 3 R4 lullerfIies, 2 conpIex nuIlipIiers, 2 DuaI
ank nenory eIenenls (1 consisling of 2x64 vords and 1 of 2x16 vords) and 2 Read OnIy
Menories sloring lhe W
64
and W
16
lviddIes. The conpIex lviddIe nuIlipIiers are 4 inleger-
nuIlipIiers of inpul 14 14 and are pipeIined vilh deplh 5. The conlroI of each pipeIine-
slage is IocaI and aII slages are synchronized lhrough a gIolaI sync signaI. ReaIizing lhe R64
slage ly lhe R4
3
engine offers a conlinalion of advanlages: lhe sinpIicily of lhe circuils as a
resuIl of using as lasic lullerfIy snaII radix (4) and lhe reguIarily of lhe archileclure, vhich
is luiIl hierarchicaIIy vilh conliguous R4
3
engines al lhe higher IeveI and vilh each R4
3
forned ly lhree R4 pipeIined conpulalions al lhe lasic IeveI. Iurlhernore, lhe R4
3
design is
VSLI area efficienl lecause il uliIizes lhe conpIex nuIlipIiers 75 of lhe cycIes and 1OO of
lhe olher resources al each cycIe. The IIT achieves lhe dynanic range of 84 d (anpIilude)
ly using lhe vidlh of 14 lils for each reaI and inaginary parl of each sanpIe. To nainlain
lhe naxinun precision (14 lils) for each parl, lhe R4
3
uses increased precision for inlernaI
VLS 228
caIcuIalions. The lil vidlh of each parl is 16 and 2O lils al lhe oulpul of lhe firsl and second
R4 slages respecliveIy. Al lhe oulpul of lhe finaI (lhird) R4 slage, lhe reaI and inaginary
parls are lruncaled lo provide an oulpul of 14 lils each. The lviddIe faclors are 18 lils reaI


Iig. 4. The Adder Tree of lhe ConpIex AccunuIalor


Iig. 5. Iock diagran of lhe Radix 4
3
ullerfIy Archileclure

and 18 lils inaginary. y appIying lhe alove nenlioned dala fornal, lhe inpIenenlalion of
lhe 4O96 poinl IIT vilh fixed poinl caIcuIalions is aInosl as accurale as a fIoaling poinl
inpIenenlalion given lhe dynanic range of 84 d. The fixed poinl IITs oulpuls are vilhin
a nargin of +/- (1) conpared lo lhe oulpul of a fIoaling poinl IIT vhose vaIues are
nornaIized (divided ly N = 4O96) and lruncaled lo 14 lil inlegers.

Fully Systolic FFT Architectures for Giga-sample Applications 229
2.3.3 Data ScheduIing and Addressing in R4
3

The firsl slage of lhe 4K poinl IIT aIgorilhn is reaIized ly lhe firsl R4
3
processor, vhich
conpules
4O96
64
IITs vilh each lransforn consisling of 64 poinls. Lach lransfornalion 64-
lupIe conlains lhe eIenenls vhose address is of lhe forn 64i + k, vhere i = O, . . . , 63 and k is
lhe lupIe index, ranging fron O lo
4O96
64
1.
In lhe firsl R4
3
processor lhe dala vilhin each 64 -lupIe are processed in lerns of
quadrupIes, so lhal lhe firsl R4 processor conpules lhe DIT of lhe eIenenls vilh address
16i
1
+ k
1
, i
1
= O, . . . , 3 and k
1
= O, . . . ,
64
4
1 is again lhe lupIe index. We can use lhe
nolalion |a
11
, . . . , a
O
j for each poinls inpul address (posilion vilhin lhe 4O96 poinls) and
j d d d d d d |
O 1 2 3 4 5
for lhe address of each lank of lhe firsl inlernediale nenory. The
resuIling 64 eIenenls are vrillen in one lank of lhe firsl inlernediale nenory in
conseculive addresses, i.e. j d d d d d d | j a a a a a a |
O 1 2 3 4 5 6 7 8 9 1O 11
.
The second R4 reads lhe dala fron lhe firsl inlernediale lank of size 64 ly using lhe
address pernulalion j d d d d d d | j a a a a a a |
O 1 2 3 4 5 4 5 2 3 O 1
. Lvery 4 conseculive resuIls of 4-
poinl DITs produced ly lhe second R4 are slored in one of lhe lvo lanks of lhe second
inlernediale nenories of size 16. When lhe lhird R4 slarls reading lhese dala lhe second R4
viII slore lhe nexl 4 DIT resuIls inlo lhe second inlernediale nenory of size 16.
Nole lhal lhe second R4 produces resuIls so lhal lhe lhird R4 can use aII lhe dala slored in
lhe inlernediale nenory lo conpIele ils 4-poinl DIT fuIIy pipeIined and vilh 1OO
resource uliIizalion. Wilh j d d d d |
O 1 2 3
denoling lhe address vilhin lhe second inlernediale
nenory, lhe lhird R4 reads lhe dala vilh address pernulalion j d d d d | j a a a a |
O 1 2 3 2 3 O 1
.
During lhe second slage of lhe 4K IIT aIgorilhn, lhe second R4
3
processor conpules
4O96
64

IITs vilh each lransforn consisling of 64 poinls, so lhal each lransfornalion 64-lupIe
conlains lhe eIenenls vhose address is of lhe forn 64k + i, vhere i = O, . . . , 63 and k is lhe
lupIe index, ranging fron O lo
4O96
64
1. The firsl R4 produces 64 eIenenls, vhich are vrillen
in one lank of lhe firsl inlernediale nenory in conseculive addresses, i.e.
j d d d d d d | j a a a a a a |
O 1 2 3 4 5 O 1 2 3 4 5
. Aparl fron lhis difference lhe addressing and lhe dala
scheduIing vilhin lhe second R4
3
is exaclIy lhe sane as in lhe firsl R4
3
descriled alove.

2.3.4 FPGA and VLSI ImpIementation
The proposed archileclure has leen inpIenenled in RTL VHDL using fixed poinl
arilhnelic. The projecl ains al lolh high capacily and high speed IICA devices such as lhe
XILINX Virlex II 6OOO, as veII as a high perfornance O.13un slandard ceII process. Ior lhe
XiIinx inpIenenlalion ve have used lhe XiIinx looI and lhe aulonalic fIoor pIanning
oplion. The XiIinx inpIenenlalion on lhe 6OOO parl resuIls in 2O of Iogic area uliIizalion
and 25 Iocks RAM uliIizalion. The use of oplinized XiIinx conponenls (CoreLil
nuIlipIiers) reduces lhe area lo 13 of lhe lolaI resources of lhe 6OOO parl.
The TSMC O.13 Iilrary inpIenenlalion achieved 6O4 MHz vorsl case. Il incIudes 96K
slandard ceIIs and 64 RAMs (1361 slandard ceII rovs), (64 configuralion in TalIe 2)
occupying a siIicon area of 263Ox5129 un square (1.42*1OL7 un sq) al 84.2 uliIizalion and
achieves a vorsl-case (O.9V, 125C) posl-roule perfornance of 6O4.5 MHz and a 4.4 Walls
VLS 230
pover consunplion. The use of lypicaI process paranelers (1V, 25C) resuIls in exceeding
lhe 1CHz posl-roule frequency nark (dala rale 1 CSanpIe/sec), naking lhe proposed
archileclure lhe faslesl slandard-ceII 4O96 conpIex poinl IIT inpIenenlalion reporled in
lhe Iileralure. In addilion, a second 4O94 conpIex poinl engine has leen inpIenenled in lhe
sane slandard ceII Iilrary, lhis line using deeper RAMs (16 configuralion in TalIe 2).
Iover consunplion in lhis case vas sulslanliaIIy reduced lo 722.8 nW fron lhe 4.4 W of
lhe 64 configuralion. Iigure 6 depicls lhe VLSI Iayoul of lhe Radix 4
3
engine, vhiIe figure
7 depicls lhe finaI Iayouls of lhe 4K IIT, of lolh lhe 16 and lhe 64 inpIenenled.
The 4K IIT has leen designed for an experinenl invoIving a frequency anaIyzer for a
landvidlh of 2OO MHz. The land has leen divided inlo four sul-lands and each sul-land
has leen acconnodaled ly a 4K IIT archileclure. The IICAs perforn al 1O2.5 MHz on a
18-Iayer loard vhich has a conpacl-ICI inlerface perforning al 51.25 MHz. The lask is lo
perforn IIT and use a ThreshoId fiIler lo idenlify lhe frequencies of high pover vilhin
each sul-land. The expecled oulpul sel incIudes al nosl 1O frequencies per sul-land, per
IIT. Afler lhe prololype conpIelion lhe 4K IIT has leen deIivered as an II core vilh
specificalions achieving 5 lines lhe perfornance of lhe IICA prololype. The 16K, 64K and
256K have leen reaIized as II cores for research purposes.

2.4 Architectures Performance and Advantages
The proposed archileclure denonslrales inproved Ialency conpared lo olher unfoIded
archileclures lecause il requires dala luffering onIy lelveen lhe lvo R4
3
slages inslead of
lhe luffering required al aII lhe 6 slages of lhe unfoIded IIT using R4. The exisling cascade
IIT archileclures require 6 R2
2
(He & TorkeIson, 1996, 1998) and aIlhough lhey use 1/4 of
lhe nenory of lhe proposed R4
3
archileclure for 4O96 poinls, lhey are nol scaIalIe and lhey
nusl le designed for specific perfornance due lo lheir organizalion invoIving cIosed
conpuling Ioops.
The achieved perfornance eslalIishes lhe proposed archileclure as a reaI-line high
perfornance IIT reaIizalion, lhe inporlance of vhich can le furlher poinled oul ly


Iig. 6. VLSI inpIenenlalion Iayoul of lhe Radix 4
3
engine
Fully Systolic FFT Architectures for Giga-sample Applications 231
conparing ils characlerislics lo olher reIevanl resuIls. TalIes 3 and 4 presenl a conparison
of lhe proposed IIT archileclure vilh reIaled resuIls. We dislinguish reIaled pulIished
resuIls inlo lvo calegories.

IIT 4O96 Ransx16 IIT 4O96 Ransx64
CIock (period) 1.6 ns 1.2 ns
Inax in MHz 386.8 6O4.5
Sld CeIIs (RAMs) 1OO347(16) 95897(64)
Sld CeIIs (rovs) 8O5 1361
Chip Size (un sq) 5.34L+O6 1.42L+O7
UliI () 8O.3O 84.2O
X(un) 1681 263O
Y(un) 3165 5129
Iover (nW) 722.8 4414.1
TalIe 2. 4K IIT VLSI InpIenenlalion Rouling ResuIls

In lhe firsl calegory ve conpare lhe hilherlo pulIished archileclures execuling IIT
aIgorilhns up lo 128 poinls lo lhe fealures of lhe R4
3
pIaying lhe roIe of a conpIele 64
conpIex poinl IIT archileclure. This conparison is shovn in TalIe 3. The second calegory
incIudes lhe archileclures soIving IITs of size 1O24 lo 4O96 poinls presenled in TalIe 4. The
conparison incIudes IIT size, vord Ienglh, aIgorilhn, IIT archileclure, lechnoIogy process,
voIlage, area, pover, naxinun operaling frequency and suslained lhroughpul in lolh
MSanpIes/s and Clils/s. Since lhe IIT designs vary vilh respecl lo IIT size, aIgorilhn
and archileclure ve have aIso incIuded lhe NornaIized Area (idel el aI., 1995), in order lo
evaIuale lhe siIicon cosl. Moreover, ve conpare lhe efficiency (perfornance/cosl) of lhe
proposed design lo lhe reIaled resuIls ly using lhe fraclion Suslained
Throughpul/NornaIized Area.
The proposed IIT archileclure achieves lhe highesl suslained lhroughpul conpared lo aII
lhe olher designs. Iurlhernore, lhe efficiency expressed as lhe fraclion Suslained
Throughpul/NornaIized Area of lhe proposed design is lhe highesl considering lolh snaII
inpul size (TalIe 3) and Iarge inpul size IIT archileclures (TalIe 4). Nole lhal, in lolh
calegories lhe archileclures R 4
3
and lhe proposed 4K occupy nore area lhan lheir
conpelilors respecliveIy. This is a penaIly lhough in achieving lhe highesl lhroughpul
possilIe.
AIso nole lhal (Lin el aI., 2OO5) perforns lransfornalions of onIy 128 poinls, and lhere is no
provision laken so lhal il viII conslilule a core for scaIalIe archileclures vilh respecl lo IIT
size. The 64-poinl Iourier lransforn chip, presenled in (Maharalna el aI, 2OO4) operales al 2O
MHz vilh Ialency 3.85 us, conparing lo lhe R4
3
processor perforning a 64 conpIex poinl
IIT vhiIe operaling al a 2OO MHz cIock frequency vilh Ialency O.32 us.
The archileclure descriled in (Lenarl & OvaII, 2OO3) is a 2K conpIex poinl IIT processor
vhich achieves naxinun operaling frequency of 76MHz and suslains a lhroughpul of 2O48
poinls/26us. The design presenled in (Corles el aI., 2OO6) inpIenenls a 2K/4K/8K
nuIlinode IIT and achieves 9 MHz cIock frequency, al a conpulalion line of up lo 45Ous.
VLS 232

Iig. 7. VLSI inpIenenlalion Iayoul of lhe 4K IIT processor: a) 16 4O96 Configuralion, l)
64 4O96 Configuralion

Characlerislics
(Lin el aI.,
2OO5)
(Maharalna
el aI., 2OO4)
Iroposed Design
(R4
3
)
IIT size 128p 64p 64p
Word Lenglh, lil 1O 16 14
AIgorilhn R-2, R-8 R-2 R-4
3

IIT Archileclure MRMDI UnroIIed UnroIIed
Irocess (un) O.18 O.25 O.13
VoIlage (V) 3.3 1.8 O.9
Area (nn
2
) 3.52 6.8 5.O9
NornaIized Area (nn
2
) 3.52 3.5 9.7
Iover (nW) 77.6 41 78.2
Inax (MHz) 11O 2O 6O4.5
Suslained Throughpul in
MSanpIes/s
48O 16.6 1O52
Suslained Throughpul in Cl/s 9.6 O.53 29.5
Throughput(Gb/s)
Norm.Area

2.72 O.15 3.O4
TalIe 3. Conparison TalIe vilh IIT designs fron 64-128 poinls
Fully Systolic FFT Architectures for Giga-sample Applications 233
Characlerislics
(Lenarl &
OvaII,
2OO3)
(Corles el
aI., 2OO6)
(SvarlzIander,
2OO7)
Iroposed
Design
IIT size 2K 2K/4K/8K 4K 4K
Word Lenglh, lil 1O 16 16 14
AIgorilhn R 2
2
R 2
2
R 2 R 4
3

IIT Archileclure SDI SDI SpIil SysloIic UnroIIed
Irocess (un) O.35 O.35 O.25 O.13
VoIlage (V) N/A 3.3 N/A O.9
Area (nn
2
) 6 18.7 N/A 13.48
NornaIized Area (nn
2
) 1.58 4.9 N/A 25.84
Iover (nW) N/A 114.65 26O 4414.1
Inax (MHz) 76 9.1 1OO 6O4.5
Suslained Throughpul in
MSanpIes/s
75.8 18.2 2OO 1O52
Suslained Throughpul in
Cl/s
1.5 O.582 6.4 29.5
Throughput(Gb/s)
Norm.Area

O.94 O.11 N/A 1.14
TalIe 4. Conparison lalIe vilh IIT designs fron 1K-4K

IinaIIy, a singIe ASIC chip, sysloIic IIT processor, deveIoped ly lhe Mayo Ioundalion
conpules 4O96-poinl IITs suslaining a lhroughpul of 2OO Ms/s (SvarlzIander, 2OO7).
Considering IICA inpIenenlalions, lhe corresponding XILINX designs (vvv.xiIinx.con)
achieve equaI naxinun operaling frequency of 2OOMHz, lul occupy consideralIy Iarger
chip area lhan lhe R4
3
approach. AIso nole lhal, ALTLRA designs (vvv.aIlera.con) uliIize
IIT cores vilh IIT Ienglh varying fron 64 poinls up lo 4K poinls. They denonslrale a
naxinun operaling frequency of 3OO MHz. Anong lhe ALTLRAs IIT designs ve conpare
lhe 64 poinl IIT al 3OO MHz lo lhe R4
3
perfornance, vhich reaIized on lhe sane ALTLRA
IICA (ALTLRA STRATIX II LI2S3OI484C3) achieves operaling frequency of 35O MHz.

3. 16K, 64K and 256K compIex point FFT Architectures

The high uliIily prospecls of lhe efficienl 4K poinl IIT design as presenled alove lecones
langilIe lhrough lhe uliIizalion of lhis archileclure lo deveIop Iarger size IIT archileclures
lo conpule 16K, 64K and 256K poinl lransforns. This seclion descriles lhe expIoilalion of
lhe 4K poinl IIT as a radix-4O96 processor resuIling in archileclures neeling differenl
requirenenls vilh respecl lo paraIIeIisn, siIicon area and lhroughpul. These archileclures
can le laiIored lo a lroad area of appIicalions.

3.1 Deriving the 16K point Architecture
The DIT equalion for a 16K poinl IIT lakes lhe forn


VLS 234
(11)



Selling
2 1
n
4O96
N
n n + = and
2 1
k 4O96 k k + = :
( )
2 2 2 1 1 1
k n
4O96
k n
4O96 4
k n
4
kn
4O96 4
2 1 2 2 1 1 1 2 1 2 1
W W W W
k
4O96
N
k Nn k n k n n k k k 4O96 n
4O96
N
n n k

=
+ + + = + |
.
|

\
|
+ =


Therefore, lhe lransforn lecones

2 1
2
2 2
1
1 1
k n
4O96 4
4O95
O n
k n
4O96
3
O n
k n
4
W W j n | x W j k | X

= =
|
|
|
.
|

\
|
=

(12)

According lo equalion 12, lhe 4K poinls IIT can le exlended lo 16K poinls. This is
acconpIished ly firsl perforning four 4K poinl IIT lransforns. Nexl, lhe dala is nuIlipIied
ly lhe lviddIe faclors lhal correspond lo a 16K poinls IIT and finaIIy, a radix-4 slage
conpIeles lhe 16K poinl IIT conpulalion.
The archileclure is presenled in figure 8. There are four 4K IIT lIocks operaling in paraIIeI.
The archileclure has a four conpIex poinl inpul per cycIe. The 16K IIT archileclure vas
inpIenenled in a high perfornance, O.13un, 1IoIy-8Copper Iayer slandard ceII lechnoIogy
fron TSMC. A fIal lack-end fIov vas used in vhich lhe design vas firsl synlhesized lo
gales using Synopsys Design ConpiIer and oplinized for a frequency of 3OO MHz. The
uniquified nelIisls vas lhen read inlo Cadence SoC Lncounler vhere fIoorpIanning, pover
pIanning, cIock-lree-synlhesis, pIacenenl, rouling and IIO rouling look pIace. IinaIIy, lhe
design vas lroughl lack inlo lhe lop-IeveI for finaI lop-IeveI pIacenenl, rouling and lining
anaIysis. TalIe 5 shovs lhe InpIenenlalion Rouling ResuIls and figure 8 depicls lhe VLSI
CeIIs for lhe 16K IIT. The lhroughpul is 1.4 Cs/sec (39.2 Clils/sec).


Iig. 8. Iock diagran of lhe 16K IIT - IaraIIeI/IaraIIeI Archileclure
_ _

=

=
= =
1 4O96 4
O n
nk
4O96 4
16383
O n
nk
16384
W j n | x W j n | x j k | X
Fully Systolic FFT Architectures for Giga-sample Applications 235

TalIe 5. VLSI InpIenenlalion Rouling ResuIls of lhe 16K IIT archileclure

The 16K archileclure inpIenenled on lhe XiIinx Virlex 5 (-2) achieves operaling frequency
of 25OMHz, occupies 12264 sIices and suslains a lhroughpul of 1Cs/s (28 Clils/sec).

3.2 64K and 256K compIex points FFT Architectures
The DIT equalion for a 64K poinl IIT lakes lhe forn



-1 16384 4
O n
nk
16384 4
65535
O n
nk
65536
W j n | x W j n | x j k | X

=
= = (13)
Selling
2 1
n
16384
N
n n + = and
2 1
k 16384 k k + = :

2 2 2 1 1 1
k n
16384
k n
16384 4
k n
4
kn
16384 4 2 1 2 2 1 1 1
W W W W k
16384
N
k Nn k n k n 16384 n k

= + + + =
Therefore, lhe lransforn lecones

2 1
2
2 2
1
1 1
k n
16384 4
16383
O n
k n
16384
3
O n
k n
4
W W j n | x W j k | X

= =
|
|
|
.
|

\
|
=

(14)

According lo equalion 14, lhe 16K poinls IIT can le exlended lo 64K poinls. This is
acconpIished ly firsl perforning a 16K poinl IIT lransforn. Nexl, lhe dala is nuIlipIied ly
lhe lviddIe faclors lhal correspond lo a 64K poinls IIT and finaIIy, a R4 slage conpIeles lhe
64K poinl IIT conpulalion, as shovn in figure 1O. The 64K IIT has leen VLSI (and IICA)
inpIenenled ly using lhe 16K paraIIeI/paraIIeI conpulalion vilh a R4 slage vilh four
paraIIeI inpuls and oulpuls. The archileclure has a posl rouling frequency of 256 Mhz vilh a
lhroughpul of 1 Cs/sec (28 Clils/sec). Iigure 11 depicls lhe VLSI CeII for lhe 64K IIT
design. The 64K has leen inpIenenled on lhe XiIinx Virlex 5 (-2) achieving an operaling
frequency of 125MHz and using 13461 sIices. The archileclure has a four dala paraIIeI inpul
and oulpul and suslains a lhroughpul of 5OOMs/s (14 Clils/sec).

Iaraneler IIT 16K
Core Size 1.6561e + O7un
2

Sld CeII Rovs 11O2
Nunler of CeIIs 885456
Nunler of RAMs 64
SlalislicaI Iover 3.45 W
Inax 352 MHz
VLS 236
3.3 256K compIex point FFT Architecture
Iigure 12 depicls lhe 256K IIT archileclure, vhich is a slraighlforvard appIicalion of lhe
R4
3
aIgorilhn. Three conseculive R4
3
engines are used lo acconpIish lhe lask.


Iig. 9. VLSI inpIenenlalion Iayoul of lhe 16K IIT IaraIIeI/IaraIIeI archileclure


Iig. 1O. Iock diagran of lhe 64K conpIex poinl IIT Archileclure
Fully Systolic FFT Architectures for Giga-sample Applications 237
Iaraneler IIT 64K IIT 256K
Core Size 3.4894e + O8un
2
2.7647e + O8un2
Sld CeII Rovs 16OO 45OO
Nunler of CeIIs 896148 735945
Nunler of RAMs 192 384
SlalislicaI Iover 9.8 W 35.75 W
Inax 256.5 MHz 188 MHz
TalIe 6. VLSI InpIenenlalion Rouling ResuIls of lhe 64K and 256K IIT designs


Iig. 11. VLSI inpIenenlalion Iayoul of lhe 64K IIT

In lhe case of lhe very Iarge 256K IIT, looI capacily nandaled lhe use of a hierarchicaI fIov.
IoIIoving lhe sane fronl-end synlhesis process (again oplinized for 3OO MHz), lhe
oplinized nelIisl vas read inlo Cadence SoC encounler vhere parlilioning vas firsl
perforned. This process crealed six inslances of lhe 256K nenory lIock (for a lolaI of 1.5
M of on-chip SRAM) vhich vere individuaIIy pIaced and rouled. The sane process vas
perforned for lhe R4
3
engines and lhe lviddIe ROM lIock. Nole lhal lo conpIele lhe enlire
IIT conpulalion il requires a 3-frane Ialency (786792 cycIes) and lhe conpulalion Ialency
vilhin lhe lhree R4
3
(36O cycIes), vhich is 4.1 nsec. The 256K IIT archileclure has a posl
rouling frequency of 188 Mhz. IinaIIy, lhe design vas lroughl lack inlo lhe lop-IeveI for
VLS 238
finaI lop-IeveI pIacenenl, rouling and lining anaIysis. Iigure 13 depicls lhe VLSI CeII for
lhe 256K IIT design and TalIe 6 shovs lhe VLSI InpIenenlalion Rouling ResuIls of lhe 64K
and 256K IIT designs.


Iig. 12. Iock diagran of lhe 256K IIT Archileclure

4. ParaIIeI Accessing in FFT Architectures

This seclion presenls a lechnique, vhich aIIovs lhe paraIIeIizalion of lhe nenory accesses in
hardvare inpIenenlalions of lhe IIT aIgorilhn ly enalIing a processor during a IIT slage
lo access in paraIIeI l nenory lanks vilhoul confIicls. ConsequenlIy lhe processor
perforns a radix-l lullerfIy ly Ioading lhe l-lupIe dala fron l nenory lanks in paraIIeI,
lhen ly operaling on lhe l dala and finaIIy ly sloring lhe resuIling l-lupIe in l nenory
lanks in paraIIeI. Hence, lhe speedup and lhe lhroughpul increase ly l.

4.1 ProbIem Definition and ReIated Work
Lel
p
l = N denole lhe nunler of poinls of lhe Iasl Iourier Transforn, vhere l is an
arlilrary lase such lhal 2 l and 1 p . Using l as a lase, lhe inpul address O,.,N-1 (or
index) of each eIenenl is represenled as j a .... a |
O 1 p-
, vhere 1 l a O
i
- . We viII denole each
eIenenl onIy ly ils address j a ,..., a |
O 1 p-
, vhich represenls lhe eIenenl in lhe inpul slrean
vilh index
O 1
1 p
1 p
a + l a + ... + l a
-
-
. This nolalion hoIds for each of lhe p slages of lhe IIT.
We viII use
j n n | l
I lo denole a processor perforning aII radix-l conpulalions, vhich can
aIso read (Ioad) n dala in paraIIeI and vrile (slore) m dala in paraIIeI. We assune a generic
archileclure using , N 1
E
ORJ ,
j l l | l
I processors and ) 1 ( l 2 + nenory lanks,
so lhal each lank can slore N/l eIenenls. Lach processor reaIizes N/l conseculive slages of
lhe IIT conpulalion.

Fully Systolic FFT Architectures for Giga-sample Applications 239

Iig. 13. VLSI inpIenenlalion Iayoul of lhe 256K IIT

The nenory lanks are arranged according lo lhe schene descriled in figure 14. Since, ve
are using a radix-l processor, lhe DIT deconposilion consisls of N Iog
l
slages. If ve
assune N Iog =
l
lhen ve consider a
j l l | l
I processor reaIizing each slage of lhe
aIgorilhn. The slages are indexed fron O lo 1 N Iog
l
- . elveen lvo (2) conseculive slages i
and i+1 lhere is a nenory of size N 2 divided inlo lvo (2) sels of size N each. We viII
denole lhe nenory lelveen slage i and slage i+1 as nenory vilh index i. The nenories
anong lhe IIT slages are indexed as O lo 2 N Iog
l
- . There are nenories indexed as -1 and
1 N Iog
l
- serving as inpul and oulpul nenories respecliveIy. The nenory of each sel i is
divided inlo l nenory lanks vilh indices O lo l-1.
Al each cIock cycIe, lhe
j l l | l
I processor of slage i Ioads fron ils 'read nenory (nenory
i-1), l eIenenls lhal forn a lransforn l-lupIe. These are lhe eIenenls vhose address is of
lhe forn j a ,..., Ia a ,..., a |
O 1 i 1 + 1 p - i -
, vhere lhe indices | j a ,..., a , a ,..., a
1 p 1 + i 1 i O - -
are lhe sane for
aII eIenenls in lhe l-lupIe and I ranges fron O lo l-1. These eIenenls viII le Ioaded in
paraIIeI if lhey are slored in dislincl nenory lanks. During lhe nexl ((i + 1)
lh
) slep of lhe
aIgorilhn, lhese l eIenenls have lhe sane a
i+1
address digil and hence lhey leIong lo
VLS 240
differenl lransfornalion l-lupIes. These eIenenls nusl le slored in lhe processors vrile
nenory so lhal lhey can le read in paraIIeI fron lhe processor al slage i + 1.


Iig.14. Ceneric arrangenenl of a
j l l | l
I processor and ils vrile nenories

The slraighlforvard approach is lo dislrilule lhe N eIenenls lo lhe l nenory lanks using
lheir i
lh
address digil. Then lhe processor al slage i can Ioad lhen in paraIIeI. Hovever, if ve
lry lo slore lhe sane l eIenenls of each l-lupIe lo lhe nenory lank al slage (i+1) according
lo lheir (i+1)
lh
digil of lheir address, ve viII nolice lhal aII lhe eIenenls in each l-lupIe nusl
le slored in lhe sane nenory lank. Since ve cannol slore nore lhan one eIenenl in a
nenory lank al each cIock cycIe lhis silualion conslilules a confIicl.
To iIIuslrale lhe silualion vhere a confIicl occurs, consider a 16 poinl lransforn lhal is
lased on
j 4 4 | 4
I

processors. The address of each eIenenl in lhe inpul slrean is j a a |
O 1
,
vhere . 3 , 2 , 1 , O = a
i
Iigure 15 shovs lhe firsl slage of lhe exanpIe. The firsl 4-lupIe lo le
lransforned consisls of lhe eIenenls |(O, O) (O, 1) (O, 2) (O, 3)j. Nole lhal, lhese eIenenls can
le read in paraIIeI fron lhe sel of nenory lanks of slage i 1, lul nusl le slored in lhe
sane nenory lank of nenory i.
The firsl lechnique for paraIIeIizing lhe nenory accesses in IIT archileclures (}ohnson,
1992) descriles a hardvare archileclure designed for in-pIace conpulalion of lhe IIT
aIgorilhn. This lechnique uses r (r leing lhe radix) lanks and pernules lhe oulpul dala of
each processor, lo le vrillen in lhe sane nenory Iocalions vilhin each lank as lhe inpul
dala have leen read. (Ma, 1999) descriles a lechnique for radix-2 lased IIT ly using queues
al lhe inpul of each nenory lo rearrange lhe dala lefore lhey are slored. }ohnsons
lechnique can le exlended lo lhe nore generaI case of a pipeIined IIT inpIenenlalion al
lhe cosl of conpIex addressing hardvare inpIenenlalion. (Thonas & YeIick, 1999) presenl
a lechnique lhal is used in veclor processors. This lechnique pernules lhe inpul dala prior
lo lhe radix caIcuIalions in order lo naxinize lhe efficiency of lhe aIgorilhn. These
pernulalions are perforned vilhin lhe veclor regislers (In-regisler lranspose). This
funclionaIily has leen inpIenenled ly exlending lhe inslruclion sel of lhe veclor processor
vilh lvo inslruclions lhal shifl lhe dala vilhin a regisler, so lhal each pernulalion is
perforned ly using a regisler lo regisler copy and a shifl inslruclion.
Fully Systolic FFT Architectures for Giga-sample Applications 241
AII lhe IIT lechniques (}ohnson, 1992, Ma, 1999), vilh l paraIIeI nenory accesses, invoIve
al each slage l lanks for reading and l lanks for vriling lhe N IIT dala vilh lank size
N/l. The in-pIace case uses lhe sane l lanks (each lank of size N/l) for lolh read and
vrile. To conpare lhe hardvare conpIexily of lhe previousIy pulIished resuIls, ve firsl
consider designs lhal are lased on queues, such as lhe one descriled in (Ma, 1999).


Iig. 15. Iirsl slage of a 16-poinl lransforn using
j 4 4 | 4
I

processors

Such designs require Z 2 ) l ( O
2
-lils vide regislers, vhere v is lhe vord Ienglh, arranged in
l queues vilh each queue having l 1 regislers. l is lhe processor radix, or in lhe generaI
case, lhe nunler of paraIIeI read/vrile accesses. Lach sel of l1 regislers is arranged so lhal
l1 eIenenls are Ioaded in paraIIeI vhiIe one eIenenl is vrillen inlo lhe nenory lank. The
olher l1 eIenenls are shifled -one eIenenl per nenory access cycIe- inlo each nenory
lank. The lechnique in (}ohnson, 1992) uses fixed hardvare lo inpIenenl firsl a
pernulalion on lhe inpul dala and lhen lhis specific hardvare can appIy a pernulalion al
each IIT slage. More specificaIIy, (}ohnson, 1992)s approach gives onIy a sulsel of lhe
possilIe soIulions. This sulsel of soIulions considers onIy in-pIace inpIenenlalions ( = 1
and nenory size N). The circuil in (}ohnson, 1992) is oplinized for inpIenenling lhese
soIulions vilh respecl lo lhe VLSI area. The address generalion is lased on a circuil
invoIving a lree of noduIo-l adders vilh
2
l ( O excIusive-OR gales.
In lhis seclion ve descrile a lechnique (Reisis & VIassopouIos, 2OO6) lhal can le used lolh
for pipeIined archileclures and in pIace inpIenenlalions (nenory size equaIs lo N). We
foIIov a differenl approach lhan (}ohnson,1992) Ieading lo a differenl proof and providing a
VLS 242
generaI soIulion, vhich incIudes lhal in (}ohnson,1992). Moreover, lhe presenled lechnique
resuIls in a sinpIe and inproved hardvare inpIenenlalion conpared lo (}ohnson,1992).
The presenled soIulion shovs inproved perfornance vilh respecl lo lhe Ialency and lhe
hardvare cosl conpared lo olher soIulions (Ma, 1999, Suler & Slevens,1998, He &
TorkeIson, Thonas & YeIick, 1999) vhiIe il provides lhe sane lhroughpul as lhese. Our
lechnique is lased on nenory address pernulalions and can le reaIized using Iook up
lalIes vilh each slage lalIe occupying
2
l ( O lils. The presenled approach resuIls in a sel of
pernulalions, vhich can le appIied in lhe design of pipeIine IIT archileclures vilh paraIIeI
nenory accesses per slage. A sulsel of lhese pernulalions acconnodales lhe in-pIace
inpIenenlalions.

4.2 ParaIIeIizing the Memory Accesses
The foIIoving Lennas 1 and 2 prove lhe exislence of lhe required pernulalions and
Theoren 1 proves lhal lhese pernulalions resuIl in a correcl IIT aIgorilhn.
Lenna 1: Assune lhal lhe
p
l = N eIenenls vilh indices j a ... a |
O 1 p
, 1 - l a O
i
, are
dislriluled in l sels S
I
, I = O, . . . , l 1, such lhal: j a ... Ia a ... a || = S
O 1 i 1 + i 1 p I

Lel f
j
: |O, . . . , l1|O, . . . , l1, vhere O < j < l1, le a sel of funclions such lhal
- f
1
j

f is 1-1 and onlo, and


- n , n , vhere n, n , l ,..., O |
If , n n lhen x , ) x ( f ) x ( f
n n

The sels
' I
' S , vhere ) I ( f = ' I
1 + i
a
are defined as foIIovs j a ... a I ) I ( f a ... a || = ' S
O 1 i a 2 + i 1 p ' I
1 + i


Where I is fixed and 1 l ,..., O = a
1 + i

Under lhe alove condilions, lhe foIIoving hoId:
1) Lach S
I
conlains al nosl N/l ilens
2) Lvery lvo eIenenls of lvo sels
1
I
S ,
2
I
S vhose indices differ onIy in lhe i
lh
digil, i.e. q and
r viII le dislriluled in differenl S sels, naneIy lhe S
q
and S
r
.

Iroof of Lenna 1: Lach sel S
I
conlains lhe dala vilh indices j a ... a I ) I ( f a ... a |
O 1 i ) a ( 2 + i 1 p
1 + i

,
vhere 1 l ,..., O = a
1 + i
and I is fixed. To prove (1) assune lhal lhere exisls a sel lhal conlains
nore lhan N/l eIenenls. Then, lhere is al Ieasl one eIenenl vilh index
j a ... a ' ' I ) ' I ( f a ... a |
O 1 i ) a ( 2 + i 1 p
1 + i

vhere I is differenl lhan I. Since such an eIenenl does nol
leIong lo lhe sel S
I
, ve concIude lhal each sel cannol conlain nore lhan N/l eIenenls.
SiniIarIy, lo prove (2), assune lhal lvo eIenenls of lhe sels S
q
and S
r
, respecliveIy, vhose
indices differ in lhe i
lh
digil are dislriluled in lhe sane sel, S
s
. Then, for lhese eIenenls,
=
+ +
+ +
j a ... a I ) r ( f a ... a | j a ... a I ) q ( f a ... a |
O 1 - i a 2 i 1 - p O 1 - i a 2 i 1 - p
1 i 1 i
) r ( f = ) q ( f
1 + i 1 + i
a a
. Since lhe
f
j
are inverlilIe, r q = =
+ + + +

(r)) (f f (q)) (f f
1 i 1 1 1 i 1 1
a
1
a a
1
a
, vhich conlradicls our iniliaI assunplion
lhal
2 1
I I . The foIIoving Lenna shovs lhal lhese funclions exisl.
Lenna 2: There exisls a sel of funclions , 1 l ,..., O | 1 l ,..., O | : f
j
1 l ,..., O = j and 2 l
such lhal lhe condilions of Lenna 1 are salisfied.
Fully Systolic FFT Architectures for Giga-sample Applications 243
Iroof of Lenna 2: Lel M = |O, . . . , l 1. Iurlher, Iel C(M) denole lhe sel of cycIic
pernulalions of Ienglh l on lhe sel M. These pernulalions are of lhe forn

)
l nod ) p + 1 l (
) 1 l (
...
...
l nod ) p + 1 (
1
l nod ) p (
O
(


for p = O, . . . , l 1. Nov, lhese pernulalions are inverlilIe and lhere are exaclIy l such
pernulalions. To prove lhal
) x ( f ) x ( f
j i
(15)
for j i and i, j = O, . . . , l 1 il suffices lo shov lhal l nod ) j + x ( l nod ) i + x ( , vhich is
lrue since i and j couId onIy le in lhe sane equivaIence cIass, if i = k j, vhere k is an
inleger, lul lhis can le lrue onIy for k = 1, since i, j < l 1. Therefore equalion 15 hoIds.

Theoren 1: Lel
p
l = N . Assune lhal ve have a
j l l | l
I

processor. Then, a radix-l lased
IIT having N Iog
l
slages can use l nenory lanks al each slage, such lhal aII lhe read and
vrile operalions are perforned in paraIIeI.

Iroof: Lel j a ... a |
O 1 p
, O < a
i
< l 1, le lhe address of each eIenenl according lo ils iniliaI
posilion vilhin lhe inpul dala sel of lhe aIgorilhn. The i
lh
slage of lhe aIgorilhn has
arranged lhe N eIenenls in l nenory lanks according lo lheir i
lh
digil of lheir address:

j a j| a ... a a ... a |
i O 1 i 1 + i 1 p
(16)
vhere a
i
is lhe index of lhe nenory lank.
To vrile lhe oulpuls of a radix-l processor inlo l dislincl nenory lanks in paraIIeI, ve use
lhe resuIls of Lenna 1. Iirsl, ve assign each nenory lank lo a sel, S
I
, so lhal each
lransforn l-lupIe consisls of eIenenls lhal differ in lhe i
lh
eIenenl of lheir address. Nexl, ve
seIecl a sel of l funclions f
O
, f
1
, . . . , f
(l1)
using Lenna 2.
Since lhe requirenenls of Lenna 1 hoId, ve can dislrilule lhe eIenenls according lo lhe
equalion:
)j a ( f j| a ... a a ... a |
i a O i 2 + i 1 p
1 + i

(17)
These eIenenls viII le dislriluled in dislincl nenory lanks and can le slored in paraIIeI.
Lxeculing lhe foIIoving slage, ve viII use lhe inverse pernulalion lo read each l-lupIe in
paraIIeI: During lhe (i+1)
lh
slage of lhe IIT aIgorilhn, lhe (i+1)
lh
processor shouId read lhe
dala vhose address is
j a j| a ... a a ... a |
1 + i O i 2 + i 1 p
(18)
vhere a
i+1
= O, . . . , l1 and a
j
= consl, for 1 + i j . The inverse pernulalion is given ly
)j (a j|f ...a a ...a |a
i
1
a
O i 2 i 1 - p
1 1

+
+
(19)
vhere a
i
= consl and a
i+1
= O, . . . , l 1, since lhis pernulalion yieIds a l-lupIe vhose
eIenenls differ onIy in lhe (i + 1)
lh
digil. To prove lhal lhese eIenenls are slored in l dislincl
nenory lanks assune lhal lhis is nol lhe case, i.e. lhere exisl lvo dislincl
eIenenls, )j a ( f j| a ... a a ... a | q
i
1
a
O i 2 i 1 - p 1
1 i

+
+
= and )j (a' j|f ...a' a' ...a' |a' q
i
1
a'
O i 2 i 1 - p 2
1 1

+
+
= lhal are
VLS 244
in lhe sane l-lupIe and al lhe sane line are slored in lhe sane nenory lank. Since q
1
and
q
2
are on lhe sane l-lupIe, lherefore a
j
= a
j
for 1 + i j . Iurlher, ) (a f ) (a f
i
1
a i
1
a'
1 i 1 i

+ +
= ,
vhich is lrue onIy vhen
1 + i 1 + i
' a = ' a . Therefore, lhe eIenenls q
1
and q
2
coincide.
To concIude lhe proof ve shov lhal lhe forvard and inverse pernulalions al aII slages
resuIl in a correcl IIT aIgorilhn. We proceed ly induclion on lhe nunler of slages. During
lhe firsl slage of lhe IIT lhe dala are vrillen in lhe inpul nenory lanks (nenory 1)
according lo j a j| a ... a |
O 1 1 p
. The oulpuls of lhe lullerfIy are vrillen lo lhe firsl inlernediale
nenory (nenory O) using lhe pernulalion )j a ( f j| a a ... a |
O a O 2 1 p
1

, 1 l ,..., O = a
O
. We assune
lhal lhe funclions f
j
are idenlicaI for aII slages, aIlhough lhis nay nol le lhe case. The onIy
acluaI reslriclion is lo use a pernulalion
L
I for vriling lhe dala al slage i and ils inverse
pernulalion

L
I for relrieving lhe dala al slage i+1. AppIying lhe inverse pernulalion ve
ollain )j a ( f j| a a ... a |
O a
1
O 2 1 p
1

, vilh a
O
= conslanl and a
1
= O, . . . , l1. The sel ) a ( f |
O a
1
1


consisls of l dislincl eIenenls, such lhal lheir address differs onIy in lhe digil a
1
and
lherefore, lhey conslilule a vaIid lransforn l-lupIe. Iurlher, assune lhal in lhe i
lh
slage lhe
eIenenls are slored according lo equalion 17. Using lhe sane reasoning as alove, ve can see
lhal lhe inverse lransforn yieIds a vaIid l-lupIe and lhe proof is conpIele. Nole lhal lhe
vrile operalion of lhe eIenenls in lhe firsl sel of lanks (nenory O) does nol require any
pernulalion on lhe inpul dala sel. The eIenenls are vrillen lo lhe lank corresponding lo
lheir O
lh
(zero) address digil, in radix l nolalion. SiniIarIy, during lhe finaI slep of lhe
aIgorilhn, lhe dala can le vrillen in lhe sane addresses and lanks as lhose lhal have leen
read fron, since no conpulalion foIIovs
The currenl seclion has shovn lhe lechnique and lhe properlies of lhe pernulalions
required lo lhe paraIIeI access of each l-lupIe al each IIT slage. An engineer can reaIize lhe
lechnique ly choosing anong lhe slraighlforvard soIulions, e.g. lhe cycIic pernulalions. A
ninleresling research lopic is lo idenlify lhose pernulalions, vhich can le reaIized vilh
nininaI inlerconneclion and address generalion circuils and lhus Iover lhe VLSI cosl.

5. ConcIusion

The presenl chapler has shovn a lechnique lo design IIT archileclures for reaI-line
appIicalions, vhich invoIve inpul vilh Iarge nunler of conpIex poinls. The lechnique lases
on conlining lhree conseculive R4 slages lo reaIize a R64 conpulalion. The resuIling R4
3
as
veII as lhe sysloIic archileclures, vhich uliIize a R4
3
as a slage for execuling 4K, 16K, 64K
and 256K poinl IITs, have leen shovn lo provide higher lhroughpul conpared lo hilherlo
pulIished archileclures soIving lhe corresponding lransfornalions. Moreover, lhe R4
3
and
4K IIT archileclures achieve lhe highesl ralio of lhroughpul lo nornaIized area.
Iurlhernore, lhe chapler has proven a lechnique for paraIIeIizing lhe nenory access for
each lullerfIy radix-l conpulalion so lhal lhe lhroughpul can le inproved furlher ly a
faclor of l.

Fully Systolic FFT Architectures for Giga-sample Applications 245
6. References

A. Corles, I. VeIez, I. ZaIlide, A. Irizar, }.I. SeviIIano An IIT Core for DV-T/DV-H
Receivers ICLCSO6, page(s):1O2-1O5, Decenler 2OO6.
A. Oppenhein, R. Schafer DigilaI SignaI Irocessing. Irenlice HaII, 1975.
. C. }o and M. H. Sunvoo, Nev Conlinuous-IIov Mixed-Radix (CIMR) IIT Irocessor
Using NoveI In-IIace Slralegy, ILLL Trans. Circuils Sysl. I, voI.52, no.5, May 2OO5.
. Suler and K. S. Slevens A Lov Iover, High Ierfornance approach for Tine-Irequency
/ Tine-ScaIe Conpulalions, Iroceedings SIIL98 Conference on Advanced SignaI
Irocessing AIgorilhns, Archileclures and InpIenenlalions VIII. VoI. 3461, pp. 86-
9O, }uIy 1998.
C. D. Thonpson, Iourier Transforns in VLSI, ILLL Transaclions on Conpulers, VoI. 32,
1O47 - 1O57, 1983
D. Harper, Iock , MuIlislride Veclor, and IIT Accesses in IaraIIeI Menory Syslens, ILLL
Trans. on IaraIIeI and Dislriluled Syslens, VoI. 2, No. 1, pp. 43 - 51, }anuary 1991
D. Reisis, N.VIassopouIos, Address Ceneralion Techniques for ConfIicl Iree IaraIIeI
Menory Accessing in IIT Archileclures ICLCS, pp.1188-1191, Decenler 2OO6.
L. idel, D. CasleIain, C. }oanlIanq and I. Slenn A fasl singIe-chip inpIenenlalion of 8192
conpIex poinl IIT ILLL }ourn. of SSC, 3O(3):3OO-3O5, Mar. 1995.
L. H. WoId and A. M. Despain IipeIine and IaraIIeI IIT Irocessors for VLSI
InpIenenlalions, ILLL Transaclions on Conpulers, voI. C-33, 1984.
LarI L. SvarlzIander, }r SysloIic IIT Irocessors: A IersonaI Ierspeclive }ournaI of VLSI
SignaI Irocessing, }une 2OO7.
I. S. Uzun, A. Anira and A. ouridane, IICA inpIenenlalions of Iasl Iourier Transforns
for reaI-line signaI and inage processing, ILLL Vision, Inage and SignaI
Irocessing, 2OO5.
}. Lee, }. Lee, M. H. Sunvoo, S. Moh and S. Oh A DSI Archileclure for High-Speed IIT in
OIDM Syslens, LTRI }ournaI, 2OO2.
}. TakaIa and K. Iunkka ScaIalIe IIT Irocessors and IipeIined ullerfIy Unils }ournaI of
VLSI SignaI Irocessing 43, 113-123, 2OO6.
}. Y. OH and M. S. Lin, Nev Radix-2 lo lhe 4lh Iover IipeIine IIT Irocessor, ILICL
Trans. LIeclron., VOL. L88-C, NO. 8, Augusl 2OO5.
}.W. CooIey and }.W. Tukey, An aIgorilhn for lhe nachine conpulalion of conpIex
Iourier series, Malhenalics of Conpulalion, 1965
K. alionilakis, K. ManoIopouIos, K. Nakos, N. VIassopouIos, D. Reisis, V. ChouIiaras, A
High Ierfornance VLSI IIT Archileclure, 13
lh
ILLL InlernalionaI Conference on
LIeclronics, Circuils and Syslens, pp. 81O-813, Decenler 2OO6
K. Maharalna, L. Crass, and U. }agdhoId, A 64-Ioinl Iourier Transforn Chip for High-
Speed WireIess LAN AppIicalions Using OIDM, ILLL }ournaI of SoIid Slale
Circuils, VOL. 39, NO. 3, March 2OO4.
K. ManoIopouIos, K. Nakos, D. Reisis, N.VIassopouIos, V.A. ChouIiaras High Ierfornance
16K, 64K, 256K conpIex poinls VLSI SysloIic IIT Archileclure ICLCS, pp. 146-149,
Decenler 2OO7.
L. C. }ohnson, ConfIicl Iree Menory Addressing for Dedicaled IIT Hardvare, ILLL
Transaclions on Circuils and Syslens - II: AnaIog and DigilaI SignaI Irocessing,
VoI. 39, No. 5, May 1992
VLS 246
L. R. Raliner and . CoId Theory and AppIicalion of DigilaI SignaI Irocessing, Irenlice-
HaII.
L. Yang, K. Zhang, H. Liu, }. Huang and S. Huang An Lfficienl LocaIIy IipeIined IIT
Irocessor, ILLL Trans. Circuils Sysl. II, voI. 53, no. 7, }uIy 2OO6.
N. Hu, O. Lrsoy, Iasl Conpulalion of ReaI Discrele Iourier Transforn for Any Nunler of
Dala Ioinls, ILLL Transaclions on Circuils and Syslens, VoI. 38, No. 11, pp. 128O -
1292, Novenler 1991
O.K . Lrsoy, Iourier-ReIaled Transforns, Iasl AIgorilhns and AppIicalions. LngIevood
CIiffs, N}:Irenlice HaII, 1997.
R. Thonas and K. YeIick, Lfficienl IITs on IRAM, Iroceedings of lhe 1
sl
Workshop on
Media Irocessors and DSIs, 1999
S. ouguezeI, M. O. Ahnad, and M. N. S. Svany, A Nev Radix-2/8 IIT AIgorilhn for
Lenglhq 2n DITs, ILLL Trans. Circuils Sysl. I, voI. 51, no. 9, Seplenler 2OO4.
S. ouguezeI, M. O. Ahnad, and M. N. S. Svany, Nev Radix-(222)/(444) and Radix-
(222)/(888) DII IIT AIgorilhns for 3-D DIT, ILLL Trans. Circuils Sysl. I,
voI. 53, no. 2, Ielruary 2OO6.
S. Choi, C. Covindu, }. W. }ang, V. K. Irasanna Lnergy-Lfficienl and Iaranelerized
Designs of Iasl Iourier Transforns on IICAs, The 28lh InlernalionaI Conference
on Acouslics, Speech, and SignaI Irocessing (ICASSI), ApriI 2OO3.
S. He and M. TorkeIson A Nev Approach lo IipeIine IIT Irocessor, Iroceedings of lhe
IIIS, 1996.
S. He and M. TorkeIson Design and InpIenenlalion of a 1O24-poinl IipeIine IIT
Irocessor, ILLL 1998 Cuslon Inlegraled Circuils.
S.S. Wang and C.S. Li An Area-Lfficienl Design of VarialIe-Lenglh Iasl Iourier Transforn
Irocessor }ournaI of VLSI SignaI Irocessing, March 2OO7.
T. Lenarl and V. OvaII, A 2O48 ConpIex Ioinl IIT Irocessor Using a NoveI Dala ScaIing
Approach, ILLL ISCAS 2OO3.
W. H. Chang, T. Nguyen, An OIDM-Specified LossIess IIT Archileclure, ILLL Trans.
Circuils Sysl. I, voI. 53, no. 6, }une 2OO6.
vvv.aIlera.con
vvv.xiIinx.con
Y. Ma, An Lffeclive Menory Addressing Schene for IIT Irocessors, ILLL Transaclions
on SignaI Irocessing, VoI. 47, No. 3, 9O7 - 911, May 1999
Y. Ma, L.Wanhannar, A Hardvare Lfficienl ConlroI ofMenory Addressing for High-
Ierfornance IIT Irocessors, ILLL Transaclions on SignaI Irocessing, VoI. 48, No.
3, 917 - 921, March 2OOO.
Y.N. Lin, H.Y. Liu, and C.Y.Lee A 1-CS/s IIT/IIIT Irocessor for UW AppIicalions,
ILLL }ourn. of SSC, voI. 4O, Issue 8, Aug. 2OO5.

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 247
X

Radio-Frequency (RF) Beamforming
Using SystoIic FPGA-based Two
DimensionaI (2D) IIR Space-time FiIters

Arjuna Madanayake, Mcm|cr, ||||, and Leonard T. rulon, |c||cu, ||||
Mu||iimcnsicna| Signa| Prcccssing Grcup, Unitcrsi|q cf Ca|garq
Ca|garq, A||cr|a, Canaa
(nmaanaq, |ru|cnuca|garq.ca)

1. Introduction

IIane-vaves are far-fieId soIulions lo (1) lhe veclor vave equalion, for lhe case of
eIeclronagnelic vaves, (2) lhe scaIar vave equalion, for lhe case of IongiludinaI pressure
vaves in seisnic, acouslic, and uIlrasonic syslens, as veII lo (3) Iinear surface vaves, such
as lhose crealed ly dropping a pellIe inlo sliII lhe valers of a pond. Iar-fieId leanforning
refers lo lhe highIy-seIeclive direclionaI enhancenenl of propagaling spalio-lenporaI pIane-
vaves lased on lheir direclions-of-arrivaI (DOA).

The direclionaI enhancenenl (leanforning) of eIeclronagnelic pIane-vaves is of
inporlance in nany areas of eIeclricaI engineering, such as in vireIess connunicalions,
radar and radio-frequency (RI) inaging (nuIli-CHz range). Of parlicuIar inporlance are
appIicalions in radio-aslronony and space physics (Van Ardenne,2OOO), vhere far-fieId
leanforning is increasingIy enpIoyed in aperlure arrays, and in vireIess noliIe voice and
dala connunicalion syslens. In lhe case of dala connunicalions, leanforning is used for
niligaling lhe fading effecls of nuIlipalh propagalion (Lilva and Lo,1996, Lilerli }r. and
Rappaporl,1999, Huseyin ArsIan, Zhi-Ning Chen and Maria-CalrieIIa Di enedello,2OO6) in
saleIIile-lorne renole sensing appIicalions invoIving synlhelic aperlure and DoppIer radar,
in navigalion and Iocalion devices lased on CIS (SiIva, WorreI and rovn), as veII as in
various uIlra-videland Iocalion lechnoIogies (Chavani, MichaeI and Kohno).

TradilionaIIy, receiver-side leanforning has leen achieved using highIy-direclionaI
receiving anlennas, lypicaIIy enpIoying paraloIic refIeclors and horns, anlenna array
configuralions vilh passive phasing nelvorks (such as deIay-and-sun nelvorks and
phased-array feeds) and refIecl-arrays (Hun, Okonievski and Davies). DigilaI leanforning
aIgorilhns are oflen lased on fraclionaI-deIay sleering aIgorilhns and/or finile inpuIse
response (IIR) digilaI fiIlers (Chavani, MichaeI el aI., }ohnson and Dudgeon,1993, Lilerli }r.
and Rappaporl,1999, Sladerini,2OO2, Huseyin ArsIan, Zhi-Ning Chen el aI.,2OO6, }. Roderick,
H. Krishnasvany, K.Nevlon el aI.,2OO6). The use of digilaI signaI processing (DSI) in far-
12
VLS 248


fieId lroadland leanforning for snarl anlenna array appIicalions is currenlIy receiving
nuch allenlion, nainIy due lo lhe conlinuousIy increasing avaiIaliIily of digilaI
progrannalIe Iogic and cuslon siIicon falricalion lechnoIogies lhal are graduaIIy enalIing
lhe lypicaIIy high IeveIs of reaI-line conpulalionaI lhroughpuls necessilaled ly such DSI-
lased lroadland snarl anlenna arrays.

In lhis conlrilulion, ve descrile a parlicuIar lype of recenlIy proposed far-fieId leanforner
lhal is lased on lvo-dinensionaI (2D) space-line digilaI fiIlers having infinile inpuIse
responses (IIRs) (Rananoorlhy and rulon, AgalhokIis and rulon,1983, rulon and
arlIey,1985). UnIike lhe nore videIy-used DSI-lased 2D ||R leanforners, lhe descriled
2D ||R leanforners have 2D z-donain lransfer-funclions
1 2 1 2 1 2
( , ) ( , ) / ( , ) H : : N : : D : :
havng poIe-nanifoIds, as veII as zero-nanifoIds, in lhe 2D conpIex pIane
2
C . Iurlher, for
leanforning appIicalions,
1 2
( , ) D : : mus| |c ncn-scpara||c, inpIying non-lriviaI design
chaIIenges lo avoid nuIlidinensionaI inslaliIily and conpulaliIily conslrainls (such
chaIIenges are nol encounlered in 1D design or if
1 2
( , ) D : : is separalIe in lhe 2D case).
Hovever, suilalIe cIosed-forn a|gc|raic expressions for leanforning discrele-donain 2D
IIR lransfer funclions are avaiIalIe in lhe Iileralure lhal are conpulalIe, slalIe and
reaIizalIe using hardvare vilh Iover conpIexily lhan IIR leanforners having siniIar
direclionaI seIeclivily, anguIar haIf-pover landvidlh, elc. Iurlhernore, lhe exislence of
such cIosed-forn aIgelraic lransfer funclions faciIilales lhe reaI-line conlinuous sleering of
lhe direclion of lhe lean and adjuslnenl of ils landvidlh, naking lhese fiIlers allraclive for
appIicalions in energing soflvare-defined radio (SDR), nicrovave inaging and cognilive
radio syslens.

This paper is organized as foIIovs: in seclion 2, ve provide a lrief reviev of space-line
pIane-vaves and lheir properlies in lhe 2D space-line frequency donain, foIIoved ly
seclion 3, vhere ve provide a conprehensive reviev of lroadland pIane-vave digilaI fiIler
design. Thereafler, in seclion 4, ve discuss praclicaI hardvare inpIenenlalions slarling
fron 2D difference equalions lhal Iead lo signaI fIov graphs and nassiveIy-paraIIeI sysloIic-
array hardvare reaIizalions. Seclion 5 descriles sone recenl progress ve have achieved in
prololyping lhese nev sysloIic-array circuils using fieId progrannalIe gale array (IICA)
lechnoIogy. IinaIIy, in seclion 6 ve descrile recenlIy avaiIalIe lechnoIogy and eIeclronic
design aulonalion (LDA) looIs lhal nay evenluaIIy Iead lo 2D IIR leanforners lhal operale
in reaI-line for various lroadland nicrovave leanforning appIicalions.

2. EIectromagnetic PIane-waves in Space-time

We consider here eilher lhe lransverse eIeclric fieId ( , , , )
y
E x y : ct or nagnelic fieId
( , , , )
x
H x y : ct of a propagaling eIeclronagnelic pIane vave, vhere
4
( , , , ) x y : ct e1 is lhe
4D space-line conlinuous-donain,
3
( , , ) x y : e1 is 3D space, t e1 is line and
8 1
3 10 c ms

~ is lhe speed of Iighl in air/vacuun. y anaIogy vilh 3D pIanes, lhe


equalion

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 249



3
2
1 2 3 1,2,3
1
, , and, 1
k
k
x y : ct o o o o o
=
+ + + = e
_
1 (1)

is a 4D hyper-pIane in lhe 4D conlinuous-donain
4
( , , , ) x y : ct e1 . Iropagaling
eIeclronagnelic pIane vaves are 4D hyper-pIane vaves in ( , , , ) x y : ct given ly

1 2 3
( , , , )
PW
w x y : ct w x y : ct

o o o

| |
| = + + +
|
\ .
_
,
3
2
1
1
k
k
o
=

_

1
e1 (2)

and lherefore have lhe properly lhal lhey are conslanl-vaIued in each of lhe hyper-pIanes
(1): lhal is, for each
1
e1 . LquivaIenlIy, for each vaIue of , ( )
PW
w is a
corresponding 4D iso-surface in ( , , , ) x y : ct . In Iig. 1, ve shov lhe 4D pIane vave
( )
1 2 3 PW
w x y : ct o o o + + + in lhe 3D spaliaI donain
3
( , , ) x y : e1 in an iso-pIane vhich, ly
sinpIy 3D geonelry, is perpendicuIar dislance ct fron lhe origin, as shovn. In 3D, ve nay
lherefore visuaIize lhe 4D space-line pIane vave of equalion (2) as an infinile sel of such
iso-pIanes ( ) w , each of vhich is prcpaga|ing in ( , , ) x y : ctcr |imc | vilh speed c in a
direclion nornaI lo lhe iso-pIanes. Depending on lhe 1D speclraI properlies of lhe c-scaIed
lenporaI signaI ( )
PW
w ct , lhe pIane vave nighl le lenporaIIy-narrovland or lenporaIIy-
lroadland.

Nole lhal, for lhe case of lhe ica| pIane vave, lhe region of supporl (ROS) of equalion (2) in
4
( , , , ) x y : ct e1 exlends, in generaI, lo infinily in al Ieasl sone direclions in
4
1 . Lqualion
(2) represenls eilher lhe eIeclric or nagnelic fieId of lhe pIane vave in 4D space-line. In lhis
chapler, ve are onIy concerned vilh lhe vaIues of lhe 4D pIane vave signaI as received on a
slraighl Iine in ( , , ) x y : . Therefore, ve consider onIy lhe speciaI case of lhe 2D space-line
represenlalion for vhich equalion (2) reduces lo lhe forn
1
( , 0, 0, )
PW
w x ct w x ct

o
| |
| = +
|
\ .
_
for
signaIs on lhe x-axis. Wilh lhe DOA in 3D space defined ly lhe angIes
(measured on - plane)
o
x : u and
o
as shovn in Iig. 1, il is easiIy shovn lhal (2) nay le
vrillen in lhe forn

( ) ( ) ( ) ( , , , ) sin cos cos cos sin
PW o o o o o
w x y : ct w x : y ct

u u
| |
|
= + + +
|
\ .
_
(3)
vilh / 2 , / 2,
o o
t u t < s fron vhich il foIIovs lhal lhe corresponding 2D space-line
pIane vave signaI rcccitc cn |nc x-axis is given ly
VLS 250


( ) ( , ) sin cos
PW o o
w x ct w x ct

u
| |
|
= +
|
\ .
_
(4)

The 4D hyper-pIanar iso-surfaccs of conslanl in (3) lecone 2D iso-Iines of (4) in
2
( , ) x ct e1 given ly
( ) sin cos
o o
x ct u + = (5)

As shovn in Iig. 2, lhe space-line direclion of lhe 2D space-line pIane vave is defined ly
lhe nornaI lo lhese conlours and is given ly (Cunaralne and rulon, Khadeni)


1
tan (sin cos )
o o
u u

= (6)

vilh respecl lo lhe c|-axis in lhe 2D Carlesian space-line donain ( , ), x ct vhere
sin sin cos
o o
| u = . Nole fron (6) lhal lhe 2D spalio-lenporaI direclion
1
tan (sin ) u |

= is
conslrained ly / 4 / 4 t u t s < vhere lhe exlrene vaIues / 4 t occur vhere
(( / 2
o
u t = ) and/or 0
o
= . These exlrene direclions are knovn as lhe 'end-fire angIes,
corresponding lo pIane vaves having DOAs in 3D space lhal are in lhe direclion of lhe x-
axis. The so-caIIed 'lroadside DOAs, vilh respecl lo lhe x-axis, are lhose direclions for
vhich lhe 4D space-line signaI in (3) is conslanl everyvhere on lhe x-axis al aII inslanls of
line |, vhich corresponds lo lhe 2D space-line direclion 0 u = and is equivaIenl lo lhe
DOAs in 3D space given ly ( 0,
o
u t = and/or / 2
o
t = ).


Iig. 1. The direclion of arrivaI (DOA) of a pIane-vave in 3D space, ( , )
o o
u , and lhe
appararenl spaliaI DOA seen ly a Iinear array aIong x-axis, | .
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 251


Iig. 2. Iropagaling pIane-vave in 3D space (a), 2D spaliaI viev on lhe 0 y = pIane (l), 2D
spalio-lenporaI DOA (c), and region of supporl (ROS) on 2D frequency-donain, aIigned
aIong lhe spalio-lenporaI DOA (d).

2.3 On The Region of Support (ROS) of the 2D Fourier Transform of 2D space-time
PIane Waves
Civen lhe 2D Iourier lransforn pair for 2D space-line pIane
vaves ( ) sin ( , )
x ct
f f
PW PW
w x ct W e e
e e
u + , il nay le shovn (rulon and arlIey,1985)
lhal lhe Region of Supporl (ROS) of lhe speclrun ( , )
x ct
f f
PW
W e e
e e
in lhe 2D Carlesian
frequency donain ( , )
x ct
e e is confined lo lhe slraighl Iine

sin 0
x ct
e e u + = (7)

vhich passes lhrough lhe origin and sullends angIe u lo lhe
ct
e axis. Il Iies on lhe
ct
e axis
for lroadside DOAs and on
ct x
e e = for end-fire DOAs. InporlanlIy lherefore, lhe ROS of
aII 2D space-line eIeclronagnelic pIane vave signaIs, propagaling al speed c, cannol Iie
oulside lhe 9O-degree vide 2D fan-shaped region
x ct
e e s in ( , )
x ct
e e .

VLS 252


2.4 SpectraI-FiItering a Desired 2D Space-time PIane-wave in the Presence of other
PIane Waves and Noise
TypicaIIy, lhe signaI received on lhe x-axis nay le represenled ly M nuIlipIe lroadland
pIane-vaves, each having a differenl orienlalion
, ,
,
o k o k
u , 0,1, 2, ( 1) k M = . , in 2D space-
line, and addilive 2D noise. We assune lhe firsl pIane vave, given ly 0 k = is lhe desired
pIane vave lo le recovered ly fiIlering. Then lhe received signaI nay le vrillen in lhe forn

1
,
0
( , ) ( sin ) ( , )
M
PW k k v
k
w x ct w x ct n x ct |

=
= + +
_
(8)

Where
, ,
sin sin cos ,
k o k o k
| u = and vhere ( , )
v
n x ct represenls 2D space-line noise. The
Iourier lransforn of (8) is lherefore given ly


1
,
0
( , ) ( , ) ( , )
x ct x ct x ct
M
f f f f f f
PW k v
k
W e e W e e N e e
e e e e e e

=
= +
_
(9)
vhere
2
( , ) ( , )
D
v x ct v
N n x ct e e . TypicaIIy, ( , )
v
n x ct corresponds lo non-pIane-vave
eIeclronagnelic propagaling inlerference or olher sources of 2D lroadland noise, nodeIIed
as addilive vhile Caussian noise (AWCN). Therefore, lhe 2D ROS of lhe noise speclrun
( , )
v x ct
N e e is lypicaIIy uniforn lhroughoul lhe 2D frequency-donain
2
( , )
x ct
e e e1 . The
ROS of ( , )
x ct
f f
W e e
e e
lherefore consisls of lhe uniforn ROS of ( , )
v x ct
N e e and M Iines
lhrough lhe origin, vhere lhe orienlalion of each Iine is given ly lhe M differenl angIes
1
tan (sin )
k k
u |

= .

|cr nc|a|icna| ccntcnicncc in |nc rcs| cf |nc cnap|cr, uc ui|| usc
1 x
e e as |nc spa|ia| frcqucncq
taria||c, an
2 ct
e e as |nc |imc-frcqucncq taria||c ccrrcspcning |c spacc|imc . ct

2.5 Beamforming of a Broadband PIane Wave using Space-time FiIters
The sinpIesl of 2D space-line pIane-vave fiIler is knovn as a 'frequency-pIanar lean fiIler
(rulon and arlIey,1985), lecause ils passland Iies on a Iine-lhrough lhe origin and ideaIIy
has a 'lean shaped 2D passland of uniforn vidlh. The lean-shaped 2D passland is
orienled lo encIose lhe ROS of lhe 2D speclrun of lhe desired pIane-vave over ils fuII
lenporaI landvidlh vhiIe allenualing aII speclra avay fron lhis narrov passland. Such
lean fiIlers can le reaIized using severaI nelhods invoIving lolh IIR or IIR digilaI fiIlers
(Cunaralne and rulon, Khadeni, rulon and arlIey,1985). IIR fiIlers are inherenlIy slalIe
lul are of high arilhnelic conpIexily due lo lhe reIaliveIy higher order of lhe fiIler lhal is
required for a given seIeclivily, reIalive lo approxinaleIy-equivaIenl IIR fiIlers. Hovever,
lhe Ialler are nol as slraighlforvard lo design and lo inpIenenl.

The focus of lhis chapler is on lhe design and reaI-line hardvare inpIenenlalion of a firsl-
order 2D IIR lean digilaI pIane-vave fiIler.

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 253

2.6 Effects of 2D Space-time SampIing using a Uniform Linear Array (ULA)
The 2D conlinuous-donain space-line signaI ( , ) w x ct is sanpIed in space using a nunler
of equi-spaced anlennas aIong lhe x-axis and sanpIed in line lo yieId lhe corresponding 2D
sanpIed space-line signaI
1 2
( , )
CLK
w n x n c T A A ,
1,2
0,1, 2, 3, n = ., and vhere. In order lo
prevenl undesiralIe spaliaI aIiasing, such uniforn Iinear arrays (ULAs) of anlennas require
lhal lhe uniforn dislance lelveen anlennas salisfy lhe Nyquisl condilion (lhe dislance
lelveen anlennas is x c T A s A , vhere 1/
U
f T A is lhe upper lenporaI frequency of lhe
inpul signaI leyond vhich ils speclrun Iacks supporl). Ior N anlennas, lhe 2D spaliaIIy-
sanpIed conlinuous-line anlenna array signaIs are given ly
1 1
( , ), 0,1,..., 1 w n x ct n N A = .
The 2D speclrun of
1
( , ) w n x ct A repIicales on lhe spaliaI-frequency
1
e axis vilh periodicily
2 . t The N conlinuous-line signaIs
1
( , ) w n x ct A are anpIified, using a Iov noise anpIifier
having lhe required lenporaI landvidlh and noise perfornance, lhen Iov-pass fiIlered
prior lo A/D conversion, resuIling in lhe 2D discrele-donain anlenna signaI
1 2
( , ).
CLK
w n x n c T A A The ADC cIock frequency 1/
CLK CLK
F T = A vhere is chosen ly seIecling
lhe inler-sanpIe line
CLK
T T A s A such lhal Nyquisl sanpIing lheoren is salisfied in lolh
lhe spaliaI and lenporaI frequency donains. UsuaIIy, , 1,
CLK U U
T K T K A = A > vhere
U
K
is lhe so-caIIed lenporaI oversanpIing faclor and is sonelines necessary for nininizing lhe
frequency-varping effecls lhal are inlroduced in lhe design of lhe digilaI fiIler lransfer
funclion.

Melhods have leen proposed and inpIenenled for significanlIy reducing lhe required
nunler of anlennas vilhoul significanlIy reducing perfornance, for vireIess
connunicalions and olher appIicalions. These nelhods aIso Iead lo nuch reduced
arilhnelic conpIexily of lhe fiIler and are lased on aIIoving a conlroIIed anounl of
nuIlidinensionaI spaliaI aIiasing and lherely spaliaI under-sanpIing, as reporled in
(Khadeni and rulon, Madanayake, Hun and rulon).

3. First-order 2D IIR Frequency-Beam PIane-wave FiIter

In lhe alove, il is eslalIished lhal lhe direclionaI enhancenenl of an ideaI desired space-
line pIane-vave nay le achieved using a 2D space-line fiIler having a 2D passland lhal
enconpasses lhe Iine-shaped ROS of lhe desired pIane-vave signaI. Iurlher, undesired
signaIs, such asolher pIane-vaves and/or noise nay le allenualed ly ensuring lhal lhe ROS
of lhe 2D slopland correspondas lo lhe ROS of lhe speclrun of lhe undesired signaIs.
AIlhough lhis approach has leen exlended lo 3D and 4D space-line signaIs (KuenzIe and
rulon, oIIe,1994, rulon,2OO3, Dansereau,2OO3, KuenzIe and rulon,2OO5, Dansereau and
rulon,2OO7), here ve focus on lhe sinpIesl 2D case ly descriling lhe design and
inpIenenlalion of a suilalIe 2D fiIler.
VLS 254


3.1 The Prototype ResistiveIy-terminated 2D Passive Frequency-Beam Network
Consider a 2D firsl-order conlinuous-donain induclance-resislance nelvork (rulon and
arlIey,1985), vhere
1
s and
2
s are spaliaI and lenporaI LapIace varialIes (Dudgeon and
Mersereau,199O, }ohnson and Dudgeon,1993, Schroeder and Iune,2OOO), respecliveIy. The
inpul-oulpul 2D LapIace voIlage lransfer funclion of lhis nelvork is given ly


1 2
1 2
1 1 2 2 1 2
( , )
( , )
( , )
Y s s R
T s s
R L s L s W s s
=
+ +
(1O)

vhere lhe paranelers
1 2
0, 0, and 0 L L R > > > correspond lo a passive spa|ia| induclor,
passive |cmpcra| induclor and passive resislance, respecliveIy, vilh lransforn inpuls and
oulpuls
1 2
( , ) W s s and
1 2
( , ) Y s s , respecliveIy. We denole lhe respeclive lransforn pairs ly
2
1 2
( , ) ( , )
D
w x ct W s s and
2
1 2
( , ) ( , ),
D
y x ct Y s s respecliveIy. The sleady-slale inpul-oulpul
frequency-response of (1O) is found ly selling
1 1
s fe = and
2 2
, s fe = Ieading lo lhe 2D
frequency response lransfer funclion


( )
1 2
1 2
1 1 2 2 1 2
( , )
( , )
( , )
Y f f R
T f f
R f L L W f f
e e
e e
e e e e
=
+ +
(11)
Iron (5), lhe nelvork under consideralion is 2D rcscnan| cn |nc 2D |inc-snapc rcgicn


1 1 2 2
0 L L e e + = (12)

passing lhrough lhe frequency-origin (Nole: |n 2D, capaci|crs arc nc| rcquirc |c inucc
rcscnancc). Al aII finile frequencies vhere (12) is salisfied (i.e. lhroughoul lhe 2D passland) ,
nelvork energy resonales lelveen lhe lvo induclance eIenenls and
1 2
( , ) T f f e e is unily. y
choosing
1 2
cos and sin L L u u = = , 0 90 ,
o
u s s ve can orienl lhe axis of lhe 2D passland lo
lhe angIe . u A lypicaI response is shovn in Iig. 3. The shape of lhe 2D gain
1 2
( , ) T f f e e of
lhe fiIler nay le envisaged in 2D frequency space ly noling lhal
1 1 2 2
L L e e + = descriles,
for conslanl , a Iine lhal is paraIIeI lo lhe 2D passland and aIong vhich
1 2
( , ) T f f e e is
conslanl and Iess lhan unily. InporlanlIy,
1 2
( , ) T f f e e decreases nonolonicaIIy vilh
increasing vaIues of vilh lhe lvo -3d Iines having gain O.7O7 given ly R = . We
nake lhe foIIoving sunnary olservalions (see (rulon and arlIey,1985) for delaiIs):

1. Al 2D resonance, lhal is on lhe frequency-Iine in
1 2
( , ) e e vhere 0, = lhe lransfer
funclion
1 2
( , ) 1 T f f e e = . This defines lhe 2D passland axis and unily gain on lhe
cenlre of lhe lean-shaped passland.
2. AIong aII direclions orlhogonaI lo lhe passland axis in
1 2
( , ) e e , lhe nagnilude
and phase frequency response of lhe fiIler correspond lo lhal of a firsl-order Iov-
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 255

pass lransfer funclion, nonolonicaIIy-decreasing vilh dislance fron lhe passland
axis.
3. The uniforn -3d landvidlh of lhe lean is given ly
2 2
3 1 2
/
aB
R L L e

= + .

3.2 The Transfer-functions of the First-order Beam FiIter in the 2D s- and z- Domains
AIlhough lhe inverse 2D LapIace lransforn of equalion (1O) yieIds a conlinuous-donain
parliaI differenliaI equalion for lhe inpul-oulpul lransfer-funclion, praclicaI
inpIenenlalions have so far leen in lhe discrele-donain of 2D finile-difference equalions,
inpIenenled in lhe forn of digilaI circuils. Transfornalion lo lhe discrele-donain is
achieved ly appIying lhe nornaIized 2D liIinear lransforn (2D LT)
1
, 1, 2,
1
k
k
k
:
s k
:

= =
+
lo
equalion (1O) Ieads, afler consideralIe aIgelraic nanipuIalion (rulon and arlIey,1985), lo
lhe 2D z-lransforn lransfer funclion


( )( )
( )
1 2
1 2
1 1
1 2
1 2
1 1
1 2
, 1 1 1
1 1
1 2
10 11 2 1 01 2
1 1
( , )
( , )
( , )
1
: :
T
: :
: :
Y : :
H : :
W : :
b b : : b :

| |
|
|
+ +
\ .
+ +
=
+ + +
(13)

vhere
2
1 2 1 2
( , ) ( , )
D
CLK
W : : w n x n c T A A and
2
1 2 1 2
( , ) ( , ),
D
CLK
Y : : y n x n c T A A respecliveIy, and
vhere
( ) ( )
1 2 1 2
( 1) ( 1) / .
i f
if
b R L L R L L = + + + + The alove appIicalion of lhe 2D LT, vhich is
a confornaI napping lelveen lhe 2D LapIace and 2D z-donain, resuIls in a dislorlion ofn
lhe high frequency parl of lhe 2D passland, knovn an liIinear varping, lhal Ieads lo a
praclicaI Iinilalion of lhe upper frequency ( )
2
0.5 , , t e t < < of lhe lean-shaped passland.
This effecls of lhis Iinilalion nay le avoided ly suilalIe lenporaI and/or spaliaI over-
sanpIing of lhe inpul signaI.

Ior exanpIe, here ve shaII enpIoy a lenporaI over-sanpIing faclor of 2 for vhich ve shov
in Iig. 4 lhe correpsonding 'veakIy-varped nagnilude response of lhe discrele-donain
frequency-response lransfer-funclion over lhe usefuI range
2
, , 0.5 . e t s


VLS 256



Iig. 3. The 2D conlinuous-donain sleady-slale Magnilude Irequency Response
1 2
( , ) T f f e e
of lhe fiIler in (1O) for , , , 1, 2,
k
k e t s = and
1 2
0.1, cos(30 ), sin(30 ).
o o
R L L = = =


Iig. 4. A ean-shaped Response lhal is varped ly lhe 2D LT, shovn in lhe usalIe range
1 2
and 0.5 0.5 . t e t t e t s s s s The lean shape al frequencies ( )
2
0.5 , , t e t < < are
nol used in our appIicalion lecause lhe lean-shape is significanlIy off-axis, due lo linear
varping. The inleresled reader is referred lo (Madanayake and rulon, rulon,2OO3) for
delaiIs.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 257

3.3 On ImpIementing the 2D Difference-Equations Using DifferentiaI Operators
Taking lhe inverse 2D z-lransforn of (1O) Ieads lo lhe 2D space-line inpul-oulpul direcl-
forn difference-equalion, vhich ve have shovn can le inpIenenled in nassiveIy-paraIIeI
sysloIic-array hardvare for reaI-line fiIlering appIicalions. Hovever, lhe recenlIy proposed
hylrid-forn signaI fIov graph (Madanayake, Hun el aI., Madanayake,2OO8), aIlhough
reIaliveIy conpIicaled lo design, offers Iover conpIexily and high-speeds of operalion lhan
direcl-forn nelhods. In lhis paper, ve pursue lhe hylrid-forn signaI fIov graph nelhod.
The 2D z-donain lransfer-funclion lhal Ieads lo lhe so-caIIed hylrid-forn slruclure (an
archileclure lhal enjoys lhe high-speeds of operalion of direcl-forn slruclures vhiIe leing as
Iov in conpIexily as so-caIIed differenliaI-forn slruclures) can le ollained ly nelhods
avaiIalIe in lhe Iileralure (Madanayake, Hun el aI., Madanayake and rulon,2OO7,
Madanayake,2OO8). Ior exanpIe, ve nay vrile a suilalIe lransfer funclion in lerns of lhe
spaliaI differenliaI operalor
1 1 1
1 1 1
/(1 )
D
: : :

+ as


( )
1
1
1
2 1 2
1 2
1
1 1 1 2 1
2 2 1
1
1 ( , )
( , )
( , )
1 1
1
D
:
: Y : :
H : :
W : : :
: :
:
o |

+
=
+ +
+
_
(14)
vhere ve require


2cos 2sin
, 1
cos sin cos sin R R
u u
o |
u u u u
= =
+ + + +
(15)

Nole lhal lhe passland gain in (14) is scaIed ly lhe conslanl
1
R
R L +
, reIalive lo lhe direcl-forn
case (erlschnann, arlIey and rulon, Madanayake, Hun el aI., Liu and rulon,1989,
Madanayake and rulon,2OO7, Madanayake,2OO8), and is ignored in lhe foIIoving lecause il
is nol of praclicaI significance. Re-vriling lhe direcl-forn lransfer-funclion using lhe spaliaI-
differenliaI operalor resuIls in jusl lvo fiIler coefficienls in lhe denoninalor of (14) inslead
of lhree, inpIying a 33 reduclion in lhe nunler of paraIIeI hardvare nuIlipIiers required
in circuil reaIizalions, reIalive lo direcl-forn reaIizalions.

The difference-equalion reaIizalions descriled here Iead lo praclicaI-lounded-inpul-
lounded-oulpul (praclicaI-IO) slalIe (AgalhokIis and rulon,1983) perfornance under
finile precision arilhnelic, assuning zero iniliaI condilions (ZICs) for lolh spaliaI and
lenporaI ileralions. Melhods lhal guaranlee ZICs are considered Ialer in lhe foIIoving.

4. A ReaI-time High Throughput ImpIementation using the Hybrid-form
SystoIic-Array Processor Architecture

SysloIic-array processors are nassiveIy-paraIIeI conpulers having idenlicaI, synchronousIy
cIocked, fuIIy-pipeIined, high lhroughpul idenlicaI processing eIenenls, vhich are
connecled in a Iinear- or neshed-array configuralion
VLS 258



Iig. 5. Overviev of lhe pIane-vave fiIler inpIenenlalion consisling of N paraIIeI processing
core noduIes (IICMs), vhich have an inlernaI signaI fIov graph lased on lhe hylrid-forn
lransfer-funclion in equalion (14).

(Sid-Ahned, Kung,1988a, Kung,1988l, Shanlhag,1991, Rader,1996, Zajc, Sernec and
Tasic,2OOO). Such sysloIic-array processors are noduIar, reguIar, and IocaIIy inlerconnecled,
naking lhen veII-suiled for reaI-line signaI processing using appIicalion-specific VLSI
hardvare for digilaI signaI processing appIicalions al radio-frequency.

Research on noveI sysloIic-array archileclures for 2D/3D IIR frequency-pIanar digilaI pIane-
vave fiIlers for leanforning appIicalions has Iead lo fieId-progrannalIe gale-array
(IICA) lased singIe-chip nuIliprocessor inpIenenlalions capalIe of reaI-line operalion al
a suslained arilhnelic lhroughpul of one-frane-per-cIock-cycIe (OIICC), a requirenenl for
reaI-line pIane-vave fiIlering al RI using Iinear- or reclanguIar-arrays of anlenna eIenenls
(Hun, Madanayake and rulon, Madanayake, rulon and Conis, Madanayake and rulon,
Madanayake and rulon, Madanayake, Hun el aI., Madanayake, Hun and rulon,
Madanayake,2OO4, Madanayake,2OO8, Madanayake and rulon,2OO8). The required OIICC
lhroughpul rale, required for nuIli-CHz inpIenenlalions, arises due lo lhe facl lhal lhe
signaIs of inleresl are of uIlra-vide RI landvidlh, vhich Ieads lo Nyquisl sanpIe rales lhal
are al Ieasl lvice lhe fuII RI landvidlh of lhe signaI.

The leanforners lherefore direclIy sanpIe RI signaIs fron lhe anlennas vilhoul dovn-
conversion (or landpass sanpIing), and Ieads lo frane sanpIe rales in lhe CHz. Such
excessiveIy-high frane sanpIe rales (nuIlipIe CHz) nake soflvare-lased reaIizalions
infeasilIe using lradilionaI DSI lechnoIogies. Our research indicales (Madanayake and
rulon, Madanayake,2OO8) lhal nassiveIy-paraIIeI synchronousIy-cIocked, speed-oplinized,
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 259

fuIIy-pipeIined sysloIic-array processors are currenlIy lhe lesl avaiIalIe soIulion for lhe
lroadland reaI-line DSI-lased radio-frequency (RI) leanforning appIicalions using
sanpIed anlenna arrays (ArnoId Van Ardenne, LIIingson,1999, Lilerli }r. and
Rappaporl,1999, Ween, Noralos and Iopovic,1999, Irederick, Wang and Iloh,2OO2, Do-
Hong and Russer,2OO4, Rodenleck, Sang-Cyu, Wen-Hua el aI.,2OO5, Madanayake,2OO8,
DevIin,Spring 2OO3).

4.1 Overview of the Architecture
The nassiveIy-paraIIeI sysloIic-array archileclure consisls of an array of idenlicaI paraIIeI-
processing core-noduIes (IICMs), sonelines caIIed processing eIenenls in lhe Iileralure
(Kung,1988a). Lach IICM is dedicaled lo processing an anlenna eIenenl, and signaIs fron
aII N eIenenls are anpIified using a Iov-noise anpIifier (LNA), Iov-pass fiIlered (LII),
and line-synchronousIy sanpIed using N idenlicaI anaIog signaI processing chains. The
IICMs and anaIog-lo-digilaI converlers (ADCs) are cIocked using a singIe-phase nasler
cIock signaI, of frequency 1/
CLK CLK
F T = A . The IICMs are derived using lhe recenlIy-
proposed hylrid-forn signaI fIov graph having lhe required z-donain lransfer-funclion
(14) (Hun, Madanayake el aI., Madanayake, Hun el aI.).

The IICMs lhal conprise lhe sysloIic-array processor are fuIIy-paraIIeI, speed-naxinized,
fuIIy-pipeIined, nuIli-inpul-nuIli-oulpul (MIMO) processors, each consisling of 2 inpul
porls and 2 oulpul porls. A IICM al spaliaI Iocalion
1
n has ils inpul porl A connecled lo lhe
ADC al Iocalion
1
n and inpul porl connecled lo lhe oulpul porl C of lhe IICM al Iocalion
1
1. n Iorl D provides lhe conpuled oulpul signaI
1 2
( , )
CLK
y n x n c T A A for spaliaI Iocalion
1
. n

4.2 Inter-PPCM and Intra-PPCM PipeIines
The IICMs are pipeIined such lhal signaIs enlering lhrough lhe inpul porls
1 1
and
n n
A B undergo p addilionaI cIocked deIays, as a resuIl of inlernaI pipeIining. These
addilionaI deIays can le conpensaled ly deIaying lhe inpul signaI
1
1 n
A
+
ly
1
( 1) n p + cIock
cycIes, Ieading lo a deIay of
1
( 1)
x
n T + A seconds, vhere
x CLK
T p T A = A is lhe pipeIining
Ialency of a IICM. Ior N IICMs, lhe finaI oulpul signaI al lhe oulpul of lhe
th
N IICM
lherefore undergoes a pipeIining deIay of Np cIock cycIes. When 2D space-line oulpul
signaIs are required, lhe oulpul signaIs fron each IICM nusl le fed lhrough addilionaI
cIocked IIIOs, having deplh
1
( 1 ) N n p , so lhal lhe signaIs al aII spaliaI oulpul Iocalions
are unifornIy deIayed ly Np cIock cycIes. The inpIenenled lransfer-funclion is lherefore
nodified, in lhe presence of pipeIining, lo lhe Iinear phase-deIayed forn
1 2 2
( , )
Np
H : : :


vhich has no effecl on lhe nagnilude frequency-response funclion (lecause
2
1.
CLK
f Np
e
e A
= )
VLS 260



Iig. 6. Hylrid-forn IICM circuil having 2 inpuls, 2 oulpuls, 4 lvo-inpul paraIIeI
adder/sullraclors, and 2 paraIIeI hardvare nuIlipIiers.

4.3 Design of the Hybrid-form PPCMs
Having ollained a lasic overviev of lhe sysloIic-array archileclure, ve nov derive lhe
inlernaI signaI fIov and inlernaI conponenls of each IICM. RecaII lhe 2D hylrid-forn
lransfer-funclion (14):

( )
1
2 1 2
1 2
1
1 1 1 2 1
2 2 1
1
1 ( , )
( , )
( , )
1 1
1
: Y : :
H : :
W : : :
: :
:
o |

+
=
+ +
+
(16)

Cross-nuIlipIying lerns in (16), ve gel lhe 2D z-donain inpul-oulpul forn, given ly


( )
1
1 1 1 1
2 1 2 2 2 1 2 1
1
1 ( , ) 1 1 ( , ),
1
:
: W : : : : Y : :
:
o |

| |
+ = + +
|
+
\ .
(17)
Ieading lo

( )
1
1
1 2 1 2 1
1 1
1 2 2 1
2
( , ) ( , )
1
( , ) 1
1
:
W : : Y : :
:
Y : : :
:
o
|

| |
+
|
+
\ .
= +
+
(18)

MuIlipIying lolh sides ly
2
p
:

yieIds lhe required forn




Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 261


Iig. 7. Inlerconneclions lelveen IICMs, shovn here in lhe nixed donain
1 2
( , ) , n : eZC
Ieads lo lhe nassiveIy-paraIIeI sysloIic-array processor inpIenenlalion of lhe lean pIane-
vave fiIler.


( )
1
1
1 2 2 2 1 2 1
1 1
1 2 2 2 1
2
( , ) ( , )
1
( , ) 1
1
p p
p
:
W : : : : Y : :
:
Y : : : :
:
o
|

| |
+
|
+
\ .
= +
+
(19)

Conpuling lhe inverse z1-lransforn of (19) under spaliaI ZICs, ve ollain lhe 2D nixed-
donain
1 2
( , ) n : eZC forn, given ly


( )
( )
1 2 2 2 1 2
1
1 2 2 2
1
2
( , ) ( , )
( , ) 1
1
p p
p
W n : : : U n :
Y n : : :
:
o
|

+
= +
+
(2O)

vhere

1 2 1 2 2 1 2 2
( , ) ( 1, ) ( 1, )
p p
U n : Y n : : U n : :

= (21)

and, vhere
2
p
:

is lhe z
2
-lransforn of lhe inlernaI pipeIining deIays al each IICM. ecause
lhe deplh of pipeIining is arlilrary, lhe nuneralor of (2O) can le pipeIined al viII using
slraighlforvard 1D IIR fiIler pipeIining nelhods, noling lhal lhis 1D IIR seclion has lvo
lerns
1 2
( , ) W n : and
1 2
( , ) U n : vhich are ollained using digilaI porls
1
n
A and
1
,
n
B respecliveIy. Lqualions (2O) and (21) descrile lhe 2-inpul-2-oulpul z
2
-donain lransfer-
funclions of a IICM al Iocalion
1
0 1. n N s < The hylrid-forn signaI fIov-graph is
lherely ollained, and is shovn in Iig. 6. The firsl 3 IICMs in lhe sysloIic-array are shovn
in Iig. 7 as an inlerconneclion of processors.

VLS 262


4.4 A PipeIining ExampIe
In order lo faniIiarize lhe reader vilh pipeIining concepls, ve nov provide a sinpIe
exanpIe vhere il is assuned lhal 12 p = inlernaI pipeIining slages are sufficienl for
achieving lhe required lhroughpul. The 12 slage pipeIine viII le dislriluled as foIIovs: lhe
nuIlipIier o viII consisl of 3 IeveI pipeIining, lhe lhree 2-inpul adders/sullraclors denoled
A1, A2, and A3, are lo have 3 IeveIs of pipeIining. Il is inporlanl lo ensure lhal aII signaI
conponenls lhal connecl lo a parlicuIar 2-inpul adder/sullraclor undergo equaI deIays.
This is essenliaI for correcl operalion, and nusl le salisfied for aII pipeIined designs.

The hylrid-forn signaI-fIov graph does nol aIIov pipeIining of A4, lecause onIy one unil-
deIay luffer is avaiIalIe in lhe firsl-order feedlack Ioop, vhich is usuaIIy alsorled inside
lhe paraIIeI Iogic of nuIlipIier . | Irovided aII feed-forvard palhs are fuIIy pipeIined, lhe
crilicaI palh deIay of lhe hylrid-forn IICM cannol le reduced leyond
/ CPD Mul A S
T T T ~ + vhere
/
and
Mul A S
T T are lhe propagalion deIays of a paraIIeI nuIlipIier
and adder/sullraclor circuil, respecliveIy. The naxinun speed of operalion for a hylrid-
forn IICM is lherefore Iess lhan 1/
CLK CPD
F T s A unIess addilionaI speed-oplinizalion
nelhods, lased on Iook-ahead oplinizalion, are enpIoyed. This nelhod is discussed in lhe
nexl seclion.


Iig. 8. SignaI fIov graph of a hylrid-forn IICM having 12 cycIes of pipeIine Ialency
(arlilrariIy chosen for lhe purpose of denonslralion). The 12 cIock-cycIes of addilionaI
pipeIining can le used as required lo reduce lhe crilicaI-palh deIay (CID) of lhe sysloIic-
array.

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 263

The pipeIined version, having 12 p = for lhe hylrid forn IICM, is shovn in Iig. 8. We
nov descrile Iook-ahead speed-oplinizalion of lhe inlernaI 1D lenporaI IIR digilaI fiIler
seclion having lransfer-funclion
1
2
1
2
1
1
:
: |

+
+
lhal enalIes nuch grealer IeveIs of reaI-line
lhroughpul al lhe cosl of addilonaI circuil conpIexily.

4.5 AdditionaI Look-Ahead Speed-Maximization
We exlend veII-knovn 1D IIR pipeIining using Iook-ahead oplinizalion, a nelhod
pioneered ly Iarhi el aI (Iarhi, Iarhi,1991, Iarhi,1999). Look-ahead is a nelhod for
reducing addilionaI deIays inlo a crilicaI feedlack Ioop and is lased on poIe-zero
canceIIalion of 1D z-donain lransfer-funclions.

In seclion 4.4, ve descriled an exanpIe for vhich 1O addilonaI deIays are dislriluled in lhe
forvard (lhal is, IIR, aIso knovn as feed-forvard) signaI palhs of lhe IICM, such lhal lhe
crilicaI palh deIay of lhe IICM (and lherefore, of lhe sysloIic-array) is reduced lo lhe
Ialency for a nuIlipIy-add-operalion, denoled .
CPD
T A The speed-lollIeneck for lhis exanpIe
Iies vilhin lhe firsl-order feed-lack IIR fiIler, vhich has a sinpIe reaI-poIe al
2
: | = vhere
il nay le shovn lhal 1 | s for passive fiIler nelvork prololypes. ecause lhis poIe is
vilhin (or on) lhe unil circIe
2
1 : = lhe 1D IIR fiIler seclion is uncondilionaIIy slalIe
(ignoring effecls due lo finile precision).

Lel us furlher assune lhal our oljeclive is lo haIve lhe crilicaI palh deIay using Iook-ahead
oplinizalion of lhe IIR seclion. This can le achieved ly increasing lhe nunler of inlernaI
deIays in lhe firsl-order feedlack Ioop lo 2 (causing lhe feedlack Ioop lo increas in order):
lhis nay le easiIly achieved ly nuIlipIying lolh nuneralor and denoninalor of
1
2
1
2
1
1
:
: |

+
+
ly
1
2
1 : |

Ieading lo lhe 2
nd
order seclion, given ly
( )( )
1 1
2 2
2 2
2
1 1
1
: :
:
|
|

Ieading lo
a nev crilicaI palh deIay in lhe feedlack Ioop
,
/ 2,
CPD LA CPD
T T ~ inpIying an aInosl 1OO
increase in lhe naxinun speed of operalion (Iarhi, Iarhi, Iarhi and Messerschnill, Iarhi
and Messerschnill, Sundarajan and Iarhi, Iarhi and Messerschnill,1989, Iarhi,1991,
Iarhi,1999). This Iook-ahead speed-naxinizalion process nay le repealed: for exanpIe,
ly nuIlipIying lhe nuneralor and denoninalor of
( )( )
1 1
2 2
2 2
2
1 1
1
: :
:
|
|

ly
2 2
2
1 : |

+ Ieads
lo a 4
lh
-order feedlack Ioop having lransfer-funclion
( )( )( )
1 1 2 2
2 2 2
4 4
2
1 1 1
1
: : :
:
| |
|

+ +

vhich
aIIovs lhe nuIlipIier
4
| lo consisl of 3 IeveIs of inlernaI pipeIining, vhiIe lhe fourlh deIay
can le used in lhe 2-inpul adder lhal conpIeles lhe feedlack Ioop (Madanayake and rulon,
Madanayake,2OO8). The addilionaI lerns in lhe nuneralor lhal appear due lo lhe
appIicalion of Iook-ahead speed-naxinizalion Iead lo addilionaI circuil conpIexily - lhis is
VLS 264


lhe price for lhe exlensive gain in reaI-line lhroughpul, vhich is 3OO for 4
lh

order feedlack
Ioops. The addilionaI arilhnelic circuils lhal appear in lhe feed-forvard seclions can le
easiIy pipeIined ly increasing lhe deplh of pipeIining p as required. In our exanpIe, ve
have increase lhe deplh of pipeIining up lo 22 p = vhich aIIovs 3-IeveI pipeIining lo aII
addilionaI adders/sullraclors and nuIlipIiers in lhe IICM.

5. FPGA Circuit Prototypes

In lhis seclion, ve provide a proof-of-concepl circuil design using a fieId progrannalIe
gale array (IICA) device. An exanpIe inpIenenlalion of a hylrid-forn sysloIic-array
processor conlaining 21 fuIIy pipeIined IICMs is provided. The largel IICA is a XiIinx
Virlex-4 Sx35-1Off668 device inslaIIed on a NaIIalech enADDA daughler card, vhich in
lurn is inslaIIed on a NaIIalech enONL nainloard. This parlicuIar conlinalion is videIy
knovn as lhe XiIinx XlreneDSI Kil-4.

The Iogic design fIov slarls vilh lhe XiIinx Syslen Ceneralor (XSC) design looI, vhich is a
pIug-in for MalIal/SinuIink. We chose XSC as our IICA design looI, aIlhough
convenlionaI design nelhods lased on hardvare descriplion Ianguages such as VHDL or
VeriIog nay aIso le allenpled. The noduIar reguIar nalure of lhe sysloIic-array, logelher
vilh lhe conpIicaled pipeIines and dalafIov slruclure, nakes lhe use of a graphicaI IICA
design nelhod such as XSC, easier, conpared lo lexl-lased design looIs. We hovever nole
lhal XSC, in lhe end, Ieades lo synlhesizalIe VHDL (or VeriIog), vhich is sulsequenlIy
processed ly convenlionaI IICA Iogic synlhesis looIs such as lhe XiIinx Synlhesis TooI
(XST) or SynpIify Iro.

5.1 Finite Precision Arithmetic and the FPGA Circuit
The arilhnelic circuils on lhe IICA are olviousIy lased on finile precision hardvare lIocks
for lhe nuIlipIiers, adders/sullraclors, and nenory devices. The designalion of precisions
(vord sizes) is an inporlanl design slep lhal requires exlensive furlher research. Our
exanpIe is lased on experience vilh nany siniIar circuils, and is IargeIy a resuIl of
experienliaI Iearning accunuIaled over severaI years of research on siniIar sysloIic-arrays.
Al lhis line, a conprehensive design nelhod lhal can Iead lo oplinaI finile precision IeveIs
(in lerns of hardvare resource consunplion, quanlizalion noise slalislics, pover
consunplion, and lhroughpul) is nol avaiIalIe, and is an inleresling suljecl for research
aclivilies.

The foIIoving exanpIe assunes inpul signaIs ollaining fron 4-lil A/D converlers.
IreIininary sludies shov lhal 3 lil A/D converlers are quile sufficienl for uIlra-videland
vireIess connunicalions appIicalions. Our choice of 4-lils in our A/D converlers resuIls
fron 1-lil overdesign, nainIy as a nargin of safely, in order lo ensure good perfornance

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 265


Iig. 9. XiIinx IICA circuil for a hylrid-forn IICM having 12 cycIes of pipeIine Ialency
(corresponding lo lhe signaI-fIov graph in Iig. 8).


Iig. 1O. Iirsl 4 IICMs of a hylrid-forn sysloIic-array IICA circuil shoving inler-IICM
inlerconneclions. The IICA circuil is lesled on-chip using slepped hardvare co-sinuIalion
using a 2D unil inpuIse inpul al IICM #1, vilh inpuls of IICM #2, #3, ., #21, sel lo zero,
Ieading lo lhe 2D neasured inpuIse response
1 2
( , )
CLK
h n x n c T A A . A lil-lrue cycIe-accurale
IICA circuil sinuIalion of lhe 2D inpuIse response is avaiIalIe in lhe MalIal varialIes
simcu|,simcu|1,.,simcu|20, and lhe neasured on-chip IICA circuil response are avaiIalIe in
MalIal varialIes n0,n1,.,n20.

VLS 266


fron a reaI-vorId appIicalion. The nuIlipIier coefficienls are assuned lo le 12 lils, vilh lhe
linary poinl al posilion 1O. AII olher regislers, incIuding quanlized oulpuls fron nuIlipIiers
and adder/sullraclor lIocks, are fixed al 14-lils, vilh linary poinl assuned al posilion 1O.
The design of a IICM is shovn in Iig. 9, foIIoved ly lhe sysloIic-array, in Iig. 1O. The finile
precision vaIues al various Iocalions on lhe IICM signaI fIov graph can le videIy
oplinized againsl various requirenenls, lul is nol allenpled here, lecause ve are onIy
inleresled in giving our readers a lasic design overviev of lhe hylrid-forn sysloIic-array
processor.

The IICA circuil vas lesled, using on-chip hardvare-in-lhe-Ioop co-sinuIalion, using
MalIal/SinuIink, XSC, and IUSL, using lhe XlreneDSI Kil-4 device, vhich vas inslaIIed
on lhe 5V 32-lil ICI sIol of lhe hosl IC. Iigure 11 shovs lhe neasured 2D nagnilude
frequency response of an exanpIe lean fiIler having spaliaI DOA 25 ,
o
o
| = and landvidlh
paraneler 0.02, R = conpuled for 21 spaliaI sanpIes, and 256 line sanpIes, of lhe inpuIse
response. The uneven nalure of lhe neasured response is allriluled lo quanlizalion
effecls, and nagnilude sensilivily, for vhich a conprehensive sludy renains as usefuI
fulure research.


Iig. 11. Measured 2D nagnilude response,
1 2
and 0.5 0.5 , t e t e t s s s s ollained
fron a 21 IICM IICA inpIenenlalion using on-chip hardvare co-sinuIalion using lhe
XiIinx XlreneDSI Kil-4. Quanlizalion effecls cause lhe inpIenenlalion lo have exlra rippIes
in lolh pass- and slop-lands, and Iead lo addilion of ACWN. A delaiIed sludy of
quanlizalion effecls renain for fulure vork.

Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 267

5.2 High-speed ImpIementation TechnoIogies
Al presenl, sysloIic-array inpIenenlalions of lhe proposed 2D IIR lean pIane-vave fiIlers
have leen Iiniled lo proof-of-concepl reaIizalions on IICA circuils lhal operale al cIock
rales of up lo 1OO MHz. Hovever, reaI-vorId eIeclronagnelic appIicalions requires frane
rales in excess of 1 CHz and can le high as 21 CHz for fuII-land UW radio syslens. IICA
circuil inpIenenlalions are oflen inpraclicaI for producl appIicalions and are noslIy used
as prololypes for evenluaI inpIenenlalion using appIicalion-specific inlegraled circuils
(ASICs) using high-speed VLSI pIalforns such as lhe slale-of-lhe-arl 4Onn digilaI CMOS
process. Iorling lhe avaiIalIe IICA designs lo 4Onn CMOS (or siniIar) VLSI lechnoIogy
renains an exciling fieId for fulure research.

5.3 FieId-ProgrammabIe Object Arrays (FPOAs) and Asynchronous FPGA Circuits
In generaI, aIlhough IICAs fron vendors such as XiIinx and AIlera are Iiniled lo speeds
Iess lhan ~3OO MHz for nosl rccursitc fi||cr csigns, fulure deveIopnenls nay faciIilale lhe
use of convenlionaI IICA lechnoIogy lo inpIenenl lhe proposed sysloIic arrays al 1 CHz or
higher. Iurlhernore, il shouId le noled lhal fal-Iess seniconduclor lechnoIogies, such as
MalhSlars (n||p.//uuu.ma|ns|ar.ccm/) Arrix fieId progrannalIe oljecl array (IIOA) devices
(Anonynous,2OO7) and high-speed picoIIIL IICAs capalIe of 1.5 CHz operalion
(Anonynous,2OO8) fron Acnrcnix Scmiccnuc|cr (n||p.//uuu.acnrcnix.ccm/), are energing as
an aIlernalive lo convenlionaI ASIC soIulions, and forns a lasis for fulure research.

IIOAs, fron MalhSlar, consisls of an array of hard II lIocks such as arilhnelic-and-Iogic
unils (ALUs) and nuIlipIy-accunuIale (MACs) lIocks, vilhin a reconfiguralIe svilching
falric, vhich are ideaI for sysloIic-array reaIizalions due lo lheir noduIarily, reguIarily, and
IocaI inlerconneclivily. These oljecls are pre-falricaled onlo lhe IIOA slruclure, and
neel slringenl lining slandards, enalIing delerninislic design al 1 CHz vhich are
independenl of lhe Iogic leing inpIenenled. On lhe olher hand, Spccs|cr IICAs fron
Acnrcnix Scmiccnuc|cr, enpIoy asynchronous handshaking prolocoIs lelveen
conlinalionaI Iogic lIocks, vhich lhey descrile as lhe nerging of cIock and dala lokens inlo
one signaI, vhich in lurn, according lo Acnrcnix docunenlalion, enalIes fasler operalion
conpared lo convenlionaI synchronous IICA archileclures. The Speedsler faniIy loasls 1.5
CHz, and is polenliaI candidale for reaI-line inpIenenlalion of lhe sysloIic-array
archileclures descriled herein, foIIoving fulure research.

6. ConcIusions

The alove nev sysloIic inpIenenlalion of a 2D IIR frequency-lean fiIler lransfer funclion
has pronising engineering appIicalions for lhe direclionaI enhancenenl of a propagaling
lroadland space-line pIane-vave received on an array of sensors. A parlicuIarIy inporlanl
case is lhe use of an array of lroadland anlennas for lhe direclionaI enhancenenl (lhal is,
leanforning) of uIlra-videland eIeclronagnelic pIane-vaves.

A nassiveIy-paraIIeI sysloIic-array cuslon archileclure, lhal is capalIe of processing one
Iinear frane per cIock cycIe (OIICC) vilh delaiIed design and oplinizalion infornalion,
has leen descriled. The archileclure is lased on lhe recenlIy proposed hylrid-forn 2D
signaI fIov graph, vhich has leen shovn lo le oplinaI in lerns of crilicaI palh deIay (hence
VLS 268


naxinun lhroughpul, lecause al OIICC, lhe cIock rale is equaI lo lhe frane rale in lhese
archileclures) and Iov conpulalionaI conpIexily.

A design exanpIe for lhe proposed sysloIic-array processor archileclure has leen descriled
using a XiIinx Virlex-4 Sx35 IICA device, and lhe MalIal/SinuIink lased IICA design
looI caIIed XiIinx Syslen Ceneralor. The exanpIe IICA inpIenenlalion of lhe 2D IIR
frequency- lean fiIler vas lesled on-chip using lhe hardvare-in-lhe-Ioop verificalion
nelhod caIIed 'hardvare co-sinuIalion, and lhe on-chip 2D unil-inpuIse response vas
mcasurc, vhich in lurn Ied lo mcasurc 2D frequency response resuIls lhal confirn correcl
inpIenenlalion of lhe hardvare.

AIlhough lhe IICA-lased exanpIe is generaIIy loo sIov for nicrovave inaging
appIicalions, il serves as a vaIidalion of lhe proposed OIICC sysloIic-array processor and
can le used in ils currenl forn for sIover appIicalions in audio, uIlra-sound, and Iover
radio frequencies (of up lo approxinaleIy 1OO MHz). IinaIIy, pronising nev VLSI
inpIenenlalion pIalforns are descriled here , vhich nay evenluaIIy enalIe lhe proposed
archileclure lo operale al lhe required nuIli-CHz cIock frequency lo enalIe reaI-line uIlra-
videland digilaI snarl anlenna array appIicalions.

7. References

AgalhokIis, I. and L. T. rulon (1983). "IraclicaI-IO slaliIily of N-dinensionaI discrele
syslens." Iroc. Insl. LIec. Lng. 130, Pt. G(6): 236-242.
Anonynous (2OO7). Arrix IIOA Overviev. AvaiIalIe onIine al hllp://vvv.nalhslar.con.
Anonynous (2OO8). Using High-Ierfornance IICAs for Advanced Radio SignaI Irocessing.
. AvaiIalIe onIine al hllp://vvv.achronix.con.
ArnoId Van Ardenne. The TechnoIogy ChaIIenges for lhe Nexl Ceneralion Radio
TeIescopes. Ierspeclives on Radio Aslronony - TechnoIogies for Large Anlenna
Arrays, NelherIands Ioundalion for Research in Aslronony.
erlschnann, R. K., N. R. arlIey and L. T. rulon A 3-D inlegralor-differenlialor doulIe-
Ioop (IDD) fiIler for rasler-scan video processing. ILLL InlI. Synp. on Circuils and
Syslens, ISCAS'95.
oIIe, M. (1994). A CIosed-forn Design Melhod for 3-D Recursive Cone IiIlers ILLL
InlernalionaI Conference on Acouslics, Speech, and SignaI Irocessing, ICASSI.
rulon, L. T. (2OO3). "Three-dinensionaI cone fiIler lanks." ILLL Trans. on Circuils and
Syslens I: IundanenlaI Theory and AppIicalions 50(2): 2O8-216.
rulon, L. T. and N. R. arlIey (1985). "Three-dinensionaI inage processing using lhe
concepl of nelvork resonance." ILLL Trans. on Circuils and Syslens 32(7): 664-672.
Dansereau, D. (2OO3). 4D Lighl IieId Irocessing and ils AppIicalion lo Conpuler Vision.
LIeclricaI and Conpuler Lngineering. CaIgary, Universily of CaIgary. M5c.
Dansereau, D. and L. T. rulon (2OO7). "A 4-D DuaI-Ian IiIler ank for Deplh IiIlering in
Lighl IieIds." SignaI Irocessing, ILLL Transaclions on 55(2): 542-549.
DevIin, M. (Spring 2OO3) "Hov lo Make Snarl Anlenna Arrays." XceII }ournaI OnIine
Do-Hong, T. and I. Russer (2OO4). SignaI Irocessing for Wideland Snarl Anlenna Array
AppIicalions. ILLL Microvave Magazine. 5.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 269

Dudgeon, D. L. and R. M. Mersereau (199O). MuIlidinensionaI DigilaI SignaI Irocessing.
LngIevood CIiffs, N.}. O7632, Irenlice-HaII.
LIIingson, S. W. (1999). A DSI Lngine for a 64-LIenenl Array. Iroceedings on Ierspeclives
for Radio Aslronony-- TechnoIogies for Large Anlenna Arrays, NelherIands.
Irederick, }. D., Y. Wang and T. Iloh (2OO2). "A snarl anlenna receiver array using a aingIe
RI channeI and digilaI leanforning." ILLL Trans. on Microvave Theory and
Techniques 50(12): 3O52-3O58.
Chavani, M., L. . MichaeI and R. Kohno UIlra videland signaIs and syslens in
connunicalion engineering, }ohn WiIey and Sons., Inc.
Cunaralne, T. K. and L. T. rulon "eanforning of roadland-landpass IIane Waves using
IoIyphase 2D IIR TrapezoidaI IiIlers." ILLL Trans. on. Circuils and Syslens:
ReguIar Iapers 55(3): 838-85O.
Hun, S. V., H. L. I. A. Madanayake and L. T. rulon "UW eanforning using 2D ean
DigilaI IiIlers." ILLL Trans. on Anlennas and Iropagalion 57(3): 8O4-8O7.
Hun, S. V., M. Okonievski and R. }. Davies "ModeIing and Design of LIeclronicaIIy TunalIe
RefIeclarrays." ILLL Trans. on Anlennas and Iropagalion 55(8): 22OO-221O.
Huseyin ArsIan, Zhi-Ning Chen and Maria-CalrieIIa Di enedello (2OO6). UIlra-videland
WireIess Connunicalion, WiIey Inlerscience.
}. Roderick, H. Krishnasvany, K.Nevlon, el aI. (2OO6). "SiIicon-lased uIlra-videland lean-
forning." ILLL }ournaI of SoIid-Slale Circuils 41(8): 1726-1739.
}ohnson, D. H. and D. L. Dudgeon (1993). Array SignaI Irocessing-Concepls and
Techniques. LngIevood CIiffs, N.}. O7632, Irenlice-HaII.
Khadeni, L. Reducing lhe conpulalionaI conpIexily of IIR 2D fan and 3D cone fiIlers |MSc
Thesisj. LIeclricaI and Conpuler Lngineering, Universily of CaIgary.
Khadeni, L. and L. T. rulon On lhe Iinilalions of narrov 2D fan fiIlers speech processing.
ILLL 2OO3 Iacific Rin Conference on Connunicalions, Conpulers, and SignaI
Irocessing (IACRIM'O3).
KuenzIe, . and L. T. rulon "3-D IIR fiIlering using decinaled DIT-poIyphase fiIler lank
slruclures." ILLL Trans. on Circuils and Syslens I: ReguIar Iapers 53(2): 394-4O8.
KuenzIe, . and L. T. rulon (2OO5). A noveI Iov-conpIexily spalio-lenporaI uIlra vide-
angIe poIyphase cone fiIler lank appIied lo sul-pixeI nolion discrininalion. ILLL
InlI. Synp. on Circuils and Syslens, ISCAS'O5, Kole, }apan.
Kung, S. Y. (1988a). VLSI Array Irocessors, Irenlice-HaII, LngIevood CIiffs, N.}.
Kung, S. Y. (1988l). VLSI Array Irocessors: Designs and AppIicalions. 1988 ILLL
InlernalionaI Synp. on Circuils and Syslens, ISCAS'88.
Lilerli }r., }. C. and T. S. Rappaporl (1999). Snarl Anlennas for WireIess Connunicalions-
IS-95 and Third Ceneralion CDMA AppIicalions. Upper SaddIe River, N.}. O7632,
Irenlice-HaII.
Lilva, }. and T. K.-Y. Lo (1996). DigilaI eanforning in WireIess Connunicalions, Arlech
House.
Liu, Q. and L. T. rulon (1989). "Design of 3-D pIanar and lean recursive digilaI fiIlers
using speclraI lransfornalion." ILLL Trans. Circuils and Syslens 36(3): 365-374.
Madanayake, A. (2OO4). IICA Archileclures for 2D/3D DigilaI IiIlers. LIeclricaI and
Conpuler Lngineering. CaIgary, Universily of CaIgary. M5c: 2O5.
VLS 270


Madanayake, A. (2OO8). ReaI-line IICA Archileclures for Space-line Irequency-pIanar
MDSI. LIeclricaI and Conpuler Lngineering. CaIgary, Universily of CaIgary. PhD:
371.
Madanayake, A. and L. rulon A Reviev of 2D/3D IIR IIane-vave ReaI-line DigilaI IiIler
Circuils. ILLL Canadian Conference on LIeclricaI and Conpuler Lngineering,
CCLCL'O5, Saskaloon, Saskelchavan, Canada.
Madanayake, A., L. rulon and C. Conis IICA archileclures for reaI-line 2D/3D IIR/IIR
pIane vave fiIlers. ILLL InlI. Synp. on Circuils and Syslens, ISCAS'O4.
Madanayake, A. and L. T. rulon "A Speed-oplinized sysloIic-array processor archileclure
for spalio-lenporaI 2D IIR lroadland lean fiIlers." ILLL Trans. on. Circuils and
Syslens: ReguIar Iapers 55(7): 1953 - 1966.
Madanayake, H. L. I. A. and L. T. rulon "A SysloIic-array Archileclure for Iirsl-Order 3D
IIR Irequency-pIanar IiIlers." ILLL Trans. Circuils and Syslens: ReguIar Iapers
55(6): 1546-1559.
Madanayake, H. L. I. A. and L. T. rulon (2OO7). "Lov-conpIexily dislriluled-paraIIeI-
processor for 2D IIR lroadland lean pIane-vave fiIlers." Canadian }ournaI of
LIeclricaI and Conpuler Lngineering (C}LCL) 32(3): 123-131.
Madanayake, H. L. I. A. and L. T. rulon (2OO8). A ReaI-line SysloIic Array Irocessor
InpIenenlalion of Tvo-dinensionaI IIR IiIlers for Radio-frequency Snarl Anlenna
AppIicalions. ILLL InlI. Synp. on Circuils and Syslens (ISCAS'O8), SeallIe.
Madanayake, H. L. I. A., S. V. Hun and L. T. rulon "A SysloIic Array 2D IIR roadland
RI eanforner." ILLL Trans. on Circuils and Syslens-II: Lxpress riefs 55(12):
1244-1248.
Madanayake, H. L. I. A., S. V. Hun and L. T. rulon UW eanforning Using DigilaI 2D
Irequency-pIanar IiIlers. ILLL 2OO8 Anlenna and Iropagalion Sociely
Synposiun/URSI Synposiun, San Diego.
Iarhi, K. K. "Iinile vord effecls in pipeIined recursive fiIlers." ILLL Trans. on SignaI
Irocessing, ILLL Trans. on Acouslics, Speech, and SignaI Irocessing. 39(6): 145O-
1454.
Iarhi, K. K. "IipeIining in aIgorilhns vilh quanlizer Ioops." ILLL Trans. on Circuils and
Syslens 38(7): 745-754.
Iarhi, K. K. (1991). "Iinile vord effecls in pipeIined recursive fiIlers." ILLL Trans. on SignaI
Irocessing |see aIso ILLL Trans. on Acouslics, Speech, and SignaI Irocessingj 39(6):
145O-1454.
Iarhi, K. K. (1999). VLSI DigilaI SignaI Irocessing Syslens: Design and InpIenenlalion,
}ohn WiIey and Sons.
Iarhi, K. K. and D. Messerschnill Look-ahead conpulalion: Inproving ileralion lound in
Iinear recursions. ILLL InlI. Conf. on Acouslics, Speech and SignaI Irocessing,
ICASSI'87.
Iarhi, K. K. and D. C. Messerschnill IipeIined VLSI Recursive IiIler Archileclures using
Scallered Look-Ahead and Deconposilion. 1988 ILLL InlI. Conf. on Acouslics,
Speech and SignaI Irocessing, ICASSI88, Nev York, N.Y., USA.
Iarhi, K. K. and D. C. Messerschnill (1989). "Concurrenl archileclures for lvo-dinensionaI
recursive digilaI fiIlering." ILLL Trans. on Circuils and Syslens 36(6): 813-829.
Rader, C. M. (1996). VLSI SysloIic Arrays for Adaplive NuIIing. ILLL SignaI Irocessing
Magazine. 13: 29-49.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) R Space-time Filters 271

Rananoorlhy, I. A. and L. T. rulon "Design of slalIe lvo-dinensionaI anaIog and digilaI
fiIlers vilh appIicalions in inage processing." Inl. }. Circuil Theory AppI. 7: 229-
245.
Rodenleck, C. T., K. Sang-Cyu, T. Wen-Hua, el aI. (2OO5). "UIlra-videland Iov-cosl phased-
array radars." Microvave Theory and Techniques, ILLL Transaclions on 53(12):
3697-37O3.
Schroeder, H. and H. Iune (2OOO). One- and MuIlidinensionaI SignaI Irocessing-
AIgorilhns and AppIicalions in Inage Irocessing, }ohn WiIey and Sons, Lld.
Shanlhag, N. R. (1991). "An inproved sysloIic archileclure for 2-D digilaI fiIlers." SignaI
Irocessing, ILLL Transaclions on |see aIso Acouslics, Speech, and SignaI
Irocessing, ILLL Transaclions onj 39(5): 1195-12O2.
Sid-Ahned, M. A. "A sysloIic reaIizalion for 2-D digilaI fiIlers." ILLL Trans. on Acouslics,
Speech, and SignaI Irocessing 37(4): 56O-565.
SiIva, R., R. WorreI and A. rovn ReprogrannalIe, DigilaI ean Sleering CIS Receiver
TechnoIogy for Lnhanced Space VehicIe Operalions. Core TechnoIogies for Space
Syslens Conference, CoIorado Springs, CO.
Sladerini, L. M. (2OO2). "UW radars in nedicine." Aerospace and LIeclronic Syslens
Magazine, ILLL 17(1): 13-18.
Sundarajan, V. and K. K. Iarhi Synlhesis of foIded nuIlidinensionaI DSI syslens. ILLL
InlI. Synp. on Circuils and Syslens (ISCAS'98).
Van Ardenne, A. (2OOO). Concepls of lhe Square KiIonelre Array, lovard lhe nev
generalion radio leIescopes. ILLL 2OOO InlI. Synp. on Anlennas and Iropagalion.
Ween, }. I., . M. Noralos and Z. Iopovic (1999). roadland Array Consideralions for SKA.
Iroceedings on Ierspeclives for Radio Aslronony-- TechnoIogies for Large
Anlenna Arrays.
Zajc, M., R. Sernec and }. Tasic (2OOO). Array processors for DSI: inpIenenlalion
consideralions. 1Olh Medilerranean LIeclrolechnicaI Conference, 2OOO, MLLLCON
2OOO.



VLS 272
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 273
X

A VLSI Architecture for Output ProbabiIity
Computations of HMM-based
Recognition Systems

Kazuhiro Nakanura, Masaloshi Yananolo,
Kazuyoshi Takagi and Naofuni Takagi
Nagcqa Unitcrsi|q
]apan

1. Introduction

MoliIe enledded syslens vilh naluraI hunan inlerfaces, such as speech recognilion, Iip
reading, and geslure recognilion, are required for lhe reaIizalion of fulure uliquilous
conpuling. Recognilion lasks can le inpIenenled eilher on processors (CIUs and DSIs) or
dedicaled hardvare (ASICs). AIlhough processor-lased approaches offer fIexiliIily, reaI-
line recognilion lasks using slale-of-lhe-arl recognilion aIgorilhns exceed lhe perfornance
IeveI of currenl enledded processors, and require nodern high-perfornance processors
lhal consune far nore pover lhan dedicaled hardvare. Dedicaled hardvare, vhich is
oplinized for Iov-pover, reaI-line recognilion lasks, is nore suilalIe for inpIenenling
naluraI hunan inlerfaces in Iov pover noliIe enledded syslens. VLSI archileclures
oplinized for recognilion lasks vilh Iov pover dissipalion have leen deveIoped.
Yoshizava el aI. invesligaled a lIock-vise paraIIeI processing nelhod for oulpul prolaliIily
conpulalions of conlinuous hidden Markof nodeIs (HMMs), and proposed a Iov pover,
high-speed VLSI archileclure. Oulpul prolaliIily conpulalions are lhe nosl line-
consuning parl of HMM-lased recognilion syslens. Malhev el aI. deveIoped Iov-pover
acceIeralors for lhe SIHINX 3 speech recognilion syslen, and aIso deveIoped perceplion
acceIeralors for enledded syslens. In lhis chapler, ve presenl a fasl and nenory efficienl
VLSI archileclure for oulpul prolaliIily conpulalions of conlinuous HMMs using a nev
lIock-vise paraIIeI processing nelhod. We shov lIock-vise frane paraIIeI processing
(III) for oulpul prolaliIily conpulalions and presenl an appropriale VLSI archileclure
for ils inpIenenlalion. Conpared vilh a convenlionaI lIock-vise slale paraIIeI processing
(SII) archileclure, vhen lhere are a sufficienl nunler of HMM slales for accurale
recognilion, lhe III archileclure requires fever regislers and processing eIenenls (P|s),
and Iess processing line. The P|s used in lhe III archileclure are idenlicaI lo lhose used
in lhe SII archileclure. Iron a VLSI archilecluraI vievpoinl, a conparison shovs lhe
efficiency of lhe III archileclure lhrough efficienl use of regislers for sloring inpul fealure
veclors and inlernediale resuIls during conpulalion. The renainder of lhis chapler is
organized as foIIovs: lhe slruclure of HMM lased recognilion syslens is descriled in
13
VLS 274

















Iig. 1. asic slruclure of HMM-lased recognilion hardvare

Seclion 2, III and III-lased VLSI archileclure are inlroduced in Seclion 3, lhe
evaIualion of lhe III archileclure is descriled in Seclion 4, and concIusions are presenled
in Seclion 5.

2. HMM-based Recognition Systems

2.1 HMM-based Recognition Hardware
Due lo lheir effecliveness and efficiency for user-independenl recognilion, HMMs are
videIy used in appIicalions such as speech recognilion, Iip-reading, and geslure recognilion.
Iigure 1 shovs lhe lasic slruclure of HMM-lased recognilion hardvare (Yoshizava el aI.,
2OO6, Yoshizava el aI., 2OO4, Yoshizava el aI., 2OO2, Malhev el aI., 2OO3a). The oulpul
prolaliIily conpulalion circuil and Vilerli scorer vork logelher as a recognilion engine.
The inpuls lo lhe oulpul prolaliIily conpulalion circuil are fealure veclors of severaI
dinensions and nodeI paranelers of HMMs. These vaIues are slored in RAM and ROM
respecliveIy. The RAM, ROM and oulpul prolaliIily conpulalion circuil inlerconnecl via a
singIe lus, and nenory accesses are excIusive. The oulpul prolaliIily conpulalion circuil
oulpuls lhe resuIls of lhe oulpul prolaliIily conpulalion of HMMs. The Vilerli scorer
oulpuls IikeIihood score using lhe Vilerli aIgorilhn. In HMM-lased recognilion syslens,
lhe nosl line-consuning lask is oulpul prolaliIily conpulalions, and lhe oulpul
prolaliIily conpulalion circuil acceIerales lhese conpulalions. The oulpul prolaliIily
conpulalion circuil has severaI regisler arrays and processing eIenenls (P|s) for efficienl
high-speed paraIIeI processing.

2.2 Output ProbabiIity Computation of HMMs
Rel O
1
, O
2
, ..., O
T
le a sequence of P-dinensionaI inpul fealure veclors (inpul franes) lo
HMMs, vhere O
|
= (c
|1
, c
|2
, ..., c
|P
), 1 s | s T. T is lhe nunler of inpul fealure veclors, and P is
lhe dinension of lhe inpul fealure veclor. Ior an inpul frane O
|
, lhe oulpul prolaliIily of
N-slale Iefl-lo-righl conlinuous HMM al lhe j-lh slale is given ly
...
Oulpul prolaliIily conpulalion circuil
Oulpul prolaliIilies of HMMs
Regisler arrays
HMM paranelers
(slored in ROM)
Iealure veclors
(slored in RAM)
Vilerli scorer /search aIgorilhn
LikeIihood scores
us
vords
elc.
voice
elc.
excIusive access
P|
P|s
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 275





















Iig. 2. IIovcharl of oulpul prolaliIily conpulalion

( ) ( ) , 1 , 1 , log
1
2
T t N f o b
P
p
fp tp fp f t f
s s s s + =
_
=
o e O

(1)
vhere e
j
, o
jp
and
jp
are lhe faclors of lhe Caussian prolaliIily densily funclion (Yoshizava
el aI., 2OO6).
The oulpul prolaliIily conpulalion circuil conpules Iog|
j
(O
|
) lased on Lq. (1), vhere aII
HMM paranelers e
j
, o
jp
and
jp
are slored in ROM, and lhe inpul franes are slored in RAM.
The vaIues of T, N, P and lhe nunler of HMMs V differ for each recognilion syslen. Ior a
recenl isoIaled vord recognilion syslen (Yoshizava el aI., 2OO6, Yoshizava el aI., 2OO4), T, N,
P and V are 86, 32, 38, and 8OO, respecliveIy, and for anolher vord recognilion syslen
(Yoshizava el aI., 2OO2), T, N, P and V are 89, 12, 16 and 1OO respecliveIy. Ior a conlinuous
speech recognilion syslen (Malhev el aI., 2OO3a), T, N, P and V are approxinaleIy 2O, 1O, 4O,
and 5O, respecliveIy. Differenl appIicalions require differenl oulpul prolaliIily conpulalion
circuil archileclures. A fIovcharl of oulpul prolaliIily conpulalions for V HMMs is shovn
in Iig. 2. Oulpul prolaliIilies are ollained ly T N P V lines lhe parliaI conpulalion of
Iog|
j
(O
|
). IarliaI conpulalion of Iog|
j
(O
|
) perforns four arilhnelic operalions, a
sullraclion (a = c
|p

jp
), an addilion (acc = acc + |, vhere lhe iniliaI vaIue of acc is e
j
), and
lvo nuIlipIicalions (| = a a o
jp
) for Lq. (1), and conpules Iog|
j
(O
|
).

3. Fast and memory efficient VLSI architecture

3.1 BIock-wise frame paraIIeI processing
Iock paraIIeI processing (II) for oulpul prolaliIily conpulalions vas proposed as an
efficienl paraIIeI processing nelhod for vord HMM-lased speech recognilion ly Yoshizava

t = O
yes
t = t + 1
| = O
| = | + 1
j = O
j = j + 1
p = O
p = p + 1
IarliaI conpulalion of Iog|
j
(O
|
)
P s p
N s j
T s |
V s t
yes
yes
yes
no
no
no
no
|ccp A
|ccp 8
|ccp C
|ccp D
VLS 276



















Iig. 3. IIovcharl of oulpul prolaliIily conpulalion using SII

el aI. (Yoshizava el aI., 2OO6, Yoshizava el aI., 2OO4, Yoshizava el aI., 2OO2). In lhis nelhod,
lhe sel of inpul franes is caIIed a ||cc|, and HMM paranelers are effecliveIy shared lelveen
differenl inpul franes in lhe conpulalion. N-paraIIeI conpulalion is perforned ly lheir II.
In lhis chapler, ve cIassify lvo lypes of II according lo dala fIov of oulpul prolaliIily
conpulalions: lIock-vise frane paraIIeI processing (III) and lIock-vise slale paraIIeI
processing (SII). A lIock can le seen as a sel of M ( T) inpul franes, vhose eIenenls are
O
|
s, 1 | M. M franes in T inpul franes are processed in lIock. 8|PP perforns
arilhnelic operalions lo IocaIIy slored inpul franes, vhich are O
1
, O
2
, ..., O
M
, and oulpul
prolaliIily conpulalions for nuIlipIe franes are carried oul sinuIlaneousIy. On lhe olher
hand, a lIock can aIso le seen as a M P nalrix vhose eIenenls are c
|p
, 1 | M, 1 p P.
8SPP perforns arilhnelic operalions lo an inpul sequence, vhich is c
11
, ..., c
1P
, c
21
, ..., c
2P
, ...,
c
M1
, ..., c
MP
, and oulpul prolaliIily conpulalions for nuIlipIe slales are carried oul
sinuIlaneousIy.
The II proposed ly Yoshizava el aI. (Yoshizava el aI., 2OO6, Yoshizava el aI., 2OO4,
Yoshizava el aI., 2OO2) is cIassified as a SII. In lhis chapler, ve presenl III for oulpul
prolaliIily conpulalions. M/2-paraIIeI conpulalions are perforned ly our III.
A fIovcharl of lhe oulpul prolaliIily conpulalions vilh lhe convenlionaI SII (Yoshizava
el aI., 2OO6, Yoshizava el aI., 2OO4, Yoshizava el aI., 2OO2) is shovn in Iig. 3.
P|
i
represenls lhe i-lh processing eIenenl, vhich conpules Iog|
i
(O
|
) ly a sullraclion, an
addilion, and lvo nuIlipIicalions for Lq. (1). |ccp 8 (Iig. 2) is expanded as shovn in Iig. 3,
and Iog|
1
(O
|
), Iog|
2
(O
|
), ..., and Iog|
N
(O
|
) are conpuled sinuIlaneousIy vilh N P|s, vhere
c
|p
is fed lo lhe N P|s in |ccp A. In addilion lo lhe N-slale paraIIeI conpulalion, lhe sane
HMM paranelers
jp
s and o
jp
s, and e
j
s, 1 j N, 1 p P, are used repealedIy during
|ccp C in Iig. 3.
A fIovcharl of lhe oulpul prolaliIily conpulalion vilh III is shovn in Iig. 4. The P|s in

t = O
t = t + 1
| = O
| = | + 1
p = O
p = p + 1
c
|p
P s p
T s |
V s t
yes
yes
yes
no
no
no
|ccp A
|ccp C
|ccp D
Iog|
1
(O
|
) Iog|
2
(O
|
)
N-paraIIeI conpulalion vilh N P|s
Iog|
N
(O
|
)
. . . . . . . . . . . . . . . . . . . . .
P|
2
P|
N
P|
1
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 277


































Iig. 4. IIovcharl of oulpul prolaliIily conpulalion using III

Iigs. 4 and 3 are idenlicaI, lul in a differenl nunler. |ccp C in Iig. 2 is parliaIIy expanded in
Iig. 4, and Iog|
j
(O
|+1
), Iog|
j
(O
|+2
), ..., and Iog|
j
(O
|+M/2
) are conpuled sinuIlaneousIy vilh
M/2 P|s in |ccp C1, vhere
jp
and o
jp
are fed lo lhe M/2 P|s in |ccp A. In addilion lo lhe
M/2-frane paraIIeI conpulalions, Iog|
j
(O
|+M/2+1
), Iog|
j
(O
|+M/2+2
), ..., and Iog|
j
(O
|+M
) are
aIso conpuled vilh lhe sane M/2 P|s. In lhis doulIe M/2-paraIIeI conpulalion, lhe sane
HMM paranelers
jp
and o
jp
are used lvice, lecause lhe paranelers are independenl of |. In
addilion lo lhe M/2-paraIIeI conpulalions, |ccp D (Iig. 2) is divided inlo |ccps D1 and D2
(Iig. 4). The sane inpul franes O
|+1
, O
|+2
, ..., and O
|+M
are used repealedIy during |ccp D1,
lecause lhe inpul franes are independenl of t.
t = t + 1
j = O, t = t
j = j + 1
p = O
p = p + 1

jp
, o
jp
P s p
T s |
max
V s t
max
yes
yes
yes
no
no
no
|ccp A
|ccp 8
|ccp D1
t
max
s t
yes
no
N s j
yes
no
t
max
= O
|ccp D2
|
max
= O
| = |
max
, |
max
= |
max
+ M, t = t
max
|
|ccp C1
t
max
= t
max
+ |
Iog|
j
(O
|+1
) Iog|
j
(O
|+2
)
doulIe M/2-paraIIeI conpulalion
vilh M/2 P|s
. . . . . . . . . . . . . . . . . .
P|
2
P|
M/2
P|
1
Iog|
j
(O
|+M/2+1
)
. . . . . . . . . . . . . . . .
Iog|
j
(O
|+M
) Iog|
j
(O
|+M/2+2
)
Iog|
j
(O
|+M/2
)
VLS 278

3.2 A VLSI architecture for output probabiIity computation
Our III VLSI archileclure for oulpul prolaliIily conpulalions is shovn in Iig. 5. The
archileclure consisls of five regisler arrays and M/2 P|s. RcgO slores M inpul franes O
|+1
,
O
|+2
, ..., O
|+M
. Rcg and Rcgo slore HMM paranelers
jp
, and o
jp
, respecliveIy. Rcge slores




























Iig. 5. III VLSI archileclure

HMM paraneler e
j
and inlernediale resuIls. Rcgo slores conpuled oulpul prolaliIilies for
a Vilerli scorer. Lach P|
i
consisls of lvo adders and lvo nuIlipIiers, vhich are used for
conpuling
( )
_
=
+
P
p
fp p t fp f
o
1
2
'
o e
.
Iigure 6 shovs lhe fIovcharl of oulpul prolaliIily conpulalions using lhe III
archileclure. The conpulalion slarls ly reading M inpul franes fron RAM and sloring
lhen lo RcgO in |ccp C1, vhich are O
|+1
, O
|+2
, ..., O
|+M/2
, O
|+M/2+1
, O
| +M/2+2
, ..., O
|+M
. The
HMM paranelers of t-lh HMM are read fron ROM and slored in Rcg, Rcgo and Rcge,
vhich are
11
, o
11
,

and e
1
. The vaIue of aII regislers in Rcge is sel lo e
1
. Ior lhe firsl haIf of
lhe slored inpul franes O
|+1
, O
|+2
, ..., and O
|+M/2
, M/2 inlernediale resuIls are
sinuIlaneousIy conpuled vilh lhe slored
11
, o
11
, and e
1
ly M/2 P|s, vhere lhe HMM
paranelers are shared ly aII P|s. Al lhe sane line, an HMM paraneler
j p+1
of t-lh HMM
c
|p
Ou|pu| prc|a|i|i|q ccmpu|a|icn circui|
RcgO
Rcg Rcgo
ROM
(, o, e)
RAM
(O)
Rcgo
P
Rcge
+ +
P|
1
+ +
P|
2
+ +
P|
M/21
+ +
P|
M/2
. .
O
|+1
O
|+2
O
|+M/21
O
|+M/2
O
|+M/2+1
O
|+M/2+2
O
|+M1
O
|+M
.
.
.
.
.
.
.
.
.
.
.
.
.
.
M/2
M/2
M/2-paraIIeI conpulalion
M/2 M/2
M/2 M/2
M
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 279

is read fron ROM and slored in Rcg. Then, for lhe olher haIf of lhe slored inpul franes
O
|+M/2+1
, O
|+M/2+2
, ..., and O
|+M
, M/2 inlernediale resuIls are sinuIlaneousIy conpuled vilh
lhe sane
11
, o
11
, and e
1
ly M/2 P|s. Al lhe sane line, an HMM paraneler o
j p+1
of t-lh
HMM is read fron ROM and slored in Rcgo. In lhis doulIe M/2-paraIIeI conpulalion, lhe





































Iig. 6. IIovcharl of conpulalions using lhe III archileclure

sane HMM paranelers
11
, o
11
, and e
1
are used lvice. In lhe nexl doulIe M/2-paraIIeI
conpulalion, lhe slored HMM paranelers
j p+1
and o
j p+1
are used lvice. M oulpul
prolaliIilies Iog|
j
(O
|+1
), Iog|
j
(O
|+2
), ..., and Iog|
j
(O
|+M
) of t-lh HMM are ollained ly |ccp
t = t + 1
j = O, t = t
j = j + 1
p = O
p = p + 1

jp
, o
jp
P s p
T s |
max
V s t
max
yes
yes
yes
no
no
no
|ccp A
|ccp 8
|ccp D1
t
max
s t
yes
no
N s j
yes
no
t
max
= O
|ccp D2
|
max
= O
| = |
max
, |
max
= |
max
+ M, t = t
max
|
|ccp C1
t
max
= t
max
+ |
Iog|
j
(O
|+1
)
doulIe M/2-paraIIeI conpulalion
vilh M/2 P|s
. . . . . . . . . . . . . . . . . .
P|
2
P|
M/2
P|
1
Iog|
j
(O
|+M/2
)
Iog|
j
(O
|+M/2+1
)
. . . . . . . . . . . . . . . . Iog|
j
(O
|+M
)
Load O
|
lo RcgO (|=|+1, |=|+2, ., |=|+M, MP cycIes)
Load
11
, o
11
lo Rcg, Rego, respecliveIy (2 cycIes)
t = t+1, j = 1, p = 1
Load e
j
lo Rcge (1 cycIe)
Copy Rcge lo Rcgo
Load
j p+1
lo Rcg
Load o
j p+1
lo Rcgo
Iog|
j
(O
|+2
)
Iog|
j
(O
|+M/2+2
)
VLS 280

A. The ollained resuIls are lransfered fron Rcge lo Rcgo for slarling lhe nexl oulpul
prolaliIily conpulalion, Iog|
j+1
(O
|+1
), Iog|
j+1
(O
|+2
), ..., Iog|
j+1
(O
|+M
) of t-lh HMM. The
slored resuIls are fed lo lhe Vilerli scorer. The MN oulpul prolaliIilies of t-lh HMM are
ollained ly |ccp 8. MN| oulpul prolaliIilies of HMM t + 1, t + 2, ..., t + L are ollained
ly |ccp D1 vilh lhe sane M inpul franes O
|+1
, O
|+2
, ..., and franes O
|+1
, O
|+2
, ..., and O
|+M.





























Iig. 7. SII VLSI archileclure

franes O
|+1
, O
|+2
, ..., and O
|+M.
The MN|(T/M) oulpul prolaliIilies of HMM t + 1, t +
2, ..., t + | are ollained ly |ccp C1, and finaIIy lhe MN|(T/M)(V/|) oulpul prolaliIilies
of aII HMMs are ollained ly |ccp D2.

4. EvaIuation

We conpared lhe proposed III vilh SII (Iig. 7) VLSI archileclure (Yoshizava el aI.,
2OO6, Yoshizava el aI., 2OO4, Yoshizava el aI., 2OO2). The SII archileclure consisls of lhree
regisler arrays and N P|s. Rcg and Rcgo slore HMM paranelers
jp
and o
jp
, respecliveIy,
and Rcge slores HMM paraneler e
j
and inlernediale resuIls. The P|s in Iigs. 7 and 5 are
idenlicaI.
c
|p
Ou|pu| prc|a|i|i|q ccmpu|a|icn circui|
Rcg Rcgo
ROM
(, o, e)
RAM
(O)
Rcge
+ +
P|
1
+ +
P|
2
+ +
P|
N1
+ +
P|
N
. .
.
N-paraIIeI conpulalion
N
N
.
.
.
.
.
P
N
.
.
.
.
.
P
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 281

Iigure 8 shovs lhe fIovcharl of lhe conpulalions of SII archileclure. The conpulalion
slarls ly reading aII 2NP + N HMM paranelers of t-lh HMM fron ROM and sloring lhen
lo Rcg, Rcgo, and Rcge in |ccp D. Ior inpul c
|p
, lhe inlernediale resuIls are conpuled vilh
slored HMM paranelers ly N P|s. N oulpul prolaliIilies Iog|
1
(O
|
), Iog|
2
(O
|
), ..., Iog|
N
(O
|
)
of lhe HMM are ollained ly |ccp A. The ollained resuIls are fed lo a Vilerli scorer. NT

























Iig. 8. IIovcharl of conpulalions using lhe SII archileclure

Regisler size (lil)
III (ours) PMxo + 2x

+ x
o
+ 2Mx
f

SII NPx

+ NPx
o
+ Nx
f

TalIe 1. Regisler size

Irocessing line (cycIes)
III (ours) V/|(|PM + (1 + 2P)|NT/M(
SII V(2NP + N + PT)
TalIe 2. Irocessing line

oulpul prolaliIilies of t-lh HMM are ollained ly |ccp C vilh lhe sane HMM paranelers.
The NTV oulpul prolaliIilies of aII HMMs are ollained ly |ccp D.
TalIe 1 shovs lhe regisler size of lhe SII and III archileclures, vhere x

, x
o
, x
o
, and x
f

represenl lhe lil Ienglh of
jp
, o
jp
, o
|p
, and lhe oulpul of P|, respecliveIy. N, P, and M are lhe
t = O
t = t + 1
| = O
| = | + 1
p = O
p = p + 1
P s p
T s |
V s t
yes
yes
yes
no
no
no
|ccp A
|ccp C
|ccp D
Iog|
1
(O
|
) Iog|
2
(O
|
)
N-paraIIeI conpulalion vilh N P|s
Iog|
N
(O
|
)
. . . . . . . . . . . . . . . . . . . . .
P|
2
P|
N
P|
1
Load c
|p
Load
jp
and o
jp
of HMM t lo Rcg and Rcgo
(j = 1, 2, ., N, p = 1, 2, ., P, 2NP cycIes)
Load e
j
lo Rcge (j = 1, 2, ., N, N cycIes)
VLS 282

nunler of HMM slales, lhe dinension of inpul fealure veclor (frane), and lhe nunler of
inpul franes in a lIock, respecliveIy.
TalIe 2 shovs lhe processing line for conpuling oulpul prolaliIilies of V HMMs vilh lhe
III and SII archileclures, vhere T and | are lhe nunler of inpul franes and lhe
nunler of HMMs vhose oulpul prolaliIilies are conpuled vilh lhe sane inpul franes
during |ccp D1 of Iig. 6, respecliveIy.

Regisler size (lil) Irocessing line (cycIes) #P|s
III (ours) 15,512 4,477,44O 22
SII 2O,224 4,585,6OO 32
TalIe 3. LvaIualion of lhe SII and III perfornance






















Iig. 9. LvaIualion of lhe SII and III perfornance, and lhe vaIue of M of lhe III (N =
32, P = 38, T = 86, V = 8OO)

TalIe 3 shovs lhe regisler size, lhe processing line, and lhe nunler of P|s for conpuling
oulpul prolaliIilies of 8OO HMMs, vhere ve assune lhal N = 32, P = 38, T = 86, x

= 8, x
o
= 8,
x
o
= 8, x
f
= 24, and V = 8OO, lhe sane vaIues used in a recenl circuil design for isoIaled vord
recognilion (Yoshizava el aI., 2OO6, Yoshizava el aI., 2OO4). We aIso assune lhal M = 44 and
| = 5 for lhe III archileclure. The P|s used in lhe SII and III archileclures are
idenlicaI. Conpared vilh lhe SII archileclure, lhe III archileclure has fever regislers,
requires Iess processing line, and has fever P|s. Iron lhe VLSI archileclure vievpoinl, lhis
is lecause lhe regisler size of lhe III archileclure is independenl of N, and ils P|s can
repealedIy use lhe sane inpul franes. The III archileclure has fever vail cycIes for
#P|s=32 (SII)
#P|s (III)
6
7
8
9
15
22
#P|s=43
#P|s=32
44
A VLS Architecture for Output Probability Computations of HMM-based Recognition Systems 283

reading dala fron ROM lefore paraIIeI conpulalions, 586,24O (V/|((PM + |N)T/M(),
lhan lhe SII archileclure, vhich has 1,971,2OO (V(2NP + N)).
Iig. 9 shovs lhe processing line and lhe nunler of P|s of lhe III and SII archileclures,
and lhe vaIue of M of lhe III archileclure. The processing line and lhe nunler of P|s of
lhe III archileclure are Iess lhan lhose of lhe SII archileclure vhen M = 44 (Iig. 9).
Iron a Iogic design vievpoinl, lhe regisler arrays of lhe SII and III archileclures are
designed vilh IIip-IIops or on-chip nuIli-porl nenories of differenl sizes. Dala palhs are
designed vilh idenlicaI P|s, lul in a differenl nunler. The conlroI palhs of lhese
archileclures are designed, as shovn in lhe fIovcharls Iigs. 8 and 6. The dala palh deIay is
lhe sane for lolh lhe SII and III designs, equaI lo lhe deIay line of one P|. The deIay
lines of conlroI palhs differ lelveen lhe lvo, lul lhe conlroI palh deIay is snaII conpared
vilh lhe dala palh deIay.

5. ConcIusions

We presenled III for oulpul prolaliIily conpulalions and presenled an appropriale VLSI
archileclure for ils inpIenenlalion. III perforns arilhnelic operalions lo IocaIIy slored
inpul franes, and oulpul prolaliIily conpulalions for nuIlipIe franes are carried oul
sinuIlaneousIy. Conpared vilh lhe convenlionaI SII archileclure, vhen lhe nunler of
HMM slales is Iarge enough for accurale recognilion, lhe III archileclure requires fever
regislers and P|s, and Iess processing line. In lerns of lhe VLSI archileclure, a fasl and
nenory efficienl VLSI archileclure for oulpul prolaliIily conpulalions of HMM-lased
recognilion syslens has leen presenled. A Iogic design, a Vilerli scorer for lhe III
archileclure, and a reconfiguralIe archileclure for lolh lhe SII and III archileclures are
our fulure vorks.

6. References

. Malhev, A. Davis & Z. Iang (2OO3a). Ierceplion Coprocessors for Lnledded Syslens,
Prcc. cf Icr|sncp cn |m|cc Sqs|cms fcr Rca|-Timc Mu||imcia (|ST|Mcia), pp. 1O9-
116, 2OO3.
. Malhev, A. Davis & Z. Iang (2OO3l). A Lov-Iover AcceIeralor for lhe SIHINX 3 Speech
Recognilion Syslen, Prcc. cf |n|'| Ccnf. cn Ccmpi|crs, Arcni|cc|urc an Sqn|ncsis fcr
|m|cc Sqs|cms, pp. 21O-219, 2OO3.
S. Yoshizava, Y. Miyanaga & N. Yoshida (2OO2). On a High-Speed HMM VLSI ModuIe vilh
Iock IaraIIeI Irocessing, |||C| Trans. |unamcn|a|s (]apancsc |i|icn), VoI. }85-A,
No. 12, pp. 144O-145O, 2OO2.
S. Yoshizava, N. Wada, N. Hayasaka & Y. Miyanaga (2OO4). ScaIalIe Archileclure for Word
HMM-ased Speech Recognilion, Prcc. cf 2004 |||| |n|'| Sqmpcsium cn Circui|s an
Sqs|cms (|SCAS'04), pp. 417-42O, 2OO4.
S. Yoshizava, N. Wada, N. Hayasaka & Y. Miyanaga (2OO6). ScaIalIe Archileclure for Word
HMM-ased Speech Recognilion and VLSI InpIenenlalion in ConpIele Syslen,
|||| TRANSACT|ONS ON C|RCU|TS AND SYST|MS, VoI. 53, No. 1, pp. 7O-77,
2OO6.
VLS 284

X. Huang, I. AIIeva, H. W. Hon, M. Y. Hvang, K. f. Lee & R. RosenfeId (1992). The SIHINX-
II speech recognilion syslen: an overviev, Ccmpu|cr Spcccn an |anguagc, VoI. 7(2),
pp. 137-148, 1992.

Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 285
X

Efficient BuiIt-in SeIf-Test for Video
Coding Cores: A Case Study on
Motion Estimation Computing Array

Chun-Lung Hsu, Yu-Sheng Huang and Chen-Kai Chen
Dcpar|mcn| cf ||cc|rica| |nginccring, Na|icna| Dcng Hua Unitcrsi|q
Taiuan, R.O.C

1. Introduction

In nore recenl years, nuIlinedia lechnoIogy appIicalions have leen leconing nore fIexilIe
and poverfuI vilh lhe deveIopnenl of seniconduclors, digilaI signaI processing (DSI), and
connunicalion lechnoIogy. The Ialesl video slandard, H.264/AVC/MILC-4 Iarl 1O
(Advance Video Coding) (Wiegand, 2OO3), is regarded as lhe nexl generalion video
conpression slandard (VCS). Ior video conpression slandards, lhe nolion eslinalion
conpuling array (MLCA) is lhe nosl conpulalionaIIy denanding conponenl in a video
encoder/decoder (Kuhn, 1999, Konarek and Iirsch 1989). Il is knovn lhal aloul 6O-9O of
lhe lolaI of conpulalion line is consuned in nolion eslinalion. AddilionaIIy, lhe nolion
eslinalion aIgorilhns used aIso profoundIy infIuences lhe visuaI quaIily of reconslrucled
inages. More accurale prediclions increase lhe conpression ralio and inprove peak signaI-
lo-noise ralio (ISNR) al a given lil-rale. Since lhe nolion eslinalion aIgorilhn is nol
specified in lhe video coding slandards lhal nany aIgorilhns are appIied lo vilh differenl
hardvare pIalforn, syslen frequencies, operaling voIlage and pover dissipalion.
On lhe olher hand, due lo lhe rapid advance in seniconduclor falricalion lechnoIogy, a
Iarge nunler of lransislors can le inlegraled on a singIe chip. Hovever, inlegraling Iarge
nunler of processor on a singIe chip resuIls in lhe increase in lhe Iogic-per-pin ralio, vhich
draslicaIIy reduces lhe conlroIIaliIily and olservaliIily of lhe Iogic on lhe chip.
ConsequenlIy, lesling such highIy conpIex and dense circuils lecone very difficuIl and
expensive (Lu el aI., 2OO5).
Ior a connerciaI chip, lhe VCS nusl inlroduce design-for-leslaliIily (DIT), especiaIIy in an
MLCA. The oljeclive of DIT is lo increase lhe ease vilh vhich a device can le lesled lo
guaranlee high syslen reIialiIily. Many DIT approaches have leen deveIoped such as a
ncc, s|ruc|urc, and |ui||-in sc|f-|cs| (IST) (Wu el aI., 2OO7, NagIe el aI., 1989, McCIuskey,
1985, Kung el aI., 1995, Toula and McCIuskey, 1997). Anong lhese DIT approaches, IST
has an olvious advanlage in lhal reduces lhe need for lesling of expensive lesl equipnenl,
since lhe circuil/chip and ils lesler are inpIenenled in lhe sane circuil/chip. In a vord, lhe
IST can generale lesl sinuIalions and anaIyze lesl responses vilhoul oulside supporl,
naking lesls and diagnoses of digilaI syslens quick and effeclive.

14
VLS 286

Thus, lhis chapler proposed a nininaI perfornance penaIly IST design vilh significanlIy
snaIIer area overhead. In nornaI node, lhe IST circuil does nol le perforned and do nol
deIiver lhe lesl pallern lo AD-ILs for lesling. Thus, each AD-IL in MLCA perforns ils
nornaI operalion lo delernine lhe SAD vaIues. In lesling node, lhe IST circuils, conprise
lesl pallern generalor (TIC) and oulpul response anaIyzer (ORA), are perforned lo lesling
ilseIf. In lerns of resuIls, afler enpIoying lhe presenls IST design, lhe circuil guaranlee
1OO fauIl coverage vilh Iov lesl appIicalion line al Iov area overhead. Moreover, lhe
experinenlaI resuIls prove lhe effecliveness and vaIue of lhis vork.

2. Motion estimation computing array

The nolion eslinalion process in any VCS conprises a search slralegy lo delernine lhe
nolion veclors (MVs). The nolion veclor is lo descrile lhe lransfornalion fron adjacenl
franes in lhe video sequences. On lhe olher hand, lhe nolion eslinalion aIso conprises a
process lo delernine a nalching cosl nelric conpulalion such as lhe sun of alsoIule
differences (SAD). In a vord, lhe search slralegy in nolion eslinalion ains al seIecling a sel
of candidales and seIecls lhe one vilh lhe nininun cosl nelric. Afler seIecling nininizes
lhe cosl nelric, il encodes lhe prediclion residuaI lelveen lhe originaI and nolion
conpensaled lIocks. Lach residuaI lIock is lransforned, quanlized, and enlropy coded.
AddilionaIIy, lhe nain oljeclive of nolion eslinalion aIgorilhns is lo expIoil lhe lenporaI
redundancy of a segnenled video sequences. Ior exanpIe, lhe fuII-search lIock-nalching
aIgorilhn has leen deveIoped lhal inlroduced lhe lesl resuIls in lerns of finding nolion
veclors. Such aIgorilhns are inpIenenled in lvo slages, naneIy lhe caIcuIalion of lhe sun
of alsoIule differences (SAD) for each dispIacenenl veclor, foIIoved ly nelhod for finding
lhe snaIIesl SAD vaIues. This is sunnarized ly Lq. (1) and (2).

N - 1 N - 1
0 0
SAD ( , ) ( , ) - ( , )
= =
+ +
__
k l
i f C k l R i k f l
(1)

MIN
SAD min(SAD( , )) = i f
(2)

vhere ( , ) C k l and ( , ) R i k f l + + represenl lhe currenl frane and search region's nacrolIock,
respecliveIy (Iirsch el aI., 1995). In lhe olher hand, lIock-nalching in MLCA is perforned
ly a sequenliaI expIoralion of lhe search region vhen lhe conpulalions are perforned in
paraIIeI (see Iig. 1). In Iig.1 , lhe MLCA is lhe paraIIeI archileclure for conpuling lhe SAD
vaIue and ils corresponding AD-IL slruclure are shovn in Iig.1 (a) and (l), respecliveIy.
The purpose of AD-ILs is lo slore lhe vaIue of ( , ) C k l and ( , ) R i k f l + + and receive lhe vaIue
of corresponding lo lhe currenl posilion of lhe reference in search region. In olher vord, lhe
AD-ILs perforn lhe processing of lhe sullraclion and lhe alsoIule vaIue conpulalion.
Afler lhal, AD-IL adds lhe resuIls vilh lhe parliaI resuIl coning fron lhe upper AD-IL (see
Iig. 1 (l)). The parliaI resuIls are added on coIunns and a Iinear array of adders perforns
lhe horizonlaI sunnalion of lhe rov suns, and conpule ( , ) SAD i f . Ior each posilion ( , ) i f of
lhe reference lIock, lhe M-IL checks if lhe nalching cosl nelric conpulalion, ( , ) SAD i f , is
snaIIer lhe previous snaIIer SAD vaIue lhal updales lhe snaIIer SAD vaIue lo lhe regisler
vhich slores lhe previous snaIIer SAD vaIue.
Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 287

AD
AD
Adder
Adder
C R
C R
( , ) SAD i j
Iron upper
AD-IL
AD
AD
Adder
Adder
C R
C R
( , ) SAD i j
Iron upper
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
M-IL
M-IL
SAD
MIN SAD(i,j)
Adder
Adder
Adder
Adder
Adder
Adder
Adder
Adder
16
.
.
.
.
.
. . .
.
16
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
M-IL
M-IL
SAD
MIN SAD(i,j)
Adder
Adder
Adder
Adder
Adder
Adder
Adder
Adder
16
.
.
.
.
.
. . .
.
16






















(a)















(l)
Iig. 1. The generic archileclure of nolion eslinalion and slruclure of AD-IL

3. The BIST design and its test strategy

Iigure 2 shovs lhe proposed IST design, vhich consisls of lesl pallern generalor and
oulpul response anaIyzer. The nain oljeclive of lhe proposed IST design is lo effecliveIy
lesl lhe MLCA ly ilseIf. Moreover, using lhe IST design can draslicaIIy reduce lhe cosl of
VLS 288

AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
+
+
+
+
+
+
+
+
C
i
r
c
u
i
l

U
n
d
e
r

T
e
s
l

C
U
T

ORA
ORA
TIC
TIC Irane dala
Syslen CIock
Resel
Tesl_node
SAD VaIue
AD-IL SeIecl
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
AD-IL
+
+
+
+
+
+
+
+
C
i
r
c
u
i
l

U
n
d
e
r

T
e
s
l

C
U
T

ORA
ORA
TIC
TIC Irane dala
Syslen CIock
Resel
Tesl_node
SAD VaIue
AD-IL SeIecl
expensive lesl equipnenl and lesling line. The largeled fauIl, lesl pallern generalor and
oulpul response anaIyzer are expIicilIy descriled as foIIovs.






















Iig. 2. The proposed IST design for MLCA archileclure

3.1 FauIt modeI, AD-PE design and test strategy
In lhis chapler, lhe proposed IST design ained al lhe singIe sluck-al fauIl (SSAI). In
generaI, consideralIe physicaI fauIls in a circuil/chip are le discussed individuaIIy, such as
sluck-al fauIl, lridge fauIl, open fauIl, deIay fauIl, crosslaIk, elc.. Hovever, lhe sluck-al fauIl
is cIassicaI and nosl videIy fauIl, vhich covers 8O-9O of possilIe nanufacluring defecl in
CMOS circuils. Thus, lhe largeled fauIls of lhis vork are lhe singIe sluck-al fauIls. On lhe
olher hand, lhe proposed IST design consisls of lvo najor oljeclives: 1) lhe largeled fauIl
coverage is firsl achieved ly using efficienl IST design. MeanvhiIe, lhe lesl pallern
generalors can le inpIenenled vilh Iinear-feedlack shifl regisler (LISR). 2) onIy a fev lesl
pallerns are required lo lesl lhe enlire MLCA archileclure.

In addilion, lo achieve higher conlroIIaliIily and olservaliIily for singIe AD-IL, each AD-
IL uses lhe rippIe-carry adders (RCAs) lo produce lhe processing of lhe alsoIule difference
vaIue conpulalion and addilion unils, as shovn in Iig.3. In Iig. 3, each nuIlipIexers (nux)
are designed for lesling requirenenl and are perforned nornaI and lesl node ly using lhe
signaI sc|. The funclions of lhe alsoIule difference vaIue conpulalion and addilion are
designed ly using rippIe-carry adder. esides, lhe frane dala deIivered lhe pixeI vaIue lo
RCAs lo produce lhe parliaI SAD vaIue vhen lhe nornaI node is perforned. Then, lhe lesl
pallerns fron lesl pallern generalor are deIivered lo lesling lhe RCAs vhiIe lhe lesl node is
seIecled.
Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 289

,XY,
mux
mux
mux sel sel
Frame data
Frame data
AD-PE
dmux
dmux
sel

mux mux sel sel


Frame TPG
Frame TPG
Upper AD-PE
Upper AD-PE
dmux
dmux
sel
To ORA
To next AD-PE
To next AD-PE
To ORA
To ORA

,XY,
mux
mux
mux sel sel
Frame data
Frame data
AD-PE
dmux
dmux
sel

mux mux sel sel


Frame TPG
Frame TPG
Upper AD-PE
Upper AD-PE
dmux
dmux
sel
To ORA
To next AD-PE
To next AD-PE
To ORA
To ORA






















Iig. 3. The inlernaI archileclure of AD-IL

In lhis chapler, lhe proposed IST design incIudes lhree nodes: 1) Ncrma| mcc: In lhis
node, each AD-IL perforns ils nornaI funclion, vhich is lo delernine lhe nininun SAD
vaIue. 2) Tcs| mcc: The lesl pallern generalor deIivers lhe lesl pallerns lo each AD-IL for
lesling. Then, lhe lesling oulpul resuIls are conpressed ly lhe oulpul response anaIyzer for
lesling anaIysis. 3) Ana|qzc mcc: In anaIyze node, oulpul response anaIyzer used lhe
conpressed signalure lo delernine each AD-IL in MLCA archileclure lhal are fauIl-free or
nol.

3.2 Test pattern generator
According lo lhe RCA slruclure, il vas conprised ly nany one lil fuII-adder (IA). in olher
vord, each AD-IL consisls of nany one lil fuII-adder. ConsequenlIy, according lo lhe
characlerislic of one lil fuII-adder, lhe IA has lhree prinary inpuls. Thus, lhe lesl pallerns
of IA1 has 2
3
differenl inpuls lhal can inpul lo one lil fuII-adder for lesling requirenenl. In
a vord, one lil fuII-adder can le ollained 1OO fauIl coverage for singIe sluck-al fauIls ly
using lesl pallerns (OOO, OO1, O1O, O11, 1OO, 1O1, 11O, 111 ). ased on lhe alove-nenlioned
concepl, lhe TalIe 1 shovs lhe proposed lesl pallerns for each one lil fuII-adder in AD-IL
vhich can achieve 1OO singIe sluck-al fauIl coverage. On lhe olher hand, in order lo reaIize
a sinpIe and Iov-cosl lesl pallern generalor, lhis chapler expIoils lhe LISR lo reaIize lhe lesl
pallern generalor of IST design. Iigure 4 shovs lhe proposed lesl pallern generalor vhich
can deIiver lhe lesl pallerns lo each RCA ly using LISRs. In order lo effecliveIy reaIize,
lhese lesl pallerns can le sinpIified as segnenl pallerns and generaled ly a LISR. The
conneclion lelveen lhe proposed lesl pallerns generalors (8 lils and 12 lils) and RCAs (8
VLS 290

LISRO LISRO LISR1 LISR1 LISR2 LISR2 LISR3 LISR3
SunO
IAO IAO
aO lO
Sun1
IA1 IA1
a1 l1
Sun2
IA2 IA2
a2 l2
Sun3
IA3 IA3
a3 l3
Sun4
IA4 IA4
a4 l4
Sun5
IA5 IA5
a5 l5
Sun6
IA6 IA6
a6 l6
C
oul
a7
Sun7
IA7 IA7
l7
RCA
C
in
TPG
LISRO LISRO LISR1 LISR1 LISR2 LISR2 LISR3 LISR3
SunO
IAO IAO
aO lO
Sun1
IA1 IA1
a1 l1
Sun2
IA2 IA2
a2 l2
Sun3
IA3 IA3
a3 l3
Sun4
IA4 IA4
a4 l4
Sun5
IA5 IA5
a5 l5
Sun6
IA6 IA6
a6 l6
C
oul
a7
Sun7
IA7 IA7
l7
RCA
C
in
TPG
lils and 12 lils) are shovn in Iig. 4 and 5, respecliveIy. Ior exanpIe, lhe lesl pallerns for 8
lils RCA are shovn in TalIe I, vhich can achieve 1OO singIe sluck-al fauIl coverage vhiIe
lhe LISRs deIiver lhe lesl pallerns lo RCA for lesling. Iurlhernore, lhe lesl pallern
generalor of 8 lils RCA conprises four LISR lo generale lhe lesl pallern for lesling
requirenenl. IirslIy, LISRO generaled and deIivered lhe lesl pallerns lo lhree inpuls of IAO
in 8 lils RCA lhal lhe sequences of lesl pallern Iisled as foIIov: 111O11OO11OOO1O
1O111O. Then, lhe nexl slage carry is generaled ly ilseIf. SecondIy, lhe inpuls of IA1 in 8
lils RCA are inpulled lhe lesl pallern sequences, OO1OO11O1111O1, ly LIRS1.
SiniIarIy, lhe LISR2 and LISR3 deIiver lhe lesl pallern sequences, 1O1111O1OO1O
O1 and 11O1OO1OO11O11, lo lhe inpuls of IA2 and IA3 in 8 lils RCA,
respecliveIy. esides, lhese inpuls, a
n
and l
n
, need lo nalch up differenl carry as shovn in
TalIe 1. in olher vords, lhe propagalions of lesl pallern are generaled ly LISRs as shovn in
Iig. 6.

TLST Tesl pallerns
TI
1
1 1 1 O O 1 O 1 1 O O 1 O 1 1 O O
TI
2
O 1 1 1 O 1 1 O 1 1 O 1 1 O 1 1 O
TI
3
O O 1 O 1 1 1 O O O 1 1 1 O O O 1
TI
4
1 O O 1 O O 1 1 O 1 O O 1 1 O 1 O
TI
5
O 1 O 1 1 O O O 1 1 1 O O O 1 1 1
TI
6
1 O 1 1 1 1 O 1 O 1 1 1 O 1 O 1 1
TI
7
1 1 O O 1 O 1 1 1 O 1 O 1 1 1 O 1
TalIe 1. 17-lil lesl pallern for 8-lil RCA














Iig. 4. The conneclion lelveen lhe proposed lesl pallern generalor and 8 lils RCA








Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 291

LISRO LISRO LISR1 LISR1 LISR2 LISR2 LISR3 LISR3
SunO
IAO IAO
aO lO
Sun1
IA1 IA1
a1 l1
Sun2
IA2 IA2
a2 l2
Sun3
IA3 IA3
a3 l3
Sun4
IA4 IA4
a4 l4
Sun5
IA5 IA5
a5 l5
Sun6
IA6 IA6
a6 l6
C
oul
a7
Sun7
IA7 IA7
l7
RCA
C
in
TPG
Sun8
IA8 IA8
a8 l8
Sun9
IA9 IA9
a9 l9
Sun1O
IA1O IA1O
a1O l1O a11
Sun11
IA11 IA11
l11
LISRO LISRO LISR1 LISR1 LISR2 LISR2 LISR3 LISR3
SunO
IAO IAO
aO lO
Sun1
IA1 IA1
a1 l1
Sun2
IA2 IA2
a2 l2
Sun3
IA3 IA3
a3 l3
Sun4
IA4 IA4
a4 l4
Sun5
IA5 IA5
a5 l5
Sun6
IA6 IA6
a6 l6
C
oul
a7
Sun7
IA7 IA7
l7
RCA
C
in
TPG
Sun8
IA8 IA8
a8 l8
Sun9
IA9 IA9
a9 l9
Sun1O
IA1O IA1O
a1O l1O a11
Sun11
IA11 IA11
l11
O11
O11
11O
11O
111
111
1O1
1O1
OO1
OO1
1OO
1OO
O1O
O1O
O11
O11
11O
11O
O11
O11
11O
11O
111
111
1O1
1O1
OO1
OO1
1O1
1O1
OO1
OO1
1OO
1OO
O1O
O1O
1OO
1OO
O1O
O1O
11O
11O
1O1
1O1
1OO
1OO
111
111
OO1
OO1
O1O
O1O
O11
O11
11O
11O
1O1
1O1
11O
11O
1O1
1O1
1OO
1OO
111
111
OO1
OO1
111
111
OO1
OO1
O1O
O1O
O11
O11
O1O
O1O
O11
O11
111
111
1O1
1O1
O1O
O1O
11O
11O
O11
O11
OO1
OO1
1OO
1OO
111
111
1O1
1O1
111
111
1O1
1O1
O1O
O1O
11O
11O
O11
O11
11O
11O
O11
O11
OO1
OO1
1OO
1OO
OO1
OO1
1OO
1OO
1O1
1O1
111
111
O11
O11
11O
11O
1OO
1OO
O1O
O1O
OO1
OO1
1O1
1O1
111
111
1O1
1O1
111
111
O11
O11
11O
11O
1OO
1OO
11O
11O
1OO
1OO
O1O
O1O
OO1
OO1
O1O
O1O
OO1
OO1















Iig. 5. The proposed lesl pallern generalor for 12 lils RCA










(a) LISR O (l) LISR 1










(c) LISR 2 (d) LISR 3
Iig. 6. The lesl pallerns propagalion of 8 lils RCA

3.3 Output response anaIyzer
MuIli-Inpul Shifl Regisler (MISR) is videIy used as lhe signalure anaIyzer lo conpacl lhe
oulpul response of lhe circuil under lesl (CUT). Thus, lhis chapler expIoils lhe MISR lo
anaIyze lhe lesling resuIl. Iigure 7 shovs lhe oulpul response anaIyzer (ORA) lhal lhe ORA
of lhe lvo adders can le inpIenenled ly MISR and adding sone exlra Iogics. In lesl node,
lhe MISR conpresses lhe oulpul responses and lhe finaI resuIl of lhe MISR lo serve as lhe
VLS 292

CLK
Reset
From AD-PE output data
D
0
Q
0
D
1
Q
1
D
2
Q
2
D
6
Q
6
D
7
Q
7
From TPG generate test patterns
CLK
Reset
From AD-PE output data
D
0
Q
0
D
0
Q
0
D
1
Q
1
D
1
Q
1
D
2
Q
2
D
2
Q
2
D
6
Q
6
D
6
Q
6
D
7
Q
7
D
7
Q
7
From TPG generate test patterns
signalure vhich is used lo check and delernine vhelher lhe CUT is fauIl-free or nol. In a
vord, lesling vilh signalure anaIyzer, MISR, has lhe nerils of sinpIicily and Iov cosl
hardvare lecause MIRS does nol need lo slore lhe enlire responses of lesl pallerns. The
MISR of ORA has lvo kinds of inpul dala vhich are fron AD-IL oulpul dala lhrough sone
XOR gales and lhe lesl pallerns (see Iig. 7). On lhe olher hand, lhe resuIls are checked ly
MISR and propagaled lhe finaI resuIl lo lhe Iasl nodeI.














Iig. 7. The oulpul response anaIyzer for 8 lils RCA

3.4 Test strategy
ased on lhe discussion of each sul-circuil (AD-IL, TIC and ORA), lhe overaII vorking
fIov of lhe proposed IST design is sunnarized in Iig. 8, and is lriefIy descriled as foIIov.

5tcp 1: SeIecling lhe operalion nodes

The operalion node, nornaI and lesl node, are delernined ly lesl conlroIIer. In nornaI
node, lhe CUT circuil perforns ils nornaI operalion lo delernine lhe SAD vaIues, vhereas
lhe MLCA archileclure is lesled ly ilseIf.

5tcp 2: IniliaI selling of IST design

In lesl node, lhe lesl pallerns of IST design are generaled ly TIC vhich vas nade ly four
LISRs. Then, lhe TIC deIivered lhe lesl pallerns lo each AD-IL for lesling.

5tcp 3: AnaIyzing lhe lesl resuIl of AD-IL

The lesling resuIls of each AD-IL anaIyzed ly lhe ORA unliI aII AD-IL are lesled and
anaIyzed conpIeleIy.





Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 293

NornaI node
Lnd
Slarl
Tesl ConlroIIer
Tesling `
Have fauIl on AD-IL `
Nexl AD-IL
Take dovn
lhe faiI AD-IL
Tesl done
Tesl conpIeleIy `
- IniliaI TIC vaIues
- SeIecl AD-IL lesled
- ORA on
Yes
No
Yes
Yes
No
No
NornaI node
Lnd
Slarl
Tesl ConlroIIer
Tesling `
Have fauIl on AD-IL `
Nexl AD-IL
Take dovn
lhe faiI AD-IL
Tesl done
Tesl conpIeleIy `
- IniliaI TIC vaIues
- SeIecl AD-IL lesled
- ORA on
NornaI node
Lnd
Slarl Slarl
Tesl ConlroIIer Tesl ConlroIIer
Tesling ` Tesling `
Have fauIl on AD-IL ` Have fauIl on AD-IL `
Nexl AD-IL
Take dovn
lhe faiI AD-IL
Nexl AD-IL
Take dovn
lhe faiI AD-IL
Tesl done Tesl done
Tesl conpIeleIy ` Tesl conpIeleIy ` Tesl conpIeleIy `
- IniliaI TIC vaIues
- SeIecl AD-IL lesled
- ORA on
- IniliaI TIC vaIues
- SeIecl AD-IL lesled
- ORA on
Yes
No
Yes
Yes
No
No


























Iig. 8. The overaII vorking fIov of lhe proposed IST design

4. ResuIts discussion

The proposed IST design vas reaIized using VeriIog HDL and synlhesized Design
ConpiIer of Synopsys. The perfornance conparisons ain al area overhead, fauIl coverage
and lesl pallerns discussion vhich are aIso presenled here lo verify lhe good perfornance of
lhe proposed IST design.
The design is carried oul lop-dovn al lhe gale-IeveI in lhe syslen of Quarlus II ly neans of
vaveforn and lhe design finaIIy passes lolh lhe unil lesl and lhe inlegraled lesl. Iigure 9
shovs a funclionaIily of lhe vaveforns for AD-IL (as you can see in seclion II). Iigure 1O
shovs lhe funclionaIily of lesl pallern generalor. Ior lesl node, iniliaI lesl pallerns are
scanned in, an AD-IL is lesled, and lesl responses are scanned oul. AII reIaled conlroI
signaIs are generaled fron conlroIIer. And, lhe fauIl coverage of AD-IL reaches 1OO vilh 7
lesl pallerns.





VLS 294

CLK
Pixelx
Pixely
y
(2`s)
35 255
AD
128 49 4 107 86 240
28 145 92 74 213 107 87 250
7 110 36 25 209 0 1 10
228 111 164 182 43 149 169 6
Upward AD-PE
AD-PE
45 17 29 175 84 204 155 113
0 127 65 200 293 204 156 123
0
0
1 2 3 4 5 6 7 8 9 10
52
CLK
Pixelx
Pixely
y
(2`s)
35 255
AD
128 49 4 107 86 240
28 145 92 74 213 107 87 250
7 110 36 25 209 0 1 10
228 111 164 182 43 149 169 6
Upward AD-PE
AD-PE
45 17 29 175 84 204 155 113
0 127 65 200 293 204 156 123
0
0
1 2 3 4 5 6 7 8 9 10
52
CLK
LFSR0Set
LFSR1Set
LFSR2Set
7
LFSR3Set
0
1 0
6 0
3 0
TPGouta
TPGoutb
109 183 36 218 147 254 73 109
73 109 183 36 218 147 254 73
0
0
1 2 3 4 5 6 7 8 9 10
0
0
Reset
0
0
0
0
CLK
LFSR0Set
LFSR1Set
LFSR2Set
7
LFSR3Set
0
1 0
6 0
3 0
TPGouta
TPGoutb
109 183 36 218 147 254 73 109
73 109 183 36 218 147 254 73
0
0
1 2 3 4 5 6 7 8 9 10
0
0
Reset
0
0
0
0
















Iig. 9. The Iogic sinuIalion of AD-IL

















Iig. 1O. The vaveforn for lesl pallern generalor

Conparing vilh previous vork (Li el aI., 2OO4), nunlers of lesl pallerns, lhe pin overheads,
lesl line and fauIl coverage are Iisled in TalIe 2.
y using lhe proposed IST design and fauIl nodeIs, lhe singIe sluck-al fauIl coverage of
each AD-IL can achieve 1OO. This perfeclIy proves lhe vaIidily of lhe lesl pallerns. And,
lhe area overhead of lhe MLCA archileclure incIuding IST is O.2, vhich is loIeralIe in
induslry




Efcient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 295

Iroposed (Li el aI., 2OO4)
No. of lesl pallern 8 3O
TolaI lesl line 35us 13us
Iins overhead 25 19O4
IauIl coverage 1OO 96
TalIe 2. Conparisons

5. ConcIusions

This chapler descriles a IST design for MLCA archileclure in video coding syslens. In lesl
node, lesl pallerns are generaled ly TIC and scanned inlo each AD-IL of MLCA
archileclure lo lesling, and lesl responses are scanned oul. AII of conlroI signaIs are
generaled ly lhe conlroIIer. And, experinenlaI resuIls shov lhal lhe area overhead of lhe
IST archileclure for nolion eslinalion archileclure is Iess lhan 1. The fauIl coverage of
each AD-IL can achieve 1OO, and il perfeclIy proves lhe vaIidily of lhe lesl pallerns.
Moreover, IST slruclure can easiIy le designed and appIied lo lhe MLCA archileclure.
Thal neans lhe sinpIificalion in lhe IST design of AD-IL in MLCA archileclure is
reasonalIe.

6. References


Alranovici, M., reuer, M. A. & Iriednan, A. D. (199O). DigilaI Syslens Tesling and
TeslalIe Design. oslon, MA: Conpuler Science Iress, 199O.
CaIIagher, I., Chickernane, V., Cregor, S. & Iierre, T. S. (2OO1). A uiIding Iock IST
MelhodoIogy for SOC Designs: A Case Sludy, Prccccings cf |n|crna|icna| Tcs|
Ccnfcrcncc, II. 111-12O, Ocl. 2OO1.
Konarek, T. & Iirsch, I. (1989). Array Archileclures for Iock Malching AIgorilhns, ||||
Transac|icns cn Circui|s an Sqs|cms, VoI. 36, No. 2, II. 13O1-13O8, Ocl. 1989.
Kuhn, I. (1999). AIgorilhns, ConpIexily AnaIysis and VLSI Archileclures for MILC-4
Molion Lslinalion. Nev York, NY: KIuver Acadenic IulIishers, 1999.
Kung, C. I., Huang, C. }. & Lin, C. S. (1995). Iasl IauIl SinuIalion for isl AppIicalions,
Prccccings cf |nc |cur|n Asian Tcs| Sqmpcsium, II. 93-99, 1995.
Li, D., HU, M. & Mohaned, O. (2OO4). uiIl-In SeIf-Tesl Design of Molion Lslinalion
Conpuling Array, Prccccings cf |||| Ncr|ncas| Icr|sncp cn Circui|s an Sqs|cms
(N|ICAS04), }une 2OO4.
Lu, S. K., Shih, }. S. & Huang, S. C. (2OO5). Design-for-TeslaliIily and IauIl-ToIeranl
Techniques for IIT Irocessors, ILLL Trans. VLSI Syslens, voI. 13, no. 6, pp. 732-
741, }une 2OO5.
McCIuskey, L. }. (1985). uiIl-In SeIf-Tesl Technique, |||| Dcsign an Tcs| cf Ccmpu|crs, VoI.
2, No. 2, II. 29-36, Apr. 1985.
NagIe, H. T., Roy, S. C., Havkins, C. I., Macnaner, M. C. & Irilzeneier, R. R. (1989). Design
for TeslaliIily and uiIl-in SeIf Tesl : A reviev, |||| Transac|icns cn |nus|ria|
||cc|rcnics, VoI. 36, No. 2, II. 129-14O, May. 1989.
Iirsch, I., Denassieux, N. & Cehrke, W. (1995). VIsi archileclure for video conpression-a
survey, Prccccings cf |nc ||||, No. 2, I. 22O, Iel. 1995.
VLS 296

Toula, N. A. & McCIuskey, L. }. (1997). Iseudo-Randon Iallern Tesling of ridging IauIls,
Prccccing scf |n|crna|icna| Ccnfcrcncc cn Ccmpu|cr Dcsign , II. 54-6O, 1997.
Wiegand, T. (2OO3) Drafl ITU-T Reconnendalion and IinaI Drafl InlernalionaI Slandard of
}oinl Video Specificalion (ITU-T Rec. H.264/ ISO/ ILC 14496-1O AVC), }oinl Video
Tean (}VT) of ISO/ILC MILC and ITU-T VCLC, Mar. 2OO3.
Wu, T. H., Tsai, Y. L., & Chang, S. }. (2OO7). An Lfficienl Design-for-TeslaliIily Schene for
Molion Lslinalion in H.264/AVC, Prccccings cf |n|crna|icna| Sqmpcsium V|S|
Dcsign Au|cma|icn an Tcs|, II.25-27, Apr. 2OO7.

SOC Design for Speech-to-Speech Translation 297
X

SOC Design for Speech-to-Speech TransIation

Shun-Chieh Lin
1
, }ia-Ching Wang
2
, }hing-Ia Wang
3
,
Ian-Min Li
4
and }er-Hao Hsu
5

1
|nus|ria| Tccnnc|cgq Rcscarcn |ns|i|u|c
2
Na|icna| Ccn|ra| Unitcrsi|q
3, 4, 5
Na|icna| Cncng Kung Unitcrsi|q

1. Introduction

In lodays gIolaIised vorId, infornalion exchange lelveen differenl Ianguages is
indispensalIe. AccordingIy, speech-lo-speech lransIalion researches |1j-|6j grev lo a
Ieading-edge lechnoIogy enalIing nuIliIinguaI hunan-lo-hunan and hunan-lo- nachine
inleraclion. In previous vork, lvo differenl archileclures are adopled ly nany researchers
for speech-lo-speech lransIalion research |4j,|15j - a convenlionaI sequenliaI archileclure
and a fuIIy inlegraled archileclure. The sequenliaI archileclure is conposed of a speech
recognilion syslen foIIoved ly a Iinguislic (or non-Iinguislic) lexl-lo-lexl lransIalion syslen
and a lexl-lo-speech syslen |1j,|2j,|5j,|16-2Oj. The inlegraled archileclure conlines speech
fealure nodeIs and lransIalion nodeIs in a nanner siniIar lo lhal used for speech
recognilion. This inlegralion lrings aloul efficienl lransIalion ly searching for an oplinaI
vord sequence of largel Ianguage lhrough lhe inlegraled nelvork such as finile-slale
lransducer |4j,|21j and lhe olhers |22j,|23j.
According lo lhese lvo archileclures, lhere are lvo syslen inpIenenlalions - cIienl-server-
lased syslens and sland-aIone handheId syslens. NeverlheIess, a crilicaI shorlconing of
cIienl-server lased speech lransIalion syslens is lhal lhey shouId le luiIl on lhe server
conpuler. In olher vords, il is nol avaiIalIe anyline or anyvhere. An olvious soIulion
vouId le lo luiId porlalIe sland-aIone speech-lo-speech lransIalion handheId devices. To
nenlion jusl a fev: Isolani c| a|. |7j used a Iockel IC IDA vilh 2O6 MHz SlrongARM/64
M RAM lo luiId a speech-lo-speech lransIalion syslen for lhe use in various silualions
vhiIe lraveIing. Iai|c| c| a|. |8j adopled a Iockel IC IDA vilh 4OO MHz SlrongARM/64
M RAM lo conslrucl a speech-lo-speech lransIalion syslen for nedicaI inlervievs.
Walanale c| a|. |9j deveIoped a noliIe device running on 4OO MHz Ienliun II-cIass
processor/192 M RAM lhal heIps speech-lo-speech lransIalion in various silualions during
lheir lraveI alroad. Hovever, lhese vorks shov lhal reaI-line speech-lo-speech lransIalion
vilh a resource-Iiniled device is sliII a prolIen.
TalIe 1 shovs a conparison anong reIaled speech-lo-speech lransIalion syslens, incIuding
}ANUS |1j, VerlnoliI |2j, LUTRANS |4j elc. CIearIy, lhe cIienl-server lased archileclure
can vork in reaI line, lul is nol porlalIe, vhiIe lhe sland-aIone syslen is porlalIe, lul Iacks
lhe reaI-line perfornance. Therefore, a VLSI soIulion is presenled ly reaIizing lhe enlire
15
VLS 298

speech-lo-speech lransIalion aIgorilhn vilhin a singIe chip. This SOC chip onIy requires a
fev peripheraI conponenls for conpIele operalion, and is characlerized ly snaII size, Iov
cosl, reaI-line operalion, and high reIialiIily. The conslruclion of lhis chip is acconpIished
in lvo nain phases: lhe soflvare sinuIalion phase and lhe SOC design phase.
In lhe soflvare sinuIalion phase, lhe sinuIalion is lased on lhe nuIlipIe-lransIalion
spolling (MTS) nelhod, a kind of inlegralion of speech anaIysis and Ianguage lransIalion
|23j. The proposed nuIlipIe-lransIalion spolling approach is direclIy fron speech lo speech
vilhoul Ianguage nodeIs Iike olher aulonalic speech recognilion (ASR) approaches. Wilh
idenlifying speech fealures, lransIalion prinariIy slays in lhe speech nodaIily and does nol
go lhrough a lexluaI nodaIily. The proposed MTS nelhod nol onIy relrieves lhe oplinaI
nuIlipIe-lransIalion spolling lenpIale, lul aIso exlracls lhe appropriale largel pallerns.
Wilh lhe exlracled pallerns, lhe largel speech can le generaled ly a concalenalion-lased
vaveforn segnenl synlhesis nelhod.
In lhe SOC design phase, lesides a cosl efficienl progrannalIe core used for syslen conlroI
and non-conpulalion-inlensive lasks, lhree specific hardvare cores vere designed lo
perforn cepslrun exlraclion, lenpIale relrievaI, largel pallern exlraclion. Moreover, lhe
A/D converler (ADC) and D/A converler (DAC) are aIso designed.
The resl of lhis paper is organised as foIIovs. Seclion 2 gives lhe soflvare sinuIalion of lhe
proposed speech-lo-speech lransIalion syslen. Seclion 3 discusses lhe SOC archileclure for
lhe MTS-lased speech-lo-speech lransIalion syslen. IinaIIy, a shorl concIusion is provided
in Seclion 4.

Syslen VocaluIary Size Response line Specificalion
}ANUS |1j 3,OOO~5,OOO s 2 lines reaI-line IC-lased pIalforn (server-
cIienl)
VerlnoliI |2j >1O,OOO 4 lines reaI-line IC-lased pIalforn (server-
cIienl)
LUTRANS |4j 1,7O1(LngIish)
2,459(IlaIian)
s 3 lines reaI-line IC-lased pIalforn (server-
cIienl)
ATR-MATRIX |5j 13,OOO O.1 sec for TDMT IC-lased pIalforn (server-
cIienl)
Isolani c| a|. |7j 2O,OOO(LngIish)
5O,OOO(}apan)
SIover lhan V
R
55OO
processor
Iockel IC IDA (2O6 MHz
SlrongARM/64 M RAM)
SpeechaIalor |8j - >2~3 sec Iockel IC IDA (4OO MHz
SlrongARM/64 M RAM)
Walanale c| a|. |9j 1OOOO (LngIish)
5OOOO (}apan)
>2~3 sec MoliIe IC (4OO MHz Ienliun
II-cIass processor/192 M
RAM)
TalIe 1. A conparison anong reIaled speech-lo-speech lransIalion syslens
SOC Design for Speech-to-Speech Translation 299


TalIe 2. LxanpIes for TS and MTS.
1


2. The Proposed Speech-to-Speech TransIation System

2.1 System Overview
MuIlipIe-lransIalion spolling (MTS) is proposed as an inprovenenl on lhe lradilionaI
lransIalion spolling (TS) nelhod |11j, |12j and uses nuIlipIe-lransIalion spolling lenpIales
derived fron nuIlipIe pairs of lransIalions for spolling aII hypolhesised largel-Ianguage
pallerns. TalIe 2 Iisls sone exanpIes of TS and MTS. The source-Ianguage inpul | uan| a
cu||c rccm can onIy provide five largel-Ianguage pallerns <,,,,> in lradilionaI
TS nelhod lul lhe proposed MTS provides aII largel-Ianguage pallerns <,,,,,
>. AII largel-Ianguage pallerns <,,,,,,> are aIso exlracled ly MTS vhiIe
inpulling lhe senlence |s |ncrc a sing|c rccm fcr |cmcrrcu. Ior each nuIlipIe-lransIalion
spolling lenpIale, hypolhesised largel pallerns are generaled in lhe MTS process.
Iigure 1 shovs an exanpIe of lhe MTS process for lhe direclion of LngIish-lo-Chinese vhiIe
inpulling a speech. The nuIlipIe-lransIalion spolling lenpIale of lhis exanpIe possesses
eIeven LngIish pallerns in a fealure lenpIale. ased on lhe one-slage aIgorilhn |13j, vhen a
speaker inpuls an LngIish speech |s a sing|c rccm s|i|| atai|a||c fcr |cnign|, lhe proposed
syslen can ollain lhe idenlified resuIls - is, a, sing|c rccm, s|i||, atai|a||c, fcr,
and |cnign|. IoIIoving lhe idenlified resuIls, lhe lransIalions of is, a,
sing|c rccm, and so on are exlracled. A Mandarin Chinese speech pallern sel lhal
incIudes seven hypolhesised speech pallerns, , , , , , and
is generaled ly lhe conslrucled vaveforn lransIalions in lhe nuIlipIe-lransIalion spolling
lenpIale and represenled as darker vaveforns in Iig. 1.
Ior speech generalion, afler delernining lhe oplinaI largel speech sequence, lhese
vaveforns are rearranged vilh adequale overIapping porlions lo generale speech vilh lhe

1
The gray lalIels Iisl lhe lransIalion spolling resuIls.
VLS 300

vaveforn siniIarily overIap and add (WSOLA) aIgorilhn |24j. WSOLA inlroduces a
loIerance on lhe desired line-varping funclion lo ensure signaI conlinuily al vaveforn
segnenl joins. Wilh a proper vindovs Ienglh and a lining loIerance, WSOLA usuaIIy
produces high quaIily line-scaIed speech |25j,|26j. Therefore, lhe syslen can generale high
phonelic/prosodic quaIily in lhe lransIaled speech oulpul. The advanlages of lhis nelhod
are lhe snaII conpulalionaI cosl during lhe generalion process and lhe high inleIIigiliIily of
lhe generaled speech. The foIIoving sulseclions furlher discuss lhe delaiIs of lhe kerneI
spolling aIgorilhn vilhin lhe speech lransIalion.

Is a single room still available Ior tonight



b
r
e
a
k
I
a
s
t









a
v
a
i
l
a
b
l
e








t
o
m
o
r
r
o
w















l
a
u
n
d
r
y

s
e
r
v
i
c
e

















t
o
n
i
g
h
t

An hypothesized Mandarin Chinese speech pattern set:
Speech
Ieature input
Speech
Ieatures
Multiple-
Translation
Spotting
Template
Feature
Template
WaveIorm
Translation
IdentiIied
results
I
s



















s
t
i
l
l












I
o
r





















r
o
o
m

s
e
r
v
i
c
e



















a


s
i
n
g
l
e

r
o
o
m
Speech
Pattern

Iig. 1. An exanpIe of MTS process for lhe direclion of LngIish-lo-Mandarin Chinese.

2.2 MuItipIe-TransIation Spotting
To fornuIale lhe concepl of MTS lelveen inpul speech (
L
X
1
) and lhe t-lh nuIlipIe-
lransIalion spolling lenpIale (
v
r
), ve use lhe foIIoving nolalions: | represenls lhe frane
index vilhin
L
X
1
,
L l s s 1
, j represenls lhe lransIalion pair
v
f
v
f
t s ,
index vilhin
v
r
,
J f s s 1
,
and | represenls lhe frane index vilhin lhe j-lh source pallern
v
f
s
and ils napped largel
pallern denoled ly
v
f
t
,
f
K k s s 1
. Then for each inpul frane, lhe accunuIaled dislorlion
) , , ( f k l a
A
is defined ly:
SOC Design for Speech-to-Speech Translation 301

( ) ) , , 1 ( min ) , , ( ) , , (
2
f m l a f k l a f k l a
A
k m k
A
+ =
s s
, (1)
for
f
K k s s 2
,
J f s s 1
, vhere
) , , ( f k l a
is lhe IocaI dislorlion lelveen lhe |-lh frane of
L
X
1
and lhe |-lh frane of lhe source pallern
v
f
s
. The recursion in (1) is carried oul for lhe
inlernaI franes (i.e.,
2 > k
) of each source pallern. Al lhe pallern loundary, i.e., vhen
1 = k
,
lhe recursion can le caIcuIaled as:
( ) | | ) , 1 , 1 ( , ) , , 1 ( min min ) , 1 , ( ) , 1 , (
1
f l a m K l a f l a f l a
A m A
J m
A
+ =
s s
. (2)
And lhe lesl palh is delernined ly
( ) | | f K L a a
f A
J f
v
G
, , min
1 s s
=
. (3)
Afler ranking aII lhe lenpIales, lhe hypolhesized spolling lenpIale is decided fron lhe Top
N candidales vilh nininun dislorlion ly
_
=
s s
=
J
f
v
f
N v
v
1
1
max arg ` t
(4)
According lo lhe decided
v` -|n lenpIale fron (4), lhe largel pallerns
{ }
J
f
v
f
t

1
`

=
can le ollained
ly
{ }
J
f
v
f


1
`
=
. Wilh lhe delernining largel speech pallerns, lhese vaveforns are rearranged
vilh adequale overIapping porlions lo generale speech vilh lhe vaveforn siniIarily
overIap and add (WSOLA) aIgorilhn.

2.3 Software SimuIation
The lransIalion experinenls vere perforned on lolh a IC-lased pIalforn and iIAQ IDAs.
On lhe IC-lased pIalforn, lhe soflvare sinuIalions vere done using Windovs CL 3.O on a
Ienliun

IV 1.8 CHz, 1 C RAM, Windovs

XI IC. On lhe COMIAQ iIAQ IDAs, lhe


syslen vas inpIenenled on a 4OO MHz InleI XScaIe processor vilh 128 M RAM. This
vork luiIl a coIIeclion of LngIish senlences and lheir Chinese lransIalions lhal frequenlIy
appear in phraselooks for foreign lourisls. The phraselooks are referred lo |27j and |28j. Six
sul-donains are used for coIIeclion incIuding acconnodalion, reslauranl, lraffic, shopping,
lourisn, and asking for direclions. Lach sul-donain conlains 174~251 lransIalions.
Lxperinenls vere conducled lelveen Mandarin Chinese and LngIish. To evaIuale lhe
syslen perfornance, lhe 1,26O ullerances of lhe lraining sel used for conslrucling nuIlipIe-
lransIalion spolling lenpIales vere coIIecled fron one speaker (Sp1) and lhe 1O5 ullerances
of lhe lesl sel vere aIso coIIecled fron lhe sane speaker (Sp1) for cIosed lesling and lvo
liIinguaI naIe speakers (Sp2 and Sp3) for open lesling. Iirsl speaker's fealure nodeIs
(spolling lenpIales) vere used lo proceed lesls on lhe renaining lvo speakers. AII speeches
had an 8 kHz sanpIing rale vilh a precision of 16 lils on lolh lhe ICs and lhe IDAs. The
speech fealure anaIysis vas perforned using 1Olh-order Iinear prediclion cepslraI
coefficienls (LICCs) using 32 ns franes vilh 8 ns overIap. The lolaI nenory requirenenl
of lhe vhoIe syslen al lhe slarlup vas aloul 22.9 M. The required vork nenory is aloul 1
M, depending on lhe Ienglh of a speech inpul.
WhiIe hesilalions or inserlions are occurred, lhe proposed MTS approach can spol crilicaI
pallerns as keyvord spolling or parliaI nalching and a largel oulpul can le properIy
generaled ly lhe spolled pallerns vilh lhe decided spolling lenpIale. Hovever, lhe
proposed approach vouId le sensilive lo dislurlances incIuding speaking rale, noise,
speaker properlies, and so on. Ior lhe effecl of speaker properlies for lhe proposed syslen,
VLS 302

Ior relrieving Top 5 lenpIales, TalIe 3 shovs lhal lhe spolling accuracy of Sp2 and Sp3
drops ly 1O lo 15 percenl. A given spolling lenpIale is caIIed a ma|cn vhen il ollains lhe
sane inlenlion of lhe inpul speech. In addilion, fron lhe research presenled in |31j, lhe
speaking rale had a significanl effecl on recognilion accuracy and furlher adaplalion
nelhods of duralion nodeIs for spolling lenpIales are needed. ased on lhe proposed
approach, vhen lenpIale or vocaluIary size increases, lhe increasing spolling lenpIales
viII Iead lo nore speech fealure veclors and hence nore siniIarilies viII occur in speech
spolling neasurenenl, lhus causing faIse spolling resuIls and Iovering spolling accuracy.
y coIIecling nore speech dalalases, lhe syslen can appIy speaker-dependenl or speaker-
independenl HMM lo MTS for nore rolusl speech lransIalion. Speech lransIalion
perfornance aIso is degraded ly noise. ReIaled vorks lo nininize lhe effecls of lhe noise on
lhe syslen perfornance are presenled in |29j,|3Oj and vouId le appIied lo lhe proposed
syslen in lhe fulure.
To judge lhe generaled lransIalions fron lhe nalched lenpIales, a suljeclive senlence error
rale (SSLR) in |3j is used lo evaIuale and cIassify lhe largel generalion resuIls inlo lhree
calegories ly lhree liIinguaI evaIualors. Referring lo lhe SSLR evaIualion nelhod, gcc (G),
uncrs|ana||c (U), |a (8) IeveIs of lransIalion quaIily are scaIed lo 1.O, O.5, and O.O,
respecliveIy. The underslandalIe lransIalion rale is caIcuIaled ly
( ) ( )
100
tests oI no.
level oI no. 5 . 0 level oI no. 0 . 1
rate
+
=
U G
(5)
The resuIls reveaIed lhal our proposed approach roughIy achieves 89 and 92
underslandalIe lransIalion rale fron (5) for lhe Mandarin Chinese-lo-LngIish and lhe
LngIish-lo-Mandarin Chinese lransIalions, respecliveIy.


TalIe 3. Average spolling accuracy in nuIlipIe speaker lesling.

3. SOC Design for the MTS-based Speech-to-Speech TransIation System

The VLSI archileclure for lhe speech-lo-speech lransIalion SOC vas deveIoped using lhe
progrannalIe appIicalion-specific lechnique. esides a cosl efficienl progrannalIe core
|14j, vhich is used for syslen conlroI and olher non conpulalion-inlensive lasks, lhree
specific hardvare cores are aIso designed for fealure exlraclion, lenpIale relrievaI, pallern
exlraclion. Moreover, lhe A/D converler and D/A converler are aIso designed.
Iigure 2 shovs lhe overaII lIock diagran of lhe VLSI archileclure for lhe speech-lo-speech
lransIalion SOC. This archileclure nainIy consisls of a cepslrun exlraclion core, a lenpIale
relrievaI core, a pallern exlraclion core, a progrannalIe core, and ADC/DAC. In lhe
SOC Design for Speech-to-Speech Translation 303

foIIoving paragraph, ve viII discuss lhe lhree specific processing hardvare cores and lhe
ADC/DAC for our speech-lo-speech lransIalion syslen.

ADC/
DAC
Translation
Template
ROM
Synthesis
Pattern ROM
Translation
Rule ROM
Programmable
Core
LPC Extraction
Core
Template
Retrieval Core
High Low
Converter
Speech
Feature
RAM
High Speed Bus
L
o
w

S
p
e
e
d

B
u
s
Pattern
Extraction Core
Cepstrum
Extraction
Core
Template
Retrieval Core
Pattern
Extraction
Core
Programmable
Core

Iig. 2. Iock diagran of lhe SOC archileclure for lhe proposed speech-lo-speech lransIalion
syslen.

3.1 The Design of the Cepstrum Extraction Core
Iealure exlraclion is one of lhe nosl inporlanl issues in lhe fieId of speech recognilion. In
nany speech recognilion syslens, lhe cepslraI coefficienls are lrealed as lhe fealure
paranelers, derived ly lhe Iourier Transforn or fron lhe LIC coefficienls. We seIecle lhe
Ialler vay as il yieIds high recognilion rale and Iover conpulalionaI Ioad.

3.1.1 AutocorreIation AnaIysis
The sanpIing rale of lhe inpul speech for lhe cepslrun exlraclion core is 8 KHz. The pre-
enphasis fiIler for lhe digilized speech is defined ly


. . , ) ( s s =

h h: : H
. (6)
ecause ve adopl fixed-poinl inpIenenlalion, n is chosen as 15/16. The pre-enphasized
speech is lhen lIocked inlo N sanpIes, vilh adjacenl franes leing separaled ly M sanpIes.
In olher vords, lhe adjacenl frane legins M sanpIes Ialer lhan lhe previous frane and
overIaps vilh il ly N-M sanpIes. In our design, M=192, lhe frane size N=256, and lherefore
lhe frane rale equaIs 31.25 franes per second. The nexl slep is lo vindov each frane so lhal
lhe signaI disconlinuilies al lhe leginning and lhe end of each frane are nininized. The
Hanning vindov is used and has lhe forn of

s s

=
otherwise
N i
N
i
i w


, 0
), cos( . .
) (
t

(7)
Ior each frane of lhe vindoved speech signaI, aulocorreIalion anaIysis is perforned ly lhe
foIIoving fornuIa:
VLS 304

P k i x k i x k R
N
k i
s s =
_

, ) ( ) ( ) (
,
(8)
vhere P is lhe order of lhe LIC anaIysis.
The delaiIed archileclure for lhe aulocorreIalion anaIysis of overIapping franes fron lhe
digilized inpul speech can le seen in |32j. The archileclure is lased on a caIcuIalion
procedure vhich is depicled in Iig. 3.

R(0) x(0)x(0)
R(1)
R(2)
R(10)
x(2)x(2)x(3)x(3)
x(1)x(2)x(2)x(3)
x(0)x(2)x(1)x(3)

x(1)x(1)
x(0)x(1)


......x(10)x(10)......
...... x(9)x(10)......
...... x(8)x(10)......
x(0)x(10)......
x(n) x(n) x(n1)x(n1) ...... x(N-1)x(N-1)
x(n-1) x(n) x(n) x(n1) ...... x(N-2)x(N-1)
x(n-2) x(n) x(n-1)x(n1) ...... x(N-3)x(N-1)
x(n-10)x(n) x(n-9)x(n1) ...... x(N-9)x(N-1)
) (k f
n
) 1 ( ) 1 ( + + n x k n x

Iig. 3. An exanpIe iIIuslraling lhe aulocorreIalion caIcuIalion procedure.

3.1.2 Linear Predictive AnaIysis
Of lhe various Iinear prediclive anaIysis aIgorilhns, ve adopl lhe aulocorreIalion nelhod
lecause il Ieads lo nore slalIe resuIls lhan lhe covariance nelhod and has Iover
conpulalionaI Ioad lhan lhe Iallice nelhod |33j. Ior lhe aulocorreIalion nelhod, lhe nalrix
equalion for soIving lhe LIC coefficienls is in lhe forn of

P i k i R a i R
P
k
k
s s =
_
=
1 , ) ( ) (
1
, (9)

vhere R(i) is lhe aulocorreIalion coefficienl, P is lhe order of lhe LIC anaIysis, chosen as 1O
in lhis paper.
The nosl efficienl nelhod knovn for soIving lhis parlicuIar syslen of equalions is lhe
Levinson-Durlin recursion, vhich can le fornuIaled as foIIovs |33j:

) 0 (
) 0 (
R E =
(1O)
for P i s s 1

) ( ) (
1
1
) 1 (
_

=
i
f
i
f
f i R a i R temp
(11)

/
) 1 (
=
i
i
E temp K
(12)
end for

i
i
i
K a =
) (
(13)

1 1 ,
) 1 ( ) 1 ( ) (
s s =

i f a K a a
i
f i i
i
f
i
f
. (14)
E K E
i
i
i ( ) ( )
( ) =

1
2 1
, (15)
The alove equalions are soIved recursiveIy for i = 1, 2,
.
, P and lhe finaI soIulion is given
as

P f a a
P
f f
s s = 1 ,
) (
. (16)
Sulsliluling (12) inlo (15), ve ollain lhal

) ( ) 1 ( ) ( i i i
K temp E E =

. (17)
The archileclure lased on pipeIine fashion for lhe Levinson-Durlin recursion is descriled in
SOC Design for Speech-to-Speech Translation 305

Iig. 4. There are a snaII nunler of divisions needed for lhe LIC conpulalion. Hovever, il is
nol econonicaI lo prepare exlra hardvare for lhen. Inslead, lhe division operalions can le
perforned ly prune-and-search |34j on lhe hardvare vilhoul lhe use of an individuaI divider.

Constants
quotient
Set up
Register File
R(0)~R(10),
E, K,Temp,
a1~a10
mux
mux mux
Multiplier
mux
0
Adder

Iig. 4. The archileclure for lhe LIC anaIysis.

mux mux
mux
Con:an:
lc_:cr Flc
Ccp:rum
lc_:cr Flc
L!C
mux
Adder
Multiplier
0

Iig. 5. The archileclure for lhe conversion fron LIC lo cepslrun.

VLS 306

3.1.3 Conversion from LPC to Cepstrum
The cepslrun is conpuled fron lhe Iinear prediclion nodeI. The recursive equalions of lhe
LIC lased cepslrun are as foIIovs |35j:

1 1
a C =
, (18a)

, 1 , ) 1 (
1
1
P n C a n m a C
n
m
m n m n n
s s =
_

(18l)
vhere
n
C
is lhe nlh cepslraI coefficienl,
n
a
is lhe nlh LIC coefficienl, and P is lhe cepslrun
order, vhich is sel equaI lo lhe LIC order in lhis paper.
The nenory requirenenl for lhe conslanl
) 1 ( n m
can le reduced as foIIovs. There are n-1
conslanls for any n, so lhe lolaI nunler of lhe various conslanls is 1+2+3+.+P-1=P(P-1)/2.
y laking inlo consideralion lhe pallern of lhe conslanl array, Iel |
m,n
=(1-m/n), lhen ve can
shov lhal
|
n-1-i,n
=1- |
i,n
, (19)
for i=1, 2 ,.,
(
2 / 1 n
, and if n is even lhen
n n
k
, 2 /
=1/2.
Lqualion (19) inpIies lhal
n i
k
,
is equaI lo lhe conpIenenl of
n i n
k
, 1
, lherefore, lhe size of lhe
conslanl array can le reduced lo haIf. This aIgorilhn is inpIenenled ly a slore-and-
accunuIale lechnique and lhe archileclure is descriled in Iig. 5. We need lvo slorage
eIenenls lo hoId lhe LIC coefficienls and lhe previous order cepslraI coefficienls. In
addilion, an accunuIalor is necessary for conpuling and lolaIing lhe nain expression, (1-
m/n)a
n
C
n-m
lo yieId lhe cepslraI coefficienls defined in (18l).
The dedicaled archileclure for lhe aulocorreIalion anaIysis, lhe LIC anaIysis, and lhe
conversion fron LIC lo LSI are individuaIIy designed. Hovever, since lhe lhree procedures
are perforned sequenliaIIy, a resource-sharing lechnique,is perforned so lhal lhe cepslrun
exlraclion core viII need onIy one nuIlipIier and one adder.

3.2 The Design of the TempIate RetrievaI Core
To exenpIify our design concepl, Iigs. 6(a) and 6(l) dispIay lvo dependence graphs (DCs) of
lhe lenpIale relrievaI. These dependence graphs are conslrucled according lo (1) and (2), and
can le regarded as lhe lenpIale relrievaI space. The DC shovn in Iig. 6(a) descriles lhe case
vilh 12 franes in lhe inpul speech (
12
1
X
). There are lhree source pallerns (
1
1
v
s
,
1
2
v
s
,
1
3
v
s
) in lhis
lenpIale (
1 v
r
), and lhe frane nunler of lhen is 7, 8, and 6, respecliveIy. The DC shovn in Iig.
6(l) has 8 franes in lhe inpul speech (
8
1
X
) and 2 pallerns in lhe lenpIale (
2 v
r
). The frane
nunler of lhe lvo pallerns (
2
1
v
s
,
2
2
v
s
) in lhis lenpIale is 6 and 7, respecliveIy.
The differenl frane nunlers vilhin differenl inpul speeches cause lhe varialion aIong lhe
horizonlaI axis in lhe lenpIale relrievaI space. The differenl pallern nunlers vilhin differenl
lenpIales and lhe differenl frane nunlers vilhin differenl pallerns are responsilIe for lhe
varialion aIong lhe verlicaI axis in lhe lenpIale relrievaI space. The archileclure design for
coping vilh lhe varialion of lhe slruclure of lhe lenpIale relrievaI space is descriled as foIIovs.
Iirsl, horizonlaI projeclion is perforned in lhe originaI DCs (Iig. 6(a)) and lhe resuIl is shovn
in. Iig. 6(c). This projeclion reduces lhe lvo-dinensionaI (2-D) lenpIale relrievaI space inlo a
one-dinensionaI (1-D) one and eIininales lhe varialion resuIling fron differenl frane
nunlers vilhin differenl inpul speeches. Lach node in Iig. 6(c) requires a regisler lo slore lhe
conpulalionaI resuIls for lhe dislorlion accunuIalion of lhe nexl frane. The nodes vilhin one
SOC Design for Speech-to-Speech Translation 307

coIunn are divided inlo differenl lIocks according lo lhe pallerns lhey leIong lo. Thus lhe DC
in Iig. 6(c) consisls of lhree lIocks corresponding lo lhree differenl pallerns.
To furlher reduce lhe size of lhe DC, a verlicaI projeclion is perforned in Iig. 6(c). This
projeclion lransforns lhe lhree-lIock DC inlo a one-lIock DC (see Iig. 6(e)). ecause lhe frane
nunlers for lhe lIocks are differenl, lhe Iargesl frane nunler is laken as lhe frane nunler
for lhe nev one-lIock DC. To deaI vilh lhe discrepancy of having differenl frane nunlers
vilhin differenl pallerns, ve added sone nuIlipIexers. In addilion, lvo exlra regislers are
added in fronl of each node lo repIace lhe lvo eIininaled lIocks. Iigure 6(e) is lhen nodified
as Iig. 6(g) lo have a reguIar vire conneclion.
Iigures 6(l), 6(d), 6(f) and 6(h) provide anolher exanpIe of lhe use of horizonlaI and verlicaI
projeclions. One can find lhal lhe frane nunler and lhe regisler nunler for a cerlain node
lelveen lhe DCs in Iigs. 6(g) and 6(h) are differenl. To conline lhe lvo DCs inlo a singIe one,
lhe Iargesl frane nunler and lhe Iargesl regisler nunler lelveen lhen are chosen, see Iig.
6(i). The added nuIlipIexers in fronl of each node provide lhe seIeclivily anong lhe one-lIock
DCs.
P

P

u C 1 8 S
u C 1 8 S
8
1
x
2 v
r
2
1
v
s
2
2
v
s
1 2
1
x
1 v
r
1
1
v
s
1
2
v
s
1
3
v
s
u 1 8 S
u 1 8 S
u C
M
M
M
M
M
M
M
M
M
M

M

v

M
M

M

M
M

8

M

8

v


C 8 u C M C 8 u C
C 8 u C M C 8 u C
l M u C
8


Iig. 6. Archileclure inference of lhe lenpIale relrievaI lased on DC.
VLS 308

3.2.1 The Distortion Unit
Lach node in lhe lenpIale relrievaI DC associales vilh a IocaI dislorlion. The IocaI dislorlion
lelveen a frane of lhe lenpIale and a frane of lhe inpul speech is defined ly
_
=
=
P
i
i i
U R
0
distortion Local
, (2O)
vhere
i
R
denoles lhe i-lh cepslraI coefficienl in lhe R-lh frane of lhe lenpIale,
i
U
represenls
lhe i-lh cepslraI coefficienl in lhe U-lh frane of lhe inpul speech, and P is lhe cepslrun order.
The cepslraI coefficienls of lhe inpul speech and lhe lenpIales are slored in RAMO and
ROM1, respecliveIy. The dislorlion unil accesses lhe cepslraI coefficienls fron lhe lvo
nenories, and lhen accunuIales lhe dislorlion in regisler R.

3.2.2 The Processing EIement
Lach node in lhe DCs shovn in Iig. 6(i) is inpIenenled as a processing eIenenl. The
archileclure design of lhe IL is dispIayed in Iig. 7. Afler receiving aII lhe accunuIaled
dislorlions fron lhe regislers, lhe IL seIecls lhe nininun accunuIaled dislorlion, and lhen
adds il lo lhe IocaI dislorlion
) , , ( f k l a
. The resuIl is lhe nev accunuIaled dislorlion
) , , ( f k l a
A

associaled vilh lhe currenl node. In addilion lo lhe accunuIalion process, lhe IL aIso
generales lhe decision infornalion lhal indicales lhe source node in a palh lransilion. The
use of lhe decision infornalion is discussed logelher vilh lhe pallern exlraclion core Ialer.
ased on lhe dala palh of lhe lenpIale relrievaI core, a corresponding lenpIale relrievaI
conlroIIer is aIso deveIoped. In addilion lo producing conlroI signaIs, lhis conlroIIer aIso
funclions as a nenory address generalor for accessing lhe lenpIale speech fealures. The
lenpIale relrievaI core nol onIy accunuIales lhe nininun IocaI dislorlion lul aIso vriles
lhe decision infornalion inlo lhe nenory. The decision infornalion for a node indicales ils
source node in a palh lransilion. The lesl palh can le ollained fron lhe decision
infornalion, lracing il lackvard for exlracling lhe hypolhesised speech pallerns. The design
of lhe pallern exlraclion core is given in lhe nexl sulseclion.

M
u
x
Comparator
decision inIormation
0
1
2
) , , 1 ( f k l a
A

) , 1 , 1 ( f k l a
A

) , 2 , 1 ( f k l a
A

) , , ( f k l a
) , , ( f k l a
A
M
u
x
Comparator
0
1
) , , 1 ( f k l a
A

) , , ( f k l a
) , , ( f k l a
A
( ) ) , , 1 ( min
1
m K l a
m A
J m

s s
For the internal Frames in a node For the Iirst Irame in a node
1 , 2 , 1 , = K k a
k
e a ,
0
decision inIormation

Iig. 7. Archileclure of lhe processing eIenenl.

3.3 The Design of the Pattern Extraction Core
To exlracl lhe lesl pallern sequence, lhis vork presenls a lIock-node pallern exlraclion
nelhod. Iigure 8 exenpIifies lhe reIalionship lelveen nodes and lIocks. The lIock
loundaries are lhe sane as lhe pallern loundaries. This lenpIale relrievaI space conlains 24
SOC Design for Speech-to-Speech Translation 309

lIocks, nunlered fron O lo 23. Iocks in lhe sane rov leIong lo lhe sane pallern, and have
lhe sane nunler of nodes (franes). Ior exanpIe, lIocks 2, 5, 8, ., 23 leIong lo lhe lhird
pallern vilhin lhis lenpIale. The pallern exlraclion nelhod used in lhis vork can le
regarded as a lvo-IeveI address decoding process. Al lhe firsl IeveI, lhe lIocks are lhe
addressing unils. Lach lIock consisls of frane nodes, vhich lecone addressing unils al lhe
second IeveI. If | denoles lhe lIock index and | denoles lhe IocaI node index in a lIock, each
node can aIso le referred lo ly lhe pair (|, |). Ior exanpIe, ncc 64 can le referred lo as ncc
(21, 1).
The decision vord is generaled for each lIock, and is denoled ly
} , ..., , , ,
1 2 1 0
e a a a a D
K

, (21)
vhere K is lhe Iargesl nunler anong aII lhe lIock node ones, c is lhe pallern decision
infornalion fron lhe firsl node of a lIock, and
|
,
1 0 s s K k
, is lhe decision infornalion
fron lhe |-lh node of a lIock. We use c lo indicale lhe source pallern for an exlernaI
(lelveen-pallerns) lransilion, and
|
lo indicale lhe source node for an inlernaI (vilhin-
pallerns) lransilion. Assune lhal K
j
denoles lhe nunler of nodes in lhe j-lh pallern.
Iallern exlraclion slarls fron lhe end node, and recursiveIy updale lhe IocaI node index and
lhe lIock index lo conslrucl a lesl palh in lhe lenpIale relrievaI space. The lIock index | is
updaled ly

=
, external Ior ,
, internal Ior ,

transition e ola e new J b
transition J b
b
(22)
vhere ] is lhe nunler of pallerns in lhis lenpIale, ncu_c is lhe currenl vaIue of lhe decision
infornalion c, and c|_c is lhe previous vaIue of lhe decision infornalion c.
As for lhe IocaI node index |, il is updaled ly

=
=
=
=
on. transiti external Ior 1,
2, on with transiti internal Ior , 2
1, on with transiti internal Ior , 1
, 0 on with transiti internal Ior ,

e
K
a k
a k
a k
k
k
k
k
(23)
Lel us denole lhe lesl palh end in lhe lIock ly cn_||cc|, and ils pallern ly cn_pa||crn. The
fIov charl of lhe aIgorilhn is depicled in Iig. 9.
The lIock-node pallern exlraclion nelhod is inpIenenled as a finile slale nachine (ISM).
ecause lenpIale relrievaI and pallern exlraclion are required for aII lenpIales, lhis design
is lased on a lvo-nenory schene. WhiIe lhe decision vord is leing vrillen inlo RAM1, lhe
lack lracing is leing perforned in RAM2, and vice versa. This reduces lhe overaII
conpulalion line of lenpIale relrievaI and pallern exlraclion for aII lenpIales.

VLS 310

First
Pattern
Third
Pattern
Second
Pattern






Iig. 8. LxanpIe iIIuslraling lhe lIock-node addressing nelhod.

kK
e
-1;
bb-Jnewe-olae;
Current node is a Iirst
node oI a block?
End Yes
No
Yes
No
External Transition? Yes
No
kk, iI a
k
0;
kk-1, iI a
k
1;
kk-2, iI a
k
2;
bb-J;
Pattern extraction
Irom the end node (b, k)
noae (b, k)
Current
block is in the Iirst
column oI the block-
node addressing
method?

Iig. 9. IIov charl of lhe lIock-node pallern exlraclion.

SOC Design for Speech-to-Speech Translation 311

3.4 The Design of the ADC/DAC
The Iasl specific hardvare core is ADC/DAC. In lhis parl, ve nainIy presenl an efficienl
archileclure lo inprove lhe convenlionaI fuIIy differenliaI successive approxinalion ADC.
The proposed archileclure incIudes lvo oplinaI design soIulions. Iirsl, in order lo avoid lhe
nisnalch lelveen posilive and negalive reference voIlages, a singIe reference voIlage (SRV)
nelhod lhal renoves lhe negalive reference voIlage is deveIoped ly lhe dislincl svilched
capacilor conlroIIing aIgorilhn. Iigure 1O shovs lhe inpIenenlalion of lhe SRV aIgorilhn
in circuil. Second, a snaII area schene for successive approxinalion regisler (SAR) is
inlroduced. Conparing lo lhe lradilionaI successive approxinalion ADC, our proposed
archileclure is conlrilulive lo inprove lhe area saving in SAR and lhe accuracy of ADC.


Iig. 1O. The archileclure of fuIIy differenliaI successive approxinalion ADC using singIe
reference voIlage.

To reduce lhe area for successive approxinalion regisler, lhis vork presenls a sinpIified
non-redundanl successive approxinalion regisler (SSAR). The lasic archileclure of lhe
SSAR is a nuIlipIe inpul n-lil shifl regisler, shovn in Iig. 11. This nuIlipIe inpul regisler
consisls of a generaI D fIip-fIop and a nuIlipIexer vilh lvo inpuls. The funclion of SSAR is
shovn in Iig. 11. Suppose lhe 1 is lhe inpul loken in Iig. 12. Whenever lhe SSAR is
lriggered, each regisler of SSAR nusl go lhrough lhree nodes: (1) vhen lhe loken has nol
passed yel, lhe vaIues of lhe regislers are O, (1) vhen lhe loken is slaying al a cerlain
regisler, ils regisler vaIue is changed lo 1, and receives lhe resuIl of conparalor, (3) vhen
lhe loken has passed lhrough, lhe vaIue delernined ly lhe resuIl of conparalor is heId unliI
lhe vhoIe conversion is done. The funclion can le inpIenenled as leIov.
Therefore, vhen lhe vaIue of |-lh regisler of lhe SSAR is sliII O, lhe Q
|+1
seIecled ly lhe
nuIlipIexer is connecled lo D
|
. If lhe |-lh regisler receives lhe loken, lhe Q
|
is changed lo 1,
VLS 312

and lhe CMI vouId le seIecled. The resuIl of conparalor Z
|
vouId le heId in lhe |-lh
regisler unliI lhe vhoIe conversion is done. The syslen can le inpIenenled ly lhe lIocked
CLK
|
. The conlinalion of aII alove inpIenenlalions nakes il possilIe lo inprove lhe
accuracy and save lhe chip area.
CLK CLK CLK
D Q D Q D Q
Mux
CMI
Mux
CLK
D Q
CMI
Mux
Q Q Q
Qb Qb
Qb
MS
LS
CMI
lh lh
lh
Q
CLK

Iig. 11. The archileclure of lhe sinpIified non-redundanl SAR.

1 0 0
1 0 Z
2
1 Z
2
Z
1
Z
2
Z
1
Z
0
0
0
0
1
1st cycle
2nd cycle
cycle
4th cycle
3rd

Iig. 12. The funclion of sinpIified non-redundanl SAR.

4. Summary

Speech-lo-speech nachine lransIalion is a prospeclive appIicalion of speech and Ianguage
lechnoIogy. This vork presenls an MTS lased speech-lo-speech lransIalion syslen lelveen
Mandarin Chinese and LngIish. The proposed MTS approach achieves aloul a 9O
underslandalIe lransIalion rale on average. Ior a porlalIe and reaI-line speech-lo-speech
lransIalion syslen, lhis vork aIso proposes lhe SOC reaIizalion. The archileclure design is
lased on lhe seni-ASIC lechnique, vhich incorporales a cosl efficienl progrannalIe core
aIong vilh specific hardvare acceIeralors: an LIC exlraclion core, a lenpIale relrievaI core,
and a pallern exlraclion core. esides, lhe ADC/DAC are aIso incIuded. A SRV-lased fuIIy
differenliaI successive approxinalion ADC vilh reduced SSAR is designed. These hardvare
cores conslrucl a conpIele speech-lo-speech lransIalion SOC. The proposed SOC chip is lhe
firsl one dedicaled for speech-lo-speech lransIalion.

SOC Design for Speech-to-Speech Translation 313

5. References

|1jA. Lavie c| a|., }ANUS III: speech-lo-speech lransIalion in nuIlipIe Ianguages, in Prcc.
|||| |n|. Ccnf. Accus|ics, Spcccn an Signa| Prcccssing, Apr. 1997, pp. 99-1O2.
|2jW. WahIsler, Vcr|mc|i|. |cuna|icns cf Spcccn-|c-Spcccn Trans|a|icn. erIin HeideIlerg,
Nev York: Springer-VerIag, 2OOO.
|3jH. Ney, S. Nieen, I. }. Och, H. Savaf, C. TiIInann, and S. VogeI, AIgorilhns for
slalislicaI lransIalion of spoken Ianguage, |||| Trans. Spcccn an Auic Prcccssing,
voI. 8, pp. 24-36, }an. 2OOO.
|4jI. Casaculerla c| a|., Speech-lo-speech lransIalion lased on finile-slale lransducers, in
Prcc. |||| |n|. Ccnf. Accus|ics, Spcccn an Signa| Prcccssing, May 2OO1, pp. 613-616.
|5jI. Sugaya, T. Takezava, A. Yokoo, and S. Yananolo, Lnd-lo-end evaIualion in ATR-
MATRIX: speech lransIalion syslen lelveen LngIish and }apanese, in Prcc. 6|n
|ur. Ccnf. Spcccn Ccmmunica|icn an Tccnnc|cgq, Sep. 1999, pp. 2431-2434.
|6jI. C. Ching and H. H. Chi, ISIS: a lriIinguaI conversalionaI syslen vilh Iearning
capaliIilies and conlined inleraclion and deIegalion diaIogs, in Prcc. Na|icna|
Ccnf. Man-Macninc Spcccn Ccmmunica|icns, Nov. 2OO1, pp. 119-124.
|7jR. Isolani, K. Yanalana, S. Ando, K. Hanazava, S. Ishikava, T. Lnori, H. Hallori, A.
Okunura, and T. Walanale, An aulonalic speech lransIalion syslen on IDAs for
lraveI conversalion, in Prcc. |||| |n|. Ccnf. Mu||imca| |n|crfaccs, Ocl. 2OO2, pp.
211-216.
|8jA. WaileI, A. adran, A. W. Iack, R. Irederking, D. Cales, A. Lavie, L. Levin, K. Lenzo,
L. M. Tonokiyo, }. Reicherl, T. SchuIlz, D. WaIIace, M. Woszczyna, and }. Zhang,
SpeechaIalor: lvo-vay speech-lo-speech lransIalion on a consuner IDA, in Prcc.
|urcpcan Ccnf. Spcccn Ccmmunica|icn an Tccnnc|cgq, Sep. 2OO3, pp. 369-372.
|9jT. Walanale, A. Okunura, S. Sakai, K. Yanalana, S. Doi, and K. Hanazava, An
aulonalic inlerprelalion syslen for lraveI conversalion, in Prcc. |n|. Ccnf. Spc|cn
|anguagc Prcccssing, Sep. 2OOO, pp. IV-444-IV-447.
|1Oj}. I. Wang, . Z. Houg, and S. C. Lin, A sludy for Chinese lexl lo Taivanese speech
syslen, in Prcc. |n|. Ccnf. Rcscarcn cn Ccmpu|a|icna| |inguis|ics, Aug. 1999, pp. 37-
53.
|11jM. Sinard, TransIalion spolling for lransIalion nenories, in Prcc. H|T-NAAC|
Icr|sncp cn 8ui|ing an Using Para||c| Tcx|s. Da|a Dritcn Macninc Trans|a|icn an
8cqcn, May 2OO3, pp. 65-72.
|12j}. Veronis and I. LangIais, LvaIualion of paraIIeI lexl aIignnenl syslens - lhe ARCADL
projecl, in Para||c| Tcx| Prcccssing, Dordrechl: KIuver Acadenic, 2OOO, pp. 369-388.
|13jL. Raliner and . H. }uang, |unamcn|a|s cf Spcccn Rcccgni|icn. Irenlice-HaII, Inc., 1993.
|14j}. I. Wang, A. N. Suen, and C. K. Chieh, A progrannalIe appIicalion specific
archileclure for reaI-line speech recognilion, in Prcc. cf V|S| Dcsign/CAD
Sqmpcsium, Aug. 1995, pp. 261-264.
|15jY. Zhang, Survey of currenl speech lransIalion research, presenled al MuIliIinguaI
Speech-lo-Speech TransIalion Seninar, Carnegie MeIIon Universily, Iillslurgh,
IA, 2OO3.
|16jY. S. Lee and S. Roukos, IM Spoken Language TransIalion Syslen LvaIualion, in
Prcc. |NT|RSP||CH2004 Icr|sncp cn Spc|cn |anguagc Trans|a|icn. |ta|ua|icn
Campaign cn Spc|cn |anguagc Trans|a|icn, Ocl. 2OO4, pp.39-46.
VLS 314

|17jL. Cu and Y. Q. Cao, On Iealure SeIeclion in Maxinun Lnlropy Approach lo SlalislicaI
Concepl-lased Speech-lo-Speech TransIalion, in Prcc. |NT|RSP||CH2004
Icr|sncp cn Spc|cn |anguagc Trans|a|icn. |ta|ua|icn Campaign cn Spc|cn |anguagc
Trans|a|icn, Ocl. 2OO4, pp.115-121.
|18jS. Nakanura, K. Markov, T. }ilsuhiro, }. S. Zhang, H. Yananolo and C. Kikui, MuIli-
LinguaI Speech Recognilion Syslen for Speech-To-Speech TransIalion, in Prcc.
|NT|RSP||CH2004 Icr|sncp cn Spc|cn |anguagc Trans|a|icn. |ta|ua|icn Campaign
cn Spc|cn |anguagc Trans|a|icn, Ocl. 2OO4, pp.146-154.
|19jK. Malsui, Y. Wakila, T. Konuna, K. Mizulani, M. Lndo, and M. Murala, An
experinenlaI nuIliIinguaI speech lransIalion syslen, in Prcc. |CM|-PU|, 2OO1, pp.
1-4.
|2OjS. Rossalo, H. Ianchon, and L. esacier, Speech-lo-speech lransIalion syslen
evaIualion: ResuIls for Irench for lhe NespoIe! projecl firsl shovcase, in Prcc.
|CS|P, 2OO2, pp. 19O5-19O8.
|21jI. Casaculerla, L. VidaI, and }. M. ViIar, Archileclures for speech-lo-speech lransIalion
using finile-slale nodeIs, in Prcc. AC| Icr|sncp cn Spcccn-|c-Spcccn Trans|a|icn.
A|gcri|nms an Sqs|cms, 2OO2, pp. 39-44.
|22jH. Ney, Speech lransIalion: CoupIing of recognilion and lransIalion, in Prcc. |CASSP,
1999, pp. 517-52O.
|23j}. I. Wang and S. C. Lin, and H.W. Yang, MuIlipIe-TransIalion Spolling for Mandarin-
Taivanese Speech-lo-Speech TransIalion, |n|. ]curna| cf Ccmpu|a|icna| |inguis|ics
an Cnincsc |anguagc Prcccssing, voI.9, no.2, 2OO4, pp. 13-28.
|24jW. VerheIsl and M. RoeIands, An overIap-add lechnique lased on vaveforn siniIarily
(WSOLA) for high quaIily line-scaIe nodificalion of speech, in Prcc. |CASSP,
1993, pp. 554-557.
|25jM. DenoI, K. Slruyve, W. VerheIsl, H. IauIussen, I. Desnel, and I. Verhoeve, Lfficienl
non-uniforn line-scaIing of speech vilh WSOLA for CALL appIicalions,
presenled al InSTIL/ICALL Synp. Conpuler Assisled Learning, Venice, IlaIy,
2OO4.
|26jH. C. IIk and S. Tugac, ChanneI and source consideralions of a lil-rale reduclion
lechnique for a possilIe vireIess connunicalions syslen's perfornance
enhancenenl, |||| Trans. Iirc|css Ccmmunica|icns, voI. 4, no. 1, pp. 93-99, }an.
2OO5.
|27jL. T. CorneIius, |ng|isn 900. Iace Croup InlernalionaI Inc., 1999.
|28j. L. agneII and M. Lee, Ncu G|c|c |ng|isn Ccursc cn Tratc|. Nev CIole IulIishing Co.,
LTD, 199O.
|29j}. I. Wang and S. H. Chen, Speech Lnhancenenl Using Ierceplion WaveIel Iackel
Deconposilion and Teager Lnergy Operalor, }ournaI of VLSI SignaI Irocessing,
VoI. 36 I: 2-3, pp. 125-139, Iel. 2OO4..
|3Oj}. I. Wang, C. H. Yang, and K. H. Chang, Design of a Sulspace Tracking ased Speech
Lnhancenenl Syslen, in Prcc. |||| T|NCON 2004, voI. 1, pp. 147-15O..
|31j}. T. Chien and C. H. Huang, ayesian Learning of Speech Duralion ModeIs, ||||
Trans. Spcccn an Auic Prcccssing, voI. 11 I:6, pp. 558-567, Nov. 2OO3.
|32j}hing-Ia Wang, }ia-Ching Wang, Han-Chiang Chen, Tai-Lung Chen, Chin-Chan Chang,
and Ming-Chi Shih, Chip Design of IorlalIe Speech Menopad SuilalIe for
SOC Design for Speech-to-Speech Translation 315

Iersons vilh VisuaI DisaliIilies, |||| Transac|icns cn Spcccn an Auic Prcccssing,
voI. 1O, no. 8, pp. 644-658, Novenler 2OO2.
|33j}. MakhouI, Linear prediclion: a luloriaI reviev, Spcccn Ana|qsis, ILLL Iress, Nev
York, 1979.
|34jL. Y. Liu, }. I. Wang, }. Y. Lee, M. H. Sheu, and Y. L. }eang, "An ASIC design for Iinear
prediclive coding of speech signaIs," in Prcc. |urc AS|C' 92, 1992, pp.288-291.
|35jS. Iurui, CepslraI anaIysis lechnique for aulonalic speaker verificalion, ||||
Transac|icns cn Accus|ics, Spcccn, an Signa| Prcccssing, voI. ASSI-29, no. 2, pp. 254-
272, 1981.
VLS 316
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 317
X

A NoveI De Bruijn Based Mesh
TopoIogy for Networks-on-Chip

Reza Sallaghi-Nadooshan
1
, Mehdi Modarressi
2,3

and Hanid Sarlazi-Azad
2,3

1
|s|amic Aza Unitcrsi|q Ccn|ra| Tcnran 8rancn, Tcnran, |ran
2
Snarif Unitcrsi|q cf Tccnnc|cgq, Tcnran, |ran
3
|PM Scncc| cf ccmpu|cr scicncc, Tcnran, |ran

1. Introduction

The nesh lopoIogy is lhe nosl doninanl lopoIogy for lodays reguIar liIe-lased NoCs. Il is
veII knovn lhal nesh lopoIogy is very sinpIe. Il has Iov cosl and consunes Iov pover.
During lhe pasl fev years, nuch efforl has leen nade lovard underslanding lhe
reIalionship lelveen pover consunplion and perfornance for nesh lased lopoIogies
(Srivasan el aI., 2OO4). Despile lhe advanlages of neshes for on-chip connunicalion, sone
packels nay suffer fron Iong Ialencies due lo Iack of shorl palhs lelveen renoleIy Iocaled
nodes. A nunler of previous vorks lry lo lackIe lhis shorlconing ly adding sone
appIicalion-specific Iinks lelveen dislanl nodes in lhe nesh (Ogras & MarcuIescu, 2OO5)
and lypassing sone inlernediale nodes ly inserling express channeIs (DaIIy, 1991), or
using sone olher lopoIogies vilh Iover dianeler (sallaghi el aI., 2OO8).
The facl lhal lhe de ruijn nelvork has a Iogarilhnic dianeler and a cosl equaI lo lhe Iinear
array lopoIogy nolivaled us lo evaIuale il as an underIying lopoIogy for on-chip nelvorks.
De ruijn lopoIogy is a veII-knovn nelvork slruclure vhich vas iniliaIIy proposed ly de
ruijn (de ruijn, 1946) as an efficienl lopoIogy for paraIIeI processing. Sanalhan (Sanalhan
& Iradhan, 1989) shoved lhal de ruijn nelvorks are suilalIe for VLSI inpIenenlalion, and
severaI olher researchers have sludied lopoIogicaI properlies, rouling aIgorilhns, efficienl
VLSI Iayoul and olher inporlanl aspecls of lhe de ruijn nelvorks (Iark & AgravaI, 1995,
Canesan & Iradhan, 2OO3).
In lhis chapler, ve propose a lvo-dinensionaI de ruijn lased nesh lopoIogy (2D DM for
shorl) for NoCs. We viII conpare equivaIenl nesh and 2D DM archileclures using lhe lvo
nosl inporlanl faclors, nelvork Ialency and pover consunplion. A rouling schene for lhe
2D DM nelvork has leen deveIoped and lhe perfornance and pover consunplion of lhe
lvo nelvorks under siniIar vorking condilions have leen evaIualed using sinuIalion
experinenls. SinuIalion resuIls shov lhal lhe proposed nelvork can oulperforn ils
equivaIenl popuIar nesh lopoIogy in lerns of nelvork perfornance and energy dissipalion.

16
VLS 318

2. The 2D DBM TopoIogy

2.1 The Structure
The lasic idea aloul our vork is lased on digraphs, and in lhis seclion, ve presenl lhe
infornalion conpiIed fron sludies conducled on lhe de ruijn digraph (Liu & Lee, 1993 ,
Mao & Yang , 2OOO).
The de ruijn lopoIogy has nany appIicalions in connunicalion nelvorks and paraIIeI
processing (Sananalhan & Iradhan, 1989). A de ruijn graph has |
n
nodes. Lach node
) ,..., (
0 1
u u u
n
=
has an edge lo node
) ,..., (
0 1
v v v
n
=
if and onIy if
1
=
i i
u v
1 1 s s n i . Node t
has a unidireclionaI direcl Iink lo node u if and onIy if
) (mod
n
k r k v u + =
, 1 0 s s k r
(1)

The in-degree and oul-degree of a node is equaI lo |. Therefore, lhe degree of each node is
equaI lo 2|. The dianeler of a de ruijin graph is equaI lo n vhich is oplinaI. The de ruijn
aIso has a sinpIe rouling aIgorilhn. Case |=2 is lhe nosl popuIar de rujin nelvork vhich
is aIso used in lhis sludy. Due lo lhe Iogarilhnic (oplinaI) dianeler and a sinpIe rouling
aIgorilhn, il can le expecled lhal lhe lraffic on channeIs in lhe nelvork viII le Iess lhan
olher nelvorks, resuIling in a leller perfornance (Canesan & Iradhan, 2OO3).
LxanpIes of de ruijn nelvorks are iIIuslraled in Iig. 1. SeveraI researchers have sludied lhe
lopoIogicaI properlies (Liu & Lee, 1993 , Mao & Yang , 2OOO) and efficienl VLSI Iayoul
(Sananalhan & Iradhan, 1989, Chen el aI., 1993) of lhe de ruijn nelvorks. Moreover, lhe
scaIaliIily prolIen of de ruijn nelvorks is addressed in (Liu & Lee, 1993). In de rujin
nelvork dala is circuIaled fron node lo node unliI il reaches ils deslinalion. Lach node has
lvo oulgoing (inconing) conneclions lo (fron) olher nodes via shuffIe (rolale Iefl ly one lil)
and shuffIe-exchange (rolale Iefl ly one lil and conpIenenl LS) operalions lo neighloring
nodes (Louri & Sung, 1995). Oving lo lhe facl lhal lhese conneclions are unidireclionaI, lhe
degree of lhe nelvork is lhe sane as lhe one-dinensionaI nesh nelvorks (or Iinear array
nelvork). The dianeler of a de ruijn nelvork vilh size N, lhal is, lhe dislance lelveen
nodes O and N-1, is equaI lo Iog (N).


(a)

(l)
Iig. 1. The de ruijn nelvork vilh (a) 8 nodes and (l) 16 nodes
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 319

2.2 The 2D DBM TopoIogy
In a 2D DM nelvork, lhe nodes in each rov and each coIunn forn a de ruijn nelvork.
Lach node has lvo oulgoing edges aIong vhich dala packels can le senl lo olher nodes, and
lvo inconing Iinks receiving dala packels fron olher nodes in each dinension. Thus, node
(u, t),
) ,..., (
0 1
u u u
n
=
,
) ,..., (
0 1
v v v
n
=
, has edges lo node (u, t) if and onIy if
1
'

=
i i
u u
for
1 1 s s n i , and node (u, t), if and onIy if
1
'

=
i i
v v
for 1 1 s s n i . Node (u, t) has a direcl
Iink lo node (u, t) if and onIy if

n
r u u
2 mod
' 2 + =
,
1 , 0 = r
(2)

and lo node (u, t) if and onIy if

n
r v v
2 mod
' 2 + =
,
1 , 0 = r
(3)

The 88 2D DM is shovn in Iig. 2.























Iig. 2. A 2D DM vilh 64 nodes conposed fron eighl 8-node de ruijn nelvorks (as shovn
in Iig.1. a) aIong each dinension

The 2D DM nelvorks have sone inleresling lopoIogicaI properlies lhal nolivale us lo
consider lhen as a suilalIe candidale for on-chip nelvork archileclures. The nosl inporlanl
properly of 2D DM nelvorks is lhal vhiIe lhe nunler of Iinks in a 2D DM and an equaI-
sized nesh are exaclIy lhe sane, lhe nelvork dianeler of lhis nelvork is Iess lhan lhal of lhe
1 , 0 2 , 0 3 , 0 4 , 0 5 , 0 6 , 0 7 , 0 0 , 0
1 , 1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 0 , 1
1 , 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 0 , 2
1 , 3 2 , 3 3 , 3 4 , 3 5 , 3 6 , 3 7 , 3 0 , 3
1 , 4 2 , 4 3 , 4 4 , 4 5 , 4 6 , 4 7 , 4 0 , 4
1 , 5 2 , 5 3 , 5 4 , 5 5 , 5 6 , 5 7 , 5 0 , 5
1 , 6 2 , 6 3 , 6 4 , 6 5 , 6 6 , 6 7 , 6 0 , 6
1 , 7 2 , 7 3 , 7 4 , 7 5 , 7 6 , 7 7 , 7 0 , 7
VLS 320

nesh. More preciseIy, lhe dianeler of a 2D DM and a nesh are 2|cg N
O.5
and 2(N
O.5
-1),
respecliveIy, vhere N represenls lhe nelvork size.
AIlhough, eslalIishing lhe nev Iinks renoves lhe Iink lelveen sone adjacenl nodes (for
exanpIe 1 lo O, 2 lo 1 and 3 lo 4 conneclions in Iig. 1) and increases lheir dislance ly one
hop. In lhis nelvork, hovever, lhe dislance lelveen nany nodes is decreased ly one or
nuIlipIe hops, conpared lo a nesh, and lhis can Iead lo a consideralIe reduclion in lhe
average inler-node dislance in lhe nelvork.
The 2D DM Iinks are unidireclionaI and al nosl 8 unidireclionaI Iinks are used per node.
This is equaI lo lhe nunler of Iinks connecled lo a node in a nesh (vhich is 4 lidireclionaI
Iinks). Since lhe node degree of a lopoIogy has an inporlanl conlrilulion in (and usuaIIy
acls as lhe doninanl faclor of) lhe nelvork cosl, lhe proposed lopoIogy can achieve Iover
average dislance lhan a 2D nesh vhiIe il has aInosl lhe sane cosl. Hovever, ve viII
discuss lhe area overhead due lo Ionger Iinks in 2D DM in nexl seclions.

2.3 Routing AIgorithm
During pasl years, a nunler of rouling aIgorilhns have leen deveIoped for lhe originaI de
ruijn nelvork. Canesan (Canesan & Iradhan, 2OO3) proposed a rouling aIgorilhn vhich
roules lhe packels lovard lhe deslinalion ly changing one lil al a line, slarling fron lhe
nosl significanl lil of an n-lil address in a nelvork of size 2
n
. Al lhe i
lh
slep of lhis
aIgorilhn, lhe n-i
lh
lil of lhe deslinalion address is conpared lo lhe MS of lhe currenl
address. If lhey are equaI, lhe nessage is rouled over lhe shuffIe channeI, lo keep lhe lil
unchanged and rolale lhe address. Olhervise, lhe nessage is rouled over lhe shuffIe-
exchange channeI lo nake lhe lvo lils idenlicaI and lhen rolale lhe address (seIf-Ioops are
avoided). This aIgorilhn invoIves a naxinun of n sleps. In order lo le deadIock-free, lhis
aIgorilhn requires n virluaI channeIs and lhe nessage uses lhe i
lh
channeI al lhe n-i
lh
slep.
Since in lhis virluaI channeI seIeclion scenario rouling is perforned in a descending order
regarding channeI nunlers, lhe dependency graph of virluaI channeIs is acycIic and lhe
rouling is deadIock-free (DaIIy & Seilz, 1987).
Canesan (Canesan & Iradhan, 2OO3) spIils lhe nelvorks inlo lvo lrees: T1 and T2. A
nessage is rouled lelveen T1 lo T2, and lhen in T2, and lhen lhe nessage is rouled lelveen
T2 lo T1 and lhen in T1. Therefore, lhis rouling aIgorilhn has four differenl sleps, vhich
nay aIso decrease, depending on lhe source and deslinalion nodes. The aIgorilhn has lvo
phases and iniliales vilh phase O (using virluaI channeI O). When lhe nessage goes lhrough
T2 lo T1, lhe phase nunler increases (virluaI channeI 1 is used). Canesan (Canesan &
Iradhan, 2OO3) proved lhal lhis nelhod is deadIock-free. The lvo lrees, T1 and T2, are
depicled in Iig. 3 for N=8.



Iig. 3. Trees T1 and T2 for N=8

A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 321

Iark (Iark & AgravaI, 1995) has deforned lhe de ruijn as lvo graphs, increasing and
decreasing. In lhe i
lh
slages, lhe MS of lhe currenl node is conpared vilh lhe n-i
lh
lil of lhe
deslinalion node, and if lhey are lhe sane, shuffIe cycIe is used, olhervise, shuffIe-exchange
is used. If lhe palh svilches fron increasing graph lo lhe decreasing graph, lhe virluaI
channeI nunler increnenls ly 1. Il is proved lhal lhe nenlioned aIgorilhn for lhe de rujin
nelvork is deadIock-free (Iark & AgravaI, 1995). Iark (Iark & AgravaI, 1995) has shovn
lhal for a de rujin nelvork vilh N nodes (N=2
n
) lhis kind of rouling requires

Nunler of V.C.
(


=
2
1 n
n

(4)

virluaI channeIs lo ensure deadIock freedon.
So, for N=8, il requires 2 virluaI channeIs. Ior N=16 nodes, (n=4), lased on equalion (4), lhe
nunler of virluaI channeIs needed for a deadIock-free rouling is equaI lo 3. VirluaI channeI
O is nore crovded lhan lhe olher virluaI channeIs. The rouling aIgorilhn uses virluaI
channeIs unevenIy and very fev packels use aII lhe virluaI channeIs.
We revised lhe rouling aIgorilhn lo laIance lhe use of virluaI channeIs. In our proposed
soIulion, al each node, lhe header fIil has a degree of fIexiliIily in seIecling virluaI channeIs,
as used in (Kiasari el aI., 2OO5). Ior exanpIe, for N=16, sone of lhe source nodes nay use
onIy 1 virluaI channeI. Sone nodes use 2 virluaI channeIs and a fev source nodes use 3
virluaI channeIs. Ior lhe Iasl group (lhose use 3 virluaI channeIs), virluaI channeI O is
seIecled in lhe source node. Ior lhe second group, ve slarl vilh virluaI channeI 1 and for lhe
firsl group ve slarl fron virluaI channeI 2. If virluaI channeI 2 is occupied, lhen ve can use
virluaI channeI 1 and finaIIy virluaI channeI O is chosen. Using lhis nelhod, virluaI channeIs
are used unifornIy.
In lhis chapler, for rouling in lhe 2D DM, ve use lhe revised nelhod of Iark (Iark &
AgravaI, 1995) in each dinension and ve choose lhe palhs and nodes in a vay lhal lhe
rouling is nininaI. Like XY rouling in nesh nelvorks, lhe delerninislic rouling firsl appIies
lhe rouling nechanisn in rovs in order lo deIiver lhe packel lo lhe coIunn al vhich lhe
deslinalion is Iocaled. Aflervards, lhe nessage is rouled lo lhe deslinalion ly appIying lhe
sane rouling aIgorilhn in lhe coIunns. OlviousIy, adding lhe second dinension in lhis
rouling schene does nol generale a cycIe and lhe vhoIe rouling in lhe 2D nelvork is
deadIock-free provided lhal lhe rouling in each dinension is deadIock-free (Dualo el aI.,
2OO5).

3. SimuIation ResuIts

3.1 SimuIator
To sinuIale lhe proposed NoC lopoIogy, ve have used an inlerconneclion nelvork
sinuIalor lhal is deveIoped lased on IOINLT sinuIalor (Iopnel, 2OO7) vilh Orion pover
Iilrary enledded in il. Orion is a Iilrary vhich nodeIs lhe pover consunplion of lhe
inlerconneclion nelvorks (Wang el aI., 2OO2). Iroviding delaiIed pover characlerislics of lhe
nelvork eIenenls, Orion enalIes lhe designers lo nake rapid pover perfornance lradeoffs
al lhe archileclure IeveI (Wang el aI., 2OO2). As nenlioned in (Wang el aI., 2OO2), lhe lolaI
energy each fIil consunes al a specified node and ils oulgoing Iink is given ly
VLS 322

flit wrt arb reaa xb link
E E E E E E = + + + +
(5)
Il consisls of five conponenls: 1) lhe pover lhal is consuned for vriling inlo luffers, 2) lhe
pover of arliler, 3) lhe pover lhal is consuned in reading fron luffers, 4) lhe pover of lhe
inlernaI crosslar, and 5) lhe pover lhal is consuned in lhe Iinks.
The IOINLT sinuIalor is onIy inlroduced for a lvo-dinensionaI nesh lopoIogy (Iopnel,
2OO7). We have cuslonized lhe sinuIalor lo supporl olher lopoIogies and olher rouling
aIgorilhns such as shuffIe-exchange (Sallaghi el aI., 2OO8), and de ruijn lopoIogies. The
pover is ollained and reporled for each Iayer and each conponenl of lhe nelvork.
We sel lhe nelvorks Iink vidlh lo 32 lils. Lach Iink has lhe sane landvidlh and one fIil
lransnission is aIIoved on a Iink. The pover is caIcuIaled lased on a NoC vilh 9O nn
lechnoIogy vhose roulers operale al 25O MHz. ased on lhe core size infornalion presenled
in (MuIIins el aI., 2OO6), ve sel lhe vidlh of lhe II cores lo 2 nn, and lhe Ienglh of each vire
is sel lased on lhe nunler of cores il passes. The sinuIalion resuIls are ollained for 88
and 1616 nesh NoCs vilh XY rouling aIgorilhn, and 88 and 1616 de ruijn NoCs using
lhe rouling aIgorilhns descriled in lhe previous seclion. The nessage Ienglh is assuned lo
le 32 and 64 fIils and 2 and 3 virluaI channeIs per physicaI channeI are used. Messages are
generaled according lo a Ioisson dislrilulion vilh rale . The lraffic pallern can le Unifcrm,
Ma|rix-|ranspcsc, and Hc|spc| (Dualo el aI., 2OO5).

3.2 Comparison ResuIts
In lhis seclion, ve evaIuale lhe 2D DM and conpare il vilh lhe nesh, lhe nosl connon
lopoIogy for NoCs. ased on lhe neckIace properlies in de ruijn Iayoul (Chen el aI., 1993),
ve have considered a nore efficienl Iayoul for each rov and coIunn of lhe 2D DM as
shovn in Iig. 4. Wilh lhis nev Iayoul lhe lolaI vire Ienglh used in lhe nelvork is decreased.
Ior exanpIe, for an 88 2D DM aloul 25 reduclion in lolaI vire Ienglh is ollained and
lhis can le increased lo a 5O reduclion for a 1616 2D DM.


Iig. 4. A leller node pIacenenl of de ruijn nelvork of 8 nodes

As nenlioned lefore, ve have aIso revised lhe rouling aIgorilhn in (Iark & AgravaI, 1995) lo
have laIanced use of virluaI channeIs. Iig. 5 conpares lhe perfornance of originaI rouling
aIgorilhn and lhe nev rouling aIgorilhn in lhe 88 2D DM using 2 virluaI channeIs vilh
nessages of 32 fIils. As can le seen in lhe figure, lhe nev aIgorilhn exhilils leller
perfornance in lerns of average nessage Ialency.
Iigures 6-9 conpare pover consunplion and lhe perfornance of sinpIe 2D nesh and 2D
DM NoCs under various lraffic pallerns, nelvork sizes and nessage Ienglhs. In Iig. 6 and
Iig. 7, lhe average nessage Ialency is dispIayed as a funclion of nessage generalion rale al
each node for lhe 88 and 1616 nelvorks under delerninislic rouling. As can le seen in lhe
figures, lhe 2D DM NoC achieves a reduclion in nessage Ialency vilh respecl lo lhe popuIar
2D nesh nelvork for lhe fuII range of nelvork Ioad under various lraffic pallerns (especiaIIy
in uniforn lraffic). Nole lhal for nalrix-lranspose lraffic Ioad, il is assuned lhal 3O of
nessages generaled al a node are of nalrix-lranspose lype (i.e. node (x,q) sends lhe nessage lo
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 323

node (q,x)) and lhe resl of 7O nessages are senl lo olher nodes unifornIy. Ior holspol lraffic
Ioad a holspol rale of 16 is assuned (i.e. each node sends 16 of ils nessages lo lhe holspol
node (node (4,4) in 88 nelvork and node (8,8) for 1616 nelvork) and lhe resl of 84 of
nessages are senl lo olher nodes unifornIy). Nole lhal increasing lhe nelvork size causes
earIier saluralion in a sinpIe 2D nesh.

8-8 v=2 m=32
100
200
300
400
0.001 0.002 0.003 0.004 0.005
Message generation rate ()
A
v
e
r
a
g
e

d
e
I
a
y

(
c
y
c
I
e
s
)
with v.c. uniformity
without v.c. uniformity

Iig. 5. The effecl of laIanced use of virluaI channeIs

8-8 v=2 m=32
100
300
500
0.001 0.002 0.003 0.004 0.005
Message generation rate ()
A
v
e
r
a
g
e

D
e
I
a
y
(
c
y
c
I
e
s
)
bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot

a)
8-8 v=2 m=64
200
400
600
800
0.0005 0.001 0.0015 0.002 0.0025
Message generation rate ()
A
v
e
r
a
g
e

D
e
I
a
y
(
c
y
c
I
e
s
)
bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot

l)
Iig. 6. The average nessage Ialency in lhe 88 sinpIe 2D nesh and 2D DM for differenl
lraffics pallerns vilh nessage size of (a) 32 fIils and (l) 64 fIils
VLS 324

16-16 v=3 m=32
100
300
500
0.001 0.002 0.003 0.004 0.005
Message generation rate ()
A
v
e
r
a
g
e

D
e
I
a
y
(
c
y
c
I
e
s
)
bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot

a)
16-16 v=3 m=64
200
400
600
800
0.0005 0.001 0.0015 0.002 0.0025
Message generation rate ()
A
v
e
r
a
g
e

D
e
I
a
y
(
c
y
c
I
e
s
)
bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot

l)
Iig. 7. The average nessage Ialency in lhe 1616 sinpIe 2D nesh and 1616 nelvork of 2D
DM for differenl lraffics pallerns vilh nessage size of (a) 32 fIils and (l) 64 fIils

According lo lhe sinuIalion resuIls reporled alove, lhe 2D DM has a leller perfornance
conpared lo lhe equivaIenl sinpIe 2D nesh NoC. The reason is lhal lhe average dislance a
nessage lraveIs in lhe nelvork in a 2D DM nelvork is Iover lhan lhal of a sinpIe 2D
nesh. The node degree of lhe 2D DM and sinpIe 2D nesh nelvorks (hence lhe slruclure
and area of lhe roulers) are lhe sane. Hovever, unIike lhe sinpIe 2D nesh lopoIogy, lhe 2D
DM Iinks do nol aIvays connecl lhe adjacenl nodes and lherefore, sone Iinks nay le
Ionger lhan lhe Iinks in an equivaIenl nesh. This can Iead lo an increase in lhe nelvork area
and aIso creale prolIens in Iink pIacenenl. The Ialler can le aIIevialed ly using efficienl
VLSI Iayouls (Sananalhan & Iradhan, 1989, Chen el aI., 1993) proposed for de ruijn
nelvorks, as ve used.
Iig. 8 denonslrales pover consunplion of lhe sinpIe 2D nesh and 2D DM under
delerninislic rouling schene vilh uniforn lraffic. Il is again lhe 2D DM lhal shovs a
leller lehavior lefore reaching lo lhe saluralion poinl. Iig. 9 reporls siniIar resuIls for
holspol and nalrix-lranspose lraffic pallerns in lhe lvo nelvorks.
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 325

30
50
70
90
110
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ()
P
o
w
e
r


(
n
j

/

c
y
c
I
e
s
)
mesh-64f
mesh-32f
bruijn-64f
bruijn-32f

a)
150
200
250
300
350
400
450
500
550
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ()
P
o
w
e
r


(
n
j

/

c
y
c
I
e
s
)
mesh-64f
mesh-32f
bruijn-64f
bruijn-32f

l)
Iig. 8. Iover consunplion of lhe sinpIe 2D nesh and 2D DM vilh uniforn lraffic pallern
and nessage size of 32 and 64 fIils for (a) 88 nelvork and (l) 1616 nelvork

30
50
70
90
110
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ()
P
o
w
e
r


(
n
j

/

c
y
c
I
e
s
)
bruijn-32u
bruijn-hot
bruijn-mat
mesh-32u
mesh-hot
mesh-mat

a)
VLS 326

150
200
250
300
350
400
450
500
550
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ()
P
o
w
e
r


(
n
j

/

c
y
c
I
e
s
)
bruijn-u
bruijn-hot
bruijn-mat
mesh-u
mesh-hot
mesh-mat

l)
Iig. 9. Iover consunplion of sinpIe 2D nesh and 2D DM for differenl lraffic pallerns and
nessage size 32 fIils for (a) 88 and (l) 1616 nelvorks

The resuIls indicale lhal lhe pover of 2D DM nelvork is Iess for Iighl lo nediun lraffic
Ioads. The nain source of lhis reduclion is lhe Iong vires vhich lypass sone nodes and
hence, save lhe pover vhich is consuned in inlernediale roulers in an equivaIenl nesh
lopoIogy.
AIlhough for Iov lraffic Ioads lhe 2D DM nelvork provides a leller pover consunplion
conpared lo lhe sinpIe 2D nesh nelvork, il legins lo lehave differenlIy near heavy lraffic
regions.
Il is nolalIe lhal a usuaI advice on using any nelvorked syslen is nc| |c |a|c |nc nc|ucr|
ucr|ing ncar sa|ura|icn rcgicn (Dualo el aI., 2OO5). Having considered lhis and aIso lhe facl
lhal nosl of lhe nelvorks rareIy enler such lraffic regions, ve can concIude lhal lhe 2D
DM nelvork can oulperforn ils equivaIenl nesh nelvork vhen pover consunplion is
considered.
The area eslinalion is done lased on lhe hylrid synlhesis-anaIylicaI area nodeIs presenled
in (MuIIins el aI. , 2OO6, Kin el aI., 2OO6, Kin el aI. 2OO8). In lhese papers, lhe area of lhe
rouler luiIding lIocks is caIcuIaled in 9Onn slandard ceII ASIC lechnoIogy and lhen
anaIylicaIIy conlined lo eslinale lhe rouler lolaI area. TalIe 1 oulIines lhe paranelers. The
anaIylicaI area nodeIs for NoC and ils conponenls are dispIayed in TalIe 2. The area of a
rouler is eslinaled lased on lhe area of lhe inpul luffers, nelvork inlerface queues, and
crosslar svilch, since lhe rouler area is doninaled ly lhese conponenls.
The area overhead due lo lhe addilionaI inler-rouler vires is anaIyzed ly caIcuIaling lhe
nunler of channeIs in a nesh-lased NoC. An nn nesh has 2n(n-1) channeIs. The 2D
DM has lhe sane nunler of channeIs as nesh lul vilh Ionger vires. In lhe anaIysis, lhe
Ienglhs of packelizalion and depackelizalion queues are considered as Iarge as 64 fIils.
In TalIe 3, lhe area overhead of 2D DM NoC is caIcuIaled for 88 and 1616 nelvork sizes
in a 32-lil vide syslen. The resuIls shov lhal, in an 88 nesh, lhe lolaI area of lhe 2nn
Iinks and lhe roulers are O.O633 nn
2
and O.1O89 nn
2
, respecliveIy. ased on lhese area
eslinalions, lhe area of lhe nelvork parl of lhe 2D DM nelvork shovs a 44 increase
conpared lo a sinpIe 2D nesh vilh equaI size. Considering 2nn2nn processing
eIenenls, lhe increase in lhe enlire chip area is Iess lhan 3.5. OlviousIy, ly increasing lhe
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 327

luffer sizes, lhe nelvork node/configuralion svilch area increases, Ieading lo nuch
reduclion in lhe area overhead of lhe proposed archileclure.

Paramctcr 5ymbn!
IIil Size I
uffer Deplh
No. of VirluaI channeIs V
uffer area (O.OOOO2 mm
2
/|i| (Kin el aI., 2OO8))
area

Wire pilch (O.OOO24 mm (ITRS, 2OO7) W
pilch

No. of Iorls I
Nelvork Size N (= nn)
Iackelizalion queue capacily IQ
Depackelizalion queue capacily DQ
ChanneI Area (O.OOO99 nn
2
/lil/nn (MuIIins el aI. , 2OO6) W
area

ChanneI Lenglh (2nn ) L
No. Of ChanneIs N
channeI

TalIe 1. Iaranelers

5ymbn! Mndc!
Crosslar RCX
area
W
2
pilch
III
2

uffer (per
porl)
RI
area

area
IV
Rouler R
area
RCX
area
+IRI
area

Nelvork
Adaplor
NA
area
IQ
area
+DQ
area

ChanneI CH
area
IW
area
LN
channeI

NoC Area NoC
area
n
2
(R
area
+ NA
area
)+ CH
area

TalIe 2. Area anaIylicaI nodeI

Nelvork Link Area Rouler
Area
Increase percenl lo
nesh
increase percenl in
lhe enlire chip
88 nesh .O6338 .1O89 O O
88 2D DM .1O86 .1O89 44.38 3.46
1616 nesh .O6338 .1217 O O
1616 2D DM .1626 .1217 1O3.58 9.57
TalIe 3. 2D DM area overhead

4. ConcIusion

The sinpIe 2D nesh lopoIogy has leen videIy used in a variely of appIicalions especiaIIy
for NoC design due lo ils sinpIicily and efficiency. Hovever, lhe de ruijn nelvork has nol
leen sludied yel as lhe underIying lopoIogy for 2D liIed NoCs. In lhis chapler, ve
inlroduced lhe lvo-dinensionaI de ruijn Mesh (2D DM) nelvork vhich has lhe sane cosl
as lhe popuIar nesh, lul has a Iogarilhnic dianeler. We lhen conducled a conparalive
sinuIalion sludy lo assess lhe nelvork Ialency and pover consunplion of lhe lvo
VLS 328

nelvorks. ResuIls shoved lhal lhe 2D DM lopoIogy inproves on lhe nelvork Ialency
especiaIIy for heavy lraffic Ioads. The pover consunplion in lhe 2D DM nelvork vas aIso
Iess lhan lhal of lhe equivaIenl sinpIe 2D nesh NoC.
Iinding a VLSI Iayoul for lhe 2D and 3D DM nelvorks lased on lhe design consideralions
in deep sul-nicron lechnoIogy, especiaIIy in lhree dinensionaI design, can le a chaIIenging
fulure research in lhis Iine.

5. References

hllp://vvv.princelon.edu/~Ishang/popnel.hlnI, Augusl 2OO7.
Chen, C., AgravaI, I. & urke, }R. (1993). dcule : A Nev cIass of HierarchicaI
MuIliprocessor Inlerconneclion Nelvorks vilh Area Lfficienl Layoul, ||||
Transac|icn cn Para||c| an Dis|ri|u|c Sqs|cms, VoI. 4, No. 12, pp. 1332-1344.
DaIIy, W}. & Seilz, C. (1987). DeadIock-free Message Rouling in MuIliprocessor
Inlerconneclion Nelvorks, |||| Trans. cn Ccmpu|crs, VoI. 36, No. 5, pp. 547-553.
DaIIy, W}. (1991). Lxpress Cules: Inproving lhe Ierfornance of K-ary N-cule
Inlerconneclion Nelvorks, |||| Trans. cn Ccmpu|crs, VoI. 4O, No. 9, pp. 1O16-1O23.
De ruijn, NC. (1946). A ConlinaloriaI IrolIen, Kcnin||ij|c Nccr|ans A|acmic tan
Ic|cnscnappcn Prccccings, 49-2, pp.758-764.
Dualo, }. (1995). A Necessary and Sufficienl Condilion for DeadIock-free Adaplive Rouling
in WornhoIe Nelvorks, |||| Transac|icns cn Para||c| an Dis|ri|u|c Sqs|cms, VoI. 6,
No. 1O, pp. 1O55-1O67.
Dualo, }., YaIananchiIi, S. & Ni, L. (2OO5). |n|crccnncc|icn Nc|ucr|s. An |nginccring Apprcacn,
Morgan Kaufnann IulIishers.
Canesan, L. & Iradhan, DK. (2OO3). WornhoIe Rouling in de ruijn Nelvorks and Hyper-
de ruijn Nelvorks, |||| |n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms (|SCAS),
pp. 87O-873.
ITRS. (2OO7). InlernalionaI lechnoIogy roadnap for seniconduclors. Tccn. rcp., InlernalionaI
TechnoIogy Roadnap for Seniconduclors.
Kiasari, AL., Sarlazi-Azad, H. & Rezazad, M. (2OO5). Ierfornance Conparison of Adaplive
Rouling AIgorilhns in lhe Slar Inlerconneclion Nelvork, Prccccings cf |nc 8|n
|n|crna|icna| Ccnfcrcncc cn Hign Pcrfcrmancc Ccmpu|ing in Asia-Pacific Rcgicn
(HPCAsia), pp. 257-264.
Kin, M., Kin, D. & SoleInan, L. (2OO6). NoC Iink anaIysis under pover and perfornance
conslrainls, |||| |n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms (|SCAS), Creece.
Kin, MM., Davis, }D., Oskin, M & Auslin, T. (2OO8). IoIynorphic on-Chip Nelvorks,
|n|crna|icna| Sqmpcsium cn Ccmpu|cr Arcni|cc|urc(|SCA), pp. 1O1 -112.
Liu, CI. & Lee, KY. (1993). OplinaI Rouling AIgorilhns for CeneraIized de ruijn Digraph,
|n|crna|icna| Ccnfcrcncc cn Para||c| Prcccssing, pp. 167-174.
Louri, A. & Sung, H. (1995). An Lfficienl 3D OplicaI InpIenenlalion of inary de ruijn
Nelvorks vilh AppIicalions lo MassiveIy IaraIIeI Conpuling, Scccn Icr|sncp cn
Massitc|q Para||c| Prcccssing Using Op|ica| |n|crccnncc|icns, pp.152-159.
Mao, }. & Yang, C. (2OOO). Shorlesl Ialh Rouling and IauIl-loIeranl Rouling on de ruijn
Nelvorks, Nc|ucr|s, voI.35, pp.2O7-215.
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 329

MuIIins, R., Wesl, A. & Moore, S. (2OO6). The Design and InpIenenlalion of a Lov-Lalency
On-Chip Nelvork, Asia an Scu|n Pacific Dcsign Au|cma|icn Ccnfcrcncc(ASP-DAC),
pp. 164-169.
Ogras, UY. & MarcuIescu, R. (2OO5). AppIicalion-Specific Nelvork-on-Chip Archileclure
Cuslonizalion via Long-Range Link Inserlion, ||||/ACM |n||. Ccnf. cn Ccmpu|cr
Aic Dcsign, San }ose, pp. 246-253.
Iark, H., AgravaI, DI. (1995). A NoveI DeadIock-free Rouling Technique for a cIass of de
ruijn lased Nelvorks, |PPS, pp. 524-531.
Sallaghi-Nadooshan, R., Modarressi, M. & Sarlazi-Azad, H. (2OO8). A NoveI high
Ierfornance Iov pover ased Mesh TopoIogy for NoCs, PM|O-2008, 7
|n

|n|crna|icna| Icr|sncp cn Pcrfcrmancc Mcc|ing, |ta|ua|icn, an Op|imiza|icn, pp. 1-7.
Sananalhan, MR., Iradhan, DK. (1989). The de ruijn MuIliprocessor Nelvork: a VersaliIe
IaraIIeI Irocessing and Sorling Nelvork for VLSI, |||| Trans. On Ccmpu|crs, voI.
38, pp.567-581.
Srivasan, K., Chala, KS. & Konjevad, C. (2OO4). Linear Irogranning ased Techniques for
Synlhesis of Nelvorks-on-chip Archileclures, |||| |n|crna|icna| ccnfcrcncc cn
Ccmpu|cr Dcsign, pp. 422-429.
Wang, H., Zhu, X., Ieh, L. & MaIik, S. (2OO2). Orion: A Iover-Ierfornance SinuIalor for
Inlerconneclion Nelvorks, 35|n |n|crna|icna| Sqmpcsium cn Micrcarcni|cc|urc
(M|CRO) , Turkey, pp. 294-3O5.

VLS 330
On the Efcient Design & Synthesis of Differential Clock Distribution Networks 331
X

On the Efficient Design & Synthesis of
DifferentiaI CIock Distribution Networks

Hounan Zarrali
1
, ZeIjko ZiIic
2
, Yvon Savaria
3
and A. }. AI-KhaIiIi
1

1
Dcpar|mcn| cf ||cc|rica| an Ccmpu|cr |nginccring, Ccnccria Unitcrsi|q
2
Dcpar|mcn| cf ||cc|rica| an Ccmpu|cr |nginccring, McGi|| Unitcrsi|q
3
Dcpar|mcn| cf ||cc|rica| |nginccring, |cc|c Pc|q|ccnniquc c Mcn|rca|
Canaa

1. Introduction

AInosl aII high-perfornance VLSI syslens in loday lechnoIogies are synchronous. These
syslens use a c|cc| signaI lo conlroI lhe fIov of dala lhroughoul lhe chip. This grealIy
faciIilales lhe design process of syslens lecause il provides a gIolaI franevork lhal aIIovs
nany differenl conponenls lo operale sinuIlaneousIy vhiIe sharing dala. The onIy price for
using synchronous lype of syslens is lhe addilionaI overhead required lo generale and
dislrilule lhe cIock signaI.
NearIy aII on-chip CIock Dislrilulions Nelvorks (CDNs) conlain a series of luffers and
inlerconnecls lhal repealedIy pover-up lhe cIock signaI fron lhe cIock source lo lhe cIock
sinks. ConvenlionaIIy, CDNs consisled of onIy a singIe slage luffer driving vires lo lhe
cIock Ioads. This is sliII lhe case for cIock dislrilulion in very snaII scaIe syslens, yel
conlenporary conpIex syslens use nuIlipIe luffer slages. A lypicaI cIock lree dislrilulion
nelvork in nodern conpIex syslens is shovn in Iigure 1. This design is lased on lhe
reporled CDNs in (OMahony el aI, 2OO3, ReslIe el aI, 1998, Vasseghi el aI, 1996).

1.1 Hierarchy in CDNs
The cIock signaI is generaled vilh a Ihase Lock Loop (ILL). A ILL is a conlroI syslen lhal
generales a signaI having a fixed reIalion lo lhe phase of ils reference signaI. A ILL circuil
responds lo lolh lhe frequency and lhe phase of ils inpul signaI and aulonalicaIIy
raises/Iovers lhe frequency of lhe conlroIIed osciIIalor unliI il nalches lhe reference
(Wikipedia, 2OO9). The core cIock signaI is lhen anpIified lhrough lhe gIolaI luffer and
dislriluled lhrough a hierarchicaI nelvork and luffers. The syslen CDN is generaIIy
defined lo span fron lhe ILL lo lhe cIock pins. The pin is lhe inpul lo a luffer lhal IocaIIy
anpIifies and dislrilules lhe cIock signaI lo cIocked slorage eIenenls vilhin a nacro, lhe
snaII lIocks lhal nake up a syslen. There can le any nunler of luffer IeveIs lelveen lhe
ILL and lhe cIock pin. In nodern VLSI syslens, lhere are up lo four luffer IeveIs. The Iasl
luffer IeveI lefore lhe cIock pin is generaIIy caIIed a seclor luffer. This slage drives lhe
inlerconnecl Ieading lo lhe nacros and lhe IocaI luffers al lhe pins. A synchronous VLSI
17
VLS 332

syslen has lhousands of Ioads lo le driven ly cIock signaI. In CDNs, lhe Ioads are grouped
logelher crealing a (sul-) lIock. This lrend resuIls in a hierarchy in lhe design of CDNs
incIuding lhree differenl IeveIs/calegories of cIock dislrilulion naneIy as g|c|a|, rcgicna| and
|cca| as shovn in Iigure 1. Al each IeveI of hierarchy lhere are luffers associaled vilh lhal
IeveI lo regenerale and lo inprove lhe cIock signaI al lhal IeveI.
The gIolaI cIock dislrilulion connecls lhe gIolaI cIock luffer lo lhe inpuls of lhe seclor
luffers. This IeveI of lhe dislrilulion has usuaIIy lhe |cngcs| pa|n in CDN lecause il reIays
lhe cIock signaI fron lhe cenlraI poinl on lhe die lo lhe seclor luffers Iocaled lhroughoul lhe
die. The issues in designing lhe gIolaI lree is mcs||q rc|a|c |c signa| in|cgri|q vhich is neanl
lo nainlain a fasl edge rale over Iong vires vhiIe nol inlroducing a Iarge anounl of lining
uncerlainly. Skev and jiller accunuIale as lhe cIock signaI propagales lhrough lhe cIock
nelvork and lolh lend lo accunuIale proporlionaI lo lhe Ialency of lhe palh. ecause nosl
of lhe Ialency occurs in lhe gIolaI cIock dislrilulion, lhis is aIso a prinary source of skev
and jiller (ReslIe el aI, 2OO1). Iron a design poinl of viev, achieving Iov lining uncerlainly
is lhe nosl crilicaI chaIIenge al lhis IeveI.
The regionaI cIock IeveI is defined lo le lhe dislrilulion of cIock signaIs fron lhe seclor
luffers lo lhe cIock pins. This IeveI is lhe niddIe ground lelveen gIolaI and IocaI cIock
dislrilulion, il does nol span as nuch area as lhe gIolaI IeveI and il does nol drive as nuch
Ioad or consune nearIy as nuch pover as lhe IocaI IeveI.
The IocaI IeveI is lhe parl of lhe CDN lhal deIivers lhe cIock pin lo lhe Ioad of lhe syslen lo
le synchronized. This nelvork drives lhe finaI Ioads and hence consunes lhe nosl pover.
As a design chaIIenge, lhe pover al lhe IocaI IeveI is aloul one order of nagnilude Iarger
lhan lhe pover in lhe gIolaI and regionaI IeveIs conlined (ReslIe el aI, 2OO1).


Iig. 1. A lypicaI hierarchicaI CDN for a high-perfornance synchronous VLSI syslen

1.2 CDNs figures of merit
The nain figures of neril for a CDN are lhe conponenls of lining uncerlainly, as veII as,
pover consunplion. AII of lhese perfornance nelrics have significanl inpacls on lhe
design, evaIualion and verificalion of synchronous syslen perfornance and reIialiIily.
As nenlioned previousIy, lhe advanlage of a synchronous syslen is lo reguIale lhe fIov of
dala lhroughoul lhe syslen. Hovever, lhis synchronizing approach depends on lhe aliIily
lo accuraleIy reIay a cIock signaI lo niIIions of individuaI cIocked Ioads. Any lining error
inlroduced ly lhe cIock dislrilulion has lhe polenliaI of causing a funclionaI error Ieading lo
On the Efcient Design & Synthesis of Differential Clock Distribution Networks 333

syslen naIfunclioning. Therefore, lhe lining uncerlainly of lhe cIock signaI nusl le
eslinaled and laken inlo accounl in lhe firsl design slages. The lvo calegories of lining
uncerlainlies in a cIock dislrilulion are s|cu and ji||cr.
CIock skev refers lo lhe alsoIule line difference in cIock signaIs arrivaI line lelveen lvo
poinls in a CDN. CIock skev is generaIIy caused ly nisnalches in eilher device or
inlerconnecl vilhin lhe cIock dislrilulion or ly lenperalure or voIlage varialions around
lhe chip. There are lvo conponenls for cIock skev: lhe skev caused due lo lhe slalic noise
(such as inlaIanced rouling) vhich is c|crminis|ic and lhe one caused ly lhe syslen device
and environnenlaI varialions vhich is rancm. An ideaI cIock dislrilulion vouId have zero
skev, vhich is usuaIIy unachievalIe.
}iller is anolher source of dynanic lining uncerlainlies al a singIe cIock Ioad. The key
neasure of jiller for a synchronous syslen is lhe period or cycIe-lo-cycIe jiller, vhich is lhe
difference lelveen lhe noninaI cycIe line and lhe acluaI cycIe line. The firsl cycIe, lhe
period is lhe sane as lhe cIock signaI period and lhe second cycIe, lhe cIock period lecones
Ionger/shorler. The lolaI cIock jiller is lhe sun of lhe jiller fron lhe cIock source and fron
lhe cIock dislrilulion. Iover suppIy noise nay cause jiller in lolh lhe cIock source and lhe
dislrilulion (HerzeI el aI, 1999).
CIock nelvork aIso invoIves Iong inlerconnecls vhich inpIies having Iols of parasilics
associaled vilh lhe nelvork conlriluling lo lhe pover consunplion of lhe cIock signaI.
Having lhe highesl svilching aclivily of lhe circuil in a chip is anolher facl of consuning a
Iarge anounl of pover of lhe syslen. This pover consunplion can le as high as 5O of lhe
lolaI pover consunplion of lhe chip according lo (Zhang el aI, 2OOO). The conponenls of
pover consunplion of CDN are: slalic, dynanic and Ieakage pover. The pover
consunplion due lo lhe Ieakage currenl, in CDNs, is reIaliveIy snaII. In lhe sane vay,
keeping lhe proper rise/faII lines, nininizes lhe slalic pover consunplion. Thus lhe nain
porlion of lhe pover consunplion is due lo lhe dynanic pover consunplion. This is
eslinaled as:

P=f C
|
V

V
suing

in vhich f, C
|
,

V

and

V
suing
respecliveIy represenl frequency of lhe cIock nelvork, lolaI Ioad
capacilances, suppIy-voIlage and voIlage-sving of cIock signaI. Ior lhe case of fuII sving (in
vhich lhe cIock signaI sving reaches lhe voIlage-suppIy IeveI) V
suing
is lhe sane as V

.
AccordingIy, nelhods lo reduce lhe pover consunplion are:
a. Reduce lolaI Ioad capacilances (C
|
)
l. Reduce voIlage-suppIy (V
DD
)
c. Reduce cIock signaI sving (V
suing
)
The inlrinsic Ioad capacilance reIies on lhe process lechnoIogy and lhere is no handy vay lo
inprove il. Yel, fron lhe design aspecls ly lreaking dovn inlerconnecls ly repealer
inserlion lhe lolaI inlerconnecl Ioad is reduced. Worlh nenlioning lhal in coupIed Iines, lhe
lolaI Ioad is grealer lhan lhal of singIe-node Iines, lhus conpensaling design nelhods
shouId le laken inlo consideralion for pover-saving inprovenenl. TypicaIIy, pover
reduclion is achieved ly neans of suppIy and/or sving voIlage scaIing in CDNs.

VLS 334

2. DifferentiaI CIock Distribution Networks (DCDNs)

In lhis seclion, lased on lhe generaI overviev given on CDNs, ve viII inlroduce lhe
concepls and nolivalions lovard lhe design of DifferenliaI CDNs (DCDNs). Ior lhis, ve
iniliaIIy address lhe preIininaries needed for lhe design of DCDNs. These lheories incIude
differenliaI signaIing and differenliaI signaI inlegrily.


Iig. 2. VoIlage-node differenliaI signaIing

2.1 PreIiminaries

2.1.1 DifferentiaI signaIing
A digilaI signaI can le lransnilled iffcrcn|ia||q over lhe nediun ly uliIizing lvo
conduclors. One of vhich is used for lransnilling lhe signaI and lhe olher is used for lhe
conpIenenl of lhe signaI. Iigure 2 shovs a differenliaI voIlage-node signaIing syslen. To
lransnil Iogic 1, lhe upper voIlage source drives V
1
and lhe Iover voIlage source drives V
O
.
Ior Iogic 'O lransnission, lhe voIlages are reversed.
As is shovn in Iigure 2, lhe foIIoving voIlages are defined in a differenliaI syslen: V
1
is lhe
signaI on lhe firsl Iine vilh respecl lo connon relurn palh, V
O
is lhe signaI on lhe second
Iine vilh respecl lo connon relurn palh, V
iff
is lhe differenliaI signaI vhich is lhe voIlage
difference of lhe lvo signaI pair, and, V
ccmm
is lhe connon voIlage signaI vhich is in
connon lelveen lolh of signaI pair. DifferenliaI signaI V
iff
carries lhe infornalion and al
lhe receiver lhe infornalion is exlracled fron lhis voIlage difference. In addilion lo lhe
differenliaI voIlage lhere is a connon-node signaI. This signaI is used lo give an iniliaI
liasing lo lhe differenliaI signaI pair. In ideaI condilions, lhe connon-node signaI is
conslanl and il does nol carry any infornalion. In lhis case:

V
iff
=V
1
-V
0
DifferenliaI signaIing requires nore rouling and vires and pins lhan ils singIe-ended
counlerparl syslen. In relurn for lhis increase, differenliaI signaIing offers lhe foIIoving
advanlages over singIe-ended signaIing:
a. A differenliaI syslen, serves ils ovn reference. The receiver al lhe far end of lhe
syslen conpares lhe lvo signaI pair lo delecl lhe vaIue of lhe lransnilled
infornalion. Transnillers are Iess crilicaI in lerns of noise issues, since lhe receiver
is conparing lvo pair of signaIs logelher ralher lhan conparing lo a fixed
reference. This resuIls in canceIing any noises in connon lo lhe signaIs.
l. The voIlage difference for lhe lvo signaI pair lelveen Iogic1 and 'O is:

V=2(V
1
-V
0
)
On the Efcient Design & Synthesis of Differential Clock Distribution Networks 335

vhich is lvice as nuch as is defined for a singIe-ended signaIing syslen. This
shovs lhal |nc ncisc margin cf |nc iffcrcn|ia| sqs|cm is |uicc as mucn as |nc sing|c-cnc
signa|ing sqs|cm. This doulIing effecl of signaI sving inproves lhe speed of lhe
signaIing syslen. Il affecls lhe lransilion lines (rise/faII line) vhich is done in haIf
of lhe lransilion line of singIe-ended signaIing syslen.

rax
ax c
g
ax c
c
ax c
g
rax
lax
ax l
m
lax

Iig. 3. A segnenl of a coupIed inlerconnecl

2.1.2 DifferentiaI signaI integrity
In order lo enpIoy differenliaI signaIing, lhe coupIed inlerconnecls nodeI is uliIized and
appIied lo lhe syslen. This lype of inlerconnecls nol onIy have lhe inlrinsic signaI inlegrily
issues, lul aIso, lhey are invoIved vilh lheir nuluaI signaI inlegrily aspecls. In Iigure 3, a
segnenl of a coupIed inlerconnecl is shovn.
The nuluaI parasilic eIenenls are due lo lhe adjacenl Iine. These are nuluaI capacilance Cc
and nuluaI induclance |
m
in addilion lo lhe inlrinsic parasilic eIenenls r, Cg and | vhich
indicale inlrinsic resislance, capacilance and induclance of each Iine. The effeclive
capacilance Ccff associaled vilh each Iine, depending on lhe direclion/node of lhe signaIing
(in-phase or oul-of-phase usuaIIy caIIed ctcn and c node respecliveIy) can le caIcuIaled
fron lhe foIIoving equalions (HaII el aI, 2OOO):

C
cff
(c)

= Cc+Cg
C
cff
(ctcn)

=Cg
And for effeclive induclance ve have:

|
cff
(c)

=|-|
m
|
cff
(ctcn)

=|+|
m

As lhe alove equalions indicale, for lhe case of differenliaI signaIing (or oul-of-phase
signaIing), lhe effeclive capacilance is increased ly lhe faclor of due lo coupIing
capacilances and lhe effeclive induclance is decreased due lo lhe effecl of nuluaI
induclance. In (Kahng el aI, 2OOO) il vas shovn lhal has lhe vaIue of |O, 2 and 3
depending on lhe node of signaIing and sIev rales of lhe coupIed signaIs. The lypicaI vaIue
for , for lypicaI sharp inpul signaIs designs, is laken as 2.

VLS 336

2.1.3 DifferentiaI Buffers
The configuralion of differenliaI luffers is lased on currenl sleering devices, in vhich lhe
oulpul Iogic can le sel ly sleering lhe currenl in lhe circuil. These devices are aIso
considered as Currenl Mode Logic (CML) circuils. CML circuils are knovn lo oulperforn
lhe convenlionaI CMOS circuils in Ciga Herlz (CHz) operalion frequency. A lasic
differenliaI luffer is given in Iigure 4. The currenl source in differenliaI luffer is lhe laiI
currenl |
ss
. When lhe connon-node voIlage V
ccmm
is appIied lo lhe differenliaI luffer, due
lo lhe synnelry of lhe differenliaI luffer, lhe currenl is spIil equaIIy lelveen lhe lvo vings
(|
ss
/2). Increasing one of lhe inpul voIlages vhich inpIies lhe decrease in lhe olher one, viII
resuIl in increase in currenl of one lranch and decrease in currenl of lhe olher lranch. Nole
lhal lhe lolaI possilIe currenl lo sleer is |
ss
and vhen one inpul voIlage rises, lhe olher one
decreases ly lhe sane anounl. When lhe inpul differenliaI voIlage V=V
in
-V
in
has passed
a specific lhreshoId, in olher vords vhen one of lhe lransislors derives aII lhe possilIe
currenl fron one lranch lhe olher lransislors lurn off, hence lhe oulpul voIlage reaches V


vhereas lhe firsl lranch drops lo V

-R|
ss
. SeveraI differenliaI Ioads aIso have leen
inlroduced in lhe Iileralure (DaIIy el aI, 1998). These Ioads nay use resislor, currenl nirror
and cross-coupIed lransislors. The differenliaI Ioad is characlerized ly ils differenliaI and
connon-node inpedances, knovn as r

and r
c
respecliveIy. The differenliaI inpedance
delernines lhe change in lhe differenliaI currenl |

vhen lhe voIlages on lhe lvo inpuls of


lhe lerninaI are varied in opposile direclions. The connon-node inpedance inpIies lhe
average currenl changes vhen lolh inpul voIlages are varied in lhe sane direclion.
Depending on lhe lype of appIicalion, lhe design nay chose fron lhese design oplions.
TalIe I denonslrale lhe r

and r
c
for each Ioad.


Iig. 4. A lasic differenliaI luffer

Load r
c
r


Resislor R R
Currenl-nirror 1/g
n
-1/I
Cross-coupIed 1/g
n
1/g
n

TalIe 1. Inpedance of differenliaI Ioads

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 337

2.2 DifferentiaI CIock Distribution Networks (DCDNs)
As discussed previousIy, differenliaI signaIing offers higher innunily againsl exlernaI
perlurlalions. Due lo lhe conpIexily increase and lhe need for error-free operalion in
conlenporary syslens, lhe idea of inlegraling differenliaI signaIing and cIock dislrilulion is
seeningIy leconing a vialIe soIulion for nodern and for fulure IC designs.
HisloricaIIy lhe idea of DCDN vas lo le uliIized for off-chip cIock dislrilulion and for IC-
IeveI synchronizalion. This lechnique vas uliIized lo reduce and suppress lhe LIeclro-
Magnelic Inlerference (LMI) of lhe neighloring circuils and syslens vaves. Due lo lhe
superiorily of DCDN, recenlIy lhere has leen a coupIe of vorks on on-chip DCDN as veII,
such as (Sekar, 2OO2, Anderson el aI, 2OO2). The idea of uliIizing on-chip DCDN has nol leen
videIy used in lhe Iileralure. In (Anderson el aI, 2OO2) a DCDN is used in gIolaI IeveI of lhe
hierarchicaI CDN for Ilaniun Microprocessor. They reporled lhal lhe use of DCDN has
given lhe advanlage of 1O Iess skev varialion. In (Sekar, 2OO2) il is reporled lhal DCDN
has 25-42 Iess sensilivily lo pover suppIy noises and 6 Iess sensilivily lo
nanufacluring varialions vhen lhey uliIized H-Tree DCDN.
A generaI nodeI of a DCDN is given in Iigure 5. The DCDN is conposed of a differenliaI
signaI pair shovn in lvo differenl pallerns. The cIock lree generaIIy is a linary lree. The
differenliaI signaI is dispersed aIong lhe cIock nelvork. Throughoul lhe cIock nelvork al
lranching poinls lhe differenliaI cIock signaIs are regeneraled ly differenliaI luffers lo
inprove lhe signaI inlegrily of lhe cIock nelvork. IinaIIy al lhe Iasl slage, lhey are aII
converled lo singIe-ended signaIs for conpaliliIily vilh lhe resl of lhe syslen funclionaIily,
vhich nornaIIy use singIe-ended signaIs. Ior lhe regeneralive luffers a sinpIe differenliaI
luffer inlroduced in lhe previous parl can le uliIized. The onIy design issue reIaled lo lhe
luffer is lhe choice of differenliaI Ioads. ased on lhe process lechnoIogy, or design crileria,
lhis ilen can le chosen fron lhe design Iilrary. Ior finaI slage converlers, usuaIIy lhe choice
of currenl nirror Ioad is lhe superior choice. As TalIe 1 denonslrales, currenl nirror Ioads
have high differenliaI oulpul inpedance vhich resuIls in fasl change in lhe oulpul lhal is
used lo drive lhe oulpul of lhe cIock nelvork.
DifferenliaI cIocking eIininales lhe induced crosslaIk due lo aggression of cIock signaIs.
CIock signaI is spread aII over lhe chip area. Il aIso has fuII svilching aclivily. AIso device
sizes lend lo shrink as lechnoIogy advances. These facls shov lhal as lechnoIogy advances
lhe cIock signaI aggression can le quile harnfuI for aII syslen conponenls aII over lhe chip
area. Dislrilulion of cIock vilh differenliaI signaIs eIininales lhis prolIen lo a cerlain
exlenl, as lolh posilive and negalive signaI vaIues are appIied and lhe noise vouId le
canceIIed. Iurlhernore, as given in (Anderson el aI, 2OO2), DCDN offers Iess skev varialions
in lhe presence of exlernaI noises, il has Iess sensilivily in presence of suppIy and process
varialions (Sekar 2OO5).
The aforenenlioned poinls are of lhe nosl inporlanl crileria/soIulions for reIialIe syslen
design. Due lo lechnoIogy advances and increase in syslen conpIexily, lhe design vilh Iov
or no paraneler varialion in ideaI case, has lecone lhe nosl concerning issue. Tining error
resuIls direclIy in syslen naIfunclioning. Thus designing a reIialIe and noise loIeranl, cIock
dislrilulion nay heIp significanlIy for a reIialIe syslen design. As inlroduced in lhe
Iileralure, DCDN has lhese polenliaIs, lhus lhis design nelhodoIogy can le a soIulion for
fulure rolusl syslen design.
IIus lhe pros and cons of DCDN, lhere are sone design/synlhesis chaIIenges associaled
vilh lhe efficienl design of DCDNs. Sone of nosl chaIIenges nay le sunnarized as:
VLS 338

- DifferenliaI signaIing is invoIved vilh higher parasilic, due lhe exislence of
coupIed Iines. In lhis case lhe lolaI pover consunplion is connonIy increased.
- CoupIed Iines are connonIy rouled and synlhesized using synnelricaI palh
nodeIs (vhich is nol lhe generaI case).
- Rouling conpIex DCDNs nay lake loo nuch conpulalion line. Using exisling
rouling nelhods is nol line efficienl.
Iroposing soIulions lo address lhe alove chaIIenges can efficienlIy heIp lhe design and
synlhesis of DCDNs, needed for nodern conpIex VLSI lechnoIogies. These soIulions are
given in lhe foIIoving seclions.


Iig. 5. A generaI slruclure for DCDN

3. Efficient design of DCDNs

3.1 Dynamic ThreshoId (DT) MOS for Iow-voItage DCDNs
Ior lhe design and synlhesis of CDNs, luffers are inserled lo inprove lhe perfornance of
CDNs in order lo reduce lhe overhead capacilances. In lhis parl, Dynanic ThreshoId (DT)
(sonelines referred lo as VarialIe ThreshoId) lransislors (Assaderaghi el aI, 1997) are
uliIized in convenlionaI differenliaI luffer slruclures. These lransislors oulperforn lhe
convenlionaI lransislors in Iov voIlage appIicalions vhich are suilalIe for advanced Iov
voIlage lechnoIogies. The use of DT lransislors heIps inprove lhe luffer perfornance. DT
lransislors svilch fasler since lheir lhreshoId voIlage decrease dynanicaIIy vhen lhe inpul
is appIied lo lheir gale lerninaI due lo lody effecl. Such luffers are depicled in Iigure 6
(Zarrali el aI, 2OO6). Iigure 6(a) presenls a Lov-Sving:Lov-Sving differenliaI luffer. DT
lransislors heIp inprove lhe speed of lhese luffers vhen Iov sving inpuls are appIied lo
lhe luffer. The use of cross-coupIed differenliaI Ioad vilh high differenliaI inpedance heIps
lo have a fasl lransilion of lhe inpuls lo lhe oulpuls. Iigure 6(l) represenls a Lov-Sving:
IuII-Sving IeveI converler. This luffer is used al lhe sinks lo reslore cIock signaIs lo lheir
singIe-node fuII anpIilude. The currenl-source puII-ups heIp lo have asynnelricaI fasl
lransfornalion of differenliaI lo singIe-ended signaIs. The slruclure is lased on ChappeII
anpIifier vhich offers good connon-node noise rejeclion (ChappeII el aI, 1998).

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 339

(a) (b)

Iig. 6. (a) Lov-Sving: Lov-Sving (l) Lov-Sving: IuII-Sving DT differenliaI luffers

3.2 DifferentiaI Iow-power buffers
RecaIIing fron Seclion 2.1.1, lvo voIlage conponenls are associaled vilh differenliaI
signaIing: connon-node and differenliaI. Connon-node aIso refers lo DC voIlage liasing
and is lhe voIlage used for iniliaI liasing of lhe differenliaI luffer.
In order for receivers (differenliaI lo singIe-ended converlers) lo operale efficienlIy and have
a fuII/proper oulpul sving, lhe connon-node voIlage or DC liasing of lhe differenliaI
luffer shouId le Iov enough lo lurn lhe inpul lransislors off. In lhe Iileralure, lhe nelhod
used in order lo overcone lhis issue is lo increase lhe voIlage sving as nuch as possilIe lo
le alIe lo decrease lhe connon-node voIlage lo lhe sufficienl suppIy IeveI (usuaIIy used
differenliaI voIlage of 5O of V

) (Anderson el aI, 2OO2, Sekar, 2OO5). This nelhod resuIls in


high pover consunplion in DCDN. RecaIIing fron Seclion 2.1:

V
iff_|cu
=V

-R|
ss

The alove equalion inpIies lhal in order lo increase lhe differenliaI voIlage sving, lhe laiI
currenl need lo le increased. This lechnique IargeIy affecls lhe pover consunplion in
DCDN. Nole lhal, il is nol possilIe lo louch lhe Ioad (R) as il direclIy affecls lhe landvidlh
of lhe cIock nelvork. Therefore, in previous vorks, in order lo reach sufficienl oulpul
sving, lhe differenliaI voIlage sving is increased lo reduce connon-node voIlage.
CorrespondingIy, a circuil lechnique is proposed lo address lhis design prolIen.
The proposed lechnique for differenliaI receiver is given in Iigure 7 (Zarrali, 2OO6). The
luffer configuralion is lased on ChappeII anpIifier as inlroduced in lhe previous seclion.
Allached lo lhe luffer are lhe IeveI-shifling circuils. The luffer funclionaIily is as foIIovs:
The dashed parls in Iigure 7 are lhe IeveI shiflers (aIso referred lo as source foIIovers)
(Razavi, 2OO1). When lhe inpul is appIied lo lhe gale lerninaIs of lhe IeveI shiflers, lhe
oulpuls are dropped and foIIov lheir inpuls. In olher vords, lhe voIlage gain equaIs one (no
voIlage anpIificalion), and lhe foIIoving reIalions are appIicalIe (roderson, 2OO5):

|=(/2)(V
|N
-V
OUT
-V
T
)
2

V
OUT
-=|R
s
V
OUT
=(R
s
/2)(V
|N
-V
OUT
-V
T
)
2

V
|N
=V
OUT
+ V
T
+j(V
|N
-2)/(R
s
)]
0.5


VLS 340


Iig. 7. DifferenliaI receiver vilh IeveI shifler

The Iasl resuIl shovs lhal V
OUT
can le derived ly soIving lhe finaI equalion ileraliveIy.
Hovever, ly naking lhe firsl order approxinalion lhal R
S
is Iarge enough (especiaIIy in
currenl sources) lo nake lhe lhird lern equaI lo zero, ve can concIude:

V
OUT
= V
|N
-V
T
This shovs lhal lhe oulpul of lhe source foIIover circuils copies lhe inpul of lhe gales vilh a
shifl of a lransislor lhreshoId vhich is a lechnoIogy dependenl faclor. The lransislor ralios
for luffers sizing are lhe sane as lhe ones given in (ChappeII el aI, 1998). Hovever, lhe lolaI
size of lhe luffer is scaIed lo nininize lhe skev.
The alove configuralion for lhe differenliaI receivers heIps Iover lhe connon-node (DC
lias) of lhe inlernaI inpul lransislors of lhe receiver. UliIizing lhis design lechnique, il is
possilIe lo furlher reduce lhe differenliaI voIlage sving vhiIe nainlaining a sufficienl
oulpul sving al lhe finaI nodes.
In order lo perforn differenliaI voIlage scaIing in DCDN, previousIy a nev design for IeveI
converler vas given. Ior lhe case of inlernediale luffers, in order lo le alIe lo vary lhe
differenliaI voIlage vhiIe nainlaining lhe Iinearily of lhe luffer, lhe differenliaI Ioad shouId
le reconfigured in a vay lo eslalIish lhis design goaI. In lhis parl, a nev configuralion for
differenliaI Ioad is proposed vhich enalIes us lo have Iinearily in lhe luffer. Iigure 8 shovs
lhe proposed luffer configuralion.


Iig. 8. DifferenliaI luffer vilh conposile Ioad

The dashed parl denonslrales lhe proposed conposile configuralion of lhe differenliaI Ioad.
Such conposilion enalIes lhe circuil lo conline lolh lhe characlerislics of lhe diode
connecled device and lriode lransislor logelher lo have a Iinear operaling Ioad in various
On the Efcient Design & Synthesis of Differential Clock Distribution Networks 341

voIlage ranges (DaIIy el aI, 1998). The proposed luffer lased on conposile differenliaI Ioad
is a lechnoIogy porlalIe design and can le used in any avaiIalIe design process vhereas lhe
use of resislance is Iiniled lo currenl and fulure advanced lechnoIogies. This porlalIe
design nelhod cones al lhe price of increase in area and parasilic eIenenls. The lransislor
ralios (for luffers sizing) are 1 lo 3 vhich refer lo lhe ralio of puII up lo puII dovn
lransislors (L=2L
nin
lo reduce lhe channeI Ienglh noduIalion effecl). The lolaI size of lhe
luffer is scaIed lo reach lhe oljeclive frequency of operalion.

4. Efficient synthesis of DCDNs

4.1 Zero skew DCDN routing
As seen in lhe overviev seclion, DCDNs are connonIy rouled assuning synnelricaI CDN
palh nodeIs. This hovever is nol lhe generaI case. Here ve viII propose a nelhod for zero
skev rouling of DCDNs appIicalIe lo generaI (asynnelric) palh nodeIs. In lhe Iileralure,
especiaIIy in (Cong el aI, 1996) a conprehensive sludy on efficienl cIock rouling is sludied.
Tsays nelhod (Tsay, 1991) is one of lhe nelhods inlroduced for zero skev rouling of cIock
lrees. In order lo roule differenliaI cIock lrees vilh zero skev characlerislic, lhe exisling
nelhods are nodified lo salisfy design oljeclives. To achieve lhis ain here in lhis parl, a
Iine equivaIenl deIay nodeI is uliIized. The nodeI is appIied lo Tsays nelhod (Tsay, 1991)
for zero skev rouling of differenliaI cIock lrees (DCDNs).

4.1.1 UtiIizing Tsays method
In lhis nelhod, zero skev is achieved ly Iocaling lapping poinls lhroughoul lhe cIock lree.
Tapping poinls are lhe lranching poinls al vhich sul-lrees are chosen lo nainlain equaI
deIay as shovn in Iigure 9.


Iig. 9. Tapping poinl exlraclion lhrough nerging decoupIed sul-lree(s)

As vas seen in Seclion 2.1.2, lhe effeclive capacilance associaled vilh each segnenl of a
coupIed Iine, considering lolh inlrinsic and nuluaI effecls is:

C
cff
(c)

= Cc+Cg
The effeclive capacilance is appIied lo lolh signaI Iines independenl of each olher, in lhis
vay, ve caII lhese Iines as decoupIed Iines and ve nane lhis nodeI as cccup|c |inc nodeI.
This nodeI is enpIoyed for lhe purpose of cIock rouling. In lhis vay, lhe decoupIed RC-
deIay nodeI is used lo nodeI inlerconnecls. The nelhodoIogy of lapping exlraclion is as
VLS 342

foIIovs. Iigure 9 shovs a schenalic of a decoupIed cIock lree lranch in vhich each Iine of
lhe lranch is a decoupIed dislriluled RC nodeI connecled lo ils sul-lree chiId, for vhich
lhe dislriluled Iine propagalion deIay is given ly |
in|
=0.37R
in|
C
cff
. Lach sul-lree is nodeIed
ly a lolaI capacilance C
su||rcc
and lolaI propagalion deIay |
su||rcc
as shovn in Iigure 9.
Considering lapping Iocalion x, lo salisfy lhe equaIily of lhe lvo lranch deIays, lhe
foIIoving equalion is reaIized:

|
in|1
+0.74R
in|1
C
cffsu||rcc1
+|
1
= |
in|2
+0.74R
in|2
C
cffsu||rcc2
+|
2
"
In lhe second parl of lhe equaIily, since lhe inlerconnecl resislance conlined vilh sul-lree
capacilance creales a |umpc Ioop, il has lhe Iunped propagalion deIay of 0.74R
in|
C
su||rcc
.
Revriling inlerconnecl parasilics ly per unil Ienglh paranelers, ve have:

R
in|1
=r
0
x|, C
in|1
=c
0
x|
R
in|2
=r
0
(1- x)|, C
in|1
=c
0
(1- x)|
in vhich r
0
and c
0
are lhe resislance and capacilance per unil Ienglh of lhe vire, | is lolaI
inlerconneclion Ienglh lelveen lhe lvo sul-lrees and lapping Iocalion x. SoIving Lqualion
" vilh respecl lo x resuIls inlo:

x=j1.35(|
in|2
-|
in|1
)+r
0
|(C
cffsu||rcc2
+0.5c
0
|)]/jr
0
|(C
cffsu||rcc1
+c
0
|+C
cffsu||rcc2
)]
In case of (x < O or x > 1), eIongalion vouId le needed. LIongalion is lhe process of adding
exlra vire Ienglh lo lhe sul-lree vhich has Iess effeclive capacilance, in order lo equaIize lhe
deIay of lolh sul-lrees. The Ienglh of eIongalion lo nainlain zero skev is given ly:

|=j-20r
0
C
cffsu||rcc2
+2(100r
0
2
C
cffsu||rcc2
2
+270r
0
c
0
(|
in|2
-|
in|1
))
0.5
]/j20r
0
c
0
]
This nelhodoIogy is appIied for zero skev rouling in DCDNs. The resuIls are given in
Seclion 5.1 viII vaIidale lhe efficiency of lhis nelhodoIogy.

4.2 ParaIIeI synthesis of DCDNs
CDN synlhesis is one of lhe prinary line-consuning sleps, perforned in lhe synlhesis fIov
of VLSI syslens. LspeciaIIy vilh lhe grovlh of conpIex SoCs in currenl advanced
lechnoIogies, lhis parl has lecone nore conpIicaled and Iess conpulalionaI cosl-effeclive.
Many efforls have leen pul inlo paraIIeI conpuler aided design, aII vilh lhe goaI of
reducing lhe conpulalion line. In Iileralure, nelhodoIogies have leen proposed for paraIIeI
synlhesis of CDN such as lhe ones proposed in (anerjee el aI, 1992, anerjee, 1994). These
nelhods hovever, focus nainIy on lhe singIe-ended cIock lree slruclure. In lhis seclion, lhe
goaI is lo Ieverage dislinclive fealures of paraIIeI conpulalion lo reduce conpulalionaI line
required lo synlhesize DCDNs (Zarrali el aI, 2OO7). The nelhodoIogy uliIizes and exlends
lhe lechnique proposed in Seclion 4.1, lo synlhesize zero skev DCDNs in paraIIeI. This is a
fIexilIe nelhodoIogy, appIicalIe lo synnelric/asynnelric and hylrid (differenliaI and/or
singIe-ended) cIock lree slruclures.

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 343

P0 P1
P2 P3
Clock-sink
P0 P1
P2 P3
Clock-root
oI the
partition
P
Clock source

Iig. 1O. IaraIIeI DCDN dislrilulion: a) parlilioning lhe die area inlo sul-regions, l) Iocaling
lhe cIock-rool of each region, c) finding lhe source of lhe cIock nelvork

The nelhodoIogy for paraIIeI synlhesis of zero skev DCDNs is as foIIovs. IniliaIIy lhe lolaI
chip area is parlilioned inlo sul-regions (parlilioning phase). Laler, synlhesis of zero skev
differenliaI cIock dislrilulion nelvorks is perforned on each of lhe parlilioned regions
(IocaI cIock dislrilulion phase). In lhe finaI slage, lhe gIolaI differenliaI cIock nelvork is
rouled for each of lhe previousIy-exlracled cIock-rools of lhe sul-regions (gIolaI cIock
dislrilulion phase). The ollained source of lhe cIock nelvork can end up anyvhere in lhe
vhoIe chip area (Manhallan surface), regardIess of lhe iniliaI parlilioning. The proposed
scenario is iIIuslraled in Iigure 1O. The proposed nelhod nay le inpIenenled using C++
Ianguage and lhe Message Iassing Inlerface (MII) pIalforn (MII). A pseudo-code
descriling lhe nelhod is given in Iigure 11.
A possilIe negalive side effecl of paraIIeI synlhesis is lhe increase in lhe lolaI vire-Ienglh in
lhe cIock nelvork. This couId le inlerpreled as lhe inpacl of nuIli-slage dislrilulion of lhe
cIock nelvork vhich resuIls in iniliaI IocaI zero-skev cIock nelvorks and a finaI gIolaI cIock
nelvork rouled on lop of regionaI cIock nelvorks. In generaI, lhis paraIIeI processing
approach resuIls in a cIock-lree differenl fron lhe one rouled in a singIe slep, due lo die area
parlilioning, lhus, lhe characlerislics of lhe nev cIock lree such as lolaI vire-Ienglh and
skev nay le sIighlIy differenl. This proposed nelhodoIogy is fIexilIe, as il aIIovs having a
hylrid (differenliaI and/or singIe-ended) dislrilulion of lhe cIock nelvork. The gIolaI CDN
couId le differenliaI, vhiIe lhe IocaI (Iover IeveIs) CDNs couId le singIe-ended lo aIIeviale
rouling conpIexily. Il is possilIe lo enhance lhe gIolaI/IocaI dislrilulion aIgorilhn vilh
refined inlerconnecl nodeIs. This nelhodoIogy is aIso appIicalIe lo aII
synnelric/asynnelric cIock-lrees.

Para!!c! Zcrn 5kcw DIffcrcntIa! C!nck DIstrIbutInn
(CIock-sinks, Nunler of Irocessing-nodes)
|
1. PartItInn chip area according lo lhe nunler of prnccssIng nndcs.
2. AppIy '!nca! zero skev (differenliaI) cIock dislrilulion lo lhe parlilioned
areas and scnd lhe cIock-lree rnnt(s) lo lhe rool processing node.
3. RcccIvc lhe processed cIock-lree rool(s) fron processing nodes, and, appIy
'g!nba! zero skev (differenliaI) cIock dislrilulion.
4. Rcturn lhe ollained finaI cIock-lree rool as lhe snurcc of lhe (differenliaI)
cIock dislrilulion nelvork.
]
Iig. 11. Iseudo-code for paraIIeI synlhesis of zero skev differenliaI cIock dislrilulion
VLS 344

5. ResuIts

In lhis seclion, lhe quanlilalive resuIls reIaled lo lhe given design and synlhesis nelhods for
DCDNs are given.

5.1 Zero skew routing
The zero skev rouling nelhod, inspired ly Tsays aIgorilhn and given in Seclion 4.1, vas
appIied lo IM lenchnarks r1.r5 (Tsay, 1991). Modified Tsays nelhod (for differenliaI
signaI inlegrily) vas inpIenenled using C++ for cIock rouling, and ILRL Ianguage for
nelIisl nanipuIalion vas uliIized for design and sinuIalions. HSIICL sinuIalion resuIls for
lhe proposed nelhod are laluIaled in TalIe 2. The deIay and skev resuIls presenled
lhroughoul lhis vork represenl lhe average and alsoIule difference of cIock signaI phase
deIay al sink nodes respecliveIy.
Tvo nelhodoIogies vere used for rouling differenliaI Iines: SingIe-Spaced (SS) and DoulIe-
Spaced (DS) rouling. In singIe-spaced rouling schene, lhe nuluaI coupIing effecls are
slronger, lherefore differenliaI characlerislics of lhe pair is nore doninanl. DS offers snaIIer
nuluaI coupIing, consequenlIy lhis reduces deIay vhiIe degrading noise innunily. This is
due lo lhe facl lhal lhe slronger lhe coupIing effecl is, lhe slronger connon-node noise
rejeclion lecones.
TalIe 2 denonslrales lhal, on average, cIock lrees generaled vilh lhe proposed nodeI shov
97 skev reduclion conpared lo lhose ollained using LInore nodeI, i.e. negIecling
coupIing effecls. This inprovenenl is achieved lecause coupIing effecls of differenliaI Iines
are nore accuraleIy considered in lhe aIgorilhn Ieading lo lapping poinl seIeclion. As
lechnoIogy advances lhe coupIing effecls increase and ve are no Ionger alIe lo negIecl lhese
effecls in syslen nodeIing. In lhis case, negIecling lhe coupIing effecls, resuIls in lhe
nispIacenenl of lapping poinls and reduces lhe effecliveness of lhe considered zero skev
DCDNs. SinuIalion resuIls aIso shov snaIIer deIay and skev for lhe DS schene due lo
reduced coupIing, hovever lhis design slralegy as ve viII see degrades roluslness in
presence of exlernaI noise.

ench
Mark
SingIe Spaced (SS) SingIe Spaced (SS) DoulIe Spaced (DS)
LInore Iroposed Iroposed LInore Iroposed Iroposed
Skev (ps) DeIay (ns) Skev (ps) DeIay (ns) Skev (ps) DeIay (ns)
r1 115 1.1 5.9 1.1 2.4 O.9
r2 199 3.8 15 3.7 8.8 3.3
r3 341 5.1 14 5.1 8.O 4.6
r4 759 17.5 36 17.1 2O 14.5
r5 1825 34 51 34 39 28
TalIe 2. Skev and deIay of DCDN for r1-r5 lenchnarks in 18Onn lechnoIogy

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 345

5.2 AppIying DT buffers
uffers vere inserled lo inprove lhe perfornance of lenchnark cIock nelvorks. The luffer
inserlion procedure is as foIIovs. Lov-sving luffers vere inserled al lranching poinls and
IeveI converling luffers vere inserled al sinks. Lov-sving luffers are sized lo reduce
propagalion deIay lhroughoul lhe cIock nelvork. To acconpIish lhis, a lase size for luffer
according lo ils deIay-pover characlerislic diagran vas chosen lo drive a unil Ienglh
inlerconnecl segnenl. Iurlher, lhe luffers vere unifornIy sized according lo lhe Iongesl
inlerconnecl. IuII-sving IeveI converlers (differenliaI lo singIe-ended) vere conposed of
nininun size lransislors lo reduce lhe pover consunplion and skev reduclion, lheir sizes
vere scaIed up reIalive lo lheir Ioad capacilance.

ench
Mark
Lov-Sving
ConvenlionaI
Lov-Sving
Iroposed
IuII-Sving
ConvenlionaI
Skev (ps) DeIay (ns) Skev (ps) DeIay (ns) Skev (ps) DeIay (ns)
r1 695 9.6 545 7.O 71 1.1
r2 844 1O 679 7.7 163 1.3
r3 8O8 1O 667 7.8 127 1.3
r4 1388 11.9 981 8.8 379 1.6
r5 1566 13.1 1135 9.6 532 1.9
TalIe 3. uffered DCDNs (IuII-Sving Vdd=1.8V, Lov-Sving Vdd=O.5V)

TalIe 3 shovs lhe skev and deIay difference for siniIarIy sized, luffered DCDN lased on
lhe convenlionaI (DaIIy el aI, 1998) and lhe proposed luffers. Il shovs 25 deIay and skev
inprovenenl on average conpared lo convenlionaI luffers in Iov-sving differenliaI
cIocking schene. ResuIls shov lhal lhe deIays are reduced significanlIy vhiIe skevs are
degraded as conpared vilh un-luffered DCDNs. Il is leIieved lhal skevs in luffered cIock
nelvorks can le reduced significanlIy ly enhancing lhe process of luffer inserlion. Ior
inslance, differenliaI luffers deIay nodeI shouId le considered vhen lapping poinls are
seIecled in lhe zero skev DCDN design aIgorilhn.

Single Differential (SS) Differential (DS)
0
2
4
6
8
10
12
Clocking Schemes
%

S
k
e
w

V
a
r
i
a
t
i
o
n
s
Skew Variations for Different CDNs in presence of Crosstalk
Low Swing
Full Swing

Iig. 12. Skev varialions due lo crosslaIk

Wilh regards lo lhe skev sensilivily of lhe proposed DT DCDNs, lvo lypes of exlernaI
aggressors, resuIling inlo randon skev are invesligaled: pover-suppIy varialions and
crosslaIk. Conparisons vere nade lelveen siniIarIy designed singIe-node CDN, singIe-
VLS 346

spaced DCDNs and doulIe-spaced DCDNs (Iigure 12). enchnark r3 is used for
sinuIalions due lo ils average characlerislic in lerns of size and sinuIalion line. Ior lhe
Iov-sving schene, lhe pover suppIies vere V
ddH
=1.8V & V
ddL
=O.5V, vhereas for fuII-sving
schene a singIe suppIy voIlage (V
ddH
=1.8V) vas used. In lhose experinenls, suppIy-
voIlages vere varied ly 1O.
SinuIalion resuIls shov lhal for lolh cIocking schenes, lhe singIe-spaced DCDN is lhe nosl
rolusl design nelhod in lhe presence of pover-suppIy varialions vhen conpared lo olher
CDNs. Skev varialions increase vhen Iov-sving cIocking is used. DoulIe-spaced DCDN
has Iess roluslness lo suppIy varialions. DCDN is seen lo have up lo 25 Iess skev
varialions in Iov-sving cIocking schene and up lo 9 Iess skev varialions in fuII-sving
cIocking schene lhan singIe-node CDN, in presence of pover-suppIy varialions.
Anolher source of perlurlalion lhal causes deIay uncerlainly in CDNs is crosslaIk. Ior
experinenls, a fuII-sving aggressor is appIied lo one of lhe lvo lig-chiId of lhe cIock lree.
The sane Iov-sving and fuII-sving cIocking schenes vere considered. SinuIalion resuIls
shov lhal lhe singIe-spaced DCDN shovs 6 Iess skev varialions vhen conlined vilh Iov
sving cIocking schene and 9 vhen conlined vilh fuII sving cIocking schene as
conpared lo singIe-node CDN suljecl lo crosslaIk.

5.3 AppIying Iow-power buffers
In lhis parl, lhe effecl of enpIoying Iov-pover luffers inlroduced in Seclion 3.2 is sludied.
Ior conparison and perfornance evaIualion of cIock nelvorks, a sel of 4OO MHz CDNs
vere designed for lenchnark r3 due lo ils average size suilalIe for sinuIalions. AII reporled
designs vere nininaIIy sized lo neel lhe largel operaling frequency (4OOMHz) lased on
18Onn lechnoIogy paranelers. In DCDNs, lhe differenliaI signaI sving vas scaIed ly
adjusling lhe laiI currenl source of inlernediale differenliaI luffers. The Iovesl polenliaI
reached ly eilher parl of lhe differenliaI signaI is V

-R|
ss
vhere V

is lhe suppIy voIlage, |


ss

is lhe laiI currenl and R is lhe equivaIenl resislance of lhe lransislor Ioads. Nole lhal, lhe
Ioad resislance delernines lhe landvidlh of lhe cIock nelvork, hence lhe onIy possilIe
varialIe lo lune is lhe laiI currenl lo scaIe lhe differenliaI voIlage sving. In lhe foIIoving, lhe
effecl of differenliaI voIlage scaIing on pover consunplion and cIock skev varialions in lhe
presence of pover suppIy varialions is expIored.

1.62-0.45 1.79-0.475 1.8-0.5 1.89-0.525 1.98-0.55
-80
-60
-40
-20
0
20
40
Voltage Supply VddH -VddL (V)
%

S
k
e
w

V
a
r
i
a
t
i
o
n
s

Skew Variations for Different Low Swing CDNs
Single
Differential (SS)
Differential (DS)
1.62 1.71 1.8 1.89 1.98
0
5
10
Voltage Supply (V)
%

S
k
e
w

V
a
r
i
a
t
i
o
n
s
Skew Variations for Different Full Swing CDNs
Single
Differential (SS)
Differential (DS)

Iig. 13. Skev due lo suppIy-voIlage varialions in Iov/fuII -sving schenes

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 347

5.3.1 Effect on power consumption
The laiI currenl affecls lolh lhe shorl circuil and dynanic pover dissipalion. Iigure 14
denonslrales lhe effecl of differenliaI voIlage scaIing on lhe pover consunplion of DCDN
and ils singIe-node CDN counlerparl lolh oplinized for lhe sane lenchnark, vhich is
ollained ly scaIing lhe laiI currenl source of lhe differenliaI luffer.
Iigure 14 shovs lhe pover consunplion ollained vilh lhree differenl cIocking schenes:
differenliaI CDNs vilh conposile-Ioad luffers, differenliaI CDNs vilh onIy grounded-gale
Ioad lransislor luffers and lhe convenlionaI singIe-node cIocking schene. The reason for
expIoring lhe nelvork vilh onIy grounded-gale Ioad lransislor luffers is lo denonslrale lhe
effecliveness of conposile-Ioad luffers in lerns of reduced sensilivily vhen Iarge signaI
svings are considered. In lhis case, lhe size of lhe Ioad lransislors is increased (ly
approxinaleIy 2O) lo gel lhe sane anpIilude for a given laiI currenl. The CDN used as
reference is a convenlionaI laIanced singIe-node cIocking lree, vhere lransislors have lhe
nininaI size necessary lo give fuII oulpul sving vhiIe operaling al lhe largel frequency
(4OOMHz).

10 15 20 25 30 35 40 45 50
1.5
2
2.5
3
3.5
4
4.5
Differential Pair Swing Scaling (%Vdd)
P
o
w
e
r

C
o
n
s
u
m
p
t
i
o
n

(
W
)


DCDN Composite Load
DCDN GND Load
Single

Iig. 14. Iover consunplion for voIlage scaIed DCDNs vs. a singIe-node CDN (r3).

Iigure 14 shovs lhal as lhe differenliaI sving increases leyond 25 of suppIy-voIlage
(45OnV, V

=1.8V), lhe pover consunplion increases draslicaIIy. This enphasizes lhe


significanl inpacl of Iarge differenliaI voIlage svings on lhe pover consunplion of lhe
cIock nelvork. Ior differenliaI voIlage svings leIov 45OnV, lhe pover consunplion is nol
reduced nuch if lhe differenliaI sving scaIing is furlher reduced. A Iover lound of 1O of
V
dd
vas inposed lo lhe differenliaI sving lo ensure a sufficienl noise nargin. Anolher
consideralion is lhal even vilh a differenliaI sving as Iov as 1O of V

(18OnV), lhe pover


consunplion of lhe differenliaI cIock nelvork renains aInosl 3O higher lhan lhal of
singIe-node cIock dislrilulion nelvork. Thus, lrying lo nalch lhe pover dissipalion of a
singIe-node nelvork ly decreasing lhe sving of differenliaI nelvorks does nol appear lo le
a vialIe oplion. A finaI olservalion fron Iigure 14 is lhal lhe DCDN vilh grounded-gale
Ioads (CND) consunes Iess pover over a Iiniled region vhere lhe differenliaI sving is
Iarge. Hovever, as ve viII see in lhe foIIoving, lhis sIighl reduclion in dissipaled pover
cones al a Iarge price in cIock skev varialiIily.

VLS 348

5.3.2 SuppIy-voItage scaIing
In lhe previous seclion, ve olserved lhal lhe reduclion of lhe differenliaI sving has a slrong
inpacl on lhe pover consunplion. Ior differenliaI svings snaIIer lhan 25 of V

, lhe
pover consunplion lecones Iess lhan haIf of lhal olserved vhen lhe cIock nelvork
operales vilh lhe voIlage sving of 5O of V

. Second, as Iigure 15 suggesls, due lo lhe


fIuclualions of connon-node voIlage in lhe cIock nelvork, lhe voIlage sving of 25 of V


(45OnV) shovs lhe Ieasl skev vhen lhe cIock nelvork is suljecl lo suppIy varialions. We
aIso olserve lhal DCDN vilh conposile Ioad is nore rolusl. This is due lo lhe Iinear
characlerislic of conposile Ioad luffers, as vas seen in Seclion 3.2.

10 15 20 25 30 35 40 45 50
0
50
100
150
200
250
300
350
Differential Pair Swing Scaling (%Vdd)
M
a
x

S
k
e
w

V
a
r
i
a
t
i
o
n
s

(
%
)


DCDN Composite Load
Differential GND Load

Iig. 15. Ieak lo peak skev varialions in differenliaI voIlage scaIed DCDNs.

Taking lhese consideralions inlo accounl, ve consider a design poinl for vhich lhe
differenliaI sving is 25 of V

(45OnV) and ve reduce lhe suppIy voIlage lo lhe poinl


vhere ve reach lhe sane pover consunplion as lhal olserved for lhe singIe-node and
differenliaI cIock nelvorks. HSIICL sinuIalions denonslrale lhal for a suppIy voIlage of
1.4V and differenliaI sving 45OnV, ve ollain lhe sane pover consunplion for lhe
differenliaI cIock nelvork. Yel, as can le olserved fron Iigure 16, lhe varialion of cIock
skev is sliII Iess lhan lhal of lhe conparalIe singIe-node CDN. Anolher inleresling poinl
lhal vas olserved during suppIy voIlage scaIing in DCDN is a ncg|igi||c signa| |a|cncq
iffcrcncc. This can le juslified since as lhe laiI currenl is Iovered lo achieve Iover
differenliaI svings, lhe necessary differenliaI voIlage needed for lhe differenliaI luffer lo
svilch is aIso decreased. This enalIes lhe differenliaI luffer lo operale/svilch fasler lhan in
lhe case vhere grealer suppIy voIlage vilh grealer differenliaI sving is used. AIso as
olserved fron Iigure 16 and as discussed previousIy, lhe DCDN lased on onIy grounded-
gale Ioads is Iess resiIienl.

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 349

-10% -5% 0 5% 10%
0
5
10
Voltage Supply Variations (V)

















%

S
k
e
w

V
a
r
i
a
t
i
o
n
s
Skew Variations for Different Clocking Schemes @ 400 MHz Frequency


Single
Differential (1.8V) Composite
Differential (1.4V) Composite
Differential (1.4V) GND Load
Less Varations
Less Power
Consumption

Iig. 16. HSIICL sinuIalion shov Iess varialions in DCDN conpared lo singIe-node CDN
for equaI noninaI pover consunplion.

5.4 ParaIIeI zero skew routing
enchnarks fron (Tsay, 1991) vilh as nany as 31O1 cIock sinks and lolaI area as Iarge as 1.4
cn * 1.4 cn vere chosen. The reporled resuIls are lased on 18Onn CMOS lechnoIogy fiIe,
for vhich inlerconnecl paranelers are: C
g
= 8e-17(I/un), C
c
= 8e-17(I/un) and
R
inl
=O.O22(/un). The processing line, speed up and resuIling sinuIaled skevs vhen
synlhesizing zero skev DCDNs using 1 (sequenliaI), 2 and 4 processing nodes are reporled
in TalIe 4. These resuIls denonslrale a ncar|q-|incar spcc-up. Il is expecled lhal for very Iarge
lenchnarks, lhe speed-up grovs |incar|q vhen lhe nunler of sinks is sufficienlIy Iarge
conpared lo lhe nunler of processing nodes. Thus, lhe processing line overhead due lo
perforning lhal synlhesis in paraIIeI vas aInosl negIigilIe. Iigure 17 confirns lhese
olservalions for 2 and 4 processing nodes. As lhe nunler of cIock sink nodes increases, lhe
speed-up (dashed Iines) converges lo lhe naxinun expecled speed-up (rigid Iines).

enchnark
(# of sinks)
Skev (ps) Conpulalion Tine (s) Speed-Up
1 IN 2 IN 4 IN 1 IN 2 IN 4 IN 1 IN 2 IN 4 IN
r3
(862)
14 12 14 O.7O O.48 O.21 1.O 1.45 3.24
r4
(19O3)
36 35 36 1.6O O.9O O.46 1.O 1.76 3.46
r5
(31O1)
51 49 52 2.74 1.45 O.73 1.O 1.89 3.74
TalIe 4. Run-line and speed-up resuIls of lenchnarks

In generaI, lhe paraIIeI processing approach resuIls in a cIock-lree differenl fron lhe one
rouled in a singIe slep, due lo die area parlilioning, lhus, lhe characlerislics of lhe nev cIock
lree such as lolaI vire-Ienglh and skev nay le sIighlIy differenl. This proposed
nelhodoIogy is fIexilIe, as il aIIovs having a hylrid (differenliaI and/or singIe-ended)
dislrilulion of lhe cIock nelvork. The gIolaI CDN couId le differenliaI, vhiIe lhe IocaI
(Iover IeveIs) CDNs couId le singIe-ended lo aIIeviale rouling conpIexily. Il is possilIe lo
enhance lhe gIolaI/IocaI dislrilulion aIgorilhn vilh refined inlerconnecl nodeIs. This
nelhodoIogy is aIso appIicalIe lo aII synnelric/asynnelric cIock-lrees.
VLS 350

1000 1500 2000 2500 3000 3500
1
1.5
2
2.5
3
3.5
4
Number oI clock-sinks
S
p
e
e
d
-
u
p



Iig. 17. Speed-up approaches ils naxinun, as lhe size of cIock nelvork increases, for 2 and
4 processing node synlhesis cases.

6. ConcIusions

In lhis chapler, sone lechniques for efficienl design and synlhesis of on-chip DifferenliaI
CIock Dislrilulion Nelvorks (DCDNs) vere given.
IniliaIIy design lechniques vere proposed lhal inprove lhe perfornance of differenliaI
luffers vhich resuIl inlo lhe perfornance inprovenenl of DCDNs. This vas achieved ly
neans of inlroducing configuralions for differenliaI luffers lased on Dynanic ThreshoId
(DT) lransislors. Il vas shovn lhal for Iov suppIy-voIlages, lhey oulperforn lhe
convenlionaI luffers vilh 25 deIay reduclion. AIso, in order lo overcone lhe high pover
consunplion of DCDNs, a circuil configuralion vas proposed ly vhich il is possilIe lo
reduce lhe differenliaI voIlage svings (dovn lo 1O of V

) vhich reduces lhe pover


consunplion significanlIy (3O nore lhan singIe-node CDN). Iurlhernore, ly scaIing lhe
suppIy voIlage of lhe syslen fron 1.8V lo 1.4V, ve reach a design poinl vhere lhe DCDN
consunes lhe sane pover as ils singIe-node CDN counlerparl lul has Iess varialion (in
lerns of skev). This hovever cones al lhe expense of deIay and reduced voIlage sving.
Various synlhesis lechniques vere inlroduced lhal inprove lhe DCDNs rouling lo achieve
Iov (and possilIy zero) skev. Ior lhis, a Iine equivaIenl deIay nodeI vas suggesled ly
vhich il is possilIe lo roule DCDNs vilh Iov (zero) skev. On average, 97 skev reduclion
vas ollained uliIizing lhis nodeI conpared lo lhe cIassic LInore deIay nodeI. A
nelhodoIogy for paraIIeI dislrilulion (rouling) of zero skev DCDNs vas aIso proposed.
The nelhod is appIicalIe lo aII synnelric/asynnelric cIock nelvorks vilh aliIily for
hylrid inpIenenlalion (differenliaI and/or singIe-ended). The proposed nelhod aIIeviales
lhe prolIen of high conpulalionaI cosl of such CDNs in conpIex VLSI syslens. UliIizing
lhis nelhod, nearIy-Iinear speed-up is achieved for zero skev DCDNs.
In lhe hierarchy of CDNs in nodern high-perfornance conpIex syslens, DCDNs can le
effecliveIy fil in lhe gIolaI IeveI of CDNs, yel lhey can le used as lhe soIe soIulion lo lhe
cIock dislrilulion of lhe syslen, vhen noise is lhe nain design issue.

On the Efcient Design & Synthesis of Differential Clock Distribution Networks 351

7. References

Anderson, I. L., WeIIs, }. S. & erla, L. Z. (2OO2). The core cIock syslen on lhe nexl
generalion Ilaniun nicroprocessor, in |SSCC Digcs| cf Tccnnica| Papcrs, pp. 146-7.
Assaderaghi, I., Sinilsky, D., Iarke, S.A., okor, }., Ko, I.K. & Hu, Chenning. (1997).
Dynanic lhreshoId-voIlage MOSILT (DTMOS) for uIlra-Iov voIlage VLSI, ILLL
Transaclions on LIeclron Devices, VoIune 44, Issue 3, pp.414 - 422.
anerjee, Irilhviraj. & Xing, Zhaoyun. (1992). A paraIIeI aIgorilhn for zero skev cIock lree
rouling, InlernalionaI Synposiun on IhysicaI Design, pp. 118 - 123.
anerjee, Irilhviraj. (1994). IaraIIeI AIgorilhns for VLSI Conpuler-Aided Design, ITR
Irenlice HaII, LngIevood CIiffs, Nev }ersey O7632.
roderson, ol. (2OO5). AnaIog Inlegraled Circuils, onIine naleriaI, avaiIalIe:
hllp://lvrc.eecs.lerkeIey.edu/IeopIe/IacuIly/rl/.
ChappeII, .A., ChappeII, T.I., Schusler, S.L., SegnuIIer, H.M., AIIan, }.W., Iranch, R.L. &
ReslIe, I.}. (1988). Iasl CMOS LCL receivers vilh 1OO-nV vorsl-case sensilivily,
ILLL }SSC, VoIune 23, Issue 1, pp:59 - 67.
Cong, }., He, L., Koh, C. K. & Madden, I. (1996). Ierfornance Oplinizalion of VLSI
Inlerconnecl Layoul, |n|cgra|icn, |nc V|S| ]curna|, voI. 21, pp. 1-94.
DaIIy, WiIIian }. & IouIlon, }ohn. (1998). DigilaI Syslens Lngineering, Canlridge
Universily Iress.
HaII, S.H., HaII C.W. & McCaII, }.A. (2OOO). High-Speed DigilaI syslen Design, A Handlook
of Inlerconnecl lheory and Design Iraclices. }ohn WiIey & Sons INC.
HerzeI, I., Razavi, . (1999). A s|uq cf csci||a|cr ji||cr uc |c supp|q an su|s|ra|c ncisc, ILLL }.
Circuils and Syslens, VoIune 46, pp. 56 - 62.
Kahng, A.., Muddu, S., Sarlo, L. (2OOO). On svilch faclor lased anaIysis of coupIed RC
inlerconnecls, Design Aulonalion Conference, pp. 79 - 84.
MII, Message Iassing Inlerface, onIine: hllp://vvv.npi-forun.org/.
OMahony, Irank I. (2OO3). 1OCHz CIolaI CIock Dislrilulion Using CoupIed Slanding-
vave OsciIIalors, IhD Disserlalion, Slanford Universily.
Razavi, ehzad. (2OO1). Design of AnaIog CMOS Inlegraled Circuils. Mc Crav HiII.
ReslIe, I. }. & A. Deulsch (1998). Designing lhe lesl cIock dislrilulion nelvork, in
Sqmpcsium V|S| Circui|s Digcs| cf Tccnnica| Papcrs.
ReslIe, I.}., McNanara, T.C., Weller, D.A., Canporese, I.}., Lng, K.I., }enkins, K.A., AIIen,
D.H., Rohn, M.}., Quaranla, M.I., oerslIer, D.W., AIperl, C.}., Carler, C.A., aiIey,
R.N., Ielrovick, }.C., Krauler, .L. & McCredie, .D (2OO1). A cIock dislrilulion
nelvork for nicroprocessors, |||| ]. Sc|i-S|a|c Circui|s, voI. 36, no.5, pp. 792-799.
Sekar, D.C. (2OO5). CIock lrees: differenliaI or singIe ended` , InlernalionaI Synposiun on
QuaIily of LIeclronic Design, pp.548 - 553.
Tsay, R. S. (1991). Lxacl zero skev, in Iroc. ILLL Inl. Conf. Conpuler-Aided Design, pp.
336-339, Nov.
Vasseghi, N., Yeager, K., Sarlo, L. & Seddighnezhad, M. (1996). 2OO-MHz superscaIar RISC
nicroprocessor, |||| ]. Sc|i-S|a|c Circui|s, voI. 31, no. 11, pp. 1675-1685.
Wikipedia, onIine: hllp://en.vikipedia.org/viki/Ihase-Iocked_Ioop
Zarrali, Hounan. (2OO6). On lhe design and synlhesis of differenliaI cIock dislrilulion
nelvork, MASc Disserlalion, Concordia Universily.
VLS 352

Zarrali, Hounan, Saaied, Haydar, AI-KhaIiIi, A. }. & Savaria, Yvon. (2OO6). Zero Skev
DifferenliaI CIock Dislrilulion Nelvork, InlernalionaI Synposiun on Circuil And
Syslens (ISCAS), Creece, IsIand of Kos.
Zarrali, Hounan, ZiIic, ZeIjko, AI-KhaIiIi, A. }. & Savaria, Yvon. (2OO7). A nelhodoIogy for
paraIIeI synlhesis of zero skev differenliaI cIock dislrilulion nelvorks, }oinl
Conference of MWSCAS/NLWCAS, MonlreaI, Canada.
Zhang, H., Varghese Ceorge & Ralaey, }. M. (2OOO). Lov-sving on-chip signaIing
lechniques: effecliveness and roluslness, ILLL Trans. on VLSI Sysl., VoIune 8,
Issue 3, pp. 264 - 272.

Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 353
X

Robust Design and Test
of AnaIog/Mixed-SignaI Circuits in
DeepIy ScaIed CMOS TechnoIogies

Cuo Yu and Ieng Li
Tcxas AcM Unitcrsi|q
Cc||cgc S|a|icn, TX, USA

1. Introduction

The proIiferalion of connunicalion and consuner eIeclronic syslens Ieads lo lhe Iarge
denands of high-perfornance & rolusl anaIog/nixed-signaI circuils. On lhe olher hand,
aIlhough deepIy scaIed CMOS lechnoIogies enalIe grealer degrees of seniconduclor
inlegralion and Iover nanufacluring cosl, lhe advancenenls of lechnoIogies aIso inlroduce
severaI nev chaIIenges for VLSI circuil design. The increasing paranelric varialions and
lheir inpacls on circuil perfornances are leconing key issues vhich nake aIready conpIex
circuils even nore sophislicaled in design praclice (Nassif, 2OO1). Cive lhese larriers,
designing rolusl nixed-signaI circuils in deepIy scaIed CMOS lechnoIogies lecone a reaI
chaIIenge for circuil designers.

In lhis chapler, ve propose lo soIve lhe prolIens ly firsl inlroducing efficienl and accurale
nodeIing lechniques for Iarge anaIog/nixed-signaI circuil designs vilh consideralion of
process varialions. IoverfuI slalislicaI dinension reduclion lechniques are uliIized lo nake
perfornance nodeIing of Iarge circuils possilIe. Then lhese noveI circuil nodeIs are used lo
achieve efficienl paranelric syslen perfornance anaIysis. In such vays lhe designers can
have lhe fuII knovIedge of reaI circuil perfornances under process varialions, vhich nakes
il feasilIe lo do rolusl syslen lopoIogy design and circuil oplinizalion for Iarge nixed-
signaI syslens. The efficienl nodeIing franevork is aIso exlended for circuil lesl purposes,
vhich Ieads lo rolusl uiIl-in SeIf-Tesl (IST) circuil design and oplinizalion.

We denonslrale lhe effecliveness of lhe proposed ideas vilh lvo popuIar lypes of nixed-
signaI circuil exanpIes, Signa-DeIla A/D converlers and Ihase Looked Loops (ILLs).

Ior Signa-DeIla ADCs, ve presenl a noveI paranelerized Iookup lalIe (LUT) lechnique for
capluring perfornances of luiIding lIocks in lhe syslens, and use lhese LUTs lo perforn
lopoIogy lrade-off anaIysis and syslen oplinizalion. ModeIing of circuil IeveI
nonIinearilies, adaplive LUT generalion and rolusl syslen design are expIained in delaiI
vilh conprehensive experinenlaI resuIls (Yu & Li, 2OO6 & 2OO7a).
18
VLS 354

Ior ILLs, ve discuss luiIding paranelric VeriIog-A nodeIs for charge-punp ILLs and use
lhese nodeIs for high-IeveI perfornance anaIysis. In order lo handIe Iarge nunler of
paranelric varialIes, dinension reduclion lechnique is appIied lo reduce sinuIalion
conpIexilies. We appIy lhe ollained syslen sinuIalion franevork lo evaIuale lhe
efficiencies of paranelric faiIure delecling of differenl IST circuils and perforn
oplinizalion lased on lhe experinenlaI resuIls (Yu & Li, 2OO7l).

2. Robust Sigma-DeIta ADCs Design

Signa-DeIla ADCs have leen videIy used in dala conversion appIicalions due lo lhe good
resoIulion. Hovever, oversanpIing and conpIex circuil lehaviors render lhe lransislor-
IeveI anaIysis of lhese designs prohililiveIy line consuning (Norsvorlhy & el aI., 1997).
The inefficiency of lhe slandard sinuIalion approach aIso ruIes oul lhe possiliIily of
anaIyzing lhe inpacls of a nuIlilude of environnenlaI and process varialions crilicaI in
nodern VLSI lechnoIogies. We propose a Iookup lalIe (LUT) lased nodeIIing lechnique lo
faciIilale nuch nore efficienl perfornance anaIysis of Signa-DeIla ADCs. Various
lransislor-IeveI circuil nonideaIilies are syslenalicaIIy characlerized al lhe luiIding lIock
IeveI and lhe vhoIe syslen is sinuIaled nuch nore efficienlIy using lhese luiIding lIock
nodeIs. Our approach can provide up lo four orders of nagnilude runline speedup over
SIICL-Iike sinuIalors, hence significanlIy shorlening lhe CIU line required for evaIualing
syslen perfornances such as SNDR (signaI-lo-noise-and-dislorlion-ralio). The proposed
nodeIing lechnique is furlher exlended lo enalIe scaIalIe perfornance varialion anaIysis of
conpIex Signa-DeIla ADC designs. Such approach aIIovs us lo perforn lrade-off anaIysis
of various lopoIogies considering nol onIy noninaI perfornances lul aIso lheir varialiIilies.

2.1 Background of Sigma-DeIta ADC design
We lriefIy discuss lhe lackground and difficuIlies for Signa-DeIla ADC design in lhis
seclion. Various inporlanl circuil nonideaIilies, vhich are difficuIl lo nodeI accuraleIy
using anaIylicaI equalions, are aIso discussed. The lvo lasic conponenls of Signa-DeIla
ADCs are noduIalors and digilaI fiIlers as shovn in Iig. 1. The anaIog inpul of lhe ADC is
sanpIed ly a very high frequency cIock in lhe noduIalor, lhen lhe sanpIed signaI is passed
lhrough lhe Ioop fiIler lo perforn noise-shaping. The oulpul of lhe Ioop-fiIler is quanlized
ly an inlernaI A/D converler, producing a lil-slrean al lhe sane speed as lhe sanpIing
cIock. The Iov-pass digilaI fiIler in lhe decinalor lhen renoves lhe oul-of-land noise, and
lhe dovn-sanpIer converls lhe high speed lil-slrean lo high resoIulion digilaI codes.

( ) H :

Iig. 1. Iock diagran of a Signa-DeIla ADC
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 355

The difference lelveen lhe ideaI digilaI oulpul of lhe quanlizer and lhe acluaI anaIog signaI
is caIIed quanlizalion noise. The goaI of Signa-DeIla lechnique is lo eIininale lhis unvanled
quanlizalion noise as nuch as possilIe. y oversanpIing lhe inpul signaI, lhe noduIalor
noves lhe najorily of lhe quanlizalion noise oul of lhe signaI landvidlh. The principIe of
noise-shaping can le anaIyzed using lransfer funclions, vhich can le ollained using a
Iinear nodeI in lhe frequency donain. The quanlizalion noise L(z) is nodeIIed as addilive
noise lo lhe quanlizer, lhe oulpul of lhe quanlizer Y(z) can le vrillen as

( ) 1
( ) ( ) ( )
1 ( ) 1 ( )
H :
Y : X : E :
a H : a H :
= +
+ +
(1)
vhere d is lhe feedlack gain of lhe DAC, X(z) is lhe inpul signaI, H(z) is lhe lransfer
funclion of lhe Ioop fiIler. y configuring H(z) and d, ve can have differenl noise shaping
funclions so lhal lhe signaI-lo-noise ralio in lhe oulpul can le oplinized.

Signa-DeIla ADC perfornances are grealIy inpacled ly various circuil-IeveI nonideaIilies,
such as lhe finile DC gain, sIev of lhe operalionaI anpIifies, charge injeclions of lhe
svilches, nisnalch of lhe inlernaI quanlizers and D/A converlers. The effecls are difficuIl
lo anaIyze accuraleIy ly hand caIcuIalion or sinpIe anaIylicaI nodeIs, neilher are lheir
inpacls on syslen perfornances. There exisl severaI high-IeveI sinuIalors for Signa-DeIla
ADC design, Iike MIDAS (alii el aI., 1997) and SWITCAI (Iang & Suyana, 1992). These
lechniques are onIy suilalIe for archileclure-IeveI expIoralion and delerninalion of luiIding
lIock specificalions in lhe earIy design phase, lul no suilalIe for lhe consideralion of circuil-
IeveI non-ideaIilies. Oflen lhe line, lransislor-IeveI sinuIalion is lhe onIy choice for currenl
perfornance evaIualion, vhich nay lake a fev veeks for a singIe lransienl sinuIalion.
AnaIyzing lhe inpacl of process varialions is an even grealer chaIIenge lecause a Iarge
nunler of Iong lransienl sinuIalions are needed lo derive lhe perfornance slalislics under
process varialions. In lhe foIIoving seclions, ve address lhese issues ly adopling lhe Iookup
lalIe lased nodeIs, vhich can handIe lhe circuil-IeveI delaiIs and process varialion induced
perfornance slalislics accuraleIy and efficienlIy.

2.2 Performance modeIing with Lookup tabIe (LUT) technique
The need of efficienl sinuIalion lechniques capalIe of incIuding lransislor-IeveI delaiIs is
parlicuIarIy pressing for assessing lhe inpacl of process varialions, vhere lhe anaIysis
conpIexily lends lo expIode in a Iarge paraneler space. We slarl vilh nodeIIing of
svilched-capacilor lype Signa-DeIla ADCs, vhich are cIocked ly sone gIolaI sanpIing
signaIs. The najor conponenls in lhe converlers are inlegralors, quanlizers and feedlack
DACs.

Since lhe svilched-capacilor inlegralors in lhe discrele-line Signa-DeIla noduIalors are
cIocked ly lhe sanpIing cIock as shovn in Iig. 2, il is possilIe lo use Iookup lalIes lo
represenl lhe nonIinear slale lransfer funclion of lhe syslen (ishop el aI., 199O, rauns el
aI., 199O). The oulpul of inlegralor can le expressed as a funclion of inpul signaIs and lheir
previous slales as in (2)
| 1| ( | |, | 1|, | 1|) y k F y k x k a k + = + + (2)
vhere y|k+1j is lhe currenl oulpul of lhe inlegralor, y|kj is lhe inlegralor oulpul in lhe
previous cIock cycIe, x|k+1j and d|k+1j are lhe currenl inpul signaI and lhe digilaI feedlack
VLS 356

oulpul of lhe DAC, respecliveIy. This properly of Signa-DeIla ADCs nakes il possilIe lo
predicl lhe nev inlegralor oulpul using lhe previous slale and lhe nev inpul. The previous
slale of lhe inlegralor, lhe digilaI feedlack and lhe nev anaIog inpul are discrelized al a sel
of discrele voIlage IeveIs lhal are used as lhe indices lo lhe Iookup lalIe nodeIs.

y|k1|F(y|k|,x|k1|,d|k1|)

Iig. 2. Inlegralor lehaviours under cIocking

As iIIuslraled in equalion (2) and Iig. 2, lhe oulpul of an inlegralor is a funclion of lhe inpul
signaIs and lhe iniliaI slale of lhe inlegralor, vhich are discrelized lo generale lhe Iookup
lalIe enlries. The nunler of discrelizalion IeveIs depends on lhe accuracy requirenenl of
lhe sinuIalion. The inlernaI circuil node voIlage svings can le eslinaled ly lhe syslen
archileclure. Ior Iov-voIlage Signa-DeIla ADC designs, lhe inlernaI voIlages can change
fron O lo suppIy voIlage Vdd. To cover lhe vhoIe range of voIlage sving, ve discrelize lhe
inpuls and oulpuls of lhe inlegralors unifornIy al N IeveIs fron O lo Vdd, vhere N is in lhe
range of 1O. The exlraclion selup for an inlegralor vilh a nuIli-lil DAC inpIenenled in
lhernoneler code is shovn in Iig.3. A Iarge induclor L logelher vilh a voIlage source Vs is
used lo sel lhe iniliaI vaIue of lhe inlegralor oulpul. The inpul of lhe inlegralor is aIso sel ly
a voIlage source Vi. The digilaI oulpul of lhe quanlizer conlroIs lhe anounl of charge lo le
fed lack. An n-lil DAC inpIenenled in lhernoneler code has 2
n
- 1 lhreshoId voIlages.
The digilaI codes fron O lo 2
n
- 1 can le represenled ly counling lhe nunler of voIlage
sources lhal are connecled lo lhe inlegralor inpuls fron a sel of voIlages sources Vd1, Vd2, .,
Vd(2
n
-1),lhe voIlages of vhich are sel lo le eilher digilaI 1 or digilaI O.


Iig. 3. LUT generalion selup for inlegralors

Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 357

The nonIinearilies of quanlizers can le caplured using Iookup lalIes as veII. The quanlizer
acls as a conparalor, lhe inpul lhreshoId voIlage varies depending on lhe direclion in vhich
lhe inpul voIlage changes. To caplure lhe hysleresis effecl accuraleIy, ve use lhe lransislor-
IeveI sinuIalion lo find lhe inpul lhreshoId voIlages al vhich lhe digilaI oulpul svilches
fron O lo 1 (
off
J
+
) and fron 1 lo O (
off
J

), respecliveIy. The quanlizer is lhen nodeIed as


1 ( | 1| )
| 1| | | ( | 1| )
0 ( | 1| )
in off
off in off
in off
J k J
a k a k J J k J
J k J
+
+

+ >

+ = < + <

+ <

(3)
vhere d|k+1j is lhe nev oulpul of lhe quanlizer, d|kj is lhe oulpul in lhe previous cIock
cycIe. MuIli-lil quanlizers can le nodeIed in a siniIar vay since lhey are luiIl fron severaI
1-lil quanlizers.

Signa-DeIla ADCs vilh conlinous-line noduIalors can aIso le nodeIed using lhe
proposed lechnique vilh ninor nodificialion. Conlinuous-line Signa-DeIla ADCs are
differenl fron discrele-line counlerparls since lhe inlegralors are nol cIocked ly lhe
sanpIing cIock, and lhe inpul and lhe oulpul of inlegralor changes lhroughoul a cIock
period. In order lo nake lhe Iookup-lased nodeIing possilIe, ve discrelize each cIock cycIe
inlo M line inlervaIs vilh a slep size dT=T/M. If dT is snaII enough, lhen in each snaII
line period lhe lehaviours of conlinous-line noduIalors can le approxinaled using lhe
presenled lechnique, delaiIed inpIenenlalion for conlinous-line Signa-DeIla ADCs can le
found in (Yu & Li, 2OO7a).

2.3 Parametric LUT based modeIing
Irocess varialions in lhe falricalion slage can cause significanl perfornance shifl for anaIog
and nixed-signaI circuils. So handIing of process varialions in lhe earIy design slage is
crilicaI for rolusl anaIog/nixed-signaI design. In order caplure lhe infIuence of process
varialions, ve exlracl paranelerized LUT nodeIs and perforn fasl slalislicaI sinuIalion lo
evaIuale lhe perfornance dislrilulions of conpIex Signa-DeIla ADCs. Using lhis efficienl
nodeIing lechnique, ve have lhe aliIily lo find lhe nosl suilalIe syslen lopoIogies and
design paranelers for ADC designs.

Since lhe nunler of process varialIes is Iarge, il is nol possilIe lo exhausl aII lhe possilIe
perfornances under process varialions. Here ve use paranelerized LUT-lased nodeIs lo
caplure lhe inpacls of circuil paranelric varialions lhal incIude lolh environnenlaI and
process varialions. In lhis case, a nonIinear regression nodeI (nacronodeI) is exlracled for
each lalIe enlry. In generaI, a nacronodeI correIaling lhe inpul varialIes and lheir
responses can le slaled as foIIovs:
given n sels olserved responses |y
1
,y
2
,.,y
n
and n sels of n inpul varialIes |x1,x2,.,xnj, ve
can delernine a funclion lo reIale inpul x and response y as (Lov & Direclor, 1989)
VLS 358

1 11 12 1
2 21 22 2
1 2
( , , , )
( , , , )
( , , , )
m
m
n n n nm
y h x x x
y h x x x
y h x x x
=
=
=

(4)
vhere
y
i
i-lh response
h funclion reIaling y and x
x
i
i-lh sel of process varialIes
n nunler of process varialIes
n nunler of experinenlaI runs

The lask of conslrucling each nacronodeI is achieved ly appIying lhe response surface
nodeIing (RSM) lechnique vhere enpiricaI poIynoniaI regression nodeIs reIaling lhe
inpuls and lheir oulpuls are exlracled ly perforning nonIinear Ieasl square filling over a
chosen sel of inpul and oulpul dala (ox el aI., 2OO5). To syslenalicaIIy conlroI lhe nodeI
accuracy and cosl, design of experinenl (DOL) lechnique is appIied lo properIy choose a
snaIIesl sel of dala poinls vhiIe salisfying a given nodeIing reguIalion. Ior our circuil
nodeIing lask, lhe inpul paranelers are lhe paranelric circuil varialions and lhe oulpul is
an enlry in lhe Iookup lalIes. Then, a nonIinear funclion such as a quadralic funclion
reIaling each enlry in lhe lalIes vilh lhe circuil paranelric varialions can le delernined via
regression

` ` ` `
0
1 1 1
m m m
i i if i f
i i f
y x x x | | |
= = =
= + +
_ __
(5)
vhere
x
i
i-lh process varialIe sel
Y approxinaled response
|
eslinaled nodeI filling coefficienl
n nunler of process varialIes

Lqualion (5) can le revrillen in a nore conpacl nalrix forn as
X Y | = (6)

The filling coefficienl veclor
|
can le caIcuIaled using lhe Ieasl-square filling of
experinenlaI dala as

1
( )
T T
X X X Y |

= (7)

A najor prolIen in soIving equalions (5 - 7) is lhe nunler of experinenlaI dala, so a
second-order cenlraI conposile pIan consisling of a cule design sul-pIan and a slar design
sul-pIan is enpIoyed (ox & el aI., 2OO5). The cule design pIan is a lvo-IeveI fraclionaI
facloriaI pIan lhal can le used lo eslinale firsl-order effecls (e.g., x
i
) and inleraclion effecls
(e.g., x
i
_x
j
), lul il is nol possilIe lo eslinale pure quadralic lerns (e.g., x
i
2
). The slar design
pIan is used as a suppIenenlary lraining sel lo provide pure quadralic lerns in equalion (5).
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 359

In our inpIenenlalion, lhe cule design pIan is seIecled in order lo eslinale aII lhe firsl-
order and cross-faclor second-order coefficienl of lhe inpul varialIes.

The ranges of aII paranelric varialions are usuaIIy ollained fron lhe process
characlerizalion. This infornalion is used lo selup lhe nodeI exlraclion procedure. In lhe
cule design pIan, each faclor lakes on lvo vaIues -1 and +1 lo represenl lhe nininun and
lhe naxinun vaIues of lhe paranelric varialion. Lach faclor in lhe slar pIan lakes on lhree
IeveIs -a, O, a, vhere O represenls lhe noninaI condilion and lhe IeveI range |a| < 1. As
iIIuslraled in Iig. 4, for each poinl (i,j) in lhe Iookup lalIe, n sinuIalions runs are conducled
using fraclionaI facloriaI pIan lo provide lhe required dala lo generale lhe regression nodeI
in equalion (5). As Iong as lhe Iookup lalIes for specified process varialion dislrilulions are
generaled, ve can perforn fasl syslen-IeveI sinuIalion lo evaIuale lhe perfornances under
process varialions, and in lurn lo achieve oplinizalion as lo le discussed in lhe foIIoving
seclion.

` ` ` `
0
1 1 1
m m m
i i if i f
i i f
y x x x | | |
= = =
= + +
_ __

Iig. 4. Response surface nodeIIing of paranelerized LUTs

2.4 System optimization using parametric LUTs
The appIicalion of lhe proposed nodeIing lechniques are denonslraled vilh lhree discrele-
line Signa-DeIla ADC designs vilh differenl lopoIogies incIuding 2nd-order vilh 1-lil
quanlizer (SDM 1), 2nd-order vilh 2-lil quanlizer (SDM 2) and 3rd-order vilh 1-lil
quanlizer (SDM 3). AII lhese ADCs are inpIenenled in 13Onn CMOS lechnoIogy vilh a
singIe 1.5V pover suppIy. The sanpIing cIock and oversanpIing ralio are chosen lo le
1MHz and 128, respecliveIy.

Using our paranelerized LUT-lased infraslruclure, ve are alIe lo nol onIy predicl lhe
noninaI case design perfornances lul aIso lheir sensilivilies lo paranelric varialions.
Hence, our lechnique provides an efficienl vay for slalislicaI circuil sinuIalion as veII as
perfornance-roluslness lrade-off anaIysis. Ior slalislicaI anaIysis, a ResoIulion V 2
(8-2)

fraclionaI facloriaI design pIan lhal incIudes 64 runs for lhe cule design pIan and 17 runs for
lhe slar design pIan is enpIoyed ly SDM 1. Ior SDM 2 and SDM 3, a ResoIulion VI 2
(6-1)

VLS 360

fraclionaI facloriaI design pIan vilh 45 runs is enpIoyed, resuIling in 32 runs for lhe cule
design pIan and 13 runs for lhe slar design pIan.

In TalIe 1, lhe proposed LUT-lased sinuIalor is conpared vilh lhe lransislor-IeveI sinuIalor
(Speclre) in lerns of nodeI exlraclion line, sinuIalion line, and predicled noninaI SNDR and
THD vaIues. Once lhe LUT nodeIs are exlracled, lhe LUT-lased sinuIalor can le efficienlIy
enpIoyed lo perforn slalislicaI perfornance anaIysis, vhich is infeasilIe for lhe lransislor-
IeveI sinuIalor. Ior lhe 2nd-order Signa-DeIla ADC vilh 1-lil quanlizer, il onIy lakes 2O
ninules lo conducl 1,OOO LUT-lased lransienl sinuIalions each incIuding 64k cIock cycIes. Ior
lhe sane anaIysis, lransislor-IeveI sinuIalion vilh convenlionaI sinuIalors is expecled lo lake
4,5OO hours lo conpIele. In lerns of accuracy, lhe SNDRs and THDs predicled ly Speclre and
lhe LUT sinuIalor are aIso presenled in TalIe 1. The error of SNDR of our LUT-lased
sinuIalor is vilhin 1d, vhich denonslrales lhe accuracy of lhe proposed lechnique.

LUT-bascd sImu!atInn 5pcctrc sImu!atInn
DcsIgn Sin.
gen.
Iar. gen. Runline SNDR THD Runline SNDR THD
SDM 1 7 nin 9.5 hr 2 s 73.8d -63.1d 4.5 hr 74.1d -62.6d
SDM 2 2O nin 15 hr 4 s 9O.2d -77.4d 9.5 hr 9O.Od -76.8d
SDM 3 8 nin 11 hr 2 s 83.3d -67.5d 5.5 hr 83.5d -66.9d
TalIe 1. Runline and accuracy of lhe proposed LUT sinuIalion ( |2OO7j ILLL, fron Yu &
Li, 2OO7a)

Wilh lhe poverfuI LUT-lased sinuIalor, ve can perforn syslen evaIualion very efficienlIy so
lhe oplinizalion of syslen lopoIogies and delaiIed design lecone possilIe. Iirsl ve use lhe
oplinizalion of 2nd-order Signa-DeIla ADC vilh nuIli-lil quanlizer as an exanpIe ly
invesligaling lhe inpacls of DAC capacilance nisnalch. The capacilor nisnalch IeveI
decreases as lhe capacilance increases, so il is of inleresl lo invesligale lhe lrade-offs of syslen
noise perfornances and area (IeIgron & el aI., 1989). SlalislicaI sinuIalions are perforned lo
anaIyze lhe infIuence of lhe nisnalch of lhe lvo inlernaI DACs ly sveeping lhe vaIues of lhe
lhree charging capacilors in each DAC. The varialion of capacilances is nodeIed using a
Caussian dislrilulion vilh 3 1 o = . The dislrilulions of SNDR due lo lhe capacilance
nisnalch in lhe lvo DACs are shovn in Iig. 5, respecliveIy.
55 60 65 70 75 80 85 90
0
5
10
15
20
25
30
35
40
SNDR(dB)
N
u
m
b
e
r
SNDR without
mismatching

85 86 87 88 89 90
0
5
10
15
20
25
30
35
40
SNDR(dB)
N
u
m
b
e
r
SNDR without
mismatching

(a) DAC connecled lo firsl-slage (l) DAC connecled lo second-slage
Iig. 5. SDNR dislrilulions for DACs connecled lo differenl slages in SDM 2 ( |2OO7j ILLL,
fron Yu & Li, 2OO7a)
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 361

We can see fron lhe lvo figures lhal lhe nisnalch of lhe DAC connecled lo lhe firsl slage
inlegralor (Iefl figure) has nuch nore infIuence lo lhe syslen perfornance lhan lhal of lhe
olher DAC (righl figure). This can le expIained ly lhe facl lhal lhe firsl slage DAC is
connecled direclIy lo lhe syslen inpul, so lhe feedlack error lecause of lhe DAC nisnalch
viII le nagnified ly lhe second slage inlegralor. The resuIl of lhis anaIysis indicales lhal
nore allenlion shouId le paid lo lhe firsl slage DAC in lhe design process.

Anolher oplinizalion exanpIe is lo evaIuale lhe charging capacilor nisnalches in SDM 3.
The nisnalch of each capacilor is nodeIIed using Caussian dislrilulion. We evaIuale lhe
syslen perfornance dislrilulions vilh varialion of capacilances sel lo 1 o = and 5 o = ,
as iIIuslraled in Iig. 6.

82.9 83 83.1 83.2 83.3 83.4 83.5 83.6 83.7 83.8
0
5
10
15
20
25
30
35
40
45
50
SNDR(dB)
N
u
m
b
e
r
SNDR in
nominal case

80 80.5 81 81.5 82 82.5 83 83.5 84 84.5
0
5
10
15
20
25
30
35
40
45
50
SNDR(dB)
N
u
m
b
e
r
SNDR in
nominal case

(a) 1 o = (l) 5 o =
Iig. 6. SNDR dislrilulions for nisnalches of charging and sanpIing capacilors in SDM 3 (
|2OO7j ILLL, fron Yu & Li, 2OO7a)

We can olserve fron Iig. 6 lhal perfornance dislrilulion devialions increase fron 1d lo
4.5d for capacilor varialion 5 o = , and lhe inpacl of lhe nisnalch of charging and
sanpIing capacilors is nol as crilicaI as lhal of lhe nuIli-lil DAC even vilh 5 o = . Il is
aIso possilIe lo perforn nore conpIele syslen anaIysis and oplinizalion using lhe
proposed paranelric LUT-lased sinuIalion nelhod depending on lhe largel of lhe design
(Yu & Li, 2OO7a).

3. Robust PLL Design

As an essenliaI luiIding lIock, ILLs are videIy used in loday's connunicalion and digilaI
syslens for purposes such as frequency synlhesis, Iov-jiller cIock generalion, dala recovery
and so on. AIlhough lhe inpul and oulpul signaIs of ILLs are in lhe digilaI donain, nosl
ILLs inpIenenlalions consisl of lolh digilaI and anaIog conponenls, vhich nake lhen
prone lo process varialion infIuences. In lhis seclion ve propose an efficienl paraneler-
reduclion nodeIIing lechnique lo caplure process varialions and furlher achieve Iov-cosl
syslen perfornance evaIualion using hierarchicaI syslen sinuIalion. The proposed nelhod
nol onIy can le used for rolusl ILL design under process varialion, lul aIso paves lhe road
for effeclive luiIl-in seIf-lesl circuil design as lo le discussed in lhe nexl seclion.

VLS 362

3.1 Background of PLL design
As iIIuslraled in Iig. 7, a lypicaI charge-punp ILL syslen consisls of a frequency deleclor, a
charge punp, a Ioop fiIler, a voIlage-conlroIIed osciIIalor (VCO) and a frequency divider.
The frequency of lhe oulpul cIock signaI Ioul is N lines of lhal of reference cIock signaI
Iref, vhere N can le an inleger nunler or fraclionaI nunler. The ILL design oplions
incIude VCO lopoIogies and conponenl sizes, fiIler characlerizalions, charge currenl in lhe
charge punp and so on. The nelrics of ILL syslens usuaIIy incIude acquisilion/Iock-in
line, oulpul jiller, syslen pover, lolaI area, elc. The goaI of ILL design and oplinizalion is
lo find lhe lesl overaII syslen perfornances ly searching in lhe design varialIe spaces.


Iig. 7. Iock diagran of charge-punp ILL

Due lo lhe nixed-signaI nalure, lhe design and oplinizalion of ILL syslen is quile conpIex
and coslIy. Ior exanpIe, a Iong lransienl sinuIalion (in lhe order of hours or days) is needed
lo ollain lhe Iock-in line lehavior of ILL, vhich is one of lhe nosl inporlanl perfornance
nelrics for a ILL. So lhe lrule-force oplinizalion ly searching in lhe design space vilh
lransislor-IeveI sinuIalion is infeasilIe for ILL syslens.

The difficuIlies of syslen perfornance anaIysis can le addressed ly adopling a lollon-up
nodeIIing and sinuIalion slralegy. The perfornances of anaIog luiIding lIocks can le
evaIualed and oplinized vilhoul loo nuch cosl. When lhe lehaviours of anaIog luiIding
lIocks are exlracled, lhese luiIding lIocks can le napped lo VeriIog-A nodeIs for fasl
syslen IeveI evaIualion (Zou & el aI., 2OO6). y using lhis approach ve can avoid lhe
scaIaliIily issue associaled vilh line consuning lransislor-IeveI sinuIalions.

When process varialions are considered, lhe silualion lecones nore sophislicaled. The
Iarge nunler of process varialIes and lhe correIalions lelveen differenl luiIding lIocks
inlroduce nore uncerlainlies for ILL perfornance under process varialions. In order lo
uliIize lhe hierarchicaI sinuIalion nelhod vhiIe laking inlo consideralion of slalislicaI
perfornance dislrilulions, ve propose an efficienl nacronodeIing nelhod lo handIe lhis
difficuIly. The key aspecl of our nacronodeIing lechniques is lhe exlraclion of
paranelerized lehavioraI nodeIs lhal can lrulhfuIIy nap lhe device-IeveI varialiIilies lo
varialiIilies al lhe syslen IeveI, so lhal lhe infIuence of falricalion slage varialions can le
propagaled lo lhe ILL syslen perfornances.

Iaranelerizalion can le done for each luiIding lIock nodeI as foIIovs. Iirsl, nuIlipIe
lehaviouraI nodeI exlraclions are conducled al nuIlipIe paraneler corners, possilIy
foIIoving a parlicuIar design-of-experinenls (DOL) pIan (ox & el aI., 2OO5). Then, a
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 363

paranelerized lehavioraI nodeI is conslrucled ly perforning nonIinear regression over lhe
nodeIs exlracled al differenl corners. This delaiIed paranelric nodeIing slep is
advanlageous since il syslenalicaIIy naps lhe device-IeveI paranelric varialions lo each of
lhe lehavioraI nodeIs. Hovever, difficuIlies arise vhen lhe nunler of paranelric
varialions is Iarge, vhich Ieads lo a prohililiveIy high paranelric nodeI exlraclion cosl. We
address lhis chaIIenge ly appIying design-specihc paraneler dinension reduclion
lechniques as descriled in lhe foIIoving seclion.

3.2 HierarchicaI modeIing for PLLs
In lhis seclion ve firsl descrile lhe noninaI lehavioraI nodeI exlraclion for each ILL
luiIding lIock, lhen ve discuss hov a paranelerized nodeI can le conslrucled in lhe nexl
seclion.

The voIlage conlroIIed osciIIalor is lhe core conponenl of a ILL. The lvo nainslrean lypes
of VCOs are LC-lank osciIIalors and ring osciIIalors. In a lypicaI VCO nodeI, lhe dynanic
(response lo inpul change) and slalic (V-Ireq reIalion) characlerislics of lhe voIlage lo
frequency lransfer are nodeIed separaleIy firsl and lhen conlined lo forn lhe conpIele
nodeI. The slalic VCO characlerislic can le vrillen as Ioul=f(Vcon), vhere Ioul is lhe
oulpul signaI frequency, Vcon is lhe deIayed conlroI voIlage, and f(.) is a nonIinear
napping reIaling lhe voIlage vilh lhe frequency. To generale lhe anaIylicaI nodeI, lhe
napping funclion f(.) can le furlher represenled ly an n-lh order poIynoniaI funclion.
' ' 2 '
0 1 2
n
out con con n con
F a aJ a J a J = + + + + (8)
vhere aO, a1, ., an are coefficienls of lhe poIynoniaI. To generale lhe alove poIynoniaI,
nuIlipIe VCO sleady-slale sinuIalions are conducled al differenl conlroI voIlage IeveIs and
a nonIinear regression is perforned using lhe coIIecled sinuIalion dala.

Suppose lhe conlroI voIlage is Vcon, lhe dynanic lehavior of lhe VCO is nodeIed ly
adding a deIay eIenenl lhal produces a deIayed version of lhe conlroI voIlage (Vcon). The
deIay eIenenl can le expressed using a Iinear lransfer funclion H(s) (e.g. a second-order RC
nelvork consisling of lvo R's and lvo C's). H(s) can le delernined via lransislor-IeveI
sinuIalion as foIIovs: a slep conlroI voIlage is appIied lo lhe VCO and lhe line il lakes for
lhe VCO lo reach lhe sleady-slale oulpul frequency, or lhe slep-inpul deIay of lhe VCO, is
recorded. H(s) is lhen synlhesized lhal gives lhe sane slep-inpul deIay. The dynanic effecl
is usuaIIy nolalIe in LC VCOs due lo lhe high-Q LC lank vhiIe in ring osciIIalors lhis effecl
nay le negIecled.

The charge punp is nainIy luiIl vilh svilching currenl sources. As iIIuslraled in Iig. 8, lhe
conlroI signaIs of lhe lvo svilches M1 and M2 cone fron lhe oulpuls of lhe phase and
frequency deleclor. The currenls lhrough M1 and M2 can le lurned on-and-off lo provide
desired charge-up or charge-dovn currenls. The exisling charge punp nacronodeIs are
very sinpIislic. UsuaIIy, lolh lhe charge-up and charge-dovn currenls are nodeIed as
conslanl vaIues. A conslanl nisnalch lelveen lhe lvo currenls nay aIso le considered
(Zou & el aI., 2OO6). Hovever, lhis sinpIe approach is nol sufficienl lo nodeI lhe lehavior of
charge punp accuraleIy. In reaI inpIenenlalion, lhe currenl sources are inpIenenled using
lransislors so lhal lhe acluaI oulpul currenls viII vary according lo lhe voIlages across lhese
VLS 364

MOSILTs. Therefore, lhe dependency of charge-up and charge-dovn currenls on Vcon
nusl le considered.

Iig. 8. ModeIing of charge punp ( |2OO7j ILLL, fron Yu & Li, 2OO7l)

In our charge punp nodeI, for each oulpul currenl, lhe currenl vs. Vcon characlerislics is
divided inlo lvo regions. When lhe oulpul voIlage Vcon is cIose lo lhe suppIy voIlage, lhen
svilch M1 viII le liased in lriode region. The charge-up currenl Iup in lhe lriode region can
le vrillen as

2
|( ) 0.5 |
up p ox gs thp as as
as aa on con
W
I C J J J J
L
J J J J
=
=
(9)
vhere Vdd is lhe suppIy voIlage, Von is lhe on-voIlage across lhe svilch, Vgs is lhe gale-
source voIlage,
p

is lhe noliIily, Cox is lhe oxide capacilance, W is vidlh and L is lhe


Ienglh of M
1
. We can see fron Lqualion (9) lhal lhe charge-up currenl is dependenl on lhe
oulpul voIlage Vcon. We use a poIynoniaI lo expIicilIy nodeI such voIlage dependency

2 3
0 1 2 3 up con con con
I b bJ b J b J = + + + (1O)
vhere li are lhe poIynoniaI coefficienls. SiniIarIy, lhe charge-dovn currenl has a slrong
Vcon dependency vhen Vcon is Iov. This voIlage dependency is nodeIed in a siniIar
fashion. When M
1
and M
2
operale in saluralion region, lhey acl as parl of lhe currenl
nirrors. In lhis case, conslanl oulpul currenl vaIues are assuned vhiIe lhe possilIe
nisnalches lelveen lhe lvo are considered in our VeriIog-A nodeIs.

The phase deleclor and lhe frequency divider are digilaI circuils so lhal lhey are nore
anenalIe lo lehavioraI nodeIing. The lvo key paranelers of lhe phase deleclor and lhe
frequency divider are lhe oulpul signaI deIay and lhe lransilion line, vhich are easy lo
exlracl fron lransislor-IeveI sinuIalion. The Ioop fiIlers are usuaIIy conprised of passive RC
eIenenls, vhich can le direclIy nodeIed in VeriIog-A sinuIalion.

3.3 Efficient parameter-reduction modeIing for PLLs
The key paranelric varialions for lransislors nay incIude varialions in noliIily, gale oxide,
lhreshoId voIlage, effeclive Ienglh and so on (Nassif, 2OO1). The consideralion of aII possilIe
sources of varialions in lransislors and inlerconnecls can easiIy Iead lo expIosion of lhe
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 365

paraneler space, rendering lhe paranelric nodeIing infeasilIe. AIlhough lhe videIy used
principIe conponenl anaIysis (ReinseI & VeIu, 1998) can le adopled lo perforn paraneler
dinension reduclion, ils effecliveness nay le ralher Iiniled since lhe paraneler reduclion is
achieved ly onIy considering lhe slalislics of conlroIIing paranelers vhiIe negIecling lhe
inporlanl correspondence lelveen lhese paranelers and lhe circuil perfornances of
inleresl. As such, lhe exlenl lo vhich lhe paraneler reduclion can le achieved is nol
sufficienl for our anaIog nacronodeIing prolIens. To address lhis difficuIly, a nore
poverfuI design-specific dinension reduclion lechnique, vhich is lased on reduced rank
regression (RRR), is deveIoped. This nev lechnique considers lhe cruciaI slrucluraI
infornalion inposed ly lhe design and has leen shovn lo le quile effeclive for paranelric
inlerconnecling nodeIing prolIens (Ieng & el aI., 2OO6).

Suppose ve have a sel of n process varialions, X, and a sel of N perfornances, Y. The
oljeclive is lo idenlify a snaIIer sel of nev varialIes Z, lased on X, vhich are slalislicaIIy
significanl lo lhe perfornances of inleresl, Y. Wilhoul Ioss of generaIily, Iel us assune Y
nonIinearIy depends on X lhrough a quadralic nodeI

| |
1 2
( )
X
Y f X C C
X X
(
= ~
(


(11)
vhere C1 and C2 are lhe firsl and second order coefficienls, X X represenls lhe quadralic
lerns of X. The conlinalion of lhe Iinear and quadralic lerns of X are lhen defined as a nev
prediclor veclor `
( )
T
T
T
X X X X
(
=

. Nov lhe quadralic nodeI in equalion (11) can le casl
inlo a Iinear nodeI as
`
Y AX c = + . To idenlify lhe redundancy in X lo faciIilale paraneler
reduclion, ve seek a reduced rank regression nodeI in lhe forn
`
R R
Y A B X c = +
(12)
vhere AR and R have a reduced rank of R (R < n), and R has onIy R coIunns. We denole
lhe covariance nalrix of
`
X as
` `
( )
X X
Cov X = E
, and covariance nalrix lelveen
`
X and Y as
`
`
( , )
Y X
Cov Y X = E
. Il can le shovn lhal an oplinaI reduced rank nodeI (in lhe sense of nean
square error) is given as (ReinseI & VeIu, 1998)
` ` `
1
,
T
R R
Y X X X
A U B U

= = E E
(13)
vhere U conlains R nornaIized eigenveclors corresponding lo lhe R Iargesl eigenvaIues of
lhe nalrix:
` ` ` `
1
Y X X X XY
D

= E E E
. Il is inporlanl lo nole lhal a successfuI conslruclion of lhe
alove reduced rank nodeI indicales lhal onIy a snaIIer sel of R nev paranelers
`
R
Z B X =

are crilicaI lo Y in a slalislicaI sense, hence faciIilaling lhe desired paraneler reduclion.

Il shouId le noled lhal lhe reduced rank regression is onIy enpIoyed as a neans for
paraneler reduclion so as lo reduce lhe conpIexily of lhe sulsequenl paranelerized
nacronodeIing slep. Hence, Y in lhe alove equalions does nol have lo le lhe lrue
perfornances of inleresl and can le jusl sone circuil responses lhal are highIy correIaled lo
lhe perfornances. This fIexiliIily can le expIoiled lo nore efficienlIy coIIecl
`
Y X
E
lhough
Monle-CarIo sanpIing if Y are easier lo ollain lhan lhe lrue perfornances in sinuIalion.

VLS 366

The conpIele paranelerized ILL nacronodeI exlraclion fIov is shovn in Iig. 9. Lvery
luiIding lIock is nodeIed using VeriIog-A for efficienl syslen-IeveI sinuIalion. Lach nodeI
paraneler for luiIding lIocks is expressed as a poIynoniaI in lhe underIying device-IeveI
varialions, such as

1 1 1
( , , ,...., , , )
th eff ox thn effn oxn
P f J L T J L T =
(14)
vhere Vlhi, Leffi, Toxi, elc represenl lhe paranelers of i-lh lransislor, f(.) is lhe nonIinear
poIynoniaI funclion lo connecl process varialions lo syslen perfornances. f(.) is very
difficuIl lo ollain if lhe nunler of paranelers is Iarge. Hence, RRR-lased paraneler
reduclion is appIied, vhich Ieads lo a sel of R nev paranelers Z lhal are lhe nosl inporlanl
varialions for lhe given circuil perfornances of inleresl. If R is snaII, lhen a nev
paranelerized nodeI in lerns of Z can le easiIy ollained lhrough convenlionaI nonIinear
regression for coefficienls of equalion (11)
1 2
( , , )
R
C f Z Z Z = (15)

1 2
( , , )
R
C f Z Z Z =

Iig. 9. IIov of ILL nacronodeIing

In addilion lo reducing lhe cosl of paranelerized nacronodeIing, paraneler dinension
reduclion viII aIso Iead lo nore efficienl slalislicaI sinuIalion of lhe conpIele ILL. This is
lecause inslead of anaIyzing lhe design perfornance varialions over lhe originaI high-
dinensionaI paraneler space, slalislicaI sinuIalion can nov le perforned nore efficienlIy
in a nuch Iover dinensionaI space lhal carries lhe essenliaI infornalion of lhe design
varialiIily, vhich is a very good properly lo achieve ILL syslen oplinizalion. DelaiIed
oplinizalion nelhod for ILL using lhe hierarchicaI nacronodeIs can le found in (Yu & Li,
2OO8).

Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 367

Wilh lhe hierarchicaI nodeIs and paraneler reduclion lechnique, ve can aIso perforn luiIl-
in seIf-lesl circuil design and oplinizalion since Ienglhy lransienl sinuIalions can le
reIieved ly lhe proposed nelhod. We viII discuss lhis parl in lhe nexl seclion.

4. BuiIt-in SeIf-test Scheme Design for PLLs

Tesling of ILLs is very chaIIenging and of greal inleresl since: a) UsuaIIy onIy sinpIe
funclionaI lesls such as phase Iock lesl is feasilIe in produclion lesl. Hovever, il nay nol le
sufficienl for guaranleeing aII lhe specificalions, l) The operalion of ILLs is inlrinsicaIIy
conpIex, e.g. sinpIe frequency donain anaIysis is nol appIicalIe for ILL circuils due lo lhe
digilaIized inpul/oulpul signaIs and lhe cIosed-Ioop dynanics, c) InlernaI anaIog signaIs
are difhcuIl lo access oulside of lhe chip and syslen specificalions such as jiller, frequency
range and Iock-line are very conpIex and coslIy lo neasure vilh digilaI leslers. In lhis
seclion, ve discuss lhe inpIenenlalions and oplinizalion of luiIl-in seIf-lesl circuils lo
caplure ILL perfornance faiIures using lhe efficienl nodeIIing and sinuIalion franevork
in Seclion 3.

4.1 BIST circuit design
uiIl-in seIf-lesl (IST) has energed as a very pronising lesl nelhodoIogy for inlegraled
circuils aIlhough ils appIicalion lo nixed-signaI ICs is nore chaIIenging conpared lo lhe
digilaI counlerparls. Sunler and Roy propose a IST schene lo neasure key anaIog
specificalions incIuding jiller, open Ioop gain, Iock range and Iock line in ILL syslens
(Sunler & Roy, 1999), vhiIe nosl of olher proposed IST schenes nainIy focus on
calaslrophic fauIls. Kin and Sona presenl an aII-digilaI IST schene using charge punp as
slinuIus lo charge VCO up and dovn lo delecl calaslrophic fauIls (Kin & Sona, 2OO1). In
(Hsu & el aI., 2OO5) Hsu el a| propose a differenl IST schene for calaslrophic fauIl deleclion
in ILLs ly inlroducing lhe phase errors lo lhe inpuls lo lhe phase frequency deleclor. Olher
IST schenes proposed Iike (Azais & el aI., 2OO3) are aIso largeled al calaslrophic fauIls.

WhiIe deleclion of calaslrophic (hard) fauIls renains as a key consideralion in lesl,
paranelric fauIls caused ly lhe groving process varialions in nodern nanoneler VLSI
lechnoIogies are receiving significanl concerns. In nosl cases, chips vilh paranelric fauIls
nay le sliII funclionaI lul cannol achieve lhe desired perfornance specificalions, and hence
shouId le screened oul. These faiIing chips are free of calaslrophic fauIls so lhal specific
IST schenes largeling al paranelric faiIures nusl le deveIoped.

Civen lhal process varialions and resuIlanl paranelric faiIures viII conlinue lo rise in sul-
1OO-nn lechnoIogies, a design-phase ILL IST deveIopnenl nelhodoIogy is slrongIy
desired. Such nelhodoIogy shouId faciIilale syslenalic evaIualion of paranelric varialions
of conpIex ILL syslen specificalions and lheir reIalions lo specific IST neasurenenls so
as lo enalIe oplinaI IST schene deveIopnenl. To lhis end, hovever, lhree najor
chaIIenges nusl le addressed: a) SuilalIe nodeIing lechniques nusl le deveIoped in order
lo enalIe feasilIe vhoIe syslen ILL anaIysis vhiIe considering reaIislic device-IeveI process
varialions and nisnalch, l) Device-IeveI paranelric varialions and nisnalch lhal
conlrilule lo paranelric faiIures forn a high-dinensionaI paraneler space and lhe resuIling
VLS 368

issue of cursc cf imcnsicna|i|q lrings significanl nodeIIing and anaIysis difficuIly, and c)
Lffeclive oplinizalion slralegies are desired in order lo deveIop oplinaI IST schenes.

Wilh lhe lechniques presenled in seclion 3, lhe firsl lvo difficuIlies have leen addressed,
and nov ve pul nore efforls on lhe oplinizalion of IST circuils. The nosl videIy used
approach in IST design is lo uliIize lhe exisling digilaI lIocks, such as lo use lhe frequency
divider as counler and read oul ils slale in order lo delecl chip faiIures (Sunler & Roy, 1999),
(Kin & Sona, 2OO1), (Hsu & el aI., 2OO5), (Azais & el aI., 2OO3). Il is expecled lhal lhe
frequency divider/counler oulpul viII change significanlIy if lhere exisls a calaslrophic
fauIl. Hovever, paranelric faiIures nay produce snaIIer varialions in lhe readoul vaIues.
Hence, lhey are nore difficuIl lo delecl and deserve nore carefuI lrealnenls.

We consider lhree IST schenes shovn in Iig. 1O as polenliaI candidales. SiniIar in spiril lo
lhe exisling IST schenes, lhe nain idea of lhe proposed IST schenes is lo conlroI lhe
charge punp in a vay such lhal lhe oulpul frequency of lhe ILL viII le aIlered and lhe
slale of lhe frequency divider is read oul al cerlain line inslance for faiIure deleclion.
Device-IeveI varialions and nisnalch viII perlurl lhe operalion of lhe ILL and can push
lhe syslen perfornances oul of lhe specificalion vindov. The sane paranelric varialions
nay le refIecled in lhe varialions in lhe readoul vaIues of lhe frequency divider. Iaranelric
faiIures can le delecled if lhe slales of lhe frequency divider are slrongIy correIaled vilh lhe
design perfornances.

Delay
counter
Frequency
detector
Frequency
divider
0
1
Charge
pump
Loop
filter
VCO
Fref
counter
readout
start/read
signal
nT
Delay
counter
Frequency
detector
Frequency
divider
Charge
pump
Loop
filter
VCO
start/read
signal
Fref
counter
readout
0
1
nT
BST
scheme 2
BST
scheme 3
Delay
counter
Frequency
detector
Frequency
divider
0
1
Charge
pump
Loop
filter
VCO
start/read
signal
Fref
counter
readout
Delay2
0
1
Delay1
nT
BST
scheme 1
Vcon /
Counter
t
t
t
Vcon /
Counter
Vcon /
Counter

Iig. 1O. Three IST schene candidales ( |2OO7j ILLL, fron Yu & Li, 2OO7l)

Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 369

Tesl schenes are descriled as foIIovs:

a) IST schene 1
The firsl IST schene is siniIar lo lhe one adopled in (Hsu & el aI., 2OO5). In lhe nornaI
operalion node, lhe reference inpul and lhe oulpul of lhe frequency divider are appIied lo
lhe frequency deleclor lo forn lhe cIosed Ioop configuralion. In lhe lesl node, lhe oulpul of
lhe frequency divider is disconnecled fron lhe inpul of lhe frequency deleclor. The
reference inpul and or ils deIayed versions are fed lhrough lhe nuxes lo lhe frequency
divider forning a open Ioop configuralion. The firsl deIay eIenenl has a Iarger deIay vaIue
lhan lhe second one. To charge up lhe VCO, lhe reference inpul and ils deIayed version
lhrough deIay 2 are appIied lo lhe frequency deleclor. To charge dovn lhe VCO, lhe
deIayed versions of lhe reference inpul ly lolh deIay 1 and deIay 2 are seIecled. The deIay
vaIues of lhe lvo deIay eIenenls delernine lhe phase error inlroduced al lhe frequency
deleclor inpuls. Hence, lhey aIso diclale lhe coverage of lhe VCO luning range in lhis IST
selup. Under lypicaI design vaIues, deIay vaIues in lhe order of len's of lhe reference cIock
signaI period are required, vhich nay cosl significanl siIicon area lo inpIenenl.

Ior aII lhese lhree schenes, lhe counler read-oul signaIs, vhich conlroI lhe slarl and end
poinls for a singIe lesl run, are generaled ly passing lhe reference cIock signaI Iref lhrough a
series of D fIip-fIops. As such, lhe conlenls of lhe frequency divider vilhin a defined line
inlervaI are read oul.

l) IST schene 2
To soIve lhe siIicon overhead prolIen of schene 1, ve propose lhe second IST schene
vhich enpIoys an inverler lo inlroduce lhe phase difference. Since lhis configuralion
inlroduces a conslanl phase deIay of dT al lhe inpuls of frequency divider, lhe charge punp
experiences lhe foIIoving sequence of operalion: charge up slop charge up slop
unliI lhe conlroI voIlage of lhe VCO reaches lhe fuIIy voIlage sving.

c) IST schene 3
The lhird IST schene is configured as foIIovs: firsl lhe ILL is pul in a slandard cIosed-Ioop
configuralion and lhen a slandard phase Iock lesl is perforned. Once lhe ILL is Iocked, lhe
feedlack signaI frequency is changed fron Ioul lo 2Ioul ly using lhe nux lo seIecl lhe oulpul
of lhe second Iasl D fIip-fIop in lhe divider.

A lrief sunnary of lhe lhree IST schenes: for schene 1, lhe area cosl is high vhiIe lhe lesl
line is shorl, for schene 2, lhe area cosl is Iov and lhe lesl line is aIso Iov, for schene 3, lhe
area cosl is Iov, and lhe lesl line is nediun. The nosl inporlanl aspecl for IST schenes,
hovever, is lhe lesl accuracy. We use lhe nacronodeIs deveIoped in seclion 3 lo perforn
lesl accuracy evaIualion and oplinizalion for lhese IST schene candidales.

4.2 Test scheme evaIuation and optimization
The nosl slraighlforvard vay lo evaIuale a IST schene is lo perforn Monle-CarIo
sinuIalions and exanine lhe aliIily of lhe specified IST neasurenenls lo predicl lhe
pass/faiI slalus of lhe design. Since a sel of fev hundred Monle-CarIo ILL sinuIalions nay
VLS 370

lake lens of hours lo conpIele, nore runline efficienl approaches are needed, especiaIIy for
lhe oplinizalion purpose.

The oplinizalion of a given IST schene vilh n digilaI oulpuls is iIIuslraled in lhe second
haIf of Iig. 11. Since a IST schene nay le evaIualed nany lines under differenl selups
(e.g., lhe line inlervaI vilhin vhich lhe slales of lhe frequency divider are read oul) vilhin
lhe oplinizalion Ioop, lhe correIalion lelveen lhe IST schenes and lhe ILL perfornance
nusl le efficienlIy conducled. This goaI is achieved ly idenlifying lhe crilicaI sources of
varialions Z
v
as in equalion (15).


Iig. 11. IST schene oplinizalion ( |2OO7j ILLL, fron Yu & Li, 2OO7l)

Nolicing lhal Z
v
onIy conlains a reduced sel of varialions, a nonIinear enpiricaI nodeI
reIaling each neasurenenl T
i
of lhe given IST schene and Z
v
can le ralher efficienlIy
generaled. This is achieved ly conducling a fev VeriIog-A lased ILL sinuIalions al
differenl Z
v
sanpIes and perforning nonIinear regression: T
i
= f(Z
v
). Nole lhal lhis slep does
nol incur a high sinuIalion cosl since regression nodeIs are onIy luiIl over a Iov-dinension
paraneler space represenled ly Z
v
. Using lhese easiIy ollained regression nodeIs, a Iarge
sel of sanpIes for each T
i
can le efficienlIy generaled.

To caplure lhe polenliaI nonIinear correspondence lelveen lhe design perfornances and
lhe neasurenenls, Supporl Veclor Machine (SVM) is adopled as an accurale cIassifier
(Vapnik, 1998). Supporl Veclor Machine is a poverfuI nelhod lo luiId highIy nonIinear
nuIlivariale regression/cIassificalion nodeIs. In SVM regression, ve consider a sel of
lraining dala |(x
1
, y
1
), (x
2
, y
2
), ., (x
n
, y
n
), vhere x
i
is lhe inpul veclor and y
i
is lhe
corresponding oulpul. The inpul X is napped inlo a high dinensionaI fealure space using
nonIinear lransfornalion, lhen a lesl filling funclion is conslrucled in lhis fealure space as
( ) ( ) y f x x b e | = = + (16)
vhere | is lhe nonIinear lransfornalion, l is lhe lias lern, and e represenls lhe nodeI
paranelers lo le decided. ased on lhis nonIinear funclion f(.), ve can cIassify lhe chip as
fauIly or nol vilh lhe IST circuil oulpuls.

u||d regress|on mode| for
|8T measurement T w|th
process parameter Z v T|= f(Zv}

End
Adjust |8T setups
Ach|eved
good accuracy?
F|na| accuracy
ver|f|cat|on w|th
more s|mu|at|ons
Y
N
6heck the c|ass|f|cat|on
accuracy of 8VH mode|
Cenerate a |arge number of
measurements us|ng the
non||near regress|on mode|
6orre|at|ng w|th T's
performances us|ng 8VH
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 371

We denonslrale lhe effecliveness of lhe proposed nelhod using a ILL design inpIenenled
in 9Onn CMOS lechnoIogy. The frequency versus conlroI voIlage curve of lhe VCO is
exlracled using lhe nodeI of equalion (8). Iirsl ve sinuIale lhe VCO for a fev cIock cycIes
and galher lhe line-donain oulpul response as Y. Then RRR is appIied lo gel a reduced
paraneler sel Z lo represenl lhe inporlanl device-IeveI paranelers. A paranelric nodeI of
lhe VCO in lerns of Z is lhen luiIl. To nodeI lhe slalislicaI characlerislics of lhe VCO
accuraleIy, a 6-lh order poIynoniaI filling is used lo fil lhe oulpul frequency vs. conlroI
voIlage curve.

The VeriIog-A nodeIs for olher luiIding lIocks are exlracled in a siniIar fashion.
SpecificaIIy, lhe charge punp nodeI is generaled using a 3-rd order poIynoniaI in lhe
oulpul voIlage for each charge-up/dovn currenl. There are a lolaI of 17 VeriIog-A nodeI
paranelers exlracled for lhe conpIele ILL design.

We evaIuale lhe effecliveness of lhree IST schenes. SVM nodeIs are generaled as in Iig. 11
lo predicl lhe pass/faiI slalus of lhe chips lased on lhe corresponding IST oulpuls. 4OO
Monle-CarIo sinuIalion sanpIes are generaled ly conducling ILL syslen sinuIalion using
VeriIog-A nacronodeIs. These dala are used lo generale lhe SVM nodeI. To evaIuale lhe
effecliveness of lhe each schene nore reIialIy, anolher 1OO Monle-CarIo sinuIalions are
carried oul and used as lhe lesl dala for checking lhe accuracy of lhe SVM nodeI. The
pass/faiI prediclions achieved lhrough lhe lhree SVM nodeIs are conpared againsl lhe
sinuIaled chip perfornances, as shovn in Iig. 12. Here, lhe prediclions nade lhrough lhe
sinuIalion are IaleIed as direcl neasurenenl, and +1 indicales a chip leing cIassified as
faiI, vhiIe -1 indicales lhe opposile.

0 10 20 30 40 50 60 70 80 90 100
1
0.5
0
0.5
1
1.5
Chip Index
P
a
s
s
/
F
a
i
I
Direct measurement
BST scheme 1
BST scheme 2
BST scheme 3

Iig. 12. Iass/faiI prediclions of lhree IST schenes ( |2OO7j ILLL, fron Yu & Li, 2OO7l)

Iron Iig. 12 ve can see lhal IST schene 1 onIy has onIy 1 niscIassificalion. The
perfornance of IST schene 2 is verified lo le poor as il can onIy delecl lvo fauIly chips
vhich nay have Iargesl varialions. IST schene 3 can delecl nore faiIures lhan schene 2
lul is sliII nol as good as schene 1.

VLS 372

We furlher Iook inlo lhe lrade-offs lelveen lhe accuracy and lhe nunler of digilaI oulpuls
for oplinizalion. This is inporlanl since fever lesl codes viII correspond lo a shorler lesl
line, if a siniIar accuracy can le achieved. This lrade-off anaIysis is conducled for every
schene in Iig. 12. Il can le olserved lhal for a snaII nunler of digilaI oulpuls, lhe accuracy
of schene 2 is acluaIIy higher lhan lhal of schene 3. Il can le aIso seen lhal lhe accuracy of
schene 3 lecones quickIy saluraled as lhe nunler of oulpuls increases. Under aII lhe cases,
schene 1 is aIvays lhe oplinaI choice.

3 4 5 6 7 8 9
0
0.05
0.1
0.15
0.2
Number of test codes
E
r
r
o
r
BST scheme 1
BST scheme 2
BST scheme 3

Iig. 13. Accuracy vs. lesl slruclures for lhree IST schenes ( |2OO7j ILLL, fron Yu & Li,
2OO7l)

There are olher oplinizalions can le done lo inprove IST schenes using lhe efficienl
nacronodeIing lechnique in Seclion 3 and Seclion 4 (Yu & Li, 2OO7l). Il is of greal lenefil lo
do such kind of anaIysis and oplinizalion ly laking consideralion of slalislicaI syslen
perfornances in lhe earIy design slage so lhal ve can avoid lhe coslIy design ileralions due
lo lhe infIuence of process and environnenlaI varialions.

5. ConcIusion

In lhis chapler ve have discussed lhe infIuences of process varialions lo anaIog/nixed-
signaI circuils in deepIy scaIed CMOS lechnoIogies. The perfornances of lvo lypes of
popuIar nixed-signaI syslens, i.e. Signa-DeIla ADCs and Ihase-Iocked Loops are evaIualed
under process varialions. Iaranelerized Iookup lalIe lechnique and reduced-rank
regression vilh hierarchicaI nacronodeIing nelhod vere proposed lo fuIfiI lhe
oplinizalion of lhese lvo syslens, respecliveIy. We aIso exlended lhe ollained fasl syslen
perfornance evaIualion franevork lo conpare lhe efficiencies of paranelric faiIure
delecling of differenl IST circuils and perforn lesl circuil oplinizalion.

6. AcknowIedgement

This vork vas funded in parl ly lhe ICRI Iocus Cenler for Circuil & Syslen SoIulions
(C2S2), under conlracl 2OO3-CT-888.
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 373

7. References

Azais, I. & el aI. (2OO3). An aII-digilaI DIT schene for lesling calaslrophic fauIls in ILLs.
|||| Dcsign c Tcs| cf Ccmpu|crs, VoI. 2O, No. 1, pp. 6O - 67, }an. 2OO3
alii, S. & el aI. (1997). M|DAS Uscr Manua|, Slanford Universily, Slanford, CA
ishop, R. & el aI. (199O). TalIe-lased nodeIing of deIla-signa noduIalors. |||| Trans. On
Circui|s an Sqs|cms, VoI. 37, No. 3, pp. 447-451, Mar. 199O
ox, C., Hunler, D. & Hunler, W. (2OO5). S|a|is|ics fcr |xpcrimcn|s. Dcsign, |nncta|icn, an
Discctcrq, }ohn WiIey & Son, 978-O47171813O, Holoken, N}
rauns, C. & el aI. (199O). TalIe-lased nodeIing of deIla-signa noduIalors using ZSIM.
|||| Trans. On Ccmpu|cr-aic Dcsign, VoI. 9, No. 2, pp. 142-15O, Iel. 199O
Iang, S. & Suyana, K. (1992). Uscr's Manua| fcr SI|TCAP2, CoIunlia Universily, Nev
York, NY
Ieng, Z., Yu, C. & Li, I. (2OO7). Reducing lhe ConpIexily of VLSI Ierfornance Varialion
ModeIing Via Iaraneler Dinension Reduclion. Prccccings cf |n|crna|icna|
Sqmpcsium cn Qua|i|q ||cc|rcnic Dcsign, pp. 737-742, 978-O769527957, Mar. 2OO7,
ILLL press, San }ose, CA
Hsu, C. , Lai, Y. & Wang, S. (2OO5). uiIl-in seIf-lesl for phase-Iocked Ioops. |||| Trans. On
|ns|rumcn| an Mcasurcmcn|, VoI. 54, No. 3, pp. 996-1OO2, }un. 2OO5
Kin, S. & Sona, M. (2OO1). An aII-digilaI luiIl-in seIf-lesl for high-speed phase-Iocked Ioops.
|||| Trans. On Circui|s an Sqs|cms ||, VoI. 48, No. 2, pp. 141-15O, Iel. 2OO1
Lov, K. & Direclor, S. (1989). An efficienl nelhodoIogy for luiIding nacronodeIs of IC
falricalion processes. |||| Trans. On Ccmpu|cr-aic Dcsign, VoI. 8, No. 12, pp.
1299-1313, Dec. 1989
Nassif, S. (2OO1). ModeIing and AnaIysis of Manufacluring Varialions. Prccccings cf Cus|cm
|n|cgra|c Circui|s Ccnfcrcncc, pp. 223-228, 978-O78O365917, May 2OO1, ILLL press,
San Diego, CA
Norsvorlhy, S., Schreier, R. & Tenes C. (1997). Dc||a-Sigma Da|a Ccntcr|crs. Tnccrq, Dcsign,
an Simu|a|icn. ILLL Iress, 978-O78O31O452, Iiscalavay, N}
IeIgron, M., Duinnaijer, A. & WeIlers, A. (1989). Malching properlies of MOS lransislors.
|||| ]curna| cf Sc|i-S|a|c Circui|s, VoI. 24, No. 5, pp. 1433 - 144O, Ocl. 1989
ReinseI, C. & VeIu, R. (1998). Mu||itaria|c Rcucc-Ran| Rcgrcssicn, Tnccrq an App|ica|icns.
Springer, 978-O387986O12, Nev York, NY
Sunler, S. & Roy A. (1999). IST for phase-Iocked Ioops in digilaI appIicalions. Prccccings cf
|n|crna|icna| Tcs| Ccnfcrcncc, pp. 532-54O, 978-O78O357531, Sep. 1999, ILLL Iress,
AlIanlic Cily, N}
Vapnik, V. (1998). S|a|is|ica| |carning Tnccrq. WiIey-Inlerscience, 978-O471O3OO34, Nev York,
NY
Yu, C. & Li, I. (2OO6). Lookup TalIe ased SinuIalion and SlalislicaI ModeIing of Signa-
DeIla ADCs. Prccccings cf Dcsign Au|cma|icn Ccnfcrcncc, pp. 1O35-1O4O, 978-
1595933816, San Irancisco, CA, }uIy 2OO6, ILLL Iress
Yu, C. & Li, I. (2OO7a). Lfhcienl Lookup TalIe ased ModeIing for Rolusl Design of A
ADCs. |||| Trans. cn Circui|s an Sqs|cms - |, VoI.54, No.7, Sep. 2OO7, pp.1513-1528,
Yu, C. & Li, I. (2OO7l). A MelhodoIogy for Syslenalic uiIl-in SeIf-Tesl of Ihase-Iocked
Loops Targeling al Iaranelric IaiIures. Prccccings cf |n|crna|icna| Tcs| Ccnfcrcncc,
pp. 1-1O, 978-1424411276, Ocl. 2OO7, ILLL Iress, Sanla CIara, CA
VLS 374

Yu, C. & Li, I. (2OO8). YieId-avare hierarchicaI oplinizalion of Iarge anaIog inlegraled
circuils. Prccccings cf |n|crna|icna| Ccnfcrcncc cn Ccmpu|cr-Aic Dcsign, pp. 79-84,
978-1424428199, Nov. 2OO8, ILLL Iress, San }ose, CA
Zou, }., MueIIer, D., Crael, H. & SchIichlnann, U. (2OO6). A CIILL hierarchicaI
oplinizalion nelhodoIogy considering jiller, pover and Iocking line. Prccccings cf
Dcsign Au|cma|icn Ccnfcrcncc, pp. 19-24, 978-1595933816, San Irancisco, CA, }uIy
2OO6, ILLL Iress
Nanoelectronic Design Based on a CNT Nano-Architecture 375
0
Nanoelectronic Design Based on
a CNT Nano-Architecture
Bao Liu
Electrical and Computer Engineering Department
The University of Texas at San Antonio
San Antonio, TX, 78249-0669
Email: bliu@utsa.edu
Abstract Carbon nanotubes (CNTs) and carbon nanotube eld effect transistors (CNFETs) have
demonstrated extraordinary properties and are widely expected to be the building blocks of next
generation VLSI circuits. This chapter presents (1) the rst purely CNT and CNFET based nano-
architecture, (2) an adaptive conguration methodology for nanoelectronic design based on the
CNT nano-architecture, and (3) robust differential asynchronous circuits as a promising nano-circuit
paradigm.
1. Introduction
Silicon based CMOS technology scaling has driven the semiconductor industry towards cost
minimization and performance improvement in the past ve decades, and is rapidly ap-
proaching its end (30). On the other hand, nanotechnology has achieved signicant progress
in recent years, fabricating a variety of nanometer scale devices, e.g., molecular diodes (44)
and carbon nanotube eld effect transistors (CNFETs) (46). This provides new opportunities
for VLSI circuits to achieve continuing cost minimization and performance improvement in a
post-silicon-based-CMOS-technology era.
However, we must overcome a number of signicant challenges for practical nanoelectronic
systems, including achieving some of the most critical nanoelectronic design metrics as follow.
1. Manufacturability. As minimum layout feature size becomes smaller than lithography
light wavelength, traditional lithography based manufacturing process can no longer
achieve satisable resolution, and leads to signicant process variations. Resolution
enhancement and other design for manufacturability techniques become less applica-
ble as scaling continues. Alternatively, nanoelectronic systems are expected to be based
on bottom-up self-assembly based manufacturing processes, e.g., molecular beam epi-
taxy (MBE). Such bottom-up self-assembly manufacturing processes provide regular
structures, e.g., perfectly aligned carbon nanotubes (23). Consequently, nanoelectronic
systems need to rely on recongurability to achieve functionality and reliability (51).
2. Reliability. Technology scaling has led to increasingly signicant process and system
runtime variations, including critical dimension variation, dopant uctuation, electro-
magnetic emission, alpha particle radiation and cosmos ray strikes. Such variations can-
not be avoided by manufacturing process improvement, and is inherent at nanometer
19
VLS 376
Nano InterIace
Nano InterIace
N
a
n
o

I
n
t
e
r
I
a
c
e
N
a
n
o

I
n
t
e
r
I
a
c
e
C
N
T
CNT
Fig. 1. The proposed CNT crossbar nano-architecture: layers of orthogonal carbon nanotubes
form a dense array of RDG-CNFETs and programmable interconnects with voltage-controlled
nano-addressing circuits on the boundaries.
scale due to the uncertainty principle of quantum physics. Robust design techniques,
including redundant, adaptive, and resilient design techniques at multiple (architec-
ture, circuit, layout) levels, are needed to achieve a reliable nanoelectronic system (5).
3. Performance. Nanoscale devices have achieved ultra-high performance in the absence
of load, however, nanoelectronic system performance bottleneck lies in global intercon-
nects. Rents rule states that the maximum interconnect length scales with the circuit
size in a power law (24), while signal propagation delay across unit length interconnect
increases as technology scales (30). As a result, interconnect design will be critical to
nanoelectronic system performance.
4. Power consumption. As technology scaling leads to increased device density and de-
sign performance, power consumption is also expected to be critical in nanoelectronic
design.
This chapter presents several recent technical advancements towards manufacturable, reli-
able, high performance and low power nanoelectronic systems.
1. The rst purely CNT and CNFET based nano-architecture, which is constructed by lay-
ers of orthogonal CNTs with via-forming and gate-forming molecules sandwiched in
between, forming a dense array of recongurable double gate carbon nanotube eld
effect transistors (RDG-CNFETs) and programmable interconnects. Such a CNT array
is addressed by novel voltage-controlled nano-addressing circuits on the boundaries,
which do not require precise layout design and achieve yield in aggressive scaling
and adaptivity to process variations. Simulation based on CNFET and molecular de-
vice compact models demonstrates superior logic density, reliability, performance and
power consumption for nano-circuits implemented in this CNT crossbar based nano-
architecture compared with the existing, e.g., molecular diode and MOSFET based
nano-architectures.
Nanoelectronic Design Based on a CNT Nano-Architecture 377
2. A complete set of linear complexity methods for adaptive conguration of nanoelec-
tronic systems based on a CNT crossbar based nano-architecture (Fig. 1) (26), including
(1) adaptive nano-addressing, (2) RDG-CNFET gate matching, and (3) catastrophic de-
fect mapping methods. Compared with the previous nano-architecture defect mapping
and adaptive conguration proposals, these methods are complete, specic, determin-
istic, of low runtime complexity. These methods demonstrate the promising prospect of
achieving nanoelectronic systems of correct functionality, performance, and reliability
based on the CNT crossbar nano-architecture.
3. Robust Differential Asynchronous (RDA) circuits as a promising paradigm for reliable
(noise immune and delay insensitive) high performance and low power nano-circuits
based on the CNT crossbar nano-architecture. Theoretical analysis and SPICE simu-
lation based on 22nm CMOS Predictive Technology Models show that RDA circuits
achieve much enhanced reliability in logic correctness in the presence of a single bit
soft error or common multiple bit soft errors, and timing correctness in the presence of
parametric variations given the physical proximity of the circuit components.
The rest of this chapter is organized as follows. Section 2 reviews the existing nanoelectronic
devices, nano-architectures and nano-addressing circuits. Section 3 presents the proposed
CNT crossbar based nano-architecture including a novel RDG-CNFET device, a multi-layer
CNT crossbar structure, and a voltage-controlled nano-addressing circuit. Section 4 presents
adaptive conguration methods for nanoelectronic systems based on the CNT crossbar nano-
architecture. Section 5 presents robust differential asynchronous circuits as a promising nano-
circuit paradigm. Section 6 presents simulation results which evaluate the CNT crossbar nano-
architecture and robust differential asynchronous nano-circuits. Section 7 concludes this pa-
per with a list of nanotechnologies which enable and improve the proposed CNT crossbar
based nano-architecture.
2. Background
2.1 Existing Nanoscale Devices
Carbon nanotube is one of the most promising candidates for interconnect technology at
nanometer scale, due to its extraordinary properties in electrical current carrying capabil-
ity, thermal conductivity, and mechanical strength. A carbon nanotube is a one-atom-thick
graphene sheet rolled up in a cylinder of a nanometer-order diameter, which is semicon-
ductive or metallic depending on its chirality. The cylinder form eliminates boundaries and
boundary-induced scattering, yielding electron mean free path on the order of micrometers
compared with few tens of nanometers in copper interconnects (32). This gives extraordinary
current carrying capacity, achieving a current density on the order of 10
9
A/cm
2
(56). How-
ever, large resistance exists at CNT-metal contacts, reducing the performance advantage of
CNTs over copper interconnects (38).
Among various nanotechnology devices, carbon nanotube eld effect transistors are the most
promising candidates to replace the current CMOS eld effect transistors as the building
blocks of nanoelectronic systems. Three kinds of carbon nanotube based eld effect transis-
tors (CNFETs) have been manufactured: (1) A Schottky barrier based carbon nanotube eld
effect transistor (SB-CNFET) consists of a metal-nanotube-metal junction, and works on the
principle of direct tunneling through the Schottky barrier formed by direct contact of metal
and semiconducting nanotube. The barrier width is modulated by the gate voltage. This de-
vice has the most mature manufacturing technique up to today, while two problems limit its
VLS 378
future: (a) The metal-nanotube contact severely limits current. (b) The ambipolar conduction
makes this devices cannot be applied to conventional circuit design methods. (2) A MOSFET-
like CNFET is made by doping a continuous nanotube on both sides of the gate, thus forming
the source/drain regions. This is a unipolar device of high on-current. (3) A band-to-band
tunneling carbon nanotube eld effect transistor (T-CNFET) is made by doping the source
and the drain regions into p
+
and n
+
respectively. This device has low on-current and ultra
low off current, making it potential for ultra low power applications. It also has the potential
to achieve ultra fast signal switching with < 60mV/decade subthreshold slope (46).
Molecular electronic devices are based on two families of molecules: the catenanes which
consist of two or more interlocked rings, and the rotaxanes which consist of one or more
rings encircling a dumbbell-shaped component. These molecules can be switched between
states of different conductivities in a redox (reduction/oxidation) process by applying cur-
rents through them, providing recongurability for nanoscale devices (44).
Avariety of recongurable nanoscale devices have been proposed. Resonant tunneling diodes
based on redox active molecules are congurable on/off (44). Nanowire eld effect transistors
with redox active molecules at gates are of high/low conductance (17). Spin-RAM devices are
of high/low conductivity based on the parallel/anti-parallel magnetization conguration of
the device which is congured by the polarity of the source voltage (40). A double gate Schot-
tky barrier CNFET is congurable to be a p-type FET, an n-type FET, or off, by the electrical
potential of the back gate (25). A double gate eld effect transistor with the back gate driven
by a three state RTD memory cell is congurable to be a transistor or an interconnect, reducing
reconguration cost of a gate array (4).
2.2 Existing Nanoelectronic Architectures
At least three categories of nanoelectronic architectures have been proposed. An early nano-
electronic architecture NanoFabrics was based on molecular resonant tunneling diodes (RTDs)
and negative differential resistors (NDRs) (20). The insightful authors have observed that
passive device (diode/resistor) based circuits lack signal gain to recover from signal attenua-
tion, while combining with CMOS circuits compromises scaling advantages. They proposed
latches based on negative differential resistors (NDRs), which, unfortunately, have become
obsolete since the publication.
The majority of the existing nanoelectronic architectures are based on a hybrid nano-CMOS
technology, with CMOS circuits complementing nano-circuits. In FPNI (50) (CMOL (54)),
a nanowire crossbar is placed on top of CMOS logic gates (inverters). The nanowires pro-
vide programmable interconnects (and wired-OR logic), while the CMOS gates(inverters)
provide logic implementation (signal inversion and gain). Such architectures achieve compro-
mised scaling advantage in term of device density. DeHon (11; 13) proposed to combine pro-
grammable nanoscale diode logic arrays with xed simple CMOS circuitry, e.g., of precharge
and evaluation transistors as in domino logic for signal gain. Sequential elements need also to
be implemented as CMOS circuits. However, the optimal size of a combinational logic block
is typically small (e.g., of 30-50 gates), which results in signicant CMOS circuitry overhead
in such architectures. An exception is memory design, where CMOS technology provides pe-
ripheral circuitry such as address decoders and read sensors with moderate overhead, while
nanotechnology provides scaling advantage in memory cells (17; 47; 63).
The third category of existing nanoelectronic architectures rely on DNA-guided self-assembly
to form 2-D scufes for nanotubes (42; 43) or 3-D DNA-rods (14). Such technologies target
application in the far future.
Nanoelectronic Design Based on a CNT Nano-Architecture 379
Data Lines / Nanoscale Wires
A
d
d
r
e
s
s

L
i
n
e
s

/

M
a
c
r
o
s
c
a
l
e

W
i
r
e
s
Fig. 2. Layout of undifferentiated nanoscale wires (data lines) addressed by microscale wires
(address lines). Lithography denes high- and low-k dielectric regions, which gives eld effect
transistors and direct conduction, respectively.
2.3 Existing Nano-Addressing Circuits
A nano-addressing circuit selectively addresses a nanoscale wire in an array, and enables data
communication between a nano-system and the outside world. The existing nano-addressing
circuits are based on binary decoders, with an array of (microscale) address lines running
across the (nanoscale) data lines, forming transistors at each crossing (e.g., Fig. 2). Each data
line is selected by a unique binary address, given each data line has a unique gate congura-
tion. However, such precise layout design is highly unlikely to achieve at a sublithographic
nanometer scale (without signicantly compromised yield).
In details, the existing nano-addressing circuits are in four categories as follow.
1. Randomized contact decoder (59) includes gold particles which are deposited at ran-
domas contacts between nanoscale and microscale wires. Testing and feedback provide
a one-to-one mapping between a nanoscale wire and an address.
2. Undifferentiated nanoscale wires are addressable by microscale wires with (e.g., lithog-
raphy dened) different gate congurations (which requires nanoscale wire spacing in
the same order of lithography resolution) (22) (Fig. 2).
3. Alternatively, different gate congurations are realized in the nanoscale wires, by grow-
ing lightly-doped and heavily-doped carbon nanotubes of different length alternatively,
while the microscale wires are undifferentiated. A microscale wire crossing a lightly-
doped nanotube segment forms a gate, while a heavily-doped nanotube segment is
always conductive for all possible signals in the microscale wire. In such a case, precise
control of the lengths of the lightly- and heavily-doped nanotube segments would be
critical (12; 21).
4. In radial addressing, multi-walled carbon nanotubes are grown with lightly- and
heavily-doped shells, an etching process removes the heavily-doped outer shells at pre-
cise locations, and denes the gate congurations at each crossing of nanoscale and
microscale wires (48).
VLS 380
Because process variations are inevitably signicant at nanometer scale, these existing nano-
addressing structures achieve limited yield, e.g., there is certain probability that two nanoscale
wires have identical or similar gate conguration due to process variation. Furthermore,
nanoscale wires are mostly partially selected, e.g., they may not achieve the ideal conductivity
upon selected, due to process variations such as misalignment, dopant variation, etc.
2.4 Existing Nano-Architecture Defect-Mapping and Adaptive Conguration Methods
Existing nano-architecture defect mapping techniques are as follow. (1) On a Teramac recon-
gurable computing platform, signals are propagated along each row or each column in a
crossbar structure, a defect is located at the intersection of a defective row and a defective col-
umn, based on the assumption that a single defect is present (10). (2) In the NanoFabrics nano-
architecture, the roughly estimated number of defects for a subset of computing resources are
collected by counter or none-some-many circuits, a simple graph based algorithm or a Bayes
rule based probabilistic computation procedure gives defect occurrence probability estimates.
E.g., highly likely defects are detected in the probability assignment phase, which accumulates
defect probability in different test congurations, while less likely defects are located in the
defect location phase, which incrementally clears certain spots as non-defects during test of
different congurations (33). (3) A Build-In Self-Test (BIST) method in the NanoFabrics nano-
architecture brings much increased complexity with limited applicability (in nding available
defect-free neighboring nanoBlocks to implement test circuitry) (8; 58).
After a defect map is achieved presumably, logic circutis can be constructed avoiding or utiliz-
ing the defects. For example, a nanoPLA block can be synthesized in the presence of defective
crosspoints (36), a CNT nano-circuit layout can be synthesized in the presence of misaligned
and mispositioned CNTs (41), metallic CNTs (61), and CNTs of variational density (62).
2.5 Existing CNT Nano-Circuit Design
A very limited number of primitive combinational logic circuits have been fabricated based
on CNFETs, including an inverter and two NOR gates in NMOS logic based on SB-CNFETs
(3), and a ve-inverter ring oscillator based on MOSFET-like CNFETs (9). While nano-circuits
based on ambipolar SB-CNFETs need different topologies (3; 46; 53), nano-circuits based on
unipolar MOSFET-like CNFETs can be identical to CMOS circuits (9).
3. CNT Crossbar based Nano-Architecture
As we have seen, most existing nanoelectronic architectures are based on diode/resistor logic
and CMOS/nano-technologies (11; 13; 20; 50; 54; 63), which only achieve limited manufactura-
bility, reliability, and performance. Carbon nanotubes (CNTs) and carbon nanotube eld effect
transistors (CNFETs) are the most promising candidates as the the building blocks of nanoelec-
tronic systems due to their extraordinary properties. CNTs possess excellent electrical current
carrying capability, thermal conductivity, and mechanical strength. CNFETs are potential to
achieve high on-current, ultra-low off-current, and ultra-fast switching (< 60mV/decade sub-
threshold slope). CNT crossbar structure (Fig. 1) is one of the most promising candidates
for nanoelectronic design platform. Recently, UIUC researchers have achieved fabrication of
dense perfectly aligned CNT arrays (23). Such a CNT crossbar structure forms the basis of
nanoscale memories (17; 47; 63).
However, no nanoelectronic architecture has been proposed which is solely based on CNTs
and CNFETs. The reasons include lack of (1) a recongurable CNT based device which could
provide functionality and reliability, (2) a self-assembly process which forms complex CNT
Nanoelectronic Design Based on a CNT Nano-Architecture 381
n

n

p or i
Dielectric
Front Gate
Bistable
Molecules
CNT
CNT Source CNT Drain
CNT
Molecules
BacN Gate
Redox Active
Fig. 3. A n-type MOSFET-like recongurable double gate carbon nanotube eld effect tran-
sistor (RDG-CNFET).
structures, and (3) an achievable mechanism which precisely addresses an individual CNT in
an array.
In this section, we investigate the rst purely CNT and CNFET based nano-architecture, which
is based on a novel RDG-CNFET device, includes a CNT crossbar structure on multiple layers,
and a novel voltage-controlled nano-addressing circuit.
3.1 RDG-CNFET Device Structure
As the building block of a purely CNT and CNFET based nano-architecture, a recong-
urable double-gate CNFET (RDG-CNFET) is constructed by sandwiching electrically bistable
molecules in a double gate CNFET. The double gate CNFET is constructed by three overlap-
ping orthogonal carbon nanotubes. The top and the bottom carbon nanotubes form the front
gate and the back gate, while doping the carbon nanotube in the middle layer forms the source
and the drain of a n- or p-type MOSFET-like CNFET (46). Electrically bistable molecules are
coated around the front gate and sandwiched between the front gate and the source/drain re-
gions. Dielectric and redox active molecules are coated around the back gate and sandwiched
between the back gate and the source/drain regions (Fig. 3).
The redox active molecules at the back gate are electrically recongurable to hold/release
charge in a redox process, which controls the CNFET threshold voltage and conductance, or,
turns the CNFET on or off. An example of such conguration is reported in (17), wherein
a 10V voltage applied to cobalt phthalocyanine (CoPc) molecules triggers a redox process,
and results in a NW-FET conductance change of nearly 10
4
times. Such reconguration of
CoPc molecules is repeatable for more than 100 times.
The bistable molecules sandwiched between the front gate and the source/drain regions are
electrically recongurable to be conductive or insular, making the device a via or a FET. An ex-
ample of such electrically bistable molecules is reported in (44), wherein oxidative degradation
reduces resonant tunneling current of the V-shaped amphiphilic [2]-rotaxane 5
4+
molecules
by nearly a factor of 100. Alternatively, the anti-fuse technologies in the existing recong-
urable architectures provide one-time congurability. For example, the QuickLogic ViaLink
technology include a layer of amorphous silicon sandwiched between two layers of metal. A
10V programming voltage provides a resistance difference between G and 80 (6).
VLS 382
S D
G
BG
Fig. 4. Compact model of a n-type MOSFET-like recongurable double gate carbon nanotube
eld effect transistor (RDG-CNFET).
L1
L4
Dielectric
L2
L3
Redox Active Molecules
Bistable Molecules
Bistable Molecules
Fig. 5. Carbon nanotube (CNT) layers in the proposed nanoelectronic architecture.
3.2 RDG-CNFET Device Behavior
Such a RDG-CNFET device is described in a compact model as is shown in Fig. 4, and is
recongurable to the following components, making it an ideal nanoelectronic architecture
building block.
1. Via, when the front gate bistable molecules are congured to be conductive. The over-
lapping of the front gate and the source/drain regions form conductive contacts. As a
result, the front gate, the source, and the drain are short circuited. The device is cong-
ured as a via between the carbon nanotubes on the top and in the middle.
2. Short, when the front gate bistable molecules are congured to be insular, and the back
gate redox active molecules are congured to hold positive(negative) charge in a n-
type(p-type) CNFET. The CNFET is on for any front gate voltage.
3. MOSFET-like CNFET, when the front gate bistable molecules are congured to be insu-
lar, and the back gate redox active molecules are congured to hold negative(positive)
charge in a n-type(p-type) MOSFET-like CNFET. The CNFET threshold voltage is ad-
justable by the doping concentration in the channel (p or n doping for a n- or p-type CN-
FET), such that when the back gate redox active molecules are congured to hold neg-
ative(positive) charge in a n-type(p-type) MOSFET-like CNFET, the CNFET achieves
both performance and leakage control.
4. Open, when the MOSFET-like CNFET is turned off. This is achieved at the architecture
level as follows.
Nanoelectronic Design Based on a CNT Nano-Architecture 383
3.3 CNT Crossbar Structure
At a larger scale, a nanoelectronic architecture is constructed by growing layers of orthog-
onal carbon nanotubes, with via-forming (electrically bistable) and gate-forming (dielectric
and redox active) molecules sandwiched at each crossing (Fig. 1). The carbon nanotubes
are either (1) semiconductive CNTs which are doped to have low resistivity and are recon-
gurable to opens by gate isolation, or (2) metallic CNTs which upon identication can be
utilized as global interconnects if not avoided or removed (1; 64). The (1) via-forming (electri-
cally bistable) and (2) gate-forming (dielectric and redox active) molecules can be rst coated
around a carbon nanotube (e.g., as in (17)), then undergo an etching process with the top layer
of carbon nanotubes as masks (e.g., as in (49)). The remaining molecules are sandwiched be-
tween two orthogonal carbon nanotubes on adjacent layers. A top-down (e.g., lithography)
process denes the areas for each type of molecules to assemble on each layer, as well as the
p-wells and n-wells. P-type and n-type of MOSFET-like CNFETs are formed by (e.g., potas-
sium or electrostatic (46)) doping of the carbon nanotubes selectively. E.g., a p-well or n-well
of dimensions in the order of 22nm include about 10 rows of CNFETs.
Conguration of such a CNT crossbar based nanoelectronic architecture gives a nanoscale
VLSI implementation including MOSFET-like CNFETs and interconnects with opens, shorts
and vias (Fig. 6), which can be 2-D (compatible to traditional VLSI systems) or 3-D.
In a 2-D VLSI implementation, MOSFET-like CNFETs are formed on the bottom three layers
of carbon nanotubes, with the rst layer (L1) from bottom of carbon nanotubes provides the
back gates, the second layer (L2) provides the source and the drain regions, and the third
layer (L3) provides the front gates of the MOSFET-like CNFETs. Dielectric and redox active
(back gate) molecules are sandwiched between the L1 and L2 layer carbon nanotubes, and
electrically bistable (front gate) molecules are sandwiched between the L2 and L3 layer car-
bon nanotubes. A multi-layer recongurable interconnect structure with programmable vias
and opens is achieved with via-forming and gate-forming molecules sandwiched between
interconnects which are formed above the rst (L1) layer (Fig. 5).
3-D VLSI circuits are under active research in recent years due to their potential of achieving
reduced wirelength, reduced power consumption and improved performance. However, sili-
con based VLSI circuits are essentially 2-D, because MOSFETs are surface devices on the bulk
of silicon, 3-D MOSFET circuits can only be achieved by bonding chips. It is therefore criti-
cal to achieve (1) bonding technology which provides acceptable mechanical strength, (2) via
technology which provides low resistive interconnects between chips, and (3) heat dissipation
in a multiple chip system for silicon based 3-D circuits. On the contrary, CNFET and CNFET
based nano-architectures provide excellent platforms for 3-D VLSI circuits, because (1) CNTs
and CNFETs are not conned to certain surface and can be manufactured in 3-D space, (2)
CNTs possess excellent current carrying, mechanical and heat dissipation properties which
are critical to 3-D VLSI circuits.
In a 3-D VLSI implementation, the RDG-CNFETs do not need to be conned on the bottom
layers, with the upper layers dedicated to interconnects. Instead, transistors and interconnects
are free to be located on each layer of carbon nanotubes. Gate forming (dielectric and redox
active) molecules and via-forming (electrically bistable) molecules are distributed between
adjacent CNT layers. Combination of the types of molecules surrounding a CNT segment
gives three components.
1. Gate-forming molecules both on top and on bottom of a CNT segment give a device
which is recongurable to either open or short,
VLS 384
b a c b a c Vdd
output
Fig. 6. An RDG-CNFET based Boolean logic a(b + c) implementation.
output clN
Vdd
input
clN
Fig. 7. An RDG-CNFET based latch implementation.
2. Gate-forming and via-forming molecules on top and on bottom of a CNT segment give
the RDG-CNFET, which is recongurable to via, short, MOSFET-like CNFET, and open,
3. Via-forming molecules both on top and on bottom of a CNT segment give a device
which is recongurable to be stacked via, simple via, or double gate FET.
We have the following observations.
Observation 1. Via-forming (electrically bistable) molecules must be present between any two adja-
cent layers.
Observation 2. Gate-forming (redox active) molecules must be present next to each layer for gate
isolation.
Observation 3. Gate-forming (redox active) and via-forming (electrically bistable) molecules need to
be evenly distributed on each layer for performance.
3.4 Circuit Paradigms and Analysis
This CNT crossbar based nano-architecture provides regularity and manufacturability for
high logic density implementations of all CMOS logics, including the standard CMOS logic
(e.g., in Fig. 6), domino logic, pass-transistor logic, etc., for combinational circuits, as well as
latches (e.g., in Fig. 7), ip-ops, memory input address decoder and output sensing circuits.
Such high logic density is achieved via direct connection of CNFETs through their
source/drain regions (e.g., as in an latest Intel microprocessor implementation (15)), without
going through additional (e.g., metal) interconnects. CNT-metal contacts are known to bring
the most signicant resistivity in CNT technology (38). Avoiding such CNT-metal contacts
contributes to performance and reliability improvements. Furthermore, reduced interconnect
length also leads to reduced interconnect capacitance, and improved circuit performance.
This CNT crossbar based nano-architecture also provides a high recongurability by allow-
ing an arbitrary ratio of logic gates and interconnect switches (a RDG-CNFET device can be
Nanoelectronic Design Based on a CNT Nano-Architecture 385
Address Line 2
Vdda2
Address Line 1
Vdda1
Vssa1
Vssa2
Data Lines / Nanoscale Wires
Fig. 8. Schematic of the proposed voltage-controlled nano-addressing circuit.
congured as either a logic gate or an interconnect switch). Apre-determined ratio of logic de-
vices and interconnect switches (e.g., in standard cell designs and FPGA architectures where
cells and routing channels are separated) constrains design optimization and may lead to
inefcient device or interconnect utilization. Allowing an arbitrary ratio of logic gates and
interconnect switches (e.g., as in sea-of-gate designs) provides increased degree of freedom
for design optimization (4).
The CNT crossbar based nano-architecture is also the rst to include multiple routing layers.
Multiple routing layers (as in the current technologies) are necessary for VLSI designs, as
Rents rule suggests that the I/Onumber of a circuit module follows a power lawwith the gate
number in the module (24). A small routing layer number could lead to infeasible physical
design or signicant interconnect detouring, resulting in degraded performance and device
utilization.
3.5 Voltage Controlled Nano-Addressing Structure
The nal piece of the CNT crossbar nanoelectronic architecture is the nano-addressing circuits
on the boundary of the carbon nanotube crossbar structure.
Designing a nano-addressing circuit is a challenging task, because (1) the nanoscale lay-
out cannot be manufactured precisely unless it is of a regular structure, and (2) the nano-
addressing circuit cannot be based on recongurability since it provides recongurability to
the rest of the nanoelectronic system.
A novel voltage-controlled nano-addressing circuit (Fig. 8) is constructed by running two ad-
dress lines (of either microscale or nanoscale wires) on top of the data lines (of nanoscale wires
in an array which are to be addressed). The address lines and the data lines are orthogonal. At
each crossing of an address line and a data line, a eld effect transistor is formed by doping
the data line into the source and the drain regions while the address line provides the gate
of the transistor, with a thin layer of dielectric sandwiched between the gate and the tran-
sistor channel. Such eld effect transistors have been successfully fabricated based on either
nanowires or carbon nanotubes (17; 37; 46).
3.6 Voltage-Controlled Nano-Addressing Principle
The address line provides the gate voltage for the transistors. Each address line is connected to
two external voltages at the ends (V
dda1
and V
ssa1
for address line 1, V
dda2
and V
ssa2
for address
line 2). The position of a nanoscale wire in the array gives the gate voltage for the transistor
VLS 386
on the nanoscale wire alone the address line. For example, a i-th nanoscale wire (starting from
V
ss
) in an array of n equally spaced nanoscale wires has a transistor gate voltage
V
g
(i, n) =
i
n
V
dd
+
n i
n
V
ss
(1)
in an address line connecting to two external voltage sources V
dd
and V
ss
. Here we assume
uniform address lines of negligible external resistance (from the rst or the last nanoscale wire
to the nearest external voltage source).
A transistor is on if its gate voltage exceeds the threshold voltage V
g
> V
th
. A nanoscale
wire is conductive if both transistors on it are on. Because the two address lines provide an
increasing series and a decreasing series of gate voltages respectively, only nanoscale wires at
specic positions in the array are conductive. For example, for V
dda1
= V
dda2
and V
ssa1
= V
ssa2
,
the nanoscale wire in the middle of the array gets conductive.
In general, to select the i-th data line from the left in an array of n nanoscale wires, the external
voltages need to be such that all the transistors on the right hand side of the i-th data line in
the rst address line are off, and all the transistors on the left hand side of the i-th data line in
the second address line are off:
V
ga1
(i +1, n) = (1
i +1
n
)V
dda1
+
i +1
n
V
ssa1
< V
th
V
ga2
(i 1, n) =
i 1
n
V
dda2
+ (1
i 1
n
)V
ssa2
< V
th
(2)
3.7 Voltage Controlled Nano-Addressing Analysis
Compared with the existing nano-addressing circuits, the proposed voltage-controlled nano-
addressing circuit leads to signicant manufacturing yield improvement due to the following
reasons.
The existing nano-addressing circuits are based on binary decoders and require every
nanoscale wire have a unique physical structure to differentiate itself, which is highly unlikely in
a nanotechnology manufacturing process - lithography cannot achieve nanoscale resolution,
while bottom-up self-assembly based nanotechnology manufacturing processes provide only
regular structures. Even at microscale, such a structure is subject to prevalent catastrophic
defects and signicant parametric variations, which result in low yield.
On the contrary, the proposed circuit consists of only uniform components in a regular struc-
ture. Every nanoscale wire has a uniform physical structure and is differentiated by their electrical
parameters, e.g., the node voltages. This scheme avoids any precise layout design and signi-
cantly improves yield and enables aggressive scaling of the addressing circuit with the rest of
the nanoelectronic system.
Furthermore, let us compare voltage-controlled nano-addressing with the existing binary de-
coder based nano-addressing mechanisms in terms of addressing accuracy and resolution.
These two key quantitative metrics for nano-addressing circuits are dened as follow since
such denition is not available in previous publications to the best of the authors knowledge.
Denition 1. Addressing inaccuracy of a nano-addressing circuit is the offset between the target
data line i and the data line j of maximum current.
AI = [i j[ (3)
Nanoelectronic Design Based on a CNT Nano-Architecture 387
In voltage-controlled nano-addressing, addressing inaccuracy is given by inaccurate address-
ing voltages from the voltage dividers. Such addressing inaccuracy can be further minimized
by adjusting the external voltages to adapt to manufacturing process and system runtime
parametric variations. As a result, a mis-addressing is localized, i.e., the data line j of maxi-
mum current is not far from the target data line i.
In traditional binary decoder based nano-addressing, an n-bit binary address has n neigh-
boring binary addresses of Hamming distance 1. A 1-bit error could lead to n different mis-
addressings. This leads to non-localized mis-addressing and a more signicant addressing
inaccuracy.
Denition 2. Addressing resolution of a nano-addressing circuit is the minimum ratio between the
on current I
on
(i) of a target data line i and the off current I
o f f
(j) of a non-target data line j (under all
conditions, e.g., different inputs and parametric variations).
AR = Min
I
on
(i)
I
o f f
(j)
(4)
In traditional binary decoder based nano-addressing, the achievable addressing resolution de-
pends on the conductance difference between the target data line and other non-target data
lines. There are n non-target data lines with Hamming distance 1 for a n-bit target address,
which have similar if not identical conductances. The presence of parametric variations fur-
ther reduces addressing resolution.
In voltage-controlled nano-addressing, addressing resolution is largely given by the address-
ing voltage difference between two adjacent data lines. Applying high voltages leads to a
number of reliability issues, such as electromigration and gate dioxide breakdown. Carbon
nanotubes are highly resistive to electromigration, while new material is needed to enhance
reliability for gate dioxide breakdown.
Alternatively, for given gate voltage difference, transistor current difference can be improved
by improving the inverse subthreshold slope. However, MOSFETs and MOSFET-like CNFETs
are limited to an inverse subthreshold slope S (which is the minimum gate voltage variation
needed to bring a 10 source-drain current increase) of 2.3
kT
q
60mV/decade at 300K (46).
This requires development of novel devices for larger inverse subthreshold slopes.
4. Adaptive Conguration of Nanoelectronic Systems Based on the CNT Crossbar
Nano-Architecture
In this section, we examine a list of nanoelectronic design adaptive conguration methods
which cancel the effects of catastrophic defects and parametric variations in the proposed
CNT crossbar nano-architecture.
4.1 Adaptive Nano-Addressing
A variety of parametric variations are expected to be prevalent and signicant in nanoelec-
tronic systems. Their effects on the voltage-controlled nano-addressing circuit (Fig. 8) are as
follow.
1. Global address line resistance variations, e.g., due to uniform width, height, and/or
resistivity variations of the address lines, have no effect on the voltage divider hence
the addressing scheme.
2. Address line misalignment (shifting) has no effect on the conductances of the data lines.
VLS 388
V
h
R
R
V
l
D D
i j
R R
R
h l
l
R
h
Fig. 9. Addressing two CNTs D
i
and D
j
with a resistance of R in between.
3. Global data line misalignment (i.e., shifting of all data lines), variations of external volt-
age sources, and variations of external wire/contact resistance (between the resistive
voltage divider and the external voltage sources) lead to potential addressing inaccu-
racy (CNT target offset).
4. Individual data line misalignment (shifting) could decrease the difference between the
gate voltages of two adjacent transistors, leading to degraded addressing resolution
(on/off CNT current ratio between two adjacent CNTs).
5. Process variations of the transistors, including width, length, dopant concentration, and
oxide thickness variations, lead to transistor conductivity uncertainty and degraded
addressing resolution.
The nano-addressing scheme needs to achieve a higher enough addressing resolution which
endures the above-mentioned parametric variation effects (e.g., by applying high external
addressing voltages, and/or novel CNFETs of < 60mV/decade subthreshold slope).
After achieving satisable addressing resolution, we need to minimize any addressing inac-
curacy and address the correct CNT data line (Problem 1).
Problem 1 (Adaptive Nano-Addressing). Given a voltage-controlled nano-addressing circuit, ad-
dress the i-th CNT data line in the presence of parametric variations.
Let us rst derive the external voltage offset needed for a data address offset. Suppose for an
address line, the external voltages V
h
and V
l
address the i-th CNT data line D
i
. The resistance
between CNT data line D
i
and the high (low) external address voltage V
h
(V
l
) is R
h
(R
l
) (Fig.
9).
1
We have
R
l
R
h
+ R
l
V
h
+
R
h
R
h
+ R
l
V
l
= V
on
(5)
where V
on
is the voltage needed to address a CNT data line of peak current. Shifting the
external voltages to V
h
+V and V
l
+V addresses another CNT data line D
j
. The resistance
between CNT data line D
j
and the high (low) external address voltage is R
h
+R (R
l
R).
We have
R
l
R
R
h
+ R
l
(V
h
+V) +
R
h
+R
R
h
+ R
l
(V
l
+V) = V
on
(6)
1
For the rst address line, the high external voltage V
h
= V
l1
is on the left, the low external voltage
V
l
= V
r1
is on the right. For the second address line, the high external voltage V
h
= V
r2
is on the right,
the low external voltage V
l
= V
l2
is on the left.
Nanoelectronic Design Based on a CNT Nano-Architecture 389
As a result,
V =
R
R
h
+ R
l
(V
h
V
l
) (7)
Observation 4. The external voltage offset V is proportional to the resistance offset R between two
CNT data lines, and is proportional to the physical offset L between the two CNT data lines, if the
resistive voltage dividers are uniform (e.g., the CNT data lines are equally spaced and the address lines
have uniform resistivity).
Based on Observation 1, Method 1 gives an adaptive nanoelectronic addressing method,
which nds the external voltage shifts needed to address the left most and the right most
CNT data lines rst. Any other external voltage shift needed to address a specic CNT data
line is then computed based on a linear interpolation. To address the left most or the right
most CNT data line, we apply a gradually increasing/decreasing external voltage offset V
at an address line, keep all the transistors at the other address line on, and measure the con-
ductance of the array of CNT data lines. The maximum and the minimum Vs (e.g., V
mink
and V
maxk
, k = 1 or 2) with non-zero CNT data line conductances address the left most and
the right most CNT data lines, respectively.
Algorithm 1: Adaptive Voltage Controlled
Nano Addressing
Input: An array of n CNT data lines, address i
Output: Addressing i-th data line
1. Turn on all transistors at address line 2 (V
l2
=
V
r2
> V
th
)
2. Find V
l1
which addresses rst data line (bi-
nary search)
3. Find V
r1
which addresses n-th data line (bi-
nary search)
4. Turn on all transistors at address line 1 (V
l1
=
V
r1
> V
th
)
5. Find V
l2
which addresses rst data line (bi-
nary search)
6. Find V
r2
which addresses n-th data line (bi-
nary search)
7. Shift V
l1
and V
r1
by
ni
n
V
l1
+
i
n
V
r1
8. Shift V
l2
and V
r2
by
ni
n
V
l2
+
i
n
V
r2
Observation 5. The addressing accuracy given by Method 1 depends only on the uniformity of the
resistive voltage divider, and the time domain variations of the external voltage differences V
l1
V
r1
and V
l2
V
r2
. Any time-invariant (e.g., manufacturing process) variations of the external voltages
(V
l1
, V
r1
, V
l2
, and V
r2
) or the external address line resistances (from the outer most data lines to the
external voltage sources) do not affect the achievable addressing accuracy.
4.2 RDG-CNFET Gate Matching
Another process variation is the misalignment of the front gate CNT and the back gate CNT of
a recongurable double-gate CNFET (RDG-CNFET). This is because that the front gate CNT
and the back gate CNT are on different (i 1 and i +1) layers, while CNT arrays on different
layers do not have and are not expected to have a precise alignment mechanism.
VLS 390
Front Gate
Bistable
Molecules
CNT
n

p or i n

p or i
CNT
Molecules
BacN Gate
Redox Active
CNT
Molecules
BacN Gate
Redox Active
Front Gate
Bistable
Molecules
CNT
CNT
Molecules
BacN Gate
Redox Active
n

Dielectric
Fig. 10. CNT misalignment in dense CNT arrays. The closest CNT pair forms the front gate
and the back gate of a RDG-CNFET. The neighboring CNTs have cross-coupling effect which
needs to be simulated/tested or avoided by shielding.
Fortunately, we observe that precise alignment between a front gate CNT and a back gate
CNT is not necessarily required as long as the CNT arrays are dense, e.g., with the spacing
between CNTs close to the CNT diameters. In such a case, a double gate eld effect transistor
is formed even in the presence of CNT misalignment (Fig. 10). A CNFET channel is formed
by doping the source/drain regions with the front gate CNT on the upper layer as mask. The
resultant CNFET channel aligns with the front gate CNT. A misaligned back gate injects a
weaker electrical eld in the CNFET channel from a longer distance. A neighboring back gate
may also injects a weak electrical eld in the channel. This is either tolerated (which needs to
be veried by simulation or testing) or avoided (by reserving the neighboring back gates for
shielding).
The question is then how to nd the closest CNT pair on different layers which form the front
gate and the back gate of a RDG-CNFET (such that we can address them and congure the
RDG-CNFET).
Problem 2 (RDG-CNFET Gate Matching). Given a CNT i on layer l, locate the closest CNT j on
layer l +2 (or l 2) where CNTs i and j form the front gate and the back gate of a RDG-CNFET.
Method 2 solves Problem 2 and nds the closest CNT pairs which form the front gate and the
back gate of a RDG-CNFET.
Nanoelectronic Design Based on a CNT Nano-Architecture 391
Algorithm 2: RDG-CNFET Gate Matching
Input: CNT i on layer l which is a gate of CNFET
T
Output: Closest CNT j to CNT i on layer l + 2
(l 2) which is the other gate of CNFET T
1. Apply a turn-off gate voltage to CNT i
2. For each CNT j on layer l +2 (l 2),
3. Apply a turn-off gate voltage to CNT j
4. Measure the conductance of CNFET T
5. Find CNT j for the smallest CNFET conduc-
tance
Once a matching gate is identied, the CNFET can be characterized (by achieving its I-V
curves). A parasitic CNFET can also be identied by nding the second closest CNT (with the
second smallest CNFET conductance in the algorithm), which is either tolerated or avoided in
a nanoelectronic design.
4.3 Catastrophic Defects and Mapping Techniques
In this subsection, we examine catastrophic defects for CNTs, programmable vias and CN-
FETs, and their corresponding detection and location methods.
4.3.1 Metallic, Open and Crossover CNTs
CNTs are metallic or semiconductive depending on their chirality. One third of CNTs are
metallic if they are grown isotropically. Metallic CNTs can be removed by either chemical
etching (64) or electrical breakdown (1). However, such techniques bring large process vari-
ation effects (34). Mitra et al. propose use of CNT bundles for each nanoelectronic signal to
reduce metallic CNT effect (34). We observe that metallic CNTs need not necessarily to be
removed and CNT bundles are not needed for each nanoelectronic signal as long as metal-
lic CNTs can be detected and located. Upon detection and location, metallic CNTs can be
congured to form global interconnects if not avoided. Their low resistivity helps to reduce
signal propagation delay in global interconnects which are critical to nanoelectronic system
performance.
Open CNTs are expected to be prevalent in a CNT array, as open CNT occurrence is propor-
tional to the length of the CNT. A CNT with a single open can be largely included in a correct
nanoelectronic design, upon detection and location of the single defect. A CNT with two (or
more) opens is not fully utilizable. The segment between the two (extreme) opens are not
accessible by any nano-addressing circuit, and components attached to that segment are not
congurable. Upon detection and location of the extreme opens, the end segments of an open
CNT can be included in a nano-circuit. Or, we can simply avoid open CNTs.
CNTs which are supposedly-parallel may cross over each other, resulting in different ad-
dresses for a CNT on two sides of a crossbar, and unexpected resistive contacts between CNTs.
If not corrected by etching (34), such crossover CNTs can be taken as multi-thread cables and
included in a correct nano-circuit. It is necessary to solve the following problem for nanoelec-
tronic system conguration on a CNT crossbar nano-architecture.
Problem3. Detect and locate metallic, open and crossover CNTs in an CNT array, which are addressed
on both ends by nano-addressing circuits.
VLS 392
Such metallic, open, and crossover CNTs can be captured in a n n resistance matrix R
CNT
,
where each entry R
CNT
(i, j) gives the resistance of CNT between the i-th CNT end and the j-th
CNT end on the opposite sides of an array of n CNTs (if i ,= j, R
CNT
(i, j) gives the resistance
of a crossover CNT, otherwise, R
CNT
(i, i) gives the i-th CNTs resistance).
Method 3 solves Problem 3 by giving such a n n resistance matrix R
CNT
. With this CNT re-
sistance matrix R
CNT
, we avoid open CNTs, and consider only semiconductive CNTs, metal-
lic CNTs, and crossover CNT bundles (as multi-thread cables) for the rest of the calibration
(Methods 4 and 5 and 2).
Algorithm 3: Metallic, Open, Crossover CNT
Detection and Location
Input: Array of n CNTs with nano-addressing
circuits on both ends (Fig. 1)
Output: Resistance map R
CNT
for metallic, open,
crossover CNTs
1. Congure all CNFETs as shorts
2. For each i
3. For each j
4. Address the i-th CNT on one end of CNT
5. Address the j-th CNT on the other end of
CNT
6. Measure resistance R
CNT
(i, j)
7. If i = j and R
CNT
(i, j)
8. Open CNT (i, j)
9. If i ,= j and R
CNT
(i, j)
10. Crossover CNT (i, j)
11. If R
CNT
(i, j) R
metallic
12. Metallic CNT (i, j)
13. If R
CNT
(i, j) R
semiconductive
14. Semiconductive CNT (i, j)
4.3.2 Opens and Shorts in Programmable Vias
A CNT junction with electrically bistable molecules is a programmable via, which is suppos-
edly recongured as a conductive via or open. A catastrophic defect at such a junction can
be either (1) permanent open, or (2) permanent short. It is necessary to solve the following
problem for nanoelectronic system conguration on a CNT crossbar nano-architecture.
Problem 4. Detect and locate permanently open or short vias in a CNT crossbar nano-architecture.
Method 4 solves Problem 4 by giving two m n resistance maps R
Pmin
and R
Pmax
, where
each entry R
Pmin
(i, j) or R
Pmax
(i, j) gives the resistance of a L-shaped path which includes
the i-th CNT segment on the top(bottom) of the CNT crossbar, the j-th CNT segment on the
left(right) of the CNT crossbar, and a programmable via which is congured as conductive
or open, respectively. Given non-open CNTs, these resistance matrices give a defect map for
permanently open or short vias.
Nanoelectronic Design Based on a CNT Nano-Architecture 393
Algorithm 4: Permanently Open or Short Via
Detection and Location
Input: Two layers of m n CNT crossbar with
nano-addressing interface on four sides
Output: Resistance maps R
Pmin
and R
Pmax
for
permanently open or short vias
1. For each non-open CNT i
2. For each non-open CNT j
3. Address i-th CNT from top(down) of
crossbar
4. Address j-th CNT from left(right) of
crossbar
5. Program via V(i, j) to conductive
6. Measure path resistance R
Pmin
(i, j)
7. Program via V(i, j) to insular
8. Measure path resistance R
Pmax
(i, j)
9. If R
Pmin
(i, j) = R
Pmax
(i, j)
10. Permanently open via V(i, j)
11. If R
Pmin
(i, j) = R
Pmax
(i, j) R
CNT
(i, i)
or R
CNT
(j, j)
12. Permanently short via V(i, j)
4.3.3 Opens and Shorts in CNFETs
A CNT junction with dielectric and redox active molecules is supposedly recongured as a
FET. A catastrophic defect could lead to (1) short between source and drain (e.g., due to chan-
nel punchthrough, no intrinsic channel area, redox active molecules cannot release charge), (2)
short between gate and source or drain (e.g., due to dielectric breakthrough), or (3) constant
open gate (e.g., redox active molecules cannot hold charge). It is necessary to solve the follow-
ing problem for nanoelectronic system conguration on a CNT crossbar nano-architecture.
Problem 5. Detect and locate permanently open or short CNFETs in a CNT crossbar nano-
architecture.
Shorts between CNFET gate and source or drain can be detected in a method which is sim-
ilar to Method 4 but without via programming. Method 5 nds permanent opens or shorts
between the source and the drain of a CNFET by giving a m n resistance matrix R
CNFET
.
Upon detection and location, these catastrophic defects (metallic, open and crossover CNTs,
permanently open or short vias and CNFETs) can be included in a correct nano-circuit. Nano-
circuit physical design needs to be adaptive to the presence of these catastrophic defects, and
will be different from die to die, based on the catastrophic defect maps (R
CNT
, R
Pmin
, R
Pmax
,
and R
CNFET
) for each die.
VLS 394
Algorithm 5: Permanently Open or Short CN-
FET Detection and Location
Input: CNFETs in crossbar with nano-addressing
interface, CNT resistance matrix R
CNT
Output: Resistance map R
CNFET
for perma-
nently open or short CNFETs
1. Congure all CNFETs as shorts
2. For each non-open CNT i
3. For each non-open CNT j
4. Address the i-th CNT on both ends
5. Congure CNFET (i, j) as open
6. Measure resistance R
CNFET
(i, j)
7. If R
CNFET
(i, j) R
CNT
(i, i)
8. Short between CNFET (i, j) source-
drain
9. If R
CNFET
(i, j) R
CNT
(i, i)
10. Open between CNFET (i, j) source-
drain
11. Congure CNFET (i, j) as short
4.4 Parametric Variation and Adaptive Design
Other than catastrophic defects, process variations are also critical to nanoelectronic system
performance and reliability. Compared with catastrophic defects, process variations are more
prevalent, and they are more difcult to detect since their effects are accumulated in affecting
the underlying circuit. Adaptive or resilient nano-circuit design techniques are expected to
achieve functionality and reliability in the presence of such process variations, besides online
calibration and adaptive conguration as follows.
In adaptive conguration, each module of the circuit is congured with its test circuit. The test
circuit can be as simple as additional interconnects which connect the inputs and the outputs
of the module to some of the primary inputs and the primary outputs, respectively. In such
cases, function and performance calibration is performed externally. Alternatively, self-test
can be performed given the complexity of the test circuit. If the current conguration passes
online function and performance verication, the auxiliary test circuit will be removed, and
the current conguration of the module is committed. Otherwise, the same circuit module
needs to be realized using other hardware resources on the recongurable platform.
5. Reliable, High Performance and Low Power Nano-Circuits
5.1 Nano-Circuit Design Challenges and Promising Techniques
As we have seen, the CNT crossbar nano-architecture provides regularity and manufactura-
bility for high logic density implementations of all CMOS combinational logic families, in-
cluding static logic, domino logic, pass-transistor logic, as well as latches, ip-ops, memory
input address decoder and output sensing circuits such as differential sense ampliers (Fig. 6
and Fig. 7).
However, nano-circuits in a CNT crossbar nano-architecture face a number of unique chal-
lenges. Nano-circuits must achieve reliability in the presence of prevalent defects and signif-
Nanoelectronic Design Based on a CNT Nano-Architecture 395
icant parametric variations, must achieve performance with highly resistive CNT intercon-
nects, etc. We discuss nano-circuit design in a CNT crossbar nano-architecture in this section.
Nanoscale computing systems are expected to be subject to prevalent defects and signicant
process and environmental variations inevitably as a result of the uncertainty principle of
quantum physics. E.g., the conductance of a CNT or a CNFET is very sensitive to chirality,
diameter, etc. (38). Besides adaptive conguration, nanoscale computing systems need new
computing models or circuit paradigms for reliability enhancement, performance improve-
ment and power consumption reduction.
As technology scales, nanoelectronic computing systems are expected to be based on single
electron devices (the average number of electrons in a transistor channel is approaching one
for the current technologies). In quantum mechanics, the occurrence probability of an electron
is the wavefunction given by the Schr odinger equation. How to extract a deterministic com-
putation result from stochastic events such as electron occurrences is one of the fundamental
problems that we are facing in designing nanometer scale computing systems. Traditional
computation based on large devices can be modeled as redundancy and threshold based logic
(which includes majority logic). In redundancy and threshold based logic, the error rate is
given by a binomial distribution (the probability of observing m events in an environment of
expecting an average of n independent events). As a result, a minimum signal-to-noise ratio
is required with performance and power consumption implications. Finding a more efcient
reliable computing model is of essential interest in nanoelectronic design.
Besides stochastic signal occurrence, signal propagation delay variability is another category
of uncertainty in stochastic nanoscale systems. Nanoelectronic design needs to be adaptive to
or resilient in the presence of signal propagation delay variations. Existing techniques (e.g.,
the Razor technology (2; 18) wherein a shadowip-op captures a delayed data signal for tim-
ing verication and correction) achieves only limited performance adaptivity, e.g., the circuity
is adaptive to performance variations only within a given range. Asynchronous circuits have
unlimited adaptivity to performance variations, and are ideal for high performance (enabling
performance scaling in the presence of signicant performance variations) and lowpower (be-
ing event-driven and clockless) nanoelectronic design (e.g., multi-core chips are expected to be
increasingly self-timed, global-asynchronous-locally-synchronous, or totally asynchronous).
However, existing asynchronous design techniques suffer in reliability in the presence of soft
errors (e.g., glitches, coupling noises, radiation or cosmos ray strike induced random noises),
which has limited their applications for decades.
Anumber of robust design techniques at multiple levels help to enhance reliability and reduce
error rate of a nanoelectronic computing system. At the circuit level, differential signaling and
complementary logic reduces parametric variation effects by exploiting spatial and temporal
correlations (e.g., by correlating m and n for reduced error rate in a binomial distribution) (15;
31; 57). At a higher level, we believe that Error Detection/Correction Code (35) is the key to
the stochastic signal occurrence problem in nanoscale systems (e.g., for lower required signal-
to-noise ratios, which lead to high performance and lowpower), and needs to be applied more
extensively at a variety of design hierarchy levels. Error correction coding has been applied
widely in todays memories and wireless communication systems. Proposals also exist for
applying (AN or residue) error detection/correction coding in arithmetic circuits (7).
VLS 396
I
I
valid
C
C
valid(i)
valid(i)
rs.I
rs.t
d.I
d.t
req
acN.I
acN.t
acN.I
acN.t
valid
I
I
I
inputs
I
inputs
FF
valid(d)
valid(d)
d.I
d.t
cN.I cN.t
rs.I rs.t
C
1
C
acN.t
n
1
acN.I
n
Fig. 11. A static logic based RDA (robust differential asynchronous) circuit.
5.2 Robust Asynchronous Circuits
A promising CNT nano-circuit paradigm is robust asynchronous circuits, which are formed
by applying Error Detection/Correction Code to asynchronous circuit design for enhanced
reliability in the presence of soft errors, while achieving delay insensitive nano-circuits (28).
For a 1-bit data signal, a robust differential asynchronous (RDA) communication channel in-
cludes three rails, two differential data signals d.t and d. f , and a request signal req. The data
and request signals are valid only if (d.t, d. f , req) = (1, 0, 1) or (0, 1, 1), while (d.t, d. f , req) =
(0, 0, 0) when the data and request signals return to zero. A validity check circuit detects the
arrival of valid data and triggers the receiver ip-op. The Hamming distance between any
two legal codes for the data and request signals is at least two, instead of one in the dual-rail
asynchronous signaling (d.t, d. f ) = (0, 1), (1, 0) or (0, 0). Differential acknowledgment signals
ack.t and ack. f are also included.
2
Fig. 11 gives a static logic based RDA circuit.
3
Two Muller C elements take the input validity
signals valid(i) and valid(i) (as well as the rs.t and rs. f signals coming from the downstream
acknowledgment signals) and generate two differential validity signals valid and valid, which
trigger the combinational logic computation. The Muller C elements have the maximum sig-
nal propagation delay for a rising(falling) output for all combinational logic circuits with the
same number of inputs, due to the presence of the longest path of transistors to the power
supply or the ground. Also the valid and valid signals derive from and arrive later than the
data signals. As a result, the valid and valid signals always arrive later than the differential
combinational logic outputs f and

f .
The later arriving validity signals lter out the internal glitches which come from combina-
tional logic computation before the differential outputs f and

f settle to their nal values.
Two NMOS transistors clamp the differential outputs f and the

f to the ground until the valid
signal rises and the valid signal falls for noise immunity. The differential acknowledgment
2
In general, multiple-bit data can be encoded in a variety of error detection codes (35). A single parity
bit for n-bit data provides a Hamming distance of two which is immune to any single bit error, while
an error detection code of a Hamming distance larger than k is immune to any k-bit error.
3
Alternative implementations (in dynamic circuits, e.g., dual-rail domino, DCVSL, etc.) achieve dif-
ferent cost and reliability tradeoffs, and are potentially preferrable depending on the manufacturing
technology and the environment, e.g., parametric variabilities, soft error rates, etc.
Nanoelectronic Design Based on a CNT Nano-Architecture 397
signals ack.t and ack. f derive from and always appear later than the differential data signals
f and

f .
At the senders end of the interconnect, for sequential elements, ip-ops are preferred over
latches for reliability. A ip-op is only vulnerable to noise when capturing the signal, while a
latch is vulnerable to noise whenever it is transparent. The ip-ops send out differential data
signals d.t and d. f (which come from the differential combinational logic outputs f and

f ) as
well as a request req signal (which comes from the acknowledgment signal ack.t). At the re-
ceiver end of the interconnect, a group of XOR and AND(NAND) gates verify the differential
data and request signals, and generate two differential validity signals valid(d) and valid(d).
Any single bit soft error or common multiple bit soft errors injected to the interconnects or at
the validity signals will halt the circuit.
Each sender ip-op is triggered by two differential acknowledgment signals ack.t and ack. f
as the differential clock signals ck.t = 1 and ck. f = 0, and is reset by two differential reset
signals when rs.t = 1 and rs. f = 0. The differential reset signals come from the downstream
differential acknowledgment signals ack.t
i
and ack. f
i
via the Muller C elements. They also
generate the valid and valid signals which trigger the combinational logic computation.
In the presence of multiple fanouts, multiple sets of differential acknowledgment signals will
be sent back to the upstream stage. With the Muller C elements holding the input validity sig-
nals, the early arriving acknowledgment signals hold until the latest acknowledgment signal
arrive from the fanouts. At that time the Muller C elements close the inputs to the combina-
tional logic block and reset the ip-op at the upstream stage, which brings all differential
data and request signals d.t, d. f and req as well as the acknowledgment signals ack.t and ack. f
back to the ground, completing an asynchronous communication cycle.
5.3 Logic and Timing Correctness of RDA Circuits
An RDA circuit achieves logic and timing correctness in the presence of a single bit soft error
or common multiple bit soft errors given the physical proximity of the circuit components.
Denition 3 (Single Bit Soft Error). A single bit soft error is a glitch or toggling caused by a single
event upset as a result of an alpha particle or neutron strike from radioactive material or cosmos rays.
Denition 4 (Common Multiple Bit Soft Error). A common multiple bit soft error is glitches or
togglings of the same magnitude and polarity caused by common noises such as capacitive or inductive
interconnect coupling, or spatially correlated transient parametric (e.g., supply voltage, temperature)
variations (19; 39; 60), which have near identical effects on components at close physical proximity.
Theorem 1 (Logic Correctness). An robust differential asynchronous circuit achieves logic correct-
ness at the event of a single bit soft error or common multiple bit soft errors.
Proof. An RDA circuit achieves logic correctness in the following cases.
1. A single bit soft error or a common multiple bit soft error at the input data signals leads
to invalid data. The valid(i)(valid(i)) signal will not rise(fall).
2. A single bit soft error or a common multiple bit soft error at the valid(i) and valid(i)
signals, at the Muller C elements computing the valid and valid signals, or at the valid
and valid signals, leads to an early false or a late valid(valid) signal. In this case, the
differential structure for the valid/valid signals prevents any logic error. A false valid
signal turns off the keeper circuit, but can rise neither f nor

f , because the valid signal
is still high. A false valid signal rises either f or

f , while the keeper circuit still clamps
VLS 398
both f and

f to the ground. Only when both the valid and valid signals arrive, the
differential combinational logic computation is enabled.
3. A single bit soft error or a common multiple bit soft error at the differential combina-
tional logic block or the differential data signals f and

f will not raise the ack.t signal
nor lower the ack. f signal.
4
4. A single bit soft error or a common multiple bit soft error at the differential acknowl-
edgment signals ack.t and ack. f does not trigger the ip-op.
5. A single bit soft error or a common multiple bit soft error at the differential reset signals
RS and RS does not reset the ip-op.
6. A single bit soft error or a common multiple bit soft error at the differential data signals
d.t and d. f and the request req signal leads to invalid data and does not generate a
validity signal.
In summary, in order to make an RDA circuit to fail, the glitches must follow certain specic
patterns, e.g., to reverse a 01 to a 10, which is highly unlikely to take place.
.
Theorem 2 (Timing Correctness). An robust differential asynchronous circuit achieves timing cor-
rectness for any delay variation given the physical proximity of the circuit components.
Proof. Prevalent parametric (process, temperature, supply voltage) variations in nanoelec-
tronic circuits lead to signicant delay variations for the components in the circuit. Because
such delay variations are spatially correlated (19; 39; 60), given the physical proximity of the
circuit components, their delay variations are tightly correlated. Consequently, an RDAcircuit
achieves timing correctness in the following cases.
1. The input data signals d.t and d. f always arrive earlier than the differential validity
signals valid(d) and valid(d), which derive from the differential data and the request
signals.
2. The valid(valid) signal derives from the input validity signals valid(i) and valid(i)
through the Muller C elements.
The rising(falling) delay of an Muller C element is the maximum of any combinational
logic block of the same number of inputs with the longest serial transistor path to the
power supply and the ground. Given the tight correlation of parametric variations for
the transistors and the interconnects in the circuit, the valid and valid signals arrive no
early than the nal combinational logic computation results f and

f . The valid and valid
signals enable the differential outputs f and

f via the tri-state output structure and the
keeper circuit, thus ltering out the glitches in combinational logic computation.
3. The differential acknowledgment signals ack.t and ack. f derive fromand arrive no early
than the differential data signals f and

f . They are further delayed (e.g., via buffers)
such that at the rising edge of the ip-op clock signal, the input data signals have
settled to their nal values, and the ip-op captures the correct data f and

f . Conse-
quently, no setup time constraint is required. The differential data signals f and

f hold
4
A glitch does not appear before valid data given the clamp NMOS transistors and fast data rise time
in a static RDA circuit. A dynamic RDA circuit needs to enhance robustness for a soft error strike at
a combinational logic output, e.g., by including weak keeper PMOS transistors, or delaying precharge
PMOS transistors.
Nanoelectronic Design Based on a CNT Nano-Architecture 399
until the differential acknowledgment signals ack.t and ack. f reach the upstream stage,
reset the upstream ip-op, and lower the valid(i) and valid signals, which take much
longer time than the hold time of the ip-op. Consequently, no hold time constraint is
required.
4. The differential acknowledgment signals ack.t and ack. f arrive after the combinational
logic computation completes and the differential data signals f and

f settle to their nal
values.
5. After all downstream stages send back acknowledgment signals, the ip-op is reset,
bringing the differential data d.t and d. f and the request req signals to the ground. The
downstream stage acknowledgment signals are also brought back to the ground as a
result. This completes a four-phase asynchronous communication cycle.
As a result, the proposed robust differential asynchronous circuit is delay insensitive, i.e.,
achieves correct timing (signal arrival time sequence) in the presence of delay variations,
which is critical for nanoelectronic circuits. .
6. Experiments
6.1 Voltage-Controlled Nano-Addressing
In this section, we rst verify the effectiveness of the proposed voltage-controlled nano-
addressing circuit (Fig. 8) by running SPICE simulation based on the Stanford CNFET com-
pact model (52).
In the proposed voltage-controlled nano-addressing circuit, each nanotube is gated by two N-
type MOSFET-like CNFETs. These CNFETs are of 6.4nm gate width and 32nm channel length,
as are described in the Stanford CNFET compact model. The two CNFETs in each nanotube
are given a voltage drop of V
dd
= 1V. The external address voltages are V
dda1
= V
dda2
= 1V,
V
ssa1
= V
ssa2
= 0. As a result, the CNFETs have complementary gate voltages V
g1
+V
g2
= 1V.
Fig. 12 gives the nanotube currents in the array with different gate voltage at the rst address
line. The nanotubes carry a signicant current only with specic gate voltages, e.g., reaching
I
out
= 5.064mA at gate voltage V
g1
= 0.495V.
With 0 and 1V external voltages, Fig. 12 gives the currents for all the nanotubes in the array.
With larger external voltages, Fig. 12 is extended to give the nanotube currents: any nanotube
with a V
ga1
> 1V or V
ga2
< 0V gate voltage at the rst address line carries zero current.
Addressing resolution is given by the difference of addressing voltages between two adjacent
nanotubes (since MOSFETs and MOSFET-like CNFETs are limited to a < 60mV/decade in-
verse subthreshold slope). Adjusting the external address voltages minimizes any addressing
inaccuracy due to manufacturing process and system runtime parametric variations.
6.2 Comparison of CNT Crossbar based and the Existing Nano-Architecture
Let us now compare nano-circuits implemented in the proposed CNT crossbar based nano-
architecture and the existing nano-architectures. Considering DNA-guide self-assembly based
nanoelectronic architectures such as NANA (42) and SOSA (43) target the far future, and FPNI
(50) is very similar to CMOS technology by employing CMOS transistors and nanowires, I will
compare RDG-CNFET based logic implementation with molecular diode and MOS transistor
based logic implementation which is the mainstreamnanoelectronic architecture in literature.
5
5
Comparing CNFET and CMOS-FET circuits gives approximately 5 performance improvement (16).
VLS 400
0
1
2
3
4
5
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

o
u
t

(
m
A
)
Vg1 (V)
"out"
Fig. 12. Nanotube current I
out
in mA for CNFET gate voltage V
g1
in the rst address line.
Vdd a
en output
b c
Fig. 13. A molecular diode/MOSFET based Boolean logic a(b + c) implementation.
As an example of a combinational logic block, a Boolean logic function a(b +c) is implemented
based on RDG-CNFETs (Fig. 6) and by molecular diodes and peripheral CMOS transistors
(Fig. 13). In the following experiments, SPICE simulation is conducted based on the latest
Stanford compact CNFET model (52), a molecular device model from a latest publication (17),
and the latest Predictive CMOS Technology Model (45).
The RDG-CNFETs are constructed based on an enhancement mode CNFET of 6.4nm gate
width and 32nm channel length, as is described in the Stanford compact model (52). The
bistable molecules at the front gate provide a resistance difference between G and about
80 (6).
6
The redox active molecules at the back gate are cobalt phthalocyanine (CoPc) (17),
which have been the basis of a NW-FET device with 1000 conductance difference (17).
The molecular diodes are based on V-shaped amphiphilic [2]rotaxane 5
4+
molecules (44), with
saturation current I
s
= 36pA, emission coefcient N = 14.66, and an on/off current ratio of
194.9. The CMOS transistors are modeled by 22nm Predictive Technology Models (45). To
balance the current difference between molecular diodes and PMOS transistors, the PMOS
transistors have a channel width/length ratio W/L = 1/10, while each molecular diode con-
sists of 10, 000 V-shaped amphiphilic [2]rotaxane 5
4+
molecules. As a result, the circuit has a
current on the order of nA.
6
The amorphous silicon based anti-fuse technology works with silicon based nanowires (6). Similar
technologies are expected and assumed here for carbon nanotubes.
Nanoelectronic Design Based on a CNT Nano-Architecture 401
Mo. Diode RDG-CNFET
abc V
out
P
static
V
out
P
static
(V) (W) (V) (W)
111 0.999 1.49n 1.000 0.25n
110 0.807 0.83 1.000 0.33n
101 0.807 0.83 1.000 0.32n
011 0.497 1.45 0.000 15.34p
000 0.265 1.47 0.000 41.32p
Table 1. Output voltage and static power consumption with different inputs of RDG-CNFET
and molecular diode based Boolean logic a(b + c) implementations.
Comparing the CNFET based and the molecular diode/CMOS based logic implementations,
we have the following observations.
1. Area: The CNFET based logic implementation takes an area of 2 6 = 12 CNFETs
and 2 3 = 6 vias, while molecular diodes and MOSFET based implementation takes
an area of 2 4 = 8 molecular diodes and 2 MOSFETs (and two more MOSFETs if
an inverter is included at each output to restore signal voltage swing). Considering
CNFET based implementation is in a complementary logic, and the MOS transistors
do not scale well, CNFET based implementation is expected to achieve superior logic
density at a nanometer technology node.
2. Signal reliability: The CNFET based logic implementation achieves full voltage swing
at the outputs, while in the diode logic circuit, the output swing depends on the inputs,
and varies between 0.503V to 0.735V in the experiment (Table 1). Additional CMOS
circuitry (e.g., an inverter) can be included at each output to restore full voltage swing,
however, the reduced signal voltage swing in the diode logic circuit still implies com-
promised signal reliability.
3. Static power: The CNFET based logic implementation in CMOS logic achieves orders of
magnitudes of less power consumption compared with molecular diodes and MOSFET
based implementation for most input vectors (Table 1).
4. Performance: The CNFET based logic implementation achieves orders of magnitude
of timing performance improvement compared with molecular diodes and MOSFET
based implementation (Table 2).
In summary, CNFET based logic implementation achieves superior logic density, reliabil-
ity, performance, and power consumption compared with molecular diodes and CMOS-FET
based Boolean logic implementation.
6.3 Verication of RDA Circuits
In this section, we verify logic and timing correctness of robust differential asynchronous cir-
cuits by running HSPICE simulation based on 22nm Predictive Technology Models (45).
Fig. 14 gives signal waveforms for a perfect RDA circuit implementing Boolean function f =
ab in CMOS static logic. Input signal a falls from logic 1 at 100ps to logic 0 at 200ps, input
signal b rises from logic 0 at 0ps to logic 1 at 50ps. The validity signal valid(valid) rises(falls)
from logic 0(1) at 100ps to logic 1(0) at 200ps. As a result, we observe that the late arrival of
the valid(valid) signals enable the combinational logic computation for f and

f , and lter out
VLS 402
Mo. Diode RDG-CNFET
C
L
D
r
D
f
D
r
D
f
( f F) (ns) (ns) (ns) (ns)
1 0.37 0.63 0.01 0.01
10 3.31 6.61 0.08 0.08
100 32.77 62.29 0.77 0.78
Table 2. Rising/falling signal propagation delays D
r
/D
f
(ns) (from a to output) for various
load capacitance C
L
( f F) of RDG-CNFET and molecular diode based Boolean logic a(b + c)
implementations. Input signal transition time varying from 1ps to 100ps leads to no consider-
able delay difference.
100 150 200 250 300
~500
0
500
1000
1500
Time (ps)
V
o
l
t
a
g
e

(
m
V
)
valid
f
ack.t
d.f
valid(d)
Fig. 14. Signal waveforms in a robust differential asynchronous circuit with no single bit soft
error.
the internal glitches that would be observed at the f and

f signals if the validity signals arrive
early.
Fig. 15 gives signal waveforms for the same RDA circuit with the same input signals a and
b, while the validity signal valid is delayed by 50ps compared to the complementary validity
signal valid, representing an early false validity signal (valid) or a late arriving validity signal
(valid) due to an injected negative glitch at either the valid or the valid signal. We observe
that the differential data signals f and

f are clamped to the ground until both validity signals
settle to their nal values valid = 1 and valid = 0. As a result, no logic malfunction is present
while all signals are delayed by 50ps.
Fig. 16 gives signal waveforms for the RDA circuit with a triangle current (of 0.1mA peak
current starting at 160ps ending at 180ps) injected to the f signal. Comparing with Fig. 14, we
observe that such a negative glitch does not lead to any logic error, instead, the arrivals of the
ack.t and ack. f signals are postponed for about 40ps, as well as all the downstream signals d.t,
d. f , and valid(d).
Fig. 17 gives signal waveforms for the RDA circuit with a triangle current (of 0.1mA peak
current starting at 120ps ending at 140ps) injected to the ack.t signal. The glitch at the ack.t
signal does not trigger the ip-op, and the subsequent signals d.t, d. f , valid(d) and valid(d)
are not affected.
Fig. 18 gives signal waveforms for the RDA circuit with two identical triangle currents (of
0.1mA peak current starting at 120ps ending at 140ps) injected to both differential acknowl-
Nanoelectronic Design Based on a CNT Nano-Architecture 403
100 150 200 250 300
~500
0
500
1000
1500
Time (ps)
V
o
l
t
a
g
e

(
m
V
)
valid
f
ack.t
d.f
valid(d)
Fig. 15. Signal waveforms in a robust differential asynchronous circuit with an early false
valid or a late arriving valid signal.
100 150 200 250 300
~1000
~500
0
500
1000
1500
Time (ps)
V
o
l
t
a
g
e

(
m
V
)
valid
f
ack.t
d.f
valid(d)
Fig. 16. Signal waveforms in a robust differential asynchronous circuit with a negative glitch
injected at the f signal.
edgment signals ack.t and ack. f . We observe that the double glitches do not trigger the ip-op
either, the subsequent signals d.t, d. f , valid(d) and valid(d) are not affected.
From these experiments, we observe that the RDA circuit achieves correct logic and correct
timing (signal arrival time sequence) at the event of a single bit soft error or common multiple
bit soft errors, by temporarily halting the circuit operation until the valid data re-appear.
7. Conclusion
In this chapter, we have studied the rst purely CNT and CNFET based nano-architecture,
which is based on (1) a novel recongurable double gate carbon nanotube eld effect transistor
(RDG-CNFET) device, (2) a multi-layer CNT crossbar structure with sandwiched via-forming
and gate-forming molecules, and (3) a novel voltage-controlled nano-addressing circuit not
requiring precise layout design, enabling manufacture of nanoelectronic systems in all existing
CMOS circuit design styles.
A complete methodology of adaptive conguration of nanoelectronic systems based on the
CNT crossbar nano-architecture is also presented, including (1) an adaptive nano-addressing
method for the voltage-controlled nano-addressing circuit, (2) an adaptive RDG-CNFET gate
matching method, and (3) a set of catastrophic defect mapping methods, which are specic (to
VLS 404
100 150 200 250 300
~500
0
500
1000
1500
2000
Time (ps)
V
o
l
t
a
g
e

(
m
V
)
valid
f
ack.t
ack.f
d.f
valid(d)
Fig. 17. Signal waveforms in a robust differential asynchronous circuit with a positive glitch
injected at the ack.t signal.
100 150 200 250 300
~500
0
500
1000
1500
2000
Time (ps)
V
o
l
t
a
g
e

(
m
V
)
valid
f
ack.t
ack.f
d.f
valid(d)
Fig. 18. Signal waveforms in a robust differential asynchronous circuit with positive glitches
injected at both differential acknowledgment signals ack.t and ack. f .
the CNT crossbar nano-architecture), complete (in detecting and locating all possible catas-
trophic defects in the CNT crossbar nano-architecture), deterministic (with no probabilistic
computation), and efcient (test paths are rows or columns of CNT, or L-shaped CNT paths,
CNT open/short defect detection is separated with via/CNFET open/short defect detection,
runtime is linear to the number of defect sites). This is signicant improvement compared
with the previous techniques (which either detects only a single defect (10), or is generic, ab-
stract, probabilistic, and highly complex (8; 33; 58)).
We have also examined some of the design challenges and promising techniques for CNT
and CNFET based nano-circuits. We identify signicant parametric variation effects on logic
correctness and timing correctness, and propose robust (differential) asynchronous circuits by
applying Error Detection Code to asynchronous circuit design for noise immune and delay
insensitive nano-circuits.
SPICE simulation based on compact CNFET and molecular device models demonstrates su-
perior logic density, reliability, performance and power consumption of the proposed RDG-
CNFET based nanoelectronic architecture compared with previously published nanoelec-
tronic architectures, e.g., of a hybrid nano-CMOS technology including molecular diodes and
MOSFETs. Furthermore, theoretical analysis and SPICE simulation based on 22nm Predictive
Technology Models showthat RDAcircuits achieve much enhanced reliability in logic correct-
Nanoelectronic Design Based on a CNT Nano-Architecture 405
ness in the presence of a single bit soft error or common multiple bit soft errors, and timing
correctness in the presence of parametric variations given the physical proximity of the circuit
components.
While nanotechnology development has not enabled fabrication of such a system, this chap-
ter has demonstrated the prospected manufacturability, reliability, and performance of a
purely carbon nanotube and carbon nanotube transistor based nanoelectronic system. These
nanoelectronic system design techniques are expected to be further developed along with
nanoscale device fabrication and integration techniques which are critical to achieve and to
improve the proposed nanoelectronic architecture in several aspects, including: (1) search of
electrically bistable molecules of repeated recongurability and low contact resistance with
carbon nanotubes, (2) development of etching processes for electrically bistable and redox ac-
tive molecules with carbon nanotubes as masks, and (3) manufacture of nanoscale devices of
superior subthreshold slope for enhanced nano-addressing resolution, and ultra high perfor-
mance low power nanoelectronic systems.
8. References
[1] I. Amlani, N. Pimparkar, K. Nordquist, D. Lim, S. Clavijo, Z. Qian and R. Emrick, Au-
tomated Removal of Metallic Carbon Nanotubes in a Nanotube Ensemble by Electrical
Breakdown, Proc. IEEE Conference on Nanotechnology, 2008, pp. 239-242.
[2] T. Austin, V. Bertacco, D. Blaauw and T. Mudge, Opportunities and Challenges for Bet-
ter Than Worst-Case Design, Asian South Pacic Design Automation Conference, 2005.
[3] A. Bachtold, P. Hadley, T. Nakanishi and C. Dekker, Logic Circuits with Carbon Nan-
otube Transistors, Science, 2001, 294(5545), pp. 1317-1320.
[4] P. Beckett, A Fine-Grained Recongurable Logic Array Based on Double Gate Transis-
tors, International Conference on Field-Programmable Technology, 2002, pp. 260-267.
[5] L. Benini and G. De Micheli, Networks on chips: A new paradigm for component-based
MPSoC design, Proc. MPSoC., 2004.
[6] J. Birkner, A. Chan, H. T. Chua, A. Chao, K. Gordon, B. Kleinman, P. Kolze and R. Wong,
A Very High-Speed Field Programmable Gate Array Using Metal-To-Metal Anti-Fuse
Programmable Elements, New Hardware Product Introduction at Custom Integrated Circuits
Conference, 1991.
[7] T. J. Brosnan and N. R. Strader II, Modular Error Detection for Bit-Serial Multiplication,
IEEE Trans. on Computers, 37(9), 1988, pp. 1043-1052.
[8] J. G. Brown and R. D. Blanton, CAEN-BIST: Testing the NanoFabric, Proc. International
Test Conference, 2004, pp. 462-471.
[9] Z. Chen, J. Appenzeller, Y.-M. Lin, J. Sippel-Oakley, A. G. Rinzler, J. Tang, S. J. Wind, P.
M. Solomon and P. Avouris, An Integrated Logic Circuit Assembled on a Single Carbon
Nanotube, Science, 2006, 311(5768), pp. 1735.
[10] B. Culbertson, R. Amerson, R. Carter, P. Kuekes and G. Snider, Defect Tolerance on the
Teramac Custom Computer, Proc. Symposium on FPGAs for Custom Computing Machines
(FCCM), 2000, pp. 185-192.
[11] A.DeHon, Array-Based Architecture for FET-Based, Nanoscale Electronics, IEEE Trans.
on Nanotechnology, 2(1), pp. 23-32, 2003.
[12] A. DeHon, P. Lincoln and J. E. Savage, Stochastic Assembly of Sublithographic
Nanoscale Interface, IEEE Trans. Nanotechnology, 2(3), pp. 165-174, 2003.
[13] A. DeHon and M. J. Wilson, Nanowire-Based Sublithographic Programmable Logic Ar-
rays, Proc. FPGA, 2004, pp. 123-132.
VLS 406
[14] C. Dwyer, L. Vicci, J. Poulton, D. Erie, R. Superne, S. Washburn and R. M. Taylor, The
design of DNA self-assembled computing circuitry, IEEE Trans. on Very Large Scale Inte-
gration (VLSI) Systems, 12(11), pp. 1214-1220, Nov. 2004.
[15] D. J. Deleganes, M. Barany, G. Geannopoulos, K. Kreitzer, M. Morrise, D. Milliron, A.
P. Singh and S. Wijeratne, Low-Voltage Swing Logic Circuits for a Pentium 4 Processor
Integer Core, IEEE J. of Solid-State Circuits, 40(1), pp. 36-43, 2005.
[16] J. Deng, and H.-S. P. Wong, A Compact SPICE Model for Carbon Nanotube Field Effect
Transistors Including Non-Idealities and Its Application

A

T Part II: Full Device Model


and Circuits Performance Benchmarking, IEEE Trans. Electron Devices, 2007.
[17] X. Duan, Y. Huang and C. M. Lieber, Nonvolatile Memory and Programmable Logic
from Molecule-Gated Nanowires, Nano Letters, 2(5), pp. 487-490, 2002.
[18] D. Ernst, N. S. Kim, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge and K. Flautner, Ra-
zor: Circuit-Level Correction of Timing Errors for Low-Power Operation, IEEE MICRO
special issue on Top Picks From Microarchitecture Conferences of 2004, 2005.
[19] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, Modeling Within-Die
Spatial Correlation Effects for Process-Design Co-Optimization, Proc. International Sym-
posium on Quality Electronic Design, pp. 516-521, 2005.
[20] S. C. Goldstein and M. Budiu, NanoFabrics: Spatial Computing Using Molecular Elec-
tronics, Proc. International Symposium on Computer Architecture, 2001, pp. 178-191.
[21] B. Gojman, E. Rachlin, and J. E. Savage, Evaluation of Design Strategies for Stochasti-
cally Assembled Nanoarray Memories, Journal of Emerging Technologies, 1(2), 2005, pp.
73-108.
[22] J. R. Heath and M. A. Ratner, Molecular Electronics, Physics Today, 56(5), pp. 43-49,
2003.
[23] S. J. Kang, C. Kocabas, T. Ozel, M. Shim, N. Pimparkar, M. A. Alam, S. V. Rotkin and J. A.
Rogers, High-Performance Electronics Using Dense, Perfectly Aligned Arrays of Single-
Walled Carbon Nanotubes, Nature Nanotechnology, Vol. 2, pp. 230-236, April 2007.
[24] B. S. Landman and R. L. Russo, On a Pin Versus Block Relationship for Partitions of
Logic Graphs, IEEE Trans. on Computers, C-20, pp. 1469-1479, 1971.
[25] J. Liu, I. OConnor, D. Navarro and F. Gafot, Design of a Novel CNTFET-Based Recon-
gurable Logic Gate, Proc. ISVLSI, 2007, pp. 285-290.
[26] B. Liu, Recongurable Double Gate Carbon Nanotube Field Effect Transistor Based Na-
noelectronic Architecture, Proc. Asia and South Pacic Design Automation Conference, 2009.
[27] B. Liu, Adaptive Voltage Controlled Nanoelectronic Addressing for Yield, Accuracy and
Resolution, Proc. International Symposium on Quality Electronic Design, 2009.
[28] B. Liu, Robust Differential Asynchronous Nanoelectronic Circuits, Proc. International
Symposium on Quality Electronic Design, 2009.
[29] B. Liu, Defect Mapping and Adaptive Conguration of Nanoelectronic Circuits Based
on a CNT Crossbar Nano-Architecture, Workshop on Nano, Molecular, and Quantum Com-
munications (NanoCom), 2009.
[30] International Technology Roadmap for Semiconductors, http://www.itrs.net/.
[31] A. Maheshwari and W. Burleson, Differential Current-Sensing for On-Chip Intercon-
nects, IEEE Trans. on VLSI Systems, 12(12), 2004, pp. 1321-1329.
[32] P. L. McEuen, M. S. Fuhrer and P. Hongkun, Single-walled Carbon Nanotube Electron-
ics, IEEE Trans. Nanotechnology, 1(1), pp. 78-85, 2002.
[33] M. Mishra and S. C. Goldstein, Defect Tolerance at the End of the Roadmap, Proc.
International Test Conference, 2003, pp. 1201-1211.
Nanoelectronic Design Based on a CNT Nano-Architecture 407
[34] S. Mitra, N. Patil and J. Zhang, Imperfection-Immune Carbon Nanotube VLSI Logic
Circuits, Foundations of NANO, 2008.
[35] T. K. Moon, Error Correction Coding: Mathematical Methods and Algorithms, Wiley-
Interscience, 2005.
[36] H. Naeimi and A. DeHon, A Greedy Algorithm for Tolerating Defective Crosspoints in
NanoPLA Design, Proc. Intl. Conf. on Field-Programmable Technology, 2004, pp. 49-56.
[37] P. Nguyen, H. T. Ng, T. Yamada, M. K. Smith, J. Li, J. Han and M. Meyyappan, Direct
Integration of Metal Oxide Nanowire in Vertical Field-Effect Transistor, Nano Letters,
2004, 4(4), pp. 651-657.
[38] A. Nieuwoudt and Y. Massoud, Assessing the Implications of Process Variations on Fu-
ture Carbon Nanotube Bundle Interconnect Solutions, Proc. Intl. Symp. on Quality Elec-
tronic Design, 2007, pp. 119-126.
[39] M. Orshansky, L. Milor, P. Chen, K. Keutzer, C. Hu, Impact of spatial intrachip gate
length variability on the performance of high-speed digital circuits, IEEE Trans. on
Computer-Aided Design of Integrated Circuits and Systems, 2002, pp. 544-553.
[40] S. S. P. Parkin, Spintronics Materials and Devices: Past, Present and Future, IEEE Inter-
national Electron Devices Meeting (IEDM) Technical Digest, pp. 903-906, 2004.
[41] N. Patil, J. Deng, A. Lin, H.-S. Philip Wong and S. Mitra, Design Methods for Mis-
aligned and Mispositioned Carbon-Nanotube Immune Circuits, IEEE Tran. on CAD,
2008, 27(10), pp. 1725-1736.
[42] J. P. Patwardhan, C. Dwyer, A. R. Lebeck and D. J. Sorin, NANA: A Nano-Scale Active
Network Architecture, ACM Journal on Emerging Technologies in Computing Systems, 2(1),
pp. 1-30, 2006.
[43] J. P. Patwardhan, V. Johri, C. Dwyer and A. R. Lebeck, ADefect Tolerant Self-Organizing
Nanoscale SIMD Architecture, International Conference on Architecture Support for Pro-
gramming Languages and Operating Systems, 2006, pp. 241-251.
[44] A. R. Pease, J. O. Jeppesen, J. F. Stoddart, Y. Luo, C. P. Collier and J. R. Heath, Switching
Devices Based on Interlocked Molecules, Acc. Chem. Res., 34, pp. 433-444, 2001.
[45] Predictive Technology Model, http://www.eas.asu.edu/ptm/.
[46] A. Raychowdhury and K. Roy, Carbon Nanotube Electronics: Design of High Perfor-
mance and Low Power Digital Circuits, IEEE Trans. on Circuits and Systems - I: Funda-
mental Theory and Applications, 54(11), pp. 2391-1401, 2007.
[47] G. S. Rose, A. C. Cabe, N. Gergel-Hackett, N. Majumdar, M. R. Stan, J. C. Bean, L. R. Har-
riott, Y. Yao and J. M. Tour, Design Approaches for Hybrid CMOS/Molecular Memory
Based on Experimental Device Data, Proc. Great Lakes Symposium on VLSI, 2006, pp. 2-7.
[48] J. E. Savage, E. Rachlin, A. DeHon, C. M. Lieber and Y. Wu, Radial Addressing of
Nanowires, ACM Journal of Emerging Technologies in Computing Systems, 2(2), pp. 129-
154. 2006.
[49] M. S. Schmidt, T. Nielsen, D. N. Madsen, A. Kristensen and P. Bggild, Nano-Scale
Silicon structures by Using Carbon Nanotubes as Reactive Ion Masks, Nanotechnology,
16, pp. 750-753, 2005.
[50] G. S. Snider and R. S. Williams, Nano/CMOS Architectures Using a Field-
Programmable Nanowire Interconnect, Nanotechnology, 18(3), 2007.
[51] M. R. Stan, P. D. Franzon, S. C. Goldstein, J. C. Lach and M. M. Ziegler, Molecular
Electronics: From Devices and Interconnect to Circuits and Architecture, Proc. of the
IEEE, 91(11), pp. 1940-1957, 2003.
[52] Stanford CNFET Model, http://nano.stanford.edu/models.php.
VLS 408
[53] R. Sordan, K. Balasubramanian, M. Burghard and K. Kern, Exclusive-OR Gate with a
Single Carbon Nanotube, Appl. Phys. Lett., 2006, 88.
[54] D. B. Strukov and K. K. Likharev, CMOL FPGA: A Recongurable Architecture for Hy-
brid Digital Circuits with Two-Terminal Nanodevices, Nanotechnology, 16(6), pp. 888-
900, 2005.
[55] J. P. Sun, G. I. Haddad, P. Mazumder and J. N. Schulman, Resonant Tunneling Diodes:
Models and Properties, Proc. of the IEEE, 86(4), 1998, pp. 641-660.
[56] Y.-C. Tseng, P. Xuan, A. Javey, R. Malloy, Q. Wang, J. Bokor and H. Dai, Monolithic
Integration of Carbon Nanotube Devices with Silicon MOS Technology, Nano Letters,
4(1), pp. 123-127, 2004.
[57] N. Tzartzanis and W. W. Walker, Differential Current-Mode Sensing for Efcient On-
Chip Global Signaling, IEEE J. of Solid-State Circuits, 40(11), 2005, pp. 2141-2147.
[58] Z. Wang and K. Chakrabarty, Using Built-In Self-Test and Adaptive Recovery for Defect
Tolerance in Molecular Electronics-Based NanoFabrics, Proc. International Test Confer-
ence, 2005.
[59] R. S. Williams and P. J. Kuekes, Demultiplexer for a Molecular Wire Crossbar Network,
US Patent Number 6,256,767, 2001.
[60] J. Xiong, V. Zolotov, and L. He, Robust Extraction of Spatial Correlation, Proc. Interna-
tional Symposium on Physical Design, 2006, pp. 2-9.
[61] J. Zhang, N. P. Patil, A. Hazeghi and S. Mitra, Carbon Nanotube Circuits in the Presence
of Carbon Nanotube Density Variations, Proc. Design Automation Conference, 2009.
[62] J. Zhang, N. P. Patil and S. Mitra, Design Guidelines for Metallic-Carbon-Nanotube-
Tolerant Digital Logic Circuits, Proc. Design Automation and Test in Europe, 2008, pp. 1009-
1014.
[63] W. Zhang, N. K. Jha and L. Shang, NATURE: A Hybrid Nanotube/CMOS Dynamically
Recongurable Architecture, Proc. Design Automation Conference, 2006, pp. 711-716.
[64] G. Zhang, P. Qi, X. Wang, Y. Lu, X. Li, R. Tu, S. Bangsaruntip, D. Mann, L. Zhang and H.
Dai, Selective Etching of Metallic Carbon Nanotubes by Gas-Phase Reaction, Science,
314, 2006, pp. 974-977.
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 409
1

A New Technique of Interconnect Effects
EquaIization by using Negative Group
DeIay Active Circuits

Iaise RaveIo, Andre Ierennec and Marc Le Roy
U|8, Unitcrsi|q cf 8rcs|
|a|-ST|CC, UMR CNRS 3192,
|rancc

1. Introduction

During lhe Iasl lvo decades, lechnoIogicaI progresses in VLSI process have lroughl an
oulslanding deveIopnenl of infornalion lechnoIogy equipnenls and lhus a greal increase
in lhe use of connunicalion services aII over lhe vorId. As reporled ly lolh lhe
InlernalionaI TechnoIogy Roadnap for Seniconduclors (ITRS) and lhe OveraII Roadnap
TechnoIogy Characlerislics (ORTC), lhe exponenliaI reduclion of lhe fealure size of
eIeclronic chips according lo Moores Iav (Moore, 1965) sliII occurs logelher vilh lhe
exponenliaI increase in line of lhe nunler of lransislor per unil area. Conlined lo lhis
shrinking of fealure sizes, lhe on-chip cIock frequency increases conlinuaIIy and shouId
exceed 1O CHz in 2O1O. These ceaseIess lrends in VLSI circuils have Ied lo nore and nore
conpIex inlerconnecl syslens, and lhus, lhe inpIenenlalion of nelaI nuIli-Iayers for inlra-
chip inlerconnecls has lecone a nusl. In lhe nid-198Os, lhe devices vere, lhus, conposed
of one or lvo Iayers of aIuniniun, in 2O11, according lo ITRS prediclion, chips viII consisl
of nore lhan len Iayers of copper.
Under lhese condilions, oving lo lhe higher operalion speeds, lhe inlerconnecl propagalion
deIay lecones nore and nore significanl and doninales consideralIy lhe Iogic propagalion
deIay (Deulsch, 199O, Ralay, 1996). ecause of lhe sensilivilies of paranelric varialions,
cIock and dala fIov nay nol le synchronized (Iriednan, 1995). This expIains vhy
inlerconneclions are so inporlanl in lhe delerninalion of VLSI syslen perfornances.
esides, sinpIified nodeIs are vorlh leing considered in order lo reduce lhe conpIexily of
any sludy on inlerconnecls. Since lhe leginning of lhe 198Os lhe nodeIIing of propagalion
deIay in order lo eslinale lhe deIay of an inlerconnecl Iine driven ly a CMOS gale has leen
lhe suljecl of nunerous papers (Sakurai, 1983 and 1993, Deng & Shiau, 199O). The sinpIesl
and nosl used nodeI of lhis deIay vas proposed ly LInore in 1948, il reIies on lhe use of
onIy an RC-Iine nodeI. NeverlheIess, due lo lhe eIevalion of syslen dala rales, lhis nodeI
lends lo le insufficienlIy accurale. Therefore, nore accurale nodeIs lhal sonelines lake inlo
accounl lhe induclive effecl (Wyall, 1987, IsnaiI el aI., 2OOO) have leen proposed.
20
VLS 410

To soIve lhe prolIen of cIock skev and propagalion deIays, a lechnique of signaI inlegrily
enhancenenl lased on repealer inserlion vas proposed ly differenl aulhors (AdIer and
Iriednan, 1998, IsnaiI & Iriednan, 2OOO). ul, vhen lhe signaIs are significanlIy
allenualed, such a soIulion nay le unalIe lo conserve lhe dala duralion and lhus,
inefficienl. These consideralions drove us lo recenlIy propose a nev lechnique for
inlerconnecl-effecl equaIizalion (RaveIo, Ierennec & Le Roy, 2OO7a, 2OO8a, 2OO9 and 2OO9)
lhrough use of negalive group deIay (NCD) aclive circuils. As shovn in Iig. 1, il consisls
nereIy in cascading lhese NCD circuils al lhe end of lhe inlerconnecl Iine. The possiliIily of
signaI recovery vilh a reduclion of signaI rise/faII-, sellIing- and propagalion-deIays vas
lheorelicaIIy denonslraled and evidenced lhrough sinuIalions in (RaveIo el aI., 2OO7a,
2OO8a) and confirned ly experinenls in (RaveIo el aI., 2OO9).


Iig. 1. Inlerconnecl Iine driven ly a Iogic gale ended ly an NCD circuil.

In facl, evidences of lhe NCD phenonenon have leen provided lhrough lheorelicaI
denonslralions and experinenls vilh passive-eIeclronic devices (Lucyszyn el aI., 1993,
LIeflheriades el aI., 2OO3, Siddiqui el aI., 2OO4 and 2OO5) and aclive ones (SoIIi & Chiao, 2OO2,
Kilano el aI. 2OO3, Nakanishi el aI., 2OO2, Munday & Henderson, 2OO4). As descriled in
severaI physics donains (Wang el aI., 2OOO, Dogariu el aI., 2OO1, SoIIi & Chiao, 2OO2), in lhe
case of a snoolhed signaI propagaling in a device/naleriaI lhal generales NCD, lhe, lhe
peak of lhe oulpul signaI and ils fronl edge are lolh in line advance conpared lo lhe inpul
ones. Then, confirnalions lhal lhis counlerinluilive phenonenon is nol physicaIIy al odds
vilh lhe causaIily principIe have leen provided (Wang el aI., 2OOO, Nakanishi el aI., 2OO2). A
Iileralure reviev shovs lhal lhe firsl circuils lhal exhililed NCD al nicrovave vaveIenglhs
dispIayed aIso significanl Iosses, vhereas lhe laseland-operaling ones vere inlrinsicaIIy
Iiniled lo Iov frequencies. To cope vilh lhese issues, ve, recenlIy, reporled on lhe design,
lesl and vaIidalion lhrough sinuIalions and experinenls of a nev and lolaIIy inlegralIe
lopoIogy of NCD aclive circuil (RaveIo el aI., 2OO7l, 2OO7c and 2OO8l), lhis lopoIogy reIies
on lhe use of a ILT and shoved ils aliIily lo conpensale for Iosses al nicrovave
frequencies over lroad landvidlh. Transposilion of lhis NCD lopoIogy lo laseland
frequencies aIIoved us lo deveIop nev slruclures lhal denonslraled lheir aliIily lo
sinuIlaneousIy generale an NCD and gain for lroad and laseland signaIs (RaveIo el aI,
2OO8). Then, lhe idea pul forvard ly SoIIy and Chiao (SoIIy, 2OO2) lo conpensale
degradalions inlroduced ly passive syslens such as inlerconnecl Iines ly using NCD
devices lecane possilIe.
As a conlinualion of lhese invesligalions, lhis chapler deaIs vilh furlher deveIopnenls of
lhis lechnique. Seclion 2 gives insighl inlo lhe vay lhis lechnique vorks, and lriefIy
expIains lhe roIe of lhe NCD circuil. The lheory of inlerconnecl nodeIIing and lhe definilion
of lhe propagalion deIay are lolh recaIIed in Seclion 3. AnaIylicaI approach and
experinenlaI vaIidalions of RC-nodeI equaIizalion are presenled in Seclion 4. The feasiliIily
of lhe proposed lechnique, vhen lhe induclive effecls are laken inlo accounl, is deaIl in
Seclion 5. In Seclion 6, a conpIeleIy originaI and fuIIy-inlegralIe lopoIogy is proposed ly
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 411

gelling rid off induclance lo cope vilh lheir inpIenenlalion issue. Thus, lhe resuIls of
sinuIalions, vhich provided a very good vaIidalion of lhe perfornances expecled fron
lheory, are anaIysed and discussed. A sunnary of lhis chapler is given in lhe Iasl Seclion
logelher vilh proposaIs aloul possilIe fulure deveIopnenls.

2. The basic principIe of the proposed compensation technique

Iigure 2 iIIuslrales lhe generaI case of lhe dislorlion undergone ly a nunericaI signaI
degraded ly lhe inlerconnecl circuilry as inlroduced in Iig. 1. Conpared lo lhe inpul signaI,
t
i
(|), lhe degradalion of lhe oulpul one, t
|
(|), can le assessed fron lhe signaI allenualion as
veII as lhe rise-, faII- and sellIing-lines and lhe 5O propagalion deIay. Il is vorlh recaIIing
lhal lhis Iasl paraneler, denoled here ly T
p5O
or nereIy T
p
, is defined as lhe line needed ly
lhe oulpul signaI, t
|
, lo reach 5O of lhe unil-slep inpul of anpIilude, V
M
, here. The deIay
required lo reach 5O of lhe Iogic sving is lradilionaIIy referred as lhe deIay line.


Iig. 2. Tine-donain responses of lhe ideaI syslen shovn in Iig. 1 for a periodicaI inpul
voIlage, t
i
(|).

In frequency donain, lhe degradalion lelveen lhe inpul, t
i
and lhe oulpul, t
|
corresponds
lo a lransfer funclion denoled, G
|
(s) vhose gain nagnilude and group deIay usuaIIy verify
lhe foIIoving inequaIilies:

1 ) ( < e j G
|
and O / ) ( ) ( > c cZ = e e e t j G
| |
. (1)

The oulpul LapIace lransforn of lhis inlerconnecl Iine can le vrillen as:

t
|
(s) = G
|
(s) . t
i
(s).
(2)

As shovn ly Iigs. 1 and 2, lhis sludy vas ained al finding a reIevanl configuralion or
circuil alIe lo provide a conpensaled oulpul, v
N
, (lIack lhick curve) as cIose as possilIe lo
lhe inpul signaI, v
i
, (red dashed curve). Il neans lhal lhe foIIoving nalhenalicaI
approxinalion can le nade:

t
N
(|) ~ t
i
(|).
(3)

VLS 412

In lheory, for veII-nalched circuils, lhe lransfer syslen lo le found, G
x
(s), nusl le
associaled lo G
|
(s) so lhal equalion (4) is verified:

) ( ). ( ). ( ) ( s t s G s G s t
i x | N
= . (4)

According lo lhe circuil and syslen lheory, lhrough use of equalions (3) and (4), one gels
lhe adequale lransfer funclion:

) ( / 1 ) ( ) ( ) ( ). ( ). ( s G s G s t s t s G s G
p x i i x |
~ ~ .
(5)

ConsequenlIy, in lhe frequency donain, lhe syslen gain and group deIay nusl le such lhal:

8
|
8
x
j G j G ) ( ) ( e e = ,
(6)
) ( ) ( e t e t
| x
= . (7)

So, on condilion lo lake inlo accounl lhe condilion expressed in equalion (1), lhe gain and
lhe group deIay nusl le respecliveIy such lhal |G
x
(j)|
8
> O and
x
() < O. TechnicaIIy,
lhese condilions require lhe cascade of a syslen alIe lo sinuIlaneousIy exhilil Cain and an
NCD in laseland. As descriled ly lhe lIock diagran of Iig. 3, lhe vhoIe cascaded syslen
is characlerized ly ils lransfer funclion C(s) vhere G
|
is lhe inlerconnecl lransfer funclion
and G
x
= G
NGD
, t is lhe vhoIe group deIay.


Iig. 3. Iock diagran: inlerconnecl passive syslen, G
|
(s) cascaded ly lhe aclive NCD
circuil, G
NGD
(s): G(s) = G
|
(s).G
NGD
(s).

This process conslilules, lhen, a lechnoIogicaI soIulion of equaIizalion lechnique. In lhis
case, lhe gain and group deIay generaled ly lhe conpensalion syslen shouId le
respecliveIy lhe reverse and lhe opposile of lhose of lhe inlerconnecl circuilry as depicled in
Iig. 4. The principIe of inlerconnecl Ioss conpensalion and group deIay reduclion is
iIIuslraled in Iig. 4. Al lhis slage, il is vorlh noling lhal lhe conpensalor nusl conlain a
syslen alIe lo exhilil a group deIay negalive al lase land frequencies nol onIy vilh an
opposile vaIue of lhe inlerconnecl one lul aIso over lhe corresponding frequency land. A
siniIar renark can le nade aloul lhe gain conpensalion. The need for aclive circuils drove
us lo propose, here, a lopoIogy lased on lhe use of a ILT, chosen lecause of ils liasing
sinpIicily, lhe differenl nodeIs avaiIalIe (sinpIe for anaIylicaI lheory and nore accurale
ones for finaI sinuIalions), ils facuIly lo operale al lens of CHz as veII as ils easy
inlegralion.
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 413


Iig. 4. IIIuslralion of lhe conpensalion principIe: frequency responses of lhe nagnilude in
d (a) and group deIay (l).

In concIusion, lhe NCD circuil consisls in lhe achievenenl of a group deIay and allenualion
conpensalory funclion. Iig. 4 depicls lhe conpensalion principIe ly considering lhe generaI
frequency lehaviour of inlerconnecls or lransnission Iines. So, prior lo conducling a
feasiliIily sludy of lhis lechnique vilh concrele syslens, Iel us lriefIy recaII lhe lheory on
connonIy used nodeIs of inlerconnecls, i.e. RC- and RLC-circuils, and on propagalion
deIay assessnenls.

3. Theory on the interconnect modeIIing and the propagation deIay
approximation

In nosl of nicroeIeclronic device inlerconnecl nodeIs, lhe conduclance effecl can le
negIecled conpared lo lhe per-unil Ienglh resislance, R
|
, lhe induclance, |
|
, and lhe
capacilance, C
|
. Therefore, Iel us consider lhe inlerconnecl slruclure presenled in Iig. 5: il
consisls of an RLC-Iine nodeI of Ienglh, , driven ly a gale vilh an oulpul resislance, R
s
. As
previousIy nenlioned, t
|
is lhe circuil oulpul voIlage and t
i
, lhe inpul one.
VLS 414


Iig. 5. A gale vilh oulpul resislance, R
s
, driving a RLC-nodeI inlerconnecl.

Ior lhe line-donain anaIysis of lhe slruclure under sludy, lhe inpul voIlage, t
i
, is assigned
as a Heaviside unil slep funclion, (|), of anpIilude, V
M
:

t
i
(|)=V
M
(|). (8)
As defined previousIy, lhe 5O propagalion deIay, T
p
, is defined as lhe rool of equalion (9):

t
|
(T
p
) = V
M
/2. (9)
efore lhe caIcuIalion of lhis propagalion deIay, il is vorlh focusing on lhe syslen lransfer
funclion, vhich is defined as G
|
(s) = V
|
(s)/V
i
(s). Il vas eslalIished (Ajoy el aI., 2OO4) lhal,
according lo lhe configuralion of Iig. 5, lhis quanlily is expressed as:

) sinh( ) / ( ) cosh( ) 1 (
1
) (
Z R sR
s G
c s s
|
+ +
=
,
(1O)
vhere
s C s | R Z
| | | c
/ ) ( + = , (11)
and
s C s | R
| | |
) ( + = , (12)
are, respecliveIy, lhe characlerislic inpedance and lhe propagalion conslanl of lhe Iine. This
lransfer funclion is anaIysed lhrough use of poIynoniaI expansion as done for cIassicaI
Iinear syslens. Ior sinpIificalion, Iel us deaI vilh lhe nornaIized lransfer funclion, g
|
(s)
vhich is delernined ly G
|
(s)/G
|
(O), and lhal can le expressed ly lhe m-order Iinear
expression:

m
m
n
n
|
s | s | s |
s a s a s a
s g
+ + + +
+ + + +
=
... 1
... 1
) (
2
2 1
2
2 1
, (13)
vhere lhe coefficienls, a
i
(i = |1,...,n) and |
j
(j = |1,...,m) are reaI nunlers, and m and n are
inlegers. According lo lhe Iileralure (LInore, 1948, Wyall, 1987, IsnaiI el aI., 2OOO), use of
lhis expression aIIovs one lo eslinale lhe 5O propagalion deIay expressed in equalion (9).
Anong lhe exisling approxinalion, il is vorlh recaIIing lhal lhe sinpIesl and lhe nosl used
in lhe induslriaI conlexl is lhe one proposed ly LInore in 1948. Il is lased on lhe firsl-order
consideralion of equalion (13). Indeed, lhis eslinalion of lhe propagalion deIay is nereIy
defined ly:
T
p||mcrc
= |
1
- a
1
. (14)
One shouId nole lhal lhis propagalion deIay is exaclIy equaI lo lhe group deIay of lhe
syslen under consideralion al very Iov frequencies (VIach el aI. 1991). NeverlheIess, lhis
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 415

fornuIa has proven lo lecone Iess and Iess accurale lovards lhe eIevalion of lhe operaling
frequency. A nev fornuIa vas proposed ly Wyall (Wyall 1987): il differs fron lhe previous
approach ly onIy lhe vaIues of lhe coefficienls, a
i
and |
i
, vhich are defined ly lhe reciprocaI
of lhe doninanl poIe of lhe syslen lransfer funclion:

_
=

=
n
i
i
z a
1
1
1
,
(15)
_
=

=
m
i
i
p |
1
1
1
,
(16)
vhere lhe reaI nunlers, z
i
and p
i
, are, respecliveIy, lhe zeros and lhe poIes of g
|
(s). Il is
vorlh noling lhal lhis approach provides an exacl expression of T
p
in lhe case of lhe RC-
nodeI (L
I
= O).

3.1 RecaII on RC-Iine modeI
Il vas eslalIished (Sakurai , 1983, Deng & Shiau, 199O) lhal lhe firsl-order approxinalion of
lhe lransfer funclion in equalion (1O) Ieads lo lhe foIIoving expression:

{ } ) ( j ) |( 1 / 1 ) ( s G s C R R s G
rc | | s |
= + + + = . (17)
This aIIovs lhe nodeIIing of lhe inlerconnecl circuilry presenled in Iig. 5 as a sinpIe
equivaIenl circuil (Iig. 6).


Iig. 6. SinpIified represenlalion of lhe slruclure shovn in Iig. 5 ly considering lhe firsl-
order approxinalion of lhe lransfer funclion.

Therefore, lhe driver gale Ioaded ly lhe dislriluled lransnission Iine can le equivaIenl lo a
Iunped RC-circuil. To nake easier lhe anaIylicaI caIcuIalion, Iel us consider lhe equivaIenl
paranelers of lhe syslen under sludy, R
|
= R
s
+ R
|
and C = C
|
. So, lhe LInore propagalion
deIay of lhis veII-knovn circuil is given ly:
C R T
| pRC
= .
(18)
ul, il can le shovn ly caIcuIalion lhrough lhe unil slep response lhal lhe exacl vaIue of
lhis quanlily is expressed as:
) 2 In( C R T
| pRC
= .
(19)
Then, lhe rise line, denoled |
rRC
, vhich is defined as lhe line needed ly lhe oulpul signaI lo
pass fron 1O lo 9O of lhe finaI oulpul vaIue, is vrillen as:
) 9 In( = C R |
| rRC
.
(2O)
VLS 416

3.2 Summary on RLC-Iine theory
As previousIy nenlioned, novadays, inlerconnecl Iine nodeIIing requires, in nosl cases, lo
lhoroughIy consider lhe induclance effecl. This paraneler is usuaIIy laken inlo accounl
lhrough use of a second-order syslen, such as an RLC-nelvork vilh lhe foIIoving canonicaI
lransfer funclion:

) 2 /( ) (
2 2 2
n n n |
s s s g e e e + + = . (21)
vhere
C |
| n
/ 1 = e , (22)
and
| |
| C R / ) 2 / ( = , (23)
are, respecliveIy, lhe undanped naluraI anguIar frequency and lhe danping ralio. Iron lhis
second-order lransfer funclion, an accurale propagalion deIay, T
p
, vas eslalIished ly
IsnaiI & Iriednan (2OOO):

n p
c T e

/ ) 48 . 1 (
35 . 1
9 . 2
+ =
-
. (24)
To reduce lhis propagalion deIay, and as firsl envisaged ly SoIIi and Chiao (SoIIi el aI, 2OO2),
il couId le vorlh cascading lhe inlerconnecl Iine vilh an NCD aclive circuil. ul a
preIininary lo lhis proposaI is a delaiIed sludy of lhe resuIling device in order lo gel a
confirnalion of lhe conpensalion principIe efficacy. This viII le lhe focus of lhe nexl
Seclion.

4. Compensation of a first-order interconnect modeI (RC-circuit)

To conpensale for inlerconnecl spurious effecls, especiaIIy Iosses and line deIay, ve
recenlIy deveIoped and lesled a nev equaIizalion lechnique lased on lhe use of an NCD
aclive circuil (RaveIo el aI., 2OO7a, 2OO8a and 2OO9). In lhis seclion, ve viII lriefIy recaII lhe
lheorelicaI fundanenlaIs and vaIidale lhe lechnique lhrough experinenls carried oul in
frequency- and line-donains in lhe case vhere a firsl-order circuil, i.e. an RC-circuil, is
used lo nodeI lhe inlerconnecl effecls.

4.1 Theory
As expIained alove, lhe equaIizalion principIe consisls in ending lhe circuil lo le
conpensaled, here an RC one, vilh an NCD aclive ceII as depicled in Iig. 7. To sinpIify lhe
anaIylicaI approaches, lhe ILT is nodeIIed ly a conlroIIed voIlage currenl source vilh a
lransconduclance, g
m
, and lhe drain source resislance, R
s
.

A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 417


Iig. 7. RC-NCD circuil: RC-circuil cascaded vilh a lasic ceII of lhe NCD aclive circuil (ILT
in feedlack vilh an RL series nelvork) and lhe corresponding equivaIenl circuil.

In a firsl slep, Iel us recaII (RaveIo el aI., 2OO7a, 2OO8a, 2OO8l, 2OO9) lhe lransfer funclion and
lhe group deIay expressions of lhe NCD ceII aIone (in lhe dash lox in Iig. 7) al Iov
frequencies:

as
as m
NGD
R R
R R g
G
+

=
) 1 (
) 0 ( ,
(25)
(1 )
(0)
( )(1 )
m as
NGD
as m
L g R
R R g R
t
+
=
+
.
(26)
On condilion lhal:
0 1 < R g
m
, (27)
lhe NCD ceII exhilils a negalive group deIay (
NGD
( O) < O) al very Iov frequencies.

Then, lhe lransfer funclion and lhe gain al Iov frequencies of lhe vhoIe RC-NCD circuil
presenled in Iig. 7 are, respecliveIy, expressed as:

| |
2
(1 )
( )
(1 ) ( )
as m m as
as t m as t as t
R g R g R Ls
G s
R R R g R R C R R L s R CLs

=
+ + + + + + +
,
(28)
(1 )
(0)
(1 )
as m
as t m as
R g R
G
R R R g R

=
+ + +
.
(29)
One shouId nole lhal, lecause of lhe unnalched effecl lelveen lhe RC- and NCD-circuils,
lhe lransfer funclion, G(s), is nol equaI lo G
rc
(s).G
NGD
(s) (RaveIo el aI., 2OO9). The sane
renark appIies lo lhe group deIays:

) ( . ) ( ) ( e e e j G j G j G
NGD rc
= , (3O)
) ( ) ( ) ( e t e t e t
NGD rc
+ = . (31)

(I) DcmnnstratInn nf thc 50% prnpagatInn dc!ay rcductInn:
Iron equalions (18) and (28), lhe LInore propagalion deIay (or lhe group deIay al Iov
frequencies) of lhe conpensaled RC-NCD circuil can le expressed as foIIovs (RaveIo el aI.,
2OO8a):

VLS 418

) 1 )( (
) 1 )( ( j ) ( 1 |
2
s m s | m s |
m s pRC s | m s | m
p
R g R R g R R R
R g R R T | R R g R R g
T
+ + +
+ + + +
= . (32)
So, lhe deIay is reduced (T
p
< T
pRC
,) on condilion lhal:

) 1 /( ) 1 (
2
R g C R R g |
m | m
+ > . (33)
ul lhis inequaIily is aulonalicaIIy verified if lhe condilion expressed in equalion (27) is
lrue, i.e. if lhe aclive circuil generales NCD. According lo lhis sludy, lhe oplinun vaIues of
lhe NCD circuil can le synlhesised as a funclion of lhe RC-vaIues.

(II) 5ynthcsIs nf an NGD cc!! accnrdIng tn thc RC-mndc! va!ucs
As lhe ain of lhe equaIizalion principIe is lhe generalion of an oulpul signaI equaI or cIose
lo lhe inpul one, lhe lransfer funclion nagnilude and group deIay shouId le, respecliveIy
aInosl equaI lo unily and nuII. Ior sinpIificalion purpose, Iel us appIy lhese condilions al
very Iov frequencies:

|G(O)| ~ 1 and (O) ~ O. (34)

Iron equalions (31) and inversion of (29) and (32), one gels lhe synlhesis reIalions
expressed in equalions (35) and (36):

) 1 /( j ) 1 ( 2 | + + =
s m | s m s
R g R R g R R , (35)
1 ) (
) )( 1 (
2
+ + +
+
=
| s m | s m
| s m
R R g R R g
C R R R R g
| .
(36)
These expressions are neaningfuI vhen lhe Iosses dispIayed ly lhe inlerconnecl are Iess
lhan lhe naxinaI gain nagnilude of an NCD ceII, C
nax
:

s m
R g G =
nax
. (37)

Olhervise, severaI slages of NCD ceIIs are needed lo generale a vhoIe gain of aloul unily,
lhen, as reporled in (RaveIo el aI., 2OO8a), il is advised lo use equalion (38) for lhe caIcuIalion
of lhe oplinaI nunler of ceIIs, n:

)j In( / ) 1 In( inl| 1
nax
) /(
G c n
C R T
|

+ ~ . (38)

T is lhe line duralion of lhe considered inpul square puIse. Ior a reaI x, lhe vaIue of lhe
grealesl inleger given ly lhe funclion, in|(x), is equaI lo or Iover lhan x. Il is vorlh
underIining lhal lhe inpIenenlalion of, al Ieasl, lvo or an even nunler of lransislors is
preferalIe vhen lhe appIicalion under sludy requires avoiding signaI inversion.

To vaIidale lhese lheorelicaI prediclions, proof-of-concepl (IOC) circuils vere designed,
falricaled and neasured in frequency- and line-donains as expIained hereafler.

A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 419

4.2 ExperimentaI vaIidations
The slarling vaIues of lhe NCD circuil cone fron lhe synlhesis reIalions knoving lhe
characlerislics of lhe chosen ILT and lhe RC-nodeI vaIues. Then, lhe nosl accurale
avaiIalIe nodeIs vere used lo run sinuIalions of lhe IOC circuils on ADS
TM
sinuIalor
soflvare fron Agi|cn|
TM
, a sensilivily anaIysis vas aIso perforned. The differenl IOC
circuils presenled in lhis seclion vere inpIenenled in hylrid pIanar lechnoIogy (associalion
of nicroslrip and surface nounl chip/package conponenls).

(I) DcsIgn Prnccss
This denonslralor vas ained al iIIuslraling visuaIIy lhe recovery of lhe degraded signaI
and lhe deIay reduclion. The circuils vere designed and inpIenenled separaleIy, lhey
consisled of an RC-circuil, an NCD-one and lhe conlinalion of lolh inlo an RC-NCD circuil
(Iig. 8(l)). To perforn a lroadland liasing and lo avoid lhe evenluaI disruplions caused ly
lhe lias nelvork al Iov frequencies, lhe ILT vas liased lhrough an aclive-Ioad lechnique.


(a)


(b)
Iig. 8. Schenalics of lhe (a) RC- and (l) RC-NCD-circuils (incIuding lhe liasing nelvork in
lhin Iines) using a IHLMT ILT (ATI-34143, V
gs
= O V, V

= 3 V, |

= 11O nA), for R


|
= 33 , C
= 68O pI, R = 56 , R
c
= 1O , | = 22O nH, C
|
= 1OO nI and Z
O
is lhe oulpul reference Ioad.

The ILT paranelers required for lhe synlhesis equalions vere exlracled fron lhe non-Iinear
nodeI and found lo le g
m
= 226 nS and R
s
= 27 O. Ior lhe RC circuil, lhe RL vaIues of lhe
NCD circuil vere synlhesised fron equalions (33) and (34). Then, accurale frequency
responses vere ollained lhrough conlinalion of circuil sinuIalions (Iunped conponenls
and non Iinear ILT nodeI) vilh eIeclronagnelic co-sinuIalions of lhe dislriluled parls ly
Monenlun Soflvare (fron Agi|cn|
TM
). The vaIues of lhe avaiIalIe Iunped conponenls vere
used in a finaI sIighl oplinisalion procedure. The Iayoul of lhe hylrid pIanar circuil shovn
in Iig. 9 vas prinled on an 8OO-n-lhick IR4 sulslrale of reIalive pernillivily,
r
= 4.3, lhen
lhe surface-nounl chip passive conponenls vere sel onlo lhe sulslrale.

VLS 420


Iig. 9. Layoul of lhe inpIenenled RC-NCD circuil.

(II) 5Imu!atInn Rcsu!ts and 5cnsItIvIty Ana!ysIs
The sinuIalions of lhe circuil piclured in Iig. 9 and of lhe RC and NCD ones vere
independenlIy run in frequency- and line-donains under lhe specificalions and condilions
nenlioned alove. A sensilivily Monle-CarIo anaIysis over 1O lriaIs vas carried oul on using
loIerance vaIues of 5 around lheir noninaI vaIues for lhe NCD eIenenls (R = 56 and
| = 22O nH). AII lhe resuIls are presenled in Iigs. 1O and 11.

- |requenc-domaln ana|sls: Iig. 1O(a) shovs lhal lhe nagnilude |G(f)| of lhe vhoIe RC-
NCD circuil is kepl al aloul O d up lo 4O MHz, and lhe corresponding group deIay |t(f)|
is, as expecled, slrongIy decreased fron DC lo 15 MHz conpared lo lhal of lhe RC circuil.
Moreover, lhe frequency responses are onIy sIighlIy sensilive lo Iunped-conponenl
loIerance.

(a) (b)
Iig. 1O. (a) Magnilude and (l) group deIay frequency responses issued fron sinuIalions
vilh a Monle-CarIo anaIysis of lhe NCD circuil eIenenls R and L ( 5).

- Tlme-domaln ana|sls: Il is vorlh underIining lhal, al firsl, lhe signaI vas neasured al lhe
oulpul of a Rncc c Scnuarz Signa| Gcncra|cr SM] 100A al lhe highesl avaiIalIe rale, i.e. 25
Msyn/s, and lhen furlher used in sinuIalions as lhe inpul signaI. Then, excilalion of lhe
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 421

sinuIaled RC- and RC-NCD-circuils vilh lhis inpul square vave puIse (nagnilude, V
M
= 1
V) Ied lo lhe line-donain sinuIalion resuIls presenled in Iig. 11, vhere lhe dolled curve
indicales lhe degradalion induced ly lhe RC-circuil aIone, noreover, lhe signaI recovery is
evidenced ly lhe lhick lIack curve. Il cIearIy appears lhal lhe conpensalion is significanl,
lul inconpIele. ecause of lhe drain-source currenl inversion, lhe oulpul signaI is reversed
conpared lo lhe inpul voIlage. This is vhy lhe pIol presenled in Iig. 11 is lhal of -V
N
.


Iig. 11. ResuIls of line-donain sinuIalions (vilh a Monle CarIo anaIysis of lhe NCD circuil
eIenenls, R and L ( 5)) for an inpul square vave puIse (V
M
= 1 V vilh a 4O ns dala
duralion).

Il is vorlh noling lhal lhe 5O propagalion deIay is shorlened fron 16.5 ns (for lhe RC
circuil) lo aloul 3 ns (for lhe RC-NCD circuil), i.e. a reIalive reduclion of, al Ieasl, 81. A
Monle-CarIo sensilivily anaIysis vilh a 5-varialion around lhe R and | noninaI vaIues
shoved nearIy no change of lhe RC-NCD circuil oulpul signaI (incIuding lhe fronl and
Ieading edges), lul lhe finaI vaIue is sonevhal sIighlIy affecled.

(III) Mcasurcd rcsu!ts and dIscussInns:
To check for lhe vaIidily of lhe resuIls of lheorelicaI anaIysis and sinuIalions, lhree circuils
vere falricaled and lesled in lolh frequency- and line-donains.

- |requenc-domaln resu|ts: These neasurenenls vere nade vilh a veclor nelvork
anaIyzer (Rncc c Scnuarz ZVR| 9|Hz - 4GHz), vhich provides lhe scallering nalrix (S-
paranelers) of lhe devices under lesl. Then, lhe lransfer funclion is caIcuIaled lhrough use
of lhe cIassicaI passage reIalionships in order lo gel lhe nagnilude and lhe group deIay
(caIcuIaled fron lhe lransfer funclion phase response). The falricaled devices consisled of
lhe singIe RC-circuil, lhe NCD one and a lhird one, denoled RC-NCD circuil, vhich
conlined lolh funclions in order lo avoid any conneclor nisnalch IialIe lo occur if lhe firsl
lvo ones vere sinpIy cascaded. Iigure 12(a) shovs lhal lhe nagnilude and group deIay
responses of lhe lhree circuils are in good agreenenl vilh lhe sinuIalion resuIls presenled
in Iigs. 1O. As expecled fron equalions (3O) and (31), lhe lolaI nagnilude and group deIay
vaIues are differenl fron lhose issued fron lhe addilion of lhe individuaI ones lecause of
lhe exislence of nisnalch lelveen lhe RC and NCD parls (G
RCNGD
= G
RC
.G
NGD
). The
neasured nagnilude, |G(f)|
d
, is cIose lo O up lo 4O MHz and gels dovn lo -1O d al 8O
MHz.
VLS 422

(a) (b)
Iig. 12. Measurenenls of (a) nagnilude and (l) group deIay produced ly lhe RC-, NCD-
and RC-NCD-circuils.

One shouId nole lhal, lhough lhe RC-NCD circuil group deIay, t(f), is nol fuIIy canceIIed
(Iig. 12(l)), il is slrongIy reduced leIov 2O MHz lhanks lo lhe NCD circuil, noreover, il is
kepl leIov 4 ns up lo 8O MHz. In lheory, a higher alsoIule vaIue of NCD couId le ollained,
lul a conpronise lelveen NCD vaIue, NCD landvidlh and gain fIalness has lo le found
lo nininize lhe overshool or lhe rippIe in line-donain
- Tlme-domaln exerlmenta| resu|ts: Throughoul lhe line-donain sinuIalions and
neasurenenls, lhe circuil Ioad, Z
O
, is sel al a high inpedance vaIue (Z
O
= 1 MO).


Iig. 13. Tine-donain responses vilh an inpul square puIse (25-Msyn/s rale, 2 ns rise- and
faII-lines) and zoon on lvice lhe synloI duralion.
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 423

A square vave inpul puIse of anpIilude V
M
= 1 V vas deIivered al a rale of 25-Msyn/s
(corresponding lo a 4O-ns dala duralion) ly lhe laseland dala oulpul of an RcS SM] 100A
veclor signaI generalor and recorded on a 2 Gs/s LeCroy digilaI osciIIoscope. The inpul
puIse, V
i
, vas lhus nonilored as veII as lhe oulpul ones of lhe RC- and RC-NCD-circuils,
V
RC
and -V
N
, prior lo lheir resynchronizalion lhrough use of lhe sane reference signaI.
Iigure 13 shovs lhal lhe oulpul Ieading edge provided ly lhe RC-circuil is degraded and
characlerised ly |
rRC
~ 35 ns as rise line and a 5O propagalion deIay of T
pRC
~ 18.5O ns.
Conpared lo V
RC
, lhe oulpul vaveforn, -V
N
, is reshaped and Iess dislorled. Hence, lhe
NCD circuil conpensalion aIIoved a reduclion of lolh paranelers dovn lo |
r
~ 1O ns and
T
p
~ 2.5O ns, lhal is reIalive reduclions of 71.4 (1-|
r
/|
rRC
) and 86.5 (1-T
p
/T
pRC
), respecliveIy.
As shovn in Iig. 13, lhe lraiIing edge is aIso slrongIy inproved. This poinl is vorlh leing
noled in lhe case of an enhancenenl of lhe dala rale.
The proposed inlerconnecl equaIizalion lechnique vas vaIidaled lhrough nodeIIing of lhe
inlerconnecl Iine degradalion ly an RC-circuil, vhich is lhe nosl currenl nodeI in use.
Hovever, in order lo suppIenenl lhis sludy, lhe induclive effecls of inlerconnecl Iines viII
le laken inlo accounl in lhe nexl Seclion.

5. EquaIization under conditions of interconnect Iine inductive effects

As underIined al lhe leginning of lhis chapler, lhe ceaseIess increase of operaling
frequencies, lhe induclive effecls can no Ionger le negIecled in lhe nodeIIing of inlerconnecl
syslens. In lhis sludy, lhe Iine is nodeIIed ly lhe RLC nelvork shovn in Iig. 14(a) and
driven ly a Iogic gale vilh R
s
as oulpul resislance. In order lo check vhelher lhe proposed
conpensalion lechnique sliII vorks, Iel us consider lhe vhoIe circuil presenled in Iig. 14(l):
il consisls of lhe inlerconnecl nodeI cascaded ly lhe conpensalion circuil, vhich is
conposed of a ILT fedlack ly an RL-series nelvork. To sinpIify lhe lheorelicaI sludy, lhe
ILT is repIaced ly ils Iov frequency equivaIenl-circuil nodeI as inlroduced in Iig. 5. The
lerns, R
|
= R
s
+ R
|
, |
|
= |
|
and C = C
|
, inlroduced in lhe previous sludy of lhe RC-nelvork
are sliII used.


(a)

(b)
Iig. 14. (a) Iine nodeI: RLC nelvork driven ly a Iogic gale vilh oulpul resislance, R
s
, (l) lhe
vhoIe circuil conposed of lhe Iine nodeI conpensaled ly an NCD ceII.

5.1 Theory
According lo lhe procedure used in lhe previous seclion, lhe lransfer funclion of lhe vhoIe
syslen is:

VLS 424

) /( )j ( 1 | ) (
3
3
2
2 1 O
s s s |s R g R s G
m s
o o o o + + + + = , (39)
vhere
) 1 (
O | m s |
R g R R R + + + = o , (4O)
| | s m s |
| | | R g R R C R + + + + = ) (
1
o ,
(41)
C R R | |R
s | |
)j ( |
2
+ + = o ,
(42)
C ||
|
=
3
o .
(43)
Then, al very Iov frequencies ( O), lhe corresponding gain, C(O), and lhe LInore
propagalion deIay, T
p
, are respecliveIy expressed as:

)j 1 ( /| )j 1 ( | ) O (
| m s | m s
R g R R R R g R G + + + = , (42)
) 1 j( + ) + 1 ( + |
) + 1 )( + 1 ( )j + 1 ( + ) + ( )| 1 (
=
-
- -
R g R R g R R
R g R g R g | R R C R R g
T
m s s m |
| m s m m | s | m
p
. (43)
AppIicalion of lhe equaIizalion oljeclives expressed in (32) lo equalions (42) and (43) aIIovs
one lo exlracl lhe foIIoving synlhesis reIalions:

) 1 /( j ) 1 ( 2 | + + =
s m | s m s
R g R R g R R , (44)
| | j 1 ) ( /| ) 1 ( ) ( ) 1 (
2
+ + + + + + =
| s m | s m | s m | s m
R R g R R g | R g CR R R R g | . (45)
To evidence lhe reIevance of lhis RLC effecl equaIizalion, sinuIalions under reaIislic
condilions vere run.

5.2 VaIidations by simuIations
In order lo soIve lhe prolIen of lhe oulpul voIlage sign and lo lake inlo accounl induclive
effecls, a lvo-slage NCD circuil vas sinuIaled lhrough use of ADS
TM
for lvo lypes of
inlerconnecl Iines nodeIs: i) a nodeI vilh Iunped RLC eIenenls and an inpul signaI of 1 ns
in dala duralion, and ii) an RLC dislriluled Iine and an inpul signaI of 5-ns dala duralion.
Moreover, using lvo NCD ceIIs aIIovs videning lhe NCD frequency landvidlh (RaveIo el
aI, 2OO7c) and lhus, inpul signaIs vilh higher dala rales and snaIIer rise-/faII-lines can le
considered.

(I) CnmpcnsatInn nf a !umpcd RLC-cIrcuIt fnr an Input sIgna! wIth 1 ns-data duratInn
Iig. 15 depicls lhe vhoIe conpensaled RLC-NCD circuil under sludy. Il consisls in a
Iunped RLC-circuil ended vilh a lvo-slage NCD aclive circuil. The ILTs used ly lhe Ialler
(LC-2612 vilh g
m
= 98.14 nS and R
s
= 116.8 ) are fron Mimix 8rca|an
TM
.


Iig. 15. Tvo-slage NCD aclive circuil conpensaling lhe RLC nelvork (R
|
= 1OO , |
|
= 6 nH,
C = 4.3 pI, R
1
= 86 , R
2
= 89 , |
1
= 5.2 nH, |
2
= 2.6 nH and ILT/LC-2612).
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 425

One shouId nole lhal lhe line- and frequency-donain resuIls presenled here vere ollained
ly using lhe ILT Iinear nodeI provided ly lhis nanufaclurer and recorded al lhe liasing
poinl V
s
= 3 V and |
s
= 25 nA.

- |requenc-domaln resu|ts: Iigs. 16 (a) and (l) presenl lhe nagniludes and lhe group
deIays of lhe RLC- and NCD-circuils separaleIy, and lhen, lolh cascaded (noled RLC-NCD),
as depicled in Iig. 15. Iig. 16(a) shovs lhal lhe lransfer-funclion nagnilude of lhe overaII
circuil (lIack lhick curve) is kepl vilhin -4 and O d up lo 3 CHz.


(a) (b)
Iig. 16. SinuIaled frequency responses of lhe RLC, NCD and RLC-NCD circuils: (a)
nagnilude and (l) group deIay.

Once again, one shouId nole lhal lhe vaIue of lhis vhoIe lransfer-funclion nagnilude is
differenl fron lhe sun (in d) of lhe individuaI nagniludes (G
R|CNGD
= G
R|C
.G
NGD
) lecause
of nisnalch lelveen lolh parls. Iigure 16(l) evidences lhal, lhanks lo lhe NCD effecl (lhin
lIue curve), lhe lolaI group deIay of lhe vhoIe circuil (lIack lhick curve) is Iess lhan 68 ps. A
conparison of lhe group deIays respecliveIy produced ly lhe RLC and lhe RLC-NCD
circuils shovs a significanl reduclion up lo aloul O.8 CHz vilh lhe Ialler.

- Tlme-domaln resu|ts: Iigure 17 presenls lhe resuIls of lransienl sinuIalions run in lhe case
of a 2-ns periodic inpul signaI, V
i
, vhose rise-/faII-lines are aloul 92 ps (lhin red curve).
The response al lhe oulpul of lhe RLC circuil aIone, V
R|C
, is depicled ly lhe degraded
dashed curve, and lhe corresponding 5O propagalion deIay, T
pR|C
, is equaI lo 3O4 ps.
Iurlher lo lhe inserlion of lhe NCD circuil, lhe lIack lhick curve represenlalive of V
N,
i.e. al
lhe oulpul of lhe RLC-NCD circuil, shovs inprovenenl vilh a reIalive reduclion of nore
lhan 85 of propagalion deIay since T
p
is aloul 44 ps. Moreover, ly conparison lo V
i,
V
N

presenls neilher allenualion, nor overshool, and lhe Ieading and lraiIing edges are lolh
inproved.

VLS 426


Iig. 17. Tine-donain responses produced ly sinuIalions of lhe circuil in Iig. 15 for a 2-ns
periodic lrapezoidaI inpul, rise-/faII-lines, |
r
= 92 ps and 5O-duly ralio vilh zoon on a
haIf-period (lollon).

As lhe accuracy of lhe Iunped RLC nodeI of inlerconnecl Iine used in lhese sinuIalions
nighl le insufficienl for cerlain VLSI configuralions, lhe sinuIalion descriled in lhe nexl
paragraph deaIl vilh a dislriluled RLC-nodeI (RaveIo el aI, 2OO7a).

(II) CnmpcnsatInn nf a dIstrIbutcd RLC-!Inc In thc casc nf an Input sIgna! nf 5-ns data-
duratInn


Iig. 18. RLC-Iine driven ly a gale of oulpul resislance, R
s
= 9O , conpensaled ly a lvo-
slage NCD circuil (ILT/LC-2612, R
1
= 73 , |
1
= 99 nH, R
2
= 1O2 and |
2
= 17 nH) Ioaded
ly anolher gale of inpul capacilor C
|
= 3O pI.

Iig. 18 shovs lhe circuil under sludy, lhe inlerconnecl Iine is nodeIIed ly an RLC
dislriluled circuil. The cIassicaI RLC-Iine is driven ly a gale of oulpul resislance, R
s
,
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 427

conpensaled ly a lvo-slage NCD circuil Ioaded vilh anolher gale of inpul capacilor, C
|
.
The lransnission Iine paranelers vere laken fron ITRS roadnap so lhal R
|
= 76 /cn,
|
|
= 5.3 nH/cn and C
|
= 2.6 pI/cn for a O.8-cn-Iong Iine. Ior lhese vaIues, lhe synlhesis
reIalions vere used logelher vilh an oplinizalion process lo gel an oulpul, V
N
, cIose lo lhe
inpul, V
i
. IinaIIy, lhe conponenl vaIues for lhe lvo NCD ceIIs vere: R
1
= 73 , |
1
= 99 nH,
R
2
= 1O2 and |
2
= 17 nH.

- |requenc-domaln resu|ts: Iigs. 19(a) and (l) respecliveIy shov lhe lransfer funclion
nagnilude and lhe group deIays produced ly sinuIalions of lhe lhree circuils (RLC-Iine,
NCD- and RLC-NCD-circuils). In Iig. 19(a), lhe inporlanl allenualion dispIayed ly lhe
RLC-Iine is conpensaled ly lhe NCD ceIIs so lhal lhe lolaI gain is kepl al aloul O d up lo
aloul 5OO MHz. Wilh respecl lo lhe group deIay produced ly lhe RLC-Iine, lhe one ly lhe
RLC-NCD circuil (lhick lIack curve) is slrongIy reduced as expecled lhanks lo lhe NCD
conlrilulion (lhin lIue curve). Indeed, al laseland frequencies, lhis Ialler provides a
nininaI NCD vaIue around -O.5 ns.


(a) (b)
Iig. 19. Irequency responses produced lhrough sinuIalions of lhe RLC-, NCD-, and RLC-
NCD circuils shovn in Iig. 18: nagnilude (a) and group deIay (l).

- Tlme-domaln resu|ts: Tine-donain sinuIalions vere run on using al lhe inpul a 1O-
ns-period signaI vilh O.6O ns as rise-/faII-lines.


Iig. 2O. Tine donain oulpul responses of lhe RLC-Iine and of lhe RLC-NCD conpensaled
slruclure for a lrapezoidaI inpul voIlage al a 2OO Mlil/s rale.

VLS 428

Iig. 2O evidences a narked degradalion of lhe oulpul vaveforn, V
|
, of lhe RLC-Iine
conpared lo lhe inpul vaveforn. Once again, lhe inserlion of lhe NCD aclive circuil aIIovs
greal enhancenenl iIIuslraled ly a nolalIe signaI regeneralion, a reduclion of lhe
propagalion deIay of aloul 1.6O ns (fron 1.86 ns lo O.26 ns) acconpanied vilh an
enhancenenl of lhe signaI Ieading- and lraiIing-edges. These resuIls confirn lhe efficacy
and lhe reIialiIily of lhis lechnique lo equaIize lhe degradalion induced eilher ly Iunped or
dislriluled RLC-nodeI of inlerconnecl Iines.

6. Improvement of the NGD topoIogy toward an integration process

AppIicalion of lhese innovalive ceIIs lo lhe conpensalion of VLSI inlerconnecl-induced
degradalion requires conpaliliIily vilh VLSI inlegralion process. So, here, lhe nain issue is
lhe inlegralion and nanufacluring of induclive eIenenls such as lhe induclance of lhe ILT
feedlack and lhe liasing Induclance (choke seIf vilh a high vaIue). The Ialler can le
repIaced vilh, for exanpIe, an aclive lias vhich, noreover, provides a possiliIily of
operaling landvidlh enhancenenl. To gel rid of lhe feedlack induclance, Iel us consider a
nev lopoIogy of NCD aclive circuil vilh no induclance. Il sinpIy consisls of a ILT cascaded
vilh a shunl RC-nelvork. To nalch lhe ceII oulpul lo lhe foIIoving slage inpul, an R
2

resislor is connecled in shunl al ils end. As descriled in Iig. 21, lhis nev lopoIogy is
cascaded al lhe end of an RC circuil, vhich nodeIs lhe inlerconnecl Iine lo le conpensaled.


Iig. 21. RC-circuil conpensaled ly a NCD ceII vilhoul induclance.

Lel us focus, al firsl, on lhe anaIylicaI sludy of lhe proposed ceII according lo lhe RC circuil
vaIues prior providing evidence, lhrough sinuIalions, of lhe conpensalion of lhe RC effecls,
i.e. signaI recovery and deIay reduclion, ly using lhe proposed NCD lopoIogy

6.1 Theory
According lo (RaveIo el aI, 2OO8), lhe lransfer funclion of lhe RC-NCD device descriled in
Iig. 21 is:

){ R R ( s C R R R R )[ Cs R (
) s C R ( R G
) s ( G
as as t
max
+ + + + +
+
=
2 1 1 2 1
1 1 2
1
1
,
(48)
vhere G
max
is lhe naxinaI gain vaIue expressed in equalion (37). The ninus sign is
expIained ly lhe inlrinsic ILT nodeI, vhich enlaiIs naluraIIy lhe inversion of lhe oulpul
voIlage direclion conpared lo lhe inpul one. Olhervise, vilh lhe sane approach as in
Seclion 4.1, fron lhis lransfer funclion il can le eslalIished lhal lhe gain al very Iov
frequencies and lhe LInore propagalion deIay are given ly:
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 429

) R R R (
R G
) ( G
as
max
+ +

=
2 1
2
0
,
(49)
) R R R (
){ R R ( C R T ) R R [(
T
as
t pRC as
p
+ +
+ +
=
2 1
1 1 1 2
.
(5O)

Iron expression (49), Ioss conpensalion (|G(O)| > 1) is effeclive on condilion lhal lhe
foIIoving condilion lelveen G
nax
and lhe resislance vaIues le nel:

2
1
1
R
R R
G
as
max
+
+ >
.
(51)

In addilion, fron expression (5O), il can le found lhal, vhalever lhe vaIues of lhe RC- and
lhe NCD-circuil paranelers, T
P
is aIvays Iover lhan T
pRC
. In olher vords, lhe RC-
propagalion deIay is aIvays reduced in lhe configuralion of Iig. 21. Al lhis slage, il is vorlh
poinling oul lhal equalion (5O), despile ils usefuIness, is an approxinalion of lhe 5O
propagalion deIay as proposed ly LInore. Indeed, according lo IsnaiI and Iriednan, a
reIalive inaccuracy of aloul 3O is possilIe. Anolher Iinilalion appears vhen lhe foIIoving
condilion is salisfied:

as
t
t
R R
) R R ( C R
C R
+
+
<
2
1 1 1
.
(52)

In lhis case, expression (5O) provides a negalive vaIue for T
p
, vhich Ieads lo an unreaIislic
lehaviour lhal vouId conlradicl lhe causaIily principIe. Indeed, caIcuIalion of lhe exacl
expression for T
p
fron lhe rool of lhe equalion, t
N
(T
p
) = V
M
/2, gives aIvays a posilive vaIue
for lhe lolaI propagalion deIay.
Despile lhese reslriclions, equalions (49) and (5O) are parlicuIarIy usefuI for a firsl anaIylicaI
approxinalion and pernil lhe exlraclion of lhe synlhesis reIalions needed lo delernine lhe
vaIues of lhe NCD ceII conponenls:

s
R G R R = ) 1 (
nax 2 1
, (53)
) 1 /( ) (
nax 1 2
+ = G R R R
s
, (54)

and ly using lhe equalion, (O) ~ O:

2
1 2 1 1
/ ) + + ( = R R R R CR C
s
. (55)

The synlhesis reIalions (53) and (54) are physicaIIy reaIislic under lhe foIIoving condilions:

1
nax
> G , (56)
) 1 /(
nax 2
> G R R
s
.
(57)
VLS 430

6.2 Discussion on the simuIation resuIts
The previous synlhesis reIalions vere used lo design and sinuIale, vilh ADS

soflvare, lhe
circuil presenled in Iig. 22(a). The lhin Iines indicale lhe DC lias nelvork. The RC-circuil is
conpensaled ly a lvo-slage NCD circuil, and lhe vhoIe circuil is exciled al lhe rale of
1 Clil/s ly an inpul signaI, V
i
, vilh rise and faII lines of aloul O.2 ns.

(a)
(b)
Iig. 22. (a): The vhoIe circuil conposed of an RC-nodeI conpensaled ly a lvo-slage NCD
circuil in aclive liasing, ILT (LC-2612, V

= 3 V, |
s
= 3O nA), R = 68 O, C = 1O pI, R
1
= 142
O, R
2
= 32 O, C
1
= C
2
= 1O pI, R
3
= 1OO O, R
4
= 51 O and (l) lhe corresponding sinuIalion
resuIls.

An aclive Ioad lias vilh lhe sane ILTs (LC-2612) vas appIied lo lhe circuil. No induclance
eIenenl is needed in lhis circuil. Transienl sinuIalions run vilh ideaI conponenls gave lhe
resuIls dispIayed on Iig. 22(l). Once again, lhe conparison of lhe RC- and RC-NCD-oulpuls
highIighls a signaI recovery, characlerised ly a gain of aloul 1.85 d al | = T/2 and a
reduclion of propagalion fron T
pRC
~ 47 ps lo T
p
~ 12 ps or T/T
pRC
~ 74.4 in reIalive vaIue.
Tine-donain neasurenenls are scheduIed and viII indicale if furlher inprovenenls are
required prior lo lhe inlegralion in a finaI VLSI device.

7. ConcIusion and future works

A noveI and innovalive lechnique of inlerconnecl effecl equaIizalion in eIeclronic syslens
vas deveIoped lhrough lheorelicaI sludies and experinenlaIIy vaIidaled. Il reIies on lhe use
of aclive circuils alIe lo sinuIlaneousIy generale gain and negalive group deIay in laseland
over lroad landvidlh.
The properlies of lhese circuils vere used lo propose a nev conpensalion approach
consisling in an equaIizalion of lolh lhe posilive group deIay and allenualion induced ly
inlerconnecls ly an equivaIenl negalive group deIay and gain.
The lheory on connonIy used circuils lo nodeI inlerconnecl effecls vere lriefIy recaIIed.
Then, a circuil conposed of a firsl-order inlerconnecl nodeI (i.e. an RC-circuil) cascaded
vilh an NCD ceII vas lheorelicaIIy sludied in order lo delernine lhe condilions required lo
conpensale for lolh lhe degraded propagalion deIay and lhe allenualion and lo express lhe
synlhesis reIalions lo le used in lhe delerninalion of lhe vaIues of lhe NCD ceII
conponenls. This NCD ceII sinpIy consisls in a ILT fedlack ly an RL series nelvork. Then,
for lhis firsl nodeIIing of inlerconnecl Iine, a proof-of-concepl circuil inpIenenled in hylrid
pIanar lechnoIogy vas falricaled lo denonslrale lhe efficiency of lhis lechnique. The
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 431

experinenlaI resuIls in frequency- and line-donains vere in very good agreenenl vilh
sinuIalions and vaIidaled lhe conpensalion lechnique in lhe case of an inpul signaI vilh a
25 Mlil/s dala rale. Indeed, lhe reduclions of lhe rise line and lhe 5O propagalion deIay
vere 71 and 86, respecliveIy.
In nany VLSI syslens and parlicuIarIy in Iong vires and/or for high dala rales or cIocks,
lhe induclive spurious effecls can no Ionger le negIecled. So, a nore eIaloraled syslen
conposed of an RLC inlerconnecl nodeI conpensaled vilh NCD ceIIs vas aIso sludied
anaIylicaIIy in order lo check for lhe vaIidily and efficacy of lhe equaIizalion lechnique and
lo delernine lhe synlhesis reIalions lo le used in furlher appIicalions. To vaIidale lhe
approach, a firsl series of sinuIalions vas run vilh an RLC Iunped nodeI for an inpul
signaI al 1 Clils/s-rale, lhen , lhe nodeI used in lhe second sel of sinuIalions vas an RLC
dislriluled Iine for an inpul signaI al a rale of 2OO Mlil/s. These sinuIalions under reaIislic
condilions confirned lhe conpensalion approach vilh reduclion of lhe propagalion deIay
of lhe sane order as previousIy. Moreover, as olserved vilh lhe RC-nodeI, lhe fronl and
lraiIing edges lolh shoved greal enhancenenls indicalive of a good recovery of lhe signaI
inlegrily.
IinaIIy, lo le alIe lo conpensale for inlerconnecl effecls in VLSI syslens, lhe proposed
circuils nusl le conpalilIe vilh a VLSI inlegralion process. This requirenenl drove us lo
propose inprovenenls of lhe proposed lopoIogy in order lo cope vilh induclance
inlegralion and nanufacluring prerequisiles. So, a lopoIogy vilh no induclance, lul vilh
lhe sane lehaviour and perfornances as previousIy vas proposed. A lheorelicaI sludy
provided evidence of ils aliIily lo exhilil a negalive group deIay in laseland logelher vilh
gain. Then, line-donain sinuIalions of a lvo-slage NCD device exciled ly a 1 Clils/s-rale
inpul signaI vere run lo vaIidale lhe expecled conpensalion approach and check for lhe
signaI recovery.

The inpIenenlalion of lhis equaIizalion lechnique in lhe case of a VLSI inlegralion process
is expecled lo aIIov conpensalion for spurious effecls such as deIay and allenualion
inlroduced ly Iong inler-chip inlerconnecls in SiI and SoC equipnenls or ly Iong vires and
luses. A preIininary slep vouId le lhe design and inpIenenlalion of such a circuil in
MMIC lechnoIogy and especiaIIy ly using dislriluled eIenenls. Al lhis slage, even if
experinenlaIIy lhe NCD ceIIs vere nol parlicuIarIy sensilive lo noise conlrilulion, il vouId
le vorlh conparing lhis approach and repealer inserlion under rough condilions, i.e. Iong
vires vilh a significanl allenualion, in order lo gain key infornalion on lheir respeclive
lehaviour under condilions of significanl noise. As idenlified in ITRS roadnap, lhe pover
consunplion is nov one of lhe najor conslrainls in chip design and has leen idenlified as
one of lhe lop lhree overaII chaIIenges over lhe Iasl 5 years. Iaced lo lhese conslrainls,
furlher invesligalions are needed lo accuraleIy evaIuale lhe consunplion of lhe presenled
NCD aclive circuils.

VLS 432

8. References

AdIer, V. & Iriednan, L. C. (1998). Repealer Design lo Reduce DeIay and Iover in Resislive
Inlerconnecl, |||| Trans. Circui|s Sqs|. ||, Ana|cg an Digi|a| Signa| Prcccssing, VoI.
54, No. 5, pp. 6O7-616.
akogIu H. . & MeindI }. D. (1985). OplinaI Inlerconneclion Circuils for VLSI, |||| Trans.
On ||cc|rcn. Dcticcs, VoI. 32, No. 5, pp. 9O3-9O9.
arke L. (1988), Line-lo-ground capacilance caIcuIalion for VLSI: a conparison, |||| Trans.
cf Ccmpu|cr Ac Dcsign, VoI. 7, No. 2, pp. 295-298.
Deng, A. C. & Shiau, Y. C. (199O). Ceneric Iinear RC deIay nodeIing for digilaI CMOS
circuils, |||| Tran. cn Ccmpu|cr-Aic Dcsign, VoI. 9, No. 4, pp. 367-376.
Deulsch., A. (199O). High-speed signaI propagalion on Iossy lransnission Iines, |8M ]. Rcs.
Dctc|cp., VoI. 34, No. 4, pp. 6O1-615.
Deulsch, A., Kopcsay, C. V., ReslIe, I. }., Snilh, H. H., Kalopis, C., ecker, W. D., Coleus, I.
W., Surovic, C. W., Rulin, . }., Dune, R. I., CaIIo, T. A., }ankis, K. A., Ternan, L.
M., Dennard, R. H., Asai-HaIsz, C., Krauler, . L. & KneleI, D. R. (1997). When are
lhe lransnission Iine effecls inporlanl for on-chip inlerconneclions, |||| Trans. cn
MTT, VoI. 45, No. 1O, pp. 1836-1846.
Deulsch, A., Kopcsay, C. V., Surovic, C. W., Rulin, . }., Ternan, L.M., Dunne, }r. R. I.,
CaIIo, T. A. & Dennard, R. H. (1995). ModeIing and characlerizalion of Iong on-chip
inlerconneclions for high speed-perfornance nicroprocessors, |8M ]. Rcs. Dctc|cp.,
VoI. 39, No. 5, pp. 547-667.
Dogariu, A., Kuznich, A. & Cao, H. & Wang, L. }. (2OO1). SuperIuninaI Iighl puIse
propagalion via rephasing in a lransparenl anonaIousIy dispersive nediun, Op|ics
|xprcss, VoI. 8, No. 6, pp. 344-35O.
LIeflheriades, C. V., Siddiqui, O. & Iyer, A. K. (2OO3). Transnission Iine for negalive
refraclive index nedia and associaled inpIenenlalions vilhoul excess resonalors,
|||| MIC |c||., VoI. 13, No. 2, 51, pp. 51-53.
LInore, W. C. (1948). The lransienl response of danped Iinear nelvorks, ]. App|. Pnqs., VoI.
19, pp. 55-63.
Iriednan, L. (1995). CIock dislrilulion nelvorks in VLSI circuils and syslens, Ncu Ycr|
|||| Prcss.
Crover, I. (1945). Induclance CaIcuIalions Working IornuIas and TalIes, |ns|rum. Scc. cf
Amcrica.
IsnaiI, Y. I. & Iriednan, L. C. (2OOO). Lffecls of induclance on lhe propagalion, deIay and
repealer inserlion in VLSI circuils, |||| Tran. V|S| Sqs., VoI. 8, No. 2, pp. 195-2O6.
IsnaiI, Y. I., Iriednan, L. C. & Neves, }. L. (2OOO). LquivaIenl LInore deIay for RLC lrees,
|||| Tran. Ccmpu|c-Aic Dcsign, VoI. 19, No. 1, pp. 83-97.
Kilano, M., Nakanishi, T. & Sugiyana, K. (2OO3). Negalive group deIay and superIuninaI
propagalion: an eIeclronic circuil approach, |||| ]curna| cf Sc|cc|c Tcpics in
Quan|um ||cc|rcnics, VoI. 9, No. 1, pp. 43-51.
Krauler, . & Mehrolra S. (1998). Layoul lased frequency dependenl induclance and
resislance exlraclion for on-chip inlerconnecl lining anaIysis, ILLL Design
Aulonalion Conference, Prccccings cf |nc ACM, pp. 3O3-3O8.
Lucyszyn, S., Rolerlson, I. D. & Aghvani, A. H. (1993). Negalive group deIay synlhesiser,
||cc|rcn. |c||., VoI. 29, pp. 798-8OO.
A New Technique of nterconnect Effects Equalization
by using Negative Group Delay Active Circuits 433

Moore, C. L. (1965). Cranning nore conponenls inlo inlegraled circuils, in ||cc|rcnics, VoI.
38, No. 8, pp. 114-117.
Munday, }. N. & Henderson, R. H. (2OO4). SuperIuninaI line advance of a conpIex audio
signaI, App|. Pnqs. |c||., VoI. 85, pp. 5O3-5O4.
Nakanishi, T., Sugiyana, K. & Kilano, M. (2OO2). Denonslralion of negalive group deIays in
a sinpIe eIeclronic circuil, Amcrican ]curna| cf Pnqsics, Issue 11, VoI. 7O, pp. 1117-
1121.
IaIil, A. K., Meyer, V., DuganapaIIi, K. K., Anheier, W. & SchIoeffeI, }. (2OO4). Tesl pallern
generalion lased on predicled signaI inlegrily Ioss lhrough reduced order
inlerconnecl nodeI, 16|n Icr|sncp Tcs| Mc|ncs an Rc|ia|i|i|q cf Circui|s an
Sqs|cms.
Ralay, }. M. (1996). DigilaI inlegraled circuils, a design perspeclive, |ng|cucc C|iffs, N}:
Irenlice-HaII.
RaveIo, . (Dec. 2OO8). Negalive group deIay aclive devices: lheory, experinenlaI
vaIidalions and appIicalions, Pn.D. |ncsis (in |rcncn), Lal-STICC, UMR CNRS 3192,
Universily of resl, Irance.
RaveIo, ., Ierennec, A. & Le Roy, M. (2OO7a). LquaIizalion of inlerconnecl propagalion
deIay vilh negalive group deIay aclive circuils, 11|n |||| Icr|sncp cn SP|, Cenova,
IlaIy, pp. 15-18.
RaveIo, ., Ierennec, A., Le Roy, M. & oucher Y. (2OO7l). Aclive nicrovave circuil vilh
negalive group deIay, |||| MIC |c||., VoI. 17, Issue 12, pp. 861-863.
RaveIo, ., Ierennec, A. & Le Roy, M. (2OO7c). Synlhesis of lroadland negalive group deIay
aclive circuils, |||| MTT-S Sqmp. Dig., Hcnc|u|u (Hauaii), pp. 2177-218O.
RaveIo, ., Ierennec, A. & Le Roy, M. (2OO8a). AppIicalion of negalive group deIay aclive
circuils lo reduce lhe 5O propagalion deIay of RC-Iine nodeI, 12|n |||| Icr|sncp
cn SP|, Avignon, Irance.
RaveIo, ., Ierennec, A. & Le Roy, M. (2OO8l). Negalive group deIay aclive lopoIogies
respecliveIy dedicaled lo nicrovave frequencies and laseland signaIs, ]curna| cf
|nc |uMA, VoI. 4, pp. 124-13O.
RaveIo, ., Ierennec, A. & Le Roy, M. (2OO9). LxperinenlaI vaIidalion of lhe RC-inlerconnecl
effecl equaIizalion vilh negalive group deIay aclive circuil in pIanar hylrid
lechnoIogy, 13|n |||| Icr|sncp cn SP|, Slraslourg, pp. 1-4.
RuehIi, A. & rennan, I. (1975). Capacilance nodeIs for inlegraled circuil nelaIIizalion
vires, ]. cf Sc|i-S|a|c |n|cgra|c Circui|s, VoI. 1O, No. 6, pp. 53O-536.
Sakurai, T. (1983). Approxinalion of viring deIay in MOSILT LSI, |||| ]. cf Sc|i S|a|c
Circui|s, VoI. 18, No. 4, pp. 418-425.
Sakurai, T. (1993). CIosed-forn expressions of inlerconneclion deIay, coupIing and crosslaIk
in VLSIs, |||| Tran. cn ||cc|rcn. Dcticcs, VoI. 4O, No. 1, pp. 118-124.
Siddiqui, O. I., Lrickson, S. }., LIeflheriades, C. V. & Mojahedi, M. (2OO4). Tine-donain
neasurenenl of negalive-index lransnission-Iine nelanaleriaIs, |||| Trans. MTT,
VoI. 52, No. 5, pp. 1449-1453.
Siddiqui, O., Mojahedi, M., Lrickson, S. & LIeflheriades, C. V. (2OO3). IeriodicaIIy Ioaded
lransnission Iine vilh effeclive negalive refraclive index and negalive group
veIocily, |||| Tran. cn An|cnnas an Prcpaga|icn, VoI. 51, No. 1O.
SoIIi, D. & Chiao, R. Y. (2OO2). SuperIuninaI effecls and negalive deIays in eIeclronics and
lheir appIicalions, Pnqsica| Rcticu |, Issue 5.
VLS 434

SlandIey D. & Wyall, }. L. }r., (1986). Inproved signaI deIay lounds for RC lree nelvorks,
V|S| Mcmc, No. 86-317, Massachusells Inslilule of TechnoIogy, Canlridge,
Massachusells.
VIach, }., arly, }. A., VanneIIi, A., TaIkhan, T. & Shi. C. }. (1991). Croup deIay as an eslinale
of deIay in Iogic, |||| Tran. Ccmpu|c-Aic Dcsign, VoI. 1O, No. 7, pp. 949-953.
Wang, L. }., Kuznich, A. & Dogariu, A. (2OOO). Cain-assisled superIuninaI Iighl
propagalion, Na|urc 406, pp. 277-279.
Wyall, }. L. }r., & Yu, Q. (1984). SignaI deIay in RC neshes, lrees and Iines, Prccccings cf |nc
|||| |n|crna|icna| Ccnfcrcncc cn Ccmpu|cr-Aic Dcsign, pp. 15-17.
Wyall, }. L. }r. (1985). SignaI deIay in RC nesh nelvorks, |||| Tran. Circui|s an Sqs|cms,
CAS-32(5), pp. 5O7-51O.
Wyall, }. L. }r. (1987). SignaI propagalion deIay in RC nodeIs for inlerconnecl, Circui|
Ana|qsis, Simu|a|icn an Dcsign, Par| ||. V|S| Circui| Ana|qsis an Simu|a|icn, A.
RuehIi, ed., VoI. 3 in lhe series Advances in CAD for VLSI, Norlh-HoIIand.
Yu, Q., Wyall, }. L. }r., Zukovski, C., Tan, H-N. & O'rien, I. (1985). Inproved lounds on
signaI deIay in Iinear RC nodeIs for MOS inlerconnecl, Prccccings cf |nc ||||
|n|crna|icna| Sqmpcsium cn Circui|s an Sqs|cms, pp. 9O3-9O6.


Book Embeddings 435


Book Embeddings

Sad ellayel
Unitcrsi|q cf Hcus|cn C|car |a|c
Hcus|cn, TX 77058, USA

1. Introduction

Craph enleddings pIay an inporlanl roIe in inlerconneclion nelvork and VLSI (Very Large
ScaIe Inlegralion) design. SinuIalion of one inlerconneclion nelvork ly anolher can le
represenled as a graph enledding prolIen. Delernining lhe nunler of Iayers required lo
luiId a VLSI chip, aIso caIIed look-enledding, is anolher appIicalion of graph enleddings.
In lhis chapler, ve expIore lhe Ialler prolIen. Afler an overviev of lhe resuIls on look
enledding of lhe hypercule, ve presenl resuIls on look enledding of lhe k-ary hyperule,
a varianl of lhe hypercule. We aIso presenl recenlIy ollained resuIls on lhe look
enledding of lhe lorus graph.
Craph enleddings have leen sludied in lhe Iileralure exlensiveIy for lhe inporlanl roIe il
pIays in inlerconneclion nelvork and VLSI (Very Large ScaIe Inlegralion) design (ernharl
& Kainen, 1979, ellayel el aI., 1989, ellayel & Sudlorough., 1992, ellayel &
Sudlorough, 1989, Chung el., 1987, Yannakakis, 1989). SinuIaling an inlerconneclion
nelvork, say A, ly anolher, say , is represenled as a graph enledding prolIen vhere lhe
nodes and edges of A are napped lo nodes and palhs of . A look enledding of a graph C
is lhe napping of lhe nodes of C onlo lhe spine of a look and lhe edges of C onlo pages so
lhal lhe edges assigned lo lhe sane page do nol cross. Delernining lhe nininun nunler of
pages required for such an enledding is lhe focus of lhis chapler. The nininun nunler of
pages in vhich a graph C can le enledded is caIIed lhe pagenunler of C, pg(C).
Delernining lhe pagenunler of an arlilrary graph has leen shovn lo le NI-conpIele
(Chung el aI, 1987, Yannakakis, 1989). Il renains NI-conpIele lo delernine if an arlilrary
graph can le enledded in lvo pages. In a 198O arlicIe, Carey el aI (Carey el aI., 198O)
proved lhal delernining lhe pagenunler of an arlilrary graph renains NI-conpIele even
if ve assune lhal lhe node enledding parl is fixed, i.e. lhe Iayoul of lhe nodes is given. An
equaIIy chaIIenging lask is lhe prolIen of delernining lhe pagervidlh or geonelric
lhickness of an arlilrary graph C vhich defined lo le lhe nininun nunler of Iayers in a
pIanar draving of an arlilrary C.
Researchers have leen dravn lo sludy lhis prolIen and varialions of lhis prolIen (Chung
el aI., 1987, CaIiI el aI., 1989, Yannakakis, 1989) lecause of ils nany and diverse appIicalions
such as fauIl loIeranl conpuling (Chung el aI., 1987), graph draving, and graph separalors
(CaIiI el aI., 1989). Olher prolIens renain lo le soIved such as lhe reIalionship of lhe
21
VLS 436
pagenunler of a graph and olher invarianls. Lnonolo and Miyauchi(Lnonolo & Miyauchi,
1999) considered lhe case vhere edges nay use nore lhan one page.
The pagenunler of a graph has slrong inpIicalions in VLSI design. The pagenunler is lhe
nininun nunler of Iayers required lo produce a VLSI chip. Anolher area of VLSI design
lhal can le descriled in lerns of look enledding is lhe configuring of processors in lhe
presence of fauIls. Civen an array of processors, sone of vhich nay le fauIly, ve Iay lhen
in a Iine. This couId le eilher physicaI or IogicaI. Running paraIIeI lo lhe Iine of processors
are lundIes of vires. As ve scan lhe Iine of processors, ve aclivale svilches connecling lhe
good processors lo a lundIe of vires and lypassing lhe lad processors. The lundIe of vires
acl Iike a slack in lhal, vhen a processor requesls a conneclion lo anolher processor, is
connecled lo a parlicuIar lundIe and pushes lhe olher processor conneclions dovn lo a
vire. When our scan reaches lhe processor lo vhich processor connecls, il is popped off
lhe lundIe since lhe vire is no Ionger needed, and lhe olher conneclions are relurned lo
lheir originaI posilions. The desired properly in lhis case, is lhe nininizalion of lhe nunler
of lundIes required lo inlerconnecl aII of lhe good processors in lhe desired Iayoul. This is
used in lhe Diogenes nelhod of fauIl loIeranl design as descriled ly Chung, Leighlon, and
Rosenlerg (Chung el aI., 1987). If ve lake each lundIe of vires and represenl il as a page,
ve have a look enledding. Chung, Leighlon, and Rosenlerg (Chung el aI., 1987) have
sludied lhe look enledding prolIen for a variely of graphs incIuding lrees, grids, X-lrees,
enes nelvorks, conpIele graphs, and lhe linary hypercule.

2. Definitions and TerminoIogy

Lel le an undirecled graph. A Iinear Iayoul of lhe verlices of C is a napping of
V lo lhe sel vhere . Lel and le edges in such lhal
and Then and inlersecl vilh respecl lo L if
. If, in lhe olher hand,

or


lhe edges and are said lo nesl vilh respecl lo .
A |cc| cm|cing, aIso referred lo as slack enledding, is a Iinear Iayoul of lhe verlices and
an assignnenl of lhe edges lo pages such lhal no page conlains a pair of edges lhal inlersecl.
The pagenunler of a graph, aIso referred lo as look lhickness, is lhe nininun nunler of
pages achieved anong aII possilIe look enleddings.
When each edge has lo le dravn as a slraighl Iine segnenl, lhe prolIen is referred lo as lhe
gccmc|ric |nic|ncss or rca| |incar |nic|ncss (Kainen, 199O, DiIIencourl el aI, 2OOO)
Anolher varialion sludied in lhe Iileralure is lhe so-caIIed qucuc cm|cing. A queue
enledding is a Iinear Iayoul of lhe verlices of a graph and an assignnenl of lhe edges lo
queues such lhal no queue conlains a pair of edges lhal nesls (ellayel el aI. 2O1O).

A d-dinensionaI lorus is lhe d-dinensionaI nesh vilh vraparound edges. The vraparound
edges connecl lhe firsl and Iasl verlex in each dinension. The nolalion


denoles lhe d-dinensionaI lorus vhere

is lhe size of lhe

dinension.
The linary hypercule of dinension denoled ly

has

verlices IaleIed ly lhe


linary represenlalion of inlegers lelveen O and

Tvo verlices are connecled ly an


edge if and onIy if lheir IaleIs differ in exaclIy one lil posilion. The k-ary hypercule of
dinension denoled ly

has

verlices IaleIed ly lhe k-ary represenlalion of


Book Embeddings 437
inlegers lelveen O and

Tvo verlices are connecled ly an edge if and onIy if lheir


IaleIs differ in exaclIy one posilion ly one (noduIo k).

3. Book Thickness of Graphs

In a 1979 arlicIe, ernharl and Kainen (ernharl & Kainen, 1979) inlroduced lhe prolIen of
look enledding. They proved lhal lhe pagenunler of a graph C, if and onIy C
is a sulgraph of sone pIanar graph. They aIso conjeclured lhal pIanar graphs have
unlounded pagenunler. Healh and IslraiI (Healh & IslraiI, 1992) disproved lhis conjeclure
and proved lhal lhe pagenunler of graphs vilh genus is (). The genus of a graph is
defined as foIIovs. A graph C can le enledded vilh no edge crossing on a conpacl
orienlalIe lvo-nanifoId surface on vhich a nunler of handIes have leen pIaced. A handIe
is used ly an edge lhal vouId olhervise cross olher edges. The genus of a surface is lhe
nunler of handIes on lhal surface. The genus of a graph is lhe nininun genus of aII
possilIe surfaces on vhich C can le enledded. So pIanar graphs have genus O since no
handIe is needed.
In (Kainen & Overlay, 2OO3, lhe aulhors sludied lhe pagenuunler in lerns of lIock-
culpoinl lree. A lIock-culpoinl graph of a graph is a liparlile graph
vhere There is a verlex for each lIock of C and a verlex for each
culpoinl in The verlices x and y are connecled ly an edge if and onIy if lhe lIock
represenled ly x conlains lhe culpoinl y. A graph is connecled if and onIy if ils lIock-
culpoinl is lree. They shoved lhal lhe pagenunler of a graph is lhe naxinun of lhe
pagenunler of ils lIocks. They aIso shoved lhal a graph is pIanar if and onIy if il
honeonorphic lo a graph vilh pagenunler al nosl lvo.

4. Book Embedding of PIanar Graphs

A graph has pagenunler if and onIy if , . consisls of isoIaled
verlices. Trees are shovn lo have pagenunler 1. A graph is said lo le oulerpIanar if il can
le dravn in lhe pIane so lhal aII of ils verlices Iie on lhe sane face. OulerpIanar graphs have
pagenunler one. The foIIoving lheoren is due lo Kainen and Overlay (Kainen & Overlay,
2OO3):
Tnccrcm. A graph C has pagenunler vilh verlex ordering

if and onIy

, vhere each

is an oulerpIanar graph enledded vilh verlex ordering


.
In (ernharl & Kainen, 1979), lhe aulhors conjeclured lhal pIanar graphs have unlounded
pagenunler lul lhis vas disproved ly Healh and IslraiI (Healh & IslraiI, 1992). uss and
Shor (uss & Shor, 1984) gave an aIgorilhn lhal enleds aII pIanar graphs in nine pages.
Healh and IslraiI (Healh & IslraiI, 1992)gave an inprovenenl lhal achieves seven pages.
Yannakakis (Yannakakis, 1989) lroughl lhe nunler lo four and shoved lhal indeed four
pages are sufficienl and necessary.

4.1 OutIine of Yannakakis AIgorithm
Lel C le a cycIe C of lhe graph C lhal conlains no exlernaI chords. Denole lhe verlices of C
ly . The edge is caIIed lhe Iong edge, and lhe resl of edges of C are caIIed
VLS 438
shorl edges. Lel C
c
le lhe graph ollained fron C ly adding ils chords. If I and I are lvo
inner faces of C
c
vhere lhe Iong edge of I is a shorl edge of I lhen aII verlices of I nusl
appear lelveen lvo conseculive verlices of I. Iirsl, Lel K le lhe cycIe lounding lhe inner
face of C
c
lhal conlains lhe Iong edge of C. Ils shorl edges nusl le lhe Iong edges of lhe
olher inner cycIes. Layoul lhe inlerior of K. Then, expand recursiveIy lhe inner cycIes.
As poinled oul in (Yannakakis, 1989), lhe verlices of a pIanar graph can le parlilioned inlo
IeveIs. AII edges connecl eilher verlices of lhe sane IeveI or verlices of adjacenl IeveIs. The
forner are caIIed |ctc|-cgcs and lhe Ialler |ining cgcs. LeveI O consisls of lhe verlices
forning lhe cycIe K. Laying oul lhe inlerior of K is acconpIished ly firsl Iaying IeveI 1
verlices and coIoring lhe shorl edges of K and lhe linding edges lelveen IeveIs O and 1.
Lxpand recursiveIy lhe cycIes forned ly IeveI 1 verlices.

5. Book Embedding of the Torus and the k-ary hypercube

In (ellayel & HoeIzenan, 2OO9), ve eslalIished lhe upper lound and Iover lound on lhe
pagenunler of lhe k-ary hypercule. The |-ary hypercule is a generaIizalion of lhe linary
hypercule. Il is defined as foIIovs. The |-ary hypercule of dinension is an undirecled
graph of |
d
verlices IaleIed ly lhe inlegers O lhrough |
d-1
. Tvo nodes x and y are connecled
if and onIy lhe k-ary represenlalions of lheir IaleIs differ in exaclIy one posilion ly one
noduIo k. We shoved lhal lhe pagenunler of lhe -dinensionaI -ary hypercule is al Ieasl


anu the uppei bounu is (

) vhere a =

. In (Chung el aI., 1987), Chung el aI.


shoved lhal lhe linary hypercule of dinension ,

adnils an page enledding


vhich is vilhin a faclor of 2 of oplinaI lecause lhe Iover lound can easiIy le shovn lo le

Hovever, Healh, Leighlon and Rosenlerg (Healh el aI., 1992) have shovn lhal lhe
pagenunler of lhe lernary hypercule of dinension has a Iover lound

. Il seens
lhal vhen k is even lhe pagenunler grovs IinearIy vilh lhe nunler of dinensions vhiIe il
grovs exponenliaIIy vhen k is odd. In (ellayel el aI., 1O2O) ve shov lhal lhe pagenunler
lhe k-ary hypercule depends on lhe parily of k. When k is even lhe pagenunler of


grovs IinearIy vilh lhe dinension lul vhen k is odd, lhe pagenunler grovs
exponenliaIIy vilh . Wilh lhe d-dinensionaI lorus

, if aII

, are
even lhe pagenunler is grovs IinearIy vilh lhe nunler of dinensions . In lhal paper, ve
aIso descrile a Iayoul lechnique vilh sequenliaI correclions of lhe order of verlices lhal
niligales lhe prolIen for lhe case vhen dinensions are odd. asicaIIy, lhe lechnique
nodifies lhe slandard Iayoul of aIlernaling Iefl-lo-righl and righl-lo-Iefl segnenls vilh an
anorlizalions of correclions lhal aIIov lhe reverse order lo le reaIized vilhoul paying lhe
penaIly of aII edges in lhe firsl-lo-Iasl vraparound conneclions lo le sinuIlaneousIu
crossing. This vouId reduce lhe nunler of pages.
RecenlIy, ve shoved lhal for any d-dinensionaI lorus lhe pagenunler is lounded ly

(ellayel el aI., 2O1O). Il had leen shovn ( ellayel & HoeIzenan, 2OO9, Chung el
aI., 1987, Yannakakis, 1989 ) lhal lhe pagenunler of a graph is al Ieasl as Iarge as lhe
nininun nunler of oulerpIanar graphs inlo vhich il can deconposed.
Ior lhe queue enledding prolIen, lhe queue nunler for lhe lorus grovs IinearIy in lhe
nunler of dinensions, regardIess of lhe parily of lhe sizes of ils dinensions. The queue
nunler of lhe k-ary hypercule

aIso grovs IinearIy vilh lhe dinension and does nol


depend on lhe parily of . Il is indeed
Book Embeddings 439
6. ConcIusion

In lhis chapler, ve presenled resuIls on look enledding and queue enledding of graphs.
The upper lound for lhe look-enledding of lhe lorus ve achieved is

. Il is
inleresling lo knov if lhis couId le inproved. Healh, Leighlon and Rosenlerg (Healh el aI.
1992) shoved lhal lhe lernary hypercule has a Iover lound

. Il foIIovs lhal lhe


Iover lound for lorus is exponenliaI in lhe nunler of dinensions vhen lhe sizes of ils
dinensions are odd. The aulhors in (Healh el aI., 1992) conjeclured lhal faniIy of graphs
vilh Iarge queue nunler and snaII page (or slack) nunler do nol exisl. In (ellayel el aI.,
2O1O), ve descrile a cIass of graphs, naneIy lhe -dinensionaI -ary nodified hypercules
vhich have pagenunler . We conjeclured lhal lhe queue nunler for such graphs grov
nore rapidIy lhan anu Iinear funclion of lhe dinension .

7. References

ernharl, I., Kanien I.C., (1979). The ook Thickness of a Craph, ]curna| cf Ccm|ina|cria|
Tnccrq, Series (27), 1979, pp. 32O-331.
ellayel, S., (1995). On lhe K-ary Hypercule. ]curna| cf Tnccrc|ica| Ccmpu|cr Scicncc 14O,
1995, pp. 333-339.
ellayel, S., Heydari, H., MoraIes, L., Sudlorough, I.H., (2O1O). Slack and Queue Layouls
for Toruses and Lxlended Hypercules. To appear
ellayel, S., HoeIzenan, D., (2OO9). Upper and Lover ounds on lhe Iagenunler of lhe
ook Lnledding of lhe k-ary Hypercule. ]curna| cf Digi|a| |nfcrma|icn
Managcmcn|, 7 (1), 2OO9, pp. 31-35.
ellayel, S., MiIIer, Z., Sudlorough, I.H., (1992). Lnledding Crids inlo Hypercules. ]curna|
cf Ccmpu|cr an Sqs|cm Scicnccs, 45 (3), 1992, pp. 34O-366.
ellayel, S., Sudlorough, I.H., (1989). Crid Lnledding inlo Ternary Hypercules. Prcc. Of
|nc ACM Scu|n Ccn|ra| Rcgicna| Ccnfcrcncc, 1989, pp. 62-64.
uss, }.I., Shor, I.W., (1984). On lhe Iagenunler of IIanar Craphs. Iroc. Of lhe 16
lh
AnnuaI
Synposiun on Theory of Conpuling, 1984, pp. 98-1OO.
Chung, I.R.K, Leighlonn I.T, Rosenlerg, A.L. (1983). DIOCLNLS: A MelhodoIogy for
Designing IauIl ToIeranl Irocessor Arrays. 13
|n
Ccnf. cn |au|| Tc|cran| Ccmpu|ing,
1983, pp. 26-32.
Chung, I.R.K, Leighlonn I.T, Rosenlerg, A.L. (1987). Lnledding Craphs in ooks: A
Layoul IrolIen vilh AppIicalion lo VLSI Design. S|AM ]curna| cf A|gc|raic
Discrc|c Mc|ncs, 8 (1), 1987, pp. 33-58.
Dean, A.M., Hulchinson, }.I. (1991). ReIalions anong Lnledding Iaranelers for Craphs.
Grapn Tnccrq, Ccm|ina|crics, an App|ica|icns, voI. 1, WiIey Inlerscience IulI., Nev
York, 1991, pp. 287-296.
Dean, A.M., Hulchinson, }.I. , Sheinernan, L.R. (1991). On lhe Thickness and Arloricily of a
Craph. ]curna| cf Ccm|ina|cria| Tnccrq Scrics 8, voI. 52, 1991, pp. 147-151.
DiIIencourl, M.., Lpslein, D., Hirschlerg, D.S. (2OOO). Ceonelric Thicknes of ConpIele
Craphs. ]curna| cf Grapn A|gcri|nms an App|ica|icns. voI. 4, 2OOO, pp. 5-17.
Lnonolo, H., Miyauchi, M.S., (1999). Lnledding Craphs inlo a Three Iage ook vilh O(M
Iog N) crossings of edges over lhe spine. S|AM ]curna| cf Discrc|c Ma|n. 12, 1999, pp.
337-341.
VLS 440
Healh, L.S, IslraiI, S. (1992). Conparing Queues and Slacks As Mechanisns for Laying oul
Craphs. ]curna| cf |nc Assccia|icn fcr ccmpu|ing Macnincrq, 39, 1992, pp. 479-5O1.
Healh, L.S, Leighlon, I.T., Rosenlerg, A.L. (1992). Conparing Queues and Slacks As
Mechanisns for Laying oul Craphs. S|AM ]curna| cf Discrc|c Ma|ncma|ics, 5 (3),
1992, pp. 398-412.
Kainen, I.C., Overlay, S. (2OO3). ook Lnleddings of Craphs and a Theoren of Whilney.
Tccnnica| Rcpcr| GUGU-2/25/03, n||p.// uuu.gccrgc|cun.cu/facu||q/|aincn/p|ip3.pf .
Kainen, I.C. 199O. The ook Thickness of a Craph. Ccngr. Numcr., 71, 199O, pp. 127-132.
Yannakakis, M. (1989) Lnledding IIanar Craphs in four Iages., }ournaI of Conpuler and
Syslen Sciences, 38 (1), 1989, pp. 36-67.

VLS Thermal Analysis and Monitoring 441
X

VLSI ThermaI AnaIysis and Monitoring

Ahned Lakhssassi and Mohanned ougalaya
Unitcrsi|c u Quc|cc cn Ou|acuais
Canaa

1. Introduction

The rapid evoIulion in lhe induslry of lhe inlegraled circuils during lhe Iasl decade vas so
quick lhal currenlIy il is possilIe lo inlegrale conpIex syslens on a singIe SoC (Syslen on a
Chip). Due lo aggressive lechnoIogy scaIing, VLSI inlegralion densily as veII as pover
densily increases draslicaIIy. As lhernaI phenonena research aclivilies on nicro-scaIe IeveI
are gaining popuIarily due lo an alundance of SoC and MLMS-lased appIicalions, various
neasurenenl lechniques are needed lo undersland lhe lhernaI lehaviour of VLSI chip. In
parlicuIar, neasurenenl lechniques for surface lenperalure dislrilulions of Iarge VLSI
syslens are a highIy chaIIenging research lopic. Hence, surface peaks lhernaI deleclion is
necessary in nodern VLSI circuils, lheir inlernaI slress due lo packaging conlined vilh
IocaI seIf healing lecones serious and nay resuIl in Iarge perfornance varialion, circuil
naIfunclion and even chip cracking.

One of lhe inporlanl queslions in lhe fieId of lhernaI issues of VLSI syslens and nicro-
syslens is hov lo perforn lhe lhernaI noniloring, in order lo indicale lhe overhealing
silualions, vilhoul conpIicaled conlroI circuils. TradilionaI approach consisls of pIacenenl
of nany sensors everyvhere on lhe chip, and lhen lheir oulpul can le read sinuIlaneousIy
and conpared vilh lhe reference voIlage recognized as lhe overhealing IeveI.

The idea of lhe proposed nelhod is lo predicl lhe IocaI lenperalure and gradienl aIong lhe
given dislance in a fev pIaces onIy on lhe nonilored surface, and evaIuale ollained
infornalion in order lo predicl lhe lenperalure of lhe heal source. Therefore, in lhe case of
an SoC devices, lhere is no pIace on lhe Iayoul for lhe conpIicaled unil perforning
conpulalions, lul lhere is aIso no need for il, as ve onIy vanl lo delecl lhe overhealing
silualions. These peaks are essenliaIs during lhernaI die noniloring lo avoid a crilicaI
induced lherno-nechanicaI slress. Moreover, in nosl cases, lhe overhealing occurs onIy in
one pIace.

Due lo aggressive lechnoIogy scaIing, VLSI inlegralion densily as veII as pover densily
increases draslicaIIy. Ior exanpIe, lhe pover densily of high perfornance nicroprocessors
has aIready reached 5OW/cn
2
al 1OOnn lechnoIogy and il viII reach 1OOW/cn
2
al 5Onn
lechnoIogy (ITRS, 2OO3). This evoIulion lovards higher inlegralion IeveIs is nolivaled ly
lhe needs of advanced high perfornance, Iighler and nore conpacl syslens vilh Iess
22
VLS 442

pover consunplion. MeanvhiIe, lo niligale lhe overaII pover consunplion, nany Iov
pover lechniques such as dynanic pover nanagenenl (Wu el aI., 2OOO), cIock galing (Oh &
Iedran, 2OO1), voIlage isIands (Iuri el aI., 2OO3), duaI V
dd
/V
lh
(Srivaslava el aI., 2OO4) and
pover galing (Kao el aI., 1997), (Long & He, 2OO3) are proposed recenlIy. These lechniques,
lhough heIpfuI lo reduce lhe overaII pover consunplion, nay cause significanl on-chip
lhernaI gradienls and IocaI hol spols due lo differenl cIock/pover galing aclivilies and
varying voIlage scaIing. Il has leen reporled in (Cronovski el aI., 1998) lhal lenperalure
varialions of 3O

C can occur in a high perfornance nicroprocessor design. The nagnilude of


lhernaI gradienls and associaled lherno-nechanicaI slress is expecled lo increase furlher as
VLSI designs nove inlo nanoneler processes and nuIli-CHz frequencies.

NeverlheIess, lhe grovlh of pover densily dissipaled lroughl a nunler of crilicaI lherno-
nechanicaI prolIens. Thus lhe heal produced in lhe slruclure of lhe VLSI devices is
direcled lovards ils edges vhere il is dissipaled ly radialion, conduclion, or conveclion.
The principaI effecl of lhe alsence of a good dynanic lhernaI nanagenenl is lhe graduaI
and conlinuous degradalion of lhe quaIily of perfornance as veII as sone olher direcl
effecls on lhe Iife cycIe of lhe eIeclronic syslens (uedo el aI., 2OO2). Thus, an aIgorilhn for
lhe deleclion and lhe IocaIizalion of lhe lhernaI peaks is exlreneIy inporlanl in order lo
nanage lhe lhernaI slress on high densily seniconduclor devices. These peaks are
essenliaIs during lhernaI die noniloring lo avoid a crilicaI induced lherno-nechanicaI
slress.

This sludy presenls a VLSI lhernaI peak noniloring approach, using CDS, for deveIopnenl
of SITDA aIgorilhn. The proposed aIgorilhn using onIy lvo sensor ceIIs viII le fornuIaled
in a nanner lo faciIilale lhe deveIopnenl of noduIar archileclures using nininun siIicon
space in regards of lheir inpIenenlalion in VLSI. The archileclure seIecled viII le nodeIIed
in high IeveI Ianguages, sinuIaled in order lo evaIuale ils perfornances, and lhen
inpIenenled on a IICA (IieId-IrogrannalIe Cale Array). A cIosed Ioop of sinuIalion is
used in order lo evaIuale lhe perfornances of lhe archileclures proposed al each slage. The
proposed archileclure of lhe aIgorilhn viII le designed in a noduIar perspeclive afler lhe
separalion of lhe differenl eIenenlary funclions of lhe aIgorilhn. Hence lhe design of
SITDA vilh fIexilIe noduIar-lased archileclure viII le presenled. The archileclure is
designed in high IeveI Ianguages such as MalIal - SinuIink, sinuIaled, lesled using
VHDL and synlhesized using XiIinx ISL (XiIinx, 2OO9) and AIlera DSI (AIlera, 2OO9)
looIs. The sinuIalion and hardvare inpIenenlalion resuIls ollained viII le conpared lo a
finile eIenenl nelhod (ILM) lenperalure prediclion of lhe enlire CDS nelhod
configuralion ceIIs.

This chapler viII presenl a packaged VLSI lhernaI anaIysis ly ILM, lhernaI noniloring
approach using CDS (Cradienl Direclion Sensors) nelhod. AIso lhe design of surface peaks
lhernaI deleclor aIgorilhn (SITDA) vilh fIexilIe noduIar-lased archileclure viII le
presenled.

VLS Thermal Analysis and Monitoring 443

2. ThermaI monitoring by Gradient Direction Sensors method

This parl presenls lhe firsl approach lo lhe nelhod of predicling lhe lenperalure of a singIe
heal source on lhe VLSI device surface. The second approach used concern prediclion of lhe
lenperalure gradienl aIong lhe given dislance in a fev pIaces onIy on lhe nonilored chip-
surface. The infornalion ollained viII le used lo evaIuale lhernaI peak posilion in order lo
achieve lhe lenperalure of lhe heal source. This neans lhal ve knov lenperalure vaIues in
sone pIaces on lhe chip and ve lry lo find lhe lenperalure vaIues of lhe heal sources and
evenluaIIy predicl lherno-nechanicaI slress dislrilulion in lhe vhoIe slruclure.

The proposed aIgorilhn is lased on lhe CDS nelhod for evaIualing a singIe heal source on
lhe chip surface. This nelhod principIe of vork is expIained in delaiIs in (Wjciak &
NapieraIski, 1997). Ior lvo sensors, A and C pIaced in lhe dislance a (Iig. 1), lhe difference
lelveen lheir oulpul voIlages is proporlionaI lo lhe changes of lhe lenperalure vaIue aIong
lhe dislance a (Wjciak & NapieraIski, 1997). This is lrue onIy vhen lhe heal source is
direclIy on lhe Iine AC for any olher cases lhe vaIues of lhe angIe has lo le laken inlo
accounl for lhe proper caIcuIalion of AT (1).
o o cos . cos . a
J J
a
T T
r
T
A C A C

=
A
A
(1)
vhere r is lhe dislance fron lhe heal source. In figure 1, ve have:

A B A B
J J T T AD AD b = ,
(2)
A C A C
J J T T AE AE c b = + ,
(3)
In order lo ollain lhe infornalion aloul angIe o ve shouId appIy lhe lhird sensor. In lhe
sinpIesl case lhe CDS conlains onIy lhree lenperalure sensors pIaced in dislance a (fig. 1)












Iig. 1. The 3 sensors ceII o e (O, 3O) (Wjciak & NapieraIski, 1997)

On lhe lasis of eqs. (2) and (3) and lhe geonelricaI dependencies fron (fig. 1), eq. (4) is
ollained:
|
|
.
|

\
|

2
1
.
3
2
tan
A C
A B
J J
J J
o
(4)
Using lhe ceII ve can ollain infornalion on lhe lenperalure dislrilulion and parlIy on lhe
posilion of lhe heal source. In order lo ollain lhe lenperalure vaIue of a singIe puncluaI
heal source ve have lo caIcuIale lhe dislance lelveen lhe sensor and lhe heal source.
VLS 444














Iig. 2. IrolIen descriplion and lhe dislrilulion of lhe sensors ceIIs

Tvo sensor ceIIs are required for lhis purpose (fig2). The ceIIs are pIaced in a given dislance
(H) and each of lhen gives infornalion aloul lhe angIe (1 and 2) in lhe direclion of lhe
heal source. Hence, lhe heal source and ceIIs fron a lriangIe in vhich lhe Ienglh of one side
and vaIues of lhe angIes adjacenl lo lhis side are knovn. This neans lhal ve can caIcuIale
lhe dislances lelveen lhe heal source and sensors. Nov ve can caIcuIale lhe lenperalure
gradienl aIong lhe knovn dislance. y adding il lo lhe lenperalure of lhe sensor ve ollain
lhe lenperalure of lhe heal source. Tvo sensor ceIIs A,,C and D,L,I are pIaced in lvo
corners of a nonilored Iayoul in lhe dislance H. Hence, lhe lenperalure of lhe heal source
can le ollained ly equalion (5).
( )
( ) ( )
( ) ( )
A A C S S
J J J
a
H
J T +
+
+ +
=
2 1 2 1
2 1
2
tan tan tan tan 1 3
tan 3 1 tan
o o o o
o o
(5)
Iigure 2 shovs lhe proposed dislrilulion of lhe 6 sensors divided inlo 2 ceIIs Iocaled
vhelher on or oulside lhe chip.

3. AIgorithmic Design MethodoIogy

3.1 Presentation of aIgorithmic division
The generaI fIovcharl of lhe proposed SITDA aIgorilhn is shovn in figure 3. Il shovs sone
paraIIeIisn in conpulalion lelveen lhe inlernaI varialIes represenled ly lan1 and lan2.
This paraIIeIisn viII le used Ialer in lhe archilecluraI design in order lo oplinize lhe speed
of lhe running inpIenenlalion. Since successive evaIualion of heal source lenperalure is
needed for fasl heal-source changes. The aIgorilhn viII creale 2 lripIels of vaIues coning
forn lhe deleclors. These 2 lripIels viII le used lo caIcuIale lhe langenls vaIues and eslinale
R
1
, R
2
and lhe lenperalure of lhe heal source. y dividing lhe aIgorilhn inlo ils lasic
arilhnelic eIenenlary operalions (ALO) ve viII le alIe lo fornuIale ils dala fIov graph in
order lo nodeI lhe aIgorilhn.





D


L


I


A





C


HealfIov fron
LquivaIenl
singIe
R2
R1
H

CeII 2 (3 sensors)

Isolh A


Isolh D


Isolh L

Isolh


Isolh C


Isolh I



CeII 1 (3 sensors)
heal source
VLS Thermal Analysis and Monitoring 445





















Iig. 3. IIovcharl of lhe SITDA aIgorilhn

3.2 Architecture modeIing using SimuIink

efore nodeIIing lhe syslen using XiIinx ISL

or AIlera DSI

looIs ve have lo perforn a


quanlificalion anaIysis in order lo delernine lhe conlinalion of fixed lils lhal viII Iead lo
lhe nininun quanlificalion error. Due lo lhe fixed nunler of lils in lhe fixed poinl
represenlalion, overfIov and underfIov prolIens are serious and connon. Thus, a sel of
sinuIalions is required lo delernine lhe lesl fixed poinl represenlalion. Iigure 4 shov lhal
a quanlificalion conlinalion of a signed 11.4 and 12.4 lils had greal differences vilh lhe
nain funclion, 13.4 vas lhe lurning poinl, and increasing lhe nunler of lils presenled no
greal enhancenenl lo lhe resuIl. Hovever, il is inporlanl lo nole lhal lhe quanlificalion
process induce a cerlain Ioss of precision. The fixed poinl specificalion has lo le appIied lo
every operalion and even every conslanl lIock in a nodeI.














Input : values reported
by detectors.
Computation oI tan1
and tan 2.
Computation
oI r2
Estimation oI Ts
Thresholds (Tmax, Length, Width)
Loop
Delay
Delay
Computation
oI r1
VeriIication Module
Initialisation Module
VLS 446

















Iig. 4. ResuIls generaled using differenl fixed-poinl represenlalions

4. ThermaI anaIysis and computationaI resuIts

4.2 WSI physicaI structure
In lhis parl ve viII presenl sleady slale and lransienl lhernaI anaIysis of WSI (Wafer ScaIe
Inlegralion) chip junclion ly ILM approach. Il viII le used lo luiId nodeIs lo vaIidale
lhernaI peaks prediclion ly CDS nelhod. Hence, lhe geonelricaI coordinales of lhe
invesligaled source can le ollained ly appIying lhe gradienl direclion sensors. This vay,
lhe possiliIilies lo nininize lhe lhernaI peaks in lhe crilicaI surface areas for CA (aII
Crid Array) packaged WSI devices can le expIored.










Iig. 5. Cross seclion viev of lhe sinuIaled WSI slruclure








0 20 40 60 80 100 120 140 160 180 200
-150
-100
-50
0
50
100
150
200
250
Matching intervals
T
e
m
p
e
r
a
t
u
r
e
Different fixed point representations
[15].[4]
[14].[4]
[12].[4]
[11].[4]
[13].[4]
X
Z
A : Substrat level
B : Sold
C : Chip level
D : Molding compound level
E :AL level
Convection coeIIicient
Heat source
VLS Thermal Analysis and Monitoring 447












Iig. 6. IhysicaI posilion of ceII sensors and heal sources on lhe ILM nodeI

As iIIuslraled in figure 5 (dinensions nol respecled), lhe WSI device sludied is nuIliIeveI
slruclure vilh sinpIe Si (siIicon) sulslrale covered vilh differenl Iayers, AI (AIuninun),
soIder laII, and noIding conpound. The packaging assenlIy vas a ceranic CA. Once lhe
geonelry of lhe device had leen delernined and lhe heal lransfer nechanisns quanlified, il
vas possilIe lo nodeI lhe syslen using finile eIenenl anaIysis. Using lhe conpuler code
NISA (NunericaI Inlegraled LIenenls for Syslen AnaIysis), a 3-D nodeI vas crealed vilh
nore lhan 1OO OOO isoparanelric lhernaI sheII eIenenls. Ior lhe healing conpulalions, lhis
eIenenl nodeIs lhe 3-D slale of heal fIov. Ior lhe lhernaI parl, lhe eIenenl has lhe
lenperalure (T) as lhe onIy degree of freedon al each node. Iigure 6 shovs heal sources
and sensors enpIacenenl in lhe surface of processor. Thal viII enalIe us lo eslalIish in-silu
sensor nelvork lo achieve lhe nosl honogeneous lherno-nechanicaI carlography. In lhis
sludy ve presenl a case of WSI slruclure vilh inlernaI heal generalion. This configuralion
viII enalIe us lo sinuIale device inlense aclivily and lo conslrucl lhernaI conlroI unil lhal
viII ensure a suilalIe cooIing. Moreover, ve have lo nake sure lhal lhe lenperalure device
slruclure varialion renains suilalIe for appropriale induced lherno-nechanicaI slress.
Iigure 7 shovs posilion of ceII sensors on WSI device for lenperalure and slress resuIls.










Iig. 7. Schenalic posilion of sensors ceII and heal source on WSI device

4.3 WSI ThermaI monitoring
In (Lakhsasi el aI., 2OO6) a nelhod has leen presenled in delaiIs, vhich can le used for
nixed fIuid-heal lransfer approach for VLSI sleady slale lhernaI anaIysis used lo anaIyze IC
package prolIen. Hovever, during lhernaI anaIysis lhe lenperalure of lhe chip is
delernined for lypicaI packages and pover IeveIs using fIuid loundary condilions lo
X1
X2
X3
X4
X
Y
2 sensor ceIIs Heal source
A
C
B
E
D
F
HS1
R1
R2
HS2
VLS 448

evaIuale equivaIenl conveclion coefficienl lo le appIied as a lhernaI junclion Cs.
Therefore, lhe knovIedge of lhe conpIele lenperalure fieId inpIies lhe knovIedge of lhe
lenperalure gradienls al aII lines, vhich is of significance for reIialiIily issues. Ior lhe Iarge
VLSI device dissipaling nuIlipIe IocaIized heal sources lhe singIe heal source can le
considered. As shovn in figure 8, fasl varialions of lenperalures in space and in line
infIuence lhe IocaI peak lenperalure. Hence, in lhe case of excessive healing or Iarge
anounl of pover dissipaled during shorl line, lhe conpIele lransienl finile eIenenl
conpulalions are slrongIy reconnended.










Iig. 8. VLSI device lransienl lhernaI anaIysis dissipaling nuIlipIe IocaIized heal sources


Iig. 9. VLSI device sleady slale lhernaI anaIysis dissipaling singIe IocaIized heal source

In lhis sludy, invesligalions are done for lhe sinpIesl case, onIy six lenperalure sensors
(A,,C,D,L and I, Iigure 2) in lhe forn of lvo sensor ceIIs and one singIe pover healing
source, in order lo vaIidale prediclion vilh lhe 3-D ILM nodeI. The sinuIalions have leen
carried oul for a one source pIaced al lhe junclion surface IeveI. As expecled, lhe peak
lenperalure profiIe is Iocaled al lhe cenlre of lhe heal source (figure 9). There is sulsequenl
reIaxalion on lenperalure gradienls lhrough lhe slruclure Ieading lo an essenliaIIy uniforn
lenperalure varialion T. The sensor ceIIs can le pIaced in any vay oul of lhe nonilored
area (differenl dislance H lelveen ceII 1 and ceII 2), lul in sone cases adequale pIacenenl
can sinpIify lhe lhernaI conlroI unil design. In lhis parl lhe resuIls of lhernaI peaks can le
very usefuI for indicaling overhealing silualions and crilicaI lherno-nechanicaI slress
occurring in lhe device slruclure. Hence, lalIe 1 dispIay a conparison lelveen lhe
lenperalures peaks on surface vilh differenl inpIenenlalion under lhe sane condilions.
VLS Thermal Analysis and Monitoring 449

Dctcctnrs Va!ucs (
n
C)
Dctcctnr 5ct 1 5ct 2 5ct 3
V
A
68.655 98.655 75.324
V

7O.248 1OO.248 76.324


V
C
71.325 1O1.325 77.O55
V
L
69.365 99.365 75.6O3
V
I
7O.325 1OO.325 76.54O
V
C
72.318 1O2.318 78.649
Tcmpcraturc pcaks cstImatcd nn surfacc (
n
C)
EstImatcd by 5ct 1 5ct 2 5ct 3
ILM 82.115 115.213 84.632
SITDA IIoal 85.32O 115.3OO 85.73O
SITDA Iix 84.O6O 114.1OO 84.19O
SITDA AIlera 85.63O 115.6OO 85.31O
Thus, lhe ILM resuIls ollained (Iigure 1O and 11) are in fuII agreenenl vilh lhe CDS
prediclions.

In lhis sludy ve use NISA progran lo conslrucl 3D lherno-nechanicaI nodeI of WSI
device. Iurlhernore, lhe appIicalion of lhe finile eIenenl nelhod lo delernine lhe gradienl
of lenperalure lhal arises in WSI slruclure is nol aIvays sinpIe especiaIIy for nuIli
inlerconneclion conponenls such as laII grid array and fIip-chip packages. This lhernaI
invesligalion is lased on pover Ioss densily dislrilulion (lhernaI carlography) conlined
vilh sleady slale finile eIenenl anaIysis lo predicl spaliaI lherno-nechanicaI slress al
differenl Iocalion in lhe WSI slruclure. Therefore, anaIylicaI CDS lhernaI prediclion viII le
conpared vilh ILM nelhod conpulalions.















TalIe 1. Tenperalure peaks conparisons ly ILM and SITDA

During inpIenenlalion lhe fixed-poinl represenlalion of lhe SITDA aIgorilhn vas used.
Afler inpIenenlalion, lhe sane inpul is direcled lo lhe aIgorilhn sinuIlaneousIy in
SinuIink

and on lhe IICA loard. The resuIls are rouled lack lo SinuIink

, nuIlipIexed,
and projecled on lhe sane 2-D graph in order lo conpare lhe oulpul. As nany sinuIalions
and co-sinuIalions have preceded lhe inpIenenlalion, lhe resuIl vas expecled lo le lhe
sane as lhe conparison lelveen lhe fIoaling and lhe fixed poinl conparison. An oplinaI
frequency cIose lo 1OO MHZ has leen achieved. Iurlhernore, lhe VHDL T (Tesl ench)
vas conslrucled and lhe force-fiIe vas used lo slinuIale inpuls and lo conpare vilh lhe
aIgorilhn prediclions. Hence, lhe sel of sinuIalions reveaIed lhal lhe eslinalions generaled
ly lhe SITDA aIgorilhn presenled a greal concordance vilh lhe prediclions generaled ly
lhe Iinile LIenenl Melhod (ILM) presenled in (Lakhsasi el aI., 2OO6, ougalaya el aI., 2OO6).




VLS 450


Iig. 1O. Tenperalure dislrilulion according lo A-C-axis |C
o
j















Iig. 11. Tenperalure dislrilulion according lo -axis |C
o
j

4.4 ThermaI boundary conditions issue
One of lhe nosl chaIIenging issues in crealing conpacl lhernaI nodeIs is lo use an
appropriale sel of loundary condilions for generaling 'dala vilh a delaiIed finile eIenenl
nodeI represenling lhe lhernaI enveIope of lhe appIicalion. The lhernaI anaIysis depends
on:
- The cooIing oplion (radialors lop and/or lollon) appIied,
- The Iocalion/vicinily and pover of ils heal-dissipaling neighlours,
- The lhernaI conduclivily of lhe VLSI naleriaIs: IC, heal sink, package, sulslrale
and heal spreader.
In lhis sludy ve use lhe NISA finile eIenenl progran lo predicl lhernaI lehaviour of lhe
VLSI device slruclure. A vide variely of loundary condilions can le appIied using lhe ILM
soflvare. Hovever, lhe loundary condilion on lhe verlicaI sides of lhe sinuIalion region is
sonevhal prolIenalic. IIacing a fixed loundary condilion on lhese surfaces produces an
VLS Thermal Analysis and Monitoring 451

incorrecl resuIl, unIess a very Iarge sinuIalion region is used al lhe expense of very Iong
sinuIalion run lines. A nore naluraI loundary condilion is a zero fIov condilion across
lhese liny surfaces (adialalic loundary condilions) as shovn in figure 12. The renaining
loundary condilions lo le defined are on lhe lollon and lop surfaces of lhe VLSI device,
represenling lhe heal sink inlerfaces. ecause lhe VLSI devices is reIaliveIy lhin and siIicon
and soIder are good lhernaI conduclors, heal fIovs happens nainIy in lhe verlicaI direclion,
so lhe loundary condilions in lolh horizonlaI direclions can le considered adialalic. The
uniforn heal renovaI al lhe lollon and lop is nodeIIed ly heal fIux exchange coefficienl n
|W/n
2
*
o
Kj. The pover dissipaled ly lhe device is nodeIIed ly heal fIux produced inlo lhe
conponenls. The prolIen fornuIalion is presenled graphicaIIy in figure 12.












Iig. 12. VLSI device: inside lhernaI loundary condilions (TCs) (C).

5. ThermaI stress prediction

If ve knov lhe peak lenperalure aIong vilh naleriaIs properly and loundary condilions
ve can evaIuale lhe lhernaI slress using lhe proposed approach. The aIgorilhn presenled
herein nighl prove lo le praclicaI conparaliveIy vilh lhernaI slress sensors nelhods
especiaIIy for appIicalions vhere lhe lenperalure has lo le knovn onIy in a Iiniled nunler
of poinls, e.g. lhe delerninalion of hol spol lenperalure or on-Iine lenperalure noniloring.
The lhernaI slress (excIuding lhe inlrinsic one induced ly packaging process) in lhin fiIns is
given ly lhe foIIoving expression:

lh
= ( L / (1 - )) .T

Where L / (1 - ) is lhe conposile eIaslic conslanl for lhe differenl Iayers and lhe
difference in lhe coefficienls of lhernaI expansion (CTL) lelveen differenl IeveI of
packaging (Koleda el aI., 1989). In (Lakhsasi & Skorek, 2OO2) a nelhod has leen presenled
in delaiIs for caIcuIalion of lhe conpressive slress in lhe siIicon IeveI given ly:

xx, nax
~ -9.63 ( L)Si .T
in
(Koleda el aI., 1989)

xx, nax
~ -9.63 (O.15 x 2.8)Si .(1O2.3-81.1)

xx, nax
~ -8.574 MIa

Hovever, lhe induced conpressive lhernaI slress viII le conlined lo lhe inlrinsic one due
lo lhe falricalion processes, and lhe slress due lo lhe nechanicaI cIanping nechanisn.
Hcat f!ux
(5urfacc Pnwcr
dcnsIty)
Hcat snurcc
AdIabatIc
AdIabatIc
AdIabatIc
AdIabatIc
Cnn!Ing
Hcat f!ux
(5urfacc Pnwcr
dcnsIty)
Hcat snurcc
AdIabatIc
AdIabatIc
AdIabatIc
AdIabatIc
Cnn!Ing
VLS 452

TalIe 2 gives lhe lherno nechanicaI paranelers for differenl naleriaIs used in lhe
conpulalion.

TalIe 2. Device MaleriaIs lherno-nechanicaI Iroperlies

M e l h o d Dislance CeII 1-
ceII 2, H (n)
Sour ce Tenp
Tnax (
o
C)
Dislance CeII 1-
source 1 (n)
Dislance CeII 2-
source 1 (n)
AnaIylicaI
resuIls ly CDS
8OOO 1OO.2 1O376 1O8O9
3-D ILM nodeI
resuIls.
8OOO 1O2.3 1O148 1O984
TalIe 3. ResuIls conparison lelveen CDS nelhod and 3-D ILM nodeI

The nodaI lenperalures slored in lhe lhernaI parl vere used lo perforn a lhernaI slress
anaIysis of lhe device slruclure. Il is assuned lhal lhe slruclure is slress-free al 25
o
C. The
presence of a lenperalure varialion lhroughoul lhe device slruclure causes defornalion due
lo lhernaI conlraclion and expansion. A lhernaI slress anaIysis is perforned lo caIcuIale
lhese defIeclions and associaled slresses due lo lhe lhernaI Ioading. The conpulalion is
exlended lo lhe vhoIe voIune of lhe device slruclure.

The presenl approach nay le suilalIe for Iarge VLSI devices packaging, lecause il
polenliaIIy aIIovs lhe conlinalion and inlegralion of lolh lhernaI and nechanicaI conlroI,
in a singIe unil. As an exanpIe, lhe naxinun IeveI of slress generaled ly inlense healing of
WSI devices have leen evaIualed and exanined ly CDS nelhod. During design and finaIe
packaging of Iarge VLSI devices, in-silu lherno-nechanicaI conlroI unile nusl le
inpIenenled lo ensure lhe safe fuIfiIInenls of lheir operaling condilions. The nalure of such
inlerface naleriaIs nusl lherefore le considered very carefuIIy during WSI design and
packaging. Iurlher research shouId le focused on eIaloraling soflvare looIs for
oplinizalion of lenperalure sensor posilions vilhin lhe aIIoved area for an IC designer.

Iigure 13 and 14 shovs lhernaI slress profiIe according lo X
1
-axis. The peak conpressive
slress of
xx
= -8.5 MIa is reporled around lhe region of lhe heal source.



MaleriaIs Chip
(die)
M

T
Sulslrale
SoIder aII AI Die
allach
Youngs ModuIus
(CIa)
131 16 26(xy), 11(z) 17 7O 1OO
Ioisons Ralio O.3 O.25 O.39(xz.yz), O.11(xy) O.4 O.33 O.25
CTL ppn/
O
C 2.8 15 15(xy), 52(z) 21 22.4 O.8
ThernaI
Conduclivily
W/n.
O
C
15O 65 O.3 5O 16O 33
Densily Kg/n
3
233O 166O 166O 846O 27OO 33OO
Specific heal }/kg.

O
C
712 1672 1672 957 96O 11OO
VLS Thermal Analysis and Monitoring 453




















Iig. 13. VLSI lhernaI slress profiIe according lo X
1
-axis,
xx
= -8.5 MIa Maxinun























Iig. 14. VL VLSI lhernaI slress profiIe according lo X
2
-axis,
xx
= -3.3 MIa Maxinun

VLS 454

VLSI Iayers are very lhin and any inperfeclion in slruclure nay Iead lo cracking and
sulsequenl shear-inilialed deIaninalion al lhe siIicon IeveI. The nalure of such inlerface
naleriaIs nusl lherefore le considered very carefuIIy in VLSI devices design for inlense
appIicalions. Depending on lhe lechnoIogy requirenenls and definilion of faiIure, lhe
nechanisn of faiIure nay lake severaI forns.

6. Discussion

One of lhe inporlanl queslions in lhe fieId of lhernaI issues of VLSI syslens and
nicrosyslens is hov lo perforn lhe lhernaI noniloring, in order lo indicale lhe overhealing
silualions, vilhoul conpIicaled conlroI circuils. TradilionaI approach consisls of pIacenenl
of nany sensors everyvhere on lhe chip, and lhen lheir oulpul can le read sinuIlaneousIy
(SzekeIy, 1994) and conpared vilh lhe reference voIlage recognized as lhe overhealing
IeveI.
The idea of lhe proposed aIgorilhn is lo predicl lhe IocaI lenperalure and gradienl aIong
lhe given dislance in a fev pIaces onIy on lhe nonilored surface, and evaIuale ollained
infornalion in order lo predicl lhe lenperalure of lhe heal source. Therefore, in lhe case of
an SoC devices, lhere is no pIace on lhe Iayoul for lhe conpIicaled unil perforning
conpulalions, lul lhere is aIso no need for il, as ve onIy vanl lo delecl lhe overhealing
silualions. These peaks are essenliaIs during lhernaI die noniloring lo avoid a crilicaI
induced lherno-nechanicaI slress. Moreover, in nosl cases, lhe overhealing occurs onIy in
one pIace.
In lhis chapler, a nelhodoIogy lo evaIuale and predicl a lhernaI peak of Iarge VLSI circuil
vas presenled. The inporlanl faclors conlriluling lo lhe device's lhernaI healing vere
characlerized. The noniloring approach reporled in lhis paper can le appIied lo predicl lhe
lhernaI slress peak of nuIliIeveI slruclures. Surface peaks lhernaI deleclion is necessary in
nodern VLSI circuils, lheir inlernaI slress due lo packaging conlined vilh IocaI seIf healing
lecones serious and nay resuIl in Iarge perfornance varialion, circuil naIfunclion and
even chip cracking. AIso, in lhis paper CDS lechnique vas used lo deveIop SITDA
aIgorilhn.

7. ConcIusion

This sludy presenled an approach lo lhernaI and lherno-nechanicaI slress noniloring of
VLSI chip. The possiliIily of evaIualing lhernaI peaks and associaled lhernaI slress
dislrilulion aII over lhe nonilored area ly using CDS and ILM has leen shovn.
Iurlhernore, lhis sludy presenled an approach lo lhe appIicalion of inverse prolIen for
lherno-nechanicaI anaIysis. The lhernaI peaks of invesligaled source can le ollained ly
appIying lhe gradienl direclion sensors nelhod. The adequale pIacenenl of sensors can
accuraleIy evaIuale lhe lhernaI peaks and sinpIify lhe lherno-nechanicaI conlroI unil
design. Thal viII enalIe lhe chip designer lo eslalIish lhe nosl honogeneous lherno-
nechanicaI carlography during operalion. The cosl of lhe lhernaI nanagenenl of WSI
device depends heaviIy upon lhe efficiency of lhe chip design. Hovever, heal sources
spaliaI dislrilulion has a significanl effecl on lhe WSI device operalion. Hence, in-silu
lherno-nechanicaI conlroI unile nusl le inpIenenled lo prevenl unexpecled pilfaIIs.

VLS Thermal Analysis and Monitoring 455

Anolher aspecl presenled in lhis sludy is reIaled lo surface peaks lhernaI deleclion in
nodern VLSI circuils, lecause lheir inlernaI slress due lo packaging conlined vilh IocaI
seIf healing lecones serious and nay resuIl in Iarge perfornance varialion, circuil
naIfunclion and even chip cracking. As an exanpIe, in lhis sludy CDS lechnique vas used
lo deveIop SITDA aIgorilhn. SeveraI approaches vere inpIenenled lo achieve a leller
perfornance for lhe SITDA operalion. In (Wjciak & NapieraIski, 1997), physicaI circuil
archileclure vas proposed. Hovever, conpIicaled conlroI conponenls had lo le depIoyed
in lhe circuil vhich is nol suilalIe for lodays conslrainls regarding area, especiaIIy if lhe
conlroI aIgorilhn needs lo le pIaced on-chip, for exanpIe, on a sensor nelvork node.
Therely, depIoying lhe SITDA in sensor nelvork can le nade feasilIe in lhe fulure.

8. References

A. Lakhsasi, A. Skorek," Dynanic Iinile LIenenl Approach for AnaIyzing Slress and
Dislorlion in MuIliIeveI Devices ", SOLID-STATL LLLCTRONICS, ILRCAMON,
LIsevier Science Lld., VoIune 46/6 pp. 925-932, May 2OO2.
A. Lakhsasi, M. ougalaya, D. Massicolle: IraclicaI approach lo gradienl direclion sensor
nelhod in very Iarge scaIe inlegralion lhernonechanicaI slress anaIysis, }. Vac. Sci.
TechnoI. A 24(3), pp. 758-763, May 2OO6.
A. Srivaslava, D. SyIvesler, and D. Iaauv, Concurrenl Sizing, Vdd and Vlh Assignnenl
for Lov-Iover Design, in Iroc. Design, Aulonalion and Tesl in Lurpoe, voI. 1,
Iel 2OO4, pp. 718-719.
AIlera corp (hllp://vvv.aIlera.con), 2OO9.
C. Long and L. He, Dislriluled sIeep lransislor nelvork for pover reduclion, in Iroc.
Design Aulonalion Conf., 2OO3.
L Koleda and aI ' in silu neasurenenls during lhernaI oxidalion of siIicon ' }.vac.sci
TechnoI 7 (2), Mar/aprI 1989
ITRS, InlernalionaI TechnoIogy Roadnap for Seniconduclors (ITRS), 2OO3.
}. Kao, A. Chandrakasan, and D. Anloniadis, Transislor sizing issues and looI for nuIli-
lhreshoId CMOS lechnoIogy, in Iroc. Design Aulonalion Conf., 1997.
}. Oh and M. Iedran, Caled cIock rouling for Iov-pover nicroprocessor design, ILLL
Trans. on Conpuler-Aided Design of Inlegraled Circuils and Syslens, voI. 2O, pp.
715-722, }un 2OO1.
M. ougalaya, A. Lakhsasi and D. Massicolle: Sleady Slale Therno-nechanicaI Slress
Irediclion for Large VLSI circuils using CDS Melhod. ILLL CCLCLO6 Iroceeding
SN:1-4244-OO38-4, pp.917-921
I. Cronovski el aI., High perfornance nicroprocessor design, ILLL }. SoIid-Slale Circuils,
voI. 33, pp. 676-686, May 1998.
Q. Wu, Q. Qiu, and M. Iedran, Dynanic pover nanagenenl of conpIex syslens using
generaIized slochaslic Ielri nels, in Iroc. Design Aulonalion Conf., }un 2OOO.
R. Iuri, L. Slok, }. Cohn, D. Kung, D. Ian, D. SyIvesler, A. Srivaslava, and S. H. KuIkarni,
Iushing ASIC Ierfornance in a Iover LnveIope, in Iroc. Design Aulonalion
Conf., 2OO3.
Sergio Lopez-uedo, }avier Carrido, and Lduardo I. oeno: 'DynanicaIIy Inserling,
Operaling, and LIininaling ThernaI Sensors of IICA-ased Syslens, ILLL TRAN
VLS 456

ON COM AND IACK TLCHNOLOCILS, VOL. 25, NO. 4, DLCLMLR 2OO2, pp.
561-566.
V. SzekeIy, ThernaI noniloring of nicroeIeclronic slruclures, MicroeIeclronic }., 25 (1994)
157-17O.
W. Wjciak and A. NapieraIski 'ThernaI noniloring of a singIe heal source in
seniconduclor devices - lhe firsl approach MicroeIeclronics }ournaI 28 (1997) p
313-316.
XiIinx corp (hllp://vvv.xiIinx.con), 2OO9.

You might also like