Professional Documents
Culture Documents
Multiple-input multiple-output (MIMO) wireless communication systems [1] employ multiple antennas at the transmitter
and at the receiver to increase system capacity and to achieve
better quality of service. In spatial multiplexing mode, MIMO
systems reach higher peak data rates without increasing the
bandwidth of the system by transmitting multiple data streams
in parallel in the same frequency band. Orthogonal frequency
division multiplexing (OFDM) is a modulation scheme that is
robust against interference arising from multipath propagation.
Consequently, many upcoming standards for high throughput
wireless communication such as IEEE 802.1 in and IEEE
802.16 rely on a combination of MIMO with OFDM. Unfortunately, the performance improvements of MIMO technology also entail a considerable increase in signal processing
complexity, in particular for the separation of the parallel
data streams. Hence, a major challenge associated with the
implementation of future wireless communication systems is
in the design of low-complexity MIMO detection algorithms
and corresponding VLSI architectures.
In this work, we consider the VLSI implementation of
linear MMSE detection for wideband MIMO-OFDM systems.
A suboptimal linear detection scheme is contemplated since
the implementation of algorithms with better performance
(e.g., [2], [3], [4]) either do not meet the high throughput
requirements for MIMO-WLAN (especially not on FPGAs)
or lack the ability to provide soft-information for channel
decoding with low hardware complexity.
A. System Model and Requirements
The system under consideration is a packet-based MIMOOFDM system
wtth MT transmit and MR recetve antennas.
Data frame
Dtat
Idle
Idle
MIMO detectioni
Detection latency
time index t on the kth tone of the OFDM signal. After proper
OFDM modulation at the transmitter and demodulation at the
receiver, the corresponding received vector y[k, t] is given by
y[k, t]= H[k]s[k, t] + n[k, t],
(1)
2I) l HH[k]
(2)
4102emnt
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
th-eA
acsscnto
ISCA
2006du
2T5
CDiv2MR
(6)
8: end for
9: G =P(MR)HH
favorable numerical properties in fixed-point implementations such a minimum-area solution cannot meet the low-latency
and in the availability of a wide range of regular array archi- requirements of MIMO-OFDM systems. A highly parallel
tectures [8], [9] for their implementation. The main arguments architecture achieves higher throughput but suffers signififor direct matrix inversion are the lower number of operations cantly from the fact that data dependencies and the desire
compared to QR decomposition and the fact that the matrix for a regular data flow mandate a sequential execution of the
(HH [k]H[k] +MTG2I) I is produced as an intermediate result. individual steps in Alg. 1. Since these steps differ significantly
In fact, the diagonal entries of this matrix are required for the in the number of required operations, a massively parallel
architecture would result in a poor utilization of processing
computation of soft-outputs [10], [6].
resources. In a moderately parallel VLSI architecture the
The implementation that is described in this paper relies
number
resources is chosen so that their average
tp nAl.1rqieete
on direct matrix inversion. The corresponding algorithm iS utlzto.of processing
shg.Moto.h
in
borrowed from the updating procedure of the Kalman gain
Hence, choosing
Kalman filtering applications. The basic idea is to start from MT or a multiple Of MT multiplications.
the trivial inverse Of
and to obtain (HHH + MTG2I) 1 an MT-fold degree of parallelism leads to a high hardware
utilization.
through a series of MR rank-one updates by using the matrix
inversion lemma. The iteration is initialized by setting
A. Moderately Parallel Architecture
The high-level block diagram of the proposed moderately
1 I
(4) parallel architecture is shown in Fig. 2. The circuit employs
p(O)
MTG2
MT identical processing elements (PEs) arranged in a circular
array and a common 1/ Y-block that computes the additions in
and proceeds by computing
step 4) and the pseudo floating-point division in step 5). The
connections in the array are local, meaning that only neighHH p(j-l)'
(5) boring PEs are connected with each other. Each PE mainly
p(i) =p(i-1)
contains a complex-valued multiplier, an adder and some local
1 + HHP(j-1)HH'i
V
storage registers as shown in Fig. 3. All intermediate variables
where H1 denotes the jth row of H. After MR iterations, are stored locally, equally distributed over the PBs. For the
MT.2I
HI iH.Pi
p(MR)~~~~~~~~~~
HH+MGI
(R n H
hr
.
index of the OFDM tone has been omitted for brevity. The
complexity of the above described algorithm in terms of the
21n
4103
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
Fr s r ' X S r fl X r wr wr tm ~ ~ ~ ~ ~ Cycles
PE(1) PH'l
27 t | zt
z t|
4P4,2"j 2
P3,3
P2,4
[PHj]I
z t
--. ,$_ It %
PE(1 )
PE(2)
PE(MT)
PE(i)
4.
,1 , ~~,~~~~~~~~~~~~~~~~~ ,
p
EE
Fig
Hih-eve
bockdigra
_
o a
Lztj
Lolonal
d ppeline
o in
lI IIregister
o_ _
__
multiplication
Thepmatrix
gkRI of each computed
in a series
of MR matrix-vector multiplications
of which
is identical
to step 3). ofG,s
The entries
of H are again isapplied
to
thatnoietramemorycs
requrired
withacthed
jorthcolmpuain
column/
|inputs
| of the PEs row-by-rowso that G is outoutlage
Ithe
wise,sryimpratwentenme
as shownM in mtiveorulpictiones
Fig. 2. ofThe jth row of H caniscbe replaced
~~~ ~ ~ ~~~pricula
Thec
l
MI
| LEn 1
| 1- X
ompute
ed to Alg. 1
i aon is
(7)
MTU
from the memory one column at a time. The entries of the jth
column are presented to the PBs together with the jth entry
of the received vector y which is broadcasted to all PBs. The
results of the multiplications of Gij with Yj in the ith PB are
accumulated in the
av register of the same PB and s is available
after MR cycles.
Df Pipelining
Despite the recursive nature of the applied algorithm,
pipelining can be introduced to allow for higher clock frequencies. An additional register is added to the original architecture
as shown in Fig. 3. The actual increase in clock speed depends
on the quality of the placement of this pipeline register in the
logic. Implementation results show that a factor of almost 1.7
can easily be reached with manual retiming. Unfortunately,
pipelining of the recursive matrix inversion algorithm also
mandates the insertion of additional cycles to flush the pipeline
4104
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
10
10T
Ntpiee
WW
18
1
~1c
75k
0_ 74_s_|_A_
85k
P;elined
Pp_
CDiv
181_9
10
o
Area
78k GE
20
-3
10
Fig. 5.
82k GE
21 1 -89k GE
15
iW=18
8Tmecpd
--v.
TmE/v.58 ,
0.6 ,us
0 0.61[,us
25
25
20
WW=_1_9
___
point_
30s35o
30
35
MR
after the operations associated with steps 3), 4), 6), 7) and 9)
of Alg. 1. As a result, the number of cycles increases to
tCPd = MR (3MT +6) -MT + 2+ MRCDiV
In
packet-based
MIMO-OFDM systems
a considerable implementation challenge. The presented algorithm and the scalable VLSI
architecture for the computation of the MMSE estimators
40 partially solve this problem for MIMO-OFDM systems with a
40SNR small number of tones (K < 64). A first important advantage of
the presented approach is that it reduces silicon area by reusing
WW-20
WW=21
'Floating-
0.6 ,us\
______
______ ______ ______ ______
_
_
_
10
Time/Inv.
0.68 ,us
0.72 ,ls
Area
69k
21
FPGA
4Nopipelined
the same hardware for the preprocessing and the for the
ACKNOWLEDGEMENT
This work is supported by the STREP project No. ISTand the number of cycles for the division must also be 026905 (MASCOT) within the sixth framework programme
of the European Commission.
increased to match the higher clock rate.
REFERENCES
[1] G. Foschini and M. Gans, "On limits of wireless communications in a
IV. IMPLEMENTATION RESULTS
fading environment when using multiple antennas," Wireless Personal
Communications, vol. 6, no. 3, pp. 311-334, 1998.
A
ASIC
critical
is
the
design
parameter
Implementation. A crlhcal deslgn parameter 1S the
[2] Z. Guo and P. Nilsson, "A 53.3 Mb/s 4 x 4 16-QAM MIMO decoder in
ASIC
wordlength of the complex-valued datapath. The correspond0.35pm CMOS," in Proc. IEEE ISCAS, May 2005, pp. 4947-4950.
[3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
ing trade-offs between silicon area, bit error rate (BER) perforB6lcskei, "VLSI implementationofofSolid-State
using the sphere
and the
computationtieforasingleMMSH.
mance, and the computation time formance,
a single
MMSE
decoder
Circuits,
2005.
estimator
algorithm," IEEE Journal MIMO detection
is illustrated in Fig. 5 for a system with MT = MR = 4.
[4] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-Best
MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc.
The VLSI implementation results are based on a 0.25 pm
Int. Symp. on Circuits and Systems, May 2006.
simulationstheentIEEE
and forthe
th BER
technology and for the BERtechnology
simulations
entries of H[k]
[5] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, and
are assumed i.i.d. Rayleigh fading with variance one, so that
H. B6lcskei, "ASIC implementation of a MIMO-OFDM transceiver for
192 mbps WLANs," in Proc. IEEE ESSCIRC, Sept. 2005, pp. 215-218.
the received SNR is given by 17/2. For the computation of
[6] s. Haene, A. Burg, D. Perels, P. Luethi, N. Felber, and W. Fichtner,
the MMSE estimator, H [k] is represented in a block floating"Silicon implementation of an MMSE-based soft demapper for MIMOBICM," in Proc. IEEE Int. Symp. on Circuits and Systems, May 2006.
point format and WW denotes the wordwidth of the real-valued
multipliers which constitute the complex-valued multipliers [7] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, "Area & power
efficient VLSI architecture for computing pseudo inverse of channel
in the PEs. The clock rates of the unpipelined designs
ln t PsTare
ccrsfhu pnamatrix in a MIMO wireless system,"
in Proc. IEEE Int. Conf on VLSI
between 93 MHz and 101 MHz, depending on the wordlength.
Design (VLSID), Jan. 2006, pp. 734-737.
The pipelined implementations achieve between 167 MHz and [8] G. Lightbody, R. Woods, and R. Walke, "Design of a parameterized
silicon intellectual property core for QR-based RLS filtering," IEEE
176 MHz. For the computation of the MMSE estimators, the
Trans. on VLSI Systems, vol. 11, pp. 659-678, 2003.
[9] F. Edman and V. Owall, "A scalable pipelined complex valued matrix
gain from the higher clock frequency remains small due to
inversion architecture," in Proc. IEEE ISCAS, 2005, pp. 4489-4492.
the increase in the number of cycles. However, a significant
I. B. Collings, M. R. G. Butler, and M. McKay, "Low complexity reperformance improvement is achieved from pipelining when [10] ceiver
design for MIMO bit-interleaved coded modulation," in Proc. 8th
the circuit operates in detection mode, because during this
IEEEInt. Symposium on Spread Spectrum Techniques and Applications,
operation no pipeline bubbles need to be inserted. Without
2004, pp. 12-16.
D. Cescato, M. Borgmann, H. Boilcskei, J. C. Hansen, and A. Burg,
....
.
pipelining, 23-25 millilon
(received)
vectors per second (Mvps) ~~~~~~~~~~~~~[11]
"Interpolation-based QR decomposition in MIMO-OFDM systems,"
can be processed, while with pipelining throughput increases
in Proc. IEEE Workshop on Signal Processing Advances in Wireless
(8)
to 42-44 Mvps.
4105
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.