You are on page 1of 4

Algorithm and VLSI Architecture for Linear

MMSE Detection in MIMO-OFDM Systems


A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber and W. Fichtner
Integrated Systems Laboratory, ETH Zurich, Switzerland
{ apburg,haene,perels,luethi,felber,fw } @iis.ee.ethz.ch
Abstract- The paper describes an algorithm and a corresponding VLSI architecture for the implementation of linear MMSE
detection in packet-based MIMO-OFDM communication systems. The advantages of the presented receiver architecture are
low latency, high-throughput, and efficient resource utilization,
since the hardware required for the computation of the MMSE
estimators is reused for the detection. The algorithm also supports
the extraction of soft information for channel decoding.
I. INTRODUCTION

Multiple-input multiple-output (MIMO) wireless communication systems [1] employ multiple antennas at the transmitter
and at the receiver to increase system capacity and to achieve
better quality of service. In spatial multiplexing mode, MIMO
systems reach higher peak data rates without increasing the
bandwidth of the system by transmitting multiple data streams
in parallel in the same frequency band. Orthogonal frequency
division multiplexing (OFDM) is a modulation scheme that is
robust against interference arising from multipath propagation.
Consequently, many upcoming standards for high throughput
wireless communication such as IEEE 802.1 in and IEEE
802.16 rely on a combination of MIMO with OFDM. Unfortunately, the performance improvements of MIMO technology also entail a considerable increase in signal processing
complexity, in particular for the separation of the parallel
data streams. Hence, a major challenge associated with the
implementation of future wireless communication systems is
in the design of low-complexity MIMO detection algorithms
and corresponding VLSI architectures.
In this work, we consider the VLSI implementation of
linear MMSE detection for wideband MIMO-OFDM systems.
A suboptimal linear detection scheme is contemplated since
the implementation of algorithms with better performance
(e.g., [2], [3], [4]) either do not meet the high throughput
requirements for MIMO-WLAN (especially not on FPGAs)
or lack the ability to provide soft-information for channel
decoding with low hardware complexity.
A. System Model and Requirements
The system under consideration is a packet-based MIMOOFDM system
wtth MT transmit and MR recetve antennas.

0-7803-9390-2/06/$20.00~~~lem 2006 IEEEn

Data frame

Dtat

Idle

Idle

MIMO detectioni

Detection latency

Fig. 1. Timing diagram of MIMO detection process in packet-based MIMOOFDM systems.

time index t on the kth tone of the OFDM signal. After proper
OFDM modulation at the transmitter and demodulation at the
receiver, the corresponding received vector y[k, t] is given by
y[k, t]= H[k]s[k, t] + n[k, t],
(1)

where the MR X MT-dimensional matrix H[k] describes the


effective MIMO channel for the kth tone and the vector n[k, t]
models the thermal noise in the system as i.i.d. proper complex
Gaussian with variance (Y per complex dimension. Assuming
knowledge of the channel matrices, the linear MMSE estimator
for each tone is given by

G[k] = (HH [k]H[k] +MT

2I) l HH[k]

(2)

and linear MIMO detection corresponds to a straightforward


matrix-vector multiplication according to
s[k,t] G[k]y[k,t]
(3)
followed by quantization of the entries of s[k, t] to the nearest
constellation point.
The difficulty in the implementation of linear receivers for
packet-based MIMO-OFDM systems arises from the frame
structure because the initial training phase, during which the
receiver obtains knowledge of H[k], is immediately followed
by data. Since the detection of the data according to (3) only
starts when the MMSE estimators for all K data carrying tones
have been computed, the delay incurred by the preprocessing
according to (2) translates directly into detection latency as
illustrated in Fig. 1. In MIMO-OFDM receiver implementations [5], this latency is responsible for considerable memory
to buffer the received vectors and can cause probrequirements
par
than

4102emnt

Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

th-eA

acsscnto
ISCA
2006du

of packet-based MIMO-OFDM receivers. However, it is also


noted that the corresponding operation is only performed once
at the start of the frame so that, without special provisions, the
potentially costly hardware for the preprocessing will be idle
most of the time.
Contribution: In this paper an algorithm for efficient toneby-tone linear preprocessing of channel state information in
MIMO-OFDM systems is presented, together with a hardwareefficient VLSI architecture for its realization. The described
receiver constitutes the basis for the soft-output demapper
described in [6] which yields a 5-6 dB gain in terms of signal
to noise ratio (SNR) over a hard-decision MMSE decoder.
The reported ASIC and FPGA area and performance figures
provide reference for the true silicon complexity of linear
MMSE receivers for MIMO-OFDM systems.
Outline: The next section introduces the algorithm for
the computation of the linear MMSE detectors. Section III
describes a scalable VLSI architecture for the proposed algorithm. Area and performance figures for ASIC and FPGA
implementations are provided in Section IV. Section V concludes the paper.

number of multiplications2 and divisions is given by


5
2
CMult =2MRMT + 5MRM -MT +MT

2T5

CDiv2MR

(6)

In order to map recursion (5) to hardware, its compact


mathematical description is expanded as shown in Alg. 1. The
operation sequence is designed to reduce the dynamic range
of intermediate results and to minimize the number of costly
divisions, while keeping the number of multiplications low.

Algorithm 1 Algorithm for computing the MMSE estimator


1l I
P(M)
for MT6M
MR do
2lfrj=I...
g =P(j-i)HH
3
S= 1 + Hj (note that S is strictly positive)
4:
5: Se elog25S - 2Sel/
g = 5mg
6:
7:
p(j) = p(j-1) - ggH2-Se

8: end for
9: G =P(MR)HH

III. VLSI ARCHITECTURE

II. PREPROCESSING ALGORITHM

The choice of a suitable hardware architecture for the


Algorithm choices for the implementation of (2) are either
based on QR-decomposition [7] using unitary transformations implementation of Alg. 1 depends on the system specifications
or on direct matrix inversion algorithms with conventional and on the available area: The most area efficient solution
arithmetic. The main advantages of the QR approach lie in its is a fully decomposed, processor-like architecture. However,

favorable numerical properties in fixed-point implementations such a minimum-area solution cannot meet the low-latency
and in the availability of a wide range of regular array archi- requirements of MIMO-OFDM systems. A highly parallel
tectures [8], [9] for their implementation. The main arguments architecture achieves higher throughput but suffers signififor direct matrix inversion are the lower number of operations cantly from the fact that data dependencies and the desire
compared to QR decomposition and the fact that the matrix for a regular data flow mandate a sequential execution of the
(HH [k]H[k] +MTG2I) I is produced as an intermediate result. individual steps in Alg. 1. Since these steps differ significantly
In fact, the diagonal entries of this matrix are required for the in the number of required operations, a massively parallel
architecture would result in a poor utilization of processing
computation of soft-outputs [10], [6].
resources. In a moderately parallel VLSI architecture the
The implementation that is described in this paper relies
number
resources is chosen so that their average
tp nAl.1rqieete
on direct matrix inversion. The corresponding algorithm iS utlzto.of processing
shg.Moto.h
in
borrowed from the updating procedure of the Kalman gain
Hence, choosing
Kalman filtering applications. The basic idea is to start from MT or a multiple Of MT multiplications.
the trivial inverse Of
and to obtain (HHH + MTG2I) 1 an MT-fold degree of parallelism leads to a high hardware
utilization.
through a series of MR rank-one updates by using the matrix
inversion lemma. The iteration is initialized by setting
A. Moderately Parallel Architecture
The high-level block diagram of the proposed moderately
1 I
(4) parallel architecture is shown in Fig. 2. The circuit employs
p(O)
MTG2
MT identical processing elements (PEs) arranged in a circular
array and a common 1/ Y-block that computes the additions in
and proceeds by computing
step 4) and the pseudo floating-point division in step 5). The
connections in the array are local, meaning that only neighHH p(j-l)'
(5) boring PEs are connected with each other. Each PE mainly
p(i) =p(i-1)
contains a complex-valued multiplier, an adder and some local
1 + HHP(j-1)HH'i
V
storage registers as shown in Fig. 3. All intermediate variables
where H1 denotes the jth row of H. After MR iterations, are stored locally, equally distributed over the PBs. For the

MT.2I

HI iH.Pi

p(MR)~~~~~~~~~~

HH+MGI
(R n H

hr

.
index of the OFDM tone has been omitted for brevity. The
complexity of the above described algorithm in terms of the

21n

terms of complex-valued multiplications. The few real-valued mul-

tiplications are counted as complex-valued, assuming a dedicated VLSI


architecture with multipliers optimized for complex-valued coefficients.

4103
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

Fr s r ' X S r fl X r wr wr tm ~ ~ ~ ~ ~ Cycles
PE(1) PH'l

27 t | zt

z t|

4P4,2"j 2

P3,3

P2,4

[PHj]I

PE(3) _31 jl+22j2+l3j3<P H g [PHj&13


P3
PE(4)
= [PH&14
+P2,3HH +P3,4Hf
2HI12

z t

~ ~L*_*~ ~~ ~Lj14_~ ~ ~ ~ ~ ~E4

--. ,$_ It %

PE(1 )

PE(2)

PE(MT)

PE(i)

Fig. 4. Procedure to compute step 3) in Ag. 1 for MT

4.

,1 , ~~,~~~~~~~~~~~~~~~~~ ,
p

EE

result again into the g register of the neighboring PB. This


I Procedure is repeated for MT cycles with the exception of
the first iteration (j 1) in which a single cycle suffices to
compute g, since P() is a diagonal matrix. The multiplications

Fig. 2. High-level block diagram of a moderately parallel architecture for the


direct matrix inversion. The same hardware is reused for the linear detection.

Fig

Hih-eve

bockdigra
_

in step 4) can be carried out on all PBs in parallel in a single


cycle. The summation of the results that yields S is absorbed

rinto the 1/ block, which performs the addition and the


ai
inMTacycles
out
l Ecomputation
o The
teidvuaPB.step
odratlyparlll achtecur
H fo teon
Ypdivision
in stepb 5) 7)in isCDiV carried
cycles.
of ginasnl
from g
_
in cycle
inst iseagin striva ancain ibe performed
a single
inothe
upperbring, whiecirclaestrfoughthe lditowe ring.th
Sm
_ ~~~~~~~
each of wvhich one entry of g iS broadcasted to all PBs through

o a

~~~~~~~~~~~~~~~~Before step 9) is executed, the diagonal entries of P can be


I I

Lztj

for the computation of soft bit-metrics, as described


~~~~~~extracted
in [6].
with H is

Lolonal
d ppeline
o in
lI IIregister
o_ _
__

multiplication
Thepmatrix
gkRI of each computed
in a series
of MR matrix-vector multiplications
of which
is identical
to step 3). ofG,s
The entries
of H are again isapplied
to
thatnoietramemorycs
requrired
withacthed
jorthcolmpuain
column/
|inputs
| of the PEs row-by-rowso that G is outoutlage
Ithe
wise,sryimpratwentenme
as shownM in mtiveorulpictiones
Fig. 2. ofThe jth row of H caniscbe replaced
~~~ ~ ~ ~~~pricula
Thec
l

MI

| LEn 1

| 1- X

into store the MMSE estimators This eMR

IsIdToverall number of cyclesThatris requir


with the presented architecture is given by

ompute
ed to Alg. 1

~~~~~~~tcpd=MR(3MT + 2) -MT + 1 +MRCDiV.

i aon is

(7)

multlier,thati rdequired tohomue loalrgst.I


Detection
l ~~~~~~~~~~~~~~~oveallumerdo
~~~~~~~~c.

MTU

A significant advantage of the described conventional arithwise,


metic based computation of the MMSE estimator is that once
be
Fig. 3. Schematic of a single PE. The main components are the complex- *prepro s ispomplete, the sameh w ane lrgeue
afor the detection according to (3). To this end, G is read back
Hermitian matrix P, only the main diagonal and the lower
triangular part are stored.

B. MMSE Estimator Computation


The computation of the MMSE estimators starts with the
loop between step 2) and 8) in Alg. 1. During the jth iteration
of this loop, the entries of the jth row of H must be presented
to the inputs of the PBs as shown in Fig. 2. The computation
of the matrix-vector multiplication in step 3) is illustrated in
Fig. 4. In the first cycle, the first PB uses the upper ring to
broadcast H1,1 to all other PBs which multiply H1,1 with their
respective entry of the first column of p(i-r) and store the
result in the g register of the neighboring PB. In the second
cycle, the second PB broadcasts Ho2 to the other PBs, which
multiply it with the respective entries of the second column
of p(4.), add the content of their g register and forward the

from the memory one column at a time. The entries of the jth
column are presented to the PBs together with the jth entry
of the received vector y which is broadcasted to all PBs. The
results of the multiplications of Gij with Yj in the ith PB are
accumulated in the
av register of the same PB and s is available
after MR cycles.

Df Pipelining
Despite the recursive nature of the applied algorithm,

pipelining can be introduced to allow for higher clock frequencies. An additional register is added to the original architecture
as shown in Fig. 3. The actual increase in clock speed depends
on the quality of the placement of this pipeline register in the
logic. Implementation results show that a factor of almost 1.7
can easily be reached with manual retiming. Unfortunately,
pipelining of the recursive matrix inversion algorithm also
mandates the insertion of additional cycles to flush the pipeline

4104
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

10
10T

Ntpiee

WW
18
1

~1c

75k

0_ 74_s_|_A_

85k

P;elined
Pp_
CDiv

181_9

10
o

Area

78k GE

20

-3

10

Fig. 5.

82k GE

21 1 -89k GE

15

iW=18

8Tmecpd

--v.

TmE/v.58 ,
0.6 ,us

0 0.61[,us
25
25
20

WW=_1_9

___
point_
30s35o

30

35

Fixed-point BER simulation and VLSI implementation results

for MMSE detection in a system with MT


modulation.

MR

4 and with 16-QAM

after the operations associated with steps 3), 4), 6), 7) and 9)
of Alg. 1. As a result, the number of cycles increases to
tCPd = MR (3MT +6) -MT + 2+ MRCDiV

In

packet-based

MIMO-OFDM systems

complexity linear detectors pose

even allegedly low-

a considerable implementation challenge. The presented algorithm and the scalable VLSI
architecture for the computation of the MMSE estimators
40 partially solve this problem for MIMO-OFDM systems with a
40SNR small number of tones (K < 64). A first important advantage of
the presented approach is that it reduces silicon area by reusing

WW-20
WW=21
'Floating-

0.6 ,us\

pliers of that size. The pipelined version operates with a clock


rate of 40 MHz and requires 2.2 ps to compute the MMSE
estimator of one tone. Hence, for example, the detection
latency in a system with K= 64 tones adds up to 141 ps. The
throughput in detection mode is 10 Mvps. In terms of area,
the design consumes 16 out of 144 multipliers and 3'416 logic
slices out of a total of 33'792.
V. CONCLUSIONS

______
______ ______ ______ ______
_
_
_

10

Implementation. For the implementation of the de-

sign (for MT MR 4) on a XILINX XC2V6000-6 FPGA,


WW = 18 was chosen as the device contains hardwired multi-

Time/Inv.
0.68 ,us
0.72 ,ls

Area
69k

21

FPGA

4Nopipelined

the same hardware for the preprocessing and the for the

detection. The second advantage is the ability to easily extract


soft bit-metrics for a subsequent channel decoder [6]. The
main drawback are the considerable numerical requirements.
Moreover, it is noted that for systems with a large number
of tones, preprocessing latency is still too high. A possible
solution to this problem has recently been proposed in [11].

ACKNOWLEDGEMENT
This work is supported by the STREP project No. ISTand the number of cycles for the division must also be 026905 (MASCOT) within the sixth framework programme
of the European Commission.
increased to match the higher clock rate.
REFERENCES
[1] G. Foschini and M. Gans, "On limits of wireless communications in a
IV. IMPLEMENTATION RESULTS
fading environment when using multiple antennas," Wireless Personal
Communications, vol. 6, no. 3, pp. 311-334, 1998.
A
ASIC
critical
is
the
design
parameter
Implementation. A crlhcal deslgn parameter 1S the
[2] Z. Guo and P. Nilsson, "A 53.3 Mb/s 4 x 4 16-QAM MIMO decoder in
ASIC
wordlength of the complex-valued datapath. The correspond0.35pm CMOS," in Proc. IEEE ISCAS, May 2005, pp. 4947-4950.
[3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
ing trade-offs between silicon area, bit error rate (BER) perforB6lcskei, "VLSI implementationofofSolid-State
using the sphere
and the
computationtieforasingleMMSH.
mance, and the computation time formance,
a single
MMSE
decoder
Circuits,
2005.
estimator
algorithm," IEEE Journal MIMO detection
is illustrated in Fig. 5 for a system with MT = MR = 4.
[4] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-Best
MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc.
The VLSI implementation results are based on a 0.25 pm
Int. Symp. on Circuits and Systems, May 2006.
simulationstheentIEEE
and forthe
th BER
technology and for the BERtechnology
simulations
entries of H[k]
[5] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, and
are assumed i.i.d. Rayleigh fading with variance one, so that
H. B6lcskei, "ASIC implementation of a MIMO-OFDM transceiver for
192 mbps WLANs," in Proc. IEEE ESSCIRC, Sept. 2005, pp. 215-218.
the received SNR is given by 17/2. For the computation of
[6] s. Haene, A. Burg, D. Perels, P. Luethi, N. Felber, and W. Fichtner,
the MMSE estimator, H [k] is represented in a block floating"Silicon implementation of an MMSE-based soft demapper for MIMOBICM," in Proc. IEEE Int. Symp. on Circuits and Systems, May 2006.
point format and WW denotes the wordwidth of the real-valued
multipliers which constitute the complex-valued multipliers [7] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, "Area & power
efficient VLSI architecture for computing pseudo inverse of channel
in the PEs. The clock rates of the unpipelined designs
ln t PsTare
ccrsfhu pnamatrix in a MIMO wireless system,"
in Proc. IEEE Int. Conf on VLSI
between 93 MHz and 101 MHz, depending on the wordlength.
Design (VLSID), Jan. 2006, pp. 734-737.
The pipelined implementations achieve between 167 MHz and [8] G. Lightbody, R. Woods, and R. Walke, "Design of a parameterized
silicon intellectual property core for QR-based RLS filtering," IEEE
176 MHz. For the computation of the MMSE estimators, the
Trans. on VLSI Systems, vol. 11, pp. 659-678, 2003.
[9] F. Edman and V. Owall, "A scalable pipelined complex valued matrix
gain from the higher clock frequency remains small due to
inversion architecture," in Proc. IEEE ISCAS, 2005, pp. 4489-4492.
the increase in the number of cycles. However, a significant
I. B. Collings, M. R. G. Butler, and M. McKay, "Low complexity reperformance improvement is achieved from pipelining when [10] ceiver
design for MIMO bit-interleaved coded modulation," in Proc. 8th
the circuit operates in detection mode, because during this
IEEEInt. Symposium on Spread Spectrum Techniques and Applications,
operation no pipeline bubbles need to be inserted. Without
2004, pp. 12-16.
D. Cescato, M. Borgmann, H. Boilcskei, J. C. Hansen, and A. Burg,
....
.
pipelining, 23-25 millilon
(received)
vectors per second (Mvps) ~~~~~~~~~~~~~[11]
"Interpolation-based QR decomposition in MIMO-OFDM systems,"
can be processed, while with pipelining throughput increases
in Proc. IEEE Workshop on Signal Processing Advances in Wireless
(8)

to 42-44 Mvps.

Communications (SPAWC), June 2005, pp. 945-949.

4105
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

You might also like