Professional Documents
Culture Documents
Abstract systems.
1
However, these speed efficient designs are not al- a 4-input LUT, 16 bits of distributed SelectRAM
ways relevant solutions. Many applications require memory, or a 16-bit variable-tap shift register element.
smaller throughput (wireless communication, digital The output from the function generator in each slice
cinema, pay TV, ...). Sequential designs based on a 1- drives both the slice output and the D input of the
round loop may be judicious and attractive in terms of storage element.
hardware cost for many embedded applications. Sev-
eral such implementations have been published in the
literature. For DES and triple-DES designs, the most
efficient solution [16] encrypts/decrypts in 18 cycles
with a fresh key. For AES, the best design based on
1-round loop [17, 18] produces a data rate of 1450
Mbps (Virtex-E) using 542 slices and 10 RAM blocks,
but it does not support the decryption mode. Another
efficient circuit [20] proposes a compact architecture
that combines encryption and decryption. It executes
1 round in four cycles and produces a throughput of
166 Mbps (Spartan-II) using 222 slices and 3 RAM
blocks.
The design proposed in this paper is also based on
a quarter of round loop implementation and improves
by 68% (in term of ratio T hroughput/Area) the de- Figure 1. The Spartan-3 CLB.
sign detailed in [20]. We investigate a good combina-
tion of encryption/decryption and place a strong focus
on a very low area constraints. The resulting design
fits in the smallest Xilinx devices (e.g. the Spartan-
3 XC3S50 and Virtex-II XC2V40), achieves a data
stream of 208 Mbps (using 163 slices, 3 RAM blocks)
and 358 Mbps (using 146 slices, 3 RAM blocks), re-
spectively in Spartan-3 and Virtex-II devices. It at-
tempts to create a bridge between throughput and cost
requirements for embedded applications.
The paper is organized as follows: section 2 de-
scribes the smallest Spartan-3 and Virtex-II devices;
the mathematical description of Rijndael is in sec-
tion 3; section 4 describes our sequential AES encryp-
tor/decryptor; finally, section 5 concludes this paper. Figure 2. The Spartan-3 slice.
2. Spartan-3 and CLB description: the A specific feature of the slice is the 16-bit shift reg-
XC3S50 component ister configuration. The write operation is synchronous
with a clock input and an optional clock enable. A dy-
The Spartan-3 configurable logic blocks (CLBs) are namic read access is performed through the 4-bit ad-
organized in an array and are used to build combinato- dress bus.
rial and synchronous logic designs. Each CLB element Spartan-3 devices also incorporate 18-Kbit RAM
is tied to a switch matrix to access the general rout- blocks. These ones complement the distributed Selec-
ing matrix, as shown in Figure 1. A CLB element in- tRAM resources provided by the CLBs. Each RAM
cludes 4 similar slices, with fast local feedback within block is an 18-Kbit true dual-port RAM with two inde-
the CLB. The four slices are split into two columns of pendently clocked and independently controlled syn-
two slices with two independent carry logic chains and chronous ports that access a common storage area.
one common shift chain. Both ports are functionally identical.
Each slice includes two 4-input function genera- Virtex-II devices exploit the same architecture as
tors, carry logic, arithmetic logic gates, multiplexers Spartan-3.
and two storage elements. As shown in Figure 2, The XC3S50 and XC2V40 components are, respec-
each 4-input function generator is programmable as tively, the smallest Spartan-3 and Virtex-II compo-
2
nents. Table 1 illustrates the logic resources available 2. An affine transform over GF (2).
in both components.
The SubByte transformation applied to the State can
Component XC3S50 XC2V40 be represented as follows:
CLB array: row × col. 16 × 12 8×8
Number of slices 768 256 SB(Statei ) =
SB(di15 ) SB(di11 ) SB(di7 ) SB(di3 )
Number of flip-flops 1, 536 512
Number of LUTs 1, 536 512 SB(di14 ) SB(di10 ) SB(di6 ) SB(di2 )
Max dist. selectRAM
SB(di13 ) SB(di9 ) SB(di5 ) SB(di1 )
or shift reg. (bits) 24, 576 8, 192 SB(di12 ) SB(di8 ) SB(di4 ) SB(di0 )
Number of RAM blocks 4 4 The inverse transformation is defined InvSubByte
(ISB).
Table 1. Resources available in XC3S50 Shif tRow (SR) performs a cyclical left shift on
and XC2V40. the last three rows of the State. The second row is
shifted of one byte, the third row is shifted of two bytes
and the fourth row is shifted of three bytes. Thus, the
3. The AES algorithm Shif tRow transformation proceeds as follows:
SR(SB(Statei )) =
The Advanced Encryption Standard (AES, Rijn-
SB(di15 ) SB(di11 ) SB(di7 ) SB(di3 )
dael) algorithm is a symmetric block cipher that pro- SB(di10 ) SB(di6 ) SB(di2 ) SB(di14 )
cesses data block of 128, 192 and 256 bits using, re- SB(di )
5 SB(di1 ) SB(di13 ) SB(di9 )
spectively, keys of the same length. In this paper, SB(di0 ) SB(di12 ) SB(di8 ) SB(di4 )
only the 128 bit encryption version (AES-128) is con-
The inverse Shif tRow operation (InvShif tRow
sidered. The 128-bit data block and key are consid-
(ISR)) is trivial.
ered as a byte array, respectively called State and
M ixColumn (M C) operates separately on every
RoundKey, with four rows and four columns.
column of the State. A column is considered as a
Let a 128-bit data block in the ith round be defined
polynomial over GF (28 ) and multiplied modulo x4 +1
as:
with the fixed polynomial c(x):
data block i = di15 |di14 |di13 |di12 |di11 |di10 |di9 | c(x) =0 030 x3 +0 010 x2 +0 010 x +0 020
di8 |di7 |d6 |di5 |di4 |di3 |di2 |di1 |di0
As an illustration, the multiplication by 0 020 corre-
where represents the most significant byte of the
di15 sponds to a multiplication by two, modulo the irre-
data block of the round i. The corresponding Statei ductible polynomial m(x) = x8 + x4 + x3 + x + 1.
is: This can be represented as a matrix multiplication:
Ri = M C(SR(SB(Statei
i i i i
d15 d11 d7 d3 ))) =
i di14 di10 di6 di2 0 0
02 0 0 0
03 0 0 0
01 01
State = i
d13 di9 di5 di1 0 010 0
020 0
030 0 010
⊗
di12 di8 di4 di0 0 010 0
010 0
020 0 030
0 0 0
03 010 0
01 0 0
02 0
3
4.1. Implementation of ShiftRow and
i i
R15 R11 R7i R3i
i i
R R10 R6i R2i InvShiftRow operations
AK(Ri ) = 14 ⊕
i
R13 R9i R5i R1i
i
R12 R8i R4i R0i
i i
rk15 rk11 rk7i rk3i
i
rk14 rk10 i
rk6i rk2i In our design, the way to access the Statei ,
rki i
rk5i rk1i
rk for the first quarter of the round, is described
13 9
rk12 rk8i
i
rk4i rk0i in Figure 3. We read di15 ,di10 ,di5 ,di0 in paral-
lel from the input memory, and execute SubByte,
The inverse operation (InvAddRoundKey M ixColumn and AddRoundKey. Then we write re-
(IAK)) is trivial. sults di+1
15 ,d14 ,d13 , d12 to a different output mem-
i+1 i+1 i+1
RoundKeys are calculated with the key schedule ory. The second, third, and fourth quarters of the
for every AddRoundKey transformation. In AES- round are managed in a similar manner, depending on
128, the original cipher key is the first RoundKey 0 Shif tRow.
(rk 0 ) used in the additional AddRoundKey at the
beginning of the first round. RoundKey i , where The best FPGA solution to implement such si-
0 < i ≤ 10, is calculated from the previous multaneous read and write memory accesses is pro-
RoundKey i−1 . Let p(j) (0 ≤ j ≤ 3) be the column posed in [20]. The solution is based on a shift reg-
j of the RoundKey i−1 and let w(j) be the column ister design. As described above, all calculations
j of the RoundKey i . Then the new RoundKey i is from the AddRoundKey are written into adjacent lo-
calculated as follows: cations of the output memory in consecutive cycles.
We store first di+1
15 ,d14 ,d13 , d12 in parallel, then
i+1 i+1 i+1
w(0) = p(0) ⊕ (Rot(Sub(p(3))) ⊕ rconi , d11 ,d10 ,d9 , d8 in parallel, and so on. Therefore
i+1 i+1 i+1 i+1
4
4.2. Implementation of SubByte/MixColumn
and InvSubbyte/InvMixColumn
SB(a)
SB(a)
T3 (a) =
0 0
03 • SB(a)
Compared to the paper [20], we propose a more
0 0
efficient combination of SubByte and M ixColumn 02 • SB(a)
operations, i.e. we use less resources than separated The combination of SubByte followed by
block implementations. Our solution takes advantage M ixColumn can be expressed as:
of specific features of the new Xilinx devices and per-
fectly fits into the Spartan-3 or Virtex-II technologies4 .
i
e15
The Spartan-3 and Virtex-II FPGAs have both dedi- ei14
ei = T0 (di15 ) ⊕ T0 (di10 ) ⊕ T0 (di5 ) ⊕ T0 (di0 )
cated 18-Kbit dual-port RAM blocks5 , that can be used 13
5
CIPHER
ENABLE_OUT
PLAINTEXT_IN
32
tion process takes 48 cycles to generate the complete
Reset_IN
InvRoundKeys.
Due to InvRoundKey 0 and InvRoundKey 10 , ta-
bles (T0 to T3 ) need to be changed. InvSubByte has
RoundKeyi to replace the duplicated SubByte. We define new 16-
32 Kbit tables (CT0 to CT3 ) combined with (IT0 to IT3 ).
4X8 CT0 is illustrated below as an example:
0
0 0 0 0
02 • SB(a) 0E • SB(a)
.......
SRL16 SB(a) 0
090 • SB(a)
4X8
CT0 (a) =
....... 0
ISB(a) 0D0 • SB(a)
15 0
4X4
030 • SB(a) 0
0B 0 • SB(a)
Rd_ROW1
Rd_ROW2
Rd_ROW3
32
Rd_ROW4
32
SRL3
32
2 RAM blocks
MODE_DECR
CT0...CT 3
ISB+IMC
6
Algo. Gaj’s Ours Ours Ours Ours
AES AES AES DES 3-DES
Device XC2S30-6 XC3S50-4 XC2V40-6 XC2V40-6 XC2V40-6
Slices 222 163 146 189 227
Through. (Mbps) 166 208 358 974 326
RAM blocks 3 3 3 0 0
Throughput/Area
(Mbps/slices) 0.75 1.26 2.45 5.15 1.44
ENABLE_OUT
CIPHER INPUTS
PLAINT/KEY
Device XC3S50-4 XC2V40-6
32
32 LUTs used 293 288
Reset_IN
Registers used 126 113
32
Slices used 163 146
K_RAM
32
ROT
rconi
RAM blocks 3 3
Latency (cycles) 46 46
.......
Out. every (cycles) 1/44 1/44
SRL3
SRL16 Frequency 71.5 MHz 123 MHz
.......
32
15
F5
4X4
Rd_ROW1 Table 3. Final results of our complete se-
Rd_ROW2
32
Rd_ROW3
Rd_ROW4
quential AES.
2 RAM blocks
MODE_DECR
CT0...CT 3
F5
ISB+IMC
mains interesting for applications that need to regularly
change the key for encryption or decryption. Indeed,
ADDRA DIA DIB
our AES design takes 92 cycles, in the worst case, to
32 K_RAM
ADDRB
1 RAM block calculate a new complete InvRoundKeys.
RoundKey i or Reset_K_RAM
InvRoundKey i
5. Conclusion
Figure 6. Our complete AES design.
In this paper, we propose solutions for a very com-
pact and effective FPGA implementation of the AES.
We combine narrowly the non-linear S-boxes and the
The synthesis of our complete design was done us- linear diffusion layer thanks to specific features of re-
ing Synpllify Pro 7.2 from Synplicity. The place and cent Xilinx devices. We also propose a low-cost so-
route were done using Xilinx ISE 6.1.02i package. The lution to deal with the subkeys computed during the
final results are given in Table 3 for Spartan-3 and decryption step. In addition, we merge efficiently the
Virtex-II. key schedule and the data path parts.
As a comparison, we also set up a table with the The resulting implementations fits in a very inex-
previous AES [20], DES and 3-DES [16] results. Ta- pensive Xilinx Spartan-3 XC3S50 FPGA, for which
ble 2 shows the results of these compact encryp- the cost starts below $10 per unit. This implementa-
tion/decryption circuits. Like others papers, we also tion can encrypt and decrypt a throughput up to 208
define a ratio T hroughput/Area to facilitate compar- Mbps, using 163 slices. The design also fits in Xil-
isons. We finally achieve an implementation of AES inx Virtex-II XC2V40 and produces data streams up
which is 68% better in terms of T hroughput/Area to 358 Mbps, using 146 slices. In comparison with 3-
assuming that Spartan-II and Spartan-3 are equivalent. DES, AES is more efficient if we do not care about the
In comparison with the most efficient compact 3- use of three internal FPGA RAM blocks.
DES circuits in XC2V40-6, we can conclude that AES The throughput, low-cost and flexibility of our solu-
is more effective if we do not care about the use tion make it perfectly practical for cryptographic em-
7
bedded applications. [19] P. Chodowiec and K. Gaj. Comparison of the Hardware Performance
of the AES Candidates using Reconfigurable Hardware. The Third Ad-
vanced Encryption Standard (AES3) Candidate Conference, April 13-14
References 2000, New York, USA.
[14] J.P. Kaps and C. Paar. Fast DES Implementations for FPGAs and Its
Application to a Universal Key-Search Machine. In Proc. of SAC’98:
Selected Areas in Cryptography,LNCS 1556, pp. 234-247, Springer-
Verlag.
[15] G. Rouvroy, FX. Standaert, JJ. Quisquater, JD. Legat. Efficient Uses of
FPGA’s for Implementations of DES and its Experimental Linear Crypt-
analysis. In IEEE Transactions on Computers, Special CHES Edition,
pp. 473-482, April 2003.
[16] G. Rouvroy, FX. Standaert, JJ. Quisquater, JD. Legat. Design Strategies
and Modified Descriptions to Optimize Cipher FPGA Implementations:
Fast and Compact Results for DES and TripleDES. In the proceedings of
FPL 2003, Lecture Notes in Computer Science, vol 2778, pp. 181-193,
Springer-Verlag.
[18] FX. Standaert, G. Rouvoy, JJ. Quisquater, JD. Legat. Efficient Im-
plementation of Rijndael Encryption in Reconfigurable Hardware: Im-
provements and Design Tradeoffs. In the proceedings of CHES 2003,
Lecture Notes in Computer Science, vol 2779, pp. 334-350, Springer-
Verlag.