You are on page 1of 11

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO.

4, APRIL 2010

553

An Efcient Multimode Multiplier Supporting AES and Fundamental Operations of Public-Key Cryptosystems
Chen-Hsing Wang, Chieh-Lin Chuang, and Cheng-Wen Wu
AbstractThis paper presents a highly efcient multimode multiplier supporting prime eld, namely, polynomial eld, and matrixvector multiplications based on an asymmetric word-based Montgomery multiplication (MM) algorithm. The proposed multimode 128 32 b multiplier provides throughput rates of 441 and ( ) and (2 ) at a clock 511 Mb/s for 256-b operands over rate of 100 MHz, respectively. With 21 930 additional gates for Advanced Encryption Standard (AES), the multiplier is extended to provide 1.28-, 1.06-, and 0.91-Gb/s throughput rates for 128-, 192-, and 256-b keys, respectively. The comparison result shows that the proposed integration architecture outperforms others in terms of performance and efciency for both AES and MM that is essential in most public-key cryptosystems. Index TermsAdvanced Encryption Standard (AES), composite eld arithmetic, digital signal algorithm (DSA), Elliptic-curve cryptography (ECC), Montgomery multiplication (MM), Rivest, Shamir, and Adleman (RSA) .

I. INTRODUCTION HE RAPID evolution of communication technology has altered human life deeply during the past two decades. Many communication applications have been invented to make daily life more convenient, such as the credit card transaction system, Internet transaction service, etc. However, using insecure network to transmit private data may suffer from signicant risk, resulting in huge loss. The security problem therefore becomes an important issue in todays wired or wireless Internet applications. One of the most useful methods to protect data is employing a cryptographic system, as the design of cipher algorithms is based on an advanced mathematical theorem. It usually mixes different types of cryptosystems in a secure protocol to provide a safe channel for data transmission. Generally speaking, asymmetric-key cryptosystems, and RSA here stands for Rivest, Shamir and Adleman who rst publicly described it in 1978. Symmetric-key cryptosystems, such as Data Encryption Standard (DES) or Advanced Encryption Standard (AES), are used to encrypt bulk data in the transmission phase. Due to limited computing resources in portable applications, the system usually off-loads the security process to dedicated special hardware. Recently, there have been many works on designing cost-effective encryption hardware used in portable applications[1][12]. Some works [1][5] focus on area reduction

Manuscript received September 25, 2008; revised December 31, 2008. First published October 16, 2009; current version published March 24, 2010. The authors are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan. Digital Object Identier 10.1109/TVLSI.2009.2013958

of AES, while others [6][10] propose to reduce hardware cost for both ECC and RSA cryptosystems. In [1], the authors propose a cost-effective method to implement ShiftRows and InvShiftRows by using only 16 8-b registers and a few 8-b multiplexers. In [2], the authors present a fully rolled inner pipelined architecture that uses only two 8-b basis conversion , while others need 16 conunits version and 16 inverse-conversion units. Three different area reduction strategies on InvSubBytes/SubBytes transformations are proposed in [3][5]. In [3], the authors propose using Fermats little theorem [13] to compute modular inversion over . An efcient method using composite eld arithmetic to reduce hardware complexity in modular inversion over is proposed in [4]. Using a single lookup table to for both encryption implement modular inversion over and decryption is proposed in [5]. There are other papers studying asymmetric-key cryptosystems. They concentrate on designing a fast yet low-cost multiplier, which is the basic functional block of public-key cryptosystems, e.g., RSA, digital signature algorithm (DSA), etc. In [6], a scalable multiplier consisting of several processing elements (PEs) chained in a pipeline fashion is proposed, where each PE contains two -b CSAs. Its performance is dependent on the amount of linked PEs, giving users a high exibility of performance/area tradeoff in various applications. Based on [6], two improved pipeline architectures are proposed in [7] and [8]. In [8], a parity prediction module is inserted in each PE to remove the pipeline stall, improving the performance with minor area overhead. The other paper [7] presents an enhanced pipeline architecture with b multiplier is proless latency and hardware cost. An posed in [9], which requires no additional logic gates to disable carrypropagation, while the adder-based architectures [6][8] need an AND gate to stop carry propagation.The selection of bits is also a exible parameter providing the tradeoff between performance and cost. Another method [10](only supporting eld) uses two carry-propagation adders (CPAs) in the an arithmetic unit. Its target is to handle different key sizes by combining multiple arithmetic units in parallel for ECC or in serial for RSA. Compared with the architectures in [6][8], it needs no extra CPA to convert a carry-save redundant number into a normal binary number. Some works [11], [12] propose to simultaneously consider area reduction for symmetric-/asymmetric-key cryptography algorithms. In[11], the authors propose a unied multiplier to accelerate both ECC and AES cryptosystems, but the AES performance decreases dramatically due to inefcient computation

1063-8210/$26.00 2010 IEEE

554

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

of modular inversion over . A universal cryptography processor for the smart-card application is proposed [12]. Three cryptosystems (AES, DES, and ECC) are implemented in a small chip that consists of ve ne-grained function blocks and a recongurable microprogram controller. As their target application is on smart cards, the design has lower area cost, lower power consumption, but also lower performance (1.83 Mb/s for AES). In our survey, most papers [1][10] focus on area minimization of a single type of cryptosystem (either AES/DES or RSA/ECC). Only a few papers [11], [12] discuss the topic for both types of cryptosystems. Nevertheless, due to cost consideration, most of them implement AES by a 32-b data path, which have lower performance. When considering security issues on portable applications, users often need both types of cryptographic hardware to speed up performance for certain applications. Thus, it is necessary to study efcient architectures supporting dual-type cryptosystems with higher performance but lower area cost. The purpose of this paper is to design a cost-effective multiplier that can enhance the performance of multiple encryption algorithms, particularly for AES. Other algorithms, which are composed of a large amount of multiplications or matrixvector (MV) multiplications, may benet as well. Based on a wordbased Montgomery multiplication (MM) algorithm [14], [9], we propose a multimode multiplier supporting the essential operations used by AES and the public-key cryptosystems. We present a different solution on this topic, which considers area minimization on multiple encryption standards together in a single architectural design. II. CIPHER ALGORITHMS A. AES Algorithm AES is a private-key block cipher algorithm, which is composed of three key procedures: the encryption, decryption, and round-key expansion processes. It deals with data blocks of 128 b using keys with three standard lengths of 128, 192, or 256 b. Fig. 1 shows the AES algorithm. Each 128-b information is arranged as a 4 4 state, operated by four primitive transformations. During the encryption/decryption process, the four rounds, primitive transformations are executed iteratively in where the value of will be 10, 12, or 14, depending on which key size is selected. In the encryption procedure, the incoming data will rst be bitwise XORed with an initial key, and then, four transformations are executed in the following order: SubBytes, ShiftRows, MixColumns, and AddRoundKey. Notice that the MixColumns transformation is not performed in the last round. The execution sequence is reversed in the decryption process, where their inverse transformations are InvSubBytes, InvShiftRows, InvMixColumns, and AddRoundKey, respectively. Since each round needs a round key, an initial key is used to generate all round keys before encryption/decryption. In the AES algorithm, the SubBytes transformation is a nonlinear byte substitution composed of two operations: 1) modular , modulo an irreducible polynomial inversion over , and 2) afne transformation , where is an 8 8 b matrix, dened as is an 8-b constant, and denotes 8-b input/output. In the

Fig. 1. AES algorithm.

MixColumns transformation, the 128-b data arranged as a 4 4 state are operated column by column. The four elements of each column form a four-term polynomial that is multiplied by a constant polynomial modulo . The ShiftRows transformation is a simple operation in which each row of the state is cyclically shifted right by different offsets. The AddRoundKey transformation is a bitwise XOR operation of each round key and current state. B. MM Algorithm Modular multiplication is the major operation of many popular public-key cryptosystems, and the MM algorithm [14] is the most effective algorithm to compute modular multiplication, which was proposed by Montgomery in 1985. In the MM algorithm, it replaces the modular multiplication as a series of addi, and , where tions and right shifting. Given the inputs is an -bit modulus and , , the output of mod . The MM algothe MM algorithm is equal to rithm, shown in Algorithm I, generally consists of four phases: the parity generation phase, the accumulation phase, the reduction phase, and the nal-correction phase. We use the labels {P1, P2, P3, and P4} to mark the four operation phases, respectively. As shown in Algorithm I, the for-loop iteratively generates the , accumulates each partial product, and performs the parity modular reduction. After the for-loop, the nal correction ad. justs the nal result to fall within the range of Algorithm I: Inputs: , ,

WANG et al.: MULTIPLIER SUPPORTING AES AND FUNDAMENTAL OPERATIONS OF PUBLIC-KEY CRYPTOSYSTEMS

555

Outputs: mod ,

; for ( P1: P2: P3: P4: return to ) % ;//Parity generation ;//Accumulation ; //Reduction ? : ;//Final correction

III. MODIFICATIONS OF AES AND MM ALGORITHMS A. Extraction of MV Multiplication From AES Round Function In the AES algorithm, the Inv-/SubBytes and Inv-/MixColumns are the most complicated transformations, whose basic operations are composed of modular inversion and MV multiplication. The MV multiplication discussed in this paper 8 b matrix and an is dened as a multiplication of an 8 8-b vector, where the elements of matrix and vector are either 1 or 0. The notation Inv-/ denotes the inverse and original transforms; e.g., Inv-/SubBytes denotes the InvSubBytes and SubBytes transformations. Our target is to nd as many MV multiplications from the AES algorithm as possible; therefore, we do some reductions, rearrangement, and grouping on the four primitive transformations. In the AES round function, the four primitive transformations are basically categorized as linear and nonlinear operations. Only the Inv-/SubBytes transformations are nonlinear, while others are linear. As mentioned in Section II-A, the Inv-/SubBytes transformations and Inv-/afne consist of modular inversion over transformations. To reduce hardware complexity, modular inversion is usually simplied by composite eld arithmetic [15], to . [16], moving the computation from We therefore use this skill to reduce the hardware complexity of modular inversion, and the four primitive transformations are decomposed, rearranged, and regrouped as new linear and nonlinear operations shown in Fig. 2. In the gure, denotes the Inv-/isomorphism functions from to and from to , respectively. denotes the Inv-/afne transformations. RK denotes the round key. Inv-/ShiftRows and Inv-/MixColumns are abbreviated as ISR/SR and IMC/MC. Fig. 2(a) shows the encryption/decryption sequence of the AES round function in which the multiplexer selects the encryption/decryption path. By composite eld arithmetic [16], the Inv-/SubBytes transformations are decomposed , Inv-/isomorphism as Inv-/afne transformations , and modular inversion over . functions In order to shorten the circuit delay in the Inv-/SubBytes module, the Inv-/isomorphism functions are moved outside inverter to combine with . Fig. 2(b) shows the the result that the Inv-/SubBytes transformations are clearly and four separated as modular inversion over , and ).The order linear functions (

Fig. 2. Regrouping steps of AES round function.

of and SR/ISR is further exchanged, as shown in Fig. 2(c). Since the SR/ISR transformations are just the cyclic rotation (see Fig. 1), the exchange does not affect the result. However, it provides an opportunity to merge the linear with the coefcients of MC/IMC, i.e., functions , and . Finally, loop unrolling is used for further matrix combination exhibited in Fig. 2(d). The current and next round functions are considered together, and more consecutive matrices are merged, i.e., and . After the rearrangement and regrouping, the new coefcients of MC/IMC , transformations are listed as follows: , , and . Fig. 3 shows the rearranged result in which the new AES round function is performed iteratively inside the ip-ops, ) and the postbasis while the prebasis converter ( and and ), transforming data between converter ( and , are placed before and after the center. In the gure, NMC/NIMC represents the new Inv-/MixColumns transformations, and NRK represents the new round key which . Each 128-b input (DIN) is has been mapped to in the rst stage, iteratively processed in mapped to in the nal stage. The the central stage, and remapped to new AES round function is eventually partitioned into modular , SR/ISR, new Inv-/MixColumns, and inversion over AddRoundKey. We give an example to explain the new MixColumns transformation. Let the rst column of the 4 4 state be the coefcients of a four-term poly, i.e., , nomial and the result of the new MixColumns transformation be , i.e., the coefcients of another four-term polynomial

556

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

Fig. 3. Rearranged AES round function.

cients of NMC be following:

. Let the constant coef, and , as listed in the

TABLE I PREEVALUATION OF AES DESIGN

(1) (2) (3) Let the new xed polynomial be . The new MixColumns transformation is modied as follows: mod . (4) (5) (6) (7) In Fig. 3, the new AES round function is clearly regrouped as inverter) and linear parts. Based on new nonlinear ( this architecture, an RTL model that is implemented for a fast evaluation about the area statistics of linear and nonlinear parts is studied. The RTL model is then synthesized by a commercial 0.18- m cell library. Table I lists the area statistics of the result, excluding the round-key generation. The Inv-/SubBytes transformations, which are the most area consuming in hardware implementation, has been reduced as the modular inversion over (nonlinear operation), standing for 11.0% of the total gates. The other linear functions, including the prebasis/ postbasis converters, SR/ISR, NMC/NIMC, and AddRoundKey functions, stand for 54.3% of the total gates. Two 1024-b FIFOs buffering the input/output data stand for 33.4% of the total gates. The evaluation reveals that the AES area can be greatly saved, so far as the linear functions can be done by a multimode multiplier, as proposed in this paper, which will be described in detail in the next section. Moreover, the storage element can be shared with the MM algorithm as well, and more area is saved. B. AWBMM Algorithm We modify the MM algorithm, reported in [14] and [9], into an asymmetric word-based MM (AWBMM) algorithm. The asymmetric feature of the operand size helps us design an efcient multiplier to support MV and dual-eld multiplications. In the AWBMM algorithm (Algorithm II), all operands are represented in word-based form, but the word width of different operands may be different. For instance, an -bit integer is represented in word-based form with -bit words , and an -bit integer is represented as -bit . Notice that denotes the th word of words and that denotes a sequence of bitsfrom the th bit to . The variable denotes the parity, and , , the th bit of , and are four variables used to store the temporary values in the inner for-loop. It is similar to the MM algorithm in that the AWBMM algorithm also has the same operation phases: parity generation, accumulation, reduction, and nal correction, marked with P1, P2, P3, and P4 in Algorithm II, respectively. The difference is that all operands in the AWBMM algorithm are processed word by word. In short, the outer for-loop , and the inner for-loop iteratively generates the parity

In the aforementioned equations, the MV multiplications of , , and are shown in (8), (9), and (10), respectively, where the value of is from zero to three

(8)

(9)

(10)

WANG et al.: MULTIPLIER SUPPORTING AES AND FUNDAMENTAL OPERATIONS OF PUBLIC-KEY CRYPTOSYSTEMS

557

concurrently accumulates the partial products and does modular reduction. The nal correction adjusts the output into the after the outer loop is nished. Comcorrect range pared with the symmetric word-based MM algorithm reported in [9], both the symmetric and asymmetric algorithms have the same complexity. Given the 256-b inputs, e.g., both of them need 16 iterations to accomplish a 256-b MM operation . The AWBMM algorithm is also efcient in supporting by replacing with MM over [17]. Here, is an irreducible polynomial of degree , and are polynomials with degree that is less than , and mod . Algorithm II: Inputs: , , , , , , ,
Fig. 4. Proposed multimode 8

2 8 b multiplier.

Outputs: ,

A. Reformulation of MV Multiplication be two 8-b vectors and , and let be an 8 8 b matrix, where the elements are either 0 or 1. MV multiplication is dened as Let , where the value of is from zero to seven to ; ) \{ and

; for (

P1: for ( P2: P3: ; ; if % to

;//Parity generation ) //Accumulation \& reduction ; (11)

; ; P4: return ? ;//Final correction

IV. PROPOSED MULTIMODE MULTIPLIER Based on the modied AES and MM algorithms, we propose a multimode multiplier to support both MV and dual-eld multiplications. The multimode multiplier is modied from the dual-eld multiplier proposed in [9], as both their multiplier and ours are designed based on the word-based MM algorithm. Therefore, the description of this section is focused on the MV multiplication supporting.

The MV multiplication of (11) can be reformulated as eight vectors XORed together. The eight shown in (11) are rst dened as columns of matrix , i.e., . The MV multiplication is then reformulated as . is dened as Here, the operation of , becomes and . For example, is equal to . B. Proposed Multimode 8 8 b Multiplier

Fig. 4 shows the proposed multimode multiplier, using an example, which is a multimode 8 8 b multiplier. In Fig. 4(a), 8 b multiplication. it shows eight partial products of an 8

558

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

Fig. 5. Arrangement of new MixColumns coefcients and 128-b data.

The eight partial products are padded with some zeros and partitioned as P1, P2, and P3, as shown in Fig. 4(b). The new eight partial products of P2 are labeled aspp1, pp2, , and pp8 and fed into an XOR tree calculator (XTC) shown in Fig. 4(c). It is named XTC since only the sum vectors of each CSA are XORed [see the right part of Fig. 4(c)]. In the XTC module, the square and circle at the output of CSA or HA denote the produced carry vector and sum vector. The carry vectors of each CSA and the nal sum vector are sent to the Wallace tree accumulator (WTA) module. In the meantime, the other partial products in P1 and P3 are sent to WTA as well for carry-save accumulation. Finally, an adder is used to convert the WTAs outputs (carry and sum) into a normal binary number. It is obvious that the nal sum vector of XTC is equal to the XOR value of pp1, pp2, , and pp8. By with , replacing the MV product can be obtained at the XTC module. In addition, the polynomial product is easy to obtain by concatenating the sum vectors of P1, P2 (nal sum vector), and P3. Hence, we can get the MV product from the XTC module, the polynomial product from the WTA module, and the integer product from the nal-stage adder. C. Width Selection of Multimode Multiplier Until now, a novel multimode 8 8 b multiplier has been presented. The multiplier size is further enlarged to handle all MV multiplications needed in the new AES round function. According to the previous analysis described in Section III.A, it needs 64 MV multiplications executed concurrently in each Inv-/MixColumns transformation. Fig. 5(a) shows the Mix4 blocks. Columns transformation represented by the 16 In the gure, a 128-b input is partitioned into 16 8-b vectors , and the constant coefcients of new Mix, and . Each block, Columns are dened as , indicates an MV multiplication, and it produces e.g., an 8-b intermediate value after the MV multiplication. The positions of all blocks are carefully arranged so that the result 4 of the MixColumns transformation is dened as the 16 intermediate values XORed column by column. For example, the MixColumns transformation of the rst column of the 4 4

state, listed in (4), (5),(6), and (7), is represented as the rst four , columns in the rightmost side. The results, i.e., and , are obtained by vertically XORing the rightmost 4 4 intermediate values. In fact, each 8-b result of the MixColumns transformation, , can be represented as 32 8-b vectors XORed tosuch as gether.The 32 8-b vectors of each column can therefore be concatenated into 32 128-b vectors. The MixColumns transformation is nally formulated as the XOR value of 32 128-b vectors; hence, it needs a 128 32 b multiplier to do the XOR operation. Fig. 5(b), (c) shows the extension from a multimode 8 8 b multiplication (see Fig. 4) to a multimode 128 32 b multiplication. In Fig. 5(b), it shows the partial products of a multiplication whose size is 128 32 b. In the same way, the 32 128-b partial products are padded with some zeros in the upper- and bottom-right corners, as shown in Fig. 5(c). It does not affect the result of dual-led multiplication since there are just some zeros padded in the partial products. The central 32 128-b partial products with the padded zeros are computed by an enlarged XTC. Then, if the newInv-/MixColumns transformations are needed, it simply replaces the central partial products as new 32 128-b vectors arranged like Fig. 5(a). We recommend the use of the 128 32 b multiplier for supporting the MV multiplications since it is the most cost effective. If we choose other sizes, such as 64 64 or 256 16 b, it will either waste too much area cost or reduce the MV multiplication efciency. Fig. 6(a) shows the supporting of a 64 64 b multiplier to do the MV multiplication. It needs to pad 1520 zeros in the upper- and bottom-right corners to support the XOR operation of 32 128-b vectors, but it zeros for the 128 32 only needs 256 b multiplier. In circuit design, zero padding means adding XOR gates in the multiplier, as the MV multiplication only needs an XOR operation. Fig. 6(b) shows another case, which is a 256 16 b multiplier. Since the depth of the 256 16 b multiplier is only 16 b, it must take double clock cycles to do Inv-/Mix32 b Columns transformations, as compared with the 128 multiplier. Although it does not need to pad zeros in the 256 16 b multiplier, the slow MV multiplication degrades the AES

WANG et al.: MULTIPLIER SUPPORTING AES AND FUNDAMENTAL OPERATIONS OF PUBLIC-KEY CRYPTOSYSTEMS

559

Fig. 6. Multiplier width selection.

Fig. 7. Proposed cipher core architecture based on a multimode 128 multiplier.

2 32 b

The XTC module is composed of several CSAs arranged in a treelike structure. It only calculates the XOR results of all input partial products and sends all carry vectors to the WTA module. The WTA accumulates all input vectors as one carry vector and one sum vector; then, the two vectors are added to get the result of integer multiplication. The multimode multiplier is further enhanced to support the to accelerate the exmultiplyadd function ecution of the MM algorithm. If the multiplyadd function is provided, Algorithm II requires only two clock cycles to accuper inner loop itermulate all vectors ation, while ve clock cycles are needed originally. One CSA is accordingly added to the XTC module for this enhanced function, and two source ports are created to transmit operands and to the multiplier. The multiplyadd function not only accelerates the execution of the AWBMM algorithm but also provides another chance for further integration of AddRoundKey and Inv-/MixColumns transformations.With this enhancement, each round key can be XORed with other partial products directly (see Fig. 7). The constant vectors of Inv-/afne transformations [see (8), (9), and (10)], which have been merged with the coefcients of Inv-/MixColumns and Inv-/isomorphism functions, can be XORed with other partial products as well. Finally, the and subtraction are supported basic addition as well. In a word, the proposed multimode multiplier supports the modular addition, modular subtraction, and modular multiplication for public-key cryptographic algorithms, and it also provides the enhanced MV multiplication to accelerate the AES algorithm. B. Dedicated Module for AES As our multiplier only supports the MV multiplication for the AES function, it still needs some dedicated hardware modules to do other operations, such as modular inversion over , key expansion, etc. The dedicated hardware modules shown in Fig. 7 contain a key expansion unit generating round keys, a inverter, and a ROM storing prebasis converter, a matrix coefcients embedded in the partial product generator. All round keys are precomputed by the key expansion unit and kept in the storage element before data encryption/decryption. Thanks to the bigger storage size used by RSA, the storage element unit is large enough to cache all round keys. It needs, e.g., at least 4000 ip-ops to store the parameters of 1000-b RSA operation, yet it requires less than 4000 ip-ops to store the overall round keys generated by a 256-b initial key.The prebasis ) for converter provides two basis transformations ( and either encryption or decryption. Although the prebasis conversion can be done by the multimode multiplier, a dedicated prebasis converter is still implemented as it can improve the performance with acceptable area overhead.Without the dedicated prebasis converter, the multimode multiplier needs one more to . clock cycle doing basis conversion from It originally needs another clock cycle to transform encryption/ after the round function, decryption results back to but the nal round does not need Inv-/MixColumns transformations; thus, the multimode multiplier is reused for postbasis conversion in the nal round. The proposed architecture can hence

performance. Therefore, we choose the size of 128 32 b to implement the MV multiplication that has less area overhead but the highest performance. V. CIPHER CORE ARCHITECTURE Fig. 7 shows the proposed cipher core architecture based on a multimode 128 32 b multiplier (highlighted by solid rectangle). Other components include dedicated hardware for the AES function (encircled by dashed rectangle), storage element unit, main controller, I/O controller, as well as I/O interface.The storage element unit includes two 128 16 b memory banks and 1024 ip-ops for data buffering. They are wrapped as six read ports and two write ports, which are controlled by the storage management unit. A. Enhanced Multimode b Multiplier

The multimode b multiplier shown in Fig. 7 is an extension of the 8 8 b version, making each AES round to be done within one clock cycle. In each clock cycle, the partial product generator produces 32 128-b partial products according to the operation mode. It is trivial to generate the partial prod, but it ucts for dual-eld multiplication needs more effort to organize the matrix coefcients and inputs for the AES function. A smart arrangement shown in Fig. 5 performs the new MixColumns transformation in the XTC module.

560

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

deal with every 128-b data in 10, 12, or 14 cycles for a 128-, 192-, or 256-b key, respectively. VI. EXPERIMENTAL RESULTS A. Implementation, Area Statistics, and Power Prole The proposed multimode multiplier is implemented by following the general soft-IP design ow. The area statistics of design are listed in Table II. The gate count of the multimode 128 32 b multiplier is 77 010, in which the XTC and WTA modules contribute 64 110 gates in total. Since the partial product generator is shared by the AES and MM algorithms, the gates are shared as well. As the linear functions are done in the multimode multiplier, it only requires 21 930 extra gates to implement the dedicated modules for AES. Compared with the evaluation shown in Table I, the proposed architecture (21 930 gates) saves more than half the original gates (49 700 gates) if it is directly integrated without any hardware sharing. Moreover, the shared I/O interface and storage elements for multiple algorithms also save a lot of area costs. It totally saves more than 50 000 gates for AES design, as compared with the evaluation result. In addition, integration of AES does not degrade its performance. For a 128-b input, the integration architecture can nish data encryption/decryption in 10, 12, or 14 clock cycles for a 128-, 192or 256-b key, respectively. The CBC mode is supported in the integration architecture as well, which has no bad effect on not only AES integration but also dual-eld multiplication, since the added XOR gates for MV multiplication do not affect the timing-critical paths of the dual-eld multiplier. We use a commercial power estimation tool to estimate the power consumption of the proposed cipher core, where the multimode multiplier is congured to perform the AES and MM algorithms, respectively. The average power consumption is represented by the dark bars in Fig. 8(a), where their values are 100.8 mW for AES and 100.9 mW for MM. Due to high degree of resource sharing, the power consumption for both algorithms is quite similar in whatever mode the multimode multiplier is congured. It means that the multimode multiplier will waste power if it is operated in the AES mode since the WTA module and the nal-stage adder are not disabled. In order to compensate the drawback, some AND gates are added between the XTC and WTA modules to disable the signal transitions that will be propagated to the WTA module and adder, producing unwanted power consumption. The disable command is given from the controller, which is forced to Low in the AES mode but kept High in other modes. The cipher core with the disable logic consumes 68.69 and 102.6 mW for the AES and MM algorithms, respectively, which are represented by the light bars in Fig. 8(a). The power prole of the cipher core working in the AES mode is shown in Fig. 8(b), in which the two bars denote the power consumption with/without disable logic, respectively. The values marked on the bars represent the power consumption of the respective modules. With the disable logic, the power consumption of AES can be reduced from 100.8 to 68.69 mW, with a 31.8% improvement. We apply the same method to disable the dedicated AES circuit when the multimode multiplier is working in the MM mode. However, the power saving is not obvious since the partial product generator, having the most power

TABLE II AREA STATISTICS OF CIPHER CORE

Fig. 8. (a) Power consumption of AES and MM. (b) AES power prole.

consumption in the dedicated AES circuit, can not be disabled. Therefore, we decide to insert the disable logic for AES only. It needs 2100 extra gates to implement the disable logic, and it has 1.7 mW extra power for MM; however, it saves 32.11 mW for AES. B. Result Comparison and Discussion for AES Algorithm We compare the proposed integration architecture with others in terms of the AES and MM algorithms that are shown

WANG et al.: MULTIPLIER SUPPORTING AES AND FUNDAMENTAL OPERATIONS OF PUBLIC-KEY CRYPTOSYSTEMS

561

inTable III. The work in [11] and ours support both AES and MM algorithms; others only support one of them (either AES algorithm or MM algorithm). The calculations of gate count for [11] and ours are separated as dedicated AES circuit and multiplier, which are listed in Table III(a) and (b), respectively. The total gate counts of the dedicated AES circuit and the multiplier, enclosed by parentheses, is also listed in the table, i.e., 98 930 for ours and 56 000 for [11]. In addition, the AES data (with the disable logic) are also listed in Table III(a). In Table III(a), we compare the proposed integration architecture with other works for the AES algorithm. Some works based on full pipelining or round unrolling reported in [20] and [21] are not taken into account as they have higher throughput rate yet higher gate count (150 000350 000 gates). The compact architecture reported in [1] is based on a 32-b data path and a 32-b on-the-y key scheduler. However, it only supports 128-b key size and takes more clock cycles to encrypt a 128-b data block (44 cycles), as compared with ours (10 cycles). Highly regular and scalable AES hardware is presented in [18]. The regularity saves more area cost in circuit design,but the full-custom design makes porting to a new process difcult. The work reported in [2] presents a low-cost design as well. They implement all possible combinations (128, 192, 256) of data and key, but only data encryption is supported. In [5], the authors propose to share the same modular inversion in both SubBytes and Inv-SubBytes transformations. They implement modular inversion by using 16 lookup tables. As they do not use composite eld arithmetic to reduce the hardware complexity of modular inversion, their design requires more hardware cost. It consumes 32 000 gates to provide the throughput of 610 Mb/s. The work reported in [4] is a highly efcient design, but the pipeline architecture makes it inefcient in the CBC mode. The work reported in [3] also uses pipeline architecture to accelerate clock rate, making it unsuitable for CBC mode as well. In [11], the authors propose a single cipher core architecture to support both ECC and AES cryptosystems. In their architecture, a dedicated modular reduction , module is designed for modular multiplication over where the irreducible polynomial used by AES is implemented. With this module, their architecture is extended to handle dedicated modular multiplication for AES. Although the dedicated module for AES only consumes 6100 gates, the AES performance decreases to 64 Mb/s in their architecture. Compared with previous works, our proposed design outperforms others in terms of hardware efciency (throughput/gate count). It takes only 21 930 additional gates to implement the dedicated AES circuit, and the AES performance achieves the throughput of 1.28 Gb/s. The proposed design, moreover, supports more features than others (excluding [2]), i.e., encryption/decryption, three key lengths, and ECB/CBC modes. Our target is to support the AES and MM algorithms in a single core architecture, which is similar to the work reported in[11]. Therefore, we give more detailed discussions for both integration architectures. Both cipher core architectures are based on the enhanced multipliers to support the AES and MM algorithms. Their enhanced multiplier is able to do dedicated modular multiplication for AES, while ours can do MV multiplication. In the AES round function, they observe

that the common operation of Inv-/SubBytes and Inv-/MixColumns transformations is the modular multiplication, where the modular inversion of Inv-/SubBytes can be replaced by a series of modular multiplications based on Fermats little theorem [13]. We nd that the four primitive transformations can be regrouped as new linear and nonlinear operations by applying composite eld arithmetic to the Inv-/SubBytes transformations. In their architecture, it takes 200 clock cycles to encrypt 128-b data with a 128-b key; in ours, it takes only ten clock cycles for data encryption with the same key size. The performance degradation in their architecture is due to the inefcient modular inversion, taking too many clock cycles. In is executed our architecture, modular inversion over by dedicated hardware, and the other linear operations are done by our proposed multimode multiplier. Therefore, each AES round can be nished in a clock cycle. Comparing the implementation results, both cipher cores are implemented by the 0.18- m CMOS process and operated at 100-MHz clock rate. For AES extension, their enhanced multiplier consumes 6100 extra gates to provide a 64-Mb/s throughput rate, while our dedicated hardware modules cost 21 930 extra gates to provide a 1.28-Gb/s throughput rate. The hardware area of ours is 3.6 times lager than theirs, but the performance of ours is 20 times faster than theirs. In terms of hardware efciency, our design is about 5.56 times higher compared to theirs. C. Result Comparison and Discussion for MM Algorithm Table III(b) compares this work with other approaches in terms of the MM algorithm. Since there is no good way to compare the implementations between eld-programmable gate array and application-specic integrated circuit, we do not list the hardware efciency of the two works [7], [10] in Table III(b). In [9], the authors propose a symmetric 64 64 b multiplier recognized as a highly efcient design. They are the rst one to propose a multiplier-based dual-eld multiplier. In their design, the multiplier operates at different clock rates for different elds and ). Although the performance of MM over ( can achieve very high throughput rate (2 612 Mb/s) at a higher clock frequency, it is more difcult to support different clock frequencies when the core is integrated into a system-onchip design. Both the designs of our enhanced multimode multiplier and the work reported in [11] are based on the dual-eld multiplier. In [11], double multipliers with smaller radix (8 b) in an arithmetic core are implemented to improve the performance of both AES and MM algorithms. Instead of changing the multiplier size, they increase the number of arithmetic cores for performance/area tradeoff. We propose to modify the multiplier size as it can get better result in AES but does not degrade the performance for the MM algorithm. As mentioned in Section III-B, both symmetric and AWBMM algorithms have the same complexity. Thus, the performance of the asymmetric 128 32 b multiplier is equivalent to the symmetric 64 64 b multiplier, as proposed in [9]. In circuit design, the 128 32 b multiplier has fewer partial products (32) than the 64 64 b multiplier (64). The fewer partial products need less hardware cost to implement the XTC and WTA modules. However, the asymmetric feature results in larger vector size, needing a bigger adder to convert carry/sum vectors into a normal binary number.

562

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

TABLE III PERFORMANCE COMPARISON

Furthermore, the bigger adder increases the timing of the critical path. In order to compensate the drawback, we use a fast carry-lookahead adder as the nal stage adder. The work proposed in [6] is the rst adder-based dual-eld multiplier. It is a scalable pipeline architecture comprising 40 8-b PEs. The works reported in [7], [8], and [19] are the enhanced versions of it. The work in [7] enhances the design by reducing the pipeline stall and area cost. The work in [8] removes the pipeline stall by employing a parity prediction module, and that in [19] further enhances the pipeline architecture by improving the clock rate. In [10], a CPA-based arithmetic is proposed to efciently address the issue of different key sizes used b) and RSA ( b) cryptosystems in by ECC ( a single architecture. The adder-based multiplier improves the MM performance by a faster clock rate, as compared to the multiplier-based design. When comparing our work with that of others [6][11], the proposed multimode 128 32 b multiplier has better efciency or performance than others, excluding [9]. However, the work in [9] is implemented by a more advanced process (0.13 m) than ours. VII. CONCLUSION In this paper, we have presented a high-efciency and high-performance cipher core based on the proposed multimode multiplier. We use the composite eld arithmetic to

decompose the Inv-/SubBytes transformations; therefore, the AES round function is regrouped as new linear and nonlinear functions. The new Inv-/MixColumns transformations, which are the most area-consuming part, are reformulated as multiple MV multiplications; then, they are efciently executed by our proposed multimode multiplier. The multimode multiplier also supports the modular addition, subtraction, as well as and elds. Thanks to multiplication in both the shared multiplier, it takes only 21 930 additional gates for AES, and the proposed cipher core can provide 1.28-, 1.06-, and 0.91-Gb/s throughput rates for 128-, 192-, and 256-b and keys, respectively. The performance of MM over can achieve 441- and 511-Mb/s throughput rates for 256-b operands, respectively. In addition, the integration architecture supports more features than other low-cost AES designs, and it also supports scalable key sizes for the MM algorithm by changing the storage size. As the proposed integration architecture efciently shares the hardware resources, it saves more area cost than other straightforward methods, directly integrating different cipher cores into a single core architecture. When comparing the hardware efciency for both AES and MM algorithms, our proposed architecture is about two to six times higher than others. REFERENCES [1] H. Li and J. Li, A new compact architecture for AES with optimized shiftrows operation, in Proc. IEEE ISCAS, May 2007, pp. 18511854.

WANG et al.: MULTIPLIER SUPPORTING AES AND FUNDAMENTAL OPERATIONS OF PUBLIC-KEY CRYPTOSYSTEMS

563

[2] M. Alam, S. Ray, D. Mukhopadhayay, S. Ghosh, D. RoyChowdhury, and I. Sengupta, An area optimized recongurable encryptor for AESRijndael, in Proc. Conf. DATE, Apr. 2007, pp. 16. [3] Y.-K. Lai, L.-C. Chang, L.-F. Chen, C.-C. Chou, and C.-W. Chiu, A novel memoryless AES cipher architecture for networking applications, in Proc. IEEE ISCAS, May 2004, pp. 333336. [4] C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, A high-throughput low-cost AES processor, IEEE Commun. Mag., vol. 41, no. 12, pp. 8691, Dec. 2003. [5] C.-C. Lu and S.-Y. Tseng, Integrated design of AES (Advanced Encryption Standard) encrypter and decrypter, in Proc. IEEE Int. Conf. Appl.-Specic Syst. Architectures, Processors, Jul. 2002, pp. 277285. [6] A. F. Tenca and . K. Ko, A scalable architecture for modular multiplication based on Montgomerys algorithm, IEEE Trans. Comput., vol. 52, no. 9, pp. 12151221, Sep. 2003. [7] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu, An improved unied scalable radix-2 Montgomery multiplier, in Proc. 17th IEEE Symp. Comput. Arithmetic, 2005, pp. 172178. [8] M.-C. Sun, C.-P. Su, C.-T. Huang, and C.-W. Wu, Design of a scalable RSA and ECC crypto-processor, in Proc. ASP-DAC, Jan. 2003, pp. 495498. [9] A. Satoh and K. Takano, A scalable dual-eld elliptic curve cryptographic processor, IEEE Trans. Comput., vol. 52, no. 4, pp. 449460, Apr. 2003. [10] F. Crowe, A. Daly, and W. Marnane, A scalable dual mode arithmetic unit for public key cryptosystems, in Proc. Int. Conf. ITCC, Apr. 2005, pp. 568573. [11] J. Wang, X. Zeng, and J. Chen, A VLSI implementation of ECC combined with AES, in Proc. ICSICT, Oct. 2006, pp. 18991904. [12] Y. Eslami, A. Sheikholeslami, P. G. Gulak, S. Masui, and K. Mukaida, An area-efcient universal cryptography processor for smart cards, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 1, pp. 4356, Jan. 2006. [13] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography. Boca Raton, FL: CRC Press, Oct. 1993. [14] P. L. Montgomery, Modular multiplication without trial division, Math. Comput., vol. 44, no. 170, pp. 519521, Apr. 1985. [15] V. Rijmen, Efcient Implementation of the Rijndael S-box, 2001. [Online]. Available: http://www.esat.kuleuven.ac.be/ ~rijmen/rijndael/ sbox.pdf [16] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, Unied hardware architecture for 128-bit block ciphers AES and Camellia, in Proc. CHES, Aug. 2003, pp. 304318. (2 ), [17] . K. Ko and T. Acar, Montgomery multiplication in Des., Codes Cryptography, vol. 14, no. 1, pp. 5769, Apr. 1998. [18] S. Mangard, M. Aigner, and S. Dominikus, A highly regular and scalable AES hardware architecture, IEEE Trans. Comput., vol. 52, no. 4, pp. 483491, Apr. 2003. [19] C.-H. Wang, C.-P. Su, C.-T. Huang, and C.-W. Wu, A word-based RSA crypto-processor with enhanced pipeline performance, in Proc. 4th IEEE AP-ASIC, Fukuoka, Japan, Aug. 2004, pp. 218221. [20] T. Good and M. Benaissa, Pipelined AES on FPGA with support for feedback modes (in a multi-channel environment), IET Inf. Security, vol. 1, no. 1, pp. 110, Mar. 2007. [21] A. Hodjat and I. Verbauwhede, Area-throughput trade-offs for fully pipelined 30 to 70 Gbits/s AES processors, IEEE Trans. Comput., vol. 55, no. 4, pp. 366371, Apr. 2006.

Chieh-Lin Chuang received the M.S. degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 2007. He is currently an ASIC Engineer with Novatek Corporation, Hsinchu. His research interests include digital circuit design.

GF

Chen-Hsing Wang received the M.S. degree from the Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, in 2003, where he is currently working toward the Ph.D. degree in electrical engineering. His research interests include the design and test of VLSI circuits and systems. He is particularly interested in cryptographic circuit design.

Cheng-Wen Wu (S86M87SM95F04) received the B.S.E.E. degree from National Taiwan University, Taipei, Taiwan, in 1981 and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of California, Santa Barbara (UCSB), in 1985 and 1987, respectively. From 1981 to 1983, he was an Ensign Instructor with the Chinese Naval Petty Ofcers School of Communications and Electronics, Tsoying, Taiwan. From 1983 to 1984, he was with the Information Processing Center, Bureau of Environmental Protection, Executive Yuan, Taipei. Since 1988, he has been with the Department of Electrical Engineering, National Tsing Hua University (NTHU), Hsinchu, Taiwan, where he is currently a Professor. At NTHU, he was also the Director of the Computer and Communications Center from 1996 to 1998 and the Director of the Technology Service Center from 1998 to 1999. From August 1999 to February 2000, he was a Visiting Researcher with the Department of Electrical and Computer Engineering, UCSB. He was the Chair of the Department of Electrical Engineering, NTHU, from 2000 to 2003 and the Director of the IC Design Technology Center from 2000 to 2005. He is currently the Dean of the College of Electrical Engineering and Computer Science, NTHU. His research interests include the design and test of high-performance VLSI circuits and systems. Dr. Wu is a Life Member of the Chinese Institute of Electrical Engineers (CIEE) and the Taiwan IC Design Society. He was the Technical Program Chair of the IEEE Fifth Asian Test Symposium (ATS96), the General Chair of ATS00, and the General Cochair of the IEEE Memory Technology, Design, and Testing Workshop in 2005 and 2006. He is the Editor-in-Chief for the International Journal of Electrical Engineering (IJEE), an Editor for the Journal of Electronic Testing: Theory and Applications, and an Editor for the IEEE Design & Test of Computers. He was an Editor for IJEE from 2000 to 2003, and in 2001, he edited the IJEE Special Issue on Design and Test of System-on-Chip. He was also a Guest Editor for the Journal of Information Science and Engineering, Special Issue on VLSI Testing. He was a recipient of the Distinguished Teaching Awards from NTHU in 1996 and 2006, the Outstanding Electrical Engineering Professor Award from CIEE in 1997, the Distinguished Research Awards from the National Science Council in 2000 and 2002, the Industrial Collaboration Award from the Ministry of Education (MOE) in 2001, the Best Paper Award at the 2002 IEEE International Workshop on Design and Diagnostics of Electronic Circuits and Systems, the Best Paper Award at the 2003 IEEE Asia and South Pacic Design Automation Conference (ASP-DAC), the Special Feature Award of the 2003 ASP-DAC University LSI Design Contest, the Academic Award from MOE in 2005, and the Continuous Service Award and the Outstanding Contribution Award from the IEEE Computer Society in 2005. He became a Golden Core Member of the IEEE Computer Society in 2006.

You might also like