You are on page 1of 4

A New RSA Cryptosystem Hardware Implementation Based on High-Radix Montgomerys Algorithm

Fang Yingli, Gao Zhiqiang

IC Research and Design Gmup


Institute o Microelectronics, Tsinghua University f Beijing China 100084

fangyl@dns.ime.tsinghua.edu.cn

Abstract

In this paper, we propose an efficient hardware-oriented on modular multiplication algorithm based Montgomerys algorithm. We employ the high-radix technique and modify the original Montgomerys algorithm to reduce hardware complexity and improve processing speed. A RSA cryptosystem hardware design based on this proposed algorithm is presented. The design has been implemented to a single-chip 512-bit RSA processor with CSMC (Central Semiconductor Manufacture Corporation) 0.6um CMOS standard cell library. The processor contains about 96k gates and delivers a baud rate of 113kbits/sec with 40MHz clock in the worst case.
Introduction
As the telecommunication network has grown explosively and the internet has become increasingly popular, their various applications cover almost every aspect of human life, including some very important fields like person identification and commerce. So the network security becomes a more and more serious issue. The fundamental security requirements include confidentiality, authentication, data integrity and nonrepudiation. One efficient solution to the network security issue is the public key cryptographyCl1. Among the various public key cryptography algorithms, the RSA cryptosystem[2] is one of the most efficient, versatile and widely used public key cryptosystems today. To encrypt and decrypt, the input text is first encoded to a numeric format and divided into blocks of suitable size. for The blocks are then processed a s ~ = M E ( ~ o d N ) encryption and M = CD(modN)for decryption. Where hf, C are the plaintext and ciphertext blocks, respectively. N, E, and D are the cryptosystemparameters.

The modular exponentiation is the main computation of the RSA cryptosystem. The modular exponentiation can be reduced to modular multiplication (m(mod N ) ) . Among the many algorithms to perform the modular multiplication, the Montgomerys algorithm[3] is of low complexity and high efficiency, which make it most popular in RSA cryptosystem realizations. The paper is organized as follows. In the 2nd section, we propose a hardware-oriented modular multiplication and exponentiation algorithms based on high-radix Montgomerys algorithm. And the 3rd section describes the hardware design implementation of the RSA cryptosystem using the proposed algorithm. Finally we conclude the paper in the last section.

Improved Algorithms for RSA Cryptosystem

In 1985, P. L. Montgomery invented an algorithm for


modular multiplication without traditional trial division[3]. The original Montgomerys algorithm is described as follows. Given an odd modulus N > 1 , select a number R satisfying R > N and ( R , N ) = ~ Let R- and N be . two numbers satisfyingo c R- < N ,O< N< R , l\r(mod R ) = -1, RR-(modN) = 1. k t AandBbe random numbers. The Montgomerys algorithm is: Algorithm 1. MontPro( A , B ,N )
( T=A.B; M = (T mod R)N(mcd R ) ; S = (T + M N ) I R ; if S z N then S = S - N ;

return S ;

0-7803-6677-8/01/$10.000200 1 IEEE.

348

Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.

In the Montgomery algorithm, the final resultS , referred


to as Montgomery Product, is not the result of AB(modN) but the result ofABR-(mod N ) , COn&g an extra factor R- . The mathematical proof can be found in [3]. Algorithm 1 shown above is only mathematical but not practical for realization because a l the operations in the l algorithm involve large integers. Based on Algorithm 1 we propose our improved hardware-oriented Montgomerys algorithm. The proposed algorithm has the following improvements: (1)each operation in the algorithm only involves 1-word integer, which makes it easy for hardware implementation; (2)variables in the algorithm are reused, which lowers the register complexity when mapped to hardware; (3)dependence between operations is minimized, which maximizes the operation parallelity and reduces the number of clock cycles to complete the whole algorithm. The improved hardware-oriented Montgomerys algorithm is described as follows. We present all numbers in radix r. For convenience, the radix r is usually selected to be 2 wherek is called the word length. Assuming all operands to be w-words integers, we get A = (awlawz. B = (b,,b,. ..b,bo), and N = (n,,n,, ...nlno), k t no be a word satisfying

Here A, B, N and S are all w -word integers and T is a (w+ 1) -word integer. sl , s2 represent the sum words and cl, c2 represent the carry words. The algorithm in which the Montgomery Product is used to compute the modular exponentiation is given below. Let M and C be the plaintext and ciphertext, respectively. c = M ~ ( ~ ~ N ) N is the modulus and E is the where exponent. M, C, N and E are all U -bit integers. R has the same definition in Algorithm 1. Usually we let
F!=2.

nono -l(modr) .The G modified algorithm is given below.


Algorithm 2. MontPro( A , B ,N ) { s=o; T = O ; for i = o to w-1 do I (cl, $1) = t, + U , .bj :
to= 31 .;

Montgomerys

Algorithm 3. ModExp(M, E, N) { C=l; M=MontPro(M,RZmod N ,N); for i=O to U-1 do { if ( e j = 1) C=MontPro(M, C, N); M=MontPro(M, M,N); return C ,

rn = (so + t o ) . nb (mod r ) :
(c2,s2)=so+rn.n,+to: for j = 1 to w-1 do

I
Hardware Implementation of RSA Cryptosystem
System Architecture
Fig 1 shows the architecture of our RSA cryptosystem design based on Algorithm 2 and Algorithm 3. The RSA cryptosystem works in two modes: programming mode and processing mode. In the programming mode @rogram=l), the cryptosystem loads in RSA operation parameters such as Nand E from ports din and e-in. In the processing mode (program=O), the cryptosystem first loads in the message data block to be processed while new-msg=l, then do RSA encryptioddecryption

(cl,sl) = tj,
t j = s1;

+ u j . bi + cl ;

(c2,s2)=sj+ m . n j +c2:
SI-, = s2 ;

r, = c l ;
s ,

= c2 ;

I
(c2,s2) = so + t ;

cl=l:

349

Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.

on the data block while new-msg=O. When the RSA operation is completed, port state outputs a positive impulse indicating that the result data can be read from port dout.
The RSA cryptosystem is partitioned into four parts: one register file, one modular exponentiation controller, one Montgomery controller and two identical Montgomery datapath. Observing that in Algorithm 3 the two Montgomery modular multiplications in the iteration are identical and simultaneous, we use only one Montgomery controller to control two identical Montgomery datapath, which reduces the hardware complexity.
clock

contains arithmetic units and registers for the Montgomery modular multiplication, including two multiply-add units for arithmetic operations, some multiplexers to choose the proper operands and some loop-shift registers to store the variables such as S and T in Algorithm 2. Each multiply-add unit contains a k x k multiplier and a k + k + 2k adder, where k is the word length. The results of the multiplier and the adder are registered to make the multiply-addition operations pipelined.

1 b
Shift Registers
Prcduct Register

Modular Exponentia!hn canmuer

1 1

-1
Module Design

Register Rekister
Montgomery

Fig 3. Multiply-add Unit (3)-merY Controller. The Montgomery controller controls the Montgomery Product computation process. We use two state machines. One is negative edge triggered, switching the states and generating control codes for sequential logic and the other is positive edge triggered, generating control codes for combinational logic.

Fig 1. System Architecture

The RSA cryptosystem contains four modules: register file, Montgomery datapath, Montgomery controller and modular exponentiationcontroller. ( 1 ) W s t e r Fik. The register file contains registers for operation parameters, constants and variables such as N, M , C, E and ( R2 mod N ) in Algorithm 3. Observing that the involved operands are dealt word by word in Algorithm 2, we store the operands in loop-shift registers, which loop-shift the operands one word at the end of each corresponding iteration. The shifting feature makes the controller simpler and the looping feature guarantees that the data will not be lost. The registers also have external-inputs to load in initial data. Figure 2 shows the typical architecture of the register.

. . ( 4 ) M o d u l a r o n contral h . The modular exponentiation controller controls the shifting of the exponent. It receives the ending signal from the Montgomery controller and shifts the exponent one bit. It also counts the shifted bits and gives out an ending signal for the whole encryptioddecryption prooess when it finishes with the last bit of the exponent.
Pegormance and Features
To measure the performance of our design, we implemented it to a 512-bit RSA processor. In the implementation, we let the word length be 32 bits. Larger word length reduces the number of clock cycles needed for computation but on the other hand the operation of longer words has greater time delay that limits the clock frequency. Empirically we take 32 bits as the word length to get the trade-off.

shift enable

clock

Fig 2. Loop-shift register

(2)Mmtg~mzry Dalapath. The Montgomery datapath

We use the fast carry look-ahead model to implement the 32+32+64 adder and the Booth-encoded Wallace-tree model to implement the 32x32 multiplier. When mapped

350

Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.

to CSMC 0.6um CMOS standard cell library, the adder shows a 21.33ns critical path delay and the multiplier shows a 21.73ns critical path delay. Even if we take 15 percent of the delay as the design margin, the max delay is about 25ns, so the RSA processor can operate under a 40MHz clock. According to Algorithm 2 and Algorithm 3, the RSA encryptioddecryption needs (wz 5w + 5) clock + cycles, where U is the data length and w is the number of words. If k is the word length, there holds U = k .w. For our 512-bit RSA processor, u=512, k 3 2 and -16, so it takes about 0.18M clock cycles to complete one RSA encryptioddecryption. Table 1 lists the main features of our RSA processor and some other recently presented RSA realizations. With comparable hardware complexity, our design greatly reduces the number of clock cycles by taking advantage of the high-radix technique, so it can operate at a baud rate of 113kbitdsec even under a relatively low clock frequency of 40MHz.

is also scalable for different numbers of bits in RSA cryptosystems. These features make our design a good candidate for the RSA cryptosystem hardware implementation.

References
[ l ] W. Diffie and M. Hellman, New Directions in Cryptography, IEEE Transactions on Information Theory, vol. IT-22, pp. 644-654, November 1976. [2] R. Rivest, A. Shamir and L. Adleman, A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, Communications of the ACM, vol. 21, pp. 120-126, February 1978. [3] P. L. Montgomery, Modular Multiplication without Trial Division, Math. Computation, vol. 44, pp. 519521, April 1985. [4] H. Orup, A 100Kbitsls single chip modular exponentiation processor, in HOT Chips VI, Symp. Rec., pp. 53-59, 1994. [5] S. Ishii, K. Ohyama, and K. Yamanaka, A single1 chip RSA processor implemented in 0 . 5 ~ rule gate array, in Proc. 7 Annu. IEEE Int. ASIC Conf. Exhibit, pp. 433-436, 1994. [6] P. S. Chen, S. A. Hwang, and C. W. Wu, A systolic RSA public key cryptosystem, in Proc. IEEE International Symposium on Circuits and Systems, vol. 4, pp. 408-411, 1996. [7] Ching-Chao Yang, Tian-Sheuan Chang, and CheinWei Jen, A New RSA Cryptosystem Hardware Design Based on Montgomerys Algorithm, IEEE Transactions on Circuits and Systems-11: Analog and Digital Signal Processing. Vol. 45, No. 7, pp. 908-913, July 1998.

Conclusions
In this paper, we propose a hardware-oriented RSA encryptioddecryption algorithm and its VLSI architecture based on the high-radix Montgomerys algorithm. Using the CSMC 0.6um CMOS standard cell library, we implemented our design to a 32-bit-radixed 512-bit RSA processor. The processor contains about 96k gates and it takes about 0.18M clock cycles to complete a 512-bit RSA encryptioddecryption, delivering a baud rate of 113kbitdsec at a clock frequency of 40MHz in the worst case. It has relatively low hardware complexity and high processing speed. It

35 1

Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.

You might also like