You are on page 1of 6

Efficient Realization of Givens Rotation through

Algorithm-Architecture Co-design for Acceleration


of QR Factorization
ASHISH TIWARI(T18010)
Under the Guidance of Dr. Shubhajit Roy Chowdhury
T18010@students.iitmandi.ac.in
School of Computing and Electrical Engineering
Indian Institute of Technology, Mandi

Abstract—This report presents an efficient realization of less, so here a chance to improve the speed of Givens Rotation
GGR(Generalized Givens Rotation) for QR factorization that technique [1].
achieves more then 3-100x better performance in terms of
Gflops/watt in multicore and GPGPUs as compare to classical A = QR
Givens Rotation.GGR is the advanced form of Givens Rotation
in which multiple elements of a matrix with different row Where
and column annihilate simultaneously. In the case of GGR, the
number of multiplication is less than the GR by 33%. If we are a b c a b c

using the Asynchronous circuit then the latency in the circuit is Q = d e f R = 0
d e
surely less then the synchronous circuit. g h i 0 0 f
I. I NTRODUCTION Det|A| = a ∗ d ∗ f
In today life, speed is one of the main dominant factors in
For the implementation of these factorizations on FPGA and
the engineering field but as we know that there is a trade-off
GPGPUs there is some library based approach are used like
between speed and area and power consumption. So we can
BLAS (Basic linear algebra subprogram) based on highly
do optimization to get a point where the device is efficient at
tuned packages specially design for such problem to enhance
high speed. The main question arise is that why we need to
the speed. Tuned packages are MAGMA and PLASMA[2].
do factorization? Doing factorization play a vital role. As we
know that the calculation becomes easy in case of an upper
In this project for matrix simplification Floating point has
and lower triangular matrix.”Easier” means the calculation has
been used. The main reason for using Floating Point is that
become fast in case of the triangular matrix as compared to a
the large computation become easy and also the accuracy of
simple matrix.
floating point is also high. The representation of a floating
QR factorization is a process to decompose any matrix in
point number is followed
two matrices.
A = QR Floating Point Representation
Sign Exponant M antissa
Where the diagonal of R(upper triangular matrix) shows
the eigenvalue and column of Q(orthogonal matrix) shows Example:
eigenvector of matrix A.Product of the diagonal element of 85.125
matrix R shows the determinant of the matrix. The eigenvalue 85 = 1010101
is one of the most important factors of any matrix which 0.125 = 001
shows the solution of that matrix which is used in the different 85.125 = 1010101.001
field of engineering applications like Communication systems, = 1.010101001 ∗ 26
Designing bridges, Designing car stereo system, decoupling sign = 0
of three-phase systems, a solution of linear equations. Ba- For single precision
sically, there are three types of QR factorization technique, biased exponent = 127 + 6 = 133
1) Givens Rotation, 2) Householder Transform, 3)Modified 133 = 10000101
Gram-Schmidth[1]. MGS is used where the accuracy of the Normalised mantisa = 010101001
final solution is not so important.HT is used for the high- The IEEE 754 Single precision is:
performance computing field, HPC is a numerically stable = 01000010101010100100000000000000
process. Givens Rotation is used in embed field where the end Double precision:
solution is critical but since the speed of the Givens Rotation is biased exponent 1023 + 6 = 1029
1029 = 10000000101 architecture.
By using RDP(Reconfigurable Data Path) we can improve
Normalised mantisa = 010101001 the data processing explained in the next section.
Moving towards the highly tuned software the time is taken
The IEEE 754 Double precision is: and no of multiplication reduced, which will cause the hike
= 0100000001010101010010000000000000000........ in performance in terms of Gflop/watt.

As we know that the doing multiplication of big binary no


is more complex and accuracy is also not good. For doing this A. IMPROVEMENT IN SYNCHRONOUS CIRCUIT
multiplication easier and accurate we are using a floating point. Column-wise Givens Rotation is explained in [8] is
In Floating point, there is some problem which is removed in discussed follow and the modification of the same also
posit number representation. presented in this section.
There is basically three type of BLAS are used which is highly For a matrix of order 4 ∗ 4 X = xij we need to multiply
tuned in matrix operation. BLAS1: it is 1st level for a vector three Givens sequence to get QR factorization as shown in
to vector multiplication. the following equation[1]:
BLAS2: It is 2nd level for a matrix to vector multiplication. GX =
BLAS3: It is 3rd level for a matrix to matrix multiplication. x1 1 x1 2 +s1 1 x1 1 x1 3 +s1 2 x11 x14 +s13
P3

These BLASs give different efficiency with different software p3 p3 p3
0 xp1 1ps1 1 − x1p2 p1 2 x1 1 s 1 2 x1 3 p2 x1 1 s13 x14 p2

with a different order of matrix[3]. 3 2 3 p3 p2 − p3 p3 p2 − p3
x2 1 s2 1 x 2 2 p1 x2 1 s 2 2 x2 3 p2 x2 1 s23 x24 p1

p1 p2 − p2 p2 p1 − p2 p2 p1 − p2
0
II. RELATED WORK

x3 1 x4 2
0
p1 − x4x p1
32 x3 1 x4 3
p1 − x4 p1 x1 3 3 x3 p1 x1 44 − x4134
p1
Firstly I would like to explain GR after that CGR and x x +s x x +s x x +s

P3 1 1 1 2 1 1 1 1 1 3 1 2 1 1 1 4 13
p3 p3 p3
finally GGR.

0 k1 s11 − x12 l1 k1 s12 − x13 l1 k1 s13 − x14 l1
In Givens Rotation we multiply a matrix to the given matrix =
0 k2 s21 − x22 l2 k2 s22 − x23 l2 k2 s23 − x24 l2

to make the element of given matrix zero. Multiplying a 0 cx42 − x32 s cx43 − x33 s cx44 − x34 s
particular matrix means we are rotating plane in a particular
direction with a specific angle θ.After multiplying with a The DAG(Directed Acyclic Graph) of above equation
particular matrix a particular element becomes zero. shown below[1]:
Let’s consider a matrix with a column vector a and b.For
making the upper triangular matrix or making an element
zero we have to multiply this matrix with a new matrix shown
below:

c −s a r
s c b = 0

Where

r = a2 + b2
c = cos θ , s = sin θ

Recently, several proposals came to develop BLAS


and MAGMA on the basis of CGRA platform through
algorithm-architecture co-design, DLA and through processor Fig. 1. One Iteration of Column-wise Givens Rotation
pipelining[6][7].In this paper, we focus on the acceleration of
GR based QR factorization, where classical GR is generalized Above DAG represent the annihilation of multiple elements
to achieve Generalized Givens Rotation (GGR) where GGR like x21 , x32 and x42 has been eliminated simultaneously and
has 33% fewer multiplications compared to GR. Several update the column of the matrix. If we have a matrix of order
macro operations in GGR are identified and realized on n ∗ n then by the help of GR we need n(n−1)
2 sequences while
RDP to achieve superior performance compared to Modified in the case of CGR we need n−1 sequence only. As the name
Householder Transform (MHT) presented in [5]. Some major suggest column wise it means that we can not reduce the no
effort to make the GR-based efficient QR factorization are of sequence in this case because it will take n-1 sequence.
following: But here we sense an opportunity that we can modify the
GR can be improved by CGR which is further improved by sequence shown in figure 1. From Fig.1 we can say that the
the help of GGR(Generalized Givens Rotation). delay of the multiplication block and addition block is not
the same. So it is clear that if addition, multiplication, and
Recently a paper was presented based on the synchronous square root come in one stage (between two different clock)
clock so, in this paper, by using asynchronous cycle we then the clock time period is maximum of all three operation
can further improve the execution time of the presented which will cause more delay in the circuit[1].
In this project as we know the delay of the different operations
is different from each other.Generally the order of delay is

+<∗</< .
From fig.1 if we are considering the delay of the adder is ’t’,
delay of multiplier block is 2t, delay of square root block
is 3t, and the delay of the divider is 4t then for 1 column
update it will take 23’t’, while using asynchronous circuit
it will take 25’t’ delay. So by the help of the asynchronous
circuit, the delay of the overall circuit will reduce.
Fig. 3. Asynchronous Architecture

3) Synchronization of Asynchronous circuit:


• Muller C-Element:
This is a circuit which is used for synchronization of
two different signals. Expression of Muller C-Element is
shown below:
Fn = AB + (A + B)Fn−1
Truth table of Muller C-Element:

TABLE I
M ULLER C-E LEMENT

A B OUTPUT(Fn )
0 0 0
Fig. 2. One Iteration of Column-wise Givens Rotation
0 1 Fn−1
1 0 Fn−1
1 1 1
B. ASYNCHRONOUS CIRCUIT
We choose the asynchronous circuit because of the
limitation of getting high operating frequency by using From the table, it is clear that the output will be 1 when both
the synchronous clock. Reduction in the clock time period the input will be 1. This element plays a vital role in the
will cause a higher operating frequency as well as using asynchronous circuit to provide synchronization to the circuit
the asynchronous circuit will cause a reduction in power [4].
consumption and interference in the circuit. In a synchronous 4) Data Communication in Asynchronous Digital Circuits:
circuit for reduction of dynamic power clock gating is
introduce but for clock gating, we have to use some other • 2-Phase Protocol:
extra circuitry to achieve the less dynamic power but In 2 - Phase Handshaking Protocol the data will be
simultaneously we are paying area. Due to these reasons, we transfer to the receiver from the sender in two cycles. In
are looking for the asynchronous circuit[4]. 1st cycle the transmitter make the request signal high(1),
which shows the sender is ready to send the data and in
1) Asynchronous Architecture: In the case of asynchronous the next cycle the receiver will make the acknowledge
architecture we do not need any clock cycle. In this case, we signal high which means the receiver is ready to receive
have a sender block and a receiver block. Data transfer has the data and hence data will be transferred successfully
been done by the help of acknowledging and request signal from sender to receiver[4].
as shown below:
But the main problem with this protocol is that it can not
give the reliability because we don’t have information that
2) Difficulties in Asynchronous Architecture: One of whether the information is transfer successfully transfer
the main difficulty in Asynchronous architecture is the or not.
synchronization in a big architecture. • 4-Phase Protocol
Since it is new work so there is not much enough automatic In 4 - Phase Handshaking Protocol data will transfer from
tools to check the functionality, so as a result synthesis is a sender to receiver in 4 cycle. In the first cycle sender
big challenge. assure that he has information available to transmit and
sender make the request signal high. In the second cycle
TABLE II

Index Area Power Cycle Req


2 Phase y x 2
4 Phase 2y 2x 4
Modified 4 Phase 2y 2x 3

III. GGR AND IMPLEMENTATION


In the given matrix An∗n to eliminate which is present
Fig. 4. 2phase protocol in the last row and first column which is(n,1) will take one
Givens sequence

R(1)
when the receiver see that request signal is high then it Gn,1 = Where R(k) is a matrix which has k zero
0
will make it’s acknowledge signal high.In the third cycle elements in the lower part of the new matrix and has
when the both the input to Muller C-Element is high undergone k-updates.Generally Givens matrix is given by the
then the output of Muller C-Element is also become following equation:
high and data will transfer from sender to receiver. In c s
Gij = diag(Ii−2 ; Gi,j ; Im−i ) ,where G=
the fourth cycle receiver make acknowledge signal zero −s c
Ai− 1, j Ai , j
p
where c = 2 2
,s= t and t= Ai −1,j + Ai ,j .
which shows that the data received successfully. t

To annihilate n-1 element we need we need to use n-1


givens sequence. There is a possibility to annihilate multiple
elements simultaneously. So to the annihilate 2 elements
simultaneously we have to use two sequences given below:
R(2)
Gn−1 , 1Gn,1 A =
0
(Gn−1 , 1Gn,1 A)T (Gn−1 , 1Gn,1 A)=(Gn−1 , 1Gn,1 A)(Gn−1 , 1Gn,1 A)T =
To annihilate n-1 element of first
column of matrix A
R(n−1)
G2,1 G3,1 ...Gn−1 ,1 Gn,1 A =
0
Fig. 5. 4 phase protocol where
(G2,1 G3,1 ...Gn−1 ,1 Gn,1 )T (G2,1 G3,1 ...Gn−1 ,1 Gn,1 )=(G2,1 G3,1 ...Gn−
(G2,1 G3,1 ...Gn−1 ,1 Gn,1 )T = I
The main drawback of 4- phase protocol is time taken Above equation is used to annihilate n-1 element from the
in this case is 4 cycle. To reduce the time we are using column of matrix A.Further annihilation from (n-1) to (n-
another method which is modified 4 phase protocol. 1)+(n-2) result matrix R(n−1)+(n−2) where n-1 annihilation
• Modified 4-Phase Protocol In modified 4 phase protocol from column 1 and n-2 from column 2.
the no of cycle required to transmit the data is 3. One
cycle reduced in this case as compared to the 4 phase (G3,2 G4,2 ...G
n−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 )A=
protocol. In this case the request and acknowledge signal R(n−1)+(n−2)

active high simultaneously which will cause a reduction 0
in the cycle and speed increases. There are some limi-
tations of modified 4 Phase protocol like are required in
this case is also more and moreover, the complexity, in
this case, is also more.

C. Performance Analysis of 2-Phase/4Phase/Modified 4-


Phase Protocol
The performance analysis of these three protocols on the
basis of speed, area and power consumption is shown in the
below table[4]:

From this result, we are going to use the modified 4 phase


protocol to do asynchronous data communication in our
architecture to improve the speed and reliability.
Fig. 6. CGR and GGR
((G3,2 G4,2 ...Gn−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 )) order of matrix multiplication and different operation like
((G3,2 G4,2 ...Gn−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 ))T matrix-matrix,matrix-vector, and vector-vector operation. In
=((G3,2 G4,2 ...Gn−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 ))T section II(A) we show an asynchronous architecture instead of
((G3,2 G4,2 ...Gn−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 ))= I using a synchronous circuit which will cause approximately
1
To annihilate n∗(n−1)
2 element in the lower part of the matrix 5 times reduction in the delay of the circuit. To understand
A, will take n-1 sequence to shrink one bye one each element. the working and implementation of the synchronous circuit
we wrote some Verilog code for different block used in our
((Gn,n−1 )(Gn−1,n−2 )...Gn−1 ,1 Gn,1 )) circuit shown in figure 1. The Verilog result for some block
3,2 G4,2 ...Gn−1
((G ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 ))A = like addition, multiplication and division block is shown
R(n−1)+(n−2) below:

0
and
((Gn,n−1 )(Gn−1,n−2 )...Gn−1 ,1 Gn,1 ))
((G3,2 G4,2 ...Gn−1 ,2 Gn,2 ) (G2,1 G3,1 ...Gn−1 ,1 Gn,1 ))=QT
and QQT = QT Q = I
The above equation shows the GGR in equation form[4].
IV. GGR IN M ULTICORE AND GPGPU
For GPGPU’s realization of GGR , we are using PLASMA
and MAGMA discussed below: Fig. 8. 32 Bit floating point addition
For the implementation of generalized givens rotation in
GPGPUs and multicore in PLASMA we first implement the
dgeqr2ggr routine. The performance of different routine with
different software based on synchronous design is shown in
below graph[1]:

Fig. 9. 32 Bit floating point multiplication

VI. CONCLUSION
GGR is the modified form which is presented in this paper
has less no of multiplication as compare to the classical GR.
In this paper, we use floating point no which is efficient in the
case of very big and very small no operation. By using the
Asynchronous architecture we can reduce the time required on
a particular section by 51 times of previous one. To implement
Asynchronous architecture we are going to use a modified
4 phase protocol in which the time required is less then the
4 phase protocol and the reliability is more than the phase
protocol. Future scope of this project is to replace floating
Fig. 7. Performance of GGR in Different Packages and Platforms
point no by the posit numbers, because floating point no has
some limitation.
From the above graph, it is clear that the different operation R EFERENCES
with different software shows a different result. But the main
problem of this software is based on the synchronous circuit. [1] Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Senior Mem-
ber, IEEE, Soumyendu Raha, S K Nandy, Senior Member, IEEE,,
To implement these circuit with the asynchronous circuit is Ranjani Narayan, and Rainer Leupers, Efficient Realization of Givens
quite complicated because the software which is used here is Rotation through Algorithm-Architecture Codesign for Acceleration of
written on the basis of the synchronous circuit. QR Factorization, .
[2] F. A. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy,
and R. Narayan, Efficient realization of householder transform through
algorithm-architecture co-design for acceleration of qr factorization, IEEE
V. R ESULTS Transactions on Parallel and Distributed Systems, vol. PP, no. 99, pp. 11,
Implementation of Asynchronous architecture is a big deal 2018.
[3] Z. E. Rakossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A.
because of it’s complexity and unavailability of the automated Chattopadhyay, Efficient and scalable cgra-based implementation of
software to run and calculate the performance of the different columnwise givens rotation, in ASAP, 2014, pp. 188189.
[4] Nikhil Bhandari,Dr.Dr. Rahul Shrestha, Shubhajit Roy Chowdhury”FPGA
Based High Performance Asynchronous Arithmetic Logic Unit and
Asynchronous Finite State Machine Controller using Modified 4-Phase
Handshaking Protocol”.
[5] Mark A. Erle, Senior Member, IEEE, Brian J. Hickmann, Member, IEEE,
and Michael J. Schulte, Senior Member, IEEE,”Decimal Floating-Point
Multiplication” in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58,
NO. 7, JULY 2009.
[6] S. Das, K. T. Madhu, M. Krishna, N. Sivanandan, F. Merchant, S.
Natarajan, I. Biswas, A. Pulli, S. K. Nandy, and R. Narayan, A framework
for post-silicon realization of arbitrary instruction extensions on reconfig-
urable data-paths, Journal of Systems Architecture - Embedded Systems
Design, vol. 60, no. 7, pp. 592614, 2014.
[7] G. Ansaloni, P. Bonzini, and L. Pozzi, Egra: A coarse grained recon-
figurable architectural template, Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 19, no. 6, pp. 10621074, June 2011.
[8] F. Merchant, A. Chattopadhyay, G. Garga, S. K. Nandy, R. Narayan, and
N. Gopalan, Efficient QR decomposition using low complexity column-
wise givens rotation (CGR), in 2014 27th International Conference on
VLSI Design and 2014 13th International Conference on Embedded
Systems, Mumbai, India, January 5-9, 2014, 2014, pp. 258263.

You might also like