You are on page 1of 5

Partially Reconfigurable Matrix Multiplication for Area and

Time Efficiency on FPGAs


Luo Jianwen and Jong Ching Chuen
School of Electrical and Electronic Engineering, Nanyang Technological University
Nanyang Avenue, Singapore 639798
robin_ljw@pmail.ntu.edu.sg, eccjong@ntu.edu.sg

Abstract partially reconfigurable feature which greatly improves


area and latency tradeoff when compared with the
This paper presents a novel architecture for matrix existing designs.
multiplication implemented on reconfigurable hardware Linear array design by Jang et.[1] implemented matrix
with partially reconfigurable feature. The proposed multiplication on the Xilinx Virtex-II device. Their design
design significantly reduces the size and achieves the adopted systolic design architecture and focused on the
minimum computation cycles for the n × n matrix area-latency tradeoff minimization, which achieved great
multiplication. Compared with the linear array design [1], improvement when compared with the state-of-the-art
the area of our design is reduced by 72% - 81% while the FPGA based designs [2] and [3]. For a matrix size of 4 ×
AT metrics (product of area and latency) is reduced by 4, it had 52% and 69% less in area/speed metrics
40% - 58% for matrix size between 3 × 3 and 48 × 48. respectively, and saved up to 46% silicon against design
The versatility of our design is demonstrated in different in [4], while achieving a maximum frequency of 166MHz.
parameterisable instantiation to cater for different
implementations with various time and area requirements. Xilinx reference design [5] for 3 × 3 matrix multiplication
Partially reconfiguration allows us to reload the design maximized the pipelined data flow by “multi pump” the
contents with the minimum configuration overhead. The embedded multipliers to the speed of 9 times of the
performance of our design is even better for larger environment frequency, up to 154MHz. We use this
matrices. design as our benchmark for matrix multiplication
implemented on the Xilinx Virtex-II device.
1. Introduction Xilinx core generator tool [5] has many parameterisable
library cores for fast design realization. These cores have
guaranteed high performance and density. We
Matrix multiplication is one of the essential operations in implemented the uniprocessor for matrix multiplication
a wide range of applications such as graphic, video, by using this tool and made the comparison with our
image, robotic and signal processing. These applications proposed design. The uniprocessor can run at 113MHz
need high performance as well as cost efficient design. when adopting MAC v3.0 core [5].
Reconfigurable systems offer us a potential for
computation acceleration due to its software-like The rest of this paper is organized as: Section 2 describes
programmable nature of the parallel processing units. the proposed matrix multiplier design architecture for AT
Run-time configuration explores a novel research area for efficiency. Section 3 presents the FPGA implementation
reconfigurable hardware to further speedup the and comparison with the existing designs. Section 4
processing speed by eliminating the configuration addresses the content partial reconfiguration used in our
overhead with the overlapping of execution time. This design. And we conclude in Section 5.
offers a non-interrupted processing system even with the
change of the circuits and greatly improves the logic
density by the time-sharing logic. 2. Design architecture
Many existing schemes addressing matrix multiplication
on FPGAs are on the area-time tradeoff issues to achieve Since Virtex-II devices incorporate large amounts of 18
the maximum processing speed. Partially reconfigurable Kbit Block SelectRAM, which has versatile configuration
devices offer us the ability to change the design options, we can instantiate the memory cell with the
implementation without stopping the whole executing operand matrix the way we do with parameterisable
process. To our best knowledge, none of the existing registers. The proposed matrix multiplier uses two chunks
matrix multiplication design is run-time configurable. In of the memory area. Figure 1 shows the architecture of
this paper we present a novel matrix multiplier with the proposed processing element (PE). Memory B is used

1
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
to store the column j of matrix B and memory C is used to the calculation of any n × n matrix multiplication C =
store the partial and final products of column j. Compared AxB with aik, bkj and cij represent the elements of the n × n
with the previous techniques, our design significantly matrices A, B and C respectively. We need n times of
reduces the number of registers needed for data multiplication to produce each element of product C.
movement. 4n registers are required for data movement in Thus, to compute the whole n dimension C, n3 times of
Linear Array design [1], while only n registers are used in multiplication are needed. If we have n multipliers
our design. In Linear Array design, n2 + 2n cycles are working in parallel, n2 cycles are the latency elapsed in
needed for the n × n matrix multiplication. With run-time multiplication. Note that we have not counted the latency
configurable parameters and parallel processors, we save for addition in pipelined processing and those wasted in
n cycles in our systolic mode design and 2n cycles in the data movement. So the minimum timing requirement for
parallel mode. an n × n matrix multiplication is n2 cycles with n
multipliers and we get the n3 cycles with one multiplier
for the same reason. These are the low bound latency for
implementing n × n matrix multiplication.
Lemma 2 n × n matrix multiplication can be performed in
n2 + n cycles using n PEs, each with 1 MAC, 1 register
and a Block SelectRAM of 2n words and 1 I/O port.
Proof: Figure 1 is devised to compute cij=ěQN  aik × bkj
for all i, j. aik, bkj and cij represent the elements of the n ×
n matrices A, B and C. PEj denotes the j-th PE in the
whole structure. PEj computes column j of matrix C, c1j,
c2j, …, cnj stored in Block SelectRAM C part. The input of
PEj connects to the output of PEj-1 and the output of PEj
is the input of the next array element PEj+1. In phase k,
row k of matrix A (aik , 1 d i d n ) traverses PE1 , PE2 ,
PE3 ,……ˈPEn in order. Column j of matrix B resides in
the Block SelectRAM of PEj that can be partially
configured. This scheme allows PEj to update c’ij = c’ij +
aik × bkj every clock cycle, where c’ij represents the
intermediate value of cij. And it takes n cycles to calculate
Figure 1. Architecture of PEj in the Proposed Design each element of Matrix C. The MAC in PEj will not start
until the first element of Matrix A a11 arrives. Thus, PEj
starts computing j cycles after the ready signal activates,
Based on the proposed PE architecture, a number of and completes on j + n2 cycles. So we get the result after
theorems are derived to show the performance of the the last element cnn in PEn is ready, which is after the (n2
proposed multiplier. Lemma 1 gives the minimum latency + n) th cycle.
requirement of n × n matrix multiplication with n MACs
(multiplier-and-accumulator) and uniprocessor. Lemma 2 Corollary 1 n × n matrix multiplication can be performed
improves the Linear Array algorithm for matrix in n2 cycles using n PEs, each with 1 MAC, 1 register and
multiplication with respect to both the number of registers 1 Block SelectRAM of 2n words and 1 I/O port.
and number of computation cycles. Lemma 2 is extended Proof: There is no change here from Lemma 2 except the
in Corollary 1 and 2 for demonstration of the ability of way Matrix A traverses. Instead of through PE1 , PE2 ,
the proposed design to meet the latency limitation with n
PE3 ,……ˈPEn, the elements of Matrix A are traveling
and one MAC respectively. Lemma 3 addresses the
through the data bus and fed into each PE simultaneously.
matrix decomposition when the matrix size is lager than
We instantiate all the PEs with the parameter of PE1 but
the number of available PEs and gives the quantitative
different column value of Matrix B in Block SelectRAM
analysis of the trade-offs between area and latency.
part B - PEj with j-th column of Matrix B. This method
Lemma 1 n × n matrix multiplication can never be allows all the PEs start at the same time and finish with
performed in less than n2 cycles with n multipliers or n3 the latency of PE1 as in Lemma 2.
cycles with one multiplier.
Corollary 2 n × n matrix multiplication can be performed
Proof: We define O (n3) as the complexity for n × n in n3 cycles using 1 PE with 1 MAC, 1 register and a
matrix multiplication. Equation cij=ěQN  aik × bkj denotes Block SelectRAM of 2n2 words and 1 I/O port.

2
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
Proof: This is the same as in the Uniprocessor. n × n PE 1 PE 2 PE 3 PE 4
matrix multiplication can also be performed using only
PE1 . We can parameterize the value of Block SelectRAM ª b 11 b 12 b13 b 14 b15 b 16 b 17 b 18 º
part B with Matrix B in column order. Matrix A is fed into «b b 22 b 23 b 24 b 25 b 26 b 27 b 28 »»
PE1 n times with the production rate of 1 column per time. « 21
So n3 cycles are needed in this case.
« b 31 b 32 b 33 b 34 b 35 b 36 b 37 b 38 »
« »
Lemma 3 n × n matrix multiplication can be performed in B « b 41 b 42 b 43 b 44 b 45 b 46 b 47 b 48 »
rn2 cycles using n/r PEs, each with 1 MAC, 1 register and « b 51 b 52 b 53 b 54 b 55 b 56 b 57 b 58 »
1 Block SelectRAM of 2n words and 1 I/O port where n « »
is divisible by r. « b 61 b 62 b 63 b 64 b 65 b 66 b 67 b 68 »
«b b 72 b 73 b 74 b 75 b 76 b 77 b 78 »
Proof: n × n matrix multiplication can be decomposed « 71 »
into r3n/r matrix multiplications. Using Corollary 1 with ¬« b 81 b 82 b 83 b 84 b 85 b 86 b 87 b 88 ¼»
n replaced by n/r, the result is obtained. The matrix (b)
operand management would be like this: Matrix A is fed
in with major sequence of the row of sub-matrix, and Figure 2. Decomposition of Matrix Multiplication in the
minor sequence of row order in each sub-matrix; Matrix Proposed Scheme
B resides in the Block SelectRAM, with major order in Block SelectRAM of PEs are configured in the order as
the column of sub-matrix and minor order of each column shown in Figure 2 (b). The way that Matrix A is fed is
within the sub-matrix. For example, if we decompose an illustrated in the following Pseudo code:
8 × 8 matrix multiplication with factor of r = 2, we can
For major-row_count = 1 to r do
manipulate the matrices in the arrow sequence as depicted
For major-row = 1 to r do
in Figure 2.
For major-column = 1 to r do
First For minor-row = 1 to n/r do
For minor-column = 1 to n/r do
ª a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 º aik = Aij
«a a 22 a 23 a 24 a 25 a 26 a 27 a 28 »» If (minor-column = n/r) loop again, else minor-
« 21 column++;
«a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 38 » If (minor-row = n/r) loop again, minor-row++;
« » If (major-column = r) loop again, else major-
a a 42 a 43 a 44 a 45 a 46 a 47 a 48 »
A « 41 column++;
«a 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 »
« » If (major-row = r) loop again, else major-row++;
«a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 » If (major-row_count = r) end loop;
«a a 72 a 73 a 74 a 75 a 76 a 77 a 78 » Where aik is the register of aik in Figure 1 and Aij is the
« 71 » current element of matrix A ready for feeding in.
¬«a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 ¼»
Lemma 3 caters for area and latency trade-off. A smaller
Next value of n/r reduces the number of PEs, resulting in lesser
area. However, it increases the number of cycles to
ª a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 º complete the matrix multiplication.
«a a 22 a 23 a 24 a 25 a 26 a 27 a 28 »»
« 21
«a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 38 »
« » 3. FPGA implementation
a a 42 a 43 a 44 a 45 a 46 a 47 a 48 »
A « 41
«a 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 »
« » 3.1 Performance Comparison
«a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 »
«a The matrix multiplier described above was implemented
a 72 a 73 a 74 a 75 a 76 a 77 a 78 » in Xilinx Virtex-II device and its performance in term of
« 71 »
«¬a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 »¼ area and latency metrics was evaluated.

(a) We define the performance equation to be Perf = n3 /


(slices × Latency), where n is the matrix size, and slices
and latency are for the area consumption and computing

3
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
time respectively. By using the metrics slices × Latency Figure 3 shows the performance evaluation of the 3
(AT) for evaluation, we are able to take into account the existing designs against our proposed one for various
effect of increased numbers of processing elements and sizes of matrix multiplication. The performance equation
the area differences for various types of memory. This is shows the significant improvement over the existing
especially relevant in an era of deep pipelines and huge modules under the AT metrics calibration. The
caches where small performance improvements are comparable linear array design can run almost 2 times
bought at the cost of dramatic increases in area. faster than the proposed module, but the performance
deteriorates after n=15 due to the significant slice
Table 1 shows the different matrix multiplication modules
consumption beyond that point.
with various area and latency tradeoffˈ with the Xilinx
reference design run at 154MHz, the module generated by 2
Perf=n3/(SlicesxLatency)

Core Generator at 113MHz, the linear array at 166MHz


and the proposed module at 74MHz. The reason for 1.8

which our design runs at a relative low rate is due to our


1.6
non-pipelined design architecture and this part needs Xilinx
CoreGen

Performance Values
further optimization. Note that the modules by Core 1.4 LinearArray
Proposed
Generator and by Xilinx use single multiplier so the area
1.2
is the same for all the matrix sizes.
1
Table 1. Comparison of 3 existing designs against the
proposed design for various sizes of matrix multiplication 0.8

Matrix AXilinx ACoreGen ALinearArray AProposed 0.6

Size (slices) (slices) (slices) (slices) 0.4


0 5 10 15 20 25 30 35 40 45 50
3u3 207 158 393 110 Matrix Size : n
6u6 207 158 786 219
9u 9 207 158 1179 334 Figure 3. Performance Evaluation of Matrix
12 u 12 207 158 1572 446 Multiplication with Various Sizes
15 u 15 207 158 1965 558 Lemma 2 and its corollaries show the ability of our design
24 u 24 207 158 3912 893 to be configured as different types of processor according
48 u 48 207 158 9360 1798 to different specific requirements. By instantiating the
column parameter all with the first column number, we
(a) Area Comparison
get a cluster of vector multipliers that compute in parallel
and can achieve the maximum speedup in any kind of n-
processor mode with computation cycles of n2. Each
Matrix LXilinx LCoreGen LLinearArray LProposed vector multiplier can also be used individually as an
Size (cycles (cycles (cycles (cycles uniprocessor for matrix-vector multiplication. Table 2
/Ps) /Ps) /Ps) /Ps)
shows the latency of the proposed module when
3u3 45 45 16 13 configured as an Uniprocessor, a Systolic Array and the
/0.292 /0.398 /0.096 /0.175 an Optimal Parallel module:
6u6 360 288 49 43
/2.337 /2.549 /0.295 /0.581 Table 2. Latency for Various Size of Matrix
9u 9 1215 891 100 91 Multiplication in Versatility of the Proposed Module
/7.890 /7.885 /0.602 /1.229 Proposed module configured as:
12 u 12 2280 2016 169 157 Matrix
/14.805 /17.841 /1.018 /2.121 Size Uniprocessor SystolicArray Parallel
(cycles) (cycles) (cycles)
15 u 15 5625 3825 256 241
/36.526 /33.850 /1.542 /3.256 3u3 28 13 10
24 u 24 23040 14976 625 601 6u6 217 43 37
/149.610 /132.531 /3.765 /8.121 9u 9 730 91 82
48 u 48 184320 115200 2401 2353 12 u 12 1729 157 145
/1196.883 /1019.469 /14.464 /31.797 15 u 15 3376 241 226
(b) Latency Comparison 24 u 24 13825 601 577
48 u 48 110593 2353 2305

4
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
Note that the performance evaluation in Table 1 and each contains a Block Ram for partial content
Figure 3 is carried out by instantiating the proposed implementation.
module in the systolic array mode, while the parallel
mode can achieve even greater improvement.
Figure 4 shows the Area-Latency tradeoffs as stated in
Lemma 3 with n= 48. For matrices of other sizes, the
trend remains the same. We can see from the chart that
the metrics Area-Latency are actually inversely
proportional.
Area-Latency tradeoffs for matrix multiplication
2000

……
1800

1600

1400
Area (Slices)

1200

1000 PE2
800

600
PE1
400

200
Figure 5. 48×48 Matrix product implementation on
0 2 4 6 8 10 12 Virtex-II 4000, PE’s distribution
4
Latency (cycles) x 10

4. Conclusion
Figure 4. Area-Latency Tradeoffs for Matrix
Multiplication (n=48) A computation and area efficient architecture for matrix
multiplication is proposed with instantiation versatility
and feature of content partial reconfiguration. We
3.2 Partial Reconfiguration demonstrate the improved area and latency tradeoff by
One of the novel features in our architecture is the partial comparing its performance with existing designs.
reconfigurability. As the target device Virtex-II FPGA
supports partial configuration, we can partially change 5. References
our design within independent configuration bitstream
flows and modify only the desired parts of the silicon [1] Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna, “Area
without stopping the processing or reprogramming the and Time Efficient Implementation of Matrix Multiplication on
whole device. This gives us a novel space to work in FPGAs,” The First IEEE International Conference on Field
which the cost of reconfiguration can be alleviated by Programmable Technology (FPT), December 2002.
reduced size of the bitstream.
[2] A. Amira, A. Bouridane, and P. Milligan, “Accelerating
The contents of our matrix multiplier can be changed Matrix Product on Reconfigurable Hardware for Signal
dynamically by partially reconfiguring the memory cell of Processing,” Field- Programmable Logic and Applications
the Block Ram embedded in the Virtex-II device. In this (FPL), pp. 101-111, 2001.
way, the modification of the matrix B multiplicand can be [3] O. Mencer,M. Morf, and M. Flynn, “PAM-Blox: High
carried out at run-time without re-synthesizing the whole Performance FPGA Design for Adaptive Computing,” IEEE
design flow. Symposium on FPGAs for Custom Computing Machines, pp.
167-174, 1998.
At this point, only the contents of matrix multiplicand can
be partially configured. We will extend the components [4] V. K. Prasanna Kumar and Y. Tsai, “On Synthesizing
with partially reconfigurability to the clock templates, Optimal Family of Linear Systolic Arrays for Matrix
Multiplication,” IEEE Transactions on Computers, Vol. 40, no.
data width, and the number of processing elements for
6, 1991.
different matrix size.
[5] Xilinx Application Note XAPP284, Virtex-II Series,
Figure 5 shows the whole implementation of 48×48 http://www.xilinx.com, 2003.
matrix multiplication. The rectangles designate the PEs
linearly distributed along the Block Ram columns which

5
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.

You might also like