Professional Documents
Culture Documents
1
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
to store the column j of matrix B and memory C is used to the calculation of any n × n matrix multiplication C =
store the partial and final products of column j. Compared AxB with aik, bkj and cij represent the elements of the n × n
with the previous techniques, our design significantly matrices A, B and C respectively. We need n times of
reduces the number of registers needed for data multiplication to produce each element of product C.
movement. 4n registers are required for data movement in Thus, to compute the whole n dimension C, n3 times of
Linear Array design [1], while only n registers are used in multiplication are needed. If we have n multipliers
our design. In Linear Array design, n2 + 2n cycles are working in parallel, n2 cycles are the latency elapsed in
needed for the n × n matrix multiplication. With run-time multiplication. Note that we have not counted the latency
configurable parameters and parallel processors, we save for addition in pipelined processing and those wasted in
n cycles in our systolic mode design and 2n cycles in the data movement. So the minimum timing requirement for
parallel mode. an n × n matrix multiplication is n2 cycles with n
multipliers and we get the n3 cycles with one multiplier
for the same reason. These are the low bound latency for
implementing n × n matrix multiplication.
Lemma 2 n × n matrix multiplication can be performed in
n2 + n cycles using n PEs, each with 1 MAC, 1 register
and a Block SelectRAM of 2n words and 1 I/O port.
Proof: Figure 1 is devised to compute cij=ěQN aik × bkj
for all i, j. aik, bkj and cij represent the elements of the n ×
n matrices A, B and C. PEj denotes the j-th PE in the
whole structure. PEj computes column j of matrix C, c1j,
c2j, …, cnj stored in Block SelectRAM C part. The input of
PEj connects to the output of PEj-1 and the output of PEj
is the input of the next array element PEj+1. In phase k,
row k of matrix A (aik , 1 d i d n ) traverses PE1 , PE2 ,
PE3 ,……ˈPEn in order. Column j of matrix B resides in
the Block SelectRAM of PEj that can be partially
configured. This scheme allows PEj to update c’ij = c’ij +
aik × bkj every clock cycle, where c’ij represents the
intermediate value of cij. And it takes n cycles to calculate
Figure 1. Architecture of PEj in the Proposed Design each element of Matrix C. The MAC in PEj will not start
until the first element of Matrix A a11 arrives. Thus, PEj
starts computing j cycles after the ready signal activates,
Based on the proposed PE architecture, a number of and completes on j + n2 cycles. So we get the result after
theorems are derived to show the performance of the the last element cnn in PEn is ready, which is after the (n2
proposed multiplier. Lemma 1 gives the minimum latency + n) th cycle.
requirement of n × n matrix multiplication with n MACs
(multiplier-and-accumulator) and uniprocessor. Lemma 2 Corollary 1 n × n matrix multiplication can be performed
improves the Linear Array algorithm for matrix in n2 cycles using n PEs, each with 1 MAC, 1 register and
multiplication with respect to both the number of registers 1 Block SelectRAM of 2n words and 1 I/O port.
and number of computation cycles. Lemma 2 is extended Proof: There is no change here from Lemma 2 except the
in Corollary 1 and 2 for demonstration of the ability of way Matrix A traverses. Instead of through PE1 , PE2 ,
the proposed design to meet the latency limitation with n
PE3 ,……ˈPEn, the elements of Matrix A are traveling
and one MAC respectively. Lemma 3 addresses the
through the data bus and fed into each PE simultaneously.
matrix decomposition when the matrix size is lager than
We instantiate all the PEs with the parameter of PE1 but
the number of available PEs and gives the quantitative
different column value of Matrix B in Block SelectRAM
analysis of the trade-offs between area and latency.
part B - PEj with j-th column of Matrix B. This method
Lemma 1 n × n matrix multiplication can never be allows all the PEs start at the same time and finish with
performed in less than n2 cycles with n multipliers or n3 the latency of PE1 as in Lemma 2.
cycles with one multiplier.
Corollary 2 n × n matrix multiplication can be performed
Proof: We define O (n3) as the complexity for n × n in n3 cycles using 1 PE with 1 MAC, 1 register and a
matrix multiplication. Equation cij=ěQN aik × bkj denotes Block SelectRAM of 2n2 words and 1 I/O port.
2
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
Proof: This is the same as in the Uniprocessor. n × n PE 1 PE 2 PE 3 PE 4
matrix multiplication can also be performed using only
PE1 . We can parameterize the value of Block SelectRAM ª b 11 b 12 b13 b 14 b15 b 16 b 17 b 18 º
part B with Matrix B in column order. Matrix A is fed into «b b 22 b 23 b 24 b 25 b 26 b 27 b 28 »»
PE1 n times with the production rate of 1 column per time. « 21
So n3 cycles are needed in this case.
« b 31 b 32 b 33 b 34 b 35 b 36 b 37 b 38 »
« »
Lemma 3 n × n matrix multiplication can be performed in B « b 41 b 42 b 43 b 44 b 45 b 46 b 47 b 48 »
rn2 cycles using n/r PEs, each with 1 MAC, 1 register and « b 51 b 52 b 53 b 54 b 55 b 56 b 57 b 58 »
1 Block SelectRAM of 2n words and 1 I/O port where n « »
is divisible by r. « b 61 b 62 b 63 b 64 b 65 b 66 b 67 b 68 »
«b b 72 b 73 b 74 b 75 b 76 b 77 b 78 »
Proof: n × n matrix multiplication can be decomposed « 71 »
into r3n/r matrix multiplications. Using Corollary 1 with ¬« b 81 b 82 b 83 b 84 b 85 b 86 b 87 b 88 ¼»
n replaced by n/r, the result is obtained. The matrix (b)
operand management would be like this: Matrix A is fed
in with major sequence of the row of sub-matrix, and Figure 2. Decomposition of Matrix Multiplication in the
minor sequence of row order in each sub-matrix; Matrix Proposed Scheme
B resides in the Block SelectRAM, with major order in Block SelectRAM of PEs are configured in the order as
the column of sub-matrix and minor order of each column shown in Figure 2 (b). The way that Matrix A is fed is
within the sub-matrix. For example, if we decompose an illustrated in the following Pseudo code:
8 × 8 matrix multiplication with factor of r = 2, we can
For major-row_count = 1 to r do
manipulate the matrices in the arrow sequence as depicted
For major-row = 1 to r do
in Figure 2.
For major-column = 1 to r do
First For minor-row = 1 to n/r do
For minor-column = 1 to n/r do
ª a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 º aik = Aij
«a a 22 a 23 a 24 a 25 a 26 a 27 a 28 »» If (minor-column = n/r) loop again, else minor-
« 21 column++;
«a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 38 » If (minor-row = n/r) loop again, minor-row++;
« » If (major-column = r) loop again, else major-
a a 42 a 43 a 44 a 45 a 46 a 47 a 48 »
A « 41 column++;
«a 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 »
« » If (major-row = r) loop again, else major-row++;
«a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 » If (major-row_count = r) end loop;
«a a 72 a 73 a 74 a 75 a 76 a 77 a 78 » Where aik is the register of aik in Figure 1 and Aij is the
« 71 » current element of matrix A ready for feeding in.
¬«a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 ¼»
Lemma 3 caters for area and latency trade-off. A smaller
Next value of n/r reduces the number of PEs, resulting in lesser
area. However, it increases the number of cycles to
ª a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 º complete the matrix multiplication.
«a a 22 a 23 a 24 a 25 a 26 a 27 a 28 »»
« 21
«a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 38 »
« » 3. FPGA implementation
a a 42 a 43 a 44 a 45 a 46 a 47 a 48 »
A « 41
«a 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 »
« » 3.1 Performance Comparison
«a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 »
«a The matrix multiplier described above was implemented
a 72 a 73 a 74 a 75 a 76 a 77 a 78 » in Xilinx Virtex-II device and its performance in term of
« 71 »
«¬a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 »¼ area and latency metrics was evaluated.
3
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
time respectively. By using the metrics slices × Latency Figure 3 shows the performance evaluation of the 3
(AT) for evaluation, we are able to take into account the existing designs against our proposed one for various
effect of increased numbers of processing elements and sizes of matrix multiplication. The performance equation
the area differences for various types of memory. This is shows the significant improvement over the existing
especially relevant in an era of deep pipelines and huge modules under the AT metrics calibration. The
caches where small performance improvements are comparable linear array design can run almost 2 times
bought at the cost of dramatic increases in area. faster than the proposed module, but the performance
deteriorates after n=15 due to the significant slice
Table 1 shows the different matrix multiplication modules
consumption beyond that point.
with various area and latency tradeoffˈ with the Xilinx
reference design run at 154MHz, the module generated by 2
Perf=n3/(SlicesxLatency)
Performance Values
further optimization. Note that the modules by Core 1.4 LinearArray
Proposed
Generator and by Xilinx use single multiplier so the area
1.2
is the same for all the matrix sizes.
1
Table 1. Comparison of 3 existing designs against the
proposed design for various sizes of matrix multiplication 0.8
4
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
Note that the performance evaluation in Table 1 and each contains a Block Ram for partial content
Figure 3 is carried out by instantiating the proposed implementation.
module in the systolic array mode, while the parallel
mode can achieve even greater improvement.
Figure 4 shows the Area-Latency tradeoffs as stated in
Lemma 3 with n= 48. For matrices of other sizes, the
trend remains the same. We can see from the chart that
the metrics Area-Latency are actually inversely
proportional.
Area-Latency tradeoffs for matrix multiplication
2000
……
1800
1600
1400
Area (Slices)
1200
1000 PE2
800
600
PE1
400
200
Figure 5. 48×48 Matrix product implementation on
0 2 4 6 8 10 12 Virtex-II 4000, PE’s distribution
4
Latency (cycles) x 10
4. Conclusion
Figure 4. Area-Latency Tradeoffs for Matrix
Multiplication (n=48) A computation and area efficient architecture for matrix
multiplication is proposed with instantiation versatility
and feature of content partial reconfiguration. We
3.2 Partial Reconfiguration demonstrate the improved area and latency tradeoff by
One of the novel features in our architecture is the partial comparing its performance with existing designs.
reconfigurability. As the target device Virtex-II FPGA
supports partial configuration, we can partially change 5. References
our design within independent configuration bitstream
flows and modify only the desired parts of the silicon [1] Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna, “Area
without stopping the processing or reprogramming the and Time Efficient Implementation of Matrix Multiplication on
whole device. This gives us a novel space to work in FPGAs,” The First IEEE International Conference on Field
which the cost of reconfiguration can be alleviated by Programmable Technology (FPT), December 2002.
reduced size of the bitstream.
[2] A. Amira, A. Bouridane, and P. Milligan, “Accelerating
The contents of our matrix multiplier can be changed Matrix Product on Reconfigurable Hardware for Signal
dynamically by partially reconfiguring the memory cell of Processing,” Field- Programmable Logic and Applications
the Block Ram embedded in the Virtex-II device. In this (FPL), pp. 101-111, 2001.
way, the modification of the matrix B multiplicand can be [3] O. Mencer,M. Morf, and M. Flynn, “PAM-Blox: High
carried out at run-time without re-synthesizing the whole Performance FPGA Design for Adaptive Computing,” IEEE
design flow. Symposium on FPGAs for Custom Computing Machines, pp.
167-174, 1998.
At this point, only the contents of matrix multiplicand can
be partially configured. We will extend the components [4] V. K. Prasanna Kumar and Y. Tsai, “On Synthesizing
with partially reconfigurability to the clock templates, Optimal Family of Linear Systolic Arrays for Matrix
Multiplication,” IEEE Transactions on Computers, Vol. 40, no.
data width, and the number of processing elements for
6, 1991.
different matrix size.
[5] Xilinx Application Note XAPP284, Virtex-II Series,
Figure 5 shows the whole implementation of 48×48 http://www.xilinx.com, 2003.
matrix multiplication. The rectangles designate the PEs
linearly distributed along the Block Ram columns which
5
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)
0-7695-2203-3/04 $ 20.00 IEEE
Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.