Universal Scalable Matrix Multiplication

Default Term Project Report
Soumitry J. Ray April 30, 2012
3a) Explain your implementation (5000 words or less).
MPI only:
The implementation details are as below: 1.
Communicators:
All the allocated processors were organized into logical mesh, where the
mesh was of rectangular shape ('r ' rows
'c'
columns). Row communicators row_comm
and column communicators col_comm were then constructed out of these processors using
MPI_Comm_split. All processors on a row belonged to one row_comm, similarly for other
rows. col_comm carry a similar meaning. The code snippet, Algorithm 1, below shows how the communicators were constructed. The variables proc_x and proc_y are the coordinates of each processor in MPI_COMM_WORLD.
Algorithm 1 row_comm
and col_comm communicators.
MPI_Comm row_comm, col_comm; MPI_Comm_split(MPI_COMM_WORLD, proc_x, proc_y, &row_comm); MPI_Comm_split(MPI_COMM_WORLD, proc_y, proc_x, &col_comm);
2.
Tiling:
The matrices A (M
K)
and B (K
N)
are then divided into
rc
tiles and each
processor received one unique tile. Let the size tiles of A, denoted by Ablock, be B, denoted by Bblock, be by ncol_Ablock, 3.
ar ac
and
br bc .
In implementation,
ar
is represented by by nrow_Ablock,
ac
br
by nrow_Bblock and
bc
by ncol_Block. a) row_storage and b)
Buers:
For each processor two storage buers were declared:
col_storage. row_storage is buer space of dimension
ar K ,
that stored all the received
panels of tiles of A from row_comm. Similarly, col_storage, of dimension
K bc ,
stored all the
received panels of tiles of B, from col_comm. In addition, each processor copied its own tiles of A and B to the appropriate locations in row_storage and col_storage buers respectively. Furthermore, row_storage is column ordered whereas col_storage is row ordered. Algorithm
2 shows copying of own A tile by a processor to its row_storage buer and Algorithm 3 shows
copying of own B tile by a processor to its col_storage buer.
Algorithm 2 Each processor copies it own A tile to row_storage

for (c = 0; c < ncol_Ablock; c++) { // Iterate over the rows of the matrix for (r = 0; r < nrow_Ablock; r++) { int index_A = (c * nrow_Ablock) + r; int index_bu = proc_y * ncol_Ablock * nrow_Ablock + (c * nrow_Ablock) + r; row_storage[index_bu] = Ablock [index_A]; } }
Algorithm 3 Each processor copies it own B tile to col_storage

int index_bu = proc_x * ncol_Bblock * nrow_Bblock; for (r = 0; r < nrow_Bblock; r++) { for (c = 0; c < ncol_Bblock; c++) { int index_B = (c * nrow_Bblock) + r; col_storage[index_bu] = Bblock [index_B]; index_bu++; } }
4.
Communication:
mentation vector (of
Each processor then broadcasted its own A and B tiles, Ablock and Bblock
respectively, from row_storage and col_storage buers respectively, within its row_comm and col_comm. The broadcasting is done in form of panels. Let the panel size be
pb (in imple-
pb
is denoted by pb ). For broadcasts within a row_comm, each panel is a column broadcasts within a col_comm the panel was a row #panels to be broadcasted by a processor is given
vector (of dimension
ar pb ) whereas for dimension pb bc ). The total
by nPanels. Algorithm 4 shows the broadcast along row_comm and Algorithm 5 shows the broadcast along col_comm communicators respectively.
Algorithm 4 Each processor broadcasts its Ablock

int nPanels = ncol_Ablock/pb; int row_id; for (row_id = 0; row_id < procGridY; row_id++) { for (panel = 0; panel < nPanels; panel++)
on its row_comm
MPI_Bcast (&row_storage [ row_id * ncol_Ablock * nrow_Ablock + panel* pb * nrow_Ablock], nrow_Ablock * pb, MPI_DOUBLE, row_id, row_comm); }
Algorithm 5 Each processor broadcasts its own B tile on its col_comm

nPanels = nrow_Bblock/pb; int col_id; for (col_id=0; col_id < procGridX; col_id++) { for (panel = 0; panel < nPanels; panel++) MPI_Bcast (&col_storage [ col_id * ncol_Bblock * nrow_Bblock + panel* pb * ncol_Bblock], ncol_Bblock * pb, MPI_DOUBLE, col_id, col_comm); }
5.
Computation:
After all the broadcasts are nished, each node computes its respective C
tile (Cblock ) using its row_storage and col_storage buers. Each Cblock (ar is formed by accumulating the sum of outer product of the
bc ) is stored in
column ordered format. Multiplication is performed in rank 1 update format, that is Cblock
th the kiter row of col_storage where

computed by each processor.
th kiter column
of row_storage with
1 kiter K .
Algorithm 6 below shows how Cblock is
Algorithm 6
int k_iter;
Each processor computes its Cblock by accumulating the sum of outer product on
columns of row_storage and rows of col_storage.
for (k_iter = 0; k_iter < k; k_iter++) { int Cblock_ind = 0; for (c = 0; c < ncol_Bblock; c++) { int col_storage_ind = (k_iter * ncol_Bblock) + c; for (r = 0; r < nrow_Ablock ; r++ ) { int row_storage_ind = (k_iter * nrow_Ablock) + r; Cblock[ Cblock_ind ] += row_storage[row_storage_ind]* col_storage[ col_storage_ind]; Cblock_ind++; } } }
MPI+OpenMP:
The MPI+OpenMP implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+OpenMP implementation has been discussed. 5.
Computation:
To parallelize the computation of Cblock, the number of threads (nthreads ) was set to 2,4 and
6. The matrix multiplication was performed in dot product fashion, where each element of Cblock was obtained as the dot product of a row of row_storage and a column of col_storage matrices.
Algorithm 7 below shows the implementation. The outer two loops of the matrix multiplication
were collapse d. The resulting matrix Cblock was in column ordered format.
Algorithm 7 Parallelized matrix multiplication with OpenMP

int k_iter, Cblock_ind, col_storage_ind, row_storage_ind; int nthreads =2; omp_set_num_threads(nthreads); #pragma omp parallel for private(k_iter, c, r, Cblock_ind, col_storage_ind, row_storage_ind) schedule(static) collapse(2) for (r = 0; r < nrow_Ablock ; r++ ) { for (c = 0; c < ncol_Bblock; c++) { Cblock_ind = (c*nrow_Ablock) + r; for (k_iter = 0; k_iter < k; k_iter++) { row_storage_ind = (k_iter * nrow_Ablock) + r; col_storage_ind = (k_iter * ncol_Bblock) + c; Cblock[ Cblock_ind ] += row_storage[row_storage_ind]* col_storage[ col_storage_ind]; } } }
MPI+GPU:
The MPI+GPU implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+GPU implementation has been discussed. 5.
Computation:
Prior to performing multiplication on the GPU it is necessary to copy the muliplicand matrices to the device. The multiplicand matrices row_storage and col_storage on the host are in double formatand need to be copied to the device as oat. Therefore, two multiplicand matrices row_storage and col_storage are rst copied to row_storage_row_order_oat and col_storage_oat matrices on the host.
Algorithm 8
int ind_row_order = 0; for (r=0; r<nrow_Ablock; r++) { for (c=0; c < k; c++) { int ind_col_order = (c*nrow_Ablock)+r; row_storage_row_order_oat[ind_row_order] = (oat)row_storage[ind_col_order]; ind_row_order++; } } int ind = 0; for (ind = 0; ind<ncol_Bblock * k; ind++) col_storage_oat[ind] = (oat)col_storage[ind]; oat *dev_Arow_block, *dev_Bcol_block, *dev_Cblock; cudaMalloc( (void**)&dev_Arow_block, nrow_Ablock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Bcol_block, ncol_Bblock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Cblock, nrow_Ablock * ncol_Bblock*sizeof(oat) ); cudaMemcpy( dev_Arow_block, row_storage_row_order_oat, nrow_Ablock*k*sizeof(oat), cudaMemcpyHostToDevice); cudaMemcpy( dev_Bcol_block, col_storage_oat, ncol_Bblock*k* sizeof(oat), cudaMemcpyHostToDevice); MatrixMulKernel_Wrapper(dev_Arow_block, dev_Bcol_block, dev_Cblock, nrow_Ablock, k, ncol_Bblock); oat *buer = NULL; buer = malloc(sizeof(oat)* nrow_Ablock * ncol_Bblock); assert(buer != NULL); cudaMemcpy( buer, dev_Cblock, nrow_Ablock* ncol_Bblock*sizeof(oat), cudaMemcpyDeviceToHost); ind = 0; // Now Cblock is in col storage format for (ind = 0; ind<nrow_Ablock * ncol_Bblock; ind++) Cblock[ind] = (double)buer[ind];
Using cudaMalloc function, row_storage_row_order_oat and col_storage_oat matrices are
copied to devArow_block and devBcol_block and then the kernel on the device is invoked. Finally, buer is copied to Cblock. Algorithm 8 illustrates this process.
The
product matrix dev_Cblock is in oat format, therefore it is copied to buer as a matrix of oat. In the kernel implementation, blocks of size BLOCK_SIZE are collaboratively loaded into the
shared memory on the GPU. Each C sub matrix, Csub, is computed in an accumulative way which
is nally copied to the matrix C. Algorithm 9 below shows the kernel implementation. (The kernel implementation is a modication of the matrix multiplication implementation in NVIDIA CUDA Handbook).
Algorithm 9 Computing Cblock

#dene BLOCK_SIZE 2
using collaborative loading of blocks.
__global__ void MatrixMulKernel(oat* A, oat* B, oat* C, int hA, int wA, int wB) { int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin +wA -1; int aStep = BLOCK_SIZE; int bBegin = BLOCK_SIZE * bx; int bStep = BLOCK_SIZE * wB; oat Csub =0; int a,b; for (a= aBegin,b=bBegin; a<= aEnd; a+=aStep, b+=bStep ) { __shared__ oat As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ oat Bs[BLOCK_SIZE][BLOCK_SIZE]; As[ty][tx] = A[a+wA*ty+tx]; Bs[ty][tx] = B[b+wB*ty+tx]; __syncthreads(); int k; for(k=0; k<BLOCK_SIZE; k++) Csub += As[ty][k]* Bs[k][tx]; __syncthreads(); } int c = BLOCK_SIZE*by + BLOCK_SIZE *hA*bx; C[c+ hA* tx+ty] = Csub; } void MatrixMulKernel_Wrapper (oat *a, oat*b, oat* c, int hA, int wA, int wB) { dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB/ dimBlock.x,hA/dimBlock.y); MatrixMulKernel<<<dimGrid, dimBlock>>>(a,b,c,hA,wA,wB); }
3b)
General implementation notes:

The performance in the three implementations for dierent panel sizes(pb same. Therefore, to reduce the job time on cluster
= 1, 4, 16)
was the
pb = 1
was used.
The #iterations was also reduced to 5 from the default value of 25. The #nodes used was 4.
Figures:
Figures below shows the time_per_iter vs #processors for the three implementations. of the matrices (A,B and C). The matrices were assumed to be square, and the legend in all the gures shows the common dimension
MPI (only) 120 100 128 256 512 1024 2048 4096
80 time per iter
60
40
20
10
20
30 40 #processors
50
60
70
Figure 1: MPI (only)
MPI+OpenMP (#threads 2) 700 600
MPI+OpenMP (#threads 4) 500
400
500 time per iter
time per iter
400 300 200 100 0 128 256 512 1024 2048 4096
300 128 256 512 1024 2048 4096
200
100
10
20
30 40 #processors
(a)
50
60
70
10
20
30 40 #processors
(b)
50
60
70
MPI+OpenMP (#threads 6) 450 400 350 300 time per iter 250 200 150 100 50 0 0 10 20 30 40 #processors
(c)
128 256 512 1024 2048 4096
50
60
70
Figure 2: MPI+OpenMP
MPI+CUDA (block size 2) 60 50 40 time per iter 30 20 10 0 128 256 512 1024 2048 4096
MPI+CUDA (block size 4) 35 30 25 time per iter 20 15 10 5 0 128 256 512 1024 2048 4096
20
40 #processors
(a)
60
80
20
40 #processors
(b)
60
80
MPI+CUDA (block size 16) 14 12 10 time per iter 8 6 4 2 0 128 256 512 1024 2048 4096
20
40 #processors
(c)
60
80
Figure 3: MPI+GPU
i) Fastest in terms of time to solution.

Comparing the Figures 1,2 and 3, it can be seen that MPI+GPU is the fastest in terms of time to solution and time to solution is the minimum when block size is 16.
ii) Most scalable.

Similarly looking at Figures 1,2 and 3, it can be noted that Figure 3a has the steepest slope for matrix dimension of 4096, for all other dimensions it has an approximately at prole. Therefore, MPI+GPU with block size 2 seems to be the most scalable, however it is slower in terms of time to solution when compared with MPI+GPU with block size 16. For dimensions less than 4096, MPI+OpenMP (2 threads) appears to be scalable, however it is slower than all MPI+GPU instances.
iv) Approximate lines of code.

MPI: 110; MPI+OpenMP: 116; MPI+GPU: 210
v) Most complex task in implementing your term project.

The most dicult part was implementing the MPI (only). In particular, implementing panel broadcast was the dicult part. Second, in line was MPI+GPU implementation for collaborative loading of matrices.
vi) Most surprising result in your performance analysis across implementations.

The most surprising result in performance analysis was from MPI+GPU implementation. Figure 3b and 3c show that MPI+GPU with block size 4 and 16 are not scalable whereas they are the fastest in terms of time to solution. One possible reason is MPI communication is signicantly dominant. Furthermore, MPI+OpenMP implementation performs worse than MPI(only) implementation. In the MPI+OpenMP, the outer two loops were collapsed which suggests collapsing the outer two loops is probably not the optimal implementation for MPI+OpenMP.
10

Universal Scalable Matrix Multiplication

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Universal Scalable Matrix Multiplication

Uploaded by

Copyright:

Available Formats

Default Term Project Report

Soumitry J. Ray April 30, 2012

3a) Explain your implementation (5000 words or less).

mesh was of rectangular shape ('r ' rows

columns). Row communicators  row_comm 

and col_comm communicators.

are then divided into

tiles and each

by ncol_Block. a) row_storage and b)

For each processor two storage buers were declared:

col_storage. row_storage is buer space of dimension

that stored all the received

panels of tiles of A from row_comm. Similarly, col_storage, of dimension

stored all the

Algorithm 2 Each processor copies it own A tile to row_storage

Algorithm 3 Each processor copies it own B tile to col_storage

vector (of dimension

ar pb ) whereas for dimension pb bc ). The total

Algorithm 4 Each processor broadcasts its Ablock

Algorithm 5 Each processor broadcasts its own B tile on its col_comm

th the kiter row of col_storage where

Algorithm 6 below shows how Cblock is

columns of row_storage and rows of col_storage.

Algorithm 7 Parallelized matrix multiplication with OpenMP

Using cudaMalloc function, row_storage_row_order_oat and col_storage_oat matrices are

Algorithm 9 Computing Cblock

using collaborative loading of blocks.

General implementation notes:

80 time per iter

Figure 1: MPI (only)

MPI+OpenMP (#threads 2) 700 600

MPI+OpenMP (#threads 4) 500

time per iter

300 128 256 512 1024 2048 4096

128 256 512 1024 2048 4096

i) Fastest in terms of time to solution.

ii) Most scalable.

iv) Approximate lines of code.

v) Most complex task in implementing your term project.

vi) Most surprising result in your performance analysis across implementations.

You might also like

columns). Row communicators row_comm

For each processor two storage buers were declared:

col_storage. row_storage is buer space of dimension

Using cudaMalloc function, row_storage_row_order_oat and col_storage_oat matrices are