Professional Documents
Culture Documents
MPI only:
The implementation details are as below: 1.
Communicators:
All the allocated processors were organized into logical mesh, where the
'c'
and column communicators col_comm were then constructed out of these processors using
MPI_Comm_split. All processors on a row belonged to one row_comm, similarly for other
rows. col_comm carry a similar meaning. The code snippet, Algorithm 1, below shows how the communicators were constructed. The variables proc_x and proc_y are the coordinates of each processor in MPI_COMM_WORLD.
Algorithm 1 row_comm
MPI_Comm row_comm, col_comm; MPI_Comm_split(MPI_COMM_WORLD, proc_x, proc_y, &row_comm); MPI_Comm_split(MPI_COMM_WORLD, proc_y, proc_x, &col_comm);
2.
Tiling:
The matrices A (M
K)
and B (K
N)
rc
processor received one unique tile. Let the size tiles of A, denoted by Ablock, be B, denoted by Bblock, be by ncol_Ablock, 3.
ar ac
and
br bc .
In implementation,
ar
is represented by by nrow_Ablock,
ac
br
by nrow_Bblock and
bc
Buers:
ar K ,
K bc ,
received panels of tiles of B, from col_comm. In addition, each processor copied its own tiles of A and B to the appropriate locations in row_storage and col_storage buers respectively. Furthermore, row_storage is column ordered whereas col_storage is row ordered. Algorithm
2 shows copying of own A tile by a processor to its row_storage buer and Algorithm 3 shows
copying of own B tile by a processor to its col_storage buer.
4.
Communication:
mentation vector (of
Each processor then broadcasted its own A and B tiles, Ablock and Bblock
respectively, from row_storage and col_storage buers respectively, within its row_comm and col_comm. The broadcasting is done in form of panels. Let the panel size be
pb (in imple-
pb
is denoted by pb ). For broadcasts within a row_comm, each panel is a column broadcasts within a col_comm the panel was a row #panels to be broadcasted by a processor is given
by nPanels. Algorithm 4 shows the broadcast along row_comm and Algorithm 5 shows the broadcast along col_comm communicators respectively.
on its row_comm
MPI_Bcast (&row_storage [ row_id * ncol_Ablock * nrow_Ablock + panel* pb * nrow_Ablock], nrow_Ablock * pb, MPI_DOUBLE, row_id, row_comm); }
5.
Computation:
After all the broadcasts are nished, each node computes its respective C
tile (Cblock ) using its row_storage and col_storage buers. Each Cblock (ar is formed by accumulating the sum of outer product of the
bc ) is stored in
column ordered format. Multiplication is performed in rank 1 update format, that is Cblock
th kiter column
of row_storage with
1 kiter K .
Algorithm 6
int k_iter;
Each processor computes its Cblock by accumulating the sum of outer product on
for (k_iter = 0; k_iter < k; k_iter++) { int Cblock_ind = 0; for (c = 0; c < ncol_Bblock; c++) { int col_storage_ind = (k_iter * ncol_Bblock) + c; for (r = 0; r < nrow_Ablock ; r++ ) { int row_storage_ind = (k_iter * nrow_Ablock) + r; Cblock[ Cblock_ind ] += row_storage[row_storage_ind]* col_storage[ col_storage_ind]; Cblock_ind++; } } }
MPI+OpenMP:
The MPI+OpenMP implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+OpenMP implementation has been discussed. 5.
Computation:
To parallelize the computation of Cblock, the number of threads (nthreads ) was set to 2,4 and
6. The matrix multiplication was performed in dot product fashion, where each element of Cblock was obtained as the dot product of a row of row_storage and a column of col_storage matrices.
Algorithm 7 below shows the implementation. The outer two loops of the matrix multiplication
were collapse d. The resulting matrix Cblock was in column ordered format.
MPI+GPU:
The MPI+GPU implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+GPU implementation has been discussed. 5.
Computation:
Prior to performing multiplication on the GPU it is necessary to copy the muliplicand matrices to the device. The multiplicand matrices row_storage and col_storage on the host are in double formatand need to be copied to the device as oat. Therefore, two multiplicand matrices row_storage and col_storage are rst copied to row_storage_row_order_oat and col_storage_oat matrices on the host.
Algorithm 8
int ind_row_order = 0; for (r=0; r<nrow_Ablock; r++) { for (c=0; c < k; c++) { int ind_col_order = (c*nrow_Ablock)+r; row_storage_row_order_oat[ind_row_order] = (oat)row_storage[ind_col_order]; ind_row_order++; } } int ind = 0; for (ind = 0; ind<ncol_Bblock * k; ind++) col_storage_oat[ind] = (oat)col_storage[ind]; oat *dev_Arow_block, *dev_Bcol_block, *dev_Cblock; cudaMalloc( (void**)&dev_Arow_block, nrow_Ablock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Bcol_block, ncol_Bblock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Cblock, nrow_Ablock * ncol_Bblock*sizeof(oat) ); cudaMemcpy( dev_Arow_block, row_storage_row_order_oat, nrow_Ablock*k*sizeof(oat), cudaMemcpyHostToDevice); cudaMemcpy( dev_Bcol_block, col_storage_oat, ncol_Bblock*k* sizeof(oat), cudaMemcpyHostToDevice); MatrixMulKernel_Wrapper(dev_Arow_block, dev_Bcol_block, dev_Cblock, nrow_Ablock, k, ncol_Bblock); oat *buer = NULL; buer = malloc(sizeof(oat)* nrow_Ablock * ncol_Bblock); assert(buer != NULL); cudaMemcpy( buer, dev_Cblock, nrow_Ablock* ncol_Bblock*sizeof(oat), cudaMemcpyDeviceToHost); ind = 0; // Now Cblock is in col storage format for (ind = 0; ind<nrow_Ablock * ncol_Bblock; ind++) Cblock[ind] = (double)buer[ind];
copied to devArow_block and devBcol_block and then the kernel on the device is invoked. Finally, buer is copied to Cblock. Algorithm 8 illustrates this process.
The
product matrix dev_Cblock is in oat format, therefore it is copied to buer as a matrix of oat. In the kernel implementation, blocks of size BLOCK_SIZE are collaboratively loaded into the
shared memory on the GPU. Each C sub matrix, Csub, is computed in an accumulative way which
is nally copied to the matrix C. Algorithm 9 below shows the kernel implementation. (The kernel implementation is a modication of the matrix multiplication implementation in NVIDIA CUDA Handbook).
__global__ void MatrixMulKernel(oat* A, oat* B, oat* C, int hA, int wA, int wB) { int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin +wA -1; int aStep = BLOCK_SIZE; int bBegin = BLOCK_SIZE * bx; int bStep = BLOCK_SIZE * wB; oat Csub =0; int a,b; for (a= aBegin,b=bBegin; a<= aEnd; a+=aStep, b+=bStep ) { __shared__ oat As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ oat Bs[BLOCK_SIZE][BLOCK_SIZE]; As[ty][tx] = A[a+wA*ty+tx]; Bs[ty][tx] = B[b+wB*ty+tx]; __syncthreads(); int k; for(k=0; k<BLOCK_SIZE; k++) Csub += As[ty][k]* Bs[k][tx]; __syncthreads(); } int c = BLOCK_SIZE*by + BLOCK_SIZE *hA*bx; C[c+ hA* tx+ty] = Csub; } void MatrixMulKernel_Wrapper (oat *a, oat*b, oat* c, int hA, int wA, int wB) { dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB/ dimBlock.x,hA/dimBlock.y); MatrixMulKernel<<<dimGrid, dimBlock>>>(a,b,c,hA,wA,wB); }
3b)
= 1, 4, 16)
was the
pb = 1
was used.
The #iterations was also reduced to 5 from the default value of 25. The #nodes used was 4.
Figures:
Figures below shows the time_per_iter vs #processors for the three implementations. of the matrices (A,B and C). The matrices were assumed to be square, and the legend in all the gures shows the common dimension
MPI (only) 120 100 128 256 512 1024 2048 4096
60
40
20
10
20
30 40 #processors
50
60
70
400
500 time per iter
400 300 200 100 0 128 256 512 1024 2048 4096
200
100
10
20
30 40 #processors
(a)
50
60
70
10
20
30 40 #processors
(b)
50
60
70
MPI+OpenMP (#threads 6) 450 400 350 300 time per iter 250 200 150 100 50 0 0 10 20 30 40 #processors
(c)
50
60
70
Figure 2: MPI+OpenMP
MPI+CUDA (block size 2) 60 50 40 time per iter 30 20 10 0 128 256 512 1024 2048 4096
MPI+CUDA (block size 4) 35 30 25 time per iter 20 15 10 5 0 128 256 512 1024 2048 4096
20
40 #processors
(a)
60
80
20
40 #processors
(b)
60
80
MPI+CUDA (block size 16) 14 12 10 time per iter 8 6 4 2 0 128 256 512 1024 2048 4096
20
40 #processors
(c)
60
80
Figure 3: MPI+GPU
10