You are on page 1of 10

Default Term Project Report

Soumitry J. Ray April 30, 2012

3a) Explain your implementation (5000 words or less).

MPI only:
The implementation details are as below: 1.

Communicators:

All the allocated processors were organized into logical mesh, where the

mesh was of rectangular shape ('r ' rows

'c'

columns). Row communicators  row_comm 

and column communicators col_comm  were then constructed out of these processors using

MPI_Comm_split. All processors on a row belonged to one row_comm, similarly for other
rows. col_comm carry a similar meaning. The code snippet, Algorithm 1, below shows how the communicators were constructed. The variables proc_x and proc_y are the coordinates of each processor in MPI_COMM_WORLD.

Algorithm 1 row_comm

and col_comm communicators.

MPI_Comm row_comm, col_comm; MPI_Comm_split(MPI_COMM_WORLD, proc_x, proc_y, &row_comm); MPI_Comm_split(MPI_COMM_WORLD, proc_y, proc_x, &col_comm);

2.

Tiling:

The matrices A (M

K)

and B (K

N)

are then divided into

rc

tiles and each

processor received one unique tile. Let the size tiles of A, denoted by Ablock, be B, denoted by Bblock, be by ncol_Ablock, 3.

ar ac

and

br bc .

In implementation,

ar

is represented by by nrow_Ablock,

ac

br

by nrow_Bblock and

bc

by ncol_Block. a) row_storage and b)

Buers:

For each processor two storage buers were declared:

col_storage. row_storage is buer space of dimension

ar K ,

that stored all the received

panels of tiles of A from row_comm. Similarly, col_storage, of dimension

K bc ,

stored all the

received panels of tiles of B, from col_comm. In addition, each processor copied its own tiles of A and B to the appropriate locations in row_storage and col_storage buers respectively. Furthermore, row_storage is column ordered whereas col_storage is row ordered. Algorithm

2 shows copying of own A tile by a processor to its row_storage buer and Algorithm 3 shows
copying of own B tile by a processor to its col_storage buer.

Algorithm 2 Each processor copies it own A tile to row_storage


for (c = 0; c < ncol_Ablock; c++) { // Iterate over the rows of the matrix for (r = 0; r < nrow_Ablock; r++) { int index_A = (c * nrow_Ablock) + r; int index_bu = proc_y * ncol_Ablock * nrow_Ablock + (c * nrow_Ablock) + r; row_storage[index_bu] = Ablock [index_A]; } }

Algorithm 3 Each processor copies it own B tile to col_storage


int index_bu = proc_x * ncol_Bblock * nrow_Bblock; for (r = 0; r < nrow_Bblock; r++) { for (c = 0; c < ncol_Bblock; c++) { int index_B = (c * nrow_Bblock) + r; col_storage[index_bu] = Bblock [index_B]; index_bu++; } }

4.

Communication:
mentation vector (of

Each processor then broadcasted its own A and B tiles, Ablock and Bblock

respectively, from row_storage and col_storage buers respectively, within its row_comm and col_comm. The broadcasting is done in form of panels. Let the panel size be

pb (in imple-

pb

is denoted by pb ). For broadcasts within a row_comm, each panel is a column broadcasts within a col_comm the panel was a row #panels to be broadcasted by a processor is given

vector (of dimension

ar pb ) whereas for dimension pb bc ). The total

by nPanels. Algorithm 4 shows the broadcast along row_comm and Algorithm 5 shows the broadcast along col_comm communicators respectively.

Algorithm 4 Each processor broadcasts its Ablock


int nPanels = ncol_Ablock/pb; int row_id; for (row_id = 0; row_id < procGridY; row_id++) { for (panel = 0; panel < nPanels; panel++)

on its row_comm

MPI_Bcast (&row_storage [ row_id * ncol_Ablock * nrow_Ablock + panel* pb * nrow_Ablock], nrow_Ablock * pb, MPI_DOUBLE, row_id, row_comm); }

Algorithm 5 Each processor broadcasts its own B tile on its col_comm


nPanels = nrow_Bblock/pb; int col_id; for (col_id=0; col_id < procGridX; col_id++) { for (panel = 0; panel < nPanels; panel++) MPI_Bcast (&col_storage [ col_id * ncol_Bblock * nrow_Bblock + panel* pb * ncol_Bblock], ncol_Bblock * pb, MPI_DOUBLE, col_id, col_comm); }

5.

Computation:

After all the broadcasts are nished, each node computes its respective C

tile (Cblock ) using its row_storage and col_storage buers. Each Cblock (ar is formed by accumulating the sum of outer product of the

bc ) is stored in

column ordered format. Multiplication is performed in rank 1 update format, that is Cblock

th the kiter row of col_storage where


computed by each processor.

th kiter column

of row_storage with

1 kiter K .

Algorithm 6 below shows how Cblock is

Algorithm 6
int k_iter;

Each processor computes its Cblock by accumulating the sum of outer product on

columns of row_storage and rows of col_storage.

for (k_iter = 0; k_iter < k; k_iter++) { int Cblock_ind = 0; for (c = 0; c < ncol_Bblock; c++) { int col_storage_ind = (k_iter * ncol_Bblock) + c; for (r = 0; r < nrow_Ablock ; r++ ) { int row_storage_ind = (k_iter * nrow_Ablock) + r; Cblock[ Cblock_ind ] += row_storage[row_storage_ind]* col_storage[ col_storage_ind]; Cblock_ind++; } } }

MPI+OpenMP:
The MPI+OpenMP implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+OpenMP implementation has been discussed. 5.

Computation:
To parallelize the computation of Cblock, the number of threads (nthreads ) was set to 2,4 and

6. The matrix multiplication was performed in dot product fashion, where each element of Cblock was obtained as the dot product of a row of row_storage and a column of col_storage matrices.

Algorithm 7 below shows the implementation. The outer two loops of the matrix multiplication
were collapse d. The resulting matrix Cblock was in column ordered format.

Algorithm 7 Parallelized matrix multiplication with OpenMP


int k_iter, Cblock_ind, col_storage_ind, row_storage_ind; int nthreads =2; omp_set_num_threads(nthreads); #pragma omp parallel for private(k_iter, c, r, Cblock_ind, col_storage_ind, row_storage_ind) schedule(static) collapse(2) for (r = 0; r < nrow_Ablock ; r++ ) { for (c = 0; c < ncol_Bblock; c++) { Cblock_ind = (c*nrow_Ablock) + r; for (k_iter = 0; k_iter < k; k_iter++) { row_storage_ind = (k_iter * nrow_Ablock) + r; col_storage_ind = (k_iter * ncol_Bblock) + c; Cblock[ Cblock_ind ] += row_storage[row_storage_ind]* col_storage[ col_storage_ind]; } } }

MPI+GPU:
The MPI+GPU implementation diers from the MPI (only) implementation in Step 5, the Computation step. Below, only the Computation step for MPI+GPU implementation has been discussed. 5.

Computation:

Prior to performing multiplication on the GPU it is necessary to copy the muliplicand matrices to the device. The multiplicand matrices row_storage and col_storage on the host are in double formatand need to be copied to the device as oat. Therefore, two multiplicand matrices row_storage and col_storage are rst copied to row_storage_row_order_oat and col_storage_oat matrices on the host.

Algorithm 8
int ind_row_order = 0; for (r=0; r<nrow_Ablock; r++) { for (c=0; c < k; c++) { int ind_col_order = (c*nrow_Ablock)+r; row_storage_row_order_oat[ind_row_order] = (oat)row_storage[ind_col_order]; ind_row_order++; } } int ind = 0; for (ind = 0; ind<ncol_Bblock * k; ind++) col_storage_oat[ind] = (oat)col_storage[ind]; oat *dev_Arow_block, *dev_Bcol_block, *dev_Cblock; cudaMalloc( (void**)&dev_Arow_block, nrow_Ablock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Bcol_block, ncol_Bblock*k*sizeof(oat) ); cudaMalloc( (void**)&dev_Cblock, nrow_Ablock * ncol_Bblock*sizeof(oat) ); cudaMemcpy( dev_Arow_block, row_storage_row_order_oat, nrow_Ablock*k*sizeof(oat), cudaMemcpyHostToDevice); cudaMemcpy( dev_Bcol_block, col_storage_oat, ncol_Bblock*k* sizeof(oat), cudaMemcpyHostToDevice); MatrixMulKernel_Wrapper(dev_Arow_block, dev_Bcol_block, dev_Cblock, nrow_Ablock, k, ncol_Bblock); oat *buer = NULL; buer = malloc(sizeof(oat)* nrow_Ablock * ncol_Bblock); assert(buer != NULL); cudaMemcpy( buer, dev_Cblock, nrow_Ablock* ncol_Bblock*sizeof(oat), cudaMemcpyDeviceToHost); ind = 0; // Now Cblock is in col storage format for (ind = 0; ind<nrow_Ablock * ncol_Bblock; ind++) Cblock[ind] = (double)buer[ind];

Using cudaMalloc function, row_storage_row_order_oat and col_storage_oat matrices are

copied to devArow_block and devBcol_block and then the kernel on the device is invoked. Finally, buer is copied to Cblock. Algorithm 8 illustrates this process.

The

product matrix dev_Cblock is in oat format, therefore it is copied to buer as a matrix of oat. In the kernel implementation, blocks of size BLOCK_SIZE are collaboratively loaded into the

shared memory on the GPU. Each C sub matrix, Csub, is computed in an accumulative way which
is nally copied to the matrix C. Algorithm 9 below shows the kernel implementation. (The kernel implementation is a modication of the matrix multiplication implementation in NVIDIA CUDA Handbook).

Algorithm 9 Computing Cblock


#dene BLOCK_SIZE 2

using collaborative loading of blocks.

__global__ void MatrixMulKernel(oat* A, oat* B, oat* C, int hA, int wA, int wB) { int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin +wA -1; int aStep = BLOCK_SIZE; int bBegin = BLOCK_SIZE * bx; int bStep = BLOCK_SIZE * wB; oat Csub =0; int a,b; for (a= aBegin,b=bBegin; a<= aEnd; a+=aStep, b+=bStep ) { __shared__ oat As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ oat Bs[BLOCK_SIZE][BLOCK_SIZE]; As[ty][tx] = A[a+wA*ty+tx]; Bs[ty][tx] = B[b+wB*ty+tx]; __syncthreads(); int k; for(k=0; k<BLOCK_SIZE; k++) Csub += As[ty][k]* Bs[k][tx]; __syncthreads(); } int c = BLOCK_SIZE*by + BLOCK_SIZE *hA*bx; C[c+ hA* tx+ty] = Csub; } void MatrixMulKernel_Wrapper (oat *a, oat*b, oat* c, int hA, int wA, int wB) { dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB/ dimBlock.x,hA/dimBlock.y); MatrixMulKernel<<<dimGrid, dimBlock>>>(a,b,c,hA,wA,wB); }

3b)

General implementation notes:



The performance in the three implementations for dierent panel sizes(pb same. Therefore, to reduce the job time on cluster

= 1, 4, 16)

was the

pb = 1

was used.

The #iterations was also reduced to 5 from the default value of 25. The #nodes used was 4.

Figures:
Figures below shows the time_per_iter vs #processors for the three implementations. of the matrices (A,B and C). The matrices were assumed to be square, and the legend in all the gures shows the common dimension

MPI (only) 120 100 128 256 512 1024 2048 4096

80 time per iter

60

40

20

10

20

30 40 #processors

50

60

70

Figure 1: MPI (only)

MPI+OpenMP (#threads 2) 700 600

MPI+OpenMP (#threads 4) 500

400
500 time per iter

time per iter

400 300 200 100 0 128 256 512 1024 2048 4096

300 128 256 512 1024 2048 4096

200

100

10

20

30 40 #processors
(a)

50

60

70

10

20

30 40 #processors
(b)

50

60

70

MPI+OpenMP (#threads 6) 450 400 350 300 time per iter 250 200 150 100 50 0 0 10 20 30 40 #processors
(c)

128 256 512 1024 2048 4096

50

60

70

Figure 2: MPI+OpenMP

MPI+CUDA (block size 2) 60 50 40 time per iter 30 20 10 0 128 256 512 1024 2048 4096

MPI+CUDA (block size 4) 35 30 25 time per iter 20 15 10 5 0 128 256 512 1024 2048 4096

20

40 #processors
(a)

60

80

20

40 #processors
(b)

60

80

MPI+CUDA (block size 16) 14 12 10 time per iter 8 6 4 2 0 128 256 512 1024 2048 4096

20

40 #processors
(c)

60

80

Figure 3: MPI+GPU

i) Fastest in terms of time to solution.


Comparing the Figures 1,2 and 3, it can be seen that MPI+GPU is the fastest in terms of time to solution and time to solution is the minimum when block size is 16.

ii) Most scalable.


Similarly looking at Figures 1,2 and 3, it can be noted that Figure 3a has the steepest slope for matrix dimension of 4096, for all other dimensions it has an approximately at prole. Therefore, MPI+GPU with block size 2 seems to be the most scalable, however it is slower in terms of time to solution when compared with MPI+GPU with block size 16. For dimensions less than 4096, MPI+OpenMP (2 threads) appears to be scalable, however it is slower than all MPI+GPU instances.

iv) Approximate lines of code.


MPI: 110; MPI+OpenMP: 116; MPI+GPU: 210

v) Most complex task in implementing your term project.


The most dicult part was implementing the MPI (only). In particular, implementing panel broadcast was the dicult part. Second, in line was MPI+GPU implementation for collaborative loading of matrices.

vi) Most surprising result in your performance analysis across implementations.


The most surprising result in performance analysis was from MPI+GPU implementation. Figure 3b and 3c show that MPI+GPU with block size 4 and 16 are not scalable whereas they are the fastest in terms of time to solution. One possible reason is MPI communication is signicantly dominant. Furthermore, MPI+OpenMP implementation performs worse than MPI(only) implementation. In the MPI+OpenMP, the outer two loops were collapsed which suggests collapsing the outer two loops is probably not the optimal implementation for MPI+OpenMP.

10

You might also like