PAC3 MS Cuda.v01

COMPUTACI/N ALTES/AS
PRESTACIONS/ES PAC 3
Semestre Setembre 2015
The objectives and goals
The exercises
Evaluation criteria
Formatting
Deadline
COMPUTACI DALTES PRESTACIONS

PAC 3
Presentation and goals

Objectives
The main goal of this exercise is to do a deeper dive to the basic concepts of parallel
programming. To do that, two different options are provided. The first one is more theoretical
and is oriented to those students that do not have strong computer science background (i.e:
programming, linux etc.). This option will let the student to propose a parallelization of a
specific scientific problem. The second one is more practical and it is oriented to those
students that have programming and system skills. In this second option the student will
implement a parallel application using Cuda
What it is expected
Students doing the first option have to deliver: 1) a description and characterization of the
proposed algorithm to parallelize; 2) a pseudo-code parallelization for the proposed algorithm
and a comparison with the sequential implementation; 3) An estimation for the potential
speedup that the parallel application may show with respect to the serial implementation.
Students doing the second option have to deliver: 1) the sources for the developed code; 2)
one document with the answers with the proposed questions
For both options the maximum length for the document is 4 pages.
The environament and resources
For the second option it is recommended to use Ocelot to validate the code that the student
develops. More details provided below. The student will need to use its own installation using
amazonws.
Its important to emphasize that it is expected that the student will carry out its own
pathfinding and research to find out solutions for compilation and programming problems.
Part of the evaluation will consider the student ability to understand, address and fix issues
and problems that show up during the exercise. Thats what happens in real life. It also
expected that active discussion in the forum may help to address them.
Computaci dAltes Prestacions, 2015

PAC 3
First option: Theoretical parallelization

1. The serial algorithm
The goal of this first part is to understand the nature of the smith-waterman algorithm or of the
Fast Fourier Transform: what is solves, what inputs needs, what generates, it computational
cost and what parts of it can be potentially parallelized. The student needs to select one of the
proposed algorithms. The smith-waterman is more complex than the FFT. Thereby if it is
selected one it will be taken into account in the evaluation process.
1.1
Describe what the selected algorithm algorithm does (include the references that you
have used): inputs, outputs and pseudo-code describing the algorithm. The pseudo-code
must contain comments on what each important part of it is doing.
1.2
Describe what parts of the algorithm can be potentially parallelized.
2. Parallel implementation
The goal of this second part is to propose a pseudo-code parallel implementation for the parts
identified in 1.2.
2.1
Describe in pseudo-code a potential parallel implementation for this algorithm:
2.1.1 What strategy have you selected? (i.e: pipeline, shared memory, message passing etc.)
Why?
2.1.2 What other options you could use?
2.1.3 Describe the pseudo-code including comments on why the different parallel selected
parts.
3. Performance projection
3.1
Given the pseudo-code proposed in 2.1 and 1.1 propose a theoretical model to project
the speedup that the parallel implementation may show with respect to the serial
implementation. (Its a model so its not expected to provide 100% accuracy).
3.2
Provide a speedup analysis using the previous model for: 1, 2, 4, 16 and 32 threads.
3.3
Provide a description of what type of computational system would be better for the
provided implementation and what components would be important to invest more.

PAC 3
Second option: CUDA

1. Cuda basics
The first part of the exercise is devoted to understand the basic cuda code.
hello word (file hello.cu):
The the typical
#include <stdio.h>
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc(
cudaMalloc(
cudaMemcpy(
cudaMemcpy(
(void**)&ad, csize );
(void**)&bd, isize );
ad, a, csize, cudaMemcpyHostToDevice );
bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
printf("%s\n", a);
return EXIT_SUCCESS;
}
Questions
1.1 How the Hello World in CUDA works? What is the role of the GPU? Explain how the transfer
is done between the host and the device.
int count, max;

complex z;
float temp, lengthsq;
max = 256;
z.real = 0; z.imag = 0;
COMPUTACI
DALTES
count = 0;
/* number of iterations
*/
do {
PAC 3
temp = z.real
z real * z.real
z real - z.imag
z imag * z.imag
z imag + c.real;
c real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count++;
} while ((lengthsq < 4.0) && (count < max));
return count;
PRESTACIONS
2. Mandelbrot
3.11
We want to implement a parallel version of a very popular program called Mandelbrot. The
Mandelbrot set results in a geometric figure with infinite complexity (fractal kind of figure)
obtained through a mathematical formula and a recursive algorithm. The left part of the
following figure shows the output that the algorithm generates:
Mandelbrot set
3.12
The code provided as a part of the exercise (mandelbrot.c) is a sequential implementation of the
Mandelbrot set that result in the figure shown in right part previous figure. The provided code
will be used as our reference. To link the code it will be necessary to include the libraries libm
and libX11. One way of compiling the code can be:
gcc -I/usr/include/X11 -omandelbrot mandelbrot.c -L/usr/lib -lX11 -lm
(Its important to emphasize that the provided paths will depend on the installation that you are
using. The provided command line works in the cluster provided by UOC the compilation may
show some warnings).
Its important to notice that to visualize the output you will need to have a connection that
allows X forwarding (for example if connecting through linux client ssh X username@host).
In order to compile the CUDA version of mandelbrot (mandelbrot.cu) with Ocelot (more details
below) you will need to use the following instructions:
nvcc -cuda mandelbrot.cu -I /usr/local/include/ocelot/api/interface/ -arch=sm_20

-lX11 lm
g++ -o mandelbrot mandelbrot.cu.cpp.ii -I
/usr/local/include/ocelot/api/interface/ -L /usr/local/lib/ -locelot L/usr/local/lib -lpthread -ldl lm

PAC 3
Questions
2.1.
Provide and explain the implementation of mandelbrot using CUDA.
2.2.
Who is showing the figure in the screen? The GPU ? Why?
2.3.
What options you have selected (bloc size, stream distributions, etc.)? Why?
What other alternatives you could consider?
2.4.
Propose and do a performance analysis of the CUDA implementation and

compare it with the serial implementation. Evaluation criteria:
2.4.1. The experiment proposal
2.4.2. The experiment results and analysis
2.5.
How you would execute mandelbrot using several GPUS in parallel with CUDA?
Provide a scheme / pseudo-code showing your proposal.
How to use Ocelot

Ocelot is an environment that lets execute CUDA programs in both GPUs or x86 processors.
More information can be found at:
http://code.google.com/p/gpuocelot/
In order to perform this second option, the student will need to use amazonws to
install and test ocelot. Using amazonws has been already covered at the beginning of
the course.
Following instructions detail the basic commands and steps that you need to follow in order to
compile the Hello World program:
1.
2.
Source (fitxer hello.cu):

Compilation with CUDA (note that the include and linking directories will depend on the
ocelot installation)
nvcc -cuda hello.cu -I /usr/local/include/ocelot/api/interface/
-arch=sm_20
3.
The previous step will generate a hello.cu.cpp.ii that you will need to compile with g++
in order to obtain a binary that can be executed:
g++ -o hello hello.cu.cpp.ii -I /usr/local/include/ocelot/api/interface/ -L
/usr/local/lib/ -locelot -L/usr/local/lib -lpthread -ldl -lm
4.
Execute

PAC 3
./hello
Note that the file el .bashrc can been modified in order to point to the required binaries:
export PATH=$PATH:/usr/local/cuda/bin/:/usr/local/cuda/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Instructions to compile the code with GPU (no support provieded)

Information about how to install and use the SDK corresponding to CUDA or OpenCL (for GPUS
NVIDIA or ATI, respectively) can be found at: http://developer.nvidia.com/cuda-downloads i
http://www.khronos.org/
Evaluation criteria
Criteria that will be used in the evaluation: proper utilization of MPI or OpenMP models, brevity
and clear results, experiment setup and discussion and analysis.
Format
One PDF document containing all the different answers for the selected option containing:
-
The answers to the formulation questions (must not exceed 4 pages).
All the different codes developed or scripts must be added as annex section at the end
of the document (no limit)
Provide one tar document with the developed codes (if any)
$ tar cvf tot.tar fitxer1 fitxer2 ...
Deadline

PAC3 MS Cuda.v01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PAC3 MS Cuda.v01

Uploaded by

Copyright:

Available Formats

COMPUTACI/N ALTES/AS

The objectives and goals

COMPUTACI DALTES PRESTACIONS

Presentation and goals

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS

First option: Theoretical parallelization

Describe what parts of the algorithm can be potentially parallelized.

Describe in pseudo-code a potential parallel implementation for this algorithm:

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS

Second option: CUDA

The the typical

dim3 dimBlock( blocksize, 1 );

Computaci dAltes Prestacions, 2015

int count, max;

gcc -I/usr/include/X11 -omandelbrot mandelbrot.c -L/usr/lib -lX11 -lm

nvcc -cuda mandelbrot.cu -I /usr/local/include/ocelot/api/interface/ -arch=sm_20

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS

Provide and explain the implementation of mandelbrot using CUDA.

Who is showing the figure in the screen? The GPU ? Why?

Propose and do a performance analysis of the CUDA implementation and

How to use Ocelot

Source (fitxer hello.cu):

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS

Instructions to compile the code with GPU (no support provieded)

The answers to the formulation questions (must not exceed 4 pages).

Computaci dAltes Prestacions, 2015

You might also like