You are on page 1of 7

COMPUTACI/N ALTES/AS

PRESTACIONS/ES PAC 3
Semestre Setembre 2015

The objectives and goals

The exercises

Evaluation criteria

Formatting

Deadline

COMPUTACI DALTES PRESTACIONS


PAC 3

Presentation and goals


Objectives
The main goal of this exercise is to do a deeper dive to the basic concepts of parallel
programming. To do that, two different options are provided. The first one is more theoretical
and is oriented to those students that do not have strong computer science background (i.e:
programming, linux etc.). This option will let the student to propose a parallelization of a
specific scientific problem. The second one is more practical and it is oriented to those
students that have programming and system skills. In this second option the student will
implement a parallel application using Cuda
What it is expected
Students doing the first option have to deliver: 1) a description and characterization of the
proposed algorithm to parallelize; 2) a pseudo-code parallelization for the proposed algorithm
and a comparison with the sequential implementation; 3) An estimation for the potential
speedup that the parallel application may show with respect to the serial implementation.
Students doing the second option have to deliver: 1) the sources for the developed code; 2)
one document with the answers with the proposed questions
For both options the maximum length for the document is 4 pages.
The environament and resources
For the second option it is recommended to use Ocelot to validate the code that the student
develops. More details provided below. The student will need to use its own installation using
amazonws.
Its important to emphasize that it is expected that the student will carry out its own
pathfinding and research to find out solutions for compilation and programming problems.
Part of the evaluation will consider the student ability to understand, address and fix issues
and problems that show up during the exercise. Thats what happens in real life. It also
expected that active discussion in the forum may help to address them.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 3

First option: Theoretical parallelization


1. The serial algorithm
The goal of this first part is to understand the nature of the smith-waterman algorithm or of the
Fast Fourier Transform: what is solves, what inputs needs, what generates, it computational
cost and what parts of it can be potentially parallelized. The student needs to select one of the
proposed algorithms. The smith-waterman is more complex than the FFT. Thereby if it is
selected one it will be taken into account in the evaluation process.
1.1

Describe what the selected algorithm algorithm does (include the references that you
have used): inputs, outputs and pseudo-code describing the algorithm. The pseudo-code
must contain comments on what each important part of it is doing.

1.2

Describe what parts of the algorithm can be potentially parallelized.

2. Parallel implementation
The goal of this second part is to propose a pseudo-code parallel implementation for the parts
identified in 1.2.
2.1

Describe in pseudo-code a potential parallel implementation for this algorithm:

2.1.1 What strategy have you selected? (i.e: pipeline, shared memory, message passing etc.)
Why?
2.1.2 What other options you could use?
2.1.3 Describe the pseudo-code including comments on why the different parallel selected
parts.
3. Performance projection
3.1

Given the pseudo-code proposed in 2.1 and 1.1 propose a theoretical model to project
the speedup that the parallel implementation may show with respect to the serial
implementation. (Its a model so its not expected to provide 100% accuracy).

3.2

Provide a speedup analysis using the previous model for: 1, 2, 4, 16 and 32 threads.

3.3

Provide a description of what type of computational system would be better for the
provided implementation and what components would be important to invest more.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 3

Second option: CUDA


1. Cuda basics
The first part of the exercise is devoted to understand the basic cuda code.
hello word (file hello.cu):

The the typical

#include <stdio.h>
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc(
cudaMalloc(
cudaMemcpy(
cudaMemcpy(

(void**)&ad, csize );
(void**)&bd, isize );
ad, a, csize, cudaMemcpyHostToDevice );
bd, b, isize, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, 1 );


dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
printf("%s\n", a);
return EXIT_SUCCESS;
}
Questions
1.1 How the Hello World in CUDA works? What is the role of the GPU? Explain how the transfer
is done between the host and the device.

Computaci dAltes Prestacions, 2015

int count, max;


complex z;
float temp, lengthsq;
max = 256;
z.real = 0; z.imag = 0;
COMPUTACI
DALTES
count = 0;
/* number of iterations
*/
do {
PAC 3
temp = z.real
z real * z.real
z real - z.imag
z imag * z.imag
z imag + c.real;
c real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count++;
} while ((lengthsq < 4.0) && (count < max));
return count;

PRESTACIONS

2. Mandelbrot

3.11

We want to implement a parallel version of a very popular program called Mandelbrot. The
Mandelbrot set results in a geometric figure with infinite complexity (fractal kind of figure)
obtained through a mathematical formula and a recursive algorithm. The left part of the
following figure shows the output that the algorithm generates:

Mandelbrot set

3.12

The code provided as a part of the exercise (mandelbrot.c) is a sequential implementation of the
Mandelbrot set that result in the figure shown in right part previous figure. The provided code
will be used as our reference. To link the code it will be necessary to include the libraries libm
and libX11. One way of compiling the code can be:

gcc -I/usr/include/X11 -omandelbrot mandelbrot.c -L/usr/lib -lX11 -lm

(Its important to emphasize that the provided paths will depend on the installation that you are
using. The provided command line works in the cluster provided by UOC the compilation may
show some warnings).
Its important to notice that to visualize the output you will need to have a connection that
allows X forwarding (for example if connecting through linux client ssh X username@host).
In order to compile the CUDA version of mandelbrot (mandelbrot.cu) with Ocelot (more details
below) you will need to use the following instructions:

nvcc -cuda mandelbrot.cu -I /usr/local/include/ocelot/api/interface/ -arch=sm_20


-lX11 lm
g++ -o mandelbrot mandelbrot.cu.cpp.ii -I
/usr/local/include/ocelot/api/interface/ -L /usr/local/lib/ -locelot L/usr/local/lib -lpthread -ldl lm

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 3

Questions
2.1.

Provide and explain the implementation of mandelbrot using CUDA.

2.2.

Who is showing the figure in the screen? The GPU ? Why?

2.3.

What options you have selected (bloc size, stream distributions, etc.)? Why?
What other alternatives you could consider?

2.4.

Propose and do a performance analysis of the CUDA implementation and


compare it with the serial implementation. Evaluation criteria:
2.4.1. The experiment proposal
2.4.2. The experiment results and analysis

2.5.

How you would execute mandelbrot using several GPUS in parallel with CUDA?
Provide a scheme / pseudo-code showing your proposal.

How to use Ocelot


Ocelot is an environment that lets execute CUDA programs in both GPUs or x86 processors.
More information can be found at:
http://code.google.com/p/gpuocelot/
In order to perform this second option, the student will need to use amazonws to
install and test ocelot. Using amazonws has been already covered at the beginning of
the course.
Following instructions detail the basic commands and steps that you need to follow in order to
compile the Hello World program:
1.
2.

Source (fitxer hello.cu):


Compilation with CUDA (note that the include and linking directories will depend on the
ocelot installation)
nvcc -cuda hello.cu -I /usr/local/include/ocelot/api/interface/
-arch=sm_20

3.

The previous step will generate a hello.cu.cpp.ii that you will need to compile with g++
in order to obtain a binary that can be executed:
g++ -o hello hello.cu.cpp.ii -I /usr/local/include/ocelot/api/interface/ -L
/usr/local/lib/ -locelot -L/usr/local/lib -lpthread -ldl -lm

4.

Execute

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 3
./hello
Note that the file el .bashrc can been modified in order to point to the required binaries:
export PATH=$PATH:/usr/local/cuda/bin/:/usr/local/cuda/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Instructions to compile the code with GPU (no support provieded)


Information about how to install and use the SDK corresponding to CUDA or OpenCL (for GPUS
NVIDIA or ATI, respectively) can be found at: http://developer.nvidia.com/cuda-downloads i
http://www.khronos.org/

Evaluation criteria
Criteria that will be used in the evaluation: proper utilization of MPI or OpenMP models, brevity
and clear results, experiment setup and discussion and analysis.

Format
One PDF document containing all the different answers for the selected option containing:
-

The answers to the formulation questions (must not exceed 4 pages).

All the different codes developed or scripts must be added as annex section at the end
of the document (no limit)

Provide one tar document with the developed codes (if any)
$ tar cvf tot.tar fitxer1 fitxer2 ...

Deadline

Computaci dAltes Prestacions, 2015

You might also like