Performance Analysis of Parallelised Applications On Multicore Processors

PERFORMANCE ANALYSIS OF
PARALLELISED APPLICATIONS ON
MULTICORE PROCESSORS
Vishnu.G (CB107CS070), Baskar.R (CB107CS110),
Avinash Kumar.M.K (CB107CS209)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
AMRITA VISHWA VIDYAPEETHAM
ETTIMADAI,COIMBATORE-641105
ggvishnu29@gmail.com
ABSTRACT:
This Project is aimed at analysing and multiple CPU's in one chip. A multi-core
comparing the performance of selected processor is quite different from separate
applications which are parallelised and made to processors work together, with each processor
run on a multicore platform. Tools like Vtune having its own dedicated memory, cache and
performance analyser, Intel C++ compiler, other hardware whereas in a multi-core
OpenMP Concurrency platform are used for processor, the multiple execution cores are
collecting thread data and program execution placed on the same chip and resources such as
time. Using the data collected from the tools, a cache and memory are common to all the
graph is drawn showing the speed up attained execution cores. When we run two applications
when the applications are parallelised simultaneously in a CPU, what actually happens
is that the CPU time is shared between the two
Introduction: applications which are run simultaneously, this
CPU time slice for each application would be in
The Digital age paves way for an the order of microseconds. In a multi-core
undeterred growth in all fields of science, system there are say n number of cores, so it is
especially in the field of microprocessors, as the feasible to run n number of applications
number of transistors on a microprocessor chip simultaneously. This Paper is aimed at
has grown exponentially. All of this was in envisaging the need for parallelizing present day
accordance with Moore's law which proposed software for better adaptation to multi-core
that the number of transistors in Intel's chips architectures.
would be doubled every Eighteen months, and
so, as such the number of transistors on each Our project mainly deals with analyzing
chip reached millions and there was no stopping the performance of applications which are
this colossal growth. The clock frequency of parallelized using a concurrency platform. We
operation rose from a few megahertz to a few used various tools provided by Intel for testing
gigahertz . Later, it became evident that the heat and analysing the application. They are as
generated was so intense that any more over follows:
clocking of the micro processor would result in
the chip melting to the intense heat generated 1. Intel C++ compiler
during its operation. The frequency of operation 2. OpenMP concurrency platform
could not be increased further without excessive 3. Vtune performance analyzer
heat generation. This paved way for the multi-
core processors to come to the fore. A multi-
core processor consists of multiple cores or
PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Intel C++ compiler:
This is a C++ compiler with support for 5. Image conversion.

all OpenMP directives (pragmas). 6. Path finding
OpenMP concurrency platform: Matrix multiplication:
OpenMP stands for Open Multi This application involves summation of

Processing. It is an Application Program the product of the element from two matrices (A
Interface (API) that may be used to explicitly and B) and storing the value in the product
direct multithreaded, matrix (C). Since the operations involved for
shared memory parallelism. It comprises of each element in the product matrix is
three primary API components: independent of each other, they can be
parallelised easily. So, we parallelised the
1. Compiler Directives application by including OpenMP directives.
2. Runtime Library Routines We will show the parallel version of the code
3. Environment Variables and will show the difference in the runtime by
using Vtune analyzer. In this application large
pairs of matrices with predefined values are
multiplied. Two parallel threads executing
Vtune performance analyser: simultaneously on each core are taken for
consideration while generating the call graph.
This tool can be used to analyse the Parallelization of the outer loop:
performance of any application in detail The
call graph wizard is used for performance Matrix_mul()
analysis. The sampling wizard can be {
optimized by selecting the advanced int numprocs=omp_get_num_procs();
performance analysis wizard. omp_set_num_threads(numprocs);
double start=omp_get_wtime();
PROGRAMS AND ANALYSIS:
for(x=0;x<n;x++)
We selected some applications that can {
be parallelised without any hazards or data //Pragma Directive
dependencies. We first made a serial version of #pragma omp parallel for shared(a,b,c)
the application and calculated the execution private(i,j,k)
time. Then, we parallelised the applications and for(i=0;i<100;i++)
again calculated the execution time and a {
significant reduction in execution time when for(j=0;j<100;j++)
parallelised. We used the OpenMP directive to {
parallelise the applications. The following c[i][j]=0;
applications were analysed: for(k=0;k<100;k++)
{
1. Matrix multiplication c[i][j]+=a[i][k]*b[k][j];
2. Merge sort }
3. Odd-even sort }
4. Database search }

} 2.The function that each thread is going to
double end=omp_get_wtime(); execute should be complex enough to see the
cout<<"end:"<<endl; performance gain-up. In the above matrix
cout<<"run time:"<<(end-start); multiplication program, we parallelised the
} whole matrix multiplication operation involving
} two entire matrices rather than parallelising the
operations for each element in the matrix. To
Parallelization of the inner-most loop: illustrate this, we included pragma function for
the inner most for loop and analysed the
Matrix_mul() application. We found that it takes more
{ execution time than the serial version.
for(x=0;x<n;x++)
{ Analysis of Results:
for(i=0;i<100;i++)
{ The above application was run with the data sets
for(j=0;j<100;j++) mentioned in the table and the respective results
{ were obtained.
c[i][j]=0;
//Pragma Directive Datasets 10000 40000 80000
#pragma omp parallel for (Pairs of
shared(a,b,c) private(i,j,k) Matrices
for(k=0;k<100;k++) Multiplied)
{
c[i][j]+=a[i][k]*b[k][j]; Runtime 11.86 47.3 94.57
} (Serial)*
} Runtime 6.77 24.46 48.86
} (Parallel)*
} Speed Up 1.75 1.93 1.94
double end=omp_get_wtime();
cout<<"end:"<<endl;
cout<<"run time:"<<(end-start); * Note: All Run-time data represented in
} seconds
}
Average Speed Up thus obtained was
There are two important issues which we found found to be 1.8717 . From the above table it
out to be important while parallelising the was observed that as the volume of data being
application. They are as follows: processed is increased, parallelism becomes
very important for better program execution
1.The number of threads created should be time.
proportional to the number of processors. We
included the omp function and set the Call Graph function for serial code for
environment variable to the number of cores Matrix multiplication:
available in the processor before parallelising
the application. As seen from the screenshot(figure 1a
present in the diagrams section at the end of this
paper) from the Vtune performance analyzer for
a sequential execution, it is seen that two

threads namely the master thread (Thread_0) the difference in their runtime using Vtune
and the child thread (Thread_1). analyzer.
Call Graph function for Parallelized code for //parallel version

Matrix multiplication:
Merge_sort()
{
The screenshot (figure 1b present in the int numprocs=omp_get_num_procs();
diagrams section at the end of this paper) shows omp_set_num_threads(numprocs);
the spawned threads for a parallel execution of double start=omp_get_wtime();
the application. In this program, we included #pragma omp parallel
pragma directive inside the outer for loop. Here {
we set the environment variable by calling #pragma omp sections
omp_get_num_procs() and {
omp_set_num_threads(). This function returns #pragma omp section
the number of processors available in the system mergesort(1,n/2);
and sets that value to the environment variable. #pragma omp section
Here, each set of matrices (B and C) mergesort((n/2)+1,n);
multiplication is parallelised. Since the number }
of cores available in this case is 2, two child merge(1,(n/2),n);
threads are created. }
double end=omp_get_wtime();
Time taken for parallel version: cout<<"run time:"<<(end-start)<<endl;
}
Thread_0: }
Execution Time: 6.326343 s
Thread_1: Analysis of Results:
Execution time: 6.714668 s
Thread_2: Number of 100000000 200000000 300000000
Execution time: 6.126622 s unsorted (200 (200 (300
elements Million) Million) Million)
Merge Sort: Runtime(Ser 15.39 31.85 48.88

ial)*
The second application which we Runtime(Par 10.11 21.8 32.23
selected to parallelise is sorting. It involves allel)*
dividing a list into two and sorting the two parts
recursively (divide and conquer approach). Here *All Run-time data represented in seconds
we first divided the list into two and called Average Speed Up: 1.498926
merge sort function (recursive function) and (Refer the screenshots present at the end of the
merged the two lists into one. As the list gets paper 2a and 2b)
divided into two, threads are created and
mapped onto the available cores as shown in the Time taken for completion of threads and
call graph. We have shown both the serial waiting time:
version and the parallel version of the code and

//parallel version
Time taken for serial version: Odd_even_sort()
Thread_0: {
Execution time: 15.894525 s start=omp_get_wtime();
wait time: 15.892648 s #pragma omp parallel
Thread_1: {
Execution time: 15.838973 s #pragma omp sections
wait time: 0.250066 s {
#pragma omp section
In this program, we divided the list into two and oesort(1,n/2);
parallelised that section by calling pragma #pragma omp section
sections function. Two threads are created and oesort((n/2)+1,n);
each thread sorts a list (divided). At last, the two }
lists are merged into a single list. merge(1,(n/2),n);
}
The following thread specific data illustrates the
time each thread has been in waiting, end=omp_get_wtime();
cout<<"Time:"<<(end-start);
Time taken for parallel version:
}
Thread_0:
Execution time: 10.586861 s Number of 100000 200000 300000
wait time: 10.581956 s elements
Thread_1:
Execution time:10.563912 s Runtime(S 5.84 23.17 52.09
wait time: 0.537342 s erial)*
Thread_2: Runtime(P 3.74 14.81 32.24
Execution time:10.057626 s arallel)*
wait time: 10.057574 s
*All Run-time data represented in seconds
Odd-Even Sort: Average Speed Up: 1.53635
(Refer the screenshots present at the end of the
Description: paper 3a and 3b)
In many areas of application, dealing

with a large pool of numbers becomes Time taken for serial version:
unavoidable. (For example, a company has to
prepare an expense chart and it has to list out the Thread_0:
expenditure for every single item in some Execution time:8.712375 s
order).There are only two kinds of numbers wait time: 0 s
,odd and even. And now that we have a multi-
core processor, we can run a thread on one core Since it is the only thread index execution it
to sort out the odd numbers and the other thread does not have to wait for the completion of any
for the even numbers on the other core in thread
parallel.

In this program, the list is divided into omp_set_num_threads(numprocs);
two lists. Each list is given to a thread and start=omp_get_wtime();
sorting is done. Then, two lists are merged by #pragma omp parallel
calling merge () function. {
#pragma omp sections
The following is the amount of time the {
various threads had taken for completion and #pragma omp section
spent in waiting search(root->l,data);
#pragma omp section
Time taken for parallel version: search(root->r,data);
}
Thread_0: }
wait time: 3.767029 s end=omp_get_wtime();
Thread_1: cout<<"run time:"<<(end-start)<<endl;
Execution time:3.848531 s
wait time: 0.806332 s }
Thread_2:
Execution time:3.780029 s Analysis of Results:
Number 40000 60000 100000
Database search: of
Entities
This is one of the major applications in the
which can be parallelised. Most of today’s database
search involves a tree data structure. Searching
in a tree can be parallelised. We can divide the Runtime( 5.97 13.81 20.88
tree structure into two trees (by considering two Serial)*
children as two roots) and giving each tree to Runtime( 9.23 9.84 12.82
each thread. We will show both the serial Parallel)*
version and the parallel version of the code and
will show the difference in their runtime by (Refer the screenshots present at the end of the
using Vtune analyser. paper 4a and 4b)
Average speed up=1.50
Database_search()
{ Time taken for serial version:
//serial version
start=omp_get_wtime(); Thread_0:
search(root->l,dat); Execution time: 192.342031 s
search(root->r,dat); wait time: 42.430577 s
end=omp_get_wtime(); Thread_1:
cout<<"runtime:"<<(end-start)<<endl; Execution time:45.863450 s
//parallel version
In this program, we divided the tree into two
int numprocs=omp_get_num_procs(); sections. Each section is given to a thread. Left

child of the root is assumed to be the root in one
thread and searching is done. In another thread,
right child of the root is assumed to be the root C++ code:
and searching is done. //parallel version
int num=omp_get_num_procs();
omp_set_num_threads(num);
Time taken for parallel version: double start=omp_get_wtime();

for(k=1;k<=n;k++)
Thread_0: {
Execution time: 56.537615 s #pragma omp parallel for shared(picture)
wait time: 21.116431 s private(i,j)
Thread_1: for(i=1;i<=300;i++)
Execution time: 21.977969 s {
wait time: 21.761701 s for(j=1;j<=300;j++)
Thread_2: {
wait time: 17.858027 s if(picture[i][j]>=128)
picture[i][j]=255;
Image Conversion (RGB to greyscale): else
picture[i][j]=0;
Image processing is a computation }
intensive area. Large number of images are sent }
and received over the internet every day. In case
of low bandwidths it takes an image, quite some }
time to be loaded. Especially for text images double end=omp_get_wtime();
which need not be colourful to be understood. In cout<<"run time:"<<(end-start)<<endl;
such cases, we can convert the original image }
into a greyscale image or a binary image. An }
image is a two-dimensional array of pixels or
values of colours (from 0 to any number; we Analysis of Results:
have it here as 0-255, standard 8 bit colour Number of 10000 20000 30000
image). What we have done here is that we sub-images
took the average contrast of an image and then of size 300
we adjusted the contrast in such a way that those X 300
pixels with a value lower than the average are
made 0 and those with a higher value are made Runtime(Se 4.7887 9.52558 14.236
255(in binary it is 1). Since this operation is a rial)*
set of independent computations we can Runtime(Pa 2.79798 5.56294 8.32684
parallelise the execution by dividing the matrix rallel)*
and giving each thread a part of the matrix to
work upon. Due to size constraints we assumed *All Run-time data represented in seconds
the maximum number of pixels an image could
have, in this case we assumed it to be 90601 Average Speed-Up: 1.7111
pixels in an image and higher resolution images (Refer the screenshots present at the end of the
would be built upon from the sub images. paper 5a and 5b)

in many fields. For example, in networking,
packet switching is done where the packets may
be sent in different ways. This is one of the
Time taken for serial version: major fields where this algorithm can be
applied.
Thread_0:
Execution time: 9.537091 s The runtime may change when analyzed at
wait time: 9.472819 s different times. This depends on the state of the
Thread_1: processor.
In this program, an image is divided into C++ code:

number of sub images and each sub image is void recury(int i,int j)
processed by a thread. {
//Serial version
Time taken for parallel version: if(i==ei&&j==ej)
Thread_0: {
Execution time: 3.214905 s cout<<"found:"<<endl;
wait time: 2.631652 s exit(1);
Thread_1:
Execution time: 6.576604 s }
wait time: 6.576558 s else
Thread2: {
Execution time: 3.043152 s if(i+1<=15)
wait time: 2.744147 s {
recury(i+1,j);
}
PATH FINDING: if(j+1<=15)
{
This program detects the path between two recury(i,j+1);
nodes - a starting node and an ending node. }
There may be many ways to reach the }
destination point. All possible paths are
searched in the graph drawn using the given
nodal details(which is always placed in the first //parallel version
quadrant and the co-ordinates of the starting if(i==ei&&j==ej)
node should be lesser than the ending node, {
otherwise the recursive procedure becomes an cout<<"found:"<<endl;
infinite loop). This program can be parallelized exit(1);
effectively using a recursive procedure. A large
number of threads are created and each thread }
will search for a path to reach the destination. If else
one thread finds a path then the function ends {
and all threads quit from execution. This #pragma omp parallel
program is a simple illustration which is applied {

#pragma sections shared(i,j) wait time: 9.472819 s
{ Thread_1:
#pragma omp section Execution time: 9.459156 s
if(i+1<=15) wait time: 0.130074 s
{
recury(i+1,j); Time taken for parallel version:
}
#pragma omp section Thread 0:
if(j+1<=15) Execution time:1.214943 s
{ wait time: 1.188413 s
recury(i,j+1); Thread_1:
} Execution time:1.286208 s
} wait time: 0.655480 s
} Thread_2:
Execution time:0.392150 s
Analysis of Results:
Graph of the achieved Speed-Up :
Datasets ( 1000 2000 3000
No of
nodes)
2.5
Runtime( 0.783345 3.7046 20.2386
Serial)* 2
Runtime( 0.1947 2.32 11.79564 1.5

Parallel)* AVERAGE
1 SPEED-UP
Speed Up 4.02334 1.5968 1.7157
0.5
*All Run-time Execution time represented in

0
seconds A B C D E F
Average Speed-Up: 2.44528 (Here the speed
up is greater than 2 because of cache coherency)
A - Matrix Multiplication
Datasets denote the size of the graph in which
the two nodal points exist. In this case it refers B - Merge Sort
to the size of two dimensional matrix C - Odd -Even Sort
1000X1000.
D - Database search
(Refer the screenshots present at the end of the E - Image conversion
paper 6a and 6b) F - Path finding
Time taken for serial version:
Thread 0:

Call Graph function for serial code for
Matrix multiplication(1a):
Call Graph function for Parallelized code for

Matrix multiplication(1b):
Call Graph Function for serial code for

Merge-sort(2a):

Call Graph Function for Parallelized code
for Merge-sort(2b):
Call Graph Function for serial code for Odd-

Even sort(3a):

Call Graph Function for Parallelized code
for Odd-Even sort(3b):
Call Graph Function of serial code for Data

base Search(4a):
Call Graph Function of Parallelized code for

Data base Search(4b):

Call Graph function of serial code for Image
Conversion(5a):
Call Graph function of parallel code for

Image Conversion(5b):

Call Graph Function for serial code of Path CONCLUSION:
Finding(6a): Thus, from the performed analysis using
available tools and Concurrency platforms, it is
clear that applications which are parallelised run
more efficiently than serial code on a multicore
processor. A detailed view of the working
mechanism of threads on multiple cores is also
presented. This project would serve as a
benchmark for further developments and
advancements in this area. This paper's ultimate
goal is to reveal the real power of multi-core
processors, that, every application produced in
forth-coming days will be parallelised for multi-
cores is not very far.
Call Graph Function for Parallel code of References:

Path Finding(6b):
[1] Tracey and Cameron Hughes. Professional
Multi-core programming
[2] Morgan Kauffman. Computer Architecture.
A Quantitative Approach,3rd Edition
[3] Robert L. Britton. MIPS Assembly language
Programming
[4] Charles E. Leiserson ,Ilya B. Mirman.How
To Survive the Multi-core Software
Revolution
[5] Sun Microsystems.Open Sparc Internals
[6] Daan Leijen and Judd Hall.Optimize
Managed Code For Multi-Core Machines
[7] Intel software documentation

Performance Analysis of Parallelised Applications On Multicore Processors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Analysis of Parallelised Applications On Multicore Processors

Uploaded by

Copyright:

Available Formats

PERFORMANCE ANALYSIS OF

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

This is a C++ compiler with support for 5. Image conversion.

OpenMP concurrency platform: Matrix multiplication:

OpenMP stands for Open Multi This application involves summation of

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph function for Parallelized code for //parallel version

Merge Sort: Runtime(Ser 15.39 31.85 48.88

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

In many areas of application, dealing

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Time taken for parallel version: double start=omp_get_wtime();

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

In this program, an image is divided into C++ code:

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Runtime( 0.1947 2.32 11.79564 1.5

*All Run-time Execution time represented in

Time taken for serial version:

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph function for Parallelized code for

Call Graph Function for serial code for

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph Function for serial code for Odd-

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph Function of serial code for Data

Call Graph Function of Parallelized code for

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph function of parallel code for

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

Call Graph Function for Parallel code of References:

PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com

You might also like