Professional Documents
Culture Documents
950
Parallel Programming
for
Multicore Machines
Using
OpenMP and MPI
Dr. C. Evangelinos
MIT/EAPS
● Under the assumption of a single address space one uses multiple control
streams (threads, processes) to operate on both private and shared data.
● Shared data: synchronization, communication, work
● In shared arena/mmaped file (multiple processes)
● In the heap of the process address space (multiple threads)
Heap Shared
memory
Thread Thread Process arena Process
Private Private
heap heap
Private Process Private Private Single Private
stack address stack stack process stack
space space
{
F J F J
Master thread O O O O
R I R I
K N K N
necessary synchronization
● Directives are additions to the source code that can be ignored by the compiler:
● Appearing as comments (OpenMP Fortran)
● Appearing as preprocessor macros (OpenMP C/C++)
● Hence a serial and a parallel program can share the same source code - the
serial compiler simply overlooks the parallel code additions
● Addition of a directive does not break the serial code
● However the wrong directive or combination of directives can give rise
to parallel bugs!
● Easier code maintenance, more compact code
● New serial code enhancements outside parallel regions will not break the
parallel program.
C$OMP& reduction(a,+)
reduction(a,+)
whoami = omp_get_thread_num() + 1;
#endif
!$ & 1
call foo(q)
+
real b, c
Orphan
C$OMP END PARALLEL
b = 1.0
directive
call foo(q)
C$OMP DO PRIVATE(b,c)
call bar
DO i=1,1000
c = log(real(i)+b)
b = c
a = c
ENDDO
return
end
*/
●Structured block:
●Implicit barrier at the end
PROGRAM PARALLEL
To run on a system:
% setenv OMP_NUM_THREADS 4
%./a.out
hello world!
hello world!
hello world!
hello world!
● Control clauses:
● IF (scalar_logical_expression)
● Conditional parallel execution (is there enough work to do?)
● NUMTHREADS(scalar_logical_expression)
● Hardcodes the number of threads (useful for sections)
● Overrides runtime library call, environment variable
● Data sharing attribute clauses (lowercase for C/C++)
● PRIVATE (list), SHARED (list)
● DEFAULT (PRIVATE | SHARED | NONE)
● FIRSTPRIVATE (list)
● REDUCTION (operator: list)
● COPYIN (list)
integer N
To run with 9 threads:
1000000
END
Here I am running
Here I am running
% setenv OMP_NUM_THREADS 3
Here I am running
% ./a.out
Here I am running
Here I am running
5 Here I am running
Here I am running
Here I am running
Here I am running
% ./a.out
1000000
Here I am running
Here I am running
Here I am running
variables
● C: File scope variables, static, storage on the heap
● Stack (automatic) variables are private:
● Automatic variables in a structured block
ENDDO
iwork = 1000
np = omp_get_num_threads()
ieach = iwork/np
real A
print *, A
end
tnumber=OMP_GET_THREAD_NUM()
I = I + tnumber
DO i=1,N
for (i=0; i<N; i++) {
ENDDO
}
!$OMP END DO
Non-iterative worksharing:
#pragma omp sections
!$OMP WORKSHARE
{
FORALL (i=1,N)
#pragma omp section
Non-iterative worksharing: }
!$OMP SECTION {
!$OMP SECTION
}
!$OMP SINGLE
{
C$OMP DO
enddo
do i=mytid*many, ntill
write(6,*) foo
enddo
A = B + 1.0
C$OMP SECTIONS
C$OMP SECTION
call foo
C$OMP END PARALLEL
endif
call bar
endif
..
{
C$OMP MASTER
..
print*, "Init"
#pragma omp master
..
..
!$OMP DO
A(i) = log(B(i))
DO i=1, M
D(i) = C(i)/D(i)
LHS. $!OMP DO
DO i=1, N
A(i) = A(i)*B(i)
ENDDO
$!OMP END DO
real A(4)
real A(4)
DO I=1,4
DO I=1,4
A(I) = 10.0
A(I) = 10.0
ENDDO
ENDDO
C$OMP DO LASTPRIVATE(A)
&LASTPRIVATE(A)
DO I=1,4
DO I=1,4
ENDDO
ENDDO
C$OMP END DO
C$OMP END PARALLEL DO
end
● Barrier Synchronization
● Threads wait until all threads reach this point
● Implicit barrier at the end of each parallel region
● Costly operation, to be used judiciously
● Be careful not to cause deadlock:
● No barrier inside of CRITICAL,
.. {
C$OMP BARRIER ..
.. {
C$OMP CRITICAL(mult)
..
..
..
.. {
C$OMP ATOMIC
..
.. A(i) -= vlocal
nthreads = OMP_GET_NUM_THREADS()
tnumber = OMP_GET_THREAD_NUM()
!$OMP MASTER
READ(5,*) L
!$OMP BARRIER
!$OMP CRITICAL
L=10 INTEGER:: J
!$OMP CRITICAL(adder)
CLOSE(26)
CALL ADD_ONE(L)
!$OMP END MASTER
!$OMP END CRITICAL(adder)
J=J+1
!$OMP END PARALLEL
I=J
PRINT *, "The final value of L is", L
END SUBROUTINE ADD_ONE
END PROGRAM CRITICAL
● Library calls:
● omp_set_num_threads, omp_get_num_threads
● omp_get_max_threads, omp_get_num_procs
● omp_get_thread_num
● omp_set_dynamic, omp_get_dynamic
● omp_set_nested, omp_get_nested
● omp_in_parallel
● Environment variables:
● OMP_NUM_THREADS (see NUM_THREADS clause)
● OMP_DYNAMIC
● OMP_NESTED
enddo
error = sqrt(error)/dble(n*m)
!$omp parallel do
pot = pot + 0.5*v(d)
!$omp& default(shared)
do k=1,nd
!$omp& private(i,j,k,rij,d)
f(k,i) = f(k,i) - &
!$omp& reduction(+ : pot, kin)
& rij(k)*dv(d)/d
do i=1,np
enddo
! compute potential energy
endif
! and forces
enddo
f(1:nd,i) = 0.0
! compute kinetic energy
do j=1,np
kin = kin + &
if (i .ne. j) then
& dotr8(nd,vel(1,i),vel(1,i))
call dist(nd,box, &
enddo
& pos(1,i),pos(1,j),rij,d)
!$omp end parallel do
! attribute half of the potential
kin = kin*0.5*mass
! energy to particle 'j'
!$omp parallel do
!$omp& default(shared)
!$omp& private(i,j)
do i = 1,np
do j = 1,nd
pos(j,i) = pos(j,i) + vel(j,i)*dt + 0.5*dt*dt*a(j,i)
vel(j,i) = vel(j,i) + 0.5*dt*(f(j,i)*rmass + a(j,i))
a(j,i) = f(j,i)*rmass
enddo
enddo
!$omp end parallel do
8) One can try and use the diagnostic information that the
autoparallelizing engines of most compilers emit to aid in faster
identifying the low hanging fruit for the fine level loop
parallelization.
9) Once you have a functional parallel program test first for
correctness vs. the baseline solution and then for speed and scaling
at various processor counts. Debug if necessary.
10)You then proceed to optimize the code by hoisting and fusing
parallel regions so as to get as few as possible, ideally only one for
the whole code through the use of SINGLE & MASTER. Move
code to subroutines with orphaned directives and privatize as
many variables as possible. Iterate with (9).
● $OMP THREADPRIVATE(list)
● #pragma omp threadprivate(list)
● Makes global data private to a thread
● Fortran: COMMON blocks, module or save variables
● C: File scope and static variables
real rnumber
integer i, iam, nthr
!$OMP& , /work/)
!$OMP THREADPRIVATE(/identify/ &
!$OMP PARALLEL
!$OMP& , /work/)
!$ iam = omp_get_thread_num()
!$OMP DO
!$ nthr = omp_get_num_threads()
DO i=1,nthr
CALL DO_SERIAL_WORK
ENDDO
call DO_PARALLEL_WORK
COPYING
● The user has finer control over the distribution of loop iterations onto threads
in the DO/for construct:
● SCHEDULE(static[,chunk])
● Distribute work evenly or in chunk size units
● SCHEDULE(dynamic[,chunk])
● Distribute work on available threads in chunk sizes.
● SCHEDULE(guided[,chunk])
● Variation of dynamic starting from large chunks and exponentially
going down to chunk size.
● SCHEDULE(runtime)
● The environment variable OMP_SCHEDULE which is one of
static,dynamic, guided or an appropriate pair, say "static,500"
0
0
50
0 0
100
0 I
150 t
1 e
200 r
250
a
t
300 i
2 1 1 o
350 n
1
400 S
p
450 a
500 c
2 e
550
0 2
600
chunk = 150 chunk = 250 chunk = 300 default
1 2 3
4
5
6
Exceution
Parallel region 7
8
9
10
11
12
1 2 3 4 5 6 7 8 9 10 11 12
Pieces of work
2048 iter.
1024 iter.
512 iter.
256 iter.
256 iter.
12 11 10 9 8 7 6 5 4 3 2 1
Pieces of work
!$OMP& PRIVATE(I,J)
!$OMP& ORDERED
DO I=1,N
DO J=1,M
Look at ordered.f90
Z(I) = Z(I) + X(I,J)*Y(J,I)
END DO
!$OMP ORDERED
IF(I<21) THEN
END IF
END DO
!$OMP END DO
if (x != 0) if (x != 0)
● omp_init_lock, omp_destroy_lock
● omp_set_lock, omp_unset_lock, omp_test_lock
● omp_init_nest_lock, omp_destroy_nest_lock
● omp_set_nest_lock, omp_unset_nest_lock,
omp_test_nest_lock
timing API
● omp_get_wtime
● omp_get_wtick
Parallelization Runtime
a = 0.0
C$OMP
C$OMP
PARALLEL
DO REDUCTION(+:a)
tmp = sin(0.1*i)
variables
a = a + tmp
b(i) = tmp
enddo
print *, b(1)
print *, a
IA32/IA64/AMD64 (commercial)
● Compilers
● Portland Group compilers (PGI) with debugger
● Absoft Fortran compiler
● Pathscale compilers
● Lahey/Fujitsu Fortran compiler
● Debuggers
● Totalview
● Allinea DDT
12.950 Parallel Programming for Multicore Machines Using OpenMP and MPI
IAP 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.