You are on page 1of 12

UNIT-8

FORMS OF PARALLELISM
8.1 SIMPLE PARALLEL COMPUTATION:
Example 1: Numerical Integration over two variables

Consider a simple example to explain role of parallel processing in algorithms


The example indicates a double-integration of a function of two variables over a
rectangular region of the X-Y plane.
The double integration is evaluated numerically using a simple parallel algorithm
EXAMPLE: NUMERICAL INTEGRATION OVER TWO VARIABLES:
A continuous function f(x,y) of two variables X and Y defines a volume in the
three-dimensional space created by three axes x,y,z and z=f(X,Y).
This volume is determined by the integral:

Ymax Xmax

f ( X , Y ) dx ,dy

Ymin Xmin

Where the appropriate limits on X and Y have been taken as X min, X max, Ymin and,
Ymax respectively.
When such an integral is to be evaluated on a computer, the axes X and Y can be
divided into intervals of length deltaX and deltaY respectively, and the integral is
replaced by the following summation:
f(X,Y)X Y

The function f(X,Y) must be evaluated at an appropriate point, for example


midpoint, within each area element of size (X)( Y)

Figure: Double Integration of f(x, y) over rectangular region of X-Y plane


The number
of intervals along X and Y axes is, respectively
( Xmax Xmin)
(YmaxYmin)
Nx=
Ny=
and
X
Y
One possible parallelized version to solve Eqn(1.1) is shown below

1. For each of the Nx X Ny area elements, in parallel, calculate the value of


the function f(X,Y) at the mid-point of the area element.
2. For each of the Ny rows, in parallel, calculate the summation of f(X,Y) at
Nx points along the row; denote this summation as the respective row
total; this is the inner summation
3. Calculate the sum of Ny row totals found in step2; this is outer summation
4. Multiply the sum of step 3 by (X)( Y)
Nx, Ny processors are working in parallel in step1.
Barrier Synchronization: step2 should not start until all processors have
completed step1, and similarly step3 should not start until all the processors
involved have completed step2 this type of synchronization between processorsor processes- is known as barrier synchronization.
Let us assume that we have a square grid over which the double integration is to
be performed i.e Nx=Ny=N
The steps for parallel algorithm concluded as follows:
o Function evaluation: For the evaluation of f(X,Y) at each grid element,
2
we use N processors,and the time taken is independent of N
o Row Totals:

N
2

with

processors used for each row. The N row totals

can be calculated in parallel in


o Final sum:
log 2 N

N
2

processors in

time steps.

processors, Thus we see that,the overall

processors, the computation of double integration is

performed in time O(

Example 2:
processors

time steps

the final sum is calculated using

o Thus, the overall, with


with

log 2 N

Addition

log 2 N

of

).

numbers

using

parallel

N numbers on a single processor takes N-1 addition steps. On multiple processors operating in
parallel, the same addition of N numbers can be done in more efficient manner.
Let us consider the addition of N=8 numbers on 4 processors.
Assume that the numbers a0,a1,a2,a7 are distributed on eight processors
P0,P1,.P7.
Step 1: Do it parallel : a0+a4-->a0, a1+a5-->a1, a2+a6-->a2,a3+a7-->a3
o Note that, a0+a4
a0, which meant is that the operand a4 is made
available from processor p4 to processor p0,using some mechanism of
inter processor communication ,operand a0 is already present in processor
p0,and therefore the result of addition is also then available in processor
p0.
Step 2: Do in parallel : a0+ a2-->a0, a1+a3-->a1
Step3: a0+a1--> a0

Figure: Inter Processor communication in the above algorithm


We see that four additions take place in parallel in step1, two additions in step2,
and a single addition in step3.
Barrier synchronization is needed between steps .sum of eight numbers is
available in a0 after three time steps, and the degree of parallelism is 4, since that
is maximum number of parallel operations we carried out, which was in step1
k
Let us assume that, in general , N= 2 for some integer k, i.e N is a power of 2
o We can easily verify that:
1. In the above example, at the end of three time steps variable a0 in
processor p0 does indeed have the sum of eight operands
originally given to us
k
2. In generally, for N= 2 values to be added, the number of time
steps required will be k=

log 2 N

The following figure represents another depiction


communication occurs in the pattern of a binary tree.

of

inter-processor

Figure: Another Depiction of algorithm for adding N numbers using parallel


processors.
when an associative operation is carried out on a multiprocessor system, it is
known as reduce or reduction operation.

8.3 PARALLEL ALGORITHMS:

The complexity of any sequential algorithm is defined in terms of asymptotic


running time of the algorithm on a problem instance of size n.
The complexity is shown in bigOh or order notation.
Ex: O(t(n)) means that n>n0, the running time of an algorithm grows as kt(n) for
some constants n0,k.
For a problem instance of size n, assume that an algorithm uses p(n) processors
in parallel and has running time in O(t(n)). Then the work performed by the
algorithm on a problem instance of size n is defined as w(n)=O(p(n)t(n)). The
Work performed by the parallel algorithm can also be referred to as the cost of
the algorithm.
Consider two different parallel algorithms, say I and II, for solving a given
problem.
o In solving a problem instance of size n, let these two algorithms perform
work Wi(n)=O(Pi(n)ti(n)), and Wii(n)=O(Pii(n)tii(n)), respectively.
o We say that algorithm I is work-efficient with respect to algorithm II if
Wi(n) is in O(Wii(n)), i.e Wi(n) is of order of Wii(n).

A deterministic sequential algorithm is considered efficient, if its running time


2
T(n) is a polynomial in n; for example bubble sort has running time in o( n )

A parallel algorithm is said to be efficient if, for solving a problem of size n, it


satisfies the following two conditions:
a
1. The running time of processors p(n) used is in o( n for some constant a,
i.e the number of processors required is polynomial in n, and
log b n
2. The running time of the algorithm t(n) is in o(
), for some constant

b,i.e the running time of the algorithm is polylogarithmic in n.


An optimal parallel algorithm is defined as one which is work-efficient with
respect to the best possible sequential algorithm for solving the problem.

BRENTS THEOREM: For a given problem, suppose that there exists a


parallel algorithm which solves a problem instance of size n using p(n)
processors in time o(t(n)), and if q(n)<p(n) processors available to solve the
t (n)
problem, then the problem can be solved in time O(p(n)* q(n) )---brents
theorem.
Ex: Consider the following Let A and B be the input matrices algorithm for
multiplying two n X n matrices
Version 1:
1. Read A(i, k)
2. Read B(k, j)
3. Compute A(i,k)X B(k,j)
4. Store In c(i, j, k)
Version 2:
1. l< n
2. Repeat

i.

l <

l
2

ii.

if (k< l) then
begin
Read c(i,j,k)
Read c(i,j,k+l)
Compute c(i,j,k)+c(i,j,k+l)
Store in c(i,j,k)
End
Until (i=1)

o Assume a three-dimensional indexing of the PES. Let us consider two


versions of algorithm
VERSION1:
3
Assume that there are n processors and takes O(log n) time.
3

Work done =p(n)*T(n)=O ( n logn


VERSION2:
3

Assume that there are

n
logn

processors and runs in O(log n)

3
time. Work done = q(n)*t(n)=O( n )

o From the above, we can conclude that, the number of processors can be
n3
reduced by a factor of log n i.e from logn
o we can say in the second version of the algorithm

n3
logn

processors

which are used in the first version


o In general , we can say that q(n)<p(n) processors can simulate one time
p( n)
step of p(n) parallel processors in o( q (n) ) time steps
o Two versions of the parallel algorithm, on p(n)and q(n) processors
respectively, are work-efficient with respect to each other.
o Thus the running time of the algorithm on the reduced q(n) number of
p( n)
processor s increases by a factor of o( q (n) ) , giving us the theorem
known as Brents theorem

DESIGN STAGES OF A PARALLEL ALGORITHM:


The three stages in the process of writing, compiling and executing a parallel program
are
1. STRUCTURAL PARALLELISM enters into program design at the very first
stage in program design and development.
o Structural parallelism is highest level of abstraction in program
design(top down manner)

o Applications designed or coarse gain should also be considered as


examples of structural parallelism
o Multiple instruction multiple data (MIMD) parallelism, and the more
restricted single program multiple data (SPMD) parallelism, both fall
into this structural parallelism
2. COMPILER-DISCOVERED PARALLELISM is discovered in the second
stage.
o it needs support from the under lying hardware this form of parallelism
focuses on a block of instructions, or it may have scope spanning
across two or more blocks
3. PROCESSOR-DISCOVERED parallelism (ILP) is independent of the first
two stages; it is discovered and exploited on-the-fly by the processor hardware
o It relies on discovering independence between the multiple instructions
of the program which occupy the fetch buffer and instruction
pipelining at one time .
Application program written in a
higher level language
Compiler, function libraries and
runtime environment
Processor(s) on which application
runs
Figure: stages in writing, compiling and running an application

8.4 STREAM PROCESSING

Stream processing is a form of data parallelism which has some characteristics of


SIMD as well as data flow processing
Stream processing depends on high level of data locality and regularity in the
processing of stream data
Stream processing combines the features of high processing power, energy efficiency
and programmability .
All the data elements in a data stream go through the same processing stages
for example, the 3D graphical model of a car ma; be made up of hundreds of
thousands of line elements or polygons. Which must be processed through the socalled rendering pipeline to display the car on the system display screen
APPLICATIONS OF STREAM PROCESSING
i. Animated 3D graphics
ii.
Multimedia applications
iii.
Image and signal processing applications
iv. 3G mobiles, set-top boxes
v. Biological computations
vi. Cryptography and database queries
STREAM PROCESSING UNITS:
o The different units/ components required for stream processing
application lie Graphics are as follows:
GPU (Graphics processing unit ): GPUs are widely used in
graphics processing applications.

GPUs are basic elements in stream processing


applications and they operate in parallel with other
processors or processors in graphics processing load.
Stream processing can be seen as a new variant of SIMD
in which streams of data flow amongst processing kernels
Processing kernels: The processing kernels are basically software
functions being executed on GPU processor cores.
GPU processor cores or stream processors: Multiple copies of a
kernel execute in parallel on multiple cores, giving rise to SIMD
organization.

Fig: four processing kernels operating on a data stream

If you consider there are sixteen sets(assume each set is a combination of four
kernels), then the total number of processor cores employed will be 64
Each processor core contain a local register file to maintain copies of working
variables for the single execution thread (or task) running in each core
Data locality a key role in the design of a stream processing algorithm.
Stream processing is a form of structural parallelism
Stream processors:
Imagine Stream Processor developed at Stanford university
Fermi GPU developed by Nvidia corporation
o IMAGINE STREAM PROCESSOR:
It was designed by researchers at Stanford University
It achieved tens of G flops performance for certain graphics
applications.
The aggregate power dissipation of this processor is less than
10 watts.
MERRIMAC is the name of another research project at
Stanford aimed at larger computing platform using stream
architecture and advanced interconnection networks.
The goals of MERRIMAC are
Achieving
high
ratio
of
computation
to
communication.
Very high performance/processing speed.
Compact size.
High energy efficiency.
Reliability.
Simple system management.

Scalability.
o FERMI GPU:
Nvidia Corporation is a leading producer of graphics
processing units.
They also define a hardware/software platform named
Compute Unified Device Architecture (CUDA) for general
purpose program development using GPUs and standard
programming languages.
Nvidia named this concept GPU computing i.e., GPUs
applied to general purpose computing.
From around 2006, Nvidia have developed several multi-core,
multi-threaded general purpose GPUs (also called GPGPUs),
which were named GEFORCE, Quadro and Tesla.
Nvidia have announced their advanced Fermi architecture for
GPU computing.
The first Fermi based GPU from Nvidia has over 3.0 billion
transistors and 512 cores.
Each core executes a floating point or integer instruction per
clock.
The 512 cores are organized in 16 so called stream
multiprocessors (SMs) of 32 cores each.
L2 cache is shared btw 16 SMs.
The GPU chip provides six 64 bit memory interfaces, for a total
384 bit memory interface, supporting up to a total of 6GB of
memory.
A host interface connects GPU to CPU via PCI- Express, while
the Giga thread unit on the GPU schedules groups of threads
among the SMs.
A schematic diagram of Fermi chip is shown next.
Apart from the 32 cores, each sm is also provided with 16
load/store units, and 4 independent special functional
units(SFUs) to compute sine, cosine, reciprocal and square root
functions
The processor cores themselves are very basic, with one ALU
and 1FPU each
fermi offers improved memory access and double precision
floating point performance, ECC support,(limited) cache
hierarchy, more shared memory amongst SMS, faster context
switching, faster atomic operations and instruction scheduling,
and the use of predication to reduce branch penalty
Threads are grouped into larger units- a known as warps,
blocks, grids-for the purpose of scheduling
>most of the area in the fermi chip is taken up by actual
processing elements-i.e FPUs, ALUs, SFUs

Figure: block diagram of fermi GPU

when we compare stream processing with other available technologies for achieving
specialized and power efficient processing, the following broad picture emerges:
i.
Application Specific ICs(ASICs) have comparable performance and are power
efficient ,but they involve longer design cycles and design costs,and are less
flexible
ii.
Field-Programmable Gate Arrays (FPGAs) are less energy efficient,and not allow
applications to be programmed in higher level languages

8.5 CRAY LINE OF COMPUTER SYSTEMS


Seymour cray [1925-1996] is known as father of supercomputing because he innovated

super computer architecture, including innovative packaging and cooling techniques.


Seymour cray was the chief designer of CDC 6600, the first commercial supercomputer
built at control data corporation.
CDC 6600* was followed by CDC 7600*.
Seymour cray founded his first company cray research and built cray1 and cray2 super
computers.
The different super computers of cray research are:
i.
Microprocessor super computers:
a. Cray computer systems combined multiprocessing with vector
processing called multiprocessor super computers.
b. Cray X-MP* is first multiprocessor supercomputer
c. Cray Y-MP* is the powerful successor of cray X-MP.
ii.

Massively parallel processing (MPP) systems:


a. Massively parallel processing (MPP) systems developed by cray
research are T3D, and its successor T3E.
b. Both T3D and T3E both uses 3-D torus topology.

iii.

iv.

v.

c. T3Dand T3E both employed different versions of the 64 bit DEC


(Digital equipment corporation) Alpha processors.
Scalable Linux supercomputers:
a. Cray introduced the XT series* of so called scalable Linux super
computers
b. Examples (scalable Linux supercomputers) include XT5* and XT6*
(discussed briefly next).
Massively multi-threading supercomputers:
a. Examples include cray XMT supercomputer*, announced in 2006.
b. Its descendent is Tera/MTA* massive multithreading.
c. Cray XMT uses Crays own 500 MHz, 64-bit threadstorm processors.
d. Each threadstorm processor supports 128 threads.
e. Cray XMT contains more than 8000 processors.
f. XMT system can deliver over one million concurrent processing threads.
g. The total shared memory on XMT system is up to 64 Tera bytes i.e., 8 GB
per node.
h. XMT system provides very high levels of multithreading needed for
applications such as data analysis, data mining, predictive analytics and
pattern matching.
i. The systems interconnect used in XMT system is Crays proprietary sea
star technology.
j. Scalar processing, I/O and service functions are provided by AMD
Opteron based nodes.
Other systems:
a. Cray CX1* is a lower end super computer from the company which is less
expensive and easier to deploy.
b. It makes use of INTEL Xeon processors in cluster architecture.
c. Adaptive computing- a technology which combines features of vector
processing, parallel processing and multithreading is the latest
technology used in recent Cray computer systems

CRAY XT SUPERCOMPUTERS:
Cray XT series of super computers include XT5 and XT6, which has petaflops
performance.
XT5 system at Oak Ridge National Laboratory in the US (nick named
JAGUAR), is currently rated as the worlds most powerful supercomputer.
XT5 system uses six-core AMD Opteron processors, with a total of over 224,000
processing cores in the system, and can reach a peak performance over two
petaflops.
The OS platform employed in XT5 is Linux based.
XT5 is based on AMD Opteron processors (quad-core or six-core) in a 2D torus
network which is built using Crays proprietary sea star interconnect.
The main goals of XT5 or XT6 or Crays present super computer technologies
are:
a. High computing performance with scalability and programmability.
b. Advanced packaging.
c. Efficient cooling.
d. Low power consumption.
Each diskless node in XST network is made up of Opteron processors, which
have a shared 25.6 GB/sec data path to shared local memory.
The local memory is 16GB or 32 GB DDr2 memory provided with ECC (error
correction code).

Each processing core in XT5 has 64K L1 (level-1) instruction cache, in addition
the processor chip provides 6MB shared L3 cache.
The proprietary sea star ASIC (application specific IC) chip has a direct memory
access (DMA) engine, a communication-cum-management processor, and a
service port.

Figure: Schematic Diagram of 2D torus network in Cray XT5


All XT super computers (XT5 and XT6) developed by Cray uses its own Linux
based Cray Linux environment (CLE).
Program development software supported by XT super computers (XT5 and XT6)
includes Fortran 90,Fortran 95,c,c++,MPI 2,Crays shared memory software
SHMEM, open MP, high performance math, libraries, performance analysis tools.
Hardware and software features provided on XT super computers (XT5 and XT6)
includes system monitoring, fault identification and recovery, checkpoint and
restart, system interconnect management, system status displays for
administrator ,redundant power supplies and voltage regulator modules, and
redundant data paths to system RAID.

XT6:
At 2009 supercomputing conference, in November 2009, Cray announced its high
end XT6 super computer system.
XT6 super computers employ eight and twelve core AMD Opteron processors to
provide higher processing performance than XT5.
Each computer node in XT6 can be provided with 32GB or 64GB of ECC DDR
local memory.
In future systems, XT6 can be upgraded to 12 and 16 core Opteron processors.
XT5m and XT6m are fully compatible midrange versions of XT5 and XT6 super
computers.

Super Computer

Year Introduced

Speed/Peak Performance

Titan

Oak
Ridge
National
Laboratory
(ORNL)
(2012)
ORNL (2009)
Los Alamos National
Laboratory
(1976)
1985
1982
1995
1988
2009 NOVEMBER 16

17.59 petaflops
27 petaflops(peak)

Jaguar or XT5
Cray-1

Cray-2
Cray X-MP
cray T3E
CRAY Y-MP
XT6

1.75 PETAFLOPS (peak)


80M flops (peak)

1.9 G Flops (peak)


800 M flops
1 TERAFLOPS
333 M FLOPS
2 PETAFLOPS

List of super computers developed by Cray research:


Cray -1

Cray SV1

Cray XT3

Cray X-MP

Cray SV2

CrayXD1

Cray -2

Cray-3

CrayXT4

Cray- Y MP

Cray3/SSS

Cray XMT

Cray- XMS

Cray -4

Cray XT5

Cray Y-MPEL

Cray APP

Cray CX1

Cray C90

Cray S-MP

Cray XT6

Cray EL90

Cray CS 6400

Cray XE6

Cray T3D

Cray SX-6

Cray CX 1000

Cray J90

Cray MTA-2

Cray XK6

Cray T90

Cray REDSTORM

Cray XK7

Cray T3E

Cray X1

(FEW MORE)

You might also like