Professional Documents
Culture Documents
FORMS OF PARALLELISM
8.1 SIMPLE PARALLEL COMPUTATION:
Example 1: Numerical Integration over two variables
Ymax Xmax
f ( X , Y ) dx ,dy
Ymin Xmin
Where the appropriate limits on X and Y have been taken as X min, X max, Ymin and,
Ymax respectively.
When such an integral is to be evaluated on a computer, the axes X and Y can be
divided into intervals of length deltaX and deltaY respectively, and the integral is
replaced by the following summation:
f(X,Y)X Y
N
2
with
N
2
processors in
time steps.
performed in time O(
Example 2:
processors
time steps
log 2 N
Addition
log 2 N
of
).
numbers
using
parallel
N numbers on a single processor takes N-1 addition steps. On multiple processors operating in
parallel, the same addition of N numbers can be done in more efficient manner.
Let us consider the addition of N=8 numbers on 4 processors.
Assume that the numbers a0,a1,a2,a7 are distributed on eight processors
P0,P1,.P7.
Step 1: Do it parallel : a0+a4-->a0, a1+a5-->a1, a2+a6-->a2,a3+a7-->a3
o Note that, a0+a4
a0, which meant is that the operand a4 is made
available from processor p4 to processor p0,using some mechanism of
inter processor communication ,operand a0 is already present in processor
p0,and therefore the result of addition is also then available in processor
p0.
Step 2: Do in parallel : a0+ a2-->a0, a1+a3-->a1
Step3: a0+a1--> a0
log 2 N
of
inter-processor
i.
l <
l
2
ii.
if (k< l) then
begin
Read c(i,j,k)
Read c(i,j,k+l)
Compute c(i,j,k)+c(i,j,k+l)
Store in c(i,j,k)
End
Until (i=1)
n
logn
3
time. Work done = q(n)*t(n)=O( n )
o From the above, we can conclude that, the number of processors can be
n3
reduced by a factor of log n i.e from logn
o we can say in the second version of the algorithm
n3
logn
processors
If you consider there are sixteen sets(assume each set is a combination of four
kernels), then the total number of processor cores employed will be 64
Each processor core contain a local register file to maintain copies of working
variables for the single execution thread (or task) running in each core
Data locality a key role in the design of a stream processing algorithm.
Stream processing is a form of structural parallelism
Stream processors:
Imagine Stream Processor developed at Stanford university
Fermi GPU developed by Nvidia corporation
o IMAGINE STREAM PROCESSOR:
It was designed by researchers at Stanford University
It achieved tens of G flops performance for certain graphics
applications.
The aggregate power dissipation of this processor is less than
10 watts.
MERRIMAC is the name of another research project at
Stanford aimed at larger computing platform using stream
architecture and advanced interconnection networks.
The goals of MERRIMAC are
Achieving
high
ratio
of
computation
to
communication.
Very high performance/processing speed.
Compact size.
High energy efficiency.
Reliability.
Simple system management.
Scalability.
o FERMI GPU:
Nvidia Corporation is a leading producer of graphics
processing units.
They also define a hardware/software platform named
Compute Unified Device Architecture (CUDA) for general
purpose program development using GPUs and standard
programming languages.
Nvidia named this concept GPU computing i.e., GPUs
applied to general purpose computing.
From around 2006, Nvidia have developed several multi-core,
multi-threaded general purpose GPUs (also called GPGPUs),
which were named GEFORCE, Quadro and Tesla.
Nvidia have announced their advanced Fermi architecture for
GPU computing.
The first Fermi based GPU from Nvidia has over 3.0 billion
transistors and 512 cores.
Each core executes a floating point or integer instruction per
clock.
The 512 cores are organized in 16 so called stream
multiprocessors (SMs) of 32 cores each.
L2 cache is shared btw 16 SMs.
The GPU chip provides six 64 bit memory interfaces, for a total
384 bit memory interface, supporting up to a total of 6GB of
memory.
A host interface connects GPU to CPU via PCI- Express, while
the Giga thread unit on the GPU schedules groups of threads
among the SMs.
A schematic diagram of Fermi chip is shown next.
Apart from the 32 cores, each sm is also provided with 16
load/store units, and 4 independent special functional
units(SFUs) to compute sine, cosine, reciprocal and square root
functions
The processor cores themselves are very basic, with one ALU
and 1FPU each
fermi offers improved memory access and double precision
floating point performance, ECC support,(limited) cache
hierarchy, more shared memory amongst SMS, faster context
switching, faster atomic operations and instruction scheduling,
and the use of predication to reduce branch penalty
Threads are grouped into larger units- a known as warps,
blocks, grids-for the purpose of scheduling
>most of the area in the fermi chip is taken up by actual
processing elements-i.e FPUs, ALUs, SFUs
when we compare stream processing with other available technologies for achieving
specialized and power efficient processing, the following broad picture emerges:
i.
Application Specific ICs(ASICs) have comparable performance and are power
efficient ,but they involve longer design cycles and design costs,and are less
flexible
ii.
Field-Programmable Gate Arrays (FPGAs) are less energy efficient,and not allow
applications to be programmed in higher level languages
iii.
iv.
v.
CRAY XT SUPERCOMPUTERS:
Cray XT series of super computers include XT5 and XT6, which has petaflops
performance.
XT5 system at Oak Ridge National Laboratory in the US (nick named
JAGUAR), is currently rated as the worlds most powerful supercomputer.
XT5 system uses six-core AMD Opteron processors, with a total of over 224,000
processing cores in the system, and can reach a peak performance over two
petaflops.
The OS platform employed in XT5 is Linux based.
XT5 is based on AMD Opteron processors (quad-core or six-core) in a 2D torus
network which is built using Crays proprietary sea star interconnect.
The main goals of XT5 or XT6 or Crays present super computer technologies
are:
a. High computing performance with scalability and programmability.
b. Advanced packaging.
c. Efficient cooling.
d. Low power consumption.
Each diskless node in XST network is made up of Opteron processors, which
have a shared 25.6 GB/sec data path to shared local memory.
The local memory is 16GB or 32 GB DDr2 memory provided with ECC (error
correction code).
Each processing core in XT5 has 64K L1 (level-1) instruction cache, in addition
the processor chip provides 6MB shared L3 cache.
The proprietary sea star ASIC (application specific IC) chip has a direct memory
access (DMA) engine, a communication-cum-management processor, and a
service port.
XT6:
At 2009 supercomputing conference, in November 2009, Cray announced its high
end XT6 super computer system.
XT6 super computers employ eight and twelve core AMD Opteron processors to
provide higher processing performance than XT5.
Each computer node in XT6 can be provided with 32GB or 64GB of ECC DDR
local memory.
In future systems, XT6 can be upgraded to 12 and 16 core Opteron processors.
XT5m and XT6m are fully compatible midrange versions of XT5 and XT6 super
computers.
Super Computer
Year Introduced
Speed/Peak Performance
Titan
Oak
Ridge
National
Laboratory
(ORNL)
(2012)
ORNL (2009)
Los Alamos National
Laboratory
(1976)
1985
1982
1995
1988
2009 NOVEMBER 16
17.59 petaflops
27 petaflops(peak)
Jaguar or XT5
Cray-1
Cray-2
Cray X-MP
cray T3E
CRAY Y-MP
XT6
Cray SV1
Cray XT3
Cray X-MP
Cray SV2
CrayXD1
Cray -2
Cray-3
CrayXT4
Cray- Y MP
Cray3/SSS
Cray XMT
Cray- XMS
Cray -4
Cray XT5
Cray Y-MPEL
Cray APP
Cray CX1
Cray C90
Cray S-MP
Cray XT6
Cray EL90
Cray CS 6400
Cray XE6
Cray T3D
Cray SX-6
Cray CX 1000
Cray J90
Cray MTA-2
Cray XK6
Cray T90
Cray REDSTORM
Cray XK7
Cray T3E
Cray X1
(FEW MORE)