Professional Documents
Culture Documents
(NCS-063)
SEMESTER-VI
SESSION:2015-16
Priyanka Goel
Assistant Professor
Department of CS & E
TEXT BOOK:
1.M.J. Quinn, “Designing Efficient Algorithms for
Parallel Computer”, McGrawHill
REFERENCE BOOK:
1.S.G. Akl, “Design and Analysis of Parallel Algorithms”
IF ID M EX W 1
IF ID M EX W
1
IF ID M EX W
1
IF ID M EX W
5. Write Back: The Write Back (WB) stage is responsible for writing the
result of a calculation, memory access or input into the register file.
Advantages/Disadvantages of Pipelining
Advantages:
1. More efficient use of processor
2. Quicker time of execution of large number of instructions( Increased
throughput)
Disadvantages:
1. Pipelining involves adding hardware to the chip
2. Inability to continuously run the pipeline at full speed because of
pipeline hazards which disrupt the smooth execution of the pipeline.
Pipeline Hazards
1. Data Hazards – an instruction uses the result of the previous instruction. A hazard occurs
exactly when an instruction tries to read a register in its ID stage that an earlier
instruction intends to write in its WB stage.
Select R 2 and R 3 for ADD R 2 and R 3 STORE SUM IN
ALU Operations R1
ADD R 1, R2, R 3 IF ID M EX WB
SUB R 4, R 1, R 5 IF ID M EX WB
STALL IF ID M EX WB
STALL IF ID M EX WB
STALL IF ID M EX WB
SUB R 4 , R 1 , R 5 IF ID M EX WB
I/O Processors( Channels)
• In 1st generation computers I/O instructions were executed by the CPU.
• The data transmission speed of an I/O device was far slower than the data manipulation speed of a
processor.
• As a result CPU spent most of its time idling while executing an I/O instruction.
• This problem was solved by introducing a separate processor to handle I/O operations(in 2nd
generation computers).
• This I/O processor called a channel receives I/O instructions from the CPU but then works
independently , freeing the CPU to resume arithmetic processing.
Interleaved Memory
• This is a memory bank divided into a number of modules or banks that can be accessed
simultaneously.
• Each memory bank has its own addressing circuitry and data addresses are interleaved to take
advantage of the parallel fetch capability.
• With low-order interleaving the low-order bits of an address determine the memory bank containing
the address
• With high-order interleaving the high order bits of an address determine the memory bank.
Cache Memory
• This memory is a small, fast memory unit used as a buffer between a processor and main
memory.
• Its purpose is to reduce the time the processor must spend waiting for data to arrive from the
slower primary memory.
• The efficiency of a cache memory depends on the locality of reference in the program being
run.
• Temporal locality refers to the observed phenomenon that once a particular data or
instruction location is referenced, it is often referenced again in the near future.
• Spatial locality refers to the observation that once a particular memory location is referenced,
a nearby memory location is often referenced in the near future.
• Given a reasonable amount of locality of reference, the majority of the time the processor can
fetch instructions and operands from cache memory, rather than primary memory.
• Only when the instruction or operand is not in the cache memory, the processor remains idle.
Super Scalar Architecture
• A superscalar architecture includes parallel execution units, which can execute
instructions simultaneously.
• This parallel architecture was first implemented in RISC processors, which use short
and simple instructions to perform calculations.
• Common instructions (arithmetic, load/store, conditional branch) can be initiated and
executed independently in separate pipelines — Instructions are not necessarily executed
in the order in which they appear in a program — Processor attempts to find
instructions that can be executed independently, even if they are out-of-order — Use
additional registers and register renaming to eliminate some dependencies
• Because of their superscalar capabilities, RISC processors have typically performed better
than CISC processors running at the same megahertz.
• However, most CISC-based processors (such as the Intel Pentium) now include some
RISC architecture as well, which enables them to execute instructions in parallel.
• Nearly all processors developed after 1998 are superscalar.
• A superscalar processor of degree m can issue m instructions per cycle.
• In order to fully utilize a superscalar processor of degree m, m instructions
must be executable in parallel.
• This situation must not be true for all clock cycles. In that case, some of the
pipelines may be stalling in a wait state.
• E.g.
Out-of-order execution of instructions
• Instruction issue and completion policies are critical to superscalar processor
performance.
• Following scheduling policies are introduced:
• In-order issue : when instructions are issued in program order
• Out-of-order issue : when program is violated
• In-order completion : If the instructions must be completed in program order
• Out-of-order completion : If the instructions are not completed in program order
• In-order issue is easier to implement but may not yield the optimal performance. In-
order issue may result in either in-order or out-of-order completion.
• Out-of-order issue usually ends up with out-of-order completion.
• The purpose of out-of-order issue and completion is to improve performance.
VLIW Architecture
• The Very Long Instruction Word (VLIW) architecture uses even more functional
units than that of a superscalar processor.
• Thus the CPI of a VLIW processor can be further lowered.
• Due to the use of very long instructions( 256 to 1024 bits per instruction), VLIW
processors have been mostly implemented with microprogrammed control.
• Thus the clock rate is slow with the use of ROM.
• A large number of microcode access cycles may be needed for some instructions.
• These processors have instruction words with fixed "slots" for instructions that
map to the functional units available.
• This makes the instruction issue unit much simpler, but places an enormous
burden on the compiler to allocate useful work to every slot of every instruction.
• The number of operations in a VLIW instruction = the number of execution
units in the processor
Parallelism in Software
• Parallel algorithms and parallel architectures are closely tied together.
• We cannot think of a parallel algorithm without thinking of the parallel hardware that will support it.
• Conversely, we cannot think of parallel hardware without thinking of the parallel software that will drive it.
1. Data - level parallelism : where we simultaneously operate on multiple bits of a datum or on multiple data.
Examples of this are bit - parallel addition multiplication and division of binary numbers, vector processor
arrays and systolic arrays for dealing with several data samples. It offers the highest potential for concurrency.
It is practiced in both SIMD and MIMD modes. Data parallel code is easier to write and to debug than control
parallel code. Data parallel code is easier to write and to debug than control parallel code. Synchronization in
SIMD data parallelism is handled by the hardware.
2. Instruction - level parallelism (ILP)/ Control parallelism: where we simultaneously execute more than one
instruction by the processor. It can appear in the form of instruction pipelining or multiple functional units
and is limited by the pipeline length and by the multiplicity of functional units. Both pipelining and
functional parallelism are handled by the hardware; programmers need take no special actions to invoke them.
3. Thread - level parallelism (TLP): A thread is a portion of a program that shares
processor resources with other threads. A thread is sometimes called a lightweight
process. In TLP, multiple software threads are executed simultaneously on one processor
or on several processors.
4. Process(Task) - level parallelism: A process is a program that is running on the
computer. A process reserves its own computer resources such as memory space and
registers. This is, of course, the classic multitasking and time -sharing computing where
several programs are running simultaneously on one machine or on several machines.
To solve the mismatch problem between software parallelism and hardware parallelism,
one approach is to develop compilation support and the other is through hardware
redesign for more efficient exploitation by an intelligent compiler. These two approaches
must cooperate with each other to produce the best result.
Algorithm
• The IEEE Standard Dictionary of Electrical and Electronics Terms defines an
algorithm as “ A prescribed set of well - defined rules or processes for the solution
of a problem in a finite number of steps ”.
• The tasks or processes of an algorithm are interdependent in general.
• Some tasks can run concurrently in parallel and some must run serially or
sequentially one after the other.
• Any algorithm is composed of a serial part and a parallel part.
• In fact, it is very hard to say that one algorithm is serial while the other is parallel
except in extreme trivial cases.
• If the number of tasks of the algorithm is W , then we say that the work associated
with the algorithm is W .
• The basic components defining an algorithm are
1. the different tasks,
2. the dependencies among the tasks where a task output is used as another
task ’s input,
3. the set of primary inputs needed by the algorithm, and
4. the set of primary outputs produced by the algorithm.
Sequential model
• Sequential algorithm uses an abstract model of computation called Random Access
Machine(RAM).
• In this model, the machine consists of a single processor connected to a memory
system.
• Each basic CPU operation, including arithmetic operations, logical operations, and
memory accesses, requires one time step.
• The designer's goal is to develop an algorithm with modest time and memory
requirements.
• The random-access machine model allows the algorithm designer to ignore many of
the details of the computer on which the algorithm will ultimately be executed, but
captures enough detail that the designer can predict with reasonable accuracy how
the algorithm will perform.
• Generally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit
(CPU);
• A problem is broken into a discrete series of instructions.
• Instructions are executed one after another.
• Only one instruction may execute at any moment in time.
Need of alternative model
• Most of today's algorithms are sequential, that is, they specify a sequence of steps in which
each step consists of a single operation.
• These algorithms are well suited to today's computers, which basically perform operations in
a sequential fashion.
• Although the speed at which sequential computers operate has been improving at an
exponential rate for many years, the improvement is now coming at greater and greater cost.
• As a consequence, researchers have sought more cost-effective improvements by building
parallel computers.
• In order to solve a problem efficiently on a parallel machine, it is usually necessary to design
an algorithm that specifies multiple operations on each step, i.e., a parallel algorithm.
• As an example, consider the problem of computing the sum of a sequence A of n numbers.
• The standard algorithm computes the sum by making a single pass through
the sequence, keeping a running sum of the numbers seen so far.
• It is not difficult however, to devise an algorithm for computing the sum that
performs many operations in parallel.
• For example, suppose that, in parallel, each element of A with an even index
is paired and summed with the next element of A, which has an odd index,
i.e., A[0] is paired with A[1], A[2] with A[3], and so on.
• The result is a new sequence of Γn/2 ˥ numbers that sum to the same value
as the sum that we wish to compute.
• This pairing and summing step can be repeated until, after Γ log2 n ˥ steps,
a sequence consisting of a single value is produced, and this value is equal to
the final sum.
Parallel Computation
• In the simplest sense, parallel computing is the simultaneous use of multiple compute resources
to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
• The computing resources can include:
• A single computer with multiple processors;
• An arbitrary number of computers connected by a network;
• A combination of both.
• The computational problem usually demonstrates characteristics such as the
ability to be:
• Broken apart into discrete pieces of work that can be solved
simultaneously;
• Execute multiple program instructions at any moment in time;
• Solved in less time with multiple compute resources than with a single
compute resource.
Uses of Parallel Computing
• Historically, parallel computing has been considered to be "the high end of
computing", and has been used to model difficult scientific and engineering
problems found in the real world. Some examples:
• Solve larger problems: Many problems are so large and/or complex that it is impractical or
impossible to solve them on a single computer, especially given limited computer memory. For
example:
• "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring
PetaFLOPS and PetaBytes of computing resources.
• Web search engines/databases processing millions of transactions per second
• Use of non-local resources: Using compute resources on a wide area
network, or even the Internet when local compute resources are
scarce. For example:
• SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for
a compute power over 528 TeraFLOPS (as of August 04, 2008)
• Folding@home (folding.stanford.edu) uses over 340,000 computers for a
compute power of 4.2 PetaFLOPS (as of November 4, 2008)
• Provide concurrency: A single compute resource can only do one thing at a time. Multiple
computing resources can be doing many things simultaneously. For example, the Access Grid
(www.accessgrid.org) provides a global collaboration network where people from around the
world can meet and conduct work "virtually".
• Limits to serial computing: Both physical and practical reasons pose
significant constraints to simply building ever faster serial computers:
• Transmission speeds - the speed of a serial computer is directly dependent
upon how fast data can move through hardware. Absolute limits are the speed
of light (30 cm/nanosecond) and the transmission limit of copper wire (9
cm/nanosecond). Increasing speeds necessitate increasing proximity of
processing elements.
• Limits to miniaturization - processor technology is allowing an increasing
number of transistors to be placed on a chip. However, even with molecular or
atomic-level components, a limit will be reached on how small components can
be.
• Economic limitations - it is increasingly expensive to make a single processor
faster. Using a larger number of moderately fast commodity processors to
achieve the same (or better) performance is less expensive.
Classification of Parallel Computers : Flynn's
Classical Taxonomy
• There are different ways to classify parallel computers. One of the more widely
used classifications, in use since 1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures according
to how they can be classified along the two independent dimensions of Instruction
and Data. Each of these dimensions can have only one of two possible states:
Single or Multiple.
• The matrix below defines the 4 possible classifications according to Flynn:
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• A computer in this class consists of a single processing unit receiving a single stream of instructions
that operate on a single stream of data
• Single instruction: only one instruction stream is being acted on by the CPU during any one clock
cycle
• Single data: only one data stream is being used as input during any one clock cycle
• This is the oldest and even today, the most common type of computer
• The overwhelming majority of computers today adhere to this model invented by John von Neumann
and his collaborators in the late 1940s. An algorithm for a computer in this class is said to be
sequential (or serial).
• Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
UNIVAC 1 IBM 360 CRAY 1
• And so on. After log N steps, P1 knows whether all the ai are equal.
• If they are, it proceeds to store a, in A; otherwise no writing is allowed to take place.
Example 3: m out of N multiple Read/Write accesses on EREW computer
• This is the more general case where a multiple read from or a multiple write into a memory location
does not necessarily implicate all processors.
• In a typical algorithm, arbitrary subsets of processors may be each attempting to gain access to
different locations, one location per subset.
• Clearly the procedures for broadcasting and storing described in Example 1 no longer work in this
case.
• Another approach is needed in order to simulate such an algorithm on the EREW model with N
processors.
• Say that the algorithm requires a total of M locations of shared memory.
• The idea here is to associate with each of the M locations another 2N - 2 locations.
• Each of the M locations is thought of as the root of a binary tree with N leaves (the
tree has depth log N and a total of 2N – 1 nodes).
• The leaves of each tree are numbered 1 through N and each is associated with the
processor with the same number.
• When m processors, m <= N, need to gain access to location A, they can put their
requests at the leaves of the tree rooted at A.
• For a multiple read from location A, the requests trickle (along with the processors) up
the tree until one processor reaches the root and reads from A.
• The value of A is then sent down the tree to all the processors that need it.
• Similarly, for a multiple-write operation, the processors "carry" the requests up the
tree in the manner described in Example 2.
• After log N steps one processor reaches the root and makes a decision about
writing.
• Going up and down the tree of memory locations requires 2 log N steps.
• The formal description of these simulations is known as multiple broadcasting and
multiple storing, respectively.
2.Concurrent-Read, Exclusive-Write (CREW) SM SIMD Computers:
• Multiple processors are allowed to read from the same memory location but
the right to write is still exclusive: No two processors are allowed to write into
the same location simultaneously.
• For concurrent read operation, the value is send to the shared memory by the
broadcaster.
• All other processors read it from the shared memory.
• This takes total of O(1) steps as all processors can read the same value at a
time.
• CRCW is more powerful than EREW PRAM model( can be explained using
example 1).
3. Exclusive-Read, Concurrent-Write (ERCW) SM SIMD Computers
• Multiple processors are allowed to write into the same memory location but read
accesses remain exclusive.
A[4]=(29,43) P4 nothing
• P4 (92,26) = A[4]
A[5]=(92,26) P5 writes
• P5 (8,12) = A[5]
Picking one processor for each competing write
Lemma 2: Assume m'<m. Any problem that can be solved on a p-processor and m-cell
PRAM in t steps can be solved on a max(p,m')-processor m'-cell PRAM in O(tm/m') steps.
1. Proof:
Partition m simulated shared memory cells into m' continuous segments Si of
size m/m'.
2. Each simulating processor P'i, 1<= i<= p, will simulate processor Pi of the original
PRAM.
3. Each simulating processor P'i, 1<= i<= m', stores the initial contents of Si into its local
memory and will use M'[i] as an auxiliary memory cell for simulation of accesses to
cells of Si.
4. Simulation of one original READ operation:
each P'i, i=1,...,max(p,m') repeats for k=1,...,m/m':
1. write the value of the k-th cell of Si into M'[i], i=1,...,m',
2. read the value which the simulated processor Pi, i=1,...,p, would read in this simulated
substep, if it appeared in the shared memory.
5. The local computation substep of Pi, i=1,...,p, is simulated in one step by P'i.
6. Simulation of one original WRITE operation is analogous to that of READ.
• Such a simulation may be desirable for one of two reasons:
1. The parallel computer available belongs to the EREW class and thus the only way to
execute a CREW, ERCW, or CRCW algorithm is through simulation or
2. Parallel computers of the CREW, ERCW, and CRCW models with a very large
number of processors are technologically impossible to build at all.
• Indeed, the number of processors that can be simultaneously connected to a memory location is
limited
(i) not only by the physical size of the device used for that location,
(ii) but also by the device's physical properties (such as voltage).
• Therefore concurrent access to memory by an arbitrary number of processors may not be
realizable in practice.
• Again in this case simulation is the only resort to implement an algorithm developed in theory
to include multiple accesses.