You are on page 1of 120

PARALLEL ALGORITHMS

(NCS-063)
SEMESTER-VI
SESSION:2015-16
Priyanka Goel
Assistant Professor
Department of CS & E
TEXT BOOK:
1.M.J. Quinn, “Designing Efficient Algorithms for
Parallel Computer”, McGrawHill

REFERENCE BOOK:
1.S.G. Akl, “Design and Analysis of Parallel Algorithms”

2.S.G. Akl, ”Parallel Sorting Algorithm” by Academic Press


OBJECTIVES:
• To introduce techniques for the design of efficient parallel
algorithms and their implementation
• To analyze and select best algorithms for a particular
machine/problem (An algorithm good for one architecture may not
be good for others)
• To learn how algorithms are made to run efficiently on processor
arrays, multiprocessors and multi-computers.
LEARNING OUTCOMES:
At the end of the course you will be:
 familiar with the efficient parallel algorithms related to many areas
of computer science: expression computation, sorting, graph-theoretic
problems, computational geometry, algorithmics of texts etc.
 familiar with the basic issues of implementing parallel algorithms.
UNIT-1
• Sequential model
• Need of alternative model
• Parallel computational models such as PRAM,
LMCC, Hypercube, Cube Connected Cycle,
Butterfly, Perfect Shuffle Computers, Tree model,
Pyramid model , Fully Connected model
• PRAM-CREW, EREW models
• Simulation of one model from another one.
CONTENTS
• The Famous Moore’s Law
• Consequences of Moore’s Law
• Various forms of parallelism
• Pipelining
• Advantages/Disadvantages of Pipelining
• Pipeline Hazards
• Stalling in Pipelining
• I/O Processors( Channels) and Interleaved Memory
• Cache Memory
• Super Scalar Architecture
• Out-of-order execution of instructions
• VLIW Architecture
• Parallelism in Software
• Algorithm
• Sequential model
• Need of alternative model
• Parallel computation
• Uses of Parallel Computing
• Classification of Parallel Computers: Flynn Classical Taxonomy
• Single Instruction Single Data (SISD)
• Single Instruction, Multiple Data (SIMD)
• Shared Memory SIMD : PRAM Model, PRAM-CREW, EREW models
• Interconnection network SIMD
• Multiple Instruction, Single Data (MISD)
• Multiple Instruction, Multiple Data ( MIMD)
• Models such as LMCC, Hypercube, Cube Connected Cycle,
Butterfly, Perfect Shuffle Computers, Tree model, Pyramid model,
Fully Connected model
• Simulation of one model from another one.
The Famous Moore’s Law
• Speed of conventional processors is limited by:
• line delays: signal transmission time between gates
• gate delays: settling time before state can be reliably read
• Both can be improved by reducing device size, but this is in turn
ultimately limited by:
• heat dissipation
• thermal noise (degradation of signal-to-noise ratio)
• Heat dissipation is current binding constraint on processor speed.
• It was implicitly assumed that more transistors per chip = more
performance. BUT …
• ~1986 – 2002 50% performance increase
• Since 2002 ~20% performance increase
• Number of transistors per chip that yields minimum cost per
transistor increases by factor of two every two years
• But it does not say that microprocessor performance or clock
speed doubles every two years.
• Nevertheless, clock speed did in fact double every two years from
roughly 1975 to 2005, but has now flattened at about 3 GHz due
to limitations on power (heat) dissipation.
Consequences of Moore’s Law
• Smaller circuits are more efficient, so one can either maintain
same power but increase clock speed (historical trend ~1975–
2005)
• Or, maintain same power and clock speed but increase
functionality (current trend)
• Or, maintain same clock speed but use less power (future
trend?)
• For given clock speed, increasing performance depends on producing
more results per cycle, which can be achieved by :
• Exploiting various forms of parallelism
• I/O Processors/Channels
• Pipelined functional units
• Superscalar architecture (multiple instructions per cycle)
• Out-of-order execution of instructions
• SIMD instructions (multiple sets of operands per instruction)
• Memory hierarchy (larger caches and more levels of cache)
• Multicore and multithreaded processors
Consequently, almost all processors today are parallel !!
This course is about how to design and analyze efficient parallel
algorithms for such architectures and applications.
Various forms of parallelism
• Parallelism in Hardware
▪ Parallelism in a Uniprocessor
– Pipelining
– Superscalar, VLIW etc.
▪ SIMD instructions, Vector processors, GPUs
▪ Multiprocessor
– Shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
▪ Multicomputers a.k.a. clusters
▪ Parallelism in Software
▪ Instruction level parallelism
▪ Task-level parallelism
▪ Data parallelism
▪ Thread level parallelism
Pipelining
• Pipelining is an implementation technique where multiple instructions are
overlapped in execution.
• The computer pipeline is divided in stages. Each stage completes a part of an
instruction in parallel.
• The stages are connected one to the next to form a pipe - instructions enter at one
end, progress through the stages, and exit at the other end.
• Pipelining does not decrease the time for individual instruction execution.
• Instead, it increases instruction throughput.
• The throughput of the instruction pipeline is determined by how often an
instruction exits the pipeline.
• Because the pipe stages are hooked together, all the stages must be ready to
proceed at the same time.
• We call the time required to move an instruction one step further in the
pipeline a machine cycle .
• The length of the machine cycle is determined by the time required for
the slowest pipe stage.
5

IF ID M EX W 1
IF ID M EX W
1
IF ID M EX W
1
IF ID M EX W

Four Pipelined Instructions


1. Instructions Fetch: The instruction Fetch (IF) stage is responsible for obtaining
the requested instruction from memory. The instruction and the program
counter (which is incremented to the next instruction) are stored in the IF/ID
pipeline register as temporary storage so that may be used in the next stage at
the start of the next clock cycle.
2. Instruction Decode: The Instruction Decode (ID) stage is responsible for
decoding the instruction and sending out the various control lines to the other
parts of the processor. The instruction is sent to the control unit where it is
decoded and the registers are fetched from the register file.
3. Execution: The Execution (EX) stage is where any calculations are performed.
The main component in this stage is the ALU. The ALU is made up of
arithmetic, logic and capabilities.
4. Memory and I/O : The Memory and IO (MEM) stage is responsible for
storing and loading values to and from memory. It is also responsible for
input or output from the processor. If the current instruction is not of
Memory or IO type than the result from the ALU is passed through to
the write back stage.

5. Write Back: The Write Back (WB) stage is responsible for writing the
result of a calculation, memory access or input into the register file.
Advantages/Disadvantages of Pipelining
Advantages:
1. More efficient use of processor
2. Quicker time of execution of large number of instructions( Increased
throughput)

Disadvantages:
1. Pipelining involves adding hardware to the chip
2. Inability to continuously run the pipeline at full speed because of
pipeline hazards which disrupt the smooth execution of the pipeline.
Pipeline Hazards
1. Data Hazards – an instruction uses the result of the previous instruction. A hazard occurs
exactly when an instruction tries to read a register in its ID stage that an earlier
instruction intends to write in its WB stage.
Select R 2 and R 3 for ADD R 2 and R 3 STORE SUM IN
ALU Operations R1

ADD R 1, R2, R 3 IF ID M EX WB

SUB R 4, R 1, R 5 IF ID M EX WB

Select R 1 and R 5 for


ALU Operations

2. Control Hazards – the location of an instruction depends on previous instruction


3. Structural Hazards – two instructions need to access the same resource
Stalling in Pipelining
• Stalling involves halting the flow of instructions until the required result is
ready to be used. However stalling wastes processor time by doing nothing
while waiting for the result.
ADD R 1 , R 2 , R 3 IF ID M EX WB

STALL IF ID M EX WB

STALL IF ID M EX WB

STALL IF ID M EX WB

SUB R 4 , R 1 , R 5 IF ID M EX WB
I/O Processors( Channels)
• In 1st generation computers I/O instructions were executed by the CPU.
• The data transmission speed of an I/O device was far slower than the data manipulation speed of a
processor.
• As a result CPU spent most of its time idling while executing an I/O instruction.
• This problem was solved by introducing a separate processor to handle I/O operations(in 2nd
generation computers).
• This I/O processor called a channel receives I/O instructions from the CPU but then works
independently , freeing the CPU to resume arithmetic processing.
Interleaved Memory
• This is a memory bank divided into a number of modules or banks that can be accessed
simultaneously.
• Each memory bank has its own addressing circuitry and data addresses are interleaved to take
advantage of the parallel fetch capability.
• With low-order interleaving the low-order bits of an address determine the memory bank containing
the address
• With high-order interleaving the high order bits of an address determine the memory bank.
Cache Memory
• This memory is a small, fast memory unit used as a buffer between a processor and main
memory.
• Its purpose is to reduce the time the processor must spend waiting for data to arrive from the
slower primary memory.
• The efficiency of a cache memory depends on the locality of reference in the program being
run.
• Temporal locality refers to the observed phenomenon that once a particular data or
instruction location is referenced, it is often referenced again in the near future.
• Spatial locality refers to the observation that once a particular memory location is referenced,
a nearby memory location is often referenced in the near future.
• Given a reasonable amount of locality of reference, the majority of the time the processor can
fetch instructions and operands from cache memory, rather than primary memory.
• Only when the instruction or operand is not in the cache memory, the processor remains idle.
Super Scalar Architecture
• A superscalar architecture includes parallel execution units, which can execute
instructions simultaneously.
• This parallel architecture was first implemented in RISC processors, which use short
and simple instructions to perform calculations.
• Common instructions (arithmetic, load/store, conditional branch) can be initiated and
executed independently in separate pipelines — Instructions are not necessarily executed
in the order in which they appear in a program — Processor attempts to find
instructions that can be executed independently, even if they are out-of-order — Use
additional registers and register renaming to eliminate some dependencies
• Because of their superscalar capabilities, RISC processors have typically performed better
than CISC processors running at the same megahertz.
• However, most CISC-based processors (such as the Intel Pentium) now include some
RISC architecture as well, which enables them to execute instructions in parallel.
• Nearly all processors developed after 1998 are superscalar.
• A superscalar processor of degree m can issue m instructions per cycle.
• In order to fully utilize a superscalar processor of degree m, m instructions
must be executable in parallel.
• This situation must not be true for all clock cycles. In that case, some of the
pipelines may be stalling in a wait state.
• E.g.
Out-of-order execution of instructions
• Instruction issue and completion policies are critical to superscalar processor
performance.
• Following scheduling policies are introduced:
• In-order issue : when instructions are issued in program order
• Out-of-order issue : when program is violated
• In-order completion : If the instructions must be completed in program order
• Out-of-order completion : If the instructions are not completed in program order
• In-order issue is easier to implement but may not yield the optimal performance. In-
order issue may result in either in-order or out-of-order completion.
• Out-of-order issue usually ends up with out-of-order completion.
• The purpose of out-of-order issue and completion is to improve performance.
VLIW Architecture
• The Very Long Instruction Word (VLIW) architecture uses even more functional
units than that of a superscalar processor.
• Thus the CPI of a VLIW processor can be further lowered.
• Due to the use of very long instructions( 256 to 1024 bits per instruction), VLIW
processors have been mostly implemented with microprogrammed control.
• Thus the clock rate is slow with the use of ROM.
• A large number of microcode access cycles may be needed for some instructions.
• These processors have instruction words with fixed "slots" for instructions that
map to the functional units available.
• This makes the instruction issue unit much simpler, but places an enormous
burden on the compiler to allocate useful work to every slot of every instruction.
• The number of operations in a VLIW instruction = the number of execution
units in the processor
Parallelism in Software
• Parallel algorithms and parallel architectures are closely tied together.
• We cannot think of a parallel algorithm without thinking of the parallel hardware that will support it.
• Conversely, we cannot think of parallel hardware without thinking of the parallel software that will drive it.

1. Data - level parallelism : where we simultaneously operate on multiple bits of a datum or on multiple data.
Examples of this are bit - parallel addition multiplication and division of binary numbers, vector processor
arrays and systolic arrays for dealing with several data samples. It offers the highest potential for concurrency.
It is practiced in both SIMD and MIMD modes. Data parallel code is easier to write and to debug than control
parallel code. Data parallel code is easier to write and to debug than control parallel code. Synchronization in
SIMD data parallelism is handled by the hardware.
2. Instruction - level parallelism (ILP)/ Control parallelism: where we simultaneously execute more than one
instruction by the processor. It can appear in the form of instruction pipelining or multiple functional units
and is limited by the pipeline length and by the multiplicity of functional units. Both pipelining and
functional parallelism are handled by the hardware; programmers need take no special actions to invoke them.
3. Thread - level parallelism (TLP): A thread is a portion of a program that shares
processor resources with other threads. A thread is sometimes called a lightweight
process. In TLP, multiple software threads are executed simultaneously on one processor
or on several processors.
4. Process(Task) - level parallelism: A process is a program that is running on the
computer. A process reserves its own computer resources such as memory space and
registers. This is, of course, the classic multitasking and time -sharing computing where
several programs are running simultaneously on one machine or on several machines.
To solve the mismatch problem between software parallelism and hardware parallelism,
one approach is to develop compilation support and the other is through hardware
redesign for more efficient exploitation by an intelligent compiler. These two approaches
must cooperate with each other to produce the best result.
Algorithm
• The IEEE Standard Dictionary of Electrical and Electronics Terms defines an
algorithm as “ A prescribed set of well - defined rules or processes for the solution
of a problem in a finite number of steps ”.
• The tasks or processes of an algorithm are interdependent in general.
• Some tasks can run concurrently in parallel and some must run serially or
sequentially one after the other.
• Any algorithm is composed of a serial part and a parallel part.
• In fact, it is very hard to say that one algorithm is serial while the other is parallel
except in extreme trivial cases.
• If the number of tasks of the algorithm is W , then we say that the work associated
with the algorithm is W .
• The basic components defining an algorithm are
1. the different tasks,
2. the dependencies among the tasks where a task output is used as another
task ’s input,
3. the set of primary inputs needed by the algorithm, and
4. the set of primary outputs produced by the algorithm.
Sequential model
• Sequential algorithm uses an abstract model of computation called Random Access
Machine(RAM).
• In this model, the machine consists of a single processor connected to a memory
system.
• Each basic CPU operation, including arithmetic operations, logical operations, and
memory accesses, requires one time step.
• The designer's goal is to develop an algorithm with modest time and memory
requirements.
• The random-access machine model allows the algorithm designer to ignore many of
the details of the computer on which the algorithm will ultimately be executed, but
captures enough detail that the designer can predict with reasonable accuracy how
the algorithm will perform.
• Generally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit
(CPU);
• A problem is broken into a discrete series of instructions.
• Instructions are executed one after another.
• Only one instruction may execute at any moment in time.
Need of alternative model
• Most of today's algorithms are sequential, that is, they specify a sequence of steps in which
each step consists of a single operation.
• These algorithms are well suited to today's computers, which basically perform operations in
a sequential fashion.
• Although the speed at which sequential computers operate has been improving at an
exponential rate for many years, the improvement is now coming at greater and greater cost.
• As a consequence, researchers have sought more cost-effective improvements by building
parallel computers.
• In order to solve a problem efficiently on a parallel machine, it is usually necessary to design
an algorithm that specifies multiple operations on each step, i.e., a parallel algorithm.
• As an example, consider the problem of computing the sum of a sequence A of n numbers.
• The standard algorithm computes the sum by making a single pass through
the sequence, keeping a running sum of the numbers seen so far.
• It is not difficult however, to devise an algorithm for computing the sum that
performs many operations in parallel.
• For example, suppose that, in parallel, each element of A with an even index
is paired and summed with the next element of A, which has an odd index,
i.e., A[0] is paired with A[1], A[2] with A[3], and so on.
• The result is a new sequence of Γn/2 ˥ numbers that sum to the same value
as the sum that we wish to compute.
• This pairing and summing step can be repeated until, after Γ log2 n ˥ steps,
a sequence consisting of a single value is produced, and this value is equal to
the final sum.
Parallel Computation
• In the simplest sense, parallel computing is the simultaneous use of multiple compute resources
to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
• The computing resources can include:
• A single computer with multiple processors;
• An arbitrary number of computers connected by a network;
• A combination of both.
• The computational problem usually demonstrates characteristics such as the
ability to be:
• Broken apart into discrete pieces of work that can be solved
simultaneously;
• Execute multiple program instructions at any moment in time;
• Solved in less time with multiple compute resources than with a single
compute resource.
Uses of Parallel Computing
• Historically, parallel computing has been considered to be "the high end of
computing", and has been used to model difficult scientific and engineering
problems found in the real world. Some examples:

1. Weather Prediction: Forecasting the weather on a computer requires the solution


of general circulation model equations in a spherical coordinate system. A 3D grid
partitions the atmosphere by altitude, latitude and longitude. Time is the fourth
dimension and that too is partitioned by specifying a time increment. A computer
capable of 100 million floating point operations per second( 100 megaflops) like
the Cray-1 would require 24 hours to complete the 24-hour forecast. Moreover if
we want to receive accurate long rage forecasts, much more powerful computers
must be developed.
2. Artificial Intelligence: Most current computers have a relatively inflexible
I/O interface. If computers need to be more “user friendly” they must be
able to interact with humans at a higher level using speech, pictures and
natural language. Allowing voice, pictorial and natural language input to be
handled in real time requires an enormous amount of computing power.
3. Remote sensing: The analysis of earth-resource data broadcast from
satellites has many applications in agriculture, ecology, forestry, geology
and land-use planning. However images are often so large that even simple
calculations require large amounts of CPU time.
Other examples are:
Physics - applied, nuclear, particle, condensed matter, high pressure, fusion,
photonics
• Biotechnology, Bio-science, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics
Today, commercial applications provide an equal or greater driving
force in the development of faster computers. These applications
require the processing of large amounts of data in sophisticated ways.
For example:
• Databases, data mining
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Management of national and multi-national corporations
• Financial and economic modeling
• Advanced graphics and virtual reality, particularly in the
entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
Why Use Parallel Computing?
• Save time and/or money: In theory, throwing more resources at a task will shorten its time to
completion, with potential cost savings. Parallel clusters can be built from cheap, commodity
components.

• Solve larger problems: Many problems are so large and/or complex that it is impractical or
impossible to solve them on a single computer, especially given limited computer memory. For
example:
• "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring
PetaFLOPS and PetaBytes of computing resources.
• Web search engines/databases processing millions of transactions per second
• Use of non-local resources: Using compute resources on a wide area
network, or even the Internet when local compute resources are
scarce. For example:
• SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for
a compute power over 528 TeraFLOPS (as of August 04, 2008)
• Folding@home (folding.stanford.edu) uses over 340,000 computers for a
compute power of 4.2 PetaFLOPS (as of November 4, 2008)
• Provide concurrency: A single compute resource can only do one thing at a time. Multiple
computing resources can be doing many things simultaneously. For example, the Access Grid
(www.accessgrid.org) provides a global collaboration network where people from around the
world can meet and conduct work "virtually".
• Limits to serial computing: Both physical and practical reasons pose
significant constraints to simply building ever faster serial computers:
• Transmission speeds - the speed of a serial computer is directly dependent
upon how fast data can move through hardware. Absolute limits are the speed
of light (30 cm/nanosecond) and the transmission limit of copper wire (9
cm/nanosecond). Increasing speeds necessitate increasing proximity of
processing elements.
• Limits to miniaturization - processor technology is allowing an increasing
number of transistors to be placed on a chip. However, even with molecular or
atomic-level components, a limit will be reached on how small components can
be.
• Economic limitations - it is increasingly expensive to make a single processor
faster. Using a larger number of moderately fast commodity processors to
achieve the same (or better) performance is less expensive.
Classification of Parallel Computers : Flynn's
Classical Taxonomy
• There are different ways to classify parallel computers. One of the more widely
used classifications, in use since 1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures according
to how they can be classified along the two independent dimensions of Instruction
and Data. Each of these dimensions can have only one of two possible states:
Single or Multiple.
• The matrix below defines the 4 possible classifications according to Flynn:
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• A computer in this class consists of a single processing unit receiving a single stream of instructions
that operate on a single stream of data
• Single instruction: only one instruction stream is being acted on by the CPU during any one clock
cycle
• Single data: only one data stream is being used as input during any one clock cycle
• This is the oldest and even today, the most common type of computer
• The overwhelming majority of computers today adhere to this model invented by John von Neumann
and his collaborators in the late 1940s. An algorithm for a computer in this class is said to be
sequential (or serial).
• Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
UNIVAC 1 IBM 360 CRAY 1

CDC 7600 PDP 1


Example:
• In order to compute the sum of n numbers, the processor needs
to gain access to the memory n consecutive times and each time
receive one number.
• There are also n – 1 additions involved that are executed in
sequence. Therefore, this computation requires on the order of n
operations in total.

This example shows that algorithms for SISD computers do not


contain any parallelism. The reason is obvious, there is only one
processor!
Single Instruction, Multiple Data (SIMD)
• In this class, a parallel computer consists of N identical processors.
• Each of the N processors possesses its own local memory where it can
store both programs and data.
• All processors operate under the control of a single instruction stream
issued by a central control unit.
• Equivalently, the N processors may be assumed to hold identical copies of
a single program, each processor's copy being stored in its local memory.
• There are N data streams, one per processor.
• The processors operate synchronously: At each step, all processors
execute the same instruction, each on a different datum.
• The instruction could be a simple one (such as adding or comparing two
numbers) or a complex one (such as merging two lists of numbers).
• Similarly, the datum may be simple (one number) or complex (several
numbers).
• Sometimes, it may be necessary to have only a subset of the processors execute an
instruction. This information can be encoded in the instruction itself, thereby telling a
processor whether it should be active (and execute the instruction) or inactive (and
wait for the next instruction).
• There is a mechanism, such as a global clock, that ensures lock-step operation.
• Thus processors that are inactive during an instruction or those that complete
execution of the instruction before others may stay idle until the next instruction is
issued.
• The time interval between two instructions may be fixed or may depend on the
instruction being executed.
• Best suited for specialized problems characterized by a high degree of regularity, such
as graphics/image processing.
• In most problems that we wish to solve on an SIMD computer, it is
desirable for the processors to be able to communicate among
themselves during the computation in order to exchange data or
intermediate results.
• This can be achieved in two ways, giving rise to two subclasses:
SIMD computers where communication is through a shared memory
and those where it is done via an interconnection network.
• Two varieties: Processor Arrays and Vector Pipelines
Shared-Memory (SM) SIMD Computers
• This class is also known as the Parallel Random-Access Machine (PRAM) model.
• Here, the N processors share a common memory that they use in the same way a
group of people may use a bulletin board.
• The size of the shared memory is called the communication width.
• This model consists of a control processor, a global random access memory and a
set of parallel processing units.
• When two processors wish to communicate, they do so through the shared
memory.
• Say processor i wishes to pass a number to processor j.
• This is done in two steps. First, processor i writes the number in the shared
memory at a given location known to processor j. Then, processor j reads the
number from that location.
• During the execution of a parallel algorithm, the N processors gain access to the
shared memory for reading input data, for reading or writing intermediate results,
and for writing final results.
• The basic model allows all processors to gain access to the shared memory
simultaneously if the memory locations they are trying to read from or write
into are different.
• However, the class of shared-memory SIMD computers can be further divided
into four subclasses, according to whether two or more processors can gain
access to the same memory location simultaneously:
• With shared memory, the model must specify how concurrent read and
concurrent write of memory are handled.
• Four memory update options are possible:
• Exclusive Read(ER): This allows at most one processor to read from
any memory location in each cycle, a rather restrictive policy.
• Exclusive Write(EW): This allows at most one processor to write into a
memory location at a time.
• Concurrent Read(CR): This allows multiple processors to read the
same information from the same memory cell in the same cycle.
• Concurrent Write(CW): This allows simultaneous writes to the same
memory location. In order to avoid confusion, some policy must be
set up to resolve the write conflicts.
• Various combinations of the above options lead to several variants of the
PRAM model. Since CR does not create a conflict problem , variants
differ mainly in how they handle the CW conflicts.
1. Exclusive-Read, Exclusive-Write (EREW) SM SIMD Computers:
• Access to memory locations is exclusive. In other words, this model forbids ore
than one processor from reading or writing the same memory cell simultaneously.
• The EREW model is the weakest of the four subclasses of the shared-memory
approach, as it restricts its access to a given address to one processor at a time.
• An algorithm for such a computer must be specifically designed to exclude any
attempt by more than one processor to read from or write into same location
simultaneously.
• This model can be used to simulate multiple accesses at the cost of either increasing
the space and/or time requirements of an algorithm.
Example 1: Searching a value among n entries on EREW computer
• Consider a very large computer file consisting of n distinct entries.
• We shall assume for simplicity that the file is not sorted in any order. (In fact, it may be the
case that keeping the file sorted at all times is impossible or simply inefficient.)
• Now suppose that it is required to determine whether a given item x is present in the file in
order to perform a standard database operation, such as read, update, or delete.
• On a conventional (i.e. SISD) computer, retrieving x requires n steps in the worst case where
each step is a comparison between x and a file entry. The worst case clearly occurs when x is
either equal to the last entry or not equal to any entry.
• On the average, if the file entries are distributed uniformly over a given range, then half as
many steps are required to retrieve x.
• The job can be done a lot faster on an EREW SM SIMD computer with N processors, where N
< n. Let us denote the processors by PI, P2, . . . , Pn. To begin with, we need to let all the
processors know the value of x. This can be done using an operation known as broadcasting:
• 1. P1 reads x and communicates it to P2.
2. Simultaneously, P1 and P2 communicate x to P3 and P4 respectively.
3. Simultaneously, P1, P2, P3, and P4, communicate x to P5, P6, P7, and P8,
respectively, and so on.
• The process continues until all processors obtain x. As the number of processors that
receive x doubles at each stage, broadcasting x to all N processors requires log N
steps
• Now the file to be searched for x is subdivided into sub files that are searched
simultaneously by the processors: P1 searches the first n/N elements, P2 searches the
second n/N elements, and so on.
• Since all sub files are of the same size, n/N steps are needed in the worst case to
answer the query about x. In total, therefore, this parallel algorithm requires log N +
n/N steps in the worst case.
• On the average, we can do better than that (as was done with the SISD computer): A
location F holding a Boolean value can be set aside in the shared memory to signal
that one of the processors has found the item for and, consequently, that all other
processors should terminate their search.
• Initially, F is set to false. When a processor finds x in its subfile, it sets F
to true. At every step of the search all processors check F to see if it is
true and stop if this is the case. Unfortunately, this modification of the
algorithm does not come for free: log N steps are needed to broadcast the
value of F each time the processors need it.
• This leads to a total of log N + (n/N)log N steps in the worst case. It is
possible to improve this behavior by having the processors either check the
value of F at every (log N)th step, or broadcast it (once true) concurrently
with the search process.
Example 2 : N multiple Write accesses on EREW computer
• Suppose that we want to run a parallel algorithm involving multiple accesses on an
EREW SM SIMD computer with N processors P1, P2, . . . , PN.
• Suppose further that every multiple access means that all N processors are attempting
to read from or write into the same memory location A.
• We can simulate multiple-read operations on an EREW computer using a broadcast
procedure as explained in earlier example.
• This way, A can be distributed to all processors in log N steps.
• Similarly, a procedure symmetrical to broadcasting can be used to handle multiple-
write operations.
• Assume that the N processors are allowed to write in A simultaneously only if they are
all attempting to store the same value.
• Let the value that Pi is attempting to write be denoted by ai, 1 <= i <= N.
• The procedure to store in A works as follows:

• And so on. After log N steps, P1 knows whether all the ai are equal.
• If they are, it proceeds to store a, in A; otherwise no writing is allowed to take place.
Example 3: m out of N multiple Read/Write accesses on EREW computer
• This is the more general case where a multiple read from or a multiple write into a memory location
does not necessarily implicate all processors.
• In a typical algorithm, arbitrary subsets of processors may be each attempting to gain access to
different locations, one location per subset.
• Clearly the procedures for broadcasting and storing described in Example 1 no longer work in this
case.
• Another approach is needed in order to simulate such an algorithm on the EREW model with N
processors.
• Say that the algorithm requires a total of M locations of shared memory.
• The idea here is to associate with each of the M locations another 2N - 2 locations.
• Each of the M locations is thought of as the root of a binary tree with N leaves (the
tree has depth log N and a total of 2N – 1 nodes).
• The leaves of each tree are numbered 1 through N and each is associated with the
processor with the same number.
• When m processors, m <= N, need to gain access to location A, they can put their
requests at the leaves of the tree rooted at A.
• For a multiple read from location A, the requests trickle (along with the processors) up
the tree until one processor reaches the root and reads from A.
• The value of A is then sent down the tree to all the processors that need it.
• Similarly, for a multiple-write operation, the processors "carry" the requests up the
tree in the manner described in Example 2.
• After log N steps one processor reaches the root and makes a decision about
writing.
• Going up and down the tree of memory locations requires 2 log N steps.
• The formal description of these simulations is known as multiple broadcasting and
multiple storing, respectively.
2.Concurrent-Read, Exclusive-Write (CREW) SM SIMD Computers:
• Multiple processors are allowed to read from the same memory location but
the right to write is still exclusive: No two processors are allowed to write into
the same location simultaneously.
• For concurrent read operation, the value is send to the shared memory by the
broadcaster.
• All other processors read it from the shared memory.
• This takes total of O(1) steps as all processors can read the same value at a
time.
• CRCW is more powerful than EREW PRAM model( can be explained using
example 1).
3. Exclusive-Read, Concurrent-Write (ERCW) SM SIMD Computers
• Multiple processors are allowed to write into the same memory location but read
accesses remain exclusive.

4. Concurrent-Read, Concurrent-Write (CRCW) SM SIMD Computers


• Both multiple-read and multiple-write privileges are granted.

• Allowing multiple-read accesses to the same address in memory should in principle


pose no problems. Conceptually, each of the several processors reading from that
location makes a copy of the location's contents and stores it in its own local
memory.
• With multiple-write accesses, however, difficulties arise. If several processors
are attempting simultaneously to store (potentially different) data at a given
address, which of them should succeed?
• In other words, there should be a deterministic way of specifying the contents
of that address after the write operation.
• Several policies have been proposed to resolve such write conflicts, thus further
subdividing classes (iii) and (iv).
• Some of these policies are :
(a) the smallest-numbered processor is allowed to write, and access is denied to
all other processors: Priority CRCW PRAM
(b) all processors are allowed to write provided that the quantities they are
attempting to store are equal, otherwise access is denied to all processors:
Common CRCW PRAM
(c) the sum/average/maximum/minimum value of all quantities that the
processors are attempting to write is stored: Combining CRCW PRAM
(d) the non- deterministic choice of the values written or selecting the value
randomly: Arbitrary CRCW PRAM
• The EREW PRAM is the weakest( but the most realistic) and the Priority CRCW
PRAM is the strongest PRAM model.
• The relative powers of the different PRAM models are as follows:
• EREW  CREW  Common CRCW  Arbitrary CRCW  Priority CRCW
• An algorithm designed for a weaker model can be executed within the same time
and work complexities on a stronger model.
• An algorithm designed for a stronger PRAM model can be simulated on a weaker
model either with asymptotically more processors(work) or with asymptotically more
time.
• Examples:
• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi
S820, ETA10
• Most modern computers, particularly those with graphics processor units (GPUs) employ
SIMD instructions and execution units.
ILLIAC IV MasPar Cray Y-MP
Feasibility of the Shared Memory Model
• When one processor needs to gain access to a datum in memory, some circuitry is
needed to create a path from that processor to the location in memory holding
that datum.
• The cost of such circuitry is usually expressed as the number of logical gates
required to decode the address provided by the processor.
• If the memory consists of M locations, then the cost of the decoding circuitry may
be expressed as f (M) for some cost function f: If N processors share that memory
as in the SM SIMD model, then the cost of the decoding circuitry climbs to N x f
(M).
• For large N and M this may lead to prohibitively large and expensive decoding
circuitry between the processors and the memory.
• There are many ways to mitigate this difficulty.
• One way to reduce the cost of the decoding circuitry is to divide the shared
memory into R blocks, say, of M/R locations each.
• There are N + R two-way lines that allow any processor to gain access to any
memory block at any time.
• However, no more than one processor can read from or write into a block
simultaneously.
• This arrangement is shown in following figure for N = 5 and R = 3.
• The circles at the intersections of horizontal and vertical lines represent small
(relatively inexpensive) switches.
• When the ith processor wishes to gain access to the jth memory block, it sends
its request along the ith horizontal line to the jth switch, which then routes it
down the jth vertical line to the jth memory block.
• Each memory block possesses one decoder circuit to determine which of the
M/R locations is needed.
• Therefore, the total cost of decoding circuitry is R x f (M/R).
• To this we must add of course the cost of the N x R switches.
• Another approach to obtaining a weaker version of the SM SIMD is described
in the next section.
Interconnection-Network SIMD Computers
• In previous topic we find that SM SIMD model can be made more feasible by
dividing the memory into blocks and making access to these blocks exclusive. We
can extend this idea to obtain a slightly more powerful model.
• Here the M locations of the shared memory are distributed among the N processors,
each receiving M/N locations. In addition every pair of processors are connected by
a two way line. This arrangement is shown in coming figure for N = 5.
• At any step during the computation, processor Pi can receive a datum from Pj and
send another one to Pk (or to Pj). Consequently, each processor must contain
(i) a circuit of cost f(N - 1) capable of decoding a log(N - 1)-bit address: this
allows the processor to select one of the other N - 1 processors for
communicating; and
(ii) a circuit of cost f (M/N) capable of decoding a log(M/N)-bit address
provided by another processor.
•This model is therefore more powerful than the R-
block shared memory, as it allows instantaneous
communication between any pair of processors.
•Several pairs can thus communicate simultaneously
(provided, of course, no more than one processor
attempts to send data to or expects to receive data
from another processor).
•Thus, potentially all processors can be busy
communicating all the time, something that is not
possible in the R-block shared memory when N > R.
Fully interconnected set of N=5 processors
Some Features of Interconnection Network SIMD Computer:
i. Price: There are N - 1 lines leaving each processor for a total of N(N - 1)/2 lines.
Clearly, such a network is too expensive, especially for large values of N. This is
particularly true if we note that with N processors the best we can hope for is an
N-fold reduction in the number of steps required by a sequential algorithm
ii. Feasibility: Even if we could afford such a high price, the model is unrealistic in
practice, again for large values of N. Indeed, there is a limit on the number of
lines that can be connected to a processor, and that limit is dictated by the actual
physical size of the processor itself.
iii. Relation to SM SIMD: This fully interconnected model is weaker than a shared-
memory computer for the same reason as the R-block shared memory: No more
than one processor can gain access simultaneously to the memory block
associated with another processor. Allowing the latter would yield a cost of NZ x
f (M/N), which is about the same as for the SM SIMD (not counting the
quadratic cost of the two-way lines): This clearly would defeat our original
purpose of getting a more feasible machine!
• Thus, a small subset of all pairwise connections can be sufficient to obtain a good
performance.
• It should be kept in mind that since two processors can communicate in a constant
number of steps on a SM SIMD computer, any algorithm for an interconnection-network
SIMD computer can be simulated on the former model in no more steps than required to
execute it by the latter.
• The most popular of these networks are briefly outlined:
i. Linear Array: The simplest way to interconnect N processors is in the form of a
one-dimensional array, as shown in Fig. for N = 6. Here, processor Pi is linked to
its two neighbors Pi-1, and Pi+1, through a two-way communication line. Each of
the end processors, namely, P1 and Pn, has only one neighbor.
ii. Two-Dimensional Array: A two-dimensional network is obtained by arranging the N
processors into an m x m array, where m = N^½, as shown in Fig. for m = 4. The
processor in row j and column k is denoted by P(j, k), where 0 <= j <= m - 1 and 0 <= k
<= m - 1. A two-way communication line links P(j, k) to its neighbors P(j + 1, k), P(j - 1,
k), P(j, k + 1), and P(j, k - 1). Processors on the boundary rows and columns have fewer
than four neighbors and hence fewer connections. This network is also known as the
mesh.
iii. Tree Connection
iv. Perfect Shuffle Connection
v. Cube Connection
Multiple Instruction, Single Data (MISD)
• A single data stream is fed into multiple processing units.
• Here, N processors each with its own control unit share a common
memory unit where data reside.
• Each processing unit operates on the data independently via
independent instruction streams.
• Thus, parallelism is achieved by letting the processors do different
things at the same time on the same datum. This class of computers
lends itself naturally to those computations requiring an input to be
subjected to several operations, each receiving the input in its original
form.
• Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
Example:
• It is required to determine whether a given positive integer z has no
divisors except 1 and itself.
• The obvious solution to this problem is to try all possible divisors of z: If
none of these succeeds in dividing z, then z is said to be prime; otherwise z
is said to be composite.
• We can implement this solution as a parallel algorithm on an MISD
computer.
• The idea is to split the job of testing potential divisors among processors.
• Assume that there are as many processors on the parallel computer as there
are potential divisors of z.
• All processors take z as input, then each tries to divide it by its associated
potential divisor and issues an appropriate output based on the result.
• Thus it is possible to determine in one step whether z is prime.
• More realistically, if there are fewer processors than potential divisors, then
each processor can be given the job of testing a different subset of these
divisors.
Multiple Instruction, Multiple Data (MIMD)
• This class of computers is the most general and most powerful in our
paradigm of parallel computation that classifies parallel computers
according to whether the instruction and/or the data streams are
duplicated.
• Here we have N processors, N streams of instructions, and N streams
of data.
• Multiple Instruction: every processor may be executing a different
instruction stream
• Multiple Data: every processor may be working with a different data
stream
• The processors here are of the type used in MISD computers in the
sense that each possesses its own control unit in addition to its local
memory and arithmetic and logic unit.
• This makes these processors more powerful than the ones used for SIMD
computers.
• Each processor operates under the control of an instruction stream issued by its
control unit. Thus the processors are potentially all executing different programs on
different data while solving different sub problems of a single problem.
• When all the processors in MIMD are running the same program, we call it Single
Program Multiple Data(SPMD) computation.
• This means that the processors typically operate asynchronously. As with SIMD
computers, communication between processors is performed through a shared
memory or an interconnection network.
• MIMD computers sharing a common memory are often referred to as
multiprocessors (or tightly coupled machines) while those with an interconnection
network are known as multi computers (or loosely coupled machines).
Classification of MIMD Computers
• MIMD machines are classified on the basis of how processors communicate with each
other. Accordingly, there are two categories:
Shared Memory Multiprocessors
• Single address space shared by all processors.
• By single address space we mean that same address generated by two or more
processors refers to the same memory element in one of the memory modules.
• Shared Memory multiprocessors are usually connected by a shared BUS.
• References generated by processors are first searched in their respective local caches. A
cache miss necessitates memory access.
• When all the processors have equal access time to all memory words, such
architectures are called Uniform Memory Access (UMA) model.
• Shared memory architectures are also called tightly coupled multiprocessors due to
close integration of modules (processors, memory, I/O) and high degree of resource
sharing.
• When all processors have equal access to all peripheral devices, the system is called
symmetric multiprocessor.
• In this case, all the processors are equally capable of running the executive programs
such as the OS kernel and I/O service routines.
• In an asymmetric multiprocessor, only one or a subset of processors are executive
capable( i.e. it can execute the OS and handle I/O).
• The remaining processors have no I/O capability and thus are called attached
processors(APs).
• APs execute user codes under the supervision of the master processor.
• When the access time varies with the location of the memory word in a shared
memory model, it is called Non-Uniform Memory Access(NUMA) model.
• The shared memory is physically distributed to all processors called local memories.
• The collection of all local memories forms a global address space accessible by all
processors.
• It is faster to access a local memory with a local processor as it will avoid long
delays of connecting through interconnection network.
• When a multiprocessor using cache- only memory, it is called Cache Only Memory
Access(COMA) model.
• It is a special case of NUMA machine in which the distributed main memories are
converted to caches and all caches form a global address space.
Distributed Memory Multi-computers
• This system consists of multiple computers( or nodes), interconnected by a message
passing network.
• Each node is an autonomous computer consisting of a processor, local memory and
sometimes attached disks or I/O peripherals.
• The message passing network provides point to point static connections among the
nodes.
• All local memories are private and are accessible only by local processors.
• Therefore, traditional multi-computers are also called No-Remote-Memory-
Access(NORMA) machines.
• When processors in a distributed system are so far apart, the number of data
exchanges among them is significantly more important than the number of
computational steps performed by any of them.
Inter Connection Networks
• Tree Model
• Pyramid Model
• Hypercube
• Cube connected cycle
• Butterfly
• Perfect Shuffle Computers
• Fully Connected model
Network Properties and Routing
• Node degree: It is the number of edges(links or channels) incident on a
node. In practical aspects, it reflects the number of I/O ports required
per node and therefore the node degree should be kept constant and as
small as possible.
• Network Diameter: It is the maximum shortest path between any two
nodes. It indicates the maximum number of distinct hops between any
two nodes. Therefore it should be as small as possible from
communication point of view.
Tree Model
• In this network, the processors form a complete binary tree.
• Such a tree has d levels, numbered 0 to d - 1, and N = 2k -1 nodes each
of which is a processor, as shown in Figure for d = 4.
• Each processor at level i is connected by a two-way line to its parent at
level i + 1 and to its two children at level i - 1.
• The root processor (at level d - 1) has no parent and the leaves (all of
which are at level 0) have no children.
• Maximum node degree= 3
• Diameter= 2(k-1)
• The terms tree connection (or tree-connected computer) are used to refer
to such a tree of processors.
Perfect Shuffle Computers
• Let N processors Po, P1,. . . , Pn-1, be available where N is a power
of 2.
• In the perfect shuffle interconnection, a one-way line links Pi to Pj,
where :

as shown in Fig. for N = 8.


• Equivalently, the binary representation of j is obtained by cyclically
shifting that of i one position to the left.
• In addition to these shuffle links, two-way lines connecting every
even-numbered processor to its successor are sometimes added to
the network.
• These connections, called the exchange links, are shown as broken lines
in Fig. In this case, the network is known as the shuffle-exchange
connection.
000 000 000 000
001 001 001 001
010 010 010 010
011 011 011 011
100 100 100 100
101 101 101 101
110 110 110 110
111 111 111 111
Perfect Shuffle mapping for N=8 Inverse Perfect Shuffle for N=8
Hypercube model
• An n-dimensional cube(hypercube) has n routing functions, defined by each bit of
the n-bit address.
• These data exchange functions can be used in routing messages in a hypercube
machine.
000 001 010 011 100 101 110 111

Routing by least significant bit C0

000 001 010 011 100 101 110 111

Routing by middle bit C1

000 001 010 011 100 101 110 111

Routing by most significant bit C2


Cube Connected Cycle
• Assume that N = 2q for some q >= 1 and let N processors be available Po , P1 , , .
. . , PN-1.
• It is a modified architecture of hypercube model.
• The idea is to cut off the corner nodes of the 3-cube and replace each by a
ring(cycle) of 3 nodes.
• The 3-CCC has a diameter of 6( twice that of original 3-cube).
• In general, the network diameter of a k-CCC equals 2k.
• In general, one can construct k-cube connected cycles from a k-cube
with n= 2k cycle nodes by replacing each vertex of the k-dimensional
hypercube by a ring of k-nodes.
• Thus a k-cube can be transformed to a k-CCC with k* 2k nodes.
• The q neighbors Pj of Pi are defined as follows: The binary
representation of j is obtained from that of i by complementing a single
bit.
• This is illustrated in following figure for q = 3. The indices of Po, P1, . .
. , PN-1, are given in binary notation. Note that each processor has three
neighbors.
• The major improvement of a CCC lies in its constant node degree of 3, which
is independent of the dimension of the underlying hypercube.
Butterfly Network
• A butterfly network topology consists of (k+1).2k nodes arranged in
k+1 ranks(or rows), each containing n=2k nodes.
• k is called the order of the network.
• Rows are labeled 0 … k. Each processor has four connections to other
processors (except processors in top and bottom row).
• Processor P(r, j), i.e. processor number j in row r is connected to P(r-
1, j) and P(r-1, m) where m is obtained by inverting the rth significant
bit in the binary representation of j.
Following is a butterfly network for rank=3:
Pyramid Network
• A pyramid consists of (4d+1 – 1)/3 processors organized in d+1 levels
so as:
• Levels are numbered from d down to 0
• There is 1 processor at level d
• Every level below d has four times the number of processors than the
level immediately above it.
• Pyramid interconnection can be seen as generalization of the ring –
binary tree network, or as a way of combining meshes and trees.
Pyramid network for 2 levels
Comparison Of Interconnection
Networks
Network Topology Number of Node Degree
Nodes

Linear and Ring d 2


Shuffle-Exchange 2d 3
2D Mesh d2 4
Hypercube 2d D
Binary Tree 2d - 1 3
Butterfly (d+1)* 2d d+1
Pyramid (4d+1 – 1)/3 9
Simulation of one model into other
• Small PRAMs can simulate larger PRAMs. Even though relatively
simple, the following two simulations are very useful and
notoriously used.
• The first result says that if we decrease the number of processors,
the cost of a PRAM algorithm does not change, up to a
multiplicative constant.
Simulation may be desirable for one of two reasons:
1. The parallel computer available belongs to the EREW class
and thus the only way to execute a CREW, ERCW, or
CRCW algorithm is through simulation or,
2. Parallel computers of the CREW, ERCW, and CRCW models with a
very large number of processors are technologically impossible to
build at all.
Indeed, the number of processors that can be simultaneously connected to a
memory location is limited :
(i) not only by the physical size of the device used for that location,
(ii) but also by the device's physical properties (such as voltage).
• Therefore concurrent access to memory by an arbitrary number of processors
may not be realizable in practice. Again in this case simulation is the only
resort to implement an algorithm developed in theory to include multiple
accesses.
Lemma 1: Assume p'<p. Any problem that can be solved on a p-processor
PRAM in t steps can be solved on a p'-processor PRAM in t'=O(tp/p') steps
assuming the same size of shared memory.
Proof:
1. Partition p simulated processors into p' groups of size p/p' each.
2. Associate each of the p' simulating processors with one of these groups.
3. Each of the simulating processors simulates one step of its group of
processors by:
i. executing all their READ and local computation substeps first,
ii. executing their WRITE substeps then.
Simulation theorem: Any algorithm running on a CRCW PRAM with p
processors cannot be more than O(log p) times faster than the best
algorithm on a EREW PRAM with p processors for the same problem.
Proof:
“Simulate” concurrent writes
When Pi writes value xi to address li, one replaces the write by an (exclusive)
write of (li ,xi) to A[i], where A[i] is some auxiliary array with one slot per
processor
Then one sorts array A by the first component of its content
Processor i of the EREW PRAM looks at A[i] and A[i-1]
if the first two components are different or if i = 0, write value xi to address li
Since A is sorted according to the first component, writing is exclusive
• P0  (29,43) = A[0] Note that we said that we just
sort array A. If we have an
A[0]=(8,12) P0 writes
algorithm that sorts p
• P1  (8,12) = A[1] elements with O(p) processors
A[1]=(8,12) P1 nothing
in O(log p) time, therefore, the
• P2  (29,43) = A[2] A[2]=(29,43) P2 writes proof is complete

• P3  (29,43) = A[3] A[3]=(29,43) P3 nothing

A[4]=(29,43) P4 nothing
• P4  (92,26) = A[4]
A[5]=(92,26) P5 writes
• P5  (8,12) = A[5]
Picking one processor for each competing write
Lemma 2: Assume m'<m. Any problem that can be solved on a p-processor and m-cell
PRAM in t steps can be solved on a max(p,m')-processor m'-cell PRAM in O(tm/m') steps.
1. Proof:
Partition m simulated shared memory cells into m' continuous segments Si of
size m/m'.
2. Each simulating processor P'i, 1<= i<= p, will simulate processor Pi of the original
PRAM.
3. Each simulating processor P'i, 1<= i<= m', stores the initial contents of Si into its local
memory and will use M'[i] as an auxiliary memory cell for simulation of accesses to
cells of Si.
4. Simulation of one original READ operation:
each P'i, i=1,...,max(p,m') repeats for k=1,...,m/m':
1. write the value of the k-th cell of Si into M'[i], i=1,...,m',
2. read the value which the simulated processor Pi, i=1,...,p, would read in this simulated
substep, if it appeared in the shared memory.
5. The local computation substep of Pi, i=1,...,p, is simulated in one step by P'i.
6. Simulation of one original WRITE operation is analogous to that of READ.
• Such a simulation may be desirable for one of two reasons:
1. The parallel computer available belongs to the EREW class and thus the only way to
execute a CREW, ERCW, or CRCW algorithm is through simulation or
2. Parallel computers of the CREW, ERCW, and CRCW models with a very large
number of processors are technologically impossible to build at all.
• Indeed, the number of processors that can be simultaneously connected to a memory location is
limited
(i) not only by the physical size of the device used for that location,
(ii) but also by the device's physical properties (such as voltage).
• Therefore concurrent access to memory by an arbitrary number of processors may not be
realizable in practice.
• Again in this case simulation is the only resort to implement an algorithm developed in theory
to include multiple accesses.

You might also like