CS 194 Parallel Programming Why Program For Parallelism?: Katherine Yelick Yelick@cs - Berkeley.edu

CS 194 Parallel Programming
Why Program for Parallelism?

Katherine Yelick
yelick@cs.berkeley.edu
http://www.cs.berkeley.edu/~yelick/cs194f07
8/29/2007
CS194 Lecure
What is Parallel Computing?

Parallel computing: using multiple processors in
parallel to solve problems more quickly than with a
single processor
Examples of parallel machines:
A cluster computer that contains multiple PCs combined
together with a high speed network
A shared memory multiprocessor (SMP*) by connecting
multiple processors to a single memory system
A Chip Multi-Processor (CMP) contains multiple processors
(called cores) on a single chip
Concurrent execution comes from desire for

performance; unlike the inherent concurrency in a multiuser distributed system
* Technically, SMP stands for Symmetric Multi-Processor
8/29/2007
CS194 Lecure
Why Parallel Computing Now?

Researchers have been using parallel computing for
decades:
Mostly used in computational science and engineering
Problems too large to solve on one computer; use 100s or 1000s
There has been a graduate course in parallel computing

(CS267) for over a decade
Many companies in the 80s/90s bet on parallel
computing and failed
Computers got faster too quickly for there to be a large market
Why is Berkeley adding an undergraduate course now?

Because the entire computing industry has bet on parallelism
There is a desperate need for parallel programmers
Lets see why

8/29/2007
CS194 Lecure
Technology Trends: Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years
Called Moores Law

Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of

Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
8/29/2007
CS194 Lecure
Microprocessor Transistors and Clock Rate

Growth in transistors per chip
Increase in clock rate
100,000,000
1000
10,000,000
1,000,000
i80386
i80286
100,000
R3000
R2000
100
Clock Rate (M Hz)
Transistors
R10000
Pentium
10
i8086
10,000
i8080
i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
0.1
1970
1980
1990
2000
Year
Why bother with parallel programming? Just wait a year or two

8/29/2007
CS194 Lecure
Limit #1: Power density

Can soon put more transistors on a chip than can afford to turn on.
-- Patterson 07
Scaling clock speed (business as usual) will not work
Power Density (W/cm2)
10000
Suns
Surface
Rocket
Nozzle
1000
Nuclear
Reactor
100
8086
Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
8/29/2007
1980
P6
Pentium
486
1990
Year
CS194 Lecure
Source: Patrick
Gelsinger, Intel
2000
2010
6
Parallelism Saves Power

Exploit explicit parallelism for reducing power
Power = (C
C
2C***VV
V222*/4
**FF)/4
F
* F/2
Performance = (Cores
2Cores
Cores ***FF)*1
F
F/2
Capacitance Voltage Frequency
Using additional cores

Increase density (= more transistors = more
capacitance)
Can increase cores (2x) and performance (2x)
Or increase cores (2x), but decrease frequency (1/2):
same performance at the power
Additional benefits
Small/simple cores more predictable performance
8/29/2007
CS194 Lecure
Limit #2: Hidden Parallelism Tapped Out

Application performance was increasing by 52% per year as measured
by the SpecInt benchmarks here
10000
From Hennessy and Patterson,

Computer Architecture: A Quantitative
Approach, 4th edition, 2006
??%/year
1000
52%/year
100
10
25%/year
due to transistor density

due to architecture
changes, e.g., Instruction
Level Parallelism (ILP)
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
VAX
: 25%/year 1978 to 1986
RISC
+ x86: 52%/year 1986 to CS194
2002 Lecure
8/29/2007
Limit #2: Hidden Parallelism Tapped Out

Superscalar (SS) designs were the state of the art;
many forms of parallelism not visible to programmer
multiple instruction issue
dynamic scheduling: hardware discovers parallelism
between instructions
speculative execution: look past predicted branches
non-blocking caches: multiple outstanding memory ops
You may have heard of these in 61C, but you havent

needed to know about them to write software
Unfortunately, these sources have been used up
8/29/2007
CS194 Lecure
Performance Comparison
Measure of success for hidden parallelism is Instructions Per Cycle (IPC)

The 6-issue has higher IPC than 2-issue, but far less than 3x
Reasons are: waiting for memory (D and I-cache stalls) and dependencies
(pipeline stalls)
8/29/2007
CS194 Lecure Graphs from: Olukotun et al,10
ASPLOS, 1996
Uniprocessor Performance (SPECint) Today

3X
Performance (vs. VAX-11/780)
10000
1000
From Hennessy and Patterson,

Computer Architecture: A Quantitative
Approach, 4th edition, 2006
??%/year
2x every
5 years?
52%/year
100
10
25%/year
Sea change in chip

design: multiple cores or
processors per chip
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
VAX
: 25%/year 1978 to 1986
RISC
+ x86: 52%/year 1986 to CS194
2002 Lecure
8/29/2007
RISC + x86: ??%/year 2002 to present
11
Limit #3: Chip Yield

Manufacturing costs and yield problems limit use of density
Moores (Rocks) 2nd law:

fabrication costs go up
Yield (% usable chips)
drops
Parallelism can help
More smaller, simpler
processors are easier to design
and validate
Can use partially working chips:
E.g., Cell processor (PS3) is sold
with 7 out of 8 on to improve
yield
8/29/2007
CS194 Lecure
12
Limit #4: Speed of Light (Fundamental)

1 Tflop/s, 1
Tbyte sequential
machine
r = 0.3
mm
Consider the 1 Tflop/s sequential machine:
Data must travel some distance, r, to get from memory

to CPU.
To get 1 data element per cycle, this means 1012 times
per second at the speed of light, c = 3x108 m/s. Thus r
< c/1012 = 0.3 mm.
Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
Each bit occupies about 1 square Angstrom, or the size

of a small atom.
No choice but parallelism
8/29/2007
CS194 Lecure
13
Revolution is Happening Now

Chip density is
continuing increase
~2x every 2 years
Clock speed is not
Number of processor
cores may double
instead
There is little or no
hidden parallelism
(ILP) to be found
Parallelism must be
exposed to and
managed by software
Source: Intel, Microsoft (Sutter) and

Stanford
(Olukotun, Hammond)
8/29/2007
CS194 Lecure
14
Multicore in Products
We are dedicating all of our future product development to
multicore designs. This is a sea change in computing
Paul Otellini, President, Intel (2005)
All microprocessor companies switch to MP (2X CPUs / 2 yrs)

Procrastination penalized: 2X sequential perf. / 5 yrs
Manufacturer/Year
AMD/05
Intel/06
IBM/04
Sun/07
Processors/chip
Threads/Processor
16
Threads/chip
128
And at the same time,

The STI Cell processor (PS3) has 8 cores
The latest NVidia Graphics Processing Unit (GPU) has 128 cores
Intel has demonstrated an 80-core research chip
8/29/2007
CS194 Lecure
15
Tunnel Vision by Experts

On several recent occasions, I have been asked
whether parallel computing will soon be relegated to
the trash heap reserved for promising technologies
that never quite make it.
Ken Kennedy, CRPC Directory, 1994
640K [of memory] ought to be enough for anybody.

Bill Gates, chairman of Microsoft,1981.
There is no reason for any individual to have a

computer in their home
Ken Olson, president and founder of Digital Equipment
Corporation, 1977.
I think there is a world market for maybe five

computers.
Thomas Watson, chairman of IBM, 1943.
8/29/2007
CS194 Lecure
Slide source: Warfield et al.
16
Why Parallelism (2007)?

These arguments are no long theoretical
All major processor vendors are producing multicore chips
Every machine will soon be a parallel machine
All programmers will be parallel programmers???
New software model

Want a new feature? Hide the cost by speeding up the code first
All programmers will be performance programmers???
Some may eventually be hidden in libraries, compilers, and

high level languages
But a lot of work is needed to get there
Big open questions:

What will be the killer apps for multicore machines
How should the chips be designed, and how will they be
programmed?
8/29/2007
CS194 Lecure
17
Outline
all
Why powerful computers must be parallel processors
Including your laptop
Why writing (fast) parallel programs is hard

Principles of parallel computing performance
Structure of the course
8/29/2007
CS194 Lecure
18
Why writing (fast) parallel

programs is hard
8/29/2007
CS194 Lecure
19
Principles of Parallel Computing
Finding enough parallelism (Amdahls Law)

Granularity
Locality
Load balance
Coordination and synchronization
Performance modeling
All of these things makes parallel programming

even harder than sequential programming.
8/29/2007
CS194 Lecure
20
Finding Enough Parallelism

Suppose only part of an application seems parallel
Amdahls law
let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
Even if the parallel part speeds up perfectly
performance is limited by the sequential part
8/29/2007
CS194 Lecure
21
Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to
getting desired speedup
Parallelism overheads include:
cost of starting a thread or process

cost of communicating shared data
cost of synchronizing
extra (redundant) computation
Each of these can be in the range of milliseconds

(=millions of flops) on some systems
Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
8/29/2007
CS194 Lecure
22
Locality and Parallelism

Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
Large memories are slow, fast memories are small

Storage hierarchies are large and fast on average
Parallel processors, collectively, have large, fast cache
the slow accesses to remote data we call communication
Algorithm should do most work on local data
potential
interconnects
L3 Cache
Load Imbalance
Load imbalance is the time that some processors in the
system are idle due to
insufficient parallelism (during that phase)
unequal size tasks
Examples of the latter

adapting to interesting parts of a domain
tree-structured computations
fundamentally unstructured problems
Algorithm needs to balance load
8/29/2007
CS194 Lecure
24
Course
Organization
8/29/2007
CS194 Lecure
25
Course Mechanics
Expected background
All of 61 series
At least one upper div software/systems course, preferably 162
Work in course
Homework with programming (~1/week for first 8 weeks)
Parallel hardware in CS, from Intel, at LBNL
Final project of your own choosing: may use other hardware

(PS3, GPUs, Niagra2, etc.) depending on availability
2 in-class quizzes mostly covering lecture topics
See course web page for tentative calendar, etc.:

http://www.cs.berkeley.edu/~yelick/cs194f07
Grades: homework (30%), quizzes (30%), project (40%)

Caveat: This is the first offering of this course, so things
will change dynamically
8/29/2007
CS194 Lecure
26
Reading Materials
Optional text
Introduction to Parallel Computing, 2nd Edition Ananth Grama,
Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley,
2003
Some on-line texts (on high performance scientific

programming):
Demmels notes from CS267 Spring 1999, which are similar to
2000 and 2001. However, they contain links to html notes from
1996.
http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
Ian Fosters book, Designing and Building Parallel

Programming.
http://www-unix.mcs.anl.gov/dbpp/
8/29/2007
CS194 Lecure
27

CS 194 Parallel Programming Why Program For Parallelism?: Katherine Yelick Yelick@cs - Berkeley.edu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 194 Parallel Programming Why Program For Parallelism?: Katherine Yelick Yelick@cs - Berkeley.edu

Uploaded by

Copyright:

Available Formats

CS 194 Parallel Programming

Why Program for Parallelism?

What is Parallel Computing?

Concurrent execution comes from desire for

Why Parallel Computing Now?

There has been a graduate course in parallel computing

Why is Berkeley adding an undergraduate course now?

Lets see why

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called Moores Law

Gordon Moore (co-founder of

Microprocessor Transistors and Clock Rate

Increase in clock rate

Why bother with parallel programming? Just wait a year or two

Limit #1: Power density

Scaling clock speed (business as usual) will not work

Power Density (W/cm2)

Parallelism Saves Power

Capacitance Voltage Frequency

Using additional cores

Limit #2: Hidden Parallelism Tapped Out

From Hennessy and Patterson,

due to transistor density

Limit #2: Hidden Parallelism Tapped Out

You may have heard of these in 61C, but you havent

Measure of success for hidden parallelism is Instructions Per Cycle (IPC)

Uniprocessor Performance (SPECint) Today

Performance (vs. VAX-11/780)

From Hennessy and Patterson,

Sea change in chip

Limit #3: Chip Yield

Moores (Rocks) 2nd law:

Limit #4: Speed of Light (Fundamental)

Consider the 1 Tflop/s sequential machine:

Data must travel some distance, r, to get from memory

Each bit occupies about 1 square Angstrom, or the size

Revolution is Happening Now

Source: Intel, Microsoft (Sutter) and

All microprocessor companies switch to MP (2X CPUs / 2 yrs)

And at the same time,

Tunnel Vision by Experts

640K [of memory] ought to be enough for anybody.

There is no reason for any individual to have a

I think there is a world market for maybe five

Slide source: Warfield et al.

Why Parallelism (2007)?

New software model

Some may eventually be hidden in libraries, compilers, and

Big open questions:

Why writing (fast) parallel programs is hard

Why writing (fast) parallel

Principles of Parallel Computing

Finding enough parallelism (Amdahls Law)

All of these things makes parallel programming

Finding Enough Parallelism

cost of starting a thread or process

Each of these can be in the range of milliseconds

Locality and Parallelism

Large memories are slow, fast memories are small

Algorithm should do most work on local data

Examples of the latter

Algorithm needs to balance load