You are on page 1of 27

CS 194 Parallel Programming

Why Program for Parallelism?


Katherine Yelick
yelick@cs.berkeley.edu
http://www.cs.berkeley.edu/~yelick/cs194f07
8/29/2007

CS194 Lecure

What is Parallel Computing?


Parallel computing: using multiple processors in
parallel to solve problems more quickly than with a
single processor
Examples of parallel machines:
A cluster computer that contains multiple PCs combined
together with a high speed network
A shared memory multiprocessor (SMP*) by connecting
multiple processors to a single memory system
A Chip Multi-Processor (CMP) contains multiple processors
(called cores) on a single chip

Concurrent execution comes from desire for


performance; unlike the inherent concurrency in a multiuser distributed system
* Technically, SMP stands for Symmetric Multi-Processor
8/29/2007

CS194 Lecure

Why Parallel Computing Now?


Researchers have been using parallel computing for
decades:
Mostly used in computational science and engineering
Problems too large to solve on one computer; use 100s or 1000s

There has been a graduate course in parallel computing


(CS267) for over a decade
Many companies in the 80s/90s bet on parallel
computing and failed
Computers got faster too quickly for there to be a large market

Why is Berkeley adding an undergraduate course now?


Because the entire computing industry has bet on parallelism
There is a desperate need for parallel programmers

Lets see why


8/29/2007

CS194 Lecure

Technology Trends: Microprocessor Capacity

Moores Law

2X transistors/Chip Every 1.5 years

Called Moores Law


Microprocessors have
become smaller, denser,
and more powerful.

Gordon Moore (co-founder of


Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra

8/29/2007

CS194 Lecure

Microprocessor Transistors and Clock Rate


Growth in transistors per chip

Increase in clock rate

100,000,000

1000

10,000,000

1,000,000
i80386
i80286

100,000

R3000
R2000

100
Clock Rate (M Hz)

Transistors

R10000
Pentium

10

i8086

10,000
i8080
i4004

1,000
1970 1975 1980 1985 1990 1995 2000 2005
Year

0.1
1970

1980

1990

2000

Year

Why bother with parallel programming? Just wait a year or two


8/29/2007

CS194 Lecure

Limit #1: Power density


Can soon put more transistors on a chip than can afford to turn on.
-- Patterson 07

Scaling clock speed (business as usual) will not work

Power Density (W/cm2)

10000

Suns
Surface
Rocket
Nozzle

1000

Nuclear
Reactor

100
8086

Hot Plate

10 4004
8008 8085
386
286
8080
1
1970

8/29/2007

1980

P6
Pentium
486
1990
Year
CS194 Lecure

Source: Patrick
Gelsinger, Intel

2000

2010
6

Parallelism Saves Power


Exploit explicit parallelism for reducing power
Power = (C
C
2C***VV
V222*/4
**FF)/4
F
* F/2

Performance = (Cores
2Cores
Cores ***FF)*1
F
F/2

Capacitance Voltage Frequency

Using additional cores


Increase density (= more transistors = more
capacitance)
Can increase cores (2x) and performance (2x)
Or increase cores (2x), but decrease frequency (1/2):
same performance at the power

Additional benefits
Small/simple cores more predictable performance
8/29/2007

CS194 Lecure

Limit #2: Hidden Parallelism Tapped Out


Application performance was increasing by 52% per year as measured
by the SpecInt benchmarks here
10000

From Hennessy and Patterson,


Computer Architecture: A Quantitative
Approach, 4th edition, 2006

??%/year

1000
52%/year
100

10
25%/year

due to transistor density


due to architecture
changes, e.g., Instruction
Level Parallelism (ILP)

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

VAX
: 25%/year 1978 to 1986
RISC
+ x86: 52%/year 1986 to CS194
2002 Lecure
8/29/2007

Limit #2: Hidden Parallelism Tapped Out


Superscalar (SS) designs were the state of the art;
many forms of parallelism not visible to programmer
multiple instruction issue
dynamic scheduling: hardware discovers parallelism
between instructions
speculative execution: look past predicted branches
non-blocking caches: multiple outstanding memory ops

You may have heard of these in 61C, but you havent


needed to know about them to write software
Unfortunately, these sources have been used up

8/29/2007

CS194 Lecure

Performance Comparison

Measure of success for hidden parallelism is Instructions Per Cycle (IPC)


The 6-issue has higher IPC than 2-issue, but far less than 3x
Reasons are: waiting for memory (D and I-cache stalls) and dependencies
(pipeline stalls)
8/29/2007
CS194 Lecure Graphs from: Olukotun et al,10
ASPLOS, 1996

Uniprocessor Performance (SPECint) Today


3X

Performance (vs. VAX-11/780)

10000

1000

From Hennessy and Patterson,


Computer Architecture: A Quantitative
Approach, 4th edition, 2006

??%/year

2x every
5 years?

52%/year
100

10
25%/year

Sea change in chip


design: multiple cores or
processors per chip

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

VAX
: 25%/year 1978 to 1986
RISC
+ x86: 52%/year 1986 to CS194
2002 Lecure
8/29/2007
RISC + x86: ??%/year 2002 to present

11

Limit #3: Chip Yield


Manufacturing costs and yield problems limit use of density

Moores (Rocks) 2nd law:


fabrication costs go up
Yield (% usable chips)
drops
Parallelism can help
More smaller, simpler
processors are easier to design
and validate
Can use partially working chips:
E.g., Cell processor (PS3) is sold
with 7 out of 8 on to improve
yield
8/29/2007

CS194 Lecure

12

Limit #4: Speed of Light (Fundamental)


1 Tflop/s, 1
Tbyte sequential
machine

r = 0.3
mm

Consider the 1 Tflop/s sequential machine:

Data must travel some distance, r, to get from memory


to CPU.
To get 1 data element per cycle, this means 1012 times
per second at the speed of light, c = 3x108 m/s. Thus r
< c/1012 = 0.3 mm.
Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:

Each bit occupies about 1 square Angstrom, or the size


of a small atom.
No choice but parallelism
8/29/2007

CS194 Lecure

13

Revolution is Happening Now


Chip density is
continuing increase
~2x every 2 years
Clock speed is not
Number of processor
cores may double
instead

There is little or no
hidden parallelism
(ILP) to be found
Parallelism must be
exposed to and
managed by software

Source: Intel, Microsoft (Sutter) and


Stanford
(Olukotun, Hammond)
8/29/2007

CS194 Lecure

14

Multicore in Products
We are dedicating all of our future product development to
multicore designs. This is a sea change in computing
Paul Otellini, President, Intel (2005)

All microprocessor companies switch to MP (2X CPUs / 2 yrs)


Procrastination penalized: 2X sequential perf. / 5 yrs
Manufacturer/Year

AMD/05

Intel/06

IBM/04

Sun/07

Processors/chip

Threads/Processor

16

Threads/chip

128

And at the same time,


The STI Cell processor (PS3) has 8 cores
The latest NVidia Graphics Processing Unit (GPU) has 128 cores
Intel has demonstrated an 80-core research chip
8/29/2007

CS194 Lecure

15

Tunnel Vision by Experts


On several recent occasions, I have been asked
whether parallel computing will soon be relegated to
the trash heap reserved for promising technologies
that never quite make it.
Ken Kennedy, CRPC Directory, 1994

640K [of memory] ought to be enough for anybody.


Bill Gates, chairman of Microsoft,1981.

There is no reason for any individual to have a


computer in their home
Ken Olson, president and founder of Digital Equipment
Corporation, 1977.

I think there is a world market for maybe five


computers.
Thomas Watson, chairman of IBM, 1943.
8/29/2007

CS194 Lecure

Slide source: Warfield et al.

16

Why Parallelism (2007)?


These arguments are no long theoretical
All major processor vendors are producing multicore chips
Every machine will soon be a parallel machine
All programmers will be parallel programmers???

New software model


Want a new feature? Hide the cost by speeding up the code first
All programmers will be performance programmers???

Some may eventually be hidden in libraries, compilers, and


high level languages
But a lot of work is needed to get there

Big open questions:


What will be the killer apps for multicore machines
How should the chips be designed, and how will they be
programmed?
8/29/2007

CS194 Lecure

17

Outline
all
Why powerful computers must be parallel processors
Including your laptop

Why writing (fast) parallel programs is hard


Principles of parallel computing performance
Structure of the course

8/29/2007

CS194 Lecure

18

Why writing (fast) parallel


programs is hard

8/29/2007

CS194 Lecure

19

Principles of Parallel Computing

Finding enough parallelism (Amdahls Law)


Granularity
Locality
Load balance
Coordination and synchronization
Performance modeling

All of these things makes parallel programming


even harder than sequential programming.
8/29/2007

CS194 Lecure

20

Finding Enough Parallelism


Suppose only part of an application seems parallel
Amdahls law
let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
P = number of processors

Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
Even if the parallel part speeds up perfectly
performance is limited by the sequential part

8/29/2007

CS194 Lecure

21

Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to
getting desired speedup
Parallelism overheads include:

cost of starting a thread or process


cost of communicating shared data
cost of synchronizing
extra (redundant) computation

Each of these can be in the range of milliseconds


(=millions of flops) on some systems
Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work

8/29/2007

CS194 Lecure

22

Locality and Parallelism


Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache

Proc
Cache
L2 Cache

Proc
Cache
L2 Cache

L3 Cache

L3 Cache

Memory

Memory

Memory

Large memories are slow, fast memories are small


Storage hierarchies are large and fast on average
Parallel processors, collectively, have large, fast cache
the slow accesses to remote data we call communication

Algorithm should do most work on local data

potential
interconnects

L3 Cache

Load Imbalance
Load imbalance is the time that some processors in the
system are idle due to
insufficient parallelism (during that phase)
unequal size tasks

Examples of the latter


adapting to interesting parts of a domain
tree-structured computations
fundamentally unstructured problems

Algorithm needs to balance load

8/29/2007

CS194 Lecure

24

Course
Organization

8/29/2007

CS194 Lecure

25

Course Mechanics
Expected background
All of 61 series
At least one upper div software/systems course, preferably 162

Work in course
Homework with programming (~1/week for first 8 weeks)
Parallel hardware in CS, from Intel, at LBNL

Final project of your own choosing: may use other hardware


(PS3, GPUs, Niagra2, etc.) depending on availability
2 in-class quizzes mostly covering lecture topics

See course web page for tentative calendar, etc.:


http://www.cs.berkeley.edu/~yelick/cs194f07

Grades: homework (30%), quizzes (30%), project (40%)


Caveat: This is the first offering of this course, so things
will change dynamically
8/29/2007

CS194 Lecure

26

Reading Materials
Optional text
Introduction to Parallel Computing, 2nd Edition Ananth Grama,
Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley,
2003

Some on-line texts (on high performance scientific


programming):
Demmels notes from CS267 Spring 1999, which are similar to
2000 and 2001. However, they contain links to html notes from
1996.
http://www.cs.berkeley.edu/~demmel/cs267_Spr99/

Ian Fosters book, Designing and Building Parallel


Programming.
http://www-unix.mcs.anl.gov/dbpp/

8/29/2007

CS194 Lecure

27

You might also like