Architectures For Next Generation HiPC

Architectures for Next
Generation HiPC
Prof. V. Kamakoti
Reconfigurable Intelligent Systems Engineering group (RISE)
Dept. of CSE
IIT Madras
11/17/08 RISE Group 1

How Chips Have Shrunk
! 1946 in UPenn
! Measured in cubic ft.

ENIAC on a Chip
! 1997 " 7.44 mm x 5.29 mm

! 174,569 " 0.5 µ technology
Transistors
Integrated Circuit Revolution
1958: First integrated circuit (germanium) 2000: Intel Pentium 4 Processor

Built by Jack Kilby at Texas Instruments Clock speed: 1.5 GHz
Contailed five components : transistors, # Transistors: 42 million
resistors and capacitors Technology: 0.18µm CMOS

Costs Over Time

Evolution in IC Complexity

If Transistors are Counted as Seconds
4004 < 1 hr
8080 < 2 hrs
8086 8 hrs
80286 1.5 days
80386 DX 3 days
80486 13 days
Pentium > 1 month
Pentium Pro 2 months
P II 3 months
P III ~1 year
P4 ~ 1.5 years
Comparison of Sizes

How Small Are The Transistors?
2.0 Micron Sub-micron Deep-sub Ultra Nano

80286 micron Deep-sub
80386
1.0 micron
486
pentium
0.3 pentium II
0.2 Pentium IV
Itanium
0.1
0.05
0.03
83 86 89 92 95 98 01 04 07
! Compare that to diameter of human hair - 56 µ m
Moore’s Law
! Transistors double almost every 2.3 years
# Gordon Moore of Intel
# Visionary prediction
# Observed in practice for more than 3 decades
! Implication
# More functionality
# More complexity
# Cost ??
Processor Frequency Trends
Frequency doubles each generation

Processor Power Trends

Power Density Increase
Sun’s
10000 Surface
Power Density (W/cm2)
Rocket
1000
Nuclear Nozzle
Reactor
100
8086 Hot Plate

10 8008 P6
4004 8085 Pentium®
286 386
486
8080
1
1970 1980 1990 2000 2010

Complexity

What can we do to bridge the gap?
! Increase frequency
! Increase voltage of operation
! Increase the amount of work done per
time unit
! Increase the number of hardware units
! Use clever techniques

Increase Frequency
! Has been attempted for a long time
! Increase in frequency != Better
performance
! Has been around 4 GHz for almost 2
years
! Companies don’t play this number game
anymore

Increase Voltage of Operation
! Has been done by overclockers
# Mostly avid gamers
# There is a limit beyond which voltage cannot be
increased – electrical breakdown
! Power = C • V^2 • f + g1 • V^3 + g2 • V^5
Subthreshold Gate
Leakage Leakage

Increase Amount of Work Done per
Time Unit
! By dedicating more resources
! Don’t waste computation
! Make hardware faster

Basis of Hyperthreading

Conventional
Superscalar
Wide Issue
Different Tasks (4 Instructions
per cycle)
Pipelined
Functional Units
RISE Group
Executed Job
22
11/17/08
Opportunities Lost
! Functional Units have pipeline
bubbles
# Lost opportunity to execute some
other instruction
! Correspondingly, the front end
issue also have holes
# Front end = Instruction fetch,
decode, out-of-order execution unit,
register rename logic etc.

Symmetric Multiprocessing to the Rescue
Multiple Processors
Discussion on SMP
! Two programs are
running
# More work done per
unit time
! Number of execution
units doubled
# Number of empty slots
also doubled
! Execution efficiency
has not improved
Multithreading
One
instruction
stream per
slot

Multithreading
! Alleviates some of the
memory latency
problems
! Still has problems
# What if red thread waits
for data from memory and
there is a cache miss?
! Yellow thread waits
unnecessarily
Hyperthreading
More than one
instruction stream
per slot

Multicores
! Two or more processors on the same chip
! Each has an independent interface to the
frontside bus
! Both OS and the applications must
support thread-level parallelism

Typically……

Challenges & Solutions
• Multiple cores face a serious
programmability problem
– Writing correct parallel programs is very
difficult
– Problem is synchronization
– Traditional Lock based Synchronization
Problem: Lock-Based Synchronization
Lock-based synchronization of shared data access

Is fundamentally problematic
• Software engineering problems

– Lock-based programs do not compose
– Performance and correctness tightly coupled
– Timing dependent errors are difficult to find and debug
• Performance problems
– High performance requires finer grain locking
– More and more locks add more overhead
Need a better concurrency model for multi-core software

Transactional Memory
• Transactional Memory addresses key
part of the problem
– Makes parallel programming easier by
simplifying coordination
– Requires hardware support for
performance
• Allow multiple atomic operations to
proceed till they conflict (read-
write/write-write on same address)
time
CONFLICT
Examples
start_transaction
R(A) start_transaction Transactions T1,T2
W(B) R(B)
end_transaction R(C)
end_transaction
NO CONFLICT
time
start_transaction
R(A)
start_transaction
R(B)
R(B)
end_transaction
R(C)
RISE end_transaction
IITM
TM Design Issues
Detect Conflicts - Eager or Lazy

Resolve Conflicts
Commit/Abort - Version Management
STM and HTM - two approaches

The History
• Massively Parallel Processors
– Distributed Memory Machines
– Shared memory architectures
• Multicores
– The new generation
– Share large amount of Cache
• Research on how we can use them
The History (Contd)
• Yesterday’s software is today’s coprocessor
and tomorrow’s hardware
• Examples
– Segmentation
– Overlay to Paging
– Single User to Multiuser OS
– Floating point - soft to coprocessor to processor
capability
– Graphic cards become Compute cards - GPUs
The Rise of GPUs
• A quiet revolution and potential build-up
– Calculation: 367 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API
GFLOPS
G80 = GeForce 8800 GTX

NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
– GPU in every PC and workstation – massive volume and

potential impact
GeForce 8800
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB
DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Host
Input Assembler
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Load/store Load/store Load/store Load/store Load/store Load/store
Global Memory
G80 Characteristics
• 367 GFLOPS peak performance (25-50 times of

current high-end microprocessors)
• 265 GFLOPS sustained for typical applications
• Massively parallel, 128 cores, 90W
• Massively threaded, sustains 1000s of threads per app
• 30-100 times speedup over high-end microprocessors
on scientific and media applications: medical imaging,
molecular dynamics
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel
co-processor
• Targeted software stack
– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
– Standalone Driver - Optimized for computation
– Interface designed for compute - graphics free API
– Data sharing with OpenGL buffer objects
– Guaranteed maximum download & readback speeds
– Explicit GPU memory management
GeForce 7800 GTX
Board
SLI Connector
Details
Single slot cooling
sVideo
TV Out
DVI x 2
256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
Online Materials
• http://courses.ece.uiuc.edu/ece498/al1/
Syllabus.html - A course on
Programming Massively Parallel
Processors - by David Kirk, UIUC

Architectures For Next Generation HiPC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architectures For Next Generation HiPC

Uploaded by

Copyright:

Available Formats

Architectures for Next

11/17/08 RISE Group 1

11/17/08 RISE Group 3

! 1997 " 7.44 mm x 5.29 mm

1958: First integrated circuit (germanium) 2000: Intel Pentium 4 Processor

11/17/08 RISE Group 5

11/17/08 RISE Group 6

11/17/08 RISE Group 7

11/17/08 RISE Group 9

2.0 Micron Sub-micron Deep-sub Ultra Nano

Frequency doubles each generation

11/17/08 RISE Group 13

8086 Hot Plate

11/17/08 RISE Group 14

11/17/08 RISE Group 15

11/17/08 RISE Group 16

11/17/08 RISE Group 17

11/17/08 RISE Group 18

11/17/08 RISE Group 19

11/17/08 RISE Group 20

11/17/08 RISE Group 23

11/17/08 RISE Group 26

11/17/08 RISE Group 28

11/17/08 RISE Group 30

11/17/08 RISE Group 31

Lock-based synchronization of shared data access

• Software engineering problems

Need a better concurrency model for multi-core software

Detect Conflicts - Eager or Lazy

STM and HTM - two approaches

G80 = GeForce 8800 GTX

– GPU in every PC and workstation – massive volume and

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

• 367 GFLOPS peak performance (25-50 times of

You might also like