You are on page 1of 56

Architectures for Next

Generation HiPC

Prof. V. Kamakoti
Reconfigurable Intelligent Systems Engineering group (RISE)
Dept. of CSE
IIT Madras

11/17/08 RISE Group 1


How Chips Have Shrunk

! 1946 in UPenn
! Measured in cubic ft.

11/17/08 RISE Group 3


ENIAC on a Chip

! 1997 " 7.44 mm x 5.29 mm


! 174,569 " 0.5 µ technology
Transistors
11/17/08 RISE Group 4
Integrated Circuit Revolution

1958: First integrated circuit (germanium) 2000: Intel Pentium 4 Processor


Built by Jack Kilby at Texas Instruments Clock speed: 1.5 GHz
Contailed five components : transistors, # Transistors: 42 million
resistors and capacitors Technology: 0.18µm CMOS

11/17/08 RISE Group 5


Costs Over Time

11/17/08 RISE Group 6


Evolution in IC Complexity

11/17/08 RISE Group 7


If Transistors are Counted as Seconds

4004 < 1 hr
8080 < 2 hrs
8086 8 hrs
80286 1.5 days
80386 DX 3 days
80486 13 days
Pentium > 1 month
Pentium Pro 2 months
P II 3 months
P III ~1 year
P4 ~ 1.5 years
11/17/08 RISE Group 8
Comparison of Sizes

11/17/08 RISE Group 9


How Small Are The Transistors?

2.0 Micron Sub-micron Deep-sub Ultra Nano


80286 micron Deep-sub
80386
1.0 micron
486
pentium
0.3 pentium II
0.2 Pentium IV
Itanium
0.1

0.05
0.03

83 86 89 92 95 98 01 04 07
! Compare that to diameter of human hair - 56 µ m
11/17/08 RISE Group 10
Moore’s Law
! Transistors double almost every 2.3 years
# Gordon Moore of Intel
# Visionary prediction
# Observed in practice for more than 3 decades

! Implication
# More functionality
# More complexity
# Cost ??
11/17/08 RISE Group 11
Processor Frequency Trends

Frequency doubles each generation


11/17/08 RISE Group 12
Processor Power Trends

11/17/08 RISE Group 13


Power Density Increase
Sun’s
10000 Surface
Power Density (W/cm2)

Rocket
1000
Nuclear Nozzle
Reactor
100

8086 Hot Plate


10 8008 P6
4004 8085 Pentium®
286 386
486
8080
1
1970 1980 1990 2000 2010

11/17/08 RISE Group 14


Complexity

11/17/08 RISE Group 15


What can we do to bridge the gap?

! Increase frequency
! Increase voltage of operation
! Increase the amount of work done per
time unit
! Increase the number of hardware units
! Use clever techniques

11/17/08 RISE Group 16


Increase Frequency
! Has been attempted for a long time
! Increase in frequency != Better
performance
! Has been around 4 GHz for almost 2
years
! Companies don’t play this number game
anymore

11/17/08 RISE Group 17


Increase Voltage of Operation
! Has been done by overclockers
# Mostly avid gamers
# There is a limit beyond which voltage cannot be
increased – electrical breakdown
! Power = C • V^2 • f + g1 • V^3 + g2 • V^5

Subthreshold Gate
Leakage Leakage

11/17/08 RISE Group 18


Increase Amount of Work Done per
Time Unit
! By dedicating more resources
! Don’t waste computation
! Make hardware faster

11/17/08 RISE Group 19


Basis of Hyperthreading

11/17/08 RISE Group 20


11/17/08 RISE Group 21
Conventional
Superscalar
Wide Issue
Different Tasks (4 Instructions
per cycle)

Pipelined
Functional Units

RISE Group
Executed Job
22
11/17/08
Opportunities Lost
! Functional Units have pipeline
bubbles
# Lost opportunity to execute some
other instruction
! Correspondingly, the front end
issue also have holes
# Front end = Instruction fetch,
decode, out-of-order execution unit,
register rename logic etc.

11/17/08 RISE Group 23


Symmetric Multiprocessing to the Rescue

Multiple Processors
11/17/08 RISE Group 24
Discussion on SMP
! Two programs are
running
# More work done per
unit time
! Number of execution
units doubled
# Number of empty slots
also doubled
! Execution efficiency
has not improved
11/17/08 RISE Group 25
Multithreading
One
instruction
stream per
slot

11/17/08 RISE Group 26


Multithreading
! Alleviates some of the
memory latency
problems
! Still has problems
# What if red thread waits
for data from memory and
there is a cache miss?
! Yellow thread waits
unnecessarily
11/17/08 RISE Group 27
Hyperthreading
More than one
instruction stream
per slot

11/17/08 RISE Group 28


Multicores
! Two or more processors on the same chip
! Each has an independent interface to the
frontside bus
! Both OS and the applications must
support thread-level parallelism

11/17/08 RISE Group 30


Typically……

11/17/08 RISE Group 31


Challenges & Solutions
• Multiple cores face a serious
programmability problem
– Writing correct parallel programs is very
difficult
– Problem is synchronization
– Traditional Lock based Synchronization
Problem: Lock-Based Synchronization

Lock-based synchronization of shared data access


Is fundamentally problematic

• Software engineering problems


– Lock-based programs do not compose
– Performance and correctness tightly coupled
– Timing dependent errors are difficult to find and debug
• Performance problems
– High performance requires finer grain locking
– More and more locks add more overhead

Need a better concurrency model for multi-core software


Transactional Memory
• Transactional Memory addresses key
part of the problem
– Makes parallel programming easier by
simplifying coordination
– Requires hardware support for
performance
• Allow multiple atomic operations to
proceed till they conflict (read-
write/write-write on same address)
time
CONFLICT
Examples
start_transaction
R(A) start_transaction Transactions T1,T2

W(B) R(B)
end_transaction R(C)
end_transaction

NO CONFLICT
time

start_transaction
R(A)
start_transaction
R(B)
R(B)
end_transaction
R(C)
RISE end_transaction
IITM
TM Design Issues

Detect Conflicts - Eager or Lazy


Resolve Conflicts
Commit/Abort - Version Management

STM and HTM - two approaches


The History
• Massively Parallel Processors
– Distributed Memory Machines
– Shared memory architectures
• Multicores
– The new generation
– Share large amount of Cache
• Research on how we can use them
The History (Contd)
• Yesterday’s software is today’s coprocessor
and tomorrow’s hardware
• Examples
– Segmentation
– Overlay to Paging
– Single User to Multiuser OS
– Floating point - soft to coprocessor to processor
capability
– Graphic cards become Compute cards - GPUs
The Rise of GPUs
• A quiet revolution and potential build-up
– Calculation: 367 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API
GFLOPS

G80 = GeForce 8800 GTX


G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800

– GPU in every PC and workstation – massive volume and


potential impact
GeForce 8800
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB
DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory
G80 Characteristics

• 367 GFLOPS peak performance (25-50 times of


current high-end microprocessors)
• 265 GFLOPS sustained for typical applications
• Massively parallel, 128 cores, 90W
• Massively threaded, sustains 1000s of threads per app
• 30-100 times speedup over high-end microprocessors
on scientific and media applications: medical imaging,
molecular dynamics
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel
co-processor
• Targeted software stack
– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
– Standalone Driver - Optimized for computation
– Interface designed for compute - graphics free API
– Data sharing with OpenGL buffer objects
– Guaranteed maximum download & readback speeds
– Explicit GPU memory management
GeForce 7800 GTX
Board
SLI Connector
Details
Single slot cooling

sVideo
TV Out

DVI x 2

256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
Online Materials
• http://courses.ece.uiuc.edu/ece498/al1/
Syllabus.html - A course on
Programming Massively Parallel
Processors - by David Kirk, UIUC

You might also like