Asymmetric Multiprocessing

Asymmetric Multiprocessing for High Performance Computing
Stephen Jenks Electrical Engineering and Computer Science UC Irvine
sjenks@uci.edu
Source: NVIDIA
Overview and Motivation
Conventional Parallel Computing Was:

For scientists with large problem sets Complex and difficult Very expensive computers with limited access
Parallel Computing is Becoming:

Ubiquitous Cheap Essential Different Still complex and difficult
Where Is Parallel Computing Going? And What are the Research Challenges?
2/25/2009 Stephen Jenks 2
Computing is Changing
Parallel programming is essential

Clock speed not increasing much Performance gains require parallelism
Parallelism is changing
Special purpose parallel engines CPU and parallel engine work together Different code on CPU & parallel engine Asymmetric Computing
2/25/2009
Stephen Jenks
Conventional Processor Architecture
Hasnt Changed Much For 40 Years Pipelining and Superscalar Since the 1960s But has become integrated Microprocessors High Clock Speed Great performance High Power Cooling Issues Various Solutions
From Hennessy & Patterson, 2nd Ed.
2/25/2009
Stephen Jenks
Parallel Computing Problem Overview

newimage[i][j] = (image[i][j] + image[i][j-1] + image[i][j+1] + image[i+1][j] + image[i-1][j]) / 5
Image Relaxation (Blur)
Stencil
2/25/2009
Stephen Jenks
Shared Memory Multiprocessors

CPUs see common address space
CPU CPU CPU CPU
Cache
Cache
Cache
Cache
Shared Memory Each CPU computes results for its partition Memory is shared so dependences satisfied
Shared Memory Programming
Threads
POSIX Threads Windows Threads
Not Integrated with Compiler or Language No idea if code is in thread or not Poor optimizations
OpenMP
User-inserted directives to compiler Loop parallelism #pragma omp parallel for for (i=0; i<n; i++) Parallel regions a[i] = b[i] * c[i]; 2/25/2009 Stephen Jenks Visual Studio
Multicore Processors
Several CPU Cores Per Chip Shared Memory Shared Caches (sometimes) Lower Clock Speed
Lower Power & Heat But Good Performance
Intel Core Duo
Program with Threads Single Threaded Code

Not Faster Majority of Code Today
AMD Athlon 64 X2
Conventional Processors are Dinosaurs
So much circuitry dedicated to keeping ALUs fed:

Cache Out-of-order execution/reorder buffer Branch prediction Large Register Sets Simultaneous Multithreading
ALU tiny by comparison Huge power for little performance gain
With thanks to Stanfords Pat Hanrahan for the analogy

AMD Phenom X4 Floorplan
Source: AMD.com
Single Nehalem (Intel I7) Core
Nehalem - Everything You Need to Know about Intel's New Architecture http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382
2/25/2009
Stephen Jenks
11
NVIDIA GPU Floorplan
Source: Dr. Sumit Gupta - NVIDIA
2/25/2009
Stephen Jenks
12
Asymmetric Parallel Accelerators
Current Cores are Powerful and General Pair General CPUs with Specialized Accelerators
Graphics Processing Unit Field Programmable Gate Array (FPGA) Single Instruction, Multiple Data (SIMD) Processor
Athlon 64 CPU XBAR HyperTransport Memory Controller ATI GPU
Some Applications Only Need Certain Operations Perhaps a Simpler Processor Could Be Faster
Possible Hybrid AMD Multi-Core Design

Graphics Processing Unit (GPU)
GPUs Do Pixel Pushing and Matrix Math
From NVIDIA CUDA Compute Unified Device Architecture Programming Guide 11/29/2007
2/25/2009
Stephen Jenks
14
CUDA Programming Model
Data Parallel
But not Loop Parallel Very Lightweight Threads Write Code from Threads Point of View
Block of Threads
No Shared Memory
Host Copies Data To and From Device Different Memories
Hundreds of Parallel Threads (Sort-of SIMD)

Finite Difference, Time Domain (FDTD) Electromagnetic Simulation

/* update magnetic field */ for (i = 1; i < nx - 1; i++) for (j = 1; j < ny - 1; j++) { /* combining Y loop */ for (k = 1; k < nz - 1; k++) { ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k); ELEM_TYPE tmpx = rx*invmu; ELEM_TYPE tmpy = ry*invmu; ELEM_TYPE tmpz = rz*invmu; AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k)); AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k)); AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k)); } /* now update the electric field */ for (k = 1; k < nz - 1; k++) { ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k); ELEM_TYPE tmpx = rx*invep; ELEM_TYPE tmpy = ry*invep; ELEM_TYPE tmpz = rz*invep; AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1)); AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k)); AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k)); } } 2/25/2009 Stephen Jenks 16
Fields stored in 6 3D Arrays: EX, EY, EZ, HX, HY, HZ 2 other coefficient arrays O(N3) algorithm
OpenMP FDTD
#pragma omp parallel for private(i,j,k) /* update magnetic field */ #pragma omp parallel for private(i,j,k) \ shared(hx,hy,hz,ex,ey,ez) shared(hx,hy,hz,ex,ey,ez) for (i = 1; i < nx - 1; i++) for (j = 1; j < ny - 1; j++) { for (k = 1; k < nz - 1; k++) { Because of Dependences, ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k); Need to Split E & M Loops ELEM_TYPE tmpx = rx*invmu; ELEM_TYPE tmpy = ry*invmu; ELEM_TYPE tmpz = rz*invmu; AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k)); AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k)); AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));
} }
Sequential Core2Quad 26.7 Speedup

2/25/2009
OpenMP Core2Quad 8.8 3.0

Stephen Jenks
PS3 28.6
PS3 26.5 1.1

17
CUDA FDTD
// Block index int bx = blockIdx.x; // because CUDA does not support 3D grids, we'll use the y block value to // fake it. Unfortunately, this needs mod and divide (or a lookup table). int by, bz; by = blockIdx.y / gridZ; bz = blockIdx.y % gridZ; Predefined variables provide // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int tz = threadIdx.z;
thread position in block and block position in grid
// this thread's indices in the global matrix (skip outer boundaries) int i = bx * THREAD_COUNT_X + tx + 1; int j = by * THREAD_COUNT_Y + ty + 1; int k = bz * THREAD_COUNT_Z + tz + 1;
At this point, the thread knows where it maps in problem space

CUDA FDTD (cont.)

float invep = 1.0f/AR_INDEX(d_ep,i,j,k); float tmpx = d_rx*invep; Code looks normal, but: float tmpy = d_ry*invep; Only one iteration float tmpz = d_rz*invep; float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;
AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1)); AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k)); AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));
Arrays reside in device global memory
Machine Core2Quad Core2Quad GeForce 9800GX2
Code Style Sequential OpenMP CUDA
Time (s) 26.5 8.8 7.4
Speedup 1 3.0 3.6
2/25/2009
Stephen Jenks
19
Better CUDA FDTD

Problem: CUDA Memory Accesses Slow Problem: Not Much Work Done Per Thread Idea 1: Copy Data to Fast Shared Memory
Allows Reuse By Neighbor Threads Takes Advantage of Coalesced Memory Accesses
Idea 2: Compute Multiple Elements Per Thread

But Each Block Only Gets 16KB Shared Space Barely Enough for 2 Points at 1x16x16 Threads
Machine Core2Quad Core2Quad GeForce 9800GX2 GeForce 9800GX2 Code Style Sequential OpenMP CUDA CUDA faster
2/25/2009
Time (s) 26.5 8.8 7.4 6.5

Stephen Jenks
Speedup 1 3.0 3.6 4.1

20
Cell Broadband Engine

PowerPC Processing Element with Simultaneous Multithreading at 3.2 GHz 8 Synergistic Processing Elements at 3.2 GHz
Optimized for SIMD/Vector processing (100 GFLOPS Total) 256KB Local Storage - no cache
4x16-byte-wide rings @ 96 bytes per clock cycle
From IBM Cell Broadband Engine Programmer Handbook, 10 May 2006
2/25/2009
Stephen Jenks
21
New IBM Single Source Compiler Supports OpenMP
Example SPE Code
PPE &SPEs Run Separate Programs

Programmed in C Explicitly Move Memory Hard to Debug SPE Code SPE Good at Arithmetic, not Branching PPE &SPEs Communicate via Mailboxes & Memory SPEs Communicate via Signals & Memory
struct dma_list_eleminList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); struct dma_list_elemoutList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); void SendSignal1(program_data* destPD, int value) { sig1_local.SPU_Sig_Notify_1 = value; mfc_sndsig( &(sig1_local.SPU_Sig_Notify_1), destPD->sig1_addr, 3/* tag */, 0, 0); mfc_sync(1<<3); }
Cell Programming Model

Research Agenda & Future Directions
Programming and Performance Being Pushed Back to Software Developers

Big Problem! Software Less Well Tested/Developed Than Hardware Architecture Affects Performance In Strange Ways Compiler Solutions (IBM SSC) Dont Always Work Well
Extract Top Performance From Multicore CPUs Make Multicore Processors Better to Program Explore & Exploit Asymmetric Computing
2/25/2009
Stephen Jenks
23
Parallel Microprocessor Problems

L2 Cache CPU1 CPU CPU2 System/ Mem I/F
Mem Mem
Then Now
Memory interface too slow for 1 core/thread Now multiple threads access memory simultaneously, overwhelming memory interface Parallel programs can run as slowly as sequential ones!

Stephen Jenks
2/25/2009 24
SPPM: Producer/Consumer Parallelism Using The Cache

Data in Memory Data in Memory
Memory Bottleneck
Communications Through Cache
Thread 1 Half the Work
Thread 2 Half the Work
Producer Thread
Consumer Thread
2/25/2009
Stephen Jenks
25
Multicore Architecture Improvements
Cores in Common Multicore Chips Not Well Connected

Communicate Through Cache & Memory
Synchronization Slow (OS-based) or Uses Spin Waits
Register-Based Synchronization
Shared Registers Between Cores Halt Processor While Waiting To Save Power
Preemptive Communications (Prepushing)

Reduces Latency over Demand-Based Fetches
Cuts Cache Coherence Traffic/Activity
Software Controlled Eviction

Manages Shared Caches Using Explicit Operations Move Data to Shared Cache Before Needed From Private Cache
Synchronization Engine
Hardware-based multicore synchronization operations
Asymmetric Multicore Processors

Intel & AMD Likely to Add GPUs to CPUs in Multicore Processors Connectivity Better than PCIe Athlon 64 ATI Cache Access Possible CPU GPU
Shared Memory Likely Programming Slightly Better
XBAR
HyperTransport Memory Controller
Multiple CPU Classes Possible

Fast & Powerful for Some Apps Simple & Power Efficient for Most
Thoughts:
Not many multithreaded apps, so dont need many CPUs Application accelerators speed up multimedia, etc. Probably Cost Effective to Integrate CPU & GPU
Possible Hybrid AMD Multicore Design
2/25/2009
Stephen Jenks
27
FPGA Application Accelerators
Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines
FPGA Code Created Through VHDL or Other Compiler Can Speed Up Some Applications
My Thoughts
FPGA Great for Prototyping General & Flexible But Have Slow Clock Speed So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs
2/25/2009
Stephen Jenks
28
Intel Larrabee
Simple Simple Simple Simple Simple x86 Core x86 Core x86 Core x86 Core x86 Core Interprocessor Ring Network
Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache
Memory & I/O Interfaces
Simple Simple Simple x86 Core x86 Core x86 Core
Simple Simple x86 Core x86 Core
Many simple, fast, low power, in-order x86 cores

Larrabee: A Many-Core x86 Architecture for Visual Computing ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008. 2/25/2009 Stephen Jenks
29
OpenCL
New standard pushed by Apple & others (all major players) Data Parallel
N-Dimensional computational domain Work-groups similar to CUDA block Work-items similar to CUDA threads
Task Parallel
Similar to threads, but single work-item
Supports GPU, multicore, Cell, etc.

Summary
Parallelism Big Resurgence

Ubiquitous Still Difficult (Programming Challenge Worries Intel, Microsoft, & IBM)
New Types Of Parallelism Becoming Widespread

Multicore Processors
But They Suffer From Bottlenecks And Are Not Well Integrated
GPU Processing
Hard to Program Especially Hard to Get Good Performance
Asymmetric Processor Cores

Hard to Program and Manage Also Especially Hard to Get Good Performance
Many Architecture, Hardware, and Software Research Challenges

Resources

http://www.ibm.com/developerworks/power/cell - Cell Tools & Docs http://www.nvidia.com/object/cuda_home.html - CUDA Tools & Docs http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html SPPM & Parallel Execution Models http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel Programming Blog (Source code for FDTD Examples) http://newport.eecs.uci.edu/~sjenks My homepage, where this presentation is posted
2/25/2009
Stephen Jenks
32

Asymmetric Multiprocessing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asymmetric Multiprocessing

Uploaded by

Copyright:

Available Formats

Asymmetric Multiprocessing for High Performance Computing

Stephen Jenks Electrical Engineering and Computer Science UC Irvine

Overview and Motivation

Conventional Parallel Computing Was:

Parallel Computing is Becoming:

Parallel programming is essential

Conventional Processor Architecture

From Hennessy & Patterson, 2nd Ed.

Parallel Computing Problem Overview

Image Relaxation (Blur)

Shared Memory Multiprocessors

Shared Memory Programming

Lower Power & Heat But Good Performance

Intel Core Duo

Program with Threads Single Threaded Code

Conventional Processors are Dinosaurs

So much circuitry dedicated to keeping ALUs fed:

ALU tiny by comparison Huge power for little performance gain

With thanks to Stanfords Pat Hanrahan for the analogy

AMD Phenom X4 Floorplan

Single Nehalem (Intel I7) Core

NVIDIA GPU Floorplan

Source: Dr. Sumit Gupta - NVIDIA

Asymmetric Parallel Accelerators

Possible Hybrid AMD Multi-Core Design

Graphics Processing Unit (GPU)

GPUs Do Pixel Pushing and Matrix Math

CUDA Programming Model

Hundreds of Parallel Threads (Sort-of SIMD)

Finite Difference, Time Domain (FDTD) Electromagnetic Simulation

Sequential Core2Quad 26.7 Speedup

OpenMP Core2Quad 8.8 3.0

PS3 26.5 1.1

thread position in block and block position in grid

At this point, the thread knows where it maps in problem space

CUDA FDTD (cont.)

Arrays reside in device global memory

Machine Core2Quad Core2Quad GeForce 9800GX2

Code Style Sequential OpenMP CUDA

Time (s) 26.5 8.8 7.4

Speedup 1 3.0 3.6

Better CUDA FDTD

Idea 2: Compute Multiple Elements Per Thread

Time (s) 26.5 8.8 7.4 6.5

Speedup 1 3.0 3.6 4.1

Cell Broadband Engine

4x16-byte-wide rings @ 96 bytes per clock cycle

From IBM Cell Broadband Engine Programmer Handbook, 10 May 2006

New IBM Single Source Compiler Supports OpenMP

Example SPE Code

PPE &SPEs Run Separate Programs

Cell Programming Model

Research Agenda & Future Directions

Programming and Performance Being Pushed Back to Software Developers

Parallel Microprocessor Problems

SPPM: Producer/Consumer Parallelism Using The Cache

Communications Through Cache

Thread 1 Half the Work

Thread 2 Half the Work

Multicore Architecture Improvements

Cores in Common Multicore Chips Not Well Connected

Preemptive Communications (Prepushing)

Cuts Cache Coherence Traffic/Activity