Professional Documents
Culture Documents
sjenks@uci.edu
Source: NVIDIA
Where Is Parallel Computing Going? And What are the Research Challenges?
2/25/2009 Stephen Jenks 2
Computing is Changing
Parallelism is changing
Special purpose parallel engines CPU and parallel engine work together Different code on CPU & parallel engine Asymmetric Computing
2/25/2009
Stephen Jenks
Hasnt Changed Much For 40 Years Pipelining and Superscalar Since the 1960s But has become integrated Microprocessors High Clock Speed Great performance High Power Cooling Issues Various Solutions
2/25/2009
Stephen Jenks
Stencil
2/25/2009
Stephen Jenks
Cache
Cache
Cache
Cache
Shared Memory Each CPU computes results for its partition Memory is shared so dependences satisfied
2/25/2009 Stephen Jenks 6
Threads
POSIX Threads Windows Threads
Not Integrated with Compiler or Language No idea if code is in thread or not Poor optimizations
OpenMP
User-inserted directives to compiler Loop parallelism #pragma omp parallel for for (i=0; i<n; i++) Parallel regions a[i] = b[i] * c[i]; 2/25/2009 Stephen Jenks Visual Studio
Multicore Processors
Several CPU Cores Per Chip Shared Memory Shared Caches (sometimes) Lower Clock Speed
AMD Athlon 64 X2
2/25/2009 Stephen Jenks 8
Source: AMD.com
2/25/2009 Stephen Jenks 10
Nehalem - Everything You Need to Know about Intel's New Architecture http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382
2/25/2009
Stephen Jenks
11
2/25/2009
Stephen Jenks
12
Current Cores are Powerful and General Pair General CPUs with Specialized Accelerators
Graphics Processing Unit Field Programmable Gate Array (FPGA) Single Instruction, Multiple Data (SIMD) Processor
Athlon 64 CPU XBAR HyperTransport Memory Controller ATI GPU
Some Applications Only Need Certain Operations Perhaps a Simpler Processor Could Be Faster
From NVIDIA CUDA Compute Unified Device Architecture Programming Guide 11/29/2007
2/25/2009
Stephen Jenks
14
Data Parallel
But not Loop Parallel Very Lightweight Threads Write Code from Threads Point of View
Block of Threads
No Shared Memory
Host Copies Data To and From Device Different Memories
Fields stored in 6 3D Arrays: EX, EY, EZ, HX, HY, HZ 2 other coefficient arrays O(N3) algorithm
OpenMP FDTD
#pragma omp parallel for private(i,j,k) /* update magnetic field */ #pragma omp parallel for private(i,j,k) \ shared(hx,hy,hz,ex,ey,ez) shared(hx,hy,hz,ex,ey,ez) for (i = 1; i < nx - 1; i++) for (j = 1; j < ny - 1; j++) { for (k = 1; k < nz - 1; k++) { Because of Dependences, ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k); Need to Split E & M Loops ELEM_TYPE tmpx = rx*invmu; ELEM_TYPE tmpy = ry*invmu; ELEM_TYPE tmpz = rz*invmu; AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k)); AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k)); AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));
} }
PS3 28.6
CUDA FDTD
// Block index int bx = blockIdx.x; // because CUDA does not support 3D grids, we'll use the y block value to // fake it. Unfortunately, this needs mod and divide (or a lookup table). int by, bz; by = blockIdx.y / gridZ; bz = blockIdx.y % gridZ; Predefined variables provide // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int tz = threadIdx.z;
// this thread's indices in the global matrix (skip outer boundaries) int i = bx * THREAD_COUNT_X + tx + 1; int j = by * THREAD_COUNT_Y + ty + 1; int k = bz * THREAD_COUNT_Z + tz + 1;
2/25/2009
Stephen Jenks
19
Problem: CUDA Memory Accesses Slow Problem: Not Much Work Done Per Thread Idea 1: Copy Data to Fast Shared Memory
Allows Reuse By Neighbor Threads Takes Advantage of Coalesced Memory Accesses
PowerPC Processing Element with Simultaneous Multithreading at 3.2 GHz 8 Synergistic Processing Elements at 3.2 GHz
Optimized for SIMD/Vector processing (100 GFLOPS Total) 256KB Local Storage - no cache
2/25/2009
Stephen Jenks
21
struct dma_list_eleminList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); struct dma_list_elemoutList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); void SendSignal1(program_data* destPD, int value) { sig1_local.SPU_Sig_Notify_1 = value; mfc_sndsig( &(sig1_local.SPU_Sig_Notify_1), destPD->sig1_addr, 3/* tag */, 0, 0); mfc_sync(1<<3); }
Extract Top Performance From Multicore CPUs Make Multicore Processors Better to Program Explore & Exploit Asymmetric Computing
2/25/2009
Stephen Jenks
23
Mem Mem
Then Now
Memory interface too slow for 1 core/thread Now multiple threads access memory simultaneously, overwhelming memory interface Parallel programs can run as slowly as sequential ones!
Stephen Jenks
2/25/2009 24
Memory Bottleneck
Producer Thread
Consumer Thread
2/25/2009
Stephen Jenks
25
Register-Based Synchronization
Shared Registers Between Cores Halt Processor While Waiting To Save Power
Synchronization Engine
Hardware-based multicore synchronization operations
2/25/2009 Stephen Jenks 26
XBAR
HyperTransport Memory Controller
Thoughts:
Not many multithreaded apps, so dont need many CPUs Application accelerators speed up multimedia, etc. Probably Cost Effective to Integrate CPU & GPU
2/25/2009
Stephen Jenks
27
Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines
FPGA Code Created Through VHDL or Other Compiler Can Speed Up Some Applications
My Thoughts
FPGA Great for Prototyping General & Flexible But Have Slow Clock Speed So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs
2/25/2009
Stephen Jenks
28
Intel Larrabee
Simple Simple Simple Simple Simple x86 Core x86 Core x86 Core x86 Core x86 Core Interprocessor Ring Network
Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache
29
OpenCL
New standard pushed by Apple & others (all major players) Data Parallel
N-Dimensional computational domain Work-groups similar to CUDA block Work-items similar to CUDA threads
Task Parallel
Similar to threads, but single work-item
Summary
GPU Processing
Hard to Program Especially Hard to Get Good Performance
Resources
http://www.ibm.com/developerworks/power/cell - Cell Tools & Docs http://www.nvidia.com/object/cuda_home.html - CUDA Tools & Docs http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html SPPM & Parallel Execution Models http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel Programming Blog (Source code for FDTD Examples) http://newport.eecs.uci.edu/~sjenks My homepage, where this presentation is posted
2/25/2009
Stephen Jenks
32