Professional Documents
Culture Documents
November 2011
Bulldozer Module
2MB L2 Cache
2MB L2 Cache
Bulldozer Module
2MB L3 Cache
2MB L3 Cache
Northbridge
DDR3 Phy
2MB L3 Cache
2MB L3 Cache
Bulldozer Module
2MB L2 Cache
Bulldozer Module
HyperTransport Phy
4 | Grid Computing Seminar 2011
Virtualization Enhancements
Faster switching between VMs TLB Flush by ASID, VM State Caching, Decode Assists AMD-V extended migration support
Improved IPC
Micro-architecture and ISA enhancements SSE4.1/4.2, AVX 1.0/1.1, SSSE3, AES, LWP, XOP, FMA4 Independent Core/FP Schedulers, Deeper Buffers, Larger Caches, Branch Fusion, Aggressive Pre-fetching, High Degree of Thread Concurrency Throughout
HyperTransport Interface
HyperTransport Interface
HyperTransport Interface
Core 1 Core 5
Core 2 Core 6
Core 3
Core 4
Core 9
8M L3 Cache
8M L3 Cache
2M L2 Cache
2M L2 Cache
256-bit FPU
256-bit FPU
2M L2 Cache
2M L2 Cache
256-bit FPU
256-bit FPU
256-bit FPU
256-bit FPU
HyperTransport Interface
2M L2 Cache
2M L2 Cache
2M L2 Cache
256-bit FPU
2M L2 Cache
256-bit FPU
Memory Controller
Memory Controller
Core 13
Core 14
Core 15
Core 16
Core 7
Core 8
Four Core AMD Opteron 6200 Series Processors Huge memory bandwidth per core
Eight Core AMD Opteron 6200 Series Processors Blend of cores and memory with an edge on memory bandwidth
Twelve Core AMD Opteron 6200 Series Processors Blend of cores and memory with a little more computation muscle
Sixteen Core AMD Opteron 6200 Series Processors The industrys only 16-core x86 processor for massive thread density
Executes two SSE or AVX (128-bit) instructions simultaneously or one AVX (256-bit) instruction per Bulldozer module
No waiting on integer scheduler to run instructions Designed to be always available to schedule floating point operations
*FMAC can execute an FMA4 execution (a=b+c*d) in one cycle vs. 2 cycles that would be required for FMA3 or standard SSE floating point calculation.
FLEX FP
AMD Opteron 6100 Series FPU Intel Sandy Bridge AMD Opteron 6200 Series FlexFP
Execute 128-bit FP Support SSSE3, SSE4.1, SSE4.2 Execute 128-bit AVX Execute 256-bit AVX Execute two 128-bit SSE or AVX ADD instructions in 1 cycle Execute two 128-bit SSE or AVX MUL instructions in 1 cycle Switch between SSE and AVX instructions without penalty Execute FMA operations (A=B+C*D) Supports XOP FLOPs per cycle (128-bit FP) FLOPS per cycle (128-bit AVX) FLOPS per cycle (256-bit AVX)
FP Scheduler
FP Scheduler
128-bit FMAC
128-bit FMAC
128-bit FMAC
128-Bit AVX
128-Bit AVX
256-Bit AVX
48
-
32
32 64
64
64 64
Two 128-bit FMACs shared per module, allowing for dedicated 128-bit execution per core or shared 256-bit execution per module
Sandybridge information from http://software.intel.com/en-us/avx/
128-bit FMAC
Applications/Use Cases
Video encoding and transcoding Biometrics algorithms Text-intensive applications Application using AES encryption Secure network transactions Disk encryption (MSFT BitLocker) Database encryption (Oracle) Cloud security
Software that currently supports SSSE3, SSE4.1, SSE4.2, AESNI, and AVX should run on Bulldozer.
No recompile of code if the software only checks ISA feature bits Recompile needed if software also checks for processor type For FMA4 or XOP, software will need to be written to call specific instructions or be compiled with a compiler that will automatically generate code that leverages these instructions.
Floating point intensive applications: Signal processing / Seismic Multimedia Scientific simulations Financial analytics 3D modeling HPC applications Numeric applications Multimedia applications Algorithms used for audio/radio
COMPILER SUPPORT
Compiler
Status
Microsoft Visual Studio 2010 SP1 GCC 4.5, 4.6 Open64 4.2.5 PGI 11.6, 11.7 PGI 11.9 ICC 12*
*Intel has an mAVX flag which is designed to run on any x86 processor; however, the ICC runtime makes assumptions about cache line sizes and other parameters that causes code not to run on AMD processors
11 | Grid Computing Seminar 2011
Others
Random Number Generators AVX compiler switch for Fortran
Compiler Support
Alpha support for gcc 4.6 and Open64 4.2.5 PGI 11.8 and ICC 12 added with ACML 5.0 production release Cray to begin deployment of ACML with their compiler with ACML 5.0 All compilers listed for ACLM 5.0 will be supported
Real-to-complex (R-C) single precision FFTs Double precision C-C and R-C FFTs
33%
Memory Throughput
1600 MHz DDR-3 support LR-DIMM support 1.25V LV-DDR3 support New memory power management features:
Aggressive power down
Partial channel power down And memory power capping via
*Based on measurements by AMD labs as of 8/9/11. Comparison is AMD Opteron 6200 Series with DDR3-1600 vs. AMD Opteron 6100 Series with DDR3-1333. See substantiation section for config info.
AMD Opteron 6200 Series fits into the same general thermal range as previous generation
TDP Power Cap Set thermal design power (TDP) to meet power and workload demands for more flexible, denser deployments
Minimizes the number of active transistors for lower power and better performance
POWER MANAGEMENT | P-states, AMD Turbo CORE Core P-states specify multiple frequency and voltage points of operation
Higher frequency P-states deliver greater performance but require higher voltage and thus more power The hardware and operating system vary which P-state a core is in to deliver performance as needed, but use lower frequency P-states to save power whenever possible
AMD Turbo CORE: when the processor is below its power/thermal limits the frequency and voltage can be boosted above the normal maximum and stay there until it gets back to the power/thermal limits
15 | Grid Computing Seminar 2011
Higher performance
Pb0
Pb1
P2 P3 P4 P5 P6 P0 P1 P2 P3 P4
Boost P-states
Base P-state
Example TDP limit (higher resolution) set between P5 and P6
Lower power
P7
P5
HW View
SW View
When there is TDP headroom in a given workload, AMD Turbo CORE technology is automatically activated and can increase clock speeds by 300 to 500 MHz across all cores.
When a lightly threaded workload sends half the Bulldozer modules into C6 sleep state but also requests max performance, AMD Turbo CORE technology can increase clock speeds by up to 1 GHz across half the cores.
More control over power settings Flexibility to set power limits without capping CPU frequencies4
support in Bulldozer processors platforms where TDP power capping feature is enabled in the system BIOS 3For platforms that have designed in APML platform support 4TDP power capping can still allow the processor to operate a maximum specified frequency
18 | Grid Computing Seminar 2011
Idle
VDD Core
Idle
Idle
VDD Core
Idle
Idle
Idle
Voltage is reduced but still applied to cores resulting in leakage / static power
Voltage is gated off to virtually remove all core static power / leakage
Idle
No cores running workloads; core/module frequency reduced to 800MHz to save more power
L2
L2
Smart Fetch
After a set idle time L2 cache is flushed to L3, allowing cores to sleep to save power while maintaining MP coherency
C61 (NEW!)
On Bulldozer any idle module can independently enter C6, gating power up to 95% for considerable power savings; module state is saved to DRAM
In addition to reducing memory and I/O power, C6 further reduces core power on Bulldozer vs. current core in C1e
C1e
1Planned
GENERATIONAL COMPARISONS
AMD Opteron 6100 Series Processors Cores Cache (L2 per core / L3 per die) 8 or 12 core 512KB / 6MB four; up to 1333MHz 128-bit FPU per core (FADD/FMUL) 3 No 65W, 80W, 105W AMD Opteron 6200 Series Processors 4, 8, 12 or 16 core 2MB (shared between 2 cores) / 8MB four; up to 1600MHz 128-bit dedicated FMAC per core or 256-bit AVX shared between 2 cores 4 Yes (+500MHz with all cores active) TBD (planned 65W, 80W, 105W) SSSE3, SSE 4.1/4.2, AVX, AES, FMA4, XOP, PCLMULQDQ AMD CoolCore, C1E 45nm SOI AMD CoolCore, C1E, C6 32nm SOI (smaller overall die size)
Performance
The above reflect current expectations regarding features and performance and is subject to change.
*Based on pre-production measurements by AMD labs as of 8/9/11 and comparison of top bin AMD Opteron 6200 Series processors to top bin AMD Opteron 6100 Series processors
ATI FirePro Professional Graphics 3D Accelerators For Visualization Full support for GPU computation with OpenCL
AMD FireStream GPU Accelerators Optimized for server integration Single-slot and dual-slot form factors Industry standard OpenCL SDK
1.00E+18
500PF
1.00E+17
20PF
100PF 50PF
Exascale Target
@30% uplift
1.00E+16
1.00E+15
HIGH EFFICIENCY LINPACK IMPLEMENTATION ON AMD MAGNY COURS + AMD 5870 GPU
System GFLOPS
DGEMM/node
Linpack/node
Linpack/4 nodes 0 500 1000 1500 2000 2500
GPU DPFP Peak: 544 GFLOPS GPU DGEMM kernel: 87% of Peak 2.5 GFLOPS/W Node DPFP Peak: 745.6 GFLOPS Linpack efficiency: 75.5% of Peak Linpack scaling across 4 nodes: 70% of Peak
HPL code:
http://code.compeng.uni-frankfurt.de/
APU: Fusion of CPU & GPU compute power within one processor High-bandwidth I/O
~17 GB/sec
~17 GB/sec
CPU Cores
CPU Chip
CPU Cores MC
UNB / MC
APU Chip
UVD
UNB GPU
~7 GB/sec
GPU UVD
~27 GB/sec
SB Functions
~27 GB/sec
PCIe
3X bandwidth between GPU and memory Even the same sized GPU is substantially more effective in this configuration
PCIe
Eliminate latency and power associated with the extra chip crossing Bandwidth pinch points and latency hold back the GPU capabilities
Substantially smaller physical foot print
APU bandwidth enhancements not only improve traditional graphics, but also data parallel compute effectiveness Both GPUs cooperate on graphics and compute OpenCL 1.1 and DirectCompute compliant in both the APU and the optional discrete GPU
APU Chip
UVD
GPU
~27 GB/sec
~27 GB/sec
>100 GB/sec
AMD Fusion Experience Fund is helping to fuel the application developer community
Discrete GPU Card/Chip
(optional)
GPU UVD MC
INTRODUCING OPENCLTM
The open standard for parallel programming across heterogeneous processors
CPU
System Memory
Very many GPU processing elements 100s Different vendors, configurations, architectures
Fusion GPU
The multi-million dollar question How do you avoid developing and maintaining different source code versions?
GPU Memory
Discrete GPU
WHAT IS OPENCLTM
CPUs
Emerging Intersection
GPUs
Heterogeneous Computing
Source Khronos
31 | Grid Computing Seminar 2011
Source Khronos
32 | Grid Computing Seminar 2011
OPENCL FRAMEWORK
Context
Programs
Kernels
Memory Objects
Command Queues
__kernel void dp_mul(__global const float *a, __global const float *b, __global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; }
dp_mul dp_mul CPU program binary dp_mul GPU program binary arg [0] value arg [1] value arg [2] value
Images
In Order Queue
Buffers
// build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, vec_add, NULL);
// get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects
// set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_char)*strlen(Hello World), srcA, NULL);} // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);
CORE DESIGN
EXPOSING PARALLELISM
C function
EXPOSING PARALLELISM
C function
__global float * Y,
const float a) for (int i = 0; i < 24; i++) { Y[i] = a*X[i] + Y[i]; } } { uint i = get_global_id(0); Y[i] = a* X[i] + Y[i];
WORK ITEM
Work items
10 11
22 23
WORK GROUP
Work items
MEMORY SPACES
SIMD
EXECUTION ON GPUS
4
CPU
Core
3 0
2
1
EXECUTION ON CPU
Hardware Support
Fusion APUs
Multi-GPU (Windows 7)
FFT and BLAS-3 Libraries Binary Image Format / Offline Compile Developer Tool Suite (Later slides) Debugger, Profiler, Optimizers Graphics Driver releases OpenCL run-time available by default in Catalyst driver builds (Windows)
gDEBugger V5.8 Proven mature product (6+ yrs) Widely deployed 10k-s of users Multiple industries
OpenCL and OpenGL API level Debugger, Profiler and Memory Analyzer Access internal system information Find bugs Optimize compute performance Minimize memory usage Windows, Linux and MacOS
GDEBUGGER 6.0 (VISUAL STUDIO EXTENSION) Visual Studio 2010 extension Native look and feel Find bugs Shortens development time Minimize memory usage Includes all gDEBuggers capabilities OpenCL and OpenGL Trace OpenCL API calls Single step through OpenCL Kernels Insert breakpoints into OpenCL kernels View Kernel local variables, buffers & images
APP PROFILER
Run-time OpenCL and DirectCompute profiler Identify bottlenecks and optimization opportunities Access GPU hardware performance counters for AMD GPUs Visualize timeline and API trace for an OpenCL program
APP KERNELANALYZER
Version of GPU ShaderAnalyzer tool targeting Accelerated Parallel Processing Statically analyze of OpenCL Kernels for AMD Radeon GPUs. Display compiled Kernels as IL or GPU ISA Display statistics from the AMD Kernel Compiler Estimate Kernel performance Includes support for previous 12 versions of Catalyst Driver
CODEANALYST PERFORMANCE ANALYZER Profiling suite to identify, investigate and tune code performance on AMD platforms Find time critical hot-spots C, C++, Fortran
Java, .Net managed cod
Identify thread-affinity and core utilization problems Windows and Linux platforms
LINUX
SDK 2.5
Linux OpenCL SDK available gDEBugger LLVM optimization passes APP Profiler Code Analyst
OpenCL
Open standard Closed implementation
Multicoreware
Parallel Path Analyzer Task Manager GMAC v11.3 v10.04 v6, 5.5
FUSION
A PATH TO THE FUTURE
FSA ROADMAP
Physical Integration
Optimized Platforms
Architectural Integration
System Integration
Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU
App
App
App
App
App
Domain Libraries
App
App
App
App
App
App
OpenCL
FSA Runtime
AMD user mode component AMD kernel mode component
FSA JIT
QUESTIONS?