Professional Documents
Culture Documents
Serial Code
Serial Code
..
.
MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
• Data parallelism
– across threads in a block
– across blocks in a kernel
• Task parallelism
– different blocks are independent
– independent kernels
Beyond Programmable Shading: In Action
Memory model
Thread Block
Per-block
Per-thread Shared
Local Memory
Memory
Kernel 0
... Sequential
Per-device Kernels
Kernel 1 Global
Memory
...
Device 0
memory
Host cudaMemcpy()
memory
Device 1
memory
NVIDIA C Compiler
NVIDIA Assembly
CPU Host Code
for Computing
GPU CPU
David Luebke
dluebke@nvidia.com
..
.
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
texture<type, dim>
texfetch(texture<> t,
float x, float y=0, float z=0)
cudaChannelFormatDesc
cudaCreateChannelDesc<float>(void);
cudaChannelFormatDesc
cudaCreateChannelDesc<float4>(void);
cudaChannelFormatDesc
cudaCreateChannelDesc<uchar4>(void);
• To allocate an array
cudaError_t
cudaMalloc( cudaArray_t **arrayPtr,
struct cudaChannelFormatDesc &desc,
size_t width,
size_t height = 1);
• To free an array:
cudaError_t
cudaFree( struct cudaArray *array);
Beyond Programmable Shading: In Action
Loading Memory from Arrays
• Load memory to arrays using a version of
cudaMemcpy:
cudaError_t cudaMemcpy(
void *dst,
const struct cudaArray *src,
size_t count,
enum cudaMemcpyKind kind
)
• To unbind:
template<class T, int dim, bool norm>
cudaError_t cudaUnbindTexture(
const struct texture<T, dim, norm> &tex
)
// Allocate array
cudaMalloc( &cu_array, cudaCreateChannelDesc<float>(),
width, height );
// Run kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height);
cudaUnbindTexture(tex);
NVIDIA C Compiler
NVIDIA Assembly
CPU Host Code
for Computing
GPU CPU
• Function qualifiers:
__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
• Variable qualifiers:
__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];
• Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
• Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
void main()
{
Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only
Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles
Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy
Reciprocal sqrt
23 bit 12 bit 12 bit 12 bit
estimate accuracy
13–457x
GPU Computing:
110-240X
Motivation
35X
1000
750
500
250
0
Sep-02 Jan-04 May-05 Oct-06 Feb-08
• Applications:
– High bandwidth:
Sequencing (virus scanning, genomics), sorting, database…
– Visual computing:
Graphics, image processing, tomography, machine vision…
Computational Computational
Medicine Modeling
Computational Computational
Science Biology
Computational Image
Finance Processing
Beyond Programmable Shading: In Action
GPU Computing Sweet Spots
• From cluster to workstation
– The “personal supercomputing” phase change
• From lab to clinic
• From machine room to engineer, grad student desks
• From batch processing to interactive
• From interactive to real-time
• GPU-enabled clusters
– A 100x or better speedup changes the science
• Solve at different scales
• Direct brute-force methods may outperform cleverness
• New bottlenecks may emerge
• Approaches once inconceivable may become practical
Beyond Programmable Shading: In Action
Applications - Condensed
• 3D image analysis • Film Protein folding
• Adaptive radiation therapy • Financial - lots of areas Quantum chemistry
• Acoustics • Languages
Ray tracing
• Astronomy • GIS
• Audio • Holographics cinema Radar
• Automobile vision • Imaging (lots) Reservoir simulation
• Bioinfomatics • Mathematics research Robotic vision/AI
• Biological simulation • Military (lots)
• Broadcast • Mine planning Robotic surgery
• Cellular automata • Molecular dynamics Satellite data analysis
• Computational Fluid Dynamics • MRI reconstruction Seismic imaging
• Computer Vision • Multispectral imaging
Surgery simulation
• Cryptography • nbody
• CT reconstruction • Network processing Surveillance
• Data Mining • Neural network Ultrasound
• Digital cinema/projections • Oceanographic research Video conferencing
• Electromagnetic simulation • Optical inspection
Telescope
• Equity training • Particle physics
Video
Visualization
Wireless
X-ray
Beyond Programmable Shading: In Action
Tesla HPC Platform
• HPC-oriented product line
– C870: board (1 GPU)
– D870: deskside unit (2 GPUs)
– S870: 1u server unit (4 GPUs)
Architecture: TESLA
Beyond Programmable Shading: In Action
Chips: G80, G84, G92, …
New Applications
Real-time options implied volatility engine
Ultrasound imaging
Manifold 8 GIS
• Architectures
– Connection Machine, MasPar, Cray
– True supercomputers: incredibly exotic, powerful, expensive
multiprocessors
Multiprocessor 2
• Each multiprocessor is a Multiprocessor 1
set of 32-bit processors
with a Single Instruction
Multiple Data architecture Instruction
Processor 1 Processor 2
… Processor M
multiprocessor executes
the same instruction on a
group of threads called a
warp
• The number of threads in a
warp is the warp size
• Processing elements
– 8 scalar thread processors (SP)
SM t0 t1 … tB
– 32 GFLOPS peak at 1.35 GHz
MT IU – 8192 32-bit registers (32KB)
• ½ MB total register file space!
SP – usual ops: float, int, branch, …
• Hardware multithreading
– up to 8 blocks resident at once
– up to 768 active threads in total
Multiprocessor N
memory Texture
Memory