You are on page 1of 19

Introduction to GPGPU

and ManyCore Systems


Tim Child
Oct 2017
Bio

Former CTO EARN.org Over 20 years experience in


Software Development, including 3D CAD, database
and Web applications. Formerly VP Engineering at
Oracle, BEASystems Inc. and Informix. Lead teams
that have developed many large 3D database projects
and products including, AutoCAD Map, NASAs EOSDIS
and LINZ (Land Information System New Zealand).
Overview
Bio

Overview

Why GPUs

GPU Archiecture

Hardware Types

Software

Dev Tools

Programming Techniques

Applications

Gotchas

Q&A
Why GPUs

Its their Performance!


Hardware
Discrete GPUs

ASIC

Tensor Processing Units (TPU)

Special ASIC for Machine Learning

Accelerated Processing Units (APU)

Integrated CPU & GPU Laptops, Tablets

Mobile

Cell Phone, SBC, Tablet

Embedded

AMD, Nvidia boards

Cloud

Most Vendors

AWS Elastic GPUs


What Powers a GPU

Large # SIMD Cores


ALU, FPU
High Bandwidth RAM
HBM2
HW Accelerators
Video, Audio, Tensor
High Speed Interconnect
NVLink, PCI-E
GPU Architecture
Nvidia Volta

84 Streaming Multiprocessors
64 FP32 Cores per SM = 5376
32 FP64 Cores per SM = 2688
10 Tensor Cores per SM = 840
GPU Speed-Up

https://blogs.nvidia.com/blog/2010/06/23/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/
Amdahl's Law
Speed-up Limited by Amdahls Law

Serial
5%
Parallel
95%

Serial Parallel
50% 50%
Software Model
Kernel Instances

Host
Driver
Code
GPU Code Example
#include "../common/book.h"
#define N 10

int main( void ) {


int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );
// fill the arrays 'a' and 'b' on the CPU // copy the arrays 'a' and 'b' to the GPU
for (int i=0; i<N; i++) { HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
a[i] = -i; cudaMemcpyHostToDevice ) );
b[i] = i * i; HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
} cudaMemcpyHostToDevice ) );
add<<<N,1>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
cudaMemcpyDeviceToHost ) );
CUA Kernel // display the results
for (int i=0; i<N; i++) {
printf( "%d + %d = %d\n", a[i], b[i], c[i] );
__global__ void add( int *a, int *b, int *c ) { }
int tid = blockIdx.x; // handle the data at this index // free the memory allocated on the GPU
if (tid < N) cudaFree( dev_a );
cudaFree( dev_b );
c[tid] = a[tid] + b[tid]; cudaFree( dev_c );
} return 0;
}

From: https://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdf
Dev Stacks & Tools
CUDA

Most Advanced, NVCC + language bindings

OpenCL

GPU & CPU + language bindings

OpenACC

Programing Standard Compiler Support C,C++,FORTRAN

OpenMP

API, Compiler Support C,C++, FORTRAN

C++ AMP

Others

Debuggers

Libraries

Thrust, cuBLAS, cuXXX,


Programming Techniques

http://www.seas.upenn.edu/~cis565/LECTURES/Lecture3.pdf
GPU Memory Hierarchy

https://www.bu.edu/pasi/files/2011/07/Lecture31.pdf
Parallel Algorithms
Traditional Model Lock Free Model

Locks for Synchronization Lock Free


Mutex Atomics
Deadlock
Semaphore Live Lock
Compare & Swap
Critical Sections Priority Inversion
Applications
Graphics

OpenGL, WebGL, Vulkan

HPC

Numerous,

Finance

Black-Scholes, Forex, Programmed Trading, $$$$

AI/ML

Tensor flow,

Computer Vision

OpenCV,

Video/Audio

FFMpeg,

Database

Kinetica, MapD,
Gotchas
Devices

Drivers

Memory Allocation

Memory Bandwidth

Tools

Debugging

Transfer Bandwidth

No Virtual Memory

No Interrupts

No O/S

SIMD
Summary
GPU Great Benefits

Increase Processing Power

Lower Power per Operation

Required Advanced Skills & Tools to Program

Offer Increasingly Broader Applications


Q&A

You might also like