Professional Documents
Culture Documents
SIMD/Vector/GPU
Vector Processing:
Exploiting Regular (Data) Parallelism
Data Parallelism
Time-space duality
In time or in space
SIMD Processing
ARRAY PROCESSOR
Instruction Stream
LD
ADD
MUL
ST
VECTOR PROCESSOR
VLIW
VR A[3:0]
VR VR, 1
VR VR, 2
A[3:0] VR
LD3
LD0
AD3
LD1 AD0
ST3
Time
Same op @ space
Space
Space
5
Vector Processors
Array processor
10
Fisher, Very Long Instruction Word architectures and the ELI-512, ISCA 1983.
11
12
Vector Registers
V
1
V
2
V
3
M-bit wide
V0,0
V0,1
V1,0
V1,1
V0,N-1
V1,N-1
V3 <- v1 * v2
13
14
Memory Banking
CRAY-1
Russell, The CRAY-1
computer system,
CACM 1978.
Bank
1
Bank
2
Bank
15
MDR MAR
Data bus
Address bus
CPU
15
16
Bas
e
Vector Registers
Memory Banks
18
Vectorizable Loops
7 dynamic instructions
Vectorized loop:
MOVI VLEN = 50
MOVI VSTR = 1
VLD V0 = A
VLD V1 = B
VADD V2 = V0 + V1
VSHFR V3 = V2 >> 1
VST C = V3
Why 16 banks?
1
304 dynamic instructions
1
1
1
11 ;autoincrement addressing
11
4
1
11
2 ;decrement and branch if NZ
17
Scalar code
MOVI R0 = 50
MOVA R1 = A
MOVA R2 = B
MOVA R3 = C
X: LD R4 = MEM[R1++]
LD R5 = MEM[R2++]
ADD R6 = R4 + R5
SHFR R7 = R6 >> 1
ST MEM[R3++] = R7
DECBNZ --R0, X
0 1 2 3 4 5 6 7 8 9 A B C D E F
Stride
Address
Generator
For I = 0 to 49
1
1
11 + VLN - 1
11 + VLN 1
4 + VLN - 1
1 + VLN - 1
11 + VLN 1
20
No chaining
Vector Chaining
V
1
LV
v1
MULV v3,v1,v2
ADDV v5, v3, v4
V
2
Chain
Load
Unit
V
3
V
4
V
5
Chain
Mult.
Add
Memory
285 cycles
21
11
49
11
79 cycles
Strict assumption:
Each memory bank
has a single port
(memory bandwidth
bottleneck)
49
49
11
182 cycles
49
22
49
24
Questions (I)
Gather/Scatter Operations
Want to vectorize loops with indirect accesses:
for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
LV vD, rD
LVI vC, rC, vD
LV vB, rB
ADDV.D vA,vB,vC
SV vA, rA
#
#
#
#
#
25
Gather/Scatter Operations
Data Vector
3.14
6.5
71.2
2.71
26
Equivalent
3.14
0.0
6.5
0.0
0.0
0.0
0.0
71.2
2.7
27
B
2
2
2
10
-4
-3
5
-8
VMASK
0
1
1
0
0
1
1
1
Simple Implementation
Density-Time Implementation
M[7]=1 A[7]
B[7]
M[7]=1
M[6]=0 A[6]
B[6]
M[6]=0
M[5]=1 A[5]
B[5]
M[5]=1
M[4]=1 A[4]
B[4]
M[4]=1
M[3]=0 A[3]
B[3]
M[3]=0
C[5]
M[2]=0
C[4]
M[2]=0
C[2]
M[1]=1
C[1]
A[7]
B[7]
M[1]=1
M[0]=0
C[1]
Write data port
M[0]=0
Write Enable
29
C[0]
Write data port
30
Some Issues
Storage of a matrix
31
32
ARRAY PROCESSOR
Instruction Stream
LD
ADD
MUL
ST
VECTOR PROCESSOR
VR A[3:0]
VR VR, 1
VR VR, 2
A[3:0] VR
LD3
LD0
AD3
LD1 AD0
ST3
Time
Same op @ space
Space
Space
33
34
ADDV C,A,B
Execution using
one pipelined
functional unit
Execution using
four pipelined
functional units
A[6]
B[6]
A[24]
B[24] A[25]
B[25] A[26]
B[26] A[27]
B[27]
A[5]
B[5]
A[20]
B[20] A[21]
B[21] A[22]
B[22] A[23]
B[23]
A[4]
B[4]
A[16]
B[16] A[17]
B[17] A[18]
B[18] A[19]
B[19]
A[3]
B[3]
A[12]
B[12] A[13]
B[13] A[14]
B[14] A[15]
B[15]
C[2]
C[8]
C[9]
C[10]
C[11]
C[1]
C[4]
C[5]
C[6]
C[7]
C[0]
C[0]
C[1]
C[2]
C[3]
Vector
Registers
Elements 0,
4, 8,
Elements 1,
5, 9,
Elements 2,
6, 10,
Elements 3,
7, 11,
Lane
Memory Subsystem
35
36
Load Unit
load
Multiply Unit
load
load
Add Unit
load
Iter. 1
mul
add
add
time
load
load
add
load
Iter. 2
add
Instruction
issue
37
store
load
load
store
mul
Vectorized Code
Time
Iter.
1
load
add
add
store
store
Iter.
2
Vector Instruction
38
42
43
Thread Warp 3
Thread Warp 8
Common PC
Thread Warp
Thread Warp 7
Scalar
Thread
Z
SIMD Pipeline
45
Vectorized Code
load
store
load
load
Iter. 2
load
load
Time
add
load
Iter. 1
46
Iter.
1
load
add
add
store
store
Iter.
2
10
11
12
13
14
15
10
11
12
13
14
15
+
Vector Instruction
add
store
Slide credit: Krste Asanovic
47
CPU code
for (ii = 0; ii < 100; ++ii) {
C[ii] = A[ii] + B[ii];
}
CUDA code
// there are 100 threads
__global__ void KernelFunction() {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}
Thread Warp 7
RF
ALU
D-Cache
All Hit?
Data
Writeback
Warps accessing
memory hierarchy
Miss?
Thread Warp 1
Thread Warp 2
Thread Warp 6
Lock step
Programming model is SIMD (no threads) SW needs to know vector
length
ISA contains vector/SIMD instructions
SIMD Pipeline
ALU
ALU
Warps available
for scheduling
Decode
RF
I-Fetch
RF
Thread Warp 3
Thread Warp 8
50
51
SPMD
Branch divergence
occurs when threads
inside warps branch to
different execution
paths.
54
AA/1111
Reconv. PC
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
BB/1111
Branch
C/1001
C
Common PC
Thread Warp
D/0110
D
Common PC
Thread Warp
Path A
EE/1111
Path B
G/1111
G
A
Time
Slide credit: Tor Aamodt
55
56
58
x/1000
Legend
A
Execution of Warp x
at Basic Block A
x/0110
C y/0010 D y/0001 F
E
x/1111
y/1111
x/1110
y/0011
59
x/0001
y/1100
Execution of Warp y
at Basic Block A
D
A new warp created from scalar
threads of both Warp x and y
executing at Basic Block D
x/1110
y/0011
x/1111
G y/1111
A
Baseline
Time
Dynamic
Warp
Formation
Time
60
61
NVIDIA-speak:
240 stream processors
SIMT execution
Generic speak:
30 cores
8 SIMD functional units per core
64 KB of storage
for fragment
contexts (registers)
= multiply-add
= multiply
62
Tex
Tex
64 KB of storage
for thread contexts
(registers)
64
Tex
Tex
Tex
Tex
Tex
Tex
Tex
65