CUDA Sparse Parallel Solver

Parallel Sparse Linear Solver GMRES for GPU Clusters
with Compression of Exchanged Data
J. M. Bahi, R. Couturier, L. Ziane Khodja
Laboratoire d’Informatique de l’Université de Franche-Comté (LIFC)

IUT Belfort-Montbéliard, France
9th International Workshop on Algorithms, Models and Tools for

Parallel Computing on Heterogeneous Platforms
Bordeaux, France
August 29th 2011
J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 1 / 23

Our objectives
Before:
Solving large sparse linear systems
Parallel GMRES solver on GPU clusters
Sparse banded matrices
After:
All types of sparse matrices (large bandwidths!)
Parallel sparse matrix-vector products on a GPU cluster:
Pb: overheads in CPU/CPU and GPU/CPU communications
Solution: data compression of shared vectors

Outline
GPU clusters
GMRES solver and its improvements
Experimental results
Conclusion and future work

GPU clusters
GPU clusters

GPU clusters
Graphic Processing Unit: GPU

Initially designed for the 3D visualization in real-time
Today, high performance accelerator for data-parallel and intensive
tasks
Streaming Streaming Streaming

multiprocessor multiprocessor multiprocessor
SP SP SP SP SP SP
SP SP SP SP SP SP
SP SP SP SP SP SP
SP SP SP SP SP SP
Shared memory Shared memory Shared memory
Global memory

GPU clusters
GPGPU programming
CUDA (Compute Unified Device Architecture programming)
Nvidia CUDA programming environment: extension of C language

GPU is viewed as a co-processor to the CPU
Kernels: data-parallel and data-intensive functions of an application
In CUDA program:
Host (CPU):
Controls the application execution
Executes the sequential code written in C
Launches all kernels on the GPU
GPU:
Executes the kernels
Grid of thread blocks: SIMT platform
Returns the final results of a kernel execution to the CPU

GPU clusters
GPU cluster architecture

To exploit simultaneously: the memory capability & the high
performance computing of several GPUs
Node 0 Node 1 Node 5

GPU 0 GPU 1 GPU 0 GPU 1 GPU 0 GPU 1
GPU processors GPU processors GPU processors GPU processors GPU processors GPU processors
102 GB/s 102 GB/s 102 GB/s 102 GB/s 102 GB/s 102 GB/s
Device memory Device memory Device memory Device memory Device memory Device memory
PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s
Host memory Host memory Host memory

25.6GB/s 25.6GB/s 25.6GB/s
CPU processors CPU processors CPU processors
CPU CPU CPU
InfiniBand 20 GB/s
Figure: GPU cluster of IUT Belfort-Montbéliard


GMRES method
Generalized Minimal RESidual solver
Iterative method developped by Saad & Schultz in 1986
Generalization of the MINRES method
Generic and efficient method to solve:

nonsymmetric & non-Hermitian problems
definite & indefinite symmetric problems

Sequential GMRES algorithm with restarts
1: Start: Choose x0 and compute: r0 = M −1 (b − Ax0 ) and v1 = r0 /kr0 k

2: Iterate: Arnoldi process
For j=1,2,. . . ,m do:
hi,j = (M −1 Avj , vi ), i=1,2,. . . ,j
Pj
wj = M −1 Avj − i=1 hi,j vi
hj+1,j = kwj k
vj+1 = wj /kwj k
3: From the approximate solution: xm = x0 + Vm ym
where ym minimizes kβe1 − H̄m y k, such that: β = kr0 k and y ∈ R m
4: Restart:
Compute rm = M −1 (b − Axm )
if satisfied then stop
else compute x0 := xm , v1 := rm /krm k and go to 2

GMRES parallelization on a GPU cluster

Partitioning of Ax = b into p sub sparse linear systems: Ak xk = bk ,
k = 1, 2, . . . , p (number of GPUs on the cluster)
Matrix bandwidth
Vector x
Left shared Local Right shared
Proc 0
Proc 1
=
Proc 2
Proc 3
Sparse matrix A Vector b

GMRES parallelization on a GPU cluster
Accelerating the mathematical operations on the GPUs:

Sparse matrix-vector product (SpMV)
Dot product, Euclidean norm, AXPY operation
Other kernels: scalar-vector product, solving least-squares problem,
solution vector updates
Synchronizations over the GPU cluster by the CPUs:

Reduction operations: MPI_Allreduce()
Parallel SpMV: exchanging shared vectors of unknowns

The most expensive step of the parallel GMRES algorithm

Parallel sparse matrix-vector product (SpMV) on a GPU cluster:
1: gpu_to_cpu(); /* GPU→CPU comm. (8 GB/s) */
2: MPI_Alltoallv(); /* CPU↔CPU comm. (20 GB/s) */
3: cpu_to_gpu(); /* GPU←CPU comm. (8 GB/s) */
4: SpMV();
Local sparse matrix A Local vector b
Proc 1 =
SpMV product
Left shared Local Right shared

Global vector x
GPU/CPU and CPU/CPU

Exchange the shared vectors communications
Local vector x Local vector x Local vector x

Proc 0 Proc 2 Proc 3
Slow data exchanges of the shared vectors

Parallel SpMV does not need all unknown values ⇒ loss of performance!
loss of performance = 6 elements × (2 · throughput of GPU/CPU
comm. + throughput of CPU/CPU comm.)
x x
Part of local sparse matrix x x x x
of node β
x
x x x
Node β 0 1 2 3 4 5 6 7 8 9 10 11
Global vector x on GPU β x x x x x x x x x x x x
GPU β CPU β
communication
0 1 2 3 4 5 6 7 8 9 10 11 8 GB/s
Shared vector on CPU β x x x x x x x x x x x x
CPU α CPU β
communication
20 GB/s
Local vector x on CPU α x x x x x x x x x x x x
0 1 2 3 4 5 6 7 8 9 10 11 GPU α CPU α
Node α communication
8 GB/s
Global vector x on GPU α x x x x x x x x x x x x
0 1 2 3 4 5 6 7 8 9 10 11
Local vector x of node α

Enhanced data exchanges of the shared vectors

Solution:
Compression of the shared vectors before the sending
Decompression of the shared vectors after the reception
⇒ Minimize the communication overheads of solving large sparse systems
x x
Part of local sparse matrix
x x x x
of node β x
x x x
Node β 0 1 2 3 4 5 6 7 8 9 10 11
Global vector x on GPU β x x x x x x x x x x x x
GPU β CPU β
Decompression communication
0 2 3 6 10 11 8 GB/s
Shared compressed vector
on CPU β
x x x x x x
CPU α CPU β
communication
Local compressed vector x 20 GB/s
on CPU α x x x x x x
Node α Compression 0 2 3 6 10 11
GPU α CPU α
communication
Global vector x on GPU α x x x x x x x x x x x x 8 GB/s
0 1 2 3 4 5 6 7 8 9 10 11
Local vector x of node α

Enhanced parallel SpMV on a GPU cluster
Accelerating the compression and decompression functions on GPUs
Parallel SpMV algorithm:

1: compression(); /*kernel on GPUs*/
2: gpu_to_cpu();
3: MPI_Alltoallv(); /*done on CPUs*/
4: cpu_to_gpu();
5: decompression(); /*kernel on GPUs*/
6: SpMV(); /*kernel on GPUs*/


Our GPU cluster
GPU cluster:
InfiniBand
Six Quad-Core Xeon E5530
Two Tesla C1060 GPUs per CPU
→ Cluster of 12 GPUs
TCPU
Performance measure: speed-ups TGPU of the cluster of 12 GPUs
compared to:
Cluster of 12 CPU cores
Cluster of 24 CPU cores
Comparison between the two versions of GMRES solver (without and

with compression/decompression operations)

Sparse matrices of tests

Real-world sparse matrices of the Davis’ collection
Automatic generation of large sparse matrices
Generated large and sparse banded matrix
Proc 0
Real sparse matrix
Proc 1 right_part
left_part
Proc 2 right_part
left_part
Proc 3

Sparse matrices of the experimental tests
Sparse matrices generated from those of the Davis’s collection

Size of the test sparse matrices: 90 million
Matrix Nb. Nonzeros Bandwidth

ecology2 449,729,174 1,002
stomach 1,277,498,438 22,868
shallow_water2 360,751,026 23,212
language 276,894,366 398,626
G3_circuit 443,429,071 525,429
cage14 1,674,718,790 1,266,626
thermal2 643,458,527 1,928,223

Speed-ups of GMRES solver without and with data

compression
Precision on GPU cluster: 5.75e-8 to 4.27e-15

Difference GPU/CPU solutions: 3.78e-10 to 1.35e-18
Matrix vs. 12 CPUs vs. 24 CPUs

- comp. + comp. - comp. + comp.
ecology2 8.46 13.33 5.86 9.18
stomach 8.53 10.66 5.88 7.35
shallow_water2 9.83 12.89 6.49 8.92
language 8.81 11.93 5.89 8.05
G3_circuit 8.02 12.37 5.50 8.53
cage14 6.93 9.11 5.30 6.21
thermal2 6.56 10.70 4.34 7.05

Conclusion & future work
Conclusion:
Both versions of GMRES are faster on GPU clusters
GMRES with data compression is more efficient
Data compression minimizes: GPU/CPU & CPU/CPU communication
overheads
Future work:
Other structures: matrices with large bandwidths
Data partitioning: minimizing the communication volume

Thank you for your attention!

CUDA Sparse Parallel Solver

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA Sparse Parallel Solver

Uploaded by

Copyright:

Available Formats

Parallel Sparse Linear Solver GMRES for GPU Clusters

with Compression of Exchanged Data

J. M. Bahi, R. Couturier, L. Ziane Khodja

Laboratoire d’Informatique de l’Université de Franche-Comté (LIFC)

9th International Workshop on Algorithms, Models and Tools for

August 29th 2011

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 1 / 23

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 2 / 23

GMRES solver and its improvements

Conclusion and future work

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 3 / 23

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 4 / 23

Graphic Processing Unit: GPU

Streaming Streaming Streaming

Shared memory Shared memory Shared memory

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 5 / 23

Nvidia CUDA programming environment: extension of C language

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 6 / 23

GPU cluster architecture

Node 0 Node 1 Node 5

Host memory Host memory Host memory

CPU processors CPU processors CPU processors

CPU CPU CPU

Figure: GPU cluster of IUT Belfort-Montbéliard

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 7 / 23

GMRES solver and its improvements

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 8 / 23

Iterative method developped by Saad & Schultz in 1986

Generalization of the MINRES method

Generic and efficient method to solve:

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 9 / 23

Sequential GMRES algorithm with restarts

1: Start: Choose x0 and compute: r0 = M −1 (b − Ax0 ) and v1 = r0 /kr0 k

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 10 / 23

GMRES parallelization on a GPU cluster

Sparse matrix A Vector b

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 11 / 23

GMRES parallelization on a GPU cluster

Accelerating the mathematical operations on the GPUs:

Synchronizations over the GPU cluster by the CPUs:

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 12 / 23

The most expensive step of the parallel GMRES algorithm

Left shared Local Right shared

GPU/CPU and CPU/CPU

Local vector x Local vector x Local vector x

Slow data exchanges of the shared vectors

Local vector x of node α

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 14 / 23

Enhanced data exchanges of the shared vectors

Local vector x of node α

Enhanced parallel SpMV on a GPU cluster

Accelerating the compression and decompression functions on GPUs

Parallel SpMV algorithm:

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 16 / 23

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 17 / 23

Our GPU cluster

Comparison between the two versions of GMRES solver (without and

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 18 / 23

Sparse matrices of tests

Generated large and sparse banded matrix

J. M. Bahi, R. Couturier, L. Ziane Khodja HeteroPar’11 August 29th 2011 19 / 23

Sparse matrices of the experimental tests

Sparse matrices generated from those of the Davis’s collection

Matrix Nb. Nonzeros Bandwidth