GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016

GPU Implementations for
Finite Element Methods

Brian S. Cohen
12 December 2016
Last Time…
Stiffness Matrix Assembly
2
10
Julia v0.47
1
10 MATLAB v2016b
0
10
CPU Time [s]
-1
10
-2
10
-3
10
-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs
B. Cohen – 8 February 2019 Slide 2

Goals
1. Implement an efficient GPU-based assembly routine to interface with

the EllipticFEM.jl package
– Speed test all implementations and compare against CPU algorithm using varied
mesh densities
– Investigate where GPU implementation choke points are and how these can be
improved in the future
2. Implement a GPU-based linear solver routine

– Speed test solver and compare against CPU algorithm

Finite Element Mesh
• A finite element mesh is a set of nodes
and elements that divide a geometric
domain on which our PDE can be solved
• Other relevant information for the mesh
may be necessary
– Element centroids
– Element edge lengths
– Element quality
– Subdomain tags
• EllipticFEM.jl stores this information Node pi

in object meshData
• All meshes are generated using linear 2D Element e Node pk
triangle elements Node pj
• Node data stored as Float64 2D Array … 𝑝𝑖 …

𝑥1 ⋯ 𝑥𝑛
𝒏𝒐𝒅𝒆𝒔 = 𝑦 ⋯ 𝑦𝑛 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔 = … 𝑝𝑗 …
• Element data stored as Int64 2D Array 1
… 𝑝𝑘 …

Finite Element Matrix Assembly
• Consider the simple linear system
𝐾1,1 ⋯ 𝐾1,𝑛𝐷𝑂𝐹
𝐊𝐮 = 𝐟 𝐊= ⋮ ⋱ ⋮
𝐾𝑛𝐷𝑂𝐹,1 ⋯ 𝐾𝑛𝐷𝑂𝐹,𝑛𝐷𝑂𝐹
• The stiffness matrix 𝐊 is an assembly of all

element contributions
𝑚
𝑘11 𝑘12 𝑘13
𝐊 = ෍ 𝐤𝑒 K = sparse(I,J,V) 𝐤 𝑒 = 𝑘21 𝑘22 𝑘23
𝑒=1 𝑘31 𝑘32 𝑘33
• Element contributions are derived from the

“hat” function used to approximate the 𝒖𝒊
solution on each element
𝐤 𝑒 = න 𝐉 −T 𝐁 T 𝐄 𝐉 −𝟏 𝐁 𝑑𝐴
y
x
GPU Implementation A
Pre-Processing Assemble 𝐊 Matrix Solve
Read Generate Call

Equation Geometric sparse() 𝐮 = 𝐊\𝐛
CPU
Data Data constructor
Generate
(I,J)
Generate Vectors
GPU
Mesh Data
Generate
Ke_Values
Array Double for-loop
implementation

GPU Implementation B
Pre-Processing Assemble 𝐊 Matrix Solve

Generate
Read Generate Generate (I,J) Call
Equation Geometric Mesh Data Vectors sparse() 𝐮 = 𝐊\𝐛
CPU
Data Data constructor
Generate
Ke_Values
Array
GPU
Transfer node and

element arrays
only to GPU

CPU vs. GPU Implementations
2
10
CPU Implementation
CPU Time for I, J, V Assembly [s]
1
10 GPU Implementation A
GPU Implementation B
0
10
-1
10
-2
10
-3
10
-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs
• GeForce GTX 765M, 2048MB

Runtime Diagnostics
Implementation A Implementation B
1 1.0
CPU -> GPU CPU -> GPU
0.8 I, J, V array assembly 0.8 I, J, V array assembly
CPU Runtime [s]
CPU Runtime [s]

GPU -> CPU GPU -> CPU
sparse() assembly sparse() assembly
0.6 0.6
0.4 0.4
0.2 0.2
0 0
102 103 104 105 106 102 103 104 105 106
Number of DOFs Number of DOFs
• Overhead to transfer mesh data from CPU → GPU is low

• Overhead to transfer mesh data from GPU → CPU is high
• It would be nice to be able perform sparse matrix construction on GPU
• It would be even nicer to solve the problem on the GPU

Solving the Model
• Now we want to solve the linear model:
𝐊𝐮 = 𝐟
• ArrayFire.jl does not currently support sparse matrices

– Dense matrix operations seem to be comparable in speed or slower on GPU’s than
CPU’s
• CUSPARSE.jl wraps NVIDIA CUSPARSE library functions

– High performance sparse linear algebra library
– Does not wrap any solver routines
• Built on CUDArt.jl package

– Wraps the CUDA runtime API
• Both packages required CUDA Toolkit (v8.0)

GPU Solver Implementation
• Preconditioned Conjugate Gradient Method
𝐫 ← 𝐟 − 𝐊𝐮
– 𝐊 is a sparse symmetric positive definite matrix
𝒇𝒐𝒓 𝑖 = 1, 2, … until convergence do
– Improves convergence if 𝐊 is not well conditioned
𝒔𝒐𝒍𝒗𝒆 𝐌𝐳 ← 𝐫
• Uses the Incomplete Cholesky Factorization 𝜌𝑖 ← 𝐫 T 𝐳
𝒊𝒇 𝑖 == 1 𝒕𝒉𝒆𝒏
𝐊 ≈ 𝐌 = 𝐑T 𝐑
𝐩←𝐳
• Rather than solve the original system 𝒆𝒍𝒔𝒆
𝜌𝑖
𝐊𝐮 = 𝐟 𝛽←
𝜌𝑖−1
• We solve the following system 𝐩 ← 𝐳 + 𝛽𝐩
𝒆𝒏𝒅 𝒊𝒇
𝐑−T 𝐊 𝐑−𝟏 𝐑𝐮 = 𝐑−T 𝐟
q← 𝐀𝐩
𝜌𝑖
𝛼← T
𝐩 𝐪
𝐱 ← 𝐱 + 𝛼𝐩
𝐫 ← 𝐫 − 𝛼𝐪
end 𝒇𝒐𝒓

Solver Results
2
10
CPU Implementation
1
10
GPU Implementation
CPU Time to Solve Ku=f [s]
0
10
-1
10
-2
10
-3
10
-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs

Conclusion
• GPU computing in Julia shows promise of speeding up FEM matrix
assembly and solve routines
– Potentially greater gains to be made with higher order 2D/3D elements
– Minimize data transfer to the GPU needed to assemble FEM matrices
– Keeping code vectorized helps
– Removing any temporary data copies on GPU
• ArrayFire.jl should (and hopefully soon will) support sparse matrix

assembly and arithmetic
– Open issue on GitHub since late September, 2016
• CUSPARSE.jl should (and hopefully soon will) wrap additional

functions
– COO matrix constructor and iterative solvers would be especially useful
• Large impact on optimization problems where matrix assembly

routines and solvers are called many times

GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016

Uploaded by

Copyright:

GPU Implementations for

Finite Element Methods

B. Cohen – 8 February 2019 Slide 2

1. Implement an efficient GPU-based assembly routine to interface with

2. Implement a GPU-based linear solver routine

B. Cohen – 8 February 2019 Slide 3

• EllipticFEM.jl stores this information Node pi

• Node data stored as Float64 2D Array … 𝑝𝑖 …

B. Cohen – 8 February 2019 Slide 4

• The stiffness matrix 𝐊 is an assembly of all

• Element contributions are derived from the

Pre-Processing Assemble 𝐊 Matrix Solve

Read Generate Call

Data Data constructor

B. Cohen – 8 February 2019 Slide 6

Pre-Processing Assemble 𝐊 Matrix Solve

Data Data constructor

Transfer node and

B. Cohen – 8 February 2019 Slide 7

• GeForce GTX 765M, 2048MB

B. Cohen – 8 February 2019 Slide 8

CPU Runtime [s]

Number of DOFs Number of DOFs

• Overhead to transfer mesh data from CPU → GPU is low

B. Cohen – 8 February 2019 Slide 9

• Now we want to solve the linear model:

• ArrayFire.jl does not currently support sparse matrices

• CUSPARSE.jl wraps NVIDIA CUSPARSE library functions

• Built on CUDArt.jl package

• Both packages required CUDA Toolkit (v8.0)

B. Cohen – 8 February 2019 Slide 10

B. Cohen – 8 February 2019 Slide 11

B. Cohen – 8 February 2019 Slide 12

• ArrayFire.jl should (and hopefully soon will) support sparse matrix

• CUSPARSE.jl should (and hopefully soon will) wrap additional

• Large impact on optimization problems where matrix assembly

B. Cohen – 8 February 2019 Slide 13

You might also like