You are on page 1of 13

GPU Implementations for

Finite Element Methods


Brian S. Cohen

12 December 2016
Last Time…
Stiffness Matrix Assembly
2
10

Julia v0.47
1
10 MATLAB v2016b

0
10
CPU Time [s]

-1
10

-2
10

-3
10

-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs

B. Cohen – 8 February 2019 Slide 2


Goals

1. Implement an efficient GPU-based assembly routine to interface with


the EllipticFEM.jl package
– Speed test all implementations and compare against CPU algorithm using varied
mesh densities
– Investigate where GPU implementation choke points are and how these can be
improved in the future

2. Implement a GPU-based linear solver routine


– Speed test solver and compare against CPU algorithm

B. Cohen – 8 February 2019 Slide 3


Finite Element Mesh
• A finite element mesh is a set of nodes
and elements that divide a geometric
domain on which our PDE can be solved
• Other relevant information for the mesh
may be necessary
– Element centroids
– Element edge lengths
– Element quality
– Subdomain tags

• EllipticFEM.jl stores this information Node pi


in object meshData
• All meshes are generated using linear 2D Element e Node pk
triangle elements Node pj

• Node data stored as Float64 2D Array … 𝑝𝑖 …


𝑥1 ⋯ 𝑥𝑛
𝒏𝒐𝒅𝒆𝒔 = 𝑦 ⋯ 𝑦𝑛 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔 = … 𝑝𝑗 …
• Element data stored as Int64 2D Array 1
… 𝑝𝑘 …

B. Cohen – 8 February 2019 Slide 4


Finite Element Matrix Assembly
• Consider the simple linear system
𝐾1,1 ⋯ 𝐾1,𝑛𝐷𝑂𝐹
𝐊𝐮 = 𝐟 𝐊= ⋮ ⋱ ⋮
𝐾𝑛𝐷𝑂𝐹,1 ⋯ 𝐾𝑛𝐷𝑂𝐹,𝑛𝐷𝑂𝐹

• The stiffness matrix 𝐊 is an assembly of all


element contributions
𝑚
𝑘11 𝑘12 𝑘13
𝐊 = ෍ 𝐤𝑒 K = sparse(I,J,V) 𝐤 𝑒 = 𝑘21 𝑘22 𝑘23
𝑒=1 𝑘31 𝑘32 𝑘33

• Element contributions are derived from the


“hat” function used to approximate the 𝒖𝒊
solution on each element

𝐤 𝑒 = න 𝐉 −T 𝐁 T 𝐄 𝐉 −𝟏 𝐁 𝑑𝐴
y

x
B. Cohen – 8 February 2019 Slide 5
GPU Implementation A

Pre-Processing Assemble 𝐊 Matrix Solve

Read Generate Call


Equation Geometric sparse() 𝐮 = 𝐊\𝐛
CPU

Data Data constructor

Generate
(I,J)
Generate Vectors
GPU

Mesh Data
Generate
Ke_Values
Array Double for-loop
implementation

B. Cohen – 8 February 2019 Slide 6


GPU Implementation B

Pre-Processing Assemble 𝐊 Matrix Solve


Generate
Read Generate Generate (I,J) Call
Equation Geometric Mesh Data Vectors sparse() 𝐮 = 𝐊\𝐛
CPU

Data Data constructor

Generate
Ke_Values
Array
GPU

Transfer node and


element arrays
only to GPU

B. Cohen – 8 February 2019 Slide 7


CPU vs. GPU Implementations
2
10

CPU Implementation
CPU Time for I, J, V Assembly [s]

1
10 GPU Implementation A
GPU Implementation B
0
10

-1
10

-2
10

-3
10

-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs

• GeForce GTX 765M, 2048MB

B. Cohen – 8 February 2019 Slide 8


Runtime Diagnostics
Implementation A Implementation B
1 1.0
CPU -> GPU CPU -> GPU
0.8 I, J, V array assembly 0.8 I, J, V array assembly
CPU Runtime [s]

CPU Runtime [s]


GPU -> CPU GPU -> CPU
sparse() assembly sparse() assembly
0.6 0.6

0.4 0.4

0.2 0.2

0 0
102 103 104 105 106 102 103 104 105 106

Number of DOFs Number of DOFs

• Overhead to transfer mesh data from CPU → GPU is low


• Overhead to transfer mesh data from GPU → CPU is high
• It would be nice to be able perform sparse matrix construction on GPU
• It would be even nicer to solve the problem on the GPU

B. Cohen – 8 February 2019 Slide 9


Solving the Model

• Now we want to solve the linear model:

𝐊𝐮 = 𝐟

• ArrayFire.jl does not currently support sparse matrices


– Dense matrix operations seem to be comparable in speed or slower on GPU’s than
CPU’s

• CUSPARSE.jl wraps NVIDIA CUSPARSE library functions


– High performance sparse linear algebra library
– Does not wrap any solver routines

• Built on CUDArt.jl package


– Wraps the CUDA runtime API

• Both packages required CUDA Toolkit (v8.0)

B. Cohen – 8 February 2019 Slide 10


GPU Solver Implementation
• Preconditioned Conjugate Gradient Method
𝐫 ← 𝐟 − 𝐊𝐮
– 𝐊 is a sparse symmetric positive definite matrix
𝒇𝒐𝒓 𝑖 = 1, 2, … until convergence do
– Improves convergence if 𝐊 is not well conditioned
𝒔𝒐𝒍𝒗𝒆 𝐌𝐳 ← 𝐫
• Uses the Incomplete Cholesky Factorization 𝜌𝑖 ← 𝐫 T 𝐳
𝒊𝒇 𝑖 == 1 𝒕𝒉𝒆𝒏
𝐊 ≈ 𝐌 = 𝐑T 𝐑
𝐩←𝐳
• Rather than solve the original system 𝒆𝒍𝒔𝒆
𝜌𝑖
𝐊𝐮 = 𝐟 𝛽←
𝜌𝑖−1
• We solve the following system 𝐩 ← 𝐳 + 𝛽𝐩
𝒆𝒏𝒅 𝒊𝒇
𝐑−T 𝐊 𝐑−𝟏 𝐑𝐮 = 𝐑−T 𝐟
q← 𝐀𝐩
𝜌𝑖
𝛼← T
𝐩 𝐪
𝐱 ← 𝐱 + 𝛼𝐩
𝐫 ← 𝐫 − 𝛼𝐪
end 𝒇𝒐𝒓

B. Cohen – 8 February 2019 Slide 11


Solver Results
2
10

CPU Implementation
1
10
GPU Implementation
CPU Time to Solve Ku=f [s]

0
10

-1
10

-2
10

-3
10

-4
10 2 3 4 5 6 7
10 10 10 10 10 10
Number of DOFs

B. Cohen – 8 February 2019 Slide 12


Conclusion
• GPU computing in Julia shows promise of speeding up FEM matrix
assembly and solve routines
– Potentially greater gains to be made with higher order 2D/3D elements
– Minimize data transfer to the GPU needed to assemble FEM matrices
– Keeping code vectorized helps
– Removing any temporary data copies on GPU

• ArrayFire.jl should (and hopefully soon will) support sparse matrix


assembly and arithmetic
– Open issue on GitHub since late September, 2016

• CUSPARSE.jl should (and hopefully soon will) wrap additional


functions
– COO matrix constructor and iterative solvers would be especially useful

• Large impact on optimization problems where matrix assembly


routines and solvers are called many times

B. Cohen – 8 February 2019 Slide 13

You might also like