Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms

Linear Algebra Operators
for GPU Implementation of Numerical Algorithms

Jens Kruger and Rudiger Westermann
Computer Graphics and Visualization Group, Technical University Munich
Figure 1: We present implementations of techniques for solving sets of algebraic equations on graphics hardware. In this way, numerical
simulation and rendering of real-world phenomena, like 2D water surfaces in the shown example, can be achieved at interactive rates.
Abstract matics. These techniques have a variety of applications in physics

based simulation and modelling, and they have been frequently
In this work, the emphasis is on the development of strategies to employed in computer graphics to provide realistic simulation of
realize techniques of numerical computing on the graphics chip. In real-world phenomena [Kaas and Miller 1990; Chen and da Vito-
particular, the focus is on the acceleration of techniques for solving ria Lobo 1995; Foster and Metaxas 1996; Stam 1999; Foster and
sets of algebraic equations as they occur in numerical simulation. Fedkiw 2001; Fedkiw et al. 2001]. Despite their use in numerical
We introduce a framework for the implementation of linear alge- simulation, these techniques have also been applied in a variety of
bra operators on programmable graphics processors (GPUs), thus computer graphics settings, e.g. the simulation of watercolor draw-
providing the building blocks for the design of more complex nu- ings [Curtis et al. 1997], the processing of polygonal meshes [Des-
merical algorithms. In particular, we propose a stream model for brun et al. 1999], or the animation of deformable models [Baraff
arithmetic operations on vectors and matrices that exploits the in- and Witkin 1998; Debunne et al. 2001], to mention just a few.
trinsic parallelism and efficient communication on modern GPUs. The numerical complexity of techniques for solving sets of alge-
Besides performance gains due to improved numerical computa- braic equations often imposes limitations on the numerical accuracy
tions, graphics algorithms benefit from this model in that the trans- or extremely high demands on memory and computing resources.
fer of computation results to the graphics processor for display is As a consequence thereof, parallelization of numerical solvers on
avoided. We demonstrate the effectiveness of our approach by im- multi-processor architectures has been an active research area for
plementing direct solvers for sparse matrices, and by applying these quite a long time.
solvers to multi-dimensional finite difference equations, i.e. the 2D An alternative direction of research is leading towards the imple-
wave equation and the incompressible Navier-Stokes equations. mentation of general techniques of numerical computing on com-
CR Categories: I.6.7 [Simulation and Modeling]: Simu- puter graphics hardware. Driven by the evolution of graphics pro-
lation Support Systems; I.3.7 [Computer Graphics]: Three- cessors from fixed function pipelines towards fully programmable,
Dimensional Graphics and Realism floating point pipelines, additional effort is spent on the develop-
ment of numerical algorithms amenable to the intrinsic parallelism
Keywords: Numerical Simulation, Graphics Hardware and efficient communication on modern GPUs. Recent examples
include GPU implementations of matrix multiplications [Thomp-
1 Introduction son et al. 2002], multi-grid simulation techniques [Bolz et al.
2003] and numerical solution techniques to least squares problems
The development of numerical techniques for solving partial differ- [Hilllesland et al. 2003]. Particularly in computer graphics appli-
ential equations is one of the traditional subjects in applied mathe- cations, the goal of such implementations of numerical techniques
is twofold: to speed up computation processes, as they are often
jens.krueger, westermann@in.tum.de at the core of the graphics application, and to avoid the transfer of
computation results from the CPU to the GPU for display.
Based on early prototypes of programmable graphics architec-
tures [Olano and Lastra 1998; Lindholm et al. 2001], the design
of graphics hardware as a pipeline consisting of highly optimized
stages providing fixed functionality is more and more abandoned
on modern graphics chips, e.g. the NVIDIA NV30 [Montrym and
Permission to make digital/hard copy of part of all of this work for personal or Moreton 2002] or the ATI R300 [Elder 2002]. Today, the user
classroom use is granted without fee provided that the copies are not made or
distributed for profit or commercial advantage, the copyright notice, the title of the
has access to parallel geometry and fragment units, which can be
publication, and its date appear, and notice is given that copying is by permission programmed by means of standard APIs. In particular, vertex and
of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute pixel shader programs enable direct access to the functional units,
to lists, requires prior specific permission and/or a fee. and they allow for the design of new classes of hardware supported
2003 ACM 0730-0301/03/0700-0908 $5.00
908
graphics algorithms. As a representative example for such algo- can be implemented by executing GPU implementations of these
rithms, let us refer to [Purcell et al. 2002], where ray-tracing on operations in a particular order. One of our goals is to replace soft-
programmable fragment processors was described. ware implementations of basic linear algebra operators as available
In our current work, we employ the Pixel Shader 2.0 API [Mi- in widespread linear algebra libraries, i.e. the BLAS (Basic Linear
crosoft 2002], a specific set of instructions and capabilities in Algebra Subprogram) library [Dongarra et al. 1988; Dongarra et al.
DirectX9-level hardware, which allows us to perform hardware 1990], by GPU implementations, thus enabling more general linear
supported per-fragment operations. Besides basic arithmetic oper- algebra packages to be implemented on top of these implementa-
ations, instructions to store intermediate results in registers and to tions, i.e. the LAPACK (Linear Algebra Package) [Anderson et al.
address and access multiple texture units are available. Our target 1999].
architecture is the ATI Radeon 9800, which supports 32-bit floating In the remainder of this paper, we will first introduce the internal
point textures as well as a set of hardware supported pixel shader. representation for vectors and matrices on the graphics processor,
Pixel shader provide 24-bit precision internal computations. To and we will describe the syntax and the semantic of the vector and
save rendering results and to communicate these results to consec- matrix routines our approach is built upon. Similar to the notation
utive passes, rendering can be directed to a 32-bit offscreen buffer. used in the BLAS library, we outline specific operations on vectors
This buffer can be directly bound to a texture map, i.e. the content and matrices. Although we use a different syntax than BLAS, and
of the buffer is immediately available as 2D texture map. we also do not provide the many different operators contained in the
Until today, the benefits of graphics hardware have mainly been BLAS function definition, it should become obvious that by means
demonstrated in rendering applications. Efforts have been focused of our approach the same functionality can be achieved.
on the creation of static and dynamic imagery including polygonal We will also address sparse matrix representation and opera-
models and scalar or vector valued volumetric objects. In a few tions on such matrices as they typically occur in numerical sim-
examples, strategies to realize numerical computations on graphics ulation techniques. In this way, we achieve a significant speed-
processors were described, usually implemented by means of low- up compared to software approaches. Next, GPU implementations
level graphics APIs that did not yet provide programmable vertex of two methods for solving sparse linear systems as the occur in
and pixel shader. many numerical simulation techniques are described: the Conju-
Hopf et al. [Hopf and Ertl 1999; Hopf and Ertl 2000] de- gate Gradient (CG) method and the Gauss-Seidel solver. Finally,
scribed implementations of morphological operations and wavelet we demonstrate the effectiveness of our approach by solving the
transforms on the graphics processor. Numerical computations 2D wave equation and the incompressible Navier-Stokes equations
were realized by means of blending functionality, and by exploit- on graphics hardware, and by directly visualizing the results on the
ing the functionality provided by the OpenGL imaging subset. Us- ATI 9800.
ing similar coding mechanisms, the simulation of cellular automata
and stochastic fractals was demonstrated in [nVidia 2002; Hart
2001]. Strzodka and Rumpf [Strzodka and Rumpf 2001a; Strzodka 2 Matrix Representation on GPUs
and Rumpf 2001b] proposed first concepts for the implementation
of numerical solvers for partial differential equations on graphics In this section, we describe the internal representation of matrices
hardware. Therefore, the intrinsic communication and computa- on graphics hardware. The proposed representation enables the ef-
tion patterns of numerical solution techniques for finite difference ficient computation of basic algebraic operations used in numerical
equations were mapped to OpenGL functionality. Non-standard simulation techniques. The general idea is to store matrices as tex-
OpenGL extensions were employed in [Heidrich et al. 1999; Jo- ture maps and to exploit pixel shader programs to implement arith-
bard et al. 2000; Weiskopf et al. 2001] to interactively visualize 2D metic operations. For the sake of simplicity only column matrices,
vector fields. At the core of these techniques, vector valued data i.e. vectors and square NxN matrices are discussed. General matri-
is encoded into RGB texture maps, thus allowing for the tracing of ces, however, can be organized in exactly the same way.
particles by successive texture fetch and blend operations.
Matrix
With the availability of programmable fragment units, the possi- (0,1) (1,1)
2D-Product
bility to implement general numerical techniques on graphics pro- (0) (1)
cessors was given. A number of examples have been demonstrated

since then, ranging from physics based simulation of natural phe-
nomena to real-time shading and lighting effects [nVidia 2003; ATI
2003]. Weiskopf et al. [Weiskopf et al. 2002] extended their previ-
ous work towards the interactive simulation of particle transport in (0,0) (1,0)
Vector
flow fields. Recently, Harris et al. [Harris et al. 2002] described the (0) (1)
implementation of an explicit scheme for the solution of a coupled

map lattice model on commodity graphics cards. In both exam- Figure 2: This illustration demonstrates the computation of a
ples, numerical computations were entirely carried out in a frag- matrix-vector product using 1D and 2D textures to represent vectors
ment shader program. GPU implementation of matrix-multiplies and matrices, respectively. The 1D texture is continued periodically
based on a particular distribution strategy for 2D textures across a across the rendered quadrilateral.
cube-shaped lattice was described in [Larsen and McAllister 2001].
In contrast to previous approaches, which were specifically de- On the graphics processor, vectors might be represented as 1D
signed with regard to the solution of particular problems, our goal texture maps. This kind of representation, however, has certain
is to develop a generic framework that enables the implementation drawbacks: First, limitations on the maximum size of 1D textures
of general numerical techniques for the solution of difference equa- considerably reduce the number of vector elements that can be
tions on graphics hardware. Therefore, we provide the basic build- stored in one texture map. Second, rendering 1D textures is signifi-
ing block routines that are used by standard numerical solvers. Built cantly slower than rendering square 2D textures with the same num-
upon a flexible and efficient internal representation, these functional ber of texture elements. On current graphics cards, the use of 2D
units perform arithmetic operations on vectors and matrices. In the textures yields a performance gain of about a factor of 2. Third, if
same way as linear algebra libraries employ encapsulated basic vec- a 1D vector contains the result of a numerical simulation on a com-
tor and matrix operations, many techniques of numerical computing putational grid, the data finally has to be rearranged in a 2D texture
909
for rendering purposes. Fourth, this representation prohibits the ef- are handled by that element. Texture handles and the size of each
ficient computation of matrix-vector products. Although by means texture can be accessed via public functions.
of multi-textures both the matrix and the vector can be rendered si-
multaneously and combined on a per-fragment basis (see Figure 2),
the computed values are not in place and have to be summed along 3.1 Vector Arithmetic
the rows of the matrix to obtain the result vector.
To circumvent the mentioned drawbacks, we represent matrices Arithmetic operations on two vectors can be realized in a simple
as set of diagonal vectors and we store vectors in 2D texture maps pixel shader program. The application issues both operands as
(see Figure 3). To every diagonal starting at the i-th entry of the first multi-textures, which are accessed and combined in the shader pro-
row in the upper right part of the matrix, its counterpart starting gram. On current graphics cards supporting the Pixel Shader 2.0
at the (N-i)-th entry of the first column in the lower left part of instruction set, arithmetic operations like addition, subtraction and
the matrix is appended. In this way, no texture memory is wasted. multiplication are available. The product of a scalar times a vector
Each vector is converted into a square 2D texture by the application is realized by issuing the scalar as a constant value in the shader
program. Vectors are padded with zero entries if they do not entirely program.
fill the texture. This representation has several advantages: The function header for implementing standard arithmetic oper-
ations on vector elements looks like this:
Much larger vectors can be stored in one single texture ele-
ment. void clVecOp (
CL enum op,
Arithmetic operations on vectors perform significantly faster float , float ,
because square 2D textures are rendered. clVec x, clVec y,
clVec res
Vectors that represent data on a 2D grid can directly be ren- );
dered to visualize the data.
The enumerator op can be one of CL ADD, CL MULT or
Matrix-vector multiplication can be mapped effectively to
CL SUB. The scalars and are multiplied with each element
vector-vector multiplication.
of x and y, respectively. At the beginning of each routine a consis-
The result of a matrix-vector multiplication is already in place tency check is performed to verify that both vectors have the same
and does not need to be rearranged. internal representation. Then, the respective shader program is ac-
tivated and vectors x and y are issued as multi-textures. Finally, a
Most notable, however, the particular layout of matrices as set of square quadrilateral is rendered, which is lined up in screen space
diagonal vectors allows for the efficient representation and process- with the 2D textures used to represent the active vectors. The result
ing of banded diagonal matrices, as they typically occur in numeri- is kept as 2D texture to be used in consecutive computations.
cal simulation techniques. In a pre-process the application program
inspects every diagonal, discarding those diagonals that do not carry
any information. If no counterpart exists for one part of a diagonal, 3.2 Matrix-Vector Product
it is filled with zero entries.
In the following, we consider the product of a matrix times a vector.
A nice feature of this representation is, that the transpose of a
A second vector might be specified to allow for the computation of
matrix is generated by simply ordering the diagonals in a different
Ax op y, where A is a matrix, x and y are vectors, and op is one
way. Off-diagonals numbered i, which start at the i-th entry of the
of CL ADD, CL MULT, CL SUB. To compute Ax op y we first
first row, now become off-diagonals numbered N-i. Entries located
render the result of Ax into the render target. Now, the result is
in the former upper right part of the matrix are swapped with those
bound as an additional texture, and in a final rendering pass it is
entries located in the lower left part. Swapping does not have to be
combined with the vector y and rendered to the destination target.
performed explicitly, but it can be realized by shifting indices used
to access these elements. Each index has to be shifted by the num- The header of the function performing matrix-vector operations
ber of entries in the vector that come from the lower left part of the looks like this (if one of A or x, or y is NULL, only the respective
matrix. This can easily be accomplished in the pixel shader pro- component not equal to NULL considered in the operation):
gram, where the indexing during matrix-vector operations is per-
void clMatVec (
formed (see below).
CL enum op,
clMat A,
3 Basic Operations clVec x, clVec y,
clVec res
Now, we describe the implementation of basic algebraic operations );
on vectors and matrices based on the proposed internal representa-
tion. In each operation, rendering is directed to a specific render Because matrices are represented as a set of diagonal vectors,
target that can be directly bound to a 2D texture for further use. To matrix-vector multiplication becomes a multiplication of every di-
update values in a target that is not the current target anymore, it agonal with the vector. Therefore, N rendering passes are per-
is made the current render target again. Then, its content can be formed, in each of which one diagonal and the vector are issued
modified by consecutive rendering passes. as separate textures in the shader program. Then, corresponding
Vector and matrix containers are defined as classes clVec and entries in both textures are multiplied. However, to the j-th element
clMat, respectively. Both containers assemble C++ arrays in that of a diagonal that starts at the ith entry of the first row of the matrix
the array is decomposed into one or multiple vectors. Vectors com- corresponds the ((i+j) mod N)-th entry of the vector. This entry first
posed of zero entries neither have to be stored nor processed. has to be computed in the shader program before it can be used to
Upon initialization, for each vector a texture is created and bound access the vector.
to a texture handle. Internally, each class element stores the reso- Values i and N are simply issued as constant values in the shader
lution of the respective vector or matrix and of all the textures that program. Index j, however, is directly derived from the fragments
910
i
Matrix N Vectors
N 2D-Textures
1 i N
N N ... ... ... ...
N-i
1 i N
N
Figure 3: The representation of a 2D matrix as a set of diagonal vectors, and finally as a set of 2D textures is shown.
texture coordinates and N. Finally, the destination index ((i+j) mod equal to NULL, the combiner operation is carried out on the prod-
N) is calculated and converted to 2D texture coordinates. uct x times y rather than only on x.
After the first diagonal has been processed as described, the cur- The reduce operation combines the vector entries in multiple ren-
rent render target is simultaneously used as a texture and as a rendering passes by recursively combining the result of the previous
der target. Thus, fragments in consecutive passes have access to pass. Starting with the initial 2D texture containing one vector
the intermediate result, and they can update this result in each iter- and the quadrilateral lined up with the texture in screen space, in
ation. After rendering the last diagonal, the result vector is already each step a quadrilateral scaled by a factor of 0.5 is rendered. In
in place, and the current render target can be used to internally rep- the shader program, each fragment that is covered by the shrunken
resent this vector. quadrilateral combines the texel that is mapped to it and the three
A considerable speed-up is achieved by specifying multiple adja- adjacent texel in positive (u,v) texture space direction. The distance
cent diagonals as multi-textures, and by processing these diagonals between texels in texture space is issued as a constant in the shader
at once in every pass. Parameters i and N only have to be issued program. The result is written into a new texture, which is now of a
once in the shader program. A particular fragment has the same in- factor of two smaller in each dimension than the previous one. The
dex j in all diagonals, and as a matter of fact it only has to be com- entire process is illustrated in Figure 4. This technique is a standard
puted once. Starting with the first destination index, this index is approach to combine vector elements on parallel computer architec-
successively incremented about one for consecutive diagonals. The tures, which in our scenario is used to keep the memory footprint
number of diagonals that can be processed simultaneously depends for each fragment as low as possible. For a diagonal vector that is
on the number of available texture units. represented by a square texture with resolution 2n , n 1 rendering
Let us finally mention, that with regard to the described imple- passes have to be performed until the result value val is obtained in
mentation of matrix-vector products, there is no particular need to one single pixel. The respective pixel value is finally returned to the
organize matrices into sets of diagonal vectors. For instance, dense application program.
matrices might be represented as sets of column vectors, giving rise
to even more efficient matrix-vector multiplication. Every column original Texture 1st pass 2nd pass
just has to be multiplied with the respective vector element, result-
ing in a much smaller memory footprint, yet requiring a simple ...
...
shader program to which only the index of the currently processed
...
column is input.
...
3.3 Vector Reduce

Quite often it is necessary to combine all entries of one vector into
one single element, e.g. computing a vector norm, finding the max-
imum/minimum element etc. Therefore, we provide a special oper-
ation that outputs the result of such a reduce operation to the appli- Figure 4: Schematic illustration of the reduce operation.
cation program:
float clVecReduce (
CL enum cmb, 4 Sparse Matrices
clVec x, clVec y,
); So far, the operations we have encountered execute very efficiently
on current commodity graphics hardware. On the other hand, they
The enumerator cmb can be one of CL ADD, CL MULT, are not suitable to process matrices as they typically arise in numer-
CL MAX, CL MIN, CL ABS. If the second parameter y is not ical simulations. For instance, let us assume that the solution to the
911
2D wave equation offset with respect to the current column. Most effectively, the off-
set is specified in a vertex shader program, by means of which each
2u 2u 2u vertex compute the exact 2D position in screen space.
= c2 ( 2 + 2 )
t 2 x y
Sparse Matrix
Iterations
n
on a grid of resolution 512 x 512 has to be computed numerically vertex array 1
(including boundary points). If first and higher order partial deriva- vertex array 2 color, position, texture coordinate
...
tives are approximated by finite differences, the partial difference vertex array n-1
equation writes as a set of finite difference equations for each grid n vertex array n
point (ij):
texture
ut+1 t t1
i j 2ui j + ui j ut
2 i+1 j
+ uti1 j + uti j+1 + uti j1 4uti j bind buffer as texture
=c ( ) Vector
t 2 xy
Using the implicit Crank-Nicholson scheme, where the aver- Figure 5: This image illustrates the computation of a sparse-
age of the right-hand side is taken, i.e. for all grid points we set matrix-vector product based on the internal representation of ma-
uti, j = 0.5(ut+1 t trix columns as sets of vertices.
i, j + ui, j ), the difference equation contains more than
one unknown and the system of algebraic equations has to be solved
simultaneously. For the multiplication of a matrix times a vector, the color of
If initial and boundary conditions are specified, the set of equa- each vertex has to be multiplied with the color of the corresponding
tions can be written in matrix from as Ax = b, where A is a 5122 entry in the vector. The vector, however, is not static and can thus
x 5122 matrix, and both b and the solution vector x are of length not be coded into the vertex array. As a matter of fact, we associate
with each vertex a texture coordinate, which is used to access the
5122 . Here, x contains the unknowns ut+1 i j to be solved for. In the vector via the 2D textures used to represent it. Fortunately, these
particular example, A is a banded matrix (a triangular matrix with texture coordinates can also be stored on the GPU, so that only the
fringes) with a maximal bandwidth of six. Obviously, storing the appropriate textures have to be bound during matrix-vector compu-
matrix as a full matrix is quite inefficient both in terms of mem- tations (see Figure 5)
ory requirements and numerical operations. In order to effectively It is a nice feature of the described scheme, that the realization of
represent and process general sparse N x N matrices, in which only matrix-vector operations on the GPU as it was proposed in Chap-
O(N) entries are supposed to be active, an alternative representation ter 3 is not affected by the graphical primitives we use to internally
on the GPU needs to be developed. represent and render matrices. The difference between sparse and
full matrices just manifests in that we render every diagonal or col-
4.1 Banded Matrices umn vector as a set of vertices instead of set of 2D textures. In this
way, a significant amount of texture memory, rasterization opera-
With regard to the internal representation of matrices as a set of tions and texture fetch operations can be saved in techniques where
diagonal vectors, we can effectively exploit the existence of a sparse matrices are involved. For instance, to compute the prod-
banded matrix with only a few non-zero off-diagonals. Zero off- uct between the sparse matrix described above and a vector only
diagonals are simply removed from the internal representation, and 5122 x6 textured vertices have to be rendered.
off-diagonals that do not have a counterpart on either side of the
main diagonal are padded with zero entries.
In the above example, where the 2D wave equation has been 5 Examples
discretized by means of finite differences, only six diagonals have
to be stored internally. As a consequence, the product of this matrix We will now exemplify the implementation of general techniques
times a vector costs no more than six vector-vector products. of numerical computing by means of the proposed basis operations
In the general setting, however, were non-zero entries are posi- for matrix-vector and vector-vector arithmetic.
tioned randomly in the matrix, the diagonal layout of vectors does
not allow for the exploitation of the sparseness in a straight forward
way. 5.1 Conjugate Gradient Method
The conjugate gradient (CG) method is an iterative matrix algo-
4.2 Sparse Random Matrices rithm for solving large, sparse linear system of equations Ax = b,
where A Rnxn . Such systems arise in many practical applications,
To overcome this problem, we use vertices to render the matrix
such as computational fluid dynamics or mechanical engineering.
values at the correct position. For each non-zero entry in a column
The method proceeds by generating vector sequences of iterates
vector one vertex is generated. The coordinate of each vertex is
(i.e. successive approximations to the solution), residuals r corre-
chosen in such a way, that it renders at exactly the same position as
sponding to the iterates, and search directions used in updating the
the respective vector element if it was rendered via the 2D texture
iterates and residuals. The CG algorithm remedies the shortcom-
used to represent the vector. For each column we thus store as many
vertices as there are non-zero entries. Matrix values are stored as ings of steepest descent by forcing the search directions p(i) to be
T
colors associated with the respective vertices. A-conjugate, that is p(i) Ap(i) = 0, and the residuals to be orthog-
Vertices and corresponding colors are stored in a vertex array on onal. Particularly in numerical simulation techniques, where large
the GPU. As long as the matrix is not going to be modified, the but sparse finite difference equations have to be solved, the CG-
internal representation does not have to be changed. Note that in algorithm is a powerful and widely used technique.
case of a banded matrix, where apart from start and end conditions In the following, pseudo code for the unpreconditioned version
for each NxN block in the matrix the same band is present in every of the CG algorithm is given. Lower and upper subscripts indicate
row, it is sufficient to store one representative set of vertices for in- the values of scalar and vector variables, respectively, in the speci-
ner grid points. Then, this set can be rendered using the appropriate fied iteration. For a good introduction to the CG method as well as
912
to other solution methods for linear system equations let us refer to of x(i) have to be done in place. Based on the representation of
[Press et al. 2002]. matrices as set of column vectors, we sweep through the matrix in a
column-wise order, using the result vector x(i) as the current render
target as well as a currently bound texture. Initially, the content of
Unpreconditioned CG r(i) is copied into x(i) . When the j-th column of L is rendered, each
1 p0 = r(0) = b Ax(0) for some initial guess x(0) element is multiplied with the j-th element in x(i) , and the result is
2 for i 0 to #itr
T added to x(i) . We thus always multiply every column with the most
3 i = r i r i recently updated value of x(i) .
4 q(i) = Ap(i)
T
5 i = i /p(i) q(i)
6 xi+1 = xi + i p(i) 6 Discussion and Performance Evalua-
7 ri+1 = ri i q(i) tion
T
8 i = ri+1 ri+1 /i
To verify the effectiveness of the proposed framework for the im-
9 p(i+1) = r(i+1) + i p(i) plementation of numerical simulation techniques we have imple-
10 convergence check mented two meaningful examples on the graphics processor. All
our experiments were run under WindowsXP on a P4 2.8 GHz pro-
cessor equipped with an ATI 9800 graphics card.
With regard to the realization of methods of numerical comput-
The CG method can effectively be stated in terms of the de- ing on graphics hardware, limited accuracy in both the internal tex-
scribed building blocks for GPU implementation of numerical tech- ture stages and the shader stages is certainly the Achilles heel of
niques (Note that using a preconditioner matrix, for instance the any approach. In many cases, numerical methods have to be per-
diagonal part of A stored in the first diagonal vector in our inter- formed in double precision to allow for accurate and stable compu-
nal representation, only involves solving one more linear system in tations. As a matter of fact, our current target architecture does not
each iteration): provide sufficient accuracy in general. Other graphics cards, on the
other hand, like NVIDIAs GeForceFX, already provide full IEEE
floating point precision in both the shader and texture stages. Thus,
Unpreconditioned GPU-based CG it will be of particular interest to evaluate this GPU in particular
1 clMatVec(CL ADD, A, x(0) , b, r(0) ) initial guess x(0) as well as other near-future graphics architectures with regard to
2 clVecOp(CL ADD, 1, r (0) , 0, NULL, r(0) ) computation accuracy.
3 clVecOp(CL ADD, 1, r (0) , 0, NULL, p(0) ) Let us now investigate the performance of our approach as well
3 for i 0 to #itr as the differences to CPU implementations of some of the described
basic operations. In our experiments the resolution of vectors and
4 i = clVecReduce(CL ADD, r (i) , r(i) )
matrices was chosen such as to avoid paging between texture mem-
5 clMatVec(CL ADD, A, p(i) , NULL, q(i) ) ory and main memory and between main memory and disk. All our
6 i = clVecReduce(CL ADD, p(i) , q(i) ) operations were run on vectors and matrices of size 5122 to 20482 .
7 i = i / i We have also not considered the constant time to initially load tex-
8 clVecOp(CL ADD, 1, i , x(i) , p(i) , x(i+1) ) tures from main memory to texture memory. The reason is, that we
9 clVecOp(CL SUB, 1, i , r(i) , q(i) , r(i+1) ) predominantly focus on iterative techniques, where a large number
10 i = clVecReduce(CL ADD, r (i+1) , r(i+1) ) of iterations have to be performed until convergence. Supposedly,
11 i = i / i in these particular applications the time required to setup the hard-
ware is insignificant compared to the time required to perform the
12 clVecOp(CL ADD, 1, i , r(i+1) , p(i) , p(i+1) )
computations. During all iterations the data resides on the GPU and
13 convergence check
it has neither to be reloaded from main memory nor duplicated on
the CPU. In other scenarios, e.g. if frequent updates of a matrix
happen, this assumption may not be justifiable anymore. In this
case, also the time needed to transfer data between different units
In the GPU implementation, the application program only needs has to be considered.
to read single pixel values from the GPU thus minimizing bus trans- On vectors and full matrices the implementation of standard
fer. All necessary numerical computations can be directly per- arithmetic operations, i.e. vector-vector arithmetic and matrix-
formed on the GPU. Moreover, the final result is already in place vector multiplication, was about 12-15 times faster compared to
and can be rendered as a 2D texture map. an optimized software implementation on the same target architec-
ture. A considerable speed-up was achieved by internally storing
5.2 Gauss-Seidel Solver vectors and matrices as RGBA textures. Sets of 4 consecutive en-
tries from the same vector are stored in one RGBA texel. Thus, up
Next, let us describe the GPU implementation of a Gauss-Seidel to four times as many entries can be processed simultaneously. We
solver. Denoting with L and U the strict lower and upper triangular should note here, that operations on vectors and matrices built upon
sub-matrices, and with D the main diagonal of the matrix A, we can this particular internal format perform in exactly the same way as
rewrite A as L + D + U. In one iteration, the Gauss-Seidel method outlined. Just at the very end of the computation need the vector
essentially solves for the following matrix-vector equation: elements stored in separate color components to be rearranged for
rendering purposes. We can easily realize this task by means of a
x(i) = Lx(i) + (D +U)x(i1) simple shader program that for each pixel in the result image fetches
the respective color component.
where x(k) is the solution vector at the k-th iteration. On average, the multiplication of two vectors of length 5122 took
r(i) = (D +U)x(i1) can be derived from the previous time step 0.2 ms. Performance dropped down to 0.72 ms and 2.8 ms for
by one matrix-vector product. To compute Lx(i) , however, updates vectors of length 10242 and 20482 , respectively. Multiplication of
913
a 40962 full matrix times a vector was carried out in roughly 0.23 1 2
Ft = vt + 4t( v V v + fy ) (6)
seconds. In contrast, the multiplication of a sparse banded matrix Re
of the same size, which was composed of 10 non-zero diagonals,
took 0.72 ms. Given current values for u and v at every grid points, Gt and F t
Obviously, only one vector element can be stored in a single can be directly evaluated. In the current implementation, we em-
RGBA texel if numerical operations on vector-valued data have ploy an explicit scheme to resolve for Gt and F t . The diffusion
to be performed. On the other hand, in this case also the perfor- operator is discretized by means of central differences, and, as pro-
mance of software implementations drops down due to enlarged posed in [Stam 1999], we solve for the advection part by tracing
memory footprints. Our current software implementation is highly the velocity field backward in time. Note that these operations are
optimized with regard to the exploitation of cache coherence on the carried out on a 2D grid represented by a 2D texture. The involved
CPU. In practical applications, a less effective internal representa- computations are performed at each grid point in a pixel shader pro-
tion might be used, so that we rather expect a relative improvement gram. Finally, in order to compute updated velocities at time t + 1,
of the GPU based solution. In this respect it is important to know we have to solve for the pressure at time t + 1. From the continu-
that also on the GPU access to higher precision textures slows down ity equation for incompressible media (div(V ) = 0), we obtain the
performance about a factor of 1.5-2. following Poisson equation for the pressure p:
The least efficient operation compared to its software counter-
part is the reduce operation. It is only about a factor of three faster 2 pt+1 2 pt+1 1 F t Gt
+ = ( + ) (7)
even though we store four elements in one RGBA texture. For in- x 2 y 2 4t x y
stance, reducing a 10242 vector takes about 1.6 ms on the GPU. For
a vector of length 10242 , this time is 5.4 ms. The relative loss in The partial derivatives of F and G are solved at each grid point,
performance is due to the fact, that the pixel shader program to be represented as a 2D texture, by means of forward differences. Fi-
used for this kind of operation is a lot more complex than one that nally, the right hand side of equation 7 is input to the GPU imple-
is used for vector-vector multiplication. On the other hand, even a mentation of the CG solver. Because vectors are internally repre-
performance gain of a factor of three seems to be worth an imple- sented as 2D matrices, the data does not have to be converted and
mentation on the GPU. can be directly used to feed the CG solver. Equipped with appro-
In the following, we present two examples that demonstrate the priate boundary conditions, the CG solver iteratively computes a
efficient solution of finite difference equations on the GPU. In both solution for p at time t + 1, which can be directly passed back to
examples, a 1024 x 1024 computational grid was employed, and the explicit scheme to compute new velocity values by means of
matrices were represented as set of diagonal vectors. equations 3 and 4.
In the first example, a solution to the 2D wave equation was com- Overall, by means of the GPU implementation of both the ex-
puted based on the implicit Crank-Nicholson scheme as described plicit and the implicit scheme we were able to interactively demon-
(see Figures 6 and 7 for results). Compared to explicit schemes, the strate the numerical solution of the NSE at 9 fps on a 10242 grid. In
implicit approach allows us to considerable increase the step size in each time step, we use the pressure distribution from the last time
time. To solve the system of equations we employed the GPU im- step as initial guess for the CG solver. In this way, only a few iter-
plementation of the conjugate gradient solver. The banded structure ations have to be performed, yet resulting in good accuracy. In the
of the matrix was exploited by reducing the number of diagonal vec- current implementation, four iterations were executed. Such a small
tors to be rendered in one matrix-vector multiplication. The com- number of iterations, on the other hand, yields inaccurate results
putation of one matrix-vector product (10242 x10242 -sparse-matrix once abrupt changes are applied by means of external forces. In
times 10242 -vector) took roughly 4.54 ms. Overall, one iteration of Figure 8, we show a snapshot of our interactive tool, which allows
the conjugate gradient solver was finished in 15.4 ms. By perform- one to interact with the velocity field, and to visualize the dynamics
ing only a limited number of iterations, five in the current example, of injected dye into this field.
interactive simulation at 13 fps could be achieved.
In our second example, we describe a GPU implementation of a
numerical solution to the incompressible Navier-Stokes equations 7 Conclusion
(NSE) in 2D:
In this work, we have described a general framework for the im-
u 1 2 plementation of numerical simulation techniques on graphics hard-
= u V u + fx p (1) ware. For this purpose, we have developed efficient internal layouts
t Re
v 1 2 for vectors and matrices. By considering matrices as a set of diag-
= v V v + fy p (2) onal or column vectors and by representing vectors as 2D texture
t Re maps, matrix-vector and vector-vector operations could be acceler-
Here, u and v correspond to the components of the velocity V in ated considerably compared to software based approaches.
the x and y direction, respectively. Re is the Reynolds number, p Our emphasis was on providing the building blocks for the de-
the pressure, and via f x and fy external forces can be specified. sign of general techniques of numerical computing. This is in con-
We first discretize partial derivative of u and v, resulting in an trast to existing approaches, where dedicated, mainly explicit solu-
explicit scheme to compute new velocities at time t + 1 from values tion methods have been proposed. In this respect, for the simulation
at time t: of particular phenomena some of these approaches might be supe-
rior to ours in terms of performance. On the other hand, our frame-
pt+1 work offers the flexibility to implement arbitrary explicit or implicit
ut+1 = Gt + 4t (3) schemes, and it can thus be used in applications where larger step
x
t+1 sizes and stability are of particular interest. Furthermore, because
p our internal matrix layout can benefit from the sparsity of columns
vt+1 = F t + 4t (4)
y quite effectively, we do not expect our method to be significantly
slower compared to customized explicit schemes.
with In order to demonstrate the effectiveness and the efficiency of our
1 2 approach, we have described a GPU implementation of the conju-
Gt = ut + 4t( u V u + fx ) (5) gate gradient method to numerically solve the 2D wave equation
Re
914
and the incompressible Navier-Stokes equations. In both examples,
implicit schemes were employed to allow for stable computations,
yet providing interactive rates. Despite precision issues, we could
achieve considerably better performance compared to our software
realization. On the other hand, to allow for a fair comparison we
should consider timing statistics of SSE-optimized software solu-
tions, which are supposed to perform about a factor of 2 to 3 faster.
The lack of a contiguous floating point pipeline on our target
architecture still prohibits its use in numerical applications where
accuracy is a predominant goal. On the other hand, with regard
to the fact that full floating point pipelines are already available,
the implementation of numerical techniques on commodity graph-
ics hardware is worth an elaborate investigation. Particularly in en-
tertainment and virtual scenarios, where precision issues might be
of lesser dominant concern, such implementations can be used ef-
fectively for interactive physics based simulation.
In the future, we will implement matrix-matix operations based
on the described internal layout, and we will investigate meth-
ods to efficiently update vector and matrices that are stored in Figure 7: A GPU-based tool to interact with water surfaces in real-
texture memory. In this way, linear algebra operations like LU- time is shown. By means of the mouse, the user can simulate ex-
decomposition or Singular Value decomposition can be imple- ternal forces that disturb the water surface. On a 10242 grid the
mented. In the long term, we aim at providing the functionality applications runs at 13 fps.
that is available in the BLAS library, thus allowing general linear
algebra packages to be built upon GPU implementations.
8 Acknowledgements
We would like to thank ATI for providing the 9800 graphics card,
and in particular Mark Segal for providing information about the
technical details of this card.
Figure 8: An interactive tool for the visualization of the solution

to the 2D Navier-Stokes equations is demonstrated. The user can
modify the velocity field, and dye can be injected into the field. On
a 10242 grid the applications runs at 9 fps.
ATI, 2003. Sample effects on the ATI graphics cards.

http://www.ati.com/developer/techpapers.html.
BARAFF , D., AND W ITKIN , A. 1998. Large steps in cloth simulation.
Computer Graphics SIGGRAPH 98 Proceedings, 4354.
B OLZ , J., FARMER , I., G RINSPUN , E., AND S CHROEDER , P. 2003. Sparse
matrix solvers on the GPU: Conjugate gradients and multigrid. Computer
Graphics SIGGRAPH 03 Proceedings.
C HEN , J., AND DA V ITORIA L OBO , N. 1995. Towards interactive-rate
Figure 6: GPU-based interactive simulation of 2D water surfaces is simulation of fluids with moving obstacles using Navier-Stokes equa-
demonstrated. The implementation runs at 43 fps on a 5122 grid. tions. Graphical Models and Image Processing 57, 2.
C URTIS , C., A NDERSON , S., S EIMS , J., F LEISCHER , F., AND S ALESIN ,
D. 1997. Computer-generated watercolor. Computer Graphics SIG-
GRAPH 97 Proceedings.
References D EBUNNE , G., D ESBRUN , M., M.-P., C., AND BARR , A. 2001. Dy-
A NDERSON , E., BAI , Z., B ISCHOF, C., B LACKFORD , S., D EMMEL , J., namic real-time deformations using space and time adaptive sampling.
D ONGARRA , J., D U C ROZ , J., G REENBAUM , A., H AMMARLING , S., In Computer Graphics SIGGRAPH 01 Proceedings.
M C K ENNEY, A., AND S ORENSEN , D. 1999. LAPACK Users Guide, D ESBRUN , M., M EYER , M., S CHROEDER , P., AND BARR , A. 1999. Im-
third ed. Society for Industrial and Applied Mathematics, Philadelphia, plicit fairing of irregular meshes using diffusion and curvature flow. In
PA. Computer Graphics SIGGRAPH 99 Proceedings, 317324.
915
D ONGARRA , J., D U C ROZ , J., H AMMARLING , S., AND H ANSON , R. S TRZODKA , R., AND RUMPF, M. 2001. Using graphics cards for quantized
1988. An extended set of FORTRAN basic linear algebra subprograms. FEM computations. In Proceedings VIIP 2001, 98107.
ACM Transactions on Mathematical Software 14, 117.
T HOMPSON , C., H AHN , S., AND O SKIN , M. 2002. Using modern graph-
D ONGARRA , J., D U C ROZ , J., H AMMARLING , S., AND H ANSON , R. ics architectures for general-purpose computing: A framework and anal-
1990. A set of level 3 basic linear algebra subprograms,. ACM Transac- ysis. Proceedings of 35th International Symposium on Microarchitecture
tions on Mathematical Software 16, 117. (MICRO-35).
E LDER , G. 2002. Radeon 9700. In Proceedings Eurographics/SIGGRAPH W EISKOPF, D., H OPF, M., AND E RTL , T. 2001. Hardware-accelerated
Workshop on Graphics Hardware 2002. visualization of time-varying 2D and 3D vector fields by texture advec-
F EDKIW, R., S TAM , J., AND J ENSEN , H. 2001. Visual simulation of tion via programmable per-pixel operations. In Proceedings Workshop
smoke. Computer Graphics SIGGRAPH 01 Proceedings, 1522. on Vision, Modeling, and Visualization VMV01, 439446.
F OSTER , N., AND F EDKIW, R. 2001. Practical animation of liquids. Com- W EISKOPF, D., H OPF, M., AND E RTL , T. 2002. Hardware-accelerated
puter Graphics SIGGRAPH 01 Proceedings, 2330. Lagrangian-Eulerian texture advection for 2D flow visualization. In Pro-
ceedings Workshop on Vision, Modeling, and Visualization VMV 02.
F OSTER , N., AND M ETAXAS , D. 1996. Realistic animation of liquids.
Graphical Models and Image Processing 58, 5, 471483.
H ARRIS , M., C OOMBE , G., S CHEUERMANN , T., AND L ASTRA , A. 2002.
Physically-based visual simulation on graphics hardware. In Proceed-
ings Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
H ART, J. 2001. Perlin noise pixel shaders. In Proceedings Eurograph-
ics/SIGGRAPH Workshop on Graphics Hardware 2001.
H EIDRICH , W., W ESTERMANN , R., S EIDEL , H.-P., AND E RTL , T. 1999.
Applications of pixel textures in visualization and realistic image synthe-
sis. In ACM Symposium on Interactive 3D Graphics, 110119.
H ILLLESLAND , K., M OLINOV, S., AND G RZESZCZUK , R. 2003. Non-
linear Optimization Framework for Image-Based Modelling on Pro-
grammable Graphics Hardware. Computer Graphics SIGGRAPH 03
Proceedings.
H OPF, M., AND E RTL , T. 1999. Accelerating 3D convolution using graph-
ics hardware. In Proceedings IEEE Visualization99, 471474.
H OPF, M., AND E RTL , T. 2000. Hardware accelerated wavelet transfor-
mations. In Proceedings EG/IEEE TCVG Symposium on Visualization
VisSym 00, 93103.
J OBARD , B., E RLEBACHER , G., AND H USSAINI , Y. 2000. Lagrangian-
Eulerian advection of noise and dye textures for unsteady flow visualiza-
tion. In Proceedings IEEE Visualization00, 110118.
K AAS , M., AND M ILLER , G. 1990. Rapid, stable fluid dynamics for
computer graphics. Computer Graphics SIGGRAPH 90 Proceedings,
4957.
L ARSEN , E. S., AND M C A LLISTER , D. 2001. Fast matrix multiplies using
graphics hardware. In Proceedings Supercomputing 2001.
L INDHOLM , E., K ILGARD , M., AND M ORETON , H. 2001. A user-
programmable vertex engine. Computer Graphics SIGGRAPH 01 Pro-
ceedings.
M ICROSOFT, 2002. DirectX9 SDK. http://www.microsoft.com/DirectX.
M ONTRYM , J., AND M ORETON , H. 2002. GeForce4. In Proceedings
Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
N V IDIA , 2002. nvidia OpenGL game of life.
http://www.nvidia.com/view.asp?IO=ogl gameoflife.
N V IDIA , 2003. Sample effects on the nVIDIA graphics cards.
http://developer.nvidia.com/view.asp?PAGE=papers.
O LANO , M., AND L ASTRA , A. 1998. A shading-language on graphics
hardware. Computer Graphics SIGGRAPH 98 Proceedings, 159168.
P RESS , W., T EUKOLSKY, S., V ETTERLING , W., AND F LANNERY, B.
2002. Numerical Recipes in C++ : The Art of Scientific Computing.
Cambridge University Press.
P URCELL , T., B UCK , I., M ARK , W., AND H ANRAHAN , P. 2002. Ray
tracing on programmable graphics hardware. Computer Graphics SIG-
GRAPH 98 Proceedings, 703712.
S TAM , J. 1999. Stable fluids. Computer Graphics SIGGRAPH 99 Proceed-
ings, 121128.
S TRZODKA , R., AND RUMPF, M. 2001. Nonlinear diffusion in graphics
hardware. In Proceedings EG/IEEE TCVG Symposium on Visualization
2001, 7584.
916

Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms

Uploaded by

Copyright:

Available Formats

Linear Algebra Operators

for GPU Implementation of Numerical Algorithms

Abstract matics. These techniques have a variety of applications in physics

cessors was given. A number of examples have been demonstrated

implementation of an explicit scheme for the solution of a coupled

3.3 Vector Reduce

Figure 8: An interactive tool for the visualization of the solution

ATI, 2003. Sample effects on the ATI graphics cards.

You might also like