Martin Andrew John

A HIGH PERFORMANCE
PARALLEL SPARSE LINEAR EQUATION SOLVER USING CUDA
A thesis submitted
to Kent State University in
partial fulfillment of the requirements
for the degree of Master of Science
by
Andrew J. Martin
August, 2011
Thesis written by
Andrew J. Martin
B.S., Keene State College, 2007
M.S., Kent State University, 2011
Approved by
Dr. Mikhail Nesterenko , Advisor

Dr. John Stalvey , Chair, Department of Computer Science
Dr. Timothy S. Moerland , Dean, College of Arts and Sciences
ii
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1
Bi-factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2
CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
Random Power System Generator . . . . . . . . . . . . . . . . . . . . . .
20
2.4
Parallel Sparse Linear Equation Solver . . . . . . . . . . . . . . . . . . .
24
3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
39
5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A CUDA Solver Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii
47
LIST OF FIGURES
A simple North American power system. . . . . . . . . . . . . . . . . . .
CUDA memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . .
19
A COO representation of sample matrix A . . . . . . . . . . . . . . . . .
21
A 20 node random power system topology: dmin=0, dmax=0.3, = 2.67. .
22
A 20 node random power system topology: dmin=0, dmax=0.3, = 4.78. .
22
A 100 node random power system topology: dmin=0, dmax=0.3, = 2.67.
23
A 100 node random power system topology: dmin=0, dmax=0.3, = 4.78.
23
Average admittance matrix size with = 1.7. . . . . . . . . . . . . . . .
31
33
10
35
11
Average computation time for CPU versus GPU with = 1.7. . . . . . .
40
12
Average computation time for CPU versus GPU with = 2.67. . . . . .
42
13
Average computation time for CPU versus GPU with = 4.78. . . . . .
44
iv
LIST OF TABLES
Average computation times for L and U when average neighbor count is

2.67. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Average number of elements in the admittance matrix for UPS. . . . . .
30
Average number of elements in the admittance matrix for WECC. . . . .
32
Average number of elements in the admittance matrix for NYISO. . . . .
34
Average memory transfer times for = 1.7. . . . . . . . . . . . . . . . .
36
37
37
Average computation times with = 1.7. . . . . . . . . . . . . . . . . . .
41
Average computation times with = 2.67. . . . . . . . . . . . . . . . . .
43
10
Average computation times with = 4.78. . . . . . . . . . . . . . . . . .
45
Acknowledgements
I would like to thank Evgeny Karasev of the Moscow Power Engineering Institute
and Dimitry Nikiforov of Monitor Electric for providing technical guidance throughout
the process of my thesis. I would like to thank my advisor, Mikhail Nesterenko, for his
expert guidance throughout the entire thesis process.
vi
I dedicate this thesis to my brother, Corey Martin Jr. May your memory be eternal.
vii
CHAPTER 1
Introduction
Electricity. One of the most significant achievements of mankind was the harnessing of
electric power. Nothing in our history has done more to further our progress as a civilized
society. We use electricity to light our homes, heat our food, power our traffic lights,
charge the batteries in our portable electronic devices and more. Without electricity, we
would still rely on animals to power farming equipment and on fire to provide light and
heat for our homes. Instead, we flip a switch on the wall and light comes out of the light
bulb. As consumers, we are oblivious to how intricate our power system is. In fact, the
only time we ever really pay attention to electricity is when it is, for some reason, not
there.
Actually, by flipping the light switch, we are connecting a power source to the light
bulb in a direct path known as an electric circuit. An electric circuit needs three basic
components to deliver energy to the consumer:
Energy source provides the force that induces electrons to move and thus provide
energy. The flow of electrons through an electric circuit is the electric current (I).
Load is the device that converts the energy in a circuit to productive use, such as
a light bulb or a motor.
2
Path provides a conduit for the electric current to flow between the energy source
and the load.
Materials where electrons are tightly bound to atoms, such as rubber and air, are
insulators. Insulators do not conduct electric current easily. Conversely, materials that
have free electrons in atoms, such as coper and steel, are conductors. The degree to
which the medium prevents electrons from flowing is resistance (R). Insulators have
high resistance and conductors have low resistance. Voltage (V ) is the cause of current
flow. The higher the voltage, the greater the electric current. Ohms law states that
the amount of current flowing through a circuit element is directly proportional to the
voltage across the element, and inversely proportional to the resistance of the element [1].
Ohms law can be written as: I = V /R
There are two types of electric current: direct current (DC) and alternating current
(AC). Direct current is a unidirectional flow of electrons. Alternating current is the flow
of electrons that periodically reverses direction. In AC, the voltage and current change
as a sine wave in time. In electrical engineering, the sine wave is represented as a rotating
vector in the real and imaginary plane.
One of the major advantages of AC over DC is that it is easy to generate and its voltage
is easy to change. DC requires many sophisticated pieces of equipment to distribute power
to consumers, as the voltage can not be easily changed for transmission.
The device that changes the AC voltage is a transformer. A transformer contains
two coils of wire that are wound around the same core material in such a way that
mutual inductance is maximized [2]. Mutual inductance is the effect of inducing voltage
3
in a secondary coil by a primary coil. Current passing through the primary coil induces
voltage in the secondary coil. A transformer, thus, transfers electric energy from one
circuit to another. As the energy is transferred, the voltage may be changed. There are
two types of transformers: a step up transformer and a step down transformer. A step up
transformer increases the voltage by having a secondary coil that has more turns than
the primary coil. The voltage induced in the secondary coil is thus increased. A step
down transformer decreases the voltage by having a secondary coil that has less turns
than the primary coil. Therefore, the secondary coil will have less voltage induced.
An electric circuit presents two obstacles to this kind of change: inductance and
capacitance. Inductance (L) is a property of an electric circuit that opposes sudden
change of current flow. Unlike resistance, inductance does not cause energy loss in the
form of heat. Capacitance (C) is the ability to hold an electric charge. Reactance (X)
is the opposition of an electric circuit to a change of electric current or voltage. We,
therefore, have inductive reactance=XL and capacitive reactance=XC . The Impedance
(Z) is the total measure of opposition in a circuit to AC. It is expressed as a complex
number such that Z
= R + j ( XL XC ), where XL and XC are imaginary
quantities1 . Admittance (Y ) is the inverse of impedance. It measures how easily AC can

flow through a circuit.
Electric power system. An electric power system is a connected collection of electric

circuits that is used to supply, transmit and distribute electric power. See Figure 1 for
1
Power engineers use j to represent the imaginary unit of a complex number so as to not confuse i
with current.
4
an example of a North American electric power system2 .
Substation
Step Down
Transformer
Color Key:
Black: Generation
Blue: Transmission
Green: Distribution
Transmission lines
Subtransmission
Customer
26kV and 69kV
765, 500, 345, 230, and 138 kV

Generating Station
Primary Customer
13kV and 4kV
Generating
Step Up
Transformer
Transmission Customer
138kV or 230kV
Secondary Customer
120V and 240V
Figure 1: A simple North American power system.
A power generation plant converts energy from sources such as water, coal, natural
gas into electric energy. Most power plants are located away from heavily populated
areas and near water sources. In a power plant, the energy from the primary source is
used to convert water into steam. The steam is applied to the blades of a turbine forcing
it to rotate. The turbine is connected to an electromechanical generator, which converts
the rotation of the turbine into electric energy.
A generator contains a coil of wire inside a large magnet. As the turbine rotates,
it turns the coil of wire, passing the different poles of the magnet, and electric current
is induced in the wire. The coil rotates inside the magnet, causing the magnet to pull
the electrons in one direction, but when the coil has rotated 180 , the magnet pulls the
electrons in the other direction. This rotation creates AC.
Long transmission lines offer high resistance to AC which leads to extensive energy
losses. Higher voltages require less current to transmit the same amount of power. This
2
Figure 1 has been taken from the US Department of Energys final report on the August 14, 2003
black out in the United States and Canada [3].
5
reduces the amount of energy lost during transmission [4]. Typical transmission voltages
for long distance transmission lines are between 138 kV and 765 kV whereas the power
is generated at about 10 kV to 13 kV. The generated power is transmitted to the load to
be used. The load is possibly remote. As the power leaves the power plant, its voltage is
increased to a transmission level.
Electric power can not be consumed at the transmission level voltages due to health
and safety concerns. Once the power is transmitted to populated areas, the voltage needs
to be decreased to the consumer level. The power is transmitted to substations, where a
step down transformer decreases the voltage to a distribution level that a consumer can
use. A typical substation has multiple transmission lines delivering power in multiple
directions. To organize power routing, the conductors in a substation are arranged into
buses. A bus is a part of the power system with zero impedance [1]. A bus has circuit
breakers and switches to allow uninterrupted power delivery in case any element of the
substation fails.
Powerflow problem. To ensure the safe and efficient operation of an electric power
system, it is continuously managed by a distributed team of electric power dispatchers.
There is a collection of control centers manned by dispatchers. The data about the state
of the power system is telemetered to these control centers. This data is processed to
compute the state of the power system, which reflects the power distribution through it.
When determining the motion of electric machines, the generators and motors, the
powerflow calculation alternates with integration of differential equations that determine
the rotor acceleration at the next time step. In real-time applications, such as system
6
simulators, the powerflow has to be computed hundreds of times per second.
The problem of computing the power distribution is called the powerflow problem.
Specifically, the powerflow problem requires that the voltage at each bus in the power
system be computed on the basis of current. The admittances for each bus are therefore
stored in a sparse admittance matrix. For the powerflow problem, the electric power
system of n buses is represented by a system of linear equations. Rewriting Ohms law,
we have:
I = YV,
where
(1)
V1
Y11 Y12 Y1n
I1

Y
I
2
21 Y22 Yin
2
,V = .
,Y =
I=
.
.
.
..
..
..
..
..
..
.
.
.

Vn
Yn1 Yn2 Ynn
In
Vector I contains currents, represented as complex numbers, that each bus injects into
the system. Voltage profile vector v contains voltages, represented as complex numbers,
at each bus, and Y is the admittance matrix that stores the admittances, represented as
complex numbers, at each bus. Where each yij , i 6= j is a mutual admittance between
buses and yii is a self admittance for a bus. A typical power system may contain thousands
of buses and, therefore, the system may have thousands of equations. The admittance
matrix may potentially contain millions of elements. However, each bus has relatively
few neighbors. In fact, the typical number of neighbors ranges from 1.7 - 4.78. Therefore,
the admittance matrix is rather sparse and diagonally dominated. That is, most of the
non-zero elements are along the main diagonal. The powerflow problem requires to
7
solve this system for voltages. One of the most straightforward methods is to invert the
admittance matrix and solve the system as follows: V = Y1 I. However, the processing
of the inverted matrix is rather computationally intensive. Even more problematic is
the matrix fill in. That is, even if the original matrix Y is sparse, its inverse is usually
dense. Using a dense multi-million element matrix to solve the powerflow problem is
prohibitively expensive both in storage and computation time.
One approach to overcome the fill in problem is factorization: rather than computing
the inverse, the factor matrices are calculated such that these factors preserve the sparsity
of the original matrix. For example, a well-known LU factorization technique converts
Y into two triangular factors Y = LU. The lower L and upper U triangular matrices
consist of zeros above and below the main diagonal respectively. Substituting the factors
in Formula 1 we get.
I = (LU)v
(2)
Letting vector Uv = X, the system of linear equations in Formula 2 can be split into
two as follows:
I = LX, X = Uv.
(3)
Since L is the lower triangular factor matrix, the first system of linear equations can
be solved for X by forward substitution. Then, since U is the upper triangular factor
matrix, the second system can be solved for v using backward substitution. Given that
U and L preserve the sparsity of the original admittance matrix, this method of solving
the linear system of equations results in substantial memory and computational resource
savings.
8
The computational demands for real-time power system state computation require
efficient powerflow computations. Such requirements are beyond the capabilities of even
the fastest modern day central processing units (CPU). One method of accelerating
power system state computation is through parallelization on multi-processor architectures. However, the success in this direction has been limited thus far. One major
obstacle is that the powerflow computation, specifically solving I = YV is inherently
sequential. Nevertheless, such problems could become solvable when the number of parallel processors such as those on Graphical Processing Units (GPU) is large enough that
even minor concurrency gains result in significant performance improvements.
GPGPU. The GPU is a specialized processor used for image processing and 3D applications. In 1992, Silicon Graphics opened the programming interface to its hardware by
releasing the OpenGL library [5]. Spurred by the graphics and 3D gaming market, the
GPU has evolved into a massively parallel, many-core processor with significant computational power. In 2002, Mark Harris coined the term General-purpose computation on
graphics processing units (GPGPU) [6]. GPGPU uses a GPU to perform computations
that are not related to graphics processing. A main difference between the multi-core
CPU and the many-core GPU is that a many-core processor puts more cores in a given
thermal envelope than a multi-core processor does [7]. This makes many-core computation advantageous. The effect of a many-core system is that it does not suffer from
traditional problems that arise in multi-core systems, such as bus contention3 . A typical GPU has hundreds of cores and is able to achieve high peak memory bandwidth
3
In this sense, I mean a computer bus and not a power system bus
9
throughput. The GPU is designed for data-parallel computations. Data-parallel processing on the GPU involves mapping data elements to parallel processing threads. That is,
each thread on the GPU performs the same instruction or program on a different piece of
data. Researchers and developers quickly realized the benefit of using GPUs to accelerate
data-parallel algorithms. GPUs are used in a variety of areas that include computational
biology and chemistry, the SETI project [8], protein folding, video accelerators, ray tracing and more.
Several frameworks exist that allow the developer to write applications for the GPU
using several high level programming languages. OpenCL [9] is an example of one such
framework that allows developers to write applications that run on a variety of GPUs. Microsoft has contributed to the list of GPU frameworks by developing DirectCompute [10]
for DirectX. DirectCompute differs from OpenCL in that it utilizes DirectX and therefore is only for the Windows operating system. OpenCL can be run on all of the major
operating systems (Linux, Windows and OS X). There is a third solution available for
developers wishing to write GPU accelerated applications, named CUDA.
Compute Unified Device Architecture (CUDA) was introduced by NVIDIA in November 2006, and is a general purpose parallel computing architecture. Unlike OpenCL and
DirectCompute, CUDA only runs on NVIDIA GPUs. However, in most cases the tight
integration of hardware and software allows CUDA to achieve better performance results
than OpenCL and DirectCompute. The CUDA parallel programming model [11] consists of an application that contains a sequential host program that may execute parallel
programs called kernels on the GPU. A kernel is the code that runs in parallel on the
GPU.
10
Related literature. Previous work has been done on sparse matrix vector (SpMV) multiplications using the GPU [12, 13, 14]. Bell et al. present several packing methods for
packing sparse matrices based on their storage requirements and computational characteristics. Kernels are also included to illustrate generic SpMV multiplications. However,
the SpMV multiplication kernels presented would have to be heavily modified to solve
systems of linear equations using a bi-factorized matrix. Chalasani [15] presents several
approaches to computing the powerflow problem using parallelization. Chalasanis work
differs from mine in that he is using a loosely coupled, heterogeneous network of workstations. The architecture of Chalasanis work is incompatible with the GPU architecture
on which I will be implementing my algorithm. Furthermore, the use of a GPU is more
cost-effective than a cluster or workgroup of machines and will likely yield similar, if not
better, results. My implementation exploits the high level of parallelism offered by a
single GPU installed on a single workstation. Liu et al. [16] present a generic method for
SpMV multiplication using OpenMP. This implementation is limited by the number of
CPU cores on a given system. The alternative is to use OpenMP in on a cluster, however, this suffers from costly message passing. The generic SpMV multiplication method
is not suitable for SpMV solving systems of linear equations using a bi-factorized matrix.
Wang et al. [17] present a method for solving systems of linear equations using LU factorization and a specialized FPGA multiprocessor. Their implementation requires several
compute nodes and several control nodes, which may lead to cumbersome and complex
configurations. This approach differs from my setup in that I use one machine with one
GPU, which is incompatible with the requirements of Wangs work. Amestoy et al. [18]
11
present a multifrontal parallel distributed solver that utilizes a MIMD architecture to
factorize dense and sparse matrices to solve systems of linear equations. Their approach
requires a host node to analyze the matrix, break up the work, distribute the work, collect the solution and organize the compute nodes during the computation process. In my
approach, the hardware schedulers on the GPU distribute work to the multiprocessors
on the same GPU and the multiprocessors utilize their local resources to complete the
computations, thus not relying on the hardware schedulers, or a single host node for further instructions. Furthermore, in my implementation, the mapping of data to a group
of threads is intuitively based on the thread ID and requires no analysis of the incoming sparse matrix4 . Arnold et al. [19] present an elemental scheme for distributing the
LU factor matrices across a MIMD processor. The elemental scheme analyzes all of the
tasks to find tasks ready to be processed and assigns them to an idle processor. The
management overhead for this method could lead to significant compromises in efficiency
when the matrix is significantly sparse. My method packs the sparse matrix in a format
that ignores all non-zero elements, such that analyzing the matrix for ready tasks is
not necessary.
Several linear algebra libraries have also been developed to aide in the process of
solving linear equations using the GPU, such as CUBLAS [20] and CUSP [21]. These
libraries, however, are for generic matrix vector computations, and are not suitable for
my purposes. Mayanglambam et al. [22] implements a library, TAUCUDA, for CUDA
developers to accurately test the complete performance of their parallel applications.
Their library takes into account asynchronous and synchronous operations. The setup of
4
I assume the incoming matrix has already been sorted by row
12
my kernels is straightforward and allows for performance measurements to be computed
without the use of additional libraries. To my knowledge, this is the first time that a
GPU is used to solve systems of linear equations using factorized, specifically bi-factorized
matrices.
Thesis outline. In this thesis I explore the performance improvement of the powerflow
computation by parallelizing the least parallelizable component of the powerflow problem:
the sparse linear equation solver. In Chapter 2, I explain the bi-factorization method
that factorizes a matrix into two factor matrices, while preserving the sparsity of the
original admittance matrix. I also explain how to solve systems of linear equations
using bi-factorized matrices. In Chapter 2, I also explain what CUDA is and how it
can be used to accelerate the powerflow computation. I then present a parallel sparse
linear equation solver for bi-factorized matrices. I implement this algorithm using a costeffective commodity NVIDIA GPU. I build a system to accurately produce sample power
system data. I also discuss the optimization process. In Chapter 3, I explain the setup
of my hardware and software used for the test environment. I also explain the process
of generating the data that is used in our performance tests. In Chapter 4, I present the
results from my experiments show that my approach achieves up to a 38 times speedup
over a single threaded CPU implementation. In Chapter 5, I present my plans for future
work.
CHAPTER 2
Preliminaries
To implement a high performance parallel sparse linear equation solver, I created a

random power system topology generator. The random power system topology generator
produces realistic sample admittance matrices. To solve a system for voltages, the admittance matrix is factorized using bi-factorization. I explain how CUDA and the GPU
can be used to parallelize the sparse linear equation solver.
2.1 Bi-factorization
Bi-factorization is particularly suitable for sparse coefficient matrices that are diagonally dominant and that are either symmetrical, or, if not symmetrical, have a symmetrical sparsity structure [23]. Electric networks and power flow systems adhere to these
requirements. A factorization approach is to separate the matrix into multiple factor
matrices. This method is based on finding 2n factor matrices, such that the product of
these factor matrices satisfies the requirement:
L(n) L(n1) . . . L(2) L(1) YU(1) U(2) . . . U(n1) U(n) = E

where
Y = original coefficient matrix,
L = lower factor matrices,
U = upper factor matrices,
13
(4)
14
E = unit matrix of order n.
Pre-multiplying equation (4) by the inverses of L(n) , L(n1) . . . L(2) and L(1) consecutively yields:
YU(1) U(2) . . . U(n1) U(n) = (L(1) )1 (L(2) )1 . . . (L(n1) )1 (L(n) )1
(5)
Post-multiplying equation (5) by L(n) , L(n1) . . . L(2) and L(1) consecutively yields:
YU(1) U(2) . . . U(n1) U(n) L(n) L(n1) . . . L(2) L(1) = R
(6)
Finally, pre-multiplying equation (6) by Y 1 yields:

U(1) U(2) . . . U(n1) U(n) L(n) L(n1) . . . L(2) L(1) = Y1
(7)
The factor matrices obtained by the criterion given by (4) enable the inverse of the
coefficient matrix Y to be expressed and determined implicitly in terms of these factor
matrices. Hence, the solution of the linear system of equations I = YV can be found as:
V = Y1 I = U(1) U(2) . . . U(n1) U(n) L(n) L(n1) . . . L(2) L(1) I
Bi-factorization is a method of obtaining sparse factors for the original sparse matrix.
Intuitively, this method obtains a pair of factors by applying Gauss and then Crout
elimination to the original matrix. Specifically, let Y = Y(0) and Y(k) = L(k) Y(k1) U(k) .
15
Where
L(k)
0
1
.
.
.
.
1
0
(k)
,
=
lkk
(k)
lik 1
.
.
..
..
(k)
lnk
1
U(k)
..
(k)
(k) ,
=
0
0
1
u
kj
kn
..
Y(k)
0
1
.
.
.
.
(k)
(k)
(k) .
=
ykk ykj ykn
(k)
(k)
(k)
0
y
y
y
ij
in
ik
..
..
..
.
.
.
(k)
(k)
ynk
ynn
16
The elements of these matrices are as follows:
(k1)
(k)
lik =
yik
(k1)
(i = k + 1, . . . , n),
ykk
(k1)
(k)
ukj
(k)
yij
(k1)
Note that if yik
(k1)
yij
ykj
(k1)
(j = k + 1, . . . , n),
ykk
(k1) (k1)
yij
(k1)
ykk
yik
(k)
(i = k + 1, . . . , n)
.
(j = k + 1, . . . , n)
(k1)
is zero, so is lik . Similarly, if lkj
(k)
is zero, so is ukj . That is, sparse
Y(k1) produces sparse factors L(k) and U(k) . These factor matrices are conveniently
stored compactly as a collection of factor matrices in one large matrix. As such, there is
a certain amount of fill in with regard to the entire matrix as we transition from Y(k1)
to Y(k) , but it tends to be insignificant. This fill in is further minimized if the rows
and columns in Y are sorted in the increasing order of the number of non-zero elements.
Thus, if Y is sparse, the 2n factors produced by bi-factorization are also sparse. Note
also that in every factor L(k) and U(k) , only the respective column and row above and
below the diagonal is non-zero. Therefore, all the 2n factors can be compactly stored as
rows and columns in a single factorized n n matrix.
2.2 CUDA Overview
GPU architecture. The newest generation of GPUs by NVIDIA, code named Fermi,
feature up to 512 CUDA cores. A CUDA core is a streaming processor on the GPU.
Each CUDA core features a fully pipelined integer arithmetic logic unit and a floating
point unit1 . The architecture was designed with double precision arithmetic in mind,
1
The Fermi architecture implements the new IEEE 754-2008 floating point standard.
17
offering up to 16 double precision fused multiply-add operations per SM, per clock tick.
The core executes one floating point or integer instruction per clock tick per core. The
CUDA cores are organized into 16 streaming multiprocessors (SM) of 32 cores each. A
streaming multiprocessor is a grouping of 32 CUDA cores, all of which have access to 16
load/store units, 4 special function units, an interconnection network, 64KB of shared
memory and 32,768 registers. A thread is an independent execution stream that runs a
piece of the code based on the data to which it is assigned. A core may execute many
threads, but may only execute one thread at a time. A thread block is a logical grouping
of threads that guarantees that all threads within a thread block reside on the same
SM. The threads within a given thread block cooperate and communicate using barrier
synchronization. Each thread block operates independently of all other thread blocks.
The GigaThread global scheduler distributes thread blocks to SM thread schedulers as
efficiently as possible. It is important to note that in order to achieve the highest level
of efficiency, the GigaThread global scheduler distributes work to the thread blocks in
an arbitrary order. That is, one can not assume that thread block zero will be executed
before thread block one and so on. The GPU has a 384-bit memory interface, which
allows up to a total of 6 GB of GDDR5 DRAM memory.
CUDA threads have access to multiple types of memory during their execution. See
Figure 2 for illustration2 . Each thread has private local, register based memory, which
is fast but quite small. Each thread block has access to slower shared memory, which is
accessible to all threads of the local thread block. Shared memory is located at each SM
and is limited to 64 KB. The scope of the data in shared memory is the local thread block.
2
This figure was taken from NVIDIAs CUDA Programming Guide [11].
18
Threads of all thread blocks have access to the large global memory. Global memory is
the DRAM located off-chip on the GPU. The access to the global memory is significantly
slower than access to the on-chip memory. Therefore, the usage of global memory should
be minimized.
Software. A CUDA program consists of two essential components: the host code and
the kernel code. Host code is the code that runs on the CPU. Host code is responsible
for calling the parallel kernels. Kernel code is the code that runs in parallel on the GPU.
A kernel is a Single Program Multiple Data (SPMD) code that is executed on the GPU
using a potentially large number of parallel threads. Multiple GPU threads run the
kernel. Each thread has a unique thread ID. A programmer can refer to an individual
thread by its thread ID. The thread ID is a three-dimensional vector. The programmer
or compiler organizes threads into thread blocks. In the Fermi architecture, each thread
block may contain up to 1,024 threads, with a maximum of 65,535 thread blocks. This
means that a programmer has a total of over 67 million threads at his or her disposal,
which provides a high degree of potential parallelism.
19
Figure 2: CUDA memory hierarchy.
20
2.3 Random Power System Generator
Before I can test the performance of my parallel algorithm, I need to generate random
admittance matrices to model realistic power systems. The standard practice for engineering applications is to use a small number of historical test systems. This practice,
however, has shortcomings when examining new theories, methods and scalability. In my
system, I develop a random power system topology generator that generates statistically
accurate, realistic power system topologies.
Studies of power systems in the United States indicate that the average number of
neighbors is about 2.67 for Western Electricity Coordinating Council system (WECC),
formerly known as the Western Systems Coordinating Council, and about 4.78 for the
New York Independent System Operator (NYISO) system [24]. The Siberian region of the
Unified Power System of the Russian Federation (UPS) has 1.7 neighbors on average [25].
A random power system topology is generated using a random distribution function.
The generated power system topology has no self-loops and no disconnected nodes. Inside
a fixed square, N bus locations are selected using a random distribution function. The
edges, which represent transmission lines, are chosen according to the distance limitation
dmin d dmax . Each bus is assigned a corresponding random Poisson variable, P (),
that represents the potential number of neighbors. is equal to the average number
of neighbors for each bus in the power system. A potential list of neighbors, X, is
calculated using the distance limitation and the random Poisson variable for each bus.
It is important to note that in some cases X < P (). In such cases, the number of X
neighbors is chosen over P (). The output of the random power system is the admittance
21
matrix Y. Matrix Y is stored in Coordinate List (COO) format. An example of COO is
given in Figure 3. See Figures 4 - 7 for examples of generated power system topologies
0
2 + j1.2
0
0
0
0
4 j2.1 1 + j0.9
A=
12 + j6.7
0
0
0
0
0
0
4 + j2.3
row column
value
2 + j1.2
0
1
4 j2.1
1
2
1 + j0.9
1
3
0
12 + j6.7 2
4 + j2.3
3
3
Figure 3: A COO representation of sample matrix A
22
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Figure 4: A 20 node random power system topology: dmin=0, dmax=0.3, = 2.67.
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
23
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
24
2.4 Parallel Sparse Linear Equation Solver
Even though solving systems of linear equations with bi-factorized matrices is inherently sequential, I have identified portions of calculations on the sparse matrix that can be
performed concurrently. In my method, I pack all non-zero elements of the bi-factorized
matrix into COO format.
Development history. It took me several attempts to optimize the algorithm. I started

out with a non-packed N N admittance matrix and a kernel that consisted of as many
threads as possible inside of one block. The algorithm used the CPU to control its
execution sequentially. As such, my first attempt actually did worse than the CPU.
Next, I took a look at the data and figured out that the COO format would work well for
my purposes. The algorithm was then rewritten to handle the new format. As such, I
reconfigured the kernel launch parameters such that the kernels had one thread per matrix
element. This involved having many blocks with many threads. CUDA executes blocks
in an arbitrary order, therefore the programmer does not have control over the thread
block execution. The data dependencies of the algorithm require that while multiplying
factor matrix L by voltage vector v, column k must finish processing before column
k + 1 may begin. Similarly, for factor matrix U, row k must finish processing before row
k 1 may be begin. Since I was using many blocks, I had to use the CPU to launch
the kernel sequentially to avoid data-dependency violations. Upon running NVIDIAs
CUDA Profiler, I discovered that over 80% of the execution time was spent calling the
kernels sequentially. In fact, even with a packed sparse matrix, the GPU algorithm
25
was still achieving only a 2-4 times speedup over the CPU. Therefore, I decided that
having one thread block with a couple hundred threads would be better. That is, I could
launch the kernels one time, and have them loop through the matrix without needing to
coordinate with the CPU. Once I implemented this method, I started observing a 10-18
times speedup. After reviewing the requirements of the algorithm, as well as reviewing
the data dependencies, I identified several areas where I could utilize shared memory.
Utilization of shared memory led to a 22-27 times speedup.
My final optimization came after discovering that the GPU and my algorithm had
troubles processing the data when Y had more than 1, 000, 000 non-zero elements. When
the number of non-zeros in the admittance matrix exceeds 1, 000, 000, the solution vector
is returned with zeros. I believe it has to do with a memory overflow, but I have not
determined the exact cause. Previously, I used atomic operations to keep track of how
many non-zero elements were in a given row or column. Keeping track of this number
allowed the kernels to increment a position counter in order to move from column to
column or row to row. Instead, I created an array to keep track of how many elements
are in a given row or column. I send the array of row and column position values to the
GPU and increment the position counter accordingly. This final optimization increased
the speedup to 38 times over the CPU.
The process of learning to program on the GPU was not without trial and error.
Several behaviors of the GPU are not well documented within the CUDA documentation.
For instance, during testing, I found that my algorithm would produce incorrect results
when testing against admittance matrices in a specific order. When I ran the admittance
matrices in a different order, the results would be correct. This discrepancy was due to a
26
memory initialization issue. I assumed that when I freed the memory, it and its contents
were gone. Furthermore, I assumed that when I allocated new memory and initialized it
in the next run of the application, the memory would be properly initialized. However, I
found that I had to not only initialize the memory before I used it, but I had to reinitialize
the memory before freeing it and exiting.
L kernel. Since the data dependencies for processing Lv only require that column k
be processed before column k + 1, all of the elements in column k on or below the main
diagonal are able to be processed in parallel. To calculate the values of vj0 , threads are
mapped to all elements of the kth column of L where i = k and j i. The values of vj0 are
calculated such that vj0 = vj +lji vi where i < j < n, and vi0 = lii vi where i = k and j = k.
U kernel. The data dependencies for processing UV require that row k be processed
before row k 1. Furthermore, for row k the sum of uij vj must be computed where
i = k and j > k. The product of uij vj can be computed in parallel. Using strategies
from [26, 27], I implement a parallel reduction function to compute the sum of uij vj
efficiently in parallel.
Parallel reduction. The parallel reduction code uses N/2 active threads, where N is
the number of threads in the block, to compute the sum of uij vj . That is, thread 0 and
thread N 1 compute their sum, thread 1 and thread N 2 compute their sum, thread n
and thread N (n + 1) compute their sum and so on. Each iteration of the for loop cuts
the number of active threads in half while storing the partial sums in shared memory
27
mapped to the lower half of the active threads. Normally, a parallel reduction on the GPU
makes use of multiple thread blocks to sum very large arrays in parallel. In this case,
however, there is only one block available, as the reduction code is implemented inside
of the U kernel code. However, the array is not large and never exceeds the number of
threads in the thread block. My method is as efficient as a many-block implementation,
as it is only necessary to sum an array of size N . Furthermore, there is no need to sum
the partial sums for each thread block at the end. To further speed up this reduction,
the product of uij vj is stored in shared memory, and thus there is no need to load shared
memory from global for the reduction.
Number of Buses
L kernel, ms U kernel with reduction, ms
500
0.99
2.46
1000
1.98
4.94
2000
4.00
9.91
3000
6.01
14.87
4000
8.13
19.85
5000
10.31
24.86
6000
12.36
29.83
7000
14.75
34.89
8000
16.84
39.88
9000
19.24
44.95
Table 1: Average computation times for L and U when average neighbor count is 2.67.
Table 1 displays the timings for the L kernel and the U kernel with the reduction code.
28
The U kernel time is generally double that of the L kernel, as the U timing contains the
code to perform a parallel reduction.
CHAPTER 3
Experiment Setup
Hardware. The test environment I used to evaluate the performance of the GPU versus
the CPU is a PC with an Intel i7 processor running at 2.8 GHz, 8 GB of DDR3 RAM
and an NVIDIA GTX 570. The GTX 570 has an over clocked processor clock running at
1.9 GHz and a memory clock running at 2.3 GHz. The GTX 570 has 480 CUDA cores
and a 320-bit memory interface with 1.28 GB of GDDR5 DRAM. It is also important to
note that the GTX 570 is not a dedicated GPU, and has to deal with both CUDA and
GUI programs.
Software. The host OS is Windows 7 64-bit Enterprise Edition. I wrote the CPU
algorithm using C# and compiled it with Visual Studio 2008, version 9.0, with .NET
framework 3.5 SP1. The CPU algorithm is not optimized and is single threaded. I wrote
the GPU algorithm using the CUDA toolkit v3.2 in C and compiled it using NVIDIAs
nvcc compiler from within Visual Studio 2008.
Data generation. I selected the following system sizes for the UPS and WECC systems:
500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 and 9000. For the NYISO system,
the system size was limited to 6000, as system sizes of 7000 and greater contained more
than 1, 000, 000 non-zero elements. For each system size, I generated 30 matrices. See
29
30
Tables 2 - 4 and Figures 8 - 10 for the average numbers of non-zero elements per sample
size and their densities. These matrices were generated off line before the performance
testing began. The confidence intervals shown in the graphs were calculated using a
t-distribution function where was set to 99%.
The fill in appears large, however, Table 2 shows that the sparsity still remains.
Number of Buses
Original
Factorized
Fill in
Density, %
500
1030.6 0.99
1399.6 136.41
369
0.56
1000
2078.6 2.18
3468 417.00
1389.4
0.35
2000
3919.06 0.99
7807.2 992.79
3888.14
0.20
3000
6254 1.94
13981 3669.27
7727
0.16
4000
8297.2 1.44
23738.4 4558.57
15441.2
0.15
5000
10386 2.56
35419.4 5719.57
25033.4
0.14
6000
12432 2.17
41171.4 9113.15
28739.4
0.11
7000
14457 2.00
53392.6 13778.59 38935.6
0.11
8000
16572.4 2.71 76363.8 19466.18 59791.4
0.12
9000
18609.2 1.99 89030.2 18239.76
0.11
70421
Table 2: Average number of elements in the admittance matrix for UPS.
31
Average Admittance Matrix Size for Siberian UPS

120000
Original
Factorized
100000
Matrix Size
80000
60000
40000
20000
0
500
1000
2000
3000
4000
5000
6000
Number of Buses
7000
8000
Figure 8: Average admittance matrix size with = 1.7.
9000
32
Number of Buses
Original
Factorized
Fill in Density, %
500
1826.47 1.09
5166.53 319.99
3340
2.07
1000
3688.27 1.59
13132.33 1003.55
9444
1.31
2000
7429.20 1.85
38748.10 2741.56
31319
0.97
3000
11106.53 1.49
65304.10 6803.51
54198
0.73
4000
14660.80 0.91 113136.27 10796.83
98475
0.71
5000
18343.87 1.91 164057.80 12944.24
145714
0.66
6000
22079.67 1.95 196493.67 16406.81
174414
0.55
7000
25667.00 1.96 283087.80 23595.03
257421
0.58
8000
29206.73 1.35 321423.67 25630.34
292217
0.50
9000
32976.33 1.30 400489.40 37933.98
367513
0.49
Table 3: Average number of elements in the admittance matrix for WECC.
33
Average Admittance Matrix Size for WECC

450000
Original
Factorized
400000
350000
Matrix Size
300000
250000
200000
150000
100000
50000
0
500
1000
2000
3000
4000
5000
6000
Number of Buses
7000
8000
9000
34
Number of Buses
Original
Factorized
Fill in Density, %
500
2863.6 3.22
15085.4 1895.54
12222
6.03
1000
5760 5.52
39518.2 5933.89
33758
3.95
2000
11489.8 7.52
39518.2 16387.82
91739
2.58
3000
17163.4 6.07
103228.4 19112.85
161712
1.99
4000
22899.6 6.74
365619 48518.50
342719
2.29
5000
28594.2 5.70 498994.2 116629.32
470400
2.00
6000
34166.4 4.76 609192.4 145238.24
575026
1.69
Table 4: Average number of elements in the admittance matrix for NYISO.
35
Average Admittance Matrix Size for NYISO

120000
Original
Factorized
100000
Matrix Size
80000
60000
40000
20000
0
500
1000
2000
3000
4000
5000
6000
Number of Buses
7000
8000
9000
For the readers consideration, I have included Tables 5 - 7, which show the average
data transfer times for the admittance matrices to and from the GPU.
36
Number of Buses
To Card, ms
From Card, ms
500
0.22
0.13
1000
0.28
0.11
2000
0.49
0.13
3000
0.59
0.14
4000
0.87
0.17
5000
1.21
0.18
6000
1.42
0.19
7000
1.81
0.21
8000
2.29
0.22
9000
2.54
0.24
Table 5: Average memory transfer times for = 1.7.
37
Number of Buses
To Card, ms
From Card, ms
500
0.29
0.09
1000
0.51
0.10
2000
1.16
0.12
3000
1.84
0.15
4000
2.85
0.15
5000
3.88
0.16
6000
4.63
0.19
7000
6.19
0.20
8000
7.04
0.21
9000
8.77
0.24
Number of Buses
To Card, ms
From Card, ms
500
0.55
0.10
1000
1.17
0.11
2000
2.60
0.13
3000
4.29
0.14
4000
7.90
0.16
5000
10.71
0.17
6000
12.98
0.24
38
In order to assess the efficiency of my algorithm, I collected performance data using
the previously generated matrices. For each generated matrix I measured the time that it
takes to execute the algorithm on the CPU and the GPU. For each run of the programs,
a third program is run that checks the CPU and GPU output for correctness.
In the case of C#, the resolution of the default timer, stopwatch(), was not sufficiently high. In fact, I observed that the default C# timer is only accurate up to about
15 ms. Using methods in [28], I developed a high-precision timer, accurate to 1 ns.
CUDA comes with a high resolution event timer API [29]. The event timer creates
a GPU event, destroys the event and records the duration using a time stamp, which is
later converted to a floating-point value representing milliseconds.
CHAPTER 4
Experiment Results and Analysis
Since my goal is to measure the performance of CUDA kernels, I do not include the
time spent transferring data between host and GPU in the average computation time
table. Although I developed a system to produce realistic power system topologies, it is
important to note that the uniformly distributed random values that were generated for
admittances and voltages may not be indicative enough to test the performance of my
algorithm. In the future I would like to compare the following results against realistic
values. Figures 11 - 13 show the average computation time in milliseconds for all sample
sizes for the CPU and the GPU. Since my sample sizes only contained 30 matrices, I have
included confidence intervals on the graphs. The confidence intervals were calculated
using a t-distribution function where was set to 99%.
39
40
Average Computation Time for Siberian Region of Russian UPS

2500
GPU
CPU
2000
Time, ms
1500
1000
500
0
500
1000
2000
3000
4000
5000
Number of Buses
6000
7000
8000
9000
Figure 11: Average computation time for CPU versus GPU with = 1.7.
41
Number of Buses
GPU, ms
CPU, ms
Speedup
500
3.34 0.00
6.53 0.18
1.96
1000
6.68 0.01
22.33 0.49
3.34
2000
13.35 0.01
89.88 4.11
6.73
3000
20.07 0.06
204.88 4.69
10.21
4000
26.82 0.04
377.22 4.15
14.06
5000
33.56 0.05
597.67 9.46
17.81
6000
40.25 0.07
1010.93 7.32
25.12
7000
47.07 0.10 1301.54 15.71
27.65
8000
53.89 0.11 2099.35 17.41
38.96
9000
60.66 0.09 2327.66 26.77
38.37
Table 8: Average computation times with = 1.7.
Table 8 has the average computation times for CPU versus GPU with the average
speedup for each sample size for the UPS system.
42
Average Computation Time for WECC

2500
GPU
CPU
2000
Time, ms
1500
1000
500
0
500
1000
2000
3000
4000
5000
Number of Buses
6000
7000
8000
9000
43
Number of Buses
GPU, ms
CPU, ms
Speedup
500
3.45 0.01
6.76 0.07
1.96
1000
6.92 0.01
23.08 0.31
3.34
2000
13.91 0.02
97.21 1.04
6.99
3000
20.88 0.06
224.19 4.41
10.74
4000
27.99 0.1
408.62 3.47
14.60
5000
35.16 0.13
643.11 15.51
18.29
6000
42.19 0.16
1048.77 8.94
24.85
7000
49.65 0.24
1364.36 9.26
27.49
8000
56.72 0.24
2105.36 7.75
37.12
9000
64.19 0.36 2385.62 22.93
37.16
speedup for each sample size for WECC.
44
Average Computation Time for NYISO

1200
GPU
CPU
1000
Time, ms
800
600
400
200
0
500
1000
2000
3000
4000
Number of Buses
5000
6000
45
Number of Buses
GPU, ms
CPU, ms
Speedup
500
3.58 0.02
6.97 0.32
1.95
1000
7.21 0.05
25.26 0.73
3.50
2000
14.54 0.15
100.84 5.58
6.94
3000
21.98 0.18
235..77 13.66
10.73
4000
30.21 0.49
425.92 17.71
14.10
5000
38.13 1.22
667.84 16.16
17.51
6000
45.88 1.58 1113.95 30.43
24.86
speedup for each sample size for NYISO.
My results show that the use of a GPU for solving sparse linear equations with
bi-factorized matrices achieves up to a 38 times speedup over the traditional CPU implementation. This result highlights the practical use of a GPU for a high performance
sparse linear equation solver. My kernels identify and exploit parallelism in an otherwise
sequential algorithm.
CHAPTER 5
Future Work
Future work. I plan to further optimize the kernels with the goal of being able to
compute matrices with more than 1,000,000 non-zero elements. The next revision of the
code will have a strong focus on memory access optimization on the GPU. The current
implementation receives the admittance matrix pre-sorted by row, the algorithm makes
a copy and sorts the matrix by column using the CPU. I plan to explore two options for
addressing these sorting issues. First, I plan to implement an efficient parallel sorter to
sort the matrix in place on the GPU. Second, I plan to explore a method where the matrix
does not require prior sorting. I also plan to optimize my parallel reduction code further,
such that only non-zero elements are processed. I plan to implement additional parts of
the powerflow computation where the GPU could be used to accelerate the computation
further. In order for this software to be adapted into industrial use, the sorting issues
will have to be resolved.
46
APPENDIX A
CUDA Solver Source Code

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "cuda.h"
#define nTL 1024 // num threads for L kernel
#define nTU 1024 // num threads for U kernel
typedef struct
{
double x, y; // x = real, y = imaginary
int px, py; // px = position x, py = position y
}cuDoubleComplex;
void writeVector(cuDoubleComplex *v, FILE* f,int N);
void importMatrix(FILE *f, cuDoubleComplex *m, cuDoubleComplex *m2,
int *countersx, int *countersy);
void importVector(FILE *f, cuDoubleComplex *m);
int countVector(FILE *f);
int int_cmp(const void *A, const void *B);
// OPERATOR OVERLOADS
__device__ __forceinline__ cuDoubleComplex operator*(const cuDoubleComplex a,
const cuDoubleComplex b){
cuDoubleComplex result;
result.x = (a.x * b.x) - (a.y * b.y);
result.y = (a.y * b.x) + (a.x * b.y);
return result;
}
__device__ __forceinline__ cuDoubleComplex operator+(const cuDoubleComplex a,
result.x = (a.x + b.x);
result.y = (a.y + b.y);
return result;
}
__device__ __forceinline__ cuDoubleComplex operator-(const cuDoubleComplex a,
result.x = a.x - b.x;
result.y = a.y - b.y;
return result;
47
48
}
__device__ __forceinline__ cuDoubleComplex operator/(const cuDoubleComplex a,
result.x = (((a.x * b.x) + (a.y * b.y)) / ((pow(b.x, 2)) + (pow(b.y, 2))));
result.y = (((a.y * b.x) - (a.x * b.y) ) / ((pow(b.x, 2)) + (pow(b.y, 2))));
return result;
}
// END OPERATOR OVERLOADING
// boolean value to determine if a block is the last block done or not
__shared__ bool isLastBlockDone;
__global__ void L(cuDoubleComplex *a, cuDoubleComplex* b,
cuDoubleComplex *c, int N, int ne, int *count,
unsigned int *index){
// variable to keep track of how many elements match the current col, num
__shared__ unsigned int placeholder;
if(threadIdx.x==0)
placeholder=0; // start at 0
__syncthreads();
for(int i = 0; i < N; i++){
int tid = placeholder + threadIdx.x;
// extract row, col coordinates from COO format
// and assign them to x and y for each thread
int x = a[tid].px; int y = a[tid].py;
if(x == i && y == i){
// hardcode special case where num==0
if(i==0){
// number on the diagonal for num=0 v_i = l_ii * v_i
c[i] = a[i] * b[i];
}else{
// number on the diagonal v_i = l_ii * v_i
c[i] = a[tid] * b[i];
}
}
__syncthreads();
if(x == i && y > i){
// v_j = v_j + l_ji * v_i
b[y] = b[y] + a[tid] * b[i];
}
// increment tid by placeholder
if(threadIdx.x==0) placeholder += count[i];
__syncthreads();
}
}
49
__global__ void U(cuDoubleComplex *a, cuDoubleComplex* b,

cuDoubleComplex *c, int N, int ne, int *count,
unsigned int *index){
// start kernel at end of matrix and work our way up
int tid = ne - threadIdx.x;
int t = threadIdx.x;
// variable to keep track of how many elements match the current col, num
__shared__ unsigned int placeholder;
if(threadIdx.x==0) placeholder = 0;
__syncthreads();
for(int num = N-1; num >=0; num--){
// decrement tid by number of elements minus placeholder
tid = (ne-placeholder) - t;
// extract row, col coordinates from COO format and
// assign them to x and y for each thread
int x = a[tid].px; int y = a[tid].py;
// setup shared memory array of size nTU for faster access
__shared__ cuDoubleComplex sdata[nTU];
// initialize shared memory to 0
sdata[t].x=0;sdata[t].y=0;
if(y == num && x > num){
// u_ji * v_j
sdata[t] = a[tid] * c[x];
}
__syncthreads();
// begin parallel reduction
// cut number of active threads in half for each iteration
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (t < s){
// add lower threads to higher threads and store results in
// lower threads. sum(u_ji * v_j)
sdata[t] = sdata[t] + sdata[t + s];
}
__syncthreads();
}
// write results to global memory
if(t==0){
c[num] = c[num] + sdata[0];
// increment placeholder counter
placeholder += count[num];
}
__syncthreads();
}
50
}
int main(int argc, char *argv[]){
/**************************************
call functions to:
import matrix and vector
launch parallel kernels
record results in a file
**************************************/
// variables for reading powerflow data based on command line arguments
char Ystring[50], Vin[50];
sprintf(Ystring, "PGen/Y_%d_%d.csv", atoi(argv[1]), atoi(argv[2]));
sprintf(Vin, "PGen/V_%d_%d.csv", atoi(argv[1]), atoi(argv[2]));
FILE *cout = fopen("IO/cudaB.csv", "w");
FILE *cout2 = fopen("IO/cudaB2.csv", "w");
FILE *results = fopen("IO/results.csv", "a");
FILE *Y = fopen(Ystring,"r");
FILE *V = fopen(Vin,"r");
// ne = number of elements, ctr for keeping track of kernel calls
int ne;
// size of matrix N*N
const int N = atoi(argv[1]);
// cuda event timers - two timers per event, start and stop
cudaEvent_t lCompute, lComputeS, d2gMem, d2gMemS, g2dMem, g2dMemS,
uCompute, uComputeS;
// cuda event timer totals
float lComputeT, d2gMemT, g2dMemT, uComputeT;
// count number of elements in the packed matrix
ne = countVector(Y);
// close file to reset the fh
fclose(Y);
// reopen the matrix file and actually import it
FILE *Yin = fopen(Ystring,"r");
// setup matrix A and allocate host mem
unsigned int size_a = ne;
unsigned int mem_size_a = sizeof(cuDoubleComplex) * size_a;
// setup arrays B, C and allocate host mem
unsigned int size_b = N;
unsigned int mem_size_b = sizeof(cuDoubleComplex) * size_b;
unsigned int size_c = N;
unsigned int mem_size_c = sizeof(cuDoubleComplex) * size_c;
// allocate memory for matrix and arrays
// col sorted matrix
51
cuDoubleComplex* a = (cuDoubleComplex*)malloc(mem_size_a);
// row sorted matrix
cuDoubleComplex* a1= (cuDoubleComplex*)malloc(mem_size_a);
// dense voltage vector
cuDoubleComplex* b = (cuDoubleComplex*)malloc(mem_size_b);
// results vector
cuDoubleComplex* c = (cuDoubleComplex*)malloc(mem_size_c);
// array to store num elements for each column
int *countersx = (int*)malloc(sizeof(int)*N);
// array to store num elements for each row
int *countersy = (int*)malloc(sizeof(int)*N);
// initialize elements to 0
memset(countersx, 0, sizeof(int)*N);
memset(countersy, 0, sizeof(int)*N);
// debugging variable to keep track of num elements processed
unsigned int *h_index = (unsigned int*)malloc(
sizeof(unsigned int));
// import the randomly generated admittance matrix
importMatrix(Yin, a, a1, countersx, countersy);
// import the randomly generated voltage vector
importVector(V, b);
// close file handlers
fclose(Yin);
fclose(V);
// sort matrix in row format

qsort((void *)a, ne, sizeof(cuDoubleComplex), int_cmp);
// allocate memory on GPU
// admittance matrices sorted by col and row
cuDoubleComplex* d_a;
cudaMalloc((void**) &d_a, mem_size_a);
cuDoubleComplex* d_a1;
cudaMalloc((void**) &d_a1, mem_size_a);
// voltage vector
cuDoubleComplex* d_b;
cudaMalloc((void**) &d_b, mem_size_b);
// results vector
cuDoubleComplex* d_c;
cudaMalloc((void**) &d_c, mem_size_c);
52
// row and column counters

int* d_countx;
cudaMalloc((void**) &d_countx, sizeof(int)*N);
int* d_county;
cudaMalloc((void**) &d_county, sizeof(int)*N);
// debugging variable for tracking number of elements processed
unsigned int *d_index;
cudaMalloc((void**) &d_index, sizeof(unsigned int));
cudaMemset(d_index, 0, sizeof(unsigned int));
// start host to gpu mem xfer timer
cudaEventCreate(&d2gMem);
cudaEventCreate(&d2gMemS);
cudaEventRecord(d2gMem, 0);
// copy matrix a into d_a matrix
cudaMemcpy(d_a, a, mem_size_a, cudaMemcpyHostToDevice);
// copy matrix a1 into d_a1 matrix
cudaMemcpy(d_a1, a1, mem_size_a, cudaMemcpyHostToDevice);
// copy vector b into d_b vector
cudaMemcpy(d_b, b, mem_size_b, cudaMemcpyHostToDevice);
// copy couters to gpu
cudaMemcpy(d_countx, countersx, sizeof(int)*N,
cudaMemcpyHostToDevice);
cudaMemcpy(d_county, countersy, sizeof(int)*N,
cudaMemcpyHostToDevice);
// initialize c = 0
cudaMemset(d_c, 0, mem_size_c);
// stop host to gpu mem xfer timer
cudaEventRecord(d2gMemS, 0);
cudaEventSynchronize(d2gMemS);
cudaEventElapsedTime(&d2gMemT, d2gMem, d2gMemS);
// start L computation timer
cudaEventCreate(&lCompute);
cudaEventCreate(&lComputeS);
cudaEventRecord(lCompute, 0);
// Call L kernel
L<<<1, nTL>>>(d_a, d_b, d_c, N, ne, d_countx, d_index);
// stop L computation timer
cudaEventRecord(lComputeS, 0);
cudaEventSynchronize(lComputeS);
cudaEventElapsedTime(&lComputeT, lCompute, lComputeS);
53
// start U computation timer
cudaEventCreate(&uCompute);
cudaEventCreate(&uComputeS);
cudaEventRecord(uCompute, 0);
// Call U kernel
U<<<1,nTU>>>(d_a1, d_b, d_c, N, ne, d_county, d_index);
// stop U computation timer
cudaEventRecord(uComputeS, 0);
cudaEventSynchronize(uComputeS);
cudaEventElapsedTime(&uComputeT, uCompute, uComputeS);
// start GPU to host mem xfer timer
cudaEventCreate(&g2dMem);
cudaEventCreate(&g2dMemS);
cudaEventRecord(g2dMem, 0);
// transfer vector C from GPU to host
cudaMemcpy(c , d_c, mem_size_c, cudaMemcpyDeviceToHost);
// stop GPU to host mem xfer timer
cudaEventRecord(g2dMemS, 0);
cudaEventSynchronize(g2dMemS);
cudaEventElapsedTime(&g2dMemT, g2dMem, g2dMemS);
// write the results
writeVector(c, cout, N);
// write performance results
fprintf(results,"%f,", d2gMemT);
fprintf(results,"%f,", lComputeT);
fprintf(results,"%f,", uComputeT);
fprintf(results,"%f,", g2dMemT);
fprintf(results,"%f,", lComputeT+uComputeT);
fprintf(results,"%d\n", N);
printf("%f\t%f\t%f\n", lComputeT, uComputeT, lComputeT+uComputeT);
// write zeros into memory
cudaMemset(d_a, 0, mem_size_a);
cudaMemset(d_a1, 0, mem_size_a);
cudaMemset(d_b, 0, mem_size_b);
cudaMemset(d_c, 0, mem_size_c);
cudaMemset(d_countx, 0, sizeof(int)*N);
cudaMemset(d_county, 0, sizeof(int)*N);
// free gpu mem
cudaFree(d_a);
cudaFree(d_a1);
cudaFree(d_b);
cudaFree(d_c);
cudaFree(d_countx);
cudaFree(d_county);
54
// free host
free(a);
free(a1);
free(b);
free(c);
free(countersx);
free(countersy);
// close file handlers
fclose(cout);
fclose(cout2);
fclose(results);
return 0;
}
void importVector(FILE *f, cuDoubleComplex *m){
/*******************************************
import vector from disk
*******************************************/
int i;
double vx, vy;
i=0;
while(feof(f)==0){
fscanf(f, "%lf,%lf\n",&vx,&vy);
m[i].x = vx;
m[i].y = vy;
i++;
}
}
void importMatrix(FILE *f, cuDoubleComplex *m, cuDoubleComplex *m1,
int *countersx, int *countersy){
/*******************************************
import admittance matrix, create two
copies. Sort second matrix by row later
on in the program
*******************************************/
int i,x,y;
double vx, vy;
i=0;
while(feof(f)==0){
fscanf(f, "%lf,%lf,%d,%d\n",&vx,&vy,&x,&y);
m[i].x = vx; m1[i].x = vx;
m[i].y = vy; m1[i].y = vy;
m[i].px = x; m1[i].px = x;
m[i].py = y; m1[i].py = y;
countersx[x]++;
countersy[y]++;
i++;
}
}
void writeVector(cuDoubleComplex *v, FILE* f, int N){
/*******************************************
55
write results vector to disk as string
*******************************************/
for(int i=0; i<N;i++){
char x[100], y[100];
sprintf(x, "%.10lf", v[i].x);
sprintf(y, "%.10lf", v[i].y);
fprintf(f,"[%s,%s]\n", x, y);
}
}
int countVector(FILE *f){
/*******************************************
count number of elements in vector or
admittance matrix
*******************************************/
int i,x,y;
double vx, vy;
i=0;
while(feof(f)==0){
fscanf(f, "%lf,%lf,%d,%d\n",&vx,&vy,&x,&y);
i++;
}
return i;
}
// compare ints for qsort
int int_cmp(const void *A, const void *B)
{
return ((cuDoubleComplex*)A)->px - ((cuDoubleComplex*)B)->px;
}
Bibliography
[1] ERPI Power Systems Dynamics Tutorial, ERPI, Palo Alto, CA, 2009.
[2] D. Ross, C. Shamieh, and G. McComb, Electronics for dummies.
John Wiley &
Sons, Ltd, 2010.

[3] D. Barr, Final report on the august 14, 2003 blackout in the united states and
canada, United States Department of Energy, Tech. Rep., April 2004. [Online].
Available: http://www.ferc.gov/industries/electric/indus-act/reliability/blackout/
ch1-3.pdf
[4] J. Daintith. (2011,
line]. Available:
June) Joules laws. Oxford University Press. [On-
http://www.oxfordreference.com/views/ENTRY.html?subview=
Main&entry=t83.e1604
[5] J. Sanders and E. Kandrot, Cuda by Example: An Introduction to General-Purpose
GPU Programming. Addison-Wesley Professional, 2010.
[6] (2011, June) General-purpose computation on graphics processing units. [Online].
Available: http://gpgpu.org
[7] (2008, May) Many-core processor. [Online]. Available: http://software.intel.com/
en-us/articles/many-core-processor/
[8] Seti@home. U.C. Berkeley. [Online]. Available: http://setiathome.berkeley.edu/
56
57
[9] (2011, June). [Online]. Available: http://www.khronos.org/opencl/
[10] (June, 2011). [Online]. Available:
http://www.microsoft.com/downloads/en/
details.aspx?displaylang=en&FamilyID=3021d52b-514e-41d3-ad02-438a3ba730ba
[11] NVIDIA CUDA C Programming Guide, NVIDIA, November 2010.
[12] N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on
throughput-oriented processors, in Implementing sparse matrix-vector multiplication on throughput-oriented processors. ACM, 2009.
[13] M. Garland and N. Bell, Efficient sparse matrix-vector multiplication on cuda,
NVIDIA, Tech. Rep. NVR-2008-004, December 2008.
[14] A. Dziekonski, A. Lamecki, and M. Mrozowski, A memory efficient and fast sparse
matrix vector product on a gpu, in Progress In Electromagnetics Research. Progress
In Electromagnetics Research, 2011, vol. 116, pp. 4963.
[15] M. ten Bruggencate and S. Chalasani, Parallel implementations of the power system
transient stability problem on clusters of workstations, in Proceedings of Supercomputing95. San Diego, CA: ACM/IEEE, Dec. 1995.
[16] S. Liu, Y. Zhang, X. Sun, and R. Qiu, Performance evaluation of multithreaded
sparse matrix-vector multiplication using openMP, in HPCC.
IEEE, 2009, pp.
659665. [Online]. Available: http://dx.doi.org/10.1109/HPCC.2009.75
58
[17] X. Wang and S. G. Ziavras, Parallel direct solution of linear equations on FPGAbased machines, in CD-ROM/Abstracts Proceedings of the 17th International Parallel and Distributed Processing Symposium (17th IPDPS03). Nice, France: IEEE
Computer Society (Los Alamitos, CA), Apr. 2003, p. 113.
[18] P. Amestoy, I. Duff, and J.-Y. LExcellent, Multifrontal parallel distributed symmetric and unsymmetric solvers, Comput. Methods in Appl. Mech. Eng., vol. 184,
pp. 501520, 2000.
[19] Arnold, Parr, and Dewe, An efficient parallel algorithm for the solution of
large sparse linear matrix equations, IEEETC: IEEE Transactions on Computers,
vol. 32, 1983.
[20] (2011, April) Cuda toolkit 4.0 cublas library. User Guide. NVIDIA. [Online].
Available:
http://developer.download.nvidia.com/compute/DevZone/docs/html/
CUDALibraries/doc/CUBLAS Library.pdf
[21] (2011) Cusp. [Online]. Available: http://code.google.com/p/cusp-library/
[22] S. Mayanglambam, A. Malony, and M. Sottile, Performance measurement of applications with gpu acceleration using cuda, University of Oregon, Tech. Rep., 2009.
[23] A. Brameller, R. N. Allan, and Y. M. Hamam, Sparsity and its Applications. Pitman
Ltd, 1976.
[24] Z. Wang, R. J. Thomas, and A. Scaglione, Generating random topology power
grids, in Generating Random Topology Power Grids.
2008.
IEEE Computer Society,
59
[25] E. Karasev, Lu-factorization, lecture Notes Excerpts.
[26] M. Harris, Optimizing parallel reduction in cuda, White paper, NVIDIA,
2010. [Online]. Available: http://developer.download.nvidia.com/compute/cuda/
1 1/Website/projects/reduction/doc/reduction.pdf
[27] Data-parallel algorithms. NVIDIA. [Online]. Available: http://www.nvidia.com/
object/cuda sample data-parallel.html
[28] (2002,
July) High-performance timer in c#. [Online]. Available:
http:
//www.codeproject.com/KB/cs/highperformancetimercshar.aspx
[29] CUDA C Best Practices Guide, NVIDIA, August 2010.
[30] E. Karasev, Pactp state estimator system models, In Person.
[31] J. Grainger and W. Stevenson, Power System Analysis. McGraw Hill, 1994.
[32] J. M. Robert Miller, Power System Operation, 3rd ed. McGraw Hill, 1994.
[33] (2004, October) Supervisory control and data acquisition (scada) systems.
National Communications System. [Online]. Available:
http://www.ncs.gov/
library/tech bulletins/2004/tib 04-1.pdf

[34] (2011) Cuda in action - research & apps. [Online]. Available:
http:
//www.nvidia.com/object/cuda apps flash new.html

[35] Nvidias next generation cuda compute architecture:
Fermi, White paper,
NVIDIA, 2010. [Online]. Available: http://www.nvidia.com/content/PDF/fermi

white papers/NVIDIA Fermi Compute Architecture Whitepaper.pdf
60
[36] J. Johnson, T. Chagnon, P. Vachranukunkiet, P. Nagvajara, and C. Nwankpa,
Sparse lu decomposition using fpga, in International Workshop on State-of-the-Art
in Scientific and Parallel Computing (PARA), 2008.
[37] D. P. Koester, S. Ranka, and G. C. Fox, A parallel gauss-seidel algorithm for sparse
power system matrices, in Supercomputing 94, 1994.
[38] H. Courtecuisse and J. Allard, Parallel dense gauss-seidel algorithm on many-core
processors, in HPCC. IEEE, 2009, pp. 139147.
[39] R. W. Vuduc, R. W. Vuduc, and R. W. Vuduc, Automatic performance tuning
of sparse matrix kernels, 2003. [Online]. Available: http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.1.4701;http://bebop.cs.berkeley.edu/pubs/thesis.pdf
[40] N. Black and S. Moore. Gauss-seidel method. Math World A Wolfram
Web Resource, created by Eric W. Weisstein. [Online]. Available:
http:
//mathworld.wolfram.com/Gauss-SeidelMethod.html
[41] A. Klockner. Iterative cuda. New York University. [Online]. Available:
http:
//mathema.tician.de/software/iterative-cuda
[42] (2011, June) Hydroelectric power. Encyclopdia Britannica Online. [Online]. Available: http://www.britannica.com/EBchecked/topic/278455/hydroelectric-power

Martin Andrew John

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Martin Andrew John

Uploaded by

Copyright:

Available Formats

A HIGH PERFORMANCE

PARALLEL SPARSE LINEAR EQUATION SOLVER USING CUDA

Dr. Mikhail Nesterenko , Advisor

Random Power System Generator . . . . . . . . . . . . . . . . . . . . . .

Parallel Sparse Linear Equation Solver . . . . . . . . . . . . . . . . . . .

4 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . .

A CUDA Solver Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A simple North American power system. . . . . . . . . . . . . . . . . . .

CUDA memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . .

A COO representation of sample matrix A . . . . . . . . . . . . . . . . .

A 20 node random power system topology: dmin=0, dmax=0.3, = 2.67. .

A 20 node random power system topology: dmin=0, dmax=0.3, = 4.78. .

A 100 node random power system topology: dmin=0, dmax=0.3, = 2.67.

A 100 node random power system topology: dmin=0, dmax=0.3, = 4.78.

Average admittance matrix size with = 1.7. . . . . . . . . . . . . . . .

Average admittance matrix size with = 2.67. . . . . . . . . . . . . . . .

Average admittance matrix size with = 4.78. . . . . . . . . . . . . . . .

Average computation time for CPU versus GPU with = 1.7. . . . . . .

Average computation time for CPU versus GPU with = 2.67. . . . . .

Average computation time for CPU versus GPU with = 4.78. . . . . .

Average computation times for L and U when average neighbor count is

Average number of elements in the admittance matrix for UPS. . . . . .

Average number of elements in the admittance matrix for WECC. . . . .

Average number of elements in the admittance matrix for NYISO. . . . .

Average memory transfer times for = 1.7. . . . . . . . . . . . . . . . .

Average memory transfer times for = 2.67. . . . . . . . . . . . . . . . .

Average memory transfer times for = 4.78. . . . . . . . . . . . . . . . .

Average computation times with = 1.7. . . . . . . . . . . . . . . . . . .

Average computation times with = 2.67. . . . . . . . . . . . . . . . . .

Average computation times with = 4.78. . . . . . . . . . . . . . . . . .

= R + j ( XL XC ), where XL and XC are imaginary

quantities1 . Admittance (Y ) is the inverse of impedance. It measures how easily AC can

Electric power system. An electric power system is a connected collection of electric

765, 500, 345, 230, and 138 kV

Figure 1: A simple North American power system.

I assume the incoming matrix has already been sorted by row

To implement a high performance parallel sparse linear equation solver, I created a

L(n) L(n1) . . . L(2) L(1) YU(1) U(2) . . . U(n1) U(n) = E

Finally, pre-multiplying equation (6) by Y 1 yields:

Note that if yik

is zero, so is lik . Similarly, if lkj

is zero, so is ukj . That is, sparse

2.2 CUDA Overview

Figure 2: CUDA memory hierarchy.

Figure 4: A 20 node random power system topology: dmin=0, dmax=0.3, = 2.67.

Figure 5: A 20 node random power system topology: dmin=0, dmax=0.3, = 4.78.

Development history. It took me several attempts to optimize the algorithm. I started

L kernel, ms U kernel with reduction, ms

53392.6 13778.59 38935.6

16572.4 2.71 76363.8 19466.18 59791.4

18609.2 1.99 89030.2 18239.76

Table 2: Average number of elements in the admittance matrix for UPS.

Average Admittance Matrix Size for Siberian UPS

Figure 8: Average admittance matrix size with = 1.7.

14660.80 0.91 113136.27 10796.83

18343.87 1.91 164057.80 12944.24

22079.67 1.95 196493.67 16406.81

25667.00 1.96 283087.80 23595.03

29206.73 1.35 321423.67 25630.34

32976.33 1.30 400489.40 37933.98

Table 3: Average number of elements in the admittance matrix for WECC.

Average Admittance Matrix Size for WECC

Figure 9: Average admittance matrix size with = 2.67.

28594.2 5.70 498994.2 116629.32

global void U(cuDoubleComplex a, cuDoubleComplex b,