Professional Documents
Culture Documents
3, MARCH 2015
Abstract—The sparse matrix solver by LU factorization is a serious bottleneck in Simulation Program with Integrated Circuit Emphasis
(SPICE)-based circuit simulators. The state-of-the-art Graphics Processing Units (GPU) have numerous cores sharing the same
memory, provide attractive memory bandwidth and compute capability, and support massive thread-level parallelism, so GPUs can
potentially accelerate the sparse solver in circuit simulators. In this paper, an efficient GPU-based sparse solver for circuit problems is
proposed. We develop a hybrid parallel LU factorization approach combining task-level and data-level parallelism on GPUs. Work
partitioning, number of active thread groups, and memory access patterns are optimized based on the GPU architecture. Experiments
show that the proposed LU factorization approach on NVIDIA GTX580 attains an average speedup of 7.02 (geometric mean)
compared with sequential PARDISO, and 1.55 compared with 16-threaded PARDISO. We also investigate bottlenecks of the
proposed approach by a parametric performance model. The performance of the sparse LU factorization on GPUs is constrained by
the global memory bandwidth, so the performance can be further improved by future GPUs with larger memory bandwidth.
Index Terms—Graphics processing unit, parallel sparse LU factorization, circuit simulation, performance model
Ç
1 INTRODUCTION
TABLE 1
Summary of Existing GPU-Based Direct Sparse Solvers
TABLE 2
Specifications of Computing Platforms
TABLE 3
Performance of LU Factorization on GPUs, and the Speedups over KLU (re-factorization) and PARDISO
flop
optimization), KLU [16] (compiled with-O3 optimization), bigger NNZðLþUIÞ . The average speedup of the GPU-based
and PARDISO [21] from the Intel Math Kernel Library LU re-factorization is 18.77 achieved by GTX580 and
(MKL). We show the main results here, some additional 15.72 achieved by K20x. The geometric mean of the speed-
results are shown in Appendix B, available online. ups is 7.02 achieved by GTX580.
Compared with 16-threaded PARDISO, the GPU-based
5.2 Results of LU Factorization on GPUs LU re-factorization is faster than PARDISO for about half
5.2.1 Performance and Speedups of the benchmarks and slower for the other half. The aver-
Table 3 shows the results of the proposed LU factorization age speedup is 1.55 (geometric mean) achieved by
approach on NVIDIA GTX580 and NVIDIA K20x, and the GTX580 and 1.21 achieved by K20x. Fig. 6a shows the
speedups compared with KLU and PARDISO which are average speedup of the GPU-based LU re-factorization
implemented on the host CPU. All the benchmarks are (implemented on GTX580), compared with PARDISO
sorted by the average number of floating-point operations using different number of threads. On average, the GPU-
flop based LU re-factorization approach outperforms multi-
(flop) per nonzero (NNZðLþUIÞ ). The listed time in Table 3 is
only for numeric re-factorization, excluding pre-processing threaded PARDISO.
and right-hand solving. Some data have to be transferred Fig. 6b shows the average speedup of the GPU-based LU
between the host memory and the device memory in every re-factorization (implemented on GTX580) compared with
re-factorization (see Fig. 4), time for these transfers are NICSLU re-factorization. Since the speedups differ little
included in the GPU re-factorization time. for different matrices in these results, only arithmetic
Our GPU-based LU re-factorization outperforms KLU re- means are presented. Fig. 6b indicates that the performance
factorization for most of the benchmarks. The speedups
flop
tend to be higher for matrices with bigger NNZðLþUIÞ . This
indicates that GPUs are more suitable for problems that are
more computationally demanding. The average speedup is
24.24 achieved by GTX580 and 19.86 achieved by K20x.
Since the speedups differ greatly for different matrices, the
geometric means of speedups are also listed in Table 3.
Compared with sequential PARDISO, the GPU-based LU
re-factorization is also faster than PARDISO for most of the
benchmarks. The speedups tend to be higher for matrices
flop
with smaller NNZðLþUIÞ . PARDISO utilizes BLAS to com-
pute dense subblocks so it performs better for matrices with Fig. 6. Average speedups. The GPU results are obtained on GTX580.
CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATION WITH PERFORMANCE MODELING 791
Fig. 7. Achieved GPU performance, using different number of resident Fig. 8. Predicted LU factorization time by the CWP-MWP model [37], for
warps per SM. ASIC_100 ks, obtained on GTX580.
of the GPU-based LU re-factorization is similar to that already fully utilized the bandwidth with 24 (for
of 10-threaded NICSLU. When the number of threads of GTX580)/40 (for K20x) resident warps per SM. It also
NICSLU is larger than 10, NICSLU will outperform the means that the actual bandwidth is saturated with 24 (for
GPU implementation. The average speedup of the GPU- GTX580)/40 (for K20x) resident warps per SM. When
based LU re-factorization compared with 16-threaded NIC- warps become more, the performance is constrained and
SLU is 0.78. becomes lower. To explain this phenomenon in detail, we
Although the theoretic peak performance of K20x is build a parametric performance model in Section 6.
about 6 larger than that of GTX580 (for double-precision),
the results in Table 3 indicate that the achieved performance
on K20x and GTX580 is similar. The main reason is that per- 6 PERFORMANCE EVALUATION OF LU
formance of sparse LU factorization on GPUs is bandwidth- FACTORIZATION ON GPUS
bound. We will explain this point in Section 6. The effective In this section, a parametric performance model is built to
global memory bandwidth of K20x (with ECC on) is very explain the phenomenon shown in Fig. 7 in detail. With
close to that of GTX580, so the higher compute capability of this model, we can investigate the relation between the
K20x cannot be explored. achieved GPU performance and the GPU parameters
such as the global memory bandwidth and the number of
warps, to help us better understand the bottlenecks of the
5.2.2 Bottleneck Analysis
solver and choose the optimal number of work threads
The achieved average bandwidth of all benchmarks is on the given hardware platform. Actually the proposed
53.81, 51.05, and 92.75 GB/s, obtained on NVIDIA model is universal and can also be used for other GPU
GTX580, NVIDIA K20x, and Intel E5-26902 (imple- applications.
mented by 16-threaded NICSLU), respectively. For CPU A CWP-MWP (CWP and MWP are short for computation
execution, some memory accesses are served by caches, warp parallelism and memory warp parallelism) model is
so the bandwidth may greatly exceed the theoretic peak proposed in [37]. In that model, a warp that is waiting for
of the main memory. For different platforms, factors that memory data is called a memory warp, and MWP repre-
restrict the performance are different. For CPUs, large sents the number of memory warps that can be handled
caches and high clock frequency can greatly improve the concurrently on one SM. We have attempted to use the
computational efficiency. However, for GTX580 and CWP-MWP model to evaluate the performance on NVIDIA
K20x, disabling the L1 cache just results in less than 5 GTX580, but it gets incorrect results, as shown in Fig. 8. The
percent performance loss. It is difficult to evaluate the reason is that MWP in the CWP-MWP model is very small
effect of GPU’s L2 cache, since there is no way to disable (47 for different matrices), so when the number of resident
the L2 cache of current NVIDIA GPUs programmed warps per SM exceeds MWP , the execution time is greatly
using CUDA. Since GPU’s L2 cache is much smaller than affected by the global memory latencies, leading to unrea-
modern CPU’s last level cache (LLC) and there are much sonable predicted results.
more cores sharing the LLC on a GPU, it is expected that In this paper, a better performance model is proposed,
the LLC has less effect on GPUs than on CPUs. The main which improves the CWP-MWP model at the following
bottleneck of the proposed LU re-factorization approach points.
on GPUs is the peak global memory bandwidth.
We have found that processing too many columns con- Computation of MWP is changed. MWP is hard-
currently may decrease the performance on GPUs. Fig. 7 ware-specific and directly constrained by the peak
shows the giga floating-point operations per second bandwidth of the global memory.
(Gflop/s) of three matrices obtained on GTX580 and Concepts of MWP req and warp exp which have
K20x, using different number of resident warps per SM. physical meanings are introduced to illustrate the
The best performance is attained with about 24 (for different cases of warp scheduling.
GTX580)/40 (for K20x) resident warps per SM, rather The effect of the L2 cache is considered.
than with maximum resident warps per SM. Although The CWP-MWP model performs static analysis on
the reported bandwidth in Table 3 does not reach the the compiled code to predict performance. However,
peak bandwidth of the GPU, Fig. 7 indicates that we have in the pipeline mode (see Appendix A.1, available
792 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015
TABLE 4
Parameters Used in the Model, for GTX580
miss rate is measured on the target GPU using a micro- mem cycle
warp exp ¼ þ1 (5)
benchmark, see Appendix C.1, available online, for details. comp cycle
MWP represents the maximum number of memory
warps that can be served concurrently on one SM. It is hard- 6.1.2 Three Cases in Warp Scheduling
ware-specific and directly constrained by the peak band-
width of the global memory: Based on the above parameters, there are three cases in the
warp scheduling of GPUs, according to the number of resi-
dent warps per SM, as illustrated in Fig. 10.
peak bandwidth mem cycle
MWP ¼ ; (2) Case 1 (Fig. 10a). if #warp warp exp and MWP
#SM freq load bytes per wap MWP req, there are enough warps running to hide global
memory latencies and the memory bandwidth is also suffi-
where peak bandwidth, #SM and freq are physical cient to support all the concurrent memory warps. In the
parameters of a GPU. load bytes per wap means the aver- time line, each memory period follows the corresponding
age data size accessed from the global memory in each computing period closely and no delay occurs. The total
memory period. Since in a MAD operation, there are 1 runtime is mainly determined by the computing periods.
int reading operation, 2 double reading operations, and For a given warp that is repeated for #repeat iterations
1 double writing operation, the equivalent data size (#repeat ¼ 3 in Fig. 10a), the number of total execution
accessed from the global memory considering the cache cycles is:
CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATION WITH PERFORMANCE MODELING 793
Fig. 11. Predicted time and actual time (on GTX580), with different
#warp.
proposed approach and choose the optimal number of work [11] G. P. Krawezik and G. Poole, “Accelerating the ANSYS direct
sparse solver with GPUs,” in Proc. Symp. Appl. Accelerators High
threads on the given hardware platform. Perform. Comput., July 2009, pp. 1–3.
The proposed LU factorization approach on NVIDIA [12] C. D. Yu, W. Wang, and D. Pierce, “A CPU-GPU hybrid approach
GTX580 achieves on average 7.02 (geometric mean) for the unsymmetric multifrontal method,” Parallel Comput.,
speedup compared with sequential PARDISO, and 1.55 vol. 37, no. 12, pp. 759–770, Dec. 2011.
[13] T. George, V. Saxena, A. Gupta, A. Singh, and A. Choudhury,
speedup compared with 16-threaded PARDISO. The perfor- “Multifrontal Factorization of Sparse SPD Matrices on GPUs,” in
mance of the proposed GPU solver is comparable to that of Proc. IEEE Int. Parallel Distrib. Process. Symp., 2011, pp. 372–383.
the existing GPU solvers listed in Table 1. The model-based [14] R. F. Lucas, G. Wagenbreth, J. J. Tran, and D. M. Davis,
“Multifrontal sparse matrix factorization on graphics processing
analysis indicates that the performance of the GPU-based units,” Los Angeles, CA, USA, Tech. Rep. ISI-TR-677, 2012.
LU factorization is constrained by the global memory band- [15] R.F. Lucas, G. Wagenbreth, D. M. Davis, and R. Grimes,
width. Consequently, if the global memory bandwidth “Multifrontal computations on GPUs and their multi-core hosts,”
becomes larger, the performance can be further improved. in Proc. 9th Int. Conf. High Perform. Comput. Comput. Sci., 2011,
pp. 71–82.
One limitation of the current approach is that it can be [16] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a
only implemented on single-GPU platforms. For multiple- direct sparse solver for circuit simulation problems,” ACM Trans.
GPU platforms, blocked algorithms are required, and the Math. Softw., vol. 37, pp. 36:1–36:17, Sep. 2010.
data transfers between GPUs need to be carefully con- [17] L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang, “Sparse LU fac-
torization for parallel circuit simulation on GPU,” in Proc. 49th
trolled. In addition, multiple GPUs can provide larger ACM/EDAC/IEEE Design Autom. Conf., Jun. 2012, pp. 1125–1130.
global memory bandwidth so the performance is expected [18] J. W. Demmel, J. R. Gilbert, and X. S. Li, “An asynchronous paral-
to be higher. lel supernodal algorithm for sparse gaussian elimination,” SIAM
J. Matrix Anal. Appl., vol. 20, no. 4, pp. 915–952, 1999.
[19] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time pro-
ACKNOWLEDGMENTS portional to arithmetic operations,” SIAM J. Sci. Statist. Comput.,
vol. 9, no. 5, pp. 862–874, 1988.
This work was supported by 973 project (2013CB329000), [20] J. J. Dongarra, J. D. Cruz, S. Hammerling, and I. S. Duff,
National Science and Technology Major Project “Algorithm 679: A set of level 3 basic linear algebra subprograms:
(2013ZX03003013-003) and National Natural Science Foun- Model implementation and test programs,” ACM Trans. Math.
Softw., vol. 16, no. 1, pp. 18–28, Mar. 1990.
dation of China (No. 61373026, No. 61261160501), and Tsing- [21] O. Schenk and K. G€artner, “Solving unsymmetric sparse systems
hua University Initiative Scientific Research Program. This of linear equations with PARDISO,” Future Generat. Comput. Syst.,
work was done when Ling Ren was at Tsinghua University. vol. 20, no. 3, pp. 475–487, Apr. 2004.
[22] G. Poole, Y. Liu, Y. cheng Liu, and J. Mandel, “Advancing analysis
We would like to thank Prof. Haohuan Fu and Yingqiao
capabilities in ansys through solver technology,” in Proc. Electron.
Wang from Center for Earth System Science, Tsinghua Uni- Trans. Numer. Anal., Jan. 2003, pp. 106–121.
versity for lending us the NVIDIA Tesla K20x platform. We [23] T. A. Davis, “Algorithm 832: Umfpack v4.3—An unsymmetric-
also thank the NVIDIA Corporation for donating us GPUs. pattern multifrontal method,” ACM Trans. Math. Softw., vol. 30,
no. 2, pp. 196–199, Jun. 2004.
[24] A. Gupta, M. Joshi, and V. Kumar, “WSMP: A high-performance
REFERENCES shared- and distributed-memory parallel sparse linear solver,”
IBM T. J. Watson Research Center, Yorktown Heights, NY, USA,
[1] L. W. Nagel, “SPICE 2: A computer program to stimulate semi- Tech. Rep. RC 22038 (98932), Apr. 2001.
conductor circuits,” Ph.D. dissertation, Univ. California, Berkeley, [25] V. Volkov and J. Demmel, “Benchmarking GPUs to tune dense
CA, USA, 1975. linear algebra,” in Proc. Int. Conf. High Perform. Comput., Netw.,
[2] S. Rajamanickam, E. Boman, and M. Heroux, “ShyLU: A hybrid- Storage Anal., Nov. 2008, pp. 1–11.
hybrid solver for multicore platforms,” in Proc. IEEE 26th Int. [26] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense linear alge-
Parallel Distrib. Process. Symp., 2012, pp. 631–643. bra solvers for multicore with GPU accelerators,” in Proc. IEEE
[3] X. Chen, Y. Wang, and H. Yang, “An adaptive LU factorization Int. Symp. Parallel Distrib. Process., Workshops Phd Forum, Apr.
algorithm for parallel circuit simulation,” in Proc. 17th Asia and 2010, pp. 1–8.
South Pacific Design Autom. Conf., Jan. 30, 2012–Feb. 2, 2012, [27] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear
pp. 359–364. algebra for hybrid GPU accelerated manycore systems,” Parallel
[4] X. Chen, W. Wu, Y. Wang, H. Yu, and H. Yang, “An escheduler- Comput., vol. 36, pp. 232–240, Jun. 2010.
based data dependence analysis and task scheduling for parallel [28] J. Kurzak, P. Luszczek, M. Faverge, and J. Dongarra, “LU factori-
circuit simulation,” IEEE Trans. Circuits Syst. II: Express Briefs, zation with partial pivoting for a multicore system with acceler-
vol. 58, no. 10, pp. 702–706, Oct. 2011. ators,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 8, pp. 1613–
[5] X. Chen, Y. Wang, and H. Yang, “NICSLU: An adaptive sparse 1621, Aug. 2013.
matrix solver for parallel circuit simulation,” IEEE Trans. Comput.- [29] The Univ. Tennessee, Knoxville, TN, USA. The MAGMA Project.
Aided Design Integr. Circuits Syst., vol. 32, no. 2, pp. 261–274, Feb. (2013). [Online]. Available: http://icl.cs.utk.edu/magma/index.
2013. html
[6] N. Kapre and A. DeHon, “Performance comparison of single-pre- [30] NVIDIA Corporation. (2012). CUDA BLAS [Online]. Available:
cision SPICE model-evaluation on FPGA, GPU, Cell, and multi- http://docs.nvidia.com/cuda/cublas/
core processors,” in Proc. Int. Conf. Field Programmable Logic and [31] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam,
Appl., Aug. 31, 2009–Sep. 2, 2009, pp. 65–72. “Algorithm 887: Cholmod, supernodal sparse cholesky factoriza-
[7] K. Gulati, J. Croix, S. Khatri, and R. Shastry, “Fast circuit simula- tion and update/downdate,” ACM Trans. Math. Softw., vol. 35,
tion on graphics processing units,” in Proc. Asia and South Pacific no. 3, pp. 22:1–22:14, Oct. 2008.
Design Autom. Conf., Jan. 2009, pp. 403–408. [32] NVIDIA Corporation. NVIDIA CUDA C Programming guide.
[8] R. Poore, “GPU-accelerated time-domain circuit simulation,” in (2012). [Online]. Available: http://docs.nvidia.com/cuda/cuda-
Proc. IEEE Custom Integr. Circuits Conf., Sep. 2009, pp. 629–632. c-programming-guide/index.html
[9] J. W. H. Liu, “The multifrontal method for sparse matrix solu- [33] I. S. Duff and J. Koster, “The design and use of algorithms for per-
tion: Theory and practice,” SIAM Rev., vol. 34, no. 1, pp. 82– muting large entries to the diagonal of sparse matrices,” SIAM J.
109, 1992. Matrix Anal. Appl., vol. 20, no. 4, pp. 889–901, Jul. 1999.
[10] M. Christen, O. Schenk, and H. Burkhart, “General-purpose [34] I. S. Duff and J. Koster, “On algorithms for permuting large entries
sparse matrix building blocks using the NVIDIA CUDA technol- to the diagonal of a sparse matrix,” SIAM J. Matrix Anal. Appl.,
ogy platform,” in Proc. First Workshop General Purpose Process. vol. 22, no. 4, pp. 973–996, Jul. 2000.
Graph. Process. Units, 2007, pp. 1–8 .
CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATION WITH PERFORMANCE MODELING 795
[35] R. Bahar, E. Frohm, C. Gaona, G. Hachtel, E. Macii, A. Pardo, and Yu Wang (S’05-M’07) received the BS degree
F. Somenzi, “Algebric decision diagrams and their applications,” and the PhD degree (with honor) from Tsinghua
Formal Methods Syst. Design, vol. 10, no. 2/3, pp. 171–206, 1997. University, Beijing, China in 2002 and 2007,
[36] T. A. Davis and Y. Hu, “The university of florida sparse matrix respectively. He is currently an associate profes-
collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, sor with the Department of Electronic Engineer-
Dec. 2011. ing, Tsinghua University. His current research
[37] S. Hong and H. Kim, “An analytical model for a GPU architecture interests include parallel circuit analysis, low-
with memory-level and thread-level parallelism awareness,” in power and reliability-aware system design meth-
Proc. 36th Annu. Int. Symp. Comput. Archit., 2009, pp. 152–163. odology, and application-specific hardware com-
puting, especially for brain-related topics. He has
Xiaoming Chen (S’12) received the BS degree authored and co-authored more than 90 papers
from the Department of Electronic Engineering, in refereed journals and conference proceedings. He received the Best
Tsinghua University, Beijing, China, in 2009, Paper Award in ISVLSI 2012. He was nominated five times for the Best
where he is currently working toward the PhD Paper Award (ISLPED 2009, CODES 2009, twice in ASPDAC 2010,
degree. His current research interests include and ASPDAC 2012). He is the Technical Program Committee co-chair
power and reliability-aware circuit design meth- of ICFPT 2011, the Publicity co-chair of ISLPED 2011, and the finance
odologies, parallel circuit simulation, high-perfor- chair of ISLPED 2013. He is a member of the IEEE.
mance computing on GPUs, and GPU
architecture. He was nominated for the Best Huazhong Yang (M’97-SM’00) received the BS
Paper Award in ISLPED 2009 and ASPDAC degree in microelectronics, and the MS and PhD
2012. He is a student member of the IEEE. degrees in electronic engineering from Tsinghua
University, Beijing, China, in 1989, 1993, and
Ling Ren received the BS degree from the 1998, respectively. In 1993, he joined the Depart-
Department of Electronic Engineering, Tsinghua ment of Electronic Engineering, Tsinghua Univer-
University, Beijing, China, in 2012. He is currently sity, where he is currently a specially appointed
working toward the PhD degree in the Depart- professor of the Cheung Kong Scholars Pro-
ment of Electrical Engineering and Computer Sci- gram. He has authored and co-authored more
ence at Massachusetts Institute of Technology. than 200 technical papers and holds 70 granted
His research interests include computer architec- patents. His current research interests include
ture, computer security, and parallel computing. wireless sensor networks, data converters, parallel circuit simulation
algorithms, nonvolatile processors, and energy-harvesting circuits. He is
a senior member of the IEEE.
circuit_4
add32
add20
scircuit
rajat15
rajat18
transient
g2_circuit
twotone
rajat03
rajat24
raj1
onetone2
coupled
asic_680ks
asic_320ks
asic_100ks
onetone1
hcircuit
bcircuit
asic_680k
asic_320k
asic_100k
mainly about memory optimization. In this subsection,
we discuss the data format for intermediate vectors, and
the sorting process for more coalesced access patterns to
the global memory. Fig. 3: Performance increases after sorting the nonzeros.
A.2.1 Format of Intermediate Vectors
We have two alternative data formats for the interme- In Fig. 3, we use the test benchmarks to show the
diate vectors (x in Algorithm 1): compressed sparse effect of sorting. The achieved memory bandwidth is
column (CSC) vectors or uncompressed vectors. CSC significantly increased by an average of 2.1×. It is worth
vectors save space and can be placed in the shared mentioning that CPU-based sparse LU factorization also
memory, while uncompressed vectors have to reside in benefits from sorting the nonzeros, but the performance
the global memory. Uncompressed vectors are preferred gain is tiny (less than 5% on average). Sorting the
in this problem for two reasons. First, CSC format is nonzeros is done by transposing the CSC storages of L
inconvenient for indexed accesses. We have to use binary and U twice. The time cost of sorting is about 20% on
search, which is very time-consuming even in the shared average of the time cost of one LU re-factorization on
memory. Moreover, using too much shared memory will GPUs. In addition, sorting is performed just once so its
reduce the maximum number of resident warps, which time cost is not the bottleneck and negligible in circuit
results in severe performance degradation: simulation.
size of shared memory per SM
resident warps per SM ≤
size of a CSC vector A PPENDIX B
Our tests have proved that using uncompressed formats A DDITIONAL R ESULTS
is several times faster than using compressed formats. B.1 Relation Between the Achieved GPU Perfor-
A.2.2 Improving Data Locality mance and Matrix Characteristics
Higher global memory bandwidth can be achieved on We find that the achieved GPU performance strongly
GPUs if memory access patterns are coalesced [2]. The depends on the characteristics of the matrices, as shown
nonzeros in L and U are out of order after pre-processing in Fig. 4. The two subfigures are scatter plots, in which
and the first factorization performed on the host CPU, each point represents a matrix. The results are obtained
resulting in highly random memory access patterns. We on GTX580. They show that the achieved GPU perfor-
sort the nonzeros in each column of L and U by their row mance has an approximate logarithmic relation with the
−I)
indices to improve the data locality. As shown in Fig. 2, relative fill-in ( N NNZ(L+U
N Z(A) ) or the average number of
f lop
after sorting, neighboring nonzeros in each column are flop per nonzero ( N N Z(L+U −I) ). The fitted lines are also
more likely to be processed by consecutive threads. plotted in Fig. 4.
3
0.30 70 mem_trans=1
CPU solve + data transfer
mem_trans=2
40
0.15
30
0.10 20
0.05 10
4 8 12 16 20 24 28 32
0.00 #resident warps per SM
circuit_4
twotone
scircuit
rajat03
rajat15
rajat18
rajat24
raj1
transient
asic_680ks
asic_320ks
asic_100ks
g2_circuit
add32
add20
onetone2
onetone1
coupled
hcircuit
bcircuit
asic_680k
asic_320k
asic_100k
Fig. 6: Measured L2 cache miss rate, obtained on GTX580
using NVIDIA Visual Profiler.
0.1
Predicted time (s)
0.01
0.001
0.0001
0.0001 0.001 0.01 0.1 1
Actual time (s)
warp t’s task queue and the finish time of node k can be
determined. The earliest finish time of the whole DAG
is just the maximum finish time of all the nodes.
R EFERENCES
[1] S. Yan, G. Long, and Y. Zhang, “StreamScan: Fast Scan Algorithms
for GPUs Without Global Barrier Synchronization,” in Proceedings
of the 18th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, ser. PPoPP ’13, 2013, pp. 229–238.
[2] NVIDIA Corporation, “NVIDIA CUDA C programming
guide.” [Online]. Available: http://docs.nvidia.com/cuda/cuda-
c-programming-guide/index.html
[3] Y. Saad and M. H. Schultz, “GMRES: a generalized minimal resid-
ual algorithm for solving nonsymmetric linear systems,” SIAM J.
Sci. Stat. Comput., vol. 7, no. 3, pp. 856–869, Jul. 1986.
[4] “CUSP.” [Online]. Available: https://code.google.com/p/cusp-
library/
[5] C.-J. Lin and J. J. More, “Incomplete Cholesky Factorizations With
Limited Memory,” SIAM J. SCI. COMPUT, vol. 21, no. 1, pp. 24–45,
1999.
[6] X. Chen, W. Wu, Y. Wang, H. Yu, and H. Yang, “An escheduler-
based data dependence analysis and task scheduling for parallel
circuit simulation,” Circuits and Systems II: Express Briefs, IEEE
Transactions on, vol. 58, no. 10, pp. 702–706, oct. 2011.
[7] M. Naumov, “Parallel Solution of Sparse Triangular Linear Systems
in the Preconditioned Iterative Methods on the GPU,” NVIDIA
Technical Report, NVR-2011-001, Tech. Rep., june 2011.
[8] NVIDIA Corporation, “cuSPARSE.” [Online]. Available:
https://developer.nvidia.com/cusparse
[9] X. Xiong and J. Wang, “Parallel forward and back substitution
for efficient power grid simulation,” in Proceedings of IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), nov.
2012, pp. 660–663.