Parallelization of A DEM Code Based On CPU-GPU Heterogeneous Architecture

Parallelization of a DEM Code Based on CPU-GPU
Heterogeneous Architecture
Xiaoqiang Yue1, Hao Zhang2, Congshu Luo1, Shi Shu3,*, and Chunsheng Feng1
1
School of Mathematics and Computational Science, Xiangtan University,
411105, Hunan, China
2
Heat and Mass Transfer Technological Center, Technical University of Catalonia,
08222, Terrassa, Spain
3
Hunan Key Laboratory for Computation and Simulation in Science and Engineering,
Key Laboratory of Intelligent Computing and Information Processing of
Ministry of Education, Xiangtan University, 411105, Hunan, China
siukyoo@163.com, tourzhang@gmail.com, luocs105@126.com,
{shushi,spring}@xtu.edu.cn
Abstract. Particulate flows are commonly encountered in both engineering and

environmental applications. The discrete element method (DEM) has attracted
plentiful attentions since it can predict the whole motion of the particulate flow
by monitoring every single particle. However the computational capability of
the method relies strongly on the numerical scheme as well as the hardware
environment. In this study, a parallelization of a DEM based code titled Trubal
was implemented. Numerical simulations were carried out to show the benefits
of this research. It is shown that the final parallel code gave a substantial
acceleration on the Trubal. By simulating 6,000 particles using a NVIDIA Tesla
C2050 card together with Intel Core-Dual 2.93 GHz CPU, an average speedup
of 4.69 in computational time was obtained.
Keywords: Parallelization, Discrete element method, CPU-GPU heterogeneous

architecture.
1 Introduction
Particulate flows are commonly encountered in both engineering and environmental
applications. However due to the stochastic nature of the particles, until now the
fundamental physical mechanism in these systems is generally not well understood. The
researchers failed to formulate a general method for the reliable scale-up, design and
control of processes of different types [1]. The adoption of numerical simulation can assist
people in making decisions on trial conditions. Comparing with the actual experiments, it
is a cheaper and more convenient option. At present, numerical modeling has been a
powerful tool in developing new or optimizing existing engineering facilities. As a typical
Lagrangian method, the discrete element method (DEM) [2] has attracted plentiful
attentions of researchers from different fields. DEM can predict the whole motion of the
particulate flow by monitoring every single particle.
*
Corresponding author.
K. Li et al. (Eds.): ParCFD 2013, CCIS 405, pp. 149–159, 2014.

© Springer-Verlag Berlin Heidelberg 2014
150 X. Yue et al.
Trubal is a software package based on DEM which is originally developed by Prof.

Cundall [2]. Then Dr. Thornton from Aston University proposed some interaction
laws [3-4] and introduced them into Trubal which considered more multi-physical
factors. Sheng used Trubal to investigate the powder compaction processes [5-6].
Charley Wu and his colleagues have coupled Trubal with a computational fluid
dynamics (CFD) solver to simulate complex particle-fluid interaction problems
[7-10]. It should be stressed that the computational capability of DEM relies strongly
on the numerical scheme as well as the hardware environment. The researches
mentioned above are limited by the numbers of the particles since the serial
characteristics of the original Trubal. Lately Trubal was parallelized by Kafui et al.
from University of Birmingham using a single program multiple data strategy [11],
and successfully applied the new code to three dimensional (3D) applications [12].
Another parallelization work of Trubal is from of Washington et al. who ported the
parallel Trubal on the CM-5 architecture at the Pittsburgh supercomputing center
[13]. Ghaboussi also made Trubal parallel using neural networks on the connection
machine (CM-2) with 32,768 processors [14]. Otherwise, Maknickas et al.
parallelized the DEMMAT code to simulate visco-elastic frictional granular media
utilizing the spatial domain decomposition strategy, where a speedup around 11 has
been obtained on 16 processors [15]. Darmana et al. parallelized an Euler-Lagrange
model using mixed domain decomposition and a mirror domain technique. They
applied their code to simulate dispersed gas-liquid two-phase flow and obtained a
maximum speed-up up to 20 using 32 processors [16]. At the same time, the widely
used commercial software like EDEM [17] and PFC3D [18] also exploit partial
parallel functions to widen their applications to engineering problems. From the
survey of references, it is found that the existing work is mainly on the MPI/OpenMP
programming environment to eliminate bottlenecks for computational efficiency and
memory capacity.
Graphics processing units (GPUs) have recently burst onto the scientific computing
scene as a technology that has yielded substantial performance and energy-efficiency
improvements. Some old supercomputers, such as JAGAUR (now known as TITAN)
of the Oak Ridge National Laboratory, are being redesigned in order to incorporate
GPUs and thereby achieve better performance. CPU-GPU heterogeneous architecture
has become an important trend in the development of high performance computers
with GPUs successfully used in supercomputers [19]. A GPU is a symmetric multi-
core processor that can be accessed and controlled by CPU. The Intel/AMD CPU
accelerated by NVIDIA/AMD GPUs is probably the most commonly used
heterogeneous high performance computing architecture at present. Under conditions
often met in scientific computing, modern GPUs surpass CPUs in computational
power, data throughput and computational efficiency per watt by almost one order of
magnitude [20]. Shigeto et al. proposed a new algorithm for multi-thread parallel
computation of DEM, and pointed out that their calculation speed ratio of the GPU to
CPU was up to 3.4 using single-precision floating-point numbers [21]. However the
superiority in speed of the GPU decreased as the number of calculated particles
increased. Recently, Li et al. developed a DEM based software named GDEM,
exploited certain GPU technologies to accelerate the continuous-based DEM and
Parallelization of a DEM Code Based on CPU-GPU Heterogeneous Architecture 151
achieved an average 650 times speedup using a NVIDIA GTX VGA card to Intel
Core-Dual 2.66 GHz CPU [21-22]. Ye et al. proposed a spatial subdivision algorithm
to partition space into a uniform grid and sorted the particles base on their hash
values, their results showed that the rendering speed of large-scale granular flow
scene can still reach as high as 37.9 frames per second when the number of particles
reaches 200,000 [23]. Radeke et al. proposed an approach of using massively parallel
compute unified device architecture (CUDA) technique on GPUs for the
implementation of the DEM algorithms which enables for simulations of more than
two million particles per Giga Byte of memory [24]. Applications of GPU based-
DEM are limited but more and more popular nowadays, and numerical simulations
with actual-engineering-level numbers of particles are of especially high demand. We
reconstructed the Trubal based on a CPU-GPU heterogeneous architecture, and
attained an average speedup of 4.69 in simulating 6,000 particles of 200,000 time-
steps from four classical moments.
The framework of the paper is organized as follows. We made a brief introduction
of the theoretical and numerical issues of the DEM in Section 2. In Section 3,
numerical simulations were conducted and a comparison was made among Trubal and
the new solvers. Finally, relevant results were summarized and conclusions were
given in Section 4.
2 Discrete Particle Modeling
2.1 Governing Equations

The basic idea behind DEM is to calculate the trajectory of every single element
based on the Newton’s second law. Meanwhile the collisions between the moving
particles were considered and treated using interaction laws. The dynamic equations
of the discrete element can be symbolically expressed as
ma = mg + Fc
 2
 ∂θ (1)
 I 2 = τc
 ∂t
where m and I are respectively the mass and the moment of inertia of the particle, a is
the acceleration, θ is the angular displacement, g is the acceleration of gravity if
considered, Fc and τc are the contact force and moment of force respectively
generated by the direct collisions. Among the commonest interaction laws are ‘Hard’
contact, linear law and the nonlinear law which all based on classical contact
mechanics. The calculation of the interaction force is not necessary in the ‘Hard’
contact models. The latter two are called ‘Soft’ contact, where a small overlap is
allowed to represent the physical deformation of the contacting bodies which takes
place at the interface. In Trubal as adopted in present work the normal force-
displacement relationship of the particles are calculated based on the theory of Hertz [25].
152 X. Yue et al.
For two particles of radius Ri, Young’s modulus Ei and Poisson’s ratios υi (i=1,2), the
normal force-displacement relationship read
4 * *1/ 2 3/2
Fn = E R δn (2)
3
where
1 1 − v12 1 − v22
= + (3)
E* E1 E2
and
1 1 1
*
= + (4)
R R1 R2
The incremental tangential force arising from an incremental tangential displacement
depends on the loading history as well as the normal force and is given by Mindlin
and Deresiewicz [26]
ΔT = 8G*raθ k Δδ t + (−1)k μΔFn (1 − θ k ) (5)
where
1 1 − v12 1 − v22
= + (6)
G* G1 G2
ra=(R*δn)1/2 is the radius of the contact area, Δδt is the relative tangential incremental
surface displacement, μ is the coefficient of friction, k and θk changes with the loading
history.
2.2 Numerical Strategy: Two Simulators under CPU

Trubal, written in Fortran 77 and C, can be broadly divided into three parts including
setup phrase, solve phrase and post-processing phrase. Setup phrase aims at reading
parameters, such as the number and material characteristics of particles and walls,
bulk of physical domains and logical boxes dividing the whole domain, and
generating what solve phrase requires, such as time-step and contact situations
between particles and walls in each box. Solve phrase, the crucial component of
numerical simulations, computes explicitly the linear and angular velocity, linear and
angular displacement, composite force and resultant moment of each particle and wall
at every moment. Post-processing phrase saves the computing results into files and
displays their images to analyze the validity of the simulation. The entire calculation
flow of Trubal is shown in Fig. 1.
Fig. 1. Flowchart of simulators under CPU
The main storage structure in Trubal is a globally static array in single precision. It
eases of programming, data-accessing and memory-saving, but it’s prone to interrupt
our simulation because we couldn't predict how much memory is sufficient during the
entire process. The regular solution is to set up the maximum storage of the used
machine. This simulator is denoted as Trubal-org-s in what follows.
In order to avoid the above shortcomings of the static storage structure, we
dynamically spare the required local arrays according to the number of particles, walls
or contacts. This can prevent a waste of memory and enhance the computational
capability, i.e. we can simulate more particles than Trubal. In the following, we name
this new simulator as Trubal-new-s.
2.3 GPU Computing and a Simulator under CPU-GPU

Based on NVIDIA Fermi GPU architecture, as shown in Fig. 2, Tesla C2050 GPUs
deliver equivalent supercomputing performance at 1/10th the cost and 1/20th the power
154 X. Yue et al.
consumption comparing to the latest quad-core CPUs [27]. Tesla C2050 has 14
streaming multiprocessors (SMs), and each SM has 32 streaming processors (SPs) or
CUDA cores. Each SM has its own L1 cache (64 KB), shared memory (48 KB) and
register (32,768 available per block). They share L2 cache (768 KB), constant
memory (64 KB), texture memory and global memory (3.0 GB, 2.625 GB with ECC
on). The GPU memory and peak performances in float and double precisions are both
about 10 times faster than those of CPUs.
Fig. 2. NVIDIA Fermi GPU architecture
The CPU-GPU heterogeneous parallelism consists of two parts, CPU is responsible

for computations unsuitable for data-parallelization, like complex logical transactions,
GPU is in charge of large-scale intensive computations. It is a significant advantage to
explore the potential performance and cost effective of computers, and compensate
for the bottleneck of CPU utilizing powerful processing capability and high
bandwidth of GPU.
As CUDA doesn’t encapsulate the heterogeneous nature of the storage system, we
can get the maximum throughput by optimizing the use of various types of memory,
thereby boost up the overall performance, such as taking advantage of shared memory
to reduce access latency by hanging threads temporarily, texture memory to randomly
access, aiming at maximizing the frequency of GPU memory bandwidth. Fig. 3 shows
the entire calculation flow under CPU-GPU heterogeneous architecture. We indicate
this simulator as Trubal-new-p hereinafter.
Fig. 3. Flowchart of simulator under CPU-GPU
3 Numerical Experiments
The die filling process, as sketched in Fig. 4, has wide applications ranging from
pharmacy and metallurgy to food processing. Charley Wu et al. have successfully
used Trubal to simulate the die filling process [7-10]. In this study, several numerical
experiments are carried out to demonstrate the efficiencies of the above three
simulators directed towards a class of die filling, whose parameters are shown in
Table 1.
Table 2 lists the specifications of our testbed. For the sake of fairness, we compile
the Trubal-org-s and Trubal-new-s with optimization options being “-O2 -IPA”, “-
arch=sm_20 -O2” for Trubal-new-p. Here, the specified thread hierarchy of “grid-
block-thread” in CUDA is “1-13-512”.
156 X. Yue et al.
Fig. 4. Sketch map of a die filling process
Table 1. Parameters of our simulation
Parameter Value
Number of light, heavy particles and walls 3,000, 3,000, 11
Density of light, heavy particles 400, 7,800
Diameter of particles 1.30×10-4
(Young’s modulus, Poisson’s ratio) of particles, walls (8.70×109, 0.30), (2.10×1011, 0.29)
Friction of particle-particle, particle-wall 0.30, 0.30
Damping coefficient of mass, stiffness 0.00, 1.59×10-2
Acceleration of gravity (downward) 9.80
Domain of die (0.002, 0.009)×(0.020, 0.027)
Domain of shoe (0.009, 0.021)×(0.029, 0.043)
Velocity of the shoe (left) 0.07
Time step of the simulation (s) 7.6283×10-9
Table 2. Specifications of our test platform
CPU GPU Operating System Host Device

Intel(R) CPU E7500 Tesla C2050
Linux
2.0 GB 3.0 GB
gcc 4.4.5 nvcc 4.0
2 Cores 448 CUDA Cores
2.93 GHz 1.15 GHz Fedora 13-2.6.34
Fig. 5 lists the distributions of particles from Trubal-org-s, Trubal-new-s and

Trubal-new-p on 0.0382 s, 0.0763 s, 0.0915 s, 0.114 s, 0.130 s, 0.168 s, 0.244 s and
0.305 s, which verify their validity as the distributions are all in accord with the actual
physical processes.
The comparison among Trubal-org-s, Trubal-new-s and Trubal-new-p on
simulations of 200,000 time-steps from four classical moments is shown in Table 3.
These moments match along with when the shoe arrives at the die (0.0382 s), goes
through the die (0.0763 s), starts to leave the die (0.130 s) and gets on for getting clear
away from the die (0.305 s), respectively.
Fig. 5. Comparison on the distributions of particles (left, centered and right are from Trubal-
org-s, Trubal-new-s and Trubal-new-p, respectively)
Table 3. Wall time and Ratio/Speedup of partial simulations of 6,000 particles

Trubal-org-s Trubal-new-s Trubal-new-p
Wall time Wall time Ratio Wall time Speedup
0.0382 s 3,464.99 s 3,459.67 s 1.54‰ 726.27 s 4.77
0.0763 s 2,878.35 s 2,871.09 s 2.53‰ 612.83 s 4.70
0.1300 s 2,473.77 s 2,467.99 s 2.34‰ 520.65 s 4.75
0.3050 s 3,333.51 s 3,316.78 s 5.04‰ 731.40 s 4.56
As shown in Table 3, Trubal-new-s is a little bit faster than Trubal-org-s, and

Trubal-new-p reaps an speedup of 4.77, 4.70, 4.75 and 4.56 in contrast with Trubal-
org-s, on the average speedup of 4.69, where Ratio=-1+Trubal-org-s/Trubal-new-s
and Speedup=Trubal-org-s/Trubal-new-p.
158 X. Yue et al.
4 Conclusions
In this study, a parallelization of a serial discrete particle algorithm titled Trubal was
carried out by following two steps: 1. Reconstruction of the static storage structure; 2.
An essential parallelism on the relative newer code using shared memory without
bank conflict and texture memory to maximize the frequency of GPU memory
bandwidth based on a CPU-GPU heterogeneous architecture. Numerical simulation
showed that the final parallel code gave a substantial acceleration on the Trubal. By
simulating 6,000 particles using a NVIDIA Tesla C2050 card together with Intel
Core-Dual 2.93 GHz CPU, an average speedup of 4.69 in computational time was
obtained.
Acknowledgments. The authors are grateful to Prof. Yuanqiang Tan from Xiangtan
University for providing the Trubal code, which is essentially helpful for the
understanding of DEM. We also would like to thank Prof. Mingjun Li from Xiangtan
University for his helpful comments and suggestions on the reconstruction of the data
structure. Yue and Feng are partially supported by the National Natural Science
Foundation of China (11201398), Specialized research Fund for the Doctoral Program of
Higher Education (20124301110003) and Project of Scientific Research Fund of Hunan
Provincial Education Department of China (12A138). Shu is partially supported by the
National Natural Science Foundation of China (91130002, 11171281).
References
1. Zhu, H.P., Zhou, Z.Y., Hou, Q.F., Yu, A.B.: Linking discrete particle simulation to
continuum process modelling for granular matter: Theory and application. Particuology 9,
342 (2011)
2. Cundall, P.A.: A computer model for simulating progressive large scale movement in
block rock system. In: Symposium ISRM Proc., vol. 2, p. 129 (1971)
3. Thornton, C., Yin, K.K.: Impact of elastic spheres with and without adhesion. Powder
Technology 65, 113 (1991)
4. Thornton, C., Ning, Z.: A theoretical model for the stick/bounce behavior of adhesive,
elastic-plastic spheres. Powder Technology 99, 154 (1998)
5. Sheng, Y., Lawrence, C.J., Briscoe, B.J.: 3D DEM simulation of powder compaction. In:
3rd International Conference on Discrete Element Methods, Santa Fe, Mexico, p. 305
(2002)
6. Sheng, Y., Lawrence, C.J., Briscoe, B.J.: Numerical studies of uniaxial powder
compaction process by 3D DEM. Engineering Computations 62, 304 (2010)
7. Wu, C.Y.: DEM simulations of die filling during pharmaceutical tabletting. Particuology
6, 412 (2008)
8. Wu, C.Y., Guo, Y.: Modelling of the flow of cohesive powders during pharmaceutical
tabletting. Journal of Pharmacy and Pharmacology 62, 1450 (2010)
9. Guo, Y., Wu, C.Y., Kaifui, D.K., Thornton, C.: 3D DEM/CFD analysis of size-induced
segregation during die filling. Powder Technology 206, 177 (2011)
10. Guo, Y., Wu, C.Y., Thornton, C.: The effects of air and particle density difference on
segregation of powder mixtures during die filling. Chemical Engineering Science 66, 661
(2011)
11. Kaifui, D.K., Johnson, S., Thornton, C., Seville, J.P.K.: Parallelization of a Lagrangian-
Eulerian DEM/CFD code for application to fluidised beds. Powder Technology 207, 270
(2011)
12. Guo, Y., Wu, C.Y.: Numerical modelling of suction filling using DEM/CFD. Chemical
Engineering Science 73, 231 (2012)
13. Washington, D.W., Meegoda, J.N.: Micro-mechanical simulation of geotechnical problems
using massively parallel computers. International Journal of Numerical and Analytical
Mehods in Geomechanics 27, 1227 (2003)
14. Ghaboussi, J., Basole, M., Ranjithan, S.: Three dimensional discrete element analysis on
massively parallel computers. In: 2nd International Conference on Discrete Element
Methods. Massachusetts Institute of Technology, Cambridge (1993)
15. Maknickas, A., Kaceniauskas, A., Kaceniauskas, R., Balebicius, R., Dziugys, A.: Parallel
DEM Software for simulation of granular media. Informatica 17, 207 (2006)
16. Darmana, D., Deen, N.G., Kuipers, J.A.M.: Parallelization of a Euler-Lagrange model
using mixed domain decomposition and a mirror domain technique: Application to
dispersed gas-liquid two-phase flow. Journal of Computational Physics 220, 216 (2006)
17. EDEM, http://www.dem-solutions.com/software/edem-software/
18. PFC, http://www.itasca.cn/index.jsp
19. Top 500 SuperComputer lists, http://www.top500.org/lists/2011/11
20. Feng, C.S., Shu, S., Xu, J., Zhang, C.S.: Numerical study of geometric multigrid methods
on CPU-GPU heterogenous computers (2012) (preprint),
http://arxiv.org/abs/1208.4247v2
21. CDEM, GDEM, http://www.sinoelite.cc/
22. Ma, Z.S., Feng, C., Liu, T.P., Li, S.H.: A GPU accelerated continuous-based discrete
element method for elastodynamics analysis. Advanced Materials Research 320, 329
(2011)
23. Ye, J., Chen, J.X., Chen, X.Q., Tao, H.P.: Modeling and rendering of real-time large scale
granular flow scene on GPU. In: Procedia Environmental Sciences, p. 1035 (2011)
24. Radeke, C.A., Glasser, B.J., Khinast, J.G.: Large-scale powder mixer simulations using
massively parallel GPU architectures. Chemical Engineering Science 65, 6435 (2010)
25. Johnson, K.L.: Contact mechanics. Cambridge University Press, Cambridge (1985)
26. Mindlin, R.D., Deresiewicz, H.: Elastic spheres in contact under varying oblique forces.
Journal of Applied Mechanics 20, 327 (1953)
27. NVIDIA Tesla C2050/C2070 GPU Computing Processor,
http://www.nvidia.com/object/personal-supercomputing.html

Parallelization of A DEM Code Based On CPU-GPU Heterogeneous Architecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallelization of A DEM Code Based On CPU-GPU Heterogeneous Architecture

Uploaded by

Copyright:

Available Formats

Parallelization of a DEM Code Based on CPU-GPU

Abstract. Particulate flows are commonly encountered in both engineering and

Keywords: Parallelization, Discrete element method, CPU-GPU heterogeneous

K. Li et al. (Eds.): ParCFD 2013, CCIS 405, pp. 149–159, 2014.

Trubal is a software package based on DEM which is originally developed by Prof.

2 Discrete Particle Modeling

2.1 Governing Equations

ΔT = 8G*raθ k Δδ t + (−1)k μΔFn (1 − θ k ) (5)

2.2 Numerical Strategy: Two Simulators under CPU

Fig. 1. Flowchart of simulators under CPU

2.3 GPU Computing and a Simulator under CPU-GPU

Fig. 2. NVIDIA Fermi GPU architecture

The CPU-GPU heterogeneous parallelism consists of two parts, CPU is responsible

Fig. 3. Flowchart of simulator under CPU-GPU

Fig. 4. Sketch map of a die filling process

Table 1. Parameters of our simulation

Table 2. Specifications of our test platform

CPU GPU Operating System Host Device

Fig. 5 lists the distributions of particles from Trubal-org-s, Trubal-new-s and

Table 3. Wall time and Ratio/Speedup of partial simulations of 6,000 particles

As shown in Table 3, Trubal-new-s is a little bit faster than Trubal-org-s, and

You might also like