Professional Documents
Culture Documents
Heterogeneous Architecture
Xiaoqiang Yue1, Hao Zhang2, Congshu Luo1, Shi Shu3,*, and Chunsheng Feng1
1
School of Mathematics and Computational Science, Xiangtan University,
411105, Hunan, China
2
Heat and Mass Transfer Technological Center, Technical University of Catalonia,
08222, Terrassa, Spain
3
Hunan Key Laboratory for Computation and Simulation in Science and Engineering,
Key Laboratory of Intelligent Computing and Information Processing of
Ministry of Education, Xiangtan University, 411105, Hunan, China
siukyoo@163.com, tourzhang@gmail.com, luocs105@126.com,
{shushi,spring}@xtu.edu.cn
1 Introduction
Particulate flows are commonly encountered in both engineering and environmental
applications. However due to the stochastic nature of the particles, until now the
fundamental physical mechanism in these systems is generally not well understood. The
researchers failed to formulate a general method for the reliable scale-up, design and
control of processes of different types [1]. The adoption of numerical simulation can assist
people in making decisions on trial conditions. Comparing with the actual experiments, it
is a cheaper and more convenient option. At present, numerical modeling has been a
powerful tool in developing new or optimizing existing engineering facilities. As a typical
Lagrangian method, the discrete element method (DEM) [2] has attracted plentiful
attentions of researchers from different fields. DEM can predict the whole motion of the
particulate flow by monitoring every single particle.
*
Corresponding author.
achieved an average 650 times speedup using a NVIDIA GTX VGA card to Intel
Core-Dual 2.66 GHz CPU [21-22]. Ye et al. proposed a spatial subdivision algorithm
to partition space into a uniform grid and sorted the particles base on their hash
values, their results showed that the rendering speed of large-scale granular flow
scene can still reach as high as 37.9 frames per second when the number of particles
reaches 200,000 [23]. Radeke et al. proposed an approach of using massively parallel
compute unified device architecture (CUDA) technique on GPUs for the
implementation of the DEM algorithms which enables for simulations of more than
two million particles per Giga Byte of memory [24]. Applications of GPU based-
DEM are limited but more and more popular nowadays, and numerical simulations
with actual-engineering-level numbers of particles are of especially high demand. We
reconstructed the Trubal based on a CPU-GPU heterogeneous architecture, and
attained an average speedup of 4.69 in simulating 6,000 particles of 200,000 time-
steps from four classical moments.
The framework of the paper is organized as follows. We made a brief introduction
of the theoretical and numerical issues of the DEM in Section 2. In Section 3,
numerical simulations were conducted and a comparison was made among Trubal and
the new solvers. Finally, relevant results were summarized and conclusions were
given in Section 4.
ma = mg + Fc
2
∂θ (1)
I 2 = τc
∂t
where m and I are respectively the mass and the moment of inertia of the particle, a is
the acceleration, θ is the angular displacement, g is the acceleration of gravity if
considered, Fc and τc are the contact force and moment of force respectively
generated by the direct collisions. Among the commonest interaction laws are ‘Hard’
contact, linear law and the nonlinear law which all based on classical contact
mechanics. The calculation of the interaction force is not necessary in the ‘Hard’
contact models. The latter two are called ‘Soft’ contact, where a small overlap is
allowed to represent the physical deformation of the contacting bodies which takes
place at the interface. In Trubal as adopted in present work the normal force-
displacement relationship of the particles are calculated based on the theory of Hertz [25].
152 X. Yue et al.
For two particles of radius Ri, Young’s modulus Ei and Poisson’s ratios υi (i=1,2), the
normal force-displacement relationship read
4 * *1/ 2 3/2
Fn = E R δn (2)
3
where
1 1 − v12 1 − v22
= + (3)
E* E1 E2
and
1 1 1
*
= + (4)
R R1 R2
The incremental tangential force arising from an incremental tangential displacement
depends on the loading history as well as the normal force and is given by Mindlin
and Deresiewicz [26]
where
1 1 − v12 1 − v22
= + (6)
G* G1 G2
ra=(R*δn)1/2 is the radius of the contact area, Δδt is the relative tangential incremental
surface displacement, μ is the coefficient of friction, k and θk changes with the loading
history.
The main storage structure in Trubal is a globally static array in single precision. It
eases of programming, data-accessing and memory-saving, but it’s prone to interrupt
our simulation because we couldn't predict how much memory is sufficient during the
entire process. The regular solution is to set up the maximum storage of the used
machine. This simulator is denoted as Trubal-org-s in what follows.
In order to avoid the above shortcomings of the static storage structure, we
dynamically spare the required local arrays according to the number of particles, walls
or contacts. This can prevent a waste of memory and enhance the computational
capability, i.e. we can simulate more particles than Trubal. In the following, we name
this new simulator as Trubal-new-s.
consumption comparing to the latest quad-core CPUs [27]. Tesla C2050 has 14
streaming multiprocessors (SMs), and each SM has 32 streaming processors (SPs) or
CUDA cores. Each SM has its own L1 cache (64 KB), shared memory (48 KB) and
register (32,768 available per block). They share L2 cache (768 KB), constant
memory (64 KB), texture memory and global memory (3.0 GB, 2.625 GB with ECC
on). The GPU memory and peak performances in float and double precisions are both
about 10 times faster than those of CPUs.
3 Numerical Experiments
The die filling process, as sketched in Fig. 4, has wide applications ranging from
pharmacy and metallurgy to food processing. Charley Wu et al. have successfully
used Trubal to simulate the die filling process [7-10]. In this study, several numerical
experiments are carried out to demonstrate the efficiencies of the above three
simulators directed towards a class of die filling, whose parameters are shown in
Table 1.
Table 2 lists the specifications of our testbed. For the sake of fairness, we compile
the Trubal-org-s and Trubal-new-s with optimization options being “-O2 -IPA”, “-
arch=sm_20 -O2” for Trubal-new-p. Here, the specified thread hierarchy of “grid-
block-thread” in CUDA is “1-13-512”.
156 X. Yue et al.
Parameter Value
Number of light, heavy particles and walls 3,000, 3,000, 11
Density of light, heavy particles 400, 7,800
Diameter of particles 1.30×10-4
(Young’s modulus, Poisson’s ratio) of particles, walls (8.70×109, 0.30), (2.10×1011, 0.29)
Friction of particle-particle, particle-wall 0.30, 0.30
Damping coefficient of mass, stiffness 0.00, 1.59×10-2
Acceleration of gravity (downward) 9.80
Domain of die (0.002, 0.009)×(0.020, 0.027)
Domain of shoe (0.009, 0.021)×(0.029, 0.043)
Velocity of the shoe (left) 0.07
Time step of the simulation (s) 7.6283×10-9
Fig. 5. Comparison on the distributions of particles (left, centered and right are from Trubal-
org-s, Trubal-new-s and Trubal-new-p, respectively)
4 Conclusions
In this study, a parallelization of a serial discrete particle algorithm titled Trubal was
carried out by following two steps: 1. Reconstruction of the static storage structure; 2.
An essential parallelism on the relative newer code using shared memory without
bank conflict and texture memory to maximize the frequency of GPU memory
bandwidth based on a CPU-GPU heterogeneous architecture. Numerical simulation
showed that the final parallel code gave a substantial acceleration on the Trubal. By
simulating 6,000 particles using a NVIDIA Tesla C2050 card together with Intel
Core-Dual 2.93 GHz CPU, an average speedup of 4.69 in computational time was
obtained.
Acknowledgments. The authors are grateful to Prof. Yuanqiang Tan from Xiangtan
University for providing the Trubal code, which is essentially helpful for the
understanding of DEM. We also would like to thank Prof. Mingjun Li from Xiangtan
University for his helpful comments and suggestions on the reconstruction of the data
structure. Yue and Feng are partially supported by the National Natural Science
Foundation of China (11201398), Specialized research Fund for the Doctoral Program of
Higher Education (20124301110003) and Project of Scientific Research Fund of Hunan
Provincial Education Department of China (12A138). Shu is partially supported by the
National Natural Science Foundation of China (91130002, 11171281).
References
1. Zhu, H.P., Zhou, Z.Y., Hou, Q.F., Yu, A.B.: Linking discrete particle simulation to
continuum process modelling for granular matter: Theory and application. Particuology 9,
342 (2011)
2. Cundall, P.A.: A computer model for simulating progressive large scale movement in
block rock system. In: Symposium ISRM Proc., vol. 2, p. 129 (1971)
3. Thornton, C., Yin, K.K.: Impact of elastic spheres with and without adhesion. Powder
Technology 65, 113 (1991)
4. Thornton, C., Ning, Z.: A theoretical model for the stick/bounce behavior of adhesive,
elastic-plastic spheres. Powder Technology 99, 154 (1998)
5. Sheng, Y., Lawrence, C.J., Briscoe, B.J.: 3D DEM simulation of powder compaction. In:
3rd International Conference on Discrete Element Methods, Santa Fe, Mexico, p. 305
(2002)
6. Sheng, Y., Lawrence, C.J., Briscoe, B.J.: Numerical studies of uniaxial powder
compaction process by 3D DEM. Engineering Computations 62, 304 (2010)
7. Wu, C.Y.: DEM simulations of die filling during pharmaceutical tabletting. Particuology
6, 412 (2008)
8. Wu, C.Y., Guo, Y.: Modelling of the flow of cohesive powders during pharmaceutical
tabletting. Journal of Pharmacy and Pharmacology 62, 1450 (2010)
9. Guo, Y., Wu, C.Y., Kaifui, D.K., Thornton, C.: 3D DEM/CFD analysis of size-induced
segregation during die filling. Powder Technology 206, 177 (2011)
10. Guo, Y., Wu, C.Y., Thornton, C.: The effects of air and particle density difference on
segregation of powder mixtures during die filling. Chemical Engineering Science 66, 661
(2011)
Parallelization of a DEM Code Based on CPU-GPU Heterogeneous Architecture 159
11. Kaifui, D.K., Johnson, S., Thornton, C., Seville, J.P.K.: Parallelization of a Lagrangian-
Eulerian DEM/CFD code for application to fluidised beds. Powder Technology 207, 270
(2011)
12. Guo, Y., Wu, C.Y.: Numerical modelling of suction filling using DEM/CFD. Chemical
Engineering Science 73, 231 (2012)
13. Washington, D.W., Meegoda, J.N.: Micro-mechanical simulation of geotechnical problems
using massively parallel computers. International Journal of Numerical and Analytical
Mehods in Geomechanics 27, 1227 (2003)
14. Ghaboussi, J., Basole, M., Ranjithan, S.: Three dimensional discrete element analysis on
massively parallel computers. In: 2nd International Conference on Discrete Element
Methods. Massachusetts Institute of Technology, Cambridge (1993)
15. Maknickas, A., Kaceniauskas, A., Kaceniauskas, R., Balebicius, R., Dziugys, A.: Parallel
DEM Software for simulation of granular media. Informatica 17, 207 (2006)
16. Darmana, D., Deen, N.G., Kuipers, J.A.M.: Parallelization of a Euler-Lagrange model
using mixed domain decomposition and a mirror domain technique: Application to
dispersed gas-liquid two-phase flow. Journal of Computational Physics 220, 216 (2006)
17. EDEM, http://www.dem-solutions.com/software/edem-software/
18. PFC, http://www.itasca.cn/index.jsp
19. Top 500 SuperComputer lists, http://www.top500.org/lists/2011/11
20. Feng, C.S., Shu, S., Xu, J., Zhang, C.S.: Numerical study of geometric multigrid methods
on CPU-GPU heterogenous computers (2012) (preprint),
http://arxiv.org/abs/1208.4247v2
21. CDEM, GDEM, http://www.sinoelite.cc/
22. Ma, Z.S., Feng, C., Liu, T.P., Li, S.H.: A GPU accelerated continuous-based discrete
element method for elastodynamics analysis. Advanced Materials Research 320, 329
(2011)
23. Ye, J., Chen, J.X., Chen, X.Q., Tao, H.P.: Modeling and rendering of real-time large scale
granular flow scene on GPU. In: Procedia Environmental Sciences, p. 1035 (2011)
24. Radeke, C.A., Glasser, B.J., Khinast, J.G.: Large-scale powder mixer simulations using
massively parallel GPU architectures. Chemical Engineering Science 65, 6435 (2010)
25. Johnson, K.L.: Contact mechanics. Cambridge University Press, Cambridge (1985)
26. Mindlin, R.D., Deresiewicz, H.: Elastic spheres in contact under varying oblique forces.
Journal of Applied Mechanics 20, 327 (1953)
27. NVIDIA Tesla C2050/C2070 GPU Computing Processor,
http://www.nvidia.com/object/personal-supercomputing.html