Professional Documents
Culture Documents
Yes
Neighbour Search
O(Np) x Nsearch
No Update
Force Summation
O(Np) x Nt Neighbour
t=t+1
List?
O(Np) x Nt
t=Nt
End
278
particles which can potentially interact with a specific 3. Three-dimensional Parallel DEM Im-
particle is limited to a constant number which is plementation
determined by the ratio of the radii of the largest and
smallest particles in the model. Therefore the total ESyS-Particle is a parallel implementation of the
number of potentially interacting particle pairs stored Discrete Element Method written in C++ with a Python
in the Verlet list is O(Np ), where Np is the number API [8], [9], [4]. ESyS-Particle is specifically designed
of particles. This results in a linear dependence of with geophysical applications in mind and is capable
the time needed for the force calculations on the total of simulating large three-dimensional problems. The
number of particles for a given particle size range. software design permits distribution of both control
Similarly, the grid-based neigbour search algorithm and data. It employs a distributed memory model
reduces the time for a rebuild of the list of potentially with communication implemented using the Message
interacting pairs of particle. Instead of checking every Passing Interface (MPI) [10].
particle in the domain against every other particle ESyS-Particle uses a master-worker model modified
in the domain for interactions, or Np (Np − 1)/2 to permit communication directly among workers. The
particle-pairs (a search with O(Np2 ) complexity), the master process, known as the lattice master, provides a
algorithm looks for neighbouring particles within a high level of control, communicating requirements to
smaller, fixed space in which the number of particles all worker processes in the lattice covering the domain
can be considered constant. So the number of particle- of the problem. Each worker process is comprised of a
pairs checked for interactions is now approximately sublattice controller and the sublattice itself. Each sub-
C1 Np (O(Np )), where C1 is a constant which mainly lattice controller communicates with the lattice master,
depends on the range of particle sizes present in the conveying instructions to its corresponding sublattice.
model. The neighbour search is only done on a needs Each sublattice is responsible for the computations,
basis and is usually triggered when a particle has such as force calculations, in the simulation. In the
moved a significant distance such that it may have new modified master-worker model, the sublattices also
neighbours. communicate directly among themselves [8], [9].
Particles are packed into a closed geometrical space
Having obtained the list of potential particle-pairs with a power-law distribution of sizes. The problem of
from the neighbour search, the force summation is how to assign particles within the domain to workers
performed. This requires, for each particle-pair in the is addressed with spatial domain decomposition. Each
list, computing the particle-pair forces and updating the worker is responsible for a subdomain of the problem
running sum of the net force acting on each particle. domain and thus all particles and their interactions
At the end of the force summation, the algorithm within that subdomain. The movement of particles be-
yields the total net force acting on each particle, from tween the subdomains is handled during the rebuilding
which the current acceleration is obtained by dividing of the neighbour table described in Section 2.
by the mass of the particle. The force summation In order to compute the net force on a particle near a
requires approximately C2 Np calculations (O(Np )), subdomain boundary, a worker process may require the
where C2 is the average coordination number (the positions of particles residing in a neighbouring sub-
average number of particles interacting with a given domain. To minimise communication between worker
particle). After the net force acting on each particle processes, each worker stores the positions of particles
has been obtained, the velocities and positions of residing in a small region surrounding its own domain.
each particle are individually updated via a finite- At the end of each position update stage, workers
difference time-integration scheme with an algorithmic then broadcast the positions of particles lying in their
complexity of O(Np ). subdomain boundary region. Neighbouring workers
As illustrated in Figure 1, the neighbour search, then update their own particle position lists accord-
force summation and position update stages are repeat- ingly. In addition to the memory and communication
edly executed until the desired number of simulation overheads this approach entails, each worker also faces
timesteps are completed. Consequently, the total algo- an increased computation work-load to compute forces
rithmic complexity of the DEM is at least O(Np × Nt ) in boundary regions, compared with that of an isolated
for simulations with infrequent neighbour list updates. subdomain.
To simulate some geophysical applications with the The output of simulation data to disk is also paral-
DEM may require the use of both millions of particles lelised in ESyS-Particle. There exist two types of data
and millions of timesteps, a computational burden that structure that can be written from an ESyS-Particle ex-
is beyond the scope of desktop computers. periment: checkpointers and field saver. Checkpointers
279
Number of
Workers:
1 8 27
280
Table 1. Simulation parameters for the wave Table 2. Partitioning of the domain.
propagation benchmark experiments.
Nodes Processes / Cores Subdomains Particles
Simulation Parameter Symbol Value xyz
Subdomain side length L 15.0 mm 1 2 111 8647
Minimum particle radius Rmin 0.2 mm 2 9 222 71615
Maximum particle radius Rmax 1.0 mm 4 28 333 241540
Number of timesteps Nt 10000 9 65 444 573027
Timestep increment ∆t 10−3 ms 16 126 555 1107566
Elastic stiffness K 1000.0 N/mm 28 217 666 1925620
Wave source amplitude AW 0.1 mm/ms 43 344 777 3020677
Wave source period τW 1.0 ms 65 513 888 4558458
92 730 999 6345300
126 1001 10 10 10 8700121
281
Simulation Time for Number of Worker Processes
4500
4000
3500
3000
Time (s)
2500
2000
1500
1000
Output
None
500 FS
CP
FS & CP
0
0 100 200 300 400 500 600 700 800 900 1000
Workers
Figure 5. Simulation execution times for four experiments: No data output, field saver output, checkpoint
file output, and both field saver and checkpoint file output.
given number of workers and field saver output are which such a bottleneck may be mitigated. It should
typically only between 10% and 20% slower than be noted that storage of checkpoint files every 100
counterparts without data output. These results testify timesteps is highly unusual for production simula-
that inclusion of simple data output in the form of a tions. Typically checkpoint files would be stored every
field saver is not a significant overhead for simulations 10 000 timesteps for simulations comprising between
involving larger numbers of particles. 100 thousand and 1 million timesteps. In such cases,
Simulations in which checkpoint files are saved also the performance degradation from data output would
display weak scalability, albeit with a more significant be negligible relative to the computational burden.
linear increase in average execution time. Whereas the
field saver only requires approximately 10 B of data Writing both field savers and checkpointers does not
to be stored per timestep, the amount of data saved increase the total execution time significantly, com-
in checkpoint files scales with the number of particles pared with writing checkpointers alone. This confirms
in the simulation. Approximately 2.3 MB of data are the previous observation that storing field saver output
stored per worker process which, for the largest simu- is a relatively trivial overhead compared with the actual
lations, results in approximately 2.6 GB being stored to computations and storage of simulation state data in
disk every 100 timesteps. We observe that runtimes for checkpoint files. For many production simulations,
simulations involving 1000 workers are approximately storage of checkpoint files is mandatory for post-
three times that of simulations with 27 workers. The analysis of simulation results and three-dimensional
cause for this linear performance degradation is un- visualisation. Storage of field saver data is typically
doubtedly a filesystem bandwidth bottleneck, although useful when post-analysis would be exceptionally time-
further analysis is required to determine the extent to consuming to obtain the desired output data.
282
6. Conclusion [6] T. Pöschel and T. Schwager, Computational Granular
Dynamics: Models and Algorithms. Springer, 2005.
A scaling benchmark for ESyS-Particle, a paral- [7] L. Verlet, “Computer ‘experiments’ on classical flu-
lel implementation of the Discrete Element Method, ids. I. Thermodynamical properties of Lennard-Jones
shows that ESyS-Particle scales well up to at least 1000 molecules,” Phys. Rev., vol. 159, no. 1, p. 98, Jul 1967.
cores when simulating up to 8.7 million particles in a
[8] S. Abe, “Investigation of the influence of different
three-dimensional elastic wave propagation problem. microphysics on the dynamic behaviour of faults using
The increase in computation time is as little as 46% the lattice solid model,” Ph.D. thesis, The University of
on 1000 cores compared with 27 cores. The trend for Queensland, 2001.
simulation data output similarly shows a flattening of
the curve toward constant time for increasing worker [9] S. Abe, D. Place, and P. Mora, “A parallel implemen-
tation of the lattice solid model for the simulation of
processes. On the basis of our analysis, ESyS-Particle rock mechanics and earthquake dynamics,” Pure. Appl.
is demonstrably an efficient platform for parallel DEM Geophys., vol. 161, pp. 2265–2277, 2004.
simulations, scaling to beyond 8.7 million particles
on cluster supercomputers. One would expect ESyS- [10] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-
Particle to continue to display weak scalability well performance, portable implementation of the MPI mes-
sage passing interface standard,” Parallel Computing,
beyond the limit of 1000 worker processes considered vol. 22, pp. 789–828, 1996.
in our analysis.
Acknowledgments
References
283