Scaling Benchmark

2010 Sixth IEEE International Conference on e–Science
Scaling Benchmark of ESyS-Particle for Elastic Wave Propagation Simulations
D. K. Weatherley∗† , V. E. Boros∗ , W. R. Hancock† and Steffen Abe‡

∗
The University of Queensland
Earth Systems Science Computational Centre
Brisbane, Queensland, Australia
†
The University of Queensland
Sustainable Minerals Institute
W. H. Bryan Mining and Geology Research Centre
Brisbane, Queensland, Australia
‡
RWTH Aachen University
Aachen, Germany
Abstract undergoing particle-pair interactions. The success of

the DEM lies in its ability to be used as both a
The Discrete Element Method (DEM) is a popular fundamental research modelling tool and as a practical
particle-based numerical method for simulating geo- engineering tool in industries ranging from pharmaceu-
physical processes including earthquakes, rock break- ticals [2] to mining [3].
age and granular flow. Often simulations consisting As explained in Section 2, the DEM is inherently
of thousands of particles have insufficient resolution computationally expensive, with the total number of
to reproduce the micromechanics of many geophysical particles being the largest constraint on execution
processes, requiring millions of particles in some in- times. To replicate the macroscopic behaviour of many
stances. The high computational expense of the DEM geophysical processes, significant numbers of particles
precludes execution of such problem sizes on desktop are required. One such example is in rock fracture
PCs. simulations, where samples are created entirely of
ESyS-Particle is a parallel implementation of the bonded particles and fractures are resolved along bro-
DEM, designed for execution on cluster supercomput- ken bonds. The propagation of the crack tip is heavily
ers. Three-dimensional spatial domain decomposition dependent on the resolution of the simulation where, in
is implemented using the MPI for interprocess com- the worst case if too few particles are used, cracks may
munications. We present results of scaling benchmarks never coalesce to form macroscopic fractures. This
in which problem size per worker remains constant. motivates the need for a parallel implementation of
As the number of workers increases from 27 to 1000, the algorithm, one that is able to distribute workload
execution time remains near-constant, permitting sim- evenly across the compute space and maintain a rea-
ulations of 8.7M particles in approximately the same sonable efficiency even at larger scales.
real time as simulations comprising 240K particles. The aim of this paper is to benchmark the perfor-
mance of ESyS-Particle, a parallel implementation of
1. Introduction the DEM designed for cluster supercomputers. Ear-
lier work [4] has verified its scalability on a two-
The Discrete Element Method (DEM) is a gener- dimensional elastic wave propagation problem. More
alised numerical modelling method for the simulation recent work [5] has showcased its three-dimensional
of processes that can be described using fundamental capability on the simulation of a fault gouge. Here we
interactions between grains or particles [1]. By using verify its scalability on a three-dimensional version of
a first principles approach, the aim of the DEM is to the elastic wave propagation problem. In the following
model macroscopic processes such as rock breakage we first describe the basic features of DEM algorithms
and granular flow by solving Newton’s equations of then discuss the parallel implementation employed in
motion for a large assembly of microscopic particles ESyS-Particle. Subsequently we describe a series of
978-0-7695-4290-4/10 $26.00 © 2010 IEEE 277

DOI 10.1109/eScience.2010.40
Initialisation
O(Np)
t=0
Yes
Neighbour Search
O(Np) x Nsearch
No Update
Force Summation
O(Np) x Nt Neighbour
t=t+1
List?
Position Update t<Nt
O(Np) x Nt
t=Nt
End
Figure 1. Flow diagram of the DEM algorithm.

Figure 2. Example of a cubic volume packed with
benchmarking experiments and present results demon- spherical particles.
strating the scalability of the software. We find that
ESyS-Particle displays reasonable weak scalability i.e.
total execution time remains near-constant with in- complete treatise on DEM algorithms is provided in
creasing numbers of processes for constant problem [6]. The DEM algorithm consists of four main com-
size per processor. We conclude that ESyS-Particle ponents: initialisation, neighbour search, force sum-
provides an effective platform for DEM simulations mation and position update. Initialisation of a DEM
involving tens of millions of particles. simulation typically consists in specifying the initial
locations and velocities of all particles. Various meth-
2. Discrete Element Method ods have been devised for specifying initial particle
locations. For the simulations conducted here, we em-
The DEM is a first principles approach to modelling ploy a geometrical particle-packing algorithm in which
elastic solids or granular media in which Newton’s particles ranging in size from Rmin to Rmax are closely
equations of motion are used to compute the trajec- packed into a bounding volume (see Figure 2). For the
tories of large assemblies of microscopic particles un- benchmarking results presented later, the details of this
dergoing particle-pair interactions. Spherical particles packing algorithm are irrelevent because packing was
described by their mass, radius, position and velocity conducted in a separate stage prior to commencing
interact with neighbouring particles via a range of simulations and does not contribute to the execution
simplified interactions including linear elastic repul- times reported.
sion and frictional forces. An explicit finite difference The second component of any DEM algorithm is
time integration scheme is used to integrate Newton’s an efficient neighbour search algorithm. This algorithm
equation of motion, solving for the velocity and new generates a list of pairs of particles that are sufficiently
position of each particle in response to the forces close to one another that they may interact. The
it experiences at a given time. To ensure numerical particle-pair list is used during the subsequent force
stability, positions are repeatedly updated at discrete summation stage to avoid unnecessary calculations. A
time intervals (∆t) which are typically of the order number of efficient neighbour search algorithms are
of micro- or nano-seconds for real-world applications. described in [6], two of the most common being: main-
Since DEM simulations often comprise large numbers tain a Verlet list [7] to keep track of neighbouring pairs
of particles and very short time increments, the DEM to be considered for force calculations, and implement
is a computationally expensive numerical method, with a grid-neighbour search algorithm.
computation time often the single most limiting factor The Verlet list accelerates the force calculations by
determining the resolution and accuracy of numerical storing a list of potentially interacting particle pairs.
solutions. Given that the force calculations in the DEM are based
Figure 1 illustrates the DEM solution algorithm. A on nearest-neighbour interactions only, the number of
278
particles which can potentially interact with a specific 3. Three-dimensional Parallel DEM Im-
particle is limited to a constant number which is plementation
determined by the ratio of the radii of the largest and
smallest particles in the model. Therefore the total ESyS-Particle is a parallel implementation of the
number of potentially interacting particle pairs stored Discrete Element Method written in C++ with a Python
in the Verlet list is O(Np ), where Np is the number API [8], [9], [4]. ESyS-Particle is specifically designed
of particles. This results in a linear dependence of with geophysical applications in mind and is capable
the time needed for the force calculations on the total of simulating large three-dimensional problems. The
number of particles for a given particle size range. software design permits distribution of both control
Similarly, the grid-based neigbour search algorithm and data. It employs a distributed memory model
reduces the time for a rebuild of the list of potentially with communication implemented using the Message
interacting pairs of particle. Instead of checking every Passing Interface (MPI) [10].
particle in the domain against every other particle ESyS-Particle uses a master-worker model modified
in the domain for interactions, or Np (Np − 1)/2 to permit communication directly among workers. The
particle-pairs (a search with O(Np2 ) complexity), the master process, known as the lattice master, provides a
algorithm looks for neighbouring particles within a high level of control, communicating requirements to
smaller, fixed space in which the number of particles all worker processes in the lattice covering the domain
can be considered constant. So the number of particle- of the problem. Each worker process is comprised of a
pairs checked for interactions is now approximately sublattice controller and the sublattice itself. Each sub-
C1 Np (O(Np )), where C1 is a constant which mainly lattice controller communicates with the lattice master,
depends on the range of particle sizes present in the conveying instructions to its corresponding sublattice.
model. The neighbour search is only done on a needs Each sublattice is responsible for the computations,
basis and is usually triggered when a particle has such as force calculations, in the simulation. In the
moved a significant distance such that it may have new modified master-worker model, the sublattices also
neighbours. communicate directly among themselves [8], [9].
Particles are packed into a closed geometrical space
Having obtained the list of potential particle-pairs with a power-law distribution of sizes. The problem of
from the neighbour search, the force summation is how to assign particles within the domain to workers
performed. This requires, for each particle-pair in the is addressed with spatial domain decomposition. Each
list, computing the particle-pair forces and updating the worker is responsible for a subdomain of the problem
running sum of the net force acting on each particle. domain and thus all particles and their interactions
At the end of the force summation, the algorithm within that subdomain. The movement of particles be-
yields the total net force acting on each particle, from tween the subdomains is handled during the rebuilding
which the current acceleration is obtained by dividing of the neighbour table described in Section 2.
by the mass of the particle. The force summation In order to compute the net force on a particle near a
requires approximately C2 Np calculations (O(Np )), subdomain boundary, a worker process may require the
where C2 is the average coordination number (the positions of particles residing in a neighbouring sub-
average number of particles interacting with a given domain. To minimise communication between worker
particle). After the net force acting on each particle processes, each worker stores the positions of particles
has been obtained, the velocities and positions of residing in a small region surrounding its own domain.
each particle are individually updated via a finite- At the end of each position update stage, workers
difference time-integration scheme with an algorithmic then broadcast the positions of particles lying in their
complexity of O(Np ). subdomain boundary region. Neighbouring workers
As illustrated in Figure 1, the neighbour search, then update their own particle position lists accord-
force summation and position update stages are repeat- ingly. In addition to the memory and communication
edly executed until the desired number of simulation overheads this approach entails, each worker also faces
timesteps are completed. Consequently, the total algo- an increased computation work-load to compute forces
rithmic complexity of the DEM is at least O(Np × Nt ) in boundary regions, compared with that of an isolated
for simulations with infrequent neighbour list updates. subdomain.
To simulate some geophysical applications with the The output of simulation data to disk is also paral-
DEM may require the use of both millions of particles lelised in ESyS-Particle. There exist two types of data
and millions of timesteps, a computational burden that structure that can be written from an ESyS-Particle ex-
is beyond the scope of desktop computers. periment: checkpointers and field saver. Checkpointers
279
Number of
Workers:
1 8 27
Figure 4. Subdomain decomposition for scaling

benchmark experiments.
propagated through a medium e.g. the transmission of

sound through air or earthquakes via ground motion.
Figure 3. Snapshot of particle velocities at
timestep 800 in a simulation using 27 worker pro- A series of experiments are conducted in which the
cesses. size of the problem and the number of processor cores
are varied. As the problem size grows, the problem
domain is divided into subdomains of equal size, with
are used for verbose output of simulation information each worker process assigned one subdomain. The
including the positions, velocities and forces for each experiment is scaled up by stacking the base problem
particle. Fieldsavers are used to output specific quanti- size in a cubic fashion as demonstrated in Figure 4. By
ties such as total potential or kinetic energy, forces keeping the simulation model as a cubic volume, this
acting on walls or the number of remaining bonds also ensures that the entire compute space has been
within a simulation. The communication that occurs utilised.
in a checkpoint operation involves the master request- The wave propagation experiment was selected be-
ing that all workers write information about particles cause it is a small deformation problem where particles
within their subdomain into a file that each worker do not have to move significant distances. Small de-
creates. The field saver approach is a scatter-gather formation reduces the number of times the neighbour
operation in which the master requests the workers search is triggered and limits the amount of communi-
to calculate their individual totals. This information is cation to transfer particles from one worker process to
then sent back to the master which collates it into a sin- another due to migration across subdomain boundaries.
gle datum which is then written to disk by the master This simplifies the communication and computational
process. Both approaches have their advantages and overheads for the simulations, thus easing interpreta-
disadvantages. Checkpointers involve a large amount tion of the timing benchmark results presented later.
of data output but use limited communication (as the The wave propagation model is initialised by pack-
process is embarassingly parallel). Fieldsavers on the ing a cubic domain with particles. A base cube mea-
other hand require a larger amount of communication suring 15 × 15 × 15 mm is packed using particles
with usually small amounts of data output at the end with radii ranging between 0.2 and 1.0 mm. For larger
of the operation. One of the aims of the benchmarking simulations, the same packing algorithm is used but the
experiments below is to quantify the relative overheads side length of the cube is expanded according to the
of checkpointer versus field saver data output options. number of subdomains in each coordinate direction.
In our simulations, each subdomain contains approxi-
4. Benchmarking Experiment mately 9000 particles, commencing with 8647 particles
in the smallest test (one worker process) and finishing
In order to test the scalability of ESyS-Particle, a with 8 700 121 particles spread across 1000 workers in
simple problem is required which can be scaled up the largest test. Table 2 shows the partitioning of the
with relative ease. A three-dimensional elastic wave problem for differing numbers of worker processes.
propagation problem is a suitable choice. The problem Having constructed the particle assembly, the ge-
consists of a cube of bonded particles excited into ometries are stored to files that are read during ini-
movement by an impulse supplied by an external, tialisation of the actual simulations. The lattice master
planar wall (see Figure 3). Wave propagation is a reads the input geometry file and assigns each particle
fundamental process that occurs as kinetic energy is to one of the workers according to the subdomain in
280
Table 1. Simulation parameters for the wave Table 2. Partitioning of the domain.
propagation benchmark experiments.
Nodes Processes / Cores Subdomains Particles
Simulation Parameter Symbol Value xyz
Subdomain side length L 15.0 mm 1 2 111 8647
Minimum particle radius Rmin 0.2 mm 2 9 222 71615
Maximum particle radius Rmax 1.0 mm 4 28 333 241540
Number of timesteps Nt 10000 9 65 444 573027
Timestep increment ∆t 10−3 ms 16 126 555 1107566
Elastic stiffness K 1000.0 N/mm 28 217 666 1925620
Wave source amplitude AW 0.1 mm/ms 43 344 777 3020677
Wave source period τW 1.0 ms 65 513 888 4558458
92 730 999 6345300
126 1001 10 10 10 8700121
which the particle initially resides. Each particle is then

connected to its nearest neighbours via linear elastic
springs. Such a bonded particle assembly has the amounts of data output. For simulations involving no
macroscopic properties of an isotropic, linear elastic data output, there is significant performance degrada-
solid [8]. In order to trigger elastic wave propagation, tion when increasing from 1 to 27 worker processes,
a wall is created at the base of the cube which is moved however total execution time thereafter remains near-
upwards to generate an input pulse. An overview constant. Simulating the interactions of 241 540 parti-
of simulation parameters is given in Table 1. Each cles (in 27 subdomains) takes 1104.5 s of real time. By
simulation is executed for a total of 10 000 timesteps comparison, simulations of 8 700 121 particles (in 1000
and an external timing function is employed to measure subdomains) takes 1617.5 s, an increase of 500 s over
execution time in real seconds. 10 000 timesteps. Although the number of particles for
The benchmarking experiments are conducted in 1000 workers is 3600% more than for 27 workers, the
four stages, with ten simulations in each stage per the time involved increases only by 46%.
list in Table 2. In the first stage, execution times are The initial increase in execution times arises due
recorded for simulations with no output data gener- to the increasing workload of one or more worker
ated. This provides a base-line for computation time processes as the number of subdomains increases. For
without data I/O overheads. The second stage involves one worker process, there is no computation or com-
simulations in which a field saver records a single munication overhead from subdomain decomposition
floating point number (the total kinetic energy) to a (due to the lack of neighbouring subdomains). For
file, once per timestep. This is considered the mini- the eight-worker simulation (2 × 2 × 2 subdomains),
mum amount of data I/O for any practical simulation. workers share at most three subdomain boundaries
The third stage stores checkpoint files (containing the with neighbouring subdomains, whereas for more than
entire simulation state) once every 100 timesteps. In 27 workers, at least one worker shares the maximum
the final stage, both field saver and checkpoint files number of subdomain boundaries with its 6 neighbour-
are output. The hardware used for these experiments ing workers. Since interprocess synchronisation occurs
is a newly commissioned SGI HPC cluster at The only prior to the force summation and position update
University of Queensland containing 3144 cores on stages, increasing the number of workers with a full
384 nodes connected through InfiniBand and Gigabit complement of shared subdomain boundaries causes
Ethernet networks. For the benchmarking tests, we relatively little increase in the total time to execute a
utilise between 1 and 126 nodes of that supercomputer, single timestep of the simulation. The slight increase in
each comprised of dual Intel Xeon L5520 2.26 GHz execution time as the number of workers increases is
quad-core CPUs. For each CPU the memory speed is likely due to a combination of communications latency
1066 MHz with 3 GB RAM available to each core. and variations in processor speeds.
For simulations in which field saver output occurs,
5. Performance Results the total execution times closely track those of simula-
tions without data output. Field saver output amounts
Total execution times for each of the experiments to a constant time overhead (approximately 200 s over
are provided in Figure 5. In all experiments, ESyS- 10 000 timesteps), even for simulations with larger
Particle shows evidence for weak scalability, even numbers of worker processes. For field saver output
for very large numbers of worker processes, although the increase in simulation time between 27 and 1000
scalability is degraded for simulations involving large workers is 37%. We also find that simulations of a
281
Simulation Time for Number of Worker Processes
4500
4000
3500
3000
Time (s)
2500
2000
1500
1000
Output
None
500 FS
CP
FS & CP
0
0 100 200 300 400 500 600 700 800 900 1000
Workers
Figure 5. Simulation execution times for four experiments: No data output, field saver output, checkpoint
file output, and both field saver and checkpoint file output.
given number of workers and field saver output are which such a bottleneck may be mitigated. It should
typically only between 10% and 20% slower than be noted that storage of checkpoint files every 100
counterparts without data output. These results testify timesteps is highly unusual for production simula-
that inclusion of simple data output in the form of a tions. Typically checkpoint files would be stored every
field saver is not a significant overhead for simulations 10 000 timesteps for simulations comprising between
involving larger numbers of particles. 100 thousand and 1 million timesteps. In such cases,
Simulations in which checkpoint files are saved also the performance degradation from data output would
display weak scalability, albeit with a more significant be negligible relative to the computational burden.
linear increase in average execution time. Whereas the
field saver only requires approximately 10 B of data Writing both field savers and checkpointers does not
to be stored per timestep, the amount of data saved increase the total execution time significantly, com-
in checkpoint files scales with the number of particles pared with writing checkpointers alone. This confirms
in the simulation. Approximately 2.3 MB of data are the previous observation that storing field saver output
stored per worker process which, for the largest simu- is a relatively trivial overhead compared with the actual
lations, results in approximately 2.6 GB being stored to computations and storage of simulation state data in
disk every 100 timesteps. We observe that runtimes for checkpoint files. For many production simulations,
simulations involving 1000 workers are approximately storage of checkpoint files is mandatory for post-
three times that of simulations with 27 workers. The analysis of simulation results and three-dimensional
cause for this linear performance degradation is un- visualisation. Storage of field saver data is typically
doubtedly a filesystem bandwidth bottleneck, although useful when post-analysis would be exceptionally time-
further analysis is required to determine the extent to consuming to obtain the desired output data.
282
6. Conclusion [6] T. Pöschel and T. Schwager, Computational Granular
Dynamics: Models and Algorithms. Springer, 2005.
A scaling benchmark for ESyS-Particle, a paral- [7] L. Verlet, “Computer ‘experiments’ on classical flu-
lel implementation of the Discrete Element Method, ids. I. Thermodynamical properties of Lennard-Jones
shows that ESyS-Particle scales well up to at least 1000 molecules,” Phys. Rev., vol. 159, no. 1, p. 98, Jul 1967.
cores when simulating up to 8.7 million particles in a
[8] S. Abe, “Investigation of the influence of different
three-dimensional elastic wave propagation problem. microphysics on the dynamic behaviour of faults using
The increase in computation time is as little as 46% the lattice solid model,” Ph.D. thesis, The University of
on 1000 cores compared with 27 cores. The trend for Queensland, 2001.
simulation data output similarly shows a flattening of
the curve toward constant time for increasing worker [9] S. Abe, D. Place, and P. Mora, “A parallel implemen-
tation of the lattice solid model for the simulation of
processes. On the basis of our analysis, ESyS-Particle rock mechanics and earthquake dynamics,” Pure. Appl.
is demonstrably an efficient platform for parallel DEM Geophys., vol. 161, pp. 2265–2277, 2004.
simulations, scaling to beyond 8.7 million particles
on cluster supercomputers. One would expect ESyS- [10] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-
Particle to continue to display weak scalability well performance, portable implementation of the MPI mes-
sage passing interface standard,” Parallel Computing,
beyond the limit of 1000 worker processes considered vol. 22, pp. 789–828, 1996.
in our analysis.
Acknowledgments
This work was supported by AuScope, funded by

the Australian Federal Government National Research
Infrastructure Strategy. The authors wish to acknowl-
edge the useful contributions by colleagues at the
Earth Systems Science Computational Centre. ESyS-
Particle is Open Source Software, freely available from
https://launchpad.net/esys-particle. The SGI cluster su-
percomputer used in this work was provided by the
Queensland Cyber Infrastructure Foundation and The
University of Queensland.
References
[1] P. A. Cundall and O. D. L. Strack, “A discrete nu-

merical model for granular assemblies,” Geotechnique,
vol. 29, pp. 47–65, 1979.
[2] K. Yamane, “Discrete-element method application to

mixing and segregation model in industrial blending
system,” Journal of Materials Research, vol. 19, pp.
623–627, 2004.
[3] W. Hancock and D. Weatherley, “3D simulations of

block caving flow using ESyS-Particle,” in 1st Southern
Hemisphere International Rock Mechanics Symposium,
2008, pp. 221–229.
[4] S. Latham, S. Abe, and M. Davies, “Scaling evaluation

of the lattice solid model on the SGI Altix 3700,” in
Proc. HPCAsia’04, 2004.
[5] S. Latham, S. Abe, and P. Mora, “Parallel 3-D simu-

lation of a fault gouge using the lattice solid model,”
Pure. Appl. Geophys., vol. 163, pp. 1949–1964, 2006.
283

Scaling Benchmark

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scaling Benchmark

Uploaded by

Copyright:

Available Formats

2010 Sixth IEEE International Conference on e–Science

Scaling Benchmark of ESyS-Particle for Elastic Wave Propagation Simulations

D. K. Weatherley∗† , V. E. Boros∗ , W. R. Hancock† and Steffen Abe‡

Abstract undergoing particle-pair interactions. The success of

978-0-7695-4290-4/10 $26.00 © 2010 IEEE 277

Position Update t<Nt

Figure 1. Flow diagram of the DEM algorithm.

Figure 4. Subdomain decomposition for scaling

propagated through a medium e.g. the transmission of

which the particle initially resides. Each particle is then

This work was supported by AuScope, funded by

[1] P. A. Cundall and O. D. L. Strack, “A discrete nu-

[2] K. Yamane, “Discrete-element method application to

[3] W. Hancock and D. Weatherley, “3D simulations of

[4] S. Latham, S. Abe, and M. Davies, “Scaling evaluation

[5] S. Latham, S. Abe, and P. Mora, “Parallel 3-D simu-

You might also like