Dynamic Load Balancing Techniques for Particulate Flow Simulations

Parallel multiphysics simulations often suffer from load imbalances originating from the applied coupling of algorithms with spatially and temporally varying workloads. It is thus desirable to minimize these imbalances to reduce the time to solution and to better utilize the available hardware resources. Taking particulate flows as an illustrating example application, we present and evaluate load balancing techniques that tackle this challenging task. This involves a load estimation step in which the currently generated workload is predicted. We describe in detail how such a workload estimator can be developed. In a second step, load distribution strategies like space-filling curves or graph partitioning are applied to dynamically distribute the load among the available processes. To compare and analyze their performance, we employ these techniques to a benchmark scenario and observe a reduction of the load imbalances by almost a factor of four. This results in a decrease of the overall runtime by 14% for space-filling curves.


Introduction
Simulations are becoming an increasingly prominent alternative to often expensive and time consuming laboratory experiments Therefore, engineers are interested in introducing more and more physical effects to explore complex phenomena with the help of computers. These multiphysics simulations thus feature a combination of different algorithms for the physics, like multiphase fluid flow [1], particle motion [2], and free surfaces [3]. Due to their computational cost, it is often necessary to utilize thread-based and distributed memory parallelization techniques to obtain results in reasonable time and to make use of the immense capabilities provided by today's supercomputers. This is commonly achieved by a spatial domain partitioning where the computational domain is subdivided into smaller pieces which are then distributed among the available processes, see e.g. [4]. As the majority of the parallel numerical codes consist of computation and synchronization, the desired properties of such a distribution are that the overall workload (originating from the computations) is spread evenly among the processes and that the amount of communication (due to synchronization) is minimized. If the first property is violated, the simulation suffers from load imbalances and the processes must wait at the synchronization points for the slowest process, i.e. the one with the largest workload, until the simulation can proceed. As a result, the runtime of the whole simulation increases and the hardware resources at hand are utilized less efficiently due to the idle times. Acquiring and maintaining a distribution with balanced workloads is thus crucial to achieve efficient parallel simulations.
Conceptually, the task of load balancing can be split into two steps: At first, referred to as load estimation, each subdomain is assigned a weight that quantifies the workload present on that subdomain. Based on these weights, a load distribution routine is executed in a second step to reassign the subdomains to the processes. An illustration is given in Figure 1. Depending on the characteristics w 5 w 6 w 7 w 8 Ω i : subdomain i w i : weight of subdomain i domain of process P 1 domain of process P 2 Figure 1. Illustration of the domain partitioning with load balancing. The computational domain Ω is subdivided into subdomains Ω i , such that Ω = i Ω i . The load estimator assigns a weight w i to each subdomain that quantifies the workload. The load distribution routine then assigns the subdomains to the available processes (here P 1 and P 2 ) such that the load per process is balanced.
of the underlying algorithms and the targeted degree of parallelism, various load balancing strategies have been developed, as reviewed in [5]. Meshless methods, like particle simulations, typically use the number of particles [6,7] or collisions [8] per subdomain as weight function. Consequently, this weight may vary significantly in space and time, requiring dynamic load balancing techniques. This can be achieved, e.g., by changing the geometric position of the subdomains [6,7] or further subdivision [8].
Contrary to that, in grid-based algorithms like finite volume or lattice Boltzmann methods, the weight of a subdomain is typically given as the number of grid cells [9][10][11][12]. As a special case, if the grid remains unchanged during the simulation and a constant number of cells per subdomain is chosen, a distribution of one subdomain per process is an obvious choice and thus encountered in many such programs. However, when adaptive mesh refinement (AMR) is applied to improve the accuracy and efficiency of these simulations [13,14], the workload is altered significantly and load balancing becomes an essential component. Apart from these techniques, load distribution routines have attracted a lot of attention throughout the years and various methods have been proposed, since they are also used for classical graph partitioning problems [5,15,16]. It becomes apparent, however, that the applied techniques differ substantially for meshless and grid-based methods where in the first case the subdomains are modified solely for load balancing whereas this is not desired for the grid-based methods. Determining a suitable load balancing approach for multiphysics simulations that incorporate both types of algorithms thus poses a challenge. Commonly, in cases where one of the algorithms is the workload-wise dominant one, the specific approaches of this algorithm are used, e.g. the grid-based fluid solver in particulate flow simulations [2,17], which comes at the cost of load imbalances. Alternatively, a different domain partitioning for each part can be used [6], which, however, requires expensive communication and complicated mapping mechanisms between the subdomains for distributed memory simulations. In this paper, we will therefore present a different strategy for dynamic load balancing of these simulations. We will use particulate flows as an illustrating example, which features a coupling between a fluid solver and a particle simulation. Our primary design objective is to enable massively parallel simulations on supercomputers for large physical systems. This restriction already excludes all algorithms that require global knowledge about process local quantities, e.g. the position of all particles, as they will inevitably not scale to several thousand or more processes. Additionally, we require that all parts of our simulation use the same domain partitioning to avoid the previously mentioned difficulties and bottlenecks. A variant that complies with these specifications is a static partitioning of the computational domain into blocks of constant size [4], as shown in Figure 1. Load balancing is achieved by dynamically distributing these blocks among the available processes with the goal to have a similar workload on each process by specifically assigning several blocks to each process. We develop a genuine load estimation approach to quantify the workload per block by taking into account all aspects of the coupled simulation and evaluate different load distribution routines in detail.  The remainder of this paper is thus structured as follows: At first, we briefly describe the numerical methods we apply for geometrically fully resolved particulate flow simulations in Sec. 2, consisting of the lattice Boltzmann method, a hard contact particle solver and the fluid-particle coupling mechanism. Next, in Sec. 3, we present and calibrate our load estimation strategy that predicts the block's weight based on locally available quantities. This estimator is then applied in Sec. 4, where additionally the performance of several load distribution approaches is investigated and compared. We summarize our findings and give an outlook on future directions in Sec. 5.

Numerical Methods
In this section, we briefly describe the numerical methods that we apply in our simulations of particulate flows. The fluid flow is computed via the lattice Boltzmann method and a non-smooth granular dynamics simulation accounts for the motion and collisions of the suspended particles. A coupling of these two approaches is established based on the momentum exchange principle which features geometrically fully resolved particle shapes [2,18]. This coupling is illustrated in Figure 2. We note, however, that most of the observations and results obtained in the following sections also carry over to other CFD methods, like finite volume methods, other particle collision resolution methods, like the discrete element method, and also other coupling algorithms, like the immersed boundary method.

Lattice Boltzmann Method
In the lattice Boltzmann method (LBM) [19,20], the three-dimensional domain is discretized with a uniform lattice. Each cell features q different particle distribution functions (PDF) f q where each one is associated with a lattice velocity c q . We employ the D3Q19 variant such that q ∈ {0...18}. The macroscopic quantities, density ρ f = ρ 0 + δρ f , with a mean density ρ 0 and a fluctuation δρ f , and fluid velocity u f , are obtained as moments of the PDFs in each grid cell x: The evolution of the PDFs is described by alternating between a collision and a stream step. We use the two-relaxation-time (TRT) collision operator [21] that results in the collision step where the PDFs are split into their symmetric f + q and anti-symmetric f − q parts with their respective collision times τ + and τ − . This collision step relaxes the PDFs linearly towards their equilibrium values where the lattice weights w q are taken from [22]. The succeeding streaming step is then given as The kinematic fluid viscosity ν is determined by the collision time and we use the following relations to define τ + and τ − [21]:

Rigid Body Dynamics
The particles suspended in the fluid interact with each other and walls via collisions. In order to account for these particle-particle or particle-wall collisions, we apply a rigid body solver which determines the collision forces and integrates the particles' position and velocity in time. The individual particles are thus represented by their exact geometric shape. Each particle is described by its position X p , orientation Q p , as well as its translational and rotational velocity, V p and W p . The dynamics are described by the Newton-Euler equations for rigid bodies, given acting forces F p and torques T p . These arise from external forces like gravity, hydrodynamic interactions with the surrounding fluid, and collisions. Here, the collisions are determined by a contact detection algorithm and are modeled as inelastic hard contacts [23], resulting in non-smooth functions for position and orientation. This effectively resolves overlaps between the particles. Additionally, the contact model includes Coulomb friction. The integration of the particle trajectories in time is carried out by a semi-implicit Euler method. As a result, a non-linear system of equations has to be solved for each particle and each contact [23]. More details about the algorithm and its implementation can be found in [23,24].

Fluid-solid Interaction
In order to realize a hydrodynamic coupling between the fluid and the geometrically fully resolved particulate phase, we make use of the momentum exchange method [25]. Here, the particles are explicitly mapped into the fluid domain, marking cells contained inside the particle as solid, in contrast to the fluid cells.
A no-slip boundary condition is applied along the surface of the particles which is given by the central linear (CLI) scheme as [21] fq(x, t + ∆t) =f q (x, t) + where the boundary velocity v b = V p + W p × (x b − X p ) is the particle's velocity at the position x b . The interpolation weight δ q is the distance between the cell center x and the actual surface position x b = x + δ q c q , normalized by the distance of the cell centers, i.e. c q ∆t . For spherical particles, it can be determined analytically as the result of a ray-sphere intersection problem. A sketch of the boundary handling procedure is provided in Figure 3.
solid fluid, near boundary fluid particle surface As a result of this boundary treatment, momentum is transferred from the fluid onto the submerged particle along the boundary link q, given as [26]: By summing up these contributions over all boundary links from all near-boundary cells (NB) next to a single particle, the total hydrodynamic force and torque are determined as respectively. Finally, due to the explicit mapping of the particles into the domain, cells that convert from solid to fluid due to moving particles need to restore PDF information before the simulation can continue. Here, we initialize the PDFs of such a transformed cell with its equilibrium values, Eq. (3), using the particle's surface velocity and a spatially averaged density [18].

Parallelization
The direct numerical simulations of complex particulate flow scenarios can exhibit enormous computational costs that can only be tackled by parallel execution on a large number of processes. Such large computers are designed with a distributed memory so that typically the message passing interface (MPI) is used to implement the parallel algorithms. We here employ the open-source WALBERLA framework [4,27] where all the above mentioned algorithms are implemented. It partitions the computational domain into blocks. They contain all the data for this subdomain and are then assigned to the available MPI processes to distribute the workload, see Figure 1. This requires the introduction of synchronization mechanisms for the fluid as well as for the particle simulation which both use the same domain partitioning as a result of the close coupling within the same framework.
The sketch in Figure 4 shows the applied techniques for two neighboring blocks. Each block is subdivided into uniform cells that are required for the fluid simulation by the lattice Boltzmann method, Sec. 2.1. To be able to carry out the streaming step, Eq. (4), across block borders, an extra layer of cells, a so called ghost layer, has to be introduced around each block. They contain the corresponding PDFs from the neighboring blocks that are to be streamed to cells inside the block at hand and thus have to be communicated beforehand. Similarly, the particle simulation defines local and shadow particles on a block [28]. The former are particles whose center of mass are contained inside this block.  To check for particle-particle collisions, however, all possible collision partners have to be known on a block. This necessitates that copies of all particles that intersect with the block, with the center of mass being contained inside another block, are present on that block as well. These copies are called shadow particles.
More details on the block-based domain partitioning and the communication routines can be found in [2,4,11,14].

Complete Algorithm
By combining the different components presented in the previous sections, we obtain the complete algorithm for coupled fluid-particle simulations with geometrically fully resolved particles, shown in Algorithm 1. Since the treatment of the boundary, Eq. (6), and the link-based computation of the interaction force, Eq. (7), require the same information, we fuse them to a single step in line 6 of the algorithm. Additionally, we note that the time step of the rigid body has to be chosen smaller than the one of the fluid simulation to obtain an accurate collision behavior [29]. Therefore, we introduce a sub cycling loop for the rigid body solver in which we subdivide the time step into S sub steps in which the hydrodynamic force acting on the particles is kept constant, see line 8 of Algorithm 1.   Perform LBM collision step, Eq. (2).

Description
The algorithm outlined in Algorithm 1 is a typical representative for a simulation procedure to solve multiphysics problems. One time step consists of contributions from the individual physical components, here fluid flow, Sec. 2.1, and rigid body dynamics, Sec. 2.2, as well as coupling algorithms for their interaction. Depending on the physical setup at hand, each of these parts generates a different workload within a block, used for the partitioning of the simulation domain, see Sec. 2.4. Taking particulate flows as an example, blocks with dense particle packings result in longer compute times, i.e. larger workloads, for the rigid body simulation in comparison to blocks without any particles. Additionally, in many physical problems, the location of these areas, and consequently the workload per block, changes over time. Since the simulation gets synchronized via communications in each time step, unbalanced workload distributions result in waiting times for the processes that own blocks with smaller workloads. These idle times reduce the overall efficiency of the simulation and result in a longer time to solution than ideally achievable. For successful load balancing, it is thus important to first estimate the workload per block. This is resembled by a scalar quantity and we denote it as weight of the block, see Figure 1. Once the weights of all blocks are known, a load distribution algorithm can then be used to reassign the blocks to the processes to establish a balanced state. In this section, we present our approach to develop a load estimator for particulate flow simulations with Algorithm 1. The goal is to find a function that is able to reliably predict the workload of a block, depending on local quantities that describe the state of this block. The workload is here given as the runtime of a single time step in this simulation, excluding communication routines. For that reason, simulations of a characteristic setup have to be carried out. During these simulations, the block local quantities and the corresponding runtimes of the different parts of the algorithm are continuously evaluated and stored. Based on this data, function fitting is applied to finally obtain a estimator function that can be used in all subsequent simulations to predict the workload for a block.
The simulation setup hast to be chosen such that different cases which can be encountered in a real application also occur in the simulation. This means that blocks with no particles and dilute as well as densely packed particles should be included in the setup and thus in the data. Only then, the function that is to be fitted to these measurements can later on predict these cases accurately enough. Specifically, we execute several simulations to obtain separate time measurements of the major parts of Algorithm 1 and vary some of the important parameters to increase the amount of data and the reliability. The block local quantities that are available in our fluid-particle simulation are given in Table 1 and their definition can be obtained from Figures 3 and 4. For function fitting, we first have to determine the form of the functions that map these quantities to the measured runtime. For each part, this is achieved by analyzing the structure of the individual algorithm. In a calibration step, those functions are then fitted to the timing measurements of the separate parts based on the block local quantities to determine the functions' coefficients. Finally, those functions are combined to obtain a complete estimator for the workload per block.
We note, that this procedure must be carried out only once as a preprocessing step for all upcoming simulations. Once the load estimator is found, it is simply applied, i.e. the fitted function is evaluated, in the simulations whenever load balancing is carried out. This procedure also only involves simulations of small systems which keeps the computational effort small.

Workload Contributions
In the following, we establish the form of the functions we use to fit the measured times based on the local quantities from Table 1. The contributions are mainly gathered from the structure of the respective implementation, which are briefly outlined for each part.

LBM module
The simulation of the fluid flow is carried out by the LBM, which consists of the collision (line 4 of Algorithm 1) and the stream (line 7) step. These are only carried out for cells marked as fluid and the pseudocode is given in Algorithms 2 and 3, respectively. The resulting workload heavily depends on the applied collision operator and the actual implementation [30,31] but it will generally be mainly determined by the number of cells C and the number of fluid cells F. Thus, a function that represents the workload generated by the two LBM steps is given by

Boundary handling module
The boundary handling procedure for LBM, line 6 of Algorithm 1, applies the chosen condition for each near boundary cell. More specifically, it does so for each fluid-boundary link of the near boundary cell As can be seen in the pseudocode of Algorithm 4, the resulting workload will thus be related to the number of these links. However, evaluating this number for each block is computationally more costly than simply counting the number of near boundary cells and the workload per link depends on the specific boundary condition, which we here do not distinguish. We therefore simply use a function of the form where we additionally included C to represent the outer loop.

Particle mapping module
Mapping the particles into the domain by marking the contained cells as boundary cells is an important part of the applied fluid-particle coupling algorithm, see line 2 of Algorithm 1. The pseudocode is outlined in Algorithm 5. The resulting workload thus depends on the number of particles, local and shadow, as well as the extent of their axis-aligned bounding box. Since the size of this box depends on the diameter of the particle, this information is not readily available. Instead, we can analyze the result of the mapping by including the number of non-fluid cells as a kind of block local solid volume fraction. Since shadow particles are only partially present on the block they will generate a smaller workload than local particle and we should thus distinguish workload contributions originating from local and shadow particles in our function. We therefore propose the following function: Algorithm 5: Pseudocode for particle mapping (Coup1).
1 for each particle p do 2 Obtain an axis-aligned bounding box (AABB) which fully contains the particle. 3 for each cell c in this AABB do 4 if cell c is contained in particle p then 5 Set flag to boundary (solid) flag.

PDF reconstruction module
The second part of the coupling algorithm, line 3 in Algorithm 1, reconstructs the PDF values in cells that have changed from being solid to being fluid due to the motion of the corresponding particle, see Algorithm 6. The generated workload thus depends on the position, orientation and velocity of the individual particles and is therefore difficult to predict in general. Assuming that chances for these cell state changes are higher if there are more particles around and the solid volume fraction is larger, we estimate its workload by: Algorithm 6: Pseudocode for PDF reconstruction (Coup2). The rigid body simulation part, line 9 of Algorithm 1, consists of several components which for simplicity will all be included in a single function. Algorithm 7 outlines these different sub steps [23]. The first part, contact detection, is typically of squared complexity since all particles have to be checked against all other particles to find possible contacts. Our implementation makes use of hash grids for optimizing this routine to obtain linear complexity. This, however, is only activated if a reasonable number of particles is present on a block since otherwise the computational overhead renders it less efficient. In a second step, the contact resolution treats each contact according to the applied collision model in order to resolve the overlaps between particles. Determining the needed information for these two steps is simple for spherical particles but can become complex and costly for other shapes. The last step uses a time integrator scheme to update the local particles' position and velocity. In the simulation algorithm, the whole rigid body algorithm is embedded into a loop over S number of sub cycles, see line 8 of Algorithm 1. Combining these steps, we use the following function to estimate the workload: WL RB = S a 1,RB (P L + P S ) 2 + a 2,RB P L + a 3,RB P S + a 4,RB K + a 5,RB Algorithm 7: Pseudocode for rigid body simulation.
1 for each particle p do 2 if p is in contact with another particle then 3 Generate a contact pair. Compute reaction to resolve contact. 8 end 9 for each particle p do 10 Integrate p in time using the acting forces. 11 end

Total workload estimator
Since, in the general case, only a single weight per block has to be provided, the aforementioned contributions must be combined in order to obtain a estimator function for the total workload per block. A natural and simple choice is to add up all individual functions, Eqs. (10)-(14), effectively combing the coefficients of the block quantities:

Simulation Setup
After having identified and defined the workload functions for the different modules of the algorithm, we now carry out multiple simulations of a characteristic setup to obtain timing information that are then used to fit the functions and determine the coefficients. To obtain a generally applicable estimator, the setup should be designed such that it contains various cases that are also encountered in typical applications involving particulate flows. Those are e.g. bounding walls, densely packed areas with many inter-particle collisions as well as dilute regions with only few or even no particles. Additionally, due to complex bounding geometries, some parts of the computational domain could be completely excluded from the simulation as neither fluid nor particles can enter these regions. We therefore chose a horizontally periodic setup with an initially random distribution of particles. Those particles are then set into a settling motion due to the acting gravity and then pack at the bottom plane until all particles have settled. Visualizations of the initial, intermediate and final state of the simulations with spherical particles can be seen in Figure 5.
The characteristic parameters of this setup are the Galileo number, Ga = u s D/ν = 30, with a characteristic settling velocity u s = (ρ p /ρ f − 1)gD, the diameter D and kinematic viscosity ν. The particles have a density ratio of ρ p /ρ f = 2.5 and are subjected to a gravitational acceleration g in vertical direction, resulting in gravitational and buoyancy forces. The domain size is L x × L y × L z with L x = 4b s , L y = 4b s and L z = 5b s − o t . Here, b s is the block size and we introduce a constant offset of the top wall o t = 1.05b s . The global solid volume fraction of the domain is 0.2. The simulation is run for 2.5 unitless time steps, where one time step is T = L z /u s . As the definition of the domain size already suggests, we subdivide the whole domain into 4 × 4 × 5 blocks of size b s × b s × b s .
As a result, we obtain five vertical layers of blocks with different characteristics throughout the simulation. The bottom layer experiences a continuous increase in the number of particles until it is densely packed. The second layer features the interface between the particle packing and the bulk fluid region at the end of the simulation. Blocks of the third layer are traversed by all upper particles during the settling phase and end up without any particles or boundaries. Similarly, the fourth layer ends up with no particles but, due to the offset in the top wall, with a constant number of boundary and (a) Initially random particle distribution (t * = 0).  near-boundary cells. The blocks of the topmost row contain neither fluid cells nor particles throughout the complete simulation since they are completely overlapped by the top wall. Those simulations are carried out for different typical block sizes b s /∆x = {24, 32, 48, 64} and diameters D/∆x = {10, 20} to obtain a large enough variance in the samples. Each simulation is executed in parallel on 80 processes of the SuperMUC Haswell cluster at the LRZ such that each process is assigned one block. This allows to obtain separate measurements for each block with 2000 samples each. In total, this results in around 1.3 × 10 6 data points that will be used for the function fitting. In the first 50 time steps, no measurements are taken to exclude possible warm-up effects of the hardware. Furthermore, we make use of thread pinning provided by the LIKWID tool [32,33] to obtain reliable measurements. Each sample consists of the current values for the block-local quantities from Table 1

Results of the Calibration
An example of the quantity evaluation and time measurement for the LBM module of the aforementioned simulation setup is shown in Figure 6. From the temporal evolution of F, the curves can be matched to a block of one of the five rows. Clearly, a correlation between F and m LBM can be seen which is in agreement with the assumption made in Eq. (10).
Before we evaluate the result of the calibration process of the workload estimator, we want to obtain insight into the proportion of the different parts of the total runtime, given by These proportions are different for each sample and are thus visualized in a compact format as a box-and-whiskers plot in Figure 7, showing the median, the upper, and lower quartile values as a box, as well as whiskers that extend to the 5 and 95 percentiles. It can be seen that the LBM and the BH part make up most of the workload with an average of 45% and 35%, respectively. This is followed by the rigid body simulation and the coupling parts with median values below 5%. The relatively large spread in the whiskers is introduced by the empty blocks of the fifth block row in the setup where total runtimes are very low and the different parts, except for Coup1, contribute similarly to it. We can conclude that for spherical particles with the chosen coupling algorithm, it is most important to accurately predict the workload from the lattice Boltzmann and the boundary handling routines.
With all samples available, we make use of the curve fitting functionality provided by the Python library SciPy in a post-processing step to obtain the coefficients from Eqs. (10)- (14). The outcome is presented in Table 2.
To quantify the quality of the fit for each part, various measures can potentially be evaluated. One of them is the relative error of predicted workload by the five different fits for each individual sample, calculated as E X = (WL X − m X )/m tot . It is visualized as histograms for the different parts and the total workload in Figure 8. Additionally, the median and the median absolute deviation (MAD), which is a robust measure of the statistical dispersion, are stated. Note that we calculate the relative error with respect to the total runtime instead of m X to avoid the division by almost zero for the small runtimes of Coup1 and RB, see Figure 7.
For the LBM part, we obtain good predictions of the workload with a median close to zero. Outliers can be seen at some distinct error values which originate from the empty blocks where the prediction is generally worse. A similar conclusion can be drawn for the BH part, where again the median is close to zero and some outliers can be found, this time on the negative side, i.e. due to underpredictions of the workload. These samples can again be attributed to those with empty blocks but also the ones of the third row which eventually end up with no boundary cells like the empty blocks. The Coup1 part shows an acceptable median value but a relatively large MAD value due to the outliers at around 30%. These are again the empty blocks where the assumption we have formulated when establishing our fit function in Eq. (12), i.e. the mapping should be related to the number of boundary cells as a measure of the solid volume fraction, breaks down as no particles are present. Similarly, for Coup2 the median is close to zero but the MAD value is relatively large. As discussed when setting up the equation for the fit, Eq. (13), predicting the workload for this part is especially difficult as it can not be directly related to any of the available quantities. This is then also the source of deviations from the predictions. At last, the RB part is well-predicted in average and also the variance is small. Considering the relative error for the total runtime, shown in the last histogram, the obtained estimator from Eq. (15) is able to predict the observed runtimes with good accuracy. Outliers originating from the empty blocks of different block sizes where the fits do not work as good can be seen. We also find that over 85% of all samples are predicted with a relative error of less than 10%.

Discussion
In this section, we have constructed a specific workload estimator for coupled fluid-particle simulations with the methods from Sec. 2. Here, the derivation of the specific form of the fit functions, Eqs. (10)- (14), was purely driven by the algorithms themselves. These functions can thus only be valid assumptions for the here applied algorithms. As they calibration process of the coefficients relies on data from actual runtime measurements, the obtained coefficients for these functions also depend on the hardware on which the corresponding measurements were obtained. However, it can be expected that they still remain good predictors if the hardware design is similar to the used one. In case of larger differences, like e.g. using a GPU instead of a CPU to run the simulation [34,35], this might no longer be the case and the measurements and fits should be redone. A possible improvement to overcome this drawback would be to also add hardware details, like cache sizes, clock frequency, etc., to the estimator and try to come up with a performance model for hardware-aware predictions. Since this is a topic of active research, even for simple, stencil-based algorithms like LBM [31,36], this is not further pursued here.
Generally, the here reported values for the coefficients from Table 2 should not be blindly reused in other programs even if the same algorithms and hardware are used, as also the actual implementation of the algorithms can have a large influence on the runtimes. Nevertheless, by following the procedure presented in this section as a general recipe, workload estimators for many different applications, including pure CFD or complicated multiphysics setups, can be obtained. They are then tailored and calibrated for the specific algorithms, hardware and implementation, and can be used in upcoming simulations of that kind to predict, and in a next step also distribute, the workload.
Besides the form of the fit functions, the simulation setup to generate the measurements has to be selected with care. As reported in Sec. 3.4, in our case the predictions of the empty blocks were generally worse than for other, regular blocks as some assumptions broke down. We still included them in the calibration procedure in order to obtain a general workload estimator and to avoid any special treatment of these empty blocks in our load estimation step. However, if empty blocks do not appear in the simulations in which this workload estimator will then be applied, we suggest to remove them from the test scenario as well. This improves the quality of the fits notably and a more accurate estimator can be obtained.

Description
Once the calibration process of the workload estimator has been executed, Eq. (15) with coefficients from Table 2, it can be applied in the subsequent simulations to assign a weight to each block, representing its workload. As a second step, these weights are now used by load distribution methods to reassign the blocks to the processes dynamically throughout the simulation. Besides the reduction of load imbalances, minimizing communication costs between the different processes and keeping the redistribution costs low are the key aspects of this step. We briefly review the different types of distribution methods in the next section. Next, we set up a simulation that is different from the one used to deduce the load estimator in Sec. 3.3. With this setup, we demonstrate the capabilities of our workload estimation approach and can compare the performance of different load distribution strategies.

Space-filling curve
Generally, a space-filling curve has the property that it covers the entire n-dimensional unit hypercube. Typically, these curves are constructed recursively from a sequence of piecewise linear continuous curves, following a specific construction pattern. Different versions of such curves exist, e.g. the Peano, Hilbert, or Morton curve, as reviewed in [15]. As a practical result, a curve is obtained that connects all blocks of the computational domain and thus determines a one-dimensional ordering of these blocks. For load distribution and while knowing the block weights as well as the total weight of all blocks, the blocks are assigned to the processes in a greedy manner. This means that one traverses the curve and picks the blocks until the sum of these block weights is approximately equal to the total weight divided by the number of processes. Due to the construction of these curves, the assigned blocks are usually geometrical neighbors and thus a reduction of the inter-process communication is implicitly achieved. More information about space-filling curves can be found in [15].

Diffusive distribution
Load distribution via space-filling curves requires global information about the blocks and their weights. For massively parallel simulations, this procedure poses an upper limit on the applicable number of processes as it inherently does not scale and also might require too much memory on a single process [8,14]. For these cases, diffusive techniques become the only feasible approach to distribute the load. There, processes require only information about the load distribution in their direct neighborhood and will then try to even out possible imbalances by sending individual blocks to these neighboring processes. The load gets distributed in a diffusive manner and thus several iterations are required to obtain good results. However, a well-balanced load cannot be guaranteed and also the blocks on a process might get fragmented over time which increases communication costs.

Graph partitioning
The primary goal of these approaches is to partition unstructured graphs such that the edge cut is minimized, corresponding to a reduced communication between the processes. This requires to determine weights for the edges that resemble the communication cost between neighboring blocks. As an additional optimization constraint, weights for the vertices, i.e. here the blocks, can be provided in order to balance these as well. Since graph partitioning is a common problem in various application fields, several algorithms are available [5,16,37]. The MPI-parallel library ParMETIS [38] is a commonly applied choice for this multi-objective optimization task. It offers various algorithms to construct, improve or update graph partitions that can be further tuned by specifying additional parameters. Furthermore, if provides functionalities to deal with multi constraint problems where several weights per block can be given to account for multiphysics applications [39,40]. Generally, it makes use of combinations of space-filling curves, multilevel algorithms and diffusive distributions to obtain a graph partitioning.
Specifically, the required edge weights EW between two vertices v 1 and v 2 , i.e. here blocks with b 3 s cells, are evaluated as follows to resemble the communication volume for the LBM part: ParMETIS offers various algorithms and tuning parameters. We restrict ourselves to the algorithms PartGeomKway and AdaptiveRepart. The latter one is chosen since it is supposedly  particularly well-suited for simulations with adaptive grid refinement, which is one natural use case of load balancing in general. The proposed default values for the load imbalance tolerance, ubvec = 1.05, and the ratio between inter-process communication time compared to data redistribution times, itr = 1000, are used.

Simulation Setup
The simulation domain is a rectangular box that is subdivided into 12 × 12 × 16 blocks of size 32 3 cells. This results in a total of 2304 blocks. By using N p = 256 processes of the SuperMUC cluster, a uniform block distribution is achieved with 9 blocks per process. We use four inclined planes to construct a fluid domain where the horizontal cross section is reduced by 60% towards the bottom plane, resembling a hopper as illustrated in Figure 9. As a result, several blocks remain empty throughout the whole simulation. The domain contains about 4300 spherical particles of diameter D/∆x = 15 that are initially densely clustered along the top plane and then settle under the effect of gravity. The Galileo number is Ga = 50, the density ratio is ρ p /ρ f = 1.5, the number of rigid body solver sub cycles is set to S = 10, and the simulation is run for 80000 time steps. The temporal evolution of the simulation can be seen in Figure 9.
In order to distribute the blocks among the available processes based on the predicted loads, we use several of the strategies presented in Sec. 4.2 and evaluate their performance. For space-filling curves, Hilbert and Morton orders are applied and the mentioned ParMETIS variants are also included. In all cases, the load balancing step that consists of load estimation and load distribution is triggered every 100 time steps. These variants are compared against the case where no load balancing is applied at all. However, since the initial distribution of 9 blocks per process is non-trivial, we use the static partitioning given by the Hilbert curve for that case, which is thus called once at the beginning of the simulation.

Results of the Hopper Simulation
In Figure 10, the temporal evolution of the runtime for 100 time steps is shown for the four different load distribution variants and compared against the case with no load balancing. From these results, we see that without load balancing the runtime decreases until time step 40000, which is shown in the middle of Figure 9. This corresponds to the transition from an initially densely packed bed to a fully suspended settling with only few inter-particle collisions. Afterwards, when the clustering at the bottom sets in, the simulation again gets slower. On the other hand, both cases that use space-filling  curves for load distribution yield a relatively constant runtime throughout the whole simulation and are both very similar in their behavior. Except for some fluctuations halfway through the simulation, the space-filling curves are significantly faster than the default case. The two ParMETIS variants also feature low runtimes at the beginning which then steadily increase as the simulation advances, leading to larger runtimes than the default case. Again, the trend of both curves is rather similar but the AdaptiveRepart variant exhibits stronger oscillations. The time to solution of the whole simulation is evaluated as the cumulated sum of the values depicted in Figure 10 and is summarized in Table 3. In total, load balancing with space-filling curves reduces the overall runtime by 14% whereas using ParMETIS even results in a slight increase of the runtime. Compared to the rather simple space-filling curves, the more complex load distribution algorithms of ParMETIS are also reflected in the larger fraction of the runtime spent in the load balancing procedure. As a result, however, the ParMETIS variants show the smallest edge cut, i.e. the sum of all edge weights from Eq. (17) that are cut by process boundaries.
Since the overall goal of load balancing is to reduce possible load imbalances, the latter ones are again evaluated in intervals of 100 time steps and shown in Figure 11. The load imbalance LI, ideally close to zero, is here defined as the normalized difference between the maximum and the average runtime per interval, i.e.  Figure 11. Load imbalance over the time steps for simulation shown in Figure 9.
where m tot is the runtime of the workload generating parts, see Eq. (16), summed up over the 100 time steps, and p denotes one of the N p = 256 processes. This definition takes into account that the performance of the simulation is most affected by a few processes that are slower than average, resulting in many idle processes, whereas having some processes that are faster than average is less problematic, as only a few processes have to wait. Our definition thus differs from [6] who used the normalized difference between the maximum and minimum runtime. The evolution of the load imbalance for the case without load balancing resembles the behavior from Figure 10 with a drop from 90% to around 40%, followed by an increase to 70%. The space-filling curves exhibit the lowest imbalances with a median of 17%, and thus almost four times smaller than the default case. Two distinct behaviors can be seen for ParMETIS: the PartGeomKway variant is slightly worse than the space-filling curves with an median of 22% whereas the AdaptiveRepart case shows larger imbalances together with strong fluctuations.

Discussion
The results from this comparative study show the importance of choosing an appropriate load distribution technique. Space-filling curves achieve a reduction of the overall runtime by 14% compared to a simulation in which no load balancing is applied. On the other hand, ParMETIS, which relies on space-filling curves but combines them with more advanced partitioning strategies, does not yield an improvement in the runtime. Even though the ParMETIS variants show a reduction of the load imbalance and the edge cut, they ultimately fail in reducing the time to solution. One reason for that is certainly but only partly that the runtimes for the load balancing step are significantly higher than for the space-filling curves. Also, some improvements can be expected by optimizing the parameters that have to be specified for the ParMETIS routines. This specialization, however, runs contrary to a widely and readily applicable load distribution technique. Another reason is that ParMETIS, like many other graph partitioning tools, is not designed to work with an average of, here, 9 blocks per process but usually expects around 1000 or more. This would e.g. be the case for cell-based domain partitioning, but unrealistic for block-based ones, and with this larger granularity the partitioning routines are expected to work faster and better. The observable strong fluctuations in the runtime and load imbalance graphs can possibly addressed to be a direct cause of this difficulty. Lastly, our choice of the edge weights, Eq. (17), can be improved as it currently is solely based on the volume of the communication of the LBM data structures. Thus, the communication by the particle simulation is not regarded but might change the weights notably. One could, therefore, also try to find an estimator for the edge weights similarly to the procedure used to come up with a vertex weight estimator.
We note that these results are specific for the here simulated setup and cannot be generalized easily. They also show, however, that load balancing is a viable tool to reduce the time to solution if carefully applied and systematically compared. It is especially helpful for cases where strong heterogeneities exist in the simulation, as it is the case for the initial and end phase of the here carried out simulation.

Conclusion
In this work, we present and evaluate different techniques to improve the performance of particulate flow simulations that exhibit spatially and temporally varying workloads during their parallel execution.
To this end, first a workload estimator is designed and calibrated that predicts the workload for each block of the domain using locally available information. It is based on analyzing the involved algorithms and setting up suitable functions that describe the generated load. The coefficients of the functions are determined by fitting them to timing measurements obtained from simulations of a small but representative setup. Choosing such a setup is a crucial step as all relevant phenomena and properties that can lead to varying workloads must be included to obtain a generally applicable estimator. Since timing measurements are influenced by various factors, like the hardware used or the actual implementation details, major changes in these might require a reevaluation of the fitted coefficients. Even though the article focuses on particulate flow simulations, the presented approach can be readily used for other multiphysics simulations.
In a second step, this workload estimator is applied in a more complex simulation scenario in conjunction with a load distribution technique to reduce load imbalances between the processes. We compare and evaluate load distribution via space-filling curves and with the help of the graph partitioning tool ParMETIS to the case without any load balancing. For the case of space-filling curves, a significant reduction of the time to solution by 14% can be observed, demonstrating how these techniques can be used to utilize hardware resources more efficiently by minimizing load imbalances. On the other hand, the ParMETIS-based distributions are unable to improve the runtime for this specific scenario. This can be attributed to the relatively large computational overhead produced by the load balancing step but also generally to the different focus of these graph partitioning tools that would require a finer granularity.
All in all, this study shows that load balancing is an effective tool for parallel multiphysics simulations where workloads change in time and space. It is typically used in combination with adaptive mesh refinement, where it becomes a necessity due to the strongly varying work loads when the grid is updated. Thus, naturally, we aim to apply the here presented load balancing approach with AMR. But, as we have shown, it can also be advantageous to apply it with uniform grids to reduce the time to solution of the simulation.
Future work will pursue further improvements of the load estimator as well as the load distribution step. By including information available from performance models for the different algorithms, the workload estimator can be made more general and flexible. Tools like Kerncraft [41] automatically analyze the performance of a given implementation for the hardware at hand, which would render the estimator independent of these factors. Furthermore, a workload estimate based on the current runtimes is a natural alternative to the proposed predictor as it is able to use actual data from the currently running simulation. Such an estimator can not be used in an a-priori fashion which renders it less useful for adaptive grid refinement where new blocks are created and have to be distributed immediately to avoid huge imbalances. However, a combination of both strategies would result in a kind of predictor-corrector approach and could reduce load imbalances more effectively. For the load distribution step, other graph partitioning libraries are available, e.g. Zoltan [42], PT-Scotch [43] or Geographer [44]. In combination with a better estimator for the communication weights, their performance should thus be evaluated and again be compared to the space-filling curves.
Author Contributions: Christoph Rettinger developed the presented algorithms, implemented them, realized the simulations and evaluated the results. Furthermore, he wrote major parts of this article. Ulrich Rüde supervised the work and contributed to the arrangement of this article by significantly improving the structure and wording of this manuscript.
Funding: This research received no external funding.