# Dynamic Load Balancing Techniques for Particulate Flow Simulations

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Numerical Methods

#### 2.1. Lattice Boltzmann Method

#### 2.2. Rigid Body Dynamics

#### 2.3. Fluid-Solid Interaction

#### 2.4. Parallelization

#### 2.5. Complete Algorithm

Algorithm 1: Coupled algorithm for particulate flow simulations. |

## 3. Development and Calibration of a Workload Estimator

#### 3.1. Description

#### 3.2. Workload Contributions

#### 3.2.1. LBM Module

Algorithm 2: Pseudocode for LBM (collision). |

Algorithm 3: Pseudocode for LBM (stream). |

#### 3.2.2. Boundary-Handling Module

Algorithm 4: Pseudocode for boundary-handling. |

#### 3.2.3. Particle-Mapping Module

Algorithm 5: Pseudocode for particle-mapping (Coup1). |

#### 3.2.4. PDF Reconstruction Module

Algorithm 6: Pseudocode for PDF reconstruction (Coup2). |

#### 3.2.5. Rigid Body Simulation Module

Algorithm 7: Pseudocode for rigid body simulation. |

#### 3.2.6. Total Workload Estimator

#### 3.3. Simulation Setup

#### 3.4. Results of the Calibration

#### 3.5. Discussion

## 4. Comparison of Load Distribution Methods

#### 4.1. Description

#### 4.2. Load Distribution Algorithms

#### 4.2.1. Space-Filling Curve

#### 4.2.2. Diffusive Distribution

#### 4.2.3. Graph Partitioning

`PartGeomKway`and

`AdaptiveRepart`. The latter one is chosen since it is supposedly particularly well-suited for simulations with adaptive grid refinement, which is one natural use case of load balancing in general. The proposed default values for the load imbalance tolerance,

`ubvec`$=1.05$, and the ratio between inter-process communication time compared to data redistribution times,

`itr`$=1000$, are used.

#### 4.3. Simulation Setup

#### 4.4. Results of the Hopper Simulation

`AdaptiveRepart`variant exhibits stronger oscillations in the runtime.

`PartGeomKway`variant is slightly worse than the space-filling curves with an median of 22% whereas the

`AdaptiveRepart`case shows larger imbalances together with strong fluctuations.

#### 4.5. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Kusumaatmaja, H.; Hemingway, E.J.; Fielding, S.M. Moving contact line dynamics: From diffuse to sharp interfaces. J. Fluid Mech.
**2016**, 788, 209–227. [Google Scholar] [CrossRef] - Rettinger, C.; Rüde, U. A comparative study of fluid–particle coupling methods for fully resolved lattice Boltzmann simulations. Comput. Fluids
**2017**, 154, 74–89. [Google Scholar] [CrossRef] - Biegert, E.; Vowinckel, B.; Meiburg, E. A collision model for grain-resolving simulations of flows over dense, mobile, polydisperse granular sediment beds. J. Comput. Phys.
**2017**, 340, 105–127. [Google Scholar] [CrossRef] - Anderl, D.; Bogner, S.; Rauh, C.; Rüde, U.; Delgado, A. Free surface lattice Boltzmann with enhanced bubble model. Comput. Math. Appl.
**2014**, 67, 331–339. [Google Scholar] [CrossRef] - Rettinger, C.; Godenschwager, C.; Eibl, S.; Preclik, T.; Schruff, T.; Frings, R.; Rüde, U. Fully Resolved Simulations of Dune Formation in Riverbeds. In High Performance Computing; Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 3–21. [Google Scholar] [CrossRef]
- Markl, M.; Körner, C. Powder layer deposition algorithm for additive manufacturing simulations. Powder Technol.
**2018**, 330, 125–136. [Google Scholar] [CrossRef] - Long, G.; Liu, S.; Xu, G.; Wong, S.W.; Chen, H.; Xiao, B. A Perforation-Erosion Model for Hydraulic-Fracturing Applications. SPE Prod. Oper.
**2018**, 33. [Google Scholar] [CrossRef] - Godenschwager, C.; Schornbaum, F.; Bauer, M.; Köstler, H.; Rüde, U. A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), Denver, CO, USA, 17–22 November 2013; ACM: New York, NY, USA, 2013; pp. 35:1–35:12. [Google Scholar] [CrossRef]
- Hendrickson, B.; Devine, K. Dynamic load balancing in computational mechanics. Comput. Methods Appl. Mech. Eng.
**2000**, 184, 485–500. [Google Scholar] [CrossRef] - Cundall, P.A.; Strack, O.D.L. A discrete numerical model for granular assemblies. Géotechnique
**1979**, 29, 47–65. [Google Scholar] [CrossRef] - Yurkin, M.; Hoekstra, A. The discrete dipole approximation: An overview and recent developments. J. Quant. Spectrosc. Radiat. Transf.
**2007**, 106, 558–589. [Google Scholar] [CrossRef] - Owen, D.; Feng, Y.; Han, K.; Peric, D. Dynamic Domain Decomposition and Load Balancing in Parallel Simulation of Finite/Discrete Elements; ECCOMAS 2000: Barcelona, Spain, 2000. [Google Scholar]
- Kloss, C.; Goniva, C.; Hager, A.; Amberger, S.; Pirker, S. Models, algorithms and validation for opensource DEM and CFD–DEM. Prog. Comput. Fluid Dyn. Int. J.
**2012**, 12, 140–152. [Google Scholar] [CrossRef] - Eibl, S.; Rüde, U. A Systematic Comparison of Dynamic Load Balancing Algorithms for Massively Parallel Rigid Particle Dynamics. arXiv, 2018; arXiv:1808.00829. [Google Scholar]
- Deiterding, R. Block-structured adaptive mesh refinement-theory, implementation and application. ESAIM Proc.
**2011**, 34, 97–150. [Google Scholar] [CrossRef] - Lintermann, A.; Schlimpert, S.; Grimmen, J.; Günther, C.; Meinke, M.; Schröder, W. Massively parallel grid generation on HPC systems. Comput. Methods Appl. Mech. Eng.
**2014**, 277, 131–153. [Google Scholar] [CrossRef] - Schornbaum, F.; Rüde, U. Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids. SIAM J. Sci. Comput.
**2016**, 38, C96–C126. [Google Scholar] [CrossRef] - Qi, J.; Klimach, H.; Roller, S. Implementation of the compact interpolation within the octree based Lattice Boltzmann solver Musubi. Comput. Math. Appl.
**2016**. [Google Scholar] [CrossRef] - Deiterding, R.; Wood, S.L. Predictive wind turbine simulation with an adaptive lattice Boltzmann method for moving boundaries. J. Phys. Conf. Ser.
**2016**, 753, 082005. [Google Scholar] [CrossRef] - Schornbaum, F.; Rüde, U. Extreme-Scale Block-Structured Adaptive Mesh Refinement. SIAM J. Sci. Comput.
**2018**, 40, C358–C387. [Google Scholar] [CrossRef] - Bader, M. Space-Filling Curves: An Introduction with Applications in Scientific Computing; Springer Science & Business Media: Berlin, Germany, 2012; Volume 9. [Google Scholar] [CrossRef]
- Buluç, A.; Meyerhenke, H.; Safro, I.; Sanders, P.; Schulz, C. Recent Advances in Graph Partitioning. In Algorithm Engineering: Selected Results and Surveys; Kliemann, L., Sanders, P., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 117–158. [Google Scholar] [CrossRef]
- Schneiders, L.; Grimmen, J.H.; Meinke, M.; Schröder, W. An efficient numerical method for fully-resolved particle simulations on high-performance computers. PAMM
**2015**, 15, 495–496. [Google Scholar] [CrossRef] - Vowinckel, B.; Jain, R.; Kempe, T.; Fröhlich, J. Entrainment of single particles in a turbulent open-channel flow: A numerical study. J. Hydraul. Res.
**2016**, 54, 158–171. [Google Scholar] [CrossRef] - Agudo, J.R.; Illigmann, C.; Luzi, G.; Laukart, A.; Delgado, A.; Wierschem, A. Shear-induced incipient motion of a single sphere on uniform substrates at low particle Reynolds numbers. J. Fluid Mech.
**2017**, 825, 284–314. [Google Scholar] [CrossRef] - Long, G.; Xu, G. The effects of perforation erosion on practical hydraulic-fracturing applications. SPE J.
**2017**, 22, 645–659. [Google Scholar] [CrossRef] - Pickl, K.; Pande, J.; Köstler, H.; Rüde, U.; Smith, A.S. Lattice Boltzmann simulations of the bead-spring microswimmer with a responsive stroke - from an individual to swarms. J. Phys. Condens. Matter
**2017**, 29, 124001. [Google Scholar] [CrossRef] [PubMed] - Buwa, V.V.; Roy, S.; Ranade, V.V. Three-phase slurry reactors. In Multiphase Catalytic Reactors; John Wiley & Sons, Ltd.: New York, NY, USA, 2016; Chapter 6; pp. 132–155. [Google Scholar] [CrossRef]
- Ge, W.; Chang, Q.; Li, C.; Wang, J. Multiscale structures in particle–fluid systems: Characterization, modeling, and simulation. Chem. Eng. Sci.
**2019**. [Google Scholar] [CrossRef] - Munjiza, A.; Owen, D.; Bicanic, N. A combined finite-discrete element method in transient dynamics of fracturing solids. Eng. Comput.
**1995**, 12, 145–174. [Google Scholar] [CrossRef] - Chen, S.; Doolen, G.D. Lattice Boltzmann Method for Fluid Flows. Annu. Rev. Fluid Mech.
**1998**, 30, 329–364. [Google Scholar] [CrossRef] - Krüger, T.; Kusumaatmaja, H.; Kuzmin, A.; Shardt, O.; Silva, G.; Viggen, E.M. The Lattice Boltzmann Method; Springer: Berlin, Germany, 2017. [Google Scholar]
- Ginzburg, I.; Verhaeghe, F.; d’Humieres, D. Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. Commun. Comput. Phys.
**2008**, 3, 427–478. [Google Scholar] - Qian, Y.H.; D’Humières, D.; Lallemand, P. Lattice BGK Models for Navier-Stokes Equation. EPL (Europhys. Lett.)
**1992**, 17, 479. [Google Scholar] [CrossRef] - Preclik, T.; Rüde, U. Ultrascale simulations of non-smooth granular dynamics. Comput. Part. Mech.
**2015**, 2, 173–196. [Google Scholar] [CrossRef] - Preclik, T. Models and Algorithms for Ultrascale Simulations of Non-Smooth Granular Dynamics. Ph.D. Thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany, 2014. [Google Scholar]
- Aidun, C.K.; Lu, Y.; Ding, E.J. Direct analysis of particulate suspensions with inertia using the discrete Boltzmann equation. J. Fluid Mech.
**1998**, 373, 287–311. [Google Scholar] [CrossRef] - Wen, B.; Zhang, C.; Tu, Y.; Wang, C.; Fang, H. Galilean invariant fluid-solid interfacial dynamics in lattice Boltzmann simulations. J. Comput. Phys.
**2014**, 266, 161–170. [Google Scholar] [CrossRef] - The waLBerla Framework. Available online: http://walberla.net (accessed on 26 November 2018).
- Iglberger, K.; Rüde, U. Massively parallel rigid body dynamics simulations. Comput. Sci. Res. Dev.
**2009**, 23, 159–167. [Google Scholar] [CrossRef] - Wellein, G.; Zeiser, T.; Hager, G.; Donath, S. On the single processor performance of simple lattice Boltzmann kernels. Comput. Fluids
**2006**, 35, 910–919. [Google Scholar] [CrossRef] - Wittmann, M.; Haag, V.; Zeiser, T.; Köstler, H.; Wellein, G. Lattice Boltzmann benchmark kernels as a testbed for performance analysis. Comput. Fluids
**2018**, 172, 582–592. [Google Scholar] [CrossRef] - LIKWID. Available online: https://github.com/RRZE-HPC/likwid (accessed on 26 November 2018).
- Treibig, J.; Hager, G.; Wellein, G. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops, San Diego, CA, USA, 13–16 September 2010; pp. 207–216. [Google Scholar] [CrossRef]
- Feichtinger, C.; Habich, J.; Köstler, H.; Rüde, U.; Aoki, T. Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters. Parallel Comput.
**2015**, 46, 1–13. [Google Scholar] [CrossRef] - Riesinger, C.; Bakhtiari, A.; Schreiber, M.; Neumann, P.; Bungartz, H.J. A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters. Computation
**2017**, 5. [Google Scholar] [CrossRef] - Pohl, T.; Kowarschik, M.; Wilke, J.; Iglberger, K.; Rüde, U. Optimization and profiling of the cache performance of parallel lattice Boltzmann codes. Parallel Process. Lett.
**2003**, 13, 549–560. [Google Scholar] [CrossRef] - Hammer, J.; Eitzinger, J.; Hager, G.; Wellein, G. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. In Tools for High Performance Computing 2016; Niethammer, C., Gracia, J., Hilbrich, T., Knüpfer, A., Resch, M.M., Nagel, W.E., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–22. [Google Scholar] [CrossRef]
- Karypis, G.; Kumar, V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput.
**1998**, 20, 359–392. [Google Scholar] [CrossRef] - ParMETIS. Available online: http://glaros.dtc.umn.edu/gkhome/views/metis/ (accessed on 26 November 2018).
- Karypis, G.; Kumar, V. Multilevel Algorithms for Multi-constraint Graph Partitioning. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC ’98), Orlando, FL, USA, 7–13 November 1998; IEEE Computer Society: Washington, DC, USA, 1998; pp. 1–13. [Google Scholar] [CrossRef]
- Kirk, S.; George, K.; Vipin, K. Parallel static and dynamic multi-constraint graph partitioning. Concurr. Comput. Pract. Exp.
**2002**, 14, 219–240. [Google Scholar] [CrossRef] - Boman, E.G.; Catalyurek, U.V.; Chevalier, C.; Devine, K.D. The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering, and Coloring. Sci. Program.
**2012**, 20, 129–150. [Google Scholar] [CrossRef] - Chevalier, C.; Pellegrini, F. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Comput.
**2008**, 34, 318–331. [Google Scholar] [CrossRef] - Von Looz, M.; Tzovas, C.; Meyerhenke, H. Balanced k-means for Parallel Geometric Partitioning. arXiv, 2018; arXiv:1805.01208. [Google Scholar]

**Figure 1.**Illustration of domain partitioning with load balancing. The computational domain $\mathsf{\Omega}$ is subdivided into subdomains ${\mathsf{\Omega}}_{i}$, such that $\mathsf{\Omega}={\bigcup}_{i}{\mathsf{\Omega}}_{i}$. The load estimator assigns a weight ${w}_{i}$ to each subdomain that quantifies the workload. The load distribution routine then assigns the subdomains to the available processes (here ${P}_{1}$ and ${P}_{2}$) such that the load per process is balanced.

**Figure 2.**Schematic representation of the numerical methods presented in Section 2 that are used for particulate flow simulations in this work.

**Figure 3.**Sketch of the particle-mapping and the boundary treatment according to the CLI boundary scheme from Equation (6).

**Figure 4.**Domain partitioning into blocks. The cell-based fluid simulation requires a ghost layer for synchronization. The particle simulation uses shadow particles. The two bold vertical lines denote the common face of both blocks.

**Figure 5.**Workload evaluation simulation with ${b}_{s}/\Delta x=32$ and $D/\Delta x=10$. Black boxes show the $4\times 4\times 5$ blocks used for domain partitioning. The magnitude of the fluid velocity is shown in two slices at the back of the domain. (

**a**) Initially random particle distribution (${t}^{*}=0$). (

**b**) Ongoing settling (${t}^{*}=1.3$). (

**c**) Final, completely settled state (${t}^{*}=2.5$).

**Figure 6.**Number of fluid cells (

**left**) and time measurements (

**right**) of the LBM part per block over time (${t}^{*}=t/T$) of the workload evaluation simulation with ${b}_{s}/\Delta x=32$ and $D/\Delta x=10$. The coloring resembles the affiliation of the block to one of the five rows in the setup.

**Figure 7.**Box-and-whiskers plot of the proportion of the part’s runtime of total runtime, ${m}_{\mathrm{X}}/{m}_{\mathrm{tot}}$, for each sample.

**Figure 8.**Relative errors ${E}_{\mathrm{X}}$ of predicted workload for the five different parts and the total runtime.

**Figure 9.**Visualization of the temporal evolution of the particles and the fluid velocity in the hopper clogging simulation. (

**a**) Initial state with block structure. (

**b**) Active settling phase. (

**c**) Final, completely settled state.

**Figure 11.**Load imbalance over the time steps for simulation shown in Figure 9.

Variable | Description |
---|---|

C | total number of cells on a block |

F | number of cells flagged as fluid |

B | number of cells flagged as near boundary |

${P}_{L}$ | number of local particles |

${P}_{S}$ | number of shadow particles |

K | number of contacts between rigid bodies |

S | number of sub cycles of the rigid body simulation |

Coefficient | Value |
---|---|

${a}_{1,\mathrm{LBM}}$ | 9.99 × 10${}^{-6}$ |

${a}_{2,\mathrm{LBM}}$ | 1.57 × 10${}^{-4}$ |

${a}_{3,\mathrm{LBM}}$ | −8.23 × 10${}^{-2}$ |

${a}_{1,\mathrm{BH}}$ | 6.65 × 10${}^{-6}$ |

${a}_{2,\mathrm{BH}}$ | 7.06 × 10${}^{-4}$ |

${a}_{3,\mathrm{BH}}$ | −1.09 × 10${}^{-1}$ |

${a}_{1,\mathrm{Coup}1}$ | 3.08 × 10${}^{-6}$ |

${a}_{2,\mathrm{Coup}1}$ | 2.42 × 10${}^{-7}$ |

${a}_{3,\mathrm{Coup}1}$ | 1.41 × 10${}^{-2}$ |

${a}_{4,\mathrm{Coup}1}$ | 2.78 × 10${}^{-2}$ |

${a}_{5,\mathrm{Coup}1}$ | −1.40 × 10${}^{-1}$ |

${a}_{1,\mathrm{Coup}2}$ | 5.99 × 10${}^{-6}$ |

${a}_{2,\mathrm{Coup}2}$ | 3.90 × 10${}^{-6}$ |

${a}_{3,\mathrm{Coup}2}$ | −8.80 × 10${}^{-3}$ |

${a}_{4,\mathrm{Coup}2}$ | 2.51 × 10${}^{-2}$ |

${a}_{5,\mathrm{Coup}2}$ | −1.30 × 10${}^{-1}$ |

${a}_{1,\mathrm{RB}}$ | 1.16 × 10${}^{-6}$ |

${a}_{2,\mathrm{RB}}$ | 9.62 × 10${}^{-4}$ |

${a}_{3,\mathrm{RB}}$ | 2.75 × 10${}^{-4}$ |

${a}_{4,\mathrm{RB}}$ | 1.48 × 10${}^{-3}$ |

${a}_{5,\mathrm{RB}}$ | 1.88 × 10${}^{-2}$ |

**Table 3.**Comparison of total runtime of the hopper simulation for the four load distribution variants relative to the case without load balancing. Additionally, the runtime of the load balancing step relative to the total runtime and the simulation-averaged edge cut is given.

Variant | Relative Runtime (%) | Load Balancing Fraction (%) | Mean Edge Cut |
---|---|---|---|

no load balancing | 100.0 | 0.0 | $4.5\times {10}^{6}$ |

Hilbert | 86.0 | 2.0 | $4.3\times {10}^{6}$ |

Morton | 86.0 | 1.9 | $4.8\times {10}^{6}$ |

ParMETIS AdaptiveRepart | 101.9 | 3.9 | $3.6\times {10}^{6}$ |

ParMETIS PartGeomKway | 102.3 | 6.8 | $3.6\times {10}^{6}$ |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rettinger, C.; Rüde, U.
Dynamic Load Balancing Techniques for Particulate Flow Simulations. *Computation* **2019**, *7*, 9.
https://doi.org/10.3390/computation7010009

**AMA Style**

Rettinger C, Rüde U.
Dynamic Load Balancing Techniques for Particulate Flow Simulations. *Computation*. 2019; 7(1):9.
https://doi.org/10.3390/computation7010009

**Chicago/Turabian Style**

Rettinger, Christoph, and Ulrich Rüde.
2019. "Dynamic Load Balancing Techniques for Particulate Flow Simulations" *Computation* 7, no. 1: 9.
https://doi.org/10.3390/computation7010009