Cross-Platform GPU-Based Implementation of Lattice Boltzmann Method Solver Using ArrayFire Library

: This paper deals with the design and implementation of cross-platform, D2Q9-BGK and D3Q27-MRT, lattice Boltzmann method solver for 2D and 3D ﬂows developed with ArrayFire library for high-performance computing. The solver leverages ArrayFire’s just-in-time compilation engine for compiling high-level code into optimized kernels for both CUDA and OpenCL GPU backends. We also provide C++ and Rust implementations and show that it is possible to produce fast cross-platform lattice Boltzmann method simulations with minimal code, effectively less than 90 lines of code. An illustrative benchmarks (lid-driven cavity and Kármán vortex street) for single and double precision ﬂoating-point simulations on 4 different GPUs are provided.


Introduction
Popularity of the lattice Boltzmann method (LBM) has steadily grown since its inception from lattice gas automata [1] more than three decades ago. The lattice gas automata are a type of cellular automaton used to simulate fluid flows and they were the precursor to the LBM. From lattice gas automata it is possible to derive the macroscopic Navier-Stokes equations. A disadvantage of the lattice gas automata method is the statistical noise. Another problem is the difficulty in expanding the model to 3D case. Because of these reasons the LBM started to rise in early 1990s as an alternative procedure [2]. As a mesoscopic method, filling the gap between macroscopic Navier-Stokes solvers and microscopic molecular dynamics, it has been an important tool for numerical simulations of multi-component, multiphase flows [3][4][5], flows in porous media [6,7], turbulent flows [8], and lately for more complex flows of fluids, as for instance, Bose-Einstein condensate [9], interaction of (2+1)-dimensional solitons [10], or modeling of viscous quasi-incompressible flows [11]. Thanks to the computational simplicity of LBM and its spatial and temporal locality, it is naturally suited to parallel computing [12].
Recently, the increase in computational power and advances in general-purpose computing on GPUs (GPGPU) opened the door for real-time and interactive computational fluid dynamics (CFD) simulations [13][14][15][16][17]. Together with the performance and speed of the LBM method, it is now possible to compute more than several hundreds of iterations per second which makes an interaction with the simulation in progress possible [18]. Getting instant feedback according to the change of various parameters in simulation gives researchers the ability to iterate faster toward the creation of accurate model, better understanding of underlying phenomena, or employing simulation within the control of industrial systems. It is, therefore, desirable to push the limits of execution speed of LBM simulations. Developers have to be careful with the memory limitations, even though GPUs provide high memory bandwidth, as LBM algorithms tend to consume large amounts of memory for storing the data. GPU architecture is designed for high data throughput thanks to the combination of single instruction and multiple data (SIMD) execution model and multithreading, called single instruction and multiple threads (SIMT). With the parallel nature of the LBM algorithm, CFD simulations that use it can achieve high speeds not just on high-performance computing (HPC) systems but also PC workstations with a single GPU. For applications of real-time or online interactive visualization of the running simulation, getting to high frame rates is easier to achieve on such workstation PCs because of the high bandwidth of the PCIe slot. On networked HPC systems, the transfer speeds are limited by network throughput and higher latency. Therefore, transferring data from GPU on the HPC system back to the PC client for visualization is much slower [19]. In this study, we use GPUs that have between 70 and 900 GB/s theoretical maximum memory bandwidth.
To attain more computational power from GPUs, algorithm optimization techniques for parallel computing need to be considered. Compiled code needs to be vectorized and multithreaded to leverage parallel capabilities in massively parallel architectures [13]. This is typically done by using specific compilation commands to automatically vectorize code (NVC++ compiler from NVIDIA with stdpar), writing GPU-specific code (compute kernels) with frameworks like CUDA and OpenCL, or compiling both CUDA and OpenCL from the same code without specialized compiler directives using cross-platform library like ArrayFire. In multi-GPU setups, a message passing interface (e.g., MPI, OpenMP) is usually employed.
In this paper, the cross-platform solver for LBM simulations is presented, discussing the advantages of using ArrayFire library [20] for writing optimized CUDA and OpenCL software without worrying about hand-tuned optimization of custom-made kernels for specific platforms.
The present work is organized as follows. Section 1 introduces the problem. In Section 2, we present the computational framework of LBM with the structure of our crossplatform LBM solver, discussing the advantages and disadvantages of the computational ArrayFire library. In Section 3, we provide a performance analysis of implemented solver, benchmarking it on consumer-grade GPUs with well-known academic test cases. In Section 4, we provide a discussion about the next steps in future research building on the foundations from this study. Section 5 concludes this article with some comments.

Materials and Methods
The LBM solver with D2Q9 stencil (two-dimensional computational domain with fluid particle distributions moving in 9 velocity directions to neighboring nodes) with single-relaxation time and D3Q27 stencil (three-dimensions with 27 velocity directions) with multiple-relaxation time that uses GPUs for high-performance computation were implemented. Various optimization techniques exist to parallelize and optimize LBM algorithms [15,18,[21][22][23][24]. We tried to keep manual optimizations in code to a minimum to see how performant the resulting ArrayFire application is.
The same LBM solver was implemented in both C++ and Rust to study the ease of development across different languages for which ArrayFire wrappers exist. We picked Rust instead of Python as a second language because it is a modern, memory-safe language focused on performance, and the feature-parity with C++ API was better at the time of development.
In this section, we cover the basic theory behind LBM, followed by a description of the implementation of the solver algorithms on GPU, the pros and cons of using the ArrayFire library for the development of parallel programs and how the simulations results can be visualized in real-time with Forge library (included in ArrayFire) while the simulation is running.

Lattice Boltzmann Method
The lattice Boltzmann method originally grew out of the lattice gas automata [2,25]. It is positioned in the middle between Eulerian and Lagrangian methods for solving fluid flow problems. Instead of calculating the properties of individual particles, the particle distribution function (PDF) is used for describing the distribution of particles that is computed for each node in the discretized domain. As we mentioned earlier, each node needs only its neighbors for the actual computation, allowing for good parallelization. A collision of particle distributions is described by Ω operator, that states the rate of change of PDF (denoted as f ) is equal to the rate of collision in the limit of dt − → 0: The collision operator Ω is difficult to solve. It has been simplified by the work of Bhatnagar, Gross, and Krook (BGK) [26], that introduced the BGK operator defined as where f eq i is an PDF in an equilibrium state of the system obtained by Taylor expansion of the Maxwell-Boltzmann equilibrium function. The relaxation parameter τ is the reciprocal that presents a time in which the systems relaxes towards the equilibrium.
The lattice Boltzmann equation (LBE) in its discrete form, the fundamental part of the LBM, is obtained by discretization of the velocity space of the Boltzmann equation into a finite number of discrete velocities e i , i = 0, 1, . . . , Q, with Q = 9 for D2Q9 and Q = 27 for D3Q27 stencil, respectively (see Figure 1). The LBE can be stated as where f i (x, t) denotes the individual direction of the PDF at each lattice point in particular time and f i (x + e i ∆t, t + ∆t) is equal to resulting PDF for all neighboring nodes in the next iteration step. In other words, it represents the probability of finding a particle at position x and time t with velocity e i . Necessary criterion for stability is that physical information should not travel faster than fastest speed supported by the lattice [25]. Discrete velocities e i in D2Q9 model can be expressed in an array as for the single node in 2D grid and for D3Q27 the array is extended to accommodate 27 velocities moving in 3D grid Macroscopic quantities are obtained from hydrodynamic moments of the distribution function (see Equation (5)), where ρ is the density and u is the velocity defined as Equilibrium distribution function f eq can be expressed by performing a Hermite expansion of the Boltzmann equilibrium function as where c s is the dimensionless speed of sound within the lattice, usually set to c s = 1 √ 3 and ω i denotes different weights (7) for discrete velocity in D2Q9 stencil as follows ω 0 = 4/9, ω 1,2,3,4 = 1/9, ω 5,6,7,8 = 1/36.
With a proper set of discrete velocities, the LBE recovers the incompressible Navier-Stokes equations by the Chapman-Enskog expansion. For the flows within the incompressible limit, assumptions, such as low-Mach number u/c s and variations in density of order O(M 2 ), where M is a transformation matrix, has to be made.
Two general steps of the LBM solver involve solving Equation (8) for collision and Equation (9) for streaming in each iteration, which are defined as follows: To overcome difficulties of numerical instability in applying the 3D LBM method, the multiple-relaxation-time (MRT) scheme is useful to stabilize the solution and to obtain satisfactory results. The MRT model allows for independent tuning of the relaxation times for each physical process [8]. It is, therefore, natural to extend current work to include MRT. In this study we implemented non-orthogonal MRT-LBM in D3Q27 because it simplifies the transformation between the discrete velocity space and the moment space [5].
The collision operator in MRT-LBM is defined as where S is a diagonal relaxation matrix, constructed by the process described in [5]. The collision step is performed in moment space. Formally it can be defined as where m = M f and m eq = M f eq . After the collision step, the distribution function is be reconstructed by f * = M −1 m * and used in the streaming step as usual For the boundary condition, we implemented a simple no-slip boundary known as bounce-back, which effectively reverses the direction of f i [12]. In places like input and output of simulated pipe for Kármán vortex test case, we implemented periodic boundary conditions [27].

Using GPU for CFD Simulations
There has been an increase in studies focused on optimizing the execution speed of LBM algorithms, after the idea of using GPGPUs for CFD simulations started gaining traction more than a decade ago. It has been bolstered by the LBM's advantage in the locality of computations since only the values from neighboring nodes are needed.
In this space, the most used APIs for programming GPUs are CUDA and OpenCL. CUDA is a proprietary API used to program NVIDIA GPUs [28]. OpenCL is an open standard that supports different hardware from various vendors on the market [29].
Recently, multiple 2D and 3D LBM simulations used CUDA or OpenCL for their parallel implementation targeting GPUs [13][14][15][16][30][31][32][33][34]. There is an increasing amount of studies on memory access efficiency and optimization techniques [22,35]. However, to accomplish near-optimal performance, it is extraordinary amount of work. Programming software for GPGPU is still very difficult [36]. Developers need to optimize for specific hardware and, therefore, have to know each architecture thoroughly. For this reason, such hardware-specific implementations in cross-platform code tend to become very complex.
ArrayFire library can help by automatically leveraging the best hardware features available on multiple architectures, hiding the hardware-specific optimizations. Developers can write code in a high-level language like C++, Rust, or Python (for which the ArrayFire library wrapper exists) and have it compiled for CPU, GPU, or other accelerators like FPGA.

Compiling to CUDA and OpenCL from One Codebase
CUDA and OpenCL APIs differ from each other. They can be considered as extensions of the C/C++ language and require significant experience in low-level C/C++ programming. To write optimized, parallel software, developers need to employ different techniques, specific to the API they choose, which adds substantial overhead when trying to port one to another in case of a need. In heterogeneous computing systems, writing optimized cross-platform code for different GPUs means writing multiple hardware-specific kernels in CUDA and OpenCL.
In this work, a high-performance, parallel computing ArrayFire library has been used to significantly simplify programming for GPUs. Its easy-to-use API provides high-level abstractions in the form of hundreds of functions. They are automatically converted to optimized, fast GPU kernels, utilizing just-in-time (JIT) compilation and lazy evaluation [37]. ArrayFire's high-level object construct called Array is a data structure that acts as a container that represents memory stored on the device. On top of it, ArrayFire provides abstractions in the form of Unified Backend for working with different computational backends. This way it is possible to switch to CPU, GPU, FPGA, or another type of accelerator at runtime (see Listing 1), and permits programmers to write massively parallel applications in a high-level language with a much lower number of lines of code [20]. Arrays (or matrices) of up to 4 dimensions can be created.

Employed Optimizations in LBM Solver
In the following section, we describe simple optimizations employed in our LBMbased solver. To achieve high throughput on different parallel architectures, a large portion of time-consuming, hardware-specific, low-level optimizations are done automatically by ArrayFire. The underlying JIT compilation engine converts expressions into the smallest number of CUDA or OpenCL kernels, but to decrease the number of kernel calls and unnecessary global memory operations, it tries to merge cooperating expressions into a single kernel. However, not everything can be optimized automatically. There are some specifics between LBM algorithms where some optimizations have a large impact on the performance of the simulation, namely coalesced memory writes resulting from better data organization scheme, removing branch divergence, improvement of cache locality, and thread parallelism.

Data Organization
To ensure the most efficient memory throughput when programming for GPU is to ensure that memory access is coalesced [22]. There are two common patterns for data organization: array of structures (AoS) and structure of arrays (SoA). Therefore, if we consider a 1D array, in AoS, all discrete velocities in each node occupy consecutive elements of the array, and in SoA, the value of one discrete velocity direction from all nodes is arranged consecutively in memory, then the next direction in all nodes and so on. In a GPUbased LBM algorithm, it is beneficial to store values of distribution functions for each node in the computational domain represented by a grid in 1-dimensional array. For D2Q9 and D3Q27 stencils, this translates into storing single velocity direction for 2D grid represented by N x * N y or 3D grid represented by N x * N y * N z nodes at a time, consecutively 9-times for D2Q9 and 27-times for D3Q27, respectively. Using SoA is significantly faster than AoS [13,22]. This way the cache use is considerably improved [12].
To create a SoA structure with 1D array to store particle distributions in all directions along whole computational domain, we can set first dimension of array construct straightforwardly to the n x * n y * n z * Q number of elements, but for easier index creation further down the line, it is better to create 3D (or 4D in case of D3Q27) array initially and flatten it when used for heavy computation of actual simulation (see Listing 2).
Listing 2: Creating SoA structure representation of D2Q9 lattice with ArrayFire in C++.
unsigned nx = 64, ny = 64, dirs = 9; // SoA 1D array array f_1d_soa = constant(0, nx * ny * dirs); // Array flattened to SoA 1D array array f = constant(0, nx, ny, dirs); array f_1d_soa = flat(f) For streaming operation in LBM, we use ArrayFire's shift function for each direction of particle distributions. Instead of running this function on every iteration, we create two indices for access to "current" nodes and their neighboring nodes at the initialization phase of the solver and store them for the whole lifetime of the simulation. Streaming is then as easy as accessing particle distributions in a 1D array with neighbours index. In ArrayFire, indexing is also executed in parallel, but is not a part of a JIT compilation. Instead, it is a handwritten optimized kernel. Any JIT code that is fed to indexing is evaluated in a single kernel if possible (see Listing 3).
Listing 3: Creating "current" (or main) index and neighboring index.

Removing Branch Divergence
When writing classical, imperative code, handling control flow is usually done by using if-else blocks, creating different possible branches. In multithreaded execution model like SIMT used in GPUs, the processor's threads execute different paths of the control flow, leading to poor utilisation due to thread-specific control flow using masking [38]. Branch divergence is a major cause for performance degradation in GPGPU applications [39]. To keep the flow coherent for the processing threads, it is recommended to remove if-else blocks from the code (see Listing 5).
Listing 5: Pseudo-code showcasing the removal of branch divergence by removing if statement.
In practical LBM application, branch divergence occurs when doing different computations on different types of the nodes in computational domain, e.g., "if fluid node, do computation, else do nothing". Branch removal in the C++ version of ArrayFire applications is shown in Listing 6. In Rust version, the branching removal is achieved in the same manner, but at the end to use the condition, function select can be used (see Listing 7).
Listing 7: Example Rust code of removing branch divergence using ArrayFire.
let out = af::select(&A, &cond, &B); In LBM simulations, we are also concerned with setting up boundary conditions. It is necessary to tell the solver which cells are solid (e.g., for doing bounce-back in some step down the line) and which are other types of fluids. For the simplest case, let us consider only two types of cells-solid and fluid. Boolean mask, in this case represented as integers (fluid as 0 and solid as 1), is instantiated in mask variable. Indexes of the solid nodes within computational domain can be easily found with where function as it is in Listing 8.

Pull Scheme
Most common algorithms for the streaming phase in LBM solvers use push and pull schemes [22,35]. In the push scheme, the streaming step occurs after the collision step, at which point the particle distribution values are written to neighboring nodes. This presents a misalignment of the memory locations, resulting in an uncoalesced writes, degrading the performance significantly. On the other hand, in pull scheme, streaming step occurs before collision step, at which point the neighboring particle distribution values are gathered to the current nodes and then used for computations ending with collision step, after which the results are written directly to the current nodes. This way, the writes are coalesced in memory.
The idea behind preferring coalesced writes to GPU device memory is that the requests for values that are stored at memory addresses within 128-byte range are combined into one, which saves memory bandwidth. It is generally accepted that the cost of the uncoalesced reading is smaller than the cost of the uncoalesced writing [22].

Real-Time Visualization
The progress of scientific computations can be viewed in real-time thanks to the highperformance OpenGL visualization library called Forge (see Figure 2). It is developed by the same team behind ArrayFire and distributed together with their library for highperformance parallel computing. It is written specifically for use with GPU-accelerated applications as it does not require the expensive copying from GPU to CPU and back to GPU for rendering, but instead it builds on CUDA/OpenCL interoperability with OpenGL and allows for direct reading of the data on GPU [40]. Forge provides various plotting and visualization functions for 2D and 3D domains. For the current study, we only used 2D visualizations.
Practical scientific simulations for in-depth study of complex physical phenomena from real world, e.g., direct numerical simulation of cellular blood flow [32], requires higher accuracy. Instead of single-precision floating-point type (f32) computation, doubleprecision floating-point (f64) has to be chosen. In LBM context, this practically doubles the amount of memory needed. For real-time visualizations of results with this type of precision, they have to be converted to a single-precision floating-point for Forge to effectively work with the data. In ArrayFire (and generally in programming), function for this operation is called cast (see Listing 10).
Listing 10: Converting to single precision floating point for Forge visualization in C, C++ and Rust.

Results
For the current work, the LBM was applied to simulate incompressible steady and unsteady low Reynolds number flows in 2D and 3D lid-driven cavity using full bounceback walls and 3D flow around confined circle also known as Kármán vortex street. These classic, academic tests were simulated in different grid densities going from coarse to denser, within memory limits of GPUs that were used.
Performance of LBM simulations is measured in million lattice updates per second (MLUPS), which is a standard unit of measurement within the LBM research community. The same metric is used for both single and double-precision floating-point operations in benchmarked simulations.
We also show different visualization functions from the Forge library that can be used to study fluid flows and other types of CFD simulations.

Hardware
The solver performance on 4 different GPUs with multiple architectures as it is shown in Table 1 was tested. The AMD Radeon R9 M370X is of mobile GPU type installed in laptops. In the current study, the AMD GPU was tested on the higher-end Macbook Pro 2015. The NVIDIA GTX 1070 is the average desktop GPU and its price at the time of writing this article is $443.78 USD according to PassMark G3D Mark (a GPU benchmarking website). The NVIDIA RTX 3090 is a top-of-the-line, consumer-grade, enthusiast-level GPU with the price of $2139.99 USD according to PassMark G3D Mark at the time of writing. We have also included NVIDIA GeForce RTX 2080 Ti.

Performance Analysis
Since ArrayFire allows for using not only GPU backends but also CPU, we added a CPU benchmarks executed on Intel Core i7 6800 K running at 3.40 GHz. Performance peaked at around 19 MLUPS and stayed the same between various domain sizes across both academic test cases.
Most of the GPU benchmarks were done using both CUDA and OpenCL backends, although differences between them were minimal. Therefore, in following tables and graphical representations of the data, we show MLUPS numbers solely from OpenCL benchmarks, since testing on this platform allowed us to perform benchmarks not only on NVIDIA hardware, but also on AMD GPU.
Single floating-point (f32) calculations perform significantly faster than double-precision (f64). The difference between f32 and f64 is the doubling of the problem data size, hence an increase in memory bandwidth is needed. This difference can be seen in D2Q9 lid-driven cavity and Kármán vortex test cases (see Figures 5-8).
For the 2D Kármán vortex test case, benchmarks showed similar results (see Figures 7  and 8). The performance spikes in so-called "warm-up" phase at the start of simulation were less significant in double-precision benchmarks than in lid-driven cavity test. For the 3D LBM simulation with D3Q27 stencil and MRT, results are shown in Figure 9, where the computation speeds reaching up to 1500 MLUPS in single-precision and up to 490 MLUPS in double-precision benchmarks as it is depicted in Figure 10. The MRT is much more computationally intensive than BGK and with the 27 particle distributions to track, the difference between D2Q9 and D3Q27 is significant. The memory bandwidth for D3Q27 lid-driven cavity tests is much higher, therefore, we were limited to grids under 128 3 in current implementation.

Discussion
Focus of this study was to test performance of ArrayFire library implementation of LBM codes on a single GPU. Some literature mentions simple extension to algorithms written with ArrayFire to add multiple GPU support in 4 lines of code [36]. Implementation like this is great for simple use cases and systems with the same type of GPU to prevent load-balancing issues. Proper multi-GPU support is planned in future versions of ArrayFire library that will be compatible with its Unified Backend convention. Therefore, we are eager to integrate multi-GPU support in future and test performance on heterogenous HPC systems when explicit mutli-GPU will be ready.
In current implementations of D2Q9-BGK and D3Q27-MRT codes analyzed in this study, the lines of code are pretty low, comparable to MATLAB implementations of the same algorithms, but they run significantly faster without employing other than standard code changes required by ArrayFire library documentation (vectorize for-loops, use 1st dimension of matrices for coalesced writes, etc.). When we tried implementing codes for 3D LBM simulations using D3Q27 stencil, the performance was not good enough in comparison to state-of-art solvers in this category (e.g., Sailfish [30] or ELBE [41]). Therefore, for such case, more optimizations should be considered [35].

Conclusions
In this article, we presented a simple cross-platform implementation of LBM solver that can be used on variety of parallel accelerators (e.g., GPUs, CPUs, or FPGAs) by using ArrayFire, a high-performance parallel computing library (we used version 3.8.0). Reference implementation in C++ and ported the same code to Rust was created. We chose Rust because it has modern capabilities, is memory safe, and its performance is comparable to C/C++. In this way, we were able to produce sufficient LBM solver in under 150 lines of code, including real-time visualization code. Benchmarks were concluded for both C++ and Rust versions, resulting in identical performance. Data reported in this study are taken from the Rust version.
For the benchmarks, we analyzed two classical, academic test cases, the lid-driven cavity and Kármán vortex street (flow around the obstacle in pipe). Commonly used metric for measuring speed of LBM implementations, the MLUPS was employed. We benchmarked our LBM implementation on 4 different GPUs and one CPU. Between CPU and GPU, we saw from 4 to more than 300 times speed-up on various GPUs.
The code is available under MIT license and is freely published on GitHub. We also included the "condensed" implementation of Kármán vortex street in 87 lines of code C++ to showcase how little amount of code is needed to create high-performance, cross-platform simulation software with ArrayFire library.  Data Availability Statement: Code with the implementation of LBM algorithms for this article is provided freely at https://github.com/michaltakac/lbm-arrayfire (accessed on 16 June 2021). All the data from the benchmarks that were performed for this article are available at the same repository under the rust/benchmarks folder.

Acknowledgments:
Author would like to thank ArrayFire team, especially John Melonakos and Pradeep Garigipati for building ArrayFire library, helping with technical questions and additional benchmarking on NVIDIA Tesla K20c GPU.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Graphics processing unit GPGPU General-purpose computing on graphics processing unit D2Q9 Two-dimensional lattice stencil with 9 discrete velocity directions in each node D3Q27 Three-dimensional lattice stencil with 27 discrete velocity directions in each node JIT Just-in-time API Application programming interface