An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations

Lian, Jiacheng; Yao, Qinghe; Jiang, Zichao

doi:10.3390/fluids10070176

Open AccessArticle

An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations

by

Jiacheng Lian

,

Qinghe Yao

^*

and

Zichao Jiang

^*

School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, China

^*

Authors to whom correspondence should be addressed.

Fluids 2025, 10(7), 176; https://doi.org/10.3390/fluids10070176

Submission received: 24 May 2025 / Revised: 21 June 2025 / Accepted: 1 July 2025 / Published: 4 July 2025

(This article belongs to the Section Flow of Multi-Phase Fluids and Granular Materials)

Download

Browse Figures

Versions Notes

Abstract

This study addresses computational challenges in the immersed boundary method (IBM) with the pressure implicit with split operator (PISO) algorithm for simulating incompressible flows. We introduce a novel time-step splitting method to implement communication overlapping optimization, aiming to reduce costs dominated by the pressure Poisson solver. Using a fast Fourier transform (FFT)-based approach, the Poisson equation is solved efficiently with

O (N log N)

complexity. Our method interleaves IBM force calculations with Poisson phases, employing asynchronous communication to overlap computation with global data exchanges. This reduces communication overhead, enhancing scalability. Validation through benchmark simulations, including flow around a cylinder and particle-laden flows, shows improved efficiency and accuracy comparable with traditional methods. Implemented in a custom C++ solver using the FFTW library, tests indicate substantial acceleration, with results showing a 40% speed-up and less than 3% deviation in drag and lift coefficients. This research provides an efficient and promising simulation tool for complex flow.

Keywords:

immersed boundary method; parallel optimization; overlapping strategy

1. Introduction

The immersed boundary method (IBM) is widely employed in computational fluid dynamics (CFD). Unlike body-fitted grid methods [1], IBM avoids complex mesh generation by representing internal boundaries using force nodes that are independent of the fluid grid. This approach allows the use of uniform fluid grids even for challenges involving complex geometries, such as particle-laden flows [2] or flows around obstacles [3]. Consequently, it is extensively employed in the simulation of fluidized beds [4,5,6], biofluid mechanics [7,8], and other areas.

In simulating incompressible flows, the IBM is often integrated with the pressure implicit with split operator (PISO) algorithm [9], which calculate the pressure term via a pressure Poisson equation. As an elliptic equation, the Poisson equation introduces global dependence, allowing local perturbations to immediately affect the entire domain through the pressure field [10]. Consequently, in large-scale direct numerical simulations (DNSs), the pressure Poisson solver frequently dominates computational cost and complexity. Additionally, the main idea of IBM is using an external body force term to enforce boundary conditions [11,12,13]. Thus, in the PISO-IBM framework, the additional computational cost arises from calculating the body force at each time step [14]. Because this force calculation depends on interpolation between fluid and boundary nodes, cases with numerous interpolation nodes (e.g., particle-laden flows) can become computationally expensive.

In order to reduce computational cost, researchers focus on decreasing computational complexity and utilizing parallel computing. In terms of solving the Poisson equation, fast Fourier transform (FFT)-based algorithms are considered highly effective [15,16]. The core idea is to use the discrete Fourier transform (DFT) to convert the discrete Poisson equation into independent one-dimensional problems on different bases. This approach has two main advantages: it is well suited for parallel computation because the global elliptic equation is decomposed into multiple independent one-dimensional problems, which can be distributed across different processes; additionally, the DFT can be expedited using the FFT algorithm [17], which has a complexity of

O (N log N)

, where N is the size of the equation. The resulting one-dimensional problems, having a tridiagonal form, can be solved using the Thomas algorithm [18], with a complexity of

O (N)

. The FFT-based algorithm has strict restrictions on geometry, as it can only be employed for equations with simple geometries. However, when coupled with IBM, its application is significantly expanded, becoming a key tool in related fields [19,20].

To accelerate IBM computation, research has focused on optimizing parallelism, as reducing the complexity of IBM calculations is not feasible. Nonetheless, parallel implementations introduce communication overhead that cannot be reduced merely by increasing the number of computational nodes. In fact, excessive nodes may exacerbate communication bottlenecks. While IBM reduces some computational costs, communication challenges remain significant for large-scale problems. One effective strategy to mitigate these bottlenecks is the overlapping method, where communication occurs concurrently with computation [21]. This approach effectively “hides” communication time by overlapping it with computational tasks. However, implementing this strategy in CFD algorithms presents a unique challenge: overlapping requires independence between computation and communication, meaning the calculated data cannot depend on the communicated data during the overlap phase.

Existing research has proposed limited optimizations for IBM-PISO algorithms in parallel environments. A common approach is to separate inner and interface regions for computation, initiate non-blocking communication for interface regions first, then compute inner regions concurrently during communication, and finally process interface regions after communication completes [22]. Some optimization strategies focus on grid-based approaches. For example, Ref. [23] employs a dual-grid for resolving fluid velocity and discrete element method (DEM) coupling while maintaining accuracy; and a coarse grid for pressure solution to accelerate large-scale computations. Additionally, numerous studies have explored performance optimizations on heterogeneous architectures, particularly focusing on multi-granularity parallelization strategies [24,25,26]. However, these methods only work for locally dependent operations like convective term calculations. They prove ineffective for global operations such as Poisson equation solving, which requires all-to-all communication.

This research introduces a novel time-step splitting method to implement overlapping optimization for the IBM-PISO algorithm. Our approach staggers the computation of the Poisson equation solution with IBM force calculations within each time step. Section 2.1 details this algorithm, which employs an FFT-based Poisson solver. Unlike classic Krylov methods [27], the FFT-based approach transforms the global equation into multiple independent sub-equations. This method offers two key advantages: (1) low computational complexity due to the efficiency of FFT and Thomas algorithms, and (2) excellent parallel scalability as sub-equations can be solved independently. While FFT-based solvers are typically restricted to simple geometries due to spectral expansion requirements, our work combines this approach with IBM to handle both simple and complex internal boundaries.

The manuscript is organized as follows: Section 2.1 presents the governing equations and solution methodology, while Section 3 details our overlapping strategy implementation. We validate the precision and the computational efficiency through multiple numerical test cases in Section 4.

2. PISO-IBM Algorithm

2.1. PISO

The incompressible non-dimensional Navier–Stokes equations are given by:

\begin{matrix} \nabla \cdot u_{f} = 0, \\ \frac{\partial u_{f}}{\partial t} + u_{f} \nabla \cdot u_{f} = - \nabla p + \frac{1}{Re} \nabla^{2} u_{f} + f_{f}, \end{matrix}

(1)

where

u_{f, p}

is vector of velocity, p is pressure, and

f_{f, p}

is the volume force term. The subscripts

f

and

p

denote the fluid and particle phases, respectively. The external body force is omitted. Unless otherwise specified, all quantities are dimensionless throughout this paper. The Reynolds number is defined as

Re = ρ_{f} U L / μ

, with

ρ_{f}

being fluid density, U being the characteristic velocity, L being the characteristic length, and

μ

being dynamic viscosity. Quantities in the Reynolds number computation are dimensional.

In the PISO algorithm, the pressure and velocity is decoupled, such that the connection is the pressure Poisson equation. Coupled with IBM, the computation of the time advancing step is shown by

\begin{matrix} u_{f}^{*} & = u_{f}^{n} + Δ t (- u_{f}^{n} \cdot \nabla u_{f}^{n} + \frac{1}{Re} \nabla^{2} u_{f}), \end{matrix}

(2a)

\begin{matrix} u_{f}^{* *} & = u_{f}^{*} + Δ t F (\nabla \cdot u_{f}^{*}), \end{matrix}

(2b)

\begin{matrix} \nabla^{2} p & = \nabla \cdot u_{f}^{* *}, \end{matrix}

(2c)

\begin{matrix} u_{f}^{n + 1} & = u_{f}^{* *} - \nabla p, \end{matrix}

(2d)

where the superscript n denotes the temporal step and

u_{f}^{*}

and

u_{f}^{* *}

are the intermediate velocities for pressure correlation and IBM calculation, respectively. The computation of IBM force is denoted by

f_{f} = F (u_{f}; x_{p}, x_{f})

. In (2), we only proposed a single iteration to simplify the formula, while in the general PISO algorithm, the velocity correction via Equation (2d) requires multiple iterative steps. The pressure Poisson Equation (2c) comes from divergence-free velocity projection at each time step to maintain the divergence-free velocity, i.e., incompressibility of the flow.

2.2. FFT-Based Poisson Solving

The FFT-based Poisson solving algorithm utilizes the central difference scheme for the Poisson Equation (2c). In two dimensions, the discrete form of the Poisson equation at the

(i, j)

-th node is given by

\frac{p_{i - 1, j} - 2 p_{i, j} + p_{i + 1, j}}{Δ x^{2}} + \frac{p_{i, j - 1} - 2 p_{i, j} + p_{i, j + 1}}{Δ y^{2}} = f_{i, j},

(3)

where

Δ x

and

Δ y

are the spatial steps in the x- and y-directions, respectively. In the PISO algorithm, the right-hand term is

f_{i, j} = \nabla \cdot u_{f}^{* *}

. In the classic DFT framework, the pressure solution is expanded as

p_{i, j} = \sum_{n = 1}^{N} {\hat{p}}_{i, n} q_{n j},

(4)

where N is the number of nodes in the y-direction, and

q_{n j}

are the basis functions for the spectral expansion, corresponding to the roots of unity

e^{2 π i n j / N}

used in the DFT. As Equation (4) represents a DFT, it can be rapidly computed using the fast Fourier transform (FFT). Consequently, Equation (3) can be transformed into

(δ + λ_{n}) {\hat{p}}_{i, n} + δ_{x} ({\hat{p}}_{i + 1, n} + {\hat{p}}_{i - 1, n}) = {\hat{f}}_{i, n},

(5)

where

δ = - 2 (Δ_{x}^{- 2} + Δ_{y}^{- 2})

,

δ_{x} = Δ_{x}^{2} Δ_{y}^{- 2}

and

λ_{n} = 2 [1 - cos (2 π n / N)]

.

Equation (5) is decoupled in n, i.e., for a given n, the equation is independent of other values of n. Therefore, (5) has been converted into independent sub-equations and is appropriate for parallel computation.

Direct application of DFT for transformation can probably lead to aliasing and sampling effects, and it is also challenging to extend to other boundary conditions such as Dirichlet and Neumann. Therefore, this study adopts spectral expansion coefficients derived from the eigentransformation of (3). These coefficients vary depending on the boundary conditions in the corresponding direction. Specifically, for Dirichlet boundary conditions, the form of

q_{n j}

is

q_{n j}^{D} = \sqrt{\frac{2}{N + 2}} sin \frac{n j π}{N + 1},

(6)

and for Neumann boundary conditions,

q_{n j}

is

q_{n j}^{N} = \{\begin{matrix} \frac{2}{N} sin \frac{(2 n j - j - 2 n N)}{2 N}, & (j < N) \\ \frac{2}{N} sin \frac{π}{4}, & (j = N) \end{matrix}

(7)

Compared with direct DFT expansion, expressions (6) and (7) are derived directly from the matrix form of Equation (3) for the spectral space transformation. This approach avoids the aliasing and sampling effects associated with direct DFT while retaining the advantage of transforming the equation into independent sub-equations for parallel computation. Furthermore, it can be demonstrated that this process can be directly completed through FFT and vector-wise computations without increasing computational complexity.

In practical computations, compared with direct solvers, FFT-based solving algorithms involve two spatial transformation processes that include FFT. In practice, these transformations involve global communication, specifically an all-to-all process. Because other computational steps typically have a complexity no higher than

O (log N)

on each computational unit, this global communication often becomes a bottleneck limiting computational and parallel efficiency in large-scale parallel simulations. This forms the premise for our proposed optimization based on an overlapping strategy, which will be discussed in Section 3.

2.3. IBM

The IBM approach imposes prescribed velocity conditions on arbitrary interfaces. We adopt the direct forcing scheme of Uhlmann [12], where interpolation between Eulerian and Lagrangian positions is governed by

\begin{matrix} {\tilde{u}}_{f} (x_{p}^{(i)}) = \sum_{x_{f} \in Ω} u_{f} δ_{h} (x_{f} - x_{p}^{(i)}) h^{3} & 1 ⩽ i ⩽ N_{b}, \\ f_{p} (x_{p}^{(i)}) = \frac{u_{p}^{(i)} - {\tilde{u}}_{f} (x_{p}^{(i)})}{Δ t} & 1 ⩽ i ⩽ N_{b}, \\ f_{f} = \sum_{i = 1}^{N_{b}} f_{p} (x_{p}^{(i)}) δ_{h} (x_{f} - x_{p}^{(i)}) Δ V^{(i)} & \forall x_{f} \in Ω, \end{matrix}

(8)

where force points are indexed by superscript i,

N_{b}

is number of force points,

δ_{h}

is a kernel function used for interpolation, and the radius of the IBM kernel function is denoted as

d_{h}

, h is mesh width. The integration domain

Ω

represents the fluid region within a distance

d_{h}

from the force point.

{\tilde{u}}_{f}

is the interpolated fluid velocity located in force points,

Δ t

is the time step, and

Δ V^{(i)}

is the discrete volume associated with each force point such that the union of all these volumes forms a thin shell (of thickness equal to one mesh width). If the force point is uniformly distributed, then all

Δ V^{(i)}

are equal.

3. Overlapping Strategy

The proposed overlapping strategy to accelerate the computation of (8) focuses on parallel implementation, with the parallelization strategy detailed in this section. For parallel computation, we employ a one-dimensional (slab) domain decomposition of the 2D fluid domain, partitioning the computational domain uniformly along the y-axis to optimize load balancing. Extending this approach to 3D configurations is conceptually straightforward, requiring only algorithmic generalization without altering the core logic of the overlapping scheme.

As mentioned in Section 2.1, the Poisson equation is discretized using the finite difference method (FDM). In this research, the incompressible Navier–Stokes equations are discretized by FDM with a staggered grid, where discrete nodes for velocity in the x- and y-directions and pressure are located at different positions, as shown in Figure 1. Furthermore, while employing one-dimensional (slab) domain decomposition in FDM-IBM, updating a grid point requires information from neighboring points following a specific dependency pattern known as the finite difference stencil. The stencil of a force point defines its dependence on adjacent Eulerian fluid points, determined by the support region of the kernel function

δ_{h}

.

As shown in Figure 1, in the parallel implementation of IBM, certain force points near the slab boundary may have stencils extending beyond the local process domain. For such points, the stencil includes Eulerian velocity points from both the local process and adjacent processes. Consequently, the local process requires Eulerian velocity and Lagrangian force data from near-boundary regions of neighboring processes.

In the parallelization implementation, each process maintains buffer regions on both sides of its local slab boundary. These buffers receive the necessary data for stencil computation from neighboring processes. When the support region spans the boundary between two processes, both the calculation of the force on the Lagrangian point

f_{p}

and the body force

f_{f}

require data from neighboring processes. Consequently, two buffer synchronizations are necessary in the Lagrangian-to-Eulerian force calculation:

Eulerian velocity increment synchronization: The local process computes the force contribution $f_{f}$ from near-boundary force points affecting neighboring processes and transmits it via MPI to the neighbor’s buffer, where it is applied to the Eulerian velocity field.
Lagrangian force point synchronization: The local process sends boundary force points $f_{p} (x_{p}^{(i)})$ directly to the neighbor’s buffer, allowing each process to independently compute $f_{f}$ .

First subroutine incurs communication costs proportional to the grid size, while the second scales with the number of force points. The choice between these two methods depends on the specific simulation case. For example, when simulating a fluid with many particles, resulting in a large number of force points, the second subroutine would be more advantageous.

In general, the overlapping strategy typically involves overlapping the calculations of buffers and interior points. However, when the number of processes is large, the buffer size may be insufficient to match the interior points, resulting in communication times that do not overlap with computation times. The new overlapping strategy proposed in this paper focuses on overlapping communication in the FFT-based Poisson solving and buffer synchronization, rather than within a single IBM subroutine.

Specifically, as described in (5), the FFT-based Poisson solver involves two critical global communication (all-to-all) processes. This involves transformations from physical space to Fourier space and vice versa, implemented as two transposition processes: transforming an array indexed by the y-direction to one indexed by n, each constituting an all-to-all global communication process, with a similar inverse transformation. Therefore, the proposed overlapping strategy can be understood as overlapping the two communication processes within IBM with the two in the Poisson solver. The detailed implementation process is described in Algorithm 1.

Algorithm 1 Overlapping Fluid–Particle Coupling Algorithm with CPU-IO Parallelism
1: Solve momentum equation:
2: Parallel Execution:
Computation process:	Communication process:
1. Transformation by FFT	1. Synchronize $u_{f}$ buffer data
2. Update ${\tilde{u}}_{f} (x_{p}^{(i)})$ and $f_{p} (x_{p}^{(i)})$	2. Transpose data from $x - y$ to $y - x$ direction
3. Apply chasing method	3. Synchronize $f_{p} (x_{p}^{(i)})$ buffer data
for pressure solution
4. Update $f_{f}$ and velocity $u_{f}$	4. Transpose data from $y - x$ to $x - y$ direction
5. Inverse transformation by IFFT
3: Combine results from both processes and solve pressure equation:

In the classic PISO algorithm shown in (2), the computation and communication subroutines for both the IBM and Poisson solving are executed sequentially. In contrast, the algorithm based on the proposed overlapping strategy interleaves the computation phases of the IBM and Poisson solving, as demonstrated in Algorithm 1. The time advancement process of the overlapping algorithm can be expressed as

\begin{matrix} \nabla^{2} p & = \nabla \cdot u_{f}^{*}, \end{matrix}

(9a)

\begin{matrix} u_{f}^{n + 1} & = u_{f}^{*} + Δ t F (\nabla \cdot u_{f}^{*}) - \nabla p, \end{matrix}

(9b)

The intermediate velocity

u_{f}^{*}

is calculated from the advective and dissipative terms, consistent with (2a). It is important to note that both (2) and (9) are based on a first-order forward scheme for simplicity. However, in transient simulations, to maintain an acceptable time step, a low-order time advancement scheme is not always reliable. In the actual implementation, as seen in the numerical results in Section 4, we employ the 4th-order Runge–Kutta algorithm.

4. Numerical Results

In this section, we validate the precision of our proposed algorithm and assess the impact of the overlapping algorithm on the accuracy of transient flow simulations. Specifically, we compare the numerical solution with an analytically constructed solution of the incompressible Navier–Stokes equations. For the well-known benchmark problem of two-dimensional flow around a cylinder, we compare the numerical results from our solver, which incorporates the proposed overlapping method, with solutions obtained using the classic method. In this context, we evaluate computational efficiency using different numbers of MPI processes and assess parallel computation acceleration. To analyze the speed-up effect of the proposed overlapping method, we construct a particle-laden flow scenario with varying numbers of particles and compare the results with those obtained from the classic algorithm in terms of computational efficiency.

All numerical results were obtained using an in-house solver developed in C++. The open-source FFT library FFTW [28], known for its high efficiency, is the only third-party library used in our development. Based on this self-developed solver, all numerical results were achieved, and all parallel tests were conducted on the CPU parallel computational platform described in Section 4.2.

4.1. Precision Validation

A constructed analytical solution is employed to evaluate the accuracy of the algorithm. This is the so-called method of manufactured solution, which is believed to be a severe test for numerical computations [29,30]. The solution is defined as

\begin{matrix} u & = C_{1} ϕ^{2} \frac{\partial ϕ}{\partial y}, \\ v & = - C_{1} ϕ^{2} \frac{\partial ϕ}{\partial x}, \\ p & = ζ ϕ^{2} \end{matrix}

(10)

where

\begin{matrix} ϕ_{0} (x, y) & = (x - x^{2}) (y - y^{2}) [{(x - \frac{1}{2})}^{2} + {(y - \frac{1}{2})}^{2} - r^{2}], \\ ψ (x, y) & = sin (ω t) (x - \frac{1}{2}) + cos (ω t) (y - \frac{1}{2}), \\ ϕ (x, y) & = ϕ_{0} ψ . \end{matrix}

(11)

Solution

u_{f} = [u (x, y), v (x, y)]

satisfies the incompressible condition. Figure 2a depicts the geometry of the analytical solution, which consists of a stationary square outer boundary and a circular inner boundary. Both boundaries enforce no-slip wall conditions. The circular of diameter

D = 0.4

is centered at

(0.5, 0.5)

, with the outer square boundary extending from

(0, 0)

to

(1, 1)

.

We compare velocity results at

t = 1

from the overlapping algorithm, labeled as “overlapping” in Figure 3, with those obtained from the classic sequential algorithm, labeled as “classic,” across various spatial discretizations.

As shown in Figure 3, the discretization accuracy is between first and second order. This is consistent with the results of the classic algorithm in both the entire field and at nodes surrounding the obstacle. The overlapping algorithm does not affect the order of accuracy, suggesting that the overlapping strategy retains similar computational precision.

4.2. Flow Around a Cylinder

Flow around a cylinder is a classical benchmark problem. To assess the reliability in transient flow and computational efficiency of the proposed method, the setup depicted in Figure 4 is used. The left boundary serves as an inlet with a velocity of 1, while the top and bottom boundaries are stationary walls, and the outlet boundary is treated as a fully developed flow boundary. The computational domain has the dimensionless dimensions

L_{x} = L_{y} = 8

with a cylinder of diameter

D = 0.3

positioned at

(1.85, 4)

. The Reynolds number is defined as

Re = ρ_{f} u_{i n} D / μ = 100, 1000

, where inlet velocity

u_{i n} = 1

. The computational grid uses uniform spacing of

Δ x = Δ y = 1 / 128

for

Re = 100

cases and

Δ x = Δ y = 1 / 512

for

Re = 1000

cases.

Drag coefficient

C_{d} = 2 F_{d} / (ρ u^{2} π r^{2})

, lift coefficient

C_{l} = 2 F_{l} / (ρ u^{2} π r^{2})

, and Strouhal number

St = f_{v} D / u

are calculated, where

F_{d}, F_{l}

is the drag force and lift force on cylinder, u is the inlet velocity, r is the cylinder radius,

f_{v}

is the frequency of vortex shedding. Table 1 shows that all quantities are found in good agreement with references.

The time evolution of drag and lift coefficients are depicted in Figure 5. It can be seen that both forces settle to a regular sinusoidal function after vortex shedding becoming stable. Comparing the two algorithms, there are small amplitude and phase differences between these two curves. As shown in Table 1, the overlapping algorithm predicts the mean drag coefficient with a maximum deviation of 7.16% comparing with reference data. This discrepancy, which falls within acceptable limits, can be primarily attributed to algorithmic accuracy and geometric parameters of the flow field. The drag and lift coefficient phase shift results from numerical velocity perturbations affecting vortex shedding initiation. Common practice in the literature involves initial perturbations (random velocities across the fluid domain or cylinder rotation) to trigger vortices earlier. Although these methods modify the exact onset timing, the essential vortex dynamics remain unchanged.

The vorticity distribution, indicative of the transient flow structures like the Kármán vortex street, is shown in Figure 6. The upper half depicts the results from the classic algorithm, while the lower half shows results from our proposed algorithm. It is apparent that the vorticity distributions are very similar, with almost continuous contour lines, indicating consistency between the two approaches.

To evaluate the optimization effect of the proposed algorithm, computational efficiency was compared between our proposed method and the classic algorithm on a server with two CPUs. Detailed software and hardware configurations are provided in Table 2.

Parallel efficiency is assessed by measuring computational time across various numbers of MPI processes, as shown in Figure 7. The primary aim is to evaluate optimizations in parallel computation using relative speed-up R, defined by

R (n) = \frac{{t |}_{n_{p} = 1}}{{n \times t |}_{n_{p} = n}},

(12)

where

{t |}_{n_{p} = n}

is the computational time for n MPI processes. The parallelization is implemented using MPI in C++, ensuring that the number of MPI processes equals the number of CPU cores. The parallel efficiency is tested up to 32 CPU cores, as indicated in Figure 7.

In Figure 7a, the gray dashed line represents ideal computational efficiency, where n CPU cores yield n times speed-up. The pale line demonstrates that computational efficiency improves continuously across different core counts.

To further analyze parallel efficiency, Figure 7b compares relative speed-up between the classic algorithm and our proposed method. The proposed algorithm consistently demonstrates higher efficiency, particularly as CPU cores increase. When CPU cores exceed 8, the proposed algorithm achieves over 40% speed-up compared with the classic method, aligning with expectations that the overlapping strategy increases computational efficiency with more CPU cores.

4.3. Multiple Particles in Cavity

To evaluate the efficacy of our proposed optimization method, we present a case of multiple particles in cavity. The case setting is following with the research of particle-laden flow by Zhi-Gang Feng and Efstathios E. Michaelides [32], which is a well-studied problem in computational fluid dynamics, often using the IBM approach to simulate the interaction between the flow and solid particles [33].

In this efficiency experiment, the domain is a box with dimensions

2 cm \times 2 cm

, with each circular particle having a diameter of

D = 0.625 cm

. The computational grid uses a uniform spacing of

Δ x = Δ y = 1 / 256 cm

. The mesh resolution is set at 16D to ensure that the particle dynamics are fully resolved. The fluid density is

ρ_{f} = 1 {g / cm}^{3}

, and the particle-to-fluid density ratio is

1.01

. The fluid’s kinematic viscosity is

1 g / ms

. Particles are precisely positioned in a uniform grid pattern to avoid collision.

The particles are initialized with randomly generated unit velocities: in the x-direction, the velocity may be either positive or negative, while in the y-direction, it is constrained to be positive. For simplicity, particle advection is disabled, and the velocities are utilized solely for the IBM computations. The fluid driving force is thus entirely derived from the velocity difference between the particles and the flow field. Disabling particle advection allows us to focus on evaluating the performance of the overlapping algorithm in handling a large number of force points, without the additional complexity of particle motion or collision dynamics.

Figure 8 illustrates the vorticity field in the particle-laden flow, offering a visual insight into the flow structures. It is imperative to note that this vorticity illustration is primarily for demonstration. The particle counts used for efficiency experiment, as depicted in Figure 9, are significantly larger.

In this efficiency experiment, simulations were conducted over 200 timesteps, and the average computational time per step was calculated from the final 100 timesteps, effectively removing any initialization transients to provide a robust measure of steady-state performance.

Figure 9 highlights the computational efficiency of both the standard and the overlapping methods relative to particle count. The plot clearly indicates that the computational cost of the standard method escalates with an increase in particle count, mirroring the intensifying computational demand of the IBM process. Conversely, the overlapping method reveals a distinct trend. The communication overhead for processes substantially diminishes the increased cost of IBM calculations, resulting in very stable computational time with an increasing number of particles.

It is theoretically anticipated that the computational cost of the overlapping method will also rise beyond a certain particle count threshold. Nevertheless, in this study, the particle numbers do not exceed this limit, as the maximum particle counts already occupy a significant part of the computational domain.

5. Conclusions

To promote the computational efficiency of the IBM-PISO algorithm in parallel computation, this study proposes an overlapping-based parallel optimization strategy. The distinctive feature of the proposed method, compared with traditional approaches, lies in its modified time-stepping scheme, which interleaves the solution of the FFT-based pressure Poisson equation and the IBM force calculation. By employing a communication–computation overlap technique, communication overhead is effectively hidden during transient simulations, thereby significantly accelerating the PISO algorithm in parallel computation. An in-house code was developed and employed to conduct numerical evaluations of the proposed method on parallel platforms, including assessments against a manufactured analytical solution and the benchmark problem of flow around a circular cylinder. The results demonstrate that the proposed method exhibits significantly superior computational efficiency compared with the classic sequentially executed IBM-PISO scheme, particularly when employing a large number of CPU cores and when dealing with complex immersed boundaries characterized by a high density of IBM force points. Consequently, this method is particularly well suited for fluid dynamics problems with complex immersed boundaries, especially fluid–structure interaction (FSI) problems. The developed algorithm offers a valuable and efficient alternative for addressing these challenging simulations.

To further validate the capabilities of our proposed overlapping algorithm, comprehensive numerical experiments need to be conducted to assess its performance in handling complex geometries, moving boundaries, and elastic boundary conditions. Additionally, the proposed algorithm requires systematic validation for various FSI problems.

While our current IBM implementation builds upon the basic direct forcing method, we recognize opportunities to incorporate more advanced IBM variants. These include the multiple direct forcing method [34], Poisson equation modification techniques [35] for enhanced divergence-free condition enforcement at boundaries, and moving least squares methods for greater interpolation stability [36]. Integrating these advanced methods with our FFT-based overlapping strategy represents a promising avenue for future research.

Due to the structural limitations of the Poisson equation matrix, FFT-based solvers are restricted to rectangular domains. As a result, our proposed overlapping algorithm, which builds upon FFT-based Poisson solvers, currently shares the same limitation. Recent studies have explored extensions to curvilinear orthogonal coordinates [37] and dimensionality reduction strategies, where FFT is applied along a selected direction and the reduced subproblems are solved using multigrid methods [38]. Nevertheless, these approaches remain fundamentally constrained by the underlying matrix structure. Addressing this limitation will be an important direction for future work.

Author Contributions

Methodology, J.L.; software, J.L. and Z.J.; validation, J.L.; formal analysis, Z.J.; investigation, J.L. and Z.J.; resources, Q.Y.; data curation, J.L.; writing—original draft preparation, Z.J.; writing—review and editing, J.L. and Q.Y.; visualization, J.L. and Z.J.; supervision, Z.J. and Q.Y.; project administration, Z.J.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Guangdong Basic and Applied Basic Research Foundation—Guangdong–Hong Kong–Macao Applied Mathematics Center Project under Grant No. 2021B1515310001—and the Guangdong Basic and Applied Basic Research Foundation—Regional Joint Fund Key Project under Grant No. 2022B1515120009.

Data Availability Statement

This study did not utilize any public or private datasets. All data used in this paper, including efficiency tests and computational results, can be independently obtained through the methods described within. For detailed data specifically used in the paper, please contact the corresponding author.

Acknowledgments

During the preparation of this manuscript/study, the authors used GPT-4o and Google Gemini 2.0 Flash for the purposes of text promotion. But no any AI tool is employed in the algorithm research, including the solver development. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IBM	Immersed boundary method
CFD	Computational fluid dynamics
PISO	Pressure implicit with split operator
DNS	Direct numerical simulation
FFT	Fast Fourier transform
DFT	Discrete Fourier transform
DEM	Discrete element method
FDM	Finite difference method
FSI	Fluid–structure interaction

References

Thompson, J.F.; Thames, F.C.; Mastin, C.W. Automatic numerical generation of body-fitted curvilinear coordinate system for field containing any number of arbitrary two-dimensional bodies. J. Comput. Phys. 1974, 15, 299–319. [Google Scholar] [CrossRef]
Yang, G.; Jing, L.; Kwok, F.; Sobral, Y. A comprehensive parametric study of LBM-DEM for immersed granular flows. Comput. Geotech. 2019, 114, 103100. [Google Scholar] [CrossRef]
Su, S.-W.; Lai, M.-C.; Lin, C.-A. An immersed boundary technique for simulating complex flows with rigid boundary. Comput. Fluids 2007, 36, 313–324. [Google Scholar] [CrossRef]
Li, W.; Lan, B.; Xu, J.; Zhao, B.; Zou, Z.; Zhou, Q.; Wang, J. CFD-DEM-IBM simulation of hematite reduction in a fluidized bed with hydrogen and carbon monoxide. Chem. Eng. J. 2025, 512, 162252. [Google Scholar] [CrossRef]
van der Sande, P.C.; de Munck, M.J.A.; Wu, K.; Rieder, D.R.; van den Eertwegh, D.E.A.; Wagner, E.C.; van Ommen, J.R. Fluidization behavior of stirred gas-solid fluidized beds: A combined X-ray and CFD-DEM-IBM study. Chem. Eng. J. 2024, 499, 155944. [Google Scholar] [CrossRef]
Liu, Y.; Zhao, B.; Xu, J.; Wang, J. A TFM-IBM method for Cartesian grid simulation of gas-solid flows in complex geometries. Chem. Eng. Sci. 2025, 312, 121637. [Google Scholar] [CrossRef]
Anupindi, K.; Delorme, Y.; Shetty, D.A.; Frankel, S.H. A novel multiblock immersed boundary method for large eddy simulation of complex arterial hemodynamics. J. Comput. Phys. 2013, 254, 200–218. [Google Scholar] [CrossRef]
Fang, D.; Huang, Z.; Zhang, J.; Hu, Z.; Tan, J. Flow pattern investigation of bionic fish by immersed boundary-lattice Boltzmann method and dynamic mode decomposition. Ocean Eng. 2022, 248, 110823. [Google Scholar] [CrossRef]
Issa, R.I. Solution of the implicitly discretised fluid flow equations by operator-splitting. J. Comput. Phys. 1986, 62, 40–65. [Google Scholar] [CrossRef]
Chorin, A. Numerical Solution of the Navier-Stokes Equations. Math. Comput. 1968, 22, 745–762. [Google Scholar] [CrossRef]
Peskin, C.S. Flow patterns around heart valves: A numerical method. J. Comput. Phys. 1972, 10, 252–271. [Google Scholar] [CrossRef]
Uhlmann, M. An immersed boundary method with direct forcing for the simulation of particulate flows. J. Comput. Phys. 2005, 209, 448–476. [Google Scholar] [CrossRef]
Verzicco, R. Immersed Boundary Methods: Historical Perspective and Future Outlook. Annu. Rev. Fluid Mech. 2023, 55, 129–155. [Google Scholar] [CrossRef]
Giannenas, A.E.; Laizet, S. A simple and scalable immersed boundary method for high-fidelity simulations of fixed and moving objects on a Cartesian mesh. Appl. Math. Model. 2021, 99, 606–627. [Google Scholar] [CrossRef]
Hockney, R.W. A fast direct solution of Poisson’s equation using Fourier analysis. J. ACM (JACM) 1965, 12, 95–113. [Google Scholar] [CrossRef]
Schumann, U.; Sweet, R.A. Fast Fourier transforms for direct solution of poisson’s equation with staggered boundary conditions. J. Comput. Phys. 1988, 75, 123–137. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Laizet, S.; Lamballais, E. High-order compact schemes for incompressible flows: A simple and efficient method with quasi-spectral accuracy. J. Comput. Phys. 2009, 228, 5989–6015. [Google Scholar] [CrossRef]
Mittal, R.; Iaccarino, G. Immersed boundary methods. Annu. Rev. Fluid Mech. 2005, 37, 239–261. [Google Scholar] [CrossRef]
Feng, H.; Long, G.; Zhao, S. FFT-based high order central difference schemes for Poisson’s equation with staggered boundaries. J. Sci. Comput. 2021, 86, 7. [Google Scholar] [CrossRef]
Gropp, W.; Lusk, E.; Skjellum, A. Using MPI: Portable Programming with the Message Passing Interface, 3rd ed.; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Álvarez-Farré, X.; Gorobets, A.; Trias, F.X. A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers. Comput. Fluids 2021, 214, 104768. [Google Scholar] [CrossRef]
Zhu, A.; Chang, Q.; Xu, J.; Ge, W. A dual-grid approach to speed up large-scale CFD-DEM simulations. Chem. Eng. J. 2024, 492, 152218. [Google Scholar] [CrossRef]
Pickering, B.P.; Jackson, C.W.; Scogland, T.R.W.; Feng, W.-C.; Roy, C.J. Directive-based GPU programming for computational fluid dynamics. Comput. Fluids 2015, 114, 242–253. [Google Scholar] [CrossRef]
Ge, W.; Sankaran, R.; Chen, J.H. Development of a CPU/GPU portable software library for Lagrangian-Eulerian simulations of liquid sprays. Int. J. Multiph. Flow 2020, 128, 103293. [Google Scholar] [CrossRef]
Gorobets, A.; Trias, F.X.; Borrell, R.; Lehmkuhl, O.; Oliva, A. Hybrid MPI+OpenMP parallelization of an FFT-based 3D Poisson solver with one periodic direction. Comput. Fluids 2011, 49, 101–109. [Google Scholar] [CrossRef]
Edwards, W.S.; Tuckerman, L.S.; Friesner, R.A.; Sorensen, D.C. Krylov Methods for the Incompressible Navier-Stokes Equations. J. Comput. Phys. 1994, 110, 82–102. [Google Scholar] [CrossRef]
Frigo, M.; Johnson, S.G. The Design and Implementation of FFTW3. Proc. IEEE 2005, 93, 216–231. [Google Scholar] [CrossRef]
Petö, M.; Gorji, M.; Duvigneau, F.; Düster, A.; Juhre, D.; Eisenträger, S. Code verification of immersed boundary techniques using the method of manufactured solutions. Comput. Mech. 2024, 73, 1283–1309. [Google Scholar] [CrossRef]
Babuška, I.; Oden, J.T. Verification and validation in computational engineering and science: Basic concepts. Comput. Methods Appl. Mech. Eng. 2004, 193, 4057–4066. [Google Scholar] [CrossRef]
Manikyala Rao, P.; Kuwahara, K. Numerical simulation of unsteady viscous flow around a circular cylinder. Sadhana 1991, 16, 47–58. [Google Scholar]
Feng, Z.-G.; Michaelides, E.E. The immersed boundary-lattice Boltzmann method for solving fluid-particles interaction problems. J. Comput. Phys. 2004, 195, 602–628. [Google Scholar] [CrossRef]
Balachandar, S.; Eaton, J.K. Turbulent Dispersed Multiphase Flow. Annu. Rev. Fluid Mech. 2010, 42, 111–133. [Google Scholar] [CrossRef]
Fadlun, E.A.; Verzicco, R.; Orlandi, P.; Mohd-Yusof, J. Combined immersed-boundary finite-difference methods for three-dimensional complex flow simulations. J. Comput. Phys. 2000, 161, 35–60. [Google Scholar] [CrossRef]
Kim, J.; Kim, D.; Choi, H. An immersed-boundary finite-volume method for simulations of flow in complex geometries. J. Comput. Phys. 2001, 171, 132–150. [Google Scholar] [CrossRef]
Wang, Z.; Jiang, Z.; Zhang, Y.; Yang, G.; Kwan, T.H.; Chen, Y.; Yao, Q. A moving least square immersed boundary method for SPH with thin-walled rigid structures. Comput. Part. Mech. 2024, 11, 1981–1995. [Google Scholar] [CrossRef]
Nikitin, N. Finite-difference method for incompressible Navier–Stokes equations in arbitrary orthogonal curvilinear coordinates. J. Comput. Phys. 2006, 217, 759–781. [Google Scholar] [CrossRef]
Costa, P. A FFT-accelerated multi-block finite-difference solver for massively parallel simulations of incompressible flows. Comput. Phys. Commun. 2022, 271, 108194. [Google Scholar] [CrossRef]

Figure 1. IBM support domain across two processes. Solid markers represent points within the stencil of the force point, while hollow markers indicate points unrelated to the stencil. The extent of this stencil is determined by the support region of the kernel function

δ_{h}

associated with the force point.

Figure 1. IBM support domain across two processes. Solid markers represent points within the stencil of the force point, while hollow markers indicate points unrelated to the stencil. The extent of this stencil is determined by the support region of the kernel function

δ_{h}

associated with the force point.

Figure 2. (a) The geometry and boundaries of the analytical solution. (b) The velocity vector distribution of the analytical solution at

t = 1

.

Figure 2. (a) The geometry and boundaries of the analytical solution. (b) The velocity vector distribution of the analytical solution at

t = 1

.

Figure 3. The error with different spatial step at

t = 1

: (a) The 2-norm of the error of the whole flow field. (b) The 2-norm of the error on the IBM force nodes.

Figure 3. The error with different spatial step at

t = 1

: (a) The 2-norm of the error of the whole flow field. (b) The 2-norm of the error on the IBM force nodes.

Figure 4. Geometry of the cylinder case.

Figure 5. Drag coefficient

C_{d}

and lift coefficient

C_{l}

in the case of the flow past stationary cylinder at

Re = 100

. The two algorithms’ results are compared.

Figure 5. Drag coefficient

C_{d}

and lift coefficient

C_{l}

in the case of the flow past stationary cylinder at

Re = 100

. The two algorithms’ results are compared.

Figure 6. Drag coefficient

C_{d}

and lift coefficient

C_{l}

in the case of flow past stationary cylinder. The simulation was conducted at

Re = 100

. Two algorithms’ results are compared.

Figure 6. Drag coefficient

C_{d}

and lift coefficient

C_{l}

in the case of flow past stationary cylinder. The simulation was conducted at

Re = 100

. Two algorithms’ results are compared.

Figure 7. Comparison of parallel efficiency between the proposed and classic algorithms: (a) Computational time for varying numbers of MPI processes. (b) Relative speed-up comparison of both algorithms. The blue dashed line represents the ratio of the time consumption between the original algorithm and the overlapping algorithm.

Figure 8. Vorticity field of multiple particles in cavity at

t = 1

. The units in the figure are dimensional. Few particles are uniformly placed primarily for demonstration. (a) Sparse-packed particles. (b) Dense-packed particles.

Figure 8. Vorticity field of multiple particles in cavity at

t = 1

. The units in the figure are dimensional. Few particles are uniformly placed primarily for demonstration. (a) Sparse-packed particles. (b) Dense-packed particles.

Figure 9. The computational efficiency as a function of particle count.

Table 1. Comparison of average drag coefficient

{\bar{C}}_{d}

, the amplitude of the drag and lift fluctuation

C_{d}^{'}, C_{l}^{'}

, and Strouhal number

St

in the case of a flow past stationary cylinder.

Table 1. Comparison of average drag coefficient

{\bar{C}}_{d}

, the amplitude of the drag and lift fluctuation

C_{d}^{'}, C_{l}^{'}

, and Strouhal number

St

in the case of a flow past stationary cylinder.

Re	Method	${\bar{C}}_{d}$	$C_{d}^{'}$	$C_{l}^{'}$	St	${\bar{C}}_{d}$ Error
100	Original	1.522	±0.007	±0.248	0.167	1.40%
	Overlapping	1.550	±0.007	±0.222	0.167	3.26%
	Reference [12]	1.501	±0.011	±0.349	0.172
1000	Original	1.753	±0.201	±1.129	0.260	8.28%
	Overlapping	1.735	±0.171	±0.971	0.260	7.16%
	Reference [31]	1.619	±0.156	±1.372	0.232

Table 2. Hardware and software specifications of the efficiency test platform.

Parameter	Value
CPU Model	Intel(R) Xeon(R) Gold 6230R @ 2.10 GHz
CPU Cores (per CPU)	26
Number of CPUs	2
RAM	256 GB
C++ Compiler	Intel OneAPI (gcc version 7.2.0)
FFTW Library	3.3.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lian, J.; Yao, Q.; Jiang, Z. An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations. Fluids 2025, 10, 176. https://doi.org/10.3390/fluids10070176

AMA Style

Lian J, Yao Q, Jiang Z. An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations. Fluids. 2025; 10(7):176. https://doi.org/10.3390/fluids10070176

Chicago/Turabian Style

Lian, Jiacheng, Qinghe Yao, and Zichao Jiang. 2025. "An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations" Fluids 10, no. 7: 176. https://doi.org/10.3390/fluids10070176

APA Style

Lian, J., Yao, Q., & Jiang, Z. (2025). An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations. Fluids, 10(7), 176. https://doi.org/10.3390/fluids10070176

Article Menu

An Overlapping IBM-PISO Algorithm with an FFT-Based Poisson Solver for Parallel Incompressible Flow Simulations

Abstract

1. Introduction

2. PISO-IBM Algorithm

2.1. PISO

2.2. FFT-Based Poisson Solving

2.3. IBM

3. Overlapping Strategy

4. Numerical Results

4.1. Precision Validation

4.2. Flow Around a Cylinder

4.3. Multiple Particles in Cavity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI