An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves

Wang, Yunhao; Wang, Qite; Wang, Yan

doi:10.3390/e27121193

Open AccessArticle

An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves

by

Yunhao Wang

^1,2,

Qite Wang

³ and

Yan Wang

^1,2,*

¹

College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, China

²

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, China

³

China Aerospace Times Feihong Technology Co., Ltd., China Academy of Aerospace Electronics Technology, Intelligent Unmanned System Overall Technology Research and Development Center, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(12), 1193; https://doi.org/10.3390/e27121193

Submission received: 16 October 2025 / Revised: 18 November 2025 / Accepted: 21 November 2025 / Published: 24 November 2025

(This article belongs to the Section Statistical Physics)

Download

Browse Figures

Versions Notes

Abstract

This paper presents an efficient and high-order WENO-based Upwind Rotated Lattice Boltzmann Flux Solver (WENO-URLBFS) on graphics processing units (GPUs) for simulating three-dimensional (3D) compressible flow problems. The proposed approach extends the baseline Rotated Lattice Boltzmann Flux Solver (RLBFS) by redefining the interface tangential velocity based on the theoretical solution of the Euler equations. This improvement, combined with a weighted decomposition of the numerical fluxes in two mutually perpendicular directions, effectively reduces numerical dissipation and enhances solution stability. To achieve high-order accuracy, the WENO interpolation is applied in the characteristic space to reconstruct physical quantities on both sides of the interface. The density perturbation test is employed to assess the accuracy of the scheme, which demonstrates 5th- and 7th-order convergence as expected. In addition, this test case is also employed to confirm the consistency between the CPU serial and GPU parallel implementations of the WENO-URLBFS scheme and to assess the acceleration performance across different grid resolutions, yielding a maximum speedup factor of 1208.27. The low-dissipation property of the scheme is further assessed through the inviscid Taylor–Green vortex problem. Finally, a series of challenging three-dimensional benchmark cases demonstrate that the present scheme achieves high accuracy, low dissipation, and excellent computational efficiency in simulating strongly compressible flows with complex features such as strong shock waves and discontinuities.

Keywords:

WENO-URLBFS; GPU parallel; 3D compressible flow; high order; low numerical dissipation

1. Introduction

Advances in numerical methods and computing resources have significantly propelled the advancement of computational fluid dynamics (CFD), making it widely applicable in various fields [1]. Compressible flow problems in aerospace often feature complex phenomena like strong shocks and contact discontinuities, requiring numerical methods with greater accuracy and efficiency. However, traditional low-order numerical methods usually fail to accurately capture these complex features, resulting in insufficient simulation results. In addition, with the increase in problem dimensions and computational scale, the consumption of computing resources also increases sharply, and a single serial program makes it difficult to meet practical needs. Therefore, developing a high-precision, low-dissipation numerical method with efficient computational capabilities is particularly important.

The key to constructing a high-precision, low-dissipation numerical scheme lies in the effective application of interpolation reconstruction techniques and the rational selection of flux evaluation methods. The traditional second-order interpolation method, such as the Monotonic Upstream-Centered Schemes for Conservation Laws (MUSCL) difference scheme [2] and the total variation reduction (TVD) scheme [3,4,5,6], can simulate contact discontinuity problems to a certain extent. However, these schemes often introduce significant numerical dissipation when dealing with strong discontinuities or complex flow field structures, resulting in excessive smoothing of detailed flow field structures. Although these numerical schemes effectively prevent non-physical oscillations near contact discontinuities, their ability to capture details in high-gradient regions is relatively limited, thus restricting the overall accuracy and resolution of the computation. To address these challenges, researchers have developed advanced numerical schemes to effectively capture complex flow characteristics. Harten [7] introduced the essentially non-oscillatory (ENO) scheme, which theoretically achieves arbitrary high-order accuracy. Shu and Osher [8,9] introduced a highly accurate ENO scheme within the FDM framework. The ENO scheme’s reliance on selecting multiple stencils for each calculation leads to a loss of physical information and reduced computational efficiency. Liu et al. [10] developed the weighted essentially non-oscillatory (WENO) scheme to solve this problem by adopting a nonlinear combination of reconstruction functions for all candidate templates, thereby constructing a finite-volume WENO scheme of 3rd-order accuracy. Jiang and Shu [11] refined the smoothness indicators and established the foundational framework for finite difference WENO (FD-WENO) schemes in multidimensional spaces, culminating in the classic WENO-JS scheme, which achieves fifth-order numerical accuracy. Building on this, many WENO schemes have been proposed [12,13]. The finite difference method has become preferred for constructing high-precision numerical schemes due to its simple and efficient parallel capabilities and the convenience of high-order accuracy expansion.

After adopting the appropriate interpolation reconstruction technique, the choice of flux evaluation method is also crucial. The evaluation of flux is crucial to guarantee the stability and accuracy of numerical calculations, particularly when handling shock waves, discontinuities, and intricate flow phenomena. Therefore, a reasonable flux evaluation method can effectively control numerical dissipation and enhance the precision of the solution. The following will introduce several commonly used flux evaluation methods and their performance in practical applications. Typical numerical schemes, such as the Lax-Friedrichs (LF) [14], ROE [15], and the HLL scheme [16], are primarily used for directly approximating the macroscopic flux. However, the LF scheme exhibits significant numerical dissipation, affecting the accuracy of the results. When simulating high-speed flow problems with strong shock waves, the ROE and HLL schemes can exhibit various forms of instability, particularly carbuncle phenomena [17,18,19,20]. A mesoscopic lattice Boltzmann method (LBM) [21,22,23] has gradually become prevalent in recent years, mainly for incompressible flows. The solver employs the local reconstruction of the LBM to evaluate numerical fluxes at the interface. It performs well under low-speed inflow conditions but has limitations when applied to high-speed flows. People have changed the equilibrium distribution function to apply to compressible flows [24,25,26,27,28], but the new equilibrium distribution function is relatively complex, which affects the computational efficiency. To this end, Shu et al. [29,30,31] developed the lattice Boltzmann flux solver (LBFS), and Yang et al. [32,33] advanced the method by proposing an optimized D1Q4 model to improve computational efficiency. Chen [34] proposed a RLBFS, which decomposes the interface normal vector into two orthogonal components, calculates the directional flux separately, and generates a new numerical flux by weighted combination. This method has good numerical stability, but the accuracy is only second order. It should be emphasized that in both LBFS and RLBFS, the tangential velocity is calculated approximately, which can lead to discrepancies with the theoretical solution of the Euler equations [35]. Such approximations often introduce additional numerical dissipation, resulting in a reduction in the resolution of the results. This is the primary motivation for the investigation of low-dissipation numerical schemes in this paper.

With the increase in problem dimensionality, the computation time increases exponentially. Traditional serial computation methods cannot meet the needs of real-time computation; so, parallel acceleration computation is needed. Initially, computational acceleration relied on multiple processor central processing units (CPU) to achieve parallel computing [36,37]. Although efficiency can be improved to a certain extent, as the scale of calculation increases, CPU performance is limited by the number of cores and processing power, and performance bottlenecks will gradually emerge, especially when processing large-scale data and floating-point operations. With the rapid development of GPU technology, GPU-based parallel computing methods have gradually become more efficient acceleration approaches. GPU has significant advantages in processing large-scale parallel computing tasks compared to the CPU. Its vast number of computing cores and powerful parallel processing capabilities enable the GPU to quickly handle many computing tasks simultaneously, significantly improving the computing speed. GPU acceleration has shown unparalleled advantages in CFD calculations with high resolution and complex physical phenomena [38,39,40,41,42,43]. In order to make full use of computing resources, a heterogeneous computing paradigm is adopted to divide the computing tasks into parts suitable for GPU and CPU [44,45,46]. Under this architecture, the GPU is responsible for computing-intensive tasks such as numerical flux calculation and time advancement, while the CPU is mainly responsible for data reading and relatively light tasks. Specifically, a single NVIDIA TITAN V executes the GPU procedure, and the CPU procedure is executed on the Hygon 7185.

In this paper, the URLBFS as a new numerical scheme, is introduced by improving the calculation method of the interface tangential velocity. Combine the FDM framework with the WENO higher-order method, the 5th order and 7th order FD-WENO-URLBFS schemes are constructed, which can accurately simulate the complex flow characteristics in compressible flows and effectively reduce numerical dissipation. The scheme is accelerated by GPU using the CUDA platform, which greatly improves the computational efficiency. Section 2 of this article is the methodology, Section 3 discusses GPU implementations, Section 4 assesses the precision and computational speed of the numerical method through challenging 3D test cases and Section 5 is the conclusion.

2. Methodology

2.1. Governing Equations

The conservation form of the 3D Navier–Stokes (NS) equation is shown below:

\frac{\partial W}{\partial t} + \frac{\partial F_{c}}{\partial x} + \frac{\partial G_{c}}{\partial y} + \frac{\partial H_{c}}{\partial z} = \frac{\partial F_{v}}{\partial x} + \frac{\partial G_{v}}{\partial y} + \frac{\partial H_{v}}{\partial z},

(1)

\begin{matrix} W = [\begin{matrix} ρ \\ ρ u \\ \begin{array}{l} ρ v \\ ρ e \end{array} \\ E \end{matrix}], F_{c} = [\begin{matrix} ρ u \\ ρ u^{2} + p \\ \begin{array}{l} ρ u v \\ ρ u w \end{array} \\ u (E + p) \end{matrix}], G_{c} = [\begin{matrix} ρ v \\ ρ v u \\ \begin{array}{l} ρ v^{2} + p \\ ρ v w \end{array} \\ v (E + p) \end{matrix}], H_{c} = [\begin{matrix} ρ w \\ ρ w u \\ \begin{array}{c} ρ w v \\ ρ w^{2} + p \end{array} \\ w (E + p) \end{matrix}], \\ F_{v} = [\begin{matrix} 0 \\ τ_{x x} \\ τ_{x y} \\ τ_{x z} \\ u τ_{x x} + v τ_{x y} + w τ_{x z} + κ \frac{\partial T}{\partial x} \end{matrix}], G_{v} = [\begin{matrix} 0 \\ τ_{y x} \\ τ_{y y} \\ τ_{y z} \\ u τ_{y x} + v τ_{y y} + w τ_{y z} + κ \frac{\partial T}{\partial y} \end{matrix}], H_{v} = [\begin{matrix} 0 \\ τ_{z x} \\ τ_{z y} \\ τ_{z z} \\ u τ_{z x} + v τ_{z y} + w τ_{z z} + κ \frac{\partial T}{\partial z} \end{matrix}] . \end{matrix}

(2)

where

ρ

represents the density,

p

and

T

correspond to pressure and temperature,

u

,

v

,

w

denote the velocity components along each of the three axes. The viscosity tensor

τ

is:

τ = μ [\begin{matrix} \frac{4}{3} \frac{\partial u}{\partial x} - \frac{2}{3} (\frac{\partial v}{\partial y} + \frac{\partial w}{\partial z}) & \frac{\partial u}{\partial y} + \frac{\partial v}{\partial x} & \frac{\partial u}{\partial z} + \frac{\partial w}{\partial x} \\ \frac{\partial v}{\partial x} + \frac{\partial u}{\partial y} & \frac{4}{3} \frac{\partial v}{\partial y} - \frac{2}{3} (\frac{\partial u}{\partial x} + \frac{\partial w}{\partial z}) & \frac{\partial v}{\partial z} + \frac{\partial w}{\partial y} \\ \frac{\partial w}{\partial x} + \frac{\partial u}{\partial z} & \frac{\partial w}{\partial y} + \frac{\partial v}{\partial z} & \frac{4}{3} \frac{\partial w}{\partial z} - \frac{2}{3} (\frac{\partial u}{\partial x} + \frac{\partial v}{\partial y}) \end{matrix}] .

(3)

μ

represents the molecular viscosity.

κ = μ C_{p} / \Pr

, which represents the thermal conductivity, where

C_{p}

means the specific heat at constant pressure, and

\Pr

is the Prandtl number.

E

represents the total energy, as shown below:

E = ρ (\frac{1}{2} V^{2} + e),

(4)

where

e

represents internal energy. The kinetic energy of the fluid is:

\frac{1}{2} V^{2} = \frac{1}{2} (u^{2} + v^{2} + w^{2}),

(5)

The introduction of the state equation completes the problem described by the Euler equation, as shown below:

p = (γ - 1) [E - \frac{1}{2} ρ (u^{2} + v^{2} + w^{2})] .

(6)

Here,

γ

is the specific heat ratio of air. Unless otherwise specified, all examples in this paper take the value of

γ = 1.4

.

All examples in this paper are spatially discretized in a uniform grid, where

Δ x = Δ y = Δ z

. The FDM method is used to discretize Equation (1), and the semi-discrete equation at the discrete point is:

\frac{d W_{i j k} (t)}{d t} = - \frac{1}{Δ x} (F_{i + 1 / 2, j, k} - F_{i - 1 / 2, j, k}) - \frac{1}{Δ y} (G_{i, j + 1 / 2, k} - G_{i, j - 1 / 2, k}) - \frac{1}{Δ z} (H_{i, j, k + 1 / 2} - H_{i, j, k - 1 / 2}) .

(7)

where

W_{i j k} (t)

is a numerical approximation of

W (i, j, k, t)

.

F = F_{c} - F_{v}

,

G = G_{c} - G_{v}

and

H = H_{c} - H_{v}

are numerical fluxes in three directions, respectively. The ordinary differential Equation about time is written as follows:

\frac{d W (t)}{d t} = L (h),

(8)

The third-order TVD Runge–Kutta method is commonly employed due to its strong stability properties. It is formulated as follows:

\begin{matrix} h^{(1)} = h^{n} + Δ t L (h^{n}), \\ h^{(2)} = \frac{3}{4} h^{n} + \frac{1}{4} h^{(1)} + \frac{1}{4} Δ t L (h^{(1)}), \\ h^{n + 1} = \frac{1}{3} h^{n} + \frac{2}{3} h^{(2)} + \frac{2}{3} Δ t L (h^{(2)}) . \end{matrix}

(9)

This scheme preserves TVD properties and can suppress numerical oscillations, making it suitable for solving complex flow phenomena in high mach flow problems. The velocity and temperature gradients in the governing equations via second-order central difference scheme to compute the stress tensor and heat flux, from which the viscous numerical flux is then determined. In the following section, the focus will be on the evaluation methods for the inviscid numerical fluxes.

2.2. Inviscid Numerical Flux Evaluation Method

This section begins by reviewing the traditional RLBFS and then introduces an enhanced version, the URLBFS. The primary improvement in the URLBFS is the redefinition of the tangential velocity at the interface. Instead of relying on approximate methods, this improvement uses the exact Euler-equation solution to calculate the tangential velocity, which helps to reduce numerical dissipation significantly.

2.2.1. The RLBFS Scheme

Chen [47] proposed the RLBFS method, which decomposes the interface normal vector into two perpendicular components. Subsequently, directional numerical fluxes are obtained using the LBFS scheme with a D1Q4 lattice Boltzmann (LB) model, and these fluxes are merged via a weighted procedure to form the final numerical flux. The D1Q4 model overcomes the problem of performance degradation of LBFS schemes in other one-dimensional models due to too many user-specified parameters [24,32]. The model is shown in Figure 1. The equilibrium distribution function

g_{m} (m = 1, 2, 3, 4)

and lattice velocity

d_{m} (m = 1, 2)

can be expressed as follows:

\begin{matrix} g_{1} = \frac{ρ (- d_{1} d_{2}^{2} - d_{2}^{2} u + d_{1} u^{2} + d_{1} ϑ^{2} + u^{3} + 3 u ϑ^{2})}{2 d_{1} (d_{1}^{2} - d_{2}^{2})}, \\ g_{2} = \frac{ρ (- d_{1} d_{2}^{2} + d_{2}^{2} u + d_{1} u^{2} + d_{1} ϑ^{2} - u^{3} - 3 u ϑ^{2})}{2 d_{1} (d_{1}^{2} - d_{2}^{2})}, \\ g_{3} = \frac{ρ (d_{1}^{2} d_{2} + d_{1}^{2} u - d_{2} u^{2} - d_{2} ϑ^{2} - u^{3} - 3 u ϑ^{2})}{2 d_{2} (d_{1}^{2} - d_{2}^{2})}, \\ \begin{array}{l} g_{4} = \frac{ρ (d_{1}^{2} d_{2} - d_{1}^{2} u - d_{2} u^{2} - d_{2} ϑ^{2} + u^{3} + 3 u ϑ^{2})}{2 d_{2} (d_{1}^{2} - d_{2}^{2})}, \\ d_{1} = \sqrt{u^{2} + 3 ϑ^{2} - \sqrt{4 u^{2} ϑ^{2} + 6 ϑ^{4}}}, \\ d_{2} = \sqrt{u^{2} + 3 ϑ^{2} + \sqrt{4 u^{2} ϑ^{2} + 6 ϑ^{4}}}, \end{array} \end{matrix}

(10)

where

ϑ = \sqrt{D p / ρ}

is the peculiar velocity, and

D

represents the spatial dimension, which is 1 in this paper. When

g_{m}

and

d_{m}

are determined, the variables in Equation (1) are given by:

\begin{matrix} ρ = \sum_{i} g_{i}, \\ ρ u = \sum_{i} g_{i} ξ_{i}, \\ ρ u u + p = \sum_{i} g_{i} ξ_{i} ξ_{i}, \\ ρ E = \sum_{i} g_{i} (\frac{1}{2} ξ_{i} ξ_{i} + e_{p}), \\ (ρ E + p) u = \sum_{i} g_{i} (\frac{1}{2} ξ_{i} ξ_{i} + e_{p}) ξ_{i}, \end{matrix}

(11)

where

ξ_{i}

represents the particle velocity,

ξ_{1} = d_{1},

ξ_{2} = - d_{1},

ξ_{3} = d_{2},

ξ_{4} = - d_{2}

,

e_{p} = [1 - \frac{D}{2} (γ - 1)] e

and

e = p / [(γ - 1) ρ]

. When applying the above D1Q4 model to solve problems involving multiple dimensions, it is usually only considered to apply this method along the interface normal. However, for a three-dimensional case, as depicted in Figure 2, this paper replaces the velocity

u

in Equation (10) with the normal velocity

U_{n} = U \cdot n

. The tangential velocity vector is

U_{τ} = (u_{τ x}, u_{τ y}, u_{τ z}) = u - U_{n} n

. Then, the variables

W

and convective flux

F_{c}

in Equation (2) are:

\begin{array}{l} W = [\begin{matrix} ρ \\ ρ (U_{n} n_{x} + u_{τ x}) \\ \begin{array}{l} ρ (U_{n} n_{y} + u_{τ y}) \\ ρ (U_{n} n_{z} + u_{τ z}) \end{array} \\ ρ (U_{n}^{2} / 2 + e) + ρ U_{τ}^{2} / 2 \end{matrix}], \\ F_{c} = [\begin{matrix} ρ U_{n} \\ (ρ U_{n}^{2} + p) n_{x} + ρ U_{n} u_{τ x} \\ \begin{array}{l} (ρ U_{n}^{2} + p) n_{y} + ρ U_{n} u_{τ y} \\ (ρ U_{n}^{2} + p) n_{z} + ρ U_{n} u_{τ z} \end{array} \\ (ρ (U_{n}^{2} / 2 + e) + p) U_{n} + ρ U_{n} U_{τ}^{2} / 2 \end{matrix}] . \end{array}

(12)

Considering only the normal velocity, the inviscid flux at the cell interface

x = 0

is:

\begin{matrix} {(F_{c})}_{i + 1 / 2}^{*} = {[\begin{matrix} ρ U_{n} & ρ U_{n} U_{n} + p & (ρ (\frac{1}{2} U_{n} U_{n} + e) + p) U_{n} \end{matrix}]}^{T} & = \sum_{i} ξ_{i} φ_{α} f_{i} (0, t), \end{matrix}

(13)

where

φ_{α} = {(1, ξ_{i}, \frac{1}{2} ξ_{i}^{2} + e_{p})}^{T}

and the superscript * indicates numerical value.

f_{i} (0, t)

is given by:

f_{i} (0, t) = g_{i} (0, t) - τ_{0} [g_{i} (0, t) - g_{i} (- ξ_{i} δ t, t - δ t)] + O (δ t),

(14)

g_{i} (0, t)

and

g_{i} (- ξ_{i} δ t, t - δ t)

represent the equilibrium distribution functions at the interface and at adjacent points. The numerical flux accounting only for the normal direction is:

\begin{array}{l} {(F_{c})}_{i + 1 / 2}^{*} & = (1 - τ_{0}) \sum_{i} ξ_{i} φ_{α} g_{i} (0, t) + τ_{0} \sum_{i} ξ_{i} φ_{α} g_{i} (- ξ_{i} δ t, t - δ t) \\ = (1 - τ_{0}) F_{c, i + 1 / 2}^{(I *)} + τ_{0} F_{c, i + 1 / 2}^{(I I *)} . \end{array}

(15)

where

τ_{0} = τ / δ t

is the dimensionless collision time. Inserting Equation (14) into Equation (13) and including the effect of tangential velocity, the complete numerical flux takes the form:

F_{c, i + 1 / 2}^{(I)} = (1 - τ_{0}) [\begin{matrix} F_{c, i + 1 / 2}^{(I *)} (1) \\ F_{c, i + 1 / 2}^{(I *)} (2) n_{x} + ρ U_{n} u_{τ x} \\ F_{c, i + 1 / 2}^{(I *)} (2) n_{y} + ρ U_{n} u_{τ y} \\ F_{c, i + 1 / 2}^{(I *)} (2) n_{z} + ρ U_{n} u_{τ z} \\ F_{c, i + 1 / 2}^{(I *)} (3) + ρ U_{n} U_{τ}^{2} / 2 \end{matrix}] + τ_{0} [\begin{matrix} F_{c, i + 1 / 2}^{(l l *)} (1) \\ F_{c, i + 1 / 2}^{(l l *)} (2) n_{x} + ρ U_{n} u_{τ x} \\ F_{c, i + 1 / 2}^{(l l *)} (2) n_{y} + ρ U_{n} u_{τ y} \\ F_{c, i + 1 / 2}^{(l l *)} (2) n_{z} + ρ U_{n} u_{τ z} \\ F_{c, i + 1 / 2}^{(l l *)} (3) + ρ U_{n} U_{τ}^{2} / 2 \end{matrix}] .

(16)

From the above equation, it is clear that the key step in obtaining the inviscid numerical flux is the

g_{i} (0, t)

and

g_{i} (- ξ_{i} δ t, t - δ t)

. Function

g_{i} (- ξ_{i} δ t, t - δ t)

can be solved by:

g_{i} (- ξ_{i} δ t, t - δ t) = \{\begin{matrix} g_{i}^{L}, if i = 1, 3, \\ g_{i}^{R}, if i = 2, 4 . \end{matrix}

(17)

where

g_{i}

can be calculated by Equation (10), and

g_{i}^{L}

and

g_{i}^{R}

are shown in Figure 3. We can get numerical flux

F_{c, i + 1 / 2}^{(I I *)}

contributed by the normal velocity:

F_{c, i + 1 / 2}^{(I I *)} = {[\begin{matrix} ρ U_{n} & ρ U_{n} U_{n} + p & (ρ (\frac{1}{2} U_{n} U_{n} + e) + p) U_{n} \end{matrix}]}^{T} = \sum_{i} ξ_{i} φ_{a} g_{i} (- ξ_{i} δ t, t - δ t),

(18)

F_{c, i + 1 / 2}^{(I I *)} = \sum_{i} ξ_{i} φ_{α} g_{i} (- ξ_{i} δ t, t - δ t) = \sum_{i = 1, 3} ξ_{i} φ_{α} g_{i}^{L} + \sum_{i = 2, 4} ξ_{i} φ_{α} g_{i}^{R} .)

(19)

From the above equation, the interface values of conservative variables can be calculated by:

W {(0, t)}_{i + 1 / 2}^{*} = {[\begin{matrix} ρ & ρ U_{n} & \frac{1}{2} ρ U_{n} U_{n} + ρ e \end{matrix}]}^{T} = \sum_{i} φ_{α} g_{i} (0, t) = \sum_{i} φ_{α} g_{i} (- ξ_{i} δ t, t - δ t),

(20)

W {(0, t)}_{i + 1 / 2}^{*} = \sum_{i = 1, 3} φ_{α} g_{i}^{L} + \sum_{i = 2, 4} φ_{α} g_{i}^{R} .)

(21)

Through the above formula, we can get the flow variable, combined with Equation (10),

g_{i} (0, t)

can be easily calculated. Therefore, the two components of the flux

F_{c, i + 1 / 2}^{(I)}

and

F_{c, i + 1 / 2}^{(I I)}

can be solved. The

τ_{0}

in Equation (16) is a coefficient used to control numerical dissipation. Yang [27] proposed to use a switching function to control A, as shown below:

τ_{0} = \tanh (m \frac{|p_{L} - p_{R}|}{p_{L} + p_{R}}) .

(22)

Here,

p_{L}

and

p_{R}

denote the interface pressures. The parameter

m

is an empirical constant, and its value is set to 100 in this study.

The interface normal vector in the RLBFS scheme is decomposed into two vectors

n_{1}

and

n_{2}

, as shown in Figure 4. The final numerical flux is obtained by combining the fluxes in the two directions with appropriate weights [47]. The relationship between vectors

n_{1}

and

n_{2}

is as follows:

\begin{matrix} n_{1} \cdot n_{2} = 0, \\ n = α_{1} n_{1} + α_{2} n_{2}, \end{matrix}

(23)

where

| n_{1} | = | n_{2} | = 1

,

α_{1} = n \cdot n_{1}

and

α_{2} = n \cdot n_{2} .

The

α_{1}, α_{2} \geq 0

to ensure that

n_{1}

and

n_{2}

are always on one side of the interface. The numerical flux is calculated using the following weighted combination:

F_{R L B F S} = F_{R L B F S} (n) = α_{1} F_{L B F S} (n_{1}) + α_{2} F_{L B F S} (n_{2}) .

(24)

where

F_{L B F S} (n_{1})

and

F_{L B F S} (n_{2})

are determined using Equation (16). From the above formula, it is evident that

α_{1}

plays a crucial role, as its value directly influences the outcome of

F_{R L B F S}

. From

α_{1} = n \cdot n_{1}

, we can know that

α_{1}

is directly determined by

n_{1}

, so, the value of

n_{1}

is decisive for the result [48]. This paper adopts the results determined by Nishikawa et al. [49], as shown in the following formula:

n_{1} = \{\begin{cases} n, & if \sqrt{{(Δ u)}^{2} + {(Δ v)}^{2} + {(Δ w)}^{2}} \leq ε, \\ \frac{Δ u i + Δ v j + Δ w k}{\sqrt{{(Δ u)}^{2} + {(Δ v)}^{2} + {(Δ w)}^{2}}}, & otherwise . \end{cases}

(25)

where

Δ () = {()}_{R} - {()}_{L}

, and when

ε

is close to 0, it will not affect the numerical result. On the contrary, it will have a more significant impact [48]. In this article,

ε = 10^{- 3} U^{*}

,

U^{*}

is the free stream velocity. Equation (25) can determine the value of

n_{1}

, and the

n_{2}

can be obtained by the following Equation:

n_{2} = (n_{1} \times n) \times n_{1} .

(26)

2.2.2. The URLBFS Developed Based on RLBFS

Following the above explanation of the RLBFS approach, it is evident that this method relies solely on the D1Q4 model for computing the normal numerical flux. RLBFS uses an approximate method to calculate the contribution of tangential velocity to momentum and energy fluxes, as shown in the following formula:

\begin{array}{l} {(ρ U_{τ})}^{*} = \sum_{i} g_{i} \cdot U_{τ}^{*} = \sum_{i = 1, 3} g_{i}^{L} \cdot U_{τ}^{L} + \sum_{i = 2, 4} g_{i}^{R} \cdot U_{τ}^{R}, \\ {(ρ U_{n} U_{τ})}^{*} = \sum_{i} ξ_{i} g_{i} \cdot U_{τ}^{*} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot U_{τ}^{L} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot U_{τ}^{R}, \\ {(ρ U_{n} U_{τ}^{2})}^{*} = \sum_{i} ξ_{i} g_{i} \cdot {(U_{τ}^{*})}^{2} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot {(U_{τ}^{L})}^{2} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot {(U_{τ}^{R})}^{2} . \end{array}

(27)

U_{τ}^{*}

is on the interface, and

U_{τ}^{L}

and

U_{τ}^{R}

are the left and right sides of the interface. Since this method relies on approximate calculations, it inevitably introduces some numerical dissipation, which hampers its ability to capture fine vortex structures and reduces the resolution of the flow field results. Especially when dealing with complex flow problems, numerical dissipation may lead to losing or distorting essential flow features. To achieve a numerical method with low dissipation, this study presents an enhanced scheme derived from the analytical solution of the Euler equations [35], which allows for the recalculation of a more accurate tangential velocity

U_{τ}^{*}

. Specifically, this method first solves the interface normal velocity

U_{n}^{*}

through Equation (28) and then uses the upwind direction of the

U_{n}^{*}

to determine the value of the

U_{Γ}^{*}

.

\begin{array}{l} ρ^{*} = \sum_{i = 1, 3} g_{i}^{L} + \sum_{i = 2, 4} g_{i}^{R}, \\ ρ U_{n}^{*} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} . \end{array}

(28)

Let

U_{n}^{u p w i n d} = U_{n}^{*}

, and get the new tangential velocity

U_{τ}^{*}

through the value of

U_{n}^{u p w i n d}

, as shown below:

U_{τ}^{*} = \{\begin{cases} U_{τ}^{L}, if U_{n}^{u p w i n d} \geq 0, \\ U_{τ}^{R}, if U_{n}^{u p w i n d} < 0 . \end{cases}

(29)

So, Equation (27) can be reformulated as follows:

\begin{array}{l} \{\begin{cases} {(ρ U_{τ})}^{*} = \sum_{i} g_{i} U_{τ}^{*} = \sum_{i = 1, 3} g_{i}^{L} \cdot U_{τ}^{L} + \sum_{i = 2, 4} g_{i}^{R} \cdot U_{τ}^{L}, \\ {(ρ U_{n} U_{τ})}^{*} = \sum_{i} ξ_{i} g_{i} U_{τ}^{*} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot U_{τ}^{L} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot U_{τ}^{L}, \\ {(ρ U_{n} U_{τ}^{2})}^{*} = \sum_{i} ξ_{i} g_{i} {(U_{τ}^{*})}^{2} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot {(U_{τ}^{L})}^{2} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot {(U_{τ}^{L})}^{2}, \end{cases} U_{n}^{u p w i n d} \geq 0, \\ \{\begin{cases} {(ρ U_{τ})}^{*} = \sum_{i} g_{i} U_{τ}^{*} = \sum_{i = 1, 3} g_{i}^{L} \cdot U_{τ}^{R} + \sum_{i = 2, 4} g_{i}^{R} \cdot U_{τ}^{R}, \\ {(ρ U_{n} U_{τ})}^{*} = \sum_{i} ξ_{i} g_{i} U_{τ}^{*} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot U_{τ}^{R} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot U_{τ}^{R}, \\ {(ρ U_{n} U_{τ}^{2})}^{*} = \sum_{i} ξ_{i} g_{i} {(U_{τ}^{*})}^{2} = \sum_{i = 1, 3} ξ_{i} g_{i}^{L} \cdot {(U_{τ}^{R})}^{2} + \sum_{i = 2, 4} ξ_{i} g_{i}^{R} \cdot {(U_{τ}^{R})}^{2}, \end{cases} U_{n}^{u p w i n d} < 0 . \end{array}

(30)

In this way, the URLBFS scheme provides more accurate tangential velocity, eliminates numerical errors introduced by approximate methods, reduce numerical dissipation, and enables the precise capture of both strong shock waves and small-scale vortices.

2.3. WENO Scheme Reconstruction Interface Flow Variables

Rather than applying WENO method directly to reconstruct interface fluxes as in standard FDM approaches, this work first uses WENO method to interpolate flow variables on each side of the interface and then feeds those values into the previously described URLBFS. As a result, the scheme achieves higher accuracy and lower numerical dissipation compared to conventional FDM.

2.3.1. High-Order WENO Reconstruction Method

This section focuses on WENO5 and WENO7 interpolation methods. In order to get the physical quantity on the interface’s left side,

ρ^{-}, u^{-}, v^{-}, w^{-}, p^{-}

, the WENO5 scheme needs to use the physical information on the five nodes around the interface,

S = \{U_{i - 2}, U_{i - 1}, U_{i}, U_{i + 1}, U_{i + 2}\}

, as shown in Figure 5.

It is divided into three sub-templates to construct low-order reconstruction polynomials:

S_{0} = \{U_{i - 2}, U_{i - 1}, U_{i}\}, S_{1} = \{U_{i - 1}, U_{i}, U_{i + 1}\}, S_{2} = \{U_{i}, U_{i + 1}, U_{i + 2}\} .

(31)

After weighted combination of these three formulas, the final WENO5 reconstruction solution is obtained:

\begin{array}{l} U_{i + 1 / 2}^{L} & = w_{0} (\frac{1}{3} U_{i - 2} - \frac{7}{6} U_{i - 1} + \frac{11}{6} U_{i}) \\ + w_{1} (- \frac{1}{6} U_{i - 1} + \frac{5}{6} U_{i} + \frac{1}{3} U_{i + 1}) \\ + w_{2} (\frac{1}{3} U_{i} + \frac{5}{6} U_{i + 1} - \frac{1}{6} U_{i + 2}), \end{array}

(32)

The weight coefficient and smoothness index proposed by Shu Jianjun [8] are used as follows:

w_{k} = \frac{α_{k}}{α_{0} + α_{1} + α_{2}}, α_{k} = \frac{d_{k}}{{(β_{k} + ϵ)}^{2}}, k = 0, 1, 2 .

(33)

\begin{matrix} β_{0} = \frac{13}{12} {(U_{i - 2} - 2 U_{i - 1} + U_{i})}^{2} + \frac{1}{4} {(U_{i - 2} - 4 U_{i - 1} + 3 U_{i})}^{2}, \\ β_{1} = \frac{13}{12} {(U_{i - 1} - 2 U_{i} + U_{i + 1})}^{2} + \frac{1}{4} {(U_{i - 1} - U_{i + 1})}^{2}, \\ β_{2} = \frac{13}{12} {(U_{i} - 2 U_{i + 1} + U_{i + 2})}^{2} + \frac{1}{4} {(3 U_{i} - 4 U_{i + 1} + U_{i + 2})}^{2} . \end{matrix}

(34)

The optimal linear weighting coefficient is given by [50]:

d_{0} = \frac{1}{10}, d_{1} = \frac{3}{5}, d_{2} = \frac{3}{10},

(35)

Set

ε = 10^{- 6}

to ensure that the denominator is non-zero. Only the process of calculating

U_{i + 1 / 2}^{L}

is given above. The value

U_{i + 1 / 2}^{R}

on the right side can be obtained symmetrically and thus will not be discussed in detail here.

The global template in the WENO7 scheme contains seven surrounding points,

S = \{U_{i - 3}, \dots, U_{i + 3}\}

, and is divided into four sub-stencils in total, as shown in the Figure 6 below.

The left side flow variables of the interface are reconstructed using WENO7 as follows:

\begin{matrix} U_{i + 1 / 2}^{L} & = w_{0} (- \frac{1}{4} U_{i - 3} + \frac{13}{12} U_{i - 2} - \frac{23}{12} U_{i - 1} + \frac{25}{12} U_{i}) \\ + w_{1} (\frac{1}{12} U_{i - 2} - \frac{5}{12} U_{i - 1} + \frac{13}{12} U_{i} + \frac{1}{4} U_{i + 1}) \\ + w_{2} (- \frac{1}{12} U_{i - 1} + \frac{7}{12} U_{i} + \frac{7}{12} U_{i + 1} - \frac{1}{12} U_{i + 2}) \\ + w_{3} (\frac{1}{4} U_{i} + \frac{13}{12} U_{i + 1} - \frac{5}{12} U_{i + 2} + \frac{1}{12} U_{i + 3}), \end{matrix}

(36)

The Detailed parameters are shown below:

w_{k} = \frac{α_{k}}{α_{0} + α_{1} + α_{2} + α_{3}}, α_{k} = \frac{d_{k}}{{(β_{k} + ϵ)}^{2}}, k = 0, 1, 2, 3 .

(37)

\begin{array}{l} β_{0} & = U_{i - 3} (547 U_{i - 3} - 3882 U_{i - 2} + 4642 U_{i - 1} - 1854 U_{i}) \\ + U_{i - 2} (7043 U_{i - 2} - 17,246 U_{i - 1} + 7042 U_{i}) \\ + U_{i - 1} (11,003 U_{i - 1} - 9402 U_{i}) + U_{i} (2107 U_{i}), \\ β_{1} & = U_{i - 2} (267 U_{i - 2} - 1642 U_{i - 1} + 1602 U_{i} - 494 U_{i + 1}) \\ + U_{i - 1} (2843 U_{i - 1} - 5966 U_{i} + 1922 U_{i + 1}) \\ + U_{i} (3443 U_{i} - 2522 U_{i + 1}) + U_{i + 1} (547 U_{i + 1}), \\ β_{2} & = U_{i - 1} (547 U_{i - 1} - 2522 U_{i} + 1922 U_{i + 1} - 494 U_{i + 2}) \\ + U_{i} (3443 U_{i} - 5966 U_{i + 1} + 1602 U_{i + 2}) \\ + U_{i + 1} (2843 U_{i + 1} - 1642 U_{i + 2}) + U_{i + 2} (267 U_{i + 2}), \\ β_{3} & = U_{i} (2107 U_{i} - 9402 U_{i + 1} + 7042 U_{i + 2} - 1854 U_{i + 3}) \\ + U_{i + 1} (11,003 U_{i + 1} - 17,246 U_{i + 2} + 4642 U_{i + 3}) \\ + U_{i + 2} (7043 U_{i + 2} - 3882 U_{i + 3}) + U_{i + 3} (547 U_{i + 3}) . \end{array}

(38)

d_{0} = \frac{1}{35}, d_{1} = \frac{12}{35}, d_{2} = \frac{18}{35}, d_{3} = \frac{4}{35} .

(39)

The

ε

has the same values as WENO5, and only the interpolation results of the left variables are shown. The variables on the right only need symmetric operations.

2.3.2. Characteristic Space

In compressible flow problems, oscillations often occur at shocks and strong discontinuities, which may cause divergent results [51]. To better solve these complex flow problems, we put the reconstruction process in the feature space. For convenience of description, we only show the Jacobian matrix in the x direction:

A (U) = \frac{\partial F}{\partial U} |_{i + 1 / 2} = [\begin{matrix} 0 & 1 & 0 & 0 & 0 \\ \hat{γ} H - u^{2} - c^{2} & (3 - γ) u & - \hat{γ} v & - \hat{γ} w & \hat{γ} \\ - u v & v & u & 0 & 0 \\ - u w & w & 0 & u & 0 \\ \frac{1}{2} u [(γ - 3) H - c^{2}] & H - \hat{γ} u^{2} & - \hat{γ} u v & - \hat{γ} u w & γ u \end{matrix}] .

(40)

where

\hat{γ} = γ - 1

, the speed of sound

c = \sqrt{γ p / ρ} ε

. The total enthalpy

H ε

as shown below:

H = \frac{1}{2} V^{2} + \frac{c^{2}}{γ - 1},

(41)

where

V = (u^{2} + v^{2} + w^{2})

. The right eigenvectors of the

A (U)

is:

\begin{array}{l} R^{F} = (\begin{matrix} 1 & 1 & 0 & 0 & 1 \\ u - c & u & 0 & 0 & u + c \\ v & v & 0 & - c & v \\ w & w & c & 0 & w \\ H - u c & 1 / 2 V^{2} & c w & - c v & H + u c \end{matrix}), \end{array}

(42)

The matrix of left eigenvectors,

L^{F} A (U)

, is simply

{(R^{F})}^{- 1}

. The conservative variables

U_{i + 1 / 2, j, k}

at

x_{i + 1 / 2, j, k}

is:

U_{i + 1 / 2, j, k} = \frac{1}{2} (U_{i, j, k} + U_{i + 1, j, k}),

(43)

The conservation variable in physical space can be transformed into characteristic space by multiplying the eigenvector

L^{F}

on the left, as shown below:

V_{m, j, k} = L^{F} U_{m, j, k},

(44)

where

k

represents number of nodes required for interpolation reconstruction. After the eigenspace is reconstructed by the WENO scheme to obtain the value

V_{i + 1 / 2, j, k}^{\pm}

at the interface, then go to the physical space through the right eigenvector

R^{F}

as follows:

U_{i + 1 / 2, j, k}^{\pm} = R^{F} V_{i + 1 / 2, j, k}^{\pm} .

(45)

The conservation variables are obtained through the above process, and flow variables can be easily derived. The methods for the other two directions are the same and will not be repeated here.

3. GPU Implementations

In previous research on LBFS and RLBFS, the calculations were mainly based on CPU implementation, and some of the work accelerated the calculations by introducing OpenMP instructions and a message passing interface (MPI). However, the acceleration effect is insignificant due to the limitation of the number of CPU threads. Unlike traditional CPUs, the GPU contains many arithmetic logic units (ALUs) or threads. The number of threads is much higher than that of the CPU; so, they are suitable for parallel computing large-scale simple programs. OpenCL and CUDA are the two main programming models for processing GPU parallel tasks. This study utilizes the CUDA platform for GPU programming to fully leverage the highly parallel computing capabilities of GPU, thereby significantly improving computational efficiency.

3.1. CUDA-GPU Introduction

The original program was written in Fortran 90 to solve the compressible flow. In order to use GPU to accelerate the calculation, the Fortran program needs to be rewritten as a version that supports CUDA. There are usually two options: CUDA Fortran (PGI) and CUDA C (NVIDIA). In this study, to fully utilize the functions of CUDA to achieve efficient GPU computing, we chose to rewrite the Fortran program using CUDA C. In a heterogeneous computing environment, the CPU and GPU each undertake different tasks to achieve parallel computing better. Generally, the CPU is suitable for processing complex logical calculations and program control, while the GPU is good at large-scale simple computing parallel tasks. Therefore, it is necessary to adopt a heterogeneous programming model. In CUDA programming, the code is divided into serial and parallel parts. The code running on the host (CPU) prepares the necessary data, such as variable declaration, initialization, and data output operations. The host side then transfers the data to the device (GPU), responsible for performing computationally intensive tasks. Once the GPU completes the task, the result is returned to the host side through data transfer for subsequent processing. The overall process is shown in Figure 7. The code that runs on the GPU is commonly referred to as a “kernel.” These parallel codes are executed simultaneously by hundreds or even thousands of threads on the device. These threads are divided and grouped into multiple thread blocks, which can be further organized into different thread block grids based on the characteristics of the hardware, as shown in Figure 8. To simplify programming, each thread is generally assigned a unique thread index, which can be organized into a 1D, 2D, or 3D thread block structure. The threads and thread block number should be reasonably selected based on hardware performance and computing requirements to maximize the use of GPU resources. Too few threads or thread blocks will lead to underutilization of hardware resources, while too many threads or thread blocks may cause resource competition and performance degradation. Therefore, choosing the appropriate thread and thread block configuration is the key to optimizing GPU computing performance.

3.2. Parallel Implementation of the WENO-URLBFS Scheme

This paper uses NVIDIA TITAN V GPU and CUDA C programming model to develop WENO-URLBFS GPU parallel code. The parameters of the TITAN V graphics card are shown in Table 1. Solving the compressible flow problem using the FD-WENO-URLBFS scheme consists of three key parts: (1) flow field initialization; (2) boundary conditions, solution flux, and time advancement; (3) checking the simulation time and outputting the results, as shown in Figure 9a. The first part assigns values to the flow field through initial conditions and defines some global parameters. This part is only executed once at the beginning of the program and has little impact on the calculation time; so, it does not need to be parallelized. The second part focuses on solving the entire flow field, evaluating the numerical fluxes using the URLBFS scheme, thus ensuring that the flow characteristics are accurately captured. After obtaining the spatial discrete operator, the TVD Runge-Kutta method performs time advancement and obtains the flow variables of the next time step. Since this part requires millions of calculations on each grid interface, the amount of calculation is vast and occupies a large part of the whole calculation time; so, this part needs to be parallelized. The last part is determining whether the calculation has reached the termination time. If it has, the data of the entire flow field is output. Since this part is only executed once at the end of the calculation, it does not need to be parallelized.

The second part basically occupies the entire computing time; so, it is feasible to parallelize it. The above description shows that the GPU is just a computing device, and its work requires commands from the host. The host implements the device’s operation by calling the kernel function. The kernel function of the second part is implemented in parallel on the GPU as a kernel function, and its specific process is shown in Figure 9b. An important feature of a kernel function is that it allows for the allocation of multiple threads. Therefore, the number of threads must be reasonably specified before calling the kernel function. For three-dimensional flow problems, the thread configuration can be defined by

g r i d_s i z e (G x, G y, G z)

and

b l o c k_s i z e (B x, B y, B z)

. Among them,

g r i d_s i z e

represents grid size, and

b l o c k_s i z e

represents thread block size, and the product of these two variables is total number of threads. Higher computational efficiency in GPUs is achieved only when the number of threads exceeds the available computing cores, allowing for more effective utilization of the available computational resources. Each thread has a unique identity in the kernel function that determines the computing cell for which the thread is responsible. Taking the three-dimensional problem as an example, the index corresponding to the computational cell in the x-direction can be calculated as follows:

n x = b l o c k D i m . x \times b l o c k I d x . x + t h r e a d I d x . x

(46)

where

b l o c k D i m . x

represents the threads number in the thread block, corresponding to the value of

b l o c k_s i z e . B x

;

b l o c k I d x . x

is the index of the thread block within the grid; and

b l o c k I d x . x

denotes the thread’s position inside its block. Each thread can uniquely locate a specific unit in the calculation grid through the above indexing mechanism. After implementing the above content, parallel computing can be realized, and both CPU and GPU codes use double precision; so, there is no precision mismatch problem.

4. Numerical Tests and Validation

This section verifies the numerical accuracy of the WENO-URLBFS scheme and evaluates the performance of its GPU implementation through the density perturbation advection problem. Additionally, test cases such as Inviscid Taylor–Green Vortex, Explosion in a Box, Explosion in an Enclosed Cabin and Oblique shock–mixing layer interaction are used to further illustrate the URLBFS scheme’s ability to capture detailed flow structures and its low-dissipation properties. Unless otherwise specified, all the following test cases use the 3rd TVD Runge-Kutta for time advancement with a constant time step of 1 × 10⁻⁴. All the test cases in this study were conducted on a Windows desktop equipped with a Hygon 7185 CPU (2.0 GHz) and an NVIDIA TITAN V GPU. The specifications of the CPU and GPU are listed in Table 1. The GPU-accelerated code was developed using CUDA C++, while the CPU-based version was implemented in standard C++. Both versions were compiled in double-precision mode. The CPU-based flow simulations presented in this paper were executed in single-thread mode, which is a common performance analysis method reported in literature [52,53,54].

4.1. Advection of Density Perturbation Problem

This example tests the WENO-URLBFS numerical scheme precision and compares the efficiency of CPU serial and GPU parallel approaches. The initial conditions are that all parameters except density are set to 1, where density is [55]:

ρ (x, y, z) = 1 + 0.2 \sin (π (x + y + z)),

(47)

The exact solution is:

ρ (x, y, z, t) = 1 + 0.2 \sin (π (x + y + z - t)),

(48)

This example employs a [0, 2]³ computational domain with periodic boundary conditions applied to all boundaries. The simulation runs for a total time of t = 2. The table below presents the density error and the order of accuracy for different flux solvers, utilizing different WENO reconstruction methods at grid sizes of 1/5, 1/10, 1/20, and 1/40. Figure 10 illustrates the relationship between the L₂ error and grid resolution for various numerical schemes. The accuracy of these schemes can be fitted using the least squares method. It is evident that the different flux calculation methods, founded on the finite difference framework proposed in this paper, generally achieve the expected fifth and seventh-order accuracies, with the URLBFS scheme exhibiting slightly higher accuracy than the other three schemes. Table 2 and Table 3 presents the density errors and the corresponding segmented orders of accuracy for various numerical schemes across different grid resolutions. It can be seen that URLBFS has a smaller error, indicating that this scheme has lower numerical dissipation.

To quantitatively evaluate the computational performance of the FD-WENO-URLBFS scheme on both CPU and GPU platforms, we use a fixed time step for time advancement, and the

d t = 1 \times 10^{- 4}

is used. The CPU code is calculated using the Hygon x86 7185-32c CPU and C++ compiler, while the GPU calculation is completed using the Nvidia TITAN V graphics card, combined with Nvidia CUDA technology and NVCC compiler, both using double precision calculations. Figure 11 presents the density contour distributions obtained from the serial and parallel implementations at a grid resolution of 40³. Table 4 further verifies the consistency between the CPU and GPU results by comparing the L₁ and L₂ errors. The results demonstrate that the parallel implementation produces highly consistent and reliable solutions comparable to those of the serial version. The next will focus on analyzing the acceleration efficiency of the parallel scheme for various grid sizes. The speedup is introduced to represent the GPU acceleration efficiency, as shown below

S p e e d u p = \frac{T_{C P U}}{T_{G P U}}

(49)

where

T_{C P U}

and

T_{G P U}

denote the computation time required by the CPU and GPU code, respectively. Table 5 lists the computation times for CPU and GPU implementations, as well as the corresponding speedup ratios, for the WENO-URLBFS scheme across different grid sizes. Figure 12 provides a visual comparison of computation time across different grid sizes, along with the GPU’s speedup relative to the CPU. From the table and figures, it is evident that as the grid number increases exponentially, the computation time for CPU serial solving grows significantly, posing a substantial challenge for research work that requires obtaining results within a limited time. In contrast, the GPU parallel solution shows excellent acceleration performance. Under the same WENO reconstruction method, as the grid amount increases, the acceleration effect of the GPU becomes more significant, especially in the WENO5 method, where the GPU achieves a speedup ratio of more than 1208.27 times. It is essential to highlight that, for the same grid size, the increase in numerical accuracy has a more noticeable impact on CPU computation time, while the GPU can still complete the calculation efficiently, reflecting its advantages in processing complex and high-precision calculations. Figure 13 presents a comparison of computational time and numerical accuracy between the WENO5 and WENO7 reconstruction schemes under the GPU implementation at various grid resolutions, aiming to evaluate which scheme achieves higher computational efficiency on the GPU. The circular markers in the figure represent different grid sizes, namely 103, 203, 403, and 803. As shown in the figure, when the grid resolution is relatively low, the WENO7 reconstruction scheme exhibits higher overall efficiency, considering both computational time and numerical accuracy. As the grid resolution increases, the WENO7 scheme continues to achieve smaller numerical errors; however, its advantage in computational time over WENO5 becomes less pronounced compared with that observed at lower resolutions.

4.2. Inviscid Taylor–Green Vortex

The velocity and pressure distribution at the initial moment of this example presents a symmetrical vortex structure. As time goes by, the inviscid vortex begins to stretch and gradually produces smaller and smaller scales. This problem is commonly utilized to prove the scheme’s potential to catch high-frequency vortices while examining properties like the conservation of total kinetic energy and numerical dissipation. The simulations are performed in the region [0, 2π]³ with a grid resolution of 128³. The initial conditions are [56]:

\{\begin{array}{l} ρ = 1, \\ u (x, y, z) = V_{0} \sin (x / L) \cos (y / L) \cos (z / L), \\ v (x, y, z) = - V_{0} \cos (x / L) \sin (y / L) \cos (z / L), \\ w (x, y, z) = 0, \\ p (x, y, z) = p_{0} + \frac{ρ_{0} V_{0}^{2}}{16} ([\cos (2 x / L) + \cos (2 y / L)] [\cos (2 z / L) + 2] - 2) . \end{array}

(50)

All boundaries are subjected to periodic boundary conditions. Since the total time is T = 10, to save time, the actual CPU calculation time to t = 0.1 is counted and proportionally converted to the total calculation time of t = 10 to estimate the total CPU calculation time. Table 6 shows the computational time of the WENO-URLBFS scheme for simulating the Inviscid Taylor–Green Vortex problem under CPU serial computing and GPU parallel computing conditions, where the GPU achieves a speedup ratio of up to 1063.49. Figure 14 presents the time-dependent evolution of the vortex. When t = 0, the vorticity diagram shows a set of symmetrical vortex structures forming a regular grid arrangement. As time goes by, the vortex structure becomes more complex, and the vortex diagram shows that the vortex gradually strengthens or splits. New vortex pairs or the interaction between the vortex and other flow field features appears. When t = 10 is reached, the vorticity diagram shows the asymmetry of the vortex structure, the originally regular vortex pattern is disturbed or distorted, and the size of the vortex also changes in space, which is manifested as different vortex intensities at different positions.

Figure 15 shows the variation in the ratio of kinetic energy relative to its initial value (

(ρ u_{i} u_{i}) / {(ρ u_{i} u_{i})}_{0}

) with time when using various numerical schemes. The results indicate that the proposed URLBFS scheme outperforms the ROE and LBFS schemes in kinetic energy recovery, demonstrating its low numerical dissipation characteristics. Figure 16 shows the Q criterion iso-surfaces at Q = 2 for different solvers. The results indicate that the URLBFS scheme resolves finer vortex structures, providing higher flow field resolution.

4.3. Explosion in a Box

This problem describes the diffusion of a spherical shock wave generated through an explosion in a closed box, and its interaction with the wall of the box produces a series of complex flow phenomena [57]. The schematic diagram is shown in Figure 17 below, a sphere with a radius of 0.3 is in the square box and the calculation domain size is [0, 1]³, which is discretized into a uniform grid of 120³. All boundaries are set as reflection conditions, and the center of the sphere is (0.4,0.4,0.4). Initial conditions are given by:

(ρ, u, v, w, p) = \{\begin{cases} (5, 0, 0, 0, 5) & i f r \leq 0.3, \\ (1, 0, 0, 0, 1) & e l s e . \end{cases}

(51)

Table 7 compares the calculation time of the GPU parallel implementation of the WENO-URLBFS scheme and the CPU serial solution in the Explosion in a Box problem. The results indicate that the GPU parallel solution has a prominent advantage in computational efficiency, and its speedup ratio can reach 1289.62. It effectively demonstrates the efficiency of GPU parallel methods in solving complex flow problems. Figure 18 is the density contour at t = 0.5 on z = 0.4 plane, illustrating the density distribution under different schemes. Figure 19 shows the density

(ρ = 1.8)

of iso-surfaces under present schemes. The figure shows that the FD-WENO framework proposed in this paper performs well under different solvers and can solve complex flow phenomena stably. To provide a more intuitive comparison of the results obtained by different numerical schemes, Figure 20 presents the density distribution along y = 0.2 on the plane z = 0.4. The reference solution was obtained by using the WENO5–ROE scheme with an eight-fold increase in the number of grid cells. As shown in the figure, the URLBFS scheme exhibits larger amplitude variations in the density profile, indicating its significantly lower numerical dissipation. The enhanced amplitude preservation further demonstrates the superior capability of the URLBFS scheme in resolving flow features and capturing fine-scale structures.

4.4. Explosion in an Enclosed Cabin

In order to further extend the applicability of URLBFS, this study employs the method to perform numerical simulations of the propagation of explosive waves generated by a cylindrical high-pressure, high-density gas within a confined square chamber, and deeply explores the evolution law of the wave system and the pressure load characteristics of typical measuring points on the wall. By meticulously constructing the geometric model and boundary conditions of the enclosed chamber, this study analyzes the propagation, reflection, and interaction mechanisms of the explosive wave in the restricted space, thereby revealing its dynamic evolution under multiple reflections and interferences. To simulate the explosion process, high-pressure and high-density gas is set in the blue cylindrical area, and the surrounding area represents the air domain. As shown in Figure 21. The specific initial conditions and schematic diagram are as follows [58]:

(ρ, u, v, w, p) = \{\begin{matrix} (166.3, 0, 0, 0, 3791) \\ (1, 0, 0, 0, 1) \end{matrix} \begin{matrix} i f \sqrt{{(x - 0.4)}^{2} + {(y - 0.4)}^{2}} \leq 0.05 |z| \leq 0.07, \\ e l s e . \end{matrix}

(52)

Points p1 and p2 are two pressure measurement points set on the wall surface, which are used to compare the differences in calculations using different numerical schemes. The simulations are performed in the region [−0.4, 0.4]³ with a grid resolution of 100³. All boundaries are set as reflection boundary conditions.

Table 8 presents the execution times of the URLBFS scheme for CPU serial and GPU parallel implementations. The GPU implementation speedup ratio is up to 1198.11, which fully demonstrates the significant advantages of GPU in computationally intensive numerical simulations. The pressure contour of the initial explosion in the three-dimensional closed cabin simulated by the WENO5-URLBFS scheme is shown in Figure 22. The figure provides into how the explosion wave propagates during the initial phase of the blast. The high-pressure gas first expands freely in a three-dimensional cylindrical symmetric manner. When the explosion wave first touches the cabin wall, a regular reflection occurs (see Figure 22a,b). Due to the constraints of the surrounding walls, the explosion wave has a local pressure concentration phenomenon at the intersection of the two walls (see Figure 22c). When the explosion wave propagates further to the corner of the cabin, a more significant pressure convergence phenomenon is formed in the area where the three walls meet (see Figure 22d). Subsequently, the shock waves produced by the reflection of the cabin wall propagate toward the center of the cabin and converge and collide with each other in this area (see Figure 22e,f). Analysis reveals that the explosion wave within the closed square cabin undergoes repeated wall reflections, center convergence, collisions, and subsequent re-reflections. During this process, the intensity of the explosion wave gradually decays, and the pressure in the cabin tends to be uniform.

Figure 23 compares the pressure time histories at two monitoring points at t = 1. Owing to the combined effects of wall reflections and multiple interactions of the blast-induced shock waves, the pressure response inside the three-dimensional enclosed chamber exhibits a multi-peak pattern with gradual attenuation over time. The reference solution was obtained by using the WENO5–LBFS scheme with an eight-fold increase in the number of grid cells. As illustrated in the figure, the URLBFS scheme yields pressure profiles with larger amplitude, indicating its lower numerical dissipation and its superior ability to capture detailed wave fluctuations and local extrema. Overall, URLBFS demonstrates enhanced resolution of nonlinear wave interactions in this complex three-dimensional blast scenario.

4.5. Oblique Shock–Mixing Layer Interaction

This example verifies the robustness of the scheme in complex flow conditions where the distinction between shock waves and smooth flow is unclear. This example shows a shock wave approaching from the upper left corner, angled 12° relative to the flow direction. The shock wave interacts with the shear layer, destabilizing it and leading to the formation of a vortex structure. These vortices gradually rotate along the flow direction as time changes, developing a complex shock wave structure. The initial condition as follows [59]:

(ρ, u, v, w, p) = \{\begin{cases} (0.3626, 2, 0, 0, 0.3327), y \leq 0 \\ (1.6374, 3, 0, 0, 0.3327), y > 0 \end{cases}

(53)

The simulations are performed in the region [0, 200] × [−20, 20] × [−20, 20] with a grid resolution of 400 × 80 × 80. The inflow velocity applied to the left boundary is:

\begin{array}{l} u_{i n} = 2.5 + 0.5 \tanh (2 y) \\ v_{i n} = \sum_{k = 1}^{2} a_{k}^{'} \cos (\frac{2 π k t}{T} + \frac{z}{L_{z}} + ϕ_{k}) e^{- \frac{y^{2}}{10}} \end{array}

(54)

where

a_{1} = a_{2} = 0.05

,

ϕ_{1} = 0, ϕ_{2} = π / 2

.

T = λ / u_{c}

, the

λ = 30

and the

u_{c} = 2.68

. Zero gradient extrapolation is used at the right boundary. A slip wall condition is applied at the bottom boundary, while the upper boundary adopts a post-shock value of

(ρ, u, v, p) = (2.1101, 2.9709, - 0.1367, 0.4754)

. The front and rear walls along the Z direction are set as symmetric boundary conditions. Here, Pr = 0.72 and Re = 500. Since a fixed time step is used, the CPU serial and GPU parallel computing times are compared at t = 1 as shown in Table 9. The results show that the GPU parallel speedup ratio is up to 1002.82.

Apply velocity disturbance on the left boundary to establish velocity inlet boundary condition. This velocity disturbance will produce a phase difference in the z direction, resulting in a clearly regular vortex structure before the incident shock wave interacts with the shear layer. After the shock wave collides with the shear layer, the spanwise vortex undergoes significant deformation. Due to the lower boundary being a slip wall, the shock wave reflects upon impact, leading to additional interactions that further evolve and distort the vortex structure. Figure 24 illustrates the density contour at t = 120 calculated using the present scheme.

Figure 25 presents the density contours at the z = 20 cross-section obtained using the WENO-URLBFS scheme. The interaction between the spanwise vortex structure and the shock wave is clearly captured, and the evolution of the vortex generated by the interference between the oblique shock and the boundary layer is accurately reproduced. To compare the results of the two schemes intuitively, the pressure distributions along a straight line across the vortices from point (90, 0, 20) to point (200, −6, 20) are shown and compared with the reference solution in Figure 26. The reference solution was obtained by using the WENO5–URLBFS scheme with an eight-fold increase in the number of grid cells. The comparison of the pressure profiles shows that the solution based on the WENO7 reconstruction exhibits larger amplitude and is closer to the reference solution, indicating lower numerical dissipation. This observation is consistent with the accuracy assessments presented earlier.

5. Conclusions

This paper presents WENO-URLBFS, a high-order, low-dissipation numerical scheme with GPU-parallel acceleration, for simulating 3D compressible flows involving strong shock waves and contact discontinuities. The scheme improves upon the conventional LBFS and RLBFS approaches by replacing the approximate treatment of the interface tangential velocity with the exact Euler equation solution, thereby effectively reducing numerical dissipation. Unlike traditional finite difference methods that directly reconstruct interface fluxes, WENO-URLBFS reconstructs interface physical quantities using the WENO scheme in characteristic space and then evaluates the numerical fluxes via the URLBFS solver. The implementation on the CUDA platform enables efficient GPU parallelization, resulting in a substantial enhancement in computational performance.

The accuracy of the WENO-URLBFS scheme is initially assessed by simulating the advection of a density perturbation, achieving formal fifth- and seventh-order convergence. In this test case, the consistency between CPU serial and GPU parallel implementations is confirmed, and the GPU acceleration efficiency is assessed across different grid resolutions, yielding a maximum speedup ratio of 1208.27. The low-dissipation property of the scheme is quantitatively demonstrated in the inviscid Taylor vortex problem by analyzing the temporal evolution of the total kinetic energy. Furthermore, several challenging three-dimensional problems—including the explosion in a box, explosion in an enclosed cabin, and the oblique shock mixing layer interaction—are simulated to validate both the computational efficiency and accuracy of the scheme in capturing complex flow phenomena. For these cases, GPU acceleration performance is also compared. Although variations in initial conditions and grid sizes lead to differences in acceleration efficiency, the speedup generally reaches around 1000. Overall, the results confirm that the GPU-parallel WENO-URLBFS scheme can efficiently and robustly capture essential physical features such as strong shock waves, vortices, and discontinuities while maintaining low numerical dissipation, thereby demonstrating its strong capability for high-dimensional, strongly nonlinear compressible flow problems.

Currently, the numerical experiments are mainly performed on a single GPU device. Future work will aim to further enhance the accuracy of the WENO reconstruction, explore multi-GPU strategies for greater speedup, and develop high-order multi-GPU schemes suitable for non-uniform grid problems.

Author Contributions

Conceptualization, Y.W. (Yan Wang); Methodology, Y.W. (Yan Wang); Validation, Y.W. (Yunhao Wang), Q.W. and Y.W. (Yan Wang); Formal analysis, Y.W. (Yunhao Wang), Q.W. and Y.W. (Yan Wang); Investigation, Y.W. (Yunhao Wang) and Q.W.; Data curation, Y.W. (Yunhao Wang) and Q.W.; Writing—original draft, Y.W. (Yunhao Wang) and Q.W.; Writing—review & editing, Y.W. (Yan Wang); Visualization, Y.W. (Yan Wang); Supervision, Y.W. (Yan Wang); Project administration, Yan Wang; Funding acquisition, Y.W. (Yan Wang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 12272178), the Research Fund of State Key Laboratory of Mechanics and Control for Aerospace Structures (Grant No. MCAS-I-0325G01), and the Aeronautical Science Foundation of China (Grant No. 20220012052004). This work was also partially supported by the High Performance Computing Platform of Nanjing University of Aeronautics and Astronautics.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Qite Wang was employed by the company China Aerospace Times Feihong Technology Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shang, J.S. Three decades of accomplishments in computational fluid dynamics. Prog. Aerosp. Sci. 2004, 40, 173–197. [Google Scholar] [CrossRef]
Van Leer, B. Towards the ultimate conservative difference scheme. V. A second-order sequel to Godunov’s method. J. Comput. Phys. 1979, 32, 101–136. [Google Scholar] [CrossRef]
Harten, A. High Resolution Schemes for Hyperbolic Conservation Laws. J. Comput. Phys. 1997, 135, 260–278. [Google Scholar] [CrossRef]
Osher, S. Convergence of Generalized MUSCL Schemes. SIAM J. Numer. Anal. 1985, 22, 947–961. [Google Scholar] [CrossRef]
Zhao, J.; Özgen, I.; Liang, D.; Hinkelmann, R. Improved multislope MUSCL reconstruction on unstructured grids for shallow water equations. Int. J. Numer. Methods Fluids 2018, 87, 401–436. [Google Scholar] [CrossRef]
Yuan, M. A predictor–corrector symmetric TVD scheme for magnetogasdynamic flow. Comput. Phys. Commun. 2019, 237, 86–97. [Google Scholar] [CrossRef]
Harten, A.; Engquist, B.; Osher, S.; Chakravarthy, S.R. Uniformly high order accurate essentially non-oscillatory schemes, III. J. Comput. Phys. 1987, 71, 231–303. [Google Scholar] [CrossRef]
Shu, C.-W.; Osher, S. Efficient implementation of essentially non-oscillatory shock-capturing schemes. J. Comput. Phys. 1988, 77, 439–471. [Google Scholar] [CrossRef]
Shu, C.-W.; Osher, S. Efficient implementation of essentially non-oscillatory shock-capturing schemes, II. J. Comput. Phys. 1989, 83, 32–78. [Google Scholar] [CrossRef]
Liu, X.-D.; Osher, S.; Chan, T. Weighted Essentially Non-oscillatory Schemes. J. Comput. Phys. 1994, 115, 200–212. [Google Scholar] [CrossRef]
Jiang, G.-S.; Shu, C.-W. Efficient Implementation of Weighted ENO Schemes. J. Comput. Phys. 1996, 126, 202–228. [Google Scholar] [CrossRef]
Henrick, A.K.; Aslam, T.D.; Powers, J.M. Mapped weighted essentially non-oscillatory schemes: Achieving optimal order near critical points. J. Comput. Phys. 2005, 207, 542–567. [Google Scholar] [CrossRef]
Borges, R.; Carmona, M.; Costa, B.; Don, W.S. An improved weighted essentially non-oscillatory scheme for hyperbolic conservation laws. J. Comput. Phys. 2008, 227, 3191–3211. [Google Scholar] [CrossRef]
Ding, X.; Chen, G.; Luo, P. Convergence of the Lax-Friedrichs Scheme for Isentropic Gas Dynamics (I). Acta Math. Sci. 1985, 5, 415–432. [Google Scholar] [CrossRef]
Roe, P.L. Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes. J. Comput. Phys. 1997, 135, 250–258. [Google Scholar] [CrossRef]
Harten, A.; Lax, P.D.; Leer, B.V. On Upstream Differencing and Godunov-Type Schemes for Hyperbolic Conservation Laws. SIAM Rev. 1983, 25, 35–61. [Google Scholar] [CrossRef]
Kitamura, K.; Roe, P.; Ismail, F. Evaluation of Euler Fluxes for Hypersonic Flow Computations. AIAA J. 2009, 47, 44–53. [Google Scholar] [CrossRef]
Quirk, J.J. A contribution to the great Riemann solver debate. Int. J. Numer. Methods Fluids 1994, 18, 555–574. [Google Scholar] [CrossRef]
Linde, T. A practical, general-purpose, two-state HLL Riemann solver for hyperbolic conservation laws. Int. J. Numer. Methods Fluids 2002, 40, 391–402. [Google Scholar] [CrossRef]
Chen, L.; Wang, Y. Feature-consistent field inversion and machine learning framework with regularized ensemble Kalman method for improving the k – ω shear stress transport model in simulating separated flows. Phys. Rev. Fluids 2025, 10, 024603. [Google Scholar] [CrossRef]
Guo, Z.; Shu, C. Lattice Boltzmann Method and Its Applications in Engineering; World Scientific: London, UK, 2013. [Google Scholar]
Benzi, R.; Succi, S.; Vergassola, M. The lattice Boltzmann equation: Theory and applications. Phys. Rep. 1992, 222, 145–197. [Google Scholar] [CrossRef]
Zhang, Y.; Pu, T.; Jia, H.; Wu, S.; Zhou, C. Extension of a sharp-interface immersed-boundary method for simulating parachute inflation. Adv. Aerodyn. 2024, 6, 3. [Google Scholar] [CrossRef]
Kataoka, T.; Tsutahara, M. Lattice Boltzmann method for the compressible Euler equations. Phys. Rev. E 2004, 69, 056702. [Google Scholar] [CrossRef] [PubMed]
Zhong, C.; Li, K.; Sun, J.; Zhuo, C.; Xie, J. Compressible flow simulation around airfoil based on lattice Boltzmann method. Trans. Nanjing Univ. Aeronaut. Astronaut. 2009, 26, 3. [Google Scholar]
Chen, J.; Wang, Y.; Chen, Q. A global matrix-based stability analysis of the lattice Boltzmann and gas kinetic flux solvers for the simulation of compressible flows. Phys. Fluids 2025, 37, 066108. [Google Scholar] [CrossRef]
Tong, Z.-X.; Li, M.-J.; Du, Y.; Yuan, X. Mass transfer analyses of reactive boundary schemes for lattice Boltzmann method with staircase approximation. Adv. Aerodyn. 2024, 6, 9. [Google Scholar] [CrossRef]
Qin, J.; Yu, H.; Wu, J. On the investigation of shock wave/boundary layer interaction with a high-order scheme based on lattice Boltzmann flux solver. Adv. Aerodyn. 2024, 6, 6. [Google Scholar] [CrossRef]
Ji, C.Z.; Shu, C.; Zhao, N. A lattice Boltzmann method-based flux solver and its application to solve shock tube problem. Mod. Phys. Lett. B 2009, 23, 313–316. [Google Scholar] [CrossRef]
Shu, C.; Wang, Y.; Teo, C.J.; Wu, J. Development of Lattice Boltzmann Flux Solver for Simulation of Incompressible Flows. Adv. Appl. Math. Mech. 2014, 6, 436–460. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Shu, C. From Lattice Boltzmann Method to Lattice Boltzmann Flux Solver. Entropy 2015, 17, 7713–7735. [Google Scholar] [CrossRef]
Yang, L.M.; Shu, C.; Wu, J. Development and Comparative Studies of Three Non-free Parameter Lattice Boltzmann Models for Simulation of Compressible Flows. Adv. Appl. Math. Mech. 2012, 4, 454–472. [Google Scholar] [CrossRef]
Yang, L.M.; Shu, C.; Wu, J. A moment conservation-based non-free parameter compressible lattice Boltzmann model and its application for flux evaluation at cell interface. Comput. Fluids 2013, 79, 190–199. [Google Scholar] [CrossRef]
Chen, J.; Yang, D.; Chen, Q.; Sun, J.; Wang, Y. A rotated lattice Boltzmann flux solver with improved stability for the simulation of compressible flows with intense shock waves at high Mach number. Comput. Math. Appl. 2023, 132, 18–31. [Google Scholar] [CrossRef]
Toro, E.F. Riemann Solvers and Numerical Methods for Fluid Dynamics; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Cheng, J.; Liu, X.; Liu, T.; Luo, H. A Parallel, High-Order Direct Discontinuous Galerkin Method for the Navier-Stokes Equations on 3D Hybrid Grids. Commun. Comput. Phys. 2017, 21, 1231–1257. [Google Scholar] [CrossRef]
Xia, Y.; Luo, H.; Frisbey, M.; Nourgaliev, R. A set of parallel, implicit methods for a reconstructed discontinuous Galerkin method for compressible flows on 3D hybrid grids. Comput. Fluids 2014, 98, 134–151. [Google Scholar] [CrossRef]
Bernardini, M.; Modesti, D.; Salvadore, F.; Pirozzoli, S. STREAmS: A high-fidelity accelerated solver for direct numerical simulation of compressible turbulent flows. Comput. Phys. Commun. 2021, 263, 107906. [Google Scholar] [CrossRef]
Bonelli, F.; Tuttafesta, M.; Colonna, G.; Cutrone, L.; Pascazio, G. An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster. Comput. Phys. Commun. 2017, 219, 178–195. [Google Scholar] [CrossRef]
Elsen, E.; LeGresley, P.; Darve, E. Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 2008, 227, 10148–10161. [Google Scholar] [CrossRef]
Lai, J.; Li, H.; Tian, Z.; Zhang, Y. A Multi-GPU Parallel Algorithm in Hypersonic Flow Computations. Math. Probl. Eng. 2019, 2019, 2053156. [Google Scholar] [CrossRef]
Rossinelli, D.; Hejazialhosseini, B.; Spampinato, D.G.; Koumoutsakos, P. Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids. SIAM J. Sci. Comput. 2011, 33, 512–540. [Google Scholar] [CrossRef]
Ji, H.; Lien, F.-S.; Zhang, F. A GPU-accelerated adaptive mesh refinement for immersed boundary methods. Comput. Fluids 2015, 118, 131–147. [Google Scholar] [CrossRef]
Khan, A.; Sim, H.; Vazhkudai, S.S.; Butt, A.R.; Kim, Y. An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers. In Proceedings of the HPCAsia’21: The International Conference on High Performance Computing in Asia-Pacific Region, Virtual Event, 20–22 January 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 11–22. [Google Scholar]
Strohmaier, E.; Meuer, H.W.; Dongarra, J.; Simon, H.D. The TOP500 List and Progress in High-Performance Computing. Computer 2015, 48, 42–49. [Google Scholar] [CrossRef]
Zeng, Y.; Wang, Y.; Yuan, H. A stable and efficient semi-implicit coupling method for fluid-structure interaction problems with immersed boundaries in a hybrid CPU-GPU framework. J. Comput. Phys. 2025, 534, 114026. [Google Scholar] [CrossRef]
Chen, J.; Wang, Y.; Yang, D.; Chen, Q.; Sun, J. Development of three-dimensional rotated lattice Boltzmann flux solver for the simulation of high-speed compressible flows. Comput. Fluids 2023, 265, 105992. [Google Scholar] [CrossRef]
Ren, Y.-X. A robust shock-capturing scheme based on rotated Riemann solvers. Comput. Fluids 2003, 32, 1379–1403. [Google Scholar] [CrossRef]
Nishikawa, H.; Kitamura, K. Very simple, carbuncle-free, boundary-layer-resolving, rotated-hybrid Riemann solvers. J. Comput. Phys. 2008, 227, 2560–2581. [Google Scholar] [CrossRef]
Shu, C. Essentially non-oscillatory and weighted essentially non-oscillatory schemes for hyperbolic conservation laws. In Advanced Numerical Approximation of Nonlinear Hyperbolic Equations; Quarteroni, A., Ed.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 325–432. [Google Scholar]
Wang, Y.; Chen, J.; Wang, Y.; Zeng, Y.; Ke, S. A WENO-Based Upwind Rotated Lattice Boltzmann Flux Solver with Lower Numerical Dissipation for Simulating Compressible Flows with Contact Discontinuities and Strong Shock Waves. Appl. Sci. 2024, 14, 11450. [Google Scholar] [CrossRef]
Ma, Z.H.; Wang, H.; Pu, S.H. A parallel meshless dynamic cloud method on graphic processing units for unsteady compressible flows past moving boundaries. Comput. Methods Appl. Mech. Eng. 2015, 285, 146–165. [Google Scholar] [CrossRef]
Zhang, J.-L.; Chen, H.-Q.; Xu, S.-G.; Gao, H.-Q. A novel GPU-parallelized meshless method for solving compressible turbulent flows. Comput. Math. Appl. 2020, 80, 2738–2763. [Google Scholar] [CrossRef]
Ma, Z.; Wang, H.; Pu, S.H. GPU computing of compressible flow problems by a meshless method with space-filling curves. J. Comput. Phys. 2014, 263, 113–135. [Google Scholar] [CrossRef]
Liu, Y.Y.; Yang, L.M.; Shu, C.; Zhang, H.W. Three-dimensional high-order least square-based finite difference-finite volume method on unstructured grids. Phys. Fluids 2020, 32, 123604. [Google Scholar] [CrossRef]
Taylor, G.I.; Green, A.E. Mechanism of the production of small eddies from large ones. Proc. R. Soc. London Ser. A Math. Phys. Sci. 1997, 158, 499–521. [Google Scholar] [CrossRef]
Li, Q.; He, Y.L.; Wang, Y.; Tang, G.H. Three-dimensional non-free-parameter lattice-Boltzmann model and its application to inviscid compressible flows. Phys. Lett. A 2009, 373, 2101–2108. [Google Scholar] [CrossRef]
Xu, W.; Kong, X.; Wu, W. An Improved rd-Order WENO Scheme Based on Mapping Functions and Its Application. Appl. Math. Mech. 2017, 38, 1120–1135. [Google Scholar] [CrossRef]
Park, J.S.; Kim, C. Hierarchical multi-dimensional limiting strategy for correction procedure via reconstruction. J. Comput. Phys. 2016, 308, 57–80. [Google Scholar] [CrossRef]

Figure 1. Distribution of the D1Q4 lattice velocity model.

Figure 2. Use of the 1D LB Model at Cell Interfaces in 3D Simulations. Black arrows represent the velocity components in the global Cartesian frame, and red arrows denote the velocity components in the local rotated coordinate system.

Figure 3. Streaming process of D1Q4 model.

Figure 4. Schematic diagram of the decomposition of the normal vector at the 3D cell interface.

Figure 5. Global template and sub-templates of WENO5. Red dots represent the cell nodes.

Figure 6. Global template and sub-templates of WENO7. Red dots represent the cell nodes.

Figure 7. A heterogeneous program model that uses CPU and GPU together.

Figure 8. Description of thread hierarchy.

Figure 9. Comparison between the CPU-only serial implementation and the CPU + GPU parallel execution of the WENO-URLBFS scheme.

Figure 10. L₂ error between the density and the grid size h. (a): WENO5 reconstruction method; (b): WENO7 reconstruction method.

Figure 11. Comparison of CPU and GPU results for the density perturbation advection problem with 403 grid points using the WENO5-URLBFS scheme. (a): CPU serial; (b): GPU parallel.

Figure 12. Speedup of GPU parallel scheme relative to CPU serial scheme at different meshes. (a): WENO5-URLBFS scheme; (b): WENO7-URLBFS scheme.

Figure 13. Comparison of accuracy and computational cost of different reconstruction schemes at various grid resolutions.

Figure 14. Temporal evolution of an inviscid Taylor–Green vortex based on the WENO5-URLBFS scheme, using Q-criterion iso-surfaces to show the vortex structure and colored according to the velocity in the x direction.

Figure 15. 3D inviscid Taylor-Green vortex problem. Time history of the ratio between the average kinetic energy and its initial value for various numerical schemes. Panel (a) shows the WENO5 reconstruction technique; Panel (b) shows the WENO7 reconstruction technique.

Figure 16. 3D inviscid Taylor-Green vortex problem. Iso-surfaces of Q-criterion at Q = 2 for different numerical schemes colored with x-velocity at t = 10. Panel (a–c): WENO5 reconstruction technique; Panel (d–f): WENO7 reconstruction technique.

Figure 17. Schematic diagram of the initial conditions. The blue circle represents an initially stationary, high-density, high-pressure bubble surrounded by ambient gas.

Figure 18. Density distributions on z = 0.4 were visualized as 23 equally spaced contour levels from 0.2 to 2.4 using different schemes. Panel (a–c): WENO5 reconstruction technique; Panel (d–f): WENO7 reconstruction technique.

Figure 19. Density iso-surface (

ρ = 1.8

) contours were obtained using the WENO-URLBFS schemes. Panel (a) shows the WENO5 reconstruction technique; Panel (b) shows the WENO7 reconstruction technique.

Figure 19. Density iso-surface (

ρ = 1.8

) contours were obtained using the WENO-URLBFS schemes. Panel (a) shows the WENO5 reconstruction technique; Panel (b) shows the WENO7 reconstruction technique.

Figure 20. Comparison of density profiles along y = 0.2 on the plane z = 0.4 obtained using different numerical schemes. Panel (a,c) represent the reconstruction technologies of WENO5 and WENO7, respectively; Panel (b,d) are locally enlarged views.

Figure 21. Schematic showing the confined chamber setup and the initial explosion region. Panel (a): Geometric configuration and the computational setup. The Blue dots represent monitoring points. The interior of the blue cylinder represents a high-density, high-pressure environment. Panel (b): Pressure distribution contour map at the initial moment.

Figure 22. Temporal evolution of the pressure contour maps for an explosion in a confined chamber simulated using the WENO5-URLBFS scheme.

Figure 23. Comparison curves of the pressure change in p1 and p2 measurement points with simulation time in different schemes. Panel (a,c) represent the reconstruction technologies of WENO5 and WENO7, respectively; Panel (b,d) are locally enlarged views.

Figure 24. Three-Dimensional density iso-surface plot for the oblique shock mixing layer interaction problem. Panel (a) shows the WENO5 reconstruction technique; Panel (b) shows the WENO7 reconstruction technique.

Figure 25. Density contours with 23 equally spaced contour lines ranging from 0.4 to 2.4 on plane z = 20 were obtained using the WENO-URLBFS schemes. Panel (a) shows the WENO5 reconstruction technique; Panel (b) shows the WENO7 reconstruction technique.

Figure 26. Computed pressure distributions along a straight line across the vortices from point (90, 0, 20) to point (200, −6, 20).

Table 1. Specifications of the Hygon 7185 CPU and NVIDIA TITAN V GPU.

		Hygon 7185	NVIDIA TITAN V
Processor	Total number of cores	32	5120
	Clock rate	2.0 GHz	1455 MHz
Memory	Global memory	64 GB	12 GB
	Shared memory	--	64 KB
	Registers per block	--	256 KB
Peak theoretical performance	Floating point operations	1-core: 32 GFLOP/s	7450 GFLOP/s
	Memory bandwidth	170 GB/s	652.8 GB/s

Table 2. Comparison of error and numerical order for WENO5-based flux solvers at different grid resolutions.

Schemes	h	L₁ Error	L₁ Order	L₂ Error	L₂ Order	L_∞ Error	L_∞ Order
WENO5-LF	1/5	1.76 × 10⁻²	-	1.89 × 10⁻²	-	2.53 × 10⁻²	-
	1/10	9.19 × 10⁻⁴	4.256	9.88 × 10⁻⁴	4.259	1.40 × 10⁻³	4.179
	1/20	2.94 × 10⁻⁵	4.968	3.33 × 10⁻⁵	4.891	5.15 × 10⁻⁵	4.761
	1/40	9.08 × 10⁻⁷	5.015	1.02 × 10⁻⁶	5.026	1.67 × 10⁻⁶	4.943
WENO5-ROE	1/5	1.63 × 10⁻²	-	1.75 × 10⁻²	-	2.33 × 10⁻²	-
	1/10	8.45 × 10⁻⁴	4.266	9.12 × 10⁻⁴	4.262	1.30 × 10⁻³	4.161
	1/20	2.68 × 10⁻⁵	4.979	3.05 × 10⁻⁵	4.902	4.81 × 10⁻⁵	4.760
	1/40	8.35 × 10⁻⁷	5.004	9.43 × 10⁻⁷	5.016	1.57 × 10⁻⁶	4.935
WENO5-LBFS	1/5	1.36 × 10⁻²	-	1.47 × 10⁻²	-	2.06 × 10⁻²	-
	1/10	6.91 × 10⁻⁴	4.298	7.55 × 10⁻⁴	4.281	1.11 × 10⁻³	4.211
	1/20	2.13 × 10⁻⁵	5.018	2.45 × 10⁻⁵	4.946	4.06 × 10⁻⁵	4.779
	1/40	6.64 × 10⁻⁷	5.005	7.51 × 10⁻⁷	5.028	1.31 × 10⁻⁶	4.956
WENO5-URLBFS	1/5	1.47 × 10⁻²	-	1.58 × 10⁻²	-	2.20 × 10⁻²	-
	1/10	6.78 × 10⁻⁴	4.436	7.39 × 10⁻⁴	4.419	1.09 × 10⁻³	4.337
	1/20	2.00 × 10⁻⁵	5.083	2.28 × 10⁻⁵	5.021	3.60 × 10⁻⁵	4.919
	1/40	5.26 × 10⁻⁷	5.248	5.92 × 10⁻⁷	5.267	9.04 × 10⁻⁷	5.317

Table 3. Comparison of error and numerical order for WENO7-based flux solvers at different grid resolutions.

Schemes	h	L₁ Error	L₁ Order	L₂ Error	L₂ Order	L_∞ Error	L_∞ Order
WENO7-LF	1/5	1.76 × 10⁻²	-	1.89 × 10⁻²	-	2.53 × 10⁻²	-
	1/10	9.19 × 10⁻⁴	4.256	9.88 × 10⁻⁴	4.259	1.40 × 10⁻³	4.179
	1/20	2.94 × 10⁻⁵	4.968	3.33 × 10⁻⁵	4.891	5.15 × 10⁻⁵	4.761
	1/40	9.08 × 10⁻⁷	5.015	1.02 × 10⁻⁶	5.026	1.67 × 10⁻⁶	4.943
WENO7-ROE	1/5	2.12 × 10⁻³	-	2.52 × 10⁻³	-	3.98 × 10⁻³	-
	1/10	4.68 × 10⁻⁵	5.502	5.32 × 10⁻⁵	5.564	9.84 × 10⁻⁵	5.336
	1/20	3.06 × 10⁻⁷	7.257	4.28 × 10⁻⁷	6.958	1.07 × 10⁻⁶	6.517
	1/40	1.02 × 10⁻⁹	8.235	1.39 × 10⁻⁹	8.263	3.48 × 10⁻⁹	8.270
WENO7-LBFS	1/5	1.73 × 10⁻³	-	2.08 × 10⁻³	-	3.38 × 10⁻³	-
	1/10	3.95 × 10⁻⁵	5.451	4.52 × 10⁻⁵	5.526	8.43 × 10⁻⁵	5.327
	1/20	2.53 × 10⁻⁷	7.286	3.52 × 10⁻⁷	7.006	9.07 × 10⁻⁷	6.538
	1/40	8.09 × 10⁻¹⁰	8.289	1.12 × 10⁻⁹	8.297	2.93× 10⁻⁹	8.276
WENO7-URLBFS	1/5	1.85 × 10⁻³	-	2.23 × 10⁻³	-	3.69 × 10⁻³	-
	1/10	3.95 × 10⁻⁵	5.451	4.52 × 10⁻⁵	5.526	8.43 × 10⁻⁵	5.327
	1/20	2.53 × 10⁻⁷	7.286	3.52 × 10⁻⁷	7.006	9.07 × 10⁻⁷	6.538
	1/40	8.09 × 10⁻¹⁰	8.289	1.12 × 10⁻⁹	8.297	2.93 × 10⁻⁹	8.276

Table 4. Consistency comparison of the WENO-URLBFS scheme results between CPU and GPU implementations.

Schemes	h	CPU (L₁)	GPU (L₁)	CPU (L₂)	GPU (L₂)
WENO5-URLBFS	1/20	2.00 × 10⁻⁵	2.01 × 10⁻⁵	2.28 × 10⁻⁵	2.29 × 10⁻⁵
WENO5-URLBFS	1/40	5.26 × 10⁻⁷	5.28 × 10⁻⁷	5.92 × 10⁻⁷	5.94 × 10⁻⁷
WENO7-URLBFS	1/20	2.53 × 10⁻⁷	2.53 × 10⁻⁷	3.52 × 10⁻⁷	3.53 × 10⁻⁷
WENO7-URLBFS	1/40	8.09 × 10⁻¹⁰	8.10 × 10⁻¹⁰	1.12 × 10⁻⁹	1.12 × 10⁻⁹

Table 5. Computation time and speedup ratio of CPU serial and GPU parallel in WENO-URLBFS scheme at different meshes.

Scheme	Mesh Resolution	CPU (min)	GPU (min)	Speedup
WENO5-URLBFS	1/5	10.24	0.15	69.29
	1/10	68.93	0.18	382.94
	1/20	512.14	0.56	914.53
	1/40	3695.64	3.21	1151.29
	1/60	12,928.54	10.7	1208.27
WENO7-URLBFS	1/5	14.00	0.17	82.35
	1/10	94.93	0.28	339.05
	1/20	686.87	0.97	708.11
	1/40	5461.92	6.28	821.96
	1/60	19,246.23	19.2	1002.41

Table 6. Comparison of CPU serial and GPU parallel computation times for the WENO-URLBFS scheme at t = 10.

Scheme	Number of Cells	CPU (min)	GPU (min)	Speedup
WENO5-URLBFS	128³	77,630.52	91.42	849.16
WENO7-URLBFS	128³	135,786.3	127.68	1063.49

Table 7. Comparison of CPU serial and GPU parallel computation times for the WENO-URLBFS scheme at t = 0.5.

Scheme	Number of Cells	CPU (min)	GPU (min)	Speedup
WENO5-URLBFS	120³	2914.55	2.26	1289.62
WENO7-URLBFS	120³	4534.52	4.46	1016.70

Table 8. Comparison of CPU serial and GPU parallel computation times for the WENO-URLBFS scheme at t = 2.6.

Scheme	Number of Cells	CPU (min)	GPU (min)	Speedup
WENO5-URLBFS	100³	1569.52	1.31	1198.11
WENO7-URLBFS	100³	2342.25	2.32	1009.59

Table 9. Comparison of CPU serial and GPU parallel computation times for the WENO-URLBFS scheme at t = 1.0.

Scheme	Number of Cells	CPU (min)	GPU (min)	Speedup
WENO5-URLBFS	400 × 80 × 80	12,214.31	12.18	1002.82
WENO7-URLBFS	400 × 80 × 80	15,582.72	15.91	979.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, Q.; Wang, Y. An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves. Entropy 2025, 27, 1193. https://doi.org/10.3390/e27121193

AMA Style

Wang Y, Wang Q, Wang Y. An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves. Entropy. 2025; 27(12):1193. https://doi.org/10.3390/e27121193

Chicago/Turabian Style

Wang, Yunhao, Qite Wang, and Yan Wang. 2025. "An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves" Entropy 27, no. 12: 1193. https://doi.org/10.3390/e27121193

APA Style

Wang, Y., Wang, Q., & Wang, Y. (2025). An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves. Entropy, 27(12), 1193. https://doi.org/10.3390/e27121193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves

Abstract

1. Introduction

2. Methodology

2.1. Governing Equations

2.2. Inviscid Numerical Flux Evaluation Method

2.2.1. The RLBFS Scheme

2.2.2. The URLBFS Developed Based on RLBFS

2.3. WENO Scheme Reconstruction Interface Flow Variables

2.3.1. High-Order WENO Reconstruction Method

2.3.2. Characteristic Space

3. GPU Implementations

3.1. CUDA-GPU Introduction

3.2. Parallel Implementation of the WENO-URLBFS Scheme

4. Numerical Tests and Validation

4.1. Advection of Density Perturbation Problem

4.2. Inviscid Taylor–Green Vortex

4.3. Explosion in a Box

4.4. Explosion in an Enclosed Cabin

4.5. Oblique Shock–Mixing Layer Interaction

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI