High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM

Chi, Zixin; Hui, Xin; Wang, Bosen

doi:10.3390/app16073611

Open AccessArticle

High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM

by

Zixin Chi

^1,2

,

Xin Hui

^1,3

and

Bosen Wang

^1,3,*

¹

National Key Laboratory of Science and Technology on Aero-Engine Aero-Thermodynamics, Beihang University, 37 Xueyuan Road, Beijing 100191, China

²

School of Energy and Power Engineering, Beihang University, 37 Xueyuan Road, Beijing 100191, China

³

Research Institute of Aero-Engine, Beihang University, 37 Xueyuan Road, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3611; https://doi.org/10.3390/app16073611

Submission received: 17 March 2026 / Revised: 3 April 2026 / Accepted: 6 April 2026 / Published: 7 April 2026

Download

Browse Figures

Versions Notes

Abstract

Multicomponent diffusion modeling based on the Maxwell–Stefan formulation is widely used in high-fidelity combustion simulations due to its superior physical accuracy compared with simplified diffusion models. However, the computational complexity of the Maxwell–Stefan model, which arises from the solution of coupled multicomponent transport equations, becomes a major performance bottleneck in large-scale CFD simulations. In this work, a high-performance computing optimization strategy for the Maxwell–Stefan diffusion model is developed within the OpenFOAM framework. The proposed method improves computational efficiency through block-based computation and vectorization-oriented data organization to better exploit modern CPU architectures and SIMD instruction capabilities. The optimized implementation enhances memory locality, increases data reuse efficiency, and reduces cache miss penalties. Numerical validation is performed using two-dimensional laminar counterflow flame cases and ammonia–hydrogen turbulent combustion cases, including both premixed and non-premixed jet flames. Results demonstrate that the optimized Maxwell–Stefan implementation preserves numerical accuracy while significantly improving computational performance. Speedups of 2.5×–4.5× are achieved depending on the number of chemical species. The developed approach provides an efficient solution for detailed combustion simulations involving large chemical mechanisms. The test cases and source code are openly shared.

Keywords:

multicomponent diffusion; Maxwell–Stefan formulation; high-performance computing; combustion

1. Introduction

Multicomponent mass diffusion plays a critical role in the numerical simulation of reacting flows, especially in applications involving detailed chemistry, such as hydrogen combustion, ammonia combustion, and pollutant formation in aero-engine and gas turbine combustors. Among the available formulations, the Maxwell–Stefan diffusion model provides a physically rigorous description of multicomponent transport by implicitly accounting for pairwise species interactions. Compared to mixture-averaged diffusion models, the Maxwell–Stefan formulation offers significantly improved predictive capability in systems characterized by strong composition gradients, light species transport, or high-pressure conditions. It has been widely used in the energy and chemical industries, such as in combustion, catalysts, adsorbents and membranes [1,2,3,4,5,6,7]. In the combustion area, it has been proven that the Maxwell–Stefan formulation provides higher accuracy than simpler methods such as the mixture-averaged method or the unity Lewis number assumption (equal diffusivity assumption) in predicting the overall combustion process [8,9,10,11] and pollutants [9,12]. In addition, it has been found that in a highly turbulent premixed H₂ flame, the importance of the diffusion model becomes more pronounced, as flame stretching and vortex-induced curvature enhance differential diffusion effects among species, thereby leading to more complex macroscopic combustion rate behaviors [13].

The Maxwell–Stefan diffusion model incurs high computational cost; it requires the solution of a coupled linear system for each computational cell, with the coefficient matrix depending on local thermodynamic states and binary diffusion coefficients. For detailed chemical mechanisms involving tens or even hundreds of species, the associated computational complexity grows rapidly. In large-scale Computational Fluid Dynamics (CFD) simulations—such as Large Eddy Simulation (LES) or Direct Numerical Simulation (DNS)—the cost of evaluating multicomponent diffusion can become comparable to or even exceed that of chemical source term integration. As a result, despite its physical advantages, the practical use of the Maxwell–Stefan model in high-resolution simulations remains limited by computational efficiency.

Previous studies have focused primarily on reducing computational cost at the model [14,15,16,17] or algorithm level; however, such approaches often achieve improved efficiency at the expense of numerical accuracy. The low-level computational performance—such as memory access patterns, data locality, vectorization, and instruction-level parallelism—has received comparatively less systematic investigation. In modern High-Performance Computing (HPC) environments, especially on Single Instruction Multiple Data (SIMD)-capable Central Processing Unit (CPU) architectures, suboptimal data layout and dependency chains can significantly limit performance, even when the algorithmic formulation itself is optimal.

In this work, we present a performance-oriented reimplementation of the Maxwell–Stefan detailed multicomponent diffusion model within an open-source CFD framework, OpenFOAM, using HPC technology. The objective is to substantially improve computational performance while preserving numerical accuracy and physical consistency. The optimization strategy combines algorithmic analysis with hardware-conscious code restructuring, including data layout redesign, reduction in memory traffic, enhanced SIMD vectorization, and improved instruction-level parallelism. A theoretical performance analysis is conducted to identify computational bottlenecks and guide optimization decisions.

The remainder of this paper is organized as follows. Section 2 introduces the mathematical formulation of the Maxwell–Stefan diffusion model and describes the baseline computational algorithm. Section 3 presents the proposed optimization strategies and theoretical performance considerations. Section 4 provides the case setup. Section 5 provides detailed benchmarking results and performance analysis. Finally, conclusions are drawn in Section 6.

2. Theory

For multi component gas system, the governing equation can be written as:

\frac{\partial ρ}{\partial t} + \nabla \cdot (ρ u) = 0

(1)

\frac{\partial (ρ u)}{\partial t} + \nabla \cdot (ρ u u) = - \nabla p + \nabla \cdot τ

(2)

\frac{\partial (ρ Y_{i})}{\partial t} + \nabla \cdot (ρ u Y_{i}) = - \nabla \cdot J_{i} + S_{i}

(3)

\frac{\partial (ρ h_{s} + ρ K)}{\partial t} + \nabla \cdot (ρ h_{s} + ρ K) = - \nabla \cdot \dot{q} + \frac{d p}{d t} + Φ + \dot{Q}

(4)

Here,

ρ

is the mixture density (kg/m³),

u

is the mixture velocity (m/s),

p

is the pressure (Pa),

τ

is the stress (N),

Y_{i}

is the mass fraction of i-th species, and

J_{i}

is the molecular diffusion flux of i-th species (kg/m²/s).

S_{i}

is the chemistry source term (kg/m³/s).

h_{s}

is the sensible enthalpy (J/kg), and

K

is the kinetic energy of the mixture (J/kg).

\dot{q}

is the energy transfer term due to heat conduction as well as species diffusion (W/m²),

Φ

is the viscous heat (W/m³), and

\dot{Q}

is the volumetric heat from chemical reaction (W/m³).

The Maxwell–Stefan formulation describes the detailed species transport, which can be written as [18]:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \nabla X_{i} - \frac{\nabla T}{T} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{D_{T, j}}{ρ_{j}} - \frac{D_{T, i}}{ρ_{i}})

(5)

Here,

X_{i}

is the mole fraction of the

i

th species, and

D_{i j}

is the binary diffusion coefficient between the

i

-th species and the

j

-th species.

D_{T, j}

is the thermal diffusion coefficient of the

j

-th species. Equation (5) is only valid for an ideal gas, and the effects of pressure and other external forces on different species are omitted. For a real gas or liquid, the mole fraction gradient

\nabla X_{i}

should be replaced by generalized flux driving forces, d, where the details of d can be found in ref. [18]. In practice, an equivalent generalized Fick’s Law formulation of the Maxwell–Stefan diffusion model for an ideal gas is often used for convenience in CFD software, such as the commercial software ANSYS Fluent 2024 R2 [19] or the open-source software OpenFOAM-10 [20], where the equivalent generalized Fick’s Law formulation is:

J_{i} = \sum_{j = 1}^{N s - 1} ρ D_{i j} \nabla Y_{j} - D_{T, i} \frac{\nabla T}{T}

(6)

Here,

D_{i j}

is the generalized Fick diffusion coefficient between the

i

-th species and the

j

-th species. The derivation of generalized Fick diffusion equation can be found in Appendix A.

D_{i j}

can be rewritten as the matrix form:

D_{i j} = [D] = {[A]}^{- 1} [B]

(7)

The symbol

[D]

,

[A]

and

[B]

represent matrices with (Ns − 1) × (Ns − 1) size, where Ns represents the total number of species. For matrix

[A]

, the element can be computed as:

\{\begin{matrix} A_{i i} = - (\frac{X_{i}}{D_{i N s}} \frac{W_{m}}{W_{N s}} + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{j}}{D_{i j}} \frac{W_{m}}{W_{i}}) \\ A_{i j} = X_{i} (\frac{1}{D_{i j}} \frac{W_{m}}{W_{j}} - \frac{1}{D_{i N s}} \frac{W_{m}}{W_{N s}}), i \neq j \end{matrix}

(8)

For matrix

[B]

, the element can be computed as:

\{\begin{matrix} B_{i i} = - (X_{i} \frac{W_{m}}{W_{N s}} + (1 - X_{i}) \frac{W_{m}}{W_{i}}) \\ B_{i j} = X_{i} (\frac{W_{m}}{W_{j}} - \frac{W_{m}}{W_{N s}}), i \neq j \end{matrix}

(9)

The subscripts

m

,

i

and

N s

represent the mixture, the

i

-th species and the last species.

D_{i j}

represents the binary diffusion coefficient between the

i

-th species and the

j

-th species.

D_{i N s}

represents the binary diffusion coefficient between the

i

-th species and the last species. This work treats the last species as a non-free variable, and its value is obtained through

1 - \sum_{i = 1}^{N s - 1} Y_{i}

. The symbol

W

represents molecular weight (kg/kmol), and

X

represents the species mole fraction.

Generally speaking, the computation of the generalized Fick diffusion coefficient

[D]

is the most expensive part among the computations of species flux; it involves matrix inversion as well as matrix multiplication, where the time complexity is O(Ns³). With an increasing species number, the computational cost increases rapidly.

3. Implementation Details

The goal of this work is to optimize the Maxwell–Stefan diffusion model in the open-source CFD code OpenFOAM-10 to improve efficiency. Therefore, this section is mainly divided into two parts: The first part addresses performance testing, and the second part addresses computational performance optimization

3.1. Computational Bottleneck Analysis

To identify the dominant performance bottlenecks of the Maxwell–Stefan implementation, a detailed profiling study was conducted based on the opposed-flow flame tutorial in OpenFOAM-10 using a 38-species, 263-reaction ammonia mechanism [21] under cold-flow conditions (with chemistry disabled). Performance statistics were collected using the Linux perf tool with approximately 30,000 samples. Only self-time (exclusive time) was considered for each function, which means that if the function funcA calls another sub-function funcB during the implementation of funcA, the execution time of funcB will not be considered.

The profiling results indicate that the computational cost is dominated by the functions shown in Table 1. Here, the most time-consuming routine is transformDiffusionCoefficientFields, which is used for data structure conversion and memory access, which will be analyzed in Section 3.2. The routines LUBacksubstitute, MatrixMultiplation and LUDecompose are matrix operations, which will be discussed in Section 3.3. The last overhead is the 2D interpolation of the binary diffusion coefficient for different species pairs, where the input variables are pressure and temperature.

It should be noted that the flow state does not affect the results reported in Table 1. The first function in Table 1 involves data access and transformation between field variables and matrix representations, as well as the reverse process. Functions 2–4 correspond to matrix operations, which are independent of the flow state. For the interpolation of binary diffusion coefficients, a uniform lookup table is employed, allowing the indices to be determined directly through algebraic operations without any search procedure; therefore, this process is also independent of the flow state.

3.2. Memory Access Optimization

The primary function of the routine transformDiffusionCoefficientFields is data reorganization and memory movement rather than arithmetic computation. Its workflow can be decomposed into three stages:

Matrix Assembly.

For each computational cell, a diffusion coefficient matrix is constructed. At this stage, the matrix entries correspond to the binary diffusion coefficients. Each element is gathered from the associated entry of the Ns × Ns binary species diffusion coefficient fields, where Ns denotes the number of chemical species.

2.: Matrix Operations.

The assembled matrix is then passed to the subsequent sub-routines, including the construction of matrices

[A]

and

[B]

, computation of the inverse of

[A]

, and evaluation of the matrix product

[D]

=

{[A]}^{- 1} [B]

. The execution time of this part is not considered in the routine in this section.

3.: Data Redistribution.

After the matrix operations are completed, the resulting matrix entries—now representing generalized Fick diffusion coefficients—are written back to the diffusion coefficient fields. At this point, the original binary diffusion coefficient fields are overwritten by the generalized Fick coefficients.

From a computational performance perspective, this routine is memory-bound rather than compute-bound. For a single computational cell, both the assembly and redistribution stages exhibit a time complexity of O(Ns²), since all pairwise species coefficients must be processed. More critically, matrix assembly requires accessing the value at a fixed cell index across all diffusion coefficient fields. Because these fields are stored as separate arrays in memory, the required data are distributed across widely separated memory addresses. Consequently, each access is likely to incur cache misses, especially when Ns is large and the working set exceeds cache capacity.

The performance implication is significant: a main-memory load may incur a latency on the order of hundreds of clock cycles, whereas a floating-point add operation typically has a latency of only a few cycles and a throughput approaching one operation per cycle on modern architectures. Therefore, the dominant cost of this routine arises from irregular memory access patterns and poor spatial locality rather than arithmetic intensity. Figure 1 shows the diagram of the data redistribution process. Figure 1 illustrates the process of writing data from the matrix back to the species diffusion coefficient fields. It can be observed that the memory addresses associated with successive write operations are separated by substantial strides, resulting in large gaps between consecutive write locations. Under such a memory access pattern, significant cache-miss penalties are likely to occur due to poor spatial locality.

In the present work, the memory access pattern is optimized based on the principle of cache locality. When data are fetched from main memory into the cache, an entire cache line is loaded at once. On most modern architectures, a cache line is 64 bytes, corresponding to eight double-precision floating-point values. In the original write-back procedure (Figure 1), although only a single value is explicitly written, the surrounding data corresponding to nearby computational cells are also implicitly loaded into the cache as part of the same cache line. These data, already resident in the cache, can potentially be reused to improve memory efficiency. To exploit this behavior, the AVX2 instruction set was employed to enhance memory access efficiency. AVX2 is widely supported on modern x86 processors and enables 256-bit SIMD operations. The optimized procedure, illustrated in Figure 2, consists of three steps:

For four consecutive matrices, four 256-bit vectors are loaded simultaneously.
An in-register transpose operation is performed on these vectors.
The transposed vectors are written to the corresponding four species diffusion coefficient fields.

This approach fully leverages spatial locality by ensuring that data loaded within a cache line are effectively utilized before eviction. In addition to memory access, the vector transpose operation involves AVX2 shuffle instructions. These instructions operate exclusively on registers. In modern processor architectures, register-to-register operations are significantly faster than cache or memory accesses. Consequently, the overhead associated with vector shuffling is negligible compared to the latency of memory traffic.

Memory optimization is also performed on the Matrix Assembly process, but the implementation details are omitted since the basic idea is similar to Figure 2.

3.3. Matrix Operation Optimization

Another time-consuming operation involves matrix operations, including the process of matrix multiplication, LU decomposition, and matrix inversion through LU substitution, which takes up more than 40% of the running time in the Maxwell–Stefan module. The optimization of this process is presented in this section.

3.3.1. Matrix Inversion

The objective of matrix inversion is to compute the inverse of matrix

[A]

. The standard approach in numerical linear algebra is to first compute the LU decomposition of

[A]

, such that

[L]

and

[U]

denote the lower and upper triangular matrices, respectively, and then compute

{[A]}^{- 1}

using the LU matrices. The LU decomposition is not the major computational bottleneck in this module. Therefore, the optimization details of LU decomposition are omitted in this work, but they can be found in the author’s previous work on developing high-performance chemistry solvers in OpenFOAM, where the LU decomposition becomes a bottleneck in the chemistry integration process.

After LU decomposition, the inverse matrix is obtained by solving a sequence of linear systems. Noting that

[A] {[A]}^{- 1}

= [I], each column of

{[A]}^{- 1}

can be computed by solving

[L] [U]

x = e, where x represents an unknown column of

{[A]}^{- 1}

, and e is the corresponding column of the identity matrix

[I]

. This procedure amounts to performing forward and backward substitution for each column of the identity matrix. This column-wise inversion strategy based on LU decomposition is the standard implementation adopted in OpenFOAM, as illustrated in Figure 3.

This approach exhibits two inherent drawbacks. First, only one column is computed at a time. Such a column-wise processing strategy limits instruction-level parallelism and fails to fully exploit the SIMD vectorization capabilities of modern CPUs. Since the forward and backward substitutions are performed independently for each right-hand side, opportunities for data-level parallelism across multiple columns are not utilized. Second, the procedure of writing the computed column back into the inverse matrix introduces unfavorable memory access patterns. In the OpenFOAM implementation, matrices are stored in row-major order, meaning that a two-dimensional matrix is laid out in contiguous memory row by row. Consequently, storing elements column-wise results in strided memory accesses. This access pattern degrades spatial locality, reduces effective cache-line utilization, and may lead to increased cache misses. Such memory inefficiencies can significantly impact performance, especially for large matrices where memory bandwidth becomes a limiting factor.

To improve computational efficiency, multiple columns (e.g., four columns) can be processed simultaneously within the LU back-substitution routine, rather than computing a single column at a time. Specifically, under the AVX2 instruction set, a single vector instruction can operate on four double-precision (64-bit) floating-point values concurrently. In contrast, the conventional scalar implementation in OpenFOAM performs arithmetic operations on only one 64-bit floating-point value per instruction. By reformulating the back-substitution procedure to solve multiple right-hand sides simultaneously, data-level parallelism can be effectively exploited through SIMD vectorization. Furthermore, after computing multiple columns together, the results can be written back to the inverse matrix in a manner that improves memory access locality. By organizing the storage so that vector elements correspond to contiguous memory locations, spatial locality is enhanced, leading to better cache-line utilization and reduced cache miss rates. Consequently, overall cache efficiency is improved. The validation and performance results can be found in Appendix B.

3.3.2. Matrix Multiplication

After obtaining the inverse matrix

{[A]}^{- 1}

, the matrix of generalized Fick diffusion coefficients

[D]

is obtained by using

[D]

=

{[A]}^{- 1} [B]

. This matrix multiplication is performed using the classical triple-loop matrix multiplication implementation in OpenFOAM, based on row-major storage and without blocking or SIMD vectorization, and exhibits several structural performance limitations, which will be discussed in detail.

First, although row-major storage ensures contiguous access for rows of the left operand matrix,

{[A]}^{- 1}

, the access pattern to the right operand matrix,

[B]

, remains inherently strided. In the typical i-j-k loop ordering, elements of the matrix

[B]

are accessed column-wise while being stored row-wise in memory. This mismatch leads to non-contiguous memory accesses with large strides, degrading spatial locality and reducing effective cache-line utilization, as shown in Figure 4. As the matrix size grows beyond the capacity of the upper cache levels, frequent cache misses occur, significantly limiting performance.

Second, the absence of cache blocking prevents effective temporal reuse of data. In the naïve formulation, once a row of the left matrix or a column element of the right matrix is used in a partial accumulation, it is unlikely to remain in fast cache levels for subsequent reuse. Consequently, the algorithm fails to exploit the hierarchical memory system efficiently, resulting in increased memory traffic and bandwidth pressure.

Third, the implementation operates entirely in scalar mode. Each floating-point instruction processes a single double-precision value, leaving SIMD execution units underutilized. Modern CPUs provide wide vector registers and fused multiply–add (FMA) units capable of processing multiple floating-point operations per cycle. However, the classical triple-loop structure in OpenFOAM does not explicitly expose sufficient data-level parallelism for the compiler to reliably generate efficient vectorized code.

To improve computational efficiency, this work adopts a blocked matrix multiplication algorithm in place of the standard implementation in OpenFOAM-10. The blocked formulation is mathematically equivalent to conventional matrix multiplication; however, it reorganizes the computation in a manner that better aligns with modern computer architectures. By partitioning matrices into sub-blocks and performing computations at the block level, data locality is significantly enhanced. This restructuring increases temporal reuse of matrix elements and improves cache-line utilization, thereby reducing memory traffic and mitigating cache miss rates. For this reason, blocked algorithms are widely employed in high-performance linear algebra libraries.

In the present implementation, the innermost loop computes a 4 × 4 sub-block of the result matrix rather than a single scalar element. For example, as illustrated in Figure 5, a 4 × 4 sub-block is computed in each kernel invocation. This formulation exposes data-level parallelism suitable for SIMD execution, enabling vector instructions to operate on multiple elements simultaneously. By combining blocking with SIMD-based computation, the reuse of elements from matrices

{[A]}^{- 1}

and

[B]

is substantially increased, while strided memory accesses are reduced. Consequently, cache efficiency is improved and potential cache misses are effectively mitigated, leading to higher overall computational throughput.

4. Case Setup

In this section, the proposed implementation is validated in terms of both numerical correctness and computational efficiency. The validation cases are categorized into two-dimensional and three-dimensional configurations. The two-dimensional test case consists of a laminar counterflow flame, obtained from the official tutorial cases of OpenFOAM-10. This configuration is employed to verify numerical accuracy and to assess computational performance under controlled conditions. The three-dimensional cases involve two LES combustion cases, and details are given in Section 4.2.

All simulations were conducted on the T6 partition of the Beijing Super Cloud high-performance computing cluster. Each compute node of the cluster is equipped with dual Intel Xeon Platinum 9242 processors, providing a total of 96 CPU cores per node, with a base frequency of 2.30 GHz and 384 GB of memory per node. The high-performance code developed in this work was compiled using the GCC 7.3.0 compiler. During compilation, the optimization flags “-O3 -mavx2 -mfma” were enabled to fully exploit the aggressive compiler optimization level (O3), the AVX2 SIMD instruction set, and fused multiply–add (FMA) instructions for enhanced computational performance. Due to the relatively small grid size of the two-dimensional counterflow flame case, simulations for this configuration were executed using a single CPU core. For three-dimensional LES simulations, four computing nodes were used, with a total of 384 cores.

4.1. Setup of Two-Dimensional Counterflow Flame Cases

The reaction mechanisms and fuel compositions for the two-dimensional test cases are summarized in Table 2. The initial temperature and pressure are set to 293 K and 100,000 Pa, respectively. After each case reaches a steady state, an additional 100 time steps are performed to collect flow field data and computational performance metrics.

The fuel and air streams are arranged in an opposed-flow configuration, with both streams having an inlet velocity of 0.1 m/s. The Soret effect is neglected in all cases. For chemical kinetics integration, a high-performance solver developed by the author is employed (see Reference [22]). In the chemical Ordinary Differential Equation (ODE) solver, the absolute tolerance is set to 10⁻⁸, and the relative tolerance is set to 10⁻⁴.

Table 2. Two-dimensional counterflow flame cases and reaction mechanisms used.

Fuel Composition	Number of Species	Number of Reactions	Reference
H₂/NH₃: 0.5/0.5	31	203	[23]
H₂/NH₃: 0.5/0.5	38	263	[21]
CH₄/H₂/NH₃: 0.3/0.5/0.2	42	130	[24]
CH₄/H₂/NH₃: 0.3/0.5/0.2	59	356	[25]
CH₄/H₂/NH₃: 0.3/0.5/0.2	69	389	[26]
CH₄/H₂/NH₃: 0.3/0.5/0.2	77	460	[27]
CH₄/H₂/NH₃: 0.3/0.5/0.2	125	1099	[28]
CH₄/H₂/NH₃: 0.3/0.5/0.2	129	1231	[29]

4.2. Setup of Three-Dimensional Large Eddy Simulations of Ammonia/Hydrogen Jet Flames

The three-dimensional test cases involve LES of ammonia–hydrogen jet flames, including both non-premixed and premixed configurations. The combustion process of ammonia–hydrogen mixtures involves detailed species transport phenomena and complex nitrogen oxide (NO_x) formation pathways. Previous studies have demonstrated that the accuracy of species diffusion modeling plays a critical role in the prediction of pollutant emissions [9,10]. Therefore, ammonia–hydrogen flames are selected in this study as the benchmark configurations for evaluating computational efficiency.

4.2.1. Burner Description

The schematic diagrams of the experimental setups are presented in Figure 6. Specifically, Figure 6a illustrates the non-premixed jet flame configuration, while Figure 6b shows the premixed jet flame configuration.

The non-premixed flame configuration is based on the jet flame burner designed by King Abdullah University of Science and Technology (KAUST). The burner operates within a pressurized duct at 5 atm, with both the fuel and oxidizer supplied at an unburned temperature of 294 K. In this work, the CAJF28 operating condition is adopted for validation. Under this condition, 28% of the ammonia fuel is cracked into hydrogen and nitrogen. The resulting fuel mixture consists of 56.3% ammonia (NH₃), 32.8% hydrogen (H₂), and 10.9% nitrogen (N₂) by volume fraction, while the oxidizer is pure air. The inner diameter of the jet nozzle is 4.58 mm. The fuel jet velocity is 10.1 m/s, corresponding to a jet Reynolds number of 11,200. The coflow air velocity is 0.24 m/s. Further details regarding the experimental configuration and operating conditions can be found in Reference [30].

The second configuration is a piloted premixed jet flame designed by Lund University, Sweden. The selected operating condition is taken from Reference [31]. The main jet consists of an ammonia/hydrogen premixed fuel with a volumetric ratio of 60/40 (NH₃/H₂), operating at an equivalence ratio of 1.4. The unburned temperature is 298 K, the jet velocity is 110 m/s, and the ambient pressure is 1 atm. The pilot stage employs a methane–air premixed flame with an equivalence ratio of 0.9 and a flow velocity of 0.44 m/s.

4.2.2. Model Selection

In terms of subgrid-scale turbulence modeling, the WALE subgrid-scale model [32] is employed for both configurations. For the chemical kinetics modeling of the non-premixed jet flame, Abdelwahid et al. [33] evaluated four reaction mechanisms (ranging from 32 to 125 species) for the corresponding counterflow flame configuration and reported good agreement with experimental measurements. In their study, the Zhang mechanism (38 species and 263 reactions) [21] was selected due to its relatively low computational cost. To maintain consistency with their work, the Zhang mechanism is also adopted in this study. For the premixed jet flame, Xu et al. [34] selected the Okafor mechanism (59 species and 356 reactions) [25] for ammonia–methane combustion. Following their approach, the Okafor mechanism is also adopted in this work to ensure consistency with previous studies. The turbulent combustion model is treated using the laminar closure assumption, which is consistent with ref. [33] as well as ref. [34].

4.2.3. Numerical Approach

Both cases are simulated using the reactingFoam solver in OpenFOAM-10. The chemical ODE system of the cases is integrated using the Seulex solver. The relative and absolute tolerances for the non-premixed jet flame are set to 10⁻⁴ and 10⁻⁸, respectively. For the premixed jet flame, the relative tolerance and absolute tolerance are set to 10⁻¹ and 10⁻⁸, respectively. For the main jet inlet of both cases, the LEMOS synthetic turbulence generator [35] is used, with an inlet turbulence intensity of approximately 5%. Adiabatic boundary conditions are imposed at the walls of both cases, while a non-reflecting boundary condition is applied at the outlet. The PIMPLE algorithm is used for pressure-velocity coupling, and two outer iterations are performed for each flow time step.

For the non-premixed jet flame, the computational mesh contains approximately 7 million cells, with a minimum grid spacing of about 0.15 mm. The mesh resolution satisfies the requirement that more than 80% of the turbulent kinetic energy is resolved. The flow time step is set to 2 × 10⁻⁶ s, with a maximum Courant number of approximately 0.3. The turbulent kinetic energy resolution of this flame can be found in Appendix C.

The computational mesh for the premixed jet flame contains approximately 9.2 million cells, with a grid resolution of about 0.1 mm in the core region, which is the same as in ref. [34]. Note that the integral length scale and Kolmogorov length scale are 2.9 mm and 0.048 mm, respectively. Given these scales, the chosen grid is expected to adequately resolve the dominant turbulent structures. The flow time step is set to 2 × 10⁻⁷ s, and the maximum Courant number is approximately 0.6.

5. Results and Discussion

5.1. Two-Dimensional Counterflow Flame Cases

5.1.1. Code Verification

This section first qualitatively examines the flame structure. For the two-dimensional counterflow flame, molecular diffusion and chemical reactions reach a dynamic balance; therefore, the accuracy of molecular diffusion calculations can significantly influence the predicted flame structure. Figure 7 shows the distributions of H species mass fraction, OH species mass fraction, temperature, and heat release rate. It can be observed that the standard Maxwell–Stefan diffusion model and the optimized implementation proposed in this study produce consistent results for all four physical quantities. No significant visual differences can be identified between the two approaches.

In addition to comparing species distributions, a quantitative assessment of the relative error of the generalized Fick diffusion coefficient matrix is also performed. The relative error is computed as follows:

ε_{i j} = \frac{|D_{i j, s t a n d a r d} - D_{i j, o p t i m i z e d}|}{|D_{i j, s t a n d a r d}| + 10^{- 300}}

(10)

where 10⁻³⁰⁰ is introduced as a small regularization constant to prevent division by zero. Considering double-precision (64-bit) floating-point arithmetic, as long as the magnitude of the generalized Fick diffusion coefficient is not on the order of 10⁻³⁰⁰, the addition of this small constant does not alter the denominator due to machine precision limitations.

Table 3 presents the statistical results of the element-wise relative errors of the generalized Fick diffusion coefficient matrix at the 54th cell. This cell is located within the flame zone, where the temperature is approximately 2400 K. The relative error statistics at this location demonstrate excellent agreement. For all chemical mechanisms considered, the maximum relative error is below 10⁻¹¹, and 99% of the matrix elements exhibit relative error below 3 × 10⁻¹³.

Table 4 presents the results at the 24th cell, which is located in the high-temperature-gradient region (1165 K). The numerical accuracy remains consistent with the standard OpenFOAM implementation. Compared to the results in Table 3, the computational error in Table 4 shows a slight increase, with the maximum relative error reaching 7.36 × 10⁻¹²; nevertheless, this result still demonstrates a very high level of accuracy.

Table 5 presents the results at the 0th grid point (corresponding to 300 K). It is observed that the maximum relative error reaches 7.83 × 10⁻⁵. This increase may be attributed to intermediate species in the combustion process. Due to species diffusion, certain intermediate species may have extremely low concentrations at this location (for instance, OH mass fraction is about 6.23 × 10⁻¹⁵ when using 31-species chemistry), which may amplify numerical errors during the evaluation of matrix elements (e.g., in matrices [A] and [B]) and subsequent matrix operations.

5.1.2. Performance Evaluation

Table 6 presents the performance comparison between the standard and optimized implementations of the Maxwell–Stefan diffusion model. It can first be observed that the execution time of the Maxwell–Stefan model increases rapidly as the number of chemical species grows. In terms of computational performance, for the mechanism containing 31 species, the optimized implementation achieves a speedup of approximately 2.5×. When the number of species increases to 129, the speedup reaches approximately 4.5×, demonstrating a substantial performance improvement.

To further assess the performance of the optimized implementation in this work, we select the Intel Math Kernel Library (MKL), a high-performance linear algebra library, to replace the matrix operation implementation in Section 3.3, using the AVX2 instruction set and single-thread mode. The result is also shown in Table 6. It can be observed that when the matrix size is relatively small (<59 species), the optimized implementation in Section 3.3 performs well; for a larger number of species, the MKL provides faster execution efficiency. A possible reason is that Intel MKL employs a pre-packing strategy for matrix [B] in matrix multiplication. This strategy can effectively improve cache utilization and thereby enhance computational efficiency for large matrices. Instead of a pre-packing strategy, a potential improvement would be to generate the elements of matrix [B] directly in a packed format during computation. This approach will be considered in future work.

5.2. Three-Dimensional Ammonia/Hydrogen Jet Flame Cases

Figure 8 presents the computational performance comparison for the ammonia–hydrogen non-premixed jet flame case. The reported average execution time represents the mean value per time step. Since the two compared cases differ only in the species diffusion model, while all other physical and numerical settings remain identical, the execution time, labeled as “Others” and “Chemistry Integration”, is taken as the average of the two cases. Here, “Others” refers to computational costs excluding species diffusion and chemical reaction calculations, including thermophysical property updates, equation solving, and related numerical operations.

As shown in Figure 8, the standard species diffusion module requires approximately 6.5 s per time step on average, dominating the total computational cost. In contrast, the optimized implementation reduces the average execution time to 2.55 s per time step, which becomes comparable to the computational cost of chemical reaction integration and other operations. The resulting speedup is approximately 2.54×, leading to a substantial reduction in overall computational time and cost.

It is worth emphasizing that the combustion simulations in this study employ a high-performance chemical kinetics solver previously developed by the author to reduce computational expense [22]. This solver maintains mathematical consistency and numerical accuracy equivalent to the standard OpenFOAM implementation, without loss of precision. If the standard chemical reaction solver in OpenFOAM-10 were used instead, the average execution time for chemical kinetics integration would be approximately 26 s per time step. In that case, the Maxwell–Stefan diffusion model would no longer constitute the dominant computational cost.

Figure 9 presents the computational performance results for the premixed jet flame case. Due to the use of a relatively large chemical mechanism, the computational cost associated with the Maxwell–Stefan diffusion model becomes significantly higher. In the standard implementation, the species diffusion calculation requires approximately 26.46 s per time step, whereas in the optimized implementation, the execution time is reduced to approximately 8.95 s, corresponding to a speedup of 2.96×. In comparison, the chemical reaction integration requires about 2.21 s per time step, and the remaining computational processes account for approximately 4.79 s. The relatively small computational cost of chemical kinetics is primarily attributed to the use of a relaxed relative convergence tolerance (10⁻¹) in the ODE solver. It is evident that the optimized implementation substantially reduces the computational overhead of the Maxwell–Stefan diffusion model, bringing its cost to the same order of magnitude as the chemical reaction and other numerical processes.

5.3. Further Discussion

Although the present implementation is optimized for AVX2/FMA instructions, the underlying optimization strategy is not inherently tied to a specific hardware architecture. The proposed approach is built upon general principles, including vectorization-oriented data organization, cache-aware blocking, and data access optimization, which are widely applicable to modern CPU architectures.

It should be noted that certain low-level implementation details, such as SIMD intrinsics, vector width, and block size selection, are architecture-dependent and have been tuned in this work for AVX2. However, these parameters can be systematically adapted to other architectures, such as AVX-512 or ARM-based SIMD (e.g., NEON or SVE), by redesigning the vectorized kernels and adjusting hardware-specific configurations. The overall algorithmic structure, data layout, and computational workflow remain unchanged under such adaptations.

Furthermore, the emphasis of the present work is not on developing a hardware-specific optimization, but rather on establishing a problem-oriented performance optimization framework for the Maxwell–Stefan diffusion model within OpenFOAM. In this context, the combination of data restructuring and lightweight kernel design provides a portable pathway for performance improvement across different platforms.

Therefore, while the current implementation reflects the characteristics of the target AVX2 platform, the proposed methodology is expected to be transferable to other architectures with moderate implementation effort.

6. Conclusions

In this study, an optimized implementation of the Maxwell–Stefan species diffusion model was developed and systematically evaluated in the OpenFOAM-10 framework. The primary objective was to improve computational efficiency while preserving numerical accuracy in simulations of ammonia and ammonia–hydrogen turbulent flames.

First, numerical validation was performed using a two-dimensional counterflow flame configuration. Both qualitative comparisons of flame structure and quantitative assessments of the generalized Fick diffusion coefficient matrix demonstrated excellent agreement between the standard and optimized implementations. Element-wise relative errors were found to be below 10⁻¹¹ for all tested mechanisms, with a significant portion of entries exhibiting zero relative error, confirming numerical consistency at the machine precision level.

Subsequently, computational performance was assessed using a two-dimensional counterflow flame and large-scale three-dimensional large-eddy simulations of non-premixed and premixed ammonia–hydrogen jet flames. Results show that the execution time of the Maxwell–Stefan module increases rapidly with the number of chemical species, making it a dominant computational bottleneck for detailed reaction mechanisms.

The proposed optimization significantly reduces computational cost. For mechanisms with 31 species, a speedup of approximately 2.5× was achieved, while for larger mechanisms containing up to 129 species, the speedup reached approximately 4.5×. In realistic three-dimensional flame simulations, the optimized implementation reduced the species diffusion cost from 6.5 s to 2.55 s per time step in the non-premixed case and from 26.46 s to 8.95 s in the premixed case. This improvement effectively lowered the diffusion cost to the same order of magnitude as chemical reaction integration and other numerical processes.

The results demonstrate that the proposed optimization strategy substantially enhances the computational efficiency of the Maxwell–Stefan diffusion model without sacrificing accuracy. This improvement is particularly important for combustion simulations involving large chemical mechanisms, where species transport modeling can become a major computational constraint. The developed approach therefore provides a practical and efficient solution for high-fidelity turbulent combustion simulations.

Author Contributions

Conceptualization, Z.C. methodology, Z.C.; software, Z.C.; validation, Z.C.; formal analysis, Z.C.; investigation, Z.C.; resources, X.H. and B.W.; data curation, X.H. and B.W.; writing—original draft preparation, Z.C.; writing—review and editing, B.W.; supervision, X.H.; project administration, X.H.; funding acquisition, X.H. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

Our thanks to the National Natural Science Foundation of China (12402292); Fund Project supported by the National Key Laboratory of Science and Technology on Aero-Engine Aero-thermodynamics, China (2023-JCJQ-LB-063-0305); Sustaining Project supported by the National Key Laboratory of Science and Technology on Aero-Engine Aero-thermodynamics, China (12700002024146001); and Fundamental Research Funds for the Central Universities of China (501QYZX2023146001, YWF-23-Q-1068).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The test cases and code that support the findings of this study are openly available in a GitHub repository at https://github.com/chizixin-688/FastMaxwellStefan-OpenFOAM-10, accessed on 27 February 2026.

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT-5.3 for the purposes of language polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CFD	Computational Fluid Dynamics
LES	Large Eddy Simulation
DNS	Direct Numerical Simulation
HPC	High-Performance Computing
SIMD	Single Instruction Multiple Data
CPU	Central Processing Unit
AVX2	Advanced Vector Extensions
FMA	Fused Multiply–Add
ODE	Ordinary Differential Equation
NOx	Nitrogen Oxide
KAUST	King Abdullah University of Science and Technology
WALE	Wall-Adapting Local Eddy-Viscosity

Appendix A. Derivation of Generalized Fick Diffusion Equation

The accurate Maxwell–Stefan equation is rewritten as:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \nabla X_{i} - \frac{\nabla T}{T} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{D_{T, j}}{ρ_{j}} - \frac{D_{T, i}}{ρ_{i}})

(A1)

To convert the Maxwell–Stefan equation into the generalized Fick diffusion equation, the following equation is used:

\sum_{j = 1}^{N s} Y_{j} = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} Y_{j} + Y_{i} + Y_{N s} = 1

(A2)

\sum_{j = 1}^{N s} X_{j} = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} X_{j} + X_{i} + X_{N s} = 1

(A3)

\sum_{j = 1}^{N s} J_{j} = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} J_{j} + J_{i} + J_{N s} = 0

(A4)

\frac{X_{i}}{Y_{i}} = \frac{W_{m}}{W_{i}}

(A5)

W_{m} = \frac{1}{\sum_{j = 1}^{N s} \frac{Y_{j}}{W_{j}}} = {[\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} \frac{Y_{j}}{W_{j}} + \frac{Y_{N s}}{W_{N s}} + \frac{Y_{i}}{W_{i}}]}^{- 1}

(A6)

ρ Y_{i} = ρ_{i} (i d e a l g a s a s s u m p t i o n)

(A7)

Here, Equations (A2) and (A3) define the realizability conditions for mass fractions and mole fractions, respectively. Equation (A4) specifies the realizability condition for the species diffusion flux. Equation (A5) establishes the relationship between the mole fraction and the mass fraction, where

W_{m}

denotes the mean molecular weight of the gas mixture. Equation (A6) relates

W_{m}

to the species mass fractions

Y_{i}

, and Equation (A7) provides the relationship between the mixture density and the species apparent densities in a computational cell, based on the ideal gas assumption. We use the symbol Ns as the last species, where the species is calculated using

Y_{N s} = 1 - \sum_{j - 1}^{N s - 1} Y_{j}

. The variables

Y_{N s}

and

J_{N s}

should be eliminated in the generalized Fick diffusion equation.

By expanding the left side of Equation (A1), we have:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} [\frac{X_{i} X_{j}}{D_{i j}} \frac{J_{j}}{ρ_{j}}] + \frac{X_{i} X_{N s}}{D_{i N s}} \frac{J_{N s}}{ρ_{N s}} - \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} [\frac{X_{i} X_{j}}{D_{i j}} \frac{J_{i}}{ρ_{i}}]

(A8)

Here, Equation (A8) contains the species diffusion flux of the last species, which should be eliminated. By substituting Equation (A4) into Equation (A8), the second term on the right side of Equation (A8) can be rewritten as:

\frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} J_{N s} = - \frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} J_{j} - \frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} J_{i}

(A9)

By substituting Equation (A9) into Equation (A8), we have:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} [\frac{X_{i} X_{j}}{ρ_{j} D_{i j}} J_{j}] - \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} [\frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} J_{j}] - \frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} J_{i} - \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} [\frac{X_{i} X_{j}}{ρ_{i} D_{i j}} J_{i}]

(A10)

By rearranging Equation (A10), we have:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} [(\frac{X_{i} X_{j}}{ρ_{j} D_{i j}} - \frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}}) J_{j}] - (\frac{X_{i} X_{N s}}{ρ_{N s} D_{i N s}} + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} [\frac{X_{i} X_{j}}{ρ_{i} D_{i j}}]) J_{i}

(A11)

Finally, by substituting the relation

ρ Y_{i} = ρ_{i}

(see Equation (A7)) into Equation (A11), we have

\begin{matrix} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) = \frac{1}{ρ} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} [X_{i} (\frac{W_{m}}{D_{i j} W_{j}} - \frac{W_{m}}{D_{i N s} W_{N s}}) J_{j}] \\ - \frac{1}{ρ} (\frac{X_{i} W_{m}}{W_{N s} D_{i N s}} + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} [\frac{W_{m} X_{j}}{W_{i} D_{i j}}]) J_{i} \end{matrix}

(A12)

Note that the right side of Equation (A12) is the Matrix and Vector Multiplication. Let

J = \{j_{1}, j_{2}, \dots, j_{N s - 1}\}

, then, Equation (A12) can be rewritten in matrix form:

\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{i} X_{j}}{D_{i j}} (\frac{J_{j}}{ρ_{j}} - \frac{J_{i}}{ρ_{i}}) \Rightarrow \frac{1}{ρ} [A] J

(A13)

where matrix

[A]

is defined as:

\{\begin{matrix} A_{i i} = - (\frac{X_{i}}{D_{i N s}} \frac{W_{m}}{W_{N s}} + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s} \frac{X_{j}}{D_{i j}} \frac{W_{m}}{W_{i}}) \\ A_{i j} = X_{i} (\frac{1}{D_{i j}} \frac{W_{m}}{W_{j}} - \frac{1}{D_{i N s}} \frac{W_{m}}{W_{N s}}), i \neq j \end{matrix}

(A14)

By repeating the same operation for the thermo-diffusion effect in Equation (A1), we can also obtain

[A]

. Therefore, we have the Maxwell–Stefan equation in matrix form:

\frac{1}{ρ} [A] J = \nabla X - \frac{\nabla T}{T} \frac{1}{ρ} [A] D_{T}

(A15)

Here,

X = \{X_{1}, X_{2}, \dots, X_{N s - 1}\}, D_{T} = \{D_{T, 1}, D_{T, 2}, \dots, D_{T, N s - 1}\}

.

Matrix

[A]

provides the drag effect, and matrix [B] provides the thermodynamic effect between the mass fraction and the mole fraction:

\nabla X = [B] \nabla Y

[1].

Next, we provide the derivation of matrix

[B]

. Using the properties of derivatives and Equation (A5), we can decompose the gradient of the mole fraction into the gradient of the mass fraction and the mixture molecular weight:

\nabla X_{i} = \nabla [\frac{Y_{i}}{W_{i}} W_{m}] = \frac{W_{m}}{W_{i}} \nabla Y_{i} + \frac{Y_{i}}{W_{i}} \nabla W_{m}

(A16)

To further decompose the

\nabla W_{m}

into the gradient of mass fraction, Equation (A5) and Equation (A2) are required.

\nabla W_{m}

can be finally expressed as:

\nabla W_{m} = - W_{m}^{2} [\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} (\frac{1}{W_{j}} - \frac{1}{W_{N s}}) \nabla Y_{j} + (\frac{1}{W_{i}} - \frac{1}{W_{N s}}) \nabla Y_{i}]

(A17)

By substituting Equation (A17) into Equation (A16), and using Equation (A6) to convert the

Y_{i} / W_{i}

into

X_{i} / W_{m}

, the second term on the right side of Equation (A16) can be rewritten as:

\frac{Y_{i}}{W_{i}} \nabla W_{m} = - X_{i} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} (\frac{W_{m}}{W_{j}} - \frac{W_{m}}{W_{N s}}) \nabla Y_{j} - (X_{i} \frac{W_{m}}{W_{i}} - X_{i} \frac{W_{m}}{W_{N s}}) \nabla Y_{i}

(A18)

By substituting Equation (A18) into Equation (A16) and rearranging the equation, we have:

\nabla X_{i} = [(1 - X_{i}) \frac{W_{m}}{W_{i}} + X_{i} \frac{W_{m}}{W_{N s}}] \nabla Y_{i} - [X_{i} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N s - 1} (\frac{W_{m}}{W_{j}} - \frac{W_{m}}{W_{N s}})] \nabla Y_{j}

(A19)

Equation (A19) actually provides the multiplication between matrix and vector, where the matrix is denoted as [B] and defined as:

\{\begin{matrix} B_{i i} = - (X_{i} \frac{W_{m}}{W_{N s}} + (1 - X_{i}) \frac{W_{m}}{W_{i}}) \\ B_{i j} = X_{i} (\frac{W_{m}}{W_{j}} - \frac{W_{m}}{W_{N s}}), i \neq j \end{matrix}

(A20)

Let

= \{Y_{1}, Y_{2}, \dots, Y_{N s - 1}\}

, and substitute Equation (A19) into Equation (A15). Equation (A15) can be rewritten as:

\frac{1}{ρ} [A] J = - B \nabla Y - \frac{\nabla T}{T} \frac{1}{ρ} [A] D_{T}

(A21)

Multiply both sides of the Equation (A21) by the inverse of

[A]

and the mixture density at the same time. We finally obtain the expression of the generalized Fick diffusion equation:

J = - ρ {[A]}^{- 1} [B] \nabla Y - \frac{\nabla T}{T} D_{T}

(A22)

Appendix B. Numerical Validation and Performance Analysis of Matrix Inversion Using the Intel Math Kernel Library

To further verify and evaluate the optimized implementation proposed in this work, the Intel MKL from Intel oneAPI 2022.1. is employed as a reference to assess the matrix inversion strategy used in this study, which involves LU decomposition and LU back-substitution. The sizes of the square matrices range from 16 to 1024, and the results are presented in Table 2.

The evaluation is conducted using the L2 error, the relative L2 error, the maximum absolute error of matrix element, and the maximum relative error of matrix element, which are defined as follows:

ε_{L 2} = \sqrt{\sum_{i = 1}^{N s \times N s} {(a_{i} - a_{i}^{m k l})}^{2}}

(A23)

ε_{r e l . L 2} = \frac{\sqrt{\sum_{i = 1}^{N s \times N s} {(a_{i} - a_{i}^{m k l})}^{2}}}{\sqrt{\sum_{i = 1}^{N s \times N s} {(a_{i}^{m k l})}^{2}}}

(A24)

M a x ε_{a b s} = M a x (|a_{i} - a_{i}^{m k l}|), i = 1, \dots, N s \times N s

(A25)

M a x ε_{r e l} = M a x (\frac{|a_{i} - a_{i}^{m k l}|}{|a_{i}^{m k l}|}), i = 1, \dots, N s \times N s

(A26)

When calling Intel MKL, a single-core, single-thread mode is used. The matrix inversion is performed using the dgetrf and dgetri routines with the AVX2 instruction set enabled. All other settings are consistent with those described in Section 4 (Case Setup) of the main text. The unit test code and corresponding results are available in the author’s GitHub repository.

The computational results are summarized in Table 2. It can be observed that the inverse matrices obtained using the optimized implementation in this work agree very closely with those computed using MKL routines. Even when the matrix size increases to 1024, the maximum relative error remains as low as 3.43 × 10⁻¹⁰, demonstrating the high numerical accuracy of the proposed method.

In terms of computational time, the proposed implementation is faster than MKL for matrix sizes smaller than 64 × 64. For matrices of size 64 × 64, both approaches exhibit comparable performance. For larger matrices, the more sophisticated optimization strategies in MKL lead to superior performance. In 3D LES combustion simulations with hydrogen fuel or ammonia/hydrogen fuel, researchers tend to employ reduced chemical mechanisms (fewer than 64) to avoid excessive computational cost; therefore, the optimized implementation proposed in this work can provide efficient computational performance.

Table A1. The relative error between MKL and the implementation in this work, and the average execution time (seconds) of the matrix inversion routine for different matrix sizes.

Size	$ε_{L 2}$	$ε_{r e l . L 2}$	$Max ε_{a b s}$	$Max ε_{r e l}$	τ (This Work)	τ (mkl)
16	2.29 × 10⁻¹⁸	2.27 × 10⁻¹⁶	1.73 × 10⁻¹⁸	1.36 × 10⁻¹⁵	1.18 × 10⁻⁶	4.17 × 10⁻⁶
32	4.90 × 10⁻¹⁸	2.44 × 10⁻¹⁶	3.47 × 10⁻¹⁸	1.42 × 10⁻¹⁵	5.92 × 10⁻⁶	1.31 × 10⁻⁵
64	8.69 × 10⁻¹⁸	3.19 × 10⁻¹⁶	3.47 × 10⁻¹⁸	6.00 × 10⁻¹³	4.90 × 10⁻⁵	4.79 × 10⁻⁵
128	1.59 × 10⁻¹⁷	5.07 × 10⁻¹⁶	6.07 × 10⁻¹⁸	1.54 × 10⁻¹²	4.15 × 10⁻⁴	2.10 × 10⁻⁴
256	4.56 × 10⁻¹⁷	9.06 × 10⁻¹⁶	1.56 × 10⁻¹⁷	1.62 × 10⁻¹²	4.06 × 10⁻³	1.08 × 10⁻³
512	8.36 × 10⁻¹⁷	1.17 × 10⁻¹⁵	1.99 × 10⁻¹⁷	1.75 × 10⁻¹⁰	4.97 × 10⁻²	7.94 × 10⁻³
1024	1.74 × 10⁻¹⁶	1.73 × 10⁻¹⁵	3.47 × 10⁻¹⁷	3.43 × 10⁻¹⁰	4.48 × 10⁻¹	4.28 × 10⁻²

Appendix C. Turbulent Kinetic Energy Resolution for Non-Premixed Jet Flame

This appendix presents the resolved turbulent kinetic energy fraction for the non-premixed jet flame. The turbulent kinetic energy resolution is defined as

γ = \frac{k_{r e s}}{k_{r e s} + k_{s g s}}

(A27)

where the resolved turbulent kinetic energy

k_{r e s}

is obtained from statistical analysis of the velocity field, and the subgrid-scale turbulent kinetic energy

k_{s g s}

is obtained from the subgrid-scale model. To obtain statistically meaningful distributions of turbulent kinetic energy, approximately five flow-through times are sampled in this case. The final results are shown in Figure A1. It can be observed that the velocity field has been well averaged, while some statistical uncertainty remains only in the far downstream region. As shown in Figure A1b, the turbulent kinetic energy resolution exceeds 80% in both the upstream shear layer and the fully developed downstream region.

Figure A1. (a) The time-averaged velocity magnitude field of NH3/H2; (b) The turbulent kinetic energy (TKE) resolution field.

References

Krishna, R.; Wesselingh, J.A. The Maxwell-Stefan Approach to Mass Transfer. Chem. Eng. Sci. 1997, 52, 861–911. [Google Scholar] [CrossRef]
Aziaba, K.; Jordan, C.; Haddadi, B. Design of a Gas Permeation and Pervaporation Membrane Model Based on Maxwell–Stefan Diffusion. Membranes 2022, 12, 1186. [Google Scholar] [CrossRef] [PubMed]
Rios, W.Q.; Antunes, B.; Rodrigues, A.E.; Portugal, I.; Silva, C.M. Accurate Effective Diffusivities in Multicomponent Systems. Processes 2022, 10, 2042. [Google Scholar] [CrossRef]
Kumar, V.K.; Gholamalian, F.; Kalyvas, C.; Ghassemi, M.; Chizari, M. Modelling Mass Transport in Anode-Supported Solid Oxide Fuel Cells Using a Multi-Component Diffusion Model Based on Stefan–Maxwell Formulation. Electronics 2025, 14, 3486. [Google Scholar] [CrossRef]
Veerman, J. A Maxwell–Stefan Approach to Ion and Water Transport in a Reverse Electrodialysis Stack. Processes 2024, 12, 1407. [Google Scholar] [CrossRef]
Cheng, C.; Ding, W.; Shen, J.; Liao, P.; Yu, C.; Miao, B.; Zhou, Y.; Li, H.; Zhang, H.; Zhong, Z. Multi-Physics Coupling Simulation of H₂O–CO₂ Co-Electrolysis Using a Multi-Component Maxwell–Stefan Diffusion Model. Processes 2025, 13, 3192. [Google Scholar] [CrossRef]
Bizon, K.; Boroń, D.; Tabiś, B. Assessment and Discussion of the Steady-State Determination in Zeolite Composite Membranes for Multi-Component Diffusion. Membranes 2025, 15, 301. [Google Scholar] [CrossRef]
Fillo, A.J.; Schlup, J.; Blanquart, G.; Niemeyer, K.E. Assessing the Impact of Multicomponent Diffusion in Direct Numerical Simulations of Premixed, High-Karlovitz, Turbulent Flames. Combust. Flame 2021, 223, 216–229. [Google Scholar] [CrossRef]
Chi, C.; Thévenin, D. Impact of Different Diffusion Models on NO Production in NH₃/H₂/Air Turbulent Flames Using DNS. In Proceedings of the Global Power and Propulsion Society (GPPS) Xi’an21 Conference; Global Power and Propulsion Society: Zug, Switzerland, 2022; p. GPPS-TC-2021-3. [Google Scholar]
Chi, C.; Han, W.; Thévenin, D. Effects of Molecular Diffusion Modeling on Turbulent Premixed NH3/H2/Air Flames. Proc. Combust. Inst. 2023, 39, 2259–2268. [Google Scholar] [CrossRef]
Adam, A.; Abdulnaim, A.; Kai, R.; Watanabe, H. Differential Diffusion Effect on NH3/H2 Non-Premixed Turbulent Flame Structure and Chemical Kinetics. Int. J. Hydrogen Energy 2025, 102, 20–28. [Google Scholar] [CrossRef]
Dworkin, S.B.; Smooke, M.D.; Giovangigli, V. The Impact of Detailed Multicomponent Transport and Thermal Diffusion Effects on Soot Formation in Ethylene/Air Flames. Proc. Combust. Inst. 2009, 32, 1165–1172. [Google Scholar] [CrossRef]
Lee, H.C.; Dai, P.; Wan, M.; Lipatnikov, A.N. Influence of Molecular Transport on Burning Rate and Conditioned Species Concentrations in Highly Turbulent Premixed Flames. J. Fluid Mech. 2021, 928, A5. [Google Scholar] [CrossRef]
Xin, Y.; Liang, W.; Liu, W.; Lu, T.; Law, C.K. A Reduced Multicomponent Diffusion Model. Combust. Flame 2015, 162, 68–74. [Google Scholar] [CrossRef]
Ambikasaran, S.; Narayanaswamy, K. An Accurate, Fast, Mathematically Robust, Universal, Non-Iterative Algorithm for Computing Multi-Component Diffusion Velocities. Proc. Combust. Inst. 2017, 36, 507–515. [Google Scholar] [CrossRef]
Fillo, A.J.; Schlup, J.; Beardsell, G.; Blanquart, G.; Niemeyer, K.E. A Fast, Low-Memory, and Stable Algorithm for Implementing Multicomponent Transport in Direct Numerical Simulations. J. Comput. Phys. 2020, 406, 109185. [Google Scholar] [CrossRef]
Naud, B.; Arias-Zugasti, M. Accurate Multicomponent Fick Diffusion at a Lower Cost than Mixture-Averaged Approximation: Validation in Steady and Unsteady Counterflow Flamelets. Combust. Flame 2020, 219, 120–128. [Google Scholar] [CrossRef]
Kee, R.J.; Coltrin, M.E.; Glarborg, P. Chemically Reacting Flow: Theory and Practice, 1st ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
ANSYS Fluent; ANSYS Inc.: Canonsburg, PA, USA, 2024.
OpenFOAM Foundation. OpenFOAM: The Open Source CFD Toolbox, Version 10; OpenFOAM Foundation: London, UK, 2022.
Zhang, X.; Moosakutty, S.P.; Rajan, R.P.; Younes, M.; Sarathy, S.M. Combustion Chemistry of Ammonia/Hydrogen Mixtures: Jet-Stirred Reactor Measurements and Comprehensive Kinetic Modeling. Combust. Flame 2021, 234, 111653. [Google Scholar] [CrossRef]
Chi, Z.; Hui, X.; Ji, K.; Wang, B. A High-Performance Chemistry Solver for Reactive Flow Simulations in OpenFOAM. Phys. Fluids 2025, 37, 125154. [Google Scholar] [CrossRef]
Stagni, A.; Cavallotti, C.; Arunthanayothin, S.; Song, Y.; Herbinet, O.; Battin-Leclerc, F.; Faravelli, T. An Experimental, Theoretical and Kinetic-Modeling Study of the Gas-Phase Oxidation of Ammonia. React. Chem. Eng. 2020, 5, 696–711. [Google Scholar] [CrossRef]
Okafor, E.C.; Naito, Y.; Colson, S.; Ichikawa, A.; Kudo, T.; Hayakawa, A.; Kobayashi, H. Measurement and Modelling of the Laminar Burning Velocity of Methane-Ammonia-Air Flames at High Pressures Using a Reduced Reaction Mechanism. Combust. Flame 2019, 204, 162–175. [Google Scholar] [CrossRef]
Okafor, E.C.; Naito, Y.; Colson, S.; Ichikawa, A.; Kudo, T.; Hayakawa, A.; Kobayashi, H. Experimental and Numerical Study of the Laminar Burning Velocity of CH₄–NH₃–Air Premixed Flames. Combust. Flame 2018, 187, 185–198. [Google Scholar] [CrossRef]
Xu, L.; Chang, Y.; Treacy, M.; Zhou, Y.; Jia, M.; Bai, X.-S. A Skeletal Chemical Kinetic Mechanism for Ammonia/n-Heptane Combustion. Fuel 2023, 331, 125830. [Google Scholar] [CrossRef]
Zhang, X.; Shen, D.; Meng, X.; Bi, M. A Comprehensive Kinetic Analysis of Auto-Ignition Explosion Characteristics of DME/NH₃/H₂/O₂ Blends and Dilution Effects on H₂/O₂. Energy 2025, 319, 135060. [Google Scholar] [CrossRef]
Shrestha, K.P.; Lhuillier, C.; Barbosa, A.A.; Brequigny, P.; Contino, F.; Mounaïm-Rousselle, C.; Seidel, L.; Mauss, F. An Experimental and Modeling Study of Ammonia with Enriched Oxygen Content and Ammonia/Hydrogen Laminar Flame Speed at Elevated Pressure and Temperature. Proc. Combust. Inst. 2021, 38, 2163–2174. [Google Scholar] [CrossRef]
Lubrano Lavadera, M.; Brackmann, C.; Konnov, A.A. Experimental and Modeling Study of Laminar Burning Velocities and Nitric Oxide Formation in Premixed Ethylene/Air Flames. Proc. Combust. Inst. 2021, 38, 395–404. [Google Scholar] [CrossRef]
Tang, H.; Yang, C.; Wang, G.; Krishna, Y.; Guiberti, T.F.; Roberts, W.L.; Magnotti, G. Scalar Structure in Turbulent Non-Premixed NH3/H2/N2 Jet Flames at Elevated Pressure Using Raman Spectroscopy. Combust. Flame 2022, 244, 112292. [Google Scholar] [CrossRef]
Cai, X.; Fan, Q.; Bai, X.-S.; Wang, J.; Zhang, M.; Huang, Z.; Alden, M.; Li, Z. Turbulent Burning Velocity and Its Related Statistics of Ammonia-hydrogen-air Jet Flames at High Karlovitz Number: Effect of Differential Diffusion. Proc. Combust. Inst. 2023, 39, 4215–4226. [Google Scholar] [CrossRef]
Nicoud, F.; Ducros, F. Subgrid-scale Stress Modelling Based on the Square of the Velocity Gradient Tensor. Flow Turbul. Combust. 1999, 62, 183–200. [Google Scholar] [CrossRef]
Abdelwahid, S.; Malik, M.R.; Al Kader Hammoud, H.A.; Hernández-Pérez, F.E.; Ghanem, B.; Im, H.G. Large Eddy Simulations of Ammonia-Hydrogen Jet Flames at Elevated Pressure Using Principal Component Analysis and Deep Neural Networks. Combust. Flame 2023, 253, 112781. [Google Scholar] [CrossRef]
Xu, L.; Fan, Q.; Liu, X.; Cai, X.; Subash, A.A.; Brackmann, C.; Li, Z.; Aldén, M.; Bai, X.-S. Flame/Turbulence Interaction in Ammonia/Air Premixed Flames at High Karlovitz Numbers. Proc. Combust. Inst. 2023, 39, 2289–2298. [Google Scholar] [CrossRef]
Kornev, N.; Kröger, H.; Hassel, E. Synthesis of Homogeneous Anisotropic Turbulent Fields with Prescribed Second-order Statistics by the Random Spots Method. Commun. Numer. Meth. Eng. 2008, 24, 875–877. [Google Scholar] [CrossRef]

Figure 1. (a) Diagram of standard data redistribution; (b) The corresponding C++ code implementation in OpenFOAM-10. Different colors represent different binary diffusion coefficients and the arrows denote the sequential data mapping process: extraction of matrix data into the fields.

Figure 2. (a) Diagram of optimized data redistribution; (b) The corresponding C++ code implementation in the optimized module. Different colors represent different binary diffusion coefficients and The arrows denote the sequential data mapping process: (1) extraction of matrix data into __m256d vector registers for SIMD optimization, followed by (2) transfer of vectorized data to the field storage. The red dashed line frame indicates writing the __m256d vector into field data, where four data elements are stored simultaneously in a single SIMD operation.

Figure 3. (a) Diagram of standard matrix inversion; (b) The corresponding C++ code implementation in OpenFOAM-10. Different colors represent elements from different matrices; the red arrow indicates data being written to matrix

{[A]}^{- 1}

, and the red dashed box denotes the write position.

Figure 3. (a) Diagram of standard matrix inversion; (b) The corresponding C++ code implementation in OpenFOAM-10. Different colors represent elements from different matrices; the red arrow indicates data being written to matrix

{[A]}^{- 1}

, and the red dashed box denotes the write position.

Figure 4. (a) Diagram of standard matrix multiplication; (b) The corresponding C++ code implementation in OpenFOAM-10. Red dashed line frame represents the current elements from matrices

{[A]}^{- 1}

and

[B]

being multiplied. Red arrow indicates the next elements to be accessed from matrices

{[A]}^{- 1}

and

[B]

in the multiplication process. Blue arrow represents the computation sequence of elements in matrix

[D]

.

Figure 4. (a) Diagram of standard matrix multiplication; (b) The corresponding C++ code implementation in OpenFOAM-10. Red dashed line frame represents the current elements from matrices

{[A]}^{- 1}

and

[B]

being multiplied. Red arrow indicates the next elements to be accessed from matrices

{[A]}^{- 1}

and

[B]

in the multiplication process. Blue arrow represents the computation sequence of elements in matrix

[D]

.

Figure 5. (a) Diagram of optimized matrix multiplication; (b) The corresponding C++ code implementation in the optimized routine. Red dashed line frame represents the current sub-block from matrices

{[A]}^{- 1}

and

[B]

being multiplied. Red arrow indicates the next sub-block to be accessed from matrices

{[A]}^{- 1}

and

[B]

in the multiplication process. Blue arrow represents the computation sequence of the sub-block in matrix

[D]

.

Figure 5. (a) Diagram of optimized matrix multiplication; (b) The corresponding C++ code implementation in the optimized routine. Red dashed line frame represents the current sub-block from matrices

{[A]}^{- 1}

and

[B]

being multiplied. Red arrow indicates the next sub-block to be accessed from matrices

{[A]}^{- 1}

and

[B]

in the multiplication process. Blue arrow represents the computation sequence of the sub-block in matrix

[D]

.

Figure 6. (a) Diagram of non-premixed NH3/H2/N2 jet flame at elevated pressure; (b) Diagram of piloted premixed NH3/H2/Air jet flame at atmospheric pressure.

Figure 7. Reactive scalar distribution comparison between standard and optimized Maxwell–Stefan diffusion model implementations.

Figure 8. Mean execution time of standard and optimized Maxwell–Stefan diffusion implementations for non-premixed NH₃/H₂/N₂ jet flame.

Figure 9. Mean execution time of standard and optimized Maxwell–Stefan diffusion implementations for premixed NH₃/H₂/Air jet flame.

Table 1. The overhead of different routines in the standard Maxwell–Stefan implementation in OpenFOAM-10.

Overhead	Routines
37.53%	transformDiffusionCoefficientFields
20.11%	LUBacksubstitute
17.67%	MatrixMultiplation
5.90%	LUDecompose
4.07%	2D interpolation of binary diffusion coefficient

Table 3. Statistics of element-wise relative errors of the generalized Fick diffusion coefficient matrix at the 54th cell in the counterflow flame.

Number of Species	75% Quantile	99% Quantile	Maximum Value
31	2.91 × 10⁻¹⁵	2.06 × 10⁻¹³	1.21 × 10⁻¹²
38	2.57 × 10⁻¹⁵	1.44 × 10⁻¹³	4.38 × 10⁻¹²
42	2.50 × 10⁻¹⁵	4.59 × 10⁻¹⁴	2.48 × 10⁻¹²
59	2.39 × 10⁻¹⁵	3.58 × 10⁻¹⁴	1.00 × 10⁻¹²
69	1.91 × 10⁻¹⁵	4.68 × 10⁻¹⁴	4.16 × 10⁻¹²
77	2.06 × 10⁻¹⁵	3.79 × 10⁻¹⁴	7.36 × 10⁻¹²
125	1.52 × 10⁻¹⁵	3.39 × 10⁻¹⁴	4.64 × 10⁻¹²
129	1.74 × 10⁻¹⁵	3.25 × 10⁻¹⁴	5.81 × 10⁻¹²

Table 4. Statistics of element-wise relative errors of the generalized Fick diffusion coefficient matrix at the 24th cell in the counterflow flame.

Number of Species	75% Quantile	99% Quantile	Maximum Value
31	3.93 × 10⁻¹⁵	6.42 × 10⁻¹⁴	6.21 × 10⁻¹³
38	2.64 × 10⁻¹⁵	6.09 × 10⁻¹⁴	2.65 × 10⁻¹²
42	3.45 × 10⁻¹⁵	7.23 × 10⁻¹⁴	2.17 × 10⁻¹²
59	2.59 × 10⁻¹⁵	6.22 × 10⁻¹⁴	8.53 × 10⁻¹³
69	1.86 × 10⁻¹⁵	3.79 × 10⁻¹⁴	3.83 × 10⁻¹²
77	3.06 × 10⁻¹⁵	7.33 × 10⁻¹⁴	1.54 × 10⁻¹¹
125	1.29 × 10⁻¹⁵	3.33 × 10⁻¹⁴	8.85 × 10⁻¹²
129	1.30 × 10⁻¹⁵	3.48 × 10⁻¹⁴	7.75 × 10⁻¹⁰

Table 5. Statistics of element-wise relative errors of the generalized Fick diffusion coefficient matrix at the 0th cell in the counterflow flame.

Number of Species	75% Quantile	99% Quantile	Maximum Value
31	5.39 × 10⁻¹⁵	2.65 × 10⁻¹⁰	3.98 × 10⁻⁹
38	4.00 × 10⁻¹⁵	1.19 × 10⁻¹⁰	1.05 × 10⁻⁹
42	7.16 × 10⁻¹⁵	1.13 × 10⁻⁷	2.49 × 10⁻⁶
59	2.90 × 10⁻¹⁵	7.41 × 10⁻¹⁴	2.67 × 10⁻¹²
69	3.76 × 10⁻¹⁵	1.72 × 10⁻⁷	9.87 × 10⁻⁶
77	3.46 × 10⁻¹⁵	2.49 × 10⁻¹⁰	1.11 × 10⁻⁸
125	3.30 × 10⁻¹⁵	1.76 × 10⁻¹³	7.83 × 10⁻⁵
129	2.44 × 10⁻¹⁵	8.63 × 10⁻¹⁴	3.45 × 10⁻⁶

Table 6. Mean execution time (second) and speedup of standard and optimized Maxwell–Stefan diffusion implementations.

Number of Species	Standard	Optimized	Optimized (MKL)
31	0.43185	0.17303	0.24956
38	0.82547	0.32650	0.43289
42	0.99545	0.36815	0.45838
59	2.27939	0.82173	0.81996
69	3.47764	1.23635	1.16933
77	5.3393	1.39736	1.28829
125	20.42961	5.14657	3.65615
129	22.54269	5.01626	3.59296

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chi, Z.; Hui, X.; Wang, B. High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM. Appl. Sci. 2026, 16, 3611. https://doi.org/10.3390/app16073611

AMA Style

Chi Z, Hui X, Wang B. High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM. Applied Sciences. 2026; 16(7):3611. https://doi.org/10.3390/app16073611

Chicago/Turabian Style

Chi, Zixin, Xin Hui, and Bosen Wang. 2026. "High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM" Applied Sciences 16, no. 7: 3611. https://doi.org/10.3390/app16073611

APA Style

Chi, Z., Hui, X., & Wang, B. (2026). High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM. Applied Sciences, 16(7), 3611. https://doi.org/10.3390/app16073611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Performance Computing Optimization of the Maxwell–Stefan Diffusion Model in OpenFOAM

Abstract

1. Introduction

2. Theory

3. Implementation Details

3.1. Computational Bottleneck Analysis

3.2. Memory Access Optimization

3.3. Matrix Operation Optimization

3.3.1. Matrix Inversion

3.3.2. Matrix Multiplication

4. Case Setup

4.1. Setup of Two-Dimensional Counterflow Flame Cases

4.2. Setup of Three-Dimensional Large Eddy Simulations of Ammonia/Hydrogen Jet Flames

4.2.1. Burner Description

4.2.2. Model Selection

4.2.3. Numerical Approach

5. Results and Discussion

5.1. Two-Dimensional Counterflow Flame Cases

5.1.1. Code Verification

5.1.2. Performance Evaluation

5.2. Three-Dimensional Ammonia/Hydrogen Jet Flame Cases

5.3. Further Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Derivation of Generalized Fick Diffusion Equation

Appendix B. Numerical Validation and Performance Analysis of Matrix Inversion Using the Intel Math Kernel Library

Appendix C. Turbulent Kinetic Energy Resolution for Non-Premixed Jet Flame

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI