A Neural Network Inspired Formulation of Chemical Kinetics

A method which casts the chemical source term computation into an artiﬁcial neural network (ANN)- inspired form is presented. This approach is well-suited for use on emerging supercomputing platforms that rely on graphical processing units (GPUs). The resulting equations allow for a GPU-friendly matrix-multiplication based source term estimation where the leading dimension (batch size) can be interpreted as the number of chemically reacting cells in the domain; as such, the approach can be readily adapted in high-ﬁdelity solvers for which an MPI rank oﬄoads the source term computation task for a given number of cells to the GPU. Though the exact ANN-inspired recasting shown here is optimal for GPU environments as-is, this interpretation allows the user to replace portions of the exact routine with trained, so-called approximate ANNs, where the goal of these approximate ANNs is to increase computational eﬃciency over the exact routine counterparts. Note that the main objective of this paper is not to use machine learning for developing models, but rather to represent chemical kinetics using the ANN framework. The end result is that little-to-no training is needed, and the GPU-friendly structure of the ANN formulation during the source term computation is preserved. The method is demonstrated using chemical mechanisms of varying complexity on both 0-D auto-ignition and 1-D channel detonation problems, and the details of performance on GPUs are explored.


Introduction
Computational modeling is an integral component of the design of modern combustion devices.While there has been considerable growth in the physical understanding and the development of reliable yet computationally efficient models [1,2], representation of complex chemical kinetics remains both a computational and modeling challenge.In particular, the use of detailed mechanisms that involve a hundred or more species and an even larger number of reactions in a turbulent flow configuration still remains out of reach [3].While progress towards their use in canonical flows has been reported [4], their use in simulation of complex geometries is still limited.When modeling turbulent combustion, manifold methods have overcome this computational issue by representing multi-step kinetics using a reduced-set of tracking variables such as mixture fraction and progress variable [5].However, other combustion models such as the transported probability density function (PDF) approach [6,7] or the linear-eddy model [8] require detailed chemistry to be directly evolved.In this regard, methods and algorithms that allow detailed chemical processes to be included in such approaches are a critical requirement.
In the past, several approaches have been used to accelerate chemical source term computations.These include tabulation methods such as in-situ adaptive tabulation (ISAT) [9] and the PRISM [10] approach.In these methods, the computationally expensive numerical integration of chemical source terms, which can be cast as a set of ordinary differential equations (ODEs), is replaced by a look-up table.In particular, ISAT builds a trust region in thermochemical composition space using a set of ellipsoids determined by the Jacobian of the source term.However, the cost of building and accessing such tables can become expensive, especially on modern high performance computers that are memory-limited and use extensive concurrency of computations to reach high throughput efficiency.An alternative approach, which is also the focus here, is based on artificial neural networks (ANNs) [11][12][13][14][15]. Before discussing the specifics of the ANN for kinetics, it is necessary to describe a parallel trend in computing hardware.
The use of ANNs has driven the overall revolution of data sciences.A critical enabling tool for ANNs has been the development of hardware for machine learning.Due to the large application scope of artificial intelligence and machine learning, modern high-performance computing (HPC) revolves around the usage of graphics processing units (GPUs) or similar accelerators whose architectures enable fast algorithm execution in single-instruction, multiple-thread (SIMT) environments [16].Additionally, the compute power of modern day HPCs is increasingly being dominated by GPUs due to their power efficiency and high theoretical peak performance.It is has become crucial for the CFD community to adapt to these changes, though a central issue revolves around the re-interpretation and re-design of traditional algorithms that have been around for decades into a GPU-optimal scope [17].In general, GPUs operate differently from CPUs, requiring algorithmic implementations to be vastly altered in order to leverage their specific hardware architecture.To this end, approaches for GPU-offloading of kinetics have been explored in detail in recent years to good success [18][19][20], and their implementation into high-fidelity parallel solvers has also been demonstrated [21].These approaches traditionally rely on translation of the exact equations for kinetics and time-integration methods into the GPU environment.
However, given that ANN libraries that take advantage of specific GPU capabilities already exist, ensuring a readily interpretable and accurate ANN representation of chemistry will allow kinetics calculations to be performed efficiently.The focus of this work is to develop such an ANN-based framework for GPUs.To provide context, prior use of ANNs for chemical kinetics computations can be placed into two categories, each providing a different ANN representation and levels of interpretability.The first is to replace both the source term computation and the time integration step with a single trained ANN that takes thermochemical state as input and outputs the same state at a future time step (ANN as a time integrator) [22][23][24].Assuming a relatively simple architecture, this approach is attractive due to its speed: if it works, both source term recovery and time integration is captured in a single efficient pass through the ANN.However, the downside is that this approach relies heavily on sampling a high-dimensional data space during the training process, which is either prohibitive for high-dimensional mechanisms or requires involved sampling procedures that rely on slow manifold theory.Furthermore, because the time integration is treated within the ANN, the time step is either fixed or required as an additional input to the model.Unless treated explicitly within the ANN architecture, the integration scheme contained within the ANN in such methods will naturally reflect the scheme used to recover the training data itself.Lastly, this approach is completely black-box in nature, and few constraints based on underlying physical relations (i.e.Arrhenius form) can be exploited (though some constraints, such as mass fraction conservation, can be enforced with the correct output layer activation function [22]).A consequence of the black-box quality lies in ANN interpretability: if one deploys a trained ANN using this approach into a solver, by construction, assessing where and how the resulting model fails is difficult because of its large operation scope.
The second category is to replace just the source term computation with a trained ANN that takes thermochemical state as input and outputs the corresponding source terms (ANN as a tabulation method) [12,15,25].Here, the ANN serves as an approximation to a known nonlinear function.This can be seen as narrowing the scope of the ANN with respect to the first approach, as it eliminates the time-integrator role played by the neural network.Due to this elimination, the advantage here is that well-established GPU (or other ANN-based) integration techniques [18,26] can still be utilized, and the role of the ANN becomes more transparent.However, the disadvantage is still in the prohibitive dependence on sampling an highdimensional dynamical system to produce the training data.Techniques that rely on clustering subsets of the thermochemical state within the ANN [15,27] have been attempted to reduce this dependence, though this adds significant computational complexity to the architecture and introduces additional assumptions to the procedure.Overall, both routes discussed above are at risk of overfitting to the configurations used to obtain the training data [11].In-situ training techniques [9,28] can be used in the ANN setting in light of this and remains an open area of research, though it ambitiously relies on the in-situ training phase to be overall less expensive than the deployment phase.
The goal of this work concentrates on the second category, but adopts a different approach -the exact equations for chemical kinetics are cast into an ANN-based form.More specifically, components of the source term evaluation are transformed into matrix-multiplication representations that can be interpreted as ANN layers.It will be shown that this exact ANN framework can be modified by utilizing trained ANNs as drop-in replacements for their exact form counterparts.Such replacements allow for direct control over the computational cost of individual components of the source term evaluation through their architectures, and because of this, additional speedup on the GPU can be extracted over the corresponding "exact ANN" form in some conditions.Furthermore, it will be shown that the training approach for these drop-in replacements a) does not require an intensive high-dimensional sampling procedures for data generation, and b) allows the user to retain certain physical constraints that drive the source term computation (i.e. the Arrhenius form), providing a method that extends to any configuration.Note that the main objective of this paper is not to use machine learning for developing models, but rather to represent chemical kinetics using the ANN framework.The end result is that little-to-no training is needed while preserving the GPU-friendly structure of the ANN formulation.
The remainder of the paper proceeds as follows.In Sec. 2, the methodology for the ANN-inspired formulation is presented.In Sec. 3, the method is demonstrated using various simulations, and GPU speedup and saturation effects are discussed.Concluding remarks are provided in Sec. 4.

Neural Network Interpretation of Kinetics
The main goal of this section is to recast the chemical kinetics equations into neural network interpretations that enable fast GPU execution through matrix multiplications.Though these reformulations are exact, they can be easily modified to include trained artificial neural networks (ANNs) to further enhance speedup.It will be shown that this form of ANN integration allows for extendable models that preserve underlying physical constraints attributed to the Arrhenius form and chemistry mechanism structure.
In the following, the quantities N C , N S and N R denote the batch size (which can be interpreted as the number of reacting cells in a domain offloaded to the GPU), number of species, and number of reactions, respectively.Unless otherwise indicated, matrices are denoted by bold symbols (A) and vectors by non-bold symbols.The scalar entry of matrix A in row i and column j is denoted A ij ; similarly, the scalar i-th entry of vector a is denoted a i .Further, the quantities i, j, and k index N C , N R , and N S respectively (i.e.i = 1, . . ., N C , j = 1, . . ., N R , and k = 1, . . ., N S ).For the set of species {S 1 , . . ., S N S }, a general chemical mechanism is represented as where ν ∈ R N S ×N R (resp.ν ) is the reactant (resp.product) stoichiometric coefficient matrix, and ν = ν − ν .

Species Production Rate
The molar net production rate (kmol/m 3 s) for species k in cell i is where Ω ∈ R N C ×N S contains the source terms and Q net ∈ R N C ×N R contains the net reaction rates.Note that Eq. 2 can be expressed concisely through the matrix multiplication Ω = Q net ν T .The complexity comes from the net reaction rate, which is expressed as Above, Q f and Q r ∈ R N C ×N R are the forward and reverse reaction rate matrices respectively, K f and K r ∈ R N C ×N R are the forward and reverse rate constants respectively, and C ∈ R N C ×N S contains the species molar concentrations.Since Q f and Q r are non-negative, Eq. 3 can be interpreted as a summation of two ANN layers by enabling matrix multiplications in the logarithm space: Reverse rate It can be seen through Eq. 4 that the forward and reverse contributions are ANN layers with exponential activation functions, where the input is the logarithm of the concentration matrix C, the weight matrices are known stoichiometric coefficients ν and ν , and the biases are the logarithms of rate constants K f and K r .
Figure 1a summarizes the above formulation (Eqs. 2 and 4) through an ANN architecture.Note that the leading matrix dimension of all input and output variables, which constitutes the batch size in the forward pass, is N C .This allows for efficient threading and fast execution in high fidelity settings, assuming optimized linear algebra libraries (such as cuBLAS) are utilized by the user.The remaining task, described below, is to obtain the rate constants K f and K r .

Forward Rate Constant:
The forward rate constant K f ∈ R N C ×N R is given by the Arrhenius expression where A, β, and E are vectors each of size N R containing pre-exponential factors, temperature exponents, and activation energies respectively for the elementary reactions.These Arrhenius parameters are known to the user through the mechanism files.The natural logarithm of the forward rate (required in Eq. 4) usefully yields a form that can also be interpreted as a linear ANN layer, R is a weight matrix consisting of temperature exponents and activation energies, and B f ∈ R 1×N R is a bias term of pre-exponential factors.In this sense, each row of the layer output log(K f ) can be interpreted as a set of N R Arrhenius neurons.

Reverse Rate Constant:
The reverse rate constant K r ∈ R N C ×N R is given by where K c ∈ R N C ×N R contains the equilibrium rate constants.Since the expression for log(K f ) is provided through the Arrhenius neurons (Eq.6), the task of determining log(K r ) required in Eq. 4 is accomplished by considering only log(K c ).The equilibrium constant for cell i and reaction j is [29] where ∆S j and ∆H j are changes in entropy and enthalpy for reaction j, and p ref is the reference pressure (1 bar).The logarithm of Eq. 8 yields where G ∈ R N C ×N S is the nondimensional Gibbs free energy matrix (hereafter referred to as the Gibbs matrix) obtained from the nondimensional enthalpy (H ∈ R N C ×N S ) and entropy (S ∈ R N C ×N S ) matrices.Each entry in the Gibbs matrix is determined from NASA polynomials which provide species enthalpy and entropy as tabulated functions of temperature.The result can be expressed as a matrix multiplication In Eq. 10, X G ∈ R N C ×6 is the input consisting of various functions of temperature and α ∈ R 7×N S is a matrix of polynomial coefficients; the first 6 rows of α is the weight matrix W G and the last row is the bias B G .Note that, though not shown in Eq. 10 for conciseness, the quantities in α (and in turn W G and B G ) are also functions of the cell temperature T i and the species index.This is because the species polynomial coefficients change based on a cutoff temperature (usually 1000 K).
Inserting Eq. 10 into Eq.9 gives where the standard concentration term p ref /RT i has been integrated into the bias B G .Equation 11 can be interpreted as a linear two-layer ANN.The parameters of the first layer (the Gibbs layer) are the temperaturedependent W G and B G , and those of the second layer are the net stoichiometric coefficients ν.The intermediary neurons (i.e.hidden layer neurons) here are referred to as the Gibbs neurons.Illustrations of the both forward and equilibrium rate constant formulations as neural network-inspired architectures are shown in Fig. 1b and c, with Arrhenius and Gibbs neurons highlighted.

Integration of Approximate Artificial Neural Networks
Thus far, the ANN-inspired reformulations are exact and can by themselves be implemented on GPUs with efficient linear algebra libraries -we refer to these as exact ANNs.However, additional computational efficiency can be provided by utilizing approximate ANNs as drop-in replacements for the exact ANN architectures described in Fig. 1.At the cost of accuracy, such replacements allow for direct control over computational cost through the ANN architecture.The end goal is that the execution time of the approximate ANN should be faster than the exact ANN counterpart.
Though many pathways to this end are available, here, the replacement of the exact ANN for the logarithm of the equilibrium rate constant (Fig. 1c) will be explored.By narrowing the approximate ANN scope to the equilibrium constant alone, a) sampling an N S -dimensional phase space to develop a training dataset is not required, and b) known physical constraints to recover the source term, such as the relationship between concentrations and net reaction rate (Eq.4), and Arrhenius forms (Eq.6), are preserved.Additionally, the exact ANN architecture in Fig. 1c is complex enough to warrant a reduction based on an approximate ANN (there is more than one layer, which is not the case for the forward rate constant architecture).
In general, an ANN layer takes the following form: where X l is the layer input, X l+1 is the layer output, W l is the weight matrix, B l is the bias vector, σ is an activation function, and N L is the total number of hidden layers.Note that unlike in the above sections, the parameters (weights/biases) are assumed unknown in this setting and are found through a training procedure.As with the exact formulations, the leading dimension (batch size) for these input and output matrices is N C .Here, to simplify analysis, for a given N L , the hidden layer dimension N H (or number of neurons per hidden layer) is fixed.
The only restrictions are that the input features are functions of temperature and the output dimensionality is N R .To highlight key points, two examples of approximate ANN architectures are shown in Fig. 2. Architecture 1 uses the same input features as the exact ANN but allows for variation N L and N H (referred to as the modified Gibbs neurons).Architecture 2 is similar but only utilizes one input feature, namely log(T ).
Nonlinearity is imposed in both architectures through the activation functions σ l .In Architecture 1, since several functions of temperature are already included in the input, a simple rectified linear unit (relu) activation function can be used: On the other hand, since Architecture 2 utilizes only log(T ) in the input layer, the more expensive exponential linear unit (elu) activation can be used [30] to allow the model to extract dependence on powers of T as needed during the training process: In the above scenario, it is reasonable to expect that the computational advantage offered by the smaller input size of Architecture 2 is offset by the more expensive activation function.In light of this, Architecture 2 can be modified to use the relu activation, though this lessens the expressive power of the ANN.Although several other candidate architectures can be created, in the results below, we demonstrate the approximate ANN performance using only Architecture 2 for brevity, as this architecture consists of a less complex input.Overall trends discussed throughout this work are applicable to both (and more) architecture types.As an aside, in some cases it is advantageous if the architecture is modified to instead operate on scaled versions of the inputs and outputs, which can assist in weight convergence during the training phase.In this work, a standardization operation (subtraction of training set mean and normalization by training set standard deviation) is used for scaling, though many other scaling strategies are viable.

Additional Comments
Three-body reactions: To handle three-body reactions, the quantity log(M) can be added to log(K f ), where M ∈ R N C ×N R is a matrix of third-body concentrations for each reaction (M ij is 1 if reaction j does not include a third body).The entries in M can be obtained through the matrix multiplication M = CE, where E ∈ R N S ×N R is a matrix of third-body efficiency factors.
Falloff reactions: Falloff reactions are treated separately from standard reactions -although the Arrhenius rate constants used to compute the non-dimensional reduced pressure can be treated by the same Arrhenius layer described in Fig. 1b, falloff functions such as that of Troe (if they exist in the mechanism) are handled separately on the GPU in a non-matrix fashion.A useful extension of the method could be to capture both standard and Troe falloff functions in a single forward pass using an approximate ANN for the forward rate constant.
Irreversible reactions: Treatment of irreversible reactions is handled by appending an indicator function encoding reaction reversibility to the reverse reaction rate term in Eq. 3.

Results
The objective below is to first demonstrate the feasibility of the approximate ANN replacement (Fig. 2, Architecture 2) for the equilibrium rate constant as described in Sec.2.3, and to then assess the GPU-based performance of the method (Sec.3.2).Table 1 details the three mechanisms (referred to as mechanisms A, B, and C) of increasing complexity used throughout this section.

Training results:
As per Fig. 2, the goal of the approximate ANN is to recover the logarithm of the equilibrium rate constant for all reactions in the mechanism, log(K c ).As such, the training data is obtained by sweeping through a range of temperatures (functions of which supply the input) and, for each temperature, recovering the exact N R -dimensional vector of equilibrium rate constants (which supplies the target).Then, standard supervised techniques can be used to train the neural network with a specified loss function.The mean-squared error (MSE) loss on the scaled (standardized) logarithm of the equilibrium constant is used here.Note that this approach usefully eliminates an involved high-dimensional sampling procedure for obtaining the training data -temperature is trivially sampled from a range that is usually known a-priori.For each mechanism, 1 million temperature / equilibrium constant pairs were sampled within the ranges specified in Tab. 1 (i.e.N C = 10 6 in the training phase).ANNs were trained with the PyTorch library [35].
Training results for the three mechanisms are shown in Fig. 3 in the form of loss function history versus training iteration (epoch).For all models and mechanisms shown, training parameters (batch size, learning rate, total epochs, optimization method, etc.) were fixed to allow for more direct comparisons.Two classes of models are considered: low-fidelity (LF) and high-fidelity (HF) ANNs.The architecture for the HF-ANNs (3 hidden layers, 32 neurons per layer) is constructed such that its parameter space is much larger relative to the LF-ANN counterparts (1-to-2 hidden layers, 8 neurons per layer).The consideration of ANNs with varying complexity is important in the context of speedup, to be explored in Sec.3.2.In the discussion below, the notation X/Y is used to concisely refer to an ANN with X hidden layers and Y neurons per hidden layer.
Figure 3 shows that the HF-ANNs (the 3/32 ANNs) approach MSE values that increase slightly with increasing mechanism complexity.Despite this, the convergence point for all three HF-ANNs occur at nearly the same order of magnitude.In other words, for the ANN architectures considered, no significant decrease in training loss is seen when moving from Mechanism A to C. This is particularly impressive because Mechanism C is much more complex than A.
As expected, the LF-ANNs approach MSE values much higher than the HF-ANN counterparts.Surprisingly, despite the vast difference in complexity, the 1/8 ANNs for Mechanisms A and C approach the same MSE.Interestingly, the converged MSE for the 1/8 ANN for Mechanism B is roughly an order of magnitude lower than both Mechanisms A and C. Figure 3 shows that a jump from 1/8 to 2/8 (addition of one hidden layer) in the LF-ANN architecture is required for Mechanism B to converge to a loss similar to the 1/8 ANNs for Mechanisms A and C.This highlights an important facet of chemical mechanism complexity: although Mechanism B contains less species and reactions than C, higher order functions of temperature are required in the ANN estimation to recover the equilibrium constants at similar levels of accuracy.

Predictions:
The remainder of this section outlines how errors in the log(K c ) computation (Fig. 3) translate to errors in mass fraction predictions in a-posteriori simulation settings.This involves replacing the exact equilibrium rate constants with the approximate ANN outputs during the source term computation used throughout the simulation.Both HF and LF-ANN models for each mechanism are considered here; architectures are supplied in Tab. 1.Note that in light the discussion in Sec.3.1.1,the LF-ANN architecture for Mechanism B for the analysis below contains two hidden layers instead of one.Two simulation scenarios are considered: zero-dimensional (0-D) auto-ignition and one-dimensional (1-D) channel detonation.

0-D Ignition:
Auto-ignition in a constant pressure reactor was simulated for the three mechanisms using both exact and ANN-based formulations.All simulations were performed using a forward Euler integration scheme at a constant time step of 1e-10s.Only one initial condition per mechanism is shown here for conciseness.
The time evolution of mass fraction and error for a subset of species are shown in Figs. 4, 5, and 6 for Mechanisms A, B and C respectively.The fuel and oxidizer used for Mechanisms A and C are hydrogen/air, whereas those for B are methane/oxygen.For all three mechanisms, the mass fraction profiles show that both the LF and HF-ANN based simulations are nearly indistinguishable from the exact (Cantera) counterpart.However, the error profiles expectedly reveal that the LF-ANN simulation observes significantly higher errors than the HF-ANN simulation, with peaks occurring near the ignition point.For Mechanisms A and C (Figs. 4  and 6), the highest observed LF-ANN errors occur at values of roughly two orders of magnitude lower than the respective mass fraction values.Surprisingly, the errors profiles for the LF-ANNs in Mechanism C (the mechanism with highest species and reaction count) are noticeably lower across the board -this could likely be due to the reduced temperature range with which its training data was collected (see Tab. 1).
Mechanism B (Fig. 5) LF-ANN errors show peaks near or at the same order of the respective mass fraction values.These spikes in error are explained by the slightly offset peaks in the LF-ANN mass fractions.Since the ignition timescale is very small, this delay in ignition time (indicated by the insets in Fig. 5) produces a very high mass fraction errors at the same time instant.This LF-ANN effect is not observed to the same degree for Mechanisms A and C, and is alleviated in Mechanism B by the HF-ANN.Despite this, the structure of the LF-ANN species profiles in Mechanism B are very close to the exact simulations; the relative error in LF-ANN ignition time shown in the insets in Fig. 5 is less than 1%.
These results ultimately show that a) despite higher MSE values observed in the training phase, the LF-ANN based approximations still produce near-exact mass fraction profiles, and b) one can correlate in confidence a drop in MSE in the context of Fig. 3 with a drop in errors in species time evolution profiles for a given mechanism.However, the degree to which this error is reduced is not necessarily consistent across different mechanisms.An important result is that the LF-ANN maintains high accuracy even for Mechanism C; this is somewhat counter-intuitive because the LF-ANN provides significantly higher complexity reduction for Mechanism C than for the others.
The pre-detonation (ambient) conditions and other simulation details used for each mechanism case are provided in Tab. 2. Note that the initial condition for each case contains a small section at the left end of the channel that is filled with a high pressure, high temperature post-detonation mixture to enable detonation propagation.Figure 7 shows a snapshot of temperature, pressure, and several mass fraction fields collected after 0.1 ms of run time for each mechanism.The results discussed in the 0-D case are in general applicable here: both LF and HF-ANN models almost exactly capture the nonlinear detonation profiles.However, a slight misrepresentation of the the wavefront location obtained by the LF-ANN simulation is observed in the insets in Fig. 7, which is a direct consequence of the higher training errors discussed in Sec.3.1.1.More specifically, the location of peak pressure is overestimated for the Mechanism A and C cases, and underestimated for the Mechanism B case.As expected, the insets show that these inaccuracies are largely eliminated by the HF-ANN architecture, especially for small intermediary species in Mechanism B (last row in Fig. 7).
Figure 8 shows the time evolution of relative errors in peak values for temperature and pressure.Curves for Mechanisms A and C show no constant increase in time, and relative errors for both temperature and pressure stay under 1% throughout.Further, for most of the snapshots considered in Fig. 8, an expected drop in peak value error is observed when moving from the LF to the HF-ANN model.These trends are also observed for Mechanism B, albeit to a lesser degree: relative errors in peak pressure are as high as 10% for both HF and LF-ANN models at some time instances, and the amount of overlap between LF and HF-ANN error curves is higher.The 1-D detonation results are overall convincing and bring significant confidence towards the viability of the approximate ANN approach.Despite the large increase in complexity over the 0-D cases, the approximate ANNs (both low and high fidelity) capture the relevant detonation structures to acceptable levels of accuracy.However, these results also present a clear trade-off in ANN accuracy versus computational cost, as the increase in ANN fidelity eliminates errors (however small) as expected.Section 3.2 builds on this demonstration and explores this tradeoff in the context of GPU speedup and saturation.

GPU Performance
In this section, the GPU-enabled speedup provided by the formulations in Sec. 2 is explored.In particular, three questions are addressed: 1) For a given N C , how much faster is the GPU-based evaluation of source term than the Cantera-based CPU counterpart?2) Do the formulations in Sec. 2 allow the GPU to be utilized to its fullest capacity? 3) In which scenario does switching to an approximate ANN for recovering the equilibrium rate constant provide speedup over the exact ANN form (i.e. when is using the ANNs described in Sec.2.3 "worth it")?It should be stated that below, GPU speedup and performance is assessed only from the perspective of the source term computation in isolation, and not for an entire solver, which is deemed out of scope.This is because GPU-enabled speedup for an entire reacting flow solver can drastically vary depending on a) the chemistry time-integration algorithm, b) GPU treatment of convective/diffusive fluxes, c) GPU treatment of boundary conditions and domain decomposition based communication steps, and d) the amount (and implementation of) CPU-GPU data transfers.Since the methodology in Sec. 2 exists independently from these factors, the GPU speedup and performance trends are also treated independently.
The methodology described in Sec. 2 was implemented in the GPU setting with a combination of the CUDA and cuBLAS C++ APIs.Calculations used in the analysis below were performed on a single ORNL Summit node consisting of IBM Power9 CPUs and Nvidia V100 GPUs.It should be noted that absolute values of speedup will of course depend on both CPU and GPU architectures as well as the user implementation of GPU functions.Despite this, the GPU computation trends discussed below -especially with regards to saturation limits and approximate ANN architecture -are valuable in general.Further, in the context of domain decomposition based approaches used in high-fidelity multi-physics solvers, the GPU is often used to accelerate routines assigned to one or more MPI ranks (or hardware threads of MPI ranks) that operate over some number of cells/nodes in the domain.The speedup-related quantities and figures discussed below are therefore shown as functions of cell number (N C in Sec. 2), and increases in N C can be interpreted as the result of mesh refinement.
Figure 9 shows the GPU speedup (ratio between GPU and CPU time-to-solution) and GPU evaluation time for the source term calculation (Eq.2).The GPU computations use the exact matrix-based formulations from Sec. 2 -no approximate ANNs are utilized at this point.The CPU baseline used for Fig. 9a comes from the C++ Cantera function getNetProductionRates evaluated with one MPI rank.
Figure. 9 (left) shows that the speedup curves of all three mechanisms have similar profiles: a convergence in speedup is reached near 100x after an initial period of near-linear growth.There is no decay in speedup after the convergence point is reached.Further, as mechanism complexity increases, speedup also increases within the 10 1 -10 4 cell count range which may seem counter-intuitive.On the other hand, the converged speedup at 10 6 cells drops slightly with increasing N S and N R .This comes directly from the fact that the saturation point, or the point at which speedup stabilizes, occurs at a lower cell count when mechanisms become more complex (i.e. the increase in N S and N R is accounted for by a decrease in N C ).This phenomenon is better accessed in the right plot in Fig. 9, which shows absolute GPU compute times.For a given mechanism, there is a range of cell counts for which compute time does not change at all;  the upper bound of this range (which corresponds to the elbow in left plot of Fig. 9) drops as mechanism complexity increases.Beyond this point, the GPU compute time increases linearly.Since the CPU compute time also increases linearly with respect to N C , no drop in speedup occurs beyond the saturation point as evidenced by Fig. 9 (left).In Sec.3.1, the approximate ANN which replaces the exact computation for the equilibrium constant was utilized.Although the discussion above shows that GPU speedup using the exact matrix-based formulations described in Sec. 2 is high, additional speedup can be extracted through the approximate ANN replacement under some conditions.This is shown in Fig. 10, where the speedup represents the ratio of times taken to compute log( K c ) (approximate ANN in Fig. 2, Architecture 2) and log(K c ) (exact form, Fig. 1c) on the GPU.Note that the speedup shown in Fig. 10 is different than Fig. 9, which compares GPU-to-CPU ratio in execution times for the exact source terms.
Figure 10 shows that a) ANN based speedup is affected minimally by the number of hidden layers when the number of neurons per layer is reasonably small, and b) achieving ANN speedup for less complex mechanisms is much less feasible.This second point is especially important, as it signifies how the upper bound on the speedup achievable by the ANN in this context is limited by the number of species and reactions in the mechanism (as evidenced by all mechanisms collapsing to similar curves in Fig. 10).As a result, even when considering the low-fidelity (LF) ANN models used in Sec.3.1, significant speedup is only observed for Mechanism C because the ANN reduction provided is much more significant.Further, despite the fact that the HF-ANNs alleviate much of the errors of the LF-ANNs, none of the HF-ANNs provide speedup over the exact computation of the equilibrium constant.This means the ANN-based speedup also comes at a slight cost in accuracy, though as discussed in Sec.reasonably low-fidelity ANNs in this scope.Translation of the approximate ANN based speedup in Fig. 10 into speedup observed in the entire source term calculation (Fig. 9) is difficult to assess.This is because the values in Fig. 10 (as well as Fig. 9) for a given ANN architecture are heavily reliant on the GPU implementation of the exact equilibrium constant calculation (Eq.9).This implementation not only affects the ANN-replacement speedup, but also changes the contribution of the equilibrium rate constant computation to the entire GPU-based source term computation.Depending on the implementation, it was found that this contribution can range anywhere from 90% (inefficient) to 10% (efficient) of the total source term cost.Despite this variation, in the context of high-fidelity simulations, even a small ANN-derived speedup in Fig. 10 can be useful (i.e. the speedup factor seen for Mechanism A's LF-ANN) as the amount of calls to the source term calculation within one simulation time step is usually very high.

Conclusion
A method which casts the source term computation into an ANN-inspired form was presented, which allows for an interpretation of the equations as a series of ANN layers.The resulting equations allow for matrixmultiplication based source term estimation where the leading dimension (batch size) can be interpreted as the number of chemically reacting cells in the domain; as such, the approach can be readily adapted in highfidelity solvers for which an MPI rank offloads the source term computation task to the GPU.Though the exact matrix-multiplication based recasting is GPU-friendly as-is, the ANN-inspired interpretation allows the user to replace portions of the exact routine with trained, approximate ANNs.The ultimate goal is to use these approximate ANNs to decrease computational cost (or increase speedup) over the exact counterparts.
In this work, the approximate ANNs were trained as drop-in replacements for the equilibrium rate constant computation.The utilization of a trained ANN in this fashion removes the input dependence on the N S -dimensional species concentration vector, which greatly simplifies the training process and allows the framework to maintain certain physical qualities during the source term computation (i.e.Arrhenius form for forward rate constants).Through a-posteriori 0-D auto-ignition and 1-D channel detonation simulations on several mechanisms of varying complexity, the results ultimately showed the viability of this approach in complex nonlinear environments.Despite higher MSE values observed in the training phase, the low-fidelity ANN approximations still produced near-exact mass fraction profiles across the board.
When the number of cells is reasonably high, the exact GPU-based methodology displays significant speedup over the Cantera CPU counterpart.Further, saturation trends showed that after a near-linear growth in speedup with respect to the number of cells, a saturation point is reached after which the GPU computation time trends linearly with cell count, and speedup stabilizes as a result.As expected, this saturation point was reached for a smaller cell count when the mechanism complexity increased.Further, it was found that speedup obtained by the approximate ANNs for the GPU-based computation of the equilibrium rate constant depended both on the approximate ANN architecture and the chemical mechanism complexity.Achieving approximate ANN speedup for less complex mechanisms was ultimately much less feasible, as the upper bound on the speedup achievable by the ANN in this context is limited by the number of species and reactions.Realistic approximate ANN speedup over the exact form in this sense was achievable only for the more complex chemical mechanism.In general, the analysis showed how the computational advantage provided by the approximate ANNs can be significant, although trained ANNs used to substitute particular algorithms that can already be cast in GPU-friendly forms should not be interpreted as a catch-all technique for speedup.
There are many ways in which this approach can be extended.Since this technique applies only to the source term estimation, coupling of the method with a GPU-optimal stiff time-integration routine is warranted.Further, the design of an approximate ANN architecture better suited for speedup in less complex mechanisms should be explored.Additionally, adapting the ANN-based interpretation to single or half-precision environments can allow for improved GPU utilization, and faster run times as a result.These topics will be explored in future work.

Figure 1 :
Figure 1: Illustrations ANN-based formulations for NC = 1, NS = 4, and NR = 8.Since NC = 1, input/outputs are vectors and cell indices are ignored.Bias terms not shown for clarity.a) Schematic of Eqs. 2 and 4. Exponential activation functions are used to produce forward/reverse rates.b) Schematic of Arrhenius layer for forward rate constant (Eq.6).c) Schematic of Gibbs layer equilibrium constant (Eq.11).
3.Last three columns show high-fidelity (HF) ANN architecture used in predictions in Sec.3.1.2,low-fidelity (LF) architecture used in predictions in Sec.3.1.2,and temperature range used to generate the training data, respectively.For ANN columns, notation "X/Y" refers to ANN with X hidden layers and Y neurons per hidden layer.

Figure 7 :Figure 8 :
Figure 7: Detonation profiles after 0.1 ms from exact ( ), LF-ANN ( ), and HF-ANN ( ) simulations for Mechanism A (left column), B (middle column), and C (right column).First two rows show temperature and pressure profiles; remaining rows show several species mass fraction profiles.Insets show quantities near detonation front.

Figure 9 :
Figure 9: (Left) GPU-enabled speedup in source term calculation over the Cantera-based CPU counterpart versus NC for Mechanism A ( ), B ( ), and C ( ). (Right) GPU source term evaluation times versus NC for Mechanism A, B, and C (same symbols).Linear trend with respect to cells ( ) is shown for reference.

Figure 10 :
Figure 10: Speedup for GPU-based equilibrium rate constant computation enabled by approximate ANN (Fig. 2, Architecture 2) over the exact form (Fig. 1c) for Mechanism A ( ), B ( ), and C ( ) at NC = 10 5 .X-axis is number of neurons per hidden layer (NH ) normalized by number of species (NS).Plots show increasing number of hidden layers from left-to-right.Arrows indicate LF-ANN and HF-ANN architectures for each mechanism as used in Sec.3.1.

Table 1 :
Details of chemistry mechanisms used throughout Sec.

Table 2 :
Pre-detonation (ambient) conditions for each mechanism case.Refer to Tab. 1 for mechanism details.Ratio of fuel for Mechanism C is 50:50 H2:CH4 by volume.All mechanisms use a simulation time step ∆t of 1e-10 s, adjusted as needed to satisfy a CFL condition of 0.2.