^{‡}

^{1}

^{2}

^{1}

^{*}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

The solution of chemical kinetics is one of the most computationally intensive tasks in atmospheric chemical transport simulations. Due to the stiff nature of the system, implicit time stepping algorithms which repeatedly solve linear systems of equations are necessary. This paper reviews the issues and challenges associated with the construction of efficient chemical solvers, discusses several families of algorithms, presents strategies for increasing computational efficiency, and gives insight into implementing chemical solvers on accelerated computer architectures.

Chemical transport models solve the mass balance equations:

The mass balance partial differential ^{splitting}] individual processes in ^{splitting} denotes the model time split step size and should be discriminated against an integration step size, denoted by ^{splitting}. This leads to a sequence of simpler problems involving advection and diffusion, chemistry, ^{d}^{d}^{×}^{d}

The solution of chemical kinetics

The overall accuracy of a chemical transport simulation, roughly defined as the difference between the model output and the real chemical concentration fields, is the result of nonlinear interactions between errors coming from different sources:

The overall simulation accuracy can be considerably improved by

Data and and modeling errors are difficult to quantify. A desirable level of overall numerical accuracy is on the order of 1%; the overall simulation accuracy can be assessed aposteriori by comparing model results and measurements. The size of the numerical errors should be at least one or two orders of magnitude below the overall target.

The chemical kinetic mechanism is just one subsystem of a large chemical transport simulation. Data errors are associated with the initial conditions. Model errors are associated with the level of detail of the chemical mechanism, and with the accuracy of reaction rate coefficient values. Numerical errors are associated with the particular numerical integration algorithm employed and with the length of the time steps used.

The target level of relative accuracy (relative error tolerance) is 0.1%,

Due to operator splitting [

Different chemical species participating in atmospheric chemical kinetics have widely different life times. Specifically, different species evolve on different time scales, from milliseconds (e.g., for radicals such as OH) to years (e.g., for CH_{4}). The resulting system of ordinary differential equations is

Due to numerical stability considerations, explicit time integration methods cannot use time steps that are much larger than the fastest time scale in the system. Roughly speaking, the current solution is influenced by the approximation error made during the previous step, multiplied by the ratio of the step size over the fastest dynamic time scale. If this ratio is large the errors accumulate extremely quickly and the solution becomes unusable (numerical instability).

The time stepping methods used to solve atmospheric chemical kinetics should be unconditionally stable,

The solution of a chemical kinetic system has several intrinsic properties. The total mass and total electric charge are preserved during the system evolution, and the concentrations remain positive at all times. It is desirable that the numerical solution preserves such properties as well [

The total mass and the total charge are linear invariants of the system,

The preservation of positivity is more difficult to achieve. Methods that preserve positivity unconditionally (for any step size) are at most of order one [

Note that the preservation of positivity is important only in those situations where negative concentrations render the ODE dynamics unstable. For many chemical mechanisms used in practice, small negative concentrations do not result in instability [

Since the solution of chemical kinetics takes up an important fraction of the total compute cycles in a chemical transport simulation, special care needs to be paid to computational efficiency Roughly speaking, an efficient computation achieves the target accuracy in the shortest CPU time possible. The total compute time depends on the number of time steps used to cover the interval [^{splitting}], and on the CPU time spent in each of these steps.

We have seen that the solution of stiff chemistry requires implicit time integration algorithms. Most of the computational effort per step is spent in solving the system of nonlinear equations. In a Newton-Raphson approach, the LU factorization of the Jacobian is computed once, and is reused for all iterations; most of the computational effort is spent on performing the LU factorization and the repeated substitutions. The following ideas have proved successful in reducing the computational effort per step:

Avoid solving coupled nonlinear systems by the use of approximate implicit algorithms [

Use sparse linear algebra techniques [

Reduce the number of forward and backward substitutions by using iteration-free (linearly-implicit) time stepping algorithms [

The reduction in the number of necessary time steps requires a good mechanism for time step adaptivity [^{splitting}] in a small number of steps.

A considerable increase in efficiency is possible based on the important observation that chemical systems

Some early techniques for dealing with stiffness of chemical ODE systems in the atmospheric chemistry include analytical techniques [

The Kinetic PreProcessor (KPP) [

The QSSA method [

Starting with the production-destruction form of the chemical kinetic ODE _{n}_{n}_{n}^{splitting} in _{d}^{d}^{×}^{d}_{n}_{+1}, _{n}_{+1}), the QSSA formula _{y}

A careful analysis of the QSSA method has revealed that it is of first order, and improved QSSA methods remain of first order under stiffness [

Backward differentiation formulas (BDF) have become famous under the name “Gear” methods for solving chemical kinetic problems. BDF are linear multistep methods with excellent stability properties for the integration of stiff systems [

The _{i}

Practical implementations of BDF formulas are able to adapt both the time step and the order to achieve maximum efficiency For easily adjusting the step size it is convenient to represent the past history by the Nordsieck array [_{n}_{+1−}_{k}_{n}_{n}

The nonlinear system _{n}_{+1} by a Newton-Raphson iterative approach. The starting point provided by the ^{th} order “predictor” approximation of _{n}_{n}_{−1} _{n−}_{1} directly.

The Newton-Raphson iterations proceed as follows
_{i}_{n}_{+1} is the ODE solution _{n}_{+1}. The “prediction” matrix is _{n}_{n}_{n−j} β_{n−j}_{n−j}

After the iterations converge, the local truncation error is estimated by

If ‖_{n}_{+1}‖ ≤ _{safe} the solution _{n}_{safe} = 0.9. In both situations a change in stepsize and/or order is considered in order to maximize computational efficiency.

Note that VODE uses a variable-coefficient implementation (fixed-leading coefficient form) instead of the fixed-step-interpolate methods in LSODE. The fixed-leading coefficient form shows better performance on many, though not all, stiff problems. KPP offers interfaces to both LSODE and VODE modified to use the optimized sparse linear algebra routines generated by KPP.

A general s-stage implicit Runge-Kutta method reads [_{ij}_{i}_{i}_{i}_{i}_{n}_{1},…, _{s}_{ij}^{m}^{×}^{m}^{n}^{×}^{n}

A Singly Diagonally-Implicit Runge-Kutta (SDIRK) method is a special case of the fully implicit Runge-Kutta method with coefficients satisfying _{ij}_{ii}_{i}_{d}_{n}_{n}

An estimator of the local truncation error is obtained with the help of the embedded formula
_{n}_{+1} using the already computed increment vectors _{i}_{i}_{n}_{+1} is one less than that of _{n}_{+1}. The difference vector _{n}_{+1} = _{n}_{+1} − _{n}_{+1} provides the local error estimator. As pointed out in [

The step adjustment strategy uses _{n}_{+1}‖ based on the user specified relative and absolute tolerances. The step is accepted if _{s}_{af}_{e}_{safe} = 0.9.

The new step size is estimated by the asymptotic formula
_{max} is an upper bound, and _{min} a lower bound on the step change factor. Typical values are _{max} = 10, _{min} = 0.1, and _{safe} = 0.9. If the step size has been recently rejected, the allowed increase factor is further limited (e.g., _{max} = 1 following a rejection). Furthermore, the step size is constrained such that _{min}_{max}_{start}

Several Runge Kutta methods are available in the KPP numerical library. The fully implicit schemes implemented are the 3-stage Radau-IIa, Radau-Ia, Lobatto-IIIc, and Gauss methods [

Rosenbrock methods are competitive with other stiff solvers for low to modest accuracy, and therefore are attractive for atmospheric chemistry applications [_{n}_{+1}. The linearly implicit Euler method solves a linear system to obtain an increment vector

A general _{i}_{ij}_{ij}_{ij}_{ij}_{t}

For implementation purposes, it is advantageous to choose all the diagonal coefficients equal to each other, _{ii}^{2} multiplications for constructing the matrix-scalar product

The local error estimator for Rosenbrock methods is based on an embedded formula for error estimation, similar to _{safe} = 0.9, and rejected otherwise. The next time step size is calculated by

An important sub-class are the Rosenbrock-W methods, which allow the use of any approximation

Careful benchmarks of stiff solvers [

Several Rosenbrock methods are available in the KPP numerical library. They are Rodas (the 6-stage method based on a stiffly accurate pair of order 4(3) [

This family of methods constructs numerical solutions by applying Richardson extrapolation to a sequence of low order approximations, each made with a different step size [

Consider a sequence of step sizes _{1}, _{2}, _{3}, … defined by by _{j}_{n}_{+1}) obtained as follows. Start from _{n}_{n}_{n}_{+1}] using _{j}_{n}_{n}_{i}_{n}_{+1}) are errors that do not depend on _{τ}_{n} ≤ t ≤ t_{n}_{+1}. (For very stiff problems a different, perturbed expansion of the global error holds [_{j}_{j,k}_{+1} is of order _{k}

The KPP numerical library offers an interface to the SEULEX code [

In this section we discuss two approaches to improve the computational efficiency of the chemical kinetic solvers in air quality models. The first approach is the use of sparse linear algebra, and the second is harnessing the power of modern accelerator architectures.

In a chemical kinetic solver, most of the computational effort is spent in solving the linear systems associated with the implicit time integration algorithms. For all methods discussed here the matrix of coefficients is of the form

In a typical chemical mechanism, the pattern of chemical interactions leads to a Jacobian that has the majority of entries equal to zero.

Linear algebra algorithms can take advantage of this to avoid unnecessary operations and greatly reduce CPU time. Since the sparsity structure depends only on the chemical network (and not on the values of concentrations or rate coefficients) it can be computed offline [

Recent developments inmulti-core chipset architectures canbe leveraged to reduce chemical simulation runtime. In general, good performance is achieved by using every tier of heterogeneous parallelism available to the model. Chemical kinetics are embarrassingly parallel between grid cells, so there is abundant data parallelism (DLP). Within the solver itself, the ODE system is coupled so that, while there is still some data parallelism available in lower-level linear algebra operations, parallelization is limited largely to the instruction level (ILP). (Some specific chemical mechanisms are only partially coupled and can be separated into a small number of sub-components, but such inter-module decomposition is rare.) Thus, a three-tier parallelization is possible: ILP on each core, DLP using single-instruction-multiple-data (SIMD) features of a single core, and DLP across multiple cores (using multi-threading) or nodes (using MPI). The coarsest tier of MPI and OpenMP parallelism is typically supplied by the atmospheric model.

This section presents parallelization strategies for Rosenbrock integration in one-cell-per-thread, N-cells-per-thread, and 4/2-cells-per-thread decompositions. For performance benchmarks of parallelized Rosenbrock solvers, see [

Although not an “accelerated” architecture, multi-threaded CPUs are common, inexpensive, and a mature target platform. Modest performance improvements are achievable by parallelizing the Rosenbrock integrator via OpenMP. Since the chemistry at each grid cell is independent, the outermost iteration over grid cells is thread-parallel dimension; that is, a one-cell-per-thread decomposition. Within the integrator itself, the inseparable Jacobian matrix prohibits direct parallelization, though SIMD instructions may be introduced by the compiler for a small intra-integrator performance improvement. The principal disadvantage of this architecture is a relatively low peak performance.

A CUDA implementation takes advantage of the high degree of parallelism and independence between cells in the simulation. The outermost loops of the solver are kept on the CPU and the GPU is used to accelerate the innermost computational kernels. Time loops, Runge-Kutta loops, and error control branch-back logic are executed on the CPU. LU decomposition and solve, the ODE function evaluation, Jacobi matrix operations, and BLAS operations, are coded and invoked as separate kernels on the GPU. All data for the solver is resident on the GPU and arrays are stored with cell-index stride-one so that adjacent threads access adjacent words in memory to coalesce access to the chemical data across threads. Under this paradigm, parallelism occurs within the solver across grid cells, rather than external to the solver and across grid cells as on a multi-core CPU. Although each GPU thread still processes only one cell, the exact mapping of threads to grid cells is handled by the GPU hardware, effectively achieving an N-cells-per-thread decomposition, where N is the total number of grid cells in the simulation.

This implementation is easy to debug and profile since the GPU code is spread over many small kernels with control returning frequently to the CPU. Additionally, resource bottlenecks such as register pressure and shared-memory usage are limited to only those affected kernels. Performance critical parameters such as the size of thread blocks and shared-memory allocation can be adjusted and tuned separately, kernel-by-kernel, without subjecting the entire solver to worst-case limits. One disadvantage is that all N grid cells are forced to use the minimum time step and iterate the maximum number of times, even though only a few cells will typically require that many iterations to converge. The overhead of these additional iterations can be mitigated by storing the per-cell time, time step length, and error in a vector and using vector masks to “turn off” cells that have converged. The solver still performs the maximum number of iterations, but thread-blocks assigned to cells that have converged do little or no work and relinquish the GPU cores quickly.

A CUDA implementation is straight-forward to program, but may prove difficult to optimize. CUDA's automatic thread management and familiar programming environment make solver implementations simple to conceive and implement. However, a deep understanding of the underlying architecture is required to achieve good performance. For example, memory access coalescing is one of the most powerful features of the GPU architecture, yet CUDA neither hinders nor promotes program designs that leverage coalescing. The principal limitation on performance is the size of the on-chip shared memory and register file, which prevent large-footprint applications from running sufficient numbers of threads to expose parallelism and hide latency to the device memory. In general, GPU implementations of the Rosenbrock solver are faster than multi-core CPU implementations, but by less than a factor of two. From a power consumption standpoint, this makes them less efficient than multi-core CPUs in this arena.

The heterogeneous Cell Broadband Engine Architecture can achieve exceptionally high levels of performance for the Rosenbrock integrator, yet its complexity and uniqueness make it difficult to program. As a heterogeneous architecture, a homogeneous one-cell-per-thread decomposition across all cores will not achieve maximum performance. A master-worker approach resulting in multiple grid cells processed per thread is more appropriate.

The Power Processing Element (PPE), with full access to main memory, is the master. It prepares the model data for processing by the Synergistic Processing Elements (SPEs) by padding and aligning data to comply with architectural restrictions. The Rosenbrock solver tends to be computation-bound, so the PPU has ample time to maintain a buffer of aligned data. The SPEs implement a 128-bit SIMD instruction set architecture. Hence, every cycle operates on 128-bit vectors of either four single precision or two double precision floating point numbers. The data of two or four grid cells, depending on floating point precision, are packaged together by the PPE into a single padded, aligned, and buffered payload for processing by the SPEs (a so-called “vector cell”). This achieves a four-cells-per-thread (two-cells-per-thread in double precision) decomposition.

A small change is required in the Rosenbrock integrator design to operate on a vector cell. Typically, the integrator iteratively refines the Newton step size

Writing chemical kinetics code is often tedious and error-prone work. The Kinetic PreProcessor (KPP) [

KPP makes it possible to rapidly generate correct and efficient chemical kinetics solvers on scalar architectures, but these generated codes cannot be easily ported to multi-core accelerated or heterogeneous architectures. KPPA (the Kinetics PreProcessor: Accelerated) [

KPPA combines a general analysis tool for chemical kinetics with a code generation system for scalar, homogeneous multi-core, and heterogeneous multi-core architectures. It is written in object-oriented C++ with a clearly-defined upgrade path to support future multi-core architectures as they emerge. KPPA has all the functionality of KPP 2.1 and maintains backwards compatibility with KPP. Many atmospheric models, including WRF-Chem and STEM, support a number of chemical kinetics solvers that are automatically generated at compile time by KPP. Reusing these analysis techniques in KPPA insures its accuracy and applicability.

KPPA's code generation component accommodates a two-dimensional design space of programming language/target architecture combinations superseding the one-dimensional design space of KPP (

KPPA's key feature is its ability to generate fully-unrolled, platform-specific sparse matrix/matrix and matrix/vector operations that achieve very high levels of efficiency. As KPPA parses it's input, language independent expression trees describing sparse matrix/matrix or matrix/vector operations are constructed in memory. For example, the aggregate ODE function of the chemical mechanism is calculated by multiplying the left-side stoichiometric matrix by the concentration vector, and then adding the result to elements of the stoichiometric matrix. KPPA performs these operations symbolically at code generation time, using the matrix formed by the analytical component and a symbolic vector, which will be calculated at run-time. The result is an expression tree of language-independent arithmetic operations and assignments, equivalent to a rolled-loop sparse matrix/vector operation, but in completely unrolled form.

KPPA uses its knowledge of the target architecture to generate highly-efficient code from the expression tree. Vector types are preferred when available, branches are avoided on all architectures, and parts of the function can be rolled into a tight loop if KPPA determines that on-chip memory is a premium. An analysis of four KPPA-generated ODE functions and ODE Jacobians targeting the CBEA showed that, on average, both SPU pipelines remain full for over 80% of the function implementation. Pipeline stalls account for less than 1% of the cycles required to calculate the function. For example, in the SAPRCNOV mechanism on CBEA, there are only 20 stalls in the 2989 cycles required by the ODE function (0.66%), and only 24 stalls in the 5490 cycles required for the ODE Jacobian (0.43%). Code of this caliber typically requires meticulous hand-optimization, but KPPA is able to generate this code automatically in seconds. See [

For the numerical results we use the CAABA box model with MECCA chemistry [_{x} and NO_{x}, halogen (Cl, Br, I) and sulfur chemistry are also considered. The full mechanism including rate coefficients can be found in the supplement.

A reference solution y^{ref} has been computed with the Radau-5A numerical method implemented in the KPP Runge Kutta suite, with the tight tolerances RelTol = 10^{−10} and AbsTol = 10^{2} molecules/cm^{3}. The accuracy of each numerical solution is measured by its RMS error, defined as
^{3}. The logical expression Darshana takes a value equal to one if the reference concentration of the

The work-precision diagrams for several numerical integrators are shown in ^{−6}, 10^{−1}] have been used to generate different points on each curve.

One of the most computationally demanding tasks in atmospheric chemical transport simulations is the solution of chemical kinetic processes. Special considerations need to be taken into account when designing chemistry time integration algorithms. Unconditionally stable methods are needed due to the stiff nature of the equations; such algorithms perform expensive solves of (non)linear systems of equations at each stage. The accuracy requirements are relatively low, with relative errors of 0.1%, but compute times must be low as well. This is achieved with algorithms that quickly adjust time step size, and efficient step size control mechanisms. A challenge comes from the fact that error estimators based on asymptotic formulas may not work well in the low accuracy regime. Closed chemical systems preserve mass and charge, and the concentrations remain positive. It is desirable to have numerical solvers that also preserve this properties. Conservation is easy to achieve, but positivity requires more involved computations.

Several families of solvers that are suitable for atmospheric chemical kinetics were discussed: QSSA, BDF, implicit Runge-Kutta, Rosenbrock, and Extrapolation. Special implementations of general purpose methods have taken the place of special integrators (e.g., QSSA) during the last decade. Among them the Rosenbrock methods have become popular due to their efficiency at moderate accuracy requirements.

A major goal when implementing a chemical solver is efficiency, since many copies of the chemical mechanism (one per grid cell) need to be solved at each operator split cycle. Careful exploitation of the Jacobian structure and the use of efficient sparse linear algebra operations are key to obtaining efficiency. The ideal parallelism between the chemical tasks in different grid cells can be exploited either by domain decomposition, or by vectorization. The latter approach has found renewed interest in the context of of modern heterogeneous multicore (accelerator) architectures.

All the numerical methods discussed in this paper are implemented in the KPP numerical library [

With the increase in complexity of gas phase chemical mechanisms, and the frequent inclusion of aqueous and heterogeneous phase chemistry in three dimensional simulations, the importance of efficient and robust solvers for atmospheric chemistry models is expected to continue to increase in future.

Sparsity of the Jacobian of the MECCA chemical mechanism used in Section 5.

Work precision diagrams for stiff integration methods applied to the MECCA chemical mechanism.

Requirements and challenges for the numerical solution of atmospheric chemical kinetics.

Accuracy | Relative error under 0.1% | Quickly adjust step size; estimate error outside asymptotic regime |

Stiffness | Unconditional stability | Solve nonlinear system of equations at each step |

Special properties | Mass and charge balance; positive concentrations | Linear invariants easy to preserve; enforcing positivity requires special methods |

Efficiency | Deliver target accuracy in the shortest possible CPU time | Repeated LU factorizations are expensive; step control should be aggressive, yet avoid many step rejections |

Language/architecture combinations supported by KPP and KPPA.

KPP | ||||

KPP | ||||

KPP | ||||

KPP |

This work has been supported in part by NSF through awards NSF OCI-0904397, NSF CCF-0916493, NSF DMS0915047, and by the United States Department of Defense High Performance Computing Modernization Program through an NDSEG fellowship.

The paper is dedicated to the memory of Dr. Daewon Byun, whose work remains a lasting legacy to the field of air quality modeling and simulation.