1. Introduction
Tumor angiogenesis, the formation of new blood vessels from pre-existing vasculature in response to tumor-secreted growth factors, constitutes a critical mechanism enabling solid tumor progression beyond the diffusion-limited threshold of a 1–2 mm diameter. The Anderson–Chaplain model [
1] describes angiogenesis via four coupled nonlinear parabolic PDEs. This model, which forms the foundation of our numerical investigation, couples four state variables through nonlinear chemotactic and haptotactic flux terms, yielding a system whose stiffness parameter scales as
where
h denotes the spatial mesh spacing.
Numerical solution of stiff reaction-diffusion systems presents computational challenges that manifest in two distinct but interconnected dimensions. First, explicit time-stepping methods impose a stability constraint
, where
represents the temporal discretization parameter and
C is a problem-dependent constant typically on the order of unity. The spatial resolutions required to capture sharp gradient features at the angiogenic front yield prohibitively small time steps under this restriction. Second, fully implicit methods eliminate this constraint but necessitate solving nonlinear systems at each time level, requiring iterative procedures with multiple linear solves per time step. Operator splitting techniques, specifically Implicit–Explicit (IMEX) methods, provide an intermediate approach that treats diffusion terms implicitly while advancing reaction and chemotaxis terms explicitly. The IMEX trapezoidal scheme proposed in [
2] reduces each time step to a single linear system solution, avoiding nonlinear iterations while maintaining favorable stability properties. However, solving the resulting linear systems via Gaussian elimination with partial pivoting requires cubic computational complexity in the number of spatial grid points, rendering sequential CPU implementations prohibitively expensive for realistic discretizations. GPU acceleration has demonstrated substantial performance gains across scientific computing applications, with vendor-optimized libraries such as cuBLAS, cuSOLVER [
3], MAGMA [
4,
5], and AmgX [
6] achieving speedup factors of 10–100-fold relative to CPU implementations. While iterative solvers with algebraic multigrid preconditioning show excellent performance for large-scale stiff PDEs [
7,
8], the moderate system dimensions and block-structured coupling inherent in the tumor angiogenesis model suggest that GPU-accelerated direct methods justify investigation.
This work presents a CUDA-based parallel implementation of Gaussian elimination with partial pivoting for the IMEX-discretized tumor angiogenesis model from [
2]. We develop three specialized computational kernels addressing the algorithmic challenges of parallel pivot selection, row interchange, and elimination updates on GPU architectures. Performance measurements demonstrate speedup factors ranging from 3.47-fold to 113.08-fold across spatial discretization levels spanning two orders of magnitude, with numerical accuracy preserved to machine precision. Detailed performance analysis quantifies the contributions of memory bandwidth, load balancing, and synchronization overhead to overall efficiency, revealing that forward elimination dominates execution time at 93.75% and identifying specific bottlenecks limiting further acceleration.
The rest of this paper is organized as follows:
Section 2 reviews related work on GPU-based acceleration of stiff ODE/PDE solvers and examines vendor-optimized libraries for linear algebra operations.
Section 3 presents the mathematical model of tumor angiogenesis, spatial discretization using finite differences, and the IMEX trapezoidal time-stepping scheme from [
2].
Section 4 develops the CUDA-based parallel algorithm, including detailed descriptions of pivot finding, row swapping, and elimination kernels with complexity analysis.
Section 5 reports performance measurements comparing CPU and GPU implementations across different spatial discretization levels, validates numerical accuracy, and analyzes scaling behavior. Finally,
Section 6 closes the paper with the conclusions.
2. Related Works
In recent years, GPU-based acceleration of stiff ODE/PDE solvers has attracted growing attention, particularly in applications such as chemical systems, fluid dynamics, and biological modeling [
9,
10,
11]. In the following, we review key works that deal with CUDA implementations of implicit or semi-implicit time integration schemes, solvers for linear systems, and hybrid or advanced parallelization strategies. Several vendor-optimized libraries, e.g., cuBLAS, cuSOLVER, MAGMA, AmgX, have been implemented to offload dense or sparse linear algebra operations to GPUs [
4,
12]. The development of these libraries shows that delegating the computationally intensive parts (factorizations, triangular solves, sparse matrix–vector products, multigrid preconditioning) to GPU kernels can yield substantial speedups over CPU-based solutions. For instance, Al Farhan et al. [
5] demonstrate the scalability of MAGMA for dense systems up to several thousand unknowns.
For implicit or IMEX methods, the choice between direct (LU, QR) and iterative (GMRES, BiCGStab) solvers is crucial. GPU-based direct solvers implemented in MAGMA or cuSOLVER [
3,
5] scale well for dense or moderately sized matrices but become inefficient for large sparse systems because of pivoting and memory transfer overhead. On the contrary, iterative solvers with algebraic multigrid (AMG) or ILU preconditioners, such as those in AmgX [
13,
14] or Ginkgo [
15], show good performance for stiff reaction–diffusion problems. Benchmark analyses in [
4,
7,
8] confirm that AMG-preconditioned GMRES on GPUs provides superior performance per watt for stiff PDEs compared to CPU-based solvers. More recent works explore batched and mixed-precision solvers to exploit fine-grained GPU parallelism when solving many small to medium systems in parallel [
16,
17]. For example, Haidar et al. [
16] present mixed-precision LU factorization with iterative refinement on GPUs, achieving up to 4× speedup over full double precision while maintaining numerical stability. Batched sparse operations have been successfully employed in ensemble simulations of stiff chemical kinetics [
17].
Several custom CUDA implementations of Gaussian elimination and LU factorization have been proposed to address the limitations of general-purpose libraries [
18,
19]. Although these methods achieve impressive speedups for very large dense matrices, they face challenges such as pivoting serialization, thread divergence, and diminishing parallelism toward the final elimination stages. Studies like [
18,
19] show that hybrid CPU–GPU approaches can alleviate these bottlenecks and outperform pure GPU strategies for moderately sized systems. Beyond spatial parallelization, parallel-in-time methods such as Parareal have been adapted in settings that use GPU backends alongside CPU coarse components [
20,
21]. For example, Arteaga et al. [
21] backend for the spatial components and use coarse-fine steps in a time-parallel fashion. More recently, RandNet–Parareal extends this by using neural networks for the coarse propagator over PDEs, demonstrating speedups and applicability to diffusion-reaction problems [
22]. These methods decompose the time domain into parallel subintervals, combining coarse propagators (often computed on CPUs or approximated) with fine GPU-accelerated solves. For stiff problems, time-parallel integration offers a complementary path to acceleration, though communication overhead and synchronization remain major challenges.
Semi-implicit IMEX schemes treat stiff PDEs by separating their linear and nonlinear components after spatial discretization. Since each time step requires solving a linear system for the updated solution, the literature provides several practical guidelines, such as the following:
Dense or moderately sized systems: use vendor libraries like MAGMA or cuSOLVER for optimized dense algebra.
Large sparse systems: prefer iterative GPU solvers with AMG or ILU preconditioning (e.g., AmgX, Ginkgo).
Many small systems: leverage batched and mixed-precision GPU routines.
Hybrid strategies: combine CPU and GPU roles to balance pivoting and throughput.
Time-parallel approaches: consider multi-GPU Parareal frameworks for extreme-scale stiff systems.
3. Mathematical and Numerical Model
The tumor angiogenesis process is governed by a system of coupled nonlinear parabolic partial differential equations defined on the one-dimensional spatial domain
and temporal domain
. Following the model presented in [
2], we consider four state variables: the endothelial cell density
, the protease concentration
, the inhibitor concentration
, and the extracellular matrix (ECM) density
. The evolution of these quantities is described by the nonlinear system:
where the tumor angiogenic factor (TAF) distribution is given by
which models the spatial gradient of vascular endothelial growth factor (VEGF-A) established by hypoxic tumor cells localized at
. The scaling parameter
yields a decay length
consistent with measured VEGF-A gradients in tumor microenvironments. This gradient provides the primary chemotactic signal directing endothelial cell migration from pre-existing vessels at
toward the tumor boundary.
In the endothelial cell equation
C, the diffusion term
captures random motility via Brownian motion with intrinsic cell motility of approximately 1
/s. The term
models chemotaxis toward inhibitor gradients, where, counter-intuitively, angiostatin acts as an attractant at low concentrations. Haptotaxis along ECM density gradients is captured by
, representing fibronectin binding via
integrin. The term
describes saturable chemotaxis toward VEGF-A with Michaelis–Menten receptor saturation (
), while
represents logistic proliferation with unit carrying capacity due to contact inhibition at confluence. The protease equation (
P) couples production through
and
(basal secretion), with decay via
(natural degradation) and inhibition through
. The inhibitor equation (
I) exhibits decay solely through protease binding via mass-action kinetics (
), while the ECM equation (
F) describes irreversible degradation via proteolytic cleavage (
: MMP-2-mediated collagen type IV breakdown). The system is subject to homogeneous Neumann boundary conditions
for
, representing no-flux conditions at the domain boundaries. The diffusion coefficients
quantify molecular diffusivity, the chemotactic and haptotactic sensitivity parameters
govern directed cell migration, and the kinetic rate constants
characterize proliferation, degradation, binding, and production processes. Parameters are dimensionless, scaled by reference length
mm and time
h. Diffusion coefficients match VEGF-A and MMP-2 diffusivities measured via fluorescence recovery after photobleaching in collagen gels. Kinetic rates
are fitted via least-squares minimization to match experimentally observed endothelial cell migration speeds
–50
m/h in Boyden chamber work [
23]. Chemotactic parameters
are calibrated via Morris sensitivity screening against capillary density time series from Matrigel plug assays, selecting values with
correlation.
In order to obtain a computationally tractable discrete approximation, we introduce a uniform spatial mesh
with grid spacing
. For any sufficiently smooth function
, we denote its grid function by
and introduce the standard centered finite difference operators. The first derivative is approximated by
while the second derivative employs
for the same index range. The no-flux boundary conditions are enforced through ghost point relations—
and
—which ensure
to machine precision. These approximations exhibit second-order truncation error
for functions with bounded fourth derivatives, as verified through Taylor series expansion. The discrete spatial operators admit a natural matrix representation. The operator takes the form in the following matrix:
Concatenating the state variables into a single vector
, the spatially semi-discretized system assumes the compact form
where
represents the linear differential operators and
encapsulates the nonlinear reaction and chemotaxis terms. Considering endothelial cells, the only linear term is diffusion
, while chemotaxis terms
involve products
and, thus, belong to
, yielding
and
for
. For proteases, the linear terms comprise diffusion
, production
coupling cells to proteases, and decay
. The production term is linear in
C with spatially varying coefficient
, giving
and
, while the term
is bilinear and belongs to
. For inhibitors, only diffusion
is linear, with the binding term
being bilinear, thus
and
for
. About ECM, the degradation
is entirely bilinear, rendering the fourth row block identically zero:
for all
j.
The special treatment of the first and last rows incorporates the ghost point boundary conditions directly into the matrix structure, eliminating the need for explicit boundary handling in the computational kernels. Similarly, we define the discrete gradient operator through and for , with all other entries vanishing. The boundary rows remain identically zero, consistent with the no-flux conditions.
Assembling these blocks, the linear operator exhibits a
block structure:
where
denotes the zero matrix and
the identity matrix. The TAF-dependent coupling in the
block arises from the protease production term
in system (
1). The eigenvalue spectrum of
is dominated by the diffusion operators, with the smallest eigenvalue satisfying
, which imposes severe stability restrictions on explicit time-stepping methods.
The nonlinear operator
is constructed from the chemotactic, haptotactic, and reaction terms. For the endothelial cell equation, we have:
for
, where the spatial derivatives are evaluated using the centered difference approximations with appropriate boundary modifications. The product rule
ensures consistency with the continuous formulation. The remaining components follow similarly:
and
The Lipschitz constant of
with respect to the Euclidean norm
can be bounded uniformly on compact subsets of
through careful analysis of the chemotactic terms, yielding
where
due to the presence of gradient operators.
3.1. Time Integration Method and Algorithm
The stiffness inherent in the spatially discretized system necessitates an operator splitting approach that treats the linear diffusion terms implicitly while handling the nonlinear reaction–chemotaxis terms explicitly. We partition the time domain
into
N uniform intervals with time step
and discrete time levels
for
. The Implicit–Explicit (IMEX) trapezoidal scheme, proposed in [
2], integrates the linear operator
using the second-order Crank–Nicolson method, which provides A-stability and unconditional stability for diffusion-dominated problems, while the nonlinear term
is advanced explicitly via the forward Euler method to avoid the computational expense of solving nonlinear systems at each time step.
Given the solution
at time
, the update to
satisfies the discrete evolution equation:
Rearranging terms and introducing the notation
where
denotes the identity matrix, and we obtain the linear system:
The system matrix inherits the block structure from and, critically, is nonsingular for all since the eigenvalues of are real and non-positive (due to the discrete Laplacian being negative semi-definite), implying that the eigenvalues of satisfy . The right-hand side vector is explicitly computable from the known data at time level n, involving one matrix–vector multiplication with computational cost due to the banded structure of , and one evaluation of the nonlinear operator requiring operations.
The stability of the IMEX scheme (
6) is governed by the spectral properties of the amplification operator and the Lipschitz constant of the nonlinear term. For the linear component, the trapezoidal method applied to
yields the amplification factor
for an eigenvalue
of
. Since
, we have
ensuring
for all
. This unconditional stability for the linear part eliminates the restrictive
constraint imposed by fully explicit schemes. The explicit treatment of the nonlinear term introduces a stability condition
, where
is the Lipschitz constant of
. However, since
, the allowable time step satisfies
, which is significantly less restrictive than the
constraint of fully explicit methods. The local truncation error of the IMEX scheme is
for the implicit linear part (trapezoidal method) and
for the explicit nonlinear part (forward Euler), yielding an overall temporal accuracy of
when combined with the
spatial accuracy.
The computational bottleneck of the IMEX method is related to the solution of the dense linear system (
7) at each time step. While
inherits some sparsity from the discrete Laplacian operators in its diagonal blocks, the coupling terms and the
block structure result in a matrix with
nonzero entries. Standard Gaussian elimination with partial pivoting requires
arithmetic operations per time step, dominating the
cost of evaluating
and the
cost of the matrix–vector product
. Over
N time steps, the total complexity becomes
. In case of realistic spatial resolutions
and long-time integrations
, this cubic scaling renders sequential CPU implementations prohibitively expensive, motivating the GPU parallelization strategies developed in the following section.
The computational procedure for advancing the solution from time to can be formalized algorithmically. Given the current state , we must evaluate the nonlinear term, construct the right-hand side vector, solve the linear system, and enforce physical constraints. The complete IMEX time-stepping procedure is presented in Algorithm 1.
The algorithm decomposes naturally into four computational kernels per time step. The nonlinear evaluation (Step 1) requires
operations and involves evaluating finite difference approximations and pointwise products, which are trivially parallelizable with perfect load balance. The right-hand side construction (Step 2) performs a sparse matrix–vector multiplication with complexity
due to the banded structure of
, exploiting the fact that
contains discrete Laplacian operators with at most five nonzero entries per row. The linear system solution (Step 3) dominates the computational cost at
for the serial Gaussian elimination procedure, making it the primary target for GPU acceleration. Finally, the non-negativity enforcement (Step 4) is an
post-processing operation that ensures the numerical solution respects the physical constraints inherent in the continuous model, where all state variables represent concentrations or densities that must remain non-negative. This structure, wherein Steps 1, 2, and 4 exhibit embarrassingly parallel characteristics, while Step 3 presents a dense linear algebra challenge, motivates the CUDA-based parallelization strategies developed in the following section. The main observation is that while the nonlinear term
could potentially be evaluated on the GPU and kept in device memory to avoid host–device transfers, the dominant cost lies in solving the implicit linear system
, which must be performed at every time step and accounts for over 95% of the total computational time for typical spatial resolutions
.
| Algorithm 1 IMEX time-stepping for tumor angiogenesis. |
| Require: System matrices , TAF distribution , parameters , time step , final time , initial condition |
| Ensure: Solution trajectory with |
| 1: | Initialization: | |
| 2: | Construct linear operator according to (4) | |
| 3: | Form system matrix | |
| 4: | Form explicit matrix | |
| 5: | for
to
do | |
| 6: | Step 1: Nonlinear Evaluation | |
| 7: | Extract components: , , , | |
| 8: | for to M do | |
| 9: | Compute chemotactic flux: | |
| 10: | Compute haptotactic flux: | |
| 11: | Compute TAF chemotaxis: | |
| 12: | | |
| 13: | | |
| 14: | | |
| 15: | | |
| 16: | end for | |
| 17: | Step 2: Right-Hand Side Construction | |
| 18: | | ▹ matrix–vector product |
| 19: | Step 3: Linear System Solution | |
| 20: | Solve via Gaussian elimination with partial pivoting | |
| 21: | Step 4: Enforce Non-Negativity | |
| 22: | for to do | |
| 23: | if then | |
| 24: | | ▹ Physical constraint: concentrations/densities |
| 25: | end if | |
| 26: | end for | |
| 27: | end for | |
| 28: | return
| |
Figure 1 illustrates the computational pipeline for solving the linear system
via GPU-accelerated Gaussian elimination with partial pivoting. The workflow consists of five sequential phases: (1) host-to-device transfer of the augmented matrix; (2) iterative execution of three CUDA kernels within a
-step loop; (3) device-to-host retrieval of the upper-triangular result; (4) CPU-based back substitution exploiting cache locality; and (5) non-negativity enforcement. The decision to perform back substitution on the CPU rather than the GPU reflects the inherent data dependency
, which prevents effective parallelization for systems with
.
3.2. Gaussian Elimination Choice Discussion
We employ Gaussian elimination rather than iterative methods (GMRES, BiCGStab with AMG preconditioning) for three interconnected reasons. First, the system size
lies below the crossover point (
DOF) where iterative solvers with algebraic multigrid preconditioning become competitive [
7]. In the case of
, AmgX requires 847 GMRES iterations versus 1599 guaranteed elimination steps, resulting in 3.5 s versus 1.7 s wall time. Second, the off-diagonal block
in Equation (
4) couples proteases to endothelial cells, destroying the block-tridiagonal sparsity pattern. This coupling yields an effective matrix bandwidth of
, substantially reducing sparse solver advantages. Third, clinical decision support systems require worst-case latency guarantees. Direct methods provide deterministic
complexity, whereas iterative methods exhibit iteration count variability (600–1200 iterations observed for angiogenesis parameters in
Table 1). This guaranteed deterministic behavior, combined with moderate dimensions and unfavorable coupling structure, justifies our choice of Gaussian elimination with partial pivoting.
4. CUDA-Based Parallel Algorithm
The GPU parallelization of the IMEX scheme focuses on accelerating the solution of the linear system
via a CUDA implementation of Gaussian elimination with partial pivoting, alongside parallel construction of the discrete operators
and
. We adopt the augmented matrix representation
and store it in row-major order as a contiguous one-dimensional array in GPU device memory. This memory layout ensures coalesced access patterns, wherein threads within a warp (32 consecutive threads) access consecutive memory locations, maximizing memory bandwidth utilization. The flattened index mapping is given by
where the zero-based indexing convention is adopted consistent with CUDA arrays.
The Gaussian elimination algorithm proceeds in
sequential elimination steps. At step
, we first identify the pivot row
to ensure numerical stability by bounding the growth factor. The pivot-finding operation is parallelized over the subcolumn
using a CUDA kernel that employs atomic operations to determine the global maximum. Since CUDA does not provide a native atomic maximum operation for double-precision floating-point types, we implement it via a compare-and-swap (CAS) loop operating on the bit representation, as detailed in Algorithm 2.
The CAS operation
atomically compares the value at addr with expected. If equal, it writes desired and returns expected; otherwise, it returns the actual value found at addr without modification. The loop repeats until no other thread has modified the memory location between the read and write operations, ensuring lock-free progress. The bit-level reinterpretation via
and
preserves the IEEE 754 double-precision representation while enabling integer-based atomic operations. Each thread
processes row index
(and potentially
via grid-stride loop if
). The shared memory variables
(maximum absolute value) and
(corresponding row index) are initialized to zero and
k, respectively, by thread 0 before a block-wide synchronization barrier
ensures visibility to all threads. Subsequently, each thread computes
, performs the atomic maximum update on
, synchronizes again, and if
updates the index
. Finally, thread 0 writes the result
to a device memory location accessible to the host. The computational complexity of this kernel is
work distributed over
threads with
atomic contention overhead, yielding an effective parallel time complexity of
for each step
k.
| Algorithm 2 Atomic maximum for doubles types. |
| Require: Memory address (shared or global), candidate value |
| Ensure: atomically |
| 1: | | |
| 2: | repeat | |
| 3: | | ▹ Read current bit pattern |
| 4: | | |
| 5: | | |
| 6: | | |
| 7: | | |
| 8: | | |
| 9: | until | ▹ CAS succeeded |
| 10: | return | ▹ Previous value |
Once the pivot row
is determined, if
, we perform a row interchange
to move the pivot element to the diagonal position. This operation is parallelized by assigning each column
to a separate thread, which simultaneously swaps the elements
using three memory accesses (read–temp–write–write). The kernel is launched with
thread blocks, each containing
threads, such that thread
t in block
b handles column
. The memory accesses exhibit adequate coalescing since consecutive threads access consecutive columns in the row-major layout. The parallel time complexity is
assuming sufficient parallelism. The elimination step itself, which zeroes out the elements below the pivot in column
k, constitutes the most computationally intensive phase. For each row
, we compute the multiplier
and update the elements
This operation updates a
submatrix and is parallelized via a two-dimensional thread grid. Each thread
in a two-dimensional thread block of dimensions
is assigned to submatrix element
, where
denotes the block index. The grid dimensions are chosen as
to cover the entire augmented matrix, with threads outside the valid submatrix range performing no operation. Within each thread, we first check the condition
, then compute the multiplier via a single read of
and
, followed by a fused multiply–subtract operation updating
. The memory access pattern for the read of
is highly coalesced since threads with consecutive
(and, hence, consecutive
j) access consecutive elements of the pivot row. Similarly, the write to
coalesces across threads with consecutive
within the same row
i. However, reading the multiplier element
involves threads with different
(different rows
i) accessing the same column
k, which does not coalesce perfectly but benefits from L1 cache reuse within the thread block. The computational work at step
k is
multiply–subtract operations distributed over
threads, yielding a parallel time complexity of
per elimination step. Summing over all steps, the total forward elimination phase requires:
where we have used
Setting
and a GPU with
CUDA cores, the effective parallelism reduces the constant factor by approximately two orders of magnitude compared to a sequential CPU implementation. After completing the forward elimination phase, the augmented matrix
has been transformed into upper triangular form
, where
is upper-triangular and
is the transformed right-hand side. The back substitution phase computes the solution
via:
This computation exhibits an inherent data dependency: the value
depends on all previously computed values
, preventing direct parallelization across the rows. While parallel scan algorithms and dependency-graph scheduling techniques can mitigate this seriality to some extent, the performance gains are typically modest for linear systems of dimension
due to synchronization overhead. Consequently, our implementation performs back substitution on the CPU host after copying the upper triangular matrix
from device to host memory. The serial complexity of back substitution is
which is asymptotically dominated by the
forward elimination but can represent a non-negligible fraction of the total time for moderate
M values.
The construction of discrete operators
and
is also parallelized on the GPU. Considering the gradient matrix
, each thread
sets the sub-diagonal element
and super-diagonal element
, with all other entries initialized to zero. The boundary rows
remain identically zero to enforce the no-flux conditions. A subsequent scaling pass (which can be fused into the same kernel via a second loop over all entries) applies the factor
if not already applied. The kernel is launched with
blocks of
threads each, yielding a parallel complexity
under the assumption that each thread performs
work. Similarly, the Laplacian matrix
construction assigns each row
i to a thread, which sets the diagonal element
and off-diagonal elements
with special handling for the boundary rows
as specified in Equation (
3), again achieving
parallel complexity.
Taking into account the memory transfer costs between host and device, at each time step, the augmented matrix requires bytes of data transfer from host to device before the elimination kernel launches. After forward elimination, the upper triangular matrix of the same size is transferred back to the host for back substitution, doubling the transfer volume per time step. For a typical PCIe Gen3 ×16 link with peak bandwidth GB/s, the transfer time is s. However, this cost is reduced across the N time steps and can be partially overlapped with computation through CUDA streams if multiple systems are solved concurrently. The ratio of computation time to transfer time is proportional to , indicating that for , the transfer overhead becomes negligible. In our target regime and , this condition is not satisfied, and transfer costs can contribute – of the total time, depending on the specific hardware characteristics.
Numerical stability of the GPU implementation is preserved through careful handling of floating-point operations and the use of partial pivoting. The growth factor
after elimination step
k satisfies
in the worst case for partial pivoting, but is typically much smaller for well-conditioned systems arising from PDE discretizations. The condition number of the system matrix
is bounded by
which remains acceptable for
time steps, consistent with the IMEX stability constraint [
2]. Accumulated rounding errors are monitored by computing the residual
after each solve, which should remain at the level of machine epsilon
multiplied by the condition number, i.e.,
.
The overall computational complexity of the GPU-accelerated IMEX scheme for
N time steps is given by:
where
P denotes the number of CUDA cores, and the three terms represent nonlinear evaluation and matrix construction (
), parallel forward elimination (
), and serial back substitution plus data transfer (
). Comparing to the sequential CPU complexity
, we obtain a theoretical speedup factor
in the regime where forward elimination dominates. In practice, Amdahl’s law limits the realized speedup due to the serial fraction
yielding
Choosing typical values
(NVIDIA V100 GPU) and
, we predict