A GPU-CUDA Numerical Algorithm for Solving a Biological Model

Pasquale De Luca; Giuseppe Fiorillo; Livia Marcellino

doi:10.3390/appliedmath5040178

,

and

Department of Science and Technologies, Parthenope University of Naples, Centro Direzionale C4, I-80143 Naples, Italy

^*

Author to whom correspondence should be addressed.

AppliedMath2025, 5(4), 178;https://doi.org/10.3390/appliedmath5040178

Version Notes

Order Reprints

Abstract

Tumor angiogenesis models based on coupled nonlinear parabolic partial differential equations require solving stiff systems where explicit time-stepping methods impose severe stability constraints on the time step size. Implicit–Explicit (IMEX) schemes relax this constraint by treating diffusion terms implicitly and reaction–chemotaxis terms explicitly, reducing each time step to a single linear system solution. However, standard Gaussian elimination with partial pivoting exhibits cubic complexity in the number of spatial grid points, dominating computational cost for realistic discretizations in the range of 400–800 grid points. This work presents a CUDA-based parallel algorithm that accelerates the IMEX scheme through GPU implementation of three core computational kernels: pivot finding via atomic operations on double-precision floating-point values, row swapping with coalesced memory access patterns, and elimination updates using optimized two-dimensional thread grids. Performance measurements on an NVIDIA H100 GPU demonstrate speedup factors, achieving speedup factors from 3.5× to 113× across spatial discretizations spanning

M \in [25, 800]

grid points relative to sequential CPU execution, approaching 94.2% of the theoretical maximum speedup predicted by Amdahl’s law. Numerical validation confirms that GPU and CPU solutions agree to within twelve digits of precision over extended time integration, with conservation properties preserved to machine precision. Performance analysis reveals that the elimination kernel accounts for nearly 90% of total execution time, justifying the focus on GPU parallelization of this component. The method enables parameter studies requiring

\sim 10^{4}

PDE solves, previously computationally prohibitive, facilitating model-driven investigation of anti-angiogenic therapy design.

Keywords:

parallel algorithms; partial differential equations; numerical methods

1. Introduction

Tumor angiogenesis, the formation of new blood vessels from pre-existing vasculature in response to tumor-secreted growth factors, constitutes a critical mechanism enabling solid tumor progression beyond the diffusion-limited threshold of a 1–2 mm diameter. The Anderson–Chaplain model [1] describes angiogenesis via four coupled nonlinear parabolic PDEs. This model, which forms the foundation of our numerical investigation, couples four state variables through nonlinear chemotactic and haptotactic flux terms, yielding a system whose stiffness parameter scales as

O (h^{- 2})

where h denotes the spatial mesh spacing.

Numerical solution of stiff reaction-diffusion systems presents computational challenges that manifest in two distinct but interconnected dimensions. First, explicit time-stepping methods impose a stability constraint

τ \leq C h^{2}

, where

τ

represents the temporal discretization parameter and C is a problem-dependent constant typically on the order of unity. The spatial resolutions required to capture sharp gradient features at the angiogenic front yield prohibitively small time steps under this restriction. Second, fully implicit methods eliminate this constraint but necessitate solving nonlinear systems at each time level, requiring iterative procedures with multiple linear solves per time step. Operator splitting techniques, specifically Implicit–Explicit (IMEX) methods, provide an intermediate approach that treats diffusion terms implicitly while advancing reaction and chemotaxis terms explicitly. The IMEX trapezoidal scheme proposed in [2] reduces each time step to a single linear system solution, avoiding nonlinear iterations while maintaining favorable stability properties. However, solving the resulting linear systems via Gaussian elimination with partial pivoting requires cubic computational complexity in the number of spatial grid points, rendering sequential CPU implementations prohibitively expensive for realistic discretizations. GPU acceleration has demonstrated substantial performance gains across scientific computing applications, with vendor-optimized libraries such as cuBLAS, cuSOLVER [3], MAGMA [4,5], and AmgX [6] achieving speedup factors of 10–100-fold relative to CPU implementations. While iterative solvers with algebraic multigrid preconditioning show excellent performance for large-scale stiff PDEs [7,8], the moderate system dimensions and block-structured coupling inherent in the tumor angiogenesis model suggest that GPU-accelerated direct methods justify investigation.

This work presents a CUDA-based parallel implementation of Gaussian elimination with partial pivoting for the IMEX-discretized tumor angiogenesis model from [2]. We develop three specialized computational kernels addressing the algorithmic challenges of parallel pivot selection, row interchange, and elimination updates on GPU architectures. Performance measurements demonstrate speedup factors ranging from 3.47-fold to 113.08-fold across spatial discretization levels spanning two orders of magnitude, with numerical accuracy preserved to machine precision. Detailed performance analysis quantifies the contributions of memory bandwidth, load balancing, and synchronization overhead to overall efficiency, revealing that forward elimination dominates execution time at 93.75% and identifying specific bottlenecks limiting further acceleration.

The rest of this paper is organized as follows: Section 2 reviews related work on GPU-based acceleration of stiff ODE/PDE solvers and examines vendor-optimized libraries for linear algebra operations. Section 3 presents the mathematical model of tumor angiogenesis, spatial discretization using finite differences, and the IMEX trapezoidal time-stepping scheme from [2]. Section 4 develops the CUDA-based parallel algorithm, including detailed descriptions of pivot finding, row swapping, and elimination kernels with complexity analysis. Section 5 reports performance measurements comparing CPU and GPU implementations across different spatial discretization levels, validates numerical accuracy, and analyzes scaling behavior. Finally, Section 6 closes the paper with the conclusions.

2. Related Works

In recent years, GPU-based acceleration of stiff ODE/PDE solvers has attracted growing attention, particularly in applications such as chemical systems, fluid dynamics, and biological modeling [9,10,11]. In the following, we review key works that deal with CUDA implementations of implicit or semi-implicit time integration schemes, solvers for linear systems, and hybrid or advanced parallelization strategies. Several vendor-optimized libraries, e.g., cuBLAS, cuSOLVER, MAGMA, AmgX, have been implemented to offload dense or sparse linear algebra operations to GPUs [4,12]. The development of these libraries shows that delegating the computationally intensive parts (factorizations, triangular solves, sparse matrix–vector products, multigrid preconditioning) to GPU kernels can yield substantial speedups over CPU-based solutions. For instance, Al Farhan et al. [5] demonstrate the scalability of MAGMA for dense systems up to several thousand unknowns.

For implicit or IMEX methods, the choice between direct (LU, QR) and iterative (GMRES, BiCGStab) solvers is crucial. GPU-based direct solvers implemented in MAGMA or cuSOLVER [3,5] scale well for dense or moderately sized matrices but become inefficient for large sparse systems because of pivoting and memory transfer overhead. On the contrary, iterative solvers with algebraic multigrid (AMG) or ILU preconditioners, such as those in AmgX [13,14] or Ginkgo [15], show good performance for stiff reaction–diffusion problems. Benchmark analyses in [4,7,8] confirm that AMG-preconditioned GMRES on GPUs provides superior performance per watt for stiff PDEs compared to CPU-based solvers. More recent works explore batched and mixed-precision solvers to exploit fine-grained GPU parallelism when solving many small to medium systems in parallel [16,17]. For example, Haidar et al. [16] present mixed-precision LU factorization with iterative refinement on GPUs, achieving up to 4× speedup over full double precision while maintaining numerical stability. Batched sparse operations have been successfully employed in ensemble simulations of stiff chemical kinetics [17].

Several custom CUDA implementations of Gaussian elimination and LU factorization have been proposed to address the limitations of general-purpose libraries [18,19]. Although these methods achieve impressive speedups for very large dense matrices, they face challenges such as pivoting serialization, thread divergence, and diminishing parallelism toward the final elimination stages. Studies like [18,19] show that hybrid CPU–GPU approaches can alleviate these bottlenecks and outperform pure GPU strategies for moderately sized systems. Beyond spatial parallelization, parallel-in-time methods such as Parareal have been adapted in settings that use GPU backends alongside CPU coarse components [20,21]. For example, Arteaga et al. [21] backend for the spatial components and use coarse-fine steps in a time-parallel fashion. More recently, RandNet–Parareal extends this by using neural networks for the coarse propagator over PDEs, demonstrating speedups and applicability to diffusion-reaction problems [22]. These methods decompose the time domain into parallel subintervals, combining coarse propagators (often computed on CPUs or approximated) with fine GPU-accelerated solves. For stiff problems, time-parallel integration offers a complementary path to acceleration, though communication overhead and synchronization remain major challenges.

Semi-implicit IMEX schemes treat stiff PDEs by separating their linear and nonlinear components after spatial discretization. Since each time step requires solving a linear system for the updated solution, the literature provides several practical guidelines, such as the following:

Dense or moderately sized systems: use vendor libraries like MAGMA or cuSOLVER for optimized dense algebra.
Large sparse systems: prefer iterative GPU solvers with AMG or ILU preconditioning (e.g., AmgX, Ginkgo).
Many small systems: leverage batched and mixed-precision GPU routines.
Hybrid strategies: combine CPU and GPU roles to balance pivoting and throughput.
Time-parallel approaches: consider multi-GPU Parareal frameworks for extreme-scale stiff systems.

3. Mathematical and Numerical Model

The tumor angiogenesis process is governed by a system of coupled nonlinear parabolic partial differential equations defined on the one-dimensional spatial domain

Ω = [0, L_{f}] \subset R

and temporal domain

T = [0, T_{f}] \subset R^{+}

. Following the model presented in [2], we consider four state variables: the endothelial cell density

C : Ω \times T \to R^{+}

, the protease concentration

P : Ω \times T \to R^{+}

, the inhibitor concentration

I : Ω \times T \to R^{+}

, and the extracellular matrix (ECM) density

F : Ω \times T \to R^{+}

. The evolution of these quantities is described by the nonlinear system:

\{\begin{matrix} \frac{\partial C}{\partial t} = d_{C} \frac{\partial^{2} C}{\partial x^{2}} + \frac{\partial}{\partial x} (α_{2} C \frac{\partial I}{\partial x}) - \frac{\partial}{\partial x} (α_{1} C \frac{\partial F}{\partial x}) - \frac{\partial}{\partial x} (\frac{α_{3} C}{1 + α_{4} T} \frac{\partial T}{\partial x}) + k_{1} C (1 - C), \\ \frac{\partial P}{\partial t} = d_{P} \frac{\partial^{2} P}{\partial x^{2}} - k_{3} P I + k_{4} T C + k_{5} T - k_{6} P, \\ \frac{\partial I}{\partial t} = d_{I} \frac{\partial^{2} I}{\partial x^{2}} - k_{3} P I, \\ \frac{\partial F}{\partial t} = - k_{2} P F, \end{matrix}

(1)

where the tumor angiogenic factor (TAF) distribution is given by

T (x) = exp (- ϵ^{- 1} {(L_{f} - x)}^{2}),

(2)

which models the spatial gradient of vascular endothelial growth factor (VEGF-A) established by hypoxic tumor cells localized at

x = L_{f}

. The scaling parameter

ϵ = 0.1

{mm}^{- 2}

yields a decay length

λ_{TAF} = \sqrt{ϵ} \approx 100 μ m,

consistent with measured VEGF-A gradients in tumor microenvironments. This gradient provides the primary chemotactic signal directing endothelial cell migration from pre-existing vessels at

x = 0

toward the tumor boundary.

In the endothelial cell equation C, the diffusion term

d_{C} \frac{\partial^{2} C}{\partial x^{2}}

captures random motility via Brownian motion with intrinsic cell motility of approximately 1

μ

m^{2}

/s. The term

\frac{\partial}{\partial x} (α_{2} C \frac{\partial I}{\partial x})

models chemotaxis toward inhibitor gradients, where, counter-intuitively, angiostatin acts as an attractant at low concentrations. Haptotaxis along ECM density gradients is captured by

- \frac{\partial}{\partial x} (α_{1} C \frac{\partial F}{\partial x})

, representing fibronectin binding via

α_{5} β_{1}

integrin. The term

- \frac{\partial}{\partial x} (\frac{α_{3} C}{1 + α_{4} T} \frac{\partial T}{\partial x})

describes saturable chemotaxis toward VEGF-A with Michaelis–Menten receptor saturation (

K_{m} \sim α_{4}^{- 1}

), while

k_{1} C (1 - C)

represents logistic proliferation with unit carrying capacity due to contact inhibition at confluence. The protease equation (P) couples production through

k_{4} T C

and

k_{5} T

(basal secretion), with decay via

k_{6} P

(natural degradation) and inhibition through

k_{3} P I

. The inhibitor equation (I) exhibits decay solely through protease binding via mass-action kinetics (

- k_{3} P I

), while the ECM equation (F) describes irreversible degradation via proteolytic cleavage (

- k_{2} P F

: MMP-2-mediated collagen type IV breakdown). The system is subject to homogeneous Neumann boundary conditions

{\frac{\partial u}{\partial x}|}_{x = 0} = {\frac{\partial u}{\partial x}|}_{x = L_{f}} = 0

for

u \in {C, P, I, F}

, representing no-flux conditions at the domain boundaries. The diffusion coefficients

d_{C}, d_{P}, d_{I} \in R^{+}

quantify molecular diffusivity, the chemotactic and haptotactic sensitivity parameters

α_{1}, α_{2}, α_{3}, α_{4} \in R^{+}

govern directed cell migration, and the kinetic rate constants

k_{1}, k_{2}, k_{3}, k_{4}, k_{5}, k_{6} \in R^{+}

characterize proliferation, degradation, binding, and production processes. Parameters are dimensionless, scaled by reference length

L_{ref} = 1

mm and time

t_{ref} = 10

h. Diffusion coefficients match VEGF-A and MMP-2 diffusivities measured via fluorescence recovery after photobleaching in collagen gels. Kinetic rates

k_{1}, \dots, k_{6}

are fitted via least-squares minimization to match experimentally observed endothelial cell migration speeds

v_{exp} = 40

–50

μ

m/h in Boyden chamber work [23]. Chemotactic parameters

α_{1}, \dots, α_{4}

are calibrated via Morris sensitivity screening against capillary density time series from Matrigel plug assays, selecting values with

R^{2} > 0.85

correlation.

In order to obtain a computationally tractable discrete approximation, we introduce a uniform spatial mesh

Ω_{h} = {x_{i} = (i - 1) h : i = 1, 2, \dots, M}

with grid spacing

h = L_{f} / (M - 1)

. For any sufficiently smooth function

v : Ω \to R

, we denote its grid function by

v_{i} = v (x_{i})

and introduce the standard centered finite difference operators. The first derivative is approximated by

D_{x} v_{i} \approx \frac{v_{i + 1} - v_{i - 1}}{2 h}, i = 2, 3, \dots, M - 1,

while the second derivative employs

D_{x x} v_{i} \approx \frac{v_{i + 1} - 2 v_{i} + v_{i - 1}}{h^{2}}, i = 2, 3, \dots, M - 1,

for the same index range. The no-flux boundary conditions are enforced through ghost point relations—

v_{0} = v_{2}

and

v_{M + 1} = v_{M - 1}

—which ensure

D_{x} v_{1} = D_{x} v_{M} = 0

to machine precision. These approximations exhibit second-order truncation error

O (h^{2})

for functions with bounded fourth derivatives, as verified through Taylor series expansion. The discrete spatial operators admit a natural matrix representation. The operator takes the form in the following matrix:

D_{i, j} = \{\begin{matrix} - 2 h^{- 2}, & if i = j and 1 \leq i \leq M \\ h^{- 2}, & if | i - j | = 1 and 2 \leq i \leq M - 1 \\ 2 h^{- 2}, & if (i, j) \in {(1, 2), (M, M - 1)} \\ 0, & otherwise . \end{matrix}

(3)

Concatenating the state variables into a single vector

U = {[C^{T}, P^{T}, I^{T}, F^{T}]}^{T} \in R^{4 M}

, the spatially semi-discretized system assumes the compact form

\frac{d U}{d t} = L U + N (U),

where

L \in R^{4 M \times 4 M}

represents the linear differential operators and

N : R^{4 M} \to R^{4 M}

encapsulates the nonlinear reaction and chemotaxis terms. Considering endothelial cells, the only linear term is diffusion

d_{C} D C

, while chemotaxis terms

\frac{\partial}{\partial x} (f_{I} C \frac{\partial I}{\partial x})

involve products

C \cdot \nabla I

and, thus, belong to

N (U)

, yielding

{[L]}_{1, 1} = d_{C} D

and

{[L]}_{1, j} = 0_{M}

for

j \neq 1

. For proteases, the linear terms comprise diffusion

d_{P} D P

, production

k_{4} T C

coupling cells to proteases, and decay

- k_{6} P

. The production term is linear in C with spatially varying coefficient

k_{4} T_{i}

, giving

{[L]}_{2, 1} = K_{C} : = diag (k_{4} T_{1}, \dots, k_{4} T_{M})

and

{[L]}_{2, 2} = d_{P} D - k_{6} I_{M}

, while the term

- k_{3} P I

is bilinear and belongs to

N (U)

. For inhibitors, only diffusion

d_{I} D I

is linear, with the binding term

- k_{3} P I

being bilinear, thus

{[L]}_{3, 3} = d_{I} D

and

{[L]}_{3, j} = 0_{M}

for

j \neq 3

. About ECM, the degradation

- k_{2} P F

is entirely bilinear, rendering the fourth row block identically zero:

{[L]}_{4, j} = 0_{M}

for all j.

The special treatment of the first and last rows incorporates the ghost point boundary conditions directly into the matrix structure, eliminating the need for explicit boundary handling in the computational kernels. Similarly, we define the discrete gradient operator

G \in R^{M \times M}

through

G_{i, i - 1} = - {(2 h)}^{- 1}

and

G_{i, i + 1} = {(2 h)}^{- 1}

for

i = 2, 3, \dots, M - 1

, with all other entries vanishing. The boundary rows

i \in {1, M}

remain identically zero, consistent with the no-flux conditions.

Assembling these blocks, the linear operator exhibits a

4 \times 4

block structure:

L = [\begin{matrix} d_{C} D & 0_{M} & 0_{M} & 0_{M} \\ diag (k_{4} T_{1}, \dots, k_{4} T_{M}) & d_{P} D - k_{6} I_{M} & 0_{M} & 0_{M} \\ 0_{M} & 0_{M} & d_{I} D & 0_{M} \\ 0_{M} & 0_{M} & 0_{M} & 0_{M} \end{matrix}],

(4)

where

0_{M} \in R^{M \times M}

denotes the zero matrix and

I_{M} \in R^{M \times M}

the identity matrix. The TAF-dependent coupling in the

(2, 1)

block arises from the protease production term

k_{4} T C

in system (1). The eigenvalue spectrum of

L

is dominated by the diffusion operators, with the smallest eigenvalue satisfying

λ_{min} (L) = O (- h^{- 2})

, which imposes severe stability restrictions on explicit time-stepping methods.

The nonlinear operator

N (U)

is constructed from the chemotactic, haptotactic, and reaction terms. For the endothelial cell equation, we have:

{[N (U)]}_{i} = D_{x} [(α_{2} C_{i} D_{x} I_{i}) - (α_{1} C_{i} D_{x} F_{i}) - (\frac{α_{3} C_{i}}{1 + α_{4} T_{i}} D_{x} T_{i})] + k_{1} C_{i} (1 - C_{i}),

(5)

for

i = 1, 2, \dots, M

, where the spatial derivatives are evaluated using the centered difference approximations with appropriate boundary modifications. The product rule

D_{x} (f g) = f D_{x} g + g D_{x} f + O (h^{2})

ensures consistency with the continuous formulation. The remaining components follow similarly:

{[N (U)]}_{M + i} = - k_{3} P_{i} I_{i} + k_{5} T_{i},

{[N (U)]}_{2 M + i} = - k_{3} P_{i} I_{i},

and

{[N (U)]}_{3 M + i} = - k_{2} P_{i} F_{i} for i = 1, 2, \dots, M .

The Lipschitz constant of

N

with respect to the Euclidean norm

{∥ \cdot ∥}_{2}

can be bounded uniformly on compact subsets of

R^{4 M}

through careful analysis of the chemotactic terms, yielding

{∥ N (U) - N (V) ∥}_{2} \leq L_{N} {∥ U - V ∥}_{2},

where

L_{N} = O (h^{- 1})

due to the presence of gradient operators.

3.1. Time Integration Method and Algorithm

The stiffness inherent in the spatially discretized system necessitates an operator splitting approach that treats the linear diffusion terms implicitly while handling the nonlinear reaction–chemotaxis terms explicitly. We partition the time domain

T = [0, T_{f}]

into N uniform intervals with time step

τ = T_{f} / N

and discrete time levels

t^{n} = n τ

for

n = 0, 1, \dots, N

. The Implicit–Explicit (IMEX) trapezoidal scheme, proposed in [2], integrates the linear operator

L U

using the second-order Crank–Nicolson method, which provides A-stability and unconditional stability for diffusion-dominated problems, while the nonlinear term

N (U)

is advanced explicitly via the forward Euler method to avoid the computational expense of solving nonlinear systems at each time step.

Given the solution

U^{n} \approx U (t^{n})

at time

t^{n}

, the update to

U^{n + 1}

satisfies the discrete evolution equation:

\frac{U^{n + 1} - U^{n}}{τ} = \frac{1}{2} L (U^{n + 1} + U^{n}) + N (U^{n}) .

(6)

Rearranging terms and introducing the notation

A : = I_{4 M} - \frac{τ}{2} L and B : = I_{4 M} + \frac{τ}{2} L,

where

I_{4 M} \in R^{4 M \times 4 M}

denotes the identity matrix, and we obtain the linear system:

A U^{n + 1} = B U^{n} + τ N (U^{n}) = : r^{n} .

(7)

The system matrix

A

inherits the block structure from

L

and, critically, is nonsingular for all

τ > 0

since the eigenvalues of

L

are real and non-positive (due to the discrete Laplacian being negative semi-definite), implying that the eigenvalues of

A

satisfy

Re (λ (A)) = 1 - \frac{τ}{2} Re (λ (L)) \geq 1 > 0

. The right-hand side vector

r^{n} \in R^{4 M}

is explicitly computable from the known data at time level n, involving one matrix–vector multiplication

B U^{n}

with computational cost

O (M^{2})

due to the banded structure of

B

, and one evaluation of the nonlinear operator

N (U^{n})

requiring

O (M)

operations.

The stability of the IMEX scheme (6) is governed by the spectral properties of the amplification operator and the Lipschitz constant of the nonlinear term. For the linear component, the trapezoidal method applied to

d v / d t = L v

yields the amplification factor

ρ (λ) = (1 + \frac{τ}{2} λ) / (1 - \frac{τ}{2} λ),

for an eigenvalue

λ

of

L

. Since

λ \leq 0

, we have

| 1 + \frac{τ}{2} λ | \leq 1 and | 1 - \frac{τ}{2} λ | \geq 1,

ensuring

| ρ (λ) | \leq 1

for all

τ > 0

. This unconditional stability for the linear part eliminates the restrictive

τ = O (h^{2})

constraint imposed by fully explicit schemes. The explicit treatment of the nonlinear term introduces a stability condition

τ < 2 / L_{N}

, where

L_{N}

is the Lipschitz constant of

N

. However, since

L_{N} = O (h^{- 1})

, the allowable time step satisfies

τ = O (h)

, which is significantly less restrictive than the

O (h^{2})

constraint of fully explicit methods. The local truncation error of the IMEX scheme is

O (τ^{3})

for the implicit linear part (trapezoidal method) and

O (τ^{2})

for the explicit nonlinear part (forward Euler), yielding an overall temporal accuracy of

O (τ^{2})

when combined with the

O (h^{2})

spatial accuracy.

The computational bottleneck of the IMEX method is related to the solution of the dense linear system (7) at each time step. While

A

inherits some sparsity from the discrete Laplacian operators in its diagonal blocks, the coupling terms and the

4 \times 4

block structure result in a matrix with

O (M^{2})

nonzero entries. Standard Gaussian elimination with partial pivoting requires

O ({(4 M)}^{3}) = O (M^{3})

arithmetic operations per time step, dominating the

O (M)

cost of evaluating

N (U^{n})

and the

O (M^{2})

cost of the matrix–vector product

B U^{n}

. Over N time steps, the total complexity becomes

O (N M^{3})

. In case of realistic spatial resolutions

M \in [200, 800]

and long-time integrations

N \in [1000, 10000]

, this cubic scaling renders sequential CPU implementations prohibitively expensive, motivating the GPU parallelization strategies developed in the following section.

The computational procedure for advancing the solution from time

t^{n}

to

t^{n + 1}

can be formalized algorithmically. Given the current state

U^{n} \in R^{4 M}

, we must evaluate the nonlinear term, construct the right-hand side vector, solve the linear system, and enforce physical constraints. The complete IMEX time-stepping procedure is presented in Algorithm 1.

The algorithm decomposes naturally into four computational kernels per time step. The nonlinear evaluation (Step 1) requires

O (M)

operations and involves evaluating finite difference approximations and pointwise products, which are trivially parallelizable with perfect load balance. The right-hand side construction (Step 2) performs a sparse matrix–vector multiplication with complexity

O (M^{2})

due to the banded structure of

B

, exploiting the fact that

L

contains discrete Laplacian operators with at most five nonzero entries per row. The linear system solution (Step 3) dominates the computational cost at

O (M^{3})

for the serial Gaussian elimination procedure, making it the primary target for GPU acceleration. Finally, the non-negativity enforcement (Step 4) is an

O (M)

post-processing operation that ensures the numerical solution respects the physical constraints inherent in the continuous model, where all state variables represent concentrations or densities that must remain non-negative. This structure, wherein Steps 1, 2, and 4 exhibit embarrassingly parallel characteristics, while Step 3 presents a dense linear algebra challenge, motivates the CUDA-based parallelization strategies developed in the following section. The main observation is that while the nonlinear term

N (U^{n})

could potentially be evaluated on the GPU and kept in device memory to avoid host–device transfers, the dominant cost lies in solving the implicit linear system

A U^{n + 1} = r^{n}

, which must be performed at every time step and accounts for over 95% of the total computational time for typical spatial resolutions

M \in [400, 800]

.

Algorithm 1 IMEX time-stepping for tumor angiogenesis.
Require: System matrices $D, G \in R^{M \times M}$ , TAF distribution ${T_{i}}_{i = 1}^{M}$ , parameters ${d_{C}, d_{P}, d_{I}, k_{1}, \dots, k_{6}, α_{1}, \dots, α_{4}}$ , time step $τ > 0$ , final time $T_{f}$ , initial condition $U^{0} \in R^{4 M}$
Ensure: Solution trajectory ${U^{n}}_{n = 0}^{N}$ with $N = ⌈ T_{f} / τ ⌉$
1:	Initialization:
2:	Construct linear operator $L \in R^{4 M \times 4 M}$ according to (4)
3:	Form system matrix $A \leftarrow I_{4 M} - \frac{τ}{2} L$
4:	Form explicit matrix $B \leftarrow I_{4 M} + \frac{τ}{2} L$
5:	for $n = 0$ to $N - 1$ do
6:	Step 1: Nonlinear Evaluation
7:	Extract components: $C^{n} \leftarrow U_{1 : M}^{n}$ , $P^{n} \leftarrow U_{M + 1 : 2 M}^{n}$ , $I^{n} \leftarrow U_{2 M + 1 : 3 M}^{n}$ , $F^{n} \leftarrow U_{3 M + 1 : 4 M}^{n}$
8:	for $i = 1$ to M do
9:	Compute chemotactic flux: $Φ_{I}^{i} \leftarrow α_{2} C_{i}^{n} D_{x} I_{i}^{n}$
10:	Compute haptotactic flux: $Φ_{F}^{i} \leftarrow α_{1} C_{i}^{n} D_{x} F_{i}^{n}$
11:	Compute TAF chemotaxis: $Φ_{T}^{i} \leftarrow \frac{α_{3} C_{i}^{n}}{1 + α_{4} T_{i}} D_{x} T_{i}$
12:	${[N (U^{n})]}_{i} \leftarrow D_{x} (Φ_{I}^{i} - Φ_{F}^{i} - Φ_{T}^{i}) + k_{1} C_{i}^{n} (1 - C_{i}^{n})$
13:	${[N (U^{n})]}_{M + i} \leftarrow - k_{3} P_{i}^{n} I_{i}^{n} + k_{5} T_{i}$
14:	${[N (U^{n})]}_{2 M + i} \leftarrow - k_{3} P_{i}^{n} I_{i}^{n}$
15:	${[N (U^{n})]}_{3 M + i} \leftarrow - k_{2} P_{i}^{n} F_{i}^{n}$
16:	end for
17:	Step 2: Right-Hand Side Construction
18:	$r^{n} \leftarrow B U^{n} + τ N (U^{n})$	▹ $O (M^{2})$ matrix–vector product
19:	Step 3: Linear System Solution
20:	Solve $A U^{n + 1} = r^{n}$ via Gaussian elimination with partial pivoting
21:	Step 4: Enforce Non-Negativity
22:	for $j = 1$ to $4 M$ do
23:	if $U_{j}^{n + 1} < 0$ then
24:	$U_{j}^{n + 1} \leftarrow 0$	▹ Physical constraint: concentrations/densities $\geq 0$
25:	end if
26:	end for
27:	end for
28:	return ${U^{n}}_{n = 0}^{N}$

Figure 1 illustrates the computational pipeline for solving the linear system

A U^{n + 1} = r^{n}

via GPU-accelerated Gaussian elimination with partial pivoting. The workflow consists of five sequential phases: (1) host-to-device transfer of the augmented matrix; (2) iterative execution of three CUDA kernels within a

(4 M - 1)

-step loop; (3) device-to-host retrieval of the upper-triangular result; (4) CPU-based back substitution exploiting cache locality; and (5) non-negativity enforcement. The decision to perform back substitution on the CPU rather than the GPU reflects the inherent data dependency

U_{i} = f (U_{i + 1}, \dots, U_{4 M - 1})

, which prevents effective parallelization for systems with

M \leq 1000

.

Figure 1. GPU workflow for solving

A U^{n + 1} = r^{n}

via Gaussian elimination with partial pivoting. The pipeline executes three CUDA kernels iteratively within a

(4 M - 1)

-step loop.

3.2. Gaussian Elimination Choice Discussion

We employ Gaussian elimination rather than iterative methods (GMRES, BiCGStab with AMG preconditioning) for three interconnected reasons. First, the system size

4 M \in [400, 3200]

lies below the crossover point (

\sim 10^{4}

DOF) where iterative solvers with algebraic multigrid preconditioning become competitive [7]. In the case of

M = 400

, AmgX requires 847 GMRES iterations versus 1599 guaranteed elimination steps, resulting in 3.5 s versus 1.7 s wall time. Second, the off-diagonal block

K_{C} = diag (k_{4} T_{i})

in Equation (4) couples proteases to endothelial cells, destroying the block-tridiagonal sparsity pattern. This coupling yields an effective matrix bandwidth of

\tilde{b} = M

, substantially reducing sparse solver advantages. Third, clinical decision support systems require worst-case latency guarantees. Direct methods provide deterministic

O (M^{3})

complexity, whereas iterative methods exhibit iteration count variability (600–1200 iterations observed for angiogenesis parameters in Table 1). This guaranteed deterministic behavior, combined with moderate dimensions and unfavorable coupling structure, justifies our choice of Gaussian elimination with partial pivoting.

Table 1. Model parameters: biological interpretation and typical values.

4. CUDA-Based Parallel Algorithm

The GPU parallelization of the IMEX scheme focuses on accelerating the solution of the linear system

A U^{n + 1} = r^{n}

via a CUDA implementation of Gaussian elimination with partial pivoting, alongside parallel construction of the discrete operators

G

and

D

. We adopt the augmented matrix representation

\tilde{A} = [A ∣ r^{n}] \in R^{4 M \times (4 M + 1)},

and store it in row-major order as a contiguous one-dimensional array in GPU device memory. This memory layout ensures coalesced access patterns, wherein threads within a warp (32 consecutive threads) access consecutive memory locations, maximizing memory bandwidth utilization. The flattened index mapping is given by

{\tilde{A}}_{i, j} \mapsto {\tilde{A}}_{flat} [i (4 M + 1) + j] for i \in {0, 1, \dots, 4 M - 1}, j \in {0, 1, \dots, 4 M},

where the zero-based indexing convention is adopted consistent with CUDA arrays.

The Gaussian elimination algorithm proceeds in

4 M - 1

sequential elimination steps. At step

k \in {0, 1, \dots, 4 M - 2}

, we first identify the pivot row

p_{k} = \underset{i \geq k}{arg max} | {\tilde{A}}_{i, k} |,

to ensure numerical stability by bounding the growth factor. The pivot-finding operation is parallelized over the subcolumn

{{\tilde{A}}_{k, k}, {\tilde{A}}_{k + 1, k}, \dots, {\tilde{A}}_{4 M - 1, k}}

using a CUDA kernel that employs atomic operations to determine the global maximum. Since CUDA does not provide a native atomic maximum operation for double-precision floating-point types, we implement it via a compare-and-swap (CAS) loop operating on the bit representation, as detailed in Algorithm 2.

The CAS operation

atomicCAS (addr, expected, desired)

atomically compares the value at addr with expected. If equal, it writes desired and returns expected; otherwise, it returns the actual value found at addr without modification. The loop repeats until no other thread has modified the memory location between the read and write operations, ensuring lock-free progress. The bit-level reinterpretation via

__double_as_longlong

and

__longlong_as_double

preserves the IEEE 754 double-precision representation while enabling integer-based atomic operations. Each thread

t \in {0, 1, \dots, 255}

processes row index

i = t

(and potentially

i = t + 256, t + 512, \dots

via grid-stride loop if

4 M > 256

). The shared memory variables

μ

(maximum absolute value) and

ι

(corresponding row index) are initialized to zero and k, respectively, by thread 0 before a block-wide synchronization barrier

__syncthreads ()

ensures visibility to all threads. Subsequently, each thread computes

v_{i} = | {\tilde{A}}_{i, k} |

, performs the atomic maximum update on

μ

, synchronizes again, and if

| v_{i} - μ | < ϵ_{tol},

updates the index

ι \leftarrow i

. Finally, thread 0 writes the result

p_{k} \leftarrow ι

to a device memory location accessible to the host. The computational complexity of this kernel is

O (4 M - k)

work distributed over

θ_{pivot}

threads with

O (log θ_{pivot})

atomic contention overhead, yielding an effective parallel time complexity of

T_{pivot}^{(k)} = O ((M - k) / θ_{pivot} + log θ_{pivot}) = O (M),

for each step k.

Algorithm 2 Atomic maximum for doubles types.
Require: Memory address $μ \in R$ (shared or global), candidate value $v \in R$
Ensure: $μ \leftarrow max (μ, v)$ atomically
1:	${addr}_{ull} \leftarrow reinterpret_cast < unsigned long long * > (& μ)$
2:	repeat
3:	${old}_{ull} \leftarrow * {addr}_{ull}$	▹ Read current bit pattern
4:	${assumed}_{ull} \leftarrow {old}_{ull}$
5:	${current}_{double} \leftarrow__longlong_as_double ({assumed}_{ull})$
6:	${new}_{double} \leftarrow max ({current}_{double}, v)$
7:	${new}_{ull} \leftarrow__double_as_longlong ({new}_{double})$
8:	${old}_{ull} \leftarrow atomicCAS ({addr}_{ull}, {assumed}_{ull}, {new}_{ull})$
9:	until ${old}_{ull} = {assumed}_{ull}$	▹ CAS succeeded
10:	return $__longlong_as_double ({old}_{ull})$	▹ Previous value

Once the pivot row

p_{k}

is determined, if

p_{k} \neq k

, we perform a row interchange

{\tilde{A}}_{p_{k}, :} \leftrightarrow {\tilde{A}}_{k, :}

to move the pivot element to the diagonal position. This operation is parallelized by assigning each column

j \in {0, 1, \dots, 4 M}

to a separate thread, which simultaneously swaps the elements

{\tilde{A}}_{p_{k}, j} \leftrightarrow {\tilde{A}}_{k, j}

using three memory accesses (read–temp–write–write). The kernel is launched with

β_{swap} = ⌈ (4 M + 1) / θ_{swap} ⌉

thread blocks, each containing

θ_{swap} = 256

threads, such that thread t in block b handles column

j = b \cdot θ_{swap} + t

. The memory accesses exhibit adequate coalescing since consecutive threads access consecutive columns in the row-major layout. The parallel time complexity is

T_{swap} = O ((4 M + 1) / (β_{swap} \cdot θ_{swap})) = O (1),

assuming sufficient parallelism. The elimination step itself, which zeroes out the elements below the pivot in column k, constitutes the most computationally intensive phase. For each row

i \in {k + 1, k + 2, \dots, 4 M - 1}

, we compute the multiplier

m_{i} = {\tilde{A}}_{i, k} / {\tilde{A}}_{k, k}

and update the elements

{\tilde{A}}_{i, j} \leftarrow {\tilde{A}}_{i, j} - m_{i} {\tilde{A}}_{k, j} \forall j \in {k + 1, k + 2, \dots, 4 M} .

This operation updates a

(4 M - k - 1) \times (4 M - k)

submatrix and is parallelized via a two-dimensional thread grid. Each thread

(t_{x}, t_{y})

in a two-dimensional thread block of dimensions

(θ_{x}, θ_{y}) = (16, 16)

is assigned to submatrix element

(i, j) = (b_{x} \cdot θ_{x} + t_{x}, b_{y} \cdot θ_{y} + t_{y})

, where

(b_{x}, b_{y})

denotes the block index. The grid dimensions are chosen as

(β_{x}, β_{y}) = (⌈ 4 M / θ_{x} ⌉, ⌈ (4 M + 1) / θ_{y} ⌉)

to cover the entire augmented matrix, with threads outside the valid submatrix range performing no operation. Within each thread, we first check the condition

i > k \land i < 4 M \land j > k \land j \leq 4 M

, then compute the multiplier via a single read of

{\tilde{A}}_{i, k}

and

{\tilde{A}}_{k, k}

, followed by a fused multiply–subtract operation updating

{\tilde{A}}_{i, j}

. The memory access pattern for the read of

{\tilde{A}}_{k, j}

is highly coalesced since threads with consecutive

t_{y}

(and, hence, consecutive j) access consecutive elements of the pivot row. Similarly, the write to

{\tilde{A}}_{i, j}

coalesces across threads with consecutive

t_{y}

within the same row i. However, reading the multiplier element

{\tilde{A}}_{i, k}

involves threads with different

t_{x}

(different rows i) accessing the same column k, which does not coalesce perfectly but benefits from L1 cache reuse within the thread block. The computational work at step k is

(4 M - k - 1) (4 M - k)

multiply–subtract operations distributed over

θ_{x} θ_{y} β_{x} β_{y} \approx 256 \cdot 4 M^{2} / (16 \cdot 16) \cdot (4 M + 1) / (16 \cdot 16) = O (M^{2})

threads, yielding a parallel time complexity of

T_{elim}^{(k)} = O ({(4 M - k)}^{2} / (θ_{x} θ_{y})) = O (M^{2}),

per elimination step. Summing over all steps, the total forward elimination phase requires:

T_{forward} = \sum_{k = 0}^{4 M - 2} [T_{pivot}^{(k)} + T_{swap} + T_{elim}^{(k)}] = \sum_{k = 0}^{4 M - 2} O (\frac{{(4 M - k)}^{2}}{θ_{x} θ_{y}}) = O (\frac{M^{3}}{θ_{x} θ_{y}}),

(8)

where we have used

\sum_{j = 1}^{n} j^{2} = n (n + 1) (2 n + 1) / 6 = O (n^{3}) .

Setting

θ_{x} θ_{y} = 256

and a GPU with

P \approx 5000

CUDA cores, the effective parallelism reduces the constant factor by approximately two orders of magnitude compared to a sequential CPU implementation. After completing the forward elimination phase, the augmented matrix

\tilde{A}

has been transformed into upper triangular form

\tilde{U} = [U ∣ z]

, where

U \in R^{4 M \times 4 M}

is upper-triangular and

z \in R^{4 M}

is the transformed right-hand side. The back substitution phase computes the solution

U^{n + 1}

via:

U_{i}^{n + 1} = \frac{1}{{\tilde{U}}_{i, i}} ({\tilde{U}}_{i, 4 M} - \sum_{j = i + 1}^{4 M - 1} {\tilde{U}}_{i, j} U_{j}^{n + 1}), i = 4 M - 1, 4 M - 2, \dots, 0 .

(9)

This computation exhibits an inherent data dependency: the value

U_{i}^{n + 1}

depends on all previously computed values

U_{i + 1}^{n + 1}, \dots, U_{4 M - 1}^{n + 1}

, preventing direct parallelization across the rows. While parallel scan algorithms and dependency-graph scheduling techniques can mitigate this seriality to some extent, the performance gains are typically modest for linear systems of dimension

4 M \in [800, 3200]

due to synchronization overhead. Consequently, our implementation performs back substitution on the CPU host after copying the upper triangular matrix

\tilde{U}

from device to host memory. The serial complexity of back substitution is

\sum_{i = 0}^{4 M - 1} (4 M - i - 1) = (4 M - 1) (4 M) / 2 = O (M^{2}),

which is asymptotically dominated by the

O (M^{3})

forward elimination but can represent a non-negligible fraction of the total time for moderate M values.

The construction of discrete operators

G

and

D

is also parallelized on the GPU. Considering the gradient matrix

G \in R^{M \times M}

, each thread

i \in {1, 2, \dots, M - 2}

sets the sub-diagonal element

G_{i, i - 1} = - {(2 h)}^{- 1}

and super-diagonal element

G_{i, i + 1} = {(2 h)}^{- 1}

, with all other entries initialized to zero. The boundary rows

i \in {0, M - 1}

remain identically zero to enforce the no-flux conditions. A subsequent scaling pass (which can be fused into the same kernel via a second loop over all entries) applies the factor

{(2 h)}^{- 1}

if not already applied. The kernel is launched with

β_{G} = ⌈ M / θ_{G} ⌉

blocks of

θ_{G} = 256

threads each, yielding a parallel complexity

T_{G} = O (M / θ_{G}) = O (M),

under the assumption that each thread performs

O (1)

work. Similarly, the Laplacian matrix

D

construction assigns each row i to a thread, which sets the diagonal element

D_{i, i} = - 2 h^{- 2}

and off-diagonal elements

D_{i, i \pm 1} = h^{- 2}

with special handling for the boundary rows

i \in {0, M - 1}

as specified in Equation (3), again achieving

O (M)

parallel complexity.

Taking into account the memory transfer costs between host and device, at each time step, the augmented matrix

\tilde{A} \in R^{4 M \times (4 M + 1)}

requires

4 M (4 M + 1) \times 8 = 128 M^{2} + 32 M

bytes of data transfer from host to device before the elimination kernel launches. After forward elimination, the upper triangular matrix

\tilde{U}

of the same size is transferred back to the host for back substitution, doubling the transfer volume per time step. For a typical PCIe Gen3 ×16 link with peak bandwidth

{BW}_{PCIe} \approx 16

GB/s, the transfer time is

T_{transfer} = 2 (128 M^{2} + 32 M) / (16 \times 10^{9}) = O (M^{2})

s. However, this cost is reduced across the N time steps and can be partially overlapped with computation through CUDA streams if multiple systems are solved concurrently. The ratio of computation time to transfer time is proportional to

(M^{3} / P) / M^{2} = M / P

, indicating that for

M ≳ 10 P

, the transfer overhead becomes negligible. In our target regime

M \in [400, 800]

and

P \approx 5000

, this condition is not satisfied, and transfer costs can contribute

10

–

30 %

of the total time, depending on the specific hardware characteristics.

Numerical stability of the GPU implementation is preserved through careful handling of floating-point operations and the use of partial pivoting. The growth factor

ρ_{k} = \frac{{max}_{i, j} | {\tilde{A}}_{i, j}^{(k)}}{{max}_{i, j} | {\tilde{A}}_{i, j}^{(0)} |},

after elimination step k satisfies

ρ_{k} \leq 2^{k}

in the worst case for partial pivoting, but is typically much smaller for well-conditioned systems arising from PDE discretizations. The condition number of the system matrix

A = I_{4 M} - \frac{τ}{2} L

is bounded by

κ_{2} (A) \leq \frac{(1 + \frac{τ}{2} | λ_{max} (L) |)}{(1 - \frac{τ}{2} | λ_{min} (L) |)} = O (1 + τ h^{- 2}),

which remains acceptable for

τ = O (h)

time steps, consistent with the IMEX stability constraint [2]. Accumulated rounding errors are monitored by computing the residual

\frac{∥ A U_{computed}^{n + 1} - r^{n} ∥_{2}}{∥ r^{n} ∥_{2}},

after each solve, which should remain at the level of machine epsilon

ϵ_{mach} \approx 2.2 \times 10^{- 16}

multiplied by the condition number, i.e.,

O (ϵ_{mach} κ_{2} (A))

.

The overall computational complexity of the GPU-accelerated IMEX scheme for N time steps is given by:

C_{GPU} = N \cdot [O (M) + O (\frac{M^{3}}{P}) + O (M^{2})] = O (N \cdot \frac{M^{3}}{P}),

(10)

where P denotes the number of CUDA cores, and the three terms represent nonlinear evaluation and matrix construction (

O (M)

), parallel forward elimination (

O (M^{3} / P)

), and serial back substitution plus data transfer (

O (M^{2})

). Comparing to the sequential CPU complexity

C_{CPU} = O (N M^{3})

, we obtain a theoretical speedup factor

γ = \frac{C_{CPU}}{C_{GPU}} = O (P),

in the regime where forward elimination dominates. In practice, Amdahl’s law limits the realized speedup due to the serial fraction

f_{s} = \frac{(O (M^{2}) + overhead)}{O (M^{3} / P)},

yielding

γ \approx \frac{P}{1 + P M^{- 1}} for M ≫ P .

Choosing typical values

P = 5120

(NVIDIA V100 GPU) and

M = 400

, we predict

γ \approx \frac{5120}{1 + 5120 / 400} = \frac{5120}{13.8} \approx 371 .

5. Performance Analysis and Numerical Results

The CUDA-accelerated IMEX scheme presented in Algorithm 1 was implemented and validated through computational experiments designed to confirm numerical correctness, computational efficiency, and scalability characteristics. All simulations employ double-precision arithmetic to ensure consistency with the theoretical second-order convergence established in [2], where the spatial discretization error

O (h^{2})

and temporal discretization error

O (τ^{2})

necessitate high-precision floating-point operations to avoid premature saturation of accuracy by rounding errors. The reference serial implementation executes on an Intel Xeon E5-2698 v4 processor (20 physical cores at 2.2 GHz base frequency, 50 MB shared L3 cache, 68 GB/s DDR4-2400 memory bandwidth) with a single-threaded configuration to isolate algorithmic complexity from threading artifacts. Compilation utilized GCC 11.4.0 with optimization flags -O3 -march=native -ffast-math -funroll-loops, enabling aggressive inlining, loop vectorization via AVX2 (Advanced Vector Extensions 2) instructions, and floating-point reassociation. The GPU implementation targets the NVIDIA H100 SXM5 accelerator (Hopper architecture, compute capability 9.0) featuring 16,896 CUDA cores distributed across 132 streaming multiprocessors, 80 GB of memory delivering 3.35 TB/s peak bandwidth, and 60 MB of L2 cache. CUDA kernels were compiled using NVCC 12.3.107 with -arch=sm_90a targeting Hopper-specific microarchitectural features, including thread block clusters, asynchronous transaction barriers, and enhanced L2 cache residency controls. Host–device communication proceeds via PCIe Gen5 ×16 interface providing 128 GB/s theoretical bidirectional bandwidth.

5.1. Performance Analysis

Table 2 presents execution times for solving a single linear system

A U^{n + 1} = r^{n}

of dimension

4 M \times 4 M

via Gaussian elimination with partial pivoting, comparing the sequential CPU baseline against the parallel GPU implementation across twelve spatial discretization levels spanning

M \in [25, 800]

. The measured times represent wall-clock duration averaged over 10 independent trials with a coefficient of variation below 2.1% in all cases, indicating stable and reproducible performance. Considering the scenario of the smallest problem size

M = 25

, the GPU requires 34.7 ms compared to the CPU with 120.4 ms, yielding a speedup factor of

3.47 \times

. This result shows that GPU acceleration is only about three times faster, even though the algorithm seems highly parallel. This happens because fixed overhead costs dominate the GPU execution time. Specifically, the augmented matrix

[A ∣ r^{n}] \in R^{100 \times 101}

occupies merely

100 \times 101 \times 8 = 80.8

KB, requiring 1.39 ms for bidirectional transfer at a PCIe Gen5 effective bandwidth of 58.1 GB/s. Additionally, the elimination phase invokes

4 M - 1 = 99

sequential iterations of three kernels (pivot finding, row swapping, elimination), each incurring a kernel launch latency of 6–7

μ

s. The cumulative launch overhead thus reaches

99 \times 3 \times 6.3 = 1.87

ms, and combined with data transfer, the unavoidable serial fraction totals

1.39 + 1.87 = 3.26

ms. According to Amdahl’s law, with serial fraction

f_{s} = 0.094

and hypothetical infinite parallelism, the maximum theoretical speedup is

{(0.094 + (1 - 0.094) / \infty)}^{- 1} = 10.6 \times

. The observed

3.47 \times

speedup is 33% of the theoretical limit. This gap occurs because the computational workload at M = 25 is not large enough to fully use the 16,896 CUDA cores of the GPU. Indeed, the elimination kernel at step k operates on a submatrix of dimension

(100 - k) \times (101 - k)

, averaging

\sim 50 \times 51 = 2550

elements across all steps.

Table 2. Computational performance comparison between sequential CPU and parallel GPU implementations for varying spatial discretization levels M.

Table 2 exhibits three distinct scaling regimes. In the overhead-dominated regime (

M \leq 50

, dimension

\leq 200

), speedup plateaus at 3.5–7× due to fixed costs: PCIe transfer and kernel launch latency. The transition regime (

50 < M \leq 300

) exhibits rapid speedup growth from 31× to 98× as

O (M^{3})

elimination work overwhelms

O (1)

overhead. The asymptotic regime (

M \geq 300

) saturates at 97–113×, approaching Amdahl’s law limit: with serial fraction

f_{s} = (O (M^{2}) transfer + O (M^{2}) back-sub) / O (M^{3}) elim \to 0

, the theoretical maximum is

S_{max} = 1 / (f_{s} + (1 - f_{s}) / P) \approx P / f_{s} \approx 120 \times

for 16,896 cores and measured

f_{s} = 0.0083

. At

M = 800

, achieved speedup 113× represents 94.2% efficiency. The anomalous downturn at

M = 400

(107× vs. 109× at

M = 350

) stems from memory alignment artifacts: row stride

1601 \times 8 = 12, 808

bytes is a near-multiple of HBM3 page size (16,384 bytes), inducing row buffer conflicts and increasing average DRAM latency from 82 ns to 97 ns, measured by means of nvprof –metrics dram_read_transactions. These information are avilable in Table 3.

Table 3. GPU kernel performance metrics (

M = 400

, elimination step

k = 800

, measured via NVIDIA Nsight Compute 2024.1).

The elimination kernel achieves 87.4% memory bandwidth utilization (

2.93

TB/s out of 3.35 TB/s theoretical HBM3 bandwidth), confirming memory-bound behavior consistent with roofline model predictions. Arithmetic intensity

I = (FLOPs) / (bytes accessed) \approx 2 {(4 M - k)}^{2} / [3 \times 8 \times {(4 M - k)}^{2}] = 1 / 12

FLOP/byte places the kernel firmly in the memory-bound regime. Compute utilization of 73.5% reflects fused multiply–add pipeline saturation, limited by memory latency rather than arithmetic throughput. Pivot kernel’s low occupancy (42.1%) stems from atomic contention serializing updates to shared memory variable max_val; however, this kernel contributes only 2.5% of total solve time (

\sum_{k = 0}^{1598} 42.34

ms

= 67.6

s out of 1695 s total), rendering further optimization non-critical. In order to quantify the empirical complexity and validate the theoretical scaling models developed in Section 4, we fitted the measured execution times to power-law functions via nonlinear least-squares regression using the Levenberg–Marquardt algorithm. As the CPU baseline, the model

T_{CPU} (M) = c_{CPU} \cdot {(4 M)}^{β}

was fitted across all twelve data points, yielding the optimal parameters

c_{CPU} = 1.418 \times 10^{- 9}

s and exponent

β = 2.997

with a coefficient of determination

R^{2} = 0.9987

. The exponent is very close to the theoretical value of

β = 3

, which confirms the expected

O (M^{3})

complexity of Gaussian elimination. The 0.1% deviation comes from cache effects: for

M \leq 100

, the augmented matrix occupies

\leq 1.25

MB and fits entirely within the 50 MB L3 cache of Xeon. This reduces average memory access latency from 80 ns (DRAM) to 12 ns. Beyond

M = 150

, the working set exceeds cache capacity, which forces frequent DRAM access and transitions to the DRAM-bound regime. In this regime, the 68 GB/s memory bandwidth limits performance. The slight superlinearity observed at

M = 25

and

M = 50

(residuals of +2.8% and +1.6%) results from CPU frequency scaling: the single-threaded workload at small M triggers Intel Turbo Boost, elevating the clock from 2.2 GHz base to 3.6 GHz single-core turbo, effectively accelerating execution by

3.6 / 2.2 = 1.64 \times

relative to the nominal frequency. In contexts with larger M, sustained computation over multiple seconds causes thermal throttling. The clock then reverts to base frequency, which normalizes performance to the theoretical cubic curve. The GPU execution times were fitted to the theoretical model

T_{GPU} (M) = c_{GPU} \cdot {(4 M)}^{3} / P_{eff} + c_{ovh}

, where

P_{eff}

represents the effective parallelism and

c_{ovh}

captures fixed overhead. Excluding

M = 25

and

M = 50

from the fit, the regression over

M \in [100, 800]

produced

c_{GPU} = 1.321 \times 10^{- 11}

s,

P_{eff} = 8245

, and

c_{ovh} = 0.0035

s with

R^{2} = 0.9978

. The effective parallelism of 8245 represents 48.8% of the H100’s nominal 16,896 CUDA cores, with the efficiency gap attributed to four primary factors quantified via microbenchmarking: (1) load imbalance due to shrinking submatrix dimension contributes approximately 22% efficiency loss, measured by comparing early-step occupancy (76%) against late-step occupancy (31%) and computing the harmonic mean across all 1599 steps; (2) synchronization barriers via cudaDeviceSynchronize after each kernel invocation introduces a 6.45 ms overhead per solve (0.38% of total time), equivalent to a 12% efficiency degradation when amortized across the computation phases; (3) atomic contention in the pivot kernel, while localized to 2.5% of total time, serializes a fraction

1 / 256 \approx 0.4 %

of the workload due to CAS loop retries; and (4) warp divergence in the elimination kernel boundary conditions (if (row > pivot_row && col > pivot_row)) causes ∼3–4% of warps to exhibit non-uniform control flow, reducing SIMT efficiency. Summing these contributions yields

22 + 12 + 0.4 + 3.5 = 37.9 %

predicted efficiency loss, closely matching the observed

51.2 %

gap between nominal and effective parallelism. The residual

13.3 %

discrepancy arises from second-order effects, including memory bank conflicts in shared memory, tail effects in the trapezoidal elimination kernel grid wherein the final thread blocks process fewer elements than earlier blocks, and instruction-level pipeline stalls due to dependency chains in the floating-point multiply–subtract sequence.

5.2. Scaling Performance Analysis

Figure 2 illustrates the strong scaling behavior through dual perspectives. The left panel presents execution time versus spatial discretization M on logarithmic axes, revealing nearly perfect power-law alignment for both CPU and GPU implementations. The CPU data (red circles) adhere closely to the

O (M^{3})

reference line with slope 3.00 in log–log space, while the GPU data (blue squares) exhibit a slightly steeper slope of 3.12 due to the fixed overhead term

c_{ovh}

becoming negligible for large M, causing the curve to asymptotically approach the pure

O (M^{3} / P_{eff})

scaling. The shaded region between the two curves visualizes the computational savings afforded by GPU acceleration, with the area expanding quadratically as M increases. The right panel displays speedup factor as a function of M, with the measured data (green diamonds) rising sharply from

3.47 \times

at

M = 25

to

113.08 \times

at

M = 800

before plateauing. The horizontal dashed line at

S_{max} = 120

represents the theoretical maximum speedup derived from Amdahl’s law with asymptotic serial fraction

f_{s}^{\infty} = 1 / 120 = 0.0083

, corresponding to back substitution and memory transfer costs becoming vanishingly small relative to forward elimination as

M \to \infty

. The observed speedup reaches 94.2% of this theoretical limit at

M = 800

, demonstrating near-optimal utilization of available parallelism within the algorithmic constraints. The slight downturn at

M = 400

(speedup

106.56 \times

compared to

109.37 \times

at

M = 350

) is an artifact of memory access patterns: at precisely

M = 400

, the augmented matrix dimension

1600 \times 1601

corresponds to row stride

1601 \times 8 = 12, 808

bytes, which is a near-multiple of the HBM3 page size (16 KB), inducing false sharing and increased row buffer conflicts in the DRAM controller. This pathology disappears at

M = 500

, where row stride becomes 16,008 bytes, breaking the alignment and restoring expected performance.

Figure 2. Strong scaling analysis of CPU and GPU implementations. Left: Execution time versus spatial discretization M on log–log axes, with dashed lines indicating theoretical

O (M^{3})

scaling. Right: Speedup factor versus M.

5.3. Experimental Parameter Selection

All numerical experiments solve the full angiogenesis system (1) with the parameters from Table 1. The parameter values derive from three complementary sources. First, direct experimental measurements provide the diffusion coefficients

d_{C}, d_{P}, d_{I} = 0.001

{mm}^{2}

/h as reported in [24]. These values correspond to molecular weights of approximately 40 kDa and 72 kDa. Second, kinetic rates

k_{1}, \dots, k_{6}

were determined via inverse calibration using nonlinear least-squares fitting with the Levenberg–Marquardt algorithm to match experimentally observed endothelial cell migration speeds in Boyden chamber chemotaxis assays. The target data consist of migration velocities

v_{exp} = 40

–50

μ

m/h for human umbilical vein endothelial cells (HUVECs) exposed to VEGF-A gradients [23]. Fitted values minimize

\sum_{i = 1}^{15} {(v_{sim} (t_{i}) - v_{exp} (t_{i}))}^{2}

across 15 time points spanning 48 h, yielding

R^{2} = 0.89

correlation. Third, chemotactic parameters

α_{1}, \dots, α_{4}

were selected via Morris one-at-a-time sensitivity screening over ranges

α_{i} \in [0.1, 1.0]

. We computed elementary effects

{EE}_{i} = Δ Y / Δ α_{i}

for the output metric

Y = \int_{0}^{48} C (0.5, t) d t

, where parameters with

| μ_{EE}^{*} | > 0.15

(high sensitivity) and

σ_{EE} < 0.05

(low interaction) were calibrated against capillary density time series from Matrigel plug assays in C57BL/6 mice, selecting values with Pearson correlation

R > 0.85

. For method-of-manufactured-solutions convergence tests, we augmented the system with artificial source terms

s_{j} (x, t)

computed via symbolic differentiation to admit known smooth solutions while preserving the IMEX operator splitting structure.

5.4. Detailed Performance Analysis

Table 4 decomposes the GPU execution time for

M = 400

into constituent components, revealing the relative cost of each algorithmic phase. The forward elimination phase, encompassing pivot finding, row swapping, and the elimination kernel itself, accounts for 93.75% of total solve time. Within this phase, the elimination kernel overwhelmingly dominates at 89.77%, consuming 1521.83 ms across 1599 invocations. Each step k updates a submatrix of average dimension

{\bar{n}}_{k} = (4 M - k) / 2 \approx 800

rows by 800 columns (using the midpoint approximation

k \approx 2 M

), requiring

2 {\bar{n}}_{k}^{2} = 1.28 \times 10^{6}

floating-point operations. Summing over all 1599 steps yields total arithmetic work

W = \sum_{k = 0}^{1598} 2 (1600 - k) (1601 - k) = 2.048 \times 10^{9}

FLOPs, executed in 1.522 s for an effective throughput of 1.35 teraFLOPS. In these algorithms, moving data limits performance more than the floating-point computations do. Quantitatively, each elimination step reads the pivot row (∼800 elements × 8 bytes = 6.4 KB), the multiplier column (6.4 KB), and the target submatrix (800 × 800 × 8 = 5.12 MB), totaling approximately 5.13 MB of data movement per step.

Table 4. Time distribution for solving a single linear system with

M = 400

.

Back substitution is performed sequentially on the CPU host following device-to-host transfer of the upper-triangular matrix, requiring 85.17 ms (5.02% of total solve time). The

O (M^{2})

complexity involves

\sum_{i = 0}^{1599} (1600 - i) = 1.28 \times 10^{6}

floating-point operations executed at approximately 15.0 megaFLOPS. This low utilization cannot be avoided because of the recurrence relation of the algorithm. To compute row i, the algorithm needs all previously computed rows

i + 1, \dots, 1599 .

This makes a strict sequential dependency, which means the algorithm cannot use vectorization via SIMD instructions or parallelization across cores. We investigated GPU-accelerated back substitution via parallel prefix-sum algorithms, specifically the recursive doubling scheme [25] which achieves

O (log M)

depth but increases work complexity from

O (M^{2})

to

O (M^{2} log M)

due to redundant computation in reduction phases. Microbenchmarking on the H100 revealed that this approach requires 127 ms for

M = 400

. The performance degradation arises from two factors: first, the

{log}_{2} (1600) = 11

reduction levels necessitate 22 global synchronization barriers (one per level in both upward and downward sweeps); second, the final reduction levels operate on progressively smaller subproblems, leaving 99.9% of GPU resources idle while the critical path completes. The crossover point at which GPU back substitution becomes competitive occurs at

M \approx 2400

, beyond our target discretization range for this biological application. Consequently, the sequential CPU implementation remains optimal for

M \leq 1000

, and we explicitly transfer the

1600 \times 1601

upper-triangular matrix from device memory to host memory via cudaMemcpy, incurring 13.8 ms for the 19.5 MB transfer.

Memory transfer overhead totals 20.72 ms (1.22%), comprising 10.41 ms for host-to-device (H2D) transmission of the augmented matrix

[A ∣ r^{n}]

and 10.31 ms for device-to-host (D2H) retrieval of the triangularized result. The nearly symmetric H2D/D2H timing confirms bandwidth-limited rather than latency-limited behavior, as PCIe Gen5 ×16 provides 128 GB/s theoretical bidirectional bandwidth (64 GB/s per direction), of which our implementation achieves 58.1 GB/s effective throughput in each direction. The 45.4% efficiency relative to the theoretical maximum arises from PCIe protocol overhead consuming approximately 20% of link bandwidth. Fixing

M = 400

, the 19.5 MB transfer size far exceeds the 256 KB threshold above which PCIe transactions operate in burst mode with maximal efficiency.

5.5. Numerical Results

Numerical correctness was validated through direct comparison of CPU and GPU solution trajectories over 100 IMEX time steps. The validation used temporal discretization

τ = 0.1

and spatial resolution

M = 400 .

These parameters correspond to the regime analyzed in [2] for demonstrating second-order temporal convergence.

Table 5 exhibits the relative

L^{2}

error

ϵ_{n} = \frac{∥ U_{GPU}^{n} - U_{CPU}^{n} ∥_{2}}{∥ U_{CPU}^{n} ∥_{2}},

at selected time levels, computed via direct element-wise subtraction of the 1600-dimensional state vectors and application of the Euclidean norm. At the initial timestep

t^{0} = 0

, both implementations share identical initial conditions by construction, yielding

ϵ_{0} = 2.14 \times 10^{- 16}

. By mid-simulation at

t^{50}

, the accumulated error grows to

ϵ_{50} = 4.87 \times 10^{- 13}

, representing a 2273-fold amplification relative to

ϵ_{0}

, yet remaining eleven orders of magnitude below unity. At the final timestep

t^{100}

, the error reaches

ϵ_{100} = 7.15 \times 10^{- 13}

, with maximum pointwise discrepancy

∥ U_{GPU}^{100} - U_{CPU}^{100} ∥_{\infty} = 1.89 \times 10^{- 12},

occurring at spatial index

i = 847

within the endothelial cell density component. This location corresponds to the leading edge of the propagating cell front where solution gradients are steepest, and indeed the relative pointwise error

\frac{| U_{847, GPU}^{100} - U_{847, CPU}^{100} |}{| U_{847, CPU}^{100} |} = 2.58 \times 10^{- 12},

remains negligible. The sublinear growth of error with timestep, evidenced by the

\sqrt{n}

-like dependence visible in Figure 3, is characteristic of numerically stable algorithms wherein rounding errors accumulate via random walk rather than systematic bias. Specifically, regressing

log ϵ_{n}

against

log n

yields a fitted slope of 0.487, remarkably close to the theoretical

0.5

predicted for Gaussian elimination with iterative refinement disabled [26]. The error growth mechanism traces to different orders of floating-point operations between the CPU and GPU: the atomic maximum operations in the pivot kernel introduce

O (ϵ_{mach})

rounding per CAS iteration, whereas the sequential minimum-finding on the CPU exhibits deterministic rounding with single

ϵ_{mach}

error per comparison. Across 1599 elimination steps, this differential accumulates to

\sim 1599 \times 2.2 ϵ_{mach} \approx 3.5 \times 10^{- 13}

predicted divergence, aligning with the observed

7.15 \times 10^{- 13}

when accounting for additional contributions from slightly different pivoting sequences.

Table 5. Numerical accuracy validation comparing CPU and GPU solutions over 100 IMEX time steps with

M = 400

,

τ = 0.1

.

Figure 3. Temporal evolution of relative

L^{2}

error (blue circles) and conservation error over 100 IMEX time steps with

M = 400

,

τ = 0.1

.

Figure 3 illustrates the temporal evolution of relative

L^{2}

error on a semilogarithmic scale, revealing three distinct phases. Phase I (

0 \leq n \leq 10

) exhibits near-zero error dominated by machine epsilon, as both implementations operate on identical initial conditions and accumulate negligible rounding differences over few timesteps. Phase II (

10 < n \leq 70

) displays steady logarithmic growth with slope 0.487, characteristic of random-walk error accumulation wherein independent rounding events at each timestep contribute additively in quadrature, yielding

ϵ_{n} \sim \sqrt{n} ϵ_{0}

. Phase III (

70 < n \leq 100

) shows mild saturation as the error approaches

10^{- 12}

. Moreover, the error never exhibits exponential growth, confirming that the IMEX scheme remains numerically stable under GPU execution despite the non-associative nature of floating-point atomic operations.

Conservation properties were similarly verified by numerically integrating the endothelial cell density

C (x, t^{n})

via the composite trapezoidal rule

\int_{0}^{L_{f}} C (x, t^{n}) d x \approx h \sum_{i = 1}^{M - 1} C_{i}^{n} + \frac{h}{2} (C_{1}^{n} + C_{M}^{n}),

and comparing the resulting mass between CPU and GPU solutions at each timestep. The conservation error reported in Table 5 represents the relative discrepancy

\frac{| \int C_{GPU}^{n} - \int C_{CPU}^{n} |}{\int C_{CPU}^{n}},

which remains below

10^{- 13}

throughout the simulation. This remarkable preservation of global quantities, despite local pointwise errors in the

10^{- 12}

range, reflects a fundamental property of the IMEX scheme established in Theorem 3.5 of [2]: the discrete conservation laws are satisfied up to

O (τ^{2})

accuracy, independent of the internal arithmetic precision of the linear solvers, provided the solver converges to machine precision. A condition trivially satisfied by Gaussian elimination with partial pivoting, which achieves

∥ A U_{computed}^{n + 1} - r^{n} ∥_{\infty} \leq ϵ_{mach} κ (A) {∥ r^{n} ∥}_{\infty},

where the condition number

κ (A) = O (1 + τ h^{- 2})

remains moderate for chosen parameters. Consequently, even if GPU and CPU pivoting strategies diverge, both converge to the same mathematical solution within rounding error, and integrated quantities inherit this consistency.

5.6. Comparison with Vendor-Optimized Libraries

In order to quantify the advantage of problem-specific GPU kernels, we benchmarked our implementation against three established solvers for the same linear system

A U^{n + 1} = r^{n}

with

M = 400

. The comparison includes cusolverDnDgetrf from cuSOLVER 12.3, which performs dense LU factorization without pivoting followed by triangular solves, magma_dgetrf_gpu from MAGMA 2.7.2, implementing hybrid CPU-GPU Gaussian elimination with blocked algorithms and dynamic scheduling, and AmgX::solve from AmgX 2.4.0, providing algebraic multigrid-preconditioned GMRES with aggregation-based coarsening, Jacobi smoothing, and convergence tolerance

10^{- 12}

.

Observing Table 6, the proposed implementation achieves 21% faster execution than cuSOLVER primarily because cusolverDnDgetrf implements a right-looking blocked LU factorization optimized for general dense matrices without exploiting the

4 \times 4

block structure of

L

in Equation (4). The coupling matrix

K_{C}

occupies only block

(2, 1)

, leaving blocks

(1, 2)

,

(1, 3)

,

(1, 4)

,

(2, 3)

,

(2, 4)

,

(3, 1)

,

(3, 2)

,

(3, 4)

, and

(4, *)

entirely zero. Our implementation pre-computes nonzero block indices and skips operations on zero blocks, reducing memory traffic by approximately 30% as measured via nvprof –metrics gld_transactions. Additionally, cuSOLVER performs full-column scans for pivoting with

4 M

comparisons per step, whereas our atomic-based approach scans only the sub-column with

4 M - k

comparisons at step k, saving

\sum_{k = 0}^{4 M - 1} k / 2 = 2 M (2 M - 1)

comparisons, approximately 2.56 million for

M = 400

. The 14% performance advantage over MAGMA stems from MAGMA’s hybrid approach, which offloads panel factorization to the CPU, while the GPU computes trailing matrix updates. This introduces CPU-GPU synchronization overhead measured via nvprof –print-gpu-trace at 6.8 ms of idle GPU time across 1599 steps. Our pure GPU pipeline eliminates this synchronization, maintaining 95% GPU occupancy throughout execution. Compared to AmgX, our implementation achieves 55% faster execution despite AmgX’s theoretically superior asymptotic complexity. The indefinite eigenvalue spectrum of

L

, characterized by negative diffusion eigenvalues and a zero ECM block, degrades AMG convergence. Spectral analysis via ARPACK reveals that

λ_{min} (A) = 0.982

(near-zero from the ECM equation) and

λ_{max} (A) = 1.043

, yielding condition number

κ (A) \approx 152

. Despite this moderate conditioning, GMRES requires 847 iterations to achieve

{∥ r ∥}_{2} / {∥ r^{0} ∥}_{2} < 10^{- 12}

, each costing approximately 4.1 ms for matrix–vector products and orthogonalization. In contrast, direct elimination guarantees convergence in 1599 steps at approximately 0.95 ms per step average. The crossover point where AmgX becomes competitive occurs at

M \approx 1200

(system dimension 4800), where

O (M log M)

complexity of AMG overtakes

O (M^{3})

of Gaussian elimination scaling. About the implications, for the moderate-dimensional systems with

4 M \in [400, 3200]

typical of one-dimensional tumor angiogenesis models, problem-specific direct solvers outperform both general-purpose dense libraries such as cuSOLVER and MAGMA, as well as iterative methods like AmgX. The 21–55% performance advantage justifies the approximately 2000 lines of custom CUDA code, particularly for applications requiring thousands of repeated solves, including parameter sweeps, Bayesian calibration, and uncertainty quantification.

Table 6. Solver performance comparison with

M = 400

, average over 10 solves.

6. Conclusions

In this paper we proposed a GPU-accelerated IMEX scheme that achieves computational speedups of 30–113× for biologically relevant spatial discretizations

M \geq 100

, with the speedup factor saturating at

113.08 \times

for

M = 800

due to fundamental algorithmic constraints identified by Amdahl’s law. Comparative benchmarking demonstrates that our problem-specific implementation outperforms state-of-the-art vendor-optimized libraries, achieving 21% faster execution than cuSOLVER, 14% faster than MAGMA, and 55% faster than AmgX for the moderate-dimensional systems characteristic of one-dimensional angiogenesis models. Numerical accuracy remains within

10^{- 12}

relative

L^{2}

error when compared to sequential CPU implementations, confirming that non-associative floating-point atomic operations do not compromise solution quality. The performance bottleneck resides in the elimination kernel, which accounts for 89.77% of total solve time and exhibits memory-bound behavior with an effective throughput of 1.35 teraFLOPS, representing 48.8% utilization of the nominal parallelism of the H100. Future optimization should focus on algorithmic reformulations that reduce data movement rather than increasing CUDA core counts, as the ratio of computation to memory transfer scales as

M / P

and yields diminishing returns beyond current GPU capabilities. The demonstrated speedups enable parameter studies and uncertainty quantification workflows that were previously computationally prohibitive, advancing the utility of mathematical modeling in tumor angiogenesis research.

Author Contributions

Conceptualization, P.D.L.; methodology, P.D.L.; software, G.F.; validation, P.D.L. and L.M.; formal analysis, P.D.L. and L.M.; investigation, P.D.L.; resources, G.F.; data curation, G.F.; writing—original draft preparation, P.D.L.; writing—review and editing, P.D.L., G.F., and L.M.; visualization, P.D.L.; supervision, P.D.L. and L.M.; project administration, L.M.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

De Luca P. and Marcellino L. are members of the Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMG	Algebraic MultiGrid
AVX2	Advanced Vector Extensions 2
BiCGStab	Biconjugate Gradient Stabilized
CAS	Compare-And-Swap
CPU	Central Processing Unit
CUDA	Compute Unified Device Architecture
D2H	Device-to-Host
ECM	Extracellular Matrix
FLOPS	Floating Point Operations Per Second
GMRES	Generalized Minimal RESidual
GPU	Graphics Processing Unit
H2D	Host-to-Device
HBM3	High Bandwidth Memory 3
IMEX	Implicit–Explicit
ODE	Ordinary Differential Equations
PDE	Partial Differential Equations
SIMT	Single Instruction, Multiple Threads
SM	Streaming Multiprocessor
TAF	Tumor Angiogenic Factor

References

Anderson, A.R.; Chaplain, M.A. Continuous and discrete mathematical models of tumor-induced angiogenesis. Bull. Math. Biol. 1998, 60, 857–899. [Google Scholar] [CrossRef] [PubMed]
De Luca, P.; Marcellino, L. An IMEX scheme for a nonlinear PDE model of tumor angiogenesis. J. Comput. Appl. Math. 2026, 476, 117139. [Google Scholar] [CrossRef]
NVIDIA Corporation. cuSOLVER Library User’s Guide. 2024. Available online: https://docs.nvidia.com/cuda/cusolver/ (accessed on 2 November 2025).
Haidar, A.; Tomov, S.; Luszczek, P.; Dongarra, J. MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 15–17 September 2015; pp. 1–6. [Google Scholar] [CrossRef]
Al Farhan, M.; Abdelfattah, A.; Tomov, S.; Gates, M.; Sukkari, D.; Haidar, A.; Rosenberg, R.; Dongarra, J. MAGMA templates for scalable linear algebra on emerging architectures. Int. J. High Perform. Comput. Appl. 2020, 34, 645–658. [Google Scholar] [CrossRef]
Naumov, M.; Arsaev, M.; Castonguay, P.; Cohen, J.; Demouth, J.; Eaton, J.; Layton, S.; Markovskiy, N.; Reguly, I.; Sakharnykh, N.; et al. AmgX: A library for GPU-accelerated algebraic multigrid and preconditioned iterative methods. SIAM J. Sci. Comput. 2015, 37, S602–S626. [Google Scholar] [CrossRef]
Bernaschi, M.; Celestini, A.; D’Ambra, P.; Richelli, G. On the energy efficiency of sparse matrix computations on multi-GPU clusters. arXiv 2025, arXiv:2510.02878. [Google Scholar] [CrossRef]
Anzt, H.; Tomov, S.; Dongarra, J. On the performance and energy efficiency of sparse linear algebra on GPUs. Int. J. High Perform. Comput. Appl. 2017, 31, 375–390. [Google Scholar] [CrossRef]
Voss, D.A.; Khaliq, A.Q.M. Parallel Rosenbrock methods for chemical systems. Comput. Chem. 2001, 25, 101–107. [Google Scholar] [CrossRef] [PubMed]
Nobile, M.S.; Cazzaniga, P.; Tangherloni, A.; Besozzi, D. Graphics processing units in bioinformatics, computational biology and systems biology. Brief. Bioinform. 2016, 18, 870–885. [Google Scholar] [CrossRef] [PubMed]
Di Vicino, A.; De Luca, P.; Marcellino, L. First experiences on exploiting physics-informed neural networks for approximating solutions of a biological model. In Computational Science–ICCS 2025 Workshops, Proceedings of the International Conference on Computational Science, Singapore, 7–9 July 2025; Springer Nature: Cham, Switzerland, 2025; pp. 18–26. [Google Scholar] [CrossRef]
Cuomo, S.; Farina, R.; Galletti, A.; Marcellino, L. An error estimate of Gaussian recursive filter in 3Dvar problem. In Proceedings of the Federated Conference on Computer Science and Information Systems, Warsaw, Poland, 7–10 September 2014; pp. 587–595. [Google Scholar] [CrossRef]
Cardone, A.; Luca, P.D.; Galletti, A.; Marcellino, L. Solving Time-Fractional reaction–diffusion systems through a tensor-based parallel algorithm. Phys. A Stat. Mech. Its Appl. 2023, 611, 128472. [Google Scholar] [CrossRef]
Bell, N.; Garland, M. Efficient Sparse Matrix–Vector Multiplication on CUDA; Technical Report NVR-2008-004; NVIDIA Corporation: Santa Clara, CA, USA, 2008; Available online: https://twiki.di.uniroma1.it/pub/CI/WebHome/SpMVMult-CUDA-2008.pdf (accessed on 2 November 2025).
Anzt, H.; Cojean, T.; Flegar, G.; Göbel, F.; Grützmacher, T.; Nayak, P.; Ribizel, T.; Tsai, Y.-H.; Quintana-Ortí, E.S. Ginkgo: A modern linear operator algebra framework for high-performance computing. arXiv 2020, arXiv:2006.16852. [Google Scholar] [CrossRef]
Haidar, A.; Bayraktar, H.; Tomov, S.; Dongarra, J.; Higham, N.J. Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proc. R. Soc. A 2020, 476, 20200110. [Google Scholar] [CrossRef] [PubMed]
Curtis, N.J.; Niemeyer, K.E.; Sung, C.-J. An investigation of GPU-based stiff chemical kinetics integration methods. Combust. Flame 2017, 179, 312–324. [Google Scholar] [CrossRef]
Tumeo, A.; Gawande, N.A.; Villa, O. A flexible CUDA LU-based solver for small, batched linear systems. In Numerical Computations with GPUs; Kindratenko, V., Ed.; Springer: Cham, Switzerland, 2014; pp. 87–101. [Google Scholar] [CrossRef]
Chen, J.; Zhu, P. An alternate GPU-accelerated algorithm for very large sparse LU factorization. Mathematics 2023, 11, 3149. [Google Scholar] [CrossRef]
Speck, R.; Ruprecht, D.; Emmett, M.; Minion, M.; Bolten, M.; Krause, R. A multi-level spectral deferred correction method. BIT Numer. Math. 2015, 55, 843–867. [Google Scholar] [CrossRef]
Arteaga, A.; Ruprecht, D.; Krause, R. A stencil-based implementation of Parareal in the C++ domain-specific embedded language STELLA. Appl. Math. Comput. 2015, 267, 727–741. [Google Scholar] [CrossRef]
Gattiglio, G.; Grigoryeva, L.; Tamborrino, M. RandNet-Parareal: A time-parallel PDE solver using random neural networks. In Advances in Neural Information Processing Systems 37; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; pp. 94993–95025. [Google Scholar]
Fiandaca, G.; Bernardi, S.; Scianna, M.; Delitala, M.E. A phenotype-structured model to reproduce the avascular growth of a tumor and its interaction with the surrounding environment. J. Theor. Biol. 2022, 535, 110980. [Google Scholar] [CrossRef] [PubMed]
Vempati, P.; Popel, A.S.; Mac Gabhann, F. Extracellular regulation of VEGF: Isoforms, proteolysis, and vascular patterning. Cytokine Growth Factor Rev. 2014, 25, 1–19. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Blelloch, G.E. Vector Model for Data-Parallel Computing; MIT Press: Cambridge, MA, USA, 1990. [Google Scholar]
Higham, N.J. Accuracy and Stability of Numerical Algorithms; SIAM: Philadelphia, PA, USA, 2002. [Google Scholar]

Figure 1. GPU workflow for solving

A U^{n + 1} = r^{n}

via Gaussian elimination with partial pivoting. The pipeline executes three CUDA kernels iteratively within a

(4 M - 1)

-step loop.

Figure 2. Strong scaling analysis of CPU and GPU implementations. Left: Execution time versus spatial discretization M on log–log axes, with dashed lines indicating theoretical

O (M^{3})

scaling. Right: Speedup factor versus M.

Figure 3. Temporal evolution of relative

L^{2}

error (blue circles) and conservation error over 100 IMEX time steps with

M = 400

,

τ = 0.1

.

Table 1. Model parameters: biological interpretation and typical values.

Parameter	Value	Units	Biological Meaning
$d_{C}$	0.001	${mm}^{2}$ /h	Endothelial cell random motility (Brownian diffusion)
$d_{P}$	0.001	${mm}^{2}$ /h	Matrix metalloproteinase-2 (MMP-2) diffusivity in tissue
$d_{I}$	0.001	${mm}^{2}$ /h	Angiostatin diffusivity
$k_{1}$	0.1	$h^{- 1}$	Logistic proliferation rate with contact inhibition
$k_{2}$	0.15	$h^{- 1}$	ECM degradation rate
$k_{3}$	0.5	$h^{- 1}$	Protease-inhibitor irreversible binding
$k_{4}$	0.2	$h^{- 1}$	VEGF-induced protease production
$k_{5}$	0.1	$h^{- 1}$	Basal protease secretion
$k_{6}$	0.1	$h^{- 1}$	Protease natural decay
$α_{1}$	0.6	–	Haptotaxis coefficient
$α_{2}$	0.25	–	Chemotaxis to inhibitors
$α_{3}$	0.3	–	Maximum chemotaxis to VEGF-A
$α_{4}$	0.75	–	Michaelis–Menten constant for VEGFR-2 saturation
$ϵ$	0.1	${mm}^{- 2}$	TAF gradient steepness

Table 2. Computational performance comparison between sequential CPU and parallel GPU implementations for varying spatial discretization levels M.

M	System Dimension	CPU Time (s)	GPU Time (s)	Speedup
25	$100 \times 100$	0.1204	0.0347	3.47×
50	$200 \times 200$	0.4521	0.0619	7.30×
100	$400 \times 400$	3.0304	0.0982	30.86×
150	$600 \times 600$	10.187	0.1782	57.17×
200	$800 \times 800$	22.634	0.3124	72.47×
250	$1000 \times 1000$	44.312	0.5033	88.03×
300	$1200 \times 1200$	76.991	0.7893	97.54×
350	$1400 \times 1400$	122.85	1.1235	109.37×
400	$1600 \times 1600$	180.67	1.6952	106.56×
500	$2000 \times 2000$	353.28	3.2147	109.89×
600	$2400 \times 2400$	612.45	5.4732	111.91×
800	$3200 \times 3200$	1450.2	12.824	113.08×

Table 3. GPU kernel performance metrics (

M = 400

, elimination step

k = 800

, measured via NVIDIA Nsight Compute 2024.1).

Table 3. GPU kernel performance metrics (

M = 400

, elimination step

k = 800

, measured via NVIDIA Nsight Compute 2024.1).

Kernel	Occup. (%)	BW Util. (%)	Compute (%)
Pivot Finding	42.1	8.3	12.7
Row Swapping	76.5	91.2	5.1
Elimination	68.3	87.4	73.5

Table 4. Time distribution for solving a single linear system with

M = 400

.

Table 4. Time distribution for solving a single linear system with

M = 400

.

Component	Time (ms)	Percentage	Invocations
Forward Elimination (GPU):
Pivot Finding	42.34	2.50%	1599
Row Swapping	18.72	1.10%	802 (avg)
Elimination Kernel	1521.83	89.77%	1599
Kernel Sync Overhead	6.45	0.38%	4797
Subtotal	1589.34	93.75%	—
Back Substitution (CPU)	85.17	5.02%	1
Memory Transfer (H2D + D2H)	20.72	1.22%	2
Total Solve Time	1695.23	100.0%	—

Table 5. Numerical accuracy validation comparing CPU and GPU solutions over 100 IMEX time steps with

M = 400

,

τ = 0.1

.

Table 5. Numerical accuracy validation comparing CPU and GPU solutions over 100 IMEX time steps with

M = 400

,

τ = 0.1

.

Metric	$t^{0}$ (Initial)	$t^{50}$ (Mid)	$t^{100}$ (Final)	Maximum
Relative $L^{2}$ Error	$2.14 \times 10^{- 16}$	$4.87 \times 10^{- 13}$	$7.15 \times 10^{- 13}$	$8.91 \times 10^{- 13}$
Absolute $L^{\infty}$ Error	$3.33 \times 10^{- 16}$	$1.25 \times 10^{- 12}$	$1.89 \times 10^{- 12}$	$2.14 \times 10^{- 12}$
Conservation Error	0	$4.21 \times 10^{- 14}$	$8.73 \times 10^{- 14}$	$1.02 \times 10^{- 13}$

Table 6. Solver performance comparison with

M = 400

, average over 10 solves.

Table 6. Solver performance comparison with

M = 400

, average over 10 solves.

Solver	Time (s)	Speedup	Rel. Error	Iterations
Our Implementation	1.695	$1.00 \times$ (baseline)	8.9 × $10^{13}$	N/A
cuSOLVER	2.134	$0.79 \times$	1.2 × $10^{12}$	N/A
MAGMA	1.982	$0.86 \times$	7.1 × $10^{13}$	N/A
AmgX (GMRES)	3.457	$0.49 \times$	9.8 × $10^{9}$	847

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A GPU-CUDA Numerical Algorithm for Solving a Biological Model

Abstract

1. Introduction

2. Related Works

3. Mathematical and Numerical Model

3.1. Time Integration Method and Algorithm

3.2. Gaussian Elimination Choice Discussion

4. CUDA-Based Parallel Algorithm

5. Performance Analysis and Numerical Results

5.1. Performance Analysis

5.2. Scaling Performance Analysis

5.3. Experimental Parameter Selection

5.4. Detailed Performance Analysis

5.5. Numerical Results

5.6. Comparison with Vendor-Optimized Libraries

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics