Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides

Li, Qingyun; He, Tao; Li, Mingye; Zhang, Junfang; Lian, Bing; Liu, Liye; Qiu, Rui; Li, Junli

doi:10.3390/atmos17060573

Open AccessArticle

Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides

by

Qingyun Li

^1,2,3,†,

Tao He

^1,†,

Mingye Li

¹,

Junfang Zhang

¹,

Bing Lian

¹,

Liye Liu

¹,

Rui Qiu

^2,3,*

and

Junli Li

^2,3

¹

China Institute for Radiation Protection, Taiyuan 030006, China

²

Department of Engineering Physics, Tsinghua University, Beijing 100084, China

³

Key Laboratory of Particle & Radiation Imaging, Tsinghua University, Ministry of Education, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Atmosphere 2026, 17(6), 573; https://doi.org/10.3390/atmos17060573

Submission received: 8 April 2026 / Revised: 22 May 2026 / Accepted: 29 May 2026 / Published: 1 June 2026

(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

Download

Browse Figures

Versions Notes

Abstract

Large-scale atmospheric dispersion model for emergency response to nuclear accidents requires high computational efficiency and numerical reliability. A GPU-oriented Lagrangian particle dispersion model was developed within FLEXPART framework to address these demands. Core transport processes—including advection, turbulent diffusion, convective mixing, and dry/wet deposition—were restructured for GPU parallel execution. Further incorporation of fast arithmetic operators and multi-level parallelization strategies substantially improved overall computational performance while preserving physical accuracy. Additional MPI-based parallel meteorological data decoupling and preprocessing tool has been developed, which alleviates data-handling bottlenecks. Meanwhile, multi-GPU execution and a load-balancing strategy enable efficient scaling in heterogeneous computing environments. Using the first release of European Tracer Experiment (ETEX-I) as a benchmark, the GPU program’s accuracy and acceleration were rigorously evaluated. Results show that, while maintaining nearly comparable accuracy (with relative errors on the order of

10^{- 2}

), the program achieves an overall speedup of approximately 40.45 on a single-GPU platform, which can be further increased to about 52.05 in high-performance application scenarios where meteorological background fields are reusable. Moreover, multi-GPU experiments reveal favorable parallel scalability across configurations ranging from one to four GPUs, and confirm that the proposed load-balancing strategy effectively enhances computational efficiency in heterogeneous GPU environments.

Keywords:

GPU acceleration; lagrangian dispersion model; emergency response

Graphical Abstract

1. Introduction

Driven by the ongoing transformation of the global energy structure and low-carbon development objectives, nuclear energy continues to expand as a major source of low-emission power, accompanied by an elevated potential risk of nuclear accidents and inadvertent releases of radioactive materials. In the event of an accident, substantial quantities of radionuclides may be emitted into the atmosphere, posing severe threats to public health and ecological security [1]. Numerical atmospheric dispersion models constitute a cornerstone for consequence assessment and emergency decision support in such scenarios, with their computational speed and numerical reliability directly governing the timeliness and accuracy of response actions [2,3,4,5].

Atmospheric dispersion processes are intrinsically complex [6]. As a result, many turbulent transport and removal processes cannot be fully resolved directly from governing physical equations in practical atmospheric dispersion simulations, and contemporary dispersion models therefore rely on simplified parameterizations or empirically derived representations for these unresolved processes [7,8]. Previous studies have demonstrated that simulated concentration fields are highly sensitive to the parameterization of dry and wet deposition [9,10,11,12], turbulent mixing [9,10,11], and boundary-layer dynamics [10,11,13]. Accurately resolving plume evolution under complex meteorological regimes—particularly within unstable convective boundary layers—often requires higher-order and more sophisticated parameterization schemes [14,15]. Although such schemes can substantially reduce simulation errors, they incur a pronounced increase in computational cost [16]. Under nuclear-emergency conditions, the adoption of high-fidelity physical representations therefore entails a trade-off between numerical accuracy and computational efficiency, intensifying constraints on model responsiveness [4,17,18].

Moreover, the experiences of the Chernobyl and Fukushima accidents demonstrate that during emergency situations, power outages caused by the incidents make it difficult to obtain accurate source terms [19,20,21]. Concurrently, meteorological forecast products also carry significant uncertainties [6,22,23], leading to a lack of comprehensive conditions for dispersion simulations. Under these conditions, reliable estimation of atmospheric radionuclide distributions commonly depends on auxiliary approaches such as source-term inversion [24,25,26,27,28] and data assimilation [29,30,31,32] both of which require a large number of dispersion simulations [9,20,33]. Source-term inversion is particularly computationally demanding, as the accuracy of the inferred release parameters critically determines the credibility of subsequent predictions [34,35,36], while prevailing inversion algorithms often involve thousands of sequential dispersion runs, as shown in Table 1. This computational burden constitutes a major bottleneck for time-critical decision making during emergencies [37,38,39,40].

The simultaneous demand for high timeliness and high numerical accuracy in atmospheric dispersion simulations has increasingly exposed the computational limitations of conventional CPU-based platforms. In contrast, the rapid advancement of graphics processing units (GPUs), characterized by massive parallelism and high throughput, has opened new technical avenues for alleviating computational burdens and enhancing overall simulation efficiency [44,45,46,47]. Among existing dispersion modeling approaches, Lagrangian particle models are widely employed [48,49] because of their ability to maintain robust numerical accuracy under large-scale, heterogeneous, and highly complex turbulent conditions. By explicitly tracking the trajectories of a large ensemble of representative particles, these models describe pollutant transport and diffusion in a statistically consistent manner. Increasing the particle number improves the statistical stability of the simulation results—where the associated error scales inversely with the square root of the particle count—while also leading to a proportional increase in computational cost. Importantly, the mutual independence of particles in Lagrangian formulations aligns naturally with the “high parallelism, weak dependency” paradigm of GPU architectures, making such models intrinsically well suited for large-scale GPU acceleration [46,50].

Recent studies have explored the potential of accelerating Lagrangian particle models on heterogeneous computing platforms. Existing efforts have demonstrated that parallelizing core particle-update loops can yield measurable performance gains on both multi-core CPUs and GPUs. However, reported speedups typically range from 10 to 15 times [44,46,50] and often saturate as particle counts approach hardware concurrency limit [50]. Moreover, most implementations focus on partial offloading of selected computational kernels [46,50], leaving substantial portions of the physical processes, data management, and input–output operations on the CPU. As a result, the overall acceleration is constrained by residual serial components and data-transfer overheads. In addition, the majority of current GPU-based implementations are restricted to single-device execution, lacking scalable multi-GPU or cross-node capabilities required for large-ensemble simulations. More fundamentally, these studies primarily pursue technical porting and generic performance improvement, without explicitly accounting for the operational characteristics and constraints of nuclear-emergency response scenarios. Consequently, despite demonstrated benefits, existing GPU-accelerated Lagrangian particle models remain insufficient to simultaneously satisfy the stringent requirements for accuracy and real-time responsiveness under emergency conditions.

The present study builds upon the internationally established and extensively validated FLEXPART framework [14,15] to develop a GPU-oriented Lagrangian dispersion model tailored to the demands of emergency-response applications. First, focusing on the most computationally intensive physical processes of pollutant dispersion—including particle advection, turbulent random displacement, convective mixing, and wet deposition—a fine-grained data parallel strategy was systematically implemented to port these processes entirely to the GPU. With tailored memory configurations and computational flow optimizations specific to each physical mechanism, the approach overcomes previous constraints of partial parallelization limited to core loops without covering the complete physical processes. Subsequently, by integrating fast arithmetic instructions, the execution latency of fundamental operators was effectively reduced, further enhancing the computational throughput of GPU kernels. Furthermore, leveraging the architectural characteristics of GPUs, optimizations were carried out from both the parallel granularity and resource utilization perspectives. On one hand, a coarse-grained parallel strategy was introduced to appropriately adjust the mapping of computational tasks onto Streaming Multiprocessors (SMs), aligning with GPU asynchronous scheduling and compute–memory access overlapping mechanisms, thereby boosting overall parallel throughput. On the other hand, through precise control of the number of registers per thread, register spilling was minimized and thread concurrency was increased, significantly improving GPU resource utilization efficiency. These synergistic optimization strategies collectively enhanced GPU computational performance. Considering the high reusability of background meteorological fields in emergency simulations, this study structurally decoupled the reading and computational processes of such fields. Employing an MPI-based preprocessing and decoding strategy for meteorological data enabled efficient parallel execution during the preprocessing phase, substantially reducing I/O overhead while improving the efficiency of emergency technologies such as source inversion and data assimilation that rely on reusable background fields. Finally, a multi-GPU particle domain decomposition and load-balancing method was introduced to achieve scalable parallel capability across multiple GPUs. The accelerated performance and simulation accuracy of the developed program were systematically and comprehensively validated using the release scenario and measurement data from the European Tracer Experiment’s first trial (ETEX-I) [51].

2. Materials and Methods

2.1. Lagrangian Particle Model and the FLEXPART Framework

The atmospheric transport of radioactive contaminants can be described from a Lagrangian perspective by a system of partial differential equations (PDEs):

\frac{D c}{D t} = \nabla \cdot (K \nabla c) - v_{d} c δ (z) - Λ c - λ c + q (x, y, z, t)

(1)

Here, c denotes the material (Lagrangian-following) concentration of the contaminant,

K

represents the turbulent diffusion coefficient tensor,

v_{d}

,

Λ

and

λ

denote the dry deposition velocity, the wet scavenging coefficient, and the radioactive decay constant, respectively. The Dirac delta function,

δ (z)

, constrains dry deposition to occur only within the near-surface layer [52,53]. The term

q (x, y, z, t)

represents the source release-rate density as a function of spatial position

(x, y, z)

and time t.

In the Lagrangian particle approach, an ensemble of computational particles is used to represent the contaminant plume. By releasing and tracking these particles throughout the atmosphere, key processes such as advective transport, turbulent diffusion, and depositional removal are explicitly resolved [54,55]. The motion of each particle is assumed to be independent and governed solely by the local meteorological conditions at its instantaneous position, with particle states updated at each time step. Under this assumption, the transport and deposition processes described by Equation (2) can be simplified for an individual particle as follows:

\begin{matrix} X_{t + Δ t} & = X_{t} + U Δ t - v_{g} Δ t \hat{z}, \\ μ_{t + Δ t} & = μ_{t} e^{- (Λ + Λ_{d} + λ) Δ t}, \\ Λ_{d} & = \{\begin{matrix} \frac{v_{d}}{2 h_{r e f}}, & z < 2 h_{r e f}, \\ 0, & z \geq 2 h_{r e f} . \end{matrix} \end{matrix}

(2)

Here,

X = (x, y, z)

denotes the particle position vector, and

μ

represents the mass or activity of radioactive contaminants carried by an individual particle.

U = \bar{U} + U^{'}

is the wind velocity vector at the particle location, where

\bar{U} = (\bar{u}, \bar{v}, \bar{w})

denotes the mean wind velocity interpolated from meteorological fields, and

U^{'} = (u^{'}, v^{'}, w^{'})

represents the stochastic turbulent velocity fluctuations in each spatial direction. The statistical properties of

U^{'}

are estimated based on boundary-layer parameters such as friction velocity and are generated using a Langevin stochastic differential equation or related random-process models to characterize the random perturbations induced by atmospheric turbulence.

v_{g}

denotes the gravitational settling coefficient, and

\hat{z} = (0, 0, 1)

specifies the direction of gravitational settling. The parameters

Λ_{d}

and

2 h_{r e f}

represent the dry deposition removal coefficient and the effective thickness of the dry deposition layer, respectively. The term

Δ t

denotes the minimum time step used in the simulation.

Based on the discrete particle update formulation described above, the FLEXPART model adopts the computational workflow illustrated in Figure 1. After initialization, the model enters the time-integration loop and first checks whether the integration period has been completed. If the simulation continues, the meteorological fields are updated to the required time level. When active particles are present, wet deposition removal, first-order chemical reactions, and backward convective transport are first evaluated. The emission and chemical fields are then updated, followed by particle release and emission injection. Subsequently, forward convective transport is calculated, and radioactive decay of the deposited mass is applied. After the statistical calculation and output procedures, particle advection and turbulent diffusion are performed to update particle positions. The model then removes inactive particles, evaluates particle dry deposition and radioactive decay, calculates the surface dry-deposition flux, and removes inactive particles again before advancing the model time by one time step. This loop is repeated until the end of the prescribed simulation period.

Figure 2 illustrates the relative computational time contributions of the major modules involved in the dispersion simulation. The computational profile presented in this figure was obtained from a Cs-137 dispersion simulation using

1.0 \times 10^{6}

particles, with a total simulation duration of 90 h and a synchronization time step of 300 s. Concentration fields were output every 3 h on a

320 \times 250

grid with a horizontal resolution of

{0.1}^{\circ} \times {0.1}^{\circ}

. As shown, the dominant computational cost arises from the advection–diffusion module, which, in addition to advective transport and stochastic turbulent displacement, includes key operations such as aerosol gravitational settling and dry-deposition probability calculations. The convective mixing and wet deposition modules together account for most of the remaining computational workload. Because all of these processes require particle-by-particle updates at every time step, the overall computational complexity scales linearly with both the number of particles and the number of time steps. This characteristic endows the dispersion model with pronounced data parallelism, making it inherently well suited for large-scale parallelization on GPU architectures. In addition, the Get fields module is responsible for reading and decoding multi-level meteorological fields from GRIB and related encoded formats, while also performing the necessary coordinate transformations and physical variable derivations to generate all meteorological drivers required for the dispersion calculations. This procedure involves sequential decoding of meteorological data and intensive access to gridded fields during time stepping, and therefore accounts for a non-negligible fraction of the total runtime. In particular, under scenarios with relatively small particle counts and long time steps, the overhead associated with background-field decoding and preprocessing may exceed half of the total propagation cost, thereby becoming a critical factor limiting the overall computational efficiency of the model.

Therefore, although the computational time distribution shown in Figure 2 is qualitatively representative for FLEXPART simulations with relatively large particle counts, the exact runtime fractions of different modules may still vary under different simulation configurations. In particular, the computational cost of the output and statistical modules is sensitive to the spatial and temporal resolution of the output fields and the sampling interval, while the runtime of the meteorological preprocessing module mainly depends on the spatial resolution and update frequency of the meteorological fields. In contrast, the computational costs of the major particle-transport-related modules are primarily controlled by the number of simulated particles. Consequently, the relative contribution of each module to the total runtime may change under different simulation scenarios, although the overall computational characteristics of the model remain qualitatively similar.

2.2. Fine-Grained Parallel Acceleration Architecture

To fully exploit the parallel computing capability of GPUs and achieve maximal performance gains, a full offloading strategy was adopted, in which all core subroutines originally executed on the CPU were restructured and migrated to run entirely on the GPU device. This approach eliminates frequent host–device data transfers and the associated communication overhead, thereby enabling end-to-end acceleration of the computational workflow. Building on this design, and leveraging the inherent data-parallel nature of the Lagrangian particle model, a fine-grained GPU parallelization scheme was employed in which each particle is mapped to a single GPU thread.

2.2.1. Memory Allocation and Data Transfer

The GPU employs a hierarchical memory architecture comprising registers, L1/L2 caches, and global memory. In the Lagrangian particle dispersion model, each particle is processed by an independent thread, which requires efficient access to meteorological fields, control parameters, and particle-specific variables. Consequently, data layout and memory-management strategies play a decisive role in reducing memory-access latency and improving overall parallel performance.

Because particle positions evolve dynamically and cannot be predetermined, the model must support fast, random access to background-field data over the full spatial domain. All relevant background data are therefore resident on the GPU. To mitigate the memory-bandwidth pressure caused by concurrent global-memory access, time-invariant quantities—including particle attributes, simulation control parameters, and meteorological-field dimensions—are stored in GPU constant memory, leveraging warp-level broadcast to reduce bandwidth consumption and improve throughput.

To satisfy the GPU execution model, all state updates are thread-private. Intermediate variables were accordingly restructured: performance-critical quantities were assigned as thread-local variables to exploit registers and caches, while less frequently accessed variables were organized in global-memory arrays indexed by particle ID. This design avoids race conditions, limits register pressure, and achieves an effective balance between correctness, performance, and resource utilization.

2.2.2. Parallelization of Particle Grid Reordering and the Two-Level Loop Structure

In Lagrangian particle dispersion models, particle states are generally updated individually according to the local meteorological conditions experienced by each particle. However, deep convective transport is usually not treated using a conventional Langevin-type stochastic model because the lack of stable turbulence statistics and appropriate stochastic parameters makes it difficult to construct a generally applicable stochastic formulation for deep convection. Therefore, LPDMs commonly employ mass-flux-based convective parameterization schemes, in which convective mass fluxes defined on Eulerian grid columns are used to describe the vertical redistribution of particles [56]. This approach is also employed in FLEXPART for the treatment of deep convective transport, and a detailed description of the corresponding redistribution procedure is provided in Appendix A. In this approach, particles located within the same grid column may share identical redistribution relationships. As illustrated in Figure 3a, traditional CPU implementations commonly employ a grid-binning strategy, in which particles are reordered according to their associated grid cells and their states are updated using a two-level nested loop structure that first iterates over grid cells and then over particles within each grid cell. When ported to a GPU parallel architecture, however, this grid-dominated, nested-loop computation pattern cannot be directly mapped onto a kernel execution model in which particles serve as the fundamental parallel units. If the reuse of within-grid transformation relationships is preserved, grid cells containing only a small number of particles can lead to low thread utilization and insufficient parallelism, resulting in typical GPU load-imbalance issues. Conversely, recomputing grid-level transformation relationships independently for each particle would substantially increase the computational burden per thread, thereby diminishing the effectiveness of parallel acceleration. Based on these considerations, the computational procedures involving particle reordering and two-level loop structures were structurally redesigned to accommodate a particle-centric GPU execution model. As shown in Figure 3b, because the computation of grid identifiers for individual surviving particles is free of inter-particle data dependencies, this step was implemented as a GPU-side parallel kernel. Moreover, leveraging the stable parallel reordering operators provided by CUDA Fortran, particle reordering can be efficiently performed entirely on the GPU device. In contrast, the computation of redistribution matrices for active grid cells involves complex logic but is required for only a relatively small number of grid cells. Therefore, the original grid-first, particle-second nested-loop structure was functionally decoupled: the outer, grid-based computation of redistribution matrices was retained on the CPU and the resulting matrices were written to GPU global memory. Subsequently, in the particle-level parallel kernels, each thread directly indexes the corresponding redistribution matrix based on the grid identifier of the particle, enabling fully parallel updates of all particles. This strategy effectively avoids the insufficient parallelism and GPU load imbalance caused by nonuniform particle counts per grid cell in the original grid-based traversal scheme. It should be noted that this workflow introduces an additional device-to-host data transfer during the redistribution-matrix generation stage. However, because the data transferred in each instance consist only of a one-dimensional array with a size proportional to the number of surviving particles, the associated data volume is relatively small, and its impact on overall computational performance is negligible.

2.3. Fast Arithmetic Instruction Optimization

In GPU parallel computing, although the overall floating-point throughput is high, complex arithmetic operations such as division and square root still exhibit significantly higher instruction latency and lower execution throughput in single-precision arithmetic compared with simple operations such as addition and multiplication. When these operators are invoked frequently during particle updates, they can substantially limit kernel execution efficiency and thus become major performance bottlenecks that constrain the overall acceleration achieved.

2.3.1. Fast Division Operations

On CUDA architectures, the standard division operator “/” follows IEEE 754 floating-point semantics and is typically translated by the compiler into a single-precision floating-point division instruction (e.g.,

d i v . r n . f 32

with round-to-nearest mode). At the hardware level, this instruction is usually implemented through a sequence of high-latency micro-operations, resulting in substantially higher execution latency and lower throughput than basic arithmetic instructions such as addition and multiplication. In Lagrangian particle dispersion simulations, division operations are frequently invoked in key numerical procedures, including turbulence-intensity calculations, gradient normalization, and scaling of stochastic increments, and therefore constitute a primary arithmetic bottleneck limiting kernel performance. To reduce this computational overhead, standard division operations were reformulated using a reciprocal–multiplication scheme:

\frac{a}{b} \approx a \cdot a p p r o x (\frac{1}{b})

(3)

Here,

a p p r o x (\cdot)

denotes an approximate reciprocal operator. The reciprocal operation can be implemented using the reciprocal approximation mechanism provided by the GPU multi-function units (MFUs), which is typically based on hardware lookup tables combined with interpolation. Compared with the iterative logic employed by standard division instructions, this mechanism exhibits substantially lower instruction latency and higher throughput. Under this approximation, the execution efficiency of the corresponding arithmetic operations can be several times higher than that of standard division. In terms of numerical accuracy, for single-precision floating-point numbers satisfying

| b | \in [2^{- 126}, 2^{126}]

, the maximum error of the reciprocal approximation does not exceed 2 units in the last place (ULP). This accuracy level is sufficient to maintain numerical stability and precision requirements while significantly improving overall computational performance.

2.3.2. Fast Square-Root Operations

In calculations related to turbulence parameterization, gravitational settling corrections, and normalization of stochastic increments, a large number of single-precision floating-point square-root operations are required. Similar to floating-point division, standard square-root operations that conform to IEEE 754 semantics impose strict rounding rules and boundary-condition handling (e.g.,

N a N / I n f

and subnormal values). Compilers typically map these operations to single-precision square-root instructions (e.g.,

s q r t . r n . f 32

), which exhibit relatively high execution latency and low throughput at the hardware level and can therefore constitute a significant arithmetic bottleneck for kernel performance. To mitigate this overhead, standard square-root operations were reformulated as a combination of a single multiplication and a reciprocal square-root operation:

\sqrt{a} \approx a \cdot a p p r o x (\frac{1}{\sqrt{a}})

(4)

Here, the reciprocal square-root approximation

a p p r o x (\frac{1}{\sqrt{a}})

is likewise executed by the GPU multi-function units (MFUs) and relies on hardware lookup tables and interpolation to obtain a fast approximate value. Its instruction latency is substantially lower than that of the standard square-root operation, while offering higher execution throughput. Within the single-precision floating-point range, the maximum error of this approximation is likewise bounded by 2 units in the last place (ULP).

2.3.3. Newton–Raphson Iterative Refinement

The introduction of approximate operators results in a maximum numerical error of up to 2 ULP. For most atmospheric dispersion simulations, this error magnitude is substantially smaller than the uncertainties associated with stochastic turbulent perturbations and meteorological-field interpolation. However, along certain computational pathways—such as the calculation of vertical diffusion coefficients, normalization of turbulence time scales, and gravitational settling corrections—these errors may still accumulate over long-term integrations. To preserve numerical consistency between the GPU-based computations and the original implementation to the greatest extent possible, Newton–Raphson iterative refinement was applied to correct the approximate results.

The Newton–Raphson method is used to solve equations of the form

f (x) = 0

through the following iterative formulation:

x_{k + 1} = x_{k} - \frac{f (x_{k})}{f^{'} (x_{k})}

(5)

For division operations, the numerical uncertainty originates from the fast reciprocal approximation of

\frac{1}{b}

. Therefore, Newton–Raphson refinement is applied only to this component. Let

f (x) = x^{- 1} - b

, after one Newton–Raphson iteration, the refined expression can be written as:

x_{k + 1} = x_{k} - \frac{x_{k}^{- 1} - b}{- x_{k}^{- 2}} = x_{k} \cdot (2 - b x_{k})

(6)

For square-root computations, Newton–Raphson refinement is likewise applied only to the reciprocal square-root term of the form

\frac{1}{\sqrt{a}}

. Accordingly, by defining

f (x) = x^{- 2} - a

, the expression after a single Newton–Raphson iteration can be written as:

x_{k + 1} = x_{k} - \frac{x_{k}^{- 2} - a}{- 2 x_{k}^{- 3}} = x_{k} (1.5 - 0.5 a x_{k}^{2})

(7)

As noted above, the maximum initial error introduced by the reciprocal and reciprocal square-root approximation operators does not exceed 2 ULP, corresponding to approximately

2.4 \times 10^{- 7}

. Owing to the quadratic convergence property of the Newton–Raphson method, once the error is sufficiently small, each iteration reduces the relative error to approximately the square of its previous magnitude. Accordingly, for an initial relative error on the order of

10^{- 7}

, a single Newton–Raphson refinement step can suppress the error to the order of

10^{- 14}

, which is well below the effective representational precision of single-precision floating-point arithmetic. Given that rounding errors in single-precision floating-point operations are typically bounded by 0.5 ULP (corresponding to a relative error of approximately

10^{- 8}

), the inclusion of a single Newton–Raphson iteration is sufficient to ensure that the final result attains the maximum effective precision achievable in single precision. It should be emphasized that both types of Newton–Raphson refinements involve only a small number of multiplication and addition operations. Compared with the full

s q r t . r n . f 32

or

d i v . r n . f 32

instructions, their computational cost is substantially lower, and they can be efficiently executed by the GPU fused multiply–add (FMA) units without imposing a noticeable burden on overall kernel performance.

In addition, to further mitigate the potential accumulation of numerical errors introduced by fast division approximations, algebraic reformulation strategies were applied in selected computations to reduce the number of division operations. For example, expressions of the form

d = a / b / c

were consistently rewritten as

d = a / (b \cdot c)

, and expressions of the form

d = a / (b / c)

were reformulated as

d = (a \cdot c) / b

, thereby effectively decreasing the frequency of division operations. Similarly, power expressions such as

a^{1.5}

were explicitly decomposed and expressed as

a \cdot \sqrt{a}

, enabling efficient evaluation in combination with the fast square-root approximation operator.

2.4. GPU Parallel Execution Strategy and Resource Utilization Optimization

After the parallel restructuring of the particle–grid computation workflow and the optimization of arithmetic operators, the overall performance of the GPU kernels is further constrained by the design of the parallel execution model and the efficiency of hardware resource utilization. To fully exploit the parallel computing potential of GPUs for large-scale particle simulations, this study systematically optimized kernel execution from two complementary perspectives: parallel granularity design and hardware resource allocation.

2.4.1. Parallel Granularity and Thread Organization Strategy

After implementing fine-grained particle-level parallelism, further improvement in GPU resource utilization primarily depends on reducing idle periods between successive kernel launches and data-preparation stages, as well as increasing the degree of overlap between computation and associated data-access operations. To this end, a coarse-grained parallelization strategy based on CUDA streams was introduced. Through stream-based task scheduling, multiple computational tasks are executed in a temporally interleaved manner, thereby constructing a higher-throughput parallel execution pipeline.

CUDA streams are execution channels that support asynchronous scheduling. Operations submitted within the same stream are executed sequentially in the order of submission, whereas operations issued to different streams may execute concurrently, subject to hardware resource availability. As illustrated in Figure 4a, when parallel granularity is not optimized, particle simulations typically follow a sequential execution pattern in which kernel execution is initiated only after all relevant data have been read and prepared. As a result, computational resources remain idle during data-preparation stages, increasing overall execution latency. In contrast, as shown in Figure 4b, by appropriately partitioning the particle ensemble and assigning multiple CUDA streams, data-reading and kernel-execution operations for different particle subsets can be temporally interleaved. This approach effectively overlaps computation with data preparation, significantly reducing idle waiting periods and thereby improving overall computational throughput.

2.4.2. Register Usage Control Optimization

During GPU kernel execution, registers constitute the lowest-latency and highest-throughput storage resource, and their allocation strategy directly affects instruction-level parallelism within individual threads as well as the number of threads that can concurrently reside on a streaming multiprocessor (SM). Excessive register allocation, while beneficial for reducing memory-access overhead associated with intermediate variables, may exhaust available register resources and trigger register spilling, forcing some variables to be stored in local memory and thereby significantly increasing access latency. Conversely, overly restrictive limits on register usage can constrain the compiler’s optimization space and degrade single-thread execution efficiency. In this work, the maximum number of registers available per thread was explicitly constrained to avoid register spilling while substantially increasing thread occupancy.

2.5. Decoupling of Background-Field Preprocessing

Lagrangian particle dispersion calculations rely on meteorological fields stored in encoded formats such as GRIB. These meteorological datasets are typically decoded and preprocessed sequentially in time, including the computation of intermediate physical variables and coordinate-system transformations. However, in high-demand emergency application scenarios—such as source-term inversion and data assimilation—the background meteorological fields often remain unchanged across multiple simulation runs. Repeated decoding and preprocessing of meteorological data in such cases inevitably introduce substantial computational redundancy and significantly increase I/O overhead, thereby constraining overall computational efficiency. To address this issue, the meteorological-field preprocessing module was structurally decoupled from the main FLEXPART computational workflow, and an independent meteorological preprocessing program,

g r i b_t o_b i n_m p i

, was designed and implemented. This program executes in parallel on multicore CPU platforms using the Message Passing Interface (MPI), enabling efficient decoding of GRIB and related encoded meteorological files, derivation of physical variables, and coordinate transformations. The processed meteorological fields are then stored in binary format. Correspondingly, the FLEXPART main program was adapted, and the meteorological data input interface was reimplemented to ensure efficient and stable access to the preprocessed meteorological fields during subsequent GPU-accelerated computations. During model execution, the FLEXPART main program reads the required meteorological variables from these binary files and transfers them to GPU global memory. Notably, data transfer between CPU and GPU is triggered only when new meteorological time levels are loaded, thereby avoiding repeated host-device communication at every particle integration time step. Since the temporal resolution of meteorological fields is typically on the order of 3–6 h, whereas the particle integration time step is usually on the order of several hundred seconds, meteorological field updates are generally required only once every tens of integration steps, which helps keep the associated communication overhead relatively limited.

During particle trajectory integration, two adjacent meteorological time levels are maintained simultaneously in memory to evaluate meteorological conditions at arbitrary particle positions and intermediate integration times. Meteorological variables are interpolated spatially and temporally directly on the GPU to obtain the atmospheric state corresponding to the current particle position and simulation time. Since particle positions continuously evolve during trajectory integration, these interpolation operations are performed together with the corresponding particle-update procedures rather than treated as an independent computational module.

The design, computational details, and stored parameters of the preprocessing program are provided in Appendix B.

2.6. Multi-GPU Scalability and Load Balancing

To overcome the limitations on particle scale and computational resources under a single-GPU configuration, a multi-GPU parallel computing architecture based on particle-set partitioning was designed and implemented. Distinct particle subsets are assigned to different GPU devices, on which particle advection and associated physical process calculations are performed independently. During computation, frequent exchange of particle state information between GPUs is not required; instead, simulation results from all devices are aggregated only at the output stage, thereby enabling efficient and scalable parallel computation under multi-GPU configurations.

In multi-GPU systems, however, individual devices may differ in computational capability. To achieve optimal overall performance, it is necessary to ensure load balancing across GPUs as much as possible. Given that, in mainstream GPU architectures, the configuration of Streaming Multiprocessors (SMs) is typically uniform within a single device, the number of SMs was adopted as the primary metric for assessing relative computational capacity across devices. On this basis, a particle subset partitioning strategy proportional to SM count was designed as follows:

N_{i} = N \cdot \frac{S M_{i}}{S}

(8)

Here, N denotes the total number of particles,

{S M}_{i}

and

N_{i}

represent the number of streaming multiprocessors available on the i-th GPU and the corresponding number of particles assigned to that device, respectively, and

S = \sum S M_{i}

denotes the total number of SMs across all GPUs. After task partitioning, particle subsets are scheduled to the corresponding GPUs under host-side OpenMP management. Upon completion of all computations, the simulation results are collected and integrated by the master thread.

2.7. Validation

2.7.1. Validation Experiments and Reference Benchmarks

To systematically evaluate the simulation accuracy and acceleration performance of the GPU-accelerated Lagrangian particle dispersion model developed in this study, ETEX-I was selected as a unified test scenario for validation. During the ETEX-I experiment, a total of 340 kg of PMCH (perfluoromethylcyclohexane) was released into the atmosphere at Monterfil, Brittany, France

({48.058}^{\circ} N, {2.0083}^{\circ} W)

[51]. PMCH is a chemically inert perfluorocarbon tracer widely used in long-range atmospheric transport experiments because of its non-toxic, non-water-soluble, and non-depositing characteristics [57]. The release started at 16:00 UTC on 23 October 1994 and lasted for 12 h. During the release process, liquid PMCH was sprayed into a heated air stream to ensure rapid evaporation and atmospheric dispersion [58]. A network of 168 ground-based monitoring stations was deployed across Europe to collect tracer concentration measurements. Sequential air samplers equipped with adsorption tubes were used for sample collection, and the collected samples were subsequently analyzed using thermal desorption gas chromatography with electron-capture detection. To avoid missing plume arrival and to obtain background concentration levels, sampling operations at each station were initiated approximately 6 h before the expected arrival time of the tracer plume. Each station continuously collected 24 consecutive 3 h samples over a 72 h period, with sampling schedules progressively delayed from western to eastern Europe. The most distant stations completed sampling approximately 90 h after the release start. Overall, more than 4000 samples were collected during the ETEX-I experiments, yielding 3104 valid concentration measurements used in this study. The location of the release source and the spatial distribution of the ground monitoring stations are shown in Figure 5.

The serial version of FLEXPART v11.04 was adopted as the reference benchmark for both simulation accuracy and acceleration performance. The meteorological driving fields used for validation were obtained from the Climate Forecast System Reanalysis (CFSR) dataset provided by the National Centers for Environmental Prediction (NCEP). In the simulations, the total number of particles was set to

1.0 \times 10^{6}

, the synchronization time step was configured as 300 s, and the total simulation duration was 90 h. The horizontal grid resolution of the output concentration fields was configured as

{0.1}^{\circ} \times {0.1}^{\circ}

, corresponding to an output grid size of

320 \times 250

. Detailed descriptions of the corresponding configuration parameters are provided in Appendix C Table A3.

It should also be noted that the output grid was mainly used for gridded concentration diagnosis and plume visualization. The station-level statistical validation against ETEX-I observations was performed using receptor concentration time series calculated directly from particle contributions through the FLEXPART receptor kernel method, rather than by interpolating concentrations from the gridded output fields.

2.7.2. Validation Metrics and Accuracy Assessment

The GPU-accelerated FLEXPART implementation developed in this study was validated against both the reference CPU version of FLEXPART and the ETEX-I observational dataset. For consistency, the GPU and CPU simulations were independently performed using identical emission source parameters, meteorological input data, particle settings, physical parameterizations, and output configurations. The resulting concentration fields were quantitatively compared with the ETEX-I observational measurements, while the relative differences between the GPU and CPU results were further analyzed to assess the numerical consistency of the GPU implementation. The agreement between simulations and observations was evaluated using several statistical metrics, including Fractional Bias (FB), Root Mean Square Error (RMSE), Fraction within a Factor of Two (FA2), and Fraction within a Factor of Five (FA5), as summarized in Table 2.

2.7.3. Performance Evaluation Design

Since PMCH is a passive tracer that does not activate deposition-related modules, especially wet deposition, it cannot fully reflect the computational characteristics of complex physical processes involved in atmospheric radionuclide transport simulations. Accordingly, Cs-137 was used in the performance benchmark cases to activate dry- and wet-deposition calculations, thereby enabling a more comprehensive evaluation of the acceleration performance of the GPU implementation across all major computational modules. Apart from the tracer substitution, all other simulation configurations remained identical to those used in the accuracy validation. The performance evaluation comprised three aspects:

Single-GPU acceleration assessment. This evaluation was conducted on a computing platform equipped with an AMD Ryzen Threadripper 7970X CPU (32 cores) and an NVIDIA GeForce RTX 5080 GPU (hereafter referred to as Platform A). The runtime performance of the developed GPU-accelerated program was compared with that of the reference benchmark. The analysis focused on quantifying performance improvements achieved through the successive introduction of the fine-grained parallel architecture, fast arithmetic instruction optimization, and parallel execution and resource utilization strategies, thereby enabling a quantitative assessment of the relative contribution of each optimization to the overall speedup.
Multi-GPU scalability evaluation. By progressively increasing the number of GPUs, the variation in total execution time and speedup with respect to GPU count was systematically analyzed to assess the parallel scalability of the developed program in multi-device environments. Because such scalability tests require substantial GPU resources to adequately characterize performance trends, this evaluation was performed on a server platform equipped with an Intel^® Xeon^® Platinum 8160 CPU @ 2.10 GHz and eight NVIDIA Tesla V100 GPUs (hereafter referred to as Platform B).
Heterogeneous GPU load-balancing evaluation. Based on Platform A, an additional NVIDIA GeForce RTX 5070 GPU was introduced to construct a heterogeneous multi-GPU environment (hereafter referred to as Platform C) for evaluating the proposed load-balancing strategy. By comparing the distribution of computation time across GPUs with and without the load-balancing strategy enabled, the effectiveness of the strategy in improving task allocation balance and overall parallel efficiency under heterogeneous computing conditions was analyzed.

3. Results and Discussion

3.1. Accuracy Validation

Figure 6a and Figure 6b present the mean concentration fields at 48 h after release simulated by the reference CPU code and the developed GPU implementation, respectively. The two fields exhibit highly consistent large-scale spatial patterns, indicating that the GPU implementation reproduces the overall advection and dispersion structure of the radioactive plume with comparable fidelity. The relative difference field shown in Figure 6c indicates that, within the high-concentration core of the plume, concentrations simulated by the GPU code are marginally lower than those of the reference code, whereas slightly higher values occur along the low-concentration plume margins. This pattern suggests that, relative to the reference implementation, the GPU code may produce moderately enhanced turbulent mixing effects.

Despite these differences, the absolute deviation between the two simulations (Figure 6d) remains consistently low throughout the simulation period, with magnitudes approximately two orders of magnitude lower than the local mean concentrations at each corresponding time step. This indicates that the discrepancies have a limited influence on the overall concentration field. A further assessment of the long-term numerical consistency of the GPU simulation is provided in Appendix D, where the 15-day validation experiment is summarized in Figure A1.

To further assess the practical impact of the aforementioned deviations on simulation accuracy, the spatial distribution of differences between the observed concentrations and the reference simulation at the corresponding time is presented in Figure 6e. The comparison shows that the simulation errors introduced by the GPU implementation, relative to the reference model, are significantly smaller than the discrepancies between the observations and the reference simulation. This result further confirms that the numerical errors introduced by the GPU acceleration remain well controlled. For a systematic evaluation, Table 3 summarizes the statistical performance metrics of the reference CPU code and the GPU implementation, calculated using the complete ETEX-I observational dataset. The results indicate that the two implementations exhibit highly consistent overall simulation accuracy, demonstrating that the GPU code effectively reproduces the real pollutant dispersion process while maintaining numerical consistency. The simulation performance of the GPU implementation is therefore comparable to that of the reference model. Overall, the GPU implementation developed in this study achieves computational accuracy essentially consistent with that of the CPU reference modle, indicating that the proposed approach provides reliable results while preserving numerical precision.

3.2. Computational Performance Evaluation

3.2.1. Single-GPU Acceleration Performance

Figure 7a compares the runtime performance of individual computational components between the GPU implementation with fine-grained parallelization and the reference CPU code on validation platform A. After adopting the fine-grained “one-particle–one-thread” mapping strategy, the overall computational performance is substantially improved. Among the major physical processes, the advection–diffusion module exhibits the highest acceleration, with a speedup of 38.50, while the convective mixing and wet deposition modules achieve speedups of 10.79 and 12.53, respectively. These differences are primarily associated with the computational characteristics of each process and their inherent parallel scalability. During the advection–diffusion stage, particle updates are mutually independent and involve minimal data dependency, allowing efficient utilization of large-scale GPU parallelism and resulting in high parallel efficiency. In contrast, the acceleration of the convective mixing process is more limited, as this stage involves particle–grid rearrangement and nested loop structures; after GPU porting, the construction and update of the corresponding transition matrices remain on the CPU, which constrains the achievable speedup. For the wet deposition process, precipitation events during the ETEX-I experiment occur infrequently, leading to relatively low activation frequency and computational density. Under these conditions, the proportion of memory-access operations is comparatively high, while the effective arithmetic workload is limited, resulting in lower acceleration than that observed for the advection–diffusion process. Nevertheless, at the overall model level, the fine-grained parallelization strategy provides a clear performance benefit. Even when time-consuming sequential procedures such as meteorological data decoding and preprocessing are included, the total speedup reaches 9.75.

Given that computationally expensive arithmetic operations, such as division and square-root calculations, are primarily concentrated in the advection–diffusion stage, Figure 7b further presents the performance obtained after introducing fast arithmetic instruction optimizations for this module. The results show that the execution time of the advection–diffusion process is reduced from 152.3 s to 110.0 s, corresponding to an overall reduction of approximately

38.4 %

. This outcome indicates that, without modifying the underlying algorithmic structure, the use of fast arithmetic instructions can effectively reduce arithmetic overhead and further increase computational throughput for this stage.

After introducing the coarse-grained parallelization strategy, differences in data-transfer patterns and computational workflows among the physical processes lead to distinct task-partitioning schemes. Figure 8 summarizes the acceleration performance obtained for different physical processes under varying numbers of CUDA streams. For the advection–diffusion and wet deposition processes, the computational time generally decreases as the number of streams increases, while the performance gain approaches saturation when the stream count reaches eight. In principle, increasing the number of streams enhances the overlap between computation and data transfer, thereby improving execution efficiency. In practice, however, excessive task subdivision shortens individual kernel execution times, increasing the relative contribution of kernel launch and scheduling overhead to the total runtime. At the same time, splitting contiguous data into an excessive number of sub-blocks reduces the efficiency of each data transfer. As a result, once the number of streams exceeds a certain threshold, further refinement of task granularity yields diminishing returns and may even lead to increased overall runtime due to additional scheduling and transfer overheads.

In contrast, when coarse-grained parallelization is applied to the convective mixing process, the computational time increases. This behavior is primarily associated with the particle–grid rearrangement mechanism used in this stage. Because particle grid locations cannot be determined a priori during task partitioning, parameters related to active grid cells cannot be effectively divided across streams, which limits the achievable overlap between data transfer and kernel execution. Consequently, the potential benefits of coarse-grained parallelism are not fully realized. In addition, the additional scheduling and management overhead introduced by coarse-grained partitioning becomes more pronounced for this process, offsetting the gains expected from parallel execution and ultimately resulting in degraded performance.

Based on these observations, process-specific coarse-grained configurations are adopted in this study. Eight CUDA streams are selected for the advection–diffusion and wet deposition processes, while coarse-grained parallelization is not applied to the convective mixing process.

Figure 9 illustrates the impact of limiting the maximum number of registers per thread on the performance of individual computational modules. As the register limit is progressively reduced, the execution time of the advection–diffusion process decreases markedly from 256 to 128 registers, then declines more gradually from 128 to 72, and finally exhibits a tendency toward performance degradation when the limit is further reduced to 64. This behavior is primarily attributable to the influence of register usage on the number of warps that can reside concurrently on a SM. With a high register count, the number of resident warps per SM is constrained; when active warps stall due to memory accesses, the lack of alternative warps for scheduling leads to SM idle cycles and reduced computational efficiency. As the register limit is lowered, the number of resident warps increases substantially, enabling more effective warp switching and improved hiding of memory latency, which results in a rapid performance improvement. Once the resident warp count becomes sufficient to mask most memory stalls, further increases yield diminishing returns. Moreover, excessive register compression may induce register spilling or increase instruction counts, ultimately leading to performance degradation. For the convective mixing process, the core computational kernel uses approximately 116 registers per thread in the absence of any imposed limit, as summarized in Appendix E, Table A4, corresponding to an occupancy of about

33.3 %

. Under these conditions, the available number of resident warps is already sufficient to effectively hide memory latency, and further restriction of register usage has only a minor impact on performance. For the wet deposition process, the initial register usage is relatively high (approximately 250), comparable to that of the advection–diffusion process. However, its memory throughput is only about

5 - 7 %

, which is substantially lower than that of the advection–diffusion and convective mixing processes, indicating that it is not predominantly limited by memory bandwidth or latency. In this case, increasing occupancy does not readily translate into tangible performance gains, while the risks associated with register spilling and additional instruction overhead introduced by register compression tend to dominate, leading to increased execution time. Because the maximum register constraint applies globally to all kernels, an overall balance across different computational modules is required. Based on the combined performance characteristics, the maximum number of registers per thread is set to 72 in this study.

Although the optimization analysis in this study was conducted using the ETEX-I simulation case, the resulting conclusions exhibit reasonably good generality due to the relatively fixed computational structure, thread organization strategy, and data-access patterns of the FLEXPART GPU framework under the one-particle-per-thread parallel mapping strategy. Consequently, the register usage characteristics and SM resource-allocation behavior of the major computational kernels remain relatively stable across different simulation scenarios.

It should also be noted that, under a fixed GPU architecture and computational framework, the register utilization characteristics and occupancy behavior of GPU kernels are primarily determined by the kernel structure, thread organization strategy, and memory-access patterns [59]. Therefore, for the GPU implementation of FLEXPART developed in this study, the register usage characteristics of the major computational kernels generally remain relatively stable across different simulation cases. The primary exception is the wet-deposition module, whose active-grid distribution and memory-access behavior may vary moderately depending on the spatial distribution and intensity of precipitation fields. Nevertheless, for most computational modules, the overall resource-utilization characteristics remain broadly consistent across different simulation configurations.

After completing the GPU implementation and the associated optimizations, the runtime statistics of the individual computational modules are summarized in Table 4. The results indicate that, at this stage, the reading and preprocessing of the meteorological background fields account for

78.56 %

of the total execution time, thereby constituting the dominant bottleneck limiting further improvements in overall computational efficiency.

Table 5 reports the runtime of the independently developed MPI-based meteorological data decoupling and preprocessing program,

g r i b_t o_b i n_m p i

, under different numbers of processes, and compares it with the corresponding procedure in the reference CPU implementation. The results indicate that, under a typical modern CPU configuration with 18 threads, the meteorological field preprocessing achieves a speedup of approximately 11.63. This substantially reduces the proportion of time spent in sequential meteorological data processing and mitigates its impact on overall computational performance.

Based on the above developments, the total simulation time is reduced from

6598.00

s in the original implementation to

163.12

s, corresponding to an overall speedup of approximately 40.45. In practical application scenarios involving high computational demand, such as source-term inversion, multiple simulation tasks typically share the same meteorological background fields. Once the preprocessing step is completed, the meteorological data can be repeatedly accessed in binary form, with an associated I/O time of only about

2.85

s. Under these conditions, the overall speedup of the developed GPU program relative to the reference CPU implementation further increases to approximately 52.05. As summarized in Table 6, this performance substantially exceeds that reported for existing GPU-accelerated Lagrangian particle dispersion approaches. The sensitivity of the acceleration performance to the number of released particles is further examined in Appendix F, with the corresponding runtime and speedup results listed in Table A5.

3.2.2. Multi-GPU Scalability Assessment

Multi-GPU scalability tests of the developed GPU program were conducted on validation platform B, and the results are shown in Figure 10. As the number of GPUs increases from 1 to 4, the advection–diffusion process exhibits near-linear speedup, indicating favorable parallel scalability for this component. When the GPU count is increased beyond this range, the incremental performance gains gradually diminish. This behavior arises because, under a fixed total problem size, the computational workload assigned to each GPU decreases as more devices are introduced, while the relative contribution of communication and synchronization overhead to the total runtime correspondingly increases, thereby constraining further speedup. In contrast, the convective mixing and wet deposition processes do not show substantial acceleration with increasing GPU count. On the one hand, the proportion of kernel execution time associated with these processes is relatively small within the overall computation, limiting the impact of additional GPUs on their runtime. On the other hand, as discussed in Section 3.2.1, the convective mixing process involves relatively frequent data transfers; under fixed CPU-side communication bandwidth, introducing additional GPUs further amplifies communication and synchronization overheads and may even lead to increased execution time. Despite these limitations in the scalability of certain components, the overall GPU program maintains satisfactory parallel scalability within a moderate GPU count, supporting its feasibility and effectiveness in multi-GPU environments.

3.2.3. Load Balancing Across Heterogeneous GPUs

The acceleration effects of the load-balancing strategy in a heterogeneous GPU environment were evaluated on validation platform C, with the results shown in Figure 11. Using the single NVIDIA GeForce RTX 5080 configuration as the reference, the dual-GPU setup (RTX 5080 + RTX 5070) achieves a speedup of only about

12.52 %

for the advection–diffusion process when no load-balancing strategy is applied. After introducing the proposed load-balancing scheme, the speedup increases to approximately

31.94 %

, representing an absolute improvement of about 19.42 percentage points relative to the unbalanced configuration. This result indicates that the load-balancing strategy effectively mitigates the adverse impact of performance heterogeneity among GPUs on parallel efficiency, thereby enhancing overall acceleration. The load-balancing strategy also exerts an indirect influence on the execution efficiency of the convective mixing process. By allocating computational tasks according to the relative capabilities of the GPUs and avoiding excessive workloads on the less capable device, delays associated with data transfer and synchronization are reduced. This scheduling approach partially suppresses the increase in convective mixing runtime observed under heterogeneous multi-GPU configurations. In contrast, the impact of load balancing on the wet deposition process is limited. Because the kernel workload of this process is relatively small, both GPUs complete the computations within a short time, making its performance less sensitive to the task distribution. As a result, the overall execution time of the wet deposition process remains largely unchanged.

4. Conclusions

This study establishes a GPU-oriented Lagrangian particle dispersion models within the FLEXPART framework, achieving a comprehensive migration of core transport and deposition processes to modern accelerator architectures. Through coordinated fine-grained parallelization and architectural adaptations, the framework enables efficient large-scale dispersion calculations while preserving numerical fidelity across key physical processes. Complementary arithmetic optimizations and concurrency-oriented execution strategies further unlock GPU computational capacity, resulting in a balanced and scalable implementation suited to high-throughput operational use.

Beyond kernel-level acceleration, the development of a parallelized meteorological data preparation program alleviates a major upstream bottleneck, enabling end-to-end performance gains in realistic application workflows. The extension to multi-GPU execution, together with a heterogeneous load-balancing strategy tailored to mixed-performance devices, demonstrates robust scalability and efficient resource utilization across diverse hardware configurations.

Validation against the ETEX-I tracer experiment confirms that the GPU-based model remains consistent with established reference simulations, with minor numerical deviations being confined and physically reasonable. Compared to the benchmark verification program, it achieves a speedup ratio of approximately 40.45–52.05, significantly outperforming previously developed GPU-accelerated Lagrangian particle programs. Furthermore, the model demonstrates favorable multi-GPU scalability and can optimize execution efficiency across heterogeneous GPU devices through load-balancing strategies. Overall, the developed framework provides a practical and extensible solution for computationally intensive atmospheric dispersion applications—particularly those requiring rapid turnaround and repeated simulations, such as emergency response assessment and source-term analysis—within the domain of large-scale transport modeling.

5. Limitations and Future Works

Although the GPU-accelerated FLEXPART framework developed in this study demonstrates good numerical consistency and substantial computational acceleration, several limitations still remain in the current implementation.

At present, the multi-GPU implementation exhibits scalability saturation when the number of GPU devices increases beyond a certain level. As discussed in Section 3.2.2, this behavior is mainly associated with the increasing communication and synchronization overhead during multi-GPU execution. In the current implementation, some intermediate data structures are still independently transferred from the host to each GPU device, which may gradually increase host-mediated communication overhead and PCIe transfer pressure as the number of GPUs increases. Future work may further investigate advanced GPU-direct communication technologies, such as GPUDirect RDMA and NVLink-enabled peer-to-peer communication, to reduce redundant host-device data transfers and improve communication efficiency in large-scale multi-GPU deployments. For example, future implementations may allow only a subset of GPU devices to synchronize data directly from the host, while the remaining devices obtain replicated data through direct GPU-to-GPU communication paths.

In addition, although the present study mainly focuses on computational acceleration and numerical consistency validation, the energy efficiency of GPU-accelerated atmospheric dispersion simulations has not yet been systematically evaluated. Considering the increasing interest in green computing and low-carbon high-performance computing, future work may further compare the energy consumption and energy-per-simulation characteristics between conventional CPU platforms and GPU-accelerated implementations to more comprehensively evaluate the potential environmental benefits of GPU-based atmospheric transport modelling.

Author Contributions

Conceptualization, Q.L. and T.H.; methodology, Q.L.; software, T.H.; validation, Q.L., M.L. and B.L.; formal analysis, Q.L. and T.H.; investigation, Q.L.; resources, L.L.; data curation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, R.Q.; visualization, Q.L.; supervision, J.L. and R.Q.; project administration, J.Z. and L.L.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovative Team Project of the China Institute for Radiation Protection.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

To enhance the reproducibility of this study and to facilitate direct replication of the reported validation results by subsequent researchers and end users, the developed GPU-accelerated Lagrangian particle dispersion program, together with all validation test cases, has been packaged as a Docker image. This image integrates the GPU executable, the complete dependency and runtime environment, and all validation scenarios and parameter files presented in this work, enabling users to reproduce the experiments through minimal configuration on CUDA-enabled computing platforms. Docker image: https://doi.org/10.5281/zenodo.18164030, accessed on 29 May 2026.

Acknowledgments

This work was supported by China Institute for Radiation Protection. The authors gratefully acknowledge the funding support, which made this work possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Deep Convective Particle Redistribution Procedure

Deep convective mixing mainly occurs in deep convective clouds or strongly convective environments, where air masses may undergo rapid vertical transport within a short period, leading to fast redistribution of particles among different vertical layers. Due to the lack of stable turbulence statistics and appropriate stochastic parameters for deep convection, it remains difficult to construct a generally applicable Langevin-type stochastic model for deep convective transport. Therefore, LPDMs commonly employ mass-flux-based convective parameterization schemes, in which convective mass fluxes defined on Eulerian grids are used to describe the vertical redistribution of particles [56]. In FLEXPART, this process is parameterized using the scheme of Emanuel and Živković-Rothman [52].

Specifically, the model first computes a vertical mass-flux matrix according to the convective parameterization scheme. This matrix represents the fractional transport relationship of air mass from layer i to layer j. The mass-flux matrix is then converted into a particle redistribution probability matrix. If

F_{i j}

denotes the mass flux from layer i to layer j, the probability that a particle initially located in layer i is transported to layer j can be expressed as

P_{i j} = \frac{F_{i j}}{\sum_{k} F_{i k}} .

(A1)

In the numerical implementation, a uniformly distributed random number

R \sim U (0, 1)

is generated for each particle, and the target layer is determined using the cumulative probability function. When

C_{m - 1} < R \leq C_{m}, C_{m} = \sum_{j = 1}^{m} P_{i j},

(A2)

the particle is assigned to target layer m. After the target layer is determined, the final particle height is not fixed at the center of the layer, but is randomly interpolated between the upper and lower boundaries of the target layer. If the lower and upper boundaries of the target layer are denoted by

z_{m}^{bot}

and

z_{m}^{top}

, respectively, the new particle height is calculated as

z_{new} = (1 - α) z_{m}^{bot} + α z_{m}^{top}, α \sim U (0, 1) .

(A3)

For particles that are not redistributed between layers, a compensating subsidence velocity

w_{sub}

is applied according to the convective mass-balance condition:

z (t + Δ t) = z_{t} + w_{sub} Δ t .

(A4)

Therefore, although the redistribution probability is derived from mass fluxes defined on Eulerian grid columns, the actual redistribution of particles is carried out individually and stochastically.

Appendix B. Design, Computational Details, and Intermediate Variables of the Preprocessing Program grib_to_bin_mpi

The core idea of the decoupling strategy is to transform meteorological field processing into a one-time precomputation step. Specifically, tasks are distributed through an MPI parallel framework, with each process responsible for a subset of GRIB files. Required variables are extracted and computed independently by each process and subsequently stored as standalone binary files. As a result, the FLEXPART main program no longer needs to repeatedly read GRIB files; instead, it directly loads the preprocessed binary data, thereby substantially reducing I/O overhead and redundant computations. In parallel, the preprocessing program generates an AVAILABLE_bin file that records the names of the available binary wind-field files, ensuring compatibility with the original AVAILABLE mechanism. This design leverages MPI broadcast and synchronization to maintain global consistency, while load-balanced task allocation enables efficient utilization of multicore resources.

The main workflow of the preprocessing program is summarized as follows:

MPI initialization and process management. Upon startup, the program calls MPI_Init to initialize the MPI environment and retrieves the total number of processes $(n p r o c s)$ and the process rank $(m y r a n k)$ . The root process $(m y r a n k = 0)$ is designated with $l r o o t = . t r u e$ . and is responsible for overall coordination.
Configuration file reading and format detection. The root process reads the COMMAND, RELEASES, and AVAILABLE configuration files to obtain simulation parameters (e.g., $l d i r e c t$ , $i n d_r e c e p t o r$ , $i p i n$ ). The meteorological data format (ECMWF or NCEP) is identified using $d e t e c t f o r m a t ()$ . Depending on the detected format, $g r i d c h e c k_e c m w f ()$ or $g r i d c h e c k_g f s ()$ is invoked to analyze grid dimensions (e.g., $n x m a x$ , $n y m a x$ , $n u v z m a x$ ). The resulting metadata are written to header.t for subsequent use by FLEXPART.
Global broadcast and memory allocation. The identified format, grid dimensions, and related parameters (e.g., $n x s h i f t$ , $n x$ , $n y$ , $n z$ , $m e t d a t a_f o r m a t$ ) are broadcast to all processes via MPI_Bcast. Each process then allocates memory for the required data structures (such as the three-dimensional arrays $u u h$ , $v v h$ , and $w w h$ ), ensuring consistency across all ranks.
Parallel file processing. GRIB files are distributed among processes according to the $n u m b w f$ and $w f n a m e$ arrays. Each process reads its assigned files and invokes $r e a d w i n d_e c m w f ()$ or $r e a d w i n d_g f s ()$ to extract wind-field data. After completing parameter calculations and necessary transformations, the results are written to $. b i n$ files. An MPI_Barrier is used to synchronize all processes, after which the root process generates the AVAILABLE_bin file listing the produced binary files.
Memory deallocation and program termination.

Corresponding modifications are also required in the FLEXPART main program. Specifically,

g r i d c h e c k ()

is adapted to obtain meteorological format and grid-dimension information directly from header.t, and

g e t f i e l d s ()

is modified to read data from the .bin files rather than from GRIB input. These changes replace the original GRIB-based workflow and ensure seamless integration of the preprocessing scheme into the overall model framework.

Table A1. Parameters contained in the binary meteorological field data files. Superscript ¹ denotes spatial two-dimensional variables, superscript ² denotes vertical-level (one-dimensional) variables, and superscript ³ denotes scalar variables; all remaining variables represent spatial three-dimensional fields. Variables with the suffix eta correspond to their counterparts defined in the

η

-coordinate system.

Table A1. Parameters contained in the binary meteorological field data files. Superscript ¹ denotes spatial two-dimensional variables, superscript ² denotes vertical-level (one-dimensional) variables, and superscript ³ denotes scalar variables; all remaining variables represent spatial three-dimensional fields. Variables with the suffix eta correspond to their counterparts defined in the

η

-coordinate system.

Variable Short Name	Description
$t t h$	temperature data on half model levels
$q v h$	specific humidity data on half model levels
$p s$ ¹	surface pressure
$t c c$ ¹	total cloud cover
$t t 2$ ¹	2 m temperature
$t d 2$ ¹	2 m dew point
$l s p r e c$ ¹	large scale total precipitation
$c o n v p r e c$ ¹	convective precipitation
$o r o$ ¹	orography
$l s m$ ¹	land sea mask
$l c w$ ³	logical, indicating whether clwc is available (see $c t w c$ ¹)
$u s t a r$ ¹	friction velocity
$w s t a r$ ¹	convective velocity scale
$h m i x$ ¹	mixing height
$t r o p o p a u s e$ ¹	altitude of thermal tropopause
$o l i$ ¹	inverse Obukhov length (1/L)
$v d e p$ ¹	deposition velocity
$u u$ ( $u u e t a$ )	wind components in x
$v v$ ( $v v e t a$ )	wind components in y
$w w$ ( $w w e t a$ )	wind components in z
$u u p o l$ ( $u u p o l e t a$ )	wind components in polar stereographic projection
$v v p o l$ ( $v v p o l e t a$ )	wind components in polar stereographic projection
$t t$ ( $t t e t a$ )	temperature data on internal model levels
$q v$	specific humidity data on internal model levels
$p v$ ( $p v e t a$ )	potential vorticity
$r h o$ ( $r h o e t a$ )	air density
$d r h o d z$ ( $d r h o d z e t a$ )	vertical air density gradient
$c t w c$ ¹	total cloud water content (=liquid $c l w c$ + ice $c i w c$ )
$i c l o u d b o t$ ¹	cloud bottom height
$i c l o u d t o p$ ¹	cloud top
$h e i g h t$ ²	heights of all levels
$w h e i g h t$ ² ( $e t a w h e i g h t$ ²)	model level heights
$u v h e i g h$ ² ( $e t a u v h e i g h t$ ²)	half-model level heights
$n m i x z$ ³	number of levels up to maximum PBL height (3500 m)
$p p l e v$	pressure on half model levels
$p r s$	air pressure RLT

Table A2. List of general information produced by the preprocessing program and stored in header.t.

Variable Name	Description
$m e t d a t a_f o r m a t$	storing the input data type (ECMWF/NCEP)
$n x m a x$	Size of windfield
$n y m a x$	Size of windfield
$n u v z m a x$	Size of windfield
$n w z m a x$	Size of windfield
$n z m a x$	Size of windfield
$n x$	actual dimensions of wind fields in x, y and z direction
$n y$	actual dimensions of wind fields in x, y and z direction
$n z$	actual dimensions of wind fields in x, y and z direction
$n x m i n 1$	$n x - 1$
$n y m i n 1$	$n y - 1$
$n x f i e l d$	same as $n x$ for limited area fields, but for global fields $n x = n x f i e l d + 1$
$n u v z$	vertical dimension of original data (u, v components; staggered grid)
$n w z$	vertical dimension of original data (w component)
$n l e v_e c$	number of levels ECMWF model
$x g l o b a l$	T for global fields, F for limited area fields
$s g l o b a l$	T for global fields, F for limited area fields
$n g l o b a l$	T for global fields, F for limited area fields
$a k m$	coefficients which regulate vertical discretization
$b k m$	coefficients which regulate vertical discretization
$a k z$	model discretization coefficients at the centre of the layers
$b k z$	model discretization coefficients at the centre of the layers
$d x$	grid distance in x direction
$d y$	grid distance in y direction
$d x c o n s t$	auxiliary variables for utransform
$d y c o n s t$	auxiliary variables for utransform
$x l o n 0$	geographical longitude of lower left grid point
$y l a t 0$	geographical latitude of lower left grid point
$s o u t h p o l e m a p$	define stereographic projections at the two poles
$n o r t h p o l e m a p$	define stereographic projections at the two poles
$s w i t c h n o r t h g$	use polar stereographic threshold in grid units
$s w i t c h s o u t h g$	use polar stereographic threshold in grid units
$n c o n v l e v m a x$	maximum number of levels for convection
$n a$	parameter used in Emanuel’s convect subroutine

Appendix C. Configuration Parameters in the Validation Benchmark

Table A3. Key FLEXPART data-card parameter settings used in the validation simulations.

Parameter	Setting	Description
LDIRECT	1	Forward simulation
IBDATE	19941023	Start date of the simulation (YYYYMMDD)
IBTIME	160000	Start time of the simulation (HHMMSS, UTC)
IEDATE	19941027	End date of the simulation (YYYYMMDD)
IETIME	100000	End time of the simulation (HHMMSS, UTC)
PARTS	1,000,000	Total number of released Lagrangian particles
LSYNCTIME	300 s	Synchronization time step of the simulation
DXOUT/DYOUT	${0.1}^{\circ}$	Longitude/latitude resolution of the output grid
NX/NY	320/250	Number of output grid cells in the longitude/latitude directions
LCONVECTION	1	Switch on convection parameterization
LTURBULENCE	1	Switch on turbulence parameterization
CBLFLAG	1	Skewed turbulence parameterization scheme used in the CBL
CTL	10	Reduction factor for the turbulence-integration time step
IFINE	10	Reduction factor for the vertical-transport time step

Appendix D. Long-Term Accuracy Assessment of the GPU Model

To further assess whether the numerical agreement between the GPU implementation and the reference CPU program remains valid over extended dispersion periods, an additional 15-day simulation experiment was conducted. The original CPU implementation was used as the reference solution, and the corresponding GPU simulation was performed under the same release, meteorological, and model-configuration conditions. At each output time, the absolute deviation between the GPU and CPU concentration fields was calculated and summarized as a time-dependent diagnostic. This metric is shown by the blue solid line in Figure A1. For comparison, the red dashed line denotes the mean absolute error between the CPU simulation results and the observational measurements from the ETEX-I experiment, providing a physically meaningful reference scale for model–observation discrepancy.

As shown in Figure A1, the absolute deviation between the GPU and CPU simulations does not accumulate with increasing integration time. Instead, it generally decreases and then approaches a stable level during the 15-day simulation. This decreasing tendency is partly associated with the progressive dilution of the simulated tracer plume, which reduces the absolute magnitude of concentration differences as transport and diffusion proceed. More importantly, the GPU–CPU deviation remains substantially smaller than the mean absolute discrepancy between the CPU model prediction and the ETEX-I observations throughout the long-term simulation. These results indicate that the numerical differences introduced by GPU parallelization are minor relative to the intrinsic uncertainty of atmospheric dispersion modeling and observational comparison. Therefore, the GPU model preserves the computational accuracy of the reference CPU implementation under long-duration atmospheric transport conditions.

Figure A1. Long-term accuracy assessment of the GPU model in a 15-day ETEX-I dispersion simulation. The blue solid line represents the absolute deviation between the GPU and CPU simulations as a function of simulation time, while the red dashed line indicates the mean absolute error between the CPU simulation and ETEX-I observations.

Appendix E. Resource Utilization of Individual Components Without Imposing a Per-Thread Maximum Register Limit

Table A4. Resource utilization of the GPU model prior to optimization. Computational throughput and memory throughput are expressed as percentages of the device theoretical peak values.

Computational Module	Registers per Thread	Occupancy	Computational Throughput	Memory Throughput
Advection-Diffusion	254	16.67%	∼20%	∼46%
Convective mixing	116	33.33%	∼13%	∼38%
Wet deposition	250	16.67%	∼2%	∼4.5%

Appendix F. Performance Comparison for Different Particle Numbers

Based on the single-GPU acceleration validation experiment, the number of released particles was varied over a range of 100,000, 200,000, 500,000, 1,000,000, 2,000,000, and 5,000,000 to investigate the sensitivity of the speedup ratio to particle number. The CPU and GPU total running times and the corresponding acceleration ratios are listed in Table A5. The results show that the speedup generally increases at first and then decreases slightly as the particle number continues to increase. The increase at relatively small and medium particle numbers is mainly because more particles provide a larger particle-level parallel workload, allowing GPU hardware resources to be more fully utilized while reducing the relative contribution of fixed overheads such as initialization, kernel launches, meteorological-field reading, and output operations. This behavior reflects an inherent limitation of low-particle-number simulations, in which the computational workload is not yet large enough to dominate the total runtime; similar effects are also more evident in MPI-parallel FLEXPART simulations [15]. The slight decrease at larger particle numbers is reasonable because the computation gradually becomes more constrained by memory bandwidth, cache reuse, particle-state storage, device scheduling, and data-management overhead. In addition, serial or partially accelerated parts of the workflow, including host–device synchronization, serial control logic, and file output, cannot scale proportionally with the particle population and therefore limit the attainable speedup according to the non-parallel fraction of the overall workflow. Therefore, the observed decrease should be interpreted as a mild saturation effect rather than a continuous degradation trend.

Table A5. Total running time and acceleration performance of the CPU and single-GPU models for different particle numbers.

Particle Number	CPU Total Time (s)	GPU Total Time (s)	Speedup
100,000	1168.47	37.79	30.92×
200,000	1830.58	48.61	37.66×
500,000	3678.58	87.27	42.15×
1,000,000	6598.00	126.77	52.05×
2,000,000	13,550.26	278.87	48.59×
5,000,000	30,103.03	630.61	47.74×

References

Zhang, X.; Wang, J. Atmospheric dispersion of chemical, biological, and radiological hazardous pollutants: Informing risk assessment for public safety. J. Saf. Sci. Resil. 2022, 3, 372–397. [Google Scholar] [CrossRef]
Yao, R. Atmospheric dispersion of radioactive material in radiological risk assessment and emergency response. Prog. Nucl. Sci. Technol. 2011, 1, 7–13. [Google Scholar] [CrossRef]
Sugiyama, G.; Nasstrom, J.; Pobanz, B.; Foster, K.; Simpson, M.; Vogt, P.; Aluzzi, F.; Homann, S. Atmospheric Dispersion Modeling: Challenges of the Fukushima Daiichi Response. Health Phys. 2012, 102, 493–508. [Google Scholar] [CrossRef]
Hernández-Ceballos, M.A.; Sangiorgi, M.; García-Puerta, B.; Montero, M.; Trueba, C. Dispersion and ground deposition of radioactive material according to airflow patterns for enhancing the preparedness to N/R emergencies. J. Environ. Radioact. 2020, 216, 106178. [Google Scholar] [CrossRef]
Ulimoen, M.; Berge, E.; Klein, H.; Salbu, B.; Lind, O.C. Comparing model skills for deterministic versus ensemble dispersion modelling: The Fukushima Daiichi NPP accident as a case study. Sci. Total Environ. 2022, 806, 150128. [Google Scholar] [CrossRef]
Xu, Y.; Li, X.; Luo, H.; Wang, W.; Fang, S. Source reconstruction for atmospheric radionuclide leakage: Recent advances in decoding information from atmospheric transport physics. J. Hazard. Mater. 2025, 497, 139534. [Google Scholar] [CrossRef]
Rakesh, P.T.; Venkatesan, R.; Srinivas, C.V. Formulation of TKE based empirical diffusivity relations from turbulence measurements and incorporation in a Lagrangian particle dispersion model. Environ. Fluid Mech. 2013, 13, 353–369. [Google Scholar] [CrossRef]
Zhang, X.; Efthimiou, G.; Wang, Y.; Huang, M. Comparisons between a new point kernel-based scheme and the infinite plane source assumption method for radiation calculation of deposited airborne radionuclides from nuclear power plants. J. Environ. Radioact. 2018, 184–185, 32–45. [Google Scholar] [CrossRef]
Pudykiewicz, J. Simulation of the Chernobyl dispersion with a 3-D hemispheric tracer model. Tellus B 1989, 41, 391–412. [Google Scholar] [CrossRef]
Leelőssy, Á.; Mészáros, R.; Lagzi, I. Short and long term dispersion patterns of radionuclides in the atmosphere around the Fukushima Nuclear Power Plant. J. Environ. Radioact. 2011, 102, 1117–1121. [Google Scholar] [CrossRef] [PubMed]
Christoudias, T.; Proestos, Y.; Lelieveld, J. Atmospheric Dispersion of Radioactivity from Nuclear Power Plant Accidents: Global Assessment and Case Study for the Eastern Mediterranean and Middle East. Energies 2014, 7, 8338–8354. [Google Scholar] [CrossRef]
Hu, X.; Li, D.; Huang, H.; Shen, S.; Bou-Zeid, E. Modeling and sensitivity analysis of transport and deposition of radionuclides from the Fukushima Dai-ichi accident. Atmos. Chem. Phys. 2014, 14, 11065–11092. [Google Scholar] [CrossRef]
Christoudias, T.; Lelieveld, J. Modelling the global atmospheric transport and deposition of radionuclides from the Fukushima Dai-ichi nuclear accident. Atmos. Chem. Phys. 2013, 13, 1425–1438. [Google Scholar] [CrossRef]
Pisso, I.; Sollum, E.; Grythe, H.; Kristiansen, N.I.; Cassiani, M.; Eckhardt, S.; Arnold, D.; Morton, D.; Thompson, R.L.; Groot Zwaaftink, C.D.; et al. The Lagrangian particle dispersion model FLEXPART version 10.4. Geosci. Model Dev. 2019, 12, 4955–4997. [Google Scholar] [CrossRef]
Bakels, L.; Tatsii, D.; Tipka, A.; Thompson, R.; Dütsch, M.; Blaschek, M.; Seibert, P.; Baier, K.; Bucci, S.; Cassiani, M.; et al. FLEXPART version 11: Improved accuracy, efficiency, and flexibility. Geosci. Model Dev. 2024, 17, 7595–7627. [Google Scholar] [CrossRef]
Cassiani, M.; Stohl, A.; Brioude, J. Lagrangian stochastic modelling of dispersion in the convective boundary layer with skewed turbulence conditions and a vertical density gradient: Formulation and implementation in the FLEXPART model. Bound.-Layer Meteorol. 2015, 154, 367–390. [Google Scholar] [CrossRef]
Van Thielen, S.; Turcanu, C.; Camps, J.; Keppens, R. Optimizing the calculation grid for atmospheric dispersion modelling. J. Environ. Radioact. 2015, 142, 103–112. [Google Scholar] [CrossRef]
Sørensen, J.H.; Bartnicki, J.; Blixt Buhr, A.M.; Feddersen, H.; Hoe, S.C.; Israelson, C.; Klein, H.; Lauritzen, B.; Lindgren, J.; Schönfeldt, F.; et al. Uncertainties in atmospheric dispersion modelling during nuclear accidents. J. Environ. Radioact. 2020, 222, 106356. [Google Scholar] [CrossRef]
International Nuclear Safety Advisory Group. The Chernobyl Accident: Updating of INSAG-1; Intenational Atomic Energy Agency: Vienna, Austria, 1992. [Google Scholar]
Saunier, O.; Mathieu, A.; Didier, D.; Tombette, M.; Quélo, D.; Winiarek, V.; Bocquet, M. An inverse modeling method to assess the source term of the Fukushima Nuclear Power Plant accident using gamma dose rate observations. Atmos. Chem. Phys. 2013, 13, 11403–11421. [Google Scholar] [CrossRef]
Jammal, R.; Vincze, P.; Heitsch, M.; Dobrzynski, L.; Dolganov, K.; Duspiva, J.; Grant, I.; Guerpinar, A.; Hirano, M.; Khouaja, H. The Fukushima Daiichi Accident; Intenational Atomic Energy Agency: Vienna, Austria, 2015. [Google Scholar]
Snoun, H.; Bellakhal, G.; Kanfoudi, H.; Zhang, X.; Chahed, J. One-way coupling of WRF with a Gaussian dispersion model: A focused fine-scale air pollution assessment on southern Mediterranean. Environ. Sci. Pollut. Res. 2019, 26, 22892–22906. [Google Scholar] [CrossRef] [PubMed]
He, J.; Lyu, M.; Qiu, Z.; He, X.; Lu, B.; Wang, J.; Shen, S.; Zhang, X. Physics-informed optimization for emergency radiation assessment with temporal correction under meteorological uncertainty. J. Environ. Radioact. 2026, 291, 107817. [Google Scholar] [CrossRef]
Zhang, X.; Raskob, W.; Landman, C.; Trybushnyi, D.; Li, Y. Sequential multi-nuclide emission rate estimation method based on gamma dose rate measurement for nuclear emergency management. J. Hazard. Mater. 2017, 325, 288–300. [Google Scholar] [CrossRef]
Dong, X.; Fang, S.; Zhuang, S.; Xu, Y.; Zhao, Y.; Sheng, L. Objective inversion of the continuous atmospheric 137Cs release following the Fukushima accident. J. Hazard. Mater. 2023, 447, 130786. [Google Scholar] [CrossRef]
Dong, X.; Zhuang, S.; Xu, Y.; Hu, H.; Li, X.; Fang, S. Multi-scenario validation of the robust inversion method with biased plume range and values. J. Environ. Radioact. 2024, 272, 107363. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Fang, S.; Dong, X.; Zhuang, S. A spatiotemporally separated framework for reconstructing the sources of atmospheric radionuclide releases. Geosci. Model Dev. 2024, 17, 4961–4982. [Google Scholar] [CrossRef]
Xu, Y.; Dong, X.; Luo, H.; Fang, S. Robust source reconstruction of atmospheric radionuclides from observations of different sparsity with spatial preselection and non-smooth constraints. J. Hazard. Mater. 2025, 486, 136919. [Google Scholar] [CrossRef]
Zhang, X.L.; Su, G.F.; Yuan, H.Y.; Chen, J.G.; Huang, Q.Y. Modified ensemble Kalman filter for nuclear accident atmospheric dispersion: Prediction improved and source estimated. J. Hazard. Mater. 2014, 280, 143–155. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.L.; Li, Q.B.; Su, G.F.; Yuan, M.Q. Ensemble-based simultaneous emission estimates and improved forecast of radioactive pollution from nuclear power plant accidents: Application to ETEX tracer experiment. J. Environ. Radioact. 2015, 142, 78–86. [Google Scholar] [CrossRef]
Zhang, X.L.; Su, G.F.; Chen, J.G.; Raskob, W.; Yuan, H.Y.; Huang, Q.Y. Iterative ensemble Kalman filter for atmospheric dispersion in nuclear accidents: An application to Kincaid tracer experiment. J. Hazard. Mater. 2015, 297, 329–339. [Google Scholar] [CrossRef] [PubMed]
Jianyao, Y.; Yuan, H.; Su, G.; Wang, J.; Weng, W.; Zhang, X. Machine learning-enhanced high-resolution exposure assessment of ultrafine particles. Nat. Commun. 2025, 16, 1209. [Google Scholar] [CrossRef]
Huang, S.X.; Zhang, J.P.; Yang, W.D.; Wang, Z.F.; Hu, F.; Liu, F.; Sheng, L.; Zeng, Q.C. Predicting and Controlling Nuclear Accident Hazards: Issues and Challenges. Aerosol Air Qual. Res. 2016, 16, 417–429. [Google Scholar] [CrossRef]
Sun, S.; Li, H.; Fang, S. A forward-backward coupled source term estimation for nuclear power plant accident: A case study of loss of coolant accident scenario. Ann. Nucl. Energy 2017, 104, 64–74. [Google Scholar] [CrossRef]
Saunier, O.; Korsakissok, I.; Didier, D.; Doursout, T.; Mathieu, A. Real-time use of inverse modeling techniques to assess the atmospheric accidental release of a nuclear power plant. Radioprotection 2020, 55, 107–115. [Google Scholar] [CrossRef]
Fang, S.; Dong, X.; Zhuang, S.; Tian, Z.; Chai, T.; Xu, Y.; Zhao, Y.; Sheng, L.; Ye, X.; Xiong, W. Oscillation-free source term inversion of atmospheric radionuclide releases with joint model bias corrections and non-smooth competing priors. J. Hazard. Mater. 2022, 440, 129806. [Google Scholar] [CrossRef]
Andronopoulos, S.; Kovalets, I.V. Method of Source Identification Following an Accidental Release at an Unknown Location Using a Lagrangian Atmospheric Dispersion Model. Atmosphere 2021, 12, 1305. [Google Scholar] [CrossRef]
Cui, W.; Cao, B.; Fan, Q.; Fan, J.; Chen, Y. Source term inversion of nuclear accident based on deep feedforward neural network. Ann. Nucl. Energy 2022, 175, 109257. [Google Scholar] [CrossRef]
Hoffmann, L.; Haghighi Mood, K.; Herten, A.; Hrywniak, M.; Kraus, J.; Clemens, J.; Liu, M. Accelerating Lagrangian transport simulations on graphics processing units: Performance optimizations of Massive-Parallel Trajectory Calculations (MPTRAC) v2.6. Geosci. Model Dev. 2024, 17, 4077–4094. [Google Scholar] [CrossRef]
Ling, Y.; Liu, C.; Shan, Q.; Hei, D.; Zhang, X.; Shi, C.; Jia, W.; Yue, Q.; Wang, J. Source term inversion of short-lived nuclides in complex nuclear accidents based on machine learning using off-site gamma dose rate. J. Hazard. Mater. 2024, 465, 133388. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Y.; Wang, L.; Cheng, J.; Wang, S.; Li, Q. Source Reconstruction of Atmospheric Releases by Bayesian Inference and the Backward Atmospheric Dispersion Model: An Application to ETEX-I Data. Sci. Technol. Nucl. Install. 2021, 2021, 5558825. [Google Scholar] [CrossRef]
Li, Q.-Y.; Zhang, J.; Lian, B.; Liu, L.; Qiu, R.; Li, J. A Bayesian Source Term inversion Method Based on Spatiotemporal Trajectory Prior and Joint Adaptive MCMC Sampling. ChinaXiv 2025. [Google Scholar] [CrossRef]
Xu, Y.; Dong, X.; Fang, S. Efficient Bayesian source reconstruction and uncertainty quantification of atmospheric radionuclide releases by replacing release rate sampling with Maximum-A-Posteriori estimation of time-varying release rates. J. Hazard. Mater. 2025, 492, 138171. [Google Scholar] [CrossRef] [PubMed]
Harvey, P.; Hameed, S.; Vanderbauwhede, W. Accelerating Lagrangian particle dispersion in the atmosphere with OpenCL across multiple platforms. In IWOCL ’14: Proceedings of the International Workshop on OpenCL 2013 & 2014; Association for Computing Machinery: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
Santos, M.C.; Pinheiro, A.; Schirru, R.; Pereira, C.M.N.A. GPU-based implementation of a real-time model for atmospheric dispersion of radionuclides. Prog. Nucl. Energy 2019, 110, 245–259. [Google Scholar] [CrossRef]
Yu, F.; Strazdins, P.; Henrichs, J.; Pugh, T. Shared Memory and GPU Parallelization of an Operational Atmospheric Transport and Dispersion Application. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019. [Google Scholar] [CrossRef]
Kong, B.; Dai, T.; Xu, L.; Li, B.; Dai, N.; Xiao, B. CPU-GPU concurrent computing algorithm of particle transport using discontinuous finite element discrete ordinates with unstructured grids. ChinaXiv 2025. [Google Scholar] [CrossRef]
Stohl, A.; Hittenberger, M.; Wotawa, G. Validation of the lagrangian particle dispersion model FLEXPART against large-scale tracer experiment data. Atmos. Environ. 1998, 32, 4245–4264. [Google Scholar] [CrossRef]
Muhammad, H.; Xuan, W.; Wang, M.; Su, G. Review of spatial scale dispersion models (ATDMs) to simulate environmental dispersion and deposition of radionuclides and the overview of GIS coupling with dispersion models. Int. J. Adv. Nucl. React. Des. Technol. 2024, 6, 256–280. [Google Scholar] [CrossRef]
Zeng, J.; Matsunaga, T.; Mukai, H. Using nvidia gpu for modelling the lagrangian particle dispersion in the atmosphere. In Proceedings of the 5th International Congress on Environmental Modelling and Software, Ottawa, ON, Canada, 5–8 July 2010. [Google Scholar]
Van dop, H.; Addis, R.; Fraser, G.; Girardi, F.; Graziani, G.; Inoue, Y.; Kelly, N.; Klug, W.; Kulmala, A.; Nodop, K.; et al. ETEX: A European tracer experiment; observations, dispersion modelling and emergency response. Atmos. Environ. 1998, 32, 4089–4094. [Google Scholar] [CrossRef]
Emanuel, K.A.; Živković Rothman, M. Development and Evaluation of a Convection Scheme for Use in Climate Models. J. Atmos. Sci. 1999, 56, 1766–1782. [Google Scholar] [CrossRef]
Seinfeld, J.H.; Pandis, S.N. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
Lin, J.C.; Gerbig, C.; Wofsy, S.C.; Andrews, A.E.; Daube, B.C.; Davis, K.J.; Grainger, C.A. A near-field tool for simulating the upstream influence of atmospheric observations: The Stochastic Time-Inverted Lagrangian Transport (STILT) model. J. Geophys. Res. Atmos. 2003, 108, 4493. [Google Scholar] [CrossRef]
Song, C.K.; Kim, C.H.; Lee, S.H.; Park, S.U. A 3-D Lagrangian particle dispersion model with photochemical reactions. Atmos. Environ. 2003, 37, 4607–4623. [Google Scholar] [CrossRef]
Forster, C.; Stohl, A.; Seibert, P. Parameterization of convective transport in a Lagrangian particle dispersion model and its evaluation. J. Appl. Meteorol. Climatol. 2007, 46, 403–422. [Google Scholar] [CrossRef]
Dietz, R. Perfluorocarbon Tracer Technology; Technical Report; Brookhaven National Lab.: Upton, NY, USA, 1985. [Google Scholar]
Nodop, K.; Connolly, R.; Girardi, F. The field campaigns of the European Tracer Experiment (ETEX): Overview and results. Atmos. Environ. 1998, 32, 4095–4108. [Google Scholar] [CrossRef]
Sakdhnagool, P.; Sabne, A.; Eigenmann, R. RegDem: Increasing GPU performance via shared memory register spilling. arXiv 2019, arXiv:1907.02894. [Google Scholar] [CrossRef]

Figure 1. Schematic overview of the FLEXPART computational framework.

Figure 2. Relative computational time distribution among the major FLEXPART modules for a simulation with 1,000,000 particles.

Figure 3. Computational workflow of the convective mixing process: (a) CPU-based implementation and (b) optimized GPU-adapted implementation. The red boxes indicate procedures executed on the GPU.

Figure 4. Schematic illustration of the coarse-grained parallel execution mechanism: (a) single CUDA stream execution and (b) multi-CUDA stream execution. Roman numerals (I-IV) denote different CUDA stream identifiers.

Figure 5. Location of the ETEX-I release (red star) and spatial distribution of monitoring stations (yellow circles).

Figure 6. Comparative validation of numerical simulations, illustrating the performance of the GPU and CPU models and their discrepancies relative to observations. (a) Spatial distribution of the plume at 48 h simulated by the reference CPU model; (b) spatial distribution of the plume at 48 h simulated by the GPU implementation; (c) concentration difference between the GPU and CPU simulations at 48 h (

| GPU - CPU |

); (d) temporal evolution of the domain-averaged concentration simulated by the CPU and GPU models, together with the corresponding time series of their mean absolute difference (

| GPU - CPU |

); (e) relative differences between observed concentrations at monitoring stations and the simulated concentrations.

Figure 6. Comparative validation of numerical simulations, illustrating the performance of the GPU and CPU models and their discrepancies relative to observations. (a) Spatial distribution of the plume at 48 h simulated by the reference CPU model; (b) spatial distribution of the plume at 48 h simulated by the GPU implementation; (c) concentration difference between the GPU and CPU simulations at 48 h (

| GPU - CPU |

); (d) temporal evolution of the domain-averaged concentration simulated by the CPU and GPU models, together with the corresponding time series of their mean absolute difference (

| GPU - CPU |

); (e) relative differences between observed concentrations at monitoring stations and the simulated concentrations.

Figure 7. Effects of fine-grained parallelization (a) and fast arithmetic instruction optimization (b) on computational time.

Figure 8. Variation of computational time for individual model components as a function of the number of CUDA streams.

Figure 9. Variation of computational time as a function of the maximum number of registers available per thread.

Figure 10. Variation of GPU program execution time as a function of the number of GPU devices.

Figure 11. Impact of the load-balancing strategy on computational time.

Table 1. The number of iterations used in typical inversion algorithm studies.

Author (Year)	Convergence Iterations
Zhao et al. [41]	∼3000
Li et al. [42]	∼2400
Xu et al. [43]	1000–5000

Table 2. Statistical metrics used for accuracy validation.

P_{i}

and

M_{i}

denote the model-predicted and observed concentrations at sample i, respectively;

\bar{P}

and

\bar{M}

are the corresponding mean values; N is the total number of samples.

Table 2. Statistical metrics used for accuracy validation.

P_{i}

and

M_{i}

denote the model-predicted and observed concentrations at sample i, respectively;

\bar{P}

and

\bar{M}

are the corresponding mean values; N is the total number of samples.

Metric	Mathematical Definition
FB	FB $= \frac{2 (\bar{P} - \bar{M})}{\bar{P} + \bar{M}}$
RMSE	RMSE $= \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(P_{i} - M_{i})}^{2}}$
FA2	FA2 $= \frac{1}{N} \sum_{i = 1}^{N} I (0.5 \leq \frac{P_{i}}{M_{i}} \leq 2)$
FA5	FA5 $= \frac{1}{N} \sum_{i = 1}^{N} I (0.2 \leq \frac{P_{i}}{M_{i}} \leq 5)$

Table 3. Statistical performance metrics of the reference CPU model and the GPU implementation relative to observed measurements.

Metric	CPU Reference Implementation	GPU Implementation
FB	0.43	0.4
RMSE	0.58	0.57
FA2	0.7	0.7
FA5	0.74	0.74

Table 4. Percentage contribution of individual computational components to the total runtime of the GPU program prior to decoupling the meteorological field preprocessing.

Computational Module	Execution Time (s)	Time Fraction (%)
Advection-Diffusion	55.07	9.53
Get fields	454.04	78.56
Convective mixing	12.95	2.24
Wet deposition	4.72	0.82
Statistics & Output	20.42	3.53
Others	30.76	5.32

Table 5. Variation of computational time of the meteorological field decoding and preprocessing program

g r i b_t o_b i n_m p i

as a function of the number of parallel threads.

Table 5. Variation of computational time of the meteorological field decoding and preprocessing program

g r i b_t o_b i n_m p i

as a function of the number of parallel threads.

Number of Processes	Execution Time (s)	Speedup
CPU baseline	454.04	\
1	490.27	0.93
2	252.24	1.80
3	172.80	2.63
6	93.31	4.87
9	65.23	6.99
18	39.20	11.58

Table 6. Speedup ratio reported for recent GPU-accelerated implementations of Lagrangian particle dispersion models.

Author(s)	Original Model	Speedup Ratio
Yu et al. [46]	HYSPLIT	12.9
Zeng et al. [50]	FLEXPART	8–12
Harvey et al. [44]	FLEXPART	15
Ours	FLEXPART	40.45–52.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Q.; He, T.; Li, M.; Zhang, J.; Lian, B.; Liu, L.; Qiu, R.; Li, J. Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides. Atmosphere 2026, 17, 573. https://doi.org/10.3390/atmos17060573

AMA Style

Li Q, He T, Li M, Zhang J, Lian B, Liu L, Qiu R, Li J. Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides. Atmosphere. 2026; 17(6):573. https://doi.org/10.3390/atmos17060573

Chicago/Turabian Style

Li, Qingyun, Tao He, Mingye Li, Junfang Zhang, Bing Lian, Liye Liu, Rui Qiu, and Junli Li. 2026. "Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides" Atmosphere 17, no. 6: 573. https://doi.org/10.3390/atmos17060573

APA Style

Li, Q., He, T., Li, M., Zhang, J., Lian, B., Liu, L., Qiu, R., & Li, J. (2026). Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides. Atmosphere, 17(6), 573. https://doi.org/10.3390/atmos17060573

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implementation of a GPU-Accelerated Lagrangian Particle Dispersion Model for Atmospheric Transport of Radioactive Nuclides

Abstract

1. Introduction

2. Materials and Methods

2.1. Lagrangian Particle Model and the FLEXPART Framework

2.2. Fine-Grained Parallel Acceleration Architecture

2.2.1. Memory Allocation and Data Transfer

2.2.2. Parallelization of Particle Grid Reordering and the Two-Level Loop Structure

2.3. Fast Arithmetic Instruction Optimization

2.3.1. Fast Division Operations

2.3.2. Fast Square-Root Operations

2.3.3. Newton–Raphson Iterative Refinement

2.4. GPU Parallel Execution Strategy and Resource Utilization Optimization

2.4.1. Parallel Granularity and Thread Organization Strategy

2.4.2. Register Usage Control Optimization

2.5. Decoupling of Background-Field Preprocessing

2.6. Multi-GPU Scalability and Load Balancing

2.7. Validation

2.7.1. Validation Experiments and Reference Benchmarks

2.7.2. Validation Metrics and Accuracy Assessment

2.7.3. Performance Evaluation Design

3. Results and Discussion

3.1. Accuracy Validation

3.2. Computational Performance Evaluation

3.2.1. Single-GPU Acceleration Performance

3.2.2. Multi-GPU Scalability Assessment

3.2.3. Load Balancing Across Heterogeneous GPUs

4. Conclusions

5. Limitations and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Deep Convective Particle Redistribution Procedure

Appendix B. Design, Computational Details, and Intermediate Variables of the Preprocessing Program grib_to_bin_mpi

Appendix C. Configuration Parameters in the Validation Benchmark

Appendix D. Long-Term Accuracy Assessment of the GPU Model

Appendix E. Resource Utilization of Individual Components Without Imposing a Per-Thread Maximum Register Limit

Appendix F. Performance Comparison for Different Particle Numbers

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI