Visual Mamba-Inspired Directionally Gated State-Space Backtracking for Chemical Gas Source Localization

Jooyoung Park; Daehong Min; Sungjin Cho; Donghee Kang; Hyunwoo Nam

doi:10.3390/app152010900

,

and

¹

CompuMath AI, 396 Seochodae-ro, Seocho-gu, Seoul 06619, Republic of Korea

²

Department of Electronic Engineering, School of Aerospace & Advanced Materials, Sunchon National University, Suncheon 57922, Republic of Korea

³

Agency for Defense Development, Daejeon 34186, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci.2025, 15(20), 10900;https://doi.org/10.3390/app152010900

Version Notes

Order Reprints

Abstract

Rapidly pinpointing the origin of accidental chemical gas releases is essential for effective response. Prior vision pipelines—such as 3D CNNs, CNN–LSTMs, and Transformer-based ViViT models—can improve accuracy but often scale poorly as the temporal window grows or winds meander. We cast recursive backtracking of concentration fields as a finite-horizon, multi-step spatiotemporal sequence modelling problem and introduce Recursive Backtracking with Visual Mamba (RBVM), a Visual Mamba-inspired, directionally gated state-space backbone. Each block applies causal, depthwise sweeps along

H^{\pm}

,

W^{\pm}

, and

T^{\pm}

and then fuses them via a learned upwind gate; a lightweight MLP follows. Pre-norm LayerNorm and small LayerScale on both branches, together with a layer-indexed, depth-weighted DropPath, yield stable stacking at our chosen depth, while a 3D-Conv stem and head keep the model compact. Computation and parameter growth scale linearly with the sequence extent and the number of directions. Across a synthetic diffusion corpus and a held-out NBC_RAMS field set, RBVM consistently improves Exact and hit

\leq 1

over strong 3D CNN, CNN–LSTM, and ViViT baselines, while using fewer parameters. Finally, we show that, without retraining, a physics-motivated two-peak subtraction on the oldest reconstructed frame enables zero-shot dual-source localization. We believe RBVM provides a compact, linear-time, directionally causal backbone for inverse inference on transported fields—useful not only for gas–release source localization in CBRN response but more broadly for spatiotemporal backtracking tasks in environmental monitoring and urban analytics.

Keywords:

Visual Mamba; chemical-gas dispersion; source localization; state-space models; linear-time sequence modeling

1. Introduction

Accidental or malicious releases of toxic industrial chemicals (TICs)—including chlorine, sulfur dioxide, and anhydrous ammonia—have repeatedly demonstrated their potential for severe harm to public health, industry, and ecosystems. Large-scale field trials and incident reports alike show that dense gases hug the surface, weave around obstacles, and traverse long distances within minutes. Under such constraints, the practical value of a localization system is decided in the earliest stage: quickly inferring the emission origin guides evacuation corridors, prioritises mobile or fixed sensor deployment, and informs remediation teams. While the conceptual ideal would be a dense, always-on lattice of chemical detectors, the reality is constrained by cost, power, maintenance, and social acceptance; in practice, responders pair sparse measurements with operational dispersion tools such as ALOHA and HPAC to delineate hazard footprints [1,2]. The resulting observability gap motivates methods that can reason from sparse measurements or image-like surrogates of concentration fields, with early mobile-robotics work demonstrating gas-concentration gridmaps from sparse sensors [3].

A natural direction is to treat the gridded concentration map—obtained from fast surrogates, physics simulators, or interpolated sensor mosaics—as a spatiotemporal signal and to pose source localization as a data-driven inference task. Early pipelines based on 3D convolutional neural networks (3D CNNs) or hybrid CNN–LSTM stacks (e.g., ConvLSTM [4]) learn useful local motion features, but their compute and memory grow unfavourably with temporal horizon, cubically for deep 3D CNNs (to achieve large receptive fields) and at least quadratically for CNN–LSTM variants that carry recurrent state over long windows [5,6]. Parallel optimization and Bayesian inversions remain attractive for small grids [7,8], yet they typically require explicit wind-field estimates and degrade as resolution or domain size increases. More recently, Transformer-based approaches have recast localization as recursive frame prediction: given recent frames, a model iteratively predicts earlier snapshots until the origin emerges [9]. Transformer models have also been widely adapted to vision tasks [10]. Video Vision Transformer (ViViT) further improved single-pass accuracy [11]; however, global self-attention incurs

O (L^{2})

cost in sequence length L, which constrains temporal reach and throughput, and prior studies primarily address single-source incidents. Related video backbones such as SlowFast and the Temporal Shift Module (TSM) have also been widely adopted for efficient temporal modelling [12,13]. Recent surveys and designs on efficient video backbones further motivate axis-wise, linear-time operators for streaming use cases (see, e.g., [14]).

State-space sequence models (SSMs) offer a principled path to linear-time sequence modelling without global attention. Mamba replaces global attention with content-aware kernels that propagate a latent state at

O (L)

cost [15], and visual extensions apply directional scans along image axes to capture spatial dependencies [16,17]. Building on these ideas, we adopt a Visual Mamba-inspired formulation tailored to dispersion grids: rather than implementing explicit associative scans, we realize each axis-wise state update with a one-sided, causal, depthwise operator and learn a convex combination over directions via an upwind gate. This preserves linear scaling, encodes physical causality along the operated axis, and maps cleanly to commodity GPUs through grouped convolutions. In effect, directional depthwise stencils play the role of stable, axis-aligned transports, while the gate adapts to local flow regimes (global/local/hybrid), which we ablate in Section 5.3. Beyond gas release response, we view this combination as a general backbone for inverse inference on transported fields, i.e., spatiotemporal backtracking tasks in environmental monitoring and urban analytics.

Within this design space, we introduce Recursive Backtracking with Visual Mamba (RBVM), a directionally gated visual state-space backbone for source localization. Each residual block applies causal, depthwise sweeps along

H^{\pm}

,

W^{\pm}

, and

T^{\pm}

and then fuses them via a learned upwind gate; a lightweight two-layer GELU MLP follows. Small LayerScale coefficients modulate both the directional aggregate and the MLP branch, and a layer-indexed, depth-weighted DropPath schedule stabilizes training across blocks. A

1 \times 3 \times 3

3D-Conv stem embeds the single-channel lattice into C channels, while a symmetric

1 \times 3 \times 3

head produces logits for all time slices. Causal padding strictly prevents look-ahead along the operated axis, aligning the inductive bias with advective transport. The six-direction variant adds a backward temporal branch (

T^{-}

) and improves robustness under meandering winds, whereas the five-direction variant (omitting

T^{-}

) may be considered for strictly causal regimes. In all cases, computation and memory scale linearly with the spatiotemporal volume

T H W

and with the number of directions

| D |

.

Our inference protocol follows the recursive backtracking paradigm. The model ingests a conditioning window of recent frames and recursively predicts earlier snapshots in reverse chronological order, with dataset-specific choices of window length and backtracking depth (cf. [9,11]). The last predicted slice provides the probability map for localization. Training uses a supervised loss that blends per-slice reconstruction MSE with a BCE-with-logits term computed over the backtracked slices (including the oldest), encouraging sharp, well-localized peaks. To further stabilize learning, we maintain a Polyak/EMA [18] teacher of the student parameters, and, after a short warm-up, add consistency terms (probability-space MSE and logit-space BCE) whose coefficients ramp linearly with epoch [19]. Because dispersion in our synthetic generator is additive, a simple two-peak subtraction applied post hoc to the oldest slice enables zero-shot dual-source localization without retraining.

We evaluate on two complementary domains. A controlled synthetic benchmark—flat terrain with stochastic yet strictly causal advection—isolates algorithmic design effects and supplies dual-source sequences for stress testing. In addition, we rigorously evaluate the model using diffusion data spanning diverse experimental conditions and meteorological environments generated via the Nuclear–Biological–Chemical Reporting and Modelling System (NBC_RAMS) [9,11]. A held-out subset of these runs captures urban channelling, shear, and occasional wind reversals. Models are trained only on single-source synthetic sequences and then evaluated on NBC_RAMS as-is, under the same protocol, as in prior work [9,11]. Across both domains, we compare RBVM against strong baselines—3D-CNN, CNN–LSTM, and ViViT—and report geometry-aware localization metrics (RMSE/AED, hit

\leq 1

, Exact) together with model compactness (parameter count).

In summary, this paper makes three contributions in a unified framework. First, it presents a Visual Mamba-inspired, directionally gated visual SSM that aggregates causal depthwise operators via a learned upwind gate, achieving linear-time scaling with a hardware-friendly implementation. Second, it demonstrates robust performance under meandering, urban flows—yielding a favourable accuracy–compactness trade-off relative to 3D-CNN, CNN–LSTM, and ViViT—without resorting to quadratic-cost attention. Third, it shows that a parameter-free, physics-motivated two-peak subtraction enables zero-shot dual-source localization on synthetic stress tests, indicating that RBVM learns a physically meaningful representation rather than a brittle lookup. The remainder of the paper develops the preliminaries on linear-time state-space modelling, details the RBVM architecture and training protocol, specifies datasets and experimental procedures, and reports quantitative and qualitative results before concluding with a brief outlook.

Paper organization. We first lay the groundwork in Section 2, reviewing linear-time state-space modelling, introducing notation, and motivating directional causality for dispersion grids. Next, Section 3 presents RBVM end-to-end: we unpack the block design (causal depthwise operators and DirGate), the stem/head, training objectives and schedules, the inference protocol, and the zero-shot dual-source postprocessing step. We then turn to the experimental setup in Section 4, detailing datasets, splits, preprocessing, the common evaluation protocol, and baseline implementations, ensuring like-for-like comparisons. With the stage set, Section 5 presents the evidence: we begin with the NBC_RAMS field benchmark (Section 5.1), then turn to the synthetic benchmark (Section 5.2), and round out the section with targeted analyses—sensitivity highlights (Section 5.3) and qualitative overlays (Section 5.1 and Section 5.2). Finally, Section 6 distils the main takeaways, limitations, and practical guidance, and Section 7 closes with a brief outlook and opportunities for extension.

Scope and positioning. We address the real-time inverse task on short windows of solver-generated concentration fields under strict latency and input-availability constraints (no winds or solver calls at inference). Directional state-space modelling matches advective transport while retaining linear-time scaling and streaming readiness; physics-based solvers and their adjoint and data assimilation variants solve the forward problem, while we study a compact, directionally causal inverse layer that consumes their rasters at decision time.

Novelty—what is new vs. prior art.

(i): Directionally causal visual SSM. Axis-wise one-sided, depthwise stencils with an upwind gate (global/local/hybrid), preserving per-axis causality and linear scaling—contrasted with bidirectional scans or quadratic attention.
(ii): Flipped-time joint decoding. A reverse-index causal temporal design that predicts all backtracked slices in one pass and is trained with BCE-on-logits across slices plus light EMA consistency, limiting exposure bias without iterative rollout.
(iii): Deployment posture. A learned inverse layer that consumes solver rasters under strict latency without winds or online solver calls, with optional calibration/abstention for operations—complementing (not replacing) physics stacks.

Contributions.

A directionally causal, depthwise state-space backbone with an upwind gate that preserves linear-time scaling in sequence length and respects per-axis causality.
On NBC_RAMS and synthetic diffusion, the model improves localization accuracy while remaining compact (with a compact parameter budget and linear-time scaling).
A zero-shot dual-source postprocess with explicit assumptions and lightweight safeguards (non-negative residual clamping, small-radius NMS, uncertainty flag).

2. Preliminaries

2.1. Linear-Time State-Space Modelling and Directional Causality

A continuous-time state-space model (SSM) drives a latent state

h (t) \in R^{d}

with input

x (t)

via

\dot{h} (t) = A h (t) + B x (t), y (t) = C h (t),

(1)

which, under uniform sampling with period

Δ

, yields the discrete update

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

with

\bar{A} = exp (Δ A)

and

\bar{B} = A^{- 1} (exp (Δ A) - I) B

(for invertible A; otherwise via the series/Van Loan formulation). Absorbing C into a learned readout shows that the induced dynamics form a (causal) 1-D convolution, giving linear time and memory in sequence length L and motivating modern SSM kernels such as S4/Hyena [20,21].

Recent SSMs (e.g., Mamba) replace fixed coefficients with selective, input-conditioned parameters and compute latent updates with

O (L d)

cost, providing a principled alternative to quadratic self-attention [15]. Visual and multi-dimensional extensions apply directional scans or structured kernels on images/videos [16,17,22,23]. In this paper, we adopt a directional, causal, depthwise formulation that preserves the linear scaling and inductive bias of SSMs while remaining simple and hardware-friendly: instead of explicit associative scans, we realize each axis-wise update with a one-sided, depthwise convolution and learn the (upwind) combination of directions. This aligns the inductive bias with advective transport and keeps operators GPU-friendly via grouped depthwise convolutions, anticipating ablations in Section 5.3.

2.1.1. Notation and Operators

Let

X \in R^{B \times T \times H \times W \times C}

be a normalized (concentration) tensor (channels-last inside blocks), and let

LN (X)

denote pre-normalized features. We define the direction sets

D_{6} = {H^{+}, H^{-}, W^{+}, W^{-}, T^{+}, T^{-}}

and

D_{5} = D_{6} ∖ {T^{-}}

, where

H / W / T

denotes the height/width/time axes

{and}^{+} /^{-}

chooses the causal side (forward vs. backward) along that axis. For

d \in D_{6}

, the operator

O_{d}

is a depthwise convolution (kernel width 3) that scans only along the operated axis with one-sided causal padding on the upwind side (zeros elsewhere); spatial

(H^{\pm}, W^{\pm})

use masked 2D depthwise kernels, while temporal

(T^{\pm})

use 1D depthwise kernels on the flipped timeline (see the paragraph on temporal causality). Unless stated otherwise, our main experiments use the 6-dir set

D_{6}

; the 5-dir set

D_{5}

appears only in sensitivity snapshots. Per-block complexity is

O (| D | C T H W)

, i.e., linear in sequence extent.

2.1.2. Directional Operator

For

d \in {T^{\pm}}

we apply a 1-D depthwise convolution of width 3 along time; for

d \in {H^{\pm}, W^{\pm}}

, we apply a separable depthwise 2-D convolution whose kernel is

3 \times 1

(for

H

) or

1 \times 3

(for

W

), so the operation is effectively 1-D along the selected spatial axis:

Z_{d} = O_{d} (LN (X)) \in R^{B \times T \times H \times W \times C},

(2)

with past/future (time) or top/bottom/left/right (space) causal padding chosen

{by}^{\pm}

. Each

O_{d}

is depthwise and thus costs

O (C T H W)

.

We consider two direction sets:

D_{5} = {H^{+}, H^{-}, W^{+}, W^{-}, T^{+}}

and

D_{6} = D_{5} \cup {T^{-}}

. The sole architectural difference between 5-dir and 6-dir is the inclusion of the reverse-index temporal sweep

T^{-}

. DirGate forms a convex mixture over

D_{5}

or

D_{6}

with identical logits initialization, temperature, LayerScale, and DropPath. The added parameters are limited to the extra gate logit and one LayerScale scalar per channel (order

10^{3}

at our scale,

< 1 %

of total); hidden widths are unchanged.

2.1.3. Learned Upwind Gating (DirGate)

We mix directional responses with a temperature-controlled softmax:

\begin{matrix} π (X) & = softmax (\frac{1}{τ} W ϕ (X)) \in Δ^{| D | - 1}, \end{matrix}

(3)

\begin{matrix} ϕ (X) & = \{\begin{matrix} {mean}_{t, h, w} (LN (X)) \in R^{B \times C} & (global), \\ LN (X) \in R^{B \times T \times H \times W \times C} & (local), \end{matrix} \end{matrix}

(4)

where W maps

ϕ (X)

to

| D |

logits and

τ > 0

is a temperature. In our main experiments, the gate is global, providing one mixture per sample. We zero-initialize W so that training starts from a uniform mixture. Hybrid gating (global backbone modulated by a shallow local adaptor) is considered in Section 5.3; the block equations remain unchanged.

2.1.4. Block Update with LayerScale and DropPath

Denoting

π_{d} (X)

the d-th mixture weight (broadcast to

B \times T \times H \times W \times 1

), the residual update in each block is

X \leftarrow X + DropPath (\sum_{d \in D} γ_{d} π_{d} (X) O_{d} (LN (X))) + DropPath (γ_{mlp} MLP (LN (X))),

(5)

with learnable LayerScale scalars

γ_{d}

(per direction) and

γ_{mlp}

initialized small (∼

10^{- 3}

) for stability. In practice, we use a layer-indexed, depth-weighted DropPath probability across blocks (static during training).

2.1.5. Complexity

Each block performs

| D |

depthwise operations, plus a small gate and MLP. Overall, computation scales linearly with the spatiotemporal volume

T H W

and with

| D |

; memory scales linearly with

C T H W

(or

| D | C T H W

if all directional responses are buffered). Our main configuration uses six directions; the 5-dir variant drops

T^{-}

and appears only as an ablation.

2.2. Synthetic Benchmark

We introduce a controlled synthetic benchmark to support sensitivity analyses and stress tests under known physics. Following the general recipe of [9], we simulate passive gas transport on a discrete

H \times W

lattice under a Fickian advection–diffusion process driven by a piecewise-steady stochastic wind. Each scenario starts from an impulse-like initial condition and evolves forward to yield a short sequence of concentration fields exhibiting advection and dilution. This construction makes the recursive backtracking task well posed while avoiding any commitment to a particular temporal horizon.

The corpus contains single-source sequences used for training and model selection, and dual-source sequences reserved for stress testing. Variants cover representative wind and flow conditions, including meandering winds and so on. We employ a fully vectorized generator for high-throughput sampling. Exact lattice size, boundary conditions, denoising/normalization choices, and dataset-specific window and backtracking lengths are specified in Section 4.2.

2.3. Baseline Architectures

We compare RBVM against three representative families that span the common design space for spatiotemporal modelling on dispersion grids: a 3D convolutional backbone, a CNN–LSTM hybrid, and a Transformer-based video classifier (ViViT). Unless stated otherwise, all baselines ingest the same preprocessed input as RBVM (a recent conditioning window of frames on an

H \times W

lattice with per-frame min–max normalization) and predict the source cell on the oldest slice as a multi-class classification task. Given our emphasis on NBC_RAMS [9,11], baseline capacities are adjusted per domain, while the synthetic setting uses lighter, overfit-resistant variants for sensitivity and stress tests.

3D–CNN (VideoGasNet–style). We implement a compact I3D-style backbone akin to VideoGasNet [24,25]: a stack of blocks, each comprising a 3D convolution, batch normalization, ReLU activation, and spatiotemporal pooling with small kernels; these stages are followed by global average pooling and a classifier. Dropout is applied in the head for regularization. This family provides a strong convolutional baseline but typically requires deeper stacks to cover longer temporal horizons, increasing compute and memory cost as the window grows.

CNN–LSTM (per-frame encoder + temporal RNN). Frames are encoded framewise by a shallow 2D CNN with weights shared across time; the resulting per-frame features are then aggregated over the sequence by a unidirectional LSTM, akin to CNN–LSTM pipelines (cf. [4]). We pass the LSTM output to a lightweight MLP classifier. Training uses a standard cross-entropy classification objective. This hybrid keeps parameter growth linear with sequence length through recurrence, at the cost of reduced parallelism compared with fully convolutional designs.

ViViT (Video Vision Transformer). We re-implement ViViT [26] as a single-stream, tubelet-tokenised video classifier, following prior practice [11]. Given

V \in R^{T \times H \times W \times 1}

, we extract non-overlapping tubelets

x_{i} \in R^{t \times h \times w \times 1}

, linearly project them to tokens

E x_{i} \in R^{d}

, prepend a learned class token

z_{cls}

, and add learned positional embeddings p to obtain

z^{0} = (z_{cls}, E x_{1}, \dots, E x_{N}) + p \in R^{(N + 1) \times d} .

A stack of L Transformer encoder layers with multihead self-attention (MSA), pre-norm LayerNorm, and a two-layer GELU MLP produces

z^{L}

; the classifier reads

LN (z_{cls}^{L})

and outputs 625 logits. Our default instantiation (mirroring the internal supplement) is as follows:

L = 10

, heads

k = 8

, embedding width

d = 128

, feed-forward width

d_{f f} = 512

, and tubelet sizes

(t, h, w) = (2, 12, 13)

. ViViT offers strong single-pass accuracy on spatiotemporal grids but incurs quadratic attention cost in token count and non-trivial memory overhead as the token grid grows.

Baseline training and complexity at a glance. All baselines share the same data splits and preprocessing pipeline as RBVM. Supervision is cross-entropy on the 625-way source label from the oldest slice; early stopping and model selection are based on validation loss. For CNN–LSTM and 3D-CNN, we adopt Adam (LR

10^{- 3}

) with

L_{2}

weight decay

10^{- 4}

and dropout as stated above; for ViViT, we follow the configuration listed above. Hyperparameters were tuned once on the synthetic validation split and then frozen for the NBC_RAMS transfer. For comparability, we report both absolute metrics (Exact/hit

\leq 1

, RMSE, macro precision/recall/

F_{1}

) and efficiency (latency, peak GPU), and we keep parameter counts within a reasonable budget range across baselines when possible; when this is not feasible (e.g., ViViT’s positional table vs. CNN recurrent state), we additionally comment on accuracy–efficiency trade-offs rather than raw accuracy alone.

Complexity viewpoint. 3D-CNNs scale roughly with the spatiotemporal receptive field (depth increases to “see” longer horizons); CNN–LSTM scales linearly in frames but loses parallelism due to recurrence; ViViT keeps parameter count nearly constant with T (positional table aside) but attention cost scales quadratically in tokens. In contrast, RBVM uses causal, depthwise directional operators with a learned upwind mixture and scales linearly in

T H W

(product) and in the number of directions

| D |

.

Section 5 reports a head-to-head comparison on the synthetic benchmark and the held-out NBC_RAMS corpus under the common protocol above.

2.4. Related Works

We group prior art into three strands, and clarify how our approach differs.

2.4.1. Physics-Based Source Inversion and Hybrid Approaches

Operational CBRN pipelines run forward dispersion solvers coupled with meteorology and turn them into inverse estimators via adjoint optimization or data assimilation. Classical references cover inverse theory and source–receptor relationships [27,28], while operational stacks such as RAMS/HYSPLIT illustrate realistic transport under terrain and synoptic variability [29,30]. Ensemble Kalman filtering/smoothing remains a common choice [31,32]. Physics-informed learning amortises computation by injecting PDE constraints or learning operators [33,34]. Our design is complementary: we consume solver-generated rasters and learn a compact inverse layer that localizes sources online without solver calls or wind inputs at inference.

2.4.2. Vision Backbones for Spatiotemporal Fields

Generic video architectures have been applied to scientific rasters: 3D CNNs learn local motion features [35,36]; CNN–RNN hybrids (e.g., ConvLSTM) add recurrent memory [4]; ViT-based backbones (ViViT/ TimeSformer/MViT) capture long-range interactions [26,37,38]. These are accurate but can be heavy or quadratic in attention length, complicating streaming. In contrast, we use directionally causal, depthwise operators with an upwind gate, retaining linear-time behaviour and aligning with advective transport.

2.4.3. State-Space Sequence Models and Visual Extensions

Structured state-space models achieve linear-time scanning and stable long-range memory on 1D sequences [39], with selective variants improving the capacity–efficiency trade-off [15]. Visual extensions adapt SSMs to images/videos via axis-wise scans [17]. We enforce axis-wise one-sided stencils to respect per-axis causality and mix directions via an upwind gate, yielding an inductive bias consistent with advection–diffusion while remaining streaming-ready.

2.4.4. Positioning vs. PDE-Constrained and Hybrid Routes

PDE-constrained inversions (adjoint/DA) offer strong physical priors but assume meteorology and iterative solves; physics-informed learning can reduce cost yet remains coupled to PDE constraints. Our approach operates downstream of physics: given solver-consistent rasters, we perform the inverse mapping online with axis-wise one-sided sweeps and an upwind gate—no online solver dependencies and linear complexity—with a path to confidence calibration/abstention for reliability; see also work on probabilistic outputs in other domains (e.g., paragraph-based sentiment) [40], and complementary watermarking for provenance in deployment [41].

3. Methodology

3.1. RBVM: Directionally Gated Visual SSM Backbone

Long-horizon backtracking requires a model that (i) preserves temporal causality, (ii) propagates information efficiently across space and time, and (iii) scales linearly with the sequence extent so that compute and memory remain bounded as the look-back window grows. In dispersion grids, the spatiotemporal signal is moreover anisotropic: advection makes information travel primarily along an upwind/downwind axis that can change over time. These constraints motivate a state-space formulation with directional propagation.

Why Mamba? Transformers provide global context but incur

O (L^{2})

time and memory in sequence length L. Three-dimensional CNNs scale linearly but need deep stacks to connect distant voxels; CNN–LSTM hybrids lose parallelism and grow memory with horizon. State-space sequence models such as Mamba replace global attention with content-aware latent dynamics that advance in

O (L)

, enable streaming, and keep memory flat with horizon.

Why Visual Mamba? Backtracking must move information along spatial axes as well as time. Visual Mamba extends SSMs with directional scans on images (and videos), yielding short communication paths across H and W while preserving linear complexity. In dispersion grids, this is a good inductive bias: gaseous filaments meander and split, but transport is largely directional instant-to-instant; directional scans approximate these flows with lightweight operators.

Why advance beyond prior Visual Mamba? Standard visual SSM implementations often use bidirectional (non-causal) passes or associative scans [17,23] that, while powerful, are heavier than needed here and can blur temporal causality. We tailor the formulation to dispersion by (i) enforcing axis-wise causality with one-sided, depthwise operators (e.g., [42,43]); (ii) introducing an upwind gate (DirGate) that learns a convex mixture over directions

{H^{\pm}, W^{\pm}, T^{\pm}}

, routing information along the current flow; and (iii) using a 3D–Conv stem/head for compact I/O, with small LayerScale coefficients and a layer-indexed, depth-weighted DropPath for stable stacking [44,45]. This Mamba-inspired design keeps the linear-time advantage, encodes physical causality, and is hardware-friendly (grouped depthwise ops dominate cost).

Concretely, we parameterise RBVM as follows: Let the input be

X \in R^{B \times 1 \times T \times H \times W}

(one channel); the stem maps to C channels and we then operate on

\tilde{X} = LN (X_{feat}) \in R^{B \times T \times H \times W \times C}

(pre-norm, channels-last inside the block). For notational consistency in this section, we denote the channels-last features by

U \equiv \tilde{X} \in R^{B \times T \times H \times W \times C}

.

Model overview. The model structure comprises a 3D-Conv stem that embeds inputs

(B, 1, T, H, W)

into C channels, a stack of Mamba3D blocks operating in channels-last form

(B, T, H, W, C)

, and a 3D-Conv head that maps back to one channel of logits for all T slices; inside each block, six directional NDLayer3D operators

(H^{\pm}, W^{\pm}, T^{\pm})

are applied with causal padding and mixed by an upwind DirGate. LayerScale and optional DropPath are used on both the directional and MLP branches with residual skips, while NDLayer3D permutes/reshapes to apply 2D depthwise convolutions along space and 1D depthwise convolutions along time before restoring the

(B, T, H, W, C)

layout (Figure 1, Figure 2 and Figure 3).

Figure 1. NDLayer3D schematic. Axis selector (choose T, H, or W), per-direction causal padding masks, and shape transitions are annotated. For spatial sweeps (

H^{\pm}, W^{\pm}

):

(B, T, H, W, C) \leftrightarrow (B T, C, H, W)

; for temporal sweeps (

T^{\pm}

, flipped index):

(B, T, H, W, C) \leftrightarrow (B H W, C, T)

. Each path applies a depthwise conv (3-wide) on the operated axis and reshapes back to

(B, T, H, W, C)

.

Figure 2. Top-level RBVM architecture. Stem (Conv3D

1 \to C

) →permute to channels-last

(B, C, T, H, W) \to (B, T, H, W, C)

→ stacked Mamba3DBlocks→permute back

(B, T, H, W, C) \to (B, C, T, H, W)

→Head (Conv3D

C \to 1

). Explicit permute points are shown; the output is logits on

(B, 1, T, H, W)

.

Figure 3. Mamba3DBlock (modular view). Pre-norm LayerNorm feeds six NDLayer3D paths

(H^{\pm}, W^{\pm}, T^{\pm})

; DirGate (global/local) produces convex weights g, whose gated sum is scaled by LayerScale

γ_{dir}

with optional DropPath and added residually. A second pre-norm LayerNorm feeds an MLP, followed by LayerScale

γ_{mlp}

and optional DropPath, then a residual add. Residual routes, LayerScale/DropPath placement, and gating output g are highlighted; the block operates in

(B, T, H, W, C)

. Compositionality: stacked axis-wise sweeps synthesise oblique information flow while preserving per-axis causality. Temporal sweeps are applied on a flipped index so that both

T^{+}

and

T^{-}

remain one-sided with causal padding; dashed regions indicate unobserved frames that are never accessed.

Operational inputs. The model consumes gridded concentration rasters exported by an upstream dispersion stack (e.g., NBC_RAMS/HPAC-family). No wind vectors are used at inference. The same interface applies to other operational stacks that export concentration fields on a regular grid.

3.1.1. Temporal Causality and Time Flip

We index the conditioning window so that the oldest slice used for evaluation aligns with the causal boundary (via a simple time flip). Temporal branches

T^{+}

and

T^{-}

then perform one-sided sweeps on this flipped timeline, with padding applied only on the causal (upwind) side. All temporal operations remain within the observed window; no operator can access frames earlier than the oldest target, and the head unflips the index before producing logits. The time flip is a non-learned re-indexing that preserves gradients to all supervised frames; it does not grant access to frames earlier than the oldest target and does not remove supervision from any FUT index. Normalization statistics are taken from the conditioning window only; we do not peek into frames earlier than the oldest target, so no information leakage is introduced by normalization.

3.1.2. Temporal Windowing

Let the conditioning window cover

{t_{0} - T + 1, \dots, t_{0}}

with

t_{0} \geq 5

min. We apply a time flip so the oldest frame is the decoding target and supervise all backtracked frames within this window. By construction, neither training nor metrics require frames earlier than

t_{0}

, and excluding the first five minutes avoids near-field transients beyond the scope of this study.

3.1.3. Optional Raster Adaptor for Irregular Sensing

When only sparse/irregular measurements are available, a lightweight adaptor constructs a working raster

X \in R^{T \times H \times W}

together with a co-registered uncertainty/support map

U \in {[0, 1]}^{T \times H \times W}

. We can use conservative kernel interpolation (e.g., inverse-distance or Gaussian kernels with mass normalization) or standard kriging to populate X, and define U from local support (e.g., distance-to-sample or posterior variance). The model then consumes

[X; U]

(uncertainty as an auxiliary channel) with no change to the backbone. At inference, U is folded into the decision rule by tightening abstention thresholds when support is low (higher U). This adaptor runs before the directional sweeps, thus keeping linear-time behaviour and per-axis causality intact. This optional raster adaptation is documented for deployment completeness.

3.1.4. Directional Operators (Causal, Depthwise)

Each operator

O_{d}

performs a causal sweep along one axis/direction, propagating information only from the past (or from the upwind side) without peeking ahead. In gas dispersion, this matches physics: under advection, the most informative context at

(t, h, w)

lies immediately upwind or in the preceding time step. We therefore use one-sided stencils (kernel width 3) with causal padding along the operated axis. Kernels are depthwise (groups

= C

) so each channel evolves with its own small filter; cross-channel mixing is deferred to the gated fusion and the MLP head. This yields short communication paths and excellent hardware efficiency while preserving the state-space flavour of axis-wise propagation. While each operator acts along a single axis (

H^{\pm}

or

W^{\pm}

), stacking blocks composes these short paths into oblique or curved trajectories; the upwind gate shifts the dominant orientation across blocks and time, and the post-aggregation MLP mixes channels to couple axes. This compositional routing suffices to capture meanders and off-axis transport in our data; if tightly wound eddies dominate, diagonal variants are a drop-in extension that leave the rest of the stack unchanged.

3.1.5. Orientation Extensions

For scenes dominated by rotational or swirling transport, we can extend the direction set to

D_{8} = {H^{\pm}, W^{\pm}, D_{45}^{\pm}, D_{135}^{\pm}}

, where

D_{45}^{\pm}, D_{135}^{\pm}

are one-sided, depthwise diagonal stencils implemented as oriented masked kernels with causal support on the upwind half-plane. As with

H^{\pm} / W^{\pm}

, padding is causal with respect to the operated direction and evaluation index, so per-direction causality is preserved. This change is modular and increases complexity linearly with

| D |

(still

O (| D | C T H W)

).

Define the direction set

D_{6} = {H^{+}, H^{-}, W^{+}, W^{-}, T^{+}, T^{-}}

(our default), with

D_{5} = D_{6} ∖ {T^{-}}

used only in sensitivity snapshots. For

d \in D

, we apply a one-sided, depthwise kernel of width 3 with causal padding along the operated axis:

\begin{matrix} Z_{d} & = O_{d} (U), \\ Z_{d} & \in R^{B \times T \times H \times W \times C}, \\ O_{T^{\pm}} & = {DW - Conv 1 D}_{t}, O_{H^{\pm}} = {DW - Conv 2 D}_{h}, O_{W^{\pm}} = {DW - Conv 2 D}_{w} . \end{matrix}

(6)

3.1.6. Implementation Note

T^{+}

processes the sequence in index order (consistent with streaming), while

T^{-}

adds a backward temporal branch that helps under meandering winds;

H^{\pm}

and

W^{\pm}

use

(1 \times 3)

stencils with padding applied only on the upwind side. We permute/reshape tensors so these become depthwise 1D/2D grouped convolutions on contiguous memory. Across blocks, alternation of

H^{\pm} / W^{\pm}

sweeps with gate-driven orientation changes yields oblique information flow without breaking per-axis causality. On the flipped timeline,

T^{+}

and

T^{-}

are both one-sided with causal padding, so adding

T^{-}

improves robustness to meanders without introducing bidirectional look-ahead.

3.1.7. Upwind Gate (DirGate)

The five (or six) directional responses are not equally informative at every site. When the plume drifts east, for example,

W^{+}

(eastward) should carry more weight; when winds reverse or channel, the optimal mixture shifts accordingly. DirGate learns this upwind routing as a temperature-controlled softmax over directions. At each block, we normalize features, project to

| D |

logits, scale by

τ

, and apply softmax to obtain a convex mixture. Smaller

τ

sharpens the distribution; a larger

τ

encourages smoother mixing. We zero-initialize the projection so training starts from a uniform mixture.

3.1.8. Formal Definition and Modes

Directional responses

{Z_{d}}_{d \in D_{6}}

are combined by a convex mixture where a lightweight summary

ϕ (U)

feeds zero-initialized logits W with temperature

τ

. We support global, local, and a hybrid (global prior + local refinement); operators are unchanged across modes and preserve linear complexity:

g = softmax (W ϕ (U) / τ), \hat{Z} = \sum_{d \in D_{6}} g_{d} Z_{d} .

Here

ϕ (\cdot)

summarizes U (global pooling for global; identity or

1 \times 1

projection for local); W are gate logits (zero-initialized; uniform start), and

τ

controls mixture entropy.

3.1.9. Gating with Additional Orientations

DirGate naturally generalizes to

D_{8} = {H^{\pm},

W^{\pm}, D_{45}^{\pm}, D_{135}^{\pm}}

by producing convex weights over the enlarged set. Alternatively, a compact dynamic-orientation variant replaces fixed diagonals with K learned-orientation atoms (small K, e.g.,

K \leq 4

) and gates over these atoms (global or local mode). Zero-initialized logits and a moderate temperature

τ

avoid premature collapse; the added cost is linear in K, and streaming behaviour is unchanged. At higher resolutions, a local DirGate (per-voxel mixtures) can replace or refine the global mixture while keeping linear complexity; an optional patchify/strided stem lets global gating operate on a reduced token grid, with the patch stride chosen to balance latency and spatial fidelity. Note that DirGate supports three modes. Global gating predicts a single convex mixture over directions for the whole sample (our default at

25 \times 25

). Local gating predicts a mixture per voxel (and time index), enabling spatially varying routing when heterogeneity increases. A hybrid option combines a global prior with local refinement. Logits are zero-initialized (uniform start) and the temperature

τ

controls entropy; for local mode, we optionally apply a small spatial smoothing on pre-softmax logits. Switching modes leaves the directional operators unchanged and preserves linear complexity in

T H W

(and in

| D |

).

3.1.10. Degeneracy Mitigation

We zero-initialize gate logits (uniform start) and use a moderate temperature (

τ = 1.0

) to avoid overly sharp mixtures. Small LayerScale on both branches (

γ_{dir}, γ_{mlp} \approx 10^{- 3}

at init) prevents any single direction from dominating early; global gating (default at modest grids) further stabilizes routing by averaging over space–time. Local/hybrid variants remain available when heterogeneity increases.

3.1.11. Block, Stem, Head, and Complexity

Each residual block first aggregates directional responses using DirGate and then refines them with a lightweight two-layer GELU MLP. We use pre-norm LayerNorm on both branches so gradients flow through identity skips; small LayerScale coefficients

(γ_{d}, γ_{mlp} \sim 10^{- 3})

stabilize early training; and a layer–indexed, depth-weighted DropPath regularizes across blocks without adding inference cost. A

1 \times 3 \times 3

3D-Conv stem maps the single-channel lattice to C channels, and a symmetric head maps back to one channel of logits at all time steps. Conceptually, stacking such blocks alternates axis-wise message passing (short, causal stencils) with channel mixing (MLP), so long-range dependencies emerge through depth rather than wide filters.

3.1.12. Block, Stem, and Head (Modular View)

Stem. Conv3D maps

1 \to C

; an optional patchify/strided reduction maps

H \times W \to (H / p) \times (W / p)

before the sweeps.

Block. Six axis-wise depthwise paths

(H^{\pm}, W^{\pm}, T^{\pm})

operate on

LN (U)

to produce

{Z_{d}}

; DirGate yields g, and the gated sum is added with LayerScale and optional DropPath:

U \leftarrow U + DropPath (\sum_{d \in D_{6}} γ_{dir, d} g_{d} O_{d} (LN (U))), U \leftarrow U + DropPath (γ_{mlp} MLP (LN (U))) .

Residual form. We initialize

γ_{dir, d}

and

γ_{mlp}

at

10^{- 3}

. Small residual scales ease early optimization in deep residual/Transformer-style stacks while preserving capacity.

Head. Conv3D maps

C \to 1

logits on the cell grid; temporal alignment follows the flip convention in Sec. Temporal causality. Deployment safeguards (calibration/abstention) are wrappers and do not alter the backbone.

3.1.13. Scaling and Input Reduction

Patchify at input. For large rasters, we optionally apply a non-overlapping

p \times p

patchify (or a strided

1 \times 3 \times 3

stem) to coarsen

H \times W \to (H / p) \times (W / p)

before the directional sweeps; the head symmetrically upsamples to cell logits. Because the operators remain depthwise and one-sided, axis-wise causality and the linear-time profile are preserved. Gating (global/local/hybrid) is orthogonal to this reduction and leaves the directional operators unchanged. Gating Under Patchify Global DirGate operates unchanged on patch tokens; local DirGate (per token) is also supported when heterogeneity increases with resolution. Both retain linear complexity and reuse the same directional responses. Gating granularity does not alter the backbone’s cost profile: axis-wise, one-sided depthwise operators remain the dominant term. The gate adds a per-sample (global) or per-voxel (local) convex mixture whose overhead is linear in volume. When resolution increases, a patchify/strided stem maps

H \times W \to (H / p) \times (W / p)

tokens before the directional sweeps; global or hybrid gating then operates on the compact token grid, with optional local refinement if sub-grid heterogeneity persists.

3.1.14. Cost Drivers and Permutations

Per-block cost scales linearly in

T H W

(and

| D |

); grouped depthwise convolutions dominate runtime. We keep a channels-last layout

(B, T,

H, W, C)

throughout the body so most permutations are metadata (stride) changes, only stem/head swap layouts. In practice, permutation overhead is negligible relative to depthwise stencils on current GPUs. Adding diagonal or learned-orientation stencils increases the per-block constant proportionally to the number of directions/orientation atoms, but does not change the linear dependence on sequence extent; memory overhead grows linearly with the volume and linearly with the number of directional buffers.

3.1.15. Tiling with Causal Halos

For very large rasters, we process overlapping tiles with a small causal halo so one-sided stencils remain valid at tile boundaries; logits are then stitched without seams.

Compute and memory are dominated by depthwise grouped convolutions and thus scale linearly with the spatiotemporal volume

T H W

and with

| D |

. DirGate (a single linear projection + softmax) and LayerScale add negligible overhead. This keeps the backbone compact while allowing look-back horizons to extend without quadratic penalties.

3.1.16. Pseudocode (Forward and One Training Step)

Listing A1 provides ASCII-safe pseudocode for one forward pass (incl. optional patchify) and one training step reflecting the equations. Optional regularizers/safeguards are flagged and disabled in the main tables unless stated. For quick lookup, see Appendix A.

3.2. Training Objective and Schedules

We adopt a finite-horizon backtracking protocol parameterised by the conditioning-window length COND and the backtracking depth FUT (cf. [9,11]). Given a mini-batch

X_{1 : COND}

, the network returns logits

{\hat{L}}_{1 : FUT} \in R^{B \times FUT \times H \times W}

for the earlier slices. During training we apply a time flip so that the index

k = FUT

corresponds to the oldest slice (closest to

t \approx 0

), aligning targets and losses with the evaluation slice used at test time.

3.2.1. Joint Decoding

We decode all backtracked frames jointly from the same conditioning window and supervise the entire FUT stack in parallel. Concretely, the network produces

{\hat{L}}_{1 : FUT}

(and

{\hat{P}}_{1 : FUT}

) in one forward pass; we do not feed

{\hat{P}}_{t}

as input to predict

{\hat{P}}_{t + 1}

. The temporal operators

T^{\pm}

are one-sided causal sweeps over the latent representation aligned to the observed window (via time flip), not an auto-regressive rollout. This joint design mitigates per-step error accumulation and exposure bias typical of strictly auto-regressive decoders; the “oldest” evaluation slice is a direct head output rather than the result of repeated prediction chaining.

All FUT slices are decoded from the same conditioning window on a flipped index; both temporal branches remain one-sided with respect to the evaluation boundary, so training cannot leak information from frames earlier than the oldest target.

No forgetting via joint supervision. We supervise the entire FUT stack jointly from the same conditioning window, aggregating per-frame terms across all FUT indices (oldest included) in the supervised loss. Because all targets are optimized together, the model does not preferentially fit only the most recent slice; random window sampling across epochs further rehearses every FUT index.

Per-frame min–max yields invariance to unknown affine rescaling (emission strength, sensor gain), which is desirable for localization across heterogeneous sources. The model thus learns shape/transport patterns rather than absolute amplitude; when absolute mass is needed operationally, the optional amplitude channel provides that cue without altering the backbone or loss.

3.2.2. Loss Decomposition

The objective encourages two complementary behaviours: (i) faithful reconstruction of the earlier maps (to learn dynamics, not merely a point label) and (ii) sharp localization on the oldest slice, where the source imprint is most concentrated. Let

\hat{P} = σ (\hat{L})

and

Y \in {[0, 1]}^{B \times FUT \times H \times W}

. We blend per-pixel regression (MSE) with a classification-style term (BCE) applied to the logits over all backtracked frames (including the oldest):

L_{\sup} = \underset{L_{MSE}}{\underset{︸}{\frac{1}{B} \sum {∥ \hat{P} - Y ∥}_{2}^{2}}} + λ \underset{L_{BCE}}{\underset{︸}{\frac{1}{B} \sum BCEWithLogits (\hat{L}, Y)}}, λ = LAM .

(7)

Applying BCE on logits improves gradients when probabilities saturate;

λ

trades off geometric fidelity and peak sharpness (default

λ = 0.1

). Losses use reduction=”sum” internally and are normalized by batch size in logging.

3.2.3. Complementarity of BCE-on-Logits and MSE

Writing

p = σ (ℓ)

, the logistic loss on logits satisfies

\nabla_{ℓ} L_{BCE} = p - y

, while MSE on probabilities gives

\nabla_{ℓ} L_{MSE} = 2 (p - y) p (1 - p)

by the chain rule; the latter attenuates near saturation (

p \approx 0

or 1). We therefore use

L = L_{MSE} + λ L_{BCE}

with

λ = 0.1

: BCE-on-logits supplies non-vanishing sharpening signal under saturation, and MSE preserves geometric fidelity elsewhere. For numerical stability, we compute BCE with logits and evaluate MSE on clipped probabilities; post hoc temperature scaling (optional) can further refine calibration without altering the objective.

3.2.4. Class Imbalance and Weighting

Because positives occupy a small fraction of cells on the oldest slice, the dense

L_{MSE}

term prevents trivial all-zero solutions and enforces the global plume shape, while

L_{BCE}

focuses learning on the sparse high-probability region and maintains gradients under saturation. For simplicity and robustness across datasets, we use a single

λ

for all backtracked frames; this setting was stable in our experiments.

3.2.5. SWA/Polyak Teacher Consistency

Backtracking stacks several predictions; small calibration errors can compound. To stabilize training without affecting inference cost, we maintain a Polyak/EMA teacher (AveragedModel) and add light consistency terms after a short warm-up:

\begin{matrix} {\hat{L}}^{ema} & = f_{avg} (X_{1 : COND}), {\hat{P}}^{ema} = σ ({\hat{L}}^{ema}), \end{matrix}

(8)

\begin{matrix} L_{cons} & = α {∥ \hat{P} - {\hat{P}}^{ema} ∥}_{2}^{2} + β BCEWithLogits (\hat{L}, {\hat{P}}^{ema}), \end{matrix}

(9)

with

(α, β)

ramped linearly by

schedule_coeff

. The total loss is

L = L_{\sup} + L_{cons}

.

3.2.6. EMA Teacher and Schedule

Let

θ_{t}

be the student parameters and

θ_{t}^{ema}

the teacher. We update the teacher as

θ_{t}^{ema} \leftarrow ρ θ_{t - 1}^{ema} + (1 - ρ) θ_{t}

with a high decay

ρ

, so the teacher prediction is a temporally smoothed copy of the student. After a short warm-up, the consistency weights

(α, β)

are ramped linearly from 0 to modest values so the EMA target does not act as a strong prior early in training. The teacher uses the same inputs and indexing as the student (including the time flip that makes the oldest slice the target), receives no external wind or solver signals, and is updated with stop-gradient. Thus, the consistency loss improves calibration and temporal coherence across FUT slices without introducing information leakage or exogenous bias.

3.2.7. Robustness Under Meanders

When winds meander or reverse, student and teacher may transiently disagree; keeping

(α, β)

modest ensures supervised terms dominate in low-confidence cases, while the EMA acts as a gentle stabilizer rather than a hard constraint.

3.2.8. Inference

At test time, we feed the most recent

COND

frames and obtain

{\hat{L}}_{1 : FUT}

. After the same time flip used in training,

{\hat{P}}_{FUT} = σ ({\hat{L}}_{FUT})

is the estimate at

t \approx 0

. The predicted source cell is the spatial argmax:

(\hat{x}, \hat{y}) = \underset{(x, y) \in {1, \dots, H} \times {1, \dots, W}}{arg max} {\hat{P}}_{FUT} (x, y) .

(10)

We report geometry-aware metrics—Exact hit, hit

\leq 1

(Chebyshev

d_{\infty}

), and RMSE (px). Using the same per-frame normalization at train/test time avoids bias from dilution-induced dynamic-range shifts. All FUT slices are produced in a single pass from the same conditioning window, so no predicted slice is reused as input at test time. At test time, we run a single forward pass on the observed window; temporal sweeps are one-sided on the flipped index, so no information beyond the window can be accessed.

3.2.9. Uncertainty and Abstention (Optional)

Alongside the point estimate on the oldest predicted slice

{\hat{P}}_{FUT}

, we provide an optional confidence score built from two quantities: (i) the top-two probability margin

Δ p = p_{(1)} - p_{(2)}

at the argmax cell and (ii) the predictive entropy

H ({\hat{P}}_{FUT}) = - \sum_{h, w} {\hat{P}}_{FUT} (h, w) log ({\hat{P}}_{FUT} (h, w) + ε)

. An abstention policy can be enabled at deployment: emit a hard decision only if

Δ p \geq ϵ

and

H \leq κ

; otherwise, return uncertain. These thresholds are chosen on a held-out validation set and kept fixed. For reliability, logits may be post hoc calibrated by temperature scaling (

ℓ \to ℓ / T

; T fit on validation); when enabled, we recommend reporting the expected calibration error (ECE) on the oldest slice. All of these components are optional knobs for deployment and are disabled in the main tables unless explicitly stated.

3.2.10. Stabilizers

We use a DropPath schedule across blocks, an EMA teacher for light consistency after warm-up, and gradient clipping. These discourage sample-idiosyncratic routing and help the gate avoid collapse, without changing the backbone.

3.2.11. Deployment Note

In practice, we keep the validation-selected thresholds

(ϵ, κ)

fixed at deployment, optionally apply temperature scaling for calibration, and log both the margin and entropy with each prediction. The abstention safeguard remains disabled in the main accuracy tables unless stated.

3.2.12. Optional Stabilizer for Very Low-Wind Plateaus

Episodes with near-stagnant flow can produce broad, low-contrast concentration fields. As an optional stabilizer that leaves the backbone unchanged, we add a light mass-balance term on the oldest slice,

L_{mb} = λ_{mb} \cdot |\sum_{h, w} {\hat{P}}_{FUT} - \sum_{h, w} Y_{FUT}|,

and we recommend mixing low- and moderate-wind windows within a mini-batch to avoid collapse onto trivial flat solutions. When used,

L_{mb}

is applied after a short warm-up with a small weight

λ_{mb}

tuned on validation.

3.2.13. Conceptual Algorithms (For Clarity)

Algorithms 1 and 2 give compact, ASCII-safe pseudocode for the forward pass (causal directional sweeps + DirGate) and a single training step (BCE-with-logits across backtracked slices; after warm-up, a light EMA-consistency term). These conceptual listings include only the interfaces needed for reproduction and do not alter the backbone.

Algorithm 1 Forward pass with directional sweeps and DirGate

Input:

X \in R^{B \times 1 \times T \times H \times W}

Stem:

U = Conv 3 D (X)

; permute to

(B, T, H, W, C)

For each block:

For

d \in {H^{+}, H^{-}, W^{+}, W^{-}, T^{+}, T^{-}}

:

Z_{d} = O_{d} (LN (U))

(depthwise, causal)

g = softmax (W ϕ (U) / τ)

(global or local DirGate)

U \leftarrow U + DropPath (\sum_{d} γ_{d} g_{d} ⊙ Z_{d})

U \leftarrow U + DropPath (γ_{mlp} MLP (LN (U)))

Head: permute to channels-first and apply

Conv 3 D

to produce logits

Algorithm 2 One training step

Input: conditioning window

X_{1 : COND}

, targets

Y_{1 : FUT}

(time-flipped)

Forward: logits

{\hat{L}}_{1 : FUT}

, probs

\hat{P} = σ (\hat{L})

Loss:

L_{\sup} = MSE (\hat{P}, Y) + λ BCEWithLogits (\hat{L}, Y)

After warm-up: teacher EMA gives

{\hat{P}}^{ema}

; add

α MSE (\hat{P}, {\hat{P}}^{ema}) + β BCEWithLogits (\hat{L}, {\hat{P}}^{ema})

Update: mixed-precision backprop, gradient clipping, EMA update

3.3. Dual-Source Stress Test (Zero-Shot)

We frame the dual-source experiment as a stress test under additive transport assumptions: it is parameter-free, states its assumptions/safeguards, and is intended to probe qualitative behaviour rather than to assert fully general multi-source performance in the absence of solver-backed multi-release data.

Scope and assumptions. We cast dual-source localization as a zero-shot stress test under passive advection–diffusion: (i) approximate linear superposition over the short observation window; (ii) after backtracking, the oldest slice exhibits at least moderate peak separation so that the global maximum is attributable to a single emitter; and (iii) subtraction uses an isotropic template by default, or an oriented/elliptical template when a coarse wind prior is available.

Safeguards. We clamp residuals to be non-negative, apply small-radius NMS before the second pick, and flag “uncertain” when the top-two probability margin on the oldest slice is small, in which case we defer a hard two-source decision.

Optional upgrades (kept off by default). A compact two-Gaussian mixture fit or a short unrolled residual decoder trained on synthetic superpositions can improve deblending under stronger overlap while remaining near-linear at our lattice size and leaving the backbone unchanged.

Real incidents may involve more than one emitter. Because the synthetic generator is additive (linear superposition), we keep the backbone unchanged and apply a lightweight, physics-aware postprocess on the oldest slice. We treat

{\hat{P}}_{FUT}

as a superposition of two diffusion kernels and perform a two-peak subtraction: (i) locate the global maximum, subtract a fixed isotropic heat-kernel template scaled by that peak, and clamp to zero; (ii) repeat once on the residual and return the two peak coordinates. This deterministic, parameter-free procedure introduces no learned weights, runs in

O (H W)

time, and provides a practical zero-shot baseline for dual-source localization without retraining.

Multi-source extension (documented). For

K > 2

, we apply a sequential K-peak procedure on the oldest slice: (1) pick the current global maximum; (2) subtract a scaled template (isotropic by default, oriented if a coarse wind prior is available); (3) clamp residuals to non-negative; (4) apply small-radius NMS; (5) repeat until a residual mass criterion is met. We stop when the residual mass falls below a fraction

θ_{res}

of the original or after

K_{max}

picks (e.g.,

K_{max} = 4

). A minimum-separation guard (≥1 cell) and a top-two margin check prevent over-segmentation. This runs in linear time per pick and leaves the backbone unchanged.

Optional refinements (kept off by default). For overlapping lobes, a K-component Gaussian mixture may be fitted from the subtractor initialization (means at picked peaks; amplitudes from subtracted scales; optionally shared covariance) and refined by a few EM steps. Alternatively, a tiny residual decoder (a shallow conv head on the residual) trained on synthetic superpositions improves deblending while preserving the linear-time profile and leaving the directional operators/gating unchanged.

For

K > 1

, we expose a per-pick confidence based on the local top-two margin and the residual mass curve; if either falls below thresholds, the routine halts further picks and flags uncertain. This confidence is combined with predictive entropy into a single actionable score.

4. Experimental Setup

4.1. NBC_RAMS Field Benchmark

To assess performance under realistic urban flow—and to prioritise an operationally relevant setting—we adopt a primary benchmark curated from the Nuclear–Biological–Chemical Reporting and Modelling System (NBC_RAMS) [9,11]. NBC_RAMS couples a 3D mesoscale meteorology model (winds/temperature with terrain and building effects) and an Eulerian dispersion module to produce gridded concentration sequences over a

5 km \times 5 km

area (cell size 200 m). Each run spans 30 min at 1 min cadence, yielding

25 \times 25

rasters; we vary wind sector/speed and payloads to generate diverse scenarios for training/validation/testing. Functionally, NBC_RAMS is analogous to HPAC (v6) [46] and has been used in prior studies for comparable tasks [9,11].

Following prior practice [9,11], we discard the first 5 min of near-field turbulence and use the remaining frames as follows: ten recent frames are retained for conditioning (COND = 10), and six earlier frames are reconstructed back to

t \approx 0

(FUT = 6). This empirically grounded setting keeps the learning/evaluation pipeline identical across domains. We randomize diffusion/wind broadly in synthetic generation and evaluate cross-domain on a distinct NBC_RAMS terrain without fine-tuning.

We omit the first five minutes after release, following standard NBC_RAMS practice, to exclude the highly transient near-field jet regime that is difficult to sense reliably and poorly represented at

25 \times 25

resolution. In our setup, the conditioning window begins at

t_{0} \geq 5

min and spans recent frames; evaluation is always performed on the oldest frame in this window. This reflects common operational timelines (earliest observations arrive with a short delay) and focuses the task on mid-field transport patterns used by responders.

All 5-dir/6-dir comparisons use identical data splits, optimizer, schedule, and training budget; only the direction set differs. A parameter-matched 5-dir control (e.g., slightly wider MLP to match the extra

T^{-}

parameters) is left for future large-scale ablations; current gains arise with <1% parameter change.

4.1.1. Variability and Scale

We sweep payload mass (

{SF}_{6}

, 0.0005 to 0.0025 kg with 0.0005 kg steps), wind speed 1

{ms}^{- 1}

to 3

{ms}^{- 1}

(step 0.5

{ms}^{- 1}

), wind direction (eight sectors), and source positions (all 625 cells), yielding 125,000 single-source scenarios. This breadth exposes models to channelling, shear, and occasional reversals that are absent in the flat synthetic regime.

4.1.2. Preprocessing and Usage

We apply per-frame min–max normalization to

[0, 1]

, clip sub-threshold numerical noise, and use a light Gaussian blur to suppress isolated sensor-like outliers; frames are already

25 \times 25

, so no spatial resize is needed. We adopt the same 70:20:10 split across scenarios. In all single-source experiments, we use the same conditioning/backtracking protocol (cf. [9]). Dual-source experiments are not available in NBC_RAMS and are evaluated only on synthetic data. Unless otherwise stated, we reuse a consistent preprocessing/evaluation pipeline across datasets. When reporting results, we use geometry-aware metrics on the oldest slice: RMSE (px) and AED (m) with AED computed as

RMSE \times 200 m

(1 px = 200 m), plus hit

\leq 1

and Exact; model compactness is summarized by parameter count. Unless otherwise stated, we use global DirGate at

25 \times 25

. For higher-resolution exports, we enable the patchify/strided stem and, if needed, switch to local or hybrid gating following the guidance in Section 6.

Data splits. We split at the scenario level to avoid leakage. A scenario is the tuple ⟨release–location id, wind–sector bin, wind–speed bin, payload id⟩; all windows from a scenario are assigned in full to exactly one of train/val/test. Splits are grouped and stratified so that the wind–sector and wind–speed marginals are comparable across splits while keeping overall proportions fixed.
Leakage controls. The model employs LayerNorm (no BatchNorm), and all augmentations are applied only within the training split. No window from a scenario appears in more than one split, and no re-shuffling occurs during training.
Illustrative samples. Figure 4 illustrates gas dispersion at one-minute intervals from 1 to 15 min post-release. Each tile corresponds to a fixed time index ( $t = 1, \dots, 15$ ); the red “×” marks the true source location. Normalized concentration is overlaid on a satellite basemap, revealing the plume’s growth, transport, and dilution as it advects along urban corridors and experiences shear. As time elapses, near-source concentrations diminish due to atmospheric dilution, while downwind lobes broaden and may split as flow channels and meanders. Figure 5 shows the corresponding preprocessed input fed to the model. From each simulation we extract a consecutive window of $COND$ frames after an initial near-field transient so that early, unstable slices do not dominate the learning signal. Candidate windows are validated and discarded if too sparse (e.g., if more than $COND / 3$ frames contain no detectable concentration), ensuring that each sample carries sufficient dispersion information. To align with RBVM’s training contract, frames are rasterised to a $25 \times 25$ lattice, min–max-normalized per frame to $[0, 1]$ to avoid dilution-induced dynamic-range bias, and lightly denoised by clipping isolated single-pixel spikes (numerical noise). These preprocessed time-series frames are the model inputs; the label is the source grid cell on the oldest slice in the extracted window.

Figure 4. Sample NBC_RAMS sequence ( $t = 1$ –15) shown for illustration; early near-field frames are excluded from training/evaluation as noted in the text. Concentration (normalized to $[0, 1]$ ) is overlaid on the basemap; the red “×” marks the true source. The plume grows, advects, and dilutes over time under the prevailing wind field. The colour bar indicates the value of each data point.

Figure 5. Preprocessed counterpart of Figure 4. Frames are rasterised to a $25 \times 25$ grid, per-frame min–max-normalized to $[0, 1]$ , and lightly denoised as in Section 4.1. The red “×” indicates the true source; the colour bar denotes the normalized concentration value.
Operational input rasters. All experiments consume gridded concentration rasters exported by an upstream dispersion stack (NBC_RAMS), which couples mesoscale meteorology with Eulerian transport and can perform upstream assimilation when measurements are available. Consequently, the regular lattice is defined by the dispersion stack; RBVM operates downstream on these solver-produced rasters and does not ingest raw irregular samples. When higher-resolution exports are provided (e.g., $100 \times 100$ ), our input interface supports them directly.

4.2. Synthetic Benchmark for Sensitivity and Stress Tests

Our synthetic corpus follows the physics recipe of [9] but is reimplemented in a fully vectorized PyTorch 2.3 pipeline for high-throughput sampling. We model horizontal gas motion on a

25 \times 25

lattice under the advection–diffusion equation [47,48]:

\frac{\partial C}{\partial t} = k \nabla^{2} C - v \cdot \nabla C,

(11)

with concentration

C (x, y, t)

, diffusion k, and horizontal wind

v = (v_{x}, v_{y})

. For an instantaneous point release at

r_{0}

(time

t_{0}

) in

R^{2}

, the Green solution [47] is

C (r, t) = \frac{C_{0}}{4 π k (t - t_{0})} exp (- \frac{{∥r - \int_{t_{0}}^{t} v (τ) d τ - r_{0}∥}^{2}}{4 k (t - t_{0})}),

(12)

i.e., a diffusive Gaussian that drifts with the time-integrated wind, consistent with standard atmospheric–dispersion treatments [48]. Each scenario randomizes the source

r_{0}

, diffusion k, initial mass

C_{0}

, and a piecewise-steady wind

v_{n} = (s_{n} cos θ_{n}, s_{n} sin θ_{n})

, where the main direction is drawn from eight equal angular sectors and perturbed by

θ_{n} \sim U (- π / 3, π / 3)

; the speed

s_{n}

is sampled from a chosen range. This choice yields strictly causal transport on flat terrain, allowing us to isolate algorithmic effects without orographic confounds.

4.2.1. Sequences, Conditioning, and Targets

Each synthetic sequence contains 16 frames as in [9]: a one-hot Kronecker delta at

t = 0

(true source) followed by 15 advecting/decaying plumes. We use the latest 10 frames as the conditioning window and recursively backtrack the earlier 6 frames; this mirrors [9], keeps the inference horizon fixed across methods, and makes evaluation comparable to field data. To match RBVM’s training contract, we apply per-frame min–max normalization to

[0, 1]

to prevent dilution-induced dynamic-range shifts from biasing the optimizer, and we clip isolated single-pixel spikes caused by numerical noise. The regression targets are the six earlier frames; the classification target is the source on the oldest slice.

4.2.2. Rationale for Synthetic Use

The synthetic set provides a controllable arena to verify (i) whether axis-wise causality and the upwind mixture suffice in strictly causal flows; and (ii) how far a single-source model generalizes when the generator emits dual-source sequences. We do not train on dual-source; those sequences are used only as a zero-shot stress test (Section 3.3). Unless stated, data are split 70:20:10 (train/val/test), and all baselines and RBVM are trained on the same images for fairness.

4.2.3. Evaluation Slice and Metrics

We always score on the oldest reconstructed slice (closest to

t = 0

). Reported metrics are RMSE (px) and AED (m = RMSE

\times 200

), plus hit

\leq 1

and Exact. Using the identical slice across methods avoids favouring architectures that implicitly bias a different time index.

4.3. Experimental Instantiation of the Proposed Method

We instantiate RBVM for all main experiments. The backbone comprises a

1 \times 3 \times 3

3D–Conv stem (

1 \to C

),

N_{blk}

residual blocks, and a symmetric

1 \times 3 \times 3

head (

C \to 1

) producing logits for all T frames. Each block applies causal, depthwise directional sweeps over

D_{6}

(six directions by default;

D_{5} = D_{6} ∖ {T^{-}}

is used only in sensitivity snapshots), fuses them via a temperature-controlled DirGate (global by default, zero-initialized so training starts uniform), and refines with a two-layer GELU MLP. We use pre-norm LayerNorm on both branches; small LayerScale coefficients

γ_{d}, γ_{mlp} \sim 10^{- 3}

stabilize residuals; and a layer-indexed, depth-weighted DropPath regularizes across blocks (static during training).

4.3.1. Default Hyperparameters

Unless stated otherwise, we set embed

C = 192

, blocks

N_{blk} = 6

, VARIANT=6dir (the 5dir variant appears only in ablations), drop_path_rate = 0.2, gate_mode = global,

τ = 1.0

, and

γ_{init} = 10^{- 3}

. Optimization follows Section 3.2.1, Section 3.2.2, Section 3.2.3, Section 3.2.4, Section 3.2.5, Section 3.2.6, Section 3.2.7 and Section 3.2.8. Defaults are BATCH_SIZE = 16, EPOCHS = 30, WARMUP_EPOCHS=3,

LAM = 0.1

,

α = 1.0

, and

β = 0.1

.

4.3.2. Fairness Controls and Protocol Consistency

All methods—including 3D–CNN, CNN–LSTM, and ViViT—are trained on the same splits with identical preprocessing and are scored on the same oldest slice using RMSE/AED, hit

\leq 1

, and Exact. We also report parameter count for model compactness. This protocol ensures that observed gains derive from modelling choices (directional causality, upwind gating) rather than differences in data handling or evaluation timing. Consistent with Section 5, we report NBC_RAMS first and use the synthetic benchmark primarily for sensitivity analyses and the dual-source stress test under the same evaluation protocol.

Per-frame min–max normalization. We normalize each input frame with a small stabilizer

ε

and apply the same policy at train and test; statistics are computed only from the observed conditioning window.

{\tilde{X}}_{t} (h, w) = \frac{X_{t} (h, w) - {min}_{h, w} X_{t}}{{max}_{h, w} X_{t} - {min}_{h, w} X_{t} + ε}, ε = 10^{- 6} .

Targets

Y_{t}

are normalized independently with the same transform, making the loss effectively scale-invariant. This avoids domination by high-intensity frames, improves stability under dilution, and preserves relative morphology. If absolute scale is required, an optional single-channel per-frame summary

a_{t}

(e.g., framewise max or integral) can be concatenated to the input; our default experiments keep this off to prioritise localization from morphology.

4.3.3. Ablation Protocol and Findings

To isolate LayerScale/DropPath effects under limited compute, we define an ablation split that uses one third of the full training set (same val/test as the main results). We sweep

γ_{init} \in {10^{- 4}, 10^{- 3}, 10^{- 2}}

and drop_path_rate

\in {0.1, 0.2, 0.3}

under identical budgets and report ratios relative to the default

(γ_{init} = 10^{- 3}

,

drop_path_rate = 0.2)

. Around this default, performance remains essentially unchanged (ratios

\approx 1.00

), while a higher DropPath of

0.3

consistently worsens RMSE (ratio

> 1

) and slightly reduces Exact (ratio

< 1

);

γ_{init} \leq 10^{- 3}

is stable on this split. Results are summarized in Appendix B. No training instabilities were observed on this split (no NaN/Inf; gradients remained bounded under clip = 1.0). With a small LayerScale

γ_{init} \leq 10^{- 3}

and a moderate drop_path_rate of

0.2

, we observed a mild regularization gain without any loss in convergence speed.

5. Results

This section presents RBVM results on the high-fidelity NBC_RAMS Field Benchmark, our primary focus, and, secondarily, on a controlled Synthetic Benchmark for sensitivity analyses and a dual-source stress test. Unless noted, we follow Section 4: a COND(=10)-frame conditioning window, FUT(=6) recursive backtracking steps, and evaluation on the oldest reconstructed slice

{\hat{P}}_{FUT}

. We report geometry-aware localization metrics (RMSE/AED, hit

\leq 1

, Exact), model compactness via parameter count, and qualitative overlays. The core tables report Exact/hit ≤1/RMSE as before. Uncertainty and abstention are documented as deployment-facing options; when enabled, we suggest reporting the abstention rate at the chosen margin/entropy thresholds and ECE after temperature scaling, but these diagnostics are not required for the main results.

5.1. NBC_RAMS Field Benchmark

5.1.1. Setup

We evaluate on NBC_RAMS simulations under an as-is protocol, consistent with recent practice [9,11]. By default, we reuse the training settings from our previous study [11]. The NBC_RAMS terrain used here differs from that in [11], but we intentionally keep all model and baseline configurations unchanged to enable a clean cross-terrain comparison. NBC_RAMS introduces urban complexity (channelling, shear, occasional wind reversals), making it the primary test of operational relevance. Classification metrics (Exact and hit

\leq 1

) are computed on the oldest slice, and geometric error is reported as Average Error Distance (AED) in metres, obtained by converting pixel offsets with a ground sampling distance of 200 m per cell. For source localization, we regard geometry-aware metrics—Exact, hit

\leq 1

, and RMSE/AED—as primary indicators of operational performance.

5.1.2. Main Results

Table 1 summarizes geometry-aware metrics and parameter counts. RBVM (6-dir) achieves the best scores across all measures while using the fewest parameters (0.95 M): RMSE

0.20 px

(AED

40 m

), hit

\leq 1

99.86 %

, and Exact

96.70 %

. Compared with the strongest baseline by Exact (ViViT, 2.11 M params), RBVM reduces AED by about

29 %

(

56 m \to 40 m

) and raises Exact by

+ 2.51

pp, with roughly

55 %

fewer parameters. Relative to CNN–LSTM (9.87 M), AED improves by

38 %

(

64 m \to 40 m

) and Exact by

+ 4.96

pp, with about

90 %

fewer parameters. Against 3D–CNN, the gains are larger (AED

126 m \to 40 m

,

68 %

reduction; Exact

+ 22.4

pp) despite an

\sim 86 %

smaller parameter budget. Note that hit

\leq 1

is near the ceiling for several models (all

\approx 99 %

), so Exact and AED are more discriminative at the operating point; RBVM improves both, shifting the accuracy–compactness frontier relative to 3D–CNN, CNN–LSTM, and ViViT. These gains persist on a terrain different from prior work under unchanged configurations (cf. Setup), underscoring robustness across urban layouts. The 6-dir improvement persists despite a negligible parameter delta relative to 5-dir. Since DirGate remains a single convex mixture and all training budgets/hyperparameters are identical, we attribute the gain to the reverse-index causal sweep

T^{-}

, which stabilizes routing under per-frame angle noise and partial wind reversals, rather than to added capacity.

Table 1. NBC_RAMS field benchmark (evaluated on the oldest slice). Metrics: RMSE/AED reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m); hit $\leq 1$ = accuracy within Chebyshev distance 1; Exact = exact hit. RBVM (6-dir) attains the lowest RMSE/AED and the highest hit/Exact while using the fewest parameters.

5.1.3. Qualitative View

Figure 6 presents a representative case. The top row (

t = 15 \to 6

) is the input conditioning window; the middle row (

t = 5 \to 0

) shows the model’s recursively backtracked frames; the bottom row shows the corresponding ground truth. The right panel overlays the predicted source (blue marker) and the ground truth (red “×”). As the sequence is rolled back, the predicted plumes follow the dominant wind channels, gradually contract, and concentrate near the origin, closely matching the ground-truth frames. This illustrates that RBVM captures the governing advective pattern and reconstructs the past consistently, which in turn yields accurate source localization. These visual trends are consistent with Table 1: RBVM produces sharper, better-localized peaks on the oldest slice, reduces overshoot and near-miss confusions in the neighbourhood of the source, and better preserves source geometry under meandering winds, aligning with the improved Exact and lower AED.

Figure 6. Past-frame reconstruction and source localization on NBC_RAMS. (Top): input conditioning window (

t = 15 \to 6

). (Middle): recursively backtracked frames (

t = 5 \to 0

). (Bottom): ground-truth frames for the same indices. (Right): localization overlay (red “×” = ground truth; blue marker = prediction). Predictions align with the prevailing wind, shrink in spatial extent toward early times, and concentrate at the source, consistent with the ground truth.

5.2. Synthetic Benchmark: Sensitivity and Stress Tests

5.2.1. Setup

The synthetic study isolates algorithmic effects in strictly causal flows on a flat domain and is used primarily for sensitivity analyses. We compare RBVM against 3D–CNN, CNN–LSTM, and ViViT, trained/evaluated on the same images and evaluation slice. For RBVM we report both 5-dir (

H^{\pm}, W^{\pm}, T^{+}

) and 6-dir (

T^{-}

added) variants. Unless noted, we evaluate on the oldest reconstructed slice and report geometry-aware metrics (RMSE/AED, hit

\leq 1

, Exact) together with parameter count.

5.2.2. Main Results

Table 2 reports the geometry-aware view (RMSE/AED, hit

\leq 1

, Exact) and parameter count. On this controlled benchmark, 6-dir decisively outperforms 5-dir—with a much lower RMSE/AED and higher Exact—while keeping essentially the same parameter size, indicating that the gain stems from the backward temporal branch rather than model capacity.

Table 2. Synthetic benchmark (strictly causal flows; oldest slice evaluation). RMSE/AED are reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m).

5.2.3. Where the Gains Come from

Causal, depthwise directional operators keep communication paths short; DirGate adapts the mixture to the dominant flow; and a lightweight MLP supplies channel mixing without attention’s quadratic costs. In this regime, adding the backward temporal branch (

T^{-}

) markedly improves robustness to per-frame angle noise and cumulative drift, yielding ∼8.5× lower AED (4.40 m vs. 37.2 m) and

+ 14.63

percentage points (pp) in Exact over 5-dir, with a negligible change in parameter count (0.130 M vs. 0.128 M). hit

\leq 1

also improves from

99.76 %

to

100.0 %

.

5.2.4. Scope of the Toy Synthetic Benchmark

For a sanity check and ablations, we used a lean Mamba–3D variant and a lightweight, fully vectorized data pipeline. Each sample provides 16 frames on a

25 \times 25

grid; we condition on the most recent 10 frames and evaluate on the oldest reconstructed slice after a 6-step rollout (COND

= 10

, FUT

= 6

). The dataset is split 70:20:10 (seed 42). Unless noted, training uses AdamW (LR

= 3 \times 10^{- 4}

, weight decay

= 10^{- 4}

), cosine annealing over 30 epochs, AMP, and gradient clipping at

1.0

; the best checkpoints are selected by validation RMSE (pixels). The toy backbone is intentionally minimal: embed width

C = 96

, blocks

N_{blk} = 3

; each block stacks causal, depthwise axis-wise operators (no DirGate/LayerScale/DropPath), followed by a two-layer GELU MLP; the stem/head are 3D convolutions with padding and a sigmoid head. The direction set is controlled by a single flag (VARIANT

\in {5 dir, 6 dir}

), enabling paired comparisons under identical capacity.

The supervised objective combines a reconstruction term over the returned sequence with a small, source-sharpening term on the oldest slice:

L = MSE (Y, \hat{Y}) + 0.1 BCE (Y_{oldest}, {\hat{Y}}_{oldest}),

and the model is evaluated with RMSE (px) and Exact (%) on the oldest slice; AED (m) is reported as RMSE

\times 200

given the grid scale (1 px

\approx 200 m

). For numerical stability in time-reversed slicing and flips, tensors are made contiguous in both the dataset and rollout paths. Synthetic runs follow the controlled setup in Section 4.2.

5.3. Sensitivity Highlights

We provide brief snapshots that relate design choices to observed behaviour (geometry-aware metrics on the oldest slice unless noted):

Direction set (5-dir vs. 6-dir). In strictly causal synthetic flows, 5-dir and 6-dir can be close, consistent with the absence of reversals. On NBC_RAMS, adding the backward temporal branch ( $T^{-}$ ) improves Exact and reduces RMSE, indicating value under meandering winds and urban channelling.
Gating mode (global vs. local). At the $25 \times 25$ scale, a global gate (one mixture per sample) performs on par with, or slightly better than, a local gate; sample-level mixtures suffice at this resolution. Larger domains may benefit from local gating.
DropPath rate. A small, layer-indexed DropPath suffices on the toy synthetic split. On NBC_RAMS, a higher maximum rate acts as a useful regularizer and yields better hit $\leq 1$ / Exact without altering model capacity.

5.4. Dual-Source Stress Test (Zero-Shot)

Finally, we assess zero-shot behaviour on dual-source synthetic sequences. Following Section 3.3, we keep RBVM as-is and apply a parameter-free, physics-aware two-peak subtraction on the oldest slice

{\hat{P}}_{FUT}

(no retraining, no architectural change). Table 3 reports results for both variants: the six-direction model achieves

{RMSE}_{m} = 118.2 m

,

SR @ 100 = 89.7 %

,

SR @ 200 = 95.1 %

, and

SL - F_{1} = 0.969

; the five-direction model attains

{RMSE}_{m} = 184.1 m

,

SR @ 100 = 59.9 %

,

SR @ 200 = 83.9 %

, and

SL - F_{1} = 0.907

. Thus, adding the backward temporal branch (

T^{-}

) reduces distance error by

\approx 36 %

(118.2 m vs. 184.1 m) and improves success rates by

+ 29.8

pp (SR@100) and

+ 11.2

pp (SR@200), while maintaining essentially the same parameter budget (∼0.13

M

).

Table 3. Zero-shot dual-source localization on synthetic sequences (two-peak subtraction on

{\hat{P}}_{FUT}

). SR@

r

denotes the fraction of test cases where both sources are within radius

r m

. SL-

F_{1}

uses a

200 m

success threshold.

Timeline visualisation. For qualitative inspection, Figure 7 presents a 16-frame timeline per sequence: the top row shows the 10 conditioning inputs ( $t = 15 \to 6$ ), the middle row shows the 6 recursively backtracked predictions ( $t = 5 \to 0$ ), and the bottom row shows the corresponding ground-truth frames. The right panel overlays localization on the oldest slice ( $t \approx 0$ ): “×” marks the two ground-truth emitters, and open circles mark the two predicted sources obtained via a parameter-free two-peak subtraction on ${\hat{P}}_{FUT}$ . Predictions follow the dominant wind, contract backward in time, and concentrate at the sources, consistent with the ground truth.

Figure 7. Past-frame reconstruction and zero-shot dual-source localization on the Synthetic Benchmark. Top: input conditioning window ( $t = 15 \to 6$ ). Middle: recursively backtracked frames ( $t = 5 \to 0$ ). Bottom: ground-truth frames for the same indices. Right: localization overlay on the oldest slice ( $t \approx 0$ ); “×” marks the two ground-truth emitters and open circles mark the two predicted sources obtained via a parameter-free two-peak subtraction on ${\hat{P}}_{FUT}$ (no retraining). Predictions follow the dominant wind, contract backward in time, and concentrate at the sources, consistent with the ground truth.

For threshold sensitivity at this map resolution (100 m = half-cell, 200 m = one-cell), see Appendix D, in which we report mean ± std over 10 seeds for

{RMSE}_{m}

, SR@100, SR@200, and SL

F_{1}

.

Our quantitative focus is the dual-source stress test; the documented

K > 2

extension is intended for application-specific benchmarking in future work.

6. Discussion

6.1. Main Findings on NBC_RAMS

Across both benchmarks, RBVM improves localization accuracy while remaining compact relative to 3D-CNN, CNN–LSTM, and ViViT; here we prioritise the field setting. On NBC_RAMS, where winds meander and urban channelling occurs, the six-direction variant (adds

T^{-}

) consistently raises Exact and lowers RMSE compared with five directions (Table 1). Two design elements appear most influential in the field: (i) applying BCE-with-logits to all backtracked slices (including the oldest) sharpens peaks without probability saturation, improving decision boundaries near the source; and (ii) an EMA-based consistency term stabilizes calibration along the six-step rollout, reducing near-miss confusions even when Exact is already high. From a scaling standpoint, depthwise grouped operators dominate cost, while DirGate and LayerScale are negligible. Hence compute and memory scale linearly with the spatiotemporal volume

T H W

and with the number of directions

| D |

, enabling predictable extension of the look-back horizon. Adding

T^{-}

introduces one additional one-sided temporal stencil that preserves causality and adds robustness to reverse-index transport; the overhead is linear and modest (one more sweep), whereas the observed benefit under meanders points to a physics-aligned gain rather than capacity-driven effects.

6.2. Dual-Source Localization: Importance and Implications

Multi-emitter incidents matter operationally—ambiguity in early response often arises from overlapping plumes. Our zero-shot treatment keeps the backbone fixed and applies a parameter-free, physics-aware two-peak subtraction on the oldest slice

{\hat{P}}_{FUT}

. On the synthetic stress test, this recovers both emitters within

200 m

in

95.1 %

of cases, with

{RMSE}_{m} = 118.2 m

(Table 3, Figure 7). The approach works because (a) backtracking concentrates mass near the origin, and (b) the generator is additive, so a first-order “deblending” via subtraction is effective. Beyond gas releases, the same template applies to other transported fields—e.g., source separation in environmental monitoring or urban analytics—whenever superposition is a reasonable approximation and the oldest slice carries the sharpest imprint. As for the robustness across runs, we report SR@100 m and SR@200 m as mean±std over 10 random seeds (independent initialization and shuffle, identical budgets) to characterise variability; success rates remain consistent and method ranking is stable across runs.

The subtract-and-pick baseline is intended for short windows and passive releases, where backtracking typically increases peak separability on the oldest slice. In anisotropic urban winds, an oriented (elliptical) kernel aligned to a coarse wind prior further stabilizes results. We explicitly defer (flag “uncertain”) when emitters are nearly co-located, turbulence induces highly non-Gaussian lobes, or chemistry violates additivity; a small mixture fit or shallow residual decoder is a documented extension that improves deblending while preserving the linear-time profile.

The uncertainty flagging here is consistent with our deployment UQ/abstention policy (top-2 margin and predictive entropy on the oldest slice), ensuring conservative behaviour in low-SNR or plateaued cases.

When the oldest slice is heavily diffused or noisy, we prioritise decision quality over forced guesses. On

{\hat{P}}_{FUT}

, we compute the top-two margin

m = p_{(1)} - p_{(2)}

and predictive entropy H; if

m < ϵ

or

H > η

, the system flags uncertain rather than emitting a hard argmax. This confidence-aware abstention reduces false precision in low-SNR regimes and enables a clear coverage–accuracy trade-off. If desired, a tiny temporal pooling/decoder over the last few backtracked slices can further stabilize low-SNR cases without altering the backbone or its linear-time profile.

6.2.1. Overlapping Emitters and Limits

Near-coincident sources or elongated, intersecting lobes challenge purely subtractive schemes. Our default policy is conservative: stop early and mark uncertain rather than over-commit. When heavier deblending is required, two drop-ins remain lightweight: a subtractor-initialized mixture fit and a tiny residual decoder; both preserve linear-time inference and do not alter the directional operators or upwind gating.

6.2.2. Rotational Eddies and Orientation Extensions

While stacked axis-wise sweeps plus DirGate already form curved information paths, explicit diagonal (or learned) orientations can further align computation with local swirling cells or street-canyon recirculation. Because the added stencils remain depthwise and one-sided, the linear-time and causality guarantees hold. In deployments where rotational signatures are anticipated, enabling diagonals—or a small orientation basis—provides a practical, low-overhead enhancement without altering the backbone.

6.2.3. Scaling with Resolution and Domain Size

For domains beyond

25 \times 25

, two practical routes preserve throughput: (i) patchify the input (e.g.,

p = 4

maps

100 \times 100

to

25 \times 25

tokens) so the backbone runs at constant token count; (ii) process full resolution when needed—cost and memory then grow linearly with

T H W

and

C T H W

, respectively. In both cases, runtime remains dominated by the grouped depthwise stencils; permutations are comparatively cheap due to the channels-last body. The choice of p trades spatial fidelity and latency without altering the causal directional design.

6.3. Possible Failure Modes and Practical Guidance

Typical issues are shared across methods and can be mitigated with lightweight extensions: Boundary sources lose context near edges; boundary-aware padding or scoring on an inset region reduces this bias. Intersecting plumes may yield shallow residuals after subtraction; a small residual decoder or a learned mixture head (two-kernel deblender) can stabilize the second peak. Very low winds produce broad plateaus; mixing low- and moderate-wind episodes during training maintains Exact without extra inference cost. These considerations do not alter the backbone and preserve the linear-time, directionally causal design. Very low-wind episodes often yield broad plateaus for which a single-cell decision provides limited operational value. We therefore document two optional safeguards that preserve our linear-time profile: a light mass-balance regularizer at training time and a confidence-aware abstention rule at inference (top-two margin and entropy). When enabled in deployment, these reduce over-confident decisions on plateaus.

We prefer per-frame min–max over per-sequence or global z-score to prevent domination by high-intensity frames and to stabilize optimization under dilution. If absolute-scale reasoning is mission-critical (e.g., rate estimation), adding the amplitude channel or incorporating a solver-derived prior is a straightforward extension that preserves the model’s linear-time profile.

Near-field minutes (0–5) are dominated by jetting, strong verticals, and building-scale recirculation that can violate a coarse 2D lattice assumption. Our study targets the mid-field inverse task on solver-produced rasters under real-time constraints. As a near-term extension, earlier minutes can be incorporated by pairing higher-resolution or patchified stems with short-horizon synthetic jets, without altering the directional operators or gating and while preserving linear-time inference.

6.3.1. Vertical and Terrain Effects

Our current lattice is 2D at the screen level, which abstracts vertical stratification (e.g., stable/unstable boundary layers) and building-induced recirculation. Two lightweight extensions are natural. (i) Shallow-3D: introduce a small vertical depth Z (e.g., 2–8 levels from a solver export or coarse lidar) and add axis-wise one-sided sweeps

Z^{\pm}

. Because operators remain depthwise and one-sided, the per-block cost stays linear in

T + H + W + Z

; memory also scales linearly in

C \times T \times H \times W \times Z

. A learned vertical pooling/projection (few modes) can further reduce Z while preserving key stratification cues. (ii) Auxiliary priors: supply coarse vertical/terrain proxies as static channels (e.g., mixing height, stability, roughness/orography, building height/street-canyon masks) and use them to bias DirGate (additive logits or temperature modulation) without requiring online solver calls. These priors can be precomputed at low cadence and broadcast over the conditioning window. Both routes preserve our linear-time profile and streaming readiness while improving robustness to OOD flows with vertical structure.

6.3.2. Transfer and Operationalisation

NBC_RAMS supplies high-fidelity, terrain-aware flows but is still a proxy for live incidents. Our pathway is to (i) pretrain on breadth (synthetic + NBC_RAMS), (ii) validate in controlled field exercises (surrogate releases with ground truth), and (iii) optionally enable light calibration and confidence-aware abstention at deployment as operational safeguards. This positions RBVM as a downstream inverse layer over physics-consistent rasters while retaining a clear route to field validation.

6.3.3. Positioning vs. Physics-Based Inversions

RBVM serves as a compact learned inverse layer running downstream of a physics-based stack: physics supplies terrain/ meteorology-aware rasters, and RBVM performs source localization online without solver calls or winds. This division of labour matches operational practice and maintains predictable, linear-time latency. For deployment, light post hoc calibration (temperature scaling with ECE) and an uncertainty-aware abstention policy are documented as optional safeguards; they do not change the backbone and are not used for the core results.

6.3.4. Operational Inputs and Irregular Sensing

Our inverse layer targets solver-produced rasters, matching emergency-response practice. Where a solver grid is not available, an uncertainty-aware raster adaptor provides a working lattice and a support map that integrate seamlessly with RBVM and the abstention policy. A graph-based analogue—causal directional message passing on sensor graphs with upwind-inspired gating—is a natural extension; we leave it as future work to keep the present backbone unchanged and linear-time.

6.3.5. Spatial Heterogeneity and Gating Granularity

When flows are relatively uniform or grids are compact, global gating offers the best accuracy–efficiency trade-off. As spatial heterogeneity increases (e.g., urban canyons) or grids become larger, local gating—or a hybrid global prior plus local refinement—captures spatially varying winds while keeping the directional operators and linear-time scaling intact. In practice, we recommend patchify/stride → global (or hybrid) as a first step, enabling local refinement only where persistent sub-grid variation is observed. A systematic analysis of gate dynamics (e.g., entropy distributions across wind regimes) is an interesting direction for future work.

6.3.6. Generalization

Beyond our current domain randomization and cross-domain test, future work will (i) widen the randomization ranges for diffusion/wind/diurnal variability and (ii) inject coarse solver priors as optional channels to bias gating—both preserving the linear-time backbone while improving OOD robustness.

6.3.7. Operational Thresholds and Reporting (Optional)

A simple default (e.g.,

ε = 0.05

for the margin and a modest entropy cap

κ

) balances coverage and false positives on validation and can be fixed for deployment. We recommend logging the top-two margin, entropy, and calibrated confidence (after temperature scaling) together with the final action (“localize” vs. “uncertain”). Under broad plateaus or diffuse plumes—where point decisions carry low operational value—the abstention rule avoids over-confident mis-localizations. When resources permit, a few MC-DropPath samples can be turned on to obtain uncertainty bands without changing the backbone. These safeguards mirror our low-SNR guidance and remain purely post hoc.

Stability to split choice is important. Because our grouping jointly encodes source location and meteorology, we expect model ranking to be robust under stratified re-sampling; a full multi-split study is left to follow-on work enabled by the released indices and splitter.

Taken together, the results support RBVM as a compact, linear-time, directionally causal backbone for inverse inference on transported fields in both field and synthetic settings. At larger spatial extents, we use a patchify/strided stem and, if needed, a two-stage coarse-to-fine schedule: a coarse predictor backtracks on the reduced grid and a fine predictor refines on a local crop. Both stages share the same directional operators and gating.

7. Conclusions and Future Work

We presented Recursive Backtracking with Visual Mamba (RBVM), a compact, Visual-Mamba-inspired visual state-space framework that infers the origin of a gas release from a short conditioning window. The backbone aggregates causal, depthwise directional operators along

{H^{\pm}, W^{\pm}, T^{+} [, T^{-}]}

via an upwind gate (DirGate) and uses pre-norm LayerNorm with a small LayerScale and a layer-indexed, depth-weighted DropPath for stable stacking. Because grouped depthwise convolutions dominate the cost, compute and memory scale linearly with the spatiotemporal extent, yielding predictable behaviour as the look-back horizon increases.

Our contributions are threefold and mutually reinforcing. Architecturally, we replace heavy spatiotemporal mixing with axis-wise, one-sided depthwise operators that preserve physical causality, while DirGate learns a convex directional mixture that follows the flow and remains hardware-friendly. In training/inference, a simple recipe—time flip alignment, BCE-with-logits across all backtracked slices (including the oldest), and an EMA-based consistency term—produces sharp, well-calibrated peaks on the evaluation slice without complicating optimization. Empirically, across a field corpus with urban channelling and wind variability and a strictly causal synthetic benchmark, RBVM improves geometry-aware localization metrics with a small parameter budget relative to attention-centric trackers. The six-direction variant (with a backward temporal branch) proves more robust than five directions in settings with meandering winds, and in a dual-source stress test, a parameter-free, physics-aware two-peak subtraction on the oldest slice provides a practical zero-shot extension without changing the backbone.

To maintain clarity and comparability, we scoped evaluation around two complementary settings. Field analyses follow an as-is protocol, and synthetic experiments follow a controlled setup used for sensitivity studies and the dual-source stress test. This design yields clean, like-for-like baselines and forms a springboard for broader field corpora and additional synthetic variants.

Looking ahead, we aim to broaden RBVM’s applicability while preserving its linear-time core. Priorities include handling sparser or irregular observations, strengthening generalization across diverse wind and terrain conditions with physics-aware extensions, scaling to larger domains, and improving multi-source separation—all within a streaming, hardware-friendly pipeline.

By marrying linear-time, directional state-space modelling with visual grid inputs, RBVM offers a practical alternative to attention-centric trackers and advances toward real-time decision support in CBRN emergencies. Beyond this domain, we view RBVM as a compact, directionally causal backbone for inverse inference on transported fields, useful more broadly for spatiotemporal backtracking tasks in environmental monitoring and urban analytics. Future work will pursue controlled comparisons against solver-backed inversions under harmonised I/O and budgets, and explore light hybridisation via auxiliary priors or gating biases while retaining linear-time inference.

Author Contributions

Conceptualisation, J.P., S.C., and H.N.; methodology, J.P. and D.M.; software, D.M. and J.P.; validation, D.M., J.P., S.C., D.K., and H.N.; formal analysis, J.P. and S.C.; investigation, J.P., D.M., S.C., D.K., and H.N.; resources, D.M., S.C., D.K., and H.N.; data curation, D.M., S.C., D.K., and H.N.; writing—original draft preparation, J.P. and D.M.; writing—review and editing, J.P., D.M., and S.C.; visualisation, D.M. and S.C.; supervision, J.P. and S.C.; project administration, S.C. and J.P.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Agency For Defense Development Grant funded by the Korean Government (915114101).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to legal restrictions.

Conflicts of Interest

Authors Jooyoung Park and Daehong Min were employed by the company CompuMath AI. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Pseudocode

This appendix provides conceptual pseudocode for reproducibility.

Listing A1. Pseudocode excerpt (for reader convenience).

Appendix B. LayerScale/DropPath Ablation

We report ratios with respect to the default

(γ_{init} = 10^{- 3}, drop_path_rate = 0.2)

. To keep compute practical, the ablation uses the one-third training split defined in Section 4, with validation/test identical to the main tables. For RMSE, lower is better (ratio

< 1

); for Exact, higher is better (ratio

> 1

).

Table A1. RMSE ratio by (drop_path_rate ×

γ_{init}

)—lower is better (ratio

< 1

).

Table A1. RMSE ratio by (drop_path_rate ×

γ_{init}

)—lower is better (ratio

< 1

).

DropPath Rate	$γ_{init} = 10^{- 4}$	$γ_{init} = 10^{- 3}$	$γ_{init} = 10^{- 2}$
0.1	0.9864	0.9389	0.9434
0.2	1.0281	1.0000	1.0337
0.3	1.1572	1.0997	1.0984

Table A2. Exact ratio by (drop_path_rate ×

γ_{init}

)—higher is better (ratio

> 1

).

Table A2. Exact ratio by (drop_path_rate ×

γ_{init}

)—higher is better (ratio

> 1

).

DropPath Rate	$γ_{init} = 10^{- 4}$	$γ_{init} = 10^{- 3}$	$γ_{init} = 10^{- 2}$
0.1	1.0016	1.0124	1.0109
0.2	0.9933	1.0000	0.9916
0.3	0.9639	0.9766	0.9768

Appendix C. Parameter Counts, Latency, and Peak Memory

Scope. This paper’s efficiency indicator centers on parameter count (model compactness). Latency and peak memory are reported as measured under a shared protocol for transparency, but we do not analyse or emphasise them in the main text. Absolute values are sensitive to hardware/drivers/implementation and tuning; rigorous efficiency comparisons (and tuning) are intentionally deferred to future work.

Table A3 summarizes the hardware/software environment and the shared timing protocol used for measurements. All models were run in eval mode with torch.no_grad() on single-sample inputs (default float32); after a short warm-up, we executed a fixed-length inference loop, calling torch.cuda.synchronize() before and after each iteration to compute the average wall-clock latency. Peak memory denotes the maximum GPU allocation observed during the loop.

Table A3. Hardware/software specs and shared timing protocol.

Component	Specification
CPU	2 × Intel Xeon Silver 4310 @ 2.10GHz
GPU	2 × NVIDIA RTX 6000 Ada
GPU Memory	48 GiB per GPU (49140 MiB)
OS	Ubuntu 24.04.2 LTS
CUDA	13.0
NVIDIA Driver	580.76.05

Table A4 reports parameter counts (primary) together with as-measured latency and peak memory for 3D–CNN, CNN–LSTM, ViViT, and RBVM under identical settings. These values are environment-/implementation-dependent and are presented here for reporting only. We do not claim a ranking of models.

Table A4. Parameter counts (primary), plus as-measured latency and peak memory under the shared protocol.

Model	Params [M]	Latency [ms]	Peak Memory [MB]
CNN–LSTM	9.87	2.62	115.79
3D–CNN	6.74	0.73	110.67
ViViT	2.11	4.11	118.36
RBVM	0.95	6.22	157.56

Minimal interpretation. (i) RBVM is favourable on the paper’s core axes (accuracy and parameter count), whereas its current implementation yields conservative latency/memory figures that reflect PyTorch path choices and untuned memory/kernel routes. These should be understood as pre-tuning baseline measurements; they can vary substantially with kernel/layout optimization. (ii) The narrative scope of the main text is accuracy and compactness. Latency/memory will be revisited in a fair apples-to-apples profiling (aligned kernels/layouts, identical precision/batch/runtime options, and per-backbone optimized implementations) in future work.

Figure A1 shows the mean RBVM latency as a function of the sequence length T (batch = 1, iterations = 100, same protocol). The short-T regime is dominated by fixed overheads; as T increases, a near-linear slope emerges, which is consistent with the axis-wise one-sided design’s linear-time behaviour. The figure illustrates a trend, and absolute values are not used as a benchmarking metric.

Figure A1. RBVM inference latency vs. sequence length T (as-measured trend under the shared protocol).

Future directions (brief). Fair comparison and improvements for latency/memory will be organized around: (a) kernel fusion and shorter tensor-layout/permutation paths, (b) mixed precision with safe activation caching, (c) dynamic resolution via patchified/strided stems and local gating where appropriate, and (d) backend-/runtime-specific fast paths. Items that are primarily implementation-/tuning-dependent lie outside the paper’s core contribution and are not interpreted in the main text.

Appendix D. Dual-Source Threshold Sensitivity

At the NBC_RAMS scale (200 m per grid cell), SR@100 m and SR@200 m evaluate, respectively, half-cell and one-cell localization tolerances—tight vs. operationally permissive. Thresholds much below 100 m become overly strict at this resolution, whereas substantially larger radii tend to saturate success and lose discrimination. The table below reports means and standard deviations over 10 random seeds for dual-source metrics; the 6-dir variant remains consistently stronger under both thresholds.

Table A5. Dual-source threshold sensitivity (Mean and Std. over 10 seeds).

	${RMSE}_{m}$ [m]		SR@100 [%]		SR@200 [%]		SL- $F_{1}$
Variant	Mean	Std.	Mean	Std.	Mean	Std.	Mean	Std.
5-dir	247.72	21.73	28.51	5.44	68.31	3.78	0.812	0.022
6-dir	195.35	40.06	53.32	12.30	86.02	5.45	0.916	0.031

References

Jones, R.; Lehr, W.; Simecek-Beatty, D.; Reynolds, M. ALOHA® (Areal Locations of Hazardous Atmospheres) 5.4.4: Technical Documentation; U.S. Department of Commerce, National Oceanic and Atmospheric Administration, National Ocean Service, Office of Response and Restoration: Seattle, WA, USA, 2013. Available online: https://books.google.co.kr/books?id=iR47nwEACAAJ (accessed on 29 September 2025).
Simpson, S.M.; Miner, S.; Mazzola, T.; Meris, R. HPAC model studies of selected Jack Rabbit II (JRII) releases and comparisons to test data. Atmos. Environ. 2020, 243, 117675. [Google Scholar] [CrossRef]
Lilienthal, A.; Duckett, T. Building gas concentration gridmaps with a mobile robot. Robot. Auton. Syst. 2004, 48, 3–16. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Yeon, A.S.A.; Zakaria, A.; Zakaria, S.M.M.S.; Visvanathan, R.; Kamarudin, K.; Kamarudin, L.M. Gas source localization via mobile robot with gas distribution mapping and deep neural network. In Proceedings of the 2022 2nd International Conference on Electronic and Electrical Engineering and Intelligent System (ICE3IS), Yogyakarta, Indonesia, 4–5 November 2022; pp. 120–124. [Google Scholar]
Kim, H.; Park, M.; Kim, C.W.; Shin, D. Source localization for hazardous material release in an outdoor chemical plant via a combination of LSTM-RNN and CFD simulation. Comput. Chem. Eng. 2019, 125, 476–489. [Google Scholar] [CrossRef]
Sharan, M.; Issartel, J.P.; Singh, S.K. A point-source reconstruction from concentration measurements in low-wind stable conditions. Q. J. R. Meteorol. Soc. 2012, 138, 1884–1894. [Google Scholar]
Ristic, B.; Gunatilaka, A.; Gailis, R. Localization of a source of hazardous substance dispersion using binary measurements. Atmos. Environ. 2016, 142, 114–119. [Google Scholar]
Son, J.; Kang, M.; Lee, B.; Nam, H. Source localization of the chemical gas dispersion using recursive tracking with transformer. IEEE Access 2024. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Jang, H.D.; Kwon, S.; Nam, H.; Chang, D.E. Chemical Gas Source Localization with Synthetic Time Series Diffusion Data Using Video Vision Transformer. Appl. Sci. 2024, 14, 4451. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Liu, X.; Zhang, C.; Huang, F.; Xia, S.; Wang, G.; Zhang, L. Vision Mamba: A Comprehensive Survey and Taxonomy. IEEE Trans. Neural Netw. Learn. Syst. 2025; 1–21, early access. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Li, S.; Singh, H.; Grover, A. Mamba-nd: Selective state space modeling for multi-dimensional data. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 75–92. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Vardi, G.; Yehudai, G.; Shamir, O. Width is less important than depth in ReLU neural networks. In Proceedings of the Conference on Learning Theory, London, UK, 2–5 July 2022; pp. 1249–1281. [Google Scholar]
Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena hierarchy: Towards larger convolutional language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
Nguyen, E.; Goel, K.; Gu, A.; Downs, G.; Shah, P.; Dao, T.; Baccus, S.; Ré, C. S4nd: Modeling images and videos as multidimensional signals with state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 2846–2861. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Wang, J.; Ji, J.; Ravikumar, A.P.; Savarese, S.; Brandt, A.R. VideoGasNet: Deep learning for natural gas methane leak classification using an infrared camera. Energy 2022, 238, 121516. [Google Scholar] [CrossRef]
Wang, J.; Tchapmi, L.P.; Ravikumar, A.P.; McGuire, M.; Bell, C.S.; Zimmerle, D.; Savarese, S.; Brandt, A.R. Machine vision for natural gas methane emissions detection using an infrared camera. Appl. Energy 2020, 257, 113998. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Tarantola, A. Inverse Problem Theory and Methods for Model Parameter Estimation; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2005. [Google Scholar]
Enting, I.G. Inverse Problems in Atmospheric Constituent Transport; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Pielke, R.A.; Cotton, W.R.; Walko, R.E.A.; Tremback, C.J.; Lyons, W.A.; Grasso, L.D.; Nicholls, M.E.; Moran, M.D.; Wesley, D.A.; Lee, T.J.; et al. A comprehensive meteorological modeling system—RAMS. Meteorol. Atmos. Phys. 1992, 49, 69–91. [Google Scholar] [CrossRef]
Stein, A.F.; Draxler, R.R.; Rolph, G.D.; Stunder, B.J.; Cohen, M.D.; Ngan, F. NOAA’s HYSPLIT atmospheric transport and dispersion modeling system. Bull. Am. Meteorol. Soc. 2015, 96, 2059–2077. [Google Scholar] [CrossRef]
Evensen, G. The ensemble Kalman filter: Theoretical formulation and practical implementation. Ocean Dyn. 2003, 53, 343–367. [Google Scholar] [CrossRef]
Kalnay, E. Atmospheric Modeling, Data Assimilation and Predictability; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 6299–6308. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Icml 2021, 2, 4. [Google Scholar]
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Ke, W.; Chan, K.H. A multilayer CARU framework to obtain probability distribution for paragraph-based sentiment analysis. Appl. Sci. 2021, 11, 11344. [Google Scholar] [CrossRef]
Xing, Z.; Lam, C.T.; Yuan, X.; Im, S.K.; Machado, P. Mmqw: Multi-modal quantum watermarking scheme. IEEE Trans. Inf. Forensics Secur. 2024, 19, 5181–5195. [Google Scholar] [CrossRef]
Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going Deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 32–42. [Google Scholar]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 646–661. [Google Scholar]
Chang, J.C.; Hanna, S.R.; Boybeyi, Z.; Franzese, P. Use of Salt Lake City URBAN 2000 field data to evaluate the urban hazard prediction assessment capability (HPAC) dispersion model. J. Appl. Meteorol. 2005, 44, 485–501. [Google Scholar]
Crank, J. The Mathematics of Diffusion; Oxford University Press: Oxford, UK, 1979. [Google Scholar]
Seinfeld, J.H.; Pandis, S.N. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change, 3rd ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]

Figure 1. NDLayer3D schematic. Axis selector (choose T, H, or W), per-direction causal padding masks, and shape transitions are annotated. For spatial sweeps (

H^{\pm}, W^{\pm}

):

(B, T, H, W, C) \leftrightarrow (B T, C, H, W)

; for temporal sweeps (

T^{\pm}

, flipped index):

(B, T, H, W, C) \leftrightarrow (B H W, C, T)

. Each path applies a depthwise conv (3-wide) on the operated axis and reshapes back to

(B, T, H, W, C)

.

Figure 2. Top-level RBVM architecture. Stem (Conv3D

1 \to C

) →permute to channels-last

(B, C, T, H, W) \to (B, T, H, W, C)

→ stacked Mamba3DBlocks→permute back

(B, T, H, W, C) \to (B, C, T, H, W)

→Head (Conv3D

C \to 1

). Explicit permute points are shown; the output is logits on

(B, 1, T, H, W)

.

Figure 3. Mamba3DBlock (modular view). Pre-norm LayerNorm feeds six NDLayer3D paths

(H^{\pm}, W^{\pm}, T^{\pm})

; DirGate (global/local) produces convex weights g, whose gated sum is scaled by LayerScale

γ_{dir}

with optional DropPath and added residually. A second pre-norm LayerNorm feeds an MLP, followed by LayerScale

γ_{mlp}

and optional DropPath, then a residual add. Residual routes, LayerScale/DropPath placement, and gating output g are highlighted; the block operates in

(B, T, H, W, C)

. Compositionality: stacked axis-wise sweeps synthesise oblique information flow while preserving per-axis causality. Temporal sweeps are applied on a flipped index so that both

T^{+}

and

T^{-}

remain one-sided with causal padding; dashed regions indicate unobserved frames that are never accessed.

Figure 4. Sample NBC_RAMS sequence (

t = 1

–15) shown for illustration; early near-field frames are excluded from training/evaluation as noted in the text. Concentration (normalized to

[0, 1]

) is overlaid on the basemap; the red “×” marks the true source. The plume grows, advects, and dilutes over time under the prevailing wind field. The colour bar indicates the value of each data point.

Figure 5. Preprocessed counterpart of Figure 4. Frames are rasterised to a

25 \times 25

grid, per-frame min–max-normalized to

[0, 1]

, and lightly denoised as in Section 4.1. The red “×” indicates the true source; the colour bar denotes the normalized concentration value.

Figure 6. Past-frame reconstruction and source localization on NBC_RAMS. (Top): input conditioning window (

t = 15 \to 6

). (Middle): recursively backtracked frames (

t = 5 \to 0

). (Bottom): ground-truth frames for the same indices. (Right): localization overlay (red “×” = ground truth; blue marker = prediction). Predictions align with the prevailing wind, shrink in spatial extent toward early times, and concentrate at the source, consistent with the ground truth.

Figure 7. Past-frame reconstruction and zero-shot dual-source localization on the Synthetic Benchmark. Top: input conditioning window (

t = 15 \to 6

). Middle: recursively backtracked frames (

t = 5 \to 0

). Bottom: ground-truth frames for the same indices. Right: localization overlay on the oldest slice (

t \approx 0

); “×” marks the two ground-truth emitters and open circles mark the two predicted sources obtained via a parameter-free two-peak subtraction on

{\hat{P}}_{FUT}

(no retraining). Predictions follow the dominant wind, contract backward in time, and concentrate at the sources, consistent with the ground truth.

Table 1. NBC_RAMS field benchmark (evaluated on the oldest slice). Metrics: RMSE/AED reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m); hit $\leq 1$ = accuracy within Chebyshev distance 1; Exact = exact hit. RBVM (6-dir) attains the lowest RMSE/AED and the highest hit/Exact while using the fewest parameters.

Table 1. NBC_RAMS field benchmark (evaluated on the oldest slice). Metrics: RMSE/AED reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m); hit $\leq 1$ = accuracy within Chebyshev distance 1; Exact = exact hit. RBVM (6-dir) attains the lowest RMSE/AED and the highest hit/Exact while using the fewest parameters.

Model	Params [M]	RMSE/AED [px/m]	hit $\leq 1$ [%]	Exact [%]
3D–CNN	6.74	0.63 / 126	96.87	74.30
CNN–LSTM	9.87	0.32 / 64	99.55	91.74
ViViT	2.11	0.28 / 56	99.49	94.19
RBVM (6-dir)	0.95	0.20 / 40	99.86	96.70

Table 2. Synthetic benchmark (strictly causal flows; oldest slice evaluation). RMSE/AED are reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m).

Table 2. Synthetic benchmark (strictly causal flows; oldest slice evaluation). RMSE/AED are reported jointly as pixels/metres on the

25 \times 25

grid, with AED computed as

RMSE \times 200 m

(1 px = 200 m).

Model	Params [M]	RMSE/AED [px/m]	hit $\leq 1$ [%]	Exact [%]
3D–CNN	0.272	0.851 / 170.	91.34	29.15
CNN–LSTM	9.868	0.334 / 66.8	100.0	67.68
ViViT	0.789	0.398 / 79.6	100.0	60.61
RBVM (5-dir)	0.128	0.186 / 37.2	99.76	83.17
RBVM (6-dir)	0.130	0.022 / 4.40	100.0	97.80

Table 3. Zero-shot dual-source localization on synthetic sequences (two-peak subtraction on

{\hat{P}}_{FUT}

). SR@

r

denotes the fraction of test cases where both sources are within radius

r m

. SL-

F_{1}

uses a

200 m

success threshold.

Table 3. Zero-shot dual-source localization on synthetic sequences (two-peak subtraction on

{\hat{P}}_{FUT}

). SR@

r

denotes the fraction of test cases where both sources are within radius

r m

. SL-

F_{1}

uses a

200 m

success threshold.

Model (Variant)	${RMSE}_{m}$ [m]	SR@100 [%]	SR@200 [%]	SL- $F_{1}$
RBVM (5-dir)	184.1	59.9	83.9	0.907
RBVM (6-dir)	118.2	89.7	95.1	0.969

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).