1. Introduction
Fine-grained urban air-quality fields are increasingly used in geospatial decision-making, including exposure assessment, urban planning, and policy evaluation, where neighborhood-scale variation often matters more than city-wide averages [
1,
2]. Yet, from a geo-information perspective, these applications depend on reconstructing a spatially continuous spatiotemporal field from observations that are sparse, irregular, and sometimes unreliable. In operational settings, the monitoring infrastructure that provides regulatory-grade supervision is both limited and fragile: stations are expensive to deploy, and long gaps due to outages, maintenance, or communication failures are common [
3]. This creates a persistent gap between the spatial detail required by downstream geospatial analyses and the coverage of reliable measurements.
We study this problem in Augsburg, Germany, where the goal is to reconstruct city-scale PM
10 and NO
2 fields over a dense grid from only a handful of fixed stations, leveraging exogenous covariates (meteorology, traffic) and static geographic descriptors. With so few station time series constraining a latent city-wide field, the reconstruction is strongly underdetermined and naturally calls for a probabilistic treatment: rather than returning a single surface completion, we seek a distribution of plausible spatiotemporal fields conditioned on sparse and potentially noisy measurements. Such uncertainty-aware field inference is especially relevant under practical stressors, including single-station availability, extended station outages, and sensor corruption where point estimates can be misleading. Classical geostatistical and GIS interpolation methods such as Kriging [
4] and IDW [
5] remain useful baselines, but their smoothness and stationarity assumptions are often strained in heterogeneous urban environments.
Recent learning-based spatiotemporal models provide more flexible inductive biases by capturing non-linear dependencies over time and space. In air-quality applications, architectures based on spatiotemporal attention, graph neural networks, and multi-scale representations have been proposed for forecasting and inference [
6,
7,
8,
9,
10,
11]. Related advances on sensor networks, ranging from diffusion-convolution recurrent designs to graph convolutional backbones, offer effective mechanisms for combining temporal modeling with spatial message passing [
12,
13,
14,
15,
16], and Transformer variants provide complementary tools for long-range temporal structure [
17,
18]. Still, most of these methods are trained as deterministic predictors: under extreme sparsity they tend to over-smooth unconstrained regions, degrade sharply during long outages, and provide limited or poorly calibrated uncertainty [
19]. For urban environmental mapping and geo-information applications, a probabilistic reconstruction approach is better aligned with this regime because it can represent multiple field realizations that are consistent with the available evidence.
Diffusion models are a particularly attractive probabilistic family because they can represent complex conditional distributions through sampling [
20,
21,
22,
23,
24,
25]. However, in sparse sensing settings, reconstruction quality depends critically on how measurement constraints are enforced during reverse sampling. If observations enter only through denoiser inputs, sampling trajectories can drift from sparse constraints; if one instead applies hard replacement, inserting clean values into noisy intermediate states induces a clean–noisy mismatch and can propagate corrupted readings when measurements are unreliable [
26,
27]. From a Bayesian viewpoint, conditioning should approximate posterior sampling under an explicit likelihood, connecting to diffusion-based inverse-problem methods that guide sampling with measurement constraints [
28,
29,
30,
31,
32,
33].
We therefore propose STGPD (SpatioTemporal Graph Posterior Diffusion), a geo-information oriented posterior sampling framework for city-scale air-quality field reconstruction on graph-structured spatiotemporal domains. STGPD performs constrained posterior-guided sampling with a noise-aware soft consistency update: at each step, it combines the model proposal with a noise-matched measurement term via variance-weighted Gaussian fusion, rather than rigid replacement, improving stability under sensor noise. To better capture heterogeneous urban structure, STGPD further constructs a dual-view graph that combines geographic proximity with functional similarity derived from multi-scale geographic descriptors, enabling information propagation between distant yet functionally similar regions. Across stress tests involving extreme sparsity, station outages, and synthetic noise injection, STGPD improves both reconstruction accuracy and uncertainty calibration, while remaining compatible with fast solvers and conditional samplers [
34,
35,
36].
We formulate city-scale air-quality reconstruction as geo-information field inference via diffusion-based posterior sampling on a graph-structured spatiotemporal domain.
We propose a noise-aware, step-wise soft-consistency mechanism that enforces measurement fidelity without clean–noisy mismatch, improving robustness under unreliable sensors.
We introduce a dual-view (geographic + functional) graph construction using multi-scale geographic descriptors and demonstrate consistent gains in accuracy and uncertainty calibration under multiple real-world stress scenarios.
3. Methodology
We first introduce the Augsburg dataset and preprocessing pipeline, followed by graph construction and the proposed STGPD framework. The study area and monitoring configuration are shown in
Figure 1. An overview of the proposed STGPD framework is shown in
Figure 2. We focus on two policy-relevant pollutants, PM
10 and NO
2. Since grid-level ground truth is unavailable, station measurements are the only supervised reference for training and evaluation. Grid nodes are included to support city-scale field reconstruction, but they do not contribute to the supervised loss or evaluation metrics unless stated otherwise.
3.1. Study Area and Data Sources
The study area is Augsburg, Bavaria, Germany, a mid-sized city with heterogeneous land-use patterns and traffic corridors, which provides a challenging setting for reconstruction from sparse monitoring, as shown in
Figure 1. All spatial layers (station locations, city boundary, and grid-cell centroids) are processed in a consistent coordinate reference system, and distances used in graph construction are computed accordingly. The reconstruction domain consists of
grid-cell centroids within the Augsburg administrative boundary. Building on this spatial setup, we next describe the pollutant measurements and the auxiliary covariates used for reconstruction.
Hourly concentrations of PM10 and NO2 were collected from 1 July 2016 to 31 December 2020 at four fixed monitoring stations: Augsburg/Königsplatz, Augsburg/Bourges-Platz, Augsburg/Karlstraße, and Augsburg University of Applied Sciences (UAS). Measurements contain missing segments due to outages and are kept as missing in the aligned dataset. In addition to these station readings, we incorporate time-varying exogenous covariates that are available across the domain.
We use hourly meteorological variables, including pressure, temperature, relative humidity, precipitation, wind speed, and wind direction, together with traffic intensity provided by the City of Augsburg and calendar indicators such as hour-of-day, day-of-week, weekend and holiday flags. Following the experimental assumptions used throughout the paper, including the leave-one-out evaluation, meteorology and calendar features are treated as globally available covariates for all nodes, both stations and grid cells, reflecting their availability from dense external products or administrative records. Traffic features are aligned to the hourly timeline and associated with nodes according to the spatial support provided by the data source, for example by mapping measurements to nearby road segments or aggregating them to the spatial units used by the provider. Finally, we assign each node a static descriptor to summarize multi-scale geographic context and support spatial extrapolation.
To represent local-to-regional spatial context, each node is assigned a static descriptor vector computed from concentric buffers with 19 radii spanning 0.1–5.0 km. Within each radius, we compute summary statistics of geographic context, including land-use categories and indicators related to the built environment and roads, and concatenate them across scales. These descriptors are computed for both monitoring stations and grid-cell centroids, enabling the model to use consistent spatial context when propagating information from stations to grids. The geographic descriptors are computed from standard GIS layers extracted from OpenStreetMap (OSM), including land-use and land-cover data as well as indicators related to the built environment and roads, aggregated within each buffer radius. The proposed framework does not explicitly ingest a source-resolved emission inventory. Instead, source-related spatial heterogeneity is represented indirectly through traffic covariates, meteorological forcings, and multi-scale geographic descriptors, which together capture local traffic influence, broader urban functional differences, and spatial context relevant to transport and dispersion.
Table 1 summarizes the station names and site types, measurement method, and QA/QC information of the four fixed monitoring stations used in this study.
3.2. Preprocessing and Input Tensors
We next describe how all sources are aligned and transformed into the tensors used by the denoiser and the posterior sampling procedure. All data sources are aligned to a common hourly timeline. Pollutant channels may be missing and are retained as missing values in the raw aligned data. Exogenous covariates are synchronized by timestamp. Static descriptors are matched to each node by location.
Before modeling, we apply standard transformations and encodings to ensure numerical stability and avoid representational artifacts. Traffic intensity is transformed with log1p. Wind direction is encoded by its sine and cosine components to avoid angular discontinuities. Calendar features are encoded as standard time indicators.
All continuous variables (pollutants, meteorology, traffic, static descriptors) are standardized using z-score normalization. Normalization statistics are computed only on the training split and then applied to test splits to prevent leakage. In implementation, missing pollutant entries are filled with zeros after normalization to avoid propagating NaN values; the corresponding observation mask ensures that these placeholders do not contribute to the supervised objective. All deterministic metrics (MAE/RMSE/) are reported in the de-normalized scale.
With the aligned and normalized inputs, we define the target field, observations, and covariates used throughout the paper. Let T be the number of hourly timestamps and the number of nodes, where stations and grid-cell centroids. We define:
Latent field: , with two channels corresponding to PM10 and NO2.
Observations: , denoting the aligned observation tensor; only station-node entries are observed, while all grid-node entries are treated as missing.
Time-varying covariates: , including meteorological variables, traffic intensity, and calendar features, where denotes the number of dynamic covariate channels.
Static descriptors: , representing multi-scale geographic descriptors shared across time, where denotes the number of static descriptor channels.
We define a pollutant observation mask , where indicates that pollutant channel c at node i and time t is available in the raw dataset (after natural missingness), and 0 otherwise. Exogenous covariates are not masked.
For re-masking evaluation, we define a station-only observation mask
by restricting
to station nodes (and setting all grid-node entries to zero). We then split the observed station entries into a visible mask
and a target mask
such that
We define the visible observation tensor as
.
During training under the re-masking objective, the denoiser is conditioned on and the supervised loss is computed only on target entries indexed by . Crucially, during inference (testing), the consistency update is applied only on visible entries by setting the consistency mask and using in all re-noising and fusion steps; target entries in are never used during sampling and are reserved solely for metric computation, preventing any ground-truth leakage into the reverse diffusion trajectory. For convenience, is represented in the full-node tensor shape by assigning zero entries to all grid nodes.
3.3. Graph Construction
We encode spatial structure with a weighted graph
, whose edge weights are defined as follows. To reflect both geographic proximity and functional similarity, we fuse two affinity matrices:
where
, and
are affinities normalized to comparable scales prior to fusion (e.g., by row-wise normalization or other monotone rescaling). Specifically, we first construct
and
, and then obtain
and
. Here
denotes a monotone rescaling used to place the two affinity matrices on comparable scales prior to fusion; in our implementation we apply row-wise normalization,
, where
is a small constant for numerical stability.
We compute
using a Gaussian kernel over pairwise distances
:
This proximity-based view encourages local information propagation on the urban grid; however, purely distance-based connectivity can be insufficient in heterogeneous cities where distant locations may share similar land-use and built-environment characteristics. To complement geographic proximity, we therefore introduce a functional-similarity affinity derived from static descriptors.
We compute
from the static descriptor vectors. Let
be the normalized descriptor of node
i. We define a nonnegative similarity:
which allows edges between functionally similar locations even when they are geographically distant. After constructing
and
, we fuse them using
and then sparsify the resulting affinity to obtain a tractable graph.
For efficiency and to avoid isolated nodes, we sparsify the fused affinity by retaining the top-k neighbors per node, symmetrize by , add self-loops, and apply row normalization before message passing. To ensure connectivity among monitoring sites, we preserve all station–station edges during sparsification so that station nodes form a fully connected subgraph, while pruning is applied to all other pairs.
When wind-driven transport effects are modeled, we further modulate edge strengths using wind speed and direction to obtain a time-dependent graph , followed by the same sparsification and normalization. Unless stated otherwise, and k are selected on the validation split.
3.4. STGPD: Spatiotemporal Graph Posterior Diffusion
We treat reconstruction as posterior sampling on graph-structured spatiotemporal fields under sparse observations. Let
denote the latent pollutant field over all nodes and times. Observed station readings follow:
where
is the station-only mask induced by
. Grid-node pollutant entries are treated as unobserved; supervision is provided only by station entries, and the training loss is computed on the target subset introduced by re-masking (
Section 4). Building on this observation model, we first learn a conditional diffusion prior over the full latent field and then enforce step-wise consistency with the visible observations during reverse-time sampling.
3.4.1. Diffusion Prior with a Graph-Temporal Denoiser
We learn a conditional diffusion prior over
. With a noise schedule
, define
and
. The forward process is:
We train a denoiser
to predict
, where the conditioning set is
. Here,
denotes the neural network, parameterized by
, that predicts the diffusion noise residual
. Here,
denotes the conditioning mask: under the re-masking protocol,
; otherwise
. The denoiser follows a bidirectional graph-recurrent design: graph message passing captures spatial dependencies on
, and bidirectional recurrent updates model temporal evolution in both directions.
Training minimizes the standard noise-prediction loss on pollutant channels:
When grid-level ground truth is unavailable, the loss is evaluated only on supervised station entries and, under re-masking, restricted to the target subset indexed by
(details in
Section 4). Once this prior is learned, we draw samples by reverse diffusion; however, under sparse sensing, unconstrained sampling can drift away from the observed station values, motivating an explicit consistency mechanism.
3.4.2. Noise-Aware Observation Consistency During Sampling
Unconstrained reverse diffusion samples from the learned prior and may drift when observations are sparse. STGPD enforces consistency by updating the reverse trajectory at every step, using only the visible observations specified by the re-masking protocol.
A standard DDPM-style proposal from
is:
We then combine this prior proposal with an observation-based term in a noise-matched manner. To avoid inserting clean values into a noisy state and to ensure a leakage-free re-masking evaluation protocol, we first define the visible observation tensor
and re-noise these visible observations to the current diffusion level:
Hard Consistency (Noise Matched)
We enforce consistency strictly on the visible subset. Let
be the consistency mask. The update is:
In this hard-consistency variant, we treat the (re-noised) visible observations as exact constraints on the visible subset, i.e., we directly replace the corresponding entries after noise matching.
Soft Consistency (Noise-Aware Fusion)
When measurements are noisy, strict clamping can overfit corrupted readings. To make the role and scope of the proposed update explicit, we emphasize that the soft-consistency mechanism is intended as a practical approximation to posterior-guided reverse sampling under an explicit observation-noise assumption, rather than as a closed-form derivation of the exact reverse-time Bayesian posterior of the full diffusion process. Its purpose is to mitigate the clean–noisy mismatch caused by direct replacement of observations and to provide a noise-aware balance between measurement consistency and prior-guided denoising. Assuming additive observation noise with variance
, the effective variance of the noise-matched observation term at step
is:
On observed entries, we fuse the prior proposal and the measurement term via a Gaussian product-of-experts:
The state is then updated using the same consistency mask
:
Under this interpretation,
and
can be viewed as two Gaussian experts: the former carries the noise-matched observational information, while the latter represents the prior-guided reverse proposal produced by the denoiser and sampler. Their precision-weighted fusion is therefore best understood as a practical Gaussian approximation to the local posterior update, not as a claim of exact Bayesian optimality under the full generative model. As
t decreases, diffusion noise diminishes and the constraint naturally tightens; the update approaches hard consistency when
is small. This interpretation is also consistent with the empirical behavior observed in the sensitivity analysis of
(
Section 4.6.3): moderate nonzero values provide a better balance between observational conditioning and prior guidance than strict hard clamping, whereas excessively large values weaken the contribution of measurements. Unless stated otherwise,
is defined in the normalized space. When
is specified in physical units, it is converted to the normalized space by dividing by the training-split standard deviation of the corresponding pollutant.
During re-masking evaluation, the target entries indexed by are treated as latent variables and excluded from and , ensuring that they do not influence the reverse diffusion trajectory. The denoiser conditioning similarly uses . With this step-wise consistency update in place, we can use different samplers to trade off speed and accuracy, and obtain uncertainty estimates by repeated posterior sampling.
3.4.3. Fast Sampling and Uncertainty Estimation
The consistency updates in Equations (
9)–(
12) are applied after each solver step and are therefore compatible with different samplers. In experiments, we report results for DDPM and accelerated solvers (DDIM, DPM-Solver++) under varying numbers of function evaluations (NFE). Following the experimental definition in
Section 4.1, no clamping refers to standard conditional diffusion where observations enter only through the denoiser inputs (conditioning), without explicit replacement or fusion on the reverse trajectory.
Uncertainty is obtained by Monte Carlo sampling. Given
S posterior samples
, we estimate moments by:
Deterministic metrics use the posterior mean, while probabilistic metrics (e.g., CRPS) are computed from the sample set.
Training follows the standard conditional diffusion formulation with the noise-prediction objective. For each mini-batch, a diffusion step
is sampled and
is drawn to obtain the noisy state
via Equation (
5). Observed station pollutant entries are randomly partitioned into a visible subset and a target subset, forming
and
with
. The denoiser is conditioned on
, and parameters
are optimized by minimizing Equation (
6), with supervision restricted to the target entries indexed by
(station entries only).
3.5. Implementation Details
For reproducibility, we provide additional implementation details here. The proposed STGPD framework was implemented in PyTorch 2.0 and trained on a workstation equipped with a single NVIDIA V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB VRAM, in a Linux environment (Ubuntu 22.04). The core denoiser adopts a bidirectional graph-recurrent architecture with a hidden dimension of 64. For the diffusion process, we employed a linear noise schedule with steps, where and . During inference, DPM-Solver++ was used with to accelerate sampling while maintaining reconstruction fidelity. All major hyperparameters, including the learning rate, batch size, graph construction settings, and soft-consistency configuration, were selected based on validation performance. The reported learning rate was retained after preliminary validation-based tuning because it yielded stable convergence in all reported experiments. In particular, the soft-consistency parameter was set to 0.05 in the normalized space after sensitivity analysis, so as to provide a practical balance between diffusion-prior guidance and sparse observational constraints. These details are reported to improve transparency and to help readers better assess the robustness and reproducibility of the proposed framework.
4. Experiments and Results
We evaluate STGPD using the Augsburg dataset. Since exhaustive grid-level ground truth is unavailable, quantitative evaluation relies on station-level cross-validation under controlled re-masking and leave-one-station-out protocols. Therefore, the reported results provide indirect but practically relevant evidence for probabilistic grid-field reconstruction under sparse observational constraints, rather than exhaustive verification of all grid-level predictions. We assess performance across four dimensions: (1) reconstruction accuracy under standard sparsity; (2) spatial generalization via station hold-out tests; (3) robustness to extreme data loss; and (4) resilience to sensor noise injection and sampling-efficiency trade-offs. We next describe the experimental protocol and implementation details before presenting results for each evaluation dimension.
4.1. Experimental Setup
To evaluate temporal generalization, the dataset is partitioned chronologically to prevent information leakage from future timestamps. The training, validation, and testing splits follow a standard temporal order. Deterministic metrics are computed on de-normalized values () to reflect physical magnitudes, while probabilistic scores (CRPS) are reported in the normalized space to facilitate comparison between PM10 and NO2.
We simulate sparse supervision using a randomized re-masking strategy on station observations. Let denote the binary mask of available station entries. On the test split, we randomly partition observed station entries into two disjoint subsets: a conditioning set () used as model input, and a target set () reserved strictly for evaluation. The protocol is:
Input: the model is conditioned on together with exogenous covariates (meteorology, traffic, static descriptors) and the graph structure.
Output: the model reconstructs the full spatiotemporal field; error metrics are computed only on entries indexed by .
This protocol ensures that target values never influence the sampling trajectory, providing a leakage-free assessment of imputation performance. Unless specified otherwise, 20% of available station entries are assigned to . All methods share the same fixed re-masking partition on the test split, generated with a single random seed.
To contextualize performance, we compare STGPD against baselines from three categories:
- 1.
Spatial Interpolation: IDW and Kriging, applied per timestamp using station coordinates and the available observations in . These methods use only same-timestep information and ignore temporal context.
- 2.
Deterministic Deep Learning: BRITS [
53] and GRIN [
37], which learn spatiotemporal dependencies but output single-point estimates.
- 3.
Probabilistic Diffusion Models: state-of-the-art diffusion imputers including CSDI [
20], PriSTI [
21], SaSDim [
22], RDPI [
23], and CoFILL [
24].
STGPD employs a DDPM sampler with diffusion steps by default. The model is trained using Adam with a learning rate of and a batch size of 32. Early stopping is applied based on validation loss.All experiments were executed on a workstation equipped with a single NVIDIA V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA). For clarity, the observational time series are split chronologically into training, validation, and test subsets, and only station observations are used as supervised references during model optimization and quantitative evaluation. In contrast, the city-grid nodes serve as unlabeled inference targets without exhaustive ground-truth values. Under the standard evaluation protocol, only observed station entries in the test period are re-masked and used as targets, while naturally missing entries remain excluded unless explicitly involved in a dedicated masking protocol.
4.2. Performance Under Random Re-Masking
We first report overall imputation performance under the standard random re-masking setting.
Table 2 reports deterministic metrics under random re-masking. Across both pollutants, STGPD achieves the lowest RMSE/MAE and the highest
.
Table 3 reports CRPS for diffusion-based models; STGPD yields the lowest CRPS, indicating improved probabilistic quality under the same supervision.Although the absolute margin over the strongest baseline is moderate under the standard re-masking setting, repeated-run significance tests indicate that the improvement is statistically reliable rather than attributable to random fluctuation alone (
Figure 3). Moreover, the practical contribution of STGPD is not limited to average point-error reduction, but also lies in uncertainty-aware conditioning and more stable behavior under sparse, noisy, and partially missing observational settings. This behavior is consistent with the method design, in which observational information is incorporated through noise-aware soft consistency rather than rigid replacement during reverse diffusion.
4.3. Station Outage Extrapolation: Leave-One-Out
We next evaluate spatial extrapolation under full station outages via 4-fold LOO evaluation. In each fold, we mask all pollutant observations from one station for the entire test horizon, while keeping its node in the graph. Exogenous covariates and static descriptors remain available at the held-out station. We evaluate a single model trained with the standard random re-masking objective, without retraining per fold.
Table 4 reports RMSE on the held-out station for each fold, and
Table 5 reports the corresponding normalized CRPS.
Figure 4 visualizes absolute error distributions. STGPD yields lower errors and tighter interquartile ranges than PriSTI across stations, demonstrating robust spatial extrapolation under station outages.
4.4. Uncertainty Analysis and Interpretation of Grid-Level Reconstructions
We acknowledge that the station-based evaluation protocols used in this study provide only indirect evidence for grid-level reconstruction quality in fully unmonitored areas. Because exhaustive gridded ground truth is unavailable, the reconstructed city-scale fields should be interpreted as probabilistic spatial inferences constrained by station observations, rather than as fully verified deterministic surfaces.
To better justify the grid-level reconstructions, we added an uncertainty analysis based on the posterior standard deviation of diffusion samples.
Figure 5 presents a representative example of the posterior mean field, the corresponding posterior standard deviation field, and the relationship between uncertainty and the distance to the nearest monitoring station. The posterior mean illustrates the inferred city-scale concentration pattern under sparse observational constraints, while the posterior standard deviation provides a spatially explicit diagnostic of confidence in the reconstruction.
A clear spatial pattern can be observed: uncertainty is generally lower near observed stations and increases in less constrained regions, especially where station support is sparse and spatial extrapolation is required. This behavior is consistent with the design of the proposed conditioning mechanism. Grid nodes close to monitoring stations are more strongly constrained by the available observations through the soft-consistency update, whereas nodes farther away rely more heavily on the learned spatiotemporal prior. As a result, posterior uncertainty naturally increases with extrapolation distance.
The positive relationship between posterior uncertainty and the distance to the nearest monitoring station shown in
Figure 5c further supports this interpretation. Although this analysis does not replace direct full-field validation, it provides an interpretable confidence diagnostic for identifying where the reconstructed field is more strongly supported by observations and where it should be interpreted with greater caution. Under extremely sparse monitoring conditions, such uncertainty-aware interpretation is particularly important for practical use of city-scale reconstructed air-quality surfaces.
4.5. Robustness to Extreme Sparsity
We further stress-test robustness under extreme information loss, considering both spatial sparsity (fewer conditioning stations) and temporal sparsity (fewer observed timestamps).
We reduce the number of conditioning stations from four down to one. For each , we select k stations as conditioning stations and mask pollutant observations from the remaining stations in the conditioning mask, while keeping all nodes and covariates available. Evaluation is performed on the same target set indexed by , so the reported results remain directly comparable across different k and consistent with the standard re-masking setting. This design isolates the effect of reduced conditioning information while keeping the evaluation target distribution fixed.
We first form
by masking 20% of observed station entries on the test split. From the remaining entries
, we further subsample a proportion
as the effective conditioning mask. All models are evaluated on the same targets indexed by
.
Figure 6 shows that STGPD degrades most gracefully as information decreases, maintaining a clear gap over baselines under both extreme spatial and temporal sparsity.
4.6. Efficiency and Noise-Aware Consistency
Finally, we analyze the trade-off between sampling efficiency and robustness, focusing on how different consistency strategies behave under fewer function evaluations and under explicit sensor-noise injection.
4.6.1. Sampling Efficiency with Fast Solvers
We evaluate DPM-Solver++ with different numbers of function evaluations (NFE): 10, 15, 20, 50, and 200. We compare STGPD (Soft Clamping) against Hard Clamping, Naive Replacement, and No Clamping. All variants are evaluated under the same station-target re-masking protocol. As NFE decreases, all methods exhibit the expected accuracy–efficiency trade-off. Across the tested settings, Soft Clamping consistently provides the most stable conditioning behavior while retaining competitive accuracy at low NFE. This behavior is consistent with the role of the soft-consistency mechanism, which is designed to balance observational conditioning against prior-guided denoising rather than enforcing rigid replacement of measurements.
4.6.2. Robustness to Sensor Noise
We inject additive Gaussian noise into conditioning observations at inference time: , with and . We compare Soft Clamping (adaptive ) against Hard Clamping, Naive Replacement, and No Clamping.
Figure 7 shows that Soft Clamping remains stable across noise levels, whereas Hard Clamping and Naive Replacement degrade as they force the sampler to fit corrupted measurements. This result is consistent with the noise-aware design of the proposed fusion step, in which the confidence assigned to measurements is adjusted through
instead of treating all observations as equally reliable.
4.6.3. Sensitivity to the Observation-Noise Parameter
To further clarify the role of the soft-consistency parameter, we vary
while keeping all other settings fixed and evaluate both reconstruction accuracy (RMSE) and probabilistic quality (CRPS).
Figure 8 shows that moderate nonzero values of
provide the best balance between observational conditioning and diffusion-prior guidance. In contrast, hard clamping (
) over-trusts the observations, whereas excessively large values weaken conditioning and degrade performance. These results support interpreting
as an effective observation-noise parameter that controls the confidence assigned to conditioning measurements, rather than as an exact estimate of raw sensor error.
4.7. Interpreting the Learned Spatial Receptive Field
To provide a complementary interpretability view, we analyze the learned attention weights over 19 buffer radii (0.1–5.0 km) used in static geographic descriptors.
Figure 9 indicates that NO
2 places relatively more weight on localized context, whereas PM
10 exhibits a broader receptive field, consistent with their expected spatial characteristics.
The distinct scale sensitivities of NO2 and PM10 can be plausibly linked to differences in their dominant urban atmospheric processes. NO2 is more strongly associated with local traffic-related emissions and near-road concentration gradients, so descriptors extracted at relatively small buffer radii are expected to be more informative. In contrast, PM10 is influenced not only by local sources but also by broader-scale processes such as regional transport, background concentrations, and resuspension. This may help explain why PM10 exhibits weaker importance at very small radii ( km) and a comparatively stronger contribution from larger contextual scales. Therefore, the observed scale-dependent descriptor weights are broadly consistent with the differing spatial representativeness of traffic-dominated versus more regionally modulated pollutants.
5. Conclusions
We studied city-scale spatiotemporal air-quality reconstruction in Augsburg from sparse and unreliable station measurements, focusing on PM10 and NO2. We proposed STGPD, a graph-structured diffusion framework that formulates reconstruction as constrained posterior-guided sampling and enforces observation consistency throughout the reverse diffusion trajectory. The key component is a noise-aware soft-consistency update that re-noises measurements to the current diffusion level and adaptively relaxes constraints under sensor corruption, avoiding the clean–noisy mismatch introduced by hard replacement.
Across all evaluated stress scenarios, including reduced station availability, heavy temporal missingness, full station outages, and explicit sensor-noise injection, STGPD shows lower error growth and more stable uncertainty behavior than the evaluated baselines. These gains are driven by variance-weighted fusion, which balances the model prior with a noise-matched observation term instead of forcing exact agreement with potentially corrupted readings. In addition, the learned multi-scale spatial attention highlights pollutant-specific receptive fields across buffer radii, providing an interpretable view of which spatial scales contribute most to reconstruction. These properties suggest that STGPD may be useful for urban exposure assessment and policy-relevant environmental mapping under sparse monitoring, although further validation beyond the present case study remains necessary. Extending the evaluation to additional cities and incorporating richer transport-aware graph constructions are natural next steps.
This study also has several limitations. First, the analysis is based on a single-city case study with a very sparse monitoring network, which limits the strength of broader generalization claims across cities, climatic conditions, or monitoring-density regimes. Second, although the proposed framework reconstructs dense city-grid air-quality fields, exhaustive gridded ground truth is unavailable; therefore, grid-level performance in fully unmonitored areas can only be assessed indirectly through station-based validation protocols and uncertainty analysis. In this sense, the current evaluation provides supporting evidence for probabilistic spatial reconstruction under sparse observational constraints, rather than exhaustive verification of all grid-level predictions. Third, source-related heterogeneity is represented indirectly through geographic descriptors and exogenous covariates rather than through explicit source-resolved emission modeling. Consequently, the present results should be interpreted as evidence from a promising sparse-monitoring case study rather than as definitive proof of universal superiority or broad external generalization. Future work should extend the analysis to additional cities, denser monitoring settings, and independent spatial references where possible.