1. Introduction
Urban grid-scale population inflow prediction plays a pivotal role in smart city management and emergency decision-making. Based on historical flow data, population inflow prediction models estimate future inflow across urban grids, which is critical for optimizing urban planning, improving traffic control, enhancing emergency response, and allocating public service resources efficiently [
1,
2]. With accelerating urbanization and increasing demands for fine-grained urban management, population inflow prediction has emerged as a key research priority in urban computing and smart city domains [
3]. However, urban population mobility exhibits significant spatiotemporal heterogeneity, as the underlying temporal data are governed by both periodic temporal regularities and urban spatial functional structures, thereby posing substantial challenges for accurate prediction.
Existing urban population prediction methods are primarily grounded in two paradigms: traditional time series analysis and deep learning techniques [
4,
5]. Traditional statistical models, including Autoregressive Integrated Moving Average (ARIMA) [
6], Vector Autoregression (VAR) [
7], and Kalman Filter [
8], offer theoretical advantages and interpretability in capturing linear temporal trends but struggle to handle complex nonlinear relationships and high-dimensional spatial features. The emergence of deep learning methods has enabled new breakthroughs in complex forecasting tasks. Recurrent Neural Networks (RNNs) [
9] and their variants, such as Long Short-Term Memory (LSTM) [
10] and Gated Recurrent Unit (GRU) [
11], have demonstrated strong capabilities in modeling sequential dependencies. Furthermore, graph-based approaches, including Graph Neural Networks (GNNs) [
12], Graph Attention Networks (GATs) [
13], and Spatiotemporal Graph Convolutional Networks (STGCNs) [
14], have achieved significant progress in modeling spatial correlations. However, these methods predominantly rely on deterministic modeling frameworks that provide only deterministic point estimates without quantifying predictive uncertainty [
15,
16]. This limitation is particularly critical in practical urban management applications: decision-makers require not only predicted values but also assessments of prediction reliability and potential risks to formulate robust management strategies.
Despite the substantial progress achieved in spatiotemporal modeling, most existing studies continue to employ deterministic architectures that produce point estimates and neglect explicit uncertainty quantification. In contrast, some recent studies have begun to incorporate probabilistic reasoning into graph-based spatiotemporal frameworks. For example, Wang et al. [
17] proposed the Prob-GNN framework for quantifying spatiotemporal uncertainty in urban travel flow prediction, marking one of the first attempts to embed probabilistic graph neural networks into traffic forecasting. Similarly, Gao et al. [
18] developed an uncertainty-aware probabilistic graph learning model that explicitly captures prediction variance across multiple forecast steps for traffic risk assessment. While these approaches have demonstrated the potential of PGNNs to model spatiotemporal uncertainties, they still rely on indirect approximations for uncertainty estimation and suffer from limited calibration fidelity, leaving notable gaps in robust and scalable probabilistic prediction.
Uncertainty quantification is critical for urban prediction tasks. Existing uncertainty modeling approaches, such as Bayesian Neural Networks [
19,
20] and Deep Gaussian Processes [
21], attempt to incorporate uncertainty estimation into neural network frameworks but often suffer from high computational complexity, convergence difficulties, or overly restrictive assumptions. In practical applications, uncertainty information provides direct value for decision-making. For instance,
Figure 1 illustrates the predicted passenger flow at subway stations. As shown in the black-boxed region, deterministic methods fail to indicate prediction reliability. In contrast, the probabilistic approach reveals substantially higher uncertainty (depicted by the green shaded area), suggesting potential passenger flow surges in this region. Therefore, developing methods that simultaneously deliver accurate predictions and reliable uncertainty estimates is of paramount importance.
In recent years, the advent of generative artificial intelligence has engendered novel paradigms for addressing complex prediction problems. Recent studies have investigated Variational Autoencoders (VAEs) [
22] and Generative Adversarial Networks (GANs) [
23] as effective tools for probabilistic representation and sample-based uncertainty modeling. Huang et al. [
24] proposed a VAE-based generative model of urban human mobility that combines a latent variable architecture with a sequence-to-sequence structure, enabling realistic reconstruction and simulation of mobility trajectories under data sparsity. Mo, Fu, and Di [
25] developed a PhysGAN-TSE and explicitly quantified uncertainty by integrating stochastic traffic flow models into the adversarial training process. These studies demonstrate the effectiveness of generative models in representing probabilistic distributions of urban dynamics. However, their adversarial or variational inference mechanisms often lead to unstable training, insufficient probabilistic calibration, and limited scalability in large-scale forecasting tasks.
Building on the progress of generative modeling, diffusion models have recently emerged as a powerful and theoretically grounded alternative. As a distinctive class of generative methods, diffusion models exhibit stronger training stability and sample quality compared with VAEs and GANs. Representative variants include Denoising Diffusion Probabilistic Models (DDPMs) [
26,
27], Score-based Generative Models [
28,
29], and SDE-based Diffusion Models [
30]. These approaches achieve probabilistic forecasting by iteratively perturbing and reconstructing data through controlled noise processes. Recent studies have further demonstrated the effectiveness of diffusion mechanisms for time-series forecasting. Biloš et al. [
31] introduced a function-space diffusion framework that models temporal dynamics as continuous stochastic processes. Their method defines perturbation and denoising operations directly on continuous functions, allowing the model to handle irregularly sampled sequences and produce calibrated probabilistic forecasts through stochastic process diffusion. Yuan et al. [
32] extended the use of diffusion mechanisms to spatio-temporal point processes by proposing a unified probabilistic framework capable of learning complex joint spatial-temporal distributions. They incorporated a spatio-temporal co-attention module to capture interdependent features between event time and location, yielding significant performance improvements across tasks such as epidemic spread, urban mobility, and crime prediction. Although these studies reveal the strong capability of diffusion mechanisms for modeling stochastic temporal dependencies, there remains no research applying diffusion models to urban grid-scale population inflow prediction or integrating functional semantic information such as POI attributes as conditional guidance.
Urban functional heterogeneity is a key factor influencing population mobility patterns, as well as a core challenge for diffusion models in urban prediction. Studies in spatial econometrics and geographic information science [
33,
34] demonstrate that different functional zones, due to variations in service types, facility density, and spatial accessibility, exhibit distinctly different population agglomeration dynamics and temporal variations. POIs, as direct proxies for urban functions [
35], encode rich spatial semantics: commercial areas exhibit significant commuting attractiveness on weekdays, and educational facility vicinities display typical tidal flow patterns, while residential zones form complementary population flow dynamics with employment centers. These spatial functional attributes not only shape the magnitude and temporal distribution of population inflow but also determine the degree and spatial variability of prediction uncertainty.
Given the limitations of existing methods and the practical demands of urban forecasting tasks, this study faces three core challenges: (1) How to quantify uncertainty while providing accurate predictions to meet risk assessment needs in urban management decision-making; (2) How to effectively integrate information about urban spatial heterogeneity, particularly the semantic constraints imposed by POI functional characteristics on population flows; (3) How to design a probabilistic modeling framework suited to the characteristics of urban grid data, handling high-dimensional time-series data in real-world scenarios while maintaining computational efficiency.
Building upon the above analysis, this study proposes PDCDM, a novel urban grid inflow prediction model. PDCDM innovatively integrates urban functional semantic information into the conditional diffusion generation process by extracting static POI features from each grid as conditional guidance. The model employs a dual-dimensional Transformer architecture to separately capture temporal dependencies and inter-grid feature correlations, achieving progressive feature fusion through residual connections. The main contributions of this study can be summarized as follows:
Introduction of conditional diffusion models to urban grid-scale population inflow prediction, simultaneously achieving accurate point estimates and reliable uncertainty quantification through a probabilistic generative framework, thereby offering a novel modeling paradigm for risk-sensitive urban management decisions.
Design of a POI-enhanced conditional encoding framework that systematically incorporates urban functional semantics into the diffusion process, enabling the model to capture zone-specific population agglomeration patterns and enhance spatial prediction adaptability.
Design of a decoupled dual-dimensional attention mechanism that separately models temporal dependencies and inter-grid correlations, thereby enhancing model representation of complex spatiotemporal patterns.
Comprehensive validation on real-world urban datasets demonstrating that the proposed model outperforms baseline approaches in both prediction accuracy and uncertainty quantification, with systematic ablation studies confirming the effectiveness of each component and the model’s practical utility in complex urban scenarios.
4. Results
4.1. Performance Comparison
Probabilistic modeling approaches for urban grid-level population inflow forecasting remain underexplored. Therefore, we benchmark PDCDM against state-of-the-art probabilistic and deterministic methods from the time series forecasting domain, including probabilistic approaches (SSSD, DiffusionTS, CSDI) and deterministic approaches (PatchTST, iTransformer, TimeMixer, TimesNet, FEDformer). We employ the Continuous Ranked Probability Score (CRPS) as the primary metric for probabilistic forecasting performance, alongside MAE and RMSE to assess point prediction accuracy across all models.
As shown in
Table 4, PDCDM achieves the best performance across all evaluation metrics. Compared to the strongest baseline, PDCDM reduces MAE by 11.7%, RMSE by 14.9%, and CRPS by 8.6%. Among probabilistic baselines, DiffusionTS demonstrates strong competitive performance. As a diffusion model specifically designed for time series forecasting, DiffusionTS effectively captures temporal dependencies through its Transformer architecture, highlighting the strengths of diffusion models for sequential modeling. However, it does not account for urban spatial heterogeneity, which limits its performance on grid-level forecasting tasks. SSSD exhibits limited performance in our forecasting task, as its structured state space model struggles with high-dimensional spatial data. While CSDI demonstrates strong capability in spatial prediction, its imputation-oriented design fails to fully leverage the autoregressive nature of forecasting, thereby limiting its predictive accuracy. Beyond point forecasting accuracy, PDCDM achieves the lowest CRPS among all baselines, demonstrating improved probabilistic calibration and reduced prediction uncertainty. This confirms that PDCDM delivers not only accurate but also well-calibrated probabilistic forecasts.
Inference Time. Generally, diffusion-based prediction methods are significantly slower than others due to the iterative denoising required during inference sampling, yet they deliver superior probabilistic prediction performance. To address this, we primarily compare the inference speeds of diffusion-based prediction methods, including SSSD, DiffusionTS, CSDI, and the proposed model PDCDM.
Table 5 reports the average inference time cost for these four diffusion models when generating different numbers of samples on a single grid. We observe that SSSD is extremely time-consuming due to its recurrent state space architecture. Our model achieves approximately 30-fold acceleration compared to SSSD, owing to its non-autoregressive parallelized architecture. Despite employing a decoupled Transformer structure, DiffusionTS still incurs high inference time, requiring 103.783 s to generate 30 samples. PDCDM achieves efficient inference through parallelization in the temporal dimension and optimized network design, generating 30 samples in just 13.894 s. This places it on par with CSDI (10.662 s), which also employs a non-autoregressive architecture.
Visualization. To illustrate PDCDM’s forecasting performance, we visualize the probabilistic prediction distributions for representative grids in
Figure 8. The model quantifies uncertainty by generating 15 prediction samples, from which we derive the median prediction (green solid line) and two confidence intervals: the 50% CI (dark green shaded area, spanning the 25th–75th percentiles) and the 90% CI (light green shaded area, spanning the 5th–95th percentiles). As shown in
Figure 8, during the historical period (timesteps 0–480, marked with red crosses), observed values cluster tightly around the median prediction and fall almost entirely within the confidence intervals, demonstrating the model’s ability to accurately capture historical patterns. More critically, during the forecasting horizon (timesteps 480–576, marked with blue circles), ground truth values across all grids predominantly lie within the 90% confidence interval, validating the model’s probabilistic reliability. Across different grids, PDCDM maintains consistent prediction quality while successfully capturing spatial heterogeneity.
Notably, the confidence interval width adapts to prediction difficulty: during stable periods, narrower intervals reflect higher predictive certainty, while during volatile periods (e.g., the peak region in Grid 5 at timesteps 500–570), the model produces wider uncertainty bounds, demonstrating its capability to capture prediction uncertainty. Furthermore, the visualizations clearly reveal PDCDM’s ability to learn periodic patterns in population inflow data. Regular oscillations observed during the historical period are effectively extrapolated to the forecasting horizon. For instance, in Grid 15’s forecasting period, the model accurately captures the fluctuation pattern between timesteps 520–550, with the median prediction closely aligning with ground truth. This demonstrates the model’s capacity to learn and extrapolate long-range temporal dependencies in extended forecasting scenarios. These adaptive uncertainty quantification capabilities and precise spatiotemporal pattern recognition enable PDCDM to provide reliable decision support for practical applications such as urban planning. These visualizations further confirm PDCDM’s capability to model predictive uncertainty adaptively: narrower intervals during stable phases represent higher confidence, while wider intervals during volatile periods reflect greater uncertainty awareness.
4.2. Ablation Study
To systematically evaluate the contribution of each component to the overall performance, we conduct ablation studies by removing individual modules from PDCDM.
Table 6 summarizes the results across different configurations.
PDCDM comprises three core components: the POI feature fusion module, the temporal Transformer layer, and the spatial Transformer layer. We conduct systematic ablation studies by removing each component individually to assess its contribution to probabilistic forecasting performance, evaluated using MAE and CRPS.
The complete PDCDM achieves the best performance across all metrics. Removing the temporal Transformer layer yields the most severe performance degradation, with MAE and CRPS increasing to 0.421 and 0.464, corresponding to error increases of 84.6% and 98.2%, respectively. The temporal Transformer leverages self-attention to construct temporal representations for each grid, adaptively capturing periodic patterns and long-range dependencies. Without it, the model loses its capacity to model temporal dynamics, confirming that temporal dependency modeling is critical for accurate forecasting.
Removing the spatial Transformer layer causes MAE and CRPS to rise to 0.331 and 0.362, increasing errors by 45.2% and 55.4% relative to PDCDM. The spatial Transformer learns inter-grid dependencies through global self-attention. Its removal reduces the model to processing grids independently, preventing it from capturing cross-regional flow propagation and spatial coordination effects.
Removing the POI feature fusion module increases MAE and CRPS to 0.269 and 0.261, representing error increases of 18.0% and 12.0% relative to PDCDM. POI features encode urban functional attributes to provide static semantic context. Without them, the model cannot capture function-driven flow dynamics, reducing its generalization capability across heterogeneous functional zones and limiting prediction interpretability.
5. Conclusions
This paper presents PDCDM, a probabilistic forecasting framework for urban grid population inflow. PDCDM pioneers the application of conditional diffusion models to grid-scale population forecasting, overcoming the limitations of traditional deterministic approaches. Beyond delivering accurate point forecasts, it quantifies prediction uncertainty to support risk-aware urban management decisions. In particular, the model exhibits a strong capacity to represent and adaptively reduce predictive uncertainty through its probabilistic diffusion mechanism, as confirmed by its consistently lowest CRPS and well-calibrated confidence intervals across experiments.
The model incorporates a POI-enhanced conditional encoding framework that systematically integrates urban functional semantics into the diffusion process. By constructing multi-scale static POI feature representations, PDCDM effectively captures population aggregation patterns across diverse functional zones, enhancing prediction accuracy in spatially heterogeneous urban areas. Furthermore, we employ a dual-dimensional attention mechanism to decouple temporal and grid feature modeling, with separate pathways processing temporal dependencies and inter-grid spatial correlations through Transformer-based attention, while residual connections enable progressive feature fusion. Experiments on real-world urban datasets demonstrate that PDCDM outperforms numerous existing forecasting methods.
Although the current experimental validation is conducted using grid-level inflow data from Beijing due to dataset availability, the proposed PDCDM framework is theoretically generalizable to other cities or countries. The model depends only on grid-scale population inflow time series and corresponding POI features, which are widely obtainable in urban environments. Therefore, for regions lacking sufficient historical data, PDCDM can be extended through transfer learning, cross-city pretraining, or regional similarity analysis to leverage prior knowledge from data-rich areas. These strategies enhance the model’s adaptability and potential applicability across diverse urban contexts.
However, several limitations should be acknowledged. First, due to GPU memory constraints, a 5% grid sampling strategy was adopted for validation and testing, which may introduce minor sampling bias in performance assessment. Second, the model relies on static POI data that does not capture temporal variations, and excludes external factors such as weather conditions and specific temporal constraints (e.g., holidays, events). The model’s performance is also inherently dependent on the accuracy and representativeness of the POI data; if the POI information deviates from actual urban functional distributions, prediction deviations may emerge. Incorporating these factors could further improve prediction accuracy. Future work will address these limitations through full-scale evaluation on complete grid sets and by expanding the conditional framework to integrate dynamic environmental and temporal factors, thereby enhancing the model’s robustness and applicability to real-world urban management scenarios.