Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction

Li, Hongxiang; Gui, Zhiming; Gao, Zhenji

doi:10.3390/ijgi15010002

Open AccessArticle

Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction

by

Hongxiang Li

¹

,

Zhiming Gui

¹

and

Zhenji Gao

^2,3,*

¹

College of Computer Science, Beijing University of Technology, Beijing 100124, China

²

Integrated Natural Resources Survey Center, CGS, No. 55 Yard, Honglian South Road, Xicheng District, Beijing 100055, China

³

Technology Innovation Center of Geological Information Engineering of Ministry of Natural Resources, Beijing 100055, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(1), 2; https://doi.org/10.3390/ijgi15010002

Submission received: 15 October 2025 / Revised: 15 December 2025 / Accepted: 17 December 2025 / Published: 19 December 2025

(This article belongs to the Special Issue Spatial Data Science and Knowledge Discovery)

Download

Browse Figures

Versions Notes

Abstract

Origin–destination (OD) flow prediction is fundamental to intelligent transportation systems, yet existing diffusion-based models face two critical limitations. First, they inadequately exploit spatial semantics, focusing primarily on temporal dependencies or topological correlations while neglecting urban functional heterogeneity encoded in Points of Interest (POIs). Second, static embedding fusion cannot dynamically capture semantic importance variations during denoising—particularly during traffic surges in POI-dense areas. To address these gaps, we propose the Cross-Attention Diffusion Model (CADM), a semantically conditioned framework for short-term OD flow forecasting. CADM integrates POI embeddings as spatial semantic priors and employs cross-attention to enable semantic-guided denoising, facilitating dynamic spatiotemporal feature fusion. This design adaptively reweights regional representations throughout reverse diffusion, enhancing the model’s capacity to capture complex mobility patterns. Experiments on real-world datasets demonstrate that CADM achieves balanced performance across multiple metrics. At the 30 min horizon, CADM attains the lowest RMSE of 5.77, outperforming iTransformer by 1.9%, while maintaining competitive performance at the 15 min horizon. Ablation studies confirm that removing POI features increases prediction errors by 15–20%, validating the critical role of semantic conditioning. These findings advance semantic-aware generative modeling for spatiotemporal prediction and provide practical insights for intelligent transportation systems, particularly for newly established transportation hubs or functional zone reconfigurations where semantic understanding is essential.

Keywords:

OD flow prediction; cross-attention mechanism; diffusion generative model

1. Introduction

Origin–Destination (OD) flow prediction forms a cornerstone of intelligent transportation research, aiming to uncover the spatiotemporal dynamics of urban mobility between different origin–destination pairs. By leveraging historical records and multi-source contextual data, OD prediction elucidates the distribution and evolution of travel demand, offering valuable insights for traffic management, resource allocation, and urban planning. With the increasing complexity of modern transportation systems, travel behaviors have become highly nonlinear, multi-scale, and temporally dynamic. Accurately modeling such intricate spatiotemporal patterns under data uncertainty and noise remains a central challenge to achieving reliable traffic forecasting. Notably, while diffusion models have demonstrated remarkable modeling potential across domains such as image generation [1,2,3,4], text synthesis [5,6,7], speech processing [8,9], and video production [10,11,12,13], their application to spatiotemporal forecasting in traffic scenarios remains constrained. Specifically, existing approaches often struggle to simultaneously capture multi-scale traffic dynamics and complex spatial semantic dependencies, thereby limiting both prediction accuracy and model generalization capabilities.

At their core, diffusion models learn to reconstruct realistic data by progressively denoising random noise, effectively simulating the reverse diffusion process that maps noise to the underlying data distribution. Initially introduced in image generation, diffusion frameworks such as Denoising Diffusion Probabilistic Models(DDPM) [14] and Stochastic Differential Equation-based (SDE-based) models [15] have since evolved to handle multimodal data with improved theoretical consistency and flexibility. In time series modeling, TimeGrad [16] enhances predictive capability through time-dependent diffusion structures, while ScoreGrad [17] improves generative stability by refining score function estimation.

Diffusion-based approaches have also shown growing potential in complex temporal and multivariate domains. The Conditional Score-based Diffusion Imputation (CSDI) model [18] incorporates conditional information to guide the denoising process, achieving high precision in imputation and generation tasks. The Discrete Spatial-Temporal Diffusion (DSPD) and Continuous Spatial-Temporal Diffusion (CSPD) frameworks [19] further integrate spatiotemporal dependencies, enabling joint modeling across traffic, meteorological, and other multimodal systems. More recently, TimeDiff [20] and TSDiff [21] have introduced contextual conditioning to better characterize irregular temporal intervals and abrupt pattern changes, pushing the boundaries of diffusion modeling for complex, real-world temporal dynamics.

Although diffusion models have demonstrated remarkable generative capability in time-series modeling, two fundamental limitations remain.

First, the utilization of spatial semantic features is insufficient. Current research primarily focuses on modeling temporal dependencies or topological correlations. For example, ScoreGrad and TimeGrad improve predictive stability by introducing time-dependent diffusion processes, yet they largely rely on the numerical evolution of sequences while lacking explicit incorporation of urban semantic priors. In real-world transportation systems, however, different urban regions exhibit strong functional heterogeneity, where the spatial distribution of Points of Interest (POIs) implicitly reflects travel purposes and human activity patterns. Neglecting such semantic information prevents models from distinguishing regions that are structurally similar but functionally distinct. For instance, the OD flow dynamics between commercial and residential areas may appear similar at the data level but are driven by fundamentally different mechanisms. Consequently, purely temporal diffusion models tend to produce unrealistic flow correlations in such cases, thereby restricting their generalization capability in complex urban settings.

Second, the ability to model spatiotemporal interactions remains limited. Some extensions, such as CSDI and TSDiff, incorporate conditional information into the diffusion process to enhance temporal dependency modeling or contextual representation. Nevertheless, these methods usually employ static embedding fusion, making it difficult to dynamically capture interactions among spatiotemporal features during the denoising phase. For traffic flow tasks characterized by both periodic and sudden fluctuations, static conditioning cannot reflect temporal variations in semantic importance, resulting in coarse-grained denoising and delayed feature response. This limitation becomes particularly apparent during traffic surges, disruptions, or demand bursts in POI-dense areas. Therefore, there is an urgent need for a generative prediction framework capable of adaptively integrating multi-source semantic information and capturing fine-grained spatiotemporal correlations throughout the diffusion process.

To address these challenges, we propose a Cross-Attention Diffusion Model for short-term OD flow prediction. Built upon the generative diffusion framework, Cross-Attention Diffusion Model (CADM) introduces semantic conditioning by incorporating POI embeddings into the noise modeling process, enabling external knowledge–guided generation control. Specifically, the model takes noisy OD features and the diffusion time step t as primary inputs, while using POI embedding vectors as spatial semantic conditions. On this basis, a Cross-Attention Module is designed to achieve bidirectional fusion between semantic and spatiotemporal representations. In this mechanism, OD spatiotemporal features serve as queries (Q), and POI embeddings act as keys (K) and values (V). Through attention-based weighting, the model dynamically captures semantic relevance, allowing it to focus on semantically related regions at each diffusion step and thereby perform semantically guided noise prediction.

This design addresses two critical gaps in existing diffusion-based traffic forecasting. First, in terms of novelty, while conventional diffusion models treat spatial regions uniformly during denoising, CADM explicitly distinguishes functionally heterogeneous areas through POI-guided cross-attention, enabling the model to adaptively prioritize semantically relevant regions at each diffusion step. Unlike static semantic fusion approaches, the proposed mechanism dynamically modulates attention weights throughout the reverse process, allowing semantic guidance to evolve stage-by-stage in response to noise reduction. Second, regarding practical importance, this semantic-aware generation framework enhances both interpretability and spatial consistency—particularly in multifunctional urban zones where purely data-driven models often produce semantically inconsistent flows. By grounding traffic generation in urban functional context, CADM offers a more explainable and controllable forecasting tool for real-time traffic management and dynamic transportation planning.

The remainder of this paper is organized as follows. Section 2 introduces the data preparation pipeline and presents the overall architecture of CADM, including the temporal encoding module, U-Net denoising network, cross-attention semantic fusion mechanism, and conditional diffusion generation process. Section 3 reports comprehensive experimental results on real-world urban datasets, comparing CADM against 13 baseline methods and conducting ablation studies to validate the contribution of each component. Finally, Section 4 discusses the findings, identifies current limitations, and outlines promising directions for future research.

2. Materials and Methods

2.1. Data Preparation

In this study, a multi-source spatiotemporal dataset is constructed from real-world urban mobility data, where regional Origin–Destination flows and semantic representations of Points of Interest are utilized as key inputs for short-term traffic flow forecasting.

The data processing pipeline comprises three main stages—POI feature extraction, OD matrix construction, and multi-source sample generation—as depicted in Figure 1.

2.1.1. POI Data and Semantic Feature Construction

The Point of Interest data utilized in this study are obtained from AMap’s internal geographic information database, which provides high spatial accuracy and rich semantic attributes. The dataset covers a broad spectrum of urban functional categories, including dining, commercial, educational, office, recreational, medical, and transportation hub facilities. Due to privacy protection regulations and data licensing requirements, the POI data are used exclusively for academic research and are not publicly distributed or shared. The original POI records include essential metadata such as unique identifiers, names, functional categories, geographic coordinates, and administrative divisions. A representative example is shown in Table 1.

Each record contains static attributes including the longitude and latitude coordinates (location), functional category (type), and administrative division code (adcode) of the POI, which facilitate multi-level spatial aggregation and semantic analysis. To construct semantic feature vectors corresponding to traffic stations, this study utilizes POI data from Beijing. We designate the geographic center of each station as an aggregation unit and establish a spatial statistical radius of 500 m. Within each unit, the quantity of POIs across different functional categories is tallied to derive the functional distribution characteristics of the area:

p_{i} = [c_{i_{1}}, c_{i_{2}}, \dots, c_{i_{k}}],

(1)

Formally,

c_{i_{k}}

represents the number of POIs belonging to the k-th functional category within the i-th station area, where k denotes the total number of POI categories.

All POI features are first normalized using min–max scaling to mitigate magnitude inconsistencies across categories and subsequently projected into a low-dimensional representation space via a semantic embedding layer. This process establishes a mapping from raw point-of-interest data to regional semantic embeddings, enabling the model to effectively discern urban functional heterogeneity and spatial semantic dependencies in the following stages, thereby reinforcing the interpretability and robustness of short-term traffic forecasting.

2.1.2. OD Data and Flow Matrix Construction

The Origin-Destination flow data utilized in this study is also derived from the internal geographic information database of Amap. This dataset records the travel intensity between spatial grids in Beijing within discrete time slices over the past year, incorporating environmental context such as holiday indicators and traffic control statuses. Following aggregation and cleaning, the data is structured into a flow matrix in a time-series format with a temporal resolution of 15 min.

Specifically, each row in the matrix represents a time slot t (e.g., 1 January 2024 00:00–00:15), and each column corresponds to an ordered OD pair (

o_{i}

,

d_{j}

), indicating the outbound flow from origin grid

o_{i}

to destination grid

d_{j}

.The matrix element

X_{t}

(

o_{i}

,

d_{j}

) signifies the traffic flow volume between the paired grids within the given time interval. A representative example is shown in Table 2.

In this formulation, each “origin–destination grid pair” explicitly represents the indices of departure and arrival regions. The temporal dimension extends across the entire observation horizon, forming a multidimensional spatiotemporal series that integrates a fixed spatial layout with continuous temporal dynamics.

For each time interval t, the traffic flows are represented by an N × N OD matrix

X_{t}

, where N denotes the total number of spatial grids. This matrix characterizes the overall spatial distribution of mobility flows and inter-regional interactions, with missing entries replaced by zeros. To capture temporal dependencies, a sliding window framework is adopted to construct time-series samples: 12 consecutive intervals (approximately 3 h) are used as historical inputs, and the subsequent 4 intervals (around 1 h) are used for prediction. Formally, the temporal window is defined as

w = [t - 11, \dots, t]

, and the sample construction process is illustrated as follows:

(X_{w}, X_{t + 1 : t + 4}),

(2)

Here,

X_{w}

denotes the temporal dependency feature set fed into the model, and

X_{t + 1 : t + 4}

represents the subsequent four time steps used as prediction targets.

To ensure numerical stability and comparability across OD pairs, all flow values are normalized to the [0, 1] interval. The processed dataset is then partitioned into training, validation, and testing subsets following a consistent ratio, providing standardized input for subsequent spatiotemporal modeling.

2.1.3. Data Alignment and Sample Generation

After deriving the POI-based semantic representations and constructing the OD flow matrices, multi-source data are aligned across spatial and temporal dimensions to guarantee feature synchronization for model training. The aligned data are then organized into spatiotemporal sample sets that can be directly fed into the predictive framework.

Spatial Alignment

This study employs a unified grid-based spatial framework as the common reference system, ensuring that POI features and OD flows share identical spatial indexing structures. Specifically, each origin and destination grid within the OD matrix is associated with unique POI feature vectors

p_{i}

and

p_{j}

, respectively.

All POI feature embeddings are projected into a shared semantic space

E_{P O I}

to capture latent similarities and inter-regional interactions among urban functional zones, thereby complementing the spatial semantics underlying OD flows.

Temporal Alignment

The OD flow series is uniformly sampled at 15 min intervals. To guarantee temporal synchronization among multi-source datasets, all records are aligned along a unified time axis and resampled when necessary. Missing or irregular entries are smoothed or interpolated to ensure structural completeness. After alignment, each time slice yields a paired input consisting of the flow matrix

X_{t}

and its corresponding set of semantic vectors {

p_{1}, p_{2}, \dots, p_{N}

}.

Sample Construction and Normalization

A sliding-window framework is employed to construct training samples. Each sample comprises 12 historical OD flow matrices and their associated POI semantic representations, used to predict the subsequent 4 time steps of flows:

(X_{t - 11 : t}, p_{1 : N}) \to X_{t + 1 : t + 4},

(3)

Here,

X_{t - 11 : t}

captures short-term temporal dependencies, while

p_{1 : N}

encodes spatial semantic conditions.

All features are normalized via the min–max approach to eliminate scale inconsistencies. The final dataset is partitioned into training, validation, and testing sets with a ratio of 8:1:1, yielding the aligned multi-source spatiotemporal samples:

D = {(X_{t - 11 : t}, p_{1 : N}) \to X_{t + 1 : t + 4}},

(4)

Through this layered spatial–temporal alignment and sample construction process, the dataset effectively preserves the dynamic continuity of OD flows and the spatial diversity of POI semantics, providing a rigorous empirical foundation for forecasting urban mobility under complex environmental conditions.

2.2. Overall Framework of CADM

To enable high-precision forecasting of short-term urban traffic dynamics, this study proposes a Cross-Attention Diffusion Model—a multi-source spatiotemporal framework grounded in the diffusion-based generative paradigm.

Within the Denoising Diffusion Probabilistic Model framework, CADM jointly models the dynamic temporal dependencies of OD flows and the spatial heterogeneity embedded in POI semantic representations. By leveraging a structured noise-to-data reconstruction process—encompassing noise injection, denoising, and matrix reconstruction—the model synthesizes future traffic flow distributions. An overview of the framework is provided in Figure 2.

2.2.1. Model Inputs and Outputs

At each diffusion timestep, CADM integrates heterogeneous spatiotemporal signals through three complementary sources:

Noisy OD Matrix ( $X_{t}$ ): Obtained by injecting Gaussian noise into historical OD flows during the forward diffusion process, capturing the uncertainty of the current mobility state.
POI Semantic Embedding ( $E_{P O I}$ ): Drawn from the multi-dimensional semantic representation introduced in Section 2.1.1, providing spatial-functional priors for diffusion learning.
Temporal Embedding ( $e_{t}$ ): Generated by a multi-layer perceptron (MLP) to encode the diffusion-step index into a continuous latent manifold, enriching temporal awareness of the model.

The model outputs a predicted noise field

{\hat{ϵ}}_{θ} (x_{t}, E_{P O I}, t)

that approximates the actual noise ϵ. During inference, the reverse diffusion process progressively denoises

X_{t}

to reconstruct the clean OD matrix

X_{0}^{p r e d}

, representing the predicted traffic flows for forthcoming time slots.

2.2.2. Model Architecture and Computational Flow

The computation pipeline of CADM comprises five core modules:

Input Encoding: Fuse $X_{t}$ , $E_{P O I}$ and $e_{t}$ into unified latent representations serving as model inputs.
Temporal Embedding: Inject timestep semantics through nonlinear MLP transformations.
U-Net Denoising Network: Leverage multi-scale convolutions and residual connections to extract hierarchical OD spatiotemporal features.
Cross-Attention Fusion: Utilize OD features as queries (Q) and POI embeddings as keys (K) and values (V) to enable self-adaptive alignment between semantic and flow domains.
Noise Prediction and Reconstruction: Predict the noise ${\hat{ϵ}}_{θ}$ and iteratively reconstruct $X_{0}^{p r e d}$ through reverse diffusion.

Formally, the forward pass can be expressed as:

{\hat{ε}}_{θ} = f_{θ} (x_{t}, E_{P O I}, t), x_{0}^{p r e d} = G ({\hat{ε}}_{θ}),

(5)

where

f_{θ} (.)

is the parameterized denoising network and

G (.)

represents the generative process implementing reverse diffusion.

2.3. Temporal Step Encoding and U-Net Denoising Network

This section elaborates on the core denoising module of the proposed CADM framework, whose objective is to accurately estimate the latent noise

{\hat{ε}}_{θ}

and reconstruct the probabilistic distribution of traffic flow features, conditioned on the noisy OD matrix

X_{t}

and POI embeddings

E_{P O I}

. The denoising backbone consists of two tightly coupled components:a temporal-embedding block that explicitly encodes diffusion-step information to preserve temporal coherence across iterations, and a U-Net-based denoising network capable of multi-scale convolutional processing and spatial reconstruction. Through their joint operation, CADM achieves consistent denoising capability across diffusion stages while preserving fine-grained structural realism in the reconstructed traffic flows. The overall network architecture is illustrated in Figure 3.

2.3.1. Overall Design Philosophy

Within the diffusion generation framework, each diffusion and reverse-sampling step corresponds to the noise intensity and re-weighting stage during the forward and backward processes. The explicit modeling of such temporal information allows the network to perceive the magnitude and stage sensitivity of noise across diffusion timesteps, thereby enhancing denoising adaptability. To achieve this, CADM introduces a temporal embedding module based on a multi-layer perceptron. This module maps discrete timesteps into continuous high-dimensional representations, providing the model with sequential temporal awareness throughout the diffusion process.

Specifically, given the timestep

t \in {1, \dots, T}

, the fixed sinusoidal function is used to generate periodic positional mappings so that each diffusion step can be continuously represented as

γ (t) = [\sin (ω_{1} t), \cos (ω_{1} t), \dots, \sin (ω_{k} t), \cos (ω_{k} t)],

(6)

Here,

ω_{k}

denotes the frequency component used to capture the periodic variation in diffusion time. The resulting temporal encoding γ(t) is transformed by two fully connected layers to obtain a smooth nonlinear representation:

e_{t} = M L P (γ (t)) = W_{2}, ϕ (W_{1}, γ (t) + b_{1}) + b_{2},

(7)

where ϕ(.) is the ReLU activation. This block projects the discrete timestep into a visible semantic transition space, enabling the network to learn continuous representations from fixed indices and thus making the diffusion phase controllable and temporally coherent. The encoded

e_{t}

is forwarded to the feature pathway through residual layers for progressive propagation, allowing different diffusion stages to adaptively express varying noise states and achieve smooth transitions across reverse diffusion steps.

2.3.2. Hierarchical U-Net Architecture

After obtaining the temporal embedding, CADM employs an improved U-Net architecture to reconstruct multi-scale features of the noisy OD matrix. Unlike standard image U-Nets, this framework is tailored for the OD matrix and introduces lightweight regional processing and attention mechanisms to enhance stability and model convergence.

The encoder is composed of three one-dimensional convolutional layers (Conv1D × 3), each followed by batch normalization (BN) and ReLU activation to capture inter-layer spatiotemporal variations:

h_{i} = C o n v 1 D i (h i - 1), i = 1, 2, 3,

(8)

Here, a Conv1D architecture is adopted instead of the traditional Conv2D. This design choice is predicated on the organizational structure of the model’s input features: during the diffusion phase, the OD matrices are encoded and flattened into sequences of regional travel vectors. Conv1D is employed to extract feature representations along the temporal and semantic channel dimensions, while spatial dependencies are subsequently captured via the Cross-Attention Semantic Fusion (CASF) mechanism.

The adoption of Conv1D offers two primary advantages. First, it significantly reduces parameter count and computational complexity while maintaining temporal consistency, thereby preventing the interference caused by noise propagation during the diffusion inversion process that often accompanies the dimensional expansion in Conv2D. Second, since the primary role of the U-Net at this stage is multi-scale feature reconstruction and decoding rather than direct modeling of spatial neighborhoods, Conv1D processes high-dimensional sequential signals more efficiently, providing a more stable feature flow for conditional diffusion generation.

Shallow convolutions extract short-term spatiotemporal dependencies, while deeper residual blocks further compress and abstract high-level representations. To enhance hierarchical learning, a residual block (ResBlock × 2) is applied as follows:

R e s B l o c k (h) = ϕ (h) + {C o n v 1 D}_{2} (ϕ ({C o n v 1 D}_{1} (h))),

(9)

This design maintains information flow along short connections, accelerates gradient propagation, and strengthens multi-level feature transmission. The encoded representation

h_{m i d}

thus carries comprehensive high-level features for subsequent reconstruction.

In the decoding stage, transposed convolution (Transposed Conv1D) layers are used for upsampling and feature restoration. Each decoder stage symmetrically corresponds to its matching encoder layer through skip connections to preserve spatial detail and ensure feature complementarity. Finally, a Conv1D layer outputs the denoised result

{\hat{ε}}_{θ}

, representing the predicted noise field aligned with the OD dimensions. Through the cooperation of temporal embedding and hierarchical U-Net structure, CADM achieves consistent denoising capability across different diffusion steps while maintaining structural smoothness and reconstruction fidelity.

2.3.3. Skip-Connection Mechanism

To mitigate information loss during downsampling, CADM integrates skip connections between symmetric encoder and decoder layers.

z_{i} = C o n c a t (h_{i}, u_{i}), i = 1, 2, 3,

(10)

Among them,

h_{i}

represents the output of the encoder layer, while

u_{i}

refers to the feature map of the decoder layer. The concatenation operation ensures that the positional information from lower layers and the semantic features from higher layers are fully fused during the decoding stage, enabling the network to reconstruct the OD matrix with both fine-grained integrity and global consistency.

In addition, the skip-connection structure provides a direct path for gradient propagation, significantly alleviating the instability issues that frequently occur during the training of deep networks.

2.4. Cross-Attention Semantic Fusion Module

To enhance the model’s capacity for urban semantic interpretation and spatial-structural perception during the generative process, CADM introduces a Cross-Attention Semantic Fusion mechanism into the bottleneck layer of the U-Net backbone. This mechanism incorporates external POI feature embeddings to realize dynamic alignment and interactive learning between heterogeneous modal features, thereby establishing an explicit correspondence between flow dynamics and geographic semantic information. Within the diffusion generation framework, relying solely on the spatiotemporal characteristics of the OD flow matrix often fails to sufficiently represent regional functional features and behavioral semantic associations. On this basis, CASF introduces a learnable attention-weighting mechanism that enables the model to adaptively extract, from the structured POI embeddings, those features most semantically relevant to the current traffic state, thereby achieving complementary fusion between external semantic constraints and internal dynamic modeling.

In this mechanism, the query matrix Q is defined as the OD feature representation

H_{O D}

output by the U-Net bottleneck, which characterizes spatiotemporal dependencies and traffic dynamics among regions. The key (K) and value (V) matrices are constructed from the POI embeddings

E_{P O I}

, that is, Q =

H_{O D}

, K = V =

E_{P O I}

. This feature-pairing strategy maintains the continuity of OD dynamic features while introducing static environmental semantic constraints, allowing the model to capture the dual structural relationships of “traffic behavioral patterns” and “spatial functional semantics” within the feature space. The cross-attention operation follows the standard Scaled Dot-Product Attention formulation, and its computation is expressed as:

A t t e n t i o n (Q, K, V) = S o f t m a x! (\frac{Q K^{T}}{\sqrt{d}}) V,

(11)

Here, d represents the scaling factor of the feature dimension, which is used to suppress excessively large dot-product results and mitigate gradient instability. The similarity matrix Q

K^{T}

measures the correlation strength between traffic features and POI semantics, and after softmax normalization, forms the attention-weight matrix A. By performing the weighted aggregation AV of POI semantic features, the fused features are obtained. This process essentially realizes the semantic modulation of traffic features across different regions, allowing the model to emphasize pattern correlations among functionally similar or spatially adjacent areas during generation, thereby improving semantic consistency and spatial interpretability.

CASF is embedded in the bottleneck layer of the U-Net architecture with clear design motivation and theoretical basis. At this stage, the OD features

H_{O D}

, after multiple convolution and downsampling operations, already possess strong global abstraction capability but have not yet entered the upsampling reconstruction phase; therefore, this serves as the optimal position for semantic injection. By introducing the cross-attention mechanism at this layer, the fused feature information can propagate upward along the decoding path, guiding semantic restoration in higher-level structures, while maintaining structural consistency with the lower-level features in the encoding path. The final generated representation simultaneously carries both high-level semantics and low-level spatial details, achieving a balance between local precision and global coherence during reconstruction, thus resolving the semantic discontinuities that may occur when relying solely on convolutional features.

From a mechanistic perspective, the primary objective of designing the Cross-Attention Semantic Fusion mechanism is to enhance the semantic perception and spatial modeling capabilities of CADM. On the one hand, the model leverages attention weights to identify semantic coupling relationships between regions characterized by similar functions, comparable land use, or related activity types, ensuring that the generated OD distributions adhere to the spatial logic inherent in the urban functional layout. On the other hand, the global weighting property of the attention mechanism imposes directional constraints on the diffusion denoising process, ensuring that the generated traffic feature distributions exhibit not only improved numerical stability but also greater fidelity in their spatial structure. Furthermore, CASF possesses high parallelism, allowing it to operate synchronously with the U-Net backbone convolutions; this design achieves semantic alignment and information reinforcement without significantly increasing the parameter count or computational overhead.

2.5. Conditional Diffusion Generation and Reconstruction

After the temporal-step encoding and semantic-fusion mechanisms jointly establish the multi-source representation space, CADMs and reconstructs the latent distribution of the noise-added OD matrix through a conditional diffusion generation process. This stage consists of two phases: the training phase and the inference phase. The former aims to learn a reverse noise-estimation function capable of reconstructing the latent true distribution from the noisy samples, while the latter performs step-by-step denoising inversion from random noise based on the learned parameters, ultimately generating a high-fidelity OD distribution.

This study adopts the Denoising Diffusion Probabilistic Model framework as the theoretical foundation for diffusion modeling (Figure 4). The core principle of this framework involves modeling the forward diffusion and reverse denoising processes of data via Markov chains. During the training phase, the diffusion process can be viewed as a continuous Markov chain that progressively injects Gaussian noise into the data over discrete time steps.

Starting from the original sample

x_{0}

, a sequence of latent variables {

x_{1}, \dots, x_{T}}

is generated, where the forward diffusion process can be expressed as

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε,

(12)

where

ε \sim N (0, I)

represents Gaussian noise, and

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{i}

denotes the cumulative product of noise coefficients across time steps. This process describes how sample features are gradually perturbed by noise at each time step, leading the sample distribution to transition smoothly toward an isotropic Gaussian distribution. To regulate the intensity of noise injection, this study employs a linear β schedule, where the noise level β_t increases linearly within the interval [0.0001, 0.02]. This specific scheduling strategy is selected because it strikes an effective balance between stationarity and abruptness in traffic flow data, thereby preventing early information loss caused by excessive noise accumulation. The core of model training lies in learning the reverse reconstruction process—specifically, parameterizing the conditional distribution of the backward denoising steps through a neural network

ε_{θ}

(

x_{t}

,t,

E_{P O I}

) so as to recover a clean sample from its noisy state. The reverse diffusion step can be expressed as:

x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - \bar{α} t}}, ε_{θ} (X_{t}, t, E_{P O I})) + σ_{t} z,

(13)

where

σ_{t}

is the variance of the sample noise, and

z \sim N (0, I)

. The network learns the noise estimation function

ε_{θ}

to approximate the true posterior distribution p(

x_{t - 1}

,

x_{t}

,

E_{P O I}

), thereby realizing conditional and stable denoising generation.

The training objective of the model is to minimize the mean-square error between the predicted noise and the true noise, ensuring precise learning of detailed variations at each diffusion step. The loss function is defined as

L = E x_{0}, ε, t [| | ε_{p r e d} - ε_{t r u e} | |_{2}^{2}],

(14)

where t is uniformly sampled across time steps,

ε_{t r u e}

is the true Gaussian noise, and

ε_{p r e d}

=

ε_{θ}

(

x_{t}

, t,

E_{P O I}

) is the model-predicted noise. This objective seeks to minimize the expected error between predicted and true noise, guiding the network to learn a robust conditional denoising mapping associated with POI semantics, thereby enhancing its ability to reconstruct regionally and temporally variant features from noisy conditions.

During the inference phase, the model begins by sampling

x_{T} \sim N (0, I)

from a standard Gaussian distribution, and then iteratively performs the reverse diffusion process to progressively generate

x_{T - 1}, \dots, x_{0}

. In each step, the model sequentially removes noise from the latent representation while integrating semantic constraints, thereby achieving layer-by-layer conditional denoising reconstruction. As time steps decrease, the denoised samples gradually evolve into the structurally coherent OD matrices. In the early stages, the model focuses on restoring coarse spatiotemporal patterns of regional mobility and traffic intensity; in the later stages, it converges toward geometry-aware local refinement, capturing relational consistency among semantically or spatially correlated regions.

In summary, CADM leverages time-step diffusion modeling to conceptualize noise injection as dynamic perturbations to traffic system states and reconstructs spatiotemporal structure through reverse denoising. To integrate multi-source information, the model employs cross-modal attention to project POI semantics and OD flow dynamics into a unified representation space, enabling collaborative enhancement that strengthens semantic adaptability and spatial consistency across functionally heterogeneous regions. Architecturally, CADM adopts a U-Net-based symmetric generation network with cross-layer attention mappings at multiple scales, jointly modeling global trends and local details. This design preserves diffusion reversibility while providing flexibility to incorporate multi-source spatiotemporal features and generalize across diverse scenarios. Consequently, CADM integrates semantic, temporal, and spatial information within a unified probabilistic generative framework, establishing a stable, controllable, and structurally adaptive traffic flow generation system. The following section presents comprehensive experimental evaluations on real-world urban datasets to validate the model’s effectiveness and generalization capability in spatiotemporal forecasting.

3. Experimental Evaluation and Ablation Study

3.1. Datasets

In this study, a high-precision multi-source spatiotemporal dataset was constructed based on the internal geographic information database of Amap (Gaode Map) to verify the effectiveness of the proposed Cross-Attention Diffusion Model in short-term urban OD traffic flow prediction tasks. The dataset covers the core functional areas and major road grids of Beijing, with a spatial resolution of 500 m as the basic unit, resulting in N spatial regions. At each time step, a corresponding N × N OD flow matrix is generated. The observation period spans from January to June 2024, with a temporal resolution of 15 min, forming more than 15,000 continuous time slices, which comprehensively reflect the dynamic evolution characteristics of urban traffic. In addition to traffic flow data, multiple environmental factors closely related to travel behavior are also collected, including temperature, precipitation, holiday indicators, and traffic control information, to support modeling and generation under multi-source conditions.

Compared with commonly used public traffic datasets (such as PEMS-Bay), the internal Amap dataset employed in this study exhibits higher complexity and realism in both spatial resolution and semantic dimensionality, enabling a more detailed characterization of multi-scale dynamics in complex urban mobility. The POI (Point of Interest) data encompass typical functional categories such as dining, commerce, office, and transportation hubs. For each spatial grid, the semantic distribution is obtained by counting the number of POIs across different categories. To ensure feature stability and comparability in scale, all POI features are normalized using the min–max method and further mapped into a low-dimensional continuous space through a semantic embedding layer, forming structured regional semantic vectors. This allows the model to simultaneously perceive variations in urban functional roles and latent travel patterns during the generation process.

For temporal modeling, all OD flow matrices are column-wise normalized and organized into sequential samples via a sliding-window mechanism, in which the observations of 12 consecutive time steps (approximately 3 h) are used as input to predict the flow variation in the subsequent 4 time steps (approximately 1 h). Normalization is performed based on the statistical range of the training set to ensure numerical consistency between the training and inference phases. This dataset possesses high representativeness in terms of temporal coverage, spatial granularity, and multi-source semantic characteristics, providing sufficient and reliable support for the conditional diffusion learning of the CADM, and establishing a solid data foundation for subsequent experimental analysis.

3.2. Evaluation Metrics

To comprehensively evaluate the performance of the CADM on short-term OD traffic flow prediction tasks, this study adopts multiple quantitative evaluation metrics to assess both the accuracy of point predictions and the consistency between predicted and empirical distributions. Specifically, three common error measures are employed—Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE)—to quantify the bias and robustness of model predictions from different perspectives. They are defined as follows:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - \hat{y_{i}} |, R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}}, M A P E = \frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | \times 100 %,

(15)

where

y_{i}

is the observed ground truth,

\hat{y_{i}}

is the model prediction, and n is the total number of samples. MAE reflects the average prediction deviation, while RMSE is more sensitive to large errors, effectively capturing the impact of extreme values. MAPE, in turn, measures relative errors, enabling intuitive cross-scale comparisons of model accuracy across heterogeneous regions. Together, these three indicators provide a multidimensional evaluation of both overall prediction accuracy and stability. In addition, to evaluate the probabilistic accuracy of the generative model, this study introduces the Continuous Ranked Probability Score (CRPS) as a comprehensive distributional metric. CRPS assesses how well the predicted distribution F aligns with the empirical observation Y, and is defined as:

C R P S (F, Y) = \frac{1}{n} \sum_{i = 1}^{n} \int [F_{i} (z) - 1_{{z \geq y_{i}}}]^{2} d z,

(16)

where

F_{i} (z)

denotes the cumulative distribution function of the i-th sample prediction, and

1_{{.}}

is an indicator function. This metric computes the integral distance between the predicted and empirical cumulative distributions. A smaller CRPS value indicates a closer agreement between the two distributions; thus, the lower the CRPS, the better the generative model’s ability to align its predictive distribution with reality, resulting in predictions that are statistically more trustworthy and interpretable.

3.3. Implementation Details

To comprehensively verify the performance of the proposed Cross-Attention Diffusion Model for short-term OD traffic flow prediction, a series of comparative experiments were conducted on the aforementioned multi-source spatiotemporal dataset. All experiments were implemented in Python 3.8, using the PyTorch 2.0 framework, and executed on a server equipped with an NVIDIA RTX 4080 GPU (16 GB memory).

The dataset was partitioned chronologically into 80% for training, 10% for validation, and 10% for testing. All input features (including OD flow matrices and POI embeddings) were normalized using the min–max method based on the statistical range of the training set, ensuring consistent scaling across different data splits.

The model was trained for 50 epochs with a batch size of 32. The Adam optimizer was employed with an initial learning rate of 1 ×

10^{- 4}

and a weight decay of 1 ×

10^{- 5}

. A cosine annealing scheduler was used to dynamically adjust the learning rate during training. The diffusion process was set to T = 100 steps, and the noise level

β_{t}

was linearly increased within the range [0.0001, 0.02] following a linear

β

schedule, which helps prevent numerical instability near t = N.

To prevent overfitting, a dropout rate of 0.2 was applied within the feature layers, and an early-stopping mechanism was incorporated. Training was terminated early if the validation RMSE did not improve over 10 consecutive epochs. RMSE was chosen as the stopping criterion because its squared-error structure is more sensitive to large deviations, allowing it to guide the optimization of deterministic prediction performance more effectively.

Since diffusion models are inherently stochastic, the same input may produce slight variations across different generations. To mitigate randomness and enhance robustness, each input sample was independently generated 10 times, and the mean prediction was reported as the final result. A fixed random seed was applied in all experiments to ensure reproducibility. This overall training and inference procedure achieves an effective balance among stability, accuracy, and computational efficiency.

Implemented on an NVIDIA 4080 GPU with 16 GB of VRAM, CADM requires approximately 25 min per training epoch. Complete training to convergence—typically 50 epochs with early stopping—takes roughly 21 h. The relatively high training cost of CADM primarily stems from the additional computational overhead associated with multi-step noise estimation during the diffusion process and cross-attention calculations. The model comprises 18.7 million parameters, with peak VRAM usage reaching 12.8 GB during training. In the inference phase, the prediction latency for a single sample (predicting 4 future time steps from 12 input time steps) is approximately 1.8 s. The memory requirement during inference is 6.5 GB, which is allocated mainly for storing intermediate diffusion states and attention weight matrices.

3.4. Baselines

To comprehensively evaluate the performance of the proposed Cross-Attention Diffusion Model in short-term OD traffic flow prediction, a total of 13 representative traffic forecasting models were selected for systematic comparison. The baselines include:

HA (Liu and Guan, 2004) [22]
SVR (Smola and Schölkopf, 2004) [23]
GRU (Chung et al., 2014) [24]
LSTM (Hochreiter and Schmidhuber, 1997)
TCN (Bai et al., 2018) [25]
TGCN (Zhao et al., 2019) [26]
STGCN (Yu et al., 2017) [27]
ASTGCN (Guo et al., 2019) [28]
DCRNN (Li et al., 2017) [29]
GWN (Wu et al., 2019) [30]
STGIN (Zou et al., 2023) [31]
iTransformer (Liu et al., 2024) [32]
TimeMixer (Wang et al., 2024) [33]

Among these, (1) and (2) are traditional statistical learning models; (3)–(5) and (12)–(13) represent deep learning models for time-series forecasting; and (6)–(11) are spatiotemporal graph neural network (GNN) models. Except for STGIN, which is implemented on the TensorFlow platform, all other baselines are built under the PyTorch 2.0 framework.

To ensure fairness in comparison, all baseline models followed the same data preprocessing and partitioning protocol as CADM. All methods except traditional baselines applied the same min-max normalization scheme. For models incorporating spatial topology (Baselines (6)–(10)), traffic-environmental features were embedded as multi-channel node attributes to enhance spatiotemporal representation, whereas STGIN, with its built-in adaptive embedding structure, required no additional feature-channel integration.

Hyperparameter configurations were initialized based on the settings reported in the original papers and official implementations, and further optimized via grid search. Each model was trained for up to 100 epochs with a batch size of 32, using an early-stopping condition that halted training if the validation RMSE did not improve for 10 consecutive epochs. This consistent experimental design ensures methodological rigor and reproducibility across all baselines.

All experiments were conducted under the same hardware environment as CADM. By covering a comprehensive range of baseline paradigms—from traditional statistical methods and sequence models to advanced graph-based spatiotemporal architectures—this comparative framework provides a thorough evaluation of CADM’s performance advantages and applicability across different methodological categories.

3.5. Results

The proposed CADM was compared with a variety of mainstream traffic-prediction approaches across forecasting horizons ranging from 15 min to 60 min. Table 3 presents the detailed quantitative results, where the best performance at each prediction step is underlined.

From an overall perspective, conventional statistical models (e.g., HA and SVR) exhibit significantly higher errors on all metrics than other model categories, indicating that their linear assumptions fail to capture the complex nonlinear dynamics of traffic evolution. Early time-series models such as GRU and LSTM achieve moderate accuracy in short-term prediction, whereas more recent spatiotemporal graph neural networks (TGCN, STGCN, DCRNN, GWN, etc.) and Transformer-based architectures (STGIN, iTransformer, TimeMixer) demonstrate superior precision and stability, attributed to their stronger ability to capture inter-node spatial dependencies and intricate temporal behaviors.

Across different prediction horizons, CADM performs remarkably well in short-term forecasts (15 min and 30 min), achieving RMSE, MAE, and MAPE close to the best baseline results, thereby reflecting its strong capability for short-range prediction. However, as the forecasting horizon extends (45 min and 60 min), its performance drops more noticeably, with errors increasing at a faster rate, and the advantage gradually diminishes relative to certain graph-based neural models.

This performance shift suggests that CADM is highly adept at capturing short-term traffic dynamics and generating smooth, stable predictions. Nevertheless, at longer temporal scales, the diffusion-based generation process becomes more susceptible to accumulated errors and external uncertainties, leading to a gradual decline in accuracy. This degradation in performance is primarily attributed to error accumulation inherent in the diffusion generation process. During the multi-step reverse denoising phase of conditional diffusion, an increase in predicted time steps predisposes the generated distribution to diverge from the ground truth, thereby causing errors to amplify progressively. Furthermore, the evolution of traffic flow over extended time windows is typically subject to non-stationary factors—such as sudden incidents and complex travel decision-making—which render short-term learnable dynamic patterns increasingly ineffective for long-term prediction. In contrast, deeper structural models such as STGIN and TimeMixer maintain relatively high predictive stability for long-term forecasts, benefiting from their multi-level attention mechanisms that enhance modeling of long-range temporal dependencies.

In summary, while CADM exhibits a distinct advantage in short-term OD prediction, it encounters performance attenuation during medium- to long-term forecasting. Future improvements should be directed towards bolstering the model’s capacity for sustained modeling of long-range dependencies. Specifically, the perception of complex spatiotemporal evolutionary dynamics could be enhanced by incorporating two categories of information—dynamic semantic labels and spatial context relationships—atop the existing POI features.

Firstly, in addition to static POI information, time-sensitive dynamic labels can be introduced. For instance, ‘heat index’ features derived from travel demand, visitation frequency, or real-time crowd intensity statistics could be integrated to characterize the fluctuating activity levels of regional functions across different time intervals. Such dynamic semantic information would facilitate the capture of non-stationary temporal features, such as holiday effects and diurnal travel variations, thereby mitigating the limitations associated with static semantic features in long-term forecasting.

Secondly, spatial contextual associations between POIs can be explicitly incorporated into semantic modeling. By constructing activity correlation graphs between adjacent grids or functionally similar regions (e.g., leveraging spatial adjacency or semantic similarity matrices), the model can be empowered to capture latent semantic coupling and functional interdependence between regions. For example, commercial districts and surrounding dining areas often exhibit strong mobility synergy during peak hours; integrating such spatial dependencies into semantic encoding contributes to enhancing the model’s spatial generalization capabilities.

3.6. Ablation Study

To systematically evaluate the contribution of POI semantic features to the CADM framework, this study designed a comprehensive set of ablation experiments. These experiments verify the effectiveness of each component by progressively removing, perturbing, or simplifying the POI inputs and their associated fusion mechanisms. The experimental configurations are detailed as follows:

CADM: The complete model, integrating OD flow features, POI semantic embeddings, and the cross-attention fusion mechanism.
CADM without POI: The POI input is completely removed, causing the cross-attention module to degenerate into an identity mapping.
CADM with Random POI: Randomized POI vectors are employed to disrupt spatial semantic consistency while maintaining feature dimensionality.
CADM with Static POI: Regional features are replaced with a global average POI vector to eliminate spatial heterogeneity.
CADM without Cross-Attention: POI embeddings are retained, but the cross-attention module is replaced by a simple feature concatenation operation.

All experimental configurations were evaluated under identical conditions, including training hyperparameters, dataset partitioning schemes, and evaluation metrics. To ensure statistical robustness, a fixed random seed was used for each experiment, which was repeated three times; the average results are reported. The experimental results are presented in Table 4.

This ablation study systematically validates the critical contributions of POI semantic features and the cross-attention fusion mechanism to the performance of the CADM. First, POI semantic features exert a significant influence on overall model performance; upon the removal of POI semantic features, the evaluation metrics for short-term prediction (15 min) exhibited an increase of 15–20%, while metrics for long-term prediction (60 min) increased by 5–15%, indicating that POI semantics play a more pronounced role in modeling short-term traffic dynamics, whereas temporal dependencies gradually assume dominance in long-term prediction. Second, the consistency of spatial semantics is another factor influencing performance; the ablation experiments involving Random POI verified the model’s reliance on authentic POI-OD spatial semantic mappings in real-world scenarios. Notably, the performance of Random POI was marginally superior to the complete removal of POI, suggesting that the model can learn partial information from feature dimensions, but this learning gain is significantly lower than the structural knowledge derived from authentic semantics. Meanwhile, the Static POI configuration consistently underperformed the complete model across all time steps but outperformed Random POI, indicating that while static POI data can provide statistical information, it lacks the capability to express regional functional heterogeneity, such as distinguishing the travel characteristic differences between functional zones like commercial centers and residential areas. Finally, the experimental results regarding Cross-Attention demonstrate that the prediction performance of the model utilizing the cross-attention mechanism is superior to that of the static feature concatenation approach.

In conclusion, the model requires authentic spatial semantic correspondence and dynamic interactions between regional functions to support accurate prediction; furthermore, the results clarify that short-term prediction relies more heavily on spatial semantics, while long-term prediction depends more on temporal dynamic modeling, providing compelling support for the proposed future direction of introducing dynamic POI features (such as real-time crowd heat) into this model.

3.7. Attention Pattern Visualization

To validate the interpretability of the POI-guided cross-attention mechanism, Figure 5 visualizes the learned spatiotemporal attention patterns during Beijing’s morning rush hour (07:00–10:00). The heatmap reveals three key characteristics of the model’s semantic awareness.

First, attention allocation exhibits clear functional asymmetry. During early morning (07:00–07:30), residential grids (Grid 0, Grid 1) receive elevated attention weights, capturing outflow patterns as commuters depart. During peak hours (08:30–09:15), attention shifts toward transportation hubs (Grid 10) and office districts (Grid 20, Grid 21), with probabilities reaching 0.18–0.34, reflecting adaptive focus on high-demand destinations. Second, attention evolution demonstrates progressive semantic refinement. Early denoising stages prioritize coarse spatial alignment between functionally complementary regions (e.g., residential-to-hub connections), while later stages concentrate on fine-grained local flows within functionally similar zones. Third, the heatmap confirms semantic coherence. Functionally related regions (e.g., residential and office areas during morning commutes) exhibit synchronized attention peaks, while dissimilar regions (e.g., residential areas and public parks) maintain consistently low weights. This validates the model’s capacity to distinguish semantically distinct mobility patterns and capture functional coupling encoded in POI semantics.

These patterns substantiate that CADM’s cross-attention mechanism provides interpretable, semantically coherent modulation of spatiotemporal reconstruction, enhancing both model transparency and applicability for urban traffic management.

4. Discussion

4.1. Methodological Innovations and Interpretability

This study addresses the challenges of spatiotemporal complexity and semantic insufficiency in short-term urban OD flow prediction by proposing a multi-source spatiotemporal generative model—Cross-Attention Diffusion Model—that integrates a cross-attention mechanism with a diffusion-based generation process. Under a unified probabilistic generative framework, the model introduces semantic constraints and dynamic feature modeling, enabling traffic flow prediction to achieve both statistical accuracy and structural interpretability.

The ablation experiments verify the contribution of POI semantic features to the performance of CADM from the following perspectives: First, regarding the effectiveness of introducing POI features, the comparison between the Static POI and Random POI results demonstrates that the structured organization of semantic information is critical to the model’s modeling capability; the regional heterogeneity information provided by authentic POI features enables the model to better adapt to the traffic characteristics of different urban functional zones, which holds practical value for predictions in scenarios involving newly established transportation hubs or functional zone adjustments. Second, regarding the subject of short-term prediction, this paper attempts to capture functional differences between regions through POI-OD spatial semantic mapping; for instance, the model is capable of identifying the asymmetric distribution where commercial districts exhibit substantial inflows on weekday mornings and outflows in the evening, while residential areas exhibit the opposite pattern. Finally, from the perspective of data efficiency, POI features serve as external prior knowledge that provides stable semantic constraints in scenarios where historical OD data is scarce, thereby reducing the reliance on large-scale labeled data; this effectively compensates for the insufficiency of temporal samples, particularly in emerging urban areas where data collection is restricted or in early stages with insufficient historical data.

4.2. Limitations and Future Directions

Regarding the phenomenon of performance attenuation of the CADM in long-term prediction scenarios, it can be attributed to three reasons: First, the primary factor is the error accumulation of the diffusion process; the conditional diffusion model generates prediction results through a multi-step reverse denoising process, and the estimation error introduced in each denoising step propagates cumulatively during long-term prediction, causing the generated distribution to gradually deviate from the true distribution. Second, the insufficient timeliness of information in existing static POI data also limits the long-term modeling capability to a certain extent. Finally, the limited capability for long-term dependency modeling also stems from the structural bottlenecks of CADM. Compared to the multi-layer attention mechanisms of GNN and Transformer models, the U-Net backbone of CADM possesses multi-scale feature extraction capabilities but is deficient in capturing long-sequence temporal dependencies.

Therefore, future improvement directions can be considered from two aspects: one is to attempt to introduce stronger temporal modeling components into the diffusion model framework, and the other is to apply CADM and its future research improvements more specifically to short-term traffic flow prediction scenarios, such as traffic control planning near concert venues, rather than application scenarios where long-term periodicity is more obvious.

5. Conclusions

Short-term urban Origin-Destination flow prediction presents significant challenges in capturing spatiotemporal dynamics while incorporating functional semantics of urban regions. Existing approaches predominantly rely on historical traffic patterns while neglecting the influence of urban functional structures on mobility behavior, thereby limiting interpretability in semantically heterogeneous areas. To address these limitations, this study proposes the Cross-Attention Diffusion Model, which explicitly integrates Point-of-Interest embeddings as spatial semantic priors within a probabilistic diffusion framework. Experimental evaluations on real-world datasets demonstrate that CADM achieves competitive performance across multiple metrics, attaining an RMSE of 5.77 at the 30 min prediction horizon. Ablation analyses reveal that the removal of POI features increases prediction errors by 15–20%, thereby validating the critical role of semantic conditioning in spatiotemporal traffic flow generation.

Overall, this work validates the efficacy of a semantically guided generative framework that synergistically combines diffusion modeling with spatial semantic priors for short-term OD prediction. The proposed approach offers a novel paradigm for semantic-aware modeling of complex transportation systems. Future work will investigate the incorporation of dynamic POI features and temporal semantic representations to further enhance model robustness and generalizability in capturing non-stationary spatiotemporal dynamics.

Author Contributions

Conceptualization, Zhiming Gui and Zhenji Gao; Formal analysis, Zhiming Gui and Hongxiang Li; Investigation, Zhiming Gui and Hongxiang Li; Methodology, Zhiming Gui and Hongxiang Li; Resources, Zhenji Gao; Software, Hongxiang Li; Supervision, Zhenji Gao; Validation, Hongxiang Li; Visualization, Hongxiang Li; Writing—original draft, Hongxiang Li; Writing—review and editing, Zhiming Gui. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by grants from the National Natural Science Foundation of China (Grant No. U2344216).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to legal and privacy restrictions. The data are not publicly available as they contain commercially sensitive information subject to confidentiality agreements.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv 2021, arXiv:2106.15282. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar] [CrossRef]
Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 17981–17993. [Google Scholar]
Gong, S.; Li, M.; Feng, J.; Wu, Z.; Kong, L. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. arXiv 2023, arXiv:2210.08933. [Google Scholar] [CrossRef]
Li, X.L.; Thickstun, J.; Gulrajani, I.; Liang, P.; Hashimoto, T.B. Diffusion-LM Improves Controllable Text Generation. arXiv 2022, arXiv:2205.14217. [Google Scholar] [CrossRef]
Yu, P.; Ravula, A.; Yang, Z.; Chen, Y.; Liu, J. Latent Diffusion Energy-Based Model for Interpretable Text Modeling. arXiv 2023, arXiv:2206.05895. [Google Scholar] [CrossRef]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. arXiv 2020, arXiv:2009.09761. [Google Scholar]
Yang, D.; Wang, Y.; Zhu, X.; Yuan, X. DiffSound: Discrete Diffusion Model for Text-to-Sound Generation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
Yang, R.; Srivastava, P.; Mandt, S. Diffusion Probabilistic Modeling for Video Generation. arXiv 2022, arXiv:2203.09481. [Google Scholar] [CrossRef]
Harvey, W.; Naderiparizi, S.; Masrani, V.; Weilbach, C.; Wood, F. Flexible Diffusion Modeling of Long Videos. arXiv 2022, arXiv:2205.11495. [Google Scholar] [CrossRef]
Ho, J.; Saharia, C.; Chowdhery, A.; Niki, P.; Jain, A.; Fleet, D.J.; Salimans, T.; Chen, M.; Norouzi, M. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv 2022, arXiv:2210.02303. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video Diffusion Models. arXiv 2022, arXiv:2204.03458. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv 2021, arXiv:2011.13456. [Google Scholar] [CrossRef]
Rasul, K.; Seward, C.; Schuster, I.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv 2021, arXiv:2101.12072. [Google Scholar] [CrossRef]
Yan, T.; Zhang, H.; Zhou, T.; Zhan, Y.; Xia, Y. ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-Based Generative Models. arXiv 2021, arXiv:2106.10121. [Google Scholar]
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional Score-Based Diffusion Models for Probabilistic Time Series Imputation. Adv. Neural Inf. Process. Syst. 2021, 34, 24804–24816. [Google Scholar]
Biloš, M.; Rasul, K.; Schneider, A.; Nevmyvaka, Y.; Günnemann, S. Modeling Temporal Data as Continuous Functions with Stochastic Process Diffusion. arXiv 2022, arXiv:2211.02590. [Google Scholar]
Shen, L.; Kwok, J. Non-Autoregressive Conditional Diffusion Models for Time Series Prediction. arXiv 2023, arXiv:2306.05043. [Google Scholar] [CrossRef]
Kollovieh, M.; Ansari, A.F.; Bohlke-Schneider, M.; Zschiegner, J.; Wang, H.; Wang, Y. Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting. arXiv 2023, arXiv:2307.11494. [Google Scholar]
Liu, H.; Guan, H. A summary of traffic flow forecasting methods. Transp. Res. Circ. E-C026 Traffic Flow Theory 2004, 2004, 1–12. [Google Scholar]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Wang, J.; Li, H. T-GCN: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention-based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 922–929. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Graph WaveNet for deep spatial-temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 1907–1913. [Google Scholar]
Zou, H.; Chen, C.; Zheng, C.; Shen, Y.; Cui, P. Spatial-Temporal Graph Informer Networks for Long-Sequence Traffic Forecasting. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New-Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Liu, Y.; Zhang, Y.; Liu, Z.; Wang, H.; Chen, G. iTransformer: Inverted Transformers are Effective for Time Series Forecasting. arXiv 2024, arXiv:2310.06625. [Google Scholar] [CrossRef]
Wang, Z.; Qin, T.; Liu, T.; Zhang, X. TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. arXiv 2024, arXiv:2405.14616. [Google Scholar] [CrossRef]

Figure 1. Data Processing Pipeline.

Figure 2. Overview of the framework.

Figure 3. Overview of the Hierarchical U-Net Architecture.

Figure 4. The denoising process of the diffusion model from noise to data.

Figure 5. Spatiotemporal Attention Mechanism: Beijing Morning Rush Hour (07:00–10:00).

Table 1. Example of POI Data.

Id	Name	Type	Location	Pname
B0FFKDYJ1Y	Pu’an Temple	Tourist Attraction (Scenic Site)	116.365665, 39.933315	Beijing

Table 2. Example of OD Data.

	210950_111195\|210960_111185	210950_111220\|210960_111185	…
1 January 2024 00:00	1.0	0.0	…
1 January 2024 00:15	1.0	0.0	…
…	…	…	…
1 January 2024 23:45	0.0	1.0	…

Note: The ellipsis (…) represents omitted OD data.

Table 3. Quantitative comparison of CADM and baseline models across different forecasting horizons (15–60 min) using RMSE, MAE, and MAPE metrics.

	Metric	HA	SVR	GRU	LSTM	TCN	TGCN	STGCN	ASTGCN	DCRNN	GWN	STGIN	iTransformer	TimeMixer	CADM
15 min	RMSE	11.24	7.18	5.91	5.73	8.27	6.21	6.03	4.41	9.72	3.56	3.25	3.34	4.73	3.78
	MAE	8.37	5.41	3.66	3.43	4.81	4.09	4.06	2.85	7.68	2.45	2.56	1.96	2.68	2.02
	MAPE	23.6	10.87	9.14	9.34	14.88	12.32	11.95	6.84	18.92	5.83	5.36	5.82	7.39	5.54
30 min	RMSE	12.72	8.99	5.97	6.32	8.98	7.12	6.65	6.01	16.02	6.65	6.16	5.88	6.36	5.77
	MAE	9.12	7.26	4.47	4.61	5.28	4.88	4.79	4.32	11.58	4.03	3.79	3.01	3.56	3.69
	MAPE	25.1	15.02	10.88	11.79	15.38	12.88	12.68	11.67	28.10	11.01	11.86	8.66	10.72	10.28
45 min	RMSE	13.56	9.73	6.85	7.69	9.02	9.15	7.90	7.31	16.88	7.07	6.85	6.84	7.19	7.98
	MAE	9.68	8.36	5.11	5.67	5.51	6.02	5.77	5.41	13.83	5.39	4.89	4.68	4.90	5.19
	MAPE	26.4	19.24	13.03	14.51	16.13	16.57	15.82	14.47	31.34	16.6	13.52	12.85	14.06	15.06
60 min	RMSE	14.31	10.36	8.59	8.98	9.47	9.68	9.02	8.65	17.14	8.33	7.12	7.65	7.89	10.64
	MAE	10.23	9.16	5.81	5.90	5.66	6.10	6.01	5.97	14.38	5.68	5.06	4.95	4.98	6.35
	MAPE	27.8	21.02	16.53	15.96	16.25	18.28	17.20	17.76	32.08	18.37	16.58	14.24	14.19	18.03

Note: the underlined values represent the best performance across all models.

Table 4. Ablation study results analyzing the contribution of POI semantic features and the cross-attention mechanism across different prediction horizons.

	Metric	CADM	CADM Without POI	CADM with Random POI	CADM with Static POI	CADM Without Cross-Attention
15 min	RMSE	3.78	4.52	4.31	4.12	4.05
	MAE	2.02	2.89	2.75	2.71	2.66
	MAPE	5.54	6.37	6.25	6.14	6.08
30 min	RMSE	5.77	6.48	6.36	6.22	6.15
	MAE	3.69	4.29	4.23	4.17	4.11
	MAPE	10.28	11.48	11.03	10.89	10.65
45 min	RMSE	7.98	8.56	8.32	8.15	8.08
	MAE	5.19	5.63	5.45	5.32	5.26
	MAPE	15.06	16.12	15.88	15.64	15.49
60 min	RMSE	10.64	11.23	11.01	10.92	10.81
	MAE	6.35	7.14	6.98	6.76	6.65
	MAPE	18.03	19.55	19.14	18.99	18.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Gui, Z.; Gao, Z. Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction. ISPRS Int. J. Geo-Inf. 2026, 15, 2. https://doi.org/10.3390/ijgi15010002

AMA Style

Li H, Gui Z, Gao Z. Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction. ISPRS International Journal of Geo-Information. 2026; 15(1):2. https://doi.org/10.3390/ijgi15010002

Chicago/Turabian Style

Li, Hongxiang, Zhiming Gui, and Zhenji Gao. 2026. "Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction" ISPRS International Journal of Geo-Information 15, no. 1: 2. https://doi.org/10.3390/ijgi15010002

APA Style

Li, H., Gui, Z., & Gao, Z. (2026). Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction. ISPRS International Journal of Geo-Information, 15(1), 2. https://doi.org/10.3390/ijgi15010002

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Attention Diffusion Model for Semantic-Aware Short-Term Urban OD Flow Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preparation

2.1.1. POI Data and Semantic Feature Construction

2.1.2. OD Data and Flow Matrix Construction

2.1.3. Data Alignment and Sample Generation

2.2. Overall Framework of CADM

2.2.1. Model Inputs and Outputs

2.2.2. Model Architecture and Computational Flow

2.3. Temporal Step Encoding and U-Net Denoising Network

2.3.1. Overall Design Philosophy

2.3.2. Hierarchical U-Net Architecture

2.3.3. Skip-Connection Mechanism

2.4. Cross-Attention Semantic Fusion Module

2.5. Conditional Diffusion Generation and Reconstruction

3. Experimental Evaluation and Ablation Study

3.1. Datasets

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Baselines

3.5. Results

3.6. Ablation Study

3.7. Attention Pattern Visualization

4. Discussion

4.1. Methodological Innovations and Interpretability

4.2. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI