Next Article in Journal
Artificial Intelligence in Translation and Interpreting in Education: A Systematic Review of Trends, Applications and Challenges
Previous Article in Journal
A GraphRAG-Based Dual-Path Structural Diffusion Retrieval Framework for Requirement-Code Traceability Link Recovery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Spatial–Temporal Transformer with Query Enhancement and Fourier Analysis for Traffic Forecasting

School of Computer Science, Shandong Xiehe University, Jinan 250109, China
*
Author to whom correspondence should be addressed.
Information 2026, 17(6), 542; https://doi.org/10.3390/info17060542
Submission received: 9 April 2026 / Revised: 21 May 2026 / Accepted: 30 May 2026 / Published: 1 June 2026
(This article belongs to the Section Artificial Intelligence)

Abstract

Accurate traffic forecasting can assist in traffic management and promote the construction of smart cities. Existing methods have achieved promising results, but they often do not sufficiently distinguish node importance in road networks, explicitly capture temporal periodicity in noisy traffic data, or effectively exploit auxiliary query information. To address these issues, we propose a Spatial–Temporal Fourier Query-Enhanced Transformer (STFQET) for traffic forecasting. STFQET introduces a query-enhanced attention module to model the delayed influence of regional query dynamics on node-level traffic states, a key node attention module to identify influential nodes and strengthen spatial dependency learning, a Fourier filter module to suppress noise while preserving useful temporal components, and an FFT-based temporal period extraction module to enhance periodic feature learning in the frequency domain. Experiments on real traffic datasets demonstrate that the proposed method has competitive and generally superior predictive performance compared to existing baseline models.

1. Introduction

Traffic forecasting is a crucial component of Intelligent Transportation Systems (ITS). It analyzes historical traffic data, real-time information, and other factors to predict future traffic conditions. In real life, as cities modernize, the average number of cars per person increases, leading to increasingly severe urban traffic congestion. This severely impacts daily life and the normal operation of cities. Therefore, accurate traffic forecasting is crucial for resolving traffic congestion, improving road efficiency, and reducing traffic accidents.
Deep learning methods have made significant progress in the field of traffic prediction. For example, methods such as RSTIB [1], STPGNN [2], and FC-STGNN [3] can automatically extract and learn the temporal and spatial features hidden in traffic data and model these complex relationships. There are also some models based on attention mechanisms, such as STFAN [4] and FSatten [5], which have also achieved good results in traffic prediction tasks. Nevertheless, traffic speed prediction remains challenging when models need to jointly exploit traffic observations and auxiliary information, distinguish heterogeneous spatial influences among road segments, and extract stable temporal patterns from noisy traffic series. Therefore, despite the progress of existing studies, the following challenges still need to be addressed.
Challenge 1: Learning the delayed influence of query information. Auxiliary information, such as user queries, can provide early signals of future traffic changes. A query refers to a user’s request for the optimal route to a destination, which usually contains information such as departure time, departure region, arrival time, and arrival region. Liao et al. [6] demonstrated the usefulness of online route queries by combining them with geographical and event-related auxiliary features. However, existing query-aware traffic forecasting methods mainly treat queries as auxiliary inputs and do not sufficiently model the delayed effect between query dynamics and future traffic states. For example, when an entertainment event or performance takes place in a certain region, the number of queries to that region usually increases before the actual traffic condition changes. Therefore, query data contains useful information about future traffic evolution. As shown in Figure 1, the influence of query data on traffic conditions is not instantaneous, but gradually emerges over time as users arrive at their destinations. Ignoring or insufficiently modeling this delayed influence limits the effectiveness of auxiliary information and reduces prediction accuracy.
Challenge 2: Distinguishing key nodes while preserving spatial propagation. In a road network, complex spatio-temporal dependencies exist between nodes. Some road segments, such as intersections, ramps, and highway entrances, often have stronger influence on surrounding traffic evolution; congestion at these locations can quickly propagate to adjacent road segments. Other road segments, especially those at the edge of a local road network or with stable low traffic volume, may have weaker effects on their neighbors. Therefore, identifying key nodes is necessary because treating all road segments with the same importance can obscure the propagation patterns centered on influential locations. STPGNN [2] has shown the value of pivotal-node modeling by identifying pivotal nodes and constructing a pivotal graph for spatio-temporal dependency learning. However, key node-centered traffic modeling still faces a design challenge: Influential nodes should be emphasized without isolating them from the remaining road network, and full network propagation should be preserved without overwhelming the contribution of important nodes. Therefore, traffic forecasting models need a mechanism that can adaptively identify key nodes, strengthen their interactions, and integrate them with the complete spatial context of the road network.
Challenge 3: Extracting reliable periodic cues from noisy traffic series. Traffic data exhibits distinct periodic characteristics along the temporal dimension, and the frequency domain provides a natural way to discover such periodic patterns. However, real traffic observations are often disturbed by sensor errors, missing values, and temporary incidents, making temporal periodicity difficult to extract directly. Frequency-aware methods have explored this direction from different perspectives. MSGNet [7] uses frequency-domain analysis to identify multi-scale periodic patterns, while Affirm [8] applies adaptive Fourier filtering with causal convolutions to improve temporal representation. These methods demonstrate the usefulness of spectral modeling, but period discovery alone may retain noisy frequency components, and filtering alone may not fully transform dominant traffic rhythms into effective temporal cues. Therefore, a key challenge is to reduce noise while extracting data-driven periodic information that can better support temporal dependency learning in traffic forecasting.
To overcome these limitations, we propose the Spatial–Temporal Fourier Query-Enhanced Transformer (STFQET), which establishes a unified traffic forecasting framework that integrates multi-source data. Specifically, to address challenge 1, we introduce a query-enhanced attention module. This module maps query data and traffic data from different feature spaces into a shared latent space and learns cross-feature dependencies through a joint attention mechanism, thereby adaptively capturing the time-delay relationship between query data and traffic data. To solve challenge 2, we design a key node attention module. This module calculates the importance of each node to the current traffic status. Based on the importance score, the k nodes that have the greatest impact on the surrounding traffic conditions are identified as key nodes. A key node subgraph is built based on these nodes, and the model is allowed to learn their deep spatial dependencies. To solve challenge 3, for each node’s time series, we use Fast Fourier Transform (FFT) to convert the time domain data into frequency domain data, and use the dominant frequencies as the basis for temporal embedding. Then, we design a Fourier filtering module, which uses low-pass filtering, high-pass filtering and adaptive filtering to remove potential noise in traffic data. Furthermore, data converted to the frequency domain after FFT processing has a natural advantage in extracting temporal periodicity. Therefore, we designed an FFT-based temporal period extraction module that uses an adaptive weighting method to fuse the frequencies at each node to improve the model’s ability to learn temporal correlations.
In this paper, our main contributions can be summarized as follows:
  • We propose STFQET, a query-enhanced spatio-temporal Transformer that targets delayed query influence, key node-centered spatial dependency, and noise-robust frequency-domain temporal modeling for traffic speed forecasting.
  • We use query data as auxiliary information to build a query-enhanced attention module to learn the delay time relationship between query data and traffic data, thereby improving the model’s ability to integrate external information.
  • We construct a key node attention module to identify key nodes and construct a key node subgraph to distinguish the importance of different nodes in the road network graph, enabling the model to learn deep spatial features between nodes.
  • We design a Fourier filter module and an FFT-based temporal period extraction module to adaptively filter the data and learn periodic features according to its frequency domain characteristics, thereby enhancing the model’s learning of time periodic features.
  • Experiments show that STFQET outperforms existing baseline models on real urban network traffic datasets, and ablation experiments confirm the independent contribution of each module.

2. Related Work

2.1. Traffic Forecasting Methods

Recent years have witnessed significant progress in traffic prediction, with many studies developing advanced models to capture complex spatio-temporal dependencies in traffic data. For example, Zheng et al. [9] proposed GMAN, which combines spatial attention, temporal attention, and gated fusion within an encoder–decoder framework, and further introduces a transform attention layer to alleviate error propagation by directly linking historical and future time steps. Qian et al. [10] proposed DeepSTUQ, which models complex traffic associations through spatio-temporal analysis and adopts two independent sub-networks to estimate random uncertainty and epistemic uncertainty. Guo et al. [11] proposed SSTBAN, which reduces the computational complexity of attention from quadratic to linear through a spatial–temporal bottleneck attention mechanism and improves data utilization by combining it with a self-supervised learning framework.
Another important research direction focuses on improving temporal pattern modeling in traffic forecasting. Cai et al. [7] proposed MSGNet, which extracts multi-scale periodic patterns through frequency-domain analysis and introduces adaptive convolutional layers and self-attention mechanisms to dynamically learn inter-sequence correlations and intra-sequence dependencies at different time scales. Affirm [8] combines an adaptive Fourier filter block with causal convolutions of different kernel sizes to optimize frequency-domain feature representation and promote multi-granularity temporal interaction. Ma et al. [12] proposed U-Mixer, which combines the Unet and Mixer architectures to fuse multi-scale features and uses MLP blocks to model dependencies between time slices and channels. FSatten [5] embeds sequences into the frequency-domain space through Fourier transform and introduces multi-head spectrum scaling to capture periodic dependencies. Li et al. [13] proposed SSL-STMFormer, which develops a spatio-temporal entangled Transformer based on self-supervised learning to capture long-distance and long-term dependencies as well as heterogeneity in traffic flow. Gao et al. [14] proposed ST-SSDL, which introduces a self-supervised deviation learning scheme to model the discrepancy between current observations and historical patterns. It anchors current inputs to historical averages and uses learnable prototypes together with contrastive and deviation losses to quantify dynamic deviations in latent space, thereby improving the robustness of spatio-temporal forecasting under varying traffic conditions.
Recent studies have further strengthened traffic forecasting by exploring richer spatio-temporal correlation modeling and more expressive relational structures. Liu et al. [15] proposed a multi-layer spatiotemporal correlation-aware graph attention network (MSTC-GAT), which combines a spatial structure-aware graph attention module, a temporal structure-aware graph attention module, and a spatiotemporal transformer to jointly model dynamic, global spatial, and local temporal correlations. Xian et al. [16] proposed MDHGFN, which introduces a multiscale dual hypergraph construction strategy to capture high-order spatial interactions across microscopic, mesoscopic, and macroscopic traffic patterns. Yu et al. [17] proposed a quantum-inspired dynamic spatiotemporal matching transformer (DSTMT), which enhances traffic forecasting through temporal pattern embedding and memory-augmented pattern matching. These methods demonstrate that recent progress has moved beyond conventional pairwise graph modeling toward richer spatio-temporal dependency learning. Nevertheless, most of these studies focus on improving spatio-temporal representation from traffic observations themselves or from general relational structures. In contrast, this study further explores how auxiliary query information, heterogeneous node influence, and frequency-domain temporal representations can be integrated into a unified traffic forecasting framework.
In addition to traffic observations themselves, recent studies have also emphasized the value of external auxiliary information. Liao et al. [6] utilized online route query information, geographical attributes, road intersection information, and event-related factors for traffic prediction on the Q-Traffic dataset. Their full model, denoted as Hybrid in their study, demonstrates that route queries can provide useful early signals for traffic evolution. However, Hybrid mainly integrates auxiliary features in an encoder–decoder sequence learning framework, and the delayed relation between regional query changes and future node-level traffic states is not specifically modeled. In contrast, STFQET focuses on learning region-to-node query interactions and their latent delayed influence while also addressing key node spatial propagation and noisy temporal periodicity. Several subsequent studies have also explored Q-Traffic or related Baidu Map traffic data from different perspectives. Dest-ResNet [18] discovers hotspots from crowd map queries and jointly models traffic speed sequences and query sequences through a residual sequence learning framework. GSeqAtt [19] formulates traffic speed prediction as graph-sequence modeling and combines temporal and graph-structure attention mechanisms to capture dynamic dependencies. EGAF-Net [20] incorporates event information into traffic speed prediction by learning event-aware spatio-temporal representations and fusing them with road network dependencies. These studies further demonstrate the usefulness of query-, event-, and graph-based information for traffic forecasting. Therefore, although existing traffic prediction studies have achieved promising results in spatio-temporal dependency learning, temporal pattern extraction, and auxiliary information utilization, they still have limited ability to simultaneously model delayed query effects, distinguish node importance, and enhance temporal periodicity in a unified framework.

2.2. GNN

To better characterize the non-Euclidean spatial dependencies in road networks, many studies have introduced graph neural networks or graph-based learning strategies into traffic forecasting. Cao et al. [21] proposed STDN, which integrates spatio-temporal embedding learning, dynamic relationship graph learning, and a trend-seasonality decomposition module to disentangle traffic flow components and refine node representations for prediction. Yang et al. [22] proposed GSLI, which learns heterogeneous global spatial correlations through node-scale graph structure learning and captures common spatial dependencies through feature-scale graph structure learning. Ma et al. [23] proposed TGCRN, which explicitly captures the trends and periodicity of dynamic spatial correlations through a time-aware graph structure learning method and integrates graph convolutional gated recurrent units for multi-step prediction. Zheng et al. [24] proposed DPSTGC, which uses a delay-aware directed graph attention mechanism and adaptive graph convolution to learn the delayed propagation of messages between nodes. DAGCAN [25] learns node-specific parameters and dynamic edge correlations to dynamically capture fine-grained spatio-temporal relationships in traffic data.
A number of recent graph-based methods have further improved spatial modeling from different perspectives. Prabowo et al. [26] proposed a spatial graph Transformer that uses node embedding and self-attention to adaptively manage information flow between sensor pairs according to their unique dynamic characteristics. Zhang et al. [27] proposed LightST, which transfers spatial and temporal knowledge from a high-capacity GNN teacher to a lightweight MLP student through spatio-temporal distillation and distribution alignment. This design improves inference efficiency while alleviating the over-smoothing issue caused by deep graph message passing. Jiang et al. [28] proposed MegaCRN, which integrates a meta-graph learner powered by a meta-node library into a graph convolutional recurrent encoder–decoder framework. ST-ReP [29] models fine-grained spatial–temporal relationships through a compression–extraction–decompression encoder and introduces a multi-scale temporal analysis loss. Chen et al. [30] proposed EAC, which alleviates catastrophic forgetting and parameter inflation in continual spatio-temporal graph prediction through heterogeneity-guided expansion and low-rank-guided compression. Wang et al. [31] proposed DSTG, which separates temporal correlations into seasonal and trend patterns in a disentangled dynamic spatio-temporal graph learning framework. Huang et al. [32] proposed STD-PLM, which adapts pre-trained language models to spatial–temporal prediction by designing spatial and temporal tokenizers, topology-aware node embedding, and hourglass attention modules. Fang et al. [33] proposed STWave, which combines wavelet decomposition and spectral graph attention to model trend and event components. Choi et al. [34] proposed STG-NCDE, which applies neural controlled differential equations to temporal and spatial processing. Jiang et al. [35] proposed PDFormer, which combines spatial self-attention with short-range and long-range graph masks and explicitly models the propagation delay of traffic information. Liu et al. [36] further improved model robustness by dynamically selecting adversarial node subsets through reinforcement learning.
Some graph-based studies have also begun to explore differences in the roles of individual nodes. In STPGNN [2], key nodes are identified from traffic data, and a key node graph is constructed for spatio-temporal modeling. In HonGAT [37], the relationships between nodes and their high-order neighbors are explicitly explored. FC-STGNN [3] also demonstrates the effectiveness of modeling spatial dependencies among nodes. These studies indicate that spatial dependency modeling can benefit from considering node roles, high-order neighbors, and cross-time sensor relations. For key node-centered traffic modeling, a remaining design issue is how to emphasize influential nodes while still preserving contextual information from the complete road graph. Motivated by this consideration, our model introduces a key node attention module to identify important nodes and strengthen the modeling of their spatial interactions.

3. Preliminary

3.1. Regions and Query

We divide a city into regions according to longitude and latitude coordinates, resulting in a total of L × W regions. Each region is approximately 1 km × 1 km in size. We denote the set of regions by R = r 1 , 1 , r 1 , 2 , , r u , v , , where r u , v represents the region in the u-th row and the v-th column. In this study, Q denotes the overall query information, and Q u , v denotes the query information associated with region r u , v .

3.2. Road Network Map

Consider a traffic network as a graph G = ( V , E , A ) , where V is the set of road segments, with | V | = N . E is the set of edges representing connections between nodes (for example, road connectivity or spatial proximity). A R N × N is the adjacency matrix, which has a value of a i , j = 1 when two road segments are connected and a value of a i , j = 0 when two road segments are not connected.

3.3. Traffic Prediction

The traffic state at time t is represented by a feature matrix X t R N × C , where C denotes the number of features. In this study, only the traffic speed in traffic prediction is processed, so C = 1 . The historical traffic data over P time steps is denoted as X = ( X t P + 1 , X t P + 2 , , X t ) . For example, when P = S = 8 , one sample uses eight historical speed matrices as X , predicts eight future speed matrices as Y , and uses the corresponding regional query counts Q q u e r y R P × 5 × 5 × 2 as auxiliary information.
The goal of traffic prediction is to learn a mapping function f ( · ) that predicts the future traffic states for the next S time steps, given the historical traffic data, query data and the graph structure:
Y = [ X t + 1 , , X t + S ] = f G ; Q ; ( X t P + 1 , , X t )
where the function f ( · ) captures both spatial dependencies (via G ) and temporal dependencies (via the historical traffic sequence) to predict future traffic states.

4. Methodology

4.1. Hierarchical Model Architecture

The core idea of STFQET is to jointly model delayed query effects, key node spatial dependencies, and frequency-domain temporal patterns within a unified spatio-temporal Transformer framework. In the data processing phase, we use Spatial Temporal Embedding (STE) to convert spatial attributes and temporal patterns into a unified low dimensional vector representation. Furthermore, we use a Fourier filter module to clean the spatio-temporal data to remove noise and enhance its periodicity. We extract information related to the target region from the query Q and use the query-enhanced attention module to establish the time-delay relationship between the query data and traffic data. In the feature learning phase, we employ L layers of ST Blocks to learn the complex spatio-temporal dependencies in traffic data. In the spatial feature learning phase, the key node attention module first calculates the importance scores of all nodes, masks nodes with relatively low scores, and constructs a key node subgraph from the selected key nodes; this key node branch is then combined with the graph diffusion convolution module and the spatial attention module to learn spatial features. In the temporal feature learning phase, the FFT-based temporal period extraction module transforms temporal representations into the frequency domain, adaptively weights the frequency components, and maps them back to the time domain through inverse FFT to enhance temporal feature learning; the enhanced representation is further combined with the temporal attention module to learn temporal features. The transform attention module is placed between the encoder and decoder, enabling the model to predict future data based on real traffic data, avoiding error propagation during prediction. Following GMAN [9], STFQET uses spatial attention, temporal attention, and transform attention as standard attention components and further extends this framework with the modules proposed in this study. The overall model framework is shown in Figure 2.

4.2. Spatial Temporal Embedding

4.2.1. Spatial Embedding

The purpose of spatial embedding is to enable the model to learn the relationships between nodes, thereby improving the accuracy of traffic prediction. However, the adjacency matrix can only describe the connectivity of points to a certain extent, but its ability to express higher-order neighbors or structural similarity is limited. To obtain the spatial embedding, we apply a graph embedding method, such as DeepWalk or Node2Vec, to encode structural relationships among nodes and generate the spatial embedding E s .

4.2.2. Temporal Embedding

Like spatial embedding, temporal embedding aims to enable the model to learn the underlying connections between time steps in traffic data. However, among current research models, most methods use daily or weekly timescales as the basis for temporal embedding or divide time steps into periods and trends. These general approaches are difficult to apply to constantly changing traffic data.
In our research, we use FFT to process time series and construct temporal embedding from the dominant periodic components contained in the data. Specifically, we first center the traffic data of each node by subtracting its mean and then apply FFT to project it into the frequency domain. The spectra of all nodes are then aggregated to obtain a global spectrum and normalized. We select the three most significant period lengths in the global spectrum as the basis for temporal embedding because the first three peaks are the most evident after Fourier transform, while the subsequent frequency components usually correspond to much shorter period lengths. The sensitivity of this choice is further analyzed in Section 5. The formula for establishing the global spectrum is shown below.
F r q u ( X ) = n = 1 N F X : , n μ ( X : , n ) max ( n = 1 N F X : , n μ ( X : , n ) ) .
Here, F r q u ( X ) represents the aggregated frequency spectrum of traffic data over all nodes, F represents the Fast Fourier Transform, X : , n denotes the complete temporal sequence of the nth node, and μ ( X : , n ) denotes the mean value of that sequence. We first detect the peak set in F r q u ( X ) and then extract the most significant top-k cycles as follows:
T o p K ( X ; k f ) = T o p K P e a k F r q u ( X ) ; k f ,
where P e a k ( · ) denotes the set of local spectral peaks detected from the aggregated spectrum, and T o p K ( · ; k f ) selects the k f most significant peaks according to their amplitudes.
The use of the global spectrum is based on the assumption that road segments in the same urban area share several dominant periodic components, such as daily and sub-daily traffic rhythms. This does not mean that all nodes have identical temporal trajectories; rather, the selected global periods provide a shared temporal basis, while node-level variations remain in the original traffic sequences and are further modeled by Fourier filtering, temporal attention, and the FFT-based temporal period extraction module. To verify this assumption, we conduct a node-level FFT consistency analysis, as shown in Figure 3. The three dominant global periods are 24 h, 12 h, and 8 h. Among the node-level top-3 spectral peaks, 220 out of 223 nodes have at least one peak within ± 5 % of these global periods, accounting for 98.7% of all nodes. Specifically, the 24 h, 12 h, and 8 h periods are matched by 219/223, 202/223, and 116/223 nodes, respectively. This result indicates that, although spatial heterogeneity exists, dominant temporal rhythms are highly consistent across nodes in the studied area.
We use the extracted T o p K ( X ; k f ) to perform temporal embedding to obtain E t , where k f = 3 . So far we have obtained the complete spatio-temporal embedding: S T E = E s + E t .

4.3. Fourier Filtering Module

In real-world traffic data, abnormal fluctuations are often caused by sensor errors, temporary incidents, or missing observations. These disturbances may weaken the model’s ability to learn stable temporal patterns. Therefore, we introduce a Fourier filter module before spatio-temporal feature extraction to suppress noisy components while retaining meaningful trend and fluctuation information. Given the input traffic tensor X R B × T × N × 1 , where B denotes the batch size, T denotes the number of time steps, N denotes the number of nodes, we first apply the Fast Fourier Transform along the temporal dimension to obtain its frequency-domain representation:
X ^ = F ( X ) .
To avoid the distortion caused by using only a single filter, we construct three parallel branches in the frequency domain, namely adaptive filtering, low-pass filtering, and high-pass filtering:
X ^ a d a p t i v e = X ^ σ ( α ) , X ^ l o w = X ^ M l o w , X ^ h i g h = X ^ M h i g h ,
where α is a learnable parameter, and σ ( · ) denotes the sigmoid function. M l o w and M h i g h are binary masks defined on the frequency axis. In implementation, we set the cutoff threshold as τ = 0.1 f m a x , where f m a x is the maximum FFT frequency. Thus, low-pass filtering preserves components satisfying f τ to emphasize slowly varying trends, while high-pass filtering preserves components satisfying f τ to retain short-term variations and local details.
Each filtered spectrum is then mapped back to the time domain through the inverse Fourier transform. To prevent useful information from being overly removed, we also keep the original signal as a residual branch. Denoting the four time-domain branches by Z i { F 1 ( X ^ a d a p t i v e ) , F 1 ( X ^ l o w ) , F 1 ( X ^ h i g h ) , X } ; their fusion is written as
X f u s e d = i = 1 4 g i Z i , [ g 1 , g 2 , g 3 , g 4 ] = Softmax ( β ) ,
where β is a learnable gate parameter. After filtering the traffic data, in order to prevent the data distribution from being significantly different from the original data distribution, we use the following formula to correct the distribution to ensure that the data distribution is the same as the original distribution:
X f i l t e r = X f u s e d μ f u s e d σ f u s e d · σ o r i g i n a l + μ o r i g i n a l ,
where μ f u s e d and σ f u s e d denote the mean and standard deviation of the fused data, and μ o r i g i n a l and σ o r i g i n a l denote the mean and standard deviation of the original traffic data.

4.4. Query-Enhanced Attention Module

Query data can largely reflect future traffic speed changes. For example, when a theater opens at time t, people will begin navigating to the theater at time t p to watch the play, where p denotes the lead time between the query behavior and the target time. This will cause traffic speeds on that road section to drop significantly over the next time period t p + q , where q denotes the travel or arrival delay after the query is issued. To model this influence, we incorporate a query-enhanced attention module to learn the relationship between query data and traffic data.

4.4.1. Query Processing Module

The query data is organized at the regional level, whereas the traffic data is organized at the node level within an area. To predict the future traffic speed in region r i , j , we select the query data Q q u e r y = ( Q i 2 , j 2 , Q i 1 , j 2 , , Q i + 2 , j + 2 ) from the surrounding 5 × 5 region centered at r i , j as auxiliary information. This query data contains information about the travel process, including the origin, destination, departure time, and estimated arrival time. We denote the number of departures from neighboring regions to the target region at time t and the number of arrivals from the target region to neighboring regions at time t by QA i , j R 5 × 5 × T and QS i , j R 5 × 5 × T , respectively. For convenience, the 5 × 5 neighboring regions are further unfolded into N q = 25 regional tokens at each time step.

4.4.2. Query Attention Module

The query tensor is defined as Q t a r g e t = concat QA i , j , QS i , j R T × N q × 2 , where N q = 25 . To establish a relationship between the query data and the traffic tensor, we first perform feature projection. For simplicity, the batch dimension is omitted below.
X p r o j = ϕ x ( X f i l t e r ) , Q p r o j = ϕ q ( Q t a r g e t ) ,
where X p r o j R T × N × D and Q p r o j R T × N q × D , ϕ x ( · ) and ϕ q ( · ) denote fully connected layers. The cross-feature fusion of query data and traffic data is then performed by the query-enhanced attention mechanism. For each time step t, the query, key, and value matrices are constructed as
Q t a t t = W q ( Q p r o j , t ) , K t a t t = W k ( X p r o j , t ) , V t a t t = W v ( X p r o j , t ) ,
where W q , W k , and W v are learnable linear projection matrices used to map the query, key, and value features into a shared attention space, respectively.
The attention score between the r-th neighboring region and the n-th node is calculated as
S c o r e t , r , n = exp Q t , r a t t K t , n a t t / D n = 1 N exp Q t , r a t t K t , n a t t / D .
The weighted traffic feature corresponding to the r-th neighboring region is then written as
M t , r = ϕ n = 1 N S c o r e t , r , n V t , n a t t ,
where ϕ ( · ) is a fully connected layer. The mapped regional features are averaged along the regional dimension and then added to the original traffic data through a broadcast mechanism to achieve residual connection:
X Q u e r y , t = X p r o j , t + 1 N q r = 1 N q M t , r ,
where the averaged regional feature is broadcast along the node dimension before residual addition.
This module spatially implements cross-granularity correlation calculations for regional query data and node-level traffic data. We do not impose a fixed explicit delay between query records and future traffic states, because the effective lag can differ across travel purposes and traffic scenarios. For example, queries before a concert may influence nearby roads after a short gathering period, whereas airport-related queries may correspond to longer travel durations and later congestion responses; a manually specified delay is therefore difficult to generalize.
In the query-enhanced attention module, the regional query representation is used as the query matrix, while the node-level traffic speed representation is used as the key and value matrices. For a target time t, denote the historical query representations as ( u t P + 1 , , u t ) and their learned attention scores as ( ω t P + 1 , , ω t ) . If a query surge at u t 3 receives a larger score ω t 3 than other intervals, the module assigns more weight to the query signal three steps before the target time, which is equivalent to selecting a three-step delay for prediction. Therefore, different attention scores correspond to different candidate delay lengths, allowing STFQET to learn the latent delayed relationship through query-traffic attention. The perturbation study in Section 5.7 further verifies that the model relies on the temporal position of historical query sequences. Specifically, it compares the original query history with zeroed, current-only, and temporally shuffled query settings. This design enables the model to dynamically adjust the region-node association strength and effectively utilize query information to assist in predicting future traffic status.

4.5. ST Block

4.5.1. Key Node Attention Module

The key node attention module primarily addresses the unequal contributions of different nodes in traffic networks. Traditional neural networks usually treat all nodes equally, which makes it difficult to emphasize those nodes that dominate local traffic evolution, such as major intersections and highway entrances. To alleviate this issue, we design an importance scoring mechanism to identify key nodes and construct a key node subgraph. Attention is then performed on this subgraph to extract more discriminative spatial dependencies.
Specifically, let H l 1 R B × T × N × D denote the input hidden representation of the ( l 1 ) th layer, where B denotes the batch size, T denotes the number of time steps, N denotes the number of nodes, and D denotes the feature dimension. Before estimating node importance, we first fuse the hidden traffic representation and the spatio-temporal embedding:
H ˜ l 1 = ϕ h [ H l 1 , STE ] ,
where [ · , · ] denotes feature concatenation, and ϕ h ( · ) denotes a fully connected layer. Based on H ˜ l 1 , we employ a three-layer fully connected network to compute a scalar importance score for each node:
s b , t , n = ϕ 3 ϕ 2 ϕ 1 H ˜ b , t , n l 1 , s ¯ n = 1 B T b = 1 B t = 1 T s b , t , n , V k = TopK ( s ¯ ; k ) ,
where s b , t , n is the importance score of node n at time step t in batch b; ϕ 1 ( · ) , ϕ 2 ( · ) , and ϕ 3 ( · ) denote the three fully connected layers used for importance scoring; s ¯ n is the averaged score used for node selection; and k = max 1 , round ( ρ N ) .
Therefore, the importance is computed from the current hidden traffic states together with the spatio-temporal embedding, rather than from a predefined structural score. Here, ρ denotes the key node ratio, and k is determined adaptively according to the graph size. Since s ¯ is recomputed in each forward pass, the selected key node set is dynamic rather than static.
Based on the selected node set V k , we perform self-attention on the key node subgraph:
H k e y l = S ( W q H ˜ V k l 1 ) ( W k H ˜ V k l 1 ) d k W v H ˜ V k l 1 ,
where S is a softmax function; W q , W k , and W v are learnable matrices; and H ˜ V k l 1 denotes the spatio-temporal features on the key node subgraph. The attention output is then scattered back to the original node space through a zero-padding operation and fused with the projected full-node representation:
H K N l = ϕ o [ H ˜ l 1 , ZeroPad ( H k e y l ) ] ,
where ϕ o ( · ) denotes a fully connected layer. In this way, the model preserves complete node information while strengthening the representation of important nodes. In the spatial feature learning component, we also use a traditional spatial attention module and a graph diffusion convolution module to supplement the common-node information that may be weakened by the key node selection process. The fusion output of these modules is shown below.
H S l = ϕ s [ H K N l , H S A l , H D C l ]
Here, ϕ s ( · ) denotes a fully connected layer, H S A l represents the output of the lth spatial attention module, H D C l represents the output of the lth graph diffusion convolution module, and H S l represents the result of spatial feature learning.

4.5.2. FFT-Based Temporal Period Extraction Module

This module aims to transform traffic data into frequency domain data through FFT and then capture and learn the temporal periodic features inherent in traffic data in the frequency domain. Its core is to decompose complex time series into a set of sinusoidal components. By modulating these components, the model can emphasize or suppress specific temporal frequencies, thereby achieving more accurate predictions.
The process initiates by transforming the traffic data H l 1 R B × T × N × D into the frequency domain using FFT applied along the temporal dimension. For a single node’s feature vector across time, h b , n , d R T is the time series of the input signal at batch index b, node index n, and feature dimension index d. The transformation formula is as follows:
F ( h b , n , d ) [ k ] = t = 0 T 1 h b , t , n , d · exp { i 2 π k t / T } .
In practice, this is efficiently computed for all elements in the tensor via FFT, yielding a complex-valued tensor H f f t l C B × K f f t × N × D , where K f f t = T / 2 + 1 is the number of frequencies obtained from a real-valued signal.
To reduce computational complexity and focus on the most salient oscillations, the model retains only the first K dominant frequency components ( K K f f t ), discarding higher-frequency noise. The selection is formalized as
H f f t s e l e c t = H f f t l [ : , : K , : , : ] .
This truncation effectively acts as a low-pass filter, preserving the fundamental and most significant harmonic components of the traffic flow signal.
Subsequently, the module learns to adaptively modulate these selected frequency components. Two learnable real weight tensors W r , W i R B × N × D × K from the input, which are used to scale the real and imaginary parts of the spectrum, respectively. These weights are applied to the truncated spectrum via element-wise multiplication:
H ˜ f f t l = W r R e ( H f f t s e l e c t ) + i · W i I m ( H f f t s e l e c t ) ,
where R e ( · ) and I m ( · ) represent the real and imaginary parts of the frequency, respectively. This operation allows the model to perform feature-specific spectral manipulation, independently amplifying, attenuating, or phase-shifting each frequency component of each node and feature, thereby better learning the spatio-temporal characteristics in traffic data.
The modulated frequency components are then processed through a fully connected layer to enhance representation learning before being converted back to the time domain via an inverse Fast Fourier Transform. The formula is shown below.
H F R l = F 1 ( Pad K K f f t ( ϕ ( H ˜ f f t l ) ) ) ,
where Pad ( · ) zero-pads the processed K-length spectrum back to the original FFT length K f f t before inversion. Finally, in order to stabilize the distribution of the data, we perform residual connection and batch normalization operations, the formula of which is shown below:
H F T l = BatchNorm ( H F R l + H l 1 ) ,
where H F T l is the output of the FFT-based temporal period extraction module. This module provides a powerful and interpretable method for temporal feature extraction by leveraging adaptive frequency domain processing, enabling the model to dissect and manipulate the fundamental time-varying patterns in traffic data. In the temporal feature learning component, we also use a traditional temporal attention module to supplement the information in the FFT-based temporal period extraction module. The fused output of these modules is shown below:
H T l = ϕ ( [ H F T l , H T A l ] ) ,
where H T A l is the output of the lth temporal attention module, and ϕ ( · ) denotes a fully connected layer. We fuse the spatial features and temporal features together to obtain the output of ST Block at the l-th layer.
H l = ϕ ( [ H S l , H T l ] ) .
Finally, we convert the learned spatio-temporal features into the predicted future traffic status Y through a fully connected layer.

4.6. Standard Attention Components

The spatial attention, temporal attention, and transform attention modules in STFQET follow the standard GMAN design [9]. In general, an attention operation is written as
Attn ( Q , K , V ) = Softmax Q K d V .
Spatial attention applies this operation over road nodes at each time step, temporal attention applies it over historical time steps for each node, and transform attention uses future temporal embeddings as queries and historical temporal embeddings as keys to map encoded historical features to future representations:
H t r = Attn ϕ q ( STE p r e d ) , ϕ k ( STE h i s ) , ϕ v ( H e n c ) .
These standard components provide basic spatial–temporal dependency learning, while STFQET further enhances them through query-enhanced interaction, key node attention, Fourier filtering, and FFT-based temporal period extraction.

4.7. Complexity Analysis

Let N denote the number of road segments, P the number of historical steps, S the number of prediction steps, D the hidden dimension, and k the number of selected key nodes. In STFQET, the dominant computational cost comes from the spatial modeling modules in each ST Block, including spatial attention and graph diffusion convolution, which lead to O ( L ( P + S ) N 2 D ) . Although the temporal modeling modules, such as temporal attention and FFT-based temporal extraction, also introduce additional cost, their complexities mainly depend on P and S (e.g., O ( N P 2 D ) or O ( N D P log P ) ). Since P and S are small and fixed in our setting, while N is much larger, the overall complexity is dominated by the spatial dependency modeling term rather than the temporal modeling term. In addition, the key node attention branch reduces its pairwise interaction cost to O ( L ( P + S ) k 2 D ) , where k N , which further improves efficiency.
To complement the theoretical complexity analysis, we further report the practical training cost of STFQET on five datasets, including the number of nodes, the actual number of training epochs until convergence, the average training time per epoch, and the overall training and validation time.
As shown in Table 1, the practical training cost of STFQET increases with the graph size, which is consistent with the theoretical complexity analysis. STFQET remains trainable on both medium-scale and large-scale datasets, while the overall cost is also affected by the number of epochs required for convergence.

5. Experiments

5.1. Dataset

Liao et al. [6] proposed the Q-Traffic dataset, which is the data source for our research. This dataset is provided by Baidu and collected from Beijing, China. It contains three major types of data: road network, user queries, and traffic speed. The specific explanation is as follows:
  • The road network data presents the complete architecture of Beijing’s road network, providing a foundation for building a road network graph. The road network is divided into 68 × 72 regions along the X and Y axes and provides detailed information on the road segments within each region.
  • The user query data includes a total of 114 million user query records recorded from 1 April 2017 to 31 May 2017. Each query contains the user ID, search timestamp, and current location coordinates of the user, as well as the coordinates of the starting and destination locations and query keywords.
  • The traffic speed data covers 15,073 road sections with a total mileage of approximately 738.91 km. This data records the real-time vehicle speed on road segments covered by the study area, and its coverage area and time period fully correspond to the user query data.
Therefore, Q-Traffic is not only large in scale but rich in multi-source auxiliary information, making it a suitable dataset for evaluating the effectiveness of STFQET in traffic speed forecasting with auxiliary information. In our experiments, one time step corresponds to 15 min. Accordingly, the forecasting horizons of 60, 90, and 120 min correspond to 4, 6, and 8 prediction steps, respectively. We selected three representative single-region subsets from different areas of Beijing, namely Q-24-33, Q-26-31, and Q-32-37, which correspond to regions (24, 33), (26, 31), and (32, 37) in the Q-Traffic dataset. Here, (24, 33) denotes the region in the 24th row and 33rd column of the 68 × 72 regional grid, and the other region coordinates follow the same convention.
Since no other publicly available traffic forecasting dataset with comparable query auxiliary information is currently available, these spatially separated subsets are used to partly examine the road structure heterogeneity that cross-city evaluation is expected to test. As shown in Figure 4, Q-24-33, Q-26-31, and Q-32-37 contain 223 nodes with 249 links, 123 nodes with 133 links, and 134 nodes with 146 links, respectively, indicating clear differences in node scale, density, and connectivity patterns. In addition, to evaluate scalability under broader spatial coverage, we construct two larger multi-region subsets: Q-3335-3840 contains the data of 9 adjacent regions from (33, 38) to (35, 40), including 1074 road segments and 1324 links, while Q-4346-2629 contains the data of 16 adjacent regions from (43, 26) to (46, 29), including 922 road segments and 1063 links. These two subsets are used to further evaluate whether STFQET can learn on larger road network graphs and maintain stable performance under broader spatial coverage.

5.2. Baseline Methods

Our evaluation incorporates a range of advanced methods for traffic forecasting, each addressing unique aspects of spatio-temporal modeling. The baseline comparisons are presented as follows:
GMAN [9]: GMAN employs a graph multi-attention network with an encoder–decoder architecture, featuring spatial and temporal attention mechanisms to model dynamic spatio-temporal correlations. It includes a transform attention layer to alleviate error propagation by capturing direct relationships between historical and future time steps.
DeepSTUQ [10]: DeepSTUQ proposes a unified approach for uncertainty quantification in traffic forecasting, estimating both aleatoric and epistemic uncertainties. It combines Monte Carlo dropout and adaptive weight averaging re-training methods, enhanced with a post-processing calibration technique based on temperature scaling.
RDAT [36]: RDAT leverages reinforced dynamic adversarial training to enhance adversarial robustness. It uses a reinforcement learning-based method to dynamically select a subset of nodes as adversarial examples, reducing overfitting and incorporating self-knowledge distillation regularization to mitigate forgetting issues.
STG-NCDE [34]: STG-NCDE designs two neural controlled differential equations for temporal and spatial processing, respectively, and demonstrates robustness to irregular time series.
STWave [33]: Spatial–Temporal Wavelet Framework utilizes discrete wavelet transform to disentangle traffic series into trends and events, combined with efficient spectral graph attention networks.
STD-PLM [32]: Spatial–Temporal Data Pre-trained Language Model adapts pre-trained language models to understand spatial–temporal properties through specifically designed tokenizers and sandglass attention modules.
EAC [30]: Expand and Compress framework employs prompt tuning principles for continual spatio-temporal forecasting, using continuous prompt pools to adapt to streaming data.
ST-ReP [29]: Reconstruction and Prediction integrated learning combines current value reconstruction with future value prediction in a pre-training framework, using a compression–extraction–decompression structure for efficient encoding.
LightST [27]: LightST is an efficient traffic forecasting framework based on spatio-temporal distillation. It transfers spatial and temporal knowledge from a high-capacity GNN teacher to a lightweight MLP student through prediction-level alignment and representation-level distribution alignment, achieving competitive accuracy with much higher inference efficiency.
STDN [21]: STDN is a spatio-temporal-aware trend-seasonality decomposition network for traffic flow forecasting. It combines spatio-temporal embedding learning, dynamic relationship graph learning, and trend-seasonality decomposition to disentangle traffic flow components and enhance the representation learning of traffic nodes.
ST-SSDL [14]: ST-SSDL is a spatio-temporal forecasting framework with self-supervised deviation learning. It introduces historical anchors, learnable prototypes, contrastive loss, and deviation loss to capture the discrepancy between current observations and historical patterns, thereby improving the adaptability of forecasting under dynamic traffic conditions.
Hybrid [6]: Hybrid is the full model proposed with the Q-Traffic dataset. It combines online route queries with geographical, event-related, and road intersection auxiliary information in an encoder–decoder sequence learning framework.

5.3. Experiment Results

In this section, we compare all baseline models with STFQET on the Q-24-33, Q-26-31, and Q-32-37 datasets. We test these models’ Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) metrics over 60, 90, and 120 min. Since the prediction target is traffic speed, MAE and RMSE are reported in km/h, while MAPE is reported as a percentage; the same unit convention is used in the following performance tables. All reported results are averaged over three independent runs, and the sample standard deviation is shown in the tables. The results are shown in Table 2.
The results show that STFQET achieves the best or among the best performance on most datasets and forecasting horizons, with relatively small standard deviations across independent runs. On Q-24-33, STFQET obtains the lowest mean MAE, RMSE, and MAPE across all three horizons; for example, at 120 min, its MAE is 2.16 ± 0.02 , compared with 2.20 ± 0.01 for STDN and 2.22 ± 0.02 for ST-SSDL. On Q-26-31, STFQET achieves the best MAE and MAPE at all horizons and the best or near-best RMSE. On Q-32-37, STFQET also achieves the lowest mean errors across the three horizons. These results demonstrate that STFQET provides consistently strong forecasting accuracy and stable performance across different regions and prediction horizons.
Compared with representative strong baselines, the advantage of STFQET can be understood from the modeling perspective. Compared with Hybrid, STFQET uses query information through region-to-node attention rather than directly fusing auxiliary features in an encoder–decoder framework. STDN is effective at capturing traffic dynamics through dynamic graph learning and trend-seasonality decomposition, but STFQET further incorporates query information and key-node-aware spatial modeling. STG-NCDE is strong at continuous temporal dynamics, whereas STFQET additionally strengthens periodic representation through Fourier filtering and FFT-based temporal extraction. STD-PLM benefits from pre-trained representations, but its modeling is more general, while STFQET is tailored to traffic forecasting by targeting delayed query interaction, key-node-aware spatial propagation, and frequency-domain temporal learning. ST-ReP learns spatio-temporal representations efficiently, but STFQET further exploits external query signals and their delayed influence on traffic evolution. Overall, the results suggest that modeling auxiliary information, heterogeneous spatial dependency, and temporal periodicity is beneficial for improving forecasting accuracy.

5.4. Parameter Study

Comprehensive parameter experiments are conducted using the Q-24-33 dataset to evaluate the impact of six hyperparameters on model performance. The results are shown in Figure 5. These hyperparameters include the learning rate, batch size, dimensionality per attention head, number of attention heads, number of ST Blocks, and the key node ratio. Performance is measured at 60, 90, and 120 min. The optimal values determined experimentally are as follows: learning rate of 0.003, batch size of 8, dimensionality per attention head of 8, number of attention heads of 8, number of ST Blocks of 2, and key node ratio of 0.21.
The learning rate significantly impacts model convergence. If the learning rate is too high, the model may exceed the optimal solution in parameter space, while if the learning rate is too low, convergence will be slowed down, potentially causing training to stagnate prematurely. Experiments show that a learning rate of 0.003 strikes the optimal balance, achieving the lowest error across all timeframes.
Batch size has a significant impact on training dynamics and generalization. Smaller batch sizes may provide regularization but lead to noisy updates, while larger batch sizes may improve stability but reduce generalization. Results show that a batch size of 8 performs best, with minimal error at 60 and 90 min.
The dimension of each attention head determines its representational capacity. Too small a dimension can lead to a lack of expressiveness in the model; too large a dimension can introduce noise and overfitting. Experiments have shown that a dimension of 8 per attention head provides an ideal balance, enhancing the model’s ability to capture relevant patterns without adding excessive complexity.
Increasing the number of attention heads allows the model to focus on different subspaces, improving performance to some extent. However, beyond an optimal threshold, too many attention heads may learn noisy or redundant information, degrading results. We found that the optimal number of attention heads is 8, which maximizes performance across all prediction ranges.
Model depth, determined by the number of blocks, affects the model’s ability to capture complex dependencies. Shallow models may underfit, while very deep models may overfit due to the large number of parameters. Experiments show that 2 blocks achieve the best performance, suggesting that a moderate depth is sufficient for the Q-24-33 dataset without introducing unnecessary complexity.
The key node ratio controls the sparsity of the key node subgraph. When the ratio is too small, some influential nodes may be omitted, which weakens the spatial interaction modeling ability of the module. When the ratio becomes too large, the sparsification effect gradually diminishes, and the model becomes closer to performing dense interactions on a larger node set. As shown in Figure 5f, STFQET remains relatively stable over a moderate ratio range, while the best overall performance is achieved at ρ = 0.21 . In particular, the results around ρ = 0.13 and ρ = 0.25 are also close to the optimum, indicating that the model is not overly sensitive to this hyperparameter within a reasonable interval. This suggests that selecting an appropriate proportion of key nodes is sufficient to preserve dominant spatial dependencies without introducing excessive redundancy.
We further study the number of dominant frequencies k f used in temporal embedding because this hyperparameter determines the width of the temporal basis. Table 3 reports the sensitivity results on Q-24-33. Increasing k f from 1 to 3 clearly improves the prediction accuracy, indicating that a single dominant period is insufficient to represent the main traffic rhythms. Although k f = 5 and k f = 6 achieve slightly lower errors in some cases, we choose k f = 3 as the default setting because it provides a better trade-off between predictive accuracy and computational efficiency.

5.5. Multi-Region Experiments

To further evaluate the scalability of STFQET under broader spatial coverage, we conduct multi-region experiments on two larger subsets of the Q-Traffic dataset. Specifically, Q-3335-3840 contains the data of 9 adjacent regions from (33, 38) to (35, 40), including 1074 road segments, while Q-4346-2629 contains the data of 16 adjacent regions from (43, 26) to (46, 29), including 922 road segments. Compared with the single-region subsets, these two settings involve more road segments and denser spatial interactions, which makes it more difficult for the model to capture long-range spatial propagation and stable temporal patterns. The comparison results are shown in Table 4.
The results show that STFQET maintains strong performance on the two larger multi-region subsets. On Q-3335-3840, STFQET achieves the best MAE at all forecasting horizons and obtains the best or tied-best RMSE and MAPE. For example, at the 120 min horizon, STFQET reaches an MAE of 2.46 ± 0.02 , outperforming STDN, LightST, and Hybrid. On Q-4346-2629, STFQET achieves the lowest mean errors on all three metrics and all three prediction horizons, showing clear advantages when the number of road segments increases. These results indicate that the proposed model not only performs well in single-region forecasting but also preserves its predictive capability when the forecasting area is expanded to larger road networks with richer spatial dependencies.

5.6. Ablation Experiments

Our ablation study further verifies the contribution of each component in STFQET. The results are shown in Figure 6. The variant without spatial attention shows the largest performance drop, indicating that the spatial attention module plays a fundamental role in capturing complex spatial dependencies in traffic networks. The key node attention module also contributes significantly, especially at shorter horizons because it allows the model to focus on the most influential nodes rather than treating all nodes equally. Removing the graph diffusion module also weakens performance, which suggests that STFQET benefits from combining attention-based spatial interaction learning with graph-based structural propagation.
From the temporal perspective, both the temporal attention module and the FFT-based temporal period extraction module improve prediction accuracy, especially as the forecasting horizon increases. This indicates that STFQET benefits from jointly modeling temporal dependency in the time domain and periodic structure in the frequency domain. The query-enhanced attention module also consistently improves performance, confirming that external query data provide useful complementary information beyond historical traffic observations alone.
It can also be observed that the variant without Fourier filter performs slightly better than STFQET at the 2-step horizon. This suggests that the earliest prediction step may be more sensitive to short-term local fluctuations, while the Fourier filter, although beneficial for suppressing noise, may remove a small portion of high-frequency details that are useful for very short-horizon prediction. However, as the prediction horizon increases, the complete STFQET model regains the advantage, indicating that the Fourier filter is more helpful for stabilizing the signal and enhancing informative periodic patterns in medium- and long-horizon forecasting. Overall, the ablation results suggest that the performance improvement of STFQET comes from the joint effect of its spatial, temporal, frequency-domain, and query-enhanced components rather than from any single module alone.

5.7. Query Temporal Perturbation Study

To further examine whether STFQET uses the temporal position of query information, we conduct an additional query perturbation experiment on Q-24-33 (Table 5). The model structure, traffic input, temporal embedding, and prediction labels remain unchanged; only the historical query sequence is modified. We report representative settings averaged over all prediction horizons: Full uses the original historical query sequence, zero removes all query values, current_only keeps only the last historical query step, and temporal_shuffle randomly shuffles query time steps within each sample.
The zero setting degrades performance, confirming that query information provides useful auxiliary signals. The current_only setting is also weaker than the full historical query sequence, suggesting that earlier query steps still contain useful information. The temporal_shuffle setting performs worse even though the query value distribution and spatial layout are preserved. This indicates that STFQET does not simply exploit query volume or spatial intensity but that it also uses the historical temporal position of query signals. These results support the design that delayed query influence is learned through query-traffic attention over temporally ordered historical query sequences.

5.8. Comparison of Different Graph Embedding Methods

Graph embedding is the process of mapping graph data into low-dimensional, dense vectors. We systematically analyze the impact of different graph embedding algorithms on the performance of STFQET. Specifically, we analyze eight classic graph embedding algorithms using the following methods:
  • DeepWalk [38] leverages random walks and the Skip-gram model to learn node embeddings, effectively capturing community structures and homophily in networks.
  • GF [39] employs matrix factorization to learn embeddings by directly approximating the adjacency matrix, focusing primarily on first-order proximity.
  • Node2vec [40] extends DeepWalk with a biased random walk strategy, controlled by parameters p and q, to balance between exploring homophily and structural equivalence.
  • GraRep [41] explicitly captures higher-order proximities by factorizing different powers of the transition matrix, integrating multi-scale network relationships.
  • HOPE [42] preserves high-order proximity and asymmetric transitivity by approximating and factorizing a defined similarity matrix like Katz index.
  • HIN2Vec [43] employs multi-task learning to model multiple relationship types and meta-paths between nodes, framing it as a binary classification problem.
  • LLE [44] assumes local linearity and learns embeddings by reconstructing each node from its neighbors, preserving the local geometric structure of the graph.
  • SDNE [45] utilizes deep autoencoders to jointly optimize for first-order and second-order proximity, capturing highly non-linear network structures.
MAE measures the average absolute deviation between predicted and ground-truth values and is relatively insensitive to outliers. As shown in Figure 7a, GF achieves the lowest MAE (2.0554), followed closely by LLE (2.0619). HIN2Vec (2.0660) and DeepWalk (2.1075) exhibit the third-best and worst MAE values, respectively. These results indicate that GF and LLE are the most effective methods for reducing average absolute prediction error on this task, while GraRep, HOPE, and SDNE show intermediate performance.
RMSE is the square root of the mean of the sum of squared prediction errors. It is more sensitive to large prediction errors because squaring amplifies their impact. Therefore, RMSE better reflects the stability of prediction results and helps prevent large deviations. As shown in Figure 7b, HOPE performs best (3.1548), followed by HIN2Vec (3.1569), and then GraRep (3.1614). While GF leads in MAE, its RMSE (3.1864) is only better than DeepWalk (3.2065) and SDNE (3.1900). This result suggests that HOPE and HIN2Vec, two methods that focus on capturing high-order neighborhood relationships, perform best in controlling the maximum magnitude of prediction errors (i.e., avoiding very large prediction errors). Their learned embeddings may be more effective in suppressing extreme values in predictions. GF’s higher RMSE suggests that there may be some relatively large errors in its predictions, despite its lower mean absolute error. DeepWalk and SDNE perform relatively poorly in terms of RMSE in this experiment.
MAPE measures the average percentage of prediction error relative to the true value. It is a relative error metric that facilitates comparison of prediction accuracy on data of different scales. Node2vec and LLE tie for the lowest MAPE (6.95%) in this experiment, indicating that their prediction errors are the smallest relative to the true value. SDNE (6.98%) and HIN2Vec (6.99%) follow closely behind. GF (7.03%), DeepWalk (7.08%), and GraRep (7.12%) perform in the middle, while HOPE (7.20%) achieved the highest MAPE. This result indicates that Node2vec and LLE achieve the best relative prediction accuracy. Combined with their strong performance on MAE (especially LLE), this suggests that they not only have small absolute prediction errors but also have the best error ratio relative to the true value. SDNE and GF also perform well in terms of relative error. It is worth noting that although HOPE and GraRep perform well in RMSE, their MAPE is the highest (HOPE) and below average (GraRep). This may mean that their predicted values deviate from the true values by a large margin in some cases or that their predicted values have relatively significant errors in areas with smaller true values.

6. Discussion

Compared with existing studies already discussed in this paper, STFQET shows a distinct problem-driven modeling strategy. GMAN [9] provides an effective Transformer framework for spatial–temporal dependency learning, but it does not explicitly model auxiliary query information or distinguish node importance. Hybrid [6] demonstrates the value of query information, but STFQET further learns region-to-node query influence and allows its delayed effect to be captured by the spatio-temporal attention framework. STDN [21] and STG-NCDE [34] are effective in capturing spatial–temporal dynamics, but they do not jointly emphasize delayed query effects, key node-centered spatial modeling, and frequency-domain periodic enhancement. STWave [33] and ST-ReP [29] improve temporal representation from wavelet or predictive representation perspectives, respectively, whereas STFQET performs Fourier filtering and FFT-based temporal period extraction to reduce noise and strengthen data-driven periodic cues. Therefore, the favorable results of STFQET suggest that modeling delayed query signals, key node influence, and robust temporal periodicity is beneficial for traffic forecasting.
Although all experiments use Q-Traffic, Section 5.1 explains that the selected single-region and multi-region subsets cover road networks with clearly different scales and connectivity patterns, which partly addresses the cross-city concern from the perspective of spatial structural heterogeneity.
Despite these results, this study still has several limitations. First, the query modeling strategy is based on a fixed 5 × 5 neighboring region, which may not fully reflect the spatial influence range of different traffic scenarios. Second, the temporal embedding and frequency-domain learning modules rely on the dominant periods extracted from the global spectrum of the study area, which is effective in the current setting but may be less flexible when traffic rhythms vary substantially across datasets or regions. Third, the current framework mainly focuses on traffic speed, query information, and road topology, while other external factors are not considered. Finally, all data are still collected from Beijing during April–May 2017. Since publicly available traffic forecasting datasets with route-query information are still limited, cross-season query-enhanced evaluation and further validation on other cities remain important directions for future work.
Future work could be extended in several directions. One direction would be to design more adaptive query-region selection mechanisms so that the spatial range of auxiliary information can vary with traffic context rather than remaining fixed. It would also be worthwhile to explore more flexible frequency selection strategies, allowing the model to adjust temporal periodic modeling according to different data characteristics. Furthermore, when additional datasets containing both traffic states and route-query records become available, STFQET could be evaluated on more heterogeneous datasets to verify its robustness across cities and traffic conditions. Another direction would be to incorporate richer external information, such as weather, holidays, and event-related signals, to further enhance the practical value of the model in intelligent transportation systems.

7. Conclusions

In this study, we propose the Spatial–Temporal Fourier Query-Enhanced Transformer (STFQET), a novel framework for traffic forecasting that effectively leverages query data and frequency-domain processing. STFQET integrates a query-enhanced attention module to model the delay effect of queries on traffic flow, a key node attention module to dynamically focus on spatial key nodes, a Fourier filter module to denoise and enhance temporal features in the frequency domain, and an FFT-based temporal period extraction module to learn temporal features in the frequency domain. Extensive experiments demonstrate that STFQET outperforms existing baseline models across multiple dimensions and metrics. Ablation studies confirm the contributions of each component. Future work will explore integrating additional external factors and adapting the model to other spatio-temporal forecasting tasks.

Author Contributions

Conceptualization, S.Z. and X.T.; methodology, S.Z. and X.T.; software, S.Z.; validation, S.Z. and X.T.; formal analysis, S.Z. and X.T.; investigation, S.Z.; resources, X.T.; data curation, S.Z.; writing—original draft preparation, S.Z.; writing—review and editing, S.Z. and X.T.; visualization, S.Z.; funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Provincial Natural Science Foundation grant number ZR2025QC2304Z.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The datasets are available from their original public source(s). The code used in this study is publicly available for review at https://anonymous.4open.science/r/STFQET/ (accessed on 8 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, M.; Pang, G.; Wang, W.; Yan, C. Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting. In Proceedings of the Forty-second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
  2. Kong, W.; Guo, Z.; Liu, Y. Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8627–8635. [Google Scholar] [CrossRef]
  3. Wang, Y.; Xu, Y.; Yang, J.; Wu, M.; Li, X.; Xie, L.; Chen, Z. Fully-Connected Spatial-Temporal Graph for Multivariate Time Series Data. arXiv 2023, arXiv:2309.05305. [Google Scholar] [CrossRef]
  4. Almousa, G.; Lee, Y. Traffic forecasting using spatio-temporal dynamics and attention with graph attention PDEs. Inf. Sci. 2025, 711, 122108. [Google Scholar] [CrossRef]
  5. Wu, H. Revisiting attention for multivariate time series forecasting. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’25/IAAI’25/EAAI’25), Philadelphia, PA, USA, 25 February–4 March 2025; AAAI Press: Washington, DC, USA, 2025. [Google Scholar] [CrossRef]
  6. Liao, B.; Zhang, J.; Wu, C.; McIlwraith, D.; Chen, T.; Yang, S.; Guo, Y.; Wu, F. Deep Sequence Learning with Auxiliary Information for Traffic Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), London, United Kingdom, 19–23 August 2018; pp. 537–546. [Google Scholar] [CrossRef]
  7. Cai, W.; Liang, Y.; Liu, X.; Feng, J.; Wu, Y. MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 11141–11149. [Google Scholar] [CrossRef]
  8. Wu, Y.; Meng, X.; Hu, H.; Zhang, J.; Dong, Y.; Lu, D. Affirm: Interactive Mamba with Adaptive Fourier Filters for Long-term Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 21599–21607. [Google Scholar] [CrossRef]
  9. Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar] [CrossRef]
  10. Qian, W.; Zhang, D.; Zhao, Y.; Zheng, K.; Yu, J.J. Uncertainty Quantification for Traffic Forecasting: A Unified Approach. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 992–1004. [Google Scholar] [CrossRef]
  11. Guo, S.; Lin, Y.; Gong, L.; Wang, C.; Zhou, Z.; Shen, Z.; Huang, Y.; Wan, H. Self-Supervised Spatial-Temporal Bottleneck Attentive Network for Efficient Long-term Traffic Forecasting. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 1585–1596. [Google Scholar] [CrossRef]
  12. Ma, X.; Li, X.; Fang, L.; Zhao, T.; Zhang, C. U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 14255–14262. [Google Scholar] [CrossRef]
  13. Li, Z.; Hu, Z.; Han, P.; Gu, Y.; Cai, S. SSL-STMFormer Self-Supervised Learning Spatio-Temporal Entanglement Transformer for Traffic Flow Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 12130–12138. [Google Scholar] [CrossRef]
  14. Gao, H.; Dong, Z.; Yong, J.; Fukushima, S.; Taura, K.; Jiang, R. How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning. In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
  15. Liu, J.; Wang, Y.; Zhu, J.; Bai, W.; Zhang, H.; Zuo, L.; Zhou, T.; Li, K. A Multilayer Spatiotemporal Correlation-Aware Graph Attention Network for Traffic Flow Prediction. IEEE Trans. Neural Netw. Learn. Syst. 2026, 37, 2235–2249. [Google Scholar] [CrossRef] [PubMed]
  16. Xian, J.; Ye, Y.; Zhang, W.; Chen, Z.; Huang, J.; Lin, Z.; Zhou, T. MDHGFN: Multiscale Dual Hypergraph Fusion Spatiotemporal Network for traffic flow prediction. Chaos Solitons Fractals 2025, 201, 117228. [Google Scholar] [CrossRef]
  17. Yu, C.; Lin, Z.; Cheng, H.; Cao, C.; Zhou, T.; Leung, M.F. Quantum-Inspired Dynamic Spatiotemporal Matching Transformer for Traffic Flow Forecasting. IEEE Trans. Consum. Electron. 2025, 72, 2527–2539. [Google Scholar] [CrossRef]
  18. Liao, B.; Zhang, J.; Cai, M.; Tang, S.; Gao, Y.; Wu, C.; Yang, S.; Zhu, W.; Guo, Y.; Wu, F. Dest-ResNet: A Deep Spatiotemporal Residual Network for Hotspot Traffic Speed Prediction. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1883–1891. [Google Scholar] [CrossRef]
  19. Lu, Z.; Lv, W.; Xie, Z.; Du, B.; Xiong, G.; Sun, L.; Wang, H. Graph Sequence Neural Network with an Attention Mechanism for Traffic Speed Prediction. ACM Trans. Intell. Syst. Technol. 2022, 13, 20:1–20:24. [Google Scholar] [CrossRef]
  20. Qiu, Z.; Zhu, T.; Jin, Y.; Sun, L.; Du, B. A Graph Attention Fusion Network for Event-Driven Traffic Speed Prediction. Inf. Sci. 2023, 622, 405–423. [Google Scholar] [CrossRef]
  21. Cao, L.; Wang, B.; Jiang, G.; Yu, Y.; Dong, J. Spatiotemporal-aware Trend-Seasonality Decomposition Network for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 11463–11471. [Google Scholar] [CrossRef]
  22. Yang, X.; Sun, Y.; Chen, X.; Zhang, Y.; Yuan, X. Graph Structure Learning for Spatial-Temporal Imputation: Adapting to Node and Feature Scales. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 959–967. [Google Scholar] [CrossRef]
  23. Ma, M.; Hu, J.; Jensen, C.S.; Teng, F.; Han, P.; Xu, Z.; Li, T. Learning Time-Aware Graph Structures for Spatially Correlated Time Series Forecasting. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–17 May 2024; pp. 4435–4448. [Google Scholar] [CrossRef]
  24. Zheng, Y.; Luo, C.; Shao, R. Enhancing Traffic Flow Forecasting with Delay Propagation: Adaptive Graph Convolution Networks for Spatio-Temporal Data. IEEE Trans. Intell. Transp. Syst. 2025, 26, 650–660. [Google Scholar] [CrossRef]
  25. Yuan, Q.; Wang, J.; Han, Y.; Liu, Z.; Liu, W. DAGCAN: Decoupled Adaptive Graph Convolution Attention Network for Traffic Forecasting. IEEE Trans. Intell. Transp. Syst. 2025, 26, 3513–3526. [Google Scholar] [CrossRef]
  26. Prabowo, A.; Shao, W.; Xue, H.; Koniusz, P.; Salim, F.D. Because Every Sensor Is Unique, so Is Every Pair: Handling Dynamicity in Traffic Forecasting. In Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation (IoTDI’23), San Antonio, TX, USA, 9–12 May 2023; pp. 93–104. [Google Scholar] [CrossRef]
  27. Zhang, Q.; Gao, X.; Wang, H.; Yiu, S.M.; Yin, H. Efficient traffic prediction through spatio-temporal distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 1093–1101. [Google Scholar]
  28. Jiang, R.; Wang, Z.; Yong, J.; Jeph, P.; Chen, Q.; Kobayashi, Y.; Song, X.; Fukushima, S.; Suzumura, T. Spatio-Temporal Meta-Graph Learning for Traffic Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 8078–8086. [Google Scholar] [CrossRef]
  29. Zheng, Q.; Yao, Z.; Zhang, Y. ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 13419–13427. [Google Scholar] [CrossRef]
  30. Chen, W.; Liang, Y. Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting. arXiv 2024, arXiv:2410.12593. [Google Scholar] [CrossRef]
  31. Wang, B.; Wang, P.; Zhang, Y.; Wang, X.; Zhou, Z.; Bai, L.; Wang, Y. Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 9089–9097. [Google Scholar] [CrossRef]
  32. Huang, Y.; Mao, X.; Guo, S.; Chen, Y.; Shen, J.; Li, T.; Lin, Y.; Wan, H. STD-PLM: Understanding Both Spatial and Temporal Properties of Spatial-Temporal Data with PLM. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 11817–11825. [Google Scholar] [CrossRef]
  33. Fang, Y.; Qin, Y.; Luo, H.; Zhao, F.; Xu, B.; Zeng, L.; Wang, C. When Spatio-Temporal Meet Wavelets: Disentangled Traffic Forecasting via Efficient Spectral Graph Attention Networks. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 517–529. [Google Scholar] [CrossRef]
  34. Choi, J.; Choi, H.; Hwang, J.; Park, N. Graph Neural Controlled Differential Equations for Traffic Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, Online, 22 February–1 March 2022; Volume 36, pp. 6367–6374. [Google Scholar] [CrossRef]
  35. Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 4365–4373. [Google Scholar] [CrossRef]
  36. Liu, F.; Zhang, W.; Liu, H. Robust Spatiotemporal Traffic Forecasting with Reinforced Dynamic Adversarial Training. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’23), Long Beach, CA, USA, 6–10 August 2023; pp. 1417–1428. [Google Scholar] [CrossRef]
  37. Zhang, H.K.; Zhang, Y.G.; Zhou, Z.; Li, Y.F. HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 16750–16758. [Google Scholar] [CrossRef]
  38. Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar] [CrossRef]
  39. Ahmed, A.; Shervashidze, N.; Narayanamurthy, S.; Josifovski, V.; Smola, A.J. Distributed large-scale natural graph factorization. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13), Rio de Janeiro, Brazil, 13–17 May 2013; pp. 37–48. [Google Scholar] [CrossRef]
  40. Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar] [CrossRef]
  41. Cao, S.; Lu, W.; Xu, Q. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15), Melbourne, Australia, 19–23 October 2015; pp. 891–900. [Google Scholar] [CrossRef]
  42. Ou, M.; Cui, P.; Pei, J.; Zhang, Z.; Zhu, W. Asymmetric Transitivity Preserving Graph Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016; pp. 1105–1114. [Google Scholar] [CrossRef]
  43. Fu, T.; Lee, W.; Lei, Z. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; Lim, E., Winslett, M., Sanderson, M., Fu, A.W., Sun, J., Culpepper, J.S., Lo, E., Ho, J.C., Donato, D., Agrawal, R., et al., Eds.; ACM: New York, NY, USA, 2017; pp. 1797–1806. [Google Scholar] [CrossRef]
  44. Roweis, S.T.; Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed]
  45. Wang, D.; Cui, P.; Zhu, W. Structural Deep Network Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016; pp. 1225–1234. [Google Scholar] [CrossRef]
Figure 1. Illustration of the delay effect between query counts and traffic speed.
Figure 1. Illustration of the delay effect between query counts and traffic speed.
Information 17 00542 g001
Figure 2. Overall framework of STFQET. (a) The complete model architecture, which integrates query-enhanced attention module, spatial–temporal embedding, Fourier filter, stacked ST blocks, transform attention, and the final prediction layer. (b) The Fourier filter module, which combines adaptive, low-pass, and high-pass filtering branches with a residual branch to suppress noise while preserving useful temporal information. (c) The construction of spatial–temporal embedding, where temporal embedding is derived from the dominant frequencies identified from the global spectrum.
Figure 2. Overall framework of STFQET. (a) The complete model architecture, which integrates query-enhanced attention module, spatial–temporal embedding, Fourier filter, stacked ST blocks, transform attention, and the final prediction layer. (b) The Fourier filter module, which combines adaptive, low-pass, and high-pass filtering branches with a residual branch to suppress noise while preserving useful temporal information. (c) The construction of spatial–temporal embedding, where temporal embedding is derived from the dominant frequencies identified from the global spectrum.
Information 17 00542 g002
Figure 3. Node-level frequency consistency analysis for the global temporal basis. In (a), the curve shows the aggregated global spectrum, and the marked peaks indicate the three selected dominant periods. In (b), for each node, blue, orange, and green points denote the first-, second-, and third-strongest detected frequency features, respectively, while gray points denote the remaining detected frequency features.
Figure 3. Node-level frequency consistency analysis for the global temporal basis. In (a), the curve shows the aggregated global spectrum, and the marked peaks indicate the three selected dominant periods. In (b), for each node, blue, orange, and green points denote the first-, second-, and third-strongest detected frequency features, respectively, while gray points denote the remaining detected frequency features.
Information 17 00542 g003
Figure 4. Road network visualizations of the five Q-Traffic subsets. Blue points and black lines represent road nodes and links, respectively, and the orange dashed boxes indicate the selected grid regions. The top row shows the three single-region subsets Q-24-33, Q-26-31, and Q-32-37, while the bottom row shows the two multi-region subsets Q-3335-3840 and Q-4346-2629.
Figure 4. Road network visualizations of the five Q-Traffic subsets. Blue points and black lines represent road nodes and links, respectively, and the orange dashed boxes indicate the selected grid regions. The top row shows the three single-region subsets Q-24-33, Q-26-31, and Q-32-37, while the bottom row shows the two multi-region subsets Q-3335-3840 and Q-4346-2629.
Information 17 00542 g004
Figure 5. Parameter study of STFQET on the Q-24-33 dataset. The six subfigures report the effects of (a) learning rate, (b) batch size, (c) the dimension of each attention head, (d) the number of attention heads, (e) the number of blocks, and (f) the key node ratio on MAE at 60, 90, and 120 min forecasting horizons.
Figure 5. Parameter study of STFQET on the Q-24-33 dataset. The six subfigures report the effects of (a) learning rate, (b) batch size, (c) the dimension of each attention head, (d) the number of attention heads, (e) the number of blocks, and (f) the key node ratio on MAE at 60, 90, and 120 min forecasting horizons.
Information 17 00542 g005
Figure 6. Ablation results of STFQET on the Q-24-33 dataset.
Figure 6. Ablation results of STFQET on the Q-24-33 dataset.
Information 17 00542 g006
Figure 7. Comparison of graph embedding methods in terms of (a) MAE, (b) RMSE, and (c) MAPE. Blue, green, and orange bars correspond to MAE, RMSE, and MAPE, respectively; darker bars mark the best or tied-best result in each subplot.
Figure 7. Comparison of graph embedding methods in terms of (a) MAE, (b) RMSE, and (c) MAPE. Blue, green, and orange bars correspond to MAE, RMSE, and MAPE, respectively; darker bars mark the best or tied-best result in each subplot.
Information 17 00542 g007
Table 1. Practical training cost of STFQET on five datasets.
Table 1. Practical training cost of STFQET on five datasets.
DatasetNo. of NodesEpochsAvg. Train/Epoch (s)Train + Val Time (min)
Q-24-332232673.032.6
Q-26-311232063.721.9
Q-32-371342262.423.7
Q-3335-3840107429336.8168.1
Q-4346-262992230252.2130.3
Table 2. Performance comparison of STFQET and baseline models on the Q-24-33, Q-26-31, and Q-32-37 datasets. Each entry is reported as mean ± std; the standard deviation is rounded to two decimals for compact display.
Table 2. Performance comparison of STFQET and baseline models on the Q-24-33, Q-26-31, and Q-32-37 datasets. Each entry is reported as mean ± std; the standard deviation is rounded to two decimals for compact display.
DataModel60 min90 min120 min
MAERMSEMAPEMAERMSEMAPEMAERMSEMAPE
Q-24-33GMAN2.29 ± 0.033.44 ± 0.067.76 ± 0.14%2.32 ± 0.033.48 ± 0.067.85 ± 0.14%2.34 ± 0.033.51 ± 0.077.93 ± 0.14%
DeepSTUQ2.22 ± 0.023.35 ± 0.027.51 ± 0.06%2.28 ± 0.023.45 ± 0.027.76 ± 0.07%2.35 ± 0.023.56 ± 0.028.03 ± 0.08%
RDAT2.42 ± 0.163.50 ± 0.178.19 ± 0.39%2.48 ± 0.143.58 ± 0.178.44 ± 0.38%2.52 ± 0.153.66 ± 0.198.60 ± 0.39%
STG-NCDE2.26 ± 0.013.44 ± 0.017.72 ± 0.03%2.35 ± 0.013.55 ± 0.038.00 ± 0.07%2.44 ± 0.013.68 ± 0.038.30 ± 0.12%
STWave2.44 ± 0.023.61 ± 0.058.32 ± 0.07%2.56 ± 0.023.78 ± 0.058.78 ± 0.11%2.67 ± 0.033.93 ± 0.059.20 ± 0.12%
STD-PLM2.40 ± 0.113.62 ± 0.118.15 ± 0.40%2.60 ± 0.033.87 ± 0.078.86 ± 0.09%2.71 ± 0.033.98 ± 0.148.87 ± 0.87%
EAC2.70 ± 0.153.94 ± 0.189.17 ± 0.48%2.94 ± 0.214.25 ± 0.259.99 ± 0.71%3.15 ± 0.274.54 ± 0.3410.69 ± 0.91%
ST-ReP2.31 ± 0.123.42 ± 0.137.84 ± 0.56%2.39 ± 0.133.47 ± 0.158.20 ± 0.42%2.46 ± 0.183.71 ± 0.518.71 ± 1.00%
LightST2.21 ± 0.013.37 ± 0.017.57 ± 0.03%2.29 ± 0.013.48 ± 0.027.83 ± 0.05%2.35 ± 0.023.58 ± 0.038.07 ± 0.07%
STDN2.13 ± 0.013.25 ± 0.047.27 ± 0.08%2.16 ± 0.013.37 ± 0.017.39 ± 0.10%2.20 ± 0.013.35 ± 0.057.54 ± 0.06%
ST-SSDL2.14 ± 0.013.28 ± 0.067.33 ± 0.05%2.18 ± 0.013.35 ± 0.067.49 ± 0.07%2.22 ± 0.023.38 ± 0.037.62 ± 0.09%
Hybrid2.59 ± 0.023.80 ± 0.038.74 ± 0.05%2.73 ± 0.033.99 ± 0.049.23 ± 0.08%2.87 ± 0.034.15 ± 0.049.74 ± 0.09%
STFQET2.11 ± 0.013.24 ± 0.037.21 ± 0.07%2.14 ± 0.013.30 ± 0.037.31 ± 0.04%2.16 ± 0.023.34 ± 0.027.42 ± 0.06%
Q-26-31GMAN2.53 ± 0.034.12 ± 0.068.80 ± 0.20%2.58 ± 0.044.21 ± 0.079.06 ± 0.19%2.64 ± 0.044.30 ± 0.079.34 ± 0.18%
DeepSTUQ2.34 ± 0.023.90 ± 0.028.48 ± 0.14%2.41 ± 0.024.02 ± 0.038.87 ± 0.11%2.48 ± 0.034.13 ± 0.039.35 ± 0.07%
RDAT2.65 ± 0.094.16 ± 0.089.71 ± 0.11%2.84 ± 0.114.39 ± 0.1010.67 ± 0.20%2.94 ± 0.114.54 ± 0.0911.35 ± 0.17%
STG-NCDE2.41 ± 0.024.04 ± 0.028.86 ± 0.10%2.50 ± 0.034.19 ± 0.049.43 ± 0.17%2.58 ± 0.044.30 ± 0.039.95 ± 0.17%
STWave2.65 ± 0.064.27 ± 0.089.54 ± 0.34%2.82 ± 0.084.52 ± 0.1010.47 ± 0.41%2.97 ± 0.074.74 ± 0.1011.34 ± 0.44%
STD-PLM2.42 ± 0.023.99 ± 0.038.51 ± 0.02%2.50 ± 0.014.11 ± 0.019.24 ± 0.02%2.53 ± 0.034.16 ± 0.019.52 ± 0.02%
EAC2.73 ± 0.184.33 ± 0.298.34 ± 0.25%2.95 ± 0.184.65 ± 0.289.14 ± 0.22%3.17 ± 0.114.96 ± 0.1910.33 ± 0.29%
ST-ReP2.47 ± 0.053.93 ± 0.059.07 ± 0.30%2.56 ± 0.114.06 ± 0.099.45 ± 0.62%2.68 ± 0.164.21 ± 0.159.84 ± 0.68%
LightST2.37 ± 0.013.97 ± 0.038.50 ± 0.13%2.48 ± 0.034.14 ± 0.059.24 ± 0.22%2.57 ± 0.024.30 ± 0.059.93 ± 0.18%
STDN2.33 ± 0.063.87 ± 0.038.39 ± 0.17%2.39 ± 0.063.99 ± 0.088.66 ± 0.12%2.46 ± 0.054.08 ± 0.029.09 ± 0.06%
ST-SSDL2.39 ± 0.173.96 ± 0.198.72 ± 0.87%2.49 ± 0.224.13 ± 0.2311.00 ± 0.10%2.56 ± 0.254.26 ± 0.2711.70 ± 0.14%
Hybrid2.73 ± 0.014.37 ± 0.0210.07 ± 0.12%2.98 ± 0.024.70 ± 0.0411.38 ± 0.09%3.10 ± 0.044.82 ± 0.0511.84 ± 0.09%
STFQET2.30 ± 0.033.85 ± 0.058.12 ± 0.20%2.38 ± 0.034.00 ± 0.068.35 ± 0.27%2.42 ± 0.034.06 ± 0.048.64 ± 0.27%
Q-32-37GMAN3.17 ± 0.054.87 ± 0.0811.15 ± 0.20%3.22 ± 0.054.95 ± 0.0711.32 ± 0.23%3.27 ± 0.045.02 ± 0.0611.51 ± 0.21%
DeepSTUQ3.00 ± 0.014.86 ± 0.0110.65 ± 0.14%3.12 ± 0.015.08 ± 0.0111.05 ± 0.10%3.24 ± 0.015.29 ± 0.0411.44 ± 0.10%
RDAT3.28 ± 0.074.98 ± 0.1211.56 ± 0.42%3.49 ± 0.185.29 ± 0.2712.43 ± 0.78%3.66 ± 0.235.57 ± 0.3814.30 ± 0.63%
STG-NCDE3.09 ± 0.034.99 ± 0.0410.98 ± 0.10%3.24 ± 0.045.27 ± 0.0511.52 ± 0.13%3.38 ± 0.055.54 ± 0.0812.12 ± 0.21%
STWave3.31 ± 0.085.20 ± 0.0911.59 ± 0.21%3.50 ± 0.085.50 ± 0.0912.21 ± 0.29%3.67 ± 0.105.75 ± 0.1112.80 ± 0.30%
STD-PLM2.98 ± 0.024.71 ± 0.0610.40 ± 0.10%3.08 ± 0.014.98 ± 0.0310.82 ± 0.03%3.24 ± 0.035.17 ± 0.0310.39 ± 0.02%
EAC3.78 ± 0.445.47 ± 0.1212.56 ± 0.33%4.02 ± 0.245.82 ± 0.2413.33 ± 0.69%4.24 ± 0.166.16 ± 0.3812.48 ± 0.36%
ST-ReP3.13 ± 0.114.90 ± 0.2311.15 ± 0.43%3.17 ± 0.115.07 ± 0.4311.27 ± 0.44%3.23 ± 0.135.17 ± 0.3911.51 ± 0.47%
LightST3.02 ± 0.024.91 ± 0.0410.64 ± 0.09%3.16 ± 0.035.15 ± 0.0511.13 ± 0.12%3.28 ± 0.035.37 ± 0.0711.61 ± 0.15%
STDN2.89 ± 0.014.67 ± 0.0310.12 ± 0.13%2.96 ± 0.014.79 ± 0.0410.37 ± 0.16%3.05 ± 0.024.92 ± 0.0510.73 ± 0.22%
ST-SSDL3.02 ± 0.084.90 ± 0.1110.62 ± 0.25%3.13 ± 0.115.07 ± 0.1611.01 ± 0.38%3.22 ± 0.135.21 ± 0.1811.32 ± 0.46%
Hybrid3.40 ± 0.055.23 ± 0.0311.91 ± 0.20%3.77 ± 0.125.67 ± 0.1212.51 ± 0.32%3.69 ± 0.035.59 ± 0.0212.85 ± 0.22%
STFQET2.85 ± 0.024.56 ± 0.059.78 ± 0.03%2.92 ± 0.034.71 ± 0.039.86 ± 0.13%2.99 ± 0.014.84 ± 0.059.94 ± 0.29%
Note: Bold values indicate the best performance for each metric and forecasting horizon.
Table 3. Sensitivity study of the number of dominant frequencies k f on Q-24-33.
Table 3. Sensitivity study of the number of dominant frequencies k f on Q-24-33.
k f MAERMSEMAPE
12.13153.27007.18%
22.11483.25347.16%
32.05543.18647.03%
42.06773.19147.11%
52.05963.19187.02%
62.05493.18447.05%
Table 4. Performance comparison of STFQET and baseline models on the Q-3335-3840 and Q-4346-2629 datasets. Each entry is reported as mean ± std; the standard deviation is rounded to two decimals for compact display.
Table 4. Performance comparison of STFQET and baseline models on the Q-3335-3840 and Q-4346-2629 datasets. Each entry is reported as mean ± std; the standard deviation is rounded to two decimals for compact display.
DataModel60 min90 min120 min
MAERMSEMAPEMAERMSEMAPEMAERMSEMAPE
Q-3335-3840GMAN2.46 ± 0.053.67 ± 0.049.73 ± 0.23%2.51 ± 0.043.74 ± 0.049.91 ± 0.25%2.55 ± 0.053.79 ± 0.0510.10 ± 0.27%
DeepSTUQ2.44 ± 0.073.71 ± 0.099.72 ± 0.29%2.51 ± 0.093.81 ± 0.1410.05 ± 0.54%2.56 ± 0.103.90 ± 0.1710.26 ± 0.40%
RDAT2.71 ± 0.073.94 ± 0.0910.81 ± 0.28%2.85 ± 0.114.13 ± 0.1311.48 ± 0.43%2.97 ± 0.144.31 ± 0.1712.03 ± 0.51%
STG-NCDE2.75 ± 0.214.12 ± 0.299.74 ± 0.39%2.91 ± 0.304.38 ± 0.4110.00 ± 0.59%3.02 ± 0.344.57 ± 0.5010.20 ± 0.62%
STWave2.72 ± 0.124.06 ± 0.1410.86 ± 0.49%2.88 ± 0.144.29 ± 0.1711.56 ± 0.54%2.99 ± 0.224.45 ± 0.2712.03 ± 0.90%
STD-PLM2.72 ± 0.224.09 ± 0.3010.61 ± 0.83%2.90 ± 0.304.37 ± 0.419.59 ± 0.07%3.06 ± 0.374.59 ± 0.519.80 ± 0.06%
EAC2.94 ± 0.064.33 ± 0.0811.58 ± 0.08%3.20 ± 0.064.71 ± 0.0812.66 ± 0.13%3.44 ± 0.085.05 ± 0.0913.64 ± 0.19%
ST-ReP2.54 ± 0.083.73 ± 0.0810.11 ± 0.32%2.56 ± 0.073.76 ± 0.0810.16 ± 0.28%2.62 ± 0.063.85 ± 0.0610.40 ± 0.22%
LightST2.43 ± 0.013.71 ± 0.029.61 ± 0.07%2.50 ± 0.013.82 ± 0.029.92 ± 0.07%2.57 ± 0.023.94 ± 0.0210.26 ± 0.10%
STDN2.38 ± 0.013.61 ± 0.019.45 ± 0.08%2.43 ± 0.003.70 ± 0.019.65 ± 0.07%2.51 ± 0.013.81 ± 0.019.98 ± 0.08%
ST-SSDL2.63 ± 0.334.08 ± 0.6510.20 ± 0.52%2.74 ± 0.364.24 ± 0.7010.64 ± 0.59%2.82 ± 0.384.37 ± 0.7210.95 ± 0.63%
Hybrid2.92 ± 0.004.30 ± 0.0111.35 ± 0.01%3.15 ± 0.014.59 ± 0.0212.30 ± 0.02%3.34 ± 0.014.82 ± 0.0313.06 ± 0.04%
STFQET2.36 ± 0.023.59 ± 0.019.26 ± 0.03%2.40 ± 0.023.65 ± 0.019.52 ± 0.09%2.46 ± 0.023.74 ± 0.029.76 ± 0.09%
Q-4346-2629GMAN2.56 ± 0.013.98 ± 0.019.30 ± 0.06%2.62 ± 0.014.10 ± 0.029.58 ± 0.08%2.68 ± 0.014.19 ± 0.039.82 ± 0.10%
DeepSTUQ2.54 ± 0.014.03 ± 0.039.56 ± 0.11%2.64 ± 0.014.20 ± 0.0210.04 ± 0.10%2.72 ± 0.014.30 ± 0.0110.40 ± 0.11%
RDAT2.94 ± 0.094.43 ± 0.1310.94 ± 0.39%3.25 ± 0.184.86 ± 0.2212.22 ± 0.61%3.45 ± 0.265.15 ± 0.3113.09 ± 0.89%
STG-NCDE2.98 ± 0.234.65 ± 0.3811.12 ± 0.91%3.22 ± 0.335.06 ± 0.5210.40 ± 0.46%3.40 ± 0.365.37 ± 0.5811.02 ± 0.45%
STWave3.01 ± 2.194.69 ± 2.5911.19 ± 7.65%3.28 ± 2.105.11 ± 2.4212.47 ± 7.23%3.50 ± 1.995.44 ± 2.2713.60 ± 6.80%
STD-PLM2.90 ± 0.274.54 ± 0.4010.34 ± 0.89%3.15 ± 0.384.94 ± 0.569.41 ± 0.05%3.33 ± 0.475.21 ± 0.679.67 ± 0.06%
EAC3.22 ± 0.064.86 ± 0.0912.01 ± 0.23%3.57 ± 0.085.39 ± 0.0913.58 ± 0.18%3.85 ± 0.105.78 ± 0.1314.78 ± 0.25%
ST-ReP2.78 ± 0.114.18 ± 0.1610.18 ± 0.53%2.89 ± 0.144.37 ± 0.1910.67 ± 0.71%2.99 ± 0.194.51 ± 0.2311.08 ± 0.86%
LightST2.55 ± 0.024.01 ± 0.039.36 ± 0.06%2.67 ± 0.024.22 ± 0.039.99 ± 0.09%2.79 ± 0.034.41 ± 0.0410.55 ± 0.11%
STDN2.55 ± 0.014.04 ± 0.039.61 ± 0.16%2.63 ± 0.014.19 ± 0.0310.00 ± 0.13%2.73 ± 0.024.35 ± 0.0310.46 ± 0.08%
ST-SSDL2.64 ± 0.064.15 ± 0.089.93 ± 0.26%2.81 ± 0.104.45 ± 0.1310.87 ± 0.44%2.92 ± 0.124.62 ± 0.1511.43 ± 0.59%
Hybrid3.10 ± 0.014.74 ± 0.0111.05 ± 0.08%3.41 ± 0.025.20 ± 0.0212.51 ± 0.07%3.62 ± 0.025.48 ± 0.0313.54 ± 0.07%
STFQET2.44 ± 0.023.89 ± 0.058.89 ± 0.22%2.52 ± 0.034.03 ± 0.049.28 ± 0.18%2.59 ± 0.044.13 ± 0.049.59 ± 0.17%
Note: Bold values indicate the best performance for each metric and forecasting horizon.
Table 5. Representative query temporal perturbation results on Q-24-33.
Table 5. Representative query temporal perturbation results on Q-24-33.
Query SettingAvg. MAEAvg. RMSEAvg. MAPE
full2.05543.18647.03%
zero2.10873.25347.19%
current_only2.11943.26217.27%
temporal_shuffle2.10553.27497.29%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, S.; Ta, X. A Spatial–Temporal Transformer with Query Enhancement and Fourier Analysis for Traffic Forecasting. Information 2026, 17, 542. https://doi.org/10.3390/info17060542

AMA Style

Zhao S, Ta X. A Spatial–Temporal Transformer with Query Enhancement and Fourier Analysis for Traffic Forecasting. Information. 2026; 17(6):542. https://doi.org/10.3390/info17060542

Chicago/Turabian Style

Zhao, Shufang, and Xuxiang Ta. 2026. "A Spatial–Temporal Transformer with Query Enhancement and Fourier Analysis for Traffic Forecasting" Information 17, no. 6: 542. https://doi.org/10.3390/info17060542

APA Style

Zhao, S., & Ta, X. (2026). A Spatial–Temporal Transformer with Query Enhancement and Fourier Analysis for Traffic Forecasting. Information, 17(6), 542. https://doi.org/10.3390/info17060542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop