Previous Article in Journal
Exploratory Analysis of Electroencephalography Characteristics Shared by Major Depressive Disorder and Parkinson’s Disease: A Database Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From Prediction to Planning: A Spectral-Temporal GNN and Bi-Directional Decoding RL Framework

1
School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China
2
Power Dispatching and Control Center of Guangdong Power Grid Co., Ltd., Guangzhou 510220, China
*
Author to whom correspondence should be addressed.
Signals 2026, 7(3), 47; https://doi.org/10.3390/signals7030047
Submission received: 9 March 2026 / Revised: 21 April 2026 / Accepted: 28 April 2026 / Published: 19 May 2026

Abstract

Accurately capturing spatiotemporal dependencies and enabling effective decision support are core challenges in Intelligent Transportation Systems (ITS). Existing research often treats traffic prediction and path planning as isolated tasks. Moreover, mainstream prediction models struggle with long-term periodic patterns, while Reinforcement Learning (RL)-based planning often suffers from inefficient exploration in sparse topologies. To address these issues, this paper proposes a unified framework combining a spectral-temporal Graph Neural Network (GNN) and bi-directional decoding RL. Specifically, a time-frequency dual-stream adaptive learning module is introduced for prediction. Fast Fourier Transform (FFT) and Gated Recurrent Unit (GRU) are employed to capture global frequency periodicities and local temporal dynamics, respectively. Their adaptive fusion effectively mitigates the long-sequence information forgetting problem. For path planning, the task is formulated as sequence generation. A graph-aware attention encoder with adjacency masking is designed, and heuristic feature embeddings are incorporated to guide efficient exploration. Furthermore, a bi-directional autoregressive decoding strategy enhances robustness against topological bottlenecks. On PEMSD4 and PEMSD8, the proposed predictor achieves MAE/RMSE/MAPE values of 18.211/30.433/12.006 and 13.587/23.566/8.955, respectively. Path-planning simulations on the PEMSD4-derived sparse topology further demonstrate stable bi-directional RL optimization, faster convergence with heuristic guidance, and a sparsity-aware encoder that reduces redundant attention interactions in sparse road networks. These results validate the effectiveness of the proposed “predict-then-plan” paradigm.

1. Introduction

The primary challenge in autonomous driving and Intelligent Transportation Systems (ITS) lies in optimizing decision-making within dynamic and complex environments, where uncertainty and temporal variability are inherent. Driven by rapid advancements in deep learning, traditional cascaded architectures that separate perception, prediction, and planning are evolving toward integrated collaborative frameworks [1,2]. In this context, efficient motion planning relies not only on analyzing static road network topology but also on accurately predicting future traffic flow states to preemptively avoid congestion and hazards [3]. Existing real-time motion planning methods have made progress in obstacle avoidance and trajectory generation, focusing primarily on local geometric constraints [4,5]. However, effectively integrating dynamic traffic prediction information over long time horizons to achieve global path optimality remains a critical challenge, particularly when balancing computational efficiency with the complexity of real-world traffic patterns [6].
In the field of traffic flow prediction, modeling spatiotemporal data has shifted from traditional statistical methods to data-driven deep learning paradigms. Graph Neural Networks (GNNs) have emerged as the mainstream approach due to their robust ability to handle spatial dependencies in non-Euclidean spaces, modeling road networks as complex graph structures [7,8,9]. Recent deep-learning-based studies in vehicular sensing have also demonstrated the effectiveness of convolutional neural architectures for extracting structured patterns from complex transportation-related signals. For example, Delamou et al. [10] proposed a deep-learning-based estimator for multitarget radar detection in vehicular scenarios, highlighting the broader applicability of neural feature learning in intelligent transportation environments. Nevertheless, the traffic forecasting problem considered in this work has a different data structure, since the target signals are collected over a sparse road graph and require explicit modeling of non-Euclidean spatial dependencies together with mixed local and periodic temporal patterns. To address the limitations of static graph structures, the authors in [11] proposed adaptive graph convolutional networks to capture latent correlations among nodes, while Shao et al. designed a decoupled spatiotemporal learning framework to infer dynamic adjacency matrices [12]. Nevertheless, prior GNN-based predictors still exhibit three recurring limitations. First, spatial modeling often remains static or only weakly adaptive, which reduces robustness under incident-driven topology changes. Second, temporal modeling is usually dominated by recurrent or convolutional time-domain operators, which biases the representation toward recent observations and weakens the extraction of recurring spectral structure [13]. Third, when multiple temporal views are used, the fusion mechanism is often fixed, making it difficult to adapt the model emphasis between smooth periodic regimes and bursty traffic fluctuations. As noted by [14], spectral analysis is crucial for resolving long-range periodicity in complex time series; therefore, ignoring frequency-domain information restricts the model’s capacity to represent macroscopic traffic patterns and degrades long-horizon forecasting quality.
Simultaneously, path planning methodologies are undergoing a paradigm shift from traditional search to Deep Reinforcement Learning (DRL) [15,16]. A review by [17] indicates that DRL can learn complex navigation strategies through environmental interaction, overcoming the adaptability limitations of traditional algorithms in dynamic settings. In particular, approaches based on the Neural Combinatorial Optimization (NCO) model path planning as a node sequence generation task [18,19], demonstrating superior performance in solving combinatorial optimization challenges such as the Vehicle Routing Problem [20]. However, prior planning methods also exhibit three practical bottlenecks. First, classical planners such as Dijkstra, A*, and hybrid A* remain strong for static or geometry-dominant routing [3], but they do not naturally exploit predicted future traffic states. Second, directly applying topology-agnostic Transformer encoders to urban road networks introduces substantial redundancy because most node pairs are physically disconnected. Third, purely data-driven exploration often suffers from slow convergence and cold-start issues within large, sparse graphs, where bottleneck edges are difficult to discover in the initial stage. Studies by [21,22] demonstrate that fusing heuristic rules into reinforcement learning frameworks can significantly reduce the search space and improve policy quality. This concept traces back to the early theories on heuristic control systems by [23]. Integrating the hyper-heuristic design concepts proposed by [24], injecting the optimality priors of traditional algorithms as directional guidance signals into the model has become an effective approach to overcoming sparse reward and exploration bottlenecks.
To address these challenges, this paper proposes an integrated framework that fuses a spectral-temporal graph neural network with bidirectional decoding reinforcement learning. For prediction, we design a time-frequency dual-stream architecture to simultaneously capture microscopic time-domain dynamics and macroscopic frequency-domain periodicities. Additionally, a dynamic graph generation module is employed to infer latent spatial dependencies. For planning, we propose an adjacency masking attention mechanism to accommodate the sparsity of road networks. Furthermore, a bidirectional autoregressive decoding strategy is introduced to circumvent the local minima associated with unidirectional search. The literature gaps addressed by the proposed framework are summarized schematically in Figure 1.
The contributions of this paper are summarized as follows:
(1)
A time-frequency dual-stream adaptive graph neural network prediction model is proposed. By combining Real Fast Fourier Transform (RFFT) spectral analysis with a dynamic topology generation mechanism, the model effectively resolves the difficulty traditional methods face in simultaneously capturing long-range periodicities and instantaneous dynamic fluctuations.
(2)
A path planning algorithm based on a graph-aware attention encoder and bidirectional decoding is designed. The adjacency masking mechanism reduces redundant attention interactions in sparse road networks, while the bidirectional parallel search strategy enhances global optimization capability.
(3)
A heuristic-guided feature embedding module is constructed to incorporate prior knowledge from traditional shortest path algorithms into the reinforcement learning state space. This approach effectively addresses the issues of low exploration efficiency and cold-start problems for agents in large-scale road networks.
To make the positioning of the proposed framework more explicit, Table 1 contrasts it with three representative classes of joint prediction-planning approaches along three axes. Along module coupling, cascaded prediction-then-planning pipelines decouple the two tasks entirely, whereas end-to-end DRL over a latent traffic representation couples them implicitly through shared parameters. The proposed framework instead adopts an intermediate “predicted-cost injection” bridge, which transfers explicit short-horizon predicted flows from the spectral-temporal GNN into the planner without forcing hard parameter sharing. Along spatio-temporal modeling, most prior joint frameworks rely on time-domain GNNs whose temporal operators bias the representation toward recent observations and weaken recurring spectral structure. The proposed framework introduces a spectral-temporal dual-stream encoder with dynamic topology generation, which jointly captures periodic spectral components and incident-driven topology shifts. Along RL exploration, topology-agnostic attention and undirected exploration are common in prior planners, which causes slow convergence on large, sparse road networks. The proposed framework uses adjacency-masked attention, Dijkstra-based heuristic embedding, and bi-directional autoregressive decoding to mitigate cold-start inefficiency and bottleneck-edge blindness.
The remainder of this article is organized as follows. Section 2 formulates the mathematical models for the traffic prediction and path planning problems. Section 3 elaborates on the proposed algorithmic framework, detailing the Spectral-Temporal GNN, the dynamic topology generation, and the Bi-directional Decoding RL method with heuristic feature embedding. Section 4 presents the experimental settings, including datasets and baselines, and analyzes the performance of prediction and planning tasks along with ablation studies. Finally, Section 5 concludes this article and discusses future research directions.

2. Model

Let N denote the number of nodes within a sensor network. At each time step t, the system state comprises a set of heterogeneous physical attributes. Consequently, observations at time t are denoted as a matrix X t R N × C , where C represents the attribute feature dimension for each node. The entire historical observation window appears as a tensor X t R T × N × C , where T indicates the length of the look-back window.
The system is represented as a graph structure G . This graph consists of a node set V and an edge set E . To avoid symbol overloading, the time-varying node dependency matrix is denoted by G t R N × N . Distinct from conventional approaches relying on static priors such as geographical distances, the graph G is posited as time-varying or latent. This implies that G t must be dynamically inferred from the input data X t .
Given the historical observation tensor X t over the past T time steps, the objective involves learning a non-linear mapping function f θ to forecast the target attribute for the next τ time steps. This target usually corresponds to the primary metric among the C attributes. The prediction is formulated as follows:
Y ^ = f θ ( X t ; G l ) ,
where Y ^ R τ × N denotes the predicted sequence, θ represents learnable model parameters, and G l signifies the internally generated dynamic graph structure.
Consider a static road network topology denoted by G = ( V , E ) , where V denotes the set of n intersection nodes, and E represents the set of road edges. The edge weight function d : E R + represents the travel distance. Let f : V R + denote the time-varying node traffic, which is provided by the spectral-temporal GNN module. Given a source node s and a destination node t, the path planning task is defined as finding the optimal path π = ( π 1 , π 2 , , π L ) that minimizes the hybrid cost function:
C ( π ) = i = 1 L 1 d ˜ ( π i , π i + 1 ) + λ i = 1 L f ˜ ( π i ) ,
where π 1 = s , π L = t , and ( π i , π i + 1 ) E for all i. Because raw distances and predicted traffic values carry different physical units, the two terms are made dimensionally comparable through min-max normalization to [ 0 , 1 ] : d ˜ ( i , j ) = ( d ( i , j ) d min ) / ( d max d min ) with d min ,   d max taken over all graph edges, and f ˜ ( i ) = ( f ( i ) f min ) / ( f max f min ) with f min ,   f max taken over all nodes at the current decision cycle. The scalar weight λ 0 controls the trade-off between geometric distance and predicted traffic exposure. Unless stated otherwise, we adopt the default λ = 1 , which corresponds to equal weighting after normalization and places the two terms at a common numerical scale. The sensitivity of the framework to different choices of λ is examined empirically in Section 4.5, which confirms that the default is close to the realized-cost minimum under the present planning topology.
To improve notation consistency, we reserve A attr for attribute-level affinities, G t for the dynamically inferred node graph used by diffusion, and ( s , t , π ) for the planning tuple. The main symbols used in the prediction and planning modules are summarized below for quick reference.

3. Algorithm

3.1. A Spectral-Temporal GNN for Traffic Prediction

3.1.1. Intra-Variable Dependency Modeling

In multivariate time-series analysis, each variable typically encompasses multiple heterogeneous attributes (such as flow, speed, and occupancy). Uncovering the latent interactions among these attributes is pivotal for comprehending local system dynamics. To adaptively capture these micro-scale dependencies without relying on prior domain knowledge, we introduce an Attribute Correlation Encoding module based on latent space projection.
To map attributes with distinct physical dimensions into a unified semantic space, a non-linear feature transformation is applied to the input X t R N × C at time step t. Specifically, raw attributes are projected into a high-dimensional latent representation E t via a learnable weight matrix W e and a bias term b e :
E t = ϕ ( X t W e + b e ) ,
where ϕ ( · ) denotes the non-linear activation function and C represents the number of attributes. This step aims to extract high-order semantic features for each attribute.
Subsequently, to infer the dependency topology among attributes, pairwise correlations between latent representations are computed. Given that strong attribute associations are typically sparse and non-negative, similarity is calculated via the dot product, followed by the application of the Rectified Linear Unit (ReLU) to prune noisy connections and enforce non-negativity.
A attr = ReLU ( E t · E t ) ,
where A attr R N × C × C denotes the learned attribute affinity matrix. The ReLU activation filters out negative correlations to ensure a robust and sparse graph structure.
Furthermore, graph spectral filtering is performed on the learned graph to aggregate information from correlated attributes. To account for the significance of intrinsic features, an identity matrix I is added to the affinity matrix to introduce self-loops. The features are then fused through the graph convolution weights Θ :
H i = ( I + A attr ) X t Θ ,

3.1.2. Time-Frequency Dual-Stream Adaptive Learning

Real-world multivariate time series often exhibit an entanglement of complex transient variations and long-term periodicities. Single-domain temporal models, such as Recurrent Neural Networks (RNNs), frequently struggle with vanishing gradients when capturing long-range dependencies, while purely frequency-domain models may fail to address non-stationary local shifts. To address these limitations, the proposed method incorporates a Time-Frequency Dual-stream Adaptive Learning module. This module processes the sequence of attribute-enhanced features derived from the previous stage, denoted as H i n = [ H i ( 1 ) , , H i ( T ) ] R T × N × D , in parallel. It extracts features from the perspectives of micro-scale temporal evolution and macro-scale spectral patterns, achieving dynamic integration via an adaptive gating mechanism. The fusion is motivated by complementary failure modes: the GRU branch is more suitable for non-stationary local transitions and short-lived bursts, whereas the FFT branch compactly exposes recurring components over the full observation window. Because traffic conditions switch between these regimes over time, a fixed fusion rule is suboptimal, and an adaptive gate is used to modulate the contribution of each branch. Under the standard T = 12 input setting adopted in this paper, the RFFT branch processes T / 2 + 1 = 7 non-redundant spectral bins.
Time-Domain Local Awareness
This branch is designed to capture non-linear local dynamics. The Gated Recurrent Unit (GRU) is employed as the backbone architecture, owing to its efficacy in processing short-term memory and suppressing noise. Rather than serving as a generic sequence encoder, the GRU functions explicitly as a “local feature extractor” to model state transitions between adjacent time steps:
H t d ( t ) = GRU ( H i ( t ) , H t d ( t 1 ) ; Θ t ) ,
where H t d R T × N × D represents the time-domain hidden states that encapsulate local trends and instantaneous variations.
Frequency-Domain Global Enhancement
To effectively capture long-range dependencies and periodic patterns, a parallel branch based on spectral analysis is introduced. This branch utilizes the equivalence between time-domain convolution and frequency-domain point multiplication to process features with a global receptive field. Specifically, the input sequence is first mapped to the frequency domain via an RFFT along the temporal dimension:
F f d = RFFT ( H i n ) C ( T / 2 + 1 ) × N × D .
Specifically, spectral features are modulated using complex filters and a learnable weight matrix W f d . This process isolates predictively significant frequency components, such as dominant periods, while suppressing high-frequency random noise:
F ˜ f d = F f d · W f d .
Furthermore, an Inverse Real Fast Fourier Transform (IRFFT) restores the enhanced features to the time domain, yielding the representation H f d enriched with global context.
Adaptive Gated Fusion
Considering that simple additive fusion fails to adapt to dynamic changes where data features follow distinct patterns at different moments, such as periodicity during off-peak periods versus sudden bursts during peaks. A context-aware gating unit is designed to address this limitation. To standardize feature distributions, Layer Normalization is applied to the dual-branch outputs. A parameterized network then generates timestep-specific fusion coefficients Z R T × N × 1 :
Z = σ W g [ LN ( H t d ) LN ( H f d ) ] + b g ,
where ‖ denotes concatenation along the feature dimension, and σ is the Sigmoid activation function. The final hybrid spatio-temporal feature H f is calculated as:
H f = Z H t d + ( 1 Z ) H f d .
In this manner, the model can adaptively switch between focusing on local evolution and referencing global patterns according to the current context, thereby generating robust time-series representations.

3.1.3. Dynamic Topology Generation

While the dual-stream module effectively captures independent temporal dynamics, modeling time-varying spatial dependencies remains challenging. An adaptive topology evolution mechanism is proposed to capture these dynamic structures. This component infers a latent adjacency matrix for each time step by non-linearly fusing static priors with real-time dynamic features.
To establish a robust foundation for topology generation, node embeddings E n R N × d and time embeddings E t R d are initialized to encode inherent spatial attributes and discrete temporal features. At time step t, element-wise multiplication generates a baseline representation N s ( t ) encompassing spatio-temporal priors:
N s ( t ) = E n E t ( t ) .
This fusion strategy allows the model to adjust the distinct representation of nodes according to the temporal context.
The fused feature H f ( t ) serves as a dynamic driving signal to address topology evolution induced by sudden events. This feature is projected into the same dimensional space as the node embedding via a fully connected layer to obtain the dynamic state vector D s ( t ) :
D s ( t ) = ϕ ( W d H f ( t ) + b d ) .
Then D s ( t ) modulates the static baseline element-wise to activate or suppress specific dimensions. H l ( t ) = D s ( t ) N s ( t ) . Subsequently, pairwise node relationships are inferred through non-linear interactions in the latent space:
H l ( t ) = D s ( t ) N s ( t ) ,
where tanh activates features prior to similarity calculation via matrix multiplication. ReLU ensures sparsity and non-negativity in the graph structure.
G t = ReLU tanh ( H l ( t ) ) · tanh ( H l ( t ) ) .
The generated G t precisely quantifies real-time connection strengths between nodes, providing a structural basis for subsequent spatial diffusion convolution.

3.1.4. Spatio-Temporal Aggregation and Output

This module executes information propagation on the graph and decodes latent states into predictions. It utilizes the extracted time-frequency features and the inferred dynamic topology.
Spatio-Temporal Aggregation
To capture multi-hop spatial dependencies, this module adopts a diffusion convolution mechanism on the dynamic graph to guide information to flow along the dynamic path defined by the adjacency matrix G t . The fused feature H f ( t ) serves as the initial signal. The diffusion process is modeled as a K-step random walk:
S t = k = 0 K G t k · H f ( t ) · Θ d i f f ( k ) ,
where K denotes the diffusion order, G t k represents the k-th order transition matrix, and Θ d i f f ( k ) is a learnable weight matrix.
While spatial diffusion aggregates external neighborhood information, the evolution of a node’s own state must maintain temporal continuity. Therefore, a second GRU is introduced to internalize spatial context into the temporal memory of the node:
H s t ( t ) = GRU d e c ( S t , H s t ( t 1 ) ; Θ d e c ) .
Cumulative Residual Decoding
To alleviate the vanishing gradient problem in deep neural networks and enhance the representational capacity of the multi-layer architecture, this module adopts a cumulative output strategy based on residual learning. Each block focuses on correcting prediction errors from the preceding stage rather than directly outputting the final result. For the l-th block, parallel linear projections generate the prediction component Y ^ ( l ) and the residual approximation Δ X ( l ) :
Y ^ t ( l ) = H s t ( t , l ) W o + b o ,
X r e s ( l + 1 ) = X r e s ( l ) ( H s t ( t , l ) W r e s + b r e s ) .
The final prediction is the cumulative sum of outputs from all blocks: Y ^ = l = 1 L Y ^ ( l ) . This iterative refinement mechanism allows the model to decompose the complex forecasting task into multiple sub-problems, progressively approximating the true distribution manifold.

3.2. Bi-Directional Decoding RL Method for Path Planning

3.2.1. Graph-Aware Attention Encoder

Urban traffic path planning is fundamentally a problem of intersection node selection. Specifically, upon reaching an intersection, once a vehicle selects its next destination, the traversing trajectory between them is uniquely determined under the assumption of non-circular routing. Consequently, we model path planning as a node sequence generation task, where the generation of the next node is conditioned on the traffic environment and the sequence of previously visited nodes. This process shares a logical parallel with Large Language Models (LLMs), which generate tokens based on contextual history. Motivated by this paradigm, we propose utilizing a Transformer architecture for node feature encoding. However, urban road networks exhibit significant topological sparsity. A typical intersection node maintains direct connections with only a few adjacent intersections, lacking edge relationships with the vast majority of other nodes in the network. When a standard Transformer encoder is applied to such graphs, the fully connected self-attention mechanism still evaluates all n 2 node pairs, meaning that a large proportion of pairwise interactions correspond to physically unconnected nodes. This topology-agnostic design not only wastes attention capacity but also weakens the model’s ability to learn local connectivity patterns that are critical for downstream path decisions. To address these challenges, we propose an adjacency-masked attention mechanism. By explicitly embedding topological constraints into the attention calculation process, this mechanism restricts each node to attend solely to its physically adjacent neighbors:
α i j = exp ( e i j ) · 1 [ ( i , j ) E ] k N ( i ) exp ( e i k ) ,
where N ( i ) denotes the set of neighbor nodes for node i, and 1 [ · ] serves as an indicator function (evaluating to 1 if the condition holds, and 0 otherwise). e i j denotes the attention score. This mechanism reduces the effective interaction space of each local attention layer from O ( n 2 ) to O ( | E | ) , making the encoder structurally better aligned with sparse road networks while preserving the local topological information required for planning.
Furthermore, in path planning tasks, the distance between nodes serves as a critical part of path cost. However, standard self-attention mechanisms derive attention weights solely based on the similarity of node features, thereby failing to explicitly model edge attributes. Relying on the model to infer edge distance information from data not only exacerbates the learning burden but also risks introducing biases into decision-making processes, which are inherently distance-sensitive in traffic path planning. To address this limitation, the proposed algorithm designs an edge feature injection module that explicitly incorporates edge distance information into the attention score calculation via a learnable bias term. Let q i R d k be the query vector for node i and k j R d k be the key vector for node j, where d k represents the dimension of the key vectors. The enhanced attention score is defined as:
e i j = q i k j d k + ϕ θ ( d i j ) ,
where d i j denotes the normalized distance of edge ( i , j ) , ϕ θ : R R H represents the parameterized edge encoder, and H indicates the number of attention heads. This encoder is implemented using a two-layer feed-forward network:
ϕ θ ( d ) = W 2 · ReLU ( W 1 · d + b 1 ) + b 2 ,
where W 1 R 2 H × 1 and W 2 R H × 2 H are learnable weight matrices, and b 1 , b 2 are the corresponding bias vectors. Intuitively, this mechanism encodes geometric relationships: if two nodes are spatially distant (i.e., large d i j ), the model effectively suppresses the attention weight; conversely, for proximal nodes, the weight is enhanced. This eliminates the need for the model to learn geometric relationships through extensive iterative training, effectively turning distance information into prior knowledge. Simultaneously, the incorporation of the edge distance bias term significantly enhances the model’s convergence capabilities. Specifically, in the absence of edge features, the model might stochastically select a neighbor distant from the destination during the initial exploration phase, resulting in high traversal costs. The edge distance bias functions as a warm start for the model. Even during the initialization phase (with randomized parameters), the presence of this bias ensures that the model exhibits an inherent tendency to choose physically proximal nodes.
Finally, while the adjacency masking mechanism effectively captures local topological structure, excessively constraining the attention scope may hinder information propagation between distant nodes. In large-scale path planning scenarios, the starting point and destination are often spatially distant. Relying exclusively on local information aggregation may prevent the model from learning global planning optimality (i.e., capturing long-range dependencies). To reconcile the trade-off between preserving local topological learning and facilitating global information transfer, we propose a hybrid encoding architecture. The encoder comprises L attention layers, designed as follows: the initial L L g layers employ adjacency masking to perform constrained local feature aggregation, while the subsequent L g layers remove the masking constraints to enable cross-hop global information interaction:
h ( l ) = GraphAttn ( h ( l 1 ) ; M adj , D ) , l L L g FullAttn ( h ( l 1 ) ) , l > L L g .
The proposed architecture enables the lower-level attention layers to specialize in learning local intersection topological features, while the higher-level layers are responsible for integrating the macroscopic information essential for global path planning.

3.2.2. Heuristic-Guided Feature Embedding

In sparse road network topologies, Reinforcement Learning (RL)-based path exploration often suffers from inefficient search and struggles to locate the correct direction over extended periods. Drawing inspiration from the heuristic function design in the classic A* algorithm, we propose a heuristic feature embedding module. This module utilizes Dijkstra’s algorithm to perform offline pre-calculation of the shortest distance d ( i , j ) and hop count h ( i , j ) between all node pairs, thereby providing “directional guidance” to the model. As a classic extension of Breadth-First Search (BFS) strategies, Dijkstra’s algorithm is designed to solve single-source shortest path problems in graphs with non-negative edge weights. Based on a greedy strategy, it maintains a priority queue to iteratively select the unvisited node closest to the source and relaxes the distances of its neighbors, guaranteeing the discovery of globally optimal paths in a static graph. By injecting the precise global distances calculated by Dijkstra as prior knowledge into the model, our approach effectively alleviates the blindness associated with exploration in sparse graphs. Specifically, given a target node t, the heuristic feature for each node i is defined as:
x i heur = d ( i , t ) D max , h ( i , t ) H max ,
where D max = max j V d ( j , t ) and H max = max j V h ( j , t ) are normalization constants representing the maximum shortest distance and maximum hop count in the graph, respectively. This feature vector equips the policy network with twofold critical information: (1) the normalized shortest distance reflects the physical proximity of the current intersection to the target; and (2) the normalized hop count indicates the minimum number of steps required to reach the destination. This mechanism functions analogously to a GPS system, ensuring that the agent maintains awareness of the target direction throughout the exploration process. It is important to emphasize that these heuristic features provide directional guidance based solely on the static topology. They are designed to assist the agent in discovering feasible paths rather than determining the optimal path directly. The model still needs to learn how to optimize decisions based on dynamic traffic information, that is, selecting alternative paths with suboptimal topological distances but lower overall costs in high-traffic areas.

3.2.3. Bi-Directional Autoregressive Decoding

In large-scale sparse graph structures, the topology often contains critical edges that serve as the sole links connecting two distinct subgraph regions. During the exploration process, starting from the origin, if the agent fails to identify these critical edges, it may remain confined within the subgraph region preceding the bottleneck. Even if the critical edge is eventually discovered after repeated attempts, the accumulated traversal cost is often prohibitively high. To mitigate this issue, we propose a bi-directional decoding strategy. Specifically, we concurrently generate paths for the same planning task in two directions, from source to target ( s t ) and from target to source ( t s ), and ultimately select the path with the lower cost. Let H denote the node embedding matrix output by the encoder. The forward and backward decoding processes are formulated as:
π = Decode θ ( s t ; H ) , π = Decode θ ( t s ; H ) .
The final path is determined by minimizing the cost function over the two candidates:
π = arg min π { π , Reverse ( π ) } C ( π ) ,
where Reverse ( · ) denotes the operation of reversing the path sequence. The efficacy of bi-directional decoding stems from the symmetry of the path planning problem: the optimal path from s to t is theoretically the exact inverse of the optimal path from t to s. However, due to the sequential and irreversible nature of autoregressive decoding, the decoding processes in the two directions may yield different trajectories. Crucially, a critical edge that is difficult to traverse during forward decoding may be approached from the opposite side during backward decoding, allowing it to be naturally incorporated into the path. Consequently, the proposed strategy provides an additional layer of robustness, effectively enhancing the algorithm’s performance.
Regarding the specific implementation, the decoder calculates the probability distribution for the next node selection at each time step t based on the current state. The state representation comprises three components: the global graph embedding g , the current node embedding h π t , and the target node embedding h target . To integrate these heterogeneous information sources, we design a hierarchical context aggregation mechanism. First, the embeddings of the current node and the target node are concatenated and projected via a context projection matrix W c R d × 2 d to form the step-wise context vector:
c t = W c · Concat h π t , h target .
Subsequently, this context is combined with the global graph embedding to generate the query vector:
q t = W g · g + W s · c t ,
where W g R d × d is the global projection matrix, W s R d × d is the step-wise projection matrix and g R d is the global graph embedding. The action probabilities are computed via an attention mechanism, where the attention scores are subject to dual constraints imposed by adjacent nodes and visitation history:
u j = q t k j / d k , j N ( π t ) j { π 1 , , π t } , otherwise .
Finally, the action probability distribution is output using a tanh-clipped softmax function:
p ( a t = j | π 1 : t ) = exp ( tanh ( u j ) · C ) k exp ( tanh ( u k ) · C ) ,
where C is a clipping coefficient used to regulate the smoothness of the probability distribution and mitigate gradient instability caused by excessive logit values.

3.2.4. Training with REINFORCE

The proposed bi-directional decoding RL method is trained using the REINFORCE algorithm. Given that the forward and backward decoding processes share a unified set of encoder and decoder parameters, the training objective is formulated as the joint minimization of the expected costs for both directional paths:
J ( θ ) = E π p θ C ( π ) + E π p θ C ( π ) .
Furthermore, a critical challenge in policy gradient methods is the high variance associated with gradient estimation. To mitigate this, we adopt a Rollout Baseline strategy. Specifically, the proposed algorithm maintains a baseline model θ BL , whose parameters serve as a snapshot of the historically optimal policy. The baseline value is derived via greedy decoding using the baseline model on the same input instance, and is selected as the minimum cost between the bidirectional outcomes:
b i = min C ( π i , BL ) , C ( π i , BL ) .
At the conclusion of each training epoch, a paired t-test is conducted to evaluate whether the current policy significantly outperforms the baseline policy. If the average path cost of the current policy on the validation set is significantly lower than that of the baseline, the baseline model parameters are updated:
θ BL θ , if C ¯ current < C ¯ BL ϵ ,
where ϵ is a significance threshold. This adaptive update mechanism ensures that the baseline model consistently represents the optimal performance observed throughout the training process, thereby enhancing training stability and accelerating convergence.
The variance-reduction effect of this baseline admits a simple explanation. The unbiased REINFORCE estimator J ( θ ) C ( π ) log p θ ( π ) has a variance that scales with E [ C ( π ) 2 ] , which is dominated by the absolute magnitude of the hybrid cost and therefore tends to be large on sparse graphs with high cost dispersion. Subtracting the rollout baseline b replaces this quantity with E [ ( C ( π ) b ) 2 ] = Var ( C ( π ) b ) , which is minimized when b is close to the expected cost under the current greedy policy. Our rollout snapshot provides exactly such an estimate. Empirically, we observed that this substitution reduces the measured policy-gradient variance by approximately an order of magnitude on the present planning topology and substantially shortens the convergence horizon. Section 4.9 complements this analysis with a controlled comparison against Proximal Policy Optimization (PPO) and Deep Q-Network (DQN) under the same graph environment.

4. Experiment

4.1. Datasets and Preprocessing

To evaluate the performance of the proposed framework in real-world scenarios, we conducted experiments on two widely used public traffic datasets, PeMS04 and PeMS08. These datasets are collected by the Caltrans Performance Measurement System (PeMS) and represent real-time traffic flow data from the San Francisco Bay Area and the San Bernardino area, respectively. Their basic statistics are summarized in Table 2. Taking PEMSD4 as an example, this dataset contains traffic flow measurements collected by 307 loop detectors in the San Francisco Bay Area from 1 January 2018 to 28 February 2018, with a total of 16,992 time steps.
The prediction task is evaluated on both PEMSD4 and PEMSD8, whereas the path-planning simulation is conducted on the sparse planning graph derived from PEMSD4 and visualized in Figure 2. This graph contains 307 nodes and 340 edges, resulting in an average degree of approximately 2.21. Such a sparse topology strongly motivates the use of adjacency-masked attention. For this graph, a standard full-attention layer evaluates 307 × 307 = 94 , 249 pairwise interactions. In contrast, a local masked-attention layer retains only 2 | E | + N = 2 × 340 + 307 = 987 valid interactions after accounting for adjacent node pairs and self-loops, eliminating 98.95% of redundant pairwise interactions per local layer. Under the default planning encoder used in this work, which contains two local masked layers and one global layer, the total number of pairwise interactions is reduced from 3 × 94 , 249 = 282 , 747 to 2 × 987 + 94 , 249 = 96 , 223 , corresponding to an overall reduction of 65.97%. This quantifies the structural efficiency gain of the proposed sparsity-aware encoder.
We split both datasets into training, validation, and testing sets in chronological order with a ratio of 6:2:2. A sliding-window strategy is used to generate input-output pairs. Both the historical input window size and the prediction horizon are set to 12, meaning that the model uses the previous hour of traffic observations to predict the next hour.

4.2. Baselines and Experimental Settings

To evaluate the proposed predictor, we compared it against a diverse set of baselines, including statistical methods (HA and AR), conventional deep learning approaches (LSTNet), and representative spatio-temporal graph neural networks (AGCRN, ST-AE, SDGL, DDGCRN, and HSDGNN). Prediction accuracy is assessed using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), with missing values excluded from evaluation to ensure fair comparison.
The major architectural and training settings used for both the prediction and path-planning modules are summarized in Table 3. All experiments were implemented in PyTorch 2.0.1 on a Linux workstation equipped with a single NVIDIA GeForce RTX 4090 GPU. For prediction, we followed the standard setting of a 12-step input horizon and a 12-step forecasting horizon, used a latent feature dimension of 64, processed 7 non-redundant RFFT bins under the 12-step input window, adopted a diffusion order of K = 2 , used the Adam optimizer with an initial learning rate of 0.002, set the batch size to 64, and trained for up to 300 epochs with early stopping patience of 30 based on MAE. For path planning, the final configuration uses a 307-node sparse graph, embedding and hidden dimensions of 64, three encoder layers with one global layer and eight attention heads, heuristic features enabled, a rollout baseline, a learning rate of 1 × 10 4 , a batch size of 64, and 100 training epochs.

4.3. Results on the Prediction Performance

Table 4 reports the comparative prediction results on PEMSD4 and PEMSD8. The proposed method achieves MAE/RMSE/MAPE values of 18.211/30.433/12.006 on PEMSD4 and 13.587/23.566/8.955 on PEMSD8. On both datasets, it outperforms the strongest baseline HSDGNN, which records 18.348/30.468/12.102 on PEMSD4 and 13.843/23.678/9.196 on PEMSD8.
These results support the effectiveness of the proposed dual-stream design. Traditional statistical baselines exhibit the largest errors because they cannot capture nonlinear spatio-temporal dependencies. Deep learning baselines substantially improve over statistical methods, but models relying primarily on time-domain recurrent or convolutional processing still struggle to preserve long-range periodic information. In contrast, the frequency-domain enhancement helps the proposed method capture global periodic patterns, while the adaptive gating mechanism prevents rigid fusion from amplifying high-frequency noise, leading to better robustness during traffic peaks and fluctuations.

4.4. Ablation Studies

To verify the individual contributions of the Frequency-domain Global Enhancement, the Adaptive Gated Fusion, and the Dynamic Topology generation, we conducted prediction-side ablation studies on PEMSD4, as summarized in Table 5. The variant ours_w/o_FD removes the frequency-domain branch and relies only on the time-domain GRU. The variant ours_w/o_AGF retains the frequency-domain branch but replaces adaptive gated fusion with static concatenation. The variant ours_w/o_DT replaces the dynamic topology generation module with a static adjacency built from the offline distance-based graph.
The ablation results show that simply introducing frequency-domain features is not sufficient by itself. Although frequency information improves the MAE, static fusion leads to a higher RMSE, indicating that rigid fusion can introduce noise during peak periods. Disabling the dynamic topology further degrades all three metrics (MAE 18.412, RMSE 30.782, MAPE 12.290), which indicates that a static prior cannot adequately represent the incident-driven spatial shifts that arise during peak cycles. The full model achieves the best MAE, RMSE, and MAPE simultaneously, confirming that frequency-domain enhancement, adaptive gated fusion, and dynamic topology generation contribute complementarily to prediction quality.

4.5. Cost Function Sensitivity Analysis

The hybrid planning cost introduced in Section 2 combines a normalized distance term and a normalized flow term through a scalar weight λ . To examine how sensitive the planner is to this design choice, we conduct a controlled sweep over λ { 0 , 0.25 , 0.5 , 1 , 2 , 4 , } on the same 237-node connected subgraph, 24 source-target pairs, and 24-cycle dynamic flows used in the controlled classical-planner comparison. For each λ , the planner runs under identical encoder and decoder parameters, and the realized hybrid cost, congested-node ratio, and average path length are averaged across all pairs and cycles. λ = 0 corresponds to a distance-only objective (equivalent to geometric shortest path, ignoring predicted flow), while λ corresponds to flow-only routing.
Table 6 and Figure 3 show a clear U-shaped dependence of the realized cost on λ . A distance-only objective ignores predicted traffic and incurs the highest congestion exposure (11.42%); a flow-only objective over-reacts to short-horizon predictions, systematically detours through longer paths (12.47 hops on average), and also ends up with a higher total cost despite the lowest nominal congestion. The default λ = 1 is close to the realized-cost minimum and offers a favorable trade-off between congestion exposure and path length, which justifies the equal-weighting design adopted in this work. For scenarios that explicitly prefer congestion avoidance at the expense of route length (e.g., emergency dispatch), the table supplies guidance for moving λ toward 2 or 4.

4.6. Path Planning Simulation

Figure 4 illustrates the training progression of the proposed path-planning framework over 100 epochs. The plotted curves show the forward training cost, backward training cost, and validation cost.
Both decoding directions exhibit a rapid decrease during the first 20 epochs, indicating that the heuristic-guided feature embedding effectively alleviates the cold-start problem in sparse graphs. After approximately 60 epochs, the curves become markedly smoother and remain stable, suggesting that the combination of REINFORCE training and the rollout baseline provides stable optimization behavior.
Notably, a marginal performance discrepancy can be observed between the forward and backward directions. This phenomenon accurately reflects the topological asymmetry, where traversing reversely from the target to the source may encounter different critical bottleneck edges compared to the forward path. Nevertheless, the synchronous decline of both curves validates that our joint optimization objective (30) effectively coordinates the bi-directional decoders to concurrently search for the optimal policy. Furthermore, the validation set cost remains consistently lower than the training cost and converges steadily, indicating the model’s robust generalization capability in unseen scenarios. These observations are reported for the fixed sparse-topology setting summarized in Table 3, which defines the scope of the present path-planning evidence.
To further validate the heuristic-guided feature embedding, we compared training behavior with and without heuristic features. Figure 5 shows that the model equipped with heuristic priors converges faster and remains consistently below the baseline without such guidance throughout training. This behavior indicates that normalized shortest-path distance and hop-count features provide effective directional cues, helping the policy avoid prolonged blind exploration in sparse graphs.
The same figure also shows that heuristic guidance improves final path quality. Because the agent receives coarse topological directionality from the start, it more quickly avoids obviously high-cost regions and reaches better solutions than the baseline that relies purely on trial-and-error exploration.

4.7. Controlled Comparison with Classical Planners

To complement the learning-based results above, we constructed a controlled decision-cycle simulation on the largest connected component of the PEMSD4-derived planning graph. This connected subgraph contains 237 nodes and 280 edges. We selected 24 source-target pairs with at least two competitive simple routes and generated time-varying node costs by shifting hotspot regions over 24 decision cycles. Dijkstra and A* re-plan from the current node using the instantaneous traffic snapshot, whereas the proposed framework uses a short-horizon predicted cost estimate consistent with the predict-then-plan paradigm. Because the current graph benchmark does not expose continuous vehicle kinematic states, hybrid A* is implemented as a discretized heading-regularized surrogate and is used only as a practical reference rather than as a full vehicle-dynamics benchmark.
To ensure a fair cross-planner comparison, we evaluate all methods by the same planner-agnostic metric, which we term the realized cost. Specifically, after each planner produces its executed node sequence π = ( π 1 , , π L ) during the multi-cycle simulation, the realized cost is obtained by substituting into the hybrid objective C ( π ) in Equation (2) the true simulated node-flow values f ˜ ( π i ) at the cycle at which each node is actually visited, rather than any planner-internal estimate. The normalized edge distances d ˜ follow the same min–max normalization defined in Section 2, and the weight λ is fixed at its default value. In this way, Dijkstra and A* are scored against exactly the same ground-truth traffic realization that the proposed framework attempts to anticipate, and the realized cost faithfully reflects the cost actually incurred along the executed route rather than the cost the planner believed it would incur at decision time.
Table 7 summarizes the controlled comparison, including realized cost, congestion exposure, expanded states, and arrival rate for each planner. The results show that the proposed framework achieves the lowest realized cost and the lowest congestion exposure among the compared methods. Relative to A*, the mean realized cost decreases from 9.560 to 9.431, while the congested-node ratio decreases from 8.18% to 7.43%, corresponding to a relative reduction of approximately 9.2% in congestion exposure. In contrast, A* mainly improves search efficiency over Dijkstra, reducing the average number of expanded states per cycle from 39.40 to 30.93, whereas the hybrid A* surrogate preserves a similar cost profile but requires substantially more search effort under the present graph abstraction.
Figure 6 provides a cycle-level view of the same simulation. The proposed framework remains consistently below the classical baselines after the early decision cycles, indicating that incorporating predicted future traffic information helps the planner avoid route commitments that later become congested. This result should be interpreted as a controlled graph-based illustration of practical decision cycles rather than as a claim that the present graph benchmark fully subsumes continuous-state vehicle planning.

4.8. Planning Module Ablation

To assess the individual contribution of each core planning component, we conducted a three-variant ablation under the same 237-node connected subgraph, 24 source-target pairs, and 24-cycle dynamic flows used above. The three variants are: (i) w/o masking, which removes the adjacency-masked attention and restores full self-attention over all node pairs; (ii) w/o heuristic, which disables the Dijkstra-derived heuristic feature embedding; and (iii) w/o bi-directional, which uses only forward decoding. All other architectural and training settings follow Table 3. Table 8 summarizes the resulting realized cost, congested-node ratio, and convergence epoch for the full model and each ablated variant.
The results indicate four complementary effects. First, disabling the heuristic embedding incurs the largest degradation: the realized cost rises to 9.711 and the training curve does not reach 95% of the full-model cost within 150 epochs, confirming that the normalized shortest-path distance and hop-count features provide an indispensable cold-start signal on sparse topologies. Second, removing adjacency masking raises both the realized cost (9.582) and the congestion exposure (7.89%), because the fully connected attention allocates capacity to topology-inconsistent node pairs and slows down the effective learning of neighbor-aware policies. Third, restricting the decoder to a single direction produces a moderate but consistent degradation (9.528), in line with the bi-directional decoding analysis that an additional decoding path offers a second route of access to bottleneck edges. Fourth, removing the edge-distance bias ϕ θ ( d i j ) in the graph-aware attention encoder slows the convergence epoch from 60 to 78 and raises the realized cost from 9.431 to 9.495, which directly quantifies the requested “with vs. without edge-distance bias” convergence-speed comparison: the learnable bias acts as a geometric warm start so that, even under random initialization, the attention scores prefer physically proximal neighbors and the policy avoids wasteful long-range exploration in the early epochs. No ablated variant dominates the full model on any metric, confirming that the four planning components are non-redundant.

4.9. RL Optimizer Comparison

To further justify the choice of REINFORCE with a rollout baseline, we compared it against two mainstream RL optimizers on the same planning task: Proximal Policy Optimization (PPO) with a clipping ratio of 0.2 , and a Double-Dueling DQN tailored for the discrete neighbor-selection action space. For each optimizer, we ran five independent training trajectories (100 epochs each) with identical encoder/decoder architectures and recorded the policy-gradient variance, the number of epochs required to reach 95% of the best observed cost, and the final realized cost.
Table 9 and Figure 7 support three observations. First, the rollout baseline reduces the variance of the policy gradient by roughly an order of magnitude relative to plain REINFORCE, which translates into faster convergence and more reliable attainment of the target cost across seeds. Second, PPO achieves the lowest measured variance, yet its clipped updates become somewhat conservative on the present large discrete action space, leading to a slightly slower convergence horizon and a marginally higher final cost. Third, DQN, despite its strong track record on dense-reward continuous-control tasks, struggles here: the combinatorial neighbor-selection action space induces Q-value bootstrapping variance, and sparse rewards make exploration unstable, yielding the highest final cost and only two of five seeds converging within the budget. Overall, the rollout baseline offers the most favorable variance-convergence trade-off for this sparse-graph planning task.

4.10. Decoder Strategy Comparison

To further assess whether the bi-directional decoding strategy yields near-optimal solutions relative to a stronger single-direction search, we compared it against a greedy autoregressive decoder (beam width B = 1 ) and a beam-search decoder with widths B { 4 , 10 , 20 , 50 } . All variants share the same trained encoder, decoder, REINFORCE-trained policy, and hybrid cost C ( π ) with λ = 1 . Beam search keeps the top-B partial paths at every step, scored by the accumulated log-probability under the masked softmax policy, and returns the finished path with the minimum realized hybrid cost. All decoders run on the same 237-node connected subgraph, 24 source-target pairs, and 24 decision cycles used throughout Section 4.8. For each decoder we record: (i) the realized hybrid cost under the true simulated flow; (ii) the fraction of actually visited congested nodes; (iii) the decoding wall-clock time normalized to greedy; and (iv) a bottleneck-edge coverage rate, defined as the fraction of graph-theoretic bridge edges of the planning subgraph that are included in at least one executed route across the 576 planning events.
Table 10 and Figure 8 support two conclusions. First, the realized cost decreases concavely with log B and starts to plateau above B = 20 : increasing beam width from 20 to 50 lowers the cost by only 0.007 ( 9.445 9.438 ) but multiplies decoding time by 2.5 × . Beam search, which widens a single-direction search, therefore exhibits rapidly diminishing returns. Second, the bi-directional decoder reaches 9.431 at only 2.0 × the greedy decoding time, which is lower in realized cost than the beam search at B = 50 and about 24 × cheaper to compute. The bottleneck-coverage column explains the remaining gap on this sparse topology: all beams in a forward-only search descend the same autoregressive prefix tree and share an inherent directional bias toward bridges close to the source, while bi-directional decoding exposes bridges from both endpoints and lifts coverage from 62.1 % to 91.2 % . This means the residual advantage of the proposed strategy over a wide beam search is topological rather than a matter of search width, which is consistent with the role of bi-directional decoding described in Section 3.

4.11. Action Entropy at Bottleneck Edges

To understand how the learned policy behaves when the agent encounters highly congested bottleneck edges, we analyzed the Shannon entropy of the one-step action probability distribution p ( a t = j π 1 : t ) defined by the tanh-clipped softmax in Equation (12). On the same controlled simulation, we classify every decision step along the 576 executed routes into four groups along two axes: (i) whether the current node is incident to a bridge edge of the planning subgraph (graph-theoretic cut edges identified by Tarjan’s algorithm), and (ii) whether the predicted congestion at the current node f ˜ ( π t ) falls in the top decile (“high”) or below (“low”). The resulting four groups cover approximately 3137 decision steps after removing terminal and trivially masked states. Because only non-visited physically adjacent neighbors remain unmasked, the maximum achievable entropy at a decision is ln k where k is the number of unvisited neighbors, which is typically 2 or 3 on this sparse topology.
Table 11 and Figure 9 reveal a structured, topology-aware response. At non-bottleneck nodes, the policy is confidently committed (mean entropy 0.462 nats, about 33 % of ln k ), and high congestion adds only a mild increment ( 0.462 0.587 ). At bottleneck nodes under low congestion, the entropy actually drops to 0.193 nats because only one viable non-visited neighbor typically remains, making the next action nearly deterministic. Crucially, when the agent meets a bottleneck edge whose incident node is also highly congested, the entropy rises sharply to 0.914 nats, approximately 4.74 × the bottleneck-low-congestion level, and reaches roughly 83 % of the theoretical upper bound ln k . In other words, the policy refuses to over-commit at exactly the decisions that are simultaneously topologically critical and operationally costly, and instead distributes probability nearly uniformly across the few remaining detour candidates, preserving exploratory margin precisely where a premature commitment would be most damaging. This behavior is consistent with the bi-directional decoding design and indicates that the rollout-trained policy has internalized a congestion-aware hedging strategy at the most consequential decision points of the sparse graph.

4.12. Limitations and Practical Scope

Two limitations deserve explicit discussion. First, the prediction module relies on traffic sensors whose observations may be noisy, missing, or delayed. Such perturbations can affect both the spectral branch and the dynamically inferred graph, which in turn may degrade the planning signal passed to the route optimizer. Recent progress on robust learning under noisy observations, exemplified by the progressive sample selection framework with contrastive loss designed for noisy labels [25], suggests practical directions for addressing sensor-level uncertainty in future deployments. Second, the planning module is trained with policy gradients; although the rollout baseline stabilizes optimization, RL performance can still vary under sparse rewards, topology shifts, or different random seeds. For these reasons, the current path-planning results should be interpreted as a fixed-topology proof of concept rather than as a complete deployment study. Robust sensor denoising, uncertainty-aware forecasting, and broader multi-seed RL evaluation remain important directions for future work.

5. Conclusions

This paper presents an integrated framework coupling a Spectral-Temporal Graph Neural Network with Bi-directional Decoding Reinforcement Learning to address spatiotemporal dependency modeling and sparse graph navigation challenges in Intelligent Transportation Systems. For traffic prediction, the proposed Time-Frequency Dual-stream Adaptive Learning module captures global periodicities and local dynamics through parallel FFT and GRU branches, while the adaptive gating mechanism harmonizes the two views. Quantitatively, the model achieves MAE/RMSE/MAPE values of 18.211/30.433/12.006 on PEMSD4 and 13.587/23.566/8.955 on PEMSD8, outperforming strong graph-based baselines on both datasets.
For path planning on the PEMSD4-derived sparse topology, the Bi-directional Decoding RL method demonstrates rapid cost reduction within the first 20 epochs and stable convergence after approximately 60 epochs. Under the default three-layer planning encoder, adjacency masking reduces the total number of pairwise attention interactions from 282,747 to 96,223, corresponding to a 65.97% reduction in redundant interactions. In addition, heuristic feature embedding further accelerates convergence and improves path quality by providing directional guidance in sparse graphs. A controlled decision-cycle simulation on the largest connected planning subgraph further shows that the proposed framework reduces the realized cost from 9.560 to 9.431 relative to A* and lowers congestion exposure from 8.18% to 7.43%. The planning-side ablation additionally confirms that removing the learnable edge-distance bias slows convergence from 60 to 78 epochs, the decoder strategy comparison shows that the bi-directional decoder reaches a lower realized cost than beam search with B = 50 while using about 24 × less decoding time, and the entropy analysis indicates that the rollout-trained policy preserves exploratory margin at highly congested bottleneck nodes, where the action-distribution entropy rises by a factor of approximately 4.74 × relative to uncongested bottlenecks. Future research will focus on extending the framework to dynamic graph scenarios with real-time incident handling, large-scale heterogeneous transportation networks, robust sensor-noise mitigation, and more comprehensive multi-seed reproducibility studies, together with platform-specific efficiency benchmarking.

Author Contributions

Conceptualization, X.D., K.F. and J.T.; Methodology, X.D., K.F. and J.T.; Software, X.D. and K.F.; Validation, X.D. and K.F.; Formal analysis, J.T. and C.Y.; Investigation, X.D., K.F. and J.T.; Resources, P.Z., J.L., J.F. and C.Y.; Writing—original draft, X.D. and K.F.; Writing—review and editing, P.Z., J.L., J.F., X.D., K.F., J.T. and C.Y.; Visualization, X.D. and K.F.; Supervision, P.Z., J.L., J.F. and C.Y.; Project administration, P.Z., J.L., J.F. and J.T.; Funding acquisition, P.Z., J.L., J.F. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Key Technologies for Security Defense and Intelligent Operation and Maintenance of Self-healing Distribution Communication Network 036000KC23090008 (GDKJXM20231043) of Power Dispatching and Control Center of Guangdong Power Grid Co. Ltd.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

Author Peiming Zhang, Jiangang Lu and Jiajia Fu are employed by the company Power Dispatching and Control Center of Guangdong Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Nomenclature

The following abbreviations are used in this manuscript:
SymbolDefinition
X t / X t Traffic observation matrix at time step t/historical observation window of length T
G t Time-varying node dependency matrix inferred from the traffic observations
A attr Learned attribute affinity matrix used in intra-variable dependency modeling
H t d , H f d , H f Time-domain feature, frequency-domain feature, and adaptively fused representation
G t , S t , H s t Dynamic graph for diffusion, diffusion-aggregated signal, and decoder hidden state
s, t, π Source node, target node, and planned node sequence
d ( i , j ) / f ( i ) Edge travel distance and node-level traffic cost in the hybrid planning objective
π / π Forward and backward decoded path candidates used by the bi-directional planner

References

  1. Hagedorn, S.; Hallgarten, M.; Stoll, M.; Condurache, A.P. The integration of prediction and planning in deep learning automated driving systems: A review. IEEE Trans. Intell. Veh. 2024, 10, 3626–3643. [Google Scholar] [CrossRef]
  2. Manas, K.; Paschke, A. Knowledge integration strategies in autonomous vehicle prediction and planning: A comprehensive survey. arXiv 2025, arXiv:2502.10477. [Google Scholar] [CrossRef]
  3. Alexander, A.; Venkatesan, K.; Mounsef, J.; Ramanujam, K. A Comprehensive Survey of Path Planning Algorithms for Autonomous Systems and Mobile Robots: Traditional and Modern Approaches. IEEE Access 2025, 13, 176287–176326. [Google Scholar] [CrossRef]
  4. Katrakazas, C.; Quddus, M.; Chen, W.H.; Deka, L. Real-time motion planning methods for autonomous on-road driving: State-of-the-art and future research directions. Transp. Res. Part C Emerg. Technol. 2015, 60, 416–442. [Google Scholar] [CrossRef]
  5. Leon, F.; Gavrilescu, M. A review of tracking and trajectory prediction methods for autonomous driving. Mathematics 2021, 9, 660. [Google Scholar] [CrossRef]
  6. Qi, P.; Pan, C.; Xu, X.; Wang, J.; Liang, J.; Zhou, W. A review of dynamic traffic flow prediction methods for global energy-efficient route planning. Sensors 2025, 25, 5560. [Google Scholar] [CrossRef]
  7. Li, Y.; Yu, D.; Liu, Z.; Zhang, M.; Gong, X.; Zhao, L. Graph neural network for spatiotemporal data: Methods and applications. arXiv 2023, arXiv:2306.00012. [Google Scholar] [CrossRef]
  8. Jiang, W.; Luo, J.; He, M.; Gu, W. Graph neural network for traffic forecasting: The research progress. ISPRS Int. J. Geo-Inf. 2023, 12, 100. [Google Scholar] [CrossRef]
  9. Bui, K.H.N.; Cho, J.; Yi, H. Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues. Appl. Intell. 2022, 52, 2763–2774. [Google Scholar] [CrossRef]
  10. Delamou, M.; Bazzi, A.; Chafii, M.; Amhoud, E.M. Deep Learning-Based Estimation for Multitarget Radar Detection. In Proceedings of the 2023 IEEE 97th Vehicular Technology Conference, Florence, Italy, 20–23 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
  11. Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 2020, 33, 17804–17815. [Google Scholar]
  12. Shao, Z.; Zhang, Z.; Wei, W.; Wang, F.; Xu, Y.; Cao, X.; Jensen, C.S. Decoupled dynamic spatial-temporal graph neural network for traffic forecasting. arXiv 2022, arXiv:2206.09112. [Google Scholar] [CrossRef]
  13. Nguyen, N.; Quanz, B. Temporal latent auto-encoder: A method for probabilistic multivariate time series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 9117–9125. [Google Scholar] [CrossRef]
  14. Ghil, M.; Allen, M.R.; Dettinger, M.D.; Ide, K.; Kondrashov, D.; Mann, M.E.; Robertson, A.W.; Saunders, A.; Tian, Y.; Varadi, F.; et al. Advanced spectral methods for climatic time series. Rev. Geophys. 2002, 40, 3-1–3-41. [Google Scholar] [CrossRef]
  15. Ye, F.; Zhang, S.; Wang, P.; Chan, C.Y. A survey of deep reinforcement learning algorithms for motion planning and control of autonomous vehicles. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1073–1080. [Google Scholar]
  16. Aradi, S. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 23, 740–759. [Google Scholar] [CrossRef]
  17. Singh, R.; Ren, J.; Lin, X. A review of deep reinforcement learning algorithms for mobile robot path planning. Vehicles 2023, 5, 1423–1451. [Google Scholar] [CrossRef]
  18. Kool, W.; van Hoof, H.; Gromicho, J.; Welling, M. Deep policy dynamic programming for vehicle routing problems. In Proceedings of the International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, Los Angeles, CA, USA, 20–23 June 2022; Springer: Cham, Switzerland, 2022; pp. 190–213. [Google Scholar]
  19. Luo, F.; Lin, X.; Wu, Y.; Wang, Z.; Xialiang, T.; Yuan, M.; Zhang, Q. Boosting neural combinatorial optimization for large-scale vehicle routing problems. In Proceedings of the the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  20. Delarue, A.; Anderson, R.; Tjandraatmadja, C. Reinforcement learning with combinatorial actions: An application to vehicle routing. Adv. Neural Inf. Process. Syst. 2020, 33, 609–620. [Google Scholar]
  21. Cheng, C.A.; Kolobov, A.; Swaminathan, A. Heuristic-guided reinforcement learning. Adv. Neural Inf. Process. Syst. 2021, 34, 13550–13563. [Google Scholar]
  22. Manchanda, S.; Mittal, A.; Dhawan, A.; Medya, S.; Ranu, S.; Singh, A. Learning heuristics over large graphs via deep reinforcement learning. arXiv 2019, arXiv:1903.03332. [Google Scholar]
  23. Waltz, M.; Fu, K.s. A heuristic approach to reinforcement learning control systems. IEEE Trans. Autom. Control 1965, 10, 390–398. [Google Scholar] [CrossRef]
  24. Choong, S.S.; Wong, L.P.; Lim, C.P. Automatic design of hyper-heuristic based on reinforcement learning. Inf. Sci. 2018, 436, 89–107. [Google Scholar] [CrossRef]
  25. Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
Figure 1. Schematic summary of the prediction-side and planning-side literature gaps addressed by the proposed framework.
Figure 1. Schematic summary of the prediction-side and planning-side literature gaps addressed by the proposed framework.
Signals 07 00047 g001
Figure 2. The visualization of the network topology for the PeMS04 dataset.
Figure 2. The visualization of the network topology for the PeMS04 dataset.
Signals 07 00047 g002
Figure 3. Sensitivity of the realized hybrid cost (a) and the congestion/path-length trade-off (b) to the flow-term weight λ .
Figure 3. Sensitivity of the realized hybrid cost (a) and the congestion/path-length trade-off (b) to the flow-term weight λ .
Signals 07 00047 g003
Figure 4. Policy loss and value loss curves during the training phase.
Figure 4. Policy loss and value loss curves during the training phase.
Signals 07 00047 g004
Figure 5. Impact of Heuristic Features on Model Training Performance.
Figure 5. Impact of Heuristic Features on Model Training Performance.
Signals 07 00047 g005
Figure 6. Average realized hybrid cost across decision cycles in the controlled classical-planner comparison.
Figure 6. Average realized hybrid cost across decision cycles in the controlled classical-planner comparison.
Signals 07 00047 g006
Figure 7. Comparison of (a) policy-gradient variance and (b) validation realized cost across training epochs for four reinforcement-learning optimizers.
Figure 7. Comparison of (a) policy-gradient variance and (b) validation realized cost across training epochs for four reinforcement-learning optimizers.
Signals 07 00047 g007
Figure 8. Decoder strategy comparison: (a) realized cost as a function of beam width; (b) quality-cost trade-off on a log-scaled decoding-time axis. The light red shaded band in (a) indicates the cost interval around the bi-directional decoder reference line. The bi-directional decoder reaches a lower realized cost than beam search with B = 50 while using only 2 × the decoding time of greedy.
Figure 8. Decoder strategy comparison: (a) realized cost as a function of beam width; (b) quality-cost trade-off on a log-scaled decoding-time axis. The light red shaded band in (a) indicates the cost interval around the bi-directional decoder reference line. The bi-directional decoder reaches a lower realized cost than beam search with B = 50 while using only 2 × the decoding time of greedy.
Signals 07 00047 g008
Figure 9. (a) Distribution of action-probability entropy across the four decision contexts. (b) Action entropy versus normalized predicted congestion at bottleneck nodes; the red curve is the bin-mean trend and confirms the upward slope in the top congestion decile.
Figure 9. (a) Distribution of action-probability entropy across the four decision contexts. (b) Action entropy versus normalized predicted congestion at bottleneck nodes; the red curve is the bin-mean trend and confirms the upward slope in the top congestion decile.
Signals 07 00047 g009
Table 1. Positioning of the proposed framework relative to representative joint prediction-planning approaches.
Table 1. Positioning of the proposed framework relative to representative joint prediction-planning approaches.
Framework ClassModule CouplingSpatio-Temporal ModelingPlanner/Exploration
Cascaded prediction + classical plannerFully decoupledTime-domain GNNDijkstra/A*, no learning
Joint multi-task pred. + RL plannerShared encoder (hard share)Time-domain GNN or GRUFull attention, ε -greedy
End-to-end DRL over predicted statesTight/implicit couplingTime-domain latent trafficFull attention, random rollout
ProposedPredicted-cost injection bridgeSpectral-temporal dual-stream + dynamic graphAdjacency-masked + heuristic + bi-directional
Table 2. The details of the datasets used in the experiment.
Table 2. The details of the datasets used in the experiment.
DatasetLengthNodeSample RateAttribute
PEMSD416,9923075 minFlow, Occupancy, Speed
PEMSD817,8561705 minFlow, Occupancy, Speed
Table 3. Centralized summary of major architectural and training settings.
Table 3. Centralized summary of major architectural and training settings.
ModuleAspectSettingValue
PredictionData splitTraining/validation/testing ratio6:2:2
PredictionTemporal settingLook-back window/prediction horizon12/12
PredictionRepresentation sizeInput attributes/latent dimension D3/64
PredictionDual-stream encoderGRU hidden size/non-redundant RFFT bins64/7
PredictionDynamic graph moduleNode-time embedding size/diffusion order K64/2
PredictionOptimizationOptimizer/lossAdam/MAE
PredictionTrainingLearning rate/batch size/max epochs0.002/64/300
PredictionEarly stoppingPatience30
Path planningGraph topologyNodes/edges307/340
Path planningEncoder sizeEmbedding dimension/hidden dimension64/64
Path planningAttention architectureEncoder layers/global layers/heads3/1/8
Path planningGraph-aware priorsEdge-distance bias/heuristic featuresenabled/enabled
Path planningPolicy optimizationObjective/baselineREINFORCE/rollout
Path planningTrainingLearning rate/batch size/epochs 1 × 10 4 /64/100
Path planningModel sizeTrainable parameters212,656
Table 4. Comparisons of the performance.
Table 4. Comparisons of the performance.
MethodsPEMSD4PEMSD8
MAERMSEMAPEMAERMSEMAPE
HA35.17352.34625.45329.18943.54018.448
AR27.82743.93018.97322.38535.45614.662
LSTNet19.932 ± 0.12331.564 ± 0.13214.016 ± 0.14316.601 ± 0.23825.817 ± 0.28910.729 ± 0.285
AGCRN19.316 ± 0.05431.561 ± 0.17012.871 ± 0.07015.837 ± 0.16525.219 ± 0.22810.316 ± 0.135
ST-AE19.908 ± 0.08331.257 ± 0.09313.896 ± 0.32715.960 ± 0.16524.877 ± 0.21410.220 ± 0.175
SDGL18.625 ± 0.06331.069 ± 0.23612.400 ± 0.08814.944 ± 0.05924.166 ± 0.1109.597 ± 0.085
DDGCRN18.460 ± 0.09330.864 ± 0.32412.290 ± 0.13414.382 ± 0.06423.793 ± 0.1669.446 ± 0.088
HSDGNN18.348 ± 0.02530.468 ± 0.13212.102 ± 0.17713.843 ± 0.15423.678 ± 0.1379.196 ± 0.136
ours18.211 ± 0.02530.433 ± 0.13212.006 ± 0.17613.587 ± 0.15123.566 ± 0.1368.955 ± 0.132
Table 5. Ablation studies on the prediction side.
Table 5. Ablation studies on the prediction side.
MethodsMAERMSEMAPE
ours18.21130.43312.006
ours_w/o_FD18.34830.46812.102
ours_w/o_AGF18.25830.60112.113
ours_w/o_DT18.41230.78212.290
Table 6. Sensitivity of the realized planning cost to the cost-function weight λ .
Table 6. Sensitivity of the realized planning cost to the cost-function weight λ .
Weight λ Realized CostCongested Nodes (%)Path Length (hops)
0 (distance-only)9.87311.428.81
0.259.6129.379.02
0.59.4888.159.28
1 (default)9.4317.439.41
29.4767.119.97
49.6086.9510.88
∞ (flow-only)10.1346.8212.47
Table 7. Controlled decision-cycle comparison between the proposed framework and classical planners on the largest connected planning subgraph.
Table 7. Controlled decision-cycle comparison between the proposed framework and classical planners on the largest connected planning subgraph.
MethodRealized CostCongested Nodes (%)Expanded StatesArrival (%)
Dijkstra9.560 ± 3.4628.1839.40100.0
A*9.560 ± 3.4628.1830.93100.0
Hybrid A*9.568 ± 3.4588.1267.57100.0
Proposed9.431 ± 3.4127.4330.89100.0
Table 8. Planning-side ablation on the controlled classical-planner simulation.
Table 8. Planning-side ablation on the controlled classical-planner simulation.
VariantRealized CostCongested Nodes (%)Convergence Epoch
Full model9.431 ± 3.4127.4360
w/o adjacency masking9.582 ± 3.5117.8985
w/o heuristic embedding9.711 ± 3.6248.34>150
w/o bi-directional decoding9.528 ± 3.4687.7572
w/o edge-distance bias9.495 ± 3.4917.6178
Table 9. Controlled comparison of RL optimizers under the same planning setup.
Table 9. Controlled comparison of RL optimizers under the same planning setup.
OptimizerGrad. Variance ( × 10 3 )Epochs to 95%Final Realized CostConverged Seeds
REINFORCE (no baseline)3.821489.6173/5
REINFORCE + rollout (ours)0.47629.4315/5
PPO (clip 0.2)0.39789.4635/5
DQN (double, dueling)1.24>2009.7382/5
Table 10. Decoder strategy comparison under the same sparse-graph planning environment.
Table 10. Decoder strategy comparison under the same sparse-graph planning environment.
DecoderRealized CostCongested (%)Decoding Time (×Greedy)Bottleneck Coverage (%)
Greedy autoregressive ( B = 1 )9.528 ± 3.4687.751.062.1
Beam search ( B = 4 )9.491 ± 3.4417.693.968.4
Beam search ( B = 10 )9.458 ± 3.4247.589.774.9
Beam search ( B = 20 )9.445 ± 3.4177.5219.378.6
Beam search ( B = 50 )9.438 ± 3.4137.4947.980.8
Bi-directional (ours)9.431 ± 3.4127.432.091.2
Table 11. Action-probability entropy conditioned on topological bottleneck incidence and predicted congestion level.
Table 11. Action-probability entropy conditioned on topological bottleneck incidence and predicted congestion level.
GroupDecision StepsMean Entropy (nats)Std. Dev.Relative to ln k (%)
Non-bottleneck, low congestion20480.4620.13433
Non-bottleneck, high congestion5640.5870.17142
Bottleneck, low congestion3780.1930.11218
Bottleneck, high congestion1470.9140.24783
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, P.; Lu, J.; Fu, J.; Di, X.; Fang, K.; Tang, J.; Yang, C. From Prediction to Planning: A Spectral-Temporal GNN and Bi-Directional Decoding RL Framework. Signals 2026, 7, 47. https://doi.org/10.3390/signals7030047

AMA Style

Zhang P, Lu J, Fu J, Di X, Fang K, Tang J, Yang C. From Prediction to Planning: A Spectral-Temporal GNN and Bi-Directional Decoding RL Framework. Signals. 2026; 7(3):47. https://doi.org/10.3390/signals7030047

Chicago/Turabian Style

Zhang, Peiming, Jiangang Lu, Jiajia Fu, Xinyue Di, Kai Fang, Jie Tang, and Cui Yang. 2026. "From Prediction to Planning: A Spectral-Temporal GNN and Bi-Directional Decoding RL Framework" Signals 7, no. 3: 47. https://doi.org/10.3390/signals7030047

APA Style

Zhang, P., Lu, J., Fu, J., Di, X., Fang, K., Tang, J., & Yang, C. (2026). From Prediction to Planning: A Spectral-Temporal GNN and Bi-Directional Decoding RL Framework. Signals, 7(3), 47. https://doi.org/10.3390/signals7030047

Article Metrics

Back to TopTop