1. Introduction
The primary challenge in autonomous driving and Intelligent Transportation Systems (ITS) lies in optimizing decision-making within dynamic and complex environments, where uncertainty and temporal variability are inherent. Driven by rapid advancements in deep learning, traditional cascaded architectures that separate perception, prediction, and planning are evolving toward integrated collaborative frameworks [
1,
2]. In this context, efficient motion planning relies not only on analyzing static road network topology but also on accurately predicting future traffic flow states to preemptively avoid congestion and hazards [
3]. Existing real-time motion planning methods have made progress in obstacle avoidance and trajectory generation, focusing primarily on local geometric constraints [
4,
5]. However, effectively integrating dynamic traffic prediction information over long time horizons to achieve global path optimality remains a critical challenge, particularly when balancing computational efficiency with the complexity of real-world traffic patterns [
6].
In the field of traffic flow prediction, modeling spatiotemporal data has shifted from traditional statistical methods to data-driven deep learning paradigms. Graph Neural Networks (GNNs) have emerged as the mainstream approach due to their robust ability to handle spatial dependencies in non-Euclidean spaces, modeling road networks as complex graph structures [
7,
8,
9]. Recent deep-learning-based studies in vehicular sensing have also demonstrated the effectiveness of convolutional neural architectures for extracting structured patterns from complex transportation-related signals. For example, Delamou et al. [
10] proposed a deep-learning-based estimator for multitarget radar detection in vehicular scenarios, highlighting the broader applicability of neural feature learning in intelligent transportation environments. Nevertheless, the traffic forecasting problem considered in this work has a different data structure, since the target signals are collected over a sparse road graph and require explicit modeling of non-Euclidean spatial dependencies together with mixed local and periodic temporal patterns. To address the limitations of static graph structures, the authors in [
11] proposed adaptive graph convolutional networks to capture latent correlations among nodes, while Shao et al. designed a decoupled spatiotemporal learning framework to infer dynamic adjacency matrices [
12]. Nevertheless, prior GNN-based predictors still exhibit three recurring limitations. First, spatial modeling often remains static or only weakly adaptive, which reduces robustness under incident-driven topology changes. Second, temporal modeling is usually dominated by recurrent or convolutional time-domain operators, which biases the representation toward recent observations and weakens the extraction of recurring spectral structure [
13]. Third, when multiple temporal views are used, the fusion mechanism is often fixed, making it difficult to adapt the model emphasis between smooth periodic regimes and bursty traffic fluctuations. As noted by [
14], spectral analysis is crucial for resolving long-range periodicity in complex time series; therefore, ignoring frequency-domain information restricts the model’s capacity to represent macroscopic traffic patterns and degrades long-horizon forecasting quality.
Simultaneously, path planning methodologies are undergoing a paradigm shift from traditional search to Deep Reinforcement Learning (DRL) [
15,
16]. A review by [
17] indicates that DRL can learn complex navigation strategies through environmental interaction, overcoming the adaptability limitations of traditional algorithms in dynamic settings. In particular, approaches based on the Neural Combinatorial Optimization (NCO) model path planning as a node sequence generation task [
18,
19], demonstrating superior performance in solving combinatorial optimization challenges such as the Vehicle Routing Problem [
20]. However, prior planning methods also exhibit three practical bottlenecks. First, classical planners such as Dijkstra, A*, and hybrid A* remain strong for static or geometry-dominant routing [
3], but they do not naturally exploit predicted future traffic states. Second, directly applying topology-agnostic Transformer encoders to urban road networks introduces substantial redundancy because most node pairs are physically disconnected. Third, purely data-driven exploration often suffers from slow convergence and cold-start issues within large, sparse graphs, where bottleneck edges are difficult to discover in the initial stage. Studies by [
21,
22] demonstrate that fusing heuristic rules into reinforcement learning frameworks can significantly reduce the search space and improve policy quality. This concept traces back to the early theories on heuristic control systems by [
23]. Integrating the hyper-heuristic design concepts proposed by [
24], injecting the optimality priors of traditional algorithms as directional guidance signals into the model has become an effective approach to overcoming sparse reward and exploration bottlenecks.
To address these challenges, this paper proposes an integrated framework that fuses a spectral-temporal graph neural network with bidirectional decoding reinforcement learning. For prediction, we design a time-frequency dual-stream architecture to simultaneously capture microscopic time-domain dynamics and macroscopic frequency-domain periodicities. Additionally, a dynamic graph generation module is employed to infer latent spatial dependencies. For planning, we propose an adjacency masking attention mechanism to accommodate the sparsity of road networks. Furthermore, a bidirectional autoregressive decoding strategy is introduced to circumvent the local minima associated with unidirectional search. The literature gaps addressed by the proposed framework are summarized schematically in
Figure 1.
The contributions of this paper are summarized as follows:
- (1)
A time-frequency dual-stream adaptive graph neural network prediction model is proposed. By combining Real Fast Fourier Transform (RFFT) spectral analysis with a dynamic topology generation mechanism, the model effectively resolves the difficulty traditional methods face in simultaneously capturing long-range periodicities and instantaneous dynamic fluctuations.
- (2)
A path planning algorithm based on a graph-aware attention encoder and bidirectional decoding is designed. The adjacency masking mechanism reduces redundant attention interactions in sparse road networks, while the bidirectional parallel search strategy enhances global optimization capability.
- (3)
A heuristic-guided feature embedding module is constructed to incorporate prior knowledge from traditional shortest path algorithms into the reinforcement learning state space. This approach effectively addresses the issues of low exploration efficiency and cold-start problems for agents in large-scale road networks.
To make the positioning of the proposed framework more explicit,
Table 1 contrasts it with three representative classes of joint prediction-planning approaches along three axes. Along module coupling, cascaded prediction-then-planning pipelines decouple the two tasks entirely, whereas end-to-end DRL over a latent traffic representation couples them implicitly through shared parameters. The proposed framework instead adopts an intermediate “predicted-cost injection” bridge, which transfers explicit short-horizon predicted flows from the spectral-temporal GNN into the planner without forcing hard parameter sharing. Along spatio-temporal modeling, most prior joint frameworks rely on time-domain GNNs whose temporal operators bias the representation toward recent observations and weaken recurring spectral structure. The proposed framework introduces a spectral-temporal dual-stream encoder with dynamic topology generation, which jointly captures periodic spectral components and incident-driven topology shifts. Along RL exploration, topology-agnostic attention and undirected exploration are common in prior planners, which causes slow convergence on large, sparse road networks. The proposed framework uses adjacency-masked attention, Dijkstra-based heuristic embedding, and bi-directional autoregressive decoding to mitigate cold-start inefficiency and bottleneck-edge blindness.
The remainder of this article is organized as follows.
Section 2 formulates the mathematical models for the traffic prediction and path planning problems.
Section 3 elaborates on the proposed algorithmic framework, detailing the Spectral-Temporal GNN, the dynamic topology generation, and the Bi-directional Decoding RL method with heuristic feature embedding.
Section 4 presents the experimental settings, including datasets and baselines, and analyzes the performance of prediction and planning tasks along with ablation studies. Finally,
Section 5 concludes this article and discusses future research directions.
2. Model
Let N denote the number of nodes within a sensor network. At each time step t, the system state comprises a set of heterogeneous physical attributes. Consequently, observations at time t are denoted as a matrix , where C represents the attribute feature dimension for each node. The entire historical observation window appears as a tensor , where T indicates the length of the look-back window.
The system is represented as a graph structure . This graph consists of a node set and an edge set . To avoid symbol overloading, the time-varying node dependency matrix is denoted by . Distinct from conventional approaches relying on static priors such as geographical distances, the graph is posited as time-varying or latent. This implies that must be dynamically inferred from the input data .
Given the historical observation tensor
over the past
T time steps, the objective involves learning a non-linear mapping function
to forecast the target attribute for the next
time steps. This target usually corresponds to the primary metric among the
C attributes. The prediction is formulated as follows:
where
denotes the predicted sequence,
represents learnable model parameters, and
signifies the internally generated dynamic graph structure.
Consider a static road network topology denoted by
, where
denotes the set of
n intersection nodes, and
represents the set of road edges. The edge weight function
represents the travel distance. Let
denote the time-varying node traffic, which is provided by the spectral-temporal GNN module. Given a source node
s and a destination node
t, the path planning task is defined as finding the optimal path
that minimizes the hybrid cost function:
where
,
, and
for all
i. Because raw distances and predicted traffic values carry different physical units, the two terms are made dimensionally comparable through min-max normalization to
:
with
taken over all graph edges, and
with
taken over all nodes at the current decision cycle. The scalar weight
controls the trade-off between geometric distance and predicted traffic exposure. Unless stated otherwise, we adopt the default
, which corresponds to equal weighting after normalization and places the two terms at a common numerical scale. The sensitivity of the framework to different choices of
is examined empirically in
Section 4.5, which confirms that the default is close to the realized-cost minimum under the present planning topology.
To improve notation consistency, we reserve for attribute-level affinities, for the dynamically inferred node graph used by diffusion, and for the planning tuple. The main symbols used in the prediction and planning modules are summarized below for quick reference.
4. Experiment
4.1. Datasets and Preprocessing
To evaluate the performance of the proposed framework in real-world scenarios, we conducted experiments on two widely used public traffic datasets, PeMS04 and PeMS08. These datasets are collected by the Caltrans Performance Measurement System (PeMS) and represent real-time traffic flow data from the San Francisco Bay Area and the San Bernardino area, respectively. Their basic statistics are summarized in
Table 2. Taking PEMSD4 as an example, this dataset contains traffic flow measurements collected by 307 loop detectors in the San Francisco Bay Area from 1 January 2018 to 28 February 2018, with a total of 16,992 time steps.
The prediction task is evaluated on both PEMSD4 and PEMSD8, whereas the path-planning simulation is conducted on the sparse planning graph derived from PEMSD4 and visualized in
Figure 2. This graph contains 307 nodes and 340 edges, resulting in an average degree of approximately 2.21. Such a sparse topology strongly motivates the use of adjacency-masked attention. For this graph, a standard full-attention layer evaluates
pairwise interactions. In contrast, a local masked-attention layer retains only
valid interactions after accounting for adjacent node pairs and self-loops, eliminating 98.95% of redundant pairwise interactions per local layer. Under the default planning encoder used in this work, which contains two local masked layers and one global layer, the total number of pairwise interactions is reduced from
to
, corresponding to an overall reduction of 65.97%. This quantifies the structural efficiency gain of the proposed sparsity-aware encoder.
We split both datasets into training, validation, and testing sets in chronological order with a ratio of 6:2:2. A sliding-window strategy is used to generate input-output pairs. Both the historical input window size and the prediction horizon are set to 12, meaning that the model uses the previous hour of traffic observations to predict the next hour.
4.2. Baselines and Experimental Settings
To evaluate the proposed predictor, we compared it against a diverse set of baselines, including statistical methods (HA and AR), conventional deep learning approaches (LSTNet), and representative spatio-temporal graph neural networks (AGCRN, ST-AE, SDGL, DDGCRN, and HSDGNN). Prediction accuracy is assessed using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), with missing values excluded from evaluation to ensure fair comparison.
The major architectural and training settings used for both the prediction and path-planning modules are summarized in
Table 3. All experiments were implemented in PyTorch 2.0.1 on a Linux workstation equipped with a single NVIDIA GeForce RTX 4090 GPU. For prediction, we followed the standard setting of a 12-step input horizon and a 12-step forecasting horizon, used a latent feature dimension of 64, processed 7 non-redundant RFFT bins under the 12-step input window, adopted a diffusion order of
, used the Adam optimizer with an initial learning rate of 0.002, set the batch size to 64, and trained for up to 300 epochs with early stopping patience of 30 based on MAE. For path planning, the final configuration uses a 307-node sparse graph, embedding and hidden dimensions of 64, three encoder layers with one global layer and eight attention heads, heuristic features enabled, a rollout baseline, a learning rate of
, a batch size of 64, and 100 training epochs.
4.3. Results on the Prediction Performance
Table 4 reports the comparative prediction results on PEMSD4 and PEMSD8. The proposed method achieves MAE/RMSE/MAPE values of 18.211/30.433/12.006 on PEMSD4 and 13.587/23.566/8.955 on PEMSD8. On both datasets, it outperforms the strongest baseline HSDGNN, which records 18.348/30.468/12.102 on PEMSD4 and 13.843/23.678/9.196 on PEMSD8.
These results support the effectiveness of the proposed dual-stream design. Traditional statistical baselines exhibit the largest errors because they cannot capture nonlinear spatio-temporal dependencies. Deep learning baselines substantially improve over statistical methods, but models relying primarily on time-domain recurrent or convolutional processing still struggle to preserve long-range periodic information. In contrast, the frequency-domain enhancement helps the proposed method capture global periodic patterns, while the adaptive gating mechanism prevents rigid fusion from amplifying high-frequency noise, leading to better robustness during traffic peaks and fluctuations.
4.4. Ablation Studies
To verify the individual contributions of the Frequency-domain Global Enhancement, the Adaptive Gated Fusion, and the Dynamic Topology generation, we conducted prediction-side ablation studies on PEMSD4, as summarized in
Table 5. The variant
ours_w/o_FD removes the frequency-domain branch and relies only on the time-domain GRU. The variant
ours_w/o_AGF retains the frequency-domain branch but replaces adaptive gated fusion with static concatenation. The variant
ours_w/o_DT replaces the dynamic topology generation module with a static adjacency built from the offline distance-based graph.
The ablation results show that simply introducing frequency-domain features is not sufficient by itself. Although frequency information improves the MAE, static fusion leads to a higher RMSE, indicating that rigid fusion can introduce noise during peak periods. Disabling the dynamic topology further degrades all three metrics (MAE 18.412, RMSE 30.782, MAPE 12.290), which indicates that a static prior cannot adequately represent the incident-driven spatial shifts that arise during peak cycles. The full model achieves the best MAE, RMSE, and MAPE simultaneously, confirming that frequency-domain enhancement, adaptive gated fusion, and dynamic topology generation contribute complementarily to prediction quality.
4.5. Cost Function Sensitivity Analysis
The hybrid planning cost introduced in
Section 2 combines a normalized distance term and a normalized flow term through a scalar weight
. To examine how sensitive the planner is to this design choice, we conduct a controlled sweep over
on the same 237-node connected subgraph, 24 source-target pairs, and 24-cycle dynamic flows used in the controlled classical-planner comparison. For each
, the planner runs under identical encoder and decoder parameters, and the realized hybrid cost, congested-node ratio, and average path length are averaged across all pairs and cycles.
corresponds to a distance-only objective (equivalent to geometric shortest path, ignoring predicted flow), while
corresponds to flow-only routing.
Table 6 and
Figure 3 show a clear U-shaped dependence of the realized cost on
. A distance-only objective ignores predicted traffic and incurs the highest congestion exposure (11.42%); a flow-only objective over-reacts to short-horizon predictions, systematically detours through longer paths (12.47 hops on average), and also ends up with a higher total cost despite the lowest nominal congestion. The default
is close to the realized-cost minimum and offers a favorable trade-off between congestion exposure and path length, which justifies the equal-weighting design adopted in this work. For scenarios that explicitly prefer congestion avoidance at the expense of route length (e.g., emergency dispatch), the table supplies guidance for moving
toward 2 or 4.
4.6. Path Planning Simulation
Figure 4 illustrates the training progression of the proposed path-planning framework over 100 epochs. The plotted curves show the forward training cost, backward training cost, and validation cost.
Both decoding directions exhibit a rapid decrease during the first 20 epochs, indicating that the heuristic-guided feature embedding effectively alleviates the cold-start problem in sparse graphs. After approximately 60 epochs, the curves become markedly smoother and remain stable, suggesting that the combination of REINFORCE training and the rollout baseline provides stable optimization behavior.
Notably, a marginal performance discrepancy can be observed between the forward and backward directions. This phenomenon accurately reflects the topological asymmetry, where traversing reversely from the target to the source may encounter different critical bottleneck edges compared to the forward path. Nevertheless, the synchronous decline of both curves validates that our joint optimization objective (
30) effectively coordinates the bi-directional decoders to concurrently search for the optimal policy. Furthermore, the validation set cost remains consistently lower than the training cost and converges steadily, indicating the model’s robust generalization capability in unseen scenarios. These observations are reported for the fixed sparse-topology setting summarized in
Table 3, which defines the scope of the present path-planning evidence.
To further validate the heuristic-guided feature embedding, we compared training behavior with and without heuristic features.
Figure 5 shows that the model equipped with heuristic priors converges faster and remains consistently below the baseline without such guidance throughout training. This behavior indicates that normalized shortest-path distance and hop-count features provide effective directional cues, helping the policy avoid prolonged blind exploration in sparse graphs.
The same figure also shows that heuristic guidance improves final path quality. Because the agent receives coarse topological directionality from the start, it more quickly avoids obviously high-cost regions and reaches better solutions than the baseline that relies purely on trial-and-error exploration.
4.7. Controlled Comparison with Classical Planners
To complement the learning-based results above, we constructed a controlled decision-cycle simulation on the largest connected component of the PEMSD4-derived planning graph. This connected subgraph contains 237 nodes and 280 edges. We selected 24 source-target pairs with at least two competitive simple routes and generated time-varying node costs by shifting hotspot regions over 24 decision cycles. Dijkstra and A* re-plan from the current node using the instantaneous traffic snapshot, whereas the proposed framework uses a short-horizon predicted cost estimate consistent with the predict-then-plan paradigm. Because the current graph benchmark does not expose continuous vehicle kinematic states, hybrid A* is implemented as a discretized heading-regularized surrogate and is used only as a practical reference rather than as a full vehicle-dynamics benchmark.
To ensure a fair cross-planner comparison, we evaluate all methods by the same planner-agnostic metric, which we term the
realized cost. Specifically, after each planner produces its executed node sequence
during the multi-cycle simulation, the realized cost is obtained by substituting into the hybrid objective
in Equation (2) the
true simulated node-flow values
at the cycle at which each node is actually visited, rather than any planner-internal estimate. The normalized edge distances
follow the same min–max normalization defined in
Section 2, and the weight
is fixed at its default value. In this way, Dijkstra and A* are scored against exactly the same ground-truth traffic realization that the proposed framework attempts to anticipate, and the realized cost faithfully reflects the cost actually incurred along the executed route rather than the cost the planner believed it would incur at decision time.
Table 7 summarizes the controlled comparison, including realized cost, congestion exposure, expanded states, and arrival rate for each planner. The results show that the proposed framework achieves the lowest realized cost and the lowest congestion exposure among the compared methods. Relative to A*, the mean realized cost decreases from 9.560 to 9.431, while the congested-node ratio decreases from 8.18% to 7.43%, corresponding to a relative reduction of approximately 9.2% in congestion exposure. In contrast, A* mainly improves search efficiency over Dijkstra, reducing the average number of expanded states per cycle from 39.40 to 30.93, whereas the hybrid A* surrogate preserves a similar cost profile but requires substantially more search effort under the present graph abstraction.
Figure 6 provides a cycle-level view of the same simulation. The proposed framework remains consistently below the classical baselines after the early decision cycles, indicating that incorporating predicted future traffic information helps the planner avoid route commitments that later become congested. This result should be interpreted as a controlled graph-based illustration of practical decision cycles rather than as a claim that the present graph benchmark fully subsumes continuous-state vehicle planning.
4.8. Planning Module Ablation
To assess the individual contribution of each core planning component, we conducted a three-variant ablation under the same 237-node connected subgraph, 24 source-target pairs, and 24-cycle dynamic flows used above. The three variants are: (i)
w/o masking, which removes the adjacency-masked attention and restores full self-attention over all node pairs; (ii)
w/o heuristic, which disables the Dijkstra-derived heuristic feature embedding; and (iii)
w/o bi-directional, which uses only forward decoding. All other architectural and training settings follow
Table 3.
Table 8 summarizes the resulting realized cost, congested-node ratio, and convergence epoch for the full model and each ablated variant.
The results indicate four complementary effects. First, disabling the heuristic embedding incurs the largest degradation: the realized cost rises to 9.711 and the training curve does not reach 95% of the full-model cost within 150 epochs, confirming that the normalized shortest-path distance and hop-count features provide an indispensable cold-start signal on sparse topologies. Second, removing adjacency masking raises both the realized cost (9.582) and the congestion exposure (7.89%), because the fully connected attention allocates capacity to topology-inconsistent node pairs and slows down the effective learning of neighbor-aware policies. Third, restricting the decoder to a single direction produces a moderate but consistent degradation (9.528), in line with the bi-directional decoding analysis that an additional decoding path offers a second route of access to bottleneck edges. Fourth, removing the edge-distance bias in the graph-aware attention encoder slows the convergence epoch from 60 to 78 and raises the realized cost from 9.431 to 9.495, which directly quantifies the requested “with vs. without edge-distance bias” convergence-speed comparison: the learnable bias acts as a geometric warm start so that, even under random initialization, the attention scores prefer physically proximal neighbors and the policy avoids wasteful long-range exploration in the early epochs. No ablated variant dominates the full model on any metric, confirming that the four planning components are non-redundant.
4.9. RL Optimizer Comparison
To further justify the choice of REINFORCE with a rollout baseline, we compared it against two mainstream RL optimizers on the same planning task: Proximal Policy Optimization (PPO) with a clipping ratio of , and a Double-Dueling DQN tailored for the discrete neighbor-selection action space. For each optimizer, we ran five independent training trajectories (100 epochs each) with identical encoder/decoder architectures and recorded the policy-gradient variance, the number of epochs required to reach 95% of the best observed cost, and the final realized cost.
Table 9 and
Figure 7 support three observations. First, the rollout baseline reduces the variance of the policy gradient by roughly an order of magnitude relative to plain REINFORCE, which translates into faster convergence and more reliable attainment of the target cost across seeds. Second, PPO achieves the lowest measured variance, yet its clipped updates become somewhat conservative on the present large discrete action space, leading to a slightly slower convergence horizon and a marginally higher final cost. Third, DQN, despite its strong track record on dense-reward continuous-control tasks, struggles here: the combinatorial neighbor-selection action space induces Q-value bootstrapping variance, and sparse rewards make exploration unstable, yielding the highest final cost and only two of five seeds converging within the budget. Overall, the rollout baseline offers the most favorable variance-convergence trade-off for this sparse-graph planning task.
4.10. Decoder Strategy Comparison
To further assess whether the bi-directional decoding strategy yields near-optimal solutions relative to a stronger single-direction search, we compared it against a greedy autoregressive decoder (beam width
) and a beam-search decoder with widths
. All variants share the same trained encoder, decoder, REINFORCE-trained policy, and hybrid cost
with
. Beam search keeps the top-
B partial paths at every step, scored by the accumulated log-probability under the masked softmax policy, and returns the finished path with the minimum realized hybrid cost. All decoders run on the same 237-node connected subgraph, 24 source-target pairs, and 24 decision cycles used throughout
Section 4.8. For each decoder we record: (i) the realized hybrid cost under the true simulated flow; (ii) the fraction of actually visited congested nodes; (iii) the decoding wall-clock time normalized to greedy; and (iv) a bottleneck-edge coverage rate, defined as the fraction of graph-theoretic bridge edges of the planning subgraph that are included in at least one executed route across the 576 planning events.
Table 10 and
Figure 8 support two conclusions. First, the realized cost decreases concavely with
and starts to plateau above
: increasing beam width from 20 to 50 lowers the cost by only
(
) but multiplies decoding time by
. Beam search, which widens a single-direction search, therefore exhibits rapidly diminishing returns. Second, the bi-directional decoder reaches
at only
the greedy decoding time, which is lower in realized cost than the beam search at
and about
cheaper to compute. The bottleneck-coverage column explains the remaining gap on this sparse topology: all beams in a forward-only search descend the same autoregressive prefix tree and share an inherent directional bias toward bridges close to the source, while bi-directional decoding exposes bridges from both endpoints and lifts coverage from
to
. This means the residual advantage of the proposed strategy over a wide beam search is topological rather than a matter of search width, which is consistent with the role of bi-directional decoding described in
Section 3.
4.11. Action Entropy at Bottleneck Edges
To understand how the learned policy behaves when the agent encounters highly congested bottleneck edges, we analyzed the Shannon entropy of the one-step action probability distribution
defined by the tanh-clipped softmax in Equation (
12). On the same controlled simulation, we classify every decision step along the 576 executed routes into four groups along two axes: (i) whether the current node is incident to a
bridge edge of the planning subgraph (graph-theoretic cut edges identified by Tarjan’s algorithm), and (ii) whether the predicted congestion at the current node
falls in the top decile (“high”) or below (“low”). The resulting four groups cover approximately
decision steps after removing terminal and trivially masked states. Because only non-visited physically adjacent neighbors remain unmasked, the maximum achievable entropy at a decision is
where
k is the number of unvisited neighbors, which is typically 2 or 3 on this sparse topology.
Table 11 and
Figure 9 reveal a structured, topology-aware response. At non-bottleneck nodes, the policy is confidently committed (mean entropy
nats, about
of
), and high congestion adds only a mild increment (
). At bottleneck nodes under low congestion, the entropy actually drops to
nats because only one viable non-visited neighbor typically remains, making the next action nearly deterministic. Crucially, when the agent meets a bottleneck edge whose incident node is also highly congested, the entropy rises sharply to
nats, approximately
the bottleneck-low-congestion level, and reaches roughly
of the theoretical upper bound
. In other words, the policy refuses to over-commit at exactly the decisions that are simultaneously topologically critical and operationally costly, and instead distributes probability nearly uniformly across the few remaining detour candidates, preserving exploratory margin precisely where a premature commitment would be most damaging. This behavior is consistent with the bi-directional decoding design and indicates that the rollout-trained policy has internalized a congestion-aware hedging strategy at the most consequential decision points of the sparse graph.
4.12. Limitations and Practical Scope
Two limitations deserve explicit discussion. First, the prediction module relies on traffic sensors whose observations may be noisy, missing, or delayed. Such perturbations can affect both the spectral branch and the dynamically inferred graph, which in turn may degrade the planning signal passed to the route optimizer. Recent progress on robust learning under noisy observations, exemplified by the progressive sample selection framework with contrastive loss designed for noisy labels [
25], suggests practical directions for addressing sensor-level uncertainty in future deployments. Second, the planning module is trained with policy gradients; although the rollout baseline stabilizes optimization, RL performance can still vary under sparse rewards, topology shifts, or different random seeds. For these reasons, the current path-planning results should be interpreted as a fixed-topology proof of concept rather than as a complete deployment study. Robust sensor denoising, uncertainty-aware forecasting, and broader multi-seed RL evaluation remain important directions for future work.
5. Conclusions
This paper presents an integrated framework coupling a Spectral-Temporal Graph Neural Network with Bi-directional Decoding Reinforcement Learning to address spatiotemporal dependency modeling and sparse graph navigation challenges in Intelligent Transportation Systems. For traffic prediction, the proposed Time-Frequency Dual-stream Adaptive Learning module captures global periodicities and local dynamics through parallel FFT and GRU branches, while the adaptive gating mechanism harmonizes the two views. Quantitatively, the model achieves MAE/RMSE/MAPE values of 18.211/30.433/12.006 on PEMSD4 and 13.587/23.566/8.955 on PEMSD8, outperforming strong graph-based baselines on both datasets.
For path planning on the PEMSD4-derived sparse topology, the Bi-directional Decoding RL method demonstrates rapid cost reduction within the first 20 epochs and stable convergence after approximately 60 epochs. Under the default three-layer planning encoder, adjacency masking reduces the total number of pairwise attention interactions from 282,747 to 96,223, corresponding to a 65.97% reduction in redundant interactions. In addition, heuristic feature embedding further accelerates convergence and improves path quality by providing directional guidance in sparse graphs. A controlled decision-cycle simulation on the largest connected planning subgraph further shows that the proposed framework reduces the realized cost from 9.560 to 9.431 relative to A* and lowers congestion exposure from 8.18% to 7.43%. The planning-side ablation additionally confirms that removing the learnable edge-distance bias slows convergence from 60 to 78 epochs, the decoder strategy comparison shows that the bi-directional decoder reaches a lower realized cost than beam search with while using about less decoding time, and the entropy analysis indicates that the rollout-trained policy preserves exploratory margin at highly congested bottleneck nodes, where the action-distribution entropy rises by a factor of approximately relative to uncongested bottlenecks. Future research will focus on extending the framework to dynamic graph scenarios with real-time incident handling, large-scale heterogeneous transportation networks, robust sensor-noise mitigation, and more comprehensive multi-seed reproducibility studies, together with platform-specific efficiency benchmarking.