1. Introduction
Traffic flow prediction aims to estimate the future utilization of transportation resources in various urban regions over a given time horizon (e.g., the next hour or several hours) [
1]. For instance, it can be used to forecast the demand for taxis or shared bicycles in different districts. Accurate traffic forecasting is essential for optimizing traffic control strategies and public transportation scheduling [
2,
3,
4], and it serves as a foundational component of intelligent transportation systems (ITSs) [
5].
Traditional traffic forecasting methods primarily rely on time series models such as the Kalman filter and the autoregressive integrated moving average (ARIMA) model [
6]. Although these approaches are effective in capturing temporal dependencies, they are often inadequate for modeling the complex spatio-temporal correlations that characterize traffic data. In recent years, the rise of deep learning has led to the development of various neural network-based models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph convolutional networks (GCNs), and attention-based frameworks, which are designed to extract both spatial and temporal features from traffic observations [
7,
8,
9]. These models typically handle spatial and temporal dependencies through sequential, parallel, or decoupled processing strategies [
10,
11,
12]. However, despite continuous advances in architectural design, the improvement in predictive performance has gradually reached a plateau.
In recent years, large language models (LLMs) have achieved remarkable success across a wide range of domains, including natural language processing and computer vision [
13,
14]. Compared to conventional neural architectures, LLMs possess superior representation capabilities and can be adapted to new tasks with minimal fine-tuning, thereby eliminating the need for extensive architectural modifications. Motivated by these advantages, an increasing number of studies have investigated the application of LLMs to time series forecasting, covering both short-term and long-term prediction tasks. However, in long-horizon scenarios such as traffic flow forecasting, existing LLM-based methods have yet to achieve satisfactory performance [
15,
16,
17,
18]. This limitation is primarily due to two factors. First, the high computational costs associated with LLM training and inference hinder their practical deployment in resource-constrained environments. Second, there exists a substantial domain gap between natural language and structured traffic data, which reduces the transferability of pre-trained models.
To alleviate the computational burden, some studies adopt a frozen pre-trained (FPT) fine-tuning strategy, freezing core LLM components like feed-forward and multi-head attention modules during training. While this reduces overhead and improves generalization in traffic tasks, most existing approaches treat spatial and temporal embeddings as separate, independent features, largely ignoring their complex interactions. To our knowledge, only a few attempts, such as ST-LLM [
19] and ST-LLM++ [
20], have addressed this interaction by concatenating embeddings followed by pointwise convolution. Although these increase representational capacity, they fail to explicitly model the cross-dependencies among spatial, temporal, and node-level features, which are crucial for capturing intricate traffic dynamics [
21].
Existing traffic forecasting models struggle to simultaneously capture structural graph information and complex spatio-temporal dependencies, often leading to incomplete feature representation and suboptimal predictive accuracy. To address this limitation, we propose the GSF-LLM (graph-enhanced spatio-temporal fusion-based large language model), a novel architecture that effectively integrates graph-based spatial topology with temporal dynamics and semantic information. Leveraging the powerful contextual understanding and generalization capabilities of large language models (LLMs) enables the GSF-LLM to interpret heterogeneous traffic patterns as structured sequences, enhancing its ability to model long-term dependencies and adapt to unseen scenarios. As illustrated in
Figure 1, the model employs a multi-branch embedding module to extract token-level, temporal, and spatial representations, which are deeply fused through a multi-head cross-attention mechanism to model intricate cross-modal interactions. To further enhance efficiency without sacrificing accuracy, a frozen pre-trained large language model is fine-tuned using the lightweight low-rank adaptation (LoRA) technique, enabling targeted parameter updates. A regression head is then used to produce multi-step traffic forecasts. By unifying these components, the GSF-LLM overcomes the limitations of prior approaches, achieving more comprehensive spatio-temporal representation learning and improved adaptability to large-scale traffic networks.
The main contributions of this article are summarized as follows:
- 1.
We propose the GSF-LLM, a unified framework that effectively integrates spatio-temporal embeddings with structural graph information. By combining frozen pre-training with parameter-efficient fine-tuning via LoRA, the model achieves an optimal trade-off between predictive performance and computational efficiency.
- 2.
We design a multi-head cross-attention fusion module that explicitly captures the dependencies among spatial, temporal, and node-specific features. This enables the model to represent complex traffic patterns characterized by nonlinear trends, multi-scale periodicity, spatial heterogeneity, and abrupt anomalies, thereby improving its adaptability to diverse traffic scenarios.
- 3.
Extensive experiments on multiple benchmark traffic datasets verify the superiority of the GSF-LLM in terms of both predictive accuracy and robustness, underscoring its potential for deployment in real-world intelligent transportation systems.
The remainder of this paper is as follows.
Section 2 discusses related work about LLMs for time series analysis and traffic prediction.
Section 3 introduces the problem definition.
Section 4 details the GSF-LLM, followed by the experiments in
Section 5.
Section 6 concludes the paper.
3. Problem Definition
In this section, we define the spatio-temporal traffic prediction problem and clarify the notations used throughout the paper. The key symbols are summarized in
Table 1.
We represent the traffic data as a three-dimensional tensor , where T is the number of time steps, N is the number of spatial locations (e.g., traffic stations or sensors), and C is the number of traffic-related features (e.g., pick-up and drop-off counts). For instance, when C = 1, the input is a univariate time series indicating the traffic flow at each location.
The traffic network is modeled as a graph , where V is the set of nodes such that , with each node corresponding to a spatial location. The edge set defines the spatial connectivity between nodes. The adjacency matrix encodes spatial proximity, which can be computed from road distances, physical connectivity, or other similarity criteria.
Given historical traffic data over P time steps, denoted as
, and the corresponding traffic graph G, the goal is to learn a mapping function
, parameterized by θ, that predicts traffic conditions over the next S time steps
. That is,
where each
.
5. Experiments
This section presents the experimental setup, including the datasets, evaluation metrics, and baseline models used to assess the effectiveness of the proposed GSF-LLM framework.
5.1. Datasets
To verify the robustness and generalizability of the proposed model, we conduct experiments on two real-world urban mobility datasets: NYCTaxi and CHBike. Both datasets record spatio-temporal traffic demand over three consecutive months as illustrated in
Table 2.
The NYCTaxi dataset comprises over 35 million taxi trip records in New York City, discretized into 266 virtual stations based on pick-up locations. It covers a time span from 1 April to 30 June 2016, with each time step representing a 30-min interval, resulting in 4368 temporal steps.
The CHBike dataset contains approximately 2.6 million bike-sharing orders during the same period. After removing stations with low activity, the dataset focuses on the top 250 most frequently used stations. It shares the same temporal resolution and range as NYCTaxi.
5.2. Baselines
To evaluate the effectiveness of the proposed model, we conduct a comparative analysis with representative baseline models for traffic prediction, categorized into three major groups: graph neural network (GNN)-based models, attention-based models, and large language model (LLM)-based models.
DCRNN [
29]: Models traffic data as a directed graph and introduces a diffusion convolutional recurrent network.
STGCN [
30]: Combines spatial graph convolution with temporal 1D convolution to address traffic time series forecasting.
GWN [
28]: Utilizes graph convolution with an adaptive adjacency matrix to capture spatial dependencies.
AGCRN [
50]: Employs adaptive graph convolutional recurrent networks to learn node-specific features and inter-series dependencies.
STG-NCDE [
51]: Introduces a graph neural-controlled differential equation framework for traffic prediction.
DGCRN [
52]: Proposes a dynamic graph convolutional recurrent network tailored for traffic forecasting.
ASTGCN [
53]: Integrates spatial–temporal attention mechanisms into GNNs for more effective traffic forecasting.
GMAN [
54]: Employs an encoder–decoder architecture with multi-level attention to capture temporal and spatial patterns.
ASTGNN [
5]: Focuses on learning dynamic and heterogeneous traffic patterns using attention mechanisms.
- 2.
LLM-based models
OFA [
55]: A GPT-2-based model that freezes self-attention and feed-forward modules within its residual blocks; we adapt it by inverting the traffic data view to improve performance.
GATGPT [
37]: Combines a graph attention network (GAT) with GPT-2; we also implement a variant where GAT follows the GPT-2 backbone.
GCNGPT [
19]: Integrates a graph convolutional network (GCN) with the fine-tuned GPT-2 (FPT) architecture.
LLaMA-2 [
43]: A suite of pre-trained and fine-tuned LLMs developed by Meta; we employ a frozen pre-trained transformer from LLaMA-2 in our setup.
ST-LLM [
19]: A preliminary approach introducing spatio-temporal LLMs with partially frozen attention modules.
5.3. Evaluation Metrics
To quantitatively evaluate the prediction performance, we adopt the following metrics commonly used in traffic forecasting:
MAE (mean absolute error):
RMSE (root mean square error):
MAPE (mean absolute percentage error):
WAPE (weighted absolute percentage error)
where
and
denote the predicted and ground truth values, respectively, and
is a small constant to avoid division by zero.
5.4. Implementation Details
Following standard practice, we split the NYCTaxi and CHBike datasets into training, validation, and test sets with a 6:2:2 ratio. Both the number of historical time steps P and prediction steps S are set to 12. We define to represent the seven days of a week and , corresponding to 30-min intervals across a day.
All experiments were conducted on a system equipped with an NVIDIA RTX 5090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). LLM-based models were trained using the Ranger21 optimizer (Lawrence Berkeley National Laboratory, Berkeley, CA, USA) with a learning rate of 0.001, while GCN and attention-based models used the Adam optimizer with the same learning rate. The language models include GPT-2 (6 layers) [
53] and LLaMA-2 7B (8 layers) [
41]. We used a batch size of 64, trained each model for up to 300 epochs, and set random seeds as 6666.
5.5. Main Results
Table 3 and
Table 4 present the performance comparison results of the GSF-LLM with baseline models across four traffic prediction tasks (bike drop-off, bike pick-up, taxi drop-off, taxi pick-up) in terms of MAE, RMSE, MAPE, and WAPE. The results demonstrate that the GSF-LLM achieves the most effective performance across most evaluated metrics, confirming its superiority in capturing complex spatio-temporal dependencies in traffic networks. The key findings are outlined as follows:
The GSF-LLM consistently delivers the best overall performance across all tasks and evaluation metrics. Specifically, for both pick-up and drop-off prediction tasks on the NYCTaxi and CHBike datasets, the GSF-LLM achieves the lowest scores on MAE, RMSE, MAPE, and WAPE, surpassing all baseline models. For instance, on the CHBike drop-off task, the GSF-LLM achieves an MAE of 1.88, outperforming strong baselines such as the ST-LLM (1.89), DGCRN (1.96), and GATGPT (1.95).
Compared to the ST-LLM, which already benefits from partially frozen attention and spatio-temporal embeddings, the GSF-LLM demonstrates further improvements, particularly on MAPE and WAPE—metrics crucial for assessing relative prediction accuracy. The GSF-LLM introduces enhanced fusion mechanisms and efficient fine-tuning strategies, resulting in an average relative improvement of 1.8% in MAE and 2.3% in WAPE across all four tasks.
When compared to other LLM-based baselines (OFA, GATGPT, GCNGPT, and LLAMA2), the GSF-LLM exhibits substantial advantages. The OFA and LLAMA2 often suffer from inadequate spatio-temporal embeddings, while GATGPT and GCNGPT fail to fully leverage spatial–temporal dependencies. The GSF-LLM overcomes these limitations by integrating a spatially aware tokenization scheme and a dedicated fusion strategy tailored for traffic data.
Attention-based and GNN-based models, such as the GMAN, STSGCN, and AGCRN, achieve reasonable results on certain tasks. However, they generally struggle to capture the complex spatial–temporal dynamics present in real-world traffic data, and their performance is less stable across different datasets. In contrast, the GSF-LLM demonstrates stronger generalization and robustness.
In summary, these findings confirm that the GSF-LLM achieves a new state-of-the-art in traffic prediction. Its architectural innovations, particularly those in spatio-temporal token representation, fusion design, and fine-tuning methodology, enable effective generalization and high prediction accuracy across diverse traffic scenarios.
5.6. Ablation Study
To assess the contribution of each key component in the GSF-LLM to the overall performance, we conducted a series of ablation experiments. Specifically, we designed four model variants by removing or altering core modules to evaluate their respective impacts:
w/o Temporal Embedding: The temporal embedding module is removed to examine the importance of explicit temporal context modeling.
w/o Node Embedding: The node embedding module is excluded to evaluate its role in capturing spatial heterogeneity across regions.
w/o CrossAttention Fusion: The multi-head cross-attention fusion module is replaced with simple feature concatenation, in order to assess the effectiveness of deep spatio-temporal feature interaction.
w/o Frozen: The GPT backbone is fully fine-tuned and the LoRA structure is removed, to test the effect of parameter-efficient fine-tuning strategies.
All ablated variants are trained and evaluated under the same settings as the full GSF-LLM, across four prediction tasks derived from the NYCTaxi and CHBike datasets: bike pick-up, bike drop-off, taxi pick-up, and taxi drop-off. We employ four evaluation metrics, namely MAE, RMSE, MAPE, and WAPE, to comprehensively assess model performance.
As illustrated in
Figure 3, the full GSF-LLM consistently outperforms all ablated variants across all tasks and metrics, validating the effectiveness of its full architecture. Detailed observations are as follows:
Temporal embedding significantly enhances the model’s ability to capture sequential patterns, with particularly noticeable improvements on the CHBike dataset. After removing the temporal embedding (w/o Temporal Embedding), performance declines across all tasks. For instance, in the taxi pick-up task, WAPE increases from 0.1969 to 0.2062, and MAE increases from 5.1594 to 5.4510, indicating the critical role of explicit temporal encoding.
Node embedding is crucial for modeling spatial heterogeneity. The w/o Node Embedding variant exhibits the most severe performance degradation. In the taxi drop-off task, MAE rises from 5.0593 to 6.6812, and MAPE increases from 0.3609 to 0.4790, demonstrating that ignoring spatial context substantially harms predictive accuracy.
Cross-attention fusion facilitates effective spatio-temporal feature interaction. When it is replaced by simple concatenation (w/o CrossAttention Fusion), performance deteriorates across all metrics. For example, in the bike pick-up task, MAE increases from 1.9911 to 2.0062, and WAPE increases from 0.4010 to 0.4041, confirming the module’s efficacy in modeling complex interactions.
Frozen GPT with LoRA fine-tuning improves generalization and reduces training cost. The w/o Frozen variant, where the entire GPT is fine-tuned without LoRA, consistently underperforms the GSF-LLM. In the taxi pick-up task, WAPE increases from 0.1969 to 0.2155, and MAPE increases from 0.3545 to 0.3682, showing that the combination of LoRA and layer freezing offers both efficiency and better generalization.
The ablation study demonstrates that temporal embedding, node embedding, cross-attention fusion, and parameter-efficient fine-tuning (e.g., LoRA) are all essential to the superior performance of the GSF-LLM. The synergy of these components enables the model to effectively capture complex spatio-temporal dynamics and make accurate predictions across diverse urban traffic scenarios.
5.7. Parameter Analysis
In the GSF-LLM framework illustrated in
Figure 2, the hyperparameter U plays a pivotal role, as it specifies the number of unfrozen multi-head graph attention layers during training.
Figure 4 demonstrates how different values of U influence model performance across various metrics for the NYCTaxi and CHBike drop-off datasets.
For the NYCTaxi drop-off dataset under the WAPE metric (
Figure 4a), model performance improves as
increases to 1, suggesting that unfreezing more layers up to this point enhances predictive accuracy. However, when
exceeds 1, performance declines, indicating potential overfitting or diminishing returns. A similar trend is observed under the MAE metric (
Figure 4b), where
achieves the lowest error, confirming that this value provides an optimal balance between model complexity and prediction accuracy. For the CHBike drop-off dataset,
Figure 4c and
4d present the results under the WAPE and MAE metrics, respectively. In
Figure 4c, the lowest WAPE is obtained when U = 2, marking the best model performance. Likewise,
Figure 4d shows that the MAE reaches its minimum at U = 2, beyond which both metrics degrade. These results suggest that for the CHBike dataset, unfreezing two graph attention layers achieves an optimal trade-off between generalization and fine-tuning capacity, maintaining model simplicity while ensuring superior predictive performance.
The GPT-2 model consists of 12 layers, and in traffic flow prediction tasks, it is common to use 6 layers as the base model for re-training. As the number of retained layers increases, the computational cost also rises. To compare the effects of different layer depths, we set U as 2 and batch size as 32.
Figure 5 illustrates how varying the number of layers influences multiple performance metrics on the NYCTaxi and CHBike drop-off datasets.
For the NYCTaxi drop-off dataset under the WAPE metric (
Figure 5a), model performance improves significantly as the number of layers increases, indicating that deeper architectures enhance predictive accuracy within this range. Similarly, for the MAE metric shown in
Figure 5b, a comparable trend is observed—the minimum error occurs when all 12 layers are retained, suggesting that this configuration achieves the best balance between model complexity and prediction accuracy. For the CHBike drop-off dataset,
Figure 5c and
5d show the results under the WAPE and MAE metrics, respectively. In
Figure 5c, the lowest WAPE is achieved when six layers are retained, representing the optimal model performance. Likewise,
Figure 5d demonstrates that the MAE also reaches its minimum at six layers, after which performance declines as additional layers are included. These results indicate that for the CHBike dataset, retaining six layers strikes an optimal balance between generalization and fine-tuning capability, maintaining model simplicity while ensuring superior predictive performance.
Table 5 presents a comparison of the trainable parameters between the two models, ST-LLM and GSF-LLM. The results highlight the substantial reduction in trainable parameters achieved by the GSF-LLM compared with its predecessor across both the NYCTaxi and CHBike datasets. Although the GSF-LLM contains a slightly higher total number of parameters due to the integration of the LoRA-augmented partially frozen graph attention (PFGA) mechanism, it significantly reduces the proportion of trainable parameters. Specifically, the GSF-LLM requires only 11.48% and 11.47% trainable parameters for the NYCTaxi and CHBike datasets, respectively, compared with 51.40% and 54.26% for the ST-LLM. This notable reduction demonstrates the efficiency of the LoRA-based adaptation strategy, which not only decreases computational overhead but also enhances the model’s generalization ability by preserving the pre-trained foundational knowledge. Consequently, the GSF-LLM offers a more scalable and computationally efficient solution for spatio-temporal prediction tasks.
To further investigate the effect of the LoRA rank parameter (r) on model performance, we conducted a sensitivity analysis using
with a fixed LoRa alpha = 32. The results on both the NYCTaxi and CHBike drop-off datasets are shown in
Figure 6.
As illustrated in
Figure 6a–d, performance variation across different ranks remains minimal, indicating the robustness and stability of the GSF-LLM. Specifically, for the NYCTaxi dataset, WAPE fluctuates slightly within 0.20–0.21%, and MAE varies around 5.3–5.4. For CHBike, WAPE remains approximately 0.38–0.39%, and MAE around 1.88–1.90. The best trade-off is achieved at r = 16, where the model attains slightly higher accuracy while preserving reasonable parameter size and computational efficiency.
When r is too low, the reduced subspace capacity constrains the model’s ability to capture complex spatio-temporal dependencies, leading to minor degradation in prediction accuracy. Conversely, further increasing r beyond 32 yields negligible improvement, suggesting that the adaptation capacity of LoRA saturates beyond this level. Overall, a medium-rank configuration (r = 16) provides an optimal balance between fine-tuning efficiency and predictive performance for large-scale traffic forecasting tasks.
To evaluate the computational efficiency of the GSF-LLM, we compared its training cost and model size against three strong baselines—GATGPT, GCNGPT, and ST-LLM—on both NYCTaxi and CHBike datasets. The results are summarized in
Table 5.
Table 5 presents the comparison of model size, training time, and GPU memory usage across four models on both the NYCTaxi and CHBike datasets. As shown, the GSF-LLM achieves a substantial reduction in the proportion of trainable parameters—approximately 11.5% of the total—compared with over 50% for the ST-LLM, demonstrating the effectiveness of the LoRA-based partial fine-tuning strategy. Although the GSF-LLM contains slightly more total parameters due to the integration of the PFGA module, this design efficiently reduces redundant parameter updates while preserving the generalization capability of the pre-trained backbone.
In terms of computational efficiency, the GSF-LLM maintains moderate training time and memory usage (around 13 GB), positioned between the lightweight GCNGPT and the more complex ST-LLM. Compared with GATGPT, it avoids the excessive memory and latency caused by graph attention operations. These results indicate that the GSF-LLM achieves an optimal balance between adaptability, training efficiency, and resource consumption, making it both scalable and practical for spatio-temporal traffic prediction tasks.
5.8. Few-Shot Prediction
In the few-shot prediction setting, the large language models (LLMs) are trained using only 10% of the available data. The experimental results presented in
Table 6 demonstrate the strong few-shot learning capability of the GSF-LLM. As shown, the GSF-LLM consistently outperforms other LLM-based models, indicating its robustness in identifying complex spatio-temporal patterns even under data-scarce conditions. This superior performance can be attributed to the effectiveness of the partially frozen graph attention (PFGA) mechanism, which enables the model to capture spatial dependencies through graph-based attention despite limited training samples.
For example, the GSF-LLM achieves a 9.29% reduction in MAE compared with LLaMA-2, and a 2.41% reduction compared with the ST-LLM on the NYCTaxi pick-up dataset. These improvements highlight the impact of the PFGA-enhanced architecture and the LoRA-augmented fine-tuning strategy, both of which contribute to improved adaptation and generalization.
While OFA, GATGPT, GCNGPT, and STLLM also exhibit commendable few-shot performance, they still fall short of the GSF-LLM in prediction accuracy. For instance, although OFA performs relatively well on the CHBike drop-off dataset, the GSF-LLM surpasses it with a 7.77% improvement in MAE. Moreover, when compared with GATGPT, GCNGPT, and STLLM, the GSF-LLM achieves average MAE improvements of approximately 29%, 37%, and 8% across all datasets, respectively. These results underscore the substantial advancements introduced by the GSF-LLM over its predecessors and competing LLM-based approaches, establishing it as a superior and more efficient framework for few-shot traffic prediction.
5.9. Zero-Shot Prediction
The zero-shot prediction experiments are designed to evaluate the intra-domain and inter-domain knowledge transfer capabilities of large language models (LLMs). In these experiments, each model predicts traffic flow in the CHBike dataset after being trained solely on the NYCTaxi dataset, without any prior exposure to CHBike data. The corresponding results are summarized in
Table 7.
For intra-domain transfer, such as predicting NYCTaxi drop-off flow based on NYCTaxi pick-up flow, the GSF-LLM achieves high prediction accuracy, maintaining lower error rates than the other models. This indicates the GSF-LLM’s strong ability to capture and transfer complex spatio-temporal dependencies within the same domain.
Furthermore, the GSF-LLM also demonstrates exceptional performance in inter-domain transfer tasks, such as transferring knowledge from the NYCTaxi domain to the CHBike domain. Across both MAE and RMSE metrics, the GSF-LLM consistently outperforms all comparison models, highlighting its robustness and superior generalization capability to unseen domains without the need for re-training. Among the baseline models, LLaMA-2 shows strong results and surpasses models such as OFA, GATGPT, GCNGPT, and STLLM; however, it still falls short of matching the superior performance achieved by the GSF-LLM across all evaluated settings.
The success of the GSF-LLM in zero-shot prediction can be attributed to the partially frozen graph attention (PFGA) strategy, which effectively enables the model to leverage learned representations for both intra-domain and inter-domain predictions. By incorporating selective graph-based attention, the GSF-LLM captures spatial dependencies more effectively and activates the LLM’s inherent reasoning and knowledge transfer capabilities, establishing it as a powerful and generalizable framework for zero-shot traffic prediction tasks.
6. Conclusions and Future Work
In this paper, we present the GSF-LLM, a graph-enhanced spatio-temporal fusion-based large language model for traffic prediction. The model is designed to address the challenge of capturing complex spatial and temporal dependencies in urban mobility networks, and it establishes a new benchmark for accurate and robust traffic forecasting.
Extensive experiments on the NYCTaxi and CHBike datasets confirm the effectiveness of the proposed approach. The GSF-LLM consistently surpasses strong baselines across four prediction tasks. For instance, in the CHBike drop-off task, it achieves an MAE of 1.88, outperforming the ST-LLM, DGCRN, and GATGPT. On average, the model reduces MAE by 1.8 percent and WAPE by 2.3 percent across all tasks. Ablation studies further demonstrate that temporal embedding, node embedding, cross-attention fusion, and LoRA-based fine-tuning are all essential, since the removal of any component results in significant performance degradation.
Despite the encouraging results, several limitations of the current GSF-LLM design suggest avenues for further improvement. First, the spatial embedding module employs a simple linear transformation, which may limit its ability to capture complex spatial dependencies. Future work will explore advanced graph representation techniques, such as GraphSAGE, to enhance spatial modeling. Second, the temporal embedding combines hour and day features in a basic manner, potentially failing to fully represent cyclical traffic patterns. To address this, periodic or Fourier-based temporal encodings will be investigated to improve the representation of temporal dynamics. Third, the regression head relies on a simple convolution, which may constrain multi-step forecasting accuracy; more sophisticated architectures will be considered to better model temporal evolution over longer horizons.
Moreover, the current evaluation focuses on historical 2016 datasets, which may not reflect recent changes in urban mobility, including post-pandemic traffic patterns. Expanding the evaluation to more recent and diverse datasets will help verify model generalization. In addition, the commonly used pointwise metrics do not quantify predictive uncertainty or capture qualitative aspects such as directional changes or anomalous events. Future studies will incorporate probabilistic metrics, such as CRPS, PICP, and PINBALL, and conduct detailed error analyses that account for holidays, weather conditions, and varying forecast horizons.
Finally, practical deployment considerations will guide further development. To mitigate risks associated with hallucinations or unintended outputs inherent to LLMs, security and ethics audits will be performed in line with current guidelines. Model interpretability will be enhanced through the visualization of node importance, temporal intervals, and attention mechanisms. The GSF-LLM will also be extended to broader applications, including traffic data imputation, trajectory generation, and anomaly detection, with the integration of multi-modal data sources and support for online learning to enable real-time prediction while ensuring ethical, privacy, and fairness standards.