SASTGCN: A Self-Adaptive Spatio-Temporal Graph Convolutional Network for Trafﬁc Prediction

: Trafﬁc prediction plays a signiﬁcant part in creating intelligent cities such as trafﬁc management, urban computing, and public safety. Nevertheless, the complex spatio-temporal linkages and dynamically shifting patterns make it somewhat challenging. Existing mainstream trafﬁc prediction approaches heavily rely on graph convolutional networks and sequence prediction methods to extract complicated spatio-temporal patterns statically. However, they neglect to account for dynamic underlying correlations and thus fail to produce satisfactory prediction results. Therefore, we propose a novel Self-Adaptive Spatio-Temporal Graph Convolutional Network (SASTGCN) for trafﬁc prediction. A self-adaptive calibrator, a spatio-temporal feature extractor, and a predictor comprise the bulk of the framework. To extract the distribution bias of the input in the self-adaptive calibrator, we employ a self-supervisor made of an encoder–decoder structure. The concatenation of the bias and the original characteristics are provided as input to the spatio-temporal feature extractor, which leverages a transformer and graph convolution structures to learn the spatio-temporal pattern, and then applies a predictor to produce the ﬁnal prediction. Extensive trials on two public trafﬁc prediction datasets (METR-LA and PEMS-BAY) demonstrate that SASTGCN surpasses the most recent techniques in several metrics.


Introduction
Traffic prediction issues have become a crucial element of the Intelligent Traffic System (ITS) in recent years [1]. As seen in Figure 1, it has been extensively utilized in numerous disciplines, including traffic speed forecasting [2], flow prediction [3], trip time estimation [4], bike-sharing allocation [5], and taxi order response (pick-up and drop-off) [6]. Reliable forecasts are increasingly essential to develop sensible travel and transportation strategies.

Introduction
Traffic prediction issues have become a crucial element of the Intelligent Traffic System (ITS) in recent years [1]. As seen in Figure 1, it has been extensively utilized in numerous disciplines, including traffic speed forecasting [2], flow prediction [3], trip time estimation [4], bike-sharing allocation [5], and taxi order response (pick-up and drop-off) [6]. Reliable forecasts are increasingly essential to develop sensible travel and transportation strategies. As a typical spatio-temporal forecasting task, traffic prediction is quite intractable, mostly as a result of the intricate spatio-temporal dependencies and dynamic changes in ISPRS Int. J. Geo-Inf. 2023, 12 spatio-temporal distribution. On the one hand, a high correlation exists between the spatial and temporal characteristics of traffic data, which conduces that they ought to be captured simultaneously. On the other hand, the temporal distribution of the data is susceptible to all kinds of external influences. For instance, road construction, new subway stations, and sudden weather changes are among the factors that could impact statistical values, such as mean and variance in traffic data. Conventional techniques commonly employ recurrent neural networks, including long short-term memory (LSTM) [7], gated recurrent unit (GRU) [8], and their derivatives to capture temporal dependencies in traffic data. Furthermore, convolutional neural networks (CNN) [9] are utilized to model spatial correlations between regions in grid-based traffic data, whereas graph neural networks (GNN) [10] are applied to graph-based traffic data. More recently, transformer-based architectures have been introduced to facilitate long-term traffic prediction.
However, these approaches neglect the fact that the statistical information of traffic data distributions (such as mean and variance) varies over time, thus leading to unsatisfactory prediction performance and poor generalization ability. Passalis et al. [11] introduced a deep adaptive neural network that is capable of dynamically learning temporal distribution shifts, thereby allowing the model to comprehensively predict both future sequences and anticipated mean and variance. Arik et al. [12] proposed a self-adaptive forecasting model that can adaptively encode the evolving distributions. Nonetheless, they failed to tackle the issue of concurrent distribution shifts across multiple correlated time series, and the dynamic shifts in the temporal distribution of traffic data cannot yet be modeled.
To fix the previously mentioned issues, we developed a hybrid deep learning framework named Self-Adaptive Spatio-Temporal Graph Convolutional Network (SASTGCN). The framework mainly comprises three components: a self-adaptive calibrator, a spatiotemporal feature extractor, and a predictor. In the self-adaptive calibrator, we exploit a self-supervisor composed of an encoder-decoder module to derive the distribution bias of the input, and both the encoder and decoder consist of two layers of recurrent units. Taking the concatenation of the bias and the original features as input, the spatio-temporal feature extractor utilizes graph convolution structures as well as a transformer [13] layer to learn the spatio-temporal pattern. Moreover, a predictor, which consists of a fully connected layer, is applied to produce the final forecast. As far as our knowledge extends, this is the primary attempt to analyze the temporal distribution shift in spatio-temporal prediction issues using an autoencoder. The principal contribution of this work can be succinctly summarized in the following manner:

•
We construct a novel self-adaptive calibrator, which can obtain the temporal distribution of traffic data. The calibrator exploits a self-supervised encoder-decoder structure, where the encoder and decoder are both constructed using recurrent layers to better capture the temporal dynamic properties.

•
We propose a spatio-temporal feature extractor to discover both spatial and temporal dependencies simultaneously by stacking ST blocks with residual connections. The ST blocks consist of graph convolution layers for spatial correlations and a transformer layer for temporal characteristics.

•
Our model surpasses the state-of-the-art methods, as evidenced by a comprehensive range of experimental outcomes on two real-world benchmark datasets: METR-LA and PEMS-BAY.

Related Work
As a quintessential problem in spatio-temporal sequence prediction, traffic prediction plays a fundamental role in the advancement of smart cities. Consequently, it has attracted considerable attention from scholars and practitioners alike, leading to its rapid development as a research discipline. There are now two broad groups of traffic forecast methods: statistical techniques and data-driven techniques. Owing to a shortage of transportation data and processing resources, statistical methods dominated the early stages ISPRS Int. J. Geo-Inf. 2023, 12, 346 3 of 16 of traffic prediction. Representative techniques in this category include support vector regression (SVR) [14], auto-regressive integrated moving average (ARIMA) [15], logistic regression (LR) [16], localized extended Kalman filter (L-EKF) [17], and gradient boosting decision tree (GBDT) [18]. They overlook the long-term temporal relations and consider the spatio-temporal sequence individually, which is far from satisfactory.
As data acquisition and deep learning methods advance swiftly [19], data-driven approaches become mainstream. Compared to statistical models, data-driven models are better equipped to capture the highly non-linear complex features in large-scale spatiotemporal data. Several neural network (NN) approaches, including artificial NN [20] and deep belief networks (DBNs) [21], are adopted in traffic prediction during the preliminary stage. To mine the temporal correlations, researchers introduced recurrent neural networks such as LSTM into traffic prediction and showed promising performance [22]. Liu et al. introduced a prediction framework that utilizes LSTM with feature partitioning and feature selection, which successfully enhanced prediction performance by incorporating feature engineering techniques [23]. The traffic state in a region is probably impacted by its surrounding regions as well as distant regions (for instance, there exists a subway between two regions or the functionality of two regions is relevant). To fully employ the node-tonode spatial correlations, a novel model called ST-ResNet [24] was put forth by Zhang et al. in which the entire city is segmented into a regular grid map and the traffic data is projected as a series of images. By this means, a convolution neural network is exploited to obtain the spatial correlations. To further capture the temporal dependencies, Yao et al. introduced a model DMVST-Net [25] that combines CNN, graph embeddings, and LSTM to extract the spatio-temporal feature.
Nevertheless, the approaches mentioned above are only suitable for regular traffic, while, in reality, the majority of traffic data are irregular non-Euclidean data, and projecting into the grid-like sequence would be quite harmful to the prediction accuracy. The GCN is a kind of network structure that is suitable for dealing with non-Euclidean data. Yu et al. originally applied graph convolutional networks to traffic prediction areas. They proposed a model ST-GCN that captures the spatial and temporal features using gated 1D convolution and graph convolution, respectively. Owing to replacing the recurrent layers with 1D convolution, the computational cost is reduced considerably, and thus the model became faster. However, the adjacency matrix in ST-GCN is pre-defined and static, while the spatial dependency between regions changes over time. Graph WaveNet [26] was invented by Wu et al.; in it, a novel flexible dependency matrix is developed and learned through node embedding. Combining a stacked dilated 1D convolution, Graph WaveNet can handle very long sequences as well as dynamic spatial correlations. A single graph could only represent one perspective of the relationship among different regions, while the correlation between regions is manifold. Li et al. proposed a spatio-temporal fusion graph neural network (STFGNN) [27] that, by combining different spatial and temporal graphs, could efficiently discover hidden spatio-temporal dependencies. As reported in [28], Cai et al. combined GCN with a transformer to model the spatio-temporal correlations and periodicity present in traffic data. However, none of these methods have successfully addressed the issue of data distribution shifts in spatio-temporal sequences.

Traffic Prediction
Traffic prediction is a typical spatio-temporal sequence prediction task. We formulate the road network for traffic as a graph G =< V, E, A >, where V denotes a finite set of nodes (|V|= N , N represents the total count of nodes, and each node corresponds to a specific region on the road network); E contains a set of edges between the nodes described above and A denotes the spatial adjacency matrix representing the similarity of nodes. If V i , V j ∈ V and (V i , V j ) ∈ E, then A ij equals 1; otherwise, A ij takes the value of 0. A common traffic prediction problem is shown in Figure 2 and can be succinctly described as follows: where 1: represents the traffic observations from time step 1 t H − + to t (the value of d corresponds to the dimension of the observation). f denotes the model function to be learned and it estimates the most likely traffic observations

Graph Convolutional Network
The standard convolution for regular grids is not suitable to handle graph-structured data, i.e., graphs, whereas the graph convolutional network has demonstrated its superior performance in handling them. Graph convolution is a process employed to derive characteristics via aggregating information from neighborhood nodes on the graph, which is similar to what normal convolution operations do on images. Nodes in the graph constantly change their state under the influence of nearby and distant points until reaching a final equilibrium, with closer neighbors exerting stronger influence. Spectral-based and spatial-based approaches are two common categories of GCN. The former approaches apply convolutional filters in the spectral domain with graph Fourier transforms [29,30] and the latter combines the representation of the node with that of its neighbors to obtain a novel representation for the node [31,32]. Based on the theory mentioned above, given a defined graph , , G V E A =< > , a common paradigm of graph convolutional operation can be defined as follows: where ( ) is the initial input of the graph. A denotes the adjacency matrix of the graph, and f is the aggregation function. Specifically, Equation (2) can be written as follows: where " *G " is the convolution operator, Θ denotes the kernel, σ represents a non-linear transformation function, ˆ= Given the previous H traffic observations, it aims to forecast the traffic observations (speed, demand, flow, and so on) in the next L time steps, which can be defined as follows: where X t−H+1:t ∈ R H×N×d represents the traffic observations from time step t − H + 1 to t (the value of d corresponds to the dimension of the observation). f denotes the model function to be learned and it estimates the most likely traffic observations X t+1:t+L ∈ R L×N×d in the next L time steps.

Graph Convolutional Network
The standard convolution for regular grids is not suitable to handle graph-structured data, i.e., graphs, whereas the graph convolutional network has demonstrated its superior performance in handling them. Graph convolution is a process employed to derive characteristics via aggregating information from neighborhood nodes on the graph, which is similar to what normal convolution operations do on images. Nodes in the graph constantly change their state under the influence of nearby and distant points until reaching a final equilibrium, with closer neighbors exerting stronger influence. Spectral-based and spatial-based approaches are two common categories of GCN. The former approaches apply convolutional filters in the spectral domain with graph Fourier transforms [29,30] and the latter combines the representation of the node with that of its neighbors to obtain a novel representation for the node [31,32]. Based on the theory mentioned above, given a defined graph G =< V, E, A >, a common paradigm of graph convolutional operation can be defined as follows: where S (i) and S (i+1) is the graph signal (feature) in the i and i + 1 layer, respectively, and S (0) is the initial input of the graph. A denotes the adjacency matrix of the graph, and f is the aggregation function. Specifically, Equation (2) can be written as follows: where " * G " is the convolution operator, Θ denotes the kernel, σ represents a non-linear transformation function,Â = A + I N denotes the self-looping adjacency matrix, andD indicates the diagonal degree matrix ofÂ.

Methodology
This section commences with an overview of the comprehensive architecture of our proposed model, SASTGCN. Subsequently, each component of SASTGCN is formally described. Finally, a detailed explanation is provided on how to optimize the algorithm.

Overall Architecture
Previous graph-based approaches have primarily concentrated on exploring the dynamic changes in spatial correlation by designing various adjacency matrices. Nevertheless, these approaches have overlooked the fact that the temporal distribution of data also changes over time. For instance, the construction of new traffic infrastructure, unexpected weather changes, or new public policies implemented by the government can exert an influence on the mobility patterns of individuals, thus instigating modifications in the temporal distribution of traffic data. Therefore, it is crucial to devise a novel approach that takes into account spatial as well as temporal variations in the data to acquire a more holistic comprehension of the underlying patterns and trends. Figure 3 demonstrates the architecture of SASTGCN, which is constituted of three parts: (1) the self-adaptive calibrator; (2) the spatio-temporal feature extractor; and (3) the predictor. The self-adaptive calibrator constitutes a self-supervised encoder-decoder framework, with each of its encoder and decoder modules composed of two recurrent layers, which are adept at capturing temporal relations. The spatio-temporal feature extractor takes as input the merged bias and original features, and it utilizes graph convolution structures as well as a transformer layer to learn the spatio-temporal pattern. To resize the feature and produce the final output, a predictor consisting of a fully connected layer is additionally employed. The detailed implementation of each component is expounded in the ensuing sub-sections.

Methodology
This section commences with an overview of the comprehensive architecture of our proposed model, SASTGCN. Subsequently, each component of SASTGCN is formally described. Finally, a detailed explanation is provided on how to optimize the algorithm.

Overall Architecture
Previous graph-based approaches have primarily concentrated on exploring the dynamic changes in spatial correlation by designing various adjacency matrices. Nevertheless, these approaches have overlooked the fact that the temporal distribution of data also changes over time. For instance, the construction of new traffic infrastructure, unexpected weather changes, or new public policies implemented by the government can exert an influence on the mobility patterns of individuals, thus instigating modifications in the temporal distribution of traffic data. Therefore, it is crucial to devise a novel approach that takes into account spatial as well as temporal variations in the data to acquire a more holistic comprehension of the underlying patterns and trends. Figure 3 demonstrates the architecture of SASTGCN, which is constituted of three parts: (1) the self-adaptive calibrator; (2) the spatio-temporal feature extractor; and (3) the predictor. The self-adaptive calibrator constitutes a self-supervised encoder-decoder framework, with each of its encoder and decoder modules composed of two recurrent layers, which are adept at capturing temporal relations. The spatio-temporal feature extractor takes as input the merged bias and original features, and it utilizes graph convolution structures as well as a transformer layer to learn the spatio-temporal pattern. To resize the feature and produce the final output, a predictor consisting of a fully connected layer is additionally employed. The detailed implementation of each component is expounded in the ensuing sub-sections.

Self-Adaptive Calibrator
In the self-adaptive calibrator part, we employ a self-supervised encoder-decoder structure to capture the temporal distribution of data. The encoder and decoder are each comprised of two recurrent layers. Given the input traffic observations , we first utilize an encoder to model it and obtain the hidden feature hidden X , which could be represented as follows:

Self-Adaptive Calibrator
In the self-adaptive calibrator part, we employ a self-supervised encoder-decoder structure to capture the temporal distribution of data. The encoder and decoder are each comprised of two recurrent layers. Given the input traffic observations X in ∈ R H×N×d , we first utilize an encoder to model it and obtain the hidden feature X hidden , which could be represented as follows: where LSTM(·) denotes a long short-term memory layer as depicted in Appendix A, and the second LSTM layer ingests the hidden states of the first LSTM layer as input.
The decoder utilizes the same network structure as the encoder to convert the hidden representation X hidden into the encoder-decoder output X out1 , which can be formulated as below: Then the bias of the output and original input is calculated as X bias = X in − X out1 , which denotes the dynamic changes in the spatio-temporal pattern. Finally, we generate the new feature X new by concatenating the model input X in and the bias X bias , and it will be sent to spatio-temporal feature extraction for further processing. The equation is shown below: where ⊕ denotes the concatenation operation in dimension and X new ∈ R 2H×N×d . By combining the raw data X in with the biases resulting from distribution shifts X bias , X new is capable of effectively representing the dynamic changes in traffic data distribution. We have utilized a joint training framework, which integrates the autoencoder loss and prediction loss described in Section 4.5, to concurrently optimize both components. This approach combines the reconstruction ability of the autoencoder with the prediction capability of the model. As a result, this joint training framework leads to enhanced performance.

Spatio-Temporal Feature Extractor
Taking the preprocessed feature X new as input, the spatio-temporal feature extractor is utilized to dig out the sophisticated spatio-temporal correlations hidden behind the data. The basic component of this module is a block, which is composed of two graph convolutional layers used to explore spatial characteristics, and it is followed by a transformer structure designed to extract temporal dependencies. To enhance the information-capturing ability and global view of the spatio-temporal feature extractor, we sequentially arrange a series of ST block and employ residual connections to mitigate gradient vanishing while simultaneously accelerating the training procedure of the model (we set the number of ST block p = 3). The whole process of this module is shown in Algorithm 1, and the implementation details are described below.  Figure 3c shows the inner structure of one single spatio-temporal block ST block. It takes the calibrated features X ∈ R 2H×N×d as the input. Firstly, the feature is sent to a two-layer GCN structure, and the adjacent matrix is calculated via SVD decomposing. We reshape the input 3D tensor X ∈ R 2H×N×d as a 2D tensor X a shaped (2H · d × N). To obtain the inner correlations between distinct regions, we apply the single value decomposition algorithm (SVD) to perform matrix factorization on X a , resulting in the transformation of X a into two new matrices: where X r and X t respectively symbolize the region-wise and the time-wise matrices. The matrix X r ∈ R N×λ comprehensively contains spatial correlations amongst all regions, where λ indicates the region dimension. To compute the resemblance between the i-th and j-th region, we utilize a Gaussian kernel-based method to estimate their similarity. The similarity between stations can be reckoned as follows: where ε denotes the standard deviation and A ij is used as the adjacency matrix. On top of this, we construct two graph convolutional layers GCN 1 and GCN 2 , which share the same structure except for the parameter. After obtaining spatial features X sp = GCN 2 (GCN 1 (X)), we employ a transformer structure to extract the temporal pattern from X sp and eventually get the spatio-temporal feature X st as shown below: where Transformer(·) denotes a transformer structure, and its implemented details are shown in Appendix B due to space constraints. Figure 3b depicts the residual connection of ST blocks; for the l-th layer, the output of the layer X st (l + 1) is formulated as follows: where STblock(·) is the single ST block aforementioned and X st (0) is equal to X st .

Predictor
The predictor is comprised of a linear layer and a reshape operation. This part takes the concatenation of spatio-temporal feature X st from the STFE module in Section 4.3 and the hidden feature X hidden of the calibrator in Section 4.2 as the input X f inal ∈ R 2H×N×d : which is then delivered to a fully connected linear layer and transformed accordingly to conform to the prescribed prediction shape. The process could be represented as follows: where δ(·) is an activation function and W and b are parameters to be trained (X prediction ∈ R L×N×d , here L denotes the prediction time steps).

Training
The whole procedure of our proposed model SASTGCN is depicted in Algorithm 2. It should be noted that the self-adaptive calibrator is trained synchronously, that is to say, the entire model constitutes a seamless end-to-end pipeline. The training loss L can be expressed in the following manner: where L 1 denotes the loss of model input and prediction, L 2 indicates the calibrator loss (encoder-decoder loss), L 1 and L 2 are MSE (mean squared error) losses, and µ is a hyperparameter that balances the weight of these two losses, which is set to 0.5 in practice.

Algorithm 2: Self-Adaptive Spatio-Temporal Graph Convolutional Network (SASTGCN)
Input: Initial input X in ∈ R H×N×d .//sample chosen from preprocessed dataset. Output: model prediction X out ∈ R L×N×d .//corresponding predictions for the input.
1. X hidden = Encoder(X in ) //use a two-layer LSTM structure to encode the input 2. X out1 = Decoder(X hidden ) //decode the hidden representation via a decoder 3. X new = X in ⊕ (X in − X out1 )//join the initial input X in with its deviation from X out1 4. X st = STFE(X new ) //generate the spatio-temporal feature 5. X f inal = X st ⊕ X hidden //concatenate the feature with the hidden representation 6. X out = Linear(X f inal ) //generate the prediction X out and reshape it 7. return X out

Experiments
We execute comprehensive experiments on two real-world traffic datasets, METR-LA and PEMS-BAY, to substantiate the efficacy of the proposed approach with empirical evidence. In addition to evaluating its performance against other baselines, ablation research has been conducted to confirm the functionality of several modules in SASTGCN. Furthermore, we conduct a rigorous examination of the impact of the hyperparameter on the efficacy of the model.

Datasets
We verify our model on two publicly available spatio-temporal traffic datasets, METR-LA and PEMS-BAY, released by Li et al. (DCRNN) [2]. METR-LA compiles data collected by 207 sensors over four months to provide statistics on traffic speed along the highways of Los Angeles County, encompassing the period between 1 March 2012, and 30 June 2012. The detailed information of METR-LA dataset is illustrated in Appendix D. The PEMS-BAY has documented an extensive six-month period of traffic speed information, commencing on 1 January in the year of our Lord two thousand and seventeen and culminating on 31 May in the same year, sampled from 325 sensors in the Bay Area. As for the spatial adjacency network, we creatively employ the SVD methods to obtain a dynamic adjacency matrix to better cope with dynamic spatial correlation changes. We aggregate these two datasets into 5 min windows. The dataset statistics details are shown in Table 1.

Baselines
We use the following baselines as a comparison. For the sake of fairness, we tune the key hyperparameters to ensure that they have the best performance.
HA: The historic average method is a forecasting technique that assumes that traffic flow follows a seasonal pattern and predicts future traffic by taking the average of past observations. An example of implementing this approach would involve utilizing all recorded data from 5:00 p.m. to 6:00 p.m. on Mondays throughout history as a reference point to forecast traffic speed for the same time frame on the upcoming Monday.
SVR [14]: Support vector regression employs a linear support vector machine to perform regression tasks, which allows for a certain degree of deviation between the actual and predicted values.
FC-LSTM [33]: The FC-LSTM model represents an encoder-decoder architecture that leverages the long-short term memory (LSTM) neural network with a peephole mechanism. Notably, both the encoder and decoder components are composed of two distinct recurrent layers.
DCRNN [2]: Similar to FC-LSTM, it is a diffusion convolutional recurrent neural network that represents a sophisticated encoder-decoder architecture, comprising a recurrent layer that enables the network to effectively process sequential data. Nevertheless, in recurrent layers, the matrix multiplications are replaced with diffusion convolution.
GraphWaveNet [26]: Graph WaveNet introduces a pioneering adaptive adjacency matrix concept, which is incorporated into the graph convolution technique employing 1-D dilated convolutions to learn the dynamic and long-term spatio-temporal correlations.
SLCNN [35]: SLCNN combines structure learning convolution blocks with a pseudothree-dimensional convolution module to model the spatio-temporal correlations in traffic speed data.
AutoCTS [36]: AutoCTS employs a combination of micro and macro search spaces to represent potential architectures of ST blocks and connections between them to obtain the optimal forecasting models.
STFGNN [27]: STFGNN fuses several spatial and temporal graphs and applies a gated convolution layer to handle the long-term sequence prediction problem.
MD-GCN [37]: MD-GCN employs a dual graph convolution network operating across multiple temporal scales, which is comprised of a gated temporal convolution and a dual graph convolution module.

Experimental Setup
We partition the METR-LA and PEMS-BAY datasets into discrete training, validation, and testing sets apportioned at a ratio of 7:1:2. The basic time interval is set to 5 min, and we leverage the observed traffic values spanning throughout 12 time intervals to forecast the ensuing values in the next 3, 6, and 12 time intervals, respectively. We set the aforementioned parameters to align with the ones documented in the literature [1] to establish a fair comparison setting. To assess the effectiveness of the approaches, we employ a triumvirate of commonly accepted metrics, namely the mean absolute error (MAE), the rooted mean squared error (RMSE), and the mean absolute percentage error (MAPE). All three criteria are specified in Appendix C for simulation, and the smaller the numerical value, the better the model performs. All experiments were executed on a 64-bit Ubuntu Server equipped with a 2.40 GHz GPU and a plenty ensemble of 8 NVIDIA Titan GPUs, and the codes were implemented by the PyTorch (https://pytorch.org/, accessed on 8 May 2023) framework. In the realm of neural network-based methodologies, the optimal hyperparameters were selected through a rigorous grid search process that was predicated upon the performance evaluation of the validation set. We optimized the whole model with the Adam optimizer, whose learning rate was set to 0.001. The training epoch was 200 and an early-stopping mechanism was utilized with 10 patient epochs.

Main Results
Tables 2 and 3 present the primary outcomes of SASTGCN on two real-world datasets. The superior results from the experiment are emphasized in bold. Moreover, ten-fold cross-validation was conducted to calculate the average error for each value, which is then denoted by the ± symbol to represent the error range. Upon observation, it is evident that our proposed method, SASTGCN, outperforms all others in each of the three evaluation metrics, thereby proving its superiority and applicability. It is noticeable that traditional methods (HA, SVR) are far less effective than deep learning methods. The poor effect of FC-LSTM is likely because it completely ignores the spatial correlations among regions. Additionally, short-term forecasting outperforms long-term forecasting in terms of accuracy and precision. This phenomenon can be attributed to the accumulation of prediction errors in the early stages of long-term forecasting, which subsequently impact the accuracy of later stages. In conclusion, by incorporating a calibrator mechanism and transformer architecture to extract spatio-temporal correlations among regions, our proposed SASTGCN model outperforms all comparative methods in the majority of instances.

Effect of Each Component
We have undertaken a meticulous ablation study on the PEMS-BAY dataset to ascertain the efficacy of the pivotal components that significantly enhance the overall performance of our model. For the sake of convenience, we name SASTGCN without the different components below: • w/o HR: SASTGCN without the hidden representation added in the predictor. We utilize the output of STFE to a linear layer to generate the prediction straightly. • w/o Calibrator: SASTGCN without the whole calibrator part. We sent the initial input to the STFE module to derive the ultimate prediction. • w/o Transformer: SASTGCN without the transformer part in each ST block. We eliminate the temporal feature module in the STFE module. • w/o RC: SASTGCN with the residual connection in the STFE module. We stack three ST blocks together in a basic configuration.
We conducted an experiment on the PEMS-BAY dataset, with a forecasting horizon of six timesteps (30 min). The results are shown in Table 4. It is noteworthy that SASTGCN outperformed the variants that lacked a hidden representation in the predictor, which indicates that the hidden representation can preserve information and assist in the prediction. Additionally, SASTGCN performed better than the variants without a calibrator and a transformer, demonstrating the effectiveness of the calibrator and the indispensability of the temporal module. Furthermore, the residual connection was also found to be beneficial. In summary, each of the designed sub-modules mentioned above had a positive impact on improving overall performance.

Parameter Sensitivity Study
The hyperparameter input horizon has a crucial impact on the model performance, which is essential in SASTGCN. The impacts of different horizon lengths on the model prediction outcomes are shown for the datasets METR-LA in Figure 4. The values of the input horizon length H are 3, 6, 12, 24, 48, and 72; the forecast horizon is set to 6. The RMSE error lowers at first before rising. We can see that the input horizon length of 12 yields the greatest results in the SASTGCN model. The RMSE decreases as length increases because the model has more historical data at its disposal. Nevertheless, performance continues to decline as H increases above 12 points. One possible explanation is that as the sequence length increases, the model becomes much more complicated and the prospect of overfitting presents a considerable concern as it bears the potential to adversely affect its overall performance.

Conclusions
In this paper, we proposed a novel Self-Adaptive Spatio-Temporal Graph Convolutional Network (SASTGCN) for traffic prediction. SASTGCN can effectively capture complex and dynamic spatio-temporal relationships and model the temporal distribution shift by combining a self-adaptive calibrator with graph convolution. To capture the temporal drifts in distribution, we employ a calibrator consisting of an encoder-decoder framework made up of multiple LSTM layers. Furthermore, we stack a series of spatio-temporal modules via residual connections to extract spatio-temporal features from the data. Each spatio-temporal module comprises GCN layers and a transformer encoder, where GCN is utilized to extract spatial correlations among regions, and the transformer is employed to capture temporal characteristics. The SVD method was employed in constructing the adjacency matrix for GCN. Extensive experimental results on two real-world datasets have demonstrated the effectiveness of our proposed model.
For future work, we aim to improve the performance of our model by incorporating external features, such as weather, holidays, and events. It is worth noting that SASTGCN is not only limited to traffic prediction but can also serve as a general framework for spatio-temporal sequence prediction in various domains. As a result, we intend to adapt the proposed SASTGCN to other prediction scenarios, including weather, energy, agricultural yield, and social media prediction, among others.

Conclusions
In this paper, we proposed a novel Self-Adaptive Spatio-Temporal Graph Convolutional Network (SASTGCN) for traffic prediction. SASTGCN can effectively capture complex and dynamic spatio-temporal relationships and model the temporal distribution shift by combining a self-adaptive calibrator with graph convolution. To capture the temporal drifts in distribution, we employ a calibrator consisting of an encoder-decoder framework made up of multiple LSTM layers. Furthermore, we stack a series of spatiotemporal modules via residual connections to extract spatio-temporal features from the data. Each spatio-temporal module comprises GCN layers and a transformer encoder, where GCN is utilized to extract spatial correlations among regions, and the transformer is employed to capture temporal characteristics. The SVD method was employed in constructing the adjacency matrix for GCN. Extensive experimental results on two real-world datasets have demonstrated the effectiveness of our proposed model.
For future work, we aim to improve the performance of our model by incorporating external features, such as weather, holidays, and events. It is worth noting that SASTGCN is not only limited to traffic prediction but can also serve as a general framework for spatiotemporal sequence prediction in various domains. As a result, we intend to adapt the proposed SASTGCN to other prediction scenarios, including weather, energy, agricultural yield, and social media prediction, among others.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The principles and definitions of Long Short-Term Memory in Section 4.2 are elaborated here in detail.
As a typical representative of recurrent neural networks (RNN), long short-term memory (LSTM) addresses the issue of vanishing gradients present in conventional RNNs. It has been extensively employed in the realm of natural language processing, speech recognition, and other sequence modeling tasks.
The equation of LSTM entails the incorporation of three sophisticated gates and a memory cell to meticulously regulate the information flow throughout the neural network. The input gate, forget gate, and output gate serve as the guardians of information flow, while the memory cell dutifully preserves long-term knowledge. The function of the input gate is to selectively designate information from the present input that merits retention within the memory cell. The forget gate discerns the relevance and necessity of former remembrances within the memory cell to be relinquished. Finally, the output gate is responsible for determining which specific information from the memory cell ought to be transmitted to the output. The equation of LSTM can be expressed formally as described below: where i t , f t , and o t respectively assume the role of the input gate, forget gate, and output gate at time step t. C t denotes the cell state and h t is the hidden state at time step t. x t represents the input at time step t. W and b are corresponding trainable weight and bias matrices. σ is the sigmoid function, and · denotes element-wise multiplication.

Appendix B
This appendix describes the transformer formulation in Section 4.3. The transformer, an eloquently conceived neural network paradigm introduced by Google in 2017, remains a stalwart model employed for intricate natural language processing (NLP) endeavors. Compared to traditional convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the transformer has better performance and faster training speed when processing long sequence data. The multi-head self-attention mechanism serves as the fundamental core of the transformer's architecture, which can establish long-range dependencies between different positions, thereby better capturing the information in the sequence. The transformer also includes two modules, the encoder and decoder, which effectively map the input sequence onto the hidden space and subsequently decode the resulting representation into the output sequence. The detailed equation of the transformer is as follows: 1. Self-Attention Mechanism In the transformer, the self-attention mechanism is used to calculate the correlation between each word and other words to establish context relationships. Specifically, the representation vector of each word can simultaneously serve as a query vector, key vector, and value vector. The equation is as stated below: where Q, K, and V are query vector, key vector, and value vector, respectively. d k is the dimension of K. Through meticulous computation of the similarity between the inquiry vector and the key vector, the weight vector is eventually derived. Subsequently, the weight vector is subjected to multiplication with the value vector and finally aggregated to yield the ultimate output vector.

Multi-Head Self-Attention Mechanism
To effectively manage interdependencies among diverse positions, the transformer model has ingeniously incorporated a sophisticated multi-head self-attention mechanism. Specifically, distinct linear transformations are applied to the input vectors, thereby yielding a multitude of query vectors, key vectors, and value vectors. Then, self-attention calculation is performed on each attention head separately, and the ultimate output vector materializes through the concatenation of the output vectors from each respective head.
Assuming there are h attention heads, each with a dimension of d k , the equation for the multi-head self-attention mechanism is depicted below: where W Q i , W K i , and W V i are the linear transformation matrices corresponding to the i-th attention head, and W o is the linear transformation matrix for the concatenated output vector. The hyperparameters of attention heads h and vector dimension d k are subject to adjustment in accordance with the demands of the given task.

Encoder and Decoder
As for the encoder, given an input series X = (x 1 , . . . , x n ), the output of the encoder Z = (z 1 , . . . , z n ) can be calculated via the following equations: where LayerNorm(·) is a function employed to normalize the input vector and FeedForward(·) is a feed-forward network layer that extracts features. As for the decoder, considering the input Y = (y 1 , . . . , y m ) and the results produced by the encoder, the decoder output O = (O 1 , . . . , O m ) can be determined as follows: where the first two equations are the same as those of the encoder, while the third equation represents the encoder-decoder attention layer, which aligns the output vectors of the encoder and the decoder.

Appendix C
The definition of MAE, RMSE, and MAPE Formulation in Section 4.3 are presented in this section.
Mean absolute error (MAE), rooted mean squared error (RMSE), and mean absolute percentage error (MAPE) are three measures that quantify the variance between the anticipated and observed outcomes. MAE is calculated as the mean of the absolute differences between the predicted and actual values, which is a reliable measure of accuracy when the dataset has outliers or extreme values. The smaller the MAE, the better the accuracy of the model. RMSE is a frequently employed metric for quantifying the disparity between projected and factual data points. It is a reliable measure of accuracy when the data set does not have outliers. RMSE is always larger than MAE, and the smaller values of RMSE denote the higher accuracy of the model. MAPE is calculated as the average of the absolute differences between the prediction and ground truth divided by the latter. MAPE is a useful measure when comparing the accuracy of different models or forecasting methods. The smaller the MAPE, the better the performance of the model. However, MAPE can be misleading when the real values are close to zero or when there are extreme values in the dataset. Consequently, we add a positive value approaching 0 to the denominator of the equation to prevent it from being zero. Their equations are as follows: where k represents the number of samples, y i is the label of sample i,ŷ i represents the prediction of sample i, and ε is a very small positive value added to ensure that the denominator is not 0.

Appendix C
The definition of MAE, RMSE, and MAPE Formulation in Section 4.3 are presented in this section.
Mean absolute error (MAE), rooted mean squared error (RMSE), and mean absolute percentage error (MAPE) are three measures that quantify the variance between the anticipated and observed outcomes. MAE is calculated as the mean of the absolute differences between the predicted and actual values, which is a reliable measure of accuracy when the dataset has outliers or extreme values. The smaller the MAE, the better the accuracy of the model. RMSE is a frequently employed metric for quantifying the disparity between projected and factual data points. It is a reliable measure of accuracy when the data set does not have outliers. RMSE is always larger than MAE, and the smaller values of RMSE denote the higher accuracy of the model. MAPE is calculated as the average of the absolute differences between the prediction and ground truth divided by the latter. MAPE is a useful measure when comparing the accuracy of different models or forecasting methods. The smaller the MAPE, the better the performance of the model. However, MAPE can be misleading when the real values are close to zero or when there are extreme values in the dataset. Consequently, we add a positive value approaching 0 to the denominator of the equation to prevent it from being zero. Their equations are as follows:  Figure A1 as follows: Figure A1. The distribution of the selected sensors in the METR-LA dataset. Figure A1. The distribution of the selected sensors in the METR-LA dataset.