SGGformer: Shifted Graph Convolutional Graph-Transformer for Traffic Prediction

Accurate traffic prediction is significant in intelligent cities’ safe and stable development. However, due to the complex spatiotemporal correlation of traffic flow data, establishing an accurate traffic prediction model is still challenging. Aiming to meet the challenge, this paper proposes SGGformer, an advanced traffic grade prediction model which combines a shifted window operation, a multi-channel graph convolution network, and a graph Transformer network. Firstly, the shifted window operation is used for coarsening the time series data, thus, the computational complexity can be reduced. Then, a multi-channel graph convolutional network is adopted to capture and aggregate the spatial correlations of the roads in multiple dimensions. Finally, the improved graph Transformer based on the advanced Transformer model is proposed to extract the long-term temporal correlation of traffic data effectively. The prediction performance is evaluated by using actual traffic datasets, and the test results show that the SGGformer proposed exceeds the state-of-the-art baseline.


Introduction
Intelligent transportation system (ITS) shows great potential in improving the efficiency [1], stability, and reliability of transportation systems, which makes it a research hotspot [2]. The development of ITS is mainly reflected in vehicle-environment coordination control [3], passenger demand forecasting [4], travel time estimation [5], and order scheduling [6]. The realization of these functions depends on accurate and reliable traffic forecasting. Furthermore, the improvement of computing power and data acquisition technology provides hardware and data support for the traffic prediction task, making the development of accurate traffic prediction methods a hot research topic.
The research history of scholars on traffic prediction methods mainly includes statistical, shallow machine learning, and deep learning methods. The earliest statistical methods used to predict traffic by modeling the nonlinear characteristics of the time series of traffic data include ARIMA, Gaussian regression, and Bayesian network [7][8][9]. These methods rely heavily on the periodicity of a single series and lack the ability to model the spatial correlation characteristics of traffic data, so their potential for traffic forecasting needs to be improved. Basic machine learning, such as SVM and ANN, can capture traffic data's spatial and nonlinear temporal correlation [10,11], and its prediction effect depends heavily on professional feature engineering methods. Because there is no universally recognized and practical feature selection guide for different problems, basic machine learning-based methods may not maximize actual performance in the face of complex prediction tasks. Therefore, to capture more abundant and deeper spatiotemporal correlation information to bring more performance improvement to traffic prediction, the deep learning methods provide a solution for this problem.
Deep learning provides different solutions for feature extraction of spatial and temporal correlations. In the task of spatial correlation acquisition, the early applied methods focus more on convolutional neural networks (CNNs). Thanks to local connectivity and weight-sharing characteristics of CNNs, it has a compelling feature extraction performance for traffic data in the European space [12][13][14]. However, with the in-depth study of traffic data and the development of graph neural networks (GNNs), the definition of graph networks is more consistent with the non-European characteristics of the spatial distribution of traffic flow, which cause CNNs to become difficult to adapt to the definition form of these data. Graph Convolutional Networks (GCNs) have become more mainstream spatial correlation feature extraction methods because they can capture the non-Euclidean correlation between graph network nodes [15][16][17][18]. The way to extract spatial correlation by GCNs has become the focus of research. Among them, Kipf et al. proposed the Chebyshev first-order approximate Graph Convolutional Network (GCN) to balance the performance of feature extraction and computational complexity and it became the mainstream GCN method [19], which was used in traffic prediction by a large number of movements. For example, STGCN, DCRNN, STSGCN, and other excellent models have been proposed [20][21][22][23][24], and GCN is adopted to complete the correlation feature extraction of the complex spatial domain.
When it comes to the capture task of nonlinear time correlation in traffic data, the early commonly used methods are recurrent neural networks (RNN) and their variants, Long Short-Term Memory neural network (LSTM), and Gated Recurrent Unit (GRU). The advantage of these methods is that they can retain sufficient historical information and discard useless information through recursive neurons, which leads to their application by many people to capture traffic temporal characteristics [25][26][27][28]. However, RNN-like methods tend to forget adequate long-term information for long-term sequences, leading to performance degradation [29]. Then, with the breakthrough progress in natural language processing (NLP), the model based on the Attention mechanism has become a research hotspot in capturing temporal correlation, which shows better performance in capturing long-term correlation. The attention mechanism can reduce the maximum length of the propagation path of the network signal to the theoretical shortest O (1) while avoiding the recursive structure, increasing the calculation efficiency [30]. Bai et al. introduced the attention mechanism to adjust the importance of different time points and integrated the global temporal information to improve prediction accuracy [31]. Guo et al. proposed the AST-GCN based on the attention mechanism to effectively capture the dynamic spatiotemporal correlation in traffic data by establishing the spatiotemporal attention mechanism [32]. Xu et al. dynamically modeled directional spatial dependencies by using a self-Attention mechanism to capture real-time traffic flow. Then, the temporal attention mechanism is adopted to model long-term time dependency across multiple time steps [29]. These advanced methods use the Attention mechanism in the Transformer to extract the correlation features of time series or spatial series and achieve good results. However, they do not make good use of the Seq2Seq architecture in the Transformer model. The advantage of Transformer architecture is that it separates the learning of sequence information from the prediction of future information through the encoder-decoder structure rather than doing it simultaneously as a simple RNN-type network, which increases the network's learning ability and supports the prediction of any length at the same time.
Transformer, as a revolutionary innovation based on the Attention mechanism [33], integrates the self-attention mechanism into the Seq2Seq architecture to further increase the time-related capture capability and becomes one of the future research directions. TSTNet combines GAT with a Transformer based on Seq2Seq architecture to extract the spatiotemporal correlation characteristics [34]. The transformer uses graph context embedding to encode the input temporal and spatial information. It then embeds the target node, neighbor node, and discrete time through the full connection layer into the input Transformer to find the correlation between different elements. This kind of exploration of the Transformer method is significant. Although it uses the Graph Attention Network (GAT) to learn some graph space information, it uses the embedding and concatenation of spatial features and temporal features as the input in the Transformer. Specifically, the spatial and temporal information are regarded as elements in the sequence, which is different from simply using a Transformer as a time modeling device, which may result in some loss of spatial topology information.
To fully extract the spatiotemporal correlation characteristics of traffic data, this paper inspired by advanced work of T-MGCN and Informer [30,35] proposes a model combining a shifted window operation, GCN, and Graph Transformer. This model uses a shifted window operation to coarsen time segments to reduce fluctuation noise and computational complexity; Multi-channel GCN is used to fuse spatial correlation based on different features; the improved Graph Transformer can directly extract the nonlinear time correlation characteristics of high-dimensional graph network data. Finally, the progressiveness of this method is verified by comparison with the classic baseline. The main contributions of this paper are as follows: 1.
This paper constructs a prediction network that integrates a shifted window operation, GCN, and Transformer models, namely Shifted Graph Convolutional Graph-Transformer (SGGformer). GCN realizes complex spatial correlation characteristics extraction, and Graph Transformer extracts the nonlinear temporal correlation characteristics; 2.
The shifted window operation is developed to divide time segments, reduce computational complexity, and enhance the ability to capture features of different periods; 3.
The regional relationship with the graph network is defined, and the aggregation and mapping of regional nodes under the definition of a topology network is completed based on a multi-channel GCN; 4.
The improved Graph Transformer is used to process high-dimensional graph data. The generative decoder outputs long sequences in a single step to avoid cumulative errors and significantly reduce reasoning speed.
The rest of the paper is organized as follows: In Section 2, the deep learning method for extracting spatial and temporal correlation features of traffic is introduced. In Section 3, the definitions related to traffic prediction and the structure of the SGGformer model proposed in this paper are introduced, respectively. In Section 4, data preprocessing, evaluation index establishment, comparison test with baseline, and ablation analysis are described. Then, a discussion of the results of the experiments is presented in Section 5. Finally, conclusions are drawn with future study directions in Section 6.

Related Works
This section outlines the existing methods for modeling spatial and temporal dependencies in traffic flow prediction. Because statistical and traditional machine learning methods cannot effectively model spatial dependencies, these methods are mainly based on deep learning models.

Spatial Correlation Extraction
CNNs are the first depth learning method used in traffic prediction. Because of its local perception and weight-sharing characteristics, it is widely used in traffic prediction. Toncharoen et al. used a convolutional neural network (CNN) to extract the spatial characteristics of node data along the highway [12]. Yao et al. used a CNN model to capture the spatial characteristics of traffic data distributed in regional form [13]. Wang et al. used the 3D-CNN and sparse UNet method to model the spatial correlation of traffic data [14]. Cao et al. extracted the characteristics of the target road and the surrounding roads with strong correlation through CNN [36]. However, this kind of network is more suitable for the task of a regular network, which will cause the loss of topological information in the face of an irregular traffic network. Thanks to the robust feature extraction ability of graph neural networks (GNNs) for graph information, traffic prediction is extended to the non-European domain. As a variant of GNNs, the GCN extends classical CNN to the graph domain. Recently, GCN has been widely used to model the non-European spatial correlation of traffic data, thus serving the traffic prediction task. Li et al. proposed the DCRNN which uses diffusion convolution on the directed graph to capture the diffusion information of traffic flow [21]. Yu et al. proposed the STGCN which uses spectral graph convolution on an undirected graph to model spatial correlation [20]. Guo et al. combined a spectral graph convolutional network and a temporal CNN to complete the extraction of temporal and spatial correlation [32]. Song et al. proposed the spatiotemporal synchronous graph convolution network STSGCN, and built a spatiotemporal synchronous modeling mechanism to effectively capture the complex local spatiotemporal correlation, in which the local spatial correlation extraction is completed through the superposition of multiple graph convolution operations [22]. Therefore, it can be found that the current development direction for spatial correlation extraction is based on a graph convolution network, and the difference lies in the construction method of the adjacency matrix. A reasonable construction can effectively balance the computational efficiency with the performance of spatial feature capture.

Temporal Correlation Extraction
RNN is the earliest method used to extract the time correlation of traffic data. Because gradient explosion, gradient disappearance, and other problems often occur, RNN has limitations in modeling time correlation. In order to overcome these problems, LSTM and GRU are used to build the long-term dependency relationship in the traffic sequence. Xiao et al. used a two-way LSTM (Bi-LSTM) model to extract the periodic characteristics in the daily and weekly traffic data and used the two-way characteristics of LSTM to capture the forward and backward traffic flow change trends [25]. Tian et al. used the LSTM model to effectively capture the complex temporal-related characteristics of traffic flow in the short-term traffic flow prediction task [26]. However, these models are all based on recursive processes, and there are always problems such as training, time-consumption, and long sequence information forgetting. Some scholars tried to model the temporal correlation of sequences through CNN [20,32,37]. However, its receptive field range could not meet the requirements of long-term sequence input, and its scalability was constrained as the number of hidden layers increased with the increase of sequence length. With the breakthrough and development of the NLP field, research based on the Transformer model has become a hot spot. The Attention mechanism captures the temporal correlation in the transportation field due to its strong ability to extract the correlation of sequence elements. Yao et al. proposed a new spatiotemporal dynamic network, introduced CNN to learn the dynamic similarity between regions, and designed a periodic attention-shifting mechanism to capture long-term periodicity [13]. Bai et al. used GRU to capture short-term trends and introduced attention mechanisms to adjust the importance of different time points which integrated global time information to improve prediction accuracy [31]. Guo et al. proposed ASTGCN based on the attention mechanism. Based on three kinds of periodic data, a spatiotemporal attention mechanism was established to effectively capture the dynamic spatiotemporal correlation in traffic data [32]. However, these methods only use the Attention mechanism for simple temporal correlation modeling, which cannot fully capture the deep information hidden in the time series. The application of the Transformer needs to be studied.
The transformer is a model based on the Seq2Seq architecture that integrates attention mechanism, location coding, residual connection, and layer standardization [33]. It can effectively extract the long-term temporal correlation hidden in the input sequence by directly calculating the attention weight of the sequence data at each position. It is suitable for sequence input of different lengths and can capture the deep temporal correlation existing in the long-term range. There is still much space for research based on Transformer. Song et al. proposed a model combining GAT and Transformer-TSTNet [34], which uses random walking to map the characteristics of GAT learning to spatial embedding and combines time embedding as a spatiotemporal sequence. Finally, it uses Transformer to extract the spatiotemporal correlation in the sequence.

Methodology
In this section, we define the research problem in the first part. Specifically, for the topological map network division of regional data, we introduce the time-space sequence and then describe the process of traffic classification and traffic prediction tasks. In the second part, we introduce the main structure of SGGformer, which is divided into three parts: time segment division based on a shifted window operation, spatial correlation modeling, and time correlation modeling.

Definition 1: Regionally Topological Graph Network
This paper describes the traffic data divided by regions in the form of a graph network. Specifically, the city's regional data are represented as an undirected graph G = (V, E, W), where V = {v 1 , v 2 , · · · , v N } indicates the collection of regional nodes, N is the total number of regions. Moreover, an edge e ij ∈ E denotes the correlation between region v i and v j . In addition, the weight w ij ∈ W of the edge e ij represents the degree of correlation between region v i and v j , which is measured by the topological correlation between regions, and the value is equal to the reciprocal of the number of edges passed by the shortest adjacent path between regions. Specifically, there is a strong topological correlation between two regions that are separated by sell regions in the regional network, which shows a large weight in the adjacency matrix. The adjacency matrix W of G is expressed as Equation (1), and the matrix visualization is shown in Figure 1.

Definition 2: Spatiotemporal Sequences
The time of a day is divided into T time stamps according to a certain resolution. Under each time stamp, the spatial-temporal data of the region is expressed as follows.
where t is time stamp, X t ∈ R N×F are the F traffic characteristics (average speed, average flow) in the N regions under the time stamp t . For the node v i in the i-th region in the data X t , its traffic characteristics are expressed as x n ∈ R F .

Definition 3: Traffic Grades
In order to objectively and comprehensively evaluate the traffic status, this paper takes the form of a traffic grade as the coupling representation of different traffic statuses. Specifically, different traffic statuses are coupled based on a self-organizing mapping neural network (SOM) [38,39]. The traffic grade is obtained through the training and testing process of SOM.
In the training process, firstly, the node weight vector of SOM's output layer W = {ω i , i = 1 : Cls} is initialized, where Cls represents the node number which is equal to the number of traffic grades. Then, the initial learning rate l(0), initial neighborhood radius r(0), and iterations Iter are initialized. After that, the traffic sample X is input into the SOM, and for the sample x t,n ∈ R F , t ∈ T, n ∈ N, the distance from all output layer nodes to them is calculated as follows: After the calculation, the nearest node is found as the winning node. The weights of the winning node and all nodes in the neighborhood are updated, as shown below.
where j = 1 : Cls, and the learning rate and neighborhood radius are updated as well, which is shown as follows: Through continuous circulation, all traffic samples are traversed. Finally, the training will exit when the weight of the output layer does not change significantly or the maximum number of training times is reached. This paper sets the maximum number of iterations as 200 generations.
In the test process, based on the trained SOM network parameters, the corresponding traffic grade of each regional node is obtained. Specifically, the traffic samples X are input into the SOM network. For each traffic sample x t,n ∈ R F , t ∈ T, n ∈ N, we calculate the Euclidean distance between all SOM output layer nodes and the sample. The formula is as follows: where i = 1 . . . Cls. The nearest node ω i is found to determine the traffic grade of the sample x t,n as i . By completing the traversal of all nodes, traffic grade samples Y t ∈ R N are obtained, which represent the traffic state in N regions under the timestamp t. The distribution of regional feature points after classification is shown in Figure 2. Besides, Figure 3 shows the grade change of all regions in a certain day.

Definition 4: Traffic Prediction Task
Based on the defined regional graph network and the traffic samples X ∈ R W×N×F (regional average speed, average flow) with a history of length W under a given times tamp t, the goal of the traffic prediction task is to learn a mapping function that can predict the traffic state grade under the next time stamp h, specifically as follows: where θ represents the parameters of the mapping function.

SGGformer
The overall architecture of the SGGformer model proposed in this paper is shown in Figure 4, including three main components: shifted window operation, multi-channel GCN, and Graph Transformer. The overall model input includes the urban area's historical 24 h characteristics (average flow and average speed) and the topological adjacency matrix. The final output is the entire graph traffic grade at the future time stamp through the characteristic extraction of the model. For the three parts of the model, first of all, the following introduces the shifted window operation of SGGformer, which divides the traffic data into historical stages according to the time dimension. Then, the process of establishing spatial dependency of traffic data by multi-channel GCN is described. Finally, the establishment of time dependency by the Graph Transformer is explained, which can directly process GCN to generate high-dimensional graph data sequences. Furthermore, the Seq2Seq architecture of the Transformer type network is utilized to complete the traffic state prediction at the future time. Historical traffic data is firstly divided into different segments by shifted window operation, and then multi-channel GCN operation is performed on different historical segments in parallel based on the topological adjacency matrix to obtain spatial correlation characteristics. Then, the Graph Transformer module is used to further complete the extraction of time-related features. Finally, the traffic grade of all regions is output in the whole graph corresponding to the time of practice t + h. Specifically, the Chinese characters in the regional grade map represent the names of roads and landmark sites.

Shifted Window Operation
The historical traffic data often show a certain stage. That is, the traffic flow remains relatively stable for a certain period, with the arrival of the next stage, the traffic conditions show a distinct difference. It can be understood that during the morning rush hour, the roads in the urban area become crowded. As time goes by, the congestion of the roads will improve, and the traffic will deteriorate again when it reaches noon. At the same time, the time resolution is coarsened by dividing the shifted window operation, and feature extraction is performed based on the coarsened time interval in the timedependent modeling process, which reduces the computational complexity to a certain extent. Therefore, in this paper, the historical traffic data are divided into sliding segments according to a particular time window, and the segmentation of different periods is realized according to the size and step size of the sliding window, as shown in Figure 5.
Based on the traffic data with a history length of W time stamps, this paper selects the sliding window size of (W/3) and the step size of (W/6), so the historical data will be divided into five time windows, which are represented as P i ∈ R N×(W/3)×F . The equation for the dividing window is as follows:

Spatial Correlation Modeling
SGGformer establishes spatial dependencies by building a multi-channel GCN model, as shown in Figure 6. First, the segmented data of different windows are input into the stacked multi-channel GCN in parallel. Under each segment input, the parallel mechanism is also used between different channels. After extracting the local spatial features of GCN, the channel fusion layer is used to fuse the different types of spatial features captured. Specifically, this paper mainly includes two features, namely, regional flow and average speed. The overall spatial modeling process is shown in Equation (10).
where P f low i ∈ R N×(W/3) and P speed i ∈ R N×(W/3) represent the traffic data corresponding to the two characteristic dimensions of traffic flow and average speed under window i. GCN is proposed by Kipf et al. which is widely used in the field of traffic prediction because of its simple calculation process and effective spatial feature extraction. Specifically, based on the defined topological adjacency matrix and the combination of linear transformation and activation function, it completes the feature capture of regional nodes. The GCN is the first-order approximation of the spectral graph convolution, particularly, the aggregation and mapping of the node's first-order neighbor features are completed by a first-order approximation of the Chebyshev polynomial, and the information transmission of the multi-order neighborhood can be realized by stacking several GCN layers. For a layer of GCN, the formula is as follows: whereW = W + I is the adjacency matrix considering the self-loop,D (k) = ∑ jW (k) ij represents the degree matrix, θ (k) ∈ R d×d denotes a trainable parameter matrix, W (k) ∈ R N×N is the adjacency matrix of the graph network, H (k−1) ∈ R N×d (k−1) is the node representation of the k − 1-th layer output, and H (k) indicates the k − 1-th layer output node representation after feature extraction. d (k−1) and d (k) represent the numbers of node features corresponding to the layer (k − 1) and k, respectively. For the node features of the first layer input, d (1) = W/3.

Temporal Correlation Modeling
In this paper, the constructed Graph Transformer is used to extract the temporal-related characteristics of traffic data. Its architecture is shown in Figure 7. Graph Transformer is composed of an encoder and a decoder. It is based on the improvement of the traditional Transformer. The improvement mainly consists of the following parts: 1.
The encoder and decoder are composed of high-dimensional self-attention modules with residual connections. The purpose of constructing that module is to directly extract the time characteristics of multi-dimensional graph data obtained from graph convolution; 2.
In the decoder part, thanks to the inspiration of Informer [30], its generative decoder can output an extended sequence in a single step by inputting an input with zero occupation, thus avoiding cumulative error and greatly reducing the reasoning speed. In conclusion, this paper does not make many improvements to the excellent structure of the Transformer. However, it focuses on the input form of each sub-feature extraction unit and decoder in the structure. The following will be introduced respectively from the Graph Transformer architecture: high-dimensional self-attention mechanism and zero-occupation input of the decoder. Graph Transformer Architecture:

Add & Norm
The architecture of the graph transformer, such as neural transformation models including the Transformer, has an encoder-decoder structure. Here, the encoder maps the high-dimensional historical features of nodes extracted by the Multi-Channel GCN layer to the continuous representation z = (z 1 , · · · z win ) ∈ R win×n×emb . Here, emb represents the embedded characteristic dimension of the node. Given z, the decoder is input to the matrix spliced by a partial historical feature embedding matrix and zero-occupation matrix, so as to realize the output of the spatial-temporal characteristics at the future time in one step.
Encoder: The encoder is composed of a stack of m = 2 identical layers. Each layer contains a high-dimensional autonomous willpower mechanism. We use the residual connection between different layers, and then normalize the layers. The calculation formula is as follows: where z (k) enc_in and z (k) enc_out represent the input and output node characteristics of the sublayer of the layer k, respectively. SubLayer S−Attn (.) represents the high-dimensional selfattention mechanism, as shown in this section. In order to facilitate residual connection, all sub-layers and embedded layers in the model maintain dimensions emb = 64 and LayerNorm(.) represents layer normalization functions.
Decoder: The decoder is also composed of n = 2 stacks of the same layer. In addition to the layer in the encoder, a second layer is inserted, which performs a multi-head attention mechanism on the output of the encoder stack, which is shown in Equation (13). Similar to the encoder, residual connections are used between each layer, and then, layer normalization is performed. Similar to Transformer, we modify the self-attention layer in the decoder stack to prevent locations from focusing on subsequent locations. This mask, combined with output embedding, ensures that the prediction of i locations can only rely on known outputs with locations smaller than i.
where z (k) dec_out is the output of the decoder, h indicates the zero-occupation input of the decoder, see for details in [30]. The second layer inserted is a high-dimensional attention mechanism SubLayer Attn (.), see below for details.
After the decoder completes the extraction of the temporal feature, the feature data are input to the linear layer to complete the mapping of the feature to all node grades. The calculation process is shown as follows: where ReLU(.) represents ReLU activation function, W lin (l) ∈ R l_in(l)×l_out(l) and b lin (l) ∈ R l_out(l) represents the mapping weight and offset of the layer, respectively. L linear is the number of stacked linear layers. Finally, the characteristics of all roads at various grades z l_out ∈ R N×Cls are output. High-dimensional Self-Attention Mechanism: The conventional multi-head self-attention mechanism obtains the relative relationship between elements through the dot product so that the relative relationship between elements is completely preserved in the feature extraction process of the sequence, and all elements are processed in parallel in the calculation process to improve the calculation efficiency. Considering the high number of features of elements in the sequence, parallel computing in the form of multiple heads can extract the relationship between elements in the sequence from different angles and further increase computational efficiency. Therefore, the multi-head self-attention mechanism is widely used in the feature extraction of sequences.
In this paper, the spatial feature sequence extracted by GCN in different time windows is taken as the input, and each element in the sequence represents the spatial feature in this window. The fusion and feature extraction of spatial features in different time windows are completed using the self-attention mechanism to capture the temporal correlation among them. However, compared with the input data of the conventional self-attention mechanism, the combined sequence of different spatiotemporal features in this paper has a higher dimension. Specifically, compared with the input data of the normal multi-head voluntary mechanism, the input data of the high-dimensional multi-head self-attention mechanism are in the form of a two-dimensional matrix, which increases the dimension of the number of roads. Therefore, this paper improves the conventional attention mechanism and designs a high-dimensional multi-head self-attention mechanism, which is shown in Figure 8. The specific calculation process is as follows.
We first define the matrix of query, key, and value (Q ∈ R win×N×d , K ∈ R win×N×d and V ∈ R win×N×d ), and then obtain three functional matrices by performing a linear transformation on the second dimension (i.e., node number dimension) of the embedded traffic data X ∈ R win×N×d and decomposing it into multiple heads, as shown in Equation (15).
where W Q ∈ R N×n , W K ∈ R N×N , and W V ∈ R N×N represent the learnable parameter matrix, h is the number of headers, and shape(.) is the header splitting operation of the matrix, that is, the size is converted from [win In addition, Q h ∈ R win× N h ×d , K h ∈ R win× N h ×d , and V h ∈ R win× N h ×d represent the three functional matrices corresponding to the i-th header obtained through linear mapping and the splitting operation, respectively.
Then, we further use the dot product attention operation to represent the highdimensional self-attention layer, as shown in the following.
where a t,t ∈ A ∈ R d (A ∈ R T p ×T p ×d ) represents the normalized weight matrix between combination t and combination t , t ∈ T p , t ∈ T p , and the second dimension is equal to d q = d k = d v = N h . head i represents the characteristic matrix under the i-th header. Finally, the final output of the high-dimensional multi-head self-attention mechanism is obtained by splicing and linear mapping of all Matrices corresponding to all header matrices.
The difference between the high-dimensional attention mechanism H−DAttn(.) and the high-dimensional self-attention mechanism is reflected in the input. The query, key, and value of the high-dimensional attention mechanism can be different inputs. Three functional matrices can be obtained by performing a linear transformation on the second dimension (i.e., node number dimension) of different traffic-embedded data X 1 ∈ R win 1 ×n×d , X 2 ∈ R win 2 ×n×d , X 3 ∈ R win 3 ×n×d and decomposing them into multiple heads, as shown in the formula below.
After the linear mapping operation, the corresponding Q, K , V matrices are obtained. The following operations are the same as the above self-attention mechanism and will not be repeated. Decoder Zero-occupation Input: Because the traditional Seq2Seq architecture performs multi-step prediction through continuous self-regression in the decoding process, which reduces the calculation speed in the prediction process. Referring to the experience obtained by the Informer [30], similarly, we concatenate some historical features with zero vectors as the input of the decoder, as shown in the following.

Decompose to h Headers
where X t tok ∈ R L tok ×N×d is the starting token and X t 0 ∈ R L y ×N×d is the placeholder of the target sequence (set the scalar to 0), which equals the number of the prediction length. The starting token is a sequence of length sampled from the encoder input sequence, which corresponds to a segment before the prediction time. This paper takes the prediction of the next 1 h as an example, the sequence input to the encoder is the node characteristic sequence of five historical time periods. We take the last two time periods of the sequence as the starting token, namely L tok = 2, and input X de = {X 2p , X 0 } to the decoder. Note that the decoder here predicts all outputs in one step, and does not need time-consuming "dynamic decoding" transactions in the ordinary encoder-decoder architecture, which greatly reduces the calculation time.

Loss Function
SGGformer uses Negative Log Likelihood Loss (NLLLoss) as the target loss function. After the feature extraction of the spatiotemporal network, the matrix X out ∈ R N×Cls representing all regions at different levels is obtained. Before calculating the loss value, it is necessary to calculate the probability distribution of each road corresponding to different grades. Therefore, the above matrix is operated by So f tmax operation, which is shown as follows:Ŷ where X i,j refers to the value of the corresponding grade j of the road i in the matrix, and the grade probability matrixŶ ∈ R N×Cls represents the probability of all roads corresponding to different grades, which is used as the input of the loss function. The loss value is calculated as follows: where L(Ŷ, Y) is the loss value of the prediction grade and the real grade, and Y ∈ R N×Cls represents the real level of all roads. For each road (each line), the index value corresponding to the real grade is 1, and the rest is 0. Finally, the total loss value is obtained by summing the loss values of all roads.

Data Description and Preprocessing
This experiment is based on the floating car data of a big data platform DiDi, and the vehicle track data from 1 November to 30 November 2016, are selected as the source data, which is shown as Figure 9. According to the divided sub-regions, the average speed and the average flow of the sub-regions are counted per hour, the details are described as follows. In this paper, the track data point x is preprocessed by three main steps. First, we divide the region with the size of 8 × 8 based on the rectangular distribution range of track points. Next, the track data point is counted and calculated. The original track data point TP i = {ts i , lon i , lat i }, i ∈ {1, . . . , N TP } is obtained, where ts i is the time stamp of track point i, lon i and lat i are the longitude and latitude of track point i, respectively. We can obtain the velocity characteristics of each track point by calculating the ratio of the distance between adjacent track points of vehicles and the time stamp. Finally, the regional characteristics are counted. For the regional flow and average speed, we can get the weighted average of the number of characteristic points and the speed of characteristic points at the corresponding time stamp.
According to the obtained regional data, we need to further develop the input data. Since this paper takes the 24 h historical regional traffic characteristics of the whole graph as the input, a total of 720 sets of datasets are obtained through the rolling selection method. Moreover, the specific parameters of the model and training process are shown in Table 1.

Assessment Indicators and Baseline Model
In this paper, accuracy and the quadratic-weighted Kappa coefficient (Kappa coefficient) are used as the evaluation indicators of the prediction effect. The quadratic-weighted Kappa coefficient represents the consistency between the prediction grade and the actual grade distribution, representing the prediction's accuracy and deviation. The calculation process is based on the confusion matrix, and the value is between −1 and 1. The closer the value is to 1, the higher the consistency of the prediction grade results. Specifically, the calculation method of accuracy and weighted kappa coefficient are shown as follows.
(1, i f v t =ṽ t else 0) (23) where, for the accuracy rate equation, v t is the actual grade,ṽ t is the prediction grade, n is the number of prediction grades. For the quadratic-weighted kappa coefficient equation, p is the confusion matrix, p i,j represents the frequency of occurrence of the data instances that the road with a real grade i is judged a grade j in all prediction results. p i,: represents the ratio of the number of data instances with a real grade i to the the total number of instances, p :,j represents the ratio of the number of data instances that the road is predicted as a grade j to the total number of instances, ω i,j is a weight and Cls represents the number of grades.
To evaluate SGGformer's competitive performance, we compare it with the following baseline. In these baselines, the FC-LSTM, ConvLSTM, DCRNN, and STGCN inputs are the average road speed or flow. All baselines are optimized for optimal performance. FC-LSTM: A classical cyclic network used for time series data modeling. Here, the full connection layer maps the time dimension linearly. Precisely, the dimensions are mapped to 64 and 24, respectively, and then input to the double-layer LSTM layer, and finally output the final result through the two full connection layers. The hidden layer dimension of the first full connection layer is 64, the hidden layer dimension of the double layer LSTM is 64, the hidden layer dimension of the final full connection layer is 24, and the output layer dimension is 5.
ConvLSTM: A classical hybrid neural network used for traffic spatiotemporal data modeling. Here, we first model the spatial correlation of traffic data through the stacked two-layer CNN, then input it into the two-layer LSTM layer, and finally output the final result through the two-layer full connection layer. The convolution kernel size of the first layer convolution is 8, the step size is 4, the convolution kernel size of the second layer convolution is 3, the step size is 2, the hidden layer dimension of the double-layer LSTM is 64, the hidden layer dimension of the final fully connected layer is 24, and the output layer dimension is 5.
DCRNN: It refers to the method of [20], builds a directed graph based on sensors, and gives an edge weight by measuring the proximity between sensor pairs. The dynamic traffic flow is modeled as a two-way diffusion process. Diffusion convolution is proposed to capture spatial dependence, and a cyclic neural network is used to model the time dependence. The first two layers of diffusion convolution raise the dimension to 64, and the third layer of diffusion convolution reduces the dimension to 32.
STGCN: It refers to the method of [21], which combines graph and gated time convolution to learn spatiotemporal patterns from traffic sequence data based on a graph structure. The structure consists of two layers of spatiotemporal graph convolution modules. Each module consists of a layer of temporal convolution module, a layer of spatial graph convolution module, and a layer of temporal convolution module. The first layer of the time convolution module increases the number of feature channels to 64, and the space map convolution module reduces the number of feature channels to 32, the last layer of the time convolution module increases the number of feature channels to 64, and the hidden layer dimension of the last full connection layer is 80. Figure 10 shows the traffic prediction performance between the SGGformer model and the baseline on the dataset. See Tables 2 and 3 for specific values. Firstly, the SGGformer presented in this paper shows stable and relatively optimal tables under different prediction lengths. Based on the average performance of the three prediction lengths, the comparison accuracy and quadratic-weighted kappa coefficients are optimized by 1.7% and 0.9%, respectively, compared with the optimal baseline, and the optimization effect becomes more and more evident with the increase of prediction lengths. At the same time, the effect fluctuation of the SGGfromer under different prediction lengths is less than 1%, which proves the stability and robustness of this method under different prediction lengths.  Secondly, by comparing different deep learning baselines, it can be found that the performance of different methods on the quadratic-weighted kappa coefficient is not different, which is about 0.94. For prediction accuracy, STGCN is optimal under different lengths, and its advantages are more concentrated on short-term prediction. Specifically, with the increase in the prediction length, the prediction accuracy is reduced from 5.64% to 2.48% compared with other baselines. DCRNN is on a par with STGCN in terms of its performance different from the measured length. Specifically, its accuracy in predicting the future 1 h and 6 h groups decreases by 0.4% on average, and its accuracy in predicting the future 3 h groups exceeds 0.3%. The performance of ConvLSTM under different prediction lengths is slightly worse than that of STGCN, and the gap gradually narrows with the increase of prediction lengths, with differences of 3.15%, 1.47%, and 0.75%, respectively. Overall, FC-LSTM showed a stable prediction effect but was weaker than ConvLSTM, with an average difference of 0.68%. The gap becomes more significant with the increase of prediction length.

Comparison with Baseline
In order to show the prediction effects of different methods more intuitively, the figure below shows the difference between the prediction level and the actual level of different methods in the selected period when the prediction length is 1 h, 3 h, and 6 h. The periods selected in this paper are 12, 15, 19, and 22 h lower on the first day of the test set. This is because the level distribution in these periods is relatively complex. Figure 11 shows the grade difference heat map with a prediction length of 1 h. It can be found that the SGGformer shows the most accurate effect both in terms of quantity and difference. In some periods, STGCN and DCRNN had a slight difference (12,19, and 22 h), but in some periods (15 h), the difference between the baseline method and SGGformer was noticeable. In general, the SGGformer has shown stable and excellent results in different periods. Compared with different baselines, the difference between DCRNN and STGCN is insignificant, and the effect of the ConvLSTM is worse than the two baselines. However, the distribution between the three is similar. FCLSTM has the worst effect among all comparison methods, and its distribution is not similar to other methods. Figure 12 shows the thermal diagram of the difference between the predicted and actual grades with the predicted length of 3 h. On the whole, compared with the prediction length of 1 h, there are fewer areas of error prediction, which is consistent with the results of the prediction accuracy table above. Similarly, comparing different methods, we can find that the SGGformer still shows the most robust performance, DCRNN and STGCN show slightly worse performance, and ConvLSTM and FCLSTM show even worse performance, specifically, more error areas are predicted.  Finally, the SGGformer's performance is optimal when the future prediction length is 6 h, which is shown in Figure 13. However, its effect is even worse than STGCN in some specific periods (such as 12 h periods). The performance of other baselines is analyzed as above.

Ablation Research
To study the effects of different components in the proposed SGGformer model, we repeatedly removed one of the components to conduct ablation research.

1.
Based on SGGformer, the shifted window operation is removed, specifically, GCN combines Graph Transformer to build a model to verify the rationality of the shifted window; 2.
The GCN based on SGGformer is removed, specifically, the GCN is directly input into the Graph Transformer network after the shifted window to verify the effectiveness of GCN for spatial feature capture; 3.
The Graph Transformer based on the SGGformer is detached, specifically, SGCN is simply utilized to verify the rationality of the Graph Transformer. Figure 14 and Tables 4 and 5 show the accuracy and quadratic-weighted kappa coefficients of the SGGformer and its variants at different prediction lengths. For precision performance, the SGGformer has an average increase of 1.59% compared with the variant with the best performance. Among many variants, the G-Gformer has a slight advantage over the other two, showing an average increase of 0.47% in accuracy. The remaining two variants, S-Gformer and S-GCN, have similar effects when the prediction length is 1 h and 3 h, and the S-Gformer has an increase of about 0.3% compared with S-GCN when the prediction length is 6 h.
For the quadratic-weighted kappa coefficient representing consistency, the SGGformer has an average increase of 0.47% compared with the best performing variant G-Gformer. However, the difference between the quadratic-weighted kappa coefficients of other variants is slight, about 0.1%.

Discussion
The experiment part verifies the effectiveness of SGGformer in two parts: baseline comparison and ablation analysis. According to the experimental results, we can draw the following conclusions.

1.
It can be seen from the baseline comparison results that the graph convolution based on the sliding window combined with Graph Transformer is effective. Precisely, graph convolution can effectively capture the non-European spatial correlation characteristics in the traffic data, and Graph Transformer has greater advantages in capturing nonlinear temporal correlation of the traffic data. STGCN, also a spatiotemporal property acquisition network based on a graph convolution network, shows a slightly weaker performance, indicating that GCN is more appropriate in spatial correlation properties. However, compared with the use of a Graph Transformer to capture temporal-related characteristics, STGCN and DCRNN use Gated convolution and GRU to extract features, respectively, which shows some shortcomings in performance. Compared with the former, ConvLSTM and FC-LSTM both use the capture time characteristics of LSTM, showing poorer prediction performance. The difference between the two is that the former uses ordinary convolution to capture European spatial characteristics. In contrast, the latter uses a simple, fully connected network to capture spatial correlation characteristics. Simply flattening regions undoubtedly ignores the spatial relationship between regions, resulting in a relatively worse prediction effect.

2.
The comparison results of ablation experiments verify the effectiveness of different components of SGGformer. The G-Gformer cancels the shifted window operation, the S-Gformer deletes the GCN model and extracts the temporal-dependent characteristics through the Graph Transformer, while S-GCN uses the spatial feature extraction model for traffic prediction. The prediction effect of all variants is worse than that of all SGGformers, which proves that the three components contribute to improving the prediction effect. Among them, the prediction effect of the G-Gformer is second only to SGGformer in prediction performance, which is reasonable. Because this variant only removes the operation of the shifted window operation, it can still effectively extract spatial and temporal correlations in traffic data. However, it lacks the division of time phase by the shifted window, thus reducing specific performances. The S-Gformer and S-GCN are canceled, respectively, from the space extraction module and time extraction module, which significantly reduces the feature extraction ability for traffic data, thus showing the worst performance. By comparing the two, the S-Gformer is slightly better than S-GCN, which means that the time feature extracted by the Graph Transformer is more effective than the spatial feature extracted by GCN for traffic prediction.
Based on the above analysis, we know the advantages of the SGGformer and the necessity of different components. However, the current method of testing has some shortcomings:

1.
In this paper, the spatial feature extraction network, GCN, uses a fixed adjacency matrix to extract spatial features and does not fully consider the impact of timevarying traffic flow on space.

2.
This paper mainly finds that the performance of the Seq2Seq architecture is not fully utilized for traffic grade prediction at a particular time in the future. That is, the multi-step prediction task in the future is not involved.
Therefore, using a dynamic adjacency matrix and multi-step prediction is the future direction of improvement. To achieve a dynamic adjacency matrix, we consider adding an attention mechanism to capture the relative relationship of dynamic space.

Conclusions
In this paper, the sliding window operation, multi-channel graph convolution, and improved Graph Transformer are combined to extract the spatiotemporal correlation characteristics of traffic and complete the task of traffic grade prediction. The multi-channel graph convolution is used to complete the modeling of spatial dependency, and the graph transformer is used to capture the temporal dependency. Then, in terms of accuracy and prediction consistency, this paper compares the proposed SGGformer with mainstream baseline methods. The SGGformer shows better results in both aspects, verifying the effectiveness of this method. At the same time, according to the comparison results with different variants, the positive contributions of different components to the spatiotemporal feature extraction are verified.
Although the current method has achieved good results, there is still room for improvement in capturing spatiotemporal correlation. Specifically, using an attention mechanism to construct a dynamic adjacency matrix is an effective improvement method to improve the temporal extraction ability. At the same time, using Graph Transformer based on the Seq2Seq architecture to conduct multi-step prediction is also a hot topic in traffic prediction. In future work, we will integrate these improvements to improve the model's comprehensiveness and robustness and achieve a higher prediction performance.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: