A Temporal Directed Graph Convolution Network for Trafﬁc Forecasting Using Taxi Trajectory Data

: Trafﬁc forecasting plays a vital role in intelligent transportation systems and is of great signiﬁcance for trafﬁc management. The main issue of trafﬁc forecasting is how to model spatial and temporal dependence. Current state-of-the-art methods tend to apply deep learning models; these methods are unexplainable and ignore the a priori characteristics of trafﬁc ﬂow. To address these issues, a temporal directed graph convolution network (T-DGCN) is proposed. A directed graph is ﬁrst constructed to model the movement characteristics of vehicles, and based on this, a directed graph convolution operator is used to capture spatial dependence. For temporal dependence, we couple a keyframe sequence and transformer to learn the tendencies and periodicities of trafﬁc ﬂow. Using a real-world dataset, we conﬁrm the superior performance of the T-DGCN through comparative experiments. Moreover, a detailed discussion is presented to provide the path of reasoning from the data to the model design to the conclusions.


Introduction
Traffic flow forecasting aims to estimate traffic conditions (e.g., the velocities or travel time of traffic flow) of each segment on road networks in future time periods based on historical information [1]. It has played an important role in intelligent transportation systems (ITSs) on account of its extensive applications in urban transportation [2]. For instance, Google Maps can provide users with high-quality route planning and navigation services with the aid of traffic forecasting for the purpose of avoiding traffic congestion [3]. Despite the massive efforts made by relevant studies, high-precision and high-reliability traffic forecasting is still subject to the nonlinear dependence of traffic flow variables in the dimensions of both space and time [1,2,[4][5][6].
On the one hand, the time series of traffic flow variables generally present significant temporal dependence in both the short and long term [4]. Specifically, traffic conditions are highly correlated with those observed at adjacent times, and the short-term correlations are gradually delayed with increasing temporal distance. Additionally, the periodicity of traffic flow series on multiple temporal scales can be modeled as long-term temporal dependence. On the other hand, relevant studies have confirmed the existence of dependence between the traffic flow variables observed on topologically connected road segments with certain time lags; this is defined as spatiotemporal dependence [1,2,4]. In traffic applications such as autonomous driving and signal light control, model-based traffic simulators (e.g., LWR and PW) have been widely employed to simulate various traffic flows on road networks by considering spatiotemporal dependences [7]. However, in spite of their effectiveness in modeling the evolution of traffic flow on road networks, the lack of vehicle behavior information combined with the high costs of computational time fundamentally limit the applications of model-based traffic simulators in real-time traffic forecasting on large-scale urban road networks [8]. Nowadays, the increasing availability of discrete trace points recoded by vehicle-mounted GPS enables the characterization of time-varying traffic flow states at the road segment level [9]. In this context, large amounts of data-driven models have been specifically designed for the task of traffic flow forecasting [1,2,[4][5][6]. Currently, there are two alternative strategies for handling spatiotemporal dependence in traffic flow forecasting based on data-driven models. The first is constructing machine learning models by modeling spatiotemporal dependence as parameters to be estimated, such as the space-time auto-regressive integrated moving average (ST-ARIMA) model [10]. To extract implicit features derived from spatiotemporal dependence, a set of deep learning-based forecasting methods have been designed by coupling a convolutional neural network (CNN) with a recurrent neural network (RNN), such as CNN-Long Short-Term Memory (CNN-LSTM) models [11]. However, the requirement of grid partitioning in Euclidean space limits the capacity of traditional CNNs to accurately capture the spatial dependence among road segments to a larger extent. For cases such as this, recent studies have constructed an undirected graph structure to express the topological relationships between road segments, and it was based on this that a graph convolution neural network was employed to implement traffic flow forecasting [1,2,[4][5][6][12][13][14].
According to related studies in the field of transportation, there are a total of three elements, i.e., drivers, vehicles, and road segments, that constitute a transportation system [15]. This means that the traffic flow on a road network is determined by both the moving characteristics of the vehicles and the driving rules on the road segments. In the road network shown in Figure 1, the flow direction and volume of moving vehicles are represented by arrows and dotted lines, respectively. Segments 4 and 2 are both spatially adjacent to segment 1, so segment pairs 4-1 and 2-1 have a consistent topological structure. However, the two segment pairs do not necessarily share similar traffic flow distributions due to the diverse driving directions of the vehicles. In addition, the driving rules on road segments cannot be represented by the topology. For instance, segments 1, 4, and 7 are all one-way roads with only one allowable driving direction, while vehicles are only allowed to turn around on segment 3 despite it being topologically connected with segment 1. There is a similar case in which the vehicles on segment 4 are prohibited from turning left into the adjacent segment 6. Based on the above discussions, we can determine that the diversity of driving directions and rules on road segments poses great challenges to current methods of anisotropic spatial dependence modeling and reliable traffic condition forecasting. To overcome the aforementioned challenges, this study develops a new traffic flow forecasting method by constructing a temporal directed graph convolution network (T-DGCN) with the combined consideration of multiterm temporal dependence and vehicle movement patterns on road networks. The main contributions of this study include the following three aspects: (1) A directed graph is constructed based on the Markov transition probabilities of traffic flow to model the spatial dependence in an objective way, while a new spectral directed graph convolution operator is designed to address the asymmetry of the directed graph. (2) A transformer architecture with a novel global position encoding strategy is integrated to capture multiterm temporal dependence, with the aim of improving the interpretability and validity of the forecasting model. The remainder of this article is organized as follows: Section 2 gives a full review of the relevant research. Section 3 defines the critical problem and presents the proposed T-DGCN. In Section 4, comparative experiments on real-world datasets are performed to validate the superiority of the proposed method, while Section 5 provides an attribution analysis of the experimental results. Finally, we conclude this study and provide future research directions in Section 6.

Related Work
With the extensive utilization of data mining models in traffic flow analysis during the past few decades, an enormous number of methods have been specifically designed for traffic flow forecasting based on machine learning models or deep neural networks [1,2,5,6]. These two types of methods are reviewed in detail in the following.
Machine learning-based spatiotemporal forecasting models aim to estimate the target spatial variable values at future times through parameter training with the constraint of artificially defined spatiotemporal dependence.
With the successful use of the autoregressive integrated moving average model (ARIMA) in time series forecasting [16], Hamed et al. [17] initially introduced this machine learning model to urban traffic volume forecasting. On this basis, extensively modified ARIMA models were successively proposed to improve traffic flow forecasting accuracy. For instance, the Kohonen ARIMA model used a Kohonen self-organizing map to separate the initial time series into homogeneous fragments to track the long-term temporal dependence [18]. Guo et al. [19] integrated the Kalman filter with the generalized auto-regressive conditional heteroskedasticity model to improve the performance of short-term traffic flow forecasting. In addition to ARIMA-based models, support vector regression (SVR)-based models also have outstanding performance in traffic flow forecasting [20]. For instance, Su et al. [21] utilized the incremental support vector regression (ISVR) model to implement the real-time forecasting of traffic flow states, and Gopi et al. [22] proposed a Bayesian support vector regression model, which can provide error bars along with predicted traffic states. Besides this, other common machine learning models have also been applied to the task of traffic flow forecasting. Yin et al. [23] combined fuzzy clustering with a neural network to design a fuzzy neural traffic flow forecasting approach. Cai et al. [24] constructed an improved K-nearest neighbor (KNN) graph to optimize short-term traffic flow forecasting results with the help of spatiotemporal correlation modeling. Sun et al. [25] proposed a Bayesian network-based approach to maximize the joint probability distribution between the historical traffic flow states used as antecedents and the future states to be estimated.
Considering the subjectivity in the measurement of spatiotemporal proximity effects, existing machine learning-based models are greatly limited in capturing the underlying dependence in multiple ranges in space and time. Compared to traditional machine learning models, deep neural networks have self-learning capacity without the input of any artificially extracted features. This powerful learning capability has enabled various types of deep neural networks to be utilized in the forecasting of traffic flow on road networks [1,3,6].
In essence, the traffic flow on road networks can be classified as a kind of space-time sequence data [2]. Specifically, for the traffic flow sequence on any road segment, the RNN and its variants, such as the long short-term memory (LSTM) unit [26] and the gated recurrent unit (GRU) [27], were widely utilized to learn the dependence between timevarying traffic flow states. For example, Ma et al. [28] developed a forecasting approach to analyze the evolution of traffic congestion by coupling deep restricted Boltzmann machines with an RNN that inherits congestion prediction abilities. Tian et al. [29] utilized a LSTM to determine the optimal time lags dynamically and to achieve higher forecasting accuracy and better generalization. Focusing on the spatial dimension, Wu and Tan [30] mapped the recorded traffic flow states into regular grids divided from the study area to stack sequential images in chronological order. This can leverage the local receptive field in a CNN to capture the spatial dependence of traffic flow states in planar space. However, it is well known that the transfer of traffic flow is rigidly constrained on road networks in reality, so it is necessary to measure the spatiotemporal dependence of traffic flow in the road network space. To address this issue, most studies have used each segment or sensor as the minimum spatial unit and have organized the road network into a graph based on the topological relationships between segments [1,3,6]. In this way, the idea of graph convolution can be employed to extract spatially dependent embedded features from the graph structure. For example, Zhao et al. [2] designed a T-GCN model that introduced 1st ChebNet [12] to model the spatial dependence of traffic networks. Li et al. [13] proposed a diffusion convolutional recurrent neural network (DCRNN) model that performed a diffusion graph convolution on a traffic network to aggregate the spatial neighborhood information of each node and captured long-term temporal dependence using a RNN. Yu et al. [31] constructed a 3D graph convolution network that could simultaneously capture spatial and temporal dependence in the process of feature learning.
As mentioned in Section 1, although existing methods have utilized the topological structure of traffic networks to model spatial dependence, it is still necessary to quantitatively represent the movement patterns and driving rules of vehicles on road networks to improve the rationality of traffic flow forecasting. In terms of temporal dependence, in the majority of current RNN-based strategies, the specific modeling of the tendency and periodicity characteristics in the time-varying process of traffic flow states is insufficient. That is, a large number of relevant historical observations have not yet been sufficiently exploited in an appropriate way, which restricts the accuracy of traffic flow forecasting. To solve these two problems, this study designs a new method by coupling a directed graph convolution network with a transformer structure to model anisotropic spatial dependence and multiterm temporal dependence for the purposes of self-learning the underlying spatiotemporal features of traffic flow states to obtain high-precision forecasting results.

Method
This section describes the proposed new traffic flow forecasting method. Specifically, a directed traffic graph is first constructed by using a Markov chain-based strategy, as described in Section 3.1; based on that, a spectral directed graph convolution kernel is used to capture anisotropic spatial dependence, as presented in Section 3.2. In Section 3.3, we design a keyframe sequence and employ a transformer structure for the extraction of multiterm temporal dependence features. Finally, in Section 3.4, we build the T-DGCN by assembling the spatial and temporal dependency learning modules.

A Markov Chain-Based Strategy for Constructing a Directed Traffic Graph
In this study, considering the directivity of traffic flow, we specifically represent the traffic information on road networks using a graph structure G = (V, E , P), where the road segments and intersections constitute the node set V = {υ 1 , υ 2 , υ M } and the edge set E = e 1 , e 2 . . . , e N }, respectively. In this way, the traffic flow states on the road network can be abstracted into a tensor X ∈ R M×T×C , where M, T, and C denote the number of segments, timestamps, and traffic flow feature dimensions, respectively. For the edges in G, the majority of the current related studies generally quantify the topological relationships between the road segments to obtain a symmetrical adjacency matrix P. To further reflect the anisotropy in traffic flow spatial dependence, this study constructs a Markov chain-based directed graph to describe the transition probabilities of the traffic flow at intersections.
From the perspective of a discrete stochastic process, the transition of the traffic flow between any pair of nodes in G can be considered to follow the hypothesis of a random walk [32]. Let rs t denote the located road segment of traffic flow at timestamp t. The transition process can be modeled using a Markov chain, i.e., This means that the current traffic flow states can entirely determine the future distribution of traffic flow on road networks. On this basis, given any two nodes v i and v j , we can calculate the transition probability of traffic flow from v i to v j as p ij = P[(rs t = v i )→(rs t+1 = v j )] and can construct the following Markov transition matrix: We recombine the road nodes into a graph structure according to the transition matrix P. To obtain the transition matrix, we define an intermediate variable γ ij to denote the number of vehicles that move from segment v i to v j and form the following matrix Γ: On this basis, the transition matrix P can be expressed as Here, 1 is a vector of all ones. In this transition matrix, each element essentially quantifies the moving probability of traffic flow from v i to v j .

A Directed Graph Convolution Kernel for Capturing Spatial Dependence
Regarding the forecasting of space-time sequences organized using graph structures, e.g., traffic flow series, the spectral graph convolution neural network has shown powerful performance in learning dependence features on multiple spatial scales [12]. However, most spectral-based methods are limited to only working on undirected graphs [33]. According to spectral graph theory, it is necessary to find a directed Laplacian operator to implement the convolution operation on a constructed directed traffic graph without the loss of direction information. In this case, we leverage the Perron-Frobenius theorem to embed a directed Laplacian operator into the graph convolution neural network [34].
Let r ij (n) = P[(v i →···→v j ) n ] denote the probability that the state changes from v i to v j after step n; this term can be calculated using the following Chapman-Kolmogorov Equations [34]: The connectivity of the urban road network indicates that any two road segments can be connected through the flow of vehicles (∀v i , v j ∈ V, ∃n that r ij (n) > 0), which means that the Markov chain-based directed graph has the characteristic of strong connections. According to the steady-state convergence theorem, the stationary distribution of traffic flow states on road networks can be denoted as [34]: Here, π 0 denotes the initial vector of traffic flow states, while n tends to positive infinity. We can treat π as a Perron vector according to the Perron-Frobenius theorem to define the Laplacian operator of a directed graph, i.e., For the asymmetric matrix, the corresponding symmetric Laplacian can be expressed as [33] In this way, we symmetrize the original directed traffic graph, so we can obtain the graph convolution kernel x * g θ = U U T x U T g θ . Then, this filter can be approximated using Chebyshev polynomials [33] x * g θ ≈ where L sym = 2 λ sym max L sym − I is the rescaled form of L sym for locating eigenvalues within [−1, 1]. Let K = 2 and θ = θ 0 = −θ 1 and further approximate the largest eigenvalue of L sym as λ sym max ≈ 2 according to [12]. The filter can be simplified as To alleviate the problems of exploding and vanishing gradients, Kipf and Welling [12] used a renormalization strategy, i.e., I + D −1/2 AD −1/2 → D −1/2 A D −1/2 , by adding a self-loop to each node A = A + I. Due to the self-loop structure of the Markov chain-based directed graph, we utilize another renormalization strategy. Let θ = 2 and Equation (10) can be redefined as Finally, the directed graph convolution layer can be represented as Here, θ ∈ R d in ×d model is the learnable parameter, and d in and d model denote the dimensions of the input features and hidden features, respectively.

A Transformer Structure for Learning Temporal Dependence Features
In addition to the dependence of traffic flow in the space dimension, other critical issues exist that need to be addressed in traffic flow forecasting, that is, extracting dependence features between traffic flow states at distinct timestamps [2]. Faced with this problem, the most widely used solution at present is the RNN [1]. However, current RNN-based models were not specifically designed considering the inherent time-variant characteristics of traffic flow states and tend to be overly complex, including a large number of learnable parameters. On the basis of prior knowledge, we design keyframe sequences to organize the original data and leverage a transformer structure to extract multiterm temporal dependency features.
As discussed in Section 1, the temporal dependence of traffic flow states mainly includes short-term and long-term states; these indicate the tendencies and periodicities of traffic flow time series, respectively. For each road segment at t, we first define the tendency-related sequence as X t (t) = X (t−∆t) ∆t ≤ tl, ∆t ∈ N + by using a time lag parameter tl. In addition, current relevant work generally regards the periodicity as the correlations between the observations at t and those at the corresponding times in the previous few days or weeks [35]. Considering the slight fluctuation in the variation cycle regarding traffic flow states, this study introduces a time window parameter tw to define an interval around each periodic timestamp, within which the periodicity can be refined by embedding the local tendencies. Then, the periodicity-related sequence can be defined as X p (t) = X (t−∆t) ∆t ∈ nT p − tw, nT p + tw , n ≤ N p and n, N p , tw ∈ N + within N p cycles, where T p denotes the length of one cycle. X t (t) and X p (t) form the keyframe sequence X k (t) at timestamp t. By inputting each member of X k (t) to the directed graph convolution layer in parallel, we can capture a spatial feature sequence tensor F (t) ∈ R M×(tl+N p (1+2tw))×d model . To facilitate the capture of time-and space-varying temporal dependence, we further employ daily periodic position embedding [36] and node2vec embedding [37] strategies to encode the absolute time and space information for each timestamp and each road segment. After that, the tensor F (t) can be integrated with the space-time information by elementwise addition operations.
Targeting the spatial feature tensor F (t), we use self-attention to calculate the implicit multirelationships on the keyframe sequence of each road segment at timestamp t. Basically, three subspaces, namely the query subspace Q s ∈ R d model ×d k , the key subspace K s ∈ R d model ×d k , and the value subspace V s ∈ R d model ×d v , are obtained by performing linear mapping operations on F (t), i.e., Here, W s q , W s k , and W s v are learnable parameters. To better capture multiterm temporal dependence, multihead attention is further introduced by concatenating N h single attention heads, i.e., where Note that '•' denotes a concatenation operator. After that, a new tensor F out (t) that contains the spatial-temporal features can be produced using a learnable parameter On this basis, we can construct the transformer structure by the classical encoderdecoder method [38] to implement traffic flow forecasting. As shown in Figure 2, both the encoder and the decoder contain N cell identical cells. Each identical cell is mainly constituted by a multihead attention layer and a keyframe-wise fully connected feedforward layer. Residual connections and normalization layers are also integrated. Note that the decoder cell has one more multihead attention layer than the encoder cell, which has the function of calculating the multihead attention over the features of the historical keyframes and the forecasted ones.

Temporal Directed Graph Convolution Network (T-DGCN)
With the integration of the Markov chain-based directed graph convolution layer with the transformer structure-based encoder-decoder layer, Figure 3 gives the overall architecture of the proposed T-DGCN. Specifically, for the keyframe sequences of each road segment, two Markov-based directed graph convolution layers are used to capture keyframe-wise spatial dependence to construct the spatial feature tensor F (t). The network further utilizes the transformer structure-based encoder-decoder layer to learn multiterm temporal dependence features from F (t). The forecasted results are ultimately output from a fully connected layer. In the training process, the goal is to minimize the error between the observed traffic flow states Y on the road network and the forecasted statesŶ. Thus, the loss function can be defined as where δ is a weighing factor, and L reg = ∑ N θ i=1 θ i 2 represent the L 2 regularization term of all learnable parameters θ i , which has the function of preventing the overfitting problem.

Experimental Comparisons on a Real-Life Dataset
This section aims to verify the effectiveness and superiority of the proposed T-DGCN model by performing comparative experiments on real-life datasets. In Section 4.1, we describe the utilized traffic dataset, including information on the moving velocity and turning directions at intersections, on the road network of Shenzhen, China. Section 4.2 introduces the baseline methods and evaluation metrics in the experimental comparisons. Finally, the experimental results are presented to demonstrate the superior performance of the proposed model in Section 4.3.

The Description of the Real-Life Dataset
There have been various traffic flow datasets, such as the PeMSD and METR-LA [39], designed for the performance evaluation of distinct forecasting models. However, they are mostly collected by fixed sensors on road segments, which lack the turning direction information of vehicles at intersections and cannot support directed graph construction. In recent years, GPS-equipped taxicabs have been employed as mobile sensors to constantly monitor the traffic rhythm of a city and to record the turning directions of taxis on road networks [40]. In China, Shenzhen city has more than 16,000 taxis that operate on the road network [41], and relevant studies have confirmed the ability of these taxi trajectories to reflect real traffic flow states on road networks [42]. Thus, we built a new large-scale traffic dataset based on the taxi trajectories of Shenzhen. The original dataset was downloaded from the Shenzhen Municipal Government Data Open Platform [43], which contains approximately 1 billion taxi trajectory points from 1-31 January 2012, which include multiple attribute information, such as taxi IDs, spatial locations, timestamps, and instantaneous velocities. For any road segment in any time interval, this study utilizes the average velocities of vehicles every 15 min on each road segment to represent the velocity of traffic flow. Figure 4 shows the spatial distribution of the road network in the study, which includes 672 interconnected road segments in major districts of Shenzhen. In the experiments, to obtain faster convergence, we normalized all the input velocity values to 0-1. According to chronological order, the first 60% of the whole dataset is used as the training set, while the following 20% and the last 20% are utilized for validation and testing, respectively.

Baseline Methods and Evaluation Metrics
To verify the superiority of the proposed traffic flow forecasting method, a total of seven representative models, namely the historical average (HA) model [44], ARIMA [16], the vector auto-regression (VAR) model [45], the support vector regression (SVR) model [46], the fully connected GRU (FC-GRU) model [27], the temporal graph convolutional network (T-GCN) model [2], and the diffusion convolutional recurrent neural network (DCRNN) model [13], were selected as the baseline methods to implement experimental comparisons with the proposed model. The first four models are traditional machine learning-based methods, while the last three models were designed by modifying and integrating state-ofthe-art deep neural networks.
In addition, the following three quantitative metrics were used to conduct the accuracy assessment of the traffic forecasting results obtained by different methods, including the root mean squared error RMSE = 1 2 ∑ n i=1 (y i −ŷ i ) 2 , the mean absolute error MAE = 1 2 ∑ n i=1 y i −ŷ i , and the accuracy AC , where y i andŷ i represent the observed and forecasted values of the traffic flow velocity, respectively, while y denotes the average observations. RMSE and MAE were both utilized to measure forecasting errors, while AC indicated the forecasting precision. Therefore, high forecasting accuracies correspond to smaller RMSE and MAE values and larger AC values.

Comparative Analysis of the Experimental Results
In the experiments, we aimed to forecast the traffic flow velocity on road segments by using the proposed method and the baseline methods introduced in Section 4.2. The parameters included in the baseline methods were determined by referring to the identical criterion used in original articles or related articles. Specifically, the orders were set to (3, 0, 1) in the ARIMA model. In the VAR model, the lag was set to 3. The penalty term and the number of historical observations in the SVR model were set to 0.1 and 12, respectively. For the FC-GRU and T-GCN models, we set the number of hidden units to be 100.
Regarding the proposed method, we selected the appropriate parameters by comparing the forecasting performance of the candidates on the validation set. Specifically, we designed 16 hidden units in the directed graph convolution layers. For the keyframe sequence, the length of the tendency-related sequence and the time bandwidth of the periodicity-related sequence were set to tl = 12 and tw = 5, respectively, and the number of cycles was set to N p = 3. In the transformer structure, we set the dimensions of the subspaces as d k = 8 and d v = 16, while the numbers of cells in the encoder and decoder layers were both set to 3. Additionally, to simultaneously learn the short-and long-term temporal dependence, the number of single-head attention nodes was set to be N h = 2. In the training phase, we set up a batch size of 64 and 1000 epochs, while the learning rate was initialized as 0.0001 and was halved when the RMSE values remained unchanged for two epochs. All of the hyperparameters are classified and listed in Table 1. The proposed T-DGCN model was optimized using adaptive moment estimation (Adam) [47] and was implemented based on the PyTorch framework [48].  Table 2 presents the quantitative evaluation results of the forecasted values obtained by different methods on the traffic flow data from the road network of Shenzhen. It is obvious that the deep neural network (i.e., T-GCN)-based models have significantly higher forecasting accuracy than the classical machine learning-based methods (i.e., ARIMA, VAR, and SVR). It can be concluded that deep neural networks have advantages in capturing the nonlinear features related to spatiotemporal dependence. Note that the T-GCN model has traffic forecasting performance similar to that of the FC-GRU model regardless of the forecasting step length. This illustrates that the topology-based undirected graph convolution operator has limits in modeling the spatiotemporal evolution of traffic flow. The proposed T-DGCN model outperforms all seven baseline methods in terms of the three evaluation metrics for different step sizes. More specifically, the forecasting results of the proposed directed graph convolution-based method yield smaller RMSE and MAE values and larger AC values than the other two current deep neural network-based methods (i.e., FC-GRU and T-GCN). For example, for traffic flow forecasting in 15 min, the RMSE value of the proposed T-DGCN model is approximately 6% lower than that of the T-GCN model, while the AC value is approximately 6% higher. For the forecasting step sizes of 30 min and 45 min, the proposed method outperforms both FC-GRU and T-GCN in terms of all three metrics, to a large degree confirms the stable performance of the proposed method to a large degree.
Furthermore, we specifically selected two road segments and visualized the results forecasted by the proposed method. The T-GCN model, which shows the best performance of the seven baseline methods, was selected as the representative for the comparisons. As shown in Figure 5, both models fit the curve of the observed traffic flow time series well. In detail, the T-GCN generates smoother forecasted results than the T-DGCN, which means that the curves produced by the T-DGCN contain more high-frequency components. In other words, the T-DGCN has obvious advantages in capturing drastic variations in traffic flow velocities.
In addition to the forecasting accuracy, comparative experiments were further conducted on the computational efficiency of both the baseline and the proposed methods. We ran all of the models on a computer with 128 G memory and 16 CPU cores at 2.9 GHZ. Table 3 provides the efficiency evaluation results of different methods. One can see that all of the models have the capacity of outputting one-step forecasting results within 4 s. In other words, the computational time of all of the models can meet the requirements of real-time traffic flow forecasting given different forecasting steps (i.e., 15 min, 30 min, and 45 min). For deep learning-based methods, the running time on another computer with a Nvidia RTX3090 GPU indicates that the computation speed can be increased by nearly 10 times. In summary, the proposed method can achieve the highest forecasting accuracy within an acceptable computational time.

Discussion and Explanation of the Experimental Results
In this section, we further analyze the experimental results obtained by the proposed T-DGCN model from three aspects, namely the spatial distribution of the forecasting errors in Section 5.1, the temporal distribution of the forecasting errors (which refer to the RMSE values in the following sub-sections) in Section 5.2, and the multiterm temporal dependence in Section 5.3. Based on the analysis in the above subsections, we will provide the discussion in Section 5.4. The purpose of this section is to provide convincing explanations for the superior performance of the proposed method.  Figure 6a as examples, the road segments in Regions 1-3, which are located at the edge of the study area, contain incomplete topological structures but have high transfer complexity values and small forecasting errors. In contrast, despite the rich topology information in the road segments of Region 4, the low transfer complexity values correspond to the low forecasting accuracies. Moreover, Figure 6c presents a fitted curve to depict the relationships between the transfer complexity values and forecasting errors in a more intuitive way. It can be observed that an approximately negative linear relationship exists in the case of transfer complexity values smaller than 0.2. When the transfer complexity values exceed 0.2, the forecasting accuracies remain at a higher level. Furthermore, Figure 7 visualizes the normalized Laplacian matrices of the topology-based undirected graph and the proposed Markovbased directed graph. On the one hand, the Laplacian matrix of the directed graph contains more nonzero elements, which means that the graph convolution filter can aggregate more neighborhood information than the undirected graph structure. On the other hand, the variable values of the diagonal elements indicate that the self-influences receive more attention in the directed graph structure.  Figure 8 displays the average hourly distribution of the forecasting errors obtained by implementing the proposed method on the testing set. The T-DGCN has the ability to limit the forecasting errors to approximately four in the majority of timestamps. Here, interestingly, the forecasting errors during 0:00-6:00, especially those between 3:00-6:00, are significantly higher than those during other time periods. This distribution characteristic is highly consistent with that described in a previous study [1]. The existing inferences suggest that this may be a result of the magnitude of traffic flow speed and the noise in records. However, Figure 9a,b illustrate the homogeneous distributions of the traffic flow velocities and standard deviation in a whole day, which rejects the above inferences. In this research, we further calculated the average hourly distribution of the number of vehicles in Figure 9c. Clearly, the average number of vehicles is very small during the early morning hours, which is in accordance with the distribution of the prediction forecasting errors.  Figure 10 visualizes the multihead attention scores of four forecasting cases in the transformer structure. The scores quantify the contribution degree of the observations in the keyframe sequence to the traffic flow states to be forecasted. With the number of single-head attention nodes set to two, the training process automatically differentiates the two attention heads. The two attention heads learn the short-term dependence (i.e., the tendency) and the long-term dependency (i.e., the periodicity) of traffic flow. Specifically, Head-2 in Case 1 has higher attention scores in the beginning parts of the tendency-related sequence, while the ending parts make more contributions to the forecasted states in Case 4. For Cases 2 and 3, the middle parts in the tendency-related sequence are considered to be more important than the beginning and ending parts by the transformer structure. In addition, the heterogeneity of long-term dependence is adaptively captured, as reflected by the distributions of attention scores in the periodicity-related sequence of Head-1.  Furthermore, we utilized the auto-correlation function (ACF) to demonstrate the rationality and effectiveness of the trained two-head attention. Figure 11a shows the calculated autocorrelation coefficients of the original traffic flow time series with different time lags, where each line describes the autocorrelation for each road segment. It is obvious that the utilized traffic flow data contain significant tendencies and periodicities that appear to be discrepant between road segments. Moreover, Figure 11b depicts the relationship between the autocorrelation coefficients and forecasting errors. The results indicate that the forecasting errors of the proposed method stabilize at low levels for road segments with average autocorrelation coefficients larger than 0.2.

Discussion
Through the above analysis of the experimental results, we are able to provide a comprehensive discussion regarding the outperformance of the proposed method in terms of the accuracy of traffic flow forecasting from the following three aspects.
In the spatial dimension, the directed graph structure enables the neural network to leverage more associated information with the help of the Markov transfer matrix, which is a critical factor in higher traffic flow forecasting accuracies. In the temporal dimension, the multihead attention in the proposed method has the ability to adaptively learn the shortterm and long-term temporal dependence of traffic flow states observed on different road segments at distinct timestamps. Based on the above two factors, we can make convincing arguments that the proposed method is superior to the baseline methods.
Furthermore, in real-world applications, the sparse observations of traffic flow states in the early morning hours may increase the unreliability of space-time dependency feature learning and the associated forecasting errors. In other words, the proposed model performs better when there are more vehicles on the road network. However, traffic forecasting is more important and needed during peak hours to serve as many vehicles as possible, which is also the period with the highest forecasting accuracy of the proposed method. Hence, the T-DGCN model is able to meet the needs of realistic traffic forecasting tasks.

Conclusions
This study designed a new method called the temporal directed graph convolution Network (T-DGCN) to achieve high-precision traffic flow forecasting by adaptively capturing complicated spatial and temporal dependence. Specifically, in the spatial dimension, the idea of Markov chains is introduced to construct a directed graph for a road network by taking the vehicle turning behaviors at intersections into account. On this basis, we employed a directed graph convolution operator to learn spatial dependence features. In the time dimension, we built a keyframe sequence for each forecasted state and used the transformer structure to capture both short-term and long-term temporal dependence. In the experiments, real-world taxi trajectory points in Shenzhen city, China, were utilized to estimate historical traffic flow states on the road network to perform experimental comparisons between the proposed method and seven commonly used representative baseline methods using different evaluation metrics. The experimental results demonstrate the superiority of the proposed method in terms of traffic flow forecasting accuracy. In addition, we further discussed the forecasting results obtained by the proposed method from the space-time distributions of the forecasting errors and the multiterm temporal dependence. To a large extent, the discussions rationalize the high forecasting accuracy of the proposed method.
In the future, we will pay attention to the following three aspects of published works: The first is to make comparisons between the performance of model-based traffic simulators and deep leaning models in real-time traffic flow forecasting. The second is to investigate the impacts of incompleteness of traffic flow data on the model training process and on measuring the uncertainty degree of forecasting results by leveraging statistical models. Third, focus will be given to generalize the proposed T-DGCN model to improve its applications in diverse traffic scenarios.

Data Availability Statement:
As the data also form part of an ongoing study, the raw data cannot be shared at this time.