Intercity Online Car-Hailing Travel Demand Prediction via a Spatiotemporal Transformer Method

: Trafﬁc prediction is a critical aspect of many real-world scenarios that requires accurate trafﬁc status predictions, such as travel demand prediction. The emergence of online car-hailing activities has given people greater mobility and makes intercity travel more frequent. The increase in online car-hailing demand has often led to a supply–demand imbalance where there is a mismatch between the immediate availability of car-hailing services and the number of passengers in certain areas. Accurate prediction of online car-hailing demand promotes efﬁciencies and minimizes re-sources and time waste. However, many prior related studies often fail to fully utilize spatiotemporal characteristics. With the development of newer deep-learning models, this paper aims to solve online car-hailing problems with an ST-transformer model. The spatiotemporal characteristics of online car-hailing data are analyzed and extracted. The study region is divided into subareas, and the demand for each subarea is summed at a speciﬁc time interval. Historical demand of the areas is used to predict future demand. The results of the ST-transformer outperformed other baseline models, namely, VAR, SVR, LSTM, LSTNet, and transformers. The validated results suggest that the ST-transformer is more capable of capturing spatiotemporal characteristics compared to the other models. Additionally, compared to others, the model is less affected by data sparsity.


Introduction
China has developed numerous urban agglomerations as the economy and transportation networks have progressed. Some examples include the Yangtze River Delta economic circle, the Pearl River Delta economic circle, the Beijing-surroundings economic circle, the Sichuan-Chongqing economic circle, and the Yinchuan metropolitan area. Intercity travel activities, such as commuting across cities, and travel between residential and urban transport facilities, such as railway stations and airports, have become more popular as a result. With the rapid growth of transit infrastructures, services such as online intercity car-hailing have gained popularity in recent years. The convenience of online booking and on-demand services of online intercity car-hailing has been well received by users. As reported by China's Ministry of Transportation, 660 million car-hailing orders were placed in November 2020 alone, and most of them were related to intercity travel.
However, such an increase in online intercity car-hailing demand has often led to a supply-demand imbalance where there is a mismatch between the immediate availability of car-hailing services and the number of passengers in certain areas. To maintain a more well-balanced spatial distribution of vacant taxis to meet the demand in certain areas and thereby enhance the efficiency of online car-hailing services, demand forecasting for intercity online car-hailing travel is vital. Therefore, this issue can be solved as a travel demand forecasting problem.
Extensive studies have been conducted to increase the precision of travel demand prediction. Existing solutions to travel demand prediction problems can be classified into two major categories: statistical methods and deep-learning methods. Initially, most of the forecasting is conducted using statistical models such as autoregressive integrated moving average (ARIMA) and other alternatives. Li et al. explored human mobility patterns by using an ARIMA-based method [1]. Other studies [2][3][4] further considered factors such as spatial relationships and weather conditions with remarkable successes. However, traffic data have inherent spatiotemporal dependencies, which makes traffic prediction a highly challenging and complex task. ARIMA and other statistical methods still fail to manage such complex nonlinear relationships. Furthermore, optimizations of these models and overfitting remain very large challenges to address [5].
To address the above challenges, more recent attempts have focused on deep-learningbased methods, as these methods can well fit nonlinear relationships between future travel and historical travel demand. Deep-learning-based methods have gained remarkable performance in many learning tasks [6], which has inspired applications of deep-learning techniques for traffic prediction problems. Commonly used deep-learning models for traffic prediction tasks can be classified into three categories, namely, feedforward neural networks, recurrent neural networks (RNNs), and convolutional neural networks (CNNs) [7].
A multilayer perceptron (MLP) is an example of a simple feedforward neural network, as exemplified by the work done by Wong et al. in 2017 [8]. An MLP was used to model both the supply and demand of passenger and taxi services. However, these models fail to consider both spatial and temporal interactions. Recurrent neural networks (RNNs) can use self-circulation mechanisms and deal with temporal dependencies effectively [5,7]. Hence, RNN models and their variants are also applied by many researchers. The long short-term memory (LSTM) model developed by Hochreiter and Schmidhuber in 1997 [9] is one of the RNN variants. It is a gated recurrent neural network that includes an extra hidden state compared to the original RNN. This enables LSTM models to be more efficient at capturing longer time dependencies and resolves the problem of vanishing and/or exploding gradient problems in the original RNN to a certain extent. This characteristic makes LSTM models a popular option in short-term traffic forecasting problems [10][11][12]. Yu et al. [13] employed an LSTM network to predict traffic under extreme conditions, proving LSTM's capability in modeling sequential dependency. Yao et al. [14] used LSTM that considered semantic similarity among different regions to make predictions. However, these sequential models have limited scalability for longer time dependencies where their memorization power declines [7]. Furthermore, sequential models are good at temporal dependencies but have no mechanism to deal with the spatial dependencies in traffic data. In practice, LSTM is often combined with models that capture spatial features, such as convolutional neural networks (CNNs), to form hybrid models. Such a hybrid model is then able to capture the temporal-spatial characteristics of traffic data.
CNNs have shown outstanding performance in modeling local and shift-variant features [15]. Lv et al. [16] further integrated an RNN and CNN, whereby the latter was in charge of spatial features and used LSTM to capture temporal features. The long-and short-term temporal network (LSTNet), proposed by Lau et al. [17], is another example of LSTM and CNN integration that demonstrated significant performance improvements for long-and short-time series prediction tasks. However, since the number of hidden layers grows linearly with increasing sequences, the scalability of such models is limited for longer input sequences [18]. Deeper layers also affect the efficiency of CNNs for capturing dependencies in longer sequences [19].
After a review of related works, the utilization of spatiotemporal features is key to forecasting travel demand with high accuracy. However, all the above research mainly focused on intracity travel demand forecasting, and few studies have focused on the intercity travel demand prediction problem. In addition, compared to intracity travel demand prediction, two challenges need to be solved for intercity travel demand forecasting. First, it requires capturing a long time-horizon correlation. Second, the correlation among adjacent regions may not be obvious, and many regions have high correlations with some far away regions.
To conduct accurate intercity travel demand prediction, we first utilized a transformer model to capture longer time-horizon correlations. A transformer is a popular model proposed by Vaswani et al. [19]. It is an attention-based model that no longer requires recursive feeding of sequential data, as RNN-based models do. Therefore, a transformer can preserve sequence orders and is more computationally efficient than RNNs. Strategies such as multiheaded attention and positional encoding help a transformer attain significant success in machine translation tasks [20,21]. Machine translation and traffic prediction are formulated similarly. Machine translation is a seq-2-seq learning task that aims to translate a source sentence into a target sentence [22], whereas traffic prediction tries to use historical data as an indicator of future traffic conditions. Hence, the time steps in historical traffic data can be similar to the position index of each word in the input sentences in a machine translation task [23]. For instance, Cai et al. [23] utilized a transformer to capture temporal dependencies in traffic data.
To tackle the second challenge, a spatial transformer was introduced that armed the transformer with spatial transformation capabilities [24]. If we employ a CNN model for spatial dependency extraction, the representation of correlations between two distanced regions usually requires multiple hidden layers. The spatial transformer can capture the correlation between two regions and is independent of the distance between regions.
To conclude, we will address online car-hailing demand prediction tasks using a state-of-art model spatial-temporal transformer (ST-transformer). We hope that the spatialtemporal modeling capability of the ST-transformer can make accurate traffic predictions about travel demands and thereby provide suggestions for a more well-balanced spatial distribution of vacant taxis to meet the demand in specific areas and thereby enhance the efficiency of online car-hailing services and reduce resource waste. Most of the prior works divide the studied region into smaller subareas and tabulate travel demands in each subarea during a time interval [14,[25][26][27]; hence, similar methods will be adopted by this paper. We will test the capacity of other models, such as LSTM, LSTNet, and a transformer, and compare them with the ST-transformer.
The rest of this paper is organized as follows. Section 2 outlines the study area and dataset used, including data processing and extraction methodologies. Section 3 defines the problem in detail and lists the methodologies adopted, with elaborations on model architectures. Section 4 describes our experimental design and presents the performances of the various models. This paper ends with conclusions and future ideas in Section 5.

Study Area
Our online car-hailing data are distributed in Yinchuan City and Shizuishan City, Ningxia Autonomous Region, China, and the geographical position is shown in Figure 1. Yinchuan City, located at 105.82-106.88 E longitude, 37.59-38.88 N latitude, is the capital of Ningxia, serving as one of the most important cities in Northwest China with more than two million people. Shizuishan City, located at 105.96-106.97 E longitude, 38.60-39.39 N latitude, is north of Yinchuan City with a resident population of approximately 800,000, serving as an important pillar of Yinchuan.
To obtain spatiotemporal data, we classify the orders into 30,000 grids with a size of 1240 × 1530 m. Finally, 357 grids are selected with a record of at least one order.

Intercity Car-Hailing Data
There are 224,822 orders in our dataset, ranging from 1 January to 31 December 2020, and all orders are located within the area shown in Figure 1.
latitude, is north of Yinchuan City with a resident population of approximately 800,000, serving as an important pillar of Yinchuan.
To obtain spatiotemporal data, we classify the orders into 30,000 grids with a size of 1240 × 1530 m. Finally, 357 grids are selected with a record of at least one order.

Intercity Car-Hailing Data
There are 224,822 orders in our dataset, ranging from 1 January to 31 December 2020, and all orders are located within the area shown in Figure 1.
Each order records the spatial information, temporal information, passenger information, and so on. These are typical spatiotemporal data that include the time when the passengers order online and get into the car, the location where the passengers get into and out of the car, and some desensitized passenger data such as number and age, which are listed and described in Table 1.

Order time
The time when the passengers order online.

Departure time
The time when the passengers get into the car.

Departure location
Latitude and longitude of departure. Destination location Latitude and longitude of destination.

Number of passengers
The number of passengers getting into the car.

Ages of passengers
Age of each passenger of the order.

Data Processing
The data coverage area of this paper is consistent with the study area. Note that data in February are seriously distorted due to COVID-19; therefore, we include only data from March 1st to December 31st for temporal coherence, and we highlight the temporal attribute. As mentioned in Section 2.1, only 357 grid units with at least one order are selected. In addition, the dataset is split by chronological order into a training set, validation set, and testing set at a ratio of 7:2:1, and Z-score normalization is applied to the inputs.
Two prediction scenarios are designed to compare the performance of the STtransformer and other baselines. The passenger volume in the previous 6 time steps is used to predict the volume in the next time step.
Scenario 1: 1 h traffic demand prediction using the last 6 h order distribution; Each order records the spatial information, temporal information, passenger information, and so on. These are typical spatiotemporal data that include the time when the passengers order online and get into the car, the location where the passengers get into and out of the car, and some desensitized passenger data such as number and age, which are listed and described in Table 1. Table 1. Intercity car-hailing data fields and descriptions.

Order time
The time when the passengers order online.

Departure time
The time when the passengers get into the car.

Departure location
Latitude and longitude of departure.

Destination location
Latitude and longitude of destination.

Number of passengers
The number of passengers getting into the car.

Ages of passengers
Age of each passenger of the order.

Data Processing
The data coverage area of this paper is consistent with the study area. Note that data in February are seriously distorted due to COVID-19; therefore, we include only data from March 1st to December 31st for temporal coherence, and we highlight the temporal attribute. As mentioned in Section 2.1, only 357 grid units with at least one order are selected. In addition, the dataset is split by chronological order into a training set, validation set, and testing set at a ratio of 7:2:1, and Z-score normalization is applied to the inputs.
Two prediction scenarios are designed to compare the performance of the ST-transformer and other baselines. The passenger volume in the previous 6 time steps is used to predict the volume in the next time step.
Scenario 1: 1 h traffic demand prediction using the last 6 h order distribution; Scenario 2: 2 h traffic demand prediction using the last 12 h order distribution. Using the two scenarios above, the performance of the ST-transformer in the longand short-term is compared with the other models.

Problem Definition
The input X is a total matrix, recording the passenger flow in different places at different times. Matrix element x s t is the volume of passenger flow in grid s at time t. Whereby s varies from 1 to 357, representing the 357 observation points, which correspond The output Y is the passenger flow in the next time step, t + 1 as follows:

Spatiotemporal Transformer Net
A spatiotemporal transformer net (ST-transformer) was used to predict the traffic flow. Noting the unevenness of the order distribution, we believed that an attention layer was what we needed for effective trajectory prediction.
The ST-transformer intends to interleave the spatial and temporal transformer networks in a single frame to solve the combined prediction problem with spatiotemporal data. And the model of a ST-transformer network is shown in Figure 2. The spatial transformer focuses on the topological structure of the data and calculates the connection value between each node. The temporal transformer is a simple part of the transformer network that pays attention to sequence continuity.
to the passenger flow in the 357 grids. The numerical value of t varies within the limi one year.
The output Y is the passenger flow in the next time step, t 1 as follows:

Spatiotemporal Transformer net
A spatiotemporal transformer net (ST-transformer) was used to predict the tra flow. Noting the unevenness of the order distribution, we believed that an attention la was what we needed for effective trajectory prediction.
The ST-transformer intends to interleave the spatial and temporal transform networks in a single frame to solve the combined prediction problem with spatiotempo data. And the model of a ST-transformer network is shown in Figure 2. The spa transformer focuses on the topological structure of the data and calculates the connect value between each node. The temporal transformer is a simple part of the transform network that pays attention to sequence continuity. Both the spatial part and temporal part end with a specifically designed activat function called position-wise feed-forward (PFF), as shown in Figure 3. PFF can calculated as: Both the spatial part and temporal part end with a specifically designed activation function called position-wise feed-forward (PFF), as shown in Figure 3. PFF can be calculated as: The start is a convolution layer with 8 kernels of size (1, 1), which aims to amplify the data from (20,357,2,6) to (20,357,8,6). The input matrix can be written as X input , or X i after the start convolution layer. The encoder is composed of a spatial transformer (ST) net and a temporal transformer (TT) net: X T e = f (TT e (X i )) (5) where ⊕ identifies a concatenation operation that doubles the dimension of the channels to (20, 357, 6, 2 × 8); X i denotes the output from the start convolution part; and X T e and X S e denote the results of ST and TT for the encoder, respectively.
In the post-convolution stage, 64 kernels with a size of (1, 1) and ReLU activation a used to increase the dimension. Finally, we utilize a convolution layer to obtain t predicted matrix of size (20, 357, 1, 2).

Temporal Transformer
By using transformers, as efficient deep-learning models based on a self-attenti mechanism, great achievements have been witnessed in many fields, such as natu After two layers of convolution to reduce the hidden variables, we obtain a feature matrix X F of size (20,1,357,8). The TT and ST parts are set in parallel and will be discussed in Sections 3.3 and 3.4.
The decoder adapts a series method to connect the ST and TT: In the post-convolution stage, 64 kernels with a size of (1, 1) and ReLU activation are used to increase the dimension. Finally, we utilize a convolution layer to obtain the predicted matrix X output of size (20, 357, 1, 2).

Temporal Transformer
By using transformers, as efficient deep-learning models based on a self-attention mechanism, great achievements have been witnessed in many fields, such as natural language processing (NLP), computer vision (CV), and deep learning for graphs. In this article, the transformer networks are all designed to capture spatial and temporal features.
The TT part uses a multi-head attention mechanism found in transformer networks. Figure 4 shows the calculation process for a single attention mechanism, and here, a matrix calculation method is given. This self-attention mechanism aims to calculate the correlations among vectors in the input matrix X, which is often calculated by softmax. To improve the fitting ability of this model, the weight matrix is added to define query matrix Q = XW Q , key matrix K = XW K , and value matrix V = XW V . The self-attention mechanism can be concluded in four steps: firstly, the similarity is calculated by QK T ; secondly, the result is divided by √ d k for normalization where d k means the dimension of K; thirdly, the weighted sum of V is calculated with the weight matrix above; finally, softmax is used for a probability. The attention formula is as follows.
A mask with the shape of a strictly upper triangular matrix is used to reduce label leakage. Before the fully connected multi-head layer, 30% of the neurons are designed to die to restrain overfitting.
where all , , , , indicate a weight matrix generated by a fu connected layer.
, , is a function with the independent variables of . So, overall formula of this mechanism is:

Spatial Transformer
From the equations in 3.3, it is noted that the transformer network pays little attent to the structural information of the input; consequently, we utilize the ST block to extr the spatial feature of the trajectory and mine the correlations between grids on the ma As shown in Figure 5, the geographical distribution can be treated as a graph , , where = 1,2, … , is the grid set and = , | , is the ed A residual mechanism block is added to the TT, and the input of each layer is the sum of the output and input of the previous layer, which solves the vanishing gradient problem. The head formula is: where k varies from 1 to 8, and the multi-head mechanism outputs a tensor composed of the k heads: where all W W Q k , W V k , W K k , W O , etc indicate a weight matrix generated by a fully connected layer. (Q, K, V) is a function with the independent variables of X t i T t=1 . So, the overall formula of this mechanism is:

Spatial Transformer
From the equations in 3.3, it is noted that the transformer network pays little attention to the structural information of the input; consequently, we utilize the ST block to extract the spatial feature of the trajectory and mine the correlations between grids on the map.
As shown in Figure 5, the geographical distribution can be treated as a graph G = (V, E), where V = {1, 2, . . . , n} is the grid set and E = {(i, j)|i, j is connected} is the edge set. This modeling method is also known as transformer-based graph convolution (TGConv). The graph varies over time and can be described as g = {G 1 , . . . , G t }, where t denotes the whole period.  Assume that node is connected with an embedding vector ℎ and a neighbor set , where ℎ is the feature vector of the feature set ℎ . Defining the trajectory from grid to grid in this fully connected graph as → = , the attention mechanism in Formula (10) is then rewritten as: grid i to grid j in this fully connected graph as m i→j = q T i k j , the attention mechanism in Formula (10) is then rewritten as: where output function f out is designed as a fully connected layer, and h i is the updated embedding of i using the TGConv. The idea of ResNet and layer normalization are also applied in the ST.
With well-designed graph vertices and intersections, the ST works as well as the TT in the multi-head framework.

Results
This section presents the performance evaluation of the ST-transformer using a realworld dataset. We will use the online car-hailing dataset of recorded trips between Yinchuan and Shizuishan, China. The study areas are divided into 357 grids, and the passenger flow in each grid is aggregated into 1 h and 2 h windows.

Evaluation Metrics
The performance of the ST-transformer is evaluated in terms of the mean absolute error (MAE): mean absolute percentage error (MAPE): and root mean squared error (RMSE): The MAE represents the average of the absolute differences between the predicted passenger volume and actual passenger volume. This metric reflects the actual predicted value error. The MAPE is the percentage difference between the error and the actual passenger volume. Finally, the standard deviation of the error between the predicted value and the actual value is represented by the RMSE. A good prediction is indicated by lower values of all three metrics. Due to the presence of zero-valued grids, the scores of such grids are excluded for all evaluation metrics.

Baselines
To validate the effectiveness of the ST-transformer, it is compared with classic statistical models and deep-learning models. State-of-art models such as LSTNet and a transformer are also included in the comparison: • VAR: The vectorized autoregressive model [28] is a generalized autoregressive model that captures the relations of multiple variables over time. • SVR: Support vector regression [29] is a nonlinear regression model. • LSTM: Long short-term memory [11] is a variant of an RNN that is more efficient at capturing longer time dependencies than an original RNN.
• LSTNet: A long-and short-term temporal network [17] is an integration of LSTM and a CNN that demonstrated significant performance improvements for long-and short-time series prediction tasks. • A transformer [19] is naturally more computationally efficient than RNNs and can capture temporal features with an attention mechanism. Table 2 compares the performance of the ST-transformer and baseline models for the 1 h and 2 h window data. Prediction of the passenger volume for the next time steps uses the previous six time step data from the online car-hailing dataset recording trips between Yinchuan and Shizuishan. The ST-transformer obtains an outstanding result for both time windows compared to the other neural network-based and statistical-based methods. Generally, all the neural network-based models outperformed the VAR and SVR statistical-based methods. Although the SVR exhibits somewhat good MAE results for both time windows, the MAPE is extremely high, suggesting the possibility that incorrectly predicted values deviate greatly from the actual values. The transformer uses an attention mechanism to capture temporal features and outperforms LSTM and LSTNet. Both networks have inherent weaknesses in capturing longer time series. When passenger volume is aggregated based on a 2 h window, the performance of all models generally drops as the data become sparser. In contrast, the ST-transformer still excels and shows better performance than for the data aggregated based on a 1 h window. This may imply that the ST-transformer is less affected by the sparsity of the data, which is attributed to it being able to utilize spatial characteristics better.

Experimental Results
As shown in Figure 6, the traffic demands are distributed extremely unevenly, and the orders are concentrated in several grids, which makes it difficult for traditional prediction methods. LSTM, VAR, and other models tend to forecast the orders in all grids below ten, as most grids have few car-hailing demands. Therefore, a large error is generated in specific grids with larger demands determined by geographical characteristics (traffic center, commercial building, residential quarters, and so on).
This characteristic is noticed by the attention mechanism of the ST-transformer, which performs desirably in Grid 50 to Grid 100. For some extreme cases, the ST-transformer still behaves better than other methods by nearly 25%. Figure 7 shows the temporal error distribution of the various models. The traffic demands displayed periodic patterns for different days in a month, suggesting a possible trend in weekend and weekday distributions. Supporting previous observations, the STtransformer obtained a rather stabilized performance in all three metrics, with a relatively low MAPE compared to all other models. We can conclude from this observation that the ST-transformer is indeed predicting with a smaller deviation from the ground truth.
transformer still behaves better than other methods by nearly 25%. Figure 7 shows the temporal error distribution of the various models. The traffic demands displayed periodic patterns for different days in a month, suggesting a possible trend in weekend and weekday distributions. Supporting previous observations, the STtransformer obtained a rather stabilized performance in all three metrics, with a relatively low MAPE compared to all other models. We can conclude from this observation that the ST-transformer is indeed predicting with a smaller deviation from the ground truth. transformer still behaves better than other methods by nearly 25%. Figure 7 shows the temporal error distribution of the various models. The traffic demands displayed periodic patterns for different days in a month, suggesting a possible trend in weekend and weekday distributions. Supporting previous observations, the STtransformer obtained a rather stabilized performance in all three metrics, with a relatively low MAPE compared to all other models. We can conclude from this observation that the ST-transformer is indeed predicting with a smaller deviation from the ground truth.

Conclusions
To enhance the efficiency of online car-hailing services, a spatial-temporal transformer model is used to make accurate travel demand predictions. A multi-head transformer attention mechanism is designed to reflect the temporal correlations. The graph model is established based on the geographical region for the calculation formula of the self-attention mechanism. The results showed that the ST-transformer produces less prediction error than LSTM, LSTN, transformer, and many other classical models. The data demonstrated that the advantages are more confident when predicting the near future. The transformer mechanism based on the graph shows great superiority to deal with dynamical graph feature learning.
More accurate demand forecasting can help deploy more online car-hailing in places with high demand, help evacuate people, and increase road occupancy rates to reduce traffic congestion. Furthermore, improving the efficiency of online car-hailing services can increase people's confidence toward using online car-hailing to travel, thereby reducing the use of private cars on the road. Therefore, we believe that a large number of hired

Conclusions
To enhance the efficiency of online car-hailing services, a spatial-temporal transformer model is used to make accurate travel demand predictions. A multi-head transformer attention mechanism is designed to reflect the temporal correlations. The graph model is established based on the geographical region for the calculation formula of the self-attention mechanism. The results showed that the ST-transformer produces less prediction error than LSTM, LSTN, transformer, and many other classical models. The data demonstrated that the advantages are more confident when predicting the near future. The transformer mechanism based on the graph shows great superiority to deal with dynamical graph feature learning.
More accurate demand forecasting can help deploy more online car-hailing in places with high demand, help evacuate people, and increase road occupancy rates to reduce traffic congestion. Furthermore, improving the efficiency of online car-hailing services can increase people's confidence toward using online car-hailing to travel, thereby reducing the use of private cars on the road. Therefore, we believe that a large number of hired vehicles will reduce the chance of congestion on the roads.