Spatiotemporal Prediction of Urban Online Car-Hailing Travel Demand Based on Transformer Network

: Online car-hailing has brought convenience to daily travel, whose accurate prediction beneﬁts drivers and helps managers to grasp the characteristics of urban travel, so as to facilitate decisions. Spatiotemporal prediction in the transportation ﬁeld has usually been based on a recurrent neural network (RNN), which has problems such as lengthy computation and backpropagation. This paper describes a model based on a Transformer, which has shown success in computer vision. The study area is divided into grids, and the structure of travel data is converted into video frames by time period, based on predicted spatiotemporal travel demand. The predictions of the model are closest to the real data in terms of spatial distribution and travel demand when the data are divided into 10 min intervals, and the travel demand in the ﬁrst two hours is used to predict demand in the next hour. We experimentally compare the proposed model with the three most commonly used spatiotemporal prediction models, and the results show that our model has the best accuracy and training speed.


Introduction
The global pandemic of the corona virus disease 2019 (COVID- 19), as a global public health emergency, has brought major challenges to the entire world. The outbreak of the virus has changed the thinking and lifestyle of residents, and maintaining social distance and avoiding contagion have become a unified perception of residents in public places. It has brought out an unprecedented reduction in public transport demand over the past three years, shifting residents' perceptions of public transport from positive to negative [1]. According to the survey, residents' propensity to use transportation has changed significantly after the COVID-19 outbreak. Public transport has been the most affected, with a remarkable decrease in the number of users, while the use of private cars has increased. However, for people who do not have a private car and do not want to use public transportation, shared mobility is a good option in many cities. Online carhailing is a kind of shared mobility, which is favored by people for its convenient and reliable service. As the best alternative to private car travel and an important supplement to public transportation, it has gradually become one of the important travel modes [2]. At the same time, making full use of shared travel resources can alleviate various traffic pressures. The online car-hailing platform collects and matches passenger orders and service vehicles in a new way for service sources, creating conditions for the provision of large-scale transportation services [3]. It is helpful to build an intelligent, green, efficient, and safe integrated transportation system and promote the sustainable development of urban transportation [4,5].
According to the national online car-hailing regulatory information interaction platform [6], 263 online car-hailing platform companies are licensed in China, and 4.053 million online car-hailing driving licenses have been issued. Online car-hailing faces several problems. Residents cannot quickly access car-hailing services, and drivers run empty for can run in parallel, maintaining good performance while speeding up training. As far as we know, this is the first time online car-hailing order data have been converted into video frame data, and that the Transformer network has been used to predict the travel demand of online car-hailing in time and space. The contributions of this paper are as follows: (1) Based on the Transformer architecture, a spatiotemporal prediction SPTformer model is proposed, and the experimental results prove that our model is competitive for the spatiotemporal prediction of residents' online car-hailing travel. (2) After the online car-hailing order data are processed into video frame data, they still contain the spatiotemporal information of the online car-hailing trip data, so the model can better predict the online car-hailing demand on the spatiotemporal scale.
The remainder of this paper is organized as follows: in Section 2, we briefly review the existing solutions and models for solving traffic prediction problems. In Section 3, the structure of the model in this paper is described, and each part of the model is explained in detail. At the same time, the special processing method and data structure of online car-hailing data are introduced. In Section 4, extensive experiments are described using Haikou online car-hailing order data, and the existing spatiotemporal prediction methods in the transportation field are used as a comparison to test the effect of the proposed model. The results of the projections are discussed at the end. In Section 5, we summarize this research and discuss further work.

Related Work
Deep learning has found application in various fields, and most transportation research depends on it. This paper addresses the spatiotemporal prediction of online car-hailing travel demand.
The CNN and RNN are two basic models in traffic prediction research. In order to better consider the spatiotemporal correlation of ride-hailing travel demand, Zhang et al. proposed an end-to-end multi-task learning beat convolutional neural network (MTL-TCNN), which predicts short-term passenger demand at a multi-regional level based on Didi Chuxing's ride-hailing data in Chengdu, China and taxi data in New York City [23]. To predict short-term traffic flow, Zhang et al. [24] designed a model based on a CNN, with higher accuracy than the traditional model. Chen et al. [25] proposed PCNN, which is based on deep CNNs and can model periodic traffic data and predict short-term traffic congestion. Mou et al. [26] proposed a temporal information augmented LSTM (T-LSTM) model to predict the traffic flow of a single road segment, which can capture the intrinsic correlation between traffic flow and temporal information, thereby improving prediction accuracy. The prediction of traffic flow during peak hours is of great significance to alleviate traffic pressure. Yu et al. [27] designed a traffic flow prediction model based on long shortterm memory (LSTM) to predict traffic flow in urban peak hours. Tang et al. [28] proposed ST-LSTM, which extracts spatiotemporal features from data and combines them as input. Gu et al. [29] combined an LSTM neural network and gated recurrent unit (GRU) in a twolayer deep learning model that outperformed a single network model. However, ordinary CNNs weakly capture long-term temporal dependencies, and RNNs do not capture spatial dependencies well, which motivates their combination.
A CNN can capture the spatial basis of traffic flow, while an RNN can mine short-term changes and periodicity. Wu et al. [30] combined these in a deep learning framework, CLTFP, for the spatiotemporal prediction of traffic flow. Similarly, Zhen et al. [31] used a CNN to extract traffic spatial features, and an RNN to predict traffic flow changes. Liu et al. [32] designed a ConvLSTM model based on a CNN, which can extract the spatiotemporal features of traffic flow, and has an end-to-end deep learning architecture. To consider the temporal and spatial characteristics of traffic flow and extract the temporal and spatial correlation and variation law of traffic flow data, Li et al. [33]  prediction model based on deep spatiotemporal ConvLSTM, which was experimentally shown to outperform traditional models in both accuracy and speed. Huang et al. [36] designed a ConvLSTM-Inception network (CL-IncNet) to make spatiotemporal predictions of traffic flow data. Li et al. [37] constructed a ConvLSTM network to predict taxi demand, which was shown to more accurately process spatial information. Chen et al. [38] proposed a BT-ConvLSTM model to introduce temporal information to a ConvLSTM network, and it was experimentally shown to improve traffic flow prediction accuracy. Di et al. [39] proposed CPM-ConvLSTM, a spatiotemporal model to make short-term predictions of the congestion levels of road segments. To reduce resource requirements, Huang et al. [40] built a sparse convolutional recurrent network utilizing sparse gates in ConvLSTM and ConvGRU. Ranawaka et al. [41] used a ConvLSTM model with Google traffic data to predict traffic flow in the next 20, 30, and 60 min. Although the combination of CNN and RNN can capture the spatiotemporal features of traffic data, they are computationally expensive and slow to train, due to the sequential nature of the recurrent structure.
This study uses a Transformer network to construct a prediction model. Compared with RNN-based methods, Transformer can effectively capture long-term dependencies, can be operated in parallel, has good performance and a fast training speed, and can capture the correlation of each part of an image through self-attention well. In order to consider the different spatial relationships between variables, Grigsby et al. [42] proposed a method called Spacetimeformer, which achieved good results in the field of spatiotemporal prediction. Xu et al. [43] proposed a new paradigm of spatiotemporal transformer networks (STTNs), which exploits dynamic directional spatial dependencies and long-term temporal dependencies to improve the accuracy of long-term traffic flow prediction, and their model performs well in long-term traffic flow prediction. Song et al. [44] proposed a model named TSTNet based on the Transformer architecture, which is a sequence-tosequence (Seq2Seq) spatiotemporal traffic prediction model, which can be used for urban traffic spatiotemporal flow prediction. Zhang et al. [45] used the Transformer network to propose a novel architecture called a time-fusion transformer (TFT), which can predict short-term highway speeds, which has been experimentally shown to have high accuracy. Cai et al. [46] referred to Google's Transformer machine translation framework to design a network called a Traffic Transformer network that captures the continuity and periodicity of traffic flow time series and models spatial correlations. Girdhar et al. [47] designed an anticipatory video transformer (AVT) based on a Transformer network to predict actions, with an attention module and an end-to-end model architecture. In order to make accurate predictions of autonomous driving trajectories, Zhang et al. [48] designed a Gatformer model based on transformer architecture, which can make more accurate predictions while shortening the forecasting time. Wu et al. [49] proposed an object-centric video transformer (OCVT) to predict video frames, decomposing a scene into tokens suitable to generate video transformers. Farazi et al. [50] designed an end-to-end learnable model, the frequency domain transformer network (FDTN), which can estimate and use signal transforms in the frequency domain. Wang et al. [51] designed a concise and efficient temporal Transformer network with progressive prediction, aggregating observed features, and a lightweight architecture to progressively predict features. Liu et al. [52] proposed a ConvTransformer network for video frame sequence learning and synthesis. Shi et al. [53] proposed Transformer-based video interpolation framework self-attention to compute long-term dependencies. Zheng et al. [54] designed a pure Transformer-based network to predict the next step for a 3D human pose in a video. Tai et al. [55] designed higher-order self-attention, and proposed a higher-order recursive layer design, HORST. Farazi et al. [56] introduced a transformer model that enables local predictions with selectable sparsity.
The Transformer network has achieved great success in computer vision, and provides a theoretical basis for our research, since traffic data are spatiotemporal. To better predict online car-hailing demand, we convert order data into video data with spatiotemporal characteristics according to a fixed time period. To predict video data, we propose a model called the Spatiotemporal Convolution Transformer (SPTformer) based on the architecture of a Transformer. Experiments show that the model is suitable for the research of spatiotemporal prediction, and it performs well.

Overview
To predict the time and space of urban car-hailing trips, this study refers to the prediction of video data. We preprocess the data, convert them into frames of video data, and use a video prediction method to generate future frames. The historical travel demand sequence, x = {x 0 , x 1 , · · · x n } ∈ R H×W×C , has sequence length n, and h, W, and C are the height, width, and number of channels, respectively. Our goal is to use the m frames of sequence data before time t,x = {x t−m+1 ,x t−m , · · · ,x t }, to predict the sequence data of k frames after time t,x = {x t+1 ,x t+2 , · · · ,x t+k }.
This paper proposes a Transformer network, SPTformer. A feature embedding module embeds the input historical sequences, x = {x 0 , x 1 , · · · x n } ∈ R H×W×C , capturing rough short-term spatial dependencies. After adding batches, the input data are five-dimensional data. The position coding module adds position coding to the feature embedded historical sequence, and the frame feature map after adding position coding is transmitted to the Encoder as input. The space-time dependence between frames in the historical sequence is extracted by a self-attention mechanism and convolution. Linear transformation is used to predict and generate future frames,x = {x t+1 ,x t+2 , · · · ,x t+k } ∈ R H×W×C , so as to complete the prediction of the demand for online car-hailing in time and space. Figure 1 shows the framework of the model. poral characteristics according to a fixed time period. To predict video data, we propose a model called the Spatiotemporal Convolution Transformer (SPTformer) based on the architecture of a Transformer. Experiments show that the model is suitable for the research of spatiotemporal prediction, and it performs well.

Overview
To predict the time and space of urban car-hailing trips, this study refers to the prediction of video data. We preprocess the data, convert them into frames of video data, and use a video prediction method to generate future frames. The historical travel demand sequence, x = { , , ⋯ } ∈ × × , has sequence length n, and h, W, and C are the height, width, and number of channels, respectively. Our goal is to use the m frames of sequence data before time t, = { , , ⋯ , }, to predict the sequence data of k frames after time t, = { , , ⋯ , }. This paper proposes a Transformer network, SPTformer. A feature embedding module embeds the input historical sequences, x = { , , ⋯ } ∈ × × , capturing rough short-term spatial dependencies. After adding batches, the input data are five-dimensional data. The position coding module adds position coding to the feature embedded historical sequence, and the frame feature map after adding position coding is transmitted to the Encoder as input. The space-time dependence between frames in the historical sequence is extracted by a self-attention mechanism and convolution. Linear transformation is used to predict and generate future frames, = { , , ⋯ , } ∈ × × , so as to complete the prediction of the demand for online car-hailing in time and space. Figure 1 shows the framework of the model.

Data Conversion
People using an online car-hailing platform must input information such as the departure point and destination, which provides a basis for the spatiotemporal prediction of travel. The proposed prediction method converts the data into two-dimensional pictures, and divides them according to fixed time intervals to calculate trips. A fixed number of frames constitutes the experimental data. The city is divided into equal-sized grids, with fixed numbers of rows and columns. Each grid represents a small traffic area, and its Sustainability 2022, 14, 13568 6 of 21 order quantity represents the travel demand in that area. The travel data used in this paper cover central Haikou City, in the form

Data Conversion
People using an online car-hailing platform must input information such as the departure point and destination, which provides a basis for the spatiotemporal prediction of travel. The proposed prediction method converts the data into two-dimensional pictures, and divides them according to fixed time intervals to calculate trips. A fixed number of frames constitutes the experimental data. The city is divided into equal-sized grids, with fixed numbers of rows and columns. Each grid represents a small traffic area, and its order quantity represents the travel demand in that area. The travel data used in this paper cover central Haikou City, in the form where represents the demand at time t, and represents the demand at grid coordinates (i, j).

Spatial Embedding
The data must be spatially encoded before being fed into the decoding block, for which this study adopts the 3D convolution layer of the ReLU activation function, and the number of convolution kernels is expressed as d_model. After the data pass through the convolutional layer, their representative features can be extracted, so that the network can learn more effectively. The image frame that needs to be encoded is represented as × × , the spatially encoded data can be represented as , and their relationship is where is the convolution kernel, " * " represents convolution, and σ represents activation.

Positional Encoding
Since the model has no recursive process, our data are spatiotemporal sequence data containing time. For the model to capture the sequential relationship between time series during training, the spatially encoded sequence data must be injected with information about the relative or absolute position of the image sequence. Therefore, before entering the coding block, a layer is added with position coding,

Pos
, , , = sin 10,000 , Pos , , , = cos 10,000 , which is computed using the sine and cosine functions of different frequencies. Pos_Enc is the calculated position code, p is the absolute position of the video frame in the sequence, h is the image height, w is the width, i represents the channel dimension, and d_model is the number of channel dimensions. The calculation result is then added, element by element, to the spatially encoded video frame data, where "⊕" indicates that the tensor element corresponds to addition.

Data Conversion
People using an online car-hailing platform must input information such as parture point and destination, which provides a basis for the spatiotemporal pre of travel. The proposed prediction method converts the data into two-dimensio tures, and divides them according to fixed time intervals to calculate trips. A fixed of frames constitutes the experimental data. The city is divided into equal-size with fixed numbers of rows and columns. Each grid represents a small traffic area order quantity represents the travel demand in that area. The travel data used in th cover central Haikou City, in the form where represents the demand at time t, and represents the demand at gr dinates (i, j).

Spatial Embedding
The data must be spatially encoded before being fed into the decoding bl which this study adopts the 3D convolution layer of the ReLU activation function, number of convolution kernels is expressed as d_model. After the data pass thro convolutional layer, their representative features can be extracted, so that the netw learn more effectively. The image frame that needs to be encoded is represe × × , the spatially encoded data can be represented as where is the convolution kernel, " * " represents convolution, and σ represents tion.

Positional Encoding
Since the model has no recursive process, our data are spatiotemporal sequen containing time. For the model to capture the sequential relationship between tim during training, the spatially encoded sequence data must be injected with infor about the relative or absolute position of the image sequence. Therefore, before e the coding block, a layer is added with position coding,

Pos
, , , = sin 10,000 , Pos , , , = cos 10,000 , which is computed using the sine and cosine functions of different frequencies. P is the calculated position code, p is the absolute position of the video frame in quence, h is the image height, w is the width, i represents the channel dimensi d_model is the number of channel dimensions. The calculation result is then add ment by element, to the spatially encoded video frame data, = ⊕ Pos , ∈ 1, , where "⊕" indicates that the tensor element corresponds to addition.
where D t represents the demand at time t, and X ij represents the demand at grid coordinates (i, j).

Spatial Embedding
The data must be spatially encoded before being fed into the decoding block, for which this study adopts the 3D convolution layer of the ReLU activation function, and the number of convolution kernels is expressed as d_model. After the data pass through the convolutional layer, their representative features can be extracted, so that the network can learn more effectively. The image frame that needs to be encoded is represented as x H×W×C i , the spatially encoded data can be represented as F H×W×d_model i , and their relationship is where W s is the convolution kernel, " * " represents convolution, and σ represents activation.

Positional Encoding
Since the model has no recursive process, our data are spatiotemporal sequence data containing time. For the model to capture the sequential relationship between time series during training, the spatially encoded sequence data must be injected with information about the relative or absolute position of the image sequence. Therefore, before entering the coding block, a layer is added with position coding, Pos Enc p,h,w,2i+1 = cos p 10, 000 which is computed using the sine and cosine functions of different frequencies. Pos_Enc is the calculated position code, p is the absolute position of the video frame in the sequence, h is the image height, w is the width, i represents the channel dimension, and d_model is the number of channel dimensions. The calculation result is then added, element by element, to the spatially encoded video frame data, where "⊕" indicates that the tensor element corresponds to addition.

Encoder Layer
The coding block consists of 3D-Masked multi-head attention, 3D-Feedforward, and Add-Normalize. The 3D-Masked multi-head attention is used as the main body to calculate the spatiotemporal relationship between historical frame sequences. The 3D-Feedforward has two Conv3D layers, which can better capture short-term dependencies between time series. Add-Normalize has residual and normalization layers, which can speed up training and improve stability. Its structure is shown in the decoding block in Figure 1. Scaled Dot product Attention [17]: The query Q, key K, and value V matrices Q, K, V ∈ R H×W×d model must be calculated from the input data. We compare all keys with their queried representations, and if the query and keys are similar, the corresponding values are assumed to be related. For each Q vector, the attention weight for each value V is computed by taking the dot product of Q with every other K vector. To prevent the instability of gradient calculation when d_model is too large and the dot product becomes large, we divide these dot products or attention weights by √ dk (where dk is the channel dimension of the key vector). To avoid seeing future information, model training only relies on the sequence before this time, and cannot rely on the sequence after it. A masking method (Mask) is inserted in the attention calculation process. The results are normalized by the softmax function, and the attention weights of all V vectors are weighted and summed to obtain the final output of scaled dot product attention, A Conv3D layer is used in the calculation of Q, K, and V. Compared with a linear transformation, 3D convolution can capture the short-term correlation of the sequence and extract some features. The formula is where W represents the convolution kernel, and " * " represents convolution. Mask Multi-head Self-attention Layer: According to the characteristics of multi-head attention, multiple sets of Q, K, V are used to calculate the attention. Encoded data are calculated by multiple linear projections; multiple groups of Q, K, and V are spliced; and groups calculate attention in parallel, for a better model effect than a set of linear projection methods. Our model uses this feature to build multi-head attention with masks. Information in different positions of the video frame is jointly modeled with multiple heads, each computing scalar dot product attention in parallel, and masks prevent future information leakage. The formula is where This paper use Conv3D instead of linear projection to calculate Q, K, and V, so W is the convolution kernel.
Add-Normalize: This layer consists of residual and normalization layers. To alleviate the problem of gradient disappearance, which leads to degradation in the training of the deep neural network, the encoded data M and the calculated attention dataM are used to construct the residual layer. At the same time, in order to improve the generalization ability of the model, we use the batch normalization (BN) layer. Batch normalization reduces overfitting by reducing internal covariance drift [57]. It not only improves the training speed, but also speeds up the convergence process. It is also a regularization expression similar to Dropout to prevent over-fitting, and can achieve the same effect as Dropout. The formula is 3D-Feedforward: Compared with fully connected neural networks, CNNs can successfully capture spatial information in images due to the reduction of parameters and reusable weights. To better capture dependencies between video frame features and timing, this paper describes two Conv3D layers to construct a fully connected layer, where W 1 and W 2 represent the convolution kernel, and A is the output of the normalization layer. Above is the entire content of the encoding block. When the data added with position coding enter the coding block, the self-attention of the data is first calculated, and the calculated data are then transferred to the Add-Normalize layer to improve the robustness of the model. Next, the data enter the 3D-Feedforward layer to assist in capturing shortterm sequence correlation, and are exported after Add-Normalize to complete the Encoder layer block calculation. The entire Encoder layer part of the framework is as follows:

Prediction Layer
The final prediction layer is a Conv3D layer. To maintain consistent numbers of channels of output and input data, its convolution kernel is set to 1 and the padding is set, so the output and input are of the same order. The formula is where W p represents the convolution kernel, and F is the output of the encoding block.

Optimizer
Mean Squared Error: This paper describes the mean square error as the prediction loss, where y i is the real value,ŷ i is the predicted value, and m is the number of predicted frames.
RMSprop: This study chooses the RMSprop algorithm, which can obtain the exponentially weighted moving average of past squared gradient values, as the optimizer, to reduce the swing amplitude of the loss function, for faster convergence.
We analyze each module. During training, this study uses {x t−m+1 ,x t−m , · · · ,x t } as input and {x t−m+2 ,x t−m+1 , · · · ,x t+1 } as a prediction label, calculates the training loss from the prediction result, and uses this to adjust the model parameters. To predict the future k frames of sequence data, we use {x i−m+1 ,x t−m , · · · ,x i } to generate a frame of videox i+1 , addx i+1 to the new input frame sequence {x i−m ,x t−m , · · · ,x i+1 } as the input of the next round of prediction, and repeat the process to obtain the k-frame sequencê x = {x t+1 ,x t+2 , · · · ,x t+k } after time t. This study uses the test set to test the model results when the training is optimal.

Overview of Study Area
This study takes Haikou City, the capital city of Hainan Province and the central city of the Beibu Gulf urban agglomeration, as an example. It is located from 19 • 31 -20 • 04 north latitude and 110 • 07 -110 • 42 east longitude. It is the political, economic, technological, and cultural center of Hainan Province, and is its largest transportation hub. It is the fulcrum city of China's "One Belt, One Road" strategy [58]. However, Haikou has some problems in transportation, especially in the aspect of public transport development. Compared with the evaluation indicators of public transport in China, Haikou has fewer public transport lines, and the number of buses per 10,000 people is lower than the national standard. Buses are mainly concentrated on trunk roads and their lines are unevenly distributed. In addition, the time interval between bus lines is long, which makes few citizens choose to take the bus. As an important supplement to public transportation, online car-hailing is very popular among residents [59]. In 2012, Didi Chuxing, Shenzhou, Yidao, and other online car-hailing companies began to operate in Haikou City. By the end of 2016, the number of online carhailing vehicles in Haikou had reached 10,000, including 6000 legal car-hailing drivers [60]. Figure 2 shows an overview of the study area.
This study takes Haikou City, the capital city of Hainan Province and the central of the Beibu Gulf urban agglomeration, as an example. It is located from 19°31′-20 north latitude and 110°07′-110°42′ east longitude. It is the political, economic, technol cal, and cultural center of Hainan Province, and is its largest transportation hub. It is fulcrum city of China's "One Belt, One Road" strategy [58]. However, Haikou has so problems in transportation, especially in the aspect of public transport developm Compared with the evaluation indicators of public transport in China, Haikou has fe public transport lines, and the number of buses per 10,000 people is lower than the tional standard. Buses are mainly concentrated on trunk roads and their lines are u venly distributed. In addition, the time interval between bus lines is long, which ma few citizens choose to take the bus. As an important supplement to public transportat online car-hailing is very popular among residents [59]. In 2012, Didi Chuxing, Shenzh Yidao, and other online car-hailing companies began to operate in Haikou City. By end of 2016, the number of online car-hailing vehicles in Haikou had reached 10,000 cluding 6000 legal car-hailing drivers [60]. Figure 2 shows an overview of the study a

Ride-Hailing Data
The online car-hailing order data used in this study come from the travel dat published by Didi Chuxing's Gaia data open plan [61]. This study selected the daily o data of Haikou from 1 May to 31 October 2017, including order ID, order time, order t traffic type, number of passengers, estimated road distance between departure and d nation, arrival time, estimated price, duration, primary business line, and longitude

Ride-Hailing Data
The online car-hailing order data used in this study come from the travel dataset published by Didi Chuxing's Gaia data open plan [61]. This study selected the daily order data of Haikou from 1 May to 31 October 2017, including order ID, order time, order type, traffic type, number of passengers, estimated road distance between departure and destination, arrival time, estimated price, duration, primary business line, and longitude and latitude of the destination and starting point. Personal information was anonymized, and did not affect the research. Research area division: To facilitate future modeling, the area should be divided into small research units in order to avoid the complexity of map matching in large-scale network demand forecasting. We divided the study area into multiple grids for faster analysis. Since the point data are aggregated, the results will be affected by the size or method of grid division. The size of the grid should be comprehensively considered and determined according to actual needs. When selecting a small-scale grid, the travel demand in each grid is low, the network complexity is high, and the actual operation is difficult, but the small-scale grid division describes the demand more finely in terms of spatial granularity. Although the computational complexity is reduced when large-scale grids are selected, the accuracy of large-scale grid description is poor. In this article, the study area is initially divided into 60 × 60 grids, i.e., 0.09 km 2 , and the travel demand of each grid according to the time scale is calculated.

Data Preprocessing
Online car-hailing data: The order data are found to have missing information and abnormal problems, such as a missing order ID or a null estimated distance. Invalid information in the historical order data is deleted, including city ID, city area code, secondary district and county, driver sub product line, estimated road distance between departure and destination, estimated price, duration, and primary business line. IDs are randomly generated for missing IDs. Orders whose origin and destination latitude and longitude are outside the scope of the study area are deleted. Only order data appearing for the first time are retained in the case of duplication. The original data include 14,160,162 order data, and 11,255,140 order data are retained after data cleaning.

Experimental Data Construction
Online car-hailing order data must be converted into video frame data to construct a spatiotemporal matrix, i.e., into different grayscale images according to time periods. Each pixel represents a study area, whose grayscale represents its travel volume. Video frames are combined to construct a video frame dataset, which is imported into the model for spatiotemporal prediction. The method is as follows. The five-month data are arranged before and after the event, and the time scale of historical variables is divided. For more accurate prediction, we divide time slices for research to judge the influence of time divisions on the experimental results, and select the best method. To facilitate data segmentation, we divide the data into time slices of 10, 15, 20, and 30 min. It is then necessary to calculate the number of online car-hailing trips in each grid in each time slice. This study uses the latitude and longitude of each order, determines which grid the order falls in, and records the travel demand in the grid study area. Video frame data of each time slice are obtained, and are arranged in order of time slices, to obtain 26,496 frames of images for 10 min, 17,664 for 15 min, 13,284 for 20 min, and 8832 frames for 30 min. To control the variables, all models use the data of the first two hours to predict the data of the next hour. For three hours of data, the data of the first two hours are for training, and the data of the last hour are used as the label of the spatiotemporal prediction result to calculate the loss and accuracy rate. Image data for periods of 10, 15, 20, and 30 min are processed into video data with 18, 12, 9, and 6 frames, respectively. The first 1400 frames are used as experimental data, with 80% for training, 10% for validation, and 10% for testing. Figure 3 shows the converted data.

Evaluation Indicators
This study evaluates the quality of model prediction by mean absolute error (MAE) and root mean square error (RMSE), where n is the number of predicted frames, k denotes the frame, m is the number of video frame grids, i denotes the research area,ŷ i k is the predicted value, and y i k is the real value. MAE is the real error, which can intuitively reflect the average difference between the predicted and actual values. A lower MAE indicates a more accurate prediction. RMSE reflects the difference between the predicted and real data, magnifying larger errors, and it reflects the maximum error. A smaller RMSE indicates a better prediction result. When calculating MAE and RMSE, y andŷ are gray values. The prediction accuracy can be obtained by calculating each grid of each frame. data with 18, 12, 9, and 6 frames, respectively. The first 1400 frames are used as experimental data, with 80% for training, 10% for validation, and 10% for testing. Figure 3 shows the converted data.

Evaluation Indicators
This study evaluates the quality of model prediction by mean absolute error (MAE) and root mean square error (RMSE), where n is the number of predicted frames, k denotes the frame, m is the number of video frame grids, i denotes the research area, is the predicted value, and is the real value. MAE is the real error, which can intuitively reflect the average difference between the predicted and actual values. A lower MAE indicates a more accurate prediction. RMSE reflects the difference between the predicted and real data, magnifying larger errors, and it reflects the maximum error. A smaller RMSE indicates a better prediction result. When calculating MAE and RMSE, y and are gray values. The prediction accuracy can be obtained by calculating each grid of each frame.

SPTformer
The SPTformer model, as shown in Figure 1, includes a spatiotemporal embedding layer, which encodes and adds locations to the data; an Encoder layer for computing self-attention; and an output layer, which consists of a Conv3D layer that performs the final prediction. The Encoder consists of two Encoder layers in series.
This study uses the video frame datasets divided by different time slices to compare the impact of time division methods on the prediction results.We take the first 1400 data points as experimental data; construct training, validation, and test sets in an 8:1:1 ratio; and select the optimal model parameters after experiments. We set the number of models to 16, the number of attention heads to 4, the size of the convolution kernel of the spatial embedding convolutional layer to (3,3,3), the number of convolution kernels to 16, and use a ReLU activation function. In the Encoder block, the convolutional layer settings calculated by Q, K, V are the same. The 3D-Feedforward has two Conv3D layers, with 32 and 16 convolution kernels in the first and second layer, respectively, of size (3, 3, 3), using ReLU activation. The output layer is a Conv3D layer with one convolution kernel of size (1, 1, 1). This study uses MSE to calculate the model training loss, optimized with RMSprop, with a 0.001 learning rate and 0.9 decay rate, and uses temporal backpropagation feedback to adjust the model. During training, when the number of model iterations reaches a certain level, the loss and accuracy rates change slowly, and the model reaches the optimum, so we set the number of model iterations to 50.

Compared Models
This study adopts a convolutional LSTM network as our baseline model, and utilizes ConvGRU and Self-Attention ConvLSTM (SaConvLSTM) models for comparison.
The ConvLSTM neural network was first used to solve the problem of precipitation nowcasting. This structure can establish temporal relationships between two-dimensional plane data and extract spatial relationships like a CNN [62]. Its principle is similar to that of an LSTM network. There are also forgetting, input, and output gates, but the difference is the addition of convolution between the input and each gate. ConvLSTM has been widely used in the research of spatiotemporal prediction. The formula is as follows: where the X t , H t , C t , i t , f t , and o t are all converted from two-dimensional to threedimensional tensors. Two dimensions represent the rows and columns of the network in which the grid is located, and the other dimension represents the number of features in each grid; i t , f t , and o t represent input, oblivion, and output gates. X t represents the input of the network at t-time, H t represents output at t-time, and C t represents the cellular state at t-time; W and b represent the weights and biases for each gate, respectively. However, W acts like a convolutional kernel, and " * " represents convolutional operations; "⊗" represents the Hadamard product as in LSTM.
Since LSTM is slower to train, GRU has made slight modifications to increase the speed. Inspired by this, the LSTM was replaced by a GRU, and the ConvGRU model was proposed. Like the ConvLSTM model, ConvGRU changes the operation between the input and each gate to convolution, and it can perform spatiotemporal prediction. Unlike ConvLSTM, ConvGRU converts LSTM into GRU for computation. Yu et al. [63] found that ConvGRU is faster and has better spatiotemporal prediction results.
Lin et al. [64] found that SaConvLSTM relies on convolutional layers to capture spatial dependencies, which is locally inefficient, and introduces self-attention to extract spatial features with global and local dependencies and capture features with long-term dependencies in the spatial and temporal domains. Experimental results show that the method achieves better prediction results, with fewer parameters and higher efficiency.

Results
This paper compared the effect of the proposed model to that of other models on the same dataset, using the data of the first two hours to predict the data of the next hour. We used datasets constructed by division in different time periods, taking the first 1400 data points of each part as the experimental data, and constructing training, validation, and test sets, with results as shown in Table 1, from which it can be seen that our model has the highest prediction accuracy on all constructed datasets, with increasing accuracy with finer time scale divisions. Our model has the lowest RMSE and MAE when the dataset is constructed with 10 min time periods, and they gradually decrease with finer time periods. To observe the model training process, we visualized the changes in the accuracy of each model as it was trained. Figure 4 shows the changes in the MAE and RMSE. same dataset, using the data of the first two hours to predict the data of the next hour. We used datasets constructed by division in different time periods, taking the first 1400 data points of each part as the experimental data, and constructing training, validation, and test sets, with results as shown in Table 1, from which it can be seen that our model has the highest prediction accuracy on all constructed datasets, with increasing accuracy with finer time scale divisions. Our model has the lowest RMSE and MAE when the dataset is constructed with 10 min time periods, and they gradually decrease with finer time periods. To observe the model training process, we visualized the changes in the accuracy of each model as it was trained. Figure 4 shows the changes in the MAE and RMSE.  It can be seen from Figure 4 that the proposed model has the best fitting degree, the loss rate and accuracy curves are smoothest, and the accuracy has a rising trend. It can be seen from the accuracy change that the training speed of the proposed model is best in the first 20 rounds of training, and then the accuracy changes slowly and gradually flattens. At 50 training rounds, the fitting degree of the model is best, and the accuracy reaches the maximum. Compared to the other models, the MAE and RMSE curves of the proposed model are always at the bottom, and its training effect is best. The accuracy curve of the reference model is serrated, and the data fit poorly. The reference model has the fastest training speed in the first 10 rounds, and then it slows down, reaching the optimal value at the 30th round. To more intuitively see the performance of the model, we visualize the prediction results in Figure 5. It can be seen from Figure 4 that the proposed model has the best fitting degree, the loss rate and accuracy curves are smoothest, and the accuracy has a rising trend. It can be seen from the accuracy change that the training speed of the proposed model is best in the first 20 rounds of training, and then the accuracy changes slowly and gradually flattens. At 50 training rounds, the fitting degree of the model is best, and the accuracy reaches the maximum. Compared to the other models, the MAE and RMSE curves of the proposed model are always at the bottom, and its training effect is best. The accuracy curve of the reference model is serrated, and the data fit poorly. The reference model has the fastest training speed in the first 10 rounds, and then it slows down, reaching the optimal value at the 30th round. To more intuitively see the performance of the model, we visualize the prediction results in Figure 5.  This study selected the same dataset and used the trained model to predict it. Figure  5 shows the visualization of the prediction results of different models. Each frame of data has an image data structure, whose gray value is the travel demand. Comparing the real map to the predictions, it can be found that the prediction map of the proposed model is similar to the real map, it has the smallest difference in distribution shape and intensity, and it can best express details. Although the other models predict the general distribution characteristics of the data, there is a big gap in the details, and the predicted value of travel demand differs greatly from the actual value.
After careful observation the prediction results of each model, we found that: Con-vLSTM and ConvGRU have similar forecasts, with predictions for travel intensity roughly the same. However, in predicting spatial distribution, ConvLSTM is slightly better. Compared with the first two models, the saConvLSTM has poor prediction performance, which is lower than the ConvLSTM and ConvGRU models in both spatial distribution and travel intensity prediction. Overall, although the three comparison models predicted the spatial distribution of the central area of the study area, they had poor predictions for the marginal area around the study area. Looking at the original map, it can be found that the travel intensity of the edge area around the original map is very low or there are even no travel data, and the compared model makes excessive predictions. In contrast, the prediction results of the model proposed in this paper are better, the prediction of the edge area of the study area is basically the same as the original data, and the prediction of travel intensity is closer to the original data. Thus, our model is more competitive. This study selected the same dataset and used the trained model to predict it. Figure 5 shows the visualization of the prediction results of different models. Each frame of data has an image data structure, whose gray value is the travel demand. Comparing the real map to the predictions, it can be found that the prediction map of the proposed model is similar to the real map, it has the smallest difference in distribution shape and intensity, and it can best express details. Although the other models predict the general distribution characteristics of the data, there is a big gap in the details, and the predicted value of travel demand differs greatly from the actual value.
After careful observation the prediction results of each model, we found that: ConvL-STM and ConvGRU have similar forecasts, with predictions for travel intensity roughly the same. However, in predicting spatial distribution, ConvLSTM is slightly better. Compared with the first two models, the saConvLSTM has poor prediction performance, which is lower than the ConvLSTM and ConvGRU models in both spatial distribution and travel intensity prediction. Overall, although the three comparison models predicted the spatial distribution of the central area of the study area, they had poor predictions for the marginal area around the study area. Looking at the original map, it can be found that the travel intensity of the edge area around the original map is very low or there are even no travel data, and the compared model makes excessive predictions. In contrast, the prediction results of the model proposed in this paper are better, the prediction of the edge area of the study area is basically the same as the original data, and the prediction of travel intensity is closer to the original data. Thus, our model is more competitive.
Analyzing the prediction results of the model proposed in this paper, it can be found that: our model predicted best for the city center area, but the Xinbu Street, Haixiu Street, Xiuying Street, Haixiu Town, Chengxi Town, Fengxiang Street, and Binjiang Street areas were poor. The model predicts better results in the city center area due to the higher intensity of online car-hailing trips in urban centers and the stronger cyclicality of residents' daily trips. However, in other areas, the prediction results are poor due to the small number of daily trips and the irregular use of online car-hailing. Subsequent studies can analyze this part of the area separately to increase the accuracy of the forecast.

Discussion
This study used the same dataset to experiment with different models. The experimental results show that our model has the best fitting degree to the travel demand data of online car-hailing. The spatial distribution of the prediction results is closer to the original data, and it can better describe details. At the same time, our model most accurately predicts the demand for car-hailing. In order to capture the spatial relationship between sequences, the contrasting model changes the operation between the input and each gate to the form of convolution, while the CNN receptive field is usually small, which is not conducive to capturing global features. Unlike CNN, Transformer can extract all the information we need from the input and its relations at the same time, thus capturing long-range dependencies.
Using the same data and training batch, the average training times of SPTformer, ConvLSTM, and ConvGRU are 16 s, 33 s, and 20 s, respectively. ConvGRU is faster than ConvLSTM, but its prediction accuracy is poor. The training time of SaConvLSTM is 44 s, so our model has a shorter training time and faster speed. In addition, our model consumes fewer GPU resources. Since the contrastive model utilizes the LSTM and GRU structures to capture the temporal relationship between sequences, this type of neural network has evolved from the RNN structure. Since the RNN was proposed, it has been widely used in time series data problems. Generally speaking, the RNN is a for loop structure, which reuses the results of the previous iteration of the loop. Theoretically, it should be able to remember information seen before many time steps, but in fact, it can hardly learn this long-term dependence. Therefore, the LSTM network has been proposed subsequently. It is a variant of the RNN, which can better learn long-term dependence than the RNN. However, like the RNN, it must process sequence data in sequence, so it has no room for parallelization to accelerate the speed of model training. The working principle of GRU is the same as that of LSTM, with some simplifications and less computation, but its representation ability is not as good as LSTM in terms of prediction results. SPTformer is a deep learning model that utilizes an attention mechanism. Attention mechanisms in neural networks enhance the relevant and important parts of the input and remove the irrelevant parts, and learn which parts are important through training. Compared with a recurrent neural network, SPTformer has the advantage that it does not need to process sequential data in sequence. In the process of model training, there is a larger parallel interval, so the training time is reduced. To sum up, our model is more competitive.
Observing the prediction results of datasets constructed with different time divisions, we found that the more refined the time division, the better the prediction effect. When the time division is finer, the more information the model obtains, and the more accurate the prediction. Therefore, when forecasting travel demand, the data should be divided more finely.

Conclusions
To make more accurate spatiotemporal predictions of online car-hailing travel demand, based on the Transformer architecture, this paper proposes a new spatiotemporal prediction model. We utilized positional encoding, an attention mechanism, and a 3D convolutional network to effectively capture the spatiotemporal relationships between data. Based on the parallel mechanism of the Transformer network, our model has a fast training speed. This study processes the car-hailing order data into a video frame sequence, and the processed data are more in line with the spatiotemporal characteristics of online car-hailing travel data. At the same time, the travel intensity of online car-hailing can be directly obtained from the predicted results. Compared with the overall travel forecast for Haikou, our experiment can obtain travel demand in the small study area. This research used the 2017 real online car-hailing order data of Haikou City to test the performance of the proposed model, and the experiment proved the effectiveness of the method proposed in this paper. In real life, the method proposed in this paper can be used to predict the travel demand of online car-hailing in the next hour. For passengers, it is possible to better understand the changing laws of online car-hailing travel demand in different regions and at different times, so as to make more reasonable travel decisions and improve the travel efficiency of residents. For online car-hailing drivers, it is possible to accurately find the hot spots of travel demand, reduce the empty driving rate, and increase the income of online car-hailing drivers. At the same time, for urban management personnel, it can reasonably dispatch vehicles and timely solve traffic travel needs, improve the level of urban traffic management, and reduce urban road traffic pressure. This will provide a reference for the research on shared travel and promote the development of shared mobility. This paper only considered the impact of historical travel on the future. Although the model achieved good results, differences can be found in some details of performance. In real life, there are many factors that affect residents' travel, such as weather, points of interest, holidays, and differences in travel intensity at different time periods on the same day. Therefore, in the next study, more influencing factors can be comprehensively considered. In addition, the outbreak of large-scale infectious diseases has a greater impact on residents' thinking and travel methods, such as COVID-19. Therefore, follow-up research can analyze and predict the travel characteristics of residents using online car-hailing during COVID-19. In this paper, Haikou City was only divided according to the fixed grid scale, yet the prediction results are different with different division methods. Subsequent research can divide the study area into different scales for comparison to achieve the best results.