Data-Driven Real-Time Online Taxi-Hailing Demand Forecasting Based on Machine Learning Method

Featured Application: This research provides a valuable data-driven method on forecasting the online taxi-hailing demand, and it could be potentially applied to developing multi-modes transportation prediction. Abstract: The development of the intelligent transport system has created conditions for solving the supply–demand imbalance of public transportation services. For example, forecasting the demand for online taxi-hailing could help to rebalance the resource of taxis. In this research, we introduced a method to forecast real-time online taxi-hailing demand. First, we analyze the relation between taxi demand and online taxi-hailing demand. Next, we propose six models containing di ﬀ erent information based on backpropagation neural network (BPNN) and extreme gradient boosting (XGB) to forecast online taxi-hailing demand. Finally, we present a real-time online taxi-hailing demand forecasting model considering the projected taxi demand (“PTX”). The results indicate that including more information leads to better prediction performance, and the results show that including the information of projected taxi demand leads to a reduction of MAPE from 0.190 to 0.183 and an RMSE reduction from 23.921 to 21.050, and it increases R 2 from 0.845 to 0.853. The analysis indicates the demand regularity of online taxi-hailing and taxi, and the experiment realizes real-time prediction of online taxi-hailing by considering the projected taxi demand. The proposed method can help to schedule online taxi-hailing resources in advance.


Introduction
With the development of the intelligent transportation system, the travel of residents is growing more convenient. Nevertheless, because of the information asymmetry between passengers and drivers, the spatial and temporal distribution of passengers and drivers are inconsistent. The limited urban transportation resources were wasted by the information asymmetry between passengers and drivers. Therefore, trip demand in the urban area urgently needs to be studied. Recently, online taxi-hailing has gradually become the primary trip mode for urban residents. Meanwhile, the taxi still assumes the function of public transportation for urban residents. Under these circumstances, the online taxi-hailing demand would be affected by the taxi demand because of the homogeneity between the taxi and online taxi-hailing. Thus, we should take the taxi demand into account while studying the online taxi-hailing demand.
In the past, research that focused on forecasting traffic demand was mostly based on environmental data and GPS data . Moreover, the research mined the features of GPS data and environmental

Related Work
Over the years, numerous works have been dedicated to enhancing the accuracy of trip demand forecasting. The first application of the trip demand forecasting is predicting trip demand based on a four-step process considering spatiotemporal factors [1]. L. Moreira-Matias et al. predicted the spatial distribution of taxi demand by presenting a method [2]. Then, he proposed a learning model considering real-time data to forecast the taxi-passenger demand's spatiotemporal distribution [3]. Next, he proposed a combination forecasting model to forecast the taxi-passenger demand's spatiotemporal distribution [4]. K. Zhang et al. forecasted the location of hotspots and tested the heat of the hotspots by presenting an adaptive forecasting method [5]. Next, N. Davis et al. proposed a time-series method to forecast the taxi demand by mining the regulation of taxi mobile app data [6]. X. Peng et al. forecasted the taxi demand hotspots based on social media check-ins to reduce the imbalanced supply and demand of taxis [7]. K. Zhao et al. predicted the taxi demand through three forecasting methods, respectively, based on the Markov model, Lempel-Ziv-Welch model, and ANN model [8]. Besides the GPS data and environmental data, J. Xu et al. also considered historical traffic behaviors as an important variable in the taxi demand forecasting problem, and they proposed an LSTM method to forecast taxi demand in several urban areas [9]. D. Zhang improved the hidden Markov chain model and proposed a D-model to forecast the taxi demand [10]. For exploring the relationship between taxi and subway, Y. Bao et al. took the interaction between taxi demand and subway demand into account to explore the impacts of the interaction on the accuracy of taxi demand and proposed a taxi demand prediction method based on a neural network model [11]. N. Davis explored the impacts of tessellation on-demand prediction effects and proposed a combination algorithm of different tessellation strategies to predict taxi demand [12].
The research above considered the impacts of the GPS data and the environmental data on prediction accuracy, but they did not take real-world event information into account. To address this problem, I. Markou et al. mined the real-world event information from unstructured data, and they applied the machine learning method to realize taxi demand forecasting [13]. S. Ishiguro et al. introduced the real-time demographic data into the taxi demand forecasting method and explored the impacts of demographic data on taxi demand forecasting accuracy by a stacked denoising autoencoder [14]. S. Liao conducted a comparison of two deep neural networks for forecasting trip demand and found that DNNs perform better than other traditional machine learning methods [15]. U. Vanichrujee et al. presented an ensemble method consisting of the LSTM model, GRU model, and extreme gradient boosting model (XGB) to forecast taxi demand [16]. J. Xu proposed a sequence learning method considering the historical demand to forecast trip demand [17]. H. Yao et al. presented a multi-view spatiotemporal network framework to simulate spatiotemporal relationships and forecasted the traffic demand [18]. H. Yan analyzed taxi requests and proposed a Bayesian hierarchical semiparametric model to forecast taxi demand [20]. L. Kuang introduced the unstructured data into a deep learning method to forecast the trip demand [21]. However, the methods above ignored the destination of passengers. L. Liu proposed a method to forecast the taxi demand between origin-destination pairs [22]. I. Markou introduced real-world events into the prediction method and used the data to forecast traffic demand [23]. F. Rodrigues et al. explored the relationship between drop-off points and pick-up points and proposed a spatio-temporal LSTM model to forecast the taxi demand [25]. F. Terroso-Saenz predicted taxi demand through the QUADRIVEN method based on human-generated data [26]. Y. Xu proposed a graph and time-series learning model considering the relationships between non-adjacent for city-wide taxi demand prediction [27]. H. Yu proposed a deep spatiotemporal recurrent convolutional neural network to forecast traffic flow [28]. X. Liu explored the impacts of the socio-economic, transport system, and land-use patterns on taxi demand forecasting [29]. A. Saadallah introduced the BRIGHT method, which is an ensemble of time series analysis models to forecast taxi demand precisely [30]. A. Safikhani proposed a STAR model to analyze the spatiotemporal distribution of taxis and introduced the LASSO-type penalized methods to tackle parameter estimation [31]. Recently, Z. Liu proposed a combination forecasting model considering the random forest method and ridge regression method to predict taxi demand in hotspots [32].
In general, given the relationship between different trip modes, more attempts can be justified. This study is initiated by a real-world case study to better understand the underlying relationship between the demands of different trip modes.

Taxi GPS Data
We obtained the GPS data from the Xi'an Taxi Management Office in Xi'an city of China. The data include location information, vehicle state information, time information, and license plate information. Moreover, the taxi GPS data were recorded every 5 s for 30 days in November 2016 and include 40 million points which are located in Xi'an city of China. The GPS data were cleaned and selected. An instance of taxi GPS data is shown in Table 1.

Online Taxi-Hailing GPS Data
Online taxi-hailing GPS data are from Didi Chuxing GAIA Initiative, and the GPS data are located in Xi'an city of China. The dataset consists of 600 million track points, and it was recorded every 2-4 s for 30 days in November 2016. An instance of online taxi-hailing GPS data is shown in Table 2.

Environmental Data
The environmental data conclude air quality data and meteorological data. The air quality data in Xi'an city are from the official website of Green Breathing. The meteorological data in Xi'an city were derived from the National Meteorological Information Center. This study selects the hourly environmental data of Xi'an. In general, the environmental data contain 15 dimensions for the research (Table 3). Table 3. Environmental data structure description.

Feature Selection
Ensuring that the correlations between the features and the dependent variables are important in the prediction problem. Likewise, ensuring that the features are independent of one another is also important for improving the prediction accuracy. While modeling the forecasting method, both the features which exhibit strong, multiple collinearities and the features which have a low correlation with the dependent variable should be eliminated for enhancing the prediction accuracy. Thus, we choose the Pearson correlation analysis to test the correlation of all features and the dependent variable [33,34]. The calculation of Pearson correlation analysis is as Equation (1).
cov(X, Y) is the covariance between the features X and Y. σ X and σ Y indicate the standard deviations of the features X and Y. ρ X,Y is the correlation value of the features X and Y. The value range of ρ X,Y belongs to (−1, 1). If ρ X,Y > 0, the two features are positively correlated: if ρ X,Y < 0, the two features are negatively correlated. The larger absolute value of ρ X,Y indicates a stronger correlation between the features X and Y.

BPNN
Artificial neural networks (ANNs) possess attributes of learning, generalizing, parallel processing, and error endurance. These attributes make the ANNs useful in modeling complex situations. Therefore, we employ BPNN, a type of ANN, for forecasting online taxi-hailing demand in this study [35,36]. A three-layer BPNN employed in this paper is shown in Figure 1 [37]. In Figure 1, "T" indicates the information of time factors, "E" is the information of environmental factors, and "TX" represents the information of taxi demand.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 18 Therefore, we employ BPNN, a type of ANN, for forecasting online taxi-hailing demand in this study [35,36]. A three-layer BPNN employed in this paper is shown in Figure 1 [37]. In Figure 1, "T" indicates the information of time factors, "E" is the information of environmental factors, and "TX" represents the information of taxi demand. The connection weights among nodes are obtained by data training in the backpropagation process. Then, it produces the minimized least-mean-square error between the true and the estimated values from the neural network's output. First, the connection weights are assigned initial values. Then, the weights are updated based on the back-propagated error between the predicted and true output values. Assume that there are n input neurons, m hidden neurons, and one output neuron, a training process can be described as follows.
Hidden layer stage: Calculating the outputs of all neurons in the hidden layer as Equations (2) and (3).
net is the activation value of the jth node, y is the output of the hidden layer, and f is the activation function of a node; the activation function is the rectified linear unit function as Equation (4).
Output stage: The outputs of all neurons in the output layer are as Equation (5).
f is the activation function as Equation (4). All weights are assigned random values initially and then modified by the delta rule according to the learning samples. The three-layer BPNN above is the basic application of BPNN in online taxi-hailing demand prediction method. To find out the best network structure of BPNN for different forecasting models, The connection weights among nodes are obtained by data training in the backpropagation process. Then, it produces the minimized least-mean-square error between the true and the estimated values from the neural network's output. First, the connection weights are assigned initial values. Then, the weights are updated based on the back-propagated error between the predicted and true output values. Assume that there are n input neurons, m hidden neurons, and one output neuron, a training process can be described as follows.
Hidden layer stage: Calculating the outputs of all neurons in the hidden layer as Equations (2) and (3).
net j is the activation value of the jth node, y j is the output of the hidden layer, and f H is the activation function of a node; the activation function is the rectified linear unit function as Equation (4).
Output stage: The outputs of all neurons in the output layer are as Equation (5).
f o is the activation function as Equation (4). All weights are assigned random values initially and then modified by the delta rule according to the learning samples.
The three-layer BPNN above is the basic application of BPNN in online taxi-hailing demand prediction method. To find out the best network structure of BPNN for different forecasting models, we should use the grid search algorithm to determine the network structures of the models based on BPNN.

XGB
XGB is a boosting model based on a classification and regression tree (CART), which takes full advantage of the residual of a base classifier [38]. The boosting algorithm combines simple tree models to establish a more precise model, and it overcomes the influence of the interference signal. The prediction is as Equation (6) f k is the k th tree, K is the number of trees, and F is a set of all trees. Suppose that S = x 1 , y 1 , . . . x i , y i . . . X w , y w is a known dataset with N samples where x has L features, and y is the label of different emitters. The objective function is as Equation (7).
y i is the predicted value of x i , l represents the difference between the true and predicted values. r(f k ) is the regularized term of k th trees, which penalize the complexity of the model to avoid overfitting, and it could be calculated as Equation (8).
γ, ω are penalty coefficients, T is the number of leaves in the tree, and ϑ is leaf weight.

Evaluation Criteria
Moreover, three accuracy measures are applied to evaluate the performance of online taxi-hailing prediction. The measures are root-mean-square error (RMSE), mean absolute percentage error (MAPE) and goodness of fit (R 2 ), which are calculated as Equations (9)- (11).
C n , C n and C are the true, the predicted, and the mean value, respectively. Then, T is the number of samples.

Feature Selection
Before we predict the online taxi-hailing demand, we should select a reasonable set of forecasting features. Therefore, we use Python to calculate the correlations among prediction indicators, and we eliminate factors with strong collinearity and factors with low cross-correlation. Correlations among environmental factors are as Table 4. In Table 4, "OT" and "TX", respectively, indicate online taxi-hailing demand and taxi demand, "DW" is the day of the week, "HD" represents the hour of the day, and "WON" indicates whether the day is a workday. Other features are as Table 3. As shown in Table 4, we find that the values of correlations among AQ, AQI, PM2.5, PM10, and CO are more than 0.8. Therefore, we remove AQI, PM2.5, PM10, and CO from the predictive factors. Next, we eliminate the features whose correlation with the OT factor is less than 0.2. Predictive indicators of online taxi-hailing demand areas are shown in Table 5. Predictive indicators in Table 5 are divided into "T", "E" and "TX". "T" indicates the information of the time, "E" represents the environmental factors, and "TX" contains information about taxi demand. Then, all data are proceeded through by the One-Hot Encoder function in the scikit-learn. preprocessing library. An instance of the DW indicator is shown in Figure 2.
Appl. Sci. 2020, 10 Then, all data are proceeded through by the One-Hot Encoder function in the scikitlearn.preprocessing library. An instance of the DW indicator is shown in Figure 2. After the encoding of indicators in Table 5, the dimension of the dataset was expanded to 58. Additionally, the first 23 days of November 2016 are taken as the training set, with the other seven days as the testing set in this study.

Data Preprocessing
In this study, we choose the Bell Tower area as the research object according to the study of Liu et al. [32], because the Bell Tower area contains the most trip demand. The Bell Tower area is a commercial area, and its traffic demand exhibits a robust tidal phenomenon. The Bell Tower area is as in Figure 3. After the encoding of indicators in Table 5, the dimension of the dataset was expanded to 58. Additionally, the first 23 days of November 2016 are taken as the training set, with the other seven days as the testing set in this study.

Data Preprocessing
In this study, we choose the Bell Tower area as the research object according to the study of Liu et al. [32], because the Bell Tower area contains the most trip demand. The Bell Tower area is a commercial area, and its traffic demand exhibits a robust tidal phenomenon. The Bell Tower area is as in Figure 3.
Then, we cut taxi data and online taxi-hailing data into time slices. The trip demand for taxi and online taxi-hailing areas is shown in Figure 4. We find that the taxi demand and online taxi-hailing demand are regular, and taxi demand decreases while the online taxi-hailing demand increases in peak hours.

Online Taxi-Hailing Demand Forecasting
Then, we forecast the online taxi-hailing demand in Bell Tower area based on the BPNN and XGB. We test the prediction effects of different indicators based on the BPNN and XGB. In the experiment, we add time factors, environmental factors, and taxi demand factors into models based on BPNN and XGB. Models with different impacting factors are shown in Table 6. Next, we use the grid search algorithm to adjust the hyperparameters of models based on BPNN and XGB. Moreover, the hyperparameters for the models are illustrated in Table 7. Furthermore, the results of models with different impacting factors are shown in Figures 5 and 6. Additionally, the factors of "T", "E" and "TX" are shown in Table 5.  Then, we cut taxi data and online taxi-hailing data into time slices. The trip demand for taxi and online taxi-hailing areas is shown in Figure 4. We find that the taxi demand and online taxi-hailing demand are regular, and taxi demand decreases while the online taxi-hailing demand increases in peak hours.

Online Taxi-Hailing Demand Forecasting
Then, we forecast the online taxi-hailing demand in Bell Tower area based on the BPNN and XGB. We test the prediction effects of different indicators based on the BPNN and XGB. In the experiment, we add time factors, environmental factors, and taxi demand factors into models based on BPNN and XGB. Models with different impacting factors are shown in Table 6. Next, we use the grid search algorithm to adjust the hyperparameters of models based on BPNN and XGB. Moreover, the hyperparameters for the models are illustrated in Table 7. Furthermore, the results of models with different impacting factors are shown in Figures 5 and 6. Additionally, the factors of "T", "E" and "TX" are shown in Table 5.  Then, we cut taxi data and online taxi-hailing data into time slices. The trip demand for taxi and online taxi-hailing areas is shown in Figure 4. We find that the taxi demand and online taxi-hailing demand are regular, and taxi demand decreases while the online taxi-hailing demand increases in peak hours.

Online Taxi-Hailing Demand Forecasting
Then, we forecast the online taxi-hailing demand in Bell Tower area based on the BPNN and XGB. We test the prediction effects of different indicators based on the BPNN and XGB. In the experiment, we add time factors, environmental factors, and taxi demand factors into models based on BPNN and XGB. Models with different impacting factors are shown in Table 6. Next, we use the grid search algorithm to adjust the hyperparameters of models based on BPNN and XGB. Moreover, the hyperparameters for the models are illustrated in Table 7. Furthermore, the results of models with different impacting factors are shown in Figures 5 and 6. Additionally, the factors of "T", "E" and "TX" are shown in Table 5.      Then we use RMSE, MAPE and R to test the prediction effect of the models (Table 8). Table 8 shows the RMSE, MAPE and R of six different models' test datasets in the Bell Tower area. Comparing the performance of predictions based on BPNN, our results show that the model "BPNN + T + E + TX" is the best-performing method for solving online taxi-hailing prediction problems. Moreover, among three predictions based on XGB, the model "XGB + T + E + TX" is the bestperforming method for online taxi-hailing prediction problems.  Then we use RMSE, MAPE and R 2 to test the prediction effect of the models (Table 8). Table 8 shows the RMSE, MAPE and R 2 of six different models' test datasets in the Bell Tower area. Comparing the performance of predictions based on BPNN, our results show that the model "BPNN + T + E + TX" is the best-performing method for solving online taxi-hailing prediction problems. Moreover, among three predictions based on XGB, the model "XGB + T + E + TX" is the best-performing method for online taxi-hailing prediction problems.
Next, we analyze the contributions of the different sources of information. From Table 8, we can find that including information about taxi demand ("TX") enhances the prediction effects based on BPNN and XGB. In the BPNN models, including information "E" leads to a MAPE reduction from 0.224 to 0.190, while it decreases RMSE from 28.576 to 23.921, and increases the R 2 from 0.819 to 0.845. Likewise, including information "TX" leads to a MAPE reduction from 0.190 to 0.132, and it increases the R 2 from 0.845 to 0.866. Meanwhile, in the XGB models, including information "E" leads to a MAPE reduction from 0.333 to 0.197 while it reduces RMSE from 26.296 to 21.206, and increases the R 2 from 0.833 to 0.857. Including information "TX" leads to a MAPE reduction from 0.197 to 0.139, and it increases the R 2 from 0.857 to 0.865. Additionally, the performance of the model "BPNN + T + E + TX" is the best among the six models in Table 8. To evaluate the prediction performance of BPNN and XGB in different hours, we report MAPE and RMSE of six models in different hours. Figure 7a shows that the model "BPNN + T + E + TX" obtains the lowest MAPE among three predictions except at 6 a.m., 8 a.m., and 9 p.m. Figure 7b shows that the performance of the model "XGB + T + E + TX" is the best except at 11 a.m. and 5 p.m. From Figure 8, we know that the model "BPNN + T + E + TX" obtains the lowest RMSE among three predictions except at 4, 7, and 9 p.m., and performances of the model "XGB + T + E + TX" are the best except at 11 a.m., 12 a.m., 4 p.m. and 5 p.m.

Real-Time Online Taxi-Hailing Demand
While we are forecasting online taxi-hailing demand by different models in Table 6, we ignore that the future taxi demand is unavailable. To realize the real-time online taxi-hailing demand prediction, we should predict the taxi demand before forecasting the online taxi-hailing demand by model "BPNN + T + E" and "XGB + T + E". The results of taxi demand prediction are as in Figure 9 and Table 9.

Real-Time Online Taxi-Hailing Demand
While we are forecasting online taxi-hailing demand by different models in Table 6, we ignore that the future taxi demand is unavailable. To realize the real-time online taxi-hailing demand prediction, we should predict the taxi demand before forecasting the online taxi-hailing demand by model "BPNN + T + E" and "XGB + T + E". The results of taxi demand prediction are as in Figure 9 and Table 9.   Table 9 shows that the model "BPNN + T + E" performs better than model "XGB + T + E" in forecasting taxi demand. Based on the information on taxi demand prediction ("PTX"), we forecast the online taxi-hailing demand by model "BPNN + T + E + PTX" as Figure 10. From Table 10, we find that including the information of "PTX" leads to an MAPE reduction from 0.190 to 0.183 and an RMSE reduction from 23.921 to 21.050, and it increases the R from 0.845 to 0.853. However, because "PTX"   Table 9 shows that the model "BPNN + T + E" performs better than model "XGB + T + E" in forecasting taxi demand. Based on the information on taxi demand prediction ("PTX"), we forecast the online taxi-hailing demand by model "BPNN + T + E + PTX" as Figure 10. From Table 10, we find that including the information of "PTX" leads to an MAPE reduction from 0.190 to 0.183 and an RMSE reduction from 23.921 to 21.050, and it increases the R 2 from 0.845 to 0.853. However, because "PTX" is the projected taxi demand, the performance of the model "BPNN + T + E + TX" is better than the model "BPNN + T + E + PTX". Furthermore, Figure 11 indicates that the performance of the model "BPNN + T + E + PTX" is better than the model "BPNN + T + E" for most hours.
T + E".  Table 9 shows that the model "BPNN + T + E" performs better than model "XGB + T + E" in forecasting taxi demand. Based on the information on taxi demand prediction ("PTX"), we forecast the online taxi-hailing demand by model "BPNN + T + E + PTX" as Figure 10. From Table 10, we find that including the information of "PTX" leads to an MAPE reduction from 0.190 to 0.183 and an RMSE reduction from 23.921 to 21.050, and it increases the R from 0.845 to 0.853. However, because "PTX" is the projected taxi demand, the performance of the model "BPNN + T + E + TX" is better than the model "BPNN + T + E + PTX". Furthermore, Figure 11 indicates that the performance of the model "BPNN + T + E + PTX" is better than the model "BPNN + T + E" for most hours.   Table 10. Prediction effects of model "BPNN + T + E", "BPNN + T + E + PTX" and "BPNN + T + E + TX".    Furthermore, more experiments about traffic demand prediction can be considered. For instance, the multi-mode traffic demand predictor could be proposed to improve the prediction accuracy by considering the interaction among multiple modes of transportation. Meanwhile, the multi-mode traffic demand predictor can also take the interaction among different regions into account.