A Two-Phase Approach for Predicting Highway Passenger Volume

: With the continuous process of urbanization, regional integration has become an inevitable trend of future social development. Accurate prediction of passenger volume is an essential prerequisite for understanding the extent of regional integration, which is one of the most fundamental elements for the enhancement of intercity transportation systems. This study proposes a two-phase approach in an effort to predict highway passenger volume. The datasets subsume highway passenger volume and impact factors of urban attributes. In Phase I, correlation analysis is conducted to remove highly correlated impact factors, and a random forest algorithm is employed to extract signiﬁcant impact factors based on the degree of impact on highway passenger volume. In Phase II, a deep feedforward neural network is developed to predict highway passenger volume, which proved to be more accurate than both the support vector machine and multiple regression methods. The ﬁndings can provide useful information for guiding highway planning and optimizing the allocation of transportation resources.


Introduction
Recently, with the continuous process of urbanization, regional integration has become an inevitable trend of future social development in many developing countries [1,2]. In this situation, establishing a convenient and efficient intercity transportation system is a prerequisite for supporting regional integration, in which accurate prediction of passenger volume is one of the most fundamental elements required for the enhancement of intercity transportation systems [3][4][5][6].
The primary concern of passenger volume prediction is to extract relevant impact factors and build appropriate models. Firstly, multiple impact factors related to urban attributes, such as gross domestic product (GDP) and population, determine the absolute value and spatial distribution of passenger volume [7,8]. Consequently, extracting significant impact factors and further analyzing their relationship with passenger volume is recognized as a prerequisite for accurately predicting the passenger volume. Secondly, the prediction models attracted wide attention and the performance of different models was evaluated in past research. Some typical models, including multiple logit models, machine learning models, and deep learning models have been developed based on the historical passenger volume [9,10]. Nevertheless, the predicted accuracy of the existing models was largely affected by the dataset size of historical passenger volume [11]. Hence, the models with historical data cannot perform an accurate prediction if lacking sufficient data, which is quite common for intercity transportation.
There are two key steps in the prediction of intercity passenger volume: (1) extracting the significant impact factors, (2) developing a deep learning model to achieve the prediction. Thus, it is practical to develop a two-phase approach to predicting intercity passenger volume based on impact factors reflecting urban attributes and deep learning models. As the highway is always an important intercity mode of transport with a high mode share, this study took the highway as the research object. Phase I made a correlation analysis to remove the highly correlated impact factors and developed a random forest (RF) algorithm to extract the significant impact factors of highway passenger volume; Then, Phase II developed a deep feedforward neural network (DFNN) to predict highway passenger volume. To overcome the existing limitations on predicting intercity passenger volume, the primary contributions of this study are as follows: (1) A total of 69 impact factors of urban attributes were collected from 280 administrative districts in China, which provides a macroscopic dataset for the prediction of highway passenger volume and overcomes the limitations of traditional travel surveys and questionnaires that only focus on a single city or single transportation corridor; (2) Multiple urban attributes, including urban economy, population, industry, income and consumption, and resource and environment, were modeled together. Furthermore, A total of 30 significant impact factors of highway passenger volume were extracted by the RF algorithm, which improves the traditional process based on subjective experience and avoids the omission of significant factors; (3) A deep learning method, DFNN, was developed to predict highway passenger volume, which proved to be more accurate than the SVM and multiple regression methods and can provide more reliable information for optimizing traffic structure and reducing waste of traffic resources.
The remainder of this study is organized as follows. Section 2 gives as overview of the related literature. In Section 3, the data source is introduced, and the impact factors of urban attributes are collected and presented. Section 4 presents the underlying principle of the RF and DFNN algorithm. Section 5 presents the process of extracting the significant impact factors. In Section 6, the DFNN is developed to predict highway passenger volume, which is further compared with two benchmark methods. Finally, Section 7 draws conclusions and gives an outlook on future research.

Literature Review
This section concludes the existing research on the above two phases: (1) extracting the significant impact factors of intercity passenger volume, (2) developing models to achieve an accurate prediction. Furthermore, the limitations of existing research are itemized at the end.
The first phase is to extract the significant impact factors. Multiple impact factors related to urban attributes, including urban economic level, urban industrial structure, population, etc., were widely studied to understand their relationship with intercity passenger volume. Firstly, the urban economic level proved to be one of the necessary impact factors of intercity passenger volume [12][13][14]. Traffic demand for business and tourism in intercity transportation increases with the development of the urban economy. The impact factors reflecting the urban economic level were found to be per-capita gross domestic product (GDP), per-capita income, industrial structure, etc., and it was verified that they had a strong correlation with intercity passenger volume [15,16]. Moreover, both population structure and population size affect the intercity passenger volume significantly. Limtanakool et al. [17] took population density and land use as variables and found that a higher population density and mixed degree of land use have a positive impact on passenger volume of public modes in medium-and long-distance trips. A similar conclusion was also reached by related research [18]. Although the impact factors related to economic level and population have been widely studied in the existing research, those related to the quality of residents' lives, resources, and the environment were rarely studied because they are hard to be quantified with one or several indicators and the corresponding dataset is difficult to obtain [19][20][21]. This problem indicates that the relative research on extracting significant impact factors of intercity passenger volume is incomplete and causes the inaccurate prediction of intercity passenger volume, especially for some tourism-driven cities and resource-driven cities.
The second phase is to develop a model to achieve an accurate prediction of intercity passenger volume. In the existing studies, multiple logit models, such as the multinomial logit model [22,23], Box-Cox logit model [24], and nested logit model [25], were developed to study the mode choice of intercity trips and deduce the intercity passenger volume of various modes by calculating the intercity travel rate of surveyed samples [26,27]. Moreover, intercity passenger volume was predicted by introducing the impact factors. Harker et al. [28] proposed a network equilibrium model with considerations of market price and economic mechanism to predict the intercity freight volume. Li et al. [29] predicted the passenger volume of intercity railway with multiple indicators of passenger demand, regional economy, and regional traffic infrastructure, with an average predicted error of 3.37%. Another practical approach to predicting intercity passenger volume is based on the historical passenger volume. Xie et al. [30] analyzed the spatiotemporal characteristics of intercity passenger volume and predicted intercity passenger volume on holiday, with a predicted error of 6.43%. Recently, deep learning and machine learning algorithms, represented by various neural networks, have become remarkable at predicting intercity passenger volume by using cellular signaling data and location-based data [4,[22][23][24][25][26][27][28][29][30][31][32]. Numerous studies have shown that predicted accuracy can be significantly improved by deep learning algorithms [33].
It is noted that the difficulties in obtaining the dataset of intercity passenger volume have been widely emphasized in past studies, especially for some intercity passenger modes of transportation that have additional requirements for an urban population, geographical location, or urban scale, such as airways, railways, and waterways. This means that the prediction of intercity passenger volume can be only conducted in a few cities [34]. In contrast, the highway has better accessibility and connects to all kinds of cities, expanding the study scope of predicting intercity passenger volume [35]. As previously stated, intercity passenger volume is largely determined by impact factors. Thus, the process of extracting significant impact factors at first, and then analyzing the interaction between intercity passenger volume and impact factors with deep learning algorithms, is practical for predicting intercity passenger volume but has rarely been studied in the existing research.
From the above analysis, the relationship between intercity passenger volume and urban attributes has been widely studied, and some typical models have been developed to predict passenger volume. Nevertheless, some limitations still exist in previous research and need further improvement, which are listed as follows: (1) Due to the restrictions of the research data, most existing research predicted intercity passenger volume from a single city or transportation corridor. As a result, the current achievements are difficult to apply to intercity transportation between all kinds of cities. (2) Existing research only focuses on common urban attributes such as the population or the economy. However, more urban attributes related to the quality of residents' lives, resources, and environment were neglected for lacking the available data and quantitative indicators, causing the inaccurate prediction of intercity passenger volume, especially in some tourism-driven cities and resource-driven cities. Moreover, the selection process of significant attributes also received less attention. (3) Microcosmic datasets collected from traffic surveys have been widely used for studying the choice of transportation mode in intercity trips but is not practical to predict intercity passenger volume. In contrast, the macroscopic datasets of urban attributes provided a novel approach to predict the intercity passenger volume, but have rarely been used in the existing literature.

Data Source
In this study, the dataset, including highway passenger volume and impact factors of urban attributes, was obtained from China's urban statistical yearbook. In China, the urban statistical yearbook is regularly published online to evaluate the social and economic levels. The statistical yearbook covers multiple aspects of urban attributes, including society, economy, etc. People can download the statistical yearbook for academic research, providing a novel macroscopic dataset with the prediction of highway passenger volume.
Considering the possible complex-relevance between impact factors of urban attributes, it is necessary to select appropriate impact factors for the convenience of data processing. The selection principles in this study are summarized as follows: (1) The selected impact factors can well reflect the urban attributes and have a significant impact on intercity passenger volume. (2) The selected impact factors can be quantifiable and comparable. (3) The selected impact factors can be provided by the urban statistical yearbook and easily accessible. It is noteworthy that some non-quantifiable factors can be comparable by converting into different levels. Yet in this study, most non-quantifiable factors have a high correlation with the existing quantifiable factors. Furthermore, subjective judgment and personal preference are often included in the non-quantifiable level division, which inevitably brings errors into the process. Accordingly, this study only focuses on the prediction of highway passenger volume with the quantifiable impact factors.
Based on the above principles, a total of 69 impact factors of urban attributes were selected from China's urban statistical yearbook. To facilitate data processing, the selected impact factors of urban attributes were divided into five categories, namely, urban economic level, urban population size and structure, per-capita income and consumption, resource and environment, and urban industrial structure. The selected impact factors of urban attributes and their information are summarized in Table A1 in Appendix A.
As the data in the statistical yearbook is aggregated from the whole district or city, the authors took the administrative district as the basic unit of data collection. As a result, 3444 samples, including the selected 69 impact factors and highway passenger volume, from 280 administrative districts, were collected. The recorded date is from 2003 to 2014, covering 12 years, because there is a unified statistical standard during this period and the statistical data changed smoothly without a sharp increase or decrease. In which. The highway passenger volume was set as the unique dependent variable, and impact factors were set as the alternative independent variables for predicting highway passenger volume.

Methodology
The flow diagram of the proposed two-phase approach and associated designed framework is shown in Figure 1. Firstly, the raw dataset, including highway passenger volume and impact factors, was collected. Then, the two-phase approach was proposed. Phase I extracted the significant impact factors with the RF algorithm and Phase II predicted highway passenger volume with the DFNN. Finally, the typical machine learning algorithm, support vector machine (SVM), was also developed for predicting highway passenger volume and compared with the DFNN, because it has a better ability to solve machine learning problems with a small sample size. Moreover, the traditional multiple regression, which is widely used for discerning the relationship between dependent variables and multiple independent variables, served as the benchmark for the prediction of highway passenger volume. All predicted models were evaluated by calculating errors, including mean absolute error (MAE) and root mean squared error (RMSE). regression, which is widely used for discerning the relationship between dependent var iables and multiple independent variables, served as the benchmark for the prediction o highway passenger volume. All predicted models were evaluated by calculating error including mean absolute error (MAE) and root mean squared error (RMSE). The fundamentals of the two primary methods used in this study are briefly dis cussed as follows, including the RF algorithm and the DFNN. Moreover, the evaluatin indicators, MAE and RMSE, are introduced as well.

Random Forest Algorithm
In this study, the RF algorithm was used in Phase I to extract significant impact fac tors. The RF algorithm is a classifier established with multiple decision trees randomly which has better robustness to noise and an excellent ability to maintain accuracy even partial features are missing compared to other tree-based models [36,37]. Moreover, exis ing research has proved that the RF algorithm can efficiently analyze the complex inter action among features and pick out the significant features. As a result, it is widely use for removing the variables with a high correlation or low importance degree [38].
For any impact factor in Table 1, its importance degree can be calculated with the R algorithm. After that, the selection of significant impact factors follows two processes: (1 Remove the impact factors that are highly correlated with others. (2) Determine the re moved proportion and remove impact factors with a low importance degree.
The above processes of the RF algorithm, including calculating importance degre and selecting significant impact factors, were repeatedly conducted until the number o selected significant factors is less than the set value. Finally, the selected impact factor were set as the independent variables for predicting highway passenger volume.

Deep Feedforward Neural Network
Recently, the neural network is widely used in the prediction of traffic volume an proposes the development of deep learning [39][40][41]. The DFNN is a deep learning mode comprised of an input layer, several hidden layers, and an output layer [42][43][44]. The quan tity of hidden layers defines the depth of the architecture [45]. The topological structur of the DFNN is shown in Figure 2. The fundamentals of the two primary methods used in this study are briefly discussed as follows, including the RF algorithm and the DFNN. Moreover, the evaluating indicators, MAE and RMSE, are introduced as well.

Random Forest Algorithm
In this study, the RF algorithm was used in Phase I to extract significant impact factors. The RF algorithm is a classifier established with multiple decision trees randomly, which has better robustness to noise and an excellent ability to maintain accuracy even if partial features are missing compared to other tree-based models [36,37]. Moreover, existing research has proved that the RF algorithm can efficiently analyze the complex interaction among features and pick out the significant features. As a result, it is widely used for removing the variables with a high correlation or low importance degree [38].
For any impact factor in Table 1, its importance degree can be calculated with the RF algorithm. After that, the selection of significant impact factors follows two processes: (1) Remove the impact factors that are highly correlated with others. (2) Determine the removed proportion and remove impact factors with a low importance degree.
The above processes of the RF algorithm, including calculating importance degree and selecting significant impact factors, were repeatedly conducted until the number of selected significant factors is less than the set value. Finally, the selected impact factors were set as the independent variables for predicting highway passenger volume.

Deep Feedforward Neural Network
Recently, the neural network is widely used in the prediction of traffic volume and proposes the development of deep learning [39][40][41]. The DFNN is a deep learning model comprised of an input layer, several hidden layers, and an output layer [42][43][44]. The quantity of hidden layers defines the depth of the architecture [45]. The topological structure of the DFNN is shown in Figure 2. The theory of the DFNN is available in past research [44][45][46]. In this section, we introduce the activation function and objective function used in the DFNN algorithm.
Firstly, the rectified linear unit (ReLU) function was selected as the activation function of hidden layers and the output layer, considering that the ReLU function has a higher computing efficiency because it only activates a fraction of the neurons in each epoch. The ReLU function has been proven to be effective at avoiding gradient vanishing and overfitting, and serves as the preferred choice when developing a neural network to solve multiple problems except for the binary classification [46,47]. The ReLU function is shown in Equation (1).
Then, the objective function was built by minimizing the loss function of mean square error, as in Equation (2).
Where i y represents the actual highway passenger volume and ˆi y represents the predicted highway volume. N is the number of predicted samples. ( ) R ⋅ is a regularized constraint, represented by the 2 L norm of the parameter θ , which is solved by the gradient descent method. λ is the coefficient of regularized constraint ( ) R ⋅ .

Evaluating Indicators
To better evaluate the deviation of predicted results and assess the predicted method's performance, two indicators, MAE and RMSE, were calculated in this study. They are defined by Equations (3) and (4), respectively.
where i y and ˆi y represent the actual highway passenger volume and the predicted highway passenger volume, respectively. N is the number of predicted samples. Both The theory of the DFNN is available in past research [44][45][46]. In this section, we introduce the activation function and objective function used in the DFNN algorithm.
Firstly, the rectified linear unit (ReLU) function was selected as the activation function of hidden layers and the output layer, considering that the ReLU function has a higher computing efficiency because it only activates a fraction of the neurons in each epoch. The ReLU function has been proven to be effective at avoiding gradient vanishing and overfitting, and serves as the preferred choice when developing a neural network to solve multiple problems except for the binary classification [46,47]. The ReLU function is shown in Equation (1).
Then, the objective function was built by minimizing the loss function of mean square error, as in Equation (2).
where y i represents the actual highway passenger volume andŷ i represents the predicted highway volume. N is the number of predicted samples. R(·) is a regularized constraint, represented by the L2 norm of the parameter θ, which is solved by the gradient descent method. λ is the coefficient of regularized constraint R(·).

Evaluating Indicators
To better evaluate the deviation of predicted results and assess the predicted method's performance, two indicators, MAE and RMSE, were calculated in this study. They are defined by Equations (3) and (4), respectively.
where y i andŷ i represent the actual highway passenger volume and the predicted highway passenger volume, respectively. N is the number of predicted samples. Both MAE and RMSE represent the degree of deviation between the actual and predicted highway passenger volume. The smaller the value of MAPE and RMSE, the more accurate the predicted result.

Phase I: Extraction of Significant Factors
In Phase I, the RF algorithm was used for removing the highly correlated impact factors and extracting the significant impact factors. Specifically, impact factors with a high importance degree were retained and those with a low importance degree were removed. The RF algorithm has the advantage of showing the extraction of significant factors step by step and the extracted significant impact factors are interpretable, compared with some auto-encoder methods like neural networks. Finally, a dataset of significant impact factors was built for predicting highway passenger volume.
Firstly, the correlation coefficients between impact factors were calculated by correlation analysis, and fifteen groups of highly correlated impact factors were found based on the calculated correlation coefficients, which are shown in Table 1. Then, the importance degree of highly correlated impact factors in each group was calculated with the RF algorithm, as shown in Figure 3. The horizontal axis represents impact factors in each group, and the vertical axis represents the corresponding importance degree. Only the impact factor with the largest importance degree in each group was retained, and other impact factors were removed. Consequently  Figure 4.
In this study, the removed proportion was set at 10%. Therefore, impact factors with importance degree rankings in the bottom 10% were removed. According to Figure 4a, the removed impact factors included RP, CPR, VISR, and DNH, and the remaining 37 impact factors were retained for the subsequent data processing.
Similarly, the importance degree of impact factors was calculated repeatedly and sorted in order, and impact factors whose importance degree ranked in the bottom 10% were removed until the importance degree of the remaining impact factors reached 0.01. The above process was repeated twice. PCGRP, IRE, LA, and PFE, and DPD, PTPT, and CLPGR were removed during these two processes, respectively, as seen in Figure 4b,c. Finally, a total of 30 impact factors were retained, and are shown in Table 2. The category of resource and environment had more retained factors than any other, indicating that this category has a significant impact on highway passenger volume. Moreover, the importance degrees of HD, GDP, WCS, NOB, RT, HEC, TP, and TI rank in the top 25%, meaning that these eight factors significantly impact highway passenger volume. Appl. Sci. 2021, 11

Model Prediction
With the significant impact factors selected by Phase I as input variables, Phase II developed the DFNN to predict highway passenger volume. The primary concern of developing DFNN is to determine the appropriate quantity of hidden layers and neurons in each hidden layer. In this study, the grid search method was adopted, whose initial range for the number of hidden layers was set from 1 to 10 and that for the number of neurons was set from 1 to 140. Taking MAE as an evaluating index, the result of the grid search method is shown in Figure 5.   Urban economic level  GDP, RSC, RT, GIO  Urban population size and structure  TP, NSC, WPI, WSI, WTI, PD, PLPG  Per-capita income and consumption  AWW, DB, HD, WCS, HEC  Urban industrial structure  PI, SI, TI Resource and environment DLA, LC, NOB, APR, APGL, GCAP, NBH, NTM, CPL, VDWW, VSDE

Model Prediction
With the significant impact factors selected by Phase I as input variables, Phase II developed the DFNN to predict highway passenger volume. The primary concern of developing DFNN is to determine the appropriate quantity of hidden layers and neurons in each hidden layer. In this study, the grid search method was adopted, whose initial range for the number of hidden layers was set from 1 to 10 and that for the number of neurons was set from 1 to 140. Taking MAE as an evaluating index, the result of the grid search method is shown in Figure 5. The quantity of hidden layers and neurons with the minimum MAE is selected. Finally, the quantity of hidden layers is set to 9, and the quantity of neurons in each hidden layer is set to 120 in the DFNN of this study. Moreover, the quantity of neurons in the input layer and the output layer is set to 30 and 1, respectively, because there are 30 independent variables and 1 dependent variable.
Additionally, multiple epochs are needed for improving the predicted accuracy of the DFNN. Consequently, we continuously increased the epoch and calculated the loss of training set and verification set. When the loss of four consecutive epochs is less than 0.0001, it is considered that the training process has reached convergence and can be stopped. The loss of the training process is shown in Figure 6. Finally, the epoch of the DFNN in this study was set to 12.
Afterward, the significant impact factors were input in the developed DFNN, and the highway passenger volume was predicted. Then, evaluating indicators were calculated, showing that the MAE and RMSE of predicted highway volume from the DFNN are 2066.31 persons per day and 4176.37 persons per day, respectively. The quantity of hidden layers and neurons with the minimum MAE is selected. Finally, the quantity of hidden layers is set to 9, and the quantity of neurons in each hidden layer is set to 120 in the DFNN of this study. Moreover, the quantity of neurons in the input layer and the output layer is set to 30 and 1, respectively, because there are 30 independent variables and 1 dependent variable.
Additionally, multiple epochs are needed for improving the predicted accuracy of the DFNN. Consequently, we continuously increased the epoch and calculated the loss of training set and verification set. When the loss of four consecutive epochs is less than 0.0001, it is considered that the training process has reached convergence and can be stopped. The loss of the training process is shown in Figure 6. Finally, the epoch of the DFNN in this study was set to 12.

Model Evaluation
To further evaluate the performance of the DFNN, the traditional SVM and multiple regression were used for comparison. For the SVM, the RBF kernel function whose penalty coefficient is set as 1000, and the Gamma coefficient is set as 0.001, was selected by adopting the grid search method based on the alternative sets of the kernel function, penalty coefficient, and gamma coefficient, as shown in Table 3. The final predicted result is shown in Table 4, both MAE and RMSE of the DFNN are less than those of the SVM and multiple regression. The DNFF reduces the MAE and RMSE by 8.49% and 2.20%, respectively, compared with the multiple regression. The DFNN reduces MAE and RMSE by 2.90% and 1.15%, respectively, compared with the SVM. The result indicates that the DFNN is more accurate in predicting highway volume than the SVM and multiple regression.

Conclusions
This study overcomes the limitations of existing research on predicting highway passenger volume. The main work and results of this study are as follows: (1) A two-phase approach, in which Phase I extracts the significant impact factors and Phase II develops a deep learning model to achieve the prediction, was proposed to predict the highway passenger volume with the dataset of multiple urban attributes;

Model Evaluation
To further evaluate the performance of the DFNN, the traditional SVM and multiple regression were used for comparison. For the SVM, the RBF kernel function whose penalty coefficient is set as 1000, and the Gamma coefficient is set as 0.001, was selected by adopting the grid search method based on the alternative sets of the kernel function, penalty coefficient, and gamma coefficient, as shown in Table 3.

Conclusions
This study overcomes the limitations of existing research on predicting highway passenger volume. The main work and results of this study are as follows: (1) A two-phase approach, in which Phase I extracts the significant impact factors and Phase II develops a deep learning model to achieve the prediction, was proposed to predict the highway passenger volume with the dataset of multiple urban attributes; (2) Phase I extracted a dataset with 30 significant factors reflecting urban economic level, urban population size and structure, per-capita income and consumption, urban industrial structure, and resource and environments with the RF algorithm and proved that they have a significant impact on highway passenger volume. This study contributes to proposing a novel approach for predicting highway passenger volume, but limitations still exist and are worth further study. Recently, deep learning algorithms have been proposed and are expected to be utilized for further improving the predicted accuracy of highway passenger volume as well as increasing the interpretability. As the statistical yearbook only publishes the annual statistics, it is difficult to make a detailed analysis of highway passenger volume in quarters or months. Moreover, it is possible to find data mutation caused by the change of statistical caliber in the statistical yearbook, which affects the predicted accuracy. Therefore, other new datasets can be considered to introduce into future research for more accurate analysis.
Author Contributions: Conceptualization, Y.X. and W.Y.; methodology, Y.X. and J.C.; software, R.W. and J.C.; data acquisition, W.Y. and B.L.; data analysis, Y.X.; writing-original draft preparation, Y.X. and W.Y.; writing-review and editing, J.C., B.W. and Z.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are available on request from the first author.

Acknowledgments:
The authors would like to thank the students from the school of computer science and engineering of Southeast University for their assistance with the data collection.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A   Table A1. The selected impact factors of urban attributes.

Category Impact Factors Symbol Units
Urban Economic Level