A Freeway Travel Time Prediction Method Based on an XGBoost Model

: Travel time prediction plays a signiﬁcant role in the trafﬁc data analysis ﬁeld as it helps in route planning and reducing trafﬁc congestion. In this study, an XGBoost model is employed to predict freeway travel time using probe vehicle data. The effects of different parameters on model performance are investigated and discussed. The optimized model outputs are then compared with another well-known model (i.e., Gradient Boosting model). The comparison results indicate that the XGBoost model has considerable advantages in terms of both prediction accuracy and efﬁciency. The developed model and analysis results can greatly help the decision makers plan, operate, and manage a more efﬁcient highway system.


Introduction
Travel time prediction plays a significant role in the traffic data analysis field as it helps in route planning and reducing traffic congestion.Traditionally, the methods such as linear regression and time series models have been widely applied to predict travel times using historical travel time data.However, with the consideration of effectiveness, accuracy, and feasibility, these models may become outdated and replaceable.With the development of artificial intelligence technologies, various novel prediction methods have been developed accordingly in recent years.Machine learning is an example of a data driven method which aims to increase efficiency and accuracy of the prediction.Recently, different machine learning approaches such as neural networks [1][2][3][4][5][6], ensemble learning [7][8][9][10][11][12], and support vector machine (SVM) [13] are employed by researchers.Their results indicate that such approaches for travel time prediction are adaptable and can give better performances than traditional models.Therefore, the machine learning-based approach is selected for the travel time prediction in this study.Table 1 provides a summary of the machine learningbased travel time prediction studies in chronological order, and detailed description about each literature reviewed will be given in the following subsections.
In recent years, ensemble learning-based methods have been more and more widely used for traffic data analysis.The purpose of an ensemble learning algorithm is to achieve an improved result by combining predictions of a group of individual base models.It has been shown that the combined model often generates more stable and accurate predictions in many applications [14,15].
Bagging and boosting are both ensemble techniques, where a set of base models are combined to create a model that obtains better performance than a single model.However, they utilize different re-sampling methods and therefore can have different performances and generate different outputs.Random Forest is a bagging algorithm-based method.Hamner et al. [7] applied a context-dependent Random Forest (RF) method to predict travel time based on GPS data of the cars on the road in a simulation framework.The root mean squared error (RMSE) of the RF prediction was less than 7.5%.Fan et al. [10] conducted a study using the RF method to predict highway travel time based on data collected from highway electronic toll collection in Taiwan.The results can help highway drivers select optimal departure times to avoid traffic congestion and thus minimize travel time.
Boosting is another ensemble learning method which improves the prediction accuracy through developing multiple models in sequence by putting emphasis on the samples in the model that are difficult to estimate.Zhang and Haghani [8] employed a gradient boosting regression tree method to analyze and predict freeway travel time to improve the prediction accuracy.The authors used travel time data along freeway sections in Maryland and discussed the effects of different parameters on the proposed model and the correlations of input and output variables.The prediction results showed the proposed model can provide considerable advantages in freeway travel time prediction.Li and Bai [9] employed a gradient boosting regression tree method to analyze and predict travel time of freight vehicles.The authors used travel time data and vehicle trajectory data in Ningbo, China.The prediction results showed the proposed model can be feasible in the real world.Gupta et al. [11] employed RF and gradient boosting models to predict taxi travel time in Porto, Portugal.The vehicle trajectory data were used as the database and it was found that the gradient boosting model provided better prediction results than the RF model.
In recent years, the XGBoost model has a recognized impact in solving machine learning challenges in different application domains.It has gained popularity by winning many data science competitions (e.g., Kaggle competition).XGBoost has also been employed in transportation related studies.Alajali et al. [16] utilized the XGBoost model to predict intersection traffic volume.Dong et al. [17] employed the XGBoost model to predict short term traffic flow based on the data collected in Beijing, China.However, it is rare that studies on the application of the XGBoost model with freeway travel time prediction can be found.Therefore, the XGBoost model has the potential to be applied in freeway travel time prediction and is selected as the model of this study.This study intends to employ an XGBoost model approach to predicting freeway travel time using information such as time of day (TOD), day of the week (DOW), and weather.The temporal correlation and spatial correlation between each segment are also considered in the model.The relative importance of each feature in the model is investigated and quantified.The modeling results can offer valuable insights on the relationship between features and the prediction results.The prediction results are also compared with the outputs of the gradient boosting model and indicate that the XGBoost model can perform better from both the accuracy and efficiency perspectives.
The research findings can greatly help the decision makers plan, design, operate, and manage a more efficient highway system.

Raw Data Description
In this study, the travel time data gathered from the Regional Integrated Transportation Information System (RITIS) website were collected and used to conduct the travel time prediction work.A series of major freeway segments were selected for the case study: Interstate 77 (I-77) Southbound (Figure 1) is one of the most heavily traveled Interstate highways in the Charlotte area and runs from north to south.
The selected section of I-77 Southbound starts from the intersection with US-21 (Exit 16) and ends at the interchange with Nations Ford Road (Exit 4) at the south part of the city.Twenty-six roadway segments were selected for this study, and the total length of the selected section is 15 miles.
On the RITIS website probe data analytic suite, the raw probe data can be downloaded with the desired section and format.The roadway section can be selected based on the Road states and countries, Traffic message channels (TMC), Directions, Zip codes, Road class, and Road name.The partial sections can be selected with the selection of begin and end intersections.The date range can be selected from 1 January 2008 to today.Seven days of the week and times of day from 12:00 A.M. to 11:59 P.M. can also be selected.The units of travel time can be categorized into both seconds and minutes.The averaging period can be selected as five minutes, ten minutes, fifteen minutes, and one hour.In this study, a fifteen-minute interval is used.A sample of raw travel time data utilized in this study is shown in Table 2 below.
study: Interstate 77 (I-77) Southbound (Figure 1) is one of the most heavily traveled Interstate highways in the Charlotte area and runs from north to south.
The selected section of I-77 Southbound starts from the intersection with US-21 (Exit 16) and ends at the interchange with Nations Ford Road (Exit 4) at the south part of the city.Twenty-six roadway segments were selected for this study, and the total length of the selected section is 15 miles.On the RITIS website probe data analytic suite, the raw probe data can be downloaded with the desired section and format.The roadway section can be selected based on the Road states and countries, Traffic message channels (TMC), Directions, Zip codes, Road class, and Road name.The partial sections can be selected with the selection of begin and end intersections.The date range can be selected from 1 January 2008 to today.Seven days of the week and times of day from 12:00 A.M. to 11:59 P.M. can also be selected.The units of travel time can be categorized into both seconds and minutes.The averaging period can be selected as five minutes, ten minutes, fifteen minutes, and one hour.In this study, a fifteen-minute interval is used.A sample of raw travel time data utilized in this study is shown in Table 2 below.
In Table 2, the column labeled TMC code indicates the specific identification number of each segment.Timestamp gives the specific time period of the record and can be used to further provide the information including TOD and DOW.The third column in this table is the travel time on the segment.In Table 2, the column labeled TMC code indicates the specific identification number of each segment.Timestamp gives the specific time period of the record and can be used to further provide the information including TOD and DOW.The third column in this table is the travel time on the segment.

Ensemble Learning Algorithm
The ensemble learning-based algorithms consist of multiple base models (e.g., decision tree model), and each base model provides an alternative solution to the problem.The prediction results of these base models are combined by some rules (such as weighted or unweighted voting and averaging).The final output will be achieved through the combined model.Bagging and boosting are both ensemble techniques, where a set of base models are combined to create a model that obtains better performance than a single model.However, they utilize different re-sampling methods and therefore can have different performances and generate different outputs.
The idea of the boosting algorithm was first proposed by Kearns [21].The boosting algorithm also refers to several algorithms that convert weak learners to strong learners.Several base models are combined together to form a stronger model that can make generalizations [22].Different from the bagging method which has each base model running independently and then aggregates their outputs at the end without any preference, the boosting method improves the prediction through developing multiple models in sequence by putting emphasis on the samples in the model that are difficult to estimate.There are many boosting algorithms such as AdaBoost, Gradient boosting, and XGBoost.Gradient boosting is a typical boosting approach.It is widely used in the machine learning area.The word 'gradient' means that it uses a gradient descent algorithm to minimize the loss when adding new models [20].The gradient boosting approach supports both classification and regression predictive modeling problems.
Based on previous studies, the gradient boosting model generally gives better results than Random Forest, since Random Forest has fewer parameters needing tuning and also is less sensitive to these parameters [23,24].However, the gradient boosting model is harder to fit than Random Forests at the same time.The stopping criteria should also be chosen carefully to avoid overfitting on training data.

XGBoost Algorithm
XGBoost is the short name for 'Extreme gradient boosting' proposed by Chen and Guestrin [25].In recent years, it has a recognized impact in solving machine learning challenges in different application domains.The speed of XGBoost is much faster than that of other common machine learning methods since it can process large amounts of data in a parallel way efficiently.Therefore, the XGBoost model is selected and used to conduct travel time prediction work.The detailed information of the XGBoost model is described as follows: The objective function (Obj(Θ)) of the XGBoost model is provided below [25]: where, L(Θ) = The training loss, which measures how well the model fit on training data.Ω(Θ) = The regularization term, which measures the complexity of the model.The loss on training data L can be expressed as: In detail, the square loss for the regression problem can be expressed as: In this study, ŷi = the predicted travel time.y i = the actual travel time.
The unit of travel time is seconds.
When a new tree is added to the model, the objective function can be transformed to: ) where, Ω( f i ) = the complexity of tree f i .In order to get the simplest goal, the constant term should be removed from the function.The process of XGBoost uses second order Taylor expansion to extend the loss function and removes the constant term [25].
where, 1) , which is the first order partial derivative of the function. 1), which is the second order partial derivative of the function.
After the removal of all the constants, the specific objective at step t becomes: In the XGBoost model, the complexity is defined as [22]: where, T = The number of leaf nodes.γ = The penalty coefficient of the number of leaves.λ = The penalty coefficient of regularization.w j = The score of leaf j.
After re-formulating the tree model, the objective function with the t-th tree can be written as [25]: where I j = {i|q(x i ) = j} is an instance set assigned to the j-th leaf.The objective function could be further compressed as: where The best w j one can get for the objective function is w * j = − G j H j +λ .Therefore, the final objective function can be written as: The smaller the score is, the better the structure is.XGBoost can also add branches for each leaf node.The loss reduction after the split can be expressed as [25]: where L H L +λ is the score of the left node after the cut.
is the score of the right node after the cut.
H L +H R +λ is the score of combination without the cut.Finally, the best structure of the model can be obtained which can minimize the objective function by enumerating different kinds of tree structures.

Model Validation 4.1. Feature Selection and Processing
The real-world travel time data provided by the RITIS website (which was mentioned above) is used for this study.The quality of the data is precise enough with less than a 0.5% missing rate (4348 out of 906,048).Therefore, this study simply replaces the missing values with the mean of its closest surrounding values.The weather condition is also considered in this study.The weather data of the study area can be found at the www.wunderground.comwebsite (accessed on 30 November 2017).
Based on previous studies [26,27], the features that influence the accuracy of travel time prediction may not only include the basic features (such as time of day, day of the week, month, and weather), but also include the spatial and temporal characteristics of the segments.Therefore, the travel time information from several steps before and the travel time information of adjacent segments are also selected and will be used in the model.
For the Categorical Variable, the most commonly used method is One-hot encoding in the Python software.One-hot encoding is a process by which categorical variables are converted into a form that could be provided to machine learning algorithms to do a better job in prediction.For example, the category weekdays with seven variables will be transferred as dummy variables.
Table 3 below summarizes the basic information on the features used for this study.

Parameter Tuning Process
In the XGBoost model, there are many parameters that should be considered.In order to optimize the modeling result, it is necessary to explore the effect of different combinations of parameters on the model performance.Based on previous studies [8,17], the parameters that could be optimized include, but are not limited to: N_estimators (number of trees), learning rate (a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function), and Max_depth (maximum depth of the tree, defined as the longest path between the root node and the leaf node).Therefore, these parameters are considered to be optimized in this study.
There are several optimization methods considered in previous studies and the grid search method is the most widely used one.Therefore, the grid search method is selected as the optimization method with the consideration of time-efficiency.In this study, 80% of the traffic data is used as training data and 20% of the data is used as the testing data.The XGBoost model is fitted with various number of trees (N_estimators ranges from 1 to 500), maximum depth (Max_depth ranges from 5 to 10) and learning rates (Learning rate ranges from 0.1 to 0.5).The number of stopping rounds is set as 50, which means stopping iteration after 50 rounds when there is no performance improvement.The XGBoost package in Python software is used in this study.
Figure 2 below shows the effects of different selected features on the prediction results.Table 4 below presents the detailed prediction results including the prediction results at each step, computation time, and optimized results.The mean absolute error (MAE) is used to evaluate the performance of the model.Based on Figure 2 above, it can be concluded that the MAE value decreases as the number of trees increases, and the slopes of different learning rates are also different.In general, the lower the learning rate is, the higher the initial MAE value (with the number  The equation of the MAE is provided below: where, m = the total number of the data.y i = the actual travel time value in the test dataset of record i. ŷi = the predicted travel time value in the test dataset of record i. Based on Figure 2 above, it can be concluded that the MAE value decreases as the number of trees increases, and the slopes of different learning rates are also different.In general, the lower the learning rate is, the higher the initial MAE value (with the number of tree = 1).For example, when the learning rate equals 0.1, the initial MAE value is about 36.2.In comparison, Figure 2 shows that the MAE values are about 17.6 when the number of trees is 1 and the learning rate is 0.5.
Figure 2 also shows when the number of trees reaches 50, the value of MAE becomes nearly the same.However, the data in Table 4 indicate that the results can still be optimized a little bit if the number of trees keeps increasing.Overfitting is a general problem of traditional ensemble learning methods.For example, the prediction error usually increases when the number of trees increases after it reaches the optimized point in the gradient boosting model [8].In the XGBoost model, the overfitting problem can be solved as the algorithm will stop when there is no performance improvement after 50 iterations.Therefore, the value 'NA' in Table 4 means that the computation already stopped before the number of trees reached those values.
It could be seen that the parameter max_depth does not influence the prediction results significantly since the trends of the errors are nearly the same.However, the data in Table 4 show that as the max_depth increases, the MAE decreases a little bit (the optimized MAEs of max_depth from 5 to 10 are 2.02, 1.98, 1.95, 1.93, 1.91, 1.90, respectively).The data in Table 5 show that as the max_depth increases, the average computation time of the model also decreases a lot, which means the larger value of max_depth can not only increase the accuracy of the model a little bit but also increase the efficiency.According to the experimental results, it can be concluded that: The accuracy level of a slower learning rate with a larger number of trees in the model is higher than that of a faster learning rate with a smaller number of trees.The number of trees needed to get an optimized result for the model with a faster learning rate is also lower than those with slower learning rates.
There is also a need to consider the tradeoff between prediction accuracy and computational time.Since a large number of trees is being fitted, model complexity also increases and requires more computational time.Therefore, the selection of the parameters such as max_depth and number of stopping round is important in the real world.
In addition, the maximum depth of the tree also affects the optimized selection.When the learning rates and number of trees are the same, a higher maximum depth of the tree leads to the lower error rates.A higher max_depth is also more efficient than a lower value since the number of iterations needed to achieve optimized results is lower.In general, a higher max_depth value means a more complex tree model and requires fewer trees to be fitted with a given learning rate.

Prediction Results Analysis
In the machine learning field, the predictor variables, which are the features mentioned in Table 3, usually have significant impacts on the prediction results.Exploring the influence on the individual feature can help understand the data better.Higher relative importance indicates a stronger influence in predicting travel time.
Table 6 presents the relative importance of each feature in the optimized XGBoost model.Each predictor variable has a different impact on the predicted travel time.Based on the importance rank of each feature, it can be found that the feature T t−1 , which is the travel time at time step t−1 (15 min before), contributes the most to the predicted travel time.This result is expected and consistent with a previous study [8], which demonstrates that the immediate previous traffic condition will influence the traffic condition in the future.Therefore, this feature T t−1 is the most important and highly correlated with the prediction value.
The results in Table 6 show that time of day is the second ranked feature with the relative importance value of 34.85%, and this result is also expected.In general, the travel time variability is also highly correlated with the time of day.The travel time usually increases a lot during peak hours and becomes stable during non-peak hours.
The third ranked feature is the segment ID with the relative importance value of 12.65%.The potential reason behind this ranking could be that the segment ID indicates which segment it is.The segment ID contains a lot of potential information such as the geographic location of the segment.Usually, different segment locations contribute to different travel time variability characteristics.Therefore, the segment ID is also a necessary and important feature in the model.Day of the week is the 4th ranked feature in the model; the relative importance value of day of the week is 3.76%.The feature day of the week is also important in the model since the travel time is highly correlated with which day of the week it is.Based on previous studies, the traffic congestion on weekends is less frequent than on weekdays; the travel time during peak hours on Friday is usually higher than those on other weekdays [28,29].Therefore, the feature day of the week is important in the model; this result is consistent with a previous study [8].
Weather is also considered in the model with a relative importance value of 1.72%.Inclement weather conditions may have a drastic impact on travel time variability.Therefore, the weather information is also useful in travel time prediction as adverse weather usually increases travel time.This finding is consistent with previous studies [30,31].
The travel time at time step t−1 (15 min before) is not the only feature with the consideration of temporal correlation.Several features such as the travel time of the two steps and three steps ahead (with the relative importance value of 0.40% and 0.33%, respectively) and the travel time change value of the three time steps ahead (with the relative importance value of 0.24%, 0.47%, and 0.27%, respectively) are considered in the model.These features are also used in the models of previous studies which had used gradient boosting models to predict freeway travel time [8,29]).The time change features are considered in this study because they could indicate the travel time change trends of the segments.However, the influences of these features are relatively small.The outcome is similar to the outcome of a previous study [32].
With the consideration of spatial impact, several features such as the travel time of the two upstream segments (with the relative importance value of 0.29% and 0.40%, respectively) and the travel time of the two downstream segments (with the relative importance value of 0.26% and 0.60%, respectively), one time step ahead are considered in the model.With respect to the travel time change value, the relative importance values of the two upstream segments are both 0.28%, and the relative importance values of the two upstream segments are 0.36% and 0.69%, respectively.Based on these results, it could be found that the relative importance values of the downstream segments are higher than those of upstream segments.It could be explained by the spatial characteristics of the roadway.If a bottleneck occurs at the downstream segment, the upstream segment will be influenced shortly.
In order to examine the accuracy and effectiveness of the XGBoost model, this study comprehensively evaluates the modeling results of the XGBoost model and compares the results with those of the gradient boosting model.The prediction result of the gradient boosting model is also optimized using a grid search method.For clarity, the mean absolute percentage error (MAPE) is used to evaluate and compare the performance of the two models.
The equation of the MAPE is provided below: where, m = The total number of the data.y i = The actual travel time value in the test dataset of record i. ŷi = The predicted travel time value in the test dataset of record i. Table 7 below presents the comparison between prediction results of the optimized XGBoost model and gradient boosting model.Based on the comparison, it could be concluded that the XGBoost model outperforms the gradient boosting model with both the consideration of accuracy and efficiency.The potential reason behind this could be as follows: In general, the XGBoost model is a more regularized form of the gradient boosting model.XGBoost uses advanced regularization terms, which improve model generalization capabilities.Therefore, the prediction results of the XGBoost model are more accurate than those of the gradient boosting model.At the same time, the computation time of the XGBoost model (25 min) is much faster than that of the gradient boosting model (2 h).One important reason behind the better performance of the XGBoost model could be the parallel processing function.The gradient boosting model is extremely difficult to parallelize since it has sequential characteristics.In comparison, XGBoost can allow us to do the boosting work using distributed processing engines.
Another key reason is that the XGBoost model implements the early stopping function, which means that one can stop model assessment when additional trees offer no improvement to the prediction results.This function can help us not only prevent the overfitting problem, but also improve the efficiency of the model significantly.

Conclusions
This study aims to develop a methodology to apply the XGBoost model in travel time prediction.A real-world freeway corridor is selected as the case study to examine the XGBoost prediction model so that the gaps between the theoretical research and the application of the developed model can be bridged.
It is found that the XGBoost model can provide reliable prediction results.The relationships between several important parameters in the model (e.g., number of trees, learning rate, and maximum depth of the tree) are discussed in this study.In detail, the accuracy level of a slower learning rate with a larger number of trees in the model is higher than that of a faster learning rate with a smaller number of trees.A higher max_depth value is also more efficient than a lower value since the number of iterations needed to achieve optimized results is lower.
The relative importance of the features shows that the travel time one step ahead (15 min before) contributes the most to the predicted travel time.The features such as the time of day, day of the week and weather also have higher relative importance values in the model than other features.
The proposed XGBoost-based travel time prediction method has considerable advantages over the gradient boosting approach.The performance evaluation result shows the XGBoost-based model can have better outcomes in terms of both prediction accuracy and efficiency.
Typically, the XGBoost-based travel time prediction model can provide reliable results with low error rate.However, the impacts of accidents and roadworks on travel time prediction are also worth exploring.In the future, how to incorporate these features in the model will be studied if the data can be made available.Furthermore, the performance of the travel time prediction model is discussed under all conditions as a whole.In the future, the performances of the model under different traffic conditions (such as non-congestion conditions and congestion conditions) can be learned and compared.

Table 1 .
Prior studies on travel time prediction using machine learning approaches.

Table 2 .
Example of large table.

Table 2 .
Example of large table.

Table 3 .
Summary of the basic information on the features used for this study.
Travel time change value of second downstream segment at time step t−1 (15 min before) FloatT tTravel time at time step t Float

Table 4 .
Detailed information on selected features.

Table 5 .
Optimized prediction results and computation times.

Table 6 .
Relative importance of each feature and their ranks in the model.

Table 7 .
Relative importance of each feature and their ranks in the model.