Machine Learning Modeling of Forest Road Construction Costs

: The economics of the forestry enterprise are largely measured by their performance in road construction and management. The construction of forest roads requires tremendous capital outlays and usually constitutes a major component of the construction industry. The availability of cost estimation models assisting in the early stages of a project would therefore be of great help for timely costing of alternatives and more economical solutions. This study describes the development and application of such cost estimation models. First, the main cost elements and variables affecting total construction costs were determined for which the real-world data were derived from the project bids and an analysis of 300 segments of a three kilometer road constructed in the Hyrcanian Forests of Iran. Then, ﬁve state-of-the-art machine learning methods, i.e., linear regression (LR), K-Star, multilayer perceptron neural network (MLP), support vector machine (SVM), and Instance-based learning (IBL) were applied to develop models that would estimate construction costs from the real-world data. The performance of the models was measured using the correlation coefﬁcient (R), root mean square error (RMSE), and percent of relative error index (PREI). The results showed that the IBL model had the highest training performance (R = 0.998, RMSE = 1.4%), whereas the SVM model had the highest estimation capability (R = 0.993, RMSE = 2.44%). PREI indicated that all models but IBL (mean PREI = 0.0021%) slightly underestimated the construction costs. Despite these few differences, the results demonstrated that the cost estimations developed here were consistent with the project bids, and our models thus can serve as a guideline for better allocating ﬁnancial resources in the early stages of the bidding process. of other machine the application of optimization algorithms to automatically the methods to cost estimation


Introduction
Ensuring the sustainable management of forest resources and the economic efficiency of the forestry enterprise requires a quality transport network [1]. However, the construction and further expansion of forest road networks are associated with large costs [2,3] that should be evaluated and properly allocated to adopt management strategies such as (i) alternative selection of route locations, (ii) selection of reasonable bids on road projects, (iii) selection of road standards, (iv) tradeoffs between roading costs and harvesting costs, (v) selection of transport methods, and (vi) spatiotemporally planning of harvesting operations [4]. An estimation of costs for various components of a roading project is a challenging task and is further complicated by environmental constraints such as variable topography, soils, and rock outcrops [1,5].
Over the last few decades, the development and application of methodologies for cost estimation of road projects have been an active research area for forest engineers. Many software packages, such as PLANS [6], PLANEX [7], NETWORK 2001 [8], and computer-aided engineering programs [9][10][11][12][13] have been developed for the generation of road alternatives A proposed low-volume road in the Hyrcanian forests in northern Iran was selected for modeling and estimating the total construction costs. The slope range in the region is 8-70% and the elevation range varies between 450 and 700 m above sea level. All of the trees in the roadway corridor were cut and removed after logging. The average road width was 5.5 m while the tree clearing width of the road was 15 m on average, varying between 12 and 18 m depending on slope and soil conditions. According to data obtained from weather stations, the climate in the area is humid, and mean precipitation levels are 800 mm. Excavator and bulldozer machines were used in this project for road construction. Based on the road construction history of the area, the rock portion is significantly high and leads to more earthwork costs [22]. Further, field surveying revealed severe environmental damage, such as erosion and landslides, due to significant terrain modifications associated with road construction.

Modeling Methodology
In this section, we describe the methodology proposed to cost estimate road construction. In general, our methodology consisted of three main steps: (1) data collection, (2)

Cost Elements
Generally, there are six cost items that are considered as important in affecting forest road construction costs. They are construction staking, clearing and piling, earthwork, finish grading, surfacing, and drainage and stream crossing structures [4]. In this study, we used the available engineering documentation and project bids to derive the cost corresponding to each cost element. In the following, a brief description of these cost elements is presented.
The construction staking cost depends on the terrain condition, accessibility, and the number of staking sections per kilometer. Heavy vegetation cover, steep terrain, areas with large amounts of brush, long walk-in times, and any other conditions that shorten the workday will increase the staking costs.
The clearing and piling cost is calculated by estimating the number of hectares per kilometer of the trees along the project (i.e., right of way), which must be cleared and the stumps removed. Depending on the conditions, the clearing operations are accomplished by heavy equipment such as tractors and bulldozers or men with axes and power saws. Important variables affecting the clearing and piling cost are tractor size, number of trees, and tree size.

Cost Elements
Generally, there are six cost items that are considered as important in affecting forest road construction costs. They are construction staking, clearing and piling, earthwork, finish grading, surfacing, and drainage and stream crossing structures [4]. In this study, we used the available engineering documentation and project bids to derive the cost corresponding to each cost element. In the following, a brief description of these cost elements is presented.
The construction staking cost depends on the terrain condition, accessibility, and the number of staking sections per kilometer. Heavy vegetation cover, steep terrain, areas with large amounts of brush, long walk-in times, and any other conditions that shorten the workday will increase the staking costs.
The clearing and piling cost is calculated by estimating the number of hectares per kilometer of the trees along the project (i.e., right of way), which must be cleared and the stumps removed. Depending on the conditions, the clearing operations are accomplished by heavy equipment such as tractors and bulldozers or men with axes and power saws. Important variables affecting the clearing and piling cost are tractor size, number of trees, and tree size.
The earthwork cost is calculated by estimating the volume of common material and rock (in cubic meters per kilometer) to be excavated and/or embankment to construct the road alignment. The earthwork quantity is usually calculated as the bank volume using the local formulas or tables derived from the sideslope, road width, and cut and fill slope ratios. The most important variable is the excavation rate of rock which varies with the size, share, and hardness of the rock, and other local conditions.
The finish grading operations consist of the adjustment of the angles of the cut slope and the width slope of the subgrade. The finish grading cost depends on the area of cut slope and road surface along the road project and is calculated by determining the number of passes a grader must make for a certain width subgrade and the speed of the grader. This number can be converted to the number of hours per hectare of subgrade.
The surfacing cost depends on the type and quantity of surfacing material, the length of the haul, and the equipment used. In many forested areas, roadbed surfacing materials are scarce and expensive. In our study area, the surface layer is often made up of unsorted natural gravel extracted from streams and rock. While gravel requires only loading by front-end loaders and may be compacted, rock needs to be blasted, transported to the crusher, reloaded, loaded to the area, spread, and compacted. The costs for each of these operations are assumed to be a function of the equipment production rates and machine rates that in total compose the surfacing cost.
The drainage cost is a function of the type of the drainage structures installed. These costs are often expressed as a cost per lineal meter, which can then be easily used in road estimating. In this study, the values for cost per lineal meter for culverts were obtained from the project documentation.

Explanatory Variables
The second component of the database used in this study was a set of variables describing local conditions and road characteristics that are thought directly or indirectly to affect the total cost of road construction. Following an analysis of the literature [1,4,[10][11][12][13][14]22] and based on available data, we used eight variables: road width, earth slope, cut slope, fill slope, number of trees, size of trees, rock proportion of subsoil, and number of culverts. The data related to these variables were initially obtained from the engineering documentation of a three kilometer road. The data were then checked and verified via observation of 300 segments across the selected road. This resulted in collecting 4811 samples.

Model Development
For the development of the cost estimation models, we linked the total construction cost of the road in the study area to the set of explanatory variables using the five machine learning methods that were selected in this study for modeling cost estimation of road construction. In the following, we concisely describe the methods and refer the interested reader to the corresponding literature for a full description of each method. The models were developed using the open-source Weka software on an HP Laptop with an Intel ® Core ™ i3-3110M CPU @ 2.40GHz, 4 GB of RAM, an x64-based processor, and the Microsoft Windows 8.1 operating system.

Linear Regression (LR)
Developed in the early 1950s, LR is a supervised machine learning algorithm proven efficient for solving a variety of linear problems. LR models the relationships between a dependent variable and one or more independent explanatory variables using linear predictor functions whose unknown model parameters are estimated from the data. Unlike logistic regression which transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes, LR assigns continuous numerical values to each independent variable.

K-Star
K-Star, developed by Cleary and Trigg [23], is an instance-based learner that classifies an instance by comparing it to a database of preclassified examples. K-Star is a lazy algorithm and belongs to the family of nearest neighbor methods. In contrast to other nearest neighbor methods, K-Star uses entropy as a separation function, which enables the method to deal with many problems associated with a classification task. In K-Star, the calculation is based on two main parameters, namely missing mode and global blend. The Weka software uses a default value of 20 for the global blend. The missing mode parameter that decides how missing property valuations are dealt with the classifier utilizes four methods for treating missing attributes, including ignoring the cases with missing attributes, normalizing the attributes, treating missing qualities as maximally distinctive, and averaging column entropy curves.

Multilayer Perceptron Neural Network (MLP)
An MLP is an elective system for learning discriminants for classes from a group of examples with the generalized delta rule for learning by a back-propagation learning algorithm. Generally, such a network is created from a set of neurons (nodes) arranged in several layers that consist of an input layer, an output layer, and one or several intermediate layers know as a hidden layer(s). MLP obtains knowledge about classes by learning from the training data set directly, therefore, it is unnecessary to make any assumptions regarding the underlying possibility density functions. Information about a priori probability can be adjusted after training [24,25] or by increasing the number of training patterns. After training (learning), the MLP classifier is specified by a set of processing elements, which are arranged in a certain topological structure and interconnected with fixed connections (weights). There is no extensive computation involved in the classification of unknown patterns and no need for retaining the training data.

Support Vector Machine (SVM)
SVM is a nonparametric supervised statistical learning technique that was introduced by Cortes and Vapnik [26] and has expanded as one of the solutions in machine learning and pattern recognition. SVM makes its predictions using a linear combination of the Kernel function that operates on a set of training data with backing vectors. The method offered by SVM differs from other models derived from comparable machine learning methods such as MLP, as SVM training always finds the minimum universal. The main idea behind SVM is to construct a hyperplane in an N-dimensional space for separating the dataset into distinct classes. In SVM, support vectors refer to the training samples that are close to the hyperplane that determines the position and orientation of the hyperplane. The hyperplane maps the input data into a high dimensional feature space using the Kernel function [27], and SVM tries to maximize the distance between the hyperplane and the training samples. SVM has recently found numerous applications in many fields of science.

Instance-Based Learning (IBL)
IBL is a nonparametric method used for classifying a dataset based on the similarity of a query to the nearest training samples in the feature space. IBL is an extension of the K-nearest neighbors (KNN) classifier and is one of the subsets of lazy algorithms in which the classifier does not sum up the training samples and postpones the generalization until a query is made to the system, rather than the concerned learning technique that sums up the training dataset initially. In IBL, the KNN parameter gives the number of NNs to utilize when classifying a test instance, and the result is specified through a majority vote. The IBL algorithm is effective in reducing storage requirements, in determining the associations between learning attributes, and in tolerating noise [28]. Other advantages of the IBL algorithm include its ability to simultaneously learn mixed and overlapping concept descriptions, accelerated learning (i.e., training) rate, the integration of theorybased reasoning of real-world scenarios, and application of voting approaches to break tie votes [29].

Model Training and Testing
For model training and validation, we randomly divided the data collected from field measurements into two sets. Out of 4811 samples, 3368 samples (70%) were used as the training dataset and the remaining 1443 (30% of samples) were set aside for model testing. Although there is no universal guideline, the 70/30 ratio is the most common strategy for data dividing [22,[30][31][32].
Given the different ranges of the variables, the datasets were normalized in the range of 0 and 1 using the following formula [30]: where X n is the normalized value, X is the value that should be normalized, X min is the minimum value of X, and X max is the maximum value of X. Data normalization helps to avoid the circumstance in which a variable with a large domain dominates the other with a small domain.
Over the training phase, the optimum value for each model parameter was determined via a trial and error procedure. To do so, different values were entered arbitrarily until the best model performance was achieved. Table 1 details the optimum parameter setting of each model.
An important step in the application of MLP and SVM is the proper selection of the numbers of neurons in the hidden layer and the type of the kernel function, respectively. This step has a direct effect on the successful generalization and classification precision of these two models. In this study, we tested the performance of the ANN model by changing the number of neurons of the hidden layer in a range of 1-50 to achieve the best performance (i.e., highest R and lowest RMSE and computing time). For the SVM model, we tested the efficiency of Poly Kernel (PK), Normalized Poly Kernel (NPK), Radial Basis Function Kernel (RBFK), and Pearson Universal Kernel (PUK).
We used the correlation coefficient (R), Root Mean Square Error (RMSE), and percent of relative error index (PREI) as the performance metrics for measuring the goodness of fit (i.e., training performance) and estimation ability (i.e., testing performance) of the models. The metrics that are the most common performance metrics used for different modeling studies [33][34][35] are calculated as follows: where n is the number of samples, C actual and C predicted are the actual and predicted costs, and C actual and C predicted are the mean value of actual and predicted costs.
Model performance was evaluated in part using R, which measures the degree of association between the actual and predicted objects. R ranges from 0 to 1; higher R values indicate better model performance. Model performance was evaluated in part using RMSE, which measures the average magnitude of the error. RMSE should be as close as possible to zero (i.e., no error between actual and predicted costs) to indicate excellent model performance [33,36]. Model performance was evaluated in part using PREI, which measures the model s tendency to underestimate or overestimate the cost. PREI should be as close as possible to zero (i.e., no over-or underestimation) to indicate excellent model performance [29,37,38].

Results and Discussion
For all five models developed in this study, the magnitude of the modeling error was computed and comparative plots of target vs. output (i.e., actual cost vs. predicted cost) were prepared. Over the training phase (Figure 2), using the IBL model, the predicted costs were much closer to the actual values, indicating the greatest goodness of fit (RMSE = 1.4% and R = 0.998) to the training samples and to the linear regression equation: Y = T + 9.9e+03, in which Y is the model output (i.e., predicted total construction cost) and T is the cost reported in the project bids. In contrast, the LR model showed much farther predicted costs from the actual costs, yielding the highest training error (RMSE = 11.8%), the lowest R (0.834), and an equation of Y = 0.7T + 7.5e+05. The other three models used in this study were ranked from best to worst as, SVM (RMSE = 2.94% and R = 0.991), MLP (RMSE = 3.2% and R = 0.9894), and K-Star (RMSE = 3.4% and R = 0.988). The possible explanation for the superiority of IBL over the other models is that most of the variables (in particular, earth slope, size of trees, rock proportion of subsoil, and number of culverts) used in this study have great value distributions that make the search for neighboring instances easier for prediction purposes [39]. The most logical explanation for the low performance of the LR method compared to the other models is that LR assumes a predefined linear relationship between total cost and the independent variables. Although this assumption may yield promising results for other modeling studies, it seems that it does not fit the context of cost estimation modeling that is characterized by potentially very complex, nonlinear relationships.
The predictive power of the models could not be measured using the goodness of fit, because this metric uses the data with which the models were developed and shows only how well the models fit the training dataset. Conventionally, the predictive power of a model is measured via an external testing phase, during which the model is presented with unseen data [40]. The predictive power of our models was assessed over the testing phase that yielded the R values of 0.841 (LR), 0.984 (MLP), 0987 (K-Star), 0.993 (SVM), and 0.99 (IBL) (Figure 3). In terms of the magnitude of prediction error, the SVM model made for prediction purposes [39]. The most logical explanation for the low performance of the LR method compared to the other models is that LR assumes a predefined linear relationship between total cost and the independent variables. Although this assumption may yield promising results for other modeling studies, it seems that it does not fit the context of cost estimation modeling that is characterized by potentially very complex, nonlinear relationships. The predictive power of the models could not be measured using the goodness of fit, because this metric uses the data with which the models were developed and shows only how well the models fit the training dataset. Conventionally, the predictive power of a model is measured via an external testing phase, during which the model is presented with unseen data [40]. The predictive power of our models was assessed over the testing phase that yielded the R values of 0.841 (LR), 0.984 (MLP), 0987 (K-Star), 0.993 (SVM), and 0.99 (IBL) (Figure 3). In terms of the magnitude of prediction error, the SVM model made the most accurate prediction (RMSE = 2.44%), followed by IBL (RMSE = 2.86%), K-Star (RMSE = 3.48%), MLP (RMSE = 4.9%), and LR (RMSE = 11.9%). The highest generalization and predictive performances and capability to model nonlinear relationships using the SVM model can only be achieved if a suitable kernel function is utilized [41]. The kernel functions transform the nonlinear input space into a high dimensional feature space where nonlinear relationships and the solution to the problem are represented in a linear form [42]. Many different kernel functions exist that are used to create such high dimensional feature space. Since the nature of the real world data is typically unknown a prior choice among different kernels is impossible. Therefore, during the modeling process, different kernels are experimentally tried to find the one which gives the best performance. In this study, we evaluated the goodness of fit and predictive capability of the SVM model using four kernel functions based on the R, RMSE, and computing time ( Table 2). Over the training phase, R ranged from 0.825 to 0.991 (mean = 0.909), RMSE ranged from 2.9% to 13.64% (mean = 8.79%), and time ranged from 13.14 to 50.37 s (mean = 26.74 s) that identified the PUK (R = 0.991, RMSE = 2.9%, time = 50.37 s) as the best kernel function for building the SVM model. Over the testing phase, R varied between 0.834 and 0.993% (mean = 0.916), RMSE varied between 2.4% and 12.48% (mean = 7.815%), and time varied between 13.17 and 49.07 s (mean = 26.63 s) that once again demonstrated the efficacy of PUK (R = 0.993, RMSE = 2.4%, time = 49.07 s) over the other kernel functions for the SVM model. The highest R and lowest RMSE achieved by PUK suggest that this kernel successfully grasped the relationship between different variables and construction cost. This efficacy stems from the flexibility of the PUK to adapt its two parameters from a Gaussian into a Lorentzian peak shape [42]. However, the SVM model with PUK still showed an RMSE of 2.9% and 2.4% that indicates the existence of some noisy sequences and bias in the datasets and the inability of PUK to avoid them.   The highest generalization and predictive performances and capability to model nonlinear relationships using the SVM model can only be achieved if a suitable kernel function is utilized [41]. The kernel functions transform the nonlinear input space into a high dimensional feature space where nonlinear relationships and the solution to the problem are represented in a linear form [42]. Many different kernel functions exist that are used to create such high dimensional feature space. Since the nature of the real world data is typically unknown a prior choice among different kernels is impossible. Therefore, during the modeling process, different kernels are experimentally tried to find the one which gives the best performance. In this study, we evaluated the goodness of fit and predictive capability of the SVM model using four kernel functions based on the R, RMSE, and computing time ( Table 2). Over the training phase, R ranged from 0.825 to 0.991 (mean = 0.909), RMSE ranged from 2.9% to 13.64% (mean = 8.79%), and time ranged from 13.14 to 50.37 s (mean = 26.74 s) that identified the PUK (R = 0.991, RMSE = 2.9%, time = 50.37 s) as the best kernel function for building the SVM model. Over the testing phase, R varied between 0.834 and 0.993% (mean = 0.916), RMSE varied between 2.4% and 12.48% (mean = 7.815%), and time varied between 13.17 and 49.07 s (mean = 26.63 s) that once again demonstrated the efficacy of PUK (R = 0.993, RMSE = 2.4%, time = 49.07 s) over the other kernel functions for the SVM model. The highest R and lowest RMSE achieved by PUK suggest that this kernel successfully grasped the relationship between different variables and construction cost. This efficacy stems from the flexibility of the PUK to adapt its two parameters from a Gaussian into a Lorentzian peak shape [42]. However, the SVM model with PUK still  Determining the number of neurons in the hidden layer is a crucial part of deciding the overall MLP architecture. To determine the optimum structure of the MLP model, the number of neurons in the hidden layer was investigated in a range of 1 to 50 neurons ( Figure 4). The R, RMSE, and computing time obtained from the training dataset as input into the MLP model ranged from 0.9464 to 0.9894, 3.205% to 7.682%, and 0.75 to 21.16 s, yielding the best performance with 24 neurons in the hidden layer (R = 0.9894, RMSE = 3.205%, time = 10.03 s). Similarly, the model with 24 neurons performed in the testing phase (R = 0.9914, RMSE = 2.75%, time = 9.98 s) with an R, RMSE, and computing time that ranged from 0.9511 to 0.9914, 2.75% to 7.17%, and 0.76 to 20.5 s, respectively. The inefficiency of the MLP model with too few neurons in the hidden layer can be attributed to the underfitting problem [43] that hindered the MLP from adequately capturing the complicated relationships between the data leading to low model performance. However, structuring the MLP with too many neurons in the hidden layers excessively increased the computation time and may raise the overfitting problem in which the ANN model memorizes the idiosyncrasies of particular patterns between training samples and lost the generalization ability [43]. Our results clearly showed the increased computation time of the model by increasing the number of neurons in the hidden layer. Obviously, the MLP model with 24 neurons in the hidden layer offered a compromise between computation time and quality of results. Further, since the performance of the MLP model was within the range of other models developed in this study, we can conclude that the MLP with 24 neurons successfully overcame the overfitting problem. As a further performance analysis, the PREI metric was computed and enabled us to examine the models based on their performance to underestimate or overestimate the construction cost ( Figure 5). Except for the IBL model that overestimated the construction cost by 0.0021%, the other models underestimated the cost between −0.0062% (SVM) and −0.0564% (LR) than the actual cost available in the project bids. From these results, the LR model was identified as the most biased predictive model that underestimated the construction cost by −0.0564%. Underestimations and overestimations both negatively affect project management [44]. Underestimations may lead to poor quality of the deliverables and hence a bad reputation for the company [45]. As a further performance analysis, the PREI metric was computed and enabled us to examine the models based on their performance to underestimate or overestimate the construction cost ( Figure 5). Except for the IBL model that overestimated the construction cost by 0.0021%, the other models underestimated the cost between −0.0062% (SVM) and −0.0564% (LR) than the actual cost available in the project bids. From these results, the LR model was identified as the most biased predictive model that underestimated the construction cost by −0.0564%. Underestimations and overestimations both negatively affect project management [44]. Underestimations may lead to poor quality of the deliverables and hence a bad reputation for the company [45]. In this case, the project managers may come to the conclusion that due to cost overruns and wasting resources and money the project should be stopped [46]. Overestimations can cause serious consequences, since the manager extends "the work so as to fill the time available for its completion" according to Parkinson′s Principal [47]. Apart from the reduced productivity rate, the overestimated budgets may lead the project manager to overlook a new contract to undertake new projects [45]. A survey on the literature related to infrastructure construction projects [48,49] shows a deviation towards underestimation of construction costs (i.e., cost overruns), which support our findings in this study. Ghajar et al. [1] also reported that the cost estimation models tend to underestimate the total construction costs of road projects in the Hyrcanian forests.

Conclusions
The task of cost estimation of road construction is one to which forest engineers have devoted much time and effort. In this study, we investigated the performance of different machine learning methods for the estimation of construction cost of forest road projects. We adopted a modeling methodology based on road project bids and field measurements to develop predictive models that enable forest engineers to estimate the construction cost of the next projects. Among the five models developed, the IBL model had the highest goodness of fit with the training dataset and the SVM model had the highest estimation capability. Except for the IBL model, which showed a cost overestimation of about In this case, the project managers may come to the conclusion that due to cost overruns and wasting resources and money the project should be stopped [46]. Overestimations can cause serious consequences, since the manager extends "the work so as to fill the time available for its completion" according to Parkinson s Principal [47]. Apart from the reduced productivity rate, the overestimated budgets may lead the project manager to overlook a new contract to undertake new projects [45]. A survey on the literature related to infrastructure construction projects [48,49] shows a deviation towards underestimation of construction costs (i.e., cost overruns), which support our findings in this study. Ghajar et al. [1] also reported that the cost estimation models tend to underestimate the total construction costs of road projects in the Hyrcanian forests.

Conclusions
The task of cost estimation of road construction is one to which forest engineers have devoted much time and effort. In this study, we investigated the performance of different machine learning methods for the estimation of construction cost of forest road projects. We adopted a modeling methodology based on road project bids and field measurements to develop predictive models that enable forest engineers to estimate the construction cost of the next projects. Among the five models developed, the IBL model had the highest goodness of fit with the training dataset and the SVM model had the highest estimation capability. Except for the IBL model, which showed a cost overestimation of about 0.0021%, the other models slightly underestimated (0.0275%) the construction cost of forest roads.
Based on the performance metrics and the costs reported in the project documentation, we found these machine learning models promising for estimating the total construction cost of forest roads. This study has opened up new avenues for machine-learning analyzing economic targets, management strategies, and system efficiencies that are undoubtedly important objectives of forestry enterprise. Future work could extend this analysis to the application of other machine learning methods as well as the application of metaheuristic optimization algorithms to automatically tune the methods parameters to produce more accurate cost estimation models.