Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning

Antoniou, Fani; Konstantinidis, Konstantinos

doi:10.3390/app151810237

Open AccessArticle

Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning

by

Fani Antoniou

^1,*

and

Konstantinos Konstantinidis

²

¹

Department of Environmental Engineering, International Hellenic University, Sindos, 57400 Thessaloniki, Greece

²

Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10237; https://doi.org/10.3390/app151810237

Submission received: 28 August 2025 / Revised: 13 September 2025 / Accepted: 15 September 2025 / Published: 19 September 2025

(This article belongs to the Special Issue Artificial Intelligence in Civil Engineering: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

Accurate cost estimation during the conceptual and feasibility phase of highway projects is essential for informed decision making by public contracting authorities. Existing approaches often rely on pavement cross-section descriptors, general project classifications, or quantity estimates of major work categories that are not reliably available at the early planning stage, while focusing on one or more key asset categories such as roadworks, bridges or tunnels. This study makes a novel contribution to both scientific literature and practice by proposing the first early-stage highway construction cost estimation model that explicitly incorporates roadworks, interchanges, tunnels and bridges, using only readily available or easily derived geometric characteristics. A comprehensive and practical approach was adopted by developing and comparing models across multiple machine learning (ML) methods, including Multilayer Perceptron-Artificial Neural Network (MLP-ANN), Radial Basis Function-Artificial Neural Network (RBF-ANN), Multiple Linear Regression (MLR), Random Forests (RF), Support Vector Regression (SVR), XGBoost Technique, and K-Nearest Neighbors (KNN). Results demonstrate that the MLR model based on six independent variables—mainline length, service road length, number of interchanges, total area of structures, tunnel length, and number of culverts—consistently outperformed more complex alternatives. The full MLR model, including its coefficients and standardized parameters, is provided, enabling direct replication and immediate use by contracting authorities, hence supporting more informed decisions on project funding and procurement.

Keywords:

early-stage construction cost prediction; road infrastructure projects; feasibility study models; machine learning applications in civil engineering; model ranking; public sector procurement; conceptual-phase cost prediction; public contracting authorities; road infrastructure planning; tender budget estimation

Graphical Abstract

1. Introduction

Transport infrastructure is essential for national economic development. The EU’s 4.5 million km road network accommodates 52% of freight transport and 81.2% of passenger journeys [1], making early and efficient lifecycle cost estimates for roads crucial. Infrastructure investment is significant, yet cost estimation remains a debated topic [2]. While accurate preliminary cost ranges with confidence limits are necessary for project viability [3], due to risks and uncertainties, project stakeholders incorporate contingency reserves to absorb unforeseen financial impacts [4] rather than adopting scientifically developed lifecycle cost estimation models.

The lifecycle of a major road project spans several decades and involves multiple stages: planning, design, tender, construction, maintenance, and management. Each phase presents environmental, social, political, financial, legal, and technical challenges requiring expertise across disciplines [5]. The lifecycle begins with planning and design and is followed by implementation through construction, the costliest phase. Next comes the longest phase, i.e., operation and maintenance, which has historically been overlooked in cost estimates but is now considered in recent studies, particularly regarding annual maintenance costs of road assets such as bridges and substructures [6,7,8,9].

Cost estimates are needed at every stage, each relying on different levels of available information. In fact, most feasibility and “go-no go” decisions concerning the procurement of large transportation projects are made during the conceptual-feasibility stages of the project [10]. Therefore, preliminary construction cost estimates are made before detailed plans exist, using conceptual designs and historical data [5]. These estimates are required early on by the owners (clients) or public highway infrastructure contracting authorities (CAs) to support cost–benefit analyses, feasibility studies, budgeting, and financial evaluation of alternatives, highlighting the need for inexpensive and reasonably accurate early-stage construction cost estimating methods based on available geometrical data [5]. Since, internationally, comparative data on major infrastructure construction costs are scarce due to variations in legal frameworks, procurement systems, and contract types [11,12], models tailor-made to national markets are necessary to consider these discrepancies.

Estimating accuracy varies by project phase. Conceptual-phase estimates typically have a ±25% accuracy [12], while feasibility estimates range from −15% to −30% and +20% to +50% [13]. In practice, cost estimates in preliminary feasibility studies rely on past project data for parametric estimates (e.g., cost per square meter), which are used for informed decision making [3,14]. These estimates typically achieve 30–40% accuracy, and even those with discrepancies of up to ±50–100% [15] can be considered as “order of magnitude” discrepancies. In addition, Gardner et al. [16] challenged the common belief that incorporating more input data leads to greater estimate accuracy. Their study revealed that including more variables than necessary in conceptual cost estimates did not enhance accuracy. Instead, they found that using only the essential input data needed to achieve a reasonable level of confidence was more cost-effective.

It is proposed herein that highway construction tender cost estimation models using relatively few initial geometrical data, known during the feasibility phase, can provide better early estimates of around ±30%, thus enabling comparisons of design alternatives and facilitating procurement decision making on behalf of the public CA [12]. Therefore, this study aims to develop a best-fit cost estimation model for early road construction cost estimation based on data from the tender documents for the construction of 19 sections of the Egnatia Highway in Central and Eastern Macedonia and Thrace, which were tendered and constructed according to the Greek provisions of the relevant legislation in force at the time of their tender. The data collected for each project included the total cost of all work items and a series of technical characteristics chosen based on the professional experience of the corresponding author. The data were analyzed statistically to verify the choice of technical characteristics to be used as independent variables in the tested total cost estimation models. It was concluded that the estimated length of the main axis and service roads, number of interchanges, deck area of the structures, length of single bore tunnels and the number of culverts can be used to successfully predict the tender construction budget for a highway during the feasibility phase.

The next section presents the systematic literature review (SLR) that was carried out to determine the gap in the literature that this research work aims to fill, followed by Section 3, which presents the research objectives, data collection process and analysis, as well as a short theoretical description of each machine learning (ML) approach and their metrics of performance evaluation. Section 4 presents the models, their development, training, testing and analysis, including the model ranking and acceptance procedure employed and comparison of results. The final Section 5 includes the conclusions, limitations of the study and proposals for further research.

2. Literature Review

A systematic literature review (SLR) was conducted to collect and evaluate individual studies with the goal of assessing the originality of the proposed cost estimation models. The review followed the guidelines outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist [17]. The procedure, illustrated in Figure 1, began with a comprehensive search in the Scopus (Elsevier) database. Scopus was selected over alternatives such as Web of Science and Google Scholar due to its controlled status, broader journal coverage in engineering, greater number of indexed journals, and fewer citation-counting inconsistencies [18,19,20].

The search was performed using the “Advanced Search” feature with the following keyword string: TITLE(predict* OR forecast* OR estimat*) AND TITLE(Cost) AND TITLE(highway* OR road* OR motorway* OR bridge* OR tunnel* OR asphalt OR pavement OR freeway OR underpass OR overpass). The wildcard “*” was used to capture variations of terms. Keywords were selected to align with the research objective of identifying existing early-stage cost estimation models for road infrastructure projects that encompass assets such as bridges, tunnels, and roadworks. To ensure relevance, the search was restricted to occurrences of these keywords in the title, returning 441 initial results.

These were narrowed down to 295 documents by filtering for English-language scientific publications (journal articles, conference papers, and book chapters) within the engineering field. The titles and abstracts of these were screened. It was found that 31 referred to rehabilitation costs for road pavements, while many addressed construction cost escalation prediction models based on detailed design and market information at the time of tender. A further subset of these papers claimed to support early-stage cost estimation; however, only a small number provided models that rely on geometric parameters available during the feasibility phase. As a result, 42 documents were identified that directly addressed cost estimation models for roads, tunnels, or bridges based on preliminary information. Drawing upon the corresponding author’s prior research on metro station construction cost models [21], an additional 18 documents were identified for inclusion, bringing the total to 60 documents for detailed analysis.

The results from the SLR are shown in Table 1 and Table 2 and are summarized as follows. The selected studies span from 1998 to 2025 with a peak of activity in 2022 (12 articles). The geographic distribution of the project data sources show the USA (14 studies), Greece (10 studies) and China (6) as the most frequent, while it is notable that six studies included data sets from multiple countries, i.e., the USA and Canada [22], Iran and Iraq [23,24,25,26] and Poland and Thailand [27]. The vast majority employ a variety of ML techniques either alone (32 studies) or in combination with non-ML techniques, i.e., in combination with mataheuristic optimization algorithms (5 studies), other statistical/mathematical models (4 studies), and expert systems or decision-making methods [28,29].

As shown in Table 1, the most frequently used methods in these models were Artificial Neural Networks (ANNs), employed in 20 studies dating back to 1998, and Regression Analysis (RA) in 13 proposed models. Other frequently used techniques included Support Vector Machines (SVMs) in seven studies, Regression Forests (RF) and linear regression (LR) encountered in six studies, each followed by Gaussian Process Regression (GPR) in four and Unit Cost Analysis (UCA) in three. These finding do not differ significantly from a literature review carried out by Barakchi et al. in 2017 [2], which identified parametric methods using LR, ANNs and unit cost methods as the most common approaches for transportation infrastructure cost estimation.

Of the 60 documents initially reviewed, 14 focused on early-stage construction cost estimation models specifically for bridges and 10 for tunnels. Given that the objective of this research is to develop a preliminary construction cost estimation model for new road sections, which may include tunnels and/or bridges, these 24 studies were excluded from further analysis. This decision was based on the rationale that such specialized structures warrant separate and dedicated estimation models when more detailed information about their final geometry, method of construction and geological conditions are available later in the project’s lifecycle. The remaining 36 documents underwent a comprehensive content analysis to extract key features, including the methodological approach, database size, and input and output variables. The primary goal of this evaluation was to identify existing early-stage cost estimation models that could be utilized by client organizations during the feasibility phase of a new road project, where only limited geometric information is available.

During the review, several studies were excluded for various reasons. Specifically, studies by Chou [30], He et al. [31], Swei et al. [32] and Gante et al. [33] were excluded, as they focused exclusively on cost estimation models for existing road improvement projects. Additionally, six studies [34,35,36,37,38,39] focused on developing final cost estimation models based on detailed initial construction contract budgets, which are not applicable in early feasibility assessments. Eight further studies [27,29,40,41,42,43,44,45] were excluded, as they relied on quantity take-offs of major work items, data typically unavailable during the preliminary design phase. The study by Chou et al. [46] was also excluded, as it proposed a quantity estimation model for major work items instead of a cost estimation model, requiring the users to determine unit rates in order to deduce a preliminary cost estimation. The study by Kang et al. [47] was also deemed irrelvant, as it presents a theoretical conceptual proposal for a cost prediction model for all types of projects based on CBR and the Pareto principle without any application to existing project data and hence no final proposal of any estimation model. Lastly, studies prior to 2000 were excluded [48,49], while others were excluded either due to lack of access to the full document [50,51,52,53,54] or the absence of critical methodological information, such as data sources, input variables, or output parameters [55]. Consequently, the final set of 8 studies presented in Table 2 were identified as offering early-stage cost estimation models applicable to new road construction projects.

The proposed models in the current research endeavor differ from those in Table 2, as it is the only proposed model that requires, apart from the road length, geometric input variables related to all three basic assets such as interchanges, bridges and tunnels. The detected studies have typically relied on geometrical information of the pavement cross-section or general project classifications, or quantity estimates of major work categories that are not reliably available at the early planning stage, while omitting one or more of the key asset categories. For example, Chou and O’Connor [3], Lin and Techapeeraparnich [56], Mahalakshmi and Rajasekaran [57], Meharie et al. [58,59], Mohamed and Moselhi [60], and Hoffmann and Donev [61] did not consider tunnels or interchanges, and most also excluded culverts. Furthermore, several of these studies depended on variables difficult to obtain at the feasibility stage, such as earthwork quantities ([56,62]) or contract duration ([57,58]), thereby limiting their practical usability during early planning. The only study found that included all three asset types was Wang et al. [62], but it requires a method of estimating earthwork quantities during the feasibility phase without any design work having been completed. In addition they used only a Least Squares Support Vector Machine (LSSVM) enhanced by using Principal Component Analysis (PCA) and Particle Swarm Optimization (PSO) to optimize the input variables. In contrast this proposal examines the performance of seven machine learning (ML) methods to determine the best-fitting one.

Table 1. SLR content analysis (year, project type, data source country, method).

First Author	Year	Project Type	Data Source Country	Method	Method Category
Hegazy [49]	1998	Roads	Canada	ANN	ML
Adeli [48]	1998	Roads	USA	ANN	ML
Sodikov [27]	2005	Roads	Poland, Thailand	ANN	ML
Wilmot [34]	2005	Roads	USA	ANN	ML
Chou [46]	2006	Roads	USA	NLR	ML
Chou [3]	2007	Roads	USA	ES-(RDBMS)	ES/DM and DVTs
Hammad [35]	2008	Multiple project types including Roads	Jordan	MLR	ML
Bouabaz [63]	2008	Bridges	UK	ANN	ML
Pewdum [36]	2009	Roads	Thailand	ANN	ML
Chou [29]	2009	Roads	USA	ES using GLM	ML-Expert/DM
Petroutsatou [64]	2010	Tunnels	Greece	SEM/MLR/ANN	ML and S/MM
Kim [50]	2010	Roads	South Korea	CBR	ES/DM
Fragkakis [65]	2011	Bridges (foundations)	Greece	RA	ML
Kang [47]	2011	Roads	South Korea	CBR/RA	ES/DM and S/MM
Chou [30]	2011	Roads (Pavements)	Taiwan	MCS	STs
Asmar [40]	2011	Roads	USA	PERT	S/MM
Mahamid [41]	2011	Roads	West Bank	MLR	ML
Petroutsatou [66]	2012	Tunnels	Greece	ANN	ML
Rostami [22]	2013	Tunnels	Canada, USA	UCA/RA	S/MM
Mahamid [43]	2013	Roads	Saudi Arabia	MLR	S/MM
Kim [42]	2013	Roads	South Korea	CBR/AHP	ES/DM
He [31]	2014	Roads (Pavements)	China	ANN	ML
Park [51]	2014	Roads	Korea	BIM/GIS	DVTs
Antoniou [67]	2016	Bridges (overpasses)	Greece	MLR	ML
Fragkakis [68]	2016	Bridges	Greece	UCA/MLR	S/MM-ML
Peško [37]	2017	Roads	Serbia	ANN	ML
Swei [32]	2017	Roads (Pavements)	USA	RA	S/MM
Zhang [69]	2017	Bridges	USA	MARS	ML
Antoniou [5]	2018	Bridges (underpasses)	Greece	MLR	ML
Aretoulis [38]	2019	Roads	Greece	ANN	ML
Mahalakshmi [57]	2019	Roads	India	ANN	ML
Juszczyk [70]	2019	Bridges	Poland	SVM	ML
Lin [56]	2019	Roads	Thailand	ANN/MLR	ML
Meharie [58]	2020	Roads	Ethiopia	RF/SVM/ANN	ML
Juszczyk [71]	2020	Bridges	Poland	ANN	ML
Wang [62]	2021	Roads	China	PCA/PSO/LSSVM	ML and MOAs
Petroutsatou [72]	2021	Tunnels	Greece	ANN	ML
Kovacevic [73]	2021	Bridges	Serbia	ANN/RTE/RF/GBM/SVR/GPR	ML
Markiz [74]	2022	Bridges	Canada	ES/BIM	ES/DM and DVTs
Feng [55]	2022	Roads	China	RF/SVM	ML
Liu [75]	2022	Tunnels	China	SVM	ML
Meharie [59]	2022	Roads	Ethiopia	GBMwithLR/SVM/ANN	ML
Mahmoodzadeh [24]	2022	Tunnels	Iran/Iraq	GPR	ML
Mahmoodzadeh [26]	2022	Tunnels	Iran/Iraq	MCS	STs
Mahmoodzadeh [25]	2022	Tunnels	Iran/Iraq	LR/GPR/SVR/DT	ML
Mahmoodzadeh [23]	2022	Tunnels	Iran/Iraq	GPR/PSO/GWO/MVO/MFO/SCA/SSO	ML and MOAs
Ghadbhan Abed [44]	2022	Roads	Iraq	LASSO/K-NN/RF	ML and MOAs
Gante [33]	2022	Roads (Improvements)	Phillipines	ANN	ML
Mohamed [60]	2022	Roads	USA	ANN/SVM/RF/EML	ML
Warren [52]	2022	Roads	USA	ML	ML
Hoffmann [61]	2023	Roads	Austria	UCA	S/MM
Kovacevic [28]	2023	Bridges	Serbia	ANN/RT/MGGP/VIKOR	ML-ES/DM
Vagdatli [7]	2023	Bridges	Greece	DBN	ML
Zhang [53]	2023	Roads	USA	LASSO/GRNN	ML and MOAs
Wang [54]	2024	Roads	China	GA-BP	ML
Zhou [76]	2024	Tunnels	China	SSAE	ML
Birhanu Belete [39]	2024	Roads	Ethiopia	RA	S/MM
Vagdatli [77]	2024	Bridges	Greece	DBN	ML
Abd [45]	2024	Roads	Iraq	ANN	ML
Helaly [78]	2025	Bridges	USA	ML	ML

AHP (Analytical Hierarchy Process), ANN (Artificial Neural Network), BIM (Building information modeling), CBR (Case-based Reasoning), DBN (Dynamic Baysian Networks), DVT Simulation Techniques (Data Visualisation Tools), DM (Decision-Making), DT (Decision Trees), EML (Ensemble Machine Learning), ES (Expert System), GA-BP (Genetic Algorithm-BackPropagation Neural Network), GBM (Gradient Boosting Machine), GIS (Geographic Information System), GlM (Generalized Linear Modelling), GPR (Gaussian Process Regression), GRNN (General Regression Neural Network), GWO (Grey Wolf Optimization), K-NN (K Nearest Neighbors), LASSO (Least Absolute Shrinkage and Selection Operator), LR (Linear Regression), LSSVM (Least Squares Support Vector Machine), MARS (Multivariate Adaptive Regression Splines), MCS (Monte Carlo Simulation), MLR (Multiple Linear Regression), MOAs (Metaheuristic Optimization Algorithms), ML (Machine Learning), Moth Flame Optimization (MFO), Multi-Verse Optimization (MVO), PCA (Principal Component Analysis), PERT (Program Evaluation and Review Technique), PSO (Particle Swarm Optimization), RA (Regression Analysis), RDBMS (Relational Database Management Systems), RF (Random Forests), RT (Regression Trees), RTE (Regression Tree Ensembles), SCA (Sine Cosine Algorithm), SEM (Structural Equation Model), STs (Simulation Techniques) SSAE (Stacked Sparse AutoEncoder), SSO (Social Spider Optimization), S/MM (Statistical/Mathematical Modelling), SVM (Support Vector Machine), SVR (Support Vector Regression), UCA (Unit Cost Analysis), VR (Virtual Reality).

Table 2. Road construction cost estimation models in the extant literature.

First Author (Year) [Ref]	Database Size	Dependent Variables	Independent Variable	Comments	Method
Chou (2007) [3]	Unspecified number of projects from Texas DoT	Project Type (choice of 8), Project length (miles), Shoulder width (ft),Lane width (ft), Lanes (n), Percent trucks, Design speed (mph), Location (choice of 3),Divided highway (yes/no), Characteristics (e.g., Freeway), National Highway System (yes, no), Truck system flag (yes, no), Bridges (n), Bridge deck area (ft2). Dominant bridge material (type)	Construction Cost	Does not present the statistical models that are used for calculation of cost estimate but presents the development of a web-based system for users to input the IV. Does not consider IV related to number of culverts or interchanges or existence of tunnels.	ES-RDBMS (Non-ML)
Lin (2019) [56]	51 rural road projects in Thailand	Road length (km), Lane width (m), Number of lanes per direction, Pavement type, Earthwork quantity, Miscellaneous	Construction Cost	Does not consider IV related to number of culverts or interchanges nor consider existence and size of possible bridges or tunnels. It requires a method of estimating earthwork quantities during the feasibility phase without any design work having been completed.	RA (ML)
Mahalakshmi (2019) [57]	52 road projects in India	Road classification, Topography, Type of pavement (flexible or rigid), C/S of pavement (cutting or embankment), Soil Condition (based on CBR) (%), Contract duration, Length of pavement, lane width, Pavement thickness, No. of cross drains	Construction Cost	Does not consider IV related to number of interchanges nor considers existence and size of possible bridges or tunnels. It requires a method of estimating contract duration, which is challenging during the feasibility phase without the detailed contractual scope having been defined.	ANN (ML)
Meharie (2020) [58]	74 road projects in Ethiopia	Project length, Number of bridges, Inflation rate, Project scope, Terrain type, Project type, Contract duration and Project location.	Construction Cost	Does not consider IV related to number of culverts or interchanges nor consider existence and size of possible tunnels. It requires a method of estimating contract duration, which is challenging during the feasibility phase without the detailed contractual scope having been defined.	RF/SVM/ANN (ML)
Wang (2021) [62]	60 road projects in China	Length (km), Width (m), Earthwork Quantity (m3), Number of bridges (n), Number of interchanges, Number of separated interchanges, Number of tunnels, Pavement form, Landform features and area.	Construction Cost per km	Most similar to our idea but also requires a method of estimating earthwork quantities during the feasibility phase without any design work having been completed.	PCA/ PSO/ LSSVM (ML)
Meharie (2022) [59]	117 road projects in Ethiopia	Number of bridges, Inflation rate, Terrain type and Project type	Construction Cost	Does not consider IV related to number of culverts or interchanges nor consider existence and size of possible tunnels	GBM + LR/SVM/ANN (ML)
Mohamed (2022) [60]	284 USA highway projects	Location, Facility Type (Roads, Bridges, Drainage, and Intelligent Transportation System), Project Scope (New Construction/Expansion, Rehabilitation/Reconstruction, and Resurfacing/Renewal), Highway Type (Rural Interstate, Urban Interstate, Rural Primary, Urban Primary, and Rural Secondary). Length (km), Number of Lanes, Technical Complexity (Non-complex, Moderately Complex, and Most Complex)	Construction Cost	Does not consider IV related to number of culverts or interchanges nor consider existence and size of possible bridges or tunnels	ANN/ SVM/ RF/ EML (ML)
Hoffmann (2023) [61]	Unspecified number of projects in Austria	Lane kilometers, Area of bridges and tunnels	Unit prices	Provides expected unit prices per lane km and per square meter of bridge or tunnel. Does not consider number of interchanges nor number of culverts.	UCA (Non-ML)

3. Methods

3.1. Research Objectives

Building on the findings of the previous SLR, this research aims to develop early-stage construction cost estimation models tailored specifically to new road sections—potentially including interchanges, tunnels, and bridges—based solely on easy-to-define geometric characteristics. This study adopts a more holistic and practical perspective by comparing models developed using the ML methods Multilayer Perceptron-Artificial Neural Network (MLP-ANN), Radial Basis Function-Artificial Neural Network (RBF-ANN), Multiple Linear Regression (MLR), Random Forests (RF), Support Vector Regression (SVR), XGBoost Technique, and K-Nearest Neighbors (KNN) to determine the best-fit and propose the most accurate model for use by highway CAs during their initial conceptual or feasibility phases. The originality of this research lies in two key contributions. (1) It is the only model identified in the reviewed literature that incorporates geometric characteristics for all three major road assets—interchanges, tunnels, and bridges—within a single unified predictive framework, and (2) it avoids reliance on complex optimization or preprocessing tools like PCA and PSO, which have limited transparency and practical usability for client organizations. Instead, the models proposed in this study are grounded in readily available early-phase inputs and designed for direct applicability during feasibility studies, where time, data, and detail are constrained.

3.2. Data Collection and Description

For the purposes of this study, data were collected from public works construction contracts for nineteen new road sections of the Egnatia Motorway in northern Greece. Egnatia Motorway, which is part of the Trans-European Network for Transport and one of the most significant projects constructed using European funding, consists of a dual carriageway per direction, spanning between Igoumenitsa on the west coast and Kipi Evros at the Greek–Turkish borders. Each direction has two lanes and an emergency lane, and the 50 km section between Klidi and Lagadas bypasses the second largest city in Greece, with three lanes and an emergency lane. The maximum speed limit is 130 km/h. It has connections to four ports and six airports and includes a number of tunnels, bridges and overpasses, connecting to many Greek cities. Egnatia Odos S.A. (EOAE) is a state-owned CA that has been responsible for the management of the design, construction, operation and maintenance of the motorway [79].

The scope of each of the nineteen contracts includes the full construction of a section the motorway tendered between 1994 and 2006 and completed by 2008. The data were made available by the author during her tenure as Project Manager at Egnatia Odos S.A. for the Eastern Sector. Nevertheless, all tender documents are public records and have been accessible online since 2005 on the official EOAE website (www.egnatia.eu).

Each of the analyzed contracts includes a detailed budget and a technical description from which the data for this study were extracted. The technical descriptions provide comprehensive details on the physical and geometric characteristics of each project. Each detailed budget includes a complete listing of all project work items, the respective quantities based on the definitive studies, and their unit prices as defined in the New Unified Price Lists, based on which tender budgets are created in accordance with Greek public works legislation.

From the technical description the following geometric parameters were systematically extracted and were used as independent variables (IVs) in the models tested:

L_m = length of main axis (m);
W_m = width of main axis (m);
A_m = area of main axis (m²);
L_s= length of service roads (m);
W_s = width of service roads (m);
A_s = area of service roads (m²);
A_m+s = total road area (main + service) (m²);
N_ic = number of interchanges;
A_b= area of bridges (m²);
A_op = area of overpasses (m²);
A_up = area of underpasses (m²);
A_st = sum of area of all structures (m²);
L_t= total length of single bore tunnels (m);
N_c = number of culverts;
L_R= length of river training (m);
L_Rb = length of river bridges (m);
W_Rb= length of river bridges (m);
N_s = number of spans of river bridges;
A_Rb = area of river bridges (m²).

In the detailed tender budget, the cost of each work item is computed as the product of its quantity and unit price. These are then summed up to provide the total cost of work items. To compute the total tender budget, according to Greek public procurement legislation, 18% is added to the total cost of work items to cover contractor’s overhead and profit (O&P) and 9% for contingency. For the purpose of this research, the dependent variable (DV) “Total Cost” is the total cost of work items. To allow cost comparisons all prices were re-valued to 2024 prices using the harmonized index of consumer prices (HICP) for Greece as provided by Eurostat.

The previously described project data were recorded in an MS Excel worksheet and then transferred and analyzed statistically using IBM SPSS Statistics 23. The descriptive statistics of the data set are shown in Table 3, while the Pearson correlation coefficient for the DV against all possible IVs are shown in Table 4.

3.3. Machine-Learning Approaches

The estimation models examined in this research use seven different ML approaches to determine the best fit and propose the most accurate model for use by highway CAs during their initial feasibility study. In this section each method will be briefly described, providing references for in-depth methodological explanations and example applications of each method in construction management.

3.3.1. Multilayer Perceptron-Artificial Neural Network (MLP-ANN) Models

Multilayer Perceptrons (MLPs) have been around conceptually since the late 1950s but became practically usable in 1986 following the publication of David E. Rumelhart, Geoffrey Hinton, and Ronald J. Williams’ paper “Learning representations by back-propagating errors”, which introduced backpropagation for training multilayer perceptrons efficiently [80]. It is a class of feedforward ANN composed of multiple layers of nodes, each fully connected to the next one. MLP utilizes backpropagation as a supervised learning technique for training the network and computing a weighted sum of inputs as an activation function, enabling it to approximate complex functions and patterns by learning from data. MLP-ANNs are generally better suited for complex, high-dimensional, and large-scale problems, offering superior generalization capabilities [81,82]. For detailed description of the process and mathematical foundations, the reader can refer to Kovacevic and Antoniou [28]. It has been shown by Kulkarni et al. [83] that ANNs have been extensively used in construction management for prediction of costs, productivity, risks, safety duration, disputes, etc. Specifically for cost estimation, MLP-ANN models are frequently used and have been applied cost estimation for building construction and renovation costs and cashflow prediction models for road construction, for example, in [84,85,86,87], bridge construction or renovation costs [28,63,73], tunnel lifecycle costs [72] and for other early road construction cost estimation models [57,59,60].

3.3.2. Radial Basis Function-Artificial Neural Network (RBF-ANN) Models

Radial Basis Function Neural Networks (RBF-ANNs) were introduced in the late 1980s by Broomhead and Lowe [88] as a type of ANN that uses radial basis functions as activation functions. RBF networks typically have three layers: input, hidden, and output, where the hidden layer performs a non-linear transformation of the input space and normally computes how close the input is to a center, usually with Gaussian activation functions and Euclidean distances [81]. For a detailed description of the process and mathematical foundations, the reader can refer to Yu et al. [89]. RBF-ANNs are known for their faster training times and their ability to effectively model localized patterns, particularly in small or simpler datasets [81]. They have been used, for example, for building construction and renovation cost prediction and cashflow prediction models for road construction [81,83,84]. Our SLR found only one bridge construction cost estimation model that employed RBF-ANN and none for early construction cost estimates for roads [71].

3.3.3. Multiple Linear Regression (MLR) Models

Linear Regression (LR) is one of the oldest and most widely used statistical modeling techniques, originating from the work of Francis Galton in the 19th century [90]. It models the relationship between a DV and one or more IVs by fitting a linear equation to observed data. As a fundamental subset of Regression Analysis (RA), LR serves as the foundation for more complex models, including multiple regression, which refers broadly to any regression model with more than one independent variable. When the relationship between the variables is assumed to be linear, the method is specifically called multiple linear regression (MLR). In this study, MLR will be employed to model the relationship between various IVs and a single cost variable, due to its effectiveness in capturing multivariable influences while maintaining interpretability. In construction management, MLR models have been extensively used for construction cost estimations for buildings (e.g., [91,92,93]) and for roads, tunnels and bridges, as seen in Table 1.

3.3.4. Random Forest (RF) Models

Random Forests (RF), introduced by Breiman in 2001 [94], are an ensemble learning method that builds upon the decision tree algorithm by combining multiple decision trees to improve predictive accuracy and control overfitting. Each individual tree in the forest is trained on a bootstrap sample of the data, and at each split, a random subset of features is considered, promoting diversity among the trees and reducing correlation. For the chosen subset of features, the algorithm looks at all possible splits and selects the one that maximizes information gain. The final prediction is typically made by averaging the outputs in regression tasks or majority voting in classification. RF models are non-parametric and capable of modeling complex, non-linear relationships without requiring prior assumptions about data distribution, making them particularly well-suited for heterogeneous and high-dimensional datasets commonly found in construction management. They also offer variable importance measures, which help identify key cost drivers in predictive tasks. RFs have been applied in construction cost estimation for buildings [95,96], roads [44,55,58,60], bridges [73], and tunnels [97]. Although RF models do not require architecture tuning like neural networks, it involves selection of hyperparameters such as the number of trees and maximum tree depth [73]. For a detailed mathematical formulation and algorithmic structure, readers can refer to references [73,97,98].

3.3.5. Support Vector Regression (SVR) Models

Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) adapted for regression tasks. As cited by Smola and Scholkopf, it was developed by Vapnik in the 1990s [99]. SVR seeks to find a function that deviates from the actual observed outputs by a value no greater than a specified margin (epsilon), while keeping model complexity as low as possible. Unlike traditional regression techniques that aim to minimize the squared error, SVR employs a loss function that ignores errors within a certain threshold, allowing for greater robustness against outliers and noise. A key advantage of SVR is its ability to model non-linear relationships through the use of kernel functions—such as linear, polynomial, or radial basis function (RBF) kernels—which project input data into higher-dimensional feature spaces where a linear relationship can be identified. This flexibility makes SVR particularly useful in construction cost estimation tasks where relationships among variables may not be strictly linear [99]. While SVR models require careful tuning of hyperparameters (e.g., penalty parameter C, kernel type, and epsilon value), they have been successfully applied in various areas of construction management. As shown in our SLR, they have been applied for construction cost estimation models for bridges [71,73], tunnels [25,75] and roads [55,58,59,60]. For detailed SVR formulation and optimization strategies, readers may consult Smola and Scholkopf [99].

3.3.6. XGBoost Technique

Extreme Gradient Boosting (XGBoost) is the most recent and advanced technique among the methods examined in this study. Introduced by Chen and Guestrin in 2016 [100], XGBoost was designed with efficiency and performance in mind, incorporating system-level optimizations and regularization techniques that enhance both speed and model generalization. XGBoost can capture complex, non-linear interactions between variables, as well as handle sparse data. While RFs predict the output by averaging the predictions of all created decision trees in parallel, XGBoost builds trees sequentially, with each new tree trying to correct errors of the previous trees. Therefore, XGBoost often achieves lower marginal error than RF, which can improve predictive accuracy. However, this sequential error correction process also increases the risk of overfitting if the model is not properly regularized [100].

Our literature review indicates that while GBMs, the precursor to XGBoost, have been implemented for bridges [73] and roads [59], it is shown that XGBoost has not yet been applied to early construction cost estimation for bridges, roads, or tunnels. Nonetheless, XGBoost has shown promise in other areas of construction management. For instance, Coffie and Cudjoe [101] utilized XGBoost to predict construction cost overruns in Ghana, achieving commendable accuracy metrics. Similarly, a study by ForouzeshNejad et al. [102] demonstrated that XGBoost could enhance the reliability of project performance prediction in terms of cost and time overruns across various project types and geographical areas. These applications underscore XGBoost’s potential in modeling complex cost-related phenomena within construction projects.

3.3.7. K-Nearest Neighbors (KNN) Algorithm

The K-nearest neighbors (KNN) algorithm, introduced by Cover and Hart [103], is a non-parametric method used for classification and regression tasks. In regression applications, the output is the average of the values of its nearest neighbors. KNN is intuitive and effective, particularly when the data distribution is unknown. In construction management, KNN has been utilized for estimating construction costs and predicting project performance metrics. Our SLR found its application in road cost prediction [44]. In addition, a study employed KNN to predict time and cost overruns in construction projects in Jordan, highlighting its applicability in risk assessment and project planning [104]. The mathematical formulation can be found in [105].

3.4. Model Evaluation Metrics

Evaluating the performance of cost estimation models for roads is essential to ensure their reliability and accuracy. This section describes the key metrics used in this study to assess model performance.

3.4.1. Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) measures the average magnitude of errors in predictions without considering their direction. It is defined as the average of the absolute differences between the predicted values and the actual values. Mathematically, MAE is expressed as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |a_{i} - p_{i}|

(1)

where

a_{i}

is the actual cost for the i road project,

p_{i}

is its predicted cost, and n is the number of observed projects.

MAE provides an intuitive measure of average prediction error in the same units as the predicted variable. According to Kovasevic and Antoniou [28], lower MAE values indicate better model performance, while in principle MAE values less than 10% of the average observed costs are considered indicative of a good model fit ([28]). Nevertheless, this threshold was relaxed to 30% of the average total cost across the full data set in the present study.

The justification lies in the characteristics of the dataset; the 19 observed projects span a wide range of scope, size, and complexity, from average-sized highway roadworks to large-scale infrastructure involving major bridges and tunnels. This heterogeneity results in considerable variability in absolute and relative cost values, which in turn makes it challenging for a single predictive model to consistently achieve very low MAE values across the full spectrum of project types. By adopting a more flexible acceptance threshold, the evaluation allows for the identification of models that remain applicable across diverse project scales, rather than being narrowly optimized for a specific project size. This approach aligns with the intended use of the models as decision support tools for highway CAs, who require generalizable models that for funding allocation and procurement strategy decision making across projects of varying magnitude.

3.4.2. R²

The coefficient of determination, R², evaluates the proportion of variance in the observed data that is explained by the prediction model. It is calculated as:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {{(a}_{i} - p_{i})}^{2}}{\sum_{i = 1}^{n} {{(a}_{i} - \bar{p})}^{2}}

(2)

where

\bar{p}

is the mean of the predicted values.

Values of R² range from 0 to 1, with higher values indicating a better fit between predicted and observed costs. An R² above 0.7 is typically regarded as having strong explanatory power for cost prediction models, although acceptable thresholds can vary depending on the dataset and application context. Negative values of R² indicated worst prediction than taking the mean of the predicted values, which may be due to wrong model form (e.g., linear model for non-linear data).

3.4.3. MAE % of Mean

The MAE % of Mean metric expresses the Mean Absolute Error (MAE) as a percentage of the average observed value, providing a normalized measure of prediction error relative to the dataset’s mean. It is calculated as:

M A E % = \frac{M A E}{\bar{p}} \times 100

(3)

Unlike MAPE, which normalizes errors by each individual observed value, MAE % of Mean normalizes by a single overall mean value, avoiding issues related to very small or zero observed values.

According to Burke [12], deviations of up to ±25% are generally acceptable during conceptual-feasibility phases. On the other hand, the Association for the Advancement of Cost Engineering (AACE) reports that cost estimates based on limited design input can exhibit accuracy ranging from –30% to +50% [13], while Hanioğlu [15] notes that order-of-magnitude estimates may deviate by as much as ±50–100%. In practice, parametric estimating methods, which are frequently used in this phase, typically achieve accuracy levels in the range of 30–40% [15]. Considering these references and the significant variation in scope and size across the 19 projects in the present dataset, a more flexible acceptance threshold of MAE% < 30% was adopted. This allows for identifying models that, while not meeting the strict criteria of detailed design estimates, remain sufficiently accurate and generalizable to support decision making in the early planning and procurement stages.

3.4.4. Mean Absolute Percentage Error (MAPE)

The Mean Absolute Percentage Error (MAPE) is a widely used metric that expresses the average absolute error as a percentage of each observed value. It is calculated as:

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{a_{i} - p_{i}}{a_{i}}| \times 100

(4)

Unlike the MAE % of Mean metric, which normalizes errors by the overall mean of the dataset, MAPE normalizes each error by the corresponding observed value, offering a more granular view of model accuracy.

Again, in the context of early-stage road cost estimation, where only limited geometric information is known, higher error margins are generally tolerated, as discussed previously. It has also been noted that discrepancies of ±25–30% are widely considered acceptable in feasibility studies, with even larger deviations possible for order-of-magnitude estimates [12,13,15]. Consequently, in this study a MAPE threshold of 30% was chosen as the acceptance criterion. This is consistent with established practice in early-stage cost forecasting and reflects the practical requirements of public CAs who must make preliminary funding and procurement decisions based on limited project data.

4. Results

4.1. Models Tested

The total cost estimation models trained and tested in this study using each ML approach are summarized in Table 5. The study purposefully focuses on geometric descriptors that are consistently available during feasibility design phase. While other variables such as geological complexity, climatic factors, or regional labor/material markets may influence material, machinery and labor costs, they are either uncertain at the conceptual stage or affect contractor actual cost estimates rather than CA budget estimate needs during the feasibility phase. The selection of the IVs followed two complementary strategies. First, variables were chosen based on the corresponding author’s professional expertise, drawing on more than 22 years of experience in procurement and management of highway construction contracts. Second, additional models were constructed using IVs that exhibited the strongest correlations with the DV. In total five IV sets were defined and all were tested using seven ML methods. As a result 5 × 7 = 35 models were tested. The first two (a and b) were based on professional experience and the initial research hypothesis that emphasized specific geometric characteristics. Model a included a large set of IVs (11 in total), covering all extracted geometric variables. Model b refined this definition by (i) excluding the length of river training works, which appeared in only one dataset; (ii) using the total area of structures instead of three separate variables for the area of bridges, underpasses and overpasses; and (iii) removing the width of the main axis and service roads, as these depend on traffic load calculations, which may not be available at the conceptual design stage. Model c built upon Model b by adding the number of interchanges as an additional IV. Model d was defined solely from IVs with the highest Pearson correlation coefficients (>0.467). Finally, Model e refined Model d by replacing the area of bridges (already included in the total area of structures) with the total length of the main road, which is the most fundamental geometric characteristic despite its relatively modest correlation (0.278).

4.2. Development, Training and Testing of ML Models

The model developments were performed in Python 3.12.5, using common ML libraries including NumPy, Pandas and Scikit-learn. The enhanced support of Python libraries such as Scikit-learn and Pandas has made Python one of the most common choices for data analysis and ML applications. Python was selected as the primary programming language in this study due to its design philosophy, which emphasizes readability, concise syntax, and reusability across contexts [106].

To select the most appropriate hyperparameters for each model, trial and error was conducted for each parameter individually. The optimal configurations obtained for each machine learning approach are summarized below.

MLP_ANN:

Models a, b:
MLPRegressor(hidden_layer_sizes = (3), activation = ‘relu’, max_iter = 3000, learning_rate_init = 0.001, random_state = 42).
Models c, d, e:
Same configuration as above, except hidden_layer_sizes = (10).
RBF-ANN
All models:
rbf_model = RBFNetwork(n_centers = 5), rbf_model.fit(X_train_scaled, Y_train),
Y_pred_rbf = rbf_model.predict(X_test_scaled)
RF:
All models: RandomForestRegressor(n_estimators = 100, random_state = 42)
SVR
All models: SVR(kernel = ‘rbf’, C = 100, gamma = 1)
XGBoost:
Models a, b:
xgb.XGBRegressor(objective =‘reg:squarederror’, colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 100, random_state = 42)
Models c, d, e:
Same as above, except learning_rate = 0.01, max_depth =
KNN:
All models: KNeighborsRegressor(n_neighbors = 5)

For the MLR, SVR and both ANN models, the raw input data were scaled prior to training, as these algorithms are sensitive to the relative magnitude of the IVs. In MLR, unscaled data can distort regression coefficients; in SVR, the distance-based kernel functions (e.g., RBF) depend directly on feature magnitudes, and in ANN, optimization can converge poorly if variables operate on vastly different scales. While KNN can function without scaling, performance may improve if variables are standardized; therefore, for consistency across models, KNN was also applied to scaled data. On the other hand, this was unnecessary for tree-based methods (RF, XGBoost), which partition the data based on feature thresholds and are thus scale-invariant.

Since the IVs were expressed in different units and magnitudes (e.g., length in kilometers, area in square meters, cost in millions), where necessary they were transformed using z-score standardization:

x^{'} = \frac{x - μ}{σ}

(5)

where x is the initial IV value, μ its mean, and σ its standard deviation. This ensured that all variables had a mean of zero and unit variance, allowing the regression model to treat each IV on an equal basis and improving numerical stability during computation.

4.3. Model Evaluation Metric Results and Analysis

4.3.1. Result Presentation

The performance of the total cost estimation models was initially evaluated using the four selected metrics, MAE, R², MAE%, and MAPE%. The detailed results for all tested models are presented in Table 6. To facilitate interpretation, Figure 2 presents the scatter plots of actual versus predicted values of total costs. This graphical approach highlights how closely each model approximates the ideal prediction. Points closer to the equal-value line indicate models with higher predictive accuracy than those further away.

Initial observations from Table 6 and Figure 2 indicate substantial variation in predictive accuracy across the Total Cost models. The first variable set (a), which incorporated all geometric IVs based on professional judgment, performed poorly overall, with most yielding negative R² values and high error percentages. However, the refinements applied in Model b, still grounded in professional experience but with more selective use of variables, led to a marked improvement, with MLR and MLP-ANN achieving R² values above 0.8 and substantially lower error percentages. Model c, which added interchanges, also performed reasonably well, particularly with MLR. By contrast, the correlation-driven models (d and e) showed mixed results; while e achieved acceptable accuracy under certain methods, both models were less stable overall. This suggests that professional expertise in variable selection contributed more consistently to predictive performance than reliance solely on statistical correlation, a finding that is visually reinforced in Figure 2 where the expert-based models produce predictions more tightly clustered around the 45° line.

In terms of comparison between the performance of the seven ML methods tested, MLR emerged as highly interpretable and robust on small samples, which is critical for public highway CAs. While MLP-ANN captured non-linearities, it required more tuning, was prone to overfitting in our small dataset, and therefore produced less stable results across projects. RBF-ANN trained quickly but was inconsistent and sensitive to dataset heterogeneity. SVR offered flexibility through kernel functions, but the kernel matrix used during model training could become ill-conditioned; hence, a large variation in the inputs could produce small variation in the final prediction. Although RF and XGBoost are strong in other contexts, they showed limited stability in our dataset due to the relatively small number of projects, which led them—particularly XGBoost—to overfit. KNN proved simple but highly sensitive to scaling and local density. This comparison reinforces the conclusion that easy-to-understand methods (MLR in particular) are best suited for feasibility-phase decision support by public CAs.

4.3.2. Model Ranking and Acceptance Procedure

Given the variation in performance across ML methods and model variable sets, a systematic ranking procedure was established to identify the best-performing models for further use. The procedure, summarized in Figure 3, followed a six-step approach as follows:

Step 1—Definition of acceptance thresholds

The following acceptance thresholds were defined for each model evaluation metric:

MAE < 30% of the mean Total Cost of the 19 data sample projects.
R² > 0.50.
MAE% < 30.
MAPE% < 30.

Step 2—Initial screening of models

Models that failed to satisfy all four acceptance thresholds were removed. From the original 35 trained and tested models (5 model definitions × 7 ML methods), only 16 models met at least one of the criteria and were retained for further evaluation, even if they had scored 0 in one, two, or even three of the remaining metrics. This more flexible rule at the screening stage was adopted deliberately to avoid prematurely discarding models that might show strong performance on some metrics but weak performance on others, ensuring that potentially valuable candidates were still considered in the subsequent scoring and ranking stages.

Step 3—Model scoring against each metric

The remaining models were assigned a performance score (0–5) for each evaluation metric. A score of 0 was given if the threshold was not met, while scores from 1 (lowest acceptable) to 5 (best performance) were assigned according to the ranges shown in Table 7.

Step 4—Aggregation of scores

Each model’s scores across all evaluation metrics were averaged to obtain a single composite performance score.

Step 5—Ranking of models

Models were ranked in descending order of their average composite scores.

Step 6—Final selection for model validation

The top six models, which satisfied all acceptance thresholds and achieved a minimum score of 1 across every metric, were retained for cross-validation and comparative analysis.

4.3.3. Comparative Evaluation of Accepted Models

The resulting ranking of acceptable potential total cost estimation models following Step 5 is shown in Table 8, which summarizes the individual scores for each evaluation metric (MAE, R², MAE%, and MAPE%), the average composite score, and the final rank of each model.

The results show that Model c (MLR) achieved the highest average score (3.50), ranking first among all candidates. This was followed by Model b (MLP-ANN) with an average score of 3.25, and Model b (MLR) with 3.00. Models e (MLP-ANN, 2.75) and e (MLR, 2.00) also performed acceptably, albeit at lower ranks, while RBF-ANN-based models consistently underperformed, with average scores below 2.0 and, in some cases, failing entirely to meet minimum thresholds.

Two further insights can be drawn from these results. First, MLR models appear consistently among the highest ranks, confirming the robustness of linear methods when supported by carefully selected variables. Second, ANN performance was mixed; MLP-ANN achieved competitive results, particularly in Model b, whereas RBF-ANN proved unstable and unreliable.

The frequency of accepted models across both model variable sets and ML methods is summarized in Figure 4 and Figure 5. In terms of variable sets, Models b, c, and e dominate, while Model a appears only once. Regarding ML methods, SVR was the most frequently accepted method, followed by RBF-ANN and MLR, whereas tree-based methods contributed little to the final selection.

Overall, the results demonstrate that MLR and MLP-ANN models with expert-driven variable selection (b and c) provide the most reliable balance of accuracy, interpretability, and robustness for total cost estimation.

Based on these findings, the six top-ranking models were selected for cross-validation with independent project data. This choice ensured that only models which demonstrated both acceptable performance against all evaluation thresholds and reasonable stability across metrics were carried forward. While several lower-ranked models occasionally performed well on individual metrics, they lacked the consistency required for reliable external testing for use by public highway CAs during early-stage planning.

The next section will evaluate the generalization capacity of these models through cross-validation using data from two new road projects, providing an independent test of their practical applicability.

4.4. Cross-Validation of Selected Models Using Independent Project Data

The total cost estimation models were derived from a dataset of nineteen completed highway projects (fifteen for training and four for testing, with the split determined using Python’s random state = 42 to ensure reproducibility). While this dataset captured a wide range of design characteristics, not all possible combinations of cost components were represented in the independent test set. To ensure a more rigorous evaluation of generalization capacity, six top-ranking models were selected for external cross-validation using two additional highway projects not included in either the training or testing datasets.

These two case-study projects, tendered in 2011 and 2022, were deliberately chosen to capture variation in size, structural complexity, and time. One project included tunnels, river bridges, and interchanges, representing a more complex configuration, while the other did not involve river bridges or tunnels, thus providing a simpler counterpart. This contrast ensured that the cross-validation covered both ends of the spectrum of possible project characteristics.

The actual total cost of work items and the technical characteristics for both projects were obtained from the tender documents for these projects as posted on the official website of EOSA, ensuring reliable inputs for validation.

The first project (X1), tendered in 2022 with a tender budget total cost of EUR 39,194,023.43 (2024 prices), involved a mainline length of 8.26 km with a standard cross-section in accordance with the Greek Road Design Guidelines (OMOE). At the feasibility phase, the following characteristics could have been established: an approximate mainline length of 8.5 km (L_m), an 11.17 km system of service roads based on visual inspection of local road maps (L_s), two interchanges (N_ic), one bridge with an estimated deck area of 2280 m² (A_b), four underpasses totaling 1500 m² (A_up), and approximately 42 culverts for preliminary drainage provision. No tunnels or major river bridges were required in this section.

The second project (X2), tendered in 2011 with a tender budget total cost of EUR 78,571,146.85 (2024 prices), had a mainline length of 16.2 km and a standard cross-section width of 22.5 m, also in accordance with OMOE. At the feasibility stage, the anticipated inputs included a 16 km mainline (L_m), a 32 km service road system to service both sides of the highway (L_s), three interchanges (N_ic), two ravine bridges of approximately 70 m and 30 m in length (A_b ≈ 1680 m²), six underpasses totaling 2250 m² (A_up), four overpasses totaling 2160 m² (A_op), and 80 culverts (N_c). In addition, a twin tunnel of approximately 1200 m single-bore length (L_t) was foreseen.

The definition of the IVs was based on a combination of tender documents and reasoned assumptions reflecting typical feasibility-stage information. Specifically, the length of the main axis (L_m), the areas of structures (A_b, A_up, A_op), and tunnel lengths (L_t) were taken directly from the tender documentation and rounded to the nearest practical value. The number of interchanges (N_ic) was also derived from the tender documents, as this information can be determined even at early design stages. By contrast, the length of service roads (L_s) was estimated as approximately double the mainline length, representing the common requirement for parallel service roads on both sides of the highway. Finally, the number of culverts (N_c) was calculated by assuming one culvert every 200 m of mainline length, which is consistent with preliminary drainage provision standards.

Therefore, the model input variables shown in Table 9 were used in each of the selected IV sets b, c, and e, as appropriate to generate predicted total costs, which were compared with the actual project costs for cross-validation.

The cross-validation results, comparing predicted and actual total cost, are summarized in Table 10. The percentage deviations of predicted versus actual costs for each model in Projects X1 and X2 are illustrated in Figure 6, with black dashed lines indicating the ±30% feasibility thresholds.

The results confirm that the c MLR model provided the best performance, with deviations of –24.2% for X1 and –3.6% for X2, both well within the +/−30% threshold, demonstrating both accuracy and stability under varied project conditions. To enhance transparency and usability, the final regression equation for total cost of work items (Y) is presented in Equation (6):

Y = 37,916,007.05 \cdot L_{m}^{'} - 8,296,717.99 \cdot L_{s}^{'} + 7,517,364.66 \cdot N_{i c}^{'} + 15,446,845.90 \cdot A_{s t}^{'} + 46,184,784.74 \cdot L_{t}^{'} - 8,896,822.25 \cdot N_{c}^{'} + 82,977,372.26

(6)

The model was estimated using standardized inputs (

L_{m}^{'},

L_{s}^{'}, N_{i c}^{'}, A_{s t}^{'}, L_{t}^{'} {, N}_{c}^{'}

); therefore, for the equation to be replicated and applied, all variables must first be standardized using their mean (μ) and standard deviation (σ) over the 15 training samples before applying the coefficients (w) as provided in Table 11 in accordance with Equation (4).

Among the predictors, tunnel length (L_t) and mainline length (L_m) carried the largest positive coefficients, confirming their dominant role in driving project costs. In contrast, the negative coefficients for service road length (L_s) and culverts (N_c) suggest their relatively weaker contribution once other major asset categories are included. For example, projects with many culverts, which are relatively low-cost items typically, do not include large, high-cost drainage or river-crossing structures, so their overall budgets tend to be lower when the projects scopes are dominated by bridges.

Other models occasionally performed well for one project (e.g., e MLR slightly overestimating X2 by +5.2%) but failed for the other. Similarly, MLP-ANN models (e.g., b and e) provided estimates within the threshold for one project but exhibited significant underestimation for the other, undermining their robustness. By contrast, RBF-ANN models displayed large and inconsistent deviations, in some cases exceeding +50%, confirming their unreliability for practical application.

In conclusion, the cross-validation reinforces the earlier ranking results, identifying the c MLR model as the most reliable and practical tool for public highway CAs. Its reliance on conceptual and feasibility phase variables that are readily available in early planning, coupled with prediction accuracy that consistently falls within the ±30% industry-accepted feasibility range, makes it especially suitable for supporting funding allocation, procurement strategies, and early-stage investment decisions in road infrastructure development.

4.5. Model Application Procedure

To improve the utility of the findings for public highway CAs, the following steps can be followed for applying the proposed c MLR model in their feasibility studies of cost estimations:

Step 1: Estimate inputs: mainline length (L_m), service road length (L_s), number of interchanges (N_ic), total area of structures (A_st), single-bore tunnel length (L_t), and number of culverts (N_c).
Step 2: Standardize variables using the means and standard deviations from Table 11.
Step 3: Apply coefficients from Equation (6) to obtain the predicted total cost of work items (2024 price level).
Step 4: Adjust to tender budget: In the Greek context, add 18% for overhead and profit and 9% for contingency. Other countries may adjust according to national procurement rules.
Step 5: Update prices: When applying the model after 2024 in Greece or across countries, revalue using appropriate harmonized price indices.

This stepwise procedure clarifies how highway public CAs can move from readily available feasibility inputs to a defensible tender budget estimate during the early feasibility or conceptual design phase.

5. Conclusions

As highlighted in the literature review (Table 2), previous studies on early-stage highway cost estimation have largely focused on pavement cross-section characteristics or general project classifications, while omitting one or more of the key asset categories. In particular, most road cost prediction models focused on pavements and did not incorporate tunnels, major structures or interchanges and frequently excluded culverts. Moreover, several approaches relied on variables such as earthwork quantities or contract duration, which are difficult to determine reliably at the feasibility stage, thereby limiting their applicability during early planning. Against this backdrop, the present study is novel in that it proposes the first set of early-stage cost estimation models for highway projects that integrate all three major asset categories—interchanges, bridges, and tunnels—within a single predictive framework based solely on geometric information available at the conceptual phase. In addition, this research is original in avoiding reliance on complex optimization or preprocessing tools such as PCA and PSO, which reduce transparency and limit practical usability for public highway infrastructure CAs. Instead, the proposed models are grounded in readily available early-phase inputs and designed for direct applicability during feasibility studies, where time, data, and design detail are inherently constrained.

Through the systematic development and testing of 35 models (five different geometric IV sets with seven ML methods), this research established a structured model ranking and acceptance procedure based on four metrics (MAE, R², MAE%, MAPE%) and clearly defined acceptance thresholds. Application of this framework resulted in the acceptance of six top-ranking models, which were then cross-validated using data from two case-study projects. Among these, MLR consistently achieved the highest performance, confirming the robustness of linear approaches when applied with carefully selected conceptual-phase geometrical variables. MLP-ANN models also demonstrated competitive performance in some variable sets, though with greater variability. By contrast, more complex algorithms such as RBF-ANN, SVR, RF, XGBoost, and KNN were found to be less reliable, often failing to meet one or more of the acceptance thresholds.

Importantly, the results also revealed that expert-driven variable selection strategies (variable sets a, b, and c) produced more stable and accurate outcomes than purely correlation-driven strategies (d and e), underscoring the added value of engineering judgment at the feasibility stage. These findings highlight the value of transparent and interpretable methods, particularly MLR, which balance predictive accuracy with practical applicability for early-stage cost estimation. They also support the observation of Gardner et al. [16] that incorporating additional variables does not necessarily improve conceptual cost estimates. The cross-validation confirmed this conclusion, with the c MLR model emerging as the most reliable and practically applicable tool for highway CAs, delivering stable accuracy within the ±30% feasibility margin across two independent projects.

This study makes a valuable contribution by delivering the first practical MLR model of total cost of work items for conceptual-phase highway projects that integrates all major asset categories using only readily available geometric information at the conceptual-feasibility phase. The c MLR model proved to offer the best balance of accuracy, robustness, and interpretability. This makes it a particularly appropriate tool for public highway contracting authorities, who must make preliminary funding allocation and procurement decisions under conditions of limited design information. Crucially, the full MLR, including its coefficients and standardized parameters, are provided, enabling direct replication and immediate use by CAs. The model output represents the total cost of work items at 2024 price levels; in the Greek context, this serves as the basis for the tender budget, to which authorities must by law add 18% for contractor overhead and profit and 9% for contingencies. For application in other national settings, results would need to be revalued using appropriate harmonized indices of consumer prices (Eurostat) to reflect temporal adjustments (e.g., 2024 to 2025) and cross-country comparability. Although developed using data from Greek highway projects, the overall structure of the proposed framework is flexible and transferable to other contexts, provided suitable adjustments are made for differences in regulatory frameworks, procurement systems, and cost structures. It should be noted that the goodness-of-fit of the proposed MLR model reflects its feasibility-phase application. As project conditions evolve and more detailed designs become available, prediction accuracy will naturally improve through other estimation methods tailored to later phases of the project lifecycle. The present model is therefore intended as an early decision support tool rather than a replacement for detailed design-based estimates.

This study acknowledges two key limitations: the restricted dataset size of 19 projects and the manual, trial-and-error approach to hyperparameter selection adopted for brevity. While adequate for a proof of concept, these limitations highlight avenues for further research. First, expanding the dataset to cover a larger number of projects across diverse geographical regions and project types would improve robustness and enhance generalizability. Second, more systematic methods such as Bayesian optimization could be applied for hyperparameter tuning, although the benefit of such computationally intensive techniques may be constrained by the relatively small sample sizes. An additional limitation is that the proposed model is most reliable when applied to projects containing a balanced mix of roadworks, interchanges, bridges, and tunnels. In contrast, projects dominated by unusually long bridges or tunnels may be better served by specialized cost prediction models already developed in the literature for the Greek construction industry (e.g., Petroutsatou et al. [64,66], Antoniou et al. [67], and Fragkakis et al. [68]).

Beyond these methodological extensions, future work could also explore hybrid approaches that combine the strengths of different ML methods, as well as the possibility of developing category-based models whose outputs could be aggregated for total cost prediction. In addition, the inclusion of supplementary project descriptors—for example, geotechnical complexity, climate, seismic zone or traffic demand indicators—may strengthen explanatory power where such data are available during the feasibility design stage. Finally, there is considerable potential to integrate the proposed models into BIM- and GIS-based road planning environments. At the conceptual stage, such models could be directly linked to a line representation of the road corridor on a GIS map, providing contracting authorities with automatic cost estimates as alignments, and layout options are explored. In the broader context of further research, we emphasize the importance of coordinated international efforts to systematically collect and share project-level data. A joint EU–USA or wider global initiative could significantly expand the available evidence base, enabling continuous refinement of feasibility-stage cost prediction models such as the ones presented here and supporting more transparent and reliable cost estimation efforts worldwide.

This research demonstrates that highway cost estimation based on readily available geometric estimates during the conceptual phase can be both accurate and practical. It was shown that simple, expert-informed approaches outperform more complex techniques, hence challenging prevailing assumptions in the literature but also delivering a tool of immediate relevance to CAs. By combining a transparent model evaluation framework with the delivery of a fully specified, ready-to-use MLR model, this study bridges the gap between academic research and practical decision making, providing CAs with a reliable foundation for more efficient planning, funding, and procurement of road infrastructure.

Author Contributions

Conceptualization. F.A. methodology. F.A.; software. K.K.; validation. F.A. and K.K.; formal analysis. F.A. and K.K.; investigation. F.A.; resources. F.A. and K.K.; data curation. F.A. and K.K.; writing—original draft preparation. F.A. and K.K.; writing—review and editing. F.A.; visualization. F.A. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available on request with privacy restrictions. The data are not publicly available due to contractual and GDPR constraints.

Conflicts of Interest

The authors declare no conflicts of interest.

References

European Commission: Directorate-General for Mobility and Transport. EU Transport in Figures: Statistical Pocketbook 2021; Publications Office: Luxemburg, 2021. [Google Scholar]
Barakchi, M.; Torp, O.; Belay, A.M. Cost Estimation Methods for Transport Infrastructure: A Systematic Literature Review. Procedia Eng. 2017, 196, 270–277. [Google Scholar] [CrossRef]
Chou, J.-S.; O’Connor, J.T. Internet-Based Preliminary Highway Construction Cost Estimating Database. Autom. Constr. 2007, 17, 65–74. [Google Scholar] [CrossRef]
Bakhshi, P.; Touran, A. An Overview of Budget Contingency Calculation Methods in Construction Industry. Procedia Eng. 2014, 85, 52–60. [Google Scholar] [CrossRef]
Antoniou, F.; Konstantinidis, D.; Aretoulis, G.; Xenidis, Y. Preliminary Construction Cost Estimates for Motorway Underpass Bridges. Int. J. Constr. Manag. 2017, 18, 321–330. [Google Scholar] [CrossRef]
Shim, H.S.; Lee, S.H. Developing a Probable Cost Analysis Model for Comparing Bridge Deck Rehabilitation Methods. KSCE J. Civ. Eng. 2016, 20, 68–76. [Google Scholar] [CrossRef]
Vagdatli, T.; Petroutsatou, K.; Panetsos, P.; Barmpa, Z.; Fragkakis, N. Bayesian Pre-Estimation of Bridge Life-Cycle Costs. In Life-Cycle of Structures and Infrastructure Systems; CRC Press: London, UK, 2023; pp. 3252–3259. [Google Scholar]
Gadiraju, D.S.; Muthiah, S.R.; Khazanchi, D. A Deep Reinforcement Learning Based Approach for Bridge Health Maintenance. In Proceedings of the 2023 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 13–15 December 2023; pp. 43–48. [Google Scholar]
Petroutsatou, K.; Vagdatli, T.; Louloudakis, N.; Panetsos, P. Life-Cycle Maintenance Cost Model for Concrete Bridges Using Markovian Deterioration Curves. Buildings 2025, 15, 807. [Google Scholar] [CrossRef]
Fragkakis, N.; Petroutsatou, K.; Marinelli, M. Preliminary Cost Estimate Model for Road Underpasses. In Proceedings of the Eighth International Conference on Construction in the 21st Century (CITC-8) “Changing the Field: Recent Developments for the Future of Engineering and Construction”, Thessaloniki, Greece, 27 May 2015. [Google Scholar]
Flyvbjerg, B.; Holm, M.S.; Buhl, S. Underestimating Costs Public Works Projects: Error Or Lie? J. Am. Plan. Assoc. 2002, 68, 279–295. [Google Scholar] [CrossRef]
Burke, R. Project Management—Tools and Techniques; Burke Publishing: London, UK, 2010. [Google Scholar]
Association for the Advancement of Cost Engineering AACE. 18R-97: Cost Estimate Classification System—As Applied in Engineering, Procurement, and Construction for the Process Industries; AACE International: Fairmont, WV, USA, 2005. [Google Scholar]
Heralova, R.S.; Hromada, E.; Johnston, H. Cost Structure of the Highway Projects in the Czech Republic. Procedia Eng. 2014, 85, 222–230. [Google Scholar] [CrossRef]
Hanioğlu, M.N. A Cost Based Approach to Project Management Planning and Controlling Construction Project Costs; Routledge: New York, NY, USA, 2023. [Google Scholar]
Gardner, B.J.; Gransberg, D.D.; Jeong, H.D. Reducing Data-Collection Efforts for Conceptual Cost Estimating at a Highway Agency. J. Constr. Eng. Manag. 2016, 142, 04016057. [Google Scholar] [CrossRef]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. BMJ 2009, 339, 332–336. [Google Scholar] [CrossRef] [PubMed]
Halevi, G.; Moed, H.; Bar-Ilan, J. Suitability of Google Scholar as a Source of Scientific Information and as a Source of Data for Scientific Evaluation—Review of the Literature. J. Inf. 2017, 11, 823–834. [Google Scholar] [CrossRef]
Pranckutė, R. Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World. Publications 2021, 9, 12. [Google Scholar] [CrossRef]
Bar-Ilan, J. Which H-Index?—A Comparison of WoS, Scopus and Google Scholar. Scientometrics 2008, 74, 257–271. [Google Scholar] [CrossRef]
Antoniou, F.; Aretoulis, G.; Giannoulakis, D.; Konstantinidis, D. Cost and Material Quantities Prediction Models for the Construction of Underground Metro Stations. Buildings 2023, 13, 382. [Google Scholar] [CrossRef]
Rostami, J.; Sepehrmanesh, M.; Gharahbagh, E.A.; Mojtabai, N. Planning Level Tunnel Cost Estimation Based on Statistical Analysis of Historical Data. Tunn. Undergr. Space Technol. 2013, 33, 22–33. [Google Scholar] [CrossRef]
Mahmoodzadeh, A.; Nejati, H.R.; Mohammadi, M.; Hashim Ibrahim, H.; Khishe, M.; Rashidi, S.; Hussein Mohammed, A. Developing Six Hybrid Machine Learning Models Based on Gaussian Process Regression and Meta-Heuristic Optimization Algorithms for Prediction of Duration and Cost of Road Tunnels Construction. Tunn. Undergr. Space Technol. 2022, 130, 104759. [Google Scholar] [CrossRef]
Mahmoodzadeh, A.; Mohammadi, M.; Abdulhamid, S.N.; Ibrahim, H.H.; Ali, H.F.H.; Nejati, H.R.; Rashidi, S. Prediction of Duration and Construction Cost of Road Tunnels Using Gaussian Process Regression. Geomech. Eng. 2022, 28, 65–75. [Google Scholar] [CrossRef]
Mahmoodzadeh, A.; Nejati, H.R.; Mohammadi, M. Optimized Machine Learning Modelling for Predicting the Construction Cost and Duration of Tunnelling Projects. Autom. Constr. 2022, 139, 104305. [Google Scholar] [CrossRef]
Mahmoodzadeh, A.; Mohammadi, M.; Ali, H.F.H.; Salim, S.G.; Abdulhamid, S.N.; Ibrahim, H.H.; Rashidi, S.A. Markov-Based Prediction Model of Tunnel Geology, Construction Time, and Construction Costs. Geomech. Eng. 2022, 28, 421–435. [Google Scholar]
Sodikov, J. Cost Estimation of Highway Projects in Developing Countries: Artificaila Neural Network Approach. J. East. Asia Soc. Transp. Stud. 2005, 6, 1036–1047. [Google Scholar]
Kovačević, M.; Antoniou, F. Machine-Learning-Based Consumption Estimation of Prestressed Steel for Prestressed Concrete Bridge Construction. Buildings 2023, 13, 1187. [Google Scholar] [CrossRef]
Chou, J.-S. Generalized Linear Model-Based Expert System for Estimating the Cost of Transportation Projects. Expert. Syst. Appl. 2009, 36, 4253–4267. [Google Scholar] [CrossRef]
Chou, J.-S. Cost Simulation in an Item-Based Project Involving Construction Engineering and Management. Int. J. Proj. Manag. 2011, 29, 706–717. [Google Scholar] [CrossRef]
He, J.; Qi, Z.; Hang, W.; Zhao, C.; King, M. Predicting Freeway Pavement Construction Cost Using a Back-Propagation Neural Network: A Case Study in Henan, China. Balt. J. Road Bridge Eng. 2014, 9, 66–76. [Google Scholar] [CrossRef]
Swei, O.; Gregory, J.; Kirchain, R. Construction Cost Estimation: A Parametric Approach for Better Estimates of Expected Cost and Variation. Transp. Res. Part. B Methodol. 2017, 101, 295–305. [Google Scholar] [CrossRef]
Gante, D.V.; Silva, D.L.; Leopoldo, M.P. Forecasting Construction Cost Using Artificial Neural Network for Road Projects in the Department of Public Works and Highways Region XI. In Frontiers in Artificial Intelligence and Applications; IOS Press B.V.: Amsterdan, The Netherlands, 2022; Volume 352, pp. 64–71. [Google Scholar]
Wilmot, C.G.; Mei, B. Neural Network Modeling of Highway Construction Costs. J. Constr. Eng. Manag. 2005, 131, 765–771. [Google Scholar] [CrossRef]
Hammad, A.A.A.; Alhaj Ali, S.M.; Sweis, G.J.; Bashir, A. Prediction Model for Construction Cost and Duration in Jordan. Jordan J. Civ. Eng. 2008, 2, 250–266. [Google Scholar]
Pewdum, W.; Rujirayanyong, T.; Sooksatra, V. Forecasting Final Budget and Duration of Highway Construction Projects. Eng. Constr. Archit. Manag. 2009, 16, 544–557. [Google Scholar] [CrossRef]
Peško, I.; Mučenski, V.; Šešlija, M.; Radović, N.; Vujkov, A.; Bibić, D.; Krklješ, M. Estimation of Costs and Durations of Construction of Urban Roads Using ANN and SVM. Complexity 2017, 2017, 2450370. [Google Scholar] [CrossRef]
Aretoulis, G.N. Neural Network Models for Actual Cost Prediction in Greek Public Highway Projects. Int. J. Proj. Organ. Manag. 2019, 11, 41–64. [Google Scholar] [CrossRef]
Birhanu Belete, S.; Getnet Meharie, M.; Getawa Ayalew, G. Development of a Mathematical Model to Predict the Cost of Ethiopian Roads Authority Road Projects: The Case of Ethiopia. Cogent Eng. 2024, 11, 2297492. [Google Scholar] [CrossRef]
Asmar, M.E.; Hanna, A.S.; Whited, G.C. New Approach to Developing Conceptual Cost Estimates for Highway Projects. J. Constr. Eng. Manag. 2011, 137, 942–949. [Google Scholar] [CrossRef]
Mahamid, I. Early Cost Estimating for Road Construction Projects Using Multiple Regression Techniques. Constr. Econ. Build. 2011, 11, 87–101. [Google Scholar] [CrossRef]
Kim, S. Hybrid Forcasting System Based on Case-Based Reasoning and Analytic Heirarchy Process for Cost Estimation. J. Civ. Eng. Manag. 2013, 19, 86–96. [Google Scholar] [CrossRef]
Mahamid, I. Conceptual Cost Estimate of Road Construction Projects in Saudi Arabia. Jordan. J. Civ. Eng. 2013, 7, 285–294. [Google Scholar]
Ghadbhan Abed, Y.; Mohammed Hasan, T.; Nihad Zehawi, R. Cost Prediction for Roads Construction Using Machine Learning Models. Int. J. Electr. Comput. Eng. Syst. 2022, 13, 927–936. [Google Scholar] [CrossRef]
Abd, A.M.; Kareem, Y.A.; Zehawi, R.N. Prediction and Estimation of Highway Construction Cost Using Machine Learning. Eng. Technol. Appl. Sci. Res. 2024, 14, 17222–17231. [Google Scholar] [CrossRef]
Chou, J.-S.; Peng, M.; Persad, K.; O’Connor, J. Quantity-Based Approach to Preliminary Cost Estimates for Highway Projects. Transp. Res. Rec. J. Transp. Res. Board 2006, 1946, 22–30. [Google Scholar] [CrossRef]
Kang, T.K.; Park, W.; Lee, Y.S. Development Of CBR-Based Road Construction Project Cost Estimation System. In Proceedings of the Conference: 28th International Symposium on Automation and Robotics in Construction, Seoul, Republic of Korea, 29 June–2 July 2011; International Association for Automation & Robotics in Construction (IAARC): Oulu, Finland, 2011; pp. 1314–1319. [Google Scholar]
Adeli, H.; Wu, M. Regularization Neural Network for Construction Cost Estimation. J. Constr. Eng. Manag. 1998, 124, 18–24. [Google Scholar] [CrossRef]
Hegazy, T.; Ayed, A. Neural Network Model for Parametric Cost Estimation of Highway Projects. J. Constr. Eng. Manag. 1998, 124, 210–218. [Google Scholar] [CrossRef]
Kim, K.J.; Kim, K. Preliminary Cost Estimation Model Using Case-Based Reasoning and Genetic Algorithms. J. Comput. Civ. Eng. 2010, 24, 499–505. [Google Scholar] [CrossRef]
Park, T.; Kang, T.; Lee, Y.; Seo, K. Project Cost Estimation of National Road in Preliminary Feasibility Stage Using BIM/GIS Platform. In Proceedings of the Computing in Civil and Building Engineering, Orlando, FL, USA, 23–25 June 2014; American Society of Civil Engineers: Reston, VA, USA, 2014; pp. 423–430. [Google Scholar]
Warren, J.; Allen, D.; Storey, J. Systematic Cost Estimating Tool for the Mississippi Department of Transportation. In Proceedings of the IISE Annual Conference and Expo 2022, Seattle, WA, USA, 21–24 May 2022; Institute of Industrial & Systems Engineers (IISE): Norcross, GA, USA, 2022. [Google Scholar]
Zhang, Y.; Minchin, R.E.; Flood, I.; Ries, R.J. Preliminary Cost Estimation of Highway Projects Using Statistical Learning Methods. J. Constr. Eng. Manag. 2023, 149, 04023026. [Google Scholar] [CrossRef]
Wang, C.; Li, Z. Research on the Construction of Highway Engineering Cost Prediction Model Based on GA-BP Algorithm. Lect. Notes Electr. Eng. 2024, 1216, 271–279. [Google Scholar] [CrossRef]
Feng, F. Cost Prediction of Municipal Road Engineering Based on Optimization of SVM Parameters by RF-WPA Hybrid Algorithm. In Lecture Notes on Data Engineering and Communications Technologies; Springer Nature: Cham, Switzerland, 2022; Volume 138, pp. 86–93. [Google Scholar]
Lin, W.P.; Techapeeraparnich, W. Model for Predicting Cost of Rural Road Projects in Thailand. IOP Conf. Ser. Mater. Sci. Eng. 2019, 652, 012004. [Google Scholar] [CrossRef]
Mahalakshmi, G.; Rajasekaran, C. Early Cost Estimation of Highway Projects in India Using Artificial Neural Network. In Sustainable Construction and Building Materials: Select Proceedings of ICSCBM 2018; Springer: Singapore, 2019; pp. 659–672. [Google Scholar]
Meharie, M.G.; Shaik, N. Predicting Highway Construction Costs: Comparison of the Performance of Random Forest, Neural Network and Support Vector Machine Models. J. Soft Comput. Civ. Eng. 2020, 4, 103–112. [Google Scholar] [CrossRef]
Meharie, M.G.; Mengesha, W.J.; Gariy, Z.A.; Mutuku, R.N.N. Application of Stacking Ensemble Machine Learning Algorithm in Predicting the Cost of Highway Construction Projects. Eng. Constr. Archit. Manag. 2022, 29, 2836–2853. [Google Scholar] [CrossRef]
Mohamed, B.; Moselhi, O. Conceptual Estimation of Construction Duration and Cost of Public Highway Projects. J. Inf. Technol. Constr. 2022, 27, 595–618. [Google Scholar] [CrossRef]
Hoffmann, M.; Donev, V. Reliable Estimation of Investment and Life-Cycle Costs from Road Projects to Single Road Assets. In Life-Cycle of Structures and Infrastructure Systems; CRC Press: London, UK, 2023; pp. 3284–3291. [Google Scholar]
Wang, X.; Liu, S.; Zhang, L. Highway Cost Prediction Based on LSSVM Optimized by Intial Parameters. Comput. Syst. Sci. Eng. 2021, 36, 259–269. [Google Scholar] [CrossRef]
Bouabaz, M.; Hamami, M. A Cost Estimation Model for Repair Bridges Based on Artificial Neural Network. Am. J. Appl. Sci. 2008, 5, 334–339. [Google Scholar] [CrossRef]
Petroutsatou, K.; Lambropoulos, S. Road Tunnels Construction Cost Estimation: A Structural Equation Model Development and Comparison. Oper. Res. 2010, 10, 163–173. [Google Scholar] [CrossRef]
Fragkakis, N.; Lambropoulos, S.; Tsiambaos, G. Parametric Model for Conceptual Cost Estimation of Concrete Bridge Foundations. J. Infrastruct. Syst. 2011, 17, 66–74. [Google Scholar] [CrossRef]
Petroutsatou, K.; Georgopoulos, E.; Lambropoulos, S.; Pantouvakis, J.P. Early Cost Estimating of Road Tunnel Construction Using Neural Networks. J. Constr. Eng. Manag. 2012, 138, 679–687. [Google Scholar] [CrossRef]
Antoniou, F.; Konstantinidis, D.; Aretoulis, G. Analytical Formulation for Early Cost Estimation and Material Consumption of Road Overpass Bridges. Res. J. Appl. Sci. Eng. Technol. 2016, 12, 716–725. [Google Scholar] [CrossRef]
Fragkakis, N.; Lambropoulos, S.; Pantouvakis, J.-P. A Computer-Aided Conceptual Cost Estimating System for Pre-Stressed Concrete Road Bridges. In Civil and Environmental Engineering: Concepts, Methodologies, Tools, and Applications; IGI Global: Hershey, PA, USA, 2016; Volume 2, pp. 563–575. [Google Scholar]
Zhang, Y.; Minchin, R.E. Forecasting Conceptual Costs of Bridge Projects Using Non-Parametric Regression Analysis. Proc. Int. Struct. Eng. Constr. 2017, 4, 35. [Google Scholar] [CrossRef]
Juszczyk, M. On the Search of Models for Early Cost Estimates of Bridges: An SVM-Based Approach. Buildings 2019, 10, 2. [Google Scholar] [CrossRef]
Juszczyk, M. Early Cost Estimates of Bridge Structures Aided by Artificial Neural Networks. In International Scientific Siberian Transport Forum; Springer International Publishing: Cham, Switzerland, 2020; pp. 10–20. [Google Scholar]
Petroutsatou, K.; Maravas, A.; Saramourtsis, A. A Life Cycle Model for Estimating Road Tunnel Cost. Tunn. Undergr. Space Technol. 2021, 111, 103858. [Google Scholar] [CrossRef]
Miljan, K.; Ivanišević, N.; Petronijević, P. Construction Cost Estimation of Reinforced and Prestressed Concrete Bridges Using Machine Learning. J. Croat. Assoc. Civ. Eng. 2021, 73, 1–13. [Google Scholar] [CrossRef]
Markiz, N.; Jrade, A. Integrating an Expert System with BrIMS, Cost Estimation, and Linear Scheduling at Conceptual Design Stage of Bridge Projects. Int. J. Constr. Manag. 2022, 22, 913–928. [Google Scholar] [CrossRef]
Liu, S.; Hou, D. Construction Cost Prediction of Main Tunnel in Railway Tunnel Based on Support Vector Machine|基于支持向量机的铁路隧道洞身工程造价预测. J. Railw. Eng. Soc. 2022, 39, 108–113. [Google Scholar]
Zhou, J.-Q.; Liu, Q.-M.; Ma, C.-X.; Li, D. Cost Prediction of Tunnel Construction Based on Interpretative Structural Model and Stacked Autoencoder. Eng. Lett. 2024, 32, 1966–1980. [Google Scholar]
Vagdatli, T.; Petroutsatou, K.; Panetsos, P.; Barmpa, Z.; Fragkakis, N. Probabilistic Pre-Estimation of Life-Cycle Costs of Road Bridges Using Dynamic Bayesian Networks. Struct. Infrastruct. Eng. 2024, 20, 1–20. [Google Scholar] [CrossRef]
Helaly, H.; El-Rayes, K.; Ignacio, E.-J.; Joan, H.J. Comparison of Machine-Learning Algorithms for Estimating Cost of Conventional and Accelerated Bridge Construction Methods during Early Design Phase. J. Constr. Eng. Manag. 2025, 151, 04025004. [Google Scholar] [CrossRef]
Liolios, A.; Kotoulas, D.; Antoniou, F.; Konstantinidis, D. Egnatia Motorway Bridge Management Systems for Design, Construction and Maintenance. In Proceedings of the 3rd International Conference on Bridge Maintenance, Safety and Management—Bridge Maintenance, Safety, Management, Life-Cycle Performance and Cost, Porto, Portugal, 16–19 July 2006; pp. 135–137. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Basheer, I.A.; Hajmeer, M. Artificial Neural Networks: Fundamentals, Computing, Design, and Application. J. Microbiol. Methods 2000, 43, 3–31. [Google Scholar] [CrossRef] [PubMed]
Ramchoun, H.; Amine, M.; Idrissi, J.; Ghanou, Y.; Ettaouil, M. Multilayer Perceptron: Architecture Optimization and Training. Int. J. Interact. Multimed. Artif. Intell. 2016, 4, 26. [Google Scholar] [CrossRef]
Kulkarni, P.S.; Londhe, S.N.; Deo, M.C. Artificial Neural Networks for Construction Management: A Review. J. Soft Comput. Civ. Eng. 2017, 1, 70–88. [Google Scholar] [CrossRef]
Papadimitriou, V.E.; Aretoulis, G.N. A Final Cost Estimating Model for Building Renovation Projects. Buildings 2024, 14, 1072. [Google Scholar] [CrossRef]
Kim, G.-H.; Yoon, J.-E.; An, S.-H.; Cho, H.-H.; Kang, K.-I. Neural Network Model Incorporating a Genetic Algorithm in Estimating Construction Costs. Build. Environ. 2004, 39, 1333–1340. [Google Scholar] [CrossRef]
Bayram, S.; Ocal, M.E.; Laptali Oral, E.; Atis, C.D. Comparison of MultiLayer Perceptron (MLP) and Radial Basis Function (RBF) for Construction Cost Estimation: The Case of Turkey. J. Constr. Eng. Manag. 2015, 22, 480–490. [Google Scholar] [CrossRef]
Grigoras, A.E.; Aretoulis, G.N.; Antoniou, F.; Karatzas, S. Application of Artificial Neural Networks for the Prediction of Cashflows in Public Road Works. In Financial Evaluation and Risk Management of Infrastructure Projects; IGI Global: Hershey, PA, USA, 2023; pp. 101–130. ISBN 9781668477878. [Google Scholar]
Broomhead, D.S. Lowe David Multivariabe Functional Interpolation and Adaptive Networks. Complex Syst. 1988, 2, 321–355. [Google Scholar]
Yu, S.; Zhu, K.; Gao, S. A Hybrid MPSO-BP Structure Adaptive Algorithm for RBFNs. Neural Comput. Appl. 2009, 18, 769–779. [Google Scholar] [CrossRef]
Galton, F. Regression towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef]
Dang-Trinh, N.; Duc-Thang, P.; Nguyen-Ngoc Cuong, T.; Duc-Hoc, T. Machine Learning Models for Estimating Preliminary Factory Construction Cost: Case Study in Southern Vietnam. Int. J. Constr. Manag. 2023, 23, 2879–2887. [Google Scholar] [CrossRef]
Kantianis, D.D. Design Morphology Complexity and Conceptual Building Project Cost Forecasting. J. Financ. Manag. Prop. Constr. 2022, 27, 387–414. [Google Scholar] [CrossRef]
Sharma, S.; Goyal, P.K. Fuzzy Assessment of the Risk Factors Causing Cost Overrun in Construction Industry. Evol. Intell. 2022, 15, 2269–2281. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Alshboul, O.; Shehadeh, A.; Almasabha, G.; Almuflih, A.S. Extreme Gradient Boosting-Based Machine Learning Approach for Green Building Cost Prediction. Sustainability 2022, 14, 6651. [Google Scholar] [CrossRef]
Goel, S.; Oberoi, S.; Vats, A. Construction Cost Estimator: An Effective Approach to Estimate the Cost of Construction in Metropolitan Areas. In Proceedings of the 2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 17–18 December 2021; pp. 122–127. [Google Scholar]
Ye, X.; Ding, P.; Jin, D.; Zhou, C.; Li, Y.; Zhang, J. Intelligent Analysis of Construction Costs of Shield Tunneling in Complex Geological Conditions by Machine Learning Method. Mathematics 2023, 11, 1423. [Google Scholar] [CrossRef]
Liaw, A.; Wiener, M. Classification and Regression by RandomForest. R. News 2002, 2, 18–22. [Google Scholar]
Smola, A.J.; Schölkopf, B.; Schölkopf, S. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Coffie, G.H.; Cudjoe, S.K.F. Using Extreme Gradient Boosting (XGBoost) Machine Learning to Predict Construction Cost Overruns. Int. J. Constr. Manag. 2024, 24, 1742–1750. [Google Scholar] [CrossRef]
ForouzeshNejad, A.A.; Arabikhan, F.; Aheleroff, S. Optimizing Project Time and Cost Prediction Using a Hybrid XGBoost and Simulated Annealing Algorithm. Machines 2024, 12, 867. [Google Scholar] [CrossRef]
Cover, T.M.; Hart, P.E. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Arabiat, A.; Al-Bdour, H.; Bisharah, M. Predicting the Construction Projects Time and Cost Overruns Using K-Nearest Neighbor and Artificial Neural Network: A Case Study from Jordan. Asian J. Civ. Eng. 2023, 24, 2405–2414. [Google Scholar] [CrossRef]
Gareth, J.; Witten, D.; Hastie, T.; Tibshirani, R.; Taylor, J. An Introduction to Statistical Learning with Applications in Python; Springer: Cham, Switzerland, 2023. [Google Scholar]
McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython; O’Reilly Media, Inc.: Santa Rosa, CA, USA, 2018; ISBN 9781491957660. [Google Scholar]

Figure 1. PRISMA flowchart for screening and selecting research documents.

Figure 2. Actual versus predicted Total Cost values (for ML models per IV variable set [a–e]).

Figure 3. Flowchart of the model ranking and acceptance procedure.

Figure 4. Count of accepted models per variable set.

Figure 5. Count of accepted models per ML method.

Figure 6. Percentage deviations of predicted versus actual costs (Projects X1 and X2).

Table 3. Dataset descriptive statistics.

Variable	N	Minimum	Maximum	Mean	Std. Deviation
Length of Main Axis (m)	19	1900.00	32,610.00	14,856.21	8435.85
Width of Main Axis (m)	19	22.00	27.00	24.52	0.86
Area of Main Axis (m²)	19	45,600.00	811,032.00	364,176.64	208,449.02
Length of Service Roads (m)	18	63.30	59,211.00	17,044.84	15,327.77
Width of Service Roads (m)	18	4.95	7.50	6.33	0.68
Area of Service Roads (m²)	19	411.45	357,302.00	107,811.06	97,885.08
Total road area (main + service) (m²)	19	125,766.50	1,168,334.20	466,313.44	284,562.64
Number of Interchanges	14	1.00	3.00	1.57	0.65
Area of bridges (m²)	19	420.00	34,157.10	10,782.17	9006.64
Area of underpasses (m²)	17	627.33	10,890.00	3611.01	3129.30
Area of overpasses (m²)	16	540.00	25,146.00	4004.51	6018.26
Sum of area of all structures(m²)	19	3135.91	42,477.60	17,385.30	11,672.47
Total length of single bore tunnels (m)	7	280.00	9030.00	3637.92	3647.64
Number of culverts	17	6	90.00	46.29	24.64
Length of river training(m)	1	900.00	900.00	900.00
Length of river bridges(m)	11	27.50	507.00	258.53	167.72
Width of river bridges(m)	11	11.30	28.90	24.53	6.28
Number of spans of river bridges	11	1.00	18.00	7.73	5.24
Area of river bridges (m²)	11	310.75	13,178.40	6731.06	4826.87
Total Cost (€ 2024)	19	14,033,929.00	275,847,756.00	92,564,860.11	67,242,393,739.00

Table 4. Pearson correlation coefficients between DV and IV.

IV	Total Cost (€ 2024)
Length of main axis (L_m)	0.278
Width of main axis (W_m)	−0.125
Area of main axis (A_m)	0.262
Length of service roads (L_s)	−0.067
Width of service roads (W_s)	−0.170
Area of service roads (A_s)	−0.069
Total road area (main + service)(A_m+s)	0.168
Number of Interchanges (N_ic)	0.018
Area of bridges (A_b)	0.647 **
Area of underpasses (A_up)	0.116
Area of overpasses (A_op)	−0.083
Sum of area of all structures (A_st)	0.490 *
Total length of single bore tunnels (L_t)	0.804 **
Number of Culverts (N_c)	−0.099
Length of river training (L_R)	c
Length of river bridges (L_Rb)	0.272
Width of river bridges (W_Rb)	−0.045
Number of spans of river bridges (N_s)	0.467
Area of river bridges (A_Rb)	0.161

** Correlation is significant at the 0.01 level (2-tailed); * Correlation is significant at the 0.05 level (2-tailed); c Cannot be computed because at least one of the variables is constant.

Table 5. Models tested per ML approach.

Model	DV	IV(s)
a	Total Cost	L_m, W_m, L_s, W_s, N_ic, A_b, A_op, A_up, L_t, N_c L_r. N_s A_up
b	Total Cost	L_m, L_s, A_st, L_t, N_c
c	Total Cost	L_m, L_s, N_ic, A_st, L_t, N_c
d	Total Cost	N_s, L_t, A_st, A_b
e	Total Cost	L_m, A_st, L_t, N_s

Table 6. Evaluation metrics for Total Cost estimation models (Models a–d).

Total Cost Model	ML Method	MAE	R²	MAE%	MAPE%
a	MLP-ANN	107,201,806.45	−9.17	129.19	72.50
a	RBF-ANN	116,741,876.34	−9.20	140.69	81.97
a	MLR	97,888,659.68	−5.92	117.97	83.69
a	RF	50,499,385.15	−1.23	60.86	32.06
a	SVR	44,239,797.61	−0.97	53.32	26.62
a	XGBoost	50,350,221.57	−0.67	60.68	39.33
a	KNN	69,416,458.68	−2.96	83.66	45.55
b	MLP-ANN	15,590,969.66	0.86	18.79	13.64
b	RBF-ANN	23,758,675.83	0.68	28.63	22.71
b	MLR	17,752,384.39	0.81	21.39	17.29
b	RF	56,918,499.33	−0.97	68.60	46.24
b	SVR	41,372,128.39	−0.59	49.86	25.97
b	XGBoost	49,586,328.22	−0.95	59.76	33.33
b	KNN	46,084,716.58	−0.41	55.54	33.51
c	MLP-ANN	39,790,753.09	0.09	49.59	35.00
c	RBF-ANN	21,974,168.76	0.55	26.48	21.83
c	MLR	11,980,640.05	0.87	14.44	11.27
c	RF	61,522,837.83	−1.16	74.14	50.04
c	SVR	41,475,979.34	−0.72	49.98	25.13
c	XGBoost	47,477,931.22	−1.23	57.22	28.70
c	KNN	55,560,571.41	−0.97	66.69	41.76
d	MLP-ANN	86,098,841.82	−3.81	103.76	83.44
d	RBF-ANN	13,766,993.70	−12.73	165.91	144.00
d	MLR	107,177,935.78	−6.73	129.17	106.63
d	RF	61,008,294.17	−1.11	73.52	50.55
d	SVR	38,771,335.90	−0.08	46.73	28.54
d	XGBoost	51,344,221.21	−1.01	61.88	35.21
d	KNN	35,883,364.00	−0.09	43.24	24.04
e	MLP-ANN	18,300,913.98	0.76	22.06	12.75
e	RBF-ANN	28,328,341.99	0.30	34.14	29.47
e	MLR	21,232,975.87	0.68	25.59	20.55
e	RF	53.607.513.38	−0.92	64.60	45.54
e	SVR	40.234.082.82	−0.24	48.49	28.09
e	XGBoost	49,022,275.21	−1.14	59.08	31.17
e	KNN	42,612,353.80	−0.31	51.35	30.40

Table 7. Performance scoring ranges for model evaluation metrics.

Score	* MAE	** R²	* MAE%	* MAPE%
0	>27,769,458.03	≤0.00	>30.00	>30.00
1	22,215,566.43–27,769,458.03	0.01–0.49	24.01–30.00	24.01–30.00
2	16,661,674.83–22,215,566.42	0.50–0.59	18.01–24.00	18.01–24.00
3	11,107,783.22–16,661,674.82	0.60–0.69	12.01–18.00	12.01–18.00
4	5,553,891.62–11,107,783.21	0.70–0.79	6.01–12.00	6.01–12.00

* lower values correspond to higher scores, ** higher values correspond to higher scores.

Table 8. Scores and ranking of selected models for Total Cost estimation.

Model	ML Method	MAE	R²	MAE%	MAPE%	Average	Rank
c	MLR	3.00	5.00	3.00	3.00	3.50	1.00
b	MLP-ANN	3.00	5.00	2.00	3.00	3.25	2.00
b	MLR	2.00	5.00	2.00	3.00	3.00	3.00
e	MLP-ANN	2.00	4.00	2.00	3.00	2.75	4.00
e	MLR	2.00	3.00	1.00	2.00	2.00	5.00
b	RBF-ANN	1.00	3.00	1.00	2.00	1.75	6.00
c	RBF-ANN	2.00	2.00	1.00	2.00	1.75	6.00
e	RBF-ANN	0.00	1.00	0.00	1.00	0.50	9.00
a	SVR	0.00	0.00	0.00	1.00	0.25	10.00
b	SVR	0.00	0.00	0.00	1.00	0.25	10.00
c	SVR	0.00	0.00	0.00	1.00	0.25	10.00
c	XGBoost	0.00	0.00	0.00	1.00	0.25	10.00
d	SVR	0.00	0.00	0.00	1.00	0.25	10.00
d	KNN	0.00	0.00	0.00	1.00	0.25	10.00
e	SVR	0.00	0.00	0.00	1.00	0.25	10.00

Table 9. Cross-validation of projects’ independent variables.

IV	Project X1	Project X2
L_m	8500 m	16,000 m
L_s	11,170 m	32,000 m
N_ic	2	3
A_b	2280 m²	1680 m²
A_up	1500	2250 m²
A_op	0	2160 m²
A_st	3780 m²	6090 m²
L_t	0	1200
N_c	43	80
Ns	0	3

Table 10. Cross-validation results for selected models (Projects X1 and X2).

Rank	Model		X1 Actual (EUR)	X1 Predicted (EUR)	Deviation	X2 Actual (EUR)	X2 Predicted (EUR)	Deviation
1	c	MLR	39,194,023.43	29,726,360.79	−24.16%	78,571,146.85	75,775,227.52	−3.56%
2	b	MLP-ANN	39,194,023.43	29,805,189.53	−23.95%	78,571,146.85	15,243,355.69	−80.60%
3	b	MLR	39,194,023.43	21,310,216.12	−45.63%	78,571,146.85	57,651,670.78	−26.62%
4	e	MLP-ANN	39,194,023.43	27,929,433.09	−28.74%	€78,571,146.85	61,418,111.86	−21.83%
5	e	MLR	39,194,023.43	30,219,523.12	−22.90%	78,571,146.85	82,647,440.00	5.19%
6	b	RBF-ANN	39,194,023.43	27,743,288.58	−29.22%	78,571,146.85	83,457,835.30	6.22%
6	c	RBF-ANN	39,194,023.43	42,265,903.31	7.84%	€78,571,146.85	120,399,833.00	53.24%

Table 11. Coefficients, means, and standard deviations of IVs in the c MLR Model.

IV	Units	w	μ	σ
L_m	m	37,916,007.05	15,533.13	8691.05
L_s	m	−8,296,717.99	19,309.72	15,105.06
N_ic	n	7,517,364.66	1.33	0.87
A_st	m²	15,446,845.90	15,101.14	10,538.05
L_t	n	46,184,784.74	804.67	2275.03
N_c	n	−8,896,822.25	43.40	28.32
b		82,977,372.26	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Antoniou, F.; Konstantinidis, K. Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning. Appl. Sci. 2025, 15, 10237. https://doi.org/10.3390/app151810237

AMA Style

Antoniou F, Konstantinidis K. Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning. Applied Sciences. 2025; 15(18):10237. https://doi.org/10.3390/app151810237

Chicago/Turabian Style

Antoniou, Fani, and Konstantinos Konstantinidis. 2025. "Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning" Applied Sciences 15, no. 18: 10237. https://doi.org/10.3390/app151810237

APA Style

Antoniou, F., & Konstantinidis, K. (2025). Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning. Applied Sciences, 15(18), 10237. https://doi.org/10.3390/app151810237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Client-Oriented Highway Construction Cost Estimation Models Using Machine Learning

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Research Objectives

3.2. Data Collection and Description

3.3. Machine-Learning Approaches

3.3.1. Multilayer Perceptron-Artificial Neural Network (MLP-ANN) Models

3.3.2. Radial Basis Function-Artificial Neural Network (RBF-ANN) Models

3.3.3. Multiple Linear Regression (MLR) Models

3.3.4. Random Forest (RF) Models

3.3.5. Support Vector Regression (SVR) Models

3.3.6. XGBoost Technique

3.3.7. K-Nearest Neighbors (KNN) Algorithm

3.4. Model Evaluation Metrics

3.4.1. Mean Absolute Error (MAE)

3.4.2. R2

3.4.3. MAE % of Mean

3.4.4. Mean Absolute Percentage Error (MAPE)

4. Results

4.1. Models Tested

4.2. Development, Training and Testing of ML Models

4.3. Model Evaluation Metric Results and Analysis

4.3.1. Result Presentation

4.3.2. Model Ranking and Acceptance Procedure

4.3.3. Comparative Evaluation of Accepted Models

4.4. Cross-Validation of Selected Models Using Independent Project Data

4.5. Model Application Procedure

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.2. R²