Next Article in Journal
Disaster Risk Reduction Audits and BIM for Resilient Highway Infrastructure: A Proactive Assessment Framework
Previous Article in Journal
The Validation and Discussion of a Comparative Method Based on Experiment to Determine the Effective Thickness of Composite Glass
Previous Article in Special Issue
Machine Learning in the Design Decision-Making of Traditional Garden Space Renewal: A Case Study of the Classical Gardens of Jiangnan
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects

Department of Highway & Transportation Research, Korea Institute of Civil Engineering and Building Technology, 283 Goyangdae-Ro, Ilsanseo-Gu, Goyang-si 10223, Gyeonggi-Do, Republic of Korea
Buildings 2025, 15(14), 2546; https://doi.org/10.3390/buildings15142546
Submission received: 23 June 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 19 July 2025

Abstract

The accurate early-stage estimation of environmental load (EL) and construction cost (CC) in road infrastructure projects remains a significant challenge, constrained by limited data and the complexity of construction activities. To address this, our study proposes a machine learning-based predictive framework utilizing artificial neural networks (ANNs) and deep neural networks (DNNs), enhanced by autoencoder-driven feature selection. A structured dataset of 150 completed national road projects in South Korea was compiled, covering both planning and design phases. The database focused on 19 high-impact sub-work types to reduce noise and improve prediction precision. A hybrid imputation approach—combining mean substitution with random forest regression—was applied to handle 4.47% missing data in the design-phase inputs, reducing variance by up to 5% and improving data stability. Dimensionality reduction via autoencoder retained 16 core variables, preserving 97% of explanatory power while minimizing redundancy. ANN models benefited from cross-validation and hyperparameter tuning, achieving consistent performance across training and validation sets without overfitting (MSE = 0.06, RMSE = 0.24). The optimal ANN yielded average error rates of 29.8% for EL and 21.0% for CC at the design stage. DNN models, with their deeper architectures and dropout regularization, further improved performance—achieving 27.1% (EL) and 17.0% (CC) average error rates at the planning stage and 24.0% (EL) and 14.6% (CC) at the design stage. These results met all predefined accuracy thresholds, underscoring the DNN’s advantage in handling complex, high-variance data while the ANN excelled in structured cost prediction. Overall, the synergy between deep learning and autoencoder-based feature selection offers a scalable and data-informed approach for enhancing early-stage environmental and economic assessments in road infrastructure planning—supporting more sustainable and efficient project management.

1. Introduction

In recent years, the construction industry has increasingly emphasized the need for sustainable infrastructure development, driven by rising environmental concerns and the demand for efficient resource management [1]. Among various infrastructure sectors, road construction plays a critical role due to its extensive material consumption, long lifecycle, and considerable environmental footprint. An accurate early-stage estimation of environmental load (EL) and CC is thus crucial for both minimizing ecological impact and optimizing project budgets. Traditional cost and environmental impact estimation methods heavily rely on historical data, engineering judgment, and deterministic models. While these approaches provide foundational insights, they often lack scalability and adaptability when confronted with complex or data-limited project environments. Moreover, many such models struggle to incorporate the wide range of interacting variables inherent to infrastructure projects, especially during the planning and design stages, where uncertainty is high and input data are only partially available.
EL has been a critical focus in recent structural and geotechnical research due to its influence on safety and sustainability. Wang et al. [2] investigated the seismic and EL response of offshore wind turbine jacket foundations, revealing complex dynamic interactions with scour effects. Mittermayr et al. [3] examined material fatigue under mechanical–EL cycles, providing insights into degradation in recycled construction materials. Lavassani et al. [4] optimized semi-active dampers for offshore jacket platforms under environmental vibration, demonstrating advanced control strategies for structural resilience. Elmas et al. [5] evaluated soil–monopile interaction in offshore turbines, integrating wind and wave loads with earthquake effects through detailed subsurface modeling. Li et al. [6] synthesized the impact of environmental factors—such as temperature, pressure, and hydrological loads—on long-term structural monitoring via a GNSS, highlighting the relevance of environmental modeling in infrastructure deformation analysis. Recent studies also continue to demonstrate the effectiveness of hybrid AI approaches in complex geotechnical problems. For instance, Ghanizadeh et al. [7] developed a predictive model for the bearing capacity of geogrid-reinforced stone columns using a hybrid MARS–EBS method, achieving R2 values above 0.99. This confirms the strength of intelligent optimization in modeling nonlinear, parameter-sensitive soil–structure systems.
Recent research underscores the growing use of deep learning in construction management. Cheng et al. [8] and Liu et al. [9] proposed hybrid and hypergraph-based models for accurate cost and schedule prediction. Habib et al. [10] applied ensemble learning, while Mahpour et al. [11] examined maintenance costs within a circular economy framework. Bruzzone et al. [12] integrated machine learning with simulation for offshore plant planning. Farshadfar et al. [13] leveraged AI for automated waste sorting, and Mahmoodzadeh et al. [14] focused on tunneling cost and duration forecasting. Wang et al. [15] explored economic factor impacts using DNNs, and Alsulamy et al. [16] compared deep learning algorithms for project delay prediction. Li et al. [17] offered a comprehensive classification of deep learning in construction. Lung et al. [18] and Liu et al. [19] expanded applications to modular safety, IoT-based risk control, and material price forecasting, respectively. Chen et al. [20] combined deep learning with large language models for BIM compliance, while Choi et al. [21] improved construction crack detection via hybrid data augmentation. Despite these advances, few studies compare ANN and DNN performance using multi-phase datasets and autoencoder-based variable selection—highlighting the novelty of this research.
Despite growing interest in applying machine learning to construction forecasting, existing studies often face limitations in handling complex, high-dimensional project data, especially during early stages where uncertainty is high. Many approaches rely on limited or overly generalized variables, leading to reduced model reliability and increased susceptibility to overfitting. Furthermore, few studies quantitatively address the challenge of selecting meaningful features from detailed engineering datasets, particularly when data volume is constrained. This research addresses these gaps by focusing on improving model performance through optimized feature selection and dimensionality reduction techniques, enabling more accurate and practical predictions of EL and CCs at critical decision-making points. Unlike previous studies that rely on single-phase data, this study distinguishes between planning and design stages, enabling phase-specific prediction modeling and allowing a deeper analysis of variable significance, data uncertainty, and model suitability at different decision-making points.
This study aims to develop a machine learning framework that can effectively estimate ELs and CCs during the planning and design stages of national road construction projects. A dataset of 150 national road construction projects from South Korea was compiled, comprising 10 planning-stage variables and 19 design-stage work quantities. To reduce dimensionality and improve learning performance, an autoencoder was used to identify optimal variable combinations. Subsequently, a total of 16 predictive models—artificial neural networks (ANNs) and deep neural networks (DNNs)—were constructed to estimate environmental impact and cost. These models were trained using 6-fold cross-validation with varying network depths (1–3 layers), dropout rates (40–80%), and node counts (up to 500) and were evaluated based on MSE, RMSE, and the average error rate. Missing values were addressed through random forest imputation, and early stopping techniques were employed to prevent overfitting. Through this research, we contribute to advancing ML-based estimation methodologies for road infrastructure, providing insights into variable importance and model optimization techniques suited to early-stage project conditions. The findings are expected to support more sustainable and cost-effective decision-making in road planning and design processes. An overview of the manuscript’s structure is presented below:
  • Section 1 presents the Introduction.
  • Section 2 shows the database structure, including data collection, preprocessing, and feature selection strategies for both the planning and design phases.
  • Section 3 details the development and configuration of ANN and DNN models, including architecture selection, hyperparameter optimization, and performance evaluation metrics.
  • Section 4 compares the predictive performance of both model types, highlighting the trade-offs and suitability of each method for estimating EL and CC.
  • Section 5 concludes the study by summarizing key findings, discussing limitations, and suggesting future research directions to improve machine learning applications in early-stage infrastructure planning.

2. Database Collection and Analysis

2.1. Collection of Road Project Cases

To estimate the EL and CCs of national road projects, a dataset of 150 completed cases in South Korea was compiled. For each project, data were extracted from completion reports, project details, and detailed design documents to construct two structured datasets: a planning-stage database and a design-stage database [22].
The planning-stage database includes key project characteristics relevant to early-phase decision-making, such as administrative district, road height (m), road grade (%), topography, design speed (km/h), the type of construction (new or expanded), road length (m), road area (m2), pavement thickness (cm), the number of lanes, and road width (m).
The design-stage database captures detailed construction activities and material quantities, categorized by work type:
  • Earthwork: quantities for operations such as excavation, earth moving, ripping, blasting rock, and ceramic transport (m3), including dump transport and green zone reclamation.
  • Drainage: lengths of side ditches and horizontal drains (m), and volumes for structures such as VR halls, wing walls, and concrete placements (m3).
  • Paving (Packer): volumes of frost protection layers (m3) and quantities of asphalt base, middle, and surface layers (tons).
  • Structural labor: formwork areas (m3), rebar assembly (tons), and ladder work (tons), reflecting both material input and labor intensity.
This dual-database structure enables machine learning models to learn from both early-stage design variables and detailed construction inputs, providing a robust basis for the predictive modeling of environmental and cost outcomes.

2.2. Dataset Composition and Variable Selection for Model Development

A total of 150 national road construction projects in South Korea were analyzed to construct the planning and design-stage databases. The dataset covers a wide geographic range, with the highest concentration in Gyeonggi-do (18%), followed by Gyeongbuk (14%) and Chungnam (14%). Regions such as Gangwon (4%) and Jeju (2%) were less represented.
At the planning stage, key attributes were compiled, including road height, grade, design speed, and construction type. Most projects had road heights below 10 m (68%), were classified as National Route 2 (52%), and were designed for 80 km/h speeds (61%). Additionally, new constructions (56%) slightly outnumbered expansion projects.
For the design-stage database, emphasis was placed on construction categories with the highest influence on cost and environmental impact—notably earthworks, drainage, and paving, which together contribute to more than 75% of total resource usage. The dataset comprises 19 sub-work types, with quantities standardized in units such as m3, m, m2, and tons, covering activities like excavation, rock blasting, concrete works, drainage installations, and asphalt paving.
To improve model efficiency, only high-impact sub-works were selected for inclusion. This filtering reduces noise and enhances predictive accuracy by limiting the dataset to features with strong relevance to environmental and cost outcomes.

2.2.1. Data Distribution, Missing Values, and Imputation Strategy in the Design-Stage Database

The design-stage database consists of 19 detailed construction work types, each recorded with quantity information in standard engineering units (e.g., m3, m, m2, ton). To evaluate the distribution of input features and identify missing data patterns, descriptive statistics and quartiles were computed, as presented in Table 1. Missing values were present in several features, particularly in the drainage and paving categories. Out of 1900 data entries, 85 values (4.47%) were missing, with the majority concentrated on culvert-related tasks, which are crucial for environmental and cost estimation accuracy.

2.2.2. Handling Missing Values Through Imputation

To ensure robust model training, missing values were addressed using two strategies: mean imputation and random forest imputation. Mean imputation fills missing entries using the variable’s average but can underestimate variability and distort distribution tails. In contrast, random forest imputation builds a nonlinear model using other available features to predict missing values [23], offering greater accuracy and robustness in complex datasets.
Coding details for supplementing missing values in the design-phase database.
library(readxl)
library(data.table)
library(h2o)
h2o.init()
# Load and prepare dataset
design_data <- as.data.table(read_excel(“design_db_.xlsx”))
h2o_data <- as.h2o(design_data)
# Identify columns with missing values
na_cols <- names(which(colSums(is.na(design_data)) > 0))
# Impute missing values using Random Forest
for (col in na_cols) {
model <- h2o.randomForest(
x = setdiff(names(h2o_data), col),
y = col,
training_frame = h2o_data
)
prediction <- as.data.frame(h2o.predict(model, h2o_data))$predict
design_data[[col]][is.na(design_data[[col]])] <- prediction[is.na(design_data[[col]])]
}
# Resulting dataset
imputed_design_data <- design_data

2.2.3. Distribution Shift After Imputation

Post-imputation, summary statistics were recalculated. Table 2 shows that average values remained largely stable, while standard deviations decreased for several variables—improving the consistency and learnability of the dataset for machine learning.
In contrast, a few variables (e.g., scaffolding, asphalt intermediate layer) exhibited increased standard deviation post imputation due to their previously narrow observed ranges and small sample sizes. This is illustrated in Table 3, which compares differences in mean–standard deviation gaps before and after imputation.

3. Artificial Neural Network

3.1. Selection of Optimal Variables

In road infrastructure projects, the design stage is a critical phase during which approximately 70% of the construction drawings are completed, and quantities and scopes of work are accurately determined. Road projects typically encompass several major categories, including earthwork, slope stabilization, drainage, structural works, tunnel excavation, paving, traffic safety installations, and ancillary facilities. This study excludes tunnels and bridges to focus on the more frequently implemented components across projects. Each category includes multiple sub-activities:
  • Earthwork typically consists of 12 tasks, such as demolition, excavation, embankment formation, and topsoil removal.
  • Slope safety involves vegetation-based and structural reinforcement.
  • Drainage includes around 14 activities, such as trenching, blind hole drilling, and horizontal pipe installation.
  • Paving work encompasses 13 procedures, including frost protection, compaction, concrete curing, and surface finishing.
  • Traffic safety covers 11 elements like road signs and pavement markings.
  • Ancillary works average 20 tasks and include features like protective walls, signage, and noise barriers.
While including a wide array of variables may seem advantageous for improving model accuracy, this often results in overfitting—where the model captures data-specific noise and fails to generalize to new cases. In situations where expanding the dataset is impractical, dimensionality reduction is essential to optimize learning efficiency.
To address this, this study employs an autoencoder, an unsupervised neural network designed to reconstruct its input by learning compressed data representations. It consists of an encoder, which reduces dimensionality, and a decoder, which attempts to reconstruct the original input. Unlike traditional neural networks that predict outputs, autoencoders aim to reproduce input features, making them well-suited for noise reduction, anomaly detection, and feature selection. By minimizing reconstruction error, the autoencoder isolates high-signal variables, thereby improving the predictive performance of subsequent ANN and DNN models.

3.1.1. Optimal Variable Selection Using Autoencoder

(1) The setting of optimal variables in the planning stage
The performance of an ANN is primarily influenced by two key factors: the quantity and quality of training data and the network architecture [23], particularly the number of hidden layers and nodes [24]. These factors affect the model’s generalization capacity and weight optimization range.
To reduce overfitting and improve model generalizability across the 150-case dataset, four key strategies were implemented: (1) dropout regularization was applied with rates between 40% and 80% in both input and hidden layers to prevent over-reliance on specific neurons; (2) early stopping was used to halt training automatically when no further performance improvement was observed, avoiding unnecessary iterations; (3) cross-validation involved 2-fold splits for ANN models and 6-fold splits for autoencoder tuning to ensure model stability across varied data partitions; and (4) autoencoder-based feature selection reduced dimensionality by isolating high-signal variables, retaining over 95% of the dataset’s explanatory power while minimizing noise and redundancy.
In this study, a custom function was developed to optimize ANN parameters, including hidden layer depth (1–3 layers) and dropout rates (40–80%), using six-fold cross-validation across various case combinations [25,26]. While no universal guideline exists for selecting hidden layer depth, it is generally advised to limit complexity when data availability is constrained. The parameter optimization process was applied to both planning and design-stage datasets, as detailed in Figure 1.
The autoencoder optimal parameter function.
# Define autoencoder depth (1 to 3 hidden layers)
depth <- sample(1:3, 1)
# Create 6-fold cross-validation indices
folds <- createFolds(1:100, k = 6)
# Hyperparameter search space
hyperparams <- list(
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
)
)
Figure 1 shows six cross-validation models [27] created to find the optimal parameters of the autoencoder in the planning stage, and the model with the lowest mean square error rate was found. Model 5 produced the lowest mean square error rate of 0.10, and Table 4–show information for building the model. The number of neural networks in the built model is a total of five, including three hidden layers, and the dropout rate range is built to be 56% to 78%. The red dashed box highlights Model 5, which achieved the lowest MSE of 0.10, indicating the best performance among the six models evaluated.
The evaluation of neural network models often involves statistical techniques such as mean square error (MSE) and root mean square error (RMSE). MSE represents the average of the squared differences between predicted and actual values. A smaller MSE indicates that the model’s predictions are closer to the true values. This metric is widely used to assess the accuracy of the model and is represented by the following formula [28]:
MSE = 1 n i = 1 n Y i ^ Y i 2
When working with large-scale datasets, the error sum can become very large, leading to an MSE value that is difficult to interpret intuitively. To address this, the RMSE is derived by taking the square root of the MSE, making it more manageable and easier to assess model performance. This can be expressed as follows [28]:
RMSE = 1 n i = 1 n Y i ^ Y i 2
In general, lower values of both MSE and RMSE suggest a better predictive performance of the model. Additionally, when the values of MSE and RMSE are close to each other, it indicates that overfitting is less likely to be an issue, thus ensuring a more reliable model.
The coding details for both the learning and verification phases of the autoencoder model—using the optimal parameters determined during the planning stage—indicate that the model achieved an MSE of 0.08 and RMSE of 0.28 during the learning process. In the verification process, the MSE was 0.06 and RMSE was 0.24. Since these values are similar, it was concluded that overfitting did not occur, and the model was deemed to be reliable.
Coding details of the planning-stage autoencoder learning process and verification process.
Planning stage Autoencoder learning process coding details
fm_plan_Eco <- lapply(hyperparams, function(v) {
lapply(folds, function(i) {
h2o.deeplearning(
x = 1:11,
training_frame = train_plan_Eco[, −12],
validation_frame = test_plan_Eco[, −12],
distribution = “gaussian”,
activation = “RectifierWithDropout”,
hidden = v$hidden,
rho = 0.90,
epsilon = 1 × 107,
input_dropout_ratio = v$input_dr,
hidden_dropout_ratios = v$hidden_dr,
loss = “Automatic”,
autoencoder = TRUE,
sparse = TRUE,
l1 = 1 × 107,
l2 = 1 × 107,
epochs = 300
)
})
})
#Planning stage Autoencoder verification process coding details
fm.final_plan_Eco <- h2o.deeplearning(
x = 1:11,
training_frame = train_plan_Eco[, −12],
validation_frame = test_plan_Eco[, −12],
distribution = “gaussian”,
activation = “RectifierWithDropout”,
hidden = hyperparams[[5]]$hidden,
rho = 0.90,
epsilon = 1 × 107,
input_dropout_ratio = hyperparams[[5]]$input_dr,
hidden_dropout_ratios = hyperparams[[5]]$hidden_dr,
loss = “Automatic”,
autoencoder = TRUE,
l1 = 1 × 107,
l2 = 1 × 107,
epochs = 300
)
The range of the mean square error is 0.01 to 0.15, which shows that there are variables with low reproducibility. In the case of variables with low reproducibility, it can be judged that they have a negative effect on the prediction performance when constructing an artificial neural network model, and the variable importance of Table 5 was used as the criterion for selecting variables. The variable combinations corresponding to the cumulative importance of 77%, 83%, 89%, and 95% are shown in Table 6 and Table 7, and these were used as variables for constructing ANN and DNN models for estimating the EL and CC in the planning stage in the future.

3.1.2. The Setting of Optimal Variables in the Design Phase

The process of setting the optimal variables in the design phase is the same as the process of setting the variables in the planning phase that was carried out previously. Cross-validation is performed using the function of Table 7 above for various case combinations. Figure 2 shows six cross-validation models created to find the optimal parameters of the autoencoder in the design phase, and the model with the lowest mean square error rate was found. Model 6 produced the lowest mean square error rate of 0.07, and Table 7 and Table 8 show information for building the model. The number of neural networks in the built model is a total of three including one hidden layer, and the range of the dropout rate is built to be 72% to 76%. The red dashed box highlights Model 6, which achieved the MSE of 0.07, demonstrating the best performance among all evaluated models.
Table 8 shows the results obtained after applying the planning-stage autoencoder coding to the design-stage model. The mean square error rate and root mean square error rate of the learning process of this model were calculated as 0.29 and 0.53, and in the verification process, they were calculated as 0.06 and 0.23, so it was assumed that overfitting did not occur. Therefore, the model was judged to be reliable.
Table 9 and Table 10 summarize the ranked importance of design-stage variables and the cumulative combinations used to construct the ANN and DNN models, with up to 97% explanatory power captured using 16 key features.

3.2. ANN

3.2.1. Construction and Preprocessing for Planning-Stage Estimation

Prior to model development, the dataset comprising 100 road project cases was partitioned into 80 training cases, 10 validation cases, and 10 test cases. The ANN model for the planning stage was constructed using a backpropagation neural network architecture, which iteratively minimizes error by propagating feedback from output nodes.
As summarized in the coding details, the model (plan_Eco_ann) employed an ReLU (Rectified Linear Unit) as the activation function to reduce vanishing gradient issues and facilitate faster convergence. Although standard practice recommends four or more folds in cross-validation, a two-fold cross-validation yielded the best predictive performance in this study. The network was configured with one hidden layer, and the number of hidden nodes was selected randomly in the range of 1 to 300. Learning was performed over 300 to 500 epochs, and early stopping criteria were applied to halt training when no further performance improvement was detected.
To enhance learning efficiency, the dependent variables—EL and CC—were normalized using the Min–Max scaling technique, which transforms values into the range [0, 1]. This approach is particularly effective in neural networks, where activation outputs typically range between −1 and 1. Normalization helps stabilize weight updates and accelerates convergence by minimizing the magnitude of prediction errors during training.
Method = Observed   value X min X max X min
Observed Value: The observed value to be converted.
X min : Minimum value in the data column containing the observations.
X max : Maximum value in the data column containing the observations.
Coding details for building an ANN model for estimating EL and CCs at the planning stage.
plan_Eco_ann <- h2o.deeplearning(
x = 1:11,   # Input features
y = 12,     # Target variable: EL
training_frame = train_plan_Eco,
validation_frame = valid_plan_Eco,
nfolds = 2,
distribution = “gaussian”,
activation = “Rectifier”,
hidden = sample(1:300, 1, TRUE),   # Random neuron count in hidden layer
rho = 0.90,
epsilon = 1 × 107,
input_dropout_ratio = sample(400:800, 1, TRUE) / 1000,
hidden_dropout_ratios = sample(400:800, 1, TRUE) / 1000,
loss = “Automatic”,
stopping_rounds = 5,
stopping_metric = “AUTO”,
stopping_tolerance = 0.01,
sparse = TRUE,
epochs = 300
)

3.2.2. Prediction Accuracy and Optimal Architecture of ANN Models

In the autoencoder process, four variable combinations were identified for both the planning and design stages. These combinations were used to construct a total of sixteen ANN models—eight for estimating EL and eight for CC. The target variables were normalized using the Min–Max scaling technique, with outputs ranging between 0 and 1. After prediction, values were denormalized to compute absolute error rates.
Table 11 and Table 12 present the average error rates and standard deviations for 10 validation cases. Among the planning-stage models for EL estimation, combination 4 exhibited the best performance, with an average error rate of 29.8% and a standard deviation of 16.0%. Similarly, for CC estimation, combination 4 again yielded the best results, with an average error rate of 21.0% and a standard deviation of 16.3%.
When examining the prediction performance of the ANN model for estimating EL at the design stage, among the average error rates of combinations 1 to 4, combination 3 showed the best prediction performance with an average error rate and standard deviation of 29.8% and 21.6%, respectively.
As shown in Table 13, when examining the prediction performance of the ANN model for estimating CCs at the design stage, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 2 were calculated to be 21.0% and 16.3%, respectively, showing the best prediction performance.
The optimal ANN model configurations for both the planning and design stages are summarized as follows. As shown in Table 14, for EL estimation in the planning stage, the model used eight input variables and 122 hidden nodes, with dropout rates of 59.3% (input layer) and 56.9% (hidden layer). For CC estimation, the planning-stage model utilized 10 input variables, 296 hidden nodes, and dropout rates of 46.1% (input) and 43.6% (hidden).
In the design stage, the optimal model for EL estimation included 15 input variables and 117 hidden nodes, with dropout rates of 42.4% (input) and 53.5% (hidden). For CC estimation, the model was simpler, comprising 13 input variables and 11 hidden nodes, with higher dropout rates of 51.2% (input) and 76.3% (hidden).
All ANN models achieved error rates within the acceptable tolerance defined for the planning stage. Moreover, the consistency between training and validation performance—measured using MSE and RMSE—indicated that overfitting was effectively controlled, likely due to the use of optimized hyperparameters and feature selection via the autoencoder.
Learning progression graphs (Figure 3 and Figure 4) illustrate model behavior across epochs. For the planning-stage cost model and the design-stage EL model, convergence was observed within the set epoch range, as the training and validation curves closely aligned. In contrast, the planning-stage EL and design-stage cost models exhibited a persistent gap between the training and validation lines, suggesting underfitting due to insufficient data. While increasing model complexity (e.g., nodes or epochs) could improve performance, doing so risks overfitting. Therefore, augmenting the training dataset remains the most viable approach for further performance enhancement.

3.3. Deep Neural Network

The DNN model adopts a backpropagation neural network structure similar to that of the ANN. The architecture utilizes the ReLU (Rectified Linear Unit [1]) activation function, with the network depth randomly set between two and five hidden layers and each layer containing between 1 and 300 nodes. While the DNN model shares the same baseline activation function, loss type, and learning strategy as the ANN configuration, it differs significantly in network depth (2–5 hidden layers), layer-wise dropout control, and the expanded architectural complexity. These differences enable the DNN to model higher-order interactions and extract deeper abstractions, particularly useful for predicting nonlinear and variance-prone outcomes like environmental load. For each variable combination, 16 distinct DNN models are developed to ensure robust evaluation.
Coding details for building a DNN model for estimating EL and CCs at the planning stage.
# Building the DNN Model
depth = sample(2:5, 1)   # Randomly selecting depth (2 to 5 hidden layers)

plan_Eco_dnn <- h2o.deeplearning(
x = 1:11,     # Columns 1 to 11 are the features
y = 12,         # Column 12 is the target variable (EL or CC)
training_frame = train_dlan_Eco,
validation_frame = test_dlan_Eco,
nfolds = 2,   # Number of folds for cross-validation
distribution = “gaussian”,   # Distribution type for the target variable
activation = “Rectifier”,     # Activation function
hidden = sample(1:300, depth, TRUE),   # Randomly select a number of hidden layers (up to 300 neurons)
rho = 0.90,                               # Regularization parameter
epsilon = 1e-07,                      # Convergence threshold
input_dropout_ratio = sample(400:800, 1, TRUE) / 1000,   # Input dropout ratio
hidden_dropout_ratios = sample(400:800, depth, TRUE) / 1000,   # Hidden layer dropout ratios
loss = “Automatic”,                   # Loss function to use
stopping_rounds = 5,                 # Stop training after 5 rounds of no improvement
stopping_metric = “AUTO”,         # Metric for stopping
stopping_tolerance = 0.01,     # Tolerance for stopping criteria
sparse = TRUE,                           # Use sparse matrices
epochs = 300                               # Number of training epochs (can also use 500)
)
# Prediction Process
prediction_plan_Eco_dnn <- h2o.predict(plan_Eco_dnn, newdata = vali_plan_Eco)
# Calculate the Error Rate
error_rate <- mean(abs((prediction_plan_Eco_dnn$predict / vali_plan_Eco_1$Eco) − 1) * 100)
# Print the error rate
print(error_rate)

3.3.1. Architecture and Prediction Performance for EL and CC Estimation

Table 15 summarizes the average error rate and standard deviation for 10 verification cases of EL and CC in the planning and design stages of the constructed DNN model. Looking at the prediction performance of the DNN model for estimating EL in the planning stage, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 4 were 27.1% and 18.6%, respectively, showing the best prediction performance.
Looking at the prediction performance of the planning stage CC estimation DNN model in Table 16, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 3 were calculated to be 17.0% and 9.8%, respectively, showing the best prediction performance. Accordingly, the planning-stage DNN model satisfied the planning-stage error rate level (30%) set in this study.
Based on the results in Table 17, combination 1 of the design-stage EL estimation DNN model yielded the best performance, with an average error rate of 24.0% and a standard deviation of 13.8%. For the CC estimation (Table 18), combination 2 performed best, achieving an average error rate of 14.6% and a standard deviation of 3.9%.
While the DNN model for CC met the predefined 20% error tolerance for the design stage, the EL model did not. Although DNNs generally offer high predictive capability, their effectiveness in this study was constrained by limited data availability. The results indicate that dataset size plays a critical role in realizing the full potential of deep learning models in construction project estimation.

3.3.2. Optimal DNN Model Configuration and Performance Assessment

Information on the DNN model in the planning and design stages is as follows. In the case of the optimal DNN model for estimating environmental costs at the opportunity stage, there are 10 input variables and 2 hidden layers consisting of 184 and 155 nodes, and the dropout is 51.8% in the input layer, 74.3% in the hidden layer, and 46.6% in the hidden layer.
In the case of the optimal DNN model for estimating CCs at the planning stage, there are nine input variables and 2 hidden layers consisting of 234 and 146 nodes, and the dropout was 47.9% in the input layer, 43.5% in the hidden layer, and 69.0% in the hidden layer.
For the optimal DNN model for environmental cost estimation at the design stage, there are 12 input variables and 3 hidden layers with 13, 8, and 258 nodes, and the dropout is 52.2% in the input layer, 66.1%, 48.6%, and 41.6% in the hidden layers.
In the case of the optimal DNN model for estimating CCs at the design stage, there were 12 input variables and 2 hidden layers consisting of 5 and 46 nodes, and the dropout was 47.6% in the input layer, 74.7% in the hidden layer, and 57.1% in the hidden layer.
The learning and verification MSE and RMSE values presented in Table 19 and Table 20 show minimal differences, indicating that the chosen model parameters and the autoencoder-based variable selection effectively controlled overfitting. However, the RMSE trends in Figure 5 and Figure 6 suggest that optimal convergence was not fully achieved within the given epoch range. The further tuning of parameters may lead to increased model complexity, potentially degrading performance. To enhance learning stability and predictive accuracy, expanding the dataset is recommended, as additional data would allow the model to better generalize without overfitting.

3.3.3. Additional Evaluation Metrics

In addition to MSE and RMSE, we calculated MAPE, MAE, and R2 to offer a more comprehensive assessment of model performance. Table 21 presents these metrics for the optimal ANN and DNN models across planning and design stages. The results show that the DNN models generally achieved lower MAPE and MAE values and higher R2 scores, particularly in EL prediction, reinforcing their ability to handle complex, high-dimensional datasets. ANN models performed competitively for CC estimation, especially where data patterns were more structured.

4. Results

4.1. Comparison of ANN and DNN

Figure 7 presents a comparative analysis of the prediction error rates for EL and CC estimation using the ANN and DNN during the design stage of road construction projects. The results demonstrate that the DNN model achieved superior performance in EL estimation, recording an error rate of 29.4%, which is below the predefined threshold of 30%. In contrast, the ANN model exhibited a higher error rate of 35.1%, exceeding the acceptable limit. This suggests that the deeper architecture of the DNN was more effective in capturing complex patterns in the high-dimensional input space, which is critical for environmental impact prediction.
In terms of CC estimation, both models delivered error rates below the acceptable 20% threshold, with the ANN model slightly outperforming the DNN model (17.3% vs. 18.6%). This indicates that for cost estimation—where the data may be more linearly structured or less sensitive to deeper abstraction—simpler architectures such as the ANN may still be highly effective.
The figure also includes red threshold lines (30% for EL and 20% for CC) to emphasize model acceptability boundaries. The performance trends highlight that model selection should be aligned with task complexity: DNNs may be better suited for tasks involving nonlinear relationships and noise-prone inputs, such as environmental data, while ANNs can perform competitively in more structured domains like cost estimation. It is also noted that DNNs, while more expressive, underperformed ANNs in certain settings (e.g., design-stage CC prediction). This may be attributed to the limited dataset size, the linear nature of cost variables, and the increased sensitivity of deep models to input noise and architecture variability. These results reinforce the importance of matching model complexity to task characteristics.
These findings validate the approach of using autoencoder-based variable filtering and deep learning architectures to improve early-stage prediction accuracy in infrastructure planning, offering practical benefits for sustainability-focused project management and design decision-making.
As a future direction, this study will be expanded to benchmark the proposed ANN and DNN models against traditional and ensemble-based machine learning algorithms such as Support Vector Machines, Extreme Learning Machines, and XGBoost. This comparative analysis will provide a more comprehensive understanding of model suitability across various infrastructure estimation tasks and dataset characteristics.

4.2. Discussions

This study introduces a dual-phase modeling approach that separates planning- and design-stage variables to improve prediction accuracy for construction cost (CC) and EL. Unlike previous studies that rely on single-phase data, this structure enables phase-specific modeling, allowing for a more granular analysis of variable relevance, data uncertainty, and model suitability at distinct stages of project development. The planning phase emphasizes early estimations under incomplete information, whereas the design phase incorporates detailed quantities, thereby enhancing the applicability of machine learning in real-world infrastructure decision-making.
Despite these contributions, several modeling limitations must be acknowledged. While DNNs performed well in capturing complex patterns—particularly in EL prediction—they did not consistently outperform ANNs, especially in settings where data were more structured or linear, such as CC prediction. This may stem from the limited size and scope of the dataset (150 national road projects), which restricts the learning capacity of deeper architectures and increases sensitivity to noise or architectural over-parameterization. Moreover, the empirical tuning of network depth, dropout rates, and node sizes, while cross-validated, lacked theoretical justification and could benefit from future sensitivity and ablation analyses.
Addressing these generalizability concerns requires a multi-pronged strategy. First, future research should involve external validation using independent datasets from different regions or infrastructure types to assess robustness under domain shift. Second, transfer learning techniques will be explored to adapt pre-trained models to new but related datasets—particularly useful in low-data environments. Third, benchmarking against alternative machine learning models such as SVMs, ELMs, and XGBoost is planned to evaluate model competitiveness and task suitability across prediction contexts.
Additionally, efforts will focus on expanding the dataset to cover a broader geographic and functional spectrum of road infrastructure projects. This would enable the development of more generalized models with stronger predictive performance and real-world applicability. Lastly, the integration of these models into decision-support tools can assist planners and engineers in making informed, sustainability-oriented decisions in the early phases of highway project development.

5. Conclusions

This study developed and evaluated machine learning models—specifically ANNs and DNNs—to estimate EL and CC during both the planning and design stages of national road projects. The findings offer key insights into model performance, practical implications, and future research directions.
  • A structured dataset of 150 completed South Korean national road projects was compiled, forming planning- and design-phase databases. Emphasis was placed on 19 high-impact sub-work types to improve predictive accuracy and minimize irrelevant input noise.
  • To address the 4.47% missing data in the design-stage database, a hybrid imputation strategy combining mean substitution and random forest-based modeling was applied. This method preserved overall data distributions while reducing standard deviations by up to 5%, enhancing data stability and model readiness.
  • Dimensionality reduction via a autoencoder effectively filtered key variables—retaining only 16 critical features like culvert concrete pouring and frost protection layers—while maintaining 97% of the dataset’s explanatory power, thereby reducing redundancy.
  • ANN models benefited from cross-validation and hyperparameter optimization, achieving strong performance metrics (MSE = 0.06, RMSE = 0.24 at the planning stage), which validated both the selected features and the stability of the training process.
  • The best-performing ANN models yielded average error rates of 29.8% for EL and 21.0% for CC at the design stage, underscoring the models’ practical utility in supporting early-stage infrastructure decision-making.
  • Through the careful tuning of architecture, dropout regularization, and Min–Max normalization, ANN models achieved consistent performance across training and validation datasets with no signs of overfitting.
  • DNN models also demonstrated strong predictive capabilities, achieving average error rates of 27.1% and 17.0% for planning-stage EL and cost estimations and 24.0% and 14.6% for design-stage predictions—meeting all predefined accuracy thresholds for cost estimation.
  • Although DNN models are structurally more complex than ANNs, their performance was moderately limited by the dataset size, especially in the context of high-variance EL predictions. Dropout regularization and autoencoder-based feature selection mitigated overfitting, but expanded datasets are essential for fully leveraging DNN potential.
  • Comparative analysis showed that DNNs slightly outperformed ANNs in EL estimation (29.4% vs. 35.1%), while ANNs had a marginal advantage in cost prediction (17.3% vs. 18.6%), emphasizing that model selection should align with task complexity and data characteristics.
  • Despite current limitations related to data volume and variance, this research confirms the value of combining autoencoder-based variable selection with deep learning models. These methods provide a robust foundation for improving early-stage estimation in road infrastructure projects and contribute to more informed, sustainability-focused planning decisions.
  • Future research will extend this work by comparing ANN and DNN models with alternative machine learning approaches such as SVMs, ELMs, and XGBoost. Additional efforts will focus on validating the models using external datasets, exploring transfer learning for limited-data scenarios, and developing practical decision-support tools to enhance early-stage infrastructure planning. To improve generalizability and capture a wider spectrum of infrastructure conditions, future research will focus on expanding the dataset to include a larger and more diverse range of projects from multiple regions or countries.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2021-NR066174).

Data Availability Statement

The dataset is available on request from the authors.

Conflicts of Interest

The author declares no conflicts of interest.

Nomenclature

AbbreviationDescription
ANNArtificial Neural Network
DNNDeep Neural Network
ELEnvironmental Load
CCConstruction Cost
MSEMean Squared Error
RMSERoot Mean Squared Error
MAEMean Absolute Error
MAPEMean Absolute Percentage Error
R2Coefficient of Determination
SVMSupport Vector Machine
ELMExtreme Learning Machine
XGBoostExtreme Gradient Boosting
VRVertical Reinforcement (Pipe/Structure)
AsconAsphalt Concrete
KICTKorea Institute of Civil Engineering and Building Technology

References

  1. He, A.; Dong, Z.; Zhang, H.; Zhang, A.A.; Qiu, S.; Liu, Y.; Wang, K.C.P.; Lin, Z. Automated Pixel-Level Detection of Expansion Joints on Asphalt Pavement Using a Deep-Learning-Based Approach. Struct. Control Health Monit. 2023, 2023, 7552337. [Google Scholar] [CrossRef]
  2. Wang, Y.; Wang, C.; Zhang, H.; Liang, F.; Yuan, Z. Dynamic Response of OWTs with Scoured Jacket Foundation Subjected to Seismic and Environmental Loads. Mar. Struct. 2025, 103, 103839. [Google Scholar] [CrossRef]
  3. Mittermayr, D.; Freud, P.J.; Fischer, J. Fatigue Crack Growth Resistance under Superimposed Mechanical-Environmental Loads of Virgin and Recycled Polystyrene Using Cracked Round Bar Specimens. Eng. Fract. Mech. 2025, 319, 111042. [Google Scholar] [CrossRef]
  4. Lavassani, S.H.H.; Doroudi, R.; Gavgani, S.A.M. Optimization of Semi-Active Tuned Mass Damper Inerter for Enhanced Vibration Control of Jacket Platforms Using Multi-Objective Optimization Due to Environmental Load. Structures 2025, 78, 109305. [Google Scholar] [CrossRef]
  5. Elmas, F.; Algin, H.M. Soil-Monopile Interaction Assessment of Offshore Wind Turbines with Comprehensive Subsurface Modelling to Earthquake and Environmental Loads of Wind and Wave. Soil Dyn. Earthq. Eng. 2025, 192, 109293. [Google Scholar] [CrossRef]
  6. Li, Z.; Jiang, W.; van Dam, T.; Zou, X.; Chen, Q.; Chen, H. A Review on Modeling Environmental Loading Effects and Their Contributions to Nonlinear Variations of Global Navigation Satellite System Coordinate Time Series. Engineering 2025, 47, 26–37. [Google Scholar] [CrossRef]
  7. Ghanizadeh, A.R.; Ghanizadeh, A.; Asteris, P.G.; Fakharian, P.; Armaghani, D.J. Developing Bearing Capacity Model for Geogrid-Reinforced Stone Columns Improved Soft Clay Utilizing MARS-EBS Hybrid Method. Transp. Geotech. 2023, 38, 100906. [Google Scholar] [CrossRef]
  8. Cheng, M.Y.; Vu, Q.T.; Gosal, F.E. Hybrid Deep Learning Model for Accurate Cost and Schedule Estimation in Construction Projects Using Sequential and Non-Sequential Data. Autom. Constr. 2025, 170, 105904. [Google Scholar] [CrossRef]
  9. Liu, H.; Li, M.; Cheng, J.C.P.; Anumba, C.J.; Xia, L. Actual Construction Cost Prediction Using Hypergraph Deep Learning Techniques. Adv. Eng. Inform. 2025, 65, 103187. [Google Scholar] [CrossRef]
  10. Habib, O.; Abouhamad, M.; Bayoumi, A.E.M. Ensemble Learning Framework for Forecasting Construction Costs. Autom. Constr. 2025, 170, 105903. [Google Scholar] [CrossRef]
  11. Mahpour, A. Building Maintenance Cost Estimation and Circular Economy: The Role of Machine-Learning. Sustain. Mater. Technol. 2023, 37, e00679. [Google Scholar] [CrossRef]
  12. Bruzzone, A.G.; Sinelshchikov, K.; Gotelli, M.; Monaci, F.; Sina, X.; Ghisi, F.; Cirillo, L.; Giovannetti, A. Machine Learning and Simulation Modeling Large Offshore and Production Plants to Improve Engineering and Construction. Procedia Comput. Sci. 2025, 253, 3318–3324. [Google Scholar] [CrossRef]
  13. Farshadfar, Z.; Khajavi, S.H.; Mucha, T.; Tanskanen, K. Machine Learning-Based Automated Waste Sorting in the Construction Industry: A Comparative Competitiveness Case Study. Waste Manag. 2025, 194, 77–87. [Google Scholar] [CrossRef] [PubMed]
  14. Mahmoodzadeh, A.; Nejati, H.R.; Mohammadi, M. Optimized Machine Learning Modelling for Predicting the Construction Cost and Duration of Tunnelling Projects. Autom. Constr. 2022, 139, 104305. [Google Scholar] [CrossRef]
  15. Wang, R.; Asghari, V.; Cheung, C.M.; Hsu, S.C.; Lee, C.J. Assessing Effects of Economic Factors on Construction Cost Estimation Using Deep Neural Networks. Autom. Constr. 2022, 134, 104080. [Google Scholar] [CrossRef]
  16. Alsulamy, S. Comparative Analysis of Deep Learning Algorithms for Predicting Construction Project Delays in Saudi Arabia. Appl. Soft Comput. 2025, 172, 112890. [Google Scholar] [CrossRef]
  17. Li, Q.; Yang, Y.; Yao, G.; Wei, F.; Li, R.; Zhu, M.; Hou, H. Classification and Application of Deep Learning in Construction Engineering and Management—A Systematic Literature Review and Future Innovations. Case Stud. Constr. Mater. 2024, 21, e04051. [Google Scholar] [CrossRef]
  18. Lung, L.W.; Wang, Y.R.; Chen, Y.S. Leveraging Deep Learning and Internet of Things for Dynamic Construction Site Risk Management. Buildings 2025, 15, 1325. [Google Scholar] [CrossRef]
  19. Liu, Q.; He, P.; Peng, S.; Wang, T.; Ma, J. A Survey of Data-Driven Construction Materials Price Forecasting. Buildings 2024, 14, 3156. [Google Scholar] [CrossRef]
  20. Chen, N.; Lin, X.; Jiang, H.; An, Y. Automated Building Information Modeling Compliance Check through a Large Language Model Combined with Deep Learning and Ontology. Buildings 2024, 14, 1983. [Google Scholar] [CrossRef]
  21. Choi, S.M.; Cha, H.S.; Jiang, S. Hybrid Data Augmentation for Enhanced Crack Detection in Building Construction. Buildings 2024, 14, 1929. [Google Scholar] [CrossRef]
  22. Nguyen, H.L.; Tran, V.Q. Data-Driven Approach for Investigating and Predicting Rutting Depth of Asphalt Concrete Containing Reclaimed Asphalt Pavement. Constr. Build. Mater. 2023, 377, 131116. [Google Scholar] [CrossRef]
  23. Raza, M.S.; Sharma, S.K. Optimizing Porous Asphalt Mix Design for Permeability and Air Voids Using Response Surface Methodology and Artificial Neural Networks. Constr. Build. Mater. 2024, 442, 137513. [Google Scholar] [CrossRef]
  24. Mabrouk, G.M.; Elbagalati, O.S.; Dessouky, S.; Fuentes, L.; Walubita, L.F. Using ANN Modeling for Pavement Layer Moduli Backcalculation as a Function of Traffic Speed Deflections. Constr. Build. Mater. 2022, 315, 125736. [Google Scholar] [CrossRef]
  25. O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
  26. El-Chabib, H.; Nehdi, M.; Sonebi, M. Artificial Intelligence Model for Flowable Concrete Mixtures Used in Underwater Construction and Repair. ACI Mater. J. 2003, 100, 165–173. [Google Scholar] [CrossRef] [PubMed]
  27. Shehadeh, A.; Alshboul, O.; Al Mamlook, R.E.; Hamedat, O. Machine Learning Models for Predicting the Residual Value of Heavy Construction Equipment: An Evaluation of Modified Decision Tree, LightGBM, and XGBoost Regression. Autom. Constr. 2021, 129, 103827. [Google Scholar] [CrossRef]
  28. Chai, T.; Draxler, R.R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? Geosci. Model Dev. Discuss. 2014, 7, 1525–1534. [Google Scholar] [CrossRef]
Figure 1. Box plot of MSE values from six cross-validated autoencoder models for the planning-stage dataset.
Figure 1. Box plot of MSE values from six cross-validated autoencoder models for the planning-stage dataset.
Buildings 15 02546 g001
Figure 2. The mean square error box plot of the design-stage autoencoder 6-fold cross-validation model.
Figure 2. The mean square error box plot of the design-stage autoencoder 6-fold cross-validation model.
Buildings 15 02546 g002
Figure 3. RMSE plot showing training and validation trends for the ANN model used to estimate EL during the planning stage.
Figure 3. RMSE plot showing training and validation trends for the ANN model used to estimate EL during the planning stage.
Buildings 15 02546 g003
Figure 4. RMSE curves of the ANN model for estimating EL and CC during the design stage.
Figure 4. RMSE curves of the ANN model for estimating EL and CC during the design stage.
Buildings 15 02546 g004
Figure 5. Planning-stage EL and CC estimation model DNN RMSE graph.
Figure 5. Planning-stage EL and CC estimation model DNN RMSE graph.
Buildings 15 02546 g005
Figure 6. Design-stage EL estimation model DNN RMSE graph.
Figure 6. Design-stage EL estimation model DNN RMSE graph.
Buildings 15 02546 g006
Figure 7. Comparative error rates of ANN and DNN models for estimating EL and CC at the design stage.
Figure 7. Comparative error rates of ANN and DNN models for estimating EL and CC at the design stage.
Buildings 15 02546 g007
Table 1. Summary of missing values and quantity distributions by work type (before imputation).
Table 1. Summary of missing values and quantity distributions by work type (before imputation).
Work CategoryMissingQ1MedianQ3Max (Q4)MeanStd. Dev
Excavation (m3)02691286,783531,8141,292,560375,441290,265
Ripping Arm (m3)8142112,068238,409650,328160,595150,066
Blasting Rock (m3)12164261,197520,1912,322,210379,734395,588
Dump Transport (m3)117,418641,9461,017,2252,393,845718,342569,633
Concrete Pouring (m3)1014114,70224,398143,49518,18017,191
Frost Protection (m3)150940,22960,283180,18044,49929,716
Ascon Surface (ton)3196918,53424,821472,22629,134
Table 2. Post-imputation quantity distribution (selected variables).
Table 2. Post-imputation quantity distribution (selected variables).
Work CategoryMean (Before)Std Dev (Before)Mean (After)Std Dev (After)% Change in SD
Blasting Rock379,734395,588357,132377,753−5%
Green Zone Fill177,314297,435178,926296,379−0.4%
Concrete Pouring18,18017,19117,55916,429−4%
Rebar Assembly44,49929,71644,29329,638−0.3%
Asphalt Surface29,13455,35528,84754,581−1%
Table 3. Change in mean–standard deviation gap due to imputation.
Table 3. Change in mean–standard deviation gap due to imputation.
VariableMean–SD Gap (Before)Mean–SD Gap (After)Change (%)Interpretation
Ascon Middle Layer925825,735+178%Reduced predictability
Rebar Assembly221224,435+1005%High noise added
Blasting Rock15,85420,620+30%Slight increase
Green Zone Fill120,121117,453−2%Stable
Asphalt Surface26,2212421−91%Improved modeling stability
Table 4. Planning-stage autoencoder model information.
Table 4. Planning-stage autoencoder model information.
Model 5
Number of learnings: 24,000
divisionLayerUnitsDropout rate
Input layer11178%
Hidden layer248256%
348172%
421274%
Output layer511-
Learning process
MSE0.08
RMSE0.28
Verification process
MSE0.06
RMSE0.24
Table 5. Planning-stage autoencoder variable reproduction mean square error.
Table 5. Planning-stage autoencoder variable reproduction mean square error.
ABCDEFGHIJK
MSE
10.260.010.010.170.010.320.060.010.050.070.00
20.060.000.070.170.010.320.000.000.060.070.00
30.010.000.070.010.010.320.010.000.160.070.00
40.010.020.010.170.010.320.020.030.000.000.30
50.150.010.070.010.010.320.000.000.000.070.00
60.020.020.350.010.350.320.020.000.050.360.02
70.140.010.010.170.010.190.030.000.000.070.00
80.060.000.010.010.010.190.000.000.150.360.00
90.010.000.010.010.350.190.030.010.090.360.00
100.140.010.010.170.010.320.110.010.000.070.00
Average0.090.010.060.090.080.280.030.010.060.150.03
ABCDEFGHIJK
Administrative districtRoad heightRoad gradeTopographyDesign speedType of constructionRoad extensionRoad areaPackaging thicknessNumber of carsRoad width
Table 6. Identifying the importance of planning-stage variables.
Table 6. Identifying the importance of planning-stage variables.
Relative ImportanceRatioCumulative Ratio
Type of construction1.0017%17%
Topography0.9015%32%
Road width0.8113%45%
Packaging thickness0.539%54%
Number of cars0.478%62%
Road grade0.468%70%
Administrative district0.458%77%
Road extension0.376%83%
Road height0.356%89%
Design speed0.335%95%
Road area0.315%100%
Table 7. Planning-stage variable combinations used to build ANN and DNN models.
Table 7. Planning-stage variable combinations used to build ANN and DNN models.
DivisionVariable Combination (Number of Variables)Cumulative Ratio
Combination 1Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district (7)77%
Combination 2Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length (8)83%
Combination 3Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length, road height (9)89%
Combination 4Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length, road height, road area (10)95%
Table 8. Planning-tage autoencoder model 6 information.
Table 8. Planning-tage autoencoder model 6 information.
Model 6
Number of learnings: 24,000
divisionLayerUnitsDropout rate
Input layer11976%
Hidden layer227772%
Output layer319-
Learning process
MSE0.29
RMSE0.53
Verification process
MSE0.06
RMSE0.23
Table 9. Identifying the importance of design-stage variables.
Table 9. Identifying the importance of design-stage variables.
Significant ImportanceRatioCumulative Ratio
Pouring concrete for culvert1.008%8%
Underground construction0.938%16%
Frost protection layer0.897%24%
Underground rebar processing and assembly0.887%31%
Dump transport0.787%38%
No body0.746%44%
Horizontal drainage pipeVR pipe0.746%50%
Tossa0.736%56%
Ascon base layer0.726%62%
Ripping arm0.696%68%
Ceramic transport0.686%74%
Formwork for culvert0.645%79%
transverse drain pipe wing wall0.585%84%
Horizontal drainage pipeVR pipe0.555%88%
Blasting rock0.383%92%
Ascon middle layer0.303%94%
Green land reclamation0.292%97%
road0.262%99%
Ascon surface0.141%100%
Table 10. Design-stage variable combinations used to build ANN and DNN models.
Table 10. Design-stage variable combinations used to build ANN and DNN models.
DivisionVariable Combination (Number of Variables)Cumulative Ratio
Combination 1Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork (12)79%
Combination 2Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall (13)84%
Combination 3Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall, blasting rock, asphalt intermediate layer (15)94%
Combination 4Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall, blasting rock, asphalt intermediate layer, green zone fill (16)97%
Table 11. Results of prediction performance of ANN model for EL estimation at planning stage.
Table 11. Results of prediction performance of ANN model for EL estimation at planning stage.
DivisionEL
Actual Value
Unit: Eco-Point
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 18174561031.4%532034.9%606125.8%498439.0%
Case 27852421146.4%480938.8%562428.4%469540.2%
Case 38490424950.0%474544.1%566433.3%477043.8%
Case 42917430747.7%514576.4%5839100.2%465959.7%
Case 53892541139.0%453516.5%524234.7%691377.6%
Case 63716505536.0%444519.6%542045.9%488331.4%
Case 76690433735.2%543118.8%62077.2%506824.2%
Case 84273499917.0%496816.3%558130.6%494215.6%
Case 93337427228.0%452235.5%524157.1%455836.6%
Case 105474429621.5%56142.6%651319.0%59418.5%
Average error rate35.2%30.3%38.2%37.7%
Standard deviation10.5%19.6%24.3%19.2%
Table 12. ANN model prediction performance results for CC estimation at the planning stage.
Table 12. ANN model prediction performance results for CC estimation at the planning stage.
DivisionCC
Actual Value
Unit: 10 Million Wont
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 1183816629.6%163910.8%17544.5%152017.3%
Case 21891164313.1%155817.6%151220.0%151919.7%
Case 31942164715.2%155220.1%154920.2%151222.1%
Case 41061165556.0%157048.0%158048.9%151442.6%
Case 51128165947.1%150433.4%131116.3%148631.8%
Case 61082160548.4%154242.5%11465.9%149137.8%
Case 71444165514.6%166715.4%178623.7%15497.3%
Case 81937165614.5%157018.9%154820.1%152521.2%
Case 91307164225.6%149514.4%13896.2%148813.8%
Case 102093165520.9%172217.8%19447.1%155925.5%
Average error rate26.5%23.9%17.3%23.9%
Standard deviation16.4%12.1%12.6%10.3%
Table 13. Design-stage EL estimation ANN model prediction performance results.
Table 13. Design-stage EL estimation ANN model prediction performance results.
DivisionEL
Actual Value
Unit: Eco-Point
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 18174447845.2%626423.4%74239.2%85624.70%
Case 27852513134.7%637818.8%686612.6%82995.70%
Case 38490494841.7%88424.1%751711.5%973214.6%
Case 42917531482.2%518277.6%435249.2%6128110.1%
Case 53892615458.1%525435.0%496627.6%639664.4%
Case 63716494833.2%615765.7%520840.2%690986.0%
Case 76690494826.0%71326.60%69123.3%867329.6%
Case 84273497816.5%631247.7%604841.5%793885.8%
Case 93337447834.2%6896106.7%591477.2%7415122.2%
Case 10547451166.5%698727.6%689626.0%830451.7%
Average error rate37.8%41.3%29.8%57.5%
Standard deviation20.2%31.4%21.6%40.9%
Table 14. The optimal ANN model for estimating EL and CC at the planning stage.
Table 14. The optimal ANN model for estimating EL and CC at the planning stage.
Planning stage
Combination 2
hierarchyNumber of nodesDropout rate
1859.3%
212256.9%
31-
Learning process
MSE0.07
RMSE0.27
Verification process
MSE0.05
RMSE0.21
Combination 3
hierarchyNumber of nodesDropout rate
11046.1%
229643.6%
31-
Learning process
MSE0.03
RMSE0.18
Verification process
MSE0.04
RMSE0.20
Table 15. Results of prediction performance of the DNN model for EL estimation at the planning stage.
Table 15. Results of prediction performance of the DNN model for EL estimation at the planning stage.
DivisionEL
Actual Value
Unit: Eco-Point
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 18174469842.5%654020.0%565230.9%492539.8%
Case 27852469840.2%500836.2%561428.5%456541.9%
Case 38490469844.7%513439.5%566233.3%474844.1%
Case 42917469861.1%526280.4%566494.2%472361.9%
Case 53892469920.7%459818.1%566845.6%41536.7%
Case 63716469926.5%517139.2%562551.4%419012.8%
Case 76690469829.8%66680.3%561616.0%503124.8%
Case 84273469910.0%504618.1%565632.4%46368.5%
Case 93337469940.8%463438.9%565569.5%428328.4%
Case 105474469814.2%734734.2%56473.2%53422.4%
Average error rate33.0%32.5%40.5%27.1%
Standard deviation14.9%20.1%24.9%18.6%
Table 16. DNN model prediction performance results for CC estimation at the planning stage.
Table 16. DNN model prediction performance results for CC estimation at the planning stage.
DivisionCC
Actual Value
Unit: 10 Million Wont
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 11838138424.7%17146.7%159513.2%149418.7%
Case 21891138326.9%17149.3%144023.9%149221.1%
Case 31942138328.8%171411.7%145225.2%148923.3%
Case 41061138530.5%171461.5%142033.8%149240.6%
Case 51128138422.7%171452.0%12016.5%148131.4%
Case 61082138227.7%171458.5%11698.1%148537.3%
Case 7144413824.3%171418.7%170418.0%15034.1%
Case 81937138228.7%171411.5%139228.2%149223.0%
Case 9130713815.6%171431.1%12633.4%148913.9%
Case 102093138333.9%171418.1%18879.8%150528.1%
Average error rate23.4%27.9%17.0%24.1%
Standard deviation9.7%20.4%9.8%10.3%
Table 17. Design-stage EL estimation DNN model prediction performance results.
Table 17. Design-stage EL estimation DNN model prediction performance results.
DivisionEL
Actual Value
Unit: Eco-Point
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 18174530035.2%400351.0%703314.0%696114.8%
Case 27852535831.8%404448.5%632419.5%676913.8%
Case 38490543536.0%402652.6%77978.2%733213.6%
Case 42917354021.4%368126.2%403138.2%411541.1%
Case 5389242258.6%37543.5%464519.4%486425.0%
Case 6371640097.9%36432.0%481329.5%491432.2%
Case 76690495925.9%391941.4%67981.6%66990.1%
Case 84273515720.7%40415.4%592838.7%620445.2%
Case 93337497749.2%406821.9%568670.4%613183.8%
Case 10547456633.5%421922.9%715330.7%708429.4%
Average error rate24.0%27.5%27.0%29.9%
Standard deviation13.8%18.9%18.6%22.2%
Table 18. Design-stage CC estimation DNN model prediction performance results.
Table 18. Design-stage CC estimation DNN model prediction performance results.
DivisionCC
Actual Value
Unit: 10 Million Wont
Combination 1Combination 2Combination 3Combination 4
Predicted ValueError RatePredicted ValueError RatePredicted ValueError RatePredicted ValueError Rate
Case 11838144221.5%16699.2%152916.8%202410.1%
Case 218911,44 223.7%155417.8%152619.3%20277.2%
Case 31942144225.8%156719.3%153321.1%21158.9%
Case 41061138030.1%121414.4%150541.8%148539.9%
Case 51128140424.5%127813.4%150933.8%148331.5%
Case 61082137126.7%124515.1%151039.6%154542.8%
Case 7144414420.2%15648.3%15295.9%207944.0%
Case 81937144225.6%158118.4%152121.5%18842.7%
Case 91307144210.3%144510.6%151816.2%186242.4%
Case 102093144231.1%169119.2%152827.0%20462.2%
Average error rate21.9%14.6%24.4%23.2%
Standard deviation9.1%3.9%10.7%17.4%
Table 19. Optimal DNN model information for estimating EL and CC at the planning stage.
Table 19. Optimal DNN model information for estimating EL and CC at the planning stage.
Planning stage
Combination 4
hierarchyNumber of nodesDropout rate
11051.8%
218474.3%
315546.6%
41-
Learning process
MSE0.07
RMSE0.26
Verification process
MSE0.04
RMSE0.20
Combination 3
hierarchyNumber of nodesDropout rate
1947.9%
223443.5%
314669.0%
41-
Learning process
MSE0.08
RMSE0.28
Verification process
MSE0.04
RMSE0.19
Table 20. Optimal DNN model information for estimating EL and CC at the design stage.
Table 20. Optimal DNN model information for estimating EL and CC at the design stage.
Design phase
Combination 1
hierarchyNumber of nodesDropout rate
11252.2%
21366.1%
3848.6%
425841.6%
51-
Learning process
MSE0.05
RMSE0.21
Verification process
MSE0.04
RMSE0.20
Combination 2
hierarchyNumber of nodesDropout rate
11247.6%
2574.7%
35657.1%
11-
Learning process
MSE0.06
RMSE0.25
Verification process
MSE0.04
RMSE0.19
Table 21. Extended model evaluation metrics for optimal ANN and DNN models.
Table 21. Extended model evaluation metrics for optimal ANN and DNN models.
StageModelTargetMAPE (%)MAER2
PlanningANNEL28.31.850.72
PlanningANNCC16.21.210.81
PlanningDNNEL25.71.740.76
PlanningDNNCC13.81.090.84
DesignANNEL30.62.070.68
DesignANNCC15.41.180.79
DesignDNNEL23.91.630.74
DesignDNNCC13.20.980.86
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, J.-S. AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings 2025, 15, 2546. https://doi.org/10.3390/buildings15142546

AMA Style

Kim J-S. AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings. 2025; 15(14):2546. https://doi.org/10.3390/buildings15142546

Chicago/Turabian Style

Kim, Joon-Soo. 2025. "AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects" Buildings 15, no. 14: 2546. https://doi.org/10.3390/buildings15142546

APA Style

Kim, J.-S. (2025). AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings, 15(14), 2546. https://doi.org/10.3390/buildings15142546

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop