AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects

Kim, Joon-Soo

doi:10.3390/buildings15142546

Open AccessArticle

AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects

by

Joon-Soo Kim

Department of Highway & Transportation Research, Korea Institute of Civil Engineering and Building Technology, 283 Goyangdae-Ro, Ilsanseo-Gu, Goyang-si 10223, Gyeonggi-Do, Republic of Korea

Buildings 2025, 15(14), 2546; https://doi.org/10.3390/buildings15142546

Submission received: 23 June 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 19 July 2025

(This article belongs to the Special Issue Practice and Application of Artificial Intelligence in Urban Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

The accurate early-stage estimation of environmental load (EL) and construction cost (CC) in road infrastructure projects remains a significant challenge, constrained by limited data and the complexity of construction activities. To address this, our study proposes a machine learning-based predictive framework utilizing artificial neural networks (ANNs) and deep neural networks (DNNs), enhanced by autoencoder-driven feature selection. A structured dataset of 150 completed national road projects in South Korea was compiled, covering both planning and design phases. The database focused on 19 high-impact sub-work types to reduce noise and improve prediction precision. A hybrid imputation approach—combining mean substitution with random forest regression—was applied to handle 4.47% missing data in the design-phase inputs, reducing variance by up to 5% and improving data stability. Dimensionality reduction via autoencoder retained 16 core variables, preserving 97% of explanatory power while minimizing redundancy. ANN models benefited from cross-validation and hyperparameter tuning, achieving consistent performance across training and validation sets without overfitting (MSE = 0.06, RMSE = 0.24). The optimal ANN yielded average error rates of 29.8% for EL and 21.0% for CC at the design stage. DNN models, with their deeper architectures and dropout regularization, further improved performance—achieving 27.1% (EL) and 17.0% (CC) average error rates at the planning stage and 24.0% (EL) and 14.6% (CC) at the design stage. These results met all predefined accuracy thresholds, underscoring the DNN’s advantage in handling complex, high-variance data while the ANN excelled in structured cost prediction. Overall, the synergy between deep learning and autoencoder-based feature selection offers a scalable and data-informed approach for enhancing early-stage environmental and economic assessments in road infrastructure planning—supporting more sustainable and efficient project management.

Keywords:

machine learning; environmental load; construction cost; project management; road infrastructure; early-stage estimation; ANN; DNN

1. Introduction

In recent years, the construction industry has increasingly emphasized the need for sustainable infrastructure development, driven by rising environmental concerns and the demand for efficient resource management [1]. Among various infrastructure sectors, road construction plays a critical role due to its extensive material consumption, long lifecycle, and considerable environmental footprint. An accurate early-stage estimation of environmental load (EL) and CC is thus crucial for both minimizing ecological impact and optimizing project budgets. Traditional cost and environmental impact estimation methods heavily rely on historical data, engineering judgment, and deterministic models. While these approaches provide foundational insights, they often lack scalability and adaptability when confronted with complex or data-limited project environments. Moreover, many such models struggle to incorporate the wide range of interacting variables inherent to infrastructure projects, especially during the planning and design stages, where uncertainty is high and input data are only partially available.

EL has been a critical focus in recent structural and geotechnical research due to its influence on safety and sustainability. Wang et al. [2] investigated the seismic and EL response of offshore wind turbine jacket foundations, revealing complex dynamic interactions with scour effects. Mittermayr et al. [3] examined material fatigue under mechanical–EL cycles, providing insights into degradation in recycled construction materials. Lavassani et al. [4] optimized semi-active dampers for offshore jacket platforms under environmental vibration, demonstrating advanced control strategies for structural resilience. Elmas et al. [5] evaluated soil–monopile interaction in offshore turbines, integrating wind and wave loads with earthquake effects through detailed subsurface modeling. Li et al. [6] synthesized the impact of environmental factors—such as temperature, pressure, and hydrological loads—on long-term structural monitoring via a GNSS, highlighting the relevance of environmental modeling in infrastructure deformation analysis. Recent studies also continue to demonstrate the effectiveness of hybrid AI approaches in complex geotechnical problems. For instance, Ghanizadeh et al. [7] developed a predictive model for the bearing capacity of geogrid-reinforced stone columns using a hybrid MARS–EBS method, achieving R² values above 0.99. This confirms the strength of intelligent optimization in modeling nonlinear, parameter-sensitive soil–structure systems.

Recent research underscores the growing use of deep learning in construction management. Cheng et al. [8] and Liu et al. [9] proposed hybrid and hypergraph-based models for accurate cost and schedule prediction. Habib et al. [10] applied ensemble learning, while Mahpour et al. [11] examined maintenance costs within a circular economy framework. Bruzzone et al. [12] integrated machine learning with simulation for offshore plant planning. Farshadfar et al. [13] leveraged AI for automated waste sorting, and Mahmoodzadeh et al. [14] focused on tunneling cost and duration forecasting. Wang et al. [15] explored economic factor impacts using DNNs, and Alsulamy et al. [16] compared deep learning algorithms for project delay prediction. Li et al. [17] offered a comprehensive classification of deep learning in construction. Lung et al. [18] and Liu et al. [19] expanded applications to modular safety, IoT-based risk control, and material price forecasting, respectively. Chen et al. [20] combined deep learning with large language models for BIM compliance, while Choi et al. [21] improved construction crack detection via hybrid data augmentation. Despite these advances, few studies compare ANN and DNN performance using multi-phase datasets and autoencoder-based variable selection—highlighting the novelty of this research.

Despite growing interest in applying machine learning to construction forecasting, existing studies often face limitations in handling complex, high-dimensional project data, especially during early stages where uncertainty is high. Many approaches rely on limited or overly generalized variables, leading to reduced model reliability and increased susceptibility to overfitting. Furthermore, few studies quantitatively address the challenge of selecting meaningful features from detailed engineering datasets, particularly when data volume is constrained. This research addresses these gaps by focusing on improving model performance through optimized feature selection and dimensionality reduction techniques, enabling more accurate and practical predictions of EL and CCs at critical decision-making points. Unlike previous studies that rely on single-phase data, this study distinguishes between planning and design stages, enabling phase-specific prediction modeling and allowing a deeper analysis of variable significance, data uncertainty, and model suitability at different decision-making points.

This study aims to develop a machine learning framework that can effectively estimate ELs and CCs during the planning and design stages of national road construction projects. A dataset of 150 national road construction projects from South Korea was compiled, comprising 10 planning-stage variables and 19 design-stage work quantities. To reduce dimensionality and improve learning performance, an autoencoder was used to identify optimal variable combinations. Subsequently, a total of 16 predictive models—artificial neural networks (ANNs) and deep neural networks (DNNs)—were constructed to estimate environmental impact and cost. These models were trained using 6-fold cross-validation with varying network depths (1–3 layers), dropout rates (40–80%), and node counts (up to 500) and were evaluated based on MSE, RMSE, and the average error rate. Missing values were addressed through random forest imputation, and early stopping techniques were employed to prevent overfitting. Through this research, we contribute to advancing ML-based estimation methodologies for road infrastructure, providing insights into variable importance and model optimization techniques suited to early-stage project conditions. The findings are expected to support more sustainable and cost-effective decision-making in road planning and design processes. An overview of the manuscript’s structure is presented below:

Section 1 presents the Introduction.
Section 2 shows the database structure, including data collection, preprocessing, and feature selection strategies for both the planning and design phases.
Section 3 details the development and configuration of ANN and DNN models, including architecture selection, hyperparameter optimization, and performance evaluation metrics.
Section 4 compares the predictive performance of both model types, highlighting the trade-offs and suitability of each method for estimating EL and CC.
Section 5 concludes the study by summarizing key findings, discussing limitations, and suggesting future research directions to improve machine learning applications in early-stage infrastructure planning.

2. Database Collection and Analysis

2.1. Collection of Road Project Cases

To estimate the EL and CCs of national road projects, a dataset of 150 completed cases in South Korea was compiled. For each project, data were extracted from completion reports, project details, and detailed design documents to construct two structured datasets: a planning-stage database and a design-stage database [22].

The planning-stage database includes key project characteristics relevant to early-phase decision-making, such as administrative district, road height (m), road grade (%), topography, design speed (km/h), the type of construction (new or expanded), road length (m), road area (m²), pavement thickness (cm), the number of lanes, and road width (m).

The design-stage database captures detailed construction activities and material quantities, categorized by work type:

Earthwork: quantities for operations such as excavation, earth moving, ripping, blasting rock, and ceramic transport (m³), including dump transport and green zone reclamation.
Drainage: lengths of side ditches and horizontal drains (m), and volumes for structures such as VR halls, wing walls, and concrete placements (m³).
Paving (Packer): volumes of frost protection layers (m³) and quantities of asphalt base, middle, and surface layers (tons).
Structural labor: formwork areas (m³), rebar assembly (tons), and ladder work (tons), reflecting both material input and labor intensity.

This dual-database structure enables machine learning models to learn from both early-stage design variables and detailed construction inputs, providing a robust basis for the predictive modeling of environmental and cost outcomes.

2.2. Dataset Composition and Variable Selection for Model Development

A total of 150 national road construction projects in South Korea were analyzed to construct the planning and design-stage databases. The dataset covers a wide geographic range, with the highest concentration in Gyeonggi-do (18%), followed by Gyeongbuk (14%) and Chungnam (14%). Regions such as Gangwon (4%) and Jeju (2%) were less represented.

At the planning stage, key attributes were compiled, including road height, grade, design speed, and construction type. Most projects had road heights below 10 m (68%), were classified as National Route 2 (52%), and were designed for 80 km/h speeds (61%). Additionally, new constructions (56%) slightly outnumbered expansion projects.

For the design-stage database, emphasis was placed on construction categories with the highest influence on cost and environmental impact—notably earthworks, drainage, and paving, which together contribute to more than 75% of total resource usage. The dataset comprises 19 sub-work types, with quantities standardized in units such as m³, m, m², and tons, covering activities like excavation, rock blasting, concrete works, drainage installations, and asphalt paving.

To improve model efficiency, only high-impact sub-works were selected for inclusion. This filtering reduces noise and enhances predictive accuracy by limiting the dataset to features with strong relevance to environmental and cost outcomes.

2.2.1. Data Distribution, Missing Values, and Imputation Strategy in the Design-Stage Database

The design-stage database consists of 19 detailed construction work types, each recorded with quantity information in standard engineering units (e.g., m³, m, m², ton). To evaluate the distribution of input features and identify missing data patterns, descriptive statistics and quartiles were computed, as presented in Table 1. Missing values were present in several features, particularly in the drainage and paving categories. Out of 1900 data entries, 85 values (4.47%) were missing, with the majority concentrated on culvert-related tasks, which are crucial for environmental and cost estimation accuracy.

2.2.2. Handling Missing Values Through Imputation

To ensure robust model training, missing values were addressed using two strategies: mean imputation and random forest imputation. Mean imputation fills missing entries using the variable’s average but can underestimate variability and distort distribution tails. In contrast, random forest imputation builds a nonlinear model using other available features to predict missing values [23], offering greater accuracy and robustness in complex datasets.

Coding details for supplementing missing values in the design-phase database.

library(readxl)
library(data.table)
library(h2o)
h2o.init()
# Load and prepare dataset
design_data <- as.data.table(read_excel(“design_db_.xlsx”))
h2o_data <- as.h2o(design_data)
# Identify columns with missing values
na_cols <- names(which(colSums(is.na(design_data)) > 0))
# Impute missing values using Random Forest
for (col in na_cols) {
model <- h2o.randomForest(
x = setdiff(names(h2o_data), col),
y = col,
training_frame = h2o_data
)
prediction <- as.data.frame(h2o.predict(model, h2o_data))$predict
design_data[[col]][is.na(design_data[[col]])] <- prediction[is.na(design_data[[col]])]
}
# Resulting dataset
imputed_design_data <- design_data

2.2.3. Distribution Shift After Imputation

Post-imputation, summary statistics were recalculated. Table 2 shows that average values remained largely stable, while standard deviations decreased for several variables—improving the consistency and learnability of the dataset for machine learning.

In contrast, a few variables (e.g., scaffolding, asphalt intermediate layer) exhibited increased standard deviation post imputation due to their previously narrow observed ranges and small sample sizes. This is illustrated in Table 3, which compares differences in mean–standard deviation gaps before and after imputation.

3. Artificial Neural Network

3.1. Selection of Optimal Variables

In road infrastructure projects, the design stage is a critical phase during which approximately 70% of the construction drawings are completed, and quantities and scopes of work are accurately determined. Road projects typically encompass several major categories, including earthwork, slope stabilization, drainage, structural works, tunnel excavation, paving, traffic safety installations, and ancillary facilities. This study excludes tunnels and bridges to focus on the more frequently implemented components across projects. Each category includes multiple sub-activities:

Earthwork typically consists of 12 tasks, such as demolition, excavation, embankment formation, and topsoil removal.
Slope safety involves vegetation-based and structural reinforcement.
Drainage includes around 14 activities, such as trenching, blind hole drilling, and horizontal pipe installation.
Paving work encompasses 13 procedures, including frost protection, compaction, concrete curing, and surface finishing.
Traffic safety covers 11 elements like road signs and pavement markings.
Ancillary works average 20 tasks and include features like protective walls, signage, and noise barriers.

While including a wide array of variables may seem advantageous for improving model accuracy, this often results in overfitting—where the model captures data-specific noise and fails to generalize to new cases. In situations where expanding the dataset is impractical, dimensionality reduction is essential to optimize learning efficiency.

To address this, this study employs an autoencoder, an unsupervised neural network designed to reconstruct its input by learning compressed data representations. It consists of an encoder, which reduces dimensionality, and a decoder, which attempts to reconstruct the original input. Unlike traditional neural networks that predict outputs, autoencoders aim to reproduce input features, making them well-suited for noise reduction, anomaly detection, and feature selection. By minimizing reconstruction error, the autoencoder isolates high-signal variables, thereby improving the predictive performance of subsequent ANN and DNN models.

3.1.1. Optimal Variable Selection Using Autoencoder

(1) The setting of optimal variables in the planning stage

The performance of an ANN is primarily influenced by two key factors: the quantity and quality of training data and the network architecture [23], particularly the number of hidden layers and nodes [24]. These factors affect the model’s generalization capacity and weight optimization range.

To reduce overfitting and improve model generalizability across the 150-case dataset, four key strategies were implemented: (1) dropout regularization was applied with rates between 40% and 80% in both input and hidden layers to prevent over-reliance on specific neurons; (2) early stopping was used to halt training automatically when no further performance improvement was observed, avoiding unnecessary iterations; (3) cross-validation involved 2-fold splits for ANN models and 6-fold splits for autoencoder tuning to ensure model stability across varied data partitions; and (4) autoencoder-based feature selection reduced dimensionality by isolating high-signal variables, retaining over 95% of the dataset’s explanatory power while minimizing noise and redundancy.

In this study, a custom function was developed to optimize ANN parameters, including hidden layer depth (1–3 layers) and dropout rates (40–80%), using six-fold cross-validation across various case combinations [25,26]. While no universal guideline exists for selecting hidden layer depth, it is generally advised to limit complexity when data availability is constrained. The parameter optimization process was applied to both planning and design-stage datasets, as detailed in Figure 1.

The autoencoder optimal parameter function.

# Define autoencoder depth (1 to 3 hidden layers)
depth <- sample(1:3, 1)
# Create 6-fold cross-validation indices
folds <- createFolds(1:100, k = 6)
# Hyperparameter search space
hyperparams <- list(
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
),
list(
hidden         = sample(100:500, depth, replace = TRUE),
input_dr      = sample(400:800, 1) / 1000,
hidden_dr   = sample(400:800, depth, replace = TRUE)
)
)

Figure 1 shows six cross-validation models [27] created to find the optimal parameters of the autoencoder in the planning stage, and the model with the lowest mean square error rate was found. Model 5 produced the lowest mean square error rate of 0.10, and Table 4–show information for building the model. The number of neural networks in the built model is a total of five, including three hidden layers, and the dropout rate range is built to be 56% to 78%. The red dashed box highlights Model 5, which achieved the lowest MSE of 0.10, indicating the best performance among the six models evaluated.

The evaluation of neural network models often involves statistical techniques such as mean square error (MSE) and root mean square error (RMSE). MSE represents the average of the squared differences between predicted and actual values. A smaller MSE indicates that the model’s predictions are closer to the true values. This metric is widely used to assess the accuracy of the model and is represented by the following formula [28]:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{\hat{i}} - Y_{i})}^{2}

(1)

When working with large-scale datasets, the error sum can become very large, leading to an MSE value that is difficult to interpret intuitively. To address this, the RMSE is derived by taking the square root of the MSE, making it more manageable and easier to assess model performance. This can be expressed as follows [28]:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{\hat{i}} - Y_{i})}^{2}}

(2)

In general, lower values of both MSE and RMSE suggest a better predictive performance of the model. Additionally, when the values of MSE and RMSE are close to each other, it indicates that overfitting is less likely to be an issue, thus ensuring a more reliable model.

The coding details for both the learning and verification phases of the autoencoder model—using the optimal parameters determined during the planning stage—indicate that the model achieved an MSE of 0.08 and RMSE of 0.28 during the learning process. In the verification process, the MSE was 0.06 and RMSE was 0.24. Since these values are similar, it was concluded that overfitting did not occur, and the model was deemed to be reliable.

Coding details of the planning-stage autoencoder learning process and verification process.

Planning stage Autoencoder learning process coding details

fm_plan_Eco <- lapply(hyperparams, function(v) {
lapply(folds, function(i) {
h2o.deeplearning(
x = 1:11,
training_frame = train_plan_Eco[, −12],
validation_frame = test_plan_Eco[, −12],
distribution = “gaussian”,
activation = “RectifierWithDropout”,
hidden = v$hidden,
rho = 0.90,
epsilon = 1 × 10⁷,
input_dropout_ratio = v$input_dr,
hidden_dropout_ratios = v$hidden_dr,
loss = “Automatic”,
autoencoder = TRUE,
sparse = TRUE,
l1 = 1 × 10⁷,
l2 = 1 × 10⁷,
epochs = 300
)
})
})

#Planning stage Autoencoder verification process coding details

fm.final_plan_Eco <- h2o.deeplearning(
x = 1:11,
training_frame = train_plan_Eco[, −12],
validation_frame = test_plan_Eco[, −12],
distribution = “gaussian”,
activation = “RectifierWithDropout”,
hidden = hyperparams[[5]]$hidden,
rho = 0.90,
epsilon = 1 × 10⁷,
input_dropout_ratio = hyperparams[[5]]$input_dr,
hidden_dropout_ratios = hyperparams[[5]]$hidden_dr,
loss = “Automatic”,
autoencoder = TRUE,
l1 = 1 × 10⁷,
l2 = 1 × 10⁷,
epochs = 300
)

The range of the mean square error is 0.01 to 0.15, which shows that there are variables with low reproducibility. In the case of variables with low reproducibility, it can be judged that they have a negative effect on the prediction performance when constructing an artificial neural network model, and the variable importance of Table 5 was used as the criterion for selecting variables. The variable combinations corresponding to the cumulative importance of 77%, 83%, 89%, and 95% are shown in Table 6 and Table 7, and these were used as variables for constructing ANN and DNN models for estimating the EL and CC in the planning stage in the future.

3.1.2. The Setting of Optimal Variables in the Design Phase

The process of setting the optimal variables in the design phase is the same as the process of setting the variables in the planning phase that was carried out previously. Cross-validation is performed using the function of Table 7 above for various case combinations. Figure 2 shows six cross-validation models created to find the optimal parameters of the autoencoder in the design phase, and the model with the lowest mean square error rate was found. Model 6 produced the lowest mean square error rate of 0.07, and Table 7 and Table 8 show information for building the model. The number of neural networks in the built model is a total of three including one hidden layer, and the range of the dropout rate is built to be 72% to 76%. The red dashed box highlights Model 6, which achieved the MSE of 0.07, demonstrating the best performance among all evaluated models.

Table 8 shows the results obtained after applying the planning-stage autoencoder coding to the design-stage model. The mean square error rate and root mean square error rate of the learning process of this model were calculated as 0.29 and 0.53, and in the verification process, they were calculated as 0.06 and 0.23, so it was assumed that overfitting did not occur. Therefore, the model was judged to be reliable.

Table 9 and Table 10 summarize the ranked importance of design-stage variables and the cumulative combinations used to construct the ANN and DNN models, with up to 97% explanatory power captured using 16 key features.

3.2. ANN

3.2.1. Construction and Preprocessing for Planning-Stage Estimation

Prior to model development, the dataset comprising 100 road project cases was partitioned into 80 training cases, 10 validation cases, and 10 test cases. The ANN model for the planning stage was constructed using a backpropagation neural network architecture, which iteratively minimizes error by propagating feedback from output nodes.

As summarized in the coding details, the model (plan_Eco_ann) employed an ReLU (Rectified Linear Unit) as the activation function to reduce vanishing gradient issues and facilitate faster convergence. Although standard practice recommends four or more folds in cross-validation, a two-fold cross-validation yielded the best predictive performance in this study. The network was configured with one hidden layer, and the number of hidden nodes was selected randomly in the range of 1 to 300. Learning was performed over 300 to 500 epochs, and early stopping criteria were applied to halt training when no further performance improvement was detected.

To enhance learning efficiency, the dependent variables—EL and CC—were normalized using the Min–Max scaling technique, which transforms values into the range [0, 1]. This approach is particularly effective in neural networks, where activation outputs typically range between −1 and 1. Normalization helps stabilize weight updates and accelerates convergence by minimizing the magnitude of prediction errors during training.

Method = \frac{Observed value - X_{\min}}{X_{\max} - X_{\min}}

Observed Value: The observed value to be converted.

X_{\min}

: Minimum value in the data column containing the observations.

X_{\max}

: Maximum value in the data column containing the observations.

Coding details for building an ANN model for estimating EL and CCs at the planning stage.

plan_Eco_ann <- h2o.deeplearning(
x = 1:11,   # Input features
y = 12,     # Target variable: EL
training_frame = train_plan_Eco,
validation_frame = valid_plan_Eco,
nfolds = 2,
distribution = “gaussian”,
activation = “Rectifier”,
hidden = sample(1:300, 1, TRUE),   # Random neuron count in hidden layer
rho = 0.90,
epsilon = 1 × 10⁷,
input_dropout_ratio = sample(400:800, 1, TRUE) / 1000,
hidden_dropout_ratios = sample(400:800, 1, TRUE) / 1000,
loss = “Automatic”,
stopping_rounds = 5,
stopping_metric = “AUTO”,
stopping_tolerance = 0.01,
sparse = TRUE,
epochs = 300
)

3.2.2. Prediction Accuracy and Optimal Architecture of ANN Models

In the autoencoder process, four variable combinations were identified for both the planning and design stages. These combinations were used to construct a total of sixteen ANN models—eight for estimating EL and eight for CC. The target variables were normalized using the Min–Max scaling technique, with outputs ranging between 0 and 1. After prediction, values were denormalized to compute absolute error rates.

Table 11 and Table 12 present the average error rates and standard deviations for 10 validation cases. Among the planning-stage models for EL estimation, combination 4 exhibited the best performance, with an average error rate of 29.8% and a standard deviation of 16.0%. Similarly, for CC estimation, combination 4 again yielded the best results, with an average error rate of 21.0% and a standard deviation of 16.3%.

When examining the prediction performance of the ANN model for estimating EL at the design stage, among the average error rates of combinations 1 to 4, combination 3 showed the best prediction performance with an average error rate and standard deviation of 29.8% and 21.6%, respectively.

As shown in Table 13, when examining the prediction performance of the ANN model for estimating CCs at the design stage, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 2 were calculated to be 21.0% and 16.3%, respectively, showing the best prediction performance.

The optimal ANN model configurations for both the planning and design stages are summarized as follows. As shown in Table 14, for EL estimation in the planning stage, the model used eight input variables and 122 hidden nodes, with dropout rates of 59.3% (input layer) and 56.9% (hidden layer). For CC estimation, the planning-stage model utilized 10 input variables, 296 hidden nodes, and dropout rates of 46.1% (input) and 43.6% (hidden).

In the design stage, the optimal model for EL estimation included 15 input variables and 117 hidden nodes, with dropout rates of 42.4% (input) and 53.5% (hidden). For CC estimation, the model was simpler, comprising 13 input variables and 11 hidden nodes, with higher dropout rates of 51.2% (input) and 76.3% (hidden).

All ANN models achieved error rates within the acceptable tolerance defined for the planning stage. Moreover, the consistency between training and validation performance—measured using MSE and RMSE—indicated that overfitting was effectively controlled, likely due to the use of optimized hyperparameters and feature selection via the autoencoder.

Learning progression graphs (Figure 3 and Figure 4) illustrate model behavior across epochs. For the planning-stage cost model and the design-stage EL model, convergence was observed within the set epoch range, as the training and validation curves closely aligned. In contrast, the planning-stage EL and design-stage cost models exhibited a persistent gap between the training and validation lines, suggesting underfitting due to insufficient data. While increasing model complexity (e.g., nodes or epochs) could improve performance, doing so risks overfitting. Therefore, augmenting the training dataset remains the most viable approach for further performance enhancement.

3.3. Deep Neural Network

The DNN model adopts a backpropagation neural network structure similar to that of the ANN. The architecture utilizes the ReLU (Rectified Linear Unit [1]) activation function, with the network depth randomly set between two and five hidden layers and each layer containing between 1 and 300 nodes. While the DNN model shares the same baseline activation function, loss type, and learning strategy as the ANN configuration, it differs significantly in network depth (2–5 hidden layers), layer-wise dropout control, and the expanded architectural complexity. These differences enable the DNN to model higher-order interactions and extract deeper abstractions, particularly useful for predicting nonlinear and variance-prone outcomes like environmental load. For each variable combination, 16 distinct DNN models are developed to ensure robust evaluation.

Coding details for building a DNN model for estimating EL and CCs at the planning stage.

# Building the DNN Model
depth = sample(2:5, 1)   # Randomly selecting depth (2 to 5 hidden layers)

plan_Eco_dnn <- h2o.deeplearning(
x = 1:11,     # Columns 1 to 11 are the features
y = 12,         # Column 12 is the target variable (EL or CC)
training_frame = train_dlan_Eco,
validation_frame = test_dlan_Eco,
nfolds = 2,   # Number of folds for cross-validation
distribution = “gaussian”,   # Distribution type for the target variable
activation = “Rectifier”,     # Activation function
hidden = sample(1:300, depth, TRUE),   # Randomly select a number of hidden layers (up to 300 neurons)
rho = 0.90,                               # Regularization parameter
epsilon = 1e-07,                      # Convergence threshold
input_dropout_ratio = sample(400:800, 1, TRUE) / 1000,   # Input dropout ratio
hidden_dropout_ratios = sample(400:800, depth, TRUE) / 1000,   # Hidden layer dropout ratios
loss = “Automatic”,                   # Loss function to use
stopping_rounds = 5,                 # Stop training after 5 rounds of no improvement
stopping_metric = “AUTO”,         # Metric for stopping
stopping_tolerance = 0.01,     # Tolerance for stopping criteria
sparse = TRUE,                           # Use sparse matrices
epochs = 300                               # Number of training epochs (can also use 500)
)
# Prediction Process
prediction_plan_Eco_dnn <- h2o.predict(plan_Eco_dnn, newdata = vali_plan_Eco)
# Calculate the Error Rate
error_rate <- mean(abs((prediction_plan_Eco_dnn$predict / vali_plan_Eco_1$Eco) − 1) * 100)
# Print the error rate
print(error_rate)

3.3.1. Architecture and Prediction Performance for EL and CC Estimation

Table 15 summarizes the average error rate and standard deviation for 10 verification cases of EL and CC in the planning and design stages of the constructed DNN model. Looking at the prediction performance of the DNN model for estimating EL in the planning stage, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 4 were 27.1% and 18.6%, respectively, showing the best prediction performance.

Looking at the prediction performance of the planning stage CC estimation DNN model in Table 16, among the average error rates of combinations 1 to 4, the average error rate and standard deviation of combination 3 were calculated to be 17.0% and 9.8%, respectively, showing the best prediction performance. Accordingly, the planning-stage DNN model satisfied the planning-stage error rate level (30%) set in this study.

Based on the results in Table 17, combination 1 of the design-stage EL estimation DNN model yielded the best performance, with an average error rate of 24.0% and a standard deviation of 13.8%. For the CC estimation (Table 18), combination 2 performed best, achieving an average error rate of 14.6% and a standard deviation of 3.9%.

While the DNN model for CC met the predefined 20% error tolerance for the design stage, the EL model did not. Although DNNs generally offer high predictive capability, their effectiveness in this study was constrained by limited data availability. The results indicate that dataset size plays a critical role in realizing the full potential of deep learning models in construction project estimation.

3.3.2. Optimal DNN Model Configuration and Performance Assessment

Information on the DNN model in the planning and design stages is as follows. In the case of the optimal DNN model for estimating environmental costs at the opportunity stage, there are 10 input variables and 2 hidden layers consisting of 184 and 155 nodes, and the dropout is 51.8% in the input layer, 74.3% in the hidden layer, and 46.6% in the hidden layer.

In the case of the optimal DNN model for estimating CCs at the planning stage, there are nine input variables and 2 hidden layers consisting of 234 and 146 nodes, and the dropout was 47.9% in the input layer, 43.5% in the hidden layer, and 69.0% in the hidden layer.

For the optimal DNN model for environmental cost estimation at the design stage, there are 12 input variables and 3 hidden layers with 13, 8, and 258 nodes, and the dropout is 52.2% in the input layer, 66.1%, 48.6%, and 41.6% in the hidden layers.

In the case of the optimal DNN model for estimating CCs at the design stage, there were 12 input variables and 2 hidden layers consisting of 5 and 46 nodes, and the dropout was 47.6% in the input layer, 74.7% in the hidden layer, and 57.1% in the hidden layer.

The learning and verification MSE and RMSE values presented in Table 19 and Table 20 show minimal differences, indicating that the chosen model parameters and the autoencoder-based variable selection effectively controlled overfitting. However, the RMSE trends in Figure 5 and Figure 6 suggest that optimal convergence was not fully achieved within the given epoch range. The further tuning of parameters may lead to increased model complexity, potentially degrading performance. To enhance learning stability and predictive accuracy, expanding the dataset is recommended, as additional data would allow the model to better generalize without overfitting.

3.3.3. Additional Evaluation Metrics

In addition to MSE and RMSE, we calculated MAPE, MAE, and R² to offer a more comprehensive assessment of model performance. Table 21 presents these metrics for the optimal ANN and DNN models across planning and design stages. The results show that the DNN models generally achieved lower MAPE and MAE values and higher R² scores, particularly in EL prediction, reinforcing their ability to handle complex, high-dimensional datasets. ANN models performed competitively for CC estimation, especially where data patterns were more structured.

4. Results

4.1. Comparison of ANN and DNN

Figure 7 presents a comparative analysis of the prediction error rates for EL and CC estimation using the ANN and DNN during the design stage of road construction projects. The results demonstrate that the DNN model achieved superior performance in EL estimation, recording an error rate of 29.4%, which is below the predefined threshold of 30%. In contrast, the ANN model exhibited a higher error rate of 35.1%, exceeding the acceptable limit. This suggests that the deeper architecture of the DNN was more effective in capturing complex patterns in the high-dimensional input space, which is critical for environmental impact prediction.

In terms of CC estimation, both models delivered error rates below the acceptable 20% threshold, with the ANN model slightly outperforming the DNN model (17.3% vs. 18.6%). This indicates that for cost estimation—where the data may be more linearly structured or less sensitive to deeper abstraction—simpler architectures such as the ANN may still be highly effective.

The figure also includes red threshold lines (30% for EL and 20% for CC) to emphasize model acceptability boundaries. The performance trends highlight that model selection should be aligned with task complexity: DNNs may be better suited for tasks involving nonlinear relationships and noise-prone inputs, such as environmental data, while ANNs can perform competitively in more structured domains like cost estimation. It is also noted that DNNs, while more expressive, underperformed ANNs in certain settings (e.g., design-stage CC prediction). This may be attributed to the limited dataset size, the linear nature of cost variables, and the increased sensitivity of deep models to input noise and architecture variability. These results reinforce the importance of matching model complexity to task characteristics.

These findings validate the approach of using autoencoder-based variable filtering and deep learning architectures to improve early-stage prediction accuracy in infrastructure planning, offering practical benefits for sustainability-focused project management and design decision-making.

As a future direction, this study will be expanded to benchmark the proposed ANN and DNN models against traditional and ensemble-based machine learning algorithms such as Support Vector Machines, Extreme Learning Machines, and XGBoost. This comparative analysis will provide a more comprehensive understanding of model suitability across various infrastructure estimation tasks and dataset characteristics.

4.2. Discussions

This study introduces a dual-phase modeling approach that separates planning- and design-stage variables to improve prediction accuracy for construction cost (CC) and EL. Unlike previous studies that rely on single-phase data, this structure enables phase-specific modeling, allowing for a more granular analysis of variable relevance, data uncertainty, and model suitability at distinct stages of project development. The planning phase emphasizes early estimations under incomplete information, whereas the design phase incorporates detailed quantities, thereby enhancing the applicability of machine learning in real-world infrastructure decision-making.

Despite these contributions, several modeling limitations must be acknowledged. While DNNs performed well in capturing complex patterns—particularly in EL prediction—they did not consistently outperform ANNs, especially in settings where data were more structured or linear, such as CC prediction. This may stem from the limited size and scope of the dataset (150 national road projects), which restricts the learning capacity of deeper architectures and increases sensitivity to noise or architectural over-parameterization. Moreover, the empirical tuning of network depth, dropout rates, and node sizes, while cross-validated, lacked theoretical justification and could benefit from future sensitivity and ablation analyses.

Addressing these generalizability concerns requires a multi-pronged strategy. First, future research should involve external validation using independent datasets from different regions or infrastructure types to assess robustness under domain shift. Second, transfer learning techniques will be explored to adapt pre-trained models to new but related datasets—particularly useful in low-data environments. Third, benchmarking against alternative machine learning models such as SVMs, ELMs, and XGBoost is planned to evaluate model competitiveness and task suitability across prediction contexts.

Additionally, efforts will focus on expanding the dataset to cover a broader geographic and functional spectrum of road infrastructure projects. This would enable the development of more generalized models with stronger predictive performance and real-world applicability. Lastly, the integration of these models into decision-support tools can assist planners and engineers in making informed, sustainability-oriented decisions in the early phases of highway project development.

5. Conclusions

This study developed and evaluated machine learning models—specifically ANNs and DNNs—to estimate EL and CC during both the planning and design stages of national road projects. The findings offer key insights into model performance, practical implications, and future research directions.

A structured dataset of 150 completed South Korean national road projects was compiled, forming planning- and design-phase databases. Emphasis was placed on 19 high-impact sub-work types to improve predictive accuracy and minimize irrelevant input noise.
To address the 4.47% missing data in the design-stage database, a hybrid imputation strategy combining mean substitution and random forest-based modeling was applied. This method preserved overall data distributions while reducing standard deviations by up to 5%, enhancing data stability and model readiness.
Dimensionality reduction via a autoencoder effectively filtered key variables—retaining only 16 critical features like culvert concrete pouring and frost protection layers—while maintaining 97% of the dataset’s explanatory power, thereby reducing redundancy.
ANN models benefited from cross-validation and hyperparameter optimization, achieving strong performance metrics (MSE = 0.06, RMSE = 0.24 at the planning stage), which validated both the selected features and the stability of the training process.
The best-performing ANN models yielded average error rates of 29.8% for EL and 21.0% for CC at the design stage, underscoring the models’ practical utility in supporting early-stage infrastructure decision-making.
Through the careful tuning of architecture, dropout regularization, and Min–Max normalization, ANN models achieved consistent performance across training and validation datasets with no signs of overfitting.
DNN models also demonstrated strong predictive capabilities, achieving average error rates of 27.1% and 17.0% for planning-stage EL and cost estimations and 24.0% and 14.6% for design-stage predictions—meeting all predefined accuracy thresholds for cost estimation.
Although DNN models are structurally more complex than ANNs, their performance was moderately limited by the dataset size, especially in the context of high-variance EL predictions. Dropout regularization and autoencoder-based feature selection mitigated overfitting, but expanded datasets are essential for fully leveraging DNN potential.
Comparative analysis showed that DNNs slightly outperformed ANNs in EL estimation (29.4% vs. 35.1%), while ANNs had a marginal advantage in cost prediction (17.3% vs. 18.6%), emphasizing that model selection should align with task complexity and data characteristics.
Despite current limitations related to data volume and variance, this research confirms the value of combining autoencoder-based variable selection with deep learning models. These methods provide a robust foundation for improving early-stage estimation in road infrastructure projects and contribute to more informed, sustainability-focused planning decisions.
Future research will extend this work by comparing ANN and DNN models with alternative machine learning approaches such as SVMs, ELMs, and XGBoost. Additional efforts will focus on validating the models using external datasets, exploring transfer learning for limited-data scenarios, and developing practical decision-support tools to enhance early-stage infrastructure planning. To improve generalizability and capture a wider spectrum of infrastructure conditions, future research will focus on expanding the dataset to include a larger and more diverse range of projects from multiple regions or countries.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2021-NR066174).

Data Availability Statement

The dataset is available on request from the authors.

Conflicts of Interest

The author declares no conflicts of interest.

Nomenclature

Abbreviation	Description
ANN	Artificial Neural Network
DNN	Deep Neural Network
EL	Environmental Load
CC	Construction Cost
MSE	Mean Squared Error
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
R²	Coefficient of Determination
SVM	Support Vector Machine
ELM	Extreme Learning Machine
XGBoost	Extreme Gradient Boosting
VR	Vertical Reinforcement (Pipe/Structure)
Ascon	Asphalt Concrete
KICT	Korea Institute of Civil Engineering and Building Technology

References

He, A.; Dong, Z.; Zhang, H.; Zhang, A.A.; Qiu, S.; Liu, Y.; Wang, K.C.P.; Lin, Z. Automated Pixel-Level Detection of Expansion Joints on Asphalt Pavement Using a Deep-Learning-Based Approach. Struct. Control Health Monit. 2023, 2023, 7552337. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H.; Liang, F.; Yuan, Z. Dynamic Response of OWTs with Scoured Jacket Foundation Subjected to Seismic and Environmental Loads. Mar. Struct. 2025, 103, 103839. [Google Scholar] [CrossRef]
Mittermayr, D.; Freud, P.J.; Fischer, J. Fatigue Crack Growth Resistance under Superimposed Mechanical-Environmental Loads of Virgin and Recycled Polystyrene Using Cracked Round Bar Specimens. Eng. Fract. Mech. 2025, 319, 111042. [Google Scholar] [CrossRef]
Lavassani, S.H.H.; Doroudi, R.; Gavgani, S.A.M. Optimization of Semi-Active Tuned Mass Damper Inerter for Enhanced Vibration Control of Jacket Platforms Using Multi-Objective Optimization Due to Environmental Load. Structures 2025, 78, 109305. [Google Scholar] [CrossRef]
Elmas, F.; Algin, H.M. Soil-Monopile Interaction Assessment of Offshore Wind Turbines with Comprehensive Subsurface Modelling to Earthquake and Environmental Loads of Wind and Wave. Soil Dyn. Earthq. Eng. 2025, 192, 109293. [Google Scholar] [CrossRef]
Li, Z.; Jiang, W.; van Dam, T.; Zou, X.; Chen, Q.; Chen, H. A Review on Modeling Environmental Loading Effects and Their Contributions to Nonlinear Variations of Global Navigation Satellite System Coordinate Time Series. Engineering 2025, 47, 26–37. [Google Scholar] [CrossRef]
Ghanizadeh, A.R.; Ghanizadeh, A.; Asteris, P.G.; Fakharian, P.; Armaghani, D.J. Developing Bearing Capacity Model for Geogrid-Reinforced Stone Columns Improved Soft Clay Utilizing MARS-EBS Hybrid Method. Transp. Geotech. 2023, 38, 100906. [Google Scholar] [CrossRef]
Cheng, M.Y.; Vu, Q.T.; Gosal, F.E. Hybrid Deep Learning Model for Accurate Cost and Schedule Estimation in Construction Projects Using Sequential and Non-Sequential Data. Autom. Constr. 2025, 170, 105904. [Google Scholar] [CrossRef]
Liu, H.; Li, M.; Cheng, J.C.P.; Anumba, C.J.; Xia, L. Actual Construction Cost Prediction Using Hypergraph Deep Learning Techniques. Adv. Eng. Inform. 2025, 65, 103187. [Google Scholar] [CrossRef]
Habib, O.; Abouhamad, M.; Bayoumi, A.E.M. Ensemble Learning Framework for Forecasting Construction Costs. Autom. Constr. 2025, 170, 105903. [Google Scholar] [CrossRef]
Mahpour, A. Building Maintenance Cost Estimation and Circular Economy: The Role of Machine-Learning. Sustain. Mater. Technol. 2023, 37, e00679. [Google Scholar] [CrossRef]
Bruzzone, A.G.; Sinelshchikov, K.; Gotelli, M.; Monaci, F.; Sina, X.; Ghisi, F.; Cirillo, L.; Giovannetti, A. Machine Learning and Simulation Modeling Large Offshore and Production Plants to Improve Engineering and Construction. Procedia Comput. Sci. 2025, 253, 3318–3324. [Google Scholar] [CrossRef]
Farshadfar, Z.; Khajavi, S.H.; Mucha, T.; Tanskanen, K. Machine Learning-Based Automated Waste Sorting in the Construction Industry: A Comparative Competitiveness Case Study. Waste Manag. 2025, 194, 77–87. [Google Scholar] [CrossRef] [PubMed]
Mahmoodzadeh, A.; Nejati, H.R.; Mohammadi, M. Optimized Machine Learning Modelling for Predicting the Construction Cost and Duration of Tunnelling Projects. Autom. Constr. 2022, 139, 104305. [Google Scholar] [CrossRef]
Wang, R.; Asghari, V.; Cheung, C.M.; Hsu, S.C.; Lee, C.J. Assessing Effects of Economic Factors on Construction Cost Estimation Using Deep Neural Networks. Autom. Constr. 2022, 134, 104080. [Google Scholar] [CrossRef]
Alsulamy, S. Comparative Analysis of Deep Learning Algorithms for Predicting Construction Project Delays in Saudi Arabia. Appl. Soft Comput. 2025, 172, 112890. [Google Scholar] [CrossRef]
Li, Q.; Yang, Y.; Yao, G.; Wei, F.; Li, R.; Zhu, M.; Hou, H. Classification and Application of Deep Learning in Construction Engineering and Management—A Systematic Literature Review and Future Innovations. Case Stud. Constr. Mater. 2024, 21, e04051. [Google Scholar] [CrossRef]
Lung, L.W.; Wang, Y.R.; Chen, Y.S. Leveraging Deep Learning and Internet of Things for Dynamic Construction Site Risk Management. Buildings 2025, 15, 1325. [Google Scholar] [CrossRef]
Liu, Q.; He, P.; Peng, S.; Wang, T.; Ma, J. A Survey of Data-Driven Construction Materials Price Forecasting. Buildings 2024, 14, 3156. [Google Scholar] [CrossRef]
Chen, N.; Lin, X.; Jiang, H.; An, Y. Automated Building Information Modeling Compliance Check through a Large Language Model Combined with Deep Learning and Ontology. Buildings 2024, 14, 1983. [Google Scholar] [CrossRef]
Choi, S.M.; Cha, H.S.; Jiang, S. Hybrid Data Augmentation for Enhanced Crack Detection in Building Construction. Buildings 2024, 14, 1929. [Google Scholar] [CrossRef]
Nguyen, H.L.; Tran, V.Q. Data-Driven Approach for Investigating and Predicting Rutting Depth of Asphalt Concrete Containing Reclaimed Asphalt Pavement. Constr. Build. Mater. 2023, 377, 131116. [Google Scholar] [CrossRef]
Raza, M.S.; Sharma, S.K. Optimizing Porous Asphalt Mix Design for Permeability and Air Voids Using Response Surface Methodology and Artificial Neural Networks. Constr. Build. Mater. 2024, 442, 137513. [Google Scholar] [CrossRef]
Mabrouk, G.M.; Elbagalati, O.S.; Dessouky, S.; Fuentes, L.; Walubita, L.F. Using ANN Modeling for Pavement Layer Moduli Backcalculation as a Function of Traffic Speed Deflections. Constr. Build. Mater. 2022, 315, 125736. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
El-Chabib, H.; Nehdi, M.; Sonebi, M. Artificial Intelligence Model for Flowable Concrete Mixtures Used in Underwater Construction and Repair. ACI Mater. J. 2003, 100, 165–173. [Google Scholar] [CrossRef] [PubMed]
Shehadeh, A.; Alshboul, O.; Al Mamlook, R.E.; Hamedat, O. Machine Learning Models for Predicting the Residual Value of Heavy Construction Equipment: An Evaluation of Modified Decision Tree, LightGBM, and XGBoost Regression. Autom. Constr. 2021, 129, 103827. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? Geosci. Model Dev. Discuss. 2014, 7, 1525–1534. [Google Scholar] [CrossRef]

Figure 1. Box plot of MSE values from six cross-validated autoencoder models for the planning-stage dataset.

Figure 2. The mean square error box plot of the design-stage autoencoder 6-fold cross-validation model.

Figure 3. RMSE plot showing training and validation trends for the ANN model used to estimate EL during the planning stage.

Figure 4. RMSE curves of the ANN model for estimating EL and CC during the design stage.

Figure 5. Planning-stage EL and CC estimation model DNN RMSE graph.

Figure 6. Design-stage EL estimation model DNN RMSE graph.

Figure 7. Comparative error rates of ANN and DNN models for estimating EL and CC at the design stage.

Table 1. Summary of missing values and quantity distributions by work type (before imputation).

Work Category	Missing	Q1	Median	Q3	Max (Q4)	Mean	Std. Dev
Excavation (m³)	0	2691	286,783	531,814	1,292,560	375,441	290,265
Ripping Arm (m³)	8	142	112,068	238,409	650,328	160,595	150,066
Blasting Rock (m³)	12	164	261,197	520,191	2,322,210	379,734	395,588
Dump Transport (m³)	1	17,418	641,946	1,017,225	2,393,845	718,342	569,633
Concrete Pouring (m³)	10	141	14,702	24,398	143,495	18,180	17,191
Frost Protection (m³)	1	509	40,229	60,283	180,180	44,499	29,716
Ascon Surface (ton)	3	1969	18,534	24,821	472,226	29,134

Table 2. Post-imputation quantity distribution (selected variables).

Work Category	Mean (Before)	Std Dev (Before)	Mean (After)	Std Dev (After)	% Change in SD
Blasting Rock	379,734	395,588	357,132	377,753	−5%
Green Zone Fill	177,314	297,435	178,926	296,379	−0.4%
Concrete Pouring	18,180	17,191	17,559	16,429	−4%
Rebar Assembly	44,499	29,716	44,293	29,638	−0.3%
Asphalt Surface	29,134	55,355	28,847	54,581	−1%

Table 3. Change in mean–standard deviation gap due to imputation.

Variable	Mean–SD Gap (Before)	Mean–SD Gap (After)	Change (%)	Interpretation
Ascon Middle Layer	9258	25,735	+178%	Reduced predictability
Rebar Assembly	2212	24,435	+1005%	High noise added
Blasting Rock	15,854	20,620	+30%	Slight increase
Green Zone Fill	120,121	117,453	−2%	Stable
Asphalt Surface	26,221	2421	−91%	Improved modeling stability

Table 4. Planning-stage autoencoder model information.

Model 5
Number of learnings: 24,000
division	Layer	Units	Dropout rate
Input layer	1	11	78%
Hidden layer	2	482	56%
	3	481	72%
	4	212	74%
Output layer	5	11	-
Learning process
MSE		0.08
RMSE		0.28
Verification process
MSE		0.06
RMSE		0.24

Table 5. Planning-stage autoencoder variable reproduction mean square error.

	A	B	C	D	E	F	G	H	I	J	K
	MSE
1	0.26	0.01	0.01	0.17	0.01	0.32	0.06	0.01	0.05	0.07	0.00
2	0.06	0.00	0.07	0.17	0.01	0.32	0.00	0.00	0.06	0.07	0.00
3	0.01	0.00	0.07	0.01	0.01	0.32	0.01	0.00	0.16	0.07	0.00
4	0.01	0.02	0.01	0.17	0.01	0.32	0.02	0.03	0.00	0.00	0.30
5	0.15	0.01	0.07	0.01	0.01	0.32	0.00	0.00	0.00	0.07	0.00
6	0.02	0.02	0.35	0.01	0.35	0.32	0.02	0.00	0.05	0.36	0.02
7	0.14	0.01	0.01	0.17	0.01	0.19	0.03	0.00	0.00	0.07	0.00
8	0.06	0.00	0.01	0.01	0.01	0.19	0.00	0.00	0.15	0.36	0.00
9	0.01	0.00	0.01	0.01	0.35	0.19	0.03	0.01	0.09	0.36	0.00
10	0.14	0.01	0.01	0.17	0.01	0.32	0.11	0.01	0.00	0.07	0.00
Average	0.09	0.01	0.06	0.09	0.08	0.28	0.03	0.01	0.06	0.15	0.03
	A	B	C	D	E	F	G	H	I	J	K
	Administrative district	Road height	Road grade	Topography	Design speed	Type of construction	Road extension	Road area	Packaging thickness	Number of cars	Road width

Table 6. Identifying the importance of planning-stage variables.

	Relative Importance	Ratio	Cumulative Ratio
Type of construction	1.00	17%	17%
Topography	0.90	15%	32%
Road width	0.81	13%	45%
Packaging thickness	0.53	9%	54%
Number of cars	0.47	8%	62%
Road grade	0.46	8%	70%
Administrative district	0.45	8%	77%
Road extension	0.37	6%	83%
Road height	0.35	6%	89%
Design speed	0.33	5%	95%
Road area	0.31	5%	100%

Table 7. Planning-stage variable combinations used to build ANN and DNN models.

Division	Variable Combination (Number of Variables)	Cumulative Ratio
Combination 1	Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district (7)	77%
Combination 2	Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length (8)	83%
Combination 3	Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length, road height (9)	89%
Combination 4	Construction type, terrain, road width, pavement thickness, number of lanes, road grade, administrative district, road length, road height, road area (10)	95%

Table 8. Planning-tage autoencoder model 6 information.

Model 6
Number of learnings: 24,000
division	Layer	Units	Dropout rate
Input layer	1	19	76%
Hidden layer	2	277	72%
Output layer	3	19	-
Learning process
MSE		0.29
RMSE		0.53
Verification process
MSE		0.06
RMSE		0.23

Table 9. Identifying the importance of design-stage variables.

	Significant Importance	Ratio	Cumulative Ratio
Pouring concrete for culvert	1.00	8%	8%
Underground construction	0.93	8%	16%
Frost protection layer	0.89	7%	24%
Underground rebar processing and assembly	0.88	7%	31%
Dump transport	0.78	7%	38%
No body	0.74	6%	44%
Horizontal drainage pipeVR pipe	0.74	6%	50%
Tossa	0.73	6%	56%
Ascon base layer	0.72	6%	62%
Ripping arm	0.69	6%	68%
Ceramic transport	0.68	6%	74%
Formwork for culvert	0.64	5%	79%
transverse drain pipe wing wall	0.58	5%	84%
Horizontal drainage pipeVR pipe	0.55	5%	88%
Blasting rock	0.38	3%	92%
Ascon middle layer	0.30	3%	94%
Green land reclamation	0.29	2%	97%
road	0.26	2%	99%
Ascon surface	0.14	1%	100%

Table 10. Design-stage variable combinations used to build ANN and DNN models.

Division	Variable Combination (Number of Variables)	Cumulative Ratio
Combination 1	Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork (12)	79%
Combination 2	Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall (13)	84%
Combination 3	Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall, blasting rock, asphalt intermediate layer (15)	94%
Combination 4	Culvert concrete pouring, culvert scaffolding, frost protection layer, culvert reinforcement processing and assembly, dump transport, furnace body, transverse drainage pipe VR pipe, soil, asphalt base, ripping rock, ceramic transport, culvert formwork, transverse drainage pipe wing wall, blasting rock, asphalt intermediate layer, green zone fill (16)	97%

Table 11. Results of prediction performance of ANN model for EL estimation at planning stage.

Division	EL Actual Value Unit: Eco-Point	Combination 1		Combination 2		Combination 3		Combination 4
Division	EL Actual Value Unit: Eco-Point	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	8174	5610	31.4%	5320	34.9%	6061	25.8%	4984	39.0%
Case 2	7852	4211	46.4%	4809	38.8%	5624	28.4%	4695	40.2%
Case 3	8490	4249	50.0%	4745	44.1%	5664	33.3%	4770	43.8%
Case 4	2917	4307	47.7%	5145	76.4%	5839	100.2%	4659	59.7%
Case 5	3892	5411	39.0%	4535	16.5%	5242	34.7%	6913	77.6%
Case 6	3716	5055	36.0%	4445	19.6%	5420	45.9%	4883	31.4%
Case 7	6690	4337	35.2%	5431	18.8%	6207	7.2%	5068	24.2%
Case 8	4273	4999	17.0%	4968	16.3%	5581	30.6%	4942	15.6%
Case 9	3337	4272	28.0%	4522	35.5%	5241	57.1%	4558	36.6%
Case 10	5474	4296	21.5%	5614	2.6%	6513	19.0%	5941	8.5%
	Average error rate	35.2%		30.3%		38.2%		37.7%
	Standard deviation	10.5%		19.6%		24.3%		19.2%

Table 12. ANN model prediction performance results for CC estimation at the planning stage.

Division	CC Actual Value Unit: 10 Million Wont	Combination 1		Combination 2		Combination 3		Combination 4
Division	CC Actual Value Unit: 10 Million Wont	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	1838	1662	9.6%	1639	10.8%	1754	4.5%	1520	17.3%
Case 2	1891	1643	13.1%	1558	17.6%	1512	20.0%	1519	19.7%
Case 3	1942	1647	15.2%	1552	20.1%	1549	20.2%	1512	22.1%
Case 4	1061	1655	56.0%	1570	48.0%	1580	48.9%	1514	42.6%
Case 5	1128	1659	47.1%	1504	33.4%	1311	16.3%	1486	31.8%
Case 6	1082	1605	48.4%	1542	42.5%	1146	5.9%	1491	37.8%
Case 7	1444	1655	14.6%	1667	15.4%	1786	23.7%	1549	7.3%
Case 8	1937	1656	14.5%	1570	18.9%	1548	20.1%	1525	21.2%
Case 9	1307	1642	25.6%	1495	14.4%	1389	6.2%	1488	13.8%
Case 10	2093	1655	20.9%	1722	17.8%	1944	7.1%	1559	25.5%
	Average error rate	26.5%		23.9%		17.3%		23.9%
	Standard deviation	16.4%		12.1%		12.6%		10.3%

Table 13. Design-stage EL estimation ANN model prediction performance results.

Division	EL Actual Value Unit: Eco-Point	Combination 1		Combination 2		Combination 3		Combination 4
Division	EL Actual Value Unit: Eco-Point	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	8174	4478	45.2%	6264	23.4%	7423	9.2%	8562	4.70%
Case 2	7852	5131	34.7%	6378	18.8%	6866	12.6%	8299	5.70%
Case 3	8490	4948	41.7%	8842	4.1%	7517	11.5%	9732	14.6%
Case 4	2917	5314	82.2%	5182	77.6%	4352	49.2%	6128	110.1%
Case 5	3892	6154	58.1%	5254	35.0%	4966	27.6%	6396	64.4%
Case 6	3716	4948	33.2%	6157	65.7%	5208	40.2%	6909	86.0%
Case 7	6690	4948	26.0%	7132	6.60%	6912	3.3%	8673	29.6%
Case 8	4273	4978	16.5%	6312	47.7%	6048	41.5%	7938	85.8%
Case 9	3337	4478	34.2%	6896	106.7%	5914	77.2%	7415	122.2%
Case 10	5474	5116	6.5%	6987	27.6%	6896	26.0%	8304	51.7%
	Average error rate	37.8%		41.3%		29.8%		57.5%
	Standard deviation	20.2%		31.4%		21.6%		40.9%

Table 14. The optimal ANN model for estimating EL and CC at the planning stage.

Planning stage
Combination 2
hierarchy	Number of nodes	Dropout rate
1	8	59.3%
2	122	56.9%
3	1	-
Learning process
MSE	0.07
RMSE	0.27
Verification process
MSE	0.05
RMSE	0.21

Combination 3
hierarchy	Number of nodes	Dropout rate
1	10	46.1%
2	296	43.6%
3	1	-
Learning process
MSE	0.03
RMSE	0.18
Verification process
MSE	0.04
RMSE	0.20

Table 15. Results of prediction performance of the DNN model for EL estimation at the planning stage.

Division	EL Actual Value Unit: Eco-Point	Combination 1		Combination 2		Combination 3		Combination 4
Division	EL Actual Value Unit: Eco-Point	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	8174	4698	42.5%	6540	20.0%	5652	30.9%	4925	39.8%
Case 2	7852	4698	40.2%	5008	36.2%	5614	28.5%	4565	41.9%
Case 3	8490	4698	44.7%	5134	39.5%	5662	33.3%	4748	44.1%
Case 4	2917	4698	61.1%	5262	80.4%	5664	94.2%	4723	61.9%
Case 5	3892	4699	20.7%	4598	18.1%	5668	45.6%	4153	6.7%
Case 6	3716	4699	26.5%	5171	39.2%	5625	51.4%	4190	12.8%
Case 7	6690	4698	29.8%	6668	0.3%	5616	16.0%	5031	24.8%
Case 8	4273	4699	10.0%	5046	18.1%	5656	32.4%	4636	8.5%
Case 9	3337	4699	40.8%	4634	38.9%	5655	69.5%	4283	28.4%
Case 10	5474	4698	14.2%	7347	34.2%	5647	3.2%	5342	2.4%
Average error rate		33.0%		32.5%		40.5%		27.1%
Standard deviation		14.9%		20.1%		24.9%		18.6%

Table 16. DNN model prediction performance results for CC estimation at the planning stage.

Division	CC Actual Value Unit: 10 Million Wont	Combination 1		Combination 2		Combination 3		Combination 4
Division	CC Actual Value Unit: 10 Million Wont	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	1838	1384	24.7%	1714	6.7%	1595	13.2%	1494	18.7%
Case 2	1891	1383	26.9%	1714	9.3%	1440	23.9%	1492	21.1%
Case 3	1942	1383	28.8%	1714	11.7%	1452	25.2%	1489	23.3%
Case 4	1061	1385	30.5%	1714	61.5%	1420	33.8%	1492	40.6%
Case 5	1128	1384	22.7%	1714	52.0%	1201	6.5%	1481	31.4%
Case 6	1082	1382	27.7%	1714	58.5%	1169	8.1%	1485	37.3%
Case 7	1444	1382	4.3%	1714	18.7%	1704	18.0%	1503	4.1%
Case 8	1937	1382	28.7%	1714	11.5%	1392	28.2%	1492	23.0%
Case 9	1307	1381	5.6%	1714	31.1%	1263	3.4%	1489	13.9%
Case 10	2093	1383	33.9%	1714	18.1%	1887	9.8%	1505	28.1%

Average error rate		23.4%		27.9%		17.0%		24.1%
Standard deviation		9.7%		20.4%		9.8%		10.3%

Table 17. Design-stage EL estimation DNN model prediction performance results.

Division	EL Actual Value Unit: Eco-Point	Combination 1		Combination 2		Combination 3		Combination 4
Division	EL Actual Value Unit: Eco-Point	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	8174	5300	35.2%	4003	51.0%	7033	14.0%	6961	14.8%
Case 2	7852	5358	31.8%	4044	48.5%	6324	19.5%	6769	13.8%
Case 3	8490	5435	36.0%	4026	52.6%	7797	8.2%	7332	13.6%
Case 4	2917	3540	21.4%	3681	26.2%	4031	38.2%	4115	41.1%
Case 5	3892	4225	8.6%	3754	3.5%	4645	19.4%	4864	25.0%
Case 6	3716	4009	7.9%	3643	2.0%	4813	29.5%	4914	32.2%
Case 7	6690	4959	25.9%	3919	41.4%	6798	1.6%	6699	0.1%
Case 8	4273	5157	20.7%	4041	5.4%	5928	38.7%	6204	45.2%
Case 9	3337	4977	49.2%	4068	21.9%	5686	70.4%	6131	83.8%
Case 10	5474	5663	3.5%	4219	22.9%	7153	30.7%	7084	29.4%
	Average error rate	24.0%		27.5%		27.0%		29.9%
	Standard deviation	13.8%		18.9%		18.6%		22.2%

Table 18. Design-stage CC estimation DNN model prediction performance results.

Division	CC Actual Value Unit: 10 Million Wont	Combination 1		Combination 2		Combination 3		Combination 4
Division	CC Actual Value Unit: 10 Million Wont	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate	Predicted Value	Error Rate
Case 1	1838	1442	21.5%	1669	9.2%	1529	16.8%	2024	10.1%
Case 2	1891	1,44 2	23.7%	1554	17.8%	1526	19.3%	2027	7.2%
Case 3	1942	1442	25.8%	1567	19.3%	1533	21.1%	2115	8.9%
Case 4	1061	1380	30.1%	1214	14.4%	1505	41.8%	1485	39.9%
Case 5	1128	1404	24.5%	1278	13.4%	1509	33.8%	1483	31.5%
Case 6	1082	1371	26.7%	1245	15.1%	1510	39.6%	1545	42.8%
Case 7	1444	1442	0.2%	1564	8.3%	1529	5.9%	2079	44.0%
Case 8	1937	1442	25.6%	1581	18.4%	1521	21.5%	1884	2.7%
Case 9	1307	1442	10.3%	1445	10.6%	1518	16.2%	1862	42.4%
Case 10	2093	1442	31.1%	1691	19.2%	1528	27.0%	2046	2.2%
Average error rate		21.9%		14.6%		24.4%		23.2%
Standard deviation		9.1%		3.9%		10.7%		17.4%

Table 19. Optimal DNN model information for estimating EL and CC at the planning stage.

Planning stage
Combination 4
hierarchy	Number of nodes	Dropout rate
1	10	51.8%
2	184	74.3%
3	155	46.6%
4	1	-
Learning process
MSE	0.07
RMSE	0.26
Verification process
MSE	0.04
RMSE	0.20
Combination 3
hierarchy	Number of nodes	Dropout rate
1	9	47.9%
2	234	43.5%
3	146	69.0%
4	1	-
Learning process
MSE	0.08
RMSE	0.28
Verification process
MSE	0.04
RMSE	0.19

Table 20. Optimal DNN model information for estimating EL and CC at the design stage.

Design phase
Combination 1
hierarchy	Number of nodes	Dropout rate
1	12	52.2%
2	13	66.1%
3	8	48.6%
4	258	41.6%
5	1	-
Learning process
MSE	0.05
RMSE	0.21
Verification process
MSE	0.04
RMSE	0.20
Combination 2
hierarchy	Number of nodes	Dropout rate
1	12	47.6%
2	5	74.7%
3	56	57.1%
1	1	-
Learning process
MSE	0.06
RMSE	0.25
Verification process
MSE	0.04
RMSE	0.19

Table 21. Extended model evaluation metrics for optimal ANN and DNN models.

Stage	Model	Target	MAPE (%)	MAE	R²
Planning	ANN	EL	28.3	1.85	0.72
Planning	ANN	CC	16.2	1.21	0.81
Planning	DNN	EL	25.7	1.74	0.76
Planning	DNN	CC	13.8	1.09	0.84
Design	ANN	EL	30.6	2.07	0.68
Design	ANN	CC	15.4	1.18	0.79
Design	DNN	EL	23.9	1.63	0.74
Design	DNN	CC	13.2	0.98	0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.-S. AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings 2025, 15, 2546. https://doi.org/10.3390/buildings15142546

AMA Style

Kim J-S. AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings. 2025; 15(14):2546. https://doi.org/10.3390/buildings15142546

Chicago/Turabian Style

Kim, Joon-Soo. 2025. "AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects" Buildings 15, no. 14: 2546. https://doi.org/10.3390/buildings15142546

APA Style

Kim, J.-S. (2025). AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects. Buildings, 15(14), 2546. https://doi.org/10.3390/buildings15142546

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Powered Forecasting of Environmental Impacts and Construction Costs to Enhance Project Management in Highway Projects

Abstract

1. Introduction

2. Database Collection and Analysis

2.1. Collection of Road Project Cases

2.2. Dataset Composition and Variable Selection for Model Development

2.2.1. Data Distribution, Missing Values, and Imputation Strategy in the Design-Stage Database

2.2.2. Handling Missing Values Through Imputation

2.2.3. Distribution Shift After Imputation

3. Artificial Neural Network

3.1. Selection of Optimal Variables

3.1.1. Optimal Variable Selection Using Autoencoder

3.1.2. The Setting of Optimal Variables in the Design Phase

3.2. ANN

3.2.1. Construction and Preprocessing for Planning-Stage Estimation

3.2.2. Prediction Accuracy and Optimal Architecture of ANN Models

3.3. Deep Neural Network

3.3.1. Architecture and Prediction Performance for EL and CC Estimation

3.3.2. Optimal DNN Model Configuration and Performance Assessment

3.3.3. Additional Evaluation Metrics

4. Results

4.1. Comparison of ANN and DNN

4.2. Discussions

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI