You are currently viewing a new version of our website. To view the old version click .
Buildings
  • Article
  • Open Access

28 March 2024

Strategies for Imputing Missing Values and Removing Outliers in the Dataset for Machine Learning-Based Construction Cost Prediction

and
Department of Architectural Engineering, Gyeongsang National University, Jinju 52828, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Smart and Proactive Construction Safety Combined with AI, IoT, and Big Data

Abstract

Accurately predicting construction costs during the initial planning stages is crucial for the successful completion of construction projects. Recent advancements have introduced various machine learning-based methods to enhance cost estimation precision. However, the accumulation of authentic construction cost data is not straightforward, and existing datasets frequently exhibit a notable presence of missing values, posing challenges to precise cost predictions. This study aims to analyze diverse substitution methods for addressing missing values in construction cost data. Additionally, it seeks to evaluate the performance of machine learning models in cost prediction through the removal of conditional outliers. The primary goal is to identify and propose optimal strategies for handling missing value in construction cost records, ultimately improving the reliability of cost predictions. According to the analysis results, among single imputation methods, median imputation emerges as the most suitable, while among multiple imputation methods, lasso regression imputation produces the most superior outcomes. This research contributes to enhancing the trustworthiness of construction cost predictions by presenting a pragmatic approach to managing missing data in construction cost performance records, thereby facilitating more precise project planning and execution.

1. Introduction

1.1. Research Background and Objectives

The industrial perspective on the importance of cost management in construction projects, emphasizing economic efficiency, is widely acknowledged by the majority of stakeholders in the construction industry. However, the traditional understanding of the necessity of cost management has been limited to the realm of cost reduction, primarily focused on aspects such as minimizing material and labor expenses, as well as indirect cost savings through process management. This perception has been prevalent among members of the construction industry, where cost management has historically been perceived as a means to achieve savings specifically in various cost components []. For the successful completion of construction projects, it is crucial to accurately predict and manage construction costs and risks from the early stages, particularly during the project’s initiation and planning phases. However, the reality is that the majority of South Korean construction companies and related institutions tend to treat cost information merely as data rather than proactively assessing it. Recently, there has been an attempt to introduce machine learning into cost management for precise predictions through quantitative analysis.
The dataset utilized for machine learning training should be constructed based on historical project actual construction cost data. However, in typical cases, there is a lack of standards or systems for accumulating and managing performance data related to completed project construction costs. The absence of such a systematic record-keeping system results in insufficient available construction cost data, leading to challenges in the collection process []. Consequently, there is a shortage in the accumulation of actual data regarding construction costs, and the accumulated data often contains numerous missing values, constraining the training of machine learning models. Particularly in large-scale datasets, with few exceptions, the prevalence of missing values is notable []. The existence of these missing values renders the dataset incomplete, and utilizing incomplete data may result in inaccurate interpretations and outcomes regarding the inherent patterns in the data. Therefore, inputing missing values is a crucial step in the preprocessing phase before conducting actual analysis []. Hence, selecting an appropriate strategy for handling incomplete data is of paramount importance in the field of data analysis [].
Therefore, the purpose of this study is to apply various imputation methods for missing values in the construction cost prediction machine learning model. The study aims to select a suitable method for imputing missing values in construction cost performance data based on the comparison of prediction performance and outlier removal results, considering the correlation characteristics among construction cost influencing factors. The data used in this study were collected from 906 construction cost performance records through the public procurement agencies and the Construction Cost Information Plaza provided by the Korea Public Procurement Service. Based on the collected data, influential factors have been identified, and a dataset has been constructed to execute machine learning for predicting construction costs based on this foundation. After defining the constructed model as the base model, various methods for imputing missing values are applied and categorized into single imputation methods and multiple imputation methods. The applied single imputation methods include median imputation, mode imputation, and mean imputation. The multiple imputation methods encompass regression-based approaches such as linear regression, support vector machine regression, ridge regression, random forest regression, lasso regression, decision tree regression, and K-nearest neighbors regression imputation. An analysis has been conducted on the training outcomes of a machine learning model for predicting construction costs using the applied imputed dataset and the outlier removal process. Through this analysis, training performance was compared, and the characteristics of influencing factors were validated. Based on these results, one of the most suitable methods for handling missing values in the imputed dataset will be chosen. The study aims to propose the optimal method for imputing missing values in the dataset for construction cost prediction.

1.2. The Scope and Methodology of the Research

This study aims to compare imputation methods during the preprocessing phase of constructing a dataset used in training machine learning models for construction cost prediction. The applied imputation methods include three main methods for single imputation and nine methods for multiple imputation, with a particular focus on regression-based approaches. In this research, by applying these methods and comparing the results of prediction performance and outlier removal based on the adjusted dataset, the study aims to select a suitable imputation method for handling missing values in construction cost data, considering the correlation among factors influencing construction costs. The specific research methods are as follows.
First, factors influencing construction costs for the planning and design stages are derived based on information available through the ‘Korea Online E-Procurement System []’ provided by the Public Procurement Service of Korea. Through various research materials, it has been identified that construction cost influencing factors include total area, building area, site area, landscape area, basement, ground level, total height, typical floor height, and parking lots. Data were collected through the Public Procurement Service’s Korea Online E-Procurement System and the Construction Cost Information Plaza []. The collected data span from 2018 to 2022, covering construction projects. These projects include structures made of reinforced concrete and steel frame construction, and the purposes of the buildings range from educational facilities to government buildings, encompassing a variety of public structures. Based on the identified influential factors, a total of 906 data samples are collected according to the previously mentioned conditions. Utilizing the data collected based on these identified influential factors, a baseline model is constructed for the implementation of machine learning.
Secondly, various methods are applied to impute missing values in the constructed dataset for the prediction of construction costs, leading to the development of a complete construction cost training dataset. The constructed dataset undergoes conditional outlier removal and Z-score processing. Through this process, changes in the number of data points are observed, and the success of imputing missing values, considering the correlation among influencing factors in the actual construction cost data, is evaluated. Subsequently, the results are compared and analyzed.
Finally, utilizing the dataset that has undergone imputation for missing values and outlier removal during the data preprocessing phase, a machine learning model for construction cost prediction is trained. Throughout this process, various aspects such as error rates, accuracy during the machine learning process, and the comprehensive analysis of the correlation between the derived factors are considered. Based on this analysis, a suitable imputation method for handling missing values in actual construction cost data is selected. The research flowchart is depicted in Figure 1.
Figure 1. Research Flowchart.

3. Model Configuration and Outlier Removal for Cost Prediction

3.1. Missing Data Imputation Method

The method of imputing missing values involves replacing missing values with values inferred from the relationships with other variables and the observations where missing values exist. This method is categorized into single imputation and multiple imputation, aiming to construct a complete dataset. Single imputation provides a single value generated through a series of processes as the replacement for missing values. The multiple imputation method entails the generation of N sets of singularly imputed data through a series of systematic procedures. Each of the N datasets undergoes a distinct analytical process, and subsequently, the outcomes are amalgamated into a unified result.

3.1.1. Single Imputation Method

The single imputation method provides a single value generated through a series of processes as the replacement for missing values. It involves replacing missing values with a single value, such as the median, mean, or mode, which is generated through a systematic procedure. This approach has the advantage of relatively easy application, as it replaces missing values with a single value for the respective variable, allowing for the derivation of a complete dataset. However, since all missing values are replaced with a single value, there is a high likelihood of introducing bias into the estimates.

3.1.2. Multiple Imputation Method

The multiple imputation method was proposed to address the limitations of single imputation methods, such as the occurrence of biases in estimates. The key distinction of multiple imputations lies in generating several sets of single imputed data through a series of processes. Each of the several datasets undergoes individual analytical procedures, and subsequently, the results are amalgamated into a unified outcome. However, to conduct an analysis on a single incomplete dataset, users need to possess multiple imputed datasets. Additionally, inaccuracies may arise in the imputed data when learning inherent data patterns during the process of data imputation. This can potentially lead to a decrease in the accuracy of the imputed data.
In this study, we employed the regression model-based imputation method among various multiple imputation techniques. The regression model-based imputation typically designates the dependent variable as the variable with missing values, while the independent variables utilize the given set of other variables []. This method involves establishing a regression model, applying regression analysis, and replacing missing values with the regression-predicted values obtained from the analysis results. The imputed values for missing data vary depending on the regression model used. In this research, the employed regression models include linear regression, support vector machine regression, ridge regression, random forest regression, lasso regression, decision tree regression, and k-nearest neighbors regression.
Linear regression is a regression analysis technique that models the linear correlation between the dependent variable y and one or more independent variables X. Equation (1) represents the linear regression model. Lasso regression and ridge regression are regularization methods used in linear regression models to reduce model complexity. Support vector machine regression calculates the similarity between input values and support vectors, assigns weights accordingly, and uses them to calculate predicted values. In this research, the ‘sigmoid’ kernel was employed in the support vector machine regression to effectively model nonlinear relationships within a particular dataset. The ‘sigmoid’ kernel facilitates linear separation in issues with complex data structures by mapping the data to a higher-dimensional space. This method enhances the model’s ability to learn the inherent patterns in the data, thereby improving its predictive performance.
y i = β 0 + β 1 x l ˙ + ε i                     l ˙ = 1 ,   2 ,   3 ,   n
The decision tree model is a model that automatically discovers rules in the data through learning and creates tree-based classification rules. Random forest regression, as shown in Figure 2a, is an ensemble technique based on decision trees. It involves generating multiple decision trees and calculating the final predicted value by averaging their predicted values. In this research, the random state value was set to 0 for two regression models, thereby specifying the initial seed value for the random number generator within the algorithm. Fixing this value to a specific number ensures that consistent results are achieved whenever the same data and parameter settings are utilized, thus guaranteeing the reproducibility of the experiment.
Figure 2. (a) Random Forest Regression Model (b) K-Nearest Neighbors Regression Model.
K-nearest neighbors regression, as depicted in Figure 2b, involves finding the k-nearest neighbors among existing data when a new observation is given. It classifies and predicts based on values similar to those of the nearby neighbors, considering the attributes. In this study, the KNN regression model used for imputing missing values was set with the number of neighbors as five. The significance of the number of neighbors, K, in a model directly impacts its complexity and performance. A lower K can make the model overly sensitive to noise, leading to a higher risk of overfitting, while a higher K might result in the model being overly simplified, causing underfitting. Therefore, choosing K = 5 was expected as it effectively operates across various scenarios, offering a balanced approach between capturing the complexity of the data and facilitating generalization.

3.2. Dataset for Machine Learning Training

Constructing a dataset for machine learning models is a crucial task. Additionally, when building the dataset, it is essential to choose influential factors for predicting construction costs.
This study aims to estimate the appropriate construction costs during the planning and design phase of the architectural design process. To estimate construction costs, it is necessary to define the factors influencing construction costs, and these factors are related to the constraints provided during the design phase. Influencing factors include total area, building area, site area, landscape area, basement, ground level, total height, typical floor height, and parking lot size. Such information can be collected based on data provided by the Public Procurement Service’s Korea ON-Line E-Procurement System and Construction Cost Information Plaza.
Based on the selected influential factors, data on construction projects completed between 2018 and 2022 were collected from the Public Procurement Service’s Korea Online E-Procurement System and the Construction Cost Information Plaza. The data primarily consists of buildings constructed with reinforced concrete and steel frame structures, with the uses of these buildings primarily being educational facilities, government buildings, and other public structures. A total of 909 cases, including overall construction costs by facility type, were collected. Cases without essential data (total area, building area, ground level) for predicted construction costs were excluded, resulting in a total of 906 cases used in the dataset. Before the data preprocessing stage, the number of missing values for each influencing factor in the original dataset is shown in Table 1.
Table 1. Number of Missing Values by Influencing Factor.

3.3. Construction of Machine Learning Base Model for Cost Prediction

3.3.1. Base Model Configuration

The dataset used for training the machine learning model for construction cost prediction is based on 906 cases of collected actual construction cost data. The development environment is built on Visual Studio Code, using the Python language. The primary objective of this study is to propose methods for appropriately filling in missing values in actual construction cost data. Therefore, the conditions for all prediction models are set to be the same, except for variations in the missing value imputation methods. Moreover, out of the 906 case data, 70% are utilized as training data, while the remaining 30% are used as test data. During the machine learning process, the case data are randomly extracted for each training session to divide them into training and test data groups. When dealing with complex and noisy data such as construction cost prediction, the selection of an activation function that can reduce the model’s sensitivity to noise is crucial. For this reason, the ELU (Exponential Linear Unit) activation function was chosen in this study. Additionally, the configuration of node progression is a significant factor when processing large datasets or training complex models. Therefore, to ensure that the model can thoroughly learn the diverse characteristics of the data, a larger number of nodes were used in the initial layer. Subsequently, the number of nodes was reduced to 100 → 64 → 32 → 1 in the following layers to effectively summarize and compress the important information.

3.3.2. Model Configurations by Case

The basic model configuration conditions remain constant, with variations introduced solely in the missing value imputation methods. There are a total of 10 applied missing value imputation methods, comprising 3 single imputation methods and 7 multiple imputation methods based on regression models. In single imputation methods, median imputation, mode imputation, and mean imputation are applied. In multiple imputation methods, regression model-based imputations include linear regression imputation, support vector machine regression imputation, ridge regression imputation, random forest regression imputation, lasso regression imputation, decision tree regression imputation, and k-nearest neighbors regression imputation. The model conditions for each case are detailed in Table 2.
Table 2. Model Configurations.
The default settings for imputing missing values using multiple imputation methods are as follows. For the support vector machine regression imputation method, the kernel was set to ‘sigmoid’. In the case of ridge regression and lasso regression imputation methods, ‘random state’ was set to 0 and ‘alpha’ was set to 1.0. The random forest regression and decision tree regression imputation methods had their ‘random state’ set to 0, while for the KNN imputation method, the number of neighbors was set to 5.

3.3.3. Outlier Removal Based on Conditions and Z-Score

Filling missing values using single and multiple imputation methods for construction cost actual data can be effective in completing the dataset. However, depending on the method used, there is a possibility that incorrect data may be filled in, potentially leading to gaps not only in the analysis of the characteristics of influential factors in the constructed dataset but also in the correlation analysis between influencing factors. Therefore, to confirm the presence of gaps, it is necessary to check for outliers in the replaced data values.
After the missing value imputation process, conditions were defined to remove outliers based on actual construction cost data. The study established three conditions for removing anomalies: first, when the building area is greater than the total area; second, when the landscape area is greater than the site area; and third, when the building area is greater than the site area. Data corresponding to these three conditions undergo a process of removing entire rows. In the next step, among the data that was not removed, outliers are identified and removed using the Z-Score outlier detection method. In this study, the Z-Score outlier detection method was employed to identify and remove outliers. This method involves comparing the Z-Score of data points against a predefined threshold, ‘threshold = 1.96’, where data points with a Z-Score exceeding this threshold are considered outliers and subsequently removed. Outliers were independently removed for each column of the provided dataset. Through this process, the number of outlier data points is measured based on the missing value imputation method, and it is verified whether consideration was given to the correlation between influential factors in construction cost prediction.
The dataset, processed through the missing value imputation and outlier removal processes, is utilized for training machine learning models for construction cost prediction.
Figure 3 is a diagram illustrating the data preprocessing process that involves missing value imputation and outlier removal. Based on the processed data, a comparison of the performance of various machine learning models was conducted. In this manner, the study aimed to consider the correlation between influential factors in construction cost prediction and the performance of machine learning models, ultimately seeking to choose the optimal method for imputing missing values in actual construction cost data.
Figure 3. Changes in Data Based on Data Preprocessing.

4. Analysis of the Training Results

4.1. Differences in Data Composition by Missing Value Imputation Methods

The filled values vary depending on the method used to replace missing values in the actual construction cost data. The three single imputation methods applied in this study involve assigning the same constant value to all missing values. However, the seven multiple imputation methods, mentioned as regression model-based imputation methods, fill in missing values through an analysis of the constructed data patterns. The analysis of data patterns may vary depending on the type of regression model, leading to differences in the imputation of missing values. Therefore, while it is crucial to find an appropriate imputation method through performance comparison for effective model training, it is equally important to consider an imputation method that takes into account the characteristics of data patterns and influential factors.
In this study, to assess whether characteristics between data patterns and influential factors are considered, the changes in data quantity are examined through conditional outlier removal and Z-Score outlier removal methods. For each missing value imputation method, the total amount of data after outlier removal is compared, allowing for the assessment of the level of consideration for influential factor characteristics. A suitable missing value imputation method is then selected based on the judgment of an appropriate level beyond the baseline.
The first outlier removal method, conditional outlier removal, is based on the correlation between influential factors. It establishes conditions based on the correlation and removes rows where the conditions are not met, considering them as outliers. The conditions applied for conditional outlier removal include three cases: when the building area is greater than the floor area, when the landscaping area is greater than the site area, and when the building area is greater than the site area. If any of these conditions are met, the entire row of data is deleted. Subsequently, the second method, Z-Score outlier removal, is a general method that uses the mean and standard deviation to remove outliers. The changes in data quantity according to the outlier removal process are shown in Table 3.
Table 3. Changes in Data Based on Data Preprocessing.
Based on the examination of the pattern of changes in data quantity, it was observed that in the case of conditional outlier removal, the number of removed outliers was relatively small when the median imputation method was applied among the single imputation methods. This is because some values are deleted when conditional outlier removal is applied to the original data. The reason for deletion is that during the imputation of missing values, values that were not reflected in the correlation between floor area, landscaping area, building area, and site area were replaced. Data excluded from the relationships between influential factors in construction costs, as a result of this exclusion, may provide inaccurate data patterns during model training, potentially reducing the performance of machine learning models. Therefore, in this study, it is necessary to remove outliers generated during missing value imputation to enhance the training performance of construction cost prediction machine learning models.
The research results demonstrate that in the case of multiple imputation methods, linear regression imputation, support vector machine regression imputation, ridge regression imputation, and lasso regression imputation methods result in fewer instances being identified as outliers compared to other regression model imputation methods. In conclusion, the study shows that, for single imputation methods, the median imputation method, and for multiple imputation methods, linear regression imputation, support vector machine regression imputation, ridge regression imputation, and lasso regression imputation methods reflect the relationships between influential factors at a relatively high level compared to other imputation methods.

4.2. Training Results by Missing Value Imputation Method

The results of the performance analysis of construction cost prediction machine learning models, utilizing single imputation methods applied based on the basic model configuration, specifically median imputation, mode imputation, and mean imputation, are presented in Figure 4, Figure 5 and Figure 6. In the figure, a “wavy line” is used to omit certain loss sections to improve the visibility of the graph, which is affected by the sharp decrease in ‘train_loss’ and ‘validation_loss’.
Figure 4. Training Results of Median Imputation Method.
Figure 5. Training Results of Mode Imputation Method.
Figure 6. Training Results of Mean Imputation Method.
For single imputation methods, datasets were generated by applying median imputation, mode imputation, and mean imputation for machine learning training. The results of the training show that, for the dataset with median imputation, ‘train_loss’ and ‘validation_loss’ gradually converge to 0 as epochs (training iterations) progress. As observed from the results of three training sessions conducted through k-fold cross-validation, the gap between ‘train_loss’ and ‘validation_loss’ narrows as the number increases, and then stabilizes at a certain level. On the other hand, for the dataset with mode imputation, there is a convergence trend up to 200 epochs, but as the epochs progress, the gap between ‘train_loss’ and ‘validation_loss’ gradually widens. For the dataset with mean imputation applied, as the number of epochs increases, the gap between ‘train_loss’ and ‘validation_loss’ tends to narrow and gradually converge towards zero. However, the progression of the graph is not consistent, and noise continues to appear persistently.
For machine learning training on the linear regression model, multiple imputation methods, including linear regression imputation, support vector machine regression imputation, ridge regression imputation, random forest regression imputation, lasso regression imputation, decision tree regression imputation, and k-nearest neighbors regression imputation, were applied to create datasets. The training results are presented in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13.
Figure 7. Training Results of Linear Regression Imputation Method.
Figure 8. Training Results of Support Vector Machine Regression Imputation Method.
Figure 9. Training Results of Ridge Regression Imputation Method.
Figure 10. Training Results of Random Forest Regression Imputation Method.
Figure 11. Training Results of Lasso Regression Imputation Method.
Figure 12. Training Results of Decision Tree Regression Imputation Method.
Figure 13. Training Results of K-Nearest Neighbors Regression Imputation Method.
In the case of multiple imputation methods, various regression model-based imputation methods were applied to create a dataset for machine learning training. When training was conducted using the dataset imputed through linear regression, the ‘train_loss’ showed a converging trend to 0 as epochs progressed, while the ‘validation_loss’ converged to 0 but maintained a certain value.
For the dataset imputed through support vector machine regression, the overall trend converged to 0, but ‘validation_loss’ maintained a constant value. In the case of the dataset imputed through ridge regression, ‘train_loss’, and ‘validation_loss’ converged to 0, maintaining a constant gap as epochs progressed.
When training was conducted using the dataset imputed through ridge regression, both ‘train_loss’ and ‘validation_loss’ demonstrated a tendency to converge towards zero. However, even as the number of training iterations increased, they maintained a steady state at the point of convergence.
When training was performed using a dataset imputed with Random Forest, a significant amount of noise was observed throughout the training process, resulting in unstable graph patterns. Moreover, in some folders, a crossing pattern between ‘train_loss’ and ‘validation_loss’ was observed.
Training with the dataset imputed through Lasso regression revealed that both ‘train_loss’ and ‘validation_loss’ exhibited a convergence towards zero. However, beyond a certain level of training iterations, only ‘train_loss’ continued to converge towards zero.
Upon conducting training with the dataset imputed through Decision Tree Regression, the overall graph trended towards convergence at zero. However, it was observed that minor noise occurred within the graph.
Upon training with the dataset imputed via the KNN regression approach, a notable reduction in graph noise was observed compared to other methods. Additionally, as the number of epochs increased, a trend towards zero was evident for both ‘train_loss’ and ‘validation_loss’. However, during this process, instances of ‘train_loss’ and ‘validation_loss’ intersecting were also encountered.
Unlike single imputation methods for missing values, the use of multiple imputation methods can result in situations where ‘validation_loss’ is lower than ‘train_loss’. This phenomenon can occur during the application of k-fold cross-validation, where the machine learning model undergoes training three times. The composition of training and test data, randomly extracted in each training session, can vary, leading to the possibility of lower ‘validation_loss’ in certain folds. This is because the randomly selected data in some folds may contain patterns that are easier to predict.

4.3. Optimal Method Selection Based on Machine Learning Training Performance

To evaluate the optimal missing value imputation method for creating a dataset for a machine learning model predicting construction costs, the level of reflection of relationships between influential factors and the learning performance of the construction cost prediction machine learning model was assessed. The results of evaluating the level of correlation maintenance between influential factors showed that, for single imputation methods, median imputation and mode imputation were relatively high. For multiple imputation methods, linear regression imputation, support vector machine regression imputation, ridge regression imputation, and lasso regression imputation were relatively highly evaluated.
The relative training performance evaluation results for the dataset missing value imputation methods of the machine learning model for construction cost prediction are shown in Figure 14. According to the analysis results, in the case of single imputation methods, median imputation is relatively low, with an average error rate of 16.91%. Among the multiple imputation methods, lasso regression exhibits the lowest average error rate at 15.54%, followed by decision tree regression and random forest regression.
Figure 14. Comparison of Machine Learning Training Performance (MAPE).
Therefore, considering both the level of reflection of relationships between influential factors and the training performance of the construction cost prediction machine learning model, lasso regression imputation is deemed the most appropriate for filling in missing values in actual construction cost data.

5. Conclusions

To enhance predictive performance during the process of constructing a machine learning model, data preprocessing is essential. In the process of building a machine learning model that deals with large-scale data, the data preprocessing stage, especially the replacement of missing values, is crucial. There are various methods to replace missing values, and in this study, three representative methods among single imputation methods and seven regression model-based imputation methods among multiple imputation methods were applied to replace missing values in the construction cost actual data.
To evaluate the suitability of the generated dataset based on the applied methods, two criteria were used for assessment. The first criterion, determining suitability through the reflection of correlations between influential factors, indicated that median imputation, mode imputation, linear regression imputation, support vector machine regression imputation, ridge regression imputation, and lasso regression imputation methods were deemed appropriate. The second criterion, evaluating suitability based on the learning performance of the construction cost prediction machine learning model, revealed that, among single imputation methods, median imputation, and among multiple imputation methods, lasso regression imputation, was considered suitable due to its low average error rates. In summary, it was concluded that using the lasso regression method to replace missing values is appropriate, considering the correlation between influential factors and improving the learning performance of the machine learning model for construction cost prediction.
Analyzing these results, based on the second criterion of comparing the learning performance of the construction cost prediction machine learning model, it can be observed that median imputation, despite being a single imputation method, yields relatively low error rates compared to most multiple imputation methods. However, the position of the median can vary depending on the distribution of the data, as well as the range of maximum and minimum values. This indicates that the learning performance may vary depending on the data distribution. Therefore, especially when larger or smaller construction projects are added to the actual construction cost data, where the position of the median can change with additional data collection, applying single imputation methods might be challenging. Considering these two evaluation criteria and the potential expansion of the data range, the application of lasso regression imputation among multiple imputation methods is deemed the most appropriate.
Construction costs can vary significantly depending on various conditions in each region or country, and predicting construction costs in any given area requires access to that region’s data. Securing sufficient data for construction cost prediction in each region is a challenging task, and even when data are obtained, there might be data that cannot be partially collected. Due to these issues, finding ways to appropriately address missing data is essential for enhancing the usability of the collected construction cost data. This research could be significant in showing how to address missing values in the challenging context of construction cost prediction, where data acquisition is difficult, and how the performance of construction cost prediction changes depending on the method used to handle missing data.
To improve the accuracy of construction cost prediction, it is necessary to secure a sufficient amount of construction cost data. However, until enough data can be gathered, it is considered necessary to find ways to optimally fill in missing data, as in this study, in order to enhance the usability of the data that has been collected. Missing values in actual construction cost data often share similar data patterns, so it is expected that using an appropriate regression model, as demonstrated in this study, can replace missing values to some extent. However, there are many cases of research expenses showing outliers that deviate from general patterns, indicating that such models alone may have limitations in completely substituting for missing data. This research serves as an initial step in the development of machine learning-based construction cost prediction, seeking the optimal method to replace missing data in construction cost prediction based on general models. However, future research will need to focus on developing learning models that can be applied to various conditions and cases that can be used in construction cost prediction and researching ways to replace missing data under specific conditions.

Author Contributions

H.L. conceived the experiments, analyzed the data, and wrote the paper; S.Y. supervised the research. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in 2024 by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2021-KA163269).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The cost data used in the research was collected from a public institution in Korea and is unavailable due to privacy.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have influenced the work reported in this study.

References

  1. Kim, S. A Study on the Prediction of Construction Costs for Apartment Buildings and Characteristics of Factors Affecting Cost Variations. Master’s Thesis, Yonsei University Graduate School of Engineering, Seoul, Republic of Korea, 2003. [Google Scholar]
  2. Seo, Y. A Study on Construction Cost Prediction Models in the Planning and Design Stages through Analysis of Actual Construction Costs. Master’s Thesis, Korea University Graduate School of Engineering, Seoul, Republic of Korea, 2009. [Google Scholar]
  3. Lee, S.; Jeon, S. An Empirical Comparison of Imputation Methods for Missing Data. KIIS Spring Conf. 2009, 19, 250–251. [Google Scholar]
  4. Lee, S. Performance Comparison of Imputation Algorithms for Missing Data. Master’s Thesis, Graduate School of Korea University, Seoul, Republic of Korea, 2020. [Google Scholar]
  5. Kim, Y. Comparison of Imputation Methods for Missing Data in Meteorological Data. Master’s Thesis, Graduate School of Inha University, Incheon, Republic of Korea, 2020. [Google Scholar]
  6. Korea Online E-Procurement System. Available online: https://www.g2b.go.kr/index.jsp (accessed on 1 March 2023).
  7. The Construction Cost Information Plaza. Available online: https://pcae.g2b.go.kr:8044/pbs/psa/psa0000/index.do (accessed on 1 March 2023).
  8. Hong, E. A Study on Construction Cost Prediction of Zero-Energy Buildings Based on Artificial Neural Networks Using CTGAN. Master’s Thesis, Graduate School of Ewha Womans University, Seoul, Republic of Korea, 2022. [Google Scholar]
  9. Kim, G.; Jo, H.; Ahn, S. A Study on the Comparison of Prediction Accuracy in Construction Cost Prediction Models Using Neural Networks and Genetic Algorithms. J. Archit. Inst. Korea 2006, 22, 111–118. [Google Scholar]
  10. Son, J.; Kim, C. A Study on the Development of Artificial Neural Network Models for Predicting Construction Costs in the Conceptual Phase of Educational Facilities. Korean J. Constr. Eng. Manag. 2006, 7, 91–99. [Google Scholar]
  11. El-Sawalhi, N.I. Support Vector Machine Cost Estimation Model for Road Projects. J. Civ. Eng. Archit. 2015, 9, 1115–1125. [Google Scholar] [CrossRef]
  12. Hyari, K.H.; Al-Daraiseh, A.; El-Mashaleh, M. Conceptual Cost Estimation Model for Engineering Services in Public Construction Projects. J. Manag. Eng. 2015, 32, 1115–1125. [Google Scholar] [CrossRef]
  13. Wang, R.; Asghari, V.; Cheung, C.M.; Hsu, S.C.; Lee, C.J. Assessing Effects of Economic Factors on Construction Cost Estimation Using Deep Neural Networks. Autom. Constr. 2022, 134, 104080. [Google Scholar] [CrossRef]
  14. Elhag, T.M.S.; Boussabaine, A.H. An Artificial Neural System for Cost Estimation of Construction Projects. In Proceedings of the 14th Annual ARCOM Conference, Reading, UK, 9–11 September 1998; Volume 1, pp. 219–226. [Google Scholar]
  15. Al-Tawal, D.R.; Arafah, M.; Sweis, G.J. A Model Utilizing the Artificial Neural Network in Cost Estimation of Construction Projects in Jordan. Eng. Constr. Archit. Manag. 2021, 28, 2466–2488. [Google Scholar] [CrossRef]
  16. Kovačević, M.; Ivanišević, N.; Petronijević, P.; Despotović, V. Construction cost estimation of reinforced and prestressed concrete bridges using machine learning. GRAĐEVINAR 2021, 73, 1–13. [Google Scholar] [CrossRef]
  17. Park, J. Development and Comparison Study of ANN and MLR Models for Predicting Construction Budgets. Master’s Thesis, Sejong University Graduate School, Seoul, Republic of Korea, 2023. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.