1. Introduction
The construction industry consistently ranks among the most hazardous sectors worldwide [
1], with Korea reporting the highest fatality rate across all industries in 2022, 24.25% of all work-related deaths occurred in construction [
2]. Similar trends have also been observed in other developed countries, such as the United States, the United Kingdom, Australia, and Singapore [
3,
4]. Since the mid-1990s, developing preventive measures that account for the magnitude and characteristics of construction accidents, as well as their associated costs, has remained one of the most challenging areas in construction research [
5] despite continuous efforts to enhance on-site safety protocols. A key reason for this challenge is the lack of a standardized system for classifying cost components and collecting comprehensive accident data, an undertaking that requires substantial time and coordination among multiple stakeholders.
In this context, Hinze [
6] underscored the importance of approaching construction safety from an economic perspective, advocating for accurate cost assessments to encourage greater investment in preventive measures by stakeholders. Pellicer et al. [
7] proposed that a predictive tool for accident and prevention costs could optimize safety measures within budget constraints prior to project initiation. From the government’s perspective, it can help in formulating policies related to occupational safety and health, assessing the adequacy of the budget and preparing detailed plans. From a corporate standpoint, it facilitates the formulation of more specific accident prevention plans, the introduction of industrial safety and health technologies and internal policies, the scientific planning and budgeting for project safety and health costs, and helps measure declining project profitability due to accidents. Finally, from an academic perspective, it enables the economic evaluation of health and safety technologies, the economic evaluation of the formulation and revision of construction safety standards, and the statistical data analysis of socio-economic losses caused by accidents.
Heinreich [
8] pioneered the identification of hidden costs associated with occupational accidents, and inspired researchers to categorize them into two main groups [
9,
10]. Direct costs, such as compensation to victims and medical costs, are easily identifiable and are usually covered by insurance policies to protect contractors from liability claims [
11,
12]. In contrast, indirect costs are not directly associated with accidents, but account for a significant proportion of total accident costs. The costs mentioned encompass legal fees, productivity losses, labor replacements, and the expenses associated with investigating accidents [
13].
Direct costs are typically easier to manage, whereas tracking indirect costs, such as lost productivity and legal claims, requires a time-consuming and complex process. Consequently, indirect costs are often excluded from safety accident databases and related reports [
10]. To simplify estimation, previous studies have frequently applied fixed ratios to direct costs [
14,
15,
16], with Heinrich’s 1:4 ratio widely adopted due to its simplicity [
15]. However, this ratio varies depending on the type, severity, and intensity of the accident, making ratio-based estimates inadequate for capturing the true indirect accident costs [
17,
18].
Accurate estimation of accident costs in construction is essential for quantifying risk levels and supporting preventive strategies. However, the inability of ratio-based methods to reflect variations across different accident scenarios presents a significant limitation. These shortcomings underscore the need for data-driven approaches, such as data mining and machine learning, that can deliver more accurate, work-specific risk assessments while accounting for factors such as accident type, injury severity, and other contributing variables.
Recent studies have increasingly applied machine learning to enhance construction cost and accident outcome prediction, highlighting both methodological advances and policy implications. For example, Turkyilmaz and Polat [
19] explored six machine learning classification algorithms to predict cost-overrun ratio classes using project risk indicators, enabling proactive cost control and resource planning throughout project execution. Chen et al. [
20] introduced a comprehensive and transparent framework for construction cost prediction using advanced machine learning integrated with explainable AI (XAI) and uncertainty quantification, highlighting the potential of machine learning models to revolutionize construction cost estimation. Similarly, Jafary et al. [
21] proposed an AI-augmented construction cost estimation framework that leverages ensemble Natural Language Processing (NLP) to align quantity take-offs (QTOs) with construction cost indexes (CCIs). The model automates the traditionally manual, error-prone task of mapping free-text QTO descriptions to standardized cost items, thereby improving accuracy, transparency, and efficiency in cost estimation. Collectively, these studies underscore the growing policy relevance of interpretable machine learning frameworks for improving economic forecasting, optimizing safety investments, and guiding strategic construction management decisions.
The objective of this study was to bridge this gap by leveraging data mining techniques to develop predictive models that provide a more accurate estimate of indirect accident costs. To achieve this objective, we’ve developed a two-tiered machine learning framework to forecast the indirect costs of accidents at construction sites. This framework used a compiled dataset obtained through a survey of on the top 20 contractors in Korea based on their construction project evaluation amounts and conducted over an approximate two-year period. The reason for selecting these companies is that they operate dedicated departments that systematically collect and manage data on construction safety incidents.
The proposed framework combines a classification and quantification approach using machine learning algorithms to improve the accuracy of the estimation results. Unlike static ratios, machine learning models can accommodate variability across projects, accident types, and severity levels based on direct costs. Additionally, machine learning-based predictive models enable immediate indirect cost estimation, which is critical for rapid decision-making in safety management. The results of this study can help construction stakeholders to estimate indirect accident costs, thus evaluating the financial effect on different trades and tasks and facilitating the appropriate allocation of safety budgets and measures.
  3. Methodology
The research technique consists of two stages: the creation of the construction accident database and the implementation of data mining algorithms, as shown in 
Figure 1. In the first stage, a comprehensive accident cost questionnaire was formulated, which included information on companies, employees, accidents, injuries, and accident cost information and distributed to major contractors so that records of accident cost can be collected. In the second stage, data mining techniques are applied to the database to develop the best indirect cost prediction model for a construction project. We use a two-tiered machine learning framework to estimate the indirect costs arising from accidents on construction sites by leveraging the compiled dataset.
The first tier classifies the total accident cost to determine the total accident cost category (i.e., the total accident cost type in 
Figure 1), and the second tier uses this classification information as an additional input variable, together with the direct cost and the accident variables, to quantify the indirect cost, as indicated in 
Figure 1. The direct costs refer to the insurance costs, which are assumed to be identifiable at the time of the accident. In this section, the questionnaire design process is described, the types of accident-related information collected by the questionnaire are explained in detail, and the model development process is discussed in the next section.
  3.1. Accident Cost Questionnaire Design
We’ve utilized industry standard accident survey tables and previous research frameworks to design a customized questionnaire to assess accident safety costs in the construction industry. The questionnaire consists of five parts. Part 1 collects basic project information, including the type of construction project, company name, construction cost, and duration. While company names are primarily used to identify survey participants, project type, cost, and duration provide insights into accident characteristics in different construction environments and thus influence the resulting indirect costs [
24]. Part 2 describes the specific accident details for each project, such as the accident year, date, and day of the week, along with the project completion rate and the work operation the victim was involved in during the incident. The years and date are fundamental accident data, with the day of the week helping to identify possible correlations with increased accident rates, such as fatigue at the end of the week owing to consecutive days of labor-intensive tasks [
36]. Certain work processes, such as working at heights, are also highlighted for their high risk of serious accidents, resulting in substantial indirect costs [
37].
Part 3 provides detailed information on the employment of the individuals concerned, which includes variables such as age, occupation, organizational affiliation, length of service, and average wage. Studies have shown that older workers are often victims of fatal accidents [
36], and that inexperienced workers are more prone to accidents [
38]. The involvement of subcontractors, which increases uninsured costs, has also been reported [
24]. The average wage reflects the skill level of workers, which is critical to understanding safety awareness in the workplace [
38]. Part 4 includes information on injuries, such as the type of accident, details of injuries, and fatality information, which influence cost outcomes. For example, accidents, such as falls or hits, lead to higher days of lost productivity, which drives up indirect costs [
39].
In part 5, the financial impact of safety incidents from Parts 1 to 4 is described in detail and divided into direct and indirect costs. These cost items were identified through literature review and consultations with accounting teams from leading contractors and construction firms in Korea. Direct costs include workers’ compensation insurance, temporary/lifelong disability insurance, medical expenses (hospitalization, treatment, medication, nursing, and caregiver costs), bereavement and funeral expenses, and property/material insurance.
Indirect costs refer to consequential expenses associated with construction accidents that are not directly covered by insurance. Specifically, the following cost categories are considered, as outlined in the studies by Jallon et al. [
10] and Haupt and Pillay [
15]: (1) compensation to victims for emotional distress and temporary or lifelong unemployment; (2) material loss costs; (3) hospital transport arrangements; (4) productivity losses, including work stoppages and reduced output due to injured workers performing only light-duty tasks; (5) potential loss of expertise and experience; (6) delay penalties and the cost of extending construction timelines; (7) uncovered medical expenses, such as diagnostic tests, prosthetics, and rehabilitation aids; (8) administrative investigation and reporting costs; (9) expenses for replacement workers, including recruitment and training; (10) equipment repair and damage costs; (11) third-party legal claims; and (12) reputational damage due to negative publicity.
Due to limitations in data collection, several indirect cost components could not be recorded during the survey period. The excluded items are: (1) lifetime unemployment benefits, (2) costs for training replacement personnel, (3) damage to the company’s reputation, (4) reduced productivity from injured workers performing light-duty tasks, (5) loss of expertise and experience, and (6) third-party legal claims. As these components were not trackable through available data sources, they were excluded from the final analysis. All other cost items listed in the previous section were included in the questionnaire. Each selected cost item was verified through face-to-face consultations with expert personnel, including construction safety managers and medical consultants, and was cross-validated with insurance companies to ensure data accuracy and transparency. For analysis purposes, direct and indirect costs were first calculated separately and then aggregated to determine the total accident cost.
For indirect costs, which are affected by inflation and evolving economic conditions over time, the Consumer Price Index (CPI) was applied to adjust all reported values to current monetary terms. The CPI served as a benchmark to standardize historical cost data, ensuring consistency with present-day economic conditions. This adjustment enhanced the accuracy and comparability of the indirect cost estimates, allowing for a more realistic and meaningful analysis aligned with contemporary financial standards.
  3.2. Data Collection Process and Survey Overview
The survey took approximately one year and ten months (from July 2020 to April 2022) as the accident cost data was disorganized and rarely examined or collected by construction companies. Another problem was that this survey focused on collecting indirect cost items, which required a significant amount of time to follow up, up to two years depending on the severity of the accidents. To collect accident data, the researchers collaborated with the Ministry of Land, Infrastructure and Transport (MLIT) and the Korean Authority of Land and Infrastructure Safety (KALIS). An official letter with a detailed questionnaire was sent to the top 100 construction companies in Korea, ranked by annual construction capability. Twenty contractors and four public companies participated in this survey, including well-known large-scale firms typically ranked within the top 30, despite slight annual variations. These companies, with extensive experience in the construction industry, encompass both the building and civil engineering sectors. Supported by established safety management departments and robust regulations, safety and construction managers from these firms analyzed and provided accident cost data for completed or ongoing projects. To ensure consistency and minimize variations between surveyed organizations, a pre-forma questionnaire was utilized to automatically calculate the cost data in advance. Insurance-related costs were calculated using the Employment and Workers’ Compensation Insurance Total Service [
40]. In total, 1036 accident records, including associated costs, were collected from accidents that occurred over 11 years, from 2011 to 2022.
Of these cases, 918 were non-fatal, whereas the remainder were fatal. Of the 1036 cases, only 912 included direct and indirect costs that could be used for data analysis, of which 57 were fatalities and 855 were injuries. 
Table 2 summarizes the detailed information collected from the survey. However, as some incidents occurred more than ten years ago, certain variable information could not be collected. Therefore, based on availability and relevance to accident events, 909 cases and 14 attributes (including ten categorical and four numerical) were selected for the development of the indirect cost prediction model. Each of the categorical variables and their respective categorical elements are described in the 
Supplementary Materials section.
  6. Conclusions
Accidents and their associated costs not only harm the well-being of those affected but also jeopardize project success, company profits, and public reputation. Therefore, instead of relying on uniform ratios for direct and indirect accident costs, integrating predictive analytics is crucial. This approach enables policymakers and safety managers to accurately estimate costs, prioritize high-risk areas, allocate resources effectively, and implement proactive preventive measures.
In this context, we propose a two-tiered indirect cost prediction model proposed that uses multiple classifiers and regressors, based on the collection of 1036 accident cases. In the first tier, the k-NN classifier with ROS achieved over 90% accuracy, precision, recall, f1-score, and cross-validation score. In the second tier, the GB regressor outperformed the others, with an R2 of 0.95, a training MAE of 0.1, and a training MSE of 0.21. By combining these best-performing models, a final two-tiered predictive model was developed for estimating indirect accident costs. This approach outperforms conventional statistical regression models or ratio-based estimation and effectively captures the complex nonlinear relationships between different factors contributing to construction accidents and their costs.
The proposed methodology is applicable in real time, and offers an immediate and pragmatic approach. When an accident occurs on a construction site, the direct costs can be estimated almost immediately based on the medical expenses and immediate treatment of the injured. Therefore, this direct cost data shortly after an accident, along with other relevant variables, such as the number of fatalities, the number of injuries, the type of construction and accident, the specifics of the work process, the location of the injury, the type of workplace, and the workers’ affiliation, can be used in the proposed model to first categorize the nature of the total cost. This initial categorization provides an approximate range of total accident costs. However, a second-tier regression model was used to accurately determine indirect costs. This second-tier model leverages the predicted total cost category from the first tier, in conjunction with other known variables, such as the number of fatalities, number of injuries, direct costs, and the type of direct cost, to forecast the indirect costs in real time.
This study has made several important contributions. It collected a considerable amount of data on the indirect costs of construction accidents for in-depth analysis and insight generation. In this study, an innovative method was introduced for quantifying the indirect costs of construction accidents. This model is adaptable to incremental learning and allows for regular updates based on new data. A national accident database can be created on a countrywide scale, and the proposed model can be integrated with it to ensure continuous improvement. Accident cost is the most accurate indicator for risk assessment, and such integration can also significantly assist in risk quantification for specific construction project types based on existing records. Ultimately, the entire system can be utilized to formulate safety standards and policies from government, corporate, and academic perspectives. The problem of data imbalance was also tackled in this study, which is a notable problem in the construction industry, and improved generalization of the classifier was demonstrated through the implementation of ROS.
Despite careful efforts, this study has several limitations. Notably, it does not account for certain difficult-to-track cost items, potentially leading to indirect cost estimates that underrepresent actual expenses. Another limitation is the limited dataset size, due in part to the common practice among contractors of not systematically tracking indirect cost data. These factors suggest that some indirect costs and variability may not be fully captured in the model. Data preprocessing decisions also introduced constraints. In this analysis, no outlier detection or missing-data imputation was performed, under the assumption that survey responses were internally consistent and complete. However, given the self-reported nature of the financial data, such assumptions may have introduced unintended bias. To improve data integrity and predictive performance, future research should incorporate systematic preprocessing procedures, including robust outlier detection methods and appropriate handling of missing values. Although a low MSE was achieved on the scaled values, some deviations emerged when transforming the target and output values back to their original scale. Oversampling and undersampling were employed to create a more balanced training set; however, these techniques can alter the statistical distribution of the data. In some cases, resampling may lead to overfitting or the loss of important information, thereby affecting model robustness. Moreover, this study did not include a sensitivity analysis to examine how different random seeds or sampling proportions might influence model performance. To enhance robustness and generalizability, future research should investigate the model’s stability under varying conditions. For example, techniques such as repeated trials with different random splits or alternative balancing methods like SMOTE, ADASYN or stratified bootstrapping could be used to ensure the model’s performance is not an artifact of a particular sample or random seed.
A further limitation is the lack of external validation. All data used for training and testing originated from this single study, so the model’s effectiveness has not been confirmed on independent accident cost datasets. Therefore, future work should prioritize validating the model on an independent dataset of construction accident records to confirm its predictive accuracy and practical utility beyond the original sample. Additionally, visual examples of construction accidents with higher indirect costs could serve as useful illustrations to better convey the extent and impact of such incidents, and their inclusion is recommended in future research.
Finally, the study’s comparative scope was limited to conventional ratio-based estimates. However, multiple other cost-estimation approaches exist in the literature, including parametric models, expert judgment-based methods, capacity factor models, and frameworks based on Bayesian or fuzzy logic, as well as methods that incorporate dimensionality reduction techniques. Future iterations of this research should incorporate more comprehensive comparisons with alternative cost estimation techniques to better contextualize the proposed model within the broader landscape of estimation methodologies.
Additionally, future studies should consider incorporating hierarchical organizational structures to capture inter-company variations in safety culture, management practices, and decision-making processes that may significantly influence accident outcomes and associated costs. For instance, companies with centralized versus decentralized safety oversight may exhibit differing patterns in risk mitigation and cost recovery. Likewise, integrating contract-specific information, such as project delivery methods (such as, design-build vs. traditional), penalty clauses for delays, insurance coverage, or subcontractor involvement, could provide deeper insights into the financial repercussions of accidents. Such enriched modeling frameworks would better position the proposed methodology within the broader landscape of cost estimation research and clearly articulate its novel contributions.