Next Article in Journal
Assessment of Concrete and Reinforced Concrete Beams Incorporating CRT Panel Glass Using Non-Destructive and Destructive Testing Methods
Previous Article in Journal
Comprehensive Performance Evaluation of C Class Fly Ash Stability and Activity Index Based on Projection Pursuit Regression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents

1
Department of Civil Engineering, Chungnam National University (CNU), Engineering Hall #2, 99 DaeHakRo, Yuseong-gu, Daejeon 34134, Republic of Korea
2
Ninetynine Co., Ltd., Heeseong Plaza #312, 370 Wolgye-ro, Nowon-gu, Seoul 01905, Republic of Korea
*
Author to whom correspondence should be addressed.
Buildings 2025, 15(23), 4343; https://doi.org/10.3390/buildings15234343
Submission received: 29 October 2025 / Revised: 25 November 2025 / Accepted: 27 November 2025 / Published: 28 November 2025
(This article belongs to the Section Construction Management, and Computers & Digitization)

Abstract

The construction industry has persistently high accident rates, and major events continue despite strengthened safety management systems. This study analyzes 19,456 accident records from the national Construction Safety Management Integrated Information (CSI) system and applies a Light Gradient Boosting Machine (LightGBM) model to predict fatal versus injury outcomes. SHAP was used to identify influential factors and quantify each variable’s contribution. Fatal events represented about 5% of cases, reflecting substantial class imbalance. To address this, three oversampling methods—SMOTE, Borderline-SMOTE, and ADASYN—were tested. The ADASYN model showed the best performance (F1-score = 0.905, AUC = 0.879) and was selected as the final model. Oversampling was applied exclusively to the training folds during stratified 10-fold cross-validation on the training set. After identifying the optimal number of iterations, the model was retrained on the full training data and its final performance was evaluated on the independent test set. SHAP results indicated that Type of Accident, Accident Object, and Work Process were primary drivers of fatal outcomes, whereas Safety Management Plan and Public/Private Ownership helped lessen severity. Project Cost, Progress Rate, and Number of Workers moderated prediction strength through interactions with key variables. This study clarifies structural relationships among factors affecting accident outcomes using a LightGBM–SHAP framework that captures nonlinear interactions, supporting explainable artificial intelligence (AI)–based safety management and risk monitoring.

1. Introduction

The construction industry has one of the highest rates of occupational accidents, with serious and fatal incidents continuing to occur despite ongoing reinforcement of safety management systems.
In South Korea, approximately 51.9% of all occupational fatalities in 2022 occurred in the construction sector, indicating that a significant level of risk persists even under strengthened safety regulations and oversight [1].
Traditional construction safety management has relied mainly on empirical judgments or simple statistical analyses, which offer limited capacity to capture the complex processes and diverse environmental conditions present at construction sites.
To address these limitations, recent studies have applied artificial intelligence (AI) and machine learning (ML) techniques to predict accident likelihood and identify key risk factors influencing their occurrence [2,3].
In this study, actual accident case data recorded in the Construction Safety Management Integrated Information (CSI) system, operated by the Korea Infrastructure and Technology Corporation, were analyzed using the Light Gradient Boosting Machine (LightGBM) model [4] to predict accident outcomes (fatal or injury).
Furthermore, the Shapley Additive Explanations (SHAP) method [5] was used to quantitatively interpret the contribution and directional influence of each variable on the model’s predictions.
Rather than focusing solely on predictive accuracy, this study aims to provide data-driven evidence that can support practical decision-making in construction safety management.
By identifying the structural relationships between managerial factors (e.g., presence of a safety management plan, public/private ownership) and physical factors (e.g., type of accident, accident object), this research examines the potential for developing a predictive and preventive safety management framework.
The significance of this study lies in its application of explainable artificial intelligence (XAI) within the domain of construction safety management, offering not only predictive insights into accident outcomes but also practical implications for policy formulation and on-site safety improvement.
Recent studies in construction safety have primarily focused on predicting whether an accident will occur or identifying specific types of accidents, while relatively few have aimed to precisely predict the severity of accident outcomes (fatal vs. injury) or structurally explain the underlying causes.
Moreover, real-world accident datasets consist largely of multi-category variables such as project cost, progress rate, accident object, and work process, containing complex interactions that are often not fully captured in existing analyses. Fatal accidents account for an extremely small proportion of the data, making it difficult for conventional models to maintain stable predictive performance.
Additionally, the limited interpretability of prediction results restricts their practical applicability in the field.
Therefore, there is a pressing need to develop an explainable predictive model capable of identifying key determinants of accident outcomes and transparently interpreting their influence.

2. Literature Review

This section reviews prior studies that have applied ML, deep learning (DL), and explainable AI (XAI) to construction safety. By examining existing approaches and identifying their limitations, the need for the present study is established.
Early predictive research on construction safety primarily used traditional ML techniques to classify accident occurrence or accident types. Models such as Support Vector Machines (SVM), Random Forest, and Logistic Regression have been effective in analyzing relatively simple datasets and identifying major risk factors, contributing to the development of accident prevention strategies.
For example, Yin et al. [6] classified workers’ safety behaviors, and Bortey et al. [7] detected hazardous events using traffic and work-environment data. However, these models have limited capability to capture the complexity inherent in construction accident datasets, which typically include numerous categorical variables and nonlinear interactions.
Additionally, most studies have focused on predicting accident occurrence or type, leaving limited attention to predicting the severity of accident outcomes, such as distinguishing between fatal and injury cases.
DL-based studies have attempted to address some of these limitations by leveraging image- and text-based accident reports, demonstrating strengths in feature extraction from visual and document-based safety data. CNN-based models have been used to automatically classify visual or textual safety information [8,9], and hybrid models incorporating the Human Factors Analysis and Classification System (HFACS) have enabled multidimensional interpretations of human error factors [10,11].
Despite their predictive advantages, DL models remain difficult to interpret due to their complex architectures and are not optimized for tabular datasets containing multi-category accident variables. Moreover, datasets with extremely imbalanced classes, such as fatal accidents, pose challenges for stable model training.
From a practical standpoint, safety managers must understand the rationale behind model predictions, further limiting the applicability of low-interpretability DL models.
Amid these challenges, explainable AI (XAI) has emerged as a promising approach. SHAP (Shapley Additive Explanations), in particular, has been widely adopted due to its strong compatibility with tree-based models and its ability to quantify the magnitude and direction of each variable’s contribution. Bortey et al. [7] visualized risk factors in transportation safety using SHAP, and Yao and García de Soto [12] applied SHAP-based interpretations to cyber-risk assessment, presenting results that could be readily understood by decision-makers.
In South Korea, Kim et al. [13] conducted big-data-driven accident scenario analyses and demonstrated the potential for interpreting accident mechanisms through data analytics. However, in most existing studies, SHAP has been used only as a supplementary tool to explain feature importance. Integrated analytical frameworks that combine variable selection, class-imbalance adjustment, predictive modeling, and structural interpretation remain scarce.
Furthermore, studies that extend SHAP to explain the directional outcome (fatal vs. injury) or derive scenario-based risk levels based on specific combinations of factors are also highly limited.
To address these gaps, this study develops an integrated analytical framework that links ADASYN oversampling, the LightGBM predictive model, and SHAP-based structural interpretation. This framework enables accurate prediction of accident outcomes using large-scale, multi-categorical datasets, provides transparent explanations of variable contributions and interactions, and offers scenario-based interpretations that can support practical decision-making in construction safety management.

3. Materials and Methods

3.1. Dataset Overview

This study was conducted using accident case data obtained from the CSI system, which is operated and managed by the Korea Infrastructure and Technology Corporation (KISTEC).
The CSI system is a national public database established to systematically collect and manage information on construction site accidents across Korea. It includes details on accident causes, outcomes, site conditions, and management systems [14].
The dataset covers the period from 2017 to 2022, consisting of 19,456 accident cases and 135 attributes.
Each record contains essential information representing managerial, operational and environmental characteristics of construction sites, such as the accident date and time, type of accident, work process, project cost, bid ratio, progress rate, number of workers, weather conditions (temperature and humidity) and the presence of a safety management plan.
To ensure the reliability and consistency of the analysis, several preprocessing steps were performed.
Missing values and erroneous entries were first removed, followed by a Chi-square test to identify variables that showed statistically significant relationships with the dependent variable (accident outcome: fatal/injury) [15].
Through this process, 13 independent variables and one dependent variable were finalized for model construction.
The dependent variable was defined according to the victim’s accident outcome—fatal (0) or injury (1)—as a binary classification target.
In addition, temperature and humidity were included as environmental factors that influence accident risk by affecting workers’ physiological conditions and attention levels, as supported by previous studies in Korea and abroad [16,17].
Accordingly, this study incorporated a total of 16 variables, including climatic data, to reflect realistic construction site conditions in the predictive modeling process.
Among the 12,848 cleaned and validated cases, approximately 95% were injuries and 5% were fatalities, indicating a severe class imbalance structure.
This imbalance reflects the reality that construction accidents are typically caused not by single factors but by complex interactions among multiple site variables.
As summarized in Figure 1, the data processing and analytical workflow consisted of four main steps: (1) Data collection → (2) Data preprocessing → (3) Variable selection → (4) Final dataset construction.
Based on the finalized dataset, the LightGBM model was trained and subsequent SHAP-based explainability analyses were performed to interpret the predictive results.

3.2. Data Preprocessing and Class Imbalance Handling

The raw dataset used in this study contained several missing values and non-standardized entries.
To ensure the reliability and consistency of the analysis, a comprehensive preprocessing procedure was performed.
This procedure consisted of five key stages: variable refinement, variable selection, categorical definition, continuous variable adjustment, and class imbalance handling.
First, missing and erroneous data entries were removed. Among the 135 original attributes, a subset of candidate variables that were strongly related to the dependent variable (accident outcome: fatal/injury) was extracted.
To identify significant relationships, a Chi-square test was conducted, and 16 variables showing statistical significance with the dependent variable were selected [15].
Subsequently, these variables were reorganized into three groups according to their characteristics and interpretability.
The first group, general site factors, includes project ownership (public/private), project cost (18 ranges), bid ratio (8 ranges), progress rate (10 ranges), and number of workers (6 ranges).
These variables represent fundamental attributes of project scale and organizational characteristics, reflecting potential differences in risk levels associated with site operation and management (Table 1).
Categorical variables were converted into factor-type variables in the R environment. For variables with ordinal characteristics, such as project cost, bid ratio, progress rate, and number of workers, they were first treated as ordered factors and then transformed into general factors for modeling purposes. For example, project cost was divided into 18 ranges from less than 10 million KRW to more than 1 billion KRW, while bid ratio was categorized into eight ranges from below 60% to above 90%.
The second group, environmental factors, included weather conditions (temperature and humidity) and temporal characteristics of the accident occurrence. The detailed structure of these variables is summarized in Table 2.
The continuous variables, temperature and humidity, ranged from −16 to 38 °C and 0% to 100%, respectively, with mean values of approximately 12 °C and 63%. To minimize scale differences and ensure balanced representation during model training, both variables were normalized prior to modeling.
These meteorological variables, temperature and humidity, have long been recognized as risk-enhancing factors that may increase accident likelihood by imposing physiological strain and reducing workers’ attention levels.
Their significance has been consistently validated in previous domestic and international studies [16,17].
The third group, accident-related factors, consists of variables directly associated with the accident itself, such as the type of accident, accident object (major/minor categories), and the presence of a safety management plan (Table 3).
The safety management plan variable was categorized into three groups—regulated sites (Type 1/2), partially regulated sites (outside Type 1/2), and non-regulated sites—to reflect the degree of compliance with legally mandated safety management systems.
The dependent variable represents the victim’s accident outcome, defined as fatal (0) or injury (1).
Among the 12,848 validated cases, fatalities accounted for 634 cases (4.8%), while injuries comprised 12,214 cases (95.2%), demonstrating a severe class imbalance (Figure 2).
Such imbalance can introduce prediction bias in machine-learning models, but also reflects the industrial reality that fatal accidents, although rare, remain highly critical events in construction sites [6].
To address this issue, oversampling techniques such as SMOTE were later applied during model training to balance the dataset.

3.3. Model Design

In this study, the LightGBM was adopted to predict the outcomes of construction accidents (fatal/injury).
LightGBM is based on the Gradient Boosting Decision Tree (GBDT) algorithm and offers several advantages over conventional boosting models.
First, it employs a leaf-wise tree growth strategy, which efficiently explores deep tree structures and is suitable for construction accident data that contain complex nonlinear relationships. Second, its histogram-based splitting technique improves training speed and memory efficiency. Third, LightGBM provides automatic handling of categorical variables, allowing effective learning from multi-level categorical data such as project cost, bid ratio, and progress rate without the need for one-hot encoding [4].
In this study, LightGBM was selected as the predictive model because the CSI accident dataset consists of tabular data with multiple categorical variables, for which LightGBM is well-suited due to its strong capability in handling categorical features and large-scale datasets. Given that fatal accidents account for only about 5% of all cases, the dataset exhibits a highly imbalanced structure. The boosting-based learning mechanism of LightGBM is advantageous for modeling such sparse minority classes and provides stable predictive performance under class-imbalance conditions.
In addition, LightGBM offers high compatibility with SHAP, enabling clear interpretation of feature contributions and directional influence on prediction outcomes. For these reasons, LightGBM was deemed the most appropriate algorithm for achieving the objectives of this study.
Previous studies have demonstrated that LightGBM achieves faster training times and competitive predictive performance compared to models such as XGBoost and Random Forest [6,7].
Its robustness in large-scale, high-dimensional datasets has also been validated, particularly in imbalanced data environments where combining SMOTE oversampling with LightGBM has been shown to enhance predictive performance and variable interpretability [6].
Similarly, Bortey et al. (2024) applied SMOTE to traffic safety data to address imbalance issues and evaluated the performance of LightGBM and XGBoost models, confirming their effectiveness in safety-related prediction tasks [7].
In this study, the LightGBM was adopted as the predictive model because the CSI accident dataset is composed of tabular data with numerous multi-category variables. LightGBM is well-suited for such data structures, as it provides efficient handling of categorical features and is optimized for large-scale datasets.
In addition, only about 5% of all accident cases in the dataset correspond to fatal outcomes, resulting in a highly imbalanced class distribution. The boosting-based learning mechanism of LightGBM is advantageous in modeling sparse minority classes and offers stable predictive performance under severe imbalance conditions.
Furthermore, LightGBM is highly compatible with SHAP, enabling clear interpretation of feature contributions and the directional influence of each variable on prediction outcomes. For these reasons, LightGBM was selected as the most appropriate algorithm for achieving the objectives of this study.
The hyperparameters used in the LightGBM model are summarized in Table 4.
The present study emphasizes not only the improvement of predictive performance but also the interpretability of the model.
The trained LightGBM model was analyzed using the Shapley Additive Explanations (SHAP) method to evaluate the contribution of each variable to the prediction outcomes.
SHAP, based on cooperative game theory, quantifies the contribution of each feature and enables transparent interpretation of the model’s decision-making process [5].
Furthermore, Yao and García de Soto (2024) proposed a machine learning–based cyber risk assessment framework, demonstrating that explainable AI (XAI) can support policy-level decision-making in construction risk management [12].
This approach indicates that the purpose of this study goes beyond improving predictive accuracy.
It aims to provide a data-driven foundation for identifying the factors influencing fatal and injury outcomes and for developing a proactive safety management framework that can be applied to practical construction management.

4. Results

This section presents the training and performance evaluation results of the LightGBM model developed based on construction accident data.
The overall analytical process consisted of three main stages.
First, to address the class imbalance problem, three representative oversampling techniques, ADASYN, SMOTE, and Borderline-SMOTE, were applied and compared.
Second, the classification performance of the LightGBM model was evaluated to identify the most appropriate data augmentation method.
Third, feature importance analysis and SHAP (Shapley Additive Explanations) interpretation were conducted to enhance model explainability and identify key risk factors influencing fatal and injury outcomes.

4.1. Oversampling and Data Distribution

The dependent variable in this study represents the state of the victim, classified as fatal (0) or injury (1).
Because fatalities accounted for only about 5% of the total dataset, there was a high risk of model bias toward the majority class (injury).
To mitigate this imbalance, three representative oversampling methods, ADASYN, SMOTE, and Borderline-SMOTE, were implemented [18,19,20].
To visualize the effects of oversampling, Principal Component Analysis (PCA) was performed to project the original and resampled datasets onto a two-dimensional feature space. As illustrated in Figure 3, the original dataset showed that fatal cases (red) were distributed very sparsely compared to injury cases (blue). After applying the three oversampling methods, the distribution of the minority class expanded, enhancing its representativeness in the boundary regions.
In particular, SMOTE and ADASYN generated relatively balanced synthetic samples, while Borderline-SMOTE increased sample density around the class boundaries [21,22].
However, oversampling was applied only to the training dataset and was not used for the test dataset.

4.2. Model Performance Evaluation

The entire dataset was randomly divided into 70% for training (8993 cases) and 30% for independent testing (3855 cases). Stratified 10-fold cross-validation was conducted on the training dataset to ensure model stability and generalization performance. Oversampling methods (ADASYN, SMOTE, and Borderline-SMOTE) were applied only to the training folds and were not used for either the validation folds or the test dataset.
The cross-validation results showed that the LightGBM model achieved a mean AUC of 0.996 ± 0.001, indicating highly stable classification performance, and the optimal number of boosting iterations (best iteration) was determined to be 742. Based on these results, the LightGBM model was retrained using the full training dataset with the identified optimal iteration, and the final performance was evaluated using the independent test dataset (30%).
Table 5 summarizes the comparative results among the three resampling methods.
Meanwhile, the Precision values of all models were observed to be approximately 0.99, which can be attributed to the severe class imbalance in the dataset, where fatal accidents account for only about 5% of all cases. Under such conditions, when a model predicts the majority class (injury) with high consistency, the number of false positives becomes extremely small. As a result, Precision tends to increase, while Recall becomes relatively lower. This phenomenon is consistent with previous findings that classifiers trained on imbalanced data often become biased toward the majority class, yielding artificially inflated Precision and reduced minority-class detection performance (Recall) [8,23,24].
Therefore, in this study, overall predictive performance was evaluated using F1-score, AUC and Balanced Accuracy rather than Precision alone. As discussed in Section 3.3, LightGBM demonstrates high learning efficiency and generalization capabilities for structured tabular datasets containing numerous categorical variables. To empirically validate this advantage, three baseline models—Logistic Regression with Elastic Net regularization, Random Forest, and XGBoost—were evaluated under the same preprocessing and ADASYN oversampling conditions.
Table 6 summarizes the predictive performance of each model.
The Logistic Regression model showed the lowest discriminative power with an AUC of 0.687. The Random Forest model demonstrated moderate improvement (AUC = 0.843, F1-score = 0.901) but was still limited in capturing complex interactions among the variables. XGBoost yielded competitive performance with an AUC of 0.886; however, its Recall (0.805) remained relatively low, indicating limited capability in detecting the minority class.
In contrast, the proposed LightGBM model achieved the most balanced performance, with an AUC of 0.879 and an F1-score of 0.905, and showed the highest Recall for fatal accidents (minority class). This indicates that LightGBM can more effectively learn nonlinear and multi-categorical variable relationships, thereby capturing the complex patterns inherent in construction accident data.
Based on these results, LightGBM demonstrated stable and well-balanced performance even under severe class-imbalance conditions, outperforming the other ML models. Accordingly, the feature-importance analysis and SHAP-based interpretation in this study were conducted using the LightGBM model trained with ADASYN oversampling.

4.3. Variable Importance

The LightGBM model can estimate the contribution of each variable to its predictive performance during the training process [4].
In this study, the Gain metric was used to evaluate feature importance, which measures the relative improvement in model accuracy attributed to each variable.
The results revealed that Type of Human Accident, Accident Object (major/minor), Construction Cost, Progress Rate, and Bid Ratio were identified as the most influential factors.
These findings suggest that construction accidents are not explained by a single factor but rather by interactions among multiple dimensions, including work type, project scale, contract conditions, and site environment [6].
In particular, Type of Accident and Accident Object represent direct physical risk factors, while Construction Cost and Progress Rate reflect indirect managerial or operational risk indicators associated with project scope and progress status.
Figure 4 visualizes the relative importance of all variables derived from the LightGBM model.

4.4. SHAP-Based Interpretation

While the feature importance analysis provides insights into which variables contribute the most to model performance, it does not fully explain the direction and interaction effects of individual predictions.
To address this limitation, the Shapley Additive Explanations (SHAP) method was applied to quantitatively evaluate each variable’s contribution and enhance the interpretability of the LightGBM model outcomes [5].
The SHAP framework is based on game theory, specifically the concept of the Shapley value, which measures the fair contribution of each feature i to the model’s prediction f(x).
In this context, the model output can be decomposed into the sum of all individual feature contributions, as expressed in Equations (1) and (2) [5].
Shows the definition of the SHAP value for feature i:
ϕ i = S F \ { i } S ! F S 1 ! F ! f S i x S i f s ( x s )
where F denotes the entire set of features, S represents a subset of F excluding the feature i , f s ( x S ) is the model output when using only the subset S, and f S i x S i is the output when feature i is added.
ϕ i therefore represents the average marginal contribution of feature i to the model prediction across all possible feature combinations.
Expresses the additive decomposition of the model prediction f ( x ) :
f x = E f ( x ) + i ϕ i
Here, E f ( x ) represents the expected (baseline) model prediction when no specific feature values are provided, and the summation term i ϕ i corresponds to the total contribution of all features for a given instance.
This additive property ensures the local interpretability and global consistency of the SHAP explanations.
Figure 5 illustrates the variable importance based on the mean absolute SHAP values (mean(|SHAP|)), showing that the Safety Management Plan, Type of Accident, and Accident Object (Major) had the most significant influence on the model’s predictions.
This result indicates that the likelihood of accidents is determined not only by physical risk factors such as accident type and accident object but also by managerial factors, particularly whether a safety management plan has been properly established [12].
Figure 6 presents the SHAP summary plot, which visualizes the distribution of SHAP values and reveals the directionality of each variable’s influence on the model output.
In the plot, negative SHAP values (left side) indicate an increased probability of fatal outcomes, whereas positive SHAP values (right side) correspond to a higher probability of injury outcomes.
As the Safety Management Plan variable increases (i.e., when a plan is implemented), SHAP values shift toward the positive direction, implying a mitigating effect on fatal accidents.
Conversely, specific categories within Type of Accident and Accident Object (Major) tend to exhibit negative SHAP values, suggesting that these factors contribute more strongly to the likelihood of fatal incidents.
These findings demonstrate that the model captures not only statistical correlations but also the true directional relationships of risk associated with actual work conditions.
Moreover, this approach goes beyond simple model performance evaluation, providing practical insights that help identify key risk factors requiring management on construction sites and support the establishment of data-driven safety management strategies [6,12].

4.5. Interpretation of Variables

The SHAP analysis revealed that the variables with the highest contributions to the model’s predictions were the Safety Management Plan, Type of Accident, and Accident Object.
These three variables, respectively, represent managerial, human, and physical dimensions, acting as key determinants of the probability of accident occurrence on construction sites.
This section interprets the detailed categories and directional impacts of these variables based on their SHAP values [9].

4.5.1. Safety Management Plan

The Safety Management Plan exhibited the highest absolute SHAP value (mean(|SHAP|)) among all variables, indicating that it had the most substantial influence on the model’s predictions.
According to the SHAP summary plot, sites where a safety management plan was properly established (high value) showed positive SHAP values, corresponding to a lower probability of fatal outcomes and a higher likelihood of injury outcomes.
This finding implies that even when accidents occur, sites with well-implemented safety management plans are less likely to experience severe or fatal consequences.
Conversely, cases where the plan was absent or insufficient (low value) exhibited negative SHAP values, suggesting a greater likelihood of fatal accidents.
Therefore, the Safety Management Plan can be regarded not only as an administrative formality but as a critical managerial factor in mitigating fatalities on construction sites [10].

4.5.2. Type of Accident

The Type of Accident variable exhibited the second-highest absolute SHAP value, directly reflecting the type of incident involving workers.
Major categories included falls, entrapments, collisions, and slips. Among these, falls and entrapments were associated with negative SHAP values, indicating that they contributed to an increased probability of fatal outcomes.
This is likely because such accident types often occur during work at heights, operations near heavy materials, or processes with high entrapment risk, where direct physical impacts are severe.
In contrast, collisions and slips showed positive SHAP values, suggesting that they were more frequently linked to injury incidents.
Therefore, the Type of Accident serves as a key explanatory variable that distinguishes between fatal and injury outcomes, highlighting the importance of task-specific risk management on construction sites [11].

4.5.3. Accident Object

The Accident Object refers to the physical target or structure involved in the accident, reflecting the environmental and operational characteristics of the construction site.
Major categories included scaffolds, formwork, heavy materials, and equipment.
Among these, scaffolds and heavy materials were associated with negative SHAP values, indicating that they contributed to a higher probability of fatal outcomes.
This finding reflects that the model captured the increased risk of falls during scaffold installation and dismantling, as well as entrapment or struck-by incidents occurring during the handling and transport of heavy materials.
In contrast, objects related to small tools or light equipment exhibited relatively low SHAP values, implying that they were mainly associated with minor injury incidents rather than fatal cases [25].

4.5.4. Integrated Discussion

Synthesizing the interpretation of the three major variables, the occurrence and outcomes of construction accidents can be explained by the interaction among managerial factors (Safety Management Plan), physical factors (Accident Object), and human factors (Type of Accident).
In particular, the presence of a well-established safety management plan was identified as a key variable influencing both the likelihood of accidents and the severity of outcomes.
The combination of Type of Accident and Accident Object (e.g., falls + scaffolds, entrapments + construction equipment) was found to significantly increase the probability of fatal accidents, representing high-risk interaction patterns.
These findings indicate that the LightGBM–SHAP analytical framework functions not only as a predictive model but also as an explainable predictive system, enabling a deeper structural understanding of accident mechanisms.
Furthermore, the key influencing variables identified in this chapter will serve as foundational input for the scenario-based evaluation presented in the following chapter, which quantitatively examines how changes in each factor category affect the probability of accidents [26].

5. Discussion

This study trained a LightGBM model using data from the CSI system to predict accident outcomes (fatal or injury) at construction sites.
To address the severe class imbalance in the dataset, the ADASYN oversampling technique was applied, while model interpretability was enhanced through the SHAP (Shapley Additive Explanations)–based explainable AI (XAI) approach.
This chapter presents a scenario-based evaluation of how interactions and combinations of major variables influence accident outcomes (fatality likelihood), based on the SHAP results derived in Section 4.
Through this analysis, the study aims to identify the key conditions contributing to fatal accidents and to propose practical preventive strategies for construction safety management.

5.1. Scenario-Based Evaluation and Interpretation

The SHAP (Shapley Additive Explanations) method is based on game theory, which fairly distributes the contribution of each input variable to the model’s prediction [5,27].
This approach decomposes the model output not as a single value but as the sum of individual feature contributions (ϕ1, ϕ2, …), allowing the prediction to be interpreted as an additive combination of feature effects.
Consequently, even for complex and nonlinear prediction models, SHAP enables a clear understanding of the direction ( ± ) and magnitude of each variable’s influence [28].
The sign of the SHAP value indicates the direction of impact on the model output: a negative SHAP value implies that the variable contributes toward a fatal outcome, whereas a positive SHAP value indicates a mitigating influence leading toward injury outcome.
This sign-based interpretation quantitatively reveals how each variable value acts on the model’s decision boundary, showing whether it shifts the prediction toward or away from a fatal result [29].
For instance, the same variable may exhibit opposite SHAP signs depending on its magnitude or categorical condition, reflecting the nonlinear interactions among features—insights that cannot be captured through simple feature importance analysis.
Figure 7 illustrates the conceptual structure of SHAP, where each feature contribution (ϕ1, ϕ2, …) incrementally adjusts the model output f(x) from the baseline expectation E[f(z)] [5].
This visual representation clarifies that the model’s prediction is not only a weighted sum of inputs but is instead derived from a conditional expectation framework, capturing context-dependent relationships among variables.
This SHAP-based interpretive framework mitigates the “black box” problem commonly associated with ensemble boosting models such as LightGBM, enabling a transparent understanding of their internal decision-making mechanisms [27,29].
Accordingly, the results of the SHAP-based feature contribution analysis were utilized in the following section (Section 4.2) to conduct a scenario-based evaluation of accident outcomes, reflecting actual site conditions.

5.2. Variable Importance and SHAP Interpretation

In this section, six hypothetical scenarios (S1–S6) were constructed based on the results of the SHAP (Shapley Additive Explanations) analysis, reflecting the management conditions and environmental factors of construction sites.
The purpose of this scenario design was not only to visualize model predictions but to establish an experimental framework for verifying the relative influence of managerial interventions on the predicted accident outcomes (fatal or injury).
In particular, three major variables—Public/Private Sector Classification, Presence of a Safety Management Plan, and Type of Accident—were selected as key determinants.
These variables represent the interaction between direct drivers and mitigating factors, allowing for an analysis of their structural impact on accident outcomes.

5.2.1. Scenario Design

The scenarios were constructed around the key variables that exhibited the highest contributions in the LightGBM–SHAP analysis—namely, the Safety Management Plan, Type of Accident, and Public/Private Sector classification.
Each scenario was designed to reflect representative conditions of actual construction sites, distinguishing between combinations of factors that induce fatal accidents and those that mitigate risk through managerial interventions.
This approach goes beyond a simple analysis of mean (|SHAP|) feature importance, enabling a quantitative understanding of accident mechanisms based on the conditional interactions among variables.
Furthermore, the predicted value f(x) of the LightGBM–SHAP-based model is interpreted relative to the baseline expectation E[f(x)].
In the trained model, E[f(x)] was −0.533, representing the average predictive level across the entire dataset, that is, an unbiased reference point for model outputs.
In contrast, the optimal classification threshold, determined by ROC analysis using the Youden Index, was identified as f(x) = 0.9459.
Accordingly, the model classifies cases as injury when f(x) > 0.9459, and as fatal when f(x) < 0.9459.
Thus, E[f(x)] serves as the reference baseline for interpretation, allowing for the quantitative comparison of each scenario’s cumulative contribution based on how far its f(x) value deviates from this baseline.
For clarity, the final classification result of each scenario was explicitly indicated in Table 7 based on the defined threshold (f(x) ≥ 0.9459 → Injury; f(x) < 0.9459 → Fatal).
Accordingly, scenarios S1–S4 are classified as fatal cases, whereas S5 and S6 are classified as injury outcomes.

5.2.2. Scenario-Based Interpretation

Figure 8 and Table 7 comprehensively present the interpretation results of six hypothetical scenarios (S1–S6) derived from the SHAP (Shapley Additive Explanations) analysis.
Each scenario was constructed by combining major management and environmental factors—such as project ownership (public/private), level of safety management planning, and type of accident (fall)—to quantitatively analyze how these conditions influence the model’s predicted output f(x) in both direction and magnitude.
In the SHAP Force Plot, the yellow segments (+) indicate contributions toward the injury direction, whereas the purple segments (−) represent contributions toward the fatal direction.
The model’s baseline value (E[f(x)] = −0.533) corresponds to the average predictive level across all data (the neutral line).
When the predicted output f(x) exceeds the threshold value (0.9459), the outcome is classified as injury, whereas values below this threshold are classified as fatal.
Accordingly, each scenario’s Force Plot visualizes how f(x) deviates from the baseline, intuitively illustrating the cumulative contribution of individual factors to the predicted accident outcome.
The analysis revealed that the variables Type of Accident, Accident Object (major category), and Work Process exhibited clearly negative SHAP distributions, indicating a strong association with fatal outcomes.
In particular, fall-from-height accidents showed the most pronounced negative SHAP values, confirming their role as a direct causal factor in fatal accident occurrences.
Conversely, the variables Safety Management Plan and Public/Private demonstrated high mean absolute SHAP values (mean(|SHAP|)) but were mostly distributed in the positive direction, suggesting that they function as mitigating and preventive management factors.
In interpreting the scenario results, it is important to note that the Safety Management Plan variable (values = 1, 2, 3) reflects differences in facility-level regulatory classifications, not differences in plan quality or strength.
Type 1/2 facilities (value = 1) and other regulated facilities (value = 2) are both legally required to establish a safety management plan, whereas non-regulated facilities (value = 3) are exempt from this requirement. Accordingly, SHAP contributions for values 1 and 2 consistently appear in the positive (injury) direction, indicating their strong mitigating influence on fatal outcomes.
In contrast, the SHAP contribution for non-regulated facilities (value = 3) is occasionally positive but remains extremely small in magnitude.
This reflects the fact that “no mandated safety plan” does not directly push the prediction toward injury; rather, it provides almost no mitigating effect compared with regulated facilities.
Thus, the weak and near-zero SHAP values for value = 3 correctly represent its limited protective influence, while the large positive SHAP values for values 1 and 2 represent substantial reductions in fatality risk.
These factors do not directly cause accidents but instead act as policy- and management-level interventions that alleviate accident severity through structured administrative control.
Additionally, variables such as Project Cost, Progress Rate, and Number of Workers exhibited relatively lower SHAP magnitudes but maintained a consistent negative direction, implying that site operation characteristics, including project scale and progress rate—can influence the directional tendency of accident outcomes (fatal vs. injury).
These variables are interpreted as auxiliary factors that interact with Work Process, adjusting the overall severity level of accident outcomes [13].
Table 8 summarizes the SHAP directionality and role classification of key variable groups.
Physical variables such as Type of Accident, Accident Object, and Work Process acted as direct drivers, shifting the model predictions toward the fatal direction.
In contrast, managerial variables such as Safety Management Plan and Public/Private ownership contributed positively in the injury direction, demonstrating their risk mitigation effect.
Meanwhile, variables related to project operation, such as Project Cost, Progress Rate, and Number of Workers—exhibited lower overall contributions but tended to act as interactive moderators, influencing accident severity when combined with work process conditions.
These findings indicate that the model effectively captures the role differentiation between physical risk factors and managerial control factors, highlighting how both domains jointly shape the outcomes of construction accidents.
The integrated findings from the SHAP-based analysis indicate that the outcomes of construction accidents, whether fatal or injury, are not determined by a single factor but by the interactive structure among physical, managerial, and operational factors.
Physical factors such as Type of Accident, Accident Object, and Work Process served as direct causes, consistently shifting the model predictions toward the fatal direction.
These results reflect the inherent exposure to physical hazards and the instability of work processes that directly contribute to severe accidents.
In contrast, managerial factors, including Safety Management Plan and Public/Private ownership, adjusted the model output toward the injury direction, mitigating the severity of accident outcomes.
Notably, sites with an established Safety Management Plan exhibited prediction values that exceeded the baseline E[f(x)], demonstrating that the presence of pre-planning substantially contributes to reducing accident severity.
This implies that the model captures not merely the existence of safety documentation but also the practical implementation and effectiveness of such plans in influencing real-world safety outcomes.
Meanwhile, operational variables including Project Cost, Progress Rate, and Number of Workers showed relatively low SHAP magnitudes but exhibited subtle yet consistent interactions with other dominant factors, slightly influencing the direction of predicted outcomes.
This suggests that contextual features such as project scale, construction stage, and workforce composition function as interaction-based moderators, indirectly shaping accident severity.
Even when managerial factors are present, the model predicts that high-risk physical conditions can still lead to fatal outcomes, highlighting the structural complexity of conditional interactions among variables.
As demonstrated in Figure 8 and Table 7, fatal accidents are primarily driven by physical and operational factors, whereas managerial interventions alleviate these effects, and operational conditions further moderate their intensity.
This interpretation provides a data-driven analytical framework that incorporates conditional interaction structures, offering a more sophisticated understanding of accident outcome prediction compared to traditional single variable accident models.

6. Conclusions

This study analyzed a total of 12,848 accident cases derived from the original 19,456 records in the national CSI system. The preprocessing procedure included the removal of missing values, exclusion of unclassifiable categories (“others,” “none,” “unknown”), and refinement of categorical labels, ensuring the reliability of the analytical dataset. Based on the refined data, a LightGBM–SHAP framework was applied to predict accident outcomes (fatal vs. injury) and to identify key contributing factors. The primary objective of this study was to achieve both high predictive performance and interpretability, enabling a more systematic identification of factors that lead to fatal accidents at construction sites.
1. Among the three oversampling techniques evaluated, ADASYN demonstrated the best performance in addressing the severe class imbalance, where fatal accidents accounted for only 5% of all cases. Although AUC values across the three techniques were similar (0.875–0.879), ADASYN strengthened minority samples near decision boundaries and achieved the highest predictive performance with a balanced trade-off between precision and recall (F1-score = 0.905).
2. Compared with baseline models—Logistic Regression, Random Forest, and XGBoost—LightGBM exhibited the most stable and balanced performance (AUC = 0.879, F1-score = 0.905, Recall = 0.836). This superiority reflects LightGBM’s ability to capture nonlinear patterns and variable interactions, which align with the complex multi-categorical, tabular characteristics of construction accident data.
3. The 10-fold cross-validation confirmed the stability and generalization capability of the LightGBM model. The mean AUC was 0.996 ± 0.001, and the optimal number of boosting iterations was 742, demonstrating that the model achieved consistent learning without overfitting. These findings indicate that the proposed model is robust enough to support data-driven decision-making in construction safety management.
4. SHAP-based interpretation revealed important insights into both contributing and mitigating factors of fatal accidents. Physical and operational risk factors, such as the type of accident, accident object (small/large), and work, process were identified as primary drivers that increase fatal severity. In contrast, managerial variables, including the presence of a safety management plan and public/private ownership, acted as mitigating factors that reduced accident severity. Operational variables such as project cost, progress rate, and number of workers functioned as moderating factors, influencing the direction and magnitude of accident outcomes through interactions with key variables. These results show that accident severity is not determined by a single factor but by the combined interaction of managerial, operational, and task-related elements.
Overall, this study confirms that the integrated ADASYN–LightGBM–SHAP framework is effective for handling imbalanced accident data and for identifying the factors influencing fatal severity in construction accidents. Notably, the consistent mitigating effect of managerial variables such as safety management planning provides data-driven evidence that strengthening safety management systems can contribute to preventing fatal accidents.
This study has some limitations in that it is based on static single-point accident records and does not incorporate temporal variations in risk factors or external environmental conditions (e.g., weather, scheduling, equipment operations). Furthermore, detailed modeling by specific accident types was limited. Future research should address these limitations by incorporating time-series accident prediction, causality-oriented SHAP analysis, real-time risk monitoring using BIM and digital twin technologies, and developing specialized models tailored to specific accident types, thereby advancing toward more precise and field-applicable predictive safety management systems.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, writing—original draft preparation and review and editing, visualization, K.-n.K., D.-g.C. and M.-j.L.; supervision, project administration, funding acquisition, M.-j.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2020-KA156208).

Data Availability Statement

All relevant data are within the manuscript.

Conflicts of Interest

Author Dae-gu Cho was employed by the company Ninetynine Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

AIArtificial intelligence
ADASYNAdaptive synthetic sampling
AUCArea under the curve
CSIConstruction Safety Management Integrated Information system
DLDeep learning
GBDTGradient boosting decision tree
HFACSHuman Factors Analysis and Classification System
KISTECKorea Infrastructure and Technology Corporation
MLMachine learning
PCAPrincipal component analysis
SHAPShapley additive explanations
SMOTESynthetic minority oversampling technique
XAIExplainable artificial intelligence
XGBoostExtreme gradient boosting
LightGBMLight Gradient Boosting Machine

References

  1. Korea Occupational Safety and Health Agency (KOSHA). Annual Report on Occupational Accidents and Fatalities in Korea; KOSHA: Ulsan, Republic of Korea, 2023. [Google Scholar]
  2. Cheng, T.; Teizer, J. Real-time resource location data collection and visualization technology for construction safety and activity monitoring applications. Autom. Constr. 2013, 34, 3–15. [Google Scholar] [CrossRef]
  3. Schultz, G.G.; Lunt, C.C.; Pew, T.; Warr, R.L. Using complementary intersection and segment analyses to identify crash hot spots. Saf. Sci. 2023, 163, 106121. [Google Scholar] [CrossRef]
  4. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3149–3157. [Google Scholar]
  5. Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
  6. Yin, S.; Wu, Y.; Shen, Y.; Rowlinson, S. Development of a classification framework for construction personnel’s safety behavior based on machine learning. Buildings 2023, 13, 43. [Google Scholar] [CrossRef]
  7. Bortey, L.; Edwards, D.J.; Roberts, C.; Rille, I. Hidden in plain sight: A data-driven approach to safety risk management for highway traffic officers. Buildings 2024, 14, 3509. [Google Scholar] [CrossRef]
  8. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef]
  9. Yoo, J.W.; Park, J.; Park, H. Enhancing safety of construction workers in Korea: An integrated text mining and machine learning framework for predicting accident types. Int. J. Inj. Control Saf. Promot. 2024, 31, 203–215. [Google Scholar] [CrossRef]
  10. Son, S.; Na, Y.; Han, B. Assessment of risk priorities by cause of construction safety accidents: A case study of falling accidents in South Korea. Heliyon 2024, 10, e40303. [Google Scholar] [CrossRef]
  11. Wang, Y.; Liu, C.; Xu, H.; Geng, X.; Wang, Y.; Liu, Y. Analysis of the causes of falling accidents on building construction sites in China based on the HFACS model. Buildings 2025, 15, 1412. [Google Scholar] [CrossRef]
  12. Yao, D.; García de Soto, B. Cyber risk assessment framework for the construction industry using machine learning techniques. Buildings 2024, 14, 1561. [Google Scholar] [CrossRef]
  13. Kim, K.-N.; Kim, T.-H.; Lee, M.-J. Analysis of building construction jobsite accident scenarios based on big data association analysis. Buildings 2023, 13, 2120. [Google Scholar] [CrossRef]
  14. Korea Infrastructure and Technology Corporation (KISTEC). Construction Safety Management Integrated Information (CSI) System Overview; KISTEC: Jinju, Republic of Korea, 2023. [Google Scholar]
  15. Um, K.S. An Analysis on Fall Accidents at the Apartment Construction Site by Making Up Questionaires for Employee. Master’s Thesis, Hanyang University, Seoul, Republic of Korea, 2011. [Google Scholar]
  16. Son, C.B.; Kim, K.Y.; Lee, J.Y. A study on the influence of climate factors on construction accidents. J. Korean Soc. Saf. 2005, 20, 91–97. [Google Scholar]
  17. Song, M.; Jeong, J.; Kumi, L.; Mun, H. Analysis of the effect of outdoor thermal comfort on construction accidents by subcontractor types. Sustainability 2024, 16, 4906. [Google Scholar] [CrossRef]
  18. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  19. He, H.; Bai, Y.; Garcia, E.A.; Li, S.A. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: New York, NY, USA, 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
  20. Han, H.; Wang, W.-Y.; Mao, B.-H.; Borderline, S. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Huang, D.-S., Zhang, X.-P., Huang, G.-B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar] [CrossRef]
  21. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
  22. Fernandez, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  23. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  24. Branco, P.; Torgo, L.; Ribeiro, R.P. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 2016, 49, 1–50. [Google Scholar] [CrossRef]
  25. Błazik-Borowa, E.; Geryło, R.; Wielgos, P. The probability of a scaffolding failure on a construction site. Eng. Fail. Anal. 2022, 131, 105864. [Google Scholar] [CrossRef]
  26. Yoon, S.; Chang, T.; Chi, S. Developing an integrated construction safety management system for accident prevention. J. Manag. Eng. 2024, 40, 04024051. [Google Scholar] [CrossRef]
  27. Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
  28. Molnar, C. Interpretable Machine Learning, 2nd ed.; Leanpub: Victoria, BC, Canada, 2022; Available online: https://christophm.github.io/interpretable-ml-book (accessed on 17 September 2025).
  29. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2018, 51, 1–42. [Google Scholar] [CrossRef]
Figure 1. Overall research flow of the study.
Figure 1. Overall research flow of the study.
Buildings 15 04343 g001
Figure 2. Distribution of accident outcomes (Fatal (0) and Injury (1)).
Figure 2. Distribution of accident outcomes (Fatal (0) and Injury (1)).
Buildings 15 04343 g002
Figure 3. Distribution of the original and resampled datasets (ADASYN, SMOTE, Borderline-SMOTE) visualized by PCA. Red dots indicate fatalities (minority class), and blue triangles represent injuries (majority class).
Figure 3. Distribution of the original and resampled datasets (ADASYN, SMOTE, Borderline-SMOTE) visualized by PCA. Red dots indicate fatalities (minority class), and blue triangles represent injuries (majority class).
Buildings 15 04343 g003
Figure 4. Feature importance ranking of all variables in LightGBM.
Figure 4. Feature importance ranking of all variables in LightGBM.
Buildings 15 04343 g004
Figure 5. SHAP Feature importance plot of the LightGBM model.
Figure 5. SHAP Feature importance plot of the LightGBM model.
Buildings 15 04343 g005
Figure 6. SHAP summary plot of the LightGBM model.
Figure 6. SHAP summary plot of the LightGBM model.
Buildings 15 04343 g006
Figure 7. Conceptual illustration of SHAP value computation based on conditional expectations. Each feature’s contribution ( ϕ 1 , ϕ 2 , ) incrementally adjusts the model output f(x) from the baseline expectation E[f(z)]. Adapted from Lundberg and Lee (2017) [5], A Unified Approach to Interpreting Model Predictions, NeurIPS.
Figure 7. Conceptual illustration of SHAP value computation based on conditional expectations. Each feature’s contribution ( ϕ 1 , ϕ 2 , ) incrementally adjusts the model output f(x) from the baseline expectation E[f(z)]. Adapted from Lundberg and Lee (2017) [5], A Unified Approach to Interpreting Model Predictions, NeurIPS.
Buildings 15 04343 g007
Figure 8. SHAP summary plot showing the magnitude and direction of variable contributions (Positive SHAP = injury direction; Negative SHAP = fatal direction).
Figure 8. SHAP summary plot showing the magnitude and direction of variable contributions (Positive SHAP = injury direction; Negative SHAP = fatal direction).
Buildings 15 04343 g008
Table 1. Construction Site Variables.
Table 1. Construction Site Variables.
VariableTypeElements
Public/PrivateCategorical2
Facility Type (Major)Categorical4
Construction TypeMajorCategorical7
MinorCategorical39
Work ProcessCategorical41
Project CostCategorical18
Progress RateCategorical10
Bid RateCategorical8
Number of WorkersCategorical6
Table 2. Environment Variables.
Table 2. Environment Variables.
VariableTypeRange
TemperatureNumeric−16~38
HumidityNumeric0~100
Table 3. Accident Variables.
Table 3. Accident Variables.
VariableTypeElements
Safety Management PlanCategorical3
Type of Accident (Human)Categorical16
Accident Object (Major)Categorical9
Accident Object (Minor)Categorical117
Presence of Fatalities and InjuriesBinary2
Table 4. Hyperparameters of the LightGBM Model.
Table 4. Hyperparameters of the LightGBM Model.
ParameterValueTuningDescription
learning-rate0.050.01Learning rate
num-leaves3164Number of leaf nodes
feature-fraction0.90.8Proportion of features used per iteration
bagging-fraction0.80.8Proportion of samples used per iteration
min_data_in_leaf1010Minimum number of data per leaf
metricAUCAUCEvaluation metric
Table 5. Performance of LightGBM under different resampling methods.
Table 5. Performance of LightGBM under different resampling methods.
MethodAUCF1-ScorePrecisionRecall
SMOTE0.8790.8770.9890.788
Borderline-SMOTE0.8750.8550.9880.753
ADASYN0.8790.9050.9870.836
Table 6. Performance comparison between baseline models and the proposed LightGBM model.
Table 6. Performance comparison between baseline models and the proposed LightGBM model.
MethodAUCF1-ScoreRecallPrecisionBalanced Accuracy
Logistic Regression0.6870.8870.8170.9690.649
Random Forest0.8430.9010.8310.9820.769
XGBoost0.8860.8870.8050.9890.814
LightGBM0.8790.9050.8360.9870.806
Table 7. Configuration of virtual scenarios and model interpretation results.
Table 7. Configuration of virtual scenarios and model interpretation results.
ScenarioOwnership/
Plan Condition
Predicted
Output f(x)
Dominant SHAP ContributorsInterpretation
Summary
Final Prediction
S1Private—Other regulated facility—Fall+0.64Safety Plan (+),
Type of Accident (−), Project Cost (−)
Private site without plan →
partial mitigation; risk remains high.
Fatal
S2Private—Type 1/2 facility—Fall+0.89Safety Plan (+),
Work Process (+),
Accident Object (−)
Formal plan increases injury tendency.Fatal
S3Private—Non-regulated facility—Fall−0.27Type of Accident (–), Project Cost (−), Work Process (−)Lack of plan and low resources → fatal outcome likely.Fatal
S4Public—Non-regulated facility—Fall+0.89Public/Private (+), Safety Plan (+),
Type of Accident (−)
Public oversight partially offsets missing plan.Fatal
S5Public—Type 1/2 facility—Fall+2.69Safety Plan (+),
Public/Private (+), Work Process (+)
Full plan + public supervision → lowest fatality risk.Injury
S6Public—Other regulated facility—Fall+2.83Public/Private (+), Type of Accident (−), Safety Plan (+)Public project mitigates fatal risk despite weak plan.Injury
Table 8. Direction of SHAP influence and functional roles of major variables.
Table 8. Direction of SHAP influence and functional roles of major variables.
CategoryMajor VariablesSHAP
Direction
Role TypeInterpretation
Summary
Direct DriversType of Accident
Accident Object (Major)
Work Process
Negative (−) → Fatal directionPhysical/
Operational
factors
These variables drive the model output toward the fatal direction. High negative SHAP values indicate strong contributions to fatal outcomes, representing direct causes such as fall-from-height or collapse-related processes.
Mitigating FactorsSafety Management Plan
Public/Private
Positive (+) → injury directionManagerial/
Institutional
factors
Management-related variables reduce the predicted fatality risk. The existence of a safety plan and public-sector supervision acts as a mitigating elements that shift predictions toward injury outcomes.
Contextual/Interactive FactorsProject Cost
Progress Rate
Number of Workers
Mainly negative (−), partly positive (+) depending on conditionsSite-operational
factors
Operational characteristics such as project scale, progress rate, and workforce size show consistent directional effects and interact with work process variables, influencing the severity level of predicted accidents.
Environmental FactorsTemperature
Humidity
Weak
impact
Weak
impact
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, K.-n.; Cho, D.-g.; Lee, M.-j. A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents. Buildings 2025, 15, 4343. https://doi.org/10.3390/buildings15234343

AMA Style

Kim K-n, Cho D-g, Lee M-j. A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents. Buildings. 2025; 15(23):4343. https://doi.org/10.3390/buildings15234343

Chicago/Turabian Style

Kim, Ki-nam, Dae-gu Cho, and Min-jae Lee. 2025. "A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents" Buildings 15, no. 23: 4343. https://doi.org/10.3390/buildings15234343

APA Style

Kim, K.-n., Cho, D.-g., & Lee, M.-j. (2025). A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents. Buildings, 15(23), 4343. https://doi.org/10.3390/buildings15234343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop