1. Introduction
The construction industry records a large number of fatalities and injuries every year. In 2022, 1069 workers died on construction job sites, representing 19.5 percent of the 5486 total fatal occupational injuries reported in the United States according to the Bureau of Labor Statistics [
1]. In addition to the human toll, the economic and social losses resulting from construction-related fatalities are substantial, reaching nearly 5 billion dollars annually due to lost production, reduced family income, pain and suffering, and decreased quality of life [
2]. To mitigate these impacts and improve safety performance, ongoing research has focused on enhancing existing safety management frameworks and adopting advanced technologies such as virtual reality, wearable sensing, deep learning, and robotics [
3,
4,
5,
6,
7,
8,
9,
10]. Among these efforts, construction accident prediction, which aims to quantitatively estimate the likelihood of potential incidents [
11,
12,
13,
14], has been extensively studied because it can provide useful guidance for decision makers (e.g., safety managers) to allocate limited safety resources effectively, develop proactive safety plans, and strengthen overall safety performance [
15].
To improve the performance (e.g., accuracy) of construction accident prediction, various approaches and techniques (e.g., statistics) have been continuously developed [
16,
17]. Among them, machine learning and deep learning-based approaches have received significant attention due to their advantages in effectively analyzing large-scale, diverse data and identifying latent risk factors that are difficult to capture with other methods (e.g., statistical models) [
13]. Unfortunately, many previous studies have relied on a few machine learning and deep learning models (e.g., decision trees), failing to compare the performance of advanced algorithms, and such approaches often involve time-consuming, complex model development processes (e.g., hyperparameter optimization). For this reason, within the field of construction accident prediction, some studies have attempted to adopt automated machine learning (AutoML) methods that automate and streamline relevant tasks (e.g., preprocessing) and compare performance across different algorithms [
18]. Although this approach has partially addressed the existing challenges, its performance still heavily relies on the size of the input datasets used to train and test the algorithms, and it is difficult for practitioners and professionals in the field to employ them extensively for real-world applications [
19].
The emergence of large-scale foundation models marks a structural shift in artificial intelligence research beyond traditional task-specific supervised learning. Generative architectures, especially large language models (LLMs), exhibit strong generalization capabilities through prompt-based inference, enabled by advances in self-supervised pretraining, representation scaling, and cross-domain transfer learning [
20,
21]. Diffusion-based generative frameworks have further enhanced high-resolution image and video synthesis, enabling realistic multimedia generation from textual or latent inputs [
22,
23]. In parallel, comprehensive surveys emphasize how prompt-driven learning and foundation model scaling have improved generalization, content quality, and cross-domain applicability [
21,
24]. These advancements suggest a paradigm shift toward flexible, prompt-oriented systems capable of integrating data interpretation and content generation within unified modeling structures, thereby providing a strong technical foundation for exploring generative artificial intelligence (AI) as an alternative or complement to AutoML in construction accident prediction.
Generative AI can efficiently generate high-quality, innovative content, including text, images, and videos, based on user queries [
25]. As a result, it has attracted growing attention across research fields and offers new opportunities to advance construction accident prediction. In the construction field, it has been used to generate design alternatives, analyze data for defect detection, summarize construction details, etc. [
26,
27,
28]. Although some studies have assessed the feasibility of using generative AI to evaluate the likelihood of construction incidents [
29], several gaps in the current knowledge base remain to be addressed. First, there is a lack of comparative analysis between generative AI and AutoML in terms of performance (e.g., accuracy) and training time. A preliminary study is necessary to assess the feasibility of generative AI as an alternative to AutoML for construction accident prediction. Second, the issue of data imbalance in construction accident prediction—where fatality cases are significantly fewer than injury cases—necessitates a performance comparison between generative AI and AutoML across different dataset scales. Third, while AutoML typically requires domain expertise, generative AI offers greater accessibility to non-experts. A comparative usability study is needed to evaluate the potential for methodological shifts in construction accident prediction. Lastly, there is a need for a structured framework to facilitate the practical application of generative AI as a viable alternative to existing methodologies.
To address these research gaps, this study aims to evaluate the performance, training time, and usability of AutoML and generative AI using a dataset of 23,484 (real) construction accident cases. The findings provide empirical evidence on the feasibility of generative AI in replacing conventional construction accident prediction methods, such as AutoML. Furthermore, this research will propose a structured framework to guide the application of generative AI in the construction domain.
2. Related Studies
Early efforts to predict the consequences of construction accidents primarily employed conventional ML and statistical models. These studies demonstrated that supervised learning can classify or estimate accident outcomes using structured case data, but performance is often constrained by feature quality, domain-specific preprocessing, and limited model diversity [
11,
14]. For example, applications to Chinese construction accident data showed that standard classifiers achieve reasonable accuracy in consequence prediction, yet sensitivity to data preparation and parameter tuning remains substantial [
11]. Similarly, preprocessing pipelines and classical models for occupational accident analysis highlight both the importance and the burden of using carefully curated inputs to achieve stable performance [
14].
A growing body of research has sought to refine AI-based accident prediction in construction by addressing limitations in predictive accuracy, class imbalance handling, and model interpretability. For example, researchers have developed supervised learning models for serious injury/fatality exposure using large-scale project data, highlighting the feasibility of risk-oriented prediction in real settings [
30]. Interpretable frameworks for fatal accident prediction further demonstrate how explainability can support managerial decision-making rather than producing “black-box” outputs [
31]. Beyond standard tabular learning, graph-based deep learning approaches have been proposed to better capture relational structures among accident records, improving representativeness and robustness compared with conventional classifiers [
32]. In parallel, methodological work has shown that prediction accuracy is highly sensitive to preprocessing choices; scenario-based automated preprocessing pipelines can materially change severity-prediction performance, underscoring the need to report and justify preprocessing configurations [
33]. Time-series and hybrid modeling approaches have also been explored for forecasting accident occurrences, suggesting an additional pathway for proactive safety planning when sequential patterns are present in historical records [
34]. In addition to construction-focused research, closely related safety-critical domains such as transportation have produced a large body of evidence on severity prediction and explainable ML, offering transferable insights regarding class imbalance, model stability, and operational deployment [
35,
36]. Within construction management specifically, recent work has addressed class imbalance through integrated resampling and ML pipelines, demonstrating that minority-class handling can significantly affect outcome prediction [
37]. Finally, emerging studies are beginning to explore how LLMs and generative AI can support safety analytics and risk understanding, motivating the need for careful empirical comparison between structured AutoML pipelines and deployment-oriented generative approaches [
38,
39].
Deep learning has expanded the scope of safety analytics from tabular accident records to image and video streams, enabling both outcome prediction, and proactive hazard detection. Studies have reported improved representational power for predicting or analyzing safety incidents, including case-specific models for fall accidents and other safety outcomes [
13,
32]. In parallel, computer vision models detect non-compliance and personal protective equipment in real time, providing leading indicators of risk on-site and strengthening the connection between perception tasks and safety decision-making [
7,
8]. Collectively, these works show that deep learning can extract informative signals from complex data, but they generally require extensive data, careful hyperparameter tuning, and specialized expertise to operationalize [
8,
13,
17].
To reduce the manual burden of model selection and tuning, researchers have explored AutoML frameworks for predicting accident severity. AutoML has been used to automate preprocessing, systematically compare algorithms, and improve reproducibility in model development for construction accident severity [
18]. While these results highlight the value of pipeline automation and standardized benchmarking, reported performance remains sensitive to dataset size and class imbalance, and practical deployment still demands informed configuration and interpretation by domain experts [
18,
19].
Generative AI has recently been introduced to construction engineering and management for tasks such as generating design alternatives, analyzing defect data, and summarizing information, indicating potential to support data-driven decision-making across workflows [
26,
27,
28]. In safety analytics, initial studies have examined the feasibility of using LLMs to predict the likelihood of construction incidents and provide model interpretability (e.g., saliency) for practitioner insight [
29]. Broader surveys of generative AI position these models as versatile tools capable of rapidly producing high-quality content from prompts, potentially lowering barriers to advanced analytics in domains that lack extensive programming expertise [
25].
Across these streams, three needs motivate the present study. First, there is limited head-to-head evidence comparing generative AI with AutoML for construction accident prediction, particularly with respect to predictive performance and training time across standardized pipelines Second, few studies systematically evaluate method robustness under the class imbalance that characterizes accident datasets and across varying dataset scales, despite consistent acknowledgment of data scarcity challenges in safety analytics [
11,
14,
18,
19]. Third, usability for non-experts remains underexplored: AutoML typically presumes informed configuration and interpretation, whereas generative AI may offer more accessible interaction patterns for practitioners, but this has not been rigorously assessed in the context of accident prediction.
Despite the growing body of research on ML-based accident prediction, limited attention has been given to the structured application of generative AI for severity classification in construction contexts. Furthermore, few studies have directly compared performance-optimized AutoML pipelines with deployment-oriented generative AI approaches under external validation conditions. This gap underscores the need for empirical investigation into the practical trade-offs between accuracy, usability, and robustness in safety-critical environments.
3. Methodology
In order to achieve the objective, this study conducted four steps, which are (a) Data collection and Preprocessing, (b) Construction of AutoML-based classifier, (c) Development of generative AI-based classifier, and (d) Evaluation of models. The prepared dataset from the first step was used in both the second and third steps. Using the results of an AutoML-based classifier, this study created an ROC curve, which was then compared with the results of a generative AI-based classifier in the last step. The overview of the research approach is illustrated in
Figure 1.
3.1. Data Collection and Preprocessing
This study evaluates construction accident severity classification models using fatality and injury records from the publicly available Construction Safety Management Integrated Information (CSI) system in South Korea [
40]. The data includes corresponding construction project information (e.g., title, type, and budget) as well as accident-related information, such as the type of incident (fatality or injury), the date the incident occurred, and associated descriptions. A total of 23,484 historical data points, representing accidents associated with fatalities (1043) and injuries (22,441) between 2019 and 2023, were collected with 27 data features (attributes).
The CSI database was selected because it provides nationally standardized and publicly accessible construction accident records with consistent reporting formats. The 2019–2023 period was chosen to ensure sufficient historical coverage while maintaining data relevance to current construction practices. Records were included if they contained complete accident type information (fatality or injury) and sufficient project-related attributes for modeling. This selection strategy ensured data consistency, reliability, and practical relevance for predictive modeling in real-world construction safety management.
In order to clean the raw data, preprocessing was conducted via two steps. Firstly, missing values were filtered or imputed, the data were normalized, and noisy data points were removed. It was because, in some cases, data points were missing critical details (e.g., the specific work type an injured worker was involved in and weather conditions), and unrealistic values were often observed for temperature, humidity, etc. Secondly, 25 professionals (e.g., authors, safety managers, and construction specialists) with an average of 10 years of experience at construction sites were consulted to remove unnecessary data attributes and further refine the dataset based on their descriptive statistics. For instance, some features pertaining to post-accident information (e.g., damage description and injured body type) were removed. As a result, the cleaned dataset consisted of 12,763 data points with 18 corresponding features.
Next, this study transformed the preprocessed data into input formats suitable for ML and generative AI classifiers. For ML applications, numerical variables such as temperature and construction cost were converted to categorical variables via binning, and the resulting categorical features were further transformed into binary variables via one-hot encoding. For generative AI applications, numerical data were used without binning, and categorical features were retained in their original form, as generative models can process such inputs directly.
The refined dataset of 12,763 data points (12,442 injuries and 321 fatalities) was imbalanced, with a 97:3 ratio. To address this imbalance and minimize its impact on performance, a well-established Synthetic Minority Over-sampling Technique (SMOTE) was employed due to its proven performance in prior literature [
41].
SMOTE generates new samples for the minority class (fatalities) by interpolating between data points using the k-Nearest Neighbors (kNN) algorithm. To ensure methodological rigor, SMOTE was applied exclusively to the training dataset after dataset partitioning, while the testing and external validation datasets consisted solely of original accident records. The augmentation level (up to 5000 fatality instances) was selected to mitigate severe class imbalance while maintaining statistical stability during model training. As SMOTE creates synthetic samples through interpolation between existing minority observations, the generated data approximate realistic fatality patterns rather than introducing artificial or entirely new accident scenarios.
3.2. Construction of AutoML-Based Classifier
Conventional ML model development involves multiple sequential and iterative steps. In traditional ML, these tasks are handled manually: model results are derived from various frameworks, and iterative validation, evaluation, and optimization are performed through hyperparameter tuning. This manual process can be labor-intensive and time-consuming.
AutoML streamlines this process by automating model building and evaluation, significantly reducing the time required for model development. In this study, 16 ML models (Extra Trees Classifier, Random Forest Classifier, CatBoost Classifier, Gradient Boosting Classifier, Light Gradient Boosting Machine, Extreme Gradient Boosting, Ada Boost Classifier, Decision Tree Classifier, K Neighbors Classifier, Ridge Classifier, Linear Discriminant Analysis, Logistic Regression, SVM-Linear Kernel, Quadratic Discriminant Analysis, Naive Bayes, Dummy Classifier) were utilized within the AutoML framework. For example, the random forest classifier uses ensemble learning, which combines multiple decision trees to build a robust model. It uses a bagging method to introduce randomness by training each decision tree with different bootstrap samples. XGBoost, another ensemble method, also combines decision trees but uses advanced boosting techniques to address the problem of overfitting.
Model performance depends on many factors, including dataset construction, algorithm selection, and hyperparameter settings. In this study, we used the AutoML pipeline to train and test 16 classifiers. The dataset was split 70/30 for training and testing, with stratified 10-fold cross-validation applied to the training set. This method maintains the class distribution in each fold, ensuring balanced validation. The training data is divided into 10 subsets, with each fold used once for validation and the results averaged to provide a more robust and generalized model evaluation.
3.3. Development of Generative AI-Based Classifier
Fine-tuning a Generative Pre-trained Transformer (GPT) model to develop a classifier for construction accident prediction requires a systematic process that begins with model selection. Choosing the appropriate base model is critical to the task at hand. OpenAI offers different GPT base models through its API [
42], such as babbage-002, davinci-002, and the GPT-3.5 turbo series (0125, 0613, 1106). Each model has different capabilities and resources allocated for language processing tasks. For example, davinci-002 is designed for programming-related tasks but requires high computational costs. Babbage-002 offers a good balance between performance and cost. The GPT-3.5 turbo series is ideal for tasks requiring quick responses and lower computational resources. In this work, we adopted GPT-3.5-turbo as a base model for fine-tuning, based on proven performance in accident prediction [
43]. Specifically, we opt for GPT-3.5-turbo-1106, the latest version and the most optimized for performance and efficiency within the turbo series at the time of the experiments.
The selected GPT model is then fine-tuned with the filtered CSI data, as explained in
Section 3.1. The processed data suitable for the GPT model is formatted into a training dataset using prompt engineering techniques. Prompt engineering involves designing and optimizing the prompts used to guide a language model, such as GPT, to produce desirable results. In this study, the goal is to develop a GPT-based classifier to predict construction accidents based on specific construction conditions. Therefore, effective prompt engineering techniques are crucial to ensuring that the GPT classifier accurately identifies the given construction conditions and reliably predicts construction accidents.
In this study, we adopt prompt template engineering [
44], which involves creating templates with placeholders for specific information, such as construction conditions and details about industrial accidents (e.g., deaths and injuries). These templates serve as structured guides for the GPT model to interpret and respond to the prompts accurately. Additionally, the study uses prompt answer engineering, which focuses on designing the output format and selecting the method for answer design. By specifying the desired response format, this technique ensures that the GPT model provides consistent, relevant predictions and classifications.
We created the prompt by following a manual paraphrasing approach, simulating an interaction between an inspector and a GPT model to predict the outcome of an occupational incident based on specific construction conditions.
Figure 2 illustrates the prompt process. We defined a prompt function
to generate a prompt sentence
=
using the following template:
“Predict whether an incident results in: [I] based on the following construction conditions: [C] Occupational Incident Type: [P]”.
An example of a prompt used in the GPT fine-tuning training is:
“Predict whether an incident results in death or injury based on the following construction conditions-Months: February, Accident Reporting Time: Before Work, Public/Private sector: Private, Weather Condition: Clear, Temperature Condition: 9.0, Humidity Condition: 38.0, Work type-Major: Architecture, Work type-Subcategory: Masonry, Work Process: Others, Province: Busan, Construction Cost (KRW, Won): 100 billion won and above, Award Possibility Rate (%): 90% and above, Construction Period (days): 1155, Progress (%): 40~49%, Number of Workers: 100 to 299 people, Construction Safety Management Plan: Subject Site (Type 1 & 2 facilities), Design Safety Review: Non-Subject”.
3.4. Evaluation
This study evaluated model performance using accuracy, precision, recall, and F
1 score. These metrics provide a comprehensive view of the model’s effectiveness. These performance metrics are calculated based on the value of true positives (T.P.), false positives (F.P.), true negatives (T.N), and false negatives (F.N.). Each metric’s mathematical expressions are displayed in Equations (1)–(4). Accuracy measures the proportion of correct predictions to the total number of predictions made by the model. Precision is the fraction of correctly predicted incidents among all incidents identified by the model as a specific outcome. Recall indicates the ratio of correctly predicted incidents to the total number of actual incidents for that outcome [
45].
High percentages of both precision and recall indicate better model performance. This is because, as the number of positive incidents increases, the accuracy in correctly classifying each incident may decrease. Due to the importance of precision and recall, these two metrics should be balanced to optimize the model’s performance. To measure the balance between precision and recall, the F
1 score is used. The F
1 score represents the harmonic mean of precision and recall [
46]; a higher F
1 score indicates better performance.
In addition to these threshold-dependent metrics, the receiver operating characteristic (ROC) curve was used to evaluate the classifiers’ discriminative ability across different classification thresholds. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR), providing a threshold-independent assessment of model performance. The area under the ROC curve (AUC) summarizes the overall classification performance, where a higher AUC indicates better discrimination between fatality and injury cases.
4. Results
4.1. Identification of the Optimal AutoML Classifier
AutoML was utilized to evaluate the performance of 16 ML algorithms in predicting whether an incident resulted in a fatality or an injury.
Table 1 ranks the models by accuracy, with the Extra Trees Classifier achieving the highest performance. The top five models, including the 5th-ranked Light Gradient Boosting Machine, all achieved accuracy scores exceeding 0.974, indicating minimal variance among them. In contrast, the 10th-ranked Ridge Classifier achieved an accuracy of 0.7689, and models ranked below 10th achieved accuracies below 0.8.
Out of the 16 models tested, the Extra Trees Classifier was selected for further analysis. Its accuracy was evaluated across different data scales and compared with that of the generative AI model.
When using AutoML for construction accident prediction, it is essential to evaluate the model’s predictive performance using a Confusion Matrix, particularly given the implications of False Positives (FPs) and False Negatives (FNs) in construction safety management. In this context, True Positive (TP) refers to cases where the model correctly predicts an actual accident (injury or fatality), while True Negative (TN) indicates correctly identified non-accident cases. Conversely, False Positive (FP) represents instances where the model erroneously predicts an accident when none occurred, and False Negative (FN) denotes actual accidents that the model fails to detect.
From a safety management perspective, a high FN rate has the most critical practical implications. If the model fails to detect potential accidents, preventive safety measures may not be implemented adequately, increasing the likelihood of severe incidents at construction sites. On the other hand, a high FP rate results in excessive safety interventions, imposing unnecessary operational costs and potentially inefficient resource allocation. Therefore, an optimal model should aim to minimize FN while maintaining FP at an acceptable level to ensure safety, effectiveness, and operational efficiency. Because the balance between FPs and FNs varies with the classification threshold, ROC analysis was used to evaluate model performance across a range of thresholds.
Figure 3 is a ROC curve showing that both models achieved high TP rates across various thresholds, with the curves approaching the upper-left corner. This pattern reflects effective learning and strong classification performance, indicating that the models have learned to successfully distinguish between the target classes.
4.2. Comparative Results of AutoML and Generative AI Models
Based on the classification results in
Table 1, the Extra Trees Classifier was selected as the optimal model, and its performance was gradually evaluated to assess its predictive capabilities across different dataset sizes.
The performance of the Extra Trees classifier was evaluated using 19 datasets, with sizes ranging from 50 to 5000, maintaining a 1:1 ratio between fatalities and injuries. Since the original dataset contained only 321 fatality cases—fewer than the target of 5000—data augmentation was applied to increase the number of fatalities to 5000, addressing class imbalance. For each dataset, an equal number of ‘Death’ and ‘Injury’ instances were extracted for training. The performance of the GPT-Turbo 1106 model was also assessed using these same datasets.
Table 2 shows the performance of the Extra Trees classifier across various dataset sizes. The results indicate that accuracy improves proportionally with data size, from 0.4714 for 50 instances to 0.9984 for 5000 instances. Accuracy exceeds 0.9 for datasets with at least 1000 instances.
Under internal evaluation with balanced datasets (1:1 ratio), the recall of the Extra Trees classifier consistently improved as dataset size increased. Recall rose from 0.4898 at 50 instances to 0.9984 at 5000 instances, demonstrating a strong capacity to detect fatality cases when sufficient training data were available. Notably, recall exceeded 0.90 once the dataset size reached 1000 instances and approached near-perfect detection beyond 3000 instances. In contrast, the GPT-Turbo 1106 model exhibited more variable recall performance across dataset sizes. At very small dataset configurations (50 and 100 instances), recall was 0.0000, indicating a failure to correctly identify fatality cases. However, performance improved substantially beginning at 200 instances, reaching 0.8830, and peaked at 0.9420 at 400 instances. Beyond 1000 instances, recall fluctuated between approximately 0.61 and 0.83, without a consistent upward trend comparable to the Extra Trees classifier.
Overall, under controlled and balanced internal conditions, the Extra Trees classifier demonstrated superior and more stable recall performance compared to the GPT-based model, particularly as the dataset size increased. This suggests that structured ensemble learning approaches may provide stronger fatality detection capability when trained with sufficiently large and balanced datasets.
The performance of the GPT-based classifier. For the smallest dataset of 50 instances, accuracy is 0.5, while for the largest dataset of 5000 instances, it increases to 0.756. Notably, accuracy surpasses 0.7 for datasets with at least 1000 instances. However, unlike AutoML, the GPT-based classifier’s accuracy decreases to approximately 0.593 with 500 instances.
Figure 4 illustrates the accuracy trends for both the Extra Trees and GPT-Turbo models. Both models show lower accuracy with smaller datasets, but accuracy improves as the dataset size increases. The maximum accuracy for AutoML is 0.9984, while GPT-Turbo reaches 0.756, a difference of 0.2428.
These findings confirm the practicality of the GPT-Turbo model, which, despite lower accuracy compared to AutoML, offers significant ease of use due to its reduced requirements for data preprocessing and model implementation.
4.3. External Validation
External validation was conducted using the Extra Trees Classifier and the generative AI-based classifier (GPT) to evaluate their predictive performance on newly observed construction accident data. The training dataset consisted of construction accident records from 2019 to 2023, including fatality and injury cases. The external validation dataset comprised construction accident data from 2024, which included 3561 instances after applying the same preprocessing procedures as the training data, such as handling missing values and removing outliers. Among these, 226 instances represented fatalities, while 3335 instances corresponded to injury cases.
Table 3 presents the performance of the Extra Trees Classifier across various dataset sizes. The results indicate that accuracy improves as the dataset size increases, ranging from 0.4653 with 50 instances to 0.9450 with 3000 instances. However, compared to the original training dataset (2019–2023), the accuracy for external validation decreased. Notably, the overall decline in accuracy for external validation was less pronounced in the GPT-based classifier compared to AutoML, with only a marginal decrease from the original training dataset (2019–2023).
In terms of recall under external validation, distinct differences emerged between the two models. For the Extra Trees classifier, recall gradually increased with dataset size, rising from 0.4838 for 50 instances to 0.9459 for 3000 instances. However, compared to the near-perfect recall observed in the original training evaluation, the external validation results indicate a noticeable decline in fatality detection performance, particularly in smaller training configurations. This suggests that the structured ensemble model may be sensitive to distributional shifts between historical and newly observed accident data. In contrast, the GPT-based classifier demonstrated relatively strong recall even in moderate dataset sizes, reaching values above 0.87 from 200 to 800 instances and maintaining recall above 0.76 at 3000 instances. Although overall accuracy remained lower than that of AutoML, the GPT model exhibited more stable fatality detection performance across varying dataset sizes under external validation. From a safety-oriented perspective, this stability in recall may reduce the risk of false negatives when applied to newly emerging construction accident data.
Figure 5 illustrates the accuracy trends for both the Extra Trees Classifier and the GPT-Turbo model by dataset size. Both models show lower accuracy with smaller datasets, with accuracy improving as the dataset size increases, consistent with patterns observed in the original training phase. Notably, for small-scale datasets (up to 300 instances), the GPT-Turbo model achieves higher accuracy than the AutoML classifier.
5. Discussion
This study compared AutoML-based classifiers and a generative AI-based approach for construction accident severity prediction, focusing on predictive performance, usability, and robustness under external validation. The results confirm that AutoML, particularly ensemble-based models, achieves superior predictive accuracy when trained and tested under controlled conditions. This finding is consistent with prior research highlighting the effectiveness of ensemble learning for accident prediction. However, the high accuracy of AutoML is accompanied by substantial requirements for data preprocessing, feature engineering, and continuous model maintenance, which may limit its practical applicability in real-world construction safety management environments.
In contrast, the generative AI-based classifier demonstrated lower predictive accuracy but offered notable advantages in usability and operational flexibility. The GPT-based approach required minimal data preprocessing and enabled the integration of new accident cases through prompt-based inputs, thereby reducing the technical burden associated with model deployment and updating. These characteristics are particularly relevant in construction contexts, where safety managers often operate under time constraints and may lack specialized expertise in ML or data engineering.
To provide a more structured interpretation of usability, implementation complexity was examined in terms of preprocessing stages and configuration requirements. The AutoML pipeline required multiple sequential steps, including data cleaning, feature encoding, class balancing (SMOTE), and hyperparameter optimization prior to model training. In contrast, the GPT-based approach primarily required prompt design and minimal data formatting. Although this comparison does not constitute a fully quantitative usability metric, it offers a clearer basis for evaluating implementation burden. A comprehensive quantitative assessment of deployment time and computational cost remains an avenue for future research.
The external validation results further highlight differences in model generalization. While the performance of the AutoML-based classifier declined when applied to newly observed accident data, the generative AI-based model maintained relatively stable accuracy with only limited degradation. This performance decline may be attributable to distributional shifts between the training data (2019–2023) and the 2024 external dataset. Changes in project characteristics, reporting practices, or environmental conditions may induce feature drift, which can disproportionately affect structured ensemble models relying on stable preprocessing pipelines. Monitoring such drift and periodically recalibrating the AutoML model may therefore be necessary for reliable field deployment. This suggests that generative AI may be less sensitive to changes in data distribution and preprocessing consistency, making it potentially more suitable for applications that require frequent data updates and adaptation to evolving operational conditions. From an implementation perspective, reduced dependence on rigid preprocessing pipelines and feature compatibility may enhance the feasibility of deploying generative AI within operational safety systems.
In the context of construction safety management, predictive modeling requires a balance between predictive accuracy and operational feasibility in real-world applications. High accuracy alone does not guarantee practical usefulness if a model is difficult to implement, update, or sustain in field environments. The findings of this study suggest that generative AI can offer practical advantages in situations where ease of use, adaptability to new data, and robustness under external validation are prioritized, even at the expense of some predictive accuracy.
It is important to acknowledge that the methodological configurations of AutoML and the GPT-based model are not strictly symmetric. AutoML leverages structured preprocessing, class balancing techniques such as SMOTE, and optimized ensemble pipelines to maximize predictive accuracy, whereas the generative AI approach relies primarily on prompt-based classification with minimal preprocessing. However, both models were trained and evaluated using identical training–testing splits and the same external validation dataset to ensure experimental consistency. The objective of this comparison was not to establish strict methodological equivalence, but rather to examine practical trade-offs between performance-optimized structured pipelines and deployment-oriented generative AI models under realistic operational conditions.
Several limitations of this study should be acknowledged. The generative AI-based model was evaluated only for binary classification of accident severity, and its applicability to multi-class or more detailed severity prediction remains untested. In addition, although data augmentation techniques were applied to mitigate class imbalance, synthetic data may not fully reflect the complexity and variability of real-world construction accidents. Future research should therefore explore more complex prediction tasks, incorporate diverse datasets from different regions, and investigate deeper integration of generative AI into operational decision-support frameworks for construction safety management.
6. Conclusions
This study comparatively evaluated AutoML-based models and a generative AI-based approach for construction accident severity prediction, considering predictive performance, robustness under external validation, and operational usability. The findings reveal important differences in performance characteristics and practical feasibility between structured ensemble learning pipelines and prompt-based generative AI models.
The main conclusions of this study are as follows. First, AutoML-based models demonstrated superior predictive accuracy under controlled training conditions. In particular, the Extra Trees classifier achieved the highest performance, confirming the effectiveness of ensemble learning approaches when sufficient structured data and preprocessing are available. However, this performance advantage was reduced during external validation using newly observed accident data, indicating sensitivity to distributional changes. Second, the generative AI-based model exhibited lower overall predictive accuracy but maintained relatively stable performance under external validation and across smaller dataset scales. These findings highlight a trade-off between maximizing predictive accuracy and ensuring adaptability to evolving real-world construction environments. Third, from a practical standpoint, the generative AI approach offers advantages in usability and deployment flexibility. The GPT-based model required minimal data preprocessing and technical configuration, facilitating integration into real-world safety management workflows and lowering technical barriers for practitioners. Last, the study provides empirical evidence that generative AI can serve as a complementary approach in construction accident prediction, particularly in contexts where adaptability, ease of implementation, and operational feasibility are prioritized. Future research should explore hybrid modeling strategies and validate performance across diverse international datasets and multi-class accident severity scenarios.
Author Contributions
Conceptualization, S.S. and K.K.; methodology, S.S.; software, S.S. and D.O.; validation, S.S., D.O., K.K. and H.P.; formal analysis, S.S.; investigation, S.S., D.O. and H.P.; resources, J.J.; data curation, S.S. and D.O.; writing—original draft preparation, S.S.; writing—review and editing, S.S., K.K., H.P. and J.J.; visualization, S.S.; supervision, K.K. and J.J.; project administration, K.K.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT), grant number RS-2025-00558613.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Acknowledgments
The authors are very grateful to the reviewers for carefully reading the manuscript and providing valuable suggestions.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Workers Memorial Day: Remembering the 5486 Workers Who Died on the Job in 2022. Available online: https://www.bls.gov/opub/ted/2024/workers-memorial-day-remembering-the-5486-workers-who-died-on-the-job-in-2022.htm (accessed on 22 October 2024).
- Manzo, J. $5 Billion Cost of Construction Fatalities in the United States: A 50 State Comparison %A Manzo, Jill, United States of America: Midwest Economic Policy Institute. 2017. Available online: https://www.midwestepi.org/2017/05/08/the-5-billion-cost-of-construction-fatalities-in-the-united-states-a-50-state-comparison/ (accessed on 5 January 2026).
- Gao, Y.; González, V.A.; Yiu, T.W.; Cabrera-Guerrero, G.; Li, N.; Baghouz, A.; Rahouti, A. Immersive virtual reality as an empirical research tool: Exploring the capability of a machine learning model for predicting construction workers’ safety behaviour. Virtual Real. 2022, 26, 361–383. [Google Scholar] [CrossRef]
- Zhang, M.; Shu, L.; Luo, X.; Yuan, M.; Zheng, X. Virtual reality technology in construction safety training: Extended technology acceptance model. Autom. Constr. 2022, 135, 104113. [Google Scholar] [CrossRef]
- Chen, H.; Mao, Y.; Xu, Y.; Wang, R. The Impact of Wearable Devices on the Construction Safety of Building Workers: A Systematic Review. Sustainability 2023, 15, 11165. [Google Scholar] [CrossRef]
- Nnaji, C.; Awolusi, I.; Park, J.; Albert, A. Wearable Sensing Devices: Towards the Development of a Personalized System for Construction Safety and Health Risk Mitigation. Sensors 2021, 21, 682. [Google Scholar] [CrossRef]
- Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
- Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
- Linner, T.; Pan, M.; Pan, W.; Taghavi, M.; Pan, W.; Bock, T. Identification of Usage Scenarios for Robotic Exoskeletons in the Context of the Hong Kong Construction Industry. In Proceedings of the 2018 Proceedings of the 35th ISARC, Berlin, Germany, 20–25 July 2018. [Google Scholar] [CrossRef]
- Golabchi, A.; Zindashti, N.J.; Miller, L.; Rouhani, H.; Tavakoli, M. Performance and effectiveness of a passive back-support exoskeleton in manual material handling tasks in the construction industry. Constr. Robot. 2023, 7, 77–88. [Google Scholar] [CrossRef]
- Zhu, R.; Hu, X.; Hou, J.; Li, X. Application of machine learning techniques for predicting the consequences of construction accidents in China. Process Saf. Environ. Prot. 2021, 145, 293–302. [Google Scholar] [CrossRef]
- Mostofi, F.; Toğan, V. Predicting Construction Accident Outcomes Using Graph Convolutional and Dual-Edge Safety Networks. Arab. J. Sci. Eng. 2024, 49, 13315–13332. [Google Scholar] [CrossRef]
- Kim, J.-M.; Lim, K.-K.; Yum, S.-G.; Son, S. A Deep Learning Model Development to Predict Safety Accidents for Sustainable Construction: A Case Study of Fall Accidents in South Korea. Sustainability 2022, 14, 1583. [Google Scholar] [CrossRef]
- Lee, J.Y.; Yoon, Y.G.; Oh, T.K.; Park, S.; Ryu, S.I. A Study on Data Pre-Processing and Accident Prediction Modelling for Occupational Accident Analysis in the Construction Industry. Appl. Sci. 2020, 10, 7949. [Google Scholar] [CrossRef]
- Alkaissy, M.; Arashpour, M.; Golafshani, E.M.; Hosseini, M.R.; Khanmohammadi, S.; Bai, Y.; Feng, H. Enhancing construction safety: Machine learning-based classification of injury types. Saf. Sci. 2023, 162, 106102. [Google Scholar] [CrossRef]
- Luo, X.; Li, X.; Goh, Y.M.; Song, X.; Liu, Q. Application of machine learning technology for occupational accident severity prediction in the case of construction collapse accidents. Saf. Sci. 2023, 163, 106138. [Google Scholar] [CrossRef]
- Tixier, A.J.-P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Application of machine learning to construction injury prediction. Autom. Constr. 2016, 69, 102–114. [Google Scholar] [CrossRef]
- Toğan, V.; Mostofi, F.; Ayözen, Y.E.; Tokdemir, O.B. Customized AutoML: An Automated Machine Learning System for Predicting Severity of Construction Accidents. Buildings 2022, 12, 1933. [Google Scholar] [CrossRef]
- Alzubaidi, L.; Bai, J.; Al-Sabaawi, A.; Santamaría, J.; Albahri, A.S.; Al-dabbagh, B.S.N.; Fadhel, M.A.; Manoufali, M.; Zhang, J.; Al-Timemy, A.H.; et al. A survey on deep learning tools dealing with data scarcity: Definitions, challenges, solutions, tips, and applications. J. Big Data 2023, 10, 46. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, J.; Mai, W.; Zhang, X. A comprehensive overview of Generative AI (GAI): Technologies, ap-plications, and challenges. Neurocomputing 2025, 632, 129645. [Google Scholar] [CrossRef]
- Ding, C. Artificial intelligence in multimedia content generation. J. Soc. Inf. Disp. 2025, 34, 49–67. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, J.; Zhang, X.; Mai, W. Text-to-video generators: A comprehensive survey. J. Big Data 2025, 12, 253. [Google Scholar] [CrossRef]
- Cao, Y. A survey of AI-generated content (AIGC). ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
- Ma, W. Video diffusion generation: Comprehensive review and advances. Artif. Intell. Rev. 2025, 58, 338. [Google Scholar] [CrossRef]
- Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative Artificial Intelligence: A Systematic Review and Applications. Multimed. Tools Appl. 2024, 84, 23661–23700. [Google Scholar] [CrossRef]
- Pan, Y.; Zhang, L. Roles of artificial intelligence in construction engineering and management: A critical review and future trends. Autom. Constr. 2021, 122, 103517. [Google Scholar] [CrossRef]
- Rafsanjani, H.N.; Nabizadeh, A.H. Towards human-centered artificial intelligence (AI) in architecture, engineering, and construction (AEC) industry. Comput. Hum. Behav. Rep. 2023, 11, 100319. [Google Scholar] [CrossRef]
- Zheng, J.; Fischer, M. BIM-GPT: A Prompt-Based Virtual Assistant Framework for BIM Information Retrieval. arXiv 2023, arXiv:2304.09333. [Google Scholar] [CrossRef]
- Yoo, B.; Kim, J.; Park, S.; Ahn, C.R.; Oh, T. Harnessing Generative Pre-Trained Transformers for Construction Accident Prediction with Saliency Visualization. Appl. Sci. 2024, 14, 664. [Google Scholar] [CrossRef]
- Oguz Erkal, E.D.; Hallowell, M.R.; Ghriss, A.; Bhandari, S. Predicting Serious Injury and Fatality Exposure Using Machine Learning in Construction Projects. J. Constr. Eng. Manag. 2024, 150, 04023169. [Google Scholar] [CrossRef]
- Koc, K.; Ekmekcioğlu, Ö.; Gurgun, A.P. Developing a National Data-Driven Construction Safety Management Framework with Interpretable Fatal Accident Prediction. J. Constr. Eng. Manag. 2023, 149, 04023010. [Google Scholar] [CrossRef]
- Koc, K.; Ekmekcioğlu, Ö.; Gurgun, A.P. Construction safety predictions with multi-head attention graph and sparse accident networks. Autom. Constr. 2023, 156, 105102. [Google Scholar] [CrossRef]
- Hall, A.T.; Durdyev, S.; Koc, K.; Ekmekcioğlu, Ö.; Tupenaite, L. Scenario-based automated data preprocessing to predict severity of construction accidents. Autom. Constr. 2022, 140, 104351. [Google Scholar] [CrossRef]
- Koc, K.; Ekmekcioğlu, Ö.; Gurgun, A.P. Accident prediction in construction using hybrid wavelet-machine learning. Autom. Constr. 2022, 133, 103987. [Google Scholar] [CrossRef]
- Ahmed, S.; Hossain, M.A.; Ray, S.K.; Bhuiyan, M.M.I.; Sabuj, S.R. A study on road accident prediction and contributing factors using explainable machine learning models: Analysis and performance. Transp. Res. Interdiscip. Perspect. 2023, 19, 100814. [Google Scholar] [CrossRef]
- Aboulola, O.I. Improving traffic accident severity prediction using MobileNet transfer learning model and SHAP XAI technique. PLoS ONE 2024, 19, e0300640. [Google Scholar] [CrossRef]
- Koc, K.; Ekmekcioğlu, Ö.; Gurgun, A.P. Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods. Eng. Constr. Arch. Manag. 2023, 30, 4486–4517. [Google Scholar] [CrossRef]
- Smetana, M.; de Salles, L.S.; Sukharev, I.; Khazanovich, L. Application of Large Language Models for Safety Risk Perception Extraction in Highway Construction Scenarios. Appl. Sci. 2024, 14, 1352. [Google Scholar] [CrossRef]
- Abbasianjahromi, H.; Aghakarimi, M. Safety performance prediction and modification strategies for construction projects via machine learning techniques. Eng. Constr. Arch. Manag. 2021, 30, 1146–1164. [Google Scholar] [CrossRef]
- CSI (Construction Safety Information System). Introduction to the Construction Safety Information System. Available online: https://www.csi.go.kr/intro.do (accessed on 22 October 2025).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Ubani, S.; Polat, S.O.; Nielsen, R. ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. arXiv 2023, arXiv:2304.14334. [Google Scholar] [CrossRef]
- Møller, A.G.; Dalsgaard, J.A.; Pera, A.; Aiello, L.M. The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar] [CrossRef]
- Smetana, M.; De Salles, L.S.; Khazanovich, L. Improving Large Language Model Assisted Categorization and Classification of Highway Construction Accidents from Osha Databases. Preprint 2024. [Google Scholar] [CrossRef]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
- Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Can ChatGPT provide intelligent di-ag-noses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst. Appl. 2024, 235, 121186. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |