KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects

Lee, Hye Min; Shin, Seung-Hyeon; Won, Jeong-Hun; Kim, Moon Gyu

doi:10.3390/buildings16071393

Open AccessFeature PaperArticle

KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects

¹

Ministry of Employment and Labor, Sejong 30117, Republic of Korea

²

Department of Safety Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Buildings 2026, 16(7), 1393; https://doi.org/10.3390/buildings16071393

Submission received: 12 February 2026 / Revised: 28 March 2026 / Accepted: 30 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Next-Gen Risk Management: AI-Driven Solutions for Engineering and Construction Projects)

Download

Browse Figures

Versions Notes

Abstract

Project ownership is a critical factor that shapes safety management systems and accident patterns in construction. However, the Ministry of Employment and Labor (MOEL) industrial accident database, which is the largest construction accident database in Korea, does not include project ownership information. To address this limitation, this study developed a fine-tuned KLUE-BERT framework that automatically classifies project ownership using unstructured text fields (site name, client name, and workplace name) in MOEL data. Training data were constructed through manual classification of the 2018–2023 approved statistics and data augmentation. The proposed model achieved high classification performance. Multilayered statistical analyses were conducted using the classified 2014–2023 construction accident data across six key accident variables: accident type, accident cause, construction scale, accident severity, occupation, and worker tenure. The results revealed statistically significant associations between project ownership and all six variables. Public projects exhibited relatively high proportions of accidents involving construction machinery and vehicles, whereas private projects exhibited higher proportions of fall- and scaffold-related accidents. This study presents a novel artificial intelligence-based framework that generates analytical variables absent from the original data and demonstrates its utility through large-scale statistical analysis. The findings provide empirical evidence to support the development of project ownership-specific construction safety policies. Limitations include potential data leakage from pre-split augmentation and generalizability limited to Korean construction data.

Keywords:

construction safety; project ownership classification; public–private comparison; KLUE-BERT; natural language processing; accident data analytics

1. Introduction

The construction industry plays a pivotal role in economic growth and infrastructure development worldwide, yet it has one of the highest occupational accident rates among industries [1,2]. According to the International Labor Organization (ILO), although construction workers account for only approximately 7% of the total industrial workforce, fatal accidents in the construction industry account for approximately 30% of all occupational fatalities [3]. In Korea, the construction industry’s fatal accident rate per 10,000 workers in 2022 was 1.61, which was approximately 3.7 times the average across all industries (0.43) [4]. High accident rates not only threaten workers’ lives and health but also lead to economic losses, including construction delays and cost increases [5].

Understanding differences in accident occurrence patterns based on the structural characteristics of projects is essential for preventing construction accidents [6,7]. The ownership of construction projects, that is, the distinction between public and private projects, is a key variable associated with structural differences in on-site safety management and accident occurrence patterns [8,9]. Public projects are subject to stringent safety management regulations under the National Contract Act and the Construction Technology Promotion Act, and a direct supervisory system is implemented by the commissioning authority. Public projects are predominantly large-scale civil engineering works, such as roads, bridges, tunnels, and dams, which involve extensive use of construction machinery and transportation vehicles. In addition, the mandatory appointment of safety managers based on construction scale under the Occupational Safety and Health Act is strictly enforced [10].

In contrast, private projects operate under a relatively autonomous safety management system, and because of the high proportion of small-scale building construction, safety management gaps are prone to occur in projects below the threshold for mandatory safety-manager appointments [10,11]. Structural differences in legal regulations, work-type composition, and safety management levels according to project ownership may have different effects on accident type, cause, and severity, and the formulation of safety management strategies tailored to each project ownership type is required. For example, public road and bridge construction sites involve frequent deployment of heavy construction machinery, leading to elevated risks of machinery-related accidents, whereas private apartment building sites, with their high proportion of finishing works, are associated with concentrated risks of falls and scaffolding-related accidents [12,13,14]. These practical differences underscore the need for ownership-specific empirical analysis, yet such analysis has been constrained by data availability.

Studies analyzing the differences in construction accident characteristics according to project ownership have been conducted in several countries, including Taiwan, Singapore, and Korea [7,11,12,13,14,15]. Cheng et al. [12,13] identified differences in accident patterns between public and nonpublic projects in Taiwan through association rule mining, and Ling et al. [14] reported differences in fatality rates between public and private projects in Singapore. However, existing studies have been limited to cases in which project ownership information is included in the source data, and analyses of large-scale databases lacking project ownership information have not been conducted.

The MOEL occupational accident data—the most extensive construction accident database in Korea—were collected under the Occupational Safety and Health Act and contain approximately 270,000 construction industry accident records spanning 2014–2023. A wide range of variables, including workplace, worker, and accident occurrence characteristics, were recorded; however, project ownership information distinguishing between public and private projects was not included [4]. The Ministry of Land, Infrastructure and Transport (MOLIT) Construction Safety Information (CSI) system contains project ownership information. However, compared to the MOEL data, its collection scale is limited, and the diversity of accident variables is insufficient. Consequently, a structural limitation remains in which accident characteristics of public and private projects cannot be analyzed using the most comprehensive construction accident data in Korea.

Recently, artificial intelligence (AI) and natural language processing (NLP) have been actively utilized in the construction safety domain for hazard identification, accident-type classification, and accident prediction [16,17,18,19,20,21,22,23,24,25,26,27,28,29]. In particular, fine-tuned pretrained language models, such as BERT, have demonstrated strong performance in classifying unstructured text in construction accident reports [20,24,28]. However, existing AI/NLP-based construction safety research has primarily focused on the automatic classification of variables already available in the source data. To date, no study has used AI models to generate new analytical variables absent from the original database and subsequently incorporate them as key variables into large-scale statistical analysis.

The objectives of this study are (1) to develop a fine-tuned KLUE-BERT framework that automatically classifies project ownership from unstructured text information (site name, client name, and company name) in MOEL construction accident data and (2) to systematically compare and analyze the accident characteristics of public and private projects across six key accident variables (accident type, accident cause, construction scale, accident severity, occupation, and worker tenure) using 245,998 classified records. By integrating AI-based text classification with multilayered statistical analysis, this study aims to address the limitations of source data and provide empirical evidence for the development of safety management policies tailored to project ownership. The proposed framework can support targeted safety interventions, improve regulatory resource allocation, and ultimately contribute to reducing accident-related costs and enhancing worker safety in the construction industry.

Specifically, this study addresses the following three research questions:

RQ1: Can a fine-tuned KLUE-BERT model accurately classify project ownership from unstructured text fields in accident records, achieving classification performance sufficient for reliable downstream statistical analysis?
RQ2: Do significant structural differences exist in accident characteristics (Accident Types, Accident Causes, Construction Scale, Accident Severity, Occupation, and Tenure) between public and private projects?
RQ3: Can the classified data provide quantitative evidence—in terms of odds ratios and trend statistics—to support the formulation of ownership-specific safety management strategies?

RQ1 is evaluated by classification metrics (F1-score, precision, recall) benchmarked against a rule-based baseline; RQ2 by chi-square tests with Bonferroni correction and effect sizes (Cramér’s V); and RQ3 by odds ratios with 95% confidence intervals, adjusted standardized residuals, and Cochran-Armitage trend tests.

The remainder of this paper is organized as follows. Section 2 reviews the literature on construction accident analyses based on project ownership and NLP-based text classification. Section 3 describes the research framework, data collection and preprocessing, classification model development, and statistical analysis methods. Section 4 presents the classification model performance and statistical analysis results. Section 5 discusses the findings and practical implications, and Section 6 presents conclusions.

2. Literature Review

2.1. Construction Accident Analysis by Project Ownership

The ownership of construction projects influences their legal framework, safety management standards, and supervisory structure, thereby affecting accident occurrence patterns [8,9,10,11]. Differences in construction accident characteristics across project ownership types have been examined in a limited number of countries, primarily Taiwan and Singapore.

In Taiwan, Cheng et al. [12] analyzed 1347 construction accident records using association rule mining and identified that although falls were the most frequent accident type in both public and non-government projects, caught-in-between accidents exhibited a relatively higher proportion in public projects. In a subsequent study, Cheng et al. [13] extended this analysis to approximately 50,000 accident records in small construction enterprises and reported that public projects exhibited a higher fatality rate, with the interaction between construction scale and project ownership significantly influencing accident severity patterns. In Singapore, Ling et al. [14] analyzed approximately 2000 construction fatality records and reported differences in fatality rates between government and private industry projects, while noting that project scale and work type may act as confounding variables.

In Korea, Jo et al. [15] analyzed MOEL construction accident data from 2011 to 2015 and reported trends in incidence and mortality rates by gender, age, construction scale, and accident type, finding that larger construction scales were associated with lower incidence and mortality rates and that falls were the most frequent accident type across all years; however, project ownership was not included as an analytical variable. Yoon et al. [7] applied 4M analysis and association rule mining to KOSHA data to identify accident patterns by major construction occupations and proposed occupation-specific safety management measures, but similarly did not classify or analyze accidents by project ownership type. Studies using MOLIT CSI data have derived accident patterns and scenarios across work types [7,11], and differences in safety behavior perception and safety management levels across project ownership types have also been reported [9,30].

A targeted literature search confirmed that, beyond Cheng et al. [12,13] in Taiwan and Ling et al. [14] in Singapore, no dedicated studies comparing construction accident characteristics between public and private projects were identified in the international literature. This scarcity underscores the novelty of the present study. Furthermore, existing studies were limited to cases in which project ownership information was already included in the source data, and no study has addressed the challenge of classifying project ownership in large-scale databases where this variable is absent.

2.2. NLP-Based Text Classification in Construction Safety Domain

NLP is increasingly utilized to analyze unstructured text in construction accident data. The evolution of NLP-based construction safety research can be broadly categorized into three generations.

The first generation relied on traditional machine learning techniques. Tixier et al. [16,17] pioneered the use of TF-IDF-based feature extraction combined with support vector machines and random forests for classifying injury types and causes from construction accident reports. Subsequent studies employed XGBoost and similar ensemble methods for predicting accident types and injury locations [19,20]. Despite achieving reasonable performance, these approaches required extensive manual feature engineering and exhibited limited scalability to diverse text types.

The second generation introduced deep learning and distributed word representations. Baker et al. [21] applied Word2Vec and Doc2Vec embeddings to construction accident report classification, reducing the dependence on manual feature engineering. Zhong et al. [22] employed convolutional neural networks for construction hazard classification, while hierarchical attention networks and topic modeling techniques such as latent Dirichlet allocation and BERTopic were applied to identify latent patterns in accident narratives [22,24].

The third generation leverages fine-tuning of pretrained language models, particularly BERT. Zhou et al. [29] demonstrated that BERT-based classification of construction accident types outperformed conventional machine learning and deep learning approaches. Kumi et al. [31] applied BERT fine-tuning to Korean construction accident data, confirming the feasibility of transfer learning for domain-specific Korean text classification. These studies have been published in leading journals including Automation in Construction and Safety Science, establishing BERT fine-tuning as the current state-of-the-art approach for construction text classification [25,29,31].

However, a critical limitation persists across all three generations: existing studies focus exclusively on the automatic classification of variables already present in the source data. No case has been reported in which NLP was used to generate analytical variables absent from the source data and the generated variables were subsequently utilized as key variables in large-scale statistical analysis. The present study introduces this new paradigm.

2.3. Korean Pretrained Language Models and KLUE-BERT

BERT is a pretrained language model that simultaneously learns bidirectional contextual information based on the transformer encoder architecture and acquires universal language representations through a masked language model and next sentence prediction (NSP) tasks [32,33]. Multilingual BERT covers more than 100 languages; however, language-specific models tend to perform better for morphologically complex, agglutinative languages such as Korean [34]. KLUE-BERT is a Korean-specific pretrained model developed for the Korean Language Understanding Evaluation (KLUE) benchmark. It was pretrained on a large-scale Korean corpus and is designed to better represent Korean syntactic structures and word-order variations [34].

Text fields such as site name, client name, and company name in construction accident data consist of unstructured Korean text in which abbreviations, nonstandard notations, and proper nouns are intermixed. Accurately distinguishing between public institution and private company names requires a strong understanding of Korean context, for which KLUE-BERT is more suitable than multilingual or English-based models.

2.4. Research Gaps

Prior literature reveals three research gaps. First, comparative studies of construction accident characteristics between public and private projects remain extremely limited, with only three studies identified in Taiwan [12,13] and Singapore [14], all of which relied on databases where project ownership was already recorded as a variable. No NLP-based framework has been developed that automatically classifies project ownership in large-scale databases lacking this information, such as MOEL construction accident data. Second, NLP-based construction safety research has progressed from traditional machine learning [16,17,18,19,20] through deep learning [21,22,23,24] to pretrained language model fine-tuning [25,29,30], yet all approaches focus on classifying variables already present in the source data. The paradigm of generating new analytical variables through NLP models and utilizing the generated variables as key variables in large-scale statistical analysis remains unexplored. Third, although the Korean MOEL database contains over 245,000 construction accident records spanning a decade (2014–2023), the absence of project ownership information has prevented comprehensive comparative analysis. Even in databases where project ownership information was available, existing comparative studies [12,13,14] relied on descriptive methods or association rule mining, and comprehensive quantitative comparison has not been conducted. While construction accidents are influenced by numerous site-specific factors in practice, systematic comparison utilizing the variables recorded in databases has not been sufficiently performed. The present study addresses these research gaps.

3. Methodology

3.1. Research Framework

The overall framework of this study consisted of three main stages (Figure 1). In Stage 1, which involved data collection and preprocessing, the 2014–2023 construction industry occupational accident data from MOEL were collected, preprocessing was performed to exclude unclassifiable cases, 2023 approved statistics were manually classified, and training data were constructed through Easy Data Augmentation (EDA). The objective of this stage was to construct both a preprocessed dataset suitable for statistical analysis and a labeled dataset for classification model training. In Stage 2, the classification model, rule-based baseline model, and fine-tuned KLUE-BERT model were developed, and their performance was compared. A rule-based baseline was first developed to quantify the performance ceiling of keyword-based classification, thereby establishing the added value of the KLUE-BERT approach. In Stage 3, the optimal model was applied to the entire dataset to classify project ownership, category reclassification of occupation and tenure was performed, and statistical analyses were conducted on six key variables to examine the differences in accident characteristics across project ownership types. These six variables were selected as the key categorical variables in the MOEL data for which cross-tabulation with project ownership is analytically meaningful.

3.2. Data Collection and Preprocessing

The MOEL occupational accident data were collected under the Occupational Safety and Health Act, and statistics were compiled based on accidents approved under the Industrial Accident Compensation Insurance Act [4]. In this study, 276,257 occupational accident records from the construction industry spanning 2014–2023 were collected. The data comprised workplace characteristics (company name, site name, client name, construction scale, subindustry type, administrative district, etc.), worker characteristics (gender, age, occupation, tenure, worker classification, etc.), and accident occurrence characteristics (accident type, accident cause, etc.); however, project ownership information distinguishing between public and private projects was not included. The MOEL data were obtained from the publicly available occupational accident statistics published by the Korean Ministry of Employment and Labor. The data are compiled annually based on accident reports submitted under the Industrial Accident Compensation Insurance Act and are made available for research purposes upon approval.

From the 276,257 records, 30,259 cases were excluded based on the following criteria: (1) cases where the accident type was “unclassifiable,” “animal injury,” “off-site traffic accident,” “sports event accident,” or “act of violence”; (2) cases where the major category of accident cause or occupation was “unclassifiable”; and (3) cases where both the site name and client name were blank. After preprocessing, 245,998 records were retained for statistical analysis.

To construct training data labels, manual classification was performed on construction accident records from 2018 to 2023. The classification criteria were established by cross-referencing project ownership information from the MOLIT CSI system. Public and private projects were then identified by comprehensively reviewing the text fields of client name, site name, and company name. To ensure the consistency of the manual classification, a classification guideline was prepared, and the judgment rules for ambiguous cases were predefined. Classification was conducted independently by three researchers: two with 4 and 20+ years of experience at the MOEL Occupational Safety and Health Bureau, respectively (both doctoral candidates in safety engineering), and one with a doctoral degree in safety engineering. To assess inter-rater reliability, 25,279 records from the 2023 dataset were independently classified by all three researchers. Agreement was achieved on 25,247 records (99.87%; Cohen’s κ = 0.997), and the remaining 32 disagreements were resolved through consensus. Text field completeness was verified: among the training data, the non-blank rates for site name, client name, and company name were 100.0%, 86.2%, and 100.0%, respectively. Records with any incomplete text field were excluded to ensure complete input for the classification model.

To address the class imbalance problem and ensure a sufficient training scale, EDA [35] was applied. Among the EDA techniques, random swapping (RS) and random deletion (RD), which are suitable for the characteristics of the input text, were used to augment the 133,470 labeled records (after excluding records with incomplete text fields; public: 45,040, private: 88,430) to a total of 221,893. RS randomly swaps the order of tokens within the input text to enhance the robustness of the model to word-order variations, and RD randomly deletes tokens to enable the model to learn classification capabilities from partial information. Because the input text primarily consists of proper nouns (institution names, site names, company names), SR and RI, which rely on synonym dictionaries, are unsuitable. EDA was applied exclusively to the public class (45,040 records), generating 88,430 augmented public records to match the private class size and address the approximately 2:1 class imbalance. After quality filtering, the final augmented dataset comprised 221,893 records. The effectiveness of this augmentation strategy is evaluated in Section 4.1.

3.3. Classification Model Development

The input text for project ownership classification was constructed by selecting three text fields from the MOEL data: site name, client name, and company name, which reflect characteristics of the project owner. Each field contains distinct information: the site name includes the nature of the project (road, apartment, plant, etc.); the client name includes the commissioning organization or company; and the company name includes contractor information. A [SEP] token was inserted between each field to clearly delineate the boundaries between features, and the final input text was constructed in the format “Site name: {site name} [SEP] Client name: {client name} [SEP] Company name: {company name}.”

A rule-based classification model was constructed as a baseline for a comparative evaluation of the KLUE-BERT model performance. The rule-based model classifies by searching for public institution-related keywords (e.g., “city hall,” “provincial office,” “construction authority,” “corporation,” “land ministry,” and “education office”) in the client name and company name fields. The classification accuracy is limited for institution names not included in the keyword list or ambiguous names such as “OO Construction” that are used in both public and private contexts.

In this study, the KLUE-BERT (klue/bert-base) model was fine-tuned for the project ownership classification task [34]. KLUE-BERT was pretrained on a 62-GB Korean corpus (news, Wikipedia, Modu Corpus, etc.) and consisted of 12 transformer encoder layers, 768-dimensional hidden vectors, and 12 attention heads. The output vector (768 dimensions) of the [CLS] token was extracted from the final layer of the KLUE-BERT encoder, and after applying dropout (p = 0.1), the model was trained to perform binary classification of public projects (0) and private projects (1) through a linear layer (Figure 2). The 221,893 training data augmented using EDA were split into training (60%), validation (20%), and test (20%) data; the model configuration and key training hyperparameters are presented in Table 1. It should be noted that augmentation was performed before the data split, which means that augmented variants of the same original record may appear in both the training and test sets, potentially introducing data leakage. Early stopping was implemented, and the model with the highest validation F1-score was selected.

3.4. Statistical Analysis

Statistical analysis was performed on 245,998 accident records for which project ownership was assigned using the classification model.

Chi-square tests of independence were performed to examine the association between project ownership (public/private) and the six categorical accident variables. The Bonferroni correction was applied to control for Type I errors due to multiple testing (six tests, α_adj = 0.00833), and the bias-corrected Cramér’s V [36] was computed to evaluate the magnitude of association. Effect-size interpretation followed the criteria of Cohen [37] (negligible: V < 0.10, small: 0.10 ≤ V < 0.30, medium: 0.30 ≤ V < 0.50, large: V ≥ 0.50). The bias-corrected Cramér’s V was computed as

\tilde{V} = \sqrt{\max (0, \hat{\emptyset} {}^{2}- \frac{(k - 1) (r - 1)}{(n - 1)}) / m i n (\tilde{k} - 1, \tilde{r} - 1)}, w h e r e \hat{\emptyset} {}^{2}= χ^{2} / n,

\tilde{r} = r - \frac{{(r - 1)}^{2}}{(n - 1)}, a n d \tilde{k} = k - \frac{{(k - 1)}^{2}}{(n - 1)}

Among the six analytical variables, four—accident type, accident cause, construction scale, and accident severity—used the categories from the original data as is and were selected with reference to the variable classification scheme of Hwang et al. [38]. Construction scale was classified into three tiers: small (less than KRW 5 billion), medium (KRW 5–12 billion), and large (KRW 12 billion or more). Accident severity was defined as a binary variable (fatal vs. nonfatal injury), after excluding occupational diseases.

Occupation and tenure were reclassified from the detailed categories in the original data based on international standards. Occupation was classified into 9 categories from the 53 occupation (medium) categories in the MOEL data based on the sub-major group structure of ILO ISCO-08 (International Standard Classification of Occupations) [39] (Table 2). This classification is based on the work of Song et al. [40], who confirmed that the Korean Employment Classification of Occupations (KECO) is a domestic adaptation of ISCO-08, and Kang and Ryu [41], who used the same KOSHA/MOEL data structure while maintaining consistency with occupation classifications in the international construction safety literature [13,14,42,43]. Tenure was classified into five tiers (less than one month, one to six months, six months to one year, one to five years, and more than five years) from the 14 categories in the original data by synthesizing the criteria of Cheng et al. [12,13] and Dong et al. [42].

The statistical analysis was conducted in three stages. First, descriptive statistics and chi-square tests of independence for the six variables were used to test their overall association with project ownership. Second, ASR and OR analyses of accident-related variables (accident type, accident cause, construction scale, and accident severity) were performed to quantify the differences by category. Third, category-specific analyses and trend tests were conducted for worker-related variables (occupation, tenure).

The ASR measures the standardized deviation between observed and expected frequencies in each cell of a contingency table. Under the standard normal approximation, statistical significance is indicated when |ASR| > 1.96 (p < 0.05), |ASR| > 2.58 (p < 0.01), and |ASR| > 3.29 (p < 0.001) [44]. The OR quantifies the relative likelihood of occurrence of each accident category in public vs. private projects. Statistical significance was inferred when the 95% confidence interval (CI) did not include 1. As tenure is an ordinal variable, the Cochran–Armitage trend test [45] was performed to assess linear trends in the proportion of public vs. private projects across tenure categories. Additionally, the presence of monotonic trends was examined using the nonparametric Mann–Kendall test. The formulas for the key statistical measures are as follows. The adjusted standardized residual (ASR) was computed as

{A S R}_{i j} = (O_{i j} - E_{i j}) / \sqrt{E_{i j} (1 - R_{i} / n) (1 - C_{j} / n)}

, where

O_{i j}

is the observed frequency,

E_{i j}

the expected frequency,

R_{i}

the row total,

C_{j}

the column total, and

n

the grand total. The odds ratio (OR) and its 95% confidence interval were computed as

O R = (a \times d) / (b \times c)

and

95 % C I = e x p (l n (I R) \pm 1.96 \sqrt{1 / a + 1 / b + 1 / c + 1 / d})

. OR > 1 indicates that the accident category is more likely in public projects relative to private projects; OR < 1 indicates higher likelihood in private projects. An OR was considered statistically significant when its 95% CI did not include 1. The Cochran-Armitage Z statistic was computed as

Z = \sum_{i} t_{i} (r_{i} - n_{i} \overset{=}{p}) / \sqrt{\bar{p} (1 - \bar{p}) \sum_{i} n_{i} t_{i}^{2}}

, where

t_{i}

are equally spaced scores,

r_{i}

the number of events,

n_{i}

the total at level

i

, and

\bar{p} = \sum r_{i} / \sum n_{i}

.

To assess the robustness of bivariate associations against potential confounding by year and construction scale, supplementary logistic regression analyses were conducted for key outcome variables. Binary logistic regression models were fitted with each accident category as the dependent variable (1 = category present, 0 = otherwise), project ownership as the independent variable, and year (centered at 2018) and construction scale (ordinal: Small = 0, Medium = 1, Large = 2) as covariates. Adjusted odds ratios (aOR) with 95% confidence intervals were estimated to verify whether the direction and magnitude of associations observed in bivariate analyses remained consistent after controlling for these covariates.

4. Results

4.1. Classification Model Performance

Table 3 presents the performance comparison results between the rule-based and fine-tuned KLUE-BERT models. The rule-based model achieved a weighted F1-score of only 0.5511, whereas the fine-tuned KLUE-BERT model achieved a score of 0.9876, demonstrating a marked performance improvement across all evaluation metrics (Figure 3).

The confusion matrix indicated the poor performance of the rule-based model. In the test set (n = 44,379), 82.5% (14,592 of 17,685) of private projects were misclassified as public projects. This misclassification pattern reflects a structural limitation of the rule-based model’s keyword list, which contains broadly common terms (e.g., construction work, Korea) that frequently appear in both public and private project records, causing a large proportion of private records to be erroneously matched to public institution-related keywords and misclassified as public projects. In contrast, the fine-tuned KLUE-BERT model achieved balanced performance across both classes, with a recall of 0.991 for public projects (26,442 of 26,694 correctly classified) and 0.983 for private projects (17,388 of 17,685 correctly classified) (Figure 4). Notably, the macro F1-score (0.9871) was nearly identical to the weighted F1-score (0.9876), indicating balanced performance across both classes. It should be noted that the test set class distribution (public 60.2% vs. private 39.8%) reflects the augmented data distribution, not the real-world distribution (public 29.1% vs. private 70.9%). This inversion is a consequence of EDA augmentation being applied prior to the train-validation-test split, where the minority class (public, originally 33.7% of labeled data) was augmented at a higher ratio to address class imbalance, resulting in its becoming the majority class in all splits including the test set.

The epoch-wise training history (Table 4) demonstrates the convergence of the model. The training loss continuously decreased from 0.1117 in epoch 1 to 0.0344 in epoch 3, the validation loss decreased from 0.1027 in epoch 1 to 0.0560 in epoch 3, and the validation F1-score increased from 0.9766 to 0.9877. The difference between the training accuracy (0.9922) and validation accuracy (0.9877) was negligible (0.45 percentage points), indicating stable convergence without overfitting during the three-epoch training.

To verify the effectiveness of EDA, the performance of the model trained on the labeled data without augmentation (133,470 records) was compared with that of the model trained on the augmented data (221,893 records). The test F1-score of the model trained on the original data was 0.9712, which was 1.64 percentage points lower than that of the model trained on the augmented data (0.9876). We consider that data augmentation through RS and RD improved the classification performance by enhancing the robustness of the model to word-order variations and partial information. To assess whether the residual misclassification (1.24% error rate) could materially affect the subsequent statistical findings, a sensitivity analysis was conducted using the confusion matrix-based correction method for non-differential misclassification bias [46]. The test set confusion matrix yields a sensitivity of 0.991 and specificity of 0.983 for the public class. Applying these rates to correct the observed 2 × 2 contingency tables for each accident category, the corrected odds ratios were computed and compared with the uncorrected (crude) values. For all eight key categories examined—including construction machinery (crude OR = 3.20, corrected OR = 3.30), fall (0.73, 0.72), transportation (2.59, 2.65), and scaffolding (0.73, 0.72)—the direction of association (OR > 1 or OR < 1) was preserved, with a maximum deviation of less than 4%. This confirms that the 1.24% misclassification rate does not substantively alter the key findings of the subsequent statistical analysis.

4.2. Descriptive Statistics and Overall Associations

When the fine-tuned KLUE-BERT model was applied to the entire dataset of 245,998 records, 71,550 (29.1%) were classified as public projects, and 174,448 (70.9%) were classified as private projects. This distribution, in which private projects account for a higher proportion than public projects, reflects the structure of the domestic construction market.

The categorical distributions of the six variables are presented in Table 5. Among accident types, falls accounted for the highest proportion of cases (32.6%), with a higher percentage in private projects (34.6%) than in public projects (27.8%). Struck-by-object accidents accounted for 19.7%, followed by slips 13.5%.

Regarding accident causes, building-, structure-, and surface-related causes accounted for the highest proportion of cases (52.1%), with a higher percentage in private projects (53.9%) than in public projects (47.7%). In contrast, construction- and mining machinery-related causes accounted for a higher proportion of cases in public projects (14.3%) than in private projects (5.0%). Additionally, transportation-related causes were approximately 2.6 times more prevalent in public projects (3.9%) than in private projects (1.5%).

For construction scale, small-scale projects (less than KRW 5 billion) accounted for 74.2% of the total, and the medium-scale category (KRW 5–12 billion) accounted for approximately twice as high a proportion in public projects (12.2%) as in private projects (6.0%).

Regarding accident severity, nonfatal injuries accounted for 98.3% of cases. The proportion of fatal injuries was slightly higher in public projects (1.8%) than in private projects (1.6%); however, the absolute difference was negligible.

For occupation, construction laborers accounted for the highest proportion (44.6%), followed by building trades workers (32.2%). The proportion of building trades workers was higher in private projects (33.8%) than in public projects (28.3%), whereas other/non-construction occupations accounted for a substantially higher proportion in public projects (5.1%) than in private projects (2.2%).

For tenure, less than one month accounted for the highest proportion (71.2%). This proportion was higher in private projects (72.4%) than in public projects (68.2%). In all tenure categories of 1 month or more, the proportion in public projects exceeded that in private projects.

The chi-square test results indicated that all six variables were significantly associated with project ownership under the Bonferroni-corrected significance level (α_adj = 0.00833) (Table 6). According to the bias-corrected Cramér’s V, the effect sizes decreased in the following order: accident cause (V = 0.1585, small), construction scale (V = 0.1062, small), occupation (V = 0.1017, small), accident type (V = 0.0970, negligible), tenure (V = 0.0496, negligible), and accident severity (V = 0.0068, negligible) (Figure 5).

Accident cause, construction scale, and occupation exhibited effect sizes of small or greater, confirming that they had the strongest associations with project ownership. In contrast, although accident severity was statistically significant, its effect size was negligible, indicating that differences in fatal and nonfatal injury proportions between public and private projects were minimal.

In summary, the descriptive statistics and chi-square tests reveal two overarching patterns. First, all six variables showed statistically significant associations with project ownership (p < 0.001 for all, well below the Bonferroni-adjusted α = 0.00833), confirming that accident characteristics systematically differ between public and private projects. Second, the effect sizes measured by Cramér’s V indicate that Accident Causes (V = 0.1585), Construction Scale (V = 0.1062), and Occupation (V = 0.1017) exhibit small but meaningful associations, while Accident Severity (V = 0.0068) shows a negligible effect, suggesting that the public–private difference in fatality rates is statistically significant but practically minimal.

4.3. Accident Characteristics by Project Ownership

In the ASR analysis of accident types, falls exhibited the most pronounced difference, with a substantially higher proportion than expected in private projects (Table 7, Figure 6). In contrast, collision, caught-in-between, and structural collapse accidents were significantly more prevalent in public projects. In the OR analysis, drowning and oxygen deficiency exhibited high ORs in public projects, and structural collapse and collision were also significantly higher in public projects (Table 8). Falls and cut/pierced injuries were more likely to occur in private projects. The higher proportion of fall accidents in private projects is consistent with the predominance of building construction activities involving elevated work on scaffolding and temporary structures.

In the ASR analysis of accident causes, construction/mining machinery and means of land transportation exhibited large positive residuals in public projects (Table 9 and Figure 7). In contrast, stairs and ladders and scaffolding and working platforms exhibited significantly larger residuals in private projects. In the OR analysis, construction/mining machinery (OR = 3.201) and means of land transportation (OR = 2.586) exhibited the highest ORs in public projects (Table 10), whereas stairs and ladders (OR = 0.658) and scaffolding and working platforms (OR = 0.726) were more likely to occur in private projects. This elevated risk of machinery-related accidents in public projects likely reflects the higher proportion of large-scale civil engineering works (roads, bridges, dams) that require intensive use of heavy construction machinery.

In the construction scale analysis, medium-scale category (KRW 5–12 billion) exhibited the largest positive residuals in public projects, whereas the small-scale category (<KRW 5 billion) exhibited significantly larger residuals in private projects (Table 11, Figure 8). In the OR analysis, the medium-scale category (OR = 2.197) had the highest OR in public projects, whereas the small-scale (OR = 0.760) and large-scale (OR = 0.942) categories accounted for a higher proportion in private projects.

Regarding accident severity, the OR for fatal injuries was 1.126 (95% CI, [1.05–1.20]), which was statistically significant. However, given the negligible effect size (Cramér’s V = 0.0068), the substantive difference was minimal (Figure 9).

An analysis of temporal changes in the proportion of public projects from 2014 to 2023 indicated a significant increasing trend based on the Cochran–Armitage test (Z = 18.28, p < 0.001; Table 12 and Figure 10). The estimated annual increase was 0.60 percentage points per year, with the proportion rising from 28.57% in 2014 to 33.24% in 2023 (an increase of approximately 5 percentage points). The Mann–Kendall test likewise indicated a significant monotonic increasing trend (τ = 0.511, p = 0.047). It should be noted that the observed increasing trend in the proportion of public project accidents may be influenced by macroeconomic factors such as government SOC (Social Overhead Capital) budget fluctuations, private construction market cycles, and the COVID-19 pandemic (2020–2021), which temporarily increased public infrastructure investment. The non-monotonic pattern visible in Figure 10—particularly the decline in 2015–2016 and the peak in 2020–2021—suggests that these external factors may play a significant role.

4.4. Worker Characteristics by Project Ownership

In the ASR analysis of the nine occupation categories, other/non-construction exhibited the largest positive residual in public projects, followed by other skilled trades workers and electrical workers (Table 13, Figure 11). In private projects, building trades workers exhibited the largest positive residuals, reflecting the high demand for finishing work, such as masonry, plastering, waterproofing, and painting. In the OR analysis, woodworkers/installers (OR = 0.601) and metal/welding workers (OR = 0.749) were significantly more prevalent in private projects. Equipment operators exhibited no significant difference between project ownership types (OR = 1.019, 95% CI [0.973–1.066]).

In the ASR analysis of the five tenure categories, less than one month exhibited the strongest skew toward private projects, whereas the proportion of public projects was significantly higher in all categories of one month or more (Table 14). The OR exhibited a monotonic increasing trend with longer tenure, rising from 0.816 (less than one month) to 1.407 (one to five years) (Figure 12). For more than five years, the OR decreased slightly to 1.341 but remained significantly higher in public projects than in private projects. The Cochran–Armitage trend test indicated a statistically significant monotonic increase in the proportion of public projects with longer tenure (Z = 24.19, p < 0.001; Table 15). The monotonically increasing public-project proportion with longer tenure may reflect the relatively longer employment duration typical of large-scale civil engineering projects and the higher proportion of experienced equipment operators in public works.

In the occupation × project ownership analysis stratified by construction scale, the small-scale segment (

χ^{2}

= 1423.37, V = 0.0896) exhibited the largest effect size, indicating that the differences in occupation composition by project ownership were most pronounced in small-scale projects. In the fatality rate analysis by occupation, equipment operators exhibited the highest fatality rate in both public (3.891%) and private projects (2.490%), and their public/private fatality rate ratio was also the highest (1.563). The fatality rate increased with tenure, from 1.54% (less than 1 month) to 3.37% (more than 5 years).

5. Discussion

5.1. Classification Model Performance and Methodological Significance

The marked performance improvement achieved by the fine-tuned KLUE-BERT model over the rule-based model originates from the fundamental differences between the two approaches. The rule-based model relies entirely on a predefined keyword list and therefore cannot capture public institution names not included in the list or modified institution names, resulting in frequent misclassifications due to private company names containing public keywords (e.g., names in the format of “OO Construction”). In contrast, the fine-tuned KLUE-BERT model can comprehensively evaluate the contextual meanings of the three text fields based on the Korean semantic representations learned during pretraining. This is consistent with the findings of Kumi et al. [31], who reported that fine-tuning pretrained language models is effective for domain-specific texts.

The methodological significance of this study lies in extending the use of the NLP model beyond a simple classification tool to a framework that automatically generates analytical variables absent from the source data. While existing construction safety NLP research has focused on the automatic classification of variables already present in source data [16,21,25,29,31], this study inferred and generated variables not recorded in the database from unstructured text and utilized them as key independent variables in large-scale statistical analysis. This approach is not limited to construction accident data and can be applied to various industrial databases in which the variables required for analysis are absent from the source data.

5.2. Structural Differences in Accident Characteristics by Project Ownership and Novel Findings

The statistical analysis results confirmed the structural differences in accident occurrence characteristics between public and private projects across all six variables and identified several novel patterns that have not been reported in previous studies.

In the accident type and accident cause analyses, construction machinery-(OR = 3.20) and transportation-related (OR = 2.59) accidents were concentrated in public projects, whereas fall-(OR = 0.73) and scaffolding-related (OR = 0.73) accidents were concentrated in private projects. This reflected differences in work types between the two project ownership categories. Cheng et al. [12] found qualitative differences in accident patterns by project ownership through association rule mining but did not quantify the differences using ORs. The present study is the first to precisely quantify the risk ratios between project ownership types for each accident type and accident cause from a large-scale dataset of 245,998 records. In particular, the OR of 15.68 for drowning demonstrates that the risk of waterside work (river maintenance, dam construction, port construction, etc.) is significantly higher in public projects than in private projects, underscoring the need for specialized safety protocols for waterside infrastructure work.

A notable finding in the construction scale analysis was that the OR of 2.197 for the medium-scale category (KRW 5–12 billion) was substantially higher than the OR of 0.942 for the large-scale category (KRW 12 billion or more). Korea’s Occupational Safety and Health Act mandates the appointment of a safety manager for construction sites with a construction cost of KRW 5 billion or more and requires a dedicated safety manager for KRW 12 billion or more. The extreme concentration of public projects at the medium scale suggests that mandatory safety-manager appointments are implemented more strictly in public projects. Empirical evidence that the compliance level of safety management regulations varies according to project ownership has not been reported in previous studies; thus, our work provides a new perspective for evaluating regulatory effectiveness. Cheng et al. [13], who analyzed the accident characteristics of small construction enterprises, also confirmed the interaction between construction scale and project ownership; however, no study has interpreted this from the perspective of regulatory thresholds.

In the occupation analysis, despite equipment operator being the only occupation with no significant proportional difference between project ownership types (OR = 1.019, 95% CI [0.973–1.066]), it exhibited the highest fatality rate in both public projects (3.89%) and private projects (2.49%). This suggests that heavy equipment work constitutes a universal risk factor transcending regulatory differences by project ownership, consistent with the findings of Dong et al. [42] and Halabi et al. [43]. However, the coexistence of proportional homogeneity and the highest fatality rates across ownership types was identified for the first time in the present study. The skewness of building trades workers toward private projects (OR = 0.775) reflects the demand for finishing work in private building construction and suggests a structural association with fall accidents in private projects.

The tenure analysis revealed a pattern in which the proportion of public projects monotonically increased with longer tenure, with ORs of 0.816 (less than one month), 1.106 (one to six months), 1.319 (six months to one year), 1.407 (one to five years), and 1.341 (more than five years) (Cochran–Armitage Z = 24.19, p < 0.001). This monotonic trend between project ownership and tenure is a novel finding that has not been reported in previous studies and may reflect the dual labor market structure of the construction industry, characterized by the relatively longer-term employment tendency of public projects and the concentration of short-term employment in private projects. Cheng et al. [12] reported that workers with less than one month of tenure have the highest accident risk; accordingly, the concentration of short-term workers in private projects is interpreted as a structural risk factor combined with fall accidents.

The chi-square test result for accident severity (χ² = 12.38, p < 0.001) was statistically significant; however, with a bias-corrected Cramér’s V of 0.0068 (negligible), the substantive difference was minimal. This represents a case in which a discrepancy between statistical significance and substantive meaning emerged in a large-scale dataset of 245,998 records, empirically demonstrating the importance of effect-size interpretation in large-scale construction accident data analysis.

The above findings across individual variables are not independent but form coherent causal chains linking project ownership to accident patterns. Two primary pathways emerge. In public projects, the higher proportion of large-scale civil engineering works (roads, bridges, dams) leads to intensive deployment of construction machinery and transportation vehicles, resulting in elevated construction machinery-related accident causes (OR = 3.20) and transportation-related causes (OR = 2.59), which manifest as collapse/burial and caught-in-between accident types, with a correspondingly higher proportion of experienced equipment operators and longer-tenured workers. In private projects, the higher proportion of small-to-medium scale building construction leads to a predominance of finishing works by building trades workers (OR = 0.78), resulting in elevated risks of falls from scaffolding and temporary structures, which manifest as fall-type accidents (OR = 0.73) and scaffolding-related causes (OR = 0.73), with a concentration of short-term, less experienced workers. The concentration of construction machinery-related accidents in road construction projects has also been reported by Bria et al. [47]. Kazan and Usmen [48] reported that earthmoving equipment accidents were associated with elevated injury severity, which is consistent with the high fatality rates observed among equipment operators in the present study. These pathways suggest that the differences in accident characteristics between public and private projects are structurally embedded in the fundamental differences in work type composition, rather than being attributable solely to regulatory differences. This integrated perspective aligns with and extends the findings of Cheng et al. [12,13], who identified qualitative differences in accident patterns by project ownership but did not systematically trace the mechanistic pathways linking ownership type to specific accident outcomes.

To examine whether the observed bivariate associations were confounded by year and construction scale, supplementary logistic regression analyses were conducted for five key outcome variables. The adjusted odds ratios (controlling for year and construction scale) were highly consistent with the crude odds ratios in both direction and magnitude: construction machinery (crude OR = 3.20, adjusted OR = 3.16), fall (0.73, 0.74), transportation (2.59, 2.57), scaffolding (0.73, 0.73), and collapse/burial (1.10, 1.09). All adjusted ORs remained statistically significant (p < 0.001), confirming that the bivariate associations are robust to potential confounding by temporal trends and project scale. While this exploratory analysis does not constitute a comprehensive multivariate model controlling for all potential confounders (see Section 5.4), it provides evidence that the observed structural differences are not merely artifacts of year-specific or scale-related confounding.

5.3. Practical Implications

The results of this study have implications for the formulation of construction safety policies and on-site safety management.

In public projects, emphasis should be placed on the prevention of construction machinery- and transportation-related accidents. It is necessary to enhance construction machinery operator training, introduce contact prevention systems between machinery and workers, and ensure physical separation of work zones and equipment traffic zones. Additionally, considering the high OR of drowning, the provision of water safety equipment and the strengthening of waterside work safety protocols in waterside infrastructure projects are necessary. The concentration of construction machinery-related causes in public projects (OR = 3.20, 95% CI [3.04, 3.37]) provides quantitative justification for prioritizing machinery safety interventions in public project safety management plans.

In private projects, the prevention of fall accidents and strengthening of safety management systems for small-scale projects are required. Along with the installation of safety guardrails, fall prevention nets, and strengthened management of safety harness usage at building construction sites, policy measures are needed to address safety management gaps in small-scale projects below the safety-manager appointment threshold (such as itinerant safety management services and mandatory safety checklists for small-scale projects). From a regulatory perspective, the finding that Medium-scale public projects (KRW 5–12 billion) show an OR of 2.20 suggests that the current safety manager appointment threshold under the Occupational Safety and Health Act may need to be re-evaluated. Specifically, the regulatory requirement for a full-time safety manager at KRW 12 billion or more could be extended to a lower threshold for public projects to address the concentration of accidents at the medium scale.

A safety-training design that reflects the occupational characteristics of project ownership is also required. In private projects, fall prevention training for building trades workers must be strengthened, whereas in public projects, heavy equipment safety training for equipment operators must be enhanced. Furthermore, the effectiveness of on-site orientation training for newly deployed workers should be enhanced for the safety management of short-term employment workers in private projects.

The classification framework developed in this study can be utilized in MOEL’s Occupational Accident Data Management System. By automatically adding the project ownership variable to existing data, the framework can be utilized for computing accident statistics by project ownership and formulating tailored safety policies. Its practical value is high because it can expand analytical capabilities without changing the data collection format. The classification framework can be integrated into the MOEL’s annual accident statistics compilation process. By automatically generating the project ownership variable for each new accident record, it would enable real-time monitoring of accident rate trends by project ownership without requiring changes to the data collection format.

5.4. Limitations and Future Research

This study has several limitations. The labels of the training data were constructed based on the manual classification of 25,279 records from 2023 approved statistics and expanded to 221,893 records through EDA. Although cross-referencing with CSI data and classification guidelines was conducted to ensure the consistency of the manual classification, interrater reliability was assessed on 25,279 records from the 2023 dataset, yielding an agreement rate of 99.87% (Cohen’s κ = 0.997). While this high agreement provides confidence in label quality, the reliability assessment was conducted on one year of data. Future studies should extend inter-rater reliability verification to a broader temporal sample. The EDA augmentation was performed before the train-validation-test split, which introduces a potential data leakage concern: augmented variants of the same original record may appear in both training and test sets, potentially inflating the reported classification performance (F1 = 0.9876). However, the model trained without augmentation still achieved an F1-score of 0.9712, and the sensitivity analysis confirmed that the 1.24% misclassification rate does not substantively alter the key statistical findings. Future studies should adopt an augmentation-after-split strategy to eliminate this concern entirely.

Additionally, only two approaches—the rule-based model and fine-tuned KLUE-BERT—were examined in the classification model comparison, and the statistical analysis clarified the associations between project ownership and accident variables but not causal relationships. Future studies should include comparative analysis with various pretrained models, such as KoBERT and KoELECTRA, as well as the control of confounding variables through multivariate analysis (such as logistic regression analysis). Although supplementary logistic regression analysis confirmed that key associations remained robust after controlling for year and construction scale, a comprehensive multivariate model controlling for all potential confounders was beyond the exploratory scope of this study. Future studies should conduct confirmatory analyses using hierarchical logistic regression or multilevel models.

Because this study was conducted using Korean MOEL data, its direct application to construction accident data from other countries may be difficult. However, the methodological approach of automatically generating analytical variables from unstructured text and utilizing them in statistical analysis constitutes a generalizable framework that can be extended to other countries using multilingual pretrained models.

6. Conclusions

A fine-tuned KLUE-BERT framework was developed to automatically classify project ownership information absent from MOEL construction accident data, and the accident characteristics of public and private projects were compared and analyzed across six key accident variables using 245,998 classified records. The classification model achieved an F1-score of 0.9876, and all six variables exhibited statistically significant associations with project ownership. Construction machinery- and transportation-related accidents were significantly more prevalent in public projects, whereas fall- and scaffolding-related accidents were significantly more prevalent in private projects. Structural differences by project ownership were also observed in occupation and tenure. The sensitivity analysis confirmed that the residual misclassification (1.24% error rate) does not substantively alter these findings, and the supplementary logistic regression verified robustness after controlling for year and construction scale. These findings are consistent with previous studies. Cheng et al. [12,13] in Taiwan and Ling et al. [14] in Singapore reported qualitative differences in accident patterns between public and private projects; the present study extends these findings with quantitative measures based on a larger dataset. The universal high-risk nature of equipment operation, observed in both public and private projects, is consistent with Dong et al. [42], Halabi et al. [43], and Kazan and Usmen [48]. The classification performance confirms the effectiveness of fine-tuning pretrained language models for domain-specific construction texts reported by Kumi et al. [31].

The contributions of this study are twofold. First, a framework was developed to automatically generate analytical variables that were absent from the source data using an NLP model and to utilize them as key variables in large-scale statistical analysis. This approach can be extended to various industrial safety databases in which variables required for analysis are not recorded. Second, through a comprehensive statistical analysis of a large-scale dataset (n = 245,998), structural differences in accident characteristics by project ownership were systematically elucidated, providing empirical evidence for the formulation of safety management strategies tailored to public and private projects.

The limitations of this study include the potential data leakage from EDA augmentation prior to splitting, the limited scope of comparison models, and the exploratory nature of the statistical analysis. Future studies may extend the applicability of the framework by comparing additional pretrained models, controlling for confounding variables through multivariate methods, and applying the framework to multilingual construction accident datasets.

Author Contributions

Conceptualization, H.M.L., S.-H.S. and J.-H.W.; methodology, H.M.L. and S.-H.S.; software, H.M.L. and S.-H.S.; validation, H.M.L., S.-H.S. and M.G.K.; formal analysis, S.-H.S.; investigation, S.-H.S., H.M.L. and M.G.K.; resources, H.M.L. and J.-H.W.; data curation, S.-H.S. and H.M.L.; writing—original draft preparation, H.M.L. and S.-H.S.; writing—review and editing, S.-H.S., M.G.K. and J.-H.W.; visualization, S.-H.S.; supervision, S.-H.S. and J.-H.W.; project administration, S.-H.S. and J.-H.W.; funding acquisition, J.-H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Regional Innovation System & Education (RISE) Glocal University 30 program through the Chungbuk Regional Innovation System & Education Center, funded by the Ministry of Education (MOE) and the Chungcheongbuk-do, Republic of Korea (2025-RISE-11-014).

Data Availability Statement

The data are available from the corresponding authors upon reasonable request.

Acknowledgments

This study is part of the first author’s Master’s thesis at Chungbuk National University.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

ILO. Safety and Health at Work; International Labour Organization: Geneva, Switzerland, 2023. [Google Scholar]
Winge, S.; Albrechtsen, E. Accident types and barrier failures in the construction industry. Saf. Sci. 2018, 105, 158–166. [Google Scholar] [CrossRef]
Hämäläinen, P.; Takala, J.; Kiat, T.B. Global Estimates of Occupational Accidents and Work-Related Illnesses 2017; Workplace Safety and Health Institute: Singapore, 2017. [Google Scholar]
Ministry of Employment and Labor (MOEL). Occupational Accident Status; MOEL: Sejong, Republic of Korea, 2024. [Google Scholar]
Tam, C.M.; Zeng, S.X.; Deng, Z.M. Identifying elements of poor construction safety management in China. Saf. Sci. 2004, 42, 569–586. [Google Scholar] [CrossRef]
Chi, S.; Han, S. Analyses of systems theory for construction accident prevention with specific reference to OSHA accident reports. Int. J. Proj. Manag. 2013, 31, 1027–1041. [Google Scholar] [CrossRef]
Yoon, Y.G.; Ahn, C.R.; Yum, S.G.; Oh, T.K. Establishment of safety management measures for major construction workers through the association rule mining analysis of the data on construction accidents in Korea. Buildings 2024, 14, 998. [Google Scholar] [CrossRef]
Hallowell, M.R.; Gambatese, J.A. Construction safety risk mitigation. J. Constr. Eng. Manag. 2009, 135, 1316–1323. [Google Scholar] [CrossRef]
Leather, P.J. Safety and accidents in the construction industry: A work design perspective. Work Stress 1987, 1, 167–174. [Google Scholar] [CrossRef]
Lingard, H.; Rowlinson, S. Occupational Health and Safety in Construction Project Management; Routledge: London, UK, 2004. [Google Scholar] [CrossRef]
Kim, K.N.; Kim, T.H.; Lee, M.J. Analysis of building construction jobsite accident scenarios based on big data association analysis. Buildings 2023, 13, 2120. [Google Scholar] [CrossRef]
Cheng, C.W.; Lin, C.C.; Leu, S.S. Use of association rules to explore cause–effect relationships in occupational accidents in the Taiwan construction industry. Saf. Sci. 2010, 48, 436–444. [Google Scholar] [CrossRef]
Cheng, C.W.; Leu, S.S.; Lin, C.C.; Fan, C. Characteristic analysis of occupational accidents at small construction enterprises. Saf. Sci. 2010, 48, 698–707. [Google Scholar] [CrossRef]
Ling, F.Y.Y.; Liu, M.; Woo, Y.C. Construction fatalities in Singapore. Int. J. Proj. Manag. 2009, 27, 717–726. [Google Scholar] [CrossRef]
Jo, B.W.; Lee, Y.S.; Kim, J.H.; Khan, R.M.A. Trend Analysis of Construction Industrial Accidents in Korea from 2011 to 2015. Sustainability 2017, 9, 1297. [Google Scholar] [CrossRef]
Tixier, A.J.P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
Tixier, A.J.P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Application of machine learning to construction injury prediction. Autom. Constr. 2016, 69, 102–114. [Google Scholar] [CrossRef]
Zou, Y.; Kiviniemi, A.; Jones, S.W. Retrieving similar cases for construction project risk management using natural language processing techniques. Autom. Constr. 2017, 80, 66–76. [Google Scholar] [CrossRef]
Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Baker, H.; Hallowell, M.R.; Tixier, A.J.P. AI-based prediction of independent construction safety outcomes from universal attributes. Autom. Constr. 2020, 118, 103146. [Google Scholar] [CrossRef]
Zhong, B.; Pan, X.; Love, P.E.D.; Sun, J.; Tao, C. Hazard analysis: A deep learning and text mining framework for accident prevention. Adv. Eng. Inform. 2020, 46, 101152. [Google Scholar] [CrossRef]
Wang, Y.; Zou, P.X.W. Decoding construction accident causality: A decade of textual reports analyzed. Buildings 2025, 15, 3859. [Google Scholar] [CrossRef]
Cao, Y.; Qu, Z.; Wu, S.; Chen, Y.; Skitmore, M.; Ma, X.; Wang, J. Analyzing OSHA construction accident reports using BERTopic topic modeling for thematic insights. Buildings 2026, 16, 10. [Google Scholar] [CrossRef]
Lee, W.; Lee, S. Development of a knowledge base for construction risk assessments using BERT and graph models. Buildings 2024, 14, 3359. [Google Scholar] [CrossRef]
Zhou, K.; Wang, J.; Ashuri, B.; Chen, J. Discovering the research topics on construction safety and health using semi-supervised topic modeling. Buildings 2023, 13, 1169. [Google Scholar] [CrossRef]
Lee, J.; Ahn, S. PageRank algorithm-based recommendation system for construction safety guidelines. Buildings 2024, 14, 3041. [Google Scholar] [CrossRef]
Badhan, S.J.; Samsami, R. Artificial intelligence (AI) in construction safety: A systematic literature review. Buildings 2025, 15, 4084. [Google Scholar] [CrossRef]
Zhou, Z.; Wei, L.; Luan, H. Deep learning for named entity recognition in extracting critical information from struck-by accidents in construction. Autom. Constr. 2025, 173, 106106. [Google Scholar] [CrossRef]
Kim, M.J.; Ahn, S.P.; Shin, S.H.; Kang, M.G.; Won, J.H. Comparison of influencing factors on safety behavior and perception between contractor managers and subcontractor workers at Korean construction sites. Buildings 2025, 15, 963. [Google Scholar] [CrossRef]
Kumi, L.; Jeong, J.; Jeong, J. Data-driven automatic classification model for construction accident cases using natural language processing with hyperparameter tuning. Autom. Constr. 2024, 164, 105458. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; NAACL-HLT: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Park, S.; Moon, J.; Kim, S.; Cho, W.I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; et al. KLUE: Korean language understanding evaluation. In Proceedings of the NeurIPS 2021 Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
Bergsma, W. A bias-correction for Cramér’s and Tschuprow’s. J. Korean Stat. Soc. 2013, 42, 323–328. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Routledge: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Hwang, B.G.; Liao, P.C.; Leonard, M.P. Performance and practice use comparisons: Public vs. private owner projects. KSCE J. Civ. Eng. 2011, 15, 957–963. [Google Scholar] [CrossRef]
ISCO-08; International Standard Classification of Occupations 2008: Structure, Group Definitions and Correspondence Tables. International Labour Office: Geneva, Switzerland, 2012.
Song, M.; Jeong, J.; Kumi, L. Quantitative risk evaluation for construction methods using accident rate analysis based on working days by occupation. Saf. Sci. 2026, 196, 107094. [Google Scholar] [CrossRef]
Kang, K.; Ryu, H. Predicting types of occupational accidents at construction sites in Korea using random forest model. Saf. Sci. 2019, 120, 226–236. [Google Scholar] [CrossRef]
Dong, X.S.; Fujimoto, A.; Ringen, K.; Men, Y. Fatal falls among Hispanic construction workers. Accid. Anal. Prev. 2009, 41, 1047–1052. [Google Scholar] [CrossRef]
Halabi, Y.; Xu, H.; Long, D.; Chen, Y.; Yu, Z.; Alhaek, F.; Alhaddad, W. Causal factors and risk assessment of fall accidents in the U.S. construction industry: A comprehensive data analysis (2000–2020). Saf. Sci. 2022, 146, 105537. [Google Scholar] [CrossRef]
Agresti, A. An Introduction to Categorical Data Analysis; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 1955, 11, 375–386. [Google Scholar] [CrossRef]
Fox, M.P.; MacLehose, R.F.; Lash, T.L. Applying Quantitative Bias Analysis to Epidemiologic Data; Springer: New York, NY, USA, 2021. [Google Scholar]
Bria, T.A.; Chen, W.T.; Muhammad, M.; Rantelembang, M.B. Analysis of Fatal Construction Accidents in Indonesia—A Case Study. Buildings 2024, 14, 1010. [Google Scholar] [CrossRef]
Kazan, E.; Usmen, M.A. Worker Safety and Injury Severity Analysis of Earthmoving Equipment Accidents. J. Saf. Res. 2018, 65, 73–81. [Google Scholar] [CrossRef]

Figure 1. Research workflow.

Figure 2. Fine-tuned KLUE-BERT model architecture for project ownership classification.

Figure 3. Model performance comparison.

Figure 4. Confusion matrix for the fine-tuned KLUE-BERT model.

Figure 5. The result of bias-corrected Cramér’s V.

Figure 6. ASR by accident type. * p < 0.05, *** p < 0.001.

Figure 7. ASR by accident cause. *** p < 0.001.

Figure 8. Construction scale comparison by project ownership. *** p < 0.001.

Figure 9. OR forest plot: six variables combined.

Figure 10. Trend in the proportion of public project accidents over 10 years.

Figure 11. ASR by occupation. ** p < 0.01, *** p < 0.001.

Figure 12. OR by tenure.

Table 1. Parameters of the fine-tuned KLUE-BERT model.

Parameter	Value
Pretrained model	klue/bert-base
Hidden size	768
Attention heads	12
Transformer layers	12
Batch size	16
Learning rate	5 × 10⁻⁵
Number of epochs	3
Warmup steps	500
Weight decay	0.01
Dropout rate	0.1
Max sequence length	128
Optimizer	AdamW
Data split (training/validation/test)	60/20/20

Table 2. Occupation classification based on ISCO-08.

Category	ISCO-08 Code
Construction laborer	931
Building trades worker	711/712/713
Other skilled trades worker	-
Construction manager/supervisor	132
Metal/welding worker	721
Electrical worker	741
Equipment operator	834/81
Woodworker/installer	752
Other/non-construction	-

Table 3. Performance comparison between rule-based and fine-tuned KLUE-BERT models.

Metric	Rule-Based	KLUE-BERT	Improvement
Accuracy	0.6159	0.9876	+60.4%
F1-score (weighted)	0.5511	0.9876	+79.2%
F1-score (macro)	0.5030	0.9871	+96.2%
Precision (weighted)	0.5976	0.9876	+65.3%
Recall (weighted)	0.6159	0.9876	+60.4%

Table 4. Training results of the fine-tuned KLUE-BERT model.

Epoch	Train Loss	Train Acc.	Val Loss	Val Acc.	Val F1
1	0.1117	0.9670	0.1027	0.9766	0.9766
2	0.0643	0.9846	0.0683	0.9845	0.9845
3	0.0344	0.9922	0.0560	0.9877	0.9877

Table 5. Descriptive statistics of six accident variables by project ownership (2014–2023).

Variable	Category	Public (n = 71,550)	Private (n = 174,448)	Total (n = 245,998)
Accident type	Fall	19,879	60,325	80,204
	Struck by object	14,311	34,109	48,420
	Slip	10,236	22,968	33,204
	Collison	7964	14,439	22,403
	Caught in between	7540	13,471	21,011
	Cut/pierced	4638	14,077	18,715
	Other	6982	15,059	22,041
Accident cause	Buildings/structures/surfaces	34,113	94,031	128,144
	Machinery/equipment	10,243	20,563	30,806
	Materials/products	6541	15,232	21,773
	Hand/power tools	5108	13,904	19,012
	Transportation	2812	2567	5379
	Others	12,733	28,151	40,884
Construction scale	Small (<KRW 5 billion)	50,369	132,231	182,600
	Medium (KRW 5–12 billion)	8713	10,443	19,156
	Large (≥KRW 12 billion)	12,468	31,774	44,242
Accident severity	Nonfatal injury	70,264	171,658	241,922
Accident severity	Fatal	1286	2790	4076
Occupation	Construction laborer	32,307	77,300	109,607
	Building trades worker	20,273	58,904	79,177
	Other skilled trades worker	4869	8675	13,544
	Construction manager/supervisor	3462	8008	11,470
	Equipment operator	2750	6587	9337
	Other/Non-construction	3676	3772	7448
	Electrical worker	2224	4319	6543
	Metal/welding worker	1426	4610	6036
	Woodworker/installer	563	2273	2836
Tenure	Less than 1 month	48,767	126,294	175,061
	1–6 months	15,645	35,219	50,864
	6 months–1 year	3088	5767	8855
	1–5 years	3186	5592	8778
	More than 5 years	864	1576	2440

Table 6. Chi-square test results and bias-corrected Cramér’s V (Bonferroni-adjusted α = 0.00833).

Variable	N	$χ^{2}$	df	p-Value	Cramér’s V	Effect Size	Sig.
Accident cause	245,998	6229.68	50	<0.001	0.1585	Small	Yes
Construction scale	245,998	2775.77	2	<0.001	0.1062	Small	Yes
Occupation	245,998	2554.35	8	<0.001	0.1017	Small	Yes
Accident type	245,998	2329.45	16	<0.001	0.0970	Negligible	Yes
Tenure	245,998	609.42	4	<0.001	0.0496	Negligible	Yes
Accident severity	245,998	12.38	1	<0.001	0.0068	Negligible	Yes

Table 7. Key ASR result by accident type.

Accident Type	ASR (Public)	Interpretation
Fall	−32.54 ***	Public < Private
Collision	+22.56 ***	Public > Private
Caught in between	+22.16 ***	Public > Private
Structural collapse	+17.43 ***	Public > Private
Cut/pierced	−12.61 ***	Public < Private
Drowning	+9.12 ***	Public > Private

Note: *** p < 0.001.

Table 8. Key OR results by accident type.

Accident Type	OR	95% CI	Direction
Oxygen deficiency	15.85	[5.53–45.43]	Public >> Private
Drowning	15.68	[7.07–34.78]	Public >> Private
Structural collapse	1.867	[1.74–2.01]	Public > Private
Collision	1.393	[1.35–1.43]	Public > Private
Fall	0.729	[0.71–0.74]	Public < Private
Cut/pierced	0.831	[0.81–0.86]	Public < Private

Table 9. Key ASR result by accident cause.

Accident Causes	ASR (Public)	Interpretation
Construction/mining machinery	+45.87 ***	Public > Private
Means of land transportation	+35.46 ***	Public > Private
Floor and ground surfaces	+23.09 ***	Public > Private
Stairs and ladders	−27.98 ***	Public < Private
Scaffolding and working platforms	−22.22 ***	Public < Private

Note: *** p < 0.001.

Table 10. Key OR results by accident cause.

Accident Causes	OR	95% CI	Direction
Construction/mining machinery	3.201	[3.04–3.37]	Public >> Private
Means of land transportation	2.586	[2.45–2.73]	Public >> Private
Floor and ground surfaces	1.370	[1.33–1.41]	Public > Private
Stairs and ladders	0.726	[0.71–0.75]	Public < Private
Scaffolding and working platforms	0.658	[0.64–0.68]	Public < Private

Table 11. ASR and OR results by construction scale.

Construction Scale	ASR (Public)	OR	95% CI	Direction
Medium (KRW 5–12 billion)	+52.65 ***	2.197	[2.13–2.26]	Public >> Private
Small (<KRW 5 billion)	−27.72 ***	0.760	[0.75–0.77]	Public < Private
Large (≥KRW 12 billion)	−5.15 ***	0.942	[0.92–0.96]	Public < Private

Note: *** p < 0.001.

Table 12. Cochran–Armitage trend test for public project proportion over time.

Test	Statistic	p-Value
Cochran–Armitage Z	18.28	<0.001
Annual slope	+0.60% p/year	-
R²	0.4403	-
Mann–Kendall τ	0.511	0.047

Table 13. ASR and OR results by occupation.

Occupation	ASR (Public)	OR	95% CI	Public (%)	Private (%)
Other/Non-construction	+39.11 ***	2.451	[2.340–2.567]	5.1	2.2
Building trades worker	−26.19 ***	0.775	[0.761–0.790]	28.3	33.8
Other skilled trade worker	+18.09 ***	1.395	[1.346–1.447]	6.8	5.0
Woodworker/installer	−10.89 ***	0.601	[0.548–0.659]	0.8	1.3
Metal/welding worker	−9.46 ***	0.749	[0.706–0.795]	2.0	2.6
Electrical worker	+8.85 ***	1.264	[1.200–1.331]	3.1	2.5
Construction laborer	+3.82 ***	1.035	[1.017–1.053]	45.2	44.3
Construction manager/supervisor	+2.65 **	1.057	[1.014–1.101]	4.8	4.6
Equipment operator	+0.80	1.019	[0.973–1.066]	3.8	3.8

Note: *** p < 0.001, ** p < 0.01.

Table 14. ASR and OR results by tenure.

Tenure	ASR (Public)	OR	95% CI	Public (%)	Private (%)
Less than 1 month	−21.08 ***	0.816	[0.801–0.832]	68.2	72.4
1–6 months	+9.33 ***	1.106	[1.083–1.130]	21.9	20.2
6 months–1 year	+12.21 ***	1.319	[1.262–1.379]	4.3	3.3
1–5 years	+15.15 ***	1.407	[1.346–1.471]	4.5	3.2
More than 5 years	+6.91 ***	1.341	[1.233–1.457]	1.2	0.9

Note: *** p < 0.001.

Table 15. Public project proportion by tenure and Cochran–Armitage trend test.

Tenure	Public	Total	Public (%)
Less than 1 month	48,767	175,061	27.86
1–6 months	15,645	50,864	30.76
6 months–1 year	3088	8855	34.87
1–5 years	3186	8778	36.30
More than 5 years	864	2440	35.41

Note: Cochran–Armitage trend test: Z = 24.19, p < 0.001.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, H.M.; Shin, S.-H.; Won, J.-H.; Kim, M.G. KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects. Buildings 2026, 16, 1393. https://doi.org/10.3390/buildings16071393

AMA Style

Lee HM, Shin S-H, Won J-H, Kim MG. KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects. Buildings. 2026; 16(7):1393. https://doi.org/10.3390/buildings16071393

Chicago/Turabian Style

Lee, Hye Min, Seung-Hyeon Shin, Jeong-Hun Won, and Moon Gyu Kim. 2026. "KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects" Buildings 16, no. 7: 1393. https://doi.org/10.3390/buildings16071393

APA Style

Lee, H. M., Shin, S.-H., Won, J.-H., & Kim, M. G. (2026). KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects. Buildings, 16(7), 1393. https://doi.org/10.3390/buildings16071393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KLUE-BERT-Based Classification of Project Ownership in Korean Construction Accident Records for Comparative Safety Analysis of Public and Private Projects

Abstract

1. Introduction

2. Literature Review

2.1. Construction Accident Analysis by Project Ownership

2.2. NLP-Based Text Classification in Construction Safety Domain

2.3. Korean Pretrained Language Models and KLUE-BERT

2.4. Research Gaps

3. Methodology

3.1. Research Framework

3.2. Data Collection and Preprocessing

3.3. Classification Model Development

3.4. Statistical Analysis

4. Results

4.1. Classification Model Performance

4.2. Descriptive Statistics and Overall Associations

4.3. Accident Characteristics by Project Ownership

4.4. Worker Characteristics by Project Ownership

5. Discussion

5.1. Classification Model Performance and Methodological Significance

5.2. Structural Differences in Accident Characteristics by Project Ownership and Novel Findings

5.3. Practical Implications

5.4. Limitations and Future Research

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI