Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model

Song, Ju-Han; Shin, Seung-Hyeon; Kang, Sung-Yong; Won, Jeong-Hun; Yoo, Kwan-Hee

doi:10.3390/app14209450

Open AccessArticle

Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model

by

Ju-Han Song

¹,

Seung-Hyeon Shin

²

,

Sung-Yong Kang

^3,*

,

Jeong-Hun Won

³

and

Kwan-Hee Yoo

^1,*

¹

Department of Computer Science, Chungbuk National University, Chungdae-ro 1, Seowon-gu, Cheongju 28644, Republic of Korea

²

Department of Big Data, Chungbuk National University, Chungdae-ro 1, Seowon-gu, Cheongju 28644, Republic of Korea

³

Department of Safety Engineering, Chungbuk National University, Chungdae-ro 1, Seowon-gu, Cheongju 28644, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9450; https://doi.org/10.3390/app14209450

Submission received: 15 August 2024 / Revised: 8 October 2024 / Accepted: 13 October 2024 / Published: 16 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

With increasing industrial sophistication and complexity, workplaces are increasingly prone to occupational accidents, causing negative impacts on workers and employers, including economic losses and decreased productivity. South Korea occupational safety and health has implemented new policies addressing potential risks to overcome stagnation in industrial accident reduction and predict site accidents from past cases. Cases are human-classified according to rules, including occurrence type or original causal materials. However, human errors, subjective judgments, synonyms, and terms incorrectly used by classifiers reduce original data quality and impede developments or applications of policies, technologies, and methods preventing accidents based on past accidents. This study proposes three artificial intelligence models to objectively classify the occurrence type of accident cases. Models are developed based on a natural language processing model (KoBERT), which considers Korean language characteristics. Each model is tested by sequentially performing sentence preprocessing, keyword replacement, and morphological analysis. The proposed Model 3 exhibits 93.1% accuracy, which was the highest among tested models. Up to three classification categories for occurrence type are allowed to assist objective classification. The accident case-based occurrence type classification model is effective for industrial accident prevention, aiding in strategy development and reducing social costs.

Keywords:

natural language processing; occurrence type; occupational accidents; KoBERT; original cause materials

1. Introduction

The growing diversity of the global population, coupled with economic growth, has precipitated a period of accelerated social transformation, necessitating a swift technological response to meet the evolving societal demands. The advancement of new technologies and the advent of machinery have become imperative worldwide, and sustained industrial evolution has enhanced the capacity to respond effectively to societal changes. Conversely, with industrial sophistication and complexity increasing to meet changing societal needs, the number of workers employed in the field has increased, which correspondingly increases the number of occupational accidents, attributable to the hazards to which workers are exposed in their workplaces [1]. Occupational accidents impose several expenses on workers, including medical treatment and earnings losses. Furthermore, such incidents have an adverse impact on companies or employers, resulting in a loss of labor, reduced productivity, and weakened relationships [2]. Furthermore, the financial losses incurred owing to occupational accidents present a significant economic burden on the country and are widely acknowledged as a major social issue [3].

In the United States, the National Institute for Occupational Safety and Health (NIOSH) has developed a comprehensive strategic plan to address a wide range of occupational health and safety risks by identifying high-risk workplaces and at-risk workers [4]. In the United Kingdom, the changing landscape of workplaces and the expanded role of the Health and Safety Executive (HSE) are being planned to include factors not previously considered, such as occupational accidents related to stress and mental health [5]. In Germany, the Joint Strategy for Occupational Safety and Health provides for the modernization of their occupational safety and health system and strengthens the role of companies concerning safety and health [6]. In Japan, the Ministry of Health, Labor, and Welfare has set various goals for preventing occupational accidents and proposed eight priorities and specific response strategies for achieving them [7]. To effectively reduce occupational accidents and create workplaces with guaranteed life and safety of workers, South Korea has introduced and systematized a system of on-site supervision of high-risk areas with clear responsibilities and roles. The importance of occupational safety is increasingly being recognized globally, and policy support is being sought through various approaches.

Notably, South Korea has recently witnessed stagnation in the trend of reducing occupational accidents, despite several policies supporting the reduction. The situation is highly threatening in countries with a high proportion of manufacturing industries and dense populations. In response, Korea’s Ministry of Employment and Labor has emphasized a “self-discipline prevention system” based on risk assessment (listing and checking accidental risks during work or workplace processes by predicting them from previous cases) to detect and prevent risks. Additionally, it aims to accurately identify and predict workplace accidents based on previous accident cases and implement effective preventive measures. New policy directions are oriented based on accidents at industrial sites and contribute to the resolution of occupational accidents, which are recognized as major social problems. Therefore, accurate records and detailed analyses are required to utilize past cases of occupational accidents. The sources of accident cases have been collected and recorded from the Industrial Accident Compensation Approval Data (for accidents requiring medical treatment for more than four days, the Ministry of Labor and Welfare receives an application from the employer, and the details of the accident are classified and recorded by the Occupational Safety and Health Institute) and the Occupational Accident Investigation Table (for accidents requiring leave of absence for more than three days, the local office of the Ministry of Employment and Labor classifies and records the relevant accidents). The recorded information is converted into statistical data by classification workers (from the Ministry of Employment, Labor and Welfare or the Occupational Safety and Health Agency) based on an accident summary to identify the circumstances surrounding the accident.

During the process of transforming and categorizing data, those engaged in data classification frequently encounter difficulties with certain datasets. The problem primarily results in a declining quality of the original data, which is largely attributable to the subjective nature of input by workers, human error, the presence of homophones, misuse of terminology, and other factors. Conversely, the classification of data on the occurrence type or original cause materials among accident-related information can pose challenges or ambiguity, thereby rendering objective classification unfeasible. Based on existing case studies, the aforementioned issues have the potential to significantly affect the predictability and prevention of future accidents.

This study first addresses the issues related to quality by preprocessing an overview of 23 years of accident case data (approximately 2,100,000 cases) recorded from Korean industrial sites. The preprocessed data are used to develop a model that can classify the occurrence types among the criteria using artificial intelligence (AI) algorithms. Consequently, the developed model, with its objectivity and accuracy, can be used for research and policy development to suppress possible accidents through accident cases. Additionally, the AI model learns words from the original data that affect the accident overview quality, thereby increasing its versatility. Additionally, we expect that analyzing original cause materials closely related to accidents, along with the working environment and other contributing elements, will help reduce accidents. Moreover, we plan to strengthen the role of safety culture and managers by applying systems thinking to safety engineering and emphasizing organizational factors. Through this, we aim to foster a proactive safety culture within organizations and encourage active managerial participation in accident prevention.

2. Literature Review of Artificial Intelligence Model Utilization for Industrial Accident Prevention

The use of AI models for occupational safety differs from traditional statistical research applications. The differences are due to the dependence of data analysis processes on the opinions of experts or practitioners in the field, or the lack of numerical data that leads to biased results. Chi et al. [8] used a systematic analysis of accident reports and statistical methods to analyze the causes and patterns of work-related fatalities. Jacinto et al. [9] focused on an in-depth understanding of the causes of work accidents and developed strategic preventive measures. Although such studies have made significant advancements, they face limitations. Most of the methodologies focus on the quantitative aspects of incident data, which makes fully understanding the root causes of incidents or complex interaction patterns difficult [10]. AI models are being utilized to improve the results of traditional research. Machine learning (ML) models typically use a “two-step approach”, where the first and second steps involve extracting features from data and performing classification or predictions, respectively. G Ahn et al. [11] analyzed data from the 1665 accident summaries registered with the Korea Occupational Safety and Health Administration from 2007 to 2017 and extracted information. The extracted information was categorized into work type, original cause materials, occurrence type, and number of fatalities. Subsequently, they developed an ontology using a support vector machine (SVM) classification model [12]. Zhang et al. [13] used ML models including SVM, linear regression, k-nearest neighbors, decision tree, naive Bayes, and optimized ensemble models to classify construction accident causes. Such approaches require several interventions for feature extraction, have a strong dependency on domain knowledge, and are limited in their ability to handle large amounts of training data [14].

Recently, deep-learning (DL)-based models have gained traction owing to their generalizability compared to models in previous research. DL-based models are effective in handling large amounts of data and can solve complex non-linear problems with multiple layers. Unlike ML models that are optimized for numerical data, DL-based models can handle various types of data, including images, text, and audio, making them effective at introducing new approaches to the field of occupational safety and producing objective results. Zhang et al. [15] utilized a recurrent-neural-network-based [16] long short-term memory (LSTM) [17] technique to analyze NTSB aviation accident reports. To accurately and objectively identify the causes of aviation accidents, three binary classification models were developed: accident vs. incident, damaged vs. non-damaged, and fatal vs. nonfatal. Nemani et al. [18] utilized an ensemble of LSTM models to predict the remaining useful life of bearings (for the detection and early prevention of machine failures), prevent unearned machine downtime, increase costs, and prevent potential occupational accidents.

Although LSTM techniques have been utilized in various ways, they pose limitations. Particularly, as LSTM sequentially processes information within a sequence, it can handle text information relatively well before a certain point in time; however, considering information after a certain point in time is difficult [19]. The problem can be solved using the bidirectional encoder representations from the transformers (BERT) model recently developed by Google [20]. The BERT model uses a bidirectional transformer encoder to fully understand textual contexts and sentence organization. Luo et al. [21] utilized the BERT model for text analysis of chemical accident analysis reports. Soft Lexicon and BERT–transformer–conditional random fields were used together to automatically extract valid information from chemical accident reports and suggest ways to identify accident causes and preventive measures. Luo et al. [22] proposed a method for applying checklists and safety training to the BERT model using health, labor, and accident case data for identifying and preventing accident causes at construction sites.

The use of AI models has gradually become a major research trend in occupational safety due to the improvements in algorithmic limitations of initially developed AI models and the frequent applications of new models. Despite the use of constantly updated AI models, the fundamental problem of natural language processing remains a challenge. In particular, Korean, which is the language considered in this study, presents challenges for AI learning due to its complex sentence structures, the presence of multiple expressions of sentences, and the intricacies of its parts of speech.

Therefore, this study aims to improve the value of basic data for accident prevention by actively utilizing data on occupational accidents at Korean industrial sites. As a huge amount of data is considered, introducing an objective classification method through AI learning to fully reflect the characteristics of the Korean language is necessary. Consequently, the KoBERT [23] model is adopted as a classification tool for accident cases, and accident outlines recorded as actual accident cases are used as training data. To verify the classification method of the developed model, confusion matrices, precision, recall, and accuracy are utilized as the criteria of occurrence type, which distinguishes the main characteristics of the accidents.

3. Material and Methods

3.1. Overview

Between 2001 and 2023, approximately 2,100,000 occupational accidents occurred in South Korea, including deaths, injuries, and illnesses. Data are provided by the Ministry of Employment and Labor and include “occurrence type” and “accident summary”. “Accident summary” contains information that allows intuitive judgment of occupational accidents that occurred and is organized in the form of a sentence. “Occurrence type” is important information that classifies the final type of industrial accident and is categorized by a code, as listed in Table 1.

The classification performed based on “accident summary” requires correction for several errors. Human errors (subjectivity, lack of consistency, typology, and misclassification) are mainly caused by the human categorization process and are related to data quality. However, an imbalance exists in the amount of data accumulated annually. The errors are essential for the development of an effective AI classification model and have a significant impact on the improvement of accuracy and reliability. Therefore, we improved the qualitative and quantitative quality of training data in this study by refining sentence components during the preprocessing stage, morphological analysis, keyword replacement by codes, and data augmentation to make the training data robust.

3.2. Improving the Quality of Occupational Accidents

Accident classification can help prevent recurring accidents, specialized accidents with anomalous findings, and accidents of high magnitude or frequency. Therefore, to improve the effectiveness of classification, improving the quality of the “accident summary” is crucial. Therefore, a three-step preprocessing stage was applied in this study. In the first step, unnecessary hieroglyphs in the sentences were removed such that only essential sentence elements could be utilized. Second, the same sentences or words were repeated or made redundant to avoid affecting learning soundness. Finally, unnecessary words irrelevant to the content were excluded. The proposed method is illustrated in Figure 1.

3.3. Replacement with Similar Keywords

The aforementioned problem remained even after data preprocessing. The possibility of typos, other words, or expressions with the same meaning reduced the efficiency of model training. This issue was resolved by introducing similar keywords. By replacing various dead ends or typos referring to the “occurrence type” with similar keywords, it became possible to achieve model training with improved accuracy and efficiency. Keywords were extracted from the “accident summaries” of the last five years of accident cases. As the data were based on actual incident data, they contributed significantly to improving the prediction performance of the model. Similar keywords were categorized by “occurrence type” and expressed as frequencies by codes (Table 2).

Similar keywords help models effectively recognize different expressions or words with similar meanings. However, they can contribute to the generation or replacement of sufficient training data, thereby playing an important role in creating robust models that reflect the diversity of natural languages. Additionally, keywords can be used to minimize the quantitative imbalance in training data, which can improve the predictive performance of models. The process of similar keyword replacement is illustrated in Figure 2.

3.4. Morphological Analysis

We performed a morphological analysis to better understand the context of replaced keywords or sentences. Morphemes are the smallest units of sentence organization and comprise nouns, verbs, connecting words, and auxiliary words. This analysis assisted in understanding the role of each morpheme in a sentence and how it interacted with other morphemes. By transforming textual data into a more structured form, KoBERT models effectively learn from data, improve model performance, and enable more accurate predictions. Figure 3 illustrates the results of text extraction through morphological analysis, which contributed to the effective search and categorization of data.

3.5. Application of the KoBERT Model

The KoBERT model was characterized by pretraining Korean Wikipedia and news articles to improve Korean processing performance. The model was fine-tuned specifically for Koreans, as illustrated in Figure 4, similar to the BERT model that uses the self-attention mechanism of the transformer to understand the meaning and relationships of sentences.

The final output involved classifying a sentence into 1 of 27 occurrence types using a vector corresponding to the position of the [CLS] token ([CLS] summarized the overall context of the sentence, which was used to classify the sentence), as illustrated in Figure 4. The [CLS] token summarized the context of the entire sentence, allowing the determination of the class to which the sentence belonged. The KoBERT model used GluonNLP’s pretrained SentencePieceTokenizer as the tokenizer.

3.6. System Configuration and Settings for Deep Learning

The system used for training was configured with Ubuntu Linux, AMD 7742, NVIDIA A100ⅹ4, 1TB of memory, Python 3.7.16, and PyTorch 1.12, as listed in Table 3.

The training settings were configured as listed in Table 4, and the epoch was run 20 times for each model to ensure sufficient training. For stable and generalized learning, the optimizer used adaptive moment estimation (Adam), and the warmup ratio was set to 0.1 to prevent weight updates from being extremely large in the early stages of learning and causing instability. The batch size was common to all three models and was set to 240, which was the largest size available for our computer environment.

3.7. Constructing and Describing the Training Dataset

The augmented counts using the original data, similar keywords, and the morphological analysis method proposed in this study are listed in Table 5.

Each occurrence type in the source data represented a different degree of accident occurrence. Particularly, slips and trips, caught in/between objects, falls, and hits were items that occurred frequently owing to the work activities and movements of workers without being limited by the location of the industrial site. However, maritime and aviation accidents, asphyxiation, and drowning demonstrated a relatively small number of original data because they occurred when working in a specific place, situation, or time. To resolve the data imbalance, we augmented the data using appropriate keywords for each occurrence type; however, for occurrence types that lacked similar keywords, the problem of data imbalance was not completely resolved because of insufficient data augmentation. Therefore, we further increased the number of data points for each occurrence type using morphological analysis. The resulting dataset was applied to Models 1, 2, and 3. The dataset for Model 1 was trained using the original data presented in Table 5 without any additional tuning, and the results are shown in Table 6 and Figure 5.

In Model 2, we used the keyword replacement augmented data provided in Table 5, which utilized approximately 20,000 data points per code. The exception involved the unclassifiable category, which had fewer keyword replacements and fewer cases collected; therefore, we limited the number of augmented data to 10,000 or less. The results are presented in Table 7 and Figure 6.

For Model 3, we used the keyword replacement and morpheme-augmented data presented in Table 5 and utilized approximately 40,000 data points. As in Model 2, we limited the amount of data after augmentation to 20,000 or less for unclassifiable items. The results are presented in Table 8 and Figure 7.

The dataset for each model was divided into training, test, and validation datasets at a ratio of 8:1:1. The training dataset was trained according to the variable set for DL and implemented in each classification model. The test set was used to validate model performance during training, and the final classification model was evaluated using an independent validation set (original data not used for training).

4. Results and Experimental

4.1. Validation of Each Model According to Various Parameters

The three models were analyzed in terms of accuracy and loss per epoch using the training and validation sets to indicate how well the model predicted the training or validation data. Figure 8 illustrates the accuracy and loss results of each model per epoch.

Accuracy was interpreted as an improvement in prediction performance as the difference between the training and validation data decreased. Conversely, as the training progressed, the curve became a better fit for the training data, and the training accuracy converged to nearly 100%. However, when the loss became divergent, the curve suffered from overfitting. Therefore, the curve should converge after a certain point to avoid problems.

In Model 1, when the training accuracy converged to 100% as the epochs progressed, a difference in validation accuracy occurred. When the validation accuracy converged to 87% (more than a 5% difference in accuracy), the training accuracy reached 92%; however, there was no positive effect as the epochs progressed. As the training loss of Model 1 converged to zero, the validation loss gradually increased, which could be interpreted as a sign of overfitting. Further details are presented in Table 9.

In Model 2, the training accuracy reached 99% when the validation accuracy converged to 93% (within a 6–7% accuracy difference), indicating a tendency for accuracy to improve with epochs. The training loss in Model 2 converged to zero, while the validation loss gradually increased, as in Model 1, and then converged at epochs above 18; however, it exhibited a tendency to overfit. Further details are presented in Table 10.

In Model 3, the training accuracy was 99% when the validation accuracy converged to 95% (within a 4–5% accuracy difference), and the accuracy reached 96% when the epoch was 20. The training loss in Model 3 also converged to zero, similar to the previous models, and the validation loss converged when the number of epochs was 16 or more. Further details are presented in Table 11.

With each epoch, the model was increasingly fit to the training data, which was manifested by the training accuracy reaching 100%, or a loss converging to zero. Thus, the model learned the training data perfectly, but it also introduced the possibility of overfitting. Overfitting occurred when the model was well fitted to the training data and became less predictive of new data. However, the validation accuracy and loss converged to a certain value. Thus, the ability of the model to generalize new data was established, as illustrated in Figure 9.

The performance of the classification models and their tendency to overfit were identified based on the difference between accuracy and loss. The performance of the classification model showed that the accuracy improved with the performance of the classification model, from Models 1, 2, and 3. Similarly, the difference between the training and validation losses eliminated the overfitting problem in Model 3. Therefore, Model 3, proposed in this study, exhibited the best classification performance for both the new and training data.

4.2. Comparison of Experiment Results

Based on 2.1 million “accident summary” data recorded as actual industrial accident cases in Korea, we proposed three “occurrence type” classification models to distinguish the main characteristics of the accidents using the KoBERT model. Each model was tested by sequentially adding data preprocessing, keyword replacement, and morphological analysis to the original data. The results are illustrated in Figure 10, Figure 11 and Figure 12.

A confusion matrix, validated by class and expressed in terms of precision, recall, loss, and F1 score, was established to compare the results of the overall model. The algorithm and experimental results for each classification model are further discussed.

The results of the Model 1 algorithm with only the data preprocessing method are listed in Table 12, and the precision, recall, F1 score, and loss were 87.1%, 0.878, 0.878, and 0.38, respectively. Comparing the results in Table 13 based on the data of each trained class, drowning (67%), asphyxiation (0%), and unspecified (0%), which have relatively few data (64–100), exhibited a low classification accuracy or no classification. Conversely, slips and trips, falls, being caught in/between objects, work-related diseases, cuts–lacerations–punctures, and avulsions, which comprised more than 50,000 data points, exhibited a classification accuracy of a minimum of 82% (up to 92%).

Thus, Model 1 exhibited a large variation in classification accuracy, depending on the amount of trained data.

The results of the Model 2 algorithm, which involved data preprocessing and data augmentation to improve the accuracy deviation of Model 1, are listed in Table 12, with precision, recall, F1 score, and loss of 91.3%, 0.912, 0.913, and 0.31, respectively.

As listed in Table 13, the classification results improved from 67% to 99% for the drowning class, from 0% to 90% for the unspecified class, and from 0% to 99% for asphyxiation, which had low accuracy in the Model 1 algorithm.

However, some classification accuracy decreased for other classes that were equalized with data augmentation for classes with insufficient training data. Conversely, increasing the training data to 20,000 for classes with very few data (between 60 and 100) increased the classification accuracy to over 90%, whereas reducing the training data from 50,000+ to 20,000 resulted in an accuracy decrease to approximately 9% or less for classes that were trained with less data. Although an overall improvement in classification accuracy compared to Model 1 was observed, the accuracy decreased in some classes due to the equalization of training data; therefore, Model 2 required further improvement.

Model 3 was constructed by adding morphological analysis to the algorithm to address the issues in Model 2. With the improvement, the results of Model 3 were 93.1%, 0.931, 0.031, and 0.24 for precision, recall, F1 score, and loss, respectively, as shown in Table 12. As listed in Table 13, the accuracy of each class, which decreased in Model 2, was improved by up to 6%, and the classification accuracy was over 90%, excluding 6 out of 27 classes (struck by object, fall, crushing or overturning, slips and trips, work-related diseases, and hit).

Based on the experimental and analytical results of each variable model, we propose Model 3 as a valid algorithm for the objective classification of accident cases. Furthermore, the original data (untrained) of accident cases that occurred in 2023 were classified and verified using Model 3.

The results are presented in Table 12; the precision, recall, F1 score, and loss were 91.1%, 0.908, 0.909, and 0.32, respectively, confirming the possibility of using the model objectively.

The confusion matrix for each class result is illustrated in Figure 13. The model was well trained without overfitting. Conversely, the model was expected to classify new data appropriately even after the year corresponding to the validation data and evaluated as an objective accident case classification model.

5. Conclusions

Based on 2.1 million accident summaries recorded as actual cases of occupational accidents in South Korea, this study proposes three occurrence-type classification models that use the KoBERT model to distinguish the main characteristics of accidents. Each model was constructed by sequentially performing data preprocessing, keyword replacement, and morphological analysis of original data. Compared with the other models, Model 3 demonstrated the best results.

The results of this study are anticipated to play an important role in preventing occupational accidents based on accurately classifying accident cases by occurrence type.

In Model 1, only data preprocessing was applied, and the classification performance was 87.1%, attributable to an imbalance in the initial input of the original data. Particularly, for classes with fewer data, the classification accuracy was extremely low (drowning, precision 67%) or not classified at all (asphyxiation, unspecified). To address the imbalance in the amount of data, improving the model by augmenting the training data was necessary.
Model 2 introduced data preprocessing and data augmentation with keyword replacement to solve the imbalance problem in Model 1. The improvements resulted in an improved classification accuracy for most classes (with an average classification accuracy of over 90%) and were sufficient to improve Model 1. Thus, the augmentation method of finding and replacing keywords for a specific accident summary was effective in reducing the gap between the training and real data. Nevertheless, in Model 2, the equalized training data caused a decrease in accuracy in classes with sufficiently large amounts of original data. The drawback presented a tradeoff problem caused by uniform data equalization, requiring improvements to Model 2. Therefore, we introduced a new preprocessing method for the training data.
In Model 3, morphological analysis was performed to solve the problems encountered in Model 2. Model 3 applied three techniques, namely preprocessing of the original data, data augmentation through keyword replacement, and morphological analysis, and overcame the limitations of the previous models. When Model 3 was validated with the most recent accident case data, its precision was recorded as 91%, confirming its objective utility.
Thus, Model 3 was determined to be the most appropriate for the accident summary classification model in this study, and it is expected to alleviate problems such as human error in the occurrence-type classification process for accident summaries and poor objectivity in classification.

However, the AI model developed in this study for classifying occurrence types from accident summaries has certain limitations. Although the model has a high potential for objective utilization, as it is trained based on accident cases accumulated from the past, it can be trained with cases that are incorrectly classified by humans or cases with unclear classification boundaries, which can cause reliability problems. Therefore, we verified the classification of the occurrence type to the lowest rank to verify objectivity and reliability issues.

Figure 14 illustrates the difference between the results of the existing recorded occurrence type model and our classification model, valued at 99% when the third-level information was included. As listed in Table 14, no significant improvement from the third rank and above was observed.

With the provision of up to three occurrence types, objective classification data were provided for the vast amount of existing accident event data, as shown in Table 15, Table 16 and Table 17. The results can be used to double-check the classification or recording results, maintain the objectivity of the classifier or recorder, or determine the occurrence type classification.

The classification model developed in this study is expected to provide new directions for accident prevention measures and policy system improvements. Particularly, when the model is based on accurately categorizing cases by occurrence type, the model is expected to assist in establishing more effective accident prevention strategies. Thus, the proposed model contributes to improving the societal understanding of occupational safety and developing effective measures to prevent accidents. Continued research is expected to help prevent occupational accidents and reduce societal costs.

6. Discussion and Future Research

This study primarily aims to employ artificial intelligence technology to objectively and accurately classify occupational accident cases, rather than simply identifying or precisely understanding the causes of accidents. The focus lies on leveraging the collected examples of occupational accidents for more systematic classification.

Traditional accident analysis methods have relied on case-by-case investigations of the occurrence process to determine specific causes. However, such methods face limitations when dealing with large-scale, accumulated cases, as they may narrow the scope of analysis or focus solely on particular causes or outcomes. On the other hand, artificial intelligence technology is well suited for handling large-scale recorded accident data and is highly effective for variable studies. Consequently, this study developed a model to classify accident types based on accumulated accident case data. The classification method achieved over 93% accuracy in Rank 1 results and secured over 97% accuracy in subsequent Rank 2 results. Enhancing the accuracy of AI models can be accomplished through improvements in model architecture, augmentation, or efficient preprocessing procedures.

Nevertheless, these enhancements may not directly contribute to reducing occupational accidents or accurately identifying their root causes. Accidents in industrial sites may be caused by complex mechanisms or specific factors, which is a crucial consideration for determining the potential application of this study’s findings. Therefore, to successfully implement artificial intelligence technology in the field of occupational safety and health, it is necessary to consider not only the technical completeness—such as model improvement and accuracy enhancement—but also how these technologies can contribute to the prevention of occupational accidents.

In future research, to overcome these limitations and contribute more effectively to industrial accident prevention, the research will be expanded in the following directions:

First, we will try to accurately classify the causal factors that directly contribute to the accident and identify the cause. In most accident cases, the original cause material is closely related to the occurrence of the accident. This is a factor that is commonly used or found in industrial sites such as machinery, equipment, parts, products, etc., and has the characteristic of coming into contact with or being close to workers when an accident occurs. Since this directly contributes to the cause of the accident, the exact cause and result of the accident can be predicted from the accident case. Therefore, referring to the method and results of this study, a model that can classify the original cause material from the accident overview can be added. This requires more information input than the results of the existing occurrence type, and the artificial intelligence model will also need to be improved.

Second, a comprehensive accident analysis will be conducted, considering the complexity of the working environment. Modern industrial sites involve various risk factors that interact with each other due to technological advancements and system complexity. The complexity of the working environment encompasses not only direct causes such as causal factors but also systemic factors, organizational structures, human–machine interactions, and technical elements. Dekker et al. [24] emphasized that failures in complex systems cannot be explained by simple causes, and Brocal et al. [25] proposed a new risk management approach in the context of Industry 4.0. By referencing these studies, we will develop a model that comprehensively analyzes the diverse risk factors and their interactions that occur in the work environment. This will help identify systemic risk factors that are difficult to detect through causal factor analysis alone and contribute to establishing a comprehensive strategy for accident prevention.

Third, we will seek accident prevention measures that take into account motivational and cognitive biases in decision-making processes. Kahneman [26] explained cognitive biases that affect human judgment and decision-making, and Montibeller and Winterfeldt [27] analyzed motivational biases. These biases can lead to faulty decision-making, which can become a cause of accidents. Therefore, we aim to develop a system that minimizes these biases to contribute to accident prevention.

Fourth, we will research ways to strengthen safety culture and the role of managers. Leveson [28,29] applied systems thinking to safety engineering and proposed methodologies to prevent accidents, while Komljenovic et al. [30] emphasized the importance of organizational factors. Additionally, Mosey [31] highlighted the impact of safety culture and the role of managers in accident prevention. Based on these studies, we will establish strategies for building a safety culture at the organizational level and encouraging active participation by managers.

Through this future research, we aim to develop a comprehensive accident analysis and prevention system that considers both direct causes of accidents, such as causal factors, and systemic factors arising from the complexity of the work environment. This will contribute to a more practical approach to industrial accident prevention.

Author Contributions

Conceptualization, S.-Y.K. and K.-H.Y.; Formal analysis, S.-H.S., S.-Y.K. and J.-H.W.; Writing—original draft, S.-Y.K. and J.-H.S.; Supervision, S.-Y.K. and K.-H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Innovative Human Resource Development for Local Intellectualization program through an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (IITP-2024-2020-0-01462) and also by the Ministry of Employment and Labor (2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from the Ministry of Employment and Labor and are available from the Ministry of Employment and Labor in republic of Korea with the permission of the Ministry of Employment and Labor.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kang, S.Y.; Min, S.; Kim, W.S.; Won, J.H.; Kang, Y.J.; Kim, S. Types and characteristics of fatal accidents caused by multiple processes in a workplace: Based on actual cases in South Korea. Int. J. Environ. Res. Public Health 2022, 19, 2047. [Google Scholar] [CrossRef] [PubMed]
Eo, S.H. A Study on the Protective Effects by Sector in Korea Industrial Accident Analysis and Disaster Features. Master’s Thesis, Seoul National University of Science and Technology, Seoul, Republic of Korea, 2016. [Google Scholar]
Park, S.Y. Comparative Analysis of Industrial Accident Rate Changes between Major Countries; Occupational Safety and Health Research Institute: Ulsan, Republic of Korea, 2020. [Google Scholar]
National Institute for Occupational Safety and Health (NIOSH). NIOSH Strategic Plan: FYs 2024–2028. Centers for Disease Control and Prevention. 2023. Available online: https://www.cdc.gov/niosh/about/strategicplan/pdf/V8-NIOSH-Strategic-Plan_V8_August-2023_FINAL.pdf (accessed on 15 August 2023).
Health and Safety Executive (HSE). Protecting People and Places: HSE Strategy 2022 to 2032. Health and Safety Executive. 2022. Available online: https://www.hse.gov.uk/aboutus/the-hse-strategy.htm (accessed on 15 May 2022).
Gemeinsame Deutsche Arbeitsschutzstrategie (GDA). Leitlinie Organisation des betrieblichen Arbeitsschutzes. National Occupational Safety Conference (NAK). 2017. Available online: https://www.gda-portal.de/DE/Aufsichtshandeln/Organisation (accessed on 15 May 2017).
Ministry of Health, Labour and Welfare (MHLW). The 13th Industrial Accident Prevention Plan. Ministry of Health, Labour and Welfare, 2018 to 2022. Available online: https://www.mhlw.go.jp/file/04-Houdouhappyou-11301000-Roudoukijunkyokuanzeneiseibu-Keikakuka/0000194556.pdf (accessed on 15 February 2018).
Chi, C.F.; Chang, T.C.; Ting, H.I. Accident patterns and prevention measures for fatal occupational falls in the construction industry. Appl. Ergon. 2005, 36, 391–400. [Google Scholar] [CrossRef] [PubMed]
Jacinto, C.; Canoa, M.; Soares, C.G. Workplace and organisational factors in accident analysis within the Food Industry. Saf. Sci. 2009, 47, 626–635. [Google Scholar] [CrossRef]
Sarkar, S.; Vinay, S.; Raj, R.; Maiti, J.; Mitra, P. Application of optimized machine learning techniques for prediction of occupational accidents. Comput. Oper. Res. 2019, 106, 210–224. [Google Scholar] [CrossRef]
Ahn, G.; Seo, M.; Hur, S. Development of Accident Classification Model and Ontology for Effective Industrial Accident Analysis based on Textmining. J. Korean Soc. Saf. 2017, 32, 179–185. [Google Scholar]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. App. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning--based text classification: A comprehensive review. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
Zhang, X.; Prabhakar, S.; Sankaran, M. Sequential deep learning from NTSB reports for aviation safety prognosis. Saf. Sci. 2021, 142, 105390. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Nemani, V.P.; Lu, H.; Thelen, A.; Hu, C.; Zimmerman, A.T. Ensembles of probabilistic LSTM predictors and correctors for bearing prognostics using industrial standards. Neurocomputing 2022, 491, 575–596. [Google Scholar] [CrossRef]
Weijie, D.; Yunyi, L.; Jing, Z.; Xuchen, S. Long text classification based on BERT. In Proceedings of the 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Xi’an, China, 15 October 2021; pp. 1147–1151. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Luo, X.; Feng, X.; Ji, X.; Dang, Y.; Zhou, L.; Bi, K.; Dai, Y. Extraction and analysis of risk factors from Chinese chemical accident reports. Chinese J. Chem. Eng. 2023, 61, 68–81. [Google Scholar] [CrossRef]
Luo, Z.; Michiyuki, H. Utilization of similar accident cases for safety education. In Proceedings of the 2022 Joint 12th International Conference on Soft Computing and Intelligent Systems and 23rd International Symposium on Advanced Intelligent Systems (SCIS&ISIS), Ise-Shima, Japan, 29 November 2022; pp. 1–4. [Google Scholar]
SKTBrain. KoBERT. GitHub Repository. Available online: https://github.com/SKTBrain/KoBERT (accessed on 15 November 2019).
Dekker, S.; Cillers, P.; Hofmeyr, J.-H. The complexity of failure: Implications of complexity theory for safety investigations. Saf. Sci. 2011, 49, 939–945. [Google Scholar] [CrossRef]
Brocal, F.; González-Gaya, C.; Komljenovic, D.; Katina, P.D.; Sebastián, M.A. Emerging risk management in Industry 4.0: An approach to improve organizational and human performance in the complex systems. Complexity 2019, 2019, 2089763. [Google Scholar] [CrossRef]
Kahneman, D. Thinking, Fast and Slow; Farrar, Straus and Giroux: New York, NY, USA, 2012. [Google Scholar]
Montibeller, G.; Winterfeldt, D. Cognitive and Motivational Biases in Decision and Risk Analysis. Risk Anal. 2015, 35, 1230–1251. [Google Scholar] [CrossRef] [PubMed]
Leveson, N.G. Engineering a Safer World, Systems Thinking Applied to Safety; The MIT Press: Cambridge, MA, USA, 2011. [Google Scholar]
Leveson, N.G. Applying system thinking to analyze and learn from events. Saf. Sci. 2011, 49, 55–64. [Google Scholar] [CrossRef]
Komljenovic, D.; Loiselle, G.; Kumral, M. Organization: A new focus on mine safety improvement in a complex operational and business environment. Int. J. Min. Sci. Technol. 2017, 27, 617–625. [Google Scholar] [CrossRef]
Mosey, D. Looking beyond Operator–Putting People in the Mix. NEI Magazine. 2014. Available online: https://www.neimagazine.com/advanced-reactorsfusion/looking-beyond-the-operator-4447549/?cf-view (accessed on 26 November 2014).

Figure 1. Comparison before and after preprocessing result. (***; Random accident victim name).

Figure 2. Keyword replacement result example.

Figure 3. Example of morphological analysis results.

Figure 4. Model structure of the three proposed methods.

Figure 5. Data distribution graph by occurrence type for Model 1.

Figure 6. Data distribution graph by occurrence type for Model 2.

Figure 7. Data distribution graph by occurrence type for Model 3.

Figure 8. Accuracy and loss of models. Model 1 (a) accuracy and (b) loss; Model 2 (c) accuracy and (d) loss; Model 3 (e) accuracy and (f) loss.

Figure 9. Difference between accuracy and loss for the models. (a) Classification model performance on the difference between training and validation accuracies. (b) Overfitting tendency due to the difference between training and validation losses.

Figure 10. Confusion matrix result of Model 1.

Figure 11. Confusion matrix result of Model 2.

Figure 12. Confusion matrix result of Model 3.

Figure 13. The 2023 original data confusion matrix results.

Figure 14. Confusion matrix results when third-level information is included.

Table 1. Classification of “occurrence type”.

Code	Classification
01	Fall
02	Slips and Trips
03	Crushing or Overturning
04	Hit
05	Struck by Object
06	Collapse
07	Caught in/between Objects
08	Cuts, Lacerations, Punctures, and Avulsions
09	Electrocution
10	Explosions
11	Fire
12	Overexertion and Bodily Reaction
13	Exposure to Extreme Temperatures
14	Exposure to Hazardous Substances
15	Asphyxiation
16	Drowning
21	Occupational Diseases
22	Pneumoconiosis
23	Work-Related Diseases
31	On-Site Traffic Accidents
32	Off-Site Traffic Accidents
33	Maritime and Aviation Accidents
41	Recreational Activities
42	Acts of Violence
43	Animal-Related Incidents
49	Other Incidents
Z	Unspecified

Table 2. Similar keywords according to the “occurrence type” code.

Code	Similar Keywords in Frequency
Code	1st	2nd	3rd
01	get off	fall	stumble
02	fall down	Slip	Lying down
03	crush	Overturning	attack
04	shocked	collide	hit
05	drop	breakaway	fracture
06	collapse	buried	crush
07	caught	narrowed	get caught
08	stabbed	cut	cut off
09	shock	electricity	short
10	explosion	gas	inflation
11	fire	spark	ignition
12	pain	broken	luxation
13	heated	steam	incinerator
14	toxication	carbon monoxide	leakage
15	asphyxiation	manhole	chamber
16	drowning	disappearance	heavy rain
31	on-site	traffic accident	collision
32	off-site	traffic accident	collision
33	flight	ship	helicopter
43	stung	bitten

Table 3. System configuration.

Category	Details
OS	Ubuntu Linux
CPU	AMD 7742
GPU	NVIDIA A100 × 4
Memory (RAM)	1 TB
Python	3.7.16
Framework	Pytorch 1.12

Table 4. Settings for deep learning.

Parameters	Details
Learning Rate	5 × 10⁻⁵
Batch Size	240
Optimizer	Adam
Epochs	20
Warmup Ratio	0.1

Table 5. Number of learning data by occurrence type code.

Code	Original		Keyword Replacement Augmented Data	Keyword Replacement and Morpheme Augmented Data
Code	Ranking	Data	Keyword Replacement Augmented Data	Keyword Replacement and Morpheme Augmented Data
02	1	400,963	490,151	980,302
07	2	343,112	401,702	803,404
01	3	300,843	364,753	729,506
04	4	184,182	216,454	432,908
23	5	171,308	211,580	423,160
05	6	169,010	215,292	430,584
08	7	162,453	191,679	383,358
32	8	93,740	139,918	279,836
12	9	79,943	105,943	211,886
13	10	57,990	90,597	181,194
21	11	36,198	41,602	83,204
03	12	34,959	46,459	92,918
41	13	22,593	31,162	62,324
06	14	14,997	26,477	52,954
11	15	9492	24,115	48,230
09	16	9259	106,823	213,646
10	17	7897	44,627	89,254
14	18	7258	211,270	422,540
42	19	6900	21,802	43,604
22	20	5857	21,992	43,984
43	21	5480	24,327	48,654
49	22	5419	190,432	380,864
31	23	4215	21,012	42,024
Z	24	2023	10,197	20,394
16	25	833	21,170	42,340
15	26	389	20,446	40,892
33	27	301	30,196	60,392

Table 6. Model 1 dataset.

Code	Train Data	Test Data	Validation Data	Total
02	320,771	40,096	40,096	400,963
07	274,490	34,311	34,311	343,112
01	240,675	30,084	30,084	300,843
04	147,346	18,418	18,418	184,182
23	137,048	17,130	17,130	171,308
05	135,208	16,901	16,901	169,010
08	129,963	16,245	16,245	162,453
32	74,992	9374	9374	93,740
12	63,955	7994	7994	79,943
13	46,392	5799	5799	57,990
21	28,960	3619	3619	36,198
03	27,969	3495	3495	34,959
41	18,075	2259	2259	22,593
06	11,998	1499	1499	14,996
11	7594	949	949	9492
09	7409	925	925	9259
10	6318	789	789	7896
14	5808	725	725	7258
42	5520	690	690	6900
22	4687	585	585	5857
43	4384	540	540	5464
49	4337	541	541	5419
31	3372	421	421	4214
Z	1619	202	202	2023
16	667	83	83	833
15	312	38	38	388
33	241	30	30	301

Table 7. Model 2 dataset.

Code	Train Data	Test Data	Validation Data	Total
02	16,000	2000	2000	20,000
07	16,000	2000	2000	20,000
01	16,000	2000	2000	20,000
04	16,000	2000	2000	20,000
23	16,000	2000	2000	20,000
05	16,000	2000	2000	20,000
08	16,000	2000	2000	20,000
32	16,000	2000	2000	20,000
12	16,000	2000	2000	20,000
13	16,000	2000	2000	20,000
21	16,000	2000	2000	20,000
03	16,000	2000	2000	20,000
41	16,000	2000	2000	20,000
06	16,000	2000	2000	20,000
11	16,000	2000	2000	20,000
09	16,000	2000	2000	20,000
10	16,000	2000	2000	20,000
14	16,000	2000	2000	20,000
42	16,000	2000	2000	20,000
22	16,000	2000	2000	20,000
43	16,000	2000	2000	20,000
49	4337	541	541	5419
31	3372	421	421	4214
Z	1619	202	202	2023
16	667	83	83	833
15	312	38	38	388
33	241	30	30	301

Table 8. Model 3 dataset.

Code	Train Data	Test Data	Validation Data	Total
02	16,000	2000	2000	20,000
07	16,000	2000	2000	20,000
01	16,000	2000	2000	20,000
04	16,000	2000	2000	20,000
23	16,000	2000	2000	20,000
05	16,000	2000	2000	20,000
08	16,000	2000	2000	20,000
32	16,000	2000	2000	20,000
12	16,000	2000	2000	20,000
13	16,000	2000	2000	20,000
21	16,000	2000	2000	20,000
03	16,000	2000	2000	20,000
41	16,000	2000	2000	20,000
06	16,000	2000	2000	20,000
11	16,000	2000	2000	20,000
09	16,000	2000	2000	20,000
10	16,000	2000	2000	20,000
14	16,000	2000	2000	20,000
42	16,000	2000	2000	20,000
22	16,000	2000	2000	20,000
43	16,000	2000	2000	20,000
49	4337	541	541	5419
31	3372	421	421	4214
Z	1619	202	202	2023
16	667	83	83	833
15	312	38	38	388
33	241	30	30	301

Table 9. Difference between training and validation accuracies of Model 1.

Epoch	Training		Validation		Difference
Epoch	Accuracy	Loss	Accuracy	Loss	Accuracy	Loss
1	75%	0.93	87%	0.41	−12%	−0.52
2	87%	0.41	87%	0.40	0%	0.02
3	88%	0.36	88%	0.38	0%	−0.02
4	90%	0.32	88%	0.41	2%	−0.09
5	91%	0.28	88%	0.42	3%	−0.14
6	92%	0.24	87%	0.46	5%	−0.22
7	93%	0.21	87%	0.48	6%	−0.27
8	94%	0.18	87%	0.53	7%	−0.35
9	95%	0.15	87%	0.57	8%	−0.42
10	96%	0.13	87%	0.62	9%	−0.49
11	97%	0.10	87%	0.66	10%	−0.56
12	98%	0.08	87%	0.67	11%	−0.58
13	98%	0.07	87%	0.74	11%	−0.67
14	99%	0.05	87%	0.80	12%	−0.74
15	99%	0.04	87%	0.82	12%	−0.78
16	99%	0.03	87%	0.88	12%	−0.84
17	99%	0.02	87%	0.91	12%	−0.89
18	100%	0.02	87%	0.94	13%	−0.92
19	100%	0.02	87%	0.94	13%	−0.93
20	100%	0.02	87%	0.94	13%	−0.93

Table 10. Difference between training and validation accuracies of Model 2.

Epoch	Training		Validation		Difference
Epoch	Accuracy	Loss	Accuracy	Loss	Accuracy	Loss
1	71%	1.13	88%	0.41	−17%	−0.72
2	88%	0.39	90%	0.35	−2%	0.04
3	90%	0.32	91%	0.32	−1%	0.00
4	92%	0.25	91%	0.31	1%	−0.06
5	94%	0.21	91%	0.31	3%	−0.10
6	95%	0.17	92%	0.33	3%	−0.16
7	96%	0.14	92%	0.35	4%	−0.20
8	96%	0.12	92%	0.35	4%	−0.23
9	97%	0.10	92%	0.38	5%	−0.28
10	98%	0.08	92%	0.40	6%	−0.32
11	98%	0.06	92%	0.42	6%	−0.36
12	99%	0.05	92%	0.45	7%	−0.40
13	99%	0.04	93%	0.47	6%	−0.43
14	99%	0.03	93%	0.49	6%	−0.46
15	100%	0.02	93%	0.51	7%	−0.49
16	100%	0.01	93%	0.54	7%	−0.52
17	100%	0.01	93%	0.55	7%	−0.54
18	100%	0.01	93%	0.56	7%	−0.56
19	100%	0.00	93%	0.56	7%	−0.56
20	100%	0.00	93%	0.57	7%	−0.56

Table 11. Difference between training and validation accuracies of Model 3.

Epoch	Train		Validation		Difference
Epoch	Accuracy	Loss	Accuracy	Loss	Accuracy	Loss
1	67%	1.33	87%	0.46	−20%	−0.86
2	87%	0.43	89%	0.35	−2%	0.07
3	90%	0.32	91%	0.30	−1%	0.02
4	93%	0.24	92%	0.27	1%	−0.03
5	94%	0.18	92%	0.26	2%	−0.07
6	95%	0.15	93%	0.24	2%	−0.10
7	96%	0.11	93%	0.25	3%	−0.14
8	97%	0.09	93%	0.25	4%	−0.16
9	98%	0.07	94%	0.25	4%	−0.18
10	98%	0.05	94%	0.25	4%	−0.19
11	99%	0.04	94%	0.25	5%	−0.21
12	99%	0.03	95%	0.26	4%	−0.23
13	99%	0.02	95%	0.26	4%	−0.23
14	99%	0.02	95%	0.26	4%	−0.24
15	100%	0.01	95%	0.26	5%	−0.25
16	100%	0.01	95%	0.26	5%	−0.25
17	100%	0.01	95%	0.26	5%	−0.26
18	100%	0.00	95%	0.26	5%	−0.26
19	100%	0.00	95%	0.26	5%	−0.26
20	100%	0.00	96%	0.26	4%	−0.26

Table 12. Model evaluation metrics.

Metrics	Model 1	Model 2	Model 3	Model 3 Using Unlearned 2023 Accident Data
Precision	87.1%	91.3%	93.1%	91.1%
Recall	0.878	0.912	0.931	0.908
Loss	0.38	0.31	0.24	0.32
F1 Score	0.878	0.913	0.931	0.909

Table 13. Comparison of confusion matrix results for top 5 class and bottom 5 class models.

Metrics	Model 1	Model 2	Model 3	Model 3 Using Unlearned 2023 Accident Data
Slips and Trips	91%	82%	83%	84%
Caught in/between Objects	88%	84%	90%	88%
Fall	91%	90%	89%	87%
Hit	76%	76%	75%	79%
Work-Related Diseases	91%	87%	85%	84%
On-Site Traffic Accidents	33%	97%	98%	99%
Unspecified	0%	90%	98%	98%
Drowning	67%	99%	99%	96%
Asphyxiation	0%	99%	99%	91%
Maritime and Aviation Accidents	44%	100%	100%	99%

Table 14. Existing record of occurrence type result ranked top-5 rank.

Metrics	Rank 1	Rank 2	Rank 3	Rank 4	Rank 5
Precision	93.1%	97.9%	99.2%	99.4%	99.6%
Recall	0.931	0.979	0.992	0.994	0.996
F1 Score	0.931	0.979	0.992	0.994	0.996

Table 15. Existing record of occurrence type result ranked 1.

Accident Overview		Existing Recorded Occurrence Type
On 2 November 2017, around 8 a.m., he fell when the scaffolding collapsed while laying bricks at a house building site in Naechon-myeon, Hongcheon-gun.		Fall
Accident Classification Model (Model 3)
Rank	Class	Precision
1	Fall	74.4%
2	Collapse	24.29%
3	Struck by Object	0.7%

Table 16. Existing record of occurrence type result ranked 2.

Accident Overview		Existing Recorded Occurrence Type
On 27 November 2017, at around 17:00, the captain, a maintenance worker in the power department of the public service team, was working with a fellow worker to move the motor placed on the floor of the motor storage shelf in the public service team’s motor storage room to the motor storage shelf, and while lifting about 40 kg of the motor to the shelf by himself, he experienced pain in his back and lost strength, let go of the hand holding the motor, but he hesitated and broke his hip while hesitating, and since the pain persisted afterward, he received a fracture of the sacral bone as a result of hospital treatment, so he submits a medical treatment application.		Slips and Trips
Accident Classification Model (Model 3)
Rank	Class	Precision
1	Overexertion and Bodily Reaction	48.5%
2	Slips and Trips	31.1%
3	Work-Related Diseases	17.3%

Table 17. Existing record of occurrence type result ranked 3.

Accident Overview		Existing Recorded Occurrence Type
While bricklaying, a safety plate collapsed and I fell, injuring my side and back.		Fall
Accident Classification Model (Model 3)
Rank	Class	Precision
1	Struck by Object	69.3%
2	Collapse	23.4%
3	Fall	4.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, J.-H.; Shin, S.-H.; Kang, S.-Y.; Won, J.-H.; Yoo, K.-H. Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model. Appl. Sci. 2024, 14, 9450. https://doi.org/10.3390/app14209450

AMA Style

Song J-H, Shin S-H, Kang S-Y, Won J-H, Yoo K-H. Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model. Applied Sciences. 2024; 14(20):9450. https://doi.org/10.3390/app14209450

Chicago/Turabian Style

Song, Ju-Han, Seung-Hyeon Shin, Sung-Yong Kang, Jeong-Hun Won, and Kwan-Hee Yoo. 2024. "Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model" Applied Sciences 14, no. 20: 9450. https://doi.org/10.3390/app14209450

APA Style

Song, J.-H., Shin, S.-H., Kang, S.-Y., Won, J.-H., & Yoo, K.-H. (2024). Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model. Applied Sciences, 14(20), 9450. https://doi.org/10.3390/app14209450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Occurrence Type Classification for Establishing Prevention Plans Based on Industrial Accident Cases Using the KoBERT Model

Abstract

1. Introduction

2. Literature Review of Artificial Intelligence Model Utilization for Industrial Accident Prevention

3. Material and Methods

3.1. Overview

3.2. Improving the Quality of Occupational Accidents

3.3. Replacement with Similar Keywords

3.4. Morphological Analysis

3.5. Application of the KoBERT Model

3.6. System Configuration and Settings for Deep Learning

3.7. Constructing and Describing the Training Dataset

4. Results and Experimental

4.1. Validation of Each Model According to Various Parameters

4.2. Comparison of Experiment Results

5. Conclusions

6. Discussion and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI