Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study

Ando, Hajime; Matsugaki, Ryutaro; Yamakawa, Sakumi; Ogami, Akira

doi:10.3390/occuphealth1020016

Open AccessArticle

Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study

¹

Department of Work Systems and Health, Institute of Industrial Ecological Sciences, University of Occupational and Environmental Health, Japan, Kitakyushu 807-8555, Japan

²

Department of Rehabilitation, Wakamatsu Hospital of the University of Occupational and Environmental Health, Japan, Kitakyushu 808-0024, Japan

^*

Author to whom correspondence should be addressed.

Occup. Health 2026, 1(2), 16; https://doi.org/10.3390/occuphealth1020016

Submission received: 30 January 2026 / Revised: 10 April 2026 / Accepted: 13 April 2026 / Published: 17 April 2026

Download

Browse Figures

Versions Notes

Abstract

Same-level falls are the most frequent occupational accidents, yet traditional manual analysis of accident reports is labor-intensive and limits large-scale prevention strategies. In this pilot study, we aimed to evaluate the accuracy of using large language models (LLMs) to automate the classification of occupational accident text data without task-specific pretraining. We analyzed data from 2619 same-level-fall-related injury cases, using expert manual classification as the reference standard. Four models—GPT-4o mini, GPT-4.1 mini, GPT-4.1, and o4-mini—were compared using accuracy and Cohen’s kappa. The o4-mini model demonstrated the highest performance, showing statistical superiority in the complex “causal agent” category with 72.8% accuracy. For other classification tasks, the top models achieved accuracies of 82–92%, with Cohen’s kappa coefficients > 0.7, indicating substantial agreement with expert judgments. These findings suggest that LLMs can classify occupational accident text with substantial agreement with the expert-derived reference standard in this dataset. This automated approach enables efficient, high-frequency analysis of large datasets, offering a promising tool for large-scale occupational accident surveillance and screening.

Keywords:

same-level falls; large language models; occupational accidents; traditional accident analysis; accident prevention; accident statistics; risk profiling; natural language processing

1. Introduction

The global burden of occupational injuries remains a critical public health and macroeconomic challenge. According to estimates by the International Labour Organization in 2019, approximately 2.93 million workers die annually from work-related factors, representing an increase of more than 12% compared with that in 2000 [1]. Among various occupational risks, falls represent a pervasive and escalating crisis, exacerbated by an aging global workforce. Data from the Global Burden of Disease 2023 study indicate that the global incidence of falls reached approximately 307 million among individuals aged 20 years and older [2]. In occupational settings, same-level falls are consistently a leading cause of non-fatal injuries [3], making proactive prevention an urgent priority.

To monitor and mitigate these risks, there is a growing trend among regulatory bodies to digitalize and standardize occupational accident reporting. For instance, the US Occupational Safety and Health Administration (OSHA) mandates electronic data submission via the Injury Tracking Application for specific employers based on size and industry [4]. In Europe, Eurostat oversees the European Statistics on Accidents at Work (ESAW) framework, which harmonizes administrative accident data collection across Member States [5]. However, the ESAW framework also includes detailed circumstance variables—such as working environment, specific physical activity, deviation, and associated material agents—that require narrative accident descriptions to be translated into multiple structured categories [5]. This granularity increases the analytical value of occupational accident databases [6], but also makes manual coding time-consuming and potentially variable across coders [7]. Similarly, in Japan, while occupational accident reports have historically been required in written form, the Ministry of Health, Labour, and Welfare has initiated digitalization in recent years, mandating the electronic submission of these reports starting in January 2025 [8]. This transition is expected to facilitate more timely analysis by replacing the manual entry of classification names with standardized coded inputs; however, the challenge of analyzing the remaining free-text narrative sections that describe accident circumstances still persists.

In Japan, the situation is similarly challenging. The number of injuries and fatalities resulting from occupational accidents requiring ≥4 days of leave remains high, with “same-level falls” being the most common type of incident in Japan [9]. Particularly, health/hygiene and retail industries have a high incidence of fall-related accidents, making the formulation of effective countermeasures an urgent priority.

In our previous study, we analyzed data on same-level fall-related accidents that occurred in these two sectors using the industrial accident database of the Ministry of Health, Labour, and Welfare. We found that 35.9% of accidents occurred outdoors, with slipping on snow and ice being the primary cause, particularly during winter [10]. This insight was gained through a meticulous process, where experts reviewed the text data describing the accident circumstances and carefully classified items such as “accident location” and “cause of accident.” However, this manual classification process, which has been used in various studies [11,12,13,14,15,16], is extremely time-consuming and labor-intensive. Previous studies using the same Japanese industrial accident database have used methods such as rule-based text searches for classification [14,15,16]; however, these approaches are limited in their applicability to large datasets, speed of analysis, and ability to handle the diverse expressions found in free-text descriptions.

Machine learning, which is a type of artificial intelligence (AI), can be used to address this problem. However, traditional methods require task-specific pretraining [17], which necessitates the manual preparation of a large volume of labeled training data. This reliance on manual labor for data preparation creates a dilemma, as it does not lead to a fundamental reduction in workload.

Recent advancements in large language models (LLMs) have changed the landscape. LLMs are pretrained on an extremely broad range of knowledge and are not specialized for specific tasks. Therefore, we hypothesized that, by directly leveraging the general-purpose language capabilities of LLMs, we can automate and streamline the analysis of occupational accidents, bypassing the task-specific model training process required by conventional machine learning. Although there have been reports of LLMs being used for labeling in the medical field [18], reports in the field of occupational safety and health remain limited. To test our hypothesis and explore the feasibility of future large-scale implementations, this pilot study aimed to evaluate the performances of several LLMs using expert-labeled data.

2. Materials and Methods

2.1. Study Design and Data Source

This study used a comparative accuracy evaluation design, comparing automatic classification using LLMs with a manually labeled reference standard. The study workflow is shown in Figure 1. The dataset was derived from the “2021 Survey on Industrial Accidents” database [19]. We selected this dataset because it is compiled by a national administrative agency (the Ministry of Health, Labour, and Welfare) based on occupational accident reports that employers are legally mandated to submit. This legal framework ensures that the dataset is highly reliable, nationally representative, and standardized, making it an ideal open-data source for validating our scalable classification methodology. The data represent a random sample of approximately one-quarter of the worker injury and illness reports submitted by employers and are publicly available as anonymized open data. Each case includes text data describing the specific circumstances of the accident (“Accident Description”). This comprehensive database encompasses not only unstructured free-text narratives of accident circumstances but also structured demographic and environmental data, including industry type, establishment size, accident type, date, time, and the victim’s age. The 2021 database contains 29,605 cases. From these, we selected 2690 cases classified as same-level falls in the retail trade and health care services. After excluding 71 cases in which the manual review determined that a same-level fall did not occur or was unclear, 2619 cases remained for the final analysis.

2.2. Reference Standard

The reference standard data, which served as the benchmark for evaluating the classification accuracy of AI in this study, were derived from manual classification results created by experts in our previous study [10]. The process was as follows. First, a physical therapist with expertise in the rehabilitation of same-level fall-related injuries read the accident description text for all 2619 cases. Subsequently, they created a draft classification consisting of the following six categories: “within/outside business premises,” “accident location,” “in a vehicle,” “cause of accident,” “causal agent,” and “injured body part.” Next, an occupational physician with occupational health expertise reviewed this draft classification for cases flagged as unclear or difficult to judge during the initial assessment. While the manual classification was primarily based on these narrative descriptions, the experts also had access to supplementary structured data within the database (e.g., industry type) and occasionally used this information to contextualize ambiguous cases. The entire process required a substantial amount of specialized labor. We conservatively estimated that the initial classification by the physical therapist required at least 40 h, and the subsequent review by the occupational physician required at least 3 h.

This classification framework was not derived from existing international statistical standards such as the ESAW methodology. Rather, it was developed de novo through consensus among occupational health researchers, a physical therapist, and an occupational physician. It was specifically designed to capture the granular, practical nuances of same-level fall accidents in Japan, aiming to identify actionable targets for prevention.

2.3. Automatic Classification by LLMs and Performance Evaluation

To automate the manual classification described above, we used LLMs, which are a type of AI. We selected four models with varying performances from the well-known GPT series developed by OpenAI, namely GPT-4.1 (gpt-4.1-2025-04-14), GPT-4.1 mini (gpt-4.1-mini-2025-04-14), GPT-4o mini (gpt-4o-mini-2024-07-18), and o4-mini (o4-mini-2025-04-16), to compare their performances and costs. The processing was conducted in June 2025, using the latest versions of each model available at that time.

The processing was implemented on Colab (Google LLC, Mountain View, CA, USA) [20], a cloud-based execution environment, using Python (3.11) (Python Software Foundation, Beaverton, OR, USA) and the OpenAI Python SDK (1.84.0) (OpenAI, San Francisco, CA, USA) [21]. To efficiently process thousands of cases, we used the batch application programming interface (API) feature. This method allows multiple requests to be sent as a single job, which is then processed asynchronously on the server side. Although it can take up to 24 h, it is available at half the cost of synchronous processing, making it suitable for executing a large number of mutually independent tasks, such as ours. However, we observed that batch jobs occasionally stalled. To mitigate this, we designed a system that split the tasks into 100 chunks, submitted them all at once, and monitored their status at regular intervals, enabling us to cancel or recreate any tasks that were identified as stalled.

We created a specific prompt (instruction) for each of the seven classification items under investigation and executed them as individual tasks in the batch process. Notably, a crucial step in our prompt design was ensuring that the LLMs output single-digit numerical values. This approach explicitly transformed unstructured text into structured, quantitative data, minimizing output parsing errors during batch processing. The original prompts provided to the AI are listed in Table 1.

Importantly, a strict methodological separation was maintained between the reference standard and the AI classification process. The LLMs performed the tasks using only the raw text and prompts, without access to supplementary metadata. Although the original industrial accident database is publicly available and could theoretically have been included in the pre-training corpora of the LLMs, the expert labels used as the reference standard were generated de novo for this study and have not been published. Consequently, the LLMs could not have accessed the correct answers during the classification task, ensuring a robust evaluation of their independent performance.

Furthermore, the parameters (settings) for the API calls were crucial for ensuring the reliability of the results (Table 2). In this study, we set the temperature, which controls the “creativity” of the AI, to 0. This minimizes random elements, ensures that the AI generally returns the same answer for the same input, and thus ensures the reproducibility of the research.

2.4. Accuracy Evaluation and Statistical Analyses

The results classified by each LLM were compared with the manual classification results of experts (reference standard data) to evaluate their levels of agreement. Commonly used metrics were adopted to evaluate the performance of the classification models. As our classification task was a multi-class classification with three or more categories (e.g., “accident location” having three categories: indoors, outdoors, and unknown), each metric was calculated per category and then averaged.

Accuracy: The proportion of correctly classified data from the total data. Accuracy = Number of correctly classified data/total number of data.
Precision: Of the items predicted as a certain category (e.g., “outdoors”), the proportion that were actually correct. It indicates how well false positives were suppressed. Precision = TP/(TP + FP) (TP, true positive; FP, false positive).
Recall: Of the items that are actually in a certain category (e.g., “outdoors”), the proportion that the AI was able to identify. It indicates how few false negatives there were. Recall = TP/(TP + FN) (FN, false negative).
F1-score: The harmonic mean of precision and recall. This is used when a balanced evaluation of both metrics is required. A value closer to 1 indicates a better performance. F1-score = 2 × (precision × recall)/(precision + recall).
Cohen’s kappa score: Unlike simple accuracy, this metric calculates the agreement between the model’s predictions and true labels after accounting for the agreement that could occur by chance. This makes it a more robust measure of performance, particularly for imbalanced datasets.

The precision, recall, and F1-score were calculated for each specific class. To provide a summary metric for multi-class tasks (e.g., causal agent), we reported the weighted average, which calculates the average of the metric weighted by the number of true instances for each class (support).

2.4.1. Statistical Inference and Model Comparison

To assess statistical significance, we used a stratified bootstrap method with 1000 resamples, fixing the random seed at 42 for reproducibility. For each model, 95% confidence intervals (CIs) for the performance metrics were estimated using the bias-corrected and accelerated bootstrap method, which provides higher accuracy than that of the standard percentile method.

To test for performance differences between the models, the distribution of differences in the metrics for all six possible model pairs was calculated. Bonferroni correction was applied to address the multiple comparison problem. To control the family-wise error rate at 5%, the overall significance level of α = 0.05 was divided by the number of comparisons (6), resulting in a corrected significance level of α′ ≈ 0.0083. A statistically significant difference was determined if the resulting 99.17% CI for the difference did not include 0.

2.4.2. Analysis Environment

The analyses were performed using Python (version 3.12.1) with the following key libraries: pandas (version 2.2.2), scikit-learn (version 1.6.1), SciPy (version 1.16.2), and NumPy (version 2.0.2).

2.5. Ethical Considerations

In this study, we exclusively used the publicly available, fully anonymized Survey on Industrial Accidents database provided by the Ministry of Health, Labour, and Welfare. Therefore, this study falls outside the scope of the “Ethical Guidelines for Life Science and Medical Research Involving Human Subjects” and did not require approval from an ethics review committee.

3. Results

3.1. Classification Accuracy

We automatically classified 2619 same-level fall-related accident cases using four LLMs and evaluated their accuracy against the manual classification results as the reference standard. We report the results for four representative categories: “accident location,” “cause of accident,” “causal agent,” and “injured body part.” To ensure consistency and comparability, these categories were selected to align with the classification framework used in a previous study that established manual classification data [10]. The results are summarized in Table 3. Notably, in one case, GPT-4.1 mini provided a text response instead of the instructed number, which was manually reclassified as unknown.

Overall, the oldest model, GPT-4o mini, performed the poorest. For “causal agent,” its accuracy was significantly lower, at 37.1% (95% CI, 0.35–0.39). In contrast, the o4-mini model achieved 72.8% accuracy (95% CI, 0.71–0.75), significantly outperforming GPT-4.1 (63.8%) and GPT-4.1 mini (59.9%). A strict performance hierarchy was confirmed for this item: o4-mini > GPT-4.1 > GPT-4.1 mini > GPT-4o mini. For the other items, the top three models generally achieved accuracies of 84–92%, showing no significant differences among themselves, but consistently outperformed GPT-4o mini.

For the “causal agent” category, which had the lowest accuracy, the confusion matrices for the worst-performing model (GPT-4o mini) and the best-performing model (o4-mini) are shown in Figure 2 and Figure 3, respectively.

3.2. Processing Time and Cost

The time and estimated cost for processing all 2619 cases using the Batch API for the four models are listed in Table 4. Even with the most expensive model, o4-mini, the processing was completed in approximately 90 min for approximately $11. The lightweight model, GPT-4.1 mini, took approximately 24 min and incurred a cost of $0.75.

Based on Japanese government statistics for average hourly wages (2257 JPY for physical therapists and 6445 JPY for physicians) [22,23], the estimated labor cost for manual expert classification was approximately 109,615 JPY (approximately $730; $1 = 150 JPY). This cost is more than 60 times higher than that of the most expensive LLM model used in this study (o4-mini, $10.58).

4. Discussion

In this study, we automated the classification of occupational accident text data using LLMs and verified their accuracy and practicality. The results indicate that LLMs may support or partially automate expert classification that required substantial specialized labor.

4.1. Advantages over Traditional Machine Learning

The LLM-based approach adopted in this study has a decisive advantage over traditional machine learning models, such as deep learning, which eliminates the need for task-specific training.

When building a text classification model with conventional methods, one must first manually prepare thousands of “training data” samples, which are then divided into “training sets” and “evaluation sets.” The model learns specific classification patterns from the training set and is then tested on unseen evaluation data. The process of creating and training with labeled data is a specialized and time-consuming task, posing a major obstacle to automation. For a dataset of several thousand items, such as ours, manual labeling of the training data would essentially complete the classification of the entire set, limiting the utility of machine learning.

In contrast, the method used in this study achieved high-accuracy classification with only simple prompts and without any training. Unlike humans, LLMs are not susceptible to fatigue or inconsistencies in judgment, allowing stable and high-quality classification. This “training-free” characteristic, often referred to as zero-shot learning, dramatically reduces the initial cost and preparation time for analysis, making advanced text analysis accessible to more researchers and practitioners.

4.2. Practicality of Using the Batch API

As shown in Table 4, even with the most expensive model, o4-mini, the classification of 2619 cases was completed in approximately 90 min for approximately $11. This represents a significant improvement in efficiency. While the manual expert classification required over 43 h of highly specialized labor, costing an estimated $730 based on standard wage scales, the LLM-based approach showed substantial agreement with the expert-derived reference standard at a fraction of the cost ($11) and time (90 min). This economic and temporal advantage suggests that the combination of LLMs and the Batch API may provide a practical and scalable option for large-scale data analysis in occupational safety and health. This efficiency opens the door to larger-scale and more frequent analyses that were previously prohibitive owing to cost and time constraints. For example, it could enable the monthly analysis of national occupational accident data to detect early signs of new risks or rapid, detailed factor analysis for specific industries or tasks. Worker injury reports that form the basis of this database, originally prepared and submitted manually, have been slated to be submitted electronically as a general rule since January 2025 [8]. This transition is expected to further accelerate the speed of analysis.

4.3. Interpretation of Classification Accuracy

Among the models used, GPT-4o mini was the oldest and performed the worst. Its successor, GPT-4.1, and its lightweight version, GPT-4.1 mini, achieved accuracies of 79–92% for items other than “causal agent,” with kappa coefficients being generally > 0.7. According to Landis & Koch’s criteria [24], the agreement for these three items (indoor/outdoor classification, cause of accident, and injured body part) would be rated as “substantial” or higher, with some even classified as “almost perfect.” This suggests that the LLM-based classification achieved substantial agreement with the expert-derived reference standard for several items. The lower accuracy for “causal agent” seems to stem partly from the inherent ambiguity of the categories, with significant confusion observed among “other obstacles,” “other,” and “unknown.” The definitions for these categories were not explicitly set and were left to the discretion of the human raters, which likely posed a difficult task for the AI. Furthermore, these broad categories, particularly “other obstacles” and “other,” limit the specificity required for actionable, evidence-based interventions. This categorization scheme was adopted from our previous epidemiological study [10] to maintain comparability, rather than being a limitation of the LLM approach itself. For this automated method to serve as a truly practical and scalable approach for safety management in the future, it will be essential to develop more granular classification categories and refine prompt design to extract highly specific causal agents.

However, the o4-mini model achieved a kappa coefficient > 0.6, significantly outperforming all the other models. Notably, o4-mini is the only reasoning model among those evaluated in this study [25]. It is presumed that its “reasoning” process led to judgments more aligned with human intuition. This hypothesis is supported by the fact that o4-mini showed significant superiority only in the most complex and ambiguous category and performed comparably to other newer models in simpler tasks. A study in the public health domain that attempted to annotate social media data using GPT-4 Turbo reported lower LLM accuracy for tasks requiring contextual understanding [26], which aligns with our findings. Recent studies in occupational and safety-related free-text analysis further support this interpretation. Dunstan et al. applied natural language processing and LLMs to Spanish-language work-accident reports and evaluated mechanism extraction against human annotation, demonstrating both the scalability of this approach and the persistent challenges of mechanism coding in real-world occupational narratives [27]. Similarly, Nakamura et al. applied GPT to occupational incident texts without additional training and reported that prompt design, particularly one-shot prompting, materially affected performance in specialized causal annotation tasks [28]. Taken together with our findings, these studies suggest that general-purpose LLMs are promising tools for occupational safety text processing; however, their performance is strongly influenced by task complexity, category specificity, and prompt design. We also tested newer models, and as noted, the results suggest ongoing improvements in this area.

These findings indicate that the AI-based classification possesses a certain degree of accuracy. Depending on the application, required accuracy, and speed, one could consider using generative AI for full classification, as one of the classifiers in a team, or as an auxiliary tool. It may also be useful to perform preliminary AI-based classification before manual classification to check whether the classification criteria and content are clearly defined.

4.4. Methodological Implications and Data Governance

Before the recent emergence of prompt-based LLMs, occupational accident narratives were mainly analyzed using rule-based or supervised machine-learning approaches. For example, Goh and Ubeynarayana compared six text-mining algorithms for classifying construction accident narratives from the OSHA database [29], while Goldberg applied word-embedding models to automatically code OSHA accident narratives across multiple dimensions, including body part, source, and event type [30]. More recently, Song et al. developed a fine-tuned KoBERT-based classifier for Korean industrial accident summaries and reported high performance, but this approach required large-scale training data and task-specific model development [31]. Compared with these approaches, the present pilot study focused on a zero-shot, prompt-based framework that reduced task-specific development costs, although the more ambiguous “causal agent” category remained challenging.

Although our classification framework was developed independently of ESAW, the ESAW literature provides an important international reference for the broader challenge of structuring occupational accident information. Studies using detailed ESAW variables have shown their value for accident analysis and risk profiling [6], while reliability studies have reported lower agreement for some accident-circumstance variables than for basic worker descriptors [7]. In practice, the implementation of ESAW-type structured coding in occupational accident investigation also appears to be incomplete; an analysis of 567 investigation reports found that the eight ESAW variables considered most important were properly identified and coded in only about one quarter of cases on average, highlighting the operational difficulty of recording these variables consistently [32]. In this context, the prompt-based LLM approach evaluated in our pilot study may complement existing systems by helping translate free-text narratives into structured accident variables at scale, while still requiring human oversight for ambiguous cases.

Our findings align with the evolving international paradigm of occupational safety informatics. Recent studies utilizing the ESAW framework have demonstrated the efficacy of machine learning methods, such as decision tree algorithms, in predicting accident severity based on structured occupational data [33]. While these approaches provide valuable insights, they primarily rely on pre-categorized variables and do not directly address the task of translating free-text narratives into coded variables. In contrast, the LLM-based approach demonstrated in this pilot study offers a scalable alternative capable of directly decoding complex causal agents and accident circumstances from free-text reports.

However, for such AI-driven approaches to be robust and broadly applicable, the underlying data and predictive models must adhere to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles [34]. As regulatory bodies transition to electronic reporting, maximizing the utility of occupational health data will require ensuring cross-institutional interoperability. In this context, automating the structuring of legacy text data via LLMs is a potentially useful step toward achieving this goal. Furthermore, to mitigate inherent AI risks such as algorithmic hallucinations and counterfactual biases, as highlighted in the World Health Organization guidelines for large multimodal models, maintaining “human-in-the-loop” oversight remains essential [35]. In our framework, the LLM functions as a high-speed processor of large datasets, supporting occupational safety analysis and decision-making rather than replacing human experts.

4.5. Study Limitations

This study has some limitations. First, as this was a pilot study conducted to assess initial feasibility, validation was limited to a specific dataset. Whether similar performance can be achieved for other industries or types of accidents requires separate verification. The reference classification was developed by a single physical therapist and selectively reviewed by an occupational physician for cases deemed unclear or difficult. In addition, the human reviewers occasionally referred to supplementary structured variables in the database, whereas the LLMs classified cases using narrative text alone. This difference in available information may have influenced the comparison. Although this approach follows practical workflows, future large-scale studies should use independent double coding with interrater reliability assessment to further establish the robustness of the ground truth. Second, the performance of LLMs depends on the quality of the input text data. This interpretation is consistent with a systematic review of occupational injury text analytics, which identified terminology inconsistency, variability in narrative style, and class imbalance as recurring sources of classification difficulty [36]. The data used in this study were created by various businesses, with some descriptions being detailed and others not. A lack of detail can lead to more guesswork and reduced accuracy. Our evaluation was conducted on a full year’s worth of actual data, demonstrating a certain level of accuracy, even with this variability. Third, the prompts used in this study represent only one possible approach, and the accuracy can potentially be improved with further refinement (e.g., few-shot learning, in which a small number of concrete examples are provided within the prompt to guide AI). Nevertheless, we believe that a certain level of accuracy is achievable, even with simple prompts.

5. Conclusions

This study demonstrates that in the classification of text data from occupational accident reports, LLMs may serve as useful support tools for reducing the burden of manual expert classification. Furthermore, using the Batch API, large-scale data analysis can be performed efficiently and at low cost. The application of this method facilitates large-scale, multifaceted accident analysis, which is often difficult owing to time and human resource costs, and may support more timely occupational accident surveillance and the identification of potential targets for prevention. The findings of this study represent an important step toward advancing occupational safety and health intelligence by leveraging AI technology that does not require task-specific model training. This automated approach may also support the United Nations’ Sustainable Development Goal 8, specifically Target 8.8, by enabling more timely analysis of occupational accident data [37,38].

Author Contributions

Conceptualization: H.A. and R.M.; Formal analysis: H.A.; Investigation: H.A., R.M. and S.Y.; Writing—original draft: H.A.; Writing—review & editing: H.A., R.M., S.Y. and A.O.; Visualization: H.A.; Supervision: A.O.; Project administration: H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were not required for this study as it was based solely on publicly available, anonymized data, in accordance with local legislation and institutional requirements.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found here: https://anzeninfo.mhlw.go.jp/anzen_pgm/SHISYO_FND.html (accessed on 29 March 2026). The labeled data generated during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
API	Application programming interface
CI	confidence interval
ESAW	European Statistics on Accidents at Work
LLM	Large language model
OSHA	Occupational Safety and Health Administration
FN	False negative
FP	False positive
TP	True positive

References

International Labour Organization. A Call for Safer and Healthier Working Environments. Available online: https://www.ilo.org/publications/call-safer-and-healthier-working-environments (accessed on 29 March 2026).
Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2023 (GBD 2023) Results. Institute for Health Metrics and Evaluation (IHME). 2024. Available online: https://ghdx.healthdata.org/gbd-2023 (accessed on 29 March 2026).
Chang, W.-R.; Leclercq, S.; Lockhart, T.E.; Haslam, R. State of science: Occupational slips, trips and falls on the same level. Ergonomics 2016, 59, 861–883. [Google Scholar] [CrossRef] [PubMed]
Occupational Safety and Health Administration. Injury Tracking Application (ITA). Available online: https://www.osha.gov/injuryreporting/ (accessed on 29 March 2026).
Eurostat. European Statistics on Accidents at Work (ESAW)—Summary Methodology—2013 Edition. Available online: https://op.europa.eu/en/publication-detail/-/publication/59b4ca26-0ac9-476a-91c6-82dbc2f0a850 (accessed on 29 March 2026).
Jacinto, C.; Soares, C.G. The added value of the new ESAW/Eurostat variables in accident analysis in the mining and quarrying industry. J. Saf. Res. 2008, 39, 631–644. [Google Scholar] [CrossRef]
Molinero-Ruiz, E.; Pitarque, S.; Fondevila-McDonald, Y.; Martin-Bustamante, M. How reliable and valid is the coding of the variables of the European Statistics on Accidents at Work (ESAW)? A need to improve preventive public policies. Saf. Sci. 2015, 79, 72–79. [Google Scholar] [CrossRef]
Ministry of Health, Labour and Welfare, Japan. The Reporting Requirements for the Report of Worker Death, Injury, or Illness Will Be Revised, and Electronic Submission Will Become Mandatory (Effective 1 January 2025). Available online: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/koyou_roudou/roudoukijun/denshishinsei_00002.html (accessed on 29 March 2026). (In Japanese)
Ministry of Health, Labour and Welfare, Japan. Occupational Accident Statistics for 2023. Available online: https://www.mhlw.go.jp/stf/newpage_40395.html (accessed on 29 March 2026). (In Japanese)
Matsugaki, R.; Yamakawa, S.; Ando, H.; Ogami, A. Same-level fall injuries among healthcare and retail workers: Focus on outdoor incidents. Sangyo Eiseigaku Zasshi 2025, 67, 295–301. (In Japanese) [Google Scholar] [CrossRef]
Lincoln, A.E.; Sorock, G.S.; Courtney, T.K.; Wellman, H.M.; Smith, G.S.; Amoroso, P.J. Using narrative text and coded data to develop hazard scenarios for occupational injury interventions. Inj. Prev. 2004, 10, 249–254. [Google Scholar] [CrossRef] [PubMed]
Lombardi, D.A.; Pannala, R.; Sorock, G.S.; Wellman, H.; Courtney, T.K.; Verma, S.; Smith, G.S. Welding related occupational eye injuries: A narrative analysis. Inj. Prev. 2005, 11, 174–179. [Google Scholar] [CrossRef]
Bertke, S.J.; Meyers, A.R.; Wurzelbacher, S.J.; Measure, A.; Lampl, M.P.; Robins, D. Comparison of methods for auto-coding causation of injury narratives. Accid. Anal. Prev. 2016, 88, 117–123. [Google Scholar] [CrossRef][Green Version]
Wasaki, N.; Takahashi, A. Characteristics of occupational accidents caused by inattentiveness. J. Occup. Saf. Health 2024, 17, 93–104. (In Japanese) [Google Scholar] [CrossRef]
Sugama, A. Present situation of falls from step ladders and future perspectives on preventative countermeasures. J. Occup. Saf. Health 2017, 10, 55–58. (In Japanese) [Google Scholar] [CrossRef]
Hayashi, C.; Ogata, S.; Toyoda, H.; Tanemura, N.; Okano, T.; Umeda, M.; Mashino, S. Risk factors for fracture by same-level falls among workers across sectors: A cross-sectional study of national open database of the occupational injuries in Japan. Public Health 2023, 217, 196–204. [Google Scholar] [CrossRef]
Lu, H.; Ehwerhemuepha, L.; Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol. 2022, 22, 181. [Google Scholar] [CrossRef] [PubMed]
Balch, J.A.; Desaraju, S.S.; Nolan, V.J.; Vellanki, D.; Buchanan, T.R.; Brinkley, L.M.; Penev, Y.; Bilgili, A.; Patel, A.; Chatham, C.E.; et al. Language models for multilabel document classification of surgical concepts in exploratory laparotomy operative notes: Algorithm development study. JMIR Med. Inform. 2025, 13, e71176. [Google Scholar] [CrossRef] [PubMed]
Ministry of Health, Labour and Welfare, Japan. Database of Serious Occupational Accidents (Fatalities and Cases Involving Four or More Days of Leave). Available online: https://anzeninfo.mhlw.go.jp/anzen_pgm/SHISYO_FND.html (accessed on 29 March 2026). (In Japanese)
Google LLC. Colab. Available online: https://colab.google/ (accessed on 29 March 2026).
OpenAI Inc. Libraries | OpenAI API. Available online: https://developers.openai.com/api/docs/libraries?language=python (accessed on 29 March 2026).
Ministry of Health, Labour and Welfare, Japan. Job Tag: Physical Therapist (PT). Available online: https://shigoto.mhlw.go.jp/User/Occupation/Detail/167 (accessed on 29 March 2026). (In Japanese)
Ministry of Health, Labour and Welfare, Japan. Job Tag: Occupational Physician. Available online: https://shigoto.mhlw.go.jp/User/Occupation/Detail/583 (accessed on 29 March 2026). (In Japanese)
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
OpenAI Inc. OpenAI o3 and o4-Mini System Card. Available online: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf (accessed on 29 March 2026).
Kazari, K.; Chen, Y.; Shakeri, Z. Scaling public health text annotation: Zero-shot learning vs. crowdsourcing for improved efficiency and labeling accuracy. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2025, 2025, 1–4. [Google Scholar] [CrossRef]
Dunstan, J.; Campaña-Herrera, V.; Miranda, L.; Ladrón De Guevara, R.; Pincheira, P.; Rocco, V.; Moyano-Dávila, D. Sex differences in work-related accidents extracted from free text in Spanish using natural language processing. BMC Public Health 2025, 25, 2746. [Google Scholar] [CrossRef]
Nakamura, M.; Hayamizu, S.; Masanori, H.; Fuseya, T.; Iwamatsu, H.; Terada, K. Causal reasoning of occupational incident texts using large language models. Procedia Comput. Sci. 2024, 246, 820–829. [Google Scholar] [CrossRef]
Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
Goldberg, D.M. Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability. J. Saf. Res. 2022, 80, 441–455. [Google Scholar] [CrossRef]
Song, J.-H.; Shin, S.-H.; Kang, S.-Y.; Won, J.-H.; Yoo, K.-H. Occurrence type classification for establishing prevention plans based on industrial accident cases using the KoBERT model. Appl. Sci. 2024, 14, 9450. [Google Scholar] [CrossRef]
Salguero-Caparros, F.; Suarez-Cebador, M.; Rubio-Romero, J.C. Analysis of investigation reports on occupational accidents. Saf. Sci. 2015, 72, 329–336. [Google Scholar] [CrossRef]
Ordysiński, S. Prediction of the injury severity of accidents at work: A new approach to analysis of already existing statistical data. Appl. Sci. 2025, 15, 10666. [Google Scholar] [CrossRef]
Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
World Health Organization. Ethics and Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models, 2024. Available online: https://iris.who.int/handle/10665/375579 (accessed on 29 March 2026).
Khairuddin, M.Z.F.; Hasikin, K.; Abd Razak, N.A.; Lai, K.W.; Osman, M.Z.; Aslan, M.F.; Sabanci, K.; Azizan, M.M.; Satapathy, S.C.; Wu, X. Predicting occupational injury causal factors using text-based analytics: A systematic review. Front. Public Health 2022, 10, 984099. [Google Scholar] [CrossRef]
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. 2015. Available online: https://sdgs.un.org/2030agenda (accessed on 29 March 2026).
United Nations Statistics Division. SDG Indicator Metadata: Indicator 8.8.1. Fatal and Non-Fatal Occupational Injuries Per 100,000 Workers, by Sex and Migrant Status. Available online: https://unstats.un.org/sdgs/metadata/files/Metadata-08-08-01.pdf (accessed on 29 March 2026).

Figure 1. Flowchart of the study design and data processing pipeline. The dataset of 2619 same-level fall narratives was processed through a dual classification pipeline to evaluate the accuracy and efficiency of four LLMs against expert manual classification. AI, artificial intelligence; LLMs, large language models.

Figure 2. Confusion matrix for the causal agents of same-level falls (model: GPT-4o mini) 0 for water, 1 for oil, 2 for snow/ice, 3 for a step/uneven surface, 4 for an obstacle other than the aforementioned, 5 for other, or 9 for unknown. Darker colors indicate higher numbers of cases. LLM, large language model.

Figure 3. Confusion matrix for the causal agents of same-level falls (model: o4-mini) 0 for water, 1 for oil, 2 for snow/ice, 3 for a step/uneven surface, 4 for an obstacle other than the aforementioned, 5 for other, or 9 for unknown. Darker colors indicate higher numbers of cases. LLM, large language model.

Table 1. Original prompts input to large language models for each classification item.

Item	Prompt
Within/outside business premises	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred within the victim’s business premises. Respond with a single digit: 0 for within premises, 1 for outside premises, or 9 for unknown.
Accident location (indoor/outdoor)	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred indoors. Respond with a single digit: 0 for indoors, 1 for outdoors, or 9 for unknown.
In a vehicle	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred while riding in a vehicle. Vehicles include not only cars and public transportation but also bicycles and animals. Respond with a single digit: 0 if in a vehicle, 1 if not, or 9 for unknown.
Cause of accident	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the direct cause of the same-level fall. If multiple causes apply, select only the one you believe had the greatest impact. Respond with a single digit: 0 for slip, 1 for trip, 2 for misstep/stumble, 3 for loss of balance, 4 for other, or 9 for unknown.
Causal agent	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the direct cause of the same-level fall. If multiple causes apply, select only the one you believe had the greatest impact. Respond with a single digit: 0 for water, 1 for oil, 2 for snow/ice, 3 for a step/uneven surface, 4 for an obstacle other than the aforementioned, 5 for other, or 9 for unknown.
Injured body part	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the injured body part. If multiple parts apply, select only the one you believe was most significantly affected. If there are multiple equivalent injuries, classify as “other.” Respond with a single digit: 0 for upper limb, 1 for lower limb, 2 for back/waist, 3 for shoulder/neck, 4 for head, 5 for other, or 9 for unknown.
Same-level fall determination	You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please determine whether the victim fell in this incident. Respond with a single digit: 0 if they fell, 1 if no same-level fall occurred, 2 if they fell as a result of another cause like fainting, or 3 if it is not possible to determine.

Note: The original prompts were written in Japanese; the table presents English translations.

Table 2. Main application programming interface parameters.

Parameter	Description	Value
Model	Name of the large language model used.	“GPT-4.1, GPT-4.1-mini, GPT-4o-mini, o4-mini”
Temperature	Controls the randomness (creativity) of the output. Set to 0 to ensure reproducibility.	0
max_tokens	The maximum length of the generated response. A “token” represents a basic unit of text processed by artificial intelligence, roughly equivalent to a word or syllable. Set to a low value as a single-digit response is expected.	10

Note: The reasoning model o4-mini does not support temperature and max_tokens; thus, these were not set.

Table 3. Classification performance for each item and model.

	Accuracy	Precision	F1-Score	Kappa
Indoor/outdoor
GPT-4o mini	0.911 [0.900, 0.922] ^b	0.892 [0.878, 0.905] ^c	0.901 [0.889, 0.913] ^b	0.810 [0.787, 0.831] ^b
GPT-4.1 mini	0.921 [0.911, 0.931] ^a,b	0.914 [0.902, 0.926] ^b	0.917 [0.905, 0.928] ^a	0.835 [0.814, 0.856] ^a
GPT-4.1	0.923 [0.912, 0.933] ^a	0.916 [0.902, 0.927] ^b	0.919 [0.907, 0.930] ^a	0.838 [0.815, 0.857] ^a
o4-mini	0.913 [0.900, 0.922] ^a,b	0.929 [0.917, 0.939] ^a	0.920 [0.909, 0.929] ^a	0.824 [0.801, 0.842] ^a,b
Cause of accident
GPT-4o mini	0.796 [0.779, 0.811] ^c	0.757 [0.730, 0.785] ^c	0.751 [0.730, 0.770] ^c	0.715 [0.692, 0.734] ^c
GPT-4.1 mini	0.807 [0.791, 0.823] ^b	0.805 [0.783, 0.825] ^b	0.769 [0.749, 0.788] ^b	0.734 [0.712, 0.753] ^b
GPT-4.1	0.789 [0.772, 0.804] ^c	0.721 [0.698, 0.742] ^c	0.732 [0.710, 0.752] ^d	0.709 [0.688, 0.728] ^c
o4-mini	0.818 [0.803, 0.832] ^a	0.846 [0.829, 0.860] ^a	0.784 [0.765, 0.802] ^a	0.749 [0.729, 0.766] ^a
Causal agent
GPT-4o mini	0.371 [0.352, 0.389] ^d	0.601 [0.547, 0.634] ^c	0.332 [0.312, 0.352] ^d	0.283 [0.264, 0.301] ^d
GPT-4.1 mini	0.599 [0.580, 0.618] ^c	0.671 [0.640, 0.695] ^b	0.534 [0.513, 0.557] ^c	0.510 [0.490, 0.532] ^c
GPT-4.1	0.638 [0.618, 0.655] ^b	0.732 [0.713, 0.747] ^a	0.592 [0.570, 0.612] ^b	0.559 [0.537, 0.580] ^b
o4-mini	0.728 [0.711, 0.745] ^a	0.747 [0.729, 0.764] ^a	0.709 [0.689, 0.728] ^a	0.662 [0.640, 0.683] ^a
Injured body part
GPT-4o mini	0.798 [0.783, 0.814] ^b	0.847 [0.805, 0.859] ^b	0.790 [0.775, 0.806] ^b	0.699 [0.678, 0.721] ^b
GPT-4.1 mini	0.849 [0.835, 0.862] ^a	0.872 [0.859, 0.882] ^a	0.849 [0.835, 0.862] ^a	0.774 [0.753, 0.793] ^a
GPT-4.1	0.840 [0.825, 0.853] ^a	0.868 [0.854, 0.878] ^a	0.845 [0.831, 0.857] ^a	0.764 [0.745, 0.783] ^a
o4-mini	0.843 [0.828, 0.857] ^a	0.878 [0.866, 0.887] ^a	0.850 [0.835, 0.862] ^a	0.771 [0.749, 0.790] ^a

The precision, recall, and F1-score values are weighted averages. The 95% confidence intervals were estimated using bias-corrected and accelerated bootstrapping (1000 resamples). Statistical significance was determined using a Bonferroni-corrected alpha level of α′ = 0.0083 (0.05/6 model pairwise comparisons). Different lowercase letters indicate statistically significant differences between models within each classification item (Bonferroni-corrected p < 0.0083). Models sharing at least one letter were not significantly different. Weighted recall was not reported because it is mathematically identical to accuracy in single-label multi-class classification tasks.

Table 4. Processing time and estimated cost for batch processing with each model.

Model	Price (USD)	Time * (h:min:s)
GPT-4o mini	$0.28	2:53:55
GPT-4.1 mini	$0.75	0:24:04
GPT-4.1	$3.73	0:19:19
o4-mini	$10.58	1:27:45

* When using the batch application programming interface, jobs are executed within 24 h based on resource availability; therefore, processing times may vary considerably.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ando, H.; Matsugaki, R.; Yamakawa, S.; Ogami, A. Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occup. Health 2026, 1, 16. https://doi.org/10.3390/occuphealth1020016

AMA Style

Ando H, Matsugaki R, Yamakawa S, Ogami A. Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occupational Health. 2026; 1(2):16. https://doi.org/10.3390/occuphealth1020016

Chicago/Turabian Style

Ando, Hajime, Ryutaro Matsugaki, Sakumi Yamakawa, and Akira Ogami. 2026. "Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study" Occupational Health 1, no. 2: 16. https://doi.org/10.3390/occuphealth1020016

APA Style

Ando, H., Matsugaki, R., Yamakawa, S., & Ogami, A. (2026). Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occupational Health, 1(2), 16. https://doi.org/10.3390/occuphealth1020016

Article Menu

Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Data Source

2.2. Reference Standard

2.3. Automatic Classification by LLMs and Performance Evaluation

2.4. Accuracy Evaluation and Statistical Analyses

2.4.1. Statistical Inference and Model Comparison

2.4.2. Analysis Environment

2.5. Ethical Considerations

3. Results

3.1. Classification Accuracy

3.2. Processing Time and Cost

4. Discussion

4.1. Advantages over Traditional Machine Learning

4.2. Practicality of Using the Batch API

4.3. Interpretation of Classification Accuracy

4.4. Methodological Implications and Data Governance

4.5. Study Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI