Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Data Source
2.2. Reference Standard
2.3. Automatic Classification by LLMs and Performance Evaluation
2.4. Accuracy Evaluation and Statistical Analyses
- Accuracy: The proportion of correctly classified data from the total data. Accuracy = Number of correctly classified data/total number of data.
- Precision: Of the items predicted as a certain category (e.g., “outdoors”), the proportion that were actually correct. It indicates how well false positives were suppressed. Precision = TP/(TP + FP) (TP, true positive; FP, false positive).
- Recall: Of the items that are actually in a certain category (e.g., “outdoors”), the proportion that the AI was able to identify. It indicates how few false negatives there were. Recall = TP/(TP + FN) (FN, false negative).
- F1-score: The harmonic mean of precision and recall. This is used when a balanced evaluation of both metrics is required. A value closer to 1 indicates a better performance. F1-score = 2 × (precision × recall)/(precision + recall).
- Cohen’s kappa score: Unlike simple accuracy, this metric calculates the agreement between the model’s predictions and true labels after accounting for the agreement that could occur by chance. This makes it a more robust measure of performance, particularly for imbalanced datasets.
2.4.1. Statistical Inference and Model Comparison
2.4.2. Analysis Environment
2.5. Ethical Considerations
3. Results
3.1. Classification Accuracy
3.2. Processing Time and Cost
4. Discussion
4.1. Advantages over Traditional Machine Learning
4.2. Practicality of Using the Batch API
4.3. Interpretation of Classification Accuracy
4.4. Methodological Implications and Data Governance
4.5. Study Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial intelligence |
| API | Application programming interface |
| CI | confidence interval |
| ESAW | European Statistics on Accidents at Work |
| LLM | Large language model |
| OSHA | Occupational Safety and Health Administration |
| FN | False negative |
| FP | False positive |
| TP | True positive |
References
- International Labour Organization. A Call for Safer and Healthier Working Environments. Available online: https://www.ilo.org/publications/call-safer-and-healthier-working-environments (accessed on 29 March 2026).
- Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2023 (GBD 2023) Results. Institute for Health Metrics and Evaluation (IHME). 2024. Available online: https://ghdx.healthdata.org/gbd-2023 (accessed on 29 March 2026).
- Chang, W.-R.; Leclercq, S.; Lockhart, T.E.; Haslam, R. State of science: Occupational slips, trips and falls on the same level. Ergonomics 2016, 59, 861–883. [Google Scholar] [CrossRef] [PubMed]
- Occupational Safety and Health Administration. Injury Tracking Application (ITA). Available online: https://www.osha.gov/injuryreporting/ (accessed on 29 March 2026).
- Eurostat. European Statistics on Accidents at Work (ESAW)—Summary Methodology—2013 Edition. Available online: https://op.europa.eu/en/publication-detail/-/publication/59b4ca26-0ac9-476a-91c6-82dbc2f0a850 (accessed on 29 March 2026).
- Jacinto, C.; Soares, C.G. The added value of the new ESAW/Eurostat variables in accident analysis in the mining and quarrying industry. J. Saf. Res. 2008, 39, 631–644. [Google Scholar] [CrossRef]
- Molinero-Ruiz, E.; Pitarque, S.; Fondevila-McDonald, Y.; Martin-Bustamante, M. How reliable and valid is the coding of the variables of the European Statistics on Accidents at Work (ESAW)? A need to improve preventive public policies. Saf. Sci. 2015, 79, 72–79. [Google Scholar] [CrossRef]
- Ministry of Health, Labour and Welfare, Japan. The Reporting Requirements for the Report of Worker Death, Injury, or Illness Will Be Revised, and Electronic Submission Will Become Mandatory (Effective 1 January 2025). Available online: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/koyou_roudou/roudoukijun/denshishinsei_00002.html (accessed on 29 March 2026). (In Japanese)
- Ministry of Health, Labour and Welfare, Japan. Occupational Accident Statistics for 2023. Available online: https://www.mhlw.go.jp/stf/newpage_40395.html (accessed on 29 March 2026). (In Japanese)
- Matsugaki, R.; Yamakawa, S.; Ando, H.; Ogami, A. Same-level fall injuries among healthcare and retail workers: Focus on outdoor incidents. Sangyo Eiseigaku Zasshi 2025, 67, 295–301. (In Japanese) [Google Scholar] [CrossRef]
- Lincoln, A.E.; Sorock, G.S.; Courtney, T.K.; Wellman, H.M.; Smith, G.S.; Amoroso, P.J. Using narrative text and coded data to develop hazard scenarios for occupational injury interventions. Inj. Prev. 2004, 10, 249–254. [Google Scholar] [CrossRef] [PubMed]
- Lombardi, D.A.; Pannala, R.; Sorock, G.S.; Wellman, H.; Courtney, T.K.; Verma, S.; Smith, G.S. Welding related occupational eye injuries: A narrative analysis. Inj. Prev. 2005, 11, 174–179. [Google Scholar] [CrossRef]
- Bertke, S.J.; Meyers, A.R.; Wurzelbacher, S.J.; Measure, A.; Lampl, M.P.; Robins, D. Comparison of methods for auto-coding causation of injury narratives. Accid. Anal. Prev. 2016, 88, 117–123. [Google Scholar] [CrossRef][Green Version]
- Wasaki, N.; Takahashi, A. Characteristics of occupational accidents caused by inattentiveness. J. Occup. Saf. Health 2024, 17, 93–104. (In Japanese) [Google Scholar] [CrossRef]
- Sugama, A. Present situation of falls from step ladders and future perspectives on preventative countermeasures. J. Occup. Saf. Health 2017, 10, 55–58. (In Japanese) [Google Scholar] [CrossRef]
- Hayashi, C.; Ogata, S.; Toyoda, H.; Tanemura, N.; Okano, T.; Umeda, M.; Mashino, S. Risk factors for fracture by same-level falls among workers across sectors: A cross-sectional study of national open database of the occupational injuries in Japan. Public Health 2023, 217, 196–204. [Google Scholar] [CrossRef]
- Lu, H.; Ehwerhemuepha, L.; Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol. 2022, 22, 181. [Google Scholar] [CrossRef] [PubMed]
- Balch, J.A.; Desaraju, S.S.; Nolan, V.J.; Vellanki, D.; Buchanan, T.R.; Brinkley, L.M.; Penev, Y.; Bilgili, A.; Patel, A.; Chatham, C.E.; et al. Language models for multilabel document classification of surgical concepts in exploratory laparotomy operative notes: Algorithm development study. JMIR Med. Inform. 2025, 13, e71176. [Google Scholar] [CrossRef] [PubMed]
- Ministry of Health, Labour and Welfare, Japan. Database of Serious Occupational Accidents (Fatalities and Cases Involving Four or More Days of Leave). Available online: https://anzeninfo.mhlw.go.jp/anzen_pgm/SHISYO_FND.html (accessed on 29 March 2026). (In Japanese)
- Google LLC. Colab. Available online: https://colab.google/ (accessed on 29 March 2026).
- OpenAI Inc. Libraries | OpenAI API. Available online: https://developers.openai.com/api/docs/libraries?language=python (accessed on 29 March 2026).
- Ministry of Health, Labour and Welfare, Japan. Job Tag: Physical Therapist (PT). Available online: https://shigoto.mhlw.go.jp/User/Occupation/Detail/167 (accessed on 29 March 2026). (In Japanese)
- Ministry of Health, Labour and Welfare, Japan. Job Tag: Occupational Physician. Available online: https://shigoto.mhlw.go.jp/User/Occupation/Detail/583 (accessed on 29 March 2026). (In Japanese)
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- OpenAI Inc. OpenAI o3 and o4-Mini System Card. Available online: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf (accessed on 29 March 2026).
- Kazari, K.; Chen, Y.; Shakeri, Z. Scaling public health text annotation: Zero-shot learning vs. crowdsourcing for improved efficiency and labeling accuracy. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2025, 2025, 1–4. [Google Scholar] [CrossRef]
- Dunstan, J.; Campaña-Herrera, V.; Miranda, L.; Ladrón De Guevara, R.; Pincheira, P.; Rocco, V.; Moyano-Dávila, D. Sex differences in work-related accidents extracted from free text in Spanish using natural language processing. BMC Public Health 2025, 25, 2746. [Google Scholar] [CrossRef]
- Nakamura, M.; Hayamizu, S.; Masanori, H.; Fuseya, T.; Iwamatsu, H.; Terada, K. Causal reasoning of occupational incident texts using large language models. Procedia Comput. Sci. 2024, 246, 820–829. [Google Scholar] [CrossRef]
- Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
- Goldberg, D.M. Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability. J. Saf. Res. 2022, 80, 441–455. [Google Scholar] [CrossRef]
- Song, J.-H.; Shin, S.-H.; Kang, S.-Y.; Won, J.-H.; Yoo, K.-H. Occurrence type classification for establishing prevention plans based on industrial accident cases using the KoBERT model. Appl. Sci. 2024, 14, 9450. [Google Scholar] [CrossRef]
- Salguero-Caparros, F.; Suarez-Cebador, M.; Rubio-Romero, J.C. Analysis of investigation reports on occupational accidents. Saf. Sci. 2015, 72, 329–336. [Google Scholar] [CrossRef]
- Ordysiński, S. Prediction of the injury severity of accidents at work: A new approach to analysis of already existing statistical data. Appl. Sci. 2025, 15, 10666. [Google Scholar] [CrossRef]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
- World Health Organization. Ethics and Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models, 2024. Available online: https://iris.who.int/handle/10665/375579 (accessed on 29 March 2026).
- Khairuddin, M.Z.F.; Hasikin, K.; Abd Razak, N.A.; Lai, K.W.; Osman, M.Z.; Aslan, M.F.; Sabanci, K.; Azizan, M.M.; Satapathy, S.C.; Wu, X. Predicting occupational injury causal factors using text-based analytics: A systematic review. Front. Public Health 2022, 10, 984099. [Google Scholar] [CrossRef]
- United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. 2015. Available online: https://sdgs.un.org/2030agenda (accessed on 29 March 2026).
- United Nations Statistics Division. SDG Indicator Metadata: Indicator 8.8.1. Fatal and Non-Fatal Occupational Injuries Per 100,000 Workers, by Sex and Migrant Status. Available online: https://unstats.un.org/sdgs/metadata/files/Metadata-08-08-01.pdf (accessed on 29 March 2026).



| Item | Prompt |
|---|---|
| Within/outside business premises | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred within the victim’s business premises. Respond with a single digit: 0 for within premises, 1 for outside premises, or 9 for unknown. |
| Accident location (indoor/outdoor) | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred indoors. Respond with a single digit: 0 for indoors, 1 for outdoors, or 9 for unknown. |
| In a vehicle | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify whether it occurred while riding in a vehicle. Vehicles include not only cars and public transportation but also bicycles and animals. Respond with a single digit: 0 if in a vehicle, 1 if not, or 9 for unknown. |
| Cause of accident | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the direct cause of the same-level fall. If multiple causes apply, select only the one you believe had the greatest impact. Respond with a single digit: 0 for slip, 1 for trip, 2 for misstep/stumble, 3 for loss of balance, 4 for other, or 9 for unknown. |
| Causal agent | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the direct cause of the same-level fall. If multiple causes apply, select only the one you believe had the greatest impact. Respond with a single digit: 0 for water, 1 for oil, 2 for snow/ice, 3 for a step/uneven surface, 4 for an obstacle other than the aforementioned, 5 for other, or 9 for unknown. |
| Injured body part | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please classify the injured body part. If multiple parts apply, select only the one you believe was most significantly affected. If there are multiple equivalent injuries, classify as “other.” Respond with a single digit: 0 for upper limb, 1 for lower limb, 2 for back/waist, 3 for shoulder/neck, 4 for head, 5 for other, or 9 for unknown. |
| Same-level fall determination | You are a physician and researcher specializing in occupational medicine. I will provide a text describing the circumstances of an occupational accident. Please determine whether the victim fell in this incident. Respond with a single digit: 0 if they fell, 1 if no same-level fall occurred, 2 if they fell as a result of another cause like fainting, or 3 if it is not possible to determine. |
| Parameter | Description | Value |
|---|---|---|
| Model | Name of the large language model used. | “GPT-4.1, GPT-4.1-mini, GPT-4o-mini, o4-mini” |
| Temperature | Controls the randomness (creativity) of the output. Set to 0 to ensure reproducibility. | 0 |
| max_tokens | The maximum length of the generated response. A “token” represents a basic unit of text processed by artificial intelligence, roughly equivalent to a word or syllable. Set to a low value as a single-digit response is expected. | 10 |
| Accuracy | Precision | F1-Score | Kappa | |
|---|---|---|---|---|
| Indoor/outdoor | ||||
| GPT-4o mini | 0.911 [0.900, 0.922] b | 0.892 [0.878, 0.905] c | 0.901 [0.889, 0.913] b | 0.810 [0.787, 0.831] b |
| GPT-4.1 mini | 0.921 [0.911, 0.931] a,b | 0.914 [0.902, 0.926] b | 0.917 [0.905, 0.928] a | 0.835 [0.814, 0.856] a |
| GPT-4.1 | 0.923 [0.912, 0.933] a | 0.916 [0.902, 0.927] b | 0.919 [0.907, 0.930] a | 0.838 [0.815, 0.857] a |
| o4-mini | 0.913 [0.900, 0.922] a,b | 0.929 [0.917, 0.939] a | 0.920 [0.909, 0.929] a | 0.824 [0.801, 0.842] a,b |
| Cause of accident | ||||
| GPT-4o mini | 0.796 [0.779, 0.811] c | 0.757 [0.730, 0.785] c | 0.751 [0.730, 0.770] c | 0.715 [0.692, 0.734] c |
| GPT-4.1 mini | 0.807 [0.791, 0.823] b | 0.805 [0.783, 0.825] b | 0.769 [0.749, 0.788] b | 0.734 [0.712, 0.753] b |
| GPT-4.1 | 0.789 [0.772, 0.804] c | 0.721 [0.698, 0.742] c | 0.732 [0.710, 0.752] d | 0.709 [0.688, 0.728] c |
| o4-mini | 0.818 [0.803, 0.832] a | 0.846 [0.829, 0.860] a | 0.784 [0.765, 0.802] a | 0.749 [0.729, 0.766] a |
| Causal agent | ||||
| GPT-4o mini | 0.371 [0.352, 0.389] d | 0.601 [0.547, 0.634] c | 0.332 [0.312, 0.352] d | 0.283 [0.264, 0.301] d |
| GPT-4.1 mini | 0.599 [0.580, 0.618] c | 0.671 [0.640, 0.695] b | 0.534 [0.513, 0.557] c | 0.510 [0.490, 0.532] c |
| GPT-4.1 | 0.638 [0.618, 0.655] b | 0.732 [0.713, 0.747] a | 0.592 [0.570, 0.612] b | 0.559 [0.537, 0.580] b |
| o4-mini | 0.728 [0.711, 0.745] a | 0.747 [0.729, 0.764] a | 0.709 [0.689, 0.728] a | 0.662 [0.640, 0.683] a |
| Injured body part | ||||
| GPT-4o mini | 0.798 [0.783, 0.814] b | 0.847 [0.805, 0.859] b | 0.790 [0.775, 0.806] b | 0.699 [0.678, 0.721] b |
| GPT-4.1 mini | 0.849 [0.835, 0.862] a | 0.872 [0.859, 0.882] a | 0.849 [0.835, 0.862] a | 0.774 [0.753, 0.793] a |
| GPT-4.1 | 0.840 [0.825, 0.853] a | 0.868 [0.854, 0.878] a | 0.845 [0.831, 0.857] a | 0.764 [0.745, 0.783] a |
| o4-mini | 0.843 [0.828, 0.857] a | 0.878 [0.866, 0.887] a | 0.850 [0.835, 0.862] a | 0.771 [0.749, 0.790] a |
| Model | Price (USD) | Time * (h:min:s) |
|---|---|---|
| GPT-4o mini | $0.28 | 2:53:55 |
| GPT-4.1 mini | $0.75 | 0:24:04 |
| GPT-4.1 | $3.73 | 0:19:19 |
| o4-mini | $10.58 | 1:27:45 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ando, H.; Matsugaki, R.; Yamakawa, S.; Ogami, A. Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occup. Health 2026, 1, 16. https://doi.org/10.3390/occuphealth1020016
Ando H, Matsugaki R, Yamakawa S, Ogami A. Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occupational Health. 2026; 1(2):16. https://doi.org/10.3390/occuphealth1020016
Chicago/Turabian StyleAndo, Hajime, Ryutaro Matsugaki, Sakumi Yamakawa, and Akira Ogami. 2026. "Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study" Occupational Health 1, no. 2: 16. https://doi.org/10.3390/occuphealth1020016
APA StyleAndo, H., Matsugaki, R., Yamakawa, S., & Ogami, A. (2026). Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study. Occupational Health, 1(2), 16. https://doi.org/10.3390/occuphealth1020016

