Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data

Ordysiński, Szymon

doi:10.3390/app151910666

Open AccessArticle

Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data

by

Szymon Ordysiński

Central Institute for Labour Protection—National Research Institute, 00-701 Warsaw, Poland

Appl. Sci. 2025, 15(19), 10666; https://doi.org/10.3390/app151910666

Submission received: 25 August 2025 / Revised: 19 September 2025 / Accepted: 27 September 2025 / Published: 2 October 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This article presents a novel statistical approach for analyzing occupational accident data from the ESAW database, aiming to improve the evaluation and prediction of accident severity among specific groups of employees. The proposed method combines univariate and multivariate analytical techniques (effect size measures and classification tree methods: CHAID and CART) to identify employee groups that are both statistically robust and meaningfully distinct. The resulting model is based on six key variables describing employee and workplace characteristics, enabling accurate prediction of accident severity within these groups. The model demonstrates high reliability in predicting accident severity, achieving over 80% accuracy in a binary classification (high vs. low risk), making it a valuable tool for risk management and proactive safety planning. The findings have both theoretical and practical implications. Theoretically, the model’s strong predictive performance suggests that accident severity is not random but follows identifiable patterns linked to underlying risk factors that go beyond standard occupational or economic classification. Practically, the model allows for a more detail and effective categorization of work environments into high- and low-risk classes, and can support safety professionals, managers, and policymakers in achieving more precise identification of employee groups that are more prone to severe accidents.

Keywords:

accidents at work; accident severity predictions; data analysis; accidents injury; classification and regression trees

1. Introduction

According to Eurostat data for the last five years, an average of more than 2.2 million accidents at work took place each year in the 27 countries of the European Union [1]. Since the beginning of the registration system, data on over 27 million accidents at work have been recorded in all European Union countries [1]. All of these accidents result in suffering and inevitable costs for the injured people, their families, enterprises, and society as a whole [2]. Many efforts and studies have been conducted to mitigate these losses. Much of this work has focused on risk-based analysis and risk modelling in the work environment. The basic, usually used, risk formula involves multiplying the probability of an accident at work by its expected or anticipated severity. Although these two factors are mathematically equally important, most studies focus on the probability of the injury rather than the consequences. The primary reason for this could be the perception of the random nature of injury severity [3]. However, according to some researchers (e.g., [4,5]), prediction of the severity of accidents is essential. Firstly, it can improve the accuracy of risk assessment, and secondly, a proper predictive model of the severity of accidents occurring in a given work environment can be useful in identifying groups of workers at risk of potential work-related injuries that may result in long-term disability, as well as supporting preventive efforts accordingly [4,5].

In the European Union, statistics on accidents at work have been collected for many years based on the ESAW (European Statistics on Accidents at Work) methodology [6]. These data provide empirically validated knowledge about accidents at work and, importantly, a fairly detailed source of information on their severity and circumstances (employee and workplace environment characteristics). Although national data on accidents at work are recognised in the European Commission guidance on risk assessment at work as one of several important sources of information on the probability and severity of accidents [7], safety practitioners rarely use them to assess the potential severity of workplace accidents. This may be due to the way the data are most often analysed.

For this reason, there is a need to propose a new approach to the analysis of already existing statistical data on accidents at work in the EU that offers both the capacity to handle a wide range of employee groups (dimensionality) and the flexibility to capture the full heterogeneity of accident severity (complexity) [8]. In response to this need, this article presents a novel statistical method of analysing existing data on accidents at work (ESAW database), with the aim of developing a reliable, empirically based statistical model capable of predicting the severity of accidents among particular groups of employees.

The main contribution of this paper is a novel method for analysing occupational accident data, providing a model capable of predicting the duration of post-accident absence for groups of workers defined by their characteristics and job types based on large empirical datasets. This paper implements a robust validation approach, including temporal variation and an expanding-window time-series split, ensuring reliable predictive performance. The results of this model can be directly applied in occupational safety practice to support targeted prevention, resource planning, and risk analysis.

2. Related Works

In recent years, the application of various data mining techniques to the analysis of occupational accidents has grown substantially [9]. This trend has been partly driven by the increasing availability of large-scale data on working conditions, resulting from the digitalisation of work processes, as well as by significant advances in statistical computing and analytical methods. These developments have enabled researchers to move beyond traditional descriptive analyses toward more sophisticated predictive modelling frameworks.

In the field of occupational accident prediction, a wide range of machine learning methods have been employed, including Bayesian network (BN), Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Extreme Learning Machines (ELMs), and decision trees (DTs) [10]. Among these, decision trees are particularly prominent due to their frequent use in predicting the severity of occupational accidents [10]. For example, Chang et al. [11] applied Classification and Regression Tree (CART) models to examine relationships between accident severity and various driver, vehicle, and environmental characteristics. Similarly, Cheng et al. [12] used DT models to explore cause and effect relationships between a wide range of workplace environmental variables and the occurrence of serious accidents in the construction industry. Notably, DTs have proven particularly effective in predicting the severity of incidents and identifying high-risk scenarios, as demonstrated in the studies of Lukáčová et al. [13] and Mistikoglu et al. [14]. In the prediction of occupational accidents, commonly used decision tree algorithms include C4.5, C5.0, CART, and Chi-square Automatic Interaction Detector (CHAID) [10]. The primary objective of employing decision trees in this context is to utilize both qualitative and quantitative data to uncover hidden patterns, thereby generating predictive insights and supporting preventive strategies.

Apart from DTs, other data mining algorithms like ANN methods are widely recognized in the field of accident analysis. Due to their effective data-driven learning and non-parametric nature, these methods are capable of handling a wide range of accident analysis issues and offer flexibility, as they do not require prior assumptions about data distribution [10]. A good example of these are the studies of He et al. [15] and Yi et al. [16]. Also, other data mining methods like SVMs, BN, and ELMs have proven their usefulness in analysing accident at work data, as they have been applied to predict workplace accidents and explain their occurrence, causes, and types in many economic sectors like construction, mining, shipbuilding, and services. These approaches were employed in studies by Matías, Sánchez, Rivas, and Sanmiquel [8,17,18,19]. It is worth noting that many studies in the field combine various data mining methods to improve the accuracy and robustness of occupational accident predictions. A notable example is the work of Sarkar et al. [10], in which optimized machine learning algorithms, SVMs and ANNs, were applied to predict accident outcomes such as injury, near misses, and property damage based on occupational accident data. Additionally, the C5.0 decision tree algorithm was used to extract classification rules, aiming to identify the root causes of accidents. Also noteworthy is the authors’ special efforts to achieve high predictive accuracy and model robustness.

The importance of accuracy measurements in accident prediction based on data mining methods is particularly emphasized in aviation. For example, Christopher et al. [20] used a DT algorithm to identify significant attributes for airline accident prediction. Their prediction model resulted in 87% accuracy. Another example is the study of Gürbüz et al. [21], who used a DT to predict fatal incidents in aviation, achieving 99% accuracy. What is particularly important for this paper is that they achieved such good results by employing data pre-processing approaches to optimize a subset of predictors down to 8 from 25 at the beginning. Alternative pre-processing approaches have also been explored in data mining analysis [22,23]. Other similar studies of successful application of data mining methods in aviation are Nazeri et al. [24], predicting incidents and accidents, and Viademonte et al. [25], using association rules mining and data pre-processing to predict unsafe weather conditions. Although different methods have been employed in various economic sectors, the main conclusion is that compared to traditional statistical models, machine learning methods have demonstrated superior performance in predicting future events [10].

Some researchers focused on predicting the severity of accidents at work using data mining and machine learning methods. A notable example is another work of Sarkar et al. [4] who applied machine learning techniques to analyse occupational accidents and predict injury severity. The novelty of their study lies in the incorporation of both reactive and proactive data, as well as the use of oversampling methods to address class imbalance. Their findings demonstrated improved prediction accuracy and identified key factors influencing injury severity, which can support the development of safety decision-making rules. Numerous other studies have demonstrated the potential of data mining and machine learning in predicting occupational accident severity across various contexts. For example, Luo et al. [26] developed a machine learning framework to predict the severity of occupational accidents in construction engineering using unstructured textual data. By addressing class imbalance and improving text segmentation, their model achieved an accuracy of 82% and identified key attributes for effective accident prevention. Kumar et al. [27] employed methods such as elastic net regression, Random Forest, and gradient boosting to predict non-fatal injury rates with moderate accuracy (54–58%) and identify job-related descriptors as key predictors. Rahman et al. [28] applied machine learning techniques to predict the severity of industrial accidents, prioritizing accurate classification of low-severity incidents. Additionally, they used clustering and time-series analyses to uncover distinct patterns and forecast accident trends, providing insights for targeted safety interventions. Similarly, Zhu et al. [29] applied ML techniques to assess 16 influencing factors in construction accidents, identifying critical variables such as accident type and reporting practices for predicting accident severity. Das et al. [30] focused on automating the classification of severe injury narratives from OSHA data using Multinomial Naïve Bayes models. By applying effective filtering strategies, they improved the sensitivity to 88.17%, improving prioritization of cases for manual review. Finally, Choi et al. [31] conducted a study aimed at predicting the risk of fatal accidents using a large dataset and Random Forest (RF), identifying an effective prediction model with an accuracy of almost 92%. Despite these advancements, many existing studies are limited to specific industries and often overlook critical variables related to the circumstances of accidents. To address this gap, Khairuddin et al. [32] developed an ML-based predictive model using injury records from various sectors. Among five algorithms tested, Random Forest achieved the highest accuracy and F1-score, with almost 90% accuracy for hospitalizations and 95% for amputations. They identified type of injury, contact with material agent, and part of body injured as key predictors.

To sum up, the above examples allow us to draw several important conclusions regarding the analysis of occupational accident data. First and foremost, ensuring high predictive accuracy and model robustness is essential for reliable accident prediction, and there are several methods in the literature that have demonstrated significant potential to improve prediction outcomes. A common approach involves combining various data mining techniques to enhance both accuracy and robustness in predicting accident severity. Among these techniques, decision trees have consistently shown strong performance in accident data analysis and are frequently used in predicting accident severity, based on both qualitative and quantitative data. Finally, thorough data pre-processing, such as handling missing values, addressing class imbalance, and selecting relevant features, is crucial to achieving better prediction accuracy.

Building upon these foundations, the present study contributes to the field of accident severity prediction by incorporating these best practices identified in the literature into a comprehensive methodological framework. This framework includes robust data preparation, the use of a large and diverse dataset, and the integration of multiple analytical techniques, with a particular emphasis on decision tree algorithms.

3. Material and Methods

3.1. Subject of the Analyses

Statistical data on accidents at work are regularly recorded by almost all European Union Member States, in accordance with European Parliament regulation [33,34,35]. This data collection protocol has been standardised across countries by implementing regulation through the harmonization of a national accident registration system with the ESAW (European Statistics on Accidents at Work) methodology [6].

The analyses presented in this article have been carried out on data regarding injuries and accidents at work, registered by the Polish Central Statistical Office (CSO), in the years 2017–2019. What is specific to the Polish registration system (and very important for conducted analyses) is that CSO registers the data of all types of accidents at work (fatal, serious, and others), including those that do not cause any work absenteeism. Accident data are also registered for all economic sectors [36] (with the exception of national defence and public security). In Poland, as in most EU Member States, the scope and principles of recording information have been fully adapted to the Eurostat methodology (ESAW) [6]. Therefore, eighteen, mainly multinominal, variables describing the circumstances of work accidents are recorded in the database. Six of them describe the characteristics of the injured employee, while the others describe the characteristics of the working environment, the type of work performed by the injured person, and the time of the accident.

3.2. The Predicted Variables

In the CSO database of accidents at work, there are several variables indicating the eventual consequences of accidents. One of these variables is “Days lost (severity)”, which is the number of full calendar days lost when the injured employee is unfit for work due to an accident [6]. It is a suitable indicator of consequences of accidents for several reasons. First of all, this variable is measured on a quantitative scale, which enables the use of more advanced analytical methods and allows for a more detailed classification of the consequences of accidents at work. Moreover, the number of days lost is consistent with other variables related to other consequences of accidents. The analyses showed that there is a very strong relationship (Cohen’s d = 1.58) between the number of days lost and the other variables describing consequences of the accident: the descriptive severity (serious accidents or other than serious); the type of injury (Eta squared = 0.14), as injuries that are more likely to result in long-term or permanent incapacity for work—such as traumatic amputations (loss of body parts), bone fractures, and multiple injuries—not surprisingly, caused a significantly higher average number of days lost than for example wounds and superficial injuries or burns and frostbites. Thus, the characteristics of the “Days lost (severity)” variable make it an appropriate choice for the dependent variable for evaluating and predicting the severity (consequences) of accidents at work.

3.3. Methodology

The analysis was carried out in several steps, which can be grouped into three phases: preliminary data pre-processing, variable preparation, and final group identification (Figure 1). The conducted analyses combined one-dimensional and multidimensional methods. Analyses were conducted using IBM SPSS Statistics (version 27.0), developed by IBM Corporation (Armonk, New York, NY, USA).

3.3.1. The Initial Phase

Prior to the main analysis, an initial phase of data pre-processing was conducted, as this is a standard step in most big data analyses. This phase involved data collection and preliminary preparation, including handling missing data, unifying data formats (e.g., occupation classification), computation, and recoding variables.

3.3.2. The First Phase—Preparation

The first phase involved a series of analyses aiming to prepare severity predictors to be later used in the final model. These predictors are variables whose variance helps determine the potential severity of occupational accidents within a given group of workers.

The primary challenge in this phase was to manage the identification of an excessive number of overly dispersed groups of victims. This is due to the large number of variables in the dataset (18 variables describing the accident circumstances, mainly multinomial) and the wide range of values of those variables (up to 2328 categories). Consequently, the resulting matrix is so extensive that it is effectively equal to the number of workplace accidents registered in the dataset. Obviously, such a small number of observations within each group results in a very low reliability of conclusions.

To improve both the precision and accuracy of predictions, constraints were introduced on the classification rules of the predictors. These constraints aimed to ensure that the resulting groups were sufficiently numerous and representative, and not based on random or scattered observations, while preserving a substantial level of detail in the group classification.

This process was conducted in two steps:

Selecting the most effective predictors of severity from the variables describing accident circumstances;
Transforming selected predictors by limiting the range of their values to ensure that the identified groups were sufficiently large.

The selection of potential predictors was based on the effect size measures [37], which indicate the strength of the relationship between the explanatory variable and the dependent variable. The greater the difference in the distribution of the dependent variable between groups, defined by the values of the predictor, the stronger the relationship expressed by the effect size measure, which allows is to rank predictors based on their effectiveness.

Since the analysed variables differed in measurement levels, appropriate effect size measures were used: Cohen’s d for dichotomous variables (e.g., gender); R-squared for quantitative variables (e.g., age); and Eta-squared for nominal variables with multiple categories (e.g., occupation). To compare these effect size measures, their results were converted into a common metric, with Cohen’s d chosen for this standardization (based on [37,38,39,40]). This conversion enabled a consistent interpretation of the strength of the analysed relationships.

After selecting the most effective predictors, the accuracy was further improved by merging small, dispersed groups (identified by values) into larger clusters. This was achieved by using the CHAID tree method (Chi-squared Automatic Interaction Detection), a decision tree technique that tests relationships with the dependent variable [41]. The CHAID algorithm selects predictor values for splitting based on statistical significance, measured by the Chi-square test. The successive divisions are only performed if a statistically significant relationship exists, making CHAID an effective method for creating larger and more homogeneous clusters.

As a result, a new classification was developed that differentiated the population in terms of the dependent variable (number of days of lost), proving to be as effective as previous classifications but with much fewer, more manageable groups. This step significantly reduced the number of identified groups and allowed for the identification of clusters that were more homogeneous in terms of post-accident absence within the groups and between the groups.

3.3.3. The Final Phase—Group Identification

In the final phase, the previously selected and transformed predictors were used as explanatory variables in the decisions tree algorithm (CART). The CART method was used to further refine the division criteria, reducing the number of selected predictors and limiting their value range, further ensuring that the identified groups were sufficiently numerous and displayed the greatest diversity in terms of accident severity.

As highlighted in the Section 2, previous studies on occupational accident datasets indicated that CART decision tree algorithms are particularly effective for modelling accident data, which are often complex and multinomial. Based on this evidence, the CART algorithm was selected for the present study due to its high interpretability, robustness in handling diverse variable types, and proven performance in similar contexts, making it well suited for the analysis of occupational accident data.

The CART method, introduced by Breiman [42,43], is a recursive partitioning decision tree-growing algorithm [4] that divides the dataset through successive partitioning. In each step, the algorithm selects a single variable and a range of its values (the attribute) that most strongly differentiate the current subset of the population with respect to the dependent variable. As a result, the identified groups are described by a set of selected values of independent variables that form a hierarchical, multivariable structure consisting of nodes and values ranges—the classification rules [4]. With each successive division, the CART algorithm selects a new attribute seeking to maximize differentiation of the population based on the variability in the dependent variable. As a result, the derived classification rules identify the most distinct groups while utilizing the minimal amount of information necessary. Splits of the dataset are determined using impurity measures, such as the Gini index. The Gini index for a node can be expressed by the following equation (Equation (1)) [4,44,45]:

G i n i (t) = 1 - \sum_{j = 1}^{p} {(\frac{n (j | t)}{n (t)})}^{2}

(1)

where:

p is the number of classes of the response attribute;
n(j|t) represents the number of records in node t that belong to class j;
n(t) is the total number of records in node t.

The Gini index can range from 0 (for a single class) to 0.5 (for evenly distributed classes) [4]. Attributes for splitting are selected based on the weighted average of the Gini index for child nodes, as in Equation (2) [4]:

{G i n i (t)}_{s p l i t} = \frac{n (t_{L})}{n (t)} G i n i (t_{L}) + \frac{n (t_{R})}{n (t)} G i n i (t_{R})

(2)

where t_L and t_R are the left and right child nodes of node t.

The attribute with the lowest value of Gini(t)_split is chosen as the root node for splitting the analysed population into two groups. The tree-growing process also includes a pruning technique to address the issue of over-fitting the model [42], using a minimal cost-complexity measure that accounts for misclassified data, the number of tree leaves, and a complexity parameter [4].

4. Results

4.1. Data Preparation

4.1.1. Reducing the Number of Predictors

One-dimensional analysis was conducted to evaluate the strength of the relationship between accident severity (number of days lost) and subsequent characteristics of employees and their working environment (18 variables) in order to determine the effectiveness of accident severity prediction for each variable. Different effect size measure types have been used to match the measurement level of each variable and then to enable comparisons of calculated values that were converted into Cohen’s d.

As a result of the analyses, six variables (presented in Table 1) with a statistically strong enough effect size to be interpreted as an existing relationship (zone of desired effect) [37] were identified as effective predictors of severity.

All other variables, with a smaller value of the effect size measure, should be assessed as ineffective predictors of the severity of accidents at work, because the groups of injured persons, determined on the basis of these variables’ values, do not differentiate sufficiently in terms of the length of post-accident absence, which means that the severity cannot be effectively predicted on the basis of these variables’ values.

Nevertheless, in-depth analyses showed that predictions of the severity of accidents based on values of one of these six identified variables are not reliable enough. This is due to the relatively low strength of the relationship and even more so due to the low number of observations in many groups of workers identified on this basis. Also, using six of those variables altogether to identify groups does not achieve the desired effect, as the number of groups identified on this basis is too large and the number of injured persons within groups is often too small to be representative. For this reason, I decided to introduce further restrictions in the classification rules, this time not in the number of variables but in the range of their values.

4.1.2. Reducing the Number of Values—The Development of a New Classification

In order to reduce the number of variable values, a CHAID classification tree analysis was performed. Six variables, previously identified as effective predictors of accident severity, were used as independent variables and were transformed through the CHAID procedure. The dependent variable was the number of days lost.

The primary objective of this analysis was to reorganize variable values into fewer but more numerous categories that are more homogeneous with respect to the severity of accidents. This restructuring enabled the merging of small groups of workplace accidents into larger, more representative clusters, characterized by similar lengths of post-accident absence. The criterion for merging groups was strictly the length of absence.

As a result, a new, less fragmented classification was obtained, which increased the reliability of the results while maintaining the explanatory usefulness of the division (Figure 2). The transformation of variable values substantially reduced the number of predictor categories, without a clear loss of predictive potential for accident severity (Table 2). A noticeable reduction in effect size was observed only for the variable occupation, originally measured using a full six-digit code that divided the sample into more than 2000 groups. Although such a detailed classification provided relatively high predictive power, the excessive number of categories resulted in identifying many very small groups, which undermined the representativeness of the results and the reliability of the conclusions, leaving them vulnerable to random stochastic fluctuations. From the perspective of the analysis, this reduction was therefore necessary; while the original occupational classification offered acceptable predictive accuracy, the final developed model, based on a more aggregated categorization of six variables, proved superior by combining both higher predictive performance and greater robustness of the results.

The newly created classification thus provides predictive capabilities almost as effective as the original variables while offering much greater representativeness and stability of the results. Most importantly, the recoded variables with a reduced number of values could be used jointly in the final model, allowing their predictive potential to be combined. In contrast, the use of all non-recoded variables produced results that were overly fragmented and dominated by many small unrepresentative groups or even prevented the model from being computed due to hardware limitations.

Detailed results of the classification tree analysis are provided in the Supplementary Materials attached to this article. Since the CHAID analysis is only an intermediate step in the development of the final predictive model, a shortened description of the recoded classifications is presented below, focusing on those groups of workers most at risk of long-term absence due to accidents at work.

Occupation classification

A total of 94% of all workers classified in the node with the longest post-accident absence were underground miners and underground mining labourors. An additional 3% were workers in similar roles but employed in open-pit mines. The remaining occupations included support staff and other personnel in the mining sector not classified as miners, such as a locomotive driver or an electrical fitter for machinery and equipment used in open-pit mining. The subsequent nodes with high severity included mainly support staff and other personnel in the mining sector not classified as miners, but clearly employed in the mining industry, such as a woodcutter, mining excavation and loading machine operator, railway track layer, electrical fitter for traction networks, blasting miner, coal processing machinery operator, and underground mining traffic supervisor. The nodes with the highest accident severity also included isolated, rarely occurring occupations not related to mining but resulting in similarly long absences, such as a chemical equipment assembler, engine mechanic, and industrial climber (cleaning building structures). Construction painters also experienced very long absences, forming the third node with the longest post-accident absence consisting exclusively of this occupation.

2.: Material agent of physical activity

The group of workers with the longest absence was associated with only one material agent of physical activity—underground areas, namely tunnels. The two subsequent nodes with high severity included material agents also primarily related to mining, such as portable or mobile machines—for extracting materials or working the ground—mines, quarries, and plants for building and civil engineering works; excavations, trenches, wells, pits, escarpments, and garage pits; fixed machines for extracting materials or working the ground; structures, surfaces, and above-ground level parts—mobile (including scaffolding, mobile ladders, cradles, elevating platforms); vehicles—on rails, including suspended monorails; and goods. Only the fourth node contained factors less related to mining and more typical of construction, such as loads—suspended from a hoisting device or a crane; portable or mobile machines (not for working the ground)—for construction sites; other known buildings, structures, and surfaces—below ground level, in group 03 but not listed above.

3.: NACE—economic activity

Almost 100% of the node with the longest post-accident absence was composed of economic activities related to mining, with almost 98% representing the mining of hard coal and more than 2% representing the mining of lignite (brown coal). The next node in terms of severity included accidents in economic activities such as Support activities for other mining and quarrying and Support activities for petroleum and natural gas extraction, with a very small share of Marine fishing. The third node included economic activities related to construction, but not specific ones, such as Construction of other civil engineering projects n.e.c. and Specialized construction activities—excavation and geological engineering drilling. This node also comprised activities related to Support services to forestry, as well as Manufacture of lead, zinc and tin and Manufacture of industrial gases.

4.: Place of accident

It is characteristic that the node with the longest post-accident absence included exclusively accidents that occurred in Underground—mine. The second node in terms of severity contained more than half (54%) of the total number of accidents in Opencast quarry, opencast mine, excavation, trench (including opencast mines and working quarries), complemented by accidents in elevated places, whether fixed or not, such as roofs, terraces, masts, pylons, suspended platforms, or similar structures. The third node concentrated on accidents occurring at construction sites, with more than 96% taking place in buildings being demolished, repaired, maintained and in underground locations.

5.: Working process

The working process in the node with the longest severity was exclusively Excavation (100%). The second node in terms of severity also contained only one working process, namely Demolition—all types of construction. The third node included a somewhat more diversified set of processes, almost entirely related to construction, such as New construction—building, Remodelling, repairing, extending, building maintenance—all types of construction, and New construction—civil engineering, infrastructures, roads, bridges, dams, ports, and other similar types of processes. It also included Forestry type work.

6.: Age in years

The new classification of age, obtained using the CHAID method, categorized workers in a straightforward manner: the higher the age, the longer the severity. The group with the longest severity comprised workers aged 59 years and above. The subsequent ten identified groups spanned age intervals of approximately 3–4 years each.

This allowed me to significantly reduce the number of groups of workers identified on this basis, but still using all of them together to identify groups led to too detailed a division (over 1 million groups). Those results indicate that it is necessary to set more constraints on the division rules to further limit the number of identified groups. This was achieved by conducting another phase of analysis, this time using the CART method.

4.2. Final Group Identification

In this phase of analysis, six variables that have previously been selected and transformed were used as independent variables in the CART analysis. The dependent variable was again the “number of days lost”.

Due to the lack of clear guidelines regarding the studied relationships, the CART hyperparameter settings and model selection were conducted in an exploratory manner. Multiple models with different parameter settings were tested. The selection criterion focused on the model’s ability to predict accident severity while maintaining a sufficient number of groups to ensure the representativeness and robustness of the results.

The final CART model was set to a maximum tree depth of 9. Smaller trees did not adequately partition the data and failed to provide sufficient differentiation in terms of accident severity, while larger trees excessively split the data, creating very small groups and leading to overfitting without meaningful improvement in predictive performance.

Other hyperparameters, including minimum samples per split and the cost-complexity pruning parameter, were tuned to balance predictive accuracy and generalization. The final minimal sample per split was set to 100. Sensitivity analyses indicated that the model’s performance remained stable across reasonable variations in these parameters.

The results of the analysis indicate that the most important attribute of tree growth is “Material agent of physical activity” (Figure 3). The second most important is the economic activity of the enterprise, followed by employee’s occupation, place of accident, age of employee, and finally working process.

The CART analysis allowed me to identify 264 groups of injured person in accidents at work that are well differentiated in terms of the length of post-accident incapacity for work (Figure 4). At the same time, the identified classification rules (Table 3) provide an adequate number of observations within the groups, sufficient to ensure the reliability of the conclusions.

CART classification groups sorted by expected length of post-accident absenteeism can be generally, in a very simplified manner, described as follows:

The highest accident-related absence (mining): The highest absence (over 75 days) was primarily identified in professions and economic activities (NACE) related to various types of mining—both underground and open pit—especially hard coal, but also, less frequently, metal ore mining and, slightly more often, oil and natural gas extraction. Moreover, this group also includes some economic activities, with a little shorter absence duration, related to the construction of specific civil engineering structures, service supporting land transportation, the distribution of electricity, the production and supply of hot water, and very few forestry-related services. Occupation: The professions performed in this group were clearly related to economic activities, and therefore included all types of mine workers (unskilled and skilled, supervisors, engineers, mechanics and operators of mining, drilling and loading equipment and machines), as well as professions used in mining, such as ironworkers (metalworkers), various steel and railway structure assemblers, and construction workers like bricklayers, assistants, and painters. This group also included, but less frequently and with shorter absences, lumberjacks/tree fellers and carpenters. Place of accident: As expected, places of accidents were mainly underground mines and quarries, open-pit mines, public transport areas, places of production, factories, workshops, and construction sites. Process: The most common working process in this group, by far, was mining and earthworks. Next in order was the factor that occurred in every group—movement, including means of transport. The third encompassed all manual works like setting up, installing, mounting, dismantling, taking apart, etc. Lastly, production and processing accounted for the remaining portion, bringing the total to over 80% of the working processes overall. Material agent: The most common material agents of physical activity were underground facilities, tunnels, walkways, movable structures, and surfaces above ground level. What makes this group particularly unique is the significantly lower frequency of the most common material agent—“Floors and other fixed horizontal surfaces at ground level”—which appears very often in every other group. In contrast, various objects commonly used in mining, such as manually transported or moved loads, portable or mobile machines for extracting materials or earthworks, constructions materials, debris and other fragmented materials, rail vehicles, stationary machines for extracting materials or earthworks, mechanized hand tools for drilling, screwing, bolting, and stationary conveyors, continuous motion transport devices and systems and their equipment, were much more common. Age: This group also included older workers, as the mean age was 49 years old (almost 8 years more than other groups) with a standard deviation 8.6. It is worth noting that when the accident was in any way related to mining, it led to a significantly longer absence.
High accident absence (mobile machines): A long-lasting, but not the highest, absence (less than 75 but over 50 days) was specific to (NACE) road transport, chosen construction and industrial processing, forestry, and animal breading (especially freight and passenger transport by road; construction of roads, motorways, buildings, civil and water projects, installations and power lines; collection, treatment, and supply of water; forestry activities; poultry breeding; private security activities; industrial cleaning activities; installation of industrial machinery and equipment; production of specific products like meat, metal, and specific metal products—structures, parts, and machines—lead, zinc, and tin, builders’ carpentry and concrete, milk and cheese, wood and sawmill, paper, chemical products, bread, pastry, and processing fruits and vegetables). Material agent: What is also characteristic of this length of absence is that the most common material agents of activities were heavy vehicles, such as trucks, buses, and coaches (less frequently rail and other vehicles); stationary or mobile cranes and lifting devises; stationary continuous or vertical motion transport devices (conveyor belts, escalators, elevators, buckets, hoist), their loads, transport equipment, and accessories; machines and devices for forming by pressing, crushing, and rolling; sawing; preparing materials by grinding, separating, and mixing; packing; and construction machinery (portable or mobile). Age: the age of injured persons in this group was much higher than the average age (almost 50 and over).
Medium accident absence (retail and production sites): This was still relatively high but slightly below the average absence (about 30–49 days) and was specific to a much wider range of economic activities (NACE), which can be divided into several main sectors: manufacturing (includes the production of various goods, e.g., furniture, plastic, rubber, metal products; specialized products such as agricultural machinery, motor vehicles, electronics, doors, and windows; and food manufacturing like beverage and seafood); wholesale trade, retail sale, and services (repair and maintenance of vehicles and machinery); transport and logistics (postal services, passenger rail transport); healthcare and social care (health services, nursing home care). Place of accident: What is specific to this group is that the most common locations of accidents were industrial production sites (factories, workshops, and, less frequently, storages area, as well as maintenance and repair areas). Occupation: The occupation of injured persons in this group was also very diversified and included various types of manufacturing machine operators and mechanics, metalworking labourers (welders, lathe/milling operators), carpenters, car drivers and forklift operators, municipal/city guards, waste loaders, various machine and equipment assemblers, window fitters, and unskilled labourers in simple jobs such as packers, cleaners, and household workers. Material agent: Apart from “solid horizontal surfaces at ground level (floors)”, the most common material agents were manually moved loads and other materials, objects, products, packaging, and machines and vehicle parts. Frequently occurring factors were also light vehicles for transporting goods or passengers; large and small construction materials (girders, beams, bricks, tiles, etc.); various types of stationary machines and devices (mostly for cutting, milling, sawing, forming, joining); mechanized hand tools (for drilling, screwing, bolting); and warehouse accessories (racks, shelves, pallets). Age: In this group, the age was lower than the average age (31 years). However, if the age was higher than the average but other group selection criteria were still met, the post-accident absence was higher than the average.
Low-duration absence (health care/food processing): The groups with the lowest duration of absence were primarily related to (NACE) health care (hospital activates, general and specialised medical practice). In cases where the accident location was a healthcare facility, the primary material agents (material agent) involved were sharp and cutting instruments used in surgery and medical procedures (needlestick injuries and cuts), with injured employees being medical workers. It is worth noting that post-accident absences tend to be slightly longer when accidents involve the use of non-medical cutting tools (e.g., scissors) or kitchen and household tools. Such accidents are more common in restaurants, food production (e.g., meat processing), and retail rather than in medical facilities. When the injured person in such accidents is a food processing worker (e.g., butchers, fish processing workers, and related jobs) or a worker performing elementary tasks in the industry, such as a housekeeper, an office cleaner, then the duration of absence is little longer. Similarly, if the injured person is a shop assistant (especially in food sales), a cook, or a kitchen helper, the absence duration is extended even more (this group also includes warehouse workers with cut injuries and postmen bitten by pets).

The strength of the relationship between groups identified by the CART classification and the number of days lost was examined by measuring the effect size. It is the highest result among all analysed predictors (for comparisons, the results of the remaining predictors are presented in Table 1 and Table 2) and indicates that the identified groups are highly differentiated in terms of the length of post-accident incapacity, thus demonstrating the good predictive capabilities of the developed classification. Undoubtedly, the largest effect was achieved by the 264 groups identified in the CART analysis, where Eta square = 0.101 (Cohen’s d = 0.7).

Based on the developed classification, employees can be ranked according to the potential severity of accidents, ranging from groups most exposed to the risk of long-term post-accident absenteeism to groups in which accidents typically result in short-term incapacity for work, or even no absence at all (Figure 4). Nevertheless, due to the stochastic nature of accidents, the model has inherent limitations in its predictive capabilities, as accidents with a wide spectrum of consequences—from minor incidents leading to only a few days of absence to severe cases resulting in the maximum possible duration of incapacity—may occur occasionally in almost any group, though with different frequencies (as illustrated in Figure 4). Consequently, predictions of absence severity with one-day precision are not sufficiently reliable, and it is preferable to lower the temporal granularity of predictions in favour of improved overall accuracy (see Section 4.3).

The model, however, provides meaningful insights into group-level risk differentiation. Figure 4 shows that several groups are dominated by accidents leading to short-term absences, typically not exceeding two to three weeks, although even in these groups, isolated cases of long-term absences may occur. Most identified groups exhibit a medium duration of post-accident absence, usually ranging from about 2 weeks to 1.5–2 months. In addition, the model also identified groups of employees in which accidents frequently result in long-term absences lasting from a minimum of 1.5 months up to the maximum observed duration (6 months). However, even within these high-risk groups, occasional short-term absences are also observed, albeit much less frequently.

4.3. Evaluation Procedure—Prediction Reliability Metrics

To ensure the robustness of the results and to verify the temporal variation of the data, an expanding-window time-series validation was applied. Specifically, the dataset was divided into two splits: a training part (years 1–2) and an independent test part (year 3). The model was fitted exclusively on the training data and subsequently evaluated on the test set to avoid information leakage.

Because of the limitations of the model, due to the stochastic nature of accidents, it was necessary to slightly lower the precision of the predictions from one day to a longer period (the range from x to y days). Therefore, three different levels of prediction granularity were compared in terms of the reliability:

Exact values—numeric predictions with precision to one day of absence (on the basis of median, mean, and 5% trimmed mean);
Three-class categorization—up to 1 month, from 32 to 45 days, more than 45 days;
Second-class categorization—up to 1 month, more than 1 month.

To account for different levels of prediction granularity, two complementary evaluation approaches were applied:

(i): Exact value predictions: Here, the model was treated as a regression model. Predictive performance was assessed using standard regression metrics: MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and R² (Coefficient of Determination).
(ii): Categorical predictions: In addition, the numeric range of the target variable was discretized into three and two categories, and in these scenarios, performance was measured using accuracy, precision, recall, F1-score, and G-mean [4,46,47,48]. Precision, recall, and F1-score were computed per class and then macro-averaged to ensure equal treatment of all classes. G-mean was calculated as the geometric mean of class-wise recalls.

This dual evaluation scheme allowed me to verify the reliability of the model both for exact numeric predictions and for category-based interpretations.

As four different solutions were analysed, the four different various confusion matrices (Table 4), with multi-class prediction, were used to calculate all of the performance metrics.

Accuracy—defined as the ratio of the correctly predicted observations to the total number of observations. It is expressed by the following equation:

$Accuracy = \frac{\sum_{i = A}^{N} E_{i i}}{\sum_{i = A}^{N} \sum_{j = A}^{N} E_{i j}}$

(3)
Precision—defined as the ratio of correctly predicted observations under a particular class to the total number of predicted observations under a similar class. It is expressed by the following equation:

$Precision = \frac{E_{A A}}{\sum_{i = A}^{N} E_{i A}} \{F o r c l a s s A\}$

(4)
Recall—defined as the ratio of correctly predicted observations under a particular class to all actual observations in that class. It is expressed by the following equation:

$Recall = \frac{E_{A A}}{\sum_{j = A}^{N} E_{A j}} \{F o r c l a s s A\}$

(5)
F1-score—a metric calculated from the recall and precision by denoting the relative importance of recall versus precision, which in the conducted analysis was taken as 1. It is expressed by the following equation:

$F - measure = \frac{(1 + β^{2}) \times Precision \times Recall}{β^{2} \times Precision + Recall}$

(6)

$F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall} \{w h e r e β = 1\}$

(7)
G-mean—defined as the geometric mean of precision and recall. It is expressed by the following equation:

$G - mean = \sqrt{Precision \times Recall}$

(8)

The results indicate that the model’s predictive performance does not differ substantially between the training and test sets, both for exact value predictions and for categorical predictions (Table 5). This consistency suggests that the findings are robust over time and that the model captures patterns that are relatively stable, despite the stochastic nature of workplace accidents.

Exact value predictions showed moderate performance, with goodness-of-fit indicators (R² = 0.1) and Mean Absolute Error (MAE = 31.9) in the final model indicating a limited ability to predict the severity of accidents with exact numeric precisions. This outcome was expected due to the inherent randomness of accidents, which makes precise day-level predictions challenging.

Categorical predictions, on the other hand, performed substantially better. In the final model, both two- and three-class categorizations achieved nearly 53% accuracy, over 62% precision, and nearly 60% recall. Further analysis revealed that the model is particularly effective in predicting short-term absences: for the “up to 1 month” category, prediction accuracy exceeded 80%.

This tendency is likely related to the nature of the phenomenon being predicted. Notably, the analysis (Figure 4) showed that severe accidents leading to long-term absences are concentrated in specific high-risk groups, where exposure to hazards capable of causing long-term absence is higher. In contrast, minor accidents resulting in short-term absences occur across all identified work conditions and employee groups, albeit at varying frequencies, regardless of the presence of workplace hazards. Furthermore, groups of workers with minimal risk of long-term accident-related absences can be identified due to the absence of hazards with the potential to cause such incidents. Consequently, the developed model accurately identifies employee groups mainly at risk of short-term accident-related absences, in which long-term absences occur only sporadically, as well as groups prone to frequent long-term accident absences, but in the latter, accidents causing short-term absences also will occur occasionally.

Overall, the obtained results indicate the model effectively distinguishes groups prone to short-term and long-term accident absence, providing practical insights for targeted workplace safety interventions. However, due to the stochastic nature of accidents, occasional mispredictions are inevitable, particularly for accidents causing long-term absences. The analysis of model-based predictions showed that the recommended precision should be based on a broader range, preferably up to one month or beyond (eventually with six weeks in between), rather than single-day precision (Figure 5).

5. Discussion

The findings of this study suggest that the developed statistical model enables accurate prediction of accident severity within employee groups, based on the observed patterns of previous incidents. These predictions rely on the distribution of accident severity within groups identified using the developed statistical model. This conclusion carries both theoretical and practical significance, as previously noted by other researchers [4].

5.1. Theoretical Implications

From a theoretical perspective, the model’s predictive performance suggests that workplace injuries are not entirely random events, but rather follow identifiable patterns, linked to employee and workplace characteristics, that can be revealed through statistical analysis and data-driven approaches [4]. The variation in the frequency of accidents with different severity levels across the identified groups implies that it is possible to determine a set of variables (related to employee and workplace characteristics) that are associated with the duration of work absence resulting from an accident.

5.2. Practical Implications

On the practical side, this study addresses a longstanding objective in occupational safety—the prediction of accident severity. Unlike already existing approaches that incorporate some data analysis, such as risk analysis, leading indicators, and precursor analysis, often relying heavily on expert judgment and being prone to cognitive biases [10], the model utilizes data mining techniques to learn from past incidents and offer a more objective and scalable alternative that outperforms simple descriptive statistics [49]. The model’s performance confirms the feasibility of reliably distinguishing between severity categories, particularly accidents resulting in up to one month versus those exceeding this absence duration. This binary classification offers a practical framework for risk assessment and preventive planning, distinguishing the most serious accidents from the less sever ones.

Therefore, the developed model can be successfully used to complement expert judgment and support more informed and effective decision-making. For example, safety practitioners can use the model to identify factors associated with more severe accidents or to classify employees according to their likelihood of experiencing serious incidents. The model provides a solid platform for such analyses.

Based on the detailed results provided in the Supplementary Materials, safety practitioners can reproduce the model and apply its findings in practice. For instance, it could be implemented as a predictive tool for assessing accident severity in different worker groups. It is possible to develop a practical tool that facilitates the use of the model’s outputs—for example, by allowing practitioners to input specific worker characteristics and job conditions to obtain the predicted severity of a potential accident.

Such information can be effectively used to support workplace risk assessment, prioritize preventive measures, differentiate employees according to their risk of long-term work incapacity, and identify the most vulnerable groups. This enables more targeted proactive prevention strategies and more effective allocation of safety resources.

5.3. Model Assumptions and Comparisons

A noteworthy advantage of this study lies in the unique structure of the Polish accident reporting system, which records all incidents, even injured employees without any lost workdays. This contrasts with the Eurostat standard, which only includes cases with four or more days lost [6], while nearly 5% of recorded accidents at work in Poland fall below that threshold, significantly enriching the dataset for severity modelling.

To enhance prediction reliability and generate practical insights, the analysis incorporated best practices outlined in the Section 2 [4,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]. This includes robust data preparation, the use of a large and diverse dataset, and the integration of multiple analytical techniques, with a particular emphasis on decision tree algorithms.

The large dataset, while beneficial, introduced high variability, leading to the identification of excessively dispersed groups of injured employees. To resolve this, a two-step process of data preparation was applied: (1) selecting the most effective severity predictors using effect size measures and (2) reducing variable ranges using the CHAID method. This data preparation step enabled the use of these predictors as explanatory variables in the final CART decision tree model.

The resulting classification is more cohesive and better distinguishes between levels of accident severity. The model demonstrates moderate predictive performance, with an overall accuracy of 53% and an average precision of 62% (80% precision for predicting incidents resulting in up to one month of absence). While some studies achieved higher accuracy in predicting aviation incidents (e.g., Gürbüz et al. [21]—99%, Choi et al. [31]—92%, Christopher et al. [20]—87%), these focused on event occurrence, rather than severity. In the specific domain of accident severity prediction, fewer models have reached comparable accuracy. Luo et al. [26] reported 82% accuracy, but their study was limited to construction. Kumar et al. [27] achieved 54–58% accuracy in predicting non-fatal injuries. Khairuddin et al. [32] achieved 90–95% accuracy, but only for hospitalization and amputation outcomes. One of the most reliable severity models was developed by Sarkar et al. [4], with recall and F1-scores between 85 and 90%, though their study was limited to a steel plant and included proactive data sources. In comparison, the model developed in this study achieved 59% recall (88% for predicting over one month absence) and a 61% F1-score. Importantly, these predictions apply across all economic sectors, suggesting good generalizability and reliability. Compared to other studies, the developed classification model can be considered a reliable tool for predicting the accident severity that may occur among a diverse group of employees and various work environments.

5.4. Model Limitations

The model has some limitations in its predictive capabilities that are very hard to be avoided due to the stochastic nature of accidents. It is practically impossible to define worker groups that are both sufficiently large and completely free from short- or long-term post-accident absenteeism, as accidents with varying outcomes will inevitably occur. Even in high-risk environments, minor incidents such as slips and trips are common, while in low-risk settings, severe accidents may occasionally result from rare, unfortunate events. As a result, all identified groups include incidents with the full range of severity; the key difference lies in the distribution of these outcomes, and this is an aspect that the model successfully captures (Figure 4), but to increase reliability, predictions are deliberately broad in range. This study presents and compares four approaches to managing this trade-off, each offering a different level of predictive granularity.

5.5. Future Work

Future research on the presented issue could focus on expanding the model by increasing the dataset, including the number of observations, adding more years, and exploring new methods of analysis. It is highly suggested to supplement the model’s predictions with accident occurrence probabilities. Additionally, more effective methods for communicating prediction outcomes should be developed.

CART and CHAID decision tree algorithms were selected in this study due to their interpretability, high suitability for occupational accident datasets, including multinomial variables, and their proven effectiveness in previous studies, as highlighted in the Section 2. Nonetheless, future research could explore ensemble methods such as bagging or boosting, which combine multiple weak learners to enhance robustness and predictive accuracy, particularly for complex or imbalanced datasets. These approaches combine multiple weak learners to produce a stronger model. Bagging (Bootstrap Aggregating) can reduce variance by training multiple trees on different bootstrap samples and averaging their predictions, improving robustness against overfitting. Boosting sequentially trains trees to correct the errors of previous trees, often leading to higher predictive accuracy, particularly for imbalanced or complex datasets. Incorporating such ensemble methods could therefore enhance both the robustness and predictive performance of the model, providing more reliable predictions of post-accident absence.

6. Conclusions

This article proposes a new statistical method for analysing existing data on workplace accidents. Its aim was to develop a reliable model capable of predicting the severity of accidents occurring within specific worker groups. The model can serve as a valuable tool supporting proactive preventive measures, complementing expert opinion, and supporting more informed and effective decision-making.

The findings of the analysis have both theoretical and practical relevance. Theoretically, the model indicates that workplace injuries follow identifiable patterns that can be revealed through statistical analysis rather than occurring randomly. Practically, it contributes to the long-standing goal in occupational safety of predicting and evaluating accident severity.

Particularly noteworthy is the model’s ability to distinguish between two severity classes: accidents resulting in absences of up to one month and those exceeding that threshold. This classification may be particularly valuable for workplace risk management, as it allows for the categorization of worker groups into higher- and lower-risk classes, thus offering a strategic tool for safety planning and resource allocation.

This study shows that accident severity can be predicted using statistical models based on employee characteristics and working environment data. Still, it is important to mention that predictions should be used cautiously and supplemented with expert judgment and contextual safety information. Predictions should never be treated as an absolute interpretation, as management should make careful decisions on the basis of an accurate assessment of the specific, individual situation and using other useful safety information.

In conclusion, the sufficient predictive performance to support practical applications of the proposed model demonstrates that applied statistical methods are well suited for analysing data on accidents at work. Given their demonstrated effectiveness, such methods should be more widely adopted in both occupational safety research and practice.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app151910666/s1. Detailed results of the classification tree analyses conducted in this study: (1) CHAID analysis results for recoded predictor values, and (2) CART model classification rules.

Funding

This research and the APC was funded by Ministry of Family, Labour and Social Policy: 6th stage of the National Programme “Governmental Programme for Improvement of Safety and Working Conditions.” task no.: 6.ZS.09. Entitled: Information and educational platform “Occupational Health and Safety Management”. Programme coordinator: Central Institute for Labour Protection—National Research Institute.

Institutional Review Board Statement

This study used de-identified data recorded within the public statistics program. Therefore, ethical review and approval by an Institutional Review Board were not required.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from GUS (Central Statistical Office of Poland) and are available at https://stat.gov.pl/en/questions-and-orders/how-to-order-data/ (accessed on 26 September 2025) with the permission of GUS.

Conflicts of Interest

The author declares no conflict of interest.

References

Available online: https://ec.europa.eu/eurostat/databrowser/view/hsw_ph3_01/default/table?lang=en&category=hlth.hsw.hsw_acc_work.hsw_ph3 (accessed on 22 March 2024).
Leppink, N. Socio-economic costs of work-related injuries and illnesses: Building synergies between Occupational Safety and Health and Productivity. In Proceedings of the INAIL Seminar on «The Costs of Non-Safety», Bologna, Italy, 14 October 2015. [Google Scholar]
Tixier, A.J.-P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automat. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
Sarkar, S.; Pramanik, A.; Maiti, J.; Reniers, G. Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Saf. Sci. 2020, 125, 104616. [Google Scholar] [CrossRef]
Hallowell, M.R.; Alexander, D.; Gambatese, J.A. Energy-based safety risk assessment: Does magnitude and intensity of energy predict injury severity? Construct. Manage. Econ. 2017, 35, 64–77. [Google Scholar] [CrossRef]
Eurostat European Commission. Methodologies & Working papers. In European Statistics on Accidents at Work (ESAW): Summary Methodology; Eurostat Methodologies & Working papers; Office of the European Union: Brussels, Belgium, 2013. [Google Scholar]
European Commission. Directorate-General for Employment. In Social Affairs and Inclusion, Guidance on Risk Assessment at Work; European Commission: Luxembourg, 1996. [Google Scholar]
Matías, J.M.; Rivas, T.; Martín, J.E. A machine learning methodology for the analysis of workplace accidents. Int. J. Comput. Math. 2008, 85, 559–578. [Google Scholar] [CrossRef]
Sarkar, S.; Patel, A.; Madaan, S.; Maiti, J. Prediction of occupational accidents using decision tree approach. In Proceedings of the 2016 IEEE Annual India Conference (INDICON), Bangalore, India, 16–18 December 2016; pp. 1–6. [Google Scholar] [CrossRef]
Sarkar, S.; Vinay, S.; Raj, R.; Maiti, J.; Mitra, P. Application of optimized machine learning techniques for prediction of occupational accidents. Comput. Oper. Res. 2019, 106, 210–224. [Google Scholar] [CrossRef]
Chang, L.; Wang, H. Analysis of traffic injury severity: An application of non-parametric classification tree techniques. Accid. Anal. Prev. 2006, 38, 1019–1027. [Google Scholar] [CrossRef] [PubMed]
Cheng, C.; Leu, S.; Cheng, Y.; Wu, T.; Lin, C. Applying data mining techniques to explore factors contributing to occupational injuries in Taiwan’s construction industry. Accid. Anal. Prev. 2012, 48, 214–222. [Google Scholar] [CrossRef]
Lukacova, A.; Babic, F.; Paralic, J. Building the prediction model from the aviation incident data. In Proceedings of the 2014 IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 23–25 January 2014; pp. 365–369. [Google Scholar]
Mistikoglu, G.; Gerek, I.; Erdis, E.; Usmen, P.M.; Cakan, H.; Kazan, E. Decision tree analysis of construction fall accidents involving roofers. Expert Syst. Appl. 2015, 42, 2256–2263. [Google Scholar] [CrossRef]
He, X.; Chen, W.; Nie, B.; Zhang, M. Classification technique for danger classes of coal and gas outburst in deep coal mines. Saf. Sci. 2010, 48, 173–178. [Google Scholar] [CrossRef]
Yi, W.; Chan, A.P.C.; Wang, X.; Wang, J. Automation in construction develop- ment of an early-warning system for site work in hot and humid environments: A case study. Autom. Constr. 2016, 62, 101–113. [Google Scholar] [CrossRef]
Sánchez, A.S.; Fernández, P.R.; Lasheras, F.S.; Juez, F.J.D.C.; Nieto, P.J.G. Prediction of work-related accidents according to working conditions using support vector machines. Appl. Math. Comput. 2011, 218, 3539–3552. [Google Scholar] [CrossRef]
Rivas, T.; Paz, M.; Martín, J.E.; Matías, J.M.; García, J.F.; Taboada, J. Explaining and predicting workplace accidents using data-mining techniques. Reliab. Eng. Syst. Saf. 2011, 96, 739–747. [Google Scholar] [CrossRef]
Sanmiquel, L.; Rossell, J.M.; Vintro, C. Study of Spanish mining accidents using data mining techniques. Saf. Sci. 2015, 75, 49–55. [Google Scholar] [CrossRef]
Christopher, A.B.A.; Appavu, S. Data mining approaches for aircraft accidents prediction: An empirical study on Turkey airline. In Proceedings of the 2013 International Conference on Emerging Trends in Computing, Communication and Nanotechnology (ICE-CCN), Tirunelveli, India, 25–26 March 2013; pp. 739–745. [Google Scholar]
Gürbüz, F.; Özbakir, L.; Yapici, H. Classification rule discovery for the aviation incidents resulted in fatality. Knowl. Based Syst. 2009, 22, 622–632. [Google Scholar] [CrossRef]
Butka, P.; Pócs, J.; Pócsová, J.; Sarnovský, M. Multiple Data Tables Processing via One-Sided Concept Lattices. In Advances in Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2013; Volume 183, pp. 89–98. [Google Scholar]
Butka, P.; Pocs, J.; Pocsova, J. Use of Concept Lattices for Data Tables with Different Types of Attributes. J. Inf. Organ. Sci. 2012, 36, 1–12. [Google Scholar]
Nazeri, Z. Application of Aviation Safety Data Mining Workbench at American Airlines; The MITRE Corporation: Bedford, MA, USA, 2003. [Google Scholar]
Viademonte, S.; Burstein, F.; Dahni, R.; Williams, S. Discovering Knowledge from Meteorological Databases: A Meteorological Aviation Forecast Study. In Data Warehousing and Knowledge Discovery; Kambayashi, Y., Winiwarter, W., Arikawa, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 61–70. [Google Scholar]
Luo, X.; Li, X.; Goh, Y.M.; Song, X.; Liu, Q. Application of machine learning technology for occupational accident severity prediction in the case of construction collapse accidents. Saf. Sci. 2023, 163, 106138. [Google Scholar] [CrossRef]
Kumar, L.S.; Burns, G.N. Determinants of safety outcomes in organizations: Exploring ONET data to predict occupational accident rates. Pers. Psychol. 2024, 77, 555–594. [Google Scholar] [CrossRef]
Rahman, M.M.; Hossain, A.; Sikder, M.A. Machine learning applications in industry safety: Analysis and prediction of industrial accidents. In Proceedings of the 2024 International Conference on Smart Systems for Applications in Electrical Sciences (ICSSES), Tumakuru, India, 3–4 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zhu, R.; Hu, X.; Hou, J.; Li, X. Application of machine learning techniques for predicting the consequences of construction accidents in China. Process Saf. Environ. Prot. 2021, 145, 293–302. [Google Scholar] [CrossRef]
Das, S.; Khanwelkar, D.R.; Maiti, J. A semi-automated coding scheme for occupational injury data: An approach using Bayesian decision support system. Expert Syst. Appl. 2024, 237, 121610. [Google Scholar] [CrossRef]
Choi, J.; Gu, B.; Chin, S.; Lee, J.-S. Machine learning predictive model based on national data for fatal accidents of construction workers. Autom. Constr. 2020, 110, 102974. [Google Scholar] [CrossRef]
Khairuddin, M.Z.F.; Hui, P.L.; Hasikin, K.; Razak, N.A.A.; Lai, K.W.; Saudi, A.S.M.; Ibrahim, S.S. Occupational injury risk mitigation: Machine learning approach and feature optimization for smart workplace surveillance. Int. J. Environ. Res. Public Health 2022, 19, 13962. [Google Scholar] [CrossRef]
European Parliament; Council of the European Union. Regulation (EC) No 1338/2008 of the European Parliament and of the Council of 16 December 2008 on Community Statistics on Public Health and Health and Safety at Work (Text with EEA Relevance); European Union: Brussels, Belgium, 2008. [Google Scholar]
European Council. Council Directive (1989) 89/391/EEC—OSH ‘Framework Directive’. Official Journal of the European Communities, 1989, No. L 183/1–8. Available online: http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:31989L0391&from=EN (accessed on 13 October 2022).
Implementing Commission Regulation (EU). Commission Regulation (EU) No 349/2011 of 11 April 2011 Implementing Regulation (EC) No 1338/2008 of the European Parliament and of the Council on Community Statistics on Public Health and Health and Safety at Work, as Regards Statistics on Accidents at work Text with EEA Relevance; European Union: Brussels, Belgium, 2011; p. 3. [Google Scholar]
CSO (GUS). Accidents at Work in 2019; CSO (GUS): Warsaw, Poland, 2020.
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Erlbaum: Hillsdale, NJ, USA, 1988. [Google Scholar]
Rosenthal, R. Parametric Measures of Effect Size. In The Handbook of Research Synthesis; Cooper, H., Hedges, L.V., Eds.; Sage: New York, NY, USA, 1994; pp. 231–244. [Google Scholar]
Lenhard, W.; Lenhard, A. Calculation of Effect Sizes; Psychometrica: Dettelbach, Germany, 2016; Available online: https://www.psychometrica.de/effect_size.html (accessed on 26 September 2025).
Borenstein. Effect sizes for continuous data. In The Handbook of Research Synthesis and Meta Analysis; Cooper, H., Hedges, L.V., Valentine, J.C., Eds.; Russell Sage Foundation: New York, NY, USA, 2009; pp. 221–237. [Google Scholar]
Kass, G.V. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Ser. C (Appl. Stat.) 1980, 29, 119–127. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.I. Classification and Regression Trees; Wadsworth: Belmont, CA, USA, 1984. [Google Scholar]
Breiman, L. Random forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
Zacharis, N.Z. Classification and regression trees (cart) for predictive modeling in blended learning. Int. J. Intell. Syst. Appl. 2018, 3, 1–9. [Google Scholar] [CrossRef]
Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 2018, 44, 534–550. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
Poh, C.Q.X.; Ubeynarayana, C.U.; Goh, Y.M. Safety leading indicators for construction sites: A machine learning approach. Automat. Constr. 2018, 93, 375–386. [Google Scholar] [CrossRef]
Sarkar, S.; Maiti, J. Machine learning in occupational accident analysis: A review using science mapping approach with citation network analysis. Saf. Sci. 2020, 131, 104900. [Google Scholar] [CrossRef]

Figure 1. Methodological diagram of analysis.

Figure 2. The average number of days lost, by cluster number of accident severity predictors, identified by using the CHAID classification tree method.

Figure 3. Importance of attributes in CART model in accident severity prediction.

Figure 4. Number of days lost in groups identified by CART classification, based on six previously transformed variables (by CHAID). Yellow circles—extreme values; red stars—mild outliers; black lines—whiskers; beige—interquartile range (IQR); green—median.

Figure 5. Prediction reliability metric results.

Table 1. The most effective predictors of accident severity—identified in one-dimensional analysis.

Variable	Conversion to Cohen’s d
Occupation classification	0.4825
Material agent of physical activity	0.4442
NACE	0.4291
Place of accident	0.3577
Working process	0.3203
Age in years	0.3201

Table 2. Predictors of the severity of accidents at work, before and after segmentation using the CHAID classification tree method.

Variable	Number of Values Before Recoding	Number of Values After Recoding	Conversion to Cohen’s d Before	Conversion to Cohen’s d After
Occupation classification (O)	2328	24	0.4825	0.3809
Material agent of physical activity (M)	154	22	0.4442	0.4291
NACE (N)	598	22	0.4291	0.4135
Place of accident (P)	49	12	0.3577	0.3577
Working process (W)	30	10	0.3203	0.3203
Age in years (A)	72	10	0.3201	0.3136

Table 3. A sample set of 15 SQL rules identifying groups of injured persons with different accident absence lengths generated by CART.

Rule	Freq.	Est. nr. of Days Lost
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND (N = 19 OR N = 24 OR N = 13) AND ((P IS NULL) OR P <> 6 AND P <> 3 AND P <> 9 AND P <> 2 AND P <> 8 AND P <> 12 AND P <> 1 AND P <> 7 AND P <> 5 AND P <> 10) AND ((A IS NULL) OR A <> 3 AND A <> 2 AND A <> 1) AND (A = 7 OR A = 9 OR A = 10 OR A = 8 OR A = 6) AND (M = 11 OR M = 13 OR M = 12 OR M = 16 OR M = 1 OR M = 14 OR M = 19 OR M = 8) AND ((O IS NULL) OR O <> 4 AND O <> 10 AND O <> 12 AND O <> 13 AND O <> 17)	243	113
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND (N = 19 OR N = 24 OR N = 13) AND (P = 6 OR P = 3 OR P = 9 OR P = 2 OR P = 8 OR P = 12 OR P = 1 OR P = 7 OR P = 5 OR P = 10) AND ((A IS NULL) OR A <> 3 AND A <> 4 AND A <> 2 AND A <> 5 AND A <> 1) AND ((M IS NULL) OR M <> 11 AND M <> 3 AND M <> 13 AND M <> 17 AND M <> 16 AND M <> 5 AND M <> 18 AND M <> 7) AND ((O IS NULL) OR O <> 14 AND O <> 16 AND O <> 1 AND O <> 15 AND O <> 5 AND O <> 7 AND O <> 3 AND O <> 17 AND O <> 19) AND ((M IS NULL) OR M <> 10 AND M <> 4 AND M <> 2 AND M <> 20 AND M <> 8) AND ((A IS NULL) OR A <> 6) AND ((O IS NULL) OR O <> 4 AND O <> 12 AND O <> 2 AND O <> 23)	122	111
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND (N = 19 OR N = 24 OR N = 13) AND ((P IS NULL) OR P <> 6 AND P <> 3 AND P <> 9 AND P <> 2 AND P <> 8 AND P <> 12 AND P <> 1 AND P <> 7 AND P <> 5 AND P <> 10) AND ((A IS NULL) OR A <> 3 AND A <> 2 AND A <> 1) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8 AND A <> 6) AND ((M IS NULL) OR M <> 11 AND M <> 3 AND M <> 6 AND M <> 4 AND M <> 7 AND M <> 14 AND M <> 20) AND ((O IS NULL) OR O <> 6 AND O <> 16 AND O <> 1 AND O <> 15 AND O <> 5 AND O <> 2) AND ((M IS NULL) OR M <> 10 AND M <> 13 AND M <> 17 AND M <> 1 AND M <> 8) AND (W = 2 OR W = 4)	106	97
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND (N = 19 OR N = 24 OR N = 13) AND ((P IS NULL) OR P <> 6 AND P <> 3 AND P <> 9 AND P <> 2 AND P <> 8 AND P <> 12 AND P <> 1 AND P <> 7 AND P <> 5 AND P <> 10) AND ((A IS NULL) OR A <> 3 AND A <> 2 AND A <> 1) AND (A = 7 OR A = 9 OR A = 10 OR A = 8 OR A = 6) AND ((M IS NULL) OR M <> 11 AND M <> 13 AND M <> 12 AND M <> 16 AND M <> 1 AND M <> 14 AND M <> 19 AND M <> 8) AND ((M IS NULL) OR M <> 10 AND M <> 9 AND M <> 17 AND M <> 5 AND M <> 7 AND M <> 2)	791	94
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND (N = 19 OR N = 24 OR N = 13) AND ((P IS NULL) OR P <> 6 AND P <> 3 AND P <> 9 AND P <> 2 AND P <> 8 AND P <> 12 AND P <> 1 AND P <> 7 AND P <> 5 AND P <> 10) AND ((A IS NULL) OR A <> 3 AND A <> 2 AND A <> 1) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8 AND A <> 6) AND (M = 11 OR M = 3 OR M = 6 OR M = 4 OR M = 7 OR M = 14 OR M = 20)	732	92
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8) AND ((M IS NULL) OR M <> 15 AND M <> 21 AND M <> 6 AND M <> 4 AND M <> 9 AND M <> 1 AND M <> 14 AND M <> 19 AND M <> 20) AND ((M IS NULL) OR M <> 11 AND M <> 3 AND M <> 22 AND M <> 17 AND M <> 5 AND M <> 7 AND M <> 8) AND (A = 6 OR A = 5) AND ((P IS NULL) OR P <> 6 AND P <> 8 AND P <> 12 AND P <> 11 AND P <> 10) AND (O = 6 OR O = 9 OR O = 8 OR O = 18 OR O = 15 OR O = 5 OR O = 2 OR O = 7 OR O = 3 OR O = 13 OR O = 11 OR O = 22 OR O = 20) AND ((W IS NULL) OR W <> 8 AND W <> 1)	2091	42.4
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND ((M IS NULL) OR M <> 10 AND M <> 11 AND M <> 3 AND M <> 15 AND M <> 22 AND M <> 17 AND M <> 5 AND M <> 7 AND M <> 8) AND ((A IS NULL) OR A <> 3 AND A <> 4 AND A <> 2 AND A <> 5 AND A <> 1) AND (O = 14 OR O = 8 OR O = 24 OR O = 21 OR O = 15 OR O = 2 OR O = 7 OR O = 3 OR O = 11 OR O = 20) AND ((N IS NULL) OR N <> 3 AND N <> 2 AND N <> 1 AND N <> 25 AND N <> 16 AND N <> 17 AND N <> 14) AND ((A IS NULL) OR A <> 9 AND A <> 10) AND ((P IS NULL) OR P <> 4) AND (M = 13 OR M = 12 OR M = 16 OR M = 18) AND ((P IS NULL) OR P <> 3 AND P <> 2 AND P <> 7 AND P <> 5)	1055	42.3
WHERE (N = 19 OR N = 24 OR N = 13 OR N = 15 OR N = 21 OR N = 20 OR N = 4 OR N = 12 OR N = 11 OR N = 22) AND ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8) AND ((M IS NULL) OR M <> 15 AND M <> 21 AND M <> 6 AND M <> 4 AND M <> 9 AND M <> 1 AND M <> 14 AND M <> 19 AND M <> 20) AND ((M IS NULL) OR M <> 11 AND M <> 3 AND M <> 22 AND M <> 17 AND M <> 5 AND M <> 7 AND M <> 8) AND ((A IS NULL) OR A <> 6 AND A <> 5) AND ((O IS NULL) OR O <> 14 AND O <> 8 AND O <> 24 AND O <> 15 AND O <> 5 AND O <> 2 AND O <> 7 AND O <> 3 AND O <> 13 AND O <> 11 AND O <> 19 AND O <> 20 AND O <> 23) AND ((A IS NULL) OR A <> 2 AND A <> 1) AND ((P IS NULL) OR P <> 4 AND P <> 11)	4210	42.2
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND ((M IS NULL) OR M <> 15 AND M <> 22) AND (O = 6 OR O = 4 OR O = 9 OR O = 1 OR O = 10 OR O = 12 OR O = 5 OR O = 13 OR O = 17 OR O = 22 OR O = 23) AND (A = 7 OR A = 9 OR A = 10 OR A = 8) AND ((N IS NULL) OR N <> 5 AND N <> 3 AND N <> 2 AND N <> 25 AND N <> 23 AND N <> 17 AND N <> 18 AND N <> 14) AND (M = 5 OR M = 8)	397	41.8
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND ((M IS NULL) OR M <> 10 AND M <> 11 AND M <> 3 AND M <> 15 AND M <> 22 AND M <> 17 AND M <> 5 AND M <> 7 AND M <> 8) AND ((A IS NULL) OR A <> 3 AND A <> 4 AND A <> 2 AND A <> 5 AND A <> 1) AND (O = 14 OR O = 8 OR O = 24 OR O = 21 OR O = 15 OR O = 2 OR O = 7 OR O = 3 OR O = 11 OR O = 20) AND (N = 3 OR N = 2 OR N = 1 OR N = 25 OR N = 16 OR N = 17 OR N = 14) AND ((A IS NULL) OR A <> 7 AND A <> 6) AND ((M IS NULL) OR M <> 14 AND M <> 19 AND M <> 20) AND (M = 13 OR M = 21 OR M = 12 OR M = 4) AND ((P IS NULL) OR P <> 12 AND P <> 1 AND P <> 4 AND P <> 7 AND P <> 10)	1482	41.1
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND ((M IS NULL) OR M <> 15 AND M <> 22) AND ((O IS NULL) OR O <> 6 AND O <> 4 AND O <> 9 AND O <> 1 AND O <> 10 AND O <> 12 AND O <> 5 AND O <> 13 AND O <> 17 AND O <> 22 AND O <> 23) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8 AND A <> 5) AND ((N IS NULL) OR N <> 25 AND N <> 23 AND N <> 17 AND N <> 14) AND (O = 16 OR O = 2) AND ((A IS NULL) OR A <> 4 AND A <> 6)	1162	18.5
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND ((M IS NULL) OR M <> 15 AND M <> 22) AND ((O IS NULL) OR O <> 6 AND O <> 4 AND O <> 9 AND O <> 1 AND O <> 10 AND O <> 12 AND O <> 5 AND O <> 13 AND O <> 17 AND O <> 22 AND O <> 23) AND ((A IS NULL) OR A <> 7 AND A <> 9 AND A <> 10 AND A <> 8 AND A <> 5) AND ((N IS NULL) OR N <> 25 AND N <> 23 AND N <> 17 AND N <> 14) AND ((O IS NULL) OR O <> 16 AND O <> 2) AND ((P IS NULL) OR P <> 2 AND P <> 1 AND P <> 4 AND P <> 7 AND P <> 10);	1827	17.9
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22) AND ((W IS NULL) OR W <> 5 AND W <> 2 AND W <> 4) AND (M = 22)	1158	15.2
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22) AND ((W IS NULL) OR W <> 5 AND W <> 2 AND W <> 4) AND ((M IS NULL) OR M <> 22) AND (A = 10 OR A = 8 OR A = 4 OR A = 2 OR A = 6)	2120	13.3
WHERE ((N IS NULL) OR N <> 19 AND N <> 24 AND N <> 13 AND N <> 15 AND N <> 21 AND N <> 20 AND N <> 4 AND N <> 12 AND N <> 11 AND N <> 22) AND (M = 10 OR M = 11 OR M = 3 OR M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22 OR M = 17 OR M = 5 OR M = 7 OR M = 8) AND (M = 15 OR M = 22) AND ((W IS NULL) OR W <> 5 AND W <> 2 AND W <> 4) AND ((M IS NULL) OR M <> 22) AND ((A IS NULL) OR A <> 10 AND A <> 8 AND A <> 4 AND A <> 2 AND A <> 6)	2208	11.9

(O)—occupation classification. (M)—material agent of physical activity. (N)—NACE. (P)—place of accident. (W)—working process. (A)—age in years.

Table 4. The confusion matrix of N classes.

		Prediction
Observed		A	B	…	N
	A	E_AA	E_AB	E_A…	E_AN
	B	E_BA	E_BB	E_B…	E_BN
	…	E…_A	E…_B	E… …	E…_N
	N	E_NA	E_NB	E_N…	E_NN

Table 5. Evaluation of precision metrics in model prediction scenarios.

Scenario		Eta Square	Cohen’s d	R	R²	MAE	MSE	RMSE	Accuracy	Precision (macro)	Recall (macro)	F1 (macro)	G-mean
1. Exact values	Training model (1 + 2 year)	0.1	0.68	0.322	0.1	31.9	1956	44.2	2.02%	-	-	-	-
	Test model (year 3)	0.92	0.64	0.288	0.08	32.1	1977	44.5	2%	-	-	-	-
	Final model (3 years)	0.1	0.67	0.318	0.1	31.9	1954	44.2	2.04%	-	-	-	-
2. 3-class categorization	Training model (1 + 2 year)	-	-	-	-	-	-	-	38.0%	44.5%	42.2%	36.0%	54.5%
	Test model (year 3)	-	-	-	-	-	-	-	37.6%	44.0%	41.8%	35.2%	54.2%
	Final model (3 years)	-	-	-	-	-	-	-	38.7%	43.9%	41.9%	35.9%	55.6%
3. 2-class categorization	Training model (1 + 2 year)	-	-	-	-	-	-	-	52.6%	62.5%	59.3%	51.4%	51.4%
	Test model (year 3)	-	-	-	-	-	-	-	51.7%	62.2%	59.2%	50.7%	51.3%
	Final model (3 years)	-	-	-	-	-	-	-	52.6%	62.1%	59.3%	51.6%	52.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ordysiński, S. Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data. Appl. Sci. 2025, 15, 10666. https://doi.org/10.3390/app151910666

AMA Style

Ordysiński S. Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data. Applied Sciences. 2025; 15(19):10666. https://doi.org/10.3390/app151910666

Chicago/Turabian Style

Ordysiński, Szymon. 2025. "Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data" Applied Sciences 15, no. 19: 10666. https://doi.org/10.3390/app151910666

APA Style

Ordysiński, S. (2025). Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data. Applied Sciences, 15(19), 10666. https://doi.org/10.3390/app151910666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of the Injury Severity of Accidents at Work: A New Approach to Analysis of Already Existing Statistical Data

Abstract

1. Introduction

2. Related Works

3. Material and Methods

3.1. Subject of the Analyses

3.2. The Predicted Variables

3.3. Methodology

3.3.1. The Initial Phase

3.3.2. The First Phase—Preparation

3.3.3. The Final Phase—Group Identification

4. Results

4.1. Data Preparation

4.1.1. Reducing the Number of Predictors

4.1.2. Reducing the Number of Values—The Development of a New Classification

4.2. Final Group Identification

4.3. Evaluation Procedure—Prediction Reliability Metrics

5. Discussion

5.1. Theoretical Implications

5.2. Practical Implications

5.3. Model Assumptions and Comparisons

5.4. Model Limitations

5.5. Future Work

6. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI