Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning

Benderouach, Kawtar; Bennis, Idriss; Mansouri, Khalifa; Siadat, Ali

doi:10.3390/app16031203

Open AccessArticle

Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning

by

Kawtar Benderouach

^1,2,*

,

Idriss Bennis

¹

,

Khalifa Mansouri

¹ and

Ali Siadat

²

¹

Modelling and Simulation of Intelligent Industrial Systems Laboratory (M2S2I), ENSET Mohammedia, Hassan II University, Casablanca 20100, Morocco

²

Laboratory of Design, Manufacturing, and Control, Arts et Metiers ParisTech, Metz Campus, 57070 Metz, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1203; https://doi.org/10.3390/app16031203 (registering DOI)

Submission received: 16 December 2025 / Revised: 12 January 2026 / Accepted: 18 January 2026 / Published: 24 January 2026

(This article belongs to the Special Issue AI in Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

The aim of this article is to develop a machine learning (ML)-based predictive model for industrial accidents in the energy sector. The dataset used in this study was obtained from the Kaggle platform and consists of summaries derived from reports of occupational incidents resulting in injuries or deaths between 2015 and 2017. A total of 4739 accident cases were included, containing information on accident date, accident summary, degree and nature of injury, affected body part, event type, human factors, and environmental factors. Six supervised machine learning models—Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, Gradient Boosting Decision Trees (GBDT), and Extreme Gradient Boosting (XGBoost)—were developed and compared to identify the most suitable model for the data. Model performance was evaluated using accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC), which were selected to ensure reliable prediction in safety-critical accident scenarios. The results indicate that XGBoost and GBDT achieve superior performance in predicting human and environmental risk factors. These findings demonstrate the potential of machine learning for improving safety management in the energy sector by identifying risk mechanisms, enhancing safety awareness, and providing quantitative predictions of fatal and non-fatal accident occurrences for integration into safety management systems.

Keywords:

machine learning; industrial accident; prediction; human factor; environmental factor

1. Introduction

Almost three million die every year in workplace accidents, according to the International Labor Organization (ILO). Moreover, an estimated 395 million accidents and cases of occupational disease take place globally on a yearly basis. While most of these accidents are non-lethal, a significant portion causes the victims to be temporarily disabled [1].

Among these industries, the highest risks to workers are in construction, forestry, fishing, manufacturing, and mining sectors, which together account for around 63% of reported fatal accidents [2]. Of the sectors, metalworking is the most affected by workplace injuries [3].

Serious accidents, in particular, are often caused by electrical mishaps. The first few minutes following the fall are crucial, as it is essential to turn off the electricity (without touching the victim) and call emergency services. These accidents take place primarily while working on stationary low-voltage installations (cabinets, boxes, sockets), operating electrical tools, and in the surroundings of aerial lines, transformer substations, and underground pipes.

The magnitude of these problems has motivated a large body of work on understanding, predicting, and preventing accidents in the workplace, including that supported by machine learning technology. Such methods allow for the detection of risk factors as well as the prediction of the severity of events with historical data.

A few studies have investigated the use of machine learning for the analysis and prediction of workplace accidents. For instance, ref. [4] tested the Random Forests (RF) and the Stochastic Gradient Tree Boosting (SGTB) algorithms to predict injury category, body part affected, and accident severity from construction site reports.

Meanwhile, ref. [5] compared a set of algorithms (Support Vector Machines (SVM), Logistic Regression (LR), RF, k-Nearest Neighbors (k-NN), Decision Trees (DT), and Naive Bayes (NB)) for accident report classification. The SVM model exhibited the best outcome, with an F1 score ranging from 0.45 to 0.92.

In terms of cognitive factors, ref. [6] integrated Reasoned Action Theory (RAT) into machine learning models for the prediction of the percentage of risky actions of construction workers. They employed a decision tree-based model with an accuracy of 97.6%.

Likewise, ref. [7] employed distinct models (DT, RF, LR, k-NN, and SVM) to forecast accident severity, where the Random Forest model was found to be the most potent.

More recently, ref. [8] applied machine learning techniques to classify injury types in construction accidents using large-scale Australian datasets comparable to the Organization for Economic Co-operation and Development (OECD) statistics, demonstrating the effectiveness of supervised learning for handling heterogeneous accident data.

In Korea, ref. [9] developed a predictive model to determine the likelihood of fatal accidents, with a prediction rate of 91%. Reference [10] used the Random Forest algorithm to categorize accident types in a system that deals with class imbalance by subsampling, achieving 71.3 percent accuracy.

Some other studies, such as [11], addressed the role of safety training and fall height in the severity level of accidents. Reference [12] developed a hybrid approach, logistic regression combined with neuro-fuzzy systems—ANFIS, to predict the risk of scaffolding falls.

More recently, several research works have followed these analyses. Reference [13] applied ML algorithms to predict the severity of collapse-related incidents in the construction industry and showed that ML models are sufficient to rank the critical causes. In another study [14], it was shown that a customized combination of multiple machine learning algorithms can be used to predict occupational accidents, indicating the influence of feature selection on prediction performance. In the industrial field, several studies, for example [15], have highlighted the potential of classification models such as random forest or gradient boosting to anticipate dangerous events. In a different application context, a study conducted in South African national parks applied several machine learning algorithms—including Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), XGBoost, and Neural Networks—to predict occupational injury types and analyze unsafe conditions using data from national parks in South Africa [16]. In addition, ref. [17] applied learning and ensemble learning approaches to automatically identify the causes of fall-related accidents from investigation reports, enabling automatic extraction of human and environmental factors. Other research, such as that by [18], has proposed interpretable models for analyzing major risk indicators in the offshore oil industry, combining machine learning and explainability of results. In another study that examined the energy industry, a number of machine learning models were employed to classify industrial accidents, and their findings indicated that these procedures can facilitate the extraction of useful patterns to know more about the different groups of incidents [19]. Finally, recent studies [20,21] have compared different machine learning methods applied to national accident databases, particularly in the metallurgical sector, and have incorporated feature optimization techniques to improve the detection and prevention of occupational injuries. Similarly, ref. [22] predicted occupational accidents at the level of Brazilian states using multiple machine learning models, including regression-based approaches and LightGBM, highlighting regional disparities in accident occurrence.

In the transportation domain, ref. [23] applied machine learning models to predict traffic accident severity in Montreal, showing that data-driven approaches can effectively identify high-risk scenarios and support preventive decision-making. Similarly, ref. [24] proposed a classification-based framework to identify high-risk road segments and accident severity patterns using categorical data, highlighting the effectiveness of machine learning in handling heterogeneous accident-related variables.

In addition to road traffic safety, machine learning techniques have also been widely used for the analysis of construction and industrial accidents. Reference [25] built a factor and scenario-based prediction framework of construction accidents based on a machine learning method and highlighted the significance of human, technical, and environmental factors for predicting accidents. Their results are consistent with previous studies that emphasized the importance of classification and ensemble learning techniques in occupational safety.

In a more recent study, ref. [26] used interpretable machine learning techniques along with data augmentation to discover infrastructure-related causes that contribute to accidents on the road. Their work points to the increasing relevance of model interpretability in potentially life-saving applications, where it is important to know how individual risk factors contribute to supporting urgent decision-making.

While these works demonstrated the power of machine learning to predict accident severities and conduct risk analysis, there are few studies that consider both severities of accident outcomes at the same time and analyze cross-domain accidents. Those two outcomes of occupational health and safety predictive modeling have received little attention in the past, especially for predicting both the human and environmental factors in accidents, such as those that may occur in the energy sector. This research gap motivates the present study, which aims to compare multiple machine learning models to predict accident-related human and environmental factors and to support data-driven risk management strategies.

Overall, existing research demonstrates the potential of machine learning for proactive prevention of workplace accidents. However, several methodological limitations remain. A number of studies use small datasets and focus only on a subset of accidents or do not explicitly examine the issue regarding class imbalance, which can have a drastic impact on performance measures (accuracy, precision, recall, and F1-score). Furthermore, the selection and reporting of evaluation metrics differ greatly between studies, which further complicates comparison and lowers reproducibility. Moreover, the majority of related works focus on either building or overall industrial industries or only work with structural data, not many utilize natural language descriptions or have special attention to the energy domain.

To address these gaps, this paper proposes building a predictive model using structured variables and textual accident reports in the energy sector to identify fatal and non-fatal factors that lead to an accident. We use a common set of evaluation metrics that are justified to a reasonable level, which can help achieve reliable performance assessment and provide some assurance against class imbalance. The goal of this study is to provide more stable, interpretable, and generalizable predictions that can be followed up on in the safety management process as an early preventive measure based on a data-driven approach using a relatively large dataset and several classification algorithms.

2. Materials and Methods

2.1. Data Structuring and Source Description

The dataset is based on accident cases, and it includes 4739 reported industrial accidents, where each contains the following eight key variables: accident date, narrative summary, degree of injury, nature of injury, affected body part, accident type, human factors, and environmental factors. Table 1 shows an example of an accident scenario that is extracted from the dataset.

The data were downloaded from Kaggle in the dataset titled “OSHA HSE DATA_ALL ABSTRACTS 15–17_FINAL”, and represent a set of industrial accident records gathered from public summaries issued by occupational safety organizations. Despite the structured attributes, the dataset contains textual descriptions of accident scenarios, which allows for the joint analysis of information from both unstructured and structured sources.

Human and environmental factors were categorized in the original dataset using predetermined categories derived from occupational safety classifications established for various industrial domains. The authors did not conduct additional manual annotation. Before modeling, an extensive data pre-processing stage was performed to ensure the quality and consistency of the raw data. This involved addressing missing structured values by imputation (or discarding where appropriate), standardizing consistent non-line item labeling, aggregating low-occurrence categorical values, and eliminating redundant records. Also, the free-form narrative text descriptions were cleaned, and the categorical variables were encoded to make them compatible with suitable machine learning models before training.

2.2. Data Preprocessing

To obtain deeper semantic and analytical insights from the dataset, a comprehensive data preprocessing phase was carried out prior to the analysis. This step aims to enhance data quality, reduce noise, and ensure the reliability of subsequent modeling tasks. In addition to the general preprocessing of structured variables, particular attention was given to the textual data contained in the dataset.

For the textual attributes, a dedicated text preprocessing pipeline was applied. First, text normalization was performed to ensure consistency across documents. This included converting all characters to lowercase, removing punctuation marks, numbers, and special characters, and correcting encoding inconsistencies.

Next, tokenization was applied, whereby each text entry was segmented into individual tokens (words). This step enables the transformation of raw text into analyzable units suitable for natural language processing tasks.

Subsequently, stop word removal was conducted to eliminate common and non-informative words (such as articles, prepositions, and conjunctions) that do not contribute significantly to the semantic meaning of the text. This helps reduce dimensionality and improve the efficiency of the analysis.

To further refine the textual representation, stemming and lemmatization techniques were employed. Stemming reduces words to their root forms by removing suffixes, while lemmatization maps words to their canonical dictionary forms, taking into account their grammatical context. These processes help group semantically similar terms and mitigate vocabulary sparsity.

After completing the text preprocessing steps, the cleaned and standardized textual features were integrated with the structured variables. The main variables are then described qualitatively in Table 2, emphasizing what they measure, their structure, and their contribution to accident characterization.

2.3. Factors Taken into Account in the Database

Table 3 shows the two factors included in the dataset that were extracted from the database used, grouped into two main categories: human factors and environmental factors. This classification supports accident cause analysis and the development of targeted prevention strategies.

2.4. Encoding of Categorical Variables and Textual Data

Categorical features were digitized in a manner consistent with the characteristics of the variables themselves prior to model training. The binary outcome variable, “Degree of Injury”, was defined as 1 for fatal accidents and 0 for non-fatal ones. Other categorical variables, such as “Part of Body”, “Accident Type”, “Nature of Injury”, as well as human and environmental factors, were converted into labels and assigned unique integer codes to ensure computational compatibility while preserving data brevity for nominal fields.

For the textual component, the accident descriptions “Accident Summary” were first subjected to a text preprocessing phase. After preprocessing, the cleaned textual data were vectorized using the Term Frequency–Inverse Document Frequency (TF-IDF) method. The n-gram range was set to (1, 2) to capture both individual terms and common word sequences. Table 4 summarizes the encoding and vectorization techniques applied.

2.5. Model Configuration and Evaluation Strategy

Six machine learning models were applied to the dataset, including Random Forest, decision trees, logistic regression, support vector classification, Extreme Gradient Boosting (XGBoost), and Gradient Boosting Decision Trees (GBDT). Before training, categorical values were encoded as numerical values, and textual data was transformed into numerical values using the TF-IDF method. In total, 70% of the data was used for model training and 30% for testing. After training and testing the models, evaluation metrics—accuracy, recall, precision, and F1-score—were calculated for each factor to estimate classification performance. The receiver operating characteristic (ROC) curve was then plotted. The best model according to the results obtained was chosen, and several other results were taken into account, such as the correlation heat map, feature importance, ROC curve, confusion matrix, and graphical representation of each factor. All steps, including data import and model testing, were performed on Google Colab. Figure 1 shows the methodological logic we followed in this study. All ML-models were trained using the default parameters from scikit-learn.

3. Results

3.1. Comparison of ML Models

This section presents the results of the performance evaluation of machine learning models on different aspects. Table 5 illustrates the accuracy of each model based on the test sets, which correspond to the factors used in the data (human factors, environmental factors). Each factor was analyzed individually using the set of models, and the accuracy was calculated. According to Table 5 and Figure 2, the Extreme Gradient Boosting (XGBoost) model and the Gradient Boosting Decision Tree (GBDT) showed the best accuracy compared to the other models. Although most models achieved high accuracy rates, logistic regression and support vector machine performed less well.

3.2. Heatmap

A correlation matrix is a table that shows the correlation coefficients (Yi) between the values in other tables (Yj) and the random variables in the table (Yi). These couples exhibit a definite association. The correlation between the columns was displayed using the Seaborn heat map. The heatmap was generated using the Seaborn Python library (version 0.12.2) in Google Colab. Data values are graphically represented by various colors in a heat map. In other words, it uses the power of color to grab the audience’s attention. Heat maps are an excellent tool to assist the observer in concentrating on the most crucial components when there is a lot of data. Seaborn heatmaps appear to offer a multitude of information almost instantly, and they are both aesthetically pleasing and intuitive. For this reason, correlation matrices are displayed using this method by analysts and data scientists. Figure 3 illustrates the generated correlation heatmap that was used to visualize the linear correlations between the variables in the dataset. No overly strong correlations were observed, suggesting the absence of significant multicollinearity. However, a few weak associations, for example, between event type and environmental variables, could guide the choice of predictive variables.

3.3. Word Cloud

This study reveals that word clouds, also known as tag clouds, can be useful assessment tools in addition to being a popular and fun way to visualize textual data in a graphical style. Using word clouds, investigators may quickly and simply produce graphical representations of words that reflect the accident report. Investigators can ascertain whether word or phrase patterns or word or phrase omissions in the accident’s textual descriptions contributed to the accident as a whole. Whether or not they are relevant to the evaluation, word clouds can also be used as a starting point or screening tool for massive volumes of textual material. Figure 4 shows that the word cloud generated from accident descriptions reveals a high frequency of terms related to falls (e.g., ladder, roof, fell), impacts (e.g., truck, struck), and severity (died, hospitalized). This preliminary analysis identifies recurring themes related to potential causes of accidents, such as a lack of protection against falls or collisions with machinery. These terms were then used in supervised models to automatically predict the human and environmental causes of accidents from the text. As an exploratory data analysis tool, a visualization of the word cloud with the most frequently used words in the corpus was created. This qualitative analysis was also useful for verifying whether preprocessing operations were applied as expected (e.g., stopword removal, text tokenization). While this visualization is not used to build our models, it gives an overview of the main linguistic features in the dataset.

3.4. Accuracy of More Efficient Models for Different Factors

The XGBoost algorithm, an improved gradient boosting framework, offers superior performance and generalization ability through the integration of regularization and second-order optimization, making it suitable for pipeline safety prediction [27]. Although XGBoost is an improved version of the GBDT model, we obtained evaluation metrics that show the efficacy of the learning model evaluation from the outcomes of both the Extreme Gradient Boosting (XGBoost) and Gradient Boosting Decision Tree (GBDT) models on the accident scenario dataset. The results are displayed in Figure 5. Human factors produced the highest results with an accuracy of 89% for the two best models, while environmental factors produced the lowest results with an accuracy of 87% with GBDT.

3.5. Performance Evaluation of XGBoost and GBDT Models

The XGBoost model and the GBDT model were compared, and it is clear that all the factors are dominated by XGBoost. In terms of human interpretations, although GBDT gains a slightly higher accuracy rate than XGBoost, from the perspective of recall and F1 score, we can see that XGBoost has a significantly higher value in recall and the best F1 score, showing better compromise between the classification accuracy and detection completeness. This advantage is validated for the domain of environmental factors, with XGBoost always reporting better performance in terms of all metrics—especially recall and F1 score—thus establishing its stability and trustworthiness (see Table 6). Hence, in this analysis setting, our findings indicate that XGBoost is the better choice, particularly when minimizing FNs (False Negatives) and maintaining performance over time is of importance.

Although accuracy is reported for easy comparison, all models were evaluated comprehensively using precision, recall, F1-score, ROC curves, and confusion matrices to account for class imbalance and ensure reliable prediction of human and environmental risk factors.

3.6. Importance of Characteristics

Examining the importance score during a prediction helps to better understand the most relevant and least significant attributes of the model. This is a form of model interpretation that can be performed if conditions allow. A prediction model can be optimized based on its significance; by calculating relevance scores, we can determine which aspects to remove and which to keep (i.e., those with the highest scores). Regarding environmental and human factors, according to Figure 6, Part of Body, Nature of Injury, and Text (Accident Summary) received the highest scores, while Event Type received the lowest score.

Figure 6 presents the comparison of factors for accident prediction, Environmental and Human, using GBDT and XGBoost models. Overall, Part of Body, Nature of Injury, and Text are the most important features. This finding is consistent with the strong semantic and contextual content of these variables: Part of Body and Nature of Injury are both directly connected to the severity/nature of the accident, and the free-text notes provide further context for the observations.

On the contrary, Event Type is significantly less important. This can be explained by its low discriminatory ability and poor granularity, given that the categories (e.g., “fall”, “collision”, “incident”) are general, and in most cases, they intersect with information already reported by other characteristics. The reduced importance of human and environmental factors may indicate that these attributes carry redundant or less distinctive information rather than textual or descriptive features.

The robustness of these results in the GBDT as well as XGBoost models confirms that descriptive and textual attributes are essential for accurate accident forecasting.

3.7. Confusion Matrix

A confusion matrix, often called a contingency table, is one of the most widely used techniques for assessing model performance. It is a way of assessing a model’s performance in classification problems by comparing the number of times its predictions are correct. The predictions for categorization issues are summarized in Figure 7. Predictions that are right and wrong are highlighted by the class. As a result, the real values are taken into consideration while evaluating the findings. The classification model’s confusion during prediction-making is depicted in this diagram. Because of this, we are able to identify the various mistakes that were made. The matrix shows the number of accurate forecasts and the number of false assumptions for each class based on the anticipated outcomes and predictions. Every row in the table represents a predicted class, and every column represents an actual class. The confusion matrix for each of the two factors that served as training sets is displayed in Figure 7. The confusion matrix values and each confusion matrix’s recognition percentage are displayed in Figure 7.

3.8. ROC Curve for the Different Models

The Receiver Operating Characteristic curve and the AUC (Area Under the Curve) value represent an overall measure of the performance of both the XGBoost and GBDT models. They allow us to compare the overall effectiveness of different classification schemes. The ROC curve and the AUC value for the test data are shown in Figure 8.

The ROC curves of the GBDT and XGBoost models for predicting environmental and human factors are shown in Figure 8. In general, both models are good performers, having AUC values above 0.80 in the case of environmental factors and around 0.70 for human factors.

Regarding the environmental variables, both ensembles have quite similar performance (AUC = 0.807 for GBDT and 0.802 for XGBoost), which demonstrates that the structures related to environmental variables are well captured by these two ensemble algorithms.

For human factors, the AUC values are much lower (0.730 for GBDT and 0.701 for XGBoost), indicating that a more complicated estimation task is subjected to subjective or less-structured information. The higher AUC of GBDT could be attributed to an improved ability to account for within-class variation or less overfitting towards the most dominant patterns.

Such observations indicate that the environment is easier to model (because of more obvious patterns within its data), whereas human factors might need additional behavioral or contextual descriptors for better prediction.

The receiver operating characteristic (ROC) curves in Figure 9 illustrate the performance of multiple classification models to predict human and environmental factors. For environmental variables, results indicate good discrimination of the models, where logistic regression had the highest AUC, followed closely by ensemble-based methods, including Random Forest, GBDT, and XGBoost. The prediction of the human factor, by contrast, has lower overall AUCs, indicating a more difficult and less stable classification problem. For the latter, we can see nonlinear models like GBDT, Random Forest, and SVM have better performance, and the decision tree has repeatedly worse performance in both of them. These findings emphasize the significance of models and techniques with high generalization capacity in the prediction of complex variables.

4. Discussion

The findings of this research validate the efficiency of machine learning to predict industrial accidents in terms of human and environmental factors. Of the tested models, gradient boosting-based algorithms (particularly XGBoost and GBDT) showed superior accuracy, precision, recall, and F1-score compared with conventional models like logistic regression, decision tree, and SVM. This superior performance could be attributed to their capability of modeling intricate nonlinear relationships of explanatory variables with accident occurrence, while controlling interactions among diverse risk factors.

Comparing these results to the literature, a high level of agreement is observed. References [13,14] have also indicated the good performance of gradient boosting models to classify and predict occupational accidents. Also, ref. [17] confirmed that ensemble learning methods improve the robustness of automatic accident cause identification. These findings indicate that boosting-based models are an effective way to enhance the performance of industrial risk management systems.

Early foundational studies laid the groundwork for applying machine learning to occupational safety analysis. For example, ref. [11] demonstrated the influence of safety training and fall height on injury severity using statistical and machine learning approaches. Similarly, ref. [4] were among the first to apply ensemble learning methods to predict injury severity and affected body parts from construction accident reports. These studies established the feasibility of data-driven approaches for occupational accident analysis, which later research expanded using larger datasets and more advanced algorithms.

Analysis of the most important factors showed that variables associated with human error, together with some forms of environmental degradation (e.g., temperature or visibility conditions and workspace clutter), are highly determinant of accidents. This finding supports what has been confirmed in previous research [18,21] that addresses the integration between mechanical and human factors to obtain a comprehensive picture of industrial risk.

In addition, recent studies highlight the diversity of applications and regional contexts. In North America, ref. [21] used public OSHA data to predict injury severity using multiple machine learning models. In South America, ref. [22] applied similar techniques at the state level in Brazil, demonstrating regional disparities in accident occurrence. Reference [8] focused on construction in Australia, using large-scale datasets aligned with OECD standards to classify injury types, while [16] analyzed unsafe conditions in South African national parks using SVM, k-NN, and XGBoost, illustrating adaptability in a non-industrial context. These regional differences emphasize the importance of adaptable machine learning frameworks that can generalize across various industrial and environmental contexts.

We should, however, point out some limitations of the proposed models despite their good performance. The success of predictions is highly contingent on the representativeness and balance of the available data. Even though the use of TF-IDF vectorization has allowed us to appropriately handle the text-based variables, using unbalanced and/or incomplete data can lead to a bias in learning. On the other hand, boosting models are quite efficient yet still partially “black-box,” thereby limiting their interpretability, which is a major concern for their application in industrial settings like the energy sector.

Nevertheless, despite these limitations, our approach introduces methodological improvements that address several of these challenges. Compared to previous studies, the present work extends the literature in several ways. Most prior research focuses on single outcomes or specific sectors, whereas this study jointly analyzes human and environmental factors and considers both fatal and non-fatal outcomes. Unlike studies relying solely on structured data, the proposed framework integrates structured variables with textual accident descriptions, enhancing the richness of extracted risk patterns. Furthermore, the use of consistent evaluation metrics and explicit handling of class imbalance addresses methodological limitations often reported in prior studies.

Building on the above discussion, there are a few recommendations. If explainability is desired, then one can also incorporate explainability, such as SHAP or LIME, to understand the importance of each factor in the prediction. Also, combining a variety of data (incident reports, industrial sensors, feedback) may help make the model more robust. Lastly, incorporating these predictive models into industrial risk management systems can serve as a proactive system capable of identifying early warnings and safety information useful for making safety-related decisions.

The proposed machine learning models demonstrate promising performance in predicting human and environmental risk factors associated with industrial accidents. In the energy sector, these models can support proactive risk identification across various domains, such as electricity generation and transmission, oil and gas operations, and nuclear energy facilities.

In the electricity sector, the models are particularly applicable to accident scenarios involving maintenance activities, electrical hazards, and environmental conditions such as weather-related risks. The availability of structured incident reports and operational logs facilitates the integration of machine learning-based risk prediction tools into safety management systems.

In the oil and gas industry, where accidents often involve complex interactions between human factors, hazardous materials, and harsh environmental conditions, the models can assist in identifying recurring risk patterns related to human error, fatigue, and unsafe operational practices. However, the diversity of operational contexts (onshore vs. offshore) and the presence of rare but high-consequence events may limit generalization without sector-specific data.

For the nuclear energy sector, although the number of accidents is relatively limited, the critical nature of potential consequences highlights the importance of robust risk assessment. The applicability of data-driven models in this domain may be constrained by data scarcity, strict confidentiality requirements, and highly regulated operational environments. As a result, machine learning models should be considered as decision-support tools rather than standalone risk assessment mechanisms.

Overall, while the proposed approach shows strong potential for enhancing safety management in the energy industry, its effectiveness depends on data quality, representativeness, and sector-specific adaptation. Future research may focus on incorporating domain-specific features, expert knowledge, and advanced modeling techniques to improve generalization across different energy subsectors.

5. Conclusions

This research illustrated that the use of machine learning processes to predict industrial accidents, considering human and environmental factors, is important. Based on a structured database and TF-IDF vectorization for processing text variables, some algorithm models were used to compare the performances, such as Logistic Regression, Decision Trees, Support Vector Machines, Random Forest, XGBoost, and GBDT. The results indicate that the best overall performance is provided by the models based on gradient boosting (XGBoost and GBDT), which confirms their capacity to capture complex relationships among the explanatory variables and accident occurrence.

The findings reported here have resulted in an improved comprehension of the mechanism of development of accidents, emphasizing the role played by human and environment in their genesis. The aforementioned model also allows the risk analysis process to be more objectifiable and less dependent on expert judgement, enabling decisions to be made on the basis of numerical information.

From a practical standpoint, models can be implemented in risk management systems to reinforce prevention and forecast hazardous situations, making them available for industrial safety policy. Yet, the limitations we have uncovered—and, in particular, concerns over data quality and balance, as well as model interpretability—point to opportunities for further work to make predictions more transparent and actionable.

Future research directions will thus include incorporating explainability methods (e.g., SHAP, LIME), employing more sophisticated resampling techniques to address class imbalance with various loss functions and metrics, and leveraging the next generation of language models (BERT, RoBERTa) for improved understanding of accident-related language. In the long-term, we hope that this approach can contribute to the digitization and automation of industrial models management using AI techniques in order to improve safety, reliability, and resilience in complex systems.

In this study, the default hyperparameters provided by the scikit-learn library were used for all evaluated machine learning models to ensure a fair and consistent comparison across algorithms. The main objective of this work is not to optimize or fine-tune model hyperparameters, but rather to evaluate and compare classical machine learning techniques for identifying human and environmental factors influencing industrial accidents. By adopting uniform baseline configurations, the analysis highlights the relative performance of each algorithm under comparable conditions and establishes a reference framework for future research. Hyperparameter optimization is considered a potential extension of this work and will be addressed in future studies to further enhance model performance.

Although the present study focuses on model development and performance evaluation, the practical integration of the proposed machine learning framework into existing Safety Management Systems (SMS) represents an important direction for future work. In operational settings, the predicted human and environmental risk indicators could be incorporated as decision-support inputs within SMS dashboards to enable continuous risk monitoring. Such predictions may support proactive safety interventions, including targeted training programs addressing identified human-factor vulnerabilities, adaptive operational procedures under elevated environmental risk conditions, and prioritized inspections or audits for high-risk scenarios. Future research should further investigate system interoperability, real-time data integration, and the definition of actionable risk thresholds to ensure that model outputs can be translated into effective, timely, and context-aware safety management interventions.

Author Contributions

Conceptualization, K.B.; methodology, K.B.; validation, I.B. and K.M.; writing—original draft preparation, K.B.; supervision, K.M. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly at [https://www.kaggle.com/datasets/ihmstefanini/industrial-safety-and-health-analytics-database (accessed on 17 January 2026)].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ILO	International Labor Organization
SGTB	Stochastic Gradient Tree Boosting
SVM	Support Vector Machines
LR	Logistic Regression
DT	Decision Trees
RF	Random Forest
NB	Naive Bayes
ML	Machine Learning
FN	False Negative
RAT	Reasoned Action Theory
XGBoost	Extreme Gradient Boosting
GBDT	Gradient Boosting Decision Trees
k-NN	k-Nearest Neighbors
TF-IDF	Term Frequency-Inverse Document Frequency
ROC	Receiver Operating Characteristic
AUC	Area Under the Curve
OECD	Organisation for Economic Co-operation and Development

References

Hung, K.K.C.; Kifley, A.; Brown, K.; Jagnoor, J.; Craig, A.; Gabbe, B.; Derrett, S.; Dinh, M.; Gopinath, B.; Cameron, I.D.; et al. Impacts of injury severity on long-term outcomes following motor vehicle crashes. BMC Public Health 2021, 21, 602. [Google Scholar] [CrossRef] [PubMed]
Zermane, A.; Mohd Tohir, M.Z.; Zermane, H.; Baharudin, M.R.; Mohamed Yusoff, H. Predicting fatal fall from heights accidents using random forest classification machine learning model. Saf. Sci. 2023, 159, 106023. [Google Scholar] [CrossRef]
Shabani, S.; Bachwenkizi, J.; Mamuya, S.H.; Moen, B.E. The prevalence of occupational injuries and associated risk factors among workers in iron and steel industries: A systematic review and meta-analysis. BMC Public Health 2024, 24, 2602. [Google Scholar] [CrossRef] [PubMed]
Tixier, A.J.-P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Application of machine learning to construction injury prediction. Autom. Constr. 2016, 69, 102–114. [Google Scholar] [CrossRef]
Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
Goh, Y.M.; Ubeynarayana, C.U.; Wong, K.L.X.; Guo, B.H.W. Factors influencing unsafe behaviors: A supervised learning approach. Accid. Anal. Prev. 2018, 118, 77–85. [Google Scholar] [CrossRef]
Poh, C.Q.X.; Ubeynarayana, C.U.; Goh, Y.M. Safety leading indicators for construction sites: A machine learning approach. Autom. Constr. 2018, 93, 375–386. [Google Scholar] [CrossRef]
Tang, X.; Chen, A.; He, J. A modelling approach based on Bayesian networks for dam risk analysis: Integration of machine learning algorithm and domain knowledge. Int. J. Disaster Risk Reduct. 2022, 71, 102818. [Google Scholar] [CrossRef]
Choi, J.; Gu, B.; Chin, S.; Lee, J.-S. Machine learning predictive model based on national data for fatal accidents of construction workers. Autom. Constr. 2020, 110, 102974. [Google Scholar] [CrossRef]
Kang, K.; Ryu, H. Predicting types of occupational accidents at construction sites in Korea using random forest model. Saf. Sci. 2019, 120, 226–236. [Google Scholar] [CrossRef]
Mistikoglu, G.; Gerek, I.H.; Erdis, E.; Mumtaz Usmen, P.E.; Cakan, H.; Kazan, E.E. Decision tree analysis of construction fall accidents involving roofers. Expert Syst. Appl. 2015, 42, 2256–2263. [Google Scholar] [CrossRef]
Jahangiri, M.; Solukloei, H.R.J.; Kamalinia, M. A neuro-fuzzy risk prediction methodology for falling from scaffold. Saf. Sci. 2019, 117, 88–99. [Google Scholar] [CrossRef]
Luo, X.; Li, X.; Goh, Y.M.; Song, X.; Liu, Q. Application of machine learning technology for occupational accident severity prediction in the case of construction collapse accidents. Saf. Sci. 2023, 163, 106138. [Google Scholar] [CrossRef]
Sarkar, S.; Pramanik, A.; Maiti, J.; Reniers, G. Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Saf. Sci. 2020, 125, 104616. [Google Scholar] [CrossRef]
Hossain, A.; Rahman, M. Machine Learning Applications in Industry Safety: Analysis and Prediction of Industrial Accidents. In Proceedings of the 2024 International Conference on Smart Systems for Applications in Electrical Sciences (ICSSES), Tumakuru, India, 3–4 May 2024. [Google Scholar] [CrossRef]
Chadyiwa, M.; Kagura, J.; Stewart, A. Investigating Machine Learning Applications in the Prediction of Occupational Injuries in South African National Parks. Mach. Learn. Knowl. Extr. 2022, 4, 768–778. [Google Scholar] [CrossRef]
Qi, H.; Zhou, Z.; Irizarry, J.; Lin, D.; Zhang, H.; Li, N.; Cui, J. Automatic Identification of Causal Factors from Fall-Related Accident Investigation Reports Using Machine Learning and Ensemble Learning Approaches. J. Manag. Eng. 2024, 40, 04023050. [Google Scholar] [CrossRef]
Zhen, X.; Ning, Y.; Du, W.; Huang, Y.; Vinnem, J.E. An interpretable and augmented machine-learning approach for causation analysis of major accident risk indicators in the offshore petroleum industry. Process Saf. Environ. Prot. 2023, 173, 922–933. [Google Scholar] [CrossRef]
Benderouach, K.; Bennis, I.; Bellat, A.; Mansouri, K.; Siadat, A. Classification of industrial accidents in the energy sector using machine learning models. In Proceedings of the 2025 5th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 15–16 May 2025; pp. 1–8. [Google Scholar]
Özkan, E.K.; Ulaş, H.B. Comparison of four machine learning methods for occupational accidents based on national data on metal sector in Turkey. Saf. Sci. 2024, 174, 106468. [Google Scholar] [CrossRef]
Khairuddin, M.Z.F.; Lu Hui, P.; Hasikin, K.; Abd Razak, N.A.; Lai, K.W.; Mohd Saudi, A.S.; Ibrahim, S.S. Occupational Injury Risk Mitigation: Machine Learning Approach and Feature Optimization for Smart Workplace Surveillance. Int. J. Environ. Res. Public Health 2022, 19, 13962. [Google Scholar] [CrossRef]
Toledo, J.; Moura, T. Occupational Accidents Prediction in Brazilian States: A Machine Learning Based Approach. In Proceedings of the 26th International Conference on Enterprise Information Systems, Angers, France, 28–30 April 2024; SCITEPRESS—Science and Technology Publications: Angers, France, 2024; pp. 595–602. [Google Scholar]
Muktar, B.; Fono, V. Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning. Electronics 2024, 13, 3036. [Google Scholar] [CrossRef]
Yumak, A.; Hengirmen Tercan, S.; Colak, U.C.; Ozcanan, S. A Machine Learning Approach to Identify High-Risk Road Segments and Accident Severity Patterns Based on Categorical Data. Appl. Sci. 2025, 15, 12824. [Google Scholar] [CrossRef]
Kim, K.; Cho, D.; Lee, M. A Machine Learning Approach for Factor Analysis and Scenario-Based Prediction of Construction Accidents. Buildings 2025, 15, 4343. [Google Scholar] [CrossRef]
Lee, J.; Kim, S.; Heo, T.-Y.; Lee, D. Identifying the Roadway Infrastructure Factors Affecting Road Accidents Using Interpretable Machine Learning and Data Augmentation. Appl. Sci. 2025, 15, 501. [Google Scholar] [CrossRef]
Liu, W.; Chen, Z.; Hu, Y. XGBoost algorithm-based prediction of safety assessment for pipelines. Int. J. Press. Vessels Pip. 2022, 197, 104655. [Google Scholar] [CrossRef]

Figure 1. Prediction steps using multiple classification models.

Figure 2. The accuracy rates of the models based on different factors.

Figure 3. Correlation matrix heatmap.

Figure 4. Graphical visualization of the word cloud.

Figure 5. Best model accuracy for different factors.

Figure 6. Importance characteristic of the different variables.

Figure 7. Confusion matrices for different factors.

Figure 8. ROC curves for human and environmental factors.

Figure 9. ROC curves for different models.

Table 1. Example of accident scenarios in the dataset.

Event Date	Text Description	Degree of Injury	Nature of Injury	Part of Body	Event Type	Environmental Factor	Human Factor
10 August 2017	“At 9:00 a.m. on 10 August 2017, an employee was operating a 400 ton Bliss…”	Non-fatal	Amputation, Crushing	Fingers	Caught in or between	Catch Point/Puncture Action	Other

Table 2. Description of key variables.

Variable	Type	Description	Examples/Categories
Accident Summary	Textual	Operator-written describing how the accident occurred	“Worker injured hand while adjusting conveyor belt…”
Degree of Injury	Categorical	Level of injury impact	Fatal, Non Fatal
Body Part	Categorical	Main body part affected	Hand, Head, Leg, Eye…
Event Type	Categorical	Mechanism of occurrence	Slip/Fall, Machine Contact, Impact, Electric Shock…
Nature of Injury	Categorical	Medical classification of injury	Fracture, Burn, Cut, Amputation…
Human Factors	Multi-label categorical	Behavioral or cognitive contributors	Fatigue, inattention, Lack of training…
Environmental Factors	Multi-label categorical	Workplace conditions	Poor lighting, wet floor, noise, machine defect…

Table 3. The database’s factors.

Categorization	Accident-Causing Factors	Examples/Notes
Human Factors	Errors in judgment/dangerous situations	Misjudging equipment operation
	Distracting actions by other people	Co-worker interference during operation
	Poor perception of the work environment	Failure to notice hazards
	Malfunction of the neuromuscular system	Fatigue, sudden muscle failure
	Malfunction of safety/warning devices	Alarm or sensor failure
	Safety devices removed/inoperative	Guards removed for convenience
	Equipment unsuitable for operation	Wrong tool or machine used
	Defective equipment in use	Broken parts, worn components
	Inadequate/lack of written work practices program	Missing safety procedures
	Inappropriate equipment handling procedure	Improper lifting, handling, or setup
	Failure of lockout/tagout procedure	Machine energized during maintenance
	Inadequate/lack of housekeeping program	Slippery floors, cluttered workspace
	Inadequate/lack of technical controls	Missing interlocks or automated protections
	Inadequate/lack of respiratory protection	Exposure to dust, fumes, chemicals
	Inadequate/lack of protective clothing/equipment	Missing gloves, helmets, eye protection
	Inadequate/lack/exposure/biological monitoring	No monitoring for toxins or pathogens
	Inappropriate position for the task	Awkward posture, overreach, or strain
Environmental Factors	Hooking point/piercing action	Sharp edges, protrusions
	Action of a moving or falling object from a height	Tools or materials dropped from above
	Action of a moving or falling object above the head	Cranes, hoists, overhead loads
	Action of a flying object	Projectiles, sparks, debris
	Action of a pinch point	Moving machinery parts
	Action of a shear point	Blades or presses
	Equipment/method of handling materials	Improper lifting, conveyor belts
	Condition of the work surface/installation/layout	Slopes, clutter, poor ergonomics
	Weather, earthquake, etc.	Rain, wind, vibration, natural hazards
	Lighting	Poor illumination, glare
	Radiation conditions	UV, ionizing radiation
	Exposure to chemical action/reaction	Acid, solvent, or corrosive spills
	Gas/vapor/mist/smoke/dust	Toxic inhalation risks
	Exposure to flammable liquids/solids	Fire or explosion hazard
	Temperature ± tolerance	Extreme heat or cold
	Noise level	Hearing damage or distraction
	Overpressure/underpressure	Confined space hazards

Table 4. Encoding method.

Variable	Encoding Method
Degree of Injury	Binary Encoding → 1 = Fatal, 0 = Non-Fatal
Other categorical variables	Label Encoding → 1, 2, 3, 4… (depending on the number of categories)
Accident Summary	TF-IDF

Table 5. The accuracy rates of the models based on different factors.

	Random Forest	Logistic Regression	SVM	Decision Tree	XGBoost	GBDT
Human Factor	87%	72%	74%	87%	89%	89%
Environmental Factor	86%	71%	73%	83%	88%	87%

Table 6. Performance of models (XGBoost and GBDT) of human and environmental factors.

		Accuracy	Precision	Recall	F1-Score
Human Factor	XGBoost	0.89	0.83	0.98	0.90
Human Factor	GBDT	0.89	0.89	0.88	0.89
Environmental Factor	XGBoost	0.88	0.85	0.92	0.88
Environmental Factor	GBDT	0.87	0.90	0.75	0.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Benderouach, K.; Bennis, I.; Mansouri, K.; Siadat, A. Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning. Appl. Sci. 2026, 16, 1203. https://doi.org/10.3390/app16031203

AMA Style

Benderouach K, Bennis I, Mansouri K, Siadat A. Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning. Applied Sciences. 2026; 16(3):1203. https://doi.org/10.3390/app16031203

Chicago/Turabian Style

Benderouach, Kawtar, Idriss Bennis, Khalifa Mansouri, and Ali Siadat. 2026. "Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning" Applied Sciences 16, no. 3: 1203. https://doi.org/10.3390/app16031203

APA Style

Benderouach, K., Bennis, I., Mansouri, K., & Siadat, A. (2026). Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning. Applied Sciences, 16(3), 1203. https://doi.org/10.3390/app16031203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Predicting Human and Environmental Risk Factors of Accidents in the Energy Sector Using Machine Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Structuring and Source Description

2.2. Data Preprocessing

2.3. Factors Taken into Account in the Database

2.4. Encoding of Categorical Variables and Textual Data

2.5. Model Configuration and Evaluation Strategy

3. Results

3.1. Comparison of ML Models

3.2. Heatmap

3.3. Word Cloud

3.4. Accuracy of More Efficient Models for Different Factors

3.5. Performance Evaluation of XGBoost and GBDT Models

3.6. Importance of Characteristics

3.7. Confusion Matrix

3.8. ROC Curve for the Different Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI