Next Article in Journal
Soy Isoflavones and PCOS: Role in Hormonal and Metabolic Mechanisms
Previous Article in Journal
Derating of Electrical Contacts with Varying Surface Roughness
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks

School of Resources and Safety Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 6197; https://doi.org/10.3390/app15116197
Submission received: 18 April 2025 / Revised: 21 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

Abstract

The traditional chemical safety management method mainly relies on manual inspection and empirical judgment, which is incompetent in the face of the increasingly complex production environment and colossal data volume, and there is an urgent need to apply efficient modern emerging technologies to strengthen the safety management of chemical production sites. Therefore, this dissertation researches chemical safety risk factor identification and analysis predicated on improved LDA topic model and Bayesian network. Thirty-three main risk factors are obtained by constructing the LDA topic model, text mining, and thematic analysis of chemical safety accident cases and combining them with the socio-technical system accident model. The correlation and causal relationship between risk factors were revealed based on association rule mining and Bayesian network analysis. Sensitivity and critical causal path analyses were utilized to indicate the possible paths and vital aspects of accident development. The results show that the text mining LDA topic model proposed in the dissertation performs well in analyzing accident reports and can effectively solve the problems of insufficient analyzing ability and high subjectivity of traditional methods. The research method of the thesis can efficiently extract the keywords of accident reports and reveal the correlation and causality between risk factors.

1. Introduction

The raw materials and products involved in the chemical production process are often flammable, explosive, toxic, corrosive, and other characteristics; coupled with the complex and volatile production environment, safety production accidents occur from time to time, endangering both lives and property while threatening social stability. The trend in chemical accidents and resulting deaths in China over the period 2016–2022 is illustrated in Figure 1. These data are from “Southern Metropolis Daily” and “The National Commercial Fire and Safety Association”. There are 1185 chemical accidents with 1477 deaths. Therefore, it is essential to strengthen the management of production safety.
Extensive research has been conducted in the field of chemical safety. P. Lassak et al. [1] chose an industrial fixed-bed reactor as the object of study and developed a mathematical model with nine parameters and analyzed the effect of uncertainty on individual model parameters and their combinations using Monte Carlo methods. Building upon traditional safety analysis methods, Rathnayaka et al. [2] introduced a predictive accident model that synergistically combines event tree and fault tree approaches with process historical data and causal analysis. Jain et al. [3] formulated a holistic framework for systematic process safety management and risk assessment tailored to technological applications. Tan [4] achieved basic safety in chemical parks by constructing a capacity assessment model for chemical parks, reduced accident losses, and gave recommendations for expanding the stock of favorable hazards in the development of chemical parks.
These traditional methods need more data collection and analysis capabilities and are highly subjective, unable to fully understand potential safety risks and trends. At the same time, traditional chemical safety management may need to catch up to technological updating and development and is unable to fully utilize new technological means to enhance the efficiency and level of safety management, missing the opportunity to prevent accidents in advance and affecting the effectiveness of safety management.
Text mining can effectively compensate for traditional methods’ shortcomings, extracting useful information from massive text data, discovering hidden knowledge, and making classification predictions, which can help decision-making and applications. Recent developments in text mining include Senave et al.’s [5] systematic examination of text-processing stages, which underscored the significance of keyword extraction. Addressing this critical need, Zhang et al. [6] developed a novel semantic hierarchical graph model that effectively captures keyword contexts and their underlying connection patterns. The hierarchical associations between the words in the semantic graph can be effectively revealed through deep mining of feature item representations. High-probability keyword collection results can be obtained. To optimize news retrieval, Zhang [7] created a text-processing technique that extracts keywords from news articles to obtain their essential information. Text mining sometimes uses clustering methods, and LDA topic modeling is one kind of clustering method. Zhao et al. [8] systematically investigated the privacy preservation of mainstream LDA training algorithms established upon Collapsing Gibbs Sampling (CGS) for the privacy problem in LDA and proposed two LDA algorithms. Among them, the centralized privacy-preserving algorithm (HDP-LDA) prevents the leakage of mid-training statistics in CGS. Unlike conventional LDA, the locally private LDA (LP-LDA) variant incorporates local differential privacy mechanisms to protect each data contributor’s information. Among the LDA variants, Tracer Ratio LDA (TR-LDA) is a classical form given its distinct nature. The algorithm for solving TR-LDA only converges if the sample size is significantly smaller than the dimensionality of the data. To address this problem, Li et al. [9] proposed an adapted form of TR-LDA, which is applied to different datasets through a standardized format. Wu et al. [10] proposed a short text clustering algorithm (SKP-LDA) for LDA using sentiment co-occurrence statistics and extracted knowledge pairs. Among them, the sentiment word co-occurrence considers diverse short texts, and the microblog short texts are also assigned with sentiment polarity. At the same time, the model extracts knowledge pairs (topic-specific words and their relational counterparts) and integrates them into LDA for clustering, significantly enhancing online public opinion analysis accuracy. Jia et al. [11] employed LDA to establish a material cycle safety control system for construction sites, enhancing project safety and offering insights for systematic engineering safety management. Zhang et al. [12] utilized this idea to analyze the research lineage of forestry ecological construction, identifying frontier hotspots and forecasting future trends, thereby contributing valuable guidance for related fields.
Given the low use of text mining methods in chemical safety, this paper attempts to combine and use text mining, keyword extraction, LDA topic modeling, association rule analysis, and Bayesian networks for chemical accident analysis. Focusing on chemical production safety, this study develops a systematic risk identification and analysis framework tailored to the unique characteristics of chemical production site management. By conducting an in-depth examination of chemical accident investigation reports and leveraging text mining and analytical techniques, the research not only advances the theoretical boundaries of chemical safety management but also demonstrates substantial innovation and practical significance for real-world applications. The objective is to cut down on the frequency of accidents through robust guidance for automated analysis of accident causes and optimization of safety strategies.

2. Materials and Methods

2.1. Dataset and Data Preprocessing

This study utilizes chemical production accident investigation reports as the primary corpus for analysis, which experts within the safety field make through research, analysis, and writing after an accident. The official websites of China’s Ministry of Emergency Management (MEM) and provincial and municipal emergency management departments have corresponding accident investigation reports. Several websites in China collect and organize accidents regarding chemical production systems, such as the Safety Management Network, China Chemical Safety Association, and Chemical Safety People. To establish robust data provenance, this study downloaded the chemical-related accident investigation reports from these websites from 2013 to 2022, totaling 843 articles. After implementing data cleaning procedures including deduplication and quality filtering, we obtained a final corpus of 514 chemical industry accident reports. The classified distributions of accident severity levels and incident types are visualized in Figure 2 and Figure 3.
Figure 2 shows that most accidents are general and more significant. Though these accidents caused relatively few casualties or damage, they occur with a high frequency and can still severely impact chemical production. Figure 3 reveals that explosions account for more than 50% of all accidents, while poisoning and asphyxiation incidents represent the second most common type. This pattern primarily stems from the widespread use of combustible, explosive, toxic, and hazardous gases in chemical manufacturing processes, which frequently lead to either mixed-gas explosions or worker exposure resulting in poisoning/asphyxiation cases.
To prevent data redundancy, reduce the workload of text analysis, and minimize the interference of irrelevant information on the analysis, this study further screened out the “Accident History” and “Accident Cause Analysis” in each accident investigation report, which served as the initial corpus for the analysis of this paper. This study further screened out “accident history” and “accident cause analysis” from each accident investigation report as the initial corpus for this paper.
Data preprocessing is an essential yet labor-intensive step, as the raw corpus often includes irregular or irrelevant terms. A key objective involves segmenting the original Chinese text—similar to how English text is divided by spaces—to facilitate subsequent analysis. The text segmentation tool used in this study is the Jieba splitter of python3. Linguistic analysis reveals that the causative factors in chemical accidents predominantly manifest as either nominal phrases or verb–noun constructions, indicating their syntactical simplicity [13], so only common nouns, organization names, other proper names, everyday verbs, and gerunds are selected during the segmentation so that Jieba segmentation will automatically eliminate the rest of the lexical segmentation results during the segmentation.
Jieba participles contain three dictionaries: a domain dictionary, a synonym dictionary, and a deactivation dictionary.
(1)
Domain Dictionary: Although the Jieba system comes with a dictionary for splitting words that contain the most common words (e.g., reactor, piping, etc.), current natural language processing systems exhibit significant limitations in recognizing domain-specific terminology, particularly technical terms such as ‘distillation tower’, ‘intermediate chamber’, ‘steam valve’, and ‘gas detector’ that are essential for accurate process industry documentation analysis. When these words are encountered, the Jieba segmentation system may split the entire proprietary word into two or more words, thereby destroying words that would otherwise be highly indicative of safety risk factors. The implementation necessitates constructing a domain-specific lexicon incorporating these industry terms and subsequently integrating this customized dictionary into the segmentation system to the Jieba word-splitting system.
(2)
Dictionary of synonyms: The prevalence of synonymous expressions in accident investigation reports introduces lexical variation that adversely affects tokenization consistency, resulting in fragmented segmentation outputs, leading to a significant increase in the difficulty of cluster analysis. Therefore, all the synonyms can be replaced with a word, such as pipeline or steam pipe, which can be substituted with pipeline.
(3)
The accident reports contain significant noise elements including non-informative tokens, uncontextualized numerals, and extraneous punctuation marks that require filtration, for instance, yes, 3, “!” etc. The terms demonstrate no statistically significant relevance to the analytical framework and should be incorporated into deactivated word dictionary of the Jieba participle system to be eliminated.
The three lexical databases exert a direct influence on tokenization accuracy, consequently inducing propagation effects throughout downstream analytical processes. Thus, updating these three lexicons several times is necessary to form a Jieba segmentation system that conforms to this study.
A preliminary domain-specific lexicon was constructed prior to text segmentation. For this study, technical terminologies in chemical production were systematically extracted from official web resources of major search engines (Baidu, Sougou, and Google), then standardized and incorporated into the specialized dictionary.
The study adopted the HIT stop-word lexicon as the baseline filtration dictionary. Following corpus segmentation, domain experts conducted rigorous validation of the tokenization results. Identified synonymous expressions were systematically consolidated through incorporation into a synonym repository and subsequent term standardization, while non-content tokens were appended to the enhanced stop-word lexicon. This way, a lexicon that conforms to the textual sub-lexicization of accident investigation reports in chemical safety was obtained. Using this system for the original corpus, the final lexicon results were obtained as data for subsequent analysis.

2.2. Keyword Extraction

Keyword extraction helps to summarize the text content to facilitate users to quickly understand the text topic and main points, but also to improve the accuracy and efficiency of information retrieval, commonly used keyword extraction algorithms are TF-IDF and BM25. While examining lexical-document associations, they overlook the role that word meanings play in extracting keywords. In terms of words indicating security risk factors, longer words usually have more explicit and specialized information. However, the BM25W model [14] is able to extract more accurately words that explicitly indicate safety risk factors. In addition, specialized vocabulary in domain lexicon are carefully screened by domain experts, so their semantic representation is clearer. So the model weights the BM25 model according to the semantics of the words themselves. As shown in Equation (1), the l e n q i denotes the word   q i the length of the word, and m a x l e n d , q the maximum word length in text d . Secondly, weights based on domain dictionary are shown in Equation (2).
w e i g h t l e n q i = l e n q i m a x l e n d , q
  w e i g h t l e x i c o n q i = 0     q i d o m a i n   l e x i c o n 0.5 + 100 l e n d   q i d o m a i n   l e x i c o n
If q i not present among domain vocabulary, the score is 0. Conversely, if the term exists in the domain dictionary, its initial weight is assigned a value of 0.5 and added to every 100 words to documents d ratio of the word to the document. The combined weight of these two components serves as the semantic-based weighting factor, which is subsequently incorporated into the BM25 scoring formula as presented in Equation (3).
  s c o r e d , q , w = i W i R q i , d ( w e i g h t l e n q i + w e i g h t l e x i c o n q i )

2.3. LDA Subject Modeling

LDA is generative probabilistic model [15] pioneered by David Blei, Andrew Ng, and Michael Jordan in 2003 for revealing potential topics in textual data. The creation of LDA has brought significant progress in textual topic modeling. With the popularity of social media, the importance of textual data is increasing, posing new challenges to social science researchers. LDA models are widely used in social science research, including topic discovery and document categorization, due to their ability to extract topics from large amounts of text effectively.
Figure 4 depicts the LDA topic model framework.
The LDA model corresponding to Figure 4 has two main generative processes:
(1)
α θ d Z d n : In the LDA model, this process describes the generation process of the n th word in document d . First, the topic distribution θ d of document d is extracted from the Dirichlet distribution based on the hyperparameter α , which represents the mixture of document topics. Then, the n th word in document d is assigned its attributable topic Z d n based on θ d . This process synthesizes the distribution of document topics and the distribution of vocabulary under each topic to ensure the appropriate matching of vocabulary and topic.
(2)
β φ k W d n : In the LDA model, this process describes the generation of the n th word in a document d . First, k topic-word distributions are obtained by sampling from the Dirichlet distribution based on the hyperparameter β . Subsequently, the topic Z d n of the n th word in document d is determined, and the appropriate vocabulary W d n is generated for the word based on the corresponding topic-word distribution φ k . This process effectively captures the association between vocabulary and topic and realizes the topic modeling of textual data. The key point is to determine the topic for each word in the document through the topic-word distribution and then generate the corresponding words.
Where the joint distribution of all variables is shown in Equation (4):
p w d ,   Z d , θ d , ϕ α , β = n = 1 N m p ( w d n φ Z d n ) p ( Z d n | θ d ) p ( θ d | α ) p ( ϕ | β )
W d n in Equation (4) represents the n th word in the d th document, which is a key concept in text data analysis. Text data is the object we want to analyze and mine, while the parameters α and β are pre-set values that have a guiding role and affect the generation of topics and vocabulary. Except for the observable lexical variable W d n and the a priori parameters α and β , other variables in the LDA model are hidden. Among these hidden variables, Z d n denotes the topic to which the n th word in a document belongs, θ d denotes the topic distribution of document d , and φ k denotes the vocabulary distribution of the k th topic. LDA establishes the link between a document and a topic by inferring these hidden variables through analyzing textual data [16]. The model uses methods such as Bayesian inference for parameter estimation and learning to capture the topic structure and semantics of text to support text analysis and applications.
The most commonly used parameter estimation methods for LDA probabilistic subject models include Gibbs Sampling [17] and Variational Inference [18]. Compared to Variational Inference, Gibbs Sampling is a simple and intuitive Markov Chain Monte Carlo (MCMC) method [19] for sampling hidden variables and parameters from a joint distribution, which can accurately sample from the posterior distribution to obtain accurate parameter estimates. In the LDA model, Gibbs sampling can update the estimates of the model parameters step by step by sampling each variable given the other variables. Therefore, Gibbs sampling is used in this paper to construct the LDA model.

2.4. Association Rule Analysis

Initially introduced by Agrawal et al. [20] for market basket analysis, association rule mining serves as a computational method for discovering hidden relationships among item sets within transactional databases. As an important data mining technique, it effectively analyzes accident data, reveals accident-related patterns and causal relationships, and aids managerial decision-making, with proven applications in diverse fields [21,22].
Apriori, as a classical algorithm in association rule mining, has a solid theoretical foundation and a wide range of applications. It applies to various datasets and is easy to understand and implement. Although the Apriori algorithm is less efficient when dealing with large-scale datasets, it still performs well when the data size is not particularly large or the set of frequent items is small. The setting of the support and confidence parameters makes the results tunable and interpretable, which can meet the needs of different scenarios. The causal interdependencies among security risk factors exhibit a high degree of complexity and non-linearity, and there are many association rules with multiple sets, so this study uses the Apriori algorithm for association rule mining.

2.5. Bayesian Network Analysis

Bayesian networks provide a graphical representation of how random variables probabilistically influence one another, combining graph theory with probability theory. In the network structure, vertices correspond to random variables, while directed edges signify conditional dependency relationships between them, and the probability distribution of a node takes into account the conditional probability distribution of that node given its parent. Bayesian networks can effectively deal with uncertainty, provide flexible and accurate modeling for many practical problems, help to understand the probabilistic dependencies of complex systems and support inference and decision-making.
The 3 fundamental topological configurations of Bayesian networks are illustrated in Figure 5.
Where in the V-shaped structure of Figure 5a, one feature is dependent on the others, i.e., x 1 is dependent on x 2 with x 3 . When the values of the sub-features are unknown, the parent features are statistically independent, i.e., when x 1 is unknown, x 2 is independent of x 3 , with the following equation:
x 1 P ( x 1 , x 2 , x 3 ) = x 1 P ( x 2 ) P ( x 3 ) P ( x 1 | x 2 , x 3 )
P ( x 2 , x 3 ) = P ( x 2 ) P ( x 3 ) x 1 P ( x 1 | x 2 , x 3 ) = P ( x 2 ) P ( x 3 ) × 1 = P ( x 2 ) P ( x 3 )
In the same parent structure of Figure 5b, multiple features depend on one feature, i.e., y 2 and y 3 depend on y 1 . y 2 is independent of y 3 when y 1 is known, as in Equation (7):
P ( y 2 , y 3 | y 1 ) = P ( y 1 , y 2 , y 3 ) P ( y 1 ) = P ( y 1 ) P ( y 2 | y 1 ) P ( y 3 | y 1 ) P ( y 1 ) = P ( y 2 | y 1 ) P ( y 3 | y 1 )
In the sequential structure of Figure 5c, the first feature depends on the second feature, and the second feature depends on the third feature, i.e., z 1 depends on z 2 , and z 3 depends on z 1 . z 2 is independent of z 3 when z 1 is known as in Equation (8):
P ( z 2 , z 3 | z 1 ) = P ( z 1 , z 2 , z 3 ) P ( z 1 ) = P ( z 2 ) P ( z 1 | z 2 ) P ( z 3 | z 1 ) P ( z 1 ) = P ( z 1 , z 2 ) P ( z 3 | z 1 ) P ( z 1 ) = P ( z 2 | z 1 ) P ( z 3 | z 1 )
The Bayesian network constructed in this paper basically consists of these 3 basic structures.

3. Results

3.1. Data Preprocessing and Keyword Extraction

The raw corpus was subdivided using the subdividing algorithm of the Python 3.7 program, and from the acquired English text corpus (words separated by spaces), we identified 2898 distinct word features, a selection of which appears in Table 1. The standardized text format facilitates efficient keyword extraction computational processing in subsequent steps.
In this study, the BM25W model was used to obtain the importance scores of all feature words in the partitioning results, and some of the high-scoring feature words are shown in Table 2.
Word cloud analysis can quickly and intuitively show critical information; this paper will BM25W score of the top 100 keywords for visual display, as shown in Figure 6. In Figure 6, the larger the shape of a word, the more critical it is in the accident investigation report, and we observe that “equipment and facilities”, “supervision and management”, “safety education and training”, etc., have a significant position in the accident investigation report. It can be found that “equipment and facilities”, “supervision and management”, “safety education and training”, and so on have a significant position in the accident investigation report.
By comparing the BM25W scores of feature words, it is possible to identify those words that are more thematically relevant. This feature word screening method based on the BM25W algorithm not only considers the word frequency but also the importance and contextual relevance of the words in the text, thus more accurately reflecting the degree of contribution of the feature words to the text topic. By deleting those feature words with lower scores, noise, and irrelevant information can be reduced, which improves the performance and explication ability of the LDA topic model and aids in enhancing the effectiveness of the topic model in practical applications.

3.2. LDA Topic Model Analysis

3.2.1. Estimation of the Optimal Number of Topics

Determining the number of topics in an LDA model is critical to data interpretation and clustering effectiveness. Too many issues may lead to overfitting of the model, introducing noise and reducing interpretability; too few may ignore essential information, resulting in underfitting and failing to capture the deep structure of the data. In large-scale corpora, it is unclear how many topics there are only by experience and scientific estimation is needed to find the optimal number.
Established methodologies for determining the most suitable number of topics for LDA are perplexity and thematic coherence. Perplexity reflects article categorization certainty and decreases with increasing topics, but too many issues may lead to overfitting. Thematic coherence measures inter-topic correlation and is another indicator to assess the strength of a model. Many empirical studies have demonstrated the effect of thematic coherence [23]. Generally, the higher a thematic coherence score is, the better the topic model is. Therefore, this paper adopts the two methods of perplexity and thematic coherence at the same time to comprehensively determine the optimum count of themes established by calculating the scores of perplexity and theme coherence under different numbers of themes. This method balances model predictability and the quality of the themes to achieve the best results for the theme model. This method has been widely used in many studies [24,25,26,27,28] to help researchers more accurately select the LDA model parameters suitable for the data, and then improve the performance and explanatory power of the topic model. The formula calculates the confusion degree:
P e r p l e x i t y ( D t e s t ) = exp d = 1 M l o g p ( w d ) d = 1 M N d
where D t e s t denotes the set of all documents in the corpus, the full set of these records is M , the word count of each record d is denoted by N d , w d denotes the word in document d , and p ( w d ) denotes the probability that the word w d is produced.
In this paper, we first set a maximum value of n_max_topics that the total number of topics may be taken, i.e., in general, the overall topic count in the text of the accident investigation report of the chemical enterprise exceeds this maximum value, which is not in line with the actual situation. In this study, n_max_topics = 100, that is, the overall topic count of the LDA topic model takes the value of the range 1~100 (take an integer) and then lets the Python program traverse all the number of topics to take the value of each traversal, each traversal of the other parameter α to take the empirical value, that is, the inverse of the overall topic count of the current traversal of count of β 0.01, the number of sampling iterations to choose 1000 times. This yields the topic models and their corresponding perplexity and consistency scores for each topic under the total number of topics in the range of values taken. An image of the variation of perplexity and topic consistency with the number of topics is plotted, as shown in Figure 7. In the figure, when the number of topics is 50, the perplexity is at the inflection point, the value is the smallest, and the topic consistency is the largest. The number of topics at this time is the best estimate.

3.2.2. Thematic Analysis

In this paper, we use LDA theme analysis to obtain 50 highly clustered feature phrases, and then according to the expert experience and analysis of each feature phrase corresponding to the risk factor theme. Still, the number of themes is also an estimate; the results will inevitably have some unrealistic themes, which require further screening manually. After screening and removing the four noisy themes that do not conform to reality, 46 themes were finally analyzed, as presented in Table 3.
In this paper, the risk management framework based on the socio-technical systems theory proposed by Rasmussen [29] is used to analyze the risk factors of chemical safety accidents, which helps to analyze the relevant correlations between macro- and micro-level factors in the chemical production system. Combining this risk management framework with the actual situation of the chemical industry, this study proposes five levels to categorize the risk causation of chemical accidents, including regulatory authorities, chemical enterprises, site management, operators, environment and equipment. It should be emphasized that because the risk factor themes are manually summarized and derived from the integration of feature words under each theme, some of the summarized risk factor themes cover more than one of the original themes to maintain the consistency of the dimensions of each risk factor theme and to prevent significant errors due to inconsistencies in the magnitude of the subsequent calculations. These risk factor themes were integrated and de-weighted, and the final classification results are presented in Table 4.
After obtaining all the themes, the LDA modeling enables the computation of topic probability distributions in all the accident investigation reports, marking the accident investigation reports that contain the risk factor theme as one and those that do not include the risk factor theme as 0. This way, a structured Boolean dataset of “0–1” can be obtained and analyzed using association rule mining and Bayesian network analysis in the following section, as presented in Table 5.

3.3. Association Rule Analysis

The accident level information is added to the Boolean dataset of chemical enterprise accident risk causation information to obtain an adaptive data format for association rule mining. In this paper, the accident level is categorized into general accidents (G), significant accidents (L), and significant and above accidents (MS), which corresponds to each accident investigation report and combines with the Boolean dataset of accident risk causation information of chemical enterprises obtained above to obtain the primary dataset of association rule mining, as detailed in Table 6.
Each entry in Table 6 corresponds to a textual record from an accident investigation report, i.e., an accident, and a “1” indicates that the risk factor in the corresponding column appeared in the accident. In contrast, a “0” means that it did not appear in the accident, as shown in Table 6; the C1 risk factor appeared in the accident investigation report with serial number 5 and did not appear in the accident investigation report with serial number 4. As shown in Table 6, risk factor C1 appeared in the accident investigation report with serial number 5, while it did not appear in the accident investigation report with serial number 4. The rightmost column shows the accident level corresponding to each accident, forming a 514 × 34-dimensional dataset for association rule mining.
Support, Confidence, and Lift are three critical metrics to measure association rules, and strong association rules can be filtered based on these metrics [30].
Assuming that each accident investigation report is regarded as a “transaction”, denoted as D n , then the set of transactions in this paper D is expressed as { D 1 , D 2 , D 3 , , D i , , D 514 } . Assuming that each safety risk factor in the accident investigation report is regarded as a “project”, denoted by F i , then the project set of this paper F is denoted as { F 1 , F 2 , F 3 , , F 33 } .
The Support is the ratio of the number of transactions containing both itemsets F i and F j to the gross amount of transactions, and is calculated as shown in Equation (10):
S u p p o r t F i , F j = P F i F j = P F i , F j P D
The Confidence is the existence of the itemset F i in the transaction of the simultaneous existence of the term set F j of the transaction, which is calculated as shown in Equation (11):
C o n f i d e n c e F i F j = P F j F i = P F i , F j P F i
The Lift reflects the fact that the association rules in F i and F j correlations, and is calculated as shown in Equation (12):
L i f t F i F j = P F i , F j P F i × P F i = P F j , F i P F j
For L i f t F i F j , its value less than 1 indicates that F i has little influence on F j ; conversely, it indicates that the rule F i F j is of practical significance, suggesting that the appearance of F i largely leads to the appearance of F j ; and its value equal to 1 indicates independence from each other.
According to the above principle, to obtain more valuable and higher correlation rules between safety risk factors in chemical enterprises, after several experiments, the minimum Support is set to 5%, the minimum Confidence is set to 20%, and the enhancement threshold is set to 1.2. Finally, 53 strong correlation rules can be obtained, as shown in Table 7.
The high-frequency risk factors obtained through association rule mining analysis can be regarded as nodes in a Bayesian network. In contrast, the significant association relationships between risk factors correspond to edges in a Bayesian network. The topology of the Bayesian network can be constructed based on these correlations better to describe the dependencies and probability distributions among risk factors. In addition, the Bayesian network can incorporate information such as accident levels, which can be used as observation nodes or implied nodes to improve the model’s accuracy and practicality.

4. Discussion

The Bayesian network is constructed by using the antecedent and consequent of the association rules as nodes, with directed edges pointing from the antecedent to the consequent. The framework undergoes expert validation to filter out non-viable connections, ultimately producing the security risk factor Bayesian network topology depicted in Figure 8. Among them, “Accident” is not a node in the association rule mining but a new node representing chemical production accidents.

4.1. Sensitive Risk Factor Analysis

Sensitivity analysis is a tool for assessing the sensitivity of a model to parameter changes and is categorized into univariate and multivariate analysis. In the chemical safety field, it helps identify key risk factors, assess their impact on safety, guide resource optimization, and process improvement, and reduce the risk of accidents. Univariate analysis measures the sensitivity of each factor to the model, while multivariate analysis integrates the effects of multiple parameters. In addition, parametric and interval sensitivity analyses enhance model robustness and support decision-making. Sensitivity analysis provides a scientific basis for chemical safety, identifies risk factors, optimizes safety management, and reduces accidents.
This paper involves multiple risk factors of Bayesian network structure, so multivariate sensitivity analysis is used. The sensitivity factor between two nodes in Bayesian network structure is calculated:
I R E V F i = max P S = S t F i = f i j P S = S t P S = S t
I R R V F i = P S = S t min P S = S t F i = f i j P S = S t
I A V G F i = I R E V F i + I R R V F i 2
In the above equation, the S is a child node and S t is its state; F i is the child node S of the i parent node, and f i j is the state of each parent node. I R E V F i denotes the parent node F i ‘s risk expansion performance, and I R R V F i denotes its risk-reducing performance, and the average of the two, I A V G F i is the sensitivity coefficient between the parent node F i and the child node S .
In this paper, we hope to discover the sensitive factors affecting the accident (“Accident” node) through sensitivity analysis. Therefore, sensitivity analysis is performed on the Bayesian network using GeNIe 4.0, with ‘Accident’ selected as the target node. As shown in Figure 9, nodes with darker colors exhibit greater sensitivity.
The computed sensitivity values of the complete set of safety risk factors are visualized in Figure 10. In this paper, those greater than 0.01 are regarded as the susceptible factors of accidents. In descending order, they are non-compliant operation O5, inadequate equipment maintenance and management E5, operators not licensed or not accompanied by supervisors O2, inadequate risk identification for special operations O3, unreliable equipment and facilities E1, adventure organization operations O6, failure to wear protective equipment O1, abnormal pressure/temperature values E3, careless or misrepresentation of the work site inspection M1. Analysis reveals that while numerous critical factors contribute to chemical production accidents, susceptible factors are principally concentrated in the operating personnel, especially the sensitivity value of O5, which exceeds the sensitivity value of M1 by 166.83%. The sensitivity value of E5 exceeds that of M1 by 4.04%.
According to Human Factors Engineering (HFE) and High-Reliability Organizing (HRO) theories, it is known that in a high-risk organization, the operating personnel are critical because they are directly involved in the operation, and they can quickly respond to accidents and prevent deterioration. High-risk operations in chemical production, such as fire and elevated work, require operators to have specialized skills and mental qualities, including chemical knowledge, equipment operation, and emergency response capabilities. They must be alert, make quick decisions, and respond effectively to emergencies. In addition, continuous training and education are vital to upgrading the quality of operators and keeping up with safety standards. The chemical industry requires operators to be fully competent technically and psychologically in dealing with challenges. Safety managers should optimize safety inputs, control susceptible factors, prevent accidents, and improve safety production.

4.2. Critical Causal Path Analysis

The research proceeded with Bayesian network diagnostic analysis through GeNIe 4.0 subsequent to sensitivity testing. The diagnostic approach reconstructs the most credible accident causation chain through upward reasoning in the Bayesian network framework. Then, the last possible path that caused the accident can be found. First, clear all object nodes, and start from “Accident” to “set evidence”, and then use the computer to calculate the inferred probability of parent nodes of node; for node A (accident node), the a posteriori probability of node A (accident node) is calculated by the computer, assuming that the probability of occurrence of the general accident is 100%, i.e., P A = 1 = 1 , and its parent node is set as X i . The a posteriori probability of the parent node of A is calculated as detailed in Equation (16):
P X i = 1 A = 1 = P A = 1 , X i = 1 P A = 1
According to Equation (16), a parent node with the largest a posteriori probability is identified, and then “set evidence” is performed, and so on, until the root node is reached. Consequently, the critical causal pathways for general accidents, larger-scale accidents, and major/severe accidents were successfully identified, as illustrated in Figure 11a–c.
As can be seen in Figure 11, despite the various accident categories and probability of propagation of the key causal factors, the second half of the path that leads to all kinds of accidents is the same, which is the failure of the safety monitoring and surveillance equipment, which may include the failure of sensors, instruments, or monitoring systems, which results in the impact on the real-time monitoring and detection function of the production process or equipment status. In contrast, the failure of the safety monitoring and surveillance equipment makes the related equipment and facilities unreliable. Because the role of monitoring equipment is to detect and correct potential problems promptly, when monitoring fails, it may result in the equipment failing to be informed of the problem or to take the necessary action promptly. This may leave the entire production system in an unreliable state, and further unreliable equipment and facilities may cause problems in the production process, one of which is the possibility of leakage. With adequate monitoring and control, it may be possible to detect, isolate, or repair potential sources of leakage in time. This can increase the risk of leaks of hazardous substances, which may involve chemicals, gases, or other dangerous substances. Ultimately, these leaks can lead to accidents, which demonstrate dependence on the specific category and size of spilled substance, including fires, explosions, and chemical spills, posing a potential threat to the safety of people and the environment. In the previous accident type statistics, it was also found that the number of accident types such as explosions, fires, poisoning, and asphyxiation, etc., which are closely related to the leakage problems, accounted for 94% of the total number of accidents, which inversely confirms the validity of the accident causation chain.
Inadequate regulation is the root cause of both general and significant accidents. Negligence on the part of the regulator may lead to irregular contracting and the involvement of unqualified firms in production, resulting in safety management deficiencies and non-compliance issues. This may lead to lax or false site inspections and affect the maintenance of safety management equipment, such as safety monitoring equipment, that may fail. Failure to wear protective equipment is particularly prominent in significant accidents. This may be due to lax inspections or misrepresentation leading to the inability to implement emergency rescue and drills effectively, and rescue workers may have expanded the accident in the early stages because they needed to wear protective equipment or blindly performed the rescue.
More significant accidents are often triggered by poor tracking of hidden dangers and corrections, leading to inadequate safety systems, including failure to correct hidden safety problems promptly, insufficient training, and loopholes in management systems. This results in poorly detailed on-site inspections and inaccurate risk and hazard identification and assessment. A sufficient safety management framework may also lead to misrepresenting inspection results and covering up problems, leading to accidents.
In the domino theory, improving these issues can effectively prevent accidents by cutting off the most likely pathways leading to accidents. However, this does not mean focusing only on the critical causal pathways; ignoring other safety risk factors is necessary. Given the structural complexity inherent in Bayesian networks and the interconnected nature of safety factors, accidents may occur through other paths after cutting off a node. Therefore, in enterprise safety management, it is necessary to focus on cutting off critical causal paths and creating new paths simultaneously and continuously adjusting the safety governance strategy according to the changes in critical paths.

4.3. Statistical Analysis of Frequency

Statistical frequency-based accident risk causation analysis can provide a comprehensive understanding and analysis of historical accident data, revealing potential patterns and regularities in accident occurrence. Through extensive collection of accident data, high-frequency causal factors can be identified, and their relevance and impact can be analyzed.
The formula for calculating the frequency of security risk factors is as follows:
P w i = C o u n t D w i C o u n t D
where C o u n t D denotes the total number of accident investigation reports, w i ( i = 1 , 2 , , 33 ) refers to 33 safety risk factors, and C o u n t D w i denotes the number of accident investigation reports that contain risk factor w i . The results of frequency statistics are detailed in Figure 12.
As can be seen from Figure 12, the statistical frequency of the first 15 items is more significant than 0.3, and with the rest of the items there is a frequency fault, these 15 items are: failure to conscientiously fulfill the responsibility of safety supervision R2, leakage problems E4, abnormal concentration of hazardous gases E2, abnormal pressure/temperature values E3, operators not licensed or not accompanied by supervisors O2, hidden dangers rectification and tracking is not in place R1, unreliable equipment and facilities E1, inadequate equipment maintenance and management E5, careless or misrepresentation of the work site inspection M1, inadequate fire and explosion prevention measures M5, inadequate safety education and training C1, failure to wear protective equipment O1, inadequate safety system C2, non-compliant operation O5, failure of safety monitoring and control equipment E6.
It can be found that the environmental and equipment factors are basically in the top 15 of the frequency faults. This also proves that no matter what the initial cause of the accident is and what triggers it, the accident will eventually manifest itself in these environmental and equipment factors, which reveals that the safety management personnel, in the daily risk warning work, need to focus on these aspects to strengthen the monitoring, so as to achieve early warning and nip the accident in the bud. In addition to the environment and equipment factors, the frequency of the top 15 is also more evenly distributed among the other four types of risk factors, which indicates that the environment and equipment factors are not the root cause of accidents; the root cause often comes from the people and management.
After obtaining the results of sensitive risk factor analysis, critical causal path analysis, and frequency statistical analysis, the investigation established that some factors existed in a set of factors of these three analyses, indicating these factors are critical in chemical enterprise accidents. These three results intersect and determine the essential risk factors leading to chemical enterprise accidents. In contrast, the non-intersecting part of the results of the three analyses belongs to the critical risk factors, and the remaining ones are the general risk factors with the following formula:
W k = A s A r A f
W i = A s A r A f W k
W o = U A s A r A f
where A s , A r , and A f refer to sensitive factors, causal path factors, and high-frequency factors, respectively. W k , W i , and W o refer to critical, important, and general factors, respectively, and U is the set of all risk factors in the structure of the Bayesian network. The results are detailed in Table 8.
Through comprehensive analysis, the general factors have less influence on chemical enterprise accidents, so the main critical and important factors are analyzed and the corresponding control scheme is constructed, i.e., unreliable equipment and facilities E1, failure to wear protective equipment O1, careless or misrepresentation of the work site inspection M1, non-compliant operation O5, inadequate equipment maintenance and management E5, operators not licensed or not accompanied by supervisors O2, inadequate risk identification for special operations O3, adventure organization operations O6, abnormal pressure/temperature values E3, leakage problems E4, failure to conscientiously fulfill the responsibility of safety supervision R2, abnormal concentration of hazardous gases E2, hidden dangers rectification and tracking is not in place R1, inadequate fire and explosion prevention measures M5, inadequate safety education and training C1, inadequate safety system C2, failure of safety monitoring and control equipment E6, and illegalized contracting C7.
This paper found that in these risk factors, E1, O5, O1, and E4, the characterization of the modality is often a visual form of information; for example, the specific performance of not wearing protective equipment O1 may be in the case of poisoning, on-site personnel in the absence of effective protective equipment to rescue the case, can be detected in the face of the rescuers if they do not have masks, gas masks or other masks, leakage problems E4, specific performance where a pipeline or plant at a location suddenly appeared to produce a large amount of visible gas or even began to produce open flames, and so on.
As for the unreliable equipment and facilities E1, the most common manifestations in the chemical enterprise site are frequent failures and shutdowns, equipment aging and wear and tear, unqualified maintenance and repair, missing or damaged parts, inefficient automation and control systems and so on, through the trajectory intersecting theory (Trace Intersecting Theory) to know when the equipment and facilities appear to be unreliable. If the person appears in the same position, it is very likely to cause casualties, and if the person does not appear in the surroundings, according to the theory, it will be able to effectively avoid the occurrence of casualties.
The chemical production site environment is complex, the incidence of accidents is high, and personnel intrusion into the hazardous area is the direct cause of accidents; based on the computer vision of the personnel hazardous area intrusion detection method can effectively make up for the defects of the chemical safety management of manual supervision, so as to reduce the probability of accidents.
The rest of the risk factors are often characterized by the modality of textual information, such as the operation site inspection is not detailed or misrepresentation of the specific performance of M1 may be the operation site of the enterprise’s safety checklist is not filled out on time, hazardous operations management review checklist content, etc., safety education and training is insufficient C1 meaning the enterprise did not regularly arrange for staff safety education training courses and, therefore, there is no corresponding intervention. The specific manifestation of insufficient safety education and training C1 may be that the enterprise has not arranged regular safety education training courses for its employees and, therefore, does not have the corresponding class schedules and course materials, or has not carried out regular safety examinations and, therefore, does not have the examination papers and transcripts for examining its employees.
This visual and textual modal information together characterize the above risk factors, which can further reflect the safety management level of the enterprise. Therefore, it is proposed to establish a safety production platform that integrates textual and visual information, which can capture potential risks and hidden dangers in chemical production more clearly by integrating visual images and textual information and provide more timely and effective early warnings and countermeasures for the relevant departments, so as to minimize the safety risks in the process of chemical production and safeguard the safety of people’s lives and properties.

5. Conclusions

Due to the flammable, explosive, and toxic substances involved in its production process and the complex and changeable production environment, the chemical industry has led to frequent safety accidents, which pose a severe threat to human life, property, and social stability. To address the inherent constraints of conventional chemical safety management approaches, this study presents a modern, data-driven, technology-based approach to improve efficiency and safety management.
For data preprocessing, the text segmentation of the chemical safety accident investigation report was performed using the Jieba participle, and the accuracy of the participle was improved by using a domain dictionary, synonym dictionary, and deactivation dictionary. The BM25W algorithm was used for keyword extraction to extract critical information in the text to provide input for the LDA model. Next, the chemical safety accident cases were thematically analyzed using the LDA model to identify major risk factors. In addition, the Apriori algorithm is used in association rule analysis to mine the association relationship between risk factors. Finally, a Bayesian network model analyzes the causal relationship between risk factors. The approach established in the current research, which combines an improved LDA topic model and the Bayesian network, aims to improve the identification and analysis of chemical safety risks. Thirty-three major risk factors were identified and categorized through text mining and model analysis.
For model evaluation, the model’s sensitivity to each risk factor was assessed through sensitivity analysis to identify the key risk factors. The possible paths of accident development were inferred through critical causal path analysis combined with Bayesian network diagnosis. The importance of recognizing and preventing common risk factors in chemical safety accidents and the necessity of taking targeted measures in safety management are emphasized through frequency statistical analysis. The findings of this paper not only reveal the potential causes and patterns of chemical safety accidents but also furnish a theoretical framework for optimizing chemical plant safety management, helping them to optimize the allocation of resources and improve the level of safety management.
The method proposed in this paper can effectively extract keywords from chemical safety accident reports and reveal the correlation and causality between risk factors; at the same time, it provides a new analytical tool for the chemical safety field, which helps to identify and prevent potential safety risks in advance, enhances the scientific and systematic nature of safety management, and protects the safety of personnel and property to a greater extent, and is of great significance for promoting the sustainable development of the chemical industry. It can effectively deal with the complexity of the chemical production site and provides an innovative solution for the intelligent and automated development of chemical safety management tools in the future. Through effective processing of unstructured data, it no longer relies only on limited experience and one-sided information to cope with chemical safety risks but can establish more comprehensive and accurate risk identification based on a large amount of data analysis, greatly enriching and improving chemical safety risk management theories and methods and promoting the transformation of chemical safety management to intelligence. Meanwhile, this paper also has some limitations, the chemical industry is diverse and the types of hazardous chemicals involved are also very complex; future studies will refine the study of the chemical industry, distinguishing between different sub-industries of the chemical industry and their different types of hazardous chemical safety risk identification and analysis, so as to make the model more versatile.

Author Contributions

Conceptualization, Z.Z.; methodology, Z.Z. and J.G.; software, Z.Z. and J.G.; validation, Z.Z., J.G. and J.H.; formal analysis, Z.Z. and J.G.; investigation, Z.Z. and J.G.; resources, Z.Z.; data curation, Z.Z., J.G. and J.H.; writing—original draft preparation, Z.Z., J.G. and J.H.; writing—review and editing, Z.Z. and J.G.; visualization, Z.Z. and J.G.; supervision, J.H.; project administration, Z.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CGSCollapsing Gibbs Sampling
LDALatent Dirichlet Allocation
MCMCMarkov Chain Monte Carlo
HFEHuman Factors Engineering

References

  1. Lassak, P.; Labovsky, J.; Jelemensky, L. Influence of parameter uncertainty on modeling of industrial ammonia reactor for safety and operability analysis. J. Loss Prev. Process Ind. 2010, 23, 280–288. [Google Scholar] [CrossRef]
  2. Rathnayaka, S.; Khan, F.; Amyotte, P. SHIPP methodology: Predictive accident modeling approach (Part I: Methodology and model description). Process Saf. Environ. Prot. 2011, 89, 151–164. [Google Scholar] [CrossRef]
  3. Jain, P.; Pasman, H.J.; Waldram, S.P.; Rogers, W.J.; Mannan, M.S. Did we learn about risk control since Seveso? Yes, we surely did, but is it enough? An historical brief and problem analysis. J. Loss Prev. Process Ind. 2017, 49, 5–17. [Google Scholar] [CrossRef]
  4. Tan, X.Q. Research on Evaluation Model for Safety Capacity of Chemical Industrial Park Based on Regional Risk and Its Application in Chemical Industrial Park. Master’s Thesis, South China University of Technology, Guangzhou, China, 2011. [Google Scholar]
  5. Senave, E.; Jans, M.J.; Srivastava, R.P. The application of text mining in accounting. Int. J. Account. Inf. Syst. 2023, 50, 100624. [Google Scholar] [CrossRef]
  6. Zhang, T.T.; Chen, K.; Li, B.Z. Document keyword extraction based on semantic hierarchical graph model. Scientometrics 2023, 128, 2623–2647. [Google Scholar] [CrossRef]
  7. Zhang, K. Web News Data Extraction Technology Based on Text Keywords. Complexity 2021, 2021, 5529447. [Google Scholar] [CrossRef]
  8. Zhao, F.Y.; Ren, X.B.; Yang, S.S. Latent Dirichlet Allocation Model Training With Differential Privacy. IEEE Trans. Inf. Forensics Secur. 2021, 16, 1290–1305. [Google Scholar] [CrossRef]
  9. Li, Z.X.; Nie, F.P.; Wang, R. A Revised Formation of Trace Ratio LDA for Small Sample Size Problem. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5803–5809. [Google Scholar] [CrossRef] [PubMed]
  10. Wu, D.; Yang, R.X.; Shen, C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm. J. Intell. Inf. Syst. 2021, 56, 1–23. [Google Scholar] [CrossRef]
  11. Jia, Y.R.; Pang, Y.C. Analysis of the Causes of Construction Safety Accidents Based on LDA Modelling. Eng. Constr. 2025, 57, 72–78. [Google Scholar] [CrossRef]
  12. Zhang, S.L.; Sun, H. Analysis of Knowledge Mapping and LDA Topic Modelling of Forestry Ecological Construction in China. China For. Spec. Prod. 2025, 80–82. [Google Scholar] [CrossRef]
  13. Yang, L. Causation Analysis and Risk Study of Rail Transportation Accidents Based on Text Data. Ph.D. Thesis, Beijing Jiaotong University, Beijing, China, 2021. [Google Scholar]
  14. Zhou, Z.Y.; Huang, J.H.; Lu, Y. A new text mining-Bayesian network approach for identifying chemical safety risk factors. Mathematics 2022, 10, 4815. [Google Scholar] [CrossRef]
  15. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  16. Wang, P.; Gao, C.; Chen, X.M. Text clustering study based on LDA model. Intell. Sci. 2015, 33, 63–68. [Google Scholar]
  17. Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 1, 5228–5235. [Google Scholar] [CrossRef] [PubMed]
  18. Ramesh, N.; William, C.; John, L. Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Omaha, NE, USA, 28–31 October 2007; pp. 349–354. [Google Scholar]
  19. Brooks, S. Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D (Stat.) 1998, 47, 69–100. [Google Scholar] [CrossRef]
  20. Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 25–28 May 1993; pp. 207–216. [Google Scholar]
  21. Chen, M.S.; Han, J.W.; Yu, P.S. Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng. 1996, 8, 866–883. [Google Scholar] [CrossRef]
  22. You, M.J. Research on Coal Mine Safety Risk Identification and Evaluation Based on Text Mining. Ph.D. Thesis, China University of Mining and Technology, Beijing, China, 2022. [Google Scholar]
  23. Roder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures; Association for Computing Machinery: New York, NY, USA, 2015; pp. 399–408. [Google Scholar]
  24. Cao, J.; Xia, T.; Li, J.T. A density-based method for adaptive LDA model selection. Neurocomputing 2009, 72, 1775–1781. [Google Scholar] [CrossRef]
  25. Biggers, L.R.; Bocovich, C.; Capshaw, R. Configuring latent Dirichlet allocation based feature location. Empir. Softw. Eng. 2014, 19, 465–500. [Google Scholar] [CrossRef]
  26. Lv, L.C.; Zhou, J.; Wang, X.Z. A framework for analyzing technology evolution based on a two-layer thematic model and its application. Data Anal. Knowl. Discov. 2022, 6, 18–32. [Google Scholar]
  27. Xiang, Z.Y.; Wu, Y.; Chen, H. A Study on Discovery of Hot Topics in Microblogs Based on the Improved Algorithm of Burst Word to Topic Modeling. Intell. J. 2022, 41, 104–112. [Google Scholar]
  28. Sievert, C.; Shrley, K.E.; Davis, L. A method for visualizing and interpreting topics. In Proceedings of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014. [Google Scholar]
  29. Rasmussen, J. Risk management in a dynamic society: A modelling problem. Saf. Sci. 1997, 27, 183–213. [Google Scholar] [CrossRef]
  30. Nenonen, N. Analysing factors related to slipping, stumbling, and falling accidents at work: Application of data mining methods to Finnish occupational accidents and diseases statistics database. Appl. Ergon. 2012, 44, 215–224. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Chemical accidents and fatalities in China (2016–2022).
Figure 1. Chemical accidents and fatalities in China (2016–2022).
Applsci 15 06197 g001
Figure 2. Number of accidents of different severity.
Figure 2. Number of accidents of different severity.
Applsci 15 06197 g002
Figure 3. Percentage of different accident types.
Figure 3. Percentage of different accident types.
Applsci 15 06197 g003
Figure 4. Diagram of LDA subject model.
Figure 4. Diagram of LDA subject model.
Applsci 15 06197 g004
Figure 5. The 3 basic structures of Bayesian networks: (a) V-structure; (b) same parent structure; (c) sequential structure.
Figure 5. The 3 basic structures of Bayesian networks: (a) V-structure; (b) same parent structure; (c) sequential structure.
Applsci 15 06197 g005
Figure 6. Visualized word cloud of the top 100 keywords.
Figure 6. Visualized word cloud of the top 100 keywords.
Applsci 15 06197 g006
Figure 7. Perplexity and thematic coherence curves.
Figure 7. Perplexity and thematic coherence curves.
Applsci 15 06197 g007
Figure 8. Bayesian network structure of chemical company accidents.
Figure 8. Bayesian network structure of chemical company accidents.
Applsci 15 06197 g008
Figure 9. Results of Bayesian network sensitivity analysis.
Figure 9. Results of Bayesian network sensitivity analysis.
Applsci 15 06197 g009
Figure 10. Sensitivity values for security risk factors.
Figure 10. Sensitivity values for security risk factors.
Applsci 15 06197 g010
Figure 11. Critical causal transmission pathways for different classes of accidents: (a) critical causal transmission pathways for general accidents; (b) critical causal transmission pathways for larger accidents; (c) critical causal transmission pathways for major and above accidents.
Figure 11. Critical causal transmission pathways for different classes of accidents: (a) critical causal transmission pathways for general accidents; (b) critical causal transmission pathways for larger accidents; (c) critical causal transmission pathways for major and above accidents.
Applsci 15 06197 g011
Figure 12. Statistical frequency of accident safety risk factors.
Figure 12. Statistical frequency of accident safety risk factors.
Applsci 15 06197 g012
Table 1. Results of partial partitioning.
Table 1. Results of partial partitioning.
Serial Number of CorpusSegmentation Results
1Construction Site Responsible Person Construction Shop Demolition Fractionated Wastewater Distillation … Review Employee Qualifications
2Record Shift Monitoring Tank Monitoring Transfer Record Monitoring Tank Monitoring … Safety Education and Training
3Workshop personnel handover responsible for liquid chlorine workshop inspections … supervisory and management capabilities
4Da Fang enterprise staff enterprise workshop pipe cutting operation safety technology submission … rectification responsible for
5Enterprise production workshop vacuum work production epoxy resin production site … enterprise elimination
513Tufa production responsible person government departments production employees personnel idle play … supervisory and management responsibilities
514Enterprise flammable and explosive gas production process turnover personnel material suppliers … inspection and maintenance operations
Table 2. Characterization words for some high BM25W scores.
Table 2. Characterization words for some high BM25W scores.
Characteristic WordBM25W Value
corporations3.7427
worker2.7936
safety education training2.6337
officers2.5535
equipment and facilities2.5319
probe2.4833
operation2.4556
supervise and manage2.3683
scene2.0720
reporting1.7376
blast1.7325
illegal and irregular 1.7306
produce1.6001
flammable and explosive gases0.9053
toxic and hazardous gas0.8988
special operation0.8837
explosive mixture0.7736
tank0.7735
asphyxia due poisoning0.7690
special equipment0.7471
reaction kettle0.7407
inflammable0.7284
workplace0.7045
pressure piping0.6980
break apart0.6764
safety management system0.6745
dynamo0.6696
occupation permit0.6586
wear0.6375
limited space0.6354
Table 3. Theme mining results of accident risk factors in chemical enterprises.
Table 3. Theme mining results of accident risk factors in chemical enterprises.
Serial NumberThematic Trait WordsRisk Factor Theme
1Supervision and management of enterprises in violation of the law duties of government departments to supervise and guide the review and inspection of the corporate sectorFailure to conscientiously fulfill safety supervision responsibilities
2Business Operations Operator On-Site Personnel Inspection Report Operations Site Responsible for False ReportingInsufficiently detailed or misrepresented job site inspections
3Production enterprise production process phase-out workshop compliance risk operation backward and oldOutdated and obsolete production processes
4Lack of protection from limited space safety measures at the operating enterprise for rescuing poisoned and asphyxiated personsFailure to wear protective equipment
5Enterprise personnel safety education and training operating procedures for the development of post record system leads toInadequate safety education and training
6Operations Operator Operations Ticket On-Site Supervisor License Equipment and Facilities Worksite Special Operations Safety Education and TrainingWorkers not licensed or not accompanied by a guardian on duty
7Equipment and Facilities Inspection Hidden Hazards Leakage Supervision and Management Rectification Fracture Management Production Process SupervisionInadequate tracking of hidden dangers and corrective actions
8Enterprise equipment and facilities rupture agglomeration design leakage production process concentration old environmentUnreliable equipment and facilities
9Explosive Gas Explosive Mixture Equipment Facility Air Ignition Source Damage Inspection Combustion MixingUnusual concentrations of hazardous gases
10Vehicle Transportation Enterprises Driver Qualification Operator Tank Illegal Violations Monitoring SupervisionInadequate transportation management
11Pressure beyond the equipment and facilities valve operating procedures operating post alarm interlock siteAbnormal pressure/temperature values
12Report the leakage of the shift supervisor in the central control room on duty to inspect the shift handover post site departureInadequate shift handovers or technical safety briefings
13Dust Burning Fire Production Explosion Protection Illegal Violations Employee Equipment Facility PersonnelInadequate fire and explosion prevention measures
14Construction Workers Qualification Welding Outsourcing Supervision and Management Construction Unit Contractor Violation of Laws and Regulations Safety Technical BriefingNon-compliance with contracting regulations
15Field Operator Spill Personnel Reporting Business Inspection Equipment and Facilities Smoke Operating ProceduresLeakage problem
16Special Operations Inspection Ventilation Development Rescue Application Safety Education Training Risk Identification Blind WorkersInadequate risk identification for special operations
17Illegal explosion in the workplace illegal operation of the operator employee production process illegal operationWeak security awareness
18Business Cutting Ignition Operator Pipeline Explosion Operation Ticket Unlicensed Employee InstallationWorkers not licensed or not accompanied by a guardian on duty
19Pressure pipeline valve leakage splash site flange removal operation shift supervisor inspectionleakage problem
20Inspection and Maintenance Operations Inspection Personnel Reporting Programs Parking Responsible Measures System FailuresInadequate management of equipment maintenance
21Pipeline Leak Site Safety Measures Report Responsible Person Personnel Coordination Lack of ProtectionInadequate production safety system
22Program Sampling Instrumentation Personnel Reporting Approval Analysis Manager Illegal LeavingReporting program approvals go through the motions
23Production Process Recycling Workshop Pilot Run Cooling Raw Material Installation Equipment Facilities Shift Supervisor TechniciansInadequate commissioning of production processes
24Trial Production System Gas Stop Production Program Illegal and Illegal Equipment and Facility Reporting Supervision and ManagementOrganization of production in violation of the law
25Warehouses clustered storage combustion explosion enterprise transportation ignition violations of the law negligenceIllegal storage
26Flammable and explosive gases personnel inspection and maintenance concentration exceeds the inspection emission site operator returnUnusual concentrations of hazardous gases
27Temperature Material Pressure Splash Rise Burnout Steam Control Heating Lack ofAbnormal pressure/temperature values
28Storage Tank Demolition Ruptured Tank Connection Splash Leaves Pipeline Reported LeakingUnreliable equipment and facilities
29Fan Bolt Job Inspection and Repair Flange Inspection and Repair Removal Report Personnel Equipment and FacilitiesInadequate management of equipment maintenance
30Flammables Hazardous Concentrations Pipe Falls Illegal Violations Ignition Rectification Ordinance SystemInadequate tracking of hidden dangers and corrective actions
31Toxic and Hazardous Gas Risk Identification Emission Risk Disconnection Pipeline Monitoring Reporting Blind SystemFailure of security monitoring and surveillance equipment
32Cylinder Gas Inspection Oxygen Operator Safety Education and Training Operator Certificate RecorderWeak security awareness
33Plant violations mixtures inspections are responsible for installing inspections put into operation points lead toInadequate security inspections
34Sewage clogging of protective equipment facilities overflowing pipeline discharge agglomeration regulation developmentIllegal treatment of sewage and wastewater
35Oil products illegal violations equipment and facilities explosion elimination tank operations main responsibility electrostatic fire protection devicesRisky organization of operations
36Waste Wastewater Enterprise Mixed Production Process Sampling Equipment Facility Operators Show ClusteringIllegal treatment of sewage and wastewater
37Fall Fracture Explosion Shock Wave Fire Protection Device Operation Impact Punch Out Risk Identification Explosion PreventionInadequate fire and explosion prevention measures
38Monitoring Records Risk Identification Operational Instrumentation Risk Driving Failure System ProgramsFailure of security monitoring and surveillance equipment
39Special equipment commissioning central control room instruction specification information development technician feeding defectslit. directing against the rules (idiom); directing against the rules
40Charging and mixing commissioning illegal and irregular contact standard replacement operation laws and regulations operating proceduresnon-compliance
41Safety Measures Personnel Replacement Positions Valve Management Operating Procedures Illegal Changes Looking ForInadequate production safety system
42The main responsibility valve is responsible for operating procedures input development manager registration laws and regulations leadershipFailure to uphold paramount safety requirements
43Corrosive substances are responsible for analyzing emergency drills return laws and regulations friction oxygen technology safety measuresNon-implementation of emergency relief management
44Manufacturer Safety Measures Government Department Approval Compliance Equipment Facility Inspection Technician Qualification BlindnessInadequate approval of safety management and technical measures
45Cleaning Dismantling Operations Violators Tank Transportation Laws and Regulations Input Skillsnon-compliance
46Warning Signs Employee Leads to Heated Passage Warning Signs On Duty Opinion Procedures OpenedFailure of safety warning signs
Table 4. Safety risk factors in chemical enterprises.
Table 4. Safety risk factors in chemical enterprises.
Classification of Risk FactorsSecurity Risk Factors
R Regulatory authoritiesR1 Hidden dangers rectification and tracking is not in place
R2 Failure to conscientiously fulfill the responsibility of safety supervision
R3 Inadequate approval of safety management and technical measures
C Chemical enterprisesC1 Inadequate safety education and training
C2 Inadequate safety system
C3 Organizing production in violation of the law
C4 Inadequate security inspections
C5 Failure to implement the main responsibility for safety
C6 Non-implementation of emergency relief management
C7 Illegalized contracting
M Site managementM1 Careless or misrepresentation of the work site inspection
M2 Backward and obsolete production processes
M3 Inadequate transportation management
M4 Inadequate handover or technical safety briefings
M5 Inadequate fire and explosion prevention measures
M6 Illegal storage
M7 Illegal treatment of sewage wastewater
M8 Inadequate commissioning of production processes
M9 Reporting program approvals go through the motions
O OperatorsO1 Failure to wear protective equipment
O2 Operators not licensed or not accompanied by supervisors
O3 Inadequate risk identification for special operations
O4 Unauthorized command
O5 Non-compliant operation
O6 Adventure organization operations
O7 Weak security awareness
E Environment and equipmentE1 Unreliable equipment and facilities
E2 Abnormal concentrations of hazardous gases
E3 Abnormal pressure/temperature values
E4 Leakage problems
E5 Inadequate equipment maintenance and management
E6 Failure of safety monitoring control equipment
E7 Failure of safety warning signs
Table 5. Boolean dataset of accident risk causation in chemical companies.
Table 5. Boolean dataset of accident risk causation in chemical companies.
Serial NumberC1C2C3C4C5C6C7E1E2E3
11100001111
20100100001
30000000000
40000000010
51001000010
61000001110
70100100011
81101000011
90100000001
100001010000
110001000000
120000011100
130000000000
140000000000
150100000000
161100100010
170000010001
180000000000
191000001010
200000001010
Table 6. Association rule mining base dataset.
Table 6. Association rule mining base dataset.
Serial NumberC1C2C3C4C5R2R3Accident Level
11100010G
20100100G
30000010G
40000010G
51001000G
5130101010L
5140100000G
Table 7. Strong correlation rules for safety risk factors in chemical enterprises.
Table 7. Strong correlation rules for safety risk factors in chemical enterprises.
Serial NumberPreceding Item After
Item
SupportConfidenceLift
1C5M20.0856030.6984132.425568
2C5M80.0525290.4285712.368664
3M8E70.0505840.2795701.796237
4M1O30.1264590.3532611.665836
5O6E10.0739300.6031751.640380
6C4M60.0603110.3195881.564458
7M9E40.0719840.3978491.526079
8M9M40.0719840.3978491.526079
9M7O30.0797670.3106061.464693
10M1O10.1750970.4891301.461704
11O3O10.1031130.4862391.453062
12O2O30.1206230.3069311.447361
13C7O20.1322960.5573771.418276
14O3E60.0953310.4495411.417572
15C4M50.0933850.4948451.413058
16M7O10.1206230.4696971.403629
17M2E30.1673150.5810811.402233
18C6M70.0525290.3600001.401818
19M8E30.1050580.5806451.401181
20M9M20.0719840.3978491.381720
21C5O50.0544750.4444441.367931
22E7E30.0875490.5625001.357394
23C7O30.0680930.2868851.352835
24M6M50.0953310.4666671.332593
25C7O70.0778210.3278691.326965
26O6O50.0525290.4285711.319076
27M7E60.1070040.4166671.313906
28O4E60.0525290.4153851.309863
29C7M70.0797670.3360661.308619
30R2C70.1420230.3093221.303209
31M8M70.0603110.3333331.297980
32C6M20.0544750.3733331.296577
33C1M60.0894940.2643681.294143
34O6E20.0680930.5555561.292107
35C3M20.0583660.3703701.286286
36C2M80.0758750.2321431.283026
37C3M40.0525290.3333331.278607
38R2M30.1322960.2881361.276739
39M3E10.1050580.4655171.266010
40O2E50.1809340.4603961.265474
41O3E50.0972760.4587161.260855
42C7M10.1070040.4508201.259355
43O4E20.0680930.5384621.252349
44O6O20.0603110.4920631.252082
45R2C40.1070040.2330511.234929
46M2M80.0642020.2229731.232345
47M1O20.1731520.4836961.230790
48M5E20.1848250.5277781.227501
49O4E30.0642020.5076921.225135
50C2M90.0719840.2202381.217230
51M1E60.1381320.3858701.216791
52O4E10.0564200.4461541.213350
53E6E10.1400780.4417181.201285
Table 8. Critical, important and general factors for accidents in chemical enterprises.
Table 8. Critical, important and general factors for accidents in chemical enterprises.
Sensitive FactorCausal Pathway FactorsHigh Frequency FactorsCritical
Factor
Important
Factor
General Factors
O5R1R2E1O5M7
E5R2E4O1E5C5
O2C2E2M1O2C6
O3C7E3 O3O7
E1M1O2 O6O4
O6O1R1 E3C4
O1E1E1 E4M9
E3E4E5 R2M3
M1E6M1 E2E7
M5 R1M4
C1 M5M2
O1 C1M8
C2 C2C3
O5 E6R3
E6 C7M6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Z.; Guo, J.; Huang, J. Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Appl. Sci. 2025, 15, 6197. https://doi.org/10.3390/app15116197

AMA Style

Zhou Z, Guo J, Huang J. Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Applied Sciences. 2025; 15(11):6197. https://doi.org/10.3390/app15116197

Chicago/Turabian Style

Zhou, Zhiyong, Jiahang Guo, and Jianhui Huang. 2025. "Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks" Applied Sciences 15, no. 11: 6197. https://doi.org/10.3390/app15116197

APA Style

Zhou, Z., Guo, J., & Huang, J. (2025). Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Applied Sciences, 15(11), 6197. https://doi.org/10.3390/app15116197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop