Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks

Zhou, Zhiyong; Guo, Jiahang; Huang, Jianhui

doi:10.3390/app15116197

Open AccessArticle

Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks

by

Zhiyong Zhou

^*,

Jiahang Guo

and

Jianhui Huang

School of Resources and Safety Engineering, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6197; https://doi.org/10.3390/app15116197

Submission received: 18 April 2025 / Revised: 21 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

Download

Browse Figures

Versions Notes

Abstract

The traditional chemical safety management method mainly relies on manual inspection and empirical judgment, which is incompetent in the face of the increasingly complex production environment and colossal data volume, and there is an urgent need to apply efficient modern emerging technologies to strengthen the safety management of chemical production sites. Therefore, this dissertation researches chemical safety risk factor identification and analysis predicated on improved LDA topic model and Bayesian network. Thirty-three main risk factors are obtained by constructing the LDA topic model, text mining, and thematic analysis of chemical safety accident cases and combining them with the socio-technical system accident model. The correlation and causal relationship between risk factors were revealed based on association rule mining and Bayesian network analysis. Sensitivity and critical causal path analyses were utilized to indicate the possible paths and vital aspects of accident development. The results show that the text mining LDA topic model proposed in the dissertation performs well in analyzing accident reports and can effectively solve the problems of insufficient analyzing ability and high subjectivity of traditional methods. The research method of the thesis can efficiently extract the keywords of accident reports and reveal the correlation and causality between risk factors.

Keywords:

chemical safety; LDA topic model; risk factors; association rules; Bayesian networks

1. Introduction

The raw materials and products involved in the chemical production process are often flammable, explosive, toxic, corrosive, and other characteristics; coupled with the complex and volatile production environment, safety production accidents occur from time to time, endangering both lives and property while threatening social stability. The trend in chemical accidents and resulting deaths in China over the period 2016–2022 is illustrated in Figure 1. These data are from “Southern Metropolis Daily” and “The National Commercial Fire and Safety Association”. There are 1185 chemical accidents with 1477 deaths. Therefore, it is essential to strengthen the management of production safety.

Extensive research has been conducted in the field of chemical safety. P. Lassak et al. [1] chose an industrial fixed-bed reactor as the object of study and developed a mathematical model with nine parameters and analyzed the effect of uncertainty on individual model parameters and their combinations using Monte Carlo methods. Building upon traditional safety analysis methods, Rathnayaka et al. [2] introduced a predictive accident model that synergistically combines event tree and fault tree approaches with process historical data and causal analysis. Jain et al. [3] formulated a holistic framework for systematic process safety management and risk assessment tailored to technological applications. Tan [4] achieved basic safety in chemical parks by constructing a capacity assessment model for chemical parks, reduced accident losses, and gave recommendations for expanding the stock of favorable hazards in the development of chemical parks.

These traditional methods need more data collection and analysis capabilities and are highly subjective, unable to fully understand potential safety risks and trends. At the same time, traditional chemical safety management may need to catch up to technological updating and development and is unable to fully utilize new technological means to enhance the efficiency and level of safety management, missing the opportunity to prevent accidents in advance and affecting the effectiveness of safety management.

Text mining can effectively compensate for traditional methods’ shortcomings, extracting useful information from massive text data, discovering hidden knowledge, and making classification predictions, which can help decision-making and applications. Recent developments in text mining include Senave et al.’s [5] systematic examination of text-processing stages, which underscored the significance of keyword extraction. Addressing this critical need, Zhang et al. [6] developed a novel semantic hierarchical graph model that effectively captures keyword contexts and their underlying connection patterns. The hierarchical associations between the words in the semantic graph can be effectively revealed through deep mining of feature item representations. High-probability keyword collection results can be obtained. To optimize news retrieval, Zhang [7] created a text-processing technique that extracts keywords from news articles to obtain their essential information. Text mining sometimes uses clustering methods, and LDA topic modeling is one kind of clustering method. Zhao et al. [8] systematically investigated the privacy preservation of mainstream LDA training algorithms established upon Collapsing Gibbs Sampling (CGS) for the privacy problem in LDA and proposed two LDA algorithms. Among them, the centralized privacy-preserving algorithm (HDP-LDA) prevents the leakage of mid-training statistics in CGS. Unlike conventional LDA, the locally private LDA (LP-LDA) variant incorporates local differential privacy mechanisms to protect each data contributor’s information. Among the LDA variants, Tracer Ratio LDA (TR-LDA) is a classical form given its distinct nature. The algorithm for solving TR-LDA only converges if the sample size is significantly smaller than the dimensionality of the data. To address this problem, Li et al. [9] proposed an adapted form of TR-LDA, which is applied to different datasets through a standardized format. Wu et al. [10] proposed a short text clustering algorithm (SKP-LDA) for LDA using sentiment co-occurrence statistics and extracted knowledge pairs. Among them, the sentiment word co-occurrence considers diverse short texts, and the microblog short texts are also assigned with sentiment polarity. At the same time, the model extracts knowledge pairs (topic-specific words and their relational counterparts) and integrates them into LDA for clustering, significantly enhancing online public opinion analysis accuracy. Jia et al. [11] employed LDA to establish a material cycle safety control system for construction sites, enhancing project safety and offering insights for systematic engineering safety management. Zhang et al. [12] utilized this idea to analyze the research lineage of forestry ecological construction, identifying frontier hotspots and forecasting future trends, thereby contributing valuable guidance for related fields.

Given the low use of text mining methods in chemical safety, this paper attempts to combine and use text mining, keyword extraction, LDA topic modeling, association rule analysis, and Bayesian networks for chemical accident analysis. Focusing on chemical production safety, this study develops a systematic risk identification and analysis framework tailored to the unique characteristics of chemical production site management. By conducting an in-depth examination of chemical accident investigation reports and leveraging text mining and analytical techniques, the research not only advances the theoretical boundaries of chemical safety management but also demonstrates substantial innovation and practical significance for real-world applications. The objective is to cut down on the frequency of accidents through robust guidance for automated analysis of accident causes and optimization of safety strategies.

2. Materials and Methods

2.1. Dataset and Data Preprocessing

This study utilizes chemical production accident investigation reports as the primary corpus for analysis, which experts within the safety field make through research, analysis, and writing after an accident. The official websites of China’s Ministry of Emergency Management (MEM) and provincial and municipal emergency management departments have corresponding accident investigation reports. Several websites in China collect and organize accidents regarding chemical production systems, such as the Safety Management Network, China Chemical Safety Association, and Chemical Safety People. To establish robust data provenance, this study downloaded the chemical-related accident investigation reports from these websites from 2013 to 2022, totaling 843 articles. After implementing data cleaning procedures including deduplication and quality filtering, we obtained a final corpus of 514 chemical industry accident reports. The classified distributions of accident severity levels and incident types are visualized in Figure 2 and Figure 3.

Figure 2 shows that most accidents are general and more significant. Though these accidents caused relatively few casualties or damage, they occur with a high frequency and can still severely impact chemical production. Figure 3 reveals that explosions account for more than 50% of all accidents, while poisoning and asphyxiation incidents represent the second most common type. This pattern primarily stems from the widespread use of combustible, explosive, toxic, and hazardous gases in chemical manufacturing processes, which frequently lead to either mixed-gas explosions or worker exposure resulting in poisoning/asphyxiation cases.

To prevent data redundancy, reduce the workload of text analysis, and minimize the interference of irrelevant information on the analysis, this study further screened out the “Accident History” and “Accident Cause Analysis” in each accident investigation report, which served as the initial corpus for the analysis of this paper. This study further screened out “accident history” and “accident cause analysis” from each accident investigation report as the initial corpus for this paper.

Data preprocessing is an essential yet labor-intensive step, as the raw corpus often includes irregular or irrelevant terms. A key objective involves segmenting the original Chinese text—similar to how English text is divided by spaces—to facilitate subsequent analysis. The text segmentation tool used in this study is the Jieba splitter of python3. Linguistic analysis reveals that the causative factors in chemical accidents predominantly manifest as either nominal phrases or verb–noun constructions, indicating their syntactical simplicity [13], so only common nouns, organization names, other proper names, everyday verbs, and gerunds are selected during the segmentation so that Jieba segmentation will automatically eliminate the rest of the lexical segmentation results during the segmentation.

Jieba participles contain three dictionaries: a domain dictionary, a synonym dictionary, and a deactivation dictionary.

(1): Domain Dictionary: Although the Jieba system comes with a dictionary for splitting words that contain the most common words (e.g., reactor, piping, etc.), current natural language processing systems exhibit significant limitations in recognizing domain-specific terminology, particularly technical terms such as ‘distillation tower’, ‘intermediate chamber’, ‘steam valve’, and ‘gas detector’ that are essential for accurate process industry documentation analysis. When these words are encountered, the Jieba segmentation system may split the entire proprietary word into two or more words, thereby destroying words that would otherwise be highly indicative of safety risk factors. The implementation necessitates constructing a domain-specific lexicon incorporating these industry terms and subsequently integrating this customized dictionary into the segmentation system to the Jieba word-splitting system.
(2): Dictionary of synonyms: The prevalence of synonymous expressions in accident investigation reports introduces lexical variation that adversely affects tokenization consistency, resulting in fragmented segmentation outputs, leading to a significant increase in the difficulty of cluster analysis. Therefore, all the synonyms can be replaced with a word, such as pipeline or steam pipe, which can be substituted with pipeline.
(3): The accident reports contain significant noise elements including non-informative tokens, uncontextualized numerals, and extraneous punctuation marks that require filtration, for instance, yes, 3, “!” etc. The terms demonstrate no statistically significant relevance to the analytical framework and should be incorporated into deactivated word dictionary of the Jieba participle system to be eliminated.

The three lexical databases exert a direct influence on tokenization accuracy, consequently inducing propagation effects throughout downstream analytical processes. Thus, updating these three lexicons several times is necessary to form a Jieba segmentation system that conforms to this study.

A preliminary domain-specific lexicon was constructed prior to text segmentation. For this study, technical terminologies in chemical production were systematically extracted from official web resources of major search engines (Baidu, Sougou, and Google), then standardized and incorporated into the specialized dictionary.

The study adopted the HIT stop-word lexicon as the baseline filtration dictionary. Following corpus segmentation, domain experts conducted rigorous validation of the tokenization results. Identified synonymous expressions were systematically consolidated through incorporation into a synonym repository and subsequent term standardization, while non-content tokens were appended to the enhanced stop-word lexicon. This way, a lexicon that conforms to the textual sub-lexicization of accident investigation reports in chemical safety was obtained. Using this system for the original corpus, the final lexicon results were obtained as data for subsequent analysis.

2.2. Keyword Extraction

Keyword extraction helps to summarize the text content to facilitate users to quickly understand the text topic and main points, but also to improve the accuracy and efficiency of information retrieval, commonly used keyword extraction algorithms are TF-IDF and BM25. While examining lexical-document associations, they overlook the role that word meanings play in extracting keywords. In terms of words indicating security risk factors, longer words usually have more explicit and specialized information. However, the BM25W model [14] is able to extract more accurately words that explicitly indicate safety risk factors. In addition, specialized vocabulary in domain lexicon are carefully screened by domain experts, so their semantic representation is clearer. So the model weights the BM25 model according to the semantics of the words themselves. As shown in Equation (1), the

l e n (q_{i})

denotes the word

q_{i}

the length of the word, and

m a x l e n (d, q)

the maximum word length in text

d

. Secondly, weights based on domain dictionary are shown in Equation (2).

{w e i g h t}_{l e n} (q_{i}) = \frac{l e n (q_{i})}{m a x l e n (d, q)}

(1)

{w e i g h t}_{l e x i c o n} (q_{i}) = \{\begin{matrix} 0 q_{i} \notin d o m a i n l e x i c o n \\ 0.5 + \frac{100}{l e n (d)} q_{i} \in d o m a i n l e x i c o n \end{matrix}

(2)

If

q_{i}

not present among domain vocabulary, the score is 0. Conversely, if the term exists in the domain dictionary, its initial weight is assigned a value of 0.5 and added to every 100 words to documents

d

ratio of the word to the document. The combined weight of these two components serves as the semantic-based weighting factor, which is subsequently incorporated into the BM25 scoring formula as presented in Equation (3).

s c o r e (d, q, w) = \sum_{i} W_{i} * R (q_{i}, d) * ({w e i g h t}_{l e n} (q_{i}) + {w e i g h t}_{l e x i c o n} (q_{i}))

(3)

2.3. LDA Subject Modeling

LDA is generative probabilistic model [15] pioneered by David Blei, Andrew Ng, and Michael Jordan in 2003 for revealing potential topics in textual data. The creation of LDA has brought significant progress in textual topic modeling. With the popularity of social media, the importance of textual data is increasing, posing new challenges to social science researchers. LDA models are widely used in social science research, including topic discovery and document categorization, due to their ability to extract topics from large amounts of text effectively.

Figure 4 depicts the LDA topic model framework.

The LDA model corresponding to Figure 4 has two main generative processes:

(1): $α \to θ_{d} \to Z_{d n}$ : In the LDA model, this process describes the generation process of the $n$ th word in document $d$ . First, the topic distribution $θ_{d}$ of document $d$ is extracted from the Dirichlet distribution based on the hyperparameter $α$ , which represents the mixture of document topics. Then, the $n$ th word in document $d$ is assigned its attributable topic $Z_{d n}$ based on $θ_{d}$ . This process synthesizes the distribution of document topics and the distribution of vocabulary under each topic to ensure the appropriate matching of vocabulary and topic.
(2): $β \to φ_{k} \to W_{d n}$ : In the LDA model, this process describes the generation of the $n$ th word in a document $d$ . First, $k$ topic-word distributions are obtained by sampling from the Dirichlet distribution based on the hyperparameter $β$ . Subsequently, the topic $Z_{d n}$ of the $n$ th word in document $d$ is determined, and the appropriate vocabulary $W_{d n}$ is generated for the word based on the corresponding topic-word distribution $φ_{k}$ . This process effectively captures the association between vocabulary and topic and realizes the topic modeling of textual data. The key point is to determine the topic for each word in the document through the topic-word distribution and then generate the corresponding words.

Where the joint distribution of all variables is shown in Equation (4):

p (w_{d}, Z_{d}, θ_{d}, ϕ| α, β) = \prod_{n = 1}^{N_{m}} {p (w}_{d n}| φ_{Z_{d n}}) p (Z_{d n} | θ_{d}) p (θ_{d} | α) p (ϕ | β)

(4)

W_{d n}

in Equation (4) represents the

n

th word in the

d

th document, which is a key concept in text data analysis. Text data is the object we want to analyze and mine, while the parameters

α

and

β

are pre-set values that have a guiding role and affect the generation of topics and vocabulary. Except for the observable lexical variable

W_{d n}

and the a priori parameters

α

and

β

, other variables in the LDA model are hidden. Among these hidden variables,

Z_{d n}

denotes the topic to which the

n

th word in a document belongs,

θ_{d}

denotes the topic distribution of document

d

, and

φ_{k}

denotes the vocabulary distribution of the

k

th topic. LDA establishes the link between a document and a topic by inferring these hidden variables through analyzing textual data [16]. The model uses methods such as Bayesian inference for parameter estimation and learning to capture the topic structure and semantics of text to support text analysis and applications.

The most commonly used parameter estimation methods for LDA probabilistic subject models include Gibbs Sampling [17] and Variational Inference [18]. Compared to Variational Inference, Gibbs Sampling is a simple and intuitive Markov Chain Monte Carlo (MCMC) method [19] for sampling hidden variables and parameters from a joint distribution, which can accurately sample from the posterior distribution to obtain accurate parameter estimates. In the LDA model, Gibbs sampling can update the estimates of the model parameters step by step by sampling each variable given the other variables. Therefore, Gibbs sampling is used in this paper to construct the LDA model.

2.4. Association Rule Analysis

Initially introduced by Agrawal et al. [20] for market basket analysis, association rule mining serves as a computational method for discovering hidden relationships among item sets within transactional databases. As an important data mining technique, it effectively analyzes accident data, reveals accident-related patterns and causal relationships, and aids managerial decision-making, with proven applications in diverse fields [21,22].

Apriori, as a classical algorithm in association rule mining, has a solid theoretical foundation and a wide range of applications. It applies to various datasets and is easy to understand and implement. Although the Apriori algorithm is less efficient when dealing with large-scale datasets, it still performs well when the data size is not particularly large or the set of frequent items is small. The setting of the support and confidence parameters makes the results tunable and interpretable, which can meet the needs of different scenarios. The causal interdependencies among security risk factors exhibit a high degree of complexity and non-linearity, and there are many association rules with multiple sets, so this study uses the Apriori algorithm for association rule mining.

2.5. Bayesian Network Analysis

Bayesian networks provide a graphical representation of how random variables probabilistically influence one another, combining graph theory with probability theory. In the network structure, vertices correspond to random variables, while directed edges signify conditional dependency relationships between them, and the probability distribution of a node takes into account the conditional probability distribution of that node given its parent. Bayesian networks can effectively deal with uncertainty, provide flexible and accurate modeling for many practical problems, help to understand the probabilistic dependencies of complex systems and support inference and decision-making.

The 3 fundamental topological configurations of Bayesian networks are illustrated in Figure 5.

Where in the V-shaped structure of Figure 5a, one feature is dependent on the others, i.e.,

x_{1}

is dependent on

x_{2}

with

x_{3}

. When the values of the sub-features are unknown, the parent features are statistically independent, i.e., when

x_{1}

is unknown,

x_{2}

is independent of

x_{3}

, with the following equation:

\sum_{x_{1}} P (x_{1}, x_{2}, x_{3}) = \sum_{x_{1}} P (x_{2}) P (x_{3}) P (x_{1} | x_{2}, x_{3})

(5)

\begin{array}{l} P (x_{2}, x_{3}) & = P (x_{2}) P (x_{3}) \sum_{x_{1}} P (x_{1} | x_{2}, x_{3}) \\ = P (x_{2}) P (x_{3}) \times 1 \\ = P (x_{2}) P (x_{3}) \end{array}

(6)

In the same parent structure of Figure 5b, multiple features depend on one feature, i.e.,

y_{2}

and

y_{3}

depend on

y_{1}

.

y_{2}

is independent of

y_{3}

when

y_{1}

is known, as in Equation (7):

\begin{array}{l} P (y_{2}, y_{3} | y_{1}) & = \frac{P (y_{1}, y_{2}, y_{3})}{P (y_{1})} \\ = \frac{P (y_{1}) P (y_{2} | y_{1}) P (y_{3} | y_{1})}{P (y_{1})} \\ = P (y_{2} | y_{1}) P (y_{3} | y_{1}) \end{array}

(7)

In the sequential structure of Figure 5c, the first feature depends on the second feature, and the second feature depends on the third feature, i.e.,

z_{1}

depends on

z_{2}

, and

z_{3}

depends on

z_{1}

.

z_{2}

is independent of

z_{3}

when

z_{1}

is known as in Equation (8):

\begin{array}{l} P (z_{2}, z_{3} | z_{1}) & = \frac{P (z_{1}, z_{2}, z_{3})}{P (z_{1})} \\ = \frac{P (z_{2}) P (z_{1} | z_{2}) P (z_{3} | z_{1})}{P (z_{1})} \\ = \frac{P (z_{1}, z_{2}) P (z_{3} | z_{1})}{P (z_{1})} \\ = P (z_{2} | z_{1}) P (z_{3} | z_{1}) \end{array}

(8)

The Bayesian network constructed in this paper basically consists of these 3 basic structures.

3. Results

3.1. Data Preprocessing and Keyword Extraction

The raw corpus was subdivided using the subdividing algorithm of the Python 3.7 program, and from the acquired English text corpus (words separated by spaces), we identified 2898 distinct word features, a selection of which appears in Table 1. The standardized text format facilitates efficient keyword extraction computational processing in subsequent steps.

In this study, the BM25W model was used to obtain the importance scores of all feature words in the partitioning results, and some of the high-scoring feature words are shown in Table 2.

Word cloud analysis can quickly and intuitively show critical information; this paper will BM25W score of the top 100 keywords for visual display, as shown in Figure 6. In Figure 6, the larger the shape of a word, the more critical it is in the accident investigation report, and we observe that “equipment and facilities”, “supervision and management”, “safety education and training”, etc., have a significant position in the accident investigation report. It can be found that “equipment and facilities”, “supervision and management”, “safety education and training”, and so on have a significant position in the accident investigation report.

By comparing the BM25W scores of feature words, it is possible to identify those words that are more thematically relevant. This feature word screening method based on the BM25W algorithm not only considers the word frequency but also the importance and contextual relevance of the words in the text, thus more accurately reflecting the degree of contribution of the feature words to the text topic. By deleting those feature words with lower scores, noise, and irrelevant information can be reduced, which improves the performance and explication ability of the LDA topic model and aids in enhancing the effectiveness of the topic model in practical applications.

3.2. LDA Topic Model Analysis

3.2.1. Estimation of the Optimal Number of Topics

Determining the number of topics in an LDA model is critical to data interpretation and clustering effectiveness. Too many issues may lead to overfitting of the model, introducing noise and reducing interpretability; too few may ignore essential information, resulting in underfitting and failing to capture the deep structure of the data. In large-scale corpora, it is unclear how many topics there are only by experience and scientific estimation is needed to find the optimal number.

Established methodologies for determining the most suitable number of topics for LDA are perplexity and thematic coherence. Perplexity reflects article categorization certainty and decreases with increasing topics, but too many issues may lead to overfitting. Thematic coherence measures inter-topic correlation and is another indicator to assess the strength of a model. Many empirical studies have demonstrated the effect of thematic coherence [23]. Generally, the higher a thematic coherence score is, the better the topic model is. Therefore, this paper adopts the two methods of perplexity and thematic coherence at the same time to comprehensively determine the optimum count of themes established by calculating the scores of perplexity and theme coherence under different numbers of themes. This method balances model predictability and the quality of the themes to achieve the best results for the theme model. This method has been widely used in many studies [24,25,26,27,28] to help researchers more accurately select the LDA model parameters suitable for the data, and then improve the performance and explanatory power of the topic model. The formula calculates the confusion degree:

P e r p l e x i t y (D_{t e s t}) = \exp (- \frac{\sum_{d = 1}^{M} l o g p (w_{d})}{\sum_{d = 1}^{M} N_{d}})

(9)

where

D_{t e s t}

denotes the set of all documents in the corpus, the full set of these records is

M

, the word count of each record

d

is denoted by

N_{d}

,

w_{d}

denotes the word in document

d

, and

p (w_{d})

denotes the probability that the word

w_{d}

is produced.

In this paper, we first set a maximum value of n_max_topics that the total number of topics may be taken, i.e., in general, the overall topic count in the text of the accident investigation report of the chemical enterprise exceeds this maximum value, which is not in line with the actual situation. In this study, n_max_topics = 100, that is, the overall topic count of the LDA topic model takes the value of the range 1~100 (take an integer) and then lets the Python program traverse all the number of topics to take the value of each traversal, each traversal of the other parameter α to take the empirical value, that is, the inverse of the overall topic count of the current traversal of count of β 0.01, the number of sampling iterations to choose 1000 times. This yields the topic models and their corresponding perplexity and consistency scores for each topic under the total number of topics in the range of values taken. An image of the variation of perplexity and topic consistency with the number of topics is plotted, as shown in Figure 7. In the figure, when the number of topics is 50, the perplexity is at the inflection point, the value is the smallest, and the topic consistency is the largest. The number of topics at this time is the best estimate.

3.2.2. Thematic Analysis

In this paper, we use LDA theme analysis to obtain 50 highly clustered feature phrases, and then according to the expert experience and analysis of each feature phrase corresponding to the risk factor theme. Still, the number of themes is also an estimate; the results will inevitably have some unrealistic themes, which require further screening manually. After screening and removing the four noisy themes that do not conform to reality, 46 themes were finally analyzed, as presented in Table 3.

In this paper, the risk management framework based on the socio-technical systems theory proposed by Rasmussen [29] is used to analyze the risk factors of chemical safety accidents, which helps to analyze the relevant correlations between macro- and micro-level factors in the chemical production system. Combining this risk management framework with the actual situation of the chemical industry, this study proposes five levels to categorize the risk causation of chemical accidents, including regulatory authorities, chemical enterprises, site management, operators, environment and equipment. It should be emphasized that because the risk factor themes are manually summarized and derived from the integration of feature words under each theme, some of the summarized risk factor themes cover more than one of the original themes to maintain the consistency of the dimensions of each risk factor theme and to prevent significant errors due to inconsistencies in the magnitude of the subsequent calculations. These risk factor themes were integrated and de-weighted, and the final classification results are presented in Table 4.

After obtaining all the themes, the LDA modeling enables the computation of topic probability distributions in all the accident investigation reports, marking the accident investigation reports that contain the risk factor theme as one and those that do not include the risk factor theme as 0. This way, a structured Boolean dataset of “0–1” can be obtained and analyzed using association rule mining and Bayesian network analysis in the following section, as presented in Table 5.

3.3. Association Rule Analysis

The accident level information is added to the Boolean dataset of chemical enterprise accident risk causation information to obtain an adaptive data format for association rule mining. In this paper, the accident level is categorized into general accidents (G), significant accidents (L), and significant and above accidents (MS), which corresponds to each accident investigation report and combines with the Boolean dataset of accident risk causation information of chemical enterprises obtained above to obtain the primary dataset of association rule mining, as detailed in Table 6.

Each entry in Table 6 corresponds to a textual record from an accident investigation report, i.e., an accident, and a “1” indicates that the risk factor in the corresponding column appeared in the accident. In contrast, a “0” means that it did not appear in the accident, as shown in Table 6; the C1 risk factor appeared in the accident investigation report with serial number 5 and did not appear in the accident investigation report with serial number 4. As shown in Table 6, risk factor C1 appeared in the accident investigation report with serial number 5, while it did not appear in the accident investigation report with serial number 4. The rightmost column shows the accident level corresponding to each accident, forming a 514 × 34-dimensional dataset for association rule mining.

Support, Confidence, and Lift are three critical metrics to measure association rules, and strong association rules can be filtered based on these metrics [30].

Assuming that each accident investigation report is regarded as a “transaction”, denoted as

D_{n}

, then the set of transactions in this paper

D

is expressed as

{D_{1}, D_{2}, D_{3}, \dots, D_{i}, \dots, D_{514}}

. Assuming that each safety risk factor in the accident investigation report is regarded as a “project”, denoted by

F_{i}

, then the project set of this paper

F

is denoted as

{F_{1}, F_{2}, F_{3}, \dots, F_{33}}

.

The Support is the ratio of the number of transactions containing both itemsets

F_{i}

and

F_{j}

to the gross amount of transactions, and is calculated as shown in Equation (10):

S u p p o r t (F_{i}, F_{j}) = P (F_{i} \cup F_{j}) = \frac{P (F_{i}, F_{j})}{P (D)}

(10)

The Confidence is the existence of the itemset

F_{i}

in the transaction of the simultaneous existence of the term set

F_{j}

of the transaction, which is calculated as shown in Equation (11):

C o n f i d e n c e (F_{i} \to F_{j}) = P (F_{j}| F_{i}) = \frac{P (F_{i}, F_{j})}{P (F_{i})}

(11)

The Lift reflects the fact that the association rules in

F_{i}

and

F_{j}

correlations, and is calculated as shown in Equation (12):

L i f t (F_{i} \to F_{j}) = \frac{P (F_{i}, F_{j})}{P (F_{i}) \times P (F_{i})} = \frac{P (F_{j}, F_{i})}{P (F_{j})}

(12)

For

L i f t (F_{i} \to F_{j})

, its value less than 1 indicates that

F_{i}

has little influence on

F_{j}

; conversely, it indicates that the rule

F_{i} \to F_{j}

is of practical significance, suggesting that the appearance of

F_{i}

largely leads to the appearance of

F_{j}

; and its value equal to 1 indicates independence from each other.

According to the above principle, to obtain more valuable and higher correlation rules between safety risk factors in chemical enterprises, after several experiments, the minimum Support is set to 5%, the minimum Confidence is set to 20%, and the enhancement threshold is set to 1.2. Finally, 53 strong correlation rules can be obtained, as shown in Table 7.

The high-frequency risk factors obtained through association rule mining analysis can be regarded as nodes in a Bayesian network. In contrast, the significant association relationships between risk factors correspond to edges in a Bayesian network. The topology of the Bayesian network can be constructed based on these correlations better to describe the dependencies and probability distributions among risk factors. In addition, the Bayesian network can incorporate information such as accident levels, which can be used as observation nodes or implied nodes to improve the model’s accuracy and practicality.

4. Discussion

The Bayesian network is constructed by using the antecedent and consequent of the association rules as nodes, with directed edges pointing from the antecedent to the consequent. The framework undergoes expert validation to filter out non-viable connections, ultimately producing the security risk factor Bayesian network topology depicted in Figure 8. Among them, “Accident” is not a node in the association rule mining but a new node representing chemical production accidents.

4.1. Sensitive Risk Factor Analysis

Sensitivity analysis is a tool for assessing the sensitivity of a model to parameter changes and is categorized into univariate and multivariate analysis. In the chemical safety field, it helps identify key risk factors, assess their impact on safety, guide resource optimization, and process improvement, and reduce the risk of accidents. Univariate analysis measures the sensitivity of each factor to the model, while multivariate analysis integrates the effects of multiple parameters. In addition, parametric and interval sensitivity analyses enhance model robustness and support decision-making. Sensitivity analysis provides a scientific basis for chemical safety, identifies risk factors, optimizes safety management, and reduces accidents.

This paper involves multiple risk factors of Bayesian network structure, so multivariate sensitivity analysis is used. The sensitivity factor between two nodes in Bayesian network structure is calculated:

I_{R E V} (F_{i}) = \frac{\max \{P (S = S_{t}| F_{i} = f_{i j})\} - P (S = S_{t})}{P (S = S_{t})}

(13)

I_{R R V} (F_{i}) = \frac{P (S = S_{t}) - \min \{P (S = S_{t}| F_{i} = f_{i j})\}}{P (S = S_{t})}

(14)

I_{A V G} (F_{i}) = \frac{I_{R E V} (F_{i}) + I_{R R V} (F_{i})}{2}

(15)

In the above equation, the

S

is a child node and

S_{t}

is its state;

F_{i}

is the child node

S

of the

i

parent node, and

f_{i j}

is the state of each parent node.

I_{R E V} (F_{i})

denotes the parent node

F_{i}

‘s risk expansion performance, and

I_{R R V} (F_{i})

denotes its risk-reducing performance, and the average of the two,

I_{A V G} (F_{i})

is the sensitivity coefficient between the parent node

F_{i}

and the child node

S

.

In this paper, we hope to discover the sensitive factors affecting the accident (“Accident” node) through sensitivity analysis. Therefore, sensitivity analysis is performed on the Bayesian network using GeNIe 4.0, with ‘Accident’ selected as the target node. As shown in Figure 9, nodes with darker colors exhibit greater sensitivity.

The computed sensitivity values of the complete set of safety risk factors are visualized in Figure 10. In this paper, those greater than 0.01 are regarded as the susceptible factors of accidents. In descending order, they are non-compliant operation O5, inadequate equipment maintenance and management E5, operators not licensed or not accompanied by supervisors O2, inadequate risk identification for special operations O3, unreliable equipment and facilities E1, adventure organization operations O6, failure to wear protective equipment O1, abnormal pressure/temperature values E3, careless or misrepresentation of the work site inspection M1. Analysis reveals that while numerous critical factors contribute to chemical production accidents, susceptible factors are principally concentrated in the operating personnel, especially the sensitivity value of O5, which exceeds the sensitivity value of M1 by 166.83%. The sensitivity value of E5 exceeds that of M1 by 4.04%.

According to Human Factors Engineering (HFE) and High-Reliability Organizing (HRO) theories, it is known that in a high-risk organization, the operating personnel are critical because they are directly involved in the operation, and they can quickly respond to accidents and prevent deterioration. High-risk operations in chemical production, such as fire and elevated work, require operators to have specialized skills and mental qualities, including chemical knowledge, equipment operation, and emergency response capabilities. They must be alert, make quick decisions, and respond effectively to emergencies. In addition, continuous training and education are vital to upgrading the quality of operators and keeping up with safety standards. The chemical industry requires operators to be fully competent technically and psychologically in dealing with challenges. Safety managers should optimize safety inputs, control susceptible factors, prevent accidents, and improve safety production.

4.2. Critical Causal Path Analysis

The research proceeded with Bayesian network diagnostic analysis through GeNIe 4.0 subsequent to sensitivity testing. The diagnostic approach reconstructs the most credible accident causation chain through upward reasoning in the Bayesian network framework. Then, the last possible path that caused the accident can be found. First, clear all object nodes, and start from “Accident” to “set evidence”, and then use the computer to calculate the inferred probability of parent nodes of node; for node A (accident node), the a posteriori probability of node A (accident node) is calculated by the computer, assuming that the probability of occurrence of the general accident is 100%, i.e.,

P (A = 1) = 1

, and its parent node is set as

X_{i}

. The a posteriori probability of the parent node of A is calculated as detailed in Equation (16):

P (X_{i} = 1| A = 1) = \frac{P (A = 1, X_{i} = 1)}{P (A = 1)}

(16)

According to Equation (16), a parent node with the largest a posteriori probability is identified, and then “set evidence” is performed, and so on, until the root node is reached. Consequently, the critical causal pathways for general accidents, larger-scale accidents, and major/severe accidents were successfully identified, as illustrated in Figure 11a–c.

As can be seen in Figure 11, despite the various accident categories and probability of propagation of the key causal factors, the second half of the path that leads to all kinds of accidents is the same, which is the failure of the safety monitoring and surveillance equipment, which may include the failure of sensors, instruments, or monitoring systems, which results in the impact on the real-time monitoring and detection function of the production process or equipment status. In contrast, the failure of the safety monitoring and surveillance equipment makes the related equipment and facilities unreliable. Because the role of monitoring equipment is to detect and correct potential problems promptly, when monitoring fails, it may result in the equipment failing to be informed of the problem or to take the necessary action promptly. This may leave the entire production system in an unreliable state, and further unreliable equipment and facilities may cause problems in the production process, one of which is the possibility of leakage. With adequate monitoring and control, it may be possible to detect, isolate, or repair potential sources of leakage in time. This can increase the risk of leaks of hazardous substances, which may involve chemicals, gases, or other dangerous substances. Ultimately, these leaks can lead to accidents, which demonstrate dependence on the specific category and size of spilled substance, including fires, explosions, and chemical spills, posing a potential threat to the safety of people and the environment. In the previous accident type statistics, it was also found that the number of accident types such as explosions, fires, poisoning, and asphyxiation, etc., which are closely related to the leakage problems, accounted for 94% of the total number of accidents, which inversely confirms the validity of the accident causation chain.

Inadequate regulation is the root cause of both general and significant accidents. Negligence on the part of the regulator may lead to irregular contracting and the involvement of unqualified firms in production, resulting in safety management deficiencies and non-compliance issues. This may lead to lax or false site inspections and affect the maintenance of safety management equipment, such as safety monitoring equipment, that may fail. Failure to wear protective equipment is particularly prominent in significant accidents. This may be due to lax inspections or misrepresentation leading to the inability to implement emergency rescue and drills effectively, and rescue workers may have expanded the accident in the early stages because they needed to wear protective equipment or blindly performed the rescue.

More significant accidents are often triggered by poor tracking of hidden dangers and corrections, leading to inadequate safety systems, including failure to correct hidden safety problems promptly, insufficient training, and loopholes in management systems. This results in poorly detailed on-site inspections and inaccurate risk and hazard identification and assessment. A sufficient safety management framework may also lead to misrepresenting inspection results and covering up problems, leading to accidents.

In the domino theory, improving these issues can effectively prevent accidents by cutting off the most likely pathways leading to accidents. However, this does not mean focusing only on the critical causal pathways; ignoring other safety risk factors is necessary. Given the structural complexity inherent in Bayesian networks and the interconnected nature of safety factors, accidents may occur through other paths after cutting off a node. Therefore, in enterprise safety management, it is necessary to focus on cutting off critical causal paths and creating new paths simultaneously and continuously adjusting the safety governance strategy according to the changes in critical paths.

4.3. Statistical Analysis of Frequency

Statistical frequency-based accident risk causation analysis can provide a comprehensive understanding and analysis of historical accident data, revealing potential patterns and regularities in accident occurrence. Through extensive collection of accident data, high-frequency causal factors can be identified, and their relevance and impact can be analyzed.

The formula for calculating the frequency of security risk factors is as follows:

P (w_{i}) = \frac{C o u n t (D_{w_{i}})}{C o u n t (D)}

(17)

where

C o u n t (D)

denotes the total number of accident investigation reports,

w_{i} (i = 1, 2, \dots, 33)

refers to 33 safety risk factors, and

C o u n t (D_{w_{i}})

denotes the number of accident investigation reports that contain risk factor

w_{i}

. The results of frequency statistics are detailed in Figure 12.

As can be seen from Figure 12, the statistical frequency of the first 15 items is more significant than 0.3, and with the rest of the items there is a frequency fault, these 15 items are: failure to conscientiously fulfill the responsibility of safety supervision R2, leakage problems E4, abnormal concentration of hazardous gases E2, abnormal pressure/temperature values E3, operators not licensed or not accompanied by supervisors O2, hidden dangers rectification and tracking is not in place R1, unreliable equipment and facilities E1, inadequate equipment maintenance and management E5, careless or misrepresentation of the work site inspection M1, inadequate fire and explosion prevention measures M5, inadequate safety education and training C1, failure to wear protective equipment O1, inadequate safety system C2, non-compliant operation O5, failure of safety monitoring and control equipment E6.

It can be found that the environmental and equipment factors are basically in the top 15 of the frequency faults. This also proves that no matter what the initial cause of the accident is and what triggers it, the accident will eventually manifest itself in these environmental and equipment factors, which reveals that the safety management personnel, in the daily risk warning work, need to focus on these aspects to strengthen the monitoring, so as to achieve early warning and nip the accident in the bud. In addition to the environment and equipment factors, the frequency of the top 15 is also more evenly distributed among the other four types of risk factors, which indicates that the environment and equipment factors are not the root cause of accidents; the root cause often comes from the people and management.

After obtaining the results of sensitive risk factor analysis, critical causal path analysis, and frequency statistical analysis, the investigation established that some factors existed in a set of factors of these three analyses, indicating these factors are critical in chemical enterprise accidents. These three results intersect and determine the essential risk factors leading to chemical enterprise accidents. In contrast, the non-intersecting part of the results of the three analyses belongs to the critical risk factors, and the remaining ones are the general risk factors with the following formula:

W_{k} = A_{s} \cap A_{r} \cap A_{f}

(18)

W_{i} = A_{s} \cup A_{r} \cup A_{f} - W_{k}

(19)

W_{o} = U - A_{s} \cup A_{r} \cup A_{f}

(20)

where

A_{s}

,

A_{r}

, and

A_{f}

refer to sensitive factors, causal path factors, and high-frequency factors, respectively.

W_{k}

,

W_{i}

, and

W_{o}

refer to critical, important, and general factors, respectively, and

U

is the set of all risk factors in the structure of the Bayesian network. The results are detailed in Table 8.

Through comprehensive analysis, the general factors have less influence on chemical enterprise accidents, so the main critical and important factors are analyzed and the corresponding control scheme is constructed, i.e., unreliable equipment and facilities E1, failure to wear protective equipment O1, careless or misrepresentation of the work site inspection M1, non-compliant operation O5, inadequate equipment maintenance and management E5, operators not licensed or not accompanied by supervisors O2, inadequate risk identification for special operations O3, adventure organization operations O6, abnormal pressure/temperature values E3, leakage problems E4, failure to conscientiously fulfill the responsibility of safety supervision R2, abnormal concentration of hazardous gases E2, hidden dangers rectification and tracking is not in place R1, inadequate fire and explosion prevention measures M5, inadequate safety education and training C1, inadequate safety system C2, failure of safety monitoring and control equipment E6, and illegalized contracting C7.

This paper found that in these risk factors, E1, O5, O1, and E4, the characterization of the modality is often a visual form of information; for example, the specific performance of not wearing protective equipment O1 may be in the case of poisoning, on-site personnel in the absence of effective protective equipment to rescue the case, can be detected in the face of the rescuers if they do not have masks, gas masks or other masks, leakage problems E4, specific performance where a pipeline or plant at a location suddenly appeared to produce a large amount of visible gas or even began to produce open flames, and so on.

As for the unreliable equipment and facilities E1, the most common manifestations in the chemical enterprise site are frequent failures and shutdowns, equipment aging and wear and tear, unqualified maintenance and repair, missing or damaged parts, inefficient automation and control systems and so on, through the trajectory intersecting theory (Trace Intersecting Theory) to know when the equipment and facilities appear to be unreliable. If the person appears in the same position, it is very likely to cause casualties, and if the person does not appear in the surroundings, according to the theory, it will be able to effectively avoid the occurrence of casualties.

The chemical production site environment is complex, the incidence of accidents is high, and personnel intrusion into the hazardous area is the direct cause of accidents; based on the computer vision of the personnel hazardous area intrusion detection method can effectively make up for the defects of the chemical safety management of manual supervision, so as to reduce the probability of accidents.

The rest of the risk factors are often characterized by the modality of textual information, such as the operation site inspection is not detailed or misrepresentation of the specific performance of M1 may be the operation site of the enterprise’s safety checklist is not filled out on time, hazardous operations management review checklist content, etc., safety education and training is insufficient C1 meaning the enterprise did not regularly arrange for staff safety education training courses and, therefore, there is no corresponding intervention. The specific manifestation of insufficient safety education and training C1 may be that the enterprise has not arranged regular safety education training courses for its employees and, therefore, does not have the corresponding class schedules and course materials, or has not carried out regular safety examinations and, therefore, does not have the examination papers and transcripts for examining its employees.

This visual and textual modal information together characterize the above risk factors, which can further reflect the safety management level of the enterprise. Therefore, it is proposed to establish a safety production platform that integrates textual and visual information, which can capture potential risks and hidden dangers in chemical production more clearly by integrating visual images and textual information and provide more timely and effective early warnings and countermeasures for the relevant departments, so as to minimize the safety risks in the process of chemical production and safeguard the safety of people’s lives and properties.

5. Conclusions

Due to the flammable, explosive, and toxic substances involved in its production process and the complex and changeable production environment, the chemical industry has led to frequent safety accidents, which pose a severe threat to human life, property, and social stability. To address the inherent constraints of conventional chemical safety management approaches, this study presents a modern, data-driven, technology-based approach to improve efficiency and safety management.

For data preprocessing, the text segmentation of the chemical safety accident investigation report was performed using the Jieba participle, and the accuracy of the participle was improved by using a domain dictionary, synonym dictionary, and deactivation dictionary. The BM25W algorithm was used for keyword extraction to extract critical information in the text to provide input for the LDA model. Next, the chemical safety accident cases were thematically analyzed using the LDA model to identify major risk factors. In addition, the Apriori algorithm is used in association rule analysis to mine the association relationship between risk factors. Finally, a Bayesian network model analyzes the causal relationship between risk factors. The approach established in the current research, which combines an improved LDA topic model and the Bayesian network, aims to improve the identification and analysis of chemical safety risks. Thirty-three major risk factors were identified and categorized through text mining and model analysis.

For model evaluation, the model’s sensitivity to each risk factor was assessed through sensitivity analysis to identify the key risk factors. The possible paths of accident development were inferred through critical causal path analysis combined with Bayesian network diagnosis. The importance of recognizing and preventing common risk factors in chemical safety accidents and the necessity of taking targeted measures in safety management are emphasized through frequency statistical analysis. The findings of this paper not only reveal the potential causes and patterns of chemical safety accidents but also furnish a theoretical framework for optimizing chemical plant safety management, helping them to optimize the allocation of resources and improve the level of safety management.

The method proposed in this paper can effectively extract keywords from chemical safety accident reports and reveal the correlation and causality between risk factors; at the same time, it provides a new analytical tool for the chemical safety field, which helps to identify and prevent potential safety risks in advance, enhances the scientific and systematic nature of safety management, and protects the safety of personnel and property to a greater extent, and is of great significance for promoting the sustainable development of the chemical industry. It can effectively deal with the complexity of the chemical production site and provides an innovative solution for the intelligent and automated development of chemical safety management tools in the future. Through effective processing of unstructured data, it no longer relies only on limited experience and one-sided information to cope with chemical safety risks but can establish more comprehensive and accurate risk identification based on a large amount of data analysis, greatly enriching and improving chemical safety risk management theories and methods and promoting the transformation of chemical safety management to intelligence. Meanwhile, this paper also has some limitations, the chemical industry is diverse and the types of hazardous chemicals involved are also very complex; future studies will refine the study of the chemical industry, distinguishing between different sub-industries of the chemical industry and their different types of hazardous chemical safety risk identification and analysis, so as to make the model more versatile.

Author Contributions

Conceptualization, Z.Z.; methodology, Z.Z. and J.G.; software, Z.Z. and J.G.; validation, Z.Z., J.G. and J.H.; formal analysis, Z.Z. and J.G.; investigation, Z.Z. and J.G.; resources, Z.Z.; data curation, Z.Z., J.G. and J.H.; writing—original draft preparation, Z.Z., J.G. and J.H.; writing—review and editing, Z.Z. and J.G.; visualization, Z.Z. and J.G.; supervision, J.H.; project administration, Z.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CGS	Collapsing Gibbs Sampling
LDA	Latent Dirichlet Allocation
MCMC	Markov Chain Monte Carlo
HFE	Human Factors Engineering

References

Lassak, P.; Labovsky, J.; Jelemensky, L. Influence of parameter uncertainty on modeling of industrial ammonia reactor for safety and operability analysis. J. Loss Prev. Process Ind. 2010, 23, 280–288. [Google Scholar] [CrossRef]
Rathnayaka, S.; Khan, F.; Amyotte, P. SHIPP methodology: Predictive accident modeling approach (Part I: Methodology and model description). Process Saf. Environ. Prot. 2011, 89, 151–164. [Google Scholar] [CrossRef]
Jain, P.; Pasman, H.J.; Waldram, S.P.; Rogers, W.J.; Mannan, M.S. Did we learn about risk control since Seveso? Yes, we surely did, but is it enough? An historical brief and problem analysis. J. Loss Prev. Process Ind. 2017, 49, 5–17. [Google Scholar] [CrossRef]
Tan, X.Q. Research on Evaluation Model for Safety Capacity of Chemical Industrial Park Based on Regional Risk and Its Application in Chemical Industrial Park. Master’s Thesis, South China University of Technology, Guangzhou, China, 2011. [Google Scholar]
Senave, E.; Jans, M.J.; Srivastava, R.P. The application of text mining in accounting. Int. J. Account. Inf. Syst. 2023, 50, 100624. [Google Scholar] [CrossRef]
Zhang, T.T.; Chen, K.; Li, B.Z. Document keyword extraction based on semantic hierarchical graph model. Scientometrics 2023, 128, 2623–2647. [Google Scholar] [CrossRef]
Zhang, K. Web News Data Extraction Technology Based on Text Keywords. Complexity 2021, 2021, 5529447. [Google Scholar] [CrossRef]
Zhao, F.Y.; Ren, X.B.; Yang, S.S. Latent Dirichlet Allocation Model Training With Differential Privacy. IEEE Trans. Inf. Forensics Secur. 2021, 16, 1290–1305. [Google Scholar] [CrossRef]
Li, Z.X.; Nie, F.P.; Wang, R. A Revised Formation of Trace Ratio LDA for Small Sample Size Problem. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5803–5809. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Yang, R.X.; Shen, C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm. J. Intell. Inf. Syst. 2021, 56, 1–23. [Google Scholar] [CrossRef]
Jia, Y.R.; Pang, Y.C. Analysis of the Causes of Construction Safety Accidents Based on LDA Modelling. Eng. Constr. 2025, 57, 72–78. [Google Scholar] [CrossRef]
Zhang, S.L.; Sun, H. Analysis of Knowledge Mapping and LDA Topic Modelling of Forestry Ecological Construction in China. China For. Spec. Prod. 2025, 80–82. [Google Scholar] [CrossRef]
Yang, L. Causation Analysis and Risk Study of Rail Transportation Accidents Based on Text Data. Ph.D. Thesis, Beijing Jiaotong University, Beijing, China, 2021. [Google Scholar]
Zhou, Z.Y.; Huang, J.H.; Lu, Y. A new text mining-Bayesian network approach for identifying chemical safety risk factors. Mathematics 2022, 10, 4815. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Wang, P.; Gao, C.; Chen, X.M. Text clustering study based on LDA model. Intell. Sci. 2015, 33, 63–68. [Google Scholar]
Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 1, 5228–5235. [Google Scholar] [CrossRef] [PubMed]
Ramesh, N.; William, C.; John, L. Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Omaha, NE, USA, 28–31 October 2007; pp. 349–354. [Google Scholar]
Brooks, S. Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D (Stat.) 1998, 47, 69–100. [Google Scholar] [CrossRef]
Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 25–28 May 1993; pp. 207–216. [Google Scholar]
Chen, M.S.; Han, J.W.; Yu, P.S. Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng. 1996, 8, 866–883. [Google Scholar] [CrossRef]
You, M.J. Research on Coal Mine Safety Risk Identification and Evaluation Based on Text Mining. Ph.D. Thesis, China University of Mining and Technology, Beijing, China, 2022. [Google Scholar]
Roder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures; Association for Computing Machinery: New York, NY, USA, 2015; pp. 399–408. [Google Scholar]
Cao, J.; Xia, T.; Li, J.T. A density-based method for adaptive LDA model selection. Neurocomputing 2009, 72, 1775–1781. [Google Scholar] [CrossRef]
Biggers, L.R.; Bocovich, C.; Capshaw, R. Configuring latent Dirichlet allocation based feature location. Empir. Softw. Eng. 2014, 19, 465–500. [Google Scholar] [CrossRef]
Lv, L.C.; Zhou, J.; Wang, X.Z. A framework for analyzing technology evolution based on a two-layer thematic model and its application. Data Anal. Knowl. Discov. 2022, 6, 18–32. [Google Scholar]
Xiang, Z.Y.; Wu, Y.; Chen, H. A Study on Discovery of Hot Topics in Microblogs Based on the Improved Algorithm of Burst Word to Topic Modeling. Intell. J. 2022, 41, 104–112. [Google Scholar]
Sievert, C.; Shrley, K.E.; Davis, L. A method for visualizing and interpreting topics. In Proceedings of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014. [Google Scholar]
Rasmussen, J. Risk management in a dynamic society: A modelling problem. Saf. Sci. 1997, 27, 183–213. [Google Scholar] [CrossRef]
Nenonen, N. Analysing factors related to slipping, stumbling, and falling accidents at work: Application of data mining methods to Finnish occupational accidents and diseases statistics database. Appl. Ergon. 2012, 44, 215–224. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Chemical accidents and fatalities in China (2016–2022).

Figure 2. Number of accidents of different severity.

Figure 3. Percentage of different accident types.

Figure 4. Diagram of LDA subject model.

Figure 5. The 3 basic structures of Bayesian networks: (a) V-structure; (b) same parent structure; (c) sequential structure.

Figure 6. Visualized word cloud of the top 100 keywords.

Figure 7. Perplexity and thematic coherence curves.

Figure 8. Bayesian network structure of chemical company accidents.

Figure 9. Results of Bayesian network sensitivity analysis.

Figure 10. Sensitivity values for security risk factors.

Figure 11. Critical causal transmission pathways for different classes of accidents: (a) critical causal transmission pathways for general accidents; (b) critical causal transmission pathways for larger accidents; (c) critical causal transmission pathways for major and above accidents.

Figure 12. Statistical frequency of accident safety risk factors.

Table 1. Results of partial partitioning.

Serial Number of Corpus	Segmentation Results
1	Construction Site Responsible Person Construction Shop Demolition Fractionated Wastewater Distillation … Review Employee Qualifications
2	Record Shift Monitoring Tank Monitoring Transfer Record Monitoring Tank Monitoring … Safety Education and Training
3	Workshop personnel handover responsible for liquid chlorine workshop inspections … supervisory and management capabilities
4	Da Fang enterprise staff enterprise workshop pipe cutting operation safety technology submission … rectification responsible for
5	Enterprise production workshop vacuum work production epoxy resin production site … enterprise elimination
…	…
513	Tufa production responsible person government departments production employees personnel idle play … supervisory and management responsibilities
514	Enterprise flammable and explosive gas production process turnover personnel material suppliers … inspection and maintenance operations

Table 2. Characterization words for some high BM25W scores.

Characteristic Word	BM25W Value
corporations	3.7427
worker	2.7936
safety education training	2.6337
officers	2.5535
equipment and facilities	2.5319
probe	2.4833
operation	2.4556
supervise and manage	2.3683
scene	2.0720
reporting	1.7376
blast	1.7325
illegal and irregular	1.7306
produce	1.6001
flammable and explosive gases	0.9053
toxic and hazardous gas	0.8988
special operation	0.8837
explosive mixture	0.7736
tank	0.7735
asphyxia due poisoning	0.7690
special equipment	0.7471
reaction kettle	0.7407
inflammable	0.7284
workplace	0.7045
pressure piping	0.6980
break apart	0.6764
safety management system	0.6745
dynamo	0.6696
occupation permit	0.6586
wear	0.6375
limited space	0.6354

Table 3. Theme mining results of accident risk factors in chemical enterprises.

Serial Number	Thematic Trait Words	Risk Factor Theme
1	Supervision and management of enterprises in violation of the law duties of government departments to supervise and guide the review and inspection of the corporate sector	Failure to conscientiously fulfill safety supervision responsibilities
2	Business Operations Operator On-Site Personnel Inspection Report Operations Site Responsible for False Reporting	Insufficiently detailed or misrepresented job site inspections
3	Production enterprise production process phase-out workshop compliance risk operation backward and old	Outdated and obsolete production processes
4	Lack of protection from limited space safety measures at the operating enterprise for rescuing poisoned and asphyxiated persons	Failure to wear protective equipment
5	Enterprise personnel safety education and training operating procedures for the development of post record system leads to	Inadequate safety education and training
6	Operations Operator Operations Ticket On-Site Supervisor License Equipment and Facilities Worksite Special Operations Safety Education and Training	Workers not licensed or not accompanied by a guardian on duty
7	Equipment and Facilities Inspection Hidden Hazards Leakage Supervision and Management Rectification Fracture Management Production Process Supervision	Inadequate tracking of hidden dangers and corrective actions
8	Enterprise equipment and facilities rupture agglomeration design leakage production process concentration old environment	Unreliable equipment and facilities
9	Explosive Gas Explosive Mixture Equipment Facility Air Ignition Source Damage Inspection Combustion Mixing	Unusual concentrations of hazardous gases
10	Vehicle Transportation Enterprises Driver Qualification Operator Tank Illegal Violations Monitoring Supervision	Inadequate transportation management
11	Pressure beyond the equipment and facilities valve operating procedures operating post alarm interlock site	Abnormal pressure/temperature values
12	Report the leakage of the shift supervisor in the central control room on duty to inspect the shift handover post site departure	Inadequate shift handovers or technical safety briefings
13	Dust Burning Fire Production Explosion Protection Illegal Violations Employee Equipment Facility Personnel	Inadequate fire and explosion prevention measures
14	Construction Workers Qualification Welding Outsourcing Supervision and Management Construction Unit Contractor Violation of Laws and Regulations Safety Technical Briefing	Non-compliance with contracting regulations
15	Field Operator Spill Personnel Reporting Business Inspection Equipment and Facilities Smoke Operating Procedures	Leakage problem
16	Special Operations Inspection Ventilation Development Rescue Application Safety Education Training Risk Identification Blind Workers	Inadequate risk identification for special operations
17	Illegal explosion in the workplace illegal operation of the operator employee production process illegal operation	Weak security awareness
18	Business Cutting Ignition Operator Pipeline Explosion Operation Ticket Unlicensed Employee Installation	Workers not licensed or not accompanied by a guardian on duty
19	Pressure pipeline valve leakage splash site flange removal operation shift supervisor inspection	leakage problem
20	Inspection and Maintenance Operations Inspection Personnel Reporting Programs Parking Responsible Measures System Failures	Inadequate management of equipment maintenance
21	Pipeline Leak Site Safety Measures Report Responsible Person Personnel Coordination Lack of Protection	Inadequate production safety system
22	Program Sampling Instrumentation Personnel Reporting Approval Analysis Manager Illegal Leaving	Reporting program approvals go through the motions
23	Production Process Recycling Workshop Pilot Run Cooling Raw Material Installation Equipment Facilities Shift Supervisor Technicians	Inadequate commissioning of production processes
24	Trial Production System Gas Stop Production Program Illegal and Illegal Equipment and Facility Reporting Supervision and Management	Organization of production in violation of the law
25	Warehouses clustered storage combustion explosion enterprise transportation ignition violations of the law negligence	Illegal storage
26	Flammable and explosive gases personnel inspection and maintenance concentration exceeds the inspection emission site operator return	Unusual concentrations of hazardous gases
27	Temperature Material Pressure Splash Rise Burnout Steam Control Heating Lack of	Abnormal pressure/temperature values
28	Storage Tank Demolition Ruptured Tank Connection Splash Leaves Pipeline Reported Leaking	Unreliable equipment and facilities
29	Fan Bolt Job Inspection and Repair Flange Inspection and Repair Removal Report Personnel Equipment and Facilities	Inadequate management of equipment maintenance
30	Flammables Hazardous Concentrations Pipe Falls Illegal Violations Ignition Rectification Ordinance System	Inadequate tracking of hidden dangers and corrective actions
31	Toxic and Hazardous Gas Risk Identification Emission Risk Disconnection Pipeline Monitoring Reporting Blind System	Failure of security monitoring and surveillance equipment
32	Cylinder Gas Inspection Oxygen Operator Safety Education and Training Operator Certificate Recorder	Weak security awareness
33	Plant violations mixtures inspections are responsible for installing inspections put into operation points lead to	Inadequate security inspections
34	Sewage clogging of protective equipment facilities overflowing pipeline discharge agglomeration regulation development	Illegal treatment of sewage and wastewater
35	Oil products illegal violations equipment and facilities explosion elimination tank operations main responsibility electrostatic fire protection devices	Risky organization of operations
36	Waste Wastewater Enterprise Mixed Production Process Sampling Equipment Facility Operators Show Clustering	Illegal treatment of sewage and wastewater
37	Fall Fracture Explosion Shock Wave Fire Protection Device Operation Impact Punch Out Risk Identification Explosion Prevention	Inadequate fire and explosion prevention measures
38	Monitoring Records Risk Identification Operational Instrumentation Risk Driving Failure System Programs	Failure of security monitoring and surveillance equipment
39	Special equipment commissioning central control room instruction specification information development technician feeding defects	lit. directing against the rules (idiom); directing against the rules
40	Charging and mixing commissioning illegal and irregular contact standard replacement operation laws and regulations operating procedures	non-compliance
41	Safety Measures Personnel Replacement Positions Valve Management Operating Procedures Illegal Changes Looking For	Inadequate production safety system
42	The main responsibility valve is responsible for operating procedures input development manager registration laws and regulations leadership	Failure to uphold paramount safety requirements
43	Corrosive substances are responsible for analyzing emergency drills return laws and regulations friction oxygen technology safety measures	Non-implementation of emergency relief management
44	Manufacturer Safety Measures Government Department Approval Compliance Equipment Facility Inspection Technician Qualification Blindness	Inadequate approval of safety management and technical measures
45	Cleaning Dismantling Operations Violators Tank Transportation Laws and Regulations Input Skills	non-compliance
46	Warning Signs Employee Leads to Heated Passage Warning Signs On Duty Opinion Procedures Opened	Failure of safety warning signs

Table 4. Safety risk factors in chemical enterprises.

Classification of Risk Factors	Security Risk Factors
R Regulatory authorities	R1 Hidden dangers rectification and tracking is not in place
	R2 Failure to conscientiously fulfill the responsibility of safety supervision
	R3 Inadequate approval of safety management and technical measures
C Chemical enterprises	C1 Inadequate safety education and training
C Chemical enterprises	C2 Inadequate safety system C3 Organizing production in violation of the law C4 Inadequate security inspections C5 Failure to implement the main responsibility for safety C6 Non-implementation of emergency relief management C7 Illegalized contracting
M Site management	M1 Careless or misrepresentation of the work site inspection
	M2 Backward and obsolete production processes
	M3 Inadequate transportation management M4 Inadequate handover or technical safety briefings M5 Inadequate fire and explosion prevention measures M6 Illegal storage M7 Illegal treatment of sewage wastewater M8 Inadequate commissioning of production processes M9 Reporting program approvals go through the motions
O Operators	O1 Failure to wear protective equipment O2 Operators not licensed or not accompanied by supervisors O3 Inadequate risk identification for special operations O4 Unauthorized command O5 Non-compliant operation O6 Adventure organization operations O7 Weak security awareness
E Environment and equipment	E1 Unreliable equipment and facilities E2 Abnormal concentrations of hazardous gases E3 Abnormal pressure/temperature values E4 Leakage problems E5 Inadequate equipment maintenance and management E6 Failure of safety monitoring control equipment E7 Failure of safety warning signs

Table 5. Boolean dataset of accident risk causation in chemical companies.

Serial Number	C1	C2	C3	C4	C5	C6	C7	E1	E2	E3
1	1	1	0	0	0	0	1	1	1	1
2	0	1	0	0	1	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	1	0
5	1	0	0	1	0	0	0	0	1	0
6	1	0	0	0	0	0	1	1	1	0
7	0	1	0	0	1	0	0	0	1	1
8	1	1	0	1	0	0	0	0	1	1
9	0	1	0	0	0	0	0	0	0	1
10	0	0	0	1	0	1	0	0	0	0
11	0	0	0	1	0	0	0	0	0	0	…
12	0	0	0	0	0	1	1	1	0	0
13	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0
15	0	1	0	0	0	0	0	0	0	0
16	1	1	0	0	1	0	0	0	1	0
17	0	0	0	0	0	1	0	0	0	1
18	0	0	0	0	0	0	0	0	0	0
19	1	0	0	0	0	0	1	0	1	0
20	0	0	0	0	0	0	1	0	1	0
					…

Table 6. Association rule mining base dataset.

Serial Number	C1	C2	C3	C4	C5	…	R2	R3	Accident Level
1	1	1	0	0	0	…	1	0	G
2	0	1	0	0	1	…	0	0	G
3	0	0	0	0	0	…	1	0	G
4	0	0	0	0	0	…	1	0	G
5	1	0	0	1	0	…	0	0	G
…	…	…	…	…	…	…	…	…	…
513	0	1	0	1	0	…	1	0	L
514	0	1	0	0	0	…	0	0	G

Table 7. Strong correlation rules for safety risk factors in chemical enterprises.

Serial Number	Preceding Item	After Item	Support	Confidence	Lift
1	C5	M2	0.085603	0.698413	2.425568
2	C5	M8	0.052529	0.428571	2.368664
3	M8	E7	0.050584	0.279570	1.796237
4	M1	O3	0.126459	0.353261	1.665836
5	O6	E1	0.073930	0.603175	1.640380
6	C4	M6	0.060311	0.319588	1.564458
7	M9	E4	0.071984	0.397849	1.526079
8	M9	M4	0.071984	0.397849	1.526079
9	M7	O3	0.079767	0.310606	1.464693
10	M1	O1	0.175097	0.489130	1.461704
11	O3	O1	0.103113	0.486239	1.453062
12	O2	O3	0.120623	0.306931	1.447361
13	C7	O2	0.132296	0.557377	1.418276
14	O3	E6	0.095331	0.449541	1.417572
15	C4	M5	0.093385	0.494845	1.413058
16	M7	O1	0.120623	0.469697	1.403629
17	M2	E3	0.167315	0.581081	1.402233
18	C6	M7	0.052529	0.360000	1.401818
19	M8	E3	0.105058	0.580645	1.401181
20	M9	M2	0.071984	0.397849	1.381720
21	C5	O5	0.054475	0.444444	1.367931
22	E7	E3	0.087549	0.562500	1.357394
23	C7	O3	0.068093	0.286885	1.352835
24	M6	M5	0.095331	0.466667	1.332593
25	C7	O7	0.077821	0.327869	1.326965
26	O6	O5	0.052529	0.428571	1.319076
27	M7	E6	0.107004	0.416667	1.313906
28	O4	E6	0.052529	0.415385	1.309863
29	C7	M7	0.079767	0.336066	1.308619
30	R2	C7	0.142023	0.309322	1.303209
31	M8	M7	0.060311	0.333333	1.297980
32	C6	M2	0.054475	0.373333	1.296577
33	C1	M6	0.089494	0.264368	1.294143
34	O6	E2	0.068093	0.555556	1.292107
35	C3	M2	0.058366	0.370370	1.286286
36	C2	M8	0.075875	0.232143	1.283026
37	C3	M4	0.052529	0.333333	1.278607
38	R2	M3	0.132296	0.288136	1.276739
39	M3	E1	0.105058	0.465517	1.266010
40	O2	E5	0.180934	0.460396	1.265474
41	O3	E5	0.097276	0.458716	1.260855
42	C7	M1	0.107004	0.450820	1.259355
43	O4	E2	0.068093	0.538462	1.252349
44	O6	O2	0.060311	0.492063	1.252082
45	R2	C4	0.107004	0.233051	1.234929
46	M2	M8	0.064202	0.222973	1.232345
47	M1	O2	0.173152	0.483696	1.230790
48	M5	E2	0.184825	0.527778	1.227501
49	O4	E3	0.064202	0.507692	1.225135
50	C2	M9	0.071984	0.220238	1.217230
51	M1	E6	0.138132	0.385870	1.216791
52	O4	E1	0.056420	0.446154	1.213350
53	E6	E1	0.140078	0.441718	1.201285

Table 8. Critical, important and general factors for accidents in chemical enterprises.

Sensitive Factor	Causal Pathway Factors	High Frequency Factors	Critical Factor	Important Factor	General Factors
O5	R1	R2	E1	O5	M7
E5	R2	E4	O1	E5	C5
O2	C2	E2	M1	O2	C6
O3	C7	E3		O3	O7
E1	M1	O2		O6	O4
O6	O1	R1		E3	C4
O1	E1	E1		E4	M9
E3	E4	E5		R2	M3
M1	E6	M1		E2	E7
		M5		R1	M4
		C1		M5	M2
		O1		C1	M8
		C2		C2	C3
		O5		E6	R3
		E6		C7	M6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Z.; Guo, J.; Huang, J. Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Appl. Sci. 2025, 15, 6197. https://doi.org/10.3390/app15116197

AMA Style

Zhou Z, Guo J, Huang J. Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Applied Sciences. 2025; 15(11):6197. https://doi.org/10.3390/app15116197

Chicago/Turabian Style

Zhou, Zhiyong, Jiahang Guo, and Jianhui Huang. 2025. "Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks" Applied Sciences 15, no. 11: 6197. https://doi.org/10.3390/app15116197

APA Style

Zhou, Z., Guo, J., & Huang, J. (2025). Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks. Applied Sciences, 15(11), 6197. https://doi.org/10.3390/app15116197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chemical Safety Risk Identification and Analysis Based on Improved LDA Topic Model and Bayesian Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Data Preprocessing

2.2. Keyword Extraction

2.3. LDA Subject Modeling

2.4. Association Rule Analysis

2.5. Bayesian Network Analysis

3. Results

3.1. Data Preprocessing and Keyword Extraction

3.2. LDA Topic Model Analysis

3.2.1. Estimation of the Optimal Number of Topics

3.2.2. Thematic Analysis

3.3. Association Rule Analysis

4. Discussion

4.1. Sensitive Risk Factor Analysis

4.2. Critical Causal Path Analysis

4.3. Statistical Analysis of Frequency

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Serial Number	C1	C2	C3	C4	C5	C6	C7	E1	E2	E3
1	1	1	0	0	0	0	1	1	1	1
2	0	1	0	0	1	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	1	0
5	1	0	0	1	0	0	0	0	1	0
6	1	0	0	0	0	0	1	1	1	0
7	0	1	0	0	1	0	0	0	1	1
8	1	1	0	1	0	0	0	0	1	1
9	0	1	0	0	0	0	0	0	0	1
10	0	0	0	1	0	1	0	0	0	0
11	0	0	0	1	0	0	0	0	0	0	…
12	0	0	0	0	0	1	1	1	0	0
13	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0
15	0	1	0	0	0	0	0	0	0	0
16	1	1	0	0	1	0	0	0	1	0
17	0	0	0	0	0	1	0	0	0	1
18	0	0	0	0	0	0	0	0	0	0
19	1	0	0	0	0	0	1	0	1	0
20	0	0	0	0	0	0	1	0	1	0
					…

Serial Number	C1	C2	C3	C4	C5	…	R2	R3	Accident Level
1	1	1	0	0	0	…	1	0	G
2	0	1	0	0	1	…	0	0	G
3	0	0	0	0	0	…	1	0	G
4	0	0	0	0	0	…	1	0	G
5	1	0	0	1	0	…	0	0	G
…	…	…	…	…	…	…	…	…	…
513	0	1	0	1	0	…	1	0	L
514	0	1	0	0	0	…	0	0	G

Sensitive Factor	Causal Pathway Factors	High Frequency Factors	Critical Factor	Important Factor	General Factors
O5	R1	R2	E1	O5	M7
E5	R2	E4	O1	E5	C5
O2	C2	E2	M1	O2	C6
O3	C7	E3		O3	O7
E1	M1	O2		O6	O4
O6	O1	R1		E3	C4
O1	E1	E1		E4	M9
E3	E4	E5		R2	M3
M1	E6	M1		E2	E7
		M5		R1	M4
		C1		M5	M2
		O1		C1	M8
		C2		C2	C3
		O5		E6	R3
		E6		C7	M6

Serial Number	C1	C2	C3	C4	C5	C6	C7	E1	E2	E3
1	1	1	0	0	0	0	1	1	1	1
2	0	1	0	0	1	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	1	0
5	1	0	0	1	0	0	0	0	1	0
6	1	0	0	0	0	0	1	1	1	0
7	0	1	0	0	1	0	0	0	1	1
8	1	1	0	1	0	0	0	0	1	1
9	0	1	0	0	0	0	0	0	0	1
10	0	0	0	1	0	1	0	0	0	0
11	0	0	0	1	0	0	0	0	0	0	…
12	0	0	0	0	0	1	1	1	0	0
13	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0
15	0	1	0	0	0	0	0	0	0	0
16	1	1	0	0	1	0	0	0	1	0
17	0	0	0	0	0	1	0	0	0	1
18	0	0	0	0	0	0	0	0	0	0
19	1	0	0	0	0	0	1	0	1	0
20	0	0	0	0	0	0	1	0	1	0
					…

Serial Number	C1	C2	C3	C4	C5	…	R2	R3	Accident Level
1	1	1	0	0	0	…	1	0	G
2	0	1	0	0	1	…	0	0	G
3	0	0	0	0	0	…	1	0	G
4	0	0	0	0	0	…	1	0	G
5	1	0	0	1	0	…	0	0	G
…	…	…	…	…	…	…	…	…	…
513	0	1	0	1	0	…	1	0	L
514	0	1	0	0	0	…	0	0	G

Sensitive Factor	Causal Pathway Factors	High Frequency Factors	Critical Factor	Important Factor	General Factors
O5	R1	R2	E1	O5	M7
E5	R2	E4	O1	E5	C5
O2	C2	E2	M1	O2	C6
O3	C7	E3		O3	O7
E1	M1	O2		O6	O4
O6	O1	R1		E3	C4
O1	E1	E1		E4	M9
E3	E4	E5		R2	M3
M1	E6	M1		E2	E7
		M5		R1	M4
		C1		M5	M2
		O1		C1	M8
		C2		C2	C3
		O5		E6	R3
		E6		C7	M6

Serial Number	C1	C2	C3	C4	C5	C6	C7	E1	E2	E3
1	1	1	0	0	0	0	1	1	1	1
2	0	1	0	0	1	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	1	0
5	1	0	0	1	0	0	0	0	1	0
6	1	0	0	0	0	0	1	1	1	0
7	0	1	0	0	1	0	0	0	1	1
8	1	1	0	1	0	0	0	0	1	1
9	0	1	0	0	0	0	0	0	0	1
10	0	0	0	1	0	1	0	0	0	0
11	0	0	0	1	0	0	0	0	0	0	…
12	0	0	0	0	0	1	1	1	0	0
13	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0
15	0	1	0	0	0	0	0	0	0	0
16	1	1	0	0	1	0	0	0	1	0
17	0	0	0	0	0	1	0	0	0	1
18	0	0	0	0	0	0	0	0	0	0
19	1	0	0	0	0	0	1	0	1	0
20	0	0	0	0	0	0	1	0	1	0
					…

Serial Number	C1	C2	C3	C4	C5	…	R2	R3	Accident Level
1	1	1	0	0	0	…	1	0	G
2	0	1	0	0	1	…	0	0	G
3	0	0	0	0	0	…	1	0	G
4	0	0	0	0	0	…	1	0	G
5	1	0	0	1	0	…	0	0	G
…	…	…	…	…	…	…	…	…	…
513	0	1	0	1	0	…	1	0	L
514	0	1	0	0	0	…	0	0	G

Sensitive Factor	Causal Pathway Factors	High Frequency Factors	Critical Factor	Important Factor	General Factors
O5	R1	R2	E1	O5	M7
E5	R2	E4	O1	E5	C5
O2	C2	E2	M1	O2	C6
O3	C7	E3		O3	O7
E1	M1	O2		O6	O4
O6	O1	R1		E3	C4
O1	E1	E1		E4	M9
E3	E4	E5		R2	M3
M1	E6	M1		E2	E7
		M5		R1	M4
		C1		M5	M2
		O1		C1	M8
		C2		C2	C3
		O5		E6	R3
		E6		C7	M6