1. Introduction
The chemical industry is pivotal to China’s economy, but its safety remains a concern [
1,
2]. The industry’s high temperatures and pressures and the presence of flammable, explosive, toxic, and harmful substances often result in serious accidents [
3]. Incidents in Xiangshui, Jiangsu, and Yima, Henan, caused significant casualties and economic losses. Complex chemical production processes and many interrelated risks make accident causation intricate, requiring in-depth research and solutions [
4]. Accident causes involve organizational management, regulatory mechanisms, and personnel operations, each contributing differently. According to the principle of the critical few, prevention and control should target core causes, making accurate identification of key factors vital for effective chemical safety governance [
5].
In the face of the massive accident case reports accumulated by chemical enterprises, which are valuable but unstructured information sources, traditional manual analysis methods are inefficient and struggle to avoid subjectivity, and it is difficult to fully explore the deep rules contained in them [
6]. Text mining technology provides a new possibility to solve this problem [
7]. At present, leveraging text mining technology on accident reports to extract risk evolution patterns from vast amounts of unstructured data, combined with intelligent detection technologies based on sensor networks, computer vision, and AI algorithms (such as video behavior analysis, real-time equipment condition monitoring, and automatic gas leak identification), is crucial for achieving accurate risk identification and rapid response, thereby promoting the transformation of chemical safety governance into a data-driven model [
8].
This study addresses the complex and coupled nature of accident causes by developing an analytical framework that combines unstructured text mining with Bayesian networks. This framework is applied to extract key contributing factors from accident investigation reports and to quantify risk transmission paths.
In terms of cause analysis of chemical accidents, one study proposed an analysis method based on the organizational-level accident triangle, which divided accidents into different levels and used the Spearman correlation coefficient to analyze the rationality of accident classification. This method helps to quickly identify risk factors, but its limitation is that it mainly focuses on the organization-level classification of accidents, and the deep mining of accident causes is not deep enough [
9]. Another study analyzed representative accidents in chemical Small and Medium-sized Enterprises (SMEs) through a revised Human Factor Analysis and Classification System (HFACS) framework and provided comprehensive recommendations for preventing major chemical accidents [
10]. However, the research mainly focuses on human factors, and the analysis of other factors such as equipment and environment is relatively limited, making it difficult to fully reveal the complex causal relationship of the accident. Futher research [
11] applied complex network theory and the HFACS framework to analyze 109 accidents, and identified critical nodes through Gephi topology analysis. However, this study simplified the application of complex networks and did not conduct in-depth research on critical nodes. Another study layered the risk perception factors and constructed a system dynamics feedback loop based on the explanatory structure model, revealing that risk experience was the basic driving factor [
12]. However, the factor variables in the system dynamics in this study were difficult to quantify, and no empirical measurement method was designed. The existing research on the cause analysis of accidents mainly focuses on a single factor or specific link, making it difficult to fully reveal the complex causal relationship of accidents. In addition, the existing methods have shortcomings in dealing with the dynamic evolution process of accidents and the nonlinear relationship between different factors, which affects the accuracy and reliability of the results.
In terms of quantitative analysis, one study developed a semi-quantitative method that integrates safety, economic, and aging factors to prioritize risk in industrial chemical plants. But for specific high-risk projects, fully quantitative tools are needed to provide a more accurate assessment [
13]. Another study proposed a quantitative analysis method based on network evolution to construct the security risk network of chemical enterprises. However, the computational complexity of this method is high when dealing with complex networks, which limits its application to large-scale datasets [
14]. Further research analyzed analyzed 271 thermal runaway events, quantified the accident characteristics, and proposed a four-dimensional prevention strategy. However, the study was relatively simple in the classification of accident causes and failed to deeply explore the interaction between different factors [
15]. Additionally, a study applied the improved HFACS model combined with the Bayesian method to study unsafe behavior in hazardous chemical storage, but its factor system did not cover macro variables, resulting in limited prediction scenarios [
16]. one study proposed an improved fuzzy Bayesian network model to improve the accuracy and reliability of tank accident prediction, but the efficiency of this model needs to be improved when dealing with large-scale datab [
17]. Another study used Bayesian belief networks to quantify the risk of refinery fire and explosion events, but this study mainly relied on expert knowledge and had insufficient support for data-driven analysis [
18]. The efficiency and accuracy of existing quantitative analysis methods need to be improved when dealing with large-scale data, and there are shortcomings in the in-depth analysis of factor interaction, which makes it difficult to effectively predict the development trend and potential risk of accidents.
In the application of text mining technology, one study developed a semi-automatic method based on natural language processing technology to construct a knowledge graph of chemical accidents, but the efficiency and accuracy of this method in processing large-scale text data need to be improved [
19]. Another study developed a chemical accident case text mining method based on word embedding and deep learning, but there is still room for improvement in the classification accuracy of this method for different types of accidents [
20]. Further research proposed a method for accident consequence prediction and investigation based on natural language processing technology, but the prediction accuracy and timeliness of this method in practical applications need to be further verified [
21]. Another study proposed an improved text mining method to extract risk factors from reports and build a Bayesian network model, but this method is not robust enough to deal with complex causal relationships [
22]. Additionally, one study applied text mining and BERT models to identify the potential consequences of refinery accidents, but this approach has limited adaptability when dealing with accidents in different industries [
23]. Another work applied NLP and text mining techniques to analyze pipeline accident texts, but the study did not go far enough in extracting dependencies between influencing factors. The existing text mining technology has insufficient efficiency and accuracy when dealing with large-scale text data, and it needs to be improved in terms of prediction accuracy and robustness [
24]. Existing studies have shortcomings in accident analysis and factor dependence extraction in different industries, which make it difficult to effectively support the comprehensive analysis and risk prediction of accident causes.
Although many studies have explored the causes of chemical accidents, there are still shortcomings. For instance, some studies focus on single factors, making it difficult to fully reveal the complex causal relationships of accidents [
25]. Other research has limitations in data collection and analytical methods, which affect the accuracy and reliability of the results. The quantitative analysis methods in some studies require improvement in efficiency and accuracy when processing large-scale data. Text mining methods in other works suffer from insufficient efficiency and accuracy in handling large-scale text data. To address these limitations, this paper employs text mining technology to extract 29 key factors from a large number of chemical safety accident cases, constructs a complex network model of accident causes, quantitatively analyzes factor importance to determine core key factors, utilizes Bayesian networks for quantitative analysis, reveals key influencing factors, clarifies the accident causation mechanism, and identifies critical association paths. The identification of these core factors and paths contributes to the theoretical understanding of accident causation mechanisms.
2. Research Methods and Data Selection
The causes of safety accidents in chemical production are characterized by complexity, systematicity, and unstructured information. Faced with large volumes of unstructured Chinese accident investigation reports, traditional manual analysis and single-factor statistical methods show clear limitations in revealing multi-factor coupling and risk propagation paths. To enhance the readability and extensibility of the analytical framework in this study, we briefly outline the logical connections and applications of text mining, complex network analysis, and Bayesian networks:
Text mining is used to automatically extract key contributing factors from unstructured reports, overcoming the subjectivity and inefficiency of manual annotation and forming a quantifiable set of discrete factors. Complex network analysis constructs a systemic structural model based on factor associations, identifying core nodes and vulnerable links in the network through topological metrics to reveal the structural characteristics of risk propagation. Bayesian networks further quantify dependencies and causal strengths among factors within a probabilistic framework, supporting uncertainty reasoning and dynamic risk prediction.
These three methods form a progressive analytical chain of “information extraction → structural modeling → relationship quantification,” collectively enabling a systematic deconstruction of accident causation from textual information to network structure and finally to probabilistic mechanisms. This integrated approach not only provides an operable path for analyzing multi-source heterogeneous data in the field of chemical safety but also offers methodological insights for interdisciplinary complex system research.
2.1. Complex Networks
The development of complex network theory has provided new analytical tools for the study of real-world complex systems. This research method integrates theories and techniques from multiple disciplines, such as mathematics and computer science, reflecting the interdisciplinary nature of the research. It aims to reveal the network topology, evolution mechanisms, and functional characteristics of complex systems [
26]. At the end of the 20th century, the introduction of small-world networks and scale-free networks marked a significant breakthrough in complex network research [
27], providing theoretical support for the analysis of system behavior patterns. This theory has been widely applied in the research of Internet topology [
28], social networks [
29], infrastructure networks [
30], and transportation systems [
31], among others. Meanwhile, complex network research focuses on the dynamic evolution process of systems, by tracking changes such as the addition or removal of nodes and the reconfiguration of edges, to explain the internal logic of system development, providing theoretical support for understanding the operational mechanisms of complex systems.
Degree centrality measures a node’s local influence by calculating the number of its direct connections. In a chemical safety risk network, this metric can be used to evaluate the likelihood that a specific risk factor may lead to an accident. Betweenness centrality and closeness centrality, on the other hand, assess a node’s global influence within the network and its sensitivity to external factors from different perspectives. Such a multi-indicator evaluation approach helps to comprehensively identify the key roles of risk factors in the accident propagation process. The degree centrality of a node can be calculated according to Equation (1), where
represents a normalized degree centrality of node
i;
represents the total number of connections (sum of in-degree and out-degree) of node
i; and
N represents the total number of nodes in the network.
Closeness centrality is used to measure the accessibility of a node to other nodes within the network. Quantifying the shortest path distances between nodes reflects the efficiency and breadth of a node’s influence in information transmission. In a chemical safety risk network, nodes with higher closeness centrality values possess more significant influence, indicating that these nodes can more rapidly establish associations with other risk factors, occupy critical positions within risk transmission paths, and thus hold greater reference value for the formulation of risk warning and prevention strategies. The closeness centrality of a node is quantified according to Equation (2), where
represents the betweenness centrality of node
i;
represents the shortest path length from node i to node
j;
N presents the total number of nodes in the network.
Betweenness centrality measures the frequency with which a node acts as a “bridge” in the shortest paths of the network, reflecting its control over information flow or risk transmission. The betweenness centrality of a node is calculated as shown in Equation (3), where
represents the betweenness centrality of node
i;
represents the total number of shortest paths from node s to node
t.
2.2. Bayesian Network
As a graphical probabilistic model, the Bayesian network [
32], based on Bayes’ theorem, effectively describes the probabilistic relationships among random variables. This model can achieve autonomous information learning and structure optimization under conditions of limited information and uncertainty and output precise reasoning conclusions.
In the Bayesian network system, the conditional probability table (CPT) is an important tool for representing the probability distribution of a node given the states of its parent nodes, usually presented in tabular form. This table completely records the conditional probability values of the node under various combinations of its parent node states. Taking node A with two parent nodes, B and C, as an example, its conditional probability can be calculated using Formula (4), and the CPT provides the probability distribution information of the node for each combination of parent node states. Through this formula, the probability distribution characteristics of node A under different combinations of parent node B and C states can be characterized.
- 2.
Joint Probability Distribution
For a Bayesian network containing random variables
X1,
X2,…,
Xn, its joint probability distribution
P(
X1,
X2,…,
Xn) completely describes the probability characteristics of various state combinations of the nodes. This distribution reflects the statistical properties presented by all variable state combinations given the network structure. Within the Bayesian network system, the joint probability distribution can be calculated based on the network’s topological structure and the conditional probability distributions of each node. Specifically, the joint probability distribution can be expressed by Formula (5).
2.3. Data Sources
The data used in this study are primarily derived from official Chinese chemical accident investigation reports published between 2010 and 2023. Specifically, the database comprises 537 reports released by government agencies such as the State Administration of Work Safety and the Ministry of Emergency Management, covering various safety incidents in the chemical industry. The source of the reports is limited to publicly accessible information on the Internet, and some reports contain missing information. During data preprocessing, such incomplete cases or reports were excluded, resulting in a final sample of 422 accident investigation reports for this study. These reports provide detailed records of accident types, contextual backgrounds, causal analyses, and outcomes across different chemical enterprises, offering a rich empirical foundation for the research. By systematically organizing and analyzing these official reports, we were able to extract the primary contributing factors of accidents and further construct a Bayesian network model for causal inference. The official reports are highly authoritative and reliable, encompassing long-term safety data in the chemical industry, and they capture the characteristics and trends of different types of accidents over time, thus providing invaluable first-hand information for accident analysis.
To assess the robustness of model parameter estimation, this study employs a non-parametric bootstrap method. By performing 1000 resamplings with replacement from the original accident case data, the Bayesian network is retrained each time, and path coefficients as well as sensitivity indices are recalculated. This yields an empirical distribution for each estimated parameter, based on which the 95% confidence interval and standard error are computed.
2.4. Element Selection
2.4.1. Text Preprocessing and Word Segmentation
Accident reports generally include an incident summary, organizational information, and event details. To reduce redundancy, this study extracted three core components—accident process, causal analysis, and responsibility determination—to construct a text-mining corpus. All reports were standardized and saved as “.txt” files.
Among several mature Chinese word segmentation tools (e.g., Jieba, HanLP, PKUSeg, THULAC), Jieba (version 0.42.1) was selected for its balanced efficiency, accuracy, and applicability. Although Jieba provides a basic dictionary, general-purpose lexicons are insufficient for chemical safety texts. Therefore, this study incorporated domain-specific vocabularies from the Sogou Cell Lexicon, including the Specialized Lexicon for Production Safety, Chemical and Chemical Engineering Vocabulary, and the Registered Safety Engineer Lexicon, to enhance segmentation precision.
When selecting the Chinese word segmentation tool, we conducted a preliminary comparison of Jieba (version 0.42.1), HanLP (version 2.2.0), PKUSeg (version 0.3.1), and THULAC (version 0.3.2). Based on 50 randomly sampled accident reports (totaling 18,240 characters), the segmentation accuracy of the four tools was evaluated as follows (
Table 1):
Based on the evaluation results presented in the
Table 1, Jieba achieved the highest scores in precision (0.923), recall (0.912), and F1-score (0.917), demonstrating its superior accuracy and consistency in segmenting texts related to process safety, accident reports, regulatory penalty documents, and investigation reports from safety supervision administrations. The specific word segmentation rules are detailed in the
Supplementary Materials. Therefore, Jieba was selected as the Chinese word segmentation tool for this study.
2.4.2. Feature Selection for Text Analysis
After word segmentation, the chemical accident reports generated a large vocabulary; however, only a limited proportion of these words carried meaningful information relevant to safety themes. Feature selection aims to filter representative terms from the text collection. Analysis revealed that approximately 90% of the words lacked distinctive feature information. To improve the accuracy of keyword identification and topic classification, this study calculated feature values for the segmented terms and extracted key feature words that capture the essential characteristics of chemical safety events.
The Term Frequency–Inverse Document Frequency (TF-IDF) method is a widely used technique for feature extraction in Chinese text analysis. It evaluates the importance of a term based on the product of term frequency (TF) and inverse document frequency (IDF). Term frequency reflects how often a term appears in a single document, calculated as shown in Equation (6), while inverse document frequency measures the distribution of the term across the entire document collection, computed as the logarithm of the ratio of the total number of documents to the number of documents containing the term, as shown in Equation (7). This method simultaneously considers the significance of a term within individual documents and its uniqueness across the corpus.
The TF-IDF method plays an important role in text analysis and information retrieval applications. By calculating the weight of each term, this method effectively supports natural language processing tasks such as keyword identification and document similarity analysis. In feature extraction, TF-IDF provides an efficient quantitative approach for evaluating term significance.
Table 2 lists a selection of terms with high TF-IDF values extracted from chemical enterprise production safety accident reports.
2.4.3. Estimation of the Optimal Number of Topics
Determining the number of causal themes in chemical accidents is essential for topic analysis. An appropriate number of topics improves classification accuracy and prevents semantic overlap or omission of key themes. For large text datasets, manual selection is inefficient and error-prone. Therefore, quantitative estimation methods are typically used.
This study employs perplexity to optimize topic number selection. In natural language processing, lower perplexity indicates better model performance, and in topic modeling, it generally corresponds to more coherent topic clustering. The calculation formula for perplexity is shown in Equation (8).
In the formula, denotes the text dataset, and is the total number of documents in the dataset. represents the conditional probability of term occurring in document .
The LDA topic model employs a two-stage generative process. First, the topic distribution
for document
is drawn from a Dirichlet distribution with parameter
, and the topic
for the
-th word is sampled from
. Second, the word distribution
for topic
is drawn from a Dirichlet distribution with parameter
, and the word
in document
is generated from the multinomial distribution
conditional on
. By iterating this process, the entire document corpus is generated. The core mathematical expression of the model is shown in Equation (9).
Parameter estimation in LDA topic models is commonly performed using Gibbs sampling, which is simple to implement and computationally efficient. As a special case of the Metropolis–Hastings algorithm, Gibbs sampling is based on the Markov chain principle, iteratively updating the value of each dimension while keeping others fixed until convergence is achieved. This makes it a practical choice for topic model inference.
Based on the results of Gibbs sampling, analytical expressions for
and
can be derived. For a specific term
, the computation of its corresponding topic distribution
and word distribution
is given in Equations (10) and (11), respectively.
In the formulas, represents the number of times term is assigned to topic ; denotes the number of times topic appears in document .
In this study, perplexity was employed as the model performance metric, and the number of topics
was systematically optimized (α = 1/K, β = 0.01, number of iterations: 1000). The perplexity was calculated over 1000 iterations. As shown in
Figure 1, the perplexity reaches its minimum when the number of topics is five, indicating that the LDA model achieves the best topic clustering at this point. This result determines the optimal number of topic categories for chemical accident analysis.
2.4.4. Causal Factor Identification Results
Based on the determined optimal number of topics, parameters were set, and the top 10 feature words with the highest weights under each topic were selected.
The LDA topic modeling algorithm clusters semantically related feature words, with those appearing at the top of each cluster typically having the highest probability within the corresponding topic. However, the model only performs clustering and probability ranking of feature words and cannot automatically generate topic names. Therefore, the assignment of topic names still requires researchers to summarize and define them manually, combining domain knowledge and practical context.
Based on TF-IDF values and the LDA topic model, and with reference to existing literature and the mechanisms of accident causation, combined with statistical analysis of phrases in accident reports, the contributing factors of safety accidents in chemical enterprises were identified, as shown in
Table 3.
Unsafe acts (e.g., a4 (violation of operating procedures), a3 (weak safety awareness)), preconditions for unsafe acts (e.g., a2 (insufficient personnel qualification), x3 (equipment defects)), unsafe supervision (e.g., y4 (inadequate training), z3 (failure to implement primary responsibility)), and organizational influences (e.g., z1 (imperfect safety management system), y1 (lack of organizational emphasis on safety)) all correspond to the classical levels of the HFACS framework and are consistent with findings from existing chemical safety literature. Building upon the HFACS framework, this study further identifies three new factors with distinct contextual features that have received less attention in the existing literature through text mining and network analysis: w1 (insufficient supervision by social departments), z5 (formalism in the hidden danger investigation system), and y2 (illegal organization of production).
2.5. Flowchart
Based on the research methodology described above, the flowchart for this study is presented below (
Figure 2):
3. Results
At present, research on the coupling effect among influencing factors of safety accidents in chemical enterprises remains relatively limited. Many studies focus solely on overall statistical identification of contributing factors without adequately distinguishing between core and peripheral factors of accidents. Furthermore, the causes of accidents in the field of chemical production safety are diverse, and the impact of each factor on accidents exhibits significant heterogeneity.
When analyzing the coupling relationships among factors in complex systems, complex network methods have demonstrated substantial advantages. By analyzing the complex network model established based on the results of association rule mining, the correlation strength among various factors can be accurately reflected. This study constructs a causal model of safety accidents in chemical enterprises based on the results of association rule mining and quantifies the intrinsic risk propagation mechanism of accidents through its statistical indicators.
3.1. Mining Association Rules of Contributing Factors in Chemical Safety Accidents
This study was conducted in a Python (version 3.12) programming environment using the mlxtend package, specifically its Apriori and Association Rules modules, to mine frequent patterns and association rules of accident contributing factors. By adjusting the support threshold in the Apriori algorithm, frequent itemsets of contributing factors were systematically extracted. The association rules were further filtered and optimized based on confidence, lift, and support, enabling the identification of strong associations.
The selection of support and confidence thresholds directly affects the quality of the mined rules. The literature indicates no consensus on standard threshold values, with typical ranges of 0.03–0.1 for support and 0.1–0.7 for confidence. To assess parameter sensitivity, multiple combinations of support and confidence were tested, and the resulting number of association rules was recorded, as shown in the attached surface plot (
Figure 3).
The surface plot visually illustrates the variation in the number of association rules generated under different support and confidence combinations. It can be observed that in regions with low support and low confidence, the number of rules increases significantly; as both parameters increase, the number of rules gradually declines. In the parameter range with support around 0.07 and confidence around 0.5, the surface exhibits a relatively flat trend, indicating stable rule counts in this region, making it suitable for threshold selection. Using these parameters, this study generated 363 association rules. After further filtering with a lift threshold greater than 1.25, 134 strong association rules were retained. These rules reveal the interaction mechanisms among key accident contributing factors and clarify how associations among risk factors influence accident probability.
Table 4 presents the top 10 association rules ranked by lift, reflecting significant positive relationships between antecedent and consequent factors. For example, the rule “{x1 Lack of Certified Operation Management} → {a2 Insufficient Personnel Qualification}” has the highest lift value of 3.27. This result indicates that neglecting certification requirements for operational staff often leads to insufficient personnel qualifications, thereby significantly increasing the risk of accidents.
The top 10% of rules with the highest support were selected from the 134 association rules. As shown in
Table 5, the antecedent and consequent of each rule, along with their corresponding support, confidence, and lift values, are presented.
- 2.
High-Confidence Association Rules
The top 10% of rules ranked by confidence were selected from the association rule set (
Table 6).
Table 6 shows that the support of the association rules ranges from 0.07 to 0.12, confidence from 0.82 to 0.94, and lift from 1.32 to 1.60. These metrics indicate that the mined rules possess high statistical significance and practical relevance.
For example, Rule 1 in
Table 6, “{a3 Weak Safety Awareness of Operators, z11 Lack of On-Site Safety Management} → {y4 Inadequate Safety Training},” has a confidence of 0.94, indicating that when both a3 and z11 are present, the probability of y4 (inadequate Safety Training) occurring reaches 94%. This finding aligns closely with real-world safety management: enterprises exhibiting both weak personnel safety awareness and poor on-site management typically also have severe deficiencies in safety training. This empirical evidence confirms the validity and reliability of the mined association rules.
3.2. Construction of the Network Model for the Causes of Safety Accidents in Chemical Enterprises
By constructing a complex network model, the interaction relationships among various factors can be effectively characterized, and the key causal elements and their associated mechanisms can be identified. This study, based on the association rules mined by the Apriori algorithm, uses Gephi (version 0.10.1) software to construct a directed weighted network of the causes of chemical accidents (
Figure 4). This model quantifies the association strength between factors using the degree of enhancement and identifies core causes and peripheral factors through structural analysis, revealing the interaction mechanisms among the elements in the accident system. In the network diagram, nodes represent the causal elements of the accident, and edges indicate the direct causal relationships based on the association rules. For specific node definitions, please refer to
Table 3. The size of the nodes is determined by the eigenvector centrality value, which reflects the relative importance of the nodes in the network.
3.3. Identification of Core and Peripheral Risk Factors for Safety Accidents in Chemical Enterprises
3.3.1. Identification of Core Contributing Factors for Safety Accidents in Chemical Enterprises
Given that the network nodes exhibit a hub-and-spoke distribution characteristic and that core contributing factors play a critical and decisive role in accident occurrence, effectively suppressing these core factors can significantly mitigate the severity of accidents in prevention efforts. Node importance evaluation methods in complex networks primarily encompass three categories: metrics based on neighborhood connections (node degree), metrics based on paths (betweenness centrality), and metrics based on network characteristics (eigenvector centrality). This study integrates these three network structure indicators to quantitatively assess and rank the importance of each causative factor in chemical safety accidents, thereby identifying the core contributing factors for safety accidents in chemical enterprises.
- (1)
Node Degree
In the network model, the number of connection edges associated with each causative factor node constitutes its degree centrality indicator, which directly reflects the importance of the node within the model structure.
Figure 5 illustrates the degree value distribution of each node in the chemical accident causative factor network, and the top ten nodes are selected here for in-depth analysis.
As depicted in
Figure 5, ten high-degree contributing factors have been identified in the safety accidents of chemical enterprises, including “inadequate safety education and training”, “failure to implement the primary responsibility for safety production”, and “lack of establishment or enforcement of the hidden danger investigation system”.
- (2)
Node Betweenness Centrality
The node betweenness centrality index quantifies the extent to which a specific node acts as an intermediary in the shortest paths of the network, and its numerical value directly reflects the node’s influence on network connectivity. As depicted in
Figure 6, the top ten nodes with the highest betweenness centrality in the association network of chemical accidents play a critical central role. These nodes are pivotal in the transmission of accident risks. Conducting an in-depth analysis of these high-betweenness nodes can facilitate the identification of key control points for accident prevention.
As shown in
Figure 6, in the safety accidents of chemical enterprises, ten accident-causing factors, including “inadequate safety education and training,” “failure to implement the primary responsibility for production safety,” “insufficient supervision by social departments,” and “lack of establishment or enforcement of the hidden danger investigation system,” exhibit high betweenness centrality.
- (3)
Node Eigenvector Centrality
The eigenvector centrality index evaluates the criticality of a node by considering both the importance of the node itself and its adjacent nodes, thereby reflecting the network characteristic that “the more critical the neighboring nodes, the more important the target node becomes”. As illustrated in
Figure 7, the distribution of eigenvector centrality among nodes in the chemical accident causation network is presented, with emphasis on the top ten core nodes.
As depicted in
Figure 7, in the safety accidents of chemical enterprises, ten accident-causing factors, including “inadequate safety education and training,” “failure to implement the primary responsibility for production safety,” “lack of establishment or enforcement of the hidden danger investigation system,” and “violation of operating procedures,” exhibit high eigenvector centrality.
By comprehensively considering the three network indicators—node degree, betweenness centrality, and eigenvector centrality—the top 10 factors in each indicator are screened. The causes that appear in all three indicators are identified as the core accident-causing factors. The specific results are presented in
Table 7. This multi-dimensional assessment method ensures the comprehensiveness and reliability of the identification of core causes.
3.3.2. Identification of Peripheral Contributing Factors Related to Core Contributing Factors
In safety accidents in chemical enterprises, core contributing factors are often numerous and difficult to control. If these core factors are not properly mitigated, risks may propagate along association pathways to connected peripheral factors, potentially generating new safety hazards. Specifically, ineffective hazard inspection can reduce the likelihood of inspectors detecting unsafe operations, thereby increasing accident risk. Based on the accurate identification of primary contributing factors, it is necessary to systematically analyze related secondary factors and implement preventive measures from a holistic perspective.
Eigenvector centrality can accurately measure the closeness of nodes to core nodes, effectively identifying other nodes strongly associated with the target node. In this study, a local network framework of core contributing factors was constructed based on eigenvector centrality to analyze associated peripheral factors. Taking a4 (Unsafe Operation) as an example, the antecedent and consequent elements of association rules were treated as network nodes, with their logical relationships represented as edges, forming the corresponding network topology. The individual central networks of each core causative factor are illustrated in
Figure 8.
To avoid interference from other key factors, only a4, “Illegal operation”, was retained as the core node during network construction. The resulting network is shown in
Figure 8a, where node size directly reflects eigenvector centrality. The study identified that the core factor a4 together with the five nodes with the highest eigenvector centrality, a1, a3, z4, y5, and w2, constitute the causative factor set. These associated nodes form the relevant causative cluster for “Unsafe Operation.”
In safety accidents within chemical enterprises, unsafe operations reflect deficiencies in the implementation of the company’s safety management system. Analysis of multiple accident cases indicates that violations of safety procedures by operators are often rooted in the failure to enforce primary safety responsibilities, the absence of an effective system of safety behavior norms, and a disconnect between operating procedures and actual production requirements, with supervision often being merely formal. Specifically, inadequate safety training may leave operators unaware of process risks; performance assessments that prioritize production over safety may indirectly encourage short-cutting operational steps; and ineffective supervision mechanisms allow unsafe behaviors to go undetected and uncorrected. Additionally, the lack of a comprehensive process safety information management system may prevent timely updates of operating procedures, making them incompatible with actual production conditions. These management shortcomings increase the likelihood that employees will overlook safety controls during operations, ultimately leading to accidents caused by unsafe practices.
Therefore, systematic prevention and control of the “Unsafe Operation” causative cluster in chemical enterprises is critical. Strengthening the implementation of safety responsibilities, improving operational procedures, enhancing behavioral safety management, and establishing effective supervision mechanisms can reduce human errors, improve intrinsic safety, and prevent major accidents. Applying the same approach to other core contributing factors allows identification of the associated factor sets for each core element, with detailed results presented in
Table 8.
3.4. Construction of Bayesian Network Model Based on Correlation Rules
This study previously applied association rule mining to analyze risk contributing factors in chemical enterprise safety accidents and constructed a complex network based on the filtered results. Building on this, a Bayesian network model was developed by integrating text-mined data with the complex network topology.
The model captures risk propagation mechanisms among contributing factors through three steps: mapping complex network nodes to Bayesian variables, establishing conditional dependencies from association rules, and refining the structure via expert input and parameter learning. This approach preserves network topology while enabling probabilistic inference, allowing quantitative identification of critical risk pathways for accident prevention.
Key thresholds for rule mining were optimized: minimum support = 0.015, minimum confidence = 0.3, and rules limited to a single antecedent. Using lift > 1, 209 statistically significant strong rules were extracted (
Table 9). This configuration balances comprehensive pattern capture with noise reduction, ensuring robust and reliable results.
3.5. Bayesian Network Structure Optimization and Learning
3.5.1. Network Structure Optimization Based on Search Scoring
Based on the extracted strong association rules, a Bayesian network model was constructed, where nodes correspond to the antecedents and consequents of the rules, each representing a specific accident-causative factor. Directed edges encode causal relationships, with edge directions determined by the logical structure of the rules—antecedents as parent nodes and consequents as child nodes—thereby forming the complete network topology. Conditional probability tables were initialized using the confidence values of the association rules and further refined with expert knowledge.
Bayesian network structure optimization can be achieved via two main approaches: score-based search and constraint-based methods. Both aim to derive the optimal network topology from observed data. Constraint-based methods, however, involve iterative searches and are computationally intensive, reducing efficiency for large-scale networks. To address this, the present study employs a score-based method, specifically the K2 algorithm, for structure learning. The rationale is twofold: (1) K2 limits the number of parent nodes, reducing search space complexity and accommodating the medium-scale network in this study; (2) it allows the incorporation of expert knowledge as node ordering constraints, complementing domain-specific insights into chemical accident causation.
Score-based algorithms systematically explore candidate network structures, evaluating each using predefined criteria such as BIC (Bayesian Information Criterion), MLE (Maximum Likelihood Estimation), or the K2 score. Interactively comparing candidate scores identifies the network configuration that best balances model complexity and data fit, enhancing generalization. K2’s minimal prior knowledge requirement and ability to learn network structures autonomously without complex assumptions make it practical across various disciplines. The mathematical formulation of K2 is presented in Equation (12):
where:
P[B,D] denotes the joint probability of the network structure B and the observed data D; c is a normalizing constant; n is the total number of nodes in the network; is the number of possible parent configurations for node i; is the number of possible states of node ; is the total number of observations in which node i’s parents are in their j-th configuration; is the number of observations in which node i takes its k-th value while its parents are in the j-th configuration. This formulation quantifies the fit between the network structure and the observed data, providing a mathematical basis for model evaluation.
The K2 algorithm employs a local search strategy, offering superior computational efficiency compared with other score-based methods for large datasets. In this study, the K2 optimization was implemented using GeNIe software, and the resulting optimized Bayesian network for chemical accidents is shown in
Figure 9.
3.5.2. Bayesian Network Parameter Learning
Based on the established association network of accidents, this study employed the GeNIe software platform to perform parameter learning, thereby quantifying the probabilistic dependencies among risk factors and between risk factors and accident outcomes. The training data for learning were derived from prior text mining results and processed into a binary feature matrix: rows correspond to individual accident cases, columns represent contributing factors, and matrix elements take values of 0 or 1, indicating the absence or presence of a factor in a specific accident. This data representation effectively preserves key relational features from the original text, providing a reliable basis for parameter estimation. The Expectation Maximization (EM) algorithm [
33] was applied to conduct parameter learning; the primary rationale lies in the fact that the encoded accident text data are incomplete, with some nodes exhibiting missing values or parent-node combinations occurring with insufficient frequency. Direct application of Maximum Likelihood Estimation would result in numerous zero probabilities in the Conditional Probability Tables, thereby compromising the inference stability of the network. The EM algorithm addresses this by estimating the expectations of missing data in the E-step and maximizing the likelihood in the M-step, enabling stable convergence of probability learning even under conditions of incomplete or sparse data. This method is particularly adaptive to structures characterized by “multiple parent nodes and high-dimensional CPTs,” effectively mitigating issues of probability bias or non-convergence caused by data sparsity. As one of the most mature parameter learning methods in engineering safety, reliability analysis, and risk inference, EM provides significant advantages in enhancing model robustness and inference accuracy when dealing with incomplete and uncertain accident data. Consequently, the combined approach of K2 and EM maximizes the utilization of both prior knowledge and textual data structure inherent in this study, ensuring that the final modeling results exhibit high credibility and interpretability in terms of theoretical logic, structural stability, and parameter reliability, ultimately generating the conditional probability tables for all causal nodes and updating the Bayesian network accordingly.
Figure 10 shows the Bayesian network after updating the conditional probabilities.
To quantify the uncertainty of the Conditional Probability Table (CPT) parameters, this study obtained the posterior distribution based on the Dirichlet-Multinomial model and employed posterior sampling to calculate the 95% Confidence Interval (CI). Specifically, for each set of parent node values, the conditional probability vector θ was assigned a Dirichlet prior, which, together with the sample frequencies, formed the Dirichlet posterior distribution. Through 10,000 rounds of posterior sampling (with the first 2000 discarded as burn-in), the 2.5% and 97.5% percentiles were taken as the parameter CI. To evaluate the predictive performance of the model, Leave-One-Out Cross-Validation (LOO-CV) was applied to the Bayesian network. Using the validation module in GeNIe, the LOO-CV mode was selected to calculate the prediction accuracy for each node. Detailed validation results are presented in
Table 10. This approach ensures the rigor and reliability of the model evaluation.
Table 10 shows that the highest node prediction accuracy reached 0.9060, with most nodes exceeding 0.7. Overall, the network achieved a prediction accuracy of 0.7647, indicating that the constructed Bayesian network model for chemical accidents demonstrates strong predictive performance and is suitable for causal analysis and inference of accident factors.
The sensitivity analysis of Bayesian networks quantifies the extent to which changes in parent nodes influence the probabilities of child nodes, thereby enabling the identification of key factors within the model. In the GeNIe software, sensitivity analysis was performed with the Accident node as the target variable.
Table 11 presents the variables with a maximum absolute sensitivity index (MSI) greater than 0.01. The analysis results visually illustrate the degree to which different factors affect the probability of accident risk.
It is evident that in the production process of chemical enterprises, employees may trigger subsequent accidents due to insufficient safety protection, violation of operational commands, and other factors. Similarly, organizational influences and regulatory factors also play a critical role in ensuring safety. Equipment defects or malfunctions escalate risks in the chemical production process, potentially leading to severe consequences such as leaks and explosions. Moreover, inadequate equipment management may result in insufficient maintenance, further increasing the likelihood of accidents. The failure to implement primary responsibility for safety production, coupled with insufficient supervision by social departments, introduces additional safety hazards at the management and oversight levels, thereby contributing to accidents. Therefore, it is essential to implement effective regulatory and control measures for these highly sensitive factors.
- 2.
Path Analysis
Based on the sensitivity analysis results, this study employs the path analysis method to quantify the correlation characteristics among contributing factors. By constructing a structural equation model, the path coefficients of each variable are calculated, and the absolute value of these coefficients directly reflects the intensity of the causal relationships. The statistical test results are presented in
Figure 11, where “*”, “**”, and “***” denote significance levels of
p < 0.1,
p < 0.05, and
p < 0.001, respectively, indicating the statistical reliability of each path relationship. This method realizes the quantitative representation of the accident-causing mechanism.
It is evident that there are 13 significant association paths for safety accidents in chemical enterprises. Among these, the effect level (i.e., the product of path coefficients) of the causal path “w1 (Insufficient Supervision) → z3 (Failure to Implement Primary Responsibility) → y6 (Illegal Construction) → a4 (Violation of Operating Procedures) → Accident” is the highest, at 0.0085 (0.024 × 0.427 × 0.239 × 0.2624). This indicates that this path is the key causal path for safety accidents in chemical enterprises, as depicted in
Figure 12. Specifically, this path highlights that social department supervision, implementation of the primary responsibility for safety production, on-site safety management, and violations of operating procedures are critical factors in avoiding such accidents.
4. Discussion
This study systematically investigates the risk propagation mechanisms of safety accidents in Chinese chemical enterprises by integrating text mining, complex network analysis, and Bayesian network modeling. Compared to previous qualitative studies based on limited typical cases, the analytical framework of this research overcomes the limitations of traditional linear approaches. Through systematic mining of 422 accident reports, it not only identifies 29 key factors spanning personnel, organizational, and regulatory dimensions but also reveals the nonlinear coupling and risk transmission mechanisms among accident contributing factors. The analysis reveals that the identified accident factors exhibit nonlinear coupling characteristics within a complex network, supporting a systemic rather than single-factor view of accident causation.
By incorporating network centrality indicators, this study not only differentiates core from peripheral contributing factors but also constructs a structured network of accident factors. It finds that insufficient safety training and unfulfilled production safety responsibilities play a dominant role in the risk propagation network, which aligns with the findings of previous research. However, through quantitative analysis using Bayesian networks, this study further reveals that these surface-level factors systematically lead to accidents through the transmission pathway of “regulatory gaps” to “behavioral violations”, achieving a transition from “factor identification” to “mechanism analysis.” Sensitivity analysis confirms the high influence of critical nodes such as violations of command and equipment defects, while the Bayesian network constructed based on strong association rules provides data-driven support for probabilistic inference of accident association paths.
The findings offer a scientific basis for optimizing the allocation of safety resources, ensuring accountability, and improving training systems. In particular, by proposing a “core–peripheral” factor system along with corresponding leading indicators, control measures, and audit standards, this study provides actionable decision-making tools for enterprises to implement targeted risk management under resource constraints. Additionally, the study emphasizes the need to strengthen regulatory supervision and enhance safety governance mechanisms, promoting a paradigm shift in chemical safety management from “compliance-driven” to “performance-driven.” Overall, this study establishes a comprehensive empirical framework that expands the theoretical boundaries of accident causation research and provides novel academic insights and practical guidance for chemical safety management and accident prevention.
5. Conclusions
In this study, a multidimensional causality system covering three dimensions of personnel, organization, and supervision was constructed with 29 factors. Through complex network modeling, it is revealed that factors such as “insufficient safety education” and “failure to perform primary safety duties” play key roles in the risk propagation network, and 13 important association paths are identified through Bayesian network analysis. This study bridges the theoretical and practical gaps in chemical safety research by systematically analyzing the accident causation network and quantifying risk pathways. Theoretically, it enhances the understanding of nonlinear and system-wide interactions in accident causation. Practically, it offers actionable recommendations for enterprises and policymakers to prioritize intervention strategies and optimize safety investments. Future research should investigate the dynamic and situational adaptability of the proposed framework to further improve its applicability.
By integrating text mining, complex network analysis, and Bayesian network modeling, this study systematically reveals the causal structure and critical pathways of safety accidents in Chinese chemical enterprises, providing a data-driven analytical framework and practical insights for chemical safety governance. However, the generalization and application of the research findings require careful consideration of its inherent limitations. First, the empirical foundation of this study relies entirely on publicly available accident investigation reports within China. The conclusions profoundly reflect China’s unique safety management systems, regulatory environment, and cultural context, and their applicability to chemical safety practices in other countries and regions requires further validation. Second, although the text mining method employed efficiently processes unstructured information, it may not fully capture the complex context and implicit correlations of accident causes due to inconsistencies in the quality of original reports and variations in semantic expression. Furthermore, the static association network constructed based on historical data can depict the structural relationships among risk factors but struggles to reflect their dynamic evolution over time or under external interventions. The edge weights in the network rely solely on statistical association strength and do not incorporate domain expert knowledge or calibration with multi-source data, which may oversimplify the practical managerial implications of causal influences. In the Bayesian network analysis, the conditional probability parameters are entirely learned from observational data, which essentially represents statistical association inference rather than rigorous validation of risk propagation mechanisms. Additionally, the binary treatment of node states fails to capture the continuous variation and nonlinear response characteristics of risk factor intensities. Finally, the methodological integration across the three stages—text mining, complex networks, and Bayesian networks—still faces challenges related to information loss and computational efficiency. Further optimization of the integrated framework is needed when dealing with larger-scale, multimodal data. Future research could be expanded in areas such as cross-regional comparisons, dynamic risk evolution modeling, multi-source data fusion, and interventional causal validation to enhance the generalizability, timeliness, and explanatory power of the conclusions, thereby promoting the continuous advancement of chemical safety governance toward intelligent and precise directions.