Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy

Li, Rongbin; Zhang, Jian; Deng, Fangming

doi:10.3390/en18030716

Open AccessArticle

Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy

by

Rongbin Li

¹,

Jian Zhang

¹ and

Fangming Deng

^2,*

¹

Huizhou Power Supply Bureau, Guangdong Power Grid Corporation, Huizhou 516000, China

²

School of Electrical and Automation Engineering, East China Jiaotong University, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(3), 716; https://doi.org/10.3390/en18030716

Submission received: 10 December 2024 / Revised: 27 January 2025 / Accepted: 27 January 2025 / Published: 4 February 2025

(This article belongs to the Special Issue AI Facilitated Cyber–Physical Energy Systems—Planning, Operation, and Markets)

Download

Browse Figures

Versions Notes

Abstract

As global energy networks expand and smart grid technology evolves rapidly, the volume of historical power accident data has increased dramatically, containing valuable risk information that is essential for building efficient public safety early warning systems. This paper introduces an innovative text analysis method, the Sparse Coefficient Optimized Weighted FP-Growth Algorithm (SCO-WFP), which is designed to optimize the processing of power accident-related textual data and more effectively uncover hidden patterns behind accidents. The method enhances the evaluation of sparse risk factors by preprocessing, clustering analysis, and calculating piecewise weights of power accident data. The SCO-WFP algorithm is then applied to extract frequent itemsets, revealing deep associations between accident severity and risk factors. Experimental results show that, compared to traditional methods, the SCO-WFP algorithm significantly improves both accuracy and execution speed. The findings demonstrate the method’s effectiveness in mining frequent itemsets from text semantics, facilitating a deeper understanding of the relationship between risk factors and accident severity.

Keywords:

smart grid; power security; text analysis; data mining strategy; FP-Growth algorithm

1. Introduction

Th electric power industry plays a key role in the life and production of modern society [1], but the inherent conditions in the implementation of electric power production are complex, the accidents are frequent, and the situation of safety production is grim [2,3]. As the scale of the power grid continues to expand, the data on electric power personal accidents have increased explosively. Researchers have found a certain regularity in the occurrence of power accidents in historical data, but the historical data of power accidents are still not fully explored [4]. In order to explore the relationship between different risk factors and the levels of power accidents, it is of great practical significance to mine the risk factors of power accidents that have occurred based on historical records of accidents.

Against the background of the wide range of challenges in the power sector, many researchers have devoted themselves to solving different problems using data mining techniques. Gholami et al. [5] addressed the problem of large-scale outages occurring in a distribution network with distributed energy sources and proposed a method to effectively find the root cause of large-scale outages by aggregating, classifying, and mining the available data. Wu et al. [6] analysed historical data in terms of grid operation stability and mined the association rules between different time periods, different voltages, and different risk levels through specific grid operation cases and the rules conformed to the grid operation rules. Wang et al. [7] generated typical operating scenarios in a photovoltaic (PV) power system by correlating PV output scenarios with load scenarios and considered the weather factor to make the generated typical scenarios interpretable using the weather factor. Zhou et al. [8] used a link prediction method to explore the association rules of coal mine accident hidden danger texts in order to effectively mine valuable coal mine accident hidden danger information from massive text data and make predictions. Shen et al. [9] combined a comprehensive model of text mining and machine learning based Bayesian networks and used sensitivity analysis to determine the key risk factors of safety accidents in metro construction. However, in the current field of power accidents, the research on systematically mining the specific risk factors of power accidents is still insufficient, which to a certain extent restricts the accurate decision-making of power safety risks and the formulation of effective prevention and control strategies. In order to effectively prevent and timely deal with power accidents, knowing how to mine effective risk factors of accidents from a massive amount of recorded data is the current challenge. Thus, this paper comprehensively analyses the risk factors of electric power accidents based on a data mining algorithm.

The word vector generation model Word2Vec has been applied to the electric power domain several times, e.g., Liu et al. [10] combined the Word2Vec model and convolutional neural network (CNN) for electric power equipment fault diagnosis and mined the contextual semantic features of the words through the Word2Vec model to represent the descriptive text of equipment faults. However, generating word vectors using Word2Vec alone is not sufficient to fully reveal the deep structure and patterns in power incident texts. To solve this problem, this paper further introduces the K-means algorithm for cluster analysis. Among all clustering methods, the K-means algorithm is a commonly used unsupervised machine learning clustering method that is mainly used for grouping or classifying data sets [11]. In the field of electric power, K-means clustering is used to extract key factors from accident reports and classify the reports into different categories based on their contents. Wang et al. [12] applied K-means clustering to analyse electric power marketing information, which solved the limitations of the traditional BERT model and achieved high-precision and fast clustering identification of information. Therefore, in this paper, a K-means clustering method is used to achieve accurate clustering of text before association rule mining and provide more structured input data for subsequent association rule mining.

Association rule mining, as one of the important research methods in the field of data mining [13,14], is used to discover association relationships between items in a dataset. For power accident text data [15], association rule mining methods analyse the hidden patterns and connections in unstructured natural language information. The more typical of this method are a priori test class algorithms that generate candidate sets and Frequent Pattern-Growth (FP-Growth) algorithms that does not generate candidate sets. Agrawal et al. [16] first proposed an a priori algorithm for the shopping basket problem, which is an iterative search method that scans the original data several times, constructs and filters the candidate sets, and exhausts the items in the dataset. Liu et al. [17] used a binary logic “and” operation methos to improve the a priori to analyse the association rules of risk elements in the field of civil aviation. However, the a priori algorithm needs to scan the dataset several times before generating frequent items, which brings a large I/O load and low efficiency, and at the same time generates a large number of candidate sets, which faces high computational complexity and memory consumption. Based on the above problems, Han et al. [18] first proposed an FP-Growth algorithm. FP-Growth algorithms only need to scan the database twice and construct Frequent Pattern Tree (FP-Tree) to compress the data. The frequent itemsets are obtained by recursively mining the conditional FP-Tree to reduce the complexity of storage and computation. Therefore, in this paper, the FP-Growth algorithm is used to reveal the potential patterns and key influencing factors of accidents.

Due to the different importance of different items in the dataset, many studies have used weighting to mine frequent itemsets. Xiao et al. [19] introduced a time decay factor to assign weights to the transactions and designed a weighting function to mine the relevant frequent items of the most recent time transactions. Yu et al. [20] applied entropy weighting method to calculate the risk index of different food types and used FP-Growth algorithm with constraints for association rule mining of food risk factors. However, existing weighted association rule mining methods often neglect the unique structure and importance of text data, and the use of a traditional FP-Growth algorithm in the electric power field makes it difficult to fully mine potential risk factors and accident patterns when dealing with complex text data [21]. In addition, most of the association rule mining methods do not take into account the weighted differences between risk factors, and they also lack effective treatment of sparse data. Therefore, the existing methods have deficiencies in computational efficiency, data processing depth, and result accuracy in the specific risk factor analysis of power accidents. Based on this, this paper chooses to improve on the traditional FP-Growth algorithm by introducing a weighted algorithm to mine power accident records for frequent items. Risk factors with very low frequency of occurrence, but also with strong correlation with accidents, are added to the calculation of the importance of accident records. Then the algorithm calculates the weighted support degree of risk factor feature items and sorts them in descending order of weighted support degree. In this way, an FP-Tree is constructed to mine the frequent itemsets to highlight the importance of the concern data and fully explore the potential power risk factors.

Inspired by the aforementioned research, this paper addresses the limitations of traditional association rule mining methods in fully utilizing text data. This paper proposes a novel text-weighted FP-Growth association rule mining scheme for power accidents to solve the problem of an insufficient basis for the analysis of power accidents, so as to provide a theoretical basis and decision-making support for risk management of the electric power industry and to promote safe and stable operation of the electric power system and sustainable development. The main contributions are as follows:

Enhanced text utilization: In this paper, we propose a text-weighted FP-Growth based association rule mining scheme for power accidents that improves the utilisation of text data by incorporating semantic information into the association rule mining process.
Word vector and classification: The cosine similarity is used to calculate the sub-word weights and quantify the semantic correlation between each word and accident level, thus optimising the classification of accident types and risk levels.
Mining sparse factors: Sparse factors are introduced to address the challenge that sparse risk factors in power accidents are not easily mined. The experimental results show that this approach improves the identification of uncommon but important risk factors by accurately assessing the importance of accident records.

2. Basic Theory

2.1. Word2vec Model

Word2Vec is a word vector generation model that learns semantic knowledge from a large text corpus in an unsupervised manner that is mainly applied in the field of Natural Language Processing (NLP), such as text classification, sentiment analysis, etc. [22,23,24]. The method converts words in text into numerical vectors that are mainly divided into two architectures: the Continuous Bag of Words (CBOW) model and the Skip-gram model. The CBOW model predicts the target word based on its contextual words, while the Skip-gram model operates in the reverse direction, predicting the context with the target word. The CBOW and Skip-gram model architectures are shown in Figure 1.

Word2vec training is usually accelerated by negative sampling techniques and the softmax objective function is optimised to minimise the prediction error. After training the word vectors, similar words will be closer in the high dimensional space as well. In this paper, a Word2vec model is used to generate word vectors for power accident dataset to characterise the semantic information by means of word vectors in order to facilitate the computer to understand their semantic and contextual relationships.

2.2. K-Means Clustering

K-means clustering is performed by determining the similarity between samples, where similarity is usually measured using the distance between sample attributes [12,25]. The algorithm is based on an iterative process that divides observations (sample points) into a predetermined number of clusters (k centres), each consisting of data points closest to its centre of mass, which is also known as the mean or centre position of the cluster.

In order to perform association rule mining on power accident data, quantitative textual data need to be transformed into qualitative data, i.e., accident level data are risk graded and the sample data are transformed into the form of accident levels. The number of clusters k is determined according to the columns of accident levels in the dataset, which are Minor, Moderate, Major, and Severe. So, in this paper, k is set to 4, so that all the segmentation words belong to the same class as the representative words of one accident level. In this paper, the K-means algorithm is used to divide the dataset into four clusters representing four accident classes, which facilitates the subsequent mining of association rules between accident classes and risk factors. The main steps of the algorithm implementation are as follows:

(1): Initialisation: Four initial centres of mass (centroids) are randomly selected as representatives of each cluster.
(2): Assignment: The Euclidean distance d from each sample point to all centres of mass is computed and the sample point is assigned to the cluster with its closest distance. The Euclidean distance between the sample points $x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i n})$ and $x_{j} = (x_{j 1}, x_{j 2}, \dots, x_{j n})$ is calculated as follows:

$d (x_{i}, x_{j}) = \sqrt{\sum_{i = 1, j = 1}^{n} {(x_{i} - x_{j})}^{2}}$

(1)
(3): Update: For each cluster, recalculate its centre of mass c (the mean value of all points in the cluster) as follows:

$c_{j} = \frac{\sum_{x_{i} \in C_{j}} x_{i}}{|C_{j}|}$

(2)

where C denotes category, $C_{j} = \{C_{1}, C_{2}, C_{3}, C_{4}\}$ .
(4): Iteration: If there is a change in the centre of mass of the cluster, repeat steps 2 and 3 until the centre of mass no longer changes or the preset maximum number of iterations is reached.

2.3. Cosine Similarity

In fields such as NLP, information retrieval, and data analysis, cosine similarity is a statistical measure of the angle between two non-zero vectors [26]. It is based on the vector space model and calculates the cosine of the angle between the directions of these two vectors, taking values from −1 to 1.

When two vectors are in the same direction, the angle is 0 degrees and the cosine similarity is 1. If they are perpendicular (orthogonal) to each other, the angle is 90 degrees and the cosine similarity is 0. If they are in the opposite direction, the angle is 180 degrees and the cosine similarity is −1. In this way, we can quantify the degree of similarity of the contents of any two texts in a dataset, and the smaller the angle is, the more similar the contents of these two texts are. In this paper, the cosine similarity approach is utilised to measure the similarity between accident descriptions, thereby revealing potentially similar events or risk factors. By calculating the cosine value between each word vector and the accident level and then multiplying the corresponding coefficients according to the accident level, we obtain the segmented words weights. The formula of this algorithm is as follows:

S i m i l a r i t y (A, B) = \frac{A \cdot B}{‖A‖ \times ‖B‖} = \frac{\sum_{i = 1}^{n} (A_{i} \times B_{i})}{\sqrt{\sum_{i = 1}^{n} {A_{i}}^{2}} \times \sqrt{\sum_{i = 1}^{n} {B_{i}}^{2}}}

(3)

where A and B denote two different vectors;

A \cdot B

denotes the dot product of vectors A and B; and

‖A‖

and

‖B‖

denote the modes of vectors A and B, respectively.

2.4. FP-Growth

Let

I = \{i_{1}, i_{2}, \dots, i_{d}\}

denote the itemset consisting of all items

I

and

D = \{T_{1}, T_{2}, \dots, T_{n}\}

denote the set of all transactions T. Each transaction

T_{i}

contains a set of items that are a subset of I, and the supports (

s

) are the ratio of the support count (scount) to the length of the transaction dataset

|D|

, as following:

s = \frac{s c o u n t}{|D|}

(4)

Confidence (

c

) represents the value that reflects the degree of confidence in the association rule, and the formula is as follows:

c (U \Rightarrow V) = \frac{s (U \cup V)}{s (U)}

(5)

where

s (U \cup V)

is the support of the concatenated set of itemset U and itemset V, which represents the probability of U and V occurring at the same time and s(U) denotes the support of itemset U.

The algorithm first introduces the FP-Tree, where each node in the tree corresponds to an item in the frequent itemset. Given the original dataset D and a specified minimum support threshold, the FP-Growth algorithm only needs to traverse the original data twice to generate FP-Tree data that compresses the original data. The implementation of the algorithm is mainly divided into two stages: constructing the FP-Tree and recursively mining of the FP-Tree, as illustrated in Figure 1.

(1)

Construct the FP-Tree: Identify all frequent items from each transaction dataset and merge them to form the FP-Tree. In the FP-Tree, each internal node represents a frequent item, and the relationships between nodes reflect the order of item occurrences. The paths of the tree represent transactions with the same prefix. The specific steps are as follows:

Generate frequent itemset. The original transaction dataset D is scanned for the first time and all items in D are counted. The frequent 1-item set and the frequent item list F-list are filtered according to the minimum support and sorted in descending order.
Construct the tree structure. The original transaction dataset D is scanned for the second time, the dataset is arranged in the order of the frequent items in the F-list, and the FP-Tree is constructed for compressing and storing the dataset after ordering by frequent items.

(2)

Recursively mining FP-Tree: Frequent patterns and corresponding support are generated by scanning the FP-Tree, followed by mining association rules that satisfy the minimum confidence threshold. The specific method is as follows:

The FP-Growth algorithm is based on the constructed FP-Tree to mine frequent items in a bottom-up principle. The total mining task is divided into several independent sub-tasks, which construct and recursively mine sub-FP-Trees to mine local frequent items. The local frequent items are connected with the suffix frequent items to generate longer frequent itemsets until no new frequent itemsets are generated. Figure 2 shows the flow of the FP-Growth algorithm for mining frequent itemsets.

3. FP-Growth Algorithm Based on a Text-Weighted Method

To address the insufficient research on specific risk factors of electric power accidents, this paper explores the potential correlations between these risk factors and accident levels using an association rule mining method. For accident risk factors described in text. This paper designs the FP-Growth algorithm based on text weighting to analyse the impact and degree of association of different risk factors on accident level. The process of mining risk factors is shown in Figure 3.

In this paper, we firstly constructed a power accident dataset and preprocessed the text using a word-splitting method. Then the word vectors are generated by the CBOW model of Word2Vec, where the silhouette coefficient of the combination of the important parameters, vector_size (the dimension of the word vectors) and window (the size of the context window), are calculated to obtain the optimal combination of the parameters. Since the generated high-dimensional word vectors reduce the K-means clustering performance, the t-distributed stochastic neighbour embedding (t-SNE) algorithm is used to reduce the dimensionality before clustering, and then the cosine similarity is used to calculate the sub-word weights. Since the risk factors within a specific sparse range are not easy to mine, this paper introduces sparse coefficients, which together with the segmentation weights determine the importance of accident-related risk factors. Finally, frequent terms and association rules are mined based on the text-weighted FP-Growth algorithm defined in this paper. In addition, this paper optimizes the algorithm’s running process to achieve higher efficiency.

3.1. Design and Calculation of Word Segmentation Weights

3.1.1. Text Preprocessing

In order to convert the original text that does not conform to the input rules into the content that can be recognised by the weighted FP-Growth algorithm, it is necessary to carry out text preprocessing. Aimed at the original text corpus and feature item text description, this paper selects the jieba lexical library for the lexical processing of the text [27] and chooses the precise mode to ensure the accuracy of the lexical results to restore the semantic structure of the text to the maximum extent.

In this paper, we focus on the noise in text data to further improve the quality and efficiency of word splitting and take measures to deal with it accordingly. Firstly, an exclusive deactivation word list was constructed using the Chinese deactivation word list of Harvard University, and all deactivated words were eliminated in the process of lexical segmentation, which effectively avoided the interference of words with no practical significance with the thematic content of the text. Secondly, we cleaned up the punctuation marks, irrelevant symbols (e.g., emoticons), and formatting problems in the text to ensure that the text data are more standardised and to provide a clear basis for subsequent analysis. Through these methods, this paper effectively reduces the interference of noise in the analysis of text data, thus improving the accuracy and reliability of the analysis results and helping to focus on the important content of the text.

3.1.2. Design and Calculation of Weights

After the word segmentation process, this paper converts sentences in the original corpus into contextual sentences composed of segmented words and uses the segmented sentence lists to train the Word2Vec model. Word2Vec employs a shallow neural network to embed words into a low-dimensional vector space in order to capture the semantic information of the words. In small datasets, the vocabulary is usually limited and the contextual information of high-frequency words is relatively sufficient. Among the two architectures of the Word2Vec model, the CBOW model trains quickly on small datasets and makes better use of this contextual information. Since the dataset constructed in this paper is small, CBOW is chosen to train the Word2Vec model. The two important parameters of Word2Vec are vector_size and window, and the appropriate combination of these parameters is selected based on silhouette coefficient to generate word vectors.

In this paper, we use the t-SNE algorithm for dimensionality reduction of generated high dimensional word vectors [28]. This algorithm is a nonlinear technique for dimensionality reduction that keeps similar data similar in low dimensional space. Then, the dimensionality reduced word vectors are clustered using K-means clustering. Based on the above process, the cosine similarity is used to calculate the segmentation weights with the following formula:

w o r d w e i g h t = s i m i l a r (x, a c c i d e n t) \times 2^{i}

(6)

where

x

denotes the current segmentation;

a c c i d e n t

denotes the accident level;

s i m i l a r (x, a c c i d e n t)

denotes the cosine similarity between the current segmentation

x

and

a c c i d e n t

, and the current segmentation

x

belongs to the same clustering category as the accident level; and

i

is assigned different values based on

a c c i d e n t

(when the

a c c i d e n t

is Minor, Moderate, Major and Severe, the corresponding

i

is 0, 1, 2, and 3, respectively).

3.2. Text-Weighted FP-Growth Algorithm

Based on the design and calculation of segmentation weights, the improved FP-Growth algorithm presented in this paper primarily mines frequent terms by combining sparse coefficients to calculate the importance of accident records. At the same time, this paper introduces parallel computing and improves the relevant data structure during the execution of the algorithm to optimise the operation efficiency of the algorithm.

3.2.1. Sparse Coefficients

In this paper, 10 risk factors were mined as feature terms, and it was found that some feature terms with low participle weights in the accident dataset also responded to potential accident level correlations. These sparse feature terms appear with low frequency but may also have strong correlations with accidents. Therefore, this paper designs sparse coefficients to be added to the calculation of importance [29]. The calculation formula is as follows:

p_{a v g} = \frac{1}{c} \sum p_{i}, p_{i} = \frac{n}{N}, α \leq p_{i} \leq β

(7)

c o e f = \frac{1}{1 - p_{a v g}}

(8)

where

p_{i}

denotes the sparsity rate of the

i

-th feature term;

n

represents the number of missing accidents;

N

represents the total number of accidents;

p_{a v g}

denotes the average sparsity of feature terms within the specified range

α

and

β

; and

c o e f

denotes the sparse coefficient.

3.2.2. Definition of the Weighted FP-Growth Algorithm

We define the feature term set associated with accidents as

I = {I_{1}, I_{2}, I_{3}, \dots, I_{n}}

, and the accident records set

R = {R_{1}, R_{2}, R_{3}, \dots, R_{m}}

, where

R_{i}

consists of

I_{i}

. In the accident record

R_{i}

, each feature item contains multiple segmentation weights

w o r d w e i g h t

. The importance of the accident records (Record Importance, RI) is calculated based on the word weights and sparse coefficient, with the calculation formula is as follows:

R I = \frac{\sum w o r d w e i g h t}{n} \times c o e f

(9)

where

n

is the number of segmentation weights.

The weighted FP-Growth algorithm filters the frequent itemsets based on the weighted support thresholds

ε

, and the formula for calculating the weighted support of feature items is as follows:

S u p (I_{i}) = \frac{\sum_{j = 0}^{n} {(I}_{i} \in R_{j}) \times {R I}_{j}}{\sum_{j = 0}^{n} {R I}_{j}}

(10)

where

S u p (I_{i})

denotes the weight support of a feature term and the conditional expression

I_{i} \in R_{j}

denotes whether the feature term

I_{i}

exists in the incident record

R_{j}

. If it exists, the conditional expression returns 1, otherwise it returns 0.

This algorithm calculates the weight support of the risk factor feature terms by scanning the power accident dataset and sorts them in descending order of weight support. Subsequently, the FP-Tree is constructed by each accident record feature term in the sorted order. Finally, the conditional FP-Tree is constructed recursively to mine the frequent termsets. The specific algorithm is described as follows (Algorithm 1):

Algorithm 1 Weighted FP-Growth Algorithm

Inputs: incident record set R, minimum weight support thresholds

ϵ

Output: frequent item set FIS

1. FP-Growth (R. ϵ)

2. Construct header tableTable: scan R, compute

S u p (I_{i})

and construct header table in descending order

3. Construct FP-tree: scan R and initialise the root node of the tree as an empty node, i.e.,

R o o t = n u l l

4. While(

R_{i}

):

5. The

R_{i}

the feature items in

I_{j}

from Root into the FP tree in sorted order

6. If(

n o d e = = I_{j}

): Merge node weights together;

7. else: create new node, add node address to Table

8. Iterate over the Table:

9. Constructed

I_{i}

The conditional pattern base of the

R_{c}

: traverses the

T a b l e (I_{i})

indexes the corresponding tree nodes to construct the conditional schema base

10. Recursion:

F I S_{c}

= FP-Growth (

R_{c}

,

ϵ

)

11. Combined: FIS =

F I S ⋃ F I S_{c}

12. Return FIS

3.2.3. Optimisation of Operational Efficiency

Firstly, this paper enhances the algorithm through the implementation of hash tables. The hash table constructs the header table in Step 2, enabling rapid indexing and addition of node addresses during the FP-tree construction in Step 3. Secondly, the construction of the FP-tree necessitates ordering the feature items in the dataset by their weighted support in descending order. In building the conditional pattern base, a pointer link is established for each feature item node to the corresponding hash table based on its support value. This facilitates efficient ranking by directly comparing support values, thereby minimizing traversal overhead. Lastly, the recursive mining of frequent items is executed in parallel using Open Multi-Processing (OpenMP), which is a multi-threaded concurrent programming API that supports cross-platform shared-memory models. This approach significantly enhances operational efficiency by allowing parallel traversal of the header table for the recursive mining of frequent items.

4. Experimental Results and Discussions

4.1. Experimental Dataset

The power accident dataset presented in this paper comprises a corpus of power accidents within the Southern Power Grid, transcribed from the National Compendium of Power Accidents and Power Safety Events (2022) and documentation from a power supply bureau. The dataset organizes and classifies power accidents and safety events into categories such as personal injury and fatality incidents, equipment failures, and power safety events. Each accident is elaborated upon from several aspects, including a brief description, detailed accounts, root causes, identified issues, preventive measures, and corrective actions. In total, the dataset contains 373 records, encompassing both original text entries and extracted dimensions of accident causation with corresponding descriptions. An example of a training sample is provided, as shown in Table 1.

This paper specifically examines the correlations between the risk factor characteristics across ten dimensions and the four grades of accident occurrence: command error, operational error, supervisory error, insufficient knowledge and skills, inappropriate measures, improper use of equipment, inadequate regulations and procedures, ineffective dual prevention mechanisms for risks, insufficient risk management and control, and inadequate safety training. The four accident grades are categorized as Minor, Moderate, Major, and Severe.

4.2. Experiments on Word2Vec-Based Word Vector Generation

In this paper, we design experiments on accidental text using jieba participle and a deactivated word list to remove participle noise and generate word vectors based on Word2Vec model for the participle of the text after participle. Since vector_size and window are important parameters affecting the quality of word vectors, this experiment searches for the optimal combination by adjusting these two parameters.

Firstly, the parameter space is defined: the value range of vector_size is [60, 120], and the step size is 10; the value range of window is [3, 9], and the step size is 2. Then, all the combinations of parameters are traversed by using the grid search method, and the word vectors are obtained by training the Word2Vec model. The word vectors were dimensionality reduced and then K-means clustering was performed to classify the word vectors into four classes, representing the different accident levels. Each group of experiments is repeated five times; if any one of the clustering attempts fails, the statistical result is failure. The clustering results are shown in Table 2, where T indicates clustering success and F indicates clustering failure.

From Table 2, it can be seen that when vector_size is 90 or 100 and window is 5, it can successfully cluster the word vectors into four classes. The two parameter combinations (90, 5) and (100, 5) were further evaluated using the silhouette coefficient S, calculated as follows:

S (i) = \frac{b (i) - a (i)}{\max \{a (i), b (i)\}}

(11)

where a(i) is the average value of the degree of dissimilarity from vector i to other points within the same cluster and b(i) is the minimum value of the average degree of dissimilarity from vector i to other clusters.

The results show that the profile coefficient of the combination (100, 5) is 9.76% higher than that of the parameter combination (90, 5), so the vector_size of 100 and window of 5 are selected as the final parameter settings in this experiment. Based on this parameter combination, the visualisation results of word vector clustering are shown in Figure 4, indicating that the word vectors in the power accident dataset have been successfully classified into four classes.

4.3. Experiments on the FP-Growth Algorithm Based on Text-Weighting

This algorithm is divided into two parts: text weights and sparse coefficients. First, the experiment explores the effect of text weighting on the algorithm results. In this experiment, in order to mine as many frequent itemsets as possible, the minimum support threshold is set to 0.01. The support degree of association rules shows the frequent association relationship between risk factors and accident levels, and the higher the support degree, the more frequent the rules appear. The change of support degree of each risk factor feature term before and after text-weighted is shown in Figure 5.

As illustrated in Figure 5, with the exception of the two feature items “command error” and “improper use of work equipment”, the support levels of the remaining eight feature items have improved following text weighting. Notably, the support levels for “operation error” and “insufficient safety education and training” both exceed 0.8 before and after weighting. Among these, “insufficient safety education and training” experienced the most significant increase, rising by 3.75 percent. The support levels of the other six characteristics are primarily concentrated within the range of 0.4 to 0.7, showing a slight increase post-weighting.

Because the frequency of “command error” and “improper use of work equipment” is low in the dataset, the order of support is small, which leads to a further decrease in the weighted support, of which the weighted support for “improper use of work equipment” is only 0.19. If the support threshold is set to 0.2, the item cannot be mined into the relevant frequent items. For this reason, this paper introduces sparse coefficients and experimentally explores the effect of combining sparse coefficients on the algorithm results on the basis of text weights. The support changes of feature items before and after combining sparse coefficients are shown in Figure 6.

As illustrated in Figure 6, the support for the sparse feature term “improper use of work tools” significantly improves after the incorporation of the sparse coefficients, increasing from 0.21 to 0.52. The support for the sparse feature term “command error” also shows a 20% improvement. This indicates that increasing the support threshold enhances the likelihood of mining relevant frequent items from sparse feature terms. Additionally, with the exception of the item “inadequate safety education and training,” which exhibits no change in support before and after the application of the sparse coefficients, the support levels of all other feature items, when combined with sparse coefficients and text weighting, surpass those achieved through text weighting alone.

In order to verify the effectiveness of the proposed method, under the same dataset, this paper compares the original FP-Growth algorithm, the improved FP-Growth algorithm, and the Pearson correlation coefficient method [30]. The Pearson’s correlation coefficient method is a classical method traditionally used to measure the linear relationship of variables and is widely used in correlation analysis. The results of the degree of correlation between the 10 risk factors and the accident level are shown in Figure 7.

Figure 7 shows that the Pearson correlation coefficient method associations obtained in the three risk factors “supervision error”, “inadequate regulations and operating procedures”, and “inadequate risk control” are similar to the other two algorithms. However, the association degree of the four risk factors “command error”, “insufficient knowledge and skills”, “improper measures”, and “improper use of work equipment” is seriously insufficient and lower than 0.2. This is because the method assumes a linear relationship between the variables and considers only numerical data, ignoring the complex non-linear relationship that may exist between risk factors and accident levels and the influence of semantic information.

The FP-Growth algorithm based on text weighting proposed in this paper is not only able to deal with complex text data but also gives different weights to different risk factors through text weighting, which effectively avoids the insufficiency of Pearson’s correlation coefficient method in dealing with sparse data and nonlinear relationships. Therefore, using this algorithm can more comprehensively mine the potential risk factors, make the association rule mining more in line with the reality, and provide more accurate risk decision support for the power accident early warning.

4.4. Algorithm Running Efficiency Analysis

This experiment is executed on an AMD Ryzen 7 5800H CPU purchased from China, with 32 GB of RAM and Windows 11 operating system. The core implementation of the algorithm is based on C++17. The running time of the algorithm under different support thresholds is shown in (a) of Figure 8, and the memory usage is shown in (b) of Figure 8.

From Figure 8, it can be seen that the improved FP-Growth algorithm proposed in this paper has less running time than the original FP-Growth algorithm at different support thresholds and possesses higher running efficiency. Meanwhile, this method occupies less memory when the support threshold is lower, i.e., mining more frequent items. The improved FP-Growth algorithm saves time and space costs and verifies the efficiency of the method.

4.5. Analysis of the Correlation Between Accident Risk Levels

After the FP-Growth algorithm based on text weights and sparse coefficients obtains the frequent itemset and their support, the resultant statistics on the frequent 2-itemset are performed in order to derive the degree of association of each accident risk feature with different accident classes. The degree of association of different risk features with different accident classes is shown in Figure 9. As can be seen from Figure 9, the degree of association between the moderate accident class and each risk feature is relatively high, which is due to the high percentage of accidents in the moderate class and the low percentage of accidents in the remaining three accident classes, which is in line with the objective law. Among them, operation error has the highest degree of association with the moderate accident class, indicating that both have the highest weight support after weighting in the whole dataset, which demonstrates the importance and high frequency of operation error behaviour in the characteristics of accidents. As for the severe accident class, eight dimensions have a similar degree of association, indicating that the occurrence of severe accidents is often inextricably linked to all of these factors.

By adding up the degree of association between each risk feature and different accident levels in Figure 9 and arranging them in descending order, a comprehensive ranking of risk features can be obtained, which reveals the comprehensive degree of association between different risk features and accident levels, as shown in Figure 10. As shown in Figure 10, the feature item “operation error” has the greatest impact on power accidents, with an association degree of 0.88, followed by the feature item “insufficient safety education and training”, with an association degree of 0.83. The lowest impact of “command error” is less than 0.4 (0.36), and the impact of the other seven risk factors on power accidents is within the range of 0.4–0.8. The experimental results show that the FP-Growth algorithm based on text weight and sparse coefficients proposed in this paper effectively exploits the risk factors of power accidents and analyses the degree of association of each risk factor on accidents.

4.6. Association Rule Mining Based on the Improved FP-Growth Algorithm

The confidence level of the association rule reveals the causal relationship between the risk factors and the accident level; the higher the confidence level, the greater the possibility that the former term will trigger the latter term. To achieve the early warning of power accidents, the causal relationship between the former and latter items can be analysed in depth to find the risk factors that lead to the strength of the risk level. Some of the association rules mined by this algorithm are shown in Table 3, which are sorted according to the level of confidence.

In the historical accident data, the correlation degree between these risk factors is very high; by analysing the correlation rules in Table 3, it can be seen that “insufficient knowledge and skills” leads to a higher probability of moderate accidents, so the relevant electric power industry should focus on the knowledge and skills of the staff training. Similarly, “unsound dual prevention mechanism for risks and hazards” and “improper use of work equipment” also lead to a higher probability of moderate accidents, and the electric power companies should pay attention to the dual prevention mechanism of risks and hazards and the use of work equipment by employees. According to the mined association rules, the algorithm can provide effective risk decision support for the early warning of electric power accidents.

5. Conclusions

The amount of data on power accidents has risen dramatically, posing unprecedented challenges for risk assessment and safety improvement of power accidents. Accidents in smart energy systems usually involve multidimensional factors and it is difficult to capture these complex risk factors directly in unstructured textual data. Aiming at the textual description of accident risk features, this paper firstly performs word segmentation on the independently constructed datasets about electric power accidents, followed by the generation of word vectors through the Word2Vec model. The K-means clustering algorithm is employed to classify the accident records into four categories corresponding to different accident levels, following dimensionality reduction of the word vectors. Cosine similarity is then utilized to calculate text weights, leading to the definition of the FP-Growth algorithm based on text-weighting. Building on this, an improved FP-Growth algorithm explores the associations between various accident risk features and risk classes, with an emphasis on sparse feature terms. Additionally, the mining speed of frequent terms is enhanced by optimizing certain structural and procedural aspects of the FP-Growth algorithm, alongside the implementation of parallel processing strategies. Finally, the effectiveness of this algorithm is demonstrated through experimental results, providing risk decision support for the early warning of electric power personal accidents.

This method is applicable within the domain of electric power accidents, facilitating the analysis of correlations between different feature dimensions and accident levels. Safety managers are able to gain a deeper understanding of the key risk factors that influence the outcome of an incident, thus supporting faster, more accurate decision-making and aiding in the development of public safety early warning systems.

Author Contributions

Conceptualization, R.L. and J.Z.; methodology, R.L.; software, F.D.; validation, J.Z.; formal analysis, R.L.; investigation, F.D.; resources, J.Z.; data curation, R.L.; writing—original draft preparation, F.D.; writing—review and editing, R.L.; visualization, J.Z.; supervision, R.L.; project administration, F.D.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52377103; the General Program of National Natural Science Foundation of China, grant number 52277148; and Southern Power Grid Corporation Technology Project Funding, grant number 031300KK52222091.

Data Availability Statement

The original contributions presented in the study are included in the article and further inquiries can be directed to the corresponding author.

Conflicts of Interest

R.L. and J.Z. were employed by Guangdong Power Grid Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Southern Power Grid Corporation. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Zhang, C.Y. Network-based Infrastructure as Media: Energy, Transportation and Information(ETI) Networks Integration. Glob. J. Media Stud. 2023, 10, 56–70. [Google Scholar] [CrossRef]
Zhang, K. Research on Key Technologies and Applications of Intelligent Operation and Inspection of Power Grid. Ph.D. Thesis, University of Science and Technology of China, Hefei, China, 2023. [Google Scholar]
Tang, G.F.; Zhou, J.; Pang, H.; Lin, J.J.; Fan, Z.; Wu, Y.N.; He, Z.Y.; Ma, S.C.; Xue, F.; Zhou, B.R. Strategic Framework for New Electric Power System Development under the Energy Security Pattern. Strateg. Study CAE 2023, 25, 79–88. [Google Scholar]
Yan, Y.Q.; Zhang, S.; Liang, Z.X.; Sheng, W. Statistics and Analysis of Electric Power Enterprises Personal Accidents in China During 2016–2021. Saf. Secur. 2023, 44, 46–51. [Google Scholar]
Gholami, A.; Srivastava, A.K. ORCA: Outage Root Cause Analysis in DER-Rich Power Distribution System Using Data Fusion, Hierarchical Clustering and FP-Growth Rule Mining. IEEE Trans. Smart Grid 2024, 15, 667–676. [Google Scholar] [CrossRef]
Wu, Y.; Du, X.S.; Fu, Y.H. Stable Operation of Power Grid Based on Association Rule Mining Method of Energy Big Data. Opt. Optoelectron. Technol. 2022, 20, 139–144. [Google Scholar]
Wang, X.H.; Liu, X.X.; Zhong, F.C.; Li, Z.L.; Xuan, K.G.; Zhao, Z.L. A Scenario Generation Method for Typical Operations of Power Systems with PV Integration Considering Weather Factors. Sustainability 2023, 15, 15007. [Google Scholar] [CrossRef]
Zhou, X.; Ma, L.N.; Ma, X.G.; Hao, Q.; Jia, H.Y.; Liu, H.; Bai, W.X.; Zhang, H.Z.; Li, S.J.; Yang, Q.F. Research on text analysis of hidden dangers of coal mine accidents based on link prediction. J. Saf. Sci. Technol. 2024, 20, 26–34. [Google Scholar]
Shen, J.H.; Liu, S.P.; Zhang, J. Using Text Mining and Bayesian Network to Identify Key Risk Factors for Safety Accidents in Metro Construction. J. Constr. Eng. Manag. 2024, 150, 04024052. [Google Scholar] [CrossRef]
Liu, J.F.; Ma, H.Z.; Xie, X.L.; Cheng, J. Short Text Classification for Faults Information of Secondary Equipment Based on Convolutional Neural Networks. Energies 2022, 15, 2400. [Google Scholar] [CrossRef]
Ruan, G.C.; Xie, F.; Tu, S.W. Application Research Based on Word2vec Diversity in Library Recommender System. Libr. J. 2020, 39, 124–132. [Google Scholar]
Wang, H.W.; Yin, P.; Duan, Z.T.; Li, Y. Research on power marketing data mining and clustering techniques based on Bert and k-meas. In Proceedings of the International Conference on Power Electronics and Artificial Intelligence (PEAI), Xiamen, China, 19 January 2024. [Google Scholar]
Kasihmuddin, M.S.M.; Jamaludin, S.Z.M.; Mansor, M.A.; Wahab, H.A.; Ghadzi, S.M.S. Supervised Learning Perspective in Logic Mining. Mathematics 2022, 10, 915. [Google Scholar] [CrossRef]
Gan, W.S.; Lin, J.C.W.; Yu, P.S. A Survey of Utility-Oriented Pattern Mining. IEEE Trans. Knowl. Data Eng. 2021, 33, 1306–1327. [Google Scholar] [CrossRef]
Diaz-Garcia, J.A.; Ruiz, M.D.; Martin-Bautista, M.J. A survey on the use of association rules mining techniques in textual social media. Artif. Intell. Rev. 2023, 56, 1175–1200. [Google Scholar] [CrossRef] [PubMed]
Agrawal, R.; Tomasz, I.; Arun, S. Mining association rules between sets of items in large databases. Proc. 1993 ACM SIGMOD Int. Conf. Manag. Data 1993, 22, 207–216. [Google Scholar] [CrossRef]
Liu, W.W.; Wang, H.W.; Hou, Z.G. Analysis on risk association rules of civil aviation aircraft maintenance based on BL-Apriori. J. Saf. Sci. Technol. 2024, 20, 27–33. [Google Scholar]
Han, J.; Pei, J. Mining frequent patterns without candidate generation. ACM 2000, 29, 1–12. [Google Scholar]
Xiao, Y.; Meng, L.H.; Zhang, Y.R.; Gu, Y.T. An Improved FP-Growth Algorithm with Time Decay Factor and Element Attention Weight. In Proceedings of the 2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 5 August 2024. [Google Scholar]
Yu, J.B.; Ma, X.Y.; Zhao, Z.Y. Research on association analysis of food risk factors based on the improved FP-growth algorithm. Food Sci. 2024; in press. [Google Scholar]
Lei, X.; Cheng, G.; Zhang, Y.J.; Guo, L.; Zhang, F.C. Association analysis of alarm information based on power network situation awareness platform. Comput. Eng. Sci. 2023, 45, 1197–1208. [Google Scholar]
Wu, D.P.; Hua, G. Text Case Classification of Safety Production Accidents Based on Word2Vec Word Embedding and Clustering Model. Comput. Syst. Appl. 2021, 30, 141–145. [Google Scholar]
Zhong, G.F.; Pang, X.W.; Sui, D. Text Classification Method Based on Word2Vec and AlexNet-2 with Improve Attention Mechanism. Comput. Sci. 2024, 49, 288–293. [Google Scholar]
Yan, F.X.; Wang, J.H. Sentiment recognition model of Weibo comments based on SVM and Word2vec. Mod. Comput. 2024, 30, 60–64. [Google Scholar]
Lv, J.; Qiu, X.L. A noisy label deep learning algorithm based on K-means clustering and feature space augmentation. CAAI Trans. Intell. Syst. 2024, 19, 267–277. [Google Scholar]
Öztürk, M.M. A cosine similarity-based labeling technique for vulnerability type detection using source codes. Comput. Secur. 2024, 146, 104059. [Google Scholar] [CrossRef]
Gao, P.; Li, F.; Peng, Y.H.; Zhang, C.H.; Peng, H.J. Accurate Classification Method of Power Customers Based on Jieba Chinese Word Segmentation. Hunan Electr. Power 2023, 43, 151–154. [Google Scholar]
Yagahara, A.; Uesugi, M.; Yokoi, H. t-SNE Visualization of Vector Pairs of Similar and Dissimilar Definition Sentences Created by Word2vec and Doc2vec in Japanese Medical Device Adverse Event Terminology. Stud. Health Technol. Inform. 2022, 290, 1058–1059. [Google Scholar] [PubMed]
Li, Y.H.; Hu, L.; Gao, W.F. Multi-Label Feature Selection Based on Sparse Coefficient Matrix Reconstruction. Chin. J. Comput. 2022, 45, 1827–1841. [Google Scholar]
Li, N.; Yang, F.; Wu, H.B.; Zhang, Y.X.; Yin, L. Study on the correlation between dual-control policies on energy consumption and energy use efficiency in energy-consuming industries. High Volt. Eng. 2023, 49, 215–220. [Google Scholar]

Figure 1. CBOW and Skip-gram model architectures.

Figure 2. Process of mining FP-Growth frequent itemsets.

Figure 3. Process of mining power accident risk factors.

Figure 4. Segmentation clustering visualisation.

Figure 5. Change in support of feature items before and after text weight.

Figure 6. Change in support of feature terms before and after combining sparse coefficients.

Figure 7. Comparison of the degree of association of the three methods.

Figure 8. (a) Comparison of running time and (b) comparison of memory usage under different support thresholds.

Figure 9. Degree of correlation between different feature terms and accident levels.

Figure 10. Comprehensive ranking of risk features.

Table 1. Specific examples of the accident dataset.

Description of the Incident

Extractable Feature Text

(1)

Brief description of the accident: While carrying out a comprehensive improvement of the 10 kV xx water line at the Linxx Power Bureau, which is under the xx Power Supply Bureau of the xx Power Grid, an operation and maintenance worker of the distribution network mistakenly climbed up to pole No. 06 of the 10 kV xx, which was electrically charged near the work site, and caused an electrocution accident, resulting in serious injuries.

(2)

How the accident happened: The XX Power Supply Bureau was carrying out a comprehensive improvement of the distribution network of the 10 kV xx line. The staff came from 4 power supply offices and several external construction units. The work was divided into two groups, involving a total of three work tickets. 07:00, 10 kV xx line 01 knife gate after the section of the line to repair. Without fully explaining safety matters, members of the first team (Qin X and others) signed off and went to the job site. Qin X and three others planned to replace porcelain bottles at poles 04–08 of 10 kV xx line. at 0956 h, after the rain abated, Qin X and others reached the vicinity of pole 06 of 10 kV xx, which was mistakenly thought to be the work site. Without verifying the identification of the pole, Qin X started to climb the pole. He was electrocuted and injured in the process of boarding the pole. at 10:25 a.m., Su x went up the pole to confirm that there was no voltage and then rescued Qin x and sent him to the hospital for treatment.

(3)

Causes of the accident:

Direct cause: The field worker had a weak sense of safety and mistakenly boarded the electrified No. 06 without checking the name of the line and the number of the pole tower or conducting a power test.
Indirect Causes: (a) The safety explanation was a mere formality. The general explanation given by the leader of the on-site coordination group, My, and the person in charge of the work ticket to all workers was not specific, and the leader of the subgroup did not give a safety explanation to the workers in the subgroup. (b) There was a vacuum in the on-site safety supervision service. The person in charge of the work did not explicitly set up a special-purpose supervisor in the work permit. (c) Wrong placement of materials and work equipment. The safety belt bag, electric tester pen, foot buckle, lanyard, etc. were placed in the wrong position, which became the cause of misleadingly climbing the tower by the working staff.

Command error: the safety explanation is in form;
Operational error: mistakenly boarding pole No. 06 of the 10 kV xx line, which was running with electricity;
Supervision error: vacuum in site safety supervision duties;
Improper use of work equipment: wrong placement of materials and work equipment;
Accident level: general.

Table 2. Clustering results for different parameter combinations.

	60	70	80	90	100	110	120
Window	60	70	80	90	100	110	120
3	F	F	F	F	F	F	F
5	F	F	F	T	T	F	F
7	F	F	F	F	F	F	F
9	F	F	F	F	F	F	F

Table 3. Confidence level of the association rule between feature items and accident level.

Antecedent	Consequent	Confidence
Insufficient knowledge and skills	moderate accident	0.90
Unsound dual prevention mechanism for risks and hazards	moderate accident	0.89
Improper use of work equipment	moderate accident	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, R.; Zhang, J.; Deng, F. Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy. Energies 2025, 18, 716. https://doi.org/10.3390/en18030716

AMA Style

Li R, Zhang J, Deng F. Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy. Energies. 2025; 18(3):716. https://doi.org/10.3390/en18030716

Chicago/Turabian Style

Li, Rongbin, Jian Zhang, and Fangming Deng. 2025. "Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy" Energies 18, no. 3: 716. https://doi.org/10.3390/en18030716

APA Style

Li, R., Zhang, J., & Deng, F. (2025). Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy. Energies, 18(3), 716. https://doi.org/10.3390/en18030716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accident Factors Importance Ranking for Intelligent Energy Systems Based on a Novel Data Mining Strategy

Abstract

1. Introduction

2. Basic Theory

2.1. Word2vec Model

2.2. K-Means Clustering

2.3. Cosine Similarity

2.4. FP-Growth

3. FP-Growth Algorithm Based on a Text-Weighted Method

3.1. Design and Calculation of Word Segmentation Weights

3.1.1. Text Preprocessing

3.1.2. Design and Calculation of Weights

3.2. Text-Weighted FP-Growth Algorithm

3.2.1. Sparse Coefficients

3.2.2. Definition of the Weighted FP-Growth Algorithm

3.2.3. Optimisation of Operational Efficiency

4. Experimental Results and Discussions

4.1. Experimental Dataset

4.2. Experiments on Word2Vec-Based Word Vector Generation

4.3. Experiments on the FP-Growth Algorithm Based on Text-Weighting

4.4. Algorithm Running Efficiency Analysis

4.5. Analysis of the Correlation Between Accident Risk Levels

4.6. Association Rule Mining Based on the Improved FP-Growth Algorithm

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI