An ELECTRA-Based Model for Power Safety Named Entity Recognition

Liu, Peng; Sun, Zhenfu; Zhou, Biao

doi:10.3390/app14209410

Open AccessArticle

An ELECTRA-Based Model for Power Safety Named Entity Recognition

by

Peng Liu

¹,

Zhenfu Sun

^2,*

and

Biao Zhou

²

¹

Institute of Energy and Electrical Engineering, Changchun University of Science and Technology, Changchun 130013, China

²

School of Control and Computer Engineering, North China Electric Power University, Beijing 100096, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9410; https://doi.org/10.3390/app14209410

Submission received: 31 July 2024 / Revised: 9 September 2024 / Accepted: 24 September 2024 / Published: 15 October 2024

Download

Browse Figures

Versions Notes

Abstract

Power safety named entity recognition (NER) is essential for determining the cause of faults, assessing potential risks, and planning maintenance schedules, contributing to the comprehension and analysis of power safety documentation content and structure. Such analysis is crucial for the development of a knowledge graph within the power safety domain and the augmentation of the associated dataset. This paper introduces a power safety NER model using efficiently learning an encoder that classifies token replacements accurately (ELECTRA) model. This model employs root mean square layer normalization (RMSNorm) and the switched gated linear unit (SwiGLU) activation function, which substitutes the conventional layer normalization (LayerNorm) and the Gaussian error linear units (GeLU). This model also integrates bidirectional long short-term memory (BiLSTM) with conditional random fields (CRF) to bolster performance in NER tasks. Experimental results show that the improved ELECTRA model achieved an F1 value of 93% on the constructed power safety NER dataset. It outperforms the BERT-BiLSTM-CRF model, achieving a 3.3% performance improvement.

Keywords:

named entity recognition; power safety data; ELECTRA; SWiGLU

1. Introduction

Named Entity Recognition (NER) is crucial for power safety. By identifying key entities in text, such as equipment, components, and operations, it significantly enhances the safety and reliability of power systems. This technology enables the extraction of relevant information from a large number of unstructured text. It improves the accuracy of fault diagnosis, the effectiveness of risk assessment, and the optimization of preventive maintenance. By classifying entity types, NER helps technicians better understand the content of the text. For example, it can distinguish between different types of switchgear, which is crucial for determining the cause of faults, assessing potential risks, and planning maintenance schedules. NER is also extremely important for constructing knowledge graphs [1] from power safety texts. It identifies key information, creates structured data, and builds high-quality datasets. Researching NER in power safety texts is critically important [2].

NER is a subtask of natural language processing that identifies and categorizes entities with specific meanings, such as names, places, organizations, times, and numbers, into predefined categories [3]. NER in power safety texts is highly challenging due to the extensive use of specialized terminology and frequent ambiguities. In the power sector, polysemy is common; for instance, the term “switch” could refer to a circuit breaker, an isolator, or other switching devices. Its exact meaning depends on the specific context and deep industry knowledge. Additionally, the hierarchical and subordinate relationships between power entities are complex. For example, in the phrase “the high-voltage side circuit breaker of the No. 1 main transformer at a substation”, identifying the “high-voltage side circuit breaker” as part of the “No. 1 main transformer” rather than as an independent device demands high precision in entity extraction. Power systems consist of numerous components such as generators, transformers, lines, and circuit breakers. Their hierarchical relationships must be correctly identified to distinguish whether an entity is a component of another device or an independent device [4].

Traditional NER methods mainly relied on dictionary and rule-based pattern matching and statistical learning approaches, but these often depended on building knowledge bases and dictionaries, requiring significant time and effort. In recent times, there has been significant progress in NER methods based on deep learning. Bidirectional long short-term memory (BiLSTM) [5] is a sequence processing model composed of two Long Short-Term Memory (LSTM) networks. Conditional random field (CRF) [6] is a discriminative probabilistic model that predicts the probability distribution of an output sequence given an input sequence. Some researchers combine BiLSTM with CRF to form a BiLSTM-CRF model [5], which is used for NER tasks. Lample et al. [7] introduced architectures based on BiLSTM and CRF, as well as transition-based methods for NER.

With the introduction of the bidirectional encoder representations from the transformers (BERT) model [8] and its derivatives like the robustly optimized BERT (RoBERT) [9], a lite BERT [10], enhanced representation through knowledge integration [11] and efficiently learning an encoder that classifies token replacements accurately (ELECTRA) [12], self-attention-based NER models have become the mainstream approach. Many researchers combine BERT, BiLSTM, and CRF for NER tasks [13]. Sun et al. [14] explores a software entity recognition method based on BERT embeddings. Hu et al. [15] introduce a knowledge graph-inspired NER method, which improved the performance of BERT by integrating knowledge graphs. Yang et al. [16] introduced a BERT-Star-Transformer-CNN-BiLSTM-CRF model, which improves the efficiency of the traditional Transformer by integrating BERT-generated vectors with a Star-Transformer. He et al. [17] introduced an NER model for the maintenance records of primary power equipment. Chen et al. [18] introduced an NER model of the Chinese power market, employing Whole Word Masking and Dual Feature Extraction. Meng et al. [19] introduced a NER model a novel BERT-BiLSTM-CRF model to construct a Chinese knowledge graph of power equipment faults. Fan et al. [20] introduced a RoBERTa-BiLSTM-CRF model that extracts text entities and uses Wubi sequences and Chinese character radicals to improve feature extraction accuracy. Yang et al. [21] introduced an ELECTRA-BiLSTM-CRF model for the identification of agricultural finance policy information. Fu et al. [22] introduced a NER method for drug-related personnel using ELECTRA-BiLSTM-CRF. Feng et al. [23] introduce a BERT-BiLSTM-CRF model for NER on a power domain dataset.

2. Materials and Methods

This paper introduces a model based on ELECTRA for extracting entities from power safety datasets. This proposed model incorporates a novel Transformer [24] architecture, integrating root mean square layer normalization (RMSNorm) [25] and switched gated linear unit (SwiGLU) [26] activation function into the ELECTRA model, replacing the original layernorm normalization (Layernorm) [27] and gaussian error linear units (GeLU) activation function [28]. Furthermore, this paper combines BiLSTM and CRF to form an end-to-end sequence labeling model for handling the subtask of entity recognition.

Figure 1 shows the specific model structure and principles. The RMSNorm-SwiGLu-ELECTRA-BiLSTM-CRF (RS-ELECTRA-BiLSTM-CRF) model consists of the RS-ELECTRA layer, the BiLSTM layer, and the CRF layer. This RS-ELECTRA layer encodes the input text, generating contextually relevant representations for each word. This BiLSTM layer further extracts features based on the BERT layer, modeling the contextual information of each word through bidirectional recurrent neural networks. This CRF layer establishes a label transition matrix on the output of the BiLSTM layer, learning the transition probabilities between labels to annotate the input sequence.

2.1. ELECTRA

ELECTRA is an efficient and accurate pre-training language model that learns deep and bi-directional word vectors by performing a pre-training task, Replaced Token Detection (RTD), on a large amount of text. RTD is a novel pre-training task. It contains a generator and a discriminator. The generator is a small Transformer network that is responsible for randomly selecting some words from the input text and replacing them with other words to form a noisy text. The discriminator is a large Transformer network, which is responsible for determining which words have been replaced and which words have not been replaced from the noisy text. The generator deceives the discriminator as much as possible, and the discriminator recognizes as much as possible the words that have been replaced by the generator. Through such adversarial training, the discriminator can learn more efficient and accurate word vectors and serve as an encoder for downstream tasks. The specific principle is shown in Figure 2.

The goal of the generator is to predict masked tokens using a softmax layer, while the goal of the discriminator is to determine whether each token is real or generated by the generator G using a sigmoid layer. Using the Transformer network as an encoder, it transforms the input token sequence into a context-dependent vector representation. ELECTRA uses a masked language model (MLM) to train the generator and then replaces words with the generator’s predictions.

The generator maps an input token sequence to a context vector representation sequence. When the input token at position t is

x_{t} = [M A S K]

, the generator outputs the probability of generating a specific token

x_{t}

. This is conducted using a softmax layer as shown in Formula (1):

\begin{matrix} p_{G} (x_{t} | x) = \frac{e x p (e {(x_{t})}^{T} h_{G} {(x)}_{t})}{\sum_{x'} e x p (e {(x^{'})}^{T} h_{G} {(x)}_{t})} \end{matrix}

(1)

where e denotes the embedding of the token. The discriminator uses a sigmoid output layer to predict whether the token

x_{t}

is “real” and whether it is fake data generated by the generator, as shown in Formula (2):

\begin{matrix} D (x, t) = s i g m o i d (w^{T} h_{D} (x) t) \end{matrix}

(2)

When training the generator to perform the task of mask labeling, for a particular input sequence

X = {x_{1}, x_{2}, \dots, x_{n}}

, the model initially chooses a series of random positions

m a s k = {m_{1}, \dots, m_{k}}

, k = 0.15 n, and replaces the selected positions with [MASK], defined as

x^{m a s k e d} = R E P L A C E (x, m a s k, [M A S K])

.

During training, the masked token is passed to the generator instead of creating a corrupted instance

x_{c o r r u p t}

. The generator produces a replacement token, and then the discriminator is trained to predict which tokens in

x_{c o r r u p t}

match the original input x. Through this process, the discriminator learns to maximize the likelihood of the masked tokens in the data, thereby enhancing its ability to distinguish between real and generated data.

As shown in Formulas (3) and (4), the

L_{M L M}

is the loss function of the generator. The

L_{D i s c}

is the loss function of the discriminator.

\begin{matrix} L_{M L M} (x, θ_{G}) = E (\sum_{i \in m} - l o g p_{G} (x_{i} | x^{m a s k e d})) \end{matrix}

(3)

\begin{matrix} L_{D i s c} (x, θ_{D}) = E (\sum_{t = 1}^{n} \begin{matrix} - 1 (x_{t}^{c o r r u p t} = x_{t}) l o g D (x^{c o r r u p t}, t) - \\ 1 (x_{t}^{c o r r u p t} \neq x_{t}) l o g (1 - D (x^{c o r r u p t}, t)) \end{matrix}) \end{matrix}

(4)

As shown in Formula (5), the training process is carried out by minimizing the cross-entropy loss function of the generator and discriminator.

\begin{matrix} L o s s = \underset{θ_{G}, θ_{D}}{m i n} \sum_{x \in X} L_{M L M} (+ + x, θ_{G}) + λ L_{D i S C} (x, θ_{D}) \end{matrix}

(5)

2.2. RS-ELECTRA

To enhance the performance of the ELECTRA model, this paper has improved the normalization methods and activation functions of the transformer encoder’s feedforward network(FFN) layer. The original ELECTRA model uses GELU as the activation function and LayerNorm as the normalization method. GELU is a smoothed approximation of ReLU, which allows for a non-zero gradient for negative inputs. LayerNorm is a normalization technique that re-centers and re-scales the inputs based on mean and standard deviation in each layer.

Although GELU and LayerNorm are widely used in Transformer modeling, they are not necessarily the best choice for ELECTRA. In fact, recent research [26] has proposed a number of alternatives that may improve the performance and efficiency of ELECTRA. Two of them are RMSNorm and SwiGLu.

RMSNorm and SwiGLu are two methods used to improve the performance of Transformer models by normalizing and gating the input and weight matrices, respectively. In this paper, we incorporate the two methods, RMSNorm and SwiGLu, into the ELECTRA model, and propose a new ELECTRA model, RS-ELECTRA. RS-ELECTRA uses RMSNorm and SwiGLu in the Transformer encoder, which decreases computational complexity and improves the model’s generalization and adaptability.

RMSNorm is a normalization method that regularizes the summation inputs of a neuron layer based on the root mean square (RMS) of the inputs, endowing the model with rescaling invariance and implicit learning rate adaptivity. RMSNorm is computationally simpler and more efficient than LayerNorm because it does not need to compute the mean and standard deviation of the inputs. As shown in Formula (6):

\begin{matrix} \overline{a_{i}} = \frac{a_{i}}{R M S (a)} g_{i}, w h e r e R M S (a) = \sqrt{\frac{1}{n}} \sum_{i = 1}^{n} a_{i}^{2} \end{matrix}

(6)

In this formula, a represents the input vector,

\overline{a}

represents the output vector, n denotes the dimension of the input vector,

a_{i}

is the i-th element of the input vector, and

g_{i}

is a learnable parameter.

The SwiGLU is a novel activation function that combines the advantages of the Swish [29] activation function and Gated Linear Units (GLU) [30].

The Swish function is defined as shown in Formula (7):

\begin{matrix} S w i s h (x) = x \cdot s i g m o i d (β x) \end{matrix}

(7)

where

β

is a learnable parameter. Compared to the rectified-linear (ReLU) [31] activation function, Swish provides a smoother transition near zero, which can lead to better optimization and faster convergence. Swish has been proven to perform better than ReLU in many applications, especially in deep networks.

GLU is a neural network layer composed of a linear transformation and a gating mechanism. It is defined as shown in Formula (8):

\begin{matrix} G L U (x) = s i g m o i d (W_{1} x + b) \otimes (V x + c) \end{matrix}

(8)

GLU is similar to Swish because it combines a linear function with a nonlinear function. In GLU, the linear function is controlled by the sigmoid activation function.

In SwiGLU, the Swish function is used to gate the linear function of GLU. This allows SwiGLU to simultaneously gain the advantages of Swish and GLU while overcoming their respective shortcomings. The principle of SwiGLU is shown in Formula (9).

\begin{matrix} SwiGLU (x) = Swish (W_{1} x + b) \otimes (V x + c) \end{matrix}

(9)

where x is the input vector.

W_{1}

and V are weight matrices. b and c are bias vectors. ⊗ represents element-wise multiplication.

The Swish function is defined as

Swish (x) = x \cdot σ (β x)

, where

σ

is the sigmoid function, and

β

is a learnable parameter, typically set to 1.

SwiGLU maintains certain linear properties while providing nonlinear expressive capabilities, thereby enhancing the representational power and learning ability of neural networks. In large language models, SwiGLU is particularly suitable for handling complex semantic relationships and long-distance dependencies due to its advantages in nonlinearity, gating characteristics, gradient stability, and learnable parameters.

To replace the GeLU activation function in transformers with the SwiGLU activation function, we must modify the model’s structure. Specifically, in the FFN layer using SwiGLU, there are usually two linear transformations with the SwiGLU activation function inserted in between. These two linear transformations are defined by the weight matrices

W_{1}

,

W_{2}

and bias vectors

b_{1}

,

b_{1}

. The input to the SwiGLU activation function is the output of the first linear transformation, and its output serves as the input to the second linear transformation.

Specifically, if the dimension of the input vector x is

d_{m o d e l}

, then the dimension of the weight matrix

W_{1}

for the first linear transformation will be

d_{m o d e l} \times d_{f f}

, where

d_{f f}

is the dimension of the hidden layer. The output dimension of the SwiGLU activation function remains

d_{f f}

, and this output is then processed by the second linear transformation

W_{2}

, whose dimension is

d_{f f} \times d_{m o d e l}

, returning the output dimension to the original

d_{m o d e l}

.

To keep the overall number of parameters unchanged, the size of the hidden layer

d_{f f}

in the model is usually adjusted when introducing the SwiGLU activation function. The method for adjusting the dimension of the hidden layer is shown in the Formula (10):

\begin{matrix} d_{f f} = multiple \times ⌊\frac{⌊\frac{2 \times {\tilde{d}}_{ff}}{3} + multiple - 1⌋}{multiple}⌋ \end{matrix}

(10)

where multiple is a fixed value used to adjust the size of the hidden layer,

{\tilde{d}}_{f f}

is the size of the encoder layers and the pooler layer.

By replacing GeLU and LayerNorm in ELECTRA’s Transformer encoder with RMSNorm and SwiGLu, it can expect to improve its performance on various natural language understanding tasks.

2.3. BiLSTM

BiLSTM captures contextual information and long-distance dependencies in the text and avoids the problem of disappearing or exploding gradients through a gating mechanism. It processes sequence data through two LSTMS in opposite directions. BiLSTM can fuse the information in the front and back of each word in two directions of each word to fuse the information and obtain a richer and more complete representation of the word. The structure is shown in Figure 3.

The key components of LSTM are the cell state and various gating mechanisms, including the forget gate, input gate, and output gate. As shown in Formulas (11)–(16):

\begin{matrix} f_{t} = σ \cdot (W_{f} \cdot [h_{t - 1}, X_{t}] + b_{f}) \end{matrix}

(11)

The forget gate determines which information to discard from the cell state.

f_{t}

represents the forget gate state at time step (t),

σ

is the sigmoid function,

W_{f}

is the weight of the forget gate,

b_{f}

is the bias of the forget gate,

h_{t - 1}

is the previous hidden state, and

x_{t}

is the current input.

\begin{matrix} i_{t} = σ \cdot (W_{f} \cdot [h_{t - 1}, X_{t}] + b_{i}) \end{matrix}

(12)

The input gate determines which new information is stored in the cell state.

i_{t}

represents the input gate state at time step (t),

W_{i}

is the weight of the input gate, and

b_{i}

is the bias of the input gate.

\begin{matrix} {\tilde{C}}_{t} = tanh (W_{C} \cdot [h_{t - 1}, X_{t}] + b_{C}) \end{matrix}

(13)

where

{\tilde{C}}_{t}

is the new candidate value, which may be added to the cell state.

W_{C}

represents the corresponding weight, and

b_{C}

is the bias.

\begin{matrix} o_{t} = σ \cdot (W_{o} \cdot [h_{t - 1}, X_{t}] + b_{o}) \end{matrix}

(14)

The output gate determines how much information from the current cell state is output to the hidden state.

o_{t}

represents the output gate state at time step t,

W_{o}

represents the weight of the output gate, and

b_{o}

is the bias of the output gate.

\begin{matrix} C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot \tilde{C} \end{matrix}

(15)

Update the old cell state by adjusting it through the forget gate, then adding the product of the input gate and the new candidate value.

\begin{matrix} h_{t} = o_{t} \cdot tanh (C_{t}) \end{matrix}

(16)

The hidden state at time step t(

h_{t}

) will be output based on the output gate and the tanh value of the cell state.

This model uses BiLSTM to further encode the output of ELECTRA and obtain the hidden layer representation for each word. Specifically, given the input text

X = {x_{1}, x_{2}, \dots, x_{n}}

, and where the ELECTRA output vector is

s_{i}

. It inputs

s_{i}

into a forward LSTM to obtain the forward hidden layer state

h_{i}^{f}

corresponding to each word and inputs

s_{i}

into a backward LSTM to obtain the backward hidden layer state

h_{i}^{b}

for each word. Then, it splices

h_{i}^{f}

and

h_{i}^{b}

to obtain the corresponding bidirectional hidden layer state hi for each word. As shown in Formula (17):

\begin{matrix} \begin{matrix} h_{i}^{f} = L S T M^{f} (s_{i}, h_{i - 1}^{f}), i = 1, 2, \dots, n \\ h_{i}^{b} = L S T M^{b} (s_{i}, h_{i + 1}^{b}), i = n, n - 1, \dots, 1 \\ h_{i} = [h_{i}^{f}; h_{i}^{b}], i = 1, 2, \dots, n \end{matrix} \end{matrix}

(17)

2.4. CRF

CRF is a statistical modeling method primarily used for predicting the labels of sequence data. The core idea of CRF is that given a set of input sequences, the model learns to impose certain constraints on the output sequences to ensure that the predicted label sequences are globally optimal. Unlike other sequence labeling models, CRF takes into account the interdependencies between labels, meaning that they do not simply choose the best label independently for each position but consider the context of the entire sequence to make decisions.

This model uses a CRF model to label the output of BiLSTM, obtaining the entity type corresponding to each word. Specifically, given an input text

X = {x_{1}, x_{2}, \dots, x_{n}}

and the hidden state

h_{i}

corresponding to each word obtained from BiLSTM, it inputs

h_{i}

into a fully connected layer to obtain the score matrix

P_{i}

for each word, as shown in Formula (18):

\begin{matrix} P_{i} = W h_{i} + b, i = 1, 2, \dots, n \end{matrix}

(18)

In this context, W and b represent the weight matrix and bias vector of the fully connected layer, respectively.

The scoring function for the label sequence is shown in Formula (19). Where,

X = {x_{1}, x_{2}, \dots, x_{n}}

represents the sequence of input tokens,

Y = {y_{1}, y_{2}, \dots, y_{n}}

represents the predicted labels for the input sequence, each

y_{i}

is a label from the set

{1, 2, \dots, K}

, where K is the number of possible labels.

\begin{matrix} S (X, Y) = \sum_{i = 0}^{n} A_{y_{i}, y_{i + 1}} + \sum_{i = 0}^{n} P_{i, y_{i}} \end{matrix}

(19)

where A represents the transition score matrix and

A_{i, j}

represents the score of transitioning from label i to label j. The BiLSTM processes the input sequence and provides scores for each possible label at each position in the sequence. Specifically,

P_{i, j}

is the score of label j for word i as outputted by the BiLSTM hidden layer. The probability of the label sequence is then calculated as shown in Formula (20):

\begin{matrix} P (Y | X) = \frac{1}{Z (X)} exp (S (X, y)) \end{matrix}

(20)

where

Z (X)

is the partition function that normalizes the probabilities.

The likelihood function for the label sequence during training is shown in Formula (21):

\begin{matrix} l o g (P (y | X)) = S (X, y) - l o g (\sum_{\tilde{y} \in Y_{X}} e^{S (X, \tilde{y})}) \end{matrix}

(21)

where

\tilde{y}

represents the actual label sequence, and

Y_{X}

represents the set of all possible label sequences.

Finally, the sequence with the highest probability is used as the final predicted label sequence.

3. Results

This paper conducts experiments on datasets composed of documents related to power safety. It compares the performance of the model proposed in this paper with the BiLSTM-CRF and BERT-BiLSTM-CRF models in the entity recognition task.

3.1. Dataset

This paper employed power safety texts from documents, complemented by manually annotated entity types. Sources encompass “Compilation of National Power Accidents and Power Safety Incidents”, “Power Safety Work Regulations of the State Grid Corporation—Substation Section”, “Power Safety Work Regulations of the State Grid Corporation—Line Section” and “Twenty-Five Key Requirements for Preventing Power Generation Accidents—Case Warning Teaching Material”, as shown in Table 1.

This experiment removed duplicate content and non-electrical texts from the documents. It then extracted 6282 high-quality data for training and testing. Based on the features of power safety texts, the entity types were classified into nine categories, such as equipment, component, operation, status, condition, result, reason, purpose, and others (see Table 2 for details). In total, there were 43,975 entities. These datasets are divided into the training set, the validation set and the test set in an 8:1:1 ratio.

The dataset adopts the Begin-Inside-Outside (BIO) annotation system to label the entity type for each word. For example, for the sentence ‘The high-voltage bushing of the transformer should be checked regularly,’ the entity type annotation is ‘B-Component I-Component I-Component O B-Equipment I-Equipment O B-Operation I-Operation I-Operation’.

3.2. Evaluation Standard

This paper used evaluation indexes such as Precision, Recall, and F1-score, to evaluate the recognition effect of the RS-ELECTRA-BiLSTM-CRF in power safety NER.

Precision is the accuracy of the model’s predictions. It calculates the proportion of samples that the model correctly predicted as positive out of all the samples it predicted as positive. The formula is (22):

\begin{matrix} P = \frac{T P}{T P + F P} \end{matrix}

(22)

TP is the number of samples correctly predicted as positive, and FP is the number of samples incorrectly predicted as positive.

Recall is the proportion of positive samples that the model captures. The formula is (23):

\begin{matrix} R = \frac{T P}{T P + F N} \end{matrix}

(23)

FN is the number of positive samples incorrectly predicted as negative.

The F1-score is the harmonic mean of precision and recall, aiming to find a balance between the two. The formula is (24):

\begin{matrix} F 1 = 2 \times \frac{P \times R}{P + R} \end{matrix}

(24)

The higher the F1-score, the better the model’s performance.

3.3. Experimental Environment

The experimental environment consists of a 64-bit Windows 11 system, an Intel i5-13490f 3.20 GHz processor(Intel Corporation, Santa Clara, USA), 32 GB of memory, and an NVIDIA GeForce RTX 4060Ti 16 GB GPU (NVIDIA Corporation, Santa Clara, CA, USA). The experiment used Python 3.10 as the programming language and PyTorch 1.3.12 as the deep learning framework. The training parameters are set as follows. The RS-ELECTRA model uses a 12-layer Transformer structure with 12 multi-head attention mechanisms, and the hidden layer dimension is 768. The BiLSTM layer uses a BiLSTM network with a hidden layer dimension of 256. The CRF layer uses a linear-chain CRF model with a total of nine labels (including eight named entity categories and an ‘other’ category). The settings for hyperparameters are shown in Table 3.

The model is trained with a learning rate of 2

\times 10^{- 5}

, which controls the step size for weight updates, using the Adam optimizer for efficient training. It processes data in batches of 32 samples, runs for 1000 epochs to ensure thorough learning, applies a 0.1 dropout rate to prevent overfitting, and handles input sequences up to 128 tokens long.

3.4. Experimental Results

The experimental results are shown in Table 4, the RS-ELECTRA-BiLSTM-CRF model has shown excellent performance in the power safety NER task. This model outperforms the other three models on the power safety dataset, achieving a precision of 92%, a recall rate of 93%, and an F1-score of 93%. In comparison, the BiLSTM-CRF model only achieves an F1-score of 87%. This is because it struggles to capture semantic connections in the text context. The RS-ELECTRA model and BERT model, through replacement-style masked language modeling and masked language modeling, can learn the features of the context well, thus overcoming the effects of polysemous words and noise words, and obtaining the complete features of the sentences.

In addition, the BERT-BiLSTM-CRF model achieved an accuracy of 90%, a recall of 91%, and an F1 score of 90%. The model proposed in this paper improved these by 2.2%, 2.1%, and 3.3%, respectively. This is because the improved ELECTRA model can more effectively extract semantic information and contextual relationships from the text, making it more robust in understanding text.

To further analyze the performance disparities in entity recognition for different models, this paper computed the Precision, Recall, and F1-score for each entity type using the BiLSTM-CRF, BERT-BiLSTM-CRF, and RS-ELECTRA-BiLSTM-CRF models. The results are shown in Table 5 and Figure 4.

Table 5 shows that the model proposed in this paper has achieved improvements in the recognition of various entity types. Especially, the “equipment” type achieved an F1 score of 0.96, the “component” type achieved an F1 score of 0.98 and the “operation” type achieved an F1 score of 0.96. In addition, when compared with BERT-BiLSTM-CRF, the F1 values of types improved by 4.3%, 4.2% and 5.3%, respectively. This model has exhibited the best performance in the vocabulary of the electrical engineering field. This improvement is because of the Electra model using the RTD task, where the generator model attempts to replace words in the input with other words, and the discriminator model determines whether the words have been replaced. This method helps the model better understand differences in words, especially specialized words, as it must distinguish between original and replaced words.

As can be seen from Figure 4, the accuracy, recall and F1 values of the “equipment” type is significantly higher than that of the “status” type. The dataset shows that the “equipment” type comprises 7690 entities, while the “status” type comprises only 2692. This difference indicates that some entity types are more common in the dataset. Generally, high-frequency entity types have higher prediction accuracy because the model has more samples to learn from during training, allowing it to capture the features of these entity types more effectively. In contrast, low-frequency entity types may not have enough samples for the model to learn their features adequately, resulting in lower prediction accuracy. However, for the ‘Result’ type, which comprises 5383 entities, the performance remains relatively low. This is because the “result” entity type is highly context-dependent and requires understanding specific contexts or sentence structures, making it difficult for the model to accurately recognize without sufficient context. Additionally, the “result“ entity type has many nested relationships with other entity types, such as “equipment” and “component” entities, leading to missed predictions.

Compared to the Bert-BiLSTM-CRF model, the model proposed in this paper shows improved accuracy for all entity types. The improvement is particularly significant in recognizing low-frequency entities, which is because of replacing the original GeLU activation function with the SwiGLU activation function. The SwiGLU activation function combines a gating mechanism with two parallel nonlinear functions, enabling it to capture complex patterns in the input data that might be missed by a single nonlinear function, such as GeLU. Additionally, the gating mechanism dynamically adjusts the activation function based on the input data, enhancing the model’s feature extraction capability. This allows the model to learn which information is critical and which can be ignored. This feature enables the model to better adapt to different data distributions, thereby improving the accuracy of recognizing various entity types.

In the model’s error analysis, “equipment” and “component” types frequently confuse each other. The model often confuses one for the other. For “operation” identification, the highest error rate involves mistaking it for the “reason” type. This is because “reason” often includes operational details, which complicates distinguishing it from “operation”. The “status” type has a clear issue with missing identifications. This primarily stems from the need for a deep understanding of the context to identify a piece of equipment’s operational status. The “condition” type, often lacking context, is frequently misidentified as the “other” type. The “reason” and “result” types, with a strong logical connection and significant overlap with other types, make it difficult for the model to capture their contextual features, leading to scattered identification errors. The “purpose” type is misidentified as the “result” type. This is due to the sequential occurrence of the two types, where “purpose” often immediately precedes “result”.

In summary, the RS-ELECTRA-BiLSTM-CRF model demonstrates its advantages in entity recognition tasks, particularly for entity categories with abundant electrical engineering terminology. This model aids power maintenance personnel in troubleshooting equipment failures, assessing operational risks, and planning maintenance schedules based on power safety texts. It also provides technical support for constructing knowledge graphs or datasets in the power safety domain.

4. Discussion

This paper combines RMSNorm and SwiGLU into the ELECTRA model to propose the RS-ELECTRA model. It is integrated with BiLSTM and CRF to build a model for Named Entity Recognition in power safety texts. Experiments on the power safety dataset showed that this model improved Precision by 2.2%, Recall by 2.1%, and F1-score by 3.3% compared to the BERT-BiLSTM-CRF model. Compared to the BiLSTM-CRF model, the proposed model improves Precision by 6.9%, Recall by 5.6%, and F1-score by 6.8%. This demonstrates that this method has made progress in NER for power safety texts.

This research has broad applications in the power industry. When processing fault reports, maintenance logs, safety rule documents, accident investigation reports, etc., NER can quickly extract important entities, providing data support for fault analysis, risk assessment, and accident prevention. Additionally, the application of NER can promote the construction of intelligent monitoring systems. By analyzing text data generated by the system in real time, it can promptly identify safety risks and automatically generate warning information, thereby reducing the likelihood of accidents.

There are still many areas for improvement in this research. The power safety dataset is relatively small and only covers a part of the power safety field. In the future, larger and more diverse power safety data from various fields can be considered to improve the model’s generalization ability and adaptability. CRF can establish global constraints between entities and relationships, but it also has some limitations, such as the inability to handle overlapping or nested entities or relationships, and difficulty in extending to multivariable or complex relationships. In the future, more flexible and general annotation systems, such as Span [32] or Joint [33], can be considered as sequence labelers to improve the model’s adaptability and robustness in complex NER scenarios.

Author Contributions

Conceptualization, P.L. and Z.S.; methodology, Z.S.; software, Z.S. and B.Z.; validation, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Inner Mongolia Electric Power Research Institute (Project Number: 0722-2023FE7015NMF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Du, N.; Xu, J.; Liu, X.; Song, Y.; Qiu, L.; Zhao, Y.; Sun, M. Application and research of knowledge graph in electric power field. Electr. Power Inf. Commun. Technol. 2020, 18, 60–66. [Google Scholar]
Sharma, A.; Amrita.; Chakraborty, S.; Kumar, S. Named entity recognition in natural language processing: A systematic review. In Proceedings of the Second Doctoral Symposium on Computational Intelligence: DoSCI 2021; Springer: Singapore, 2022; pp. 817–828. [Google Scholar]
Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticæ Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Sutton, C. An Introduction to Conditional Random Fields. Found. Trends Mach. Learn. 2012, 4, 267–373. [Google Scholar] [CrossRef]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2019, arXiv:1904.09223. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Pakhale, K. Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges. arXiv 2023, arXiv:2309.14084. [Google Scholar]
Sun, C.; Tang, M.; Liang, L.; Zou, W. Software Entity Recognition Method Based on BERT Embedding. In Proceedings of the Machine Learning for Cyber Security: Third International Conference, ML4CS 2020, Guangzhou, China, 8–10 October 2020; Chen, X., Yan, H., Yan, Q., Zhang, X., Eds.; Springer: Cham, Switzerland, 2020; pp. 33–47. [Google Scholar] [CrossRef]
Hu, W.; He, L.; Ma, H.; Wang, K.; Xiao, J. KGNER: Improving Chinese Named Entity Recognition by BERT Infused with the Knowledge Graph. Appl. Sci. 2022, 12, 7702. [Google Scholar] [CrossRef]
Yang, R.; Gan, Y.; Zhang, C. Chinese Named Entity Recognition Based on BERT and Lightweight Feature Extraction Model. Information 2022, 13, 515. [Google Scholar] [CrossRef]
He, L.; Zhang, X.; Li, Z.; Xiao, P.; Wei, Z.; Cheng, X.; Qu, S. A Chinese Named Entity Recognition Model of Maintenance Records for Power Primary Equipment Based on Progressive Multitype Feature Fusion. Complexity 2022, 2022, e8114217. [Google Scholar] [CrossRef]
Chen, Y.; Liang, Z.; Tan, Z.; Lin, D. Named Entity Recognition in Power Marketing Domain Based on Whole Word Masking and Dual Feature Extraction. Appl. Sci. 2023, 13, 9338. [Google Scholar] [CrossRef]
Meng, F.; Yang, S.; Wang, J.; Xia, L.; Liu, H. Creating Knowledge Graph of Electric Power Equipment Faults Based on BERT–BiLSTM–CRF Model. J. Electr. Eng. Technol. 2022, 17, 2507–2516. [Google Scholar] [CrossRef]
Fan, X.; Zhang, Q.; Jia, Q.; Lin, J.; Li, C. Research on Named Entity Recognition Method Based on Deep Learning in Electric Power Public Opinion Field. In Proceedings of the 2022 International Conference on Computer Network, Electronic and Automation (ICCNEA), Xi’an, China, 23–25 September 2022; pp. 140–143, ISSN: 2770-7695. [Google Scholar] [CrossRef]
Yang, A.; Xia, Y. Study of agricultural finance policy information extraction based on ELECTRA-BiLSTM-CRF. Appl. Math. Nonlinear Sci. 2023, 8, 2541–2550. [Google Scholar] [CrossRef]
Fu, Y.; Bu, F. Research on Named Entity Recognition Based on ELECTRA and Intelligent Face Image Processing. In Proceedings of the 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT), Chongqing, China, 22–24 November 2021; pp. 781–786. [Google Scholar] [CrossRef]
Feng, J.; Wang, H.; Peng, L.; Wang, Y.; Song, H.; Guo, H. Chinese Named Entity Recognition Within the Electric Power Domain. In Proceedings of the Emerging Information Security and Applications: 4th International Conference, EISA 2023, Hangzhou, China, 6–7 December 2023; Springer: Singapore, 2024; pp. 133–146, ISSN: 1865-0937. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zhang, B.; Sennrich, R. Root Mean Square Layer Normalization. arXiv 2019, arXiv:1910.07467. [Google Scholar]
Shazeer, N. GLU Variants Improve Transformer. arXiv 2020. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2023. [Google Scholar] [CrossRef]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017. [Google Scholar] [CrossRef]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. arXiv 2017. [Google Scholar] [CrossRef]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR, Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Eberts, M.; Ulges, A. Span-based Joint Entity and Relation Extraction with Transformer Pre-training. arXiv 2021. [Google Scholar] [CrossRef]
Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef]

Figure 1. RS-ELECTRA-BiLSTM-CRF model structure diagram.

Figure 2. ELECTRA pre-training process.

Figure 3. Long and short memory network structure diagram.

Figure 4. Performance comparison for each entity type of different models.

Table 1. The documents of power safety.

Document Name	Number
Compilation of National Power Accidents and Power Safety Incidents	162,431
Power Safety Work Regulations of the State Grid Corporation—Substation Section	52,248
Power Safety Work Regulations of the State Grid Corporation—Line Section	43,881
“Twenty-Five Key Requirements for Preventing Power Generation Accidents”—Case Warning Teaching Material	110,592

Table 2. The distribution of entity types in the dataset.

Entity	Description	Example	Number
Equipment	Name of power equipment	Transformer	7690
Component	Constituent part or accessory of power equipment	Bushing	6152
Operation	Method or step of operating power equipment	Maintenance	5384
Status	Working state or attribute of power equipment	Energized	2692
Condition	Prerequisite or limiting condition for operating power equipment	Pressure	1538
Result	Consequence or impact of operating power equipment	Short Circuit	5383
Reason	Cause or inducement of power equipment failure or accident	Overload	2843
Purpose	Purpose or intention of operating power equipment	Protect	4229
Other	Entities not belonging to the above types	110 KV	8064

Table 3. Experimental parameter setting.

Hyperparameter	Hyperparameter Setting
learning-rate	2 $\times 10^{- 5}$
optimizer	Adam
batch size	32
epoch	1000
dropout	0.1
max seq length	128

Table 4. Experimental results of each model.

Model	Precision	Recall	F1-Score
BiLSTM-CRF	0.86	0.88	0.87
BERT-BiLSTM-CRF	0.90	0.91	0.90
RS-ELECTRA-BiLSTM-CRF	0.92	0.93	0.93

Table 5. Experimental results for each entity type of different models.

Entity	BiLSTM-CRF			BERT-BiLSTM-CRF			RS-ELECTRA-BiLSTM-CRF
Entity	P	R	F	P	R	F	P	R	F
Equipment	0.89	0.83	0.87	0.94	0.89	0.92	0.96	0.97	0.96
Component	0.90	0.88	0.89	0.94	0.95	0.94	0.98	0.99	0.98
Operation	0.88	0.88	0.88	0.93	0.90	0.91	0.96	0.96	0.96
Status	0.73	0.83	0.76	0.88	0.92	0.90	0.93	0.86	0.89
Condition	0.76	0.82	0.79	0.91	0.83	0.87	0.93	0.89	0.91
Result	0.86	0.71	0.78	0.87	0.85	0.86	0.87	0.94	0.90
Reason	0.79	0.77	0.78	0.86	0.90	0.88	0.92	0.86	0.88
Purpose	0.84	0.77	0.81	0.92	0.91	0.92	0.93	0.95	0.94
Other	0.83	0.84	0.83	0.92	0.90	0.91	0.92	0.92	0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, P.; Sun, Z.; Zhou, B. An ELECTRA-Based Model for Power Safety Named Entity Recognition. Appl. Sci. 2024, 14, 9410. https://doi.org/10.3390/app14209410

AMA Style

Liu P, Sun Z, Zhou B. An ELECTRA-Based Model for Power Safety Named Entity Recognition. Applied Sciences. 2024; 14(20):9410. https://doi.org/10.3390/app14209410

Chicago/Turabian Style

Liu, Peng, Zhenfu Sun, and Biao Zhou. 2024. "An ELECTRA-Based Model for Power Safety Named Entity Recognition" Applied Sciences 14, no. 20: 9410. https://doi.org/10.3390/app14209410

APA Style

Liu, P., Sun, Z., & Zhou, B. (2024). An ELECTRA-Based Model for Power Safety Named Entity Recognition. Applied Sciences, 14(20), 9410. https://doi.org/10.3390/app14209410

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An ELECTRA-Based Model for Power Safety Named Entity Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. ELECTRA

2.2. RS-ELECTRA

2.3. BiLSTM

2.4. CRF

3. Results

3.1. Dataset

3.2. Evaluation Standard

3.3. Experimental Environment

3.4. Experimental Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI