Named Entity Recognition for Few‑Shot Power Dispatch Based on Multi‑Task

: In view of the fact that entity nested and professional terms are difficult to identify in the field of power dispatch, a multi‑task‑based few‑shot named entity recognition model (FSPD‑NER) for power dispatch is proposed. The model consists of four modules: feature enhancement, seed, expansion, and implication. Firstly, the masking strategy of the encoder is improved by adopting whole‑word masking, using a RoBERTa (Robustly Optimized BERT Pretraining Approach) encoder as the embedding layer to obtain the text feature representation, and an IDCNN (Iterated Dilated CNN) module to enhance the feature. Then the text is cut into one Chinese character and two Chi‑ nese characters as a seed set, the score for each seed is calculated, and if the score is greater than the threshold value ω , they are passed to the expansion module as candidate seeds; next, the candidate seeds need to be expanded left and right according to offset γ to obtain the candidate entities; finally, to construct text implication pairs, the input text is used as a premise sentence, the candidate entity is connected with predefined label templates as hypothesis sentences, and the implication pairs are passed to the RoBERTa encoder for the classification task. The focus loss function is used to allevi‑ ate label imbalance during training. The experimental results of the model on the power dispatch dataset show that the precision, recall, and F1 scores of the recognition results in 20‑shot samples are 63.39%, 61.97%, and 62.67%, respectively, which is a significant performance improvement com‑ pared to existing methods.


Introduction
In recent years, China has attached great importance to the development of the electric power business, and through the implementation of a series of policies and measures, the electric power industry has been innovating in the direction of intelligence [1,2] and sustainable development [3], which has become a strong support for China's economic development. Along with the rapid development of information technology and the comprehensive construction of intelligent power grids, a large amount of power dispatch text data has been accumulated, of which about 80% are unstructured and semi-structured data and only about 20% are structured data. In order to improve the efficiency of staff and assist professionals in making decisions on power dispatch, it is necessary to extract useful information from relevant texts and build a power dispatch knowledge graph [4,5] system. Named entity recognition (NER) is a sub-task of information extraction that is widely used in fields such as intelligent question and answer [6], recommendation systems [7], and knowledge graphs [8], and its main task is to identify meaningful entities from relevant texts and classify them. For example, the task of named entity recognition in the field of power dispatch is to identify the relevant entities in the power dispatch data. Entity recognition is a fundamental part of building knowledge graphs, and improving the recognition of entities in power dispatch texts is of great significance for building power dispatch knowledge graphs.
Named entity recognition tasks have the following main challenges in the field of power dispatch: (1) Lack of annotated public datasets. (2) The power dispatch text has strong specialized terms, and it is difficult for the conventional model to identify the entities in it. There is also the problem of nested entities in the text, such as when the entity "移动基站专变" contains two entities ("移动基站专变" and "专变"). Text annotation using sequential annotation requires special processing to recognize nested entities. (3) Conventional tokenizers do not consider the existence of word boundary ambiguity or separators to represent word boundaries in Chinese. In addition, there is no word splitter for power dispatch text splitting, and there is a certain error in the splitting results when using conventional word splitters. (4) When the pre-trained language model trains the Chinese corpus, the character masking strategy is adopted, and the semantic information extracted is only at the character level, which cannot fully capture the contextual semantic information in the text. In addition, using the character masking strategy model may predict the content of the masking position in advance, but it cannot effectively infer professional terms.
In order to solve the above problems, a model based on few-shot power dispatch entity recognition is proposed to improve the recognition of power dispatch entities, and the contributions of this paper are as follows: (1) Pre-processing the unstructured data of power dispatch provided by Guangxi Power Grid, referring to the national standard electrical terminology specification, adopting the span-based approach to standardized text data for entity annotation, and constructing the named entity dataset of power dispatch. (2) Proposing a multi-task-based few-shot named entity recognition model for power dispatch and training it using a dataset based on span representation so that the model can effectively identify nested entities. (3) Improving the dynamic character masking strategy used by the RoBERTa encoder by replacing the WordPiece masking strategy with the whole-word masking strategy. (4) Using the focal loss function [9] (focal loss, FL) to solve the sample size imbalance problem.
Compared with the BERT-CRF, BERT-LSTM-CRF, NNShot, and StructShot models, the experimental results show that the proposed model has better recognition performance.

Related Work
In early entity recognition tasks [10], the rule-based approach achieved high accuracy and low recall because it was relatively easy to implement and did not require training. However, the shortcomings of this method require the construction of specialized domain knowledge and a large amount of human resources, as well as the improvement in generalization capability. Compared with the rule-based approach, the core of the statistical model-based approach is to select a suitable training model for a specific research context, omitting many tedious rule designs and spending less time to train a manually annotated corpus to improve training efficiency. At the same time, the statistical modelbased approach only needs to retrain the model for the domain-specific training set in the face of different domain-specific rules, so this approach has better robustness. With the rapid development of deep learning, the application of deep learning to named entity recognition tasks is gradually becoming mainstream and achieving better results. Among the typical models are: Li [11] built a name entity recognition model of selectable word vectors based on bidirectional long short-term memory (BiLSTM) and conditional random field [12] (CRF) to identify defective texts of power equipment. Wang [13] proposed a character set entity recognition model based on several features. The model combines character embedding, left-neighbor entropy, and lexicality to represent domain features of power dispatch text, using BiLSTM to predict character sequence labels and CRF to optimize the predicted labels. Zheng [14] proposed the Bi-order-Transformer-CRF model, which trains power billing word vectors and designs neural cosine similarity functions to distinguish similar entities and alleviate the problem of entity boundary ambiguity. Liu [15] studied the more widely used entity recognition model (BERT-CRF) and proposed an active learning strategy based on uncertainty that considers both intermediate and input results and alleviates the model's reliance on large amounts of manually labeled data. Meng [16] fused the pre-trained models BERT [17], BiLSTM, and CRF to construct a faulty entity recognition model for electrical equipment. Attention mechanisms [18] have developed rapidly in the field of computer vision, and researchers have used convolutional neural networks (CNNs) for entity recognition tasks to obtain features of high-dimensional data. Zheng [19] combined the attention mechanism with CNNs to obtain the contextual relationships of the electricity billing text and then used BiGRU to extract high-level features and CRF to obtain the output label sequences. Power dispatch text data present many kinds of entities in a small volume, but there is an insufficient corpus set for deep learning model training and data annotation has a high cost, so the above traditional entity recognition model cannot achieve better results.
Recently, few-shot named entity recognition, which aims to identify entities based on a small number of label instances in each category, has received a lot of attention. Cui [20] proposed the TemplateNER model to enumerate all text spans of a text as candidate entity labels by an exhaustive method and construct the corresponding prompt templates [21] to pass into the Seq-to-Seq model [22] for classification. Ma [23] abandoned the template construction process and retained the word prediction paradigm of pre-trained models to predict similarly related tagged words at entity locations while also exploring principled approaches to automatically search for suitable tagged words, reformulating the NER task as a language modeling problem without templates. Wang [24] proposed a seminal span-based prototypical network (SpanProto) that tackles few-shot NER via a two-stage approach, including span extraction and mention classification, in order to make more efficient use of boundary information. Chen [25] designed Self-describing Networks (SDNets), a Seq2Seq generation model that can universally describe mentions using concepts, automatically map novel entity types to concepts, and adaptively recognize entities on demand. However, the above three models are all general-domain-oriented, the datasets used by the models are all in English, and the datasets used for training are based on sequence labeling methods, which cannot effectively identify nested entities.
In summary, the research on named entity recognition of power dispatch text has not been carried out in depth, and the existing dataset of power dispatch is not sufficient to support the effective learning of traditional deep learning models. Therefore, this paper proposes a FSPD-NER model, which uses a span-based representation to represent entities to solve the problem of nested entities and a few-shot approach to train the model to solve the problem of an insufficient training corpus.

Data Acquisition and Preprocessing
The structured and unstructured data were extracted from the power dispatch data provided by Guangxi Power Grid. Before labeling the data, all data were converted to unstructured data, organized, and denoised, and finally, 828 pieces of data were extracted.

Entity Annotation
Referring to the national electric power industry terminology specification dictionary and Guangxi power grid business needs, nine kinds of entity labels were designed: time, voltage level, transmission line, station, organization, equipment, person name, address, and other. Entity annotation was performed by means of the label-studio tool, which labels the start position, end position, and entity type of the entity in the sentence. A labeling example is shown in Figure 1. and other. Entity annotation was performed by means of the label-studio tool, which labels the start position, end position, and entity type of the entity in the sentence. A labeling example is shown in Figure 1.  Table 1, and the distribution of entities is shown in Figure 2.   Table 1, and the distribution of entities is shown in Figure 2.

Method Flow
In view of the lack of public datasets in the field of power dispatch, the lack of a large amount of corpora for training, and the existence of many professional terms and nested entities in the dataset, the traditional model is poor at entity recognition. Therefore, a multi-task-based named few-shot entity recognition method in the field of power dispatch is proposed. The method is mainly divided into five parts: power dispatch corpus construction, construction of seed tags, tag expansion, construction of textual implication pairs, and tag classification. The flow chart of the method is shown in Figure 3, and the model architecture is shown in Figure 4.

Method Flow
In view of the lack of public datasets in the field of power dispatch, the lack of a large amount of corpora for training, and the existence of many professional terms and nested entities in the dataset, the traditional model is poor at entity recognition. Therefore, a multi-task-based named few-shot entity recognition method in the field of power dispatch is proposed. The method is mainly divided into five parts: power dispatch corpus construction, construction of seed tags, tag expansion, construction of textual implication pairs, and tag classification. The flow chart of the method is shown in Figure 3, and the model architecture is shown in Figure 4.

Method Flow
In view of the lack of public datasets in the field of power dispatch, the lack of a large amount of corpora for training, and the existence of many professional terms and nested entities in the dataset, the traditional model is poor at entity recognition. Therefore, a multi-task-based named few-shot entity recognition method in the field of power dispatch is proposed. The method is mainly divided into five parts: power dispatch corpus construction, construction of seed tags, tag expansion, construction of textual implication pairs, and tag classification. The flow chart of the method is shown in Figure 3, and the model architecture is shown in Figure 4.   Firstly, the power dispatch dataset is input into the RoBERTa [26] encoder to obtain a deep bidirectional representation, and the IDCNN module is used to obtain longer-distance contextual and entity boundary information. The parallelism of the GPU is fully utilized to accelerate the training speed. Second, the text is sliced into one character and two characters as seed entity sets, e.g., "1", "1", "0", …, "南津", "津站", and a candidate seed for each seed score that is greater than the threshold ω is calculated. The candidate seeds are expanded to the left or to the right to obtain the candidate entities. Finally, the candidate entities are spliced with the predefined label templates as the hypothesis of the implication task, and the input text is used as the premise of the implication task to obtain a textual implication pair. The textual implication pairs are input into RoBERTa for training and classification.

RoBERTa and Whole-Word Masking
RoBERTa is an extension and improvement of BERT. The model builds on BERT by replacing the static masking strategy with a dynamic masking strategy (Masked Language Modeling, MLM) that randomly masks each training to improve the adaptability of the model. Meanwhile, RoBERTa uses larger datasets, larger batch sizes, longer pre-trained times, longer input sequences, and more data augmentation techniques to improve the model's capabilities. Liu [26] found during their iterative experiments on the RoBERTa model that adding the Next Sentence Prediction (NSP) task did not improve the performance of the model, so the NSP task was removed. After the above improvements, RoB-ERTa obtained a more stable and robust performance than BERT.
For text sequences, after encoding by the encoder, the output Embedding of each word in the set consists of three parts: Token Embedding, Segment Embedding, and Position Embedding, as shown in Figure 5. The sequence vector is input into the bidirectional transformer for feature extraction, and finally, the sequence vector containing rich semantic features is obtained. Firstly, the power dispatch dataset is input into the RoBERTa [26] encoder to obtain a deep bidirectional representation, and the IDCNN module is used to obtain longerdistance contextual and entity boundary information. The parallelism of the GPU is fully utilized to accelerate the training speed. Second, the text is sliced into one character and two characters as seed entity sets, e.g., "1", "1", "0", …, "南津", "津站", and a candidate seed for each seed score that is greater than the threshold ω is calculated. The candidate seeds are expanded to the left or to the right to obtain the candidate entities. Finally, the candidate entities are spliced with the predefined label templates as the hypothesis of the implication task, and the input text is used as the premise of the implication task to obtain a textual implication pair. The textual implication pairs are input into RoBERTa for training and classification.

RoBERTa and Whole-Word Masking
RoBERTa is an extension and improvement of BERT. The model builds on BERT by replacing the static masking strategy with a dynamic masking strategy (Masked Language Modeling, MLM) that randomly masks each training to improve the adaptability of the model. Meanwhile, RoBERTa uses larger datasets, larger batch sizes, longer pre-trained times, longer input sequences, and more data augmentation techniques to improve the model's capabilities. Liu [26] found during their iterative experiments on the RoBERTa model that adding the Next Sentence Prediction (NSP) task did not improve the performance of the model, so the NSP task was removed. After the above improvements, RoBERTa obtained a more stable and robust performance than BERT.
For text sequences, after encoding by the encoder, the output Embedding of each word in the set consists of three parts: Token Embedding, Segment Embedding, and Position Embedding, as shown in Figure 5. The sequence vector is input into the bidirectional transformer for feature extraction, and finally, the sequence vector containing rich semantic features is obtained. The RoBERTa model uses the dynamic masking strategy to train the text and extract the bidirectional representation of depth. This strategy dynamically and randomly masks part of the tokens, allowing the pre-trained model to infer the contents of the masked position, and the position of the MASK of the model is different in each round. For example, in the sentence "10 kV 怀集线", the masked sentence in the first round of training is "10 kV 怀集[MASK]", the second round is "10 kV[MASK]集线", and so on. However, in order to improve the reasoning ability of the power dispatch dataset, the character masking strategy is changed to the whole-word masking strategy.
Unlike the whole-word masking proposed by Cui [27], this paper incorporates proper nouns from the grid dictionary before using the word splitter to make the model splitting results more accurate; then multiple consecutive [MASK] marks are used to mask the complete words, and the model predicts the masked words, which makes the model obtain more feature information and alleviates the semantic incompleteness when making Chinese predictions due to bias, as shown in Figure 6. An example of the improved masking of text is shown in Table 2.  The RoBERTa model uses the dynamic masking strategy to train the text and extract the bidirectional representation of depth. This strategy dynamically and randomly masks part of the tokens, allowing the pre-trained model to infer the contents of the masked position, and the position of the MASK of the model is different in each round. For example, in the sentence "10 kV怀集线", the masked sentence in the first round of training is "10 kV怀集[MASK]", the second round is "10 kV[MASK]集线", and so on. However, in order to improve the reasoning ability of the power dispatch dataset, the character masking strategy is changed to the whole-word masking strategy.
Unlike the whole-word masking proposed by Cui [27], this paper incorporates proper nouns from the grid dictionary before using the word splitter to make the model splitting results more accurate; then multiple consecutive [MASK] marks are used to mask the complete words, and the model predicts the masked words, which makes the model obtain more feature information and alleviates the semantic incompleteness when making Chinese predictions due to bias, as shown in Figure 6. An example of the improved masking of text is shown in Table 2. The RoBERTa model uses the dynamic masking strategy to train the text and extract the bidirectional representation of depth. This strategy dynamically and randomly masks part of the tokens, allowing the pre-trained model to infer the contents of the masked position, and the position of the MASK of the model is different in each round. For example, in the sentence "10 kV 怀集线", the masked sentence in the first round of training is "10 kV 怀集[MASK]", the second round is "10 kV[MASK]集线", and so on. However, in order to improve the reasoning ability of the power dispatch dataset, the character masking strategy is changed to the whole-word masking strategy.
Unlike the whole-word masking proposed by Cui [27], this paper incorporates proper nouns from the grid dictionary before using the word splitter to make the model splitting results more accurate; then multiple consecutive [MASK] marks are used to mask the complete words, and the model predicts the masked words, which makes the model obtain more feature information and alleviates the semantic incompleteness when making Chinese predictions due to bias, as shown in Figure 6. An example of the improved masking of text is shown in Table 2.

IDCNN Module
When convolutional neural networks (CNNs) are applied to process text, after the convolution operation, the last layer of the network may only acquire a small portion of the information from the original input data. In order to obtain more contextual information, more convolutional layers are added, and the network becomes complex and prone to overfitting. To solve this problem, the Dropout mechanism is introduced, but it also introduces more hyper-parameters, making the model large, redundant, and difficult to train.
To solve the above problem, Yu [28] proposed Dilation Convolution, which expands the perceptual field of a convolutional neural network and reduces information loss by increasing the spacing of convolutional kernels. Dilation Convolution increases the expansion width on top of the original convolutional neural network. During the convolution operation, the size of the convolution kernel does not change, and the convolution kernel skips the data in the middle of the expansion width. In this way, a convolutional kernel of the same size can acquire wider input matrix data, thus increasing the perceptual field of view of the convolutional kernel. The schematic diagram of the inflated convolution is shown in Figure 7. Figure 7a indicates the normal convolution operation with a convolution kernel size of 3 × 3; Figure 7b indicates the convolution with an expansion width of 2, and the perceptual field of view increases to 7 × 7, skipping the center adjacent nodes and causing two voids to appear, which capture the nodes adjacent to the center directly; Figure 7c shows an expansion width of 4, and the perceptual field of view expands to 15 × 15, causing six voids to appear, which capture a larger range of node information. The expansion convolution maximizes the effectiveness and accuracy of the model.

IDCNN Module
When convolutional neural networks (CNNs) are applied to process text, after the convolution operation, the last layer of the network may only acquire a small portion of the information from the original input data. In order to obtain more contextual information, more convolutional layers are added, and the network becomes complex and prone to overfitting. To solve this problem, the Dropout mechanism is introduced, but it also introduces more hyper-parameters, making the model large, redundant, and difficult to train.
To solve the above problem, Yu [28] proposed Dilation Convolution, which expands the perceptual field of a convolutional neural network and reduces information loss by increasing the spacing of convolutional kernels. Dilation Convolution increases the expansion width on top of the original convolutional neural network. During the convolution operation, the size of the convolution kernel does not change, and the convolution kernel skips the data in the middle of the expansion width. In this way, a convolutional kernel of the same size can acquire wider input matrix data, thus increasing the perceptual field of view of the convolutional kernel. The schematic diagram of the inflated convolution is shown in Figure 7. Figure 7a indicates the normal convolution operation with a convolution kernel size of 3 × 3; Figure 7b indicates the convolution with an expansion width of 2, and the perceptual field of view increases to 7 × 7, skipping the center adjacent nodes and causing two voids to appear, which capture the nodes adjacent to the center directly; Figure 7c shows an expansion width of 4, and the perceptual field of view expands to 15 × 15, causing six voids to appear, which capture a larger range of node information. The expansion convolution maximizes the effectiveness and accuracy of the model.  The IDCNN-specific implementation process is as follows: first of all, the input width Ln of the nth layer is calculated as shown in Equation (1); with the increase in the number of layers n, the length of the covered sequence increases exponentially, so as to cover a long sequence of text. The IDCNN-specific implementation process is as follows: first of all, the input width L n of the nth layer is calculated as shown in Equation (1); with the increase in the number of layers n, the length of the covered sequence increases exponentially, so as to cover a long sequence of text.
With c i as the input representation of RoBERTa, the output of the next layer is obtained after the inflated convolution of the input of the previous layer. As the number of Electronics 2023, 12, 3476 9 of 15 layers of the inflated convolution increases, the area of the perceptual field of view of the convolution kernel also improves. The calculation process for each layer of convolution is shown in Equation (2), where D represents the nth layer of convolution with input width L n and σ is the ReLU activation function.
Multiple inflated convolutional layers are stacked to form an inflated convolutional block, and B() represents the convolutional block, for which k iterations are performed to obtain c k i , as shown in Equation (3).
The Dropout mechanism is introduced to improve model generalization by randomly discarding the output of iteratively inflated convolutional blocks at a certain rate. p k i is the final model probability output, as shown in Equation (4).

Seed Module
Given an input text X = {x 1 , x 2 , x 3 , …, x n } of n characters, the input text is sliced into a set S = {s 1 , s 2 , s 3 , …, s 2n−1 } of one character and two characters, where s i = (l i , r i ) and l i and r i denote the left boundary (start position) and right boundary (end position) of the seed, respectively.
The purpose of the seed task is to find one character or two characters that overlap with the full entity and have the potential to expand into the full entity. The text is passed into the RoBERTa encoder to obtain the representation v. For the seed s i = (l i , r i ), the fraction calculation p seed i procedure is as follows: where v p i represents performing mean pooling on all representations of the seed span interval, fusing it with the output of [CLS] as the representation of the seed v seed i . Finally, the seed representation is passed into the multilayer perceptron, and the final seed fraction p seed i is obtained. According to the set seed score threshold ω, a seed score greater than the threshold ω is considered a candidate seed.

Expansion Module
With the help of the regression task, the candidate seeds are expanded to obtain the candidate entities. The set candidate seed s i = (l i , r i ) of the left boundary l i and the right boundary r i can be shifted at most γ to obtain the longest entity as 2 + 2γ. The value of the seed entity after shifting to the left is w l i , the value after shifting to the right is w r i , and the span of the expanded entity is w i . The calculation process is as follows: Next, the entity span of the above expansion is calculated to determine the candidate entities. The left and right boundary offsets o i are calculated as follows: where o i ∈ R 2 ; its first element is denoted as o l i , which represents the final left offset of the entity span; and the second element is denoted as o r i , which represents the final right offset of the entity span, i ∈ [−γ, γ]. The left and right boundaries of the final entity can be determined, and the calculation process is as follows: During the calculation of the result, if o l i > o r i is considered invalid, it is chosen to be discarded.

Implication Module
To construct the required implication pairs for the implication task, candidate entities and label templates are spliced as hypothesis sentences, and input text is used as premise sentences to form complete implication pairs. For the ith candidate entity span c i , the implication pair is of the form (X, P j i ), where P j i = {c i is a/an e i }, e i ∈ E, E is the set of label templates, and X is the input text.
The implication pairs are fed to the shared RoBERTa encoder for training, and the v [CLS] P j i of the encoder output is obtained. The result of performing text classification is: where the implication label can be defined as follows:

Focal Loss
When building the Guangxi power dispatch data corpus, it is found that the number of various types of entities is unevenly distributed according to the statistical results, which leads to an uneven distribution of loss functions during model training and prefers labeled entities with a larger number of samples, making the label recognition of entities with a smaller number of samples less effective. To alleviate this problem, this paper uses the focal loss function to reduce the weight of the samples that are easy to classify and increase the weight of the samples that are difficult to classify. The focal loss function improves the weight factor in the cross-entropy loss function by introducing a focal factor α, which is used to adjust the weights of easy-to-classify samples and hard-to-classify samples. When the focus factor is larger, the model will pay more attention to the hard-toclassify samples, thus improving the classification precision of the model, which is calculated as in Equation (18).
where p t represents the probability that the label is correctly recognized; the larger the p t , the higher the confidence of recognition, and the easier the label is recognized. (1 − p t ) τ is the modulation factor, τ ≥ 0. The focus factor α ∈ [0, 1] is used to adjust the weights of the variation samples and balance the distribution of the loss function.

Results
The experimental environment uses the Pytorch framework, CUDA version 11.1, an Ubuntu system, and an NVIDIA RTX3090 (24G) graphics card. The threshold ω size of the seed score is 0.6, and the entity offset γ is set to 5. The AdamW optimization algorithm is used, and the learning rate (lr) is 3 × 10 −5 . The Dropout mechanism is introduced, and the value is set to 0.5. The batch_size parameter has a value of 4, and the number of iterations (epochs) is 30. One evaluation is performed at the end of each iteration, and the model with the highest F1 score is saved. The dataset is constructed using the named entity recognition corpus for power dispatch included in Section 3 of this paper, and the train and eval sets are set up with three different sample sizes for the comparison experiments (5-shot, 10-shot, and 20-shot sample sizes). The inference phase is validated using the test set, which contains 287 pieces of data. Table 3 depicts other parameter settings: sampling_processes represents the number of threads that process data, with the size of the parameter set depending on the individual's computer; eval_batch_size specifies the batch size of data in the verification phase; and lr_warmup is the ratio of total training iterations to warm-up in a linear increase/decrease program, used for gradient clipping to prevent gradient explosion. The setting of the values of the above parameters is determined by repeated experiments.

Evaluation Indicators
The precision (P), recall (R), and F1 scores are used as evaluation indices of model performance, and the specific formula calculation process is as follows: In the above equation, TruePositive denotes the number of entities correctly predicted by the model, that is, the number of samples that the model predicts as positive cases and are actually positive cases; PredictPositive denotes the total number of entities identified by the model, including correctly predicted entities and incorrectly predicted entities; Actual-Positive denotes the total number of entities present in the dataset, that is, the number of real entities.

Results and Analysis
Three groups of comparison experiments are set up to evaluate the effectiveness of the model. The first group evaluates the performance analysis of different masking strategies; the second group includes the experiments comparing the model of this paper with other models; and the third group includes the ablation experiments on the model.
(1) Performance analysis of different masking strategies In order to verify that the whole-word masking strategy can effectively improve the recognition effect of the model, the FSPD-NER model using WordPiece masking and wholeword masking is compared and experimented with, and the results are shown in Table 4. From the above table, it can be seen that when the FSPD-NER model adopts the Word-Piece masking strategy, compared with the BERT pre-trained model, the RoBERTa pretrained model improves the precision, recall, and F1 scores by 1.34%, 2.90%, and 2.22%, respectively, under 20 samples. This is because the RoBERTa model uses a larger corpus for training to capture the contextual relevance of text, and the model recognition effect is improved. The FSPD-NER model adopts the RoBERTa pre-trained model; compared with the use of the WordPiece masking strategy, the precision, recall, and F1 scores of the whole-word masking strategy increase by 1.11%, 5.26%, and 3.30%, respectively, under 20 samples, which is a comparison of IDs 2 and 4. To prove that using whole-word masking can effectively improve the performance of the model, the FSPD-NER model in the following text uniformly uses the RoBERTa pre-trained model and the whole-word masking strategy.
(2) Comparative analysis of the performance of different models In order to verify the effectiveness of the model proposed in this paper for named entity recognition in few-shot power dispatch data, four different models were experimentally set up for comparison. The results show that the FSPD-NER model proposed in this paper exhibits high recognition precision in the few-shot case. The specific comparison results are shown in Table 5.  Table 5 shows the experimental results of different models on the power dispatch dataset. BERT-CRF utilizes the BERT pre-trained model as the embedding layer to fully consider the location information and contextual semantic information of characters and learns the annotation constraint rules and adjacent label information from CRF to obtain a globally optimal annotation sequence to alleviate the annotation bias problem. The F1 scores of the model are 20.14%, 23.94%, and 33.90% under the conditions of 5-shot, 10-shot, and 20-shot samples, respectively. BERT-LSTM-CRF is based on BERT-CRF and incorporates LSTM to consider time series information to alleviate the information-forgetting problem. However, due to the small number of training samples and the fact that the training corpus in the pre-trained model is relatively extensive, it results in the model learning more irrelevant information and introducing more noise during the training process, which makes the recognition of the model less effective, with F1 scores of 15.91%, 24.10%, and 41.70% under the conditions of 5-shot, 10-shot, and 20-shot sample sizes, respectively. The experimental results show that the traditional named entity recognition model does not perform well in corpora with a small sample size. The NNShot [29] model mainly obtains the contextual vector representation of each word in the sentence, calculates the word similarity in the vector space using the nearest-neighbor principle, and selects the nearest word category for labeling. The backbone of the StructShot [29] model is based on NNShot, and the Viterbi algorithm is used to decode the prediction. These two models use corpus sets from other domains to train the models and only use the power dispatch dataset for validation in the prediction phase. The models incorporate more noise, and the obtained semantic information contains more error information. In addition, the corpus of the models is labeled with sequences, which cannot alleviate the entity nested problem and makes the recognition performance of the models poor. The F1 scores of NNShot were 21.24%, 25.23%, and 23.69% for the conditions of 5-shot, 10-shot, and 20-shot sample sizes, respectively; the F1 scores of StructShot were 26.71%, 31.21%, and 32.60%.
The FSPD-NER model includes data enhancement, seed, expand, and entailment modules. This method reconstructs the named entity recognition task into the classification task of the language model and uses the power dispatch dataset for training to reduce the introduction of noise. In addition, using the whole-word masking strategy, the model can learn the contextual semantic information of the whole word and improve its reasoning ability. At the same time, using the span method to label the corpus can solve the problem that the above model cannot make full use of the boundary information. Under 5-shot, 10-shot, and 20-shot sample sizes, the recognition performance was significantly improved compared with other models, with F1 values of 52.23%, 55.94%, and 62.67%, respectively.

(3) Ablation experiments
To verify the effectiveness of the different modules of the FSPD-NER model proposed in this paper, ablation experiments were performed on the model in the 20-shot sample. Table 6 shows the results of the precision, recall, and F1 scores of the model. The results of the ablation experiments show that the FSPD-NER model adopts the WordPiece masking strategy, and the F1 score is reduced by 3.30%, which proves that the whole-word masking strategy can effectively improve the learning ability of the model and facilitate the model's ability to extract more semantic information. The FSPD-NER model removes the IDCNN module, and the F1 score decreases by 1.41%, indicating that fusing IDCNN modules can extract more contextual semantic information. The FSPD-NER model replaces the focal loss function with the cross-entropy loss function, and the F1 score is reduced by 1.11%, indicating that the focal loss function can alleviate the sample imbalance problem. In short, each part of the model is indispensable.

Conclusions
In view of the lack of labeled public datasets, nested entities, and many professional terms in the data of power dispatch, this paper proposes a few-shot power dispatch named entity recognition (FSPD-NER) model based on multi-tasks. This method uses the RoBERTa pre-trained model as the embedding layer, fully considering the position information and contextual semantic information of characters, and improves the masking strategy of the pre-trained model based on the characteristics of Chinese semantics to enhance the inference ability of the model and further extract the semantic information of Chinese text using the IDCNN module, allowing the model to obtain deeper semantic information. The seed module divides the text into one Chinese character and two Chinese characters and selects seeds that overlap with entities as candidate seeds; the expand stage extends the candidate seeds to the left and right directions to obtain the candidate entities. The implication module mainly constructs implication pairs to achieve text classification. The focus loss function is used in model training to alleviate the sample imbalance problem. The unstructured data provided by Guangxi Power Grid are labeled based on the spanwise approach, and the model is experimentally analyzed on this dataset. The experimental results show that the model achieves better performance compared with the benchmark model, with precision, recall, and F1 scores of 63.39%, 61.97%, and 62.67% in the 20-shot sample experiment.
In future work, we will focus on exploring the feature fusion method applicable to the field of power dispatching so as to improve the effect of named entity recognition. In addition, the correlation between entities and entity relationship extraction will be studied to better mine the available information in the field of power dispatching. Eventually, we plan to apply our model to other domains to verify its generalization ability.