applied sciences

: Food safety is closely related to human health. Therefore, named entity recognition tech‑ nology is used to extract named entities related to food safety, and building a regulatory knowledge graph in the field of food safety can help relevant authorities to regulate food safety issues and mit‑ igate the hazards caused by food safety problems. However, there is no publicly available named entity recognition dataset in the food safety domain. In contrast, the non‑standardized Chinese short texts generated from user comments on the web contain rich implicit information that can help iden‑ tify named entities in specific domains (e.g., food safety domain) where the corpus is scarce. There‑ fore, in this paper, named entities related to food safety are extracted from these unstandardized texts on the web. However, the existing Chinese named entity identification methods are mainly for standardized texts. Meanwhile, these unstandardized texts have the following problems: (1) their corpus size is small; (2) there are various new and wrong words and noise; (3) and they do not fol‑ low strict syntactic rules. These problems make the recognition of Chinese named entities for online texts more challenging. Therefore, this paper proposes the ERNIE‑Adv‑BiLSTM‑Att‑CRF model to improve the recognition of food safety domain entities in unstandardized texts. Specifically, adver‑ sarial training is added to the model training as a regularization method to alleviate the influence of noise on the model, while self‑attention is added to the BiLSTM‑CRF model to capture features that significant impact entity classification and improve the accuracy of entity classification. This paper conducts experiments on the public dataset Weibo NER and the self‑built food domain dataset Food. The experimental results show that our model achieves a SOTA performance of 72.64% and a good performance of 69.68% for F1 values on the public and self‑built datasets, respectively. The validity and reasonableness of our model are verified. In addition, the paper further analyses the impact of various components and settings on the model. The study has practical implications in the field of food safety.


Introduction
Food is the material basis for human survival, and food safety is closely related to human health. Especially in recent years, food safety problems (such as dyed steamed buns, expired meat, tainted bean sprouts, etc.) have occurred frequently. Some food safety incidents (e.g., the "melamine" milk powder, the "poisoned ginger", and the "smelly feet salt") were not brought to the attention of the public until they were exposed on the Internet and finally resolved. The issue of food safety has attracted a lot of public attention and has been a hot topic in the society. Food safety is not only related to people's health and safety but also has a significant impact on social stability and the credibility of the government. Although the state from the legislation to the supervision of all levels attach great importance to this topic, food safety issues such as excessive pesticide residues, genetically modified issues with melamine, Sudanese red pigment, "lean meat essence", "gutter oil", 1.
To assist the relevant authorities in making rapid decisions and risk warnings; 2.
To inform and alert the relevant authorities to take timely measures to reduce the probability of diseases caused by food safety problems; 3.
To help the general public obtain food safety information from a large amount of cluttered and redundant data and to take suitable precautions in advance.
Named Entity Recognition (NER) technology can quickly and accurately acquire implicit information related to food safety in internet users' comments. It is used as data support for comprehensive research and judgment, intelligent decision-making, and dynamic early warning of food safety risk situations.
NER aims to identify predefined semantic types (e.g., names of people, places, organizations, and domain proper names) from text [5]. NER is one of the most fundamental and essential aspects of natural language processing (NLP), which has a wide range of applications in many scenarios, such as information extraction, question and answer systems, and machine translation. Named entities (NEs) are classified into generic classes (e.g., names of people, places) and domain-specific classes (e.g., proteins, drugs, diseases). The NER task is usually regarded as a sequence annotation task and is solved by statistical or neural network approaches [6].
With the rapid development of neural networks, deep learning-based methods [7] are widely used in NER tasks [8][9][10][11] and achieve state-of-the-art results. Deep learningbased methods do not rely on the manual construction of features and can automatically obtain models by training large amounts of data. Among them, based on the excellent sequence-modeling ability of one-way Long-Short Term Memory (LSTM) models, many methods use LSTM-conditional random field (CRF) as the main framework of NER tasks and the fusion of various relevant features on this basis. BiLSTM-CRF [8] is the most common method. In [12,13], state-of-the-art performance was achieved with this method as the main framework. In the low-resource domain, to improve entity recognition for the NER task, adversarial training is added to make the model more robust and improve the generalization ability during the training process [14,15].
Compared with English, there are no natural boundaries in Chinese. The blurred boundaries of unitary vocabulary, complex entity structures, and diverse expressions make Chinese NER more difficult. In the field of Chinese NER, a distinction between words and characters exists in Chinese. A character-based tagging strategy is generally used to tag named entities [16,17]. Compared with word-based and character-word union-based methods, the character-based method can avoid the problem of word separation error transmission and generally have superior performance [18,19]. However, the characterbased Chinese NER methods have a general problem: they cannot characterize the polysemy of words. It means that the same word expresses different meanings in different Appl. Sci. 2023, 13, 2849 scenarios, but the representation of the word vector is the same, which is not consistent with the objective fact.
In recent years, pre-trained language models have become a critical fundamental technology in NLP, especially the Bidirectional Encoder Representations from Transformers (BERT) [20] model proposed by Google. BERT utilizes the multilayer self-attention bidirectional modeling capability of Transformer [21] to achieve state-of-the-art (SOTA) results in several NLP tasks by predicting masked characters. Pre-trained Language Models can be used for character representation. This is because better character representations not only contain rich syntactic and semantic information but also allow the modeling of polysemous words. However, the modeling objects of BERT are mainly focused on original language signals and less on the use of semantic knowledge units for modeling. This problem is particularly evident in Chinese. For example, BERT is modeled by predicting Chinese characters, and it is difficult for the model to learn the intact semantic representation from the larger semantic units. Therefore, Baidu improved the Chinese direction of BERT by proposing the Enhanced Representation from Knowledge Integration model (ERNIE), a semantic representation model for knowledge enhancement [22]. It empowers the model to learn the hidden knowledge implied in a large amount of text. As shown in Figure 1, the ERNIE enables the model to learn the semantic representation of complete concepts by masking semantic units such as words and entities. Compared to BERT, ERNIE directly models priori semantic knowledge units to enhance the semantic representation of the model. Appl. Sci. 2023, 13, 2849 3 of 16 character-based Chinese NER methods have a general problem: they cannot characterize the polysemy of words. It means that the same word expresses different meanings in different scenarios, but the representation of the word vector is the same, which is not consistent with the objective fact. In recent years, pre-trained language models have become a critical fundamental technology in NLP, especially the Bidirectional Encoder Representations from Transformers (BERT) [20] model proposed by Google. BERT utilizes the multilayer self-attention bidirectional modeling capability of Transformer [21] to achieve state-of-the-art (SOTA) results in several NLP tasks by predicting masked characters. Pre-trained Language Models can be used for character representation. This is because better character representations not only contain rich syntactic and semantic information but also allow the modeling of polysemous words. However, the modeling objects of BERT are mainly focused on original language signals and less on the use of semantic knowledge units for modeling. This problem is particularly evident in Chinese. For example, BERT is modeled by predicting Chinese characters, and it is difficult for the model to learn the intact semantic representation from the larger semantic units. Therefore, Baidu improved the Chinese direction of BERT by proposing the Enhanced Representation from Knowledge Integration model (ERNIE), a semantic representation model for knowledge enhancement [22]. It empowers the model to learn the hidden knowledge implied in a large amount of text. As shown in Figure 1, the ERNIE enables the model to learn the semantic representation of complete concepts by masking semantic units such as words and entities. Compared to BERT, ERNIE directly models priori semantic knowledge units to enhance the semantic representation of the model. Comparison of learning methods between BERT and ERNIE model. ERNIE learns the semantic representation of complete concepts by masking Chinese characters and entities ("苹果" and "农药味"), while BERT learns the semantic representation by masking individual Chinese characters ("果" and "农").
The NER mission has penetrated various areas, such as healthcare, finance, and law. However, there is no publicly available dataset in the food safety domain. Therefore, in domains where corpus resources are scarce, few-shot learning has emerged as an effective entity extraction method. [23] constructed a food recognition classifier based on few-shot learning by limiting the training samples. However, compared to standard news-like canonical texts, named entity recognition based on the content of Internet users' comments has the following main challenges: • The corpus is small and contains many types of entities; • Users write comments as they wish, with frequent new words and mistakes, containing noise such as Internet phrases and emoticons; • The text does not follow strict syntactic rules [24]. These challenges make it difficult for few-shot learning to achieve good entity recognition results. Therefore, to obtain as many practical features as possible from the noisy mixed small-scale corpus to improve the recognition performance of food-safe named entities in unstandardized Chinese texts, this paper proposes the ERNIE-Adv-BiLSTM-Att-CRF model. Comparison of learning methods between BERT and ERNIE model. ERNIE learns the semantic representation of complete concepts by masking Chinese characters and entities (" 苹果" and " 农药味"), while BERT learns the semantic representation by masking individual Chinese characters (" 果" and " 农").
The NER mission has penetrated various areas, such as healthcare, finance, and law. However, there is no publicly available dataset in the food safety domain. Therefore, in domains where corpus resources are scarce, few-shot learning has emerged as an effective entity extraction method. [23] constructed a food recognition classifier based on fewshot learning by limiting the training samples. However, compared to standard news-like canonical texts, named entity recognition based on the content of Internet users' comments has the following main challenges:

•
The corpus is small and contains many types of entities; • Users write comments as they wish, with frequent new words and mistakes, containing noise such as Internet phrases and emoticons; • The text does not follow strict syntactic rules [24].
These challenges make it difficult for few-shot learning to achieve good entity recognition results. Therefore, to obtain as many practical features as possible from the noisy mixed small-scale corpus to improve the recognition performance of food-safe named entities in unstandardized Chinese texts, this paper proposes the ERNIE-Adv-BiLSTM-Att-CRF model.
Our work has the following contributions:

1.
We mine food safety-related information in unstandardized texts by building a food safety domain dataset to train a food safety domain NER model. Mining food safety-related information in unstandardized texts helps build a regulatory knowledge graph in the food safety domain; 2.
We use ERNIE, a model pre-trained by a large-scale corpus, to generate context-based word vectors so that the model learns the semantic representation of complete concepts and enhances the semantic representation of the model. We are adding selfattention to the BiLSTM-CRF sequence annotation model to capture the most critical semantic information in sentences and improve the entity recognition performance of unstandardized texts; 3.
We added adversarial training as a regularization method to the model training to make the model more robust and improve the model's generalization ability.
In Section 4.2, the ERNIE-Adv-BiLSTM-Att-CRF model will improve NER tasks' performance with the unstandardized Chinese. As can be seen from the experimental results, our model outperforms other SOTA models on the public dataset Weibo NER with an F1 value of 72.64% and outperforms other baseline models on a self-built food domain dataset with an F1 value of 69.68%.

Related Work
A deep learning-based approach is used for unstandardized Chinese NER in our work. Collobert et al. [8] proposed a CNN-CRF model based on a convolutional neural network (CNN), which obtained competitive performance compared to various best statistical models. Huang et al. [9] proposed a BiLSTM-CRF model to solve the text sequence labeling problem as a benchmark model. Ma et al. [25] and Chiu et al. [11] infused character features extracted by CNN to enhance the word-level representation based on the BiLSTM-CRF framework. In the low-resource domain, Zhou et al. [16] enhanced the robustness of NER models by adding adversarial perturbations to the original samples. In the Chinese NER task, Dong et al. [26] composed a sequence of word roots for each character and used LSTM networks to obtain the root information of Chinese characters. Zhang et al. [27] proposed the Lattice-LSTM method by replacing the traditional LSTM cells with a lattice LSTM. It cleverly encodes the Chinese characters and all the potential words matched with the lexicon. Based on the Lattice-LSTM, Wei et al. [28] proposed the word-character LSTM (WC-LSTM) model to alleviate the impact of word separation errors by adding word information to the start and end characters of a word.
Chinese characters, as pictographs, contain potential glyph information. Ref. [29] proposed the fused glyph network FGN to extract the interaction information between distributed representations of characters and glyph representations through a fusion mechanism. Ref. [30] proposed the FLAT model by converting the Lattice structure into a planar structure composed of spans. This model makes full use of Lattice information and has good parallelization capability based on the power of the Transformer and well-designed positional encoding. The SLK-NER model proposed in [31] uses global semantic information to fuse lexical knowledge through an attention mechanism to integrate more informative words into a character-based model and alleviate lexical boundary conflicts. In NER domain research, introducing extra knowledge is a common way to improve model performance. Ref. [32] proposed an AESINER model that efficiently uses attention integration to encode and fuse different types of syntactic information (e.g., lexical annotation, constituent syntactic information, and dependent syntactic information) to help the model identify named entities. In social media such as Weibo and Twitter, many short texts generated by users contain various types of entities. Some entities are not written following standard syntactic conventions (e.g., abbreviated by users at will), resulting in a small probability of such entities showing sparsity, making recognizing such entities more difficult. In response to this question, previous studies have used domain information (e.g., gazetteers and embeddings trained on prominent social media texts) and external features (e.g., lexical tags) to help improve the performance of NER on social media [33,34]. However, these methods require extra work to obtain this information, and there is noise in the results. For example, training embeddings in the social media domain can bring Appl. Sci. 2023, 13, 2849 5 of 16 many unique expressions to the vocabulary. Therefore, [35] proposed the SA-NER model to enhance the recognition of named entities with semantic expansion. However, the F1 value of this model only reached 69.8% on the Weibo dataset.
However, most of the studies in the literature above are based on canonical texts and large-scale corpora. In contrast, the food safety domain corpus is scarce, and the implicit information in non-canonical texts on the web is not fully utilized. In addition, the above papers rarely enhance the entity recognition performance based on the features of the irregular Chinese text.
Therefore, to better represent syntactic semantic information in different contexts, we use ERNIE, a pre-training model more suitable for the non-standardized Chinese text NER task. ERNIE is based on character representation word vectors. Inspired by the attentionbased BiLSTM model proposed in [36] for the relationship classification task, self-attention is added to the sequence annotation model BiLSTM-CRF for the NER task mechanism to obtain the most critical features for entity classification. The difference is that we use character-level feature vectors for entity classification instead of sentence-level features. Meanwhile, to improve the robustness and generalization of the model, we train the model adversarially by adding appropriate adversarial perturbations to the original samples.

Methods
In this section, we will introduce the ERNIE-Adv-BiLSTM-Att-CRF model in detail, as shown in Figure 2, which consists of six main parts:

•
Input layer: Input samples in terms of sentences into the model; • Embedding layer: The input sentences are represented using the ERNIE pre-training model to obtain a context-based word vector that contains rich implicit information; • Adversarial training: Creating adversarial samples by adding an adversarial perturbation to the word embedding layer to train the model adversarially; • BiLSTM layer: Using the BiLSTM network to learn the dependencies on the observed sequences and selectively picking higher-order features to integrate; • Attention layer: Generates a weight vector and multiplies the weight vector with the state of the hidden layer at each moment of BiLSTM to obtain the feature vector after self-attention; • CRF layer: Output the best-predicted label sequence. information (e.g., gazetteers and embeddings trained on prominent social media texts) and external features (e.g., lexical tags) to help improve the performance of NER on social media [33,34]. However, these methods require extra work to obtain this information, and there is noise in the results. For example, training embeddings in the social media domain can bring many unique expressions to the vocabulary. Therefore, [35] proposed the SA-NER model to enhance the recognition of named entities with semantic expansion. However, the F1 value of this model only reached 69.8% on the Weibo dataset. However, most of the studies in the literature above are based on canonical texts and large-scale corpora. In contrast, the food safety domain corpus is scarce, and the implicit information in non-canonical texts on the web is not fully utilized. In addition, the above papers rarely enhance the entity recognition performance based on the features of the irregular Chinese text.
Therefore, to better represent syntactic semantic information in different contexts, we use ERNIE, a pre-training model more suitable for the non-standardized Chinese text NER task. ERNIE is based on character representation word vectors. Inspired by the attentionbased BiLSTM model proposed in [36] for the relationship classification task, self-attention is added to the sequence annotation model BiLSTM-CRF for the NER task mechanism to obtain the most critical features for entity classification. The difference is that we use character-level feature vectors for entity classification instead of sentence-level features. Meanwhile, to improve the robustness and generalization of the model, we train the model adversarially by adding appropriate adversarial perturbations to the original samples.

Methods
In this section, we will introduce the ERNIE-Adv-BiLSTM-Att-CRF model in detail, as shown in Figure 2, which consists of six main parts:

•
Input layer: Input samples in terms of sentences into the model; • Embedding layer: The input sentences are represented using the ERNIE pre-training model to obtain a context-based word vector that contains rich implicit information; • Adversarial training: Creating adversarial samples by adding an adversarial perturbation to the word embedding layer to train the model adversarially; • BiLSTM layer: Using the BiLSTM network to learn the dependencies on the observed sequences and selectively picking higher-order features to integrate; • Attention layer: Generates a weight vector and multiplies the weight vector with the state of the hidden layer at each moment of BiLSTM to obtain the feature vector after self-attention; • CRF layer: Output the best-predicted label sequence.

Word Embedding
As with BERT, ERNIE uses a multi-layer Transformer as the primary encoder. A sentence s = {w 1 , w 2 , · · · , w t } containing t tokens (characters) is connected with unique token CLS and SEP, where CLS denotes the beginning of the sentence and SEP denotes the end of the sentence. As shown in Figure 3, for each token in the sentence, ERNIE represents it as Embeddings constructed by summing Token Embeddings, Segment Embeddings, and Position Embeddings through the embedding layer, i.e., E w i = E token + E seg + E pos . The sentence is vectorized into emb = {E w 1 , E w 2 , · · · , E w t } and input to the bidirectional Transformer for feature extraction. The contextual information of each token in the sentence is captured using the Self-Attention Mechanism of the Transformer to generate a sequence vector x = {x 1 , x 2 , · · · , x t } containing rich semantic features. In other words, the rich text features are extracted using the pre-trained model ERNIE to obtain a batch_size * max_seq_len * emb_size output vector, used as the classification task's sequence representation. Where batch_size is the batch size of the processed data, max_seq_len is the maximum length of the input sentences, and emb_size is the embedding dimension of each character.

Word Embedding
As with BERT, ERNIE uses a multi-layer Transformer as the primary encoder. A sentence = { 1 , 2 , ⋯ , } containing tokens (characters) is connected with unique token CLS and SEP, where CLS denotes the beginning of the sentence and SEP denotes the end of the sentence. As shown in Figure 3, for each token in the sentence, ERNIE represents it as Embeddings constructed by summing Token Embeddings, Segment Embeddings, and Position Embeddings through the embedding layer, i.e., = + + . The sentence is vectorized into = { 1 , 2 , ⋯ , } and input to the bidirectional Transformer for feature extraction. The contextual information of each token in the sentence is captured using the Self-Attention Mechanism of the Transformer to generate a sequence vector = { 1 , 2 , ⋯ , } containing rich semantic features. In other words, the rich text features are extracted using the pre-trained model ERNIE to obtain a ℎ_ * _ _ * _ output vector, used as the classification task's sequence representation. Where ℎ_ is the batch size of the processed data, _ _ is the maximum length of the input sentences, and _ is the embedding dimension of each character.

Bidirectional LSTM Network
The Long Short Terms Memory (LSTM) unit contains three specially designed gates (input gate, forgetting gate and output gate) for controlling the information transmission in a sequence. It can well solve the gradient disappearance and gradient explosion problems in the training process of traditional recurrent neural networks (RNN); at the same time, it has good sequence modelling ability and can better model long-range dependencies.
The Bidirectional Long Short Terms Memory (BiLSTM) network is a modification of RNN. It contains two sub-networks, forward and reverses LSTM, which can process contextual information simultaneously. The BiLSTM layer acts as a sequence encoder in NER tagging and inputs the word vector sequence of the input = { 1 , 2 , ⋯ , } sentence into the BiLSTM network for feature extraction. The probability of each token corresponding to the tag sequence = { 1 , 2 , ⋯ , } is output, and is the number of tags. Specifically, ERNIE's word vector sequence output is encoded using the BiLSTM network. The forward LSTM network obtains the hidden forward state (historical features), and the reverse LSTM network obtains the backward hidden state (future features). The output of the hidden layer of the BiLSTM network is represented as: The corresponding values between the forward and backward hidden states obtained by BiLSTM are summed to obtain ℎ, where ℎ = {ℎ 1 , ℎ 2 , ⋯ , ℎ } is the hidden representation of the character. Input ℎ into the Attention layer and use self-attention to obtain further the features that have the most significant impact on entity classification.

Bidirectional LSTM Network
The Long Short Terms Memory (LSTM) unit contains three specially designed gates (input gate, forgetting gate and output gate) for controlling the information transmission in a sequence. It can well solve the gradient disappearance and gradient explosion problems in the training process of traditional recurrent neural networks (RNN); at the same time, it has good sequence modelling ability and can better model long-range dependencies.
The Bidirectional Long Short Terms Memory (BiLSTM) network is a modification of RNN. It contains two sub-networks, forward and reverses LSTM, which can process contextual information simultaneously. The BiLSTM layer acts as a sequence encoder in NER tagging and inputs the word vector sequence of the input x = {x 1 , x 2 , · · · , x t } sentence into the BiLSTM network for feature extraction. The probability of each token corresponding to the tag sequence y = {y 1 , y 2 , · · · , y n } is output, and n is the number of tags. Specifically, ERNIE's word vector sequence output is encoded using the BiLSTM network. The forward LSTM network obtains the hidden forward state (historical features), and the reverse LSTM network obtains the backward hidden state (future features). The output of the hidden layer of the BiLSTM network is represented as: The corresponding values between the forward and backward hidden states obtained by BiLSTM are summed to obtain h, where h = {h 1 , h 2 , · · · , h t } is the hidden representation of the character. Input h into the Attention layer and use self-attention to obtain further the features that have the most significant impact on entity classification.

Attention Mechanism
The attention mechanism is a selection mechanism used to allocate limited information processing power, which is essentially a weighted summation, i.e., assigning higher weights to essential characters and smaller weights to other characters. The pre-trained model ERNIE uses a multi-layer Transformer as the primary encoder. The multi-headed attention mechanism in the Transformer structure can extract features of the text itself from multiple perspectives and levels. However, the "degree of influence" between the output information obtained from each time point of the LSTM is the same. In order to highlight the most critical part of the output information for entity classification, this paper adds self-attention after the BiLSTM network to capture the most critical semantic-level information in the sentence and automatically focus on the features that have a decisive impact on entity classification.
The matrix H is composed of the hidden state vector h output by the BiLSTM. Assume that w is the matrix parameter to be trained and w T is the transpose of w, satisfying the following equation: where H ∈ R d w ×T . d w is the word vector dimension in the sentence. α is the attention weight coefficient. The output h of the BiLSTM is weighted and summed to obtain r. The dimensions of w, α, and r are d w , T, and d w , respectively. After self-attention, we obtain the sentence representation vector containing the most critical information:

CRF Layer
Conditional Random Field (CRF) is a class of discriminative models best suited for prediction tasks. It is widely used in sequence labelling problems. Although the BiLSTM-Att network can handle long-range textual information and obtain more critical features for entity classification, the dependencies between neighboring tags are not effectively handled. The CRF layer is used as a sequence decoder for the NER tagger. The standard Viterbi algorithm obtains a globally optimal labeled sequence in the final decoding stage.
All outputs of the BiLSTM-Self-Attention (BiLSTM-Att) network are input to the CRF layer as a score matrix P. For a sequence of predicted tags y = {y 0 , y 1 , · · · , y n+1 }, and the score is defined as: where P is a matrix of size t * n, P i,j corresponds to the score of the i th word in sen tence x = {x 1 , x 2 , · · · , x t } corresponding to the j th label. A is a matrix of transferred scores, and A i,j represents the scores transferred from label i to label j. The prediction sequence of y 0 and y n+1 are the two tags start and end that mark the beginning and end of the sentence, so A is a square matrix of size n + 2. CRF uses potential functions to estimate the conditional distribution probability P(x|w) of the output tag sequence y for sequence x. The formula is shown below: where φ(x, y) is the feature vector and w is the parameter vector. Z(w, x) is the cumulative sum of the conditional distribution probabilities P(y|x, w) for all possible tags y.
Training the model utilizing the maximum conditional likelihood function: The CRF layer learns some constraints from the training dataset, which reduces the sequence of invalid predicted tags and ensures that the final output tag sequence is valid. When decoding the sequences, the tag sequence with the highest prediction score is selected as the best answer: x = argmax y P(y|x, w)

Regularization
In NLP tasks, adversarial training is no longer used to defend against gradient-based malicious attacks but more to strengthen the regularization of classification models. Adversarial training can be generalized to the following maximum minimization formulation: where inside the middle brackets is a maximization. D, x and y denote the training set, input samples and sample labels, respectively; θ, L(x, y; θ) and ∆ x are the model parameters, the loss of individual samples and the adversarial perturbation superimposed on the input, respectively; and Ω is the perturbation space. The perturbation space is generally small to avoid damage to the original input samples. max(L) is the optimization objective, i.e., finding the perturbation that maximizes the loss of a single sample. Meanwhile, the model parameters θ of the neural network are optimized using the outer layer E (x,y) to minimize them. When the perturbation is fixed, the model has a minor loss of the training data. In simple terms, the sample loss should be as significant as possible after adding the perturbation. In contrast, the model loss should be as small as possible, thus making the model more robust and avoiding the bias of the model inference results caused by small perturbations on the samples. In this paper, we borrow the Fast Gradient Method (FGM) from [37] for the text classification task and add an adversarial perturbation ∆ x to the word embedding to train the model adversarially, with ∆ x defined as follows: where g = ∆ x L(x, y, θ) is the gradient of the input sample and ∥ g ∥ 2 is the L 2 parametrization of g.
Compute ∆ x from the gradient of the word embedding matrix and add it to the current embedding, which is equivalent to: Compute its forward loss, backpropagate to obtain the adversarial gradient, accumulate to the original gradient, recover the embedding, and update the parameters based on the gradient with the accumulated adversarial gradient.

Experiments
In Sections 4.1 and 4.2, the data set used for the experiments, the settings of the relevant parameters, and the evaluation standard settings are presented. The proposed model in this paper is compared with some SOTA models, and the main experimental results are presented in Section 4.3. To verify the effectiveness of the components in our model, a series of ablation experiments are performed in Section 4.4. Our model is tested three times on each dataset, and this is used to calculate the average values of Precision (P), Recall (R), and F-score (F1). The bolded numbers in tables represent the better performance of our model over the comparison model. The Weibo NER dataset is generated based on historical data filtered from Sina Weibo between November 2013 and December 2014. It contains 1890 microblog messages, annotated based on the annotation standard of DEFT ERE of LDC2014. The entity categories in this dataset are divided into four categories: Person, Organization, Address, and Geopolitical entity; and each category can be subdivided into specific (NAM, e.g., " 张三" labelled as "PER.NAM") and generic (NOM, e.g., "NOM" for " 男人").
The Food dataset was generated by filtering negative reviews about food (fruit category) from the sentiment/opinion/review propensity analysis dataset online_shopping_ 10_cats of the Chinese Natural Language Processing Language Resources Project (https: //github.com/liuhuanyong/ChineseNLPCorpus (accessed on 20 August 2021)) and manually annotated using the NER annotation tool YEDDA [38]. First, food-safe named entities were extracted from the sentences, as shown in Figure 4; then, as shown in Figure 5, the sentences were processed into sequence labels to feed into the NER model for training. These reviews come from the Jingdong e-commerce platform and contain 1914 messages. Under the guidance of food safety-related experts, based on the content of the review, we assessed consumer evaluations about a particular type of fruit sold by a store on an ecommerce platform with a particular problem and the appearance of a specific symptom by consumers after consumption; the entity categories of this dataset are divided into five categories: type of fruit (FRU), description of the risk present (SIG), e-commerce platform sold (ECP), store sold (MER), and symptoms appeared (SYM). The Weibo NER dataset is generated based on historical data filtered from Sina Weibo between November 2013 and December 2014. It contains 1890 microblog messages, annotated based on the annotation standard of DEFT ERE of LDC2014. The entity categories in this dataset are divided into four categories: Person, Organization, Address, and Geopolitical entity; and each category can be subdivided into specific (NAM, e.g., "张三" labelled as "PER.NAM") and generic (NOM, e.g., "NOM" for "男人").
The Food dataset was generated by filtering negative reviews about food (fruit category) from the sentiment/opinion/review propensity analysis dataset online_shop-ping_10_cats of the Chinese Natural Language Processing Language Resources Project (https://github.com/liuhuanyong/ChineseNLPCorpus (accessed on 20 August 2021)) and manually annotated using the NER annotation tool YEDDA [38]. First, food-safe named entities were extracted from the sentences, as shown in Figure 4; then, as shown in Figure  5, the sentences were processed into sequence labels to feed into the NER model for training. These reviews come from the Jingdong e-commerce platform and contain 1914 messages. Under the guidance of food safety-related experts, based on the content of the review, we assessed consumer evaluations about a particular type of fruit sold by a store on an e-commerce platform with a particular problem and the appearance of a specific symptom by consumers after consumption; the entity categories of this dataset are divided into five categories: type of fruit (FRU), description of the risk present (SIG), e-commerce platform sold (ECP), store sold (MER), and symptoms appeared (SYM).

Figure 4.
Manual annotation instance. The term "苹果" represents the apple, indicating the presence of risky fruit categories, labeled "FRU", and "好大一股农药味" indicates that the apple strongly smells of pesticides, which is labeled as food safety risk description "SIG".  . Manual annotation instance. The term " 苹果" represents the apple, indicating the presence of risky fruit categories, labeled "FRU", and " 好大一股农药味" indicates that the apple strongly smells of pesticides, which is labeled as food safety risk description "SIG".

Dataset and Experimental Setup
• Datasets: This paper constructed a food safety domain dataset for Food to conduct experiments. To ensure the fairness of the experiments, a widely used public dataset, Weibo NER, is chosen to validate the validity and reasonableness of our model. Both datasets use standard BIO annotation to represent the named entity tags of tokens in the input sentences.
The Weibo NER dataset is generated based on historical data filtered from Sina Weibo between November 2013 and December 2014. It contains 1890 microblog messages, annotated based on the annotation standard of DEFT ERE of LDC2014. The entity categories in this dataset are divided into four categories: Person, Organization, Address, and Geopolitical entity; and each category can be subdivided into specific (NAM, e.g., "张三" labelled as "PER.NAM") and generic (NOM, e.g., "NOM" for "男人").
The Food dataset was generated by filtering negative reviews about food (fruit category) from the sentiment/opinion/review propensity analysis dataset online_shop-ping_10_cats of the Chinese Natural Language Processing Language Resources Project (https://github.com/liuhuanyong/ChineseNLPCorpus (accessed on 20 August 2021)) and manually annotated using the NER annotation tool YEDDA [38]. First, food-safe named entities were extracted from the sentences, as shown in Figure 4; then, as shown in Figure  5, the sentences were processed into sequence labels to feed into the NER model for training. These reviews come from the Jingdong e-commerce platform and contain 1914 messages. Under the guidance of food safety-related experts, based on the content of the review, we assessed consumer evaluations about a particular type of fruit sold by a store on an e-commerce platform with a particular problem and the appearance of a specific symptom by consumers after consumption; the entity categories of this dataset are divided into five categories: type of fruit (FRU), description of the risk present (SIG), e-commerce platform sold (ECP), store sold (MER), and symptoms appeared (SYM). . Manual annotation instance. The term "苹果" represents the apple, indicating the presence of risky fruit categories, labeled "FRU", and "好大一股农药味" indicates that the apple strongly smells of pesticides, which is labeled as food safety risk description "SIG".  The training, validation and test sets of the two datasets segmented according to the number of sentences are shown in Table 1. The Weibo NER dataset is its initial segmentation [33] and has not been changed in this paper. Furthermore, the named entity labels for both datasets are shown in Table 2. According to the default configuration, the output vector size of each character is set to 768, and the dropout rate of ERNIE is 0.1. The LSTM is set as a bidirectional network, the hidden layer size is 768, the number of layers is 1, and the dropout rate of LSTM is set to 0.5. The initial learning rate is a critical parameter that needs to be adjusted according to the target task. AdamW is used as an optimizer for pre-training model fine-tuning and NER model training, both of which have different learning rates. For pre-training model fine-tuning, the initial learning rate is 3 × 10 −5 for the Weibo NER dataset and 8 × 10 −5 for the Food dataset; for NER model training, the LSTM learning rate is 2 × 10 −5 for both datasets 2 × 10 −2 for CRF. The difference between the optimal learning rates of ERNIE and BERT is extensive and requires a higher initial learning rate. Since the models' weights are randomly initialized at the beginning of training, choosing a more extensive learning rate at this time may bring instability (oscillation) to the model. In order to stabilize the model, the warmup learning rate is chosen to make the model converge faster and better. The initial warm-up step number is set to 80.

Evaluation Standard Setting
In this paper, three experimental results of precision, recall, and F1 value are used as performance measurement criteria. Their calculation equations are as follows: In the precision (P) calculation Equation (13), TP i denotes the number of positive classes correctly predicted by the model and FP i denotes the number of positive classes predicted by the model from the negative classes.
In the recall (R) calculation Equation (14), TP i is the same as the above-mentioned equation and FN i denotes the number of negative classes predicted by the model from the positive classes.
Since precision and recall are a pair of contradictory metrics, in order to better evaluate the performance of the classifier, the harmonic mean F1 score of precision and recall is used as an evaluation standard to evaluate the comprehensive performance of the model.

Experimental Results
The results of the experiments on the datasets Weibo NER and Food are shown in Tables 3 and 4. Comparisons are made with other SOTA models on the Weibo NER dataset and some baseline models on the Food dataset. The ERNIE-Adv-BiLSTM-Att-CRF model proposed in this paper takes ERNIE-BiLSTM-CRF as the baseline and adds adversarial training and self-attention. In the SOTA model shown in Table 3: 1.
Locate and Label uses a two-stage entity recognizer that locates entities and labels boundaries for nested NER; 2.
SA-NER uses semantic expansion to improve the performance of NER; 3.
AESINER improves the named entity recognition ability of the model by introducing extra knowledge; 4.
SLK-NER uses an attention mechanism to fuse lexical knowledge into characterbased models; 5.
FLAT without the pre-trained model BERT uses Lattice information; 6.
BERT-LMCRF is a BERT model that uses BiLSTM-CRF as a NER tagger; 7.
FLAT + BERT is a SOTA model based on BERT; 8.
FGN is a fused glyph network based on BERT. Table 3 shows the data of each SOTA model are the experimental results in their original papers. Two baseline models are selected in Table 4 for comparison with the model in this paper: 1.
ERNIE-based and using softmax for entity classification; 2.
ERNIE-based and using BiLSTM-CRF as the NER sequence coder-decoder.
On the Food dataset, it can be seen from Table 4 that our model is significantly higher than the other two baseline models in terms of the F1 value, which is 69.68%. The improvement is 3.61% and 1.96%, respectively. It can be seen that adding the BiLSTM-CRF network after ERNIE is better than directly classifying the output of ERNIE for prediction, with an F1 value improvement of 1.65%. After adding adversarial training to the model training process and self-attention in BiLSTM-CRF, the model is further improved with another F1 value improvement of 1.96%.
From this, we can see that our model can alleviate the impact of noise on NER performance in Weibo NER and Food, two small-scale datasets with noise confounding. After self-attention, the features with a high impact on entity classification among all the features output by BiLSTM get greater weights, making entity recognition better.

Ablation Experiments
To demonstrate that adding self-attention and adversarial training can effectively improve the NER performance of small-scale datasets with noisy interference based on the ERNIE-BiLSTM-CRF as the baseline model, a series of ablation experiments are conducted in this paper using the Weibo NER dataset as an example. As shown in Table 5, the experimental results illustrate the effects of self-attention and adversarial training on NER performance. The experimental results show that the dataset Weibo NER has an F1 value of 70.00% on the baseline model ERNIE-BiLSTM-CRF; only self-attention is added to the baseline model, and the F1 value is 70.35%, which is a mere 0.35% improvement. From this, it can be concluded that even if self-attention is added to the BiLSTM-CRF model to assign greater weights to those features that impact entity classification, the noise still dramatically influences the model. Similarly, adding adversarial training to the baseline model only, the F1 value is 70.60%, which is just a 0.60% improvement. Although adversarial training can alleviate the effect of noise on entity recognition during model training, it does not capture the most important features without adding self-attention for the BiLSTM-CRF model. Ultimately, it does not obtain good entity labeling performance. While adding both simultaneously, the model has a remarkable improvement with an F1 value of 72.64%. Compared with the three models mentioned above, the F1 values are improved by 2.64%, 2.29%, and 2.04%, respectively. This shows that adding both self-attention and adversarial training can effectively improve the performance of small-scale nonstandard Chinese NER with noise by assigning more weight to the features that help entity recognition while reducing the effect of noise. The validity of our model for unstandardized Chinese NER is demonstrated.
Based on the above experiments, it can be seen that adding adversarial perturbation to the original samples and adding the self-attention mechanism to the BiLSTM-CRF network can both alleviate the effect of noise on the model and capture the features that are beneficial to entity classification. In addition, this paper further analyzes the effects of different pre-training models and adversarial training methods on entity recognition, as detailed in Tables 6 and 7. • Pre-trained Language Model: As shown in Table 6, the NER performance of different pre-trained models is analyzed with the other settings of our model held constant. BERT-base (https://github.com/google-research/bert (accessed on 12 September 2021)) and RoBERTa-wwm-ext (https://github.com/ymcui/Chinese-BERT-wwm (accessed on 21 September 2021)) are Chinese pre-trained models. Google publishes the former, and the latter is published by Xunfei Joint Lab of Harbin Institute of Technology. It should be noted that RoBERTa-wwm-ext is not the original RoBERTa model, but only a BERT model trained in a similar way to Roberta training, i.e., RoBERTalike BERT.
It can be seen that the BERT-base-based NER model is a minor performance, with an F1 value of 68.88%, because the Chinese in BERT-base is sliced at character granularity, which does not consider Chinese word separation (CWS) in traditional NLP tasks. Instead, RoBERTa-wwm-ext uses Chinese Wikipedia (both simplified and traditional) for training and applies the Whole Word Masking (WWM) technique to Chinese. At the same time, the LTP of Harbin Institute of Technology is used as a word-splitting tool to Mask all Chinese characters that form the same word instead of being limited to Masking a single Chinese character in BERT-base. The F1 value is 70.69%, 1.81% higher than the F1 value of the BERT-base.
The F1 value of the ERNIE-based NER model is 72.64%, which is 3.76% and 1.95% higher than the BERT-base and RoBERTa-wwm-ext, respectively. Because the experimen-tal datasets, Weibo and Food, are annotated from the unstandardized text generated by user comments on social media and e-commerce platforms. However, BERT-base and RoBERTa-wwm-ext use Wikipedia data for training, and they are more effective in modeling canonical text. ERNIE adds web data such as Baidu Encyclopedia, Baidu News, and Baidu Post, and it has advantages in modeling such unstandardized text. Therefore, BERT-base and RoBERTa-wwm-ext are not as effective as ERNIE for entity recognition in our datasets.

•
Adversarial Training: The Fast Gradient Method (FGM), the Project Gradient Descent (PGD) [39], and the Free Large Batch Adversarial Training (FreeLB) [40] are three adversarial training methods, i.e., three different adversarial perturbation generation methods. Table 7 shows the different performances of the three adversarial training modalities. The experimental results show that FGM outperforms the remaining two on the NER task with an F1 value of 72.64%. 1.78% and 1.14% higher than FreeLB and PGD, respectively. That is to say, adding the perturbation generated by FGM to word embedding can obtain higher accuracy of entity classification. FGM is more suitable for small sample NER models.

Conclusions
In order to obtain as many practical features as possible from the noisy mixed smallscale corpus to improve the performance of named entity recognition of unstandardized Chinese text, we propose the ERNIE-Adv-BiLSTM-Att-CRF model: • ERNIE, a pre-trained model with advantages for modelling unstandardized Chinese text, is chosen to generate context-based word vectors that retain rich implicit information; • Adversarial training is added to the model training as a regularization tool to alleviate the effect of dataset noise on the NER model; • Self-attention is added to the BiLSTM network to automatically focus on the features that have a decisive impact on entity classification and encode them in sequence; • Sequence decoding is performed in the CRF layer to obtain the best label corresponding to each token.
The experimental results show that our approach obtains SOTA performance on the public dataset Weibo NER with an F1 value of 72.64%. As shown in Table 3, our model outperforms other SOTA models on the microblog NER dataset. Compared with the FLAT, SLK-NER, Locate and Label, and AESINER models, the F1 values obtained significant improvements of 12.32%, 8.64%, 3.48%, and 2.86%, respectively; compared with the BERT-LMCRF, FLAT + BERT, SA-NER, and FGN models, our F1 values also obtained significant improvements of improved by 5. 52%, 4.09%, 2.84%, and 1.39%, respectively. Good performance was also obtained on the self-built in-domain dataset Food, with an F1 value of 69.68%. As can be seen from Table 4, our model is significantly higher than the other two baseline models in terms of F1 values, with improvements of 3.61% and 1.96%, respectively. The effectiveness of our method in the non-standardized NER task is fully demonstrated. Moreover, the performance of different pre-trained models and adversarial training methods are discussed in the ablation experiments.
Based on the experiments in this paper, it is demonstrated that our model can extract specific entity information from non-standardized web texts. This is useful for NER tasks in corpus-poor domains, such as the food safety domain. To address the problem that there is no publicly available NER dataset in the food safety domain, our approach can extract named entities related to food safety from short non-standardized Chinese texts generated from web users' comments. In this way, a regulatory knowledge graph in the food safety domain can be constructed to help relevant authorities to regulate food safety issues and mitigate the harm caused by food safety problems.