Fine-Grained Mechanical Chinese Named Entity Recognition Based on ALBERT-AttBiLSTM-CRF and Transfer Learning

: Manufacturing text often exists as unlabeled data; the entity is ﬁne-grained and the extraction is di ﬃ cult. The above problems mean that the manufacturing industry knowledge utilization rate is low. This paper proposes a novel Chinese ﬁne-grained NER (named entity recognition) method based on symmetry lightweight deep multinetwork collaboration (ALBERT-AttBiLSTM-CRF) and model transfer considering active learning (MTAL) to research ﬁne-grained named entity recognition of a few labeled Chinese textual data types. The method is divided into two stages. In the ﬁrst stage, the ALBERT-AttBiLSTM-CRF was applied for veriﬁcation in the CLUENER2020 dataset (Public dataset) to get a pretrained model; the experiments show that the model obtains an F1 score of 0.8962, which is better than the best baseline algorithm, an improvement of 9.2%. In the second stage, the pretrained model was transferred into the Manufacturing-NER dataset (our dataset), and we used the active learning strategy to optimize the model e ﬀ ect. The ﬁnal F1 result of Manufacturing-NER was 0.8931 after the model transfer (it was higher than 0.8576 before the model transfer); so, this method represents an improvement of 3.55%. Our method e ﬀ ectively transfers the existing knowledge from public source data to scientiﬁc target data, solving the problem of named entity recognition with scarce labeled domain data, and proves its e ﬀ ectiveness.


Introduction
Manufacturing is focused on production experience, and is particularly essential to mining and reuse industry knowledge.As manufacturing and technology continue to evolve, more and more researchers are focusing on mining manufacturing data using advanced technologies to help with manufacturing production [1].Technologies such as deep learning and artificial intelligence have been applied to traditional manufacturing industries.Nonetheless, most of the current research focuses on data such as signal processing [2,3] and image detection [4,5].However, the knowledge of the manufacturing industry often exists as unstructured textual data, such as manufacturing standards, documents, product specifications, patents, and technical reports [6].Unstructured textual data contain a large amount of information about manufacturing.How to mine previous empirical knowledge from these unstructured data has become an important problem faced by industry researchers.
Natural language processing (NLP) technology is well developed and is an essential branch in artificial intelligence, focusing on studying how to mine valuable information from unstructured data and use it.NER (named entity recognition) is an important direction in NLP that aims to obtain meaningful entities and entity types from large amounts of textual data, such as name, location, and organization.NER is a vital task in information extraction from textual data, knowledge question answering systems, knowledge graphs, and other researches.
As the research on NER has progressed, scholars around the world have begun to study textual data in languages such as Korean [7], Spanish [8], Arabic [9], etc. Chinese scholars have also been paying increasing attention to the study of Chinese named entity recognition (C-NER).Unlike English, where each word represents a meaning, in Chinese, several characters often form a word to express the meaning.Moreover, there is no distinct separation marker in Chinese like the space in English, which generates difficulties in breaking words and sentences.At the same time, there is the phenomenon of multiple meanings of one word and one meaning of many words in Chinese, making it more complicated to handle in Chinese than in English.Currently, good results have been achieved on C-NER in the public domain, such as in the MSRA dataset and the People's Daily dataset.However, only a small number of entity types (three types) exists in these datasets.With professional domain development, the need for knowledge graph construction and a question answering system in the professional domain is becoming more and more urgent, and Chinese named entity recognition in the professional field is particularly important.Professional domain data are different from public domain data because they have the characteristics of a small amount of specific data and many kinds of entities.It is a fine-grained named entity recognition [10].
NER methods come in four main types: dictionary-based, rule-based, statistical learning-based, and deep learning-based.Among the first few methods, such as the hidden Markov model (HMM) and conditional random field (CRF), labeled sequences are obtained by finding the probability of state transfer.Although researchers achieved good performance, the implementation of these methods is based on a large amount of experience, knowledge, and human resources relying on rule design and feature engineering.When the data to be processed are different, the rules and features need to be reformulated [11,12].As technology continues to evolve, deep learning technology does not need to extract features manually and has strong adaptability [13].Therefore, more and more researchers are studying NER using deep learning.Although convolutional neural networks (CNN) have obtained good results in the computer vision domain, they are not well suited for the text domain because they are not sequential.Deep network structures such as recurrent neural networks (RNN), long short-term memory networks (LSTM), and gated recurrent units (GRU) have achieved excellent results in sequential data.The emergence of BiLSTM has solved the shortcomings in that these structures cannot adapt to longer sequences and cannot extract bi-directional information.Some scholars combine deep learning with machine learning to extract the features of text context information, and then use machine learning to predict sequence tags output, which has achieved good results.
There is a need to get better performance NER from manufacturing domain text and to obtain manufacturing knowledge for feedback to actual production.In this paper, a novel Chinese fine-grained NER method based on ALBERT-AttBiLSTM-CRF and transfer learning is proposed.The method performed fine-grained named entity recognition of Chinese domain text data.The ALBERT-AttBiLSTM-CRF consists of three parts.Firstly, a lite, pre-trained deep bi-directional representation model (ALBERT) is used to extract the word embeddings of text data.The bidirectional long short-term memory network (BiLSTM) is used to extract features from the input vectors and obtain contextual information on the text.The self-attention mechanism is used to focus on the related tokens in the different sentences of a document and address the tagging inconsistency problem.The feature matrix output from the previous network is labeled by the CRF to get a labeled sequence.Then, a model transfer considering the active learning (MTAL) method is proposed.The model is obtained by transferring the model trained on the source domain into the target domain.Based on the new model, active learning is used to label the unlabeled data to continuously augment the amount of data and improve the performance of the model.Finally, the proposed method is applied to public datasets and domain datasets we created.The experimental results are better than other methods, proving the effectiveness of the technology.This method can be used to identify entities in domain text data, and provide reasonable data support for the following domain knowledge graph, knowledge question answering system, etc.
The remaining chapters of the paper are organized as follows: Section 2 reviews techniques related to a text representation, NER in the domain fields, and applications of text data augmentation technology for named entity recognition.Section 3 introduces the proposed novel fine-grained C-NER method based on symmetry lite deep multi-network collaboration and transfer learning considering active learning.Section 4 applies the proposed approach to the CLUENER2020 dataset and our dataset and compares the results with other methods to verify the feasibility and performance of the proposed method.Finally, Section 5 provides the main conclusions and indicates the future directions of the work.

Word Representation Technologies
Word representation is an essential task in NLP.Neural Language Model (NLM) [14] is a class of language models used to overcome dimensional catastrophes by modeling natural language sequences using distributed representations of words.Unlike the class-based n-gram model, the NLM can recognize two similar words without losing the ability to encode each word differently.NLMs share a concept and its context with other related terms.Collobert and Weston [15] used the word vector approach as a useful tool for handling downstream tasks and introduced a neural network model structure that laid the groundwork for many improvements and enhancements to NLM methods.Mikolov et al. [16] presented Word2vec, which has several enhancements that to learn higher quality vectors more rapidly.RNN is one of the best technical solutions for dealing with dynamic input sequences prevalent in NLP.Sutskever et al. [17] have proposed the seq2seq model based on RNN and LSTM.It uses neural networks to map one sequence to another.In this method, the encoder neural network represents sentence symbols as vectors.A decoder neural network outputs the symbol of the predicted value by symbol according to the encoder status and takes the previously predicted symbols as the input of each step.The emergence of pre-training has accelerated the development of word representation, and solve the problem of one word has multi-meaning.ELMo [18] is a pretrained contextual word embedding model, which uses a BiLSTM language model consisting of a forward and a backward language model.The goal of GPT (Generative Pre-Training) [19] is to learn a general representation that can be applied to many tasks.The purpose of BERT [20] is to pre train deep bi-representation from unlabeled text by conditional preprocessing context of all layers.ERNIE 2.0 [21] is not completed on a small number of pre-training tasks, but by continuously introducing new pre-training tasks to help the model continuously and efficiently learn semantic information.All of the above methods have achieved outstanding results in the field of NLP.With the advent of these techniques in various industries, researchers have started to pay attention to speed up operations by keeping the number of models and parameters as small as possible with little loss of accuracy.Improved and the following approaches mainly represent domain-customized representations.XLNet [22] is a generalized autoregressive pre-training method that (1) enables bi-directional contextual learning by maximizing the desired likelihood of all alignments of the factorial decomposition order, and (2) improves the limitations of BERT (Bidirectional encoder from Transformers).Especially on GPU/TPU (Graphics Processing Unit/Tensor Processing Unit) memory limitations and longer training times.ALBERT (A lite BERT) [23] is important to note that it is a useful tool to reduce computer memory consumption and increase BERT's training speed.

Named Entity Recognition for Domain Fields
Zhou et al. [24] proposed an in-depth neural network for bug-specific entity recognition (DBNER) using LSTM with CRF.DBNER extracts multi-feature from amount of bug data and uses attention mechanism to improve entity tags' conformity in the bug reports.Qiu et al. [25] propose an attention-based approach BiLSTM with a CRF layer (ATT-BiLSTM-CRF) for geological named entity recognition (GNER) to extract entities from geoscience reports.Georgescu [26] rise the process of diagnosing and detecting possible loopholes within an Internet of Things (IoT) system through the NER-based solution.Harsh Vardhan et al. [27] proposed a simple and powerful method that is applicable to legal documents from different statutory bodies to correctly identify numerous entities to find relevant information for certain specific issues.
In terms of Chinese Named Entity Recognition, Yin et al. [28] proposed a BiLSTM-CRF based on the radical level feature and self-attention to solve Chinese clinical NER problems.He and Sun [29] proposed a unified feature model for the NER problem in Chinese social media, which can learn from foreign corpus and unannotated text in the domain.The unified model includes two methods for cross-domain learning and semi-supervised learning.Yin et al. [30] proposed an annotation strategy that considers fuzzy entity boundaries, combined with domain expert knowledge, and constructed MilitaryCorpus based on microblog data.Under the guidance of the bidirectional encoder word vector expression layer, the model is used to obtain word-level character under the guidance of BERT.Under the direction of BiLSTM, the layer extracts context features, to form feature matrix, and finally generates the optimal tag sequence with CRF.

Text Data Augmented in NER
The main difficulty of domain-specific NER lies in the lack of a canonical labeling corpus.However, the deep network model training usually requires a large annotated corpus to train.Otherwise, it is easy to overfit.Therefore, the deep network model's direct application to a specific field is often not very useful [31].The study of NER in this paper is based on an enterprise atlas.It is necessary to apply NER technology in the enterprise domain.NER in the enterprise domain also faces the problem of a smaller corpus.
Shen et al. [32] combined deep learning with active learning to reduce the number of labeled training data.A lightweight NER method, the CNN-CNN-LSTM model, is introduced to expedite operations.The model consists of a convoluted character, a word encoder, and a LSTM tag decoder.LukaGligic et al. [33] bootstrap neural network (NN) models pretrain word embedding on secondary tasks performed on unannotated electronic health records (EHRs) by transfer learning, and use output embedding as the basis for a range of NN architectures.Van CuongTran et al. [34] assemble a method that uses active learning (AL) and self-learning to reduce the workload for the NER task from tweet streams by using machine-labeled and manually labeled data.The CRFs are also chosen as an algorithm for selecting highly reliable cases.Kung [35] built a mandarin NER module-based transfer learning system to do damage information collection and analysis in disaster management.To deal with the cultural relic areas' lack of labeled data, Zhang et al. [36] put forward a model, Cultural Relics SCRNER (semi-supervision model for Cultural Relics' Named Entity Recognition), using tag data no training BiLSTM and CRF model to obtain the effective identification performance.Liu et al. [37] proposed an active learning method of BERT-CRF that has been widely used in NER.Researchers apply pre-trained deep neural networks to natural language processing, fine-tuning the model based on the pre-trained model according to real-world problems.In general language models, the probability of an intermediate word occurrence is often calculated based on the preceding and following words.However, in specific natural language processing problems, the contextual representation of a word, the polysemy of the characterization word, and the syntactic features of a sentence need to be addressed.To solve this problem, Devlin et al. [20] proposed BERT.The transformer is a crucial component of BERT, based on an attention mechanism to model text.BERT uses a bi-transformer as an encoder to fuse the textual context of the information as well as the pretext and post-text information.For one word, its input representation can be composed by summing the three parts, Token Embeddings, Segment Embeddings, and Positional Embeddings.Token Embeddings represent a word vector, which can be either a word vector or a character vector in Chinese processing, and the word vector used in this paper is more in line with Chinese characteristics.The [CLS] sign is followed by the first word, and the [SEP] sign is used as an interval between sentences.Segment Embeddings are used to distinguish the two sentences.Positional Embeddings are location information derived from the model learning.
The main modules of the coding cell are the self-attention, such as Equation ( 1), where Q, K, and V are input word vector matrix, d k is the input vector dimension, and Q, K, and T are the relationships between input word vectors.The relationships between these words reflect the importance degree of different words in the sentence to some extent.These are then used to adjust the importance, obtaining a new representation of each word.This unique expression contains the word itself and the relationship of other words to the word, and is, therefore, a more global expression than a simple word vector.
To extend the model's ability to focus on different locations and increase the "representation subspace" of attention units, the Transformer employs a "multi-head" model, as shown in Equations ( 2) and (3).
Moreover, to address the degradation problem in deep learning, the Transformer coding unit incorporates residual networks and layer normalization, as shown in Equations ( 4) and (5).
During text processing, the chronological order of words is an important feature, and the Transformer uses position embedding to increase this feature.Each position is numbered, and each number corresponds to a vector.By combining the position vector and word vector, a piece of position information is introduced to each word so that Attention can distinguish the words in different positions, and learn the position information, as shown in Equations ( 6) and (7).When the sequence padding is 512, 2i indicates even, 2i + 1 is odd, and d is 64.
ALBERT is lite BERT that uses factorized embedding parameterization, cross-layer parameter sharing, and other methods to effectively decrease the computational parameters and accelerate the training speed.The architecture of ALBERT is shown in Figure 1.The next sentence prediction (NSP) pretraining task in BERT has been used with proper natural language processing.ALBERT is lite BERT that uses factorized embedding parameterization, cross-layer parameter sharing, and other methods to effectively decrease the computational parameters and accelerate the training speed.The architecture of ALBERT is shown in Figure 1.The next sentence prediction (NSP) pretraining task in BERT has been used with proper natural language processing.The main parameters of the BERT network structure are Token Embedding, Encoder's Feedforward, and Multi-head self-attention, where Token Embedding has the most significant number of parameters.Therefore, ALBERT adopts the factorized embedding method, similar to the matrix decomposition in the recommended system for Token Embedding.It takes cross-layer parameter sharing for Encoder to improve, and for the parameters to decrease significantly.
In BERT, the word embedding dimension value is and the output value of the hidden layer is , where = .The dimension value of the text dictionary is .The whole number of parameters is ( × ).However, the word-level embedding does not include the contextual connection of the word.At the same time, the output of the hidden layer includes not only the original meaning of the word but also the contextual information of the word.So, the expression of the hidden layer contains more information, so we should let ≫ .In ALBERT, the dimensional value of the word embedding is smaller than the output value of the hidden layer.In the NLP task, the size of the embedding matrix size is × .Since the lexicon dimensions are large if = , the number of parameters in the embedding matrix will be significant, and the back-propagation process will be sparse in terms of updates.
In ALBERT, factorized embedding parameterization is for a decreased number of parameters.The one-hot vector is mapped to a low-dimensional space and then assigned to a high-dimensional space through a high-dimensional matrix, as shown in Equation ( 8), thereby reducing the number of parameters.For example, in the ALBERT-base model, when = 128, the number of parameters is 12 million, and when = 768, the number of parameters is 108 million.The main parameters of the BERT network structure are Token Embedding, Encoder's Feed-forward, and Multi-head self-attention, where Token Embedding has the most significant number of parameters.Therefore, ALBERT adopts the factorized embedding method, similar to the matrix decomposition in the recommended system for Token Embedding.It takes cross-layer parameter sharing for Encoder to improve, and for the parameters to decrease significantly.
In BERT, the word embedding dimension value is E and the output value of the hidden layer is H, where E = H.The dimension value of the text dictionary is V.The whole number of parameters is O(V × H).However, the word-level embedding does not include the contextual connection of the word.At the same time, the output of the hidden layer includes not only the original meaning of the word but also the contextual information of the word.So, the expression of the hidden layer contains more information, so we should let H E. In ALBERT, the dimensional value of the word embedding is smaller than the output value of the hidden layer.In the NLP task, the size of the embedding matrix size is E × V. Since the lexicon dimensions are large if E = H, the number of parameters in the embedding matrix will be significant, and the back-propagation process will be sparse in terms of updates.
In ALBERT, factorized embedding parameterization is for a decreased number of parameters.The one-hot vector is mapped to a low-dimensional space and then assigned to a high-dimensional space through a high-dimensional matrix, as shown in Equation ( 8), thereby reducing the number of Symmetry 2020, 12, 1986 7 of 21 parameters.For example, in the ALBERT-base model, when E = 128, the number of parameters is 12 million, and when E = 768, the number of parameters is 108 million.
Table 1 shows the parameter results when E is selected with different values.When the parameter sharing optimization scheme is not adopted, setting E = 768 is better, and setting E = 128 is better when the setting sharing optimization scheme is adopted.ALBERT adopts the solution in which the Transformer shares the full connection layer and the Attention layer.That is to say, it shares all the parameters in the hidden layer, and the effect decreases after the Transformer of the same magnitude adopts this solution.The number of parameters is reduced by a lot, and the training speed is increased by a lot.The settings have been reduced as shown in Equation ( 9), where L is the number of hidden layers.
To compensate for the loss of performance due to parameter reduction, ALBERT employs a new training task, inter-sentence coherence loss.In ALBERT, it is argued that the next sentence prediction (NSP) task adopted by BERT reduces the performance of the downstream task compared to the masked language model (MLM) enhancement.This is because the task contains two subtasks, topic prediction, and relational coherence prediction, and the former is much simpler than the latter.In ALBERT, a new strategy was used to improve sentence-order prediction for the next sentence prediction task, retaining only the relational consistency prediction.In this task, the positive sample is the same as in the next sentence prediction task, where two sequentially connected sentences in the same document are selected.In comparison, the negative sample is obtained by switching the order of the two sentences in a positive example.Figure 2 shows the model visualization of BERT and ALBERT.
Symmetry 2020, 12, x FOR PEER REVIEW 7 of 21 Table 1 shows the parameter results when is selected with different values.When the parameter sharing optimization scheme is not adopted, setting = 768 is better, and setting = 128 is better when the setting sharing optimization scheme is adopted.ALBERT adopts the solution in which the Transformer shares the full connection layer and the Attention layer.That is to say, it shares all the parameters in the hidden layer, and the effect decreases after the Transformer of the same magnitude adopts this solution.The number of parameters is reduced by a lot, and the training speed is increased by a lot.The settings have been reduced as shown in Equation ( 9), where L is the number of hidden layers. ( To compensate for the loss of performance due to parameter reduction, ALBERT employs a new training task, inter-sentence coherence loss.In ALBERT, it is argued that the next sentence prediction (NSP) task adopted by BERT reduces the performance of the downstream task compared to the masked language model (MLM) enhancement.This is because the task contains two subtasks, topic prediction, and relational coherence prediction, and the former is much simpler than the latter.In ALBERT, a new strategy was used to improve sentence-order prediction for the next sentence prediction task, retaining only the relational consistency prediction.In this task, the positive sample is the same as in the next sentence prediction task, where two sequentially connected sentences in the same document are selected.In comparison, the negative sample is obtained by switching the order of the two sentences in a positive example.Figure 2 shows the model visualization of BERT and ALBERT.

Feature Extraction Based on AttBiLSTM
In NLP, the data involved often have backward and forward correlations, and traditional forward NNs are no longer able to handle this type of data well.The RNN structure is divided into the input layer, the intermediate layer, and the output layer.However, RNN does not handle this problem very well when the sequence data become longer.LSTM, as a particular recurrent neural network, can effectively solve gradient explosion and gradient disappearance when RNN processes long sequences of data by using a specially designed gate structure to selectively retain the contextual information.To solve the problem that the current cell of a traditional LSTM network cannot act as an "input gate" and "output gate" at the next moment, which causes the whole cell to lose part of the very well when the sequence data become longer.LSTM, as a particular recurrent neural network, can effectively solve gradient explosion and gradient disappearance when RNN processes long sequences of data by using a specially designed gate structure to selectively retain the contextual information.To solve the problem that the current cell of a traditional LSTM network cannot act as an "input gate" and "output gate" at the next moment, which causes the whole cell to lose part of the information for the previous sequence, Gers and Schmidhuber [38] proposed the peephole connection, as expressed in Equations ( 10)- (14).The data from the cell at the previous moment are entered into the "input gate" and "forget gate" along with the data from the current moment.The output data from the "forgotten gate" are simultaneously entered into the cell; the data from the cell are entered into the "output gate" at the current moment and also into the "input gate" at the next moment.The data output from the "forget gate" are used as the output of the whole memory cell and the data from the activated cell.The structure of the improved LSTM cell is shown in Figure 3.
Symmetry 2020, 12, x FOR PEER REVIEW 8 of 21 entered into the "input gate" and "forget gate" along with the data from the current moment.The output data from the "forgotten gate" are simultaneously entered into the cell; the data from the cell are entered into the "output gate" at the current moment and also into the "input gate" at the next moment.The data output from the "forget gate" are used as the output of the whole memory cell and the data from the activated cell.The structure of the improved LSTM cell is shown in Figure 3. ( ) ( ) ( ) ( ) ( ) Here, σ is the activation function sigmoid; ⊗ is a point multiplication operation; tanh is a hyperbolic tangent activation function; t x is the unit input, In natural language processing problems, each word is influenced by its front and back words, so the textual context must be taken into account [18].Therefore, in this paper, BiLSTM is used for Here, σ is the activation function sigmoid; ⊗ is a point multiplication operation; tanh is a hyperbolic tangent activation function; x t is the unit input, i t , f t , o t are the input gate, forget gate, and output gate at moment t; w, b are the weight matrix and bias vector of the input gate, forget gate, and output gate; c t is the state at moment t; and h t is the output at moment t.
In natural language processing problems, each word is influenced by its front and back words, so the textual context must be taken into account [18].Therefore, in this paper, BiLSTM is used for feature extraction.BiLSTM is a combination of forward LSTM and backward LSTM.In BiLSTM,  The output of the BiLSTM is expressed in Equations ( 15)-( 17): Symmetry 2020, 12, 1986 9 of 21 To better understand the relationship between entities and other words, word vector representation is more suitable for this document.Based on BiLSTM feature extraction, the attention mechanism is introduced to solve the problem of inconsistent entity marks.In this part, the intermediate state of BiLSTM output h t is taken as the input of the attention layer.The association between words is obtained by calculating the attention between entity words and other words.The attention weight w t,j is expressed via Equation (18).
where score(x t , x j ) can be determined by the Euclidean equation, shown in Equation ( 19): where the weight matrix W α is a parameter of the model, and • is the element-wise product.Then, a global vector g t is computed as a weighted sum of each BiLSTM output h t , which is shown in Equation (20).
Furthermore, the BiLSTM output of the global vector and the target word is concatenated into a vector [g t ; h t ] and passed to the tanh function to generate the output of the attention layer, which is shown in Equation (21).
Then, the tanh layer at the top of the attention layer is used to predict the confidence score of the word.The confidence score of the word takes each possible tag as the output score of the network as shown in Equation ( 22).
e t = tanh(W g z t ) 3.1.3.Sequence Tag Generation Based on CRF When making predictions for the labels of entities, the interdependence and influence between the labels should be considered.When the output layer of the named entity recognition framework predicts the labeling results, the softmax function can usually be used to calculate the label probability value.Nonetheless, the label prediction of each word in the softmax layer is independent of each other, and the interdependence and influence between the labels are not considered.Using the relationship between adjacent tags in the sequence, some unreasonable situations will occur in the tags predicted using the softmax function.For example, the beginning word B-PER of the personal name followed by the nonstarting word I-LOC of the place name, etc.The CRF layer can use the dependency information between adjacent tags for sentence-level annotation, calculate the optimal solution of the overall sequence by adding the transfer score matrix of the tag, perform global optimization, and obtain the global optimal tag to sequence the correlation between the tags, to avoid the above-mentioned problem of unreasonable softmax layer prediction labels.The structure of the CRF is shown in Figure 4.
Given a sequence x = (x 1 , x 2 , • • •, x n ), the corresponding tag sequence is y = (y 1 , y 2 , • • •, y n ), then the score of the tag sequence is expressed in Equation ( 23): where the transfer matrix is A, A ij represents the transition score of labels i to j, y i is the label in the sentence, and the label type is k, A ∈ R (k+2) 2 .Assuming that the length of the sentence is n, the score matrix of the output layer is P ∈ R n×k , and the matrix element P ij is the output score of the i-th word under the j-th label.Given a sequence ( ) , , , n x x x x = ⋅⋅⋅ , the corresponding tag sequence is ( ) , , , n y y y y = ⋅⋅⋅ , then the score of the tag sequence is expressed in Equation ( 23): where the transfer matrix is A, Aij represents the transition score of labels i to j, yi is the label in the sentence, and the label type is k, . Assuming that the length of the sentence is n, the score matrix of the output layer is , and the matrix element Pij is the output score of the i-th word under the j-th label.
Normalization using the softmax function S(X,y) to get the probability distribution of the output sequence y is expressed as in Equation ( 24 Let all the label sets be y, which represents the j-th actual label value in y.Then the log-likelihood for a given training sample is expressed as the following Equation ( 25): During training, the optimal sequence of the input sequence is obtained by maximizing the loglikelihood function, as expressed as in Equation ( 26

ALBERT-AttBiLSTM-CRF Method
ALBERT-AttBiLSTM-CRF is shown in Figure 5, consisting of the ALBERT layer, BiLSTM layer, Self-attention layer, and CRF layer.In the right half of Figure 5, we list the code architecture of the ALBERT-AttBiLSTM-CRF method.The ALBERT layer obtains the word embedding related to the context of the text, the BiLSTM layer extracts the context features of the text, the Self-attention layer combines the documents matching method and attention mechanism, and the CRF layer captures the dependencies between adjacent tags and obtains the tags sequence with the highest probability.
We calculated the character feature, sentence feature, and position feature of each word in the sentence, and added the three characteristics together to get the total attributes of the word.The final feature vector was obtained by calculating the complete features obtained through a multilayer Transformer.In this process, the input of each Transformer was the output of the previous Transformer.Then, we used the computer system of ALBERT to obtain the word-level feature vector Normalization using the softmax function S(X,y) to get the probability distribution of the output sequence y is expressed as in Equation ( 24): e S(X,y) y∈Y x e S(X, y) Let all the label sets be y, which represents the j-th actual label value in y.Then the log-likelihood for a given training sample is expressed as the following Equation ( 25 During training, the optimal sequence of the input sequence is obtained by maximizing the log-likelihood function, as expressed as in Equation ( 26): 3.1.4.ALBERT-AttBiLSTM-CRF Method ALBERT-AttBiLSTM-CRF is shown in Figure 5, consisting of the ALBERT layer, BiLSTM layer, Self-attention layer, and CRF layer.In the right half of Figure 5, we list the code architecture of the ALBERT-AttBiLSTM-CRF method.The ALBERT layer obtains the word embedding related to the context of the text, the BiLSTM layer extracts the context features of the text, the Self-attention layer combines the documents matching method and attention mechanism, and the CRF layer captures the dependencies between adjacent tags and obtains the tags sequence with the highest probability.
We calculated the character feature, sentence feature, and position feature of each word in the sentence, and added the three characteristics together to get the total attributes of the word.The final feature vector was obtained by calculating the complete features obtained through a multilayer Transformer.In this process, the input of each Transformer was the output of the previous Transformer.Then, we used the computer system of ALBERT to obtain the word-level feature vector as the input of the next step of AttBiLSTM.In AttBiLSTM, the output sequences of the forward and backward LSTM hidden layers are calculated separately, and the output sequences of the two directions are spliced to obtain the final feature matrix.This feature matrix is used as the input in the next step of Self-attention.In Self-attention, a more suitable vector representation of the entity can be obtained.In CRF, the transition matrix is defined based on the feature matrix output in the previous step, and the globally optimal tag sequence is obtained through the adjacent tag relationship.
directions are spliced to obtain the final feature matrix.This feature matrix is used as the input in the next step of Self-attention.In Self-attention, a more suitable vector representation of the entity can be obtained.In CRF, the transition matrix is defined based on the feature matrix output in the previous step, and the globally optimal tag sequence is obtained through the adjacent tag relationship.

Model Transfer Considering Activate Learning (MTAL)
The main difficulty of domain-specific NER lies in the lack of a standard labeled corpus.However, the deep network model usually requires a large labeled corpus to train; otherwise, it is easy to overfit.Therefore, applying the deep network model directly to a specific domain often does not work well.Thus, the domain NER has to solve the problem of insufficiently labeled corpus.

Transfer Learning
In some new fields, it is difficult to get enough labeled data to support model training if a supervised model is to be built to handle tasks.Therefore, it is imperative to establish a reliable model of using only a small number of labeled training data.Transfer learning is ideal to solve this problem; it applies existing knowledge to solve problems in different but related fields.
Transfer learning is a machine learning method that transfers knowledge from the source domain to the target domain for a better learning effect in scenarios where the source domain data are sufficient, and the number of target domain data is small.The value of transfer learning lies in the following three aspects.Firstly, the existing knowledge domain data can be reused, so that the current large amount of work will not be discarded entirely.Secondly, there is no need to devote considerable costs to recollecting and calibrating a vast new dataset.Thirdly, for emerging fields, it can be quickly transferred and applied, reflecting the time-sensitive advantages.

Active Learning
Although the self-learning algorithm can automatically improve the model performance without human intervention by using unlabeled data, it has the disadvantage that the labeling selected in each iteration contains too little valid information.The model converges too slowly to improve the effect.The self-learning algorithm selects labeled data with a confidence level higher than the threshold.It adds them to the labeled training samples to iteratively expand the size of the

Model Transfer Considering Activate Learning (MTAL)
The main difficulty of domain-specific NER lies in the lack of a standard labeled corpus.However, the deep network model usually requires a large labeled corpus to train; otherwise, it is easy to overfit.Therefore, applying the deep network model directly to a specific domain often does not work well.Thus, the domain NER has to solve the problem of insufficiently labeled corpus.

Transfer Learning
In some new fields, it is difficult to get enough labeled data to support model training if a supervised model is to be built to handle tasks.Therefore, it is imperative to establish a reliable model of using only a small number of labeled training data.Transfer learning is ideal to solve this problem; it applies existing knowledge to solve problems in different but related fields.
Transfer learning is a machine learning method that transfers knowledge from the source domain to the target domain for a better learning effect in scenarios where the source domain data are sufficient, and the number of target domain data is small.The value of transfer learning lies in the following three aspects.Firstly, the existing knowledge domain data can be reused, so that the current large amount of work will not be discarded entirely.Secondly, there is no need to devote considerable costs to recollecting and calibrating a vast new dataset.Thirdly, for emerging fields, it can be quickly transferred and applied, reflecting the time-sensitive advantages.

Active Learning
Although the self-learning algorithm can automatically improve the model performance without human intervention by using unlabeled data, it has the disadvantage that the labeling selected in each iteration contains too little valid information.The model converges too slowly to improve the effect.The self-learning algorithm selects labeled data with a confidence level higher than the threshold.It adds them to the labeled training samples to iteratively expand the size of the training samples.The model can learn more data features and improve the model effect and generalization performance.For labeled data with a high confidence level, it is evident that the model has already learned most of the features from labeled data [39].
Active learning allows the learning algorithm to proactively propose which data are to be labeled when the unlabeled data are rich, the labeled data are scarce, and manual labeling is expensive.Then, the data are sent to experts for manual labeling, and they add the labeled data to the training sample.This algorithm is similar to the self-learning algorithm in that it selects valuable sample data from unlabeled samples and adds them to labeled samples for iterative learning.However, one significant difference between the two is that active learning requires human intervention, while self-learning does not.

Model Transfer Method Considering Activate Learning
If the previously proposed lite multinetwork collaboration model is directly transferred to the manufacturing domain, it will face a small field labeled corpus.If the ALBERT-AttBiLSTM-CRF model is trained only with this smaller labeled corpus, it will lead to severe overfitting of the model and poor generalizability.When a new corpus is encountered, the model recognition will be reduced.The combination of active learning and self-learning (MTAL) is applied to the training process of the model proposed in this paper (details are shown in Algorithm 1), and the conditional probability of the CRF layer is chosen as the confidence level for the prediction of the CRF layer (Y|X).The confidence thresholds are set.During the iterative process, samples with confidence levels above the threshold are added directly as training samples; samples with confidence levels below the threshold are handed over to manual labeling.This will solve the problem of dew domain corpus data and improve the generalizability of the model.

ALBERT-AttBiLSTM-CRF Model Transfer Considering Activate Learning Method
By combining the methods proposed in Sections 3.1 and 3.2, we obtain a novel Chinese fine-grained NER method based on lite multi-network collaboration and active learning (ALBERT-AttBiLSTM-CRF Model Transfer Considering Activate Learning Method), as shown in Figure 6.First, ALBERT is used to extract the word vectors of text data.An AttBiLSTM is used to capture features from the input vectors and obtain contextual information on the text.The feature matrix output from the previous network is labeled by the CRF to obtain the labeled sequence.Then, a model transfer method considering active learning is proposed to get a new model by transferring the model obtained by training on the source domain to the target domain.Based on the new model, active learning is used to label the unlabeled data to continuously augment the amount of data and improve the quality of the model.

Evaluation Metrics
To validate the effectiveness of the proposed methodology, relevant comparative experiments are designed by comparing it with the mainstream NER methods.To verify the robustness of the algorithm in terms of fine-grained, the CLUENER2020 dataset is selected for validation.Moreover, K-fold cross-validation was used to partition the corpus dataset into 10 equal parts, each of which was obtained by hierarchical sampling.The experiments were conducted by rotating nine parts for training and one part for testing, and the metrics evaluated were accuracy P, recall R, and F1 score, as shown in Equations ( 27)- (29).Finally, the results of the 10 experiments were summed and averaged again, which can be used as an indicator of the model optimization.

Evaluation Metrics
To validate the effectiveness of the proposed methodology, relevant comparative experiments are designed by comparing it with the mainstream NER methods.To verify the robustness of the algorithm in terms of fine-grained, the CLUENER2020 dataset is selected for validation.Moreover, K-fold cross-validation was used to partition the corpus dataset into 10 equal parts, each of which was obtained by hierarchical sampling.The experiments were conducted by rotating nine parts for training and one part for testing, and the metrics evaluated were accuracy P, recall R, and F1 score, as shown in Equations ( 27)- (29).Finally, the results of the 10 experiments were summed and averaged again, which can be used as an indicator of the model optimization.

Dataset Acquisition and Annotation
4.2.1.CLUENER2020 Dataset CLUENER2020 [40] is a fine-grained dataset for C-NER, which contains 10 entity categories.In addition to the usual labels such as person name, organization, and location that are common to other datasets, it also provides different types for more real-world scenarios.This dataset is more significant than other current C-NER datasets, reflecting real-world application results.

Manufacturing-NER Dataset
In the field of manufacturing, knowledge, and rules are usually presented in the form of unstructured data such as product standard documents, product design specifications, patents, technical reports, etc., which are mainly textual data.To mine the knowledge and rules from these textual data, it is essential to process them.NER is an essential technology in NLP.It can extract all kinds of meaningful entities from text data and is the critical guarantee for constructing knowledge question answering systems and knowledge graphs.At present, NER is widely applied in the general knowledge domain.It has led to remarkable achievements in proper nouns such as place names, person names, organization names, and meaningful quantifiers such as time and date.In terms of proprietary fields, scholars have conducted relevant research, but less research has been conducted in traditional manufacturing industries.Since there is no dataset related to manufacturing, this paper proposes using a Scrapy crawler to obtain text data from web pages, documents, patents, technical reports, and so on.The architecture diagram of the corpus crawler is shown in Figure 7, along with the dataset descriptions of CLUENER2020 and Manufacturing-NER.
CLUENER2020 [40] is a fine-grained dataset for C-NER, which contains 10 entity categories.In addition to the usual labels such as person name, organization, and location that are common to other datasets, it also provides different types for more real-world scenarios.This dataset is more significant than other current C-NER datasets, reflecting real-world application results.

Manufacturing-NER Dataset
In the field of manufacturing, knowledge, and rules are usually presented in the form of unstructured data such as product standard documents, product design specifications, patents, technical reports, etc., which are mainly textual data.To mine the knowledge and rules from these textual data, it is essential to process them.NER is an essential technology in NLP.It can extract all kinds of meaningful entities from text data and is the critical guarantee for constructing knowledge question answering systems and knowledge graphs.At present, NER is widely applied in the general knowledge domain.It has led to remarkable achievements in proper nouns such as place names, person names, organization names, and meaningful quantifiers such as time and date.In terms of proprietary fields, scholars have conducted relevant research, but less research has been conducted in traditional manufacturing industries.Since there is no dataset related to manufacturing, this paper proposes using a Scrapy crawler to obtain text data from web pages, documents, patents, technical reports, and so on.The architecture diagram of the corpus crawler is shown in Figure 7, along with the dataset descriptions of CLUENER2020 and Manufacturing-NER.The scrapy framework includes an engine, scheduler, downloader, crawler, project pipeline, downloader middleware, crawler middleware, and scheduling middleware.The engine is used to control the entire system's data processing flow and trigger transaction processing.The scheduler is used to receive requests from the engine, put them into the queue, and return them when they are requested again by the engine.The downloader is used to download web page content and return it to the Spider.The crawler is used to extract the information it needs from a particular web page.The project pipeline is responsible for the storage and filtering of the data.
After obtaining the corresponding textual data, we annotate the corpus.The annotation uses the YEDDA (A lightweight collaborative text span annotation tool) system to annotate eight types of named entities manually, and the corpus is Chinese as shown in Figure 8.To obtain high-quality label prediction results and add corresponding constraints, we use the BIO annotation method.We annotate each element as "B-Entity," "I-Entity," or "O." "B-Entity" means the fragment is type X and the item is at the beginning of the particle, "I-X" implies the piece is of type X, and the element is in the middle of the fragment."O" indicates not of any type.Dataset descriptions of CLUENER2020 and Manufacturing-NER shown in Table 2.The scrapy framework includes an engine, scheduler, downloader, crawler, project pipeline, downloader middleware, crawler middleware, and scheduling middleware.The engine is used to control the entire system's data processing flow and trigger transaction processing.The scheduler is used to receive requests from the engine, put them into the queue, and return them when they are requested again by the engine.The downloader is used to download web page content and return it to the Spider.The crawler is used to extract the information it needs from a particular web page.The project pipeline is responsible for the storage and filtering of the data.
After obtaining the corresponding textual data, we annotate the corpus.The annotation uses the YEDDA (A lightweight collaborative text span annotation tool) system to annotate eight types of named entities manually, and the corpus is Chinese as shown in Figure 8.To obtain high-quality label prediction results and add corresponding constraints, we use the BIO annotation method.We annotate each element as "B-Entity," "I-Entity," or "O." "B-Entity" means the fragment is type X and the item is at the beginning of the particle, "I-X" implies the piece is of type X, and the element is in the middle of the fragment."O" indicates not of any type.Dataset descriptions of CLUENER2020 and Manufacturing-NER shown in Table 2.

Baseline Algorithms
BERT [20]: BERT is a Transformer-based model, which uses pre-training to learn from text data, and fine-tuning on different downstream tasks, such as NER task.
RoBERTa [41]: RoBERTa builds on the language masking strategy of BERT, modifying key hyper-parameters in BERT, including removing the next sentence of BERT's pre-training target.
ALBERT [23]: ALBERT is a lightweight BERT that achieves excellent results, which reduce the number of model parameters through factorization and parameter sharing.
BiLSTM-CRF [42]: The BiLSTM-CRF model takes into account the ability of the BiLSTM model to remember the context address and retains the strength of the CRF algorithm to control the address labeling output by transferring the probability matrix.
Human Performance [40]: To better understand the difficulty of the task and the performance of the modern model compared to humans, human performance tests were conducted in the experiment.
En2BiLSTM-CRF [43]: Model contains initial encoding layer, enhanced encoding layer, and decoding layer combines the advantages of pre-trained model encoding, dual BiLSTM, and a resident connection mechanism.
ALBERT-AttBiLSTM-CRF (ours): Character-level word embeddings are trained on large-scale text via the ALBERT pretrained language model, which is then fed into the BiLSTM model's contextual information, and finally the corresponding entity labels are obtained via CRF.According

Baseline Algorithms
BERT [20]: BERT is a Transformer-based model, which uses pre-training to learn from text data, and fine-tuning on different downstream tasks, such as NER task.
RoBERTa [41]: RoBERTa builds on the language masking strategy of BERT, modifying key hyper-parameters in BERT, including removing the next sentence of BERT's pre-training target.
ALBERT [23]: ALBERT is a lightweight BERT that achieves excellent results, which reduce the number of model parameters through factorization and parameter sharing.
BiLSTM-CRF [42]: The BiLSTM-CRF model takes into account the ability of the BiLSTM model to remember the context address and retains the strength of the CRF algorithm to control the address labeling output by transferring the probability matrix.
Human Performance [40]: To better understand the difficulty of the task and the performance of the modern model compared to humans, human performance tests were conducted in the experiment.
En2BiLSTM-CRF [43]: Model contains initial encoding layer, enhanced encoding layer, and decoding layer combines the advantages of pre-trained model encoding, dual BiLSTM, and a resident connection mechanism.
ALBERT-AttBiLSTM-CRF (ours): Character-level word embeddings are trained on large-scale text via the ALBERT pretrained language model, which is then fed into the BiLSTM model's contextual information, and finally the corresponding entity labels are obtained via CRF.According to the different sizes of ALBERT pretrained corpus and the number of model parameters, ALBERT editions are divided into tiny, base, large, and so on.

Results and Discussion
The ALBERT-AttBiLSTM-CRF is applied to the CLUENER2020 fine-grained dataset along with other benchmark algorithms, and the NER results are shown in Figure 9.It can be seen that the ALBERT-AttBiLSTM-CRF is better than the best result of RoBERTa provided by the publisher of the CLUENER2020 fine-grained dataset in terms of P, R, and F1 score.In particular, it has a 5.4% higher F1 score.To thoroughly verify the effect of different ALBERT versions on the entity recognition effect, we used four versions, tiny, base, large, and xlarge, for comparison in Figure 10.Table 3 shows the score of different algorithms in CLUENER2020.It can be seen that the overall effect does not get better and better with the increase in the model pretraining corpus size and parameters.The best results appear in a large edition, where the P, R, and F1 score are 0.9253, 0.8702, and 0.8962.This result is 13.27%, 5.33%, and 9.2% higher than the best results of the proposed benchmark.to the different sizes of ALBERT pretrained corpus and the number of model parameters, ALBERT editions are divided into tiny, base, large, and so on.

Results and Discussion
The ALBERT-AttBiLSTM-CRF is applied to the CLUENER2020 fine-grained dataset along with other benchmark algorithms, and the NER results are shown in Figure 9.It can be seen that the ALBERT-AttBiLSTM-CRF is better than the best result of RoBERTa provided by the publisher of the CLUENER2020 fine-grained dataset in terms of P, R, and F1 score.In particular, it has a 5.4% higher F1 score.To thoroughly verify the effect of different ALBERT versions on the entity recognition effect, we used four versions, tiny, base, large, and xlarge, for comparison in Figure 10.Table 3 shows the score of different algorithms in CLUENER2020.It can be seen that the overall effect does not get better and better with the increase in the model pretraining corpus size and parameters.The best results appear in a large edition, where the P, R, and F1 score are 0.9253, 0.8702, and 0.8962.This result is 13.27%, 5.33%, and 9.2% higher than the best results of the proposed benchmark.Table 3.Comparison of different algorithms on CLUENER2020.

Method Precision Recall F1-Score
BiLSTM-CRF [42] 0.7106 0.6897 0.7000 ALBERT-BiLSTM-CRF [44] 0.8876 0.8270 0.8555 ALBERT-CRF [43] 0.8094 0.6120 0.6936 ALBERT-BiLSTM [43] 0.7736 0.8132 0.7925 En2BiLSTM-CRF [43] 0.9156 0.8337 0.8720 ALBERT [23] 0.7992 0.6459 0.7107 BERT [20] 0.7724 0.8046 0.7882 RoBERTa [41] 0.7926 0.8169 0.8042 Human Performance [40] 0.6574 0.6217 0.6341 ALBERT-AttBiLSTM-CRF (our) 0.9253 0.8702 0.8962 To solve the problem of fewer labeled domain data, a deep transfer learning solution is proposed to explain the target domain problem using mature source domain data and establish a reliable model to make accurate predictions from existing knowledge.The model-based idea learns features through the adaptive characteristics of neural network models, extracts public data label features, and adopts deep neural network transfer model parameter weights to achieve transfer learning of source domain data features.Still, the manufacturing domain data lack technical entity labels, so labeling data is challenging and we must adopt a self-learning method to label data and increase the number of labeled data.From the previous experimental results, it is known that ALBERT-AttBiLSTM-CRF_large achieved the best results on the public CLUENER2020 dataset, so in the model transfer section, the ALBERT-AttBiLSTM-CRF_large model was used.The comparison results of before and after the transfer are shown in Figure 11.

Method Precision
Recall F1-Score BiLSTM-CRF [42] 0.7106 0.6897 0.7000 ALBERT-BiLSTM-CRF [44] 0.8876 0.8270 0.8555 ALBERT-CRF [43] 0.8094 0.6120 0.6936 ALBERT-BiLSTM [43] 0.7736 0.8132 0.7925 En2BiLSTM-CRF [43] 0.9156 0.8337 0.8720 ALBERT [23] 0.7992 0.6459 0.7107 BERT [20] 0.7724 0.8046 0.7882 RoBERTa [41] 0.7926 0.8169 0.8042 Human Performance [40] 0.6574 0.6217 0.6341 ALBERT-AttBiLSTM-CRF (our) 0.9253 0.8702 0.8962 To solve the problem of fewer labeled domain data, a deep transfer learning solution is proposed to explain the target domain problem using mature source domain data and establish a reliable model to make accurate predictions from existing knowledge.The model-based idea learns features through the adaptive characteristics of neural network models, extracts public data label features, and adopts deep neural network transfer model parameter weights to achieve transfer learning of source domain data features.Still, the manufacturing domain data lack technical entity labels, so labeling data is challenging and we must adopt a self-learning method to label data and increase the number of labeled data.From the previous experimental results, it is known that ALBERT-AttBiLSTM-CRF_large achieved the best results on the public CLUENER2020 dataset, so in the model transfer section, the ALBERT-AttBiLSTM-CRF_large model was used.The comparison results of before and after the transfer are shown in Figure 11.  Figure 11 shows that, after pretraining and fine-tuning the model, the F1 score after transfer learning reached 0.8931, which is 3.55% higher than before the transfer.Identifying the entity types such as structure, material, model, function, shape, part, and parameter, the F1 score reached 0.8029, 0.9458, 0.8847, 0.7447, 0.9528, 0.8791, and 0.5027, respectively.In the manufacturing field, the structure, material, shape, and other characteristics are fixed, and can be easily identified based on their lexical and entity composition structure; the naming of basic components is relatively standardized, and the character sequences appearing in the entity names are fixed, so can be accurately identified by the designed model; the model in the expressions often identifies the functions and parameters are more random, appear in more formats, and different words are used to describe the same function in a different corpus, increasing the effect of identifying these entities.Figure 11 shows that, after pretraining and fine-tuning the model, the F1 score after transfer learning reached 0.8931, which is 3.55% higher than before the transfer.Identifying the entity types such as structure, material, model, function, shape, part, and parameter, the F1 score reached 0.8029, 0.9458, 0.8847, 0.7447, 0.9528, 0.8791, and 0.5027, respectively.In the manufacturing field, the structure, material, shape, and other characteristics are fixed, and can be easily identified based on their lexical and entity composition structure; the naming of basic components is relatively standardized, and the character sequences appearing in the entity names are fixed, so can be accurately identified by the designed model; the model in the expressions often identifies the functions and parameters are more random, appear in more formats, and different words are used to describe the same function in a different corpus, increasing the effect of identifying these entities.

Conclusions and Future Work
In this paper, a novel fine-grained Chinese named entity recognition method (ALBERT-AttBiLSTM-CRF Model Transfer Considering Activate Learning) based on lite deep multinetwork collaboration and transfer learning considering active learning for domain text data is proposed.First, we had to address the shortcomings of traditional pretrained language models with a large number of parameters, long computation time, and poor applicability to industrial fields.We used a lite pretrained model to stage text embedding representation of text data; a BiLSTM was used for feature extraction of text input, which could effectively capture the contextual information of the text.The extracted feature input conditions were randomized to the field network to obtain the text data of the corresponding entities.Then, considering a situation where there are few domain data and no useful public dataset, we proposed a model transfer method considering active learning by combining active learning and transfer learning.This method used the public dataset to pretrain the proposed ALBERT-AttBiLSTM-CRF_large model, and marked domain data to adjust the model parameters, with active learning of unlabeled domain data used to optimize the results of a domain called entity recognition.To verify the validity of the proposed methodology (the ALBERT-AttBiLSTM-CRF_large method), the method was validated with mainstream methods such as Bert, RoBERTa, etc. on the CLUENER2020 dataset; compared with the current best benchmark RoBERTa-wwm-large-ext, the results improved by 9.2%.Using the ALBERT-AttBiLSTM-CRF _large model with the best results at CLUENER2020 as the source model, we transferred to the Manufacturing-NER dataset using MTAL.The transfer results showed an improvement of 3.55%, proving the effectiveness of the ALBERT-AttBiLSTM-CRF Model Transfer Considering Activate Learning method.Although the application of this article is to the manufacturing field, it can be applied to other fields such as healthcare.It is only necessary to fine-tune the text data and parameters of related fields to improve the accuracy of text similarity in the model.
Future work is needed on the problem of fine-grained NER where individual entity classes are not well identified.Fine-grained named entity identification suffers from an imbalance between entities with a large number of entities and a different number of each entity [45].The unbalanced data can lead to low recognition accuracy for categories with a small number of entities.The next step will be to study the oversampling and undersampling of textual data to improve the overall entity recognition effect.

Figure 1 .
Figure 1.The framework of a lite BERT language representation.

Figure 1 .
Figure 1.The framework of a lite BERT language representation.

Figure 2 .
Figure 2. Model visualization of BERT and ALBERT.

Figure 2 .
Figure 2. Model visualization of BERT and ALBERT.3.1.2.Feature Extraction Based on AttBiLSTMIn NLP, the data involved often have backward and forward correlations, and traditional forward NNs are no longer able to handle this type of data well.The RNN structure is divided into the input layer, the intermediate layer, and the output layer.However, RNN does not handle this problem

→
the intermediate states of the forward and backward LSTM outputs, respectively.Then the entire intermediate state of BiLSTM output at time t is h t = → h t ; ← h t .

Figure 5 .
Figure 5.The architecture of the proposed ALBERT-AttBiLSTM-CRF model.

Figure 5 .
Figure 5.The architecture of the proposed ALBERT-AttBiLSTM-CRF model.

Algorithm 1
Model Transfer Considering Activate Learning (L, U, M, Conf ) Input: L: Labeled training dataset; U: Unlabeled training dataset; M: Proposed model; Conf : Confidence level.Output: Trained model M'; 1: //Model transfer 2: Use the training set L to train to get the model M; 3: Use model M to predict the unlabelled data set U, 4: Calculate the score of the conditional probability of the model CRF layer (Y|X) as the confidence level Conf ; 5: //Active learning 6: for each Conf of samples do 7: Select all Conf ≥ Conf high samples U high , add them to L (L = L + U high ), and delete U high from U (U = U − U high ) 8: Select all Conf < Conf low samples U low , re-annotate them manually, add them to L (L = L + U low ), and delete U low from U (U = U − U low ) 9: Follow the above steps to iterate n times until the model M' converges.10: end for 11: return Trained model M'.
the previous network is labeled by the CRF to obtain the labeled sequence.Then, a model transfer method considering active learning is proposed to get a new model by transferring the model obtained by training on the source domain to the target domain.Based on the new model, active learning is used to label the unlabeled data to continuously augment the amount of data and improve the quality of the model.

Figure 6 .
Figure 6.The architecture of the proposed ALBERT-AttBiLSTM-CRF Model Transfer Considering Active Learning.

Figure 6 .
Figure 6.The architecture of the proposed ALBERT-AttBiLSTM-CRF Model Transfer Considering Active Learning.

Figure 9 .
Figure 9.The results on CLUENER2020 of baseline algorithms.

Figure 9 .
Figure 9.The results on CLUENER2020 of baseline algorithms.

Figure 11 .
Figure 11.The results on Manufacturing-NER of before MTAL (model transfer considering active learning) and after MTAL.

Figure 11 .
Figure 11.The results on Manufacturing-NER of before MTAL (model transfer considering active learning) and after MTAL.

Table 1 .
Parameters analysis of BERT and ALBERT models.

Table 1 .
Parameters analysis of BERT and ALBERT models.

Table 2 .
Dataset descriptions of CLUENER2020 and Manufacturing-NER.

Table 2 .
Dataset descriptions of CLUENER2020 and Manufacturing-NER.

Table 3 .
Comparison of different algorithms on CLUENER2020.