Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

Yuan, Ling; Zeng, Chenglong; Pan, Peng

doi:10.3390/electronics13244895

Open AccessArticle

Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

by

Ling Yuan

,

Chenglong Zeng

and

Peng Pan

^*

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(24), 4895; https://doi.org/10.3390/electronics13244895

Submission received: 24 October 2024 / Revised: 29 November 2024 / Accepted: 9 December 2024 / Published: 12 December 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Chinese Entity Recognition (CER) aims to extract key information entities from Chinese text data, supporting subsequent natural language processing tasks such as relation extraction, knowledge graph construction, and intelligent question answering. However, CER faces several challenges, including limited training corpora, unclear entity boundaries, and complex entity structures, resulting in low accuracy and a call for further improvements. To address issues such as high annotation costs and ambiguous entity boundaries, this paper proposes the SEMFF-CER model, a CER model based on semantic enhancement and multi-feature fusion. The model employs character feature extraction algorithms, SofeLexicon semantic enhancement for vocabulary feature extraction, and deep semantic feature extraction from pre-trained models. These features are integrated into the entity recognition process via gating mechanisms, effectively leveraging diverse features to enhance contextual semantics and improve recognition accuracy. Additionally, the model incorporates several optimization strategies: an adaptive loss function to balance negative samples and improve the F1 score, data augmentation to enhance model robustness, and dropout and Adamax optimization algorithms to refine training. The SEMFF-CER model is characterized by a low dependence on training corpora, fast computation speed, and strong scalability. Experiments conducted on four Chinese benchmark entity recognition datasets validate the proposed model, demonstrating superior performance over existing models with the highest F1 score.

Keywords:

key information extraction; Chinese entity recognition; semantic enhancement; multi-feature fusion; gating mechanisms

1. Introduction

With the rapid development of the internet and information processing technology, vast amounts of unstructured text data have been generated on the open-source web. However, as data volumes increase, it becomes increasingly difficult to quickly and effectively find the desired information from massive unstructured datasets. Information extraction techniques, such as entity recognition and relation extraction, can extract text information into structured entity relation triples, facilitating subsequent analysis and visualization. Entity recognition plays a crucial role in tasks such as semantic understanding, text classification, knowledge graph construction, and achieving intelligent semantic comprehension.

CER involves extracting key information entities from Chinese text data and correctly classifying them as personal names, geographical names, organization names, and domain-specific terms. The accuracy of CER directly impacts the effectiveness of higher-level applications and is vital for text analysis in various fields, including medicine, law, technology, and finance.

Currently, research on entity recognition has primarily focused on English-language scenarios, benefiting from abundant English datasets. However, in Chinese applications many existing algorithms struggle to achieve satisfactory results, leading to low recognition accuracy. This is primarily due to the absence of natural word segmentation in Chinese, which complicates the identification of clear lexical boundaries. As a result, individual Chinese characters are often used as the basic unit of recognition, leading to reduced accuracy. Additionally, because Chinese characters are ideographic, they carry special semantic information that traditional named entity recognition models fail to fully leverage, causing the loss of important semantics. High annotation costs, the lack of industry-specific lexicons, and the complexity of Chinese entity structures further exacerbate the difficulties in entity recognition.

Thus, this study aims to optimize and improve CER technology, enhancing recognition accuracy and generalizability, while addressing the challenges within the field. In practice, CER is widely used across various professional domains, reducing the manual cost of building intelligent systems and providing a data foundation for text information mining in numerous fields.

The main contributions of this paper include the following:

Proposal of the SEMFF-CER Model: This model comprehensively utilizes radical and glyph features, SofeLexicon semantically enhanced lexical features, and pre-trained deep semantic features. By enhancing feature extraction through CNN networks and pre-trained models, and combining them with gating mechanisms for multi-feature fusion, the accuracy of CER and the model’s adaptability with minimal training data are significantly improved.
Optimization strategies: In the SEMFF-CER model, we introduce adaptive loss functions, data augmentation techniques, dropout strategies, and Adamax optimization algorithms. These strategies enhance the robustness, training efficiency, and generalization ability of the model, particularly excelling in non-nested entity recognition tasks.

2. Related Work

Entity recognition is a fundamental task in natural language processing, with a research history spanning over twenty years since its inception at the end of the last century. Based on the research focus, it can be divided into Chinese entity recognition and English entity recognition; based on the research methods, it can mainly be categorized into template rule-based methods, statistical machine-learning methods, and deep neural network methods. In recent years, research in CER has focused on addressing challenges such as limited training data, unclear entity boundaries, and complex entity structures through approaches like weak supervision, semantic enhancement using lexicons, and improved model architectures.

2.1. Entity Recognition

Entity recognition, as a core task in natural language processing, has evolved through three stages: template rule-based methods, statistical machine-learning methods, and deep-learning methods.

In the early stages, entity recognition primarily relied on template rule-based methods, which involved manual construction of rule templates and string matching for entity extraction. While this method yielded high precision, it lacked generalization, often missing multiple entities and incurring high costs. With the development of machine learning, models like Conditional Random Fields (CRFs) [1] were widely applied, enhancing both training speed and model portability. Qiu et al. demonstrated improved recognition accuracy using CRF models. However, these models were sensitive to feature selection and still required significant manual effort for feature engineering.

The rise of deep learning has brought new solutions to entity recognition, such as Bidirectional Long Short-Term Memory (BiLSTM) networks [2] and Convolutional Neural Networks (CNNs) [3]. Huang et al. and Ma et al. significantly improved entity recognition performance using these methods. Additionally, pre-trained models like BERT [4] and its derivatives have shown outstanding performance in entity recognition. For instance, AB-LSTM [5] introduced deep attention mechanisms, further enhancing the performance of BiLSTM networks and achieving significant improvements in entity recognition tasks. ERNIE [6], trained through self-supervised learning on large-scale data, followed by fine-tuning for entity recognition tasks, has demonstrated excellent entity recognition results. These models capture deep semantics by pre-training on large-scale data, significantly improving the accuracy and efficiency of entity recognition.

2.2. Chinese Entity Recognition

CER involves extracting key information entities from text data, such as personal names, place names, organization names, dates, and domain-specific terms in fields like healthcare, which is crucial for subsequent natural language processing tasks. Entity recognition can be classified into sequence labeling and nested entity labeling, employing various technical strategies.

2.2.1. Sequence Entity Recognition

Sequence entity recognition is based on labeling systems like BIO and BMES, implemented through word embedding models and context encoding/decoding processes. These models involve vectorizing input text and parsing context semantics through an encoder to decode and determine entity categories.

CER often involves complex textual features and lacks clear segmentation boundaries, necessitating the integration of both lexical and morphological semantic information in models. There exist models based on word-level, character-level, and multi-feature fusion for entity recognition.

Word-level Models: These models first process the text through word segmentation tools, then convert it into vectors using word embedding techniques, and finally perform entity classification. The accuracy of word segmentation directly affects entity recognition effectiveness. For instance, Yang et al. [7] demonstrated that word-level models can effectively recognize entities in texts with clear word segmentation. However, the complexity of Chinese, including issues like polysemy and word nesting, coupled with the limitations of word segmentation tools, often lead to classification errors, thereby affecting overall entity recognition performance. In such cases, dependence on character-level models is often required to enhance robustness.

Character-level Sequence Entity Recognition Models: In character-level sequence entity recognition models, researchers attempt to improve model recognition effectiveness by adding character form and word boundary features. Peng et al. [8] improved entity and boundary recognition through joint learning strategies by combining character vectors and word segmentation features. Sun et al. [9] introduced character position information to enhance the model’s understanding of text structure.

To address the challenges posed by insufficient training data, many researchers turn to pre-trained models and transfer learning. For example, Wu et al. [10] employed dynamic transfer learning to transfer semantic knowledge from general corpora to specific domains, while Du et al. [11] applied large-scale domain data to small-sample scenarios, improving model adaptability and accuracy.

Additionally, considering that Chinese characters contain rich semantic information in their form, such as radicals and strokes, Dong et al. [12] and Meng et al. [13] respectively utilized BiLSTM and Glyce networks to delve into these structural details, significantly enhancing entity recognition performance. However, challenges remain, such as character-level models failing to fully utilize word-level information, resulting in insufficient semantic expression and loss of word boundary information. These issues prompt researchers to explore further optimization of model performance through multi-feature fusion, such as structural fusion and data fusion.

Multi-feature Fusion Sequence Entity Recognition Models: Multi-feature fusion sequence entity recognition models focus on integrating lexical information into character-based models to enhance semantic understanding and improve recognition accuracy. These methods are typically categorized into structural fusion and data fusion.

Structural Fusion: Models like Lattice-LSTM first match words ending with characters and then combine their word vectors with LSTM layers. Although this helps reduce segmentation errors, its complex structure limits parallel processing and portability (Zhang et al. [14]). LR-CNN and CAN-NER models improve parallelism and word weight adjustment through CNN and attention mechanisms, enhancing performance but increasing computational complexity (Peng et al. [15], Cao et al. [16]). Additionally, GCN and LGN models integrate lexical information into each character through graph networks, effectively utilizing local and global lexical boundary information (Chen et al. [17]). The FLAT model employs a Transformer-based structure to handle lexical losses in Lattice-LSTM and improves entity recognition through relative position encoding (Li et al. [18]).

Data Fusion: Models like WC-LSTM and SoftLexicon enhance model generalizability and semantic accuracy by directly integrating lexical information into character vectors (Liu et al. [19], Ma et al. [20]). This strategy, by encoding word vectors at each node of BiLSTM or combining all matching lexical information with character features, enhances the model’s semantic understanding, particularly excelling in specific domains like law (Guo et al. [21]).

Li et al. [22] used the relationship between words to model the NER problem, not only by using a new data encoding pattern but also by proposing a completely new model architecture, which performed well in nested, non-nested, and discontinuous datasets, and also achieved good results on four Chinese datasets. Qi et al. [23] proposed a CNER model based on text similarity and mutual information maximization. Firstly, they calculated text similarity to find potential word boundaries for characters, and then maximized the mutual information between characters, potential word boundaries, and sentences to enhance input features. Additionally, they designed specific network architectures for structural enhancement.

2.2.2. Nested Entity Recognition

Nested Entity Recognition (NER) refers to the identification of entities that contain other entities, such as “Peking University”, which is both an organization and contains “Beijing”, a location entity. In nested NER, a character may have multiple labels, making it difficult for traditional sequence labeling models to handle. Therefore, new decoding layer networks are required to address this multi-label issue. Common approaches to nested NER include the cascading model and the region-pointer model.

The cascading model [24] stacks multiple decoding layers, with each layer handling entity recognition independently. For example, Ju et al. [25] decode entities at each layer and, if new entities are detected, continue stacking additional decoding layers until no more entities are found. However, errors in this model can propagate across layers, and the innermost layers cannot fully leverage information from outer layers [26].

The region-pointer model [27] identifies entity boundaries by pointing to regions in the sequence, exhaustively enumerating all possible subsequences to detect entities. Sohrab et al. [28] used pointer networks to enumerate potential entity subsequences, classifying them using a classifier. While this approach avoids error propagation, it lacks interaction between outer and inner entity information [29].

Later, Luo et al. [30] proposed an entity recognition method based on bipartite graph networks, enabling interaction between outer and inner entities. Fu et al. [31] introduced the TreeCRF model, which transforms entity recognition into a partial tree node search task using syntactic tree decomposition. Li et al. [32] applied a machine reading comprehension model to tackle Chinese nested NER, achieving promising results. However, these innovative methods tend to have high computational complexity and fail to fully utilize entity category labels and inter-entity relationships, suggesting room for further improvement.

3. Method

In this section, we propose the SEMFF-CER model. The model addresses the issues of high annotation cost, limited training samples, lack of lexical boundaries, and insufficient semantic mining in CER.

3.1. Architecture of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

The CER model based on multi-feature semantic enhancement is an improvement upon the traditional SoftLexicon lexical enhancement entity recognition model [20]. It introduces radical and glyph features as well as pre-trained features to enhance the semantic context with multiple features.

The model architecture, as shown in Figure 1, is divided into several layers for ease of representation. Specifically, it can be divided into the following layers: Input Layer, Vector Transformation Layer, Feature Extraction Layer, Feature Fusion Layer with Gate Mechanisms, BiLSTM Encoding Layer, CRF Decoding Layer, Output Layer.

The functions of each layer in the model are as follows:

(1): Input Layer: The input layer is responsible for loading, reading, and preprocessing the data. To address the issue of limited training samples in CER datasets, we use data augmentation techniques such as entity-level synonym replacement and label-level replacement to expand the training dataset. This helps alleviate the impact of limited data and improves recognition accuracy. Additionally, the input layer utilizes a Chinese character decomposition dictionary to break down each character into its corresponding root or radical. This process decomposes Chinese characters into their smallest indivisible units, facilitating root and shape feature extraction in subsequent layers.
(2): Vector Transformation Layer: This layer transforms characters, radicals, and bigrams into vector representations. The original text is input, and the Char2Id index table maps each character to an embedding vector. This yields the single-character embedding vector, radical embedding vector, bigram embedding vector, and word embedding vector. as shown in Figure 2,for comparison, the embedding matrix used here follows the same structure as that of Lattice-LSTM [14]. To handle Chinese Out-Of-Vocabulary (OOV) words, characters that do not match entries in the Char2Id table are replaced with 〈UNK〉.
(3): Feature Extraction Layer: This layer integrates various feature extraction techniques, including lexical enhancement, radical and glyph features, and pre-trained features. First, the character embedding vectors are combined with their bigram embedding vectors to form character feature vectors. A CNN is used to extract the radical and glyph features of each character. Next, a lexical enhancement strategy based on SoftLexicon is employed to extract the lexical enhancement features of words, mining the contextual semantic information of the vocabulary. Finally, the BERT-wwm pre-trained model is employed to extract deep semantic features, which contain rich knowledge learned from large-scale corpora, thereby improving the accuracy of entity recognition.
(4): Feature Fusion Layer with Gate Mechanisms: This layer employs gate mechanisms to integrate vocabulary features, radical and glyph features, and pre-trained features into character encodings, enhancing the hidden semantic representation of different features. Firstly, it concatenates the radical and glyph feature vectors with character feature vectors to obtain concatenated character features. Then, it uses gate mechanisms to merge vocabulary features related to the character into the character feature vector, extracting semantic information from all vocabulary words containing that character. Next, it linearly combines the character feature vector with the pre-trained semantic feature vector to create fused input sequence features. Finally, these fused features are input into the encoding layer for further processing.
(5): Context Ecoding Layer: The purpose of this layer is to encode input features and extract context relevance. The RNN integrates current inputs with previous outputs, suitable for sequential problems, but struggles with gradient issues and handling long sequence dependencies. LSTM handles long-distance dependencies through specialized gate structures (input gate, forget gate, output gate) and activation functions, extracting contextual information from character sequences. However, LSTM can only extract unidirectional features.
This paper employs a BiLSTM network to encode merged features into contextually encoded semantic vectors corresponding to characters, thus mining bidirectional contextual semantics from the input text sequence. Initially, character features $(E_{1}, E_{2}, \dots, E_{n})$ merged from the output of the previous layer are fed into the BiLSTM layer, resulting in a bidirectional output sequence. Then, the forward output sequences $(\vec{h_{1}}, \vec{h_{2}}, \dots, \vec{h_{n}})$ and backward output sequences $(\overset{\leftarrow}{h_{1}}, \overset{\leftarrow}{h_{2}}, \dots, \overset{\leftarrow}{h_{n}})$ are concatenated to form $h_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}] \in R^{m}$ , ultimately extracting hidden contextual encodings of the text sequence $(h_{1}, h_{2}, \dots, h_{n}) \in R^{m * n}$ . The $R$ represents the set of real numbers.
BiLSTM automatically extracts bidirectional information from text sequences, converting them into specific vector representations. These vectors represent the probable labels for each character but lack the constraints between labels, such as a ‘B’ label not likely following another ‘B’ label. Therefore, in entity recognition, a CRF is often added after the BiLSTM layer to introduce these constraints, using the output of BiLSTM as the input to the CRF layer. The final label probabilities are derived in the CRF through a label state transition matrix, thereby enhancing the performance of entity recognition.
(6): Label Decoding Layer: This layer primarily employs the CRF model to decode the context-encoded features into a probability vector corresponding to each character’s label, with each dimension of the vector representing the probability of a corresponding label for the character.
Suppose the input sentence sequence is $O = \{O_{1}, O_{2} \dots O_{n}\}$ and the corresponding annotation sequence for the sentence is $Y = \{y_{1}, y_{2} \dots y_{n}\}$ . The state transition probability matrix A represents the transition probability from the labeling state at one position to the state at the next position, where $A_{i j}$ indicates the transition probability from label i to label j. The Conditional Random Field model utilizes the state transition matrix and score matrix to obtain the final score of the sequence. The probability of sequence x being labeled as y is shown in Equation (1), where $P_{i, j}$ represents the score of character i being predicted as label $y_{i}$ , and $A_{y_{i - 1}, y_{i}}$ denotes the probability of transitioning from label state $y_{i - 1}$ to state $y_{i}$ .

$s c o r e (x, y) = \sum_{i = 1}^{n} P_{i, y_{i}} + \sum_{i = 1}^{n + 1} A_{y_{i - 1}, y_{i}}$

(1)

The probability of the sentence sequence, O, being predicted as the annotation sequence is normalized using the Softmax function. The final score of the annotation sequence is calculated as $P (y ∣ x)$ , where y represents all possible sequences and $y^{'}$ represents a specific sequence, as shown in Equation (2)

$P (y ∣ x) = \frac{exp (s c o r e (x, y))}{\sum_{y^{'}} exp (s c o r e (x, y^{'}))}$

(2)

The log-likelihood estimation function is chosen as the objective function, where the score for the entire sequence is equal to the sum of the scores at each position. Upon completion of the training, the sequence with the maximum objective function value is selected as the final predicted output label sequence, as shown in Equation (3), where $y^{x}$ represents the true sequence corresponding to the input sequence x.

$log P (y^{x} ∣ x) = s c o r e (x, y^{x}) - log (\sum_{y} exp (s c o r e (x, y)))$

(3)
(7): Output Layer: This layer provides the most probable predicted labels for the sequence, based on the computational results of the CRF layer.

3.2. Implementation of the Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

3.2.1. Character Feature Extraction Enhanced by Radical Glyph Semantics

For character features, embeddings and bigrams are linearly concatenated, then passed through a CNN to extract radical glyph features and merged to enhance character features. Chinese characters, evolved from pictographs, contain rich semantics, thus enhancing entity recognition. Characters are broken down into radicals using a decomposition dictionary, leveraging the inherent semantics. The study decomposes characters into base radicals, inputs them into a CNN, and utilizes max pooling and fully connected networks to derive radical-level features, extracting radical glyph feature vectors. The specific flow chart is shown in Figure 3.

3.2.2. Lexical Feature Extraction Enhanced by SoftLexicon Semantics

We employ a lexical semantic enhancement strategy based on SoftLexicon [20], integrating lexical information at the encoding level. This approach does not rely on the specific structure of the recognition model, making the integrated features applicable to entity recognition models of any other structure and demonstrating adaptability and scalability. We have four sets—Begin, Middle, End, and Single—collectively represented by BMES. Each of these sets contains words where the given character appears at the beginning, in the middle, at the end, or as a standalone word. The principle is that, for each character in the input sequence, all words containing that character in any of these positions are used to enhance the character’s semantic representation. This fusion of features from all relevant words helps capture the full range of meanings associated with that character. By leveraging the word-matching mechanism based on SoftLexicon, the model fully utilizes the semantic features from different words, allowing for a richer and more nuanced understanding of each character in the sequence. The specific implementation is as follows.

(1): Lexical Matching

For each character in the input sequence, lexical matching is first applied to find all words in the lexicon that contain this character. This process is inspired by traditional prefix tree-based lexical matching algorithms [20], as illustrated in Figure 4, where the word Xiehe Hospital is matched using a dictionary prefix tree.

Initially, a global static prefix index tree is constructed for all words in the lexicon. This tree is used for prefix matching with the input sequence and is stored in memory. Edges in the nodes represent the presence of word information, making all leaf nodes of the prefix index tree indicative of specific words. This prefix tree is static and independent of the input sequence. Then, the input sequence is searched up and down in the prefix tree to match the words contained in it.

(2): Lexical Distribution

After matching specific words in the input sequence, the matched words are distributed into the specific BMES lexical sets. This process is detailed in Equation (4). The matched words are distributed into the four BMES sets, and if a set does not match any words, it is filled with 0. An example of this process can be seen in Figure 4, where the word Xiehe Hospital is distributed as an example.

\begin{matrix} B e g i n (c_{i}) = \{w_{i, k}, \forall w_{i, k} \in L, i < k \leq n\} \\ M i d d l e (c_{i}) = \{w_{j, k}, \forall w_{j, k} \in L, 1 \leq j < i < k \leq n\} \\ E n d (c_{i}) = \{w_{j, i}, \forall w_{j, i} \in L, 1 \leq j < i\} \\ S i n g l e (c_{i}) = \{c_{i}, \exists c_{i} \in L\} \end{matrix}

(4)

Herein, L represents the lexicon containing all words, n denotes the length of the input text sequence, and

w_{(i, k)}

represents the substring of the input text from i to k. Begin(

c_{i}

), Middle(

c_{i}

), End(

c_{i}

), and Single(

c_{i}

) respectively represent the sets Begin, Middle, End, and Single.

The lexical distribution algorithm is described in Algorithm 1. This algorithm segments an input text sequence into words by matching each character with a word list and categorizing the matches into four sets: Begin, Middle, End, and Single. It first initializes the BMES sets for each character and uses a Trie-based prefix tree to match words in the input sequence. The matched words are then assigned to the appropriate BMES sets based on their positions: the first character of a multi-character word is labeled as “Begin”, the last character as “End”, intermediate characters as “Middle”, and single-character words as “Single”. For characters that do not match any word, 0 is added to indicate no match. This process effectively segments the text and labels each character based on its role in the word structure. For each character, it is necessary to scan the entire lexicon to find words containing that character, requiring scanning L times. Therefore, the time complexity of lexical matching is

O (L * n)

.

Algorithm 1: Lexical Distribution Algorithm

(3): Word frequency statistics

Word frequency statistics can be calculated in parallel with word matching and off-line on the training dataset. First, a prefix matching tree is constructed based on the word list. For each character sequence in the training set, the corresponding word frequency is incremented whenever a match is found in the word list. This process continues until all text sequences in the training set are matched. In order to avoid the problem of text sequence nesting, such as “Union Hospital” and “hospital”, words are sorted in reverse order by length, ensuring that shorter words do not override longer ones. Finally, the word frequency table is updated. The specific process is shown in Figure 5.

(4): Lexical information fusion

After obtaining all BMES set vocabularies for each character, these vocabularies are converted into word vectors through the word embedding matrix. For different sets, each set may have one or more words. The word vectors for all vocabularies are weighted and averaged based on static word frequency, resulting in four set vector features. The dimensions of the set word feature vectors are fixed. Finally, these four set vector features are linearly concatenated to obtain the final SoftLexicon lexical enhancement feature.

The total word frequency of all words matched for a certain character is as shown in Equation (5).

P = \sum_{w \in B \cup M \cup E \cup S} p (w)

(5)

where

p (w)

represents the static word frequency of the word w in all input text sequences, which is calculated by the word frequency statistical algorithm. For any set,

S_{x}

, in the character “BMES” set, the corresponding set word vector feature representation is shown in Equation (6).

e^{s} (S_{x}) = \frac{4}{P} \sum_{w \in S_{x}} p (w) \cdot e^{w} (w)

(6)

where

e^{w}

represents the transformation process of the word embedding vector matrix. The four set word vector features of “BMES” were linearly splintered to obtain the final SoftLexicon lexical enhancement feature

X^{w}

, as shown in Equation (7).

X^{w} = e^{s} (B) \oplus e^{s} (M) \oplus e^{s} (E) \oplus e^{s} (S)

(7)

where ⊕ represents the linear concatenation. SoftLexicon lexical enhancement feature

X^{w}

fully mines all word vector information containing the character, enhancing the semantic representation of character vectors and improving entity recognition performance.

3.2.3. Deep Semantic Feature Extraction Based on Pre-Training Model

The Chinese BERT pre-trained model [33] significantly addresses challenges like polysemy and has demonstrated excellent performance in natural language processing tasks. There are generally two approaches for using BERT in entity recognition: fine-tuning the pre-trained model for a specific task, or directly utilizing the pre-trained BERT model for text vectorization, where the output vectors are then used for downstream entity recognition tasks. This paper adopts the latter approach, where BERT is used to generate text feature vectors that are subsequently processed for entity recognition.

The paper compares various BERT pre-trained models to evaluate their impact on Chinese Entity Recognition (CER) performance, particularly under conditions of limited training samples. Through this comparison, we select the optimal pre-trained model for our multi-feature fusion entity recognition task. Specifically, we use the Transformers toolkit to load the pre-trained Chinese-BERT-wwm-ext model.

BERT-wwm (BERT Whole Word Masking) improves on the BERT-base model by using whole word masking at the sentence level, as opposed to traditional random masking where entire words are either masked or left unchanged. Whole word masking increases the complexity of subword prediction, which ultimately enhances the accuracy of word prediction. In addition, the Chinese-BERT-wwm-ext model, with an expanded training dataset and additional training steps, shows notable performance improvements over the standard BERT-wwm model.

In our study, we also discuss how BERT’s pre-trained embeddings help handle rare or unseen radicals. While the model uses pre-trained embeddings from large corpora to deal with rare or unseen characters, we further emphasize that incorporating techniques such as character-level subword embeddings and unsupervised pre-training could help improve the model’s ability to represent novel radicals, ensuring better generalization to previously unseen data.

3.2.4. Multi-Feature Fusion Based on Gating Mechanism

This subsection explains how to combine these features to enhance character information with word-level data.

Character features and pre-trained features play significant roles in entity recognition, so they are linearly concatenated to minimize semantic loss. Lexical enhancement features, which mainly assist in identifying entity boundaries, have a smaller impact on the final entity recognition. These features can be regulated through gating mechanisms to integrate lexical information effectively with character features.

The gating mechanism semantically filters different features, dynamically assigning varying weights to reduce redundancy and repetitive semantics. While this mechanism helps to maintain model expressiveness, it could also potentially lead to the omission of important features, especially when the weight distribution is not properly tuned. If not carefully adjusted, the gating mechanism may suppress crucial features, potentially undermining performance. Thus, it is important to carefully control the parameters of the gating mechanism to avoid unintended feature suppression.

In this paper, we propose a multi-feature fusion strategy based on the gating mechanism, as illustrated in Figure 6, where

X^{c}

represents the character feature after integrating radical and glyph features and

X^{(c, w)}

represents the character vector obtained by concatenating radical features and fusing lexical features.

First, the radical and glyph features are linearly concatenated with the character feature vectors. Then, the input gated neurons are fused with the lexical enhancement features. Finally, the fused features are linearly concatenated with pre-trained features to form the final fused feature, which is input into the context encoding layer for further processing.

The gating mechanism is employed to regulate the amount of input information, preventing gradual attenuation of important features during the time-step transformations. The specific calculation formula is shown in Equation (8).

X^{(c, w)} = X^{c} * σ (X^{w} W_{G} + b_{G})

(8)

where

σ

represents the sigmoid function, which is used to control the amount of input information,

W_{G}

and

b_{G}

are parameter variables in the gating mechanism.

X^{c}

represents the character feature after integrating radical and glyph features,

X^{w}

represents the lexical feature corresponding to the character, and

X^{(c, w)}

represents the character vector obtained by concatenating radical features and fusing lexical features. Finally, the fused character feature vector is spliced with the pre-trained feature vector to obtain the final feature of the input sequence, as shown in Equation (9).

X = X^{(c, w)} \oplus X^{t}

(9)

where

X^{t}

represents features extracted from the pre-trained BERT-wwm model, ⊕ represents linear splicing, and X represents the final features of the input sequence.

3.3. Optimization of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

3.3.1. Model Performance Optimization Based on Adaptive Loss Function

From a macro perspective, entity recognition evaluation metrics include accuracy, recall, and F1 score. To increase the F1 score, itis necessary to reduce the deviation between accuracy and recall, achieving a balance between them. Microscopically, this can be achieved by balancing the positive and negative sample class labels, modifying the traditional cross-entropy loss function, and using an adaptive DSC (dice coefficient) loss function.

In CER tasks, there is a significant imbalance in class label data. BMESO labels are typically used, with the O label accounting for over 99% of all labels. Excessive O labels lead to a mismatch in model training and testing, causing the model to focus more on O label sample features, thus failing to learn crucial category-specific label features. This bias towards predicting the O label lowers recall rates.

To address the issue of cross-entropy treating each sample label equally, leading to excessive focus on O labels, this paper improves the cross-entropy loss. When the probability of the O label is less than 0.5, it is directly classified as an O label. An adaptive DSC loss function is proposed to reduce focus on the O label and shift attention to more complex sample category labels, thereby enhancing the model’s recall rate and ultimately improving the overall F1 score.

Assuming that, for sample x, its binary label is

y = [y_{1}, y_{2}]

, and the output binary probability is p =

[p_{1}, p_{2}]

, the traditional cross entropy loss is shown in Equation (10) [2].

C E = - (y_{0} l o g p_{0} + y_{1} l o g p_{1})

(10)

In order to solve the problem that the cross-entropy loss treats all labels uniformly, this paper reduces the learning weight of the entity recognition model on the negative example O label and obtains the weighted cross-entropy loss, as shown in Equation (11).

W e i g h t e d C E = - α (y_{0} l o g p_{0} + y_{1} l o g p_{1})

(11)

Because of the different weights, instances with more categories are assigned fewer weights and instances with fewer categories are assigned more weight values. However, at this time, the process of how to give different tags different weights requires manual participation, so it needs to be further improved.

In this paper, the Sorensen-Dice coefficient [20] is used to measure the index of the similarity between the two combinations, and the specific formula is shown in Equation (12).

D S C (A, B) = \frac{2 | A \cap B |}{| A | + | B |}

(12)

where A is the set that the model predicts to be true and B is the set that the actual label is true; then, DSC is shown in Equation (13).

D S C (D, f) = \frac{2 T P}{2 T P + F N + F P} = F 1

(13)

where D is a dataset and f is a classification model. True Positives (TPs) indicate the correct predictions made by the model. False Positives (FP)s show where the model falsely predicts positives. False Negatives (FNs) represent the missed positives, or the instances where the model failed to predict the positive class. The above expression is not continuous and can be converted to a continuous version. For a single sample, x, its DSC is directly defined as shown in Equation (14).

D S C (x, f) = \frac{2 p_{1} y_{1}}{p_{1} + y_{1}}

(14)

In this way, when x is a negative example label O, its DSC(x, f) = 0, the model training ignores this label. Adding a smoothing term, e, ensures that the model DSC(x, f) > 0 when labeled O, as shown in Equation (15).

D S C (x, f) = \frac{2 p_{1} y_{1} + e}{p_{1} + y_{1} + e}

(15)

However, the smooth term e also needs to be selected manually, so an adaptive DSC is introduced, as shown in Equation (16).

D S C (x, f) = \frac{2 (1 - p_{1}) * p_{1} y_{1 + e}}{(1 - p_{1}) p_{1} + y_{1} + e}

(16)

For the negative example simple sample label O, its contribution to the loss function is small, and the DSC loss function makes the model pay less attention to the negative category O label entity. In this way, the model changes to punish the entity label with classification errors and avoids the problem of insufficient learning of special case label features caused by too many negative example labels O, thereby increasing the recall rate and increasing the F1 value.

3.3.2. Model Robustness Optimization Based on Data Enhancement

Entity recognition models require extensive annotated datasets, but due to strong domain specificity and high annotation costs, CER suffers from insufficient training data, impacting performance. This paper addresses the scarcity of annotated corpora by enhancing data, creating new training data to expand the training set, thus improving model accuracy and robustness. Data enhancement methods include the following:

(1): Synonym replacement: Randomly replacing an entity in a sentence with its synonym.
(2): Homophone replacement: Finding adjacent words in the pre-trained model’s embedding space for substitution.
(3): Same label replacement: Swapping with an entity of the same label from the training set.
(4): Randomly adding or deleting non-entity phrases: Random addition or removal of non-entity: Random addition or removal of non-entity phrases in the training dataset with probability.

These methods expand the corpus without affecting the model structure, are highly applicable, and especially effective in small datasets. They also introduce noise, enhancing the model’s generalization ability and robustness, countering common issues in Chinese texts like typos and overlapping words, thereby improving model robustness.

3.3.3. Model Training Optimization Based on Dropout and Adamax

Standard neural networks update parameters through forward and backward propagation, but entity recognition, due to high annotation costs, suffers from a lack of training samples, leading to overfitting and poor generalization. The dropout strategy, using a Bernoulli function to generate a probability vector, randomly eliminates neurons to prevent overfitting but maintains stability by not dropping neurons during testing. This paper adopts the Adamax optimization algorithm, an adaptation of Adam, to adaptively adjust learning rates and accelerate gradient descent, enhancing training efficiency. Compared to traditional stochastic gradient descent, Adamax avoids local optima and saddle points, stabilizes gradient descent, has lower memory requirements, and improves training speed.

4. Experiments

4.1. Experimental Setup

In Table 1, we present the experimental environment and the parameters used to evaluate the proposed method.

The CER involves two steps: firstly, locating the boundaries of entity words, and secondly, correctly identifying and representing the entities. This experiment aims to assess the accuracy of entity recognition. To objectively and accurately evaluate the performance of Chinese NER, this paper adopts strict assessment criteria, considering an entity correctly identified only if both its boundary and category are accurately predicted. The evaluation metrics used in the experiment are as follows:

Precision, P: The proportion of correctly identified entities among all entities predicted by the entity recognition model. Precision is determined by the number of accurately identified entities (TP) and the number of incorrectly identified entities (FP). The accuracy P is shown in Equation (17).

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

Recall, R: The proportion of entities correctly predicted by the model among all actual entities in the sample. For the calculation of the recall rate in entity recognition, it is as shown in Equation (18).

R e c a l l = \frac{T P}{F N + F P}

(18)

FN represents the number of all entities that should have been recognized by the entity recognition model but were not recognized by the final model.

Composite Indicator, F1 Score: An indicator combining both precision and recall. This paper uses the micro-averaged F1 score, which evaluates the F1 value across all entity categories.

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(19)

4.2. Verification of the Effectiveness of the Chinese Entity Recognition Model Based on Semantic Enhancement and Multi-Feature Fusion

4.2.1. Description of the Datasets

To compare with recent studies, this section uses four commonly used public benchmark datasets for CER: the Resume dataset [14], the Weibo dataset [29], the MSRA entity recognition dataset [29], and the OntoNotes4 entity recognition dataset [29], with details shown in Table 2.

The Resume dataset [14] is a manually annotated and well-structured dataset of resumes with fewer data samples. The data mainly come from the resumes of senior executives on Sina Finance and includes entities like nationality, position, address, profession, names, race, organizational names, and schools.

The Weibo dataset [14] is composed of Weibo messages selected from the Sina Weibo website and manually annotated. It includes four categories of entities: names of people, organizations, addresses, and political entities, each divided into specific NAM and general NOM. NAM refers to specific entities like “Zhang San”, while NOM refers to general categories like “men”.

The MSRA entity recognition dataset [14], provided by Microsoft, is a large-scale and well-annotated dataset commonly used in CER. It mainly includes three categories: names of people, organizations, and places. To compare with existing research, the MSRA dataset does not have a validation set.

The OntoNotes4 dataset [14] comes from the OntoNotes project, which aims to annotate a large corpus, mainly containing names of people, places, organizations, and geopolitical entities.

4.2.2. Experimental Parameter Setting

The main parameters of this section’s experiment include batch size, learning rate (lr), number of training epochs, dimensions of the BiLSTM hidden layer, and dropout size. The batch size determines the number of data samples loaded into memory per batch, and 64 was selected as the optimal value (as 128 caused memory overflow). A smaller learning rate (lr) helps mitigate catastrophic forgetting, and after multiple experiments it was set to 0.0015. The default parameter settings are shown in Table 3. For small-scale sample sets like the Resume and Weibo datasets, the parameters were adjusted to batch size = 1, hidden layer dimension = 200, and training epochs = 50, with other hyperparameters set to default values.

4.2.3. Comparison Experiments of Benchmark Models

This subsection aims to verify the effectiveness of the features proposed in this paper for optimizing the performance of Chinese named entity recognition. It first conducts self-comparative experiments to evaluate the impact of radical and glyph feature extraction, lexical enhancement, pre-training modules, and their combinations on entity recognition performance. Then, the proposed multi-feature entity recognition model is compared with other classical Chinese named entity recognition methods that enhance lexical information to validate its effectiveness.

The specific self-comparative experiments include the following:

Radical: Only the radical and glyph features are used to enhance character features, followed by entity recognition using the LSTM-CRF network. Lexicon: Only the lexical enhancement strategy is used to enhance character features. Radical+Lexicon: Combines radical and glyph features with character features, followed by lexical enhancement. Radical+Lexicon+BERT-wwm: Combines radical and glyph features, lexical enhancement features, and BERT pre-training features. Comparative experiments with other methods include the following:

Lattice-LSTM [14]: Introduced lexical enhancement into CER but had processing issues. WC-LSTM [19]: Proposed a lexical encoding strategy for parallel computation and portability but suffered from information loss. LR-CNN (2019) [34]: Used attention mechanisms to handle lexical conflicts but was computationally complex.

SoftLexicon (2020) [20]: Adaptive lexical enhancement, fully utilizing all lexical information, with strong model independence. FLAT (2020) [18]: Introduced position encoding and self-attention mechanisms, solving issues of lexical loss and long-distance dependencies. The FLAT model does not report Precision (P) and Recall (R) due to its focus on other evaluation metrics like F1-score and structural accuracy, which better capture the model’s performance in handling complex sequence labeling tasks. FLAT is typically evaluated using F1-score, which balances both precision and recall, providing a more comprehensive view of performance, especially in multi-label or multi-class problems where the balance between false positives and false negatives is crucial. MECT (2021) [35]: Improved on FLAT by integrating a dual-stream attention mechanism with radical and glyph information to enhance accuracy. Additionally, the paper compares models such as Word-LSTM-CRF (direct segmentation), Char-LSTM-CRF (no segmentation), and BERT-LSTM-CRF (inputting characters into a pre-trained BERT model).

The models were run on four Chinese named entity recognition public datasets: Resume, Weibo, MSRA, and OntoNotes4, with the following results:

(1): Resume dataset [14] comparative experiment results.

It can be seen from Table 4 that, in the Resume dataset, due to the high quality of the text and dense entity information, all models showed high accuracy. Models based on the Radical and Lexicon strategies outperformed the Char-LSTM-CRF model in terms of accuracy, recall rate, and F1 score, demonstrating the effectiveness of root and glyph features and lexical enhancement features. The Lexicon model performed better than the Radical model, indicating a greater contribution of lexical enhancement to performance improvement.

The Radical+Lexicon model surpassed models using Radical or Lexicon strategies alone, matching the performance of the latest MECT lexical enhancement strategy, proving the effectiveness of the multi-feature fusion strategy based on gating mechanisms. The Radical+Lexicon+BERT-wwm strategy showed the best performance, with the highest accuracy, recall rate, and F1 score, reaching 96.44%. Compared to the traditional BERT-LSTM-CRF model, the root and glyph features and lexical enhancement strategy also improved the BERT pre-trained model, although the increase was limited due to the well-annotated Resume dataset.

As shown in Figure 7, the Radical+Lexicon+BERT-wwm multi-feature fusion model has the fastest decline in loss function and ultimately the smallest loss, indicating the effectiveness of the proposed multi-feature fusion model and its advantage in training speed.

In summary, the entity recognition strategy based on multi-feature fusion proposed in this paper is the best, achieving the highest F1 score. Next is the Radical+Lexicon multi-feature fusion strategy without pre-training features, followed by the Lexicon lexical enhancement strategy alone, then the Radical root and glyph feature strategy alone, and finally, the simple Char-LSTM-CRF and Word-LSTM-CRF models.

(2): Weibo dataset [14] comparative experiment results.

It can be seen from Table 5 that, in the comparative experiments on the Weibo dataset, due to the diversity of text entity categories and issues like irregular and colloquial text, the accuracy of various models was generally low. The F1 score of the Word-LSTM-CRF model was 5.44 percentage points lower than that of the Char-LSTM-CRF model, indicating that, in irregular texts, incorrect segmentation affects entity boundary recognition and thus reduces performance. The Char-LSTM-CRF model, based on character input, performed better than the Word-LSTM-CRF model, based on word segmentation.

Models based on Radical and Lexicon strategies had higher accuracy, recall rate, and F1 score than Char-LSTM-CRF, demonstrating that root and glyph features and lexical enhancement can improve model performance. The Lexicon strategy significantly enhanced performance for irregular texts like Weibo, increasing by 9.95 percentage points. The Radical+Lexicon model outperformed models using only Radical or Lexicon strategies and exceeded the latest MECT strategy, making it suitable for scenarios with small training sets and irregular annotations.

The Radical+Lexicon+BERT-wwm strategy achieved the highest F1 score of 69.40%, an increase of 4.99 percentage points compared to the Radical+Lexicon model, indicating that, for irregular texts like Weibo, BERT features, with their large-scale unsupervised corpus knowledge, significantly impact performance improvement. Compared to the traditional BERT model, the Radical+Lexicon+BERT-wwm model improved by 2 percentage points, demonstrating that the multi-feature fusion strategy is still applicable on BERT models, further enhancing the performance of BERT entity recognition models.

This demonstrates that the entity recognition strategy based on multi-feature fusion proposed in this paper is the best, achieving the highest F1 score. Next is the Radical+Lexicon multi-feature fusion strategy without pre-training features, followed by the Lexicon lexical enhancement strategy alone, then the Radical root and glyph feature strategy alone, and finally, the simple Char-LSTM-CRF and Word-LSTM-CRF models.

(3): MSRA dataset [14] comparative experiment results.

It can be seen from Table 6 that, in the MSRA dataset, due to its large scale and fewer categories, all models show high accuracy. The F1 score of the Word-LSTM-CRF model is 2 percentage points lower than that of the Char-LSTM-CRF model, indicating that models based on character input perform better than those based on word segmentation.

Models based on Radical and Lexicon strategies have higher F1 scores than the Char-LSTM-CRF model that uses only character sequences, proving the effectiveness of root and glyph features and lexical enhancement features. The Radical+Lexicon model outperforms models using either Radical or Lexicon alone and is slightly better than the latest MECT strategy by 0.25 percentage points, validating the effectiveness of the multi-feature fusion strategy.

The Radical+Lexicon+BERT-wwm multi-feature fusion model achieves the highest F1 score in all experiments, reaching 95.40%, while the Radical+Lexicon model also achieves 95.08%. However, the improvement from BERT features is not significant, suggesting that, for large-scale corpora like MSRA, traditional models already perform well, and BERT features offer limited performance enhancement. Compared to the traditional BERT model, the Radical+Lexicon+BERT-wwm model shows a slight increase of 0.4 percentage points in F1 score.

(4): OntoNotes4 dataset [14] comparative experiment results.

From Table 7, it is observed that the F1 score of the Word-LSTM-CRF model is 1.33 percentage points higher than that of the Char-LSTM-CRF model, indicating that character-based recognition models are not always superior to word segmentation-based models.

Models based on Radical and Lexicon strategies significantly outperform the Char-LSTM-CRF model, which only uses character sequences. This demonstrates the effectiveness of root feature enhancement and lexical feature enhancement, with lexical enhancement proving to be more effective than root and glyph feature enhancement. The Radical+Lexicon model surpasses models using either Radical or Lexicon strategies alone and is 2.2 percentage points higher than the newly proposed MECT strategy, validating the effectiveness of the multi-feature fusion strategy based on gating mechanisms. The Radical+Lexicon+BERT-wwm model proposed in this paper achieves the highest F1 score of 81.89% among all algorithms, proving the effectiveness of the multi-feature fusion model. However, the improvement in the F1 score of the BERT-LSTM-CRF model to 81.82% is not very significant.

4.2.4. Optimization Effect Verification Experiment

(1): Experiment to Verify the Optimization Effect of the Adaptive DSC Loss Function

The purpose of this experiment is to verify the effect of the adaptive DSC loss function optimization on the performance of the multi-feature semantic enhancement entity recognition model. The model used in this experiment is the Radical+Lexicon+BERT-wwm model, and the main variable is the loss function used by the model, i.e., the traditional Cross-Entropy (CE) loss function or the adaptive DSC loss function. The F1 value is still used as the evaluation metric, with a higher F1 value indicating better entity recognition performance by the model. The experimental results are shown in Table 8 and Figure 8.

From Table 8, it is evident that the multi-feature fusion models using the adaptive DSC loss function have higher F1 scores on all four CER datasets compared to when using the cross-entropy (CE) loss function.

The improvement was more pronounced in the less standard Weibo and Resume datasets, where there are many entity categories with significant category bias. In contrast, the MSRA and OntoNotes4 datasets, with fewer entity categories, more balanced data, and more annotated samples, showed less improvement with the DSC loss function.

Taking the Resume dataset as an example, the accuracy of the Radical+Lexicon+BERT-wwm+DSC model was 96.61%, and the recall rate was 96.26%, with a small deviation between accuracy and recall. Meanwhile, the Radical+Lexicon+BERT-wwm+CE model had an accuracy of 97.61% and a recall rate of 92.26%, with a larger deviation between the two.

This suggests that the adaptive DSC loss function balances Precision and Recall, bringing them closer together, thereby improving the F1 score.

The results in Table 8 show that models using the adaptive loss function achieved higher F1 values. Therefore, the adaptive DSC loss function can optimize the entity recognition performance of models, especially in smaller datasets and those with sample category bias.

(2): Experiment to Verify the Effectiveness of Data Augmentation Optimization

This experiment aims to verify the impact of data augmentation optimization on the performance of the multi-feature semantic enhancement entity recognition model. Data augmentation methods under BERT-like pre-trained models show no significant effect and might even degrade performance due to the introduction of noise. It is also challenging to distinguish whether the improvement in model performance is due to data augmentation strategies or BERT features. Therefore, this experiment does not use BERT pre-trained features. Experiment 1 utilizes the Radical+Lexicon feature fusion model, while Experiment 2 applies data augmentation to the training set under the Radical+Lexicon feature fusion strategy. The comparison before and after data augmentation for the Resume dataset is shown in Figure 9.

Before data augmentation, the data distribution in the Resume dataset was very imbalanced. After augmenting samples with fewer quantities, data augmentation significantly increased the presence of underrepresented entity categories like names and locations, reducing the impact of uneven entity distribution.

In this experiment, the F1 value is used as the evaluation metric, with higher F1 values indicating better entity recognition performance. It can be seen from Table 9, the Radical+Lexicon model, after data augmentation, achieved higher recall rates, accuracy, and F1 values on all four CER datasets compared to the model without data augmentation. Particularly on the Weibo dataset, there was an increase of 1.05 percentage points. This suggests that data augmentation significantly improves performance in scenarios with many category labels, irregular annotations, and small sample sizes, while in situations with fewer category labels and more training data, like the MSRA dataset, the improvement is less pronounced. These results indicate that data augmentation for datasets with imbalanced category labels can effectively enhance overall performance in CER.

4.2.5. Ablation Control Experiment

(1): Control experiment of ablation with feature fusion mode

The purpose of this experiment is to verify the impact of the multi-feature fusion based on the gating mechanisms proposed in this paper on the performance of entity recognition models. The experiment was conducted on four Chinese open-source entity recognition datasets: Resume, Weibo, MSRA, and OntoNotes4. Experiment 1 is based on the Radical+Lexicon+BERT-wwm+DSC model and employs the feature fusion strategy based on the gating mechanisms proposed in this paper. Experiment 2, under the same conditions as Experiment 1, directly uses linear concatenation for the fusion of the extracted four features.

Table 10 and Figure 10 show that the multi-feature fusion based on gating mechanisms significantly improves entity recognition performance in small-scale datasets, such as the Weibo and Resume datasets, with F1 scores increasing by 1.12 and 2.89 percentage points, respectively. This validates the effectiveness of the proposed multi-feature fusion method.

However, on large-scale and well-annotated datasets like OntoNotes4, the performance improvement with the gating mechanism-based multi-feature fusion is not significant, showing only a 0.26 percentage point increase. On the MSRA dataset, the F1 score of the linear concatenation strategy even surpasses the gating mechanism-based feature fusion model by 0.11 percentage points. This may be because the models on large-scale datasets with standardized annotations are already performing at a high level and are difficult to further improve. On the other hand, linear concatenation, by increasing the number of parameters in the fully connected layer, fits better in datasets with more training data. Therefore, future work could consider introducing more complex fusion strategies, such as multi-layer attention mechanisms or more intricate gating network models, to utilize more parameters for better model fitting.

(2): Control experiments of ablation with different dictionaries

This experiment aims to verify the impact of different dictionaries on the performance of multi-feature fusion entity recognition models, using datasets including the Resume and Weibo datasets. Experiment 1 utilizes the Radical+Lexicon+YJ dictionary model, while Experiment 2, under the same conditions, employs the Radical+Lexicon+LS dictionary for lexical feature fusion.

As seen in Table 11, the effectiveness of vocabulary enhancement varies with different dictionaries. On the Resume dataset, the F1 score using the YJ dictionary is 0.76 percentage points higher than with the LS dictionary. Conversely, on the Weibo dataset, the Radical+Lexicon model based on the LS dictionary outperforms the YJ-based one by 0.75 percentage points. This indicates that integrating specific domain dictionaries is significant for improving domain entity recognition performance. The LS dictionary, sourced from Zhihu and Weibo, adapts better to the Weibo dataset but is less effective than the YJ dictionary in the more formal language context of the Resume dataset.

In summary, the multi-feature fusion entity recognition model proposed in this paper effectively utilizes lexical features, radical and glyph features, and pre-training features to uncover hidden semantic information, thereby enhancing context semantic analysis and improving entity recognition accuracy. The proposed multi-feature fusion strategy performs best across all four datasets, followed by the Radical+Lexicon strategy without pre-training features, then the Lexicon vocabulary enhancement strategy alone, the Radical root and glyph feature strategy alone, and finally, the Char-LSTM-CRF and Word-LSTM-CRF models. Additionally, the data augmentation and adaptive DSC loss function optimization proposed in this paper can improve sequence entity recognition performance to varying degrees.

5. Conclusions

This paper proposes the SEMFF-CER model, aiming to address issues such as high annotation costs, limited training samples, and the lack of word boundary information in CER. By enhancing the extraction of character features, lexical features, and pre-trained features, and utilizing a gate mechanism for multi-feature fusion, this model effectively improves entity recognition accuracy. Experimental results demonstrate that the model achieves optimal performance on public datasets and effectively optimizes traditional models, showing promising potential for wider application and extension.

Future research could explore more complex feature fusion mechanisms, expand the model’s application areas to relation extraction and knowledge graph construction tasks, and investigate semi-supervised and unsupervised entity recognition models to tackle the challenge of limited training data. This would facilitate the practical application of entity recognition technology in various real-world scenarios.

Author Contributions

Methodology, L.Y.; Software, C.Z.; Formal analysis, P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China: 62272180 and Equipment Pre Research Joint Foundation of the Ministry of Education: 8091B02072302.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiu, J.; Zhou, Y.; Wang, Q.; Ruan, T.; Gao, J. Chinese clinical named entity recognition using residual dilated convolutional neural network with conditional random field. IEEE Trans. Nanobiosci. 2019, 18, 306–315. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Z.; Zhou, W.; Li, H. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 107. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
Yang, J.; Zhang, Y.; Dong, F. Neural word segmentation with rich pretraining. arXiv 2017, arXiv:1704.08960. [Google Scholar]
Peng, N.; Dredze, M. Named entity recognition for chinese social media with jointly trained embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 548–554. [Google Scholar]
He, H.; Sun, X. F-score driven max margin neural network for named entity recognition in Chinese social media. arXiv 2016, arXiv:1611.04234. [Google Scholar]
Wu, B.; Deng, C.; Guan, B.; Chen, X.; Yan, D.; Chang, Z.; Xiao, Z.; Qu, D.; Wang, Y. Cross-Domain Chinese Named Entity Recognition Model with Dynamic Transfer of Entity Block Information. J. Softw. 2021, 33, 3776–3792. [Google Scholar]
Du, P.; Zhang, Y.; Zhu, Z.; Li, G. Application of Transfer Learning in Entity Recognition in Low-Resource Scenarios. J. Comput. Sci. Explor. 2023, 17, 912. [Google Scholar]
Dong, C.; Zhang, J.; Zong, C.; Hattori, M.; Di, H. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural Language Understanding and Intelligent Applications, Proceedings of the 5th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2016, and 24th International Conference on Computer Processing of Oriental Languages, ICCPOL 2016, Kunming, China, 2–6 December 2016; Proceedings 24; Springer: Cham, Switzerland, 2016; pp. 239–250. [Google Scholar]
Meng, Y.; Wu, W.; Wang, F.; Li, X.; Nie, P.; Yin, F.; Li, M.; Han, Q.; Sun, X.; Li, J. Glyce: Glyph-vectors for chinese character representations. Adv. Neural Inf. Process. Syst. 2019, 32, 247. [Google Scholar]
Zhang, Y.; Yang, J. Chinese NER using lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Peng, N.; Dredze, M. Improving named entity recognition for chinese social media with word segmentation representation learning. arXiv 2016, arXiv:1603.00786. [Google Scholar]
Cao, P.; Chen, Y.; Liu, K.; Zhao, J.; Liu, S. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 182–192. [Google Scholar]
Chen, C.; Kong, F. Enhancing entity boundary detection for better chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; Volume 2: Short Papers, pp. 20–25. [Google Scholar]
Li, X.; Yan, H.; Qiu, X.; Huang, X. FLAT: Chinese NER using flat-lattice transformer. arXiv 2020, arXiv:2004.11795. [Google Scholar]
Liu, W.; Xu, T.; Xu, Q.; Song, J.; Zu, Y. An encoding strategy based word-character LSTM for Chinese NER. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Florence, Italy, 28 July–2 August 2019; Volume 1 Long and Short Papers, pp. 2379–2389. [Google Scholar]
Ruotian, M.; Minlong, P.; Qi, Z.; Zhongyu, W.; Xuanjing, H. Simplify the usage of lexicon in Chinese NER. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5951–5960. [Google Scholar]
Guo, L.; Li, Y.; Wang, S.; Chen, X.; Fu, Y.; Pei, W. Entity Recognition of Legal Documents Based on Matching Strategy and Community Attention Mechanism. J. Chin. Inf. Process. 2022, 36, 85–92. [Google Scholar]
Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 10965–10973. [Google Scholar]
Qi, P.; Qin, B. SSMI: Semantic similarity and mutual information maximization based enhancement for Chinese NER. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13474–13482. [Google Scholar]
Fisher, J.; Vlachos, A. Merge and label: A novel neural network architecture for nested NER. arXiv 2019, arXiv:1907.00464. [Google Scholar]
Ju, M.; Miwa, M.; Ananiadou, S. A neural layered model for nested named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 1446–1459. [Google Scholar]
Shibuya, T.; Hovy, E. Nested named entity recognition via second-best sequence learning and decoding. Trans. Assoc. Comput. Linguist. 2020, 8, 605–620. [Google Scholar] [CrossRef]
Zheng, C.; Cai, Y.; Xu, J.; Leung, H.; Xu, G. A boundary-aware neural model for nested named entity recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
Sohrab, M.G.; Miwa, M. Deep exhaustive model for nested named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2843–2849. [Google Scholar]
Straková, J.; Straka, M.; Hajič, J. Neural architectures for nested NER through linearization. arXiv 2019, arXiv:1908.06926. [Google Scholar]
Luo, Y.; Zhao, H. Bipartite flat-graph network for nested named entity recognition. arXiv 2020, arXiv:2005.00436. [Google Scholar]
Fu, Y.; Tan, C.; Chen, M.; Huang, S.; Huang, F. Nested named entity recognition with partially-observed treecrfs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 12839–12847. [Google Scholar]
Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A unified MRC framework for named entity recognition. arXiv 2019, arXiv:1910.11476. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
Gui, T.; Ma, R.; Zhang, Q.; Zhao, L.; Jiang, Y.G.; Huang, X. CNN-Based Chinese NER with Lexicon Rethinking. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; Volume 2019. [Google Scholar]
Wu, S.; Song, X.; Feng, Z. MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition. arXiv 2021, arXiv:2107.05418. [Google Scholar]

Figure 1. Architecture diagram of sequential entity recognition model based on semantic enhancement and multi-feature fusion.

Figure 2. Flowchart of character word text vectorization.

Figure 3. Schematic diagram of feature extraction of the root glyph.

Figure 4. Flowchart of word matching.

Figure 5. Schematic diagram of word frequency statistics.

Figure 6. Diagram of multi-feature fusion.

Figure 7. Loss decline plot for the Resume dataset.

Figure 8. Experimental comparison of optimization effect of loss function.

Figure 9. A comparison of the number of entities before and after augmentation.

Figure 10. Comparison of ablation control experiments with feature fusion mode.

Table 1. Experimental environment and configuration.

Experimental Environment	Configuration
memory	64 G DDR4
CPU	Intel(R) Core(TM) i9-10900X 3.70 GHz
GPU	NVIDIA GeForce RTX 3090Ti
operating system	Ubuntu 18.04
CUDA	11.1
Pytorch	1.7.0
Python	3.8
Transformers	2.11.0

Table 2. Chinese entity recognition experimental dataset.

Dataset	Type	Training Set	Validation Set	Test Set	Categories of Entities
Resume	Sentences	3.8k	0.46k	0.48k	8
	Entity	1.34k	0.16k	0.15k
	Character	124.1k	13.9k	14.8k
Weibo	Sentences	1.35k	0.27k	0.27k	8
	Entity	1.89k	0.39k	0.42k
	Character	73.8k	14.5k	14.8k
MSRA	Sentences	46.4k	∖	4.4k	3
	Entity	74.8k	∖	6.2k
	Character	2169.9k	∖	172.6k
OntoNotes4	Sentences	15.7k	4.3k	4.3k	4
OntoNotes4	Entity	13.4k	6.95k	7.7k	4

Table 3. Default parameter settings.

Parameters	Default Values
batch_size	64
lr	0.0015
Epoch	100
Hidden layer	300
dropout	0.5

Table 4. Comparison of experimental results on Resume dataset.

Model	Precision P (%)	Recall R (%)	F1-Score (%)
Lattice-LSTM [14] (2018)	94.81	94.11	94.46
WC-LSTM [19] (2019)	95.27	95.15	95.21
LR-CNN [34] (2019)	95.38	94.84	95.11
SoftLexicon [20] (2020)	95.30	95.77	95.53
FLAT [18] (2020)	∖	∖	94.93
MECT [35] (2021)	96.40	95.30	95.89
Word-LSTM-CRF	93.72	93.44	93.58
Char-LSTM-CRF	93.66	93.31	93.48
BERT-LSTM-CRF	95.75	95.28	95.51
Radical	95.26	94.97	95.12
Lexicon	95.81	95.46	95.64
Radical+Lexicon	95.78	96.01	95.89
Radical+Lexicon+BERT-wwm	96.61	96.26	96.44

Table 5. Comparison of experimental results on Weibo dataset.

Model	Precision P (%)	Recall R (%)	F1-Score (%)
Lattice-LSTM [14] (2018)	∖	∖	58.79
WC-LSTM [19] (2019)	∖	∖	59.84
LR-CNN [34] (2019)	∖	∖	59.92
SoftLexicon [20] (2020)	∖	∖	61.42
FLAT [18] (2020)	∖	∖	63.30
MECT [35] (2021)	∖	∖	95.89
Word-LSTM-CRF	36.02	59.38	47.33
Char-LSTM-CRF	46.11	55.29	52.77
BERT-LSTM-CRF	∖	∖	67.33
Radical	69.62	43.72	53.71
Lexicon	70.23	56.67	62.72
Radical+Lexicon	71.76	65.70	64.41
Radical+Lexicon+BERT-wwm	72.57	66.50	69.40

Table 6. Comparison of experimental results on MSRA dataset.

Model	Precision P (%)	Recall R (%)	F1-Score (%)
Lattice-LSTM [14] (2018)	93.57	92.79	93.18
WC-LSTM [19] (2019)	94.58	92.91	93.74
LR-CNN [34] (2019)	94.58	92.91	93.74
SoftLexicon [20] (2020)	94.63	92.70	93.66
FLAT [18] (2020)	∖	∖	94.35
MECT [35] (2021)	94.55	94.09	94.32
Word-LSTM-CRF	90.57	83.06	86.65
Char-LSTM-CRF	90.74	86.90	88.81
BERT-LSTM-CRF	95.06	94.61	94.83
Radical	92.50	90.32	91.40
Lexicon	95.15	94.35	94.75
Radical+Lexicon	95.43	94.74	95.08
Radical+Lexicon+BERT-wwm	95.88	94.92	95.40

Table 7. Comparison of experimental results on OntoNotes4 dataset.

Model	Precision P (%)	Recall R (%)	F1-Score (%)
Lattice-LSTM [14] (2018)	76.35	71.56	73.88
WC-LSTM [19] (2019)	76.09	72.85	74.43
LR-CNN [34] (2019)	76.09	72.85	74.43
SoftLexicon [20] (2020)	77.28	74.07	75.64
FLAT [18] (2020)	∖	∖	76.45
MECT [35] (2021)	77.57	76.27	76.92
Word-LSTM-CRF	72.84	59.72	65.63
Char-LSTM-CRF	68.79	60.35	64.30
BERT-LSTM-CRF	81.99	81.65	81.82
Radical	70.96	66.65	68.74
Lexicon	80.09	70.13	74.78
Radical+Lexicon	81.76	76.65	79.12
Radical+Lexicon+BERT-wwm	80.89	82.91	81.89

Table 8. The experimental results of adaptive DSC loss function optimization are verified.

Model	Resume (%)	Weibo (%)	MSRA (%)	OntoNotes4 (%)
Radical+Lexicon+BERT-wwm+DSC	96.44	69.40	95.40	81.89
Radical+Lexicon+BERT-wwm+CE	95.06	63.71	95.31	81.74

Table 9. The data augmentation optimization effect verifies the experimental results.

Model	Resume (%)	Weibo (%)	MSRA (%)	OntoNotes4 (%)
Radical+Lexicon	95.89	64.41	95.40	79.12
Radical+Lexicon+Data augmentation	96.04	65.46	95.70	80.03

Table 10. Results of control experiments with feature fusion mode ablation.

Model	Resume (%)	Weibo (%)	MSRA (%)	OntoNotes4 (%)
Experiment 1: Fusion of gating mechanisms	96.44	69.40	95.40	81.89
Experiment 2: Linear concatenation of features	95.32	66.51	95.51	81.63

Table 11. Experimental results of ablation control with different dictionaries.

Model	Resume (%)	Weibo (%)
Experiment 1: Radical+Lexicon+YJ lexicon	95.89	64.41
Experiment 2: Radical+Lexicon+LS lexicon	95.13	65.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, L.; Zeng, C.; Pan, P. Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement. Electronics 2024, 13, 4895. https://doi.org/10.3390/electronics13244895

AMA Style

Yuan L, Zeng C, Pan P. Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement. Electronics. 2024; 13(24):4895. https://doi.org/10.3390/electronics13244895

Chicago/Turabian Style

Yuan, Ling, Chenglong Zeng, and Peng Pan. 2024. "Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement" Electronics 13, no. 24: 4895. https://doi.org/10.3390/electronics13244895

APA Style

Yuan, L., Zeng, C., & Pan, P. (2024). Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement. Electronics, 13(24), 4895. https://doi.org/10.3390/electronics13244895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Entity Recognition

2.2. Chinese Entity Recognition

2.2.1. Sequence Entity Recognition

2.2.2. Nested Entity Recognition

3. Method

3.1. Architecture of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

3.2. Implementation of the Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

3.2.1. Character Feature Extraction Enhanced by Radical Glyph Semantics

3.2.2. Lexical Feature Extraction Enhanced by SoftLexicon Semantics

3.2.3. Deep Semantic Feature Extraction Based on Pre-Training Model

3.2.4. Multi-Feature Fusion Based on Gating Mechanism

3.3. Optimization of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement

3.3.1. Model Performance Optimization Based on Adaptive Loss Function

3.3.2. Model Robustness Optimization Based on Data Enhancement

3.3.3. Model Training Optimization Based on Dropout and Adamax

4. Experiments

4.1. Experimental Setup

4.2. Verification of the Effectiveness of the Chinese Entity Recognition Model Based on Semantic Enhancement and Multi-Feature Fusion

4.2.1. Description of the Datasets

4.2.2. Experimental Parameter Setting

4.2.3. Comparison Experiments of Benchmark Models

4.2.4. Optimization Effect Verification Experiment

4.2.5. Ablation Control Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI