Leveraging Part-of-Speech Tagging Features and a Novel Regularization Strategy for Chinese Medical Named Entity Recognition

: Chinese Medical Named Entity Recognition (Chinese-MNER) aims to identify potential entities and their categories from the unstructured Chinese medical text. Existing methods for this task mainly incorporate the dictionary knowledge on the basis of traditional BiLSTM-CRF or BERT architecture. However, the construction of high-quality dictionaries is typically time consuming and labor-intensive, which may also damage the robustness of NER models. What is more, the limited amount of annotated Chinese-MNER data can easily lead to the over-ﬁtting problem while training. With the aim of dealing with the above problems, we put forward a B ERT- B iLSTM-C RF model by integrating the part-of-speech ( P OS) tagging features and a R egularization method (BBCPR) for Chinese-MNER. In BBCPR, we ﬁrst leverage a POS fusion layer to incorporate external syntax knowledge. Next, we design a novel RE gularization mothod with A dversarial training and D ropout (READ) to improve the model robustness. Speciﬁcally, READ focuses on reducing the difference between the predictions of two sub-models through minimizing the bidirectional KL divergence between the adversarial output and original output distributions for the same sample. Comprehensive evaluations on two public data sets, namely, cMedQANER and cEHRNER from the Chinese Biomedical Language Understanding Evaluation benchmark (ChineseBLUE), demonstrate the superiority of our proposal in Chinese-MNER. In addition, ablation study shows that READ can effectively improve the model performance. Our proposal does well in exploring the technical terms and identifying the word boundary.


Introduction
Named Entity Recognition (NER) is one of the core objectives in natural language processing (NLP) [1,2], whose purpose is to determine the underlying entities and their categories from the unstructured text [3]. As an essential component in many downstream NLP tasks, for instance, the correlation extraction [4], information retrieval [5], sarcasm detection [6], and so on, NER is always a hot research direction and attracts much attention in the NLP community. In general, most of the previous works are devoted to the English NER task and achieve promising performances by integrating word-level features [7]. Compared with English, the East Asian languages (e.g., Chinese) typically lack explicit word boundaries and have complex composition forms, which brings greater challenges for these languages for the development of a competitive NER model. For example, the property of present Chinese state-of-the-art (SOTA) models are much lower than the English SOTAs, with a gap of nearly 10% in terms of F1 metric [8]. What is more, recent studies pay more attention to the domain-specific NER, e.g., medicine, which is much more complicated and requires external domain expertise [9,10].
In particular, in the current work, we pay attention to the research of Chinese Medical Named Entity Recognition (Chinese-MNER), which is considered as a character-level sequence labeling problem, while it is word level for English [11]. Recently, deep learning methods have been extensively employed in Chinese-MNER [10,[12][13][14][15] due to their excellent ability in automatically extracting features from massive data. For instance, previous works attempt to leverage the Bi-directional Long Short-Term Memory (BiLSTM) network for acquiring sequence features and achieve comparable results [16]. In addition, on account of the excellent ability of the pre-trained language models in extracting the contextual features, transformer-based models (e.g., BERT [17]) are becoming a new paradigm for Chinese-MNER [15,[18][19][20][21].
Specially, in the medical domain, the external expertise is beneficial in understanding the technical terms and identifying the word boundary, which motivates recent research to incorporate the dictionary knowledge on the basis of traditional BiLSTM-CRF or BERT architecture [9,10]. However, the construction of high-quality dictionaries is typically time consuming and labor-intensive, which may also damage the generalization and robustness of NER models [22]. Compared with the dictionary knowledge, the part-of-speech (POS) tagging features [23] are now readily available, which does not require additional manpower and material resources. The POS tagging features [24] can be regarded as supervised signals to guide the model to explicitly identify the word boundary for the reason that it contains potential word segmentation information [25]. Therefore, we argue that the POS tagging features are more suitable to be used for Chines-MNER than the dictionary knowledge. Last but not least, due to the restrictions of high specialization degree, ethics, and privacy, the annotated Chinese-MNER data are difficult to obtain and usually small in scale, which can result in the over-fitting problem easily when training the Chinese-MNER model [26].
For the sake of alleviating the above issues, we present a BERT-BiLSTM-CRF with POS and Regularization (BBCPR) model for Chines-MNER, which leverages a POS fusion layer to incorporate external syntax knowledge as well as introduces a novel REgularization method with Adversarial training and Dropout (READ) to improve the model robustness. In general, our proposal is based on a combined MC-BERT [27] and BiLSTM-CRF modeling framework. We first utilize the MC-BERT to generate the context representation of each token in the Chinese medical text. Then, we design a POS fusion layer to integrate the part of speech tagging features and send them into a BiLSTM module as inputs. Finally, a standard conditional random fields (CRF) [25] layer is employed for decoding the sequence labels. Particularly, besides the traditional learning objective, we introduce an external Kullback-Leibler (KL) divergence loss based on READ. In detail, READ can generate the adversarial word embeddings through a Fast Gradient Method (FGM) as well as a dropout mechanism, which are subsequently put into a Softmax layer for forecasting the label distributions. After that, we can regularize the model predictions through minimizing the bidirectional KL divergence between the adversarial output and original output distributions for the same sample [28].
For proving the effectiveness of the proposal, we implement comprehensive experiments on two public data sets from the Chinese Biomedical Language Understanding Evaluation (ChineseBLUE) benchmark [27], i.e., cMedQANER and cEHRNER. The experimental results suggest that our presented BBCPR model is superior to the SOTA baseline, and the overall improvement for the F1 score on cEHRNER and cMedQANER datasets is 2.48% and 2.87%, respectively. Furthermore, the effectiveness of our designed modules is verified by the ablation studies.
In summary, the major contributions of this research can be concluded as below: • We design a POS fusion layer that can explicitly learn the word boundary feature for the task of Chinese-MNER by incorporating the POS tagging features.
• We put forward a novel regularization approach READ to alleviate the over-fitting problem for Chinese-MNER and enhance the robustness of the model on small data. • We conduct comprehensive experiments on two public datasets. The performance comparisons over several competitive baselines indicate the superiority of our proposal.

Related Work
In this section, we first summarize prior studies of Chinese-MNER and illustrate the differences between our proposal and prior works in Section 2.1. Then, we describe the related regularization and adversarial training methods in Section 2.2.

Deep Learning-Based Chinese-MNER
Deep learning approaches have been extensively applied in the task of Chinese-MNER [10,12,14,15] due to their excellent ability in automatically extracting features from massive data. Before the popularization of the pre-trained language model, most of the prior works leverage the convolutional neural networks [13,29,30], such as the recurrent neural networks [13,31], as well as their variants (i.e., bidirectional long short-term memory [11,32,33]) to represent the contextual features [15,16,34]. In addition, they usually adopt the conditional random fields to predict the label sequence. Among these models, the BiLSTM combined with CRF yields the best performance [16].
In the past several years, on account of the outstanding ability of pre-trained language models in representation learning, transformer-based models (e.g., BERT [17]) have become a new paradigm for Chinese-MNER [15,[18][19][20][21]. BERT can apply prior semantic knowledge obtained from large unlabeled corpora to the downstream tasks through fine-tuning [17]. For instance, Xu et al. [20] leverage the contextual features learned by BERT to enrich the word semantics and incorporate them to the model of Bi-LSTM-CRF. Inspired by BERT, Lee et al. [18] introduce a pre-trained biomedical language representation model for biomedical text mining. Similarly, as a variant of BERT, RoBERTa [19] is also applied to learn the medical features, which uses the dynamic masking and eliminates the next sentence prediction task in pre-training.
Specially, in the medical domain, the external expertise can help the model understand the technical terms and identify the word boundary, which motivates recent research to incorporate the extra knowledge on the basis of traditional BiLSTM-CRF or BERT architecture [9,10]. For example, Li et al. [13] propose to incorporate the pre-trained medical dictionary as the model input. In addition, Dong et al. [35] adopt a radical-level LSTM to obtain pictographic root characteristics of Chinese.
However, the construction of high-quality dictionaries is typically time consuming and labor-intensive, which may also damage the generalization and robustness of NER models [22]. Compared with the dictionary knowledge, the POS tagging features [23,24] are now readily available, which does not require additional manpower and material resources. Therefore, in the present work, the POS fusion layer is designed to incorporate the POS tagging features, which can act as supervised signals to guide the model to explicitly identify the word boundary.

Regularization vs. Adversarial Training
When training the neural network on the small training set, the deep learning-based models usually perform poor generalization ability on the test data. To prevent the deep neural networks from suffering from the over-fitting problem, most of recent works introduce the regularization methods in their models, which includes weight penalties of L1 and L2 regularization, dropout, and batch normalization [36], etc.
Dropout is a typical regularization method and has been widely used to regularize the fully connected neural network due to its simplicity and efficiency [37]. It drops neurons from each layer of the neural network at random with probability p during the training process [38]. On this basis, Wan et al. [39] propose a novel type of dropout, called DropConnect. Different from randomly setting activation units within each layer as zero, it randomly sets weights within the network as zero. However, the above methods typically work on the fully connected layer. However, for the convolution layer, the activation units are interrelated spatially. Thus, information can also flow in the network even if some neurons are dropped. To deal with this issue, Ghiasi et al. [40] design a structured dropout method named DropBlock to regularize the convolutional networks through dropping the units together in adjacent areas of the feature map. Instead of using dropout alone, some works combine it with other training frameworks. For instance, Gao et al. [41] take the standard dropout as noise and integrate it into a comparative learning framework, which advances the SOTA sentence embedding. Liu et al. [42] use the dropout to generate the positive sentence samples from the feature space and then train the encoder by a contrastive learning-based objective. When using the dropout method, inconsistency between the training samples and the inference samples may arise because of the randomness introduced by dropout. In response to this problem, Wu et al. [28] propose a R-Drop method, which regularizes the output distributions of two sub-models by minimizing the bidirectional KL-divergence for each data sample in the training.
In addition to the over-fitting, the robustness of the model is also a urgent problem to be solved, since traditional neural networks are easily cheated by slightly disturbed samples [22]. To address this issue, the adversarial training is recently introduced in the representation learning; among these methods, FGM [43] is a popular model used to generate adversarial examples, which makes neural network models robust against perturbations. The basic principle is to add disturbance to construct adversarial samples during model training, thus enhancing the model robustness when it meets adversarial samples. Goodfellow et al. [44] propose a rapid and simple approach for producing adversarial examples, which shows that adversarial training can give an extra regularization benefit in addition to the benefit of utilizing dropout alone. In addition, the experiments in this paper demonstrate that adversarial back-propagation, as a stand-alone regularizing method, performs well in improving the generalization and robustness of the network.
Inspired through the above approaches, in this study, we design a new regularization method combining R-Drop and FGM to deal with the over-fitting problem and enhance the model robustness. In accordance with generating adversarial samples, we conduct two dropouts and shorten the distance between the two sub-models by KL clustering.

Approach
In this work, we principally pay attention to the problem of Chinese-MNER. Here, we first formulaically define the Chinese-MNER problem and introduce the main notations employed in this study (see Section 3.1). Then, we show the technical specific information about our presented model of BBCPR (see Section 3.2). Finally, we describe how to integrate the designed regularization method READ into BBCPR (see Section 3.3).

Problem Definition and Notations
The Chinese-MNER task is intended to identify and predict entities (such as diseases, symptoms, drugs, etc.) from the unstructured Chinese medical text. In this paper, following Chen and Kong [8], Li et al. [15], Xu et al. [20], Zhou et al. [45], and Liang et al. [46], the Chinese-MNER task is treated as the sequence labeling problem. Given one piece of Chinese medical text X = {x 1 , x 2 , . . . , x n } with n tokens as the input, the objective of a Chinese-MNER algorithm is to predict each token x i in X with the BIO tag (Begin, Inside, Outside) and finally obtain a label sequence Y = {y 1 , y 2 , . . . , y n } as the output. An instance in the real world of the labeled entities in the Chinese medical text is presented in Table 1. For the purpose of clarity, we conclude the major notations applied in this article in Table 2. Table 1. An instance of the labeled entities in Chinese medical text. B-s denotes the beginning of entity symptom, I-s presents the interior of entity symptom, O stands for external entity.
(Patients with severe colds are prone to high fever and vomiting.) Entity type disease disease person person symptom symptom symptom symptom Table 2. Major notations applied in this paper.
Variable Description The expression of the i-th token in Chinese medical text e i The BERT embedding of the i-th token e i The adversarial embedding of the i-th token m i The output embedding of BERT for the i-th token t i The POS tagging features of the i-th token p i The POS embedding of the i-th token v i The output fusion embedding of POS fusion layer H The final hidden representations produced by BiLSTM P The input matrix of CRF layer y i Prediction label of the i-th token in Chinese medical text P(Y | X) The probability distribution of the original BERT embedding E P (Y | X) The probability distribution of the adversarial embedding E L,L NER ,L R The final, basic NER and regularization method loss λ The trade-off parameter for balancing L NER and L R

Model Architecture
The overall architecture of our presented model BBCPR is presented in Figure 1, which principally contains four layers: namely, the MC-BERT layer (see Section 3.2.1), POS fusion layer (see Section 3.2.2), BiLSTM layer (see Section 3.2.3) and CRF layer (see Section 3.2.4), respectively. Since BERT-BiLSTM-CRF is the state-of-the-art workflow in NER [15,16,20], we follow Li et al. [15] and adopt the same workflow in this work. Differently, on the basis of this workflow, we replace the BERT with the MC-BERT [27], which is specially pre-trained and more appropriate in the field of medicine. Moreover, we design a POS fusion layer and propose a READ strategy to improve model performance in our workflow. Below, we will present the detailed process of each component in BBCPR.

MC-BERT
As is known to all, BERT is pre-trained on the Wikipedia corpus [17]. However, medical texts typically contain professional terms that seldom appear in general corpus. To bridge such a semantic gap, Zhang et al. [27] propose a variant of BERT, i.e., MC-BERT, which is further trained on the Chinese medical corpus and performs well in extracting the medical contextual features. Thus, MC-BERT is specially adopted in this work.
For the input X = {x 1 , x 2 , . . . , x n }, we first adopt MC-BERT to convert them into a sequence of BERT embeddings E = {e 1 , e 2 , . . . , e n } through summing the position embedding, segment embedding, as well as token embedding.
Generally, MC-BERT retains the same structure as BERT, which is composed of a pile of L same layers. For convenience, the output of the l-th layer together with input of the first layer are represented as M 0 and M l , respectively. The output representations M l−1 of the previous layer are placed into the Multi-head Self-Attention (MSA) sub-layer to acquire contextual-level representationM l : Next, we gather the output representation of each encoder layer through feeding the contextual-level representation through a Feed-Forward Network (FFN) sub-layer. We formulate these operations as: where M l ∈ R n×d bert , l ∈ {1, 2, 3, ..., L} and d bert denotes the hidden size of MC-BERT. Then, the final output embedding M L = {m 1 , m 2 , . . . , m n } is fed to the POS fusion layer.

POS Fusion Layer
MC-BERT processes the input Chinese medical text as a collection of token X and generates token-level embeddings M. However, a word is generally recognized as the smallest unit of semantic expression in Chinese, which results in the loss of semantic and increases the difficulty in extracting the entity boundary as well.
To tackle this issue, we design a POS fusion layer to incorporate the POS tagging features into the BBCPR model. Different from the existing dictionary strategy where labor costs are invariably high [10], the POS tagging features are simple, straightforward, and easily accessible. The POS is defined as the features of words that contains verbs, nouns, modal particles, adjectives, and so on, which can accurately label common words and thus distinguish medical entities from the edges of common words [25].
Formally, we employ the Baidu LAC toolkit [47] to generate the POS tagging features for each token in X as: where T is the collection of POS tags; t i is the corresponding POS tag of the i-th token x i . Next, with the aim of mapping the dispersed POS tag into the consecutive semantic space to conduct the model training, we generate the corresponding POS embedding p i for each POS tag t i as: where W p ∈ R d p is the learnable network parameter; and d p stands for the POS-embedding dimension. Subsequently, the concatenation operation is utilized to fuse each MC-BERT output embedding m i and its corresponding POS embedding p i . The formula is shown as follows: where v i ∈ R d bert +d p is the concatenated fusion embedding v i of the i-th token x i .

BiLSTM Layer
Afterwards, for acquiring more comprehensive context features of entities, we further employ a BiLSTM layer to encode the fusion embeddings, which can make the most of both past and future input features. Following Huang et al. [16], in our proposed BBCPR model, the operation inner of an LSTM unit at step t can be expressed as below: where o t , f t , and i t represent the output, forget as well as input gate. v t and h t represent the input vector and the hidden state at step t. σ means the Sigmoid function and denotes dot product function.
and b o represent the deviation parameters. C t and C t stand for the cell state and candidate cell state at t step, respectively. We formulate the computation of BiLSTM as follows: where − → h t and ← − h t denote the hidden state at step t of the forward LSTM and the backward LSTM, respectively. After that, the hidden representation of the input Chinese medical text X produced by BiLSTM module can be denoted as H = {h t } n t=1 ∈ R n×2d LSTM , where d LSTM is the hidden size of the LSTM.

CRF Layer
For a typical NER task, the relationship between adjacent labels is sequential and should also follow some constraint rules. For example, the label I-Symptom must appear after the label B-symptom. Due to the fact that BiLSTM only focuses on the long-term contextual features rather than the dependency between labels, a CRF layer is the preferred choice for decoding the ultimate sequence labels in the current research [4,15,16,20], as it can model the sequential relationships between labels by learning the adjacent constraint.
We first convert the output of the BiLSTM H to the input matrix P of the CRF with a linear function as follows: where W p ∈ R k×2d LSTM , b p ∈ R k are learnable parameters, and k is the label types number. The CRF module is subsequently deployed to count the conditional probability P(Y | X) of the random label sequence Y = {y 1 , y 2 , . . . , y n } under the situation of a given Chinese medical text X = {x 1 , x 2 , . . . , x n }. In form, the probability P(Y | X) of the ultimate optimal label sequence can be counted as: in whichỹ represents the basic-truth label sequence. Y X denotes all the probable label sequences. A y i ,y i +1 stands for the transition possibility from the label y i to the label y i+1 , with the transition probability matrix A being a learnable model parameter. P i,y j indicates the non-normalized probability that the i-th token will be mapped to the named entity label y i . During the training process, the below loss of negative log-likelihood (NLL) is minimized for the optimization of the model as: Moreover, to predict the labels of X, the Viterbi algorithm [48] is applied for decoding the overall optimal label sequence. The output label sequence Y * containing the highest score will be produced as:

Regularization Method
As a result of the restrictions of high specialization degree, ethics, and privacy, the annotated Chinese-MNER data are hard to gather and generally small in scale. The models are more prone to over-fitting problems. The dropout method has been used in most of the research works to alleviate the over-fitting problem. Dropout is a perturbation addition in essence [37]. In addition, Goodfellow et al. [44] propose an adversarial training strategy to increase the diversity of samples by adding noise perturbation and apply it into the field of computer vision. On the basis of the previous studies, Miyato et al. [49] extend the adversarial training to the text classification task. However, most of the existing research works only consider the addition of a single perturbation [42,43].
Focusing on the Chinese-MNER task, we expect to improve the performance of our proposal in identifying named entities by increasing the diversity of perturbations. As a result, we put forward a new regularization mechanism to regularize two distributions of the same sample, i.e., the original distribution and the distribution intervened by the adversarial perturbation and the dropout perturbation. We name such regularization mechanism as READ, that is, REgularization method with Adversarial training and Dropout. Specially, the READ mechanism regularizes the model predictions from two sub-models produced by the dropout and the adversarial perturbation. Unlike the previous works that consider only a single perturbation, READ amplifies the variability of the same sample by combining different perturbations, thus enhancing the robustness of the model. The architecture of READ is shown in Figure 2. For the input X = {x 1 , x 2 , . . . , x n } and the output sequence Y = {y 1 , y 2 , . . . , y n }, we originally put X into the pre-trained language model MC-BERT to acquire the embeddings E. In READ, we alternatively apply an adversarial perturbation on the original embeddings E to generate the adversarial embeddings E as follows: where δ is an adversarial perturbation produced by FGM [49]. FGM employs L 2 -norm to scale the specific gradients to achieve better perturbation, which is calculated as: where is a constant that presents perturbation degree and g denotes the gradients of loss. As shown in Figure 2, after randomly applying different dropout into the neural network, we can obtain two different sub-models for training. Then, we feed separately the original BERT embeddings E and adversarial embeddings E into the above two sub-models to produce two different distributions for the output label sequence, that is, P(Y | X) and P (Y | X), which generate two losses. So, we take the average of the two losses as the basic NER learning object L NER : The adversarial perturbation and dropout noise will shift the representation away from the one of original input. In the training step, READ focuses on reducing the difference between the predictions of the two sub-models by minimizing the bidirectional KL-divergence between these two output distributions of the same sample. Formally, we denote this process as follows: where L R denotes the loss function of the regularization method. D KL (P P ) denotes the KL divergence between two distributions P and P . The ultimate training goal is to minimize the joint loss L for data (X, Y), which is calculated as: in which λ is the trade-off parameter for balancing the L NER and L R . We further provide the pseudo-codes to detail the major steps of READ in Algorithm 1.
Feed embedding E to model f , obtain the distribution P (Y | X) = f θ (E ); Calculate negative log-likelihood loss L NER by Equation (23); Calculate regularization loss L R by Equation (24); Update parameters θ to minimize total loss L of Equation (25). end

Datasets and Evaluation Metrics
For confirming the effectiveness of our proposal, we evaluate the model performances on two public datasets, that is, cMedQANER (https://github.com/alibaba-research/ ChineseBLUE/tree/master/data/cMedQANER) together with cEHRNER (https://github. com/alibaba-research/ChineseBLUE/tree/master/data/cEHRNER) released by the Chi-neseBLUE benchmark [27]. The cMedQANER and cEHRNER datasets are annotated from the Chinese community question answering and the Chinese electronic health records, respectively. In detail, cMedQANER contains 2063 annotated instances altogether with eleven kinds of the medical named entities, such as Crowd, Body, etc. In addition, cEHRNER contains 999 annotated samples altogether with seven kinds of the medical named entities, for instance Operation, Diagnosis, Disease, and so on. The statistics of cMedQANER and cEHRNER are shown in Table 3. We also list the specific statistics of various kinds of the entities in cMedQANER and cEHRNER in Table 4 together with Table 5, respectively. The limited amount of annotated instances and the complex types of medical named entities make it challenging for Chinese-MNER to achieve a promising performance.
To measure the model performances for Chinese-MNER, we employ the scores of F1, precision (P), and recall (R) for evaluation, which are widely used metrics for sequence labeling tasks [34].

Research Questions
We comprehensively examine the effectiveness of our proposed BBCPR model by focusing on the following research questions: (RQ1) Can BBCPR achieve better performance than the competitive baselines for the Chinese-MNER task? (See Section 5.

Experimental Details
In this paper, all the experiments are conducted in Python with the deep learning toolkit PyTorch (https://pytorch.org), where we run each experiment on both cMedQANER and cEHRNER datasets for five times under random seeds and then report the average results as well as the standard deviation. We employ the Baidu LAC toolkit [47] to obtain the POS tagging features for cMedQANER and cEHRNER datasets. In addition, the pretrained language model MC-BERT [27] is specially selected as the contextual embedding layer for our model, with 12 layers, 768-dimensional of the hidden size, as well as 12 selfattention heads. Set the BiLSTM hidden size and feed-forward network dimension as 256 and 1024, respectively. The POS embedding is randomly initialized from the standard normal distribution, where the size of POS embedding is 512. The model is trained under a mini-batch strategy, and the maximum sequence length and the batch size are 256 and 32, respectively. Following Li et al. [15] and Xu et al. [20], we employ the Adam optimizer [51] where β 1 is 0.9 and β 2 is 0.998. For the BiLSTM and MC-BERT, their learning rates are 7 × 10 −5 . The learning rate of CRF is set to 5 × 10 −3 . During training, we adopt a linear decay schedule to vary the learning rate with the weight decay being 0.01. The dropout rate and regularization loss weight λ are set to 0.2 and 2.0, respectively. Our model is trained for 50 epochs at most, and it stops on the optimal model for testing.

Results and Discussion
In this section, we first discuss the comparison of overall performance between BBCPR and the competitive baseline models (see Section 5.1). Next, we analyze the effectiveness of each component we propose in BBCPR (see Section 5.2). Furthermore, we explore the influence brought by different pre-trained language models (see Section 5.3) and different hyper-parameters (see Section 5.4). Finally, we present a case study to clearly demonstrate the superiority of BBCPR (See Section 5.5).

Overall Evaluation
To answer RQ1 , we check the overall entity recognition performance of the baselines and our proposal in terms of all evaluation metrics on the cMedQANER and cEHRNER datasets. Table 6 shows the detailed outcomes between the models discussed.
First, we focus on the property of the baselines. In accordance with Table 6, compared to the traditional statistical learning model, i.e., HMM, deep learning based-models report obvious improvements in terms of all metrics. For example, BiLSTM-CRF beats HMM by 23.95% and 44.50% for the score of F1 on the cMedQANER and cEHRNER datasets, respectively, which demonstrates that the capability of learning context features is essential for the Chinese-MNER task. Specially, we can obverse that those models using MC-BERT as an encoder show nearly 7.08-8.87% and 4.59-5.31% improvements for the metric F1 against BiLSTM-CRF on the cMedQANER and cEHRNER dataset, respectively, indicating an impressive superiority of BERT in representation learning. In addition, equipping the model with additional CRF layer yields better performance than the original ones, which may be attributed to the CRF being able to well model the dependencies between tag sequences.
Next, we focus on the comparison between the baselines and our proposal. As revealed in Table 6, BBCPR achieves the best performance among all discussed models on both the cMedQANER and cEHRNER datasets. Specially, it can be found that our approach achieves the SOTA performance with the improvements of 2.44% and 2.87% in terms of F1 score against the best baseline MC-BERT-BiLSTM-CRF on two datasets, respectively. There is a similar phenomenon in terms of P and R. For the metric P, our proposal model beats the best baseline by 2.50% and 4.05% on the cMedQANER and cEHRNER dataset, respectively. For the metric R, our method shows 2.36% and 1.68% improvements over the best baseline on the cMedQANER and cEHRNER dataset, respectively. The improvements acquired from BBCPR can be explained by the fact that using POS tagging features, which imply potential word segmentation, can provide an extra supervision signal to distinguish the edges between ordinary words and the medical entities. In addition, the adversarial samples generated by FGM can enhance the model's robustness.

Ablation Study
To answer RQ2 , we conduct comprehensive ablation studies on cMedQANER and cEHRNER datasets to verify the effectiveness of each key module of BBCPR. The detailed results of ablation studies are presented in Table 7. It can be observed that when any certain module is taken out, the model performances decrease obviously in terms of almost all metrics, which verifies the effectiveness of our proposed modules in BBCPR. Specifically, for the metric F1 and R, removing the POS fusion layer and the DP mechanism (see Row 6, Table 7) results in the biggest drop of model performance. Specifically, on both of the cMedQANER and cEHRNER datasets, model performances show a 1.54% and 2.11% decrease in terms of F1 and a 1.68% and 1.36% decrease in terms of R. This observation indicates that incorporating POS tagging features into the neural networks can evidently enhance the superiority of the Chinese-MNER model, for which POS tagging features add extra potential entities' boundary information. At the same time, the DP mechanism can help alleviate the over-fitting problem of the model and thus reduces the prediction error. As for the metric P, the removal of the READ (see Row 2, Table 7) declines the model performance most, with a 1.55% and 3.29% decrease on the cMedQANER and cEHRNER dataset, respectively. This can be due to the fact that the READ module amplifies the variability of the original samples, thus enhancing the learning ability of the model.
In addition, we can observe that removing AP (see Row 4, Table 7) or removing DP (see Row 3, Table 7) leads to a 0.70% and 0.82% decrease for the score of F1 on the cMedQANER dataset, respectively. Similar results can be found on the cEHRNER dataset, where the model reveals a 0.58% and 0.96% drop. The reason why DP performs better than AP may be that it is applied on the whole model, while AP only works on the BERT-embedding layer. In addition, it is worth mentioning that using either the AP or DP alone is not as effective as the combination of the two, i.e., the READ. For example, without the READ (see Row 2, Table 7), the decreases in F1 score are 1.29% and 1.82% on two datasets, respectively. The reason may be that each of the perturbations is relatively simple; employing only one mechanism can only introduce a small perturbation. Meanwhile, diverse perturbations can increase the dissimilarity of the representation from the same sample.
Moreover, in the condition of using DP by default (See Row 7, Table 7), adding AP (see Row 5, Table 7) results in a 0.32% and 0.70% increase in terms of F1 on the cMedQANER and cEHRNER datasets, respectively. Similarly, when adding POS (see Row 4, Table 7), the score of F1 increases by 0.41% and 0.90%, respectively. Compared with AP, POS has a greater impact on the neural network that has a dropout by default. We attribute this phenomenon to the fact that POS can directly increase the features of entities and thus bring useful information to the model. However, AP utilizes an indirect way to enhance the learning ability of the model by adding perturbations. Under the condition that uses DP by default, the F1 score drops by 1.10% and 1.48% on two datasets when removing POS and AP, which indicates that the diversity added by both POS and AP has a positive impact on the model performance.

Influence of Pre-Trained Language Models
To answer RQ3, we further conduct comparative experiments to analyze the effect brought by different pre-trained language models, such as BERT, BERT-WWM, RoBERTa, MacBERT, and MC-BERT. First, BERT is able to obtain more informative contextual representation through employing the masked language model and next sentence to forecast training targets. Furthermore, BERT-WWM employs a whole word masking strategy for the Chinese corpus. For the RoBERTa, it robustly optimizes the BERT pre-training method by four simple and effective modifications. MacBERT masks the words with their similar words in the Chinese corpus, while MC-BERT is further trained on the Chinese biomedical corpus based on BERT.
As shown in Figure 3, MC-BERT significantly outperforms other pre-trained models in terms of all evaluation metrics. In particular, as reflected in Figure 3a, it can be observed that MC-BERT shows nearly 0.83-1.67%, 0.59-1.65% and 0.74-2.21% improvements in terms of F1, P, and R score than other pre-trained language models on the cMedQANER dataset. Figure 3b indicates that MC-BERT increases nearly 0.66-1.17%, 0.12-2.48%, and −0.16-1.60% of F1, P, and R scores compared to other pre-trained language models on the cEHRNER dataset. The possible reason may be due to the fact that MC-BERT adapts the whole entity masking strategy and the whole span masking strategy to inject medical domain knowledge for Chinese biomedical text, which can help generate better contextual representation for the Chinese-MNER task. Accordingly, we choose the MC-BERT model [27] as the contextual embedding component in the following experiments.

Analysis on Different Hyperparameters
To answer RQ4, we conduct the experiments on cMedQANER and cEHRNER datasets to explore the BBCPR property under different hyperparameters, i.e., POS embedding size, and trade-off parameter λ. For the POS embedding, the size of it is set to 128, 256, 384, 512, 640, and 768 in our experiments, respectively. As displayed in Figure 4a, the property of our model initially grows when the size of the POS embedding increases and the difference between the best and worst performance can be as high as 0.97%, 0.82%, and 1.12% in the terms of F1, P, and R, respectively on the cMedQANER dataset. Figure 4b shows that there are similar outcomes on the cEHRNER dataset. This is probably explained through the fact that adding the neural network size can enhance the model complexity to obtain more powerful representation capability. However, when the size increases further, worse performance is achieved due to model over-fitting. As shown in Figure 4, when the POS embedding size is 512, all the metrics achieve their best performance. Therefore, we choose 512 as the POS embedding size in our experiments.
Further, we analyze the effect resulted from the trade-off parameter λ. We vary the λ in {1, 2, 3, 4, 5, 10} and conduct extensive experiments. According to Figure 5a,b, either too large or too small λ will make our model perform poorly. When λ = 2, the model realizes the optimum property, and subsequently, our model performs worse and worse as λ increases. Specifically, the performances of the model on the cMedQANER dataset have drops by 0.33-1.45%, as shown in Figure 5a and 0.41-1.80% on the cEHRNER dataset, as shown in Figure 5b. In terms of P and R, there are floats of 0.15-1.64% and 0.24-1.24% on the cMedQANER dataset, while they are 0.19-1.99% and 0.53-1.60% on the cEHRNER dataset. When the λ is at 2, all the metrics achieve their best performance. Therefore, we select 2 as the regularization loss weight for our proposed BBCPR model.

Case Study
In the current section, we conduct a case study to demonstrate the superiority of our model against baselines. In particular, we compared the predictive results of our proposal and MC-BERT-BiLSTM-CRF on two cases from the cMedQANER and cEHRNER datasets, respectively. The detailed input medical text and the corresponding predictive results of the two cases are presented in Figure 6.
As shown in Figure 6, in Case #1, the baseline model fails to identify "药物治疗 (drug treatment)" as a whole unit. However, our model could completely and correctly recognize the entity "药物治疗 (drug treatment)" as the four tokens have the same part-of-speech (i.e., noun). Likewise, in Case #2, we can see that the baseline MC-BERT-BiLSTM-CRF identifies "升结肠恶性肿瘤及肝内外胆管结石 (malignant tumor of ascending colon and intrahepatic and extrahepatic bile duct stones)" as an independent entity. However, the token "及 (and)" is a conjunction, while "瘤 (tumor)" and "肝 (liver)" are nouns. On the contrary, our model could accurately recognize two entities "升结肠恶性肿瘤 (malignant tumor of ascending colon)" and "肝内外胆管结石 (intrahepatic and extrahepatic bile duct stones)" through identifying obvious entity boundaries before and after "及 (and)". Overall, the above cases demonstrate that BBCPR can explicitly learn word boundary information by introducing the POS tagging features, which is conducive to enhance the entity recognition accuracy.

Conclusions and Future Work
In our work, we propose a model named BBCPR for improving the performance of the Chinese-MNER task, which leverages a POS fusion layer to explicitly learn word boundary information by incorporating external syntax knowledge. What is more, we also design a novel regularization method READ to deal with the over-fitting problem and improve the model robustness. In detail, READ regularizes the predictions of the two sub-models through minimizing the bidirectional KL-divergence between the adversarial output and original output distributions for the same sample. Comprehensive experiments conducted on two benchmark datasets confirm the advantage of our proposal for the Chinese-MNER task. In addition, an ablation study proves that the POS fusion layer and READ can effectively improve the model performance.
For future research, we want to explore how to obtain more features for entities by introducing the contrastive learning [41], which can pull the same type of entities closer and push apart different types of entities [52]. Furthermore, we have interests in verifying the effectiveness of our proposal in other domains, e.g., financial domain, legal domain. Finally, mining more potential supervisory signals from the unlabeled samples and then training the model in an unsupervised setting may also be a promising direction.