A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition

: Clinical Named Entity Recognition (CNER) focuses on locating named entities in electronic medical records (EMRs) and the obtained results play an important role in the development of intelligent biomedical systems. In addition to the research in alphabetic languages, the study of non-alphabetic languages has attracted considerable attention as well. In this paper, a neural model is proposed to address the extraction of entities from EMRs written in Chinese. To avoid erroneous noise being caused by the Chinese word segmentation, we employ the character embeddings as the only feature without extra resources. In our model, concatenated n-gram character embeddings are used to represent the context semantics. The self-attention mechanism is then applied to model long-range dependencies of embeddings. The concatenation of the new representations obtained by the attention module is taken as the input to bidirectional long short-term memory (BiLSTM), followed by a conditional random field (CRF) layer to extract entities. The empirical study is conducted on the CCKS-2017 Shared Task 2 dataset to evaluate our method and the experimental results show that our model outperforms other approaches.


Introduction
With the rapid development of information technology, medical institutions have widely used electronic medical records (EMRs) to facilitate data collection which includes patient health information, diagnostic tests, procedures performed and clinical decision making. EMRs contain valuable clinical data and a large amount of medical information of patients which can have critical implications for future health care delivery. However, most of the EMRs are in the unstructured format which is difficult to extract to build intelligent biomedical systems and, most importantly, can hinder large-scale knowledge discovery. Therefore, it is urgent to explore effective approaches to convert EMRs into structured forms for improving the quality-of-care delivery.
The task of Information Extraction (IE) refers to identify and recognize the instances of the structured semantics (e.g., predefined classes of entities and relationships among entities) from unstructured or semi-structured text [1]. The continued expansion of EMRs has attracted researchers' interests and led to an active research topic called Biomedical Information Extraction (BioIE). BioIE aims to discover structured information from unstructured clinical notes and narratives that can be used by clinicians, researchers and applications. In general, there are three main subtasks in the BioIE: (1) Named Entity Recognition (NER) which aims to categorize entity names in clinical and biomedical domains, (2) Relation Extraction (RE) which targets the detection of semantic relations between entities and (3) Event Extraction (EE) which explores a more detailed alternative to produce a formal representation to extract the knowledge within the targeted documents [2]. BioIE is a hot and active research topic at the crossroads of Chemistry, Biology, Medicine, Computational Engineering and Natural Language Processing. A growing number of workshops and conferences are testimony to its continuing importance and potential. IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM) and International Workshop on Health Text Mining and Information Analysis (LOUHI) are held to provide interdisciplinary forums for researchers interested in automated processing and information extraction of health documents. Two distinguished computational linguistics conferences, ACL and COLING, and affiliated workshops have always considered information extraction and clinical text modeling as topics of interest [3].
NER occupies a very important position in the field of natural language processing (NLP) and has been largely studied for decades especially in the general domain like news articles. It focuses on identifying and classifying all mentions of a subset of nouns such as places, persons, and organizations [4]. NER plays an essential role as a pre-processing step for downstream tasks which include question answering, information retrieval and relation extraction. With experiencing an ever-increasing explosion of EMRs, clinical named entity recognition (CNER) is of interest for researchers and many competitions have been organized in order to promote the development and stimulate the community [5]. Unlike the NER in the general domain, CNER, which consists of a large number of clinical terms and professional designations, aims to recognize entities in EMR (such as disease, symptom and body part) and benefit other intelligent clinical systems for health monitoring, prognosis, diagnostics, and treatment. In addition to the research of CNER in English, other languages have also gained prominence and Chinese is one of the core research topics. Compared with the CNER of alphabetic languages represented by English, Chinese CNER faces the following challenges [6][7][8]:


The ambiguous chunk of text corresponds to the same character sequence but to different named entities. For example, "泌尿道感染" (urinary tract infection) could refer to the disease entity or symptom entity depending on the context;  There are no clear word boundaries in Chinese text and the effect of word segmentation will significantly impact the performance of the NER. For example, "小腸切除術 " (small bowel resection) is a treatment entity if it is considered as one segmentation unit. However, if the word segmentation model splits it into "小腸" (small intestine) and "切除術" (resection), their entity types will become body part and treatment, respectively;  Because of the casual use of Chinese abbreviations for clinical entities written by doctors, it may result in multiple expressions of the same entity. For example, "盲腸炎" and "闌尾炎" could all refer to appendicitis.
In this paper, we propose an n-gram based neural network to model the Chinese CNER task as a sequence labelling problem. Given a sentence, we represent text at character level to avoid erroneous noise being caused by the Chinese word segmentation technique. We employ the character embeddings as the only feature. More specifically, adjacent character embeddings are integrated into n-gram features (unigrams, bigrams and trigrams). They are then given to a self-attention mechanism to learn long-term dependencies. Finally, a Bidirectional Long Short-Term Memory (BiLSTM) is applied to encode the sequential structure and capture contextual features, followed by a Conditional Random Field (CRF) layer to consider the correlations between adjacent tags for predicting the label sequence.
There are two main contributions in this paper:


We propose an Att-BiLSTM-CRF model to perform the Chinese CNER task based on combinations of n-gram character embeddings of different lengths without using external knowledge. Unlike other approaches in the literature which rely on domainspecific resources and may limit the ability of generalization, our model will be scalable to other datasets.


We assess the effectiveness of the proposed model on the CCKS-2017 Shared Task 2 dataset. Our model obtains an F-score of 89.33% and performs better than other competitive methods including CNN, BiLSTM and BERT based models which have Fscores in the range 87.75% to 88.51%.
The remainder of this paper is organized as follows. Section 2 reviews several techniques related to the work of this paper. The proposed model is described in Section 3. We explain the experimental setup and report the results in Section 4. In Section 5, we present conclusions and also discuss future research avenues.

Related Work
Since the growth of EMRs has increased considerably over recent decades, the CNER problem has drawn much interest and a great deal of research effort has been devoted to the study and development as well. Basically, four representative types of methods are proposed to perform the task which include rule-based, dictionary-based, machine learning and deep learning approaches [9][10][11].
In the early stage, rule-based approaches were the dominant approaches to solve the CNER problem. The adoption of heuristic information, handcrafted features [12,13] and lexical resources [14,15] is designed to detect clinical entities. Although playing a critical role before, rule-based approaches heavily require expert domain knowledge, resulting in being difficult to transfer to different fields.
Traditionally, dictionary-based methods take advantage of existing clinical vocabularies to extract entities and were widely applied due to their simplicity [15,16]. Several clinical ontologies and vocabularies, such as MeSH [17] and SNOMED-CT [18], have been proposed. However, the performance is limited by the size of the lexicon and could lead to low recall when the input data contains a high number of out-of-dictionary entities. Additionally, similar to the rule-based approaches, dictionary-based approaches also lack generalizability and require tremendous human effort to build the lexicons.
Since machine learning methods have been successfully used for sequence labelling tasks such as POS tagging, NER and chunking, the CNER task has also been transformed into a sequence labelling problem and is solved by various machine learning algorithms. Typically, feature engineering is performed on the input sentence to convert the data into numerical representation. Three typical supervised sequence tagging models (HMM, MEMM, and CRF) based on n-gram and position features are evaluated for the name recognition in traditional Chinese clinical records where CRF achieves better performance than the other two classifiers [19]. Support Vector Machine (SVM) with word shape and part-of-speech features is applied to recognize biomedical named entity and obtains a precision rate of 84.24% and a recall rate of 80.76% [20]. However, most of these machine learning methods rely on pre-defined features (such as lexical, syntactic, and semantic features) and are difficult to generalize to different datasets.
In recent years, since deep learning techniques are growing fast and have achieved significant success across various applications, the prevailing approaches have shifted to the employment of deep learning methods. The Long Short-Term Memory (LSTM) network is suitable for learning temporal relations and has been widely used in NLP tasks [21]. A BiLSTM-CRF approach, which is a neural network system based on bidirectional LSTMs and CRF, is proposed to solve the Chinese CNER problem using specialized word embeddings as feature representations and the external health domain lexicons as the knowledge base [22]. The system reports 87.95% in F-score on the CCKS-2017 (task 2) CNER dataset. Another bidirectional RNN-CRF model for recognizing Chinese CNER adopts concatenated n-gram embeddings and also includes word segmentation information, part-of-speech tagging and medical entity vocabulary as additional features [23]. Unlike the previous research relying on other miscellaneous information, in this paper, we present a neural n-gram based classifier without external resources.

The Proposed Approach
The Chinese CNER task is modeled as a sequence labelling problem in this work. Given an input sequence X with t characters (i.e., X = (x1, x2…, xt)), the goal is to label each character xi with a predefined tag based on the tagging scheme to obtain an output sequence Y = (y1, y2…, yt). We use BIO as the annotation strategy where B denotes the beginning of an entity, I denotes the middle of an entity and O denotes not an entity. In addition, the B and I tags need to be followed by an entity type such as B-BODY and I-BODY for the Body entity type. A tagging result of an input sentence "左側髖部正常" (the left hip is normal) is displayed in Table 1. Table 1. An entity tagging example of "左側髖部正常".
The proposed Att-BiLSTM-CRF model shown in Figure 1 is composed of six building blocks: Embeddings, N-gram, Attention, Concatenation, BiLSTM and CRF layers. The Embeddings Layer converts each input character into an embeddings vector and the Ngram Layer applies n-gram techniques on embeddings to form n-gram embeddings (n from 1 to 3). The Attention Layer employs self-attention on n-gram embeddings and the Concatenation Layer combines the self-attention representations. The BiLSTM Layer captures the features of the previous concatenated results and then the CRF Layer takes the BiLSTM Layer output to decode the tag sequence. The method of n-gram character embeddings and details of the neural entity recognition model will be discussed in the following sections.

N-Gram Character Embeddings
In an n-gram model which is a widely used concept in the NLP field, each sentence is represented by a sequence of n consecutive units. To reduce the ambiguity of segmentation for Chinese words, we use the character as the basic unit rather than a word. For example, given an input Chinese sentence "胃部疼痛" (stomach-ache), the unigram is {胃 , 部, 疼, 痛}, the bigram is {胃部, 部疼, 疼痛} and the trigram is {胃部疼, 部疼痛}.
For a sentence X = (x1, x2…, xt) with t characters, the embedding process is designed to transform each character into a distributed and dense vector representation R d , where d is the size of the character embedding. The Embeddings Layer, a part of the neural network, is initialized with random vectors and learns to represent all the characters in the training set during the training stage. Each character will be mapped to an embeddings vector once the training is completed. To better encode the input sentence, we use the ngram character embeddings model rather than the n-gram character model. An n-gram character embeddings model is represented by concatenating the embeddings of n characters. As seen in the N-gram Layer of Figure 1, we show unigram, bigram and trigram character embeddings.

Neural Entity Recognition Model
In this section, we discuss the proposed approach to deal with Chinese CNER in the EMRs. The neural model adopted in this research mainly relies on Attention, BiLSTM and CRF layers to obtain a more semantic representation of Chinese characters.
Attention Layer: The attention method has been widely used in many tasks, especially NLP applications to capture the context information and dependency between tokens for the given sentence [24,25]. The mechanism computes attention weights between every two tokens and uses a summation operation to obtain the representation [26]. The calculation of attention on our n-gram character embeddings is described as follows. Given an input sequence En = (e1, e2…, et) where ei ∈ R 1×dn is the n-gram character embeddings (n from 1 to 3) and dn is the size of the embeddings, En is converted to query Qn, key Kn and value Vn through linear transformations by the following expression: where Wq, Wk, Wv are learnable parameters, and Qn ∈ R t×dn , Kn ∈ R t×dn , Vn ∈ R t×dn . The attention score is, then, calculated as follows: In this paper, we use a special kind of attention, self-attention, to learn the feature of one unit in a sentence by attending to all units within the same sentence. In the self-attention mechanism, the query Qn, key Kn and value Vn are all the same.
BiLSTM Layer: After the above self-attention calculation, we concatenate A1, A2 and A3 to obtain the final embedding matrix A ∈ R t×(d1+ d2+ d3) as shown in the Concatenation Layer of Figure 1. We then pass A into a BiLSTM layer. The LSTM is proposed to capture long-term dependencies by introducing gated memory units to address the gradient problems and control the information flow [27]. At each timestamp t for the given input A = (a1, a2…, at), the LSTM updates its hidden state ht based on the current input at and the previous hidden state ht−1 by computing the following equations: where it, ft, ot and ct are input gate, forget gate, output gate and cell vector. CRF Layer: In the CNER task, there exist several constraints and dependencies in the BIO tagging scheme. For instance, the I tag must follow the B tag. It is, therefore, important to take the above factors into consideration and we adopt CRF to predict a label sequence by learning the correlations between the current label and its neighbors [28]. Given the input sequence obtained from the output of the BiLSTM Layer h = (h1, h2…, ht), we use y = (y1, y2…, yt) to represent a sequence of labels for h. The CRF model defines the conditional probability distribution over all label sequences y given h and uses the following equation [29]: where W denotes the weight and b the bias term corresponding to the neighboring pair (yi−1, yi). To train a CRF model for a given training dataset {h (i) , y (i) } where superscript i represents the i-th data, the parameter estimation is performed by maximizing the conditional log-likelihood below: In the process of inference, the optimal output label sequence y* of a test input z is derived based on the maximization of the conditional probability by the Viterbi algorithm [30]: * = ( | ; , )

Dataset and Evaluation Metrics
We conduct the empirical evaluation on CCKS-2017 Shared Task 2 benchmark dataset [31]. This dataset contains 400 EMRs in total where 300 of them are used as the training set and the remaining 100 are regarded as the testing set. Each EMR has four sections which are general items, medical history, diagnosis and treatment and discharge summary. There are five categories of clinical entities including body, exam, disease, symptom and treatment. Table 2 lists the statistics of clinical named entities for each category. There are 29,866 entities used for training and 9493 entities for testing. In this research, we use the character-level "BIO" annotation mode where "B" means that the character is at the beginning of an entity, "I" means that the character is at the middle of an entity and "O" means that the character does not belong to any entity. Since there are 5 clinical entity categories on CCKS 2017 dataset, this will result in ten annotation labels and an "O" label, yielding eleven labels in total.
The evaluation measures for entity recognition are three standard performance indicators, namely Precision (P), Recall (R) and F-score (F). Precision determines how capable the proposed method is for predicting entity categories, while Recall reflects how well it is for retrieving entity categories. F-score is defined to be the harmonic mean of Precision and Recall as an overall measure. The calculation formulas of the three metrics are defined as follows: where TP is the true positive, FP is the false positive and FN is the false negative.

Experiment and Results
To study the effectiveness of the proposed model, two experiments are conducted. The first one evaluates the model against other competitive algorithms. In the second experiment, we test our model on different lengths of n-gram character embeddings. The experiments are carried out on the Windows system with an Intel(R) Core(TM) i7-8750H CPU, 8 GB RAM, and NVIDIA GeForce GTX 1050 Ti GPU. The neural network model is composed of one character embedding lookup table, a self-attention layer, a BiLSTM layer and a final CRF layer. The hyper-parameter setting of the model is shown in Table 3. ID-CNN-CRF: A Convolutional Neural Network-based model with iterated dilated convolutions and a domain-specific lexicon for word embeddings matching [34,35]. V.
BERT-BiLSTM-CRF: A pre-trained language model BERT to enhance the semantic representation, a BiLSTM network and a CRF layer [36].
The comparison results are demonstrated in Table 4. Our model obtains the best Fscore among all competitors. We push the F-score to 89.33% and outperform the secondbest system (RD-CNN-CRF) by 0.82%. In general, LSTM based approaches have better results in the Recall metric while CNN based methods perform better in the Precision score. RD-CNN-CRF achieves the best Precision score (88.64%) and our approach is the second-best (88.53%). The best Recall score is reported by BERT-BiLSTM-CRF (90.48%) and our system is ranked second (90.13%). We perform the second experiment to further investigate the effect of different lengths of n-gram embeddings. Table 5 demonstrates the experimental results of 1-g, 2-g, 3-g and our approach which combines all of them. Our model achieves the highest F-score of 89.33% and Precision of 88.53%. In terms of Recall, the 2-g method is the best (90.47%) and our model achieves the second-highest score (90.13%). In addition to the overall performance evaluation discussed above, we also show the detailed results for all five entity categories in Table 6. Our model achieves the highest Fscores in 4 entity categories except for the "Body" category. The most challenging entity types are Disease and Treatment where the F-scores for all models are less than 80%. Though our approach is generally applicable, there are several limitations that have to be addressed. First, since our model uses a Bi-LSTM layer that is not able to fully utilize GPU for parallel processing, it is important to handle this issue to ensure both high performance and high computational efficiency. Second, we incorporate an embeddings layer to learn the distributed representation without applying any pre-trained embeddings. We expect that pre-trained embeddings learned from large Chinese medical corpora can help in the Chinese CNER task.

Conclusions
In this study, we propose a neural model based on n-gram character embeddings to learn more semantic information of Chinese characters and address the problem of Chinese clinical named entity recognition. This method avoids relying on external resources and knowledge bases. We conduct the experiments on CCKS-2017 Shared Task 2 dataset with five categories of clinical named entities. The empirical studies show that our approach performs better than other CNN and LSTM based baselines.
Future work will investigate to obtain a more contextualized representation for named entity recognition. Joint learning trains a single model to handle multiple tasks for the purpose of improving performance on all tasks. We plan to exploit the recognition of EMR sections (general items, medical history, diagnosis and treatment, discharge summary) and jointly train with the Chinese CNER to boost the performance. In addition to the Embeddings and N-gram layers which are used to preprocess Chinese characters, other layers of our approach are expected to be applicable to different languages. Another possible avenue of future work might be to extend the model to other languages in order to maximize the usefulness of our method.