1. Introduction
With the rapid development of the medical industry, data mining on Clinical Electronic Medical Records (CEMR) plays an important role in precision medicine. It provides basic technology for subsequent medical record summaries, computer auxiliary diagnosis, etc. The state-of-the-art Named Entity Recognition (NER) methods use Long Short-Term Memory (LSTM) to extract features and then employ the Conditional Random Field (CRF) to obtain the optimal tag sequence [
1,
2,
3,
4]. However, the temporal structure of LSTM-based models is usually computationally expensive and inefficient, especially when faced with massive medical records. The output of the current time of the LSTM model depends on the output of the previous moment, which results in the inability to calculate in parallel. The CNN model tends to have a faster prediction time than the LSTM model of the temporal structure. People often have to make a choice between the excellent performance and low efficiency of the LSTM model on the NLP task [
5,
6]. This would be very beneficial if the CNN model directly replaces the LSTM structure and achieves similar performance to the LSTM on NLP tasks. Recent works [
7,
8] attempted to apply Convolutional Neural Networks (CNN) to NER. Most of the deep learning networks are based on the convolutional neural network (CNN) architecture because it can make better use of the GPU and thus have a great performance in terms of speed. Nevertheless, the performance of CNNs-based models is poorer than LSTMs-based models due to it neglecting the global semantic information. Recently, the Iterated Dilated CNNs (ID-CNNs) [
9] were proposed to efficiently aggregate a broad context. The comparison between the IDCNN model and the traditional Bi-LSTM model structure can be seen in
Figure 1. However, this model ignores the significance of the word-order feature and local context in the text. As shown in
Figure 1, the natural timing structure of the traditional Bi-LSTM model enables location information to be captured. The ID-CNNs model has no time structure. It cannot capture the relative relationship between words and ignores word-order information. For example, swapping the positions of
and
in
Figure 1 has a large impact on the Bi-LSTM model. For the ID-CNN model, the output is not affected.
To address these issues, we propose an attention-based ID-CNNs-CRF model. We first introduce position embedding to capture word-order information. Then, the attention mechanism is applied to the ID-CNNs-CRF model because of its good performance in NLP tasks, which enables the enhancement of the influence of critical words. Finally, we apply the CRF to obtain the optimal tag sequence. Experimental results on two CEMR datasets demonstrate that the proposed model achieves better prediction performance with a higher F1-score than those of baseline methods.
The remainder of the paper is structured as follows:
Section 2 introduces the related work.
Section 3 describes the details of the proposed method.
Section 4 demonstrates the proposed methodology with a series of experiments. Finally, the conclusions of this study are given in
Section 5.
2. Related Work
Benefiting from the implementation of digital medicine, the digitization of medical records has been promoted for many years. Named Entity Recognition (NER) of CEMR designed to identify critical entities of interest is a basic step in medical information extraction. Traditionally, many simple, but straightforward methods, such as rule-based and heuristic-search-based methods, have been utilized to identify critical medical entities [
10]. These methods tend to achieve a low recall value due to the inability to cover all medical entities. Although rule-based methods seem better than dictionary-based methods, large numbers of rules require extensive domain knowledge to be formulated by medical professionals [
11]. With the expansion of medical data, these time-consuming and laborious original methods seem to be clumsy. However, these approaches still exist because they can be utilized as part of other systems to achieve good performance [
12,
13].
NER methods based on statistical machine learning include Support Vector Machine (SVM) [
14], the Hidden Markov Model (HMM) [
15,
16], the Maximum Entropy Hidden Markov Model (MEHMM) [
17], and the Conditional Random Field (CRF) [
18]. Zhou et al. [
16] presented an HMM NER system to deal with the special phenomena in the biomedical domain. Suwias et al. [
19] utilized a CRF-based machine learning system named Nersuite to achieve an F-score of 88.46% on the Gellus corpus. The CRF methods have decent performance, but rely heavily on the selection of features. At present, the CRF-based medical named entity recognition method is the best method in statistical machine learning because of the consideration of the transfer between tags. These methods do not need to match a large number of medical entity dictionaries, nor do experts need to make rules for entity boundaries, but rely heavily on feature selection. These features such as Parts-Of-Speech (POS), lexical features, capitalization, etc., need linguists and domain experts to formulate them, which means that feature engineering is required.
In recent years, deep learning methods have been developed for NER. LSTM architectures that are capable of learning the long-term dependencies have been put forward [
20], especially bidirectional recurrent neural network architectures [
20,
21,
22]. Lample et al. [
1] utilized bidirectional LSTMs and conditional random fields to obtain improvement in NER in multiple languages without resorting to any language-specific knowledge. Marc-Antoine et al. [
3] utilized NeuroCRF to obtain an F1-score of 89.28% on the WikiNER dataset. Ling et al. proposed a neural network approach named attention-based Bi-LSTM-CRF for document-level chemical NER. The approach leverages document-level global information obtained by the attention mechanism to enforce tagging consistency across multiple instances of the same token in a document [
23]. However, they are inefficient because of their sequential processing on sentences. To alleviate this problem, Emma et al. [
7] applied the ID-CNNs architecture to speed up the processing of the network. They proved that the test-time speed of the ID-CNN models was 1.42-times faster than the Bi-LSTM models. However, these models tend to ignore the word-order feature and local context compared to LSTM-based models. As shown in
Figure 1, the ID-CNN model failed to take advantage of word-order information between words, so the model itself is more like a bag of words model. More importantly, due to the stacking structure of ID-CNN, the output of the last layer has gained too much receptive field, so there is a lack of perception of the local environment. On the one hand, we need to make the ID-CNN model fuse location information. On the other hand, more context information about the current location of the model should be considered when predicting the output of the current location, rather than global information.
In this paper, in order to promote the performance of the ID-CNNs-CRF model, position embedding is utilized to introduce word-order information, and the attention mechanism is applied to focus on those critical words by assigning different weights.
5. Conclusions
In this paper, we proposed an attention-based ID-CNNs-CRF model for NER on CEMR. Firstly, word representation combined with position embedding was used for the input of our model to capture the word-order information. Secondly, we stacked four dilated CNN blocks to obtain broad semantic information to make up for the shortage of the CNN-based model in the language field. Then, the attention mechanism was applied to pay more attention to the characteristics of the current words and increase the performance of the model. Finally, the CRF was utilized to obtain the optimal tag sequence. The experiments on two CEMR datasets demonstrated that the attention-based ID-CNNs-CRF was superior to state-of-the-art methods with a faster test time. As we know, Bi-LSTM-CRF has a temporal structure, and its output at each step depends on the output of the previous step. This temporal structure is time consuming, especially when dealing with long text. Our model had good parallelism and could make full use of the GPU for parallel computing. Compared with the ID-CNNs-CRF, our model obtained better performance (improvements of 5.95%, 7.48%, and 7.08% in precision, recall, and F1-score, respectively). This demonstrates that the position embedding and attention mechanism had a huge performance improvement for the ID-CNNs model. In addition, our model outperformed Bi-LSTM-CRF, showing that our attention-based ID-CNNs-CRF was also an effective token encoder for structured inference. The model we proposed was 22% faster than the Bi-LSTM-CRF. There was no significant improvement in our model in the number of entities with fewer samples. Therefore, our future work is to study how to improve the recognition of these entities with fewer samples.