With the rapid development of life science and technology, biomedical literature has increased exponentially. For example, the biomedical literature repository, MEDLINE, collects over 9.6 billions records and grows at 30–50 million records a year (https://www.ncbi.nlm.nih.gov/guide/literature/
). This literature contains vast amounts of potential medical information which could be useful to biomedical research, industrial medicine manufacturing, and so forth.
The first step for extracting potential medical information automatically from the vast amounts of biomedical literature is developing a named entity recognition system. This is a crucial part of some therapeutic relation extraction systems or applications, such as drug-drug interactions [1
] and adverse drug reactions [2
Drug-Named Entity Recognition (DNER) is the job of locating drug entity mentions in unstructured medical texts and classifying their predefined categories. Our research interest in DNER mainly comes from two driving reasons: Firstly, new drugs are rapidly and constantly discovered. Secondly, the naming rule is not strictly followed.
The DNER task is full of challenges due to the following reasons: (1) the limited number of supervised training data; (2) new drug entity names are increasing constantly; (3) the authors of biomedical literature do not always follow proposed standardized name rules or formats. Previous works for DNER usually include two steps: the first is to construct orthographic features or training word embedding [3
] and the second is to employ machine learning methods, such as Conditional Random Fields [4
], support vector machines [5
], maximum entropy [6
] and so forth. CRF become the best choice for DNER since CRF is one of the most reliable sequence labeling methods, which has shown good performances on different kinds of named entity recognition (NER) tasks. For example, NER application in the newswire domain [8
]. Researchers also explore biomedical knowledge resources, such as constructing a new drug dictionary [10
]. In recent years, people have been eager to use neural network methods to develop DNER systems [12
], which can learn the feature representation from the raw input automatically and avoid costly feature engineering process.
] is suitable for process variable input, and has a long term memory. The detail can be viewed in Section 2.1
. Recent LSTM methods make a great success in NLP tasks. Such as NER task [14
] and sequence tagging [15
]. In our work, we construct a model using a bi-directional LSTMs with a random sequential conditional layer (LSTM-CRF) beyond it inspired by [15
]. Entity names are usually composed of multiple tokens, so tag decision for each token is necessary. Considering dependencies across the output label in the DNER task, instead of using softmax function in the output layer of recurrent neural network, we choose CRF to do classification decisions.
Bi-directional LSTM-CRF is an end-to-end solution. It can learn features from a dataset automatically during the training process. So it is not necessary to deign hand-crafted features and biomedical knowledge resources are also not prerequisite. Bi-directional LSTMs learn to how to identify named entities based on sentence level information and take the combination of word embedding and character embedding as inputs. The outputs of the bi-directional LSTMs will be fed into the CRF layer. The CRF layer will output the label sequence with the maximum score calculated by the Equation (14
We observe that sole dependence on word embedding will ignore explicit character level features like the prefix and suffix [16
]. However, some character sequences in words can show orthographic construction features that could be useful indicators for DNER, particularly when word embedding is poorly trained. In order to deal with these issues, we combine word embedding features with character level information as final word representations in the LSTM-CRF models. Experimental results in DDIExtraction2011 and DDIExtraction2013 dataset show that the proposed method can achieve comparable performance with the state-of-the-art performance in original challenge reports without any constructed drug dictionary or any hand-built features.
The remainder of the paper is organized as follows. Section 2
describes LSTM-CRF model, also the different embedding for input layer and training. Section 3
reports the experiments results. Finally, Section 4