Attention-Based LSTM with Filter Mechanism for Entity Relation Classiﬁcation

: Relation classiﬁcation is an important research area in the ﬁeld of natural language processing (NLP), which aims to recognize the relationship between two tagged entities in a sentence. The noise caused by irrelevant words and the word distance between the tagged entities may affect the relation classiﬁcation accuracy. In this paper, we present a novel model multi-head attention long short term memory (LSTM) network with ﬁlter mechanism (MALNet) to extract the text features and classify the relation of two entities in a sentence. In particular, we combine LSTM with attention mechanism to obtain the shallow local information and introduce a ﬁlter layer based on attention mechanism to strength the available information. Besides, we design a semantic rule for marking the key word between the target words and construct a key word layer to extract its semantic information. We evaluated the performance of our model on SemEval-2010 Task8 dataset and KBP-37 dataset. We achieved an F1-score of 86.3% on SemEval-2010 Task8 dataset and F1-score of 61.4% on KBP-37 dataset, which shows that our method is superior to the previous state-of-the-art methods


Introduction
Relation classification is an important natural language processing (NLP) task. It is the key step in many natural language applications such as information extraction [1,2], construction of knowledge base [3,4], and question answering [5,6]. Relation classification aims to extract valid information and classify the relationships between entities in sentences.
As shown in Figure 1, the sentence contains an example of the cause-effect (e1-e2) relation between the nominal "women" and "accident". <e1>,</e1>,<e2>,</e2> are four position indicators which specify the starting and ending of the nominals [7]. It is obvious that the relation is easy to extract when the sentence length is short. However, it may be difficult for classification when facing long sentences. We take one more sentence for example.
(NER) obtained by NLP tools like WordNet; we designed an interactive sentence level attention filter architecture to leave effective local feature information and designed a semantic rule to extract key word for learning more complicated features. 3. We conducted experiments using the SemEval-2010 Task 8 dataset and KBP-37 dataset. The experimental results demonstrate that our MALNet model performs better than the previous state-ot-the-art methods on both datasets. The remainder of this paper is structured as follows. Section 2 presents the related work about the relation classification. Section 3 introduces the architecture of our MALNet model. Section 4 provides the experiment setting and results. The conclusion is discussed in Section 5.

Related Work
Due to the practical significance of relation classification, a lot of research has been devoted to it. In recent years, deep neural networks have shown good performance on relation classification [8,21]. In building high-level features, it experienced a shift from human-designed construction to deep learning [9]. In the feature-based methods, such as named entities [13], shortest dependency path [10,11], and left or right tokens of the tagged entity [21], are applied to this field, but in the process of constructing the features it will cause accumulative errors and introduce noisy information.
For convolutional neural network (CNN) techniques, Zeng et al. [21] proposed a deep CNN to address this task. They used a CNN to model sentence-level features and lexical level features, including entities, left and right tokens of entities, and WordNet hypernyms of entities. Santos et al. [18] proposed a classification by ranking CNN (CR-CNN) model using a new rank loss to reduce the impact of artificial classes. It shows that new rank loss performs better than that of common cross-entropy loss function. Huang et al. [26] proposed an attention-based CNN (Attention-CNN) for semantic relation extraction, which employs a word-level attention mechanism to get the critical information for relation representation. These methods have limitations on learning sequence features because of the shortages of convolution kernels. In addition, convolutional neural networks are less effective in classifying long sentence relationships.
On the other hand, the RNN-based models show outstanding performance in processing text sequences in relation classification. Zhang et al. [16] proposed bidirectional long short-term memory networks (BLSTM) to capture the context information. Bidirectional LSTM has better performance than standard LSTM. Zhang et al. [27] proposed an RCNN (Recurrent Convolutional Neural Networks) model, which combines the advantages of RNN and CNN. It not only solved the problem of long-time dependence with RNN, but also extracted more abundant features with a CNN. Zhou et al. [15] used attention-based bidirectional long short-term memory networks to capture the most important semantic information in a sentence. This model does not rely on NLP tools to get lexical resources, which obtained state-of-the-art performance. The existing research indicates that the RNNs perform better than CNNs due to its context sensitivity. Recently, some researchers have proposed attention-based models [23]. Cao et al. [28] exploited a bidirectional long short-term memory network with adversarial training to extract sentence level features. In order to enhance the robustness, they leverage attention mechanisms to better learn the most influential features. Lee et al. [29] proposed a model bidirectional LSTM networks with entity-aware attention to learn more semantic features. Yan et al. [25] presented the shortest dependency path (SDP)-LSTM, a novel neural network to classify the relation of two entities in a sentence. The architecture leverages the shortest dependency path (SDP) between two entities, multichannel recurrent neural networks, with long short term memory (LSTM) units, to pick up heterogeneous information along the SDP.
We summarize the past methods and problems, and then put forward the model MALNet. This model can enhance the robustness and capture context information for relation classification task.

Our Model
In this section, we give an overview of the MALNet model. We introduce an attention-based BLSTM layer to extract word-level features and construct a sentence level attention filter layer to leave available information. We designed a semantic rule for the method of extracting key words. As shown in Figure 3, our model consists of five main components: • Word Representation Layer: The sentence is mapped into a real-valued vector named word embeddings.

•
Attention-based BLSTM Layer: This layer consists of two channels for extracting the world level features. One channel uses Bidirectional LSTM to capture context information and focus on the available features by attention mechanism Another channel directly utilizes attention mechanism to capture the similarity between words. We construct two channels to fit the feature better. • Sentence Level Attention Filter Layer: This layer constructs a filter module to process the noise. We take all lexical level features into account aim to filter out noise and retain effective information for classification. • Key Word Layer: In this layer, we analyze the sentence structure between the two target words, extract the key words for convolution, and provide auxiliary effects for classification. • Classification Layer: High-level features will be fed into this layer. We calculate the relationship score to identify the relationship. (1) Word Representation Layer Let us consider an input sentence denoted by S = {w 1 , w 2 , ..., w n }, where n is the number of words. To represent a word, we embed each word into a low dimensional real-value vector called word embeddings [30]. v i is one-hot encoding of w i , word embeddings can encode sparse one-hot representation v i into real-valued vector by looking up in the matrix W ∈ R d w * |V| , where d w represents the dimension of the word vectors and V represents the word vocabulary size. The word representation X = {x 1 , x 2 , ..., x n } are mapped by the word w i to a low dimensional real-value vector x i , which are fed into the next layer. Our experiments directly utilize pre-trained weights of the publicly available embedding from language models (ELMO) [31]. Bidirectional language model can obtain the context representation of the current word and fine-tune word embeddings for relation classification.
(2) Attention-based BLSTM Layer To capture the word-level features effectively, we design a two-channel module to extract it. In the first channel we use multi-head attention to extract word-level features. Since attention mechanism neglects the order of the sequence [23], we use it to capture the shallow features. Regardless of the length of sentences and the distance between entities, attention mechanism aims at modeling the strength of relevance between representation pairs [32].
We can regard an attention mechanism part as a mapping of a query and key-value pairs to an output. An attention function of the query and the key is adopted to compute the weight of each value, then the output is determined as a weighted sum of the values [23].
For multi-head attention, we use the word representation X = {x 1 , x 2 , ..., x n } to initialize query Q, key K and value V. Given a matrix of query Q, key K, and value V, the scaled dot-product attention is calculated by the following equation: where W Q i , W K i , W V i ∈ R d k /r * d k are the projections matrices, W M ∈ R d k * d k is a mapping parameter from input space to representation space, r is the number of the attention heads, d k is the dimension of the word vectors, the ⊕ represents the connection operator.
In the second channel, we combine the long short-term memory (LSTM) recurrent neural networks with the attention mechanism. Long short-term memory recurrent neural networks are an improvement over the general recurrent neural networks [24], which achieved good results and possessed a vanishing gradient problem. However, standard LSTM networks process monotonic sequences in time order, and can only capture information from left-to-right or from right-to-left; it splits the context information. So, we chose Bidirectional LSTM networks to capture the feature.
As shown in Figure 4, the BLSTM network contains two sub-networks for the left and right sequence context, which are forward and backward layer. We take the word representation X = {x 1 , x 2 , ..., x n } as input, at the time step t the LSTM units could be demonstrated: where i t , f t , and o t are input gate, forget gate and output gate, respectively. The parameters W i , U i represent the weight matrix of the input gate i t . The parameters W f , U f represent the weight matrix of the forget gate f t . The parameters W o , U o represent the weight matrix of the output gate o t . The parameters b i , b f , and b o are bias vectors of input gate, forget gate, and output gate, respectively. The parameters W c and U c are the weight matrix of new memory content c t . The b c is the bias vector of the new memory content c t . The h t is an LSTM hidden state, c t is the current cell state, denotes element-wise multiplication and tanh is a hyperbolic tangent function. At the time step t, the output of the i t h word is shown in the following equation: are the hidden states of the forward and backward LSTM at time step t. h t ∈ R 2d h represents the connection of the hidden states at time step t. d h is the hidden size of BLSTM. The ⊕ is the connection operator. Then we separate two entity vectors h e1 t and h e2 t from the BLSTM output vector, which can represent the tagged entity context information at the time step t.
Due to the uncertainty of sentence length and the distance between entities [33,34], although BLSTM can capture context information, it performs poorly in long texts. To solve this problem, we add an attention layer after the BLSTM layer, which can capture longer texts effectively. In attention part, we construct Q is the vector h e1 t and h e2 t , K,V is the output of the BLSTM. To maintain dimensional consistency, extend h e1 t and h e2 t to the same dimension as BLSTM. The scaled dot-product attention is calculated by the following equation: where W Q i , W K i , W V i ∈ R d l * d l are the projections matrices, W C ∈ R d l * r is a mapping parameter from input space to representation space, r is the number of the attention heads, d l = 2 * d h represent the dimension of the BLSTM output, H i represents the entity i with BLSTM output attention results. By the attention mechanism, we can get the output H 1 and H 2 .

(3) Sentence Level Attention Filter Layer
In recent years, attention mechanism used to learn text classification [35], question answering [36], and named entity recognition [37]. The words in the sentence contain different levels of importance [38]. To effectively distinguish the valid features and the invalid, we do some fine-tuning of attention mechanisms to build an attention filter layer. Firstly, we concatenate the two channel's output in the attention-based BLSTM layer. The formula is as follows: The ⊕ represents the connection operator. Then we introduced the latent entity types [29]. It may be more than one relationship between a particular entity and other entities. We use LET (Latent Entity Types) to extract the latent types to improve ability to extract relationships. The mathematical formulation is the follows: where W ∈ R (2d h ) * h represent the LET weight matrix, h is the hyper parameter latent type. H is the vector h e1 t and h e2 t . Traditional models take all position embeddings as a part of the word presentation, which may cause noise amplification and influence the original word representation [19,28]. In this paper, we introduce it into the filter layer. The position embeddings of the ith word is encoded as p e1 i , p e2 i ∈ R dp , where d p is the dimension of the position embeddings. For the sentence, the position embeddings can be represented as p 1 , p 2 ∈ R dp * n , n is the number of the words.
All high-level features about this sentence may affect the classification. Based on this idea, we take these features into consideration as shown in Figure 5: (1) the latent entity type vector Y 1 and Y 2 ; (2) the position embeddings of the sentence which reflect each word relative to entities p 1 ,p 2 ; (3) the concatenation sum in the attention-based BLSTM layer. We construct a high-level feature selection matrix C, which consists of all features. C = [T ⊕ Y 1 ⊕ Y 2 ⊕ p 1 ⊕ p 2 ], C ∈ R (d w +2d h +2d p ) . The ⊕ represents the connection operator. Finally, in order to better solve sentences of different length, we introduce advanced features in self-attention, but they only participate in the selection of factors that determine the relationship between sentences, not be part of the output. The representation R of the sentence can be calculated by the following equation: where W w ∈ R (d w +2d h +2d p ) * e , e represents the attention size. R ∈ R (d w +2d h +2d p ) , denotes element-wise multiplication. W h ∈ e represents a transpose vector. α represents the weight of the filter layer. By this layer, we get the high-level features.

(4) Key Word Layer
Shortest dependency path (SDP) to mark important words in long sentences and give key words with highlight weights. However, we find that the key word assigned by SDP cannot really express the core meaning of a sentence and not all sentences can extract key words through SDP.
As shown in Figure 6, the blue box represents the target word, and the red box represents the word on SDP between two words. It is obvious that the word "injured" is the predicate which connect "lawsuits" and "fans". However, SDP think "ensued" and "from" is the key word and assign them a high weight value, but in this sentence, "injured" is also a key word, which ignored by SDP. Besides, not all sentences can obtain key words by SDP. We tested 8000 sentences in the training set of SemEval-2010 task 8 dataset and find that there are 16 sentences without SDP between the two target words. It means that the operation of giving high weight fails in some sentences, and all words have the same effect in classification, which cannot highlight the importance of key words.
In order to better extract the key words, we make a semantic rule according to the task of relationship classification. Because the words between the two target words basically cover all the possible information, we take out all the words. According to the task of relation classification, we have designed a rule: (1) remove all parallel words such as "and", "or" and " but"; (2) remove all adverbial words based on part of speech tagging, keep all actions between two words, and provide effective information for classification; (3) according to the above rule, the core words of sentences we have processed are as shown in the Figure 7. We extract key words from each sentence by the rule, and then splice their word vectors together to reduce the distance between them. Because the length of the spliced sentence is smaller than that of the original sentence, we use a convolution neural network (CNN) directly.
The advantage of a convolution neural network is that it can extract local features more effectively all of the key words of the sentence S are then represented as a list of vectors (X 0 , X 1 , · · · , X i ), where xi corresponds to the key word. As shown in Figure 8, with a slide window of size k, the CNN can extract local features between key words.
where S ∈ R d w * i ,W CNN ∈ R d w * k is a weight matrix with a channel size K. Through the CNN operation of core vocabularies, we can extract features to provide assistance for our classification.

(5) Classification Layer
The features extracted by Key Word Layer will help the classification process, so we add the output of the key word layer and the Sentence Level Attention Filter Layer. α represents weight coefficient. We obtain a high-level sentence representation for the relationship, which we can directly use to predict the labelŷ. This classification process consists of a softmax classifier and the probabilitŷ p(y | S, θ) is:p where y is a target relation class and S represent the sentence. R represents the output of the Attention Filter layer. The θ parameter represents the whole trainable parameters in the whole network. The relation label with highest probability value is identified as ultimate result: For the propose of making a clear distinction, we made some adjustments based on the ranking loss function [18]. The formula is as follows: where y + represents the correct label and c − represent the highest probability sample among all incorrect relation types. s θ (x) y + and s θ (x) c − represent the softmax score of the correct relation label and the negative category chose with the highest probability among all incorrect relation types. The λ is a L2 regularization hyper parameter. In our experiment, we set γ to 2.0, m + to 1.0, and m − to 0.0. When the loss function decreases, the first term in the right side s θ (x) y + increases and the second term in the right side s θ (x) c − decreases. Compared with the cross-entropy loss function, this loss function can better distinguish positive labels from negative labels and make the boundary larger. We introduced the L2 regularization to avoid overfitting and improve generalization ability. In the training phrase, we optimize the loss function by using Adadelta algorithm.

Experiment
To evaluate the effectiveness of our MALNet model, we conducted an experiment on the SemEval-2010 Task 8 dataset and KBP-37 dataset. Compared with the other methods, our model performs better than them.

Datasets
We evaluated our model on two datasets, including SemEval-2010 Task8 dataset and KBP-37 dataset. In order to analyze the difference between the datasets, we counted the lengths of sentences and the distance between two tagged entities. The statistical features are shown in Figures 9 and 10. We can see that in SemEval-2010 Task8 dataset, around 98% of entities' distance is less than 15 and around 85% sentences length are less than 30, which means that most of the datasets are short sentences. On the contrary, KBP-37 dataset contains a large number of long sentences and the entities distance greater than 15 reached a quarter. In the Figure 10, we can see that SE focuses on short sentences and KBP on long sentences. Figure 11 shows the statistical properties of these two datasets.  SemEval-2010 Task8 dataset is a commonly used benchmark for relation classification [14]. The dataset total number of relations is 19. There are 10,717 annotated sentences that consist of 8000 samples for training and 2717 samples for testing. We used the Macro-F1 score (excluding "other").
KBP-37 is a dataset that contains more specific entities and relations [17]. In this dataset, there are more entities are names of persons, organizations or cities, which means more noise in the sentences. The total dataset number of relations is 37. In addition, there are more long sentences and more entities are name organizations and place names in the KBP-37 dataset. It means that the unseen words will affect classification accuracy.

Dataset
SemEval-2010 Task8 Figure 11. The statistical properties of these two datasets.

Experiment Settings
In our experiment, we set the word embeddings size d w to 1024. The dimension of the position embeddings d p were set to 50. The batch size were set to 20. To avoid over-fitting, we applied a dropout on the word embeddings layer, the output of the BLSTM, and self-attention filter layer. The dropout rate was 0.3, 0.3, and 0.5, respectively. In the BLSTM layer, we set the LSTM hidden size to 300. We set the learning rate to 0.1 and decay rate to 0.9. The regularization parameter λ was set to 0.001. The attention heads were set to 4. In the key word layer, we set the window size k to 2, 3, 4, and the number of convolution kernels to 50. In the SemEval-2010 dataset, we set the weight coefficient α to 1 and the KBP dataset to 0.2.All the hyper parameter are shown in Table 1.

Experiment Results
We compared our model to the previous model on both dataset SemEval-2010 Task 8 and KBP-37. Tables 2 and 3 describe the performance of our model and other methods in the datasets.
In the SemEval-2010 Task 8 dataset, we compared our model with the baseline methods CNN and RNN. Nguyen et al. [20] proposed the perspective-CNN network applied to the relation classification task. They tested it on multiple types of convolution kernels and finally achieved an F1-score of 82.8%. Due to the limitation of convolution kernel size, Zhou et al. [15] introduced attention mechanism into a BLSTM to extract sentence features more effectively. They achieved 84%, which is the state-of-the-art result. Dos Santos et al. [18] constructed a classification by ranking CNN (CR-CNN) to tackle the relation classification task. They improved the loss function on the basis of CNN. Adilova et al. [34] proposed a supervised ranking CNN model, which is an improvement on CRCNN. They tested their model on the dataset and achieved 84.39%. Zhang et al. [27] proposed a RCNN model that combines RNN and CNN in the network structure. This model got a F1-score 83.7%. Cao et al. [28] applied adversarial training into the BLSTM model and achieved 83.6%. Zhang et al. [39] proposed a model named BiLSTM-CNN (Bi-directional Long Short-Term Memory-Convolutional Neural Networks) and added the position embeddings into the input. J Lee et al. [29] applied latent-attention into BLSTM and achieved 85.2%. G. Tao et al. [40] proposed subsequence-level entity attention LSTM and achieved 84.7%. Our MALnet outperforms the previous methods and achieved 86.4% in this dataset. In order to reflect the effective filtering effect of the filter layer, we designed a series of comparative experiment. We removed the Attention Filter Layer and key word layer from origin MALnet and the FI-score is decreased to 84.3%. When we just remove the filter layer our network achieved 85.6%. When we just remove the key word layer our network achieved 85.1%. In order to reflect the generalization ability of our model in different datasets, we also used the same methods to compare with our model in KBP-37 dataset. The Perspective-CNN does not perform very well in a KBP-37 dataset for its limitation of the convolution kernel in long sentences. Models using an RNN performed better than CNN in long sentences. The Supervised Ranking CNN [34] achieved 61.26% in this dataset. The proposed method Entity-ATT-LSTM achieves 58.1% in this dataset. In the experiment for the KBP dataset, we obtained an F1-score of 61.4%. We also removed the Attention Filter Layer from the original MALnet and the FI-score then decreased to 58.3%, which showed the effectiveness of the filter layer. When we just removed the filter layer our network achieved 59.3%. When we just removed the key word layer, our network achieved 60.1%.
Compared with the reference model, we can conclude that our model is more robust and has good adaptability in short and long sentences. It can also be seen from the comparative experiment that the module we designed is also effective in relation classification.

Conclusions
In this paper, we propose a novel neural network model MALNet for relation classification. Our model uses raw text with word embeddings and position embeddings as input. To extract primary features, we designed an attention-based BLSTM layer. By this layer, the semantic information is transformed into advanced features. Then, we construct a Sentence Level Filter Layer to preserve features that facilitate classification effectively. In the current research, we found that there are some noises when using external tools for syntax analysis, so we construct a semantic rule to solve this problem, and extract key words to construct a key word layer to help the relationship classification task. Our experiment was carried out on two datasets: one has mostly short sentences, the other has mostly long sentences. In order to highlight the importance of the filter module and key word layer, we also made some comparative experiments. The experiment results show that our MALNet model works better than previous state-of-the-art methods on both the SemEval-2010 Task 8 dataset and KBP-37 dataset and the module we designed proved to be effective in relation classification.
In the future, we will consider introducing the Named Entity Recognition (NER) into relation classification. The entity recognition can mark out the nouns in a sentence, which can cooperate with the classification network to refine the sentence structure.