Abstractive Sentence Compression with Event Attention

: Sentence compression aims at generating a shorter sentence from a long and complex source sentence while preserving the important content of the source sentence. Since it provides enhanced comprehensibility and readability to readers, sentence compression is required for summarizing news articles in which event words play a key role in delivering the meaning of the source sentence. Therefore, this paper proposes an abstractive sentence compression with event attention. In compressing a sentence of news articles, event words should be preserved as important information for sentence compression. For this, event attention is proposed which focuses on the event words of the source sentence in generating a compressed sentence. The global information in the source sentence is as signiﬁcant as event words, since it captures the information of a whole source sentence. As a result, the proposed model generates a compressed sentence by combining both attentions. According to experimental results, the proposed model outperforms both the normal sequence-to-sequence model and the pointer generator on three datasets, namely the MSR dataset, Filippova dataset, and Korean sentence compression dataset. In particular, it shows 122% higher BLEU score than the sequence-to-sequence model. Therefore, the proposed model is effective in sentence compression.


Introduction
Sentence compression is an NLP task which aims at generating a compact sentence from a long and complex source sentence while preserving the important content of the source sentence. Since it generates a shorter and more condensed sentence than its source sentence, it is also known as sentence-level summarization. The need to sentence compression is increasing with the rapid growth of web and mobile contents, since sentence compression allows web and mobile users to efficiently catch and understand the contents. That is, sentence compression does not only enhance the comprehensibility and readability of the contents for the users, but also saves time and cost [1,2]. Thus, sentence compression is regarded as a significant and useful technique.
There have been many studies about sentence compression and the studies are separated into two approaches: a deletion-based approach and an abstractive approach. The deletion-based approach generates a target sentence by removing unimportant and unnecessary words from the source sentence [3][4][5][6][7][8]. That is, a target compressed sentence is a subsequence of a source sentence. Filippova et al. solved sentence compression as a binary classification task with a LSTM-based model [9]. In this model, LSTM(Long Short-term Memory) determines whether each word in a source sentence

Related Work
There have been several studies on sentence compression and the studies are separated into two kinds of approaches: the deletion-based approach and the abstractive approach. Sentence compression by deletion-based approach is to decide whether each word in a source sentence remains in a target sentence. That is, it regards sentence compression as a binary sequence labeling problem. Linguistic information is usually used to determine the sequence labels [25][26][27], but some previous studies compressed a sentence by pruning the dependency parse tree of a source sentence [7,10]. Filippova et al. proposed an unsupervised compression method to prune the dependency parse tree [28]. In their method, the weights of the edges in a dependency parse tree are computed using syntactic and length constraints, and then they are used to prune the dependency tree. After that, they proposed a supervised version which extends the unsupervised method [6]. For supervised learning of sentence compression, they also released a dataset for sentence compression. In this version, the weight of every edge in a parse tree is computed by a linear function of lexical, syntactic, and semantic features, and the relevance weights of the features are determined using the released data. All these models depend greatly on the accuracy of parsing a source sentence, but current techniques for parsing natural language sentences are not fully trustworthy yet.
With the growth of neural networks, some studies have adopted a neural network to determine word drops. Filippova et al. claimed that syntactic information could be unnecessary for sentence compression [9]. Thus, they adopted a LSTM model to determine if a word should be deleted in the compressed sentence based on the deletion/retention of its previous word in a source sentence. In addition, Hasegawa et al. used a LSTM constrained with a target length [29]. These methods do not use any syntactic information, but some experiments showed that the methods could be improved if syntactic information is incorporated into the methods [7].
One of main problems using neural networks for sentence compression is that named entities in a source sentence are usually considered to be unknown words since the neural models use a pretrained word embedding, where high frequency of unknown words often leads to performance decrease of the models. Wang et al. hypothesized that the incorporation of syntactic information into a compression model would be helpful in sentence compression and domain adaptability [7]. Thus, they proposed a LSTM-based model which uses syntactic information such as POS tags and parsing information. They showed empirically that syntactic information is influential and leverages the robustness of a model in cross-domain applications. However, the deletion-based approach has the chronic limitation of being unable to generate expressive sentences.
The abstractive approach to sentence compression can be regarded as sentence paraphrasing. Thus, it can generate some unseen words and deliver more expressiveness in the target compressed sentence. That is, it produces more expressive compressed sentences than the deletion-based approach. Inspired by neural machine translation, Rush et al. applied an attention-based sequence-to-sequence model to sentence compression [11]. Since a sequence-to-sequence model is structurally simple, their model has the merit that it does not require any pre-and post-processing for sentence compression. On the other hand, Vu et al. adopted a memory-augmented recurrent neural network [30]. The memory-augmented model uses a memory matrix to get a better understanding of a source sentence. However, these studies are vulnerable to named entities since they generate a sentence with a fixed vocabulary set. To overcome this problem, Yu et al. proposed an operation network that mimics human sentence compression [15]. Since this model is capable of copying, editing, and generating words, it can imitate human compression as well as overcome limits of the deletion-based approach. However, due to the nature of sequence-to-sequence models, these studies have difficulty compressing long and complex sentences.
Some novel attention mechanisms have been proposed to attack the difficulty. Chopra et al. proposed a convolutional attention-based model [31] in which the attention helps to understand a source sentence by capturing its context. On the other hand, Kamigatio et al. incorporated a higher-order syntactic attention into a sequence-to-sequence model [8]. The attention is computed with a chain of dependency relations in the dependency graph of a source sentence. The problem of the studies is that the attention is somewhat helpful in understanding a sentence, but fails to focus on the words that deliver important information such as an event in a long and complex sentence. Therefore, this paper proposes an event attention which enables the network to regard event words as salient words.

Sentence Compression by a Sequence-to-Sequence Model
Given a source sentence X = (x 1 , x 2 , x 3 , ..., x n ), sentence compression aims to generate a target sentence Y = (y 1 , y 2 , y 3 , ..., y m ), where n and m are sentence lengths and n > m. Thus, a sentence compression model can be formulated by a conditional probability P(Y|X) = m ∏ t=1 P(y t |y <t , X). (1) Since optimizing Equation (1) is an objective of standard sequence-to-sequence models [16,18], a sequence-to-sequence model can be used as a sentence compression model. A sequence-to-sequence model has an encoder and a decoder. The encoder which is usually implemented as a bi-directional RNN [32] takes the source sentence X = (x 1 , x 2 , x 3 , ..., x n ) where x i is an embedding vector of the word x i . At each time step t, the encoder outputs a forward hidden state where the function f is a recurrent unit such as GRU [16] or LSTM [33]. The hidden state h t at time t is expressed as a concatenation of − → h t and ← − h t . That is, where [a; b] denotes the concatenation of a vector a and a vector b. The concatenation of the final forward and backward states of the encoder is used as a representation of the source sentence X.
The decoder which is also implemented as an RNN generates a target compressed sentence Y with vectors computed by the encoder. It computes a decoder state s t at each time t as Then, P vocab , the probabilistic distribution over all words, is obtained by where W v and b v are the parameters to be tuned. That is, the decoder state s t is fed to the softmax function to produce P vocab . Finally, the word with the highest probability in P vocab is chosen as the output word at time t. Since the sequence-to-sequence model is trained with a set of pairs of a long source sentence and its short compressed sentence, it eventually learns how to transform a long sentence to a short sentence, which is sentence compression.

Abstractive Sentence Compression with Event Attention
Event words in news articles play a key role in delivering the meaning of a source sentence, since an event means something that happens or occurs [34]. For instance, Table 1 shows compressed sentences of a source sentence "If IBM has miscalculated the demand, it will suffer badly as both the high operating costs and depreciation on the huge capital investment for the East Fishkill factory drag down earnings." in which the event words are expressed as bold. Since the event words contain main actions of the source sentence, they are usually preserved in the compressed sentence. As a result, the compressed sentences in Table 1 contain at least one of the words 'miscalculated', 'suffer', and 'drag'. Therefore, the event words must be regarded as salient words in generating a compressed sentence.
To focus on the event words of a source sentence in generating a compressed sentence, the event words should be first extracted. In this paper, the event words are extracted using the event extraction system proposed by Chambers et al. [35]. When a source sentence X = (x 1 , x 2 , x 3 , ..., x n ) is given, the system returns an event mask vector E X = (e 1 , e 2 , e 3 , ..., e n ) where e i is 1 if x i is an event word and 0 otherwise. After that, a compressed sentence of X is generated by a sequence-to-sequence model. Figure 1 depicts the proposed sequence-to-sequence model to generate a compressed sentence from a vectorized source sentence X and the event mask vector E X . For a given source sentence X, the encoder outputs the hidden state h i 's and the source sentence representation Q by Equation (2). Table 1. An example of a source sentence and its compressed sentences in MSR dataset.

Source sentence
If IBM has miscalculated the demand, it will suffer badly as both the high operating costs and depreciation on the huge capital investment for the East Fishkill factory drag down earnings.
· If IBM has miscalculated the demand, high operating costs and depreciation will drag down earnings. · Miscalculation will lead IBM to have both high operating costs and capital investment depreciation for the East Fishkill earnings. Compressed sentences · If IBM miscalculated demand, it will suffer as high operating costs and depreciation on capital investment for the factory lower earnings. · IBM will suffer if it miscalculates the demand as the operating costs and depreciation on investment for East Fishkill drag down earnings. · If IBM has miscalculated the demand, it will suffer badly as both the high operating costs and depreciation on the huge capital. To compress the source sentence elaborately, two types of attention, which are event attention and global attention, are adopted in the proposed model. The event attention allows the model to focus on the event words in the source sentence, while the global attention provides the overall understanding of the source sentence. The event attention weight α i of an encoder hidden state h i is computed using h i , the source sentence representation Q, and the event mask e i . That is, where V e , W h , W q and W e are weight parameters and learned from training data. b eattn is a bias vector, and g is a non-linear function.
Since the event attention helps the decoder focus on event words during decoding, it pays less attention to the salient words which deliver the overall meaning of the source sentence. Thus, an attention proposed by Bahdanau et al. [20] is adopted as a global attention of the proposed model. The reason the attention is adopted is that it is one of the widely used and standard attention [36]. When the decoder generates the t-th word, the global attention weight β t i of h i is computed by where V g , W h and W s are weight parameters and learned from training data, b gattn is a bias vector for the global attention, and g is a non-linear function. The two attention weights are then combined by a weighted sum. That is, the final attention weight γ t i of a hidden state h i at the t-th word generation is where λ (0 ≤ λ ≤ 1) is a hyper-parameter to control the ratio of the event attention and the global attention.
The final attention weights, γ t i 's, are used to compute the context vector c t as and the context vector is used in generating y t , the t-th word by the decoder. Since the decoder depends on c t as well as s t in generating y t , Equation (4) is rewritten as Then, the probability of choosing a word w from a vocabulary set V is P(w) = P vocab (w). The proposed model is trained to minimize the risk R over training set T defined as where θ is the set of all parameters of the proposed model. Since the model above is based on a sequence-to-sequence model, it suffers from the chronic out-of-vocabulary problem [37]. To solve the problem, the model is extended to host the copy technique of the pointer generator [19]. The copy technique deals with the out-of-vocabulary problem by allowing the model to copy unknown words of a source sentence into the target sentence. That is, after expanding the vocabulary set V = V ∪ U where U is a set of unknown words in the source sentence, P(w ∈ V ) is updated as by introducing a soft switch P gen which either generates a word from P vocab or copies a word from the source sentence. If P gen is higher than 1 − p gen , the word w is generated from P vocab . Otherwise, it is copied from the source sentence where the word to be copied is determined by the final attention γ t i . The switch P gen can be expressed as a generation probability computed by a network with two linear layers. Thus, P gen is computed using the context vector c t , the decoder state s t , and the previous decoder output y t−1 as where W c , W c , b c and b c are the learnable weight parameters of the network and σ is the sigmoid function.

Experimental Setting
The proposed model is evaluated with three kinds of datasets: MSR dataset [21] (see Supplementary Materials), Filippova dataset [9] (see Supplementary Materials), and Korean sentence compression dataset. The simple statistics on each dataset is given in Table 2. The MSR dataset proposed by Toutanova et al. is used for abstractive sentence compression. It consists of a newswires, business letters, journals, and technical documents from the Open American National Corpus, and the compressed sentences are created manually. In this dataset, the numbers of training, validation, and test examples are 21,145, 1908, and 3370 pairs, respectively. The average length of source sentences is 193.2, while that of target sentences is 133.6. Filippova dataset [9] is used for deletion-based sentence compression. This dataset is automatically produced from Google News using the method proposed by Filippova et al [6]. The total number of sentence pairs is 10,000, and target sentences have 13.7 fewer words than source sentences on average. The last Korean sentence compression dataset is designed for testing morphologically complex languages. This dataset contains 3117 source sentences from Korean news articles, and the sentences are split to 8:1:1 for training, validation, and test sets, respectively. Its target sentences are created by a native speaker and their length is short on average by 6.01 words when compared to the source sentences. For the experiments below, the dimension of word embeddings is set as 256, and that of all hidden layers is set as 512. LSTM [33] is used for the function f in Equation (3) and the hyperbolic tangent is used for the function g in Equation (5) and (6). The proposed network is trained with batch size 64 and learning rate 0.001, and is optimized with an Adam optimizer [38]. The value of λ in Equation (7) is 0.6 which is estimated using MSR validation set. BLEU [23], ROUGE [22], and compression ratio (CR) [24] are used as evaluation metrics, where CR is computed as No. of tokens in the i-th compressed sentence No. of tokens in the i-th source sentence , and u is the number of sentences in a test set. Since the test set consists of pairs of a source sentence and its target sentence, the golden compression ratio can be computed from the test set. The golden compression ratio on MSR dataset is 68.60%, while those on Filippova dataset and Korean dataset are 42.62% and 47.52%, respectively. Two baselines are adopted for comparing the proposed model with existing models. One baseline is the standard sequence-to-sequence model proposed by Cho et al. [16], and the other is the pointer generator [19] in which the copy technique is applied to the sequence-to-sequence model. Table 3 shows the sentence compression performance on MSR dataset. In this table, 'seq-to-seq' and 'PG' denote the sequence-to-sequence model and the pointer generator, respectively. The model of Yu et al. is based on the sequence-to-sequence with a deletion decoder and a copy-generator decoder [15]. Unlike the proposed model, it first conducts the deletion of a source sentence using the deletion decoder and then either generates words or copies the source sentence through the copy-generator decoder. According to the table, the use of event attention improves its base model. That is, 'seq-to-seq + Event' shows higher performance than 'seq-to-seq' in ROUGE metrics, and 'PG + Event' outperforms 'PG' in all metrics. In particular, 'PG + Event' achieves the best performance. On the other hand, the performance of Yu's model is lower than 'PG + Event', even if it is also based on the copy mechanism. This is because the errors by the deletion decoder are easy to propagate to the copy-generation decoder in Yu's model. The compression ratios denoted as CR are also given in this table to see how much a method compresses source sentences. The value within parentheses denotes the difference between the compression ratio of a method and the golden ratio. As a result, the smaller the absolute value of the difference is, the better a method compresses source sentences. The golden compression ratio on MSR dataset is 68.60%, and the difference between it and the compression of ratio of 'PG + Event' is smallest as −0.63. This result implies that the proposed model is an effective compressor as well as a good writer. Table 4 shows the results of sentence compression on Filippova dataset. When the event attention is applied to a baseline model, the performance of the baseline model improves. That is, 'seq-to-seq + Event' is better than 'seq-to-seq' and 'PG + Event' outperforms 'PG' for all evaluation metrics. The proposed model works effectively even with the dataset for deletion-based compression. However, in compression ratio, the proposed model is only slightly worse than 'PG'. The golden compression ratio of Filippova dataset is 42.62%, but the compression ratio of 'PG' is 42.45% while that of 'PG + Event' is 41.70%. The main reason 'PG' is better than 'PG + Event' in compression ratio is that this dataset is designed for deletion-based compression. To verify that the proposed model works for morphologically complex languages, we conducted sentence compression on Korean dataset. Table 5 shows the result on Korean sentence compression. Overall, the proposed model outperforms all baselines. That is, 'PG + Event' shows the best performance for all metrics. One thing to note is that the performance difference between 'PG + Event' and 'seq-to-seq + Event' is large. This is because Korean dataset is relatively small, but its vocabulary size is large. Under this circumstance, the copy mechanism is much helpful in solving the out-of-vocabulary problem. In compression ratio, both 'PG' and 'PG + Event' are relatively closer to the golden compression ratio of Korean dataset. Please note that the overall performance of the proposed model for this dataset is lower than those for other datasets. This is because it is difficult to generate good Korean sentences with a small dataset since Korean is a morphologically complex language. However, even for this dataset, the proposed model outperforms its competitors.  Table 6 presents some examples of compressed sentences by 'PG' and 'PG + Event'. Sentence 1 and 2 in this table are from MSR dataset, while Sentence 3 comes from Filippova dataset. Bold words indicate event words in all source sentences. The important phrases of Sentence 1 are '80% of youth will report increased supervised time in safe environments' and '80% of participants will report increased conflict resolution skills', and the phrases contain the event word 'report increased'. The compressed sentence by 'PG' is "Anticipated outcomes from the spring survey include supervised increased conflict % of participants will report increased conflict resolution." which misses the important information such as 'supervised time in safe environments' and thus delivers distorted meaning from the source sentence. On the other hand, the compressed sentence by the proposed 'PG + Event' delivers more precise meaning than that by 'PG', since it contains '80% of youth will report in safe environments' and 'they will report increased conflict resolution skills.' Table 6. Some examples of compressed sentences by compression models.

Sentence 1 Sentence 2 Sentence 3
Source Anticipated outcomes from the spring Support is needed both to maintain why are homeowners survey include: 80% of youth will and expand these comprehensive reporting that their glass report increased supervised time in programs. Please help the door suddenly shattered? safe environments. 80% of participants American Cancer Society will report increased conflict continue its vital work. resolution skills.

Target
Expected outcomes from survey: Maintenance and expansion reporting their glass 80% of youth will report more time of our programs needs support. door suddenly shattered in safe places. 80% of people will report Help the American Cancer greater conflict resolution skills. Society continue.
PG Anticipated outcomes from nature much to maintain why are homeowners the spring survey include supervised these comprehensive programs. glass door strong increased conflict % of participants The American Cancer Society's will report increased conflict resolution. vital work.
PG + Event Anticipated outcomes from the Support the American Cancer reporting their glass spring survey include: 80% of youth Society continue its vital work door shattered will report in safe environments. and help to maintain. They will report increased conflict resolution skills.
To verify the attention weights when decoding, the weights for Sentence 1 are shown in Figure 2. The figure shows the source sentence in the horizontal axis and the generated sentence in the vertical axis. The darker the color of each word cell is, the heavier the attention weight of the word is. The upper figure shows the compressed sentence by 'PG' and the bottom one is by 'PG + Event'. The bold tokens are those recognized as event words. In the attention by 'PG', after the word 'include', attention should have given to 'youth' and 'increased supervised time' or 'in the safe environments', but is given to a wrong word 'conflict'. As a result, 'PG' generates 'participants will report increased conflict resolution'. In the bottom figure, the attention weights by'PG + Event' are relatively ideal. The model is attentive to all event words, and the words are generated in the compressed sentence. In addition, globally important phrases such as 'youth', 'safe environments', and 'conflict resolution skills' get attentive, and then the compressed sentence is generated semantically correctly. Since the proposed model is designed to consider both event words and global information, it can generate an effective compressed sentence.  Table 6. Sentence 2 also shows the superiority of the proposed model. The important phrases of the source sentence is 'support is needed to maintain and expand' and 'help the American Cancer Society continue'. As a compressed sentence, 'PG' generates 'nature much to maintain these comprehensive programs. The American Cancer Society's vital work' which is wrong semantically and grammatically. On the other hand, the compressed sentence by'PG + Event' is 'Support the American Cancer Society continue its vital work and help to maintain.' This sentence contains the event word 'support' and delivers the meaning of source sentence correctly. Lastly, Sentence 3 is sampled from Filippova dataset and is shorter than other examples. The key information of the source sentence is 'report that their glass door shattered.' 'PG + Event' generates 'reporting their glass door shattered' in which salient words such as 'report' and 'shatter' are involved. As a result, one can find out 'glass door shatters' from the compressed sentence, but not from the compressed sentence by 'PG'. The sentence by 'PG' is grammatically incomplete and the fact related to 'homeowners reporting' is not found. Figure 3 shows attention weights for the source sentence of Sentence 3. The top figure is the weights by 'PG' and the bottom is those by 'PG + Event'. 'PG' does not regard 'reporting' as a salient word, and thus it does not generate the word in the compressed sentence. In addition, even if 'shattered' is attentive, it generates 'strong' instead of it. On the other hand, 'PG + Event' pays attention to 'reporting' and 'shatter' as important information in terms of event. In addition, the globally salient phrase 'their glass door' is also focused by 'PG + Event'. As a result, 'PG + Event' is able to generate a compressed sentence which is semantically and grammatically correct. These examples show that it is effective for sentence compression to use event attention as well as global attention.  Table 6.

Conclusions
We have proposed an abstractive sentence compression model with event attention. Sentence compression is the task of generating a compact sentence from a source sentence while preserving the important content of the source sentence. However, existing models for sentence compression have a limitation that their attention often fails to focus on important context of a source sentence especially when the source sentence is long and complex. In this paper, we handled the problem with event attention. The proposed event attention focuses on event words since event words are important information for sentence compression and deliver the meaning of source sentences. In addition to event attention, the global attention was also used which helps to understand source sentences because it captures global information of the source sentence. Therefore, the proposed model compresses source sentences by combining event and global attention. For the evaluation of the proposed model, the proposed model has been compared with two baselines on three standard datasets. According to the experimental results, it outperforms all baselines for all datasets of MSR dataset, Filippova dataset, and Korean sentence compression dataset. In particular, it shows 122% higher BLEU score than the sequence-to-sequence model on MSR dataset. This result shows that event words are valuable information and the proposed model is effective for sentence compression. For future work, we will extend the proposed model for multi-sentence compression. The current model compresses just one or two sentences, but real news articles consist of multi-sentences. Thus, multi-sentence compression is required for real applications.