Skeleton to Abstraction: An Attentive Information Extraction Schema for Enhancing the Saliency of Text Summarization

: Current popular abstractive summarization is based on an attentional encoder-decoder framework. Based on the architecture, the decoder generates a summary according to the full text that often results in the decoder being interfered by some irrelevant information, thereby causing the generated summaries to suffer from low saliency. Besides, we have observed the process of people writing summaries and ﬁnd that they write a summary based on the necessary information rather than the full text. Thus, in order to enhance the saliency of the abstractive summarization, we propose an attentive information extraction model. It consists of a multi-layer perceptron (MLP) gated unit that pays more attention to the important information of the source text and a similarity module to encourage high similarity between the reference summary and the important information. Before the summary decoder, the MLP and the similarity module work together to extract the important information for the decoder, thus obtaining the skeleton of the source text. This effectively reduces the interference of irrelevant information to the decoder, therefore improving the saliency of the summary. Our proposed model was tested on CNN/Daily Mail and DUC-2004 datasets, and achieved a 42.01 ROUGE-1 f-score and 33.94 ROUGE-1, recall respectively. The result outperforms the state-of-the-art abstractive model on the same dataset. In addition, by subjective human evaluation, the saliency of the generated summaries was further enhanced.


Introduction
With the rapid development of Internet technology, people are exposed to vast amounts of text information every day such as news, blogs, reports, papers, etc.When we are faced with a large amount of disorganized information, quickly and accurately locating the required information becomes a problem to be solved.Automatic text summarization provides an efficient solution to this task.Text summarization can create a shorter version containing the main idea of the source text automatically.We can judge whether an article is interesting to us based on the shorter version.This can greatly reduce the time consumed in retrieving information.
Text summarization is generally divided into two branches, namely, extractive and abstractive.Extractive summarization selects some sentences from the source text to compose a summary.Abstractive summarization is based on the semantics of the source text to generate novel sentences as the summary.Abstractive summarization is thus more difficult than copying sentences from the source text, and most of the work in the past has been focused on extractive summarization [1][2][3][4].However, in recent years, abstractive summarization based on deep learning has also made great progress.The current popular abstractive model is mostly carried out under the framework of encoder and decoder.In order to improve the accuracy of the decoder, Bahdanau et al. [5] added an attention mechanism to the encoder-decoder framework and produced state-of-the-art performance in machine translation (MT).Due to the similarities between MT and text summarization, the subsequent text summarization follows the model of MT.Under the framework, the encoder reads the source text and understands the semantics of the text, the decoder generates summary words, and the attention mechanism is responsible for aligning the input and the output information to make the output more reliable.Despite the similarities, abstractive summarization is a very different problem from MT.The decoder must receive all contents of the source text in MT, however, in text summarization, the decoder only needs the important information from the source text to generate a summary.Humans also write summaries like this.Before the summary is generated, the important information is first extracted, and then during the process of writing a summary, only the important information is considered.A good summary should be concise and have high saliency, namely, containing more key information.However, based on the current abstractive model, the summary generation is based on all contents of the source text.Under this condition, when the source text contains plenty of information irrelevant to the summary, the encoder cannot correctly represent the semantics of the text.This means that the decoder is influenced by this irrelevant information, thereby resulting in the saliency of the summary declining.As shown in Figure 1, the generated summary has poor saliency.Abstractive summarization is based on the semantics of the source text to generate novel sentences as the summary.Abstractive summarization is thus more difficult than copying sentences from the source text, and most of the work in the past has been focused on extractive summarization [1][2][3][4].However, in recent years, abstractive summarization based on deep learning has also made great progress.The current popular abstractive model is mostly carried out under the framework of encoder and decoder.In order to improve the accuracy of the decoder, Bahdanau et al. [5] added an attention mechanism to the encoder-decoder framework and produced state-of-the-art performance in machine translation (MT).Due to the similarities between MT and text summarization, the subsequent text summarization follows the model of MT.Under the framework, the encoder reads the source text and understands the semantics of the text, the decoder generates summary words, and the attention mechanism is responsible for aligning the input and the output information to make the output more reliable.Despite the similarities, abstractive summarization is a very different problem from MT.The decoder must receive all contents of the source text in MT, however, in text summarization, the decoder only needs the important information from the source text to generate a summary.Humans also write summaries like this.Before the summary is generated, the important information is first extracted, and then during the process of writing a summary, only the important information is considered.A good summary should be concise and have high saliency, namely, containing more key information.However, based on the current abstractive model, the summary generation is based on all contents of the source text.Under this condition, when the source text contains plenty of information irrelevant to the summary, the encoder cannot correctly represent the semantics of the text.This means that the decoder is influenced by this irrelevant information, thereby resulting in the saliency of the summary declining.As shown in Figure 1, the generated summary has poor saliency.Based on the above discussion, in order to reduce the interference of the irrelevant information for the decoder, thereby improving the saliency of generated summary, this paper proposes an attentive information extraction model.This model is also proposed with reference to the way that humans write summaries.During the process of people writing a summary, they first read and understand the source text; then they will outline the important information and filter the information that is useless to the summary; next, they compare the important information with true semantics to ensure that the outlined information is correct; finally, they will write summaries.The current attentional encoder-decoder model is able to read and understand the source text as well as write a summary.However, preliminarily outlining the important information and ensuring the correctness of important information have not been realized.Thus, we firstly use an extra attention mechanism, namely, a multi-layer perceptron (MLP) network, to obtain the important information after the encoder and before the decoder.The important information is the skeleton of the source Based on the above discussion, in order to reduce the interference of the irrelevant information for the decoder, thereby improving the saliency of generated summary, this paper proposes an attentive information extraction model.This model is also proposed with reference to the way that humans write summaries.During the process of people writing a summary, they first read and understand the source text; then they will outline the important information and filter the information that is useless to the summary; next, they compare the important information with true semantics to ensure that the outlined information is correct; finally, they will write summaries.The current attentional encoder-decoder model is able to read and understand the source text as well as write a summary.However, preliminarily outlining the important information and ensuring the correctness of important information have not been realized.Thus, we firstly use an extra attention mechanism, namely, a multi-layer perceptron (MLP) network, to obtain the important information after the encoder and before the decoder.The important information is the skeleton of the source text.Furthermore, the semantic information between the reference summary and the source text is consistent, so we calculate semantic similarity scores between the reference summary and the extracted important information to ensure the correctness of the extracted information.In order to further enhance the ability of the MLP network, we maximize the similarity score to encourage high semantic similarity between the reference summary and the source text.As one of the targets of the abstractive model is to maximize the probability of target words, we think the decoder has good writing ability.We skip the decoder to maximize the score so that the encoder's semantic expression capabilities and the ability of the MLP network to extract information are improved as much as possible without affecting the ability of the decoder writing a summary.Our model extracts the important information before the decoder and the decoder generates summaries according to the important information.It cannot be influenced by the irrelevant information, therefore it can capture the main idea of the source text more completely and accurately, thus the saliency of the summary is higher.
We conduct experiments on the CNN/Daily Mail and DUC-2004 datasets.Our model achieved a 42.01 ROUGE-1 f-score and 33.94 ROUGE-1 recall, respectively, and outperformed the state-of-the-art abstractive model on the same datasets.In addition, by anonymous and subjective human evaluation, the saliency of the summary generated by our model was further enhanced.The readability of the summary generated by our model was stronger than the baseline model.

Related Work
The current abstractive model was carried out based on an encoder-decoder model [6].This model was originally used in the field of MT.In order to improve the accuracy of the decoder, Bahdanau et al. [5] added the attention mechanism to the model and obtained state-of-the-art results in MT.Due to the strong similarity between text summarization and MT tasks, the current popular text summarization models mostly followed this structure.
In the early days of text summarization studies, most of the work was done around extractive summarization [1][2][3][4][7][8][9].However, in recent years, the study of text summarization mainly focused on abstractive summarization.Rush et al. [10] proposed a data-driven network model to generate summaries.They used the convolutional neural network (CNN) to encode the source text and used a neural language model to decode a summary.State-of-the-art results were obtained on the DUC-2004 and Gigawords datasets.In an extension of this work, Chopra et al. [11] used Recurrent neural network (RNN) instead of the neural language model in the decoder, resulting in further improvement in the datasets.As RNN can better represent serialized data, Nallapati et al. [12] implemented both the encoder and the decoder using a RNN and constructed a multi-sentence summarization of the dataset CNN/Daily Mail.
Under the framework of an attentional encoder and decoder, researchers began to solve the problem of repeatability, poor readability, and out-of-vocabulary (OOV) words.Vinyals et al. [13] used the pointer mechanism in the encoder-decoder network model to solve the OOV problem.Experiments have proved that the mechanism can achieve good results.Gu et al. [14], Gulcehre et al. [15], and Nallapati et al. [12] also adopted the pointer mechanism on abstractive summarization to solve the OOV problem.See et al. [16] used a similar mechanism to generate summaries.In order to solve the problem of repeatability, the coverage mechanism [17] was introduced.Experiments achieved state-of-the-art results on the CNN/Daily Mail datasets.Suzuki et al. [18] mitigated the repeatability of summaries by evaluating the upper bound frequency of each target word in the encoder and controlling the output word in the decoder.Nema et al. [19] dealt with the sentences input into the model so that they were orthogonal to each other, thereby reducing the repeatability of the generated summaries.Li et al. [20] added latent structured information to the decoder and introduced an editing vector [21] to edit the generated summary, thereby enhancing the readability of the summary.Recently, Paulus et al. [22] applied reinforcement learning (RL) to generate a summary and adopted the attention mechanism inside the decoder.
In addition, Xu [23] used a multi-layer perceptron (MLP) model inside the encoder to predict the weight of each sentence in the source text.This model reduced the interference of irrelevant sentences when generating summaries.Zhou et al. [24] also adopted a MLP model after the encoder to weaken the irrelevant information and improve the model performance.Ma et al. [25] added a similarity comparison module between the generated summaries and the original text after the decoder to improve the semantic relevance of the summary.Ma et al. [26] combined text sentiment classification with text summarization tasks and proposed a hierarchical end-to-end model with a highway network, which achieved good experimental results on the Amazon online review dataset.In another experiment by Ma et al. [27], they proposed a supervised learning model to improve the ability of encoder text representation, thereby improving the result of summarization.Hsu et al. [28] combined extractive and abstractive summarization to generate a summary, this improved the informativity and readability of summaries.Lin et al. [29] controlled the information flow from encoder to decoder to improve the semantic relevance of the summary.Li et al. [30] also combined extractive with abstractive models to generate summaries and improve the informativity of the summary.Celikyilmaz et al. [31] presented deep communicating agents in an encoder-decoder architecture to address the challenges of representing a long document for abstractive summarization.Under the conditions of solving the problem of OOV words and repeatability, our model refers to the idea of Zhou et al. [24], adopting an extra attention mechanism to extract the important information.In order to ensure the correctness of the extracted information and enhance the ability of extra attention mechanisms, we calculate semantic similarity between the reference summary and the extracted information, and maximize the similarity score to encourage high similarity between the reference summary and the extracted information.Experiments show that our model outperformed the state-of-the-art abstractive model and the saliency of the summary generated by our model was further enhanced.

Proposed Model
In this section, we will introduce our proposed model in detail.In Section 3.1, we introduce the flow diagram of our model.In Section 3.2, we make an overview of the various parts of the model.In Section 3.3, we describe every part of the model in detail.

Model Flow Diagram
The flow diagram of our model is shown in Figure 2.
Information 2018, 9, x FOR PEER REVIEW 4 of 18 sentences when generating summaries.Zhou et al. [24] also adopted a MLP model after the encoder to weaken the irrelevant information and improve the model performance.Ma et al. [25] added a similarity comparison module between the generated summaries and the original text after the decoder to improve the semantic relevance of the summary.Ma et al. [26] combined text sentiment classification with text summarization tasks and proposed a hierarchical end-to-end model with a highway network, which achieved good experimental results on the Amazon online review dataset.
In another experiment by Ma et al. [27], they proposed a supervised learning model to improve the ability of encoder text representation, thereby improving the result of summarization.Hsu et al. [28] combined extractive and abstractive summarization to generate a summary, this improved the informativity and readability of summaries.Lin et al. [29] controlled the information flow from encoder to decoder to improve the semantic relevance of the summary.Li et al. [30] also combined extractive with abstractive models to generate summaries and improve the informativity of the summary.Celikyilmaz et al. [31] presented deep communicating agents in an encoder-decoder architecture to address the challenges of representing a long document for abstractive summarization.Under the conditions of solving the problem of OOV words and repeatability, our model refers to the idea of Zhou et al. [24], adopting an extra attention mechanism to extract the important information.In order to ensure the correctness of the extracted information and enhance the ability of extra attention mechanisms, we calculate semantic similarity between the reference summary and the extracted information, and maximize the similarity score to encourage high similarity between the reference summary and the extracted information.Experiments show that our model outperformed the state-of-the-art abstractive model and the saliency of the summary generated by our model was further enhanced.

Proposed Model
In this section, we will introduce our proposed model in detail.In Section 3.1, we introduce the flow diagram of our model.In Section 3.2, we make an overview of the various parts of the model.In Section 3.3, we describe every part of the model in detail.

Model Flow Diagram
The flow diagram of our model is shown in Figure 2.  The input of the model is the source text and the output is the generated summary.Firstly, the source text is embedded into a series of word vectors.Next, these word vectors are encoded to achieve the reading and understanding of the source text.Then, we adopt an extra attention mechanism after the encoder to obtain the important information for generating the summary, thereby reducing the interference of useless information to the decoder.At the training stage, in order to improve the performance of important information extraction, we compare the semantics of the reference summary with the semantics of the important information to obtain the consine similarity score.Note that the reference summary does not exist at the test stage.Finally, the decoder generates summaries according to the important information.

Model Overview
The concrete architecture of our model is shown in Figure 3.It mainly consists of five parts: source text encoder, extra attention, consine similarity module, reference summary encoder and summary decoder.The input of the model is the source text and the output is the generated summary.Firstly, the source text is embedded into a series of word vectors.Next, these word vectors are encoded to achieve the reading and understanding of the source text.Then, we adopt an extra attention mechanism after the encoder to obtain the important information for generating the summary, thereby reducing the interference of useless information to the decoder.At the training stage, in order to improve the performance of important information extraction, we compare the semantics of the reference summary with the semantics of the important information to obtain the consine similarity score.Note that the reference summary does not exist at the test stage.Finally, the decoder generates summaries according to the important information.

Model Overview
The concrete architecture of our model is shown in Figure 3.It mainly consists of five parts: source text encoder, extra attention, consine similarity module, reference summary encoder and summary decoder.The text encoder uses a bidirectional long short term memory network (Bi-LSTM) to read and represent the source text.It maps the source text to the semantic vector space, forming a series of semantic vectors.After the text encoder, we use the extra attention mechanism to measure the importance of each word in the source text.The extra attention mechanism is a MLP network.At each time step, the output of MLP is a weight vector that represents the importance of the word for the text.Then, we use these weight vectors to weigh the hidden states and form a series of new semantic vectors.These vectors represent the important information of the source text.Next, we also use Bi-LSTM to encode the reference summary.We compare the cosine similarity between the semantic of the reference summary and the semantic of the important information to ensure the correctness of the extracted information input into the decoder.In order to enhance the ability of the extra attention mechanism, we provide a similarity score to the encoder and MLP to maximize it.Lastly, the model adopts unidirectional LSTM to decode the important information and generate summaries.During the generation of summaries, we also use traditional attention mechanisms to provide different attention scores for different parts of the source text at different time steps.The text encoder uses a bidirectional long short term memory network (Bi-LSTM) to read and represent the source text.It maps the source text to the semantic vector space, forming a series of semantic vectors.After the text encoder, we use the extra attention mechanism to measure the importance of each word in the source text.The extra attention mechanism is a MLP network.At each time step, the output of MLP is a weight vector that represents the importance of the word for the text.Then, we use these weight vectors to weigh the hidden states and form a series of new semantic vectors.These vectors represent the important information of the source text.Next, we also use Bi-LSTM to encode the reference summary.We compare the cosine similarity between the semantic of the reference summary and the semantic of the important information to ensure the correctness of the extracted information input into the decoder.In order to enhance the ability of the extra attention mechanism, we provide a similarity score to the encoder and MLP to maximize it.Lastly, the model adopts unidirectional LSTM to decode the important information and generate summaries.During the generation of summaries, we also use traditional attention mechanisms to provide different attention scores for different parts of the source text at different time steps.

Model Details
In this section, we will introduce our model in detail.We divided our model into five parts in Section 3.2.Since extra attention and similarity modules work together to extract the useful information and filter the irrelevant information, we can merge them into an attentive information extraction module.Our model has three large blocks, namely, text encoder, attentive information extraction and summary decoder.The text encoder reads and understands the source text.Attentive information extraction is responsible for extracting the important information for summary generation.The summary decoder writes the summary words.

Text Encoder
The text encoder imitates the process of human reading and understanding the source text.This part is responsible for mapping the source text to the semantic vector space and forming a series of semantic vectors.As RNN can better represent the serialized data, the encoder is implemented using RNN.However, the general RNN has the problem of long-short-term dependence and vanishing gradient, thus we adopted the variant LSTM (http://colah.github.io/posts/2015-08-Understanding-LSTMs/).
In order to obtain more complete and accurate vector representations of the source text, we used Bi-LSTM to encode it.Forward LSTM reads word vectors from left to right, resulting in a series of hidden states ( Backward LSTM reads the word vector in the opposite direction, and obtains a series of hidden states ( We concatenate x i represents the i-th word in the source text.h i is the semantics of all contents before the i-th word in the source text.

Attentive Information Extraction
Although the architecture of text summarization borrows from machine translation (MT), it is a very different problem from MT.In MT, the decoder must fully receive all the information from the source text.In summarization, the effective information of the text is enough for the decoder.However, the current abstractive model generates a summary based on all contents in the source text, which causes the encoder to not correctly represent the text.This makes the information that has been inputted into the decoder imprecise.In this case, the generated summary will not be accurate.This situation is not what we expect.In addition, we observed the process of humans writing a summary.They will outline important information in the text before writing a summary, which can reduce the interference of useless information to the decoder.Thus, we refer to humans writing summaries and propose an attentive information extraction model to solve the problem.
After the text encoder, the network will generate a series of hidden states (h 1 , h 2 , h 3 , . . ., h n ).In order to completely and accurately represent the entire text H, we concatenate In order to extract the important information, we apply an extra attention mechanism.Here, we introduce a weight vector g i .It represents the importance of the i-th word for the full text.H and h i are input into a MLP to obtain g i .Then, h i and g i will carry out dot product operations to get h i , which represents the extracted information at time step i.
where W s , V i and b are learnable parameters. is dot product operation.After information extraction, the important information of the source text is strengthened and the unnecessary or useless information is weakened or ignored.We add these new states (h 1 , h 2 , h 3 , . . ., h n ) to represent the semantics of the source text, namely V s .It will be inputted into the decoder to generate summaries.
We extract important information to the decoder after MLP, but we cannot guarantee the correctness of extracted information.As the semantics of the reference summaries and source texts are consistent, we compare the semantics of reference summaries with the source text to ensure the extracted information's semantic correctness.We refer to the encoder of the source text and also use Bi-LSTM to encode the reference summary, and add all hidden states to represent its semantics V t . ← where s is the length of the reference summary and r i is the semantics of all contents before the i-th word in the reference summary.
Here, we adopt cosine similarity to measure the semantic similarity between the reference summary and the source text.This will tell us whether extracted information is correct or not.The similarity score is larger, the extracted information is more accurate.
In order to improve the information extraction ability of MLP, the similarity score is fed back to the network.In the training process, we maximize the score to encourage the high semantic similarity score between the reference summary and the extracted information.In our model, we minimize the negative log likelihood of the similarity.
As the current summarization model's target function is to maximize the possibility of the target word, namely, minimizing its negative log likelihood, we believe that the decoder has good writing skills.
where W * t is the target word.Therefore, in order not to affect the decoder's writing ability, we feed the similarity score back to the encoder and MLP, skipping the decoder.The final loss function is as follows: where λ is a hyper-parameter and m is the length of the generated summary.

Summary Decoder
We use the unidirectional LSTM to generate summaries after the text encoder and attentive information extraction.We use V s (in Section 3.3.2) to initialize the LSTM hidden state.It represents all important information in the source text.
where s t and y t are the hidden state and the input of LSTM at time step t, respectively.During the process of decoding, we also use traditional attention mechanisms to pay attention to different parts of the important information at different time steps.
where W h , W t , V and b t i are learnable parameters, a t i is the attention score at time t to the i-th word in the source text, and c t is the context vector at time t.Finally, in order to solve OOV words and the repeatability of summaries, we also use pointer and coverage mechanisms [16].Now, the attention score is calculated as follows: where c t i is the sum of the attention scores before time t.We will penalize the model when it repeatedly attends to the same location of the source text, namely, minimize the minimum of the sum of the attention scores of the i-th word so far and the attention score of the i-th word at the current moment.Thus part of the loss function loss t is changed as follows: where µ is a hyper-parameter.

Experiment
In this section, we will introduce our experiments in detail, including the datasets we used, the implementation details and the evaluation metrics.

Datasets
We trained our model on the CNN/Daily Mail dataset [12,32].This is a news dataset that contains 312,085 documents with multi-sentence summaries.Through statistics, each text contains an average of 766 words spanning 29.74 sentences and the corresponding summary contains 53 words spanning Information 2018, 9, 217 9 of 19 3.72 sentences.We follow the same pre-processing method described in See et al. [16] to process our datasets.The large size of the dataset makes the process of training very slow.Thus we filtered out texts longer than 500.Eventually, we had 70,065 training pairs, 3806 validation pairs and 3212 test pairs.
In addition, we also tested our model on the DUC-2004 dataset for tasks 1 and 2 [33].Although DUC 2004 is old, DUC-2004 contains many manual abstracts generated by an expert.It provides a standard dataset for summarization.Thus it is used widely in academic research and industrial applications.Since most past works were evaluated based on DUC 2004, we also used it to evaluate our model.However, it is too small to train neural networks, so we only used it to test our model.The corpus contains 500 documents.Each document has four manual reference summaries and each reference summary contains 75 bytes on average.

Experiment Details
All our experiments are implemented based on python3 and tensorlow1.2.0.In Table 1, we show our model parameters at the training stage.We set all hidden state sizes to 256 and word embedding sizes to 128.The vocabulary size was 50,000.Before the model is trained, we did not pre-train the word vectors.We randomly initialized the word vectors at the beginning of training process.We fixed the maximum length of input text at 400.In addition, the length of reference summary was 53 on average, so we set the maximum length of the generated summaries to 100.Besides, when we tested our model on DUC-2004, we changed the maximum length of the generated summaries to 25, because the reference summary contained 75 bytes on average.The batch size was 16.We adopted the beam search algorithm to generate summaries and set the beam size to four; thus the batch size was also changed to four at the test stage.We used AdaGrad [34] to optimize our model and its learning rate, the initial accumulator value was 0.15 and 0.1, respectively.The hyper-parameter λ and µ were set to 0.001 and 1, respectively.
At the end of training, the loss of seq2seq, namely, loss t (in Section 3.3.2) converged to about 2.6 from an initial value of about 7.0, and the coverage loss converged to 0.2 from an initial value of about 0.5.

ROUGE
We evaluated our model using ROUGE [35].ROUGE is a common evaluation metric in text summarization.It measures the overlap of lexical units between reference summaries and generated summaries, such as unigrams, bigrams and longest common subsequence.The calculation of ROUGE is as follows: where N represents the length of n-gram, {re f erence summaries} is the reference summary, Count match (gram n ) is the number of n-grams co-occurring in the reference summaries and generated summaries, and Count(gram n ) is the number of n-grams in reference summaries.For the CNN/Daily Mail dataset, we calculated ROUGE F1 (https://blog.csdn.net/u014380165/article/details/77493978).However, for the DUC-2004 dataset, because most works were evaluated in the past based on ROUGE recall and the official DUC metric is also ROUGE recall, we also used it to evaluate our model.

Human evaluation
For text summarization, to some extent, ROUGE only evaluates literal similarity between the reference summary and the generated summary.For the saliency of the generated summaries, there is no suitable way to evaluate them automatically.Thus, in order to evaluate the saliency of the summaries generated by our model, we randomly selected some examples for visual evaluation.We compared the summary generated by our model and See et al. [16] in terms of informativity.The summary containing more key information had higher saliency.
In addition, in order to make the evaluation experiment more representative, we randomly picked more examples.Each example contains three parts, namely, the reference summary, the summary generated by the model of See et al. [16] and the summary generated by our model.We assigned them to three different human evaluators to score each summary.The saliency scoring criteria are shown in Table 2. Finally, we collected the results of different human evaluators and calculated the mean value.Note that during the process we did not tell them which summary was generated by our model and which summary was generated by the model of See et al. [16].We only told them which summary was the reference summary.Besides, for the text summarization, readability is also an important indication for evaluation of the quality of the summarization.Thus we also randomly selected some examples and assigned them to three different evaluators to evaluate the readability of the summaries.We mainly evaluated the syntax and the grammar of the summary.The three evaluators scored each summary according to the syntax and the grammar.The scoring details are shown in Table 3.Similarly, we calculated the mean value of the three evaluators as the final readability result.Note, our readability evaluation process was also anonymous.

Weigh heat map
We visualized the weight vector obtained by MLP, namely, g i (in Section 3.3.2),to check whether our model extracted important information before the decoder.However, because g i is a high dimensional vector, it is difficult to visualize it directly.In order to visualize it, we converted it to a scalar.As we all know, the biggest relevance appears between themselves.Thus, we calculated the weight vector between the source text and the source text as the gold vector.Then, we calculated the Euclidean distance between the gold vector and the weight vector at each time step.With this, we can convert a high dimensional weight vector to a scalar.The concrete calculation is as follows: where G is the gold vector, k is the dimension of the weight vector and a i represents the scalar corresponding to g i .H is the source text (in Section 3.3.2).We will visualize the a i to represent the weight heat map.

Results and Discussion
In this section, we report the ROUGE F1, ROUGE recall for the CNN/Daily Mail and DUC-2004 test sets, respectively.We use the pyrouge package (https://pypi.org/project/pyrouge/)and the official ROUGE script (https://github.com/summanlp/evaluation/tree/master/ROUGE-RELEASE-1.5.5) to obtain our ROUGE scores.In addition, we will show the result of the saliency evaluation and readability evaluation.Then, we will provide the weight heat map.Finally, we will discuss our results.

Results
For CNN/Daily Mail and DUC-2004, their reference summaries have different lengths, so we set different sizes at the test stage.For CNN/Daily Mail, the length of the reference summaries was 53 on average.We set the maximum decoder steps to 100.Table 4 shows the results for CNN/Daily Mail.We can see that our model achieves state-of-the-art results without reinforcement learning.We only used maximum likelihood (ML) to train our model.We did not use RL to train the model, but the experiments of Celikyilmaz et al. [31] show that RL can apparently improve the value of ROUGE.This may become a part of our future work.Words-lv2k-temp-att: Nallapati et al. [12] used a pointer mechanism to solve the problem of out-of-vocabulary and used the feature-rich-encoder to embed the word.
Pointer-Generator + Coverage: See et al. [16] also adopted pointer to handle OOV words and introduced an extended vocabulary.Besides, in order to prevent the repeatability, the model used coverage to solve it.This model was our baseline model.
ML, with intra-attention: Paulus et al. [22] used an attention mechanism inside the decoder to solve the problem of repeatability.
Controlled summarization: Fan et al. [36] presented a neural summarization model to enable users to specify some high level attributes, such as the desired length, style, and the entities, in order to control to the shape of the generated summaries to better suit users' needs.
End2end w/inconsistency loss: Hsu et al. [28] combined an extractive model with an abstractive model to generate summaries.
DCA MLE + SEM + RL: Celikyilmaz et al. [31] presented deep communicating agents in an encoder-decoder architecture to address the challenges of representing a long document for abstractive summarization and trained their model using reinforcement learning to generate summaries.
DCA MLE + SEM: Celikyilmaz et al. [31] did not use RL to train their model.For DUC-2004, the reference summary contains 75 bytes on average, so we change the maximum decoder steps to 25.In addition, in the past, most of the work on DUC-2004 was evaluated using ROUGE recall, so we also obtained ROUGE recall for DUC-2004.The results are shown in Table 5.The results show that our model outperforms the state-of-the-art baseline model in ROUGE-1 and ROUGE-L recall with least 1.9 points.ABS+: Rush et al. [10] used CNN encode the source text and neural language model to decode.Words-lv5k-1sent: Nallapati et al. [12] trained the model on the first sentence from the source text and adopted the large vocabulary trick based on an attentional encoder-decoder model.C2R + Atten: Chopra et al. [11] used a CNN to encode and RNN to decode, which outperformed the ABS+ model.SEASS: Zhou et al. [24] adopted selective encoding to extend the seq2seq model, which reduced the burden of the decoder.
AC-ABS: Li et al.
[37] employed an actor-critic framework to enhance the traditional abstractive model to improve the quality of the generated summaries.
We randomly selected three examples to evaluate visually.The result is shown in Figure 4. We can see that the summary generated by our model captures more key information contained in the source text.This indicates that our summaries have a higher saliency than the summary of See et al. [16].In addition, we picked 100 examples randomly and assigned them to three different people to score anonymously.The result of the saliency evaluation is shown in Table 6.From the result, the summary that is generated by our model has a higher relevance score than the unimproved model, so our proposed model enhances the saliency of text summarization.Besides, we also selected 100 examples randomly and assigned them to three different evaluators to score for readability.The syntax score and grammar scores are presented in Table 7.We found that the summary generated by our model had a higher syntax score and a higher grammar score than the summary generated by See et al.Thus we can say that the summary generated by our model has stronger readability.Table 6.Saliency evaluation results.See et al. [16] is the summary generated by the model of See et al. [16] and our model is the summary generated by our model.

Summary
Evaluator Finally, we randomly selected an example to visualize the weight a i (in Section 4.2). Figure 5 shows the result, in which we can see that key words in the source text were picked by MLP, such as "Cambodian", "rejected", "demands", "talks", "outside", "political", "Government", "opposition", "asked", "meeting", etc.This shows that our extra attention mechanism and similarity module determined the importance of each word in the source text.They obtained the important information before the decoder.This effectively reduced the interference of irrelevant information to the decoder.Therefore, the generated summaries contain more key information, namely, their saliency is higher.

Discussion
Current abstractive models implicitly apply attention mechanisms to extract the key information while the summaries are generating.We think the model benefits from explicitly extracting key information before the decoder.We propose an attentive information extraction model to obtain the important information before the decoder.In Section 5.1, the result showed that our model effectively reduced the interference of the irrelevant information in the source text.This makes the summary more accurate and the saliency of the summary higher than the baseline model [16].Besides, through readability evaluation, we found that the summary generated by our model had stronger readability.
However, the target of the abstractive model was not only to generate a summary with higher saliency, but also to generate more novel n-grams as in the reference summaries.In order to evaluate the abstractive ability of our model, we conducted detailed statistical analysis about the percentage of new n-grams for DUC 2004 and CNN/Daily Mail.The result is shown in Figures 6 and 7.

Discussion
Current abstractive models implicitly apply attention mechanisms to extract the key information while the summaries are generating.We think the model benefits from explicitly extracting key information before the decoder.We propose an attentive information extraction model to obtain the important information before the decoder.In Section 5.1, the result showed that our model effectively reduced the interference of the irrelevant information in the source text.This makes the summary more accurate and the saliency of the summary higher than the baseline model [16].Besides, through readability evaluation, we found that the summary generated by our model had stronger readability.
However, the target of the abstractive model was not only to generate a summary with higher saliency, but also to generate more novel n-grams as in the reference summaries.In order to evaluate the abstractive ability of our model, we conducted detailed statistical analysis about the percentage of new n-grams for DUC 2004 and CNN/Daily Mail.The result is shown in Figures 6 and 7.
Current abstractive models implicitly apply attention mechanisms to extract the key information while the summaries are generating.We think the model benefits from explicitly extracting key information before the decoder.We propose an attentive information extraction model to obtain the important information before the decoder.In Section 5.1, the result showed that our model effectively reduced the interference of the irrelevant information in the source text.This makes the summary more accurate and the saliency of the summary higher than the baseline model [16].Besides, through readability evaluation, we found that the summary generated by our model had stronger readability.
However, the target of the abstractive model was not only to generate a summary with higher saliency, but also to generate more novel n-grams as in the reference summaries.In order to evaluate the abstractive ability of our model, we conducted detailed statistical analysis about the percentage of new n-grams for DUC 2004 and CNN/Daily Mail.The result is shown in Figures 6 and 7.  From Figures 6 and 7, we can see that although our model is abstractive, it does not produce new n-grams as often as reference summaries.For CNN/Daily Mail, the model of See et al. [16]

Discussion
Current abstractive models implicitly apply attention mechanisms to extract the key information while the summaries are generating.We think the model benefits from explicitly extracting key information before the decoder.We propose an attentive information extraction model to obtain the important information before the decoder.In Section 5.1, the result showed that our model effectively reduced the interference of the irrelevant information in the source text.This makes the summary more accurate and the saliency of the summary higher than the baseline model [16].Besides, through readability evaluation, we found that the summary generated by our model had stronger readability.
However, the target of the abstractive model was not only to generate a summary with higher saliency, but also to generate more novel n-grams as in the reference summaries.In order to evaluate the abstractive ability of our model, we conducted detailed statistical analysis about the percentage of new n-grams for DUC 2004 and CNN/Daily Mail.The result is shown in Figures 6 and 7.  From Figures 6 and 7, we can see that although our model is abstractive, it does not produce new n-grams as often as reference summaries.For CNN/Daily Mail, the model of See et al. [16] Figure 7.
The percentage of new n-grams for DUC 2004.Larger percentage indicates stronger abstraction.
From Figures 6 and 7, we can see that although our model is abstractive, it does not produce new n-grams as often as reference summaries.For CNN/Daily Mail, the model of See et al. [16] produced more novel n-grams than our model.However, we can also see in Figure 4 that although the model of See et al. produced more new n-grams, most of them were erroneous.Although our model produces less novel n-grams, the saliency of the summaries generated by our model was higher and most of the summaries were correct.Thus, our attentive information extraction schema is still useful, apart from the fact that most of the contents was copied from the source text.For DUC 2004, we found the result of our model and See et al. [16] to be less different.The number of novel n-grams for DUC was less than for CNN/Daily Mail on the whole.Maybe the length of the summaries was too short, so the summary generating process was already over when the model massively began to generate new n-grams.Thus the percentage was lower than for CNN/Daily Mail.In this situation, regardless of which model used, the result of our model and the model of See et al. was similar.
Additionally, the probability of generating a novel word also provides a measure of the abstractive ability of the model.Here we used P gen [16] to represent the probability.In order to measure the abstractive ability of our model, we recorded the value of P gen at the beginning of training and at the end of training.We found that P gen started with a value of about 0.26 then increased, converging to about 0.55 by the end of training.This shows that the model first learns to mostly copy, then learns to generate.However, during the test stage, P gen was very low, and only had a mean value of 0.16.The result was similar for See et al. [16].For this reason, we agree with the opinion of See et al. namely that the model receives word-by-word supervision in the form of the reference summary during the training stage, but during the test it does not.This is far from the purpose of abstractive summarization.Solving this problem without affecting the performance of the model is a part of our future work.Perhaps we can additionally adopt RL to train our model.We can calculate the rate of abstraction to encourage a higher rate, thereby making the model produce more novel n-grams.Besides, if we want to generate a summary that is more similar to the reference summary, we can also adopt RL to encourage larger ROUGE so as to improve the performance of the model.Maybe this way also can improve the degree of abstraction.
In addition, our model extracts the important information before the decoder, thereby enhancing the saliency of the summary, but we can see that it cannot completely filter all irrelevant information from Figure 5, such as "accusing", "parties", "the", etc.Maybe we can apply the hard attention mechanism [38] to solve the problem in the future, but hard attention may result in the loss of information if the information extraction mechanism is not very good.Besides, the ROUGE-2 recall reduced obviously for DUC-2004 (see Section 5.1).This is a negative outcome.In order to understand the reason for this, we tried to increase the length of Max_dec_steps (in Section 4.2) to 30; in this case the ROUGE-2 recall was 9.75.If we continued increasing the length to 35, the ROUGE-2 recall also continued increasing.The result is shown in Table 8.We can infer that the result declines because the CNN/Daily Mail does not match the DUC-2004.The reference summary length for CNN/Daily Mail was 53 words on average, but for DUC-2004 it was only 75 bytes.Even so, the ROUGE-1 and ROUGE-L recall increased.We mainly considered ROUGE-L, which represents the rate of the longest common subsequence between the reference summary and the generated summary.
As we adopted three LSTM models, the training speed of our model was very slow.During the training process, we applied some tricks such as discarding the source text with a length over 500 and setting the Max_enc_steps and Max_dec_steps to 250 and 50 respectively in the early stages of training.As CNN/Daily Mail is a news dataset, we think the key information is shown in the first half of the text.Therefore, we fixed the maximum length of input text at 250.When the model began to converge, we changed this to 400 and 100, which effectively speeded up the training.
In general, our model has the above weaknesses, but using anonymous and subjective human evaluation, the saliency of the generated summary was enhanced and the readability of the generated summary was also better than the baseline model.The result for CNN/Daily Mail and DUC-2004 also outperformed the state-of-the-art baseline model.In the future, we will encourage our model to write a summary more abstractively using RL and try to adopt the hard attention mechanism before the decoder to extract important information.

Conclusions
In this work, our target was to enhance the saliency of the summary in abstractive text summarization.In order to achieve this, we proposed an attentive information extraction model to obtain the skeleton of the source text, namely, the important information for the decoder.We conducted our experiments using CNN/Daily Mail and DUC-2004.The experiments showed that our proposed model can effectively extract important information in the source text before the decoder.In addition, we achieved a 42.01 ROUGE-1 f-score and 33.94 ROUGE-1 recall for the CNN/Daily Mail and DUC 2004 datasets, respectively.Our results outperformed the state-of-the-art abstractive model by at least 1.33 points for the CNN/Daily Mail dataset.For DUC 2004, our model outperformed the state-of-the-art model by at least 1.9 points.Finally, using human evaluation, the saliency of the summaries generated by our model was further enhanced.The readability of the summaries generated by our model was better than the baseline model.As a part of our future work, we plan to apply RL and hard attention mechanisms to the abstractive model to further improve the performance of the model.

Figure 1 .
Figure 1.An example of abstractive text summarization.Green font is the key information in the source text.Red font is the key information obtained by current abstractive model.

Figure 1 .
Figure 1.An example of abstractive text summarization.Green font is the key information in the source text.Red font is the key information obtained by current abstractive model.

Figure 2 .
Figure 2. The flow diagram of our model.At the training stage, the Generated Summary in the figure does not exist.Our target is to train an abstractive model, namely, the part drawn by the blue dotted line.At the test stage, our input is only the source text and the part represented by the red dotted line does not exist, the output is the summary generated using the abstractive model.

Figure 2 .
Figure 2. The flow diagram of our model.At the training stage, the Generated Summary in the figure does not exist.Our target is to train an abstractive model, namely, the part drawn by the blue dotted line.At the test stage, our input is only the source text and the part represented by the red dotted line does not exist, the output is the summary generated using the abstractive model.

Figure 3 .
Figure 3.Our proposed model.Before the decoder, an extra attention mechanism is used to extract important information and compare the similarity of the reference summary and the extracted information to ensure the correctness of the necessary information.In order to enhance the ability of the extra attention mechanism in extracting information, the similarity score is fed back to the network, skipping the decoder.

Figure 3 .
Figure 3.Our proposed model.Before the decoder, an extra attention mechanism is used to extract important information and compare the similarity of the reference summary and the extracted information to ensure the correctness of the necessary information.In order to enhance the ability of the extra attention mechanism in extracting information, the similarity score is fed back to the network, skipping the decoder.

Information 2018, 9 ,
x FOR PEER REVIEW 13 of 18

Figure 4 .Table 7 .
Figure 4. Examples of abstractive summarization.Green font is the key information of the source text and red font represents the effective information generated by the abstractive model.The source text (1-3) represents original texts, the gold summary is the reference summary, the generated summary is the summary by the model of See et al. [16] and our model represents the summary by our proposed model.

Figure 4 .
Figure 4. Examples of abstractive summarization.Green font is the key information of the source text and red font represents the effective information generated by the abstractive model.The source text (1-3) represents original texts, the gold summary is the reference summary, the generated summary is the summary by the model of See et al. [16] and our model represents the summary by our proposed model.

Figure 5 .
Figure 5.The weight heat map.The word in the picture is the source text.The darker color has greater weight and corresponding word is more important.The reference summary was "Cambodian government rejects opposition's call for talks abroad".The generated summary by our model was "Cambodian leader Hun Sen rejected opposition parties demands for talks outside the country".

Figure 5 .
Figure 5.The weight heat map.The word in the picture is the source text.The darker color has greater weight and corresponding word is more important.The reference summary was "Cambodian government rejects opposition's call for talks abroad".The generated summary by our model was "Cambodian leader Hun Sen rejected opposition parties demands for talks outside the country".

Figure 6 .
Figure 6.The percentage of new n-grams for CNN/Daily Mail.Larger percentage indicates stronger abstraction.

Figure 7 .
Figure 7.The percentage of new n-grams for DUC 2004.Larger percentage indicates stronger abstraction.

Figure 6 .
Figure 6.The percentage of new n-grams for CNN/Daily Mail.Larger percentage indicates stronger abstraction.

Figure 6 .
Figure 6.The percentage of new n-grams for CNN/Daily Mail.Larger percentage indicates stronger abstraction.

Figure 7 .
Figure 7.The percentage of new n-grams for DUC 2004.Larger percentage indicates stronger abstraction.

Table 1 .
Our model parameters at the training stage.Max_enc_steps and Max_dec_steps are the allowed maximum length of the source text input encoder and the generated summary, respectively.

Table 2 .
[16]ency scoring criteria.Relevance indicates the informativity of the summary by the model (our model or See et al.[16]model).Score is from 0 to 5 and higher scores are better.Higher score indicates higher saliency.

Table 3 .
Readability Scoring Criteria.Higher score indicates stronger readability.Score is from 1 to 5 and higher scores are better.

Table 4 .
ROUGE F1 on CNN/Daily Mail.All our ROUGE scores have a 95% confidence interval in the official ROUGE script.

Table 5 .
ROUGE recall on DUC-2004.All our ROUGE scores have a 95% confidence interval in the official ROUGE script.

Table 7 .
[16]ability evaluation results.See et al.[16]is the summary generated by the model of See et al.[16]and our model is the summary generated by our model.A/B: A represents the score of the syntax and B indicates the score of the grammar.