1. Introduction
Automatic summarization [
1] technology is a technique that uses computers to understand and analyze text in order to generate concise summaries covering topics of the original text. It is an important research direction in Natural Language Processing (NLP), and also a pre-task for many downstream applications, such as automated question and answer systems [
2] and news headline generation [
3].
Existing automatic summarization methods can be divided into two main categories: extractive and abstractive. Extractive methods constitute a summary by extracting important text units from the original text. The abstractive method is more similar to manual summarization methods. It restates the original text via techniques such as synonymous substitution and sentence abbreviation, thus resulting in summaries that mostly contain words or phrases other than the original text, and also with better fluency than summaries produced by extractive methods.
The Sequence to Sequence [
4] (Seq2Seq) summary generation framework incorporating deep neural networks has been widely studied in recent years. Moreover, the Seq2Seq method with an attention mechanism [
5] purposefully addresses the problem of gradient disappearance caused by excessively long sentences, thus improving the performance of generated summaries.
The performance of summary generation can also be affected by out-of-vocabulary (OOV) and redundant words. To this end, See et al. [
6] proposed a pointer generator network. The method suggests copying words from the original text or generating new words from a fixed vocabulary, and uses a coverage mechanism to alleviate the word redundancy problem, thus improving the performance of summary generation.
In general, the Seq2Seq framework combined with an attention mechanism provides the basis for studies of automatic summaries. However, existing approaches consider summary generation as a translation process from the original text to the summary; moreover, most of the attention mechanism is built between characters of the original text and the summary characters, with less research from the perspective of the topics of the original text. In terms of evaluation metrics of automated summaries, ROUGE [
7], which is widely used, is a recall-based evaluation metric. It evaluates the n-tuple contribution statistics of the generated summary and the reference summary, without evaluating the topic consistency between the summary and original text, which may result in topics of the summaries deviating from the original text despite high ROUGE scores.
We argue that the process of manually writing summaries is to read and understand the full text to find the words and phrases that are most relevant to the original topic, and then write a summary that matches the original topic. Therefore, we propose an Abstract Summarization method Combining Global Topics (ACGT) to improve the performance of the summary by constructing a global topic information extractor and an attention module that combines topic information. Our main achievements in this paper are as follows:
We propose a summary generation method incorporating global topic information, model the topic of the document, and update the document representation by fusing the topic information of the text with word embeddings through the information fusion module.
We propose the nTmG (n-TOPIC-m-GRAM) method to extract the key topic information in the original text. The essential method is used to avoid the noise caused by the introduction of the topic.
Empirical studies show that the proposed method demonstrates a more advanced performance than baseline methods. It also shows that the number of incorporated topics is tightly correlated with the performance of generating summaries, which provides empirical evidence for subsequent automatic summary studies combining topics.
2. Related Works
The mainstream automatic summarization methods can be divided into two types: extractive and abstractive. Extractive methods involve extracting words and sentences and other semantic units from the original text. Representative extractive methods include semantic-information-based methods [
8] and structural-information-based methods [
9,
10]. Abstractive summarization methods are closer to manual summarization, which restates the original text with words, sentences, and phrases that are different from the original text. It is a challenge to the model’s ability of understanding, representing, and generating. In recent years, the application of deep neural networks and attention mechanisms in machine translation research has promoted the research of abstractive summarization methods with an encoder–decoder structure. In 2015, Rush et al. [
5] first introduced an encoder–decoder structure and attention mechanism to the abstractive summarization task. Later, Nallapati et al. [
11] combined an attention mechanism with RNN, and utilized information such as stop words and document structure, and its ROUGE-L improved by 1.47% compared to Rush on DUC-2004 and gigaword datasets.
In 2018, Gehrmann S et al. [
12] used a content selector as a bottom-up attention step to constrain the model to likely phrases. This approach improves the ability to compress text, while still generating fluent summaries. Celikyilmaz A et al. [
13] introduced deep communication agents in an encoder–decoder architecture; these are divided into cooperative agents, each of which is responsible for representing a part of the text, to solve the challenges of representing a long document for abstractive summarization. Empirical results show that multiple communication encoders produce higher quality summaries compared to baseline methods.
In addition, the generation of OOV words and redundant words also has a significant impact on summary generation. To this end, Gulcehre et al. [
14] divided OOV words into two strategies, one is to find similar words in the preset vocabulary, the other is to use original text words instead, and the decoding method is judged by a two-layer perceptron during generation. Gu et al. [
15] proposed COPYNET, which added the probabilities of the output words of two modules, Generate-Mode and Copy-Mode, at the decoder to obtain the distribution of the final words. Vinyals et al. [
16] presented a pointer network that uses the weights of the input sequence as a pointer, and outputs the word probability distribution for the input sequence. Furthermore, See et al. [
6] proposed a method combining generation and replication to solve the problem of OOV words, and introduced a coverage mechanism to solve redundant words.
In order to improve the performance of the abstract, researchers have analyzed the relationship between the original text and the abstract from different perspectives. Ruan Q et al. [
17] proposed a novel approach to formulate, extract, encode, and inject hierarchical structure information explicitly into an extractive summarization model. The HiStruct model outperforms baseline collectively on CNN/Daily Mail, PubMed, and arXiv. Mao Z et al. [
18] present a dynamic latent extraction approach for abstractive summarization. The model treats the extracted text segments as latent variables and employs dynamic segment-level attention weights during decoding. Experimental results show that DYLE outperforms all existing methods on GovReport and QMSum.
In recent years, researchers have considered the summary generation process as machine translation, and have proposed many models. However, there are still significant differences between summary generation and machine translation. It is sufficient for the summary to retain only the key information of the original text, rather than all of it. Thus, it would be better if the summary had good coverage of the topics of the original text. It is inadequately represented in current evaluation metrics since the widely used ROUGE metric is a recall-based approach.
Therefore, researchers have carried out topic-oriented research. Li J et al. [
19] proposed UCTOPIC, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. It outperforms the state-of-the-art phrase representation model by 38.2% NMI on average on four entity clustering tasks. Bahrainian S A et al. [
20] introduced the first topical summarization corpus NEWTS, based on the well-known CNN/Daily Mail dataset, and annotated via online crowd sourcing. The goal was to create datasets that support topic-focused summarization tasks, and then study the relationship between original text topics and summaries. Li M et al. [
21] proposed a hierarchical contrastive learning mechanism to unify the mixed granularity of semantic meaning in the input text, including common vocabulary and topic vocabulary.
Subsequently, researchers introduced the Latent Dirichlet Allocation (LDA) topic model [
22] to the summarization task. Wu D et al. [
23] proposed an extractive summarization method based on LDA, which calculates the sentence weights according to the position and title of the sentence in the document, and extracts sentences according to the weights to form summaries. Liu Na et al. [
24] presented a multi-document extractive method based on LDA important topics. In terms of abstractive summarization, Yang Tao et al. [
25] proposed a hybrid summarization model based on topic awareness, adding document topics to help in summary generation. Guo Ji-Feng et al. [
26] used LDA to obtain topic words, constructed a composite attention mechanism, and combined it with a Generative Adversarial Network (GAN) [
27] to generate summaries. These past methods have constructed many fusion modules and performed multistep attention processes, making the work more complicated. Yang Tao et al. [
25] proposed a topic-aware summary generation method for long texts, which produces summaries with better ROUGE scores than their baseline methods, but introduce noise by including all topics of the document.
Comparing previous works, many fusion modules are constructed, multistep attention is carried out, and the work is more complicated. Specifically, in 2021, Yang Tao et al. [
25] proposed a long text summary generation method based on topic awareness. Although the ROUGE value of the generated summary is better than his baseline method, noise is introduced due to the introduction of all the topics of the document into the model, which makes the generated summary redundant. This inspired us to propose a summary generation method that combines global topic information. Our method focuses on topic information that has a significant impact on the original text, and succinctly integrates it into the representation of the original text to improve the performance of summary generation. The proposed model is effectively validated with two standard datasets.
3. Motivation
The purpose of the summary is to generate a short overview text that clarifies the important points of the article. The summary should cover the global topic information of the original text. The global topic information mentioned in this paper refers to the part of the document topics that have an important impact on summary generation. The relationship between global topic information and a summary is analyzed below.
In 2003, B lei et al. [
22] proposed the LDA, which provides a method for discovering the underlying topics of documents. In recent years, LDA has been introduced into automatic summarization tasks by many researchers, and has achieved advanced results [
28,
29]. LDA is also the basis for our proposed ACGT model. We believe that for any document
d, its word sequence is encoded as
V, and the document topic distribution vector
T of
d generated by the LDA model reflects the global topic of
d, and
T can be integrated into
V to obtain a fusion of global topic information
V′, which is used in the decoder. Therefore, we need to focus on two issues, one is to avoid noise caused by the introduction of the topic, and the other is to improve the effectiveness of the topic’s introduction.
In this paper, we analyze more than 280,000 pieces of data in the CNN/Daily Mail dataset, and the topic relevance of the original text and summary is shown in
Figure 1. It shows that there are 183,794 items of the TOP 1 topic in the original text that are also TOP 1 in the target summary, accounting for 63%. This suggests that the TOP 1 topic of the original text also appears in the summary with a high probability, which inspired us to propose a method to guide summary generation with the original text TOP N topics.
To avoid the noise caused by the introduction of topics, we propose the nTmG (n-TOPIC-m-GRAM) method, which chooses TOP m items with the highest probability from the n topics with the largest probability distribution in the original text. Then, to enhance the effectiveness of nTmG introduction, we fuse the mean vector of words with the original representation by an attention mechanism.
5. Experiments
5.1. Dataset
Experiments are conducted on the English long paragraph dataset and Chinese short text dataset, respectively. The English dataset is CNN/Daily Mail (CNN/DM) [
31], which contains 287,227 training data and 11,490 validation data. The basic statistics are shown in
Table 1. The average length of the original text in the training set is 766, with a total of 29.74 sentences, and the average length of the target abstract is 53, with a total of 3.72 sentences, and the ratio of the length of the abstract to the original text is 1/14.45.
The Chinese dataset is LCSTS (Large-scale Chinese Short Text Summarization dataset), which is contributed by Hu [
32] based on the content published by authoritative certified users, such as China Daily on Weibo, with a scale of more than 2 million. The dataset consists of three parts, as shown in
Table 2. Part I is the training set, and part II is randomly sampled from part I, with 1 to 5 manual scores added; 1 indicates the lowest correlation between the document and the abstract, and 5 indicates the highest. Part III is independent of the first two parts, and also has 1–5 manual scores. To make the comparison experiment fair, referring to the dataset used by the baseline model [
32], this paper takes part I as the training set, and the data with more than 3 points of part III as the test set, for the experiment.
5.2. Evaluation Metrics
We use the standard ROUGE-1, ROUGE-2, and ROUGE-L metrics [
33] to measure summary qualities. The ROUGE-N is calculated according to:
where
represents an n-gram,
represents the reference abstract,
represents the number of n-grams that appear in the generated abstract and the reference abstract at the same time, and
indicates the number of n-grams appearing in the reference summary.
ROUGE-L is used to measure the readability of the generated summary, and its calculation is shown in the following equations:
where
is the length of the longest common subsequence of X and Y, and m and n are the lengths of the reference abstract and the generated abstract.
and
represent the recall rate and precision rate, respectively. Since ROUGE cannot directly evaluate Chinese abstracts, when evaluating Chinese abstracts in this paper, Chinese characters are first converted into numbers before evaluation.
5.3. Experimental Setup
The experiments in this paper are conducted with the PyTorch deep learning framework on a graphics device, NVIDIA GeForce RTX 3090 TI. Training is performed using the ADAGRAD [
34] optimizer, with a learning rate set to 0.15. For CNN/DM dataset, we follow the processing method of See et al., and use the non-anonymized version of the data. Word separation is performed using the Stanford University Toolkit core NLP pair, setting the original text length to 400, the summary length to 100 for training, and 120 for testing, and the preset vocabulary is set to 50 k. For LCSTS dataset, four types of characters are inserted into the document first, including <PAD> as a complementary character, <UNK> as an OOV words, <s> and </s> as sentence start and end identifiers. The vocabulary size is set to 40 k at (CHARACTER-BASED) and 50 k at (WORD-BASED) using the JIEBA word splitting tool. In the coverage mechanism, the weight of coverage loss is set to 1.
In our approach, we use a bidirectional LSTM on the encoder and a unidirectional LSTM on the decoder, with the hidden layer dimension being 256 in both. Additionally, our model has 128 dimensional word embeddings. The batch size is set to 16. We use beam search to get the summaries and the beam size is 4.
5.4. Experimental Results
We conducted four experiments, and the final experimental results were taken from the arithmetic mean of the four experiments, and the standard deviations of the four experimental results are presented in
Table 3,
Table 4 and
Table 5.
We chose eight representative state-of-the-art baseline models for comparison, and the pointer generator network with coverage mechanism as our baseline.
Lead-3 [
35]: A traditional simple extractive summary model, extracting the first three sentences of the article as the summary.
RNN [
33]: RNN is used as the encoder and decoder, and the final hidden layer vector is used as the input of the decoder.
RNN context [
32]: Using RNN as the encoder and decoder. The weighted summation of all hidden vectors on the encoding side is used to decode the summary.
ABS [
5]: Generates the summary using an attention-mechanism-based encoder–decoder structure, such as RUSH.
Copy Net [
15]: A hybrid mechanism is used to obtain information concerning the memory unit and encode the content and location of the text, mainly for solving OOV words.
PGEN [
16]: A Seq2Seq + Attention structure with a pointer network that allows copying words from the original text or generating new words from a fixed vocabulary.
PGEN + Cov [
6]: Combines a pointer network with an encoder–decoder based on the attention mechanism, and alleviates the problem of generating redundant words with a coverage mechanism.
Key information guide model [
36]: Fusing key information of documents, including people, time, and place, in the form of keywords or key sentences, into the generation module using a multi-view attention approach to guide summary generation.
The comparison experiments show that the proposed ACGT outperforms all other baseline methods on both CNN/DM and LCSTS datasets. For CNN/DM dataset, ACGT yields gains of 0.96/2.44/1.03 of ROUGE-1/2/L scores compared to PGEN + Cov. On the LCSTS dataset (word-based), the ROUGE-1/2/L score of ACGT is improved by 1.19/1.03/0.85 compared to PGEN + Cov, and it also improved by 1.57/0.80/0.87 on the LCSTS dataset (character-based). It can be demonstrated that the summary generation method of combining topics proposed by ACGT is effective.
5.5. Ablation Experiment
In the ablation experiment, to further illustrate the influence of the introduction of key topic words in ACGT on abstracts, this paper experimentally analyzes correlation between summaries quality and the number of topic terms. The number of terms extracted from each topic in the CNN/DM dataset and LCSTS dataset is 1–10, and the ROUGE values of ACGT generated abstracts are shown in
Figure 5 and
Figure 6.
Overall, the number of key topic words has an impact on the quality of the abstract. ROUGE is slightly improved when key topic words are added to the model, and as the number of terms increases, the ROUGE value increases. Experiments show that the performance of both datasets remains stable with the change of the number of key topic words, indicating that ACGT is not sensitive to the number of key topic words.
We believe that although this paper fuses TOP m key topic words in Top n topics, the attention mechanism in our paper can sufficiently suppress the noise caused by the introduction of words, so that the summary performance remains stable with the change of the number of key topic words. For CNN/DM, the number of terms used in this experiment is nine. For the LCSTS dataset, six key themes were chosen.