T5-Based Model for Abstractive Summarization: A Semi-Supervised Learning Approach with Consistency Loss Functions

: Text summarization is a prominent task in natural language processing (NLP) that condenses lengthy texts into concise summaries. Despite the success of existing supervised models, they often rely on datasets of well-constructed text pairs, which can be insufﬁcient for languages with limited annotated data, such as Chinese. To address this issue, we propose a semi-supervised learning method for text summarization. Our method is inspired by the cycle-consistent adversarial network (CycleGAN) and considers text summarization as a style transfer task. The model is trained by using a similar procedure and loss function to those of CycleGAN and learns to transfer the style of a document to its summary and vice versa. Our method can be applied to multiple languages, but this paper focuses on its performance on Chinese documents. We trained a T5-based model and evaluated it on two datasets, CSL and LCSTS, and the results demonstrate the effectiveness of the proposed method.


Introduction
Automatic text summarization is a crucial task in natural language processing (NLP) that aims to condense the core information of a given corpus into a brief summary.With the exponential growth of textual data, including documents, articles, and news, automatic summarization has become increasingly important.
Text summarization methods can be classified into two categories: extractive and abstractive.Extractive summarization selects the most important sentences from the original corpus based on statistical or linguistic features, whereas abstractive summarization generates a summary by semantically understanding the text and expressing it in a new way [1].Abstractive summarization is more challenging than extractive summarization, but it is also considered superior, as it avoids the issues of coherence and consistency in the summaries generated with extractive methods.
Deep learning has achieved state-of-the-art results in NLP, and more researchers have shifted their focus to abstractive summarization.The sequence-to-sequence (seq2seq) model [2] combined with an attention mechanism has become a benchmark in abstractive summarization [3][4][5].However, these methods require well-constructed datasets, which can be difficult and costly to build.
In this paper, we propose a semi-supervised learning method for text summarization that treats summarization as a style transfer task.Our approach uses a transfer text-to-text transformer (T5) model as the text generator and trains it with loss functions from the cycle-consistent adversarial network (CycleGAN) for semantic transfer.
The remainder of this paper is structured as follows.In Section 2, we review previous research related to our work.Section 3 describes our method of text summarization in detail.Section 4 presents the experimental results of our proposed model.In Section 5, we perform an extensive ablation study to validate the effectiveness of our model.Finally, we summarize our work in Section 6.

Automatic Text Summarization
Automatic text summarization is a crucial task in the field of natural language processing (NLP), and it has received a significant amount of attention from researchers in recent years.Over the years, a range of methods and models have been proposed to improve the quality of automatic text summaries.In the early days of NLP research, traditional approaches to text summarization were based on sentence ranking algorithms that evaluated the importance of sentences in a given text.These methods used statistical features, such as frequency and centrality, to rank sentences and select the most important ones to form a summary [6][7][8].
With the advent of machine learning techniques in the 1990s, researchers have applied these methods to NLP to improve the quality of summaries.In automatic text summarization, this is mostly considered a sequence classification problem.Models are trained to differentiate summary sentences from non-summary sentences [9][10][11][12].These methods are referred to as extractive, as they essentially extract important phrases or sentences from the text without fully understanding their meaning.Thanks to the tremendous success of deep learning techniques, many extractive summarization studies have been proposed based on techniques including the encoder-decoder classifier [13], recurrent neural network (RNN) [14], sentence embeddings [15], reinforcement learning, and long short-term memory (LSTM) network [16].
Moreover, the development of deep learning has given rise to a method called abstract summarization.Abstract summarization has improved significantly and has become a crucial area of research in the NLP field.Researchers have made remarkable progress in this field by leveraging deep learning techniques, such as RNN [3], LSTM [17], and classic seq2seq models [4,5].
With the introduction of the transformer architecture in 2017 [18], transformer-based models have significantly outperformed other models in many NLP tasks.This architecture has been naturally applied to the text summarization task, leading to the development of several models based on pre-trained language models, including BERT [19], BART [20], and T5 [21].These models have demonstrated remarkable performance on various NLP tasks, including text summarization.

Text Style Transfer
Text style transfer is a task in the field of NLP that focuses on modifying the style of a text without altering its content.This task has received considerable attention from researchers due to its potential applications in many areas, such as creative writing, machine translation, and sentiment analysis.
The early methods for text style transfer mainly focused on rule-based approaches, where linguistic patterns and attributes were manually defined and applied to modify the style of text [22].These methods, though simple and effective, are limited by the fixed set of rules that they rely on, which may not adapt well to changing styles and genres.
With the advent of deep learning, several machine-learning-based approaches have been proposed.The most well-known method is the sequence-to-sequence (seq2seq) model [2].Seq2seq models have been used in various NLP tasks, such as text summarization and machine translation, due to their ability to encode the source text and generate a target text.
Recently, generative adversarial networks (GANs) [23] were applied to the task of text style transfer.The idea of GANs is to train two neural networks: a generator and a discriminator.The generator tries to generate text that is indistinguishable from the target style, while the discriminator tries to differentiate between the generated text and the real target text.

Cycle-Consistent Adversarial Network
The cycle-consistent adversarial network (CycleGAN) is a generative adversarial network (GAN) architecture for image-to-image translation tasks.This approach has been widely used in various domains, including but not limited to image style transfer, domain adaptation, and super-resolution.The key idea of CycleGAN is to train two generatordiscriminator pairs, with each pair consisting of a generator and a discriminator.One generator aims to translate an image from the source domain to the target domain, while the other generator aims to translate an image from the target domain back to the source domain.The discriminator in each pair is trained to distinguish the translated images from the real images in the corresponding domain.The cycle consistency loss is introduced to force the translated image to be transformed back into the original image.
Figure 1 illustrates how CycleGAN works in one direction.CycleGAN is focused on the application of style transfer in computer vision.For example, Zhu et al. [24] originally proposed CycleGAN for unpaired image-to-image translation, where there was no one-to-one mapping between the source and target domains.This method has been widely used in tasks such as colorization, super-resolution, and style transfer.Based on CycleGAN, different models have been proposed for face transfer [25], Chinese handwritten character generation [26], image generation from text [27], image correction [28], and tasks in the audio field [29][30][31].
One of the highlights of CycleGAN is the implementation of two consistency losses in addition to the original GAN loss: identity mapping loss and cycle consistency loss.The identity mapping loss implies that the source data should not be changed during transformation if they are already in the target domain.The cycle consistency loss comes with the idea of back translation: The result of back translation should be the same as the original source.These two loss functions cause the CycleGAN model to keep great consistency during its transfer procedure; thus, it is possible to handle unpaired images and achieve outstanding results.

Transfer Text-to-Text Transformer
The transfer text-to-text transformer (T5) [21] is a state-of-the-art pre-trained language model based on the transformer architecture.It adopts a unified text-to-text framework that can handle any natural language processing (NLP) task by converting both the input and output into natural language texts.T5 can be easily scaled up by varying the number of parameters (from 60M to 11B), which enables it to achieve superior performance on various NLP benchmarks.Moreover, T5 employs a full-attention mechanism that allows it to capture long-range dependencies and complex semantic relations in natural language texts.T5 has been successfully applied to many NLP tasks, such as machine translation, text summarization, question answering, and sentiment analysis [21].
The T5 model follows the typical encoder-decoder structure, and its architecture is shown in Figure 2.For example, to perform sentiment analysis on a given sentence, T5 simply adds the prefix "sentiment:" before the sentence and generates either "positive" or "negative" as the output.This feature makes it possible to train a single model that can perform multiple tasks without changing its architecture or objective function.

Overall
This section presents the foundation of our semi-supervised method for automatic text summarization.Unlike existing models, which rely heavily on paired text for supervised training, our approach leverages a small paired dataset followed by a semi-supervised training process with unpaired corpora.The algorithm used in our method is illustrated in Algorithm 1, where L denotes the loss incurred by comparing two texts.
Our approach is inspired by the CycleGAN architecture, which uses two generators to facilitate style transfer in two respective directions.The first part of our method comprises a warm-up step that employs real text pairs to clarify the tasks of the style transferers T a2s and T s2a and generate basic outputs.The subscripts a2s and s2a, which represent "article-to-summary" and vice versa, are employed to clarify the transfer direction.The second part adopts a similar training procedure to that of CycleGAN with consistency loss functions to further train the models without supervision.
Specifically, the identity mapping loss ensures that a text should not be summarized if it is already a summary and vice versa.The corresponding training procedure is based on calling the model to re-generate an identity of the input text.The loss is then calculated by measuring the difference between the original text and the generated identity.This part is designed to train the model to be capable of identifying the characteristics of two distinct text domains.In the following sections of the paper, a superscript idt is used to indicate re-generated identity texts.
In contrast, the cycle consistency loss trains the model to reconstruct a summary after expanding it or vice versa.The corresponding training procedure follows a cyclical process: For a real summary s, the model T s2a first expands it and generates a fake article.The term "fake" indicates that it is generated by our model, rather than a real example from datasets.Next, the fake article is sent to T a2s to re-generate its summary.For real articles, the same cycle steps are utilized.This part is designed to train the model to be capable of transferring texts between two domains.In the following, a superscript fake is used to indicate the fake texts generated by the models, and a superscript cyc is used to indicate the final outputs after such a cycle procedure.
Algorithm 1 Semi-supervised automatic text summarization.for all (a i , s i ) such that a i ∈ Articles and s i ∈ Summaries do 6: Re-expand and re-summary 7: identity mapping loss 8: (s Generate fake summary and article 9: ), T a2s (a f ake i )) Restore article and summary 10: Total loss 12: Back-propagation of Loss end for 14: end for As observed, despite the integration of the CycleGAN loss functions, we refrain from constructing a GAN architecture for our task.This decision arises from two factors: firstly, the challenge involved in the back-propagation phase of discrete sampling during text generation; secondly, the lack of discernible improvement vis-à-vis our method during development and the inherent instability in the training process.
The back-propagation of gradients for text generation in a GAN framework presents an arduous problem, which is primarily due to the discrete nature of text data.Consequently, the GAN model for text generation often entails the adoption of reinforcement learning or the use of Gumbel-softmax approximation.These techniques are complicated and may render the training process unstable, leading to the production of sub-optimal summaries.Moreover, we found no clear evidence of improved performance through the use of GAN-based models in our task in comparison with our semi-supervised method with CycleGAN loss functions.Therefore, we conclude that our approach presents a promising solution for automatic text summarization and is better suited for our task given its simplicity and effectiveness.

Style Transfer Model
As mentioned previously, we view the summarization task as a style transfer problem.To accomplish this, we employ a T5 model, which offers several advantages over alternative models.Firstly, the native tasks of the T5 model align well with the requirements of the style transfer task.Secondly, by modifying the prefix of the input text, a T5 model can perform tasks in both directions, i.e., from text to summary and vice versa.
As illustrated in Figure 3, a single T5 model can perform the tasks of T a2s and T s2a outlined in Algorithm 1 by changing the prefix of the input text.Therefore, we only require one generator for both directions, unlike in the original CycleGAN architecture.The versatility of the T5 model in undertaking various natural language processing tasks has been well documented in recent research.The model's pre-training process enables it to perform a wide range of tasks, including question answering, text classification, and text generation.By leveraging the strengths of the T5 model, our approach provides an effective solution to the problem of automatic text summarization.

Training with the T5 Model
Our training procedure consists of two parts: a supervised part and an unsupervised part.In the supervised part, we use small labeled data for warm-up while following the same procedure as that in the original T5 model.In this part, we fine-tune the T5 model with pairs of articles and summaries using different prefixes to indicate the generation direction.The loss function for the supervised part is cross-entropy, which is the same loss as that used in the original T5 model.
In the unsupervised part, we adopt a training procedure inspired by the CycleGAN architecture, thus incorporating identity mapping loss and cycle consistency loss.The identity mapping loss deters the model from re-summarizing a summary or expanding a full article by minimizing the difference between the input and output texts.Meanwhile, the cycle consistency loss ensures that the model preserves the source text after a cyclical transfer by minimizing the difference between the input and reconstructed texts.Figure 4 illustrates these two processes.We propose a novel training procedure that uses a single T5 model for both generation tasks with different prefixes.Given an article a and its summary s, we use the T5 model to generate a fake summary s f ake from a and a fake article a f ake from s.To indicate the desired task, we prepend a prefix string to the input text.The generation process can be formulated as follows: where T s () and T e () denote the T5 model with the summary prefix and the expansion prefix, respectively.
The training process follows a typical supervised paradigm, a cross-entropy loss [32] is calculated to measure the difference between two texts, and the model is trained via back-propagation.
where C is the vocabulary size, and p i () is the probability of i-th word in the vocabulary.
For the rest of the dataset, where an article a and a summary s are not paired, we calculate the two consistency losses.The identity mapping loss is calculated by re-summarizing a summary or re-expanding an article as follows: s idt = T s (s) As for the cycle consistency loss, the model first generates s f ake and a f ake as stated before; then, it regenerates a cycle and s cycle based on s f ake and a f ake .After such a cycle, the losses are calculated as follows: s f ake = T e (s) The training algorithm is, thus, adapted as in Algorithm 2 (T for T5 model, ⊕ for concatenation of texts).We use P s and P e to denote pre f ix_summarize and pre f ix_expand, respectively.
1: Set pre f ix_summarize and pre f ix_expand as P s and P e 2: for each batch ∈ gold_batches do 3: (article, summary) ← batch; for all (a i , s i ) such that a i ∈ Articles and s i ∈ Summaries do 9: (a idt i , s idt i ) ← (T(P e ⊕ a i ), T(P s ⊕ s i )) Re-expand and re-summarize 10: (L idt a , L idt s ) ← (L(a i , a idt i ), L(s i , s idt i )) identity mapping loss 11: (s f ake i , a f ake i ) ← (T(P s ⊕ a i ), T(P e ⊕ s i )) Generate fake summary and article 12: (a cyc i , s cyc i ) ← (T(P e ⊕ s f ake i ), T(P s ⊕ a f ake i )) Restore article and summary 13: cycle consistency loss 14: Total loss 16: Back-propagation of Loss 17: end for 18: end for Here, the hyperparameters λ idt and λ cyc control the weights of the two types of losses.

Experiments
This section presents the experimental details for evaluating the performance of our method.
The CSL is the first scientific document dataset in Chinese consisting of 396,209 papers' meta-information obtained from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) and spanning from 2010 to 2020.In our experiments, we used the paper titles and abstracts to generate summary-article pairs for training and evaluation purposes.To facilitate evaluation and comparison, we chose the subset of CSL used in the Chinese Language Generation Evaluation (CLGE) [35] for our experiments.This sub-dataset comprised 3500 computer science papers.
The LCSTS is a large dataset collecting 2,108,915 Chinese news articles published on Weibo, the most popular Chinese microblogging website.The data in LCSTS include news titles and contents posted by verified media accounts.Similarly to with CSL, we used the news titles and contents to create summary-article pairs for our experiments.
Examples from these datasets can be viewed in Figures A1 and A2.
For the unsupervised training part, our model did not have access to the matched summary-article pairs.Instead, we intentionally broke the pairs and randomly shuffled the data, ensuring that the model did not receive matched data during this part of the training.

Implementation Details
The original datasets contained well-paired texts.We used only a fraction of the paired data during the warm-up stage.The unsupervised part used text samples of the corresponding dataset without pair information.
Since the original T5 model does not support the Chinese language, we chose Mengzi [36], a high-performing lightweight (103M parameters) pre-trained language model for Chinese in our experiments (Mengzi includes a family of pre-trained models, among which we used the T5-based one).
We used the AdamW optimizer to train the model with the learning rate, β1, β2, , and weight decay as 5 × 10 −5 , 0.9, 0.999, 1 × 10 −6 , and 0.01, respectively.Moreover, we set the learning rate with a cosine decay schedule.We restricted the length of sentences in each batch to a maximum of 512 tokens, and we set the batch size to 8. The two consistency losses were weighted with factors of 0.1 for the identity mapping loss and 0.2 for the cycle consistency loss.The higher weight for the cycle consistency loss was due to its direct contribution to the model's ability to transfer texts, which was the primary objective of the task.In contrast, the identity mapping loss helped preserve the characteristics of the input texts, but it did not directly contribute to the summarization process.All of the experiments were conducted by using Python 3.7.12with PaddlePaddle 2.3 and PyTorch 1.11 while running on an NVIDIA Tesla 32GB V100 GPU.For clarity, the hyperparameter settings used in our experiments are presented in Table 1.

Results
In this section, we present the results of our proposed approach for automatic text summarization and compare its performance with baselines on four commonly used evaluation metrics: the ROUGE-1, ROUGE-2, ROUGE-L [37], and BLEU [38] scores.ROUGE is the acronym for Recall-Oriented Understudy for Gisting Evaluation, and BLEU is the acronym for BiLingual Evaluation Understudy.
The evaluation metrics play a critical role in assessing the effectiveness of a summarization model.The ROUGE and BLEU scores are widely used to evaluate the quality of generated summaries.ROUGE measures the overlap between the generated summary and the reference summary at the n-gram level, whereas BLEU assesses the quality of the summary by computing the n-gram precision between the generated summary and the reference summary.By comparing the performance of our proposed model with the baselines on these four metrics, we can determine the effectiveness of our approach in automatic text summarization.To provide clarity, we present the formal definitions of these metrics as follows: where n stands for the length of the n-gram, gram n , and Count match (gram n ) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries.By switching the reference and summary, we get the precision and recall values.The final ROUGE-N score is, hence, the F1 score.We used ROUGE-1 and ROUGE-2 in our experiments.ROUGE-L is based on the longest common subsequence (LCS).It is calculated in the same way as ROUGE-N, but by replacing the n-gram match with the LCS.
where p n is the proportion of correctly predicted n-grams within all predicted n-grams.Typically, we use N = 4 kinds of grams and uniform weights w n = N/4.BP is the brevity penalty, which penalizes sentences that are too short: where c is the predicted length and r is the target length.
We conducted experiments on two Chinese datasets: CSL [33], which consists of abstracts from the scientific literature and their corresponding titles, and LCSTS [34], which consists of Chinese news articles and their corresponding human-written summaries.Due to the lack of research on semi-supervised Chinese summarization, all baselines used in this study were fully supervised models and were proposed by the organizers of the original corresponding datasets.For the CSL dataset, we conducted the supervised part of the experiment with two fractions of the original dataset: one using 50 paired samples, and the other using 250, while the remaining data were used for the unsupervised part of our method.For the LCSTS dataset, which was larger than CSL, we conducted the experiments with 200 and 1000 paired samples.
We also performed an ablation study in comparison with the T5 model trained with labeled data only and without our proposed loss functions.The T5 models in Table 2 refer to the results obtained in these cases.
Table 2 illustrates the performance of the baselines and our proposed approach on the CSL dataset, while Table 3 shows the results on the LCSTS dataset.The results presented in Tables 2 and 3 demonstrate that our method achieved comparable performance to that of early supervised large models and even outperformed them in several metrics, despite using only a lightweight model and a limited amount of data.However, the performance of recent supervised models was still better than that of our semi-supervised method.For instance, on CSL, our best results achieved over 93% of the fully supervised BERT-base's performance on every metric, significantly outperforming LSTM-seq2seq and ALBERT-tiny.Regarding LCSTS, our model achieved better results than the best early fully supervised model, RNN-context-Char, by about 6%, and it had a score that was approximately 81% of the ROUGE-L of recent models, such as mT5 and CPM2.The experimental results confirm the effectiveness of our proposed approach in automatic text summarization.
In addition to comparing our results with those of other models, it is important to highlight the comparison between the results of our models and that of the original T5 models without unsupervised learning.This comparison sheds light on the effectiveness of incorporating unsupervised learning techniques in our approach, as evidenced by the improved summarization performance, particularly when well-paired data or "gold batches" were limited.Our semi-supervised method notably improved the performance across every metric compared to the fully supervised T5 model trained on a limited amount of labeled data.When labeled text pairs were extremely rare, the proposed method significantly improved the performance on every metric, especially the BLEU score (from 3.85 to 33.95 on SCL and from 3.99 to 10.56 on LCSTS).As the number of golden batches increased, the original T5 achieved better results, while our method still ameliorated its performance.This demonstrates the effectiveness of our approach in leveraging the information contained in unlabeled data.
The present study showcases a portion of the experimental findings, which are visually presented in Figures A1 and A2.

Conclusions
This study presents a novel semi-supervised learning method for abstractive summarization.To achieve this, we employed a T5-based model to process texts and utilized an identity mapping constraint and a cycle consistency constraint to exploit the information contained in unlabeled data.The identity mapping constraint ensures that the input and output of the model have a similar representation, whereas the cycle consistency constraint ensures that the input text can be reconstructed from the output summary.Through this approach, we aim to improve the generalization ability of the model by leveraging unlabeled data while requiring only a limited number of labeled examples.
A key contribution of this study is the successful application of CycleGAN's training process and loss functions to NLP tasks, particularly text summarization.Our method demonstrates significant advantages in addressing the problem of limited annotated data and showcases its potential for wide applicability in a multilingual context, especially when handling Chinese documents.Despite not modifying the model architecture, our approach effectively leverages the strengths of the original T5 model while incorporating the benefits of semi-supervised learning.
Our proposed method was evaluated on various datasets, and the experimental results demonstrate its effectiveness in generating high-quality summaries with a limited number of labeled examples.In addition, our method employs lightweight models, making it computationally efficient and practical for real-world applications.
Our approach can be particularly useful in scenarios where obtaining large amounts of labeled data is challenging, such as when working with rare languages or specialized domains.
It is worth noting that our proposed method can be further improved by using more advanced pre-training techniques or by fine-tuning on larger datasets.Additionally, exploring different loss functions and architectures could also lead to better performance.
In summary, our study introduces a novel semi-supervised learning approach for abstractive summarization, which leverages the information contained in unlabeled data and requires only a few labeled examples.The proposed approach offers a practical and efficient method for generating high-quality summaries, and the experimental results demonstrate its effectiveness on various datasets.

Limitations and Future Work
In this section, we discuss the limitations of our proposed T5-based abstractive summarization method and suggest directions for future work to address these limitations.
Semi-supervised training requirement: Our model cannot be trained entirely in an unsupervised manner.Instead, it requires a small amount of labeled data for a "warm-up" in a semi-supervised training setting.In our experiments, we found that the performance of the model trained in a completely unsupervised fashion was inferior to that of the semi-supervised approach.Future work could explore ways to reduce the reliance on labeled data or investigate alternative unsupervised training techniques to improve the model's performance.
Room for improvement in model performance: Although our model can match the performance of some earlier supervised training models, there is still a gap between its performance and that of more recent state-of-the-art models.Future research could focus on refining the model architecture, incorporating additional contextual information, or exploring novel training strategies to further enhance the performance of our proposed method.
Domain adaptability: The adaptability of our model to other domains remains to be tested through further experimentation.Our current results demonstrate the model's effectiveness on specific datasets, but its generalizability to different contexts and domains is still an open question.Future work could involve testing the model on a diverse range of datasets and languages, as well as developing techniques for domain adaptation to improve its applicability across various settings.

Figure 2 .
Figure 2. Architecture of the T5 model.One of the key features of T5's text-to-text framework is the use of different prefixes to indicate different tasks, thus transforming all NLP problems into text generation problems.For example, to perform sentiment analysis on a given sentence, T5 simply adds the prefix "sentiment:" before the sentence and generates either "positive" or "negative" as the output.This feature makes it possible to train a single model that can perform multiple tasks without changing its architecture or objective function.
(a) Identity mapping loss (b) Cycle consistency loss

Figure 4 .
Figure 4. CycleGAN losses of the proposed model.

Figure A1 .
Figure A1.Some experimental results on CSL with human translation.

Figure A2 .
Figure A2.Some experimental results on LCSTS with human translation.

Table 1 .
Hyperparameters used to train the model.