Next Article in Journal
A Hybrid Multi-Criteria Decision-Making Framework for the Strategic Evaluation of Business Development Models
Previous Article in Journal
FIM-JFF: Lightweight and Fine-Grained Visual UAV Localization Algorithms in Complex Urban Electromagnetic Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Desentiment: A New Method to Control Sentimental Tendency During Summary Generation

School of Computer Science, University of Science and Technology of China, Hefei 230026, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2025, 16(6), 453; https://doi.org/10.3390/info16060453
Submission received: 4 April 2025 / Revised: 18 May 2025 / Accepted: 21 May 2025 / Published: 28 May 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

:
Abstractive summarization tasks are commonly without options for sentimental tendencies, which leads to a lack of summary personalization and a simplification of the understanding of the text content. Recognizing the crucial role of sentimental tendency in shaping reader interest and perception, such as prompting hopeful outlooks or critical evaluations, we introduce the summaries with multiple optional sentimental tendencies (SMOST) task, which involves generating summaries with various sentiment options and particularly benefits the news domain. Due to a scarcity of labeled data for sentiment-supervised summarization, we utilize sentiment sentences from original texts as positive samples in the training process, augmented with a prompt learning method. Our method achieves a better result on the CNN/DailyMail and XSum datasets regarding sentiment scores and has a small influence on the semantic information of summaries. Further analysis also shows that our method can present the different distributions of sentiment and semantic information on different datasets.

1. Introduction

Sentimental tendency is crucial in news summary and advice generation tasks as it influences reader interest and perception [1]. Optimistic summaries may prompt hopeful outlooks, while skeptical ones lead to critical evaluations [2]. These tendencies shape readers’ attention and interpretation. We focus on generating summaries with multiple optional sentimental tendencies (SMOST), because sentimental tendency involves creative synthesis, allowing flexibility in narrative tone [3] in abstractive summarization.
The SMOST task can be viewed as the union of two sub-tasks, namely text summarization and sentiment conversion. Current summarization methods have been developed to control some specified aspects of summary, such as aspect-based control [4], content control [5], style control [6], and granularity control [7], but are lacking in sentiment control. Also, these previous text summary control methods need sentiment-labeled datasets for training and evaluation [8], which does not exist for the text summarization task to our best knowledge. On the other hand, some sentiment-related approaches have been proposed for sentiment transformation, including a data-driven method [9], a hybrid approach [10] combining rule-based techniques with neural networks, and a reinforcement learning-based approach [11]. However, these approaches are not specified for text summarization, since the sentiment transfer task requires at least two sequences with different sentimental tendencies, which text summarization datasets do not have. Therefore, research on SMOST is limited, and the two sub-tasks of summarization and sentiment transfer are only studied separately. In addition, our experiments illustrate that combining the solutions of the two sub-tasks leads to Semantic Churn during Sentiment Transfer (SCST). Therefore, we propose reframing SMOST as a Sentimental-Supervised Summarization Task (S3T), integrating sentimental tendencies and summarization.
For S3T, we extract sentiment-aligned sentences as candidates for summarization. In a previous work, BRIO [12] is used to introduce contrastive learning that enables the model to learn from the human-generated candidate summaries. Another work [13] uses sentence clustering and pre-trained models to find similar candidate summaries as references on text summarization tasks. However, these methods only focus on the semantic aspect and have not been applied on S3T or need a huge amount of human-generated data for augmentation. Inspired by these previous works, we augment texts with sentiment tags and integrate sentimental similarity into the loss function to address SCST. Nonetheless, this approach encounters challenges when the original text lacks sentences aligning with desired sentiments, termed Sentiment Mismatch.
To mitigate Sentiment Mismatch, we use the latent prompt method as a supplement. In a previous work, LOTUS [14] was proposed, which uses control sequences as latent prompts to control the length of summaries. In another study, MACSum [15] was used to mix the soft labels and latent prompts to control the topic of summaries. However, previous latent prompt methods focus on the whole dataset, not a single data item, making it difficult for generating each summary with specified sentimental tendencies. Inspired by the works above, we propose a label prompt method integrating sentimental validity and semantic clarity, embedding sentimental control sequences into the text.
We use sentiment labels to form prompts to augment inputs, then generate summaries for them, and finally change the loss to control the sentimental tendency, with this method called the D esentiment Model (DM). The main innovations of our DM are as follows:
  • We formulate the fusion of sentiment analysis and text summarization as a task of sentimental-supervised summarization (S3T), aiming to produce summaries with specific sentimental orientations for textual content.
  • We balance the sentiment requirement and semantic requirement of S3T by weighting the sentimental loss and semantic loss.
  • To guide the model to generate summaries with the intended sentimental tendency, we define the intended sentimental tendency as the prompt for predicting and the sentimental tendency of the ground truth summary as the prompt for training.
Extensive experiments and evaluations validate our design’s effectiveness.

2. Method

The Sentimental-Supervised Summarization Task (S3T) takes the original text I and the desired sentimental tendency l e as input, and outputs S g which needs to meet two requirements: (1) Seman-Req, to summarize I’s content, and (2) Senti-Req, to have sentimental tendency l e .
Our model for the sentimental summarization generation task is composed of three parts: sentiment prompter, summary calibrator, and sentiment calibrator. The structure of our model is illustrated in Figure 1. The sentiment prompter first takes in original text, adds a latent prompt, and outputs the modified input. Then, the summary calibrator takes the modified input and generates the summary used to calculate semantic loss. After that, the sentiment calibrator selects the sentiment sentence set to calculate sentiment loss with the summary. Finally, the weighted loss of the sentiment and semantic guides the sum model’s update in the summary calibrator.

2.1. Sentiment Prompter

Why does the sentiment prompter/latent prompt matter? Sentiment prompter uses a latent prompt to alleviate the SCST problem, which means that the latent prompt lets the model understand the target of the semantic loss function when there is a lack of sentiment sentences to form sentiment loss. The distinction of this module from other latent prompt methods lies in the utilization of ground truth labels to assist the model in understanding the loss function objective for individual samples during the SCST state in training. The training and testing processes differ in this approach, whereas previous works typically employed identical latent prompts for both training and testing, targeting the entire dataset.
Sentiment prompter concatenates a sentiment label to original text I as prompt, and outputs a modified input text I , as shown in Equation (1):
I = c o n c a t ( s u m m a r i z e : , l , I ) ,
where I is the sentence set of the original input text, c o n c a t ( · ) is string concatenation, and l { P O S I T I V E , N E G A T I V E } is the sentiment label annotated by BART-large-sst2. BART-large-sst2 is a fine-tuned variant of the BART-large model, achieving 95.3% accuracy on the SST-2 sentiment analysis task [16]. The BART-large-sst2 is trained for 3 epochs using an initial learning rate of 2 × 10−5 using AdamW with a linear learning rate scheduler and a batch_size of 32 (more details can be found in http://proceedings.mlr.press/v139/zanella-beguelin21a/zanella-beguelin21a-supp.pdf, accessed on 16 April 2024). The sentimental tendency l is our desired l e when testing and the sentimental tendency of the ground truth summary when training.

2.2. Summary Calibrator

Why does the summary calibrator/summary generator matter? The summary calibrator is actually a transformer architecture model that generates the summary. The model contains parameters that need to be updated during the training process according to the loss in Section 2.4. This section is similar to other text summary generation works.
Summary calibrator is fed a modified input I and generates a summary S g to meet the Seman-Req, shown as Equation (2):
S g = s u m m a r i z e ( I ) ,
where the output summary S g is generated by model s u m m a r i z e ( · ) , which can be any summarization model, such as BART [17] or Pegasus [18]; usually we choose the SOTA models.

2.3. Sentiment Calibrator

Why does the sentiment calibrator matter? This module is inspired by data augmentation, and regards the selected sentences as the reference of the generated summary. The difference between this module and other data augmentation methods lies in the fact that the augmentation approach, which selects sentiment sentences from the original text, not only incorporates emotional information but also retains part of the original text’s key content. This helps prevent significant semantic drift. In contrast, previous data augmentation methods typically relied on manual annotation or additional generation processes, which often led to issues of distribution inconsistency.
The sentiment calibrator selects sentences from the original text I when the sentence meets Senti-Req, and outputs a sentence set S e , formulated as Equation (3).
S e = { s | S i m ( E s , E e ) δ , s I } ,
where S e contains all sentences with the sentiment tendency l e , the cosine similarity S i m ( · ) measures the similarity of sentimental tendency of sentence s and the desired sentimental tendency l e , E s is the probability distribution of sentimental tendency of sentence s, E e is the probability distribution of l e , and δ is a threshold. All probability distributions are obtained by BART-large-sst2. In implementation, we usually set the value of δ to be 0.7.
The sentiment calibrator selects the sentence that contains sentiment information in the original text to calculate the sentiment loss in Section 2.4.1. For some data items, there may not be enough sentiment sentences, and the sentiment loss will be smaller than expected, referred to as SCST. This difficulty will be alleviated by the sentiment prompter as described in Section 2.1 and we analyze the effect of SCST in Section 5.

2.4. Loss Function

Our loss function is the weighted sum of sentiment loss and semantic loss.

2.4.1. Sentiment Loss

The sentiment loss is proposed to let the model meet Senti-Req, shown as Equation (4):
L S e n t i = C E ( l o g i t s ( S g ) , l o g i t s ( S e ) ) ,
where cross entropy C E ( · ) calculates the loss between the probability distribution of the sentence set S e and generated summary S g , and l o g i t s ( · ) converts a text to its probability distribution.

2.4.2. Semantic Loss

The semantic loss is proposed to make the generated summary meet Seman-Req, shown as Equation (5):
L S e m a n = C E ( l o g i t s ( S g ) , l o g i t s ( S a ) ) ,
where cross-entropy C E ( · ) calculates the loss between the probability distribution of ground truth summary S a and the generated summary S g .

2.4.3. Total Loss

The total loss balances the Seman-Req and Senti-Req with weight γ , shown as Equation (6):
L t o t = L S e m a n + γ L S e n t i ,
where γ 0 is the weight of the loss of sentimental tendencies. From our experiments, the value of γ is in the range [ 0.2 , 0.5 ] to balance between Senti-Req and Seman-Req. Weighted loss is commonly used for multi-target deep learning tasks, so here we use it to balance the Seti-Req and Seman-Req.

3. Experiments

To verify the ability of sentiment and semantic aspects, we compare our method with SOTA models. Then, to validate the effectiveness of sentiment prompter and sentiment calibrator, we conduct ablation studies. Furthermore, to investigate the inside of sentiment calibrator, we perform comparative experiments on different parts of datasets.

3.1. Experimental Settings

3.1.1. Datasets

Since there is currently no specific dataset for the S3T task, we use two widely-used text summarization datasets in our experiments.
CNN/DailyMail(CNN/DM) [19] is a large-scale news dataset that treats news articles as source documents and the associated highlights as summaries.
XSum [20] is a highly abstractive dataset of articles from the British Broadcasting Corporation (BBC).
To add sentiment labels to these datasets, we use an open-sourced sentiment classifier BART-large-sst2 for S3T, because BART-large-sst2 was trained on the SST2 dataset, whose distribution is similar to CNN/DM and XSum datasets.

3.1.2. Baselines

We choose a variety of related SOTA models in sentiment transfer and text summarization fields since there is no specific model for the S3T task.
BRIO [12] is the SOTA method using a novel training paradigm that assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality.
FGST [21] is the SOTA method of transferring the sentimental tendency of a text while preserving the original semantic content using a cycle reinforcement learning algorithm to tackle the problem of lacking parallel data. FGST can be merged with BRIO as a baseline series model for the SMOST task since BRIO predicts summaries and FGST transfers them to the target sentimental tendencies.

3.1.3. Evaluation Metrics

For the S3T, we employed two categories of metrics for semantic and sentiment aspects.
For semantic evaluation, we use three commonly used metrics in the text summarization task: ROUGE1, ROUGE2, and RougeL. The ROUGE series assesses summary quality by comparing model-generated summaries with human-generated references, analyzing overlap in n-grams, word sequences, and word pairs [22]. For every ROUGE score, the higher the score, the better the Seman-Req is met.
To assess the sentiment performance in the S3T, we introduce “Senti-Scores” to deal with the lack of evaluation metrics for the S3T. To evaluate a data sample of the test dataset, we use cosine similarity to measure the accuracy of our desired sentimental tendency l e and the sentimental tendency l g of the generated summary through Equation (7):
m = S i m ( E e , E g ) ,
where S i m ( · ) is the cosine similarity function, E e is the probability distribution of the sentimental tendency (pdst) we desire, and E g is the pdst that the generated summary actually has.
Then, to evaluate the performance on the whole test dataset, we have to aggregate the metric values of all test data samples. Averaging all metric values is commonly employed, but to better evaluate the accuracy when sentiment changes happen, we design four metrics for different cases, and each case corresponds to a subset of the test dataset.
The metric m’s values of all test data samples in each case are averaged as shown in Equation (8):
T i = D i m / | D i | , i = 1 , 2 , 3 , 4 , D 1 = { d | l a = l I , l I = l e , d = { S a , I } D t e s t } , D 2 = { d | l a l I , l I = l e , d = { S a , I } D t e s t } , D 3 = { d | l a = l I , l I l e , d = { S a , I } D t e s t } , D 4 = { d | l a l I , l I l e , d = { S a , I } D t e s t } ,
where T i is the sentiment score of case i, l a is the sentiment label of ground truth summary S a , and l I is that of original text I.
T 1 and T 2 represent scenarios within the dataset where T 1 has the sentimental tendency of both the ground truth summary and the original text aligning with the desired sentiment. For T 2 , the ground truth summary matches the target sentiment, but the original text’s sentiment diverges from this desired tendency.
T 3 and T 4 describe different situations. In T 3 , the original text’s sentiment aligns with the desired sentiment, but the ground truth summary’s sentiment does not. In T 4 , both the original text and the ground truth summary’s sentiments do not match the desired sentiment.
When E s (summary sentiment) and E e (desired sentiment) align, T 3 becomes a conventional text summary, since the ground truth summary satisfies the sentiment requirement and can be used to train directly. When there is a disparity, the evaluation metric better reflects the model’s sentiment performance. Thus, T 3 and T 4 are the main focus of our analysis, as they thoroughly scrutinize the model’s sentiment performance, because only using ground truth summary cannot guide the model to satisfy the sentiment requirement.
For every T i , the higher it is, the more the Senti-Req is met.

3.1.4. Parameters

On the CNN/DM dataset, we use BART-base as the backbone membrane type; the batch size is set to 6, the learning rate is set to 2 × 10 3 , and training is performed for 15 epochs. BART-base was chosen because it performed best on the CNN/DM dataset in past text summarization research.
On the XSum dataset, we use pegasus-x-base as the backbone membrane type, the batch size is set to 10, the learning rate is set to 2 × 10 3 , and training is performed for 10 epochs. Pegasus was chosen because it performed best on XSum in previous studies.
All results appearing in the table are the average of three experiments with the same parameters and different random number seeds. All training runs on 2080ti.

3.2. Comparison with SOTA Models

We compare our method with SOTA models in the aspect of Senti-Req and Seman-Req, respectively.

3.2.1. Senti-Req Comparison

To confirm that the generated text summaries exhibit the intended sentiment characteristics, we calculate the metric values of T 1 / 2 / 3 / 4 on the summaries generated by comparison methods. The results are presented in Table 1. For both CNN/DM and XSum datasets, our observations are as follows:
In the experiments on the CNN/DM dataset, our method improved the T 1 / 2 / 3 / 4 scores by approximately 2 points compared to the BART and BRIO approaches, indicating that our approach better satisfies the Senti-Req. This improvement is likely due to our approach utilizing both the prompter and calibrator modules. Additionally, the BRIO+FGST method outperformed our approach by more than 20 points in the T 1 / 2 / 3 / 4 scores. This may be because the method strongly alters the sentimental tone of the generated summary without adequately preserving the semantic information of the original text, as explained in Section 3.2.2 Seman-Req Comparison.
In the experiments on the XSum dataset, our method improved the T 2 / 3 / 4 scores by approximately 2 points compared to the BART and BRIO approaches, indicating that our approach better satisfies the Senti-Req. This improvement is likely due to the use of both the prompter and calibrator modules. The decrease in the T 1 score might be attributed to the fact that the summaries in the XSum dataset are typically short, one-sentence texts, which differ significantly from the distribution of candidates selected from the original text for loss computation. This discrepancy did not occur in the CNN/DM dataset because the ground truth summaries in CNN/DM are longer, making their distribution closer to that of the candidates. Additionally, the BRIO+FGST method outperformed our approach by more than 20 points in the T 1 / 2 / 3 / 4 scores. This may be because the method introduces a strong sentimental transformation in the generated summaries without adequately preserving the semantic information of the original text, as explained in Section 3.2.2 Seman-Req Comparison.
In summary, the results validate the effectiveness of our proposed method in influencing the sentimental tendency of generated summaries.

3.2.2. Seman-Req Comparison

To evaluate whether the Seman-Req is satisfied when Senti-Req has been satisfied, we test our method on CNN/DM and XSum, and have the following observations:
On the CNN/DM dataset, our method achieved comparable ROUGE-1/2/L scores to the previous BART approach but saw a decrease of no more than 3 points compared to BRIO. This suggests that our method introduced some semantic loss, likely due to the combination of the prompter and calibrator, which caused the model to focus more on summarizing facts with the intended sentimental tendency, leading to greater divergence from the ground truth summary. On the other hand, our model has an advantage of more than 20 points over the serial model BRIO + FGST, indicating that the latter approach has significant issues in preserving the original text’s semantics, which is the SCST we mentioned in the introduction. Our analysis of the experimental results revealed that BRIO+FGST generates sentences with strong emotions but almost no semantic connection to the original text, which seriously violates the Seman-Req of the S3T problem. Therefore, we did not consider the sentimental scores from BRIO+FGST as valid for comparison.
On the XSum dataset, our method achieved comparable ROUGE-1/2/L scores to the previous BART approach but saw a decrease of about 1 point compared to BRIO. This suggests that our method introduced some semantic loss, likely due to the combination of the prompter and calibrator, which caused the model to focus more on summarizing facts with the intended sentimental tendency, leading to greater divergence from the ground truth summary. On the other hand, our model has an advantage of more than 20 points over the serial model BRIO+FGST, indicating that the latter approach has significant issues in preserving the original text’s semantics. Upon analysis, we believe the specific reason for this phenomenon is the same as the similar case observed in the XSum dataset, and will not be elaborated further.
Overall, the results indicate that our sentiment calibration and prompt methods do not significantly impair semantic information generation during text summarization.

4. Ablation Study

To evaluate the effectiveness of sentiment prompter and sentiment calibrator modules, respectively, we perform ablation studies on the CNN/DM and XSum datasets.

4.1. Effectiveness of Sentiment Prompter

We compare the results of our proposed Densentiment and Densentiment-p which removes the module sentiment prompter from Desentiment.
The results are shown in Table 2. For simplicity, we only consider the main evaluation metrics T 3 / 4 for Senti-Req and R 1 for Seman-Req. And we have the following observations:

4.1.1. Senti-Req Comparison

On both the CNN/DM and XSum datasets, using the sentiment prompter led to approximately a one-point improvement in the T 4 sentiment score, while the T 3 score remained roughly unchanged. This demonstrates the independent effectiveness of the sentiment prompter in controlling sentimental tendency. The reason for the smaller change in the T 3 score compared to T 4 is that the sentimental tendency in the original texts corresponding to T 3 is inconsistent with the intended sentimental tendency, resulting in the selected candidates being less sentimentally intense than those for T 4 . We will explore the impact of the quality and quantity of the candidates selected from the original text in the “Analysis of Sentiment Mismatch” section.

4.1.2. Seman-Req Comparison

On the CNN/DM dataset, using the sentiment prompter results in approximately a one-point decrease in the R-1 score. This is because the added prompt introduces information unrelated to the original text, thereby altering the data distribution. On the XSum dataset, using the sentiment prompter has little to no effect on the R-1 score. This is likely because the ground truth summaries in the XSum dataset are more concise, focusing only on the key content of the original text, so the addition of a prompt has minimal impact on the summary.
In conclusion, the sentiment prompter is effective while preserving semantic information.

4.2. Effectiveness of Sentiment Calibrator

We compare the Desentiment and Desentiment-c which remove sentiment calibrator from Desentiment. The experimental results are shown in Table 3.
From Table 3, we can conclude the following:
Table 3. The R-1 and T-3, T-4 values of using sentiment prompt or not. Boldface: Better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Table 3. The R-1 and T-3, T-4 values of using sentiment prompt or not. Boldface: Better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Model R-1 T-3 T-4
CNN/DM
Desentiment44.42−28.7419.33
Desentiment-c45.94−29.3918.04
XSum
Desentiment47.10−29.513.87
Desentiment-c48.22−29.801.79

4.2.1. Senti-Req Comparison

On both the CNN/DM and XSum datasets, using the sentiment calibrator led to approximately a one-point improvement in the T 4 sentiment score, while the T 3 score remained roughly unchanged. This demonstrates the independent effectiveness of the sentiment calibrator in controlling sentimental tendency.

4.2.2. Seman-Req Comparison

On both CNN/DM and XSum datasets, using the sentiment calibrator results in approximately a one-point decrease in the R-1 score. This is because the changed loss function derived the semantics of the generated summary from that of the ground truth summary.
Overall, the sentiment calibrator module is effective alone and achieves better results when combined with sentimental tokens as prompts. Nevertheless, it significantly impacts ROUGE scores, necessitating further balance in the sentimental weighting of the loss function (see Section 6 for detailed analysis).

5. Analysis of Sentiment Mismatch

Sentiment Mismatch means there are few sentences in the original text that can be put into the sentiment sentence set S e in Equation (3) or that the sentences in S e cannot express the main information of the original text. To investigate deeper into the impact of Sentiment Mismatch on sentiment calibrator, we test our method on sub-datasets with different proportions of S e in the original text and different semantics matching levels of sentences in S e . Here, we define the following variables as shown in Equation (9):
S = { s | R 1 ( s , S a ) > δ , s S e } , S e n t i P =   | S e | / | I | , O v e r l a p =   | S | / | S e | ,
where S e , I , S a have been defined in the Method section, S contains all sentences whose Rouge-1 scores higher than threshold δ (0.7 in our experiment), and R 1 ( · ) calculates the Rouge-1 score.

5.1. Sentiment Experiments

To explore how the S e n t i P and O v e r l a p affect sentiment scores, we perform the following experiments. The results are shown in Table 4 and Table 5.
In the first row of both tables, all values are zero due to the absence of instances in the fourth data category. In the second row, with S e n t i P below 0.5, S 4 scores a slightly decline below 0. This is because the sentimental tendency is far from the intended one. In the third and fourth rows, S 4 values remain above 0, showing a positive correlation between S 4 scores and S e n t i P . S 4 values for CNN/DM exceed those for XSum in these rows, likely due to differing sentimental tendencies across datasets, suggesting lower sentimental coherence between source text and summaries in XSum. Also, there is a consistent decrease from left to right within each row, indicating that as O v e r l a p improves, their sentimental tendencies a slightly diminish.

5.2. Semantic Experiments

To explore how the S e n t i P affects the ROUGE scores of the final results, we perform experiments on the CNN/DM and XSum datasets. The result is shown in Table 6 and Table 7, where we have the following observations. The R 1 values in Table 6 and Table 7 show minimal fluctuation with S e n t i P , because we combine sentiment and semantic loss during summarization. This stabilizes R-1 scores despite variations in S e n t i P .
Table 6 and Table 7 also reveal an incremental increase in R-1 values from left to right columns, although this is not strictly linear. This is because O v e r l a p can strongly affect our method since we use a weighted loss.
From the above analysis, it can be concluded that the satisfying degree of Senti-Req and Seman-Req has a complex correlation with the values of Overlap and Sentip, which will be further discussed in Section 6. These observations highlight the potential of integrating sentiment and semantic metrics in text summarization. Understanding this relationship enhances our grasp of effective summarization techniques and informs the development of more nuanced algorithms.

6. Study of Correlation Between Sentiment and Semantics

In the SMOST task performed by humans, a natural phenomenon that is usually observed is that the semantic accuracy tends to decrease as the sentiment of the text summary increases. This phenomenon also appeared in our experiments. Therefore, in order to explore the relationship between sentimental tendency conversion and semantics preservation, we use different sentimental weights γ for training and analyze the correlation between the T 4 and R 1 . The results are shown in Figure 2, and we have the following observations:
As the sentiment weight γ increases, sentiment scores improve, but semantic scores decline due to misalignment between the sentiment sentence set S e and original summaries. The rise in sentiment scores causes a significant drop in semantic scores, highlighting the impact of sentiment factors on semantic structure. This trade-off shows that prioritizing sentiment enhances sentimental expressiveness but reduces semantic coherence.
In Figure 2, the threshold 0.2 < γ < 0.5 balances semantic preservation and sentiment. Rigorous testing confirms that this range maintains semantic integrity. In some cases, surpassing this threshold may be necessary to enhance sentiment, with the upper limit set at 0 < γ < 1 .
Our methodology uses the gamma parameter to control sentimental expression in text summaries. This control, however, may compromise some semantic fidelity.

7. Human Evaluation

In order to understand whether our scheme can effectively make humans feel the sentimental changes in the abstract, we conduct human evaluation experiments.

7.1. Experiment Settings

7.1.1. Data Preparation

We test on a randomly selected 200-item subset of samples drawn from CNN/DM and XSum datasets, and each sample has three generated summaries: a positive summary, a neutral summary (summary generated without our method), and a negative summary.

7.1.2. Data Collection

We invite four experienced annotators to perform the following tests, in which all summaries will be provided in random order: (1) PosExp: selecting the positive one between the positive summary and neutral summary. (2) NegExp: selecting the negative one between the negative summary and neutral summary.

7.1.3. Evaluation

Null hypothesis (H0): The probability that the annotator’s choice is consistent with the original annotation is 50%; that is, there is no significant difference. Alternative hypothesis (H1): The proportion of annotators’ selections that are consistent with the original annotations is significantly higher than 50%. p-value is the probability that the annotators’ choice in the experiment is consistent with the original sample.

7.2. Result

As can be seen from the Table 8, the p value is higher than 0.5, which proves that the changes in sentimental tendency can be detected by humans in both positive and negative settings. Also, the p value of the CNN/DM dataset is slightly higher than that of XSum. This is because the summary generated by XSum is shorter and it is difficult to determine the sentimental tendency from the perspective of fact bias. In conclusion, the results of human evaluation show the effectiveness of our method and evaluation metrics.

8. Comparison with LLM

Since LLMs have excellent generalization capabilities and the ability to emerge knowledge, we compared our method with LLM to analyze the performance on the S3T.

8.1. Experiment Setting

We use chatglm-6B [23] and llama2-7B [24] as the baselines for comparative experiments because these two models are the best large models on which the consumer-grade graphics card 2080ti can perform inference.
Llama2 is a series of pre-trained and fine-tuned large language models (LLMs) with parameter sizes ranging from 7 billion to 70 billion. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for conversational scenarios.
ChatGLM3 is a new generation of dialogue pre-training model jointly released by Zhipu AI and Tsinghua University’s KEG Laboratory. ChatGLM3-6B is an open source model in the ChatGLM3 series. On the basis of retaining many excellent features such as smooth dialogue and a low deployment threshold of the previous two generations of models, ChatGLM3-6B introduces the following features: a more powerful basic model, a more complete functions support, and a more comprehensive open source sequence.
In order for llama2-7B and chatglm3-6B to perform the S3T, the prompt we use is shown as Equation (10)
please give me a S e summary of the following text in one sentence t e x t s ,
where S e is the sentimental tendency we want (positive or negative), and t e x t s is the original text.

8.2. Result and Analysis

As depicted in Table 9, the comparative analysis highlights the superior performance of our method, Desentiment, when compared to chatglm3-6B and llama2-7B across sentiment evaluation metrics T 1 , T 3 , and T 4 . This stark contrast in the results is primarily attributed to our deliberate utilization of sentimental candidates during the model fine-tuning process. In stark contrast, both chatglm3-6B and llama2-7B relied solely on the zero-shot inference method, which evidently fell short in achieving comparable performance.
Delving deeper into the intricacies of our findings, it becomes apparent that while our method showcases commendable performance across several sentiment evaluation criteria, it encounters challenges, particularly in achieving parity with LLM on T 2 . This divergence in performance can be attributed to the nuanced discrepancies between the ground truth summary’s sentiment inclination and the original text’s inherent sentiment tendencies. Consequently, our method’s approach of selecting candidate sentiments from the original text proves to be less effective when juxtaposed against the benchmark set by LLM’s inherent generalization prowess.
Expanding the discourse to encompass a broader semantic analysis, the examination reveals that our method, Desentiment, not only outperforms its counterparts in terms of sentiment evaluation but also exhibits superiority in rouge scores. This is indicative of our method’s adeptness in capturing the semantic nuances embedded within the text, a feat that eludes chatglm3-6B and llama2-7B due to occasional lapses in task comprehension. These lapses, in turn, manifest in comparatively lower semantic scores, underscoring the indispensability of a finely tuned approach such as ours.
It can also be seen from the Table 9 that on the S3T, the rouge score and T 3 , T 4 performance of llama2-7B are slightly better than chatglm3-6B. This may be because chatglm-3B is a bilingual model, so its performance on the English task is not as good as llama2- 7B.
In summary, while LLM undeniably boasts impressive task generalization capabilities and zero-shot inference prowess, our empirical findings underscore the tangible advantages conferred by our model, particularly within the constrained confines of consumer-grade hardware such as the 2080ti graphics card. Notably, our method exhibits superior performance across the sentiment evaluation tasks (S3T), thereby reinforcing its efficacy and relevance in real-world applications.

9. Discussion and Limitation

Our Desentiment model demonstrates significant advancements in handling Sentimental-Supervised Summarization Tasks (S3T), particularly excelling in sentiment scores on both the CNN/DailMail and XSum datasets. For instance, on the CNN/DM dataset, we observed improvements of approximately two points in the T1/T2/T3/T4 metrics compared to the BART and BRIO approaches. This enhancement can largely be attributed to the integration of the sentiment prompter and calibrator modules, which together refine the generation of summaries with specific sentimental orientations.
However, our method also has certain limitations. A notable challenge arises from the “Sentiment Mismatch” issue, where original texts lack sentences aligning with desired sentiments. Although we mitigated this problem using latent prompt methods, it still occasionally results in lower-than-expected sentiment losses. Additionally, our model showed a slight decrease in the T1 score on the XSum dataset, likely due to the shorter summaries typical of XSum, causing discrepancies between their distribution and the candidate summaries selected from the original text.
Looking forward, several challenges may emerge when applying our model to different domains or languages. For example, industry-specific text data might require the retraining of sentiment classifiers to improve accuracy. Moreover, exploring more flexible sentiment control mechanisms, such as those incorporating reinforcement learning, could provide new avenues for addressing Semantic Churn during Sentiment Transfer (SCST) and further enhancing model performance. Additionally, future work should consider how to better balance semantic fidelity and sentiment accuracy, especially for datasets with highly condensed summaries like XSum.
In conclusion, while our Desentiment model offers promising improvements in generating summaries with controlled sentimental tendencies, there remain areas for refinement. By addressing these limitations and continuing to innovate, we aim to develop even more robust and versatile summarization tools that can cater to a wide range of applications and user preferences.

10. Conclusions

In this work, we formulate the task of generating summaries with multiple optional sentimental tendencies (SMOST) as a Sentimental-Supervised Summarization Task (S3T). Our proposed Desentiment model demonstrates significant improvements in controlling sentimental tendencies during summary generation while maintaining the semantic integrity of the generated summaries in S3T. However, several limitations must be acknowledged. Firstly, though the sentiment prompter alleviates the Sentiment Mismatch problem, when the original text does not contain sentences that align well with the intended sentiment, the Sentiment Mismatch will still potentially affect the quality of the generated summaries. Secondly, integrating both the prompter and calibrator modules introduces some degree of semantic loss, which may lead to greater divergence from the ground truth summaries. This effect is more pronounced in shorter summaries, such as those found in the XSum dataset, indicating a need for further refinement in handling highly abstractive summarization tasks.

Author Contributions

Conceptualization, H.C. and J.L.; methodology, H.C. and J.L.; software, H.C.; validation, H.C. and J.L.; formal analysis, H.C. and J.L.; writing—original draft preparation, H.C.; writing—review and editing, J.L.; visualization, H.C.; supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used in this study are publicly available from the following sources: The XSum dataset can be accessed at https://github.com/EdinburghNLP/XSum; the CNN/Daily Mail dataset is available at https://huggingface.co/datasets/abisee/cnn_dailymail. Both datasets were used for training and evaluating the models reported in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, X.; Wu, P.; Zou, C.; Xie, H.; Wang, F.L. Sentiment lossless summarization. Knowl. Based Syst. 2021, 227, 107170. [Google Scholar] [CrossRef]
  2. Calvo, R.A.; Kim, S.M. Emotions in text: Dimensional and categorical models. Comput. Intell. 2013, 29, 527–543. [Google Scholar] [CrossRef]
  3. Li, C.; Xu, W.; Li, S.; Gao, S. Guiding generation for abstractive text summarization based on key information guide network. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 2 (Short Papers). Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 55–60. [Google Scholar] [CrossRef]
  4. Amplayo, R.K.; Angelidis, S.; Lapata, M. Aspect-controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Online and Punta Cana, Dominican Republic. Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 6578–6593. [Google Scholar] [CrossRef]
  5. Dou, Z.-Y.; Liu, P.; Hayashi, H.; Jiang, Z.; Neubig, G. GSum: A general framework for guided neural abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4830–4842. [Google Scholar] [CrossRef]
  6. Cao, S.; Wang, L. Inference time style control for summarization. arXiv 2021, arXiv:2104.01724. [Google Scholar] [CrossRef]
  7. Zhong, M.; Liu, Y.; Ge, S.; Mao, Y.; Jiao, Y.; Zhang, X.; Xu, Y.; Zhu, C.; Zeng, M.; Han, J. Unsupervised multi-granularity summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4980–4995. [Google Scholar] [CrossRef]
  8. Urlana, A.; Mishra, P.; Roy, T.; Mishra, R. Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects—A Survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1603–1623. [Google Scholar] [CrossRef]
  9. Wen, Z.; Cao, C.; Yang, R.; Wang, S. Decode with template: Content preserving sentiment transfer. In Proceedings of the Language Resources and Evaluation, Language Resources and Evaluation, Marseille, France, 11–16 May 2020. [Google Scholar]
  10. Xie, Y.; Xu, J.; Qiao, L.; Liu, Y.; Huang, F.; Li, C. Generative sentiment transfer via adaptive masking. arXiv 2023, arXiv:2302.12045. [Google Scholar] [CrossRef]
  11. Liu, G.; Feng, Z.; Gao, Y.; Yang, Z.; Liang, X.; Bao, J.; He, X.; Cui, S.; Li, Z.; Hu, Z. Composable text controls in latent space with ODEs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 16543–16570. [Google Scholar] [CrossRef]
  12. Liu, Y.; Liu, P.; Radev, D.; Neubig, G. Brio: Bringing order to abstractive summarization. arXiv 2022, arXiv:2203.16804. [Google Scholar]
  13. Constantin, D.; Mihăescu, M.C.; Heras, S.; Jordán, J.; Palanca, J.; Julián, V. Using Data Augmentation for Improving Text Summarization. In Proceedings of the Intelligent Data Engineering and Automated Learning–IDEAL 2024: 25th International Conference, Valencia, Spain, 20–22 November 2024; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2024; pp. 132–144, ISBN 978-3-031-77737-0. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Zhang, X.; Wang, X.; Chen, S.; Wei, F. Latent prompt tuning for text summarization. arXiv 2022, arXiv:2211.01837. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Liu, Y.; Yang, Z.; Fang, Y.; Chen, Y.; Radev, D.; Zhu, C.; Zeng, M.; Zhang, R. Macsum: Controllable summarization with mixed attributes. arXiv 2023, arXiv:2211.05041. [Google Scholar] [CrossRef]
  16. Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:cs/0506075. [Google Scholar] [CrossRef]
  17. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2019. [Google Scholar]
  18. Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv 2020, arXiv:1912.08777. [Google Scholar] [CrossRef]
  19. Nallapati, R.; Zhou, B.; santos, C.N.d.; Gulcehre, C.; Xiang, B. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv 2016, arXiv:1602.06023. [Google Scholar] [CrossRef]
  20. Narayan, S.; Cohen, S.B.; Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
  21. Luo, F.; Li, P.; Yang, P.; Zhou, J.; Tan, Y.; Chang, B.; Sui, Z.; Sun, X. Towards fine-grained text sentiment transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2013–2022. [Google Scholar] [CrossRef]
  22. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013/ (accessed on 20 May 2025).
  23. Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. Glm-130b: An open bilingual pre-trained model. arXiv 2023, arXiv:2210.02414. [Google Scholar] [CrossRef]
  24. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Figure 1. Desentiment framework: Sentiment prompter aims to generate sentiment control senquence as part of the model input to affect the semantic information of generated summaries; the word ’POSITIVE’, referred to as the Intended Tendency, is the sentimental tendency we want, which can be replaced by the word ’NEGATIVE’ if we want some negative sentimental tendencies. In summary calibrator, the SumModel is the summary generator. In sentiment calibrator, the model input contains the source text and sentiment labels; the sentiment selector obtains sentiment sentences according to the Intended Tendency and then the result, sentiment set, is used to calculate sentiment loss.
Figure 1. Desentiment framework: Sentiment prompter aims to generate sentiment control senquence as part of the model input to affect the semantic information of generated summaries; the word ’POSITIVE’, referred to as the Intended Tendency, is the sentimental tendency we want, which can be replaced by the word ’NEGATIVE’ if we want some negative sentimental tendencies. In summary calibrator, the SumModel is the summary generator. In sentiment calibrator, the model input contains the source text and sentiment labels; the sentiment selector obtains sentiment sentences according to the Intended Tendency and then the result, sentiment set, is used to calculate sentiment loss.
Information 16 00453 g001
Figure 2. Correlation of sentiment and semantics.
Figure 2. Correlation of sentiment and semantics.
Information 16 00453 g002
Table 1. Comparison of our model and multiple baseline on three summarization metrics and four sentiment metrics. Boldface: significantly better than the baseline. *: results whose semantics collapse, having better sentiment scores but problems with semantics; examples will be analyzed in Section 3.2.1. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Table 1. Comparison of our model and multiple baseline on three summarization metrics and four sentiment metrics. Boldface: significantly better than the baseline. *: results whose semantics collapse, having better sentiment scores but problems with semantics; examples will be analyzed in Section 3.2.1. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
System R-1 R-2 R-L T-1 T-2 T-3 T-4
CNN/DM
BART44.1621.2840.9038.46−31.21−29.7116.50
BRIO-ctr47.2822.9344.1538.84−29.95−29.5817.52
BRIO + FGST15.644.7913.6867.58 *25.65 *20.37 *34.98 *
Desentiment (Ours)44.4221.1341.4740.41−28.18−28.7419.33
XSum
PEGASUS47.4624.6939.5348.51−15.61−30.021.13
BRIO-ctr48.1325.1339.8447.72−14.48−29.841.63
BRIO + FGST14.285.7412.9769.61 *27.48 *22.76 *20.59 *
Desentiment (Ours)47.1024.7339.3746.79−10.40−29.513.87
Table 2. The R-1 and T-3, T-4 value of using sentiment prompt or not. Boldface: better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Table 2. The R-1 and T-3, T-4 value of using sentiment prompt or not. Boldface: better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Model R-1 T-3 T-4
CNN/DM
Desentiment44.42−28.7419.33
Desentiment-p45.03−28.6718.33
XSum
Desentiment47.10−29.513.87
Desentiment-p47.10−29.792.56
Table 4. The change in T 4 with SentiP and Overlap on the XSum dataset.
Table 4. The change in T 4 with SentiP and Overlap on the XSum dataset.
Overlap 0.25 0.5 0.75 1.0
SentiP
0.250.000.000.000.00
0.5−1.77−1.03−0.61−4.23
0.758.686.36−2.49−4.17
1.012.678.5313.48−0.35
Table 5. The change in T 4 with SentiP and Overlap on the CNN/DM dataset.
Table 5. The change in T 4 with SentiP and Overlap on the CNN/DM dataset.
Overlap 0.25 0.5 0.75 1.0
SentiP
0.250.000.000.000.00
0.5−0.08−2.03−0.78−5.11
0.7513.3329.7521.5936.13
1.021.9623.3720.9025.60
Table 6. The change in R-1 with SentiP and Overlap on the XSum dataset.
Table 6. The change in R-1 with SentiP and Overlap on the XSum dataset.
Overlap 0.25 0.5 0.75 1.0
SentiP
0.2548.3447.9646.8751.23
0.544.4846.4046.6849.60
0.7545.0947.6246.1251.23
1.047.5649.5450.6649.58
Table 7. The change in R-1 with SentiP and Overlap on the CNN/DM dataset.
Table 7. The change in R-1 with SentiP and Overlap on the CNN/DM dataset.
Overlap 0.25 0.5 0.75 1.0
SentiP
0.2546.0345.5746.9249.67
0.543.4844.6345.1347.75
0.7542.7247.2349.0447.77
1.042.8847.9847.3051.54
Table 8. Average p-value across two datasets. P p o s is the average p-value of all annotators in PosExp, and P n e g is that in NegExp.
Table 8. Average p-value across two datasets. P p o s is the average p-value of all annotators in PosExp, and P n e g is that in NegExp.
Dataset P pos P neg
CNN/DM0.810.79
XSum0.770.75
Table 9. Comparison of our model and multiple LLM baseline on three summarization metrics and four sentiment metrics. Boldface: significantly better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
Table 9. Comparison of our model and multiple LLM baseline on three summarization metrics and four sentiment metrics. Boldface: significantly better than the baseline. R-1/2/L are ROUGE-1/2/L F1 scores. T-1/2/3/4 are sentiment scores T 1 / 2 / 3 / 4 .
System R-1 R-2 R-L T-1 T-2 T-3 T-4
CNN/DM
llama2-7B18.256.5614.5722.41−14.41−17.5314.73
chatglm3-6B16.756.3414.8532.82−7.46−25.8211.35
Desentiment (Ours)44.4221.1341.4740.41−28.18−28.7419.33
XSum
llama2-7B18.726.5916.0229.76−6.21−11.563.65
chatglm3-6B16.546.4314.4244.213.85−33.840.29
Desentiment (Ours)47.1024.7339.3746.79−10.40−29.513.87
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, H.; Li, J. Desentiment: A New Method to Control Sentimental Tendency During Summary Generation. Information 2025, 16, 453. https://doi.org/10.3390/info16060453

AMA Style

Cao H, Li J. Desentiment: A New Method to Control Sentimental Tendency During Summary Generation. Information. 2025; 16(6):453. https://doi.org/10.3390/info16060453

Chicago/Turabian Style

Cao, Hongyu, and Jinlong Li. 2025. "Desentiment: A New Method to Control Sentimental Tendency During Summary Generation" Information 16, no. 6: 453. https://doi.org/10.3390/info16060453

APA Style

Cao, H., & Li, J. (2025). Desentiment: A New Method to Control Sentimental Tendency During Summary Generation. Information, 16(6), 453. https://doi.org/10.3390/info16060453

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop