Abstractive Summarizers Become Emotional on News Summarization

: Emotions are central to understanding contemporary journalism; however, they are over-looked in automatic news summarization. Actually, summaries are an entry point to the source article that could favor some emotions to captivate the reader. Nevertheless, the emotional content of summarization corpora and the emotional behavior of summarization models are still unexplored. In this work, we explore the usage of established methodologies to study the emotional content of summarization corpora and the emotional behavior of summarization models. Using these methodologies, we study the emotional content of two widely used summarization corpora: C NN /D AILYMAIL and X SUM , and the capabilities of three state-of-the-art transformer-based abstractive systems for eliciting emotions in the generated summaries: B ART , P EGASUS , and T5. The main significant findings are as follows: (i) emotions are persistent in the two summarization corpora, (ii) summarizers approach moderately well the emotions of the reference summaries, and (iii) more than 75% of the emotions introduced by novel words in generated summaries are present in the reference ones. The combined use of these methodologies has allowed us to conduct a satisfactory study of the emotional content in news summarization.


Introduction
Storytelling is an important aspect of journalism that aims to share facts or ideas in the best way to reach, captivate attention, and convince the audience.Hence, news often does not directly re-tell events, but rather gives an interpretation of those events by a human, whose feelings can often become an important part of the story's meaning [1].Besides, there is clear evidence that using emotional cues helps to catch our attention and prolong our engagement [2].For this reason, emotions have become an important dynamic in how news is produced and consumed, central to our understanding of journalism [3,4].
According to how online newspapers produce news articles, our entry points to a story are the headline and the summary.If they catch our attention, we will likely read the source article.Therefore, we would expect that human summarizers favor emotional content when generating summaries and headlines, potentially over/under-emphasizing some emotions compared to the source article [1].Table 1 illustrates this with two summaries for the same article that evoke different emotions.
Few works have explored emotions under the umbrella of automatic news summarization [1], which have otherwise been considered in other domains such as dialogue or microblogging [5,6].
Nowadays, pre-trained language models are the reputable approach for developing state-of-the-art abstractive summarization systems of news articles.Their capabilities to summarize news articles have been proven, standing out in terms of phrase-overlapping metrics like ROUGE [7], through a broad set of corpora.However, the emotional behavior of these systems is still unexplored.Along with other summarization aspects such as abstraction [8], faithfulness, or factuality [9], emotional behavior can shed light on how to develop better summarizers.Table 1.An example of two different summaries for the same article.Using the NRC lexicon, we highlight the words that convey emotions (the emotions are listed in brackets).Phrases and emotions in blue refer to positive aspects, and those marked in red to negative aspects.

Article
Penglais Farm (Aberystwyth University) will have a total of 1000 rooms, but only 700 will be ready [anticipation] this month to welcome [joy] students.The university said developer Balfour Beatty confirmed [trust] the remaining 300 rooms will be ready [anticipation] during the 2015-2016 academic year.Balfour Beatty has been asked to comment.The unfinished [¬anticipation] rooms have not been let to students.

Summary 1
Hundreds of rooms at a student halls development at Aberystwyth University will not be ready [¬anticipation] for the new term.
Summary 2 700 rooms at Aberystwyth University will be ready [anticipation] to welcome [joy] students this month.
In this work, we explore the usage of established methodologies to study the emotional content of summarization corpora and the emotional behavior of summarization models.Using these methodologies, we carry out the first study about the emotional content of news articles and their summaries.This study is mainly based on two measures to quantify the emotional content in texts at the word level: emotion density and emotion ratio [1], and is divided into two stages.First, we study the emotional content of two widely used news summarization corpora in the literature: CNN/DAILYMAIL [10] and XSUM [11].Second, we study the capabilities of abstractive summarizer models for eliciting emotions in the generated summaries that match the emotions introduced by humans in reference summaries.This study has been performed on three state-of-the-art transformer-based systems [12]: BART [13], PEGASUS [14], and T5 [15].This work aims to answer the following questions: (i) what and how frequent are the emotions in documents and summaries of both corpora; (ii) how emotion densities and ratios of the generated summaries correlate with densities and ratios of the reference summaries; and (iii) whether the emotions of novel words that appear in the generated summaries but not in the source articles match emotions of their reference summary.For reproducibility purposes, the software used in this work is freely available on GitHub (https://github.com/ELiRF/EmotionsInNewsSummarization,accessed on 10 January 2024).

Related Work
Automatic summarization has been addressed in the literature using mainly extractive or abstractive approaches.Extractive approaches build summaries by selecting text directly from the document [16][17][18], while abstractive systems build the summaries by paraphrasing text from the document [19,20].Recently, strong efforts have been made in developing abstractive systems by focusing on encoder-decoder architectures pre-trained in self-supervised ways [13][14][15].One of the best-known problems of these systems is related to hallucinating content, where the models are prone to generate content in the summaries that is not directly inferable from the source document.Several works aim to reduce hallucinations or improve the factual consistency of abstractive summarizers, e.g., employing content planning [21], reinforcement learning [22], or constraining the generation [23].Abstractive summarizers could also be guided, for instance, to work better on aggregating semantic information [8], with specific topics [24], or to represent better the keywords and relationships among the entities [25,26].
Along with hallucinations, factuality, and abstractivity, emotions are also important to be studied in summarization systems and in the corpora used to train them.Since summaries are an entry point to the source article, the emotions elicited in the summaries directly impact the perception of the users.Few works have considered emotions for summarization in dialogue or microblog summarization [5,6], but, to our knowledge, only [1] has studied emotions in automatic news summarization.They proposed an emotion-aware news summarization system and introduced the concepts of emotion densities and ratios, which we used extensively in our work.Similarly, in our work, we use them to study salient emotions in human-written summaries of two widely used summarization corpora (CNN/DailyMail and XSUM).Different from [1], we also study the emotional behavior of abstractive summarization systems, and we do not ground emotions to predefined categories since (i) articles from the considered categories are discarded, (ii) current summarization corpora do not consider categories, and (iii) we aim to obtain global insights of emotions at newspaper-level.
Emotions have been studied out of the scope of news summarization, to understand the affective state of users in applications such as e-commerce [27], opinion analysis in social media [28,29], or healthcare [30,31].Emotions have also been studied in the news domain to detect fake news [32] or the stance toward specific targets [33].To our knowledge, our work is the first to analyze emotions under the umbrella of news automatic summarization to obtain insights from the emotional content of news summarization corpora and the emotional behavior of abstractive summarizers.

Emotional Content Measures
We aim to quantify (i) how frequent an emotion is in a text and (ii) which emotions increase/decrease their frequency in summaries compared to their frequency in articles.We base our study on the methodology introduced in [1].
Following this methodology, we assume that the presence of an emotional word in a text is enough to convey some degree of an emotion.Although this assumption oversimplifies the problem because of the inherent limitations of lexicons, such as the lack of compositionality or ambiguity, having a moderately accurate fine-grained view of emotions in texts is useful.We use the NRC lexicon [34] (version 0.92), which contains 27 k words and their associations with the eight basic emotions in Plutchik's wheel (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust).Ten thousand of these words were manually annotated through crowdsourcing, and the remaining 17 k words are Wordnet synonyms of the annotated words.We use the NRC lexicon through the NRCLEX Python package to detect words with emotions.Words from texts and the NRC lexicon are lemmatized to deal with inflections.
To measure how frequent an emotion e is in a text t, we use the emotion density defined (ED) in Equation (1).

ED(e, t) =
count(e, t) |t| where count(e, t) is the number of words in the text t that convey the emotion e following the NRC lexicon, and |t| is the number of words in t.We compute the emotion density on articles, reference summaries, and generated summaries.
To quantify emotions that appear more/less frequently in a summary than in an article, we use the emotion ratio (ER).The emotion ratio of an emotion e in an article-summary pair is defined in Equation (2).
ER(e, a, s) = ED(e, s) ED(e, a) where a is an article and s a summary.When ER(e, a, s) > 1, we say the emotion e is overemphasized in the summary.On the contrary, when ER(e, a, s) < 1, we state that the emotion is underemphasized in the summary.Intuitively, emotions that are more frequent in reference summaries than in source articles should also be more numerous in generated summaries [1].To measure this, we compute the emotion ratios for both reference summaries and summaries generated by abstractive summarizers.

Summarization Corpora
To conduct our study about emotions in news corpora and abstractive models, we choose two reference corpora in English news summarization: CNN/DAILYMAIL and XSUM.Both corpora are publicly available on the HuggingFace hub: https://huggingface. co/datasets/cnn_dailymail (CNN/DAILYMAIL, version 3.0.0,accessed on 10 January 2024), and https://huggingface.co/datasets/xsum (XSUM, accessed on 10 January 2024).Table 2 shows the number of samples and statistics for documents and summaries for both corpora.

Emotions in Summarization Corpora
First, we study how frequently the articles and summaries contain emotional words.To this aim, Table 3 shows the percentage of articles and summaries that has at least one word of an emotion.Most articles in both corpora show some emotion, and it is common to see all the emotions co-occurring (77% of articles in XSUM and 95% in CNN/DAILYMAIL have words representing all the emotions at some point).It is not so in the summaries: the percentage of summaries that elicit each emotion is lower than the percentage of articles, especially in XSUM, and it is not as frequent as in the articles where all the emotions co-occur.Fear, sadness, anticipation, and trust are the emotions that appear in a more significant number of articles and summaries.
We carried out a study of the most frequent combination of emotions in the summaries.The study shows that there are larger combinations of emotions in CNN/DAILYMAIL than in XSUM, likely because summaries are twice as long.Interestingly, summaries of CNN/DAILYMAIL are twice as long as XSUM ones, but it is four times more likely that all emotions appear in their summaries (23.08% vs. 5.68%).Of theCNN/DailyMail summaries, 52.43% are in the top-10 combinations, while, in XSUM, the top-10 combinations accumulate 29.74% of the summaries.Figures A1 and A2 of Appendix A show this study.
We found 27.8k examples in XSUM (12%) and 2.8k in CNN/DAILYMAIL (0.9%) where the reference summaries elicit at least one emotion that does not appear in the article.For both corpora, the most frequent emotions in these cases are disgust, anger, and surprise, and the least frequent ones are anticipation and trust.Table A1 of Appendix B shows one example from XSUM.Second, by focusing on emotion densities and ratios, we study how frequently each emotion is elicited in articles and summaries and what emotions are over/under-emphasized in the summaries.Figure 1 shows Kernel Density Estimation (KDE) plots of emotion densities and ratios for each emotion in both corpora.In these plots, the x-axes represent values of either emotion densities or ratios, and the y-axes define the probability density function for the kernel density estimation.The figure shows that emotion densities and ratios are similarly distributed.Related to the articles (first column in Figure 1), trust concentrates the most significant number of articles with higher ED(e, a).In contrast, disgust collects the most significant number of articles with lower ED(e, a).The distribution of fear is the most skewed.Despite the differences among the ED(e, s) and ED(e, a) distributions, emotions in summaries (second column of Figure 1) show a similar behavior: trust concentrates the most significant number of summaries with higher ED(e, s), and disgust with lower ones.In XSUM, the distributions are shifted toward higher values of ED(e, s) compared to CNN/DAILYMAIL.
Regarding the ratios ER(e, a, s) (third column in Figure 1), there is a tendency to overemphasize the emotion fear in both corpora, as suggested by the median.Surprise, disgust, and joy are underemphasized in both corpora.Interestingly, disgust is the emotion with the highest density in the tail, when ER(e, a, s) is higher than ∼3 in CNN/DAILYMAIL and ∼4 in XSUM.In CNN/DAILYMAIL, the summaries tend to overemphasize emotions, especially the negative ones, while in XSUM they tend to overemphasize fear.In Table A2 of Appendix B, we show an example from XSUM where the emotion ratio of negative emotions is high.Anger, surprise, disgust, and joy show a median emotion ratio of 0 in XSUM.Therefore, the central tendency is not to include words with these emotions in the summary.

Emotions in Summarization Systems
In this section, we describe our study of the emotional behavior of three widely used state-of-the-art abstractive summarizers.
We consider two baselines commonly used in the literature for completeness: LEAD and RANDOM.LEAD extracts the first sentence of the source article in XSUM and the first three sentences in CNN/DAILYMAIL.RANDOM extracts the same number of sentences as LEAD, but randomly selected from the source article.Additionally, we use an oracle to represent the best hypothetical summarization model.The oracle selects the sentence in the source article that maximizes the averaged ROUGE F 1 scores for each sentence in the reference summary.
For reproducibility, we show the results of these systems on the test sets in terms of ROUGE and BERTSCORE; measures commonly used in the literature for summarization [12].The results are shown in Table A4 of Appendix C. The hyper-parameters used for the abstractive summarizers are shown in Table A5 of Appendix D.

Emotional Coherence and Bias
We analyze how emotion densities and emotion ratios of the generated summaries correlate with the corresponding metrics of the reference summaries.We introduce two metrics based on the Pearson correlation coefficient to this aim.

Emotional Coherence
Emotional coherence measures how the emotion densities for an emotion e in the generated summaries correlate with the emotion densities for e in the reference summaries.In that sense, it quantifies the strength and direction of the relation between the proportion of words with an emotion e in a generated summary and the proportion of words with that emotion in the reference summary.The emotional coherence for an emotion e is computed as the Pearson correlation between the emotion densities in the reference summaries y = {ED(e, s 1 ), . . ., ED(e, s N )} and in the generated summaries ŷ = {ED(e, ŝ1 ), . . ., ED(e, ŝN )}. Figure 2 shows the emotional coherence between reference summaries and summaries generated by each model for all the emotions and corpora.
We observe that all the emotional coherences are higher than 0, suggesting positive relationships between the emotion densities.Abstractive models generally present a coherence higher than 0.5 in negative emotions: fear, anger, sadness, and disgust; and a coherence between 0.35 and 0.5 in the other emotions: anticipation, trust, surprise, and joy.Hence, abstractive models approximate better the emotion densities of negative emotions.
In XSUM, T5 is the abstractive model with the lowest emotional coherence and PEGASUS with the highest one.In CNN/DAILYMAIL, BART generally has a slightly higher emotional coherence than T5 and PEGASUS.All the abstractive systems show a similar emotional coherence in both corpora.
Baseline systems also show higher emotional coherence in negative emotions.However, different from abstractive ones, these systems show low emotional coherence in XSUM.LEAD shows an emotional coherence very similar to that of abstractive systems in CNN/DAILYMAIL (slightly higher for some emotions).Hence, the first sentences of the source articles keep moderately well, and similar to the abstractive models, the expected emotion densities in the summaries of CNN/DAILYMAIL.All the systems have higher emotional coherence than RANDOM in both corpora.The oracle shows the highest coherence in CNN/DAILYMAIL, suggesting that the emotional coherence of the abstractive models could be increased if they focus on better sentences from the source (in terms of ROUGE concerning the reference summary).It is not so in XSUM, where abstractive systems have higher coherence than the oracle.It suggests that focusing on the best sentences of the articles would not help to increase the emotional coherence in XSUM.

Emotional Bias
Emotional bias measures how the emotion ratios for an emotion e in the generated summaries correlate with the emotion ratios for e in the reference summary.Hence, it quantifies the strength and direction of the relation between the emphasis, regarding the source article, placed on an emotion e in a generated summary and the emphasis placed on that emotion in the reference summary.The emotional bias for an emotion e is computed as the Pearson correlation between the emotion ratios in the reference summaries y = {ER(e, s 1 , a 1 ), . . ., ER(e, s N , a N )} and in the generated summaries ŷ = {ER(e, ŝ1 , a 1 ), . . ., ER(e, ŝN , a N )}.To compute the emotional bias, we discard all those examples where the emotion ratio is undefined (when ED(e, a) = 0).
Figure 3 shows the emotional bias between reference summaries and summaries generated by each model for all the emotions and corpora.In almost all the cases, the emotional biases are higher than 0, suggesting positive relationships between emotion ratios.The strength of the correlations is notably lower than in the emotional coherence (Figure 2).It suggests it is more difficult to approximate the emotion ratios than the emotion densities.
The abstractive systems show higher emotional bias in XSUM than in CNN/DAILYMAIL.In XSUM, T5 is the abstractive model with the lowest emotional bias.PEGASUS shows the highest emotional bias for almost all emotions in CNN/DAILYMAIL and XSUM.All the abstractive systems show, in XSUM, the lowest emotional bias for anger and surprise, and the highest emotional bias for sadness, fear, and trust.The emotional biases of abstractive systems are similar for all the emotions in CNN/DAILYMAIL.Baseline models, LEAD and RANDOM, show a low emotional bias in CNN/DAILYMAIL and a negligible one (close to 0) in XSUM.The low emotional bias of LEAD indicates that the first sentences of the source articles do not show the expected emotion ratios in the summaries neither of CNN/DAILYMAIL nor Xsum.In CNN/DAILYMAIL, LEAD shows a slightly lower emotional bias than abstractive models for all the emotions, but in XSUM, the difference concerning abstractive models is high.All the systems show higher emotional bias than RANDOM.The oracle shows the highest emotional bias in CNN/DAILYMAIL but not in XSUM, where the abstractive models stand out.It could suggest again that abstractive models could increase their emotional bias in CNN/DAILYMAIL if they focus on better sentences from the source article (in terms of ROUGE with respect to the reference summary) but not in XSUM.

Emotions of Novel Words
Abstractive summarizers are moderately good at generating summary-worthy novel words that are not present in the source.These novel words could convey a set of emotions.However, whether the emotions of the novel words are those expected in the reference summary is still unclear.We study it on the test sets of CNN/DAILYMAIL and XSUM by computing the precision between the emotions of the novel words in a generated summary and all the emotions in the reference summary.
Let E s be the set of emotions in a reference summary and E ŝ the set of emotions of the novel words in a generated one, precision (P) for N samples is computed as shown in Equation (3); where the intersection refers to the emotions in common between those found in the novel words and the reference summary.We only consider those cases where there are novel words with emotions in the generated summary (|E ŝ| > 0) and the reference one has words with emotions (|E s | > 0).
We also compute the recall (R) to see how many of the emotions in the reference summary are covered by the emotions of the novel words in the generated summary.Recall is computed as shown in Equation (4).
Table 4 shows precision and recall for each model and corpora, along with other data statistics used to compute them.Most of the novel words generated by the models have emotions that match those of the reference summaries since precision is higher than 75% in all cases.PEGASUS is the system that shows the highest precision in both corpora.T5 has slightly higher precision than BART in CNN/DAILYMAIL, but not in XSUM.
The precision of all the models is higher in CNN/DAILYMAIL than in XSUM.The abstractive models generate more novel words in XSUM (4.9 novel words per summary) than in CNN/DAILYMAIL (0.9 novel words per summary).Then, generating more novel words will likely include more non-expected emotions.Table A3 of Appendix B shows an example from XSUM where the emotions of the novel words in a summary generated by PEGASUS do not match exactly the emotions of the reference summary.Both in CNN/DAILYMAIL and XSUM, PEGASUS generates novel words in more samples than BART and T5 (lowest Samples w/o novel words ).However, for a larger number of samples than BART and T5 in CNN/DAILYMAIL, the novel words generated by PEGASUS do not convey emotions (highest Samples |E ŝ| = 0).By contrast, PEGASUS generates emotional novel words for a slightly larger number of samples than BART and T5 in XSUM (lowest Samples |E ŝ| = 0).We notice that the models generate more novel words in XSUM than in CNN/DAILYMAIL, but the number of samples where novel words do not convey emotions is similar in both corpora.
Interestingly, the recall is between 34% and 51%, which suggests that the emotions of the novel words are enough to cover, approximately, at least a third part of the overall emotional content of the reference summaries.BART has the highest recall in CNN/DAILYMAIL and PEGASUS in XSUM.Although it is difficult to explain why, the number of emotions in the reference summaries (lower in XSUM than in CNN/DAILYMAIL) could play a big role.
Considering the overall results of the two corpora, the difference in recall is significant.We consider that it is due to the difference in the introduction of new words in both cases.The fact that the XSUM corpus is much more abstractive in nature than the CNN/DAILYMAIL means that the former incorporates a greater number of novel words, and therefore, more emotional content.

Discussion
We summarize the most important contributions and findings of this work in relation to the objectives stated in the introduction section.
Emotional content of summarization corpora.First, we found that 99% of articles and 70% of summaries of the studied corpora contain at least one emotion.We also found that 12% in XSUM and 0.9% in CNN/DAILYMAIL of the reference summaries elicit at least one emotion that does not appear in the article.Second, we applied two measures, emotion density, and emotion ratio, to articles and summaries of both corpora and the results that we analyzed.Related to the articles, we observed that trust concentrates the most significant number of articles with higher emotion densities.In contrast, disgust concentrates the largest number of articles with lower emotion densities.Related to emotions in summaries, we noticed a similar behavior.In XSUM, the distributions are shifted toward higher values of emotion densities compared to CNN/DAILYMAIL.Regarding the emotion ratios, there is a tendency to overemphasize the emotion fear in both corpora.In CNN/DAILYMAIL, the summaries tend to overemphasize emotions, especially the negative ones, while in XSUM they tend to overemphasize fear.
Emotional behavior of summarization models.We introduced two new measures, emotional coherence and emotional bias, to measure how the emotion densities and ratios of generated summaries correlate with those of the reference.We found that all the emotional coherences are higher than 0, suggesting positive relationships between the emotion densities.Abstractive models generally present a coherence higher than 0.5 in negative emotions: fear, anger, sadness, and disgust; and a coherence between 0.35 and 0.5 in the other emotions.Additionally, we found a higher emotional bias in XSUM than in CNN/DAILYMAIL.In XSUM, T5 is the abstractive model with the lowest emotional bias.PEGASUS shows the highest emotional bias for almost all emotions in CNN/DAILYMAIL and XSUM.Also, we analyzed whether the novel words generated by the summarization models convey the emotions expected in their reference summaries.We observed that most of the novel words generated by the models have emotions that match those of the reference summaries.Interestingly, the recall is between 34% and 51%, which suggests that the emotions of the novel words are enough to cover, approximately, at least a third part of the emotions in the reference summaries.
Finally, we should remark that the proposed methodology is valid for studying emotions in summarization regardless of the method used to detect emotions.However, the approach used in this work presents some limitations since we assumed that the presence of an emotional word in a text is enough to convey some degree of an emotion.Although this assumption oversimplifies the problem because of the inherent limitations of lexicons, such as the lack of compositionality or ambiguity, having a moderately accurate fine-grained view of emotions in texts is helpful.Therefore, we detected emotions at the word level using lexicons, although other alternatives could exist.

Conclusions
We studied the prevalence of emotions in news summarization corpora, specifically, how much these emotions are emphasized in the summaries compared to the source article and the capabilities of state-of-the-art abstractive summarizers for eliciting expected emotions in the generated summaries.
A large percentage of articles and summaries in CNN/DAILYMAIL and XSUM elicit emotions, especially fear, sadness, anticipation, and trust.Our findings also suggest that reference summaries in CNN/DAILYMAIL overemphasize negative emotions, while XSUM underemphasizes all the emotions except fear.Abstractive summarizers approach moderately well the emotion densities in the summaries.However, they do not show the same emotional bias as human summarizers when emphasizing emotions in the summaries.Finally, we noticed that most of the novel words generated by the models convey emotions expected in the reference summaries, especially in CNN/DAILYMAIL, where the models generate few novel words.
In future work, we plan to develop news summarization models with controllable text generation driven by the emotions of the reference summaries and via prompting [36], which could produce better emotional coherence in the generated summaries and potentially, reduce undesired biases towards some emotions and stances.

Figure 2 .
Figure 2. Emotional coherence of each model for each emotion in CNN/DAILYMAIL and XSUM.Correlations are statistically significant (p-value is 0 in all the cases).

Figure 3 .
Figure 3. Emotional bias of each model for each emotion in CNN/DAILYMAIL and XSUM.Correlations are statistically significant (p-value is 0 in all the cases).

Table 2 .
Statistics for the two corpora: CNN/DAILYMAIL and XSUM.From left to right: corpus size, average document, and summary length (in terms of words and sentences), and vocabulary size in document and summary.

Table 3 .
Percentage of articles and summaries in both corpora containing at least one word of an emotion.

Table 4 .
Precision and recall of the emotions in the novel words generated by each model, compared to the emotions of the reference summaries.The number of samples without (w/o) novel words in the generated summary and w/o emotions in the novel words of the generated summary (|E ŝ| = 0) are also shown.The last column indicates the number of samples finally considered in the evaluation.We also show percentages of samples in the test sets.
Ten most frequent combinations of emotions in the summaries of CNN/DAILYMAIL.Bar labels indicate the percentage of summaries in the whole corpus.Ten most frequent combinations of emotions in the summaries of XSUM.Bar labels indicate the percentage of summaries in the whole corpus.