Detecting Fine-Grained Emotions in Literature

Rei, Luis; Mladenić, Dunja

doi:10.3390/app13137502

Open AccessArticle

Detecting Fine-Grained Emotions in Literature

by

Luis Rei

^1,2,*

and

Dunja Mladenić

¹

Jožef Stefan Institute, 1000 Ljubljana, Slovenia

²

Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7502; https://doi.org/10.3390/app13137502

Submission received: 29 May 2023 / Revised: 19 June 2023 / Accepted: 22 June 2023 / Published: 25 June 2023

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Emotion detection in text is a fundamental aspect of affective computing and is closely linked to natural language processing. Its applications span various domains, from interactive chatbots to marketing and customer service. This research specifically focuses on its significance in literature analysis and understanding. To facilitate this, we present a novel approach that involves creating a multi-label fine-grained emotion detection dataset, derived from literary sources. Our methodology employs a simple yet effective semi-supervised technique. We leverage textual entailment classification to perform emotion-specific weak-labeling, selecting examples with the highest and lowest scores from a large corpus. Utilizing these emotion-specific datasets, we train binary pseudo-labeling classifiers for each individual emotion. By applying this process to the selected examples, we construct a multi-label dataset. Using this dataset, we train models and evaluate their performance within a traditional supervised setting. Our model achieves an F1 score of 0.59 on our labeled gold set, showcasing its ability to effectively detect fine-grained emotions. Furthermore, we conduct evaluations of the model’s performance in zero- and few-shot transfer scenarios using benchmark datasets. Notably, our results indicate that the knowledge learned from our dataset exhibits transferability across diverse data domains, demonstrating its potential for broader applications beyond emotion detection in literature. Our contribution thus includes a multi-label fine-grained emotion detection dataset built from literature, the semi-supervised approach used to create it, as well as the models trained on it. This work provides a solid foundation for advancing emotion detection techniques and their utilization in various scenarios, especially within the cultural heritage analysis.

Keywords:

emotion detection; semi-supervised learning; weak-labeling; pseudo-labeling; benchmark literature dataset

1. Introduction

Emotions are an integral part of human beings. Although prominent in neuroscience, psychology, and behavioral science, it is studied in the context of other fields, including healthcare, industrial design, architecture, and computer science, particularly in the context of Affective Computing [1]. The communication of emotion forms an integral part of human interaction but is often hard to convey in text. As humans shifted towards communicating over text messages, the need to better convey emotions was obvious. This led to the widespread adoption of emoticons and later emojis. These have even been adopted by chatbot developers to more effectively communicate with their users. However, humans have been communicating through text long before the modern age. Emotion has long been a crucial aspect of storytelling, including writing and understanding literature [2,3,4]. Even if, the particular set of emotions expressed can change significantly over time [5].

1.1. Motivation

Emotion detection in cultural heritage, literature in particular and not limited to fiction, is the focus of this work. For our immediate motivation and near-future application, this work follows Ref. [6] which combined emotion detection together with the detection of olfactory stimuli to begin associating certain emotions with sensory stimuli. The authors used only five emotions because their emotion detection model was trained on the Tales dataset [7]. We want to allow this type of analysis of cultural heritage to be conducted with fine-grained emotions.

The most relevant dataset for emotion detection in literature, Tales, consists of 1207 sentences extracted from fairy tales obtained from Project Gutenberg (https://www.gutenberg.org/ (accessed on 22 June 2023)) and originally annotated with one of Ekamn’s six basic emotions [8]: anger, fear, disgust, happiness, sadness, and surprise. Later these were reduced to just five emotions by merging anger and sadness. This brings to immediate attention the shortcomings we hope to address:

The use of coarse-grained emotions stemming from the use of “basic” or “top level” categories in psychological models of emotions;
A single emotion label per sentence;
The small size of the dataset by current machine learning standards.

Coarse-grained “basic” emotions are not sufficient for any application or study that requires identifying a more specific emotion. For example, a generic “sadness” emotion category can include distinct emotional states such as grief, despair, disappointment, and nostalgia. Further, it is debatable whether these emotion categories are appropriate at all for annotating text and whether their use, over specific emotions, complicates the annotation task [9,10].

With regard to datasets and models with a single emotion label per sentence, the question becomes what to do if a sentence expresses multiple emotions. For example, happiness and sadness can be expressed in the same sentence: “They celebrated their final reunion, knowing that their paths would soon diverge, leaving behind a profound sense of joy tinged with the ache of farewell.”. This has led to more recent work focusing on multi-label emotion detection (e.g., Refs. [11,12,13]).

Cultural Heritage, such as the study literature, is a knowledge-intensive domain. Datasets that focus on literature typically require expert or near-expert annotators (e.g., graduate students and their advisors) [14,15], or at least, more substantial efforts beyond simple crowdsourcing. This often results in small, imbalanced datasets [16]. We believe the approach we propose might also be used in other problems and domains that so far have relied on small expert annotated datasets or zero-shot transfer, such as healthcare [17].

1.2. Overview

To address the issues mentioned above, we introduce a large multi-label fine-grained emotion detection dataset. To create it, and overcome the difficulty in annotating literature, we propose to use a semi-supervised learning approach. We begin with sentences from Project Gutenberg (Section 2.1 which we preprocess and de-duplicate (Section 2.2. We create a fine-grained emotion taxonomy (Section 2.3 and leverage Natural Language Inference (abbreviated NLI, and also known as Textual Entailment) [18] for weak supervision (Section 2.4) through zero-shot text classification [19], picking the highest scored sentences for each emotion. In the next step of our approach, we train a binary classifier for each emotion based on the weak labels and use each of them to perform pseudo-labeling for all examples. In this way, from binary weakly labeled examples, we create a multi-label dataset through pseudo labeling (Section 2.5). Since we use emotion-specific binary classifiers to create the final labels of the dataset, the probability assigned to each emotion by its respective classifier can be used as a soft label or converted to a hard label with a threshold (probability ≥ 0.5). We provide a detailed analysis of the dataset in Section 3, including the label distribution (Section 3.1), label correlation (Section 3.2), and label sentiment (Section 3.3). Finally, we train two supervised models, described in Section 2.6, on this dataset, with the main purpose of assessing the quality of our dataset. To evaluate our dataset and models, we begin with the typical supervised evaluation scenario against a human-labeled subset of our dataset (Section 4.2). Next, we evaluate models trained on our dataset in a zero-shot transfer learning setting by mapping the output of our classifier (Section 4.3). As our last experiment, we evaluate few-shot transfer learning based on fine-tuning the models trained on our dataset (Section 4.4). In Section 5, we discuss our results. Finally, in Section 6 we present our conclusions, where we see the biggest room for improvement, and our plans for future work.

Our primary research objective is to explore the effectiveness of a semi-supervised approach using NLI to create a fine-grained multi-label dataset for training supervised emotion detection models specialized for literature. This approach aims to tackle the challenges presented by the complex nature of the problem and the domain, which hinder accurate annotation by crowdworkers and incur high costs for expert annotation. Our secondary research objective is to provide a dataset and associated taxonomy model that includes a set of emotion labels not present in previous work and is geared towards literature analysis.

1.3. Related Work

Annotating training data for machine learning, especially in knowledge-intensive contexts, is a major challenge. Emotion and sentiment annotation, in particular, is notoriously difficult and filled with results showing low agreement between annotators despite significant efforts by the researchers involved in the studies [7,11,12,13,14,20,21,22,23]. In part, this has been attributed to the subjectivity inherent in the task and to the choice of annotation schemas [9,10]. Early works in the field of emotion detection from text, based their annotation schemas on well-established emotion classification theories from the field of psychology, such as the six basic emotions of Ekman [8] and the eight of Plutchik’s Wheel of Emotions [24]. This includes well-known benchmark datasets in the field, such as Tales [7], and ISEAR [25], all the way to UnifyEmotions [11] which was aggregated from 14 previously released datasets in emotion classification. More recently, researchers have begun to move away from these annotation schemas and release datasets with more emotion labels, a trend that has been naturally accompanied by a move away from multiclass annotation and into multi-label annotations. UnifyEmotions itself, despite being based on many multiclass datasets, attempted to move towards multi-label by recovering the information from original annotation files and ignoring disagreements. The relatively high profile of the SemEmval 2018 Affect in Tweets shared task (SemEval-2018 Task-1C) [12] also served to place more focus on multi-label emotion classification and on more fine-grained emotion labels. The shared task added optimism, pessimism, and love to Plutchik’s 8 emotions for which they introduced a crowdsourced multi-label dataset with 11k tweets. Another interesting dataset is XED [26]. Although it uses Plutchik’s 8 emotions as labels, the annotation was multi-label and contains 34k examples, drawn from subtitles. The more ambitious work in these directions was GoEmotions [13] which introduced the largest crowdsourced dataset to date, consisting of 58k Reddit comments annotated with 27 emotion labels. Our work goes further in this direction and introduces a multi-label dataset with 200k examples annotated with 38 emotions. Since it would be nearly impossible to crowdsource fine-grained emotions in literature, as we cannot reasonably expect crowd workers to produce good labels for the difficult-to-interpret language of literature, nor for the costs of such a task to be reasonable at scale, we instead opted for a semi-supervised approach. To summarize, the key differences are that our work uses a semi-supervised approach instead of expert annotators or crowdsource workers, introduces more fine-grained emotions (a total of 38), and is much larger (with 200k examples). Given that the application focus of our work is literature analysis, the best pre-existing alternative, Tales, was a single-label, small dataset (1.2k examples) annotated with only 5 emotion labels.

Multi-label fine-grained emotion detection from text is not just challenging for human annotators. It is also a challenge for machine learning algorithms. Similar to most tasks in Natural Language Processing, emotion detection approaches have moved toward the transformer architecture [27] which now dominates the state-of-the-art across the field (for examples of leaderboards in NLP, see https://gluebenchmark.com/leaderboard or https://huggingface.co/spaces/autoevaluate/leaderboards (accessed on 22 June 2023)). In XED, the authors used BERT [28] and reported a macro-averaged F1 score of 0.54 over their 8 emotions plus neutral. In GoEmotions, the authors also used BERT and reported a macro-averaged F1 of 0.46 over their 27 emotions plus neutral. A recent evaluation of ChatGPT on this dataset reported only 26% of that [29]. On the SemEval 2018 dataset, recent work reports a 0.60 macro-averaged F1 using RoBERTa [30] over the 11 emotions plus neutral, which was an improvement compared to the previous best-reported results such as SpanEmo’s 0.58 F1 [31]. We also choose RoBERTa as our model and report a very similar F1 score of 0.59 over our 38 emotions, which is impressive considering it is a semi-supervised dataset with more fine-grained labels than previous datasets.

The main method we use for building our semi-supervised dataset is ranking through a model trained on NLI. The viability of NLI based Zero-Shot classification to provide labels for emotion classification was established in Ref. [19], with model ensembles being more thoroughly covered in Ref. [32] and multiple templates in Ref. [33]. Both focused on the evaluation of NLI on existing emotion datasets, while this work focuses on creating a new dataset. We use two hypotheses for each emotion, where the main difference between them is the label rather than the template, which is similar to some prompts explored in Ref. [33]. Both of these previous works focused primarily on coarse-grained emotions, and evaluation of the Zero-shot classification, while the primary focus of ours is on fine-grained emotions and creating a new dataset. Zero-shot fine-grained emotion classification was done in Ref. [34] but only to serve as input to a sentiment classifier rather than focus directly on the emotion classification task. Related work has also shown that results of NLI based Zero-shot text classification can be improved in different ways by using ensembles of different hypothesis [33], ensembles of different models, and moving from zero-shot to few-shot [32]. Interestingly, fine-tuning pretrained NLI models for Zero-shot emotion classification can also be done without labeling data by using self-training [35]. In that work, the authors evaluate the entailment accuracy of their self-training approach on GoEmotions and conclude that it significantly improves results over the base NLI models. The key difference between our work and the related work is that rather than evaluate NLI Zero-Shot based approach on existing datasets, we use it to build a multi-label fine-grained dataset and evaluate fully supervised models trained on it.

Soft labeling data refers to assigning each instance a discrete probability, such as a likelihood or confidence, that it belongs to each class. In comparison, when using hard or crisp labels, instances are assigned to a class with a probability of 1 (a certain event in probability theory). Soft labels can be captured from human annotators [36] but are more common in semi-supervised approaches [37], where they are often a part of a self-training, co-training process, label propagation, or from derived a pseudo-labeling classifier. Soft labels can be converted to hard labels, usually by thresholding at a certain probability. Commonly, if an example has a probability greater than 0.5 of belonging to a class, it is assigned that class label when converting to hard labels. Soft labels are often credited with better results than hard labels, attributed to proving robustness to noise and implicit regularization (e.g., Refs. [38,39,40,41]). In this work, we will compare the use of soft and hard labels on our dataset.

The key differences between our work and previous works that introduced new multi-label emotion datasets [12,13,26] can be summarized as follows:

Our work focuses on introducing a semi-supervised approach instead of crowdsourced workers for annotating training data;
We introduce a dataset with more fine-grained emotions (38 labels) compared to previous datasets (11, 27, or 8 labels);
Our work introduces the first multi-label or fine-grained emotion dataset for literature.

Compared to previous work that uses NLI based zero-shot emotion classification ([19,32,33]):

We use NLI for binary ranking of candidates instead of directly providing labels;
The final labels of our datasets are provided by a binary classifier for each emotion rather than by directly using NLI, allowing us to use only the highest NLI ranked examples for each emotion.

Finally, following previous work in semi-supervised learning, we will compare the use of soft labels in the context of this work.

1.4. Contribution

The purpose of this work is to improve the ability of researchers to analyze emotions in literature by providing a fine-grained emotion detection dataset and classifier. Further, we want to advance the state-of-the-art in emotion classification by providing a methodology that helps to develop emotion datasets. We tackle the issue of the existing emotion classification literature datasets being small and annotated with coarse, rather than fine-grained, labels. We overcome the challenge of how hard and expensive it is to annotate emotions in general, but harder in literature, by using a semi-supervised approach.

Our main contribution is a semi-supervised approach for creating multi-label emotion classification datasets, which we believe can be applied to similar problems, especially within the cultural heritage domain and other domains with a high cost of annotation. This approach can be used to create large balanced datasets without any human labeling.

Our second contribution is the dataset itself, with 38 fine-grained emotion labels, and its respective taxonomy and definitions that can be used for further research in emotion classification as well as for research related to improving the quality of semi-supervised data, such as methods for dealing with noisy labels. It includes more emotion labels than any other dataset, with GoEmotions [13] having only 27 and the only other literature sentence-level emotion detection dataset, Tales, only having 5 emotions. Similar to GoEmotions, but unlike Tales, it is also multi-label. Despite this, it still manages to have a balanced number of examples per label, which GoEmotions does not. We also provide an analysis of emotion correlation and emotion sentiment on our dataset that can serve to further inform future work in emotion detection and refine emotion taxonomies.

Our final contribution is the model trained on our dataset, evaluated in multiple common scenarios and public benchmark datasets. The evaluation shows that the model trained on our dataset performs on par with models trained on crowdsourced datasets, and it provides good zero and few-shot transfer ability. Making the model public gives researchers in other fields, especially those analyzing literature, the necessary tools to perform emotion detection.

Our main contributions can be summarized as follows:

A novel semi-supervised approach capable of creating fine-grained multi-label emotion classification datasets;
A large, balanced dataset with 38 fine-grained emotion labels, surpassing existing datasets;
A more comprehensive taxonomy and definitions for emotion detection from text;
Analysis of emotion correlation and sentiment within the dataset, informing future work in emotion detection;
Publicly available, easy-to-use trained models for researchers.

2. Materials and Methods

We start the description of materials and methods by describing Project Gutenberg, which contains a collection of free eBooks, from which we extract and filter sentences (Section 2.1 and Section 2.2). Given our taxonomy (Section 2.3), we use a pretrained NLI model to perform by binary weak labeling (Section 2.4). We train models to do pseudo labeling and create the final multi-label emotion dataset (Section 2.5). These models, as well as subsequent models trained as part of our later experiments (Section 4), are detailed in Section 2.6.

2.1. Data

The biggest and most common source of open literature data is Project Gutenberg (PG). We use all English language books based on both metadata and language identification. The objective of the steps described in this section is to prepare the text for the next steps described in the following sections. First, since the objective is to perform sentence-level emotion detection, we must convert the books into sentences. Given that PG book files contains some data that is from the body of the book text (it may not be in English, or it may not be properly segmented into a sentence), we also apply some common heuristics to remove such potentially problematic sentences:

Strip away template text added to the book (https://github.com/c-w/gutenberg/ (accessed on 22 June 2023));
Check the language of the book based on the characters in the interval [1000:20,000];
Split the book into sentences based on the newline character first and then using a sentence tokenizer;
Discard any sentences in all uppercase (most likely headings or template text);
Discard any sentences that do not start with a character of the English alphabet;
Discard any sentences with less than 6 tokens or more than 40 (whitespace delimited);
Discard any sentences that are not identified as English or any sentences that contain 10-token segments that are not identified as English;
We randomly shuffled the sentences and limited their total number to 10 M to reduce the amount of compute required in the following sections.

To perform sentence segmentation, we used the NLTK [42] implementation of Punkt [43]. For Language identification, we used the fasttext language identification model [44,45].

2.2. Deduplication

We remove duplicates and near duplicates using several processes. The first is exact case-insensitive string matching, which removes exact duplicates. The second is MinHash Locality Sensitive Hashing [46] using the datasketch library (https://ekzhu.com/datasketch/ (accessed on 22 June 2023)) to estimate the Jaccard similarity between sentences. For pairs above the threshold of 0.9 we calculate the exact Jaccard similarity; if it is above this threshold we consider them duplicates. For this process, each sentence is represented as space-delimited token n-grams with n = (1, 2, 3) with text first being standardized by normalizing white spaces, removing control characters, removing accents, and converting it to lower case.

2.3. Emotion Taxonomy

We developed a taxonomy of 38 fine-grained emotions, plus neutral, based primarily on existing work in NLP as well as work in other fields. The emotion labels and their respective definitions are listed in Table 1. The definitions are based on dictionary definitions and synonyms, primarily, the Oxford English Dictionary (https://www.oed.com/ (accessed on 22 June 2023)) and the Merriam-Webster dictionary (https://www.merriam-webster.com/ (acessed on 22 June 2023)).

Our taxonomy directly includes all the Ekman basic emotions. From the Plutchik basic emotions, “Anticipation” is the only one directly missing, as it a compound of very different emotions we include such as “optimism” and “boredom”. With regard to the GoEmotions taxonomy, 26 of the 27 emotions are present, with only “realization” missing. We found it difficult to provide a meaningful non-trivial, unambiguous definition that would separate “realization” from “surprise” and be clear for potential human annotators. The emotions “desire”, “love”, and “nostalgia” are included based on requirements of historians involved in our project. Additionally, we spent some time looking for other emotions in the Wikipedia page for emotion classification (https://en.wikipedia.org/wiki/Emotion_classification (accessed on 22 June 2023)). After exploratory experiments, we settled on those described in Table 1, however, we see the potential for adding more in the future.

2.4. Weak-Labeling

An overview of the weak labeling process is shown in Figure 1. We assign weak labels to example sentences using Zero-Shot classification with BART [47] pretrained for NLI (https://huggingface.co/facebook/bart-large-mnli (accessed on 22 June 2023)). Textual entailment is a classification task where, given a pair of texts, one being a hypothesis and the other a premise, the objective of the classifier is to decide whether the hypothesis is true given the premise. More specifically, in the MNLI [48] dataset which was used to train the off-the-shelf NLI model used in this work, it is formulated as a 3-class multiclass problem: given a premise-hypothesis pair, the classifier must decide whether they are in entailment, in contradiction, or are neutral in relation to each other. Using the target sentence as the premise and creating a sentence expressing the hypothesis that the premise expresses a given emotion, the probability of entailment becomes the probability that the emotion is expressed in the target sentence. To create this probability, we use the unnormalized outputs from the network, corresponding to “entailment” and “contradiction” and pass them through the Softmax.

Weak labels are assigned in a two-stage process for each emotion. First, we use the NLI model with the template This expresses the emotion {emotion} to rank all examples obtained in Section 2.2 by the probability of entailment. Second, we load the top 150k examples by this score and re-rank them using the NLI model with the template Speaker or someone {emotion definition}. Table 2 shows our ranking and re-ranking hypothesis resulting from applying our templates to an emotion described in Table 1. From here, we take the top 6000 sentences by their re-ranked order and assign to them the respective emotion label.

Next, we filter the examples to ensure semantic diversity. Using semantic embeddings provided by a MiniLM [49] model (https://huggingface.co/nreimers/MiniLM-L6-H384-uncased (accessed on 23 June 2023)) pretrained on multiple semantic similarity tasks [50]. Examples with a cosine similarity greater than 0.7 were discarded. Finally, we filter examples using simple token and stem [51] counting. The values were manually tweaked per emotion, based on the output. The max stem count was set between 600 while the maximum token count varied between 1000 and 2000. This variation is to account for the variation in the vocabulary for expressing different emotions. For example, “boredom” has a much more narrow vocabulary than “disgust”. The top 6000 examples by score are selected as positives examples for the emotion.

The same process is repeated for selecting negative examples except we pick the lowest scored examples and take the bottom 300,000 examples by ranking score, followed by the bottom 10,000 examples by the re-ranking score. A separate process handles neutral (no emotion) examples. First, the sentences are filtered by the NLI model using the hypothesis This text expresses emotion. Only sentences with a probability below 0.2 are selected. These are then filtered again by the NLI model using the hypothesis The speaker or someone expresses or feels {emotion} for each emotion in our taxonomy. Only sentences with a probability below 0.5 for all emotions are selected as neutral examples. Approximately 5100 neutral examples were created in this way. At the end of this process described, we have a dataset with binary labels for each emotion, totaling 38 datasets, each with 26,000 examples, plus the neutral examples.

2.5. Pseudo-Labeling

In this section, we create our final dataset. We do this by converting the emotion-specific binary datasets created by weak-labeling in Section 2.4 into a multi-label dataset by pseudo-labeling [52] the data. For each emotion, we use the weakly labeled examples to train a binary classifier for that emotion. This classifier is then used to classify all examples in our dataset except the neutral examples (these are already multi-label by definition). Figure 2 shows a diagram of this process. The classifier is described in Section 2.6 and is trained with a label smoothing factor [53] of 0.2 with the other hyperparameters detailed in Table 3.

The output of each binary classifier is a Softmax score that can be interpreted as the probability that the sentence expresses that particular emotion. We explore the use of this score as a soft label in Section 4. Applying the threshold of ≥0.5 to these scores can be used to create hard labels (0 or 1).

To make the dataset more manageable, we reduce its size to 200K examples by subsampling it using stratified splitting (https://github.com/trent-b/iterative-stratification (accessed on 23 June 2023)) [54] and further split it into train, validation, and test sets with a further small sample set aside for manual annotation. The size is still more than large enough to allow for further pruning or subsampling, while not being so big as to hinder experimentation. Pre-splitting into different subsets allows for easier comparison of any future works based on this dataset.

2.6. Supervised Classification

In this work, we train two sets of supervised text classifiers. The first set consists of the binary emotion-specific classifiers that perform the pseudo-labeling process described in Section 2.5. The second set consists of the classifiers used in the experiments described in Section 4. Both sets follow the same architecture as shown in Figure 3. This is a pre-trained language model based on the transformer architecture [27] with a classification head for task-specific fine-tuning, which has come to dominate the state of the art in NLP. We use RoBERTa [55] as the encoder, which replicates BERT [28] with better training. It also replaces BERT’s character-level Byte-Pair Encoding (BPE) [56] with a byte-level implementation similar to GPT [57]. This step is commonly called tokenization, as it segments text into a list of integer identifiers mapped to embeddings, in the model’s embedding layer. We do not use additional preprocessing or text encoding steps. The classification head consists of dropout [58], followed by a fully connected dense layer, a hyperbolic tangent activation, dropout, and a fully connected dense projection layer. This classification head is used by the Transformers library in the “RobertaForSequenceClassification” model. We adopt it as it allows for easier replication of our work and for our model weights to be shared without the need to share additional code (See https://huggingface.co/lrei/emolit-roberta-large-soft for an example (accessed on 23 June 2023).

The pseudo-labeling models use the smaller BASE transformer with 12 transformer layers, a hidden dimension of 768, with 12 attention heads (L = 12, H = 768, A = 12), totaling 125 M parameters. The models in the experiments section primarily use the LARGE transformer (L = 24, H = 1024-hidden, A = 16, 355 M parameters). Additionally, in Section 4.2 we also provide results for a smaller and faster model based on the DISTIL (https://huggingface.co/distilroberta-base (accessed on 23 June 2023)) transformer (L = 6, H = 768, A = 12, 82 M parameters) [59].

Pseudo-labeling models use the Softmax as the final activation function, whereas the models in the experiments use either the Softmax or Sigmoid based on whether the target dataset is multiclass or multi-label, respectively. The final FC layer has a size of

H \times N

where N is the number of classes, for pseudo-labeling N = 2, for models trained on our data N = 38, while models trained on other datasets have dataset-specific values of N. The dropout probability used was always 0.1.

All models are trained using mixed precision (FP16) and the Adam [60] optimizer with the linear scheduler without warmup. Hyperparameters are taken directly from Ref. [55] without any hyperparameter tuning specific to the dataset. For pseudo-labeling, we used a batch size of 32 with a learning rate of

3 \times 10^{- 5}

. Experiment-specific hyperparameters are presented in their respective sections. All models trained on our dataset were limited during training to a maximum sequence length of 48. The computational complexity of the transformer architecture, resulting from the use of self-attention, is

O (L \cdot (n^{2} \cdot H))

, where L is the number of self-attention layers, n is the sequence length, and H is the hidden dimension size. A neural network’s memory requirements are defined by the number of parameters it has, and how they are stored. The LARGE models have 335 M million parameters, and we use 2 bytes per parameter (with FP16 inference), which results in approximately 650 MB of memory used by the model. With 160k training examples and the hyperparameters used, the LARGE model takes about 3 h to train on an NVIDIA GeForce RTX 3090.

2.7. Evaluation Metrics

Throughout Section 4, we will report experimental results using the macro-averaged F1 metric (also known as maF1) as a single number summary. This is the standard metric for reporting multi-label classification results in the field (see Section 1.3), especially with imbalanced test sets. In this section, we provide a brief explanation of this metric and the other metrics used: precision and recall.

Multi-label classification is a type of classification problem where each instance can be assigned to multiple classes simultaneously. In contrast to binary classification, where an instance is assigned to either one of two classes, multiclass classification, where each instance is assigned to exactly one class out of a set of mutually exclusive classes. In multi-label classification, each class label can be independently predicted, and an instance can belong to none, one, or multiple labels. In the context of multi-label classification, the terms True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) are defined based on the prediction and ground truth of each class separately.

True Positives (TP): For a specific label, TP represents the instances where the model correctly predicts the presence of that label, and the ground truth also indicates the presence of that label. It is the count of instances that are correctly identified as positive for a particular label;
False Positives (FP): For a specific label, FP occurs when the model predicts the presence of that label, but the ground truth indicates the absence of that label. It is the count of instances that are incorrectly classified as positive for a particular label;
True Negatives (TN): For a specific label, TN represents the instances where the model correctly predicts the absence of that label, and the ground truth also indicates the absence of that label. It is the count of instances that are correctly identified as negative for a particular label;
False Negatives (FN): For a specific label, FN happens when the model predicts the absence of that class, but the ground truth indicates the presence of that label. It is the count of instances that are incorrectly classified as negative for a particular label.

Rather than directly reporting these raw counts, it is common to report precision, recall, and F1, which provide a more comprehensive measure of the model’s performance as they take into account proportions rather than raw counts.

Precision measures the proportion of true positive predictions out of the total positive predictions (Equation (1)). It focuses on the correctness of positive predictions. A high precision indicates a low rate of false positives;
Recall measures the proportion of true positive predictions out of the total positive instances in the dataset (Equation (2)). It focuses on capturing all positive instances without missing any. A high recall indicates a low rate of false negatives;
F1 score is the harmonic mean of precision and recall. It provides a balanced measure that combines both precision and recall into a single metric (Equation (3)). It is commonly used when both precision and recall are equally important, providing an overall measure of the model’s performance.

\begin{matrix} p r e c i s i o n & = & \frac{T P}{T P + F P} \end{matrix}

(1)

\begin{matrix} r e c a l l & = & \frac{T P}{T P + F N} \end{matrix}

(2)

\begin{matrix} F_{1} & = & 2 \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l} \end{matrix}

(3)

Accuracy is calculated as the ratio of the total number of correct predictions (both true positives and true negatives) to the total number of instances in the dataset (Equation (4)). In datasets, where the number of negative instances is much higher than the number of positive instances, the accuracy is usually misleading. A classifier that always predicts the majority class (negative) would achieve high accuracy due to the imbalance, even if it fails to correctly identify positive instances. In multi-label datasets with many labels, it is common for every label to have many more negatives than positives. Considering our problem, with 38 emotion labels, a given sentence is unlikely to express any one given emotion. This will be shown in Section 3), where we will see that most examples have at most two emotion labels. Thus, the ratio of positives to negatives in any test set is highly imbalanced; e.g., less than 2% of examples are positives in multiple classes and no class has more than 18% positives in our gold set. The accuracy score will be entirely dominated by negatives, with classes having high accuracy based of a good true negative rate.

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

In multi-label classification tasks, where there are multiple labels to predict, the macro-average is often used to calculate metrics such as precision, recall, and F1 score. Macro-average calculates the metric separately for each label and then takes the average across all labels. It treats each label equally, regardless of the label’s frequency or imbalance in the dataset. This approach ensures that each label’s performance contributes equally to the overall metric. The formula for the macro-average F1 score (Equation (5)) is calculated by a simple average of class-specific F1 scores (

F_{1_{i}}

) over all classes (C) in the test set [61].

m a F_{1} = \frac{1}{C} \sum_{i = 1}^{C} F_{1_{i}}

(5)

This is the primary measure used in the literature and, sometimes, the only one reported (e.g., Ref. [26]), and it is thus necessary for comparison with existing work. We will also report precision and recall, both per class and macro-averages.

3. Data Analysis

The purpose of this section is to provide an understanding of one of the key results of our work: the emotion detection dataset we built. To do so, we provide its basic statistics, namely, the total number of sentences, how many sentences per emotion, and how many emotions per sentence (Section 3.1). We also provide a brief analysis of the correlation between emotions (Section 3.2) and summarize the sentiment associated with each emotion (Section 3.3).

3.1. Label Distribution

Label counts for our dataset are shown in Table 4. Datasets created from randomly sampled data tend to have a long-tailed label distribution. This is generally more pronounced when fine-grained labels are used. For example, in GoEmotions, the most common emotion label is “admiration” which has 53 times (53.64 times, measured on the train set) as many examples as the least common emotion label, “grief”, despite the use of a “pilot model” to balance emotions. In our case, the most common label, “nostalgia” (Appendix C explains why), is only 2 times more common than the least common label, “greed”. The balanced distribution and the large size of our dataset (200k sentences) should allow for the use of automatic pruning methods to reduce noise or sampling for the creation of manually labeled datasets.

Table 5 shows the number of labels per example as a percentage. More than 60% of the examples have between 0 and 2 emotion labels, while only around 13% have 5 or more emotion labels. For comparison, in GoEmotions, 83% of the examples have only 1 label. This is because the authors of GoEmotions specifically optimized their taxonomy to minimize the number of commonly overlapping emotions and limit the number of emotions. In our work, we instead tried to have the highest number of emotions that we could, for a first attempt, while still being able to clearly define and separate them. This more fine-grained approach to the taxonomy is designed to enable more precise study. For example, “despair” might be very similar to “sadness” but its difference can be very significant, not just when analyzing fiction but also non-fiction. One example is medical texts, where “despair” can be more closely related to depression than “sadness”. Section 3.2 deals with emotion correlation.

3.2. Label Correlation

We calculated the correlation between each emotion over the sentences in our dataset, that is, emotions that occur in the same sentence (since sentences can express multiple emotions given a multi-label formulation), and present the results in Figure 4. The highest and lowest label pairs are summarized in Table 6. The results are mostly common sense, given the definitions, and are extremely similar to those in GoEmotions, with the top pairs being the same where possible (e.g., annoyance and anger, joy and excitement, nervousness, and fear). Very low negative correlation between emotions is likely less common in literature compared to social media due to the purposeful use of contrast. For example, it is common to praise one character while in the same sentence showing contempt for others (“I liked your spirit; I hate these tame, perfectly conventional girls; they bore me to death”). Or show hope in the middle of despair (“I feel at times very much deprest indeed, but have no doubt but that all things will work for good”).

3.3. Label Sentiment

We performed sentence-level sentiment analysis on our dataset by using an off-the-shelf sentiment model (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english (accessed on 23 June 2023)). For each emotion, we count every positive and negative sentence labeled with that emotion. The resulting ratio is shown in Figure 5 with a summary of the most positive and negative emotions in Table 7.

Again, the results are mostly common sense. Where they are immediately obvious, it is easier to understand in conjunction with the label correlations in Figure 4 and by keeping in mind the definitions in Table 1. In Section 3.2 we discussed the use of contrast results in low negative correlation for emotions that should be opposites by common sense reasoning. In part, this helps explain why the sentiment of emotions is generally not more extreme (i.e., closer to 100% negative or positive). The emotion that jumps to our attention is “envy”. At first, by common sense, it seems like it ought to be a more negative emotion, but the relatively high correlation with “admiration” helps explain why it is not: admirable things are enviable, but that is not necessarily negative. For example, “Even the pattern from which she is working, the silk, the gold, the lawn, made happy by her touch, are sanctified, are envied”.

4. Experimental Results

4.1. Evaluation Data

We use four well-established, public, emotion benchmark datasets. Tales is the most relevant public benchmark dataset for our work. It is an affect dataset with 1207 sentences built from literature, namely fairy tales by B. Potter, H.C. Andersen, and the Brothers Grimm [7]. It was annotated with the six basic Ekman emotions; however, anger and disgust were merged into a single label. GoEmotions [13] is the second most relevant public benchmark dataset for our work. It is a fine-grained emotion dataset consisting of 54,260 Reddit posts annotated by crowd workers with 27 emotions plus a neutral label. ISEAR (International Survey on Emotion Antecedents and Reactions) [25] is a collection of 7666 sentences each answering a question assigned to 7 emotions: joy, fear, anger, sadness, disgust, shame, and guilt. EMOINT [62] contains crowdsourced annotations for 7102 tweets annotated with anger, joy, sadness, and fear. The Tales dataset is relevant because it built from literature. GoEmotions is relevant because it is annotated with fine-grained emotions. We included ISEAR and EMOINT as benchmark datasets in these experiments because they are commonly used in related work. Both are also part of UnifyEmotions [11] and are thus easily accessible.

In addition to the public benchmark datasets, we also manually annotated a subset of our dataset with our own taxonomy. This is a set of 727 sentences annotated with all 38 emotions and detailed in Table 4 under the header “Gold”. It was created by iterative stratified sampling from the original dataset obtained in Section 2.5. A total of 1000 sentences were originally extracted, but to reduce the annotation work, 100 expected neutrals were discarded, and ultimately only the first 727 examples were manually labeled (labeling was performed by the first author).

4.2. Supervised Evaluation

Our first experiment consists of training two models on the train split of our dataset and evaluating them on our labeled gold split. The model architecture is described in Section 2.6. The training was done for 10 epochs, with the best epoch on the validation split being used to provide the final results. This setting is based on the original RoBERTa paper [55]. These models are effectively a baseline designed to evaluate our dataset.

The models differ only in the fact that one is trained on hard labels (0 or 1) created from applying the 0.5 threshold on the soft labels that the other model is trained on. The soft labels are the probability output of the pseudo-labeling classifiers described in Section 2.5. The results are shown in Table 8 with the corresponding hyperparameters shown in Table 9.

The macro averaged scores for both models are essentially the same at 0.59 F1. The results appear to be in line with what a similar model would achieve when trained and evaluated on GoEmotions, which is annotated by crowd workers while our dataset was automatically annotated. This is a good outcome. We expected the use of soft labels to provide some regularization and improved results on the human-labeled data, but this was not the case. Additionally, we compare the results with different model sizes without changing the architecture. Table 10 shows the summary comparison of different encoder sizes. There was no significant difference between using the BASE encoder and the LARGE encoder when fine-tuned and evaluated on our dataset with similar hyperparameters. The DISTIL model, which uses the distilroberta-base encoder, had a slightly worse F1 score. When comparing our RoBERTa classifier with the more common BERT baseline, we see that RoBERTa outperforms BERT in both the DISTIL and BASE sizes. The results for the LARGE size are the same. We can also see that RoBERTa LARGE and BASE have the same score but are still better than the DISTIL size. This is a data point that indicates likely diminishing returns from increasing the model size and suggests that future work should focus on improving the dataset.

Finally, in Table 11, we compare with the results provided by related work, showing that the baseline results in our dataset, created by our semi-supervised approach, are similar to crowdsourced multi-label emotion datasets. This is true even though our taxonomy is more fine-grained and thus needs to distinguish between very similar emotions that are grouped together in the other datasets.

4.3. Zero-Shot Transfer

Our taxonomy (Table 1) includes all emotions used in the selected public benchmark datasets, except for “realization” in GoEmotions. As such, it is possible to directly evaluate a model trained on our data on these datasets without any adaptation by simply mapping labels (See Appendix A). To handle “realization”, we simply merge it with “surprise” (based on the definition and mapping provided in Ref. [13]). This creates the dataset we refer to in this section as “GoEmotions26”. This experimental setup is, conceptually, the inverse of the UnifyEmotions [11] where instead the labels of the target dataset are mapped into a common taxonomy and the classifiers are trained and evaluated on that. Such an evaluation makes the datasets and classifiers use a more coarse-grained taxonomy than originally intended.

To compare the results, we train and evaluate models using 5-fold cross validation on Tales, ISEAR, and EMOINT. These models follow the same architecture described in Section 2.6 and are trained with the hyperparameters in Table 12. We also trained a model on GoEmotions, using the same model architecture, following the exact same procedure as the supervised model trained on EmoLit (Section 4.2) and the same hyperparameters (Table 9 with the model from the best epoch selected on the basis of the GoEmotions development set. Table 13 shows the results.

In this experiment, training the model on soft labels outperformed hard labels. Both models trained on our dataset outperformed the model trained on GoEmotions on the other datasets, with the best being the model trained with soft labels, which attains an average score of 0.62 over the first 3 datasets, followed by the model trained on hard labels at 0.61 and lastly, the model trained on GoEmotions with 0.59. Unsurprisingly, training models on the same dataset (relative to evaluation) significantly outperform zero-shot transfer. The conclusion of this experiment is that models trained on our dataset learn about the emotions they are trained to distinguish and can be used in zero-shot emotion classification. Surprisingly, the model trained on GoEmotions (Reddit posts) did not outperform models trained on our dataset (literature) when evaluated on EMOINT (tweets) but it did match our results on Tales (literature).

4.4. Few-Shot Transfer

In this experiment, we evaluate the models trained on our dataset in a common transfer learning scenario. This largely mirrors the experimental setup in Ref. [13]: we replace the classification head with a newly initialized classification head for each benchmark dataset and fine-tune the models on a sample of the data. The amount of training data used from each target dataset is 50, 100, 150, and 200 examples, and sampling was stratified. Evaluation is conducted on the remaining data in the benchmark dataset with results shown in Table 14. The baseline is the same model architecture (see Section 2.6) fine-tuned on the few-shot examples. The comparison model was trained on the GoEmotions dataset (the same model from Section 4.3). All fine-tuning in these experiments used a batch size of 8 over 8 epochs with a learning rate of

2 \times 10^{- 5}

(summarized in Table 15). The results are an average of three runs.

All models trained on emotions significantly outperform the baseline at a low number of examples. As the number of examples increases, the difference between fine-tuning a model pretrained for emotion classification and just the base model pretrained on the self-supervised language modelling objective decreases. At 200 examples, they are either identical or near-identical. The actual results on each benchmark are very similar for all models pretrained on emotion datasets—ours or GoEmotions. We can conclude that our dataset performs equally well in fine-tuning based few-shot transfer learning. Zero-shot results (Table 13 are significantly better for Tales and ISEAR, indicating that adaptation based on not discarding the classification head would likely be preferable. Zero-shot would also be preferable for both datasets if less than 200 examples were available for Tales, less than 100 for ISEAR, and around 50 or less were available for EMOINT.

5. Discussion

Our work has demonstrated the feasibility and potential of zero-shot classification in the context of building a fine-grained emotion detection dataset for literature analysis. Unlike previous research that focused on the direct use of zero-shot classification, we combined it with common semi-supervised learning approaches to construct a dataset for training supervised models. By leveraging the zero-shot NLI model, we ranked a large set of unsupervised sentences for each emotion and selected the most likely positive examples as weakly labeled data. These examples, along with their negative counterparts, were utilized to build binary pseudo-labeling models. By classifying all examples using these emotion-specific pseudo-labeling models, we obtained a multi-label dataset. To the best of our knowledge, this is the first publicly released dataset for multi-label fine-grained emotion detection in literature, with the most extensive emotion taxonomy of any dataset available, incorporating emotion labels not present in any other dataset. This is also the largest emotion detection dataset available and is two orders of magnitude bigger than the previously available literature dataset (Tales).

In the realm of emotion detection, our study showcases the possibility of having a fine-grained emotion detection dataset specifically tailored for literature. The supervised models trained on our dataset achieved promising results on our human-labeled gold dataset, even without substantial adaptation for handling noisy data and without hyperparameter tuning. The F1 score of 0.59 (Table 8) is similar to what is expected with crowdsourced multi-label emotion datasets (see Section 1.3 and Table 11). This is the case despite the fact that our dataset has more fine-grained labels and thus would be expected to be more challenging to model. It is an unexpected result that a semi-supervised dataset would provide results this good, especially even before adding any techniques for handling noise. This validates our semi-supervised approach.

The results comparing RoBERTa to BERT and the different model sizes (Table 10) suggest diminishing returns from increasing the quality of the underlying pretrained large language model, which can suggest better returns for focusing efforts on improving the dataset (e.g., filtering noisy examples) or the ability to learn from it (e.g., handling noisy labels).

Furthermore, our experiments in transfer learning, including zero-shot (Section 4.3) and few-shot scenarios (Section 4.4), reveal the transferability of the learned emotion detection capabilities. Models trained on our automatically generated dataset perform comparably to models trained on GoEmotions, a crowdsourced dataset. In the zero-shot scenario, for the literature dataset Tales, we achieve 93% of the Cross-Validation F1-score (Table 13). In the few-shot fine-tuning scenario, transfer models trained on our dataset surpass the models trained only on the Tales examples until the 200 examples threshold (Table 14). Both these transfer learning experiments further help validate the quality of our dataset and thus validate the approach used to create it.

Although the use of soft labels did not improve the results when evaluated on the gold subset of our dataset, all transfer experiments show superior results for the model trained with soft labels, implying better transferability with soft labels. If the ultimate use of any model trained on our dataset is transfer learning, using the soft labels seems preferable.

6. Conclusions

In conclusion, this work proposed a novel semi-supervised approach for dataset creation, leveraging NLI zero-shot classification to construct a balanced multi-label fine-grained emotion detection dataset. We have demonstrated the effectiveness of our proposed approach, which compares favorably with crowdsourcing. We showed good performance for a model trained on our dataset when evaluated on a human labeled gold set. We also showed the transferability of emotion detection learned from our dataset to public benchmark datasets. We have provided a novel dataset and taxonomy for emotion detection with 38 emotion labels and made a model trained on our dataset public. We opened avenues for further exploration and improvement in emotion detection research, within the domain of literature analysis, cultural heritage analysis, and more broadly, any domain with a high cost of annotation.

The publication of our dataset and model will contribute to future research in emotion detection and in literature analysis. Within the scope of our project, Odeuropa (https://odeuropa.eu (accessed on 23 June 2023)), we will employ our model to investigate the relationship between olfactory sensations and emotions. Researchers can now conduct similar studies using our model, our dataset, or derived datasets. The size of our dataset allows for the creation of expert-annotated subsets that are still sufficiently large to train supervised fine-grained emotion detection models effectively.

From a methodological standpoint, there is ample room for improvement. Firstly, the emotion taxonomy can be expanded and refined to include additional categories such as “distrust” or “loneliness,” which are significant for the study of the human condition. Secondly, data selection steps should be adjusted to address incomplete sentences and noisy sentences. More sophisticated token-counting approaches, such as using a tokenizer instead of simple whitespace delimitation, would be beneficial. Thirdly, incorporating techniques from related work that focus on improving zero-shot classification through ensembles of prompts and models could enhance the performance of our approach. Additionally, fine-tuning NLI with a few labeled examples (few-shot learning) has shown promise in related work and could be explored further, particularly if a small amount of human labeling is feasible, such as 5 to 10-shot per class. Fourthly, when constructing our initial binary models, more effort can be dedicated to reducing label noise or mitigating its impact beyond label smoothing. Minimal tuning of parameters with access to some labeled data could be explored. At this stage, incorporating hard negative example mining, which involves selecting negative examples with high semantic similarity to positive examples, could be beneficial. This approach forces the classifier to learn the expression of specific emotions and disregard surface regularities. Finally, while large-scale human annotation remains prohibitively expensive, recent advances in large language models could enable the utilization of chatbot assistants for performing the annotation task. Multiple assistants could simulate multiple annotators to some extent, offering a potential solution for this challenge. Immediate future work can also begin on handling whatever noise is present in our dataset. Noisy labels are a common issue with semi-supervised techniques for creating datasets, and many approaches have been presented over the years to handle it. These include both data filtering (removing likely noisy examples) and robust model training procedures (learning from noisy labels). Our dataset provides a good additional platform for the evaluation of these approaches.

Author Contributions

Conceptualization, L.R.; Data curation, L.R.; Funding acquisition, D.M.; Methodology, L.R.; Software, L.R.; Supervision, D.M.; Validation, L.R.; Writing—original draft, L.R.; Writing—review and editing, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the European Union through the Odeuropa EU H2020 project under grant agreement No 101004469.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://zenodo.org/record/7883954. The models are available at https://huggingface.co/lrei/roberta-large-emolit, https://huggingface.co/lrei/roberta-base-emolit, and https://huggingface.co/lrei/distilroberta-base-emolit. The code to train the model is available at https://github.com/lrei/emolit_train.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Label Maps

Table A1 includes the output label maps from our taxonomy to the benchmark datasets used in Section 4.3. Labels not included are not used. See also Table 1 and Figure 4. Mapping to GoEmotions is one-to-one except for “remorse” which maps to “guilt” (simple label name change, the definition is the same) and that GoEmotions “realization” is mapped to “surprise”.

Table A1. Label mappings from benchmark datasets to ours.

Tales	EmoLit
angry-disgusted	anger, annoyance, disapproval, disgust
happy	excitement, amusement, joy, relief, gratitude, optimism
fearful	fear, nervousness
sad	disappointment, despair, sadness, grief
surprised	surprise
ISEAR	EmoLit
fear	fear, nervousness
shame	embarrassment
guilt	guilt
disgust	disgust
anger	anger, annoyance, frustration
joy	approval, relief, gratitude, joy, optimism
sadness	grief, sadness
EMOINT	EmoLit
anger	anger, annoyance
joy	joy
fear	fear, nervousness
sadness	boredom, despair, sadness

Appendix B. NLI Hypothesis Comparison

Although plenty of results have been published on using NLI to perform zero-shot classification (see Section 1), since our purpose was to build a multi-label literature emotion, we conducted our own evaluation in a setting that is as close to ours as possible to give us an idea of how it would perform for the purpose of this work. This appendix shows the results of evaluating our zero-shot classifier on the Tales dataset. Since we are building a multi-label dataset, where each emotion is considered independent of all others, and Tales is a multiclass dataset, we consider all positive examples of a class in Tales to be negative examples of the other classes. Here we map “fearful” to “fear”, “happy” to “joy”, “sadness” to “sad”. Table A2 shows these results with different thresholds for the probability of the hypothesis being true given the premise. The trend is that the long hypothesis has a higher recall and the short hypothesis has a higher precision. The idea behind using the short hypothesis for ranking was to ensure that the top ranked examples were mostly true positives. Using the long hypothesis for re-ranking was a way to ensure diversity of the selected positive examples.

Table A2. NLI hypothesis binary evaluation at different thresholds on Tales: precision (P) recall (R), and F1-score per emotion.

Threshold	Hypothesis	Fear			Joy			Sadness
		P	R	F1	P	R	F1	P	R	F1
0.5	Short	0.21	0.98	0.35	0.81	0.96	0.88	0.33	0.99	0.5
	Long	0.16	1.00	0.27	0.39	1.00	0.56	0.39	1.0	0.56
0.6	Short	0.22	0.97	0.36	0.83	0.94	0.88	0.35	0.99	0.51
	Long	0.16	1.00	0.28	0.39	1.00	0.56	0.32	1.00	0.49
0.7	Short	0.24	0.96	0.38	0.84	0.92	0.88	0.36	0.99	0.53
	Long	0.17	1.00	0.28	0.4	1.00	0.57	0.33	0.99	0.5
0.8	Short	0.27	0.93	0.42	0.86	0.87	0.87	0.39	0.98	0.55
	Long	0.17	1.00	0.29	0.42	1.00	0.59	0.35	0.99	0.52
0.9	Short	0.36	0.89	0.52	0.87	0.76	0.81	0.45	0.92	0.6
	Long	0.18	1.00	0.31	0.46	0.99	0.63	0.38	0.98	0.55
0.95	Short	0.51	0.75	0.6	0.88	0.64	0.74	0.53	0.88	0.66
	Long	0.19	1.0	0.32	0.53	0.99	0.69	0.42	0.95	0.58

Appendix C. Nostalgia

During exploration of the taxonomy, we investigated if it was possible to have a separate “nostalgia” and “homesickness” emotions. The dictionary definition of nostalgia is “a wistful or excessively sentimental yearning for return to or of some past period or irrecoverable condition”, and often includes expressions of nostalgia towards things such as products, TV shows, and facilities. Nevertheless, the word was originally coined to describe acute homesickness and is still used in that context (https://www.merriam-webster.com/dictionary/nostalgia (accessed on 23 June 2023)). In our early explorations that lead to work, the two emotions were very easy to separate on social media, but when applied to Project Gutenberg Books, over 70% of examples weakly labeled as nostalgia were actual examples of homesickness and most of the others were strongly associated with locations. We decided to merge the two just after pseudo labeling in Section 2.5 by taking the max probability of either and assigning it to the label “nostalgia”. This is why it is the biggest class in our dataset. Otherwise, that would more naturally be “disappointment”.

References

Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Johansen, J.D. Feelings in literature. Integr. Psychol. Behav. Sci. 2010, 44, 185–196. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Oatley, K. Emotions and the story worlds of fiction. In Narrative Impact; Psychology Press: Mahwah, NJ, USA, 2003; pp. 39–69. [Google Scholar]
Hogan, P.C. What Literature Teaches Us about Emotion; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Frevert, U. Emotions in History—Lost and Found; Central European University Press: Budapest, Hungary, 2011. [Google Scholar]
Massri, M.B.; Novalija, I.; Mladenić, D.; Brank, J.; Graça da Silva, S.; Marrouch, N.; Murteira, C.; Hürriyetoğlu, A.; Šircelj, B. Harvesting Context and Mining Emotions Related to Olfactory Cultural Heritage. Multimodal Technol. Interact. 2022, 6, 57. [Google Scholar] [CrossRef]
Alm, C.O.; Roth, D.; Sproat, R. Emotions from text: Machine learning for text-based emotion prediction. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, 6–8 October 2005; pp. 579–586. [Google Scholar]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Williams, L.; Arribas-Ayllon, M.; Artemiou, A.; Spasić, I. Comparing the utility of different classification schemes for emotive language analysis. J. Classif. 2019, 36, 619–648. [Google Scholar] [CrossRef] [Green Version]
Öhman, E. Emotion Annotation: Rethinking Emotion Categorization. In Proceedings of the DHN Post-Proceedings, Riga, Latvia, 21–23 October 2020; pp. 134–144. [Google Scholar]
Bostan, L.A.M.; Klinger, R. An Analysis of Annotated Corpora for Emotion Classification in Text. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2104–2119. [Google Scholar]
Mohammad, S.; Bravo-Marquez, F.; Salameh, M.; Kiritchenko, S. SemEval-2018 Task 1: Affect in Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, LA, USA, 5–6 June 2018; pp. 1–17. [Google Scholar] [CrossRef] [Green Version]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A Dataset of Fine-Grained Emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4040–4054. [Google Scholar]
Kim, E.; Klinger, R. Who feels what and why? annotation of a literature corpus with semantic roles of emotions. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1345–1359. [Google Scholar]
Menini, S.; Paccosi, T.; Tonelli, S.; Van Erp, M.; Leemans, I.; Lisena, P.; Troncy, R.; Tullett, W.; Hürriyetoğlu, A.; Dijkstra, G.; et al. A multilingual benchmark to capture olfactory situations over time. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, Dublin, Ireland, 26–27 May 2022; pp. 1–10. [Google Scholar]
Rei, L.; Mladenic, D.; Dorozynski, M.; Rottensteiner, F.; Schleider, T.; Troncy, R.; Lozano, J.S.; Salvatella, M.G. Multimodal metadata assignment for cultural heritage artifacts. Multimed. Syst. 2023, 29, 847–869. [Google Scholar] [CrossRef]
Pita Costa, J.; Rei, L.; Stopar, L.; Fuart, F.; Grobelnik, M.; Mladenić, D.; Novalija, I.; Staines, A.; Pääkkönen, J.; Konttila, J.; et al. NewsMeSH: A new classifier designed to annotate health news with MeSH headings. Artif. Intell. Med. 2021, 114, 102053. [Google Scholar] [CrossRef] [PubMed]
Dagan, I.; Glickman, O.; Magnini, B. The pascal recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, 11–13 April 2005; pp. 177–190. [Google Scholar]
Yin, W.; Hay, J.; Roth, D. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3914–3923. [Google Scholar] [CrossRef] [Green Version]
Andreevskaia, A.; Bergler, S. CLaC and CLaC-NB: Knowledge-based and corpus-based approaches to sentiment tagging. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, 23–24 June 2007; pp. 117–120. [Google Scholar]
Bermingham, A.; Smeaton, A.F. A study of inter-annotator agreement for opinion retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, 19–23 July 2009; pp. 784–785. [Google Scholar]
Russo, I.; Caselli, T.; Rubino, F.; Boldrini, E.; Martínez-Barco, P. EMOCause: An Easy-adaptable Approach to Extract Emotion Cause Contexts. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), Portland, OR, USA, 24 June 2011; pp. 153–160. [Google Scholar]
Mohammad, S. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, San Diego, CA, USA, 16 June 2016; pp. 174–179. [Google Scholar]
Plutchik, R. A general psychoevolutionary theory of emotion. In Theories of Emotion; Elsevier: Amsterdam, The Netherlands, 1980; pp. 3–33. [Google Scholar]
Scherer, K.R.; Wallbott, H.G. Evidence for universality and cultural variation of differential emotion response patterning. J. Personal. Soc. Psychol. 1994, 66, 310. [Google Scholar] [CrossRef] [PubMed]
Öhman, E.; Pàmies, M.; Kajava, K.; Tiedemann, J. XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6542–6552. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Kocoń, J.; Cichecki, I.; Kaszyca, O.; Kochanek, M.; Szydło, D.; Baran, J.; Bielaniewicz, J.; Gruza, M.; Janz, A.; Kanclerz, K.; et al. ChatGPT: Jack of all trades, master of none. Inf. Fusion 2023, 99, 101861. [Google Scholar] [CrossRef]
Ameer, I.; Bölücü, N.; Siddiqui, M.H.F.; Can, B.; Sidorov, G.; Gelbukh, A. Multi-label emotion classification in texts using transfer learning. Expert Syst. Appl. 2023, 213, 118534. [Google Scholar] [CrossRef]
Alhuzali, H.; Ananiadou, S. SpanEmo: Casting Multi-label Emotion Classification as Span-prediction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 1573–1584. [Google Scholar] [CrossRef]
Basile, A.; Pérez-Torró, G.; Franco-Salvador, M. Probabilistic Ensembles of Zero- and Few-Shot Learning Models for Emotion Classification. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Online, 1–3 September 2021; pp. 128–137. [Google Scholar]
Plaza-del Arco, F.M.; Martín-Valdivia, M.T.; Klinger, R. Natural Language Inference Prompts for Zero-shot Emotion Classification in Text across Corpora. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 6805–6817. [Google Scholar]
Tesfagergish, S.G.; Kapočiūtė-Dzikienė, J.; Damaševičius, R. Zero-Shot Emotion Detection for Semi-Supervised Sentiment Analysis Using Sentence Transformers and Ensemble Learning. Appl. Sci. 2022, 12, 8662. [Google Scholar] [CrossRef]
Gera, A.; Halfon, A.; Shnarch, E.; Perlitz, Y.; Ein-Dor, L.; Slonim, N. Zero-Shot Text Classification with Self-Training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1107–1119. [Google Scholar]
Peterson, J.C.; Battleday, R.M.; Griffiths, T.L.; Russakovsky, O. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9617–9626. [Google Scholar]
Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef] [Green Version]
El Gayar, N.; Schwenker, F.; Palm, G. A study of the robustness of KNN classifiers trained using soft labels. In Proceedings of the Artificial Neural Networks in Pattern Recognition: Second IAPR Workshop, ANNPR 2006, Ulm, Germany, 31 August–2 September 2006; pp. 67–80. [Google Scholar]
Thiel, C. Classification on soft labels is robust against label noise. In Proceedings of the Knowledge-Based Intelligent Information and Engineering Systems: 12th International Conference, KES 2008, Zagreb, Croatia, 3–5 September 2008; pp. 65–73. [Google Scholar]
Galstyan, A.; Cohen, P.R. Empirical comparison of “hard” and “soft” label propagation for relational classification. In Proceedings of the International Conference on Inductive Logic Programming, Corvallis, OR, USA, 25–27 October 2007; pp. 98–111. [Google Scholar]
Zhao, Z.; Wu, S.; Yang, M.; Chen, K.; Zhao, T. Robust machine reading comprehension by learning soft labels. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 2754–2759. [Google Scholar]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Kiss, T.; Strunk, J. Unsupervised multilingual sentence boundary detection. Comput. Linguist. 2006, 32, 485–525. [Google Scholar] [CrossRef]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. arXiv 2016, arXiv:1607.01759. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. FastText.zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
Broder, A.Z. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), Positano, Italy, 11–13 June 1997; pp. 21–29. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Seattle, WA, USA, 10–15 July 2022; pp. 1112–1122. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Porter, M.F. An algorithm for suffix stripping. Program 1980, 14, 130–137. [Google Scholar] [CrossRef]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 16–21 June 2013; Volume 3, p. 896. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef] [Green Version]
Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the Stratification of Multi-label Data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Athens, Greece, 5–9 September 2011; Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners; Technical Report; OpenAI: San Francisco, CA, USA, 2019. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Schutze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; p. 281. [Google Scholar]
Mohammad, S.; Bravo-Marquez, F. Emotion Intensities in Tweets. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 65–77. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Weak labeling process using an existing NLI model to identify positive and negative examples of sentences express emotions. The key steps, Ranking and Re-Ranking, in bold, use different templates described in this section. Inputs and outputs of each step underlined: from sentences, to candidates, to examples.

Figure 2. Pseudo labeling using weakly labeled data to train a model. Positives in green, negatives in red.

Figure 3. Architecture of the Classifier model. The main blocks are the transformer encoder and the classifier, in bold. Only the output of the start token is passed to the classifier, all other encoder outputs are unused (in orange). Example classified positive label in green, negative label in red.

Figure 4. Emotion correlation calculated over the sentences in our dataset. Mostly common sense, except for smaller values of negative correlation. The results are in line with other datasets.

Figure 5. Sentiment of emotions: ratio of positive and negative sentiment sentences labeled with each emotion in our dataset.

Table 1. Emotion Taxonomy.

Emotion	Definition
admiration	finds something admirable, impressive or worthy of respect
amusement	finds something funny, entertaining or amusing
anger	is angry, furious, or strongly displeased; displays ire, rage, or wrath
annoyance	is annoyed or irritated
approval	expresses a favorable opinion, approves, endorses or agrees with something or someone
boredom	feels bored, uninterested, monotony, tedium
calmness	is calm, serene, free from agitation or disturbance, experiences emotional tranquility
caring	cares about the well-being of someone else, feels sympathy, compassion, affectionate concern towards someone,
	displays kindness or generosity
courage	feels courage or the ability to do something that frightens one, displays fearlessness or bravery
curiosity	is interested, curious, or has strong desire to learn something
desire	has a desire or ambition, wants something, wishes for something to happen
despair	feels despair, helpless, powerless, loss or absence of hope, desperation, despondency
disappointment	feels sadness or displeasure caused by the non-fulfillment of hopes or expectations, being or let down,
	expresses regret due to the unfavorable outcome of a decision
disapproval	expresses an unfavorable opinion, disagrees or disapproves of something or someone
disgust	feels disgust, revulsion, finds something or someone unpleasant, offensive or hateful
doubt	has doubt or is uncertain about something, bewildered, confused, or shows lack of understanding
embarrassment	feels embarrassed, awkward, self-conscious, shame, or humiliation
envy	is covetous, feels envy or jealousy; begrudges or resents someone for their achievements, possessions, or qualities
excitement	feels excitement or great enthusiasm and eagerness
faith	expresses religious faith, has a strong belief in the doctrines of a religion, or trust in god
fear	is afraid or scared due to a threat, danger, or harm
frustration	feels frustrated: upset or annoyed because of inability to change or achieve something
gratitude	is thankful or grateful for something
greed	is greedy, rapacious, avaricious, or has selfish desire to acquire or possess more than what one needs
grief	feels grief or intense sorrow, or grieves for someone who has died
guilt	feels guilt, remorse, or regret to have committed wrong or failed in an obligation
indifference	is uncaring, unsympathetic, uncharitable, or callous, shows indifference, lack of concern, coldness towards someone
joy	is happy, feels joy, great pleasure, elation, satisfaction, contentment, or delight
love	feels love, strong affection, passion, or deep romantic attachment for someone
nervousness	feels nervous, anxious, worried, uneasy, apprehensive, stressed, troubled or tense
nostalgia	feels nostalgia, longing or wistful affection for the past, something lost, or for a period in one’s life,
	feels homesickness, a longing for one’s home, city, or country while being away; longing for a familiar place
optimism	feels optimism or hope, is hopeful or confident about the future, that something good may happen,
	or the success of something
pain	feels physical pain or is experiences physical suffering
pride	is proud, feels pride from one’s own achievements, self–fulfillment, or from the achievements
	of those with whom one is closely associated, or from qualities or possessions that are widely admired
relief	feels relaxed, relief from tension or anxiety
sadness	feels sadness, sorrow, unhappiness, depression, dejection
surprise	is surprised, astonished or shocked by something unexpected
trust	trusts or has confidence in someone, or believes that someone is good, honest, or reliable

Table 2. Example of NLI hypothesis for the emotion “fear”. See also Table 1.

	Hypothesis
Short (Ranking)	This expresses the emotion fear.
Long (Re-ranking)	Speaker or someone is afraid or scared due to a threat, danger, or harm.

Table 3. Hyperparameter values used for training the binary classifiers.

Hyperparameter	Value
Batch Size	32
Learning Rate	$3 \times 10^{- 5}$
Max Epochs	2
Smoothing	0.2

Table 4. The number of examples per emotion for each data subset, including the human-labeled gold dataset. We nearly achieved a balanced distribution of examples per emotion label.

	Train	Validation	Test	Gold
admiration	7700	963	962	110
amusement	7427	928	928	52
anger	8556	1070	1070	72
annoyance	10,730	1341	1341	57
approval	8531	1066	1066	98
boredom	8113	1014	1014	53
calmness	8573	1072	1072	45
caring	8972	1122	1121	64
courage	8484	1061	1060	42
curiosity	7738	967	967	67
desire	10,160	1270	1270	81
despair	10,009	1251	1251	44
disappointment	12,133	1517	1517	39
disapproval	11,130	1391	1391	111
disgust	8987	1123	1123	72
doubt	9012	1127	1127	43
embarrassment	9642	1205	1205	22
envy	9942	1243	1243	14
excitement	10,794	1349	1349	38
faith	8442	1055	1055	13
fear	11,556	1445	1445	39
frustration	11,162	1395	1395	54
gratitude	11,279	1410	1410	14
greed	7423	928	928	25
grief	10,972	1372	1371	14
guilt	8660	1082	1082	13
indifference	8549	1069	1069	37
joy	9404	1175	1175	61
love	8838	1105	1105	50
nervousness	7747	968	968	24
nostalgia	14,805	1851	1851	29
optimism	9560	1195	1195	37
pain	10,014	1252	1252	22
pride	10,744	1343	1343	27
relief	9317	1165	1165	25
sadness	9589	1199	1199	52
surprise	9818	1227	1227	36
trust	8606	1076	1076	43
neutral	22,803	2890	2919	15
Sentences	160,000	20,000	20,000	727

Table 5. Number of emotions per example (%). The majority of sentences have two or fewer emotion labels.

No of Emotions	Examples (%)
0	14.3
1	27.4
2	20.8
3	14.4
4	10.0
5	6.6
6	4.3
7	2.3

Table 6. Summary of emotion correlation: highest and lowest correlated emotions. The results are very similar to GoEmotions.

Highest Correlation		Lowest Correlation
despair	sadness	optimism	pain
calmness	relief	annoyance	optimism
fear	nervousness	approval	frustration
anger	annoyance	frustration	gratitude
excitement	joy	disappointment	optimism

Table 7. Summary of emotion sentiment: most positive and most negative emotions. Most positive and most negative emotions match common sense expectations.

Most Positive		Most Negative
Emotion	Positive (%)	Emotion	Negative (%)
admiration	92	frustration	88
approval	91	boredom	86
optimism	90	despair	86
trust	89	annoyance	83
joy	89	pain	82

Table 8. Results on the human-annotated portion of the EmoLit dataset. The soft and hard label models had the same average F1 score. The results are in line with related work (e.g., GoEmotions).

	Hard Labels			Soft Labels
Emotion	Precision	Recall	F1	Precision	Recall	F1
admiration	0.74	0.31	0.45	0.72	0.31	0.43
amusement	0.73	0.87	0.79	0.75	0.87	0.8
anger	0.70	0.65	0.68	0.71	0.68	0.69
annoyance	0.51	0.74	0.60	0.5	0.72	0.59
approval	0.83	0.51	0.63	0.8	0.49	0.61
boredom	0.67	0.94	0.78	0.64	0.92	0.75
calmness	0.65	0.82	0.73	0.63	0.82	0.71
caring	0.73	0.83	0.77	0.69	0.8	0.74
courage	0.47	0.67	0.55	0.46	0.67	0.54
curiosity	0.76	0.82	0.79	0.76	0.84	0.79
desire	0.82	0.79	0.81	0.81	0.77	0.78
despair	0.72	0.70	0.71	0.7	0.7	0.7
disappointment	0.44	0.46	0.45	0.4	0.44	0.42
disapproval	0.47	0.23	0.31	0.48	0.25	0.33
disgust	0.84	0.38	0.52	0.79	0.36	0.5
doubt	0.72	0.49	0.58	0.61	0.44	0.51
embarrassment	0.58	0.64	0.61	0.5	0.73	0.59
envy	0.28	0.86	0.42	0.28	0.93	0.43
excitement	0.55	0.68	0.61	0.57	0.68	0.62
faith	0.39	0.85	0.54	0.45	0.77	0.57
fear	0.48	0.41	0.44	0.42	0.41	0.42
frustration	0.54	0.57	0.56	0.53	0.61	0.57
gratitude	0.28	0.79	0.42	0.26	0.71	0.38
greed	0.57	0.64	0.60	0.55	0.68	0.61
grief	0.27	0.86	0.41	0.31	0.93	0.46
guilt	0.43	0.69	0.53	0.45	0.77	0.57
indifference	0.66	0.89	0.76	0.65	0.84	0.73
joy	0.84	0.43	0.57	0.77	0.44	0.56
love	0.69	0.66	0.67	0.69	0.72	0.71
nervousness	0.54	0.54	0.54	0.55	0.46	0.5
nostalgia	0.29	0.97	0.44	0.27	0.97	0.42
optimism	0.52	0.43	0.47	0.5	0.38	0.43
pain	0.33	0.55	0.41	0.42	0.73	0.53
pride	0.46	0.59	0.52	0.48	0.59	0.53
relief	0.54	0.84	0.66	0.51	0.84	0.64
sadness	0.70	0.60	0.65	0.67	0.62	0.64
surprise	0.68	0.69	0.68	0.71	0.69	0.70
trust	0.71	0.63	0.67	0.76	0.65	0.70
macro-average	0.58	0.66	0.59	0.58	0.66	0.59
std	0.17	0.18	0.13	0.16	0.18	0.13

Table 9. Hyperparameter values used for training the supervised model on our dataset.

Hyperparameter	Value
Batch Size	16
Learning Rate	$2 \times 10^{- 5}$
Max Epochs	10

Table 10. Comparison between different model sizes trained on our training set and evaluated on the gold set. The F1 score is similar for all RoBERTa model sizes, although DISTIL is slightly worse (0.59 vs. 0.58). BERT underperforms for RoBERTa except for LARGE.

Encoder Name	Encoder Architecture	Encoder Parameters	F1 (Macro)
RoBERTa-large	L = 24, H = 1024, A = 16	355 M	0.59
BERT-large	L = 24, H = 1024, A = 16	340 M	0.59
RoBERTa-base	L = 12, H = 768, A = 12	125 M	0.59
BERT-base	L = 12, H = 768, A = 12	110 M	0.58
DistilRoBERTa-base	L = 6, H = 768, A = 12	82 M	0.58
DistilBERT-base	L = 6, H = 768, A = 12	66 M	0.56

Table 11. Comparison with the results reported in related work, using BASE model sizes.

Dataset	Emotions	Model	F1 (Macro)
EmoLit (ours)	38	RoBERTa	0.59
		BERT	0.58
SemEval-2018 Task-1C	11	RoBERTa	0.60 [30]
XED	8	BERT	0.54 [26]
GoEmotions	27	BERT	0.46 [13]

Table 12. Hyperparameter values used for training the Cross-Validation models.

Hyperparameter	Value
Batch Size	8
Learning Rate	$2 \times 10^{- 5}$
Epochs	3

Table 13. Zero-shot transfer experiment macro F1 on benchmark datasets and relative percentage compared to training on the same dataset (Self). The best transfer results correspond to a model trained on our dataset with soft labels.

Dataset	Tales	ISEAR	EMOINT	GoEmotions26
Self	0.83	0.76	0.82	0.52 ¹
Transfer
GoEmotions	0.74 (89%)	0.52 (68%)	0.50 (61%)	NA ²
EmoLit (Hard Labels)	0.74 (89%)	0.53 (70%)	0.57 (70%)	0.28 (54%)
EmoLit (Soft Labels)	0.77 (93%)	0.56 (74%)	0.60 (73%)	0.28 (54%)

¹ The model was trained on the 27 emotion train split of this dataset, mapping to 26 emotions done on the output. ² Zero-Shot evaluation is not possible.

Table 14. Fine-tuning transfer learning on benchmark datasets (macro F1). Transfer models outperform the baseline at a low number of examples. The best transfer results depend on the target dataset.

	Tales				ISEAR				EMOINT
	Literature				Self-Reporting				Tweets
	50	100	150	200	50	100	150	200	50	100	150	200
Baseline	0.2	0.57	0.64	0.82	0.17	0.43	0.25	0.62	0.17	0.21	0.47	0.72
GoEmotions	0.58	0.75	0.80	0.81	0.47	0.58	0.62	0.63	0.53	0.62	0.67	0.68
EmoLit (Hard Labels)	0.60	0.72	0.79	0.81	0.36	0.52	0.55	0.60	0.56	0.65	0.69	0.71
EmoLit (Soft Labels)	0.61	0.74	0.79	0.81	0.36	0.54	0.55	0.62	0.59	0.67	0.70	0.72

Table 15. Hyperparameter values used for few-shot finetuning.

Hyperparameter	Value
Batch Size	8
Learning Rate	$2 \times 10^{- 5}$
Epochs	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rei, L.; Mladenić, D. Detecting Fine-Grained Emotions in Literature. Appl. Sci. 2023, 13, 7502. https://doi.org/10.3390/app13137502

AMA Style

Rei L, Mladenić D. Detecting Fine-Grained Emotions in Literature. Applied Sciences. 2023; 13(13):7502. https://doi.org/10.3390/app13137502

Chicago/Turabian Style

Rei, Luis, and Dunja Mladenić. 2023. "Detecting Fine-Grained Emotions in Literature" Applied Sciences 13, no. 13: 7502. https://doi.org/10.3390/app13137502

APA Style

Rei, L., & Mladenić, D. (2023). Detecting Fine-Grained Emotions in Literature. Applied Sciences, 13(13), 7502. https://doi.org/10.3390/app13137502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Fine-Grained Emotions in Literature

Abstract

1. Introduction

1.1. Motivation

1.2. Overview

1.3. Related Work

1.4. Contribution

2. Materials and Methods

2.1. Data

2.2. Deduplication

2.3. Emotion Taxonomy

2.4. Weak-Labeling

2.5. Pseudo-Labeling

2.6. Supervised Classification

2.7. Evaluation Metrics

3. Data Analysis

3.1. Label Distribution

3.2. Label Correlation

3.3. Label Sentiment

4. Experimental Results

4.1. Evaluation Data

4.2. Supervised Evaluation

4.3. Zero-Shot Transfer

4.4. Few-Shot Transfer

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Label Maps

Appendix B. NLI Hypothesis Comparison

Appendix C. Nostalgia

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI