Mixing and Matching Emotion Frameworks: Investigating Cross-Framework Transfer Learning for Dutch Emotion Detection

: Emotion detection has become a growing ﬁeld of study, especially seeing its broad application potential. Research usually focuses on emotion classiﬁcation, but performance tends to be rather low, especially when dealing with more advanced emotion categories that are tailored to speciﬁc tasks and domains. Therefore, we propose the use of the dimensional emotion representations valence, arousal and dominance (VAD), in an emotion regression task. Firstly, we hypothesize that they can improve performance of the classiﬁcation task, and secondly, they might be used as a pivot mechanism to map towards any given emotion framework, which allows tailoring emotion frameworks to speciﬁc applications. In this paper, we examine three cross-framework transfer methodologies: multi-task learning, in which VAD regression and classiﬁcation are learned simultaneously; meta-learning, where VAD regression and emotion classiﬁcation are learned separately and predictions are jointly used as input for a meta-learner; and a pivot mechanism, which converts the predictions of the VAD model to emotion classes. We show that dimensional representations can indeed boost performance for emotion classiﬁcation, especially in the meta-learning setting (up to 7% macro F1-score compared to regular emotion classiﬁcation). The pivot method was not able to compete with the base model, but further inspection suggests that it could be efﬁcient, provided that the VAD regression model is further improved.


Introduction
Since the year 2000, sentiment analysis is one of the most well-studied research domains in natural language processing (NLP), not in the least because of its broad application potential. Companies and organisatons use it to learn more about (potential) customers or clients [1] or to gauge their online reputation [2]. Traditionally, sentiment analysis focused on the study of polarity with the goal of classifying textual instances as either positive or negative [1]. However, from a company perspective, it can be more interesting to pinpoint exactly what customers are talking about: for example, if they mention a product in an online review, they can be very specific as to which aspects they liked (e.g., quality and design) and/or disliked (e.g., user-friendliness). This had led to the emergence of aspectbased sentiment analysis, which focuses on the detection of sentiment expressions on the aspect or feature level [3].
In more recent years, the field advanced from analysing polarity to detecting more finegrained emotions [4]. The goal in emotion analysis is to analyse specific emotional states such as anger, joy and sadness or emotional dimensions like valence and arousal. By studying emotions, companies get more hands-on insights into which customer responses require an immediate action. For example, understanding when a customer is clearly angry or sad is more insightful than the label negative in the framework of customer response management or when thinking about employing chatbots for customer support. Moreover, the emotions of interest might vary regarding the application or domain.
When dealing with emotion analysis for specific applications, this leads to an important methodological issue, namely the choice of emotion framework. Many studies focus on the classification of textual utterances into basic emotion categories following the frameworks of Ekman (anger, disgust, fear, joy, sadness and surprise) [5] and Plutchik (anger, anticipation, disgust, fear, joy, sadness, surprise and trust) [6]. However, multiple researchers have emphasized the need of studying emotions not only in terms of basic emotion categories, but based on emotional dimensions like valence, arousal and dominance (VAD) as well [7,8].
In earlier work, we have already criticized the apparent arbitrariness with which an emotion framework is chosen for studies in NLP [9]. Mostly, a data-driven motivation or experimentally grounded choice is lacking. However, some researchers see benefits in tailoring the emotion label set to the task at hand. In the case of crisis communication, for example, it would be appropriate to employ the crisis-related emotion framework of Jin et al. [10], as proposed by Hoste et al. [11].
Although the emotional nuances in different label sets could be useful, tailoring these sets to specific applications or domains might introduce different challenges: (a) resources will need to be created for every specific application and domain, (b) emotion detection resources will be scattered over different frameworks, and (c) emotion detection systems will not be generalizable.
Cross-framework transfer learning methods could mitigate these challenges. Finetuning pre-trained models, multi-task learning or label space mapping methods can considerably decrease the amount of required training data, as this allows for the transfer of knowledge across divergent emotion frameworks.
A straight-forward approach to shift between frameworks is to map discrete categories into a three-dimensional space, which corresponds to Mehrabian and Russell's claim that all affective states can be represented by the dimensions valence, arousal and dominance [12]. This mapping to and from the VAD space can be regarded as a pivot mechanism. Regardless of the specific mapping technique (e.g., linear regression, kNN or lexicon-based mappings), this idea opens possibilities. Given an accurate mapping technique and a well-performing emotion analysis system that predicts values for valence, arousal and dominance, the predicted VAD values can be converted to any categorical emotion label set. Emotion frameworks can then easily be tailored to specific tasks and domains, broadening their scope of application in e.g., customer service management or conversational agents.
Moreover, previous experiments for Dutch emotion detection revealed that the classification of emotional categories (anger, fear, joy, love and sadness) is very challenging, while more promising results were found for VAD regression [13]. Transferring information from the regression task to improve performance on the classification task would therefore be an interesting line of research.
This study investigates the potential of dimensional representations and revolves around two research questions: (a) can dimensional representations serve as an aid in the prediction of emotion categories and (b) can dimensional representations contribute in tailoring label sets to specific tasks and domains?
Our research focuses on Dutch emotion detection and will make use of the EmotioNL dataset [13]. We examine three cross-framework transfer methodologies, namely multitask learning, meta-learning and the aforementioned pivot mechanism. In the multi-task setting, the VAD regression task and classification task are learned simultaneously. In the meta-learner approach, two systems are trained separately, one for VAD regression and one for emotion classification. We will investigate whether a multi-task learner or a metalearner that exploits both sources of information is favorable compared to a system that only uses one source. These models will be compared to a system relying on a pivot method, using solely dimensional representations. The code is publicly available at https://github.com/LunaDeBruyne/Mixing-Matching-Emotion-Frameworks (accessed on 30 September 2021).
We thus contribute to the field of emotion analysis in NLP by leveraging dimensional representations to increase the performance of emotion classification and by proposing a method to tailor label sets to specific applications.
The remainder of this paper is organised as follows: in Section 2, related work on the combination of categorical and dimensional frameworks in emotion detection is discussed. Section 3 describes the materials and methods of our study and gives an overview of the used data (Section 3.1) and a description of the experimental setup (Section 3.2). Results are reported in Section 4 and further discussed in Section 5. This paper ends with a conclusion in Section 6.

Related Work
Our previous work on Dutch emotion detection focused on the prediction of the classes joy, love, anger, fear, sadness or neutral and the emotional dimensions valence, arousal and dominance in Dutch Twitter messages and captions from reality TV-shows [13]. We found that the classification results were low (54% accuracy for tweets and 48% for captions). However, the results for emotional dimensions were more promising (0.64 Pearson's r for both domains). This observation, together with the issue of having specialised categorical labels for specific tasks/domains, reinforces the urgency to focus more on dimensional models and investigate their potential of aiding emotion classification by means of transfer learning.
Multi-task learning settings have proven successful in many tasks related to emotion and sentiment analysis [14,15]. Although there are not many studies that perform transfer learning with multiple emotion frameworks, there are various studies that employ multitask learning by jointly training emotion detection with sentiment analysis [16,17] or other related tasks [18]. All of these studies suggest that multi-task frameworks outperform single-task experiments and thus motivate the idea to train emotion classification and VAD regression jointly, especially as VAD probably contains more valuable emotional information than sentiment (which only contains the first dimension: valence).
Various studies have also investigated how to deal with disparate label spaces. Mostly, this involves a mapping between categorical and dimensional frameworks, e.g., in the work of Stevenson et al. [19] and Buechel and Hahn [20,21]. In these studies, scores for valence, arousal and dominance were used to predict intensity values for the basic emotion categories happiness, anger, sadness, fear and disgust, and vice versa. To this end, linear regression [19], a kNN model [20] and a multi-task feed-forward network [21] were used. Especially, this last method provided promising results, where a Pearson correlation of 0.877 was obtained for mapping dimensions to categories and 0.853 for the other direction.
A straightforward approach is to map discrete categories directly into the VAD space, which corresponds to Mehrabian and Russell's claim that all affective states can be represented by the dimensions valence, arousal and dominance [12]. Figure 1 shows the positions of Ekman's basic emotions in the VAD space, based on the scores of these terms in Mehrabian and Russell [12]. Calvo and Mac Kim employ this idea and apply it directly to the task of emotion detection [22]. They obtain lexicon scores for emotion words related to the categories anger/disgust, fear, joy and sadness by looking them up in the Affective Norms for English Words (ANEW) [23], and map the center of each of these categories in the VAD space. Then, they calculate VAD scores for sentences (again using the ANEW lexicon), which are placed in the emotional space as well. By computing cosine similarity between the sentence and the previously mapped emotion categories, the emotional category of the sentence can be determined. This lexicon-based mapping approach has as an advantage that no annotated categories are needed, in contrast to the previously discussed approaches which do require annotated categories to learn a mapping. Besides mapping between emotion frameworks, a similar line of research deals with the unification of disparate label spaces in emotion and sentiment resources. Examples of merging sentiment lexica are [24][25][26] for emotion lexica and [27] for emotion datasets. Techniques exist out of Bayesian models [24], variational autoencoders [25,26] and rulebased combination techniques [27] to map lexica or datasets with different labels into the same space.

Materials and Methods
In this section, we describe the data and experimental setup to thoroughly investigate the potential of dimensional representations in (a) improving emotion classification, and (b) tailoring the label set to specific tasks and domains by mapping emotional dimensions to categories.

Data
For this study, the EmotioNL dataset is used [13]. This dataset consists of Dutch data in two domains: Twitter posts (Tweets subcorpus) and utterances from reality TV-shows (Captions subcorpus). The Tweets subcorpus consists of 1000 tweets that all contain at least one out of a list of 72 emojis. The Captions subcorpus consists of 1000 utterances from transcriptions of three emotionally loaded Flemish reality TV-shows (Blind getrouwd; Bloed, zweet en luxeproblemen; and Ooit vrij), more or less equally distributed over the shows (335 instances from Blind getrouwd, 331 from Bloed, zweet en luxeproblemen and 334 from Ooit vrij).
All data were annotated with both categorical labels and dimensions. For the categorical annotation, the instances were labeled with one out of six labels: joy, love, anger, fear, sadness, or neutral. The dimensional annotations are real-valued scores from 0 to 1 for the dimensions valence, arousal and dominance. An annotated example of one instance per domain is shown in Table 1.

Experimental Setup
To answer the first research question, we will look at two ways to leverage dimensional representations in aiding emotion classification, namely multi-task learning and a stackingbased meta-learning method. The second question, namely whether dimensions can be mapped to categories to tailor label sets to specific applications, will be investigated by means of a pivot method, where we employ predictions from a dimensional model to predict emotion classes. Each of these models together with a baseline model are described in closer detail in the following sections.

Base Model: RobBERT
We will employ the Dutch transformer model RobBERT [28], the Dutch version of the robustly optimized RoBERTa [29], which is trained on 39 GB of common crawl data [30]. It consists of 12 self-attention layers with 12 heads, and has 117 M trainable parameters. Previous experiments showed that this model achieves the best performance for emotion detection [13] in comparison to the BERT-based BERTje model [31]. We implement the model using HuggingFace's Transformers library [32]. The finetuning process uses AdamW optimizer [33] and the ReduceLROnPlateau learning rate scheduler with learning rate 5e − 5. The loss function is Binary Cross Entropy for the classification task and Mean Squared Error loss for regression. We set dropout to 0.2 and use GELU as activation function in the implementation of [34]. The maximum sequence length is 64 tokens. The model is trained for 5 epochs for classification and 10 for regression with a batch size of 64 on an Nvidia Tesla V100 GPU. As we are dealing with small datasets (1000 instances per domain and task), the model is evaluated using 10-fold cross-validation.

Multi-Task Learning
In this setting, the classification (categories) and regression (dimensions) models are trained simultaneously (see Figure 2). We use the same architecture and hyperparameters as in the base model. The RobBERT feature encoder allows for hard parameter sharing where the learning of features for the emotion classes and VAD prediction happens simultaneously, but has separate task-specific output layers. The losses (Binary Cross Entropy for emotion classification and Mean Squared Error loss for VAD) are averaged according to pre-defined weights. We test three different ratios: one where VAD and classification are weighed equally (both 0.5), one where classification outweighs VAD (0.75 for classification and 0.25 for VAD) and one where VAD has the largest weight (0.75 for VAD and 0.25 for classification).

Meta-Learner
The meta-learner approach is another way of leveraging the information in dimensional representations as these are combined with categorical inputs. However, in this setting, no parameters are shared between the tasks. Instead, a stacking ensemble is used in which two base models are trained, one for VAD regression and one for emotion classification. The predictions (or probabilities in the case of classification) are concatenated (six values for classification and three values for VAD) and used as input for a meta-learner algorithm, in this case a support vector machine for classification and a linear regression model for VAD. A diagram of the proposed architecture is depicted in Figure 3.
Nested cross-validation is used for this approach: when training the base models, for every test fold, a model is trained on eight folds and predictions are saved on the remaining fold. This is repeated, so that, for every test fold, predictions for the other nine folds have been made. These predictions will be used in the training phase of the metalearner. Afterwards, the model is trained again using regular cross-validation, in order to save predictions for the test fold, based on training on nine folds. These predictions are then used during the test phase of the meta-learner. The support vector machine is trained with default settings in the Scikit-learn library [35]: we use a linear kernel and 1.0 as regularization parameter C. Hinge loss and L2 penalty is used for classification. The balanced mode is used, meaning that class weights are taken into account, inversely proportional to class frequencies.

Pivot Method
Contrary to the two previous approaches, the pivot method makes only use of the VAD annotations instead of both the dimensional and categorical data. It starts from predicting VAD scores through a transformer model, and these predictions are then transformed to classes by means of a rule-based mapping (see Figure 4). Although several mapping techniques have been investigated in related work (see Section 2), these approaches are not eligible for a pivot method, as they all rely on data in a bi-representational format and thus also require categorical data for a mapping to be learned. However, the idea of a pivot is to be able to map to any possible label set, without having to rely on any annotations for those labels.
The rule-based mapping works as follows: we look up the emotion terms from our label set (anger, fear, joy, love and sadness) in the definition list with VAD scores of Mehrabian and Russell [12] and scale them to a range from 0 to 1 to match the VAD annotation framework of the dataset. The scores can be found in Table 2. Following [22], we place both the textual instances to be classified and the vectors for the categorical emotion terms in the three-dimensional space.
We start by drawing some general rules for anger, fear, joy and sadness, as shown in Table 3 (at this point, love and neutral are not taken into consideration). If a class cannot be matched based on these rules, then we calculate cosine distance between the instance that needs to be classified and each emotion class vector (here love and neutral are included, the last one being defined as {0.5, 0.5, 0.5}). The class which has the smallest cosine distance to the instance is then assigned.   Table 3. Mapping rule used in the pivot method.
if V < 0.5 and A > 0.5 and D > 0.5 : class ← anger elif V < 0.5 and A > 0.5 and D < 0.5 : class ← f ear elif V > 0.5 and A > 0.5 and D < 0.5: class ← joy elif V < 0.5 and A < 0.5 and D < 0.5: class ← sadness else: Find class with smallest cosine distance

Evaluation
Our experiments will be evaluated using three metrics: macro-averaged F1 (F1), micro-averaged F1 (Acc.) and cost-corrected accuracy (CC-Acc.). Cost-corrected accuracy is similar to normal accuracy, but, here, a cost matrix with specific weights is taken into account [13]. This way, misclassifications within the correct polarity are punished less than misclassifications in the opposite polarity (e.g., misclassifying an instance of fear as sadness has a lower weight than misclassifying love as anger).

Results
We report results for the three metrics (macro F1, accuracy and cost-corrected accuracy) for the base transformer model, the multi-task model in its three settings (equal weights, higher weight for classification and higher weight for regression), the meta-learner and the pivot model. The results for Tweets are shown in Table 4 for categories and Table 5 for VAD, while results for Captions are shown in Tables 6 and 7.   The results of the base models are rather similar in both domains. As also observed in De Bruyne et al. [13], the performance is notably low for categories, especially regarding macro F1-score (only 0.347 for Tweets and 0.372 for Captions). Note that we are dealing with imbalanced datasets, which explains the discrepancy between macro-F1 and accuracy (instances per category in Tweets subcorpus: n_anger = 188, n_fear = 51, n_joy = 400, n_love = 44, n_sadness = 98, n_neutral = 219; Captions subcorpus: n_anger = 198, n_fear = 96, n_joy = 340, n_love = 45, n_sadness = 186, n_neutral = 135). Scores for dimensions seem more promising, although results are hard to compare as we are dealing with different metrics (r = 0.635 for Tweets and 0.641 for Captions).
When we look at multi-framework settings (multi-task and metalearner), we see that performance goes up for the categories (from 0.347 to 0.420 in the meta-learning setting for Tweets and from 0.372 to 0.407 for Captions), while it drops or stays constant for the dimensions (from 0.635 to 0.638 and from 0.641 to 0.643 for the meta-learner in Tweets and Captions, respectively) . This observation confirms that categories benefit more from the additional information of dimensions than in the opposite direction and corroborates the assumption that the VAD model is more robust than the classification model.
The boost in performance for categories is especially clear for the meta-learner setting, where scores improve for all evaluation metrics in both domains (increase of no less than 7% macro F1 and around 2% (cost-corrected) accuracy for Tweets and around 3% in all metrics for Captions).
For the multi-task approach, only macro F1 increased for categories, while for Captions, (cost-corrected) accuracy also went up in two out of three settings. When taking all metrics into account, the largest increase was found in the setting where VAD had the largest weight (noted in Tables 4 and 6 as Multi-task (0.25)).
For the pivot method, the primary objective was not to outperform the base model, but to be on par with it. However, looking at the performance, we observe a steep drop in performance for all metrics (e.g., for Tweets accuracy and Captions F1 the decrease is almost 10%). The loss in cost-corrected accuracy is smaller. Error analysis will need to clarify whether predictions made in the pivot approach are useful (see Section 5). However, based on these results, it does not seem that the pivot method is an effective approach to predict emotion categories.

Discussion
The results in Section 4 suggest that VAD dimensions can help in predicting emotional categories, as the VAD regression model seems more robust than the classification model. However, the pivot method did not seem an effective approach to predict emotion categories. In this section, we will take a look at the correlation between categories and VAD dimensions as annotated in our dataset and perform an error analysis on the predictions of the pivot method. Finally, we give some suggestions for future research directions.

Correlation between Categories and Dimensions
The point biserial correlation coefficient is used to measure correlation between a continuous and a binary variable. This allows us to assess the correlation between each emotion category (either 0 or 1, so the binary variable) and each one of the VAD dimensions (continuous). The results are shown in Table 8 (Tweets) and Table 9 (Captions).  Arousal is (weakly) positively correlated with anger and joy (Tweets: r = 0.08 and r = 0.20, respectively; Captions: r = 0.34 and r = 0.09). Sadness has a negative correlation with this dimension in Captions (r = −0.16). Strikingly, neutral has a notable negative correlation with arousal (r = −0.29 in Tweets and r = −0.34 in Captions). This goes a bit against our assumption that the neutral state is the center of the VAD space, although it is not completely counter-intuitive that neutral sentences were judged as having low arousal instead of medium arousal.
Contrary to what some studies claim [36], the dominance dimension seems more correlated with emotion categories than arousal. Especially with sadness, with which dominance is negatively correlated, the correlation is rather high (r = −0.46 in Tweets and r = −0.45 in Captions). In the Captions subset, fear and joy are rather highly correlated with dominance as well (r = −0.31 and r = 0.42, respectively).
The dimensional and categorical annotations in our dataset are thus correlated, but not for each dimension-category pair and certainly not always to a great extent. These observations do seem to suggest that a mapping could be learned. Indeed, various studies have already successfully accomplished this [19][20][21]. However, our goal is not to learn a mapping because then there would still be a need for annotations in the target label set. Instead, a mapping should be achieved without relying on any categorical annotation. The correlations shown in Tables 8 and 9 thus seem too low to directly map VAD predictions to categories through a rule-based approach, as was proven in the results of the presented pivot method.
For comparison, we did try to learn a simple mapping using an SVM. This is a similar approach as the one depicted in Figure 3, but now only the VAD predictions are used as input for the SVM classifier. Results of this learned mapping are shown in Table 10. Especially for the Tweets subset, results for the learned mapping are on par with the base model, suggesting that a pivot method based on a learned mapping could actually be operative. Apart from looking at correlation coefficients, we also try to visualise the relation between categories and dimensions in our data. We do this by plotting each annotated instance in the three-dimensional space according to its dimensional annotation, while at the same time visualising its categorical annotation through colours. Figures 5 and 6 visualise the distribution of data instances in the VAD space according to their dimensional and categorical annotations. On the valence axis, we clearly see a distinction between the anger (blue) and joy (green) cloud. In the negative valence area, anger is more or less separated from sadness and fear on the dominance axis, although sadness and fear seem to overlap rather strongly. In addition, joy and love show a notable overlap.
Average vectors per emotion category are shown in Figures 7 and 8. It is striking that these figures, although they are based on annotated real-life data (tweets and captions), are very similar to the mapping of individual emotion terms as defined by Mehrabian [12] ( Figure 1), although the categories with higher valence or dominance are shifted a bit more to the neutral point of the space. Again, it is clear that joy and love are very close to each other, while the negative emotions (especially anger with respect to fear and sadness) are better separated.    Although the average VAD per category values corresponds well to the definitions of Mehrabian [12], which are used in our mapping rule, the individual data points are very much spread out over the VAD space. This results in quite some overlap between the classes. Moreover, many (predicted) data points within a class will actually be closer to the center of the VAD space than it is to the average of its class. However, this is somewhat accounted for in our mapping rule by first checking conditions and only calculating cosine distance when no match is found (see Table 3). Nevertheless, inferring emotion categories purely based on VAD predictions does not seem efficient.

Error Analysis
In order to get some more insights into the decisions of our proposed models, we perform an error analysis on the classification predictions. We show the confusion matrices of the base model, the best performing multi-framework model (which is the meta-learner) and the pivot model. Then, we randomly select a number of instances and discuss their predictions.
Confusion matrices for Tweets are shown in Figures 9-11, and those of the Captions subset are shown in Figures 12-14. Although the base model's accuracy was higher for the Tweets subset than for the Captions, the confusion matrices show that there are less misclassifications per class in Captions, which corresponds to its overall higher macro F1 score (0.372 compared to 0.347). Overall, the classifiers perform poorly on the smaller classes (fear and love).
For both subsets, the diagonal in the meta-learner's confusion matrix is more pronounced, which indicates more true positives. The most notable improvement is for fear. Besides fear, love and sadness are the categories that benefit most from the meta-learning model. There is an increase of respectively 17%, 9% and 13% F1-score in the Tweets subset and one of 8%, 4% and 6% in Captions.
The pivot method clearly falls short. In the Tweets subset, only the predictions for joy and sadness are acceptable, while anger and fear get mixed up with sadness. In the Captions subset, the pivot method fails to make good predictions for all negative emotions.      To get more insights into the misclassifications, ten instances (five from the Tweets subcorpus and five from Captions) were randomly selected for further analysis. These are shown in Table 11 (an English translation of the instances is given in Appendix A). In all given instances (except instance 2), the base model gave a wrong prediction, while the meta-learner outputted the correct class. In particular, the first example is interesting, as this instance contains irony. At first glance, the sunglasses emoji and the words "een politicus liegt nooit" (politicians never lie) seem to express joy, but context makes us understand that this is in fact an angry message. Probably, the valence information present in the VAD predictions is the reason why the polarity was flipped in the meta-learner prediction. Note that the output of the pivot method is a negative emotion as well, albeit sadness. In three cases of the shown instances, the base model predicted emotions in the wrong polarity, which were, in turn, of the correct valence (or polarity) for the meta-learner and pivot method (instance 1, 6 and 8). Indeed, although the performance of the pivot method was low regarding macro F1 and accuracy, the cost-corrected accuracy (which takes the polarity into account) was reasonably good.
What is striking is that, out of the seven examples where the pivot failed to make a correct prediction (all instances except 7, 8 and 10), four examples would have been correctly classified based on gold VAD values. This could indicate that the main problem with the pivot method is not the mapping rule, but incorrect predictions made by the VAD regression model. To investigate this, the pivot experiments were repeated, but this time using gold VAD values instead of predicted ones. As can be observed in Table 12, results are now on par with the base classification model. This suggests that our pivot method could be efficient, provided that the VAD regression model is further improved upon.

Future Work
Our experiments showed that dimensional emotion representations can help in improving the performance of emotion classification models in the EmotioNL dataset. The pivot-based approach was not successful, although we found evidence that this method might be beneficial when the VAD regression model is further improved.
This leads to several suggestions for future work. First of all, we suggest that these methods are validated on other datasets and other languages than Dutch. Furthermore, we want to improve the VAD model by testing different model architectures and investigate its effect on the usability of the pivot method. The pivot method could then be investigated when used for mapping to label sets different than the one described in this paper. Finally, an interesting research direction could be to look at other modalities, e.g., facial emotion recognition (FER). A well-known problem in FER is the poor performance in real-time testing because of the bad quality of datasets [37]. It would be interesting to investigate whether dimensional representations might be of help here as well.

Conclusions
In emotion detection studies, researchers usually opt for either categorical or dimensional emotion frameworks. Our previous work on Dutch emotion detection showed that the classification of emotional categories is a very challenging task, but that a regression task for predicting valence, arousal and dominance achieves more promising results. In this paper, we have therefore investigated whether transferring information from the regression to the classification task can improve performance. Moreover, we have examined the potential of dimensional representations to be used as a pivot mechanism, which allows tailoring emotion frameworks to specific tasks and domains.
Our results reveal that dimensional representations can indeed boost the baseline emotion classification's performance, especially in a meta-learning setting. Moreover, while categories do benefit from the additional VAD information, the opposite does not hold, which further underlines the assumption that the VAD model is more robust than the classification model. The pivot method was not able to compete with the base model and revealed a substantial drop in performance. However, further inspection revealed that the rule-based mapping itself does perform on par with the base model when gold VAD values are used. This suggests that the pivot method could be efficient, provided that the VAD regression model is further improved. This opens up possibilities to tailor emotion frameworks to specific tasks and domains and thus broaden their application scope.