Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach

: A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions in a continuous multidimensional emotional space. This work chooses a different approach. It aims at creating a system predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three publicly available databases with dimensional annotation of emotions. The quality of regression is evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional space is tested on another pool of eight categorically annotated databases. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s circumplex model of affective space. Due to the great variability of speech data, clusters of emotions create overlapping clouds. Their average location can be represented by centroids. A hypothesis on the position of these centroids is formulated and evaluated. The system’s ability to separate the emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system works as expected and the positions of the clusters follow the hypothesized rules. Although the variance in individual measurements is still very high and the overlap of emotion clusters is large, it can be stated that the AV coordinates predicted by the system lead to an observable separation of the emotions in accordance with the hypothesis. Knowledge from training databases can therefore be used to predict AV coordinates of unseen data of various origins. This could be used to detect high levels of stress or depression. With the appearance of more dimensionally annotated training data, the systems predicting emotional dimensions from speech sound will become more robust and usable in practical applications in call-centers,


Introduction
According to Scherer's component process definition of emotion [1], vocal expression is one of the components of emotion fulfilling the function of communication of reaction and behavioral intention. It is therefore reasonable to assume that some information on the speaker's emotion can be extracted from the speech signal.
We dared to call our article "Mapping discrete emotions into the dimensional space: An acoustic approach", paraphrasing the title of the work [2], to draw attention to the fact that many authors attempt to identify the relationship between categorical and dimensional descriptions of emotions by trying to place a verbal term (label) expressing emotion, (i.e., the name of the category) in the dimensional space ( [2,3], and others). This could be wrongly automatically taken for a typical position also for the vocal (acoustic) realizations of utterances of speech under the particular emotion. Evaluation of word terms designating emotions is a different task than evaluation of emotion contained in the sound of speech utterances; nevertheless, the correlation between the placement of emotion labels and the placement of the respective emotional utterances can intuitively be assumed. This work presents a system capable of predicting continuous values of Activation and Valence from the acoustic signal of an utterance and thus finding a position of the emotion presented vocally in the particular segment of speech in the AV space.
Affect, in psychology, refers to the underlying experience of feeling, emotion or mood [4]. AV space can be used to represent affective properties not only of emotional but also stressful, insisting, warning, or calming speech, vocal manifestations of a physical condition, such as pain, or a mental condition, such as depression. Coordinates in AV space can be used to map and compare different types of affective manifestations. For example, one can try to use emotional databases to train a speech stress indicator or an anxiety and depression detect. This work offers a system for predicting such coordinates from the sound of emotional utterance. However, it must always be kept in mind that representation in two-dimensional space greatly reduces affective (and acoustic) information, and the functionality of such indicative mapping must always be well verified with respect to the needs of the application.

Discrete (Categorical) versus Dimensional (Continuous) Characterization of Emotions
The properties of emotions are usually described either categorically, by assigning the emotion to one of the predefined categories or classes, or dimensionally, by defining the coordinates of the emotion in a continuum of multidimensional emotional space [5]. Affective states (i.e., emotion, mood, and feeling) are structured in two fundamental dimensions: Valence and Arousal [6]. Russell has proposed a circumplex model of affect ad has categorized verbal expressions in English language in the two-dimensional space of Arousal-Valence (AV) [3]. The degree-of-arousal dimension is also called activation-deactivation [7], or engagement-disengagement. In this work, we adopt this two-dimensional approach.
As all the three dimensionally annotated databases have the dimensions called Activation and Valence, from now on we use this terminology and difference between terms Arousal and Activation is neglected. The term Arousal will be used when referring to the Russell's work.
In many application scenarios, such as automatic information via voice, using avatars, customer services, etc., it would be useful to have an estimate available of the emotion or stress in the speaker's voice. The system could take the affective state of the customer into account and adapt the mode of communication.

Issues in Predicting Emotional Dimensions from the Sound of an Utterance
The possibilities of human's articulation system are physiologically limited. The acoustic cues of emotions are highly non-specific; the vocal realization of the utterance can be very similar in the presence of different emotions. Affective states form a continuum and dividing emotions into disjoint classes is an extreme oversimplification. The real emotions are complex; they almost never appear "in mixtures". The meaning of terms describing emotions is ambiguous and culturally and linguistically dependent. Projections of various utterances into the AV space cannot be expected to be well separable with respect to emotion category. However, certain trends in their placement can be expected.
As noted by Gunes and Schuller [5], Activation is known to be well accessible in particular by acoustic features and Valence or positivity is known to be well accessible by linguistic features. Estimating Valence from the sound itself can therefore be particularly challenging. Oflazoglu and Yildirim [8] even claim that the regression performance for the Valence dimension of their system is low and that "This result indicates that acoustic information alone is not enough to discriminate emotions in Valence dimension" ( [8], page 9 of 11).
A special issue is that very little is known about the mutual dependency of the dimensions of the emotional space [9,10]. The authors of this research have noticed that it is very hard for the annotators to evaluate Valence independently of Activation when the semantic information is unavailable. The emotions with low activation are often assigned Valence values in the center of the range.
Activation and Dominance show even higher interdependencies. In the analysis of their Turkish emotional database, Oflazoglu and Yildrim [8] show in Figure 8 of their paper the distribution of Activation and Dominance, which appears as a narrow cloud lying on the diagonal, which indicates a strong dependence between the ratings of Activation and Dominance dimensions. Nevertheless, extending the representation of space to threedimensional (Activation, Valence, Dominance) can help to differentiate emotions (for example, to distinguish Anger from Fear). In this work, Dominance is not addressed.
Ekman argued that emotion is fundamentally genetically determined so that facial expressions of discrete emotions are interpreted in the same way across most cultures or nations [11,12]. However, the inner image of emotion in a person's mind and the idea of how it is to be presented in speech depends to a large extent on his experience, education, and to a large extent on the culture in which he lives. Lim argues that culture constrains how emotions are felt and expressed and that cross-cultural differences in emotional arousal level have consistently been found. "Western culture is related to high-arousal emotions, whereas Eastern culture is related to low-arousal emotions" [12]. In this work, we examine the vocal manifestations of emotions in four Western languages (English, German, Italian, and Serbian) and in the first approximation we consider the task of automatic prediction of Activation and Valence from sound to be culture independent. One of the results of this work may be the information, whether the proposed approach works also on languages other than the one it was trained on.
The biggest problem is that there is no ground truth information available. One has to rely on the values estimated by annotators and consider them as ground truth. However, the number of annotators is often small and the reliability of the evaluation is debatable.
The available emotional speech databases were designed for various purposes, which also means they differ in methodology and annotation convention, instructions to annotators, choice of emotional categories, or even language. Moreover, the annotation of emotions was often done with a help of video, face and body gestures, text or semantic information. This information may be absent (not reflected) in the sound modality. The sound-based predictor then misses this information in the training process.
Other problem is the small volume and representativeness of the data available for emotional training. To achieve as large amount of data as possible for regressor training, to cover more variability, three publicly available databases with annotated Activation and Valence (AV) dimensions were combined in one pool.
Different emotional databases contain different choice of emotions. In this work, only the emotions that occur in the majority of the available emotional databases are addressed, namely, Angry, Happy, Neutral, and Sad.
The differences in definitions, methodology, and conditions of creation of individual databases have to be taken into account when evaluating the reliability and informative value of the obtained results.

Hypothesis
Emotional space is a multidimensional continuum. The cues of emotions in the voice are highly non-specific. Emotions are often present in mixtures, the meaning (inner representation) of the emotional terms in both speakers and raters are culture dependent. So, the areas into which the individual realizations of emotions are projected in the dimensional space largely overlap. Nevertheless, we assume that the centroids of the clusters In order to illustrate the expected distribution of emotions in the AV space, we present in Figure 1 the placement of the stimulus words Anger, Happy, and Sad in the space of pleasure-displeasure and degree of arousal according to Russell [3]. Neutral emotion was not addressed in his work. For simplicity, it can be assumed that Neutral emotion should be located at the origin of the coordinate system.
Emotional space is a multidimensional continuum. The cues of emotions i are highly non-specific. Emotions are often present in mixtures, the meaning resentation) of the emotional terms in both speakers and raters are culture dep the areas into which the individual realizations of emotions are projected in sional space largely overlap. Nevertheless, we assume that the centroids of the points to which the utterances are projected in the AV space, should meet ce expectations considering their emotion category.
In order to illustrate the expected distribution of emotions in the AV spa sent in Figure 1 the placement of the stimulus words Anger, Happy, and Sad i of pleasure-displeasure and degree of arousal according to Russell [3]. Neutr was not addressed in his work. For simplicity, it can be assumed that Neutr should be located at the origin of the coordinate system. Figure 1. Placement of the stimulus words Anger, Happy, and Sad in the space of p pleasure (x-axis) and degree of arousal (y-axis) according to Russell [3].
Due to the various sources of uncertainty in dimension prediction and phase of research, the hypothesis can only be formulated very vaguely. Our w pothesis is that when predicting the values of Activation and Valence, the cent gry emotion utterances cluster should have a higher Activation value and a low value than the centroid of Neutral utterances. The centroid of Happy emotion cluster should have a higher Activation value and a higher Valence value tha troid of Neutral utterances. Sad emotion is less pronounced, and the centroid m to the Neutral utterances; anyway, it should have observably overall lower Va Neutral and considerably lower Arousal than the Angry emotion.

Training Databases
Three databases were available to the authors, in which values of Activati lence were annotated. Each of these three "training databases" was randomly d its training set (90% of data) and test set (remaining 10%). This ratio was cho serve as much training data as possible.
IEMOCAP [13]. The Interactive Emotional Dyadic Motion Capture data acted, multimodal and multispeaker database in English (10 speakers, 10,000 u It contains 12 h of audiovisual data. The actors perform improvisations or scrip ios. IEMOCAP database is annotated by multiple annotators into categorical l as anger, happiness, sadness, neutrality, as well as dimensional labels: Valen tion, and Dominance. MSP IMPROV [14]. MSP-IMPROV corpus is a multimodal emotional d English (12 speakers, 8500 utterances). Pairs of actors improvised the emotionuations. Categorical labels, such as anger, happiness, sadness, and neutrality, dimensional labels-Valence, Activation, and Dominance-are provided. Due to the various sources of uncertainty in dimension prediction and the early phase of research, the hypothesis can only be formulated very vaguely. Our working hypothesis is that when predicting the values of Activation and Valence, the centroid of Angry emotion utterances cluster should have a higher Activation value and a lower Valence value than the centroid of Neutral utterances. The centroid of Happy emotion utterances cluster should have a higher Activation value and a higher Valence value than the centroid of Neutral utterances. Sad emotion is less pronounced, and the centroid may lie close to the Neutral utterances; anyway, it should have observably overall lower Valence than Neutral and considerably lower Arousal than the Angry emotion.

Training Databases
Three databases were available to the authors, in which values of Activation and Valence were annotated. Each of these three "training databases" was randomly divided into its training set (90% of data) and test set (remaining 10%). This ratio was chosen to preserve as much training data as possible.
IEMOCAP [13]. The Interactive Emotional Dyadic Motion Capture database is an acted, multimodal and multispeaker database in English (10 speakers, 10,000 utterances). It contains 12 h of audiovisual data. The actors perform improvisations or scripted scenarios. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels: Valence, Activation, and Dominance.
MSP IMPROV [14]. MSP-IMPROV corpus is a multimodal emotional database in English (12 speakers, 8500 utterances). Pairs of actors improvised the emotion-specific situations. Categorical labels, such as anger, happiness, sadness, and neutrality, as well as dimensional labels-Valence, Activation, and Dominance-are provided.
VaM [15]. The database consists of 12 h of audio-visual recordings of the German TV talk show Vera am Mittag (47 speakers, 1000 utterances). This corpus contains spontaneous and emotional speech in German recorded from unscripted, authentic discussions. The emotion labels are given on a continuous valued scale for three emotion primitives: Valence, Activation, and Dominance.
Recognizing emotions from facial expressions is a common research topic nowadays (see e.g., [16,17]) and the categorical annotation is often based on facial expressions. A part of VaM database, the "VaM Faces", includes such a categorical annotation of emotion based on the facial expression, that can be linked to the corresponding speech utterance. However, this information is available only for very small number of utterances and the emotion information contained in the facial expression may not be present in the vocal presentation. Therefore, this categorical annotation of VaM was not used in this work.
The AV dimensions in all three databases were annotated using a five-point selfassessment manikins [18] scale. The final rating is the mean of the ratings of all raters. The values on the AV axes were mapped to the range from 1 to 5 in this work.
In addition to training on individual databases, we also trained on a mixture of all three databases, which we will refer to as MIX3, and on a mixture of two larger databases, IEMOCAP and MSP-IMPROV, which we will call MIX2.

System Architecture
In the areas of applied machine learning, such as text or vision, embeddings extracted from discriminatively trained neural networks are the state-of-the-art. They are now also used in speaker recognition [29]. The approaches that have been successfully applied in speaker recognition are often adopted in emotion recognition (see e.g., [30][31][32]).

X-Vector Approach to Signal Representation
The approach used in this work is based on neural network embeddings called X-vectors [29]. The X-vector extractor is based on Deep Neural Networks (DNN) and its training requires large amounts of training data. Ideally, training data should also include information describing emotions. However, to the knowledge of the authors of this work, any extra-large training database with emotions annotated that would be suitable for training emotion-focused extractor from scratch, is not available.

X-Vector Extractor Training Phase
The X-vectors generated by extractor trained on speaker verification datasets provide primarily the information on speaker identity. However, it was shown they can also serve as a source of information on age, sex, language, and affective state of the speaker [33]. Therefore, the X-vector extractor was trained on the speaker-verification databases: Vox-Celeb [34], having 1250 speakers and 150,000 utterances, and VoxCeleb2 [35] having 6000 speakers and 1.1 million utterances. The volume of training data was further augmented using reverberation and noising [36]. The feature extraction module transforms sound into representative features-30-dimensional Mel Frequency Cepstral Coefficients (MFCCs) with a frame-length of 25 ms, mean-normalized over a sliding window of up to 3 s [29]. The energy-based Voice Activity Detector (VAD) was used to filter out silence frames. The result of the training is DNN (X-vector extractor model). In the X-vector extraction process, an MFCC features matrix is fed to the input of this DNN, and an X-vector with a size of 512 is output.

Regression Model Training Phase
The training and test sets for regression are organized in pairs of features representing particular utterances-an X-vector, and the corresponding value of the perceived Valence (for Valence regressor) or Activation (for Activation regressor). The Scikit-learn library was used for training of the Support Vector Regressor (SVR) [37]. Default settings were used for the SVR.
The regression models trained in this phase are able to predict the value of Valence resp. Activation from the input X-vector representing the incoming utterance.
Various types of regressors were tested: AdaBoost regressor, Random Forest regressor, Gradient Boosting regressor, Bagging regressor, Decision Tree regressor, K-neighbors regressor, and Multi-layer Perceptron regressor, but none of them gave consistently better results than Support Vector Regressor.

Prediction Phase
In the prediction phase, the utterances from the pool of test-databases undergo the process of X-vector extraction and prediction values of Valence and Activation. The result is a pair of values indicating the coordinates in the AV space of each utterance.

Overall Architecture
The overall architecture of the system is shown in Figure 2. As we have shown in Section 4.1, the whole process has three phases. In the first phase, we trained the X-vector extractor (or X-vector model) on large speaker verification databases. In the second phase, we trained regressors for Valence and Activation on dimensionally annotated databases. In the third phase, the prediction of AV dimension values for the addressed emotion categories in the categorically annotated test databases was performed. In real-world application operation, the test databases in the prediction phase will be replaced by a speech signal audio input.

Visualization of Results
The results are presented in the form of figures and tables. The figures show the position of utterances in the AV plane. Seaborn statistical data visualization library [38] was used for visualization. Due to variability the utterances belonging to one emotion in a certain database create clouds or clusters in the AV space. The center of gravity of each cluster is a centroid, marked with a small circle of the corresponding color. The clusters were depicted in a form of a cloud with contour lines representing iso-proportion levels. The graphs were plotted using the kdeplot function with the lowest iso-proportion level at which to draw the contour line set to 0.3 [39].

Ground Truth-Original AV Values Indicated by Annotators
The original AV values indicated by annotators (perceptual Activation and perceptual Valence values) are considered in our work as ground truth. Figure 3 presents the emotions, how they were rated in original annotations. As various corpora contain different sets of emotions, only four emotions were chosen for comparison, that were present in all databases-Angry, Happy, Neutral, and Sad.
The granularity of IEMOCAP data is caused by the fact that there were very few annotators. It can be seen that the layout of centroids of emotion clusters is similar for IE-MOCAP and MSP-IMPROV. The graph for VaM original annotation is absent as VaM does not include annotation of emotion categories for vocal modality.

Visualization of Results
The results are presented in the form of figures and tables. The figures show the position of utterances in the AV plane. Seaborn statistical data visualization library [38] was used for visualization. Due to variability the utterances belonging to one emotion in a certain database create clouds or clusters in the AV space. The center of gravity of each cluster is a centroid, marked with a small circle of the corresponding color. The clusters were depicted in a form of a cloud with contour lines representing iso-proportion levels. The graphs were plotted using the kdeplot function with the lowest iso-proportion level at which to draw the contour line set to 0.3 [39].

Ground Truth-Original AV Values Indicated by Annotators
The original AV values indicated by annotators (perceptual Activation and perceptual Valence values) are considered in our work as ground truth. Figure 3 presents the emotions, how they were rated in original annotations. As various corpora contain different sets of emotions, only four emotions were chosen for comparison, that were present in all databases-Angry, Happy, Neutral, and Sad.
The granularity of IEMOCAP data is caused by the fact that there were very few annotators. It can be seen that the layout of centroids of emotion clusters is similar for IEMOCAP and MSP-IMPROV. The graph for VaM original annotation is absent as VaM does not include annotation of emotion categories for vocal modality.  Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set ( Figure 4).

Regression Evaluation-AV Values Estimated on Combinations of the Test Sets
It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing-it is not representative, nor balanced.
As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (CCC) and Mean Absolute Error (MAE) were used as regression quality measures to compare annotated and predicted values of Activation and Valence.
CCC is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39].     Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set ( Figure 4).

Regression Evaluation-AV Values Estimated on Combinations of the Test Sets
It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing-it is not representative, nor balanced.
As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (CCC) and Mean Absolute Error (MAE) were used as regression quality measures to compare annotated and predicted values of Activation and Valence.
CCC is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39]. Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set ( Figure 4).
It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing-it is not representative, nor balanced.
As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (CCC) and Mean Absolute Error (MAE) were used as regression quality measures to compare annotated and predicted values of Activation and Valence.
CCC is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39].
Let N be the number of testing samples, {y i } N i=1 be the true Valence (Arousal) levels, and {ŷ i } N i=1 be the estimated Valence (Arousal) levels. Let µ and σ be the mean and standard deviation of {y i }, respectively;μ andσ be the mean and standard deviation of {ŷ i }, respectively; and ρ be the Pearson correlation coefficient between {y i } and {ŷ i }. Then, the CCC is computed as: CCC is still being used by many authors together with traditional error measure MAE.
where y i is true value;ŷ i is predicted value; and n stands for total number of datapoints.
The results of further experiments, evaluation of regression quality with various training and test sets by means of CCC and MAE are presented in Table 2. SVR trained on MIX2 give general slightly better results than with MIX3 trained on all the three datasets. This may indicate that vocal manifestation of emotions in VaM may be less pronounced and less prototypical; the data and the annotation may be more different from other two databases. Moreover, VaM is in German and IEMOCAP and MSP-IMPROV contain English speech.
The results also show that the model obtained by training on a mixture of databases is more universal and achieves better results on the mixed test set. In some cases, it also achieves better results for individual databases than a model trained on their own training set.
Both CCC and MAE show that the quality of prediction is better for Activation than for Valence, which is in line with the observation of Oflazoglu and Yildirim [8].

Cross-Corpus Experiments, AV Values Estimated by Regression on "Unseen" Corpora
In these experiments, the utterances from the categorically annotated emotional speech corpora are input to the AV predictor. The result is represented by predicted values of Activity and Valence for each utterance.
Cross-corpus emotion recognition has been addressed by many works, but most of them focus on a categorical approach or they try to identify to which quadrant of the AV space the utterance belongs (see e.g., [40]). Our approach tries to predict continuous values of the AV dimensions. Figure 5 presents clusters of emotions, estimated by the regressor trained on MIX2 and tested on different unseen emotional corpora. Experiments were also performed with MIX3, but the regressor using MIX2 performed better (Table 3).
Cross-corpus emotion recognition has been addressed by many works, but most of them focus on a categorical approach or they try to identify to which quadrant of the AV space the utterance belongs (see e.g., [40]). Our approach tries to predict continuous values of the AV dimensions. Figure 5 presents clusters of emotions, estimated by the regressor trained on MIX2 and tested on different unseen emotional corpora. Experiments were also performed with MIX3, but the regressor using MIX2 performed better (Table 3).     Based on the figures, it is now possible to try to interpret the results obtained by the regressor on the corpora with annotated emotion categories: The results of the EmoDB database confirm the observation that it contains strongly prototypical emotions [41]. The overlap of emotion clusters is smaller compared to other corpora. The clusters are significantly more differentiated, especially on the axis of Activation, which suggests that the actors performed full-blown emotions with a large range of arousal.
The results of the EMOVO database suggest a possible fact, which is also confirmed by observations in other databases, that Valence for Sad does not reach such low values as expected. The Sad cluster is located on the Valence axis even more towards higher values than the Neutral emotion cluster. According to the predicted AV values, the sound realization of Sad utterances seems to be hardly recognizable from that of Neutral ones in this database. It can be speculated that one of the possible sources of variance may be the inter-cultural difference, as the regressor was trained on English databases and EMOVO is Italian, but this possibility would need more extensive research.
The CREMA-D, RAVDESS, eNTERFACE, and JL Corpus databases give roughly the results as expected (see Section 2.2), although cluster differentiation is relatively small. The centroid of Sad in CREMA-D and JL Corpus have similar position on Valence axis as that of Neutral. The eNTERFACE database does not contain Neutral emotion, therefore the other three emotions cannot be compared to it.
Although the differentiation of clusters is not marked for the SAVEE database, it basically meets the expected trends. The exception is again the Sad emotion, which has higher mean value of Activation than one might expect and has approximately the same mean value of Valence as Neutral emotion.
The Canadian TESS database has the mutual placement of Angry, Happy, and Neutral emotions fully in line with the hypothesis. However, the centroid of the Sad cluster again achieves a higher value of Activation and Valence than expected.
GEES is a Serbian database meant for speech synthesis, which means that the prototypical emotions are presented very clearly and with high intensity. Therefore, the emotion centroids are placed on the expected positions. It is not any surprise that these positions are practically identical to other highly prototypical database, such as German EMO-DB.

Centroid Distance as a Measure of Regression Quality
In the following experiment, the distance of centroids of Angry and Happy emotion clusters on the Valence axis (for Valence regression) and the distance of centroids of Angry and Sad emotion clusters on the Activation axis (for Activation regression) were taken for an ad hoc objective measure of the ability of the regressor to differentiate between emotions. The evaluation of the regression quality using distances between centroids is presented in Table 3.
The two regressors have similar results, but in 15 of 20 cases the one trained on MIX2 (without VaM) has better resolution, and in two cases the results were the same. So, the conclusion could be that adding VaM data to the training set does not improve the universality of regression models and slightly degrades the performance of the regressors.
As was said in Section 3.1, due to the small amount of data in the corpora, we have only allocated 10% of the data for regression quality testing. To evaluate the possible impact of test data selection, we performed a 10-fold regression test on the "wining" mixture MIX2. The results of the individual folds showed only negligible differences with very low standard deviations both for Valence and Activation (see Table 4) and confirmed that 10% of the data is in this case a sufficiently representative sample for testing.

Overall Picture of Emotion Positions in the AV Space
We displayed emotion centroids for each database in one figure to assess whether the same emotion category from different databases has a similar location in the AV space, and whether that location corresponds to the hypothesized positions ( Figure 6).
Electronics 2021, 10, x FOR PEER REVIEW 13 of 16 As was said in Section 3.1, due to the small amount of data in the corpora, we have only allocated 10% of the data for regression quality testing. To evaluate the possible impact of test data selection, we performed a 10-fold regression test on the "wining" mixture MIX2. The results of the individual folds showed only negligible differences with very low standard deviations both for Valence and Activation (see Table 4) and confirmed that 10% of the data is in this case a sufficiently representative sample for testing.

Overall Picture of Emotion Positions in the AV Space
We displayed emotion centroids for each database in one figure to assess whether the same emotion category from different databases has a similar location in the AV space, and whether that location corresponds to the hypothesized positions ( Figure 6). Centroids of Angry, Happy, and Neutral emotion clusters form well-distinguishable groups located in the AV space in an expected manner. This fact confirms that the system can evaluate the position of the perceived emotion in the AV space from the sound of utterances.
However, the group of Sad emotion shows considerable variance and largely overlaps with the Neutral emotion. Sad utterances from some of the databases also achieve higher Valence values than expected. Centroids of Angry, Happy, and Neutral emotion clusters form well-distinguishable groups located in the AV space in an expected manner. This fact confirms that the system can evaluate the position of the perceived emotion in the AV space from the sound of utterances.

Discussion and Conclusions
However, the group of Sad emotion shows considerable variance and largely overlaps with the Neutral emotion. Sad utterances from some of the databases also achieve higher Valence values than expected.

Discussion and Conclusions
Due to the small volume and small number of training databases, the "ground truth" data is very sparse and unreliable. They cover only a small fraction of the variety of possible manifestations of emotions in speech. Moreover, the training data are not available for all the parts of the AV plane, and the frequencies of occurrence of training samples representing different points of the AV space are far from being balanced. A substantial part of the data belongs to the less intensely expressed emotions, and they hardly differ from neutral speech. Examples of intense emotions, with extremely low or high Valence and Activation values, are rare. This also leads to certain narrowing of the range of predicted AV values, which is well observable when comparing the positions of emotional category centroids from annotator ratings in Figure 3 with the positions of the respective centroids estimated by regressor in Figure 4.
It is not possible to make general statements about the absolute position of individual emotions in the AV space, but it is reasonable to evaluate their relative position.
From the results obtained by the proposed system, it can be seen that in general Anger has higher Activation and lower Valence, and Happy has higher Activation and higher Valence, than the Neutral emotion. Valence predicted by the proposed system for Sad utterances does not reach such low values as could be expected with respect to the values in original annotations (Figure 3) of the training databases and with respect to Russell's circumplex model. A valuable observation is that, despite the fact that the training data were in English, the emotions from the German, Serbian, and Italian databases were also placed in accordance with the hypothesis.
Due to the variety of sources of uncertainty in speech data and non-specificity of vocal cues of emotion, the clusters of emotions acquired by regression are close to each other and they overlap considerably. However, centroids of corresponding emotion clusters from various unseen databases form observable groups, which are well separable for Angry-Happy-Sad and Angry-Happy-Neutral triplets of emotions. The locations of these groups in the AV space correspond to hypothesized expectations for Angry, Happy, and Neutral emotions.
Some models (e.g., LSTM model as presented by Parry et al. [42] in Figure 2a of their paper) seem to be more successful in determining the affiliation of utterances to individual databases than in identifying emotions. This only confirms the fact that the utterances reflect various technical and methodological aspects of the design of databases, cultural and linguistic differences, and the like. It is therefore difficult to identify emotions from acoustic characteristics of voice. However, we have proven in our experiments that measurement of coordinates of speech utterances in emotional space is in principle feasible, but the resolution and the ability to differentiate various emotions is better for high-activity emotions (Angry-Happy), than for low-activity ones (Sad-Neutral). This may be caused by technical aspects of the solution, but also by the lack of reliable training data, inconsistencies in annotation, diversity of inner psychological interpretation of emotional categories, cultural and linguistic differences, and differences in methodology. At the same time, however, it is highly probable that the sound of speech expressing low-activity emotions contains much less marked distinctive features and is very similar to neutral speech.
In the meantime, the authors have obtained access to the additional dimensionally annotated database, OMG-Emotion Behavior Dataset [39], so one of the future steps will be an analysis, processing, and incorporating this dataset in the training database pool. The other areas of possible improvement are: finetuning of the X-vector extractor for the emotion recognition task, experimenting with combinations of different analysis timeframes, experimenting with various representative features, as well as experiments with new machine learning algorithms and architectures of regressors. Axes scales normalization and finding the position of the origin (center) of the AV space need to be implemented.
The research on the measurement of AV dimensions from speech sound is in its infancy, where the predicted values have high variance and the ranges and units of the dimensionaxes are not well defined. However, with new databases, increasing volume of training data, more precise and representative annotation, and improved regression techniques, it will certainly be possible to achieve significantly higher accuracy and better applicability of AV dimensions estimation. Such a system could be used in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and many more.
The designed regressor is currently utilized for a Valence prediction in stress detector from speech in the Air Traffic Management security tools developed in European project SATIE (Horizon 2020, No. 832969), and in a depression detection module developed in Slovak VEGA project No. 2/0165/21.

Conflicts of Interest:
The authors declare no conflict of interest.