A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces

: In this research, a study of cross-linguistic speech emotion recognition is performed. For this purpose, emotional data of di ﬀ erent languages (English, Lithuanian, German, Spanish, Serbian, and Polish) are collected, resulting in a cross-linguistic speech emotion dataset with the size of more than 10.000 emotional utterances. Despite the bi-modal character of the databases gathered, our focus is on the acoustic representation only. The assumption is that the speech audio signal carries su ﬃ cient emotional information to detect and retrieve it. Several two-dimensional acoustic feature spaces, such as cochleagrams, spectrograms, mel-cepstrograms, and fractal dimension-based space, are employed as the representations of speech emotional features. A convolutional neural network (CNN) is used as a classiﬁer. The results show the superiority of cochleagrams over other feature spaces utilized. In the CNN-based speaker-independent cross-linguistic speech emotion recognition (SER) experiment, the accuracy of over 90% is achieved, which is close to the monolingual case of SER.


Introduction
The discovery of expressivity discovery in speech is one of the unresolved challenges. Among the relevant topics belonging to the expressivity are the issues related to the recognition of emotion. The recognition of emotion is a part of sentiment analysis [1,2]. Emotions are expressed both through verbal and non-verbal processes. The information derived from the speech signal, facial expressions, gestures, eye contact, and physiological status (e.g., EEG, EMG, respiration signals) is considered in studies in the area of recognition of emotions [3][4][5]. Emotion recognition based on the human speech, however, has the most applications, not only the evident ones related to the direct communication between people or human-computer interaction but also concerning the affective state recognition, e.g., recognizing social cognitive skills, diagnosing emotion deficits, psychotherapy, etc. The speech emotion recognition task is a classification task by nature: the unknown emotional speech pattern is assigned to a particular emotional class based on the expressive attributes (called features) in the speech signal. Therefore, the first stage of the recognition process is to search for and extract such features. In the next step, the extracted features are classified to identify the emotional class that the analyzed speech utterance belongs to. Most current studies report speech emotion classification rates of 70%-90% [6].

•
Methods based on a two-pass classification scheme consisting of language identification and language-specific speech emotion recognition [14,41]; • Direct cross-linguistic emotion recognition using different language databases, analyzing speech signal and recognizing emotions based on these features or parameter description that results from redundancy checking (e.g., PCA, principal component analysis-based) [13,42,43].
The speech features for emotion recognition can be grouped into several categories: local, global, continuous features, qualitative features, spectral features, and TEO (Teager energy operator)-based on features [8,10]. In our research, we have used four different 2D feature spaces, which are as follows: cochleagrams, spectrograms, mel-cepstrograms, and fractal dimension-based. A comparison of the results obtained in the conducted research with the results available in the literature is also made.
The recognition of emotion is generally performed using feature vectors extraction and the classical machine learning algorithms, including random forest, the support vector machines, k-nearest neighbors, various types of neural networks, hidden Markov models, and their variations [33].
In recent years, the deep-learning algorithms brought new opportunities to machine learning. A very promising property of deep neural networks (DNNs) is that they can learn high-level invariant features from raw data, which is potentially helpful for emotion recognition [13,14,[44][45][46]. Most of the emotion classification experiences come out, however, from the speech recognition tasks. For example, convolutional DNN delivers good results in the recognition in isolated Lithuanian words [47]. Therefore, we apply the same strategy in the current experiment while investigating archetypical primary emotions: anger, happiness, fear, sadness, and neutral [48]. These emotions are the most apparent and distinct in human life.
The paper is organized as follows: it starts with the data preparation and the details on the analysis of the methodology presentation, then the experiment results are presented and discussed. The concluding remarks include the future development of this research. Table 1. Examples of classification performance for the task of speech emotion recognition on datasets analyzed (CNN-convolutional neural network; kNN-k-nearest neighbor classifier; LSTM-long short-term memory network; SVM-support vector machines, RF-random forest classifier; RNN-recurrent neural network; GMM-Gaussian mixture model).

Methodology
In this study, we investigated the speech emotion classification based on deep learning. The classification is performed by providing the convolutional neural network (CNN) algorithm with two-dimensional feature maps extracted from emotional speech utterances.

Description of Feature Spaces
Although neural networks can operate with one-dimensional data, our experimental results have shown that a higher discriminative power can be obtained for the 2D feature spaces [48,49]. The feature maps are presented as matrices of short-time features converted into greyscale images. All the images obtained by applying specific feature space conversion were scaled to the same size. The following speech signal feature spaces were chosen to be investigated: spectrograms, mel-cepstrograms, cochleagrams, and fractal dimension-based features.
The conventional scheme of speech signal processing and feature extraction was implemented: the speech signal is segmented and windowed (Hamming window was applied), the feature vector is extracted for each segment, thus obtaining the sequence of feature vectors. Arrangement of these vectors into arrays gives us two-dimensional feature maps that can be converted to bitmap images. The images are the outcome of the speech signal analysis and are delivered into CNN to make a decision on the encoded emotion.
The analysis parameters were as follows: the length of the analysis frame was 512 samples; an overlap of 400 samples was applied with the truncation of the last frame.

Spectrograms
A spectrogram is constructed from a series of the short-time spectrum of the signal frames. For this purpose, a discrete Fourier transform (DFT) was applied. This transform is given by the following formula: where X k (m) (m = 0, . . . , M − 1) are DFT coefficients of frame k, M is the number of DFT coefficients, x k (n) are samples of the analyzed phoneme k-th short-time segment, N is the number of samples in the frame, w(n) is the window function.
In order to obtain a spectrogram with all values equally significant, a logarithm-based dynamic range compression of the magnitude spectrum was applied: where Xk (m) (m = 1, . . . , M) are Fourier transform coefficients of frame k, and M is the number of these coefficients.

Mel-Cepstrograms
In the spectrogram, linear frequency scaling is used. Meanwhile, the mel-frequency scale is a quasi-logarithmic scale that approximates the resolution of the human auditory system. The cepstrogram is created according to the classical calculation scheme: DFT-based spectrum is processed by a filter bank (modeling the human auditory system), then a logarithm-based transformation is applied with a subsequent cosine transform. In this research, we applied 13 triangular filters, thus obtaining a 13th-order mel-cepstrum. Cepstral vectors are organized into columns, thus obtaining 2D mel-cepstrograms, i.e., cepstral feature maps.

Cochleagrams
Lyon claims that the spectrogram image has too few dimensions to be a full auditory image [50]. In contrast, cochleagram has a unique orthogonal axis to the frequency axis, which gives us this image. In the literature, a cochleagram is a speech representation that shows how our brain processes information received from the ear [50,51]. It is also treated as a computational model of the peripheral auditory system [52].
The cochleagram is constructed by performing auditory filtering. In this study, a bank of gammatone filters covering the range of 20 Hz to 20 kHz is used.
The impulse response of the gammatone filter can be expressed by the following formula [53]: where n is the filter order (we set n = 4), ϕ is the phase (we use phase equal to minimum phase), B-the filter bandwidth, f 0 -the center frequency, t-the time, and A refers to the filter amplitude. The filter amplitude A is described by the automatic gain control [54]. The filter bandwidth B is calculated by: The center frequency values are spaced on the equivalent rectangular bandwidth scale [55,56]. The number of filters used in this research is equal to 50.

Fractal Dimension-Based Features
A fractal dimension-based analysis was employed for speech signal analysis considering the nonlinearity of emotional speech signals. Fractal dimension characterizes and differentiates the irregularity, self-similarity, and nonlinearity between different speech emotions. In this study, we selected fractal dimension-based features, whose effectiveness in classifying speech emotion was justified in a previous study by Katz, Castiglioni, and Higuchi on fractal dimensions [6].

CNN Architecture
As mentioned earlier, the convolutional neural network class of the DNNs was applied in this experiment, and as mentioned earlier, feature dimensionality has been selected according to its significance, and then resulting images were scaled to the same size, and the input for the classifier was fixed. The CNN network topology is a compound of three convolutional layers after each of those activation functions "relu", max pooling of the sizes 3 by 3 (the last layer uses max pooling size 2 by 2), and batch normalization was applied. Then, the flatten layer of the size 64 and batch normalization were added, and the "relu" layer activation function was utilized. The last layer acts as an emotion classifier with the "softmax" activation function. Adam optimization algorithm was applied to solve the optimization part of the emotion recognition experiment. The use of the batch normalization technique, together with the additive white noise (described in the next section), helps to prevent network overfitting. A more detailed description of the network is presented in the article published by the authors [49].

Datasets
The datasets were created by imitating real-world principles to some extent: different language data were collected without any assumption or pre-emphasis applied to the signal. Moreover, the data were not normalized or balanced, i.e., a different number of speakers, utterances, sentences were employed for several languages. In addition, all the data were augmented and processed automatically by employing unified algorithms. The primary assumption was that the data collection process should be similar to the one that is common for real-world services. However, it should clearly be stated that all datasets consist of emotional speech containing recordings of non-professionally acting speakers.
For this study, six language emotion datasets were chosen: Lithuanian (Lithuanian Spoken Language Emotions Database recordings) [16], English dataset, which consists of recordings of three independent databases, i.e., RAVDESS (24 professional female and male actors, vocalizing two lexically-matched statements in a neutral North American accent) [17], SAVEE (this database consists of recordings from four male actors expressing seven different emotions) [18], and TESS (two English-native actresses) [19], Polish dataset that includes numbers and separate command word records of Polish Emotional Speech Dataset [24,25], Spanish (Spanish emotional speech synthesis database recordings) [22], Serbian (Serbian emotional speech GEES database recordings) [21], and German (Berlin emotional speech database recordings) [23].
These datasets share the same five archetypal primary emotions, which are anger, sadness, fear, neutral, happiness [48]. In this study, TESS, SAVEE, and RAVDESS datasets are merged in order to get one big English language dataset. A summarized statistic of the datasets is given in Table 2. To augment the datasets (due to the CNN prerequisite to have a large amount of data to learn from), two following methods were used: the white noise of seven levels (0 dB, 5 dB, 10 dB, 15 dB, 20 dB, 25 dB, and 30 dB) was added; Wiener filtering was applied, thus providing cleaned recordings. This enabled us to increase data variability to decrease overfitting and expand our datasets ( Table 2) as much as nine times.

Experimental Results
An experimental investigation consisted of two parts. To ensure that the process of the experiment can easily be followed, a block diagram of the experiment is shown in Figure 1.
As we see from Figure 1, in the first part of the experiment, various feature space-based emotion recognition was carried out on separate databases. The goal of this part was to check the quality of training for each of the datasets used in the experiment. All the datasets were split according to the rule 80/20 (80% of data were used for the model training/validation and 20% for testing). In addition, 80% of the dataset was split according to the same scheme, i.e., 80% were used in the model training and 20% for the model testing. Test sets contain samples of the same emotion uttered by the same speaker, with a repetitive emotion sample recording. To evaluate the classifier performance, both the overall network accuracy and the averaged F1 score were calculated. The results are given in Table 3. Table 3 shows that cochleagrams enabled reaching the overall test accuracy (Test Acc.) and F1 score equal to 1 or 0.99, respectively. The spectrograms produced test output values around the average of 0.89, for mel-cepstrograms, it reached 0.87, and for fractal dimension-based features, it was 0.78. In addition, in Table 3   As we see from Figure 1, in the first part of the experiment, various feature space-based emotion recognition was carried out on separate databases. The goal of this part was to check the quality of training for each of the datasets used in the experiment. All the datasets were split according to the rule 80/20 (80% of data were used for the model training/validation and 20% for testing). In addition, 80% of the dataset was split according to the same scheme, i.e., 80% were used in the model training and 20% for the model testing. Test sets contain samples of the same emotion uttered by the same speaker, with a repetitive emotion sample recording. To evaluate the classifier performance, both the overall network accuracy and the averaged F1 score were calculated. The results are given in Table  3. Table 3 shows that cochleagrams enabled reaching the overall test accuracy (Test Acc.) and F1 score equal to 1 or 0.99, respectively. The spectrograms produced test output values around the average of 0.89, for mel-cepstrograms, it reached 0.87, and for fractal dimension-based features, it was 0.78. In addition, in Table 3 training accuracy (Training acc.) as well as validation accuracy (Validation Acc.) values are shown.   The results obtained from the first part of the experiments let us conclude that all datasets and all feature spaces can be used in the second part of the experiment. This part of the experiment corresponds to testing the effectiveness of the signal feature spaces in a cross-linguistic speech emotion recognition task. To this end, training and validation datasets are the same for all testing datasets (as described earlier). Two types of tests were investigated: language-dependent and language-independent. The language-dependent tests were performed on 20% of data for each emotion database we had used in the network training. Test data were not utilized for network training and validation. For the language-independent test, we employed datasets that were not a part of the neural network training.
The classification results revealed the same tendency as in the first part of the experiments. The highest classification accuracies were achieved for the same feature space, i.e., cochleagrams. However, in terms of classification accuracy, a spectrogram-based representation also returns very high scores. The results of these two feature spaces are shown in Tables 4-7. The datasets that were used in the training process are highlighted in bold font. For the classification results presented in Table 4, the Lithuanian dataset was used for training and language-dependent test. The network training accuracy is equal to 0.9636, and the validation accuracy is 0.8215 for spectrograms. In the cases of the cochleagrams, training and validation accuracies are 0.9999 and 0.9681, respectively. The remaining datasets were used for the language-independent test.
The results, when Lithuanian, English, and Serbian datasets were used for the network training, are shown in Table 5. In this case, the network performance is as follows: for spectrograms, training accuracy reaches 0.9628, validation accuracy is 0.8807, for cochleagrams, training accuracy is 0.9967, and the validation accuracy-0.9677.
The results shown in Table 6 include classification accuracies in the case of Lithuanian, English, Serbian, and Spain datasets used as training sets. For spectrograms, the training accuracy is 0.9462, validation accuracy equals 0.8641, for cochleagrams, the training accuracy is 0.9623, validation accuracy equals 0.9295.
The experiment with all datasets used for training was carried out as well. The results are given in Table 7.
Network training and validation accuracies are as follows (see Table 7): 0.9450 and 0.8623, respectively, for spectrograms, 0.9587, and 0.9265 for cochleagrams.
Mel-cepstrograms and fractal dimension-based features yielded similar classification tendencies. As the classification results of these two feature spaces are slightly worse than those of spectrograms, we decided not to recall them here.
There are additional studies on SER to compare our investigation performance with others. Our investigation is similar, to some extent, to the work of Fu et al. [39] and Neumann and Vu [40].
Fu et al. [39] used the same emotion labels (angry, fear, happy, neutral, sad) and trained the model using combinations of various corpora (German, Chinese, and French). For emotion recognition, authors used a neural network combining a one-dimensional CNN and bi-directional long short-term memory network (LSTM) network with an attention mechanism. Authors employed eleven different features concatenated in the one-dimensional 34th order feature vector in contrast to our investigated two-dimensional feature spaces, and most importantly, authors focus and depict results for different training methods. Thus, as the conditions are different, it is not possible to directly compare the results reported by Fu et al. [39] with those obtained in this study. The results of both studies can be characterized as scattered. This is inevitable for cross-lingual speech emotion recognition.
Neumann and Vu [40] present results for cross-lingual and multilingual emotion recognition as well. Authors employ English, French speech corpora, and the attentive convolutional neural network (ACNN), with the 26 logMel filter-bank output as and input features. The categorical emotion labelling is focused on a binary classification task of arousal (low/high) and valence (negative/positive) of the emotional speech. The facts lead to the same conclusion that it is not possible to directly compare the results presented by Neumann and Vu [40] to those obtained in our study.
We would emphasize the following moments of our study as advantages: • Two-dimensional feature maps enable us to deliver temporal information in addition to the selected acoustic features of the emotion. This should save the emotion global (suprasegmental) features for the analysis in the network and enhance the decision process. The sequential framing-based analysis of the speech prevents this, as all suprasegmental information is lost.

•
We extract the same features for all the languages we analyze. This analysis scheme does not require any merging, mixing or any other joint processing of different language data. Furthermore, our feature maps are based on the single analysis technique, but we cannot deny the probable need to join different analysis techniques to improve the discriminative power of the feature maps.

•
We have analyzed six different languages, which makes our study truly multilingual. The results of the cross-lingual emotion recognition are not impressive, but they reveal very clear challenges in the future: the need of multilingual emotional speech data (especially for low-resource languages), the undefined feature systems for the multilingual emotions, and the possible variation of the results for different languages.

Conclusions
With the same network architecture, the best training, validation, and test performance results were achieved by the cochleagram-based representation. This feature space enables us to reach the general test accuracy and F1 score on separate databases equal to 1 or 0.99 (in the case of Polish language). In a cross-linguistic speech emotion recognition task, test accuracy and F1 score vary from 0.88 to 0.96 in the case of all datasets (Lithuanian, English, Serbian, and Spanish, Polish, and German) as training sets.
It may be assumed that the best results in classification occurred for the cochleagram feature space because of one more dimension that it has, compared to other feature spaces. It may be that in this axis, which is orthogonal to the frequency axis, features associated with human emotions are contained. However, this statement requires a thorough analysis, which we should follow in future research. One additional suggestion in terms of the methodology would be to attempt to fuse a variety of the employed features as opposed to evaluating each feature space separately.
The experimental results depicted that emotions are language dependent. However, increasing the number of languages in a training set, the language-independent classification (for a language not used in the training step) accuracy slightly increases. In future research, the authors plan to expand the number of datasets for different languages so that it can be possible to state if it is possible to achieve high emotion recognition results for languages not used in the training process.
There is an additional conclusion related to this research study regarding data preparation. We have used the Polish database in tests, as well. However, the problem was that tagging was performed only on words due to the lack of automatic annotation of this database. This resulted in the lower classification accuracies obtained for this language compared to other languages. Therefore, we came to the obvious conclusion: very short utterances do not contain attributes of the spoken emotion. Thus, the next step should be to work on a strategy of the multilingual parser to acquire additional data from the speech utterances available on the Internet and prepare them automatically for parameterization and classification.
Finally, a question arises whether a ground truth database may be formulated containing emotionally "loaded" utterances, utilizing such techniques as, e.g., crowdsourcing [57] applied for both "producing" emotions in speech as well as evaluating gathered utterances. Such an experiment may result in more reliable datasets for the in-depth training process.
In future research, we intend to explore other factors, such as, e.g., speech phoneme properties (like duration, accent, energy), which may also affect the quality of recognition. We will investigate whether the considerable gap between emotional recognition for seen and unseen language exists because of different language properties or because of non-stationary and nonlinear speech signal properties. In addition, for data augmentation, the combination of STFT-based and Hilbert-Huang transform (HHT)-based features may be used as such an approach may improve the overall performance of a classifier [58]. Moreover, future research will investigate the problem of cross-language speech emotion recognition in terms of domain adaptation and transfer learning. Though, it should be mentioned that data from different datasets were not reemphasized or balanced. That means that the task of transfer learning for emotion recognition is more difficult than it has to be because apart from language, there are also differences in recording conditions and dataset representation.