You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

22 November 2023

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

,
,
,
,
,
,
,
and
1
School of Information Science and Technology, Northwest University, Xi’an 710127, China
2
School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK
3
Faculty of Computing and Information Technology, University of Sialkot, Sialkot 51040, Punjab, Pakistan
4
School of Information and Engineering, Chang’an University, Xi’an 710064, China
This article belongs to the Special Issue Natural Language Processing (NLP) and Applications

Abstract

In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German, and Urdu. For Amharic, we use our own publicly available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu, we use the existing RAVDESS, EMO-DB, and URDU datasets. We followed previous research in mapping labels for all of the datasets to just two classes: positive and negative. Thus, we can compare performance on different languages directly and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. The results, averaged for the three models, were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each of the following pairs: Amharic↔German, Amharic↔English, and Amharic↔Urdu. The results with Amharic as the target suggested that using English or German as the source gives the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percentage points greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training an SER classifier when resources for a language are scarce.

1. Introduction

Emotions assist individuals to communicate and to comprehend others’ points of view [1]. Speech emotion recognition (SER) is the task of comprehending emotion in a voice signal, regardless of its semantic content [2].
SER datasets are not available in all languages. Moreover, the quantity and quality of the training data that are available varies considerably from one language to another. For example, when evaluated across several datasets, differences in corpus language, speaker age, labeling techniques, and recording settings significantly influence model performance [3,4]. This encourages the development of more robust SER systems capable of identifying emotion from data in different languages. This can then permit the implementation of voice-based emotion recognition systems in real time for an extensive variety of industrial and medical applications.
The majority of research on SER has concentrated on a single corpus, without considering cross-lingual and cross-corpus effects. One reason is that, in comparison to the list of spoken languages, we only have a small number of corpora for the study of speech analysis [5]. Furthermore, even when only considering the English language, accessible resources vary in quality and size, resulting in the dataset sparsity problem observed in SER research. In such instances, learning from a single data source makes it challenging for SER to function effectively. As a result, more adaptable models that can learn from a wide range of resources in several languages are necessary for practical applications.
Several researchers have investigated cross-corpus SER in order to enhance classification accuracy across several languages. These works employed a variety of publicly accessible databases to highlight the most interesting trends [6]. Even though some research has addressed the difficulty of cross-corpus SER, as described in Schuller et al. [6], the challenges posed by minority languages such as Amharic have not been investigated.
Amharic is the second-largest Semitic language in the world after Arabic, and it is also the national language of Ethiopia [7]. In terms of the number of speakers and the significance of its politics, history, and culture, it is one of the 55 most important languages in the world [8]. Dealing with such languages is critical to the practicality of next-generation systems [9], which must be available for many languages.
In our previous work [10], we created the first spontaneous emotional dataset for Amharic. This contains 2474 recordings made by 65 speakers (25 male, 40 female) and uses five emotions: fear, neutral, happy, sad, and angry. The Amharic Speech Emotion Dataset (ASED) is publicly available for download (https://github.com/Ethio2021/ASEDV1 accessed on 17 October 2023). This dataset allows us to carry out the work reported here.
The contributions of this paper are as follows:
  • We investigate different scenarios for monolingual, cross-lingual, and multilingual SER using datasets for Amharic and three other languages (English, German, and Urdu).
  • We experiment with a novel approach in which a model is trained on data in several non-Amharic languages before being tested on Amharic. We show that training on two non-Amharic languages gives a better result than training on just one.
  • We present a comparison of deep learning techniques in these tasks: AlexNet, ResNet50, and VGGE.
  • To the best of our knowledge, this is the first work that shows the performance tendencies of Amharic SER utilizing several languages.
The structure of this paper is as follows: Section 2 presents previous work. Section 3 explains our approach, datasets, and feature extraction methods for SER. Section 4 presents the proposed deep learning architecture and experimental settings. Section 5 describes the experiments and outcomes. Finally, Section 6 gives conclusions and next steps.

3. Approach

Many factors influence SER accuracy in a cross-corpus and multilingual context. The dataset utilized, the features extracted from the speech signals, and the neural network classifiers implemented to identify emotion are all essential aspects that may significantly impact the results. Our SER method is summarized in Figure 1. We use four corpora (ASED, RAVDESS, EMO-DB, and URDU) to test SER in Amharic, English, German, and Urdu, respectively.
Figure 1. Network architecture of proposed VGGE based on well-known VGG model.
One difficulty faced with this research is that datasets use different sets of emotion labels, as can be seen in Table 2. Following previous work [25,26], we address this by mapping labels into just two classes, positive valence and negative valence, as indicated in the table. For example, ASED uses five emotions. For this dataset, therefore, we map the two emotions Neutral and Happy to positive valence, and the three remaining emotions, Fear, Sadness, and Angry, to negative valence. Analogous mappings are performed for the other datasets.
Further details are provided below concerning the chosen datasets, the feature extraction approach, and the classifiers used.

3.1. Speech Emotion Databases

ASED [10] is for Amharic and was created by the authors in previous work. It uses five emotions and consists of 2474 recordings made by 65 speakers (25 male, 40 female). Recording was performed at 16 kHz and 16 bits. The ASED dataset is accessible to the public for research purposes (see URL earlier).
RAVDESS [27] is for English, uses eight emotions, and contains just two sentences. The 24 speakers (12 male, 12 female) are professional actors. Interestingly, emotions in this dataset are ‘self-induced’ [28], rather than acted. Moreover, there are two levels of each emotion. There are 4320 utterances. Project investigators selected the best two clips for each speaker and each emotion. Recording was at 48 kHz and 16 bits, and it was carried out in a professional recording studio at Ryerson University.
EMO-DB [29] is for German, uses five emotions, and contains ten everyday sentences: five made of one phrase, and five made of two phrases. There are ten speakers (five male, five female), nine of whom are qualified in acting, and about 535 raw utterances in total. Recording was performed at 16 kHz and 16 bits and was carried out in the anechoic chamber of the Technical Acoustics Department at the Technical University Berlin.
URDU [17] is for Urdu, uses four emotions, and comprises 400 audio recordings from Urdu TV talk shows. There are 38 speakers (27 male, 11 female). Emotions are not acted but occur naturally during the conversations between guests on the talk shows.
Table 2. Datasets used in the experiments. The table also shows the mapping from the emotion labels in each dataset into just two valence labels which can be used across them all: positive and negative.
Table 2. Datasets used in the experiments. The table also shows the mapping from the emotion labels in each dataset into just two valence labels which can be used across them all: positive and negative.
AspectASEDRAVDESSEMO-DBURDU
LanguageAmharicEnglishGermanUrdu
Recordings24741440535400
Sentences27210-
Participants65241038
Emotions5874
Positive valenceNeutral, HappyNeutral, Happy,
Calm, Surprise
Neutral, HappinessNeutral, Happy
Negative valenceFear, Sadness,
Angry
Fear, Sadness,
Angry, Disgust
Anger, Sadness,
Fear, Disgust,
Boredom
Angry, Sad
References[10][27][29][17]

3.2. Data Preprocessing

Before proceeding to feature extraction, a number of pre-processing steps were performed on the datasets, as shown in Figure 2. Recordings were first downsampled to 16 kHz and converted to mono. As can be seen in Table 3, most of the sound clips in the datasets are 5 s in length or less. A few are longer than this. Therefore, for our experiments, we extended any shorter clips to 5 s by adding silence to the end. Conversely, any longer clips were cut off in order to make them exactly 5 s long.
Figure 2. Data preprocessing.
Table 3. Statistics of original clip lengths in seconds for all datasets. In the table, 1–2.0 means 1 s ≤ d < 2 s).

3.3. Feature Extraction for SER

A vast amount of information reflecting emotional characteristics is present in the speech signal. One of the key issues within SER research is the choice of which features should be used.
Previously, traditional feature extraction methods, such as prosodic features, were used for SER [30,31], including the variance, intensity, spoken word rate, and pitch. However, some traditional features are shared across different emotions, as discussed by Gangamohan et al. [30]. For example, as observed in Table 11.2 of Gangamohan et al., angry and happy utterances have similar trends in F0 and speaking rate, compared to neutral speech.
Manually extracted traditional features may work well with traditional classification methods in machine learning, where a set of features or attributes describes each instance in a dataset [32]. In contrast, however, deep learning can itself determine which features to focus on to recognize verbal emotions. Finding some sets of feature vectors or properties that can give a compact representation of the input audio signal has therefore become the main aim of feature extraction methods. The spectrum extraction methods convert the input sound waveform to some discrete shape or feature vector. Normally, the speech signal is not static but when looking at a short period of time, it acts as a static signal. This short, detached snap is called a frame. The acoustic model extracts features from the frames [33,34]. Feature extraction deals with obtaining useful information for reference by removing irrelevant information. These extracted feature vectors are fed into deep learning models. In short, spectrum extraction methods can convert audio signals into vectors that deep learning models can handle. The model can then be trained to learn the features of each emotion and hence classify it. Overall, this is one reason why deep learning models can perform better than machine learning models.
After reviewing many works on SER, it is clear that Mel-Frequency Cepstral Coefficients (MFCCs) are widely used in audio classification and speech emotion recognition [35]. An MFCC is a coefficient that expresses the short-term power spectrum of a sound. It uses a series of steps to imitate the human cochlea, thereby converting audio signals. The Mel scale is significant because it approximates the human perception of sound instead of being a linear scale [36]. In our previous work [10], we compared MFCCs to alternatives and found them to be the best. This is the reason we chose MFCC features for the present study.

4. Architectures and Settings

Most prior research uses CNN-based models for SER [37]. Among such models, the notable ones include AlexNet [38], VGG [39,40], and ResNet50 [41,42]. This section provides a short overview of the models. Our proposed model, VGGE, is a variant of VGG.
  • AlexNet is one of the famous CNN models used in applications such as image classification and recognition and is widely employed for SER classification [43]. It achieved an outstanding result at the ImageNet competition in 2012 [44].
  • VGG [39] appeared in 2014, created by the Oxford Robotics Institute. It is well known that the early CNN layers capture the general features of sounds such as wavelength, amplitude, etc., and later layers capture more specific features such as the spectrum and the cepstral coefficients of waves. This makes a VGG-style model suitable for the SER task. After some experimentation, we found that a model based on VGG but using four layers gave the best performance. We call this proposed model VGGE and use it for our experiments. Figure 1 shows the settings for VGGE.
  • ResNet [42] was launched in late 2015. This was the first time that networks with more than a hundred layers were trained. Subsequently, it has been applied to SER classification [41].
Concerning the experimental setup, the standard code for AlexNet and ResNet50 was downloaded and used for the experiments. For VGGE, the network configuration was altered, as shown in Figure 1. For the other models, the standard network configuration and parameters were used.
In all experiments, the librosa v0.7.2 library [45] was used to extract MFCC features.
We used the Keras deep learning library, version 2.0, with a Tensorflow 1.6.0 backend to build the classification models. The models were trained using a machine with an NVIDIA GeForce GTX 1050. Our model employed the Adam optimization algorithm with categorical cross-entropy as the loss function; training was terminated after 100 epochs, and the batch size was set to 32.

5. Experiments

Two methods are primarily utilized for speaker-independent SER [46]: The first method is Leave One Speaker Out (LOSO) [21,47,48]. Here, when the corpus contains n speakers, we use n 1 speakers for training and the remaining speaker for testing. For cross-validation, the experiment is repeated n times with a different test speaker each time. In the second method, the training and testing sets have been determined previously [17,49,50].
In our work, we followed the second approach. For the first monolingual experiment (train on a corpus, test on the same corpus), the data were split into training, testing, and validation sets randomly five times, ensuring each time that the split sets were speaker-independent. As shown in Table 4, all of the datasets were split into 70% train, 20% test, and 10% validation. In the first experiment, we also carried out a sentence-independent study in which the sentences used for training were not used for testing.
Table 4. Class distribution between the train, validation, and test partitions.
The second and third experiments are the cross-lingual experiment (train on a corpus in one language, test on a corpus in another language) and the multilingual experiment (train on two or three corpora joined together, each in a different non-Amharic language, and test on the Amharic ASED corpus). In these experiments, the speakers in the validation sets are not seen in the training sets. Moreover, the speakers in the testing set are by definition not the same as those in the training and validation sets, as they are from different datasets.
Figure 3 shows a label distribution that is balanced across partitions. The performance of the proposed classification of Amharic language data used in monolingual, cross-lingual, and multilingual SER experiments is evaluated using F1-score and accuracy. We have shared the file names for the audio files that belonged to the train, validation, and test partitions in the experiments (https://github.com/Ethio2021/File-names accessed on 17 October 2023). For each experiment, the models were trained five times, and the average result was reported.
Figure 3. Class distribution within Datasets.

6. Experiment 1: Comparison of SER Methods for Monolingual SER

6.1. Outline

The aim was to carry out an initial comparison of the proposed VGGE model with the two existing models discussed above, AlexNet and ResNet50. Four datasets were used: ASED, RAVDESS, EMO-DB, and URDU. To allow comparison with the other experiments, the emotion labels for each dataset were mapped onto just two labels, positive valence and negative valence, as shown by the scheme in Table 2. This follows the standard approach found in other work [25,26]. When comparing to other papers, we should bear in mind the label mapping that we needed to adopt in order to undertake the later cross-lingual and multilingual experiments. Looking at the table, we can see that the ASED, RAVDESS, EMO-DB, and URDU datasets originally had five, eight, seven, and four emotion classes, respectively, and that these are now being mapped into just two classes: positive and negative emotions. This simplifies the task, which can account for higher performance figures than in other published works.
Experiment 1 has two parts. In Experiment 1.1, the groups of speakers used for training and testing were varied. In Experiment 1.2, the dataset sentences used for training and testing were varied.

6.2. Experiment 1.1: Independence of Speakers

The results of this experiment, expressed as accuracy, are shown in Table 5. Recall that each of the four datasets is monolingual and that we are training and testing on the same language here. We can see that VGGE was the best on ASED (Amharic) and EMO-DB (German), ResNet50 was the best on RAVDESS (English), and AlexNet was the best on URDU (Urdu).
Table 5. Experiment 1.1: Monolingual SER results expressed as accuracy for different models and datasets (train in one language, test in the same language). Training and testing speakers are varied in this experiment. All datasets are monolingual: the languages are Amharic (ASED), German (EMO-DB), English (RAVDESS), and Urdu (URDU).
It is interesting to look at the average figures on the bottom row of the table. ASED (82.53%) and RAVDESS (82.71%) are very close, EMO-DB (77.78%) is 4.75% lower than ASED, and URDU (84.58%) is 2.05% higher than ASED. Generally, the differences are not that large when we consider that the languages have very different characteristics and that the datasets were created independently by different researchers. Moreover, recall that the original data are being mapped onto two sentiment classes from the original four to eight classes (see Section 6.1 and Table 2).
Subject to these points, we might conclude that Amharic and English monolingual mono-corpus SER are of similar difficulty, German is more difficult, and Urdu is easier. As languages, English and German are perhaps the most similar, since they are both within the Germanic branch of the Indo-European language group. Urdu is also Indo-European but from the Indo-Iranian branch. Finally, Amharic is from the Semitic branch of the Afro-Asiatic group.

6.3. Experiment 1.2: Independence of Sentences

Recall that the datasets all consist of different sentences spoken in every emotion, with the exception of the URDU dataset, based on TV talk show conversation, where individual sentences are not identified. Hence, URDU was not used here.
In this experiment, sentences were either used for training or testing. For each of the datasets shown in the table, the proposed VGGE model, along with AlexNet and ResNet50, was trained using MFCC features. Each model was trained five times using an 80%/20% train/test split, and the average results were computed.
The results are in Table 6. The trends are similar to those of Experiment 1.1. This time, VGGE is the best on ASED and RAVDESS, while ResNet50 is the best on EMO-DB. Concerning the averages, ASED and RAVDESS are fairly close (84.46%, 81.11%), while EMO-DB is lower (66.67%). So, this again suggests that Amharic and English monolingual SER are of similar difficulty and easier, within the context of these particular datasets and this task, while German SER is more difficult.
Table 6. Experiment 1.2: Monolingual SER results, expressed as accuracy, for the different datasets. Training and testing sentences are varied in this experiment.

7. Experiment 2: Comparison of SER Methods for Amharic Cross-Lingual SER

The aim was to compare the three models AlexNet, VGGE, and ResNet50 (Section 4) when applied to cross-lingual SER. This time, the systems are trained on data in one language and then tested on data in another language. Firstly, the three models are trained on ASED and tested on EMO-DB, then trained on EMO-DB and tested on ASED, and so on, for different combinations. To allow this cross-training, dataset-specific emotion labels are mapped into two classes, positive and negative, using the same method as for Experiment 1.
Once again, MFCC features were used for all models. The network configuration for VGGE was the same as in the preceding Experiment (Figure 1). For the other models, the standard configuration and settings were used.
The results are presented in Table 7. As line 1 of the table shows, we first trained on ASED and evaluated on EMO-DB (henceforth written ASED→EMO-DB). VGGE gave the best accuracy (66.67%), followed closely by AlexNet (65.80%) and then ResNet50 (64.06%). For EMO-DB→ASED, VGGE was best (64.22%), also followed by AlexNet (62.39%) and then ResNet50 (58.72%).
Table 7. Experiment 2: Cross-lingual SER results (train in one language, test in another language). The languages are Amharic (ASED), German (EMO-DB), English (RAVDESS), and Urdu (URDU).
Next, for ASED→RAVDESS, AlexNet was best (66.00%), followed by ResNet50 (61.75%) and VGGE (59.25%). For RAVDESS→ASED, AlexNet was best (65.87%), closely followed by ResNet50 (64.16%) and then VGGE (61.43%).
Thirdly, we used ASED→URDU. Here, ResNet50 was best (61.56%), followed by AlexNet (60.00%) and VGGE (59.69%). For URDU→ASED, ResNet50 was best (61.33%), followed by VGGE (60.00%) and AlexNet (50.67%).
It is interesting that for ASED↔EMO-DB, VGGE was best; for ASED↔RAVDESS, AlexNet was best; and for ASED↔URDU, ResNet50 was best. What is more, the figures for AlexNet on ASED↔RAVDESS in the two directions (66.00%, 65.87%, difference 0.13%) were very close, as were those for ResNet50 on ASED↔URDU (61.56%, 61.33%, difference 0.23%), while those for VGGE on ASED↔EMO-DB (66.67%, 64.22%, difference 2.45%) were slightly further apart.
We can therefore conclude that the performance of the three models was very similar overall. This is supported by the average accuracy figures for AlexNet, VGGE, and ResNet50 (61.79%, 61.88%, 61.93%, respectively) which are also very close (only 0.14% from the smallest to the biggest).
The results in the table also show that the average F1-score performance for VGGE (55.99%) is higher than that for AlexNet (54.81%, 1.18% lower) and ResNet50 (53.30%, 2.69% lower). Hence, it is concluded from these results that the prediction performance of VGGE was best, closely followed by AlexNet and then ResNet50. However, the range of F1-scores is small, only 2.69% from the smallest to the biggest, indicating only a slight difference in performance between different scenarios.
Regarding the results as a whole, two points can be made. First, the accuracy obtained by training on one language and testing on another is surprisingly good. Second, the best language to train on when testing on Amharic seems to vary by model. For AlexNet, it is RAVDESS (65.87%); for VGGE, it is EMO-DB (64.22%); and for ResNet50, it is RAVDESS again (64.16%).
Finally, we can compare our results for this experiment (Table 7) with those given for previous cross-lingual studies in Section 2. Generally, they seem comparable. Our average results are around 62%. In the previous studies, we see 56.8% [9], 57.87% [17], 65.3% [19], and 62.5% [22]. The highest is 71.62% [14]. In looking at these figures, we must remember that the exact methods and evaluation criteria used in previous experiments vary, so exact comparisons are not possible. Many different languages and datasets are used, emotion labels may need to be combined or transformed in different ways, and so on. Please refer to Section 2 for the details regarding these figures.

8. Experiment 3: Multilingual SER

In the previous experiment, we trained in one language and tested in another. In this final experiment, we trained on several non-Amharic languages and then tested on Amharic.
The same three models were used, AlexNet, VGGE, and ResNet50, with the same settings and training regime as in the previous experiments.
Table 8 shows the results. Recall that the languages are Amharic (ASED), German (EMO-DB), English (RAVDESS), and Urdu (URDU). The first three rows for each model show the results when two datasets were used for training, EMO-DB+RAVDESS, EMO-DB+URDU, and RAVDESS+URDU. The fourth row uses all three datasets for training, i.e., EMO-DB+ RAVDESS+URDU. In all cases, testing is with ASED.
Table 8. Experiment 3: Multilingual SER results (train in two or three non-Amharic languages, test in Amharic). The languages are Amharic (ASED), German (EMO-DB), English (RAVDESS), and Urdu (URDU).
The best overall performance in the table is for VGGE, training with EMO-DB+URDU (69.94%). The average figure for VGGE over all the dataset training combinations is also the best (66.44%).
When RAVDESS is added to EMO-DB+URDU to make EMO-DB+RAVDESS+URDU, the performance of VGGE falls by 1.53% to 68.41%. In the results presented in Table 9, the upper right-hand column shows the average accuracy, and the lower right-hand column the average F1-score. In this case, we see that the highest figures over all three models are for all three datasets (67.12% and 59.79%, respectively).
Table 9. Experiment 3: Multilingual SER average results. The languages are Amharic (ASED), German (EMO-DB), English (RAVDESS), and Urdu (URDU).
However, the most interesting result here is that the best accuracy figure in Table 8 (EMO-DB+URDU→ASED, VGGE, 69.94%) is higher than the best accuracy figure in Table 7 with ASED as the target, (RAVDESS→ASED, AlexNet, 65.87%) by 4.07%. In other words, training on German and Urdu gives a better result for Amharic than training on English alone. Moreover, the best overall average accuracy figure in Table 8 (VGGE, 66.44%) is higher than the best overall average accuracy figure in Table 7 (ResNet50, 61.93%) by 4.51%.
Once again, the results in Table 8 show that the average F1-score performance for VGGE (62.78%) is higher than that for AlexNet (55.81%, 6.97% lower) and ResNet50 (51.65%, 11.13% lower). Furthermore, the best overall average F1-score figure in Table 8 (VGGE, 62.78%) is higher than the best overall average F1-score figure in Table 7 (VGGE, 55.99%) by 6.79%.
These results suggest that by using several non-Amharic datasets for training, we can obtain a better result, by several percentage points, than when using one non-Amharic dataset for training, when testing on Amharic throughout.
Compared with the existing studies discussed in Section 2, there are only three that present multilingual experiments. Lefter et al. [13] report that training on three datasets, EMO-DB, DES, and ENT, and testing on EMO-DB gave the best result, better than their cross-lingual trials. This concurs with our own findings, where the average results for Experiment 3 (Table 8, bottom line) were higher than those of Experiment 2 (Table 7, bottom line). Latif et al. [17] found that training on EMO-DB, EMOVO, and SAVEE and testing on URDU gained a better result than using just two training datasets. Latif et al. [19] also obtained the best result when training on three datasets.

9. Conclusions

In this paper, we first proposed a variant of the well-known VGG model, which we call VGGE, and then applied AlexNet, VGGE, and ResNet50 to the task of speech emotion recognition, focusing on the Amharic language. This was made possible by the existence of the publicly available Amharic Speech Emotion Dataset (ASED), which we created in our previous work [10]. In Experiment 1, we trained the three models on four datasets: ASED (Amharic), RAVDESS (English), EMO-DB (German), and URDU (Urdu). In each case, a model was trained on one dataset and then tested on that same dataset. Speaker-independent and sentence-independent training variants were tried. The results suggested that Amharic and English monolingual SER are almost equally difficult on the datasets we used for these languages, while German is harder, and Urdu is easier.
In Experiment 2, we trained on SER data in one language and tested on data in another language, for various language pairs. When ASED was the target, the best dataset to train on was RAVDESS for AlexNet and ResNet50, and EMO-DB for VGGE. This could indicate that, in terms of SER, Amharic is more similar to English and German than it is to Urdu.
In Experiment 3, we combined datasets for two or three different non-Amharic languages for training and used the Amharic dataset for testing. The best result in Experiment 3 (EMO-DB+URDU→ASED, VGGE, 69.94%) was 4.07% higher than the best result in Experiment 2 (RAVDESS→ASED, AlexNet, 65.87%). In addition, the best overall average figure in Experiment 3 (VGGE, 66.44%) was 4.51% higher than the best overall average figure in Experiment 2 (ResNet50, 61.93%). These findings suggest that if several non-Amharic datasets are used for SER training, the results can be better than if one non-Amharic dataset is used, when testing on Amharic throughout. Overall, the experiments demonstrate how cross-lingual and multilingual approaches can be used to create effective SER systems for languages with little or no training data, confirming the findings of previous studies. Future work could involve improving SER performance when training on non-target languages and trying to predict which combination of source languages will give the best result.

Author Contributions

Conceptualization, E.A.R. and R.S.; methodology, E.A.R.; software, E.A.R. and E.A.; validation, E.A.R., E.A. and M.M.; formal analysis, E.A.R.; investigation, E.A.R., R.S., S.A.K., M.M. and E.A.; resources, E.A.R.; data curation, E.A.R.; writing—original draft preparation, E.A.R., J.M., M.A.B. and S.A.C.; writing—review and editing, R.S., S.A.K. and J.M.; visualization, E.A.R. and E.A.; supervision, J.F. and R.S.; project administration, E.A.R. and R.S.; funding acquisition, J.F. and R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under grant 2020YFC1523302.

Data Availability Statement

Publicly available datasets were analyzed in this study. Our Amharic Speech Emotion Dataset (ASED) is also publicly available at https://github.com/Ethio2021/ASED_V1.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zvarevashe, K.; Olugbara, O. Ensemble learning of hybrid acoustic features for speech emotion recognition. Algorithms 2020, 13, 70. [Google Scholar] [CrossRef]
  2. Khan, M.U.; Javed, A.R.; Ihsan, M.; Tariq, U. A novel category detection of social media reviews in the restaurant industry. Multimed. Syst. 2020, 29, 1–14. [Google Scholar] [CrossRef]
  3. Zhang, B.; Provost, E.M.; Essl, G. Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5805–5809. [Google Scholar]
  4. Zhang, Z.; Weninger, F.; Wöllmer, M.; Schuller, B. Unsupervised learning in cross-corpus acoustic emotion recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 523–528. [Google Scholar]
  5. Wang, D.; Zheng, T.F. Transfer learning for speech and language processing. In Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 16–19 December 2015; pp. 1225–1237. [Google Scholar]
  6. Wöllmer, M.; Stuhlsatz, A.; Wendemuth, A.; Rigoll, G. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. Affect. Comput. 2010, 1, 119–131. [Google Scholar]
  7. Mossie, Z.; Wang, J.H. Social network hate speech detection for Amharic language. Comput. Sci. Inf. Technol. 2018, 41–55. [Google Scholar]
  8. Mengistu, A.D.; Bedane, M.A. Text Independent Amharic Language Dialect Recognition using Neuro-Fuzzy Gaussian Membership Function. Int. J. Adv. Stud. Comput. Sci. Eng. 2017, 6, 30. [Google Scholar]
  9. Albornoz, E.M.; Milone, D.H. Emotion recognition in never-seen languages using a novel ensemble method with emotion profiles. IEEE Trans. Affect. Comput. 2015, 8, 43–53. [Google Scholar] [CrossRef]
  10. Retta, E.A.; Almekhlafi, E.; Sutcliffe, R.; Mhamed, M.; Ali, H.; Feng, J. A new Amharic speech emotion dataset and classification benchmark. ACM Trans. Asian -Low-Resour. Lang. Inf. Process. 2023, 22, 1–22. [Google Scholar] [CrossRef]
  11. Sailunaz, K.; Dhaliwal, M.; Rokne, J.; Alhajj, R. Emotion detection from text and speech: A survey. Soc. Netw. Anal. Min. 2018, 8, 1–26. [Google Scholar] [CrossRef]
  12. Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef]
  13. Lefter, I.; Rothkrantz, L.J.; Wiggers, P.; Van Leeuwen, D.A. Emotion recognition from speech by combining databases and fusion of classifiers. In International Conference on Text, Speech and Dialogue; Springer: Berlin/Heidelberg, Germany, 2010; pp. 353–360. [Google Scholar]
  14. Xiao, Z.; Wu, D.; Zhang, X.; Tao, Z. Speech emotion recognition cross language families: Mandarin vs. western languages. In Proceedings of the 2016 International Conference on Progress in Informatics and Computing (PIC), Shanghai, China, 23–25 December 2016; pp. 253–257. [Google Scholar]
  15. Sagha, H.; Matejka, P.; Gavryukova, M.; Povolnỳ, F.; Marchi, E.; Schuller, B.W. Enhancing Multilingual Recognition of Emotion in Speech by Language Identification. Interspeech 2016, 2949–2953. [Google Scholar]
  16. Meftah, A.; Seddiq, Y.; Alotaibi, Y.; Selouani, S.A. Cross-corpus Arabic and English emotion recognition. In Proceedings of the 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao, Spain, 18–20 December 2017; pp. 377–381. [Google Scholar]
  17. Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In Proceedings of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 17–19 December 2018; pp. 88–93. [Google Scholar]
  18. Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Cross corpus speech emotion classification-an effective transfer learning technique. arXiv 2018, arXiv:1801.06353. [Google Scholar]
  19. Latif, S.; Qadir, J.; Bilal, M. Unsupervised adversarial domain adaptation for cross-lingual speech emotion recognition. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 732–737. [Google Scholar]
  20. Goel, S.; Beigi, H. Cross lingual cross corpus speech emotion recognition. arXiv 2020, arXiv:2003.07996. [Google Scholar]
  21. Bhaykar, M.; Yadav, J.; Rao, K.S. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In Proceedings of the 2013 National conference on communications (NCC), New Delhi, India, 15–17 February 2013; pp. 1–5. [Google Scholar]
  22. Zehra, W.; Javed, A.R.; Jalil, Z.; Khan, H.U.; Gadekallu, T.R. Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell. Syst. 2021, 7, 1845–1854. [Google Scholar] [CrossRef]
  23. Duret, J.; Parcollet, T.; Estève, Y. Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data. arXiv 2023, arXiv:2306.17199. [Google Scholar]
  24. Pandey, S.K.; Shekhawat, H.S.; Prasanna, S.R.M. Multi-cultural speech emotion recognition using language and speaker cues. Biomed. Signal Process. Control 2023, 83, 104679. [Google Scholar] [CrossRef]
  25. Deng, J.; Zhang, Z.; Marchi, E.; Schuller, B. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Proceedings of the 2013 humaine association conference on affective computing and intelligent interaction, Geneva, Switzerland, 2–5 September 2013; pp. 511–516. [Google Scholar]
  26. Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
  27. Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
  28. Stanislavski, C. An Actor Prepares (New York). Theatre Art. 1936, 38. [Google Scholar]
  29. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
  30. Gangamohan, P.; Kadiri, S.R.; Yegnanarayana, B. Analysis of emotional speech—A review. In Toward Robotic Socially Believable Behaving Systems-Volume I; Springer: Berlin/Heidelberg, Germany, 2016; pp. 205–238. [Google Scholar]
  31. Fairbanks, G.; Hoaglin, L.W. An experimental study of the durational characteristics of the voice during the expression of emotion. Commun. Monogr. 1941, 8, 85–90. [Google Scholar] [CrossRef]
  32. Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
  33. Dey, N.A.; Amira, S.M.; Waleed, S.N.; Nhu, G. Acoustic sensors in biomedical applications. In Acoustic Sensors for Biomedical Applications; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43–47. [Google Scholar]
  34. Almekhlafi, E.; Moeen, A.; Zhang, E.; Wang, J.; Peng, J. A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks. Comput. Speech Lang. 2022, 71, 101274. [Google Scholar] [CrossRef]
  35. Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
  36. Shaw, A.; Vardhan, R.H.; Saxena, S. Emotion recognition and classification in speech using Artificial neural networks. Int. J. Comput. Appl. 2016, 145, 5–9. [Google Scholar] [CrossRef]
  37. Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183. [Google Scholar]
  38. Kumbhar, H.S.; Bhandari, S.U. Speech Emotion Recognition using MFCC features and LSTM network. In Proceedings of the 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 19–21 September 2019; pp. 1–3. [Google Scholar]
  39. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  40. Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
  41. George, D.; Shen, H.; Huerta, E.A. Deep Transfer Learning: A new deep learning glitch classification method for advanced LIGO. arXiv 2017, arXiv:1706.07446. [Google Scholar]
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  43. Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar]
  44. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  45. Sharmin, R.; Rahut, S.K.; Huq, M.R. Bengali Spoken Digit Classification: A Deep Learning Approach Using Convolutional Neural Network. Procedia Comput. Sci. 2020, 171, 1381–1388. [Google Scholar] [CrossRef]
  46. Shinde, A.S.; Patil, V.V. Speech Emotion Recognition System: A Review. SSRN 3869462. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3869462 (accessed on 10 October 2023).
  47. Deb, S.; Dandapat, S. Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Trans. Cybern. 2018, 49, 802–815. [Google Scholar] [CrossRef] [PubMed]
  48. Wang, K.; Su, G.; Liu, L.; Wang, S. Wavelet packet analysis for speaker-independent emotion recognition. Neurocomputing 2020, 398, 257–264. [Google Scholar] [CrossRef]
  49. Swain, M.; Sahoo, S.; Routray, A.; Kabisatpathy, P.; Kundu, J.N. Study of feature combination using HMM and SVM for multilingual Odiya speech emotion recognition. Int. J. Speech Technol. 2015, 18, 387–393. [Google Scholar] [CrossRef]
  50. Kuchibhotla, S.; Vankayalapati, H.D.; Anne, K.R. An optimal two stage feature selection for speech emotion recognition using acoustic features. Int. J. Speech Technol. 2016, 19, 657–667. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.