Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

In a conventional Speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language does not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German and URDU. For Amharic, we use our own publicly-available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets. We followed previous research in mapping labels for all datasets to just two classes, positive and negative. Thus we can compare performance on different languages directly, and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for the three models were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each pair: Amharic<->German, Amharic<->English, and Amharic<->Urdu. Results with Amharic as target suggested that using English or German as source will give the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percent greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training a SER classifier when resources for a language are scarce.


Introduction
Emotions assist individuals to communicate and to comprehend others' points of view [50].Speech emotion recognition (SER) is the task of comprehending emotion in a voice signal, regardless of its semantic content [17].SER datasets are not available in all languages.Moreover, the quantity and quality of the training data which is available varies considerably from one language to another.For example, when evaluated across several datasets, differences in corpus language, speaker age, labeling techniques, and recording settings significantly influence model performance [48,49].This encourages the development of more robust SER systems capable of identifying emotion from data in different languages.This can then permit the implementation of voice-based emotion recognition systems in real-time for an extensive variety of industrial and medical applications.
The majority of research on SER has concentrated on a single corpus, without considering cross-lingual and cross-corpus effects.One reason is that, in comparison to the list of spoken languages, we only have a small number of corpora for the study of speech analysis [44].Furthermore, even when only considering the English language, accessible resources vary in quality and size, resulting in the dataset sparsity problem observed in SER research.In such instances, learning from a single data source makes it challenging for SER to function effectively.As a result, more adaptable models that can learn from a wide range of resources in several languages are necessary for practical applications.
Several researchers have investigated cross-corpus SER in order to enhance classification accuracy across several languages.These works employed a variety of publicly accessible databases to highlight the most interesting trends [37].Even though some research has addressed the difficulty of cross-corpus SER, as described in Schuller et al. [37], the challenges posed by minority languages such as Amharic have not been investigated.Amharic is the second-largest Semitic language in the world after Arabic and it also the national language of Ethiopia [30].In terms of the number of speakers and the significance of its politics, history, and culture, it is one of the 55 most important languages in the world [28].Dealing with such languages is critical to the practicality of next-generation systems [1], which must be available for many languages.
In previous work [32], we created tbe first spontaneous emotional dataset for Amharic.This contains 2,474 recordings made by 65 speakers (25 male, 40 female) and uses five emotions: fear, neutral, happy, sad, and angry.The Amharic Speech Emotion Dataset (ASED) is publicly available for download 3 .The ASED dataset allows us to carry out the work reported here.
The contributions of this paper are as follows: • We investigate different scenarios for monolingual, cross-lingual and multilingual SER using datasets for Amharic and three other languages (English, German and Urdu).• We experiment with a novel approach in which a model is trained on data in several non-Amharic languages before being tested on Amharic.We show that training on two non-Amharic languages gives a better result than training on just one.• This is the first work that shows the performance tendencies of Amharic SER utilizing several languages, to the best of our knowledge.
The structure of this paper is as follows: Section 2 presents previous work.Section 3 explains our approach, datasets, and feature extraction methods for SER.Section 4 presents the proposed deep learning architecture and experimental settings.Section 5 describes the experiments and outcomes.Finally, Section 6 gives conclusions and next steps.
Concerning cross-lingual studies, Lefter et al. [25] carried out an early study in which they trained a SER classifier on one or more datasets and then tested on another.In a cross-lingual setting, training on ENT and testing on DES gave the lowest Equal Error Rate for Anger (29.9%).
Albornoz et al. [1] proposed a SER classifier for emotion detection, focusing on emotion identification in unknown languages.The results showed what could be expected from a system trained with a different language, reaching 45% on average.The standard multi-class SVM performed better than the classifier implemented using Emotion Profiles (EP).On average, the Standard Classifier (SC) reached 56.8%, whereas the Emotional Profile Classifier (EPC) obtained 52.1%.
Xiao et al. [46] examined SER for Mandarin Chinese vs.Western languages like German and Danish.The authors concentrated on gender-specific SER and attained classification rates that were higher than chance but lower than baseline accuracy.The best classification rate in the cross-language family test on male speech samples (71.62%), was when the Chinese Dual mode Emotional Speech Database (CDESD) was used for training and Emo-DB was used for testing.
Sagha et al. [33] utilized language detection to improve cross-lingual SER.They found that using a language identifier followed by network selection rather than a network trained on all existing languages was superior for recognizing the emotions of a speaker whose language is unknown.On average, the Language IDentification (LID) approach for selecting training corpora was superior to using all the available corpora when the spoken language was not known.
Meftah et al. [27] proposed Deep Belief Networks (DBN) for cross-corpus SER and evaluated them in comparison with MLP via emotional speech corpora for Arabic (KSUEmotions) and English (EPST).Training on one dataset and testing on the other yielded similar results for both directions and both models.The best result was Arabic→English using DBN (Valence 53.22%, Arousal 57.2%).
Latif et al. [23] extracted eGeMAPS features from their raw audio data.They used SVM with a Gaussian kernel for classifying data into their respective categories.The best result was when training on EMO-DB and then testing on URDU (57.87%).
Latif et al. [24] also used eGeMAPS features and they employed five different corpora for three different languages to investigate cross-corpus and cross-language emotion recognition using Deep Belief Networks (DBNs).IEMOCAP performs well on EMO-DB compared to FAU-AIBO even though both the latter datasets are German.
Latif et al. [22] studied SER using languages from various language families, such as Urdu vs. Italian or German.The best cross-lingual results were obtained by training on URDU and testing on EMODB (65.3%) and the worse were by training on URDU and testing on SAVEE (53.2%).
Goel et al. [13] used transfer learning to carry out multi-task learning experiments and discovered that traditional machine learning architectures [44,3] can perform as well as deep learning neural networks for SER provided the researchers pick appropriate input features.Training the model on IEMOCAP and testing on EMO-DB obtained the best performance (65%).
Zehra et al. [47] presented an ensemble learning approach for cross-corpus machine learning SER, utilizing the SAVEE, URDU, EMO-DB, and EMOVO databases.The method employed three of the most prominent machine learning algorithms, Sequential Minimal Optimization (SMO), Random Forest (RF), and Decision Tree (J48), plus a majority voting mechanism.The ensemble approach was worse than the other classifiers except when training on EMOVO and testing on URDU (62.5%).
Jarod et al. [8] used prosody prediction and employed eight different corpora for five European languages to investigate cross-lingual and multilingual emotion recognition using Wav2Vec2XLSR.The multilingual setup outperformed the monolingual one for all selected European languages, except English, by a very small margin.
Pandey et al. [31] proposed a SER classifier for emotion detection, focusing on learning emotions, irrespective of culture.They also used 3D Mel-Spectrogram features (henceforth MelSpec) and employed five different corpora for five languages to investigate cross-lingual emotion recognition using an Attention-Gated Tensor Factorized Neural Network (AG-TFNN).The best result was Fold2→German using 3D TFNN.In addition, Fold5→Telugu had better performance when compared to Fold4→Hindi, even though both languages are of Indian origin.
We now consider multilingual approaches in which several datasets in different languages are used for training.In addition to the cross-lingual experiments referred to earlier, Lefter et al. [25] also carried out some multilingual work in which they trained on various pairs or triples of datasets chosen from EMO-DB, DES and ENT, and tested on each of these individually.The best result was obtained by training on all three and testing on EMO-DB (Equal Error Rate 20.5%).
Figure 1: Block diagram of our approach for SER.
Latif et.al [23] used four different corpora (SAVEE, EMOVO, EMO-DB and URDU) for four different languages to investigate multilingual emotion recognition using Support Vector Machines (SVM).When training on EMO-DB, EMOVO and SAVEE and testing on URDU, a result of 70.98% was achieved, which was higher then any pair of these datasets.
Latif et al. [22] also used SAVEE, EMOVO, EMO-DB and URDU.The best performance was training on SAVEE, EMOVO, URDU and testing on EMO-DB (68%).The worst performance was training on the same three datasets and testing on EMOVO (61.8%) Regarding the model used, Latif et al. [23], Albornoz et al. [1], Lefter et al. [25], and Sagha et al. [33] are all based on SVMs.Meftah et al. [27] and Latif et al. [24] utilized DBN, Goel et al. [13], Jarod et al. [8] and Pandey et al. [31] applied machine learning and deep learning methods, Zehra et al. [47] used ensemble methods, and lastly Xiao et al. [46] and Latif et al. [22] applied GAN, and SMO respectively.Concerning the earlier studies we observe that the SVM algorithm performs poorly on large data sets.It also performs poorly in situations with more characteristics per data point, especially in multi-class situations.When attempting to extract features from DBN plus low-level acoustic information, vs. DBN with eGeMAPS, the latter significantly outperformed.Additionally, deep learning models outperform conventional classifiers.However, the model of Goel et al. [13] extracts features quite well but requires a lot of training time.As previously indicated, an ensemble strategy only provided the best performance in one scenario.Furthermore, the existing techniques in SER lack preprocessing operations.We conclude that, across many datasets, binary performance outperforms multiple classes.Plus none of the previous work has focused on the Amharic language.
Here, we first present the preprocessing strategy before the extraction of features from the signal.Second, we propose an architecture, based on the VGG model, which offers good results.Third, we provide a classification benchmark for Amharic and three non-Amharic languages using deep Neural Networks.Finally, we contrast the effectiveness of our novel training scenarios to demonstrate the efficiency of cross-lingual and multilingual approaches.

Approach
Many factors influence SER accuracy in a cross-corpus and multilingual context.The dataset utilized, the features extracted from the speech signals, and the neural network classifiers implemented to identify emotion are all essential aspects that might significantly impact the results.Our SER method is summarized in Figure 2. We use four corpora (ASED, RAVDESS, EMO-DB, URDU) to test SER in Amharic, English, German and Urdu respectively.
One difficulty faced with this research is that datasets use different sets of emotion labels, as can be seen in Table 2. Following previous work [6,9] we address this by mapping labels onto just two classes, positive valence and negative valence, as indicated in the table.
Further details on the chosen datasets, feature extraction, and classifier are provided below.[4] is for German, uses five emotions and contains ten everyday sentences, five made of one phrase, five made of two phrases.There are ten speakers (5 male, 5 female), nine qualified in acting and about 535 raw utterances in total.Recording is done at 16 kHz and 16 bits, and was carried out in the Anechoic chamber of the Technical Acoustics Department at the Technical University Berlin.
RAVDESS [26] is for English, uses eight emotions and contains just two sentences.The 24 speakers (12 male, 12 female) are professional actors.Interestingly, emotion in this dataset is 'self-induced' [42], rather than Acted.Moreover, there are two levels of each emotion.There are 4,320 utterances.Project investigators selected the best two clips for each speaker and each emotion.Recording was at 48 kHz and 16 bits, and it was carried out in a professional recording studio at Ryerson University.
URDU [23] is for Urdu, uses four emotions and comprises 400 audio recordings from Urdu TV talk shows.There are 38 speakers (27 male, 11 female).Emotions are not acted, but occur naturally during the conversations between guests on the talk shows.
ASED [32] is for Amharic and was created by the authors in previous work.It uses five emotions and consists of 2,474 recordings made by 65 speakers (25 male, 40 female).Recording is done at 16 kHz and 16 bits.The ASED dataset is accessible to the public for research purposes (see URL in earlier footnote).

Data Preprocessing
Before proceeding to feature extraction, a number of pre-processing steps were performed on the datasets as shown in Figure 3. Recordings were first downsampled to 16kHz and converted to mono.Most of the sound clips in the datasets are 5 seconds in length or less.A few are longer than this.Therefore, we extended any shorter clips to 5 seconds by adding silence to the end.Conversely any longer clips were cut off in order to make them exactly 5 seconds long.The statistics of clip lengths are shown in Table 3

Feature extraction for SER
A vast amount of information reflecting emotional characteristics is present in the speech signal.One of the key issues within SER research is the choice of features which should be used.
Previously, traditional feature extraction methods such as prosodic features were used for SER [10,11], including the variance, intensity, spoken word rate, and pitch.However, some traditional features are shared across different emotions, as discussed by Gangamohan et al. [11].For example, as observed in Table 11.2 of Gangamohan et al., angry and happy utterances have similar trends in F0 and speaking rate, compared to neutral speech.
Manually extracted traditional features may work well with traditional classification methods in machine learning, where a set of features or attributes describes each instance in a dataset [16].In contrast, however, deep learning can itself determine which features to focus on to recognize verbal emotions.Finding some set of feature vectors or properties that can give a compact representation of the input audio signal has therefore become the main aim of feature extraction methods.The spectrum extraction methods convert the input sound waveform to some discrete shape or feature vector.Normally, the speech signal is not static but when looking at a short period of time, it acts as a static signal.This short, detached snap is called a frame.The acoustic model extracts features from the frames [7,2].Feature extraction deals with obtaining useful information for reference by removing irrelevant information.These extracted feature vectors are fed into deep learning models.In short, spectrum extraction methods can convert audio signals into vectors that deep learning models can handle.The model can then be trained to learn the features of each emotion and hence classify it.Overall, this is one reason why deep learning models can perform better than machine learning models.
After reviewing many works on SER, it is clear that Mel-Frequency Cepstral Coefficients (MFCC) are widely used in audio classification and speech emotion recognition [15].MFCC is a coefficient that expresses the short-term power spectrum of a sound.It uses a series of steps to imitate the human cochlea, thereby converting audio signals.The Mel scale is significant because it approximates the human perception of sound instead of being a linear scale [39].In previous work [32] we compared MFCC to alternatives and found it to be the best.This is the reason we choose MFCC features for the present study.

Architectures and Settings
Most prior research uses CNN-based models for SER [21].Among such models, the notable ones include AlexNet [20], VGG [41,29], and ResNet50 [14,12].This section provides a short overview of the models.Our proposed model, VGGE, is a variant of VGG.• AlexNet is one of the famous CNN models used in applications such as image classification and recognition, and is widely employed for SER classification [35].It achieved an outstanding result at the ImageNet competition in 2012 [18].
• VGG [41] appeared in 2014, created by the Oxford Robotics Institute.It is well known that the early CNN layers capture the general features of sounds such as wavelength, amplitude, etc., and later layers capture more specific features such as the spectrum and the cepstral coefficients of waves.This makes a VGG-style model suitable for the SER task.After some experimentation, we found that a model based on VGG but using four layers gave the best performance.We call this proposed model VGGE and used it for our experiments.Figure 2 shows the settings for VGGE.
• ResNet [14] was launched in late 2015.This was the first time that networks having more than a hundred layers were trained.Subsequently it has been applied to SER classification [12].
Concerning the experimental setup, the standard code for AlexNet and ResNet50 was downloaded and used for the experiments.For VGGE, the network configuration was altered, as shown in Figure 2.For the other models, the standard network configuration and parameters were used.
We used the Keras deep learning library, version 2.0, with a Tensorflow 1.6.0backend to build the classification models.The models were trained using a machine with an NVIDIA GeForce GTX 1050.Our model employed the Adam optimization algorithm, with categorical cross-entropy as the loss function; training was terminated after 100 epochs and batch size was set to 32.

Experiments
Mostly, two methods are utilized for speaker-independent SER [40]: The first method is Leave One Speaker Out (LOSO) [3,5,45]; Here, when the corpus contains n speakers, we use n − 1 speakers for training and the remaining speaker for testing.For cross-validation, the experiment is repeated n times with a different test speaker each time.In the second method, the training and testing sets have been determined previously [23,43,19].
In our work, we followed the second approach.For the first monolingual experiment (train on a corpus, test on the same corpus) the data was split into training, testing, and validation sets randomly five times, each time ensuring that the split sets are speaker-independent.As shown in Table 4, all the datasets were split 70% train, 20% test, and 10% validation.
In the first experiment, we also carried out a sentence-independent study where sentences used for training were not used for testing.
The second and third experiments are the cross-lingual experiment (train on a corpus in one language, test on a corpus in another language) and the multilingual experiment (train on two or three corpora joined together, each in a different non-Amharic language, and test on the Amharic ASED corpus.In these experiments, the speakers in the validation sets are not seen in the training sets.Moreover, the speakers in the testing set are by definition not the same as those in the training and validation sets, as they are from different datasets.
Figure 5 shows a label distribution that is balanced across partitions.The performance of the proposed classification of Amharic language data used in monolingual, cross-lingual, and multi-lingual SER experiments is evaluated using F1-score and accuracy.We have shared the file names for the audio files that belonged to train, validation, and test partitions in the experiments 4 .For each experiment, the models were trained five times and the average result was reported.The aim was to carry out an initial comparison of the proposed VGGE model with the two existing models discussed above, AlexNet and ResNet50.Four datasets were used, ASED, RAVDESS, EMO-DB, and URDU.To allow comparison with the other experiments, the emotion labels for each dataset were mapped onto just two labels, Positive valence and Negative valence, as shown by the scheme in Table 2.This follows the standard approach found in other work [6,9].When comparing to other papers, we should bear in mind the label mapping which we needed to adopt in order to undertake the later cross-lingual and multilingual experiments.Looking at the table, we can see that the ASED, RAVDESS, EMO-DB and URDU datasets originally had five, eight, seven and four emotion classes respectively, and that these are now being mapped onto just two classes, positive and negative emotion.This simplifies the task, which can account for higher performance figures than in other published works.
Experiment 1 has two parts.In Experiment 1.1, the groups of speakers used for training and testing were varied.In Experiment 1.2, the dataset sentences used for training and testing were varied.

Experiment 1.1: Independence of Speakers
Results of the experiment are shown in Table 5.We can see that VGGE was the best on ASED (Amharic) and EMO-DB (German), ResNet50 was the best on RAVDESS (English) and AlexNet was the best on URDU (Urdu).
It is interesting to look at the average figures on the bottom row of the table.ASED (82.53%) and RAVDESS (82.71%) are very close, EMO-DB (77.78%) is 4.75% lower than ASED, and URDU (84.58%) is 2.05% higher than ASED.Generally, the differences are not that large when we consider that the languages have very different characteristics, and that the datasets were created independently by different researchers.Moreover, recall that the original data is being mapped onto two sentiment classes from the original four-to-eight classes (see Section 6.1 and Table 2).Subject to these points, we might conclude that Amharic and English monolingual mono-corpus SER are of similar difficulty, German is more difficult and Urdu is easier.As languages, English and German are perhaps the most similar, since they are both within the Germanic branch of the Indo-European language group.Urdu is also Indo-European, but from the Indo-Iranian branch.Finally, Amharic is from the Semitic branch of the Afro-Asiatic group.

Experiment 1.2: Independence of Sentences
Recall that the datasets all consist of different sentences spoken in every emotion, with the exception of the URDU dataset, based on TV talk show conversation, where individual sentences are not identified.Hence, URDU was not used here.
In this experiment, sentences were either used for training or testing.For each of the datasets shown in the table, the proposed VGGE model, along with AlexNet and ResNet50, were trained using MFCC features.Each model was trained five times using a 80%/20% train/test split, and the average results were computed.
Results are in Table 6.The trends are similar to Experiment 1.1.This time, VGGE is the best on ASED and RAVDESS, while ResNet50 is the best on EMO-DB.Concerning the averages, ASED and RAVDESS are fairly close (84.46%,81.11%), while EMO-DB is lower (66.67%).So this again suggests that Amharic and English monolingual SER are of similar difficulty and easier, within the context of these particular datasets and this task, while German SER is more difficult.
7 Experiment 2: Comparison of SER methods for Amharic cross-lingual SER The aim was to compare the three models AlexNet, VGGE, and ResNet50 (Section 4) when applied to cross-lingual SER.This time, the systems are trained on data in one language and then tested on data in another language.Firstly, the three models are trained on ASED and then tested on EMO-DB, then trained on EMO-DB and tested on ASED, and so on, for different combinations.To allow this cross-training, dataset-specific emotion labels are mapped onto two classes, Positive and Negative, using the same method as for Experiment 1.
Once again, MFCC features were used for all models.The network configuration for VGGE was the same as in the preceding Experiment (Figure 2).For the other models, the standard configuration and settings were used.
We can therefore conclude that the performance of the three models was very similar overall.This is supported by the average accuracy figures for AlexNet, VGGE and ResNet50 (61.79%, 61.88%, 61.93%) which are also very close, only 0.14% from the smallest to the biggest.The results in the table also show the average F1-score performance for VGGE (55.99%) is higher than that for AlexNet (54.81%, 1.18% lower) and ResNet50 (53.30%, 2.69% lower).Hence, it is concluded from these results that the prediction performance of VGGE was best, closely followed by AlexNet and then ResNet50.However, the range of F1-scores is small, only 2.69% from the smallest to the biggest, indicating only a slight difference in performance between different scenarios.
Regarding the results as a whole, two points can be made.First, the accuracy obtained by training on one language and testing on another is surprisingly good.Second, the best language to train on when testing on Amharic seems to vary by model; for AlexNet it is RAVDESS (65.87%), for VGGE it is EMO-DB (64.22%) and for ResNet50 it is RAVDESS again (64.16%).
Finally, we can compare our results for this experiment (Table 7) with those given for previous cross-lingual studies in Section 2. Generally, they seem comparable.Our average results are around 62%.In the previous studies we see 56.8% [1], 57.87% [23], 65.3% [22], and 62.5% [47].The highest is 71.62% [46].In looking at these figures, we must remember that the exact methods and evaluation criteria used in previous experiments vary, so exact comparisons are not possible.Many different languages and datasets are used, emotion labels may need to be combined or transformed in different ways, and so on.Please refer to Section 2 for the details regarding these figures.

Experiment 3: Multilingual SER
In the previous experiment, we trained in one language and tested in another.In this final experiment, we trained on several non-Amharic languages and then tested on Amharic.
The same three models were used, AlexNet, VGGE and ResNet50, with the same settings and training regime as in the previous experiments.When RAVDESS is added to EMO-DB+URDU to make EMO-DB+RAVDESS+URDU, the performance of VGGE falls 1.53% to 68.41%.In the results presented in Table 9, the upper right-hand column shows the average accuracy, These results suggest that, by using several non-Amharic datasets for training, we can obtain a better result, by several percent, than when using one non-Amharic dataset for training, when testing on Amharic.
Comparing with previous studies in Section 2, there are only three which present multilingual experiments.Lefter et al. [25] report that training on three datasets, EMO-DB, DES and ENT, and testing on EMO-DB gave the best result, better than their cross-lingual trials.This concurs with our own findings, where average results for Experiment 3 (Table 8, bottom line) were higher than those of Experiment 2 (Table 7, bottom line).Latif et al. [23] found that training on EMO-DB, EMOVO and SAVEE and testing on URDU gained a better result than using just two training datasets.Latif et al. [22] also obtained the best result when training on three datasets.

Conclusions
In this paper, we first proposed a variant of the well-known VGG model, which we call VGGE, and then applied AlexNet, VGGE and ResNet50 to the task of Speech Emotion Recognition, focusing on the Amharic language.This was made possible by the existence of the publicly-available Amharic Speech Emotion Dataset (ASED) which we created in previous work [32].In Experiment 1, we trained the three models on four datasets, ASED (Amharic), RAVDESS (English), EMO-DB (German), and URDU (Urdu).In each case, a model was trained on one dataset and then tested on that same dataset.Speaker-independent and sentence-independent training variants were tried.The results suggested that Amharic and English monolingual SER are almost equally difficult on the datasets we used for these languages, German is harder, and Urdu is easier.
In Experiment 2, we trained on SER data in one language and tested on data in another language, for various language pairs.When ASED was the target, the best dataset to train on was RAVDESS for AlexNet and ResNet50, and EMO-DB for VGGE.This could indicate that, in terms of SER, Amharic is more similar to English and German than it is to Urdu.
In Experiment 3, we combined datasets for two or three different non-Amharic languages for training, and used the Amharic dataset for testing.The best result in Experiment 3 (EMO-DB+URDU→ASED, VGGE, 69.94%) was 4.07% higher than the best result in Experiment 2 (RAVDESS→ASED, AlexNet, 65.87%).In addition, the best overall average figure Experiment 3 (VGGE, 66.44%) was 4.51% higher than the best overall average figure in Experiment 2 (ResNet50, 61.93%).These findings suggest that if several non-Amharic datasets are used for SER training, the results can be better than if one non-Amharic dataset is used, when testing is on Amharic.Overall, the experiments demonstrate how cross-lingual and multilingual approaches can be used to create effective SER systems for languages with little or no training data, confirming the findings of previous studies.Future work could involve improving SER performance when training on non-target languages, and trying to predict which combination of source languages will give the best result.

Figure 2 :
Figure 2: Network architecture of proposed VGGE, based on well-known VGG model.

Figure 4 :
Figure 4: Distribution of utterance lengths for all datasets, based on duration ranges.2.0 − 3 in the figure means 2s <= d < 3s where d is the utterance length, and the same for the other ranges.
• We present a comparison of deep learning techniques in these tasks: AlexNet, ResNet50, and VGGE.

Table 2 :
Datasets used in the experiments.The table also shows the mapping from the emotion labels in each dataset onto just two valence labels which can be used across them all: Positive and Negative.

Table 4 :
Class distribution between the train, validation, and test partitions.

Table 5 :
Experiment 1.1: Monolingual SER results for different datasets (train in one language, test in the same language).Two valence values are used for all datasets, created with the mappings shown in Table2.

Table 6 :
Experiment 1.2: Monolingual SER results for the different datasets.Training and testing sentences are varied in this experiment.

Table 7 :
Experiment 2: Cross-lingual SER results (train in one language, test in another language).

Table 8
shows the results.The first three rows for each model show the results when two datasets were used for training, EMO-DB+RAVDESS, EMO-DB+URDU and RAVDESS+URDU.The fourth row uses all three datasets for training, i.e.EMO-DB+RAVDESS+URDU.In all cases, testing is with ASED.
The best overall performance in the table is for VGGE, training with EMO-DB+URDU (69.94%).The average figure for VGGE over all the dataset training combinations is also the best (66.44%).

Table 8 :
Experiment 3: Multilingual SER results (train in two or three non-Amharic languages, test in Amharic).