1. Introduction
The recognition of emotions is a relatively difficult and complex task [
1], even for humans. Many people could say that they can perform this task efficiently; however, they often have the opportunity to recognize emotions based on a few different aspects, such as body language, facial expression, and voice timbre or prosody. Meanwhile, speech emotion recognition (SER) is a potentially significant step toward the future as it presents a huge variety of use cases.
SER considers recognizing emotions using only one modality, voice recordings, which makes it more complex. Thus, it uses one additional medium—a microphone that may also capture some noise [
2]. Achieving decent results on this type of problem could lead to the development of machines being more humanized, as there is nothing more human-like than emotions. Enabling machines to understand human beings’ moods and intentions could be used in fields such as security [
3], medicine, emergency call centers [
4], telemarketing, and daily work in social institutions. Previous examples present that such a tool could help understand emotions not only by machines but also by people with a less acute ability to distinguish the emotions of their interlocutor or people with perception disabilities.
A lot of research results can be found in this field of study [
5,
6,
7,
8], but they are usually obtained for only one dataset. Therefore, this approach does not promise the same results after deploying such models. This is because some of the available datasets do not employ many actors, which is not beneficial for neural networks in particular, which require vast amounts of data to perform well in every environment. Moreover, using multiple datasets can reduce the model’s tendency to learn the characteristics of recordings since datasets tend to have diversified sound characteristics due to the use of a variety of recording equipment.
Having more data does not necessarily mean better results, as it first has to be prepared appropriately. The process of data preparation for this type of study can be time-consuming, and there is no typical approach that would promise the best possible results. There are a few different ways in which the data can be prepared. One of the most common ones is spectrograms and mel-spectrograms. This research aims to compare which of those brings a better performance of CNNs when used for training. Since such comparisons are nowhere to be found in the literature, exploring them could save time for many researchers.
What is more, it is common to randomly split data into training and test sets when dealing with machine learning algorithms. This is usually a good approach, but not always, and if performed incautiously, it can lead to false results. Sometimes the data can be interdependent. For example, multiple instances might be available that represent the same object repeatedly but in a different environment, or a group of data may share the same characteristics, like in the case of audio data containing multiple recordings per emotion prepared by the same actor. In this case, it is necessary to divide the data in such a way that these instances are not repeated in the training and test sets. Although unfortunately, there are still solutions provided by researchers in which no attention was paid to the above, probably by oversight [
5,
9,
10], in order to draw attention to the problems that this may entail, it was decided to conduct research in this area.
Many cases of expressed emotions are perceivable, but they can be quite hard to label. People who prepare the datasets have different approaches to solving this problem. Sometimes, labels are retrieved based on the opinion of a speaker, of psychologists, or of a group of people to evaluate the speaker’s emotions. Until now, the quality of datasets has not been verified by other researchers, only by the authors of the datasets. That is why the decision has been made to perform an additional study on the data quality. A specific study was conducted on label correctness to verify how reliable the datasets can be.
This research aims to address all these challenges by presenting a description of the data preparation and usage of the five different datasets. The approach may also be advantageous because, most of the time, datasets are prepared in different environments with various resources exploited.
2. Related Works
This chapter presents various algorithms used in speech emotion recognition introduced by other researchers. Discussing algorithms and machine learning methods is impossible if the data format is not discussed first. When working with audio data, there is a variety of approaches to choose from. Audio data can be presented in a raw waveform or in a 2D form, such as spectrograms, mel-spectrograms, mel-frequency cepstral coefficients (MFCCs), and many more.
If the input data are in raw audio format, one possible approach, besides the classical recurrent neural networks and their variations, is to use the WaveNet architecture [
11]. Researchers created a solution [
12] that is able to classify emotions from speech based on raw audio data and employed this architecture as a backbone.
For the 2D audio data format, Long Short-Term Memory (LSTMs) artificial neural networks are frequently used in conjunction with CNNs [
13,
14,
15,
16]. LSTMs can remember long-term relationships in the input signal, which is beneficial when dealing with sequential input data such as audio signals, and CNNs can learn features from high-dimensional input data [
17]. Many variants of LSTMs have already been introduced by researchers, one of which is the Dual-Sequence LSTM Architecture [
13]. Wang et al. include two LSTM architectures in their work; the first is a basic LSTM that processes MFCCs extracted from a speech sample, and the second is a dual-sequence LSTM fed simultaneously by two mel-spectrograms with different time-frequency resolutions. The final classification is based on the average calculated from the outputs of the standard LSTM and the dual-sequence LSTM. Zang et al., in their paper called “A Study on Speech Emotion Recognition Model Based on Mel-Spectrogram and CapsNet”, used mel-spectrograms to classify emotions from voices and compared the performance of the Support Vector Machine (SVM), CNN, LSTM, and CapsNet algorithms. The results presented in the paper show dominance of a capsule network over the other studied algorithms. The literature also includes studies on datasets in different languages [
18].
Another approach without which this subsection would not be complete is the usage of transformers [
19]. For example, Tan and Soleymani, in their work published in 2022, used a pre-trained Audio-Visual Transformer for emotion recognition and achieved promising results. In addition, they used spectrograms, mel-spectrograms, and features extracted by TRILL [
20] for auditory modality.
Another essential aspect may be the speed of inference, and this may be important if the goal is to implement a system that works in real time. In this case, lightweight convolutional neural networks can be an appropriate choice [
12]. CNNs are proven to be an efficient method that can be optimized and reduced in size without significant performance losses and are commonly used when the input data are represented in 2D format [
16,
21,
22,
23].
3. Selected Approach
The method studied in this paper recognizes emotions based on the retrieved spectrograms [
7] and mel-spectrograms [
24] from short voice recordings. It can be seen in the literature that when using CNNs, the primary approach to input data is to use spectrograms, mel-spectrograms, or raw speech signals [
25]. In this study, it was decided to combine spectrograms and mel-spectrograms and investigate which feature extraction method performs better. After a systematic literature review, it is ambiguous which data type is better suited for speech emotion classification problems, and a comparison of their performance is nowhere to be found. These types of feature extraction were chosen because previous studies have shown that it is more suitable than, for example, raw signal combined with CNN [
26]. This is why it is important to measure which data type to choose and support the decision with the results of experiments. Previous studies introduced by many researchers in the past [
16,
17,
21,
22,
23,
27] have proven that convolutional neural network (CNN) architecture is suitable for the problem, so this type of architecture is used in all experiments in this work since it is relatively simple. The goal is not to develop the best possible model but to highlight the researchers’ challenges. This type of architecture is applied in all experiments in this work. However, using only a single dataset is common [
5,
6], which does not necessarily show whether the trained models can be utilized in a real environment. In this article, five different datasets are employed, namely CREMA-D, RAVDESS, SAVEE, IEMOCAP, and TESS; the exploitation of such amounts of data has not been usedin studies on SER before.
4. Datasets
The chosen datasets have been used in a few different configurations, individual and mixed with others. They differ in the number of emotions and the distribution of actors by gender. Information on datasets is included in the following subsections.
4.1. CREMA-D
CREMA-D stands for Crowd-sourced Emotional Multimodal Actors Dataset [
28]. It presents multimodal data, meaning data corresponding to multiple modalities, visual and audio data. The dataset consists of 7443 clips prepared with the participation of 91 actors. Categorical emotion labels were obtained using crowdsourcing from 2443 raters, making this dataset reliable. The emotions presented in CREMA-D are neutral, happiness, anger, disgust, fear, and sadness. Such emotional examples from many different actors make it possible to train neural networks exclusively on this dataset.
4.2. RAVDESS
RAVDESS represents the Ryerson Audio-Visual Database of Emotional Speech and Song [
29]. This dataset delivers two types of data: speech and song. The dataset consists of 7356 files from 24 professional actors—12 males and 12 females—from which only 1440 are speech audio-only files. The emotions represented in RAVDESS are: neutral, calm, happy, sad, angry, fearful, surprised, and disgusted. In this article, not all of the emotions from the RAVDESS dataset were used. Emotions of calm and surprise were dropped because they rarely occur in different datasets, and the RAVDESS dataset is too small to be used solely for deep neural network (DNN) training. Therefore, the number of examples of each emotion is a little uneven, as with the “neutral” emotion, for which there are only 48 files compared to 96 for others.
4.3. SAVEE
The name SAVEE is short for Surrey Audio-Visual Expressed Emotion [
30]. The dataset consists of audio, visual, and audio-visual modalities. The authors of the database have collected the recordings of four English male actors expressing seven emotions. These emotions are anger, disgust, fear, happiness, sadness, surprise, and neutrality. This dataset presents an unbalanced distribution of classes as it consists of 90 examples of “neutral” emotion, which is two times more frequent than the others. It is also a relatively small dataset in terms of artificial neural network (ANN) training, so no research was conducted on this dataset alone. However, it was a perfect fit to combine it with the RAVDESS dataset, which presents a smaller number of “neutral” emotions but is also not big enough.
4.4. TESS
Toronto emotional speech set—TESS [
31] represents seven emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. This dataset contains recordings of two females that are aged 26 and 64 years old. The TESS consists of 2800 audio files. Even though the number of audio files is large, preparing the CNN model solely on this dataset would be almost impossible as it would require having half of the dataset reserved for the test because only two actors were taking part in the study. That is why it is used only in combination with datasets: CREMA-D, RAVDESS, SAVEE, and IEMOCAP.
4.5. IEMOCAP
The Interactive Emotional Dyadic Motion Capture [
32] is a commonly used database for emotion classification [
8,
33]. It serves multiple modalities: Motion Capture Face Information, Speech, Videos, Head Movement and Head Angle Information, Dialog Transcriptions, and Word-level, Syllable-level, and Phoneme-level alignment. The IEMOCAP dataset presents a highly unbalanced distribution of classes which is why it was only used for additional studies with only four emotions. Since the “Happy” emotion is considered crucial to be recognized and given that it has a small number of occurrences in the dataset, it was merged with the emotion “Excitement,” which is a common technique presented in other articles [
8,
34]. Another common approach presented in the same articles is to use this dataset only to recognize four emotions because of an insufficient number of examples of other emotions.
4.6. Comments on Datasets
The main focus was to gather as much data as possible, but some restrictions had to be established. The selected dataset had to be annotated with at least five different emotions and in the form of an audio file in the wav format as long as it represents uncompressed audio. Another restriction was to have at least two actors while preparing the dataset. It was necessary to be able to obtain samples from one actor in a training set and the other in a test set because having interdependent data in both sets could induce misleading results. This situation is discussed in chapter 6 by data analysts [
35] and pertains to some papers [
36,
37]. The number of retrieved labeled recordings for each emotion from each dataset is shown in
Table 1. Datasets with fewer actors than four are used only in combination with other datasets, and thus never alone. Finally, only audio files of recorded speech were selected from each dataset as this research considers only this modality.
5. Architecture
Recent advancements in the field of SER demonstrated that the usage of Convolutional Neural Networks (CNNs) can produce satisfying results [
7,
8]. Creating the best possible model was not the goal of this research. The following sections describe studies that have been conducted and answer specific questions, addressing specific problems. Considering the above, the basic model architecture was developed and used in all experiments with some minor adjustments to make training possible. The base model consists of several convolution layers, and max-pooling layers to which a dropout has been added. The network is completed with a flattened layer and two dense layers. The architecture is shown in
Figure 1.
The input data used in experiments are spectrograms and mel-spectrograms extracted from raw wav files. The sampling rate is 22,050 Hz. The sizes of both data types were 231 × 349 × 3.
6. Performance of Different Feature Extraction Methods
There are many options with regard to the input data format for emotional speech recognition. For example, one can use raw audio data in a wav format [
12,
38] and send it to a neural network or use Mel Frequency Cepstral Coefficients (MFCCs) [
39,
40,
41], another audio representation format. However, this chapter focuses on establishing whether the basic models perform better with data in the form of a spectrogram or a mel-spectrogram.
There have been previous experiments conducted that focused on creating a CNN model and measuring its performance. For a better comparison between experimental results, a similar architecture consisted of several convolutional and max-pooling layers followed by two dense layers. Only minor adjustments were made to achieve better results as experiments were run on various datasets that differed in size. The main focus was on basic Convolutional Neural Networks. The chosen approach was to design artificial neural networks and train them from scratch using two input variants: spectrograms and mel-spectrograms. For spectrograms, no padding was applied, which is the opposite approach to using mel-spectrograms as input, where padding was introduced. For regularization, dropout layers were added. The experiments aimed to present the differences in performance between CNNs that were similar but used different datasets. In each experiment, two separate models with different input data were developed, one with spectrograms and one with mel-spectrograms, so that the difference in the usage of these two types of spectrograms is also checked. Moreover, the experiments were performed with different datasets and combinations of multiple datasets. The division of the training and testing sets was controlled, and it was ensured that the actors from the training set did not repeat in the test set.
Results presented in
Table 2 indicate a slight advantage of using mel-spectrograms instead of spectrograms, as in almost every experiment, models using mel-spectrograms were able to achieve better scores. Two tasks with different complexity can be differentiated: the classification of four emotions and the classification of six emotions. The best results for classifying four emotions were acquired by a model trained on four datasets (CREMA-D, SAVEE, RAVDESS, TESS), where the test accuracy was 55.89%. For the recognition of six emotions, the highest test accuracy was equal to 57.42%, achieved by the model trained and tested on the CREMA-D dataset. To verify the reliability of the study, ResNet18 [
42] was used as a popular architecture. ResNet18 achieved better results on spectrograms than custom CNNs when working on the same combination of data. The latter allows for the comparison between the results of the ResNet18 architecture and the custom CNNs using mel-spectrograms.
Based on the results described above, the mel-spectrograms showed better results, and thus, they were used for further experiments, as is presented later in the article.
7. The Importance of Data Division into Training and Test Sets
As is generally known, incorrectly split data in an SER task can lead to misleading results from deep learning algorithms. For example, suppose the recordings of the same actor and the same emotion are mixed up in the training and test data sets [
36,
37]. In this case, a high result may be observed during the evaluation. Still, because the model would learn some specific features of the given actors, it would not be well prepared for a completely new speaker, leading to the model’s failure during deployment. The reason for such behavior would be the model’s ability to distinguish particular actors and their emotions but a lack of the ability to classify the emotions of unknown speakers in a real-life environment.
However, there has been no attempt in the literature to show what impact this can have on performance. Three experiments were conducted, first on the TESS and IEMOCAP datasets, and then on all datasets combined to demonstrate the difference in performance between correctly and incorrectly split data. In the proper split, samples of the same actor are not mixed between the training and test set. Firstly, the model was trained and then evaluated on properly split data. Subsequently, the same procedure was repeated for the data split without consideration of separating actors among training and test datasets. Finally, the same comparison was prepared for all three cases.
Table 3 presents the results of the three experiments on two different datasets and a combination of all datasets. Similarly to the previous experiment, ResNet18 was used as a reference. The most striking difference can be seen in the first experiment on the TESS dataset. Nevertheless, in all cases, a random split produced better accuracy results. This is due to the model’s ability to distinguish between actors and their emotions and not only emotions. There is a risk that the model would not be able to handle a completely new actor.
8. Human-Based Speech Emotion Classification on CREMA-D
An additional study was conducted involving the classification of emotions in recordings from the CREMA-D dataset by humans to verify the possible results. This dataset was chosen for this study because it is the biggest and one of the most recent datasets. Regarding preparing this dataset, Cao et al. published [
28] a detailed description of the data collection process along with some statistical analysis.
This dataset was selected for the study because it was labeled during crowdsourcing. The crowdsourcing process relies on the labeling of data by volunteers. In the experiment, 54 volunteers aged 22–58 of Polish nationality participated. The subjects were asked to classify emotions for the presented recordings. Thirty recordings were randomly chosen and used, five recordings per emotion. The experiment was in the form of an online questionnaire. Recordings were played as many times as requested by participants.
The confusion matrix is presented in
Table 4. The most challenging emotions to classify were disgust and sadness, for which the correct choices were not the most common answers. The interviewees classified anger with the highest accuracy of 76%. Overall, the mean score achieved during the study was 14.63 (48.76% accuracy), with scores ranging from 2 to 21 where 30 is the highest possible score. Additionally, the median was equal to 15.
To verify if the answers were selected at random, a statistical t-test was performed. The chosen null hypothesis (H0) indicates that the mean for a sample is equal to 5, which is the accuracy at the level of 16.67%. Therefore, the null hypothesis has been rejected with a p-value lesser than 0.001. These results confirm that humans are able to distinguish emotions from the recordings.
Another statistic performed was the calculation of the confidence intervals for the mean estimation in the population. The confidence intervals acquired via this study are (0.4562848288, 0.5177891712), which means that based on the questionnaire results, one may be confident that the population mean will fall from 95% to between ~45.63% and ~51.78%.
In the article describing the CREMA-D dataset [
28], it is stated that the labeling process used a crowdsourcing method, but the labels are only attached as additional information; the labels presented in the name of each file, however, are derived from what kind of emotion actors were trying to imitate. Hence the study presented above gives another set of possible labels for the 30 audio files included, and in the results, these files have three possible labels. These annotations are compared in the table, which can be found in the appendix. Comparing labels from the study with those from CREMA-D crowdsourcing resulted in six disagreements, which is significant when considering a sample size of 30. These disagreements raise concerns about the number of annotators used in CREMA-D as each audio file was annotated by only 7–11 people compared to the 54 in the study above. These conclusions also present how complex the data preparation process is as no specific guideline clearly states all the necessary steps to produce a good dataset for artificial intelligence (AI) models.
9. Conclusions
This study investigated the difference in the performance of CNN models while using spectrograms and mel-spectrograms. For this, two different architectures were exploited—the popular ResNet-18 and a custom CNN architecture similar to the classic LeNet. Most of the conducted experiments demonstrared that the exploitation of mel-spectrograms as a feature extraction method significantly improves the accuracy metric. However, there was only one experiment where models trained on spectrograms outperformed the ones trained on mel-spectrograms, and it occurred by a small margin. This leads to the conclusion that it is usually better to choose mel-spectrograms as a data processing method. Despite the effectiveness of mel-spectrograms in speech recognition—which is an outcome that one could have anticipated—it can be seen in the literature that many authors still use spectrograms. Our goal, therefore, was to visualize the differences in the effectiveness of CNN training in both cases quantitatively. The results are presented in the form of a benchmark concerned with different datasets and their combinations, and the type of input data.
Additionally, a model trained on all gathered datasets was also prepared. Despite not showing the best results in terms of accuracy, it has the most significant potential for a real-world environment. Although the data were collected from different actors and using different microphones, the performance of the trained model should be verified in further studies in real-life scenarios.
The performance of a much bigger ResNet-18 architecture was slightly worse compared to our custom model. This shows that there is no real advantage in using bigger models, as the available data model of 800 thousand parameters (the custom CNN model) outperformed one with 11 million (Resnet-18). The smaller model would be quicker in inference after deployment and significantly lighter. One of the goals of this article is to show how important the proper way of splitting datasets is for training and testing. Researchers always strive for the best possible performance of their AI models, but sometimes the problem of interdependent data may be overlooked [
36,
37]. Even though some of the research claims to have produced excellent results in the study of SER, they are difficult to compare with other results unless there is no specific strategy for dataset splitting [
5,
9,
10,
35]. Some of the papers addressed the problem, but unfortunately, the results cannot be verified directly, as software is not always shared with the research community. Therefore, we decided to show the results of experimental comparisons in this regard.
Another consideration of this article is how well-prepared for SER the currently available datasets are. The study presented in
Section 5 shows that the dataset might not have the highest possible quality. Based on this research and the interview participants’ comments, the authors found no particular need to classify more than four emotions. For many people, emotions like disgust or sadness are hardly ever conveyed in speech, and it is even harder to imitate them by actors preparing the datasets. Suppose examined subjects classify emotions from speech with an average accuracy of just under 49% (study presented in
Section 5); in that case, it is difficult to establish what should be expected from machine learning models. Namely, will 100% accuracy be possible in the future, and what would it mean for people if machines could classify emotions that cannot even be described by the speakers being tested? Since it is not yet possible to solve these problems at this stage of research development, they remain rhetorical questions.
The source code of the software developed by the authors has been shared on the GitHub platform [
43].