Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

Tao, Huawei; Li, Sixian; Wang, Xuemei; Liu, Binkun; Zheng, Shuailong

doi:10.3390/e27060640

Open AccessArticle

Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

by

Huawei Tao

^1,2,3,*,

Sixian Li

^1,2,

Xuemei Wang

^1,2,*,

Binkun Liu

^1,2,3 and

Shuailong Zheng

^1,2

¹

Key Laboratory of Grain Information Processing and Control, Henan University of Technology, Ministry of Education, Zhengzhou 450001, China

²

Henan Key Laboratory of Grain Storage Information Intelligent Perception and Decision Making, Henan University of Technology, Zhengzhou 450001, China

³

Institute for Complexity Science, Henan University of Technology, Zhengzhou 450001, China

^*

Authors to whom correspondence should be addressed.

Entropy 2025, 27(6), 640; https://doi.org/10.3390/e27060640

Submission received: 29 March 2025 / Revised: 10 May 2025 / Accepted: 13 June 2025 / Published: 15 June 2025

Download

Browse Figures

Versions Notes

Abstract

Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may distort emotional labels, thereby reducing the performance of models. To address this issue, in this paper we systematically evaluate the influence of common data augmentation algorithms on emotion recognition from three dimensions: (1) we design subjective auditory experiments to intuitively demonstrate the impact of augmentation algorithms on the emotional expression of speech; (2) we jointly extract multi-dimensional features from spectrograms based on the Librosa library and analyze the impact of data augmentation algorithms on the spectral features of speech signals through heatmap visualization; and (3) we objectively evaluate the recognition performance of the model by means of indicators such as cross-entropy loss and introduce statistical significance analysis to verify the effectiveness of the augmentation algorithms. The experimental results show that “time stretching” may distort speech features, affect the attribution of emotional labels, and significantly reduce the model’s accuracy. In contrast, “reverberation” (RIR) and “resampling” within a limited range have the least impact on emotional data, enhancing the diversity of samples. Moreover, their combination can increase accuracy by up to 7.1%, providing a basis for optimizing data augmentation strategies.

Keywords:

spectrogram; speech emotion recognition; data augmentation; cross-entropy loss

1. Introduction

In modern society, emotional speech recognition technology has become a core component of human–computer interaction systems, with widespread applications in fields such as mental health monitoring [1], customer service quality inspection [2], age-appropriate product design [3], and intelligent transportation systems (ITS) [4]. Accurate recognition of emotion not only helps machines to understand users’ emotions but also significantly enhances system intelligence and user experience.

However, the field of speech emotion recognition (SER) is confronted with a serious issue of data sparsity. Taking the commonly used SAVEE [5] and EMODB [6] datasets as examples, the numbers of samples are merely 480 and 800, respectively. In the face of the complex and diverse emotional expressions of human beings, the coverage of these samples is extremely limited. As a result, data augmentation technology has gradually become an important approach to enhancing the generalization ability of models. Existing studies have examined the role of augmentation algorithms in improving model performance; for example, Atmaja et al. [7] pointed out that for text-independent data (covering the scenario where both the speaker and the text are independent), data augmentation methods can generally boost the performance of speech emotion recognition. Other studies [8] have also highlighted the effectiveness of data augmentation techniques in enhancing the performance of speech emotion recognition. Nevertheless, most studies only remain at the perspective of overall performance improvement, evaluating the augmentation effect through metrics such as accuracy, and have not explored the possible impact of augmentation algorithms on emotional expressions. In our previous research work [9], we merely explored the impact of classical acoustic data augmentation methods on model performance, focusing on model construction and basic analysis of these classical methods. We did not provide an in-depth look at how data augmentation techniques affect the emotional data in speech emotion recognition. Moreover, merely understanding the impact on model performance is insufficient for a comprehensive understanding of the operation mechanism of data augmentation techniques in this field.

In speech emotion recognition, the spectral features of speech signals such as fundamental frequency [10,11], formants [12], and their temporal variations [13] also play a crucial role. With the development of deep learning technology, spectrograms have become increasingly important in speech emotion recognition. When distinguishing emotions with the same arousal level, traditional prosodic features such as intonation, intensity, and speaking rate often lead to a high degree of confusion [14]. Spectrograms can more comprehensively capture the spectral characteristics of speech signals, including fundamental frequency, formants, and their temporal variations, thereby achieving significant advantages in recognizing emotional expressions. A.M. Badshah et al. [15] combined spectrograms with deep learning for emotion recognition and demonstrated the effectiveness of this method on multiple datasets. Zhang et al. [16] explored the integration of spectrograms and emotion recognition models and proposed a new architecture. Biswas et al. [17] proposed a model for recognizing emotions from speech data using logarithmic spectrograms and deep convolutional neural networks (CNNs). H. Li et al. [18] introduced a transformer-based model called MelTrans which extracts key clues from speech data by learning the core features and long-range dependencies in Mel spectrograms of speech. However, most current studies merely use spectrograms as visual aids for auxiliary analysis. They neither thoroughly investigate whether data augmentation techniques affect the core spectral features of spectrograms, such as fundamental frequency and formants, nor systematically quantify the specific extent of such impacts.

In response to the above-mentioned research gaps and challenges, this paper systematically evaluates the impacts of five common data augmentation algorithms: adding noise [19,20], time stretching [21], fundamental frequency transformation [22], resampling [23], and reverberation (RIR) [24,25]. We investigate the effects of these techniques on the spectral features of spectrograms and the performance of emotion recognition, then expound on the advantages and limitations of each method. This study is carried out from three dimensions. First, to analyze subjective auditory perception, listening experiments are designed to evaluate the quality of the augmented speech, noise perception, and the intensity of emotional expression. Second, jointly extracting multi-dimensional features such as the average spectral amplitude, fundamental frequency, formant frequency, and time domain duration from the spectrogram allows us to use heatmap visualization to conduct an in-depth exploration of the influence mechanism of data augmentation algorithms on the spectral characteristics of speech signals. Third, we provide an analysis of objective recognition performance. The cross-entropy loss calculated during the training and testing processes is utilized to objectively evaluate the recognition performance of the model. The misclassification patterns within emotion categories are observed through confusion matrices to comprehensively assess the performance of different data augmentation algorithms in emotion recognition tasks. In addition, this paper also explores the combined application of different data augmentation techniques and studies the impact of their synergistic effects on the performance of the model. Through these approaches, this paper aims to reveal the profound impacts of data augmentation on spectral characteristics and emotion recognition performance, providing a scientific basis for the optimization and application of data augmentation techniques in emotional speech processing.

2. Related Theory

2.1. Spectrogram Generation Principle

The generation of spectrograms is based on the fundamental theories of signal processing. A standard spectrogram is typically obtained by performing a short-time Fourier transform (STFT) on an audio signal, then presenting the results on a linear frequency scale [26]. The specific formulas are provided below.

For an audio signal

x (t)

, the frame division process is carried out first:

x_{m} (t) = x (t) \cdot w (t - m T_{s})

(1)

where

x_{m} (t)

represents the m-th frame,

w (t)

is the window function, and

T_{s}

is the frame shift interval. Then, the short-time Fourier transform (STFT) is applied to each frame:

X_{m} (f) = \int_{- \infty}^{\infty} x_{m} (t) e^{- j 2 π f t} d t .

(2)

Finally, taking the time (corresponding to different values of m) as the horizontal axis and the linear frequency f as the vertical axis, the magnitude values

| X_{m} (f) |

of the spectrum

X_{m} (f)

of each frame are used as the amplitude of each point and the result is presented as a standard spectrogram.

2.2. Data Augmentation Algorithms

The accuracy of emotional speech recognition models is highly dependent on the richness of data, and speech data augmentation techniques are commonly used to address the issue of data scarcity. By applying various transformations to speech signals, these techniques can improve the generalization ability of the trained model and enhance its ability to recognize different emotional states [27]. Table 1 lists common data augmentation algorithms and their implementation methods in spectrograms.

3. Proposed Method

3.1. Subjective Listening Design

A subjective listening experiment was designed to verify the changes in speech quality and emotional expression resulting from different augmentation algorithms based on human auditory perception. The design of the listening experiment is shown in Figure 1.

Sample Selection: A number of emotional speech segments were randomly selected from the original corpus. Data augmentation was applied with various parameter settings, and representative samples were chosen to ensure multidimensional coverage of emotional categories, speech samples, and augmentation algorithms.

Audience Participation: Three listeners with professional training and expertise in audio analysis who had no hearing impairments were invited to participate in the experiment. They provided detailed feedback on each augmentation algorithm based on their auditory perception. The listeners were asked to describe the following aspects:

Changes in Audio Quality: The listeners described differences in clarity, distortion, and other audio qualities between the augmented audio and the original audio. They were asked whether the speech still sounded natural and whether there were any unpleasant effects (e.g., blurring, distortion).
Noise Perception: The listeners noted whether the augmented audio introduced noise, specifying the type of noise (e.g., background noise, buzzing, echoes) and the severity of its impact (mild/significant/severe).
Changes in Emotional Expression Intensity: The listeners described whether the emotional expression in the augmented audio had changed, such as whether the emotion had been weakened or enhanced. They were also asked whether they could perceive any changes in the authenticity or intensity of the emotion.

3.2. Spectral Characteristics Analysis and Feature Extraction of the Spectrogram

Spectrograms can display the energy distribution, time domain dynamic characteristics, and frequency variation trends of speech signals. Figure 2 shows information on frequency and time, fundamental frequency (F0), formants, and dynamic changes within the spectrogram. These features provide rich time–frequency information for speech emotion recognition [28]. In research on speech emotion recognition, many scholars have utilized spectrograms to analyze speech features; however, most of these studies are merely at the level of intuitive observation or the extraction of single features. In our previous research work, we evaluated the effect of data augmentation by comparing the visual differences of spectrograms; however, that study lacked a quantitative analysis of the feature changes.

Thus, in this paper we adopt a method of joint extraction and quantitative analysis of multiple features. The Librosa library [29] is used to extract four key features: average spectral amplitude, fundamental frequency, formant frequency, and time domain duration. The specific formulas are as follows:

The short-time Fourier transform (STFT) is used to extract average spectral features:

s p e c t r a l_a m p l i t u d e_m e a n = \frac{1}{M \cdot N} \sum_{m = 0}^{M - 1} \sum_{k = 0}^{N - 1} | X (m, k) |

(3)

where M is the number of frames and N is the number of frequency points.

The probabilistic YIN algorithm (PYIN) [30] is used to extract the fundamental frequency:

\begin{matrix} f_{0} (i) = \frac{S a m p l i n g R a t e}{p e r i o d_{i}} f o r e a c h f r a m e i \\ f 0_m e a n = \{\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} f_{0} (i), & if f_{0} (i) is valid \\ 0, & if f_{0} (i) is invalid \end{matrix} \end{matrix}

(4)

where

f_{0} (i)

is the fundamental frequency estimated for each frame.

In this study, the second-order linear predictive coding (LPC) method [31] is adopted to extract the formant frequencies of audio signals and the average value of the positive formant frequencies is calculated as a feature.

The signal length calculation used to extract time domain features (time duration) is as follows:

t i m e_d u r a t i o n = \frac{l e n (a u d i o)}{s r}

(5)

where

l e n (a u d i o)

refers to the number of audio signal samples and

s r

refers to the sampling rate.

By comparing the original features with the augmented features and visualizing the normalized results through heatmaps, the effects of different augmentation algorithms on these features are demonstrated. Figure 3 shows the main design process for feature extraction.

3.3. Objective Emotion Classification Evaluation Design

In recent years, convolutional neural networks (CNNs) have demonstrated remarkable effectiveness in multiple fields, including object detection [32], text classification [33], and ultra-wideband (UWB) communication [34]. However, the application of convolutional neural networks (CNNs) in the field of speech emotion recognition (SER) is mainly attributed to their powerful feature extraction capabilities [35].

When dealing with sequential data, recurrent neural networks (RNNs) are proficient in handling sequential data with temporal dependencies, including text and speech. However, due to the sequential processing characteristics of these networks, they encounter difficulties in capturing global information [36]. In the task of speech emotion classification, emotional information exists not only in the time series of speech but also in the local spatial features of the spectrogram. Given the limitations of RNNs in this regard, CNNs can effectively capture local spatial patterns through convolutional layers, making them more suitable for extracting relevant features.

Numerous studies have confirmed the effectiveness of CNNs in speech emotion recognition. For example, ref. [37] demonstrated through experiments that a model constructed using CNNs can accurately recognize emotions in speech and achieve a high accuracy rate. In [38], the authors successfully applied CNNs to the task of speech emotion recognition. These research findings not only showcase the excellent performance of this method when dealing with actual speech data but also further validate the feasibility and advantages of CNNs in this field.

In view of these previous results, for this paper we select a model in which the feature extractor and the classifier are based on the CNN architecture (see Table 2 and Table 3). Our aim is to conduct a more in-depth evaluation of the impact of data augmentation on the effect of emotion classification.

During the training phase, the feature extractor and the classifier work closely together, as shown in Figure 4. The feature extractor employs a CNN to process the input speech signal. Selecting a relatively small kernel size is beneficial for capturing fine-grained local features, and the combination of two different kernel sizes enables the extraction of features from spectrograms at different scales. Through global average pooling, the high-dimensional feature map is transformed into a one-dimensional feature vector, which is then fed into the classifier part for further processing. For the classifier, the input layer receives the one-dimensional feature vectors from the feature extractor. The first fully connected layer utilizes the ReLU activation function and the dropout mechanism. The ReLU function can effectively alleviate the problem of gradient vanishing and accelerate the convergence speed of the network [39]. Dropout enhances the generalization ability of the model and prevents overfitting to the training data during the training process [40], preparing for the second fully connected layer to output the probability distribution of emotional categories. These two fully connected layers are set up to gradually adjust the feature dimensions and extract more discriminative features to suit the final emotion classification task.

In the process of constructing current emotion models, the selection of optimization objectives is of utmost importance. It is worth noting that the cross-entropy loss is widely adopted as the optimization objective [41,42]. During the training process, this loss function plays a dual critical role; not only does it evaluate the performance of the model by measuring the difference between the predicted probability distribution and the true label distribution [43], it also effectively alleviates the problem of class imbalance in emotion classification tasks. Specifically, in each training iteration, the model first makes predictions on the input data based on its current parameters. Then, the cross-entropy loss value is calculated to reflect the difference between the predicted distribution and the true distribution. Gradients are computed through backpropagation to guide the optimizer (such as Adam) to update the parameters of the model, thereby reducing the loss and improving the classification performance.

In the evaluation phase, the classification results are visualized through a confusion matrix to observe the misclassification of different emotional categories, thereby quantifying the specific impact of data augmentation methods on emotion recognition performance. In addition, metrics such as the weighted accuracy (WA), F1-score, and precision [44] are introduced to comprehensively evaluate the classification performance of different data augmentation methods.

4. Experiment Analysis

4.1. Experimental Environment and Dataset

Our experiments were conducted on a Windows 10 operating system with the following hardware configuration: NVIDIA GeForce RTX 3060 Ti GPU, CUDA version 12.2, and the Python 3.9 development environment. The parameter settings for the augmentation algorithms are listed in Table 4.

The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset [45] was chosen for the experiment. IEMOCAP contains data recorded from ten actors during dyadic (two-person) conversations, with five sessions per conversation and two participants per session. The dataset consists of a total of 10,039 utterances. For this study, the experiment followed the methodology used in existing literature, focusing on four emotion categories: “Anger”, “Sadness”, “Happiness”, and “Neutral”. Samples from the “Excitement” category were merged into the “Happiness” category.

In Section 3.3, the training process used the Adam optimizer (learning rate: 0.001, weight decay: 0.000001) and the cross-entropy loss function. The batch size was set to 64 and the model was trained for 50 epochs with five-fold cross-validation for model optimization.

The five data augmentation algorithms listed in Table 4 were independently applied to the training data, and the model performance was evaluated on the test set.

4.2. Subjective Auditory Perception Analysis

Based on the feedback from the listeners, the impact of each augmentation algorithm is summarized in Table 5. Among them, the time stretching algorithm stands out particularly in terms of its impact on emotional expression. After being processed by the time stretching data augmentation method, the speech originally conveying a sense of “sadness” was mistakenly judged by the listeners as expressing “cheerfulness” or “anger”. This phenomenon fully reveals the severe interference of this algorithm with the conveyance of emotional information in speech, causing it to deviate from its original emotional tone.

4.3. Impact of Augmentation Algorithms on Spectrogram Spectral Characteristics

To analyze the fundamental reasons that listeners misjudged “sadness” as “happiness”, we provide a visual analysis of how different data augmentation techniques affect the characteristics of spectrograms, aiming to reveal the acoustic factors that lead to these perceivable emotional changes. The results are illustrated in Figure 5, which presents the normalized evaluation outcomes of the effects of these algorithms on spectrogram features, encompassing average frequency, fundamental frequency (F0), formants, and temporal characteristics. The comparison yields several observations. First, reverberation (RIR) significantly affects the formant features, causing blurring in the spectrogram and reducing the clarity of emotional expression. Noise addition (add_noise) affects both frequency and temporal features, slightly decreasing audio clarity and making emotional expression less distinct. Pitch Shifting (pitch_shifting) causes subtle changes in both frequency and temporal features, which may lead to confusion between the “Anger” and “Sadness” emotion categories. Resampling (resample) has a considerable impact on frequency and temporal features, but minimal effect on the fundamental frequency and formants. Emotional expression is only mildly disrupted. On the other hand, time stretching (time_stretch) significantly affects frequency and formant features, weakening the naturalness of emotions. It flattens “Happiness” and diminishes the layered quality of “Sadness”.

4.4. Objective Emotion Classification Evaluation Results Analysis

The above analysis focuses on examining the impact of data augmentation algorithms on spectrogram features from subjective and visual perspectives. To comprehensively evaluate the influence of these algorithms on model performance, we conducted objective evaluation experiments. In a comparison of performance metrics (Figure 6), including weighted accuracy (WA), F1-score, and precision, the reverberation (RIR) method outperformed all others across all metrics, demonstrating optimal overall performance and stability in emotion classification tasks, making it suitable for speech emotion recognition scenarios. The resampling method ranked second, with slightly lower metrics than reverberation. Its stable performance and simplicity make it another viable option. In contrast, noise addition and pitch shifting showed moderate performance but were inferior to the top two methods. The time stretching method exhibited the lowest performance, particularly in terms of the F1-score, indicating a significant negative impact on classification performance which makes it unsuitable as a primary choice. Thus, reverberation is the best choice, followed by resampling.

To further validate the reliability of the performance differences among different methods, we conducted a statistical significance analysis on the performance of the reverberation algorithm (RIR) compared with the other four algorithms across the three performance metrics of weighted accuracy, F1-score, and precision using the T-test method. T-tests are used to determine whether there is a significant difference between the means of two groups of data. It calculates the t-statistic and t-distribution, based on which it determines the probability (p-value) of observing the current or more extreme data differences under the null hypothesis (i.e., there is no difference between the means of the two groups of data). The results are shown in Figure 7.

Weighted Accuracy (WA): When compared with the reverberation (RIR) algorithm, the significance markers for the noise addition and time stretching methods are ***, indicating that their p-values are less than 0.001 compared to reverberation in terms of the WA metric. This indicates that there is an extremely significant difference between these methods and reverberation in improving the weighted accuracy, with the difference in effects unlikely to be caused by random factors. The pith shifting method is marked with **, indicating that its p-value is less than 0.01 and there is a significant difference compared to reverberation.
F1-Score: The significance markers for the noise addition, time stretching, and pitch shifting methods are ***. This indicates that p < 0.001 compared with reverberation, showing an extremely significant difference. The resampling method is marked with *, indicating that p < 0.05, meaning that there is a certain degree of significant difference compared to reverberation. This shows that the differences in the effects of these methods on the F1-score compared to reverberation are not accidental.
Precision: The noise addition method is marked with **, indicating that its p-value is less than 0.01 and there is a significant difference compared to reverberation. The time stretching and resampling methods are marked with *** for p < 0.001, showing an extremely significant difference compared to reverberation. This reflects that there are obvious differences between these methods and reverberation in terms of precision performance and that these differences are caused by the methods themselves rather than random factors.

The results of the statistical significance analysis indicate that there are real and varying degrees of performance differences between the reverberation algorithm and the other four algorithms across all performance metrics. This provides a quantitative basis for our previous conclusions regarding the performance of different methods and strongly supports the view that reverberation is the best choice among the five data augmentation methods.

To identify the reasons for the performance differences caused by different data augmentation methods, we visualize the confusion matrix in Figure 8 to observe the misclassification distributions of the different methods. Among the tested algorithms, reverberation (RIR) demonstrated the best performance in classifying the “Anger” and “Happiness” emotion categories, with relatively low misclassification rates. In contrast, time stretching caused the most severe interference with the classification of “Sadness” and “Anger” emotions, leading to significant performance degradation.

Table 6 further elaborates the classification performance of each method. The reverberation method not only has the highest accuracy but also maintains the integrity of emotional categories. In contrast, the time stretching method has the greatest negative impact on emotion classification. The main reason for this is that it blurs the boundaries between “Sadness” and “Anger”, which also explains its poor performance. The noise addition, pitch shifting, and resampling methods perform stably when classifying the “Happiness” emotion but cause a great deal of interference when distinguishing between the “Neutral” and “Sadness” emotions. This blurring of the boundaries between the “Neutral” and “Sadness” emotions ultimately leads to the inferior performance of these methods compared to the reverberation method.

When speech processed with the five data augmentation methods is classified by the model, the misclassification rates of “Happiness” and “Anger” are generally low, while those of “Sadness” and “Anger” are relatively high. The reason for this is that “Happiness” and “Anger” are emotions with high arousal levels and obvious characteristic contrasts. The former is characterized by high pitch, fast speaking speed, and a positive rhythm in speech, while the latter features a low pitch and strong intonation changes. These significant characteristic differences make it easier for the model to distinguish between them, reducing the classification error rate. In contrast, “Sadness” and “Anger” both belong to negative emotions and have a moderate arousal level, resulting in many overlapping speech features. For example, the low volume and slow speaking speed of “Sadness” are similar to the speech performance of “Anger” when it is suppressed. These similarities make it highly likely for the model to make misclassifications during emotional classification.

In summary, reverberation not only enhances the features in speech that distinguish between the “Anger” and “Happiness” categories, it also avoids excessively interfere with the discrimination of other emotion types, thereby achieving the best overall classification performance. For these reasons, reverberation is the optimal choice among these five data augmentation methods. Although resampling is less effective than reverberation in distinguishing certain emotions, it demonstrates relatively stable performance in classifying the “Happiness” category and can improve the model’s performance, ranking second as a result. On the other hand, the time stretching method causes the most significant classification interference, especially at the boundaries of emotion categories such as “Sadness” and “Anger”, which greatly increases the misclassification rate and significantly reduces the classification performance. Thus, this method should be avoided possible in practical applications.

Although the individual advantages of resampling and reverberation impulse response (RIR) have been confirmed, the potential synergistic effects that might be generated by the combination of different data augmentation methods still remain to be studied. To address this issue, we systematically combined various augmentation techniques, constructed strategies with multiple different combinations, and evaluated their impacts on the performance of the model, with the results shown in the Figure 9. The results indicate that the combination of resampling and RIR outperforms other combination methods, which verifies the synergistic effect of these two methods in enhancing data diversity and model generalization ability.

In order to further confirm the effectiveness of the resampling and reverberation augmentation techniques and demonstrate the superiority of the proposed model, the performance of the classic network models (ResNet18, VGG16, GoogleNet, DenseNet) and the model composed of a feature extractor and a classifier was compared before and after the application of these two data augmentation techniques, using the weighted accuracy (WA) as the key indicator. This experiment strictly adhered to the settings outlined in Section 4.1, and all models adopted a unified training strategy to ensure the reliability and comparability of the results. The experimental results are shown in Table 7. In benchmark models such as ResNet18, VGG16, and GoogleNet, the application of the resampling and RIR algorithms significantly improved the weighted accuracy (WA), preliminarily verifying the effectiveness of these data augmentation techniques in enhancing the performance of classic network models. After integrating these two augmentation techniques, the weighted accuracy of the model increased by 7.10%, fully highlighting the unique advantage of this model in adapting to data augmentation techniques.

Meanwhile, the proposed model algorithm was compared with several mainstream algorithms on the IEMOCAP dataset, with the results shown in Table 8. The experimental results show that compared with the mainstream algorithms, the weighted accuracy index of the model algorithm has a maximum improvement of 5%, significantly outperforming other algorithms.

In conclusion, the experimental results verify the effectiveness of the research findings from two aspects. First, this experiment further confirms that the resampling and RIR augmentation techniques described in this paper can effectively improve the model performance in speech emotion recognition tasks. On the other hand, it highlights that the model has a higher performance ceiling and stronger adaptability compared to traditional network models after combining the resampling and RIR data augmentation techniques, providing a new reference direction for data augmentation and model optimization in the field of speech emotion recognition.

5. Conclusions

This study provides a comprehensive evaluation of the effects of five common data augmentation algorithms (noise addition, time stretching, pitch shifting, resampling, and reverberation) on emotion recognition. The experimental results indicate that reverberation is the most effective augmentation method. It significantly improves sample diversity with minimal interference to emotional features, leading to relatively balanced classification performance. On the other hand, time stretching has the most significant negative impact on emotion boundaries, particularly in classification of the “Sadness” category, where it severely disturbs classification accuracy and significantly reduces overall classification performance. Through experiments conducted across three dimensions, we observed notable performance differences between the various data augmentation methods in terms of emotion recognition, offering valuable insights for optimizing future emotion recognition models.

Notably, in our experiments exploring the combinations of different data augmentation techniques, the combination of resampling and reverberation demonstrated a significant synergistic effect, not only enhancing the diversity of data but also significantly improving the generalization ability of the model. Performance improvement can be achieved without relying on the transformation of any network model. This finding further confirms the potential of combining multiple data augmentation methods for optimizing emotion recognition models.

Based on the results of this study, future research directions can be concentrated on exploring adaptive data augmentation strategies. Through such strategies, appropriate data augmentation methods can be adaptively selected according to different emotional characteristics and model requirements so as to improve efficiency and further optimize the performance of emotion recognition. At the same time, during the process of data augmentation for the original speech data, it is highly likely that the augmented data may introduce new noise or distort the original emotional information, either due to environmental factors or the characteristics of the algorithm itself. Therefore, methods for reducing noise interference are also worthy of more in-depth research, which will help reduce the negative impact of noise on emotion recognition models.

Author Contributions

Conceptualization and methodology, H.T. and S.L.; software, S.L.; validation, S.L.; formal analysis, H.T., X.W. and B.L.; investigation, S.Z.; data curation, S.L.; writing—original draft preparation, H.T. and S.L.; writing—review and editing, H.T. and S.L.; visualization, S.L.; funding acquisition H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was founded in part by Innovative Funds Plan of Henan University of Technology (2022ZKCJ13), Open Project of Scientific Research Platform of Henan University of Technology Grain Information Processing Center (No. KFJJ2023011), Natural Science Project of Henan Provincial Department of Science and Technology, Technology Research Projects (No. 242102211027), and Fund of the Institute of Complexity Science, Henan University of Technology (CSKFJJ-2025-49).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found at: [IEMOCAP] [https://sail.usc.edu/iemocap/] (accessed on 29 December 2022).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, M.; Triantafyllopoulos, A.; Yang, Z.; Takeuchi, H.; Nakamura, T.; Kishi, A.; Ishizawa, T.; Yoshiuchi, K.; Jing, X.; Karas, V.; et al. Daily Mental Health Monitoring from Speech: A Real-World Japanese Dataset and Multitask Learning Analysis. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Gao, X.; Song, H.; Li, Y.; Zhao, Q.; Li, W.; Zhang, Y.; Chao, L. Research on Intelligent Quality Inspection of Customer Service Under the “One Network” Operation Mode of Toll Roads. In Proceedings of the 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Guangzhou, China, 22–24 July 2022; pp. 369–373. [Google Scholar]
Zhang, B.-Y.; Wang, Z.-S. Design Factors Extraction of Companion Robots for the Elderly People Living Alone Based on Miryoku Engineering. In Proceedings of the 2022 15th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 17–18 December 2022; pp. 110–113. [Google Scholar]
Zhang, K.; Zhou, F.; Wu, L.; Xie, N.; He, Z. Semantic understanding and prompt engineering for large-scale traffic data imputation. Inf. Fusion 2024, 102, 102038. [Google Scholar] [CrossRef]
Jackson, P.; Haq, S. Surrey Audio-Visual Expressed Emotion (Savee) Database; University of Surrey: Guildford, UK, 2014. [Google Scholar]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. Interspeech 2005, 5, 1517–1520. [Google Scholar]
Atmaja, B.T.; Sasou, A. Effects of data augmentations on speech emotion recognition. Sensors 2022, 22, 5941. [Google Scholar] [CrossRef] [PubMed]
Barhoumi, C.; Ayed, Y.B. Improving Speech Emotion Recognition Using Data Augmentation and Balancing Techniques. In Proceedings of the 2023 International Conference on Cyberworlds (CW), Sousse, Tunisia, 3–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 282–289. [Google Scholar]
Tao, H.; Shan, S.; Hu, Z.; Zhu, C. Strong generalized speech emotion recognition based on effective data augmentation. Entropy 2022, 25, 68. [Google Scholar] [CrossRef]
Arias, P.; Rachman, L.; Liuni, M.; Aucouturier, J.J. Beyond correlation: Acoustic transformation methods for the experimental study of emotional voice and speech. Emot. Rev. 2021, 13, 12–24. [Google Scholar] [CrossRef]
Liu, L.; Götz, A.; Lorette, P.; Tyler, M.D. How tone, intonation and emotion shape the development of infants’ fundamental frequency perception. Front. Psychol. 2022, 13, 906848. [Google Scholar] [CrossRef] [PubMed]
Bergelson, E.; Idsardi, W.J. A neurophysiological study into the foundations of tonal harmony. Neuroreport 2009, 20, 239–244. [Google Scholar] [CrossRef] [PubMed]
Teng, X.; Meng, Q.; Poeppel, D. Modulation Spectra Capture EEG Responses to Speech Signals and Drive Distinct Temporal Response Functions. Eneuro 2021, 8. [Google Scholar] [CrossRef]
Luengo, I.; Navas, E.; Hernáez, I. Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multimed. 2010, 12, 490–501. [Google Scholar] [CrossRef]
Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
Zhang, S.; Zhao, Z.; Guan, C. Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 5764–5769. [Google Scholar]
Biswas, M.; Sahu, M.; Agrebi, M.; Singh, P.K.; Badr, Y. Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms. In Innovations in Machine and Deep Learning: Case Studies and Applications; Springer Nature: Cham, Switzerland, 2023; pp. 83–108. [Google Scholar]
Li, H.; Li, J.; Liu, H.; Liu, T.; Chen, Q.; You, X. Meltrans: Mel-spectrogram relationship-learning for speech emotion recognition via transformers. Sensors 2024, 24, 5506. [Google Scholar] [CrossRef]
Feng, T.; Hashemi, H.; Annavaram, M.; Narayanan, S.S. Enhancing Privacy Through Domain Adaptive Noise Injection For Speech Emotion Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7702–7706. [Google Scholar]
Mujaddidurrahman, A.; Ernawan, F.; Wibowo, A.; Sarwoko, E.A.; Sugiharto, A.; Wahyudi, M.D.R. Speech emotion recognition using 2D-CNN with data augmentation. In Proceedings of the 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), Pekan, Malaysia, 24–26 August 2021; pp. 685–689. [Google Scholar]
Braunschweiler, N.; Doddipatla, R.; Keizer, S.; Stoyanchev, S. A Study on Cross-Corpus Speech Emotion Recognition and Data Augmentation. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 24–30. [Google Scholar]
Zhao, W.; Yin, B. Environmental sound classification based on pitch shifting. In Proceedings of the 2022 International Seminar on Computer Science and Engineering Technology (SCSET), Indianapolis, IN, USA, 8–9 January 2022; pp. 275–280. [Google Scholar]
Hailu, N.; Siegert, I.; Nürnberger, A. Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation. In Proceedings of the 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 21–24 September 2020; pp. 1–5. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Lin, W.; Mak, M.W. Robust Speaker Verification Using Population-Based Data Augmentation. In Proceedings of theICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7642–7646. [Google Scholar]
Smith, S. Digital Signal Processing: A Practical Guide for Engineers and Scientists; Newnes: Brierley Hill, UK, 2003. [Google Scholar]
Yi, L.; Mak, M.W. Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Networks Learn. Syst. 2020, 33, 172–184. [Google Scholar] [CrossRef] [PubMed]
Rudd, D.H.; Huo, H.; Xu, G. Leveraged mel spectrograms using harmonic and percussive components in speech emotion recognition. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Chengdu, China, 16–19 May 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 392–404. [Google Scholar]
Mcfee, B.; Raffel, C.; Liang, D. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the Python in Science Conference, Austin, TX, USA, 6–12 July 2015. [Google Scholar]
Mauch, M.; Dixon, S. PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 659–663. [Google Scholar]
Araujo, A.D.L.; Violaro, F. Formant frequency estimation using a Mel-scale LPC algorithm. In Proceedings of the ITS’98 Proceedings. SBT/IEEE International Telecommunications Symposium (Cat. No. 98EX202), Sao Paulo, Brazil, 9–13 August 1998; IEEE: Piscataway, NJ, USA, 1998; pp. 207–212. [Google Scholar]
Xu, L.; Zhao, F.; Xu, P.; Cao, B. Infrared target recognition with deep learning algorithms. Multimed. Tools Appl. 2023, 82, 17213–17230. [Google Scholar] [CrossRef]
Wang, B.; Sun, Y.; Chu, Y.; Min, C.; Yang, Z.; Lin, H. Local discriminative graph convolutional networks for text classification. Multimed. Syst. 2023, 29, 2363–2373. [Google Scholar] [CrossRef]
Li, J.; Liu, S.; Gao, Y.; Lv, Y.; Wei, H. UWB (N) LOS identification based on deep learning and transfer learning. IEEE Commun. Lett. 2024, 28, 2111–2115. [Google Scholar] [CrossRef]
Liu, Z.T.; Han, M.T.; Wu, B.H.; Rehman, A. Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Appl. Acoust. 2023, 202, 109178. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Ullah, R.; Asif, M.; Shah, W.A.; Anjam, F.; Ullah, I.; Khurshaid, T.; Alibakhshikenari, M. Speech emotion recognition using convolution neural networks and multi-head convolutional transformer. Sensors 2023, 23, 6212. [Google Scholar] [CrossRef]
Chauhan, K.; Sharma, K.K.; Varma, T. Speech emotion recognition using convolution neural networks. In Proceedings of the 2021 international conference on artificial intelligence and smart systems (ICAIS), Coimbatore, India, 25–27 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1176–1181. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30, p. 3. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Vaessen, N.; Van Leeuwen, D.A. Fine-Tuning Wav2Vec2 for Speaker Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7967–7971. [Google Scholar] [CrossRef]
Dai, D.; Wu, Z.; Li, R.; Wu, X.; Jia, J.; Meng, H. Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7405–7409. [Google Scholar] [CrossRef]
Xu, C.; Zhu, Z.; Wang, J.; Wang, J.; Zhang, W. Understanding the role of cross-entropy loss in fairly evaluating large language model-based recommendation. arXiv 2024, arXiv:2402.06216. [Google Scholar]
Sharma, M. Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6907–6911. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Er han, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolu tions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Ibrahim, K.M.; Perzo, A.; Leglaive, S. Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10636–10640. [Google Scholar] [CrossRef]
Pappagari, R.; Villalba, J.; Żelasko, P.; Moro-Velazquez, L.; Dehak, N. CopyPaste: An Augmentation Method for Speech Emotion Recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6324–6328. [Google Scholar] [CrossRef]
Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7194–7198. [Google Scholar] [CrossRef]

Figure 1. Listening experiment design.

Figure 2. Spectrogram feature analysis.

Figure 3. Main design process for feature extraction.

Figure 4. Working process of the feature extractor and classifier.

Figure 5. Evaluation results for the impact of augmentation algorithms on spectrogram features.

Figure 6. Comparison of model performance metrics under different data augmentation methods.

Figure 7. Statistical significance analysis.

Figure 8. Emotion classification confusion matrix: (a) reverberation impulse response (RIR), (b) noise addition (add_noise), (c) time stretching (time_stretch), (d) resampling (resample), (e) pitch shifting (pitch_shifting).

Figure 9. Impact of strategies combining different data augmentation methods on model performance.

Table 1. Implementation methods of data augmentation algorithms in spectrograms.

Data Augmentation Algorithm	Implementation in Spectrograms
Time Stretching	Modifies the time axis of the spectrogram to adjust speech speed or rhythm
Noise addition	Adds background noise or environmental noise to the spectrogram
Pitch Shifting	Adjusts the amplitudes and phases of frequency components to change the frequency distribution, simulating changes in pitch or fundamental frequency
Reverberation (RIR)	Adds reverb in the time domain to simulate different spatial environments, making the frequency distribution in the spectrogram more complex and blurring the distinct frequency boundaries
Resampling	Adjusts the frequency resolution of the spectrogram to simulate speech under different sampling conditions

Table 2. Feature extraction structure.

Component	Description
Input Layer	Receives the preprocessed Mel spectrogram with an input size of [B, 1, 64,200]
Convolutional Layer	Two convolutional operations are performed to extract multi-channel features, with kernel sizes of (10, 2) and (2, 8)
Feature Fusion	The convolutional results are concatenated along the channel dimension to obtain an enhanced feature representation
Residual Module	Multi-layer residual modules are applied to progressively enhance the feature representation, with the final output size of [B, 256, 4, 13]
Global Average Pooling	Converts the high-dimensional feature map into a one-dimensional feature vector, with an output size of [B, 256]

Table 3. Classifier structure.

Component	Description
Input Layer	Receives the one-dimensional feature vector output from the feature extractor, with size [B, 256]
Fully Connected Layer 1	Maps the features to a 64-dimensional space, using ReLU activation and Dropout to prevent overfitting
Fully Connected Layer 2	Outputs the probability distribution over the emotion categories (4 classes: Anger, Happiness, Sadness, Neutral) using Softmax

Table 4. Data augmentation parameters.

Data Augmentation Algorithm	Parameter Settings	Fixed Parameters
Time Stretching	[0.8, 1.2, 1.5, 1.8, 2.0]	——
Noise addition	Signal power $p_{s i g n a l} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}^{2}$ ; Noise coefficient $k = \sqrt{\frac{p_{s i g n a l}}{10^{\frac{S N R}{10}}}}$ ; Noisy speech $x_{n o i s y} = x + k \cdot ε$ $ε \sim N (0, 1)$	$S N R = 120$ dB
Resampling	Sampling rate: 20,000, First resample from 16,000 to 20,000, then resample from 20,000 back to original sampling rate	Original audio sampling rate: 16,000 Hz
Pitch Shifting	Pitch shift steps: [−6,−3,0,3,6] per octave (semitone): 12	——
Reverberation (RIR)	Source positions: [[1, 1, 1.75], [5, 4, 1.75], [9, 7, 1.75]] Microphone positions: [[0.5, 0.5, 1.2], [4, 3, 1.2], [8, 6, 1.2]]	3D space dimensions: [10, 8, 3.5]

Table 5. Listening experiment results.

Augmentation Algorithm	Listener’s Audio Quality Evaluation	Listener’s Noise Perception	Listener’s Description of Emotional Expression Changes
Original Audio	Clear, natural, no distortion	No noise	Emotional expression is clear and unchanged, with the best speech quality
Noise addition	Slightly blurred, mild distortion	Increased background noise (slight noise)	The noise makes the details of emotional conveyance less clear
Time Stretching	Change in speech rate, slight distortion	No noticeable noise	Excessively fast change in speaking speed will lead to the original sad emotion being misjudged as cheerful
Resampling	Audio slightly distorted, frequency variation	No noticeable noise	Distortion affects emotional transmission, especially noticeable in intense emotions (e.g., anger)
Pitch Shifting	Timbre change, good clarity	No noticeable noise	Timbre changes cause subtle emotional differences, especially confusion between anger and sadness
Reverberation (RIR)	The audio has reverberation and is slightly blurry	Slight noise	The speech overlaps, affecting the clear conveyance of emotions

Table 6. Analysis of evaluation results.

Augmentation Method	Good Recognition Performance	High Misclassification Rate	Other Observations
Time Stretching	“Anger” (71.78%) and “Happiness” (72.18%)	“Sadness” (48.93% misclassified as “Anger”)	“Neutral” performs poorly, with blurred boundaries, significantly reducing model performance
Reverberation(RIR)	“Anger” (62.72%) and “Happiness” (64.66%)	“Sadness” (33.57% misclassified as “Anger”), “Neutral” (19.35% misclassified as “Sadness”)	Overall stability is better than other methods
Noise Addition	“Happiness” (75.94%)	“Sadness” (43.93% misclassified as “Anger”)	Suitable for increasing data diversity but has a significant impact on boundaries
Pitch Shifting	“Sadness” (66.79%) and “Happiness” (66.92%)	“Anger” (35.19% misclassified as “Sadness”)	“Anger” and “Sadness” confusion is significant
Resampling	“Happiness” (69.17%)	“Anger” (20.56% misclassified as “Sadness”), “Neutral” (35.48% misclassified as “Sadness”)	“Neutral” performs worst, with severe confusion

Table 7. Comparison results of performance improvements for different network models on the IEMOCAP dataset.

Method	Baseline	Baseline+RIR+Resample	The Extent of Performance Improvement
Network	WA(%)	WA(%)	(%)
ResNet18 [46]	60.28	62.39	+2.11
VGG16 [47]	52.58	60.94	+8.36
GoogleNet [48]	56.52	59.26	+2.74
DenseNet [49]	59.28	63.32	+4.04
Ours	60.64	67.74	+7.10

Table 8. Comparison results between the proposed algorithm and the mainstream algorithms on the IEMOCAP dataset.

Methods	WA(%)
DAMIF [9]	66.51
ADAN+SVM [27]	64.78
WADAN+SVM [27]	63.82
Speech synthesis [50]	66.06
COPYPASTE [51]	63.78
Generative noise [52]	62.74
Ours	67.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, H.; Li, S.; Wang, X.; Liu, B.; Zheng, S. Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm. Entropy 2025, 27, 640. https://doi.org/10.3390/e27060640

AMA Style

Tao H, Li S, Wang X, Liu B, Zheng S. Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm. Entropy. 2025; 27(6):640. https://doi.org/10.3390/e27060640

Chicago/Turabian Style

Tao, Huawei, Sixian Li, Xuemei Wang, Binkun Liu, and Shuailong Zheng. 2025. "Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm" Entropy 27, no. 6: 640. https://doi.org/10.3390/e27060640

APA Style

Tao, H., Li, S., Wang, X., Liu, B., & Zheng, S. (2025). Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm. Entropy, 27(6), 640. https://doi.org/10.3390/e27060640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

Abstract

1. Introduction

2. Related Theory

2.1. Spectrogram Generation Principle

2.2. Data Augmentation Algorithms

3. Proposed Method

3.1. Subjective Listening Design

3.2. Spectral Characteristics Analysis and Feature Extraction of the Spectrogram

3.3. Objective Emotion Classification Evaluation Design

4. Experiment Analysis

4.1. Experimental Environment and Dataset

4.2. Subjective Auditory Perception Analysis

4.3. Impact of Augmentation Algorithms on Spectrogram Spectral Characteristics

4.4. Objective Emotion Classification Evaluation Results Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI