Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition
Abstract
1. Introduction
2. Related Works
3. Proposed Method
3.1. Prosodic FeaturesExtraction Design
3.2. Spatio-Temporal Features
3.3. Fusion Configuration
3.4. Classifier
3.5. Training Strategy
3.6. Evaluation Metrics
- Accuracy measures the proportion of correctly classified samples but may be misleading for imbalanced emotional classes.
- Precision indicates the fraction of correctly predicted positive samples among all predicted positives, ensuring the reliability of stress predictions.
- Recall (sensitivity) measures the proportion of correctly detected positive cases; in stress and anxiety detection, recall is critical since missing stressed cases (false negatives) can be more harmful than false positives.
- F1-score, the harmonic mean of precision and recall, balances sensitivity and reliability.
4. Results and Discussion
4.1. Prosodic Features Analysis
4.2. Spectrogram Features Analysis
4.3. Results
- The MTMFS makes the largest contribution to classification stability and accuracy.
- Prosodic features directly improve recall in the stress and anxiety classes.
- The CQTS adds depth to the representation but with a more moderate impact.
- The attention mechanism ensures adaptive integration between features, maintaining a balance between precision and recall.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bhangale, K.B.; Kothandaraman, M. Speech Emotion Recognition Using the Novel PEmoNet (Parallel Emotion Network). Appl. Acoust. 2023, 212, 109613. [Google Scholar] [CrossRef]
- Waleed, G.T.; Shaker, S.H. Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN. Information 2025, 16, 518. [Google Scholar] [CrossRef]
- Liztio, L.M.; Sari, C.A.; Setiadi, D.R.I.M.; Rachmawanto, E.H. Gender Identification Based on Speech Recognition Using Backpropagation Neural Network. In Proceedings of the 2020 International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, 19–20 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 88–92. [Google Scholar]
- Guo, X.; Mai, G.; Mohammadi, Y.; Benzaquén, E.; Yukhnovich, E.A.; Sedley, W.; Griffiths, T.D. Neural Entrainment to Pitch Changes of Auditory Targets in Noise. Neuroimage 2025, 314, 121270. [Google Scholar] [CrossRef]
- Kuuluvainen, S.; Kaskivuo, S.; Vainio, M.; Smalle, E.; Möttönen, R. Prosody Enhances Learning of Statistical Dependencies from Continuous Speech Streams in Adults. Cognition 2025, 262, 106169. [Google Scholar] [CrossRef]
- Shan, Y. Prosodic Modulation of Discourse Markers: A Cross-Linguistic Analysis of Conversational Dynamics. Speech Commun. 2025, 173, 103271. [Google Scholar] [CrossRef]
- Guo, P.; Huang, S.; Li, M. DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture. Information 2025, 16, 386. [Google Scholar] [CrossRef]
- Ayvaz, U.; Gürüler, H.; Khan, F.; Ahmed, N.; Whangbo, T.; Akmalbek Bobomirzaevich, A. Automatic Speaker Recognition Using Mel-Frequency Cepstral Coefficients Through Machine Learning. Comput. Mater. Contin. 2022, 71, 5511–5521. [Google Scholar] [CrossRef]
- Prabakaran, D.; Sriuppili, S. Speech Processing: MFCC Based Feature Extraction Techniques- An Investigation. J. Phys. Conf. Ser. 2021, 1717, 012009. [Google Scholar] [CrossRef]
- Sood, M.; Jain, S. Speech Recognition Employing MFCC and Dynamic Time Warping Algorithm. In Innovations in Information and Communication Technologies (IICT-2020); Springer: Cham, Switzerland, 2021; pp. 235–242. [Google Scholar]
- Wijaya, N.N.; Setiadi, D.R.I.M.; Muslikh, A.R. Music-Genre Classification Using Bidirectional Long Short-Term Memory and Mel-Frequency Cepstral Coefficients. J. Comput. Theor. Appl. 2024, 1, 243–256. [Google Scholar] [CrossRef]
- Saleem, N.; Gao, J.; Khattak, M.I.; Rauf, H.T.; Kadry, S.; Shafi, M. DeepResGRU: Residual Gated Recurrent Neural Network-Augmented Kalman Filtering for Speech Enhancement and Recognition. Knowl.-Based Syst. 2022, 238, 107914. [Google Scholar] [CrossRef]
- Li, Y.; Kang, S. Deep Neural Network-based Linear Predictive Parameter Estimations for Speech Enhancement. IET Signal Process. 2017, 11, 469–476. [Google Scholar] [CrossRef]
- Karapiperis, S.; Ellinas, N.; Vioni, A.; Oh, J.; Jho, G.; Hwang, I.; Raptis, S. Investigating Disentanglement in a Phoneme-Level Speech Codec for Prosody Modeling. In Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, China, 2–5 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 668–674. [Google Scholar]
- Sivasathiya, G.; Kumar, A.D.; Ar, H.R.; Kanishkaa, R. Emotion-Aware Multimedia Synthesis: A Generative AI Framework for Personalized Content Generation Based on User Sentiment Analysis. In Proceedings of the 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 4–6 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1344–1350. [Google Scholar]
- Colunga-Rodriguez, A.A.; Martínez-Rebollar, A.; Estrada-Esquivel, H.; Clemente, E.; Pliego-Martínez, O.A. Developing a Dataset of Audio Features to Classify Emotions in Speech. Computation 2025, 13, 39. [Google Scholar] [CrossRef]
- Wang, N.; Zhang, X.; Sharma, A. A Research on HMM Based Speech Recognition in Spoken English. Recent Adv. Electr. Electron. Eng. (Former Recent Pat. Electr. Electron. Eng.) 2021, 14, 617–626. [Google Scholar] [CrossRef]
- Srivastava, D.R.K.; Pandey, D. Speech Recognition Using HMM and Soft Computing. Mater. Today Proc. 2022, 51, 1878–1883. [Google Scholar] [CrossRef]
- Turki, T.; Roy, S.S. Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer. Appl. Sci. 2022, 12, 6611. [Google Scholar] [CrossRef]
- Hao, C.; Li, Y. Simulation of English Speech Recognition Based on Improved Extreme Random Forest Classification. Comput. Intell. Neurosci. 2022, 2022, 1948159. [Google Scholar] [CrossRef]
- Dua, S.; Kumar, S.S.; Albagory, Y.; Ramalingam, R.; Dumka, A.; Singh, R.; Rashid, M.; Gehlot, A.; Alshamrani, S.S.; AlGhamdi, A.S. Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Appl. Sci. 2022, 12, 6223. [Google Scholar] [CrossRef]
- Shashidhar, R.; Patilkulkarni, S.; Puneeth, S.B. Combining Audio and Visual Speech Recognition Using LSTM and Deep Convolutional Neural Network. Int. J. Inf. Technol. 2022, 14, 3425–3436. [Google Scholar] [CrossRef]
- Hema, C.; Garcia Marquez, F.P. Emotional Speech Recognition Using CNN and Deep Learning Techniques. Appl. Acoust. 2023, 211, 109492. [Google Scholar] [CrossRef]
- Oruh, J.; Viriri, S.; Adegun, A. Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition. IEEE Access 2022, 10, 30069–30079. [Google Scholar] [CrossRef]
- Orken, M.; Dina, O.; Keylan, A.; Tolganay, T.; Mohamed, O. A Study of Transformer-Based End-to-End Speech Recognition System for Kazakh Language. Sci. Rep. 2022, 12, 8337. [Google Scholar] [CrossRef] [PubMed]
- Song, Q.; Sun, B.; Li, S. Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 10028–10038. [Google Scholar] [CrossRef]
- Gondohanindijo, J.; Muljono; Noersasongko, E.; Pujiono; Setiadi, D.R.M. Multi-Features Audio Extraction for Speech Emotion Recognition Based on Deep Learning. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 198–206. [Google Scholar] [CrossRef]
- Tyagi, S.; Szénási, S. Optimizing Speech Emotion Recognition with Deep Learning and Grey Wolf Optimization: A Multi-Dataset Approach. Algorithms 2024, 17, 90. [Google Scholar] [CrossRef]
- Bhanbhro, J.; Memon, A.A.; Lal, B.; Talpur, S.; Memon, M. Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models. Signals 2025, 6, 22. [Google Scholar] [CrossRef]
- Yu, S.; Meng, J.; Fan, W.; Chen, Y.; Zhu, B.; Yu, H.; Xie, Y.; Sun, Q. Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics 2024, 13, 2191. [Google Scholar] [CrossRef]
- Wei, Z.; Ge, C.; Su, C.; Chen, R.; Sun, J. A Deep Learning Model for Speech Emotion Recognition on RAVDESS Dataset. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 316–323. [Google Scholar] [CrossRef]
- Makhmudov, F.; Kutlimuratov, A.; Cho, Y.-I. Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition. Appl. Sci. 2024, 14, 11342. [Google Scholar] [CrossRef]
- Kim, J.-Y.; Lee, S.-H. Accuracy Enhancement Method for Speech Emotion Recognition From Spectrogram Using Temporal Frequency Correlation and Positional Information Learning Through Knowledge Transfer. IEEE Access 2024, 12, 128039–128048. [Google Scholar] [CrossRef]
- Huang, Z.; Ji, S.; Hu, Z.; Cai, C.; Luo, J.; Yang, X. ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; ISCA: Singapore, 2022; Volume 2022-Septe, pp. 4152–4156. [Google Scholar]
- de Souza, D.B.; Bakri, K.J.; de Souza Ferreira, F.; Inacio, J. Multitaper-Mel Spectrograms for Keyword Spotting. IEEE Signal Process. Lett. 2022, 29, 2028–2032. [Google Scholar] [CrossRef]
- McAllister, T.; Gambäck, B. Music Style Transfer Using Constant-Q Transform Spectrograms. In Artificial Intelligence in Music, Sound, Art and Design; Springer International Publishing: Cham, Switzerland, 2022; pp. 195–211. [Google Scholar]
- Raju, V.V.N.; Saravanakumar, R.; Yusuf, N.; Pradhan, R.; Hamdi, H.; Saravanan, K.A.; Rao, V.S.; Askar, M.A. Enhancing Emotion Prediction Using Deep Learning and Distributed Federated Systems with SMOTE Oversampling Technique. Alex. Eng. J. 2024, 108, 498–508. [Google Scholar] [CrossRef]
- Ding, Z.; Wang, Z.; Zhang, Y.; Cao, Y.; Liu, Y.; Shen, X.; Tian, Y.; Dai, J. Trade-Offs between Machine Learning and Deep Learning for Mental Illness Detection on Social Media. Sci. Rep. 2025, 15, 14497. [Google Scholar] [CrossRef]
- Modi, N.; Kumar, Y.; Mehta, K.; Chaplot, N. Physiological Signal-Based Mental Stress Detection Using Hybrid Deep Learning Models. Discov. Artif. Intell. 2025, 5, 166. [Google Scholar] [CrossRef]
- Pathirana, A.; Rajakaruna, D.K.; Kasthurirathna, D.; Atukorale, A.; Aththidiye, R.; Yatiipansalawa, M.; Yatipansalawa, M. A Reinforcement Learning-Based Approach for Promoting Mental Health Using Multimodal Emotion Recognition. J. Futur. Artif. Intell. Technol. 2024, 1, 124–142. [Google Scholar] [CrossRef]
- Wang, Y.; Huang, J.; Zhao, Z.; Lan, H.; Zhang, X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Appl. Sci. 2024, 14, 11494. [Google Scholar] [CrossRef]
Component | Proposed Configuration |
---|---|
Prosodic Features | 40–60 features (pitch, jitter, shimmer, intensity, HNR, pause rate, speech rate) |
Prosody Encoder | Dense layers (128 → 64), ReLU activation, Dropout 0.3, Layer Normalization |
Spectrogram Input | Parallel MTMFS (64 Mel filters, 25 ms frame size, 10 ms hop, FFT = 2048) and CQTS (84 bins, 12 bins) |
CNN Encoder | 3 convolutional blocks (filters [32, 64, 128], kernel size 3 × 3, BatchNorm, ReLU, MaxPooling) |
Temporal Encoder | BiGRU per branch (2 layers, 128 hidden units, Dropout 0.3) → Concatenation → Layer Normalization |
Fusion Layer | Multi-Head Attention (8 heads); prosodic embedding normalized prior to fusion |
Classifier | Dense layers [64, 128, 256], ReLU activation, Dropout 0.5, Softmax output (8 classes) |
Optimizer | Adam (learning rate = 1 × 10−4) |
Training Strategy | 100 epochs, Batch size 32, Early Stopping (patience = 10, monitoring validation loss & macro F1), 5-fold subject-independent CV |
Evaluation Metrics | Precision, Recall, F1-score, Accuracy |
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Disgust | 0.9788 | 0.9635 | 0.9711 | 192 |
Calm | 0.9738 | 0.9688 | 0.9713 | 192 |
Sad | 0.9846 | 1.0000 | 0.9922 | 192 |
Happy | 0.9738 | 0.9688 | 0.9713 | 192 |
Fear | 0.9796 | 1.0000 | 0.9897 | 192 |
Angry | 0.9744 | 0.9896 | 0.9819 | 192 |
Neutral | 0.9574 | 0.9375 | 0.9474 | 96 |
Surprise | 0.9788 | 0.9635 | 0.9711 | 192 |
Accuracy | - | - | 0.9764 | 1440 |
Macro Avg | 0.9752 | 0.9740 | 0.9745 | 1440 |
Weighted Avg | 0.9763 | 0.9764 | 0.9763 | 1440 |
Study | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
Without prosody features | 93.78 | 94.01 | 93.69 | 93.85 |
Without MTMFS features | 93.23 | 93.42 | 93.23 | 93.32 |
Without CQTS features | 95.18 | 95.25 | 95.18 | 95.21 |
Without attention mechanism | 95.53 | 95.67 | 95.57 | 95.61 |
Proposed (full) | 97.64 | 97.63 | 97.64 | 97.63 |
Study | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
SVM [16] | 72.40 | 72.20 | 72.10 | - |
HuBERT + DPCNN + CAF [30] | 81.86 | - | - | 82.84 |
K-SVM + GWO [28] | 87.00 | 88.00 | 85.00 | 86.00 |
1D CNN + Feature Fusion [2] | 91.90 | 90.50 | 91.10 | 90.80 |
CNN + LSTM [32] | 95.70 | 93.49 | 94.99 | 94.20 |
MTMFS + GS + CQTS+ PEmoNet [1] | 97.41 | 97.53 | 97. 53 | 97.26 |
Ours | 97.64 | 97.63 | 97.64 | 97.63 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nugroho, K.; Al Amin, I.H.; Noviasari, N.A.; Setiadi, D.R.I.M. Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition. Computers 2025, 14, 361. https://doi.org/10.3390/computers14090361
Nugroho K, Al Amin IH, Noviasari NA, Setiadi DRIM. Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition. Computers. 2025; 14(9):361. https://doi.org/10.3390/computers14090361
Chicago/Turabian StyleNugroho, Kristiawan, Imam Husni Al Amin, Nina Anggraeni Noviasari, and De Rosal Ignatius Moses Setiadi. 2025. "Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition" Computers 14, no. 9: 361. https://doi.org/10.3390/computers14090361
APA StyleNugroho, K., Al Amin, I. H., Noviasari, N. A., & Setiadi, D. R. I. M. (2025). Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition. Computers, 14(9), 361. https://doi.org/10.3390/computers14090361