Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition †
Abstract
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tirronen, S.; Kadiri, S.; Alku, P. The Effect of the MFCC Frame Length in Automatic Voice Pathology Detection. J. Voice 2024, 38, 975–982. [Google Scholar] [CrossRef] [PubMed]
- Vidhi, S.; Seeja, K.R. Speech Emotion Recognition Using Mel Spectrogram and Convolutional Neural Networks (CNN). Procedia Comput. Sci. 2025, 258, 3693–3702. [Google Scholar] [CrossRef]
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. arXiv 2020, arXiv:1912.06670. [Google Scholar] [CrossRef]
- Yan, Y.; Simons, S.O.; van Bemmel, L.; Reinders, L.G.; Franssen, F.M.E.; Urovi, V. Optimizing MFCC parameters for the automatic detection of respiratory diseases. Appl. Acoust. 2025, 228, 110299. [Google Scholar] [CrossRef]
- Lee, T.-Y.; Huang, C.-H.; Chen, W.-C.; Liu, M.-J. A low-area dynamic reconfigurable MDC FFT processor design. Microprocess. Microsyst. 2016, 42, 227–234. [Google Scholar] [CrossRef]


| Parameter | Value |
|---|---|
| Sampling rate | 32,000 Hz |
| FFT window size | 512 |
| Hop length | 128 |
| Number of Mel bands | 20 |
| Number of MFCC coefficient | 20 |
| Feature normalization | z-score (mean and std) |
| Parameter | Value |
|---|---|
| Convolutional layers | 32 and 64 filters, kernel size (3,3) |
| Pooling layers | MaxPooling2D with pool size (2,2) after each conv layer |
| Dense layers | 128 neurons (ReLU) |
| Optimizer | Adam |
| Loss function | Sparse categorical cross-entropy |
| Batch size | 32 |
| Epochs | 26 |
| Training split | 80% |
| Test split | 20% |
| Random state | 42 |
| Separation_TIME (ms) | Overlap (%) | Epoch | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| 20 | 25 | 26 | 94.45 | 94.50 | 94.40 |
| 30 | 26 | 93.72 | 93.70 | 93.58 | |
| 35 | 26 | 94.94 | 95.01 | 94.93 | |
| 30 | 25 | 26 | 95.37 | 95.44 | 95.34 |
| 30 | 26 | 93.38 | 93.50 | 93.30 | |
| 35 | 26 | 93.65 | 93.74 | 93.60 | |
| 40 | 25 | 26 | 93.01 | 92.99 | 92.91 |
| 30 | 26 | 94.55 | 94.57 | 94.51 | |
| 35 | 26 | 95.08 | 95.10 | 94.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lopez-Olvera, J.A.; Perez-Meana, H.M.; Garcia-Rios, E.; Escamilla-Hernandez, E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Eng. Proc. 2026, 123, 22. https://doi.org/10.3390/engproc2026123022
Lopez-Olvera JA, Perez-Meana HM, Garcia-Rios E, Escamilla-Hernandez E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings. 2026; 123(1):22. https://doi.org/10.3390/engproc2026123022
Chicago/Turabian StyleLopez-Olvera, Jose Antonio, Hector Manuel Perez-Meana, Elizabeth Garcia-Rios, and Enrique Escamilla-Hernandez. 2026. "Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition" Engineering Proceedings 123, no. 1: 22. https://doi.org/10.3390/engproc2026123022
APA StyleLopez-Olvera, J. A., Perez-Meana, H. M., Garcia-Rios, E., & Escamilla-Hernandez, E. (2026). Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings, 123(1), 22. https://doi.org/10.3390/engproc2026123022

