Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition

Lopez-Olvera, Jose Antonio; Perez-Meana, Hector Manuel; Garcia-Rios, Elizabeth; Escamilla-Hernandez, Enrique

doi:10.3390/engproc2026123022

Open AccessProceeding Paper

Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition^†

by

Jose Antonio Lopez-Olvera

¹

,

Hector Manuel Perez-Meana

^1,*

,

Elizabeth Garcia-Rios

²

and

Enrique Escamilla-Hernandez

¹

ESIME Culhuacan, Instituto Politecnico Nacional, Mexico City 04440, Mexico

²

Instituto Tecnologico Superior del Occidente del Estado de Hidalgo, Mixquiahuala 42700, Mexico

^*

Author to whom correspondence should be addressed.

^†

Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.

Eng. Proc. 2026, 123(1), 22; https://doi.org/10.3390/engproc2026123022

Published: 5 February 2026

(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

This work propose an audio feature pipeline to support machine learning tasks through the extraction of the Mel Frequency Cepstral Coefficients and Mel-spectrogram which is then used as the input of an Convolutional Neural Network which is trained to make the classification tasks. This approach enables de creation of a rich-feature dataset and an end-to-end pipeline that reduces the gap between the audio and Machine Learning ready models with application in sound classification, speech recognition and spatial audio analysis.

Keywords:

MFCC; mel-Spectrogram; hamming window; CNN

Speech is one of the most natural and efficient forms of human communication. On the artificial intelligence field the fact of making the machines able to understand the spoken language has been a long-standing goal. Among the speech technology, the Mel-frequency cepstral coefficients (MFCC) [1] have become one of the most widely adopted representations. MFCC are inspired into the human auditory system, mapping frequencies onto the Mel scale for a better perception of the human pitch. It also provides a compact set of features that capture both the spectral envelope and timbral properties of speech.

The present work explores the design of a speech recognition system based on MFCC feature extraction and a convolutional neural network CNN [2] as the classifier by leveraging the perceptual strengths of MFCCs and the learning capabilities of the model as shown in Figure 1.

This study makes use of the Common Voice dataset Version “Common Voice Delta Segment 10.0” [3] which has 25 recorded hours and 4 validated hours. Although it has a high quality of the audio data its also divided on different subsets like ’reported.tsv’,’dev.tsv’, ’validated.tsv’. In this project, data from the ’validated.tsv’ file were used. In particular, the last five clients (IDs 118 to 122) contributed approximately 112 s, 82 s, 206 s, 125 s, and 308 s of audio, respectively.

When conducting Fourier analysis on finite-length signals, the Hamming window [4] minimizes spectral leakage, which makes it crucial in signal processing. It reduces sudden discontinuities that allow energy from one frequency to diffuse into another by gently tapering the signal at its edges. As a result, the spectrum has reduced side lobes, which lessen leakage, and the main lobe is somewhat widened, which only marginally lowers frequency resolution. This window is represented by Equation (1).

w (n) = 0.54 - 0.46 c o s (\frac{2 π n}{N - 1}), 0 \leq n \leq N - 1

(1)

In order to convert from the time-domain to the frequency-domain it is important to use the FFT, that’s because it allows us to extract the perceptually meaningful features using the Mel scale. This transform is presented in Equation (2) [5].

S_{i} (k) = \sum_{n = 1}^{N} S_{i} (n) w (n) e^{\frac{- 2 j π k n}{N}}

(2)

MFCCs were extracted from each input speech signal and were segmented into frames of 20 ms, 30 ms and 40 ms, to capture different temporal resolutions of the speech features. Furthermore, for each segmentation, an overlap of 25%, 30%, and 35% was integrated to reduce the loss of information at the frame boundaries. This combination of both frames and overlap allows us a better comprehension of the speech signal’s in both spectral and temporal characteristics, improving the accuracy and the analysis of the features. The specific parameters for the MFCCs extraction are summarized in Table 1.

In order to prepare the CNN for training, the parameters shown in Table 2 are considered, resulting in the outcomes displayed in Table 3, which demonstrate that both segmentation length and overlap ratio significantly influence the recognition metrics of Precision, Recall, and F1-Score. This configuration appears to provide an optimal trade-off between temporal resolution and contextual information, enabling the model to capture discriminative speakers features more effectively. The training curves as well as the normalized confusion matrix is shown in Figure 2. Where each label on the confusion matrix (0 to 4) represent a speaker.

Author Contributions

J.A.L.-O. is the principal author of this paper, responsible for the conception, development, and implementation of the methodology, as well as conducting the experiments and analyzing the results. H.M.P.-M., E.G.-R., E.E.-H., supervised the proposal, design of experiments, results analysis, and the writing process of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study come from the publicly available Mozilla Common Voice Corpus, Delta Segment 10.0, specifically the ‘Validated.tsv’ subset. The dataset is available through the Mozilla Common Voice project at https://commonvoice.mozilla.org and is distributed under the Creative Commons CC0 license (accessed on 19 september 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tirronen, S.; Kadiri, S.; Alku, P. The Effect of the MFCC Frame Length in Automatic Voice Pathology Detection. J. Voice 2024, 38, 975–982. [Google Scholar] [CrossRef] [PubMed]
Vidhi, S.; Seeja, K.R. Speech Emotion Recognition Using Mel Spectrogram and Convolutional Neural Networks (CNN). Procedia Comput. Sci. 2025, 258, 3693–3702. [Google Scholar] [CrossRef]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. arXiv 2020, arXiv:1912.06670. [Google Scholar] [CrossRef]
Yan, Y.; Simons, S.O.; van Bemmel, L.; Reinders, L.G.; Franssen, F.M.E.; Urovi, V. Optimizing MFCC parameters for the automatic detection of respiratory diseases. Appl. Acoust. 2025, 228, 110299. [Google Scholar] [CrossRef]
Lee, T.-Y.; Huang, C.-H.; Chen, W.-C.; Liu, M.-J. A low-area dynamic reconfigurable MDC FFT processor design. Microprocess. Microsyst. 2016, 42, 227–234. [Google Scholar] [CrossRef]

Figure 1. Workflow for the proposed method.

Figure 2. Results of the different combinations between segmentation and overlap (a) Normalized Confusion Matrix using a frame length of 30 ms and an overlap of 25%. (b) Training and validation accuracy and loss curves of the proposed speaker recognition model over 26 epochs.

Table 1. MFCC extraction parameters.

Parameter	Value
Sampling rate	32,000 Hz
FFT window size	512
Hop length	128
Number of Mel bands	20
Number of MFCC coefficient	20
Feature normalization	z-score (mean and std)

Table 2. CNN training parameters.

Parameter	Value
Convolutional layers	32 and 64 filters, kernel size (3,3)
Pooling layers	MaxPooling2D with pool size (2,2) after each conv layer
Dense layers	128 neurons (ReLU)
Optimizer	Adam
Loss function	Sparse categorical cross-entropy
Batch size	32
Epochs	26
Training split	80%
Test split	20%
Random state	42

Table 3. Performance metrics for different Separation times and Overlaps.

Separation_TIME (ms)	Overlap (%)	Epoch	Precision	Recall	F1-Score
20	25	26	94.45	94.50	94.40
	30	26	93.72	93.70	93.58
	35	26	94.94	95.01	94.93
30	25	26	95.37	95.44	95.34
	30	26	93.38	93.50	93.30
	35	26	93.65	93.74	93.60
40	25	26	93.01	92.99	92.91
	30	26	94.55	94.57	94.51
	35	26	95.08	95.10	94.98

Metrics are averaged over multiple training runs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lopez-Olvera, J.A.; Perez-Meana, H.M.; Garcia-Rios, E.; Escamilla-Hernandez, E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Eng. Proc. 2026, 123, 22. https://doi.org/10.3390/engproc2026123022

AMA Style

Lopez-Olvera JA, Perez-Meana HM, Garcia-Rios E, Escamilla-Hernandez E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings. 2026; 123(1):22. https://doi.org/10.3390/engproc2026123022

Chicago/Turabian Style

Lopez-Olvera, Jose Antonio, Hector Manuel Perez-Meana, Elizabeth Garcia-Rios, and Enrique Escamilla-Hernandez. 2026. "Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition" Engineering Proceedings 123, no. 1: 22. https://doi.org/10.3390/engproc2026123022

APA Style

Lopez-Olvera, J. A., Perez-Meana, H. M., Garcia-Rios, E., & Escamilla-Hernandez, E. (2026). Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings, 123(1), 22. https://doi.org/10.3390/engproc2026123022

Article Menu

Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition^†

Abstract

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition †

Abstract

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition^†