Next Article in Journal
Influence of Photovoltaic Module Degradation on Photovoltaic Power Plant Investment Payback Time
Previous Article in Journal
DNP3 Protocol Taxonomy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition †

by
Jose Antonio Lopez-Olvera
1,
Hector Manuel Perez-Meana
1,*,
Elizabeth Garcia-Rios
2 and
Enrique Escamilla-Hernandez
1
1
ESIME Culhuacan, Instituto Politecnico Nacional, Mexico City 04440, Mexico
2
Instituto Tecnologico Superior del Occidente del Estado de Hidalgo, Mixquiahuala 42700, Mexico
*
Author to whom correspondence should be addressed.
Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.
Eng. Proc. 2026, 123(1), 22; https://doi.org/10.3390/engproc2026123022
Published: 5 February 2026
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Abstract

This work propose an audio feature pipeline to support machine learning tasks through the extraction of the Mel Frequency Cepstral Coefficients and Mel-spectrogram which is then used as the input of an Convolutional Neural Network which is trained to make the classification tasks. This approach enables de creation of a rich-feature dataset and an end-to-end pipeline that reduces the gap between the audio and Machine Learning ready models with application in sound classification, speech recognition and spatial audio analysis.

Speech is one of the most natural and efficient forms of human communication. On the artificial intelligence field the fact of making the machines able to understand the spoken language has been a long-standing goal. Among the speech technology, the Mel-frequency cepstral coefficients (MFCC) [1] have become one of the most widely adopted representations. MFCC are inspired into the human auditory system, mapping frequencies onto the Mel scale for a better perception of the human pitch. It also provides a compact set of features that capture both the spectral envelope and timbral properties of speech.
The present work explores the design of a speech recognition system based on MFCC feature extraction and a convolutional neural network CNN [2] as the classifier by leveraging the perceptual strengths of MFCCs and the learning capabilities of the model as shown in Figure 1.
This study makes use of the Common Voice dataset Version “Common Voice Delta Segment 10.0” [3] which has 25 recorded hours and 4 validated hours. Although it has a high quality of the audio data its also divided on different subsets like ’reported.tsv’,’dev.tsv’, ’validated.tsv’. In this project, data from the ’validated.tsv’ file were used. In particular, the last five clients (IDs 118 to 122) contributed approximately 112 s, 82 s, 206 s, 125 s, and 308 s of audio, respectively.
When conducting Fourier analysis on finite-length signals, the Hamming window [4] minimizes spectral leakage, which makes it crucial in signal processing. It reduces sudden discontinuities that allow energy from one frequency to diffuse into another by gently tapering the signal at its edges. As a result, the spectrum has reduced side lobes, which lessen leakage, and the main lobe is somewhat widened, which only marginally lowers frequency resolution. This window is represented by Equation (1).
w ( n ) = 0.54 0.46 c o s ( 2 π n N 1 ) , 0 n N 1
In order to convert from the time-domain to the frequency-domain it is important to use the FFT, that’s because it allows us to extract the perceptually meaningful features using the Mel scale. This transform is presented in Equation (2) [5].
S i ( k ) = n = 1 N S i ( n ) w ( n ) e 2 j π k n N
MFCCs were extracted from each input speech signal and were segmented into frames of 20 ms, 30 ms and 40 ms, to capture different temporal resolutions of the speech features. Furthermore, for each segmentation, an overlap of 25%, 30%, and 35% was integrated to reduce the loss of information at the frame boundaries. This combination of both frames and overlap allows us a better comprehension of the speech signal’s in both spectral and temporal characteristics, improving the accuracy and the analysis of the features. The specific parameters for the MFCCs extraction are summarized in Table 1.
In order to prepare the CNN for training, the parameters shown in Table 2 are considered, resulting in the outcomes displayed in Table 3, which demonstrate that both segmentation length and overlap ratio significantly influence the recognition metrics of Precision, Recall, and F1-Score. This configuration appears to provide an optimal trade-off between temporal resolution and contextual information, enabling the model to capture discriminative speakers features more effectively. The training curves as well as the normalized confusion matrix is shown in Figure 2. Where each label on the confusion matrix (0 to 4) represent a speaker.

Author Contributions

J.A.L.-O. is the principal author of this paper, responsible for the conception, development, and implementation of the methodology, as well as conducting the experiments and analyzing the results. H.M.P.-M., E.G.-R., E.E.-H., supervised the proposal, design of experiments, results analysis, and the writing process of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study come from the publicly available Mozilla Common Voice Corpus, Delta Segment 10.0, specifically the ‘Validated.tsv’ subset. The dataset is available through the Mozilla Common Voice project at https://commonvoice.mozilla.org and is distributed under the Creative Commons CC0 license (accessed on 19 september 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tirronen, S.; Kadiri, S.; Alku, P. The Effect of the MFCC Frame Length in Automatic Voice Pathology Detection. J. Voice 2024, 38, 975–982. [Google Scholar] [CrossRef] [PubMed]
  2. Vidhi, S.; Seeja, K.R. Speech Emotion Recognition Using Mel Spectrogram and Convolutional Neural Networks (CNN). Procedia Comput. Sci. 2025, 258, 3693–3702. [Google Scholar] [CrossRef]
  3. Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. arXiv 2020, arXiv:1912.06670. [Google Scholar] [CrossRef]
  4. Yan, Y.; Simons, S.O.; van Bemmel, L.; Reinders, L.G.; Franssen, F.M.E.; Urovi, V. Optimizing MFCC parameters for the automatic detection of respiratory diseases. Appl. Acoust. 2025, 228, 110299. [Google Scholar] [CrossRef]
  5. Lee, T.-Y.; Huang, C.-H.; Chen, W.-C.; Liu, M.-J. A low-area dynamic reconfigurable MDC FFT processor design. Microprocess. Microsyst. 2016, 42, 227–234. [Google Scholar] [CrossRef]
Figure 1. Workflow for the proposed method.
Figure 1. Workflow for the proposed method.
Engproc 123 00022 g001
Figure 2. Results of the different combinations between segmentation and overlap (a) Normalized Confusion Matrix using a frame length of 30 ms and an overlap of 25%. (b) Training and validation accuracy and loss curves of the proposed speaker recognition model over 26 epochs.
Figure 2. Results of the different combinations between segmentation and overlap (a) Normalized Confusion Matrix using a frame length of 30 ms and an overlap of 25%. (b) Training and validation accuracy and loss curves of the proposed speaker recognition model over 26 epochs.
Engproc 123 00022 g002
Table 1. MFCC extraction parameters.
Table 1. MFCC extraction parameters.
ParameterValue
Sampling rate32,000 Hz
FFT window size512
Hop length128
Number of Mel bands20
Number of MFCC coefficient20
Feature normalizationz-score (mean and std)
Table 2. CNN training parameters.
Table 2. CNN training parameters.
ParameterValue
Convolutional layers32 and 64 filters, kernel size (3,3)
Pooling layersMaxPooling2D with pool size (2,2) after each conv layer
Dense layers128 neurons (ReLU)
OptimizerAdam
Loss functionSparse categorical cross-entropy
Batch size32
Epochs26
Training split80%
Test split20%
Random state42
Table 3. Performance metrics for different Separation times and Overlaps.
Table 3. Performance metrics for different Separation times and Overlaps.
Separation_TIME (ms)Overlap (%)EpochPrecisionRecallF1-Score
20252694.4594.5094.40
302693.7293.7093.58
352694.9495.0194.93
30252695.3795.4495.34
302693.3893.5093.30
352693.6593.7493.60
40252693.0192.9992.91
302694.5594.5794.51
352695.0895.1094.98
Metrics are averaged over multiple training runs.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lopez-Olvera, J.A.; Perez-Meana, H.M.; Garcia-Rios, E.; Escamilla-Hernandez, E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Eng. Proc. 2026, 123, 22. https://doi.org/10.3390/engproc2026123022

AMA Style

Lopez-Olvera JA, Perez-Meana HM, Garcia-Rios E, Escamilla-Hernandez E. Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings. 2026; 123(1):22. https://doi.org/10.3390/engproc2026123022

Chicago/Turabian Style

Lopez-Olvera, Jose Antonio, Hector Manuel Perez-Meana, Elizabeth Garcia-Rios, and Enrique Escamilla-Hernandez. 2026. "Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition" Engineering Proceedings 123, no. 1: 22. https://doi.org/10.3390/engproc2026123022

APA Style

Lopez-Olvera, J. A., Perez-Meana, H. M., Garcia-Rios, E., & Escamilla-Hernandez, E. (2026). Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings, 123(1), 22. https://doi.org/10.3390/engproc2026123022

Article Metrics

Back to TopTop