A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments
Abstract
1. Introduction
- We extend the MATLAB deep-learning VAD tutorial example [23] into a larger and more controlled testbed, where the noise types and SNR values are fixed, both single-SNR and multi-SNR conditions are prepared, and all models share a single MATLAB implementation, so that different lightweight architectures and feature sets can be compared under exactly the same settings.
- We design and evaluate several small BiLSTM and CNN–BiLSTM networks (around 5M parameters and smaller) and examine how CNN kernel length, number of channels, and dropout placement affect VAD performance, showing that a relatively compact CNN front-end already provides most of the benefit and that further increasing CNN size yields only marginal or no additional gains for the considered tasks.
- We compare three kinds of input features on the same backbones, namely the 9-dimensional spectral-plus-periodicity set, 13- and 39-dimensional MFCCs, and FBANKs, and observe that MFCC-based systems consistently outperform the baseline spectral–periodicity features and are generally preferable to FBANKs in terms of recall and F1-score for these lightweight models.
- By combining the above empirical findings, including seed-robustness, feature-sensitivity, compression, and runtime analyses, we provide concrete guidelines on how to choose model size, CNN front-end configuration, and feature representation for noise-robust VAD on devices with limited memory and computation (e.g., embedded or edge platforms).
- Overall, the study is intended as a benchmark guideline publication: the proposed MATLAB-based setup offers a simple, transparent, and reproducible reference benchmark for future work on lightweight VAD, rather than introducing a fundamentally new deep-learning architecture.
2. The Backbone VAD Network
3. Presented Schemes
3.1. Baseline Model and Its Extension with Dropout
3.2. Architectural Innovation: CNN-Enhanced Model Variants
3.3. Alternative Feature Sets
4. Experimental Setup
4.1. Noise Types
4.2. Dataset Partitioning
4.3. Audio Preprocessing
4.4. Mixture Generation and SNR Setting
4.5. Feature Extraction and Sequence Buffering
4.6. Evaluation Metrics for VAD
- Area Under the Receiver Operating Characteristic Curve (AUROC): The AUROC score represents the probability that a randomly chosen positive sample (speech) is ranked higher than a randomly chosen negative sample (non-speech) by the VAD model. It summarizes diagnostic performance across all discrimination thresholds and is particularly valuable when evaluating models on imbalanced datasets.
- Accuracy: Accuracy denotes the proportion of correctly classified speech and non-speech frames out of the total number of frames. It provides a direct measure of overall system effectiveness but may be less informative in scenarios with significant class imbalance.where , , , and denote the number of true positives, true negatives, false positives, and false negatives, respectively.
- Recall: Recall quantifies the ability of the model to identify actual speech frames, reflecting its robustness against missed detections.
- Precision: Precision is the fraction of detected speech frames that are truly speech, indicating the model’s specificity and resistance to false alarms.
- F1-score: The F1-score is the harmonic mean of precision and recall, offering a single measure balancing both sensitivity and specificity.
5. Experimental Results and Discussions
5.1. Comparative Analysis of Model Variants
- Model (1)—Baseline BiLSTM:The baseline implements two stacked BiLSTM layers trained on the original feature set. Its performance reaches AUROC , accuracy , recall , precision , and F1 score . While recall is relatively high, the lower precision and F1 score indicate frequent false positives. This model is also sensitive to overfitting on modestly sized datasets.
- Model (2)—BiLSTM with Dropout:The introduction of dropout layers (rate ) leads to observable improvements: AUROC increases to , accuracy to , and both precision and F1 score also increase. Thus, dropout effectively reduces overfitting and improves robustness to data variability.
- Model (3)—CNN-Enhanced ():Augmenting the network with a 1D CNN layer () yields substantially better performance: AUROC , accuracy , precision , and F1 score , all optimal among the compared models. The expansion to 32 feature channels enables more discriminative short-term feature extraction, supplying richer input to the following layers.
- Model (4)—Increased Filters ():Doubling the number of CNN filters to 64 yields slightly reduced performance: AUROC , accuracy . The marginal drop suggests that simply increasing the filter count does not guarantee improvement and may introduce redundancy or overfitting.
- Model (5)—Reduced Kernel Size ():With a smaller kernel, Model (5) achieves AUROC and accuracy . Recall and precision remain strong, but overall results fall short of Model (3). While smaller kernels can better capture short-term dynamics, excessive reduction in temporal context may limit performance on longer sequences.
- Model (6)—Enlarged Kernel Size ():Increasing the kernel size to 7 raises recall to (the highest in the group), but AUROC and F1 score decline to and , respectively. This reveals a trade-off: a larger kernel improves overall detection of positives but is less precise, resulting in more false alarms.
- Summary of Architectural Trends:The results highlight the importance of CNN kernel size and filter count for VAD performance. Moderate kernel sizes and appropriately chosen filter dimensions can help balance recall and precision effectively. The use of dropout remains vital for improving generalization, especially when working with limited data. These findings are consistent across the evaluated metrics and offer practical guidance for model selection.
5.2. Comparative Analysis of Feature Variants
5.2.1. MFCC Features
5.2.2. FBANK Features
5.3. Multi-SNR Training and Testing
- Compared to the single-SNR setup, both the baseline model and the baseline features benefit from increased data variability, resulting in better generalization and overall performance.
- With richer data, the regularization effect of dropout (Model (2)) becomes less pronounced. Performance improvements over the baseline are noticeably reduced compared to results obtained under low-data regimes.
- The inclusion of a CNN front-end in Model (5) consistently enhances VAD performance, demonstrating its effectiveness in more challenging, real-world acoustic environments.
- Across all architectures, the 13-dimensional MFCC features outperform the baseline features in every metric, reaffirming the importance of robust feature design.
5.3.1. Robustness Across Random Seeds
5.3.2. Feature-Sensitivity Analysis with Split MFCCs
5.3.3. Inference Speed Analysis
5.4. Benchmark Comparisons with Silero and ITU-T G.729
5.4.1. Performance on the Original Test Set (Google Speech Command with Noise)
5.4.2. Performance on TIMIT Test Set with White Noise
5.5. Visual Comparison of VAD Results
- As shown in panels (a) and (b), the baseline model and features perform reasonably well at 5 dB SNR, but VAD accuracy degrades considerably in severe noise ( dB SNR). The predicted probabilities deviate from the ground-truth labels, resulting in both missed speech activity and increased false alarms under high noise.
- In panels (c) and (d), Model (5) with 13-dimensional MFCCs demonstrates significantly enhanced robustness. Even at dB SNR, its probability curves closely track the ground truth, and speech/non-speech boundaries are much more accurate.
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ramírez, J.; Segura, J.C.; Benítez, C.; De La Torre, A.; Rubio, A. Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 2004, 42, 271–287. [Google Scholar] [CrossRef]
- Ramirez, J.; Gorriz, J.M.; Segura, J.C. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness. Robust Speech Recognit. Underst. 2007, 6, 1–22. [Google Scholar] [CrossRef]
- Sohn, J.; Kim, N.S.; Sung, W. A Statistical Model-Based Voice Activity Detection. IEEE Signal Process. Lett. 1999, 6, 1–3. [Google Scholar] [CrossRef]
- Moattar, M.H.; Homayounpour, M.M. A Simple But Efficient Real-Time Voice Activity Detection Algorithm. Eurasip J. Adv. Signal Process. 2009, 2009, 1–11. [Google Scholar]
- Carlin, M.A.; Elhilali, M. A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 2422–2433. [Google Scholar] [CrossRef] [PubMed]
- Sofer, A.; Chazan, S.E. CNN self-attention voice activity detector. arXiv 2022, arXiv:2203.02944. [Google Scholar] [CrossRef]
- Ong, W.Q.; Tan, A.W.C.; Vengadasalam, V.V.; Tan, C.H.; Ooi, T.H. Real-Time Robust Voice Activity Detection Using the Upper Envelope Weighted Entropy Measure and the Dual-Rate Adaptive Nonlinear Filter. Entropy 2017, 19, 487. [Google Scholar] [CrossRef]
- Tripathi, K.; Kumar, C.V.; Wasnik, P. Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion. arXiv 2025, arXiv:2506.01365. [Google Scholar] [CrossRef]
- Ramírez, J.; Segura, J.C.; Benítez, C.; de la Torre, A.; Rubio, A. An Effective Subband Order-Statistics-Based Voice Activity Detector With Noise Reduction for Robust Speech Recognition. IEEE Trans. Speech Audio Process. 2005, 13, 953–964. [Google Scholar] [CrossRef]
- Benyassine, A.; Shlomot, E.; Su, H.Y.; Massaloux, D.; Lamblin, C.; Petit, J.P. ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for CS-ACELP. IEEE Commun. Mag. 1997, 35, 64–73. [Google Scholar] [CrossRef]
- Chuangsuwanich, E.; Glass, J. Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation Frequency. In Proceedings of the Interspeech, Florence, Italy, 28–31 August 2011; pp. 2645–2648. [Google Scholar]
- Priebe, D.; Ghani, B.; Stowell, D. Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation. Sensors 2024, 24, 46. [Google Scholar] [CrossRef] [PubMed]
- Qin, Q.; Zhu, Y. Robust Audio–Visual Speaker Localization in Noisy Aircraft Cabins for Inflight Medical Assistance. Sensors 2025, 25, 5827. [Google Scholar] [CrossRef] [PubMed]
- Tashev, I.; Mirsamadi, S. DNN-Based Causal Voice Activity Detector. In Interspeech. 2016. Available online: https://www.researchgate.net/profile/Ivan-Tashev/publication/315955578_DNN-based_Causal_Voice_Activity_Detector/links/58ed2241a6fdcc61cc106e8e/DNN-based-Causal-Voice-Activity-Detector.pdf (accessed on 4 January 2025).
- Hughes, T.; Mierle, K. Recurrent Neural Networks for Voice Activity Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 7378–7382. [Google Scholar] [CrossRef]
- Gimeno, P.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains. Appl. Sci. 2022, 12, 1832. [Google Scholar] [CrossRef]
- Wilkinson, N.; Niesler, T. A Hybrid CNN-BiLSTM Voice Activity Detector. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
- Xu, X.; Jouvet, D.; Essid, S.; Richard, G. A Lightweight Framework for Online Voice Activity Detection in the Wild. In Proceedings of the INTERSPEECH 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3615–3619. [Google Scholar]
- Ding, S.; Rikhye, R.; Liang, Q.; He, Y.; Wang, Q.; Narayanan, A.; O’Malley, T.; McGraw, I. Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 3744–3748. [Google Scholar]
- Sarkar, E.; Prasad, R.; Magimai.-Doss, M. Unsupervised Voice Activity Detection by Modeling Source and System Information Using Zero Frequency Filtering. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 476–480. [Google Scholar] [CrossRef]
- Ball, J. Voice Activity Detection (VAD) in Noisy Environments. arXiv 2023, arXiv:2312.05815. [Google Scholar] [CrossRef]
- Zhu, Z.; Zhang, L.; Pei, K.; Chen, S. A Robust and Lightweight Voice Activity Detection Algorithm for Speech Enhancement at Low Signal-to-Noise Ratio. Digit. Signal Process. 2023, 137, 104151. [Google Scholar] [CrossRef]
- MathWorks. Train Voice Activity Detection in Noise Model Using Deep Learning. 2021. Available online: https://www.mathworks.com/help/audio/ug/train-voice-activity-detection-in-noise-model-using-deep-learning.html (accessed on 14 November 2025).
- Warden, P. Speech Commands: A Public Dataset for Single-Word Speech Recognition. 2017. Available online: https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz (accessed on 14 November 2025).
- Sakhnov, K.; Burykh, S.; Zadorozhny, A. Approach for Energy-Based Voice Detector with Adaptive Scaling Factor. Iaeng Int. J. Comput. Sci. 2009, 36, 390–396. [Google Scholar]
- Tan, Z.H.; Lindberg, B. High-Accuracy, Low-Complexity Voice Activity Detection Based on a Posteriori SNR Weighted Energy. In Proceedings of the International Conference on Spoken Language Processing 2009, Brighton, UK, 6–10 September 2009; pp. 1679–1682. [Google Scholar]
- Lee, J.; Choo, Y.; Park, H.G. Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting. Math. Probl. Eng. 2014, 2014, 146040. [Google Scholar] [CrossRef]
- Tan, Z.H.; Sarkar, A.K.; Dehak, N. rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method. arXiv 2019, arXiv:1906.03588. [Google Scholar] [CrossRef]
- Tan, Z.H.; Sarkar, A.K.; Dehak, N.; Perochon, S. A Presentation and Short Discussion of rVAD-fast, a Fast Voice Activity Detector. Image Process. Line 2022, 12, 1–20. [Google Scholar] [CrossRef]
- Team, S. Silero VAD: Pre-Trained Enterprise-Grade Voice Activity Detector (VAD), Number Detector and Language Classifier. 2024. Available online: https://github.com/snakers4/silero-vad (accessed on 21 December 2025).
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM; Technical Report NISTIR 4930; U.S. Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993.





| Baseline 9 Features | AUROC | Accuracy | Recall | Precision | F1 Score | # Param. (M) |
|---|---|---|---|---|---|---|
| Model (1) (baseline) | 5.113 | |||||
| Model (2) (with Dropout) | 5.121 | |||||
| Model (3) () | 5.285 | |||||
| Model (4) () | 5.492 | |||||
| Model (5) () | 5.283 | |||||
| Model (6) () | 5.288 |
| Model | AUROC | Accuracy | Recall | Precision | F1 Score | # Param. (M) | |
|---|---|---|---|---|---|---|---|
| Model (1) (baseline) | baseline feature | 5.113 | |||||
| 13-dim MFCC | 5.138 | ||||||
| Model (2) (with Dropout) | baseline feature | 5.121 | |||||
| 13-dim MFCC | 5.146 | ||||||
| Model (3) () | baseline feature | 5.285 | |||||
| 13-dim MFCC | 5.288 | ||||||
| Model (5) () | baseline feature | 5.283 | |||||
| 13-dim MFCC | 5.285 | ||||||
| Model (5+) () | 13-dim MFCC | 5.490 |
| Feature | AUROC | Accuracy | Recall | Precision | F1 Score | # Param. (M) | |
|---|---|---|---|---|---|---|---|
| Model (1) (baseline) | 13-dim MFCC | 5.138 | |||||
| 39-dim MFCC | 5.301 | ||||||
| Model (2) (with Dropout) | 13-dim MFCC | 5.146 | |||||
| 39-dim MFCC | 5.308 | ||||||
| Model (5) () | 13-dim MFCC | 5.285 | |||||
| 39-dim MFCC | 5.294 | ||||||
| Model (5+) () | 13-dim MFCC | 5.490 | |||||
| 39-dim MFCC | 5.510 |
| Feature | AUROC | Accuracy | Recall | Precision | F1 Score | # Param. (M) | |
|---|---|---|---|---|---|---|---|
| Model (1) (baseline) | 13-dim MFCC | 5.138 | |||||
| 13-dim FBANK | 5.138 | ||||||
| Model (2) (with Dropout) | 13-dim MFCC | 5.146 | |||||
| 13-dim FBANK | 5.146 | ||||||
| Model (5) () | 13-dim MFCC | 5.285 | |||||
| 13-dim FBANK | 5.285 | ||||||
| Model (5+) () | 13-dim MFCC | 5.490 | |||||
| 13-dim FBANK | 5.490 |
| Feature | AUROC | Accuracy | Recall | Precision | F1 Score | # Param. (M) | |
|---|---|---|---|---|---|---|---|
| Model (1) (baseline) | 39-dim MFCC | 5.301 | |||||
| 39-dim FBANK | 5.301 | ||||||
| Model (2) (with Dropout) | 39-dim MFCC | 5.308 | |||||
| 39-dim FBANK | 5.308 | ||||||
| Model (5) () | 39-dim MFCC | 5.294 | |||||
| 39-dim FBANK | 5.294 | ||||||
| Model (5+) () | 39-dim MFCC | 5.510 | |||||
| 39-dim FBANK | 5.510 |
| Feature | AUROC | Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|---|---|
| Model (1) (baseline) | baseline feature | |||||
| 13-dim MFCC | ||||||
| Model (2) (with Dropout) | baseline feature | |||||
| 13-dim MFCC | ||||||
| Model (5) () | baseline feature | |||||
| 13-dim MFCC |
| Metric | Original Model (5) with BiLSTM(200,200) | Reduced Model (5) with BiLSTM(64,64) | Reduced Model (5) with BiLSTM(50,50) |
|---|---|---|---|
| AUROC | 96.91 | 95.62 | 96.18 |
| Accuracy | 91.12 | 89.39 | 89.65 |
| Recall | 93.12 | 90.75 | 92.96 |
| Precision | 87.38 | 85.78 | 84.81 |
| F1 score | 90.16 | 88.20 | 88.69 |
| Model size | 5.294 MB | 641.58 kB | 426.12 kB |
| Seed | AUROC | F1 |
|---|---|---|
| 1 | 0.9720 | 0.9060 |
| 11 | 0.9687 | 0.9045 |
| 21 | 0.9619 | 0.8919 |
| 31 | 0.9576 | 0.8878 |
| 41 | 0.9668 | 0.8969 |
| MFCC Configuration | AUROC | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|---|
| 13D MFCC (1–13) | 0.9067 | 0.8783 | 0.8959 | ||
| Low-order MFCC 1–6 | 0.9679 | 0.9350 | |||
| High-order MFCC 7–13 |
| Duration (s) | CPU Time (s) | CPU RTF | GPU Time (s) | GPU RTF | GPU/CPU |
|---|---|---|---|---|---|
| 1 | 0.320170 | 0.3202 | 0.054194 | 0.0542 | 0.17× |
| 5 | 0.115336 | 0.0231 | 0.060100 | 0.0120 | 0.52× |
| 10 | 0.131569 | 0.0132 | 0.115162 | 0.0115 | 0.88× |
| 20 | 0.292913 | 0.0146 | 0.211770 | 0.0106 | 0.72× |
| 40 | 0.671083 | 0.0168 | 0.448069 | 0.0112 | 0.67× |
| Key Performance Metrics | Value |
|---|---|
| CPU Average Inference Time | 0.189 s |
| CPU Average RTF | 0.0167 |
| CPU Real-Time Speedup | 59.7× |
| GPU Average Inference Time | 0.168 s |
| GPU Average RTF | 0.0149 |
| GPU Real-Time Speedup | 67.1× |
| Metrics | Model (5) | Silero VAD | ITU-T G.729 |
|---|---|---|---|
| AUROC | 96.91% | 67.51% | – |
| Accuracy | 91.12% | 63.07% | 67.69% |
| Recall | 93.12% | 63.03% | 31.38% |
| Precision | 87.38% | 55.12% | 88.80% |
| F1 score | 90.16% | 58.81% | 45.56% |
| AUROC | Original Model (5) BiLSTM(200,200) | Reduced Model (5) BiLSTM(64,64) | Reduced Model (5) BiLSTM(50,50) | Silero VAD |
|---|---|---|---|---|
| dB | 88.92% | 89.45% | 88.80% | 61.54% |
| dB | 88.99% | 89.17% | 89.03% | 89.28% |
| 0 dB | 88.92% | 89.19% | 88.91% | 94.77% |
| 5 dB | 88.97% | 88.79% | 88.89% | 96.77% |
| Size | 5.294 MB | 641.58 kB | 426.12 kB | 2.33 MB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Su, B.-Y.; Chen, B.; Huang, S.-C.; Hung, J.-W. A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics 2026, 15, 263. https://doi.org/10.3390/electronics15020263
Su B-Y, Chen B, Huang S-C, Hung J-W. A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics. 2026; 15(2):263. https://doi.org/10.3390/electronics15020263
Chicago/Turabian StyleSu, Bo-Yu, Berlin Chen, Shih-Chieh Huang, and Jeih-Weih Hung. 2026. "A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments" Electronics 15, no. 2: 263. https://doi.org/10.3390/electronics15020263
APA StyleSu, B.-Y., Chen, B., Huang, S.-C., & Hung, J.-W. (2026). A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics, 15(2), 263. https://doi.org/10.3390/electronics15020263

