Next Article in Journal
Large Signal Stability Analysis of Grid-Connected VSC Based on Hybrid Synchronization Control
Next Article in Special Issue
Long-Distance Fiber Sensing Networks with AI-Assisted Condition Monitoring for Temperature–Vibration Decoupling Using a Single FBG
Previous Article in Journal
Securing Zero-Touch Networks with Blockchain: Decentralized Identity Management and Oracle-Assisted Monitoring
Previous Article in Special Issue
Efficient Failure Prediction: A Transfer Learning-Based Solution for Imbalanced Data Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments

1
Department of Electrical Engineering, National Chi Nan University, Nantou County 545301, Taiwan
2
Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei City 106308, Taiwan
3
Realtek Semiconductor Corp., Hsinchu County 30076, Taiwan
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 263; https://doi.org/10.3390/electronics15020263
Submission received: 15 November 2025 / Revised: 30 December 2025 / Accepted: 4 January 2026 / Published: 7 January 2026

Abstract

This work presents a comparative study of voice activity detection in noise using simple acoustic features and relatively compact recurrent models within a controlled MATLAB-based framework. For each utterance, 9 baseline spectral-plus-periodicity features, MFCCs, and FBANKs are extracted and passed to several lightweight BiLSTM-based networks, either alone or preceded by a 1D CNN layer. The main experiments are carried out at a fixed SNR to separate the influence of the network structure and the feature type, and an additional series with four SNR levels is used to assess whether the same performance trends hold when the SNR varies. The results show that adding a compact CNN front-end before the BiLSTM consistently improves detection scores, that MFCCs generally outperform the baseline spectral–periodicity features and often give better recall/F1 than FBANKs for the considered lightweight models, and that CNN ( 3 , 32 ) +BiLSTM with 13-dimensional MFCCs offers a favorable trade-off between accuracy, robustness across SNRs, and model size. Because all conditions share a single MATLAB implementation with fixed noise types, SNR values, and evaluation metrics, this work is positioned as a benchmark and practical guideline publication for noise-robust, resource-constrained VAD, rather than as a proposal of a completely new deep-learning architecture.

1. Introduction

Voice activity detection (VAD) is a fundamental stage in speech and audio processing, separating speech segments from non-speech periods. For various applications, such as telecommunication coding, automatic speech recognition (ASR), speech enhancement, and noise-robust voice interfaces, accurate detection is crucial. In real-world use, a reliable VAD enables efficient resource allocation, provides voice services with low latency, and delivers a great user experience. Most traditional VAD algorithms are based on energy and employ simple rules, such as short-term energy, zero-crossing rates, and spectral entropy. These methods work well in clean or controlled settings but degrade in the presence of strong or non-stationary background noise and competing speakers [1,2,3,4]. Over the last two decades, statistical models have gained popularity as they help address these problems. Hidden Markov models (HMMs), Gaussian mixture models (GMMs), and support vector machines (SVMs) enhance discrimination through supervised learning from labeled data. They usually integrate context modeling or prior probability adaptation to achieve superior outcomes [2,3,5,6]. Furthermore, these VAD systems often incorporate numerous handcrafted features, including fundamental frequency, periodicity, and higher-order statistics, to further enhance noise robustness [7].
The field has made significant progress thanks to the development of algorithms for extracting and combining multiple features. For example, feature-fusion frameworks use Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) features, and modulation spectrum features to improve detection accuracy in low-SNR situations [4,8]. Feature voting and ensemble approaches utilize the strengths of different discriminative features that complement each other effectively. On the other hand, subband order statistics filtering helps with SNR estimates even when there is a large amount of noise, or it changes quickly [4,5,9]. These methods have shaped industry standards, such as ITU-T G.729, which utilizes adaptive algorithms to adjust VAD sensitivity in real-time as the sound environment changes [10,11]. Recent work in MDPI has investigated envelope-based VAD and efficient speech detectors for ecological monitoring and audio-visual applications, emphasizing the need to balance robustness with computational efficiency [7,12,13].
Deep learning has substantially transformed the VAD landscape. Neural architectures, such as feedforward, convolutional (CNN), and recurrent neural networks (RNNs), can model intricate, high-dimensional patterns, allowing systems to differentiate speech from speech-like noise and reverberant backgrounds [14,15]. Raw waveform and spectro-temporal representations have also become popular inputs, reducing the need for complicated hand-crafted pre-processing. Hybrid models, like DNN-HMMs, and solutions that focus on adaptation, make systems even more resilient to bad acoustic conditions and changes in speakers [8,16]. More recent work has built on these ideas by using convolutional-recurrent architectures like CNN-BiLSTM and CRDNN-style models [17,18], on-device and personalized VAD systems that use attention and self-supervised pre-training to improve performance [19], and unsupervised methods that model source and system information for reliable detection in conditions that do not match [20]. Other research has looked into modern designs for VAD in noisy settings, such as Transformer and self-attention style architectures [21], as well as strong and light algorithms aimed at low-SNR speech enhancement and embedded deployment [22]. Recent studies show that using long-term signal statistics, temporal context windows, and multi-task or hybrid signal/deep learning strategies together works well for real-world VAD in noise environments that change substantially. [5,6,8,12,16].
Despite these advances, important gaps remain in systematic comparison and benchmarking. The comparative effectiveness of various feature types (MFCC, Mel-spectrogram, PLP, modulation-domain) and network architectures (e.g., BiLSTM, CNN, Transformer) in the presence of diverse noise sources, non-stationary backgrounds, and extreme SNR conditions remains insufficiently understood. There exists a continual deficiency in reproducible research protocols for assessing model generalizability across languages, genders, and spontaneous conversational speech [9,10,11].
This study builds upon the MATLAB deep-learning VAD tutorial example [23], extending it into a larger and more controlled testbed for systematic evaluation of multiple modern and classical VAD models. A standardized experimental protocol is used to support best-practice discussions and to guide future research in robust speech activity detection.
The main contributions and goals of this work are summarized as follows:
  • We extend the MATLAB deep-learning VAD tutorial example [23] into a larger and more controlled testbed, where the noise types and SNR values are fixed, both single-SNR and multi-SNR conditions are prepared, and all models share a single MATLAB implementation, so that different lightweight architectures and feature sets can be compared under exactly the same settings.
  • We design and evaluate several small BiLSTM and CNN–BiLSTM networks (around 5M parameters and smaller) and examine how CNN kernel length, number of channels, and dropout placement affect VAD performance, showing that a relatively compact CNN front-end already provides most of the benefit and that further increasing CNN size yields only marginal or no additional gains for the considered tasks.
  • We compare three kinds of input features on the same backbones, namely the 9-dimensional spectral-plus-periodicity set, 13- and 39-dimensional MFCCs, and FBANKs, and observe that MFCC-based systems consistently outperform the baseline spectral–periodicity features and are generally preferable to FBANKs in terms of recall and F1-score for these lightweight models.
  • By combining the above empirical findings, including seed-robustness, feature-sensitivity, compression, and runtime analyses, we provide concrete guidelines on how to choose model size, CNN front-end configuration, and feature representation for noise-robust VAD on devices with limited memory and computation (e.g., embedded or edge platforms).
  • Overall, the study is intended as a benchmark guideline publication: the proposed MATLAB-based setup offers a simple, transparent, and reproducible reference benchmark for future work on lightweight VAD, rather than introducing a fundamentally new deep-learning architecture.

2. The Backbone VAD Network

The MATLAB platform provides a comprehensive and well-documented tutorial framework [23] for creating deep-learning-based VAD systems, which serves as a convenient standard experimental baseline for systematically comparing new developments. This reference system extracts a wide range of acoustic features from the audio input. These features are meant to fully capture both the spectrum aspects and the temporal dynamics that are important for speech activity.
The baseline model uses eight spectral-domain features: spectral centroid to show the “center of mass” of the spectrum; crest factor to show how peakiness compares to average energy; spectral entropy to show how complex the spectrum is; spectral flux to show how the spectrum changes from frame to frame; spectral kurtosis and skewness for higher-order statistical characterization; rolloff point as a marker for frequency distribution; and spectral slope for overall spectral tilt. Along with these spectral descriptors, a temporal periodicity measure from harmonic ratio analysis enhances sensitivity to rhythmic speech parts that commonly distinguish speech from background noise.
The model’s architecture has two stacked bidirectional long short-term memory (BiLSTM) layers, each with 200 hidden units. This approach makes it possible to add long-range context to the audio sequence in both forward and backward directions. This is crucial for discerning the subtle temporal cues that are present in complex sound environments. The outputs of the BiLSTM layers go into a fully connected (dense) layer, and then a sigmoid output unit makes probabilistic decisions on whether each frame is speech or not.
The MATLAB reference implementation also has thorough steps for preprocessing data, like normalization and augmentation, as well as step-by-step instructions for computing features, training the network, and evaluating metrics. This method makes sure that researchers can correctly reproduce baseline results and confidently compare the changes they suggest.
Our research aims to enhance the established pipeline [23] by advancing the baseline in two principal areas: architectural innovation and feature improvement. We provide convolutional neural network (CNN)-based frontends to enhance initial feature extraction and explore alternate feature sets, such as MFCC and Mel-spectrogram representations, to capture speech-relevant information more effectively. We carefully evaluate the effects of these enhancements on VAD accuracy and robustness through a series of tests done in various noise environments and signal-to-noise ratios, providing additional insights for the development of noise-robust speech detection systems.

3. Presented Schemes

3.1. Baseline Model and Its Extension with Dropout

The baseline VAD model, as explained in the previous section, is now described here for clarity and to facilitate comparisons. This architecture, referred to as Model (1), consists of a sequence input layer, two stacked BiLSTM layers (each with 200 hidden units), a fully connected (FC) layer, and a sigmoid output layer. You can see it in Figure 1a. The BiLSTM layers capture both past and future contexts, making it easier to extract temporal features. The FC layer takes the sequence representation and turns it into class logits. The sigmoid function then normalizes these logits to give the final estimates of the probability that the speech is or is not speech.
We add a dropout layer after each BiLSTM layer to improve the baseline model’s generalization capabilities. This gives us Model (2), which is shown in Figure 1b. The purpose of this change is to reduce overfitting and increase the system’s resistance to noise and data changes. The results of the experiment (see Section 5) will show that adding the dropout layer noticeably improves VAD performance. Consequently, this dropout-enhanced architecture is classified as the extended baseline model for all ensuing experiments and comparisons.

3.2. Architectural Innovation: CNN-Enhanced Model Variants

To improve VAD performance further, we propose including a one-dimensional Convolutional Neural Network (1D CNN) layer at the input stage of the dropout-augmented baseline model (Model (2)). We then systematically examine how changing the kernel size, number of filters, and other design choices for the CNN affects the results. The main reason for adding a CNN block is to use its capacity to find local temporal cues and filter out noise before modeling a recurrent sequence. Figure 2a–d show the exact architectural alternatives that were evaluated.
Models (3) through (6) look at how changing the kernel size and number of filters in the convolutional layer affects the other layers, which include batch normalization, ReLU activation, BiLSTM, dropout, fully connected layer, and sigmoid. This architecture ensures that performance comparisons are fair and controlled.
In this study, the one-dimensional convolutional layer is termed CNN ( x , y ) , where x is the size of the kernel and y is the number of output channels (filters). This notation clearly shows how the convolutional block is set up, including its receptive field and ability to extract local patterns.
Specifically, Model (3) uses CNN ( 5 , 32 ) as the first convolutional configuration. This configuration uses a kernel size of 5 and 32 filters. Model (4) boosts extraction power by using CNN ( 5 , 64 ) , which keeps the kernel size but doubles the number of filters. Model (5) adjusts the convolution to CNN ( 3 , 32 ) , which uses a smaller kernel to focus on even more localized temporal patterns. On the other hand, Model (6) employs CNN ( 7 , 32 ) , which increases the kernel size while keeping the number of filters the same to give a wider temporal context.
We aim to conduct an extensive study of how the size of the convolutional kernel and the number of channels affect VAD performance in different acoustic settings by systematically changing x and y in CNN ( x , y ) across these models.

3.3. Alternative Feature Sets

The baseline VAD model employs eight spectral-domain features—spectral centroid, crest factor, spectral entropy, spectral flux, spectral kurtosis, spectral skewness, rolloff point, and spectral slope—along with a harmonic ratio-based periodicity metric that captures temporal modulations characteristic of voiced speech. Together, these descriptors provide a compact yet informative representation of both spectral structure and temporal regularity in acoustic signals.
To further improve the discriminative power of the input features, we extend our investigation to feature types commonly used in automatic speech recognition (ASR), namely the Mel-frequency cepstral coefficients (MFCCs) and mel-filter-bank (FBANK) features. MFCCs are derived from the short-term power spectrum of the speech signal, transformed onto the Mel scale to approximate the nonlinear frequency perception of the human ear. A discrete cosine transform (DCT) is then applied to decorrelate the Mel-filter-bank energies, producing a set of coefficients that efficiently capture the envelope of the log Mel spectrum. These coefficients largely encode the spectral shape imposed by the vocal tract, making them especially useful for distinguishing speech from noise.
Comparatively, the FBANK features keep the original log filterbank energies before applying the Discrete Cosine Transform (DCT). This approach retains the localized spectral cues and finer harmonic details that might be lost with the DCT smoothing used in MFCCs. Recent studies suggest that using mel-filter-bank representations can be beneficial when paired with deep neural networks, especially convolutional models, as they provide higher-resolution spectral data that is well-suited for identifying spatial patterns.
We compare VAD models based on MFCCs and FBANKs with a baseline that uses original spectral features to see whether these representations, which are designed with perceptual considerations and Automatic Speech Recognition (ASR) in mind, can enhance the sensitivity and robustness of speech detection in different acoustic environments.

4. Experimental Setup

To systematically assess the performance of the proposed VAD models under controlled and reproducible conditions, a comprehensive experimental setup was designed. All major procedures—including data preparation, noise addition, feature extraction, and evaluation—are clearly delineated below to ensure the validity and comparability of results. For all experiments, the threshold for classifying speech versus non-speech at the final system output was determined to achieve the equal error rate (EER) on the validation set. This allows for fair and unbiased model comparison regardless of prior label distribution or class imbalance.

4.1. Noise Types

By extending the baseline settings [23] for training and validation, four stationary noise types from the Google Speech Command dataset [24] were used: exercise bike, running tap, pink noise, and “doing the dishes”. For testing, only white noise was used. The non-stationary “dude miaowing” noise was excluded due to its instability.

4.2. Dataset Partitioning

The dataset sizes were enlarged by a factor of four compared to the baseline settings [23] to enhance experimental stability, resulting in 4000 s for training and 800 s each for validation and testing, all sampled at 16 kHz.

4.3. Audio Preprocessing

Each audio signal, denoted by the discrete-time sequence x [ n ] , was segmented into overlapping frames using a Hamming window of duration 0.05  s (corresponding to 800 samples). To standardize the amplitude across all recordings, each signal was normalized according to
x norm [ n ] = x [ n ] max | x [ n ] | ,
where x norm [ n ] represents the amplitude-normalized signal with its peak value scaled to unity.
Speech detection was performed with MATLAB’s detectSpeech function, with speech segments extended by 4000 samples ( 0.25  s) before and after to preserve context. Random silences (0–2 s) were inserted after each detected speech segment to simulate natural pauses.

4.4. Mixture Generation and SNR Setting

Every 10-s fragment was mixed with a randomly selected noise segment, yielding 400 ten-second mixture segments per training set. The signal-to-noise ratio (SNR) was controlled by adjusting the noise energy according to
E noise = α · E speech ,
where α corresponds to the target SNR (e.g., for SNR = 10 dB, α = 3.16 ). Both speech and mixture signals were normalized after mixing. The primary experiments employed single-SNR training and testing conditions, with both conducted at 10 dB SNR. In the final stage, the evaluation was extended to multi-SNR training and testing, which included four SNR levels: 10 , 5 , 0, and 5 dB.

4.5. Feature Extraction and Sequence Buffering

A Hann window (256 samples, ∼16 ms) with 50 % overlap (128-sample hop, 8 ms) was used to extract features. Each feature was normalized to zero mean and unit variance. Sample-wise VAD labels were aligned to frame-wise features. Features were buffered in 8-s sequences (1000 frames/sequence, 75 % overlap), greatly increasing the number of training samples and maintaining temporal structure for BiLSTM training.

4.6. Evaluation Metrics for VAD

The assessment of voice activity detection (VAD) systems relies on comprehensive evaluation metrics that quantify detection performance across diverse operating conditions. In this work, we employ several standard metrics to evaluate both classical and deep-learning-based VAD algorithms:
  • Area Under the Receiver Operating Characteristic Curve (AUROC): The AUROC score represents the probability that a randomly chosen positive sample (speech) is ranked higher than a randomly chosen negative sample (non-speech) by the VAD model. It summarizes diagnostic performance across all discrimination thresholds and is particularly valuable when evaluating models on imbalanced datasets.
  • Accuracy: Accuracy denotes the proportion of correctly classified speech and non-speech frames out of the total number of frames. It provides a direct measure of overall system effectiveness but may be less informative in scenarios with significant class imbalance.
    Accuracy = T P + T N T P + T N + F P + F N ,
    where T P , T N , F P , and F N denote the number of true positives, true negatives, false positives, and false negatives, respectively.
  • Recall: Recall quantifies the ability of the model to identify actual speech frames, reflecting its robustness against missed detections.
    Recall = T P T P + F N .
  • Precision: Precision is the fraction of detected speech frames that are truly speech, indicating the model’s specificity and resistance to false alarms.
    Precision = T P T P + F P .
  • F1-score: The F1-score is the harmonic mean of precision and recall, offering a single measure balancing both sensitivity and specificity.
    F 1 = 2 · Precision · Recall Precision + Recall .
Together, these metrics provide a holistic view of VAD system performance, allowing for fair comparison across models, datasets, and noise conditions.

5. Experimental Results and Discussions

At first, we evaluate all models using training and testing data at a fixed signal-to-noise ratio (SNR) of 10 dB for straightforward comparison. The respective results are shown in Section 5.1 and Section 5.2. Then, in Section 5.3, we extend the analysis to a broader range of SNR conditions, providing unified benchmarking across diverse noise environments.

5.1. Comparative Analysis of Model Variants

Here, we examine the performance of each VAD model variant, based on the evaluation reported in Table 1. By systematically varying CNN kernel size, filter number, and regularization strategies, we have the following observations:
  • Model (1)—Baseline BiLSTM:
    The baseline implements two stacked BiLSTM layers trained on the original feature set. Its performance reaches AUROC 76.53 % , accuracy 68.83 % , recall 89.32 % , precision 59.54 % , and F1 score 71.45 % . While recall is relatively high, the lower precision and F1 score indicate frequent false positives. This model is also sensitive to overfitting on modestly sized datasets.
  • Model (2)—BiLSTM with Dropout:
    The introduction of dropout layers (rate 0.3 ) leads to observable improvements: AUROC increases to 78.92 % , accuracy to 71.14 % , and both precision and F1 score also increase. Thus, dropout effectively reduces overfitting and improves robustness to data variability.
  • Model (3)—CNN-Enhanced ( CNN ( 5 , 32 ) ):
    Augmenting the network with a 1D CNN layer ( CNN ( 5 , 32 ) ) yields substantially better performance: AUROC 91.20 % , accuracy 80.90 % , precision 73.36 % , and F1 score 80.16 % , all optimal among the compared models. The expansion to 32 feature channels enables more discriminative short-term feature extraction, supplying richer input to the following layers.
  • Model (4)—Increased Filters ( CNN ( 5 , 64 ) ):
    Doubling the number of CNN filters to 64 yields slightly reduced performance: AUROC 90.91 % , accuracy 80.82 % . The marginal drop suggests that simply increasing the filter count does not guarantee improvement and may introduce redundancy or overfitting.
  • Model (5)—Reduced Kernel Size ( CNN ( 3 , 32 ) ):
    With a smaller kernel, Model (5) achieves AUROC 84.05 % and accuracy 75.16 % . Recall and precision remain strong, but overall results fall short of Model (3). While smaller kernels can better capture short-term dynamics, excessive reduction in temporal context may limit performance on longer sequences.
  • Model (6)—Enlarged Kernel Size ( CNN ( 7 , 32 ) ):
    Increasing the kernel size to 7 raises recall to 92.04 % (the highest in the group), but AUROC and F1 score decline to 83.39 % and 71.70 % , respectively. This reveals a trade-off: a larger kernel improves overall detection of positives but is less precise, resulting in more false alarms.
  • Summary of Architectural Trends:
    The results highlight the importance of CNN kernel size and filter count for VAD performance. Moderate kernel sizes and appropriately chosen filter dimensions can help balance recall and precision effectively. The use of dropout remains vital for improving generalization, especially when working with limited data. These findings are consistent across the evaluated metrics and offer practical guidance for model selection.

5.2. Comparative Analysis of Feature Variants

To systematically evaluate the effectiveness of alternative features on various VAD models, we primarily adopt MFCC features as replacements for the original baseline representations. After identifying the better-performing candidate models using MFCCs, we further evaluate these models with filter bank (FBANK) features to directly compare the performance differences between FBANK and MFCC.

5.2.1. MFCC Features

The results in Table 2 demonstrate the consistent benefit of replacing baseline features with 13-dimensional MFCCs across all evaluated VAD model architectures. Models (4) and (6), which use larger CNN kernel sizes or more filters, are omitted here for baseline features due to the poor performance observed in prior experiments (see Table 1).
Across each architecture, the adoption of 13-dim MFCCs results in clear improvements in AUROC, accuracy, recall, precision, and F1-score. For example, the baseline BiLSTM (Model (1)) exhibits substantial gains in accuracy (from 68.83 % to 81.13 % ) and F1-score (from 71.45 % to 80.97 % ). Adding dropout (Model (2)) further increases accuracy and robustness against overfitting. Additional performance gains are seen with the introduction of CNN architectures (Model (3)), likely due to more effective temporal feature extraction.
Notably, Model (5), which uses a compact CNN kernel (size 3, 32 channels), achieves the best balance of accuracy and F1-score among all configurations considered. The newly added Model (5+), with an increased number of CNN channels (from 32 to 64), offers only marginal further gains, indicating that expanding channel width quickly reaches diminishing returns.
In essence, switching to 13-dim MFCC features leads to superior VAD performance for all tested architectures. Fine-tuning model structure—especially kernel size and channel count—remains important for optimal results, but excessive widening of networks provides minimal additional benefit.
The results in Table 3 provide a direct comparison between using 13-dimensional static MFCC features and 39-dimensional extended MFCCs (which include delta and delta–delta) across several VAD models that all share a BiLSTM backbone, optionally preceded by a CNN front-end.
For both the baseline BiLSTM (Model (1)) and BiLSTM with dropout (Model (2)), expanding the feature set to 39 dimensions consistently leads to moderate improvements in AUROC, accuracy, precision, and F1-score. The same trend is observed for Model (5+), where the combination of a CNN front-end, BiLSTM, and 39-dimensional MFCCs yields the highest evaluation metrics.
A subtle but important exception is found for Model (5), where equipping the BiLSTM with a compact CNN front-end ( CNN ( 3 , 32 ) ), the 13-dimensional MFCCs actually outperform the higher-dimensional version in recall and remain competitive across the other metrics. This suggests that, in compact CNN+BiLSTM configurations, higher-dimensional features may introduce redundancy or overfitting, so lower-dimensional MFCCs may suffice for good generalization.
Parameter counts remain similar regardless of MFCC dimensionality, indicating that these performance trends are primarily driven by feature–model interaction rather than increased model capacity.
In summary, while 39-dimensional MFCCs generally bring incremental benefits to BiLSTM and CNN+BiLSTM architectures, optimal performance is architecture-dependent. In particular, compact CNN+BiLSTM models may favor simpler, lower-dimensional features, emphasizing the importance of jointly tailoring feature design and network structure to the target VAD scenario.

5.2.2. FBANK Features

In this section, we further evaluate the VAD models using FBANK features and directly compare their performance to the results previously obtained with MFCCs. Table 4 and Table 5 present the detailed performance metrics of each model when using the 13-dimensional and 39-dimensional versions of MFCC and FBANK features, respectively. These tables highlight the relative effectiveness of different feature types across several model architectures.
A closer inspection of Table 4 and Table 5 reveals that the relative performance of FBANK and MFCC features depends on both the underlying network architecture and the specific evaluation metric.
For the baseline BiLSTM (Model (1)), MFCC features retain a clear lead in recall and F1-score across both 13- and 39-dimensional settings, which is important for reliably detecting speech activity. However, 13-dim FBANK achieves the highest accuracy and precision in this model, suggesting that it can reduce false alarms in simple sequence networks.
Introducing dropout (Model (2)) shifts the pattern: while MFCC maintains its recall advantage, FBANK features (both dimensionalities) provide higher AUROC, accuracy, and precision. This indicates that FBANK’s richer spectral representation is beneficial once regularization is added, improving robustness without severely impacting sensitivity.
For models with a CNN front-end (Models (5) and (5+)), the trade-off becomes more subtle. FBANK features display strong performance in accuracy, precision, and F1-score, especially as feature dimensionality increases (e.g., 39-dim FBANK in Model (5) achieves competitive F1 and outstanding accuracy). However, MFCCs generally deliver higher recall, and their F1-scores remain competitive, if not superior, in compact models. This suggests that MFCCs may allow models to better capture weak speech segments even as complexity increases.
When the network width is further enhanced (Model (5+)), both FBANK and MFCC features yield strong metrics, but MFCC retains a slight edge in recall and overall F1-score with 39-dim input. FBANK features, meanwhile, appear to enable the highest AUROC and accuracy, particularly for deeper architectures.
Across all settings, the differences in parameter count are negligible and do not explain the performance trends. Instead, the observed effects stem from the interaction between feature representation and model structure.
In summary, MFCC features provide more consistent and generally stronger results in recall and F1-score—key indicators of VAD effectiveness—across most models and settings. FBANK offers gains in accuracy and false-alarm reduction and can be further leveraged by regularized or deep CNN+BiLSTM architectures, especially with higher-dimensional representations. Thus, the choice between MFCC and FBANK should be informed by the application’s priorities: sensitivity for speech detection or specificity for minimizing false positives.

5.3. Multi-SNR Training and Testing

To facilitate a clear and straightforward evaluation of various models and features, we previously adopted a simplified training and testing setup where both consisted of utterances with SNR = 10 dB. To assess whether these evaluation results generalize to more complex environments, we now extend the training and testing datasets to include four SNR levels: 10 , 5 , 0, and 5 dB. This results in a multi-SNR training and testing scenario. Consistent with the simple setting, the types of noise present in the training and testing conditions remain different.
As previously observed, Model (2) improves performance by introducing a dropout layer to mitigate overfitting, while Model (5) incorporates an additional CNN ( 3 , 32 ) front-end to preprocess the speech features. Both modifications yield consistent improvements in various VAD evaluation metrics without a notable increase in model complexity. Furthermore, the 13-dimensional MFCC features significantly outperform the original baseline features. Therefore, in this multi-SNR training and testing scenario, we focus on comparing the VAD performance of the original baseline features and 13-dimensional MFCC features for Model (1) (baseline), Model (2), and Model (5).
Table 6 summarizes the performance of these model architectures and feature sets under the multi-SNR training and testing scenario. From this table, we observe the following:
  • Compared to the single-SNR setup, both the baseline model and the baseline features benefit from increased data variability, resulting in better generalization and overall performance.
  • With richer data, the regularization effect of dropout (Model (2)) becomes less pronounced. Performance improvements over the baseline are noticeably reduced compared to results obtained under low-data regimes.
  • The inclusion of a CNN front-end in Model (5) consistently enhances VAD performance, demonstrating its effectiveness in more challenging, real-world acoustic environments.
  • Across all architectures, the 13-dimensional MFCC features outperform the baseline features in every metric, reaffirming the importance of robust feature design.
Overall, these results indicate that both advanced model architectures and careful feature selection contribute substantially to VAD accuracy under multi-SNR scenarios.
In particular, when using Model (5) with 13-dimensional MFCC features, the performance on test data across different SNR conditions is illustrated in Figure 3 and Figure 4.
Figure 3 shows that the proposed VAD maintains strong discriminative power across all tested SNRs, with TPRs approaching 1.0 for FPRs below 0.1 even at 10 dB. The corresponding AUC values, ranging from about 0.96 at 10 dB to nearly 0.98 at 0 dB, indicate only a modest degradation as noise increases and confirm that the classifier remains highly robust in severely noisy conditions.
Figure 4 further summarizes the impact of SNR on accuracy, recall, precision, and F1 score, highlighting a generally consistent performance trend over the 10 dB to 5 dB range. Accuracy and F1 score remain around 0.89–0.91 across SNRs, while recall peaks near 0 dB and precision improves noticeably at positive SNR. This pattern suggests that the model accepts a small loss in precision to retain higher recall at very low SNRs but overall maintains a stable and well-balanced operating point across the considered noise levels.
In the original Model (5), the BiLSTM(200,200) configuration follows the default setting provided in [23], and here we deliberately reduce the hidden units to (64,64) and (50,50) to examine how much performance is affected by compressing the recurrent layers. The results shown in Table 7 indicate that the original BiLSTM(200,200) variant still achieves the strongest overall performance (AUROC 96.91%, accuracy 91.12%, F1 90.16%), but the reduced models remain remarkably competitive despite their much smaller sizes.
Specifically, shrinking the BiLSTM to (64,64) cuts the model size from 5.294 MB to 641.58 kB, while AUROC and F1 drop by only about 1–2 percentage points. Further reducing to BiLSTM(50,50) yields an even smaller model of 426.12 kB with AUROC 96.18% and F1 88.69%, values very close to those of the original configuration. These findings suggest that the MATLAB tutorial’s large BiLSTM setting contains substantial redundancy and that Model (5) can be aggressively compressed while largely preserving detection performance, which is particularly attractive for VAD deployment on memory-constrained or edge devices.

5.3.1. Robustness Across Random Seeds

To strengthen the statistical rigor of the evaluation, the entire multi-SNR training and testing pipeline of Model (5) was repeated with five different random seeds, affecting clean utterance concatenation, silence insertion, and noise segment selection. The resulting test AUROC and F1 scores for each seed are summarized in Table 8, providing a direct view of how performance varies under different random realizations of the data generation and training process.
Averaging over the five runs yields a mean AUROC of 0.9654 with a standard deviation of 0.0054 , and a mean F1 score of 0.8974 with a standard deviation of 0.0072 . Using a t-distribution with 4 degrees of freedom, the corresponding 95% confidence intervals are 0.9654 ± 0.0060 for AUROC and 0.8974 ± 0.0080 for F1, i.e., [ 0.9594 , 0.9714 ] and [ 0.8894 , 0.9054 ] , respectively. These intervals show that all individual runs in Table 8 lie well within a narrow band of about ± 0.01 around the mean for both metrics.
Taken together, these statistics indicate that the proposed model consistently achieves high AUROC and F1 values across different random seeds and that the variability induced by randomness in data preparation and training is relatively small compared with the absolute performance level. In other words, the gains reported in the main result tables are not tied to a single favorable initialization or noise realization but remain stable under repeated experiments. This supports the claim that the observed improvements in Model (5) over the considered baselines are both consistent and statistically meaningful.

5.3.2. Feature-Sensitivity Analysis with Split MFCCs

To gain further insight into which acoustic features drive the decisions of Model (5) under noisy conditions, we conducted a simple feature-sensitivity analysis on the 13-dimensional MFCCs used as input. Specifically, we compared three configurations: (i) using all 13 MFCCs (baseline), (ii) using only the lower-order coefficients 1–6, which are more closely related to short-time energy and the overall spectral envelope, and (iii) using only the higher-order coefficients 7–13, which carry finer-grained, higher-frequency details. The resulting mean performance and standard deviations over five random seeds are summarized in Table 9.
As shown in Table 9, using all 13 coefficients yields an AUROC of 0.9630 ± 0.0078 and an F1 score of 0.8959 ± 0.0076 , which serves as the reference. When restricting the input to the lower-order MFCCs 1–6, the AUROC slightly improves to 0.9679 ± 0.0016 , while the F1 score remains essentially unchanged at 0.8955 ± 0.0047 , and recall even increases to 0.9350 ± 0.0085 . In contrast, using only the higher-order MFCCs 7–13 leads to a modest degradation in both accuracy and F1 ( 0.8850 ± 0.0032 ), even though the AUROC ( 0.9606 ± 0.0028 ) stays close to the baseline.
These results indicate that Model (5) relies more heavily on the low-order MFCCs, which encode global spectral-shape and energy-related cues, whereas high-order MFCCs alone are not sufficient to sustain the same level of detection performance. They provide an interpretable view of feature importance at the level of MFCC groups: the proposed VAD is most sensitive to coarse spectral-envelope information (MFCC 1–6), and high-frequency detail (MFCC 7–13) plays a complementary but less dominant role, helping to explain the model’s robustness across different SNRs.

5.3.3. Inference Speed Analysis

Table 10 and Table 11 together report the inference speed of the proposed Model (5), i.e., the CNN ( 3 , 32 ) architecture using 13-dimensional MFCC features, evaluated on utterances of different durations as well as summarized by average metrics across all test utterances. The measurements show that the proposed neural VAD runs well within real time on both CPU and GPU for utterances up to 40 s, with CPU and GPU average RTFs of about 0.0167 and 0.0149, respectively, corresponding to real-time speedups of roughly 59.7× on CPU and 67.1× on GPU. For example, at 10 s the CPU real-time factor (RTF) is about 0.0132 and the GPU RTF is roughly 0.0115, meaning that the model processes 1 s of audio in about 13 ms and 12 ms on CPU and GPU, respectively. Across durations from 5 s to 40 s, the RTF values remain relatively stable and close to the global averages, indicating that the implementation scales approximately linearly with signal length and does not incur significant per-utterance overhead.
When compared conceptually with conventional VADs, such as simple energy/threshold-based detectors or rVAD-fast [25,26,27,28,29], the computational cost of the proposed model is clearly higher, since those traditional methods are explicitly designed for low complexity and real-time operation on CPUs or even embedded hardware. Energy-based and adaptive-threshold VADs typically require only a small number of arithmetic operations per frame, while rVAD-fast replaces the more expensive pitch analysis in rVAD with simpler spectral measures to achieve roughly an order-of-magnitude speed-up [25,26,28,29]. Although a direct, quantitative performance comparison with these conventional methods is beyond the scope of this work, the per-duration and averaged RTF values in Table 10 and Table 11 consistently show that the proposed neural VAD still operates comfortably in real time on standard desktop hardware, complementing classical low-complexity VADs as a viable option when higher modeling capacity is desired.

5.4. Benchmark Comparisons with Silero and ITU-T G.729

The primary goal of the VAD method proposed in this work is to provide a systematic improvement and analysis framework over a MATLAB tutorial example, rather than to engineer a new VAD system that aggressively targets state-of-the-art performance on every benchmark. Nevertheless, to better position our approach (the presented Model (5) for 13-dim MFCCs trained in the multi-SNR scenario) within the landscape of both classical and modern VAD techniques, this subsection includes comparisons with a widely used contemporary neural method, Silero VAD [30], as well as the traditional standard ITU-T G.729 VAD [10], which together serve as strong external baselines.
Silero VAD is a family of deep neural network-based voice activity detectors developed by the Silero team and trained on large-scale, heterogeneous speech corpora encompassing multiple languages, speakers, noise types, and recording conditions. According to its official documentation, Silero employs a lightweight neural architecture that takes spectral features as input (e.g., compact convolutional and recurrent layers) and is trained with extensive data augmentation and carefully curated labels to achieve robust speech/non-speech discrimination across a broad range of real-world scenarios. Its training data and setup are entirely different from the datasets used in this work, so Silero is used here as an off-the-shelf, pretrained SOTA system rather than a model retrained on our corpus. In parallel, the ITU-T G.729 VAD represents a widely deployed, standards-based classical detector that relies on hand-crafted features and rule-based decision logic designed for low-bit-rate speech coding; it provides a strong traditional baseline for assessing how the proposed neural VAD compares with a well-established, low-complexity method.

5.4.1. Performance on the Original Test Set (Google Speech Command with Noise)

On the original test set based on the Google Speech Command dataset mixed with various noises at SNRs 5 , 0, 5, and 10 dB, the results in Table 12 show that Model (5) substantially outperforms both Silero VAD and the ITU-T G.729 VAD in overall discriminative ability and balanced detection quality. Model (5) achieves an AUROC of 96.91 % and an F1 score of 90.16 % , indicating that it can simultaneously maintain high recall ( 93.12 % ) and high precision ( 87.38 % ) across these mixed-SNR conditions.
In contrast, Silero VAD attains noticeably lower accuracy ( 63.07 % ) and F1 ( 58.81 % ), mainly because both its recall ( 63.03 % ) and precision ( 55.12 % ) are much weaker than those of Model (5), suggesting that Silero either misses many speech segments or produces more false alarms when faced with the same mixture of noises and SNRs. The ITU-T G.729 VAD exhibits competitive precision ( 88.80 % ) but very low recall ( 31.38 % ) and F1 ( 45.56 % ), showing that it is extremely conservative—triggering speech decisions only rarely and thus failing to detect a large portion of true speech frames under these noisy conditions. Overall, Table 12 indicates that, in this realistic multi-SNR test scenario, Model (5) provides the best trade-off between missed detections and false alarms and is clearly more robust than both the neural Silero VAD and the classical G.729 VAD baseline.

5.4.2. Performance on TIMIT Test Set with White Noise

The TIMIT test set [31] is a widely used English read-speech corpus for benchmarking ASR and related tasks. It contains 1344 utterances (8 sentences per speaker) from 168 speakers (112 male and 56 female), covering a variety of American dialect regions, with a total duration of 5186 s (about 1.44 h). To examine how the proposed method behaves on such a public benchmark and to compare it against the state-of-the-art Silero VAD, we evaluate several variants of Model (5) and Silero on the TIMIT test set corrupted by additive white noise at SNRs 10 , 5 , 0, and 5 dB, as summarized in Table 13. All TIMIT utterances used here are strictly unseen during training and validation, and the same corrupted signals and frame-wise ground-truth labels are used for every model, so that Model (5) variants and Silero are evaluated under exactly the same test conditions. All Model (5) variants used here are pre-trained on the original multi-SNR Google Commands training data and are not fine-tuned on TIMIT, so both our models and Silero operate as pre-trained VADs without adaptation to this benchmark.
Table 13 shows that all three versions of Model (5) achieve very stable AUROC values around 88.8–89.5% across all SNRs, indicating that the proposed Model (5) maintains a nearly constant detection quality even when the SNR drops to 10 dB. The medium-size configuration BiLSTM(64,64) attains the highest AUROC at 10 dB (89.45%) while reducing the model size from 5.294 MB to 641.58 kB, and the smallest version BiLSTM(50,50) still preserves almost the same AUROC with only 426.12 kB, demonstrating that the model can be aggressively compressed with negligible loss on this benchmark. This SNR-insensitive behavior is consistent with the earlier results on the noisy Google Speech Commands test set (Figure 3 and Figure 4), where Model (5) also exhibited very similar performance across SNRs between 10 and 5 dB, without showing any marked degradation as the SNR changed within this range. A plausible explanation is that all Model (5) variants are trained in a multi-SNR setup with four SNR levels 10 , 5 , 0, and 5 dB, so the models learn to handle this SNR range robustly even when evaluated on a different corpus (TIMIT) and a different noise type (white noise) that share only the SNR range but not the underlying speech or noise characteristics.
Silero VAD, on the other hand, exhibits a very different behavior: its AUROC is only 61.54% at 10 dB but increases sharply to 89.28% at 5 dB and further to 94.77% and 96.77% at 0 and 5 dB, respectively. This indicates that Silero is highly effective when the SNR is moderate or high, clearly outperforming all Model (5) variants in the range from 5 to 5 dB, but is more vulnerable in extremely noisy conditions than the proposed models, which keep their AUROC near 89% even at 10 dB. Overall, the results on the TIMIT benchmark with white-noise corruption show a clear trade-off: Silero is preferable at moderate and high SNRs, whereas the pre-trained Model (5) variants maintain more stable performance when the SNR is pushed down to 10 dB.

5.5. Visual Comparison of VAD Results

In addition to the quantitative comparisons using standard VAD evaluation metrics presented earlier, we further illustrate the practical effects of different feature and model selections using two representative utterances. Figure 5 shows example VAD results for both high SNR (5 dB) and low SNR ( 10 dB) cases, allowing a direct visual comparison of detection performance achieved by the baseline model/features and by Model (5) equipped with 13-dimensional MFCC features. We have the following observations:
  • As shown in panels (a) and (b), the baseline model and features perform reasonably well at 5 dB SNR, but VAD accuracy degrades considerably in severe noise ( 10 dB SNR). The predicted probabilities deviate from the ground-truth labels, resulting in both missed speech activity and increased false alarms under high noise.
  • In panels (c) and (d), Model (5) with 13-dimensional MFCCs demonstrates significantly enhanced robustness. Even at 10 dB SNR, its probability curves closely track the ground truth, and speech/non-speech boundaries are much more accurate.
Overall, the results demonstrate that employing MFCC features with a lightweight CNN–BiLSTM architecture (Model (5)) clearly improves VAD reliability and noise robustness across a range of acoustic conditions.

6. Conclusions and Future Work

This work extended the MATLAB deep-learning VAD tutorial into a controlled and reproducible testbed and used it to systematically compare lightweight BiLSTM and CNN–BiLSTM architectures with several simple feature sets under both single-SNR and multi-SNR noisy conditions. Within this unified framework, compact CNN(3,32)+BiLSTM models combined with MFCC features consistently outperform the original tutorial baseline and spectral–periodicity features, show favorable robustness across different SNR levels, and can be compressed to substantially smaller BiLSTM configurations with only minor degradation in AUROC, F1-score, and accuracy while maintaining real-time operation on standard CPU/GPU hardware. The analyses of multiple random seeds, feature subsets, and model sizes further support the stability of these trends and highlight the practical impact of CNN front-end design and feature choice for lightweight VAD.
The study is, therefore, positioned as an applied benchmarking and guideline publication rather than a proposal of a new deep-learning architecture: it offers a simple MATLAB-based testbed and empirical recommendations for building noise-robust, resource-constrained VAD systems using standard BiLSTM/CNN components and MFCC/FBANK features. Future work will aim to move this framework closer to real deployment by adding evaluations on real recorded hardware data and public device-oriented corpora, and by incorporating explicit FLOP and memory optimization via pruning, quantization, and knowledge distillation, together with latency and power profiling on low-power CPUs, GPUs, and MCUs. Additional extensions include broadening the benchmarks to more VAD datasets and languages and examining modern attention-based or Transformer/CRDNN components in a controlled manner to see how much extra capacity can be introduced without losing the lightweight property. Beyond these applied directions, a more theoretical analysis of the observed empirical trends—for example, why certain CNN front-end sizes, compression levels, and feature choices (MFCC versus FBANK or spectral–periodicity features) lead to different robustness–accuracy trade-offs—will be an important topic for future work, going beyond the primarily experimental scope of the present study.

Author Contributions

Conceptualization, B.-Y.S. and J.-W.H.; methodology, B.-Y.S. and J.-W.H.; software, B.-Y.S.; validation, B.-Y.S. and J.-W.H.; formal analysis, J.-W.H., B.-Y.S., B.C. and S.-C.H.; investigation, J.-W.H.; resources, J.-W.H., B.C. and S.-C.H.; data curation, J.-W.H. and B.-Y.S.; writing—original draft preparation, J.-W.H.; writing—review and editing, J.-W.H.; visualization, J.-W.H. and B.-Y.S.; supervision, J.-W.H., B.C. and S.-C.H.; project administration, J.-W.H., B.C. and S.-C.H.; funding acquisition, J.-W.H., B.C. and S.-C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Shih-Chieh Huang is employed by Company Realtek Semiconductor Corp. The remaining authors declare no conflicts of interest.

References

  1. Ramírez, J.; Segura, J.C.; Benítez, C.; De La Torre, A.; Rubio, A. Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 2004, 42, 271–287. [Google Scholar] [CrossRef]
  2. Ramirez, J.; Gorriz, J.M.; Segura, J.C. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness. Robust Speech Recognit. Underst. 2007, 6, 1–22. [Google Scholar] [CrossRef]
  3. Sohn, J.; Kim, N.S.; Sung, W. A Statistical Model-Based Voice Activity Detection. IEEE Signal Process. Lett. 1999, 6, 1–3. [Google Scholar] [CrossRef]
  4. Moattar, M.H.; Homayounpour, M.M. A Simple But Efficient Real-Time Voice Activity Detection Algorithm. Eurasip J. Adv. Signal Process. 2009, 2009, 1–11. [Google Scholar]
  5. Carlin, M.A.; Elhilali, M. A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 2422–2433. [Google Scholar] [CrossRef] [PubMed]
  6. Sofer, A.; Chazan, S.E. CNN self-attention voice activity detector. arXiv 2022, arXiv:2203.02944. [Google Scholar] [CrossRef]
  7. Ong, W.Q.; Tan, A.W.C.; Vengadasalam, V.V.; Tan, C.H.; Ooi, T.H. Real-Time Robust Voice Activity Detection Using the Upper Envelope Weighted Entropy Measure and the Dual-Rate Adaptive Nonlinear Filter. Entropy 2017, 19, 487. [Google Scholar] [CrossRef]
  8. Tripathi, K.; Kumar, C.V.; Wasnik, P. Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion. arXiv 2025, arXiv:2506.01365. [Google Scholar] [CrossRef]
  9. Ramírez, J.; Segura, J.C.; Benítez, C.; de la Torre, A.; Rubio, A. An Effective Subband Order-Statistics-Based Voice Activity Detector With Noise Reduction for Robust Speech Recognition. IEEE Trans. Speech Audio Process. 2005, 13, 953–964. [Google Scholar] [CrossRef]
  10. Benyassine, A.; Shlomot, E.; Su, H.Y.; Massaloux, D.; Lamblin, C.; Petit, J.P. ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for CS-ACELP. IEEE Commun. Mag. 1997, 35, 64–73. [Google Scholar] [CrossRef]
  11. Chuangsuwanich, E.; Glass, J. Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation Frequency. In Proceedings of the Interspeech, Florence, Italy, 28–31 August 2011; pp. 2645–2648. [Google Scholar]
  12. Priebe, D.; Ghani, B.; Stowell, D. Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation. Sensors 2024, 24, 46. [Google Scholar] [CrossRef] [PubMed]
  13. Qin, Q.; Zhu, Y. Robust Audio–Visual Speaker Localization in Noisy Aircraft Cabins for Inflight Medical Assistance. Sensors 2025, 25, 5827. [Google Scholar] [CrossRef] [PubMed]
  14. Tashev, I.; Mirsamadi, S. DNN-Based Causal Voice Activity Detector. In Interspeech. 2016. Available online: https://www.researchgate.net/profile/Ivan-Tashev/publication/315955578_DNN-based_Causal_Voice_Activity_Detector/links/58ed2241a6fdcc61cc106e8e/DNN-based-Causal-Voice-Activity-Detector.pdf (accessed on 4 January 2025).
  15. Hughes, T.; Mierle, K. Recurrent Neural Networks for Voice Activity Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 7378–7382. [Google Scholar] [CrossRef]
  16. Gimeno, P.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains. Appl. Sci. 2022, 12, 1832. [Google Scholar] [CrossRef]
  17. Wilkinson, N.; Niesler, T. A Hybrid CNN-BiLSTM Voice Activity Detector. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
  18. Xu, X.; Jouvet, D.; Essid, S.; Richard, G. A Lightweight Framework for Online Voice Activity Detection in the Wild. In Proceedings of the INTERSPEECH 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3615–3619. [Google Scholar]
  19. Ding, S.; Rikhye, R.; Liang, Q.; He, Y.; Wang, Q.; Narayanan, A.; O’Malley, T.; McGraw, I. Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 3744–3748. [Google Scholar]
  20. Sarkar, E.; Prasad, R.; Magimai.-Doss, M. Unsupervised Voice Activity Detection by Modeling Source and System Information Using Zero Frequency Filtering. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 476–480. [Google Scholar] [CrossRef]
  21. Ball, J. Voice Activity Detection (VAD) in Noisy Environments. arXiv 2023, arXiv:2312.05815. [Google Scholar] [CrossRef]
  22. Zhu, Z.; Zhang, L.; Pei, K.; Chen, S. A Robust and Lightweight Voice Activity Detection Algorithm for Speech Enhancement at Low Signal-to-Noise Ratio. Digit. Signal Process. 2023, 137, 104151. [Google Scholar] [CrossRef]
  23. MathWorks. Train Voice Activity Detection in Noise Model Using Deep Learning. 2021. Available online: https://www.mathworks.com/help/audio/ug/train-voice-activity-detection-in-noise-model-using-deep-learning.html (accessed on 14 November 2025).
  24. Warden, P. Speech Commands: A Public Dataset for Single-Word Speech Recognition. 2017. Available online: https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz (accessed on 14 November 2025).
  25. Sakhnov, K.; Burykh, S.; Zadorozhny, A. Approach for Energy-Based Voice Detector with Adaptive Scaling Factor. Iaeng Int. J. Comput. Sci. 2009, 36, 390–396. [Google Scholar]
  26. Tan, Z.H.; Lindberg, B. High-Accuracy, Low-Complexity Voice Activity Detection Based on a Posteriori SNR Weighted Energy. In Proceedings of the International Conference on Spoken Language Processing 2009, Brighton, UK, 6–10 September 2009; pp. 1679–1682. [Google Scholar]
  27. Lee, J.; Choo, Y.; Park, H.G. Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting. Math. Probl. Eng. 2014, 2014, 146040. [Google Scholar] [CrossRef]
  28. Tan, Z.H.; Sarkar, A.K.; Dehak, N. rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method. arXiv 2019, arXiv:1906.03588. [Google Scholar] [CrossRef]
  29. Tan, Z.H.; Sarkar, A.K.; Dehak, N.; Perochon, S. A Presentation and Short Discussion of rVAD-fast, a Fast Voice Activity Detector. Image Process. Line 2022, 12, 1–20. [Google Scholar] [CrossRef]
  30. Team, S. Silero VAD: Pre-Trained Enterprise-Grade Voice Activity Detector (VAD), Number Detector and Language Classifier. 2024. Available online: https://github.com/snakers4/silero-vad (accessed on 21 December 2025).
  31. Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM; Technical Report NISTIR 4930; U.S. Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993.
Figure 1. Architectures of the VAD baseline model ((a): Model (1)) and the baseline model with dropout ((b): Model (2)).
Figure 1. Architectures of the VAD baseline model ((a): Model (1)) and the baseline model with dropout ((b): Model (2)).
Electronics 15 00263 g001
Figure 2. Architectures of the presented CNN-enhanced VAD model variants ((a–d): Models (3), (4), (5), (6)).
Figure 2. Architectures of the presented CNN-enhanced VAD model variants ((a–d): Models (3), (4), (5), (6)).
Electronics 15 00263 g002
Figure 3. The AUROC curves for test sets at different SNRs, using the presented Model (5) with 13-dim MFCC.
Figure 3. The AUROC curves for test sets at different SNRs, using the presented Model (5) with 13-dim MFCC.
Electronics 15 00263 g003
Figure 4. The accuracy, precision, recall, and F1 scores for test sets at different SNRs, using the presented Model (5) with 13-dim MFCC.
Figure 4. The accuracy, precision, recall, and F1 scores for test sets at different SNRs, using the presented Model (5) with 13-dim MFCC.
Electronics 15 00263 g004
Figure 5. The VAD results for two utterances at 5 dB and −10 dB, respectively, processed by baseline (model and features) and Model (5) with 13-dim MFCC. (a) baseline model and features, 5 dB SNR; (b) baseline model and features, −10 dB SNR; (c) Model (5) and 13-MFCC, 5 dB SNR; (d) Model (5) and 13-MFCC, −10 dB SNR.
Figure 5. The VAD results for two utterances at 5 dB and −10 dB, respectively, processed by baseline (model and features) and Model (5) with 13-dim MFCC. (a) baseline model and features, 5 dB SNR; (b) baseline model and features, −10 dB SNR; (c) Model (5) and 13-MFCC, 5 dB SNR; (d) Model (5) and 13-MFCC, −10 dB SNR.
Electronics 15 00263 g005
Table 1. Performance comparison of VAD model variants using baseline features. The bold values indicate the best results for each metric among the compared models.
Table 1. Performance comparison of VAD model variants using baseline features. The bold values indicate the best results for each metric among the compared models.
Baseline
9 Features
AUROCAccuracyRecallPrecisionF1 Score# Param. (M)
Model (1)
(baseline)
76.53 68.83 89.32 59.54 71.45 5.113
Model (2)
(with Dropout)
78.92 71.14 88.72 61.81 72.86 5.121
Model (3)
( CNN ( 5 , 32 ) )
91.20 80.90 88.35 73.36 80.16 5.285
Model (4)
( CNN ( 5 , 64 ) )
90.91 80.82 88.47 73.20 80.11 5.492
Model (5)
( CNN ( 3 , 32 ) )
84.05 75.16 86.21 66.66 75.19 5.283
Model (6)
( CNN ( 7 , 32 ) )
83.39 68.28 92.04 58.72 71.70 5.288
Table 2. Performance comparison of VAD model variants using baseline features and 13-dimensional MFCCs. Models (4) and (6) are omitted for baseline features due to poor performance. All values except parameter count (# Param.) are shown as percentages. The results highlight the impact of feature extraction and architectural tuning, with the newly added Model (5+) assessing the effect of increasing the number of CNN channels from 32 to 64 (kernel size fixed at 3). The bold values indicate the best results for each metric among the compared models.
Table 2. Performance comparison of VAD model variants using baseline features and 13-dimensional MFCCs. Models (4) and (6) are omitted for baseline features due to poor performance. All values except parameter count (# Param.) are shown as percentages. The results highlight the impact of feature extraction and architectural tuning, with the newly added Model (5+) assessing the effect of increasing the number of CNN channels from 32 to 64 (kernel size fixed at 3). The bold values indicate the best results for each metric among the compared models.
ModelAUROCAccuracyRecallPrecisionF1 Score# Param. (M)
Model (1)
(baseline)
baseline
feature
76.53 68.83 89.32 59.54 71.45 5.113
13-dim
MFCC
92.30 81.13 91.94 72.33 80.97 5.138
Model (2)
(with Dropout)
baseline
feature
78.92 71.14 88.72 61.81 72.86 5.121
13-dim
MFCC
92.50 83.60 91.29 75.98 82.94 5.146
Model (3)
( CNN ( 5 , 32 ) )
baseline
feature
91.20 80.90 88.35 73.36 80.16 5.285
13-dim
MFCC
92.88 85.75 89.43 80.21 84.57 5.288
Model (5)
( CNN ( 3 , 32 ) )
baseline
feature
84.05 75.16 86.21 66.66 75.19 5.283
13-dim
MFCC
93.75 86.52 90.23 81.04 85.39 5.285
Model (5+)
( CNN ( 3 , 64 ) )
13-dim
MFCC
93.09 87.33 89.23 83.02 86.02 5.490
Table 3. Performance comparison of VAD model variants using 13-dimensional (static) MFCC and 39-dimensional extended MFCCs (13 static, 13 delta, 13 delta-delta). All values except parameter count (# Param.) are shown as percentages (%). Models (3), (4) and (6) are omitted due to inferior performance in prior studies. The bold values indicate the best results for each metric among the compared models.
Table 3. Performance comparison of VAD model variants using 13-dimensional (static) MFCC and 39-dimensional extended MFCCs (13 static, 13 delta, 13 delta-delta). All values except parameter count (# Param.) are shown as percentages (%). Models (3), (4) and (6) are omitted due to inferior performance in prior studies. The bold values indicate the best results for each metric among the compared models.
FeatureAUROCAccuracyRecallPrecisionF1 Score# Param. (M)
Model (1)
(baseline)
13-dim
MFCC
92.30 81.13 91.94 72.33 80.97 5.138
39-dim
MFCC
93.78 85.74 91.21 79.26 84.82 5.301
Model (2)
(with Dropout)
13-dim
MFCC
92.50 83.60 91.29 75.98 82.94 5.146
39-dim
MFCC
93.97 84.57 92.38 76.92 83.95 5.308
Model (5)
( CNN ( 3 , 32 ) )
13-dim
MFCC
93.75 86.52 90.23 81.04 85.39 5.285
39-dim
MFCC
94.38 87.42 89.20 83.20 86.10 5.294
Model (5+)
( CNN ( 3 , 64 ) )
13-dim
MFCC
93.09 87.33 89.23 83.02 86.02 5.490
39-dim
MFCC
94.55 87.71 90.85 82.70 86.58 5.510
Table 4. Performance comparison of VAD model variants using 13-dimensional (static) MFCC and 13-dimensional FBANK. All values except parameter count (# Param.) are shown as percentages. Models (4) and (6) are omitted due to poor performance in prior studies. The bold values indicate the best results for each metric among the compared models.
Table 4. Performance comparison of VAD model variants using 13-dimensional (static) MFCC and 13-dimensional FBANK. All values except parameter count (# Param.) are shown as percentages. Models (4) and (6) are omitted due to poor performance in prior studies. The bold values indicate the best results for each metric among the compared models.
FeatureAUROCAccuracyRecallPrecisionF1 Score# Param. (M)
Model (1)
(baseline)
13-dim
MFCC
92.30 81.13 91.94 72.33 80.97 5.138
13-dim
FBANK
89.39 82.31 84.93 76.96 80.75 5.138
Model (2)
(with Dropout)
13-dim
MFCC
92.50 83.60 91.29 75.98 82.94 5.146
13-dim
FBANK
92.90 85.44 86.19 81.53 83.79 5.146
Model (5)
( CNN ( 3 , 32 ) )
13-dim
MFCC
93.75 86.52 90.23 81.04 85.39 5.285
13-dim
FBANK
93.32 88.39 86.45 86.89 86.69 5.285
Model (5+)
( CNN ( 3 , 64 ) )
13-dim
MFCC
93.09 87.33 89.23 83.02 86.02 5.490
13-dim
FBANK
94.10 87.73 86.76 85.38 86.07 5.490
Table 5. Performance comparison of VAD model variants using 39-dimensional (static) MFCC and 39-dimensional FBANK. All values except parameter count (# Param.) are shown as percentages. Models (4) and (6) are omitted due to poor performance in prior studies. The bold values indicate the best results for each metric among the compared models.
Table 5. Performance comparison of VAD model variants using 39-dimensional (static) MFCC and 39-dimensional FBANK. All values except parameter count (# Param.) are shown as percentages. Models (4) and (6) are omitted due to poor performance in prior studies. The bold values indicate the best results for each metric among the compared models.
FeatureAUROCAccuracyRecallPrecisionF1 Score# Param. (M)
Model (1)
(baseline)
39-dim
MFCC
93.78 85.74 91.21 79.26 84.82 5.301
39-dim
FBANK
91.40 83.94 86.65 78.72 82.50 5.301
Model (2)
(with Dropout)
39-dim
MFCC
93.97 84.57 92.38 80.16 84.10 5.308
39-dim
FBANK
92.90 85.39 88.44 76.92 83.95 5.308
Model (5)
( CNN ( 3 , 32 ) )
39-dim
MFCC
94.38 87.42 89.20 83.20 86.10 5.294
39-dim
FBANK
94.12 88.17 85.21 87.40 86.29 5.294
Model (5+)
( CNN ( 3 , 64 ) )
39-dim
MFCC
94.55 87.71 90.85 82.70 86.58 5.510
39-dim
FBANK
94.21 88.17 86.13 86.70 86.42 5.510
Table 6. Performance comparison of VAD model variants using baseline features and 13-dimensional static MFCCs under multi-SNR training and testing conditions. All values are reported as percentages. The bold values indicate the best results for each metric among the compared models.
Table 6. Performance comparison of VAD model variants using baseline features and 13-dimensional static MFCCs under multi-SNR training and testing conditions. All values are reported as percentages. The bold values indicate the best results for each metric among the compared models.
FeatureAUROCAccuracyRecallPrecisionF1 Score
Model (1)
(baseline)
baseline
feature
93.80 84.48 90.14 77.84 83.54
13-dim
MFCC
95.38 88.21 93.04 82.28 87.33
Model (2)
(with Dropout)
baseline
feature
93.94 84.48 90.10 77.86 83.54
13-dim
MFCC
95.33 88.05 93.02 82.04 87.18
Model (5)
( CNN ( 3 , 32 ) )
baseline
feature
96.29 89.37 89.12 86.88 87.99
13-dim
MFCC
96.91 91.12 93.12 87.38 90.16
Table 7. Performance of Model (5) variants with different hidden units in the two BiLSTM layers tested on 13-dim MFCCs. All values except parameter count (# Param.) are shown as percentages. Here, BiLSTM( x , y ) denotes two stacked BiLSTM layers with x and y hidden units, respectively.
Table 7. Performance of Model (5) variants with different hidden units in the two BiLSTM layers tested on 13-dim MFCCs. All values except parameter count (# Param.) are shown as percentages. Here, BiLSTM( x , y ) denotes two stacked BiLSTM layers with x and y hidden units, respectively.
MetricOriginal Model (5)
with BiLSTM(200,200)
Reduced Model (5)
with BiLSTM(64,64)
Reduced Model (5)
with BiLSTM(50,50)
AUROC96.9195.6296.18
Accuracy91.1289.3989.65
Recall93.1290.7592.96
Precision87.3885.7884.81
F1 score90.1688.2088.69
Model size5.294 MB641.58 kB426.12 kB
Table 8. Test AUROC and F1 scores of Model (5) over five random seeds in multi-SNR training and testing.
Table 8. Test AUROC and F1 scores of Model (5) over five random seeds in multi-SNR training and testing.
SeedAUROCF1
10.97200.9060
110.96870.9045
210.96190.8919
310.95760.8878
410.96680.8969
Table 9. Performance of Model (5) with different subsets of 13-dimensional MFCC features (five seeds). The bold values indicate the best results for each metric among the compared models.
Table 9. Performance of Model (5) with different subsets of 13-dimensional MFCC features (five seeds). The bold values indicate the best results for each metric among the compared models.
MFCC
Configuration
AUROCAccuracyRecallPrecisionF1
13D MFCC
(1–13)
0.9630
± 0.0078
0.9067
± 0.0061
0.9143
± 0.0140
0.8783
± 0.0095
0.8959
± 0.0076
Low-order
MFCC 1–6
0.9679
± 0.0016
0.9042
± 0.0052
0.9350
± 0.0085
0.8595
± 0.0150
0.8955
± 0.0047
High-order
MFCC 7–13
0.9606
± 0.0028
0.8970
± 0.0035
0.9117
± 0.0000
0.8654
± 0.0000
0.8850
± 0.0032
Table 10. Summary of inference speed for the proposed model Model (5) ( CNN ( 3 , 32 ) ) for 13-dim MFCC.
Table 10. Summary of inference speed for the proposed model Model (5) ( CNN ( 3 , 32 ) ) for 13-dim MFCC.
Duration (s)CPU Time (s)CPU RTFGPU Time (s)GPU RTFGPU/CPU
10.3201700.32020.0541940.05420.17×
50.1153360.02310.0601000.01200.52×
100.1315690.01320.1151620.01150.88×
200.2929130.01460.2117700.01060.72×
400.6710830.01680.4480690.01120.67×
Table 11. Key performance metrics for the proposed model Model (5) ( CNN ( 3 , 32 ) ) for 13-dim MFCC.
Table 11. Key performance metrics for the proposed model Model (5) ( CNN ( 3 , 32 ) ) for 13-dim MFCC.
Key Performance MetricsValue
CPU Average Inference Time0.189 s
CPU Average RTF0.0167
CPU Real-Time Speedup59.7×
GPU Average Inference Time0.168 s
GPU Average RTF0.0149
GPU Real-Time Speedup67.1×
Table 12. Performance comparison on the original test set (Google Speech Command with Noise).
Table 12. Performance comparison on the original test set (Google Speech Command with Noise).
MetricsModel (5)Silero VADITU-T G.729
AUROC96.91%67.51%
Accuracy91.12%63.07%67.69%
Recall93.12%63.03%31.38%
Precision87.38%55.12%88.80%
F1 score90.16%58.81%45.56%
Table 13. AUROC of different models on the TIMIT test set with additive white noise at various SNRs, together with model size. The AUROC scores are computed from frame-wise decisions on the entire corrupted TIMIT test set, without any overlap with the Google Commands data used for training and validation, which avoids test–train leakage and ensures an unbiased comparison between all pre-trained systems. The bold values indicate the best results for each metric among the compared models.
Table 13. AUROC of different models on the TIMIT test set with additive white noise at various SNRs, together with model size. The AUROC scores are computed from frame-wise decisions on the entire corrupted TIMIT test set, without any overlap with the Google Commands data used for training and validation, which avoids test–train leakage and ensures an unbiased comparison between all pre-trained systems. The bold values indicate the best results for each metric among the compared models.
AUROCOriginal Model (5)
BiLSTM(200,200)
Reduced Model (5)
BiLSTM(64,64)
Reduced Model (5)
BiLSTM(50,50)
Silero VAD
10 dB88.92%89.45%88.80%61.54%
5 dB88.99%89.17%89.03%89.28%
0 dB88.92%89.19%88.91%94.77%
5 dB88.97%88.79%88.89%96.77%
Size5.294 MB641.58 kB426.12 kB2.33 MB
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Su, B.-Y.; Chen, B.; Huang, S.-C.; Hung, J.-W. A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics 2026, 15, 263. https://doi.org/10.3390/electronics15020263

AMA Style

Su B-Y, Chen B, Huang S-C, Hung J-W. A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics. 2026; 15(2):263. https://doi.org/10.3390/electronics15020263

Chicago/Turabian Style

Su, Bo-Yu, Berlin Chen, Shih-Chieh Huang, and Jeih-Weih Hung. 2026. "A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments" Electronics 15, no. 2: 263. https://doi.org/10.3390/electronics15020263

APA Style

Su, B.-Y., Chen, B., Huang, S.-C., & Hung, J.-W. (2026). A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments. Electronics, 15(2), 263. https://doi.org/10.3390/electronics15020263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop