Speech signals for the proposed methodology were collected from TIMIT database [
31] because it provides transcriptions down to word and phoneme levels. Each TIMIT sentence contains almost around 3.5 s, of which 90% is speech. To change the ratio of speech and non-speech regions by 40% to 60% [
32] respectively, silence was added to the original speech of the TIMIT corpus. For experimental purposes, speech signals were selected randomly from the TIMIT database, contributing around 910 from training dataset and 320 from test dataset. Nonstationary noises for the experiment were collected from AURORA2 database [
33] which was resampled to 16 kHz depending on the need. The speech signals were contaminated by various nonstationary noises of different SNR levels (−10 dB to 10 dB). Four noises—namely airport, babble, car, and train noises—were selected for experimental purposes. Babble noise by name consisted of multiple speakers speaking in the background. Airport and train noises included some speech elements along with their noises. Car noise was the car interior noise with an impulse noise at a particular instance.
Performance Evaluation
Performance evaluation of the VAD algorithm can be performed both subjectively and objectively. In subjective evaluation, a human listener evaluates for VAD errors, whereas, numerical computations are carried out for objective evaluation. However, subjective evaluation alone is insufficient to examine the VAD performance, because listening tests like ABC fail to consider the effects of false alarm [
32,
34,
35]. Hence numerical computations through objective evaluation help in reporting the performance of the proposed VAD algorithm.
VAD performance is calculated using (17) and (18),
and
where, HR
ns and HR
s, non-speech frames and speech frames correctly detected among non-speech and speech frames, respectively.
NSns and
NSs, refer to the number of non-speech and speech frames in the whole database, respectively, while
NSns,ns and
NSs,s, refer to the number of frames classified correctly as non-speech and speech frames. The overall accuracy rate is given by (19),
The best performance is achieved when three parameters referred in the Equations (17)–(19) become maximum.
Performance of the proposed FuzzyEn feature was evaluated by the SVM classifier (proposed) and was compared against
k-NN classifier. In this work, 10-fold cross validation was used to ensure the reliability of the classifier. In 10-fold cross validation, the given feature vector was randomly divided into 90:10 split, where 90% of the features are used to train SVM model and 10% features as test data. The mean and standard deviation of the error rates obtained were compared against the two classifiers considered for the proposed FuzzyEn feature which is shown in
Table 1. From
Table 1, it is inferred that under low SNR levels the proposed FuzzyEn feature outperforms with minimal error rate for the various noisy conditions with the SVM classifier, except for car noise at 5 dB.
All experiments were conducted with MATLAB version 7.11 on a 2.4 GHZ Intel
® Core
TM i7 processor running Windows 10 with 8 GB main memory.
Figure 4 shows the average computing time for the SVM classifier to classify the speech and non-speech frames under various noises. From the figure it is inferred that as the size of the number of frames increases the computing time increases.
Figure 5a–c shows the mean of sensitivity, specificity, and F-measures of various noises such as airport, babble, car, and train noise for FuzzyEn based VAD using spectral subtraction. Similarly,
Figure 6a–c shows the standard deviation of the sensitivity, specificity, and F-measures for various noises. For various SNR levels, ranging from −10 dB to 10 dB, the parameters were computed and the graph shows that the detection of speech and non-speech frames are good by the proposed method, especially under low SNR conditions, except for car interior noise which contains a complex audible instance, where the sensitivity decreases when compared with that of the other noises.
The performance of the proposed FuzzyEn-SVM based VAD (will be referred to as FE-SVM based VAD hereafter) was compared against the standard VAD method, ITU G.729 Annex B [
16], and with [
13,
36]. Also the same was compared using
k-NN classifier with and without spectral subtraction. The final VAD decisions were made and the performance metrics like accuracy, HR
s, and HR
ns were computed for different noises and at five SNRs (−10, −5, 0, 5, and 10 dB).
In
Figure 7, the accuracy of the FE-SVM based VAD is compared against the various VAD algorithms for SNR levels ranging from −10 dB to 10 dB for airport noise, babble noise, car noise, and train noise. From the figure, it clearly states that G.729 suffers poor accuracy for airport noise with ~50% on average at various SNRs, whereas VAD algorithms by [
13,
36] performs better accuracy for airport noise, gradually varying at different SNR levels averaging ~71% and ~73% respectively. The proposed FE-SVM based VAD outperforms the rest, yielding ~92% average and ~90% without spectral subtraction (SS). Similarly, for the
k-NN classifier, the accuracy is ~89% and ~88% with and without SS, respectively. For babble noise, the proposed FE-SVM based VAD has an accuracy of average ~93% and ~89% without SS, whereas [
13] produces ~81% and [
36] and G.729 manages ~64% and ~50%, respectively.
Similarly, using
k-NN classifier, an accuracy of ~88% and ~89% was obtained with and without SS, respectively. For car noise, [
36] dominates with an average of ~93% and the proposed FE-SVM based VAD produces ~90% and ~84% without SS, whereas [
13] yields around ~82% and G.729 with an accuracy of ~74%. The accuracy from
k-NN classifier is ~88% with and without SS. Finally, for train noise, G.729 yields an accuracy of ~60%, [
13,
36] were around ~82% and ~86% respectively, but the proposed FE-SVM based VAD outperforms with the best accuracy of ~96% and ~97% with and without SS, respectively. Similarly, while using
k-NN classifier of ~92% and ~90% with and without using SS, respectively. This shows that the proposed FE-SVM based VAD yields a better accuracy rate for all noises except for the car interior noise. This is due to the effect of SS where the portion of speech was also cancelled along with the complex audible event encountered in the car interior noise.
Figure 8 and
Figure 9 show hit rate performance evaluation metrics for five SNRs for different kinds of noises computed for G.729B, [
13,
36] and proposed FE-SVM based VAD algorithms.
Figure 8 provides comparison results for HR
s, and
Figure 9 provides HR
ns results for the various VAD algorithms. It is clearly observed that, for the different VAD algorithms considered, G.729B produces a better performance for HR
s at different SNR levels for the different noises and suffers worst performance for HR
ns. For airport noise, the proposed FE-SVM based VAD lags in HR
s to [
13] (~12%) and [
36] (~11%) and for FE-SVM − SS, HR
s lags to [
13] (~14%) and [
36] (~13%). For babble noise, proposed FE-SVM lags to [
13] (~10%) and excels by [
36] (~9%) and for FE-SVM − SS the HR
s performance lags to [
13] (~0.4%) and excels [
36] (~0.2%). For car noise, the proposed FE-SVM based VAD lags to both [
13,
36] (~13% and ~9%, respectively). Finally, for train noise, the proposed FE-SVM based VAD offers better HR
s against [
36] (~2.4%) and suffers a lag against [
13] (~0.4%). Most VAD algorithms rely on the use of post processing techniques like hangover schemes [
37] which smooths the decisions at the frame level after the initial VAD decisions were made. This scheme reduces the risk of lower energy regions of speech at the ends of speech falsely rejected as noise. The number in the bracket indicates the approximate average hit-rates of speech and non-speech by which the proposed FE-SVM based VAD is better or worse than [
13,
36]. Similarly, for HR
ns performance, as shown in
Figure 9, the proposed FE-SVM based VAD outperforms all the VADs considered under different SNR levels for various noises. The average HR
ns of the proposed FE-SVM based VAD is ~96% for airport noise, ~98% for babble noise, ~95% for car noise, and ~96% for train noise. Without this post processing scheme, the proposed FE-SVM based VAD yields ~90% average for HR
s and ~95% above for HR
ns.
In
Table 2, the average performance and average standard deviation of various VADs is compared with the proposed FE-SVM based VAD. From
Table 2, it is clear that in terms of accuracy, the proposed FE-SVM based VAD is the best among all reference VAD algorithms considered here. The results show that the proposed VAD outperforms the existing VADs under all noisy conditions at different SNR levels, ~13% higher than that of VAD proposed by [
13] in accuracy and ~14% higher than [
36]. Similarly, for hit rates, the proposed FE-SVM based VAD excels [
13,
36] by ~36% and ~35%, respectively, for HR
ns and lags [
13,
36] by ~9% and ~6% for HR
s. Also for clean conditions, accuracy rate of the proposed FE-SVM based VAD is ~15% higher than that of [
13] and~2.5% higher than that of [
36]. For hit-rates, the proposed FE-SVM based VAD lags to [
13,
36] for HR
s by ~2% and ~1%, respectively, and for HR
ns, the proposed FE-SVM based VAD leads by ~36% and ~8% to [
13,
36], respectively.
The performance metrics ensures that the proposed FE-SVM based VAD detects speech and non-speech frames efficiently, especially under low SNR conditions. Also, for babble noises, the proposed FE-SVM based VAD manages ~90% accuracy rate, showing the proposed FuzzyEn feature is better in discriminating speech from background noises. The HRns for all the noises is ~95% and above, therefore this FuzzyEn feature is well-suited for various speech applications, namely, compression and speech coding.