Voice Activity Detection Using Fuzzy Entropy and Support Vector Machine

: This paper proposes support vector machine (SVM) based voice activity detection using FuzzyEn to improve detection performance under noisy conditions. The proposed voice activity detection (VAD) uses fuzzy entropy (FuzzyEn) as a feature extracted from noise-reduced speech signals to train an SVM model for speech/non-speech classiﬁcation. The proposed VAD method was tested by conducting various experiments by adding real background noises of different signal-to-noise ratios (SNR) ranging from − 10 dB to 10 dB to actual speech signals collected from the TIMIT database. The analysis proves that FuzzyEn feature shows better results in discriminating noise and corrupted noisy speech. The efﬁcacy of the SVM classiﬁer was validated using 10-fold cross validation. Furthermore, the results obtained by the proposed method was compared with those of previous standardized VAD algorithms as well as recently developed methods. Performance comparison suggests that the proposed method is proven to be more efﬁcient in detecting speech under various noisy environments with an accuracy of 93.29%, and the FuzzyEn feature detects speech efﬁciently even at low SNR levels.


Introduction
Voice activity detection (VAD) is a speech-processing technique which discriminates speech from non-speech regions.Silence, noise, or other unrelated acoustic information can be treated as non-speech regions.But the challenge to VAD is to detect speech under low signal-to-noise ratio (SNR) scenarios and also under the influence of nonstationary noises which cause significant errors.The main applications related to VAD are speech coding [1] and speech recognition [2].VAD stands as a preprocessing stage for major speech processing applications.The applications of VAD extend to mobile communications [3], transmitting speech signals over internet [4], and suppressing noises in digital hearing aids [5].
Being a predominant stage in many speech processing applications, the basic design of a VAD can be summarized by the following steps: a noise reduction step, feature extraction step, and, finally, a classification step in order to distinguish speech and non-speech regions.The noise reduction step plays a crucial role in VAD, because it detects speech pauses in the process of estimating noises present in the speech signal.The well-known noise reduction algorithms proposed by [6,7] are widely used in robust speech recognition tasks, which eventually helps VAD in attaining high performance.
In the feature extraction step, acoustic features are usually considered in order to distinguish speech and noise.Traditional VADs rely on energy [8] and some other VADs include zero crossing rate (ZCR) and energy difference between speech and non-speech proposed by [9].Some algorithms use correlation coefficients [10], others wavelet coefficients [11] and cepstral distance features [12].Ramirez et al. [13] proposed long-term spectral divergence, speech periodicity [14], and speech periodic component to aperiodic component ratio [15].But these traditional features lack robustness under noisy conditions.So various algorithms include multiple features to detect speech under various noisy conditions.All-band spectra and sub-band spectra of Wiener filter are used in [16], and higher order statistics in [17].These features, however, improve the accuracy of the VAD in some conditions, but are still lacking for low SNR conditions.VAD accuracy and performance rely on the final classification phase.
Finally, with the classification step, VAD results mainly rely on decision rule or by a threshold on the features extracted from the speech signal.The decision rule is either a simple threshold-based approach or some statistical models.Different classifiers based on machine learning algorithms (MLA) are also invoked for VAD results.One way is with the use of neural networks.Supervised neural networks have been widely used in classification, but the disadvantage with neural networks is its cost expensive training procedures.Other frequently used classifiers are k-nearest neighbor (k-NN) and support vector machine (SVM) classifiers [18][19][20].k-NN algorithm is a nonparametric MLA used in classification, where the classification takes place using the majority votes of its neighbors.Another nonparametric MLA proposed by [21] is the SVM classifier, which is one of the powerful tools used to classify speech and non-speech signals, because of its convergence speed in the training phase, which is faster when compared to other classifiers.In our proposed method, SVM classifier is used for VAD under various noisy conditions.Irrespective of decision rules or any other classifier, selection of appropriate features will always impact the performance of VAD, because there are no unique features or multiple features considered with regard to improving the performance of VAD under various noisy conditions.So the problem regarding VAD is extremely challenging to the researchers.In our proposed method, fuzzy entropy is introduced as the feature extracted over the speech signal.Since entropy-based feature extraction is solid to discriminate speech and noise, it fails to discard cough and excessive breath from speech signals, which are treated as non-speech.Based on fuzzy set theory, to measure the complexity of the time series data, fuzzy entropy (FuzzyEn) [22] was introduced.FuzzyEn is a modified algorithm of sample entropy (SampEn) [23][24][25][26][27]. Since then, FuzzyEn has been successful in feature extraction.Similar to SampEn-based algorithms, FuzzyEn retains certain characteristics like excluding self-matches.Additionally, by inheriting the similarity measurement using fuzzy sets, the limitations cited by SampEn-which uses the Heaviside function as the tolerance to select or discard the similarities between the two vectors-was overcome by FuzzyEn, as FuzzyEn transits smoothly through varying parameters with the use of the exponential function.
In this study, FuzzyEn was used as a feature to provide the input into the SVM for VAD, and its performance was investigated under various noisy conditions (airport, babble, car, and train) at different SNR levels (−10 dB, −5 dB, 0 dB, 5 dB, and 10 dB).Also, the significance of the FuzzyEn feature was tested using k-NN classifier and the results were compared against the different algorithms and the algorithm proposed.This article follows with Section 2, which describes the proposed methodology, and Section 3, the result analysis, and finally concluding with Section 4.

Proposed Methodology
Voice activity detection usually addresses a binary decision in the presence of speech for each frame of the noisy signal.The proposed VAD block diagram is shown in Figure 1.The proposed VAD method is explained in detail in the following subsections.The motivation for the proposal is to identify a robust feature that would improve the accuracy of the VAD.The implementation steps of the proposed VAD are explained in detail as follows.

Preprocessing
Noise reduction is used as a preprocessing step in the proposed methodology.It deals with suppressing the background noise available in the noisy speech signal.The input noisy speech ( ) is obtained by corrupting the clean speech ( ) by the additive noise ( ), as in (1), To reduce the effect of background noise from the noisy speech signal, spectral subtraction proposed by Boll [6] is used, because the statistical and structural properties of the speech signal gets corrupted when the SNR tends to go lower than −5 or −10 dB, or with complex audible events.To enhance the spectrum of the speech signal, spectral subtraction is considered.The spectrum of noise ( ) was estimated during speech inactive periods and subtracted from the spectrum of the current frame ( ) resulting in an estimate of the spectrum ( ) of the noise reduced speech as in (2), In this scenario, the primary interest of this paper is in evaluation of the proposed VAD classifier under conditions of noise-cancellation, and while a ground truth measurement is used to identify speech and non-speech frames, this is to allow for sufficient evaluation, using a standard approach to noise-cancellation.The impact of noise on the speech is explained in Figure 2. The clean speech signal is corrupted by additive white noise for SNRs 0 and 5 dB.As seen in Figure 2, clean speech's histogram shows leptokurtic and heavy-tailed distributions, while lowering the SNR cause the histogram to become mesokurtic with medium-tailed distributions.

Preprocessing
Noise reduction is used as a preprocessing step in the proposed methodology.It deals with suppressing the background noise available in the noisy speech signal.The input noisy speech s k (t) is obtained by corrupting the clean speech x k (t) by the additive noise v k (t) , as in (1), To reduce the effect of background noise from the noisy speech signal, spectral subtraction proposed by Boll [6] is used, because the statistical and structural properties of the speech signal gets corrupted when the SNR tends to go lower than −5 or −10 dB, or with complex audible events.To enhance the spectrum of the speech signal, spectral subtraction is considered.The spectrum of noise V k ( f ) was estimated during speech inactive periods and subtracted from the spectrum of the current frame S k ( f ) resulting in an estimate of the spectrum X k ( f ) of the noise reduced speech as in (2), In this scenario, the primary interest of this paper is in evaluation of the proposed VAD classifier under conditions of noise-cancellation, and while a ground truth measurement is used to identify speech and non-speech frames, this is to allow for sufficient evaluation, using a standard approach to noise-cancellation.The impact of noise on the speech is explained in Figure 2. The clean speech signal is corrupted by additive white noise for SNRs 0 and 5 dB.As seen in Figure 2, clean speech's histogram shows leptokurtic and heavy-tailed distributions, while lowering the SNR cause the histogram to become mesokurtic with medium-tailed distributions.

Framing
Since the nature of the speech signal is nonstationary, the obtained noise-reduced speech is divided into a sequence of small frames of equivalent size of 20-40 ms long.The frame is to be categorized as speech or noise in most cases, therefore the VAD problem can be treated as a binary classification problem.In this paper, the noise-reduced speech is divided into 32 ms long frames with a frame shift of 10 ms with a sampling rate of 16 kHz and windowed using Hanning window.Therefore, the resultant frame consists of 512 samples and the number of frames vary depending on the length of the speech signal.These values were obtained experimentally.

Feature Extraction-Fuzzy Entropy (FuzzyEn)
Let s(i) be the sample speech sequence, where i = 1, 2, 3, …, N (N = 512 in present case), which is reconstructed by phase-space with an embedded dimension m, and the reconstructed phase-space speech vector is given in (3), and is generalized by removing the baseline as in ( 4), For given vector , the similarity degree of its neighboring vector through its similarity degree is defined by fuzzy function, given in (5), and is the maximum absolute difference of the scalar components of and , given in (6), Here μ( , ) is the fuzzy membership function, which is given by the exponential function, as in (7), where m, n, and r are the embedding dimension, gradient, and width of the fuzzy membership function, respectively.

Framing
Since the nature of the speech signal is nonstationary, the obtained noise-reduced speech is divided into a sequence of small frames of equivalent size of 20-40 ms long.The frame is to be categorized as speech or noise in most cases, therefore the VAD problem can be treated as a binary classification problem.In this paper, the noise-reduced speech is divided into 32 ms long frames with a frame shift of 10 ms with a sampling rate of 16 kHz and windowed using Hanning window.Therefore, the resultant frame consists of 512 samples and the number of frames vary depending on the length of the speech signal.These values were obtained experimentally.

Feature Extraction-Fuzzy Entropy (FuzzyEn)
Let s(i) be the sample speech sequence, where i = 1, 2, 3, . . ., N (N = 512 in present case), which is reconstructed by phase-space with an embedded dimension m, and the reconstructed phase-space speech vector S m i is given in (3), and is generalized by removing the baseline as in (4), For given vector S m i , the similarity degree D ij of its neighboring vector S m j through its similarity degree is defined by fuzzy function, given in (5), and d m ij is the maximum absolute difference of the scalar components of S m i and S m j , given in (6), Entropy 2016, 18, 298 5 of 14 Here µ(d m ij , r) is the fuzzy membership function, which is given by the exponential function, as in (7), where m, n, and r are the embedding dimension, gradient, and width of the fuzzy membership function, respectively.
For each S m i , averaging all similarity degree D ij of the neighboring vectors S m j , we get (8), Now construct ϕ m (r) given in ( 9) and ϕ m+1 (r) which is given in (10), and From this, FuzzyEn(m, r) of the speech, is defined by, given in (11), Selection of FuzzyEn Parameters Three parameters are crucial in calculating FuzzyEn, which are to be fixed at first.The first parameter, m, is the embedded dimension which focusses on the sequence length to be compared.The other two parameters are r and n, which determine the width or similarity tolerance and the gradient of the exponential function, respectively.Figure 3 illustrates the impact of different selection of parameters on the exponential function.In Figure 3a, the width r in the exponential membership function is fixed at 0.2 and n is varied from 1, 2, and 5. Similarly, in Figure 3b, the exponential membership function parameter n is set to 2 and the width r varies between 0.1 and 0.3.Experimentally, the width r is optimal when multiplied with standard deviation (SD) and small values of n.In this work, the embedding dimension, m, is 2 and the exponential function parameters n and r are set as 2 and 0.2 times standard deviation, respectively.Generally, too large values of embedding dimension might lead to loss of useful information.Also, underestimating the similarity tolerance, r, leads to higher noise sensitivity.So the selection of FuzzyEn parameter numbers are decided based on Mann-Whitney U-test.The lowest p-value was obtained for the parameter combination m as 2 and n and r as 2 and 0.2 times SD, respectively.
FuzzyEn is computed for all the frames of the speech signal and these values are used as the features for the SVM classifier.Therefore, the feature vector will be equal to the total number frames of the input speech signal.The frame is labeled as speech, if more than half the samples are speech, otherwise the frame is labeled as noise.
tolerance, r, leads to higher noise sensitivity.So the selection of FuzzyEn parameter numbers are decided based on Mann-Whitney U-test.The lowest p-value was obtained for the parameter combination m as 2 and n and r as 2 and 0.2 times SD, respectively.
FuzzyEn is computed for all the frames of the speech signal and these values are used as the features for the SVM classifier.Therefore, the feature vector will be equal to the total number frames of the input speech signal.The frame is labeled as speech, if more than half the samples are speech, otherwise the frame is labeled as noise.

Support Vector Machine (SVM)
In this work, SVM is used as a classifier, because it constructs an optimal decision function f (x), that accurately predicts the unseen data in two classes by minimizing the error function shown in (12), where g (x) is the decision boundary derived from the training set samples (x i , y i ) N i=1 , x i R m for the corresponding target classes y i R m .The decision boundary is a hyperplane which is given by, as in (13), where w and b, shall be derived based on the classification accuracy of the linear problems.Generally, a nonlinear SVM model is trained to minimize the following objective function in (14), where ϕ (x i ) is a mapping function to map x i to its higher dimensional feature space, ξ i is the misclassification error and C controls the tradeoff between the cost of classification and the margin.The mapping of the input training set into a higher dimensional space is done through a kernel function K (x i , y i ).Usually three types of nonlinear kernel functions are considered, such as polynomial kernel, multilayer kernel, and radial basis function (RBF) kernel.In this work, the RBF kernel function was used because of its excellent generalization and low computational cost [28].
The RBF kernel function is given by (15), Entropy 2016, 18, 298 7 of 14 where, the parameter γ = 1/2σ 2 is the regularization parameter which controls the width of the Gaussian function.For this given kernel function, the error function of the classifier is given by ( 16), The k-nearest neighbor algorithm classifies unknown samples based on the closest training samples in the feature space [29,30].The distance or the similarity measure determines the closeness of the k-nearest neighbor.Here, Euclidian distance is used to compute the nearest neighbor for the new feature vector.The class with majority of the neighboring votes is declared as the class of the new feature vector.Here k-NN classifier is considered to show the significance of FuzzyEn feature.

Results and Discussions
Speech signals for the proposed methodology were collected from TIMIT database [31] because it provides transcriptions down to word and phoneme levels.Each TIMIT sentence contains almost around 3.5 s, of which 90% is speech.To change the ratio of speech and non-speech regions by 40% to 60% [32] respectively, silence was added to the original speech of the TIMIT corpus.For experimental purposes, speech signals were selected randomly from the TIMIT database, contributing around 910 from training dataset and 320 from test dataset.Nonstationary noises for the experiment were collected from AURORA2 database [33] which was resampled to 16 kHz depending on the need.The speech signals were contaminated by various nonstationary noises of different SNR levels (−10 dB to 10 dB).Four noises-namely airport, babble, car, and train noises-were selected for experimental purposes.Babble noise by name consisted of multiple speakers speaking in the background.Airport and train noises included some speech elements along with their noises.Car noise was the car interior noise with an impulse noise at a particular instance.

Performance Evaluation
Performance evaluation of the VAD algorithm can be performed both subjectively and objectively.In subjective evaluation, a human listener evaluates for VAD errors, whereas, numerical computations are carried out for objective evaluation.However, subjective evaluation alone is insufficient to examine the VAD performance, because listening tests like ABC fail to consider the effects of false alarm [32,34,35].Hence numerical computations through objective evaluation help in reporting the performance of the proposed VAD algorithm.
VAD performance is calculated using (17)  The best performance is achieved when three parameters referred in the Equations ( 17)-( 19) become maximum.
Performance of the proposed FuzzyEn feature was evaluated by the SVM classifier (proposed) and was compared against k-NN classifier.In this work, 10-fold cross validation was used to ensure the reliability of the classifier.In 10-fold cross validation, the given feature vector was randomly divided into 90:10 split, where 90% of the features are used to train SVM model and 10% features as test data.The mean and standard deviation of the error rates obtained were compared against the two classifiers considered for the proposed FuzzyEn feature which is shown in Table 1.From Table 1, it is inferred that under low SNR levels the proposed FuzzyEn feature outperforms with minimal error rate for the various noisy conditions with the SVM classifier, except for car noise at 5 dB.  4 shows the average computing time for the SVM classifier to classify the speech and non-speech frames under various noises.From the figure it is inferred that as the size of the number of frames increases the computing time increases.Figure 5a-c shows the mean of sensitivity, specificity, and F-measures of various noises such as airport, babble, car, and train noise for FuzzyEn based VAD using spectral subtraction.Similarly, Figure 6a-c shows the standard deviation of the sensitivity, specificity, and F-measures for various noises.For various SNR levels, ranging from −10 dB to 10 dB, the parameters were computed and the graph shows that the detection of speech and non-speech frames are good by the proposed method, especially under low SNR conditions, except for car interior noise which contains a complex audible instance, where the sensitivity decreases when compared with that of the other noises.
The performance of the proposed FuzzyEn-SVM based VAD (will be referred to as FE-SVM based VAD hereafter) was compared against the standard VAD method, ITU G.729 Annex B [16], and with [13,36].Also the same was compared using k-NN classifier with and without spectral subtraction.The final VAD decisions were made and the performance metrics like accuracy, HR s , and HR ns were computed for different noises and at five SNRs (−10, −5, 0, 5, and 10 dB).
In Figure 7, the accuracy of the FE-SVM based VAD is compared against the various VAD algorithms for SNR levels ranging from −10 dB to 10 dB for airport noise, babble noise, car noise, and train noise.From the figure, it clearly states that G.729 suffers poor accuracy for airport noise with ~50% on average at various SNRs, whereas VAD algorithms by [13,36] performs better accuracy for airport noise, gradually varying at different SNR levels averaging ~71% and ~73% respectively.The proposed FE-SVM based VAD outperforms the rest, yielding ~92% average and ~90% without spectral subtraction (SS).Similarly, for the k-NN classifier, the accuracy is ~89% and ~88% with and without SS, respectively.For babble noise, the proposed FE-SVM based VAD has an accuracy of Entropy 2016, 18, 298 9 of 14 average ~93% and ~89% without SS, whereas [13] produces ~81% and [36] and G.729 manages ~64% and ~50%, respectively.
Similarly, using k-NN classifier, an accuracy of ~88% and ~89% was obtained with and without SS, respectively.For car noise, [36] dominates with an average of ~93% and the proposed FE-SVM based VAD produces ~90% and ~84% without SS, whereas [13] yields around ~82% and G.729 with an accuracy of ~74%.The accuracy from k-NN classifier is ~88% with and without SS.Finally, for train noise, G.729 yields an accuracy of ~60%, [13,36] were around ~82% and ~86% respectively, but the proposed FE-SVM based VAD outperforms with the best accuracy of ~96% and ~97% with and without SS, respectively.Similarly, while using k-NN classifier of ~92% and ~90% with and without using SS, respectively.This shows that the proposed FE-SVM based VAD yields a better accuracy rate for all noises except for the car interior noise.This is due to the effect of SS where the portion of speech was also cancelled along with the complex audible event encountered in the car interior noise.for the SVM classifier to classify the speech and non-speech frames under various noises.From the figure it is inferred that as the size of the number of frames increases the computing time increases.Figure 5a-c shows the mean of sensitivity, specificity, and F-measures of various noises such as airport, babble, car, and train noise for FuzzyEn based VAD using spectral subtraction.Similarly, Figure 6a-c shows the standard deviation of the sensitivity, specificity, and F-measures for various noises.For various SNR levels, ranging from −10 dB to 10 dB, the parameters were computed and the graph shows that the detection of speech and non-speech frames are good by the proposed method, especially under low SNR conditions, except for car interior noise which contains a complex audible instance, where the sensitivity decreases when compared with that of the other noises.for the SVM classifier to classify the speech and non-speech frames under various noises.From the figure it is inferred that as the size of the number of frames increases the computing time increases.Figure 5a-c shows the mean of sensitivity, specificity, and F-measures of various noises such as airport, babble, car, and train noise for FuzzyEn based VAD using spectral subtraction.Similarly, Figure 6a-c shows the standard deviation of the sensitivity, specificity, and F-measures for various noises.For various SNR levels, ranging from −10 dB to 10 dB, the parameters were computed and the graph shows that the detection of speech and non-speech frames are good by the proposed method, especially under low SNR conditions, except for car interior noise which contains a complex audible instance, where the sensitivity decreases when compared with that of the other noises.The performance of the proposed FuzzyEn-SVM based VAD (will be referred to as FE-SVM based VAD hereafter) was compared against the standard VAD method, ITU G.729 Annex B [16], and with [13,36].Also the same was compared using k-NN classifier with and without spectral subtraction.The final VAD decisions were made and the performance metrics like accuracy, HRs, and HRns were computed for different noises and at five SNRs (−10, −5, 0, 5, and 10 dB).
In Figure 7, the accuracy of the FE-SVM based VAD is compared against the various VAD algorithms for SNR levels ranging from −10 dB to 10 dB for airport noise, babble noise, car noise, and train noise.From the figure, it clearly states that G.729 suffers poor accuracy for airport noise with ~50% on average at various SNRs, whereas VAD algorithms by [13,36] performs better accuracy for airport noise, gradually varying at different SNR levels averaging ~71% and ~73% respectively.The proposed FE-SVM based VAD outperforms the rest, yielding ~92% average and ~90% without spectral subtraction (SS).Similarly, for the k-NN classifier, the accuracy is ~89% and ~88% with and without SS, respectively.For babble noise, the proposed FE-SVM based VAD has an accuracy of average ~93% and ~89% without SS, whereas [13] produces ~81% and [36] and G.729 manages ~64% and ~50%, respectively.The performance of the proposed FuzzyEn-SVM based VAD (will be referred to as FE-SVM based VAD hereafter) was compared against the standard VAD method, ITU G.729 Annex B [16], and with [13,36].Also the same was compared using k-NN classifier with and without spectral subtraction.The final VAD decisions were made and the performance metrics like accuracy, HRs, and HRns were computed for different noises and at five SNRs (−10, −5, 0, 5, and 10 dB).
In Figure 7, the accuracy of the FE-SVM based VAD is compared against the various VAD algorithms for SNR levels ranging from −10 dB to 10 dB for airport noise, babble noise, car noise, and train noise.From the figure, it clearly states that G.729 suffers poor accuracy for airport noise with ~50% on average at various SNRs, whereas VAD algorithms by [13,36] performs better accuracy for airport noise, gradually varying at different SNR levels averaging ~71% and ~73% respectively.The proposed FE-SVM based VAD outperforms the rest, yielding ~92% average and ~90% without spectral subtraction (SS).Similarly, for the k-NN classifier, the accuracy is ~89% and ~88% with and without SS, respectively.For babble noise, the proposed FE-SVM based VAD has an accuracy of average ~93% and ~89% without SS, whereas [13] produces ~81% and [36] and G.729 manages ~64% and ~50%, respectively.Figures 8 and 9 show hit rate performance evaluation metrics for five SNRs for different kinds of noises computed for G.729B, [13,36] and proposed FE-SVM based VAD algorithms.Figure 8 provides comparison results for HR s , and Figure 9 provides HR ns results for the various VAD algorithms.It is clearly observed that, for the different VAD algorithms considered, G.729B produces a better performance for HR s at different SNR levels for the different noises and suffers worst performance for HR ns .For airport noise, the proposed FE-SVM based VAD lags in HR s to [13] (~12%) and [36] (~11%) and for FE-SVM − SS, HR s lags to [13] (~14%) and [36] (~13%).For babble noise, proposed FE-SVM lags to [13] (~10%) and excels by [36] (~9%) and for FE-SVM − SS the HR s performance lags to [13] (~0.4%) and excels [36] (~0.2%).For car noise, the proposed FE-SVM based VAD lags to both [13,36] (~13% and ~9%, respectively).Finally, for train noise, the proposed FE-SVM based VAD offers better HR s against [36] (~2.4%) and suffers a lag against [13] (~0.4%).Most VAD algorithms rely on the use of post processing techniques like hangover schemes [37] which smooths the decisions at the frame level after the initial VAD decisions were made.This scheme reduces the risk of lower energy regions of speech at the ends of speech falsely rejected as noise.The number in the bracket indicates the approximate average hit-rates of speech and non-speech by which the proposed FE-SVM based VAD is better or worse than [13,36].Similarly, for HR ns performance, as shown in Figure 9, the proposed FE-SVM based VAD outperforms all the VADs considered under different SNR levels for various noises.The average HR ns of the proposed FE-SVM based VAD is ~96% for airport noise, ~98% for babble noise, ~95% for car noise, and ~96% for train noise.Without this post processing scheme, the proposed FE-SVM based VAD yields ~90% average for HR s and ~95% above for HR ns .
and for FE-SVM − SS, HRs lags to [13] (~14%) and [36] (~13%).For babble noise, proposed FE-SVM lags to [13] (~10%) and excels by [36] (~9%) and for FE-SVM − SS the HRs performance lags to [13] (~0.4%) and excels [36] (~0.2%).For car noise, the proposed FE-SVM based VAD lags to both [13,36] (~13% and ~9%, respectively).Finally, for train noise, the proposed FE-SVM based VAD offers better HRs against [36] (~2.4%) and suffers a lag against [13] (~0.4%).Most VAD algorithms rely on the use of post processing techniques like hangover schemes [37] which smooths the decisions at the frame level after the initial VAD decisions were made.This scheme reduces the risk of lower energy regions of speech at the ends of speech falsely rejected as noise.The number in the bracket indicates the approximate average hit-rates of speech and non-speech by which the proposed FE-SVM based VAD is better or worse than [13,36].Similarly, for HRns performance, as shown in Figure 9, the proposed FE-SVM based VAD outperforms all the VADs considered under different SNR levels for various noises.The average HRns of the proposed FE-SVM based VAD is ~96% for airport noise, ~98% for babble noise, ~95% for car noise, and ~96% for train noise.Without this post processing scheme, the proposed FE-SVM based VAD yields ~90% average for HRs and ~95% above for HRns.In Table 2, the average performance and average standard deviation of various VADs is compared with the proposed FE-SVM based VAD.From Table 2, it is clear that in terms of accuracy, the proposed FE-SVM based VAD is the best among all reference VAD algorithms considered here.The results show that the proposed VAD outperforms the existing VADs under all noisy conditions at different SNR levels, ~13% higher than that of VAD proposed by [13] in accuracy and ~14% higher than [36].Similarly, for hit rates, the proposed FE-SVM based VAD excels [13,36] by ~36% and ~35%, respectively, for HRns and lags [13,36] by ~9% and ~6% for HRs.Also for clean conditions, accuracy rate of the proposed FE-SVM based VAD is ~15% higher than that of [13] and~2.5% higher than that In Table 2, the average performance and average standard deviation of various VADs is compared with the proposed FE-SVM based VAD.From Table 2, it is clear that in terms of accuracy, the proposed FE-SVM based VAD is the best among all reference VAD algorithms considered here.The results show that the proposed VAD outperforms the existing VADs under all noisy conditions at different SNR levels, ~13% higher than that of VAD proposed by [13] in accuracy and ~14% higher than [36].Similarly, for hit rates, the proposed FE-SVM based VAD excels [13,36] by ~36% and ~35%, respectively, for HR ns and lags [13,36] by ~9% and ~6% for HR s .Also for clean conditions, accuracy rate of the proposed FE-SVM based VAD is ~15% higher than that of [13] and~2.5% higher than that of [36].For hit-rates, the proposed FE-SVM based VAD lags to [13,36] for HR s by ~2% and ~1%, respectively, and for HR ns , the proposed FE-SVM based VAD leads by ~36% and ~8% to [13,36], respectively.The performance metrics ensures that the proposed FE-SVM based VAD detects speech and non-speech frames efficiently, especially under low SNR conditions.Also, for babble noises, the proposed FE-SVM based VAD manages ~90% accuracy rate, showing the proposed FuzzyEn feature is better in discriminating speech from background noises.The HR ns for all the noises is ~95% and above, therefore this FuzzyEn feature is well-suited for various speech applications, namely, compression and speech coding.

Conclusions
In this paper, FuzzyEn feature-based VAD has been presented.The significance of the feature is discussed experimentally under various nonstationary noises at different SNR levels.The efficacy of the feature is compared with two classifiers, namely, SVM and k-NN.The performance of the classifier is analyzed by 10-fold cross validation scheme.The results show that the proposed FE-SVM based VAD outperforms the standard VAD by ~18% and recently developed VADs by ~9% in terms of accuracy rate.Similarly, at lower SNRs-around −5 dB and −10 dB-the proposed method proves its robustness under noisy conditions.

Figure 1 .
Figure 1.Block diagram for the proposed fuzzy entropy and support vector machine based voice activity detection (VAD).

Figure 1 .
Figure 1.Block diagram for the proposed fuzzy entropy and support vector machine based voice activity detection (VAD).

Figure 2 .
Figure 2. Speech signals (a-c) with its additive noise [0, 5 dB] with amplitude along vertical axis and time (s) along horizontal axis; its corresponding histograms (d-f), with amplitude along horizontal axis and frequency along the vertical axis.

Figure 2 .
Figure 2. Speech signals (a-c) with its additive noise [0, 5 dB] with amplitude along vertical axis and time (s) along horizontal axis; its corresponding histograms (d-f), with amplitude along horizontal axis and frequency along the vertical axis.

Figure 3 .
Figure 3. Exponential function (exp(−d n /r)) for different parameter selection.(a) Exponential membership function fixed with n = 2, and r varied from 0.1 to 0.3; (b) exponential membership function fixed with r = 0.2 and n varied from 1 to 5.

Figure 3 .
Figure 3. Exponential function (exp(−d n /r)) for different parameter selection.(a) Exponential membership function fixed with n = 2, and r varied from 0.1 to 0.3; (b) exponential membership function fixed with r = 0.2 and n varied from 1 to 5.

Figure 4 .
Figure 4. Average computing time of the SVM classifier for the various noises such as airport, babble, car, and train.

Figure 5 .
Figure 5. Average of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzyyEn based VAD for the various noises such as airport, babble, car, and train.

Figure 4 .
Figure 4. Average computing time of the SVM classifier for the various noises such as airport, babble, car, and train.

Figure 4 .
Figure 4. Average computing time of the SVM classifier for the various noises such as airport, babble, car, and train.

Figure 5 .
Figure 5. Average of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzyyEn based VAD for the various noises such as airport, babble, car, and train.

Figure 5 .
Figure 5. Average of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzyyEn based VAD for the various noises such as airport, babble, car, and train.

Figure 6 .
Figure 6.Standard deviation of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzzyEn based VAD for the various noises such as airport, babble, car, and train.

Figure 6 .
Figure 6.Standard deviation of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzzyEn based VAD for the various noises such as airport, babble, car, and train.

Figure 6 .
Figure 6.Standard deviation of (a) sensitivity (b) specificity and (c) F-Measure for the proposed FuzzyEn based VAD for the various noises such as airport, babble, car, and train.
(19)s and NS s , refer to the number of non-speech and speech frames in the whole database, respectively, while NS ns,ns and NS s,s , refer to the number of frames classified correctly as non-speech and speech frames.The overall accuracy rate is given by(19),

Table 1 .
Error rate of the VAD by support vector machine (SVM), k-nearest neighbor (k-NN) in % for the corrupted speech signal under various noise types for different signal-to-noise ratio (SNR) levels.

Table 2 .
Average performance of different algorithms for various noisy conditions at five SNR levels and at clean conditions and overall combined performance accuracy, HR s and HR ns .