Auditory Device Voice Activity Detection Based on Statistical Likelihood-Ratio Order Statistics

: This paper proposes a technique for improving statistical-model-based voice activity detection (VAD) in noisy environments to be applied in an auditory hearing aid. The proposed method is implemented for a uniform polyphase discrete Fourier transform ﬁlter bank satisfying an auditory device time latency of 8 ms. The proposed VAD technique provides an online uniﬁed framework to overcome the frequent false rejection of the statistical-model-based likelihood-ratio test (LRT) in noisy environments. The method is based on the observation that the sparseness of speech and background noise cause high false-rejection error rates in statistical LRT-based VAD—the false rejection rate increases as the sparseness increases. We demonstrate that the false-rejection error rate can be reduced by incorporating likelihood-ratio order statistics into a conventional LRT VAD. We conﬁrm experimentally that the proposed method relatively reduces the average detection error rate by 15.8% compared to a conventional VAD with only minimal change in the false acceptance probability for three di ﬀ erent noise conditions whose signal-to-noise ratio ranges from 0 to 20 dB.


Introduction
The goal of voice activity detection (VAD) is to detect the presence or absence of speech in a sound signal. VAD is increasingly difficult in noisy situations, especially for nonstationary noise such as babble noise. VAD has steadily gained research interest in the speech community in recent years, especially for applications such as selectively encoding and transmitting data in telecommunications, estimating noise statistics in speech enhancement, and detecting endpoints in speech recognition [1][2][3]. We focus on VAD's function in auditory hearing aid speech processing.
Individuals with hearing impairment have difficulty understanding relevant speech content in their daily lives. Attempts have been made to address this problem using auditory devices such as hearing aids, which are widely used to compensate for hearing loss and match the dynamic range [4]. However, many individuals avoid using hearing aids, often because of noise contamination of the speech signal entering the ear; only 23% of hearing-impaired people use hearing aid auditory devices [5][6][7]. This has motivated progress to improve complex speech perception for hearing aid users by reducing the effects of background noise on the targeted speech signal. This improvement is usually accomplished by preserving the characteristics of speech using short-term spectral amplitude (STSA) analysis, for which statistical speech enhancement techniques including Wiener filters and minimum mean square error (MMSE) estimation have been widely used [8,9]. These techniques are strongly dependent on the a priori signal-to-noise ratio (SNR) obtained by noise power spectral densities (PSDs), which can be reliably estimated in noise-only intervals [10,11]. Eventually, in the coupled systems of speech enhancement, a priori SNRs, PSDs, and noise-only intervals, the speech-enhancing performance is strongly dependent on accurate noise-only interval estimation. Thus, the auditory hearing aid must

Auditory Device VAD Implementation
As described in Section 1, the auditory filter bank should have uniformly spaced narrow frequency bands and at least 60 dB stopband attenuation, preferably higher [20,21]. Furthermore, the low computational cost and low time latency of less than 10 ms are demanded for the filter bank. These restrictions are satisfied with a uniform polyphase DFT filter bank based on the fast Fourier transform (FFT). In this paper, a 32-channel filter bank is employed with an 8 ms time delay under a 16 kHz sampling rate condition [20,21,27] on which the LRT VAD is implemented, as depicted in Figure 1.
Whether the target speech has been activated (H 1 ) or not (H 0 ) is determined by applying a VAD algorithm to the X k ( ) at the kth frequency bin (k = 0, 1, . . . , K/2). Simultaneously, the 16 down-sampled speech signals at the qth frequency band,ŝ q n ↓16 , can be obtained based on the real value from the complex component X 2q ( ), which can be used to estimate the envelope power of each band.

Conventional Statistical LRT-Based VAD
Assuming that speech and noise signals are additive, the detection of voice activity at the th segmented frame is accomplished by deciding upon one of two hypotheses of H 0 and H 1 : where X( ), S( ), and N( ) are K /2 + 1. dimensional vectors composed of k spectral components (k = 0, 1, . . . , K/2) of the input signal, speech, and noise, i.e., X k ( ), S k ( ), and N k ( ), respectively, such that Appl. Sci. 2020, 10, 5026

of 11
Assuming that S k ( ) and N k ( ) follow complex Gaussian distributions, the probability density functions conditioned on H 0 and H 1 are given by and whereλ N,k ( ) andλ S,k ( ) are estimate values of the noise variance λ N,k ( ) and speech variance λ S,k ( ), respectively, at the kth frequency bin. Here, the hat symbolˆdenotes an estimate value. The k th local LR Λ k ( ). under X k ( ) can be estimated [2,4] as the ratio between p(X k ( )|H 1 ) and p(X k ( )|H 0 ) as whereξ k ( ) is the a priori SNR estimate, which is estimated using a decision-directed (DD) approach in [2,3,10] asξ where 0 ≤ α < 1 is the smoothing parameter. In Equations (7) and (8),γ k ( ) is called the a posteriori SNR, which is expressed asγ where the noise variance estimateλ N,k ( ) is obtained via a recursive procedure with a smoothing parameter β, such thatλ Then, from the averaged log value of LR Λ k ( ) in Equation (7), a decision rule can be established using a decision threshold η: where K = K /2 + 1, and Λ( ) is referred to as the global LR in this paper.
If log Λ( ) is reduced by increasing ∆η, the FRP increases. We argue that because of the speech and noise sparseness, ∆η exists in all noisy speech samples. For the two hypotheses in Equation (1), it has been assumed in Equations (2)-(4) that a speech signal S k is present in every frequency bin for H 1 and a noise signal N k is present in every frequency bin for H 0 and H 1 . However, speech and most types of noise (apart from white noise) do not have their energy equally distributed over all frequency bins [28,29]. Thus, to reflect the sparseness states of speech and noise, we decompose H 0 and H 1 into four states according to the presence or absence of speech and noise in the k th frequency bin, as shown in Table 1

Speech Noise Present Absent
Present In the table, the minimum value of the noise components is specified as ε k . The superscript 1 is the state in which both the speech and noise components exist at the k th frequency bin; 2 and 3 are speech-only and noise-only states, respectively, and 4 represents the states with neither speech nor noise. Based on this notation of the four sparseness states, the LRT in Equation (11) can be expressed as where K = num k 1 + num k 2 + num k 3 + num k 4 and num(·) is the number of respective frequency bins. In Equation (13) where K = num k 1 + num k 2 . Consequently, the solution to improve the robustness against false rejection is based on estimating log Λ k ( ) from four sparseness states. Accordingly, we suggest an LR order statistics approach motivated by In the next subsection, we will explore the use of LR order statistics for a false rejection of robust VAD.

VAD Based on LR Order Statistics
First, the log LR sets log Λ 0 ( ), log Λ 1 ( ), . . . , log Λ K−1 ( ) are arranged in descending order of magnitude as where the subscript {k} is the new index of the log LR after ordering. We then denote the elements of the new log LR set as Ψ k ( ), such that Ψ k ( ) ≡ log Λ {k} ( ). Thus, the left side in Equation (14) can be expressed as Finally, using the LR order statistics, the LRT rule becomes From Equation (17), the problem of separately estimating observed noisy speech is focused on the estimation of K , which is a tuning parameter used to control the FRP robustness. When K = K, the proposed VAD in Equation (14) equals the conventional VAD in Equation (11).

Experiments and Results
We evaluated the proposed algorithm by counting detection errors and comparing it with conventional VAD under various noise types and SNR conditions. Speech utterances of approximately 57 s in duration were obtained from four speakers (two males and two females) from the TIMIT database (DB) [28] and mixed with three noises (white, babble, and Volvo noise) from the NOISEX-92 DB [29]. Based on the clean signals, 65.7% of the samples in the speech material were marked as active (49.3% voiced and 16.4% unvoiced). The noise signals were then artificially mixed additively with SNRs ranging from 0 to 20 dB with 5 dB steps. Signals were segmented using the 128-point LPF window in [27] and overlapped with each previous segment by one-eighth. We implemented the statistical LRT-based VAD method proposed in [12] with a conventional hang-over scheme to investigate how much the proposed LR order statistics approach reduced the detection error rate. Moreover, we set the α = 0.97 in Equation (8) and β = 0.98 in Equation (10) for the experiments, which were empirically determined. Figure 2a illustrates an example waveform of male speech at 5 dB SNR in babble noise, and Figure 2b illustrates the conventional log Λ k and the proposed descending version of log Λ k , Ψ k , at close to 0.6 s (a voiced interval) of this waveform. The proposed Ψ k illustrated a concentration at low frequencies and is less distributed than log Λ k . Figure 2c illustrates the global log LR curves obtained from the local LRs of log Λ k and Ψ k . To quantize the tuning parameter K in Equation (17), the value was converted to a percentage: M = 100 × K / K(%). The proposed method (at M = 25%) produced higher log LRs than the conventional method. At M = 100% (or K = K), the proposed method was identical to the conventional VAD in Equation (11). 25%) produced higher log LRs than the conventional method. At = 100% (or ̃′ =̃), the proposed method was identical to the conventional VAD in Equation (11). Step-by-step illustration of (a) waveform of babble noisy speech at 5 dB signal-to-noise ratio (SNR), (b) local likelihood ratio (LR) estimates of the frequency, and (c) global log LR estimates of the time.
For investigating the effect of the control constant ′ in Equation (17) in detail, the FRP was measured by comparing the detection results of the proposed method to true voice activity intervals (Figure 3). The ground truth voice activity intervals were determined manually by the author. The figure illustrates FRP histograms for white noise (left) and babble (right) noise environments at 5 dB SNR for three different values of . The FRPs decreased considerably with decreasing for both white noise and babble noise. The optimal value of = 25% was adopted for the subsequent experiments.

Figure 2.
Step-by-step illustration of (a) waveform of babble noisy speech at 5 dB signal-to-noise ratio (SNR), (b) local likelihood ratio (LR) estimates of the frequency, and (c) global log LR estimates of the time.
For investigating the effect of the control constant K in Equation (17) in detail, the FRP was measured by comparing the detection results of the proposed method to true voice activity intervals (Figure 3). The ground truth voice activity intervals were determined manually by the author. We measured the FAP to investigate further the effectiveness of the proposed LR order statistics in an LRT-based VAD method. Figure 4 depicts the results of the comparison between the proposed and conventional methods. The results are presented in the form of a detection error tradeoff graph, similar to receiver operating characteristic (ROC) curves. Depicted are the results of the conventional method with = 100% and the proposed method with = 25% for speech in white noise and babble noise at 5 dB SNR. The detection error tradeoff curve of the proposed method was always closer to the bottom-left corner than that of the conventional VAD for both noise conditions, demonstrating that the proposed method is more robust against false detection than the conventional method.
Finally, we compared the relative detection error reduction rate (RDER) of the proposed method ( = 25%) to that without under variable noise and SNR conditions. The detection error rate was defined as (FAP + FRP)/2. The decision threshold was explicitly determined to minimize the detection error under each noise and SNR condition. Table 2 illustrates that the proposed method increased the RDER in all noise environments at all SNRs. In the table, the proposed method relatively reduced the average detection error rate by 15.8% compared to a conventional VAD, with only minimal change in the false acceptance probability for three different noise conditions whose signal-to-noise ratio ranged from 0 to 20 dB. This finding implies that an LRT-based VAD employing the proposed LR order statistics can be an effective solution for attenuating the detection error by improving the false reject robustness in noise environments. We measured the FAP to investigate further the effectiveness of the proposed LR order statistics in an LRT-based VAD method. Figure 4 depicts the results of the comparison between the proposed and conventional methods. The results are presented in the form of a detection error tradeoff graph, similar to receiver operating characteristic (ROC) curves. Depicted are the results of the conventional method with M = 100% and the proposed method with M = 25% for speech in white noise and babble noise at 5 dB SNR. The detection error tradeoff curve of the proposed method was always closer to the bottom-left corner than that of the conventional VAD for both noise conditions, demonstrating that the proposed method is more robust against false detection than the conventional method.
Finally, we compared the relative detection error reduction rate (RDER) of the proposed method (M = 25%) to that without under variable noise and SNR conditions. The detection error rate was defined as (FAP + FRP)/2. The decision threshold was explicitly determined to minimize the detection error under each noise and SNR condition. Table 2 illustrates that the proposed method increased the RDER in all noise environments at all SNRs. In the table, the proposed method relatively reduced the average detection error rate by 15.8% compared to a conventional VAD, with only minimal change in the false acceptance probability for three different noise conditions whose signal-to-noise ratio ranged from 0 to 20 dB. This finding implies that an LRT-based VAD employing the proposed LR order statistics can be an effective solution for attenuating the detection error by improving the false reject robustness in noise environments. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 11 (a) (b)

Conclusions
We presented a solution for reducing the false rejection error of an LRT-based VAD in terms of auditory device speech-processing performance. The proposed LR order statistics consider that false rejections are linked to the sparseness of speech and additive noise signals. Accordingly, a spectral LRT-based VAD employing the proposed method was developed for a uniform polyphase DFT filter bank to satisfy auditory device hearing aid requirements regarding low computational cost and algorithm processing delay. Our experimental results confirmed that the LRT-based VAD having the proposed LR order statistics relatively reduced the average detection error rate by 15.8% compared to a conventional VAD, with only minimal change in the false acceptance probability under all tested noise conditions in the tested SNR range between 0 and 20 dB.
Author Contributions: S.M.K. contributed to the research idea and the framework of this study. S.M.K. also contributed ideas on how to formulate LR order statistics for VAD and performed the experiments for the activity detection of voice in noisy environments. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The author declares no conflict of interest.