Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator

: The estimation of the late reverberant spectral variance (LRSV) is of paramount importance in most reverberation suppression algorithms. This letter proposes an improved single-channel LRSV estimator based on Habets LRSV estimator by using an adaptive parameter estimator. Instead of estimating the direct-to-reverberation ratio (DRR), the proposed LRSV estimator directly estimates the parameter κ in a generalized statistical model since the experimental results show that even the κ calculated using measured ground truth DRR may not be the optimal parameter for the LRSV estimator. Experimental results using synthetic reverberant signals demonstrate the superiority of the proposed estimator to conventional approaches.


Introduction
Speech signals received within a room usually contain reverberation which impairs the intelligibility of speech in communication scenarios such as mobile phones and hearing aids. Reverberation will also degrade the recognition performance of automatic speech recognition systems. Hence, speech dereverberation is still an important issue nowadays.
Dereverberation techniques can be divided into reverberation cancellation [1] and reverberation suppression [2,3] depending on whether or not the acoustic impulse response (AIR) needs to be estimated during the dereverberation [4]. The major part of most reverberation suppression methods is the estimation of late reverberant spectral variance (LRSV), which remains a challenging task due to its high time variability [5].
Habets proposed a single-channel LRSV estimator [3] based on a generalized statistical model [6] to suppress late reverberation, and it still performs outstanding nowadays [5]. However, in Habets LRSV estimator, two parameters (i.e., the reverberation time T 60 and the parameter κ which is related to direct-to-reverberation ratio (DRR)) should be given in advance or estimated online.
To the authors' knowledge, there are numerous reverberation time estimation methods, whereas there are few single-channel DRR online estimation methods [7]. Besides, according to practical experience, it may also be inappropriate to obtain κ indirectly by estimating DRR, because even the κ calculated via measured ground truth DRR may not be the optimal κ for the Habets estimator. A detailed discussion can be found in Section 5. Therefore, unlike other traditional methods using estimated DRR to calculate κ, the present work aims to propose a blind adaptive κ estimator which can improve the performance of the Habets LRSV estimator and makes it more practical. Inspired by the optimally-modified log-spectral amplitude (OM-LSA) algorithm [8], this letter differentiates between the direct sound presence/absence hypotheses and derives the conditional direct sound presence probability to give a time-varying recursive average on the estimated κ. The proposed κ estimator is evaluated and compared with existed κ estimator [9] and κ calculated using measured ground truth DRR. The evaluation results show that the proposed κ estimator performs better than the conventional κ estimator or measured κ under all evaluation conditions. Besides, the quality of the dereverberated speech is also evaluated and compared to a method using recursive maximum-sparseness-power-prediction-model (MSPP) [10].

Problem Formulation
The reverberant signal results from the convolution of the anechoic speech signal and a causal AIR. The anechoic speech signal can be expressed in the Short-time Fourier Transform (STFT) domain by S(k, l), where k and l are the frequency and frame indices, respectively. According to the convolutive transfer function (CTF) model [2], the reverberant speech signal Z(k, l) can be expressed as Equation (1) where H(k, l) represents the AIR and it can be split into three components as Equation (2) H where H d (k) is the direct sound, H e (k, l) consists of early reflections, H l (k, l) represents later reflections, and N e usually corresponds to approximately 20-50 ms. The late reverber- H l (k, l )S(k, l − l ) mainly decreases the speech fidelity and intelligibility [4] and needs to be suppressed. Hence, the main challenge is to derive an estimator for the spectral variance of the late reverberant speech component (i.e., denotes the expectation operator. Once λ l (k, l) is given, a spectral enhancement method [11] can be used to suppress the late reverberation.

Brief Review of Habets Late Reverberant Spectral Variance Estimator
The underlying theory for the present work is based on the LRSV estimator derived by Habets [3]. The Habets method is based on a generalized statistical model which is an improvement on Polack's statistical model [4]. Using H r (k, l) represents early and late reflections. Then, the corresponding spectral variance can be written as Equation (3) where T 60 (k) is the frequency-dependent reverberation time, f s denotes the sampling frequency, R is the discrete time shift, and κ(k) is a prior parameter that is related to DRR. Assuming that the direct component Z d (k, l) = H d (k)S(k, l) and the reverberant H r (k, l )S(k, l − l ) are uncorrelated, the corresponding spectral variance λ z (k, l) = E |Z(k, l)| 2 can be expressed as the sum of the direct component spectral variance λ d (k, l) = E |Z d (k, l)| 2 and the reverberant component spectral variance where λ s (k, l) is the spectral variance of S(k, l). The reverberant component λ r (k, l) can be further split into early reverberation λ e (k, l) and late reverberation λ l (k, l), as Equation (5) λ h r k, l λ s k, l − l λ e (k,l) and the main purpose is to derive an estimator for the LRSV λ l (k, l). Combining Equations (3) and (4), λ r (k, l) can be obtained by Equation (6) Finally, according to Equations (3) and (5), λ l (k, l) can be obtained using λ r (k, l) as Equation (7) λ

Parameter Estimation
In Habets LRSV estimator, two parameters (i.e., T 60 and κ) should be given in advance. The reverberation time T 60 can be determined by applying Schroeder's method to the AIR. The parameter κ is related to DRR and can be calculated [3] by solving Equation (8) h 2 (n) and h(n) represents AIR. The Habets LRSV estimator is often used without knowing those two parameters. The T 60 estimation has been well investigated and numerous blind approaches can be found. However, the DRR estimation is less mature and there are few online single-channel estimation algorithms [7]. Therefore, the reverberation time T 60 is assumed to be known in the following, and the present work focuses on the κ estimation. Most existed κ estimators [4,9] treat κ as a frequencyindependent parameter. Hence, this letter also derives a fullband κ estimator which can make the LRSV estimator more practical and accurate.

Proposed κ Estimator
Inspired by the OM-LSA algorithm [8], this letter proposed an adaptive κ estimator using a probability-based framework. Given two hypotheses, H 0 (l) and H 1 (l), which indicate, respectively, direct sound absence and presence in the lth frame, as in Equation (9) H 0 (l) : Z(k, l) = Z r (k, l), When the direct sound is absent, the desired κ can be directly estimated according to Equation (6). Accordingly, the proposed κ estimation strategy is to recursively average past estimated κ during periods of direct sound absence, and hold the estimate during direct sound presence. Specifically, the proposed κ estimator is as follows in Equation (10) where α κ denotes a smoothing parameter, and κ(l) denotes the estimated κ in the lth frame. Under direct sound uncertainty, the frame conditional direct sound presence probability p(l) can be employed by p(l) ∆ = P(H 1 (l)|Z(k, l), k = 0, 1, . . . , K ), and the recursive averaging can be carried out in Equation (11) where α κ (l) = p(l) + α κ (1 − p(l)) is a time-varying smoothing parameter which is adjusted by the frame conditional direct sound presence probability p(l). Now, there are two remaining parts in the proposed κ estimator that need to be determined: (1) the frame conditional direct sound presence probability, p(l); (2) the estimated κ in the lth frame, κ(l).

Frame Conditional Direct Sound Presence Probability
Let us assume that the STFT coefficients, Z d (k, l) and Z r (k, l), are complex Gaussian variables. Then, applying Bayes rule [8], the conditional direct sound presence probability p(k, l) ∆ = P(H 1 (l)|Z(k, l) ) can be written as Equation (12) p(k, l) where q(l) ∆ = P(H 0 (l)) is the a priori probability for direct sound absence, ξ(k, l) is the a priori signal-to-reverberation ratio (SRR), γ(k, l) is the a posteriori SRR, and υ(k, l) 1+ξ(k,l) . Note that γ(k, l) can be calculated directly whereas q(l) and ξ(k, l) need to be determined.
Considering that λ z (k, l) decays frame by frame during periods of direct sound absence H 0 (l), the a priori probability for direct sound absence q(l) can be defined as Equation (13) where u(·) is the unit step function. Then, the a priori SRR ξ(k, l) can be obtained via recursive average as in Equation (14) ξ(k, l) = α ξ ξ(k, l − 1) where α ξ is a smoothing parameter. After p(k, l) is determined, the frame conditional direct sound presence probability p(l) can be regarded as an average of p(k, l) over all frequency bins p(l) = 1 K K−1 ∑ k=0 p(k, l).

Estimated κ in Each Frame
Under direct sound absence hypothesis H 0 (l), Equation (4) becomes λ z (k, l) = λ r (k, l), and substituting it into Equation (6) After some algebra, Equation (15) can be rewritten as Equation (16) κ = exp 13.8R Then, the estimated κ in the lth frame is determined in Equation (17) by averaging Equation (16) in the frequency domain . (17) Note that the numerator and the denominator of Equation (16) are separately averaged in order to avoid division by zero.
Equation (17) is similar to the conventional estimator Equation (18) [9]. However, Equation (17) is derived under direct sound absence hypothesis using Equation (6). Hence, the proposed estimator using a probability-based framework to update κ, rather than a simple heuristic used in conventional estimator. Further comparison can be found in Section 5. (18)

Performance Evaluation
In this section, the performance of the LRSV estimator using the proposed κ estimator is evaluated. The performance using κ obtained by other four different methods are also evaluated, including conventional κ estimator [9], the measured ground truth κ calculated with measured DRR and T 60 (fullband and subband) according to Equation (8), and the scanning-optimal κ obtained by scanning method which scans κ successively from 0.05 to 1.5 at intervals of 0.01 . Besides, the quality of the dereverberated speech using proposed method is also evaluated and compared to a recent method using recursive MSPP [10].

Setup
The Signals to be processed in this letter are synthetic reverberant signals created by convolving original AIRs measured in a real hall with reverberation time of 2 s (from an open database [12]) with a male speaker signal of 15 s length. Six AIRs (referred to as AIR 1 ∼ AIR 6 ) with different κ ranging from 0.12 to 1.54 are adopted. Figure 1 demonstrates the signal there was used in experiment with and without reverberation. As mentioned in Section 4, the ground truth κ(fullband and 1/3-octave subband) is calculated using the measured DRR and T 60 via Equation (8). Besides, the reverberation time T 60 is assumed to be known. Hence, the T 60 used in this work is directly determined in 1/3-octave subbands by applying Schroeder's method to AIRs. For evaluation purposes, the ground truth late reverberant speech component z l (n) is defined as the anechoic male speaker signal convolved with the tail of AIR starting 50 ms after the direct sound. Other parameters used in this paper are chosen empirically as κ(0) = 1, α κ = 0.75, and α ξ = 0.95, similar to the reference [8]. All experiments are carried out in computer using MATLAB software.
The Log Spectral Distortion (LSD) [4] is adopted to evaluate the LRSV estimator by computing the root mean square(RMS) value of the difference between the estimated LRSV λ l (k, l) and the ground truth LRSV λ l (k, l), which is defined as Equation (19) LSD late (l) = 1 where L{·} = max{10lg|·|, δ} is the log spectrum confined to 50 dB dynamic range and δ = max k,l {10lg|·|} − 50. The mean LSD (refered to as LSD) is obtained by averaging Equation (19) over all frames. In addition, the lower and upper semi-variance of error e(k, l) were also calculated to evaluate the LRSV estimator [5] as Equation (20) where e = mean k,l {e(k, l)} is the mean value of e(k, l).
In order to evaluate the robustness of the proposed estimator to noise, the white noise was added to synthetic reverberant signals with variable RSNR [5] where λ v (k, l) is the additive noise spectral variance. Figure 2 depicts the mean LSD for Habets LRSV estimator using κ obtained by different methods, including the measured ground truth κ(fullband and subband), proposed κ estimator, conventional κ estimator and the scanning method.

Results and Analysis
As shown in Figure 2, an scanning-optimal κ can be obtained for each AIR as the corresponding LSD late reaches a minimum during the scanning process, and it can be observed that such scanning-optimal κ is far from the measured ground truth κ, which alerts us that the measured κ may not be the optimal κ for Habets LRSV estimator. Although the measured fullband κ performs better for AIR 4 and the measured subband κ performs better for AIR 2 , they perform poorly for other AIRs. As for the proposed κ estimator, the LSD late value exhibits a minimum for three AIRs, and is close to the minimum for other AIRs. It suggests that the proposed κ estimator performs not only much better than the conventional κ estimator and measured ground truth κ (both fullband and subband), but even as well as the scanning-optimal κ obtained by scan method. It is worth mentioning that the scanning-optimal κ may not be the real optimal κ, but it still can be seen as an appropriate κ considering the experimental results.  Figure 3 shows the averaged log error obtained using all RIRs for varying RSNR. As the RSNR decreases, all estimators show a more and more positive bias, which means the LRSV estimator performs worse with background noise and should be used after a denoising algorithm. However, the 'length' of the whisker bars of the proposed κ estimator is always shorter than other methods. In other words, the proposed κ estimator yields lower variance, which suggests that the proposed κ estimator is more robust with background noise.  Figure 4 compares the measured ground truth κ with the scanning-optimal κ, and as depicted in it, the scanning-optimal κ is not obviously related to the measured ground truth κ. It precludes us from simply applying a bias correction to the measured κ, which is sometimes used in practical.  The reason for the mismatch between the measured ground truth κ and the scanningoptimal κ may be that the generalized statistical model is a simplified approximation of AIR, which causes the error of estimation in Equation (6) and the error will vary with the anechoic speech signal λ s (k, l). Hence, in order to compensate that error, the value of κ needs to be modified, which makes the measured ground truth κ not the scanning-optimal κ for Habets LRSV estimator. To prove the above viewpoint, 13 different anechoic speech signals of 15 s length are used to obtain corresponding scanning-optimal κ for each AIR. The results are shown in Figure 5. It can be seen that for different speech signals, the scanning-optimal κ changes randomly and can differ by up to 0.54, which reveals that the optimal κ for Habets LRSV estimator may be not only related to DRR but also related to the speech signal. In other words, it may be less effective to obtain κ indirectly via the blind DRR estimation algorithm. On the contrary, estimating κ directly as the proposed method did may achieve better performance. However, this letter only uses 13 different anechoic speech signals of 15 s length each, along with six AIRs, which is not enough to prove this hypothes, further research is needed using more speech signals and more AIRs.

Speech Dereverberation
Furthermore, the quality of the dereverberated speech using the estimated LRSV is evaluated, and the log-spectral amplitude gain function [11] is adopted to suppress the late reverberant speech component. Besides, a method using recursive MSPP [10] is also evaluated as a reference. The measures are the segmental SRR and LSD (averaged over all frames) between the estimated and true early speech component [3,4], the short-time objective intelligibility(STOI) [13] and perceptual evaluation of speech quality(PESQ) [14]. The results are averaged over all AIRs and presented in Table 1. It can be observed that the proposed method achieves best performance in three measures and only performs slightly worse in LSD, which validates the superiority of the proposed estimator to conventional approaches. It also indicates that the LRSV estimator using proposed method performs even better than that using the measured ground truth κ(fullband and subband). It is worth mentioning that a single measure is not convincing, so this letter used four measures to jointly judge the performance of the proposed method. Hence, although MSPP has lower score than proposed method in LSD, considering all four measures, we still believe that the proposed algorithm is superior.

Conclusions
This work improves Habets LRSV estimator by proposing an adaptive κ estimator. We differentiate between the direct sound presence/absence hypotheses, and derive the frame conditional direct sound presence probability p(l) using Bayes rule. Under the direct sound absence hypothesis, the estimated κ in the lth frame κ(l) is given under the assumption of {λ z (k, l) = λ r (k, l)}|H 0 (l) . Finally, κ(l) is recursive averaged with a time-varying smoothing parameter α κ (l) which is adjusted by the frame conditional direct sound presence probability p(l).
The proposed κ estimator has been evaluated and compared to conventional κ estimator and a recursive MSPP method proposed in recent years. Experimental results show that the LRSV estimator using the proposed κ estimator outperforms other methods. It is also found that the ground truth κ calculated using measured DRR is not the optimal κ for the LRSV estimator since the optimal κ may be affected by speech signals. It suggests us estimate κ directly and adaptively rather than using the blind DRR estimation algorithm to obtain κ, which may be a less effective approach. However, further research is needed to prove this hypothesis.

Conflicts of Interest:
The authors declare no conflict of interest.