Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement

: Inaccurate estimates of the linear prediction coefﬁcient (LPC) and noise variance introduce bias in Kalman ﬁlter (KF) gain and degrade speech enhancement performance. The existing methods propose a tuning of the biased Kalman gain, particularly in stationary noise conditions. This paper introduces a tuning of the KF gain for speech enhancement in real-life noise conditions. First, we estimate noise from each noisy speech frame using a speech presence probability (SPP) method to compute the noise variance. Then, we construct a whitening ﬁlter (with its coefﬁcients computed from the estimated noise) to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. We then construct the KF with the estimated parameters, where the robustness metric offsets the bias in KF gain during speech absence of noisy speech to that of the sensitivity metric during speech presence to achieve better noise reduction. The noise variance and the speech model parameters are adopted as a speech activity detector. The reduced-biased Kalman gain enables the KF to minimize the noise effect signiﬁcantly, yielding the enhanced speech. Objective and subjective scores on the NOIZEUS corpus demonstrate that the enhanced speech produced by the proposed method exhibits higher quality and intelligibility than some benchmark methods.

The Kalman filter (KF) was first used for speech enhancement by Paliwal and Basu [12].In KF, a speech signal is represented by an autoregressive (AR) process, whose parameters comprise the linear prediction coefficients (LPCs) and prediction error variance.The LPC parameters and noise variance are used to construct the KF recursion equations.KF gives a linear MMSE estimate of the current state of the clean speech given the observed noisy speech for each sample within a frame.Therefore, the performance of KF-based SEA largely depends on how accurately the LPC parameters and noise variance are estimated.Experiments demonstrated that the KF shows excellent performance in stationary white Gaussian noise (WGN) conditions when the LPC parameters are estimated from clean speech [12].On the contrary, the LPC parameters and the noise variance directly computed from the noisy speech would be inaccurate and unreliable, which leads to performance degradation.
In [13], Gibson et al. introduced an augmented KF (AKF) to enhance colored noisecorrupted speech.In this SEA, both the clean speech and noise signal are represented by two AR processes.The speech and noise LPC parameters are incorporated in an augmented matrix form to construct the recursive equations of AKF.In [13], the AKF processes the colored noise-corrupted speech iteratively (usually 3-4 iterations) to eliminate the noise, yielding the enhanced speech.Specifically, the LPC parameters for the current frame are computed from the corresponding filtered speech frame of the previous iteration by AKF.Although the enhanced speech of the AKF demonstrates an improvement in signal-to-noise ratio (SNR), it suffers from musical noise and speech distortion.Therefore, this method [13] does not adequately address the inaccurate LPC parameter estimation issue in practice.
In [17], So and Paliwal proposed a modulation-domain KF (MDKF) for speech enhancement.It was claimed that the modulation domain is able to better model the long-term correlation of speech information than that of time domain speech.It was shown that the MDKF exhibits better objective scores than time-domain KF (TDKF), particularly in the oracle case (LPC parameters are computed from clean speech).However, clean speech is unobserved in practice.For practical applications, they incorporated a traditional MMSE-STSA [9] with MDKF for speech enhancement.Specifically, the MMSE-STSA has been used to pre-filter the noisy speech in the acoustic domain.Then, the pre-filtered speech is transformed in the modulation domain prior to computing the LPC parameters.Therefore, they do not adequately address LPC parameter estimation directly from the noisy speech in the modulation domain.Technically, the characteristics of the speech signal in the acoustic domain are entirely different than that of it in modulation domain.Due to this limitation, it is quite difficult to assess the performance of MDKF for speech enhancement in practice.Roy et al. introduced a sub-band (SB) iterative KF (SBIT-KF)-based SEA [18].This method enhances only the high-frequency sub-bands (SBs) using iterative KF among the 16 decomposed SBs of noisy speech for a given utterance, with the assumption that the impact of noise in low-frequency SBs is negligible.However, the low-frequency SBs can also be affected by noise, typically when operating in real-life noise conditions.As demonstrated in [13], the SBIT-KF [18] also suffers from speech distortion due to the iterative processing of noisy speech by KF.
In [19], Saha et al. propose a robustness metric and a sensitivity metric for tuning the biased KF gain for instrument engineering applications.Later on, So et al. applied the tuning of KF gain for speech enhancement in the WGN condition [20,21].Specifically, the enhanced speech (for each sample within a noisy speech frame) is given by recursively averaging the observed noisy speech and the predicted speech weighted by a scalar KF gain [20].However, the inaccurate estimates of LPC parameters introduce bias in the KF gain, resulting in leaking a significant residual noise in the enhanced speech.In [20], a robustness metric is used to offset the bias in KF gain for speech enhancement.However, So et al. further showed that the robustness metric strongly suppresses the KF gain in speech regions, resulting in distorted speech [21].In [21], a sensitivity metric was used to offset the bias in KF gain, which produced less distorted speech.In [22], George et al. propose a robustness metric-based tuning of the AKF (AKF-RMBT) for enhancing colored noise-corrupted speech.As in [20], the adjusted AKF gain is underestimated in speech regions, resulting in distorted speech.
The existing KF methods [20,21] address tuning of biased Kalman gain in the WGN condition with the prior assumption that the impact of WGN on LPCs is negligible.Though the AKF method [22] performs tuning of biased gain in colored noise conditions, it still produced distorted speech.In this paper, we address the tuning of KF gain for speech enhancement in real-life noise conditions.For this purpose, we estimate noise from each noisy speech frame using an SPP-based method to compute the noise variance.To minimize bias in the LPC parameters, we compute them from pre-whitened speech.Then, KF is constructed with the estimated parameters.To achieve better noise reduction, the robustness metric is applied to offset the bias in Kalman gain when there is speech absent to that of the sensitivity metric during speech presence of the noisy speech.We also adopt the noise variance and the AR model parameters as a speech activity detector.The reduced-biased KF gain exhibits better suppression of noise in the enhanced speech.The performance of the proposed SEA is compared against some benchmark methods using objective and subjective testing.
The structure of this paper is as follows: Section 2 describes the KF for speech enhancement, including the paradigm shift of the KF recursive equations, the impact of biased KF gain on KF-based speech enhancement in WGN and real-life noise conditions.In Section 3, we describe the proposed SEA, which includes the proposed parameter estimation and the proposed Kalman gain tuning algorithm.Following this, Section 4 describes the experimental setup in terms of speech corpus, objective and subjective evaluation metrics, and specifications of competitive SEAs.The experimental results are then presented in Section 5. Finally, Section 6 gives some concluding remarks.

Kalman Filter for Speech Enhancement
Assuming that the noise, v(n), is additive and uncorrelated with the clean speech, s(n), at sample n, the noisy speech, y(n), can be represented as: The clean speech, s(n), can be represented by a p th order autoregressive (AR) model as ( [23], Chapter 8): where {a i ; i = 1, 2, . . ., p} are the LPCs and w(n) is assumed to be a white noise with zero mean and variance σ 2 w .Equations ( 1) and (2) can be used to form the following state-space model (SSM) of the KF (where the bold variables denote vector/matrix quantities, as opposed to unbolded variables for scalar quantities): In the above SSM: 1.
x(n) is a p × 1 state vector at sample n, given by: Φ is a p × p state transition matrix, represented as: d and c are the p × 1 measurement vectors for the excitation noise and observation, written as:

4.
y(n) is the observed noisy speech at sample n.
During the operation of KF, the noisy speech, y(n), is windowed into non-overlapped and short (e.g., 20 ms) frames.For a particular frame, the KF recursively computes an unbiased linear MMSE estimate, x(n|n), of the state vector, x(n), given the observed noisy speech up to sample n, i.e., y(1), y(2), ..., y(n), using the following equations [12]: x(n|n) = x(n|n − 1) In the above Equations ( 7)- (11), Ψ(n|n − 1) and Ψ(n|n) are the error covariance matrices of the a priori and a posteriori state estimates, x(n|n − 1) and x(n|n); K(n) is the Kalman gain; σ 2 v is the variance of the additive noise, v(n); and I is the identity matrix.During processing of each frame, the estimated LPC parameters, ({a i }, σ 2 w ), and noise variance, σ 2 v , remain unchanged for that frame, while K(n), Ψ(n|n), and x(n|n) are continually updated on a sample-wise basis.As demonstrated in [20,21], the estimated speech at sample n is given by: ŝ(n|n) = c x(n|n).Once all noisy speech frames have been processed, synthesis of the enhanced frames yields the enhanced speech, ŝ(n).

Paradigm Shift of Recursive Equations
The paradigm shift of the recursive Equations ( 7)- (11) transforms them in scalar form.It exploits the understanding as well as analysis of the KF operation in the speech enhancement context.The simplification starts with the output of the KF, ŝ(n|n) = c x(n|n), which is re-written as [20,21]: To transform the a posteriori state estimate, x(n|n) from vector to scalar notation, we multiply c on both sides of Equation (10), i.e., According to Equation (12), c x(n|n − 1) is also given by: In Equation ( 13), c K(n) represents the first component, K 0 (n), of the Kalman gain vector, K(n), i.e., Substituting Equation (9) into Equation (15) gives: With Equation ( 8), c Ψ(n|n − 1)c of Equation ( 16) is simplified as: The linear algebra operation on c σ 2 w dd c, gives: and c ΦΨ(n − 1|n − 1)Φ c represents the transmission of a posteriori error variance by the speech model from the previous time sample, n − 1, denoted as [21]: Substituting Equations ( 18) and (19) into Equation (17) gives: From Equations ( 20) and ( 16), K 0 (n) is given by: Substituting Equations ( 12), (14), and (15) into Equation (13) gives: Re-arranging Equation ( 22) yields: Equation ( 23) implies that the accurate estimates of ŝ(n|n) (output of the KF) will be achieved if K 0 (n) becomes unbiased.However, in practice, the inaccurate estimates of ({a i }, σ 2 w ) and σ 2 v introduce bias in K 0 (n), resulting in degraded ŝ(n|n).In [19], Saha et al. introduced a robustness metric, J 2 (n) and a sensitivity metric, J 1 (n) to quantify the level of robustness and sensitivity of the KF, which can be used to offset the bias in K 0 (n).In the speech enhancement context, J 2 (n) and J 1 (n) metrics can be computed by simplifying the mean squared error, c Ψ(n|n)c of the KF output, ŝ(n|n) as [20,21]: Substituting Equations ( 15) and ( 20) into (24) gives: where is the scalar a posteriori mean squared error, J 2 (n) and J 1 (n) are the robustness and sensitivity metrics of the KF, given as [20,21]: The KF-based SEAs in [20,21] address tuning of K 0 (n) using J 2 (n) and J 1 (n) metrics for speech enhancement in t WGN condition as described next.

Impact of Biased K 0 (n) on KF-Based Speech Enhancement in WGN Condition
We analyze the shortcomings of existing KF-based SEAs [20,21] in terms of biased interpretation of K 0 (n).For this purpose, we conducted an experiment with the utterance sp05 ("Wipe the grease off his dirty face") of NOIZEUS corpus ( [1], Chapter 12) (sampled at 8 kHz) corrupted with 5 dB WGN noise [24].In [20,21], a 20 ms non-overlapped rectangular window was considered for converting y(n) into frames as: where k {0, 1, 2, . . ., N − 1} is the frame index, N is the total number of frames in an utterance, and M is the total number of samples in each frame, i.e., n {0, 1, 2, . . ., M − 1}.
In [20], So et al. first analyze K 0 (n) in the oracle case, where ({a i }, σ 2 w ) (p = 10) and σ 2 v are computed from each frame of the clean speech and the noise signal, s(n, k) and v(n, k).
It can be seen that K 0 (n) approaches 1 when there is speech presence of the noisy speech, which passes almost clean speech to the output (e.g., 0.16-0.33s or 0.9-1.06s in Figure 1d,e).Conversely, K 0 (n) remains at approximately 0 during speech absence of the noisy speech, which does not pass any corrupting noise (e.g., 0-0.15 s or 1.8-2.19s in Figure 1d,e).As a result, the KF-oracle method produces enhanced speech with less residual background noise as well as less speech distortion (Figure 1e).
In the non-oracle case, it is also observed that J 2 (n) ≈ 1 typically during speech pauses of y(n, k) (e.g., 0-0.15 s or 1.8-2.19s in Figure 1c).Therefore, the J 2 (n) metric is found to be useful in tuning biased K 0 (n) as [20]: Figure 1d reveals that K 0 (n) ≈ 0 during speech pauses.However, K 0 (n) is over- suppressed during speech presence of y(n, k), resulting in distorted speech, as shown in Figure 1g.To address this, So et al. proposed a J 1 (n) metric-based tuning of K0 (n) [21].It can be seen from Figure 1c that J 1 (n) lies around 0.5 during speech pauses (e.g., 0-0.15 s or 1.8-2.19s), whereas it approaches 0 at speech regions (e.g., 0.16-0.33s or 0.9-1.06s).Therefore, the tuning of K0 (n) using the J 1 (n) metric is performed as [21]: It can be seen from Figure 1d that K 0 (n) is closely similar to the oracle K 0 (n), which minimizes distortion in the enhanced speech (Figure 1h) as compared to Figure 1g.
Technically, the real-life noise (colored/non-stationary) may contain time varying amplitudes, which impact ({a i }, σ 2 w ) significantly as opposed to negligible impact of WGN in these parameters [20,21].Therefore, the assumption of σ2 w = σ 2 w + σ 2 v made in [20,21] is invalid for real-life noise conditions.Moreover, the existing methods [20,21] do not analyze the impact of noise variance, σ 2 v on K 0 (n).According to Equation ( 21), in addition to α 2 (n) and σ 2 w , σ 2 v is also an important parameter to compute K 0 (n) accurately.In light of these observations, the methods in [20,21] are not applicable for speech enhancement in real-life noise conditions.Therefore, we performed a detailed analysis of the biasing effect of K 0 (n) on KF-based speech enhancement in real-life noise conditions.
To analyze K 0 (n) and its impact on KF-based speech enhancement, we repeated the experiment in Figure 1, except that the utterance sp05 was corrupted with a typical real-life non-stationary noise, babble [24], at 5 dB SNR.A 32 ms rectangular window with 50% overlap ( [25], Section 7.2.1) was considered for converting y(n) into frames, y(n, k) (as in Equation ( 28)).
In the AKF-RMBT method, the speech LPC parameters were computed from the pre-whitened speech to utilize J 2 (n) metric for the tuning of biased K 0 (n) in colored noise conditions ( [22], Figure 5d).As in [20], J 2 (n) metric-based tuning of K 0 (n) still produces distorted speech.In addition, the noise LPC parameters computed from initial speech pauses keep constant during the processing of all noisy speech frames for an utterance.The whitening filter was also constructed with the constant noise LPCs to pre-whiten each noisy speech frame prior to compute speech LPC parameters.As a result, the tuning of K 0 (n) [22] becomes irrelevant in conditions having time-varying amplitudes, such as babble noise.
Motivated by the shortcomings of [20][21][22], we propose a J 2 (n) and J 1 (n) metric-based tuning of the KF gain, K 0 (n), for speech enhancement in real-life noise conditions.v computed in oracle and non-oracle cases, (f) J 2 (n) and J 1 (n) computed from the noisy speech in (b), spectrogram of enhanced speech produced by: (g) KF-oracle method and (h) KF-non-oracle method.

Proposed Speech Enhancement Algorithm
Figure 3 shows the block diagram of the proposed SEA.Firstly, y(n) is converted into frames y(n, k) with the same setup as used in Section 2.3.
To carry out the tuning of K 0 (n) in real-life noise conditions, unlike biased J 2 (n) and J 1 (n) metrics (Figure 2f), they should achieve similar characteristics that occur in the WGN condition (Figure 1c).It can be achieved through improving the estimates of ({ âi }, σ2 w ) and σ2 v as described in Section 3.1.

Parameter Estimation
In is known that ({a i }, σ 2 w ) are very sensitive to real-life noises.Since clean speech, s(n, k), is unavailable in practice, it is difficult to accurately estimate these parameters.Therefore, we first focused on noise estimation, v(n, k), for each noisy speech frame using speech presence probability (SPP) method (described in Section 3.2) [26] to compute σ2 v .Given v(n, k), σ2 v is computed as: To reduce bias in the estimated ({ âi }, σ2 w ) for each noisy speech frame, we computed them from the corresponding pre-whitened speech, y w (n, k) using the autocorrelation method [23].The framewise y w (n, k) was obtained by applying a whitening filter, H w (z) to y(n, k).H w (z) is given by [23]: where the coefficients, { bj } (q = 20) are computed from v(n, k) using the autocorrelation method [23].

Proposed v(n, k) Estimation Method
The proposed noise estimation is performed in the acoustic domain using the SPP method [26].For more details about the SPP method, we refer the readers to [26].However, we briefly review the SPP-based noise estimation in this section.For this purpose, the noisy speech, y(n) (Equation ( 1)) is analyzed frame-wise using the short-time Fourier transform (STFT): where Y k (m), S k (m), and V k (m) denote the complex-valued STFT coefficients of the noisy speech, the clean speech, and the noise signal, respectively, for time-frame index k and frequency bin index m {0, 1, . . ., 255}.A Hamming window with 50% overlap was used in STFT analysis ( [25], Section 7.2.1).In polar form, Y k (m), S k (m), and V k (m) can be expressed as: m) , where R k (m), A k (m), and D k (m) are the magnitude spectra of the noisy speech, the clean speech, and the noise signal, respectively, and φ k (m), ϕ k (m), and θ k (m) are the corresponding phase spectra.We processed each frequency bin of the single-sided noisy speech power spectrum, R 2 k (m), to estimate the noise power spectrum, D2 k (m), where m {0, 1, . . ., 128} contain the DC and Nyquist frequency components.To initialize the algorithm, we considered the first frame (k = 0) of R 2 0 (m) as silent, giving an estimate of noise power, D2 0 (m) = R 2 0 (m).The noise PSD, λ0 (m), was also initialized as λ0 (m) = D2 0 (m).For k ≥ 1; using the speech presence uncertainty principle [26], an MMSE estimate of D2 k (m) at m th frequency bin is given by: where P(H m 0 |R k (m)) and P(H m 1 |R k (m)) are the conditional probability of the speech absence and the speech presence given R k (m) at m th frequency bin.
The simplified P(H m 1 |R k (m)) estimate is given by (The simplification is a result of assuming the a priori probability of the speech absence and presence, P(H 0 ) and P(H 1 ) as: P(H 0 ) = P(H 1 ) [26].): where ξ opt is the optimal a priori SNR.
In [26], the optimal choice for ξ opt is found to be 10 log 10 (ξ opt ) = 15 dB, and where the smoothing constant, η is set to 0.9.The |IDFT| of P v (m)e jφ k (m) yields the estimated noise, v(n, k), where P v (m) = λk (m).To ensure the conjugate symmetry, the components of P v (m) at m {1, 2, . . ., 127} are flipped to that of the m {129, 130, . . ., 255} of P v (m) before taking the |IDFT|.We can justify the improvement of v(n, k) estimation using the SPP method [26] in terms of analyzing the tuning parameters of KF in Section 3.3.

Proposed K 0 (n) Tuning Method
Firstly, we constructed KF with ({ âi }, σ2 w ) and σ2 v and extracted the tuning parameters as shown in Figure 4.It can be seen from Figure 4a that [α 2 (n) + σ2 w ] achieves similar characteristics as the KF-oracle method (Figure 2d).Unlike σ2 v in the non-oracle case (Figure 2e), σ2 v becomes lower than [α 2 (n) + σ2 w ], as usually occurred in the oracle case (Figure 2d).The improvement of these parameters also enables J 2 (n) and J 1 (n) metrics (Figure 4b) to achieve quite similar characteristics as appear in the WGN condition (Figure 1c).Therefore, J 2 (n) and J 1 (n) metrics (Figure 4b) are now eligible to dynamically tune K 0 (n) in real-life noise conditions.However, our investigation reveals that the J 2 (n) metric is useful in tuning K 0 (n) during speech pauses, since it is underestimated K 0 (n) during speech presence of noisy speech [21].On the contrary, since the J 1 (n) metric approaches 0 in speech regions of noisy speech, according to eq. ( 32), it minimizes the underestimation of K 0 (n).In light of these observations, for each sample of y(n, k), we incorporated the J 2 (n) metric during speech pauses and the J 1 (n) metric during speech presence to dynamically offset the bias in K0 (n).The proposed tuning algorithm requires a speech activity detector that operates on a sample-by-sample basis.However, the existing speech activity detector operates on a frame-by-frame basis.In addition, the incorporation of any external speech activity detector makes the proposed tuning algorithm a bit complex.To cope with the issues, we studied and found that the KF parameters can be adopted as a speech activity detector that operates on a sample-by-sample basis.Specifically, we found that [α 2 (n) + σ2 w ] and σ2 v can be adopted as a speech activity detector for each sample of y(n, k).For example, during speech pauses, the condition σ2 v ≥ [α 2 (n) + σ2 w ] holds (e.g., 0-0.15 s or 1.8-2.19s of Figure 4a.Conversely, [α 2 (n) + σ2 w ] >> σ2 v is found in speech regions (e.g., 0.16-0.33s or 0.9-1.06s of Figure 4a).Therefore, at sample n, if σ2 v ≥ [α 2 (n) + σ2 w ], y(n, k) is termed as silent and set the decision parameter (denoted by ζ) as ζ(n) = 0; otherwise, speech activity occurs and ζ(n) = 1. Figure 5 reveals that the detected flags (0/1: silent/speech) by the proposed method are closely similar to that of the reference (0/−1: silent/speech, generated by visually inspecting the utterance sp05).At sample n, if ζ(n) = 0, the adjusted K 0 (n) in the proposed SEA is given by: To justify the validity of K 0 (n), Figure 6a shows the numerator and the denominator of Equation ( 40) computed from the noisy speech in Figure 2b.It can be seen that α2 (n) ≈ 0 during speech pauses (e.g., 0-0.15 s or 1.8-2.19s of Figure 6a).According to Equation ( 40 n) occurs during speech presence (e.g., 0.16-0.33s or 0.9-1.06s of Figure 6a), it may be underestimated K 0 (n) as in the WGN experiment (Figure 1d).Thus, J 2 (n) metric-based tuning of K 0 (n) in speech activity of y(n, k) is inappropriate.
, where the same experimental setup of Figure 2b is used.
As discussed earlier, we carried out tuning biased K 0 (n) using the J 1 (n) metric during speech activity of y(n, k).However, our further investigation of the J 1 (n) metric-based tuning in Equation (33) reveals that the subtraction of J 1 (n) from biased K 0 (n) may still produce an underestimated K 0 (n).To cope with this problem, at sample n, if ζ(n) = 1, we found a more effective solution for tuning of biased K 0 (n) using the J 1 (n) metric as: (41) To justify the validity of K 0 (n), the numerator and the denominator of Equation (41) are shown in Figure 6b.It can be seen that [α 2 (n) 2 during speech presence of y(n, k) (e.g., 0.16-0.33s or 0.9-1.06s), which causes K 0 (n) to approach 1.
To examine the performance of the proposed tuning algorithm in real-life non-stationary noise conditions, we repeated the experiment in Figure 2. It can be seen from Figure 7a that K 0 (n) is closely similar to the oracle K 0 (n).Specifically, it maintains a smooth transition at the edges and the temporal changes in speech regions are closely matched to the oracle K 0 (n).Conversely, the AKF-RMBT method [22] produces a significant underestimated K 0 (n) in speech regions.Therefore, the reduced-biased K 0 (n) in the proposed method is more appropriate to mitigate the risks of distortion in the enhanced speech than that of the AKF-RMBT method [22].We also repeated the experiment in Figure 2 except for the utterance sp05 which was corrupted by 5 dB colored (f16) noise.Figure 7b reveals that the biasing effect is reduced significantly in K 0 (n) and closely similar to the oracle K 0 (n).However, the AKF-RMBT method [22] still produced underestimated K 0 (n) in speech regions.In light of the comparative study, it is evident that the proposed method adequately addresses the tuning of biased K 0 (n) both in real-life non-stationary and colored noise conditions.

Corpus
For the objective experiments, 30 phonetically balanced utterances belonging to six speakers (three male and three female) were taken from the NOIZEUS corpus ( [1], Chapter 12).The clean speech recordings had lengths of two sec to four sec depending on utterances ([1], Chapter 12).We generated a noisy speech data set by mixing the clean speech with real-world non-stationary (babble, street) and colored (factory2 and f16) noise recordings at multiple SNR levels (from −5 dB to +15 dB, in 5 dB increments).This provided 30 examples per condition with 20 total conditions.The street noise recording was taken from [27] and the rest of the noise recordings were taken from [24].All clean speech and noise recordings in the noisy speech data set are single channel with a sampling frequency of 8 kHz.

Objective Evaluation
The objective measures were used to evaluate the quality and intelligibility of the enhanced speech with respect to the corresponding clean speech.The following objective evaluation metrics have been used in this paper:

Spectrogram Evaluation
We also analyzed the spectrograms of enhanced speech produced by the proposed and the competitive methods to visually quantify the level of residual noise as well as distortion.For this purpose, we generated a noisy speech data set by corrupting the utterance sp05 with 5 dB babble (non-stationary) and 5 dB f16 (colored) noises.

Subjective Evaluation
The subjective evaluation was carried out through a series of blind AB listening tests ( [5], Section 3.3.4).To perform these tests, we used the same noisy speech data set (Section 4.3).In this test, the enhanced speech produced by six SEAs as well as the corresponding clean speech and noise corrupted speech signals were played as stimuli pairs to the listeners.Specifically, the test was performed on a total of 112 stimuli pairs (56 for each utterance) played in a random order to each listener, excluding the comparisons for the same method.
The listener gave the following ratings for each stimuli pair: prefers the first or second stimuli, which is perceptually better, or a third response indicating no difference was found between them.For a pairwise scoring, 100% is given to the preferred method, 0% to the other, and 50% for the similar preference response.The participants could re-listen to stimuli if required.Ten English speaking listeners participated in the blind AB listening tests.The average of the preference scores given by the listeners is termed the mean preference score (%), which was used to compare the efficiency among the SEAs.

Specifications of the Competitive SEAs
The performance of the proposed SEA was carried out by comparing it with the following benchmark SEAs (p : order of {a i }, σ 2 w : the excitation variance of AR model, w : analysis frame duration (ms), and s : analysis frame shift (ms)).
KF-oracle: KF, where ({a i }, σ 2 w ) and σ 2 v are computed from the clean speech and the noise signal, p = 10, w = 32 ms, s = 16 ms, and a rectangular window is used for framing; 3.
AKF-IT [13]: AKF operates with two iterations, where initial ({a i }, σ 2 w ) and ({b j }, σ 2 u ) are computed from the noisy speech followed by re-estimation of them from the processed speech after first iteration, p = 10, noise LPC order q = 10, w = 20 ms, s = 0 ms, and rectangular window is used for framing; 6.
Proposed: Robustness and sensitivity tuning of the KF, where ({ âi }, σ2 w ) and σ2 v are computed from the pre-whitened speech and estimated noise, p = 20, q = 20, w = 32 ms, s = 16 ms, rectangular window is used for time-domain frames, and Hamming window is used for acoustic frames.

Objective Quality Evaluation
Figure 8 shows the average PESQ score (found over all frames for each test condition in Section 4.1) for each SEA.It can be seen that the KF-oracle method exhibits the highest PESQ score for all test conditions.It is due to ({a i }, σ 2 w ) and σ 2 v being computed from the clean speech and the noise signal.The improvement of the average PESQ score for the KF-non-oracle method is marginal as compared to the noisy one.The proposed SEA shows a considerable PESQ score improvement compared to the benchmark methods across the test conditions.The average PESQ score for the proposed method is also very similar to that of the KF-oracle method.It is due to the reduced-biased Kalman gain obtained by the proposed tuning algorithm being closely similar to that of the KF-oracle method (Figure 7).Amongst the benchmark methods, MDKF-MMSE [17] shows relatively competitive PESQ scores followed by AKF-RMBT [22] for all tested conditions (Figure 9a-d).On the other hand, the AKF-IT method [13] exhibits reduced PESQ scores than other benchmark methods across the test conditions due to suffering from distortion and musical noise in the enhanced speech.In light of this comparative study, it is evident that the proposed method has better quality with regard to enhanced speech than that of the competing methods for all tested conditions.Figure 9 shows the average SDR (dB) score (found over all frames for each test condition in Section 4.1) for each SEA.Like the earlier experiment in Figure 8, the KF-oracle method shows an indication of the highest SDR score for all test conditions.Additionally, the noisy one shows the lower SDR scores for all tested conditions.The proposed SEA consistently demonstrates SDR score improvement from the competing methods across the test conditions.Amongst the competing methods, the MDKF-MMSE [17] show relatively competitive SDR scores for all tested conditions (Figure 9a-d).The noisy one shows the lowest SDR scores for all tested conditions.In light of this comparative study, it is evident that the proposed SEA exhibits less distortion in the enhanced speech than that of the competing methods for all tested conditions.

Objective Intelligibility Evaluation
Figure 10 shows the average STOI score (found over all frames for each test condition in Section 4.1).Like the PESQ score comparison (Section 5.1), the KF-oracle method also achieves the highest STOI score for all tested conditions.The proposed method consistently outperforms all competing methods across the tested conditions in terms of STOI score improvements.The STOI score improvement by the proposed method is also very similar to that of the KF-oracle method.Amongst the benchmark methods, MDKF-MMSE [17] is found to be competitive with the proposed method for all tested conditions.Conversely, the noisy one shows the lowest STOI scores for all tested conditions.In light of this comparative study, it is evident that the proposed method produces better intelligible enhanced speech than the competing methods for all tested conditions.

Spectrogram Analysis of the SEAs
Figures 11 and 12 compare the spectrograms of enhanced speech produced by each SEA for noisy speech data set (Section 4.2).Typically, the noise reduction is visibly improved when going from the KF-non-oracle method to the KF-oracle method.Specifically, the biased gain of the KF-non-oracle method passes a significant residual noise in the enhanced speech (Figures 11c and 12c).Additionally, the poor estimates of the a priori SNR introduces a high degree of residual noise in the enhanced speech produced by the MMSE-STSA method [9] (Figures 11d and 12d).The degree of residual noise decreases in the enhanced speech produced by the AKF-IT method [13] (Figures 11e and 12e).However, the residual noise appears as musical noise.The enhanced speech also gets distorted due to processing the noisy speech iteratively by AKF.The AKF-RMBT method [22] exhibits less residual noise in the enhanced speech; however, it suffers from distortion due to the underestimated Kalman gain (Figures 11f and 12f).The MDKF-MMSE method [17] produces less distorted speech (Figures 11g and 12g) as compared to AKF-RMBT method (Figures 11f and 12f).It can be seen that the proposed method produces enhanced speech with significantly less residual background noise and speech distortion (Figures 11h and 12h) than MDKF-MMSE [17] (Figures 11g and 12g).In addition, the enhanced speech produced by the proposed method is closely similar to the KF-oracle method (Figures 11i and 12i).It is due clean speech (100%) and the KF-oracle method (82%).Among the benchmark methods, MDKF-MMSE [17] is found to be the most preferred (67%) with AKF-RMBT [22] (63%).In light of the blind AB listening tests, it is evident that the enhanced speech produced by the proposed method ensures the best perceived quality amongst all tested methods for both male and female utterances corrupted by real-life non-stationary as well as colored noises.

Conclusions
Robustness and sensitivity metric-based tuning of the Kalman filter gain for singlechannel speech enhancement has been investigated in this paper.At first, the noise variance was computed from the estimated noise for each noisy speech frame using a speech presence probability method.A whitening filter was also constructed to pre-whiten each noisy speech frame prior to computing LPC parameters.Then, the robustness and the sensitivity metrics were incorporated differently depending on the speech activity of the noisy speech to dynamically offset the bias in Kalman gain.The noise variance and the AR model parameters were adopted as a speech activity detector.It is shown that the

Figure 2 .
Figure 2. Biasing effect of K 0 (n): (a,b) spectrograms of the clean speech and the noisy speech (corrupt sp05 with 5 dB babble noise), (c) K 0 (n) computed in oracle and non-oracle cases, (d,e) [α 2 (n) + σ 2 w ] and σ 2v computed in oracle and non-oracle cases, (f) J 2 (n) and J 1 (n) computed from the noisy speech in (b), spectrogram of enhanced speech produced by: (g) KF-oracle method and (h) KF-non-oracle method.

Figure 3 .
Figure 3. Block diagram of the proposed KF-based SEA.

Figure 5 .
Figure 5. Comparing the detected flags of Figure 2b to that of the reference corresponding to Figure 2a.

Figure 8 .Figure 9 .
Figure 8.Average PESQ score comparison between the proposed and benchmark SEAs on NOIZEUS corpus corrupted with: (a) babble, (b) street, (c) factory2, and (d) f16 noises for a wide range of SNR levels (from −5 to 15 dB).

Figure 10 .
Figure 10.Average STOI score comparison between the proposed and benchmark SEAs on NOIZEUS corpus corrupted with: (a) babble, (b) street, (c) factory2, and (d) f16 noises for a wide range of SNR levels (from −5 dB to 15 dB).

Figure 13 .Figure 14 .
Figure 13.The mean preference score (%) comparison between the proposed and benchmark SEAs for the utterance sp05 corrupted with 5 dB non-stationary babble noise.
1 occurs at m th frequency bin, it causes stagnation, which stops updating D2 [26]) (Equation (37)).Unlike monitoring the status of P(H m 1 |R k (m)) = 1 for a long time as reported in[26], we simply resolve this issue by setting P(H m 1 |R k (m)) = 0.99 once this condition occurs prior to updating D2 k (m).k (m) is estimated using Equation (37).With estimated D2 k (m), λk (m) is updated as: λk