Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation †

: This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postﬁltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system that combines the estimation of the relative transfer function (RTF) between microphones using an extended Kalman ﬁlter framework with a novel speech presence probability estimator intended to track the noise statistics’ variability. The available dual-channel information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction is further improved by means of postﬁltering techniques that take advantage of the speech presence estimation. Our proposal is evaluated in different reverberant and noisy environments when the smartphone is used in both close-talk and far-talk positions. The experimental results show that our system achieves improvements in terms of noise reduction, low speech distortion and better speech intelligibility compared to other state-of-the-art approaches.


Introduction
Speech-related services are pervasively available on mobile devices such as smartphones or tablets. However, reverberant and noisy environments, where these devices are frequently used, often degrade speech signal quality and/or intelligibility [1]. Many current devices include several microphones, so that multi-channel speech processing techniques can be applied to reduce the distortions, which improves the noise reduction performance compared to single-channel approaches. This is our research focus.
The most common multi-channel speech processing technique is beamforming [2], which applies spatial filtering to the noisy speech signals captured by several microphones. One of these beamformers is the well-known Minimum Variance Distortionless Response (MVDR) beamformer [3], which has the advantage of being able to reduce the noise power without introducing speech distortion. The performance of MVDR depends on an accurate estimation of the noise spatial characteristics and the acoustic transfer function (ATF) between the target speaker and the microphones. For the estimation of the noise spatial statistics, methods based on multi-channel speech presence probability (SPP) [4,5] are commonly used when there is no prior knowledge of the signal propagation or the spatial structure of the noise. These techniques update the noise information when speech absence is detected, which allows for the tracking of time-varying noise signals.
However, the estimation of ATFs is a more challenging task, as these depend on the speaker's location, room acoustics and microphone responses. One possible solution is based on the estimation of the beamformer weights for a certain reference microphone. As a result, the problem becomes that of estimating a relative transfer function (RTF) between the reference microphone and the rest of microphones [6]. The most common RTF estimators are based on sub-space search, using estimates of the noisy speech and the noise spatial correlations. This category includes techniques such as covariance subtraction [7], covariance whitening [8,9] and eigenvalue decomposition [10,11]. A weighted least-squares RTF estimator was proposed in Reference [12] but it is unfeasible for real-time applications. Other approaches based on online expectation-maximization [13,14] have been analyzed to jointly estimate the RTF and the clean speech and noise statistics under the assumption of a spatially-white noise field. Recently, we proposed an extended Kalman filter (eKF) framework for the estimation of the RTF between two microphones [15] and evaluated its performance on a dual-microphone smartphone under different noisy and reverberant environments. We showed that this technique obtains better estimation accuracy than the sub-space-based estimators while allowing for the tracking of the RTF variability, especially in highly reverberant scenarios.
Despite beamforming algorithms are often used in multi-microphone devices, their performance is quite limited on dual-microphone smartphones mainly due to the reduced number of microphones, their particular placement on the device and the short separation between them [16]. Therefore, alternatives to beamforming are necessary to obtain a good performance in these situations. One possible approach is the use of single-channel filters for the reference microphone using statistical information obtained from the dual-channel signal. For example, the power level difference (PLD) algorithm proposed in Reference [17] exploits the clean speech power difference between microphones when the smartphone is used in close-talk (CT) conditions (i.e., when the loudspeaker of the smartphone is placed at the ear of the user). An estimate of the single-channel noise statistics is obtained for both microphones and a Wiener filter is calculated from them. In the case of smartphones used in far-talk (FT) conditions (i.e., when the user holds the device at a distance from her/his face), the algorithm proposed in Reference [18] exploits the spatial properties of the noisy speech and noise signals. This spatial information is used along with a single-channel SPP detector to estimate the noise at the reference channel and the signal-to-noise ratio (SNR), which are needed for the filter design. This proposal was later extended to general multi-channel devices in Reference [19].
While the techniques mentioned above apply a single-channel filtering at the reference microphone, better performance can be achieved when the filter is designed to operate at the beamformer output, what is known as postfiltering [1]. Moreover, the multi-channel Wiener filter can be expressed as an MVDR beamformer followed by a single-channel Wiener filter [6]. Several postfilters based on the Wiener filter have been proposed in the literature [20][21][22], mainly differing in the assumptions made about the noise field. The authors of Reference [23] also evaluate the use of non-linear postfilters, showing improvements with respect to the linear approaches. The work in Reference [24] evaluates the performance of a generalized sidelobe canceler (GSC) beamformer along with an SPP estimator and a non-linear postfilter, showing that the SPP information is useful in the postfiltering design. The postfiltering approach has also been studied on dual-microphone scenarios [25,26]. For example, in Reference [27] we extended our eKF-RTF framework with the use of postfilters along with MVDR beamforming and the SPP estimator of Reference [4] in order to improve noise reduction. These postfilters exploit the SPP and the RTFs previously obtained to estimate the required single-channel statistics. Both linear and non-linear postfilters were evaluated, showing better performance than a standalone MVDR and other state-of-the-art enhancement algorithms intended to be used in smartphones.
In this work we analyze and further extend our dual-microphone speech enhancement eKF-RTF framework presented in References [15,27]. This extension is developed in a threefold way. First, we present a more detailed derivation of the eKF-RTF estimator where we improve the estimation of the a priori RTF statistics without any simplification in the estimation of the covariance matrices. Second, we analyze the a posteriori SPP estimation and the importance of the a priori speech absence probability (SAP). Then, we propose a novel SAP estimator suitable for dual-microphone smartphones, which exploits the spatial structure of the noise and the power differences between microphones, showing improvements with respect to the estimator used in our previous work. Third, we take advantage of the better estimates of SPP, RTFs and single-channel statistics to redefine the postfilters described in Reference [27] and propose new ones. As a result, we end up with a comprehensive dual-microphone speech enhancement system for smartphones that shows a great performance in terms of both speech quality and intelligibility when compared to other state-of-the-art approaches.
The rest of the paper is organized as follows: in Section 2, an overview of our dual-microphone speech enhancement system is given, presenting the constituent elements of the algorithm. Next, in Section 3, the eKF-RTF framework is detailed. The estimation of the noise statistics and the SPP along with the new proposals for SAP estimation on smarthphones are developed in Section 4. In Section 5, the different postfilters are described and the estimation of the required single-channel statistics is addressed. In Section 6, the experimental framework and results are presented and discussed, while Section 7 finally summarizes the conclusions.

System Overview
The proposed enhancement system for dual-microphone smartphones is depicted in Figure 1. The microphones capture the noisy speech signals y m (n), where m indicates the microphone index (m = 1, 2). We assume an additive noise distortion model which in the short-time Fourier transform (STFT) domain can be expressed as where Y m (t, f ), X m (t, f ) and N m (t, f ) represent, respectively, noisy speech, clean speech and noise signal STFTs, t is the frame index and f the frequency bin. The two channel components can be stacked in a vector, where [·] indicates matrix transposition. Similarly, we define vectors X(t, f ) and N(t, f ). In the following, we will consider that each frequency component can be processed independently from the others, which is commonly referred to as narrowband approximation [6]. Figure 1. Overview of the dual-microphone enhancement system for smartphones.

SCMs and SPP Estimation
As shown in Figure 1, the beamforming block requires two previous procedures. First, a speech presence probability (SPP)-based algorithm is employed for the estimation of the noisy speech and noise spatial correlation matrices (SCM), Φ Y (t, f ) and Φ N (t, f ), respectively, and the SPP, p x (t, f ).
Then, an extended Kalman filter (eKF)-based estimator obtains the relative transfer function (RTF) between the secondary microphone (m = 2) and the reference microphone (m = 1), defined as Once the SCM matrices and RTF have been computed, the dual-channel noisy speech vector Y(t, f ) is processed using a Minimum Variance Distortionless Response (MVDR) beamformer, which applies spatial filtering yielding an output signal Z(t, f ) defined as where (·) H indicates a Hermitian transpose. The coefficients of the beamformer, F(t, f ), are estimated as [2], where is the steering vector normalized to the reference channel. Finally, the speech signal at the beamformer output is enhanced by a single-channel postfilter for additional noise reduction. A spectral gain G(t, f ) is obtained using the power spectral density (PSD) of the speech at the reference microphone φ x 1 (t, f ), the PSD of the residual noise φ v (t, f ) and the SPP p x (t, f ). The above single-channel statistics (φ x 1 (t, f ) and φ v (t, f )) are estimated from the noisy speech and noise SCMs and the RTF. The gain function is further processed by a musical noise reduction algorithm [28] and finally applied to the beamformer output, thus obtaining where X 1 (t, f ) is the estimate of the clean speech signal at the reference microphone. In the following sections, the description of the different parts of the system is addressed.

Extended Kalman Filter-Based Relative Transfer Function Estimation
The proposed system requires knowledge of the RTF between the two microphones, namely A 21 ( f , t). Among other approaches, the computation of this RTF can be addressed as the estimation of a variable that changes across frames in terms of a stochastic model. Given this dynamic model, the noise statistics and the noisy observations, we proposed in [15] the tracking of the RTF using an extended Kalman filter (eKF), showing a better estimation performance in comparison with other state-of-the-art approaches. Next, we describe in detail the derivation of this eKF-based RTF estimation.
First, we formulate the narrowband approximation for a given frequency of the noisy speech signal at the secondary microphone in terms of the reference microphone as where the frequency index f has been omitted for the sake of simplicity. In addition, complex variables are expressed as stacked vectors of their real and imaginary parts. For example, given Y m (t) = Y r m,t + jY i m,t with m = 1, 2 , we can define and, in a similar way, Using these definitions for the model variables in terms of vectors, we propose the following dynamic and observation models needed by the Kalman filter: • Dynamic model for the RTF a 21,t : We assume that the state vector a 21,t is a random walk stochastic process which can be expressed as where is a zero-mean multivariate white Gaussian noise that models the variability of the RTF. A detailed discussion on this model is provided in Section 3.2.

•
Observation model for the noisy speech at the secondary microphone, y 2,t : It is defined using the distortion model in (8) as where the noises are assumed to be zero-mean multivariate Gaussian variables, n 1,t n 2,t ∼ N 0, Σ n 11 ,t Σ n 12 ,t Σ n 21 ,t Σ n 22 ,t , where Σ n ij ,t = E n i,t n j,t is a noise covariance matrix whose estimation is addressed in Section 4. Additionally, we assume that there is no correlation between a 21,t and n m,t while n 1,t and n 2,t are correlated. The model h is non linear because of the product between the variables a 21,t and n 1,t . This model depends on the observation y 1,t , which acts as a model parameter.
Finally, the Kalman filter framework is applied to obtain a (recursive) minimum mean square error (MMSE) estimate of a 21,t . This is a two-step procedure that is applied frame-by-frame for all frequencies: 1. The prediction step, using the model (12), is applied for every frame t > 0, where P t|t−1 = E a 21,t − a 21,t|t−1 a 21,t − a 21,t|t−1 (19) are error covariance matrices. The Kalman filter is initialized using the overall mean and covariance of the RTF, that is, a 21,0 = µ a 21 and P 0 = Σ a 21 . Further details about these parameters are provided in Section 3.2.
2. The updating step is applied to correct the previous estimation with the observations y 1,t and y 2,t (whose relationship is given by Equation (14)), where is the Kalman gain, and are necessary statistics, the estimation of which is developed in the next subsection. These equations correspond to the most general Kalman filter model as described in Reference [29]. We define the equations in terms of µ y,t , S y,t and C ay,t because the non-linearity of h in Equation (14) makes their estimation non-trivial.
To deal with non-linear functions, two variants of the Kalman filter are widely used: the eKF and the unscented Kalman filter (uKF) [29]. In the case of the model of (14), it could be demonstrated that both eKF and uKF yield the same closed-form expressions for µ y,t , S y,t and C ay,t . We choose the eKF approach because it gives a more stable computational solution and it is easier to implement than the uKF one. In the next subsection the eKF approach is presented.

Vector Taylor Series Approximation
The eKF is based on applying a linearization over the prediction and observation models using first-order vector Taylor series (VTS). In our case, the prediction model is linear, so the linearization is only applied to function h in the observation model. The first-order VTS approximates the model of Equation (14) as where are the Jacobian matrices required for the VTS approach. Finally, using (26), the noisy speech statistics can be estimated as [29] µ y,t h a 21,t|t−1 , 0; y 1,t , S y,t H t P t|t−1 H t + J n 1 ,t Σ n 11 ,t J n 1 ,t + Σ n 22 ,t + J n 1 ,t Σ n 12 ,t + Σ n 21 ,t J n 1 ,t ,

A Priori RTF Statistics
The dynamic model for the RTF vector a 21,t presented in Equation (12) accounts for the process variability across time in terms of a white noise w t . This variability has a twofold meaning: first, the possible temporal variations of the acoustic channel due to environment changes, head or smartphone movements, etc., and, second, the inaccuracy of the narrowband approximation of Equation (8) [6]. This is due to the fact that, especially in reverberant environments where the analysis window of the STFT is shorter than the impulse response of the acoustic channel, the transfer function is not multiplicative but convolutive. This convolutive transfer function expands in both time and frequency dimensions. Thus, nearby frames and frequencies are correlated, which violates the narrowband approximation [6]. In order to account for the variability associated to this effect, we will assume the statistical model proposed in (12) so that, although the channel could be time-invariant, the RTF, as defined in Equation (3), is time-variant.
The eKF-based RTF estimator requires the a priori statistics of the RTF. These statistics are the overall mean vector and covariance matrix of the RTF, respectively, and also the covariance of the RTF variability across frames, Additionally, we can define the overall correlation matrix of the RTF as The previous statistics can be estimated in advance using a training set of dual-channel clean speech utterances in different acoustic environments and device positions. In order to avoid outliers which might yield useless estimates, we select, on an utterance basis, time-frequency bins where the speech power at the reference channel is large enough (higher than the maximum power at that frequency in the utterance minus 3 dB). For those bins, we estimate the RTF using (3), which yields a set of RTFs A 21,t vectors, while a sample covariance matrix Q (l) is computed from consecutive RTF vectors. Finally, the global sample statistics µ a 21 , R a 21 and Q are obtained by averaging the particular utterance-dependent statistics. The sample covariance Σ a 21 can then be estimated using µ a 21 and R a 21 in (35).

Speech Presence Probability-Based Noise Statistics Estimation
The eKF-RTF estimator and the beamforming algorithm also require knowledge of the noise spatial correlation matrix (SCM) Φ N (t, f ). In this work, we follow the multi-channel speech presence probability (SPP) approach described in Reference [4]. This estimation method is based on the recursive updating of the noise SCM in those time-frequency bins where speech is absent. First, two hypotheses, H x and H n , are considered for speech presence and speech absence, respectively, Assuming zero-mean random variables, the SCM of the noise will be updated by means of the following recursion, where the forgetting factor is computed as whereα ∈ [0, 1) is an updating constant and is the a posteriori SPP. The estimation of this probability is detailed in Subsection 4.1. From now on, the time and frequency indices are omitted whenever possible. The noise statistics for the RTF estimator (required by (30)) are directly derived from Φ N . Assuming a zero-mean, symmetric circular complex Gaussian distribution for N [30] and with φ N ij = E N i N * j , the following relations can be demonstrated, where I 2 is the 2-dimensional identity matrix.

A Posteriori SPP Estimation
The a posteriori SPP p x (t, f ) allows us to control the updating procedure of Equation (38) for the computation of the noise statistics. Nevertheless, p x (t, f ) is also exploited in two additional parts of our system:

•
The estimation of the RTF presented in the previous section is only accurate in time-frequency bins where speech is present. The a posteriori SPP indicates those bins where speech presence is more likely. Therefore, in our implementation we only update the eKF in those bins where p x (t, f ) > p thr , with p thr being a predefined probability threshold. Otherwise, the previous values are preserved.

•
The postfiltering performance can be improved if additional information about SPP is provided, as shown later in Section 5.
The estimation of the a posteriori SPP can be addressed assuming complex multivariate Gaussian distributions for the noisy speech Y [4,5], according to the two hypotheses previously formulated (see (36) and (37)). Using the Bayes' rule, the a posteriori SPP at each time-frequency bin can be calculated as where q n = P (H n ) is the a priori speech absence probability (SAP) and are the likelihoods of observing the noisy speech signal under the different hypotheses, with det [·] being the matrix determinant operator and M the number of microphones. Equation (46) can be redefined using these likelihoods, yielding the following expression for the SPP, Then, the a posteriori SPP can be estimated at each frame for all frequencies according to the following two-iteration algorithm: • Initialization: Estimate the noisy SCM with a recursive updating, whereα is an updating constant as in (39). Also, estimate the a priori SAP q n (t, f ) (see Section 4.2).

A Priori SAP Estimation
To obtain an accurate a posteriori SPP that allows for robust tracking of the noise statistics, the a priori SAP is a key parameter. Methods on single-channel noise tracking, as the fixed a priori SNR algorithm of Reference [31] or the minima controlled recursive averaging (MCRA) algorithm [32], estimate this SAP in terms of the a priori SNR. The MCRA framework was extended to multi-channel speech signals in Reference [4]. This is the SPP estimator that we used in our previous works [15,27]. A major drawback of the MCRA scheme is the lack of robustness in case of a time-varying SNR, which makes noise changes to be detected as speech presence. More recently, the authors of [5] proposed to use spatial information in the SAP estimation, specifically the coherent-to-diffuse ratio (CDR). Alternatively, the power level difference (PLD) between microphones is used in Reference [17] to update the noise statistics, as it is a good indicator of speech presence in CT conditions. In this work, we propose a novel SAP estimator for dual-microphone smartphones that combines the CDR spatial information with the PLD between the reference and the secondary microphones.
First, an SCMΦ Y (t, f ) is estimated using a rectangular window of eight past frames as in Reference [5]. Then, we calculate (1) the power spectral density (PSD) ratio between microphones as whereφ Y ij is an estimate of E Y i Y * j , and (2) the short-term complex coherence between microphones, Both terms are needed by the SAP estimators. The PLD-based a priori SAP is based on the PLD noise estimator for CT conditions described in Reference [17]. In that work, a normalized difference of the noisy speech PSDs was defined as The noise statistics are updated by this parameter, as it gives information about speech presence. That is, assuming similar noise power at both microphones and that speech is more attenuated at the secondary microphone with respect to the primary one, ∆ φ nPLD (t, f ) is close to one when speech is present and tends to zero otherwise. Thus, an a priori SAP based on this indicator can be estimated as where q PLD (t, f ) is upper-bounded by 1. This estimator is valid for CT conditions, but it can also be useful in FT conditions for frequencies where speech at the secondary microphone is attenuated.
On the other hand, the CDR is another good indicator of speech presence [33], which is defined as where ψ X 12 (t, f ) is the clean speech short-term complex coherence (defined as in (52)) and is the diffuse noise field complex coherence, with F s the sampling frequency, K the number of frequency bins, d mic the distance between microphones and c v the speed of sound. While higher values of the CDR indicate the presence of a strong coherent component, often a speech signal, lower values indicate that a diffuse component is dominant, which is more common for noise signals. In practice, the CDR is obtained using the estimator proposed in Reference [33], where φ Y 12 (t, f ) is the phase ofφ Y 12 (t, f ). Although the CDR is positive and real-valued, estimation errors yield complex-valued results, so the real-part (·) is taken in (57). Additionally, a frequency-averaged CDR with a normalized Hamming window h W is computed as with w = 10 (the window width is 2w + 1). Then, the local a priori SAP estimate is obtained as in Reference [5] using Γ Y 12 (t, f ), where q min and q max define the minimum and maximum values of the function, respectively, c is an offset parameter and ρ controls the steepness of the transition region. Similarly, the global a priori SAP estimate q global (t, f ) is computed using Γ Y 12 (t, f ) in (59). The CDR-based a priori SAP is then obtained as in Reference [5], This estimator is the same as that proposed in Reference [5], except for the term representing the frame a priori SAP. This term was neglected as it did not increase the performance in our preliminary experiments.
Finally, the a priori SAP estimates obtained by PLD and CDR are combined to achieve a more robust joint decision. Assuming statistical independence between both estimators, the final a priori SAP estimate for dual-microphone smartphones is calculated as the joint probability of the speech absence decision by each estimator, The above estimator can be used in both CT and FT conditions to obtain prior information about speech presence. Moreover, as it only allows for noise statistics updating when both estimates indicate speech absence, this estimator is expected to be more robust.

Postfiltering Approaches for Dual-Microphone Smartphones
As introduced above, the performance achieved by beamforming algorithms is limited in our scenario mainly due to the reduced number of microphones and the particular placement on the smartphone. In this section we propose different alternatives for single-channel postfiltering at the beamformer output. These postfilters make use of the previously estimated parameters to design a gain function which is applied to the beamformer output signal for further noise reduction. In the next subsections we describe several postfiltering techniques as well as our proposals for the estimation of the clean speech and noise single-channel statistics required by the former ones.

Parametric Wiener Filtering
Under the distortionless constraint of MVDR, if we assume an accurate estimation of the RTF, we can approximate the single-channel speech signal Z(t, f ) at the beamformer output as where V(t, f ) is the residual noise at the beamformer output, The PSDs of the clean speech signal X 1 (t, f ) and the residual noise V(t, f ) are given by φ x 1 (t, f ) and φ v (t, f ), respectively. The corresponding Wiener filter (WF) for X 1 (t, f ) is then defined as where is the a priori SNR. The noise reduction performance of WF can be improved if we consider the a posteriori SPP p x (t, f ) in the postfiltering design. Our goal is to decrease the gain factor for time-frequency bins where speech is absent. This yields the parametric WF (pWF) [34] where µ(t, f ) is an SPP-driven trade-off parameter. To obtain this parameter, we use the same mapping function as in Reference [5], which is similar to the mapping function of (59). We can see that µ(t, f ) ∈ [µ min , µ max ] and therefore, the lower the a posteriori SPP, the higher µ(t, f ), thus improving the noise reduction performance.

Optimally Modified Log-Spectral Amplitude Estimator
Instead of estimating a WF filter over the signal statistics, other approaches try to directly make an MMSE estimation of the clean speech signal amplitude |X 1 (t, f )|. One of these is the log-spectral amplitude (LSA) estimator proposed in Reference [35] and defined as where is the LSA gain function, and is the a posteriori SNR. As can be seen, this estimator is equivalent to applying a gain function over the noisy speech signal Z(t, f ) that depends not only on the a priori SNR but also on the a posteriori SNR. The obtained gain function is similar to WF for high a priori SNRs and has lower values than WF when the a priori SNR is lower and the a posteriori SNR is higher (i.e., when noise tends to dominate). The estimator of (68) is only valid under the assumption of speech presence. To improve the performance, the authors of Reference [36] proposed a different estimator that also takes into account the a posteriori SPP and different gain functions for speech presence and absence. That is the optimally modified LSA (OMLSA) estimator, which is defined as where G H v is a constant attenuation applied when speech is absent, whose value is usually −25 dB [36].

Single-Channel Speech and Noise PSD Estimators
The computation of the gain functions previously presented relies on the estimation of the single-channel PSDs of the clean speech at the reference channel φ x 1 (t, f ) and the residual noise at the beamformer output φ v (t, f ). The estimation of the residual noise PSD is straightforward if estimates of the SCM Φ N (t, f ) and the steering vector d(t, f ) are available. In this case, the residual noise PSD can be obtained at each time-frequency bin as On the other hand, a clean speech PSD estimate is more difficult to obtain because of its higher variability. In Reference [27], we proposed two different estimators that make use of the noisy speech and noise statistics and the RTF between microphones. In the following, we describe both clean speech PSD estimators, omitting time and frequency indices for simplicity.

Power Level Difference-Based Estimation
The PLD-based estimator is derived from the method in Reference [17], which exploits the PLD between the microphones of a smartphone used in CT conditions; that is, a more attenuated clean speech component is expected at the secondary microphone with respect to the primary one. This PLD is defined in terms of the noisy speech PSDs at the microphone inputs, where it is assumed that the power at the reference microphone is always higher than the one at the secondary microphone. Assuming that the noise PSD difference (∆φ N = φ N 11 − φ N 22 ) can be neglected when compared to ∆ φ PLD , the clean speech PSD can be estimated as in [17], Although this estimator offers good performance in CT conditions, the previous assumptions are not longer valid in FT conditions.

Minimum Variance Distortionless Response-Based Estimation
This estimator calculates the PSD directly at the beamformer output by spectral subtraction, taking into account the distortionless property of MVDR. The estimator is defined as so that it fully exploits the spatial information by using the SCM matrices of the noisy speech and noise signals. The combination of both channels by means of the beamformer weights (F) allows for a more robust estimation than directly taking the first element of Φ Y − Φ N . Negative PSDs may be obtained due to the estimator variance, so (75) is bounded by 0.

Experimental Evaluation
The performance of the different estimators and speech enhancement algorithms discussed along this paper is evaluated by means of objective speech quality and intelligibility metrics. Two different well-known objective metrics are used:

•
The Perceptual Evaluation Speech Quality (PESQ) [37] metric is utilized to evaluate the speech quality of the enhanced speech signal. This metric gives a mean opinion score between one and five. The higher the PESQ values, the better the speech quality.

•
The Short-Time Objective Intelligibility (STOI) [38] metric is intended to evaluate the speech intelligibility of the enhanced speech signal. The resulting score is a value between zero and one. The higher the STOI value, the better the speech intelligibility.
PESQ and STOI are both intrusive metrics, which means that they need a clean reference. As a reference, we use the clean speech signal at the reference microphone, x 1 (n).
Additionally, in order to evaluate the RTF estimation accuracy, we use the speech distortion (SD) index [2]. This index measures the distortion level on the clean speech signal at the beamformer output, namelyx 1 (n) (inverse STFT of F H (t, f )X(t, f )). The idea behind using this metric is that, because of the distortionless property of MVDR, more accurate RTF estimates should yield lower speech distortion. The SD index is measured segmentally across the speech signal, in such a way that the SD value at the i-th segment is obtained as where T is the number of samples per segment. The segmental SD values are averaged to achieve the final SD index, which is a value between zero and one. The lower the SD value, the lower the speech distortion. As in Reference [5], silence frames are excluded from this evaluation by means of calculating the median of the segment-wise signal power and removing those segments with a power 15 dB lower than that median. We evaluate the proposed algorithms by using simulated dual-channel noisy speech recordings from a dual-microphone smartphone. Two different databases were developed for each device use mode: close-talk (CT) and far-talk (FT). To simulate the recordings, clean speech signals are filtered using dual-channel acoustic impulse responses and real dual-channel environmental noise is added at different signal-to-noise ratios (SNRs). We evaluate four different reverberation environments and eight different noises, which are matched as indicated in Table 1, yielding eight different acoustic environments (including both reverberation and noise). Details about the methodology used to obtain the acoustic impulse responses and the noises can be found in References [15,39]. Clean speech signals are obtained from the TIMIT database [40,41] and downsampled to 16 kHz. In particular, a total of 850 clean speech utterances from different speakers are employed. All the utterances have a length of around seven seconds, which is achieved by same-speaker utterance concatenation. Two different sets are then defined, namely training and test. Speakers do not overlap across sets. Moreover, the number of utterances from female and male speakers is balanced. The distribution of utterances and the number of speakers in the training and test sets are indicated in Table 2. The training set consists of reverberated clean speech utterances and is only used to estimate the a priori statistics for the eKF-RTF estimator. Such utterances were obtained by filtering each clean speech utterance with a set of dual-channel acoustic impulse responses which model four reverberant environments, thereby yielding a total of 2800 training utterances. Sixteen different acoustic impulse responses were considered for each acoustic environment. For each utterance, the impulse response was randomly selected.
On the other hand, the test set consists of utterances contaminated according to the eight noisy environments defined in Table 1. This set is intended to evaluate the different algorithms proposed in this work. Noises were added to the reverberated speech at six different SNRs from −5 dB to 20 dB, so a total of 7200 test utterances was obtained. In order to simulate reverberation, ten different acoustic impulse responses, in turn also different from those of the training set, were considered for each acoustic environment and, again, randomly applied.
For STFT computation, a 512-point DFT was applied using a 32 ms square-root Hann window with 50% overlap. This results in a total of 257 frequency bins for each time frame. The noisy speech and noise SCMs were estimated using an updating constantα = 0.9. The parameters for CDR-based a priori SAP calculation were set as [5]: d mic = 13 cm, q min = 0.1, q max = 0.998, c = 3 and ρ = 2.5. The SPP threshold for RTF updating was set as p thr = 0.9. Finally, for the SPP-driven trade-off parameter of the parametric Wiener filter, the following parameter values were used [5]: µ min = 1, µ max = 4, c = −3 and ρ = 4.
The algorithm was implemented in Python and it is available at [42]. The computational burden of the implementation was evaluated on a PC with an Intel Core i7-4790 CPU. The algorithm works on a frame-by-frame basis, so that the algorithmic delay is in this case the duration of a frame, that is, 32 ms. The average performance of the whole system (i.e., including SPP and RTF estimation, MVDR beamforming and postfiltering) on this machine achieved approximately 8x faster than real-time.
In the following, the performance of the different techniques is tested. In order to simplify the reading of the results tables, an acronym is provided (in parentheses) for each considered technique after its description.

Experimental Results: Performance of SAP Estimators
First, we compare the different a priori SAP estimators when used along with our eKF-RTF estimator for MVDR beamforming (i.e., no postfilter is applied yet). This comparison is shown in Table 3 in terms of PESQ and STOI. The techniques evaluated are the multi-channel version of MCRA (MCRA) [4], CDR-based SAP estimation (CDR) [5] and our proposed PLD-based SAP estimator (PLDn) and its combination with CDR-based estimation (P&C), both presented in Section 4.2. Results for the noisy speech signal at the reference microphone (Noisy) are given as a baseline. In addition, we show the results achieved by the eKF-RTF estimator with an oracle estimation of the noise SCM (eKF-OracleN) as a performance upper-bound. This estimation was obtained using the true noise signals in a recursive procedure similar to Equation (50). The speech presence probability obtained from clean speech was used in the eKF-RTF estimator to obtain these oracle results.
For CT conditions, the best results are obtained for the eKF-PLDn system. The speech power difference between microphones for CT conditions makes that the PLD-based SAP estimator can easily detect those bins where speech is absent. This power difference reduces the CDR ratio, defined in Equation (57), of the multi-channel signal in the presence of speech, leading the CDR-based SAP method to underestimate the speech presence and decrease the performance of the noise tracking algorithm. Therefore, the combination of both approaches does not improve the single decision based on PLD between channels.
On the other hand, for FT conditions, speech power at both channels is more similar and CDR increases under the presence of speech. This is especially true at higher SNRs, where the CDR-based SAP detector outperforms the PLD-based one. However, the performance of the CDR-based detector degrades more severely at lower SNRs, while the PLD detector is more robust in these conditions. Finally, the combination of both detectors increases the performance in terms of both noise reduction and speech intelligibility, keeping a performance similar to the PLD one at lower SNRs and improving at higher SNRs.
To sum up, our proposals improve the tracking of the noise statistics in dual-microphone smartphones. The eKF-PLDn proposal is the best solution in CT conditions, with eKF-P&C having a similar performance. The joint decision proposed in eKF-P&C achieves the best results in FT conditions at higher SNRs, while eKF-PLDn performs slightly better at lower SNRs.

Experimental Results: Performance of RTF Estimators
In this subsection, we compare our eKF-RTF estimator with the well-known eigenvalue decomposition (EVD) [11] and covariance whitening (CW) [9] sub-space methods for RTF estimation. The results are shown in Table 4. In addition, we show the results obtained with an oracle estimation of the RTF (OracleC) as a performance upper-bound. This oracle RTF was obtained from the clean speech signals using Equation (3) for time-frequency bins where speech presence was detected (using the speech presence probability obtained from clean speech), while the RTF of the previous frame was reused for the remaining ones. The evaluation is carried out in terms of speech distortion (SD) and the speech intelligibility metric STOI. We evaluate SD instead of PESQ because here we are only interested in the distortion introduced over the reference clean speech due to RTF estimation errors when using a distortionless beamformer. The comparison is performed using the best SAP estimator obtained for each device placement according to Table 3 (eKF-PLDn for CT and eKF-P&C for FT). A similar improvement due to our proposed SAP estimators, as the one observed in Table 3 for the eKF-RTF estimator, is expected for the other RTF estimators (due to the fact that these estimators would also take advantage of more accurate estimates of the noise SCM). We compare the RTF estimators using only one SAP estimator in order to narrow the number of possible system combinations. For a fair comparison, we use the same RTF initialization and the same updating scheme based on SPP for the different systems.
From Table 4, we can see that our eKF proposal obtains slightly better results in terms of STOI and much lower speech distortion than the other approaches, particularly in FT conditions. As we previously analyzed in a former study [15], the distortionless property of MVDR involves none or very low speech distortion if an accurate estimate of the RTF is available. Therefore, we can conclude that our estimator tracks better the RTF variability than the other approaches. This is more noticeable in FT conditions, where the reverberation level increases and the secondary microphone captures similar power to the reference one, thereby making the RTF more variable and its tracking more challenging.
Furthermore, Table 5 shows the SD results for the different approaches grouped by reverberant environment (averaging by noise environment and SNR level). The oracle results for each reverberant environment are also shown, as they give an upper-bound reference of the performance of the different approaches. This can be useful when different reverberant environments are compared, as some of them include more challenging noise environments (e.g., cafe in low reverberation). It can be observed that the increase of the reverberation level makes the RTF estimation more difficult. The performance of the different algorithms degrades with the reverberation level, as the variations of the RTF are harder to track, although our proposal is more robust against reverberant environments than sub-space approaches.  (64)) applied to the MVDR beamformer output. Table 6 shows a comparison between the estimator based on the power difference (WF-Ps) of Equation (74), the one based on the distortionless constraint of MVDR (WF-Ms) of Equation (75) and a standalone MVDR beamformer (i.e., with no postfiltering) as a reference, all of them with the best SAP configurations determined above. This comparison is done in terms of speech quality and intelligibility.
Results show that WF-Ms performs better than WF-Ps, obtaining slightly better results in CT conditions and clearly outperforming it in FT conditions. The WF-Ms estimation does not make any assumptions about the noise power similarity between microphones as the other approach does. This makes the WF-Ms estimation procedure, which also exploits the cross-correlation elements of the noisy speech and noise SCMs, more robust. Moreover, the power difference assumption is no longer valid in FT conditions, leading the WF-Ps approach to degrade the performance in this scenario, particularly at higher SNRs.  2.81 2.34 1.95 1.62 1.36 1.20 97.18 94.20 88.72 80.11 67. Finally, in Table 7 we compare the pWF and OMLSA postfilters with a basic WF postfilter and two other related state-of-the-art dual-channel speech enhancement algorithms intended for smartphones: the PLD-based single-channel WF filter of Reference [17] (PLDwf) and the SPP-and coherence-based single-channel WF filter of Reference [18] (SPPCwf) for CT and FT conditions, respectively. The MVDR-based clean speech PSD estimator (Ms) is used for the different proposed postfilters (WF, pWF and OMLSA). The comparison is done in terms of PESQ and STOI metrics. For CT conditions, both pWF and OMLSA outperform the WF and PLDwf approaches, with OMLSA achieving the best results in noise reduction performance and speech intelligibility. PLDwf achieves more noise reduction (better PESQ) than our WF due to the fact that the former introduces an overestimation of the noise. Such an overestimation also means more speech distortion, so the speech intelligibility is lower compared to our WF approach. The use of SPP information in our postfilters allows for larger noise reduction in frequency bins where speech is absent without additional speech distortion. The availability of accurate SPPs in CT conditions makes OMLSA the best approach in this case, as the LSA estimator performs better than the WF when bins where speech is present are clearly differentiated from those where speech is absent.
On the other hand, the SPPs obtained in FT conditions are less accurate for the postfiltering task, thereby degrading the performance of the OMLSA estimator. Nevertheless, this SPP information is still useful for the pWF, which, in general, outperforms the basic WF, especially at higher SNRs. Moreover, both WF and pWF outperform SPPCwf in terms of noise reduction and speech intelligibility. In summary, the availability of accurate RTF and SPP estimates, as those provided by our proposal, clearly helps to improve the performance of postfiltering in FT mode.

Conclusions
In this paper we have proposed a dual-channel speech enhancement algorithm intended for dual-microphone smartphones. Our proposal brings the use of a novel speech presence probability calculation method for the estimation of noise statistics needed by our extended Kalman filter RTF estimator, allowing for online processing of the noisy speech signal. The noise and acoustic channel information is used for speech enhancement through MVDR beamforming. To improve the noise reduction performance with low speech distortion, we also proposed several postfiltering techniques which make use of both single-channel statistics estimated at the beamformer output and information about speech presence. Moreover, our system can take advantage of information about the user position in relation to the smartphone (available through the smartphone's sensors). The experimental results indicate that our proposal achieves more accurate estimates of RTFs and SPPs than other related state-of-the-art algorithms, which yields low speech distortion and better speech quality and intelligibility. Furthermore, the proposed postfilters improve the noise reduction performance compared to other algorithms specifically intended for dual-microphone smartphones without degrading the speech intelligibility.
Regarding future work, we think that the large margin of improvement revealed by the oracle results, with respect to the estimation of the noise statistics and the acoustic channel, is a boost to this research. In particular, we will generalize our extended Kalman filter framework to general multi-microphone devices and also investigate the estimation of the speech presence probability and the development of postfiltering techniques for multi-channel speech enhancement.