Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement

Online multi-microphone speech enhancement aims to extract target speech from multiple noisy inputs by exploiting the spatial information as well as the spectro-temporal characteristics with low latency. Acoustic parameters such as the acoustic transfer function and speech and noise spatial covariance matrices (SCMs) should be estimated in a causal manner to enable the online estimation of the clean speech spectra. In this paper, we propose an improved estimator for the speech SCM, which can be parameterized with the speech power spectral density (PSD) and relative transfer function (RTF). Specifically, we adopt the temporal cepstrum smoothing (TCS) scheme to estimate the speech PSD, which is conventionally estimated with temporal smoothing. Furthermore, we propose a novel RTF estimator based on a time difference of arrival (TDoA) estimate obtained by the cross-correlation method. Furthermore, we propose refining the initial estimate of speech SCM by utilizing the estimates for the clean speech spectrum and clean speech power spectrum. The proposed approach showed superior performance in terms of the perceptual evaluation of speech quality (PESQ) scores, extended short-time objective intelligibility (eSTOI), and scale-invariant signal-to-distortion ratio (SISDR) in our experiments on the CHiME-4 database.


Introduction
Speech enhancement is essential to ensure the satisfactory perceptual quality and intelligibility of speech signals in many speech applications, such as hearing aids and speech communication with mobile phones and hands-free systems . Currently, devices with multiple microphones are popular, which has enabled multi-microphone speech enhancement, exploiting spatial information as well as spectro-temporal characteristics of the input signals . One of the most popular approaches to multi-microphone speech enhancement may be spatial filtering in the time-frequency domain, which aims to extract a target speech signal from multiple microphone signals contaminated by background noise and reverberation by suppressing sounds from directions other than the target direction [6][7][8][9][10][11].
There have been various types of spatial filters with different optimization criteria [6][7][8][9][10]. Among them, the minimum mean square error (MMSE) criterion for speech spectra estimation led to the multi-channel Wiener filter (MWF), which has shown decent performance [10,12,21,22]. It has been shown that the MWF solution can be decomposed into the concatenation of the minimum-variance distortionless-response (MVDR) beamformer and the single-channel postfilter [11,12]. Spatial filters often require the estimation of acoustic parameters such as the relative transfer function (RTF) between the microphones and the speech and noise spatial covariance matrices (SCMs), which should be estimated from the noisy observations.
For applications such as speech communication and hearing aids, time delays are crucial, and thus an online algorithm is required for multi-microphone speech enhancement. The work in [25] extended the single-channel minima controlled recursive averaging (MCRA) framework [49,50] for noise estimation in the multi-channel case by introducing the multi-channel speech presence probability (SPP) [24]. In [26], a coherence-to-diffusion ratio (CDR) based a priori SPP estimator under the expectation and maximization (EM) framework was proposed to improve the robustness in nonstationary noise scenarios. In [25,26], the speech SCM was estimated with the maximum likelihood (ML) approach, while the multi-channel decision-directed (DD) estimator was proposed in [29]. In [27], the recursive EM (REM) algorithm, which performs an iterative estimation for the latent variables and model parameters in the current frame, was exploited by defining the exponentially weighted log-likelihood of the data sequence. The speech SCM was decomposed into the speech power spectral density (PSD) and RTF under the rank-1 approximation, and these components were estimated by an ML approach using the EM algorithm in [27].
In this paper, we propose an improved speech SCM estimation for online multimicrophone speech enhancement. First, we adopt the temporal cepstrum smoothing (TCS) approach [51] to estimate the speech PSD, which has not yet been tried in multi-channel cases. Furthermore, we propose an RTF estimator based on time difference of arrival (TDoA) estimation using the cross-correlation method. Finally, we propose refining the acoustic parameters by exploiting the clean speech spectrum and clean speech power spectrum estimated in the first pass. The experimental results show that the proposed speech enhancement framework exhibited improved performance in terms of the perceptual evaluation of speech quality (PESQ) scores, extended short-time objective intelligibility (eSTOI), and scale-invariant signal to distortion ratio (SISDR) for the CHiME-4 database. Additionally, we performed an ablation study to understand how each sub-module contributed to the performance improvement.
The remainder of this paper is organized as follows. Section 2 briefly introduces the previous work on multi-microphone speech enhancement depending on various classes of approaches and then summarizes the main contributions of our proposal. Section 3 reviews the previous MMSE multi-channel speech enhancement approach and explains the conventional speech and noise SCM estimation. Section 4 presents the proposed speech SCM estimation based on the novel speech PSD and RTF estimators. Section 5 outlines the experimental results that demonstrate the superiority of the proposed method compared with the baseline in terms of speech quality and intelligibility. Finally, a conclusion is provided in Section 6.

Previous Work and Contributions
Recently, many approaches to multi-microphone speech enhancement have been proposed. In [33], the estimation of the speech PSD reduces to seek a unitary matrix and the square roots of PSDs based on the factorization of the speech SCM. The RTF estimate was recursively updated based on these estimates. They also proposed a desmoothing of the generalized eigenvalues to maintain the non-stationarities of estimated PSDs. Furthermore, these parameter estimates were then exploited for a Kalman filter-based speech separation algorithm [35]. In the context of sound field analysis, ref. [34] proposed a masking scheme under the non-negative tensor factorization model and [36] exploited the sparse representation in a spherical harmonic domain. The work in [37] proposed a multi-channel non-negative factorization algorithm in the ray space transform domain.
Deep-learning-based approaches have also been proposed, which can be categorized into several types. One is the combination of deep learning with conventional beamforming methods, in which the deep neural networks (DNNs) are employed to implement beamforming [38,39]. In [38], the complex spectral mapping approach was proposed to estimate the speech and noise SCMs. In contrast, ref. [39] reformulated the MVDR beamformer as a factorized form associated with two complex components and estimated them using a DNN, instead of estimating the parameters of the MVDR beamformer. The other approach is neural beamforming, in which a DNN directly learns the relationship between multiple noisy inputs and outputs in an end-to-end way [40][41][42][43]. In [40], they defined spatial regions and proposed a non-linear filter that suppresses signals from the undesired region while preserving signals from the desired region. In [41], the authors proposed an end-to-end system to estimate the time-domain filter-and-sum beamformer coefficient using a DNN. This approach was later replaced with implicit filtering in latent space [42]. In [43], they built a causal neural filter comprising modules for fixed beamforming, beam filtering, and residual refinement in the beamspace domain.
One of the popular approaches that adapt the spatial filter according to the dynamic acoustic condition is the informed filter, which is computed by utilizing the instantaneous acoustic parametric information [15][16][17][18]. Refs. [15,16] exploited the instantaneous direction of arrival (DoA) estimates to find the time-varying RTF used to construct the spatial filter, and [18] formulated a Bayesian framework under the DoA uncertainty. In [19], the eigenvector decomposition was applied to the estimated speech SCM to extract the steering vectors, which were used for the MVDR beamformer. The aforementioned approaches often adopted classical techniques such as ESPRIT [52] or MUSIC [53] for DoA estimation, which may be improved by incorporating more sophisticated sound localization [47,48].
Another set of studies focus on the estimation of the acoustic parameters. An EM algorithm [14] was employed to perform a joint estimation of the signals and acoustic parameters. While clean speech signals were obtained in the E-step, the PSDs of signals, RTF, and SCMs were estimated in the M-step. As the previous EM algorithm processed all of the signal samples at once, REM algorithms [27,28] overcame these issues by carrying out frame-wise iterative processing to handle online scenarios. For the speech PSD estimation, ref. [32] proposed an instantaneous PSD estimation method based on generalized principal components to preserve the non-stationarity of speech signals. For the RTF estimation, previous approaches mainly exploited the sample SCMs [46]. The covariance subtraction (CS) approaches [44,45] estimated the RTF by taking the normalized first column of the SCM obtained by the subtraction of the noisy speech SCM and noise SCM, assuming that the rank of the speech SCM was one. On the other hand, the covariance whitening (CW) approaches [30,54] normalized the dewhitened principal eigenvector of the whitened noisy input SCM to obtain the RTF.
In this paper, we propose an improved speech SCM estimation method for the online multi-microphone speech enhancement system based on the MVDR beamformer-Wiener filter factorization. The main contributions of our proposals are as follows:

1.
A speech PSD estimator based on the TCS scheme to take the knowledge on the speech signal in the cepstral domain into account; 2.
An RTF estimator based on the TDoA estimate to take advantage of the information from all frequency bins, especially when the signal-to-noise ratio (SNR) is low; 3.
The refinement of the acoustic parameter estimates by exploiting the clean speech spectrum and clean speech power spectrum estimated in the first pass.

Signal Model
Suppose that there is an array of M microphones in a noisy and reverberant room. Assuming that a single speech source and noises are additive, the observed microphone signals are given as , and V m (l, k) are the short-time Fourier transform (STFT) coefficients of the microphone signal, clean speech, and background noises, including reverberations at the mth microphone, respectively, and g(l, k) = [1, g 2 (l, k), ..., g M (l, k)] T is the RTF vector for the direct path from the desired speech source to the microphones. We assume that S m (l, k) and V m (l, k) are uncorrelated as in [16], although early reflections may disrupt this assumption. The SCM for the input signal y(l, k), Φ y (l, k), is given by where E[·] denotes mathematical expectation, and Φ s (l, are the SCMs of s(l, k) and v(l, k), respectively.

MWF and MVDR-Wiener Filter Factorization
The objective of multi-microphone speech enhancement is to estimate clean speech S 1 (l, k) from the noisy observation y(l, k), and we assume that prior knowledge on the location of the source or RTF is not available. One of the popular approaches is the MWF, which is a linear MMSE estimator for clean speech S 1 (l, k), i.e., where w mw f (l, k) denotes the MWF described as [6] w where e 1 = 1 0 1×M−1 T , in which 0 is a zero vector, and tr[·] denotes the trace of a matrix. It is noted that only the noise and speech SCM, Φ v (l, k) and Φ s (l, k), need to be estimated to implement the MWF. Previous work often adopted the multi-channel MCRA approach for noise SCM estimation, whereas ML estimation was employed for speech SCM estimation [25,26]. The MWF can be decomposed into the MVDR beamformer, w mvdr , and a singlechannel Wiener postfilter, w wiener , as [11,12] which makes it possible to consider the spatial filtering depending on the RTF g and the energy-based postfiltering w wiener separately. Note that the frame and frequency indices are omitted for notational convenience. Let the output of the MVDR beamformer be Z, i.e., where the MVDR beamformer is given as With the distortionless constraint of the MVDR beamformer, the beamformer output can be expressed as [27] where O is assumed to follow the Gaussian distribution with variance The clean speech spectrum can be obtained by applying the single-channel Wiener filter to the beamformer output Z, asŜ where φ s = E[|S 1 | 2 ] is the speech PSD at the first microphone. Figure 1 illustrates the block diagram of the multi-microphone speech enhancement system based on the MVDR-Wiener filter factorization. The noisy speech y is processed by the MVDR beamformer and the Wiener filter sequentially, for which acoustic parameters Φ v , g, φ s , and φ o need to be estimated. Existing methods for parameter estimation are present in the next subsection.

Speech and Noise SCM Estimation
As for the estimation of the SCM of noise, Φ v , the multi-channel MCRA approach [25] is widely used, which is given as whereα v (l, k) = λ + p(H 1 (l, k)|y(l, k))(1 − λ) is an SPP-dependent smoothing parameter with a constant 0 < λ < 1. This method updates Φ v more when the SPP is low and vice versa. The a posteriori SPP p(H 1 |y) can be obtained using Bayes' rule as where H 0 and H 1 denote the hypotheses for speech absence and presence, respectively, and p(y|H 0 ) and p(y|H 1 ) are modeled as complex multivariate Gaussian distributions, as follows: in which det[·] denotes the determinant of a matrix. Then, p(H 1 |y) becomes [24] p(H 1 |y) is the a priori SPP, which can be estimated using the CDR-based [26] or DNN-based [27] method.
The speech SCM is usually estimated with the ML approach, which is defined as [25,26] where Φ y is obtained by recursive smoothing as Under the rank-1 approximation for the clean speech SCM, Φ s can be further refined using the decomposition of Φ s , with the speech PSD and RTF given by [8] Adopting the covariance subtraction (CS) approach, which extracts the normalized first column vector of the ML estimator of the speech SCM Φ ml s , the estimator for the RTF is given as [46] where the denominator represents the speech PSD, i.e., in which the superscript cs indicates the CS approach.
In the REM framework [27], the ML estimator for the RTF based on the observed noisy speech is obtained as [27] where the summations in the numerator and the denominator can be computed with recursive averaging. The numerator can be thought of as the estimate of the cross-correlation between y andŜ 1 in (10), r ys , given by and the denominator can be considered to be the estimate of the speech PSD obtained by the recursive smoothing of the estimated clean speech power spectrum, where the superscript ts indicates it is a temporally smoothed estimate, and |S 1 | 2 (l, k) is the MMSE estimator of |S 1 | 2 under the speech presence uncertainty given by where we let p(H 1 |Z) = p(H 1 |y), as in [27]. With r ys in (22) and φ ts s in (23), g ml in (21) can be expressed as Figure 2 illustrates the block diagram of the proposed speech enhancement system. As in [25][26][27], the estimation of the speech and relevant statistical parameters is performed twice for each frame, which was shown to be effective for online speech enhancement. In this paper, we propose an improved method for speech SCM estimation, i.e., speech PSD estimation and RTF estimation with a rank-1 approximation, using the speech enhancement system described in Figure 2. Note that the proposed modules are highlighted with red boxes.

Proposed Speech SCM Estimation
SPP est. based on CDR or DNN Ref. [26] or [27] TDOA-based RTF est. Eq. (28) In the first pass, we exploit the noisy input y(l) in the current frame and the noise SCM estimate Φ v (l − 1) obtained in the previous frame to estimate the acoustic parameters in the current frame and perform beamforming and postfiltering, as explained in Section 3.2.
The ML estimate of the speech PSD at the first microphone using an instantaneous estimate of the PSD of input noisy signal can be obtained as where is a certain minimum value for the speech PSD estimate, which is set as ξ min φ v (l − 1, k) with a tunable parameter ξ min . To estimate the speech PSDs, the ML estimation with temporal smoothing has been commonly used as described in (16) and (17) [25][26][27]. However, this approach occasionally results in undesired temporal smearing of speech [51]. In this paper, we propose to apply TCS [51] to φ ml s in (26). TCS is a selective temporal smoothing technique in the cepstral domain motivated by the observation that, although the excitation component resides in a limited number of cepstral coefficients dependent on the pitch frequency, the speech spectral envelope is well-represented by the cepstral bins with low indices [55]. Specifically, the TCS consists of the following procedure: First, the cepstrum of ML speech PSD estimate φ ml,ceps s (l, q) is computed by the inverse discrete Fourier transform (IDFT) of φ ml s . Next, the selective smoothing is applied to φ ml,ceps s (l, q), in which the cepstral bins that are less relevant to speech are smoothed more and those representing the spectral envelope and fundamental frequency are less smoothed. Finally, the discrete Fourier transform is used to convert φ ceps s (l, q) into the TCS-based speech PSD estimate in the spectral domain φ tcs s (l, k). The bias compensation for the reduced variance due to the cepstral smoothing can be found in [56], and a detailed description of the adaptation of the smoothing parameters and the fundamental frequency estimation is given in [51]. In this paper, we denote the aforementioned procedure of TCS as an operation: φ tcs, f s (l) = TCS( φ ml s (l)), (27) in which the superscript f indicates that this is the estimate in the first pass.
In this paper, we model the RTF vector g as a relative array propagation vector, which depends on the DoA [16]. Note that the conventional approaches in [27,44] estimate the RTF for each frequency using the input statistics in the frequency bin, ignoring the interfrequency dependencies. In the presence of heavy noise, the accurate estimation of the RTF may become difficult, and thus it would be beneficial to estimate TDoA by utilizing the input signal in all frequency bins and to reconstruct the RTF using the simplest model. The TDoA for the desired speech can be obtained from the estimate of the cross-PSD of the desired speech, φ s 1m (l, k) = E[S 1 (l, k)S * m (l, k)], using the cross-correlation method [57]. The TDoA estimate τ m between the first and the mth microphones is given by in which φ s 1m (l, k) is the estimate of φ s 1m (l, k). Then, the TDoA-based RTF estimator can be obtained as In the first pass, the cross-PSD estimate φ s 1m can be obtained by taking the (1, m) element of the ML speech SCM estimate Φ ml s (l, k) in (16) as where e m = 0 (m−1) 1 0 (M−m) T in which 0 n is an all-zero vector of length n; g tdoa, f can be computed using (28) and (29) with φ f s 1m , and Φ f s can be obtained as in (18) using φ tcs, f s (l) in (27) and g tdoa, f . The noise SCM is estimated with the multi-channel MCRA approach in (11) utilizing p(H 1 |y) in (15) computed with Φ f s and Φ v . Then, we can compute the beamformer output Z in (6) and φ o in (9), and the estimate for the speech spectrum,Ŝ 1 , can be obtained as in (10).
In the second pass, we estimate the acoustic parameters again by additionally utilizing the estimates for the clean speech spectrum, clean speech power spectrum, and a posteriori SPP, computed in the first pass. These refined parameters are in turn used to estimate the clean speech once again.
To refine the estimate of the speech PSD, we apply the TCS to the clean speech power spectrum estimate |S 1 | 2 in (24) as φ tcs,r s (l) = TCS( |S 1 | 2 (l)), (31) in which the superscript r indicates it is the refined estimate in the second pass. As |S 1 | 2 would be less affected by the noise compared with the φ ml s by virtue of beamforming and the MMSE estimation, φ tcs,r s (l) would be more accurate than φ tcs, f s (l). As for the RTF estimation, r ys in (22) is evaluated withŜ 1 in (10), as in [27]. Instead of using r ys divided by the estimate of the speech PSD in the first microphone to obtain the RTF, as in [27], we again estimate the RTF based on the TDoA; φ s 1m can be computed by extracting the mth element of r ys as φ r s 1m (l, k) = e T m r ys (l, k), (32) in contrast to (30). The TDoA-based RTF estimate in the second pass, g tdoa,r , can be obtained through (28) and (29) with φ r s 1m . As in the first pass, Φ r s is computed with φ tcs,r s in (31) and g tdoa,r , and p(H 1 |y) in (15) is updated with Φ r s . Then, p(H 1 |y) and Φ v are obtained again using (15) and (11), and then the beamformer output Z and φ o are updated using (6) in (9). The final clean speech estimateŜ 1 is obtained by (10) using g tdoa,r , Φ v , and φ tcs,r s . The whole procedure of the proposed online multi-microphone speech enhancement method is summarized in Algorithm 1.

Experimental Settings
To demonstrate the superiority of the proposed algorithm, we conducted a set of experiments to evaluate the performance of the multi-microphone speech enhancement on the simulated set in the CHiME-4 database [58]. In this database, a mobile tablet device with six microphones was used for recording, of which the three microphones numbered 1, 2, and 3 were located in the top left, center, and right with an inter-microphone distance of approximately 10 cm each, while the other three microphones numbered 4, 5, and 6 were placed in the bottom left, center, and right, respectively [58]. The vertical distance between pairs of microphones was approximately 19 cm [58]. All microphones were located on the frontal surface, except for microphone 2. The bus (BUS), cafe (CAF), pedestrian area (PED), and street junction (STR) types of noise were used, and the SNR was between 0 and 15 dB. The training set consisted of 7138 utterances spoken by 83 speakers, whereas the development and evaluation sets were 1640 utterances and 1320 utterances, respectively, from 4 different speakers. The sampling rate for the signals used in the experiments was 16 kHz, and the square-root Hann window was applied to a 32 ms signal with a 16 ms frame shift. The 512-point DFT was applied to the windowed signal. The reference channel for the algorithms and evaluations was microphone 5, located at the bottom center of the device.
We set the a posteriori SPP, p(H 1 |y), to zero for the first 10 frames instead of computing it using (15) based on the assumption that the speech would be absent in the initial periods, which helped the fast stabilization of the algorithm, as in [27]. To mitigate speech distortion at the expense of increased residual noise [59], the lower bounds for the p(H 1 |y) to compute |S 1 | 2 in (24) and the Wiener gain in (10) were configured to 0.5 and −18 dB, respectively. The parameter values for λ and ξ min were set to be 0.9 and −10 dB, respectively. For the TCS schemes in (27) and (31), we followed the procedure in [51], employing the same parameter values except for the constant smoothing parameter,ᾱ const , which was determined empirically asᾱ in which q is the quefrency index.
For the DNN-based a priori SPP estimation, we adopted the DNN architecture in [27], which consisted of a uni-directional long short-term memory (LSTM) layer of 512 dimensions, followed by three fully-connected layers of 256 dimensions. The activation functions were the rectified linear unit (ReLU) for the first three layers and sigmoidal activation for the last layer, which produced a 257-dimensional output vector. The number of dimensions of the DNN output is 257. The input for the DNN was the noisy log magnitude spectrum at the reference microphone, and the training target was binary for each bin, which was set by thresholding the instantaneous SNR [13].

Experimental Results
To demonstrate the superior performance of the proposed speech enhancement method, we evaluated the wideband PESQ score [60], eSTOI [61], and SISDR [62]. As we focused on the online framework in which the algorithm only uses the current and previous audio samples for frame-wise processing, the online algorithms designed in this way were chosen for the baseline methods. Depending on the a priori SPP estimator, we compared the performance of the proposed method using the ML framework with the MWF in [26] when the CDR-based a priori SPP estimator [26] was adopted, whereas the REM approach [27] was used for performance comparison when the DNN-based a priori SPP estimator [27] was employed for the proposed algorithm. As in [27], two versions of the REM approach using the Wiener postfilter and Kalman postfilter, denoted by DNN-REMWF and DNN-REMKF, were included in the experiment. The configuration parameters for the compared methods were set as in the original papers. Tables 1-3 show the average PESQ score, eSTOI, and SISDR for each method depending on the noise type, respectively. The proposed method with the CDR-based a priori SPP estimator, CDR-Proposed, outperformed the previous approach in [26] by 0.39 in terms of the average PESQ score, 0.022 in terms of the eSTOI, and 4.3 dB in terms of the SISDR on average, respectively. With the DNN-based a priori SPP estimator, the proposed method, DNN-Proposed, improved the performance of DNN-REMKF by 0.21 in terms of the average PESQ score, 0.017 in terms of the eSTOI, and 1.2 dB in terms of the SISDR on average, respectively. Table 4 shows the PESQ scores, eSTOIs, and SISDRs for the baselines and the proposed method depending on the SNR. As the SNRs for the utterances in the evaluation set of the CHiME-4 database are distributed as in Figure 3, we divided the evaluation set into three groups depending on the SNR: low SNR less than 6.5 dB, medium SNR between 6.5 and 8.5 dB, and high SNR over 8.5 dB. It can be seen that all the measures were improved in all SNR ranges, and the performance improvements were more pronounced in low SNRs. From the results, we may conclude that the proposed speech SCM estimation approach could improve the performance of the multi-microphone speech enhancement method, regardless of the adoption of the DNN for the a priori SPP estimation.

Ablation Study
Additionally, we carried out an ablation study to analyze how much each module in the proposed system contributed to the performance improvement. We propose the speech PSD estimator, φ tcs,s s in (31), and the RTF estimator, g tdoa,s in (29). The previous approaches were the speech PSD estimator using recursive smoothing, φ ts s in (23), and the ML estimator of the RTF g ml in (25). The performances of the systems replacing the proposed modules one by one with conventional modules are summarized in Table 5. DNN-REMWF is also included, which uses φ ts s , g ml , and the Wiener postfilter, but adopts a different noise SCM estimator, derived from the EM framework. The proposed system with conventional speech PSD and RTF estimators, φ ts s + g ml , showed the same average PESQ score and improved eSTOI and SISDR compared with DNN-REMWF [27]. Among the systems in the same framework, the introduction of the proposed speech PSD estimator improved the average PESQ scores by relatively large differences of 0.12 and 0.19, whereas it did not result in increased eSTOIs and SISDRs. On the other hand, employing the proposed RTF estimator improved all three metrics. From the results, we may conclude that both the proposed speech PSD and the RTF estimators contributed to the performance improvement.

Computational Complexity
Additionally, we have compared the computational complexity of the baseline and proposed methods in terms of the normalized processing time for the MATLAB implementation of the methods. The processing times for each algorithm, normalized by the processing time of the proposed algorithm, are given in Table 6. In this experiment, the a priori SPP was estimated by a DNN for all cases. As they depend on implementation details and settings such as the number of microphones, sampling frequency, and the dimensions of the DFT, the numbers given in the table should only be used as a rough indication. To see how much the refinement in the second pass incurred additional computational burden, the proposed method without the second pass (denoted as woSP) is included. From the table, it can be seen that the computational complexity of the proposed method was higher than those for MWF [26] and REMWF [27], but less than that of REMKF [27]. Table 6. Comparison of the normalized processing time when the a priori SPP was obtained by a DNN.

Conclusions
Multi-microphone speech enhancement exploits spatial information and spectrotemporal characteristics to reduce noise from the input. The online algorithms are required for the applications sensitive to time delays such as speech communication and hearing aids. In this paper, we propose an improved estimator of the speech SCM for online multimicrophone speech enhancement. Using the decomposition of the speech SCM under a rank-1 approximation, we propose an improved estimator for the speech PSD and RTF. For speech PSD estimation, we adopt the TCS scheme, which exploits knowledge on the speech signal in the cepstral domain to provide a better estimate of the speech PSD compared with the ML estimate. The RTF is estimated based on the TDoA estimate summarizing the information from all frequency bins. These estimators are evaluated once with input statistics and refined with an estimated clean speech spectrum and power spectrum obtained in the first pass. Our proposed speech enhancement method showed an improved speech enhancement performance in terms of the PESQ score, eSTOI, and SISDR in various noise environments for the CHiME-4 dataset, compared with other online multi-microphone speech enhancement algorithms.
Future work may include the incorporation of other spatial cues such as the interchannel level differences on top of the inter-channel phase differences [47] into the RTF estimation without resorting to the far-field assumption. We may also investigate a deep learning approach to estimate acoustic parameters such as the speech and noise PSDs and RTF in a causal manner in the MVDR-Wiener filter factorization framework.