Dual-Mic Speech Enhancement Based on TF-GSC with Leakage Suppression and Signal Recovery

: The transfer function-generalized sidelobe canceller (TF-GSC) is one of the most popular structures for the adaptive beamformer used in multi-channel speech enhancement. Although the TF-GSC has shown decent performance, a certain amount of steering error is inevitable, which causes leakage of speech components through the blocking matrix (BM) and distortion in the ﬁxed beamformer (FBF) output. In this paper, we propose to suppress the leaked signal in the output of the BM and restore the desired signal in the FBF output of the TF-GSC. To reduce the risk of attenuating speech in the adaptive noise canceller (ANC), the speech component in the output of the BM is suppressed by applying a gain function similar to the square-root Wiener ﬁlter, assuming that a certain portion of the desired speech should be leaked into the BM output. Additionally, we propose to restore the attenuated desired signal in the FBF output by adding some of the microphone signal components back, depending on how microphone signals are related to the FBF and BM outputs. The experimental results showed that the proposed TF-GSC outperformed conventional TF-GSC in terms of the perceptual evaluation of speech quality (PESQ) scores under various noise conditions and the direction of arrivals for the desired and interfering sources.

One of the useful implementations of adaptive beamformers [17] is the generalized sidelobe canceller (GSC) [17], which consists of a fixed beamformer (FBF) that tries to pass only the signals from the desired direction, a blocking matrix (BM) that attempts to filter out the signals from the desired direction, and an adaptive noise canceller (ANC) that removes the signals related to the BM output from the FBF output. The transfer function (TF)-GSC (TF-GSC) [18][19][20][21][22][23][24][25][26][27] is one of the most popular algorithms for GSC, which steers a beam to a desired source based on the estimated acoustic TF [2] ratio vector and achieves high noise reduction performance while maintaining low speech distortion [28]. However, the steering error [29] arising from an inaccurate TF ratio estimation may introduce signal leakage in the BM output and speech distortion in the FBF output, which degrades the performance of speech enhancement, especially in the presence of diffuse or nonstationary noise.
One of the typical remedies to this problem is to equip a single-channel speech enhancement module as a post-processor [9][10][11][12][13][14][15][16]. The spectral gain based on the optimally modified log spectral amplitude (OM-LSA) estimator [30] is applied to the output of the TF-GSC in [12][13][14], where the transient beam-to-reference ratio (TBRR) [31] is utilized for hypothesis testing [32] to determine if the input contains desired speech, transient interference, or stationary noises, which in turn affects the estimation of a priori signal absence probability, a priori signal-to-noise ratio (SNR), noise power spectra, and spectral gain function. In [15,16], the results of the hypothesis testing were further utilized in the parameter updates for the TF-GSC. Nevertheless, a certain amount of the steering error is practically inevitable, resulting in signal leakage in the BM output and signal attenuation in the FBF output, which limits the performance of speech enhancement using TF-GSC.
In this paper, we propose to modify the outputs of the FBF and the BM by utilizing both of the outputs and the microphone signals. Specifically, we applied a gain function similar to the square-root Wiener filter to the BM output to suppress the desired speech signal leaking into it. Moreover, the attenuated signal in the FBF output was recovered to an extent by adding back a certain amount of appropriate microphone signals. The experimental results show that the proposed TF-GSC achieved better performance than the conventional TF-GSC for various noise scenarios. Moreover, the proposed method can provide consistent performance for various directions of arrivals (DoAs) of the desired and interfering sources in contrast to the conventional TF-GSC.
The remainder of this paper is organized as follows. Section 2 summarizes the conventional TF-GSC and the postfiltering. The proposed TF-GSC with leakage suppression and signal recovery is introduced in Section 3. Section 4 outlines the experimental results. Finally, a conclusion is provided in Section 5.

Summary of TF-GSC and Postfiltering
Let x p (m), n p (m), and s(m) denote the input microphone signal and interfering signal at the p th microphone and the desired speech at the source for time m, respectively. With additive noise assumption, x p (m) is represented as: where a p (m) is the acoustic impulse response from the desired source to the p th microphone and * denotes the convolution operation. The short-time Fourier transform (STFT) coefficients for signals X p (n, k), N p (n, k), and S(n, k) for frame n and frequency k are related as: X(n, k) = A(n, k)S(n, k) + N(n, k) (2) in which: where A p (n, k) is the acoustic TF from the desired source to the p th microphone for frame n and frequency k. A block diagram of the TF-GSC is shown in Figure 1 in black, which mainly consists of three blocks [2,7,8,18]. The FBF conserves signals from a desired direction while rejecting other signals as much as possible. The output of the FBF W(n, k), Y FBF (n, k) is given by: where W(n, k) is constrained to satisfy W(n, k) H A(n, k) = F * (n, k), in which a prespecified filter F * (n, k) is usually assumed to be a simple delay [9]. However, since the actual TFs are very difficult to obtain, the TF ratios are estimated instead and the optimal W(n, k) is represented as [18]: in which H is the acoustic TF ratio vector: In the other branch, the BM B(n, k) tries to block the signals from a desired direction while passing through all other signals, resulting in the noise reference Y U : in which: The third block, the ANC, tries to remove the components related to Y U (n, k) from the FBF output Y FBF (n, k) using the normalized least-mean-square (NLMS) algorithm: where G(n, k) is the ANC filter updated by: in which µ is a step size and the noise reference power P est (n, k) is updated as: where ρ U is a smoothing factor. In order to effectively operate FBF and BM, the estimation of accurate TF ratios is essential. In [18], the TF ratios were estimated using least squares, assuming that the TF ratios are slowly changing in time compared to the desired signal and the background noise signals are stationary. The TF ratio is updated when the desired signal is present in the last L frames as: in which: whereΦ X p X 1 (n, k) (p = 1, 2) is the estimate of the cross-power spectral density between X p and X 1 given by: where ρ P is a smoothing factor. However, the TF ratio estimation suffers from diffuse or nonstationary noises, resulting in steering error, which causes leakage of speech components through the BM and speech distortion in the FBF output. One way to mitigate this problem is to apply a postfilter based on the OM-LSA estimator [12][13][14][15][16]. The first step of this postfiltering is the hypothesis test, which determines if the output of the TF-GSC contains desired speech, transient interference, or stationary noises using the TBRR [31]. The result of the hypothesis test is utilized to determine a priori signal absence probability and, finally, the spectral gain function. Additionally, the hypothesis test result from the postfilter is further used to control the update of the parameters of the TF-GSC in [15,16] as follows: in which: where L is a set of frame indices that contain the desired signal component in the analysis interval from the n − L + 1 th frame to the n th frame, |L| is the size of L, and N d is the threshold for the TF ratio update. It is noted that the value of L should be carefully chosen to provide accurate estimates of ensemble averages with a large enough number of frames, while the quasi-stationary assumption for the acoustic TF is not violated by too big of an L [15,16,18]. It has the effect of relaxing the assumption on the stationarity of the TF ratio, which reduces the steering error to an extent. Even with the improved TF ratio estimation, however, a certain amount of the steering error is practically inevitable, leaving room for improvement in the performance.

Proposed TF-GSC with Leakage Suppression and Signal Recovery
To alleviate the problems caused by steering error, this paper proposes a leakage suppression module to suppress the leaked desired signal in the output of the BM and the signal recovery module to restore the attenuated desired signal in the FBF output, which is shown in Figure 1 in red. The postfiltering and the feedback from it are also adopted for both the conventional and the proposed TF-GSC, which is omitted from the figure.

Leakage Suppression
Although the FBF and BM do not operate perfectly, the ratio of the FBF output Y FBF and the BM output Y U bears information on the input SNR. The leakage of the desired signal to the BM output would be more severe when the input SNR increases, and vice versa. Therefore, the leakage suppression module applies a spectral gain G U (n, k) to the BM output Y U (n, k) to produce the modified noise referenceȲ U (n, k): in which G U (n, k) has a form similar to the square-root Wiener filter: where α < 1 is a tuning parameter representing the attenuation of the desired signal component in the BM output. The leakage suppression would attenuate the desired signal component from Y U (n, k) and thus reduce the signal cancellation in the ANC.

Signal Recovery
Once the desired signal is attenuated by the FBF with steering error, the following modules such as the ANC and the postfilter cannot restore it, while they can suppress the residual interference in the FBF output. In this regard, the proposed signal recovery module adds a certain amount of the appropriate microphone signal to the FBF output.
To determine which microphone signal should be utilized to recover the attenuated desired signal, we evaluated which microphone signal is closer to Y FBF and Y U , respectively. Specifically, we assessed the cosine similarities between X 1 , X 2 and Y FBF , Y U given by: in which: where λ is a smoothing parameter, • is an inner product operation, K is the number of frequency bins, and are the cosine values of the Euclidean angles between X p (n) and Y FBF (n) and Y U (n), respectively [33,34]. Using these similarities, the two similarity differences SD X 12 Y FBF (n) and SD X 21 Y U (n) can be computed: where SD X 12 Y FBF (n) is a measure of how much X 1 is more similar to Y FBF compared to X 2 , and SD X 21 Y U (n) represents how X 2 is closer to Y U compared to X 1 . Thus, if both SD X 12 Y FBF (n) and SD X 21 Y U (n) are positive (or negative), X 1 (or X 2 ) contains more of the desired signal. If the absolute value of SD X 12 Y FBF is small, however, the desired source may be located at the broadside, and thus the average of two microphone signals would provide a better reference of the desired signal. When the signs of SD X 12 Y FBF and SD X 21 Y U differ, we anticipate that the signal restoration may not be reliable, and the FBF output is not modified. The selected signal for the signal recovery is summarized by: which is also described in Figure 2. Using the selected microphone signal, the FBF output Y FBF (n, k) is modified as: in which G X (n, k) determines how much of the microphone signal is added to restore the attenuated desired signal. Since adding the microphone signal in low SNR environments may be harmful, G X (n, k) is designed to have higher values in higher SNRs: where β > 1 is a tuning parameter.

Experimental Results
To evaluate the performance of the proposed method, the acoustic environments depicted in Figure 3 were simulated using the image method [35]. The dimension of the room was [6.7 m, 6.1 m, 2.9 m]. Two microphones were located at [3 m, 3 m, 1.5 m] and [3.14 m, 3 m, 1.5 m], respectively, 14 cm away from each other, which is typical for modern smartphones. The reverberation times (RT60s) were 300 ms and 500 ms. The distance between the desired source and the microphone array was 0.4 m, while that for the interfering source was 0. Twenty utterances spoken by 13 male and 7 female speakers were selected randomly from the TIMIT database [36] as the desired speech signals. Babble, F16, and Factory1 noises from the NOISEX-92 database [37] and restaurant and street noises from the AURORA 2 database [38] were used as diffuse noise, which was constructed using the arbitrary noise field generator [39]. Five competing talker utterances were selected randomly from the TIMIT database as directional noises. The SNRs for diffuse or directional noise were 0, 5, 10, 15, and 20 dB. The sampling rate was 8 kHz and the 256 point STFT with a Tukey window was used with the frame shift of 160 samples.
The empirically determined values of the parameters for the conventional TF-GSC with postfiltering [16] and the proposed one with two additional modules are summarized in Table 1. The ANC filter G and the noise reference power for ANC update P est were converged before evaluating the performance. The parameters not included in the table were set to be the same as in [16], which produced the highest average performance. Perceptual evaluation of speech quality (PESQ) scores [40] were carried out to evaluate the performance. Figures 4 and 5 show the average PESQ scores and the 95% confidence intervals (CIs) for the input microphone signal, the output of the conventional TF-GSC with postfiltering, and the proposed scheme in diffuse noise environment according to the SNR, RT60, and the azimuth of the desired source averaged for five noise types. The conventional algorithm improved the PESQ scores in all conditions, but the improvement in performance reduced as the desired source moved from the broadside direction (0 • ) to the end-fire direction (−90 • ). The proposed leakage suppression and signal recovery were considered to be effective in maintaining the performance for all DoAs of the desired source. The improvement in performance over the conventional TF-GSC was statistically significant when the desired source was located near the end-fire direction (−90 • and −60 • ) and the SNR was higher than 0 dB, but was insignificant when the desired source was located at the broadside (0 • ). This may be because the steering error was smaller when the desired source was located in the broadside direction, and the signal recovery module essentially has no impact for the broadside as it just adds the scaled FBF output (X 1 (n) + X 2 (n))/2 to the FBF output. A similar tendency was observed for both RT60s. The PESQ score differences between the proposed and baseline beamformers for each noise type are shown in Figure 6. The proposed beamformer improved the PESQ scores in both stationary and nonstationary noise environments. The signal recovery module may work more effectively when the desired source is located in the end-fire direction by recovering the attenuated desired signal using the closer microphone signal.     (a) (b) Figure 6. Difference of the average PESQ scores for the proposed and baseline beamformers in diffuse noise environments for five noise types with the RT60s of (a) 300 ms and (b) 500 ms.
Another set of experiments were carried out on the directional interference. We conducted experiments on the speech enhancement in the presence of the competing talker. Figures 7 and 8 show the average PESQ scores and CIs for the conventional and proposed TF-GSC depending on the locations of the desired and interfering sources for two difference RT60s. As in the case of the diffuse noises, the proposed method showed higher PESQ scores than the conventional TF-GSC, achieving almost the same performance for all DoAs of the desired source in contrast to the baseline TF-GSC. The improvement in performance was significant except when the desired source was located at the broadside, as in the diffuse noise case. The PESQ scores were improved even when the desired and interfering sources were located in the same direction, although the improvement was smaller. This may be because the interfering source was located farther than the desired source, as is often the case with practical scenarios. Therefore, Y U might contain more reverberant components from the competing talker and Y FBF might include more desired signal, which enable the leakage suppression module to operate properly.

Conclusions
In this paper, we introduced two additional modules to mitigate the effect of the steering error of the TF-GSC with postfiltering. The leakage suppression module suppresses the leaked desired signal in the output of the BM by applying a spectral gain similar to the square-root Wiener filter. On the contrary, the signal recovery module restores the attenuated desired signal in the FBF output by adding a certain amount of the appropriate microphone signal, which is chosen by examining the cosine similarities between the microphone signals and the outputs of the FBF and BM. The experimental results showed that the two proposed modules improved the performance of the conventional TF-GSC with postfiltering, both in diffuse noise environments and in the presence of a competing talker, achieving almost the same PESQ scores for all of the DoAs of the desired signal.