A Uniﬁed Speech Enhancement System Based on Neural Beamforming With Parabolic Reﬂector

: This paper presents a uniﬁed speech enhancement system to remove both background noise and interfering speech in serious noise environments by jointly utilizing the parabolic reﬂector model and neural beamformer. First, the ampliﬁcation property of paraboloid is discussed, which signiﬁcantly improves the Signal-to-Noise Ratio (SNR) of a desired signal. Therefore, an appropriate paraboloid channel is analyzed and designed through the boundary element method. On the other hand, a time-frequency masking approach and a mask-based beamforming approach are discussed and incorporated in an enhancement system. It is worth noticing that signals provided by the paraboloid and the beamformer are exactly complementary. Finally, these signals are employed in a learning-based fusion framework to further improve the system performance in low SNR environments. Experiments demonstrate that our system is e ﬀ ective and robust in ﬁve di ﬀ erent noisy conditions (speech interfered with factory, pink, destroyer engine, volvo, and babble noise), as well as in di ﬀ erent noise levels. Compared with the original noisy speech, signiﬁcant average objective metrics improvements are about ∆ STOI = 0.28, ∆ PESQ = 1.31, ∆ fwSegSNR = 11.9.


Introduction
Perceived quality and intelligibility of speech signals are degraded by pervasive noise. This presents challenges to many applications, such as speech communication, hearing aids, and speech recognition. For these applications, speech enhancement is crucial to recover signals from the noisy speech. The enhancements offered by multichannel devices are usually greater than those of single-channel devices [1]. Recent studies indicate that it is beneficial to extract a desired speech signal by beamforming in noisy and reverberant environments, especially in high-level background noise [2,3].
Traditional beamforming methods require a priori knowledge of the Direction of Arrival (DoA) or the transfer functions from an acoustic source to microphones [4]. It is a challenging task to estimate the spatial information of a microphone array in adverse acoustic conditions. According to the auditory masking effect, the time-frequency (T-F) masking technique applies a real-valued or binary mask on the signal's spectrum to filter out unwanted components, because the mask reserves speech-dominant T-F units and weakens noise-dominant T-F units [5]. Advanced beamforming operations require an estimate of the cross-power spectral density matrix of the noise. These statistics can be obtained by estimating spectral masks for speech and noise. Then, beamformers with a mask estimation network can also enhance the quality of speech. Networks are first applied in neural beamformers [6,7] to estimate the time-frequency masks and then masks are applied on the signal's spectrum to predict speech and noise statistics. With these statistics, multichannel filter coefficients are computed based on well-studied beamforming designs, such as Minimum Variance Distortion Response (MVDR) beamformers [8], Linearly Constrained Minimum Variance beamformers (LCMV) [9], and Generalized Eigenvalue (GEV) beamformers [10]. Gannot S et al. [11] explored many popular data-dependent spatial filter design criteria and recognized several well-known beamforming criteria as special cases. S. Chakrabarty et al. [12] proposed a Convolutional Neural Network (CNN)-based mask estimation, which was learned from all the channels simultaneously. The results have shown it is beneficial to utilize multi-channel information, while the approach is array-dependent.
Many researchers choose to work on the single channel mask predictor because it can be applied to all kinds of array configurations [13]. Recent studies have mainly focused on how to design an efficient network structure for single channel mask prediction. The prediction can provide a more accurate mask that assigns a proportion of each T-F bin to each of the sources. A Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells [14] was used to estimate the time-domain spatial filter weights of a filter-and-sum beamformer for each channel separately, which are then convolved with the input signal to obtain the enhanced signal. Then, this work was extended to estimate frequency domain spatial filter weights in [15]. The recording quality of a microphone has an important impact on speech enhancement performance of the system. It is crucial to subtract the channel which captures the signal with less unwanted components as the reference microphone to provide data for mask prediction. Ochiai T et al. [16] proposed an attention mechanism for reference microphone selection, while Lei Sun [17] adopted the data cleaning and augment operations to attain data to yield better performance in following stages. However, these methods mainly focused on a learning-based approach to select desired data from existing channels, which made the system more complicated. Furthermore, there are no significant differences in the different channels of homogenous sensors. Hence, effective signals cannot be acquired. Given the trends above, in this paper, a novel method for speech enhancement that combines acoustic focus and Deep Neural Network (DNN)-based multi-channel beamforming is proposed. The microphone array speech enhancement framework is extended by integrating the speech enhancement component from the parabolic reflector into the mask-based beamformer. The performances of the two different enhancement approaches with two different types of mask application are evaluated. The performances of a traditional Delay and Sum Beamformer (DSB) are evaluated too. The performance of the proposed system is also compared to a Complex Ideal Ratio Mask (CIRM) estimation method using a DNN network, presented in [18]. The performance of the proposed system for different noise types is also analyzed.
The rest of the paper is organized as follows. In Section 2, the design of the proposed system is described, including the signal model, the acoustic focus performance of the paraboloid, as well as masking and beamforming-based enhancement approaches. In Section 3, the experimental setups are presented. In Section 4, results and analyses are reported. Section 5 concludes the paper.

Materials and Methods
In this section, the signal model is presented firstly. Then, quantitative analysis of the acoustic focusing characteristics of the paraboloid along with a design of a parabolic reflector is rendered. Following that, two different ways to incorporate the masks in a speech enhancement system are presented. Finally, how to integrate the parabolic reflector (PR) model with multi-microphones beamforming to yield a higher speech enhancement performance is presented.

Signal Model
A multi-channel data model with static sources and diffuse noise can be written as follows: where L denotes the number of microphones, and Q denotes the number of source signals. The notation ' * ' indicates convolution, and t indexes a time sample. y i (t) denotes the signal at microphone i, and s j (t) denotes the j th source signal. h i,j (t) defines the Room Impulse Response (RIR), which models the aspect of sound propagation from source to receiver. An array of L microphones was utilized throughout this work. In the Short Time Fourier Transform (STFT) domain, if the environment can be assumed anechoic, the vector of received signal, y(n, k) = [Y(n, k, 1), . . . , Y(n, k, L)] T , at time frame n and frequency bin k is given by Equation (2): The noise is divided into two components: diffuse noise, denoted by v d (n, k), and spatially uncorrelated microphone self-noise, denoted by v(n, k).

Principles
Sten [19] studied the acoustic properties of a paraboloid. Two geometrical characteristics of parabola are illustrated in Figure 1. These are essential for its application as an acoustic reflector.

1.
The angle between OP and the tangent on the point of contact P equals the angle between FP and the same tangent. In acoustics, any incident wave route in parallel with the central axis will always be focused into the same position, the focus; 2.
For a given line at right angles to the axis, the sum of the lengths of OP and PF is a constant. This means that the sound waves which are parallel to the central axis at the same frequency will have the same phase when reflected to the focus. and a are the depth and the focal length of paraboloid. Reflector diameter d is a function of ratio l/a. F represents the focus point, P represents the reflection point, and OP is a straight line parallel to the central axis.
As a consequence, the sound pressure is amplified at the focus F. According to [19], the amplification of sounds parallels to the axis is given by Equation (3): where F p represents the sound pressure of the amplification factor at the focus, and it is also a pressure factor. λ denotes the wavelength of sound, which equals the ratio of sound speed to sound frequency.

Performance Analysis and Validation
As illustrated in Equation (3), F p is associated with three factors: a, l, and λ. Figure 2 is given to illustrate the relationship of F p , l/a, and the sound frequency when the diameter d is assumed as 1 m. The gray plane (F p = 0 dB) is the base plane, where there is no amplification or attenuation. The depth of paraboloid increases with the increase of the ratio l/a. We can intuitively understand the tendency from Figure 2; that is, the amplification performance will increase, along with an increase in either the frequency or the depth of the paraboloid. In addition, as the depth of the parabola increases, the curve becomes smooth and gradually flattens. Therefore, the magnification performance cannot be improved by deepening the paraboloid. By means of the indirect boundary element method [20], a series of simulations were implemented in LMS Virtual.Lab software. The results show the convergence of a sound in a parabolic model, which is just the focus with the maximum sound pressure level.
In order to verify the feasibility of the PR-based method, the results obtained from the experiment performed in a real anechoic chamber were compared with computer simulation results. According to the results presented in Figure 3, both the simulation and the experimental results exhibited a similar trend to theoretical rules in Equation (3). It is also noticeable that there were outliers in the measured data curve. This is because the actual paraboloid was made up of plastic rather than a rigid body (theoretically), which made sound waves partially penetrate the paraboloid. So, the ideal focusing could not be achieved. Experimental results verified the effectiveness of the above theories. Based on the corresponding results, the PR system was implemented, and the amplification function of PR was fitted with reference to Equation (3). The ratio was assigned with l/a = 4 as a key parameter of the PR system. Other structural parameters of the paraboloid were a = 40 mm, l = 160 mm, d = 320 mm.

Two Approaches to Utilize Masks
The Ideal Ratio Mask (IRM) is a soft mask and is given as Equation (4): where re f m denotes the reference microphone and m denotes the corresponding neural beamformer. Y(n, k, re f m ) denotes the signal recorded at the reference microphone, and X d (n, k, re f m ) represents the estimated clean speech signal. For a noise mask, its value can be represented as 1−I IRM . Two different approaches for incorporating a mask to obtain a desired signal are discussed below.

Direct Mask Application
In this approach, the mask can be applied directly to T-F representation of the microphone signal at the reference microphone to calculate the ideal mask. An estimation of the desired signal is given by Equation (5) where I (·) represents the mask. Note that the phase ofX d (n, k, re f m ) is equal to the phase Y(n, k, re f m ), and the desired signal waveform is obtained by an inverse STFT with the estimated magnitude.

Neural Beamformer
The frequency-domain beamformer is used to reduce computational complexity. w(n, k) denotes the corresponding beamforming filter coefficients. Using a spatial filtering approach, an estimation of the desired signal is given as a linear combination of the microphone signals y(n, k), at each T − F bin, as illustrated in Equation (6) where H represents conjugate transpose. In this work, the MVDR criterion was used to compute filter coefficients, and in this criterion, coefficients were found by minimizing the power of noise components at output, given by Equations (7) and (8).
subject to: where a(n, k) denotes the Relative Transfer Function (RTF) vector.
Considering the individual signal components in Equation (2) to be uncorrelated, the Power Spectral Density (PSD) matrix of microphone signals can be expressed as Equation (9): where E{·} represents the expectation operator, Φ x d (n, k) denotes the rank-one PSD matrix of the desired signal, and Φ n (n, k), denotes the PSD matrix of noise components. PSD matrices are robustly estimated using the expectation with respect to time-frequency masks as Equations (10) and (11): (1 − I (·) (n, k))y(n, k)y H (n, k), The closed-form solution to the optimization problem is given by Equation (12) in [21]: Adopting the optimization formalization [21], the explicit dependence of the above filter on the relative transfer functions can be eliminated, and the following form can be obtained, as illustrated in Equation (13): where u is a one-hot vector representing a reference microphone, and Tr(·) represents matrix trace operation. The overall flowchart of the proposed speech enhancement framework is illustrated in Figure 4. The circular microphone array captures the noisy speech. The speech is processed by a neural beamformer, which removes most of the interference. So much high frequency information is lost. In this approach, any microphone in the circular array can be defined as the reference microphone to estimate the mask. The physical amplification characteristic of the designed PR model has significant effects in speech enhancement, and some speech distortions are also introduced into the target speech. The signals captured by the microphone at the focus and the output of the neural beamformer1 have complementary information of the desired speech in the frequency domain. By processing the signals from two approaches mentioned earlier, the fusion operation aims to make a trade-off between speech distortion and speech intelligibility. The reflector microphone signal is utilized to estimate the mask, and the direct mask application is used to enhance the desired speech. The Generalized Cross-Correlation (GCC) method [22] is adopted to align the data from different processing methods. The masked data are utilized by MVDR beamformer. During the postfiltering, the output of MVDR beamformer is multiplied by IRM obtained by the mask estimation to get the final output. In the proposed system, each neural beamformer is set as an MVDR beamformer, based on a single-channel IRM estimation. With the same network structure, two higher quality signals are used as input for the second beamforming. The enhancement performance is improved with a little additional system complexity. It is also achievable to combine different types of neural beamformers to meet our requirements.

Experiment Setup
Considering the number of physical channels of the equipment used for data acquisition and the structural parameters of the paraboloid, for all experiments, a uniform circular array was set up, where L = 7 microphones, and the distance between the microphones was 18 cm. The other microphone was at the focus of the parabolic reflector, and the circular array and the paraboloid were combined into a whole, as shown in Figure 5. The input signals, with the sampling frequency of Fs = 16 kHz, were transformed into the STFT domain, which used a Discrete Fourier Transform (DFT) length of 256 and 50% overlap. Signals were divided into 16 ms frames with an 8 ms frame shift in time-domain. McRoomSim [23] was used to generate the Room Impulse Responses (RIRs) required to simulate different acoustic conditions. The room property was set to an anechoic chamber with a sound absorption coefficient of 1, which means that there was no reverberation or other noise in the room. To illustrate the independency of the source spatial position, eight different positions were set at different angles. Table 1 shows the configuration used to generate the dataset. For each position in the training stage, 100 speech signals were convolved with the simulated RIRs corresponding to the specific setup, while for each position in the testing stage 60 speech signals were convolved. The proposed system was evaluated on the IEEE database [24]. Each clean utterance was mixed by adding the speech interference to each isolated noise (babble, volvo, destroyer engine, pink, and factory noise) at different SNRs from -18 dB to 7 dB. The speech interference noise was an utterance of WSJ0 in around 10 s. The other five noises were non-stationary and each signal was around 4 min long. Random cuts from the first half of each noise were mixed with each training utterance to create the training mixtures, and cuts of the second half of that were mixed with each testing utterance to create testing mixtures. Acoustic conditions are shown in Figure 5, and different source positions are shown in Table 1 in detail.

Training
Using IRM defined in Equation (4) as the learning target, DNN was designed, where the output could be considered as the probabilities of the existence of speech at each time-frequency bin. The magnitude, the second derivative of the magnitude, and the phase of the received signals for each STFT time frame were directly provided as the input to the system [12]. Amplitude Modulation Spectrogram (AMS), Relative Spectral Transform and Perceptual Linear Prediction (RST-PLP) [25] were also appended to the feature vector. In this paper, Restricted Boltzmann Machine (RBM) [26] based pre-training was used for DNN training. Supervised fine-tuning of the parameters throughout the whole network was performed using the Mean Square Error (MSE) criterion, as illustrated in Equation (14): whereÎ(n, k) are the vectors of reference IRM, and N is the total number of frames for the input. The DNN architecture was 2075-1024-1024-1024-129, which denotes that the size was 2075 (415 × 4 + 415, including two left and two right context frames, and one current frame in the input layer), 1024 units for each of the four hidden layers, and 129 for the output layer (DFT length was 256, where 256/2 + 1 is the total number of frequency bins). In this work, the dropout rate was 0.2, and the momentum rate was set to 0.5 for the first five epochs, and afterwards the rate was changed to 0.9 for the remaining 35 epochs. The mini-batch size was set to 512. The sigmoid activation functions were used for all hidden layers and the output layer. Other values were evaluated as well; however, this combination performed best empirically. In the following experimental evaluations, the method is called PR when processing noisy speech through the physical model. The DSB used for comparison utilized ideal parametric information [27]. Since the estimated IRM from DNN was directly applied to a reference microphone signal, the method was termed as IRM-F. Similarly, with the estimated CIRM, jointly estimating real and imaginary components of STFT [18], the corresponding method is referred to as CIRM-F. When it is used to estimate power spectral density matrices to be used within a MVDR beamformer, the method is referred to as IRM-BF. The method proposed in this work is named PR-IRM-BF.

Results
The enhanced speech signals from each approach were evaluated in terms of three well-known objective metrics, namely the Perceptual Evaluation of Speech Quality (PESQ), the Short-Time Objective Intelligibility (STOI) score, and the frequency-weighted Segmental SNR (fwSegSNR).
As presented in Table 2, results of the comparison experiments in five mixed noise situations showed that PR and IRM-BF had improved PESQ, STOI, and fwSegSNR performance compared with the original speech in all mixed noise situations. In the presence of noise, a beamformer operation removes speech interference but degrades the continuity of the target speech. PR is able to protect target speech but it is not effective to eliminate high-energy speech interference. From the above analysis, the two isolated systems (PR, IRM-BF) have their own shortcomings when addressing serious noises. In extremely low SNR environments, the proposed system significantly outperformed the compared methods by utilizing the complementarity of PR and IRM-BF and made a trade-off to get much better results over individual enhancements. In the case of 7 dB, the proposed system caused performance degradation in terms of fwSegSNR and STOI, possibly due to signal distortions. Specifically, when the original noisy speech was of relatively high quality, the PR model failed to show its superiority in improving the SNR of the desired signal, while the fusion operations introduced extra noise. For IRM estimation, the improvement of fwSegSNR and STOI achieved by each application of the mask was much higher than that of DSB beamformers. The CIRM-F method led to larger improvements in terms of fwSegSNR but suffered from lower PESQ and STOI improvement.

Discussion
To illustrate the effectiveness of our speech enhancement system more clearly, an utterance corrupted by mixed noise (destroyer engine + speech interference) at −2 dB from test data and enhanced by our proposed system is presented, as shown in Figure 6. Compared with the spectrogram of PR enhanced speech, speech processed by neural beamformer removed most of the interference parts, while losing a lot of high-frequency information. It also verified the description that there are some similarities, differences, and supplementary parts in these twofold signals. Although the PR model also introduced some speech distortions to the target speech, the spectrogram indicated that the PR model has significant performance in data cleaning.
The processed speech makes a trade-off between speech distortion and speech intelligibility by fusing operation, yielding better enhancement performance. As seen in Figure 6e, the power and the strength of noise are largely suppressed.

Conclusions
The parabolic reflector, a physical amplification, was proposed. It focuses the target speech considerably from noisy speech and provides a heterogeneous channel. The channel captures relatively clean data to estimate masks. The source-to-array distance is so long so that it is difficult to collect effective signals. The physical amplification model reduces the system complexity and provides favorable data. Moreover, by utilizing complementary information between the parabolic reflector and the microphone array, the proposed signal fusion system achieves better performance for noise and speech interference mixed conditions, especially in low SNR environments. In the future, we will extend the current work in several ways, such as upgrading a neural beamformer module to track more phase information from microphone arrays. Our most important future work is to acquire RIR by measuring and utilizing the framework in far-field multi-talker microphone array speech enhancement.