Speech Enhancement Based on Fusion of Both Magnitude/Phase-Aware Features and Targets

: Recently, supervised learning methods have shown promising performance, especially deep neural network-based (DNN) methods, in the application of single-channel speech enhancement. Generally, those approaches extract the acoustic features directly from the noisy speech to train a magnitude-aware target. In this paper, we propose to extract the acoustic features not only from the noisy speech but also from the pre-estimated speech, noise and phase separately, then fuse them into a new complementary feature for the purpose of obtaining more discriminative acoustic representation. In addition, on the basis of learning a magnitude-aware target, we also utilize the fusion feature to learn a phase-aware target, thereby further improving the accuracy of the recovered speech. We conduct extensive experiments, including performance comparison with some typical existing methods, generalization ability evaluation on unseen noise, ablation study, and subjective test by human listener, to demonstrate the feasibility and effectiveness of the proposed method. Experimental results prove that the proposed method has the ability to improve the quality and intelligibility of the reconstructed speech.


Introduction
Speech enhancement has been studied extensively as a fundamental signal processing method to reconstruct the actual received signals which are easy to be degraded by noisy adverse conditions. Nowadays, speech enhancement has been widely used in the fields of speech analysis, speech recognition, speech communication, and so forth. The aim of speech enhancement is to recover and improve the speech quality and its intelligibility via different techniques and algorithms, like unsupervised methods including spectral subtraction [1,2], Wiener filtering [3], statistical model-based estimation [4,5], subband forward algorithm [6], subspace method [5,7], and so on. Generally these unsupervised methods are based on statistical signal processing and typically work in the frequency domain. These methods essentially implement speech enhancement by estimating the gain function and noise. Voice activity detection (VAD) [8,9] algorithm is a simple approach to estimate and update the noise spectrum, but its performance under non-stationary noise is unsatisfactory. Proposals of minima controlled recursive averaging (MCRA) and improved MCRA (IMCRA) enhanced the estimation of non-stationary noise [10].
Recent approaches formulate speech enhancement as a supervised learning problem, where the discriminative patterns of speech and background noise are learned from training data [11]. The performance of supervised speech enhancement algorithms is affected by three key components, that is, learning machine, training target and acoustic feature. (1) Learning machine. Compared with a traditional learning machine, like support vector machine (SVM) [12], data-driven deep neural network (DNN) has shown its strong power in adverse environments and has received much attention [13][14][15][16][17][18]. multiple sources. As shown in Figure 2, we pre-estimate the speechx(t), noiseñ(t) and calculate the phase from the mix signal x(t), then fuse the features extracted from them separately with the features extracted from the mixed signal to construct a refined acoustic representation. To our knowledge, this study is the first time to extract features from pre-estimated speech and noise, especially from the phase. Secondly, in order to make full use of the phase information, we also incorporate IFD as a phase-aware training target to estimate the phase spectrogram. However, different to Reference [27], we employ an independent DNN (see Figure 2) to train the phase-aware target instead of utilizing a single DNN to jointly train both magnitude and phase targets for the purpose of reducing computational complexity. Extensive experiments conducted on the TIMIT corpus [36] show that the proposed method outperforms the existing methods in terms of speech quality, speech intelligibility and speech distortion.  The rest of this paper is organized as follows-the details of the proposed method are described in Section 2. Especially, the pre-estimation of noise and speech is introduced in Section 2.1, the proposed feature fusion method is introduced in Section 2.2, the calculation of magnitude-aware and phase-aware training target is introduced in Section 2.3, the network structure and training strategy is introduced in Section 2.4, the speech enhancement method is introduced in Section 2.5. In Section 3, we present the experimental data, comparison methods and evaluation metrics. We conduct a series of experiments and analyze the results in Section 4. Especially, we conduct extensive comparison experiments in Section 4.1, analyze the generalization ability of comparison methods on unseen noise in Section 4.2, and obtain a deep insight of the proposed method by ablation study in Section 4.3. Finally, we conclude this study and outlook the future work in Section 5.

Proposed Method
In this paper, n(t) and s(t) represent the interference noise and clean speech, respectively. x(t) represents the noisy speech used in training stage, which is a mixture of n(t) and s(t). z(t) and y(t) represent the signal to be enhanced and the final recovered signal, respectively. N(k, l), S(k, l), X(k, l), Z(k, l) and Y(k, l) denote the STFT spectrum corresponding to n(t), s(t), x(t), z(t) and y(t), respectively.

Pre-Estimation of Noise and Speech
Equations (1) and (2) represent the generation of noisy speech in the time and frequency domains. k and l denote the frequency bin index and frame index, respectively. To estimate the independent noise and speech from the noisy speech, we first utilize STFT technique with L time shift and N DFT length to convert a noisy speech in time domain to a spectro-temporal spectrogram where the harmonic structure of the speech can be observed clearly, then utilize specific method to obtain estimated noise and speech, respectively.

Noise Estimation
We apply the IMCRA [10] method to estimate the noise. It is extremely important to effectively track the prior signal-to-noise ratio (SNR) ξ(k, l), posterior SNR γ(k, l), and noisy power spectral density S p (k, l) ( S pmin (k, l) is corresponding minimum value) in noise estimation [37,38]. The conditional probability that the final speech exists can be expressed as follows: and S(k, l) where q(k, l) denotes the prior probability of the absence of speech, λ s (k, l) and λ n (k, l) denote the variances of desired speech and noise at the T-F bin (k, l), respectively. Two empirical constants B min = 1.66, ζ 0 = 1.67. Furthermore, the noise spectrum can be estimated as where α = 0.85 which is the smoothing factor andσ 2 n (k, l) represents the noise power spectral density estimate at time frame l and frequency bin k. The pre-estimated noise amplitude Ñ (k, l) is the square root ofσ 2 n , and the corresponding pre-estimated noiseñ(t) can be obtained via the inverse Fourier transform toÑ(k, l). Figure 3 illustrates an example of noise estimation: the left subplot is the spectrogram of an actual noise and the right one is the the spectrogram of estimated noise from a noisy speech.

Speech Estimation
There are many suitable methods, such as the minimum mean-square error (MMSE) estimator, Log-MMSE estimator [37,39], and Bayesian statistics [5], which can be utilized to estimate the independent speech from the noisy speech. In contrast, Log-MMSE estimator performs better in terms of noise suppression and speech distortion reduction. Log-MMSE estimator obtains the optimized speech by minimizing the error between the logarithmic amplitude of the pre-estimated speech and the actual speech: We use Z k for logS k , then the matrix generating function Φ Z k |X(k,l) (µ) of Z k based on X(k, l) is where p is the conditional probability, µ is an index and φ is phase. After the related calculations where Γ and Φ are the gamma function and the confluent hypergeometric function, respectively. Find the derivative of Φ Z k |X(k,l) (µ) when µ is equal to 0, you can get the conditional mean of logS(k, l).
And bring the result into the following formulã After calculation and simplification, the pre-estimated speech can be obtained by using the statistical model of the Fourier coefficient for statistical imprisonment features:

Feature Fusion
In this paper, we propose fusing the features extracted from the noisy speech F x , the pre-estimated independent speech F s , the pre-estimated independent noise F n , and the phase feature F p .
For F s and F x , we utilize the complementary acoustic feature set which has been widely accepted by recent studies as the representation vector.
For F n , we only extract AMS [30] as its representation and get a 15-dimensional feature vector.
For the purpose of further improving the discrimination of acoustic feature and enhancing the ability of the DNN to fit the phase-aware training target, in this paper, we propose to employ the IFD of noisy speech as the phase feature F p which is a 257-dimensional vector. Following the idea proposed in Reference [27], the IFD can be calculated as: X * (k, l) denotes the complex conjugate of the complex number X(k, l). The IF X can also be understood as the negative derivative of the phase spectrum along the time axis in Reference [40] and Formula (18) is the expression of the complex number field. Function principle(.) denotes the selection of principal values which projects the phase difference onto [−π, +π] and the function arg(.) calculates the phase angle of a complex number [27]. The ε is the center frequency, where L is the time shift between two adjacent frames and N is the length the discrete Fourier transform. IFD X (k, l) measures how far an IF value strays from its center frequency, so the role of ε is to eliminate the striation caused by different center frequencies, which makes the structure of the speech more apparent [28].
The final fusion feature is a concatenation of above features and the total dimension is 764.

Training Target
In addition to widely used magnitude-aware training target, we also utilize an independent phase-aware training target to train the corresponding phase mask. As for phase-aware target, we follow the idea proposed in Reference [27] and utilize the IFD calculated on the clean speech as the phase target, to improve the estimation of the final speech.
To balance phase-aware target (IFD) and the magnitude-aware target on the training errors, we normalize IFD into the range of [0, 1). As shown in Figure 2, DNN1 is used to train the phase model. Here, the fusion feature (F) is used as the network input, and the clean speech IFD s (that is,T P ) is used as the network output. They are jointly trained to obtain model-1.
As for magnitude-aware target, in our implementation, we utilize ideal amplitude mask (IAM), which is named as FFT-MASK in Reference [25]. The IAM is defined as the ratio of STFT magnitude of clean speech s(t) and noisy speech x(t).
where |S(k, l)| and |X(k, l)| represent spectral magnitudes of clean speech signal and mixed signal within a T-F unit, respectively. As shown in Figure 2, IAM corresponds to the amplitude target (T R ).
In the second network training process, IAM and the fusion feature (F) are used as the output and input of the DNN2, respectively, to train the model-2.

Network Structure and Training Strategy
In order to obtain optimal fitting effect, we propose to use two DNNs with the same structure to train IAM (magnitude mask) and IFD (phase mask) respectively, instead of jointly training two masks with a single DNN as in Reference [27]. Each DNN employs a five-layer structure, including one input layer, three hidden layers, and one output layer. Each hidden layer consists of 2048 rectified linear neurons (ReLU). The sigmoid activation function is adopted for the input and output layers. In the training process, the DNN is optimized by minimizing the mean square error (MSE). The learning rate decreases linearly from 0.008 to 0.0001. The scaling factor for adaptive stochastic gradient descent is set to 0.0015. During network training, the number of epochs for back propagation training is set to 30, and the batch size is set to 32.

Speech Enhancement
As shown in Figure 4, for the purpose of reconstructing the enhanced speech y(t) from the mix signal z(t), the fusion feature F is firstly calculated according to the description in Section 2.2. Then, the fusion feature F is input to the model1 and model2 to obtain the estimated phase target M P and the estimated amplitude target M R , respectively. The model1 and model2 are obtained in the training stage (see Figure 2). Implementing STFT on z(t) can get φ z (k, l) and |Z(k, l)|, then get Z(k, l). The estimated amplitude spectrum |Y(k, l)| can be obtained by |Z(k, l)| * M R . Z(k, l) = |Z(k, l)| e jφ z (k,l) .
The phase reconstruction process is more complicated [27]. First, we use φ z (k, l) as the initial phase, then calculate IF x according to the Equation (18) and reconstruct the phase along the time axiŝ where s(i) denotes the proximity weight, unwrap(.) is an unwrapping function which can make the phase spectrogram smooth along the time axis, andM(k, l + i) is the reliability index. −N ≤ i ≤ N.
Finally the phase can be reconstructed along the frequency axis where k 1 < k < k 2 , k 1 and k 2 are two adjacent harmonic bands of the k-th frequency band. W(k) is the discrete Fourier transform of the k-th window function. So the final enhanced speech y(t) can be reconstructed by the inverse STFT of |Y(k, l)| and φ y (k, l).

Experimental Data
In the experiment, the TIMIT voice database [36] was used for performance evaluation. We chose 380 speakers to form a training set, and 40 speakers to constitute a test set. All of the speakers have 10 clean utterances. And the test set and the training set are completely non-coincident in our experiment. The total experimental data consists of 3800 clean training utterances and 400 clean test utterances. We also selected 4 types of noises, including babble, factory1, factory2 and buccaneer1 as added noise from the NOISEX-92 database [41]. Each noise signal was separated into two parts, one for constructing training mixture and the other for testing. We mixed each clean utterance with 16 short noise segments at the SNR levels of −5 db, −3 db and 0 dB respectively, where the 16 noise segments came from the 4 types of noises, each with 4 random noise segments for training and with 1 random noise segments for testing. Thus there are 60,800 training discourse and 6400 test discourse at each SNR level. All corpora was re-sampled to 16 kHz, and converted to a T-F unit with the frame length set to 20 ms and the frame shift set to 10 ms.

Comparison Methods
In our experiment, we compared the proposed method to (1) a classical DNN-based speech method [25] which follows the scheme shown in Figure 1; (2) the method proposed in Reference [26] which exploited the CRM as the training target, and (3) a state-of-the-art method proposed in Reference [27] which improved the speech enhancement by incorporating the phase-aware training target. For the convenience of description, in this paper, we termed methods in References [25][26][27], and the proposed as DNN, DNN-CRM, DNN-IFD, DNN-MP, respectively. In addition, we also provided the result of noisy speech, which is termed as NOISY, as the baseline to evaluate the improvement of different methods.
For fair comparison, all comparison methods utilized deep neural network as the learning machine with the same structure and learning strategy as the original literature proposed. For training target, DNN, DNN-IFD, and DNN-MP utilized IAM as the magnitude-aware training target, while DNN-CRM utilized CRM as the magnitude-aware training target following its original setting. DNN-IFD, and DNN-MP utilized IFD as the phase-aware training target. For acoustic features, DNN, DNN-CRM and DNN-IFD utilized the complementary features [25], and DNN-MP utilized the proposed fusion feature.
PESQ can effectively estimate speech quality and its score ranges from −0.5 to 4.5. The higher the PESQ score is, the better the predicted speech quality is. STOI evaluates the objective intelligibility of a degraded speech signal by computing the correlation of the temporal envelopes of the degraded speech signal and its clean reference. It has been shown empirically that STOI score is strongly correlated with human speech intelligibility scores. ESTOI evaluates the objective intelligibility of a degraded speech signal by computing the spectral correlation coefficients of the degraded speech signal and its clean reference in short time segments. Unlike STOI, ESTOI does not assume that frequency bands are mutually independent. Both scores of STOI and ESTOI range from 0 to 1. The higher the scores are, the better the predicted intelligibility is. The SDR score is computed by blind source separation evaluation measurements. It has been widely used for evaluating speech quality [27].

Experiment 1: Speech Enhancement Performance Comparison
Comprehensive experimental results are listed in Table 1. Comparing the speech enhancement performance of different methods, it shows that: (1) all DNN-based speech enhancement methods can effectively improve the speech quality and intelligibility of the original noisy speech for various types of added noise at any SNR level. From the perspective of SNR, the higher the SNR of the original noisy signal, the higher the speech intelligibility and quality of the final recovered signal, which is consistent with common sense. In terms of noise, among the four types of noise, Factory2 and Buccaneer1 seem to be easier to handle, while Babble and  For the purpose of eliminating the magnitude difference between four evaluation metrics (ref. Table 1), we reduced the value of PESQ and SDR by 10 times. It is not difficult to see from Figure 5 that the performance of DNN, DNN-CRM, DNN-IFD, and DNN-MP increases stepwise for all four evaluation metrics. This statistical result once again demonstrates the assertions of previous studies [26,27], namely the performance of DNN can be improved by replacing the IAM target with the CRM target (DNN-CRM), and the performance can be further improved by adding phase-aware target on the basis of magnitude-aware target, which proves the importance of phase information in speech enhancement (DNN-IFD). Obviously, our method achieves the leading performance due to the fusion of both magnitude/phase-aware features and targets. We also compared the spectrograms of the same noisy sentence, which is the No. 395 sentence randomly selected from the test data set at −5 dB SNR level before and after speech enhancement with different comparison methods. Figures 6-9 correspond to the case of adding the noise of babble, factory1, factory2 and buccaneer1 respectively. In each figure, the annotation CLEAN represents the clean speech without noise adding, and NOISY represents the noisy speech after adding a specific noise to the CLEAN speech. DNN, DNN-CRM, DNN-IFD, and DNN-MP represent the recovered speech after corresponding processing method. Comparing the spectrograms of CLEAN and NOISY in each figure, it illustrates that at −5 dB SNR level, the clean signal is heavily polluted with the added noise. For the No. 395 sentence, in terms of a particular method, it shows that DNN and DNN-IFD perform better in processing factory2 and buccaneer1, but poorly in babble and factory1. While DNN-CRM is the opposite of DNN and DNN-IFD, it performs better in processing babble and factory1 but poorly in factory2 and buccaneer1. In contrast, the proposed DNN-MP achieves considerably good noise-reduction results for each type of noise.

Experiment 2: Generalization Ability Evaluation on Unseen Noise
To investigate the generalization ability of the comparison methods, we tested their performance on an unseen noise. In this experiment, we trained the model with 15,200 discourse adding buccaneer1 noise from the TIMIT database at −5 dB SNR level, and tested the performance on discourse adding buccaneer2 noise from the NOISEX-92 database at same SNR level. Although both noises of buccaneer1 and buccaneer2 are cockpit noise, their difference comes from the speed and altitude when getting noise. From Table 2, it can be seen that the generalization ability of the proposed DNN-MP is superior to all other comparison methods.

Experiment 3: Ablation Study
The foregoing experiments prove that the proposed method (DNN-MP) is superior to existing methods in terms of both speech enhancement performance and generalization ability. The good performance of DNN-MP comes from two aspects-one is the fusion of multiple features extracted from pre-estimated speech, pre-estimated noise, and phase, beyond only from clean speech; the other is the comprehensive utilization of magnitude-aware and phase-aware training targets.   In order to further explore the effects of the aforementioned components on the speech enhancement performance, we did two groups of ablation study. In the first set of experiments, we kept the network structure and training targets (IAM and IFD) of the DNN-MP method unchanged, and successively removed pre-estimated noise feature (F n ), pre-estimated speech feature (F s ), phase feature (F p ), and their combination, that is, F n and F p , F n and F s , F p and F s , from the fusion feature F (ref. Equation (21)), for the purpose of investigating their role in fusion feature. The experimental results tested with buccaneer1 noise at −5 dB SNR level are listed in Table 3. As can be seen from the table, among the three independent features, F s and F p (especially F s ) have a significant impact on the performance of DNN-MP. Taking the metric PESQ as an example, removal of F p and F s result in a drop from 2.22 to 2.17 and 2.15, respectively. When both F p and F s are removed, the result drops to 2.12. In contrast, F n has relatively little impact on DNN-MP, which can be derived from the fact that DNN-MP and −F n , −F p and −F n,p , −F s and −F n,s almost have the same PESQ value. The author hopes that this discovery can guide researchers to further optimize fusion features. −F n represents removing F n from the fusion feature. −F n,p represents removing both F n and F p from the fusion feature.
In the second set of experiments, we kept the network structure, fusion feature (F) and the magnitude-aware training target (IAM) of the DNN-MP method unchanged, but removed phase-aware training target (IFD). The experimental result tested with buccaneer1 noise at −5 dB SNR level is termed as DNN-M and listed in Table 4. Compared to DNN-MP, DNN-M's performance (PESQ) drops to 2.13, which demonstrates the important role of phase-aware training target in speech enhancement.

Experiment 4: Subjective Test by Human Listeners
In order to give a more comprehensive understanding of the performance of various comparison methods, we also conducted a subjective test by human listeners. We recruited 36 volunteers aged between 16 and 60 years old. All volunteers had no background knowledge in related speech enhancement research fields, so the fairness of the test results will not be affected by personal preferences for comparison methods. We randomly selected 20 noisy speech sentences from the test set and processed them with four aforementioned comparison methods, that is, DNN, DNN-CRM, DNN-IFD and DNN-MP. Thus, for each piece of speech sentence, we got four enhanced speeches that had been de-noised. A tester was asked to select one from four that he/she thought had the best de-noising effect, and to fill out the questionnaire as shown in Table 5. In the table, the top row presents the number of speech sentence randomly selected from the test set. S1, . . . , S36 in the leftmost column denotes the number of tester. ♥, , , and ♦ represents the method of DNN, DNN-CRM, DNN-IFD, and DNN-MP, respectively.
Based on the original questionnaire, we counted the results of the subjective test. As shown in Table 6, the cumulative number of votes for all sentences is 36 × 20 = 720 votes, of which 264 votes are for the DNN-MP method (♦), 225 votes are for the DNN-CRM method ( ), 193 votes are for the DNN-IFD method ( ), and 38 votes are for the DNN method (♥). Among the 20 noisy speech sentences, thirteen noisy speech sentences processed by the proposed DNN-MP method are considered to have the best quality, these speech sentences are composed of No. 24,28,74,172,179,207,239,267,306, 321, 391, 398 and 400. Two noisy speech sentences No. 155 and No. 163 processed by the DNN-IFD are considered to have the best processing effect. Another two noisy speech sentences No. 234 and No. 377 processed by the DNN-CRM method are considered to have the best quality. In addition, both DNN-CRM and DNN-IFD are considered to have the best quality for speech sentences No. 136 and No. 235, while for No. 36, both DNN-CRM and DNN-MP are considered to have the best quality. The subjective test once again proved that the proposed method can remove noise more effectively than other comparison methods.

Conclusions
In this study, we propose a novel DNN-based single-channel speech enhancement method by fusing the magnitude-aware and phase-aware information in both feature and training target aspects. Extensive experiments demonstrate that the proposed method (DNN-MP) is superior to comparison methods in terms of both speech enhancement performance (speech quality and intelligibility) and generalization ability. Experiments and analysis show that the good performance of the proposed method comes from two aspects-one is the fusion of multiple features extracted from pre-estimated speech, pre-estimated noise, and phase, beyond only from clean speech, the other is the comprehensive utilization of magnitude-aware and phase-aware training targets.