1. Introduction
Speech enhancement has been studied extensively as a fundamental signal processing method to reconstruct the actual received signals which are easy to be degraded by noisy adverse conditions. Nowadays, speech enhancement has been widely used in the fields of speech analysis, speech recognition, speech communication, and so forth. The aim of speech enhancement is to recover and improve the speech quality and its intelligibility via different techniques and algorithms, like unsupervised methods including spectral subtraction [
1,
2], Wiener filtering [
3], statistical model-based estimation [
4,
5], subband forward algorithm [
6], subspace method [
5,
7], and so on. Generally these unsupervised methods are based on statistical signal processing and typically work in the frequency domain. These methods essentially implement speech enhancement by estimating the gain function and noise. Voice activity detection (VAD) [
8,
9] algorithm is a simple approach to estimate and update the noise spectrum, but its performance under non-stationary noise is unsatisfactory. Proposals of minima controlled recursive averaging (MCRA) and improved MCRA (IMCRA) enhanced the estimation of non-stationary noise [
10].
Recent approaches formulate speech enhancement as a supervised learning problem, where the discriminative patterns of speech and background noise are learned from training data [
11]. The performance of supervised speech enhancement algorithms is affected by three key components, that is, learning machine, training target and acoustic feature. (1)
Learning machine. Compared with a traditional learning machine, like support vector machine (SVM) [
12], data-driven deep neural network (DNN) has shown its strong power in adverse environments and has received much attention [
13,
14,
15,
16,
17,
18]. A DNN is an ANN (Artificial Neural Network) with multiple hidden layers between the input and output layers. Each layer contains multiple neurons, and the neurons between layers are connected by different functions. Similar to shallow ANNs, DNNs can model complex non-linear relationships. Due to the hierarchical structure and distributed representation at each layer, the data representation ability of DNN is exponentially more powerful than that of a shallow model when given the same number of nonlinear computational units [
19]. Most recent researches focus on improving the algorithm performance by superimposing and changing the structure of the DNN [
14,
20,
21]; (2)
training targets. Training target is the key point for the quality of recovered speech, many well-designed binary masks [
22,
23] or ratio masks [
13,
23,
24,
25,
26] are proposed. In Reference [
25], Wang et al. proved that the ideal amplitude mask (IAM) can obtain a better performance on noise reducing than the ideal binary mask (IBM) [
23]. Liang et al. proposed and proved that the optimal ratio mask (ORM) can further improve the signal-to-noise ratio (SNR) over the IRM by the theoretical analysis [
24]. Bao et al. proposed the corrected ratio mask (CRM) to separately preserve and mask more speech and noise information and proved it performs better than the conventional ratio masks and other series of enhancement algorithms [
26]. Recently, phase has shown its strong relationship with speech quality. Zheng et al. proposed a phase-aware DNN-based speech enhancement method [
27], which used instantaneous frequency deviation (IFD) [
28] as one of the training target and overcame the difficulty of processing a highly unstructured phase spectrogram; (3)
acoustic features. The acoustic features play an important role in learning the desired training target as the input of learning machines. Early studies in supervised speech separation use only a few features such as pitch-based features [
29] and amplitude modulation spectrogram (AMS) [
30] in monaural separation. Based on the research of Wang et al. [
25], recent studies often utilize a complementary feature set as the acoustic representation features, which are composed of the amplitude modulation spectrogram (AMS) [
30], the relative spectral transformed perceptual linear prediction coefficients (RASTA-PLP) [
29,
31], the Mel-frequency cepstral coefficients (MFCC) [
32], and the Gammatone frequency cepstral coefficients (GFCC) [
33].
A general scheme of existing DNN-based speech enhancement method is shown in
Figure 1. In the training stage, the complementary acoustic features
are extracted directly from the noisy speech
that is a mixture of clean speech and noise, and are applied to train the DNN. A magnitude-aware training target
is learned via the STFT (short-time Fourier transform) [
5] spectrum of clean speech
and mix speech
. In the speech enhancement stage, the complementary acoustic features
extracted from
, which is the noisy signal to be enhanced, are fed into a trained model to obtain the estimated magnitude mask
. The speech magnitude
can be calculated by the product of the estimated magnitude mask
and
which corresponding to the STFT magnitude of the signal to be enhanced. The final recovered speech
is recovered accurately by inverse STFT of the recombination signal consisting of
and
.
We notice that recent research on DNN-based speech enhancement mainly focuses on training targets design and DNN structure optimization [
18]. In our opinion, there are at least two issues that, although have not received widespread attention so far, may play an important role in further improving performance of the speech enhancement. One is related to acoustic feature extraction, the other is related to the full use of phase information. In our study, we notice that the acoustic features extracted directly from the noisy (mixed) signals cannot effectively characterize the unique properties of the signal and noise, which is not conducive to network training and target learning. Intuitively, if the acoustic features can be extracted separately from the speech and noise, it will be more helpful to construct discriminative acoustic features. On the other hand, phase has shown its strong relationship with speech quality [
34,
35], and phase processing has received much attention than ever before. In Reference [
27], Zheng et al. proved that on the basis of existing scheme as shown in
Figure 1, by incorporating the instantaneous frequency deviation (IFD) [
28] as a phase-aware training target to jointly estimate the phase spectrogram, the speech enhancement performance can be further improved.
This paper proposes two main improvements to existing DNN-based speech enhancement methods. Firstly, we propose a novel discriminative complementary feature, which is a fusion of multiple sources. As shown in
Figure 2, we pre-estimate the speech
, noise
and calculate the phase from the mix signal
, then fuse the features extracted from them separately with the features extracted from the mixed signal to construct a refined acoustic representation. To our knowledge, this study is the first time to extract features from pre-estimated speech and noise, especially from the phase. Secondly, in order to make full use of the phase information, we also incorporate IFD as a phase-aware training target to estimate the phase spectrogram. However, different to Reference [
27], we employ an independent DNN (see
Figure 2) to train the phase-aware target instead of utilizing a single DNN to jointly train both magnitude and phase targets for the purpose of reducing computational complexity. Extensive experiments conducted on the TIMIT corpus [
36] show that the proposed method outperforms the existing methods in terms of speech quality, speech intelligibility and speech distortion.
The rest of this paper is organized as follows—the details of the proposed method are described in
Section 2. Especially, the pre-estimation of noise and speech is introduced in
Section 2.1, the proposed feature fusion method is introduced in
Section 2.2, the calculation of magnitude-aware and phase-aware training target is introduced in
Section 2.3, the network structure and training strategy is introduced in
Section 2.4, the speech enhancement method is introduced in
Section 2.5. In
Section 3, we present the experimental data, comparison methods and evaluation metrics. We conduct a series of experiments and analyze the results in
Section 4. Especially, we conduct extensive comparison experiments in
Section 4.1, analyze the generalization ability of comparison methods on unseen noise in
Section 4.2, and obtain a deep insight of the proposed method by ablation study in
Section 4.3. Finally, we conclude this study and outlook the future work in
Section 5.