A Supervised Speech enhancement Approach with Residual Noise Control for Voice Communication

For voice communication, it is important to extract the speech from its noisy version without introducing unnaturally artificial noise. By studying the subband mean-squared error (MSE) of the speech for unsupervised speech enhancement approaches and revealing its relationship with the existing loss function for supervised approaches, this paper derives a generalized loss function, when taking the residual noise control into account, for supervised approaches. Our generalized loss function contains the well-known MSE loss function and many other often-used loss functions as special cases. Compared with traditional loss functions, our generalized loss function is more flexible to make a good trade-off between speech distortion and noise reduction. This is because a group of well-studied noise shaping schemes can be introduced to control residual noise for practical applications. Objective and subjective test results verify the importance of residual noise control for the supervised speech enhancement approach.


I. INTRODUCTION
S PEECH enhancement plays an important role in noisy environments for many applications, such as speech communication, speech interaction and speech translation. Numerous researchers have done lots of efforts on separating the speech from its noisy version and various approaches have already been proposed in the last five decades. Conventional approaches include spectral subtraction [1], statistical method [2,3] and subspace-based method [4], which has proved to be valid when the additive noise is stationary or quasi-stationary. However, their performance often suffers from heavy degradation under non-stationary and low signalto-noise ratio (SNR) conditions. While moving head with deep learning, the supervised approaches gradually show their powerful capability on suppressing both stationary and highly non-stationary noise signals, which is mainly because of highly nonlinear mapping ability of deep neural networks (DNN) [5], [6]. In DNN-based algorithms, minimum mean-squared error (MSE) is often adopted as a loss criterion to update the weights of the network. Nevertheless, usage of this criterion directly may suffer from some problems. First, although MSE is the most often used criterion, it is not so relevant with speech perception [7,8].
Second, global MSE optimization usually obtains an oversmoothing estimation which omits some important detailed information. To solve these problems, many new criteria, that consider speech perception, have been proposed in most recent years [9][10][11][12]. The first one is to use perceptually weighted MSE functions, which are proposed to weight the loss in different time-frequency (T-F) regions [10,13]. The second one is to use objective metrics as loss functions, for examples, perceptual evaluation speech quality (PESQ) [14], short-time objective intelligibility (STOI) [15] and scale-invariant speech distortion ratio (SI-SDR) [16] have been adopted as loss functions. In [17], speech distortion and residual noise are considered separately in the loss function, which is called components loss (CL).
Note that all the above mentioned loss functions aim at suppressing noise as much as possible at noise-only segments. In other words, at noise-only segments, the amount of noise reduction is expected to be a positive infinite value. As we know, this aim could not be achieved in most cases for many reasons. First, the noise is often stochastic, and thus it is inevitable that the estimation accuracy is often constrained by a limited number of available observations [18,19]. Second, there are a great variety of noise signals, so that a DNN model cannot be expected to distinguish all of them correctly from the speech in each T-F unit. Therefore, when the noise cannot be suppressed totally as expected, some unnatural residual noise may degrade speech quality a lot [20], which needs to be considered carefully. In this paper, we derive a generalized loss function by introducing multiple manual parameters to flexibly make a balance between speech distortion and noise attenuation. More specifically, the residual noise control is introduced for voice communication [21,22]. By theoretical derivations, MSE and other often-used loss functions can be included in the proposed generalized loss function.
The remainder of the paper is structured as follows. Section II formulates the problem. Section III derives the generalized loss in detail and introduces used network architecture. Section IV is the experimental settings. Results and analysis are given in Section V. Section VI presents some conclusions.

II. PROBLEM FORMULATION
In the time domain, the noisy signal can be modelled as where s (n) is the clean speech and d(n) is the additive noise.
In the frequency domain, (1) can be written as arXiv:1912.03679v1 [cs.SD] 8 Dec 2019 where X l (k), S l (k), and D l (k) are, respectively, discrete Fourier transforms (DFT) of x(n), s(n), and d(n) with the frame index l and the frequency bin k.
For practical applications, we only have the time-domain noisy signal x(n) or its frequency-domain version X l (k), the problem becomes how to estimate s(n) or S l (k) from its noisy signal. It is common to use Minimum MSE (MMSE) as a criterion in unsupervised speech enhancement approaches. Before introducing MMSE, we first define the square error as where M l (k) is a nonlinear spectral gain function, f (a) is a function with a variable a, and g(a, b, c) is a function with three variables a, b, and c. When f (a) = |a| and g(a, b, c) = |(a + b)c|, min amplitude estimator in [2], where E {•} is the expectation operator. When f (a) = log(|a|) and g(a, b, c) = log(|(a + b)c|), min ]} leads to MMSE log-spectral amplitude estimator in [3]. More complicated forms of f (a) and g(a, b, c) can be chosen, for example, many perceptually-weighted error criteria can be included, which can be referred to [7]. For supervised approaches, the square error in the subband is often defined as the loss function in the fullband, which is One can get that, when f (a) = log(|a|) and {J x } is to minimize the MSE of logspectral amplitude between the clean speech and the estimated speech, which is the training target in [6]. Note that (3) and (4) are quite similar and the most obvious difference between them is that J x [M l (k)] is the subband square error, while J x is the fullband square error. The other difference is that the nonlinear spectral gain can be derived theoretically by minimizing E {J x [M l (k)]} when the probability density function (p.d.f.) of the speech and that of the noise are both given, while it is difficult to derive the nonlinear spectral gain by minimizing J x , where this gain can often be mapped from the input noisy features after training the supervised machine learning model. In all, it seems that all subband square error functions can be generalized to the fullband ones as supervised training targets.
III. PROPOSED ALGORITHM Only using MMSE as a criterion, it is difficult to make a balance between speech distortion and noise reduction. This section derives a more generalized fullband loss function.

A. Trade-off Criterion in Subband
In traditional speech enhancement approaches, speech distortion and noise reduction in the subband can be considered separately. The subband square error of the speech and the subband residual noise can be, respectively, given by and where h(a, b, c) is a function with three variables a, b, and c. When f (a) = |a|, g(a, b, c) = |ac|, and h(a, b, c) = |bc|, } become the MSE of the speech magnitude and the residual noise power in the subband, respectively, which are identical with [23, (8.31) and (8.32)]. By minimizing the subband MSE of the speech with a residual noise control, an optimization problem can be given to derive the nonlinear spectral gain, which is given by whereλ (β, b) is a function of two variables β and b. β l (k) ∈ [0 1] could be both a frequency and frame-dependent factor that can be introduced to control the residual noise flexibly. The optimal spectral gain in (7) can be solved theoretically by the Lagrange multiplier method, which is where µ ≥ 0 is a Lagrange multiplier. When f (a) = |a|, g(a, b, c) = |ac|, h(a, b, c) = |bc|, and |λ (β, b)| = βE{|b| 2 }, the optimal spectral gain can be derived from (8) and the constraint in (7), which can be given by where have very complicated expressions. Moreover, it is uneasy to accurately estimate the noise power spectral density in non-stationary noise environments [24][25][26]. However, it seems that this optimization can be easily solved by supervised approaches. To transfer this problem, we need to define the fullband square error of the speech and the fullband residual noise power to derive the loss function for supervised approaches.

B. Trade-off Criterion in Fullband
The fullband MSE of the speech and the fullband residual noise can be, respectively, given by and The loss function without any constraints can be given by where (12) is the same as the newly proposed components loss function as given in [17]. The loss function with residual noise control is where It is obvious that (13) is a generalization of (12), where (13) reduces to (12) when |λ (β l (k) , D l (k))| 2 ≡ 0. One can observe that β l (k) is both frequency and frame-dependent, so it can control the residual noise in each time-frequency bin.

C. A Generalized Loss Function
We further generalize the subband square error in (5) and (6), the square is substituted by a variable γ ≥ 0 and an additional variable α is also introduced on the spectra, then (5) and (6) can be, respectively, given by and Analogously, with the residual noise control, the optimal problem in the subband becomes By setting f (a) = |a|, g(a, b, c) = |ac|, h(a, b, c) = |bc|, andλ (β, b) = (β|b|), one can derive a generalized gain function with the Lagrange multiplier method, which is where c 1 = αγ/(2γ − 2) and c 2 = 1/α, where (17) is identical to [27, (6)]. Note that [27, (6)] is given intuitively without theoretical derivation. When γ = 2 and α = 1, (17) reduces to (9). When γ = 2, one can get M l (k) = ((ξ l (k)) α /(µ l (k) + (ξ l (k)) α )) 1/α , which has already been derived and presented in ( [27, (22)]). Similarly, the generalized loss function for supervised approaches can be given by to control the residual noise. Eq. (18) is a generalized loss function that includes (12) and (13). This is because (18) reduces to (13) when γ = 2, α = 1 and it can further reduces to (12) by setting |λ (β α l (k) , D α l (k))| γ ≡ 0. It is interesting to see that (3) also can be separated into two components, where one is the MSE of the speech and the other is related to the residual noise. When f (a) = a and g(a, b, c) = (a + b)c, we have relates to the power of speech distortion and E {J d (M l (k))} = |M l (k)| 2 E |D l (k)| 2 relates to the power of residual noise.
} is a combination of speech distortion and residual noise, so the fullband MSE loss function of a complex spectrum is also a special case of the generalized loss function in (18). If f (a) = |a| and g(a, b, c) = |(a+b)c| are chosen, the decomposition of E {J x (M l (k))} is more complicated than (19), which will not be further discussed for limited space.
In this letter, we emphasize the importance of introducing the residual noise control. f (a) = |a|, g(a, b, c) = |ac|, h(a, b, c) = |bc|, andλ (β, b) = (β|b|) are applied, although more complicated expressions can be chosen when taking the perceptual quality into account. Accordingly, we have and J γ,α,con (21) where α will be set to a constant value and β l (k) is a constant value over frequency for simplicity, that is to say, β l (k) ≡ β 0 and α = 1 are used in the following. We only study the impact of β 0 , µ, and γ on supervised approaches.

IV. EXPERIMENTAL SETUP A. Dataset
Experiments are conducted with TIMIT corpus, where 1000 and 200 utterances are randomly chosen as the training and the evaluation datasets, respectively. 125 types of environment noises [6,28] are used for generating noisy utterances under different SNR levels ranging from -5dB to 15dB with the interval 5dB. For model test, additional 10 male and 10 female utterances are chosen to mix with unseen noise signals taken from the NOISEX92 [29] with SNR ranging from -5dB to 10dB with the interval 5dB.

B. Network Architecture
U-Net is chosen as the network in this letter, which has been widely adopted for speech separation task [30]. As shown in Fig. 1, the network consists of the convolutional encoder and decoder, both of which are comprised of five convolutional blocks where the 2-D convolution layer is adopted, followed by batch normalization (BN) and exponential linear unit (ELU). Skip connection is introduced to compensate for the information loss during features compression process. Note that the mapping target is the gain function and the sigmoid function is adopted to make sure that the output ranges from 0 to 1. Causal mechanism is introduced to achieve real-time processing, where only the past frames are involved in the convolution calculation. The tensor output size of each layer is given with (Channels, T imeStep, F eat) format, which is shown in Fig. 1.

C. Loss Functions and Training Models
This letter chooses three loss functions including MSE in (4), Time-MSE-based loss (TMSE) [30] and recently proposed SI-SDR-based loss [30] as baselines. AS T-F domainbased network is used, an additional fixed iSTFT-like layer is needed to transform the estimated T-F spectrum back into time domain for TMSE-and SI-SDR-based loss [31]. They compare with the proposed generalized loss function given in (18) with (20) and (21). All the models are trained with stochastic gradient descent (SGD) optimized by Adam [32].  Fig. 1. The network architecture adopted in this study. Input is the noisy magnitude spectra and output is the estimated gain functions.

V. RESULTS AND ANALYSIS A. Objective Evaluation
This letter uses four objective measurements including noise attenuation (NA) [21], speech attenuation (SA) [21], PESQ [14], and SDR [33]. The testing results w.r.t. γ, β 0 and µ are shown in Fig. 2, where γ = 1, 2, 3, β 0 = −10dB, −20dB, −30dB and µ = 0.5, 1, 2, 3, 4 are considered. The test results of three baselines are also presented as comparison. From this figure, one can observe the following phenomena. First, the increase of β 0 will decrease NA. This is because the residual noise control mechanism is introduced for optimization, which means, during the training process, the residual noise in the estimated spectra will gradually get close to the preset residual noise threshold. As a consequence, the characteristic of the residual noise is expected to be effectively preserved, which will be further confirmed by subjective listening tests in the following. Second, the increase of µ is beneficial to noise suppression and meanwhile introducing more speech distortion. As generalized loss can be viewed as the joint optimization of both speech distortion and noise reduction, a larger µ leads to smaller gain values, as (17) states, where on the one hand more interference is suppressed and on the other hand, more speech components are inevitably abandoned. Third, the increase of γ has a negative influence on NA and SD. Finally, among various parameter configurations, (2, -30dB, 0.5), (2, -30dB, 1) and (2, -20dB, 1) can be chosen. This is because relatively better performance can be obtained for all the four objective metrics. One can observe that the three competing loss functions can get better performance in some objective metrics, while they may suffer much worse performance in others. For example, SI-SDR and TMSE have larger values of SDR, while their PESQ scores are even lower than the MSE, which is consistent with the study in [16].

B. Subjective Evaluation
To evaluate speech quality of the proposed generalized loss (GL) function, a subjective evaluation test is conducted among GL and baselines, where we follow the subjective testing procedures of [34]. In this comparison, we choose the parameter configuration (2, -20dB, 1) for the propose GL function. The experiment is conducted in a standard listening room, where 10 listeners participate. The listening material consists of 20 utterances, each of which includes one male and female utterance selected from TIMIT corpus and is mixed with one of five noises including aircraft, babble, bus, cafeteria, and car. Four SNR conditions are selected for mixing, i.e. -5dB, 0dB, 5dB, 10dB. Speech pause of 3s duration is specifically inserted before each utterance. Then, the duration of each listening utterance is about 13s. Each listener needs to write down the utterance index that they prefer considering both noise naturalness and speech quality. The same as [34], "Equal" option is also provided if no subjective preference can be given. To avoid inertia, the utterance index in each pair is shuffled. The averaged subjective results are presented in Table. I. From this table, one can observe that the proposed GL function with residual noise control achieves better performance in subjective testing, which can be explained as the proposed GL method can effectively recover speech components while preserving the characteristic of background noise to some extent compared with all the baselines.

VI. CONCLUSION
This letter derives a generalized loss function which can easily make a balance between noise attenuation and speech distortion with multiple manual parameters. In addition, MSE and other typical loss functions are revealed to be special cases. Both objective and subjective tests are conducted to show that it is important to control the residual noise for supervised speech enhancement approaches, where the residual noise becomes much more natural than before. Further work could concentrate on studying a combination of the residual noise control scheme with objective metrics-based loss functions to improve the naturalness of the residual noise.