1. Introduction
Speech enhancement is essential for various speech applications such as robust speech recognition, hearing aids, and mobile communications [
1,
2,
3,
4]. The main objective of speech enhancement is to improve the quality and intelligibility of the noisy speech by suppressing the background noise or interferences.
In the early studies on speech enhancement, the minimum mean-square error (MMSE)-based spectral amplitude estimator algorithms [
5,
6] were popular producing enhanced signal with low residual noise. However, the MMSE-based methods have been reported ineffective in non-stationary noise environments due to their stationarity assumption on speech and noise. An effective way to deal with the non-stationary noise is to utilize a priori information extracted from a speech or noise database (DB), called the template-based speech enhancement techniques. One of the most well-known template-based schemes is the non-negative matrix factorization (NMF)-based speech enhancement technique [
7,
8]. NMF is a latent factor analysis technique to discover the underlying part-based non-negative representations of the given data. Since there is no strict assumption on the speech and noise distributions, the NMF-based speech enhancement technique shows robustness to non-stationary noise environments. Since, however, the NMF-based algorithm assumes that all data is described as a linear combination of finite bases, it is known to suffer from speech distortion not covered by this representational form.
In the past few years, deep neural network (DNN)-based speech enhancement has received tremendous attention due to its ability to model complex mappings [
9,
10,
11,
12]. These methods map the noisy spectrogram to the clean spectrogram via the neural networks such as the convolutional neural network (CNN) [
11] or recurrent neural network (RNN) [
12]. Although the DNN-based speech enhancement techniques have shown promising performance, most of the techniques typically focus on modifying the magnitude spectra. This could cause a phase mismatch between the clean and enhanced speech since the DNN-based speech enhancement methods usually reuse the noisy phase for waveform reconstruction. For this reason, there has been growing interest in phase-aware speech enhancement [
13,
14,
15] that exploits the phase information during the training and reconstruction. To circumvent the difficulty of the phase estimation, end-to-end (E2E) speech enhancement technique which directly enhances noisy speech waveform in the time domain has been developed [
16,
17,
18]. Since the E2E speech enhancement techniques are performed in a waveform-to-waveform manner without any consideration of the spectra, their performance is not dependant on the accuracy of the phase estimation.
The E2E approaches, however, rely on a distance-based loss functions between the time-domain waveforms. Since these distance-based costs do not take human perception into account, the E2E approaches are not guaranteed to achieve good human-perception-related metrics, e.g., the perceptual evaluation of speech quality (PESQ) [
19], short-time objective intelligibility (STOI) [
20], and etc. Recently, generative adversarial network (GAN) [
21]-based speech enhancement techniques have been developed to overcome the limitation of the distance-based costs [
22,
23,
24,
25,
26]. Adversarial losses of GAN provide an alternative objective function to reflect the human auditory property, which can make the distribution of the enhanced speech close to that of the clean speech. To our knowledge, SEGAN [
22] was the first attempt to apply GAN to the speech enhancement task, which used the noisy speech as a conditional information for a conditional GAN (cGAN) [
27]. In [
26], an approach to replace a vanilla GAN with advanced GAN, such as Wasserstein GAN (WGAN) [
28] or relativistic standard GAN (RSGAN) [
29] was proposed based on the SEGAN framework.
Even though the GAN-based speech enhancement techniques have been found successful, there still remain two important issues: (1) training instability and (2) a lack in considering various speech characteristics. Since GAN aims at finding the Nash equilibrium to solve a mini-max problem, it has been known that the training is usually unstable. A number of efforts have been devoted to stabilize the GAN training in image processing, by modifying the loss function [
28] or the generator and discriminator structures [
30,
31]. However, in speech processing, this problem has not been extensively studied yet. Moreover, since most of the GAN-based speech enhancement techniques directly employ the models used in image generation, it is necessary to modify them to suit the inherent nature of speech. For instance, the GAN-based speech enhancement techniques [
22,
24,
26] commonly used U-Net generator originated from image processing tasks. Since the U-net generator consisted of multiple CNN layers, it was insufficient to reflect the temporal information of speech signal. In regression-based speech enhancement, the modified U-net structure adding RNN layers to capture the temporal information of speech showed prominent performances [
32]. In [
33] for the speech synthesis, an alternative loss function depended on multiple sizes of window length and fast Fourier transform (FFT) was proposed and generated a good quality of speech, which also considered speech characteristics in frequency domain.
In this paper, we propose novel generator and discriminator structures for the GAN-based speech enhancement which reflect the speech characteristics while ensuring stable training. The conventional generator is trained to find a mapping function from the noisy speech to the clean speech by using sequential convolution layers, which is considered an ineffective approach especially for speech data. In contrast, the proposed generator progressively estimates the wide frequency range of the clean speech via a novel up-sampling layer.
In the early stage of GAN training, it is too easy for the conventional discriminator to differentiate real samples from fake samples for high-dimensional data. This often lets GAN fail to reach the equilibrium point due to vanishing gradient [
30]. To address this issue, we propose a multi-scale discriminator that is composed of multiple sub-discriminators processing speech samples at different sampling rates. Even if the training is in the early stage, the sub-discriminators at low-sampling rates can not easily differentiate the real samples from the fake, which contributes to stabilize the training. Empirical results showed that the proposed generator and discriminator were successful in stabilizing GAN training and outperformed the conventional GAN-based speech enhancement techniques. The main contributions of this paper are summarized as follows:
We propose a progressive generator to reflect the multi-resolution characteristics of speech.
We propose a multi-scale discriminator to stabilize the GAN training without additional complex training techniques.
The experimental results showed that the multi-scale structure is an effective solution for both deterministic and GAN-based models, outperforming the conventional GAN-based speech enhancement techniques.
The rest of the paper is organized as follows: In
Section 2, we introduce GAN-based speech enhancement. In
Section 3, we present the progressive generator and multi-scale discriminator.
Section 4 describes the experimental settings and performance measurements. In
Section 5, we analyze the experimental results. We draw conclusions in
Section 6.
2. GAN-Based Speech Enhancement
An adversarial network models the complex distribution of the real data via a two-player mini-max game between a generator and a discriminator. Specifically, the generator takes a randomly sampled noise vector
z as input and produces a fake sample
to fool the discriminator. On the other hand, the discriminator is a binary classifier that decides whether an input sample is real or fake. In order to generate a realistic sample, the generator is trained to deceive the discriminator, while the discriminator is trained to distinguish between the real sample and
. In an adversarial training process, the generator and the discriminator are alternatively trained to minimize their respective loss functions. The loss functions for the standard GAN can be defined as follows:
where
z is a randomly sampled vector from
which is usually a normal distribution, and
is the distribution of the clean speech in the training dataset.
Since GAN was initially proposed for unconditional image generation that has no exact target, it is inadequate to directly apply GAN to speech enhancement which is a regression task to estimate the clean target corresponding to the noisy input. For this reason, GAN-based speech enhancement employs a conditional generation framework [
27] where both the generator and discriminator are conditioned on the noisy waveform
c that has the clean waveform
x as the target. By concatenating the noisy waveform
c with the randomly sampled vector
z, the generator
G can produce a sample that is closer to the clean waveform
x. The training process of the cGAN-based speech enhancement is shown in
Figure 1a, and the loss functions of the cGAN-based speech enhancement are
where
and
are respectively the distributions of the clean and noisy speech in the training dataset.
In the conventional training of the cGAN, both the probabilities that a sample is from the real data
and generated data
should reach the ideal equilibrium point
. However, unlike the expected ideal equilibrium point, they both have a tendency to become 1 because the generator can not influence the probability of the real sample
. In order to alleviate this problem, RSGAN [
29] proposed a discriminator to estimate the probability that the real sample is more realistic than the generated sample. The proposed discriminator makes the probability of the generated sample
increase when that of the real sample
decreases so that both the probabilities could stably reach the Nash equilibrium state. In [
26], the experimental results showed that, compared to other conventional GAN-based speech enhancements, the RSGAN-based speech enhancement technique improved the stability of training and enhanced the speech quality. The training process of the RSGAN-based speech enhancement is given in
Figure 1b, and the loss functions of RSGAN-based speech enhancement can be written as:
where the real and fake data-pairs are defined as
and
, and
is the output of the last layer in discriminator before the sigmoid activation function
, i.e.,
.
In order to stabilize GAN training, there are two penalties commonly used: A gradient penalty for discriminator [
28] and
loss penalty for generator [
24]. First, the gradient penalty regularization for discriminator is used to prevent exploding or vanishing gradients. This regularization penalizes the model if the
norm of the discriminator gradient moves away from 1 to satisfy the Lipschitz constraint. The modified discriminator loss functions with the gradient penalty are as follows:
where
is the joint distribution of
c and
,
is sampled from a uniform distribution in
, and
is the sample from
.
is the hyper-parameter that controls the gradient penalty loss and the adversarial loss of the discriminator.
Second, several prior studies [
22,
23,
24] found that it is effective to use an additional loss term that minimizes the
loss between the clean speech
x and the generated speech
for the generator training. The modified generator loss with the
loss is defined as
where
is
norm, and
is a hyper-parameter for balancing the
loss and the adversarial loss of the generator.
6. Conclusions
In this paper, we proposed a novel GAN-based speech enhancement technique utilizing the progressive generator and multi-scale discriminator. In order to reflect the speech characteristic, we introduced a progressive generator which can progressively estimate the wide frequency range of the speech by incorporating an up-sampling layer. Furthermore, for accelerating and stabilizing the training, we proposed a multi-scale discriminator which consists of a number of sub-discriminators operating at different sampling rates.
For performance evaluation of the proposed methods, we conducted a set of speech enhancement experiments using the VoiceBank-DEMAND dataset. From the results, it was shown that the proposed technique provides a more stable GAN training while showing consistent performance improvement on objective and subjective measures for speech enhancement. We also checked the semi-real-time feasibility by observing a small increment of RTF between the baseline generator and the progressive generator.
As the proposed network mainly focused on the multi-resolution attribute of speech in the time domain, one possible future study is to expand the proposed network to utilize the multi-scale attribute of speech in the frequency domain. Since the progressive generator and multi-scale discriminator can also be applied to the GAN-based speech reconstruction models such as neural vocoder for speech synthesis and codec, we will study the effects of the proposed methods.