Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions

: Generative adversarial networks (GANs) have recently garnered signiﬁcant attention for their use in speech enhancement tasks, in which they generally process and reconstruct speech waveforms directly. Existing GANs for speech enhancement rely solely on the convolution operation, which may not accurately characterize the local information of speech signals—particularly high-frequency components. Sinc convolution has been proposed in order to allow the GAN to learn more meaningful ﬁlters in the input layer, and has achieved remarkable success in several speech signal processing tasks. Nevertheless, Sinc convolution for speech enhancement is still an under-explored research direction. This paper proposes Sinc–SEGAN, a novel generative adversarial architecture for speech enhancement, which usefully merges two powerful paradigms: Sinc convolution and the speech enhancement GAN (SEGAN). There are two highlights of the proposed system. First, it works in an end-to-end manner, overcoming the distortion caused by imperfect phase estimation. Second, the system derives a customized ﬁlter bank, tuned for the desired application compactly and efﬁciently. We empirically study the inﬂuence of different conﬁgurations of Sinc convolution, including the placement of the Sinc convolution layer, length of input signals, number of Sinc ﬁlters, and kernel size of Sinc convolution. Moreover, we employ a set of data augmentation techniques in the time domain, which further improve the system performance and its generalization abilities. Compared to competitive baseline systems, Sinc–SEGAN overtakes all of them with drastically reduced system parameters, demonstrating its effectiveness for practical usage, e.g., hearing aid design and cochlear implants. Additionally, data augmentation methods further boost Sinc–SEGAN performance across classic objective evaluation criteria for speech enhancement.


Introduction
Speech enhancement is the task of removing or attenuating background noise from a speech signal, and it is generally concerned with improving the intelligibility and quality of degraded speech [1]. Speech enhancement is widely used as a preprocessor in speech-related applications including robust automatic speech recognition systems [2] and communication systems, e.g., speech coding [3], hearing aid design [4], and cochlear implants [5]. Conventional speech enhancement approaches include the Wiener filter [6], timefrequency masking [7][8][9], signal approximation [10,11], the spectral mapping [12,13], etc. In the last few years, supervised methods for speech enhancement which leverage deep learning methods have become mainstream. Well-known models include deep de-noising autoencoders [14], convolutional neural networks (CNNs) [15], and recurrent neural networks [16,17].
There exists a class of generative methods relying on generative adversarial networks (GANs) [18], which have CNNs as the backbone and have been verified to be efficient for speech enhancement [2,[19][20][21][22][23]. GANs designate a generator for enhancement mapping and a discriminator for distinguishing real signals from fake ones. With the transmitted information from the discriminator, the generator learns to produce outputs that resemble the realistic distribution of clean signals. Most past attempts deal with magnitude spectrum inputs, as it is often claimed that short-time phase is unimportant for speech enhancement [24]. However, further studies [25] demonstrate that a clean phase spectrum can deliver significant improvements to speech quality.
CNNs are the most popular deep learning architecture for processing raw speech inputs. This is due their weight sharing, local filters, and pooling features, which allow extraction of robust and invariant representations. The first convolutional layer is a critical part of waveform-based CNNs [26], since it processes high-dimensional inputs and extracts low-level speech representations for deeper layers. However, it is susceptible to vanishing gradient problems. To alleviate this issue, Ravanelli et al. [26] proposed the Sinc convolution to learn more meaningful filters in the input layer. Differently from a standard CNN, the Sinc-convolution layer convolves the waveform with a set of parametrized Sinc functions that implement band-pass filters, and only needs to learn the low and high cutoff frequencies. Consequently, the Sinc convolution is faster-converging and lightweight. It performs extraordinarily well in capturing selective speech clues, e.g., the pitch region, the first formant, and the second formant, which are essential for resembling clean speech signals. It has also achieved remarkable success in some fields of speech signal processing, e.g., speech recognition [27,28], speaker identification [26], keyword spotting [29], etc.
Unfortunately, Sinc convolution for speech enhancement is still an under-explored research direction. There is no available model fusing these two powerful paradigms-the Sinc convolution operating over raw speech waveforms and the generative adversarial models for speech enhancement. Therefore, we propose to bridge this gap by usefully merging the Sinc convolution and the speech enhancement GANs (SEGAN), resulting in a customizable, lightweight, and interpretable system, termed Sinc-SEGAN. Contributions of this paper are summarized as: • Transfer the success achieved by the Sinc convolution in the field of speech and speaker recognition to the field of end-to-end speech enhancement. • Optimize the SEGAN architecture from the seminal work [19], and enhance the original Sinc convolution layer to fit the advanced SEGAN. • Analyze the learned filters of the Sinc convolution layer. • Apply data augmentation techniques on raw speech waveforms directly to further improve the system performance.
Experimental results show that the proposed Sinc-SEGAN overtakes a set of competitive baseline models, especially on higher-level perceptual quality and speech intelligibility. Additionally, the system parameters reduce drastically; up to merely 17.7% of the baseline system. In addition, data augmentation techniques further boost the system performance across all classic objective evaluation metrics of speech enhancement. Analysis of the Sinc filters shows that the learned filters are tuned precisely to capture critical speech clues. Experimental results demonstrate the potential applications of the proposed system, e.g., portable devices, hearing aid design, and cochlear implants. Notably, the proposed Sinc-SEGAN system is generic enough to be applied to existing GAN models for speech enhancement to further improve performance.

Related Works
Pascual et al. [19] focus on generative architectures for speech enhancement, which leverage the ability of deep learning to learn complex functions from large example sets. The enhancement mapping is accomplished by the generator, whereas the discriminator, by discriminating between real and fake signals, transmits information to the generator so that the generator can learn to produce outputs that resemble the realistic distribution of clean signals. The proposed system learns from different speakers and noise types, and incorporates them together into the same shared parametrization, which makes the system simple and generalizable in those dimensions.
On the basis of [19], Phan et al. [30] indicate that all existing SEGAN systems execute the enhancement mapping via a single stage by a single generator, which may not be optimal. In this light, they hypothesize that it would be better to carry out multi-stage enhancement mapping rather than a single-stage one. To this end, they divide the enhancement process into multiple stages, with each stage containing an enhancement mapping. Each mapping is conducted by a generator, and each generator is tasked to further correct the output produced by its predecessor. All these generators are gradually cascaded to enhance a noisy input signal to yield an refined enhanced signal. They propose two improved SEGAN frameworks, namely iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN). In the ISEGAN system, parameters of its generator are fixed, constraining ISEGAN's generators to apply the same mapping iteratively, as its name implies. DSEGAN's generators have their own independent parameters, allowing them to learn different mappings flexibly. However, the parameters of DSEGAN's generators are N G times more numerous than ISEGAN's generators, where N G is the number of generators.
Phan et al. [31] reveal that the existing class of GANs for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, they propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of SEGAN, referred to as SASEGAN. Furthermore, they empirically study the effect of placing the self-attention layer at (de)convolutional layers with varying layer indices, including all layers (as long as memory allows).
As Pascual et al. [19] report, they open the exploration of generative architectures for speech enhancement to progressively incorporate further speech-centric design choices for performance improvement. This paper aims to further optimize SEGAN by fusing the powerful diagram: Sinc convolution. To our best knowledge, although Sinc convolution has achieved great success and been widely utilized in the some fields of speech signal processing, e.g., speech recognition [27] and speaker verification [26], its implementation on the speech enhancement task remains unexplored.

Speech Enhancement GANs
Given a dataset X = {(x * 1 ,x 1 ), (x * 2 ,x 2 ), · · · , (x * N ,x N )} consisting of N pairs of raw signals-clean speech signal x * and noisy speech signalx, speech enhancement aims to find a mapping f θ (x) :x →x to transform the raw noisy signalx to the enhanced signalx. θ contains the parameters of the enhancement network.
Conforming to GAN design [18], the generator learns an effective mapping that can imitate the real data distribution to generate novel samples related to those of the training set, i.e.,x = G(x). In contrast, the discriminator plays the role of a classifier which distinguishes real samples, coming from the dataset that the generator is imitating, from fake samples, made up by the generator. The discriminator guides θ towards the distribution of clean speech signals, by classifying (x * ,x) as real and (x,x) as fake. Eventually, the generator learns to produce enhanced signalsx good enough to fool the discriminator such that the discriminator classifies (x,x) as real. The objective function of SEGAN reads Dis(·) is the discriminator module, Gen(·) is the generator module, and z denotes a latent variable. Inspired by the effectiveness of the L 1 norm in the image manipulation domain [32,33], it is added to the generator loss function to encourage the generator to gain more fine-grained and realistic results. The influence of the L 1 norm is regulated by the hyper-parameter λ. Pascual et al. [19] set λ = 100 empirically, and hence we take over this value in all experiments.

Sinc-Convolution
Different from a standard convolutional layer that performs a set of time-domain convolutions between the input waveform and some Finite Impulse Response filters, Sinc convolution conducts the convolutional operation with a predefined function g, depending on few learnable parameters φ as where x[n] is a chunk of speech waveforms, and y[n] is the filtered output. Inspired by standard filtering in digital signal processing, Ravanelli et al. [26] define g as a filter bank consisting of rectangular bandpass filters. In the frequency domain, the magnitude of a generic bandpass filter can be written as where f 1 and f 2 are the learnable low and high cutoff frequencies, and rect(·) is the rectangular function in the magnitude frequency domain. In the time domain, the reference function g transforms to The Sinc function here is the unnormalized Sinc function, i.e., sinc(x) = sin(x) x . Ravanelli et al. [26] initialize filters with cutoff frequencies of the mel-scale filterbank, which has the advantage of directly allocating more filters in the lower part of the spectrum. Fainberg et al. [34] execute experiments over different initialization schemes, but no benefit for the downstream task is observed. Please note that there are two constraints in Equation (5) that need to be satisfied: f 1 0 and f 2 f 1 . In addition, the de facto frequences that are calculated in Equation (5) are f 1 and f 2 , where To smooth out the abrupt discontinuities at the end of g, Hamming window w[n] [35] of length L is deployed by where It is also suggested that no significant performance difference appears when adopting other window functions [26]. As the filters g are symmetric, the filters can be computed efficiently by considering one side of the filter and inheriting the results for the other half. Moreover, the symmetry does not introduce any phase distortion, keeping the essence of processing raw inputs for speech enhancement.
Another remarkable property of Sinc convolution is its small parameter scale. If a CNN layer is composed of F filters of length M, a standard CNN employs F × M parameters. Unlike a standard convolutional layer, only two parameters are employed for each Sinc filter, regardless of its length. For instance, if F = 64 and M = 251, the CNN layer employs approximately 16 K parameters. By contrast, a Sinc-convolution layer only employs 128 parameters. This property offers the possibility to obtain selective filters with many taps, with negligible parameter increment.

Sinc-SEGAN Architecture
We investigate two different deployments of the Sinc convolution illustrated in Section 4: (i) sitting before the first layer of the generator's encoder and the discriminator, and behind the last layer of the generator's decoder, referred to as the addition architecture, and (ii) acting as a substitute for the first standard convolutional layer of the generator's encoder and the discriminator, and the last standard convolutional layer of the generator's decoder, referred to as the substitution architecture.
In the case of the addition architecture, the generator makes use of an encoder-decoder architecture. The first layer of the encoder is Sinc convolution, using 64 filters of length 251 and stride = 1, followed by a max pooling layer (stride = 2). Thereafter, there are five one-dimensional strided convolutional layers with a common filter width of 31 and stride = 4, each followed by parametric rectified linear units (PReLUs) [36]. At the 6th layer of the encoder, the encoding vector c ∈ R 8×1024 is stacked with the noise sample z ∈ R 8×1024 , sampled from the distribution N (0, I), and presented to the decoder.
The decoder component mirrors the encoder architecture, with the same number of filters and filter width, to reverse the encoding process through deconvolutions, namely fractional-strided transposed convolution. The last layer of the decoder is also a Sinc convolution (filter number = 64, kernel size = 251, and stride = 1), and there is an unmaxpooling layer (stride = 2) before it for upsampling. Learnable skip connections are deployed to connect the encoding layer with its corresponding decoding layer to allow the information to be summed to the decoder feature maps. The learnable vectors a l multiply every channel of its corresponding shuttle layer l by a scalar factor γ l,k . Hence, for the input of the jth decoder layer h j , we have the addition of the corresponding lth encoder layer response, following where is an element-wise product along channels. The discriminator is constructed of a similar architecture to the encoder of the generator. However, it receives a two-channel input, i.e., (x * ,x) or (x,x), and utilizes virtual batch-norm [37] before LeakyReLU [38] activation with α = 0.3. Moreover, the discriminator is topped up with a 1 × 1 convolutional layer to reduce the dimension of the output of the last convolutional layer for the subsequent classification task by the softmax layer. To sum up, in the addition architecture, the generator consists of 12 layers and the discriminator consists of six layers. We illustrate the addition architecture of Sinc-SEGAN in Figure 1.
The substitution architecture, as illustrated in Figure 2, is similar to the first case, but the original first layer of the encoder is substituted as Sinc convolution, using 64 filters of length 251. Its stride = 4, in line with the standard convolutional layer, and the max pooling layer is removed. Thereafter, there are four one-dimensional strided convolutional layers with a common filter width of 31 and stride = 4, each followed by PReLUs [36]. At the 6th layer of the encoder, the encoding vector c ∈ R 16×1024 is stacked with the noise sample z ∈ R 16×1024 , sampled from the distribution N (0, I), and presented to the decoder.
The decoder component still uses the mirror architecture to reverse the encoding process through deconvolutions. The last deconvolutional layer of the decoder is also replaced with a Sinc convolution (filter number = 64, kernel size = 251, and stride = 4), and the upsampling layer is not needed anymore.
The first layer of the discriminator is also substituted with the Sinc convolution layer with 64 filters of length 251 and stride = 1. In summary, in the substitution architecture, the generator consists of 10 layers and the discriminator consists of five layers.

Database
To assess the performance of proposed Sinc-SEGANs, we report objective measures on the Valentini [39] benchmarks. This publicly available dataset is originated from the Voice Bank corpus [40], which contains 30 speakers, of which 28 speakers are included in the training set while the remaining 2 are included in the test set. For the noisy training set, 40 different noisy conditions are considered by combining 10 sorts of intrusions (2 artificially generated and 8 derived from the Demand database [41]) with 4 different signal-to-noise ratios (SNRs) each (15, 10, 5, 0 dB). For the test set, 20 different noisy conditions are considered by combining 5 sorts of intrusions (all originated from the Demand database) with 4 SNRs (17.5, 12.5, 7.5, and 2.5 dB) each. There are 10 different utterances in each adverse condition per training speaker; while, per test speaker, there are 20 utterances in each condition. Notably, the test set is entirely unseen by the training set; that is, no overlap exists in either speakers or adverse conditions. All utterances are downsampled to 16 kHz. Table 1 demonstrates the data structure of the employed corpus [39].

Evaluation Metrics
We evaluate the performance of the proposed system on six classic objective evaluation criteria for speech enhancement (the higher the better): All metrics compare the enhanced signal with its corresponding clean reference over the test set as defined in Section 6.1 (All criteria are calculated based on the implementation demonstrated in [1], and all implementations are published on https://www.crcpress.com/ downloads/K14513/K14513_CD_Files.zip (accessed on 15 February 2021)).

Implementation Details
The networks are trained for 100 epochs with the batchsize 100. Different from previous works [19,30,31], we utilize the Adam optimizer [46] (β 1 = 0 and β 2 = 0.9), with the two-timescale update rule (TTUR) [47]. According to the recent successful approach to training GANs quickly and stably [48], we set the learning rate of the discriminator to 0.0004 and that of the generator to 0.0001, i.e., the discriminator has a four-times-faster learning rate to virtually emulate numerous iterations in the discriminator prior to updating the generator. We extract raw speech chunks of length L = 16, 384 (approximately 1 s) over 50% overlap as the input, to avoid any speech distortion caused by handcrafted features. A high-frequency pre-emphasis filter of coefficient 0.95 is applied to each waveform chunk before presenting it to the networks, as this is proved to help correct with some high-frequency artifacts in the de-noising setup. During testing, raw speech chunks are extracted from testing utterances without overlap, and outputs are correspondingly de-emphasized and concatenated as the enhanced waveforms. We also conduct ablation tests on the influence of the input length, the filter number and the kernel size. All experiments are implemented in PyTorch [49].

Data Augmentation
We employ three data augmentation methods on the dataset. First, we apply a random shift between 0 and 4 s. Second, we shuffle the noises with one batch to form new noisy mixtures, termed ReMix. Third, we deploy a band-stop filter with a stop-band between f 1 and f 2 (termed Band Mask), sampled to remove 20% of the frequencies uniformly in the mel scale, which is equivalent to the SpecAugment [50] used for the automatic speech recognition task, in the time domain.

Baseline
For comparison, we take the seminal work [19], and other SEGAN variants [30,31] that we introduced in Section 2 as baseline systems. For [30], we choose the results of the ISEGAN with two shared generators and the DSEGAN with two independent generators as baseline results (the situation of N G = 2). This is done for two reasons. On one hand, the number of generators leads to exponential parameter incrementation. On the other hand, Phan et al. [30] indicate only a marginal impact of ISEGAN's number of iterations, and for DSEGAN depth larger than N G = 2 no significant performance improvements are seen. In [31], detailed results regarding the influence of the placement of the self-attention layer in the generator and the discriminator are presented. We choose the average result of coupling the self-attention layer with a single (de)convolutional layer (referred to as SASEGAN-avg), and the result of coupling self-attention layers with all (de)convolutional layers (referred to as SASEGAN-all) to ensure a fair comparison. It is worth noting that it is stated in [31] that, compared to SASEGAN-avg, the results of SASEGAN-all are slightly further boosted, but these gains are achieved at the cost of increased computation time and memory requirements.

Ablation Tests on the Configuration of Sinc Concolution
We empirically study the impacts of the placement of the Sinc convolution layer, the input length, the number of Sinc filters, and the kernel size of the Sinc convolution. Table 2 demonstrates the configuration of five ablation tests, and Table 3 shows their results across the criteria introduced in Section 6.2. After making a comparison between these experiments, we can draw the following conclusions: • Increasing the number of Sinc filters degrades the system performance, since more filters introduce more system complexity, and the training becomes more difficult, accordingly. • Decreasing the kernel size of the Sinc convolution deteriorates the performance since smaller kernel size limits the ability to extract representative speech clues. • Systems benefit from longer input length due to more context information being included. • The addition architecture outperforms the substitution architecture as the former is deeper.
These experimental results explain why we choose the length of input signals as 1 s, the number of Sinc filters as 64, and the kernel size of Sinc convolution as 251.   Table 4 demonstrates the performance and parameter comparisons between the proposed Sinc-SEGANs (+augment) and the previous SEGAN variants [19,30,31] on the Valentini [39] benchmark. These results indicate that the substitution architecture outperforms baseline systems on PESQ, CBAK, and COVL. Considering the designs of these criteria, the results suggest that for speech signals enhanced by Sinc-SEGAN-sub (substitution is shortened to sub), the general perceptive quality is higher, and they are reasonably comprehensive for users. Comparable results are achieved on CSIG, SSNR and STOI. Please note that although Sinc-SEGAN-sub underperforms DSEGAN [30] on CSIG and SSNR or SASEFAN-all [31] on STOI, it outperforms SEGAN [19], ISEGAN [30], and SASEGANavg [31] across all criteria. Additionally, the number of Sinc-SEGAN-sub parameters is merely 31.0% as compared to SEGAN, ISEGAN, or SASEGAN-avg, 29.4% compared to SASEGAN-all, and 17.7% compared to DSEGAN. In contrast, Sinc-SEGAN-add (addition is shortened to add) outperforms all baseline systems, with the parameter scale that is 71% as compared to SEGAN, ISEGAN, or SASEGAN-avg, 67% compared to SASEGAN-all, and 41% compared to DSEGAN. Moreover, data augmentation methods deliver further improvements, leading to the best performance across all evaluation metrics, without additional parameters. These results validate the efficacy of Sinc convolution. Table 4. Results on objective metrics of the proposed systems (Sinc-SEGANs) against previous SEGAN variants using the Valentini benchmark [39]. The unit of the number of parameters (Params) is million (M).

Ablation Tests on Augmentation Methods
In order to better understand the influence of different data augmentation methods on the overall performance, we execute ablation tests. We report system performance on all evaluation criteria for each of the methods in Table 5. Results suggest that each of these data augmentation methods contributes to overall performance. The time shift augmentation produces the most significant performance increment, and the Band Mask algorithm is the second. Surprisingly, ReMix augmentation only shows a limited contribution to the overall performance.

Interpretation of Sinc Convolution
Inspecting learned filters is a valuable practice that provides insights into what the network is actually learning. To this end, we visualize the learned low and high cutoff frequencies of Sinc-convolution filters in Figure 3. The green area demonstrates the low and high cutoff frequencies for each filter. In addition, we also give the corresponding results of mel filters (purple area) for comparison. We plot four examples of the learned Sinc filters of the proposed system in Figure 4. As shown in Figure 4, the learned Sinc filters are rectangular band-pass filters in the frequency domain, echoing its definition in Section 4. Furthermore, we exhibit examples of spectrograms of the speech signal enhanced by SEGAN [19], DSEGAN [30], SASEGAN-all [31], and the proposed Sinc-SEGAN-add, respectively, in Figure 5. As observed in Figure 3, the Sinc convolution learns a filter bank containing more filters with high cut-frequencies. In addition, a tendency towards a higher amplitude is noticeable, indicating an inclination of the Sinc convolution to directly process raw speech waveforms. As we can see from Figures 4 and 5, Sinc convolution is specifically designed to implement rectangular band-pass filters. Considering the speech waveform distribution in the time domain, the specific design makes Sinc convolution suitable for extracting narrow-band speech clues, e.g., the pitch region, the first formant, and the second formant, in accordance with the results of the seminal work [26].    [19], (d) DSEGAN [30], (e) SASEGAN-all [31], and the proposed (f) Sinc-SEGAN-add. We also exhibit the (a) clean and (b) noisy spectrograms for reference.

Conclusions
This paper proposes Sinc-SEGAN, a system that merges the Sinc convolution layer with the optimized SEGAN to capture more underlying representative speech characteristics. Sinc-SEGAN processes raw speech waveforms directly to prevent distortion caused by imperfect phase estimation. We investigate two different deployments of the Sinc convolution: (i) sitting before the first layer of the encoder and the discriminator, and behind the last layer of the decoder, referred to as Sinc-SEGAN-add, and (ii) acting as a substitute of the first standard convolutional layers of the encoder and the discriminator, and the last standard convolutional layer of the decoder, referred to as Sinc-SEGAN-sub. Ablation tests are conducted for the influence of the input length, number of Sinc filters, and kernel size of Sinc convolution. To train the proposed system more efficiently, we also employ three data augmentation methods in the time domain. Experimental results show that Sinc-SEGAN-sub yields enhanced signals with higher-level perceptual quality and speech intelligibility, even with drastically reduced system parameters. By contrast, the proposed Sinc-SEGAN-add overtakes all baseline systems across all classic objective evaluation criteria, with up to ∼50% fewer parameters compared to the baseline system. Moreover, data augmentation methods further boost the system performance. Analysis of the Sinc filters reveals that the learned filter bank is tuned precisely to select narrow-band speech clues, and is hence suitable for speech enhancement tasks in the time domain. Our future effort will be devoted to applying the Sinc convolution to other classic speech enhancement models, to further mitigate the lack of its application in the field of speech enhancement.  Data Availability Statement: The dataset [39] utilized by this paper is publicly available at the publisher webpage: https://datashare.ed.ac.uk/handle/10283/1942 (accessed on 5 February 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: