Exploring Multi-Stage GAN with Self-Attention for Speech Enhancement

: Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the temporal dependencies within the signal sequence. This study explores self-attention to address the temporal dependency issue in multi-generator speech enhancement GANs to improve their enhancement performance. We empirically study the effect of integrating a self-attention mechanism into the convolutional layers of the multiple generators in multi-stage or multi-generator speech enhancement GANs, speciﬁcally, the ISEGAN and the DSEGAN networks. The experimental results show that introducing a self-attention mechanism into ISEGAN and DSEGAN leads to improvements in their speech enhancement quality and intelligibility across the objective evaluation metrics. Furthermore, we observe that adding self-attention to the ISEGAN’s generators does not only improves its enhancement performance but also bridges the performance gap between the ISEGAN and the DSEGAN with a smaller model footprint. Overall, our ﬁndings highlight the potential of self-attention in improving the enhancement performance of multi-generator speech enhancement GANs.


Introduction
Speech and audio signal processing applications, such as speech recognition [1] and hearing aids [2], require clean speech signals [3]. However, real-world speech signals are inevitably impacted by background noise which can distort speech quality and intelligibility. Speech enhancement algorithms aim at approximating clean speech signals from distorted speech signals by removing the background noise contained in the noisy signal [4]. The remarkable success in deep learning has inspired the speech and audio signal processing research community to shift from their traditional speech enhancement algorithms [5,6] to deep neural networks (DNNs)-based algorithms [7]. These include convolutional neural networks (CNNs) [8,9] and long short-term memory (LSTM) recurrent neural networks (RNNs) [10][11][12], which are discriminative methods, and generative methods, such as variation auto-encoder (VAE) [13][14][15] and generative adversarial networks (GANs) [1,16,17]. The GAN-based methods have been demonstrated to be more promising for speech enhancement tasks [18], and they are more robust to different types of noise [19][20][21] compared to their discriminative counterparts [22,23]. In the wake of the seminal work, speech enhancement GAN (SEGAN), by Pascual et al. [16], there have been several improvements to the GAN-based speech enhancement methods [18]. For example, self-attention SEGAN (SASEGAN) [23] was introduced to learn temporal dependencies across the signal sequence, and MetricGAN [24] directly optimized the generator with respect to evaluation metrics such as PESQ and STOI to improve the performance in SEGAN. Furthermore, several loss functions, including relativistic loss function [17], metric loss function [25], and Wasserstein loss function [25], have also been proposed to stabilize the SEGAN training process. These methods are based on the conventional GAN [26]. Multi-stage GANs which involve the use of multiple generators [27] or multiple discriminators [28] to generate samples in multiple stages or levels have become increasingly popular in the field of speech enhancement. Multi-stage GANs have been adopted to refine noisy input signals in speech enhancement tasks [22,29,30]. In [22], multiple generators are employed to refine noisy input signals, whereas [29,30] utilized multiple discriminators to address acoustic degradation, such as noise, reverb, and equalization distortion, aiming to enhance speech clarity and intelligibility. Moreover, MelGAN [31] has shown the effectiveness of multi-stage GANs in high-quality mel-spectrogram inversion.
The multi-stage GANs have demonstrated successful performance in speech enhancement. However, the existing multi-stage GANs for speech enhancement rely on convolutional backbones [27,28], which may not be optimal for capturing temporal dependencies of an input signal sequence [32,33]. This work considers the temporal dependency problem of the convolutional backbone of multi-stage GANs. The self-attention mechanism [34] has successfully been used for temporal dependency modelling in acoustic [35] and speech recognition [32,36] tasks. Compared to RNN and LSTM, self-attention is computationally efficient and more flexible in modelling temporal dependencies of long input signal sequences [25,35]. Motivated by the work in SASEGAN [23], which adopted the concept of non-local attention [37,38] to optimise the performance of SEGAN [16], we introduce the self-attention mechanism in multi-stage GANs for speech enhancement tasks. In this work, we consider multi-stage GANs of multiple generators for speech enhancement (i.e., multi-generator speech enhancement GANs). Here, we aim to optimise the enhancement performance of iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN), the multi-stage GAN speech enhancement algorithms introduced by Phan et al. [22] (see Figure 1). We leverage the sequential modelling capability of self-attention to infuse the multiple generators of ISEGAN (two shared generators) and DESEGAN (two independent generators) with the power of capturing temporal dependency across an input signal sequence. The main contributions of this paper are summarized as follows: Figure 1. A representation of SEGAN with a single generator G (conventional GAN) and multistage GANs of multiple generators: ISEGAN with two shared generators G and DSEGAN with two independent generators G1 and G2 [22] (Figure adapted from [22]).

•
To enhance the ISEGAN and DSEGAN networks, we incorporate the self-attention mechanism inspired by the implementation in [23,37]. We refer to these enhanced versions as ISEGAN-Self-Attention and DSEGAN-Self-Attention, respectively. • Across the commonly used objective evaluation metrics, the proposed ISEGAN-Self-Attention and DSEGAN-Self-Attention demonstrated a better speech enhancement performance in all than the ISEGAN and the DSEGAN baselines, respectively.
• We also demonstrate that with the self-attention mechanism, the ISEGAN can achieve competitive enhancement performance with the DESGAN using only half of the model footprint of the DSEGAN. • Furthermore, we investigate the effect of the self-attention mechanism applied at different stages of the multiple generators in the ISEGAN and the DSEGAN networks with respect to their enhancement performance.
The rest of the paper is organized as follows: Section 2 presents the background of the study. The proposed ISEGAN-Self-Attention and DSEGAN-Self-Attention are presented in Section 3. Section 4 describes the experimental setup of the study. The results are presented and discussed in Section 5. Finally, Section 6 concludes the paper with some future directions.

Conventional SEGAN
In a speech enhancement task, a given speech or raw audio signal with noise can be represented asx = x + n ∈ R T , where x ∈ R T is the clean signal and n ∈ R T represents a noisy signal that corrupts the speech signal. This noisy signal is often considered additive background noise. The main goal of speech enhancement is to remove the noisy signal in the raw audio signal by finding an enhancement mapping function f such that f (x) :x → x. One such mapping function that has been adopted with great success is the conditionalbased GAN method [39,40].
The conditional-based GAN method was first employed in speech enhancement in the seminal work by Phan et al. [16], which is referred to as speech Enhancement GAN (SEGAN). In SEGAN, the generator G (for emphasis, a single generator) is provided with a corrupted raw audio signal as the enhancement mapping function such thatx = G(z,x), where z is a latent variable. Then, the generator G is trained to produce the enhanced output signalx simultaneously with the discriminator D, which learns to distinguish between the enhanced output signalx and the real clean signal x by classifying the pair of signals (x,x) as real and (x,x) as fake pair. The training procedure is illustrated in Figure 2, and the Equations (1) and (2) are the least-squares objective functions for training the discriminator D and the generator G, respectively.
Several improvements have been introduced to the SEGAN algorithm focusing on improving the quality of the enhanced speech signals [17,23,24,[41][42][43]. For instance, Deepak and Verhulst [17] presented an improvement in enhancement by introducing a cost function with a gradient penalty. The self-attention module [37] was incorporated into the SEGAN to prevent the convolutional layers from obscuring the temporal dependency in an input sequence of signals [23]. Additionally, MetricGAN [24] also demonstrated that evaluation metrics can be used to improve the performance of conditional GAN-based method in speech enhancement by using the metrics that are directly related to speech signals (e.g., PESQ and STOI) rather than Euclidean distances to compute the loss in the SEGAN.

SEGAN with Multiple Generators
Based on the principle of multiple generators chained to build the generator of a GAN [27], which has shown improved performance in image reconstruction, Phan et al. [22] introduced multi-stage GANs of multiple generators (i.e., ISEGAN and DSEGAN) for speech enhancement. The ISEGAN and DSEGAN learn multiple enhancement mappings with a chained generator G composes of N generators such that G = G 1 → G 2 → · · · → G N , where N > 1 to perform a speech enhancement tasks. The use of multiple generators aims to leverage the diversity of their outputs to enhance the overall quality of generated speech. In ISEGAN, the generators share their parameters, resulting in a common mapping function that is iteratively applied at all enhancement stages. Thus, ISEGAN generators can be considered as a single iterated generator G with N iterations. Sharing the generators' parameters makes the memory footprint of the ISEGAN smaller. Unlike the ISEGAN, the parameters of the DSEGAN generators are independent, which allows for flexible learning of different enhancement mappings at different stages of the network. In contrast to ISEGAN generators, the generators of DSEGAN can also be considered as a deep generator G with a depth of N. Figure 3 illustrates ISEGAN and DSEGAN with the number of generators N = 2. When the number of generators N = 1, both the ISEGAN and the DSEGAN can be viewed as the conventional SEGAN (see Figure 2).
Like the conventional SEGAN, the chained generators G in both ISEGAN and DSEGAN are conditional-based GAN generators. Given an enhancement stage n, the generator G n ∈ G is provided with the outputx n−1 of generator G n−1 ∈ G to produce an enhanced signalx n such thatx n = G n (z n ,x n−1 ), where 1 ≤ n ≤ N and z n is the latent representation. Hence, the corrupted raw audio signal is considered asx 0 ≡x and the final enhanced signal of the last generator G N ∈ G isx ≡x N . Here, the discriminator D is trained to classify the pair of signals (x,x) as real and (x 1 ,x), (x 2 ,x), . . . , (x N ,x) as fake pairs of signals. Equations (3) and (4) are the least-squares objective functions for training the discriminator D and the chained generators G, respectively. Figure 3. Illustration of the training process of a multi-generator speech enhancement GAN with two generators. In (a), the discriminator D is trained to classify (x,x) as real pairs, and in (b), the discriminator D is trained to classify the pairs (x 1 ,x 1 ) and (x 2 ,x 2 ) generated by G 1 and G 2 , respectively, as fake. Then, in (c), the generators G 1 and G 2 are trained to fool the discriminator D with the pairs (x 1 ,x 1 ) and (x 2 ,x 2 ) as real.

Self-Attention Block
The concept of the attention mechanism was first introduced in Bahdanau et al. [44] to address the alignment and translation problem in sequence-to-sequence modelling. The idea behind attention is to enable the decoder in an encoder-decoder model to access all the encoded input by introducing attention weights, which focus on the positions containing relevant information for generating output tokens. Since its inception, the selfattention [34] variant has been adopted in speech and audio signal processing. Self-attention was first adopted in acoustics models [35], and later in GAN models [23] (i.e., SEGAN) for speech enhancement.
Here, we briefly describe the underlying framework of the self-attention block [23,37] (see Figure 4) that we adopt to investigate the impact of self-attention on the efficiency and quality of enhanced speech produced by SEGAN with multi-generators. Given an output F ∈ R L×C (i.e., feature map) of a convolutional layer in SEGAN, where L is the time dimension and C is the number of channels, the query Q, key K, and value V ma-trices are, respectively, obtained by the transformations presented as follows: Q = FW Q , and W V ∈ R C× C k are weights matrices obtained through 1 × 1 convolutional layer operation of C k filters, and k is a scalar used to reduce the channel dimension C of the feature space for memory reduction. To further improve memory efficiency, the time dimensions of K and V are reduced by a factor of p. Hence, the sizes of the matrices Q, K, and V become The attention map A and attentive output O are then computed using these matrices. The attention map A is obtained by computing the softmax of the dot product of the query Q and the key K (see Equation (5)), and the attentive output O is also computed by performing a matrix multiplication of the attention map A and value matrix V (see Equation (6)).
where W O is a weight matrix of 1 × 1 convolutional layer of C filters applied to AV to restore the size of the attentive output O to the original size L × C. A shortcut connection is also used to facilitate information propagation.
3.2. ISEGAN-Self-Attention and DSEGAN-Self-Attention Networks 3.2.1. Multi-Generator G The N generators with self-attention of multi-generator G = G 1 → G 2 → · · · → G N , N > 1 for both the ISEGAN-Self-Attention and the DSEGAN-Self-Attention net-works have the same architectural structure, which follows the network architecture in Phan et al. [22]. Figure 5 illustrates an example of ISEGAN-Self-Attention network architecture and DSEGAN-Self-Attention network architecture of a multi-generator G = G 1 → G 2 . Figure 5. Illustration of the ISEGAN-and DSEGAN-Self-Attention architecture with an encoder of two generators, G 1 , G 2 , featuring the (de)convolutional layers integrated with self-attention blocks, and a discriminator D.
Each generator G n , 1 < n ≤ N in the multi-generator G network consists of an encoder-decoder architecture with fully convolutional layers, following the implementation in [16,22,23]. The encoder for each generator G n is composed of 11 one-dimensional stride convolutional layers with a common filter width of 31 and a stride length of 2. This is followed by parametric rectified linear units (PReLUs) [45]. The number of filters used is tailored to suit the requirements of the task at hand as it is increased progressively from 16 to 1024 in the set of {16, 32, 32, 64, 64, 128, 128, 512, 1024}, resulting in feature maps of varying sizes as 8192 × 16, 4096 × 32, 2048 × 32, 1024 × 64, and so on until 8 × 1024. The last feature map c ∈ R 8×1024 of the decoder is then concatenated with a noise z ∈ R 8×1024 , which is sampled from the normal distribution X ∼ N (µ, σ 2 ), and then provides it to the decoder of generator G n . The decoder of generator G n uses deconvolutions to reverse the encoding process by employing the mirror of the encoder architecture as indicated in Figure 5. To facilitate the reconstruction of the waveform from the encoded features, skip connections are used to connect each encoding layer to its corresponding decoding layer. This allows the information from the waveform to bypass the encoding stage and flow directly into the decoding stage [16]. This helps to preserve the details of the waveform and maintain the quality of the reconstructed output. Using skip connections this way, the encoder-decoder architecture can also be more effective in processing waveforms and producing high-quality enhanced speech signals.
The self-attention block presented in Section 3.1 is integrated into each generator G n of both the ISEGAN-Self-Attention network architecture and the DSEGAN-Self-Attention network architecture. The self-attention block is coupled with the convolutional and deconvolutional layers in the generators G 1 , G 2 , . . . , G N of ISEGAN and DSEGAN to construct an ISEGAN-Self-Attention and DSEGAN-Self-Attention, respectively. SASEGAN [23] indicated that it does not matter the number of self-attention blocks placed in the network since it does improve the performance with a little additional memory. Based on this, we added the self-attention block in each generator G n at the following (de)convolution layers l = 4, 6, 10 positions, which were the most effective positions in the SASEGAN. As in the ISEGAN and the DSEGAN setups, the chained generators in the ISEGAN-Self-Attention share the same parameters, and in the DSEGAN-Self-Attention, the generators' parameters are independent. Thus, the ISEGAN-Self-Attention has a smaller memory footprint. However, the parameters of each generator learn independently in the DSEGAN-Self-Attention, allowing each generator to contribute to a specific aspect of the speech enhancement process.

Discriminator D
Following Phan et al. [22], we use the same discriminator architecture for the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks. Remember that both the ISEGAN-Self-Attention network architecture and the DSEGAN-Self-Attention network architecture have multiple generators G 1 , G 2 , . . . , G N but with a single discriminator D (see Figure 5). The architecture of the discriminator D is akin to the encoder part of the generators, except that D accepts a two-channel input. As adopted in previous works [22,23], we also incorporate a virtual batch-normalization [46] prior to the leaky RELU activation [47] with the hyperparameter α = 0.3. The self-attention blocks are placed in the same position as in the encoder of the generators. We add a 1D convolutional layer with a filter size of one to the discriminator D to reduce the size of the output of the last convolutional layer from 8 × 1024 to 8 before a classification is performed with a softmax layer.

Baseline and Objective Evaluation
To have a comparative baseline reference, the ISEGAN and the DSEGAN networks [22] are used as the main baseline to compare the performance of the proposed ISEGAN-Self-Attention and DSEGAN-Self-Attention networks. Additionally, the traditional speech enhancement network, SEGAN [16], and the state-of-the-art method, the SASEGAN [23], are selected as baseline networks as well. With the implementations of the self-attention blocks in the proposed ISEGAN-Self-Attention and DSEGAN-Self-Attention networks, we consider the SASEGAN as one of the best gauges to observe how well the self-attention performs in multi-generator settings as compared to a single generator. Also, as adopted in SEGAN, we compare the performance to the noisy signals and to signals filtered via the Wiener method based on a priori signal-to-noise estimation [48]. The objective evaluation metrics used in the selected baselines are adopted to evaluate the quality of the generated enhanced speech signals. These metrics include the five objective signal-quality metrics (PESQ, CSIG, CBAK, COVL, and SSNR) suggested by Loizou [4], as well as the speech intelligibility metric (STOI) introduced by Taal et al. [49]. The five objective signal-quality metrics are computed using the implementation in SEGAN [16]. This set of metrics focuses on assessing various aspects of speech signal quality. Specifically, PESQ and STOI measure perceived speech quality and speech intelligibility, respectively. SSNR and CBAK evaluate the distortion caused by background noise, while COVL provides an overall quality assessment of the speech signal. Additionally, SSNR compares the enhanced speech signal with the original clean speech on a segment-by-segment basis, offering insights into the preservation of speech in various segments [4,49].

Dataset
The dataset used in the experiments is the noisy dataset for the speech enhancement task, which was put together by Valentini-Botinhao et al. [50] to facilitate the comparison of speech enhancement approaches. It is made of 30 speakers, recorded at 48 kHz, from the Voice Bank corpus [51]. We sub-sampled all the speeches from 48 kHz to 16 kHz. This dataset is used because it offers a variety of learning features for the optimal enhancement of speech signals combined with different noise conditions, and it was adopted to evaluate the baseline methods mentioned in Section 4.1. Following the baseline methods [16,22,23], the noisy training set comprised 40 different conditions which were made by merging 10 types of noise, that is, 2 artificial and 8 obtained from the Demand database [52], at signal-to-noise-ratios (SNRs) of 15, 10, 5, and 0 dB to obtain the noisy conditions in the data. Similarly, for the test set, 20 different conditions were made, merging 5 types of noise obtained from the Demand database with 4 SNRs each at 17.5, 12.5, 7.5, and 2.5 dB.

Experimental Settings
This work aims at improving the performance of the existing multi-generator SEGAN networks (i.e., ISEGAN and DSEGAN) with a self-attention mechanism. We conduct several experiments on the ISEGAN and DSEGAN networks with and without the self-attention block illustrated in Figure 4 using the training setup presented in Figure 3. We set three experimental objectives: • To quantitatively show the effects of a self-attention mechanism in the generators of a multi-generator SEGAN. • To qualitatively show the effects of a self-attention mechanism in the generation of a multi-generator SEGAN. • To investigate the overall performance of the model in terms of training the parameters and the generation of the enhanced speech.
We set up the experiments with the number of generators N = {2, 3, 4} and with only one discriminator in the ISEGAN network with and without self-attention, and in the DSEGAN network with and without self-attention. The TensorFlow deep learning framework [53] was used to implement the networks, and all the experiments were performed on an 8 GB GPU GeForce RTX 2080 machine. Following previous works [22,23], each network was trained for 100 epochs using the RMSprop optimization technique [54] with a mini-batch size of 64, and a learning rate set to 2 × 10 −4 . We sampled raw speech segments sampled at 16 kHz after every epoch to monitor performance during training of the different network setups for N = {2, 3, 4} generators to investigate the effects of self-attention in the multi-generator SEGANs for speech denoising and for generating synthetic speech signals. To further understand the impact of the self-attention mechanism in a specific generator of the multi-generator SEGAN setup, we performed an ablation study by placing the self-attention block in one generator at a time and at different convolutional layers of the generator.

Results and Discussion
This section presents the experimental results and ablation study on the effectiveness of self-attention in the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks. The results are reported quantitatively (see Tables 1 and 2) and through spectrograms (see Figure 6) to show the quality of the enhanced speech signals produced by the networks. The quantitative results presented in Tables 1 and 2 are based on the averaged objective evaluations of the ISEGAN-Self-Attention and the DSEGAN-Self-Attention, respectively, on the 824 wave files in the test set of the noisy dataset introduced by Valentini-Botinhao et al. [50]. In Table 1, we compare the results of the ISEGAN-Self-Attention with the attention blocks at the fourth and fifth convolutional layers of the encoder and the decoder to the baseline results reported in Phan et al. [22,23] and, likewise, the same for the results of the DSEGAN-Self-Attenation network presented in Table 2. Table 1. Performance comparison between ISEGAN-Self-Attention and the baseline methods based on the objective evaluation metrics. The highest score per metric is highlighted in bold, and the second best is underlined.

Metric
Noisy Weiner SEGAN SASEGAN ISEGAN ISEGAN-Self-Attention  Table 2. Performance comparison between DSEGAN-Self-Attention and the baseline methods based on the objective evaluation metrics. The highest score per metric is highlighted in bold and the second best is underlined.  Figure 6. Spectrograms of a speech signal generated using Librosa [55]: (a) the target clean speech signals, (b) the noisy speech signals, and (c) the enhanced speech signals with the different networks. The enhanced speech signal on the top row is produced by the ISEGAN-Self-Attention and the one on the bottom row is produced by DSEGAN-Self-Attention. The regions marked red are regions with high noise levels with little or no intelligible speech signal, and the regions marked with yellow rectangles contain intelligible speech signals with little additive noise. It can be seen in (c) that the networks are able to remove additive noise in the red rectangles as well as in the regions marked with yellow rectangles. The spectrograms in (c) show that the speech signals are preserved while the networks remove the noise.

Metric
As expected, the objective evaluations of both the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks showed an improvement over the ISEGAN and the DSEGAN without self-attention in all the speech quality metrics. For example, when N = 2, the ISEGAN-Self-Attention network achieved average improvements of 18.75%, 6.78%, and 10.64% in PESQ, COVL, and SSNR, respectively, over ISEGAN without selfattention; with N = 4, average improvements of 8.90% and 0.03% in CBAK and SSNR, respectively, were also achieved (see Table 1). The DSEGAN-Self-Attention also gained average improvements of 15.31%, 0.84%, 1.61%, 6.14%, and 5.63% in PESQ, CSIG, CBAK, COVL, and SSNR, respectively, over DSEGAN without self-attention (see Table 2). However, in terms of speech intelligibility (STOI), on average, no significant improvement was achieved in both SEGAN-Self-Attention and DSEGAN-Self-Attention over ISEGAN and DSEGAN, respectively.
Furthermore, the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks improved over other baseline methods, including the SEGAN [23] and the SASEGAN [22]. For the ISEGAN-Self-Attention, with N = 2, we observed average improvements of 17.57%, 6.12%, and 17.56% in PESQ, COVL, and SSNR, respectively, over SEGAN and SASEGAN. And when N = 4, average improvements of 7.13% and 17.56% in CBAK and SSNR, respectively, were achieved over SEGAN and SASEGAN as well. Similarly, in Table 2, with N = 2, the DSEGAN-Self-Attention improved over the SASEGAN averagely by 19.77%, 5.79%, 6.12%, 9.78%, and 19.51% in PESQ, CSIG, CBAK, COVL, and SSNR, respectively. In terms of speech intelligibility (STOI), both ISEGAN-Self-Attention and DSEGAN-Self-Attention networks outperformed SEGAN and SASEGAN. For the various multi-generator G setups, when N = 2, the ISEGAN-Self-Attention and the DSEGAN-Self-Attention on average achieved better results in all the five objective signal-quality metrics (PESQ, CSIG, CBAK, COVL, and SSNR) compared to when N = 3 and N = 4. In the case of STOI, the ISEGAN-Self-Attention improved the enhancement performance when N = 3, whereas DSEGAN-Self-Attention competed with the baseline methods. Similarly, when N = 4, the ISEGAN-Self-Attention performed better than the DSEGAN-Self-Attention in CBAK and SSNR signal-quality metrics. Overall, the results suggest that N = 2 is the best setup for both ISEGAN-Self-Attention and DSEGAN-Self-Attention for enhancing noisy speech signals. Figure 6 presents the spectrograms to visualise the quality and intelligibility of the enhanced speech signals in the N = 2 setup of ISEGAN-Self-Attention and DSEGAN-Self-Attention networks. From the spectrograms, we can see that the networks have the capability to remove the additive noise from the speech signal. This suggests that by leveraging the self-attention mechanism, the networks are able to learn temporal dependency in an input sequence of signals.
To demonstrate that the self-attention mechanism is capable to learn temporal dependencies in a speech signal, we compare the clean speech signal's amplitude at time t with the pattern of speech features in the enhanced speech's spectrograms at the same time t. We investigate whether the speech features are preserved within the time intervals to determine that the self-attention mechanism captured the temporal dependencies in the speech signal. We expect that whenever there are high frequencies, there should be a high pitch in the audio signals and vice versa. In Figure 7, we can observe the patterns of speech features, which are marked with red-, black-, and yellow-coloured rectangles, demonstrating the preservation of the temporal dependencies in the enhanced speech. The analysis in Figure 7 shows that the integration of self-attention into ISEGAN and DSEGAN helps improve the enhancement of speech signal in terms of signal distortion and speech distortion. The perceived speech quality is preserved, which indicates that the speech components from the inputs are preserved over time. Whereas speech intelligibility is difficult to improve by most speech enhancement algorithms [56], the self-attention mechanism maintains a decent score. The results indicate that the self-attention mechanism significantly improves the temporal dependencies, more so than the general improvement in the generation of the enhanced speech in both ISEGAN-Self-Attention and DSEGAN-Self-Attention as the speech signal progresses with time. Figure 7. A comparison of the input speech signal, clean speech, and the spectrogram of the enhanced speech. The speech features in the spectrograms are marked with red-, black-, and yellow-coloured rectangles for the corresponding wave signal to demonstrate the preservation of the temporal dependence in the enhanced speech signal.

Ablation Study
To understand the effect of the self-attention blocks in the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks, we ablate the generators in each network to observe the contribution of the self-attention in multi-generator speech enhancement GAN in generating enhanced speech signals. In this study, our focus was to find out which generator with the self-attention blocks in the multi-generator of the networks has the most contribution to the networks' performance. For instance, if the self-attention blocks are only placed in the first generator, how does it contribute to enhancing the speech signals? Here, we only considered the setup for N = 2 for both the ISEGAN-Self-Attention and the DSEGAN-Self-Attention. This choice was made on the basis that the setup for N = 2 produced the best results, in both quality and intelligibility of the generated signals, among all the experiments we performed. With N = 2, we had 4 different cases to investigate (i.e., 2 cases for each network) by removing all the self-attention blocks from one of the two generators in the network to observe the contributions of the remaining self-attention blocks. For example, in the setup for case 1 in the ISEGAN-Self-Attention network, we removed all the self-attention blocks in the second generator and trained the network with the same training settings mentioned in Section 4.
In Table 3, we present the results of the study and compared them with ISEGAN and DSEGAN results. We observed that, in both ISEGAN-Self-Attention and DSEGAN-Self-Attention, having the self-attention blocks in only the first generator (i.e., G 1 ) produces better-enhanced speech signals than when we place the self-attention blocks only in the second generator (i.e., G 2 ). This result indicates that the second generator G 2 is performing refining on the predecessor generator G 1 ; therefore, the first generator G 1 is responsible for capturing the core features and generating almost-enhanced samples which the second generator G 2 just needs to refine. Also, we observed that placing the self-attention blocks in only one generator of the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks did not show much improvement over ISEGAN and DSEGAN and, in most cases, trailed behind ISEGAN and DSEGAN, respectively. To achieve the best performance, the self-attention blocks might be integrated with more than one generator of the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks. Table 3. The results of the ablation study on N = 2 (i.e., 2 generators) setup of the ISEGAN-Self-Attention and the DSEGAN-Self-Attention networks. The results are compared with ISEGAN and DSEGAN. The boldface is the setup with the highest score per an objective evaluation metric.

Conclusions
This paper introduced a self-attention block into multi-generator speech enhancement GANs (ISEGAN and DSEGAN) to improve their temporal dependency capability for enhancing speech signals. The self-attention block can be integrated at different convolutional layers of the multiple generators of the ISEGAN and the DSEGAN networks or given sufficient memory, it can be integrated at all the convolutional layers of the SEGAN's and the DSEGAN's multiple generators. Our experiments demonstrate that integrating a self-attention mechanism into ISEGAN and DSEGAN networks, which we respectively called ISEGAN-Self-Attention and DSEGAN-Self-Attention networks, can significantly improve the performance of the speech enhancement system. The experimental results show that the ISEGAN-Self-Attention and the DSEGAN-Self-Attention significantly improve the enhancement performance of ISEGAN and DSEGAN. In addition, the ISEGAN-Self-Attention and the DSEGAN-Self-Attention outperformed the SEGAN and the SASEGAN baselines in all the objective evaluation metrics. Furthermore, we observe that the use of self-attention helps bridge the performance gap between the ISEGAN and the DSEGAN, demonstrating the effectiveness of the self-attention mechanism for both the shared-parameter multi-generator speech enhancement GAN (ISEGAN) and the non-shared-parameter multi-generator speech enhancement GAN (DSEGAN). Overall, our findings highlight the potential benefits of self-attention as a valuable technique for enhancing the performance of multi-generator speech enhancement systems.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: