Light-Weight Self-Attention Augmented Generative Adversarial Networks for Speech Enhancement

: Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive ﬁeld. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefﬁcient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is ﬂexible in modeling both long-range and local interactions and can be computationally efﬁcient at the same time. Our work is implemented in three phases: ﬁrstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95% fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters.

Additionally, generative neural networks (GANs) [13] have been demonstrated to be efficient for speech enhancement [14][15][16][17][18][19], where the generative training results in fewer artifacts than discriminative models. Conforming to a GAN's principle, the generator G is designated for learning an enhancement mapping that can imitate the clean data distribution to generate enhanced samples. In contrast, the discriminator D plays the role of a classifier that distinguishes the real sample, coming from the dataset that G is imitating, from the fake sample made up by G. Simultaneously, D guides the parameter updating of G towards the distribution of clean speech signals. Nevertheless, most previous attempts had convolutional layers as the backbone, limiting the network's ability in capturing longrange dependencies due to the convolution operator's local receptive field. To remedy this issue, one popular solution is substituting RNNs for CNNs, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations.
In 2017, Vaswani et al. [20] proposed the self-attention mechanism, dispensing with RNNs and CNNs entirely. Compared to discriminative deep learning models, self-attention is computationally efficient. Compared to DNNs, it possesses much fewer parameters. Compared to CNNs, it is flexible in modeling both long-range and local dependencies. Compared to RNNs, it is based on matrix multiplication, which is highly parallelizable and easily accelerated. The self-attention mechanism has been successfully used for different human-machine communication tasks [21][22][23][24][25][26], including the speech enhancement tasks [27,28]. Nevertheless, there are still two problems in the previous works. Firstly, some of them did not adopt adversarial training [28,29], which suffers from unseen distortion derived from handcrafted loss functions. Secondly, some works used discriminative models as the architecture backbone (e.g., DNN [16], CNN [30] or LSTM [19]). However, DNNs are computationally inefficient due to the huge parameter scale. CNNs command the extraordinary ability to model local information, but they experience difficulties in capturing long-range dependencies. RNNs are computationally inefficient, caused by the unparallelization of its temporal iterations.
To combine the adversarial training and the self-attention mechanism, Zhang et al. [31] proposed the self-attention generative adversarial network for image synthesis, which introduces the self-attention mechanism into convolutional GANs. In their work, the selfattention module is complementary to convolutional layers and helps with modeling longrange and multi-level dependencies across image regions. In the same year, Ramachandran et al. [32] provided the theoretical basis for substituting the self-attention mechanism for discriminative models. They verify that self-attention layers can completely replace convolutional layers and achieve state-of-the-art performance on vision tasks. Afterwards, Cordonnier et al. [33] presented evidence that self-attention layers can perform convolution and attend to pixel-grid patterns similarly to convolutional layers.
Nonetheless, Yang et al. [34] suggested that the self-attention mechanism might fully attend to all elements, dispersing the attention distribution, and thus overlook the relation of neighboring elements and phrasal patterns. Guo et al. [35] indicated that the generalization ability of the self-attention mechanism is weaker than CNNs or RNNs, especially on moderate-sized datasets, and the reason can be attributed to its unsuitable inductive bias of the self-attention structure. To this end, Yang et al. [34] proposed a parameter-free convolutional self-attention model to enhance the feature extraction of neighboring elements and validate its effectiveness and universality. Guo et al. [35] regarded self-attention as a matrix decomposition problem and proposed an improved self-attention module by introducing locality linguistic constraints. Xu et al. [36] proposed a hybrid attention mechanism via a gating scalar for leveraging both the local and global information, and verified that these two types of contexts are complementary to each other.
Inspired by prior works, this paper presents a series of speech enhancement GANs (SEGANs) equipped with a self-attention mechanism in three ways: first, we deploy the stand-alone self-attention layer in a SEGAN. Next, we employ locality modeling on the stand-alone self-attention layer. Finally, we investigate the functionality of the self-attention augmented convolutional SEGAN. We aim to probe the performance of a SEGAN equipped (i) with stand-alone standard self-attention layers, (ii) with stand-alone hybrid (global and local) self-attention layers, and (iii) with self-attention augmented convolutional layers. In addition, we also calculate the parameter scales of these proposed models.
Please note that there are four highlights of our work. Firstly, we deploy the adversarial training to alleviate the distortion introduced by handcrafted loss functions, and hence the enhancement module is supposed to capture more underlying structural characteristics. Secondly, we employ self-attention layers to obtain a more flexible ability to capture both long-range or local interactions. Thirdly, the locality modeling of the self-attention layer is via a parameter-free method. Lastly, we utilize raw speech waveforms as inputs of the system to avoid any distortion introduced by handcrafted features.
We evaluate the proposed systems in terms of various objective evaluation criteria. Systematic experiment results reveal that equipped with the stand-alone self-attention layer, the proposed system outperforms baseline systems in terms of various objective evaluation criteria with up to 95% fewer parameters. In addition, the locality modeling on the stand-alone self-attention layer delivers further performance improvements without increasing any parameter. Moreover, the self-attention augmented SEGAN outperforms all baseline systems and achieves the best results on SSNR and STOI of our work, with acceptably increased parameters.

Related works
Pascual et al. [14] open the exploration of generative architectures for speech enhancement, leveraging the ability of deep learning to learn complex functions from large example sets. The enhancement mapping is accomplished by the generator G, whereas the discriminator D, by discriminating between real and fake signals, transmits information to G so that G can learn to produce outputs that resemble the realistic distribution of the clean signals. The proposed system learns from different speakers and noise types, and incorporates them together into the same shared parametrization, which makes the system simple and generalizable in those dimensions.
On the basis of [14], Phan et al. [37] indicate that all existing SEGAN systems execute the enhancement mapping via a single stage by a single generator, which may not be optimal. In this light, they hypothesize that it would be better to carry out multi-stage enhancement mapping rather than a single-stage one. To this end, they divide the enhancement process into multiple stages and each stage contains an enhancement mapping. Each mapping is conducted by a generator, and each generator is tasked to further correct the output produced by its predecessor. All these generators are cascaded to enhance a noisy input signal gradually to yield an refined enhanced signal. They propose two improved SEGAN frameworks, namely iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN). In the ISEGAN system, parameters of its generator are fixed, constraining ISEGAN's generators to apply the same mapping iteratively, as its name implies. DSEGAN's generators have their own independent parameters, allowing them to learn different mappings flexibly. However, parameters of DSEGAN's generators are N G times that of ISEGAN's generators. N G is the number of generators.
Afterwards, Phan et al. [30] revealed that the existing class of GANs for speech enhancement solely relies on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, they propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of the SEGAN, referred to as SASEGAN. Furthermore, they empirically studied the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices, including all layers as long as memory allows.
As Pascual et al. [14] state, they open the exploration of generative architectures for speech enhancement to progressively incorporate further speech-centric design choices for performance improvement. This study aims to further optimize SEGAN, especially its variant with a self-attention mechanism. Unlike [37], we preserve the single generator architecture to maintain the light-weight parameter scale. The authors of [30] focused on coupling only one self-attention layer to one convolutional layer in the encoder. Namely, a maximum of three layers of SEGAN are equipped with the self-attention mechanism each time: one convolutional layer of the encoder, one deconvolutional layer of the decoder, and one convolutional layer of the discriminator. Although they also experimented with the performance of SASEGAN-all, i.e., simply coupling self-attention layers to all (de)convolutional layers, we query whether there are more optimized coupling combinations. (Actually, ref. [30] only couples the self-attention layer to the 3rd-11th layers in the encoder, decoder, and the discriminator because of the memory limitation, although they refer it to SASEGAN-all.) For example, can coupling the self-attention mechanism to the 10th and 11th (de)convolutional layers outperform SASEGAN-all with even smaller parameters? In addition, inspired by [32,33], we explore the feasibility of substituting selfattention layers with (de)convolutional layers completely, namely SEGAN with stand-alone self-attention layers. Moreover, to take full advantage of the self-attention layer's flexibility of modeling of both long-range and local dependencies, we introduce the parameter-free locality modeling [34] of the self-attention mechanism in SEGAN. To our best knowledge, the three following explorations: stand-alone self-attention layer, the locality modeling on self-attention layers, and optimized combination of coupling self-attention layers with convolutional layers, were never executed by previous works in the SEGAN class.

Self-Attention Mechanism
Self-attention [20] relates the information over different positions of the entire input sequence for computing the attention distribution using scaled dot product attention:

Speech Enhancement GANs
Given a dataset X = {(x * 1 ,x 1 ), (x * 2 ,x 2 ), · · · , (x * N ,x N )} consisting of N pairs of raw signals: clean speech signal x * and noisy speech signalx, speech enhancement aims to find a mapping f θ (x) :x →x to transform the raw noisy signalx to the enhanced signalx. θ contains the parameters of the enhancement network.
SEGANs designate the generator G for the enhancement mapping, i.e.,x = G(x), while designating the discriminator D to guide the training of G by classifying (x * ,x) as real and (x,x) as fake. Eventually, G learns to produce enhanced signalsx good enough to fool D such that D classifies (x,x) as real.

Stand-Alone Self-Attention Speech Enhancement GANs
In this section, we demonstrate the self-attention layer adapted in GANs [31], which enables both the generator and the discriminator to efficiently model relations between widely separated spatial regions. Given the feature map F ∈ R L×C as the input of the self-attention layer, where L is the time dimension and C is the number of channels, the query matrix Q, the key matrix K, and the value matrix V are obtained via transformations: where denote the learnt weight matrices of the 1 × 1 convolutional layer. b is a factor for reducing the channel numbers. Additionally, a max pooling layer with filter width and stride size of p is deployed to reduce the number of keys and values for memory efficiency. Therefore, the dimensions of the matrices are The attention map A is then computed as a j,i denotes the extent to which the model attends to the ith location when synthesizing the jth column v j of V . The output of the attention layer O is then computed as With the weight matrix W O realized by a 1 × 1 convolution layer of C filters, the shape of O is restored to the original shape L × C. Eventually, there is a learnable scalar β weighing the output of the attention layer in the final output as Various loss functions have been proposed to improve the training of GANs, e.g., Wasserstein loss [38], relativistic loss [39], metric loss [38], and least-squares loss [40]. In our work, the least-squares loss [40] with binary coding is utilized instead of the cross-entropy loss. Due to the effectiveness of the L 1 norm in the image manipulation domain [41], it is deployed in G to gain more fine-grained and realistic results. The scalar λ controls the magnitude of the L 1 norm. Consequently, the loss functions of G and D are We illustrate the diagram of a simplified self-attention layer with L = 9, C = 6, p = 3, and b = 2 in Figure 1.

Locality Modeling for Stand-Alone Self-Attention Layers
As illustrated in Figure 2, for the query Q, we restrict its attention region (e.g., K = {k 1 , · · · , k l , · · · , k L }) to a local scope with a fixed window size M + 1 (M ≤ L) centered at the position l aŝ When we apply the locality modeling to the self-attention layer, the factor p should be discarded to help preserve the original neighborhood for the centered position. Accordingly, the local attention map and the output of the attention layer are modified aŝ

Attention augmented convolutional SEGAN
We implement an attention augmented convolutional SEGAN by coupling the selfattention layer with the (de)convolutional layer(s). We assume this proposed architecture possesses an advantage, where the distance-aware information from the convolutional layer and the distance-agnostic dependencies modeled by the self-attention layer are supplementary and complementary to each other.
To this end, we introduce two learnable parameters, κ and γ, to weigh the input feature map F and the output feature map O of the coupled layer (with self-attention mechanism and convolution) in the augmented outputF as

Experimental Setups
We systematically evaluate the effectiveness of (i) the stand-alone self-attention layer on SEGAN, (ii) the locality modeling on the stand-alone self-attention layer, and (iii) the attention augmented convolutional SEGAN.

Dataset
We conduct all experiments on the publicly available dataset introduced in [42]. The dataset includes 30 speakers from the Voice Bank corpus [43], with 28 speakers selected for the training set and 2 for the test set. The noisy training set contains 40 different conditions, obtained by 10 types of noise (2 artificial and 8 from the Demand database [44]) with 4 signal-to-noise ratios (SNRs) (15, 10, 5, and 0 dB) each. There are approximately 10 sentences in each condition per training speaker. The noisy test set contains 20 different conditions, obtained by 5 types of noise with 4 SNRs (17.5, 12.5, 7.5, and 2.5 dB) each. There are around 20 sentences in each condition per test speaker. Notably, all the speakers and conditions included in the training and test set are totally different from each other, i.e., the test set is entirely unseen by the training set.

Network Architecture
We first introduce the architecture of SEGAN, as all three variants, namely (i) SEGAN with stand-alone self-attention layers, (ii) stand-alone self-attention layers with locality modeling, and (iii) attention augmented convolutional SEGAN, are based on it.
The classic enhancement systems are based on the short-time Fourier analysis/synthesis framework [1]. They assume that the short-time phase is not important for speech enhancement [49], and they only process the spectrum magnitude. However, further studies [50] show the intensive relation between the clean phase spectrum and the speech quality. Therefore, we use raw inputs for SASEGAN. We extract approximately one-second waveform chunks (∼16,384 samples) with a sliding window every 500 ms. G makes use of an encoderdecoder structure. The encoder is composed of 11 1-dim strided convolutional layers of filter width 31 and stride 2, followed by parametric rectified linear units (PReLUs) [51]. Along with the depth, the number of filters per layer increases while the signal duration shrinks. To compensate for the smaller and smaller convolutional output, the number of filters increases along the encoder's depth {16, 32 At the 11th layer of the encoder, the encoding vector c ∈ R 8×1024 is stacked with the noise z ∈ R 8×1024 , sampled from the distribution N (0, I), and presented to the decoder. The decoder mirrors the encoder architecture entirely to reverse the encoding process by means of the transposed convolution, termed as deconvolution. There are skip connections connecting each convolutional layer to its homologous deconvolutional layer. They bypass the compression performed in the middle of the model and allow the fine-grained details (e.g., phase, alignment) of speech signals to flow into the decoding stage directly. This is done because if we force all information flow through bottleneck structures, much useful low-level information could be lost through the compression. In addition, skip connections offer a better training behavior, as the gradients can flow more deeply through the whole structure [52]. Notably, skip connections and the addition of the latent vector double the number of feature maps in every layer.
The discriminator D resembles the encoder's structure with the following differences: (i) it receives a pair of raw audio chunks as the input, i.e., (x * ,x) or (x,x); (ii) it utilizes virtual batch-norm before LeakyReLU [53] activation with α = 0.3; (iii) in the last activation layer, there is a 1 × 1 convolution layer to reduce the feature required for the final classification neuron from 8 × 1024 to 8.

SEGAN with Stand-Alone Self-Attention Layers
We substitute the self-attention layer, illustrated in Section 4.2, with the (de)convolutional layers of both G and D. Figure 3a,b demonstrate an example of substituting the selfattention layer with the lth (de)convolutional layers. When the stand-alone self-attention layer is deployed on the lth convolutional layer of the encoder, the mirroring lth deconvolutional layer of the decoder and the lth convolutional layer in the discriminator are also replaced with it. To keep the dimension of the feature map per layer in accordance with that of SEGAN, a max pooling layer with a kernel size of 2 and a stride length of 2 follows every stand-alone self-attention layer in the encoder. Accordingly, the upsampling needs to be deployed before the 11th deconvolutional layer in the decoder to ensure the same feature dimensions flowing in the skip connections. We experiment with two interpolation methods: nearest and bilinear. The results suggest that the bilinear interpolation outperforms the nearest one, so we utilize the bilinear interpolation for upsampling in all our experiments. Theoretically, the stand-alone self-attention layer can be placed in any number, even all, of the (de)convolutional layers. We empirically study the effect of placing the stand-alone self-attention layer at (de)convolutional layers with lower or higher layer indices as well as layer combinations, provided that memory allows.
Similarly, the D component follows the same structure as G's encoder stage. We set b = 8, p = 4 for the self-attention layer. λ is set to be 100, and we initialize β = 0 for the self-attention layer.

Stand-Alone Self-Attention Layer with Locality Modeling
The general architecture remains the same as the case of Section 5.3.1. However, the factor p is eliminated to help preserve the original neighborhood for the centered position. Namely, the max pooling layers in Figure 1 are discarded for matrix K and V . The factor b remains as 8. We conduct ablation tests on the impacts of the window size (M + 1) and the placement of the window on the system performance.
Peters et al. [54] and Raganato and Tiedemann [55] indicated that higher layers of the system tend to learn semantic information while lower layers capture more surface and lexical information. Therefore, we centrally apply locality modeling to the lower layers, in line with the configuration in [56,57]. Then, we expect that the representations are learned in a hierarchical fashion. Namely, the distance-aware and local information extracted by the lower layers can complement distance-agnostic and global information captured by the higher layers.

Attention Augmented Convolutional SEGAN
Instead of replacement (Section 5.3.1), we couple stand-alone self-attention layers with (de)convolutional layers of both the generator and discriminator. As illustrated in Figure 4, when the stand-alone self-attention layer is coupled with the lth convolutional layer of the encoder, the mirror lth deconvolutional layer of the decoder and the lth convolutional layer in the discriminator are also coupled with it. Spectral normalization is applied to all the (de)convolutional layers. For scalars γ and κ, we experiment with three initialization pairs: γ = 0 and κ = 0, γ = 0.75 and κ = 0.25 (inspired by the results provided by [58]), and γ = 0.25 and κ = 0.25, finding that best results are acquired when both γ and κ are initialized as 0.25. We empirically study the effect of coupling the self-attention layer with lower and higher layer indices as well as different layer combinations. Two introduced factors (introduced in Section 4.2) are set to be p = 4 and b = 8.
Since coupling the self-attention layer with a single convolutional layer has already been studied by [30] in detail, this study focuses the more optimized couple combination for this topic.

Baseline Systems
For comparison, we take the seminal work [14], and other SEGAN variants [30,37] that we introduced in Section 2 as baseline systems. From [37], we choose the results of ISEGAN with two shared generators and DSEGAN with two independent generators as baseline results (the situation of N G = 2) for two reasons. On one hand, the number of generators leads to an exponential parameter increment. On the other hand, Phan et al. [37] indicated that marginal impacts of ISEGAN's number of iterations and DSEGAN's depth larger than N G = 2 with no significant performance improvements are seen. The authors of [30] present detailed results of the influence of the self-attention layer placement in the generator and the discriminator. We choose the average result of coupling the self-attention layer with a single (de)convolutional layer (referred to as SASEGAN-avg), and the result of coupling self-attention layers with all (de)concolutional layers (referred to as SASEGAN-all) to ensure a fair comparison. It is worth noting that it is stated in [30] that compared to SASEGAN-avg, results of SASEGAN-all are slightly more boosted, but these gains are achieved at the cost of increased computation time and memory requirements.

Configurations
Networks are trained with RMSprop [59] for 100 epochs with a minibatch size of 50. A high-frequency preemphasis filter of coefficient 0.95 is applied to both training and test samples. In the test stage, we slide the window through the whole duration of our test utterance without overlap, and the enhanced outputs are deemphasized and concatenated at the end of the stream.

Results
We systematically evaluate the effectiveness of (i) the stand-alone self-attention layer on SEGAN, (ii) the locality modeling on the stand-alone self-attention layer, and (iii) attention augmented convolutional SEGAN. Table 1 exhibits the performance of SEGAN equipped with the stand-alone self-attention layer. The upper part displays the results of baseline systems, while the lower part displays the results of our work.
As shown in Table 1, when we replace the 6th and 10th (de)convolutional layers with the stand-alone self-attention layer, the system overtakes all baselines across all metrics and achieves the best SSNR result, with 47% fewer (compared to DSEGAN) or 12% fewer (compared to SASEGAN-all) parameters. Furthermore, when we adopt the stand-alone self-attention layer at the 9th to 11th (de)convolutional layers, it still yields comparable or even better (STOI) results, with parameters plunging drastically to merely 5% (95% fewer) of DESEGAN or 9% (91% fewer) of SASEGAN-all. When l = 4, the parameter scale of the proposed system is closest to that of the baseline systems. Under such circumstances, it outperforms baseline systems in PESQ, CBAK, COVL, and STOI, and achieves the best results in PESQ, CBAK, and COVL. When we substitute the self-attention layer with lth (6 ≤ l ≤ 11) (de)concolutional layers, the system performance plummets as the parameters of the whole system are only 1% of SASEGAN-all [14] and 0.6% of DSEGAN [37]. The results reveal that the stand-alone layer can be a powerful and light-weight primitive for speech enhancement. Table 1. Effects of the stand-alone self-attention layer(s) on speech enhancement GANs (SEGANs). We denote the proposed architecture with the stand-alone self-attention layer(s) at the lth (de)convolutional layer(s) as standalone-l. Values that overtake all baseline systems are in bold. Values with an asterisk are the best ones achieved for each metric.   Next, we investigate the effects of locality modeling and its window size. Prior studies [56,60] indicate that lower layers usually extract lower-level features, so they should attend to the local field more. Additionally, they prove empirically that modeling locality on lower layers achieves better performances. Therefore, we only apply the locality modeling on layers not deeper than the 6th one. As plotted in Figure 5, the tiny window size limits the receptive field too much, and hence degrades the performance due to the deprivation of the ability of modeling long-range and multi-level dependencies. It appears that a window size of 14 is superior to other settings, approximately consistent with [34] on machine translation tasks. When the window size continues to increase, the performance tends to be the same without windows, which is self-explanatory. Completed results on six criteria are exhibited in Table 2. Compared to Table 1, employing locality modeling on the 4th layer yields the most significant improvement, in accordance with the conclusion in [56,60]. It also achieves the best or comparable results across all criteria, which demonstrate the functionality of the locality modeling without further computational cost. An explanation of the undesirable SSNR is that the suboptimal upsampling method introduces speech distortion, which is also manifested on CSIG. Importantly, a fixed-size window is not the state-of-the-art approach in the field of locality modeling of the self-attention mechanism. We choose it as it is parameter free, corresponding to our goal of a light-weight system. It is worth noting that the proposed SEGAN with stand-alone self-attention layers is general enough to combine other more advanced locality modeling approaches [61,62] in cases where the computation complexity is secondary. Lastly, we investigate the functionality of the attention augmented convolutional networks according to Equation (13). We choose the combination from lower-to-middle layers (augmentation-4,6), middle-to-higher layers (augmentation-6,10), and all layer ranges (augmentation-4,10 and augmentation-4,6,10). As displayed in Table 3, coupling the selfattention layer on the 4th and 6th layers is more competitive on CSIG, COVL, and STOI (the best results in Table 3), and it achieves the best STOI performance. In contrast, adding the self-attention layer on the 6th and 10th layers overtakes all baseline systems across all metrics, and it gives the best result on SSNR. The combination of the 4th and 10th layers still outperforms baseline systems, except for SSNR. However, the combination of the 4th, 6th, and 10th layers only outperforms baseline systems on PESQ and STOI, although it still yields decent results on other metrics. These results demonstrate the efficiency of the attention augmentation for the convolutional SEGAN. Nevertheless, it is worth noting that system parameters inevitably increase when coupling the self-attention layer to (de)convolutional layers.

Discussion
In general, the biggest advantage of applying the stand-alone self-attention layer in SEGAN is that it simultaneously outperforms baseline systems, and decreases the model complexity drastically. In particular, when applying the stand-alone self-attention layer as the 6th and 10th layers of the system, the resultant system overtakes all baselines across all metrics and achieves the best SSNR results with only ∼50% parameters (compared to DSEGAN [37]). In addition, locality modeling can be an effective auxiliary to standalone self-attention layers, which further improves their performance without any extra parameter increment. Notably, locality modeling on a lower self-attention layer delivers more perceptible performance improvements, consistent with [34,56,57]. For the selfattention augmented SEGAN, it performs modestly better. Although it is less light-weight than those two approaches, it still has 42% fewer parameters compared to DSEGAN.
However, different placements of the stand-alone self-attention layer or the coupled self-attention layer lead to different performance improvements, and the compromise between system performance and system complexity is always ineluctable. We only present the achieved performance and the homologous model complexity for representative placements, which readers can take for reference according to the desired application.

Conclusions
We integrate the self-attention mechanism with SEGAN to improve its flexibility of both long-range and local dependency modeling for speech enhancement in three methods, namely, applying the stand-alone self-attention layer, modeling locality on the stand-alone self-attention layer, and coupling the self-attention layer with the (de)convolutional layer. The proposed systems deliver consistent performance improvements. The main merit of the stand-alone self-attention layer is its low model complexity, and it can perform even better when equipped with the locality modeling. In contrast, the self-attention augmented convolutional SEGAN delivers more stable improvements, whereas it increases the model complexity.
Importantly, the locality modeling method utilized in this study is basic. We choose it to achieve the goal of light weight, but more advanced locality modeling approaches can be applied simply. Moreover, all proposed approaches described in this paper are generic enough to be applied to existing SEGAN models for further performance improvements. We leave these topics for future studies.