CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement

In recent years, neural networks based on attention mechanisms have seen increasingly use in speech recognition, separation, and enhancement, as well as other fields. In particular, the convolution-augmented transformer has performed well, as it can combine the advantages of convolution and self-attention. Recently, the gated attention unit (GAU) was proposed. Compared with traditional multi-head self-attention, approaches with GAU are effective and computationally efficient. In this CGA-MGAN: MetricGAN based on Convolution-augmented Gated Attention for Speech Enhancement, we propose a network for speech enhancement called CGA-MGAN, a kind of MetricGAN based on convolution-augmented gated attention. CGA-MGAN captures local and global correlations in speech signals at the same time by fusing convolution and gated attention units. Experiments on Voice Bank + DEMAND show that our proposed CGA-MGAN model achieves excellent performance (3.47 PESQ, 0.96 STOI, and 11.09 dB SSNR) with a relatively small model size (1.14 M).


Introduction
Speech enhancement (SE) systems are usually used at the frontend of automatic speech recognition processes [1], communication systems [2], and hearing aids [3] to remove noise from speech. Methods based on traditional signal processing, such as subtraction [4], Wiener filtering [5], and minimum mean square estimation [6], are widely used in speech enhancement. Although these methods perform well in handling stationary noises, it is challenging to deal with nonstationary noises. With the development of deep neural networks (DNNs), this model has been used more frequently in speech enhancement in recent years.
The traditional time-frequency domain DNN model reconstructs the speech magnitude spectrum by estimating the mask function [7][8][9] or directly predicting the magnitude spectrum of clean speech [10], ignoring the role of phase information. However, phase information improves speech perception quality under a low signal-to-noise ratio (SNR) [11,12]. In [13], researchers proposed recovering magnitude and phase information simultaneously in the time-frequency (TF) domain by estimating the complex ratio mask (CRM) [14]. However, due to the compensation effect between the magnitude and phase [15], the simultaneous enhancement of magnitude and phase reduces the effect of magnitude estimation [16]. In [17], researchers proposed a decoupling-style phase-aware method. By building a two-path network, the magnitude is estimated to be increased first. Then, the spectrum is refined using residual learning, which can effectively alleviate the problem of the compensation effect.

•
We construct an encoder-decoder structure including gating blocks using the decouplingstyle phase-aware method that can collaboratively estimate the magnitude and phase information of clean speech in parallel and avoid the compensation effect between magnitude and phase; • We propose a convolution-augmented gated attention unit that can capture time and frequency dependence with lower computational complexity and achieve better results than the conformer; • The proposed approach is superior to the previous approaches on the Voice Bank + DEMAND dataset [28], and an ablation experiment has verified our design choice.
The remainder of this paper is organized as follows: Section 2 introduces the related work of speech enhancement. Section 3 analyzes the specific architecture of the CGA-MGAN model we propose. Section 4 introduces the experimental setup, including the dataset used for the experiment, the network's training setup, and the experimental results' evaluation indicators. Section 5 analyzes the experimental results, compares them with some existing models, and conducts an ablation experiment. Finally, Section 6 summarizes this work and suggests some future research directions.

Related Works
This paper focuses on a convolution-augmented gated attention MetricGAN model for speech enhancement. In this section, we briefly introduce MetricGAN, conformers, and their basic working principles. In addition, we review the structure of the standard transformer and briefly introduce the basic principle of the new transformer variant, the gated attention unit (GAU).

MetricGAN
Before introducing MetricGAN, we will first introduce how to use the general GAN network for speech enhancement. GAN can simulate real data distribution by employing an alternative mini-max training scheme between the generator and the discriminator. By using the least-squares GAN method [29] to minimize the following loss function, we can train the generator to map noisy speech, x, to clean speech, y, and to generate enhanced speech.
L G(LSGAN) = E x D(G(x), x) − 1) 2 (1) Here, G represents the generator and D represents the discriminator. By minimizing the following loss function, we can train D to distinguish between clean speech and enhanced speech: Here, E represents the expectation, 1 represents the clean speech, and 0 represents the enhanced speech. G(x) represents the enhanced speech, and D(·) represents the output result of the discriminator. MetricGAN consists of a generator and a discriminator, the same as the general GAN network. The generator enhances speech. The discriminator treats the objective evaluation function as a black box and trains the surrogate evaluation function. During training, the discriminator and the generator are updated alternately to guide the generator to generate higher-quality speech.
Unlike the general GAN network, MetricGAN introduces a function (Q(I)) to represent the normalized metric to be simulated. The loss functions of MetricGAN are shown in the following formulas: Formula (3) is the loss function of the generator network, where s represents the expected distribution score. When s = 1, the generator will generate enhanced speech that is close to clean speech.
Formula (4) is the loss function of the discriminator network, where Q(I) represents the function of the target evaluation metric normalized between 0 and 1, and I represents the speech pair to be evaluated. When the inputs are both clean voices, I = (y, y); when the inputs are an enhanced voice and a corresponding clean voice, I = (G(x), y).
The training process for MetricGAN can be condensed into the following schema: • Input noisy speech, x, into the generator to generate enhanced speech, G(x); • Input a clean-clean speech pair, (y, y), into the discriminator to calculate the output, D(y, y), and calculate Q(y, y) through the objective evaluation function; • Input an enhanced-clean speech pair, (G(x), y), into the discriminator to calculate the output, D(G(x), y), and calculate Q(G(x), y) through the objective evaluation function; • Calculate the loss function of the generator and the discriminator and update the weights of both networks.

Gated Attention Unit
Recently, Hua W. [30] proposed a new transformer variant called GAU. Compared with the standard transformer, it has faster speed, lower memory occupation, and a better effect.
The standard transformer comprises, alternately, an attention block and a feed-forward network (FFN) layer, which consists of two multi-layer perceptron (MLP) layers. The attention block uses MHSA, as shown in Figure 1a. Unlike the standard transformer, GAU has only one layer, which makes networks stacked with GAU modules simpler and easier to understand. GAU creatively uses the gated linear unit (GLU) instead of the FFN layer. The structure of the GLU is shown in Figure 1b. The powerful performance of GLU allows GAU to weaken its dependence on attention. GAU can use SHSA instead of MHSA, achieving the same or even better effects compared with the standard transformer [30]. It not only improves the computing speed but also reduces memory occupation. The structure of GAU is shown in Figure 2.
Entropy 2023, 25, x FOR PEER REVIEW 4 of 14 easier to understand. GAU creatively uses the gated linear unit (GLU) instead of the FFN layer. The structure of the GLU is shown in Figure 1b. The powerful performance of GLU allows GAU to weaken its dependence on attention. GAU can use SHSA instead of MHSA, achieving the same or even better effects compared with the standard transformer [30]. It not only improves the computing speed but also reduces memory occupation. The structure of GAU is shown in Figure 2.
(a) (b)    easier to understand. GAU creatively uses the gated linear unit (GLU) instead of the FFN layer. The structure of the GLU is shown in Figure 1b. The powerful performance of GLU allows GAU to weaken its dependence on attention. GAU can use SHSA instead of MHSA, achieving the same or even better effects compared with the standard transformer [30]. It not only improves the computing speed but also reduces memory occupation. The structure of GAU is shown in Figure 2.

Conformer
The conformer was first used in speech recognition and can also be used for speech enhancement. Since a pronunciation unit is composed of multiple adjacent speech frames, the convolution mechanism can better extract fine-grained local feature patterns, such as pronunciation unit boundaries. The conformer combines the convolution and selfattention modules and gives full play to their advantages. The main structure of the conformer is shown in Figure 3. The conformer block consists of four parts: the first feedforward module, an MHSA module, a convolution module, and the second feed-forward module. The detailed structure of the convolution block is shown in Figure 4, and the detailed structure of the feed-forward module is shown in Figure 5. Inspired by Macaron Net [31], the conformer adopts a makaron-style structure. The convolution module and the MHSA module are placed between two feed-forward modules. By stacking the conformer blocks, speech features are extracted step by step to achieve speech recognition or speech enhancement.

Conformer
The conformer was first used in speech recognition and can also be used for speech enhancement. Since a pronunciation unit is composed of multiple adjacent speech frames, the convolution mechanism can better extract fine-grained local feature patterns, such as pronunciation unit boundaries. The conformer combines the convolution and self-attention modules and gives full play to their advantages. The main structure of the conformer is shown in Figure 3. The conformer block consists of four parts: the first feed-forward module, an MHSA module, a convolution module, and the second feed-forward module. The detailed structure of the convolution block is shown in Figure 4, and the detailed structure of the feed-forward module is shown in Figure 5. Inspired by Macaron Net [31], the conformer adopts a makaron-style structure. The convolution module and the MHSA module are placed between two feed-forward modules. By stacking the conformer blocks, speech features are extracted step by step to achieve speech recognition or speech enhancement.

Limitations and Our Approach
MetricGAN is a great contribution to the application of GAN to speech enhancement. GAN simulates evaluation metrics that were originally nondifferentiable so that it can be used as a loss function. However, the performance of the MetricGAN generator limits its speech enhancement effect. In our CGA-MGAN model, we use the idea of MetricGAN to build a discriminator and a generator with an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method, which can greatly improve the network's speech enhancement performance.
In addition, although GAU has been applied to natural speech processing, previous research has yet to involve the field of speech enhancement. This paper is the first application of GAU to speech enhancement. Compared with makaron-style structures used in conformers, the CGAU we propose uses GLU to replace two feed-forward modules in the conformer, replaces MHSA with SHSA, and perfectly integrates the convolution module and GAU, significantly reducing the computational complexity of the network. A convolution-augmented GAU constructed this way can extract global and local features simultaneously to obtain better speech quality.

Methodology
In this section, we introduce the composition of the CGA-MGAN model, including the encoder-decoder structure of the generator, the structure of the two-stage CGA block,

Conformer
The conformer was first used in speech recognition and can also be used for speech enhancement. Since a pronunciation unit is composed of multiple adjacent speech frames, the convolution mechanism can better extract fine-grained local feature patterns, such as pronunciation unit boundaries. The conformer combines the convolution and self-attention modules and gives full play to their advantages. The main structure of the conformer is shown in Figure 3. The conformer block consists of four parts: the first feed-forward module, an MHSA module, a convolution module, and the second feed-forward module. The detailed structure of the convolution block is shown in Figure 4, and the detailed structure of the feed-forward module is shown in Figure 5. Inspired by Macaron Net [31], the conformer adopts a makaron-style structure. The convolution module and the MHSA module are placed between two feed-forward modules. By stacking the conformer blocks, speech features are extracted step by step to achieve speech recognition or speech enhancement.

Limitations and Our Approach
MetricGAN is a great contribution to the application of GAN to speech enhancement. GAN simulates evaluation metrics that were originally nondifferentiable so that it can be used as a loss function. However, the performance of the MetricGAN generator limits its speech enhancement effect. In our CGA-MGAN model, we use the idea of MetricGAN to build a discriminator and a generator with an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method, which can greatly improve the network's speech enhancement performance.
In addition, although GAU has been applied to natural speech processing, previous research has yet to involve the field of speech enhancement. This paper is the first application of GAU to speech enhancement. Compared with makaron-style structures used in conformers, the CGAU we propose uses GLU to replace two feed-forward modules in the conformer, replaces MHSA with SHSA, and perfectly integrates the convolution module and GAU, significantly reducing the computational complexity of the network. A convolution-augmented GAU constructed this way can extract global and local features simultaneously to obtain better speech quality.

Methodology
In this section, we introduce the composition of the CGA-MGAN model, including the encoder-decoder structure of the generator, the structure of the two-stage CGA block,

Conformer
The conformer was first used in speech recognition and can also be used for speech enhancement. Since a pronunciation unit is composed of multiple adjacent speech frames, the convolution mechanism can better extract fine-grained local feature patterns, such as pronunciation unit boundaries. The conformer combines the convolution and self-attention modules and gives full play to their advantages. The main structure of the conformer is shown in Figure 3. The conformer block consists of four parts: the first feed-forward module, an MHSA module, a convolution module, and the second feed-forward module. The detailed structure of the convolution block is shown in Figure 4, and the detailed structure of the feed-forward module is shown in Figure 5. Inspired by Macaron Net [31], the conformer adopts a makaron-style structure. The convolution module and the MHSA module are placed between two feed-forward modules. By stacking the conformer blocks, speech features are extracted step by step to achieve speech recognition or speech enhancement.

Limitations and Our Approach
MetricGAN is a great contribution to the application of GAN to speech enhancement. GAN simulates evaluation metrics that were originally nondifferentiable so that it can be used as a loss function. However, the performance of the MetricGAN generator limits its speech enhancement effect. In our CGA-MGAN model, we use the idea of MetricGAN to build a discriminator and a generator with an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method, which can greatly improve the network's speech enhancement performance.
In addition, although GAU has been applied to natural speech processing, previous research has yet to involve the field of speech enhancement. This paper is the first application of GAU to speech enhancement. Compared with makaron-style structures used in conformers, the CGAU we propose uses GLU to replace two feed-forward modules in the conformer, replaces MHSA with SHSA, and perfectly integrates the convolution module and GAU, significantly reducing the computational complexity of the network. A convolution-augmented GAU constructed this way can extract global and local features simultaneously to obtain better speech quality.

Methodology
In this section, we introduce the composition of the CGA-MGAN model, including the encoder-decoder structure of the generator, the structure of the two-stage CGA block,

Limitations and Our Approach
MetricGAN is a great contribution to the application of GAN to speech enhancement. GAN simulates evaluation metrics that were originally nondifferentiable so that it can be used as a loss function. However, the performance of the MetricGAN generator limits its speech enhancement effect. In our CGA-MGAN model, we use the idea of MetricGAN to build a discriminator and a generator with an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method, which can greatly improve the network's speech enhancement performance.
In addition, although GAU has been applied to natural speech processing, previous research has yet to involve the field of speech enhancement. This paper is the first application of GAU to speech enhancement. Compared with makaron-style structures used in conformers, the CGAU we propose uses GLU to replace two feed-forward modules in the conformer, replaces MHSA with SHSA, and perfectly integrates the convolution module and GAU, significantly reducing the computational complexity of the network. A convolution-augmented GAU constructed this way can extract global and local features simultaneously to obtain better speech quality.

Methodology
In this section, we introduce the composition of the CGA-MGAN model, including the encoder-decoder structure of the generator, the structure of the two-stage CGA block, and the structure of the metric discriminator. Finally, we introduce the loss function of the generator and the metric discriminator.
The architecture of the generator is shown in Figure 6. The generator consists of an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method and four two-stage CGA blocks. First, it takes a discrete noisy signal, y ∈ R B×N×1 , with N samples as the input. Then, we convert the input signal to Y o ∈ R B×T×F×1 in a time-frequency representation domain using short-time Fourier transform (STFT), where T represents the number of frames and F represents the number of frequency bins of the complex spectrogram. After that, a power law compression with a compression exponent of 0.3 is applied to the spectrum: where c is the compression exponent. Then, the magnitude, Y m , real component, Y r , and imaginary component, as the input of the encoder. and the structure of the metric discriminator. Finally, we introduce the loss function of the generator and the metric discriminator. The architecture of the generator is shown in Figure 6. The generator consists of an encoder-decoder structure, including gating blocks using the decoupling-style phaseaware method and four two-stage CGA blocks. First, it takes a discrete noisy signal, ∈ ℝ , with samples as the input. Then, we convert the input signal to ∈ ℝ in a time-frequency representation domain using short-time Fourier transform (STFT), where T represents the number of frames and F represents the number of frequency bins of the complex spectrogram. After that, a power law compression with a compression exponent of 0.3 is applied to the spectrum: where c is the compression exponent. Then, the magnitude, , real component, , and imaginary component, , of the spectrum are concatenated as ; ; ∈ ℝ as the input of the encoder.

Encoder and Decoder
The encoder of the generator consists of five encoder blocks with concatenation operations. The last encoder block halves the frequency dimension to reduce complexity. Each encoder block consists of a Conv2D (two-dimensional convolution) layer, an instance normalization layer, and a parameter ReLU (PReLU) activation layer. The output feature of the encoder is ∈ ℝ , where ′ /2, 64. The decoder of the generator consists of three decoder blocks, including a magnitude mask estimation decoder block and two complex spectrum estimation decoder blocks, which output the multiplicative mask of the magnitude, ′ ; the real component, ′ ; and the imaginary component, ′ , of the spectrum in parallel. Each decoder block consists of five gated blocks and a Conv2D layer. The first gated block samples the frequency dimension up to . The gated block consists of a Conv2D Transpose layer and two Conv2D blocks. The structure of the Conv2D block in the decoder block is the same as that of the encoder block. The gated block learns features from the encoder and suppresses its unwanted parts, which is shown in Figure 7. After five decoder blocks, the Conv2D

Encoder and Decoder
The encoder of the generator consists of five encoder blocks with concatenation operations. The last encoder block halves the frequency dimension to reduce complexity. Each encoder block consists of a Conv2D (two-dimensional convolution) layer, an instance normalization layer, and a parameter ReLU (PReLU) activation layer. The output feature of the encoder is Y enc ∈ R B×T×F ×C , where F = F/2, C = 64.
The decoder of the generator consists of three decoder blocks, including a magnitude mask estimation decoder block and two complex spectrum estimation decoder blocks, which output the multiplicative mask of the magnitude,X m ; the real component,X r ; and the imaginary component,X i , of the spectrum in parallel. Each decoder block consists of five gated blocks and a Conv2D layer. The first gated block samples the frequency dimension up to F. The gated block consists of a Conv2D Transpose layer and two Conv2D blocks. The structure of the Conv2D block in the decoder block is the same as that of the encoder block. The gated block learns features from the encoder and suppresses its unwanted parts, which is shown in Figure 7. After five decoder blocks, the Conv2D layer compresses the number of channels to obtainX m ,X r ,X i . Then, we multiply theX m and magnitude Y m of noisy speech to obtain the preliminary estimated spectrum.
Entropy 2023, 25, x FOR PEER REVIEW 7 of layer compresses the number of channels to obtain ′ , ′ , ′ . Then, we multiply t ′ and magnitude of noisy speech to obtain the preliminary estimated spectrum. As a supplement to spectrum estimation, the preliminary estimated spectrum is co pled with the noisy speech phase, , to obtain a roughly denoised complex spectru diagram. Then, it is added element wise with the output ( ′ , ′ ) of the complex spectru estimation decoder block to obtain the final complex spectrum diagram: and represent the magnitude, the real component of spectrum and t imaginary component of spectrum of the enhanced speech.

Two-Stage CGA Block
The two-stage CGA block consists of two cascaded CGAUs, namely, the time conv lution-augmented gated attention unit (CGAU-T) and the frequency convolution-au mented gated attention unit (CGAU-F), which is shown in Figure 8. First, the input featu map, ∈ ℝ , is reshaped to ∈ ℝ and input into CGAU-T to capture t time dependence. Then, the and are element-wise added and reshaped to ℝ and input into CGAU-F to capture the frequency dependence. Finally, outp and input are element-wise added and reshaped to the final output, ℝ . The CGAU-T and CGAU-F blocks have the same structure and different shaping o erations. The structure of CGAU is shown in Figure 9 and is composed of a convolutio block and a GAU. The input is connected to the output by a residual connection. We u the same structure as the convolution block in the conformer. The convolution block star As a supplement to spectrum estimation, the preliminary estimated spectrum is coupled with the noisy speech phase, Y p , to obtain a roughly denoised complex spectrum diagram. Then, it is added element wise with the output (X r ,X i ) of the complex spectrum estimation decoder block to obtain the final complex spectrum diagram: whereX m ,X r andX i represent the magnitude, the real component of spectrum and the imaginary component of spectrum of the enhanced speech.

Two-Stage CGA Block
The two-stage CGA block consists of two cascaded CGAUs, namely, the time convolutionaugmented gated attention unit (CGAU-T) and the frequency convolution-augmented gated attention unit (CGAU-F), which is shown in Figure 8. First, the input feature map, D ∈ R B×T×F ×C , is reshaped to D T ∈ R BF ×T×C and input into CGAU-T to capture the time dependence. Then, the D T o and D T are element-wise added and reshaped to D F ∈ R BT×F ×C and input into CGAU-F to capture the frequency dependence. Finally, output D F o and input D F are element-wise added and reshaped to the final output, D o ∈ R B×T×F ×C . layer compresses the number of channels to obtain ′ , ′ , ′ . Then, we multiply the ′ and magnitude of noisy speech to obtain the preliminary estimated spectrum. As a supplement to spectrum estimation, the preliminary estimated spectrum is coupled with the noisy speech phase, , to obtain a roughly denoised complex spectrum diagram. Then, it is added element wise with the output ( ′ , ′ ) of the complex spectrum estimation decoder block to obtain the final complex spectrum diagram: where , and represent the magnitude, the real component of spectrum and the imaginary component of spectrum of the enhanced speech.

Two-Stage CGA Block
The two-stage CGA block consists of two cascaded CGAUs, namely, the time convolution-augmented gated attention unit (CGAU-T) and the frequency convolution-augmented gated attention unit (CGAU-F), which is shown in Figure 8. First, the input feature map, ∈ ℝ , is reshaped to ∈ ℝ and input into CGAU-T to capture the time dependence. Then, the and are element-wise added and reshaped to ∈ ℝ and input into CGAU-F to capture the frequency dependence. Finally, output and input are element-wise added and reshaped to the final output, ∈ ℝ . The CGAU-T and CGAU-F blocks have the same structure and different shaping operations. The structure of CGAU is shown in Figure 9 and is composed of a convolution block and a GAU. The input is connected to the output by a residual connection. We use the same structure as the convolution block in the conformer. The convolution block starts The CGAU-T and CGAU-F blocks have the same structure and different shaping operations. The structure of CGAU is shown in Figure 9 and is composed of a convolution block and a GAU. The input is connected to the output by a residual connection. We use the same structure as the convolution block in the conformer. The convolution block starts with a layer normalization. After that, the feature map is fed into a gating mechanism composed of a point-wise convolution, followed by GLU. Then, the output of the GLU is fed into a depth-wise convolution layer and activated by the swish function. Finally, a point-wise convolution layer restores the channel number. with a layer normalization. After that, the feature map is fed into a gating mechanism composed of a point-wise convolution, followed by GLU. Then, the output of the GLU is fed into a depth-wise convolution layer and activated by the swish function. Finally, a point-wise convolution layer restores the channel number. Taking CGAU-T as an example, input is divided into two feeds, one of which is fed into the convolution Block. The query, key, and value are all replicas of the convolution block output, ∈ ℝ . Then, the scaled dot-product attention is applied to the query, key, and value afterward as where ∅ represents the swish activation function, and and represent the learnable parameter matrixes of the linear layers. is a shared representation; and are simple affine transformations that apply per-dim scalars and offsets to to obtain the query and the key.
represents values in the self-attention mechanism. represents the rotation position coding [32].
represents the dimension. The other feed of input passes through the linear layer and is activated by swish to obtain . Finally, the Hadamard product of and is calculated and input into the linear layer to obtain output so that the convolution-augmented attention information is introduced to the gated linear unit. The calculation formula is as follows: where ∘ represents Hadamard product. In addition, our proposed CGAU uses softmax as the activation function instead of the ReLU used in GAU. Taking CGAU-T as an example, input D T is divided into two feeds, one of which is fed into the convolution Block. The query, key, and value are all replicas of the convolution block output, D T co ∈ R BF ×T×C . Then, the scaled dot-product attention is applied to the query, key, and value afterward as where ∅ represents the swish activation function, and W z and W v represent the learnable parameter matrixes of the linear layers. Z is a shared representation; Q and K are simple affine transformations that apply per-dim scalars and offsets to Z to obtain the query and the key. V represents values in the self-attention mechanism. b represents the rotation position coding [32]. d represents the dimension. The other feed of input D T passes through the linear layer and is activated by swish to obtain U. Finally, the Hadamard product of U and A is calculated and input into the linear layer to obtain output D T o so that the convolution-augmented attention information is introduced to the gated linear unit. The calculation formula is as follows: where • represents Hadamard product. In addition, our proposed CGAU uses softmax as the activation function instead of the ReLU 2 used in GAU.

Metric Discriminator
The metric discriminator can mimic the metric score, which is nondifferentiable, so that it can be used as the loss function. In this paper, we use the normalized PESQ as the metric score. As shown in Figure 10, the metric discriminator consists of four Conv2D layers. The channels are 32, 64, 128, and 256. After four Conv2D layers, there is a global average pooling to handle the variable-length input. Finally, there are two linear layers and one sigmoid activation. When training the discriminator, we take both inputs as clean magnitudes to estimate the maximum metric score. Then, we take the inputs as the clean magnitudes and the enhanced magnitudes to estimate the metric score of the enhanced speech to approach their corresponding PESQ label. In addition, the trained generator can render enhanced speech approaching a normalized PESQ label of one.

Metric Discriminator
The metric discriminator can mimic the metric score, which is nondifferentiable, so that it can be used as the loss function. In this paper, we use the normalized PESQ as the metric score. As shown in Figure 10, the metric discriminator consists of four Conv2D layers. The channels are 32, 64, 128, and 256. After four Conv2D layers, there is a global average pooling to handle the variable-length input. Finally, there are two linear layers and one sigmoid activation. When training the discriminator, we take both inputs as clean magnitudes to estimate the maximum metric score. Then, we take the inputs as the clean magnitudes and the enhanced magnitudes to estimate the metric score of the enhanced speech to approach their corresponding PESQ label. In addition, the trained generator can render enhanced speech approaching a normalized PESQ label of one.

Loss Function
The loss function of the generator includes three terms: (14) where , , and are the weighting coefficients of the three loss terms in the total loss. Here, we take as 1, as 0.05, and as 0.2. represents the combination of magnitude loss, , and phase-aware loss, : , , , 1 where represents the chosen weight, and we take 0.7. represents the adversarial loss. The expression of is where represents the discriminator. Correspondingly, the expression of the adversarial loss in the discriminator is where is the normalized PESQ score, and the value range is [0, 1]. In addition, some studies show that adding time loss, , can improve the enhancement of speech [20]. The expression of is , ‖ ‖ (20) Figure 10. Detailed structure of the metric discriminator.

Loss Function
The loss function of the generator includes three terms: where α, β, and γ are the weighting coefficients of the three loss terms in the total loss. Here, we take α as 1, β as 0.05, and γ as 0.2. L TF represents the combination of magnitude loss, L Mag , and phase-aware loss, L RI : where m represents the chosen weight, and we take m = 0.7. L GAN represents the adversarial loss. The expression of L GAN is where D represents the discriminator. Correspondingly, the expression of the adversarial loss in the discriminator is where Q PESQ is the normalized PESQ score, and the value range is [0, 1]. In addition, some studies show that adding time loss, L time , can improve the enhancement of speech [20]. The expression of L time is

Experiments
In this section, we first introduce the composition of the Voice Bank + DEMAND dataset for network training; then, we introduce the network training settings and six evaluation indicators for voice enhancement.

Datasets and Settings
The publicly available Voice Bank + DEMAND dataset was chosen to test our model. The speech database was obtained from the CSTR VCTK Corpus. The background noise database was obtained from the DEMAND database. The training set includes 11,572 sentences provided by 28 speakers, and the test set includes 824 sentences provided by 2 unseen speakers. We use eight natural and two artificial background noise processes to generate the training set under different SNR levels, ranging from 0 to 15 dB with an interval of 5 dB. We use five unseen background noise processes to generate the test set under different SNR levels, ranging from 2.5 to 17.5 dB with an interval of 5 dB.
All sentences are resampled to 16 kHz. For the training set, we slice the sentences into 2 s units, but there is no slicing in the test set. We use a Hamming window of length 25 ms and a hop length of 6.25 ms. Since we apply a power law compression with a compression coefficient of 0.3 to the spectrum [33] after STFT, it is reversed on the final estimated complex spectrum. Finally, we apply the inverse STFT to recover the time-domain signal. In addition, we use the AdamW optimizer to train both generator and discriminator for 100 epochs. The batch size is set to four. The learning rate of the discriminator is set to 0.001, which is twice that of the generator. After every 30 epochs, both learning rates will be halved.

Evaluation Indicators
We use six objective measures to evaluate the quality of the enhanced speech. For all metrics, higher scores indicate better performance.

Results and Discussion
In this section, we first conduct a comparison with baselines to verify the performance of the proposed model. Then, the effectiveness of CGA-MGAN is verified using ablation experiments. The experimental results are discussed and explained.

Baselines and Results Analysis
The CGA-MGAN we propose can be compared with some existing models objectively. These models include classical models, such as SEGAN [37], DCCRN [13], and Conv-TasNet [20], and state-of-the-art (SOTA) baselines. SEGAN is the earliest application of GAN in speech enhancement, and its network training is conducted by directly inputting time domain waveforms. DCCRN builds a complex convolution recursive network to simultaneously recover the magnitude and phase information in the TF domain by estimating the CRM. Conv-TasNet is a fully convolutional end-to-end time-domain speech separation network that can also be used for speech enhancement.
For methods based on the generation model, we choose three baselines, including TDCGAN [38], MetricGAN+ [39], UNIVERSE [40], and CDiffuSE [41]. TDCGAN first introduced dilated convolution and deep separable convolution to the GAN network and can build a time-domain speech enhancement system. It dramatically reduces network parameters and obtains a better speech enhancement effect than SEGAN. MetricGAN+ is an improved MetricGAN network for speech enhancement. It not only uses enhanced and clean speech to train the discriminator, but also uses noisy speech to minimize the distance between the discriminator and target objective metrics. Moreover, MetricGAN+'s generator uses the learnable sigmoid function for mask estimation, which improves the generator's speech enhancement ability. UNIVERSE builds a generative score-based diffusion model and a multi-resolution conditioning network so that mixture density networks can be enhanced and achieve good speech enhancement results. CDiffuSE learns the characteristics of noise signals and incorporates them into diffusion and reverse processes, making the model highly robust against changes in noise characteristics.
For methods based on the improved transformer, we choose four baselines, including SE-Conformer [25], DB-AIAT [17], DPT-FSNet [42], and DBT-Net [43]. SE-Conformer first introduced the convolution-augmented transformer to the field of speech enhancement. The proposed speech enhancement architecture can focus on the whole sequence through MHSA and convolutional neural networks to capture short-term and long-term time series information. DB-AIAT constructs a dual-branch network through the decoupling-style phase-aware method. It first estimates the magnitude spectrum roughly, and then the spectral details that the magnitude-marking branch missed are compensated. DPT-FSNET integrates subband band and complete band information and proposes a transformer-based dual-branch frequency domain speech enhancement network. DBT-Net proposes a dualbranch network to estimate magnitude and phase information. Interaction modules are introduced to obtain features learned from one branch to facilitate the information flow between branches to curb the undesired parts.
As shown in Table 1, compared with the improvement work of MetricGAN (Metric-GAN+), there is a 0.32 improvement in the PESQ score. In addition, compared with the advanced generation model currently used in speech enhancement, CGA-MGAN achieves better performance. Finally, compared with recent methods based on improved transformers, such as DPT-FSNet, our method is also better in almost all evaluation scores. At the same time, the model size is relatively small (1.14 M). * "-" denotes that the result was not provided in the original paper.

Ablation Study
To investigate the contribution of the different CGA-MGAN components proposed to enhance performance, we have conducted an ablation study. Several variants of the CGA-MGAN model are compared in Table 2: (i) removing the convolution block in CGAU; (that is, using GAU to replace CGAU (w/o Conv. Block)); (ii) using conformer to replace CGAU in the two-stage block (using a conformer); (iii) removing gating blocks in the decoder (w/o gating decoders); (iv) removing the phase compensation mechanism and only improving the speech in the magnitude spectrum (Mag-only); (v) removing the discriminator (w/o discriminator). We set all variants using the same configuration as CGA-MGAN. As shown in Table 2, all these variants underperform the proposed CGA-MGAN. Comparing CGA-MGAN with (i), a decrease of 0.1 in PESQ can be observed because GAU cannot extract the shortterm features of speech very well after the convolution block is removed. Comparing CGA-MGAN with (ii), a decrease of 0.14 in PESQ can be observed. This proves that the CGAU model we propose is superior to traditional convolution-augmented transformers for speech enhancement. Comparing CGA-MGAN with (iii) and (iv), we find that gating blocks and the phase-aware method are essential to CGA-MGAN. Because the gating block can retain important features in the encoder and suppresses irrelevant features, it can also control the information flow of the network and simulate complex interactions, which is quite effective for improving the network's speech enhancement ability [22]. In addition, the decoupling-style phase-aware method can process the coarse-grained regions and finegrained regions of the spectrum in parallel so that lost spectrum details can be compensated for, and it can avoid the compensation effect between magnitude and phase [17]. Finally, comparing CGA-MGAN with (v), we can observe that removing the discriminator has a negative impact on all given scores, which proves the advantages of using MetricGAN.
In addition, we can focus on CGA-MGAN and (ii) and use real-time factors (RTFs) to compare their computational complexity. RTFs can be measured by taking an average of five runs on an Intel Xeon Silver 4210 CPU. The RTF of CGA-MGAN is 0.017, while the RTF of (ii) is 0.025. This proves that the CGAU we propose has lower computational complexity than traditional convolution-augmented transformers.

Conclusions
In this work, we propose CGA-MGAN for speech enhancement, which can better combine the advantages of convolution and self-attention. Our approach combines CGAU, which can capture time and frequency dependencies with lower computational complexity and achieve better results, together with an encoder-decoder structure, including gating blocks using the decoupling-style phase-aware method, which can collaboratively estimate the magnitude and phase information of clean speech and avoid the compensation effect. Experiments on Voice Bank + DEMAND show that the CGA-MGAN model we propose performs excellently at a relatively small model size (1.14 M). In the future, we will further study the application of CGAU to other speech tasks, such as separation and dereverberation.