Next Article in Journal
Phase Plane Trajectory Planning for Double Pendulum Crane Anti-Sway Control
Previous Article in Journal
Unmanned Aerial Vehicles (UAV) Networking Algorithms: Communication, Control, and AI-Based Approaches
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DeCGAN: Speech Enhancement Algorithm for Air Traffic Control

Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Flight University of China, Jianyang 641400, China
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(5), 245; https://doi.org/10.3390/a18050245
Submission received: 11 March 2025 / Revised: 14 April 2025 / Accepted: 21 April 2025 / Published: 24 April 2025
(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Abstract

:
Air traffic control (ATC) communication is susceptible to speech noise interference, which undermines the quality of civil aviation speech. To resolve this problem, we propose a speech enhancement model, termed DeCGAN, based on the DeConformer generative adversarial network. The model’s generator, the DeConformer module, combining a time frequency channel attention (TFC-SA) module and a deformable convolution-based feedforward neural network (DeConv-FFN), effectively captures both long-range dependencies and local features of speech signals. For this study, the outputs from two branches—the mask decoder and the complex decoder—were amalgamated to produce an enhanced speech signal. An evaluation metric discriminator was then utilized to derive speech quality evaluation scores, and adversarial training was implemented to generate higher-quality speech. Subsequently, experiments were performed to compare DeCGAN with other speech enhancement models on the ATC dataset. The experimental results demonstrate that the proposed model is highly competitive compared to existing models. Specifically, the DeCGAN model achieved a perceptual evaluation of speech quality (PESQ) score of 3.31 and short-time objective intelligibility (STOI) value of 0.96.

1. Introduction

A key technique in speech analysis and recognition, speech enhancement removes noises and undesired distortions to improve the quality of speech signals for enhanced communication. The technique can effectively eliminate or suppress unwanted acoustic components while preserving essential speech information, thereby improving the clarity and comprehensibility of speech.
There have been many popular models for speech enhancement, including Wiener filtering [1], spectral subtraction [2], Kalman filtering [3], minimum mean square error estimation [4], and subspace approaches [5], but these traditional methods witness a significant decline in performance when the speech signals are contaminated by non-stationary noises or the signal-to-noise (SNR) ratio is low.
With exceptional abilities in modeling nonlinear relationships, deep learning represents a viable solution to speech enhancement. Deep learning speech enhancement models can be broadly divided into two types—time-domain models and time-frequency (TF) domain models. The former focuses on capturing the temporal features of speech, in which the raw speech waveforms are used as both inputs and outputs; the latter takes into account not only the temporal but also the spectral features in speech signals, using algorithms like short-time Fourier transform (STFT) to convert the signals from the time domain to the frequency domain and produce spectrograms that contain magnitude and phase information at various time and frequency points. This information serves as both input and output.
Despite their capacity to manage complex nonlinear noise, time-domain deep learning models frequently fall short of optimal performance. This shortcoming arises from their exclusive focus on temporal features, which prevents them from fully capturing spectral information and results in information loss. TF-domain deep learning approaches can be further categorized into two types—magnitude-domain models and complex-domain models: in the former, the spectra of clear speech are directly modeled, while the phases of noises are reconstructed to enhance the magnitude spectra; these enhanced magnitude spectra are then converted back into time-domain signals by the inverse short-time Fourier transform (ISTFT) method. One shortcoming of magnitude-domain models is that they overlook the phase information, which has been found to play an important role in enhancing the quality of speech [6]. Complex-domain models make up for this drawback as they consider both the magnitude and the phase information, but implementing these models can be very difficult as the phase components are random [7] and it is hard to estimate complex mask values [8].
After the emergence of the Transformer model [9], numerous studies applied it to various speech-processing tasks. One notable representative is SepFormer [10], which utilizes a dual-path transformer architecture to effectively capture both local and global contextual dependencies in the speech signal. The model’s self-attention mechanism and dual-path structure contribute to its superior performance in improving separation quality and speech intelligibility compared to previous methods. However, the application scope of SepFormer primarily targets general speech separation tasks, and its design does not directly address the unique challenges of Air Traffic Control (ATC) communications, such as complex background noise, rapid speech rates, and channel distortions. The primary reason for this shortcoming is the lack of a direct correlation between the objective function and commonly used assessment indicators, which means that the assessment can fall short of expectations, even with an optimized objective loss function.
To address this issue, Fu et al. proposed MetricGAN [11], a time-frequency domain speech enhancement model that leverages a discriminator to determine the evaluation metric function in order to facilitate more direct optimization of speech quality and intelligibility metrics, resulting in generated speech that aligns more closely with human auditory perception. Cao et al. [12] introduced CMGAN, a Conformer-based MetricGAN model. Similar to MetricGAN, CMGAN employs an evaluation metric-based optimization strategy within its metric discriminator, while integrating the advantages of the Conformer model. Specifically, in the time and frequency domains, the CMGAN adopts a dual-stage Conformer architecture to capture long-term dependencies and local features. However, since multi-scale and stage-wise modeling is employed in the CMGAN for the time and frequency domains, the direct correlation between the temporal and spectral information is considerably diminished. Furthermore, the lack of explicit modeling in the channel dimension restricts its ability to effectively capture global information.
Besides speech enhancement, recent research such as the Res-NeXt-Mssm-CTC [13] has shown satisfactory results in automatic speech recognition (ASR) for air traffic control. The Res-NeXt-Mssm-CTC mainly focuses on optimizing transcription accuracy directly. However, the applications of speech enhancement are broader than that of speech recognition in air traffic control scenarios. On the one hand, speech enhancement directly improves the clarity and intelligibility of communications between pilots and controllers. On the other hand, while speech recognition converts speech to text for transcription, its performance is highly dependent on the quality of the input signal. Speech enhancement can serve as a front-end pre-processing step, benefiting not only ASR systems but also other applications such as speaker verification, emotion detection, and situational awareness systems in air traffic control. Therefore, to meet this need, we propose a novel speech enhancement model to denoise speech in the air traffic control domain.
Given the reviews above, DeCGAN, a MetricGAN model based on the DeConformer architecture, is put forth in the present work to address speech enhancement. At the core of the DeCGAN model is the combination of a generator and a metric discriminator: the former, unlike conventional dual-branch models, which are typically designed separately for magnitude masking and complex spectrum optimization, depends on a shared decoder to process concatenated magnitude information and complex (be it real or imaginary) components as inputs; the latter excels in estimating and optimizing a black-box nondifferentiable metric without adversely affecting other metrics. Unlike conventional dual-branch models, which are typically designed separately for magnitude masking and complex spectrum optimization, the generator in this work employs a shared encoder that processes the concatenated magnitude and complex (real and imaginary) components as input. Meanwhile, to address the limitations of traditional Conformer models, our DeCGAN has a DeConformer module in which the Time Frequency Channel Attention (TFC-SA) and Deformable Convolution-based Feedforward Neural Network (DeConv-FNN) modules are combined to model long-term dependencies across the dimensions of time, frequency, and channel. This design facilitates more efficient local feature processing and enhances the sequential extraction of temporal and spectral information from the encoder. Subsequently, the architecture splits into a dedicated mask decoder for magnitude interpretation, along with another branch to refine both real and imaginary facets.
The main contributions of our methods are as follows:
  • This paper proposes DeCGAN, a generator based on DeConformer that integrates TFC-SA and DeConv-FFN, enabling the simultaneous capture of global long-range dependencies and local fine-grained details in speech signals.
  • By employing a mask decoder and complex decoder, we effectively overcome the phase recovery challenges in conventional complex-domain methods, thereby achieving enhanced speech of higher quality.
  • The experiment results have demonstrated that DeCGAN has better performance than other algorithms in terms of processing civil aviation control speech.
The paper is organized as follows. Section 2 provides the details of our DeCGAN model. In Section 3, we demonstrate the effectiveness of our method through a series of experiments. Then, Section 4 utilizes several ablation studies to further analyze the framework of DeCGAN. We conclude our paper and discuss future research directions in Section 5.

2. Materials and Methods

In an air traffic control environment, the presence of complex background noise presents significant challenges in effectively capturing and utilizing multi-dimensional information in the time-frequency domain. This includes temporal, spectral, channel, and phase features that are essential for speech enhancement. Therefore, in the present work, the DeCGAN model is proposed specifically to manage complex noise issues in ATC communications.
The DeCGAN model comprises two primary components: a generator, based on the DeConformer module, and a discriminator, informed by evaluation metric scores. The generator extracts multi-level time-frequency features to produce enhanced speech, while the discriminator assesses the resulting enhanced speech using objective metrics. This adversarial process encourages the generator to create higher-quality speech. The subsequent sections of this article offer a comprehensive introduction to the model’s architecture and design principles.

2.1. Generator

The generator architecture of DeCGAN is illustrated in Figure 1. Specifically designed to capture features in the time-frequency domain and generate enhanced speech signals, the generator comprises a densely connected dilated convolutional encoder, a DeConformer module, a mask decoder, and a complex decoder. The encoder first uses a densely connected dilated convolutional encoder to process the input spectrograms. After that, a DeConformer module refines the extracted features. The network then splits into two branches: one branch passes the features through a mask decoder, and the other through a complex decoder. Finally, the outputs from both decoders are merged and converted back to the time domain using an inverse short-time Fourier transform. The details of the generator architecture are shown below.
In the DeCGAN model, for a given speech noise y L × 1 , its representation in the time domain is converted into the frequency domain using the STFT; in this process, a complex spectrogram Y 0 T × F × 2 that encompasses both magnitude and phase information is generated, where T and F represent the time and frequency dimensions. Subsequently, power-law compression, a method used to compress the dynamic range of the magnitude spectrum so that smaller signals are relatively amplified and larger signals are slightly attenuated, is employed to derive Y , the compressed spectrogram [14]:
Y = Y 0 c e j Y p = Y m e j Y p = Y r + j Y i   ,
where Y m , Y p , Y r , and Y i are the magnitude, phase, real part, and imaginary part, respectively, of the noisy spectrogram after power-law compression. The power-law compression coefficient, denoted as c, is set to 0.3, in accordance with the literature [14]. This also helps stabilize neural network training by preventing excessively large or small values in the feature representation. The power-law compression of this complex spectrogram ensures that the relative importance of quieter sounds is aligned with that of louder sounds, thereby better matching human auditory perception. Y 0 c represents the original complex spectrogram obtained directly from the STFT, which retains the full amplitude and phase information of the input signal. In contrast, Y m is derived by applying power-law compression to the magnitude component of Y 0 c . Each variable serves a distinct purpose in this dual representation. Y 0 c is essential for accurate phase reconstruction and retaining the complete spectral characteristics, while Y m provides a compressed magnitude that aligns better with human auditory perception and facilitates more robust network training. Together, they enable the model to separately and effectively process amplitude and phase information during the enhancement process.
The compressed real component Y r and imaginary component Y i of the spectrogram are concatenated with the magnitude Y m to create the input feature Y m , Y r , Y i B × T × F × 3 . Here, the batch size is defined as B ; the batch input to the model can be denoted as Y i n B × T × F × 3 , which is the generator’s input.
Encoder: As shown in Figure 1, an encoder, comprising two convolutional modules and a dilated convolution module at its center, is used to extract the features Y i n . This dilated convolution module comprises four dilated convolution blocks, with 1 , 2 , 4 , 8 as the corresponding dilation factors. Through dense residual connections, the output from each layer is directly transmitted to all subsequent convolution blocks, effectively aggregating feature information from various levels and thereby capturing the multi-level features of the spectrum. The dilated convolution module further enhances the ability to capture global contextual features while maintaining a balance between kernel size and computational complexity. The final convolution block compresses F to reduce computing complexity, denoted as F = F / 2 . The feature map generated by the Encoder module is represented as F i n B × T × F × C and subsequently passes into the DeConformer module.
DeConformer: As mentioned before, the traditional Conformer model can capture long-range dependencies while extracting local features, making it a popular approach for speech recognition and separation. However, this approach has to address two challenges when applied to speech enhancement: first, the correlations between the inputs of the model will be reduced because of the stage-wise modeling of signals in the time and frequency domains, and capturing global information is difficult because of the lack of modeling of signals in the channel dimension; second, in the modeling of detailed information, the model is over-reliant on the self-attention mechanism for local feature extraction, which is inefficient in terms of computational resource utilization. Furthermore, exploiting the autocorrelation of noisy data to mine local feature correlations may introduce errors, thereby affecting the accuracy of the modeling. To address these issues, the DeConformer module is proposed, as illustrated in Figure 2. This DeConformer module comprises the TFC-SA and DeConv-FFN. The TFC-SA processes the input feature map in three parallel branches along the channel, time, and frequency dimensions. Each branch computes a one-dimensional attention feature via global pooling and a pointwise convolution, then merges them through elementwise multiplication and a final pointwise convolution layer. Next, the output is passed to the DeConv-FFN, which contains two parallel branches of deformable convolution layers. One branch has two deformable convolutions, and the other has four. Each branch produces an output feature map, which is then fed into a channel-attention block. Finally, the outputs from these branches are combined, and a residual connection adds the module’s input back to the result, concluding one DeConformer block. The details of the generator architecture are shown below.
Specifically, here, we define the input feature to the i 1 module as f i 1 ; then, the operation of the DeConformer module is as follows:
f ^ i = f i 1 + T F C - S A L N f i 1   ,
f i = f ^ i + D e C o n v - F F N L N ( f ^ i )   ,
where L N represents layer normalization, while f ^ i in the output features of TFC-SA and f i are those of DeConv-FFN.
The structure of TFC-SA is also shown in Figure 2. For a given feature map F i n B × T × F × C , C , T and F are the number of channels, time frames, and frequency units, respectively. This module consists of three branches that generate one-dimensional energy features: F c C × 1 for the channel dimension, F t T × 1 for the time dimension, and F f F × 1 for the frequency dimension. Each branch has independence in terms of operation, and computes queries and keys via global pooling and a 1 × 1 convolutional layer. The attention feature maps are generated through an activation function: M c C × C is the attention feature map for the channel branch, with M t T × T for the time branch, and M f F × F for the frequency branch. After generation, F i n is processed by a 2D 1 × 1 convolutional layer to obtain F v . To generate the final result F o u t , the feature maps F v , denoted as M c , M t , and M f , are multiplied sequentially. Subsequently, another 2D 1 × 1 convolutional layer processes the multiplication results to yield the output F o u t B × T × F × C . In this manner, a global relationship is calculated by TFC-SA, based on the features of each dimension.
The DeConv-FFN module, inspired by [15], introduces a channel attention mechanism to facilitate adaptive channel selection while enhancing feature modeling capabilities through local processing with varying receptive field sizes. The module’s architecture is also available in Figure 2. In DeConv-FFN, there are two parallel branches, each of which has a varied number of deformable convolutional layers to produce feature maps with distinct receptive field sizes. The outputs of these branches are independently processed using the channel attention mechanism. For the input feature I , the two branches have two and four deformable convolution layers, respectively, which concurrently generate the outputs f 2 and f 4 :
f 2 = H 2 I f 4 = H 4 I
Specifically, input feature I is fed into each branch separately. H 2 applies two deformable convolution blocks in sequence, while H 4 applies four. Each block expands the sampling grid via learned offsets, enabling the network to capture more flexible local patterns. After passing through these respective blocks, the two branches output their feature maps as f 2 and f 4 .
After calculating f 2 and f 4 , we follow the steps specified in Equation (5) to compute the output S :
v = G l o b a l   P o o l i n g C o n c a t ( f 2 , f 4 ) w 2 , w 4 = s i g m o i d H F C 1 v , H F C 2 v S = f 2 w 2 + f 4 w 4          ,
where H F C 1 and H F C 2 represent two different fully connected layers. Therefore, we use two separate fully connected layers to produce two scalar values, w 2 and w 4 . Then, the sigmoid function is applied to each scalar individually. Unlike conventional and dilated convolution using standard or zero-padding sampling grids, deformable convolution enhances the sampling grid through trainable offsets, which are learned by additional convolutional layers from the input feature maps to enable the capturing of information beyond regular local neighborhoods. Through this mechanism, the receptive field is expanded, and the speech features can be modeled more flexibly.
Decoder: Using the DeConformer module, both local and global time-frequency features are deeply extracted for input into the mask decoder and complex decoder, respectively. The former predicts a mask to enhance the magnitude feature of speech signals. Specifically, the mask is applied to the time-frequency points of the input spectrum to augment effective information while suppressing background noises in the speech signals. The complex decoder, while also considering phase information recovery to ensure the integrity of the recovered spectrum, focuses on processing the real and imaginary parts of the complex spectrum. Although the two decoders serve different functions, they share a similar structure, as shown in Figure 1. Both decoders have four dilated convolution blocks, with 1 , 2 , 4 , 8 as the dilation factors, respectively; they use convolutional layers to upsample the frequency dimension, restoring it to the original frequency size F . In the last convolutional layer of the mask decoder, the output channel is defined as 1, and a convolutional layer with a parametric rectified linear unit (PReLU) activation function is used to generate the final amplitude mask. Through element-wise multiplication of the amplitude mask output M a s k T × F × 1 and the magnitude spectrum of the original speech input Y m , background noises can be suppressed, and enhanced frequency magnitude is achieved: X ^ M a s k T × F × 1 . The mask is multiplied by trigonometric functions that can take on positive or negative values, thereby reconstructing the complex spectrum with the correct phase. The real and imaginary parts of the complex spectrum are processed by the complex decoder to output both the real part X ^ r C o m p T × F × 1 and the imaginary part X ^ i C o m p T × F × 1 . Finally, the masked magnitude is combined with the phase to obtain a magnitude-enhanced complex spectrum. Through element-wise addition with X ^ r C o m p , X ^ i C o m p , we eventually obtain the final complex spectrum X ^ r , X ^ i , as shown in Equation (6):
X ^ r M a s k = X ^ M a s k cos θ X X ^ i M a s k = X ^ M a s k sin θ X X ^ = X ^ r + j X ^ i = X ^ r M a s k + X ^ r C o m p + j X ^ i M a s k + X i C o m p .
The reconstructed complex spectrum X ^ undergoes inverse power-law compression, after which the frequency-domain signals are converted back to time-domain signals via the ISTFT to reach enhanced clear speech x ^ .

2.2. Discriminator

There is typically a lack of direct correlation between the objective function and common assessment indicators in speech enhancement. Therefore, an assessment result is likely to fall short of expectations when dealing with an optimized objective loss. Furthermore, as non-differential metrics, metrics like the perceptual evaluation of speech quality (PESQ) [16] and short-time objective intelligibility (STOI) [17] cannot be directly used as loss functions. Therefore, a discriminator is designed here, based on the MetricGAN model. By incorporating the metrics of assessment into adversarial training, MetricGAN enhances speech, while DeCGAN further improves this approach. Specifically, the discriminator in DeCGAN is designed to model evaluation metrics, adding the evaluation score to the loss function during model training to improve speech-denoising performance.
Figure 3 shows the structure of the DeCGAN discriminator. As the figure shows, the discriminator comprises four convolutional modules, with the number of channels set at 16, 32, 64, and 128. Each convolutional module includes convolutional layers, instance normalization layers, and a PReLU activation function. The convolutional layers extract time-frequency features, the normalization layers normalize the features, and the PReLU activation function helps alleviate the vanishing gradient problem. After these convolutional blocks, a global average pooling layer aggregates the feature maps into a fixed-length feature vector, which is crucial for handling variable-length inputs. This vector is then fed into two fully connected layers that further process the features, and finally, a sigmoid activation function produces a scalar output representing the predicted speech quality score. This discriminator plays an important role in guiding the generator during adversarial training by directly linking the generated speech to perceptual quality metrics.
During the training of the discriminator, the PESQ/STOI metric scores are maximized using the amplitude spectra of clear speech signals as the model’s inputs. Then, the metric scores for enhanced speech are estimated, using the amplitude spectra of the denoised speech signals as the inputs, to approach the score values of clean signals of speech.

2.3. Loss Function

In the present work, a generator loss function and a metric discriminator loss function, based on the output complex spectrum X ^ r , X ^ i , are designed to improve the post-training quality of speech. The generator loss function consists of the combined loss, the adversarial loss, and the time-domain loss:
L G = α L T F + β L G A N + γ L T i m e   ,
where α , β , and γ represent the weight coefficients of the three loss terms. For convenience, we set these weight coefficients as α = 1 , β = 0.05 , γ = 0.2 . After that, we normalize the loss function using softmax. L T F represents the combined loss of the magnitude loss L M a g and the phase-aware loss L R I :
X ^ m = X ^ r 2 + X ^ i 2 L M a g = E X m , X ^ m X m X ^ m 2 L R I = E X r , X ^ r X r X ^ r 2 + E X i , X ^ i X i X ^ i 2 L T F = m L M a g + 1 m L R I          ,
where X ^ m and X m represent the magnitude spectrograms of the denoised and clean speech samples, respectively; m is the weight coefficient, which was found to allow the model to reach the best speech denoising performance when it was equal to 0.7. X r and X i represent the real and imaginary parts of the complex spectrum of clean speech samples, respectively.
The adversarial loss of the generator is defined as L G A N :
L G A N = E X m , X ^ m D X m , X ^ m 1 2   ,
where D denotes the discriminator. Accordingly, the adversarial loss expression for the discriminator is:
L D = E X m D X m , X m 1 2 + E X m , X ^ m D X m , X ^ m Q P E S Q 2   ,
where Q P E S Q represents the normalized PESQ score. In this algorithm, the PESQ score is normalized to the range 0 , 1 . Additionally, it has been reported that introducing a time-domain loss can significantly optimize the speech enhancement performance, which can be expressed as L T i m e :
L T i m e = E x , x ^ x x ^ 1   ,
where x ^ represents the enhanced time-domain signal of speech, and x represents the target signal, or the clean time-domain signal. In this context, 1 denotes the l 1 -norm, while 2 denotes the l 2 -norm. We intentionally did not incorporate alternative loss functions such as L 1 / L 2 loss or perceptual loss. On the one hand, while L 1 / L 2 losses are commonly used in GAN frameworks, the loss functions tend to measure only pixel-level or signal-level differences and may fall short of capturing the semantic content and perceptual characteristics of complex speech signals. This limitation is particularly critical in air traffic control communications, where preserving intelligibility and subtle speech nuances is paramount. On the other hand, although perceptual loss can potentially offer a more human-aligned evaluation by comparing high-level features, its computational complexity makes it less suitable for real-time processing requirements. Therefore, we designed our loss function to balance adversarial training with reconstruction and frequency-domain consistency, ensuring both enhanced speech quality and operational efficiency under the specific conditions of our study.

3. Results and Analysis

Experiments were performed on a Linux Ubuntu 20.04 operating system, with the CUDA 11.4 and the PyTorch 1.11 framework. The hardware configuration comprises an Intel Xeon Silver 4110 CPU and two NVIDIA RTX 2080 Ti GPUs, each equipped with 11 GB of dedicated memory.

3.1. Dataset

The dataset utilized in this study comprises two components. The first component is an extended version of the VoiceBank [18] + Demand [19] dataset. Given the limited availability of ATC-related speech data, and to more accurately simulate environmental noise interference in ATC communications, the noise types in the VoiceBank + Demand dataset were categorized and expanded. Specifically, typical interference noises pertinent to real ATC scenarios, such as coughing, clapping, yawning, machine noise, mouse clicks, keyboard typing, and background conversations, were selected. This resulted in a total of 10,375 noisy-clean audio pairs, with audio durations ranging from 4 to 10 s.
To address the complex electromagnetic interference and rapid speech rates commonly observed in ATC communications, various real ATC environmental noises, including electromagnetic interference and engine noise, were extracted and mixed with the VoiceBank + Demand dataset at SNRs of 0 dB, 5 dB, 10 dB, and 15 dB to simulate authentic ATC noise conditions. Furthermore, the speech rate in the dataset was adjusted to reflect the fast-paced nature of ATC communications. The enhanced VoiceBank + Demand dataset was subsequently utilized for both model training and testing. Importantly, the noise samples in our dataset were randomly balanced across the different categories. Each type of interference noise was uniformly sampled to ensure a roughly equal number of noisy-clean pairs for each noise class. Moreover, rigorous stratified sampling techniques were employed during dataset construction to maintain a consistent representation of all noise types. This balanced design not only reflects the diverse acoustic environments encountered in real-world ATC communications but also effectively mitigates the risk of model overfitting to any specific noise category. Therefore, the extended version of the VoiceBank + Demand dataset supports robust model training with improved generalization capability, ensuring that the model performs reliably across various noise conditions encountered in actual ATC scenarios.
The second part of the dataset comprises authentic ATC speech collected from the Southwest Air Traffic Management Bureau of the Civil Aviation Administration of China. This dataset includes 4000 noisy-clean audio pairs, with a mean duration of 5.4 s per sample. The median duration is 4.8 s per sample, and the standard deviation of sample durations is 1.2 s, indicating moderate variability in the sample lengths.
The test set was set to contain five types of noises relevant to real ATC scenarios but that were not present in the training set to assess the model’s generalization capacity. These noises were introduced to the speech data to create test samples, with the SNR set at 2.5 dB, 7.5 dB, 12.5 dB, and 17.5 dB. This approach aims to evaluate the model performance regarding speech enhancement under previously unseen noise conditions and to validate its robustness in complex, noisy environments.

3.2. Experimental Setup

All audios were resampled to 16 kHz for the present work. The training set audio was divided into 2-s segments, while the test set retained its original length without segmentation. The STFT module employed a Hamming window (of a length of 512) with a window shift of 256 samples, resulting in a 50% overlap. After STFT processing, the generated spectrogram contained 257 frequency components, and the time dimension was determined by the actual duration of the audio. In the generator, the number of DeConformer modules was 5, the batch size was 4, and the number of channels was 64. For the metric discriminator, the number of channels was set to 16, 32, 64, and 128. During training, the AdamW optimizer was employed to optimize the generator and the discriminator over 100 epochs in total. The learning rate for the generator and the discriminator was set to 0.0005 and 0.001, respectively, with dynamic adjustments based on the training loss. In the generator loss L G ; the weight was set as α = 1 , β = 0.05 , γ = 2 .

3.3. Evaluation Metrics

In the present work, the quality of enhanced speech samples was evaluated by five popular metrics, as shown below:
  • PESQ: Perceptual evaluation of speech quality, which has values within the range of [−0.5, 4.5]; a higher value indicates better quality.
  • STOI: Short-time objective intelligibility, which is within the range of [0, 1]; the speech is considered to have been fully understood when it equals 1.
  • CSIG: Mean opinion score (MOS) [20] for the prediction of signal distortion, with a value ranging from 1 to 5.
  • CBAK: The MOS prediction of background noise intrusiveness, with a value ranging from 1 to 5.
  • COVL: The MOS prediction of the overall effect, which has a value within [1, 5].
A higher value of any of these metrics corresponds to higher performance.

3.4. Results Analysis

Our proposed model was compared with some recent and representative open-source models to verify its effectiveness in enhancing speech. For each noise reduction method, two to three models were chosen. Representative time-domain speech enhancement methods include SEGAN [21], TSTNN [22], and DEMUCS [23]. For time-frequency domain speech enhancement methods, two state-of-the-art models, PHASEN [24] and CMGAN, were evaluated. Among the models that utilize metric discriminators, MerticGAN+ [25] was selected as the most representative model for comparison. Table 1 presents their respective performance.
According to the data presented in Table 1, the DeCGAN algorithm exhibits competitive performance across multiple evaluation metrics in the comparison. This suggests that DeCGAN effectively minimizes signal distortion during speech enhancement and excels in suppressing background noise. For instance, in comparison to the PHASEN model, which is designed with magnitude and phase awareness, DeCGAN improves the PESQ (by 0.36), CSIG (by 0.46), CBAK (by 0.41), COVL (by 0.55), and STOI scores. This is attributable to the ability of the mask and complex decoders to receive more accurate magnitude and phase spectra. When compared to the more advanced CMGAN model, DeCGAN achieves higher scores for CSIG, CBAK, and COVL, indicating that the DeConformer module, through the integration of TFC-SA and deformable convolution, can more accurately model all speech signals, both global and local. Such capability is particularly evident in its ability to restore detailed information and minimize noise interference. Although our DeCGAN model achieves slightly lower scores for PESQ and STOI scores in comparison with the CMGAN model, it still exhibits significant improvement in these metrics compared to other models. While CMGAN exhibits slightly higher PESQ scores, our DeCGAN model achieves superior performance in other important metrics such as CSIG, CBAK, and COVL. This indicates that while CMGAN may slightly outperform in terms of perceptual quality as measured by PESQ, DeCGAN more effectively restores signal details and suppresses background noise, resulting in a more balanced overall enhancement performance.
In addition, a segment of air traffic control speech test data was randomly selected for visual analysis, and the enhancement results of different models were compared. The time-domain waveform and spectrogram of the audio enhanced by various models are presented below. Specifically, the following representative models were selected for comparative analysis: MetricGAN+, which is based on a metric discriminator; DEMUCS, a representative model of time-domain speech enhancement; PHASEN, a model for time-frequency domain complex speech enhancement; and CMGAN, the baseline model for this algorithm. Through visual analysis, a direct comparison of the performance of different models in speech enhancement can be made.
Figure 4 and Figure 5 show the time-domain waveforms and spectrograms of speech signals enhanced by different models.
Figure 4a and Figure 5a depict the signals for noisy speech, where much more noise is present in the time-domain waveforms and the spectrograms in comparison to the clean speech signals, as depicted in Figure 4f and Figure 5f. The noise was drawn from the test set, which included real ATC background noises that were not present during training. Examples of such noise sources include engine sounds and mechanical interference. Figure 4b and Figure 5b illustrate the speech enhancement results obtained from the DEMUCS model, which demonstrate distortion in several segments of the time-domain waveform. Although high-frequency noise is reduced in the spectrogram, a substantial amount of background noise remains, and the model’s effectiveness in addressing mid- to low-frequency noise is inadequate, resulting in significant residual background noise. In contrast, Figure 4c and Figure 5c present the speech enhancement results achieved with the MetricGAN+ model, showing improved performance in processing mid- to low-frequency bands. However, notable energy attenuation is observed in the high-frequency bands, likely due to the over-suppression of high-frequency noise. This over-suppression results in visible distortion and excessive attenuation in the time-domain waveform, leading to an overall processing effect that remains suboptimal. Figure 4d and Figure 5d illustrate the results from the PHASEN model, demonstrating strong performance in the mid- to high-frequency bands. However, there is residual energy present in the low-frequency bands, resulting in a mechanical, noise-like sound that affects perceptual quality. While some distortion is evident in the time-domain waveform, the overall performance surpasses that of the first two models. In contrast, Figure 4e and Figure 5e present the results from the baseline CMGAN model, showing effective denoising across all frequency bands. Although a small amount of noise residue remains in the low-frequency band, the overall enhancement closely resembles clean speech, and the time-domain waveform exhibits fewer distortions and instances of over-suppression compared to the earlier models.
Finally, Figure 4g and Figure 5g present the results from the DeCGAN model proposed in this paper. The time-domain waveform clearly illustrates that the model effectively suppresses distortion while preserving the details, bringing it closer to clean speech. The model successfully removes noise across all frequency bands and can effectively eliminate irrelevant phonemes during periods of silence. The overall denoising effect, in terms of both detail enhancement and noise reduction, is significantly superior. Through comparative experiments, it is evident that the proposed model outperforms others in speech enhancement, delivering superior denoising effects and improved speech quality and intelligibility.

4. Discussion

The following discussion section is included to explore the effectiveness of different components in our DeCGAN model. Specific experiment settings are described below.
We first examined the contribution of the TFC-SA and DeConv-FFN blocks. Case 1 represents our DeCGAN model with its default configurations. For Case 2, we replaced the deformable convolution in the DeConv-FFN with dilated convolution; for Case 3, we completely removed all DeConv-FFN modules; finally, Case 4 introduced a triple-path self-attention (Triple-path SA) mechanism to displace TFC-SA, which extended the dual-path Transformer structure by incorporating an additional Transformer path to process the channel dimension.
Next, the denoising performance of the two decoders was examined. In Case 4, only the Mask Decoder was retained; in Case 5, only the Complex Decoder was retained. Finally, in Case 6, we removed the metric discriminator to observe its impact on the model’s speech enhancement performance.
The experimental settings and results are presented in Table 2 and Table 3.
A comparison between Cases 1, 2, and 3 verified the effectiveness of the DeConv-FNN module, indicating that in speech enhancement tasks, deformable convolution is more effective than dilated convolution in extracting local speech features by capturing the feature dependencies of each point within a local region. Next, by comparing Cases 1 and 4, we found that Case 4 showed slightly lower performance than Case 2, which demonstrated the effectiveness of TFC-SA. However, when the DeConv-FNN module was removed, performance significantly decreased; in this case, the model relied solely on TFC-SA. Conversely, when TFC-SA was replaced with the triple-path SA, the performance approached that of the model with the DeConv-FNN module intact. These results revealed the effective role of DeConformer in balancing the functions of the Transformer in extracting global features and of the CNN in extracting local features. By analyzing Cases 1, 7, and 8, we found that the performance barely showed any improvement when the number of DeConformer modules increased. Based on this analysis, the final model employs five DeConformer modules, achieving a balance between computational efficiency and feature extraction accuracy.
By comparing Case 5 and Case 6, we could evaluate the denoising performance of the two decoders. The mask decoder utilizes only the magnitude spectrum as input; the denoised frequency spectrum was obtained by combining the reconstructed magnitude spectrum with the original phase. In contrast, in the complex decoder, the complex spectrum was used as the input to output the reconstructed complex spectrum. When comparing the two approaches, it was observed that both the mask decoder, which relies solely on the magnitude spectrum and disregards phase information, and the complex decoder, which uses only the complex spectrum, exhibited lower performance than the DeCGAN model across all metrics. This finding suggests that the integration of the mask decoder and complex decoder within the DeCGAN model creates a complementary structure that is essential for significantly enhancing speech quality and intelligibility.
Finally, by comparing Case 7 and Case 1, the impact of the metric discriminator on speech enhancement was verified. A decline in all metrics was observed, indicating that incorporating evaluation scores into the model training, as designed with the metric discriminator, significantly improved the reconstruction quality of the speech.

5. Conclusions

In this paper, we present a DeCGAN speech enhancement algorithm specifically designed for ATC. The algorithm effectively removes noise and irrelevant phonemes from audio, thereby restoring higher speech quality. In the proposed algorithm, a unique generator structure is introduced to restore both magnitude and phase information. The metric discriminator in adversarial training addresses the non-differentiability issue of evaluation metrics, ensuring that the generated speech better aligns with both subjective and objective perception standards. Our DeCGAN model has broad prospects for applications in civil aviation radio air-ground communication as it can increase the quality of speech, make the words more understandable, and reduce the chance of misjudgments, thereby ensuring efficiency and safety of communication.
In the future, to address the high computational costs caused by the overall complexity of the algorithm, we will explore model compression and optimization techniques to enhance real-time processing capabilities. Additionally, future work will focus on evaluating and adapting the model to diverse domains and noise environments to further extend its applicability beyond ATC communications. After that, we plan to evaluate the performance of our model under babble noise interference in future work, which may inspire future research on speech enhancement techniques across various noise environments. Lastly, we will also consider conducting a sensitivity analysis on the key parameters involved in the loss function, to better understand their impact on model performance. Such investigations will not only support the reproducibility and interpretability of our framework but will also provide a stronger foundation for designing more effective and perceptually aligned speech enhancement systems in future studies.

Author Contributions

Conceptualization, H.L. and Y.H.; methodology, Y.H.; software, J.K.; validation, H.L., Y.H. and H.C.; formal analysis, Y.H.; investigation, Y.H.; resources, H.L.; data curation, Y.H.; writing—original draft preparation, Y.H. and H.C.; writing—review and editing, H.L. and J.K.; visualization, Y.H.; supervision, H.C. and J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by the Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Administration of China (No. F2024KF01A).

Data Availability Statement

This study utilized the following open-source dataset: the VoiceBank + Demand dataset: “https://datashare.ed.ac.uk/handle/10283/1942 (accessed on 1 March 2025)”. The private dataset used in this study is not publicly available, due to privacy restrictions. For more information, please contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ATCAir Traffic Control
TFC-SATime Frequency Channel Attention
DeConv-FFNDeformable Convolution-based Feedforward Neural Network
SNRSignal-to-noise Ratio
TFTime Frequency
STFIShort-Time Fourier Transform
CNNConvolutional Neural Network
GANGenerative Adversarial Network
PReLUParametric Rectified Linear Unit
PESQPerceptual Evaluation of Speech Quality
STOIShort-Time Objective Intelligibility
MOSMean Opinion Score
ISTFTInverse Short-Time Fourier Transform

References

  1. Lim, J.; Oppenheim, A. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 197–210. [Google Scholar] [CrossRef]
  2. Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
  3. Paliwal, K.; Basu, A. A speech enhancement method based on Kalman filtering. In Proceedings of the ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA, 6–9 April 1987; IEEE: Piscataway, NJ, USA, 1987; Volume 12, pp. 177–180. [Google Scholar]
  4. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef]
  5. Ephraim, Y.; Van Trees, H. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 1995, 3, 251–266. [Google Scholar] [CrossRef]
  6. Lu, Y.-X.; Ai, Y.; Du, H.-P.; Ling, Z.-H. Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 236–250. [Google Scholar] [CrossRef]
  7. Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar]
  8. Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. Manner: Multi-view attention network for noise erasure. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 7842–7846. [Google Scholar]
  9. Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  10. Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 21–25. [Google Scholar]
  11. Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2031–2041. [Google Scholar]
  12. Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based metric GAN for speech enhancement. arXiv 2022, arXiv:2203.15149. [Google Scholar]
  13. Liang, H.; Chang, H.; Kong, J. Speech Recognition for Air Traffic Control Utilizing a Multi-Head State-Space Model and Transfer Learning. Aerospace 2024, 11, 390. [Google Scholar] [CrossRef]
  14. Braun, S.; Tashev, I. A consolidated view of loss functions for supervised deep learning-based speech enhancement. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Brno, Czech Republic, 26–28 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 72–76. [Google Scholar]
  15. Xu, X.; Tu, W.; Yang, Y. Selector-enhancer: Learning dynamic selection of local and non-local attention operation for speech enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13853–13860. [Google Scholar]
  16. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 749–752. [Google Scholar]
  17. Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
  18. Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–4. [Google Scholar]
  19. Thiemann, J.; Ito, N.; Vincent, E. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of the Meetings on Acoustics, Montreal, QC, Canada, 2–7 June 2013; AIP Publishing: Melville, NY, USA, 2013; Volume 19. [Google Scholar]
  20. Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2007, 16, 229–238. [Google Scholar] [CrossRef]
  21. Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
  22. Wang, K.; He, B.; Zhu, W.P. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7098–7102. [Google Scholar]
  23. Defossez, A.; Synnaeve, G.; Adi, Y. Real time speech enhancement in the waveform domain. arXiv 2020, arXiv:2006.12847. [Google Scholar]
  24. Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. PHASEN: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9458–9465. [Google Scholar]
  25. Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. MetricGAN+: An improved version of MetricGAN for speech enhancement. arXiv 2021, arXiv:2104.03538. [Google Scholar]
Figure 1. Architecture of the generator in DeCGAN.
Figure 1. Architecture of the generator in DeCGAN.
Algorithms 18 00245 g001
Figure 2. Network architecture of DeConformer.
Figure 2. Network architecture of DeConformer.
Algorithms 18 00245 g002
Figure 3. Network structure of the metric discriminator.
Figure 3. Network structure of the metric discriminator.
Algorithms 18 00245 g003
Figure 4. The time-domain waveforms of speech, enhanced by different models: (a) noisy; (b) DEMUCS; (c) MetricGAN+; (d) PHASEN; (e) CMGAN; (f) clean; (g) DeCGAN. The red arrows reflects the distortion in the time-domain waveforms.
Figure 4. The time-domain waveforms of speech, enhanced by different models: (a) noisy; (b) DEMUCS; (c) MetricGAN+; (d) PHASEN; (e) CMGAN; (f) clean; (g) DeCGAN. The red arrows reflects the distortion in the time-domain waveforms.
Algorithms 18 00245 g004
Figure 5. Spectrograms of enhanced speech signals with the different models: (a) noisy; (b) DEMUCS; (c) MetricGAN+; (d) PHASEN; (e) CMGAN; (f) clean; (g) DeCGAN. Brighter colors indicate stronger signal presence at a given frequency and time, while darker colors indicate weaker signal presence.
Figure 5. Spectrograms of enhanced speech signals with the different models: (a) noisy; (b) DEMUCS; (c) MetricGAN+; (d) PHASEN; (e) CMGAN; (f) clean; (g) DeCGAN. Brighter colors indicate stronger signal presence at a given frequency and time, while darker colors indicate weaker signal presence.
Algorithms 18 00245 g005
Table 1. Comparison of the evaluation metric scores of different models. The best results are in bold.
Table 1. Comparison of the evaluation metric scores of different models. The best results are in bold.
MethodPESQCSIGCBAKCOVLSTOI
Noisy1.933.292.342.580.92
SEGAN2.183.422.842.750.92
TSTNN2.924.043.673.470.95
DEMUCS3.034.253.33.580.95
PHASEN2.954.153.453.57
MetricGAN+3.114.083.063.59
CMGAN3.374.573.844.070.96
DeCGAN3.314.613.864.120.96
Table 2. The experimental settings of the ablation studies.
Table 2. The experimental settings of the ablation studies.
Case IndexDilated ConvDeformable ConvTri-Path SATFC-SANo. of BlocksMask DecoderComplex DecoderDiscriminator
1 5
2 5
3 5
4 5
5 5
6 5
7 4
8 6
Table 3. The experimental results of the ablation studies. The best results are in bold.
Table 3. The experimental results of the ablation studies. The best results are in bold.
Case IndexPESQCSIGCBAKCOVLSTOI
13.314.613.864.120.96
23.184.523.763.980.95
32.943.983.053.540.94
43.264.63.794.080.96
53.234.423.723.960.96
63.174.363.533.780.95
73.214.493.743.980.96
83.384.593.764.080.96
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, H.; He, Y.; Chang, H.; Kong, J. DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms 2025, 18, 245. https://doi.org/10.3390/a18050245

AMA Style

Liang H, He Y, Chang H, Kong J. DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms. 2025; 18(5):245. https://doi.org/10.3390/a18050245

Chicago/Turabian Style

Liang, Haijun, Yimin He, Hanwen Chang, and Jianguo Kong. 2025. "DeCGAN: Speech Enhancement Algorithm for Air Traffic Control" Algorithms 18, no. 5: 245. https://doi.org/10.3390/a18050245

APA Style

Liang, H., He, Y., Chang, H., & Kong, J. (2025). DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms, 18(5), 245. https://doi.org/10.3390/a18050245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop