DeCGAN: Speech Enhancement Algorithm for Air Traffic Control
Abstract
:1. Introduction
- This paper proposes DeCGAN, a generator based on DeConformer that integrates TFC-SA and DeConv-FFN, enabling the simultaneous capture of global long-range dependencies and local fine-grained details in speech signals.
- By employing a mask decoder and complex decoder, we effectively overcome the phase recovery challenges in conventional complex-domain methods, thereby achieving enhanced speech of higher quality.
- The experiment results have demonstrated that DeCGAN has better performance than other algorithms in terms of processing civil aviation control speech.
2. Materials and Methods
2.1. Generator
2.2. Discriminator
2.3. Loss Function
3. Results and Analysis
3.1. Dataset
3.2. Experimental Setup
3.3. Evaluation Metrics
- PESQ: Perceptual evaluation of speech quality, which has values within the range of [−0.5, 4.5]; a higher value indicates better quality.
- STOI: Short-time objective intelligibility, which is within the range of [0, 1]; the speech is considered to have been fully understood when it equals 1.
- CSIG: Mean opinion score (MOS) [20] for the prediction of signal distortion, with a value ranging from 1 to 5.
- CBAK: The MOS prediction of background noise intrusiveness, with a value ranging from 1 to 5.
- COVL: The MOS prediction of the overall effect, which has a value within [1, 5].
3.4. Results Analysis
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ATC | Air Traffic Control |
TFC-SA | Time Frequency Channel Attention |
DeConv-FFN | Deformable Convolution-based Feedforward Neural Network |
SNR | Signal-to-noise Ratio |
TF | Time Frequency |
STFI | Short-Time Fourier Transform |
CNN | Convolutional Neural Network |
GAN | Generative Adversarial Network |
PReLU | Parametric Rectified Linear Unit |
PESQ | Perceptual Evaluation of Speech Quality |
STOI | Short-Time Objective Intelligibility |
MOS | Mean Opinion Score |
ISTFT | Inverse Short-Time Fourier Transform |
References
- Lim, J.; Oppenheim, A. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 197–210. [Google Scholar] [CrossRef]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
- Paliwal, K.; Basu, A. A speech enhancement method based on Kalman filtering. In Proceedings of the ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA, 6–9 April 1987; IEEE: Piscataway, NJ, USA, 1987; Volume 12, pp. 177–180. [Google Scholar]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef]
- Ephraim, Y.; Van Trees, H. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 1995, 3, 251–266. [Google Scholar] [CrossRef]
- Lu, Y.-X.; Ai, Y.; Du, H.-P.; Ling, Z.-H. Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 236–250. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar]
- Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. Manner: Multi-view attention network for noise erasure. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 7842–7846. [Google Scholar]
- Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 21–25. [Google Scholar]
- Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2031–2041. [Google Scholar]
- Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based metric GAN for speech enhancement. arXiv 2022, arXiv:2203.15149. [Google Scholar]
- Liang, H.; Chang, H.; Kong, J. Speech Recognition for Air Traffic Control Utilizing a Multi-Head State-Space Model and Transfer Learning. Aerospace 2024, 11, 390. [Google Scholar] [CrossRef]
- Braun, S.; Tashev, I. A consolidated view of loss functions for supervised deep learning-based speech enhancement. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Brno, Czech Republic, 26–28 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 72–76. [Google Scholar]
- Xu, X.; Tu, W.; Yang, Y. Selector-enhancer: Learning dynamic selection of local and non-local attention operation for speech enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13853–13860. [Google Scholar]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 749–752. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–4. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of the Meetings on Acoustics, Montreal, QC, Canada, 2–7 June 2013; AIP Publishing: Melville, NY, USA, 2013; Volume 19. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2007, 16, 229–238. [Google Scholar] [CrossRef]
- Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
- Wang, K.; He, B.; Zhu, W.P. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7098–7102. [Google Scholar]
- Defossez, A.; Synnaeve, G.; Adi, Y. Real time speech enhancement in the waveform domain. arXiv 2020, arXiv:2006.12847. [Google Scholar]
- Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. PHASEN: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9458–9465. [Google Scholar]
- Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. MetricGAN+: An improved version of MetricGAN for speech enhancement. arXiv 2021, arXiv:2104.03538. [Google Scholar]
Method | PESQ | CSIG | CBAK | COVL | STOI |
---|---|---|---|---|---|
Noisy | 1.93 | 3.29 | 2.34 | 2.58 | 0.92 |
SEGAN | 2.18 | 3.42 | 2.84 | 2.75 | 0.92 |
TSTNN | 2.92 | 4.04 | 3.67 | 3.47 | 0.95 |
DEMUCS | 3.03 | 4.25 | 3.3 | 3.58 | 0.95 |
PHASEN | 2.95 | 4.15 | 3.45 | 3.57 | — |
MetricGAN+ | 3.11 | 4.08 | 3.06 | 3.59 | — |
CMGAN | 3.37 | 4.57 | 3.84 | 4.07 | 0.96 |
DeCGAN | 3.31 | 4.61 | 3.86 | 4.12 | 0.96 |
Case Index | Dilated Conv | Deformable Conv | Tri-Path SA | TFC-SA | No. of Blocks | Mask Decoder | Complex Decoder | Discriminator |
---|---|---|---|---|---|---|---|---|
1 | √ | √ | 5 | √ | √ | √ | ||
2 | √ | √ | 5 | √ | √ | √ | ||
3 | √ | 5 | √ | √ | √ | |||
4 | √ | √ | 5 | √ | √ | √ | ||
5 | √ | √ | 5 | √ | √ | |||
6 | √ | √ | 5 | √ | √ | |||
7 | √ | √ | 4 | √ | √ | √ | ||
8 | √ | √ | 6 | √ | √ | √ |
Case Index | PESQ | CSIG | CBAK | COVL | STOI |
---|---|---|---|---|---|
1 | 3.31 | 4.61 | 3.86 | 4.12 | 0.96 |
2 | 3.18 | 4.52 | 3.76 | 3.98 | 0.95 |
3 | 2.94 | 3.98 | 3.05 | 3.54 | 0.94 |
4 | 3.26 | 4.6 | 3.79 | 4.08 | 0.96 |
5 | 3.23 | 4.42 | 3.72 | 3.96 | 0.96 |
6 | 3.17 | 4.36 | 3.53 | 3.78 | 0.95 |
7 | 3.21 | 4.49 | 3.74 | 3.98 | 0.96 |
8 | 3.38 | 4.59 | 3.76 | 4.08 | 0.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, H.; He, Y.; Chang, H.; Kong, J. DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms 2025, 18, 245. https://doi.org/10.3390/a18050245
Liang H, He Y, Chang H, Kong J. DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms. 2025; 18(5):245. https://doi.org/10.3390/a18050245
Chicago/Turabian StyleLiang, Haijun, Yimin He, Hanwen Chang, and Jianguo Kong. 2025. "DeCGAN: Speech Enhancement Algorithm for Air Traffic Control" Algorithms 18, no. 5: 245. https://doi.org/10.3390/a18050245
APA StyleLiang, H., He, Y., Chang, H., & Kong, J. (2025). DeCGAN: Speech Enhancement Algorithm for Air Traffic Control. Algorithms, 18(5), 245. https://doi.org/10.3390/a18050245