Symmetric Combined Convolution with Convolutional Long Short-Term Memory for Monaural Speech Enhancement
Abstract
1. Introduction
- Existing CNN-based methods often rely on larger kernels or deeper architectures to boost the model capacity, which increases parameters, reduces efficiency, and makes training more difficult.
- The fixed kernels are better at capturing global or local patterns, which are unable to extract global and local information simultaneously, resulting in degradation in model performance.
- CNNs and LSTMs have limitations in long-term temporal and spatial features modeling. CNNs mainly capture local patterns, while LSTMs struggle with high-dimensional information flow, leading to information loss.
- Many state-of-the-art methods require large parameter sizes and heavy computation, making them unsuitable for deployment on resource-constrained devices.
- To design a parameter-efficient speech enhancement model that achieves competitive performance while reducing computational costs, ensuring suitability for resource-constrained devices.
- To develop an architecture capable of modeling both local features and long-term temporal/spatial dependencies without significant information loss.
- To comprehensively evaluate the proposed method against state-of-the-art baselines using PESQ and STOI metrics under noisy conditions.
2. Problem Statement
Methodology
3. Proposed Method
3.1. Proposed Network Architecture
3.2. Combined Convolution Block
- 1.
- Depthwise Convolution with Large Kernels:
- 2.
- Pointwise Convolution with 1 × 1 Kernel:
- 1.
- Depthwise Convolution with Depth Multiplier:
- 2.
- Pointwise Convolution with 1 × 1 Kernel:In the Equation (8), denotes the output channel index of the pointwise convolution, and .
3.3. Combined Deconvolution Block
3.4. Dilated Encoder and Decoder
3.5. Grouped Convolutional Long Short-Term Memory
4. Exeperiments
4.1. Data and Setup
4.2. Baselines
4.3. Results and Analysis of the First Experiment
4.4. Results and Analysis of the Second Experiment
4.5. Spectrum Analysis
4.6. Component Analysis
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yousif, S.T.; Mahmmod, B.M. Speech Enhancement Algorithms: A Systematic Literature Review. Algorithms 2025, 18, 272. [Google Scholar] [CrossRef]
- Chang, R.J.; Chen, Z.; Yin, F.L. Distributed parameterized topology-independent noise reduction in acoustic sensor networks. Appl. Acoust. 2023, 213, 109649. [Google Scholar] [CrossRef]
- Zheng, C.S.; Zhang, H.Y.; Liu, W.Z.; Luo, X.X.; Li, A.D.; Li, X.D.; Moore, B.C. Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods. Trends Hear. 2023, 27, 23312165231209913. [Google Scholar] [CrossRef]
- Zhang, J.; Li, C.H. Quantization-aware binaural MWF based noise reduction incorporating external wireless devices. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3118–3131. [Google Scholar] [CrossRef]
- Amini, J.; Hendriks, R.C.; Heusdens, R.; Guo, M.; Jensen, J. Spatially correct rate-constrained noise reduction for binaural hearing aids in wireless acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2731–2742. [Google Scholar] [CrossRef]
- Natarajan, S.; Al-Haddad, S.A.R.; Ahmad, F.A.; Kamil, R.; Hassan, M.K.; Azrad, S.; Macleans, J.F.; Abdulhussain, S.H.; Mahmmod, B.M.; Saparkhojayev, N.; et al. Deep neural networks for speech enhancement and speech recognition: A systematic review. Ain Shams Eng. J. 2025, 16, 103405. [Google Scholar] [CrossRef]
- Chen, J.D.; Benesty, J.; Huang, Y.T.; Doclo, S. New insights into the noise reduction Wiener filter. IEEE/ACM Trans. Audio Speech Lang. Process. 2006, 14, 1218–1234. [Google Scholar] [CrossRef]
- Ghael, S.P.; Sayeed, A.M.; Baraniuk, R.G. Improved wavelet denoising via empirical Wiener filtering. Wavelet Appl. Signal Image Process. V 1997, 3169, 389–399. [Google Scholar]
- Guo, D.N.; Shamai, S.; Verdu, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inf. Theory. 2005, 51, 1261–1282. [Google Scholar] [CrossRef]
- Martin, R. Spectral subtraction based on minimum statistics. Power 1994, 6, 1182–1185. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.-R.; Lee, C.-H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 2013, 21, 65–68. [Google Scholar] [CrossRef]
- Wang, D.L. Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 2008, 12, 332–353. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Wang, D.; Woods, W.S.; Merks, I.; Zhang, T. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 982–992. [Google Scholar] [CrossRef]
- Huang, P.-S.; Kim, M.J.; Hasegawa-Johnson, M.; Smaragdis, P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 2136–2147. [Google Scholar] [CrossRef]
- Weninger, F.J.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B.W. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Latent Variable Analysis and Signal Separation, 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015, Proceedings; Springer: Cham, Switzerland, 2015; pp. 91–99. [Google Scholar]
- Fu, S.-W.; Hu, T.-Y.; Tsao, Y.; Lu, X. Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, 25–28 September 2017; pp. 1–6. [Google Scholar]
- Park, S.R.; Lee, J.W. A fully convolutional neural network for speech enhancement. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1993–1997. [Google Scholar]
- Tan, K.; Wang, D.L. A convolutional recurrent neural network for real-time speech enhancement. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar]
- Tan, K.; Chen, J.T.; Wang, D.L. Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 27, 189–198. [Google Scholar] [CrossRef]
- Pandey, A.; Wang, D.L. A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1179–1188. [Google Scholar] [CrossRef]
- Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. [Google Scholar]
- Abdulatif, S.; Cao, R.; Yang, B. CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2477–2493. [Google Scholar] [CrossRef]
- Zhang, X.Y.; Zhou, X.Y.; Lin, M.X.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Xian, Y.; Sun, Y.; Wang, W.W.; Naqvi, S.M. Convolutional fusion network for monaural speech enhancement. Neural Netw. 2021, 143, 97–107. [Google Scholar] [CrossRef] [PubMed]
- Strake, M.; Defraene, B.; Fluyt, K.; Tirry, W.; Fingscheidt, T. Fully convolutional recurrent networks for speech enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6674–6678. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of the Meetings on Acoustics, Montreal, QC, Canada, 2–7 June 2013. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. In NIST Interagency/Internal Report (NISTIR); National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993. [Google Scholar]
- Botinhao, C.V.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Proceedings of the 9th ISCA Workshop on Speech Synthesis (SSW 9), Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
- Varga, A.; Steeneken, H. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Hu, G.N.; Wang, D.L. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE/ACM Trans. Audio Speech Lang. Process. 2010, 18, 2067–2079. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Shinn-Cunningham, B.; Kopco, N.; Martin, T. Localizing nearby sound sources in a classroom: Binaural room impulse responses. J. Acoust. Soc. Am. 2005, 117, 3100–3115. [Google Scholar] [CrossRef]
Experiment | Setup | Training/Testing Details |
---|---|---|
Experiment 1 | Clean speech corpus: TIMIT, VCTK Noise corpus: DEMAND SNR levels: −15, −10, −5, 0 | Training: 108,000 mixtures Testing: 2000 mixtures Noise types: OMMEETING, TMETRO, STRAFFIC, DLIVING, NRIVER |
Experiment 2 | Clean speech corpus: TIMIT Noise corpus: NOISEX-92, Nonspeech DataBase SNR levels: −5, 0, 5 | Training: 66,000 mixtures Testing: 1200 mixtures Noise types: Machine (seen), Traffic & Car, Machinegun (semi-unseen), Crowd (unseen) |
Measures | STOI | PESQ | ||
---|---|---|---|---|
p-Value | H0 | p-Value | H0 | |
Noisy | (+) | (+) | ||
LSTM | (+) | (+) | ||
BLSTM | (+) | (+) | ||
CFN | (+) | (+) |
Measures | STOI | PESQ | ||
---|---|---|---|---|
p-Value | H0 | p-Value | H0 | |
Noisy | (+) | (+) | ||
GRN | (+) | (+) | ||
AECNN | (+) | (+) |
Component | STOI (%) | PESQ | Parameter (Millions) |
---|---|---|---|
Full | 78.54 | 2.06 | 7.2 |
No-CCB | 76.79 | 2.02 | 3.7 |
No-CDB | 77.29 | 2.03 | 4.3 |
No-CCB-CDB | 75.18 | 1.92 | 1.8 |
No-DIA | 78.01 | 2.03 | 7.2 |
No-GC | 78.06 | 2.04 | 5.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xian, Y.; Fu, Y.; Xing, P.; Tao, H.; Sun, Y. Symmetric Combined Convolution with Convolutional Long Short-Term Memory for Monaural Speech Enhancement. Symmetry 2025, 17, 1768. https://doi.org/10.3390/sym17101768
Xian Y, Fu Y, Xing P, Tao H, Sun Y. Symmetric Combined Convolution with Convolutional Long Short-Term Memory for Monaural Speech Enhancement. Symmetry. 2025; 17(10):1768. https://doi.org/10.3390/sym17101768
Chicago/Turabian StyleXian, Yang, Yujin Fu, Peixu Xing, Hongwei Tao, and Yang Sun. 2025. "Symmetric Combined Convolution with Convolutional Long Short-Term Memory for Monaural Speech Enhancement" Symmetry 17, no. 10: 1768. https://doi.org/10.3390/sym17101768
APA StyleXian, Y., Fu, Y., Xing, P., Tao, H., & Sun, Y. (2025). Symmetric Combined Convolution with Convolutional Long Short-Term Memory for Monaural Speech Enhancement. Symmetry, 17(10), 1768. https://doi.org/10.3390/sym17101768