A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication
Abstract
1. Introduction
2. Methods and Materials
2.1. Speech and Noise Separation Preprocessing Based on the LCGRU Model
2.2. Post-Processing Denoising Techniques Based on the Wave-SkipConvNet Model
3. Results
3.1. Experimental Dataset and Hardware Settings
3.2. Evaluation Indicator Explanation
3.3. Analysis of Speech and Noise Separation Effect Based on the LCGRU Model
3.4. Performance Analysis of Post-Processing Denoising Model Grounded on Wave-SkipConvNet
3.5. Computational Complexity of the Model and Analysis of Ablation Experiments
4. Discussion
4.1. Technical Performance Analysis of the Proposed Framework
4.2. Implications for Sustainable Acoustic Environments
4.3. Limitations and Future Work
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, C.; Zhu, L.; Guo, C.; Liu, T.; Zhang, Z. Intelligent blind source separation technology based on OTFS modulation for LEO satellite communication. China Commun. 2022, 19, 89–99. [Google Scholar] [CrossRef]
- Li, S.; Cai, M.; Han, M.; Dai, Z. Noise reduction based on CEEMDAN-ICA and cross-spectral analysis for leak location in water-supply pipelines. IEEE Sens. J. 2022, 22, 13030–13042. [Google Scholar] [CrossRef]
- Hou, C.; Liu, G.; Tian, Q.; Zhou, Z.; Hua, L.; Lin, Y. Multisignal modulation classification using sliding window detection and complex convolutional network in frequency domain. IEEE Internet Things J. 2022, 9, 19438–19449. [Google Scholar] [CrossRef]
- Shi, T.; Qi, Y.; Wu, B. Hybrid free space optical communication and radio frequency MIMO system for photonic interference separation. IEEE Photon. Technol. Lett. 2022, 34, 149–152. [Google Scholar] [CrossRef]
- ISO 12913-1:2014; Acoustics—Soundscape—Part 1: Definition and Conceptual Framework. International Organization for Standardization: Geneva, Switzerland, 2014.
- Tomashevskyy, O.; Tkachuk, O. Convolutional neural network-based sound source separation in the time-frequency domain. Comput. Syst. Inf. Technol. 2026, 1, 156–171. [Google Scholar] [CrossRef]
- Tambe, T.; Yang, E.Y.; Ko, G.G.; Chai, Y.; Hooper, C.; Donato, M.; Wei, G.Y. A 16-nm SoC for noise-robust speech and NLP edge AI inference with Bayesian sound source separation and attention-based DNNs. IEEE J. Solid-State Circuits 2023, 58, 569–581. [Google Scholar] [CrossRef]
- Zmolikova, K.; Delcroix, M.; Ochiai, T.; Kinoshita, K.; Černocký, J.; Yu, D. Neural target speech extraction: An overview. IEEE Signal Process. Mag. 2023, 40, 8–29. [Google Scholar] [CrossRef]
- Carrasco, V.; Arenas, J.P.; Huijse, P.; Espejo, D.; Vargas, V.; Viveros-Muñoz, R.; Poblete, V.; Vernier, M.; Suárez, E. Application of Deep Learning to Enforce Environmental Noise Regulation in an Urban Setting. Sustainability 2023, 15, 3528. [Google Scholar] [CrossRef]
- Sharma, B.K.; Kumar, M.; Meena, R.S. Development of a speech separation system using frequency domain blind source separation technique. Multimed. Tools Appl. 2024, 83, 32857–32872. [Google Scholar] [CrossRef]
- Xie, J.; Shi, Y.; Ni, D.; Milling, M.; Liu, S.; Zhang, J.; Schuller, B.W. Automatic bird sound source separation based on passive acoustic devices in wild environment. IEEE Internet Things J. 2024, 11, 16604–16617. [Google Scholar] [CrossRef]
- Xi, J.; Xu, Z.; Zhang, W.; Zhao, L.; Xie, Y. Speech Enhancement Algorithm Based on Microphone Array and Lightweight CRN for Hearing Aid. Electronics 2024, 13, 4394. [Google Scholar] [CrossRef]
- Cheong, S.; Kim, M.; Shin, J.W. Postfilter for Dual Channel Speech Enhancement Using Coherence and Statistical Model-Based Noise Estimation. Sensors 2024, 24, 3979. [Google Scholar] [CrossRef] [PubMed]
- Basir, S.; Hossain, M.N.; Hosen, M.S.; Ali, M.S.; Riaz, Z.; Islam, M.S. U-NET: A supervised approach for monaural source separation. Arab. J. Sci. Eng. 2024, 49, 12679–12691. [Google Scholar] [CrossRef]
- Sindhu, R. Speech enhancement using nested U-net with time frequency attention and D3 net. Multimed. Tools Appl. 2025, 84, 42155–42193. [Google Scholar] [CrossRef]
- Teng, J.; Zhang, C.; Gong, H.; Liu, C. Machine Learning-Based Urban Noise Appropriateness Evaluation Method and Driving Factor Analysis. PLoS ONE 2024, 19, e0311571. [Google Scholar] [CrossRef] [PubMed]
- Zeng, X.; Zhang, X.; Wang, M. A Feature Integration Network for Multi-Channel Speech Enhancement. Sensors 2024, 24, 7344. [Google Scholar] [CrossRef] [PubMed]
- Cherukuru, P.; Mustafa, M.B. CNN-Based Noise Reduction for Multi-Channel Speech Enhancement System with Discrete Wavelet Transform (DWT) Preprocessing. PeerJ Comput. Sci. 2024, 10, e1901. [Google Scholar] [CrossRef] [PubMed]
- Wu, H.; Liu, Y.; Tu, Y.; Sun, Y.; Gan, D.; Song, Y.; Rao, Y. Multi-source separation under two “blind” conditions for fiber-optic distributed acoustic sensor. J. Light. Technol. 2022, 40, 2601–2611. [Google Scholar] [CrossRef]
- Priebe, D.; Ghani, B.; Stowell, D. Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation. Sensors 2024, 24, 2046. [Google Scholar] [CrossRef] [PubMed]
- Zarei, F.; Nik-Bakht, M.; Lee, J.; Zarei, F. Urban-Scale Acoustic Comfort Map: Fusion of Social Inputs, Noise Levels, and Citizen Comfort in Open GIS. Processes 2024, 12, 2864. [Google Scholar] [CrossRef]
- Wang, J.; Hu, X. Convolutional neural networks with gated recurrent connections. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3421–3435. [Google Scholar] [CrossRef] [PubMed]
- Xi, J.; Xu, Z.; Zhang, W.; Xie, Y.; Zhao, L. Speech Enhancement Algorithm Based on Microphone Array and Multi-Channel Parallel GRU-CNN Network. Electronics 2025, 14, 681. [Google Scholar] [CrossRef]
- Yousif, S.T.; Mahmmod, B.M. Speech Enhancement Algorithms: A Systematic Literature Review. Algorithms 2025, 18, 272. [Google Scholar] [CrossRef]
- Ruan, H.; Liao, L.; Chen, K.; Lu, J. Speech Extraction under Extremely Low SNR Conditions. Appl. Acoust. 2024, 224, 110149. [Google Scholar] [CrossRef]
- Hao, F.; Li, X.; Zheng, C. X-TF-GridNet: A Time–Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion. Inf. Fusion 2024, 112, 102550. [Google Scholar] [CrossRef]
- Yang, Z.; Guan, S.; Zhang, X.-L. Deep Ad-Hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation. Speech Commun. 2022, 140, 87–97. [Google Scholar] [CrossRef]
- Li, Y.; Lu, S.; Mathé, P.; Pereverzev, S.V. Two-layer networks with the ReLU k activation function: Barron spaces and derivative approximation. Numer. Math. 2024, 156, 319–344. [Google Scholar] [CrossRef]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Kothapally, V.; Xia, W.; Ghorbani, S.; Hansen, J.H.; Xue, W.; Huang, J. SkipConvNet: Skip convolutional neural network for speech dereverberation using optimally smoothed spectral mapping. arXiv 2020, arXiv:2007.09131. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
- Li, Z.; Basit, A.; Daraz, A.; Jan, A. Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network. PLoS ONE 2024, 19, e0291240. [Google Scholar] [CrossRef] [PubMed]
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed]
- Hao, X.; Su, X.; Horaud, R.; Li, X. FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6633–6637. [Google Scholar] [CrossRef]
- Rong, X.; Sun, T.; Zhang, X.; Hu, Y.; Zhu, C.; Lu, J. GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 971–975. [Google Scholar] [CrossRef]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention Is All You Need in Speech Separation. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 21–25. [Google Scholar] [CrossRef]
- Meng, F.; Fan, X.Y.; Semnani, A.; Zhang, L.; Xu, J.; Zhao, P.; Zhang, Q. Reconstructing missing acoustic log with multilevel wavelet decomposition and gated recurrent unit networks. SPE J. 2025, 30, 5895–5912. [Google Scholar] [CrossRef]
- O’Shaughnessy, D. Speech Enhancement—A Review of Modern Methods. IEEE Trans. Hum. Mach. Syst. 2024, 54, 110–120. [Google Scholar] [CrossRef]
- Chen, P.; Dai, Y.; Zhen, M. Effects of thermal and acoustic environments on human comfort in urban and suburban campuses in the cold region of China. Environ. Sci. Pollut. Res. 2024, 31, 30735–30749. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Liu, C.; Luther, M.; Chil, B.; Zhao, J.; Liu, C. Students’ sound environment perceptions in informal learning spaces: A case study on a university campus in Australia. Eng. Constr. Archit. Manag. 2025, 32, 109–130. [Google Scholar] [CrossRef]
- Paikrao, P.D.; Mukherjee, A.; Ghosh, U.; Goswami, P.; Novak, M.; Jain, D.K.; Narwade, P. Data Driven Neural Speech Enhancement for Smart Healthcare in Consumer Electronics Applications. IEEE Trans. Consum. Electron. 2024, 70, 4828–4838. [Google Scholar] [CrossRef]














| Parameter | Experimental Setup |
|---|---|
| Room size | 10 m × 8 m × 5 m |
| Number of microphones | 4 |
| Active sources per mixture | 2, including one target speech source and one interfering source |
| Candidate source positions | 3~5 source-position settings |
| Source-to-microphone distance | 3~6 m |
| Sampling rate | 16 kHz |
| Pure speech dataset | SiSEC |
| Noise dataset | MUSAN |
| SNR range | 0~15 dB |
| Sound-source angle | 0~180° |
| Target angle error | 0~15° |
| Reverberation time, RT60 | 0.2~0.4 s |
| Image-source order, max_order | 15 |
| Speed of sound | 343 m/s |
| Model | Main Structure | PESQ | STOI | SI-SDR (dB) | segSNR (dB) | Parameters (M) | FLOPs (G) | RTF | Peak Memory |
|---|---|---|---|---|---|---|---|---|---|
| U-Net | Encoder–decoder CNN | 2.15 ± 0.12 | 0.82 ± 0.02 | 6.84 ± 0.45 | 9.75 ± 0.38 | 8.62 | 34.5 | 0.43 | 719 |
| CNN-GRU | CNN + GRU | 2.58 ± 0.09 | 0.86 ± 0.02 | 9.15 ± 0.38 | 11.84 ± 0.34 | 6.45 | 28.2 | 0.36 | 606 |
| CNN-LSTM | CNN + LSTM | 2.65 ± 0.08 | 0.87 ± 0.02 | 9.88 ± 0.35 | 12.31 ± 0.32 | 7.80 | 32.6 | 0.41 | 679 |
| Conv-TasNet | Time-domain TCN | 3.12 ± 0.06 | 0.91 ± 0.01 | 13.45 ± 0.28 | 14.26 ± 0.27 | 5.08 | 22.4 | 0.29 | 524 |
| FullSubNet | Full-band/sub-band fusion | 3.20 ± 0.05 | 0.92 ± 0.01 | 13.92 ± 0.24 | 14.73 ± 0.25 | 5.86 | 24.1 | 0.31 | 548 |
| GTCRN | Grouped temporal convolutional recurrent network | 3.04 ± 0.07 | 0.90 ± 0.02 | 12.86 ± 0.31 | 13.88 ± 0.29 | 2.42 | 12.6 | 0.21 | 437 |
| SepFormer | Transformer-based separation | 3.29 ± 0.05 | 0.93 ± 0.01 | 14.38 ± 0.23 | 15.05 ± 0.24 | 9.72 | 41.3 | 0.52 | 836 |
| LCGRU–Wave-SkipConvNet | LCGRU + time-domain Wave-SkipConvNet | 3.45 ± 0.04 | 0.94 ± 0.01 | 15.62 ± 0.21 | 15.96 ± 0.22 | 3.15 | 14.8 | 0.18 | 391 |
| Network Configuration | Purpose of Ablation | PESQ | STOI | SI-SDR (dB) | segSNR (dB) |
|---|---|---|---|---|---|
| w/o LCGRU pre-processing | Remove the front-end separation module | 2.76 ± 0.08 | 0.85 ± 0.02 | 10.64 ± 0.36 | 10.42 ± 0.35 |
| Standard GRU instead of LCGRU gating | Test convolutional gating vs. fully connected GRU | 3.18 ± 0.06 | 0.90 ± 0.02 | 13.72 ± 0.28 | 13.86 ± 0.27 |
| w/o LSTM bottleneck | Remove temporal bottleneck modeling | 3.02 ± 0.07 | 0.84 ± 0.03 | 12.91 ± 0.33 | 13.15 ± 0.31 |
| SkipConv replaced by standard skip connection | Test multi-scale SkipConv fusion | 2.95 ± 0.09 | 0.89 ± 0.02 | 12.54 ± 0.35 | 12.68 ± 0.33 |
| 3-frame past window | Test shorter temporal context | 3.21 ± 0.06 | 0.91 ± 0.02 | 14.02 ± 0.27 | 14.16 ± 0.26 |
| 5-frame past window | Test medium temporal context | 3.34 ± 0.05 | 0.93 ± 0.01 | 14.86 ± 0.24 | 15.02 ± 0.24 |
| 9-frame past window | Test longer temporal context | 3.39 ± 0.05 | 0.93 ± 0.01 | 15.08 ± 0.23 | 15.21 ± 0.23 |
| 7-frame past window, proposed | Proposed temporal context setting | 3.45 ± 0.04 | 0.94 ± 0.01 | 15.62 ± 0.21 | 15.96 ± 0.22 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, B.; Lu, Y.; Wang, D.; Liu, H. A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability 2026, 18, 6242. https://doi.org/10.3390/su18126242
Zhang B, Lu Y, Wang D, Liu H. A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability. 2026; 18(12):6242. https://doi.org/10.3390/su18126242
Chicago/Turabian StyleZhang, Baoli, Yanping Lu, Dandan Wang, and Hongyan Liu. 2026. "A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication" Sustainability 18, no. 12: 6242. https://doi.org/10.3390/su18126242
APA StyleZhang, B., Lu, Y., Wang, D., & Liu, H. (2026). A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability, 18(12), 6242. https://doi.org/10.3390/su18126242

