A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication

Zhang, Baoli; Lu, Yanping; Wang, Dandan; Liu, Hongyan

doi:10.3390/su18126242

Open AccessArticle

A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication

¹

Department of Theoretical Teaching, School of Arts, Qingdao University, Qingdao 266071, China

²

Department of Marine Convergence Design Engineering, Pukyong National University, 45, Yongso-ro, Nam-Gu, Busan 48513, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(12), 6242; https://doi.org/10.3390/su18126242

Submission received: 9 April 2026 / Revised: 4 June 2026 / Accepted: 15 June 2026 / Published: 17 June 2026

(This article belongs to the Special Issue Soundscapes, Tranquillity and Urban Wellbeing: Towards Sustainable and Equitable Acoustic Environments)

Download

Browse Figures

Versions Notes

Abstract

Urban acoustic environments and performing-arts spaces strongly influence speech communication quality, acoustic comfort, and public wellbeing, particularly in noise-exposed shared environments such as transport hubs, campuses, healthcare spaces, public service facilities, music-education settings, and rehearsal or performance-related spaces. To address speech–noise separation in low signal-to-noise ratio and acoustically complex scenarios, this study proposes a lightweight two-stage deep learning framework termed LCGRU–Wave-SkipConvNet. In the preprocessing stage, a Lightweight Convolutional Gated Recurrent Unit (LCGRU) model is employed to achieve preliminary separation of target speech and background noise by capturing both spatial and temporal acoustic features. In the post-processing stage, a Wave-SkipConvNet model is introduced to further suppress residual noise and enhance speech quality. Experimental results demonstrate that the proposed framework achieves superior performance under different signal-to-noise ratios, sound-source angles, and target angle errors. For example, in the preprocessing stage, the LCGRU model achieved a perceptual evaluation of speech quality (PESQ) score of 2.64 at source angles between 0° and 30°, outperforming the convolutional neural network-long short-term memory (CNN-LSTM) model by 1.17. In the post-processing stage, the Wave-SkipConvNet model achieved higher short-time objective intelligibility (STOI) and segmental signal-to-noise ratio (segSNR) values than the comparison models under different SNR conditions. The proposed framework provides an effective and deployment-oriented AI solution for improving speech accessibility and acoustic comfort in urban acoustic environments and performing-arts spaces. Beyond speech enhancement, it offers practical potential for supporting healthier, more inclusive, and more equitable acoustic environments in noise-sensitive public and educational spaces. It should be noted that this study focuses on the objective acoustic environment and signal-level speech enhancement, rather than subjective soundscape perception, musical perception, or human perceptual evaluation.

Keywords:

speech–noise separation; urban acoustic environments; performing-arts spaces; urban wellbeing; acoustic comfort; equitable acoustic environments; environmental noise monitoring; lightweight deep learning; smart acoustic sensing

1. Introduction

Environmental noise has become a persistent challenge in urban and built environments, affecting speech intelligibility, communication quality, work efficiency, learning performance, and public well-being. In sustainable acoustic environments, especially in public buildings, transport hubs, campuses, healthcare spaces, and other densely occupied areas, the ability to separate target speech from background noise is essential for improving acoustic comfort and supporting intelligent environmental management [1,2,3]. However, in real-world scenarios, speech signals are often degraded by traffic noise, crowd noise, reverberation, and interfering speakers, which makes accurate speech extraction difficult [4]. Especially in typical urban public spaces such as transportation hubs and medical institution halls, the interweaving of traffic, human voices, and equipment noise seriously reduces the recognition accuracy of voice interaction systems. The research follows the definition of ISO 12913-1:2014 standard [5] and focuses on the objective acoustic environment, which refers to the sound wave signals received through microphone arrays and altered by the physical environment, rather than soundscapes involving subjective human cognition and socio-cultural factors. In these complex urban acoustic environments, how to balance the real-time performance and separation accuracy of algorithms to extract highly understandable target speech has become a technical bottleneck for intelligent sound monitoring and the construction of accessible and inclusive environments.

Existing methods for speech–noise separation mainly include blind source separation, auditory scene analysis, and data-driven machine learning approaches [6,7]. Although these methods have shown effectiveness in controlled conditions, traditional model-based approaches often require strong assumptions or prior information, while some scene analysis methods remain computationally expensive and sensitive to environmental changes [8]. In contrast, deep learning has shown strong potential in modeling nonlinear acoustic patterns and improving separation performance in complex, noisy environments. Although traditional convolutional neural networks, such as long short-term memory networks (CNN-LSTM), convolutional recurrent networks (CRN), or standard Wave-U-Net, have achieved good results in specific scenarios, they face limitations in practical deployment due to high computational complexity or insufficient ability to model long sequence phases. In contrast, exploring how to effectively extract spatiotemporal and phase features in acoustic environments while maintaining model lightweighting still has important research value.

Recent studies have proposed a variety of deep learning methods for speech enhancement and target speech extraction [9]. Sharma, B.K. et al. proposed a speech separation system based on frequency domain blind source separation technology to address the challenge of speech extraction in multi-speaker communication environments during remote conferences. The results show that the system can effectively select the optimal frequency signal to restore the original sound and improve the quality of the user’s speech signal through a comprehensive graphical evaluation of the room pulse response and signal characteristics [10]. Xie, J. and other researchers have developed an end-to-end separation model to address the problem of sound extraction caused by overlapping signals in passive acoustic monitoring in the field. The model utilizes spatial features between channels as a supplement and introduces a polarization self-attention mechanism to estimate the spectral amplitude mask. The results show that the source distortion ratio of this method reaches 10.00 dB [11]. These studies indicate that neural models can effectively improve speech quality and intelligibility under low-Signal-to-Noise Ratio (SNR) conditions, but many methods still rely on relatively complex architectures or high computational costs.

Meanwhile, encoder–decoder architectures such as U-Net have been widely used because of their ability to fuse multi-scale features through skip connections [12,13]. Basir, S., and other scholars proposed a supervised separation method combining short-time Fourier transform and U-NET to address the challenges faced in separating monaural target signals from mixed signals. The results show that objective measurements based on the GRID corpus confirm that this method effectively reconstructs the target time-domain signal, achieving significant improvements in speech quality and comprehensibility compared to existing methods [14]. Sindhu, R. proposes a multi-scale model that combines a nested U-Net, time-frequency attention mechanism, and dilated dense network to address the limitations of traditional methods in non-stationary environments. The results show that the model effectively overcomes the aliasing problem related to dilated convolution and significantly reduces information loss [15]. Although the U-Net was initially developed for image-related tasks, its structural advantages have inspired time-domain speech enhancement research [16]. In addition, recurrent units such as GRU and LSTM are effective in modeling temporal dependencies in sequential signals [17,18]. However, conventional recurrent structures may introduce substantial computational burden or fail to preserve detailed time–frequency information when directly applied to noisy speech processing.

In summary, although existing studies have improved speech–noise separation to some extent, many methods still face challenges in balancing separation accuracy, robustness, and computational efficiency, which limits their applicability in sustainable acoustic environments requiring continuous monitoring and low-resource deployment. To address these issues, this study proposes a lightweight two-stage framework that integrates an LCGRU model for speech–noise separation preprocessing with a Wave-SkipConvNet model for post-processing denoising. The main contributions of this study are threefold. First, a lightweight LCGRU module is developed for causal speech–noise separation by replacing fully connected recurrent operations with convolutional gating operations, thereby preserving local time–frequency information while reducing computational complexity. Second, a Wave-SkipConvNet post-processing module is introduced by combining multi-scale skip convolution with an LSTM bottleneck layer, which enhances temporal dependency modeling and improves waveform reconstruction quality. Third, the proposed two-stage framework is evaluated under different SNRs, sound-source angles, and target angle errors, demonstrating its potential for deployment-oriented speech enhancement in complex urban acoustic environments.

2. Methods and Materials

To support practical deployment in sustainable acoustic environments, the proposed framework was designed to combine robust speech–noise separation, residual noise suppression, and relatively low computational complexity. The overall pipeline consists of a preprocessing stage based on LCGRU and a post-processing stage based on Wave-SkipConvNet [19]. In order to achieve effective separation of speech and noise, a Time Convolutional Neural Network (TCN) based on causal convolution mode was introduced to solve the delay problem in traditional methods, and an LCGRU model was suggested to raise the capability of causal separation systems as a preprocessing. In response to the residual noise present in the separated speech, the U-Net was further introduced as a post-processing denoising technique. In response to the shortcomings of the traditional U-Net, a time-domain speech enhancement model, namely Wave-SkipConvNet, was further introduced to improve the separation of speech and noise.

No chemicals, reagents, commercial cell lines, biological samples, or laboratory materials were used in this study. The experiments were conducted using publicly available speech and noise datasets and simulated acoustic environments.

2.1. Speech and Noise Separation Preprocessing Based on the LCGRU Model

Speech signals with noise are composed of the superposition of noise and pure speech. Before conducting speech signal processing, it is necessary to clarify that, following the ISO 12913-1:2014 standard, the variables studied and processed strictly focus on the acoustic environment of objective physical properties, rather than the soundscapes involving subjective human perception, memory, and socio-cultural expectations. Therefore, the term “acoustic environment” is used in the experimental and methodological sections to refer to the physical sound field and microphone-received signals, whereas “soundscape” is only used in the broader contextual sense of urban sound management. In order to enable artificial intelligence models to accurately learn and distinguish between target speech and background noise, a database-based controlled acoustic modeling method was adopted in the study. Specifically, the study utilized publicly available pure speech benchmark databases (SiSEC) and environmental noise corpora (MUSAN), combined with room impulse response (RIR) generated by the mirror sound source method, to simulate and restore spatiotemporal mixing processes in acoustic environments. On this basis, in order to explore the separation method of noise and speech, microphone array technology was introduced in the study. This technology mainly uses the time difference between the sound wave signals received by multiple microphones to achieve an accurate estimation of the spatial position of the sound source (SS) [20]. In the process of speech processing, the study first obtains the time-frequency spectrum of noisy speech signals through Non-Negative Logarithmic Amplitude Spectroscopy (NLAS) and then obtains NLAS features based on the time-frequency spectrum processing results. The specific calculation is shown in Equation (1).

Z = \log (abs (Y) + 1)

(1)

In Equation (1),

Z

represents the NLAS feature and

Y

represents the time-frequency spectrum, which is obtained after a short-time Fourier transform. In addition, the study sets the prior condition for speech separation as a directional feature (DF). Among them, NLAS contains the amplitude characteristics of all interference sources and target sources, while DF is determined based on the distance between the target source and the MA [21]. The relative position relationship between the SS and the MA is shown in Figure 1.

As shown in Figure 1, there are fixed and moving SSs in space. The target SS is also a mobile SS, which is a human voice. According to the coordinate relationship between the target SS and the MA, the horizontal angle between the moving SS and the fixed SS can be solved. Assuming the coordinate of the MA is

({Mic}_{x}, {Mic}_{y}, {Mic}_{z})

and the coordinate of the target SS is

({Tar}_{x}, {Tar}_{y}, {Tar}_{z})

, the horizontal angle between the two is calculated as shown in Equation (2).

θ_{t} = atan 2 (T a r_{y} - M i c_{y},; T a r_{x} - M i c_{x})

(2)

The Interaural Phase Difference (IPD) characterizes the structure of an MA, and IPD exists between different channels [22]. The expression for IPD is shown in Equation (3).

{IPD}^{(p)} (t, f) = ∠ Y^{(m_{1})} (t, f) - ∠ Y^{(m_{2})} (t, f)

(3)

In Equation (3),

m

represents the microphone index, and the MA has a total of

M

.

p

represents the IPD between the

p

th pair of channels, with a total of

P

pairs of IPDs.

∠ Y^{(m 1)}

represents the phase angle of the

m

th channel. According to the IPD and the calculated horizontal angle, DF can be obtained, and the specific calculation is shown in Equation (4).

D F (t, f) = \sum_{p = 1}^{P} 〈e^{j, T P D^{(p)} (θ_{t}, f)},; e^{j, I P D^{(p)} (t, f)}〉

(4)

In Equation (4),

T P D

represents the delay of a plane wave of a specific frequency reaching each microphone, which is calculated as shown in Equation (5).

T P D^{(p)} (θ_{t}, f) = \frac{2 π f Δ_{p} \cos θ_{t}}{c}

(5)

In Equation (5),

f

represents the acoustic frequency,

c

represents the speed of sound, which was set to 343 m/s in the simulation, and

Δ_{p}

denotes the distance between two microphones. All delay-related and steering-vector calculations were checked and recalculated using the speed of sound rather than the speed of light. According to the size of DF, the cosine distance between IPD and the steering vector can be obtained. In the separation of speech and noise, speech signals are temporal signals, and modeling their temporal correlation is crucial [23]. However, previous non-causal symmetric window methods introduced fixed delays, making it difficult for speech separation systems to meet real-time requirements [24,25]. To address this issue, TCN is used in the study, which employed a causal convolution mode. In causal input features, the model only uses past and current signals without involving future feature information, thus avoiding latency issues. The illustrative drawing of TCN is in Figure 2.

As shown in Figure 2, TCN includes 1 × 1 convolutional blocks, ReLU activation functions, D Convolution deep convolutional blocks, and normalization operations. TCN mainly uses superposition to model long-term features and changes the dilation rate to achieve changes in the receptive field, thereby achieving longer-term temporal correlation modeling. In addition, the study introduces residual connections to avoid overfitting issues during model training. In addition, the study further considers the input and output of the previous moment to improve the model construction ability of the separation model for the temporal features of noisy speech signals. The conventional method is a recurrent neural network, but when modeling speech signals, this network is prone to the problem of vanishing gradients. Related studies have proposed Long Short-Term Memory (LSTM) networks, which effectively avoid the problem of vanishing gradients. This model consists of three gates, namely input, forget, and output gates, which mainly use a gating mechanism to filter out redundant feature information and transmit information. However, LSTM has a relatively complex gate-controlled network unit, which consumes a lot of time when executing tasks. Relevant scholars have further proposed the Gate Recurrent Unit (GRU). Unlike LSTM, GRU only contains two gates, namely reset and update gates, which correspond to the forget and output gates of LSTM, respectively. Compared with LSTM, GRU not only achieves effective information transmission but also has lower complexity [26,27]. Both LSTM and GRU models can construct long-term dependencies of speech signals, but their fully connected structures ignore the time-frequency characteristics of speech signals. In response to this issue, the study further introduced the LCGRU network model. This model mainly uses convolutional kernels to substitute for the fully connected structure in conventional gated recurrent neural networks, which can still preserve time-frequency features when constructing speech signal models. In addition, this model can solve the performance degradation problem in causal speech separation, better capture and utilize the feature information of speech signals at previous moments, and improve the separation performance of speech and noise. The LCGRU model structure is shown in Figure 3.

In Figure 3,

x_{t}

and

x_{t - 1}

represent the network model inputs at the present moment and the earlier instance, which is beneficial for capturing temporal dependencies in the sequence.

f_{t}

represents the forget gate, which is mainly accountable for controlling which previous memories should be forgotten and which should be retained.

W_{x}

represents the convolutional kernel of the network, used to extract features from input data.

{\tilde{h}}_{t}

represents the candidate hidden state, used to update the network’s state information.

h_{t - 1}

represents the network output from the previous moment.

σ

represents the Sigmoid activation function. The weighted eigenvector expression of

x_{t}

is shown in Equation (6).

{\hat{x}}_{t} = σ (W_{x} * x_{t}) ⊙ x_{t}

(6)

In Equation (6), * represents the convolution operation.

W

represents a weighted feature vector. The weighted eigenvector expression of

x_{t - 1}

is shown in Equation (7).

{\hat{x}}_{t - 1} = σ (W_{x - 1} * x_{t - 1}) ⊙ x_{t - 1}

(7)

The expression of the weighted eigenvector of

h_{t - 1}

is shown in Equation (8).

{\hat{h}}_{t - 1} = σ (W_{h - 1} * h_{t - 1}) ⊙ h_{t - 1}

(8)

The expression of the forget gate is in Equation (9).

f_{t} = σ (W_{t} * {\hat{x}}_{t} + W_{t - 1} * {\hat{x}}_{t - 1} + b_{f})

(9)

In Equation (9),

b_{f}

represents the bias term of the forget gate. The expression of candidate hidden states is in Equation (10).

{\tilde{h}}_{t} = \tanh (W_{h} * x_{t} + b_{h})

(10)

In Equation (10),

b_{h}

represents the bias term of the candidate hidden state. Finally, the output of the network can be obtained as shown in Equation (11).

h_{t} = f_{t} ⊙ {\tilde{h}}_{t} + (1 - f_{t}) ⊙ {\hat{h}}_{t - 1}

(11)

In Equation (11),

⊙

represents the Hadamard product used in gating mechanisms. The lightweight structure of LCGRU is beneficial for speech enhancement tasks requiring real-time response or resource-constrained deployment, such as smart campuses, public service spaces, and intelligent acoustic sensing terminals. Although the LCGRU model performs well in ensuring processing speed and mitigating gradient vanishing, in environments with extremely low signal-to-noise ratios or high reverberation, this lightweight architecture may exhibit excessive smoothing, resulting in the loss of some high-frequency speech features and even producing slight metallic sound artifacts in reconstructed speech. Therefore, it is necessary to introduce subsequent noise reduction modules for repair.

2.2. Post-Processing Denoising Techniques Based on the Wave-SkipConvNet Model

The study used the LCGRU model to separate multi-channel noise from speech, but residual noise still exists after processing, and post-processing noise reduction is needed to improve the quality of the target speech. For this purpose, the study introduced U-Net, which is a mapping-based noise removal method. The denoising process includes two phases, namely the training phase and the enhancement phase. The training phase primarily trains the nonlinear correlation between pure and noisy speeches to obtain the denoising model [28]. The enhancement stage is responsible for removing noise and obtaining an enhanced speech signal. Among them, the U-Net structure is shown in Figure 4.

In Figure 4, the U-Net is an ED network. In each layer of Encode, each downsampling reduces spatial information while adding corresponding feature information. The specific convolution operation is shown in Equation (12).

F_{e n c}^{l} = δ (W_{l} * F_{e n c}^{l - 1} + b_{l})

(12)

In Equation (12),

F_{e n c}^{l}

represents the output feature map of the encoder in layer

l

,

W_{l}

represents the convolution kernel in layer

l

,

b_{l}

is the convolution bias term, and

δ

represents the ReLU activation function [29]. The Decoder part employs skip connections to merge the characteristics produced by the Encoder with the features of the corresponding layer, while achieving upsampling through deconvolution. The specific expression is shown in Equation (13).

F_{d e c}^{l} = σ (W_{l}^{'} * F_{d e c}^{l - 1} + b_{l}^{'})

(13)

In Equation (13),

F_{d e c}^{l}

represents the decoder output feature map of layer

l

,

W_{l}^{'}

represents the convolution kernel of the deconvolution layer, and

b_{l}^{'}

means the deconvolution bias term. The fused feature sequence needs to be normalized and used as input for the next layer. This structure ensures that the network model captures features with different resolutions, achieving better output accuracy. The traditional U-Net mainly operates on amplitude spectra in the frequency domain, which can effectively capture the strength characteristics of signals. However, frequency domain methods cannot fully utilize the phase information of speech signals, which itself contains key information about the signal’s time structure and waveform shape, and plays a crucial role in signal recovery and speech enhancement. To overcome this limitation, researchers have proposed an improved U-Net, namely Wave-SkipConvNet. The design inspiration for this network comes from U-Net, but innovative improvements have been made for the characteristics of time-domain signals. Specifically, Wave-SkipConvNet optimizes the capture of time-domain features by introducing a resampling mechanism, while computing and fusing information from different time scales to raise the network’s sensitivity to temporal transformations and signal details. In addition, Wave-SkipConvNet combines the long-term correlation of speech signals to further enhance the expressive power of the model. In terms of network structure, the skip connection part of the U-Net has been improved by directly concatenating the output of the Encoder with the Decoder and replacing it with multiple convolution modules. The specific expression for the skip connection operation is shown in Equation (14).

F_{f u s i o n}^{l} = c o n c a t (F_{e n c}^{l}, F_{d e c}^{l})

(14)

In Equation (14),

F_{f u s i o n}^{l}

represents the feature map after skip connection fusion at layer

l

. This adjustment allows the decoder to directly utilize richer feature maps, rather than just the output of the Encoder. The design of replacing direct skip connections with multiple convolutional modules was inspired in part by the SkipConvNet architecture proposed by Kothapally et al. [30]. However, the original SkipConvNet was designed for dereverberation based on smoothed spectral magnitude representations, whereas the proposed Wave-SkipConvNet is designed as a time-domain waveform enhancement module. This domain-level difference is important for the present task because the proposed framework aims to suppress residual noise after LCGRU-based separation while preserving waveform continuity, temporal fine structure, and phase-related reconstruction details. In addition, the present model embeds an LSTM bottleneck layer between the encoder and decoder to further enhance dynamic temporal feature transmission. Although the introduction of SkipConv blocks can enrich feature fusion, deep skip connections may face risks of phase discontinuity and unstable time-domain reconstruction when dealing with severely damaged waveforms, which requires the model to undergo more refined hyperparameter constraints during training. The network structure is shown in Figure 5.

As shown in Figure 5, the composition of Wave-SkipConvNet includes a bottleneck layer, skip convolutional layer, and Encoder and Decoder (ED) layers. Wave-SkipConvNet is also an ED structure that includes multiple convolutional layers, with the Encoder responsible for downsampling and the Decoder responsible for upsampling. The quantity of layers in this network is L, and the time resolution of each layer in the Encoder is 1/2 of the previous layer, which can calculate high-dimensional features in rough time scales. At the same time, the upsampling block in the decoder can calculate local high-resolution features. By combining the features calculated by the ED, multi-scale feature prediction can be achieved, ensuring that the decoder utilizes all temporal features to achieve enhanced output. In the ED structure of Wave-SkipConvNet, the upsampling layer uses linear interpolation in the time direction for upsampling operations, and the downsampling layer discards one element every step. In this architecture, skip connections are responsible for connecting ED, and each skip connection contains multiple SkipConv blocks. Its network framework is shown in Figure 6.

In Figure 6, the SkipConv block includes a one-dimensional convolution with a kernel size of 5 and residual connections. The convolution operation does not change the input size. After the convolution is completed, it needs to be normalized and shared with other SkipConv blocks or decoders to learn features. In skip connections, the deeper the Encoder layer, the fewer the number of SkipConv blocks. The reason is that the feature dimension tends to become more complex with the increase in network layers, but the compatibility between high-level Encoder features and Decoder is low. Therefore, fewer transformations need to be performed on Decoder features to improve their compatibility. In addition, in order to capture the temporal dynamics of speech, three LSTM layers are inserted into the ED of the Wave-SkipConvNet model to improve the separation performance of the model. The study investigates a feature window with 7 past frames and 1 current frame in the LSTM layer, as in Figure 7.

In Figure 7, 8 frames of feature vectors are concatenated to form a long vector. The network considers features within a time window at each time step to enhance the model’s ability to process time series data. Each layer of LSTM contains 1024 feature units. This vector is used as a network input to effectively extract speech temporal information. The recursive calculation of the LSTM layer is shown in Equation (15).

h_{t} = LSTM (h_{t - 1}, F_{i n p u t}^{t})

(15)

In Equation (15),

F_{i n p u t}^{t}

represents the input feature at the current time. The overall process of separating speech and noise is to first use the LCGRU model to separate multi-channel noise from speech and then use the Wave-SkipConvNet structure for post-processing noise reduction to suppress residual noise, in order to achieve effective speech and noise separation. By further suppressing residual noise and improving recovered speech quality, the Wave-SkipConvNet module enhances the applicability of the framework in acoustically complex and sustainability-oriented environments where clear speech communication is essential.

3. Results

The study first introduced the speech and noise separation performance of the preprocessing model LCGRU under different SNRs, SS angles, and target SS angle errors, and compared it with several advanced models. Subsequently, the post-processing model Wave-SkipConvNet was validated for its effectiveness in separating speech and noise, and the combined model LCGRU–Wave-SkipConvNet was validated for its superior separation performance compared to the standalone LCGRU model.

3.1. Experimental Dataset and Hardware Settings

The pure speech data used in the study was taken from the SiSEC database, with a total duration of approximately 30 h; the noise data was taken from the MUSAN corpus, including environmental sounds, background vocals, and broadband noise in urban environments. A total of 4000 mixed speech samples were generated under different SNRs, sound-source angles, and target angle errors. The samples were divided into non-overlapping training, validation, and testing sets at a ratio of 70%, 15%, and 15%, corresponding to 2800, 600, and 600 samples, respectively. The speakers in the testing set were strictly independent of those in the training and validation sets to ensure an objective evaluation of speaker-independent generalization. Considering that it is difficult to obtain absolutely pure target speech as a reference truth for loss calculation in real scene recording, the experiment uses the pyroomacoustics toolkit to generate room impulse responses based on the mirror sound source method. The training and testing of the model were completed on a unified hardware platform equipped with an Intel Core i9-10900K processor (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 3090 GPU with 24 GB memory (NVIDIA Corporation, Santa Clara, CA, USA), and 64 GB system memory. Room impulse responses were generated using the pyroomacoustics toolkit, version 0.7.3. The software environment was based on Python 3.8 and PyTorch 1.10 deep learning frameworks. The model optimization adopts the Adam optimizer, with an initial learning rate set to 0.001 and a batch size set to 16. In terms of data preprocessing, all speech and noise files are uniformly resampled to 16 kHz before mixing, and peak amplitude normalization is performed to prevent clipping. In addition, the selection of hyperparameters is determined through empirical grid search on the validation set to ensure optimal convergence. To ensure the reliability of the reported results, all models were independently trained five times with different random seeds. Unless otherwise specified, the numerical results are reported as mean ± standard deviation across independent training runs rather than across individual test samples. Statistical significance between the proposed model and comparison models was assessed using paired tests under the same testing conditions.

3.2. Evaluation Indicator Explanation

The experiment quantified the separation performance using four objective evaluation indicators. First, Perceptual Evaluation of Speech Quality (PESQ) was calculated using the wideband MOS-LQO mode of the P.862.2 implementation at a sampling rate of 16 kHz. In the wideband setting, the PESQ MOS-LQO score was reported on an approximately 1.0–4.5 scale, where a higher value indicates better perceived speech quality. In this study, PESQ values were calculated using the Python PESQ package, version 0.0.4, in wideband mode for all compared models, and the reported values represent the mean ± standard deviation over five independent runs on the speaker-disjoint testing set. Second, Short-Time Objective Intelligibility (STOI), which is used to reflect the short-term objective intelligibility of speech, with a score range of 0 to 1. The value closer to 1 indicates a higher intelligibility. Third, Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), which is the scale-invariant signal–distortion ratio, mainly used to measure the energy ratio of the target signal to the residual noise and interference. Fourth, Segmental Signal-to-Noise Ratio (segSNR), which is the segmental signal-to-noise ratio. This indicator calculates the average value by dividing the signal into frames and can effectively reflect the improvement of local signal quality.

3.3. Analysis of Speech and Noise Separation Effect Based on the LCGRU Model

This study utilized the pyroomacoustics platform to generate room impulse responses required for experiments and proved the efficiency of the raised speech separation method through multi-channel MA speech signals in these simulation environments. The testing and parameter settings are in Table 1. According to Table 1, the experiment was conducted in a room with dimensions of 10 m × 8 m × 5 m, which contained four microphones and four SSs located at different coordinates. In the data generation stage, each simulation scenario contained an interference source and a target source, and 3–5 SSs were required to verify the performance of the model. The sampling rate of the speech signal was 16 kHz. The pure SS signal came from speech segments in the Signal Separation Evaluation Campaign (SiSEC) database. The noise data came from the Music, Speech, and Noise (MUSAN) corpus. A total of 4000 mixed speech samples were generated under different SNRs, sound-source angles, and target angle errors. Consistent with Section 3.1, these samples were divided into training, validation, and testing sets at a ratio of 70%, 15%, and 15%, respectively.

To assess the robustness of the preprocessing model under different noise conditions, the study first examined the speech–noise separation performance of LCGRU and comparison models across multiple SNR levels. The corresponding results are presented in Figure 8. As shown in Figure 8a, compared with other models, the LCGRU model had a notable advantage in PESQ scores at different SNRs. When the SNR was 0 dB, the PESQ score of the LCGRU model was 2.53 ± 0.06, which is an improvement of 1.39 compared to U-Net, and the comparison between the two is statistically significant (p < 0.01). When the SNR was 15 dB, the PESQ score of the LCGRU model increased to 3.25 ± 0.05, significantly higher than the other comparison models (p < 0.05). As shown in Figure 8b, when the SNR was 0 dB, the SI-SDR of the LCGRU model reached 10.68 ± 0.32, which was 6.57 and 4.35 higher than that of the CNN-LSTM and CNN-GRU models, respectively (p < 0.01). When the SNR was 6 dB, the SI-SDR of the LCGRU model improved to 12.08 ± 0.26 dB, which was 7.83 dB higher than that of the U-Net (p < 0.01). The LCGRU model could provide better speech and noise separation performance under different SNR conditions.

Overall, Figure 8 confirms that the LCGRU model can maintain superior separation quality across both low- and high-SNR conditions, indicating strong robustness to background noise intensity. In practical environments, the angle difference between SSs can also affect people’s ability to distinguish SSs. When the angle between SSs is large, it is easier to distinguish. In order to investigate the influence of the angle between the target and the interfering SS on the separation of speech and noise, an experiment was conducted. This experiment mainly explored the separation effect of the model at different angles by changing the angles of the target SS and interference SS relative to the MA. The position of the microphone was randomly arranged in the room, and the range from the speaker to the center point of the microphone was set to 3–6 m. Two SSs were placed in the room, one as the target SS and the other as the interference SS, and the SNR was set to 0 dB. The research mainly verified the capability of the model by changing the angle size between two SSs. The separation effect of each model under different SS angles is shown in Figure 9. As shown in Figure 9a, with the increase in the SS angle, the PESQ of each model gradually improved. Among them, when the angle between the SS was 0~30°, the PESQ of the LCGRU model was 2.64 ± 0.05, which is 1.17 higher than that of the CNN-LSTM model, and the difference is statistically significant (p < 0.01). When the angle between the SS was 90~180°, the PESQ of the LCGRU model was improved to 3.08 ± 0.04, which was 0.79 higher than that of the CNN-GRU model (p < 0.01). In Figure 9b, the SI-SDR of each model gradually increased with the increase in the SS angle. When the SS angle was between 30° and 60°, the SI-SDR of the LCGRU model was 8.87 ± 0.28. When the SS angle was between 90° and 180°, its value increased to 9.87 ± 0.21, which was significantly higher than that of other models (p < 0.01). This indicated that the LCGRU model could effectively adapt to changes in the position of the SS and had strong robustness. In addition to noise intensity, the spatial separation between sound sources can also affect speech extraction performance in practical acoustic environments. To investigate whether the proposed model can adapt to changes in source positions, experiments were further conducted under different sound-source angles, as shown in Figure 9.

In Figure 9, these results indicate that the LCGRU model is not only effective under varying noise levels but also robust to changes in spatial source distribution. To confirm the robustness of the LCGRU model to target speech angle errors, errors were introduced into the target speech coordinates to investigate the influence of different errors on the separation capability of the model. The separation performance of each model under different target SS angle errors is shown in Figure 10. In Figure 10a, the PESQ of each model gradually decreased with the increase in target speech coordinate error. The LCGRU model had a notably lower decline in comparison with the others. When the target speech coordinate error was 0°, the PESQ of the LCGRU model was 2.76 ± 0.03, significantly higher than other models (p < 0.01). When the error was increased to 15°, the PESQ of the model dropped to 2.63 ± 0.05, a decrease of only 0.13. The decrease in U-Net was significant, reaching 0.42. As shown in Figure 10b, the SI-SDR of each model gradually decreased with the increase in target speech coordinate error. When the target speech coordinate error was 2.5°, the SI-SDR of the LCGRU model was 8.05 ± 0.17, while the CNN-LSTM model was only 3.64 ± 0.21, and the difference between the two is statistically significant (p < 0.01). When the error was increased to 10°, the SI-SDR of the LCGRU model dropped to 6.54 ± 0.24, and the CNN-LSTM model dropped to 2.36 ± 0.27. The LCGRU model could maintain high speech quality and signal distortion ratio even when there were errors in the target speech coordinates, providing more reliable speech separation performance.

The trends observed in Figure 10 further demonstrate that the LCGRU model maintains relatively stable performance even when target angle estimation is imperfect, which enhances its practical applicability. These findings indicate that the LCGRU model can maintain stable speech–noise separation performance under acoustically adverse conditions, which is important for sustainable acoustic environments requiring robust speech extraction from dynamically changing background noise. In practical terms, this capability is relevant to intelligent sound monitoring in public buildings, transportation spaces, and other noise-sensitive environments.

3.4. Performance Analysis of Post-Processing Denoising Model Grounded on Wave-SkipConvNet

To confirm the advantage of the Wave-SkipConvNet model in post-processing denoising, three different models were trained and compared. Among them, the bottleneck layer of Model 1 is the convolutional layer, and skipping connections leads to direct connections. The bottleneck layer of Model 2 is the LSTM layer and skip connections to direct connections. The bottleneck layer of Model 3 is the convolutional layer, and the skip connection is the SkipConv block. The bottleneck layer of Model 4 is the LSTM layer, and the skip connection is the SkipConv block, which is the Wave-SkipConvNet model. The experiment investigated the separation performance of four models under different SNRs, with an SNR range of 0–30 dB. The evaluation metrics used included Short Time Objective Intelligibility (STOI) and Segmental Signal to Noise Ratio (segSNR). STOI is used to evaluate the improvement in listener comprehension ability of speech after noise cancelation, speech enhancement, or separation processing. SegSNR is used to evaluate signal quality in speech enhancement or separation tasks. The separation performance of each post-processing model under different SNRs is in Figure 11. In Figure 11a, the STOI of each model gradually increased with the increase in SNR, among which the STOI of the Wave-SkipConvNet model was better than the other models at different SNRs. When the SNR was 30 dB, the STOI of the Wave-SkipConvNet model reached 0.986 ± 0.006, whereas that of Model 1 was 0.921 ± 0.012. As shown in Figure 11b, the segSNR of each model gradually rose with the rise in SNR. When the SNR was 6 dB, the segSNR of the Wave-SkipConvNet model was 13.96 ± 0.38 dB. When the SNR increased to 24 dB, the segSNR improved to 18.87 ± 0.41 dB. In comparison, the segSNR of Model 3 was 9.04 ± 0.35 dB at 6 dB SNR and increased to 14.31 ± 0.39 dB at 24 dB SNR, which remained lower than that of the Wave-SkipConvNet model. The Wave-SkipConvNet model proposed in this study had a more notable superiority in post-processing denoising. After validating the preprocessing model, the study further evaluated whether the proposed post-processing module could effectively suppress residual noise and improve speech intelligibility. For this purpose, four post-processing structures were compared under different SNR conditions, as shown in Figure 11.

Therefore, Figure 11 demonstrates that the Wave-SkipConvNet design is effective for post-processing denoising and provides clear benefits in both intelligibility and signal quality. The research continues to verify the separation performance of various models at different SS angles, with an SNR of 10 dB and an SS angle error of 0°. The experiment outcomes are in Figure 12. According to Figure 12a, when the angle between the SS was 0~30°, the STOI of the Wave-SkipConvNet model was 0.935 ± 0.015, which was 0.049 higher than that of Model 2, and the difference is statistically significant (p < 0.01). When the angle between the SS was between 90° and 180°, the STOI of the Wave-SkipConvNet model increased to 0.981 ± 0.011, while the STOI of the Model 2 model was only 0.941 ± 0.014 (p < 0.01). Using SkipConv blocks as skip connections could improve the separation of speech and noise. As shown in Figure 12b, when the SS angle was between 30° and 60°, the segSNR of the Wave-SkipConvNet model was 15.02 ± 0.45. When the SS angle was between 90° and 180°, its value increased to 16.48 ± 0.42, which was significantly higher than that of the Model 3 (p < 0.01). The bottleneck layer of Model 3 was the convolutional layer, and the skip connection was the SkipConv block. The LSTM layer in the Wave-SkipConvNet model was significantly better than the convolutional layer because the LSTM layer could better capture long-term dependencies in time series data, which was beneficial for speech signal processing. The improvement in STOI and segSNR suggests that the proposed post-processing strategy can enhance speech clarity and listener comprehension in practical acoustic scenes. This is particularly valuable in public acoustic environments where speech accessibility and communication quality are closely related to service efficiency and environmental comfort. Following the SNR-based comparison in Figure 11, the study next examined whether the post-processing models remained effective when the relative positions of sound sources changed. The corresponding separation results under different sound-source angles are shown in Figure 12.

These findings suggest that the proposed post-processing strategy preserves its advantage not only under varying noise intensities but also under changing spatial acoustic configurations. This study further validates the separation performance of various post-processing models under different target SS angle errors. The SNR was set to 10 dB, and the SS angle was set to 60~90°. The experiment outcomes are in Figure 13. In Figure 13a, the STOI of each post-processing model gradually decreased with the increase in the target SS angle error. When the target SS angle error was 0°, the STOI of the Wave-SkipConvNet model was 0.974 ± 0.010, whereas the STOI values of Models 1, 2, and 3 were 0.850 ± 0.018, 0.894 ± 0.016, and 0.951 ± 0.012, respectively. When the error increased to 15°, the STOI of the Wave-SkipConvNet model remained at 0.946 ± 0.013, while that of Model 1 decreased to 0.761 ± 0.021. As shown in Figure 13b, when the target SS angle error was 5°, the segSNR of the Wave-SkipConvNet model reached 17.84 ± 0.43 dB. The corresponding values of Models 1, 2, and 3 were 3.64 ± 0.29 dB, 7.14 ± 0.33 dB, and 9.87 ± 0.36 dB, respectively. When the error increased to 12.5°, the segSNR of the Wave-SkipConvNet model decreased to 15.88 ± 0.40 dB, whereas that of Model 2 decreased to 4.67 ± 0.31 dB. These results indicate that the proposed post-processing model maintained stronger robustness under target-angle estimation errors. The Wave-SkipConvNet model demonstrated stronger robustness in post-processing denoising. To further evaluate robustness in more realistic acoustic scenarios, the study introduced target sound-source angle errors into the post-processing evaluation. The resulting STOI and segSNR trends are presented in Figure 13.

Figure 13 confirms that the Wave-SkipConvNet model remains more stable than the comparison models when localization uncertainty is introduced. The study finally evaluated the separation performance of the front-end speech and noise separation model LCGRU combined with the post-processing denoising model Wave-SkipConvNet. The LCGRU model alone was compared with the model combining LCGRU and Wave-SkipConvNet. The separation performance verification was performed under different SNRs, SS angles, and target SS angle errors. The evaluation indicators used were PESQ, STOI, and segSNR. The evaluation results of the separation performance of each model under different conditions are shown in Figure 14. As shown in Figure 14a, compared with the standalone LCGRU model, the integrated LCGRU–Wave-SkipConvNet framework achieved consistently higher PESQ values under different SNR conditions. To avoid overemphasizing a single condition-specific peak value, the revised manuscript reports the average PESQ across the tested conditions. The proposed framework achieved an average PESQ of 3.45 ± 0.04, which was higher than that of the standalone LCGRU model and other comparison models. This result indicates that the post-processing Wave-SkipConvNet module effectively improved perceived speech quality after LCGRU-based preliminary separation. As shown in Figure 14b, the STOI of the LCGRU–Wave-SkipConvNet model remained higher than 0.95 under different sound-source angles, which was significantly higher than that of the LCGRU model alone (p < 0.01). As shown in Figure 14c, the segSNR of the LCGRU–Wave-SkipConvNet model was higher than that of the standalone LCGRU model under different target sound-source angle errors. When the error was 10°, the segSNR of the LCGRU–Wave-SkipConvNet model was 21.90 ± 0.25 dB, whereas that of the standalone LCGRU model was 18.84 ± 0.30 dB (p < 0.01). After separately validating the preprocessing and post-processing modules, the study finally evaluated the overall effectiveness of the integrated LCGRU–Wave-SkipConvNet framework. To determine whether the combined design offers additional benefits over the standalone LCGRU model, comparative experiments were conducted under different SNRs, sound-source angles, and target angle errors, as shown in Figure 14.

Taken together, Figure 14 demonstrates that the integration of front-end separation and back-end denoising yields a more robust and practically valuable framework for complex acoustic environments. Overall, the integrated framework demonstrates stronger robustness than the standalone preprocessing model, indicating its potential as a system-level solution for smart acoustic sensing and sustainable environmental speech enhancement.

3.5. Computational Complexity of the Model and Analysis of Ablation Experiments

To evaluate the practicality and comparative performance of the proposed framework, a comprehensive comparison was conducted with several representative speech enhancement and separation models, including U-Net [31], CNN-GRU, CNN-LSTM [32], Conv-TasNet [33], FullSubNet [34], GTCRN [35], and SepFormer [36]. In response to the reviewer’s suggestion, segSNR was also included in Table 2 because it was used as one of the main objective metrics throughout the experiments. All compared models were trained or adapted using the same sampling rate, dataset split, optimizer setting, and evaluation scripts as the proposed model. The reported values were obtained from five independent runs with different random seeds. As shown in Table 2, the proposed LCGRU–Wave-SkipConvNet framework achieved the best overall balance between speech separation performance and computational efficiency.

Although SepFormer achieved competitive performance due to its Transformer-based sequence modeling ability, it required substantially higher computational cost, including larger FLOPs, higher RTF, and greater peak memory usage. FullSubNet and Conv-TasNet also showed strong results, but their PESQ, SI-SDR, and segSNR values remained lower than those of the proposed framework. Compared with these baselines, LCGRU–Wave-SkipConvNet achieved higher PESQ, STOI, SI-SDR, and segSNR while maintaining fewer parameters and lower inference cost. These results indicate that the proposed framework is more suitable for deployment-oriented speech enhancement in sustainable acoustic environments. To further clarify the contribution of each key component, additional ablation experiments were conducted under SNR = 5 dB. A standard fully connected GRU was used to replace the convolutional gates while keeping the remaining architecture unchanged, and 3-frame, 5-frame, 7-frame, and 9-frame past windows were compared to verify the rationality of the 7-frame setting. The results are shown in Table 3.

As shown in Table 3, removing the LCGRU preprocessing module caused a clear performance decrease, confirming the importance of front-end causal separation. More importantly, when the convolutional gates were replaced by a standard fully connected GRU, all evaluation metrics decreased. This result indicates that the improvement of LCGRU is not only due to recurrent temporal modeling but also closely related to the convolutional gating mechanism, which better preserves local acoustic structure during sequence modeling. In addition, removing the LSTM bottleneck or replacing SkipConv with standard skip connections also reduced the overall performance, demonstrating the complementary roles of temporal bottleneck modeling and multi-scale skip feature fusion. The temporal-window ablation further shows that the 7-frame past window provided the best overall performance. A shorter window, such as 3 or 5 past frames, may not provide sufficient temporal context for residual noise suppression, whereas a longer 9-frame window did not bring further improvement and may introduce redundant temporal information. Therefore, the 7-frame past window was retained as the final setting because it achieved the best balance between temporal context modeling and computational efficiency.

4. Discussion

4.1. Technical Performance Analysis of the Proposed Framework

The experimental results demonstrate that the proposed framework achieved consistent advantages in speech–noise separation under different acoustic conditions. In the preprocessing stage, the LCGRU model showed better performance than the comparison models in terms of PESQ and SI-SDR, especially under low-SNR and variable source-angle conditions. This advantage can be attributed to the combined use of convolutional operations and gated recurrent units. Specifically, the convolutional structure helps preserve and extract local time–frequency characteristics of noisy speech, while the recurrent gating mechanism improves temporal dependency modeling. Compared with conventional fully connected recurrent structures, LCGRU provides a more efficient way to capture sequential acoustic information while maintaining a relatively lightweight architecture [37].

In the post-processing stage, the Wave-SkipConvNet module further improved speech quality by suppressing residual noise that remained after the initial separation process. Its encoder–decoder structure, multi-scale feature fusion, and skip convolution design enhanced the recovery of useful speech details while reducing distortion. In addition, the introduction of temporal modeling components strengthened the network’s ability to handle dynamic acoustic patterns. As a result, the integrated LCGRU–Wave-SkipConvNet framework outperformed the standalone LCGRU model across multiple experimental conditions. These findings indicate that the two-stage design is effective not only because of stronger denoising capability, but also because it enables complementary processing between front-end separation and back-end enhancement.

Although the basic components have been widely recognized, research has validated the efficiency of specific lightweight collaborative architectures. On the one hand, it confirms the complementary enhancement mechanism of cross-domain features. The front-end LCGRU filters out high-energy noise in the time-frequency domain, while the back-end Wave SkipConvNet performs fine-grained repair in the time domain. The two work together to avoid a sudden drop in separation performance in ablation experiments. On the other hand, it breaks the traditional bottleneck of separation accuracy and computational cost. The proposed model achieved a parameter count of only 3.15 M and FLOPs of 14.8 G, while also showing lower RTF and peak memory usage than the included comparison models. These results suggest that the proposed framework provides a favorable balance between speech-enhancement performance and computational efficiency.

Compared with previous studies, this framework follows a similar technical logic in combining local feature extraction and sequence modeling but places more emphasis on the coordination between lightweight preprocessing and post-processing for residual noise. This design improves robustness and makes the framework more suitable for complex acoustic scenarios. Therefore, from a technical perspective, the proposed model provides a balanced solution in terms of separation accuracy, robustness, and practical deployment potential, which is highly consistent with the development trend of modern deep neural network speech enhancement technology in balancing speech naturalness and clarity [38].

4.2. Implications for Sustainable Acoustic Environments

Beyond algorithmic performance, the proposed framework has practical implications for sustainable acoustic environments. First, robust speech–noise separation can support acoustic quality improvement in public and shared spaces such as campuses, hospitals, offices, and transport facilities, where excessive background noise may reduce communication efficiency and environmental comfort [39,40]. Second, the lightweight design of the LCGRU module provides potential for low-resource or edge deployment in intelligent acoustic monitoring systems. As Paikrao et al. pointed out, neural network-based speech enhancement architecture can greatly promote personalized health monitoring and response efficiency of consumer electronics in scenarios such as smart hospitals [41]. Third, the enhanced speech intelligibility achieved by the combined framework may contribute to more inclusive and health-oriented sound environments, especially in scenarios that require clear public announcements, service communication, or continuous environmental sensing. Therefore, the proposed method is not only a speech processing model, but also a potential technical component of intelligent and sustainable sound environment management. From a sustainability perspective, clearer speech communication and more robust acoustic sensing can contribute to healthier, more accessible, and more efficient public environments.

4.3. Limitations and Future Work

Although the proposed framework achieved promising results in simulated acoustic settings, several limitations should be acknowledged. First, the current experiments were mainly conducted using simulated room impulse responses and benchmark datasets, and further validation in real-world urban and indoor acoustic environments is still needed. Second, the study focused on speech separation performance, while direct assessment of environmental sustainability indicators, user perception, or energy efficiency was not included. Future work will consider combining practical deployment experiments, more diverse environmental noise scenarios, and multimodal acoustic sensing strategies. For example, it is possible to explore the integration of contextual visual cues from the surrounding environment to assist in indicating complex types of noise, in order to further support sustainable acoustic environment assessment and management.

5. Conclusions

This study proposed a lightweight LCGRU–Wave-SkipConvNet framework for speech–noise separation in sustainable acoustic environments. The LCGRU model effectively captured spatial and temporal acoustic features in the preprocessing stage, while the Wave-SkipConvNet model further improved speech quality through residual noise suppression in the post-processing stage. Experimental results demonstrated that the proposed framework consistently outperformed comparison models under different SNRs, source-angle conditions, and target angle errors. The findings suggest that the framework can serve as an effective and deployment-oriented solution for speech enhancement in acoustically complex scenes. In terms of actual deployment prospects, the lightweight two-stage framework proposed in the study highly meets the practical needs of urban environments. Due to its extremely low memory usage and computing latency, the system can be seamlessly integrated into the broadcasting station announcement system of urban public transportation or the automated interactive terminals of medical institutions to enhance the voice interaction experience in noisy environments. However, idealized acoustic simulation has certain limitations, as it is difficult to fully reproduce the nonlinear distortion and complex environmental noise of microphones in real physical space. Future research will focus on introducing complex benchmark test sets containing severe nonlinear distortion and multi-speaker overlap in the real world, further enhancing the model’s generalization ability.

Author Contributions

Conceptualization, B.Z. and Y.L.; methodology, B.Z. and H.L.; software, Y.L. and B.Z.; validation, D.W. and Y.L.; formal analysis, Y.L.; investigation, B.Z.; resources, H.L.; data curation, Y.L. and B.Z.; writing—original draft preparation, Y.L. and B.Z.; writing—review and editing, H.L.; visualization, B.Z. and Y.L.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L.; Y.L. and B.Z. contributed equally to this work and should be considered co-first authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Brain Korea 21 Program for Leading Universities and Students (BK21 FOUR) and the Pukyong National University Industry-University Cooperation Foundation’s 2025 Post-Doc. Support Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Zhu, L.; Guo, C.; Liu, T.; Zhang, Z. Intelligent blind source separation technology based on OTFS modulation for LEO satellite communication. China Commun. 2022, 19, 89–99. [Google Scholar] [CrossRef]
Li, S.; Cai, M.; Han, M.; Dai, Z. Noise reduction based on CEEMDAN-ICA and cross-spectral analysis for leak location in water-supply pipelines. IEEE Sens. J. 2022, 22, 13030–13042. [Google Scholar] [CrossRef]
Hou, C.; Liu, G.; Tian, Q.; Zhou, Z.; Hua, L.; Lin, Y. Multisignal modulation classification using sliding window detection and complex convolutional network in frequency domain. IEEE Internet Things J. 2022, 9, 19438–19449. [Google Scholar] [CrossRef]
Shi, T.; Qi, Y.; Wu, B. Hybrid free space optical communication and radio frequency MIMO system for photonic interference separation. IEEE Photon. Technol. Lett. 2022, 34, 149–152. [Google Scholar] [CrossRef]
ISO 12913-1:2014; Acoustics—Soundscape—Part 1: Definition and Conceptual Framework. International Organization for Standardization: Geneva, Switzerland, 2014.
Tomashevskyy, O.; Tkachuk, O. Convolutional neural network-based sound source separation in the time-frequency domain. Comput. Syst. Inf. Technol. 2026, 1, 156–171. [Google Scholar] [CrossRef]
Tambe, T.; Yang, E.Y.; Ko, G.G.; Chai, Y.; Hooper, C.; Donato, M.; Wei, G.Y. A 16-nm SoC for noise-robust speech and NLP edge AI inference with Bayesian sound source separation and attention-based DNNs. IEEE J. Solid-State Circuits 2023, 58, 569–581. [Google Scholar] [CrossRef]
Zmolikova, K.; Delcroix, M.; Ochiai, T.; Kinoshita, K.; Černocký, J.; Yu, D. Neural target speech extraction: An overview. IEEE Signal Process. Mag. 2023, 40, 8–29. [Google Scholar] [CrossRef]
Carrasco, V.; Arenas, J.P.; Huijse, P.; Espejo, D.; Vargas, V.; Viveros-Muñoz, R.; Poblete, V.; Vernier, M.; Suárez, E. Application of Deep Learning to Enforce Environmental Noise Regulation in an Urban Setting. Sustainability 2023, 15, 3528. [Google Scholar] [CrossRef]
Sharma, B.K.; Kumar, M.; Meena, R.S. Development of a speech separation system using frequency domain blind source separation technique. Multimed. Tools Appl. 2024, 83, 32857–32872. [Google Scholar] [CrossRef]
Xie, J.; Shi, Y.; Ni, D.; Milling, M.; Liu, S.; Zhang, J.; Schuller, B.W. Automatic bird sound source separation based on passive acoustic devices in wild environment. IEEE Internet Things J. 2024, 11, 16604–16617. [Google Scholar] [CrossRef]
Xi, J.; Xu, Z.; Zhang, W.; Zhao, L.; Xie, Y. Speech Enhancement Algorithm Based on Microphone Array and Lightweight CRN for Hearing Aid. Electronics 2024, 13, 4394. [Google Scholar] [CrossRef]
Cheong, S.; Kim, M.; Shin, J.W. Postfilter for Dual Channel Speech Enhancement Using Coherence and Statistical Model-Based Noise Estimation. Sensors 2024, 24, 3979. [Google Scholar] [CrossRef] [PubMed]
Basir, S.; Hossain, M.N.; Hosen, M.S.; Ali, M.S.; Riaz, Z.; Islam, M.S. U-NET: A supervised approach for monaural source separation. Arab. J. Sci. Eng. 2024, 49, 12679–12691. [Google Scholar] [CrossRef]
Sindhu, R. Speech enhancement using nested U-net with time frequency attention and D3 net. Multimed. Tools Appl. 2025, 84, 42155–42193. [Google Scholar] [CrossRef]
Teng, J.; Zhang, C.; Gong, H.; Liu, C. Machine Learning-Based Urban Noise Appropriateness Evaluation Method and Driving Factor Analysis. PLoS ONE 2024, 19, e0311571. [Google Scholar] [CrossRef] [PubMed]
Zeng, X.; Zhang, X.; Wang, M. A Feature Integration Network for Multi-Channel Speech Enhancement. Sensors 2024, 24, 7344. [Google Scholar] [CrossRef] [PubMed]
Cherukuru, P.; Mustafa, M.B. CNN-Based Noise Reduction for Multi-Channel Speech Enhancement System with Discrete Wavelet Transform (DWT) Preprocessing. PeerJ Comput. Sci. 2024, 10, e1901. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Liu, Y.; Tu, Y.; Sun, Y.; Gan, D.; Song, Y.; Rao, Y. Multi-source separation under two “blind” conditions for fiber-optic distributed acoustic sensor. J. Light. Technol. 2022, 40, 2601–2611. [Google Scholar] [CrossRef]
Priebe, D.; Ghani, B.; Stowell, D. Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation. Sensors 2024, 24, 2046. [Google Scholar] [CrossRef] [PubMed]
Zarei, F.; Nik-Bakht, M.; Lee, J.; Zarei, F. Urban-Scale Acoustic Comfort Map: Fusion of Social Inputs, Noise Levels, and Citizen Comfort in Open GIS. Processes 2024, 12, 2864. [Google Scholar] [CrossRef]
Wang, J.; Hu, X. Convolutional neural networks with gated recurrent connections. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3421–3435. [Google Scholar] [CrossRef] [PubMed]
Xi, J.; Xu, Z.; Zhang, W.; Xie, Y.; Zhao, L. Speech Enhancement Algorithm Based on Microphone Array and Multi-Channel Parallel GRU-CNN Network. Electronics 2025, 14, 681. [Google Scholar] [CrossRef]
Yousif, S.T.; Mahmmod, B.M. Speech Enhancement Algorithms: A Systematic Literature Review. Algorithms 2025, 18, 272. [Google Scholar] [CrossRef]
Ruan, H.; Liao, L.; Chen, K.; Lu, J. Speech Extraction under Extremely Low SNR Conditions. Appl. Acoust. 2024, 224, 110149. [Google Scholar] [CrossRef]
Hao, F.; Li, X.; Zheng, C. X-TF-GridNet: A Time–Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion. Inf. Fusion 2024, 112, 102550. [Google Scholar] [CrossRef]
Yang, Z.; Guan, S.; Zhang, X.-L. Deep Ad-Hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation. Speech Commun. 2022, 140, 87–97. [Google Scholar] [CrossRef]
Li, Y.; Lu, S.; Mathé, P.; Pereverzev, S.V. Two-layer networks with the ReLU k activation function: Barron spaces and derivative approximation. Numer. Math. 2024, 156, 319–344. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Kothapally, V.; Xia, W.; Ghorbani, S.; Hansen, J.H.; Xue, W.; Huang, J. SkipConvNet: Skip convolutional neural network for speech dereverberation using optimally smoothed spectral mapping. arXiv 2020, arXiv:2007.09131. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Li, Z.; Basit, A.; Daraz, A.; Jan, A. Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network. PLoS ONE 2024, 19, e0291240. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed]
Hao, X.; Su, X.; Horaud, R.; Li, X. FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6633–6637. [Google Scholar] [CrossRef]
Rong, X.; Sun, T.; Zhang, X.; Hu, Y.; Zhu, C.; Lu, J. GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 971–975. [Google Scholar] [CrossRef]
Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention Is All You Need in Speech Separation. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 21–25. [Google Scholar] [CrossRef]
Meng, F.; Fan, X.Y.; Semnani, A.; Zhang, L.; Xu, J.; Zhao, P.; Zhang, Q. Reconstructing missing acoustic log with multilevel wavelet decomposition and gated recurrent unit networks. SPE J. 2025, 30, 5895–5912. [Google Scholar] [CrossRef]
O’Shaughnessy, D. Speech Enhancement—A Review of Modern Methods. IEEE Trans. Hum. Mach. Syst. 2024, 54, 110–120. [Google Scholar] [CrossRef]
Chen, P.; Dai, Y.; Zhen, M. Effects of thermal and acoustic environments on human comfort in urban and suburban campuses in the cold region of China. Environ. Sci. Pollut. Res. 2024, 31, 30735–30749. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liu, C.; Luther, M.; Chil, B.; Zhao, J.; Liu, C. Students’ sound environment perceptions in informal learning spaces: A case study on a university campus in Australia. Eng. Constr. Archit. Manag. 2025, 32, 109–130. [Google Scholar] [CrossRef]
Paikrao, P.D.; Mukherjee, A.; Ghosh, U.; Goswami, P.; Novak, M.; Jain, D.K.; Narwade, P. Data Driven Neural Speech Enhancement for Smart Healthcare in Consumer Electronics Applications. IEEE Trans. Consum. Electron. 2024, 70, 4828–4838. [Google Scholar] [CrossRef]

Figure 1. Relative position relationship between SS and MA.

Figure 2. Schematic diagram of TCN structure.

Figure 3. LCGRU model structure.

Figure 4. Schematic diagram of U-Net structure.

Figure 5. Wave-SkipConvNet network architecture.

Figure 6. SkipConv network framework.

Figure 7. LSTM network structure with 7-frame feature window.

Figure 8. Separation performance of various models under different SNRs.

Figure 9. Separation effect of various models under different angles of SSs.

Figure 10. Separation performance of various models under different target SS angle errors.

Figure 11. Separation capability of various post-processing models under various SNR.

Figure 12. Separation effect of various post-processing models at different SS angles.

Figure 13. Separation performance of various post-processing models under different target SS angle errors.

Figure 14. Evaluation of the separation performance of various models under different conditions.

Table 1. Experimental environment and parameter settings.

Parameter	Experimental Setup
Room size	10 m × 8 m × 5 m
Number of microphones	4
Active sources per mixture	2, including one target speech source and one interfering source
Candidate source positions	3~5 source-position settings
Source-to-microphone distance	3~6 m
Sampling rate	16 kHz
Pure speech dataset	SiSEC
Noise dataset	MUSAN
SNR range	0~15 dB
Sound-source angle	0~180°
Target angle error	0~15°
Reverberation time, RT60	0.2~0.4 s
Image-source order, max_order	15
Speed of sound	343 m/s

Table 2. Comparison of comprehensive performance and computational complexity of different models.

Model	Main Structure	PESQ	STOI	SI-SDR (dB)	segSNR (dB)	Parameters (M)	FLOPs (G)	RTF	Peak Memory
U-Net	Encoder–decoder CNN	2.15 ± 0.12	0.82 ± 0.02	6.84 ± 0.45	9.75 ± 0.38	8.62	34.5	0.43	719
CNN-GRU	CNN + GRU	2.58 ± 0.09	0.86 ± 0.02	9.15 ± 0.38	11.84 ± 0.34	6.45	28.2	0.36	606
CNN-LSTM	CNN + LSTM	2.65 ± 0.08	0.87 ± 0.02	9.88 ± 0.35	12.31 ± 0.32	7.80	32.6	0.41	679
Conv-TasNet	Time-domain TCN	3.12 ± 0.06	0.91 ± 0.01	13.45 ± 0.28	14.26 ± 0.27	5.08	22.4	0.29	524
FullSubNet	Full-band/sub-band fusion	3.20 ± 0.05	0.92 ± 0.01	13.92 ± 0.24	14.73 ± 0.25	5.86	24.1	0.31	548
GTCRN	Grouped temporal convolutional recurrent network	3.04 ± 0.07	0.90 ± 0.02	12.86 ± 0.31	13.88 ± 0.29	2.42	12.6	0.21	437
SepFormer	Transformer-based separation	3.29 ± 0.05	0.93 ± 0.01	14.38 ± 0.23	15.05 ± 0.24	9.72	41.3	0.52	836
LCGRU–Wave-SkipConvNet	LCGRU + time-domain Wave-SkipConvNet	3.45 ± 0.04	0.94 ± 0.01	15.62 ± 0.21	15.96 ± 0.22	3.15	14.8	0.18	391

Note: Values are reported as mean ± standard deviation over five independent runs. RTF denotes the real-time factor, and peak memory was measured during inference under the same hardware environment.

Table 3. Ablation analysis of the LCGRU–Wave-SkipConvNet framework under SNR = 5 dB.

Network Configuration	Purpose of Ablation	PESQ	STOI	SI-SDR (dB)	segSNR (dB)
w/o LCGRU pre-processing	Remove the front-end separation module	2.76 ± 0.08	0.85 ± 0.02	10.64 ± 0.36	10.42 ± 0.35
Standard GRU instead of LCGRU gating	Test convolutional gating vs. fully connected GRU	3.18 ± 0.06	0.90 ± 0.02	13.72 ± 0.28	13.86 ± 0.27
w/o LSTM bottleneck	Remove temporal bottleneck modeling	3.02 ± 0.07	0.84 ± 0.03	12.91 ± 0.33	13.15 ± 0.31
SkipConv replaced by standard skip connection	Test multi-scale SkipConv fusion	2.95 ± 0.09	0.89 ± 0.02	12.54 ± 0.35	12.68 ± 0.33
3-frame past window	Test shorter temporal context	3.21 ± 0.06	0.91 ± 0.02	14.02 ± 0.27	14.16 ± 0.26
5-frame past window	Test medium temporal context	3.34 ± 0.05	0.93 ± 0.01	14.86 ± 0.24	15.02 ± 0.24
9-frame past window	Test longer temporal context	3.39 ± 0.05	0.93 ± 0.01	15.08 ± 0.23	15.21 ± 0.23
7-frame past window, proposed	Proposed temporal context setting	3.45 ± 0.04	0.94 ± 0.01	15.62 ± 0.21	15.96 ± 0.22

Note: All ablation variants were evaluated under SNR = 5 dB, and values are reported as mean ± standard deviation over five independent runs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, B.; Lu, Y.; Wang, D.; Liu, H. A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability 2026, 18, 6242. https://doi.org/10.3390/su18126242

AMA Style

Zhang B, Lu Y, Wang D, Liu H. A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability. 2026; 18(12):6242. https://doi.org/10.3390/su18126242

Chicago/Turabian Style

Zhang, Baoli, Yanping Lu, Dandan Wang, and Hongyan Liu. 2026. "A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication" Sustainability 18, no. 12: 6242. https://doi.org/10.3390/su18126242

APA Style

Zhang, B., Lu, Y., Wang, D., & Liu, H. (2026). A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication. Sustainability, 18(12), 6242. https://doi.org/10.3390/su18126242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight LCGRU–Wave-SkipConvNet Framework for Speech–Noise Separation in Urban Acoustic Environments and Performing-Arts Spaces Toward Sustainable and Equitable Acoustic Communication

Abstract

1. Introduction

2. Methods and Materials

2.1. Speech and Noise Separation Preprocessing Based on the LCGRU Model

2.2. Post-Processing Denoising Techniques Based on the Wave-SkipConvNet Model

3. Results

3.1. Experimental Dataset and Hardware Settings

3.2. Evaluation Indicator Explanation

3.3. Analysis of Speech and Noise Separation Effect Based on the LCGRU Model

3.4. Performance Analysis of Post-Processing Denoising Model Grounded on Wave-SkipConvNet

3.5. Computational Complexity of the Model and Analysis of Ablation Experiments

4. Discussion

4.1. Technical Performance Analysis of the Proposed Framework

4.2. Implications for Sustainable Acoustic Environments

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI