Design of a Dual-Path Speech Enhancement Model

Hwang, Seorim; Park, Sung Wook; Park, Youngcheol

doi:10.3390/app15116358

Open AccessArticle

Design of a Dual-Path Speech Enhancement Model

by

Seorim Hwang

¹

,

Sung Wook Park

²

and

Youngcheol Park

^1,*

¹

Intelligent Signal Processing Lab, Yonsei University, Wonju 26493, Republic of Korea

²

Department of Electric and Semiconductor Engineering, Gangneung-Wonju National University, Gangneung 25457, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6358; https://doi.org/10.3390/app15116358

Submission received: 9 May 2025 / Revised: 30 May 2025 / Accepted: 4 June 2025 / Published: 5 June 2025

(This article belongs to the Special Issue Application of Deep Learning in Speech Enhancement Technology)

Download

Browse Figures

Versions Notes

Abstract

Although both noise suppression and speech restoration are fundamental to speech enhancement, many Deep neural network (DNN)-based approaches tend to focus disproportionately on one, often overlooking the importance of their joint handling. In this study, we propose a dual-path architecture designed to balance noise suppression and speech restoration. The main path consists of an encoder and two specialized decoders: one dedicated to estimating the clean speech spectrum and the other to predicting a noise suppression mask. To reinforce the joint modeling of noise suppression and speech restoration, we introduce an auxiliary refinement path. This path consists of a separate encoder–decoder structure and is designed to further refine the enhanced speech by incorporating complementary information, learned independently from the main path. By using this dual-path architecture, the model better preserves fine speech details while reducing residual noise. Experimental results on the VoiceBank + DEMAND dataset show that our model surpasses conventional methods across multiple evaluation metrics in the causal setup. Specifically, it achieves a PESQ score of 3.33, reflecting improved speech quality, and a CSIG score of 4.48, indicating enhanced intelligibility. Furthermore, it demonstrates superior noise suppression, achieving an SNRseg of 10.44 and a CBAK score of 3.75.

Keywords:

speech enhancement; speech restoration; dual-path; multi-decoder

1. Introduction

Speech enhancement (SE) refers to the process of improving the quality and intelligibility of speech signals by reducing background noise and distortions. It plays a crucial role in various applications [1,2,3,4,5], including automatic speech recognition (ASR), speech coding, and communication systems, which often demand low-latency or real-time processing, and therefore, require causal models that operate without future information.

An effective SE model must achieve two fundamental objectives: removing unwanted background noise and restoring speech signals. By successfully addressing both aspects, an SE model can enhance speech clarity, intelligibility, and naturalness. Traditional SE techniques, including statistics-based methods, spectral subtraction, and Wiener filtering, have been widely used [6,7]. However, their effectiveness is limited in real-world scenarios with highly non-stationary noises, such as speech babble and wind noise. These types of noise exhibit complex spectral variations over time, while many methods rely on strong stationarity assumptions—such as fixed noise statistics or time-invariant spectral properties. With the advent of deep learning, Deep neural network (DNN)-based SE models have demonstrated superior performance, leveraging large-scale datasets and powerful feature extraction capabilities [8,9,10,11]. In particular, frequency-domain SE models have gained attention for their effectiveness even in real-world scenarios [12,13].

The multi-stage SE system, one of the most successful approaches among frequency-domain SE models, has demonstrated remarkable performance by sequentially refining different aspects of the speech signal [14,15,16]. This approach allows for more precise enhancement of frequency components, leading to improved speech intelligibility and quality. For example, Lee et al. [14] adopts a two-stage approach, where the first stage estimates the magnitude spectrum, and the second stage refines both the magnitude and phase spectra. However, this sequential processing may introduce latency and limit feature interaction between stages, as the next stage can only refine the output of the previous stage rather than jointly optimizing both.

Building on this, researchers have further extended the multi-stage approach into a multi-branch structure, where multiple branches operate in parallel rather than sequentially [17,18,19]. This parallel processing allows different aspects of the speech signal to be processed in different branches. For example, Yu et al. [17] utilize two decoders with one shared encoder for magnitude recovery and complex spectral detail compensation, thus enhancing SE performance without additional latency. On the other hand, Fu et al. [18] adopts two parallel branches (using two encoder–decoder pairs) operating in the complex and magnitude domains, respectively, equipped with attention modules.

However, recent studies have shown that some SE models, instead of improving performance, actually degrade ASR results when used as a preprocessing step, and the speech intelligibility metric scores tend not to improve [20]. This unexpected finding suggests that during the SE process, the model may overly focus on noise suppression, leading to the distortion of linguistic information essential for speech recognition. From this perspective, it is essential for SE models to achieve consistently high performance across various evaluation metrics.

In this paper, we propose a dual-path SE model, which extends the multi-branch architecture by introducing an auxiliary refinement path. This design enables the model to suppress noise and maintain speech components more effectively without sacrificing critical speech features, such as spectral tilt and formant information, or violating causal constraints. As a result, experimental results show that our model achieves better performance across various noise reduction and speech restoration metrics, rather than single-path SE model. Notably, our model surpasses recently proposed SE models in all metrics, confirming its effectiveness.

2. Backgrounds

The Glance and Gaze Network (GaGNet) [17] introduces a collaborative two-branch framework, where the Glance branch focuses on noise suppression in the magnitude domain, while the Gaze branch restores fine spectral details in the complex domain. This division of tasks allows the model to effectively suppress background noise while preserving important speech characteristics. Similarly, U-Former [18], a U-Net-based model, employs a two-branch architecture that integrates dilated complex-valued and real-valued processing within a Conformer network. By incorporating time-frequency attention mechanisms and a hybrid encoder–decoder structure, U-Former captures long-range dependencies while maintaining real-time processing efficiency, making it well-suited for practical SE applications. Furthermore, Complex Nested U-Net with Two-Branch (CNUNet-TB) [19] builds upon the two-branch framework by incorporating multiscale feature extraction blocks, allowing the model to capture both local and global spectral information. Additionally, it utilizes complex masking and mapping techniques with a shared encoder structure, enhancing its ability to handle intricate speech signals and improve SE performance.

While these models demonstrate strong capabilities in both noise suppression and speech restoration, they often struggle to consistently balance these two objectives. In particular, aggressive noise suppression may come at the cost of speech distortion, while overly fine-grained reconstruction may retain residual noise. This trade-off can degrade speech intelligibility and naturalness, especially in highly non-stationary noise environments. Furthermore, not all of these studies clearly analyze the specific role of each branch in noise suppression or speech restoration, making it difficult to understand the source of performance trade-offs and optimize branch interactions effectively. These limitations—and the lack of a clear understanding of each branch’s role in previous models—underscore the need for a more flexible and interpretable architecture that can adaptively enhance speech quality while preserving linguistic clarity.

3. Proposed Multi-Branch SE Model

3.1. Baseline Model

Figure 1 shows the schemas of our baseline model. The baseline model is based on the Nested U-Net (U2-Net) architecture [19,21,22], which extends the standard U-Net by incorporating nested sub-U-Net structures within each layer. This design facilitates multi-scale feature extraction by efficiently capturing both global contextual information and local spectral details [21]. We call the sub-U-Net structure Complex Feature Extraction Units (CFEUs).

As shown in Figure 1, each encoder and decoder comprises a stack of six CFEUs, which are responsible for progressively extracting and refining spectral features. Each CFEU block consists of an INCONV layer, followed by a variable number of convolutional layers for the encoding path (E-CONV) and the decoding path (D-CONV). The number of convolutional layers

N_{i}

at each depth is defined as

N_{i}

= [6, 5, 4, 4, 4, 3]. The INCONV layer serves as the initial transformation step and consists of a (1 × 1) convolution, layer normalization, and PReLU activation. The E-CONV layer is a variation in the INCONV layer, where the convolution kernel is changed to (2 × 3), with the dimensions corresponding to the time and frequency axes. The D-CONV layer mirrors the structure of the E-CONV layer but employs sub-pixel convolution for upsampling, facilitating high-resolution reconstruction. By stacking multiple CFEU blocks in both the encoder and decoder paths, the model is able to extract and process features at multiple scales, effectively handling complex speech signals.

In addition to these convolutional operations, the model incorporates CDDense and CCTFA layers [19], which are complex-valued extensions of DDense [23] and CTFA layers [22,24]. These specialized layers extract features in the complex domain, allowing for more effective SE in the time-frequency representation.

The detailed structure of CDDense can be found in Figure 1c. CDDense consists of an input convolution layer, an output convolution layer, and six Depthwise-Separable Convolution (DSCONV) layers. Each DSCONV layer is composed of a depthwise convolution, a separable convolution, layer normalization, and a PReLU activation function. This configuration ensures efficient and effective feature extraction with long-range dependencies.

The CCTFA module, illustrated in Figure 1b, consists of time attention (TA) and frequency attention (FA), each computed by averaging along the opposite axis and passing the result through two pairs of one-dimensional convolution layers and nonlinear activation functions (ReLU and Sigmoid). The final time-frequency attention map is obtained by element-wise multiplication of the TA and FA maps in the complex domain [19].

3.2. Overall Architecture

Figure 2 illustrates the overall architecture of the proposed Dual-Path Complex Nested U-Net (DPCNU) model. It consists of two parallel processing paths—a main path and an auxiliary path—that collaboratively enhance noisy speech signals.

The enhancement process begins with the short-time Fourier transform (STFT) module, which converts the noisy time-domain signal

y (t)

into a complex-valued spectrogram

Y (k, l)

. This spectrogram serves as the input for both the major and auxiliary paths, where each path extracts distinct yet complementary features through its respective encoder-bottleneck-decoder structures.

The main path consists of one encoder (Encoder 1), one bottleneck (Bottleneck 1), and two decoders (Decoder 1 and Decoder 2). This path plays the primary role in enhancing the speech signal by producing two separate estimates: (1) A noise suppression mask

\bar{M} (k, l)

, which is multiplied element-wise with

Y (k, l)

to generate a denoised spectrogram. (2) A spectral estimate

\bar{S} (k, l)

, which directly represents an enhanced version of the target signal. The processed spectrograms are then transformed back into the time domain, producing

{\bar{x}}_{M} (t)

and

{\bar{x}}_{S} (t)

.

The auxiliary path consists of a separate encoder (Encoder 2), bottleneck (Bottleneck 2), and decoder (Decoder 3). Unlike the main path, it focuses solely on estimating a noise suppression mask

\tilde{M} (k, l)

. This mask is also multiplied element-wise with

Y (k, l)

to generate a filtered spectrogram, which is then transformed back into the time domain, producing

\tilde{x} (t)

. The auxiliary path serves as a complementary mechanism, refining noise suppression by capturing additional spectral cues.

During training, the main and auxiliary paths are trained independently. Training the two paths separately allows each to concentrate on a distinct aspect of the enhancement process, enabling the model to capture complementary information more effectively. However, during inference, their outputs are combined through element-wise summation and averaging, resulting in the final enhanced speech signal

\hat{x} (t)

. This fusion strategy effectively balances noise suppression and spectral integrity by leveraging the complementary strengths of both paths. More details regarding the design motivation and performance analysis are presented in Section 5.1.

3.3. Loss Function

We use a joint loss function to optimize our proposed model, composed of magnitude loss (

L_{m}

) and complex spectrogram loss (

L_{c}

), given by

\begin{matrix} L_{m} & = \sum_{k, l} | | | \hat{X} {(k, l) | - | X (k, l) | | |}_{1}, \end{matrix}

(1)

\begin{matrix} L_{c} & = \sum_{k, l} | | {\hat{X}}^{R} (k, l) - X^{R} {(k, l) | |}_{1} + \sum_{k, l} | | {\hat{X}}^{I} (k, l) - X^{I} (k, l) {| |}_{1}, \end{matrix}

(2)

\begin{matrix} L & = λ_{1} L_{m} + λ_{2} L_{c}, \end{matrix}

(3)

where

| \hat{X} (k, l) |

and

| X (k, l) |

are the magnitude spectra of the enhanced and target speech signal in the time-frequency domain, respectively.

X^{R} (k, l)

and

X^{I} (k, l)

denote the real and imaginary components of

X (k, l)

, and

| | \cdot {| |}_{1}

denote the

l 1

norm.

λ_{1}

and

λ_{2}

are the coupling coefficients determined considering the dynamic range of each term. We use 0.9 and 0.1 for

λ_{1}

and

λ_{2}

, respectively, and these values were determined empirically [13].

The magnitude loss is widely used in SE tasks as it encourages the model to approximate the spectral envelope of clean speech, which strongly correlates with perceptual quality. Meanwhile, the complex spectrogram loss complements this by accounting for both magnitude and phase components, thereby supporting the reconstruction of natural-sounding and intelligible speech.

To improve training stability and emphasize perceptually important low-frequency components, we apply a power compression with a compression factor of 0.2 to both input and target magnitude spectrograms. This operation effectively reduces the dynamic range of the spectral values, allowing the model to focus more on low-energy but informative regions such as harmonic structures in the lower frequency bands [25]. Additionally, the compression reduces the risk of exploding gradients from high-magnitude regions and promotes more balanced loss contributions across the frequency spectrum [25].

By jointly optimizing the compressed magnitude loss and the complex spectrogram loss, the model benefits from enhanced learning stability and improved fidelity in both the energy distribution and fine spectral details.

4. Experimental Setup

4.1. Dataset

We evaluated our proposed model using the VoiceBank + DEMAND (VBD) [26] dataset, a commonly used open dataset in recent SE research. The VBD dataset contains 11,572 training utterances spoken by 28 speakers and 824 test utterances spoken by two speakers. During the training, ten types of noise, including eight from the DEMAND database and two artificially generated ones, were added to the utterances at SNRs of 0, 5, 10, and 15 dB. For evaluation, we added five types of unseen noise from the DEMAND database to the test utterances at SNRs of 2.5, 7.5, 12.5, and 17.5 dB.

4.2. Implementation Details

All signals and noises had a sampling rate of 16 kHz, with a window length, hop size, and FFT length of 32 ms, 16 ms, and 512 samples, respectively. The Adam optimizer was used with a learning rate of 0.00075. The chunk size is 2 s, and the batch size is 8.

The kernel size of all convolution layers in the encoder/decoder is (2 × 3) on the (time axis × frequency axis) except for

INCCONV

,

OUTCCONV

and the CCTFA blocks. The kernel size of

INCCONV

and

OUTCCONV

is set as 1.

4.3. Evaluation Metrics

In this paper, we employ a comprehensive set of evaluation metrics to assess SE performance from the perspectives of both noise suppression and speech restoration. Specifically, we use seven objective metrics: wideband perceptual evaluation of speech quality (WB-PESQ) [27]; mean opinion score (MOS) predictions for signal distortion (CSIG), background noise distortion (CBAK), and overall quality (COVL) [28]; short-time objective intelligibility (STOI) [29]; segmental signal-to-noise ratio (SNRseg) and word error ratio (WER). Among these, CBAK and SNRseg reflect the model’s ability to suppress background noise, while CSIG, STOI, and WER are more indicative of speech content restoration. WB-PESQ and COVL, on the other hand, serve as composite metrics that capture both aspects.

WB-PESQ evaluates perceptual speech quality by comparing the enhanced speech with a clean reference, with scores typically ranging from 0 (poor) to 4.5 (excellent). CSIG, CBAK, and COVL are MOS-based metrics that range from 1 to 5, where higher scores indicate better preservation of speech, more effective noise reduction, and improved overall quality, respectively. STOI assesses intelligibility based on time-frequency correlations between the clean and enhanced signals and is expressed as a percentage, with higher values (e.g., 100%) representing perfect intelligibility. SNRseg measures the segmental signal-to-noise ratio on a frame-by-frame basis in decibels (dB), where higher values reflect stronger noise suppression. Finally, WER is computed using transcriptions generated by the Whisper model [30], and measures intelligibility by comparing enhanced speech transcriptions against those of the clean reference—lower scores indicate better performance.

5. Results and Discussion

5.1. Ablation Test

We conducted an ablation test to evaluate the effectiveness of the proposed dual-path SE model. The results, summarized in Table 1 and Figure 3, compare four model configurations: a single-path masking-based model (M), a single-path mapping-based model (S), a single-path two-branch model (M + S) with a shared encoder and separate decoders, and the final dual-path SE model (Proposed) that integrates M and M + S.

In the case of M + S in Table 1, the pre-trained weights of model S are used as initial values in Encoder 1, Bottleneck 1, and Decoder 2. These weights are then fine-tuned during the retraining of the entire network. The decision to load the pre-trained weights was based on experimental results. More detailed implementation information, including the code and demo samples, can be found at https://github.com/seorim0/Dual_Path_Speech_Enhancement_Model (accessed on 7 May 2025).

Table 1 shows that the M + S model improves WB-PESQ and COVL, enhancing the overall speech quality. However, its lower CBAK and SNRseg scores indicate weakened background noise suppression and segmental SNR preservation. To analyze this trade-off, we examined the three key components contributing to CSIG, CBAK, and COVL: weighted spectral slope (WSS), log-likelihood ratio (LLR), and frequency-weighted segmental SNR (fwSNRseg). Unlike SNRseg, which measures frame-wise SNR uniformly across all frequency bands, fwSNRseg places more emphasis on the perceptually important frequencies, making it more aligned with human auditory perception. The results are illustrated in Figure 3.

As shown in Figure 3, the M + S model, which uses a two-branch decoder and is trained jointly, achieves a higher fwSNRseg and significantly better WSS score than M and S, which use one-branch decoders each. This indicates improved noise suppression in perceptually important bands and better formant location preservation [31]. However, M + S yields a worse LLR score than any of the LLR scores of M or S, suggesting an increased spectral distortion, particularly in formant amplitude and spectral tilt [31]. These are unexpected results, and it (i.e., worse LLR score) explains that the joint training (M + S) could not make a noticeable improvement in terms of CSIG, CBAK, STOI, and SNRseg compared with the one-branch models.

These observations motivated the design of our proposed dual-path model (Proposed), which combines the strengths of masking-based approaches (M) to address the limitations identified in the M + S model. The final model was constructed by independently training the M and M + S and then integrating their learned features. We believe this separation plays a crucial role in the effectiveness of our model, as can be clearly observed in Figure 3. Notably, the model was primarily designed to achieve a high-quality SE by jointly addressing noise suppression and speech intelligibility.

As shown in Table 1, PESQ increases by 0.11, enhancing speech intelligibility, while CSIG, CBAK, and COVL improve by approximately 0.07, 0.06, and 0.1, respectively. STOI also increases by 1%, confirming better speech clarity. Additionally, as we can see in Figure 3, the auxiliary path mitigates the LLR degradation observed in the M + S model, stabilizing the spectral tilt and formant amplitude preservation while enhancing fwSNRseg (16.61 in Figure 3) and SNRseg (10.44 in Table 1). Although the exact reason for the decrease in the LLR score is unclear, it is clear that constructing a dual-path model by joining two single-path models after independently training them gives an improvement since it can average the outputs of the two paths. This is also supported by the results, which show clear performance gains in Figure 3.

The proposed model’s WSS score increased slightly compared with M + S, indicating a relative decline in the preservation of the formant location. However, the improved LLR and fwSNRseg scores suggest that the auxiliary path enhances the modeling of spectral tilt and formant amplitude and reflects a more accurate reconstruction in the perceptually important frequency regions, which play a critical role in subjective speech quality. Consequently, the overall enhancement performance becomes more balanced. Additionally, the WER is further reduced, confirming that speech content is more effectively restored.

In summary, the proposed dual-path model achieves balanced improvements across all metrics. It records a WB-PESQ of 3.33 (+0.11), with CSIG, CBAK, and COVL improved by 0.07, 0.06, and 0.10, respectively, compared with the single-path masking model. STOI increases by 1%, and WER is further reduced. The model achieves a fwSNRseg of 16.61 and SNRseg of 10.44, indicating the effective preservation of the perceptually important frequency components. While the WSS slightly increased, better LLR and fwSNRseg confirm a more stable modeling of the spectral tilt and formant amplitude, leading to a more perceptually coherent enhancement.

5.2. Comparisons with Recently Proposed SE Models

We compared our proposed model with several recent SE models on the VBD dataset. Specifically, we selected models that ensure causality while demonstrating strong performance on VBD. The models we compared include the following: DEMUCS [32], CTS-Net [33], GaGNet [17], FullSubNet+ [34], NUNet-TLS [22], FRCRN [35], CompNet [36], FDFNet [37], and CNUNet-TB [19]. The parameter values of the compared models were obtained from their respective reference papers. When parameter details were unavailable, we left the corresponding fields blank. The comparison results are presented in Table 2, where the highest scores are highlighted in bold and the second-highest scores are underlined.

As shown in Table 2, achieving a top performance across all metrics is challenging. For example, aside from our model DPCNU, FRCRN achieves the highest WB-PESQ and CBAK scores, whereas NUNet-TLS attains the highest CSIG and COVL scores. Additionally, CTS-Net and CompNet achieve comparable PESQ scores, yet their CSIG and CBAK scores differ significantly. However, DPCNU consistently outperforms all compared models overall. Specifically, DPCNU achieves a WB-PESQ score of 3.33, surpassing the next-best model, FRCRN, which records a score of 3.21. Regarding CSIG, CBAK, and COVL, DPCNU scores 4.48, 3.75, and 3.95, respectively. Additionally, it attains an STOI score of 96 and an SNRseg score of 10.44, the highest among all evaluated models.

Among the compared models, GaGNet, NUNet-TLS, and CNUNet-TB are based on nested U-Net architectures, while CTS-Net, GaGNet, FDFNet, and CNUNet-TB adopt two-stage or two-branch designs. Compared with these models, DPCNU demonstrates substantial performance improvements, highlighting its effectiveness in SE.

Our model consists of 18.69 million parameters, making it relatively large compared to other recent SE models. Nevertheless, it achieves significantly higher scores across all evaluation metrics. Notably, the lightweight version, DPCNU(S), reduces the number of convolution kernels by half, resulting in 4.78 million parameters. Despite its compact size, it achieves the second-best performance across all metrics. This demonstrates that DPCNU(S) effectively balances computational efficiency and SE quality.

Although the proposed model demonstrates strong performance across multiple metrics, it has limitations. First, the use of two independent paths increases the model complexity and training time compared with one-branch networks. Second, the model is currently trained and tested on additive noise scenarios. Future work will focus on reducing the computational cost of the dual-path architecture through model compression and efficient training strategies, aiming for faster training and real-time deployment. In addition, we plan to evaluate the model’s robustness in more diverse acoustic scenarios, including reverberation, speaker overlap, and unseen environments.

6. Conclusions

In this paper, we proposed a dual-path causal SE model that extends the multi-branch architecture by incorporating an auxiliary refinement path. This structure allows for improved noise suppression while preserving critical speech components, addressing the common trade-offs observed in existing SE models. The experimental results demonstrate that our proposed model consistently outperforms conventional single-path SE models, including both one-branch and multi-branch SE models, across all evaluation metrics. By achieving superior scores across various objective measures, including speech quality, intelligibility, and noise suppression, our model shows potential for practical SE applications.

Author Contributions

Software, S.H.; Validation, S.W.P. and Y.P.; Writing—original draft, S.H., S.W.P. and Y.P.; Writing—review and editing, S.H., S.W.P. and Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/jim-schwoebel/voice_datasets (accessed on 7 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

SE	Speech enhancement
DNN	Deep neural network
ASR	Automatic speech recognition
CFEU	Complex feature extraction unit
CCTFA	Complex causal time-frequency attention
CDDense	Complex dilated dense network
DSCONV	Depthwise-separable convolution
FA	Frequency attention
TA	Time attention
STFT	Short-time Fourier transform
VBD	VoiceBank + DEMAND
WB-PESQ	Wideband perceptual evaluation of speech quality
MOS	Mean opinion score
CSIG	Composite metric for signal distortion
CBAK	Composite metric for background noise
COVL	Composite metric for overall quality
STOI	Short-time objective intelligibility
SNRseg	Segmental signal-to-noise ratio
WER	Word error ratio
dB	Decibel
M	A single-path masking-based model
S	A single-path mapping-based model
M + S	A single-path two-branch model (integrates M and M + S)
WSS	Weighted spectral slope
LLR	Log-likelihood ratio
fwSNRseg	Frequency-weighted segmental SNR

References

Ochiai, T.; Watanabe, S.; Hori, T.; Hershey, J. Multichannel end-to-end speech recognition. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2632–2641. [Google Scholar]
Li, J.; Sakamoto, S.; Hongo, S.; Akagi, M.; Suzuki, Y. Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication. Speech Commun. 2011, 53, 677–689. [Google Scholar] [CrossRef]
Yamin, M.; Sen, A.A. Improving Privacy and Security of User Data in Location Based Services. Int. J. Ambient. Comput. Intell. (IJACI) 2018, 9, 19–42. [Google Scholar] [CrossRef]
Dey, N.; Ashour, A.S.; Shi, F.; Fong, S.J.; Tavares, J.M.R.S. Medical Cyber-Physical Systems: A Survey. J. Med. Syst. 2018, 42, 74. [Google Scholar] [CrossRef] [PubMed]
Das, N.; Chakraborty, S.; Chaki, J.; Padhy, N.; Dey, N. Fundamentals, Present and Future Perspectives of Speech Enhancement. Int. J. Speech Technol. 2021, 24, 883–901. [Google Scholar] [CrossRef]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef]
Yousheng, X.; Jianwen, H. Speech enhancement based on combination of wiener filter and subspace filter. In Proceedings of the 2014 International Conference on Audio, Language and Image Processing, Shanghai, China, 7–9 July 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 459–463. [Google Scholar]
Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
Weninger, F.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015; Proceedings 12. Springer: Cham, Switzerland, 2015; pp. 91–99. [Google Scholar]
Wang, D.; Chen, J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef] [PubMed]
Nuthakki, R.; Masanta, P.; Yukta, T.N. A Literature Survey on Speech Enhancement Based on Deep Neural Network Technique. In Proceedings of the 4th International Conference on Communications and Cyber Physical Engineering (ICCCE 2021), Hyderabad, India, 9–10 April 2021; Springer: Singapore, 2022; pp. 7–16. [Google Scholar] [CrossRef]
Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar]
Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar] [CrossRef]
Lee, J.; Kang, H.G. Two-Stage Refinement of Magnitude and Complex Spectra for Real-Time Speech Enhancement. IEEE Signal Process. Lett. 2022, 29, 2188–2192. [Google Scholar] [CrossRef]
Ju, Y.; Rao, W.; Yan, X.; Fu, Y.; Lv, S.; Cheng, L.; Wang, Y.; Xie, L.; Shang, S. TEA-PSE: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System for ICASSP 2022 DNS Challenge. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9291–9295. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Zhuang, X.; Qian, Y.; Li, H.; Wang, M. FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9276–9280. [Google Scholar] [CrossRef]
Li, A.; Zheng, C.; Zhang, L.; Li, X. Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 2022, 187, 108499. [Google Scholar] [CrossRef]
Fu, Y.; Liu, Y.; Li, J.; Luo, D.; Lv, S.; Jv, Y.; Xie, L. Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7417–7421. [Google Scholar] [CrossRef]
Hwang, S.; Park, S.W.; Park, Y. Causal Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using Self-Supervised Speech Embeddings. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Cutler, R.; Saabas, A.; Naderi, B.; Ristea, N.C.; Braun, S.; Branets, S. ICASSP 2023 Speech Signal Improvement Challenge. IEEE Open J. Signal Process. 2024, 5, 662–674. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Hwang, S.; Park, S.W.; Park, Y. Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar]
Xiang, X.; Zhang, X.; Chen, H. A Nested U-Net With Self-Attention and Dense Connectivity for Monaural Speech Enhancement. IEEE Signal Process. Lett. 2021, 29, 105–109. [Google Scholar] [CrossRef]
Zhang, Q.; Qian, X.; Ni, Z.; Nicolson, A.; Ambikairajah, E.; Li, H. A Time-Frequency Attention Module for Neural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 462–475. [Google Scholar] [CrossRef]
Li, A.; Zheng, C.; Peng, R.; Li, X. On the Importance of Power Compression and Phase Estimation in Monaural Speech Dereverberation. JASA Express Lett. 2021, 1, 014401. [Google Scholar] [CrossRef] [PubMed]
Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
Rix, A.; Beerends, J.; Hollier, M.; Hekstra, A. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No.01CH37221), Salt Lake City, UT, USA,, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Hu, Y.; Loizou, P.C. Evaluation of Objective Measures for Speech Enhancement. In Proceedings of the Interspeech 2006, Pittsburgh, PA, USA, 17–21 September 2006; pp. 1447–1450. [Google Scholar]
Martin-Donas, J.M.; Gomez, A.M.; Gonzalez, J.A.; Peinado, A.M. A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Process. Lett. 2018, 25, 1680–1684. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; Mcleavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of Machine Learning Research, Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: Cambridge, MA, USA, 2023; Volume 202, pp. 28492–28518. [Google Scholar]
Loizou, P.C. Speech Enhancement: Theory and Practice; CRC Press: Boca Raton, FL, USA, 2007; pp. 479–487. [Google Scholar]
Defossez, A.; Synnaeve, G.; Adi, Y. Real time speech enhancement in the waveform domain. arXiv 2020, arXiv:2006.12847. [Google Scholar]
Li, A.; Liu, W.; Zheng, C.; Fan, C.; Li, X. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1829–1843. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Tuo, D.; Wu, Z.; Kang, S.; Meng, H. FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 7857–7861. [Google Scholar]
Zhao, S.; Ma, B.; Watcharasupat, K.; Gan, W. FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 9281–9285. [Google Scholar]
Fan, C.; Zhang, H.; Li, A.; Xiang, W.; Zheng, C.; Lv, Z.; Wu, X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw. 2023, 168, 508–517. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Zou, H.; Zhu, J. A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar]

Figure 1. Schemas of (a) the baseline model, (b) CDDense block, and (c) CFEU-i block.

Figure 2. Overall architecture of proposed multi-branch SE model called DPCNU.

Figure 3. Performance comparison to verify the effectiveness of the models in terms of WSS, LLR, and fwSNRseg. Lower values for WSS and LLR scores are better (↓), while higher values for fwSNRseg are better (↑). M refers to the case when only the masking branch is used, and S refers to the case when only the mapping branch is used. M + S is the case when a masking branch and a mapping branch are combined. Proposed is our dual-path SE model.

Table 1. Ablation test for dual-path SE model design. M refers to the case where only the masking branch is used, and S refers to the case where only the mapping branch is used. M + S is the case when a masking branch and a mapping branch are combined. Proposed is our dual-path SE model. Improvements over M and S are indicated with a red upward arrow (↑), while decreases in performance are marked with a blue downward arrow (↓). If there is no change in performance, no arrow is displayed. The best score is in bold.

Model	WB-PESQ	CSIG	CBAK	COVL	STOI (%)	SNRseg	WER (%)
Noisy	1.97	3.35	2.44	2.63	91	1.69	5.08
M	3.22	4.41	3.69	3.85	95	10.38	2.82
S	3.19	4.39	3.64	3.82	95	9.79	3.09
M + S	3.29 ↑	4.41	3.68 ↓	3.89 ↑	95	9.66 ↓	2.61 ↑
Proposed	3.33 ↑	4.48 ↑	3.75 ↑	3.95 ↑	96 ↑	10.44 ↑	2.55 ↑

Table 2. Performance comparison with recent SE models on VoiceBank + DEMAND. All systems in this Table satisfy causality. The best score is in bold, and the second-best score is underlined.

Model	Params.	WB-PESQ	CSIG	CBAK	COVL	STOI	SNRseg
Noisy	-	1.97	3.35	2.44	2.63	91	1.69
DEMUCS [32]	33.5 M	2.93	4.22	3.25	3.52	95	-
CTS-Net [33]	4.35 M	2.92	4.25	3.46	3.59	-	-
GaGNet [17]	5.94 M	2.94	4.26	3.45	3.59	-	-
FullSubNet+ [34]	8.67 M	2.88	3.86	3.42	3.57	-	-
NUNet-TLS [22]	2.83 M	3.04	4.38	3.47	3.74	95	8.27
FRCRN [35]	6.9 M	3.21	4.23	3.64	3.73	-	-
CompNet [36]	4.26 M	2.90	4.16	3.37	3.53	-	-
FDFNet [37]	4.43 M	3.05	4.23	3.55	3.65	-	-
CNUNet-TB [19]	2.98 M	3.18	4.37	3.62	3.81	95	9.41
DPCNU	18.69 M	3.33	4.48	3.75	3.95	96	10.44
DPCNU(S)	4.78 M	3.25	4.47	3.69	3.89	95	10.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hwang, S.; Park, S.W.; Park, Y. Design of a Dual-Path Speech Enhancement Model. Appl. Sci. 2025, 15, 6358. https://doi.org/10.3390/app15116358

AMA Style

Hwang S, Park SW, Park Y. Design of a Dual-Path Speech Enhancement Model. Applied Sciences. 2025; 15(11):6358. https://doi.org/10.3390/app15116358

Chicago/Turabian Style

Hwang, Seorim, Sung Wook Park, and Youngcheol Park. 2025. "Design of a Dual-Path Speech Enhancement Model" Applied Sciences 15, no. 11: 6358. https://doi.org/10.3390/app15116358

APA Style

Hwang, S., Park, S. W., & Park, Y. (2025). Design of a Dual-Path Speech Enhancement Model. Applied Sciences, 15(11), 6358. https://doi.org/10.3390/app15116358

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design of a Dual-Path Speech Enhancement Model

Abstract

1. Introduction

2. Backgrounds

3. Proposed Multi-Branch SE Model

3.1. Baseline Model

3.2. Overall Architecture

3.3. Loss Function

4. Experimental Setup

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

5. Results and Discussion

5.1. Ablation Test

5.2. Comparisons with Recently Proposed SE Models

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI