Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement
Abstract
:1. Introduction
- The type of audio signal it operates on. For instance, time-domain methods [10,14,15] process the raw time-domain waveform of the noisy speech signal, while frequency-domain methods [16,17,18,19] transform the noisy time-domain signal to the frequency domain using a short-time Fourier transform (STFT) and then learn a mapping from the noisy STFT representation to an enhanced STFT.
- The target that operates on the noisy representation to produce a cleaner output. For example, masking-based methods [20,21,22] estimate a mask or weight for each time-frequency bin of noisy speech to indicate the likelihood that the signal is speech rather than noise and then apply the mask to suppress noise. Conversely, mapping-based methods [14,23,24] directly map the noisy speech to a clean speech signal, with the DNN learning a complex mapping function to transform the noisy speech into a more speech-like signal.
- MSE is sensitive to outliers and can suffer from gradient explosion issues, which can impede the training process.
- By only minimizing the estimation error of the cIRM, the MSE loss function overlooks the inherent variability in the importance of different time-frequency bins of a speech signal. This is important because human perception is not uniform across all frequency bands.
- In scenarios with low signal-to-noise ratios (SNRs), MSE can cause the model to excessively reduce certain frequency regions where noise dominates, negatively impacting the quality of the speech component.
- While human auditory perception operates on a non-linear scale, the cIRM is defined on a linear frequency scale. This discrepancy may further impede the efficacy of MSE in revising cIRM to capture perceptual details.
2. Inter-SubNet
- Initially, the input signal is transformed into a complex-valued spectrogram using the short-time Fourier transform (STFT). Here, F and T represent the total number of frequency bins and frames, respectively. The magnitude part of the spectrogram, denoted as , is obtained by calculating the absolute value of each entry in the complex spectrogram. is then split into sub-band units using an “unfold” process. Each unit is generated by concatenating a frequency bin vector with its adjacent frequency bin vectors, extending N frequency bins on each side. That is,Note that circular Fourier frequencies are used for those frequency indices f for if or .
- The sub-band units pass through a stack of two SubInter-LSTM (SIL) blocks and a fully connected layer to produce the complex ideal ratio mask (cIRM), . Each SIL block comprises a SubInter module, an LSTM network, and a group normalization layer (G-Norm). The SubInter module plays a crucial role in capturing local and global spectral information, the flowchart of which is depicted in Figure 2. This module takes the sub-band units as input and outputs enhanced sub-bands through a series of procedures:
- (a)
- The dimension transformation (permutation) and a linear layer are applied to the input sub-band units:
- (b)
- The first hidden representation is averaged across all sub-bands and then passed through another linear layer:
- (c)
- Each of the sub-bands is concatenated with . Then, the set of all updated sub-bands undergoes the third linear layer and the dimension transformation (permutation). Additionally, a residual connection is implemented to add back the original units . Therefore, the ultimate output for this module is as follows:
Upon completion of the SubInter module, the LSTM network acquires the local and complementary global spectral information embedded within these sub-band features , which acts like the LSTM layers of the sub-band model in FullSubNet+ [27]. A G-norm layer is then used to normalize the LSTM outputs. - The output cIRM, , is element-wise multiplied with the input complex-valued spectrogram as
3. Inter-SubNet with New Loss Function
3.1. Huber Loss
3.2. TAP Loss
- Frequency-related: These parameters capture characteristics related to the frequency content of the speech signal.
- Energy- or amplitude-related: These parameters reflect the loudness and energy levels of the speech signal, critical for understanding the intensity and dynamics of speech.
- Spectral-balance-related: These parameters assess the energy distribution across different frequency bands, providing insights into the tonal balance and clarity of the speech.
- Other temporal parameters: These parameters capture the temporal dynamics of the speech signal, including variations over time that are important for conveying speech rhythm and flow.
3.3. Discussion of Some Other Losses
3.3.1. Robust Loss
3.3.2. Perceptual Loss
- Perceptual Evaluation of Speech Quality (PESQ) [36,37]: PESQ is widely used for assessing speech quality. However, the original PESQ algorithm is non-differentiable, making it unsuitable as a direct loss function. To address this limitation, researchers have trained neural networks to mimic PESQ’s behavior, creating differentiable surrogate loss functions that enable gradient-based optimization.
- Short-Time Objective Intelligibility (STOI) [38]: STOI measures speech intelligibility and has been used as a metric to evaluate speech enhancement methods. Like PESQ, STOI’s non-differentiable nature poses challenges for direct optimization. To overcome this, researchers have explicitly developed differentiable versions or approximations of STOI for training deep-learning SE models.
- Phone-Fortified Perceptual Loss (PFPL) [39]: The PFPL integrates phonetic information through wav2vec, a self-supervised learning framework that captures rich phonetic information. It uses Wasserstein distance to quantify the distributional differences between the phonetic feature representations of clean and enhanced speech. This approach allows for assessing how well the enhanced speech preserves phonetic attributes compared to the original clean speech. PFPL has also shown a strong correlation with PESQ and STOI.
3.4. Presented Method
- The new loss is calculated using the mean Huber loss measured with the estimated and ground-truth cIRMs for and . It is expressed as follows:
- The new loss is a combination of TAP loss with the MSE-based cIRM loss:
- The new loss is a combination of TAP loss with the mean Huber-based cIRM loss:
- Forward Pass: In the forward pass, the network processes the noisy speech signal to generate the complex ideal ratio mask (cIRM). This mask is then applied to the time-frequency representation of the noisy signal, resulting in enhanced speech output.
- Loss Calculation: The estimated cIRM mask is evaluated against the ground-truth mask using MSE or Huber loss. Simultaneously, we assess the enhanced speech against the clean target speech using TAP loss. This loss function specifically measures how effectively the enhanced speech retains crucial temporal acoustic parameters.
- Backward Pass: The total loss, which combines both MSE/Huber loss and TAP loss, is utilized to compute the gradient with respect to all network parameters by applying the chain rule of calculus through each layer. Compared to MSE loss for the cIRM, TAP loss additionally incorporates layers that perform the inverse STFT and those responsible for estimating TAPs.
- Updating Network Parameters: The computed gradients are used to update the parameters of the Inter-SubNet framework with an optimizer.
4. Experimental Setup
- A Hanning window with 32ms frame size and 16ms frame shift was applied to input signals to obtain the spectrogram.
- The sub-band unit was set to have frequency bins, and the number of frames T for the input-target sequence pairs during training was set to 192.
- For the first SIL block, the SubInter module and the LSTM contained 102 hidden units and 384 hidden units, respectively. For the second SIL block, the SubInter module and the LSTM contained 307 hidden units and 384 hidden units, respectively.
- The batch size was reduced from the initial value of 20 in the script to 4 to accommodate the GPU constraints of the servers in our laboratory.
- Wideband Perceptual Evaluation of Speech Quality (WB-PESQ) [36]: This metric ranks the level of enhancement for wideband speech signals, typically sampled at frequencies of 16 kHz or higher. WB-PESQ indicates the quality difference between the enhanced and clean speech signals, and it ranges from to 4.5.
- Narrowband Perceptual Evaluation of Speech Quality (NB-PESQ) [37]: Similar to WB-PESQ, NB-PESQ evaluates the quality of speech signals but is limited to narrowband ones, typically sampled at 8 kHz. WB-PESQ involves high-fidelity speech processing, whereas NB-PESQ is more suited for traditional telephony.
- Short-Time Objective Intelligibility (STOI) [38]: This metric measures the objective intelligibility for short-time time-frequency (TF) regions of an utterance with discrete Fourier transform (DFT). STOI ranges from 0 to 1; a higher STOI score corresponds to better intelligibility.
- Scale-Invariant Signal-to-Distortion Ratio (SI-SNR) [43]: This metric usually reflects the degree of artifact distortion between the processed utterance and the clean counterpart , which is formulated by
- Composite Measure for Overall Quality (COVL) [44]: This metric is for the Mean Opinion Score (MOS) prediction of the overall quality of the enhanced speech. It reflects how listeners perceive the overall quality of the speech signal. COVL scores range from 0 to 5, where a higher score indicates better overall speech quality. COVL is commonly used in the development of SE systems to quantify improvements in speech quality as perceived by human listeners.
- Composite Measure for Signal Distortion (CSIG): [44]: This metric refers to the MOS prediction of signal distortion, which ranges from 0 to 5, with higher scores indicating lower distortion and better quality of the speech signal. It evaluates how much distortion has been introduced during the enhancement process. CSIG is useful for assessing the effectiveness of SE algorithms in preserving the integrity of the original speech while reducing noise and other distortions.
5. Experimental Results and Discussion
- TAP loss fine-tunes Inter-SubNet to enhance its performance. The combination of MSE and TAP losses leads to better results in almost all SE metrics (except SI-SDR) compared to when individual losses are used alone.
- When the weight for TAP loss is set within the range of , similar good performance is achieved. However, increasing to 0.04 degrades some metrics.
- The introduction of TAP loss does not lead to an improvement in the SI-SDR metric. This may be due to the fact that SI-SDR is more concerned with physical distortion rather than the perceptual interference that TAP loss addresses.
Loss | WB-PESQ | NB-PESQ | STOI | SI-SDR | COVL | CSIG | |
---|---|---|---|---|---|---|---|
(baseline) | 2.843 | 3.588 | 0.943 | 18.427 | 3.419 | 4.000 | |
2.746 | 3.478 | 0.939 | 17.338 | 3.442 | 4.127 | ||
2.958 | 3.622 | 0.946 | 18.279 | 3.587 | 4.196 | ||
2.915 | 3.591 | 0.946 | 18.307 | 3.564 | 4.193 | ||
2.913 | 3.605 | 0.946 | 18.407 | 3.562 | 4.191 | ||
2.923 | 3.587 | 0.945 | 18.236 | 3.557 | 4.172 |
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ochieng, P. Deep neural network techniques for monaural speech enhancement: State of the art analysis. arXiv 2023, arXiv:2212.00369. [Google Scholar] [CrossRef]
- Xu, L.; Zhang, T. Fractional feature-based speech enhancement with deep neural network. Speech Commun. 2023, 153, 102971. [Google Scholar] [CrossRef]
- Hao, X.; Xu, C.; Xie, L. Neural speech enhancement with unsupervised pre-training and mixture training. Neural Netw. 2023, 158, 216–227. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Han, J.; Zhang, T.; Qing, D. Speech Enhancement from Fused Features Based on Deep Neural Network and Gated Recurrent Unit Network. EURASIP J. Adv. Signal Process. 2021. [Google Scholar] [CrossRef]
- Skariah, D.; Thomas, J. Review of Speech Enhancement Methods using Generative Adversarial Networks. In Proceedings of the 2023 International Conference on Control, Communication and Computing (ICCC), Thiruvananthapuram, India, 19–21 May 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Karjol, P.; Ajay Kumar, M.; Ghosh, P.K. Speech Enhancement Using Multiple Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5049–5052. [Google Scholar] [CrossRef]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Interspeech 2013, Lyon, France, 25–29 August 2013; pp. 436–440. [Google Scholar] [CrossRef]
- Cohen, I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 2003, 11, 466–475. [Google Scholar] [CrossRef]
- Fu, S.W.; Tsao, Y.; Lu, X. SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 3768–3772. [Google Scholar] [CrossRef]
- Luo, Y.; Mesgarani, N. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 696–700. [Google Scholar] [CrossRef]
- Pang, J.; Li, H.; Jiang, T.; Wang, H.; Liao, X.; Luo, L.; Liu, H. A Dual-Channel End-to-End Speech Enhancement Method Using Complex Operations in the Time Domain. Appl. Sci. 2023, 13, 7698. [Google Scholar] [CrossRef]
- Fan, J.; Yang, J.; Zhang, X.; Yao, Y. Real-time single-channel speech enhancement based on causal attention mechanism. Appl. Acoust. 2022, 201, 109084. [Google Scholar] [CrossRef]
- Yang, L.; Liu, W.; Meng, R.; Lee, G.; Baek, S.; Moon, H.G. Fspen: An Ultra-Lightweight Network for Real Time Speech Enahncment. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10671–10675. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, Y.; Lv, S.; Zhang, S.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2472–2476. [Google Scholar] [CrossRef]
- Koh, H.I.; Na, S.; Kim, M.N. Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network. Bioengineering 2023, 10, 1325. [Google Scholar] [CrossRef] [PubMed]
- Nossier, S.A.; Wall, J.; Moniri, M.; Glackin, C.; Cannings, N. A Comparative Study of Time and Frequency Domain Approaches to Deep Learning based Speech Enhancement. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Zhang, Z.; Li, X.; Li, Y.; Dong, Y.; Wang, D.; Xiong, S. Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7113–7117. [Google Scholar] [CrossRef]
- Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. Proc. Aaai Conf. Artif. Intell. 2020, 34, 9458–9465. [Google Scholar] [CrossRef]
- Zhao, H.; Zarar, S.; Tashev, I.; Lee, C.H. Convolutional-Recurrent Neural Networks for Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2401–2405. [Google Scholar] [CrossRef]
- Graetzer, S.; Hopkins, C. Comparison of ideal mask-based speech enhancement algorithms for speech mixed with white noise at low mixture signal-to-noise ratios. J. Acoust. Soc. Am. 2022, 152, 3458–3470. [Google Scholar] [CrossRef] [PubMed]
- Routray, S.; Mao, Q. Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network. Comput. Speech Lang. 2022, 71, 101270. [Google Scholar] [CrossRef]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 483492. [Google Scholar] [CrossRef] [PubMed]
- Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar] [CrossRef]
- Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. [Google Scholar] [CrossRef]
- Chen, J.; Rao, W.; Wang, Z.; Lin, J.; Wu, Z.; Wang, Y.; Shang, S.; Meng, H. Inter-Subnet: Speech Enhancement with Subband Interaction. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Hao, X.; Su, X.; Horaud, R.; Li, X. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6633–6637. [Google Scholar] [CrossRef]
- Chen, J.; Wang, Z.; Tuo, D.; Wu, Z.; Kang, S.; Meng, H. FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7857–7861. [Google Scholar] [CrossRef]
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed]
- Isik, U.; Giri, R.; Phansalkar, N.; Valin, J.M.; Helwani, K.; Krishnaswamy, A. PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2487–2491. [Google Scholar]
- Lv, S.; Hu, Y.; Wu, J.; Xie, L. DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 2816–2820. [Google Scholar]
- Choi, H.S.; Park, S.; Lee, J.H.; Heo, H.; Jeon, D.; Lee, K. Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5789–5793. [Google Scholar] [CrossRef]
- Li, A.; Liu, W.; Zheng, C.; Fan, C.; Li, X. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2021, 29, 1829–1843. [Google Scholar] [CrossRef]
- Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
- Zeng, Y.; Konan, J.; Han, S.; Bick, D.; Yang, M.; Kumar, A.; Watanabe, S.; Raj, B. TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the International Conference on Image Processing, Austin, TX, USA, 13–16 November 1994; Volume 2, pp. 168–172. [Google Scholar] [CrossRef]
- ITU-T. Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs; Technical Report P.862.2; International Telecommunication Union: Geneva, Switzerland, 2005. [Google Scholar]
- ITU-T. Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs; Technical Report P.862; International Telecommunication Union: Geneva, Switzerland, 2001. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Hsieh, T.A.; Yu, C.; Fu, S.W.; Lu, X.; Tsao, Y. Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 196–200. [Google Scholar] [CrossRef]
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar] [CrossRef]
- Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of the 21st International Congress on Acoustics, Montreal, QC, Canada, 2–7 June 2013; pp. 1–6. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Roux, J.L.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR—Half-baked or Well Done? In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 626–630. [Google Scholar] [CrossRef]
- Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
- Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 936–940. [Google Scholar] [CrossRef]
- Lu, Y.X.; Ai, Y.; Ling, Z.H. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 3834–3838. [Google Scholar] [CrossRef]
- Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. MANNER: Multi-View Attention Network For Noise Erasure. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7842–7846. [Google Scholar] [CrossRef]
- Kim, E.; Seo, H. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 2736–2740. [Google Scholar] [CrossRef]
Loss | WB-PESQ | NB-PESQ | STOI | SI-SDR | COVL | CSIG |
---|---|---|---|---|---|---|
(baseline) | 2.843 | 3.588 | 0.943 | 18.427 | 3.419 | 4.000 |
2.900 | 3.620 | 0.946 | 18.811 | 3.493 | 4.078 | |
2.746 | 3.478 | 0.939 | 17.338 | 3.442 | 4.127 |
Loss | WB-PESQ | NB-PESQ | STOI | SI-SDR | COVL | CSIG | |
---|---|---|---|---|---|---|---|
2.900 | 3.620 | 0.946 | 18.811 | 3.493 | 4.078 | ||
2.746 | 3.478 | 0.939 | 17.338 | 3.442 | 4.127 | ||
2.955 | 3.609 | 0.946 | 18.287 | 3.584 | 4.197 | ||
2.934 | 3.614 | 0.946 | 18.146 | 3.579 | 4.201 | ||
2.888 | 3.582 | 0.946 | 17.865 | 3.542 | 4.179 |
Loss | WB-PESQ | NB-PESQ | STOI | SI-SDR | COVL | CSIG |
---|---|---|---|---|---|---|
2.955 | 3.609 | 0.946 | 18.287 | 3.584 | 4.197 | |
2.900 | 3.620 | 0.946 | 18.811 | 3.493 | 4.078 | |
2.746 | 3.478 | 0.939 | 17.338 | 3.442 | 4.127 | |
(baseline) | 2.843 | 3.588 | 0.943 | 18.427 | 3.419 | 4.000 |
Revised Inter-SubNet | SEGAN | CMGAN | MP-SENet | MANNER | SE-Conformer | |
---|---|---|---|---|---|---|
NB-PESQ | 3.61 | 2.16 | 3.41 | 3.50 | 3.21 | 3.13 |
STOI | 0.95 | 0.92 | 0.96 | 0.96 | 0.95 | 0.95 |
Loss | Memory Usage | Training Time | Inference Time (Each Utterance) | Model Size (No. of Parameters) |
---|---|---|---|---|
(baseline) | 284 MiB | 26 h, 40 min | 10 ms | 2.2928 M |
284 MiB | 25 h, 24 min | 10 ms | 2.2928 M | |
284 MiB | 25 h, 3 min | 13 ms | 2.2928 M (+5.2106 M) | |
284 MiB | 29 h, 37 min | 11 ms | 2.2928 M (+5.2106 M) | |
284 MiB | 29 h, 7 min | 11 ms | 2.2928 M (+5.2106 M) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hung, J.-W.; Huang, P.-C.; Li, L.-Y. Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement. Future Internet 2024, 16, 360. https://doi.org/10.3390/fi16100360
Hung J-W, Huang P-C, Li L-Y. Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement. Future Internet. 2024; 16(10):360. https://doi.org/10.3390/fi16100360
Chicago/Turabian StyleHung, Jeih-Weih, Pin-Chen Huang, and Li-Yin Li. 2024. "Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement" Future Internet 16, no. 10: 360. https://doi.org/10.3390/fi16100360
APA StyleHung, J. -W., Huang, P. -C., & Li, L. -Y. (2024). Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement. Future Internet, 16(10), 360. https://doi.org/10.3390/fi16100360