A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement
Abstract
:1. Introduction
- In the generator stage, we propose a novel dual-stream structure that incorporates phase into the enhancement process, effectively solving the phase mismatch problem while being able to further improve enhancement performance.
- In this dual stream structure, information communication (IC) between the two prediction streams is added to improve the respective enhancement effects.
- We propose Mask Estimated Blocks (MEB) based on Transformer, which can better extract sound features to accomplish mask estimation.
- In the discriminator stage, inspired by MetricGAN [18], we designed a perceptually guided discriminator to accomplish targeted optimisation by modelling specific evaluation metrics.
- We conducted more comprehensive experiments to justify the design structure. As shown by the experimental results, DPGAN is capable of improving the performance of speech enhancement to a certain extent compared with the current method.
2. Related Work
2.1. T-F Domain Methods
2.2. Time Domain Methods
2.3. GAN-Based Method
3. Methods
3.1. Overview
3.2. Dual Stream Generator with Phase Awareness
3.3. Information Communication
3.4. Mask Estimated Blocks
3.5. Perception Guided Discriminator
3.6. Loss Function
4. Results
4.1. Implementation Details
4.2. Quantitative Evaluation
4.3. Acoustic Analysis
4.4. Image Analysis
4.5. Ablation Study
5. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Yang, L.P.; Fu, Q.J. Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. J. Acoust. Soc. Am. 2005, 117, 1001–1004. [Google Scholar] [CrossRef] [PubMed]
- Martin, R. Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 1, pp. 1–253. [Google Scholar]
- Habets, E.A.P. Single- and multi-microphone speech dereverberation using spectral enhancement. Diss. Abstr. Int. 2007, 68. [Google Scholar]
- Nakatani, T.; Yoshioka, T.; Kinoshita, K.; Miyoshi, M.; Juang, B.H. Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio, Speech Lang. Process. 2010, 18, 1717–1731. [Google Scholar] [CrossRef]
- Germain, F.G.; Mysore, G.J.; Fujioka, T. Equalization matching of speech recordings in real-world environments. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 609–613. [Google Scholar]
- Defossez, A.; Synnaeve, G.; Adi, Y. Real time speech enhancement in the waveform domain. arXiv 2020, arXiv:2006.12847. [Google Scholar]
- Liu, C.L.; Fu, S.W.; Li, Y.J.; Huang, J.W.; Wang, H.M.; Tsao, Y. Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1888–1900. [Google Scholar] [CrossRef] [Green Version]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 24, 483–492. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv 2018, arXiv:1804.03619. [Google Scholar] [CrossRef] [Green Version]
- Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9458–9465. [Google Scholar]
- Binkowski, M.; Donahue, J.; Dieleman, S.; Clark, A.; Elsen, E.; Casagrande, N.; Cobo, L.C.; Simonyan, K. High Fidelity Speech Synthesis with Adversarial Networks. arXiv 2019, arXiv:1909.11646. [Google Scholar]
- Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brébisson, A.; Bengio, Y.; Courville, A.C. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Pandey, A.; Wang, D. On adversarial training and loss functions for speech enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5414–5418. [Google Scholar]
- Michelsanti, D.; Tan, Z.H. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. arXiv 2017, arXiv:1709.01703. [Google Scholar]
- Donahue, C.; Li, B.; Prabhavalkar, R. Exploring speech enhancement with generative adversarial networks for robust speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5024–5028. [Google Scholar]
- Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2031–2041. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Hu, G.; Wang, D. Speech segregation based on pitch tracking and amplitude modulation. In Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), New Platz, NY, USA, 24 October 2001; pp. 79–82. [Google Scholar]
- Srinivasan, S.; Roman, N.; Wang, D. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Paliwal, K.; Wójcicki, K.; Shannon, B. The importance of phase in speech enhancement. Speech Commun. 2011, 53, 465–494. [Google Scholar] [CrossRef]
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Le Roux, J. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 708–712. [Google Scholar]
- Trabelsi, C.; Bilaniuk, O.; Serdyuk, D.; Subramanian, S.; Santos, J.F.; Mehri, S.; Rostamzadeh, N.; Bengio, Y.; Pal, C.J. Deep Complex Networks. arXiv 2017, arXiv:1705.09792. [Google Scholar]
- Choi, H.S.; Kim, J.H.; Huh, J.; Kim, A.; Ha, J.W.; Lee, K. Phase-aware speech enhancement with deep complex u-net. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar]
- Takahashi, N.; Agrawal, P.; Goswami, N.; Mitsufuji, Y. PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2713–2717. [Google Scholar]
- Sreenivas, T.; Kirnapure, P. Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process. 1996, 4, 383–389. [Google Scholar] [CrossRef]
- Paliwal, K.; Basu, A. A speech enhancement method based on Kalman filtering. In Proceedings of the ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA, 6–9 April 1987; Volume 12, pp. 177–180. [Google Scholar]
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Rethage, D.; Pons, J.; Serra, X. A wavenet for speech denoising. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5069–5073. [Google Scholar]
- Stoller, D.; Ewert, S.; Dixon, S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. arXiv 2018, arXiv:1806.03185. [Google Scholar]
- Défossez, A.; Usunier, N.; Bottou, L.; Bach, F. Music source separation in the waveform domain. arXiv 2019, arXiv:1911.13254. [Google Scholar]
- Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 46–50. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 21–25. [Google Scholar]
- Wang, K.; He, B.; Zhu, W.P. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7098–7102. [Google Scholar]
- Kim, E.; Seo, H. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 2736–2740. [Google Scholar]
- Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
- Baby, D.; Verhulst, S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 106–110. [Google Scholar]
- Su, J.; Jin, Z.; Finkelstein, A. HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 4506–4510. [Google Scholar]
- Kolbæk, M.; Tan, Z.H.; Jensen, J. Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5059–5063. [Google Scholar]
- Fu, S.W.; Wang, T.W.; Tsao, Y.; Lu, X.; Kawai, H. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1570–1584. [Google Scholar] [CrossRef] [Green Version]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
- Fu, S.W.; Liao, C.F.; Tsao, Y. Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Process. Lett. 2019, 27, 26–30. [Google Scholar] [CrossRef] [Green Version]
- Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv 2021, arXiv:2104.03538. [Google Scholar]
- Koizumi, Y.; Niwa, K.; Hioka, Y.; Kobayashi, K.; Haneda, Y. DNN-based source enhancement to increase objective sound quality assessment score. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1780–1792. [Google Scholar] [CrossRef] [Green Version]
- Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar]
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the SSW, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of the Meetings on Acoustics ICA2013, Montreal, QU, Canada, 2–7 June 2013; Volume 19, p. 035081. [Google Scholar]
- Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Hu, G.; Wang, D. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 2067–2079. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2007, 16, 229–238. [Google Scholar] [CrossRef]
- Phan, H.; McLoughlin, I.V.; Pham, L.; Chén, O.Y.; Koch, P.; De Vos, M.; Mertins, A. Improving GANs for speech enhancement. IEEE Signal Process. Lett. 2020, 27, 1700–1704. [Google Scholar] [CrossRef]
- Liu, G.; Gong, K.; Liang, X.; Chen, Z. Cp-gan: Context pyramid generative adversarial network for speech enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6624–6628. [Google Scholar]
- Soni, M.H.; Shah, N.; Patil, H.A. Time-frequency masking-based speech enhancement using generative adversarial network. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5039–5043. [Google Scholar]
- Li, Y.; Sun, M.; Zhang, X. Perception-guided generative adversarial network for end-to-end speech enhancement. Appl. Soft Comput. 2022, 128, 109446. [Google Scholar] [CrossRef]
- Huang, H.; Wu, R.; Huang, J.; Lin, J.; Yin, J. DCCRGAN: Deep Complex Convolution Recurrent Generator Adversarial Network for Speech Enhancement. In Proceedings of the 2022 International Symposium on Electrical, Electronics and Information Engineering (ISEEIE), Chiang Mai, Thailand, 25–27 February 2022; pp. 30–35. [Google Scholar]
- Giri, R.; Isik, U.; Krishnaswamy, A. Attention wave-u-net for speech enhancement. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 249–253. [Google Scholar]
- Lv, S.; Fu, Y.; Xing, M.; Sun, J.; Xie, L.; Huang, J.; Wang, Y.; Yu, T. S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7767–7771. [Google Scholar]
- Li, H.; Yamagishi, J. Multi-metric optimization using generative adversarial networks for near-end speech intelligibility enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3000–3011. [Google Scholar] [CrossRef]
- Van Kuyk, S.; Kleijn, W.B.; Hendriks, R.C. An instrumental intelligibility metric based on information theory. IEEE Signal Process. Lett. 2017, 25, 115–119. [Google Scholar] [CrossRef] [Green Version]
- Kates, J.M.; Arehart, K.H. The hearing-aid speech perception index (HASPI). Speech Commun. 2014, 65, 75–93. [Google Scholar] [CrossRef]
- Jensen, J.; Taal, C.H. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2009–2022. [Google Scholar] [CrossRef]
Network | Domain | PESQ | STOI | CSIG | CBAK | COVL |
---|---|---|---|---|---|---|
Noisy | - | 1.97 | 0.92 | 3.35 | 2.44 | 2.63 |
SEGAN [42] | T | 2.16 | 0.925 | 3.48 | 2.94 | 2.80 |
SERGAN [43] | T | 2.51 | 0.938 | 3.79 | 3.24 | 3.14 |
iSEGAN [62] | T | 2.24 | 0.933 | 3.23 | 2.95 | 2.69 |
DSEGAN [62] | T | 2.39 | 0.933 | 3.46 | 3.11 | 2.90 |
Wave-U-Net [67] | T | 2.62 | - | 3.91 | 3.35 | 3.27 |
CP-GAN [63] | T | 2.64 | 0.942 | 3.93 | 3.33 | 3.28 |
MMSE-GAN [64] | TF | 2.53 | 0.93 | 3.80 | 3.12 | 3.14 |
PGGAN [65] | T | 2.81 | 0.944 | 3.99 | 3.59 | 3.36 |
DCCRGAN [66] | TF | 2.82 | 0.949 | 4.01 | 3.48 | 3.40 |
S-DCCRN [68] | TF | 2.84 | 0.940 | 4.03 | 2.97 | 3.43 |
MetricGAN [18] | TF | 2.86 | - | 3.99 | 3.18 | 3.42 |
HiFi-GAN [44] | T | 2.94 | - | 4.07 | 3.07 | 3.49 |
DCU-Net-16 [27] | TF | 2.93 | 0.93 | 4.10 | 3.77 | 3.52 |
PHASEN [10] | TF | 2.99 | - | 4.18 | 3.45 | 3.50 |
MetricGAN+ [51] | TF | 3.15 | 0.927 | 4.14 | 3.12 | 3.52 |
DPGAN(S)(ours) | TF | 2.67 | 0.948 | 3.43 | 3.15 | 3.29 |
DPGAN(P)(ours) | TF | 3.32 | 0.908 | 3.93 | 3.07 | 3.54 |
Network | Domain | PESQ | STOI |
---|---|---|---|
SEGAN [42] | T | 2.21 | 0.885 |
MetricGAN [18] | TF | 2.81 | 0.908 |
HiFiGAN [44] | T | 2.89 | 0.922 |
PHASEN [10] | TF | 2.95 | 0.931 |
MetricGAN+ [51] | TF | 3.08 | 0.928 |
DPGAN(S) | TF | 2.72 | 0.943 |
DPGAN(P) | TF | 3.21 | 0.915 |
Network | PESQ | STOI | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SDR (dB) | AVE | SDR (dB) | AVE | |||||||
17.5 | 12.5 | 7.5 | 2.5 | 17.5 | 12.5 | 7.5 | 2.5 | |||
DPGAN(BLSTM) | 2.99 | 3.31 | 3.13 | 2.89 | 3.08 | 0.915 | 0.942 | 0.935 | 0.896 | 0.922 |
+phase | 2.85 | 3.20 | 2.99 | 2.68 | 2.93 | 0.893 | 0.922 | 0.925 | 0.908 | 0.912 |
+IC(P) | 3.08 | 3.36 | 3.19 | 2.93 | 3.14 | 0.902 | 0.925 | 0.918 | 0.900 | 0.911 |
+IC(S) | 2.55 | 2.81 | 2.61 | 2.38 | 2.59 | 0.940 | 0.958 | 0.933 | 0.929 | 0.940 |
DPGAN(MEB) | 3.16 | 3.50 | 3.22 | 2.97 | 3.21 | 0.921 | 0.947 | 0.935 | 0.909 | 0.928 |
+phase | 2.95 | 3.28 | 3.06 | 2.79 | 3.02 | 0.908 | 0.935 | 0.926 | 0.895 | 0.916 |
+IC(P) | 3.29 | 3.60 | 3.34 | 3.05 | 3.32 | 0.899 | 0.930 | 0.919 | 0.884 | 0.908 |
+IC(S) | 2.61 | 2.90 | 2.75 | 2.42 | 2.67 | 0.940 | 0.965 | 0.957 | 0.930 | 0.948 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, X.; Li, Y.; Li, X.; Zhang, Y.; Ding, Y. A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement. Information 2023, 14, 221. https://doi.org/10.3390/info14040221
Liang X, Li Y, Li X, Zhang Y, Ding Y. A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement. Information. 2023; 14(4):221. https://doi.org/10.3390/info14040221
Chicago/Turabian StyleLiang, Xintao, Yuhang Li, Xiaomin Li, Yue Zhang, and Youdong Ding. 2023. "A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement" Information 14, no. 4: 221. https://doi.org/10.3390/info14040221
APA StyleLiang, X., Li, Y., Li, X., Zhang, Y., & Ding, Y. (2023). A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement. Information, 14(4), 221. https://doi.org/10.3390/info14040221