Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions
Abstract
:1. Introduction
- Transfer the success achieved by the Sinc convolution in the field of speech and speaker recognition to the field of end-to-end speech enhancement.
- Optimize the SEGAN architecture from the seminal work [19], and enhance the original Sinc convolution layer to fit the advanced SEGAN.
- Analyze the learned filters of the Sinc convolution layer.
- Apply data augmentation techniques on raw speech waveforms directly to further improve the system performance.
2. Related Works
3. Speech Enhancement GANs
4. Sinc-Convolution
5. Sinc–SEGAN Architecture
6. Experimental Setup
6.1. Database
6.2. Evaluation Metrics
- PESQ: Perceptual evaluation of speech quality, using the wide-band version recommended in ITU-T P.862.2 [42] (in the range of [−0.5, 4.5])
- STOI: Short-Time Objective Intelligibility [43] (in the range of [0, 100])
- SSNR: Segmental SNR [44] (in the range of [0, ))
- CBAK: Mean Opinion Score (MOS) prediction of the intrusiveness of background noises [45] (in the range of [1, 5])
- CSIG: MOS prediction of the signal distortion attending only to the speech signal [45] (in the range of [1, 5])
- COVL: MOS prediction of the overall effect [45] (in the range of [1, 5])
6.3. Implementation Details
6.4. Data Augmentation
6.5. Baseline
7. Results
7.1. Ablation Tests on the Configuration of Sinc Concolution
- Increasing the number of Sinc filters degrades the system performance, since more filters introduce more system complexity, and the training becomes more difficult, accordingly.
- Decreasing the kernel size of the Sinc convolution deteriorates the performance since smaller kernel size limits the ability to extract representative speech clues.
- Systems benefit from longer input length due to more context information being included.
- The addition architecture outperforms the substitution architecture as the former is deeper.
7.2. Performance and Parameter Comparisons with Baseline Systems
7.3. Ablation Tests on Augmentation Methods
7.4. Interpretation of Sinc Convolution
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | convolutional neural network |
GAN | generative adversarial network |
SEGAN | speech enhancement generative adversarial network |
SNR | Signal-to-Noise Ratio |
MOS | Mean Opinion Score |
SSNR | segmental SNR |
STOI | Short-Time Objective Intelligibility |
CBAK | MOS prediction of the intrusiveness of background noises |
CSIG | MOS prediction of the signal distortion attending only to the speech signal |
COVL | MOS prediction of the overall effect |
PESQ | perceptual evaluation of speech quality |
References
- Loizou, P. Speech Enhancement: Theory and Practice, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Donahue, C.; Li, B.; Prabhavalkar, R. Exploring speech enhancement with generative adversarial networks for robust speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5024–5028. [Google Scholar]
- Zhao, Z.; Liu, H.; Fingscheidt, T. Convolutional neural networks to enhance coded speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 27, 663–678. [Google Scholar] [CrossRef]
- Reddy, C.K.A.; Shankar, N.; Bhat, G.S.; Charan, R.; Panahi, I. An individualized super-Gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device. IEEE Signal Process. Lett. 2017, 24, 1601–1605. [Google Scholar] [CrossRef]
- Goehring, T.; Bolner, F.; Monaghan, J.J.; Van Dijk, B.; Zarowski, A.; Bleeck, S. Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users. Hear. Res. 2017, 344, 183–194. [Google Scholar] [CrossRef] [PubMed]
- Lim, J.; Oppenheim, A. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 197–210. [Google Scholar] [CrossRef]
- Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nie, S.; Liang, S.; Xue, W.; Zhang, X.; Liu, W. Two-stage multi-target joint learning for monaural speech separation. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 1503–1507. [Google Scholar]
- Weninger, F.; Hershey, J.R.; Le Roux, J.; Schuller, B. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA, 3–5 December 2014; pp. 577–581. [Google Scholar]
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Le Roux, J. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 708–712. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
- Nie, S.; Liang, S.; Liu, W.; Zhang, X.; Tao, J. Deep learning based speech separation via nmf-style reconstructions. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 2043–2055. [Google Scholar] [CrossRef]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Ensemble modeling of denoising autoencoder for speech spectrum restoration. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014. [Google Scholar]
- Fu, S.W.; Hu, T.Y.; Tsao, Y.; Lu, X. Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, 25–28 September 2017; pp. 1–6. [Google Scholar]
- Weninger, F.; Eyben, F.; Schuller, B. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3709–3713. [Google Scholar]
- Sun, L.; Du, J.; Dai, L.R.; Lee, C.H. Multiple-target deep learning for LSTM-RNN based speech enhancement. In Proceedings of the 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017; pp. 136–140. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar]
- Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
- Higuchi, T.; Kinoshita, K.; Delcroix, M.; Nakatani, T. Adversarial training for data-driven speech enhancement without parallel corpus. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 40–47. [Google Scholar]
- Qin, S.; Jiang, T. Improved Wasserstein conditional generative adversarial network speech enhancement. EURASIP J. Wirel. Commun. Netw. 2018, 2018, 1–10. [Google Scholar] [CrossRef]
- Li, Z.X.; Dai, L.R.; Song, Y.; McLoughlin, I. A conditional generative model for speech enhancement. Circuits Syst. Signal Process. 2018, 37, 5005–5022. [Google Scholar] [CrossRef] [Green Version]
- Pascual, S.; Serrà, J.; Bonafonte, A. Towards generalized speech enhancement with generative adversarial networks. arXiv 2019, arXiv:1904.03418. [Google Scholar]
- Wang, D.; Lim, J. The unimportance of phase in speech enhancement. IEEE Trans. Acoust. Speech Signal Process. 1982, 30, 679–681. [Google Scholar] [CrossRef]
- Paliwal, K.; Wójcicki, K.; Shannon, B. The importance of phase in speech enhancement. Speech Commun. 2011, 53, 465–494. [Google Scholar] [CrossRef]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar]
- Ravanelli, M.; Bengio, Y. Interpretable convolutional filters with sincnet. arXiv 2018, arXiv:1811.09725. [Google Scholar]
- Parcollet, T.; Morchid, M.; Linares, G. E2E-SINCNET: Toward fully end-to-end speech recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7714–7718. [Google Scholar]
- Mittermaier, S.; Kürzinger, L.; Waschneck, B.; Rigoll, G. Small-footprint keyword spotting on raw audio data with sinc-convolutions. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7454–7458. [Google Scholar]
- Phan, H.; McLoughlin, I.V.; Pham, L.; Chén, O.Y.; Koch, P.; De Vos, M.; Mertins, A. Improving GANs for speech enhancement. IEEE Signal Process. Lett. 2020, 27, 1700–1704. [Google Scholar] [CrossRef]
- Phan, H.; Nguyen, H.L.; Chén, O.Y.; Koch, P.; Duong, N.Q.; McLoughlin, I.; Mertins, A. Self-Attention Generative Adversarial Network for Speech Enhancement. arXiv 2020, arXiv:2010.09132. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Fainberg, J.; Klejch, O.; Loweimi, E.; Bell, P.; Renals, S. Acoustic model adaptation from raw waveforms with SincNet. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 897–904. [Google Scholar]
- Rabiner, L.; Schafer, R. Theory and Applications of Digital Speech Processing; Prentice Hall Press: Hoboken, NJ, USA, 2010. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. Proc. ICML 2013, 30, 3. [Google Scholar]
- Valentini-Botinhao, C. Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models; University of Edinburgh, School of Informatics, Centre for Speech Technology Research: Edinburgh, UK, 2017. [Google Scholar]
- Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; pp. 1–4. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics ICA2013; Acoustical Society of America: Montreal, QC, Canada, 2013; Volume 19, p. 035081. [Google Scholar]
- Rec, I. P. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs; International Telecommunication Union, CH: Geneva, Switzerland, 2005. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Quackenbush, S.R. Objective Measures of Speech Quality. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 1995. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2007, 16, 229–238. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]
- Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Subset | Speaker | Utterances | Intrusions | SNR (dB) |
---|---|---|---|---|
Training | 28 | 11,200 | 10 | 15, 10, 5, 0 |
Test | 2 | 800 | 5 | 17.5, 12.5, 7.5, 2.5 |
Experiment | A | B | C | D | E |
---|---|---|---|---|---|
Architecture | sub | sub | sub | sub | add |
Input length | 1 s | 1 s | 1 s | 250 ms | 1 s |
Number of Sinc filters | 64 | 80 | 64 | 64 | 64 |
Kernel size of Sinc convolution | 251 | 251 | 101 | 251 | 251 |
Experiment | PESQ | CSIG | CBAK | COVL | SSNR | STOI |
---|---|---|---|---|---|---|
A | 2.37 | 3.55 | 3.13 | 2.97 | 8.68 | 93.40 |
B | 2.32 | 3.49 | 2.84 | 2.91 | 5.51 | 92.99 |
C | 2.40 | 3.46 | 3.07 | 2.89 | 8.66 | 93.39 |
D | 2.36 | 3.57 | 3.07 | 2.94 | 8.70 | 93.37 |
E | 2.39 | 3.69 | 3.23 | 3.00 | 8.71 | 93.53 |
Architecture | Params (M) | Metric | |||||
---|---|---|---|---|---|---|---|
PESQ | CSIG | CBAK | COVL | SSNR | STOI | ||
Noisy | — | 1.97 | 3.35 | 2.44 | 2.63 | 1.69 | 92.10 |
SEGAN [19] | 294 | 2.16 | 3.48 | 2.94 | 2.79 | 7.66 | 93.12 |
ISEGAN [30] | 294 | 2.24 | 3.23 | 2.93 | 2.68 | 8.19 | 93.29 |
DSEGAN [30] | 513 | 2.35 | 3.56 | 3.10 | 2.94 | 8.70 | 93.25 |
SASEGAN-avg [31] | 295 | 2.33 | 3.52 | 3.05 | 2.90 | 8.08 | 93.33 |
SASEGAN-all [31] | 310 | 2.35 | 3.55 | 3.10 | 2.91 | 8.30 | 93.49 |
Sinc–SEGAN-sub | 91 | 2.37 | 3.55 | 3.13 | 2.97 | 8.68 | 93.40 |
Sinc–SEGAN-add | 210 | 2.39 | 3.69 | 3.23 | 3.00 | 8.71 | 93.53 |
Sinc–SEGAN-add +augment | 210 | 2.86 | 3.87 | 3.66 | 3.15 | 8.87 | 94.96 |
Sinc–SEGAN-Add | PESQ | CSIG | CBAK | COVL | SSNR | STOI |
---|---|---|---|---|---|---|
+BM | 2.44 | 3.55 | 3.37 | 3.05 | 8.79 | 93.75 |
+BM, +ReMix | 2.45 | 3.57 | 3.40 | 3.07 | 8.81 | 93.80 |
+BM, +ReMix, +shift | 2.86 | 3.87 | 3.66 | 3.15 | 8.87 | 94.96 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, L.; Wudamu; Kürzinger, L.; Watzel, T.; Rigoll, G. Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions. Appl. Sci. 2021, 11, 7564. https://doi.org/10.3390/app11167564
Li L, Wudamu, Kürzinger L, Watzel T, Rigoll G. Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions. Applied Sciences. 2021; 11(16):7564. https://doi.org/10.3390/app11167564
Chicago/Turabian StyleLi, Lujun, Wudamu, Ludwig Kürzinger, Tobias Watzel, and Gerhard Rigoll. 2021. "Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions" Applied Sciences 11, no. 16: 7564. https://doi.org/10.3390/app11167564
APA StyleLi, L., Wudamu, Kürzinger, L., Watzel, T., & Rigoll, G. (2021). Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions. Applied Sciences, 11(16), 7564. https://doi.org/10.3390/app11167564