Sparse Signal Recovery through Long Short-Term Memory Networks for Compressive Sensing-Based Speech Enhancement
Abstract
:1. Introduction
2. Theoretical Background
2.1. Framing and De-Framing
2.2. Voice Activity Detector
- Spectral Shape
- Spectro-temporal modulations
- Voicing
- Long term variability
2.3. Compressive Sensing
- .
- must be sparse, i.e., .
- must be a full rank matrix.
2.4. RNN and LSTM Neural Networks
2.4.1. Recurrent Neural Networks (RNN)
2.4.2. Long Short-Term Memory (LSTM)
- The activation of fresh information into the memory cell is controlled by the input gate ().
- The output flow is controlled by the output gate ().
- The forget gate () determines when to erase the internal state data.
- The main input to the memory cell is controlled by the input modulation gate ().
- Cell internal recurrence is controlled by the internal state ().
- The earlier data sample information is controlled by the hidden state () within the context window.
3. Proposed Approach
3.1. Deep Learning System Modelling
3.2. Enhancement Algorithm
3.2.1. Training of LSTM
3.2.2. Speech Decompression and Denoising Using Trained LSTM
4. Dataset
4.1. NOIZEUS Dataset
4.2. Noisex92 Dataset
5. Performance Evaluation Metrics
5.1. Perceptual Evaluation of Speech Quality (PESQ)
5.2. Short-Time Objective Intelligibility (STOI)
5.3. Signal-to-Distortion Ratio (SDR)
6. Results Analysis
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Das, N.; Chakraborty, S.; Chaki, J.; Padhy, N.; Dey, N. Fundamentals, Present and Future Perspectives of Speech Enhancement. Int. J. Speech Technol. 2020, 24, 883–901. [Google Scholar] [CrossRef]
- Donoho, D.L. For Most Large Underdetermined Systems of Linear Equations the Minimal 𝓁1-Norm Solution Is Also the Sparsest Solution. Commun. Pure Appl. Math. 2006, 59, 797–829. [Google Scholar] [CrossRef]
- Ahani, S.; Ghaemmaghami, S.; Wang, Z.J. A Sparse Representation-Based Wavelet Domain Speech Steganography Method. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 80–91. [Google Scholar] [CrossRef]
- Donoho, D.L.; Tsaig, Y.; Drori, I.; Starck, J.-L. Sparse Solution of Underdetermined Systems of Linear Equations by Stagewise Orthogonal Matching Pursuit. IEEE Trans. Inf. Theory 2012, 58, 1094–1121. [Google Scholar] [CrossRef]
- Crespo Marques, E.; Maciel, N.; Naviner, L.; Cai, H.; Yang, J. A Review of Sparse Recovery Algorithms. IEEE Access 2019, 7, 1300–1322. [Google Scholar] [CrossRef]
- Yang, H.; Hao, D.; Sun, H.; Liu, Y. Speech Enhancement Using Orthogonal Matching Pursuit Algorithm. In Proceedings of the 2014 International Conference on Orange Technologies, Xi’an, China, 20–23 September 2014; pp. 101–104. [Google Scholar]
- de Paiva, N.M.; Marques, E.C.; de Barros Naviner, L.A. Sparsity Analysis Using a Mixed Approach with Greedy and LS Algorithms on Channel Estimation. In Proceedings of the 2017 3rd International Conference on Frontiers of Signal Processing (ICFSP), Paris, France, 6–8 September 2017; pp. 91–95. [Google Scholar]
- Shinde, P.P.; Shah, S. A Review of Machine Learning and Deep Learning Applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–6. [Google Scholar]
- Ljung, L.; Andersson, C.; Tiels, K.; Schön, T.B. Deep Learning and System Identification. IFAC-PapersOnLine 2020, 53, 1175–1181. [Google Scholar] [CrossRef]
- Glorot, X.; Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent Advances in Convolutional Neural Networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef] [Green Version]
- Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent Advances in Recurrent Neural Networks. arXiv 2017, arXiv:1801.01078. [Google Scholar]
- Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A Tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
- Graves, A.; Mohamed, A.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Gonzalez, J.; Yu, W. Non-Linear System Modeling Using LSTM Neural Networks. IFAC-PapersOnLine 2018, 51, 485–489. [Google Scholar] [CrossRef]
- Wang, Y. A New Concept Using LSTM Neural Networks for Dynamic System Identification. In Proceedings of the 2017 American Control Conference (ACC), Seattle, WA, USA, 24–26 May 2017; pp. 5324–5329. [Google Scholar]
- Hamid, O.K. Frame Blocking and Windowing Speech Signal. J. Inf. 2018, 4, 8. [Google Scholar]
- Prabhu, K.M.M. Window Functions and Their Applications in Signal Processing; Taylor & Francis: Abingdon, UK, 2014; ISBN 978-1-4665-1584-0. [Google Scholar]
- Segbroeck, M.V.; Tsiartas, A.; Narayanan, S. A Robust Frontend for VAD: Exploiting Contextual, Discriminative and Spectral Cues of Human Voice. Interspeech 2013, 5, 704–708. [Google Scholar]
- Kim, B.-H.; Pyun, J.-Y. ECG Identification for Personal Authentication Using LSTM-Based Deep Recurrent Neural Networks. Sensors 2020, 20, 3069. [Google Scholar] [CrossRef] [PubMed]
- Kolen, J.F.; Kremer, S.C. Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. In A Field Guide to Dynamical Recurrent Networks; IEEE: Piscataway, NJ, USA, 2001; pp. 237–243. ISBN 978-0-470-54403-7. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Hu, C.; Wu, Q.; Li, H.; Jian, S.; Li, N.; Lou, Z. Deep Learning with a Long Short-Term Memory Networks Approach for Rainfall-Runoff Simulation. Water 2018, 10, 1543. [Google Scholar] [CrossRef] [Green Version]
- Hu, Y.; Loizou, P.C. Subjective Comparison and Evaluation of Speech Enhancement Algorithms. Speech Commun. 2007, 49, 588–601. [Google Scholar] [CrossRef] [Green Version]
- Varga, A.; Steeneken, H.J.M. Assessment for Automatic Speech Recognition: II. NOISEX-92: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Al-Radhi, M.S.; Csapó, T.G.; Németh, G. Continuous Noise Masking Based Vocoder for Statistical Parametric Speech Synthesis. IEICE Trans. Inf. Syst. 2020, E103-D, 1099–1107. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual Evaluation of Speech Quality (PESQ)—A New Method for Speech Quality Assessment of Telephone Networks and Codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4214–4217. [Google Scholar]
- Vincent, E.; Gribonval, R.; Fevotte, C. Performance Measurement in Blind Audio Source Separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef] [Green Version]
- Cevher, V.; Waters, A. The CoSamp Algorithm. In ELEC 639: Graphical Models Lecture Notes; Rice University: Houston, TX, USA, 2008. [Google Scholar]
- Haneche, H.; Boudraa, B.; Ouahabi, A. A New Way to Enhance Speech Signal Based on Compressed Sensing. Measurement 2020, 151, 107117. [Google Scholar] [CrossRef]
- Martin-Doñas, J.M.; Gomez, A.M.; Gonzalez, J.A.; Peinado, A.M. A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality. IEEE Signal Process. Lett. 2018, 25, 1680–1684. [Google Scholar] [CrossRef]
Name of the Parameter | Value |
---|---|
First layer | Sequence Input Layer, with size equal to the observation vector ( ) in Equation (3). |
Second Layer | LSTM Layer with 50 Hidden Units. |
Third Layer | Fully Connected Layer with output size 50. |
Fourth Layer | Dropout Layer with dropout probability of 0.25. |
Fifth Layer | Fully Connected Layer with output size equals to the sparse signal vector () in Equation (3). |
Sixth Layer | Regression Layer. |
Maximum Epochs | 250. |
Optimizer | Adam. |
Learning Rate | 0.01. |
Gradient Threshold | 1.0. |
Batch Training Size | 20. |
Techniques | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
OMP | CoSaMP | StOMP | Proposed | |||||||||
SNR (dB) | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR |
0 | 1.801 | 0.688 | 9.740 | 1.759 | 0.634 | 7.226 | 1.796 | 0.608 | 8.576 | 2.383 | 0.820 | 6.647 |
5 | 2.319 | 0.764 | 9.786 | 2.069 | 0.689 | 7.229 | 2.173 | 0.699 | 8.453 | 2.593 | 0.808 | 10.982 |
10 | 2.57 | 0.793 | 13.230 | 2.403 | 0.734 | 12.002 | 2.509 | 0.768 | 12.642 | 2.680 | 0.841 | 13.593 |
15 | 2.759 | 0.837 | 16.099 | 2.611 | 0.776 | 15.445 | 2.708 | 0.817 | 15.748 | 2.778 | 0.848 | 15.162 |
20 | 2.957 | 0.877 | 17.650 | 2.760 | 0.824 | 17.471 | 2.874 | 0.869 | 17.574 | 2.985 | 0.889 | 16.226 |
Techniques | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OMP | CoSaMP | StOMP | K-SVDCS | Proposed | |||||||||||
SNR (dB) | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR |
0 | 1.78 | 0.652 | 1.138 | 1.917 | 0.683 | 1.098 | 1.969 | 0.644 | 1.002 | 1.96 | 0.66 | -- | 2.547 | 0.807 | 4.237 |
5 | 2.216 | 0.736 | 6.565 | 2.251 | 0.726 | 4.693 | 2.308 | 0.715 | 4.942 | 2.28 | 0.72 | -- | 2.583 | 0.813 | 7.664 |
10 | 2.434 | 0.811 | 10.708 | 2.513 | 0.759 | 9.592 | 2.513 | 0.771 | 9.876 | 2.52 | 0.79 | -- | 2.720 | 0.826 | 11.471 |
15 | 2.606 | 0.827 | 14.032 | 2.646 | 0.774 | 13.743 | 2.709 | 0.805 | 13.770 | 2.69 | 0.81 | -- | 2.752 | 0.842 | 14.035 |
20 | 2.667 | 0.859 | 16.552 | 2.772 | 0.816 | 16.496 | 2.853 | 0.847 | 16.505 | 2.85 | 0.83 | -- | 2.861 | 0.866 | 15.182 |
Techniques | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
OMP | CoSaMP | StOMP | Proposed | |||||||||
SNR (dB) | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR | PESQ | STOI | SDR |
0 | 1.684 | 0.659 | 2.138 | 1.887 | 0.629 | 1.037 | 1.942 | 0.591 | 1.461 | 2.527 | 0.849 | 4.917 |
5 | 2.101 | 0.751 | 7.399 | 2.183 | 0.682 | 5.005 | 2.268 | 0.69 | 6.002 | 2.648 | 0.835 | 9.625 |
10 | 2.509 | 0.787 | 11.496 | 2.496 | 0.745 | 10.128 | 2.577 | 0.756 | 10.669 | 2.926 | 0.837 | 12.788 |
15 | 2.654 | 0.836 | 14.828 | 2.632 | 0.756 | 14.233 | 2.747 | 0.808 | 14.422 | 2.873 | 0.849 | 14.370 |
20 | 2.838 | 0.862 | 16.959 | 2.747 | 0.796 | 16.758 | 2.878 | 0.839 | 16.808 | 2.951 | 0.903 | 13.453 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shukla, V.; Swami, P.D. Sparse Signal Recovery through Long Short-Term Memory Networks for Compressive Sensing-Based Speech Enhancement. Electronics 2023, 12, 3097. https://doi.org/10.3390/electronics12143097
Shukla V, Swami PD. Sparse Signal Recovery through Long Short-Term Memory Networks for Compressive Sensing-Based Speech Enhancement. Electronics. 2023; 12(14):3097. https://doi.org/10.3390/electronics12143097
Chicago/Turabian StyleShukla, Vasundhara, and Preety D. Swami. 2023. "Sparse Signal Recovery through Long Short-Term Memory Networks for Compressive Sensing-Based Speech Enhancement" Electronics 12, no. 14: 3097. https://doi.org/10.3390/electronics12143097
APA StyleShukla, V., & Swami, P. D. (2023). Sparse Signal Recovery through Long Short-Term Memory Networks for Compressive Sensing-Based Speech Enhancement. Electronics, 12(14), 3097. https://doi.org/10.3390/electronics12143097