A Reinforcement Learning Approach to Speech Coding
Abstract
:1. Introduction
2. Current Standardized Speech Codecs
3. Reinforcement Learning and Stochastic Control: Terminology and Methods
- A model of the environment that we call the System Model;
- A control policy that describes the behavior of the learning agent;
- A cost function instead of a reward function;
- A cost-to-go that tries to emulate a Value Function.
4. Speech Coding as Reinforcement Learning
- A System Model;
- A Cost Function;
- A Cost-to-Go function;
- A Control Policy.
4.1. The System Model
5. Learning (Parameter Adaptation)
5.1. AR and MA Parameter Adaptation
5.2. Quasi-Periodic Excitation Adaptation
6. Error Shaping
7. The Value Function
8. Control Policy
8.1. Control Tree Sequences
8.2. Control Tree Gain Adaptation
9. L Step Lookahead Cost Function
10. Exploitation and Exploration
11. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AMR-WB | Adaptive Multirate-Wideband |
AR | Autoregressive |
MA | Moving Average |
PESQ-MOS | Perceptual Evaluation of Speech Quality-Mean Opinion Score |
RLS | Recursive Least Squares |
VAD/CNG | Voice Activity Detection/Comfort Noise Generation |
Appendix A. DPCM and Tree Coding
Appendix A.1. DPCM
Appendix A.2. Tree Coding
Appendix B. Pitch Lag Adaptation
Appendix C. Pitch Stability Test
Appendix C.1. Stability Test
- If , the following is sufficient for stability:
- (a)
- If , the satisfaction of the two following conditions is sufficient for stability:
- (a)
- (b)
- (i) or(ii)
Appendix C.2. Stabilization Procedure
- If ,
- If ,
- If ,
Appendix D. Voice Activity Detection/Comfort Noise Generation (VAD/CNG)
References
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
- Berger, T. Rate Distortion Theory; Prentice-Hall: Upper Saddle River, NJ, USA, 1971. [Google Scholar]
- Berger, T.; Gibson, J.D. Lossy Source Coding. IEEE Trans. Inf. Theory 1998, 44, 2693–2723. [Google Scholar] [CrossRef] [Green Version]
- Woo, H.C.; Gibson, J.D. Low delay tree coding of speech at 8 kbit/s. IEEE Trans. Speech Audio Process. 1994, 2, 361–370. [Google Scholar] [CrossRef]
- Oh, H.; Gibson, J.D. Output Recursively Adaptive (ORA) Tree Coding of Speech with VAD/CNG. In Proceedings of the 54th Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–4 November 2020. [Google Scholar]
- Li, Y.Y.; Ramadas, P.; Gibson, J. Multimode tree coding of speech with pre-/post-weighting. Appl. Sci. 2022, 12, 2026. [Google Scholar] [CrossRef]
- Wittenmark, B. Adaptive Dual Control Methods: An Overview. In IFAC Adaptive Systems in Control and Signal Processing; Elsevier: Amsterdam, The Netherlands, 1995. [Google Scholar]
- Bertsekas, D.P. Reinforcement Learning and Optimal Control; Athena Scientific: Nashua, NH, USA, 2019. [Google Scholar]
- Bertsekas, D.P. Dynamic Programming and Stochastic Control; Academic Press: Cambridge, MA, USA, 1976. [Google Scholar]
- Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
- Feldbaum, A.A. Dual Control Theory I–II. Autom. Remote Control 1960, 21, 874–880, 1033–1039. [Google Scholar]
- Feldbaum, A.A. Dual Control Theory III–IV. Autom. Remote Control 1961, 22, 1–12, 109–121. [Google Scholar]
- Feldbaum, A.A. Dual Control Theory Problems. IFAC Proc. 1963, 1, 541–550. [Google Scholar] [CrossRef]
- Bessette, B.; Salami, R.; Lefebvre, R.; Jelinek, M.; Rotola-Pukkila, J.; Vainio, J.; Mikkola, H.; Jarvinen, K. The adaptive multirate wideband speech codec (AMR-WB). IEEE Trans. Speech Audio Process. 2002, 10, 620–636. [Google Scholar] [CrossRef]
- Dietz, M.; Multrus, M.; Eksler, V.; Malenovsky, V.; Norvell, E.; Pobloth, H.; Miao, L.; Wang, Z.; Laaksonen, L.; Vasilache, A.; et al. Overview of the EVS codec architecture. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5698–5702. [Google Scholar] [CrossRef]
- Gibson, J.D. Speech Coding Methods, Standards, and Applications. IEEE Circuits Syst. Mag. 2005, 5, 30–49. [Google Scholar] [CrossRef]
- Gibson, J.D. Speech Coding for Wireless Communications. In Mobile Communications Handbook; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Gibson, J.D. Speech compression. Information 2016, 7, 800–808. [Google Scholar] [CrossRef] [Green Version]
- Ljung, L.; Soderstrom, T. Theory and Practice of Recursive Identification; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
- Pagano, M. Estimation of autoregressive signal plus noise. Ann. Stat. 1974, 2, 99–108. [Google Scholar] [CrossRef]
- Gibson, J.D. Backward adaptive prediction as spectral analysis in a closed loop. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 166–1174. [Google Scholar] [CrossRef]
- Honig, M.L.; Messerschmitt, D.G. Adaptive Filters: Structures, Algorithms, and Applications; Kluwer Academic Publishers: Hingham, MA, USA, 1984. [Google Scholar]
- Haykin, S. Adaptive Filter Theory; Prentice-Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
- ITU-T Recommendation G.726. 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM). 1990. Available online: https://www.itu.int/rec/T-REC-G.726/e (accessed on 15 June 2022).
- ITU-T Recommendation G.727. 5-, 4-, 3- and 2-bit/sample Embedded Adaptive Differential Pulse Code Modulation (ADPCM). 1990. Available online: https://www.itu.int/rec/T-REC-G.727/en (accessed on 15 June 2022).
- Cuperman, V.; Pettigrew, R. Robust low-complexity backward adaptive pitch predictor for low-delay speech coding. IEE Proc.-I 1991, 138, 338–344. [Google Scholar] [CrossRef]
- Reininger, R.; Gibson, J. Backward Adaptive Lattice and Transversal Predictors in ADPCM. IEEE Trans. Commun. 1985, 33, 74–82. [Google Scholar] [CrossRef]
- Ramachandran, R.; Kabal, P. Stability and performance analysis of pitch filters in speech coders. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 937–946. [Google Scholar] [CrossRef]
- Pettigrew, R.; Cuperman, V. Backward pitch prediction for low-delay speech coding. In Proceedings of the 1989 IEEE Global Telecommunications Conference and Exhibition ‘Communications Technology for the 1990s and Beyond’, Dallas, TX, USA, 27–30 November 1989; pp. 1247–1252. [Google Scholar] [CrossRef]
- ITU-T Recommendation P.862. Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. 2001. Available online: https://www.itu.int/rec/T-REC-P.862 (accessed on 15 June 2022).
- ITU-T Recommendation P.501. Available online: https://www.itu.int/rec/T-REC-P.501 (accessed on 15 June 2022).
- Jayant, N.S.; Noll, P. Digital Coding of Waveforms: Principles and Applications to Speech and Video; Prentice Hall: Upper Saddle River, NJ, USA, 1984. [Google Scholar]
- Anderson, J.B.; Mohan, S. Source and Channel Coding: An Algorithmic Approach; Kluwer: Dordrecht, The Netherlands, 1991. [Google Scholar]
- McCree, A.V. Low-Bit-Rate Speech Coding, Chapter 16. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Chen, J.-H.; Thyssen, J. Analysis-by-Synthesis Speech Coding, Chapter 17. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Jassim, W.A.; Skoglund, J.; Chinen, M.; Hines, A. Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders. In Proceedings of the Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 26–28 May 2020; pp. 1–6. [Google Scholar]
- Gibson, J.D.; Berger, T.; Lookabaugh, T.; Lindbergh, D.; Baker, R.L. Digital Compression for Multimedia: Principles and Standards; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
- Gibson, J.D. Adaptive prediction in speech differential encoding systems. Proc. IEEE 1980, 68, 488–525. [Google Scholar] [CrossRef]
- Anderson, J.; Bodie, J. Tree encoding of speech. IEEE Trans. Inf. Theory 1975, 21, 379–387. [Google Scholar] [CrossRef]
Control | Value Function | F1 | F2 | M1 | M2 | Avg | Std Dev |
---|---|---|---|---|---|---|---|
Random 4-2 | PESQ-MOS | 3.377 | 3.376 | 3.443 | 3.556 | 3.438 | 0.073 |
Rate (kbits/s) | 5.92 | 5.57 | 4.09 | 5.41 | 5.25 | ||
Random 4-2 plus 5-level | PESQ-MOS | 3.425 | 3.515 | 3.544 | 3.569 | 3.513 | 0.054 |
Rate (kbits/s) | 6.03 | 5.66 | 4.15 | 5.49 | 5.33 | ||
Random 4-2, 5-level, 5pol | PESQ-MOS | 3.471 | 3.562 | 3.584 | 3.599 | 3.554 | 0.050 |
Rate (kbits/s) | 6.03 | 5.66 | 4.15 | 5.49 | 5.33 |
Control | Value Function | F1 | F2 | M1 | M2 | Avg | Std Dev |
---|---|---|---|---|---|---|---|
Random 4-2 | PESQ-MOS | 3.646 | 3.581 | 3.784 | 3.699 | 3.678 | 0.074 |
Rate (kbits/s) | 12 | 12 | 12 | 12 | 12 | ||
Random 4-2 plus 5-level | PESQ-MOS | 3.747 | 3.798 | 4.022 | 3.793 | 3.840 | 0.107 |
Rate (kbits/s) | 12.24 | 12.24 | 12.24 | 12.24 | 12.24 | ||
Random 4-2, 5-level, 5pol | PESQ-MOS | 3.766 | 3.827 | 4.039 | 3.819 | 3.863 | 0.104 |
Rate (kbits/s) | 12.24 | 12.24 | 12.24 | 12.24 | 12.24 | ||
AMR, Narrowband | PESQ-MOS | 4.04 | 4.001 | 4.089 | 4.063 | 4.048 | 0.032 |
Rate (kbits/s) | 12.2 | 12.2 | 12.2 | 12.2 | 12.2 |
Control | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 | Avg | StD |
---|---|---|---|---|---|---|---|---|---|---|
Random 4-2 | 3.826 | 3.543 | 3.597 | 3.577 | 3.602 | 3.54 | 3.634 | 3.669 | 3.624 | 0.087 |
Random 4-2 plus 5-level | 3.878 | 3.769 | 3.632 | 3.512 | 3.721 | 3.677 | 3.624 | 3.787 | 3.700 | 0.106 |
Random 4-2, 5-level, 5pol | 3.861 | 3.772 | 3.639 | 3.514 | 3.733 | 3.68 | 3.638 | 3.79 | 3.703 | 0.102 |
AMR, Narrowband | 3.978 | 3.96 | 3.721 | 3.818 | 3.923 | 3.634 | 3.954 | 3.697 | 3.836 | 0.128 |
Control | M4 | M5 | M6 | M7 | M8 | M9 | M10 | Avg | StD |
---|---|---|---|---|---|---|---|---|---|
Random 4-2 | 3.645 | 3.684 | 3.856 | 3.662 | 3.796 | 3.848 | 3.828 | 3.760 | 0.086 |
Random 4-2 plus 5-level | 3.806 | 3.792 | 3.877 | 3.723 | 3.801 | 3.878 | 3.793 | 3.810 | 0.050 |
Random 4-2, 5-level, 5pol | 3.804 | 3.797 | 3.88 | 3.736 | 3.806 | 3.875 | 3.808 | 3.815 | 0.046 |
AMR, Narrowband | 3.984 | 4.091 | 3.973 | 3.902 | 3.824 | 3.745 | 4.052 | 3.939 | 0.114 |
Control | Average | Standard Deviation |
---|---|---|
Random 4-2 | 3.687 | 0.110 |
Random 4-2 plus 5-level | 3.751 | 0.101 |
Random 4-2, 5-level, 5pol | 3.756 | 0.098 |
AMR, Narrowband | 3.884 | 0.132 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gibson, J.; Oh, H. A Reinforcement Learning Approach to Speech Coding. Information 2022, 13, 331. https://doi.org/10.3390/info13070331
Gibson J, Oh H. A Reinforcement Learning Approach to Speech Coding. Information. 2022; 13(7):331. https://doi.org/10.3390/info13070331
Chicago/Turabian StyleGibson, Jerry, and Hoontaek Oh. 2022. "A Reinforcement Learning Approach to Speech Coding" Information 13, no. 7: 331. https://doi.org/10.3390/info13070331
APA StyleGibson, J., & Oh, H. (2022). A Reinforcement Learning Approach to Speech Coding. Information, 13(7), 331. https://doi.org/10.3390/info13070331