A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
Abstract
1. Introduction
2. Models
2.1. Inference
2.2. Training
3. Experiments and Results
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, D.; Wang, X.; Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 2019, 11, 1018. [Google Scholar] [CrossRef]
- Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4960–4964. [Google Scholar]
- Abdel-Hamid, O.; Mohamed, A.r.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef]
- Hannun, A.; Lee, A.; Xu, Q.; Collobert, R. Sequence-to-sequence speech recognition with time-depth separable convolutions. arXiv 2019, arXiv:1904.02619. [Google Scholar]
- Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6124–6128. [Google Scholar]
- Graves, A.; Jaitly, N.; Mohamed, A.R. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 273–278. [Google Scholar]
- Shewalkar, A.; Nyavanandi, D.; Ludwig, S.A. Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. J. Artif. Intell. Soft Comput. Res. 2019, 9, 235–245. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7829–7833. [Google Scholar]
- Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-based acoustic modeling for hybrid speech recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6874–6878. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Chan, W.; Saharia, C.; Hinton, G.; Norouzi, M.; Jaitly, N. Imputer: Sequence modelling via imputation and dynamic programming. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1403–1413. [Google Scholar]
- Higuchi, Y.; Watanabe, S.; Chen, N.; Ogawa, T.; Kobayashi, T. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. arXiv 2020, arXiv:2005.08700. [Google Scholar]
- Fujita, Y.; Watanabe, S.; Omachi, M.; Chan, X. Insertion-based modeling for end-to-end automatic speech recognition. arXiv 2020, arXiv:2005.13211. [Google Scholar]
- Lee, J.; Watanabe, S. Intermediate loss regularization for ctc-based speech recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA; pp. 6224–6228. [Google Scholar]
- Nozaki, J.; Komatsu, T. Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions. arXiv 2021, arXiv:2104.02724. [Google Scholar]
- Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
- Ghahremani, P.; BabaAli, B.; Povey, D.; Riedhammer, K.; Trmal, J.; Khudanpur, S. A pitch extraction algorithm tuned for automatic speech recognition. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2494–2498. [Google Scholar]
- Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv 2018, arXiv:1804.10959. [Google Scholar]
- Higuchi, Y.; Chen, N.; Fujita, Y.; Inaguma, H.; Komatsu, T.; Lee, J.; Nozaki, J.; Wang, T.; Watanabe, S. A comparative study on non-autoregressive modelings for speech-to-text generation. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 47–54. [Google Scholar]
- Collobert, R.; Puhrsch, C.; Synnaeve, G. Wav2letter: An end-to-end convnet-based speech recognition system. arXiv 2016, arXiv:1609.03193. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar]
- Siivola, V.; Pellom, B.L. Growing an n-gram language model. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- Kim, S.; Hori, T.; Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4835–4839. [Google Scholar]
- Watanabe, S.; Hori, T.; Kim, S.; Hershey, J.R.; Hayashi, T. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1240–1253. [Google Scholar] [CrossRef]
- Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N.E.Y.; Heymann, J.; Wiesner, M.; Chen, N.; et al. Espnet: End-to-end speech processing toolkit. arXiv 2018, arXiv:1804.00015. [Google Scholar]
Autoregressive | Decoding Level [Length] | Special Tokens | Training Algorithm | |
---|---|---|---|---|
Transformer [9] | O | token level [U] | <sos>, <eos> | Cross-Entropy |
CTC [12] | X | frame level [] | blank | Forward/backward |
RNN-T [18] | O | Token + Frame [] | blank | Forward/backward |
This work | O | frame level [] | space, re-appearance | Forward/backward |
Train-Clean-100 (%) | Test (%) | |
---|---|---|
CTC (blank) | 5,986,615 (66.7) | 1,283,175 (68.1) |
This work (space) | 4,119,724 (45.9) | 888,293 (47.1) |
Train-Clean-100 | dev-Clean | dev-Other | Test-Clean | Test-Other | Sum (%) | |
---|---|---|---|---|---|---|
2 consecutive same tokens | 72,555 | 1410 | 1219 | 1311 | 1104 | 77,599 (1.1%) |
3 consecutive same tokens | 16 | 0 | 0 | 1 | 1 | 18 (0.0003%) |
Decoder Type | Forward Variable | Backward Variable | |
---|---|---|---|
CTC | wordpiece (even u) | + α (t − 1,u)) ∗ pt (yu) | + β (t − 1,u)) ∗ pt (yu) |
blank (odd u) | |||
RNN-T | + α (t,u − 1) ∗ pt − 1,u (yu) | + β (t,u + 1) ∗ pt,u (yu) | |
This work | + α (t − 1,u) ∗ pt,u (yu) | + β (t + 1,u) ∗ pt,u (yu) |
Model | Params (M) | Word Error Rate (%) | ||
---|---|---|---|---|
Test-Clean | Test-Other | |||
Auto-regressive | This work | 0.18 | 6.8 | 18.4 |
Transformer [9] | 9.6 | 7.2 | 18.0 | |
RNN-T [18] | 1.0 | 6.6 | 18.3 | |
Non-Autoregressive | CTC [12] | 0 | 7.7 | 20.7 |
Self-conditioned CTC [17,21] | 0 | 6.9 | 19.7 | |
Intermediate CTC [16,21] | 0 | 7.1 | 20.2 |
Model | Inference Time (s) | Training Time (h) | ||
---|---|---|---|---|
Encoding Time | Decoding Time | Relative Decoding Time | ||
This work | 368 | 48.6 | 1.0 | 75.9 |
Transformer [9] | 350 | 1500 | 30.9 | 41.7 |
RNN-T [18] | 365 | 80.1 | 1.65 | 38.1 |
CTC [12] | 371 | 3.63 | 0.074 | 24.6 |
CNN Spec | # Trainable Params | WER (%) | |||||
---|---|---|---|---|---|---|---|
Kernel | Layers | Channels | Dev-Clean | Dev-Other | Test-Clean | Test-Other | |
6 | 1 | 96 | 115 k | 6.6 | 18.4 | 7.0 | 18.7 |
6 | 1 | 128 | 177 k | 6.4 | 18.3 | 6.8 | 18.4 |
6 | 1 | 192 | 340 k | 6.4 | 18.3 | 6.9 | 18.8 |
6 | 2 | 128 | 276 k | 6.5 | 18.6 | 6.9 | 18.8 |
Model | Relative Decoding Time | Params (M) | Word Error Rate (%) | |
---|---|---|---|---|
Test-Clean | Test-Other | |||
Original RNN-T (LSTM) | 1.65 | 1.0 | 6.6 | 18.3 |
RNN-T (LSTM → CNN) | 1.83 | 0.27 | 6.8 | 18.3 |
This work (CNN) | 1.0 | 0.18 | 6.8 | 18.4 |
Word Error Rate (%) | ||||||
---|---|---|---|---|---|---|
Beam Size = 1 | Beam Size = 10 | |||||
RelativeTime | Test-Clean | Test-Other | RelativeTime | Test-Clean | Test-Other | |
This work | 1.0 | 6.8 | 18.4 | 24.14 | 6.8 | 18.3 |
RNN-T | 1.65 | 6.6 | 18.3 | 30.36 | 6.3 | 17.9 |
Transformer | 30.9 | 7.2 | 18.0 | 347.77 | 6.7 | 17.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Noh, H.-K.; Park, H.-J. A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR. Appl. Sci. 2024, 14, 1300. https://doi.org/10.3390/app14031300
Noh H-K, Park H-J. A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR. Applied Sciences. 2024; 14(3):1300. https://doi.org/10.3390/app14031300
Chicago/Turabian StyleNoh, Hyeon-Kyu, and Hong-June Park. 2024. "A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR" Applied Sciences 14, no. 3: 1300. https://doi.org/10.3390/app14031300
APA StyleNoh, H.-K., & Park, H.-J. (2024). A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR. Applied Sciences, 14(3), 1300. https://doi.org/10.3390/app14031300