A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection
Abstract
:1. Introduction
- A single-word speech recognition based on an embedded system using CNNs is introduced. The system is not always turned on and it is triggered only when speech activity is detected.
- The “negative branch” method is introduced, using out-of-distribution examples to perform regularization during training. Testing of this training approach demonstrated its capability to improve the anomaly detection performance of artificial neural networks compared to traditional methods. Additionally, further experiments were conducted to explore the factors influencing the performance of models trained using this method.
2. System
2.1. Speech Recognition System
2.2. Endpoint Detection
2.3. Digitial Signal Processing for Feature Extraction
2.4. Convolution Neuron Network
2.5. Hardware Description
3. Method
3.1. Dataset
3.2. Training with Negative Branch
3.3. Evaluation Matrix
4. Implementation Detail
5. Experiment and Results
5.1. Comparison of the Traditional Method and Negative Branch Method
5.2. Number of Feature Maps
5.3. Deployment on a Microcontroller
6. Conclusions and Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Sushan, P.; Anuradha, R. Speech Command Recognition using Artificial Neural Networks. JOIV Int. J. Inform. Vis. 2020, 4, 73–75. [Google Scholar]
- Chen, G.; Parada, C.; Heigold, G. Small-footprint keyword spotting using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4087–4091. [Google Scholar]
- Sainath, T.N.; Parada, C. Convolutional neural networks for small-footprint keyword spotting. Proc. Interspeech 2015, 1478–1482. [Google Scholar] [CrossRef]
- Li, X.; Zhou, Z. Speech Command Recognition with Convolutional Neural Network. CS229 Stanf. Educ. 2017, 31. [Google Scholar]
- Arik, S.O.; Kliegl, M.; Child, R.; Hestness, J.; Gibiansky, A.; Fougner, C.; Prenger, R.; Coates, A. Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting. arXiv 2017, arXiv:1703.05390. [Google Scholar]
- Sun, M.; Raju, A.; Tucker, G.; Panchapagesan, S.; Fu, G.; Mandal, A.; Matsoukas, S.; Strom, N.; Vitaladevuni, S. Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting. In Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, USA, 13–16 December 2016. [Google Scholar] [CrossRef]
- Abdel-Hamid, O.; Mohamed, A.; Jiang, H.; Penn, G. Applying Convolutional Neural Network Concepts toHybrid NN-HMM Model for Speech Recognition. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012. [Google Scholar]
- Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello Edge: Keyword Spotting on Microcontrollers. arXiv 2017, arXiv:1711.07128. [Google Scholar]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv 2016, arXiv:1611.03530. [Google Scholar] [CrossRef]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
- Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; Mané, D. Concrete problems in AI safety. arXiv 2016, arXiv:1606.06565. [Google Scholar]
- Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- DeVries, T.; Taylor, G.W. Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv 2018, arXiv:1802.04865. [Google Scholar]
- Xu, C.; Liu, Z.; Li, P.; Yan, J.; Yao, L. Bifurcation Mechanism for Fractional-Order Three-Triangle Multi-delayed Neural Networks. Neural Process. Lett. 2022, 55, 6125–6151. [Google Scholar] [CrossRef]
- Li, P.; Lu, Y.; Xu, C.; Ren, J. Insight into Hopf Bifurcation and Control Methods in Fractional Order BAM Neural Networks Incorporating Symmetric Structure and Delay. Cogn. Comput. 2023, 15, 1825–1867. [Google Scholar] [CrossRef]
- Huang, C.; Mo, S.; Liu, H.; Cao, J. Bifurcation analysis of a fractional-order Cohen-Grossberg neural network with three delays. Chin. J. Physic. 2023; in press. [Google Scholar] [CrossRef]
- Huang, C.; Wang, H.; Liu, H.; Cao, J. Bifurcations of a delayed fractional-order BAM neural network via new parameter perturbations. Neural Netw. 2023, 168, 123–142. [Google Scholar] [CrossRef] [PubMed]
- Dave, N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 2013, 1, 1–4. [Google Scholar]
- Liang, S.; Li, Y.; Srikant, R. Enhancing the Reliability of Out-of-distribution Image Detection in Neural Networks. arXiv 2017, arXiv:1706.02690. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Warden, P. Speech Commands: A Public Dataset for Single-Word Speech Recognition. 2017. Available online: https://www.tensorflow.org/datasets/catalog/speech_commands (accessed on 14 July 2023).
- Sainath, T.N.; Mohamed, A.; Kingsbury, B.; Ramabhadran, B. Deep Convolutional Neural Networks for LVCSR. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
Layer | N | Kernal Size | Padding | Stride | AF | Params |
---|---|---|---|---|---|---|
Conv2d | 10 | 5,5 | 2,2 | 1,1 | ReLu | 260 |
MaxPool2d | 10 | 2,2 | 0,0 | 1,1 | - | - |
BatchNorm2d | 10 | - | - | - | - | 20 |
Conv2d | 20 | 5,5 | 2,2 | 1,1 | ReLu | 5020 |
MaxPool2d | 20 | 2,2 | 1,1 | 1,1 | - | - |
Conv2d | 40 | 5,5 | 2,2 | 1,1 | ReLu | 20,040 |
MaxPool2d | 40 | 2,2 | 0,0 | 1,1 | - | - |
Linear | 480-80 | - | - | - | ReLu | 38,480 |
BatchNorm1d | 80 | - | - | - | - | 160 |
Linear | 80-8 | - | - | - | Softmax | 648 |
Total | - | - | - | - | - | 64,628 |
Device | STM32H743VIT6 |
---|---|
Processor | 32-bit Arm® Cortex®-M7 |
CPU Frequency | 480 MHz (max) |
Hardware Accelerator | Double-precision FPU, L1 Cache |
DSP Extension | Single cycle 16/32-bit MAC Single cycle dual 16-bit MAC 8/16-bit SIMD arithmetic |
ROM | 2 M Bytes |
RAM | 1 M Bytes |
Id | Layer | Duration (ms) | Proportion |
---|---|---|---|
0 | Conv2dPool | 3.548 | 19.6% |
1 | BN | 0.043 | 0.2% |
2 | Conv2dPool | 6.864 | 38.0% |
3 | Conv2dPool | 7.035 | 39.0% |
4 | Transpose | 0.058 | 0.3% |
5 | Dense | 0.490 | 2.7% |
6 | NL | 0.002 | 0.0% |
7 | BN | 0.003 | 0.0% |
8 | Dense | 0.010 | 0.1% |
9 | Eltwise | 0.004 | 0.0% |
10 | NL | 0.003 | 0.0% |
total | 18.060 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, J.; Teo, T.H.; Kok, C.L.; Koh, Y.Y. A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics 2024, 13, 530. https://doi.org/10.3390/electronics13030530
Chen J, Teo TH, Kok CL, Koh YY. A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics. 2024; 13(3):530. https://doi.org/10.3390/electronics13030530
Chicago/Turabian StyleChen, Jiaqi, Tee Hui Teo, Chiang Liang Kok, and Yit Yan Koh. 2024. "A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection" Electronics 13, no. 3: 530. https://doi.org/10.3390/electronics13030530
APA StyleChen, J., Teo, T. H., Kok, C. L., & Koh, Y. Y. (2024). A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics, 13(3), 530. https://doi.org/10.3390/electronics13030530