Front-End of Vehicle-Embedded Speech Recognition for Voice-Driven Multi-UAVs Control
Abstract
:Featured Application
Abstract
1. Introduction
2. Speech Recognition Scheme for Voice-Driven Multi-UAV Control
2.1. Conventional Speech Recognition Schemes
2.2. Proposed Speech Recognition Scheme for Voice-Driven Multi-UAV Control
3. Vehicle-Embedded Speech Recognition Front-End for Multi-UAV Control
3.1. Procedures of Vehicle-Embedded Speech Recognition Front-End
3.2. Noise Reduction for UAV Environments
3.2.1. Conventional Noise Reduction Approaches
3.2.2. MMSE Enhancement Based on Spectral Energy Variation for Noise Reduction in UAV Environments
3.3. Voice Trigger for Multiple UAVs Control
3.3.1. Conventional Voice Trigger
3.3.2. Multi-Channel Voice Trigger for Multi-UAV Control
4. Experiments and Discussion
4.1. Validation of Proposed Noise Reduction Approach
4.1.1. Experimental Setup
4.1.2. Experimental Results and Discussion
4.2. Validation of Proposed Voice Trigger Approach
4.2.1. Experimental Setup
4.2.2. Results and Discussion
4.3. Discussion on Evaluation Results
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Appendix A
Appendix B
References
- Oneata, D.; Cucu, H. Kite: Automatic speech recognition for unmanned aerial vehicles. arXiv 2019, arXiv:1907.01195. [Google Scholar]
- Contreras, R.; Ayala, A.; Cruz, F. Unmanned aerial vehicle control through domain-based automatic speech recognition. Computers 2020, 9, 75. [Google Scholar] [CrossRef]
- Anand, S.S.; Mathiyazaghan, R. Design and fabrication of voice controlled unmanned aerial vehicle. IAES Int. J. Robot. Autom. 2016, 5, 205–212. [Google Scholar]
- Zheng, B.; Hu, J.; Zhang, G.; Wu, Y.; Deng, J. Analysis of noise reduction techniques in speech recognition. In Proceedings of the IEEE Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; pp. 928–933. [Google Scholar]
- Ivanov, A.V.; Fazluktinov, P.S.; Kolesnev, V.A. Applying intelligent systems of speech recognition for optimizing the algorithm of noise reduction in audio records. J. Phys. Conf. Ser. 2020, 1441, 1–10. [Google Scholar] [CrossRef] [Green Version]
- Tan, Z.H.; Varga, I. Network, distributed and embedded speech recognition: An overview. In Automatic Speech Recognition on Mobile Devices and over Communication Networks; Springer: London, UK, 2008; pp. 1–23. [Google Scholar]
- Park, J.; Kim, J.; Oh, Y. Feature vector classification based speech emotion recognition for service robots. IEEE Trans. Consum. Electron. 2009, 55, 1590–1596. [Google Scholar] [CrossRef]
- Lee, D.; Lim, M.; Park, H.; Kang, Y.; Park, J.; Jang, G.; Kim, J. Long short-term memory recurrent neural network-based acoustic model using connectionist temporal classification on a large-scale training corpus. China Commun. 2017, 14, 23–31. [Google Scholar] [CrossRef]
- Wu, L.; Li, T.; Wang, L.; Yan, Y. Improving hybrid CTC/Attention architecture with time-restricted self-attention CTC for end-to-end speech recognition. Appl. Sci. 2019, 9, 4639. [Google Scholar] [CrossRef] [Green Version]
- Ali, M.; Hameed, I.A.; Muslim, S.S.; Hassan, K.S.; Zafar, I.; Bin, A.S.; Shuja, J. Regularized urdu speech recognition with semi-supervised deep learning. Appl. Sci. 2019, 9, 1956. [Google Scholar]
- Yang, X.; Tan, B.; Ding, J.; Zhang, J.; Gong, J. Comparative study on voice activity detection algorithm. In Proceedings of the IEEE International Conference on Electrical and Control Engineering, Wuhan, China, 25 June 2010; pp. 599–602. [Google Scholar]
- Sun, Y.; Wang, R. Voice activity detection based on the improved dual-threshold method. In Proceedings of the IEEE International Conference on Intelligent Transportation in Big Data and Smart City (ICITBS), Halong Bay, Vietnam, 19–20 December 2015; pp. 996–999. [Google Scholar]
- Pang, J. Spectrum energy based voice activity detection. In Proceedings of the IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 9–11 January 2017; pp. 1–5. [Google Scholar]
- Dos SP Soares, A.; Parreira, W.D.; Souza, E.G.; de Almeida, S.J.; Diniz, C.M.; Nascimento, C.D.; Stigger, M.F. Energy-based voice activity detection algorithm using Gaussian and cauchy kernels. In Proceedings of the IEEE 9th Latin American Symposium on Circuits & Systems (LASCAS), Puerto Vallarta, Mexico, 25–28 February 2018; pp. 1–4. [Google Scholar]
- Meier, S.; Kellermann, W. Artificial neural network-based feature combination for spatial voice activity detection. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 2987–2991. [Google Scholar]
- Zazo Candil, R.; Sainath, T.N.; Simko, G.; Parada, C. Feature learning with raw-waveform CLDNNs for voice activity detection. In Proceedings of the the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 3668–3672. [Google Scholar]
- Kim, J.; Hahn, M. Voice activity detection using an adaptive context attention model. IEEE Signal Process. Lett. 2018, 25, 1181–1185. [Google Scholar] [CrossRef]
- Wang, Z.; Vincent, E.; Serizel, R.; Yan, Y. Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments. Comput. Speech Lang. 2018, 49, 37–51. [Google Scholar] [CrossRef] [Green Version]
- Heymann, J.; Drude, L.; Haeb-Umbach, R. A generic neural acoustic beamforming architecture for robust multi-channel speech processing. Comput. Speech Lang. 2017, 46, 374–385. [Google Scholar] [CrossRef]
- Wang, Z.Q.; Wang, D. All-neural multi-channel speech enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3234–3238. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE Trans. Audio Speech Lang. Process. 2015, 23, 7–19. [Google Scholar] [CrossRef]
- Donahue, C.; Li, B.; Prabhavalkar, R. Exploring speech enhancement with generative adversarial networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5024–5028. [Google Scholar]
- Bittu, K. Mean-median based noise estimation method using spectral subtraction for speech enhancement technique. Ind. J. Sci. Tech. 2016, 9. [Google Scholar] [CrossRef]
- Martin, R. Spectral subtraction based on minimum statistics. In Proceedings of the IEEE European Signal Processing Conference, Edinburgh, UK, 13–16 September 1994; pp. 1182–1185. [Google Scholar]
- Park, J.; Kim, J. Emotional information processing based on feature vector enhancement and selection for human—Computer interaction via speech. Telecommun. Syst. 2015, 60, 201–213. [Google Scholar] [CrossRef]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef] [Green Version]
- Schwerin, B.; Pailwal, K. Using STFT real and imaginary parts of modulation signals for MMSE-based speech enhancement. Speech Commun. 2014, 58, 49–68. [Google Scholar] [CrossRef]
- Malah, D.; Cox, R.; Accardi, A. Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Phoenix, AZ, USA, 15–19 March 1999; pp. 201–204. [Google Scholar]
- Kim, H.; Rose, R. Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments. IEEE Trans. Speech Audio Process. 2003, 11, 435–446. [Google Scholar]
- Michaely, A.H.; Zhang, X.; Simko, G.; Parada, C.; Aleksic, P. Keyword spotting for Google assistant using contextual speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Okinawa, Japan, 16–20 December 2017; pp. 272–278. [Google Scholar]
- Jeon, W.; Liu, L.; Mason, H. Voice trigger detection from LVCSR hypothesis lattices using bidirectional lattice recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6356–6360. [Google Scholar]
- Keshet, J.; Grangier, D.; Bengio, S. Discriminative keyword spotting. Speech Commun. 2009, 51, 317–329. [Google Scholar] [CrossRef]
- Khalifa, S.; Hassan, M.; Seneviratne, A. Feasibility and accuracy of hotword detection using vibration energy harvester. In Proceedings of the IEEE 17th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), Coimbra, Portugal, 21–24 June 2016; pp. 1–9. [Google Scholar]
- Huang, Y.; Shabestary, T.Z.; Gruenstein, A.; Wan, L. Multi-microphone adaptive noise cancellation for robust hotword detection. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Ge, F.; Yan, Y. Deep neural network based wake-up-word speech recognition with two-stage detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2761–2765. [Google Scholar]
- Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Hirsch, H.G.; Pearce, D. The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions. In Proceedings of the International Conference on Spoken Language Processing, Beijing, China, 16–20 October 2000; pp. 29–32. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Approach | Equal Error Rate |
---|---|
Conventional | 32.1% |
Proposed_DC1 | 26.0% |
Proposed_DC2 | 21.3% |
Approach | Equal Error Rate (%) | |||||
---|---|---|---|---|---|---|
Prticipant1 | Participant2 | Participant3 | Participant4 | Participant5 | Average | |
CNN-based | 4.32 | 3.76 | 4.93 | 7.29 | 4.68 | 4.99 |
Proposed | 5.78 | 5.12 | 4.84 | 8.61 | 5.43 | 5.96 |
Number of Parameters (Training Phase) | Computational Intensity (Decoding Phase) | |
---|---|---|
CNN-based | 199,936 | 513,536 |
Proposed | 120,192 | 9984 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, J.-S.; Na, H.-J. Front-End of Vehicle-Embedded Speech Recognition for Voice-Driven Multi-UAVs Control. Appl. Sci. 2020, 10, 6876. https://doi.org/10.3390/app10196876
Park J-S, Na H-J. Front-End of Vehicle-Embedded Speech Recognition for Voice-Driven Multi-UAVs Control. Applied Sciences. 2020; 10(19):6876. https://doi.org/10.3390/app10196876
Chicago/Turabian StylePark, Jeong-Sik, and Hyeong-Ju Na. 2020. "Front-End of Vehicle-Embedded Speech Recognition for Voice-Driven Multi-UAVs Control" Applied Sciences 10, no. 19: 6876. https://doi.org/10.3390/app10196876
APA StylePark, J.-S., & Na, H.-J. (2020). Front-End of Vehicle-Embedded Speech Recognition for Voice-Driven Multi-UAVs Control. Applied Sciences, 10(19), 6876. https://doi.org/10.3390/app10196876