Next Article in Journal
Intensity Simulation of a Fourier Transform Infrared Spectrometer
Next Article in Special Issue
Joint Optimization of Deep Neural Network-Based Dereverberation and Beamforming for Sound Event Detection in Multi-Channel Environments
Previous Article in Journal
An Effective Sensor Deployment Scheme that Ensures Multilevel Coverage of Wireless Sensor Networks with Uncertain Properties
Previous Article in Special Issue
Sound Source Distance Estimation Using Deep Learning: An Image Classification Approach
Open AccessArticle

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

1
College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
2
College of Fine Arts and Design, Tianjin Normal University, Tianjin 300387, China
3
School of Mathematical Sciences, Harbin Normal University, Harbin 150080, China
4
School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
*
Author to whom correspondence should be addressed.
Sensors 2020, 20(7), 1809; https://doi.org/10.3390/s20071809 (registering DOI)
Received: 22 February 2020 / Revised: 16 March 2020 / Accepted: 17 March 2020 / Published: 25 March 2020
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task. View Full-Text
Keywords: automatic pronunciation error detection; ASR; CTC; attention-based; seq2seq model; end-to-end; CAPT automatic pronunciation error detection; ASR; CTC; attention-based; seq2seq model; end-to-end; CAPT
Show Figures

Figure 1

MDPI and ACS Style

Zhang, L.; Zhao, Z.; Ma, C.; Shan, L.; Sun, H.; Jiang, L.; Deng, S.; Gao, C. End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture. Sensors 2020, 20, 1809.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop