Next Article in Journal
Modeling the Performance Indicators of Financial Assets with Neutrosophic Fuzzy Numbers
Previous Article in Journal
Quantum Behavior of a PT -Symmetric Two-Mode System with Cross-Kerr Nonlinearity
Open AccessReview

An Overview of End-to-End Automatic Speech Recognition

1,2,*, 1,2,* and 1,2
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
College of Computer, National University of Defense Technology, Changsha 410073, China
Authors to whom correspondence should be addressed.
Symmetry 2019, 11(8), 1018;
Received: 30 June 2019 / Revised: 21 July 2019 / Accepted: 3 August 2019 / Published: 7 August 2019
PDF [561 KB, uploaded 19 August 2019]


Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques, these two models have comparable performances. However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. Therefore, the end-to-end model is an important research direction of speech recognition. In this paper we review the development of end-to-end model. This paper first introduces the basic ideas, advantages and disadvantages of HMM-based model and end-to-end models, and points out that end-to-end model is the development direction of speech recognition. Then the article focuses on the principles, progress and research hotspots of three different end-to-end models, which are connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based, and makes theoretically and experimentally detailed comparisons. Their respective advantages and disadvantages and the possible future development of the end-to-end model are finally pointed out. Automatic speech recognition is a pattern recognition task in the field of computer science, which is a subject area of Symmetry. View Full-Text
Keywords: automatic speech recognition; end-to-end; deep learning; neural network; CTC; RNN-transducer; attention; HMM automatic speech recognition; end-to-end; deep learning; neural network; CTC; RNN-transducer; attention; HMM

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Wang, D.; Wang, X.; Lv, S. An Overview of End-to-End Automatic Speech Recognition. Symmetry 2019, 11, 1018.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Symmetry EISSN 2073-8994 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top