Next Article in Journal
A Technological System for Post-Earthquake Damage Scenarios Based on the Monitoring by Means of an Urban Seismic Network
Previous Article in Journal
3D Vehicle Trajectory Extraction Using DCNN in an Overlapping Multi-Camera Crossroad Scene
Article

An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading

1
School of Engineering, London South Bank University, London SE1 0AA, UK
2
Xi’an VANXUM Electronics Technology Co., Ltd., Xi’an 710129, China
3
School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China
*
Author to whom correspondence should be addressed.
Academic Editor: Miguel Ángel Conde
Sensors 2021, 21(23), 7890; https://doi.org/10.3390/s21237890
Received: 15 October 2021 / Revised: 18 November 2021 / Accepted: 20 November 2021 / Published: 26 November 2021
(This article belongs to the Section Sensors and Robotics)
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches. View Full-Text
Keywords: deep learning; lip reading; neural networks; speech recognition; robustness; augmentation; visemes; Gated Recurrent Unit; recurrent neural networks deep learning; lip reading; neural networks; speech recognition; robustness; augmentation; visemes; Gated Recurrent Unit; recurrent neural networks
Show Figures

Figure 1

MDPI and ACS Style

Fenghour, S.; Chen, D.; Guo, K.; Li, B.; Xiao, P. An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading. Sensors 2021, 21, 7890. https://doi.org/10.3390/s21237890

AMA Style

Fenghour S, Chen D, Guo K, Li B, Xiao P. An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading. Sensors. 2021; 21(23):7890. https://doi.org/10.3390/s21237890

Chicago/Turabian Style

Fenghour, Souheil, Daqing Chen, Kun Guo, Bo Li, and Perry Xiao. 2021. "An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading" Sensors 21, no. 23: 7890. https://doi.org/10.3390/s21237890

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop