Next Article in Journal
Top-Down System for Multi-Person 3D Absolute Pose Estimation from Monocular Videos
Previous Article in Journal
A Combined Semi-Supervised Deep Learning Method for Oil Leak Detection in Pipelines Using IIoT at the Edge
Previous Article in Special Issue
End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

FlexLip: A Controllable Text-to-Lip System

Speech and Dialogue Research Lab, University “Politehnica” of Bucharest, 060042 Bucharest, Romania
Faculty of Mathematics and Computer Science, “Babeș-Bolyai” University, 400347 Cluj-Napoca, Romania
Department of Communications, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania
Zevo Technology, 077042 Roșu, Chiajna, Romania
Author to whom correspondence should be addressed.
Academic Editors: Bruce Denby, Tamás Gábor Csapó and Michael Wand
Sensors 2022, 22(11), 4104;
Received: 16 May 2022 / Revised: 26 May 2022 / Accepted: 26 May 2022 / Published: 28 May 2022
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model. View Full-Text
Keywords: text-to-lip; speech synthesis; text-to-speech; speech-to-lip; zero-shot adaptation; generative models; deep learning; artificial intelligence; objective measures text-to-lip; speech synthesis; text-to-speech; speech-to-lip; zero-shot adaptation; generative models; deep learning; artificial intelligence; objective measures
Show Figures

Figure 1

MDPI and ACS Style

Oneață, D.; Lőrincz, B.; Stan, A.; Cucu, H. FlexLip: A Controllable Text-to-Lip System. Sensors 2022, 22, 4104.

AMA Style

Oneață D, Lőrincz B, Stan A, Cucu H. FlexLip: A Controllable Text-to-Lip System. Sensors. 2022; 22(11):4104.

Chicago/Turabian Style

Oneață, Dan, Beáta Lőrincz, Adriana Stan, and Horia Cucu. 2022. "FlexLip: A Controllable Text-to-Lip System" Sensors 22, no. 11: 4104.

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

Back to TopTop