This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models
by
Enrique Fernández-Rodicio
Enrique Fernández-Rodicio 1,*
,
Christian Dondrup
Christian Dondrup 2,
Javier Sevilla-Salcedo
Javier Sevilla-Salcedo 1
,
Álvaro Castro-González
Álvaro Castro-González 1
and
Miguel A. Salichs
Miguel A. Salichs 1
1
Department of Systems Engineering and Automation, University Carlos III of Madrid, Av. de la Universidad, 30, 28911 Leganés, Spain
2
The Interaction Lab, School of Mathematical and Computer Sciences, Campus The Avenue, Heriot-Watt University, Edinburgh EH14 4AS, Scotland, UK
*
Author to whom correspondence should be addressed.
Biomimetics 2025, 10(12), 835; https://doi.org/10.3390/biomimetics10120835 (registering DOI)
Submission received: 12 November 2025
/
Revised: 8 December 2025
/
Accepted: 9 December 2025
/
Published: 13 December 2025
Abstract
In recent years, robots have started to be used in tasks involving human interaction. For this to be possible, humans must perceive robots as suitable interaction partners. This can be achieved by giving the robots an animate appearance. One of the methods that can be utilised to endow a robot with a lively appearance is giving it the ability to perform expressions on its own, that is, combining multimodal actions to convey information. However, this can become a challenge if the robot has to use gestures and speech simultaneously, as the non-verbal actions need to support the message communicated by the verbal component. In this manuscript, we present a system that, based on a robot’s utterances, predicts the corresponding gesture and synchronises it with the speech. A deep learning-based prediction model labels the robot’s speech with the types of expressions that should accompany it. Then, a rule-based synchronisation module connects different gestures to the correct parts of the speech. For this, we have tested two different approaches: (i) using a combination of recurrent neural networks and conditional random fields; and (ii) using transformer models. The results show that the proposed system can properly select co-speech gestures under the time constraints imposed by real-world interactions.
Share and Cite
MDPI and ACS Style
Fernández-Rodicio, E.; Dondrup, C.; Sevilla-Salcedo, J.; Castro-González, Á.; Salichs, M.A.
Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models. Biomimetics 2025, 10, 835.
https://doi.org/10.3390/biomimetics10120835
AMA Style
Fernández-Rodicio E, Dondrup C, Sevilla-Salcedo J, Castro-González Á, Salichs MA.
Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models. Biomimetics. 2025; 10(12):835.
https://doi.org/10.3390/biomimetics10120835
Chicago/Turabian Style
Fernández-Rodicio, Enrique, Christian Dondrup, Javier Sevilla-Salcedo, Álvaro Castro-González, and Miguel A. Salichs.
2025. "Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models" Biomimetics 10, no. 12: 835.
https://doi.org/10.3390/biomimetics10120835
APA Style
Fernández-Rodicio, E., Dondrup, C., Sevilla-Salcedo, J., Castro-González, Á., & Salichs, M. A.
(2025). Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models. Biomimetics, 10(12), 835.
https://doi.org/10.3390/biomimetics10120835
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.