Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (4)

Search Parameters:
Keywords = audio-driven talking-head generation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 10385 KB  
Article
All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks
by Carina Geldhauser, Johan Liljegren and Pontus Nordqvist
Electronics 2025, 14(17), 3487; https://doi.org/10.3390/electronics14173487 - 31 Aug 2025
Viewed by 882
Abstract
This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale [...] Read more.
This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale “everyday usage” setting. Two models, LipGAN and a custom implementation of a Wasserstein GAN with gradient penalty (L1WGAN-GP), are examined for their visual performance and scoring according to commonly used metrics: Quantitative comparisons using FID, SSIM, and PSNR metrics on the GRIDTest dataset show mixed results, and metrics fail to capture local artifacts crucial for lip synchronization, pointing to limitations in their applicability for video animation tasks. The study points towards the inadequacy of current quantitative measures and emphasizes the continued necessity of human qualitative assessment for evaluating talking-head video quality. Full article
(This article belongs to the Special Issue New Trends in AI-Assisted Computer Vision)
Show Figures

Figure 1

21 pages, 1111 KB  
Article
Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
by Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler and Pål Halvorsen
Big Data Cogn. Comput. 2025, 9(3), 59; https://doi.org/10.3390/bdcc9030059 - 4 Mar 2025
Viewed by 3229
Abstract
This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE [...] Read more.
This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments. Full article
Show Figures

Figure 1

17 pages, 3610 KB  
Article
Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
by Wenchao Song, Qiong Liu, Yanchao Liu, Pengzhou Zhang and Juan Cao
Appl. Sci. 2025, 15(1), 479; https://doi.org/10.3390/app15010479 - 6 Jan 2025
Cited by 1 | Viewed by 2900
Abstract
Audio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific [...] Read more.
Audio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific person with more natural motion compared to the one-shot methods. However, previous approaches failed to distinguish the importance of different regions, resulting in the loss of information-rich region features. To alleviate the problem and improve video quality, we propose MLDF-NeRF, an end-to-end method for talking head generation, which can achieve better vector representation through multi-level feature dynamic fusion. Specifically, we designed two modules in MLDF-NeRF to enhance the cross-modal mapping ability between audio and different facial regions. We initially developed a multi-level tri-plane hash representation that uses three sets of tri-plane hash networks with varying resolutions of limitation to capture the dynamic information of the face more accurately. Then, we introduce the idea of multi-head attention and design an efficient audio-visual fusion module that explicitly fuses audio features with image features from different planes, thereby improving the mapping between audio features and spatial information. Meanwhile, the design helps to minimize interference from facial areas unrelated to audio, thereby improving the overall quality of the representation. The quantitative and qualitative results indicate that our proposed method can effectively generate talk heads with natural actions and realistic details. Compared with previous methods, it performs better in terms of image quality, lip sync, and other aspects. Full article
Show Figures

Figure 1

25 pages, 1159 KB  
Review
Application of a 3D Talking Head as Part of Telecommunication AR, VR, MR System: Systematic Review
by Nicole Christoff, Nikolay N. Neshov, Krasimir Tonchev and Agata Manolova
Electronics 2023, 12(23), 4788; https://doi.org/10.3390/electronics12234788 - 26 Nov 2023
Cited by 12 | Viewed by 4144
Abstract
In today’s digital era, the realms of virtual reality (VR), augmented reality (AR), and mixed reality (MR) collectively referred to as extended reality (XR) are reshaping human–computer interactions. XR technologies are poised to overcome geographical barriers, offering innovative solutions for enhancing emotional and [...] Read more.
In today’s digital era, the realms of virtual reality (VR), augmented reality (AR), and mixed reality (MR) collectively referred to as extended reality (XR) are reshaping human–computer interactions. XR technologies are poised to overcome geographical barriers, offering innovative solutions for enhancing emotional and social engagement in telecommunications and remote collaboration. This paper delves into the integration of (AI)-powered 3D talking heads within XR-based telecommunication systems. These avatars replicate human expressions, gestures, and speech, effectively minimizing physical constraints in remote communication. The contributions of this research encompass an extensive examination of audio-driven 3D head generation methods and the establishment of comprehensive evaluation criteria for 3D talking head algorithms within Shared Virtual Environments (SVEs). As XR technology evolves, AI-driven 3D talking heads promise to revolutionize remote collaboration and communication. Full article
(This article belongs to the Section Electronic Multimedia)
Show Figures

Figure 1

Back to TopTop