Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques
Abstract
1. Introduction
1.1. What Is Talking Head Generation
1.2. Why Is Talking Head Generation Important
1.3. Our Contributions
- We offer a comprehensive literature review spanning the diverse domains which are used to perform talking-head generation.
- We introduce novel taxonomies and a unifying framework aimed at categorizing the methodologies utilized for performing talking-head generation, providing deeper cross-domain integration than prior surveys.
- We compile an overview of the prevalent datasets and evaluation metrics typically employed in the assessment of talking-head generation models, while analyzing their applicability, limitations, and interrelations.
- We explore future research directions, highlighting challenges and potential advancements in talking-head generation, with an emphasis on practical deployment considerations and trade-offs between realism, temporal consistency, and computational cost.
2. Related Work
3. Discussion
3.1. Overview of Domains
3.1.1. GAN-Based Talking Head Generation
3.1.2. NeRF-Based and Neural Rendering Approaches
3.1.3. Diffusion-Based and Transformer-Based Methods
3.2. Input Modalities in THG
3.2.1. Audio-Driven Generation
3.2.2. Multimodal and Cross-Modal Fusion
3.3. Application Contexts: Virtual Reality and Robotics
3.3.1. Virtual Reality Deployment
3.3.2. Robotics and Human–Robot Interaction
3.4. Overview of Application Approaches Used in Talking Head Generation
3.4.1. Conversational Head Generation
3.4.2. Speech-Driven Animation
3.5. Overview of Datasets
3.5.1. Division into Training and Evaluation Sets
3.5.2. Important Datasets and Their Stats
3.5.3. Dataset Thoroughness (Covering Edge Cases)—VoxCeleb2
3.5.4. Segregation into Image/Video/Audio Sets
3.5.5. Segregation by Modality
3.6. Overview of Evaluation Metrics
3.6.1. Image Quality Metrics
3.6.2. Video Quality and Temporal Consistency
3.6.3. Audio Quality and Audio–Visual Alignment
3.6.4. Realism and Identity Preservation
3.6.5. Lip Synchronization and Mouth Shape Accuracy
3.6.6. Motion Transfer and Instruction-Level Evaluation
3.6.7. Qualitative and Human-Centered Evaluation
3.6.8. Metric Selection Rationale
3.7. Overview of Operating Parameters
Parameters Used Altogether
4. Future Directions
4.1. Advancing Multimodal Learning
4.2. Refining Motion Synthesis and Expression Dynamics
4.3. Enhancing Model Robustness and Generalization
4.4. Innovations in Real-Time Processing and Interactive Systems
4.5. Personalization and Identity Preservation
4.6. Cross-Domain Applications and Multilingual Adaptability
4.7. Lack of Standardized Benchmarks and Unified Evaluation Protocols
4.8. Ethical Considerations, Misuse Mitigation, and Regulatory Frameworks
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Pan, Y.; Tan, S.; Cheng, S.; Lin, Q.; Zeng, Z.; Mitchell, K. Expressive Talking Avatars. IEEE Trans. Vis. Comput. Graph. 2024, 30, 2538–2548. [Google Scholar] [CrossRef] [PubMed]
- Song, L.; Wu, W.; Fu, C.; Loy, C.C.; He, R. Audio-driven dubbing for user generated contents via style-aware semi-parametric synthesis. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1247–1261. [Google Scholar] [CrossRef]
- Zhen, R.; Song, W.; He, Q.; Cao, J.; Shi, L.; Luo, J. Human-computer interaction system: A survey of talking-head generation. Electronics 2023, 12, 218. [Google Scholar] [CrossRef]
- Chen, L.; Cui, G.; Kou, Z.; Zheng, H.; Xu, C. What comprises a good talking-head video generation?: A survey and benchmark. arXiv 2020, arXiv:2005.03201. [Google Scholar] [CrossRef]
- Sun, X.; Zhang, L.; Zhu, H.; Zhang, P.; Zhang, B.; Ji, X.; Zhou, K.; Gao, D.; Bo, L.; Cao, X. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. arXiv 2023, arXiv:2312.01841. [Google Scholar]
- Toshpulatov, M.; Lee, W.; Lee, S. Talking human face generation: A survey. Expert Syst. Appl. 2023, 219, 119678. [Google Scholar] [CrossRef]
- Aneja, D.; Li, W. Real-time lip sync for live 2d animation. arXiv 2019, arXiv:1910.08685. [Google Scholar] [CrossRef]
- Gowda, S.N.; Pandey, D.; Gowda, S.N. From pixels to portraits: A comprehensive survey of talking head generation techniques and applications. arXiv 2023, arXiv:2308.16041. [Google Scholar] [CrossRef]
- Wu, H.; Jia, J.; Xing, J.; Xu, H.; Wang, X.; Wang, J. MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation. arXiv 2023, arXiv:2303.09797. [Google Scholar]
- Fan, H.; Ling, H. Mart: Motion-aware recurrent neural network for robust visual tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 566–575. [Google Scholar]
- Bernardo, B.; Costa, P. A Speech-Driven Talking Head based on a Two-Stage Generative Framework. In Proceedings of the 16th International Conference on Computational Processing of Portuguese; Association for Computational Lingustics: Santiago de Compostela, Spain, 2024; pp. 580–586. [Google Scholar]
- Chen, Z. A Survey on Talking Head Generation. J. Comput.-Aided Des. Comput. Graph. 2023, 35, 1457–1468. [Google Scholar]
- Hong, F.T.; Zhang, L.; Shen, L.; Xu, D. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3397–3406. [Google Scholar]
- Tan, S.; Ji, B.; Pan, Y. Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style. arXiv 2024, arXiv:2403.06365. [Google Scholar] [CrossRef]
- Doukas, M.C.; Zafeiriou, S.; Sharmanska, V. Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 14398–14407. [Google Scholar]
- Li, L.; Wang, S.; Zhang, Z.; Ding, Y.; Zheng, Y.; Yu, X.; Fan, C. Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence (AAAI): Palo Alto, CA, USA, 2021; Volume 35, pp. 1911–1920. [Google Scholar]
- Cosatto, E.; Graf, H.P. Photo-realistic talking-heads from image samples. IEEE Trans. Multimed. 2000, 2, 152–163. [Google Scholar] [CrossRef]
- Chen, L.; Cui, G.; Liu, C.; Li, Z.; Kou, Z.; Xu, Y.; Xu, C. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 35–51. [Google Scholar]
- Lahiri, A.; Kwatra, V.; Frueh, C.; Lewis, J.; Bregler, C. Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 2755–2764. [Google Scholar]
- Ma, Y.; Wang, S.; Hu, Z.; Fan, C.; Lv, T.; Ding, Y.; Deng, Z.; Yu, X. Styletalk: One-shot talking head generation with controllable speaking styles. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 1896–1904. [Google Scholar]
- Wang, S.; Ma, Y.; Ding, Y.; Hu, Z.; Fan, C.; Lv, T.; Deng, Z.; Yu, X. StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4331–4347. [Google Scholar] [CrossRef]
- Li, S. Ophavatars: One-shot photo-realistic head avatars. arXiv 2023, arXiv:2307.09153. [Google Scholar]
- Guo, Y.; Chen, K.; Liang, S.; Liu, Y.J.; Bao, H.; Zhang, J. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 5784–5794. [Google Scholar]
- Sun, Y.; He, R.; Tan, W.; Yan, B. Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with Instructions. arXiv 2023, arXiv:2306.10813. [Google Scholar]
- Lin, H.; Wu, Z.; Zhang, Z.; Ma, C.; Yang, X. Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head. In CAAI International Conference on Artificial Intelligence; Springer Nature: Cham, Switzerland, 2022; pp. 532–544. [Google Scholar]
- Wang, T.C.; Mallya, A.; Liu, M.Y. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 10039–10049. [Google Scholar]
- Ren, Y.; Li, G.; Chen, Y.; Li, T.H.; Liu, S. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 13759–13768. [Google Scholar]
- Milis, G.; Filntisis, P.P.; Roussos, A.; Maragos, P. Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism. arXiv 2023, arXiv:2312.06613. [Google Scholar]
- Corona, E.; Zanfir, A.; Bazavan, E.G.; Kolotouros, N.; Alldieck, T.; Sminchisescu, C. VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis. arXiv 2024, arXiv:2403.08764. [Google Scholar] [CrossRef]
- Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Seminal Graphics Papers: Pushing the Boundaries; Springer: Berlin/Heidelberg, Germany, 2023; Volume 2, pp. 157–164. [Google Scholar]
- Sheng, Z.; Nie, L.; Zhang, M.; Chang, X.; Yan, Y. Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2734–2748. [Google Scholar] [CrossRef]
- Chai, Y.; Shao, T.; Weng, Y.; Zhou, K. Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement. IEEE Trans. Vis. Comput. Graph. 2022, 30, 1803–1820. [Google Scholar] [CrossRef] [PubMed]
- Prudhvi, Y.; Adinarayana, T.; Chandu, T.; Musthak, S.; Sireesha, G. Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound. Int. J. Innov. Res. Comput. Sci. Technol. 2023, 11, 13–17. [Google Scholar] [CrossRef]
- Wang, J.; Zhao, K.; Ma, Y.; Zhang, S.; Zhang, Y.; Shen, Y.; Zhao, D.; Zhou, J. Facecomposer: A unified model for versatile facial content creation. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2024; Volume 36. [Google Scholar]
- Pan, Y.; Zhang, R.; Cheng, S.; Tan, S.; Ding, Y.; Mitchell, K.; Yang, X. Emotional voice puppetry. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2527–2535. [Google Scholar] [CrossRef]
- Alexanderson, S.; Nagy, R.; Beskow, J.; Henter, G.E. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Trans. Graph. (TOG) 2023, 42, 44. [Google Scholar] [CrossRef]
- Cheng, K.; Cun, X.; Zhang, Y.; Xia, M.; Yin, F.; Zhu, M.; Wang, X.; Wang, J.; Wang, N. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. In SIGGRAPH Asia 2022 Conference Papers; ACM: New York, NY, USA, 2022; pp. 1–9. [Google Scholar]
- Gong, Y.; Zhang, Y.; Cun, X.; Yin, F.; Fan, Y.; Wang, X.; Wu, B.; Yang, Y. Toontalker: Cross-domain face reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 7690–7700. [Google Scholar]
- Ji, X.; Zhou, H.; Wang, K.; Wu, Q.; Wu, W.; Xu, F.; Cao, X. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings; ACM: New York, NY, USA, 2022; pp. 1–10. [Google Scholar]
- Richard, A.; Lea, C.; Ma, S.; Gall, J.; De la Torre, F.; Sheikh, Y. Audio-and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 41–50. [Google Scholar]
- Li, P.; Zhao, H.; Liu, Q.; Tang, P.; Zhang, L. TellMeTalk: Multimodal-driven talking face video generation. Comput. Electr. Eng. 2024, 114, 109049. [Google Scholar] [CrossRef]
- Liu, Y.; Lin, L.; Yu, F.; Zhou, C.; Li, Y. Moda: Mapping-once audio-driven portrait animation with dual attentions. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 23020–23029. [Google Scholar]
- Wang, Z.; He, W.; Wei, Y.; Luo, Y. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head. Displays 2023, 80, 102552. [Google Scholar] [CrossRef]
- Liu, M.; Li, Y.; Zhai, S.; Guan, W.; Nie, L. Towards Realistic Conversational Head Generation: A Comprehensive Framework for Lifelike Video Synthesis. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 9441–9445. [Google Scholar]
- Li, B.; Li, H.; Liu, H. Driving Animatronic Robot Facial Expression From Speech. arXiv 2024, arXiv:2403.12670. [Google Scholar] [CrossRef]
- Ginosar, S.; Bar, A.; Kohavi, G.; Chan, C.; Owens, A.; Malik, J. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 3497–3506. [Google Scholar]
- Fan, Y.; Lin, Z.; Saito, J.; Wang, W.; Komura, T. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 18770–18780. [Google Scholar]
- Prajwal, K.R.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C.V. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 484–492. [Google Scholar]
- Ji, X.; Zhou, H.; Wang, K.; Wu, W.; Loy, C.C.; Cao, X.; Xu, F. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 14080–14089. [Google Scholar]
- Lu, Y.; Chai, J.; Cao, X. Live speech portraits: Real-time photorealistic talking-head animation. ACM Trans. Graph. (TOG) 2021, 40, 1–17. [Google Scholar] [CrossRef]
- Wen, X.; Wang, M.; Richardt, C.; Chen, Z.Y.; Hu, S.M. Photorealistic audio-driven video portraits. IEEE Trans. Vis. Comput. Graph. 2020, 26, 3457–3466. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Cun, X.; Wang, X.; Zhang, Y.; Shen, X.; Guo, Y.; Shan, Y.; Wang, F. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 8652–8661. [Google Scholar]
- Sun, Z.; Lv, T.; Ye, S.; Lin, M.G.; Sheng, J.; Wen, Y.H.; Yu, M.; Liu, Y.J. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. arXiv 2023, arXiv:2310.00434. [Google Scholar] [CrossRef]
- Zhang, C.; Wang, C.; Zhang, J.; Xu, H.; Song, G.; Xie, Y.; Luo, L.; Tian, Y.; Guo, X.; Feng, J. DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation. arXiv 2023, arXiv:2312.13578. [Google Scholar]
- Hu, H.; Wang, X.; Sun, J.; Fan, Y.; Guo, Y.; Jiang, C. VectorTalker: SVG Talking Face Generation with Progressive Vectorisation. arXiv 2023, arXiv:2312.11568. [Google Scholar] [CrossRef]
- Zhao, W.; Wang, Y.; He, T.; Yin, L.; Lin, J.; Jin, X. Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape. arXiv 2023, arXiv:2310.20240. [Google Scholar] [CrossRef]
- Wu, H.; Zhou, S.; Jia, J.; Xing, J.; Wen, Q.; Wen, X. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 6822–6830. [Google Scholar]
- Wang, S.; Li, L.; Ding, Y.; Fan, C.; Yu, X. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv 2021, arXiv:2107.09293. [Google Scholar]
- Song, L.; Wu, W.; Qian, C.; He, R.; Loy, C.C. Everybody’s talkin’: Let me talk as you want. IEEE Trans. Inf. Forensics Secur. 2022, 17, 585–598. [Google Scholar] [CrossRef]
- Chen, L.; Maddox, R.K.; Duan, Z.; Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 7832–7841. [Google Scholar]
- Filntisis, P.P.; Retsinas, G.; Paraperas-Papantoniou, F.; Katsamanis, A.; Roussos, A.; Maragos, P. Visual speech-aware perceptual 3d facial expression reconstruction from videos. arXiv 2022, arXiv:2207.11094. [Google Scholar] [CrossRef]
- Siarohin, A.; Woodford, O.J.; Ren, J.; Chai, M.; Tulyakov, S. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13653–13662. [Google Scholar]
- Deng, Y.; Yang, J.; Xu, S.; Chen, D.; Jia, Y.; Tong, X. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. First order motion model for image animation. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2019; Volume 32. [Google Scholar]
- Zhang, Z.; Li, L.; Ding, Y.; Fan, C. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 3661–3670. [Google Scholar]
- Rakesh, V.K.; Mazumdar, S.; Maity, R.P.; Pal, S.; Das, A.; Samanta, T. Advancements in talking head generation: A comprehensive review of techniques, metrics, and challenges. Vis. Comput. 2025, 42, 9. [Google Scholar] [CrossRef]
- Hong, F.T.; Xu, Z.; Zhou, Z.; Zhou, J.; Li, X.; Lin, Q.; Lu, Q.; Xu, D. Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 12549–12558. [Google Scholar]
- Song, W.; Liu, Q.; Liu, Y.; Zhang, P.; Cao, J. Multi-level feature dynamic fusion neural radiance fields for audio-driven talking head generation. Appl. Sci. 2025, 15, 479. [Google Scholar] [CrossRef]
- Li, Y.; Shen, X. Audio-driven single image talking face animation with transformers. Sci. Rep. 2026, 16, 3796. [Google Scholar] [CrossRef]









| Method Category | Key Strengths | Limitations | Typical Use Case | Realism | Temporal Consistency | Computational Cost | Data Requirements | Generalization Ability |
|---|---|---|---|---|---|---|---|---|
| Transformer-based reenactment [38] | Cross-domain flexibility, identity preservation | Sensitive to extreme poses | Cartoon-to-real transfer | Moderate | Moderate | Moderate | High (multi-domain data) | High across domains |
| 3D reconstruction-based models [30] | Geometric consistency, robustness | High computational cost | High-fidelity synthesis | High | High | High | High (3D supervision or multi-view) | Moderate |
| One-shot generation [26] | Minimal reference data | Limited pose diversity | Personalized avatars | Moderate–High | Moderate | Moderate | Low | Low–Moderate |
| GAN-based models [11] | High visual realism | Training instability | Photo-realistic avatars | High | Low–Moderate | High | High | Moderate |
| Motion-aware RNNs [39] | Natural head motion | Limited texture detail | Speech-driven animation | Moderate | High | Low–Moderate | Moderate | High |
| Attention-based models [16] | Strong audio–visual synchronization | Data-intensive training | Multimodal synthesis | High | High | High | High | Moderate–High |
| Dataset Name | Type | Data Volume | Subjects/Duration Units | Image Available | Obvious Head Movements | Collection Environment |
|---|---|---|---|---|---|---|
| 100STYLE | Image | 4,000,000 frames | 100 subjects | No | No | Motion capture studio |
| LSP | Image | 2000 images | – | Yes | Yes | Flickr (images) |
| VOCASET | Image | 29 min (60 fps) | 12 subjects | Yes | Yes | Standardized phonetic protocol |
| BIWI | Image | 15,000 images | 20 subjects | Yes | Yes | Automotive setup (Kinect) |
| UvA-NEMO | Video | 1240 smile videos | 400 subjects | Yes | Yes | Controlled lab environment |
| TED-Talks | Video | 3035 videos | – | Yes | Yes | TED stage recordings |
| CelebV-HQ | Video | 35,666 video clips | 15,653 subjects | Yes | Yes | YouTube interviews |
| Tai-Chi-HD | Video | 250 videos | – | Yes | Yes | Controlled environment |
| MEAD | Audio–visual | 40 h | 60 subjects | Yes | Yes | Controlled lab environment |
| GRID | Audio–visual | 27.5 h | 34 subjects | No | No | Controlled lab environment |
| LRW | Audio–visual | 173 h | 1000+ subjects | Yes | Yes | BBC (TV/interviews) |
| TSG Zero-EGGS | Audio–visual | 67 sequences | 1 subject | Yes | Yes | Controlled environment |
| Motorica Dance | Audio–visual | 6.0 h | 8 subjects | Yes | Yes | Motion capture studio |
| 3D-VTFSET | Audio–visual | 20.0 h | 300 subjects | Yes | Yes | YouTube videos |
| TCD-TIMIT | Audio–visual | 6913 sentences | 62 subjects | Yes | Yes | Controlled lab environment |
| CREMA-D | Audio–visual | 11.1 h | 91 subjects | No | No | Controlled lab environment |
| LRS/LRS3 | Audio–visual | 438 h | 5000+ subjects | Yes | Yes | TED/YouTube |
| HDTF | Audio–visual | 15.8 h | 362 subjects | Yes | Yes | High-resolution video |
| VoxCeleb2 | Audio–visual | 2400+ h | 6112 subjects | Yes | Yes | YouTube interviews |
| RAVDESS | Audio–visual | 7356 speeches/songs | 24 subjects | Yes | Yes | Controlled emotional recording |
| Tool/Algorithm | Dataset |
|---|---|
| GANs | VoxCeleb1, CelebV |
| Flow2Flow | CelebV-HQ, VoxCeleb2 |
| Audio2head | VoxCeleb, GRID, LRW |
| SadTalker | VoxCeleb, HDTF |
| VividTalk | HDTF, VoxCeleb |
| CVTHead | VoxCeleb1, VoxCeleb2 |
| HeadGAN | VoxCeleb |
| ToonTalker | VoxCeleb, CelebA-HQ |
| Metric | Evaluates | Strengths | Limitations |
|---|---|---|---|
| FID | Image realism | Captures distribution-level realism | Sensitive to dataset size; ignores temporal consistency |
| PSNR | Pixel similarity | Simple and interpretable | Poor correlation with perceptual quality |
| SSIM | Structural similarity | Captures luminance and structure | Favors smooth outputs |
| LPIPS | Perceptual similarity | Aligns well with human perception | Computationally expensive |
| CPBD | Sharpness | Models human blur perception | Ignores motion coherence |
| LMD | Lip-sync accuracy | Accurate spatial lip alignment | Depends on landmark detection quality |
| LLVE | Temporal lip motion | Captures motion smoothness | Sensitive to frame noise |
| MCD | Audio quality | Measures spectral similarity | Ignores prosody and emotion |
| CSIM | Identity preservation | Maintains identity consistency | Does not ensure expression realism |
| Model | Year | Driving Modality | Generative Mechanism | Key Intermediate Representation | Strengths | Limitations |
|---|---|---|---|---|---|---|
| Speech-Driven 3D Face Animation with Composite and Regional Facial Movements | 2023 | Audio | Parametric + Regression | 3D mesh regions | Fine-grained regional control | Requires high-quality 3D data |
| Everybody’s Talkin: Let Me Talk as You Want | 2022 | Audio | Neural Rendering | Latent motion codes | Flexible speaking style control | Limited explicit geometry modeling |
| A Morphable Model for the Synthesis of 3D Faces | 2023 | Audio | 3D Parametric Model | 3DMM coefficients | Strong geometric consistency | Less expressive fine details |
| FaceComposer | 2024 | Audio/Text | Unified Generative Model | Disentangled latent factors | Versatile multi-task generation | High model complexity |
| LipSync3D | 2021 | Audio | Regression-based | Normalized pose & lighting parameters | Data-efficient personalization | Limited expressiveness |
| ADNeRF | 2021 | Audio | Neural Radiance Fields | NeRF density & color fields | High photorealism | Computationally expensive |
| Audio-Driven 3D Face Animation | 2022 | Audio | Parametric + Neural | 3D facial parameters | Stable lip-sync | Limited stylistic diversity |
| StyleTalk | 2023 | Audio | GAN-based | Style and motion embeddings | One-shot style control | Sensitive to pose variation |
| CVTHead | 2024 | Audio | Transformer-based | Vertex feature embeddings | Precise geometric control | Heavy training requirements |
| MODA | 2023 | Audio | Attention-based | Dual attention maps | Efficient one-shot animation | Moderate visual realism |
| ToonTalker | 2023 | Audio | Transformer-based | Cross-domain latent features | Strong domain transfer | Cartoon-to-real gap sensitivity |
| DiffPoseTalk | 2023 | Audio | Diffusion Models | 3D pose & expression latents | Diverse and natural motion | Higher inference latency |
| VLOGGER | 2024 | Audio/Image | Diffusion Models | Spatiotemporal latent maps | Strong temporal consistency | Not real-time |
| DREAM-Talk | 2023 | Audio | Two-stage Diffusion | Emotion & lip refinement codes | Emotionally expressive output | Computational overhead |
| VividTalk | 2024 | Audio | Two-stage Hybrid | Head pose & mouth disentanglement | Accurate synchronization | Requires strong priors |
| Depth-Aware GAN | 2022 | Audio | GAN-based | 3D-aware depth features | Improved realism | Requires 3D preprocessing |
| Audio2Head | 2021 | Audio | Modular Neural Framework | Head pose & expression vectors | Natural head motion | Complex pipeline |
| InstructNeuralTalker | 2023 | Audio/Text | NeRF + Instruction Learning | Editable radiance fields | Interactive editing | Heavy training cost |
| SadTalker | 2023 | Audio | 3D Motion Learning | 3D motion coefficients | High visual quality | Identity drift over long videos |
| HeadGAN | 2021 | Audio | GAN-based | Latent identity embeddings | One-shot synthesis | Temporal inconsistency under large pose variation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nisar, H.; Masood, S.; Malik, Z.; Abid, A. Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques. J. Imaging 2026, 12, 119. https://doi.org/10.3390/jimaging12030119
Nisar H, Masood S, Malik Z, Abid A. Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques. Journal of Imaging. 2026; 12(3):119. https://doi.org/10.3390/jimaging12030119
Chicago/Turabian StyleNisar, Hira, Salman Masood, Zaki Malik, and Adnan Abid. 2026. "Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques" Journal of Imaging 12, no. 3: 119. https://doi.org/10.3390/jimaging12030119
APA StyleNisar, H., Masood, S., Malik, Z., & Abid, A. (2026). Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques. Journal of Imaging, 12(3), 119. https://doi.org/10.3390/jimaging12030119

