MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction
Abstract
1. Introduction
- We propose MAVAGEN, a novel multimodal avatar generation framework for synthesizing personalized upper-body digital avatars with controllable emotional expression.
- We introduce an attribute-conditioned multimodal generation pipeline that integrates desired gender and age attributes with textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features within a unified diffusion-based architecture.
- A quantitative evaluation shows that MAVAGEN achieves the best overall avatar quality among the evaluated human animation methods.
- We introduce a novel EmoAcc measure that quantifies the agreement between the target emotion specified for the avatar and the emotion expressed by the generated avatar.
2. Related Work
2.1. Emotion-Aware Avatar Generation and Animation
2.2. Multimodal Emotion Modeling and Affective Interaction
2.3. Personalized and Controllable Avatar Synthesis
3. Methods
3.1. LLM-Based Chatbot
3.2. Avatar Image Retrieval
3.3. Multimodal Feature Extraction
3.3.1. Landmark Features
3.3.2. Appearance Features and Depth Features
3.3.3. Emotion Features and Text Features
3.3.4. Acoustic Features
3.4. Multimodal Fusion Model
4. Experiments and Results
4.1. Research Corpus
4.2. Experimental Setup
4.3. Performance Measures
4.4. Loss Function
4.5. Results and Ablation Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| CEPrompt | Cross-modal Emotion-aware Prompting |
| COGMEN | COntextualized Graph Neural Network based Multimodal |
| Emotion recognitioN | |
| CSIM | Cosine Similarity of Motion Features |
| DER-GCN | Dialog and Event Relation-aware Graph Convolutional neural Network |
| E-FID | Expression FID |
| EmoAcc | Emotion-preservation Accuracy |
| FID | Fréchet Inception Distance |
| FVD | Fréchet Video Distance |
| HESP | Human Expression-Sensitive Prompting |
| HKC | Head-Keypoint Consistency |
| HKV | Head-Keypoint Variance |
| LLM | Large Language Model |
| MAVAGEN | Multimodal Avatar Generation |
| MMGCN | Multimodal Fused Graph Convolutional Network |
| MOS | Mean Opinion Score |
| OWSM-CTC | Open Whisper-style Speech Model Connectionist Temporal Classification |
| PSNR | Peak Signal-to-Noise Ratio |
| SSIM | Structural Similarity Index Measure |
| TelME | Teacher-leading Multimodal fusion network for ERC |
| TTS | Text-to-Speech |
| VAE | Variational Autoencoder |
References
- Gabriel, S.; Puri, I.; Xu, X.; Malgaroli, M.; Ghassemi, M. Can AI Relate: Testing Large Language Model Response for Mental Health Support. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 2206–2221. [Google Scholar] [CrossRef]
- Fei, H.; Zhang, H.; Wang, B.; Liao, L.; Liu, Q.; Cambria, E. EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 61–71. [Google Scholar] [CrossRef]
- Zhang, H.; Meng, Z.; Luo, M.; Han, H.; Liao, L.; Cambria, E.; Fei, H. Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark. In Proceedings of the ACM on Web Conference (WWW), Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2872–2881. [Google Scholar] [CrossRef]
- Li, Y.; Kazemeini, A.; Mehta, Y.; Cambria, E. Multitask Learning for Emotion and Personality Traits Detection. Neurocomputing 2022, 493, 340–350. [Google Scholar] [CrossRef]
- Wen, Z.; Cao, J.; Yang, Y.; Yang, R.; Liu, S. Affective-NLI: Towards Accurate and Interpretable Personality Recognition in Conversation. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PerCom), Biarritz, France, 11–15 March 2024; pp. 184–193. [Google Scholar] [CrossRef]
- Ryumina, E.; Markitantov, M.; Ryumin, D.; Karpov, A. OCEAN-AI Framework with EmoFormer Cross-Hemiface Attention Approach for Personality Traits Assessment. Expert Syst. Appl. 2024, 239, 122441. [Google Scholar] [CrossRef]
- Chen, Y.; Xing, X.; Lin, J.; Zheng, H.; Wang, Z.; Liu, Q.; Xu, X. SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 1170–1183. [Google Scholar] [CrossRef]
- Chen, Y.; Yan, S.; Liu, S.; Li, Y.; Xiao, Y. EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 2149–2176. [Google Scholar] [CrossRef]
- Kyung, J.; Heo, S.; Chang, J.H. Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4683–4687. [Google Scholar] [CrossRef]
- Xie, Y.; Sun, C.; Cao, Z.; Liu, B.; Ji, Z.; Liu, Y.; Shan, L. A Dual Contrastive Learning Framework for Enhanced Multimodal Conversational Emotion Recognition. In Proceedings of the International Conference on Computational Linguistics (COLING), Abu Dhabi, United Arab Emirates, 9–14 January 2025; pp. 4055–4065. [Google Scholar]
- Cheng, Z.; Cheng, Z.Q.; He, J.Y.; Wang, K.; Lin, Y.; Lian, Z.; Peng, X.; Hauptmann, A. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 37, 110805–110853. [Google Scholar]
- Xiao, M.; Xie, Q.; Kuang, Z.; Liu, Z.; Yang, K.; Peng, M.; Han, W.; Huang, J. HealMe: Harnessing Cognitive Reframing in Large Language Models for Psychotherapy. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 1707–1725. [Google Scholar] [CrossRef]
- Bhattacharyya, S.; Wang, J.Z. Evaluating Vision-Language Models for Emotion Recognition. In Findings of the Association for Computational Linguistics: NAACL 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 1798–1820. [Google Scholar] [CrossRef]
- Wang, Y.; Guo, J.; Bai, J.; Yu, R.; He, T.; Tan, X.; Sun, X.; Bian, J. Instructavatar: Text-guided emotion and motion control for avatar generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8132–8140. [Google Scholar] [CrossRef]
- Liu, T.; Ma, Z.; Chen, Q.; Chen, F.; Fan, S.; Chen, X.; Yu, K. VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 5586–5594. [Google Scholar] [CrossRef]
- Song, W.; Ding, Y.; Hou, F.; Li, S.; Hao, A.; Hou, X. CtrlAvatar: Controllable Avatars Generation via Disentangled Invertible Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6959–6967. [Google Scholar] [CrossRef]
- Liu, H.; Sun, W.; Di, D.; Sun, S.; Yang, J.; Zou, C.; Bao, H. Moee: Mixture of emotion experts for audio-driven portrait animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26222–26231. [Google Scholar] [CrossRef]
- Wei, X.; Chen, P.; Lu, M.; Chen, H.; Tian, F. Graphavatar: Compact head avatars with gnn-generated 3d gaussians. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8295–8303. [Google Scholar]
- Cha, H.; Lee, I.; Joo, H. Perse: Personalized 3d generative avatars from a single portrait. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15953–15962. [Google Scholar] [CrossRef]
- Zhang, D.; Liu, Y.; Lin, L.; Zhu, Y.; Chen, K.; Qin, M.; Li, Y.; Wang, H. HRAvatar: High-Quality and Relightable Gaussian Head Avatar. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26285–26296. [Google Scholar] [CrossRef]
- Feng, W.Q.; Han, D.; Zhou, Z.K.; Li, S.; Liu, X.; Wan, P.; Zhang, D.; Wang, M. GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 250–259. [Google Scholar] [CrossRef]
- Corona, E.; Zanfir, A.; Bazavan, E.G.; Kolotouros, N.; Alldieck, T.; Sminchisescu, C. Vlogger: Multimodal diffusion for embodied avatar synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15896–15908. [Google Scholar] [CrossRef]
- Qi, X.; Pan, J.; Li, P.; Yuan, R.; Chi, X.; Li, M.; Luo, W.; Xue, W.; Zhang, S.; Liu, Q.; et al. Weakly-supervised emotion transition learning for diverse 3d co-speech gesture generation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10424–10434. [Google Scholar]
- Yariv, G.; Gat, I.; Benaim, S.; Wolf, L.; Schwartz, I.; Adi, Y. Diverse and aligned audio-to-video generation via text-to-video model adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6639–6647. [Google Scholar]
- Qin, M.; Liu, Y.; Xu, Y.; Zhao, X.; Liu, Y.; Wang, H. High-fidelity 3d head avatars reconstruction through spatially-varying expression conditioned neural radiance field. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4569–4577. [Google Scholar]
- Lin, W.; Zheng, C.; Yong, J.H.; Xu, F. Relightable and animatable neural avatars from videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3486–3494. [Google Scholar]
- Hu, J.; Liu, Y.; Zhao, J.; Jin, Q. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2021; pp. 5666–5675. [Google Scholar] [CrossRef]
- Joshi, A.; Bhat, A.; Jain, A.; Singh, A.; Modi, A. COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Seattle, WA, USA, 10–15 July 2022; pp. 4148–4164. [Google Scholar] [CrossRef]
- Li, D.; Wang, Y.; Funakoshi, K.; Okumura, M. Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimoda Emotion Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 16051–16069. [Google Scholar] [CrossRef]
- Yun, T.; Lim, H.; Lee, J.; Song, M. TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Mexico City, Mexico, 16–21 June 2024; pp. 82–95. [Google Scholar] [CrossRef]
- Ai, W.; Shou, Y.; Meng, T.; Li, K. DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4908–4921. [Google Scholar] [CrossRef] [PubMed]
- Zhou, H.; Huang, S.; Zhang, F.; Xu, C. CEPrompt: Cross-Modal Emotion-Aware Prompting for Facial Expression Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11886–11899. [Google Scholar] [CrossRef]
- Liu, Y.; Huang, Y.; Liu, S.; Zhan, Y.; Chen, Z.; Chen, Z. Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting. In Proceedings of the ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5722–5731. [Google Scholar] [CrossRef]
- Murzaku, J.; Rambow, O. OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs. arXiv 2025. [Google Scholar] [CrossRef]
- Lan, X.; Xue, J.; Qi, J.; Jiang, D.; Lu, K.; Chua, T.S. ExpLLM: Towards Chain of Thought for Facial Expression Recognition. IEEE Trans. Multimed. 2025, 27, 3069–3081. [Google Scholar] [CrossRef]
- Wu, Z.; Jiang, L.; Li, X.; Fang, C.; Qin, Y.; Li, G. Hierarchically controlled deformable 3D gaussians for talking head synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8532–8540. [Google Scholar] [CrossRef]
- Xu, Q.; Yuan, S.; Wei, Y.; Wu, J.; Wang, L.; Wu, C. Multiple Feature Refining Network for Visual Emotion Distribution Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8924–8932. [Google Scholar] [CrossRef]
- Xiang, J.; Gao, X.; Guo, Y.; Zhang, J. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 1802–1812. [Google Scholar]
- Cai, H.; Xiao, Y.; Wang, X.; Li, J.; Guo, Y.; Fan, Y.; Gao, S.; Zhang, J. HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 260–270. [Google Scholar]
- Zhan, Y.; Shao, T.; Yang, Y.; Zhou, K. Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26297–26307. [Google Scholar]
- Zhou, Z.; Ma, F.; Fan, H.; Chua, T.S. Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15941–15952. [Google Scholar] [CrossRef]
- Zhuang, J.; Kang, D.; Bao, L.; Lin, L.; Li, G. Dagsm: Disentangled avatar generation with gs-enhanced mesh. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 292–303. [Google Scholar] [CrossRef]
- Hu, L. Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 8153–8163. [Google Scholar] [CrossRef]
- Meng, R.; Zhang, X.; Li, Y.; Ma, C. EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5489–5498. [Google Scholar] [CrossRef]
- Kapitanov, A.; Kvanchiani, K.; Nagaev, A.; Kraynov, R.; Makhliarchuk, A. HaGRID–HAnd Gesture Recognition Image Dataset. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 4–8 January 2024; pp. 4572–4581. [Google Scholar] [CrossRef]
- Bazarevsky, V.; Kartynnik, Y.; Vakunov, A.; Raveendran, K.; Grundmann, M. BlazeFace: Sub-Millisecond Neural Face Detection on Mobile GPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Kartynnik, Y.; Ablavatski, A.; Grishchenko, I.; Grundmann, M. Real-Time Facial Surface Geometry from Monocular Video on Mobile GPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-Device Real-Time Hand Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. Available online: https://proceedings.mlr.press/v235/dao24a.html (accessed on 31 March 2026).
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. (NeurIPS) 2014, 27, 2366–2374. [Google Scholar]
- Ke, B.; Qu, K.; Wang, T.; Metzger, N.; Huang, S.; Li, B.; Obukhov, A.; Schindler, K. Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Boncelet, C. Image noise models. In The Essential Guide to Image Processing; Academic Press: Boston, MA, USA, 2009; pp. 143–167. [Google Scholar] [CrossRef]
- Peng, Y.; Sudo, Y.; Shakeel, M.; Watanabe, S. OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 10192–10209. [Google Scholar] [CrossRef]
- Tan, C.; Gao, Z.; Wu, L.; Xu, Y.; Xia, J.; Li, S.; Li, S.Z. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18770–18782. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Huang, Z.; Tang, F.; Zhang, Y.; Cun, X.; Cao, J.; Li, J.; Lee, T.Y. Make-your-anchor: A diffusion-based 2d avatar generation framework. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 6997–7006. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Prajwal, K.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM international conference on multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar] [CrossRef]
- Markitantov, M.; Ryumina, E.; Dvoynikova, A.; Karpov, A. Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion. Inf. Fusion 2026, 132, 104207. [Google Scholar] [CrossRef]
- Zhang, Y.; Gu, J.; Wang, L.W.; Wang, H.; Cheng, J.; Zhu, Y.; Zou, F. MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance. In Proceedings of the International Conference on Machine Learning (ICML), Vancouver, BC, Canada, 13–19 July 2025; Volume 267, pp. 74896–74910. [Google Scholar]


| Methods | FID ↓ | FVD ↓ | SSIM ↑ | PSNR ↑ | E-FID ↓ | Sync-D ↓ | Sync-C ↑ | HKC ↑ | HKV ↑ | CSIM ↑ | EmoAcc ↑ | MOS ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AnimateAnyone [43] | 60.10 | 1030.12 | 0.726 | 20.40 | 3.900 | 14.10 | 0.950 | 0.805 | 23.70 | 0.380 | – | |
| MimicMotion [65] | 55.20 | 635.40 | 0.705 | 19.10 | 2.700 | 8.10 | 1.450 | 0.902 | 24.70 | 0.520 | – | |
| EchoMimicV2 [44] | 50.10 | 605.30 | 0.736 | 21.90 | 2.240 | 7.10 | 7.150 | 0.921 | 25.20 | 0.555 | – | |
| MAVAGEN (ours) | 48.20 | 592.00 | 0.741 | 21.95 | 2.250 | 6.85 | 7.40 | 0.929 | 25.30 | 0.563 | 0.88 | 6.97 ± 2.35 |
| Methods | FID ↓ | FVD ↓ | SSIM ↑ | PSNR ↑ | E-FID ↓ | Sync-D ↓ | Sync-C ↑ | HKC ↑ | HKV ↑ | CSIM ↑ | EmoAcc ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| w/o text features | 48.90 | 600.00 | 0.738 | 21.80 | 2.300 | 6.98 | 7.25 | 0.925 | 25.05 | 0.559 | 0.84 |
| w/o emotion vector | 49.90 | 612.00 | 0.734 | 21.70 | 2.330 | 7.12 | 7.02 | 0.918 | 24.75 | 0.552 | 0.80 |
| w/o landmark-based pose | 49.10 | 605.00 | 0.737 | 21.75 | 2.310 | 7.00 | 7.15 | 0.922 | 24.90 | 0.556 | 0.83 |
| w/o depth geometry | 48.60 | 597.00 | 0.740 | 21.85 | 2.280 | 6.93 | 7.33 | 0.927 | 25.15 | 0.561 | 0.86 |
| w/o acoustic features | 50.10 | 617.00 | 0.735 | 21.65 | 2.350 | 7.18 | 6.88 | 0.915 | 24.55 | 0.548 | 0.78 |
| w/o text + emotion | 50.50 | 622.00 | 0.731 | 21.55 | 2.380 | 7.24 | 6.78 | 0.910 | 24.35 | 0.545 | 0.75 |
| w/o text + landmarks | 49.90 | 614.00 | 0.734 | 21.65 | 2.340 | 7.14 | 6.93 | 0.914 | 24.50 | 0.547 | 0.79 |
| w/o text + depth | 49.40 | 610.00 | 0.736 | 21.70 | 2.320 | 7.07 | 7.03 | 0.918 | 24.65 | 0.549 | 0.81 |
| w/o text + audio | 50.80 | 627.00 | 0.730 | 21.50 | 2.400 | 7.28 | 6.68 | 0.908 | 24.25 | 0.542 | 0.73 |
| w/o emotion + landmarks | 50.30 | 620.00 | 0.732 | 21.60 | 2.370 | 7.19 | 6.83 | 0.912 | 24.40 | 0.544 | 0.77 |
| w/o emotion + depth | 50.00 | 616.00 | 0.734 | 21.65 | 2.350 | 7.13 | 6.90 | 0.914 | 24.50 | 0.546 | 0.78 |
| w/o emotion + audio | 51.30 | 630.00 | 0.728 | 21.45 | 2.420 | 7.32 | 6.58 | 0.906 | 24.15 | 0.540 | 0.70 |
| w/o landmarks + depth | 49.70 | 612.00 | 0.735 | 21.68 | 2.330 | 7.09 | 6.98 | 0.916 | 24.55 | 0.548 | 0.80 |
| w/o landmarks + audio | 51.00 | 625.00 | 0.729 | 21.50 | 2.390 | 7.26 | 6.73 | 0.909 | 24.30 | 0.543 | 0.72 |
| w/o depth + audio | 50.60 | 623.00 | 0.732 | 21.58 | 2.360 | 7.21 | 6.80 | 0.911 | 24.40 | 0.545 | 0.74 |
| straightforward baseline | 52.80 | 640.00 | 0.668 | 19.40 | 2.800 | 7.30 | 6.70 | 0.828 | 22.70 | 0.510 | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Axyonov, A.; Ryumina, E.; Ryumin, D.; Karpov, A. MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technol. Interact. 2026, 10, 55. https://doi.org/10.3390/mti10050055
Axyonov A, Ryumina E, Ryumin D, Karpov A. MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technologies and Interaction. 2026; 10(5):55. https://doi.org/10.3390/mti10050055
Chicago/Turabian StyleAxyonov, Alexandr, Elena Ryumina, Dmitry Ryumin, and Alexey Karpov. 2026. "MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction" Multimodal Technologies and Interaction 10, no. 5: 55. https://doi.org/10.3390/mti10050055
APA StyleAxyonov, A., Ryumina, E., Ryumin, D., & Karpov, A. (2026). MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technologies and Interaction, 10(5), 55. https://doi.org/10.3390/mti10050055

