Next Article in Journal
Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition
Previous Article in Journal
Haptic and Thermal Rendering of Astronomical Data: A Multimodal Approach to Inclusive Science Communication
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia
*
Author to whom correspondence should be addressed.
Multimodal Technol. Interact. 2026, 10(5), 55; https://doi.org/10.3390/mti10050055
Submission received: 2 April 2026 / Revised: 11 May 2026 / Accepted: 12 May 2026 / Published: 18 May 2026

Abstract

Digital-avatar systems still provide limited control over emotionally expressive behavior in human–computer interaction, especially in Large Language Model (LLM)-based chatbots and virtual assistants with personalized visual embodiments. To address this problem, we propose Multimodal Avatar Generation (MAVAGEN), a multimodal avatar generation framework for synthesizing upper-body digital avatars with personalized appearance and controllable emotional expression. The user specifies the desired gender and age, as well as provides a short text input from which the target emotional state is inferred. MAVAGEN then retrieves an identity image from the HaGRIDv2-1M corpus and generates an avatar clip with synchronized facial expressions, hand gestures, and expressive speech. The framework uses the following six feature streams: textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features. In a quantitative evaluation against recent human animation methods, MAVAGEN achieves the best overall avatar quality, with FID 48.20, FVD 592.00, SSIM 0.741, Sync-C 7.40, HKC 0.929, HKV 25.30, CSIM 0.563, and EmoAcc 0.88. Ablation results show that emotion and acoustic features contribute most to emotional agreement, while landmark-based pose and depth features improve geometric and motion stability. These results support the practical use of MAVAGEN in personalized LLM-based assistants and other emotion-sensitive interactive systems.

1. Introduction

Emotionally expressive digital avatars are becoming increasingly important in human–computer interaction, especially in socially sensitive domains such as mental health support, education, and assistive technologies [1,2,3], where LLM-based chatbots and virtual assistants may particularly benefit from personalized visual embodiments. In such settings, natural interaction depends not only on semantic content but also on the coordinated expression of emotion through facial movements, gestures, and speech. Effective avatar generation therefore requires multimodal modeling of emotional states while accounting for personal factors that shape how emotion is expressed and perceived [4,5,6]. These considerations create a practical need for methods that move beyond static visual identity and support both personalized appearance and controllable emotional expression.
Despite recent progress, existing multimodal affective systems still face important limitations. Many rely primarily on text [7,8] or text paired with audio [9], while underutilizing essential non-verbal cues such as facial expressions, gestures, and paralinguistic speech features. At the same time, multimodal integration efforts [2,10] often depend on large supervised corpora, which are expensive to annotate and difficult to adapt to diverse emotional [11] and therapeutic [12] settings. From the standpoint of avatar generation, a further challenge is the limited controllability of the generated behavior as follows: many systems can model emotional content, but they still struggle to synthesize avatars that express a desired emotion in a clear and coordinated manner. Users interacting with digital characters may expect emotional nuance and responsiveness that current systems still cannot provide reliably [13].
Recent methods for avatar generation and animation offer promising building blocks, but none fully addresses this problem. InstructAvatar [14] enables text-guided control over emotional states and physical movements, while VQTalker [15], CtrlAvatar [16], and MoEE [17] improve facial animation and audio-driven expressiveness. Other methods, including GraphAvatar [18], PERSE [19], HRAvatar [20], GPAvatar [21], and VLOGGER [22], focus on high-quality avatar representation and synthesis. Nevertheless, existing approaches often prioritize only part of the overall generation problem as follows: some emphasize facial motion, others audio-driven animation, and still others focus on geometry or rendering quality. As a result, avatar synthesis that coordinates facial expressions, gestures, and speech remains insufficiently explored, especially when the avatar must be conditioned on both the desired demographic attributes and a target emotion.
To address these challenges, we propose the MAVAGEN framework for personalized upper-body avatar synthesis with controllable emotional expression in human–computer interaction scenarios involving LLM-based chatbots and virtual assistants. The user specifies the desired avatar attributes, specifically gender and age, as well as provides a short text input from which the target emotional state is inferred. Based on these inputs, MAVAGEN retrieves a visually appropriate identity image from the HaGRIDv2-1M corpus and generates an avatar clip with synchronized facial expressions, hand gestures, and expressive speech. The framework uses the following six feature streams: textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features. In this way, MAVAGEN combines personalized appearance with controllable emotional expression within a unified diffusion-based pipeline. Our main contributions are as follows:
  • We propose MAVAGEN, a novel multimodal avatar generation framework for synthesizing personalized upper-body digital avatars with controllable emotional expression.
  • We introduce an attribute-conditioned multimodal generation pipeline that integrates desired gender and age attributes with textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features within a unified diffusion-based architecture.
  • A quantitative evaluation shows that MAVAGEN achieves the best overall avatar quality among the evaluated human animation methods.
  • We introduce a novel EmoAcc measure that quantifies the agreement between the target emotion specified for the avatar and the emotion expressed by the generated avatar.

2. Related Work

Recent work on avatar generation relevant to our setting spans three closely connected directions as follows: multimodal emotion modeling, emotion-aware avatar animation, and personalized, controllable avatar synthesis. Collectively, these lines of research address how emotion can be inferred, represented, and rendered through coordinated visual and acoustic behavior. In this section, we review representative advances in each direction and highlight the remaining gap between emotion understanding and the generation of personalized, expressive avatars.

2.1. Emotion-Aware Avatar Generation and Animation

A growing body of work addresses the generation of expressive avatars directly. InstructAvatar [14] enables text-guided control over both emotional states and body motion, while VQTalker [15] introduces facial motion tokenization for multilingual talking avatars. MoEE [17] uses a mixture of emotion experts for audio-driven portrait animation, VLOGGER [22] extends multimodal diffusion to full-body avatar synthesis, and weakly supervised emotion-transition learning has also been explored for 3D co-speech gesture generation [23]. Related audio-to-video work further highlights the importance of aligning motion and speech during generation [24]. Other methods emphasize avatar representation quality and rendering fidelity as follows: GraphAvatar [18] uses graph neural networks for lightweight 3D Gaussian avatars, whereas SVE-NeRF [25] and relightable neural avatars [26] focus on high-fidelity, expression-conditioned head reconstruction and rendering. These methods provide important building blocks for expressive avatar synthesis, but they often emphasize facial animation, audio-driven motion, or rendering quality in isolation, rather than jointly modeling coordinated affect across the face, hands, upper body, and speech.

2.2. Multimodal Emotion Modeling and Affective Interaction

Multimodal affect modeling has advanced significantly in recent years, particularly in systems that integrate textual, acoustic, and visual cues to better capture emotional dynamics. Representative examples include Multimodal Fused Graph Convolutional Network (MMGCN) [27], which uses graph convolutional networks to fuse multimodal conversational signals, and COntextualized Graph Neural Network based Multimodal Emotion recognitioN (COGMEN) [28], which models contextual dependencies more explicitly. Joyful [29] combines joint modality fusion with graph contrastive learning, while Teacher-leading Multimodal fusion network for ERC (TelME) [30] transfers affective cues from text to non-verbal modalities through cross-modal knowledge distillation. Dynamic affect modeling has been further improved by Dialog and Event Relation-aware Graph Convolutional neural Network (DER-GCN) [31]. In parallel, more specialized vision-language approaches such as Cross-modal Emotion-aware Prompting (CEPrompt) [32] and Human Expression-Sensitive Prompting (HESP) [33] improve sensitivity to fine-grained visual expression cues, while OmniVox [34] and ExpLLM [35] explore broader multimodal and reasoning-based affect understanding. Despite these advances, most such methods focus on emotion recognition or interpretation rather than synthesizing emotionally expressive avatar behavior, and they rarely provide unified control over facial motion, gestures, and speech.

2.3. Personalized and Controllable Avatar Synthesis

Recent work has also emphasized personalization and explicit controllability. CtrlAvatar [16] uses disentangled invertible networks for controllable emotional animation, while Hierarchically Controlled Deformable Gaussians [36] enable fine-grained expression synthesis in talking heads. The Multiple Feature Refining Network (MFRN) [37] improves emotion distribution learning through iterative feature refinement. PERSE [19] generates personalized 3D avatars from a single image, and HRAvatar [20] alongside GPAvatar [21] advance high-quality Gaussian-based avatar representation. Closely related work on photorealistic and efficient avatar rendering includes FlashAvatar [38], HERA [39], and real-time Gaussian human avatars [40]. At the same time, Zero-1-to-A [41] and DAGSM [42] explore zero-shot animatable avatars and disentangled generative modeling with geometric guidance. Although these methods move toward more personalized and controllable synthesis, they usually address only part of the overall problem as follows: identity personalization, emotional control, geometric fidelity, or motion transfer. A unified framework that combines attribute-conditioned avatar appearance with explicit emotion conditioning and synchronized multimodal behavior remains comparatively underexplored.

3. Methods

The proposed MAVAGEN framework consists of the following two main components: an LLM-based chatbot that interacts with the user and an avatar generation module that synthesizes a personalized digital avatar from the collected inputs. In this section, we describe the assistant interface, avatar image retrieval, multimodal feature extraction, and multimodal fusion model.
The pipeline of MAVAGEN is shown in Figure 1. An assistant interface first requests the desired avatar attributes, specifically gender and age. The user then provides a short free-form message from which a pre-trained emotion recognition model estimates a probability distribution over categorical emotions. The same message is forwarded to the LLM-based chatbot, which generates a response that is subsequently expressed by the avatar. In parallel, a corresponding identity image is retrieved from a large corpus and processed using monocular depth estimation and landmark extraction to extract facial, hand, and upper-body keypoints. The chatbot response is converted into expressive speech using an emotion-conditioned Text-to-Speech (TTS) model driven by the predicted emotion distribution. Representations derived from the chatbot response text, the emotion-distribution vector, image, depth map, landmarks, and generated speech are processed by modality-specific encoders. The image and depth latents are combined with stochastic noise and processed by the ReferenceNet [43] model, after which the aligned multimodal embeddings are passed to the Denoising U-Net [44] model to synthesize an avatar clip with synchronized facial expressions, hand gestures, and expressive speech. The resulting clip is returned to the user as the final output. Overall, the pipeline provides a controllable way to generate avatar clips from multimodal inputs.

3.1. LLM-Based Chatbot

The interaction begins with an assistant interface, which in our implementation is realized as an LLM-based assistant (Qwen2.5-7B-Instruct (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct, accessed on 31 March 2026)). The assistant requests the avatar attributes used to condition the avatar’s appearance in subsequent stages. Specifically, it first asks the user to specify gender and age. The user then responds in a standardized format, providing the attribute specification a = ( G , Y ) , e.g., “Generate a Y-year-old [G] avatar”, where G denotes the desired gender and Y denotes the desired age in years. After that, the assistant continues the dialogue with a follow-up prompt such as “How can I help you today?”
The user’s free-form message is forwarded to the text emotion recognition block, where it is used to estimate the target emotional state that conditions avatar generation. The same message is also provided to the LLM-based assistant, which generates a textual response for subsequent avatar generation stages. The collected gender and age attributes are forwarded to the avatar image retrieval block, where they guide the retrieval of a visually appropriate identity image from the HaGRIDv2-1M [45] (https://github.com/hukenovs/hagrid/tree/Hagrid_v2-1M, accessed on 31 March 2026) corpus by matching the requested gender and the exact requested age whenever available, or, otherwise, the nearest available age. This setup aligns the synthesized avatar with the requested attributes while enabling emotion-conditioned generation.

3.2. Avatar Image Retrieval

Based on the requested avatar attributes a = ( G , Y ) , the avatar image retrieval block selects a retrieved identity image I I from the HaGRIDv2-1M [45] corpus, where I denotes the full image pool, G denotes the desired gender, and Y denotes the desired age in years. This corpus was selected for this task because it provides a large and diverse pool of identity images together with the metadata required for attribute-conditioned retrieval.
To retrieve a visually appropriate identity image, we first restrict the candidate pool to images with the requested gender as follows:
I G = { I I g ( I ) = G } ,
where g ( I ) denotes the gender metadata associated with image I.
Within this gender-matched subset, we search for the closest available age to the requested age Y. Let y ( I ) denote the age metadata associated with image I. We define the minimum age difference as
δ = min I I G | y ( I ) Y | .
If an image with the exact requested age is available, then δ = 0 ; otherwise, the retrieval procedure falls back to the closest available age within the same gender subset.
We then define the set of best-matching identity-image candidates associated with the requested attributes a as
I ( a ) = { I I G | y ( I ) Y | = δ } .
Finally, the retrieved identity image is sampled uniformly from this set as follows:
I Unif I ( a ) .
This retrieval procedure ensures that the selected identity image matches the requested gender and corresponds to either the exact requested age or, when an exact match is unavailable, the nearest available age.

3.3. Multimodal Feature Extraction

3.3.1. Landmark Features

The goal of this stage is to obtain a compact geometric description of the avatar’s face, hands, and upper body that can be used by the trainable Landmark Encoder and, later, by the diffusion model. As illustrated by the “Finding landmarks” block in Figure 1, we apply a frozen holistic landmark detector directly to the retrieved identity image I .
Holistic landmark detection. For landmark extraction, we use the MediaPipe Holistic Landmarker (https://ai.google.dev/edge/mediapipe/solutions/vision/holistic_landmarker, accessed on 31 March 2026). Given the preprocessed RGB image I , the detector predicts 2D keypoints for the face [46,47], both hands [48], and the upper body [49]. We denote the resulting landmark configuration by
L R N L × 2 ,
where N L is the total number of detected keypoints and each row of L contains the ( x , y ) coordinates of one keypoint. All coordinates are returned in image pixels and are converted to the [ 0 ,   1 ] range by dividing them by the image width and height. The detector also provides a visibility flag for each keypoint, which is stored as a binary mask
m L { 0 , 1 } N L .
Normalization and vectorization. To reduce sensitivity to global translation and scale, the landmark configuration is re-centered at a torso reference joint and normalized by the distance between the left and right shoulders. Let L ˜ R N L × 2 denote the resulting normalized landmark coordinates. The normalized coordinates and the visibility mask are then concatenated and flattened into a fixed-length vector as follows:
l ˜ = vec L ˜ , m L R d L in ,
where vec ( · ) denotes concatenation followed by vectorization, and d L in is the dimensionality of the resulting landmark input vector.
This vector serves as the input to the trainable Landmark Encoder E L . In our implementation, E L is realized as a Mamba2-based model [50], which produces the landmark-based pose feature representation
z L = E L ( l ˜ ) R d L ,
where d L is the dimensionality of the landmark-based pose representation.

3.3.2. Appearance Features and Depth Features

In addition to 2D landmarks, the avatar generator relies on the following two complementary visual signals extracted from the retrieved identity image I : an RGB image and a depth map [51]. Both signals are processed by frozen modules and correspond to the “Image encoder”, “Image-to-depth”, and “Depth encoder” blocks in Figure 1.
To obtain a depth map from a single RGB image, we use the pre-trained Marigold image-to-depth [52] (https://github.com/prs-eth/marigold, accessed on 31 March 2026) model. Given the retrieved identity image I , the model predicts the corresponding depth map
D = M depth ( I ) ,
where M depth denotes the frozen image-to-depth model.
The RGB image I and the corresponding depth map D are then encoded using a pre-trained Variational Autoencoder (VAE) [53] to obtain compact latent visual representations. Both inputs are standardized using the official VAE preprocessing pipeline, which includes resizing, cropping, and normalization.
During training, Gaussian noise [54] is added to the latent visual representations to regularize the conditioning process. The resulting RGB and depth latents are then passed to the trainable ReferenceNet [43] backbone. ReferenceNet acts as an intermediate visual-conditioning module between the static avatar reference and the generative diffusion backbone: it transforms the latent appearance and geometry cues extracted from the selected image into spatial feature maps, denoted by z I and z D , that preserve identity-specific texture, coarse structural layout, and view-consistent geometry. These feature maps are subsequently injected into the Denoising U-Net as visual conditioning signals, allowing the diffusion model to preserve the reference appearance throughout generation rather than relying solely on compact global embeddings.
The appearance feature maps z I and depth feature maps z D provide the avatar generator with complementary stylistic and geometric information as follows: z D stabilizes head pose and camera geometry, while z I maintains visual identity consistency across the generated video.

3.3.3. Emotion Features and Text Features

The user input (see Figure 1) is used to estimate the target emotional state that conditions avatar generation, while the response generated by the LLM-based chatbot provides the semantic content to be expressed by the avatar. In this subsection, we describe how these two texts are processed to obtain an emotion probability vector and a contextual text embedding.
Emotion-aware text encoder. Let U denote the user input, and let R denote the textual response generated by the LLM-based chatbot. The text-processing stage uses a pre-trained RoBERTa-based model (https://huggingface.co/michellejieli/emotion_text_classifier, accessed on 31 March 2026) to process these two texts and produce two outputs as follows:
z E R K , k = 1 K z E , k = 1 ,
where z E is the emotion probability vector over K basic emotions obtained from U via the classifier head and a softmax layer, and
z T R d T ,
where z T is a contextual text embedding computed from R as the pooled representation of the final Transformer layer, and d T is the dimensionality of the text embedding. These outputs correspond to the “Text emotion recognition” and “Text encoder” blocks in Figure 1. The model parameters are kept frozen and are not updated during training or evaluation.
Feature usage. The emotion vector z E derived from the user input U is forwarded to two components of the pipeline. First, it conditions the emotion-conditioned TTS block, shaping the prosody of the waveform generated from the chatbot response (Section 3.3.4). Second, it is passed directly to the Denoising U-Net [44] as part of the overall conditioning signal (Section 3.4). The embedding z T derived from the chatbot response R forms the textual feature stream in the multimodal representation and encodes the lexical content and discourse-level semantics of the text to be spoken by the avatar, influencing the communicative intent of the generated avatar behavior.

3.3.4. Acoustic Features

The acoustic module of the pipeline converts the chatbot response and the emotion probability vector inferred from the user input into expressive speech and then extracts a compact acoustic representation. This module corresponds to the “Text-to-speech with emotion” and “Acoustic encoder” blocks in Figure 1 and consists entirely of frozen components.
Emotion-conditioned text-to-speech. As described in Section 3.3.3, emotion recognition applied to the user input U produces an emotion probability vector z E R K over K basic emotions. This vector is used to control an emotion-conditioned DIA-TTS model (https://huggingface.co/nari-labs/Dia-1.6B, accessed on 31 March 2026). The model receives the triplet ( R , z E , a ) , where a is the avatar attribute specification introduced above, and it generates a waveform s ( t ) R sampled at a fixed rate of 16 kHz. The emotion probability vector z E modulates prosodic characteristics such as intensity, speaking rate, and pitch range, allowing the synthesized speech to reflect mixtures of basic emotions rather than a single discrete label.
Acoustic feature extraction. To obtain a compact representation of the generated speech, the waveform s ( t ) is processed by a frozen acoustic encoder inspired by Open Whisper-style Speech Model Connectionist Temporal Classification (OWSM-CTC) [55]. First, s ( t ) is converted into a log-mel spectrogram, which is then passed through a stack of convolutional and Transformer layers to produce a sequence of frame-level embeddings. These embeddings are average-pooled over time, yielding the acoustic feature representation
z A R d A ,
where d A is the dimensionality of the acoustic feature representation. The vector z A captures articulatory dynamics, rhythm, and prosodic patterns that are important for lip motion, jaw movements, and subtle head and body synchrony with the audio. It forms the acoustic feature stream in the proposed multimodal fusion model and is fed directly to the Denoising U-Net [44] as part of the conditioning signal.

3.4. Multimodal Fusion Model

The multimodal fusion stage is illustrated in the lower part of Figure 1. At this stage, all feature streams extracted in the previous subsections are used to condition a trainable Denoising U-Net [44] with temporal attention [56], which operates in the latent space of a VAE. During training, this module learns to generate temporally coherent avatar videos from multimodal conditioning signals in a standard diffusion noise-prediction setting; during inference, iterative denoising produces the final avatar clip. The resulting latent video is then decoded to obtain the final “Digital avatar” clip.
The frozen feature extractors described above, together with the trainable modules, jointly form the multimodal fusion model of the avatar generator. In MAVAGEN, the conditioning signal is formed from the following six feature streams extracted from the multimodal inputs introduced above: (i) textual features z T R d T , encoding semantic content; (ii) emotion-distribution features z E R K , represented by an emotion probability vector and encoding the target emotion; (iii) landmark-based pose features z L R d L ; (iv) acoustic features z A R d A , encoding prosodic information; (v) RGB-appearance feature maps z I ; and (vi) depth-geometry feature maps z D . The textual, emotion, landmark-based pose, and acoustic streams are represented as compact vectors, whereas z I and z D are spatial feature maps produced by the visual-conditioning branch.
Denoising U-Net with temporal attention. The fusion backbone, depicted as “Denoising U-Net” in Figure 1, is a U-Net-style [57] network that processes x t as a sequence of latent feature maps along the temporal dimension, where x t denotes the noisy latent representation of the avatar video at diffusion step t. Convolutional blocks operate on each frame spatially, while dedicated “Temporal Attention” blocks (gray modules between the colored bars in the figure) perform self-attention [58] across frames to enforce smooth and coherent motion. A diffusion time embedding is injected into the network so that its behavior adapts to the current diffusion noise level.
Multimodal conditioning of the diffusion model. In the full MAVAGEN configuration, the Denoising U-Net D θ receives the noisy latent video x t , the diffusion step t, the vector-valued conditioning signals z T , z E , z L , and z A , and the spatial visual-conditioning feature maps z I and z D . The network predicts the diffusion noise as
ϵ ^ t = D θ x t , t , z T , z E , z L , z A , z I , z D .
The vector-valued conditioning signals are projected to a common conditioning space and injected through cross-attention and feature-wise modulation, whereas the appearance and depth feature maps are injected into multiple layers of the U-Net as spatial visual conditioning signals. In this way, the latent video is jointly conditioned on textual content, target emotion, landmark-based pose, visual appearance, depth geometry, and prosody, resulting in avatar clips that are semantically appropriate, emotionally aligned, and visually coherent.
Decoding into a digital avatar. After S denoising steps, the final latent sequence x ^ 0 represents a clean, temporally coherent avatar video in the VAE latent space. The frozen VAE decoder [44], depicted as “Variational Autoencoder Decoder” in Figure 1, maps this sequence back to RGB frames, producing the final “Digital avatar” clip. This clip preserves the requested visual appearance, exhibits synchronized facial expressions, hand gestures, and expressive speech, and constitutes the final generated avatar clip returned by the system.

4. Experiments and Results

4.1. Research Corpus

All experiments are conducted on the HaGRIDv2-1M hand gesture corpus [45], a large-scale extension of the HaGRID corpus containing 1,086,158 FullHD RGB images of 65,977 adult subjects performing 33 gesture classes (including many emotional gestures) and a no_gesture class. The data are collected predominantly indoors under diverse backgrounds, lighting conditions, and camera-to-subject distances, which lead to substantial variation in appearance and hand poses. In our study, HaGRIDv2-1M serves solely as the source of avatar images. For each avatar request, an identity image is retrieved from the training split using the attribute-conditioned retrieval procedure described in Section 3.2 and used as a static identity reference. This reference is then processed by the landmark, depth, and image encoders to condition the generation of the digital avatar clip. These retrieved identity images are further used to construct a synthetic set of avatar clips for training and quantitative evaluation of the avatar generator.

4.2. Experimental Setup

All avatar-generator models are trained on a synthetic set of avatar clips built from static identity images retrieved from the HaGRIDv2-1M training split at a spatial resolution of 512 × 512 pixels and a frame rate of 25 fps. We use the AdamW optimizer [59] with weight decay 0.01 , an initial learning rate of 1 × 10 4 , a cosine annealing schedule [60], and a linear warm-up during the first 5% of training steps. The Denoising U-Net is trained with a standard noise-prediction loss in the VAE latent space, and stochastic weight averaging is applied at regular intervals. At inference time, all models use the same deterministic sampler with S = 30 diffusion steps. Inference was performed on a workstation equipped with an NVIDIA RTX 4090 GPU and an AMD Ryzen 9 5950X CPU. Generating one 5 s avatar clip at 25 fps, corresponding to 125 output frames, takes approximately 500 s. This corresponds to an effective inference speed of about 0.25 frames per second, while 25 fps denotes the playback frame rate of the generated video rather than real-time computational throughput. The resulting avatar clips are cached and used for all quantitative evaluations of the avatar generator.

4.3. Performance Measures

To evaluate the quality of the generated avatars, we compute the standard image- and video-generation metrics. Overall frame realism is measured with Fréchet Inception Distance (FID) [61] between real and generated frames, while temporal coherence is assessed with Fréchet Video Distance (FVD) [61], computed in the feature space of a pre-trained action-recognition network. Frame-wise structural fidelity between real and generated video frames is quantified with Structural Similarity Index Measure (SSIM) [62] and Peak Signal-to-Noise Ratio (PSNR). Following EchoMimicV2 [44], we additionally report Expression FID (E-FID), which measures the distance between real and generated videos in a facial-expression embedding space, an audio-visual synchrony distance (Sync-D), and the following three motion-consistency metrics: Head-Keypoint Consistency (HKC), Head-Keypoint Variance (HKV), and Cosine Similarity of Motion Features (CSIM). Since the primary role of the avatar is to convey emotion and synchronize with speech, we also report an audio-visual synchrony score (Sync-C) [63], obtained from a SyncNet-style model, as well as EmoAcc as an auxiliary task-specific measure of emotional agreement. To complement the automatic metrics with human perception, we additionally conducted a subjective Mean Opinion Score (MOS) evaluation. Generated videos from the compared methods were presented to 21 independent participants without method labels, and the order of videos was randomized separately for each participant. Each video was rated on a 10-point scale according to overall generation quality, including speech synchronization with facial and body movements, consistency with the reference image, naturalness of motion, and visible visual artifacts. The final MOS was reported as mean ± standard deviation across participants and evaluated clips for each method.
For EmoAcc, the target emotion is defined as the dominant class obtained by applying arg max to the emotion probability vector z E . A multimodal emotion classifier (MASAI [64]) is then applied to each generated clip together with its corresponding textual response. For clip i, this classifier predicts an emotion label y ^ i by jointly modeling facial, acoustic, and linguistic features. Formally, the target label is defined as
y i = arg max 1 k K z E , k ( i ) ,
and the binary correctness variable is defined as
c i = 1 , if y ^ i = y i , 0 , otherwise .
The final metric is computed as
EmoAcc = 1 N i = 1 N c i ,
where z E , k ( i ) denotes the k-th component of the emotion vector for clip i, and N is the number of evaluated clips.

4.4. Loss Function

The avatar generator is trained in the VAE latent space using a standard diffusion-style noise-prediction objective. At each diffusion step t, the Denoising U-Net D θ receives the noisy latent video representation together with the multimodal conditioning signals described in Section 3.4 and predicts the injected Gaussian noise ϵ ^ t . The training objective is the mean-squared error between the predicted and true noise as follows:
L diff = E ϵ t ϵ ^ t 2 2 .
Here, ϵ t denotes the Gaussian noise injected at diffusion step t, ϵ ^ t denotes the noise predicted by D θ , and E [ · ] denotes expectation over the training samples, diffusion steps, and noise realizations. All frozen modules remain fixed during training, while the trainable components of the generator are optimized jointly through this diffusion loss.

4.5. Results and Ablation Study

In this subsection, we report the quantitative results of MAVAGEN and analyze the contribution of individual feature streams. We evaluate the quality of the generated avatar videos against existing human animation methods and then study ablations over the five non-appearance feature streams (textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, and acoustic features; the RGB image is always kept as the base modality).
Avatar generation performance measures and ablations. Table 1 summarizes the objective comparison of the avatar generator against existing human animation methods, while Table 2 reports the single-stream and pairwise ablation results together with a straightforward baseline. We follow the evaluation protocol of EchoMimicV2 [44] and compare MAVAGEN with AnimateAnyone [43], MimicMotion [65], and EchoMimicV2, reporting FID, FVD, SSIM, PSNR, E-FID, two audio-visual synchrony metrics (Sync-D, Sync-C), head-keypoint metrics (HKC, HKV), CSIM, the auxiliary, task-specific metric Emotion-preservation Accuracy (EmoAcc), and the subjective MOS. Lower FID/FVD/E-FID/Sync-D and higher SSIM/PSNR/Sync-C/HKC/HKV/CSIM/EmoAcc/MOS indicate better performance.
Compared to AnimateAnyone and MimicMotion, MAVAGEN substantially improves frame- and sequence-level visual quality (as indicated by the lower FID/FVD and higher SSIM), while remaining competitive with EchoMimicV2. The full MAVAGEN model achieves better FID and FVD than EchoMimicV2 as well as slightly higher SSIM and synchrony scores (Sync-C), at the cost of a marginally higher E-FID and essentially unchanged PSNR. In addition, MAVAGEN achieves an EmoAcc of 0.88, indicating strong agreement between the target emotion and the emotion expressed by the generated avatar. The MOS results in Figure 2 further indicate that MAVAGEN achieves the highest subjective perceived quality among the compared methods. Some representative generated videos and qualitative comparisons are available on the project webpage: https://smil-spcras.github.io/MAVAGEN/, accessed on 31 March 2026.
Rows labeled “w/o…” in Table 2 correspond to ablations of the five non-appearance feature streams. Removing the explicit emotion feature stream or acoustic features leads to the largest drops in Sync-C and EmoAcc, highlighting the central role of prosody and explicit emotion control. Landmark-based pose features and depth-geometry features also contribute noticeably by improving HKC/HKV and stabilizing CSIM, while their removal causes a moderate degradation in visual metrics. Pairwise ablations further confirm the complementary roles of textual features, emotion-distribution features, and acoustic features as follows: jointly dropping emotion-distribution features and acoustic features yields the lowest EmoAcc and the weakest synchrony, whereas combinations that keep at least one of these signals degrade more gracefully. The last row reports the straightforward baseline for reference. In all variants, the RGB-appearance feature stream is retained and serves as the base appearance cue.

5. Discussion

The experiments show that MAVAGEN improves both the visual and emotional quality of the generated avatar clips relative to recent human animation methods, while maintaining temporal coherence and audio-visual synchrony. In particular, MAVAGEN achieves competitive or improved results in terms of FID, FVD, SSIM, PSNR, E-FID, Sync-D, Sync-C, HKC, HKV, and CSIM, while also yielding a high EmoAcc value as an auxiliary measure of emotional agreement. These results indicate that multimodal conditioning is an effective mechanism for generating avatar clips that are visually plausible, temporally stable, and emotionally aligned with the target emotion, which is particularly important for personalized LLM-based chatbots and virtual assistants operating in emotion-sensitive human–computer interaction scenarios.
The comparison in Table 1 shows that MAVAGEN is competitive with state-of-the-art human animation methods such as AnimateAnyone [43], MimicMotion [65], and EchoMimicV2 [44], achieving a favorable trade-off between visual quality, synchrony, and motion consistency, while also yielding strong emotional agreement, as measured by the auxiliary EmoAcc metric. The ablation study in Table 2 indicates that not all feature streams contribute equally to these gains. The emotion probability vector z E and the acoustic feature vector z A are the most influential, as follows: removing either causes the largest drops in Sync-C, motion-consistency metrics, and EmoAcc, highlighting the central roles of prosody and explicit emotion control in avatar generation. Landmark-based pose features and RGB-appearance features mainly improve perceived naturalness and motion stability through richer non-verbal behavior and identity consistency, while depth-geometry features primarily stabilize head pose and geometry with a smaller but still positive effect on the performance metrics. Taken together, the pairwise ablations suggest that emotion-distribution features and acoustic features are the main drivers of emotional agreement and synchrony, while textual features remain complementary and landmark-based pose features and depth-geometry features primarily support motion stability and visual coherence. In this sense, MAVAGEN can be viewed as a modular avatar-generation framework in which each feature stream has a clear functional role, making the framework suitable for practical use in personalized LLM-based chatbots and other emotion-sensitive interactive systems.
The proposed method has several limitations. The avatar pool is restricted to HaGRIDv2-1M, which may introduce demographic and contextual biases; the emotion representation is limited to a small set of basic categories; and the current system conditions the avatar’s appearance only on gender and age attributes. In addition, MAVAGEN focuses on text-conditioned emotional speaking behavior and does not explicitly model natural listening or idle states, such as eye contact, gaze shifts, head nods, or backchannels. Addressing these behaviors would require real-time multimodal analysis of the interlocutor and more advanced dialogue-level behavior control, which is beyond the scope of the present study. Moreover, although the system uses a probability distribution over basic emotions, the current speech generation stage selects the maximum-probability emotion rather than synthesizing blended emotional speech, since the employed text-to-speech model does not explicitly support compound emotional states. Generating an avatar clip for each request remains computationally expensive, and the method alone does not guaranty safe or clinically appropriate behavior in sensitive domains. These constraints motivate future work on richer emotion representations (e.g., continuous valence-arousal spaces and compound emotional states), explicit modeling of listening and idle behaviors, more expressive text-to-speech models, more efficient diffusion backbones, broader evaluations across diverse user populations and application domains, and more detailed analyses of safety and bias in real-world emotion-sensitive applications.

6. Conclusions

In this work, we introduced MAVAGEN, a multimodal avatar generation framework for synthesizing personalized upper-body digital avatars with controllable emotional expression. The framework combines six feature streams, textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features, and it is designed for personalized human–computer interaction scenarios involving LLM-based chatbots and virtual assistants.
The evaluation results show that MAVAGEN achieves the strongest overall performance across most reported metrics. In terms of frame- and sequence-level realism, MAVAGEN attains an FID of 48.20 and an FVD of 592.00, as well as the highest SSIM of 0.741 and a competitive PSNR of 21.95. Expression fidelity remains on par with the strongest baseline (an E-FID of 2.25 vs. 2.24 for EchoMimicV2), while audio-visual synchrony and motion consistency are slightly improved: Sync-D decreases to 6.85, Sync-C increases to 7.40, HKC and HKV reach values of 0.929 and 25.30, and CSIM grows to 0.563. As an auxiliary, task-specific metric, EmoAcc reaches a value of 0.88 for the full model, indicating strong agreement between the target emotion and the emotion expressed by the generated avatar.
The ablation study confirms the contribution of each non-appearance feature stream. Removing only one feature stream leads to moderate degradation, with the strongest effects observed for the emotion-distribution features and acoustic features (e.g., EmoAcc drops from 0.88 to 0.80 and 0.78, and Sync-C drops from 7.40 to 7.02 and 6.88). Depth-geometry features and landmark-based pose features mainly stabilize geometry and motion (HKC/HKV and CSIM decrease when they are removed), while textual features support fine-grained control over expression and synchrony. Pairwise ablations further highlight the complementary roles of these signals: the combination without both emotion-distribution and acoustic feature streams yields the lowest EmoAcc (0.70) and one of the weakest synchrony scores (Sync-C 6.58), while configurations that retain at least one of these feature streams degrade more gently across all metrics. The subjective MOS evaluation also shows that MAVAGEN receives the highest perceived quality score among the compared methods.
In general, the results indicate that MAVAGEN effectively leverages these multimodal signals to improve visual quality, temporal coherence, audio-visual synchrony, and emotion accuracy in avatar generation, and that all five non-appearance feature streams contribute meaningfully to the final performance. These findings support the practical use of MAVAGEN in personalized LLM-based chatbots and other emotion-sensitive interactive systems. Future work will focus on extending MAVAGEN in several directions. First, we plan to move beyond basic discrete emotion categories toward richer affective representations, including continuous valence-arousal spaces and compound emotional states. Second, we plan to incorporate natural listening and idle-state behaviors, such as eye contact, gaze shifts, head nods, and neutral waiting dynamics, which are essential for realistic interactive conversations. Finally, we will improve computational efficiency for real-time interaction and scale evaluations to more diverse user populations, avatar attributes, and interactive settings.

Author Contributions

Conceptualization, A.A. and A.K.; methodology, E.R. and A.A.; software, D.R.; validation, E.R. and A.A.; formal analysis, D.R.; investigation, E.R. and A.A.; resources, D.R.; data curation, A.A.; writing–original draft preparation, A.A. and E.R.; writing–review and editing, A.A. and E.R.; visualization, E.R. and A.A.; supervision, A.K.; project administration, A.A.; funding acquisition, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research is financially supported by the Russian Science Foundation, project No. 24-71-00112 (https://rscf.ru/project/24-71-00112/, accessed on 31 March 2026), with the exception of Section 3.3, which was conducted under Russian state research No. FFZF-2025-0003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The HaGRIDv2-1M dataset used as the source of avatar identity images is publicly available at https://github.com/hukenovs/hagrid/tree/Hagrid_v2-1M, accessed on 31 March 2026. Representative generated avatar videos and qualitative examples are available on the project webpage at https://smil-spcras.github.io/MAVAGEN/, accessed on 31 March 2026. Additional data supporting the reported results are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CEPromptCross-modal Emotion-aware Prompting
COGMENCOntextualized Graph Neural Network based Multimodal
Emotion recognitioN
CSIMCosine Similarity of Motion Features
DER-GCNDialog and Event Relation-aware Graph Convolutional neural Network
E-FIDExpression FID
EmoAccEmotion-preservation Accuracy
FIDFréchet Inception Distance
FVDFréchet Video Distance
HESPHuman Expression-Sensitive Prompting
HKCHead-Keypoint Consistency
HKVHead-Keypoint Variance
LLMLarge Language Model
MAVAGENMultimodal Avatar Generation
MMGCNMultimodal Fused Graph Convolutional Network
MOSMean Opinion Score
OWSM-CTCOpen Whisper-style Speech Model Connectionist Temporal Classification
PSNRPeak Signal-to-Noise Ratio
SSIMStructural Similarity Index Measure
TelMETeacher-leading Multimodal fusion network for ERC
TTSText-to-Speech
VAEVariational Autoencoder

References

  1. Gabriel, S.; Puri, I.; Xu, X.; Malgaroli, M.; Ghassemi, M. Can AI Relate: Testing Large Language Model Response for Mental Health Support. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 2206–2221. [Google Scholar] [CrossRef]
  2. Fei, H.; Zhang, H.; Wang, B.; Liao, L.; Liu, Q.; Cambria, E. EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 61–71. [Google Scholar] [CrossRef]
  3. Zhang, H.; Meng, Z.; Luo, M.; Han, H.; Liao, L.; Cambria, E.; Fei, H. Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark. In Proceedings of the ACM on Web Conference (WWW), Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2872–2881. [Google Scholar] [CrossRef]
  4. Li, Y.; Kazemeini, A.; Mehta, Y.; Cambria, E. Multitask Learning for Emotion and Personality Traits Detection. Neurocomputing 2022, 493, 340–350. [Google Scholar] [CrossRef]
  5. Wen, Z.; Cao, J.; Yang, Y.; Yang, R.; Liu, S. Affective-NLI: Towards Accurate and Interpretable Personality Recognition in Conversation. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PerCom), Biarritz, France, 11–15 March 2024; pp. 184–193. [Google Scholar] [CrossRef]
  6. Ryumina, E.; Markitantov, M.; Ryumin, D.; Karpov, A. OCEAN-AI Framework with EmoFormer Cross-Hemiface Attention Approach for Personality Traits Assessment. Expert Syst. Appl. 2024, 239, 122441. [Google Scholar] [CrossRef]
  7. Chen, Y.; Xing, X.; Lin, J.; Zheng, H.; Wang, Z.; Liu, Q.; Xu, X. SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 1170–1183. [Google Scholar] [CrossRef]
  8. Chen, Y.; Yan, S.; Liu, S.; Li, Y.; Xiao, Y. EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 2149–2176. [Google Scholar] [CrossRef]
  9. Kyung, J.; Heo, S.; Chang, J.H. Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4683–4687. [Google Scholar] [CrossRef]
  10. Xie, Y.; Sun, C.; Cao, Z.; Liu, B.; Ji, Z.; Liu, Y.; Shan, L. A Dual Contrastive Learning Framework for Enhanced Multimodal Conversational Emotion Recognition. In Proceedings of the International Conference on Computational Linguistics (COLING), Abu Dhabi, United Arab Emirates, 9–14 January 2025; pp. 4055–4065. [Google Scholar]
  11. Cheng, Z.; Cheng, Z.Q.; He, J.Y.; Wang, K.; Lin, Y.; Lian, Z.; Peng, X.; Hauptmann, A. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 37, 110805–110853. [Google Scholar]
  12. Xiao, M.; Xie, Q.; Kuang, Z.; Liu, Z.; Yang, K.; Peng, M.; Han, W.; Huang, J. HealMe: Harnessing Cognitive Reframing in Large Language Models for Psychotherapy. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 1707–1725. [Google Scholar] [CrossRef]
  13. Bhattacharyya, S.; Wang, J.Z. Evaluating Vision-Language Models for Emotion Recognition. In Findings of the Association for Computational Linguistics: NAACL 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 1798–1820. [Google Scholar] [CrossRef]
  14. Wang, Y.; Guo, J.; Bai, J.; Yu, R.; He, T.; Tan, X.; Sun, X.; Bian, J. Instructavatar: Text-guided emotion and motion control for avatar generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8132–8140. [Google Scholar] [CrossRef]
  15. Liu, T.; Ma, Z.; Chen, Q.; Chen, F.; Fan, S.; Chen, X.; Yu, K. VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 5586–5594. [Google Scholar] [CrossRef]
  16. Song, W.; Ding, Y.; Hou, F.; Li, S.; Hao, A.; Hou, X. CtrlAvatar: Controllable Avatars Generation via Disentangled Invertible Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6959–6967. [Google Scholar] [CrossRef]
  17. Liu, H.; Sun, W.; Di, D.; Sun, S.; Yang, J.; Zou, C.; Bao, H. Moee: Mixture of emotion experts for audio-driven portrait animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26222–26231. [Google Scholar] [CrossRef]
  18. Wei, X.; Chen, P.; Lu, M.; Chen, H.; Tian, F. Graphavatar: Compact head avatars with gnn-generated 3d gaussians. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8295–8303. [Google Scholar]
  19. Cha, H.; Lee, I.; Joo, H. Perse: Personalized 3d generative avatars from a single portrait. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15953–15962. [Google Scholar] [CrossRef]
  20. Zhang, D.; Liu, Y.; Lin, L.; Zhu, Y.; Chen, K.; Qin, M.; Li, Y.; Wang, H. HRAvatar: High-Quality and Relightable Gaussian Head Avatar. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26285–26296. [Google Scholar] [CrossRef]
  21. Feng, W.Q.; Han, D.; Zhou, Z.K.; Li, S.; Liu, X.; Wan, P.; Zhang, D.; Wang, M. GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 250–259. [Google Scholar] [CrossRef]
  22. Corona, E.; Zanfir, A.; Bazavan, E.G.; Kolotouros, N.; Alldieck, T.; Sminchisescu, C. Vlogger: Multimodal diffusion for embodied avatar synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15896–15908. [Google Scholar] [CrossRef]
  23. Qi, X.; Pan, J.; Li, P.; Yuan, R.; Chi, X.; Li, M.; Luo, W.; Xue, W.; Zhang, S.; Liu, Q.; et al. Weakly-supervised emotion transition learning for diverse 3d co-speech gesture generation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10424–10434. [Google Scholar]
  24. Yariv, G.; Gat, I.; Benaim, S.; Wolf, L.; Schwartz, I.; Adi, Y. Diverse and aligned audio-to-video generation via text-to-video model adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6639–6647. [Google Scholar]
  25. Qin, M.; Liu, Y.; Xu, Y.; Zhao, X.; Liu, Y.; Wang, H. High-fidelity 3d head avatars reconstruction through spatially-varying expression conditioned neural radiance field. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4569–4577. [Google Scholar]
  26. Lin, W.; Zheng, C.; Yong, J.H.; Xu, F. Relightable and animatable neural avatars from videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3486–3494. [Google Scholar]
  27. Hu, J.; Liu, Y.; Zhao, J.; Jin, Q. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2021; pp. 5666–5675. [Google Scholar] [CrossRef]
  28. Joshi, A.; Bhat, A.; Jain, A.; Singh, A.; Modi, A. COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Seattle, WA, USA, 10–15 July 2022; pp. 4148–4164. [Google Scholar] [CrossRef]
  29. Li, D.; Wang, Y.; Funakoshi, K.; Okumura, M. Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimoda Emotion Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 16051–16069. [Google Scholar] [CrossRef]
  30. Yun, T.; Lim, H.; Lee, J.; Song, M. TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Mexico City, Mexico, 16–21 June 2024; pp. 82–95. [Google Scholar] [CrossRef]
  31. Ai, W.; Shou, Y.; Meng, T.; Li, K. DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4908–4921. [Google Scholar] [CrossRef] [PubMed]
  32. Zhou, H.; Huang, S.; Zhang, F.; Xu, C. CEPrompt: Cross-Modal Emotion-Aware Prompting for Facial Expression Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11886–11899. [Google Scholar] [CrossRef]
  33. Liu, Y.; Huang, Y.; Liu, S.; Zhan, Y.; Chen, Z.; Chen, Z. Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting. In Proceedings of the ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5722–5731. [Google Scholar] [CrossRef]
  34. Murzaku, J.; Rambow, O. OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs. arXiv 2025. [Google Scholar] [CrossRef]
  35. Lan, X.; Xue, J.; Qi, J.; Jiang, D.; Lu, K.; Chua, T.S. ExpLLM: Towards Chain of Thought for Facial Expression Recognition. IEEE Trans. Multimed. 2025, 27, 3069–3081. [Google Scholar] [CrossRef]
  36. Wu, Z.; Jiang, L.; Li, X.; Fang, C.; Qin, Y.; Li, G. Hierarchically controlled deformable 3D gaussians for talking head synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8532–8540. [Google Scholar] [CrossRef]
  37. Xu, Q.; Yuan, S.; Wei, Y.; Wu, J.; Wang, L.; Wu, C. Multiple Feature Refining Network for Visual Emotion Distribution Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8924–8932. [Google Scholar] [CrossRef]
  38. Xiang, J.; Gao, X.; Guo, Y.; Zhang, J. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 1802–1812. [Google Scholar]
  39. Cai, H.; Xiao, Y.; Wang, X.; Li, J.; Guo, Y.; Fan, Y.; Gao, S.; Zhang, J. HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 260–270. [Google Scholar]
  40. Zhan, Y.; Shao, T.; Yang, Y.; Zhou, K. Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26297–26307. [Google Scholar]
  41. Zhou, Z.; Ma, F.; Fan, H.; Chua, T.S. Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15941–15952. [Google Scholar] [CrossRef]
  42. Zhuang, J.; Kang, D.; Bao, L.; Lin, L.; Li, G. Dagsm: Disentangled avatar generation with gs-enhanced mesh. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 292–303. [Google Scholar] [CrossRef]
  43. Hu, L. Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 8153–8163. [Google Scholar] [CrossRef]
  44. Meng, R.; Zhang, X.; Li, Y.; Ma, C. EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5489–5498. [Google Scholar] [CrossRef]
  45. Kapitanov, A.; Kvanchiani, K.; Nagaev, A.; Kraynov, R.; Makhliarchuk, A. HaGRID–HAnd Gesture Recognition Image Dataset. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 4–8 January 2024; pp. 4572–4581. [Google Scholar] [CrossRef]
  46. Bazarevsky, V.; Kartynnik, Y.; Vakunov, A.; Raveendran, K.; Grundmann, M. BlazeFace: Sub-Millisecond Neural Face Detection on Mobile GPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  47. Kartynnik, Y.; Ablavatski, A.; Grishchenko, I.; Grundmann, M. Real-Time Facial Surface Geometry from Monocular Video on Mobile GPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  48. Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-Device Real-Time Hand Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  49. Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  50. Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. Available online: https://proceedings.mlr.press/v235/dao24a.html (accessed on 31 March 2026).
  51. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. (NeurIPS) 2014, 27, 2366–2374. [Google Scholar]
  52. Ke, B.; Qu, K.; Wang, T.; Metzger, N.; Huang, S.; Li, B.; Obukhov, A.; Schindler, K. Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–18. [Google Scholar] [CrossRef] [PubMed]
  53. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  54. Boncelet, C. Image noise models. In The Essential Guide to Image Processing; Academic Press: Boston, MA, USA, 2009; pp. 143–167. [Google Scholar] [CrossRef]
  55. Peng, Y.; Sudo, Y.; Shakeel, M.; Watanabe, S. OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 10192–10209. [Google Scholar] [CrossRef]
  56. Tan, C.; Gao, Z.; Wu, L.; Xu, Y.; Xia, J.; Li, S.; Li, S.Z. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18770–18782. [Google Scholar] [CrossRef]
  57. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  58. Huang, Z.; Tang, F.; Zhang, Y.; Cun, X.; Cao, J.; Li, J.; Lee, T.Y. Make-your-anchor: A diffusion-based 2d avatar generation framework. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 6997–7006. [Google Scholar] [CrossRef]
  59. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  60. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  61. Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018. [Google Scholar] [CrossRef]
  62. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  63. Prajwal, K.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM international conference on multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar] [CrossRef]
  64. Markitantov, M.; Ryumina, E.; Dvoynikova, A.; Karpov, A. Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion. Inf. Fusion 2026, 132, 104207. [Google Scholar] [CrossRef]
  65. Zhang, Y.; Gu, J.; Wang, L.W.; Wang, H.; Cheng, J.; Zhu, Y.; Zou, F. MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance. In Proceedings of the International Conference on Machine Learning (ICML), Vancouver, BC, Canada, 13–19 July 2025; Volume 267, pp. 74896–74910. [Google Scholar]
Figure 1. Pipeline of the proposed MAVAGEN framework. Colors indicate different processing blocks.
Figure 1. Pipeline of the proposed MAVAGEN framework. Colors indicate different processing blocks.
Mti 10 00055 g001
Figure 2. Distribution of subjective MOS ratings for the compared human animation methods. Boxplots show participant ratings on a 10-point scale; green triangles denote mean values, orange lines denote medians, and the values above the boxes report mean ± standard deviation.
Figure 2. Distribution of subjective MOS ratings for the compared human animation methods. Boxplots show participant ratings on a 10-point scale; green triangles denote mean values, orange lines denote medians, and the values above the boxes report mean ± standard deviation.
Mti 10 00055 g002
Table 1. Quantitative and subjective comparison of MAVAGEN with existing human animation methods. Lower FID/FVD/E-FID/Sync-D and higher SSIM/PSNR/Sync-C/HKC/HKV/CSIM/EmoAcc/MOS indicate better performance. EmoAcc is an auxiliary, task-specific metric and is only defined for MAVAGEN among the compared methods. MOS is reported as mean ± standard deviation on a 10-point scale.
Table 1. Quantitative and subjective comparison of MAVAGEN with existing human animation methods. Lower FID/FVD/E-FID/Sync-D and higher SSIM/PSNR/Sync-C/HKC/HKV/CSIM/EmoAcc/MOS indicate better performance. EmoAcc is an auxiliary, task-specific metric and is only defined for MAVAGEN among the compared methods. MOS is reported as mean ± standard deviation on a 10-point scale.
MethodsFID ↓FVD ↓SSIM ↑PSNR ↑E-FID ↓Sync-D ↓Sync-C ↑HKC ↑HKV ↑CSIM ↑EmoAcc ↑MOS ↑
AnimateAnyone [43]60.101030.120.72620.403.90014.100.9500.80523.700.380 2.64 ± 1.44
MimicMotion [65]55.20635.400.70519.102.7008.101.4500.90224.700.520 3.59 ± 2.20
EchoMimicV2 [44]50.10605.300.73621.902.2407.107.1500.92125.200.555 5.26 ± 2.16
MAVAGEN (ours)48.20592.000.74121.952.2506.857.400.92925.300.5630.886.97 ± 2.35
Table 2. Ablation results for MAVAGEN together with the straightforward baseline. EmoAcc is an auxiliary, task-specific metric.
Table 2. Ablation results for MAVAGEN together with the straightforward baseline. EmoAcc is an auxiliary, task-specific metric.
MethodsFID ↓FVD ↓SSIM ↑PSNR ↑E-FID ↓Sync-D ↓Sync-C ↑HKC ↑HKV ↑CSIM ↑EmoAcc ↑
w/o text features48.90600.000.73821.802.3006.987.250.92525.050.5590.84
w/o emotion vector49.90612.000.73421.702.3307.127.020.91824.750.5520.80
w/o landmark-based pose49.10605.000.73721.752.3107.007.150.92224.900.5560.83
w/o depth geometry48.60597.000.74021.852.2806.937.330.92725.150.5610.86
w/o acoustic features50.10617.000.73521.652.3507.186.880.91524.550.5480.78
w/o text + emotion50.50622.000.73121.552.3807.246.780.91024.350.5450.75
w/o text + landmarks49.90614.000.73421.652.3407.146.930.91424.500.5470.79
w/o text + depth49.40610.000.73621.702.3207.077.030.91824.650.5490.81
w/o text + audio50.80627.000.73021.502.4007.286.680.90824.250.5420.73
w/o emotion + landmarks50.30620.000.73221.602.3707.196.830.91224.400.5440.77
w/o emotion + depth50.00616.000.73421.652.3507.136.900.91424.500.5460.78
w/o emotion + audio51.30630.000.72821.452.4207.326.580.90624.150.5400.70
w/o landmarks + depth49.70612.000.73521.682.3307.096.980.91624.550.5480.80
w/o landmarks + audio51.00625.000.72921.502.3907.266.730.90924.300.5430.72
w/o depth + audio50.60623.000.73221.582.3607.216.800.91124.400.5450.74
straightforward baseline52.80640.000.66819.402.8007.306.700.82822.700.510
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Axyonov, A.; Ryumina, E.; Ryumin, D.; Karpov, A. MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technol. Interact. 2026, 10, 55. https://doi.org/10.3390/mti10050055

AMA Style

Axyonov A, Ryumina E, Ryumin D, Karpov A. MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technologies and Interaction. 2026; 10(5):55. https://doi.org/10.3390/mti10050055

Chicago/Turabian Style

Axyonov, Alexandr, Elena Ryumina, Dmitry Ryumin, and Alexey Karpov. 2026. "MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction" Multimodal Technologies and Interaction 10, no. 5: 55. https://doi.org/10.3390/mti10050055

APA Style

Axyonov, A., Ryumina, E., Ryumin, D., & Karpov, A. (2026). MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction. Multimodal Technologies and Interaction, 10(5), 55. https://doi.org/10.3390/mti10050055

Article Metrics

Back to TopTop