Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Wang, Xueping; Huo, Yuemeng; Liu, Yanan; Guo, Xueni; Yan, Feihu; Zhao, Guangzhe

doi:10.3390/electronics14132684

Open AccessArticle

Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

by

Xueping Wang

^1,2

,

Yuemeng Huo

^1,2,

Yanan Liu

^1,2

,

Xueni Guo

^1,2,

Feihu Yan

^1,2,*

and

Guangzhe Zhao

^1,2

¹

School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

²

Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2684; https://doi.org/10.3390/electronics14132684

Submission received: 26 May 2025 / Revised: 28 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

Download

Browse Figures

Versions Notes

Abstract

Audio-driven emotional talking face generation aims to generate talking face videos with rich facial expressions and temporal coherence. Current diffusion model-based approaches predominantly depend on either single-label emotion annotations or external video references, which often struggle to capture the complex relationships between modalities, resulting in less natural emotional expressions. To address these issues, we propose MF-ETalk, a multimodal feature-guided method for emotional talking face generation. Specifically, we design an emotion-aware multimodal feature disentanglement and fusion framework that leverages Action Units (AUs) to disentangle facial expressions and models the nonlinear relationships among AU features using a residual encoder. Furthermore, we introduce a hierarchical multimodal feature fusion module that enables dynamic interactions among audio, visual cues, AUs, and motion dynamics. This module is optimized through global motion modeling, lip synchronization, and expression subspace learning, enabling full-face dynamic generation. Finally, an emotion-consistency constraint module is employed to refine the generated results and ensure the naturalness of expressions. Extensive experiments on the MEAD and HDTF datasets demonstrate that MF-ETalk outperforms state-of-the-art methods in both expression naturalness and lip-sync accuracy. For example, it achieves an FID of 43.052 and E-FID of 2.403 on MEAD, along with strong synchronization performance (LSE-C of 6.781, LSE-D of 7.962), confirming the effectiveness of our approach in producing realistic and emotionally expressive talking face videos.

Keywords:

multimodal learning; audio-driven talking face generation; emotional expression; Action Units; diffusion models

1. Introduction

Audio-driven emotional talking face generation is a computer vision task that synthesizes realistic, expressive talking face videos from an input audio signal. The goal is to produce lip-synchronized facial animations that not only match the speech content but also convey natural and nuanced emotional expressions. This task injects revolutionary potential into scenarios such as virtual avatars, film and television production, online education, and human–computer interaction [1,2].

The architectural innovations of Generative Adversarial Networks (GANs) and the refined control of Diffusion Models (DMs) have provided crucial support for this technology. Particularly, the introduction of cross-modal attention mechanisms and fine-grained audio encoding has made talking face generation a research hotspot in recent years. Currently, most research about talking face generation mainly focuses on solving lip synchronization and improving video quality. However, there are relatively few studies that tackle the difficulties associated with emotional expression [3,4,5,6,7,8]. Emotion is a core element of human communication, enabling emotionally expressive talking face generation. This technology produces videos with natural facial expressions, significantly enhancing interactive realism. However, avoiding rigid or unnatural facial animations remains a key challenge in the field.

Early work on emotional talking face generation [9,10] relied on single emotional labels as the emotion source. While straightforward, this approach struggles to model complex and dynamic emotional variations. To address this, EVP [11] introduced a disentanglement framework that implicitly controls video emotions by isolating expressive features from audio. However, EVP’s effectiveness depends critically on two factors: (1) the accuracy of speech content representation and (2) the performance of its emotion disentanglement module. This dependency often leads to inconsistent emotional expressions.

Moreover, when using reference videos as the emotion source, practical limitations arise. As noted by [12], selecting suitable videos that exactly match the target emotional style is nontrivial. The process must account for video resolution, occlusion patterns, and audio–visual temporal alignment, significantly increasing implementation complexity.

Recent advances in video diffusion models have significantly enhanced the realism of audio-driven talking face generation. However, many state-of-the-art diffusion-based approaches [13,14,15,16] share a common limitation; they rely on cross-modal attention mechanisms that perform global feature integration, which fails to adequately model the fine-grained and dynamic relationships between audio signals and localized facial regions (e.g., lips, eye movements, and eyebrow gestures). To handle these constraints, we propose a novel multimodal emotion disentanglement and fusion framework. Our approach explicitly captures localized emotional representations through Action Unit (AU) analysis, thereby eliminating reliance on external reference videos or static audio features. This design enables the robust generation of complex emotional expressions while preserving temporal coherence and enhancing realism.

To address the above issues, we propose a Multimodal Feature-guided Emotional talking face generation method (MF-ETalk). Our method first constructs an emotion-aware multimodal feature disentanglement and fusion framework. This framework includes an AU-guided expression disentanglement module, which decomposes complex expressions into independent Action Units (AUs). To capture the non-linear interactions among AUs, a residual encoder is employed. This encoder maps low-dimensional AU representations into high-dimensional continuous embeddings while preserving the granularity of features through residual learning. We further develop a multimodal feature hierarchical fusion module that establishes audio–visual–AU associations through four spatially organized feature subspaces (global motion, lip, expression, and AU) and implements expression–AU interaction via dual-pathway attention fusion [17], enabling local feature enhancement and dynamic weighting for natural lip-synced expression generation with identity preservation. Finally, we introduce an emotion consistency loss function that optimizes generated faces through emotion recognition feedback to ensure expression naturalness and emotional alignment with target emotions. The general workflow and the general effect of our method are shown in Figure 1. In addition, the comprehensive experiments demonstrate the superior performance of our method in both qualitative and quantitative metrics.

In summary, the contributions are summarized as follows:

We propose a Multimodal Feature-Guided Emotional talking face generation method, which effectively solves the problem of emotional expression and provides an effective method for complex emotional video generation.
We design an emotion-aware multimodal feature disentanglement and fusion framework. It achieves multi-level emotion modeling and optimization from granular details to overall dynamics by employing AU-guided local expression disentanglement with residual encoding, alongside multimodal feature hierarchical fusion, significantly enhancing the naturalness of generated facial expressions.
We introduce an emotion consistency loss function to make the expression on the generated face more vivid and expressive. This function utilizes the HSEmotionRecognizer algorithm to recognize emotions in the generated face.

The rest of this paper is organized as follows. Section 2 reviews the related work, Section 3 details our proposed MF-ETalk method, Section 4 presents and analyzes the experimental results, Section 5 concludes the paper, and finally in Section 6, we summarize the overall research and results of the article and discuss future work.

2. Related Work

2.1. Emotional Talking Face Generation

Emotional talking face generation refers to methods that generate facial animations driven by audio input, facial images, and emotional signals, ensuring that the generated face is not only synchronized with the audio, but also capable of expressing emotional content. Compared to lip-sync generation methods, few studies focus on emotional consistency in talking face generation, which is crucial for effective communication. Wang et al. [19] released the MEAD dataset, a large-scale multimodal emotional audiovisual dataset that provides rich data support for multimodal emotion analysis. It can be used to explore emotion-related information in audio and visual modalities, with emotions encoded using one-hot vectors. Ji et al. [11] proposed the EVP model, which decomposes speech into two disentangled spaces: an emotion space independent of duration and a content space dependent on duration. The disentangled features can be used to derive dynamic 2D emotional facial keypoints, but the extracted information mainly represents local emotional displacement and neglects other facial expressions such as those involving the mouth. Liang et al. [20] proposed the GC-AVT model, which finely processes audio and visual information. For audio, it extracts multi-dimensional features including phonemes, prosody, and emotion. For visual information, it models actions and expressions of different facial components (such as eyes, mouth, eyebrows) separately. Zhang et al. [21] extracted facial geometry, expression, and head pose parameters from audio features and 3D face models, then combined them in the feature space to obtain a new parameterized face representation. Zhang et al. [22] proposed the SadTalker model, which generates 3D motion coefficients (head pose and expression) of a 3D deformable model from audio. It explicitly models the relationships between audio and different types of motion coefficients. The generated 3D motion coefficients are then mapped to a proposed unsupervised 3D keypoint space for facial rendering to generate the final video. Wang et al. [23] proposed the EmotiveTalk model, which adopts a dual-branch architecture to separately process emotional features and speech content, and dynamically adjusts the output through a cross-modal alignment module. Zheng et al. [24] introduced the MEMO model, a memory-guided diffusion model that builds a memory bank to store typical expression and lip shape features. During generation, the model dynamically retrieves and fuses relevant memory fragments.

2.2. Diffusion Models

Diffusion models (DMs) have rapidly become the leading paradigm for generative modeling, delivering state-of-the-art performance across diverse generation tasks such as image synthesis [25,26], video generation [27,28], and audio synthesis [29]. Rooted in non-equilibrium thermodynamics principles, DMs operate through a two-phase noise diffusion process. During training, the forward process systematically corrupts input data by incrementally adding Gaussian noise across multiple timesteps until reaching near-isotropic Gaussian noise; conversely, the learned reverse process reconstructs high-quality samples by progressively denoising from random noise through iterative refinement.

The success of DMs can be attributed to their stable training process and their ability to generate diverse and high-fidelity samples. Empirical evidence from [30,31] demonstrates that DM-based frameworks consistently surpass conventional GAN-based methods across key metrics, most notably in producing temporally coherent facial animations with precise lip synchronization and micro-expression preservation. However, existing DM implementations exhibit a critical limitation; they cannot achieve fine-grained control over emotion-specific facial dynamics. Although current approaches successfully integrate global emotional cues, they remain incapable of accurately coordinating localized facial action units (AUs) to faithfully reflect target emotional states. Our work overcomes this fundamental challenge by introducing a novel multimodal feature-guided architecture that enables precise spatiotemporal control of emotional expressions in diffusion-based talking face synthesis.

2.3. Multimodal Fusion Strategies

The effective integration of multimodal information, especially the synchronization between audio and visual cues, plays a pivotal role in synthesizing realistic and expressive talking faces [32]. Researchers have investigated various fusion methodologies in this domain, including but not limited to early fusion, late fusion, and cross-modal attention mechanisms.

Early fusion approaches commonly employ feature concatenation or element-wise addition operations to integrate multimodal inputs either at the network’s input layer or during initial processing stages [33]. While computationally efficient to implement, these techniques often fail to adequately model the sophisticated temporal dynamics and nonlinear interdependencies characteristic of cross-modal interactions.

Late fusion strategies independently process individual modalities before combining their final predictions or high-level feature representations at the output stage [34]. While this architecture preserves modality-specific characteristics through dedicated processing pathways, it potentially overlooks crucial cross-modal correlations that could be leveraged during intermediate network stages.

Cross-modal attention mechanisms have gained significant attention in recent years for their ability to dynamically learn the alignment and interaction between different modalities [35,36]. In the context of talking face generation, cross-modal attention is often used to attend to relevant audio features when generating the corresponding visual frames, particularly to achieve lip synchronization [37,38]. However, as discussed in the introduction, directly applying cross-modal attention to the entire audio and visual features might limit the fine-grained control over local facial details related to emotional expression.

To address these limitations, our proposed method introduces a hierarchical multimodal fusion strategy that considers the distinct characteristics of audio, global visual motion, lip movements, and local facial action units. By disentangling and then selectively fusing these features at different levels, we aim to achieve both accurate lip synchronization and natural emotional expression in the generated talking faces. Furthermore, we introduce a dual-pathway attention mechanism specifically designed to model the intricate interactions between overall facial expressions and fine-grained action units, enabling more nuanced control over the emotional output.

3. Methods

3.1. Overview

Our MF-ETalk framework builds upon an enhanced Stable Diffusion v1.5 (SDv1.5) architecture [39] to develop a novel multimodal feature-guided emotional talking face generation system. Its goal is to generate high-quality talking faces

G_Emo \in R^{H \times W \times 3}

with rich emotional expressions. This model consists of three main parts: generative model network architecture, multimodal feature decoupling and fusion framework based on emotion perception, and emotion consistency loss function. (1) The generative model network architecture takes the noise latent vector and reference image as inputs, and gradually generates the process of speaking face images through multiple rounds of feature extraction and reconstruction with the help of components such as denoising U-Net and reference U-Net (Section 3.2). (2) The multimodal feature decoupling and fusion framework based on emotion perception feeds the driving audio, reference image, and decoupled action units into the audio encoder, face encoder, and AU encoder coding, respectively. A gating mechanism is utilized to guide the hierarchical fusion of multimodal features to achieve multimodal associations (Section 3.3). (3) The emotion consistency loss function is used to measure the difference between the emotion expression of the generated face and the expected emotion. Combined with the emotion scores output from HSEmotionRecognizer, the vividness of the emotion expression of the generated face is further improved (Section 3.4). The overall structure of the model is shown in Figure 2.

3.2. Generative Model Network Architecture

Denoising U-Net Network. The core objective of the Denoising U-Net is to enhance multi-frame latent representations that have been corrupted by noise under varying conditions (as illustrated in Figure 2). This network is building upon the architecture of Stable Diffusion v1.5 [39], and integrates three distinct attention mechanisms into each transformer block: reference attention, audio attention, and temporal attention, which enables deep fusion of multimodal features across spatial and temporal dimensions. The reference attention module is designed to align key features between the current video frame and the reference image. By using the features extracted from the Reference U-Net as keys and values, it guides the Denoising U-Net to accurately preserve static information such as facial details and background textures, thus maintaining identity consistency and scene coherence throughout the generated sequence. To model the correlation between visual and audio modalities, the audio attention mechanism captures spatially aligned audio–visual features. For instance, rhythm variations in the audio can correspond to lip movements or gesture amplitudes in the video, and spatial attention weighting enables semantic alignment between visual content and sound, resulting in more expressive and emotionally synchronized audiovisual outputs. Furthermore, to handle temporal dependencies across video frames, the temporal attention mechanism utilizes a self-attention structure to analyze motion trajectories and dynamic patterns over time. Through frame-wise feature similarity computation, this approach effectively models sophisticated temporal dynamics, including motion continuity and illumination variations, while simultaneously mitigating visual artifacts like flickering and abrupt frame transitions to enhance the generated video’s temporal consistency.

Reference U-Net. The Reference U-Net operates as a feature provider for the Denoising U-Net, leveraging reference image information to guide video generation while preserving facial textures and background details (Figure 3). Architecturally synchronized with the Denoising U-Net, its transformer blocks employ self-attention to extract reference features that serve as keys/values for the Denoising U-Net’s attention layers. Crucially, the Reference U-Net processes the clean reference image only once per diffusion pass, preventing noise-induced feature corruption while optimizing computational efficiency—particularly advantageous for extended video generation. The design replaces conventional text prompts with null tokens in cross-attention layers, eliminating textual semantic interference and ensuring exclusive focus on visual features (e.g., facial geometry, background composition). Both U-Nets maintain identical spatial hierarchies (64 × 64, 32 × 32, etc.), enabling natural feature integration through concatenation or weighted fusion. This architectural symmetry ensures semantic alignment across network depths; deeper layers capture identity-level features, while shallower layers preserve fine details (e.g., facial hair). For video synthesis, the system extracts multi-level features from a frozen Reference U-Net (initial frame as reference), while the Denoising U-Net sequentially processes noisy latents using these reference features combined with audio embeddings and temporal context.

3.3. Emotion-Aware Multimodal Feature Disentanglement

Most existing diffusion-based methods [15,18,40] adopt cross-attention mechanisms to incorporate audio guidance into video generation. Others utilize a single, manually defined emotion label to generate emotionally expressive videos during inference [9,10]. However, cross-attention mechanisms rely on fixed audio features, which limits the depth of interaction between audio and video during the diffusion process. At the same time, using a static, human-defined emotion label for the entire video fails to capture complex emotional transitions, resulting in unnatural facial expressions.

To address these issues, we design a novel emotion-aware multimodal feature disentanglement and fusion framework to enhance the richness of facial expressions. As illustrated in Figure 3, the framework consists of two key strategies:

3.3.1. AU-Guided Expression Decoupling Module

Traditional methods for handling facial expressions often rely on features extracted from a single modality (such as images or videos), making it difficult to fully capture the subtle variations in facial expressions. The Facial Action Coding System (FACS) is a comprehensive and objective system for describing facial movements [41]. It defines a set of basic facial action units, each representing specific facial muscle activity (such as eyebrow raising or lip corner pulling) [42,43]. Therefore, we introduce a novel fine-grained expression decoupling module based on facial action units, which extracts AU features from a reference face image and decomposes complex expressions into multiple independent local action units.

We introduce the OpenFace tool [44] (version 2.2.0; https://github.com/TadasBaltrusaitis/OpenFace, accessed on 29 June 2025). By analyzing video frames, it detects facial keypoints and estimates the activity levels of AUs, converting complex facial expressions into manipulable data. AUs do not exist independently but are interrelated and jointly contribute to forming complex facial expressions. To effectively learn the continuity and combinability of AU features, an AU encoder based on a multi-layer perceptron (MLP) and residual connections is designed. This encoder can capture the non-linear relationships among AUs while preserving the details of the original AU information through residual connections. Specifically, the input of the AU encoder is the original AU vector

y_{A U}

, and the output is the AU feature (

h_{A u}

) representation mapped to a high-dimensional latent space through the residual encoder. The encoding process is shown in Equation (1):

h_{A U} = MLP (y_{A U}) + y_{A U .}

(1)

The MLP consists of two fully connected layers with a ReLU activation function applied in between, as defined in Equation (2):

MLP (y_{A U}) = W_{2} \cdot ReLU (W_{1} y_{A U} + b_{1}) + b_{2},

(2)

where

W_{1}

,

W_{2}

and

b_{1}

,

b_{2}

are the weight matrices and bias terms of the MLP, respectively. With this design, the model can learn the continuity of different AU features and model the dependency relationships among AUs. Meanwhile, low-dimensional AU representations are mapped to a high-dimensional continuous latent space through a multi-layer perceptron, enabling fine-grained expression control and generating more natural facial expressions.

3.3.2. Multimodal Feature Hierarchical Fusion Module

In multimodal data processing, audio and visual inputs originate from different modalities, resulting in discrepancies in their feature representations and temporal/spatial distributions. Traditional single-level fusion methods struggle to bridge the dimensional gap between the temporal nature of speech signals and the spatial nature of visual features [45]. We propose a hierarchical multimodal feature fusion module to achieve efficient alignment between speech signals and visual features (as shown in Figure 3). It establishes audio–visual–AU multimodal associations in the spatial dimension and reorganizes features at three semantic levels. Additionally, it maintains temporal coherence by using cross-attention mechanisms.

To achieve the above-mentioned goals, we construct a hierarchy-specific feature projection space. In the l-th Transformer block of the diffusion model, the input latent feature

z_{l} \in R^{B \times H \times W \times C}

is decomposed into three motion-latent spaces: the global motion space, the lip space, and the expression space. The global motion space

z_{l}^{full}

, the lip space

z_{l}^{lip}

, and the expression space

z_{l}^{\exp}

are expressed in Equations (3)–(5):

z_{l}^{full} = {Attn}_{0} (Norm (z_{l}), A_{audio}) ⊙ M_{full},

(3)

z_{l}^{lip} = {Attn}_{1} (Norm (z_{l}), A_{audio}) ⊙ M_{lip},

(4)

z_{l}^{exp} = {Attn}_{2} (Norm (z_{l}), A_{audio}) ⊙ M_{exp},

(5)

Among them,

M_{full} \in {0, 1}^{H \times W}

is the full-face region mask,

{Attn}_{i}

is the cross-attention based on the audio embedding

A_{audio}

, and

M_{lip}

,

M_{\exp}

are generated from the keypoints extracted by MediaPipe.

To achieve refined expression generation, we introduce Action Unit (AU) feature guidance in the expression semantics, establish fine-grained cross-modal associations, and construct an AU-enhanced expression space

z_{l}^{au - \exp}

. This feature space further incorporates AU feature vector

h_{A u}

on the basis of the expression space

z_{l}^{\exp}

, thereby endowing the generated expressions with enhanced structural integrity and physiological plausibility, which is encoded as shown in Equation (6):

z_{l}^{au - \exp} = Fusion ({Attn}_{1} (Norm (z_{l}), A_{audio}), h_{A U}) ⊙ M_{exp} .

(6)

Among them, the Fusion module implements the audio–AU feature interaction through dual-channel attention. The specific process is shown in Equations (7)–(10):

Q_{a u} = Linear (h_{A U}) \in R^{B \times N \times D},

(7)

K_{a u d i o} = Linear ({Attn}_{2} (z_{l}^{exp})) \in R^{B \times H W \times D},

(8)

α_{i j} = Softmax (\frac{Q_{a u, i} K_{a u d i o, j}^{T}}{\sqrt{D}}),

(9)

z_{l}^{au - \exp} = \sum_{j} α_{i j} V_{j} + MLP (h_{A U}),

(10)

In Formula (10),

V_{j} = {Conv}_{1 \times 1} (z_{l}^{\exp})

is the visual feature at spatial position j, and

α_{i j}

quantifies the regulation strength of the i-th AU on region j. This mechanism can be explained as follows. A specific AU (such as AU04-corrugator supercilii) enhances the latent feature response of the relevant muscle region in a targeted manner through the attention weight

α

. When AU04 is activated, the attention weight

α

selectively amplifies feature responses in the corresponding facial region (the glabellar and medial eyebrow areas). The dual-pathway attention [46] mechanism works collaboratively: (1) Temporal–Spatial Alignment Pathway: It establishes dynamic associations between audio spectrograms and facial Action Units (AUs) via cross-modal attention, resolving temporal asynchrony issues. Specifically, given audio features

f_{a} \in R^{{T \times d}}

and visual features

f_{v} \in R^{{H \times W \times d}}

, the spatio-temporal attention weights are computed as Equation (11):

α_{t, i, j} = softmax (\frac{Q_{a} (t) K_{v} {(i, j)}^{⊤}}{\sqrt{d}}),

(11)

where t denotes the time step, and

(i, j)

represents spatial positions. (2) Semantic Enhancement Pathway: This pathway hierarchically aggregates AU features using local–global attention, enabling feature reorganization across three semantic levels: global motion, lips, and micro-expressions. It employs a gating mechanism defined by Equation (12):

g = σ (w_{g} [f_{a u} ∣ f_{v i s}]),

(12)

[f_{a u} ∣ f_{v i s}]

: Represents the fused feature formed by concatenating the AU feature

f_{a u}

and the visual feature

f_{v i s}

along the channel dimension.

To achieve spatially adaptive feature fusion, we designed a set of heterogeneous activation functions to generate region weight maps. Meanwhile, the final output fuses the contributions of each branch through a gating mechanism, realizing multi-scale feature aggregation. The specific formulas are shown in Equations (13)–(16):

W_{l}^{full} = Sigmoid ({Conv}_{3 \times 3} (z_{l}^{full})),

(13)

W_{l}^{lip} = Tanh ({Conv}_{3 \times 3} (z_{l}^{lip})),

(14)

W_{l}^{au - \exp} = Softplus ({Conv}_{3 \times 3}, (z_{l}^{au - \exp})),

(15)

z_{l}^{out} = W_{l}^{full} \otimes z_{l}^{full} + W_{l}^{au - \exp} \otimes z_{l}^{au - \exp} + W_{l}^{lip} \otimes z_{l}^{lip},

(16)

in Formula (16), ⊗ represents element-wise multiplication. The weight map

W_{l}^{*}

captures long-range spatial dependencies through dilated convolution. This module can achieve accurate mouth-shape synchronization and natural expression changes driven by audio while preserving facial features.

Through the above two key strategies, our method has significantly improved the richness and naturalness of facial expression generation. The AU decoupling and encoding method can capture the detailed features of facial expressions, and the multimodal feature fusion module implements the effective integration of different modal information.

3.4. Emotion Consistency Loss

Traditional methods often lack an effective assessment of the emotional consistency of generated facial expressions when generating them, resulting in expressions that may not conform to the expected emotional state. To address this issue, we introduce the HSEmotionRecognizer algorithm to perform emotion recognition on the generated faces and optimize through a designed loss function, ensuring that the generated faces are natural in expression and conform to the expected emotions. HSEmotionRecognizer is a deep-learning-based emotion recognition model that can identify seven basic emotions (such as happiness, sadness, anger, etc.) from facial images [47].

The k-th generated facial image is input into HSEmotionRecognizer to obtain the predicted emotion distribution, which is then compared with the ground-truth emotion labels. First, emotion probability prediction is achieved. The emotion probability distribution is extracted through the HSEmotionRecognizer with frozen parameters, as shown in Equation (17):

p_{k} = f_{HSE} (I_{k}) \in R^{7},

(17)

The seven-dimensional probability vector corresponds to basic emotions such as neutral, happy, sad, surprised, fearful, disgusted, and angry.

Then the maximum confidence loss is calculated. It computes the absolute deviation between the optimal predicted probability of each generated image frame and the theoretical maximum value (1.0), as shown in Equation (18):

L_{emo}^{(k)} = {∥ 1 - max (p_{k}) ∥}_{1},

(18)

This design forces the generator to output facial expressions with clear emotional inclinations.

Finally, we implement the multi-frame consistency constraint, which conducts a temporal average of the losses over S consecutive frames, as shown in Equation (19):

L_{emo} = \frac{1}{S} \sum_{k = 1}^{S} L_{emo}^{(k)},

(19)

By minimizing this loss function, we can optimize the generated facial expressions to make them more accurate and natural in emotional expression.

4. Results

4.1. Datasets and Metrics

Datasets. We conduct experiments on two public datasets: MEAD and HDTF. MEAD is a multimodal emotional dataset that contains high-resolution audio–video pairs, covering various emotion categories such as happiness, sadness, and anger. It includes participants of different ages, genders, and ethnicities, making it suitable for emotion recognition, expression generation, and multimodal fusion research. HDTF focuses on high-definition talking face generation. It provides high-quality videos with synchronized audio, and its large-scale, multilingual data supports cross-speaker generalization and lip-sync research. The two datasets are complementary in application; MEAD emphasizes emotion diversity and fine-grained analysis, while HDTF focuses on facial motion and audio alignment in natural dialogue scenarios. We perform training and testing on MEAD, and cross-domain testing on HDTF, to validate the model’s robustness in emotional expression and general talking face generation.

Metrics. The proposed framework adopts a multidimensional evaluation system, which is systematically validated through three core metrics. (1) Generation quality based on the Fréchet Inception Distance (FID) at the image level. (2) lip-sync accuracy evaluated by two multimodal metrics: LSE-C (Lip Sync Error–Confidence) and LSE-D (Lip Sync Error–Distance). The alignment between speech and lip movements is assessed using a bidirectional mechanism; higher LSE-C and lower LSE-D indicate better synchronization. (3) Expression semantic fidelity measured by Expression-FID (E-FID), which quantifies the distribution difference of facial expression features between generated and real videos, evaluating the semantic consistency of micro-expressions.

4.2. Implementation Details

All training and inference experiments are conducted on a computing cluster equipped with four NVIDIA L20 GPUs (NVIDIA, Santa Clara, CA, USA). This framework employs a two-stage training process. Both the initial stage and the enhancement stage undergo 30,000 iterations of training, with a fixed batch size of 4 and a video resolution of 512 × 512. In the initial stage, the model primarily learns to generate basic video frames from static images, focusing on mastering basic motion patterns and video frame generation capabilities. In the enhancement stage, the model further optimizes its generation abilities, especially the coherence of consecutive frames. This is achieved by introducing latent space constraints from the motion module, using the first two real video frames as continuity constraints to ensure temporal coherence in the generated video sequence. Regarding key training configurations, both stages use a constant learning rate of

1 \times 10^{- 5}

to ensure stable convergence and avoid overfitting. Additionally, the initialization weights of the motion module are inherited from the pre-trained Animatediff model, providing the model with a good starting point.

4.3. Comparisons with Other Methods

We selected the following representative open-source models as baseline methods for comparative experiments: Wav2Lip (2020) [3], PC-AVS (2021) [6], EAMM (2022) [12], SadTalker (2023) [22], DreamTalk (2023) [48], AniPortrait (2024) [49], EchoMimic (2024) [15], Hallo (2024) [18]. Early audio-driven facial generation methods mainly focused on improving the accuracy of lip-sync. For example, Wav2Lip combined the adversarial training of audio features and video frames, achieving high-precision lip-matching for the first time and greatly enhancing the visual–auditory consistency [3]. Building on this, PC-AVS introduced a pose decoupling design, separating pose and motion, and improving the coordination between head movement and speech to make the movements more natural [6].

Subsequent research began to focus on the dynamic expression of facial expressions and emotions. EAMM proposed region-specific control of facial muscle movements, enabling fine-grained control of facial expression changes and making the expressions richer and more natural [12]. SadTalker introduced an emotion intensity parameter to achieve controllable expression generation, providing greater flexibility and controllability [22]. In recent years, diffusion models have gradually become the mainstream technology. DreamTalk and AniPortrait used a temporal diffusion framework to generate coherent facial movements, capturing the temporal coherence of facial movements and enhancing the realism of the generated videos [48,49]. EchoMimic enhanced the diversity of expressions through multimodal feature fusion, combining audio and video features to make expressions more comprehensively reflect emotional information [15]. Hallo achieved enhanced control over expression diversity and pose changes, optimizing the generation efficiency and shortening the generation time [18].

4.4. Qualitative Results

As shown in Figure 4, the talking face images generated by the proposed model are presented and compared with state-of-the-art methods (Wav2Lip, PC-AVS, EAMM, SadTalker, DreamTalk, AniPortrait, EchoMimic, Hallo). Additionally, the reference videos (from the HDTF dataset) on the right side of the dotted line were unseen during the training process to ensure the fairness and objectivity of the comparison.

Figure 4 demonstrates the superior performance of our model across three key aspects: expression naturalness, lip-sync accuracy, and visual quality. Comparative methods including PC-AVS, EAMM, and DreamTalk exhibit significant blurring and detail loss, particularly in facial micro-features such as forehead wrinkles and skin textures (Positions 1–3), while Wav2Lip introduces noticeable chin artifacts that compromise visual fidelity. Although Wav2Lip achieves relatively better lip synchronization, its mouth movements still deviate from the ground truth (Position 4), with other methods showing more pronounced viseme articulation errors (Positions 5–8). Regarding expression generation, PC-AVS and DreamTalk produce overly rigid facial animations with minimal eye movement, while EAMM, SadTalker, Hallo, and EchoMimic generate unnatural expressions—SadTalker being limited to basic blinking motions (Positions 9–10). In contrast, our method maintains exceptional visual quality with preserved facial details and artifact-free rendering, achieves photorealistic lip-sync alignment, and captures subtle expression dynamics including eyebrow micro-movements and natural eyelid kinematics, collectively producing vivid facial animations.

4.5. Quantitative Results

To achieve a more comprehensive and objective evaluation of the model’s performance, we conduct quantitative experiments on the MEAD and HDTF datasets and compare the experimental results with other advanced methods. Table 1 and Table 2 show the detailed quantitative experimental results, covering visual quality indicators (FID), lip–audio synchronization indicators (LSE-C, LSE-D), and expression semantics indicators (E-FID). Here, “↓” indicates that a smaller value is better, and “↑” indicates that a larger value is better. The best results are highlighted in bold, and the second-best results are highlighted with an underline.

Table 1 shows the quantitative results of the method on the MEAD dataset. The method demonstrates significant advantages in visual quality and expression generation. From the perspective of core indicators, the proposed method has an FID value as low as 43.052 (↓), a reduction of 3.639 compared to the second-best method, Hallo, indicating that the distribution difference between the generated video and the real data is smaller; the E-FID value of 2.403 (↓) leads by a margin of 0.411, highlighting its breakthrough in the fidelity of expression semantics. In terms of lip–audio synchronization, the LSE-C indicator of 6.781 (↑) is close to the optimal 6.535 of Wav2Lip, and the LSE-D value of 7.962 (↓) is only 0.037 higher than that of Wav2Lip, reaching a top-level accuracy in synchronization. Although AniPortrait shows an abnormal performance in LSE-C (3.233), its FID and E-FID are severely degraded, reflecting that the method achieves a better balance among multiple indicators.

Table 2 shows the quantitative results of the method on the HDTF dataset. For the more challenging HDTF dataset, the method further validates its robustness. With an FID value of 35.348 (↓), it surpasses the second-best method, Hallo, by 1.632, demonstrating stable visual quality across datasets; the E-FID value of 3.521 (↓) also remains among the top, only 0.614 higher than that of Hallo, but with more outstanding lip–audio synchronization performance. The LSE-C value of 7.501 (↑) is almost on par with Hallo’s 7.535, while the LSE-D value of 7.628 (↓) is lower than Hallo’s 7.728, resulting in a smaller synchronization error. It is worth noting that the method ranks among the top two in both the FID and lip–audio synchronization indicators, and has no significant weaknesses in the four indicators, indicating its superior comprehensive performance in complex scenarios. Compared with early methods such as Wav2Lip, the scheme has achieved a dual improvement in synchronization accuracy and expression naturalness.

4.6. Ablation Study

An ablation study is conducted on the MEAD dataset to validate the effectiveness of the core components of the proposed method and assess their contributions to overall performance. The experimental settings and corresponding quantitative results are presented in Table 3, while qualitative comparisons are illustrated in Figure 5. In this context, MFDF and Loss denote the key modules introduced in this work.

We conducted an ablation study of the model on the MEAD dataset to verify the effectiveness of the core modules in our proposed method and their contributions to the overall performance. The experimental setup and quantitative analysis results are shown in Table 3, and the qualitative comparison results are shown in Figure 5. Here, MFDF and Loss represent the core components of our method. In the experimental design, the symbol ✓ indicates the inclusion of the corresponding module, and the symbol × indicates the removal of the module. Specifically, setting ‘a’ is the baseline model (containing only the basic framework), while setting ‘c’ represents our complete proposed method (including both MFDF and Loss modules). To comprehensively evaluate the performance differences under different configurations, we employed four metrics: FID, LES-C, LES-D, and E-FID, which quantitatively analyze from three dimensions: visual quality, lip-sync, and expression consistency. ↓ indicates that a smaller value is better, and ↑ indicates that a larger value is better. The best results are highlighted in bold, and the second-best results are underlined.

The effectiveness of the emotion perception framework. To verify the effectiveness of the emotion perception (MFDF) framework, we conducted comparative experiments. As shown in Table 3, introducing the MFDF framework (Setting b) led to notable improvements across several key metrics compared to the baseline model (Setting a). The FID score dropped from 46.691 to 45.179 (↓ 3.3%), indicating that the multimodal feature decoupling and fusion mechanism effectively improves the realism of static facial textures. The E-FID score also decreased by 6.3% (from 2.814 to 2.637), suggesting a significant reduction in the spatial distance of the features related to emotions. Additionally, the LSE-C score increased by 2.1% (from 6.561 to 6.702), demonstrating that the emotion perception module enhances lip-synchronization accuracy by establishing a stronger mapping between emotion and lip shape. Although the LSE-D score rose slightly by 2.3% (from 8.201 to 8.389), this may result from increased diversity in lip movement patterns due to the added emotion features. According to qualitative observations in Figure 5, this change is linked to the MFDF module activating a richer range of subtle facial expressions—such as natural eyelid tremors and dynamic movements of the mouth corners. These physiological-level details, while diverging from standard expression templates, significantly enhance the perceived naturalness of the expressions.

Effectiveness of the emotion-consistency loss function. To verify the effectiveness of the emotion-consistency loss function, the experimental results are shown in Table 3 and Figure 5. When the loss function is introduced based on Setting b (Setting c), all indicators are comprehensively optimized; FID is further reduced by 4.7% (from 45.179 to 43.052), indicating that the loss function constrains the geometric distance between the generated features and the target emotion. The E-FID indicator drops sharply from 2.637 to 2.403 (↓ 8.9%), suggesting that the distribution alignment between the generated expression and the expected emotion label is significantly improved. The LSE indicators exhibit a “double-optimal” characteristic. While LSE-C increases by 1.2% (from 6.702 to 6.781), LSE-D decreases by 5.1% (from 8.389 to 7.962), indicating that the loss function improves synchronization accuracy while maintaining the diversity of lip movements by decoupling emotion and speech features. The qualitative comparison in Figure 5 shows that the complete model (Setting c) can accurately present the characteristic of tightly closed lips when pronouncing the /p/ phoneme, and the degree of downward sloping of the corners of the mouth is 15–20% greater than that in Setting b, confirming that the loss function establishes an effective cross-modal supervision mechanism.

4.7. User Study

To complement the quantitative evaluation metrics, we conducted a user study to assess the naturalness and expressiveness of the generated talking face videos from different models. The goal was to evaluate how well the generated videos reflect realistic and emotionally expressive facial dynamics from a human perceptual perspective.

We randomly selected representative frames from videos generated by four competing methods, including ours, and anonymized them by removing any method-related or identity-revealing information. A total of 24 participants were recruited for this study, among which 12 had prior background in computer vision, and 12 had no relevant technical experience, to ensure both professional judgment and unbiased objectivity.

Each participant was shown a randomized set of image groups corresponding to the same audio input, and asked to select the sample that was (a) the most natural and (b) the most expressive, without knowing which model produced which output. Participants were not restricted in decision time and were instructed to base their decisions purely on visual impression.

After collecting all responses, we computed the average ranking scores for each model across both evaluation aspects. Our method ranked first in naturalness and second in expressiveness, and achieved the highest overall average score across both criteria. Detailed user scores and rankings are presented in Table 4. This result suggests that our approach achieves a strong balance between visual realism and emotional clarity, as perceived by human observers.

This subjective evaluation demonstrates that the proposed model not only outperforms existing methods in standard metrics (FID, E-FID, LSE-C/D), but also produces results that are perceptually preferred by users, thus reinforcing its practical applicability in human-centered scenarios such as virtual avatars and interactive agents.

5. Discussion

Experimental results demonstrate that the proposed emotion-aware framework significantly enhances the expressiveness and naturalness of generated facial expressions through fine-grained emotion feature modeling. By independently modeling and fusing local Action Units (AUs) with global motion and lip features, our method achieves a more nuanced and realistic portrayal of emotions compared to existing approaches. Meanwhile, the emotion-consistency loss function further improves the accuracy and fidelity of generated expressions by enforcing consistency between the generated facial expressions and the intended emotional cues, effectively decoupling multimodal feature constraints that might hinder natural expression. The synergistic optimization of these two core components enables the final model to achieve optimal performance across three key evaluation metrics: visual quality, lip-sync accuracy, and expression consistency, as evidenced by both quantitative and qualitative results presented in Section 4.

Our findings align with the hypothesis that fine-grained modeling of facial action units and the incorporation of an explicit emotion consistency constraint are crucial for generating emotionally expressive talking faces. This builds on previous research that highlighted the importance of considering emotional cues in talking face generation [20,50], and extends it by demonstrating the effectiveness of a localized AU-driven approach combined with a discriminative loss function for emotion accuracy. The improved performance across visual quality and lip-sync accuracy also suggests that our multimodal feature fusion strategy effectively integrates different information streams without compromising other essential aspects of talking face generation.

Although our contributions improve the naturalness of emotional expression and generate vivid talking face results, there are still some limitations that warrant further investigation. First, high-resolution generation is constrained by GPU memory capacity. The current model supports a maximum resolution of 512 × 512, which is not sufficient for 4 K ultra-high-definition video generation. As high-resolution displays become more common, low-resolution videos may appear blurry and lose fine details, potentially limiting the practical application of our method in scenarios requiring high visual fidelity. Second, the cross-lingual generalization ability still needs further validation. Most experiments are based on English audio data, and the model has limited adaptability to other languages, especially tonal ones. The intricate relationship between audio prosody and emotional expression can vary significantly between languages, and our current model may not fully capture these nuances. In a global context, audio-driven talking face generation should be able to handle diverse language inputs to be truly versatile.

6. Conclusions

This paper introduces MF-ETalk, a unified framework designed to tackle the persistent challenge of generating emotionally expressive talking face videos from audio. While recent diffusion-based approaches have advanced visual quality, they often fall short in producing emotionally coherent expressions due to limited modeling of localized facial dynamics and weak cross-modal alignment. To address these limitations, MF-ETalk integrates emotional awareness directly into the generative process through a multimodal design. Specifically, the framework leverages Action Units (AUs) as semantically meaningful facial cues to model the nonlinear dynamics of emotional expressions. These AU features are disentangled and embedded into a hierarchical fusion architecture that captures fine-grained interactions among audio, visual, and motion streams. Furthermore, an emotion-consistency constraint guides the generation toward semantically aligned and temporally coherent facial movements. MF-ETalk provides a physiologically informed solution that mitigates the common issue of emotional averaging, while maintaining accurate speech–lip alignment. This makes it a strong candidate for real-world applications in human–computer interaction, digital avatars, and emotion-aware systems.

In future work, we aim to improve high-resolution generation through memory-efficient architectures, progressive generation, or super-resolution techniques. We also plan to enhance cross-lingual generalization by exploring language-agnostic emotion representations or incorporating language-specific cues. Furthermore, we will investigate the model’s ability to handle subtle and complex emotional variations and improve the robustness to variations in input audio and identity. Integrating external knowledge about facial expressions and emotions may further enhance the realism and controllability of the generated talking faces.

Author Contributions

X.W. provided and implemented the main idea of the research and managed the project. Y.H. and Y.L. wrote the first draft and edited the paper. X.G. provided suggestions and revised the paper. F.Y. (corresponding author) supervised the work. G.Z. contributed to the research design and data analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 62402026, and grant number 62176018; in part by the R&D Program of Beijing Municipal Education Commission under Grant KM202410016010; Innovation Project for Master’s Students of Beijing University of Civil Engineering and Architecture under Grant PG2025105.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Thanks to my supervisors for their help.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUs	Action Units
GANs	Generative Adversarial Networks
DMs	Diffusion Models
MF-ETalk	Emotional talking face generation method
SDv1.5	Stable Diffusion v1.5
FACS	Facial Action Coding System
MLP	multi-layer perceptron
AU	Action Unit
FID	Fréchet Inception Distance
LSE-C	Lip Sync Error-Confidence
LSE-D	Lip Sync Error-Distance
E-FID	Expression-FID
MFDF	effectiveness of the emotion-perception

References

Mitsea, E.; Drigas, A.; Skianis, C. A Systematic Review of Serious Games in the Era of Artificial Intelligence, Immersive Technologies, the Metaverse, and Neurotechnologies: Transformation Through Meta-Skills Training. Electronics 2025, 14, 649. [Google Scholar] [CrossRef]
Toshpulatov, M.; Lee, W.; Lee, S. Talking human face generation: A survey. Expert Syst. Appl. 2023, 219, 119678. [Google Scholar] [CrossRef]
Prajwal, K.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar]
Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeItTalk: Speaker-aware talking-head animation. ACM Trans. Graph. 2020, 39, 1–15. [Google Scholar]
Wang, S.; Li, L.; Ding, Y.; Yu, T.; Xia, Z.; Ma, L. Audio-driven One-shot Talking-head Generation with Natural Head Motion. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 1098–1105. [Google Scholar]
Zhou, H.; Sun, Y.; Wu, W.; Loy, C.C.; Wang, X.; Liu, Z. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4176–4186. [Google Scholar]
Wang, J.; Qian, X.; Zhang, M.; Tan, R.T.; Li, H. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14653–14662. [Google Scholar]
Zhong, W.; Fang, C.; Cai, Y.; Wei, P.; Zhao, G.; Lin, L.; Li, G. Identity-Preserving Talking Face Generation with Landmark and Appearance Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9729–9738. [Google Scholar]
Eskimez, S.E.; Zhang, Y.; Duan, Z. Speech driven talking face generation from a single image and an emotion condition. IEEE Trans. Multimed. 2022, 24, 3480–3490. [Google Scholar] [CrossRef]
Sinha, S.; Biswas, S.; Yadav, R.; Namboodiri, V.P.; Jawahar, C.; Kumar, R. Emotion-controllable generalized talking face generation. In Proceedings of the International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1320–1327. [Google Scholar]
Ji, X.; Zhou, H.; Wang, K.; Liu, W.W.; Hong, F.; Qian, C.; Loy, C.C. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14080–14089. [Google Scholar]
Ji, X.; Zhou, H.; Wang, K.; Hong, F.; Wu, W.; Qian, C.; Loy, C.C. EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model. In Proceedings of the ACM SIGGRAPH 2022 Conference, Vancouver, BC, Canada, 7–11 August 2022; p. 61. [Google Scholar]
Tian, L.; Wang, Q.; Zhang, B.; Bo, L. EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 244–260. [Google Scholar]
Cui, J.; Li, H.; Yao, Y.; Zhu, H.; Shang, H.; Cheng, K.; Zhou, H.; Zhu, S.; Wang, J. Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv 2024, arXiv:2410.07718. [Google Scholar]
Chen, Z.; Cao, J.; Chen, Z.; Li, Y.; Ma, C. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 2403–2410. [Google Scholar]
Zhang, J.; Mai, W.; Zhang, Z. EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion. arXiv 2024, arXiv:2409.07255. [Google Scholar]
Dong, Z.; Hu, C.; Zhu, L.; Ji, X.; Lai, C.S. A Dual-Pathway Driver Emotion Classification Network Using Multi-Task Learning Strategy: A Joint Verification. IEEE Internet Things J. 2025, 12, 14897–14908. [Google Scholar] [CrossRef]
Xu, M.; Li, H.; Su, Q.; Shang, H.; Zhang, L.; Liu, C.; Wang, J.; Yao, Y.; Zhu, S. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv 2024, arXiv:2406.08801. [Google Scholar]
Wang, K.; Wu, Q.; Song, L.; Liu, W.; Qian, C.; Loy, C.C. MEAD: A large-scale audio-visual dataset for emotional talking-face generation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 700–717. [Google Scholar]
Liang, B.; Pan, Y.; Guo, Z.; Zou, Y.; Yan, J.; Xie, W.; Yang, Y. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3387–3396. [Google Scholar]
Zhang, Z.; Li, L.; Ding, Y.; Fan, C. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3661–3670. [Google Scholar]
Zhang, W.; Cun, X.; Wang, X.; Zhang, Y.; Shen, X.; Guo, Y.; Shan, Y.; Wang, F. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 8652–8661. [Google Scholar]
Wang, H.; Weng, Y.; Li, Y.; Zhou, H.; Qian, C.; Lin, D. EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion. arXiv 2024, arXiv:2411.16726. [Google Scholar]
Zheng, L.; Zhang, Y.; Guo, H.; Pan, J.; Tan, Z.; Lu, J.; Tang, C.; An, B.; Yan, S. MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation. arXiv 2024, arXiv:2412.04448. [Google Scholar]
Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In Proceedings of the SIGGRAPH Asia 2024 Conference Papers, Tokyo, Japan, 3–6 December 2024; pp. 1–11. [Google Scholar]
Teng, Y.; Wu, Y.; Shi, H.; Ning, X.; Dai, G.; Wang, Y.; Li, Z.; Liu, X. Dim: Diffusion mamba for efficient high-resolution image synthesis. arXiv 2024, arXiv:2405.14224. [Google Scholar]
Liu, Y.; Cun, X.; Liu, X.; Wang, X.; Zhang, Y.; Chen, H.; Liu, Y.; Zeng, T.; Chan, R.; Shan, Y. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22139–22149. [Google Scholar]
Bar-Tal, O.; Chefer, H.; Tov, O.; Herrmann, C.; Paiss, R.; Zada, S.; Ephrat, A.; Hur, J.; Liu, G.; Raj, A.; et al. Lumiere: A space-time diffusion model for video generation. In Proceedings of the SIGGRAPH Asia 2024 Conference Papers, Tokyo, Japan, 3–6 December 2024; pp. 1–11. [Google Scholar]
Huang, Z.; Luo, D.; Wang, J.; Liao, H.; Li, Z.; Wu, Z. Rhythmic foley: A framework for seamless audio-visual alignment in video-to-audio synthesis. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Truong, V.T.; Dang, L.B.; Le, L.B. Attacks and defenses for generative diffusion models: A comprehensive survey. ACM Comput. Surv. 2025, 57, 1–44. [Google Scholar] [CrossRef]
Stypułkowski, M.; Vougioukas, K.; He, S.; Zięba, M.; Petridis, S.; Pantic, M. Diffused heads: Diffusion models beat gans on talking-face generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 5091–5100. [Google Scholar]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef]
Hossain, M.R.; Hoque, M.M.; Dewan, M.A.A.; Hoque, E.; Siddique, N. AuthorNet: Leveraging attention-based early fusion of transformers for low-resource authorship attribution. Expert Syst. Appl. 2025, 262, 125643. [Google Scholar] [CrossRef]
Shen, M.; Zhang, S.; Wu, J.; Xiu, Z.; AlBadawy, E.; Lu, Y.; Seltzer, M.; He, Q. Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Fontanini, T.; Ferrari, C.; Lisanti, G.; Bertozzi, M.; Prati, A. Semantic image synthesis via class-adaptive cross-attention. IEEE Access 2025, 13, 10326–10339. [Google Scholar] [CrossRef]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Diao, X.; Cheng, M.; Barrios, W.; Jin, S. Ft2tf: First-person statement text-to-talking face generation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 4821–4830. [Google Scholar]
Jang, Y.; Kim, J.H.; Ahn, J.; Kwak, D.; Yang, H.S.; Ju, Y.C.; Kim, I.H.; Kim, B.Y.; Chung, J.S. Faces that speak: Jointly synthesising talking face and speech from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8818–8828. [Google Scholar]
Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; Zhao, H. Lcm-lora: A universal stable-diffusion acceleration module. arXiv 2023, arXiv:2311.05556. [Google Scholar]
Wang, C.; Tian, K.; Zhang, J.; Guan, Y.; Luo, F.; Shen, F.; Jiang, Z.; Gu, Q.; Han, X.; Yang, W. V-express: Conditional dropout for progressive training of portrait video generation. arXiv 2024, arXiv:2406.02511. [Google Scholar]
Jacob, G.M.; Stenger, B. Facial action unit detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7680–7689. [Google Scholar]
Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 818–833. [Google Scholar]
Liu, Z.; Liu, X.; Chen, S.; Liu, J.; Wang, L.; Bi, C. Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action Units. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–24. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Robinson, P.; Morency, L.P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Shen, S.; Zhao, W.; Meng, Z.; Li, W.; Zhu, Z.; Zhou, J.; Lu, J. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1982–1991. [Google Scholar]
Liang, M.; Cao, X.; Du, J. Dual-pathway attention based supervised adversarial hashing for cross-modal retrieval. In Proceedings of the 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju Island, Republic of Korea, 17–20 January 2021; pp. 168–171. [Google Scholar]
Fonteles, J.; Davalos, E.; Ashwin, T.; Zhang, Y.; Zhou, M.; Ayalon, E.; Lane, A.; Steinberg, S.; Anton, G.; Danish, J.; et al. A first step in using machine learning methods to enhance interaction analysis for embodied learning environments. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; pp. 3–16. [Google Scholar]
Ma, Y.; Zhang, S.; Wang, J.; Wang, X.; Zhang, Y.; Deng, Z. DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models. arXiv 2023, arXiv:2312.09767. [Google Scholar]
Wei, H.; Yang, Z.; Wang, Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv 2024, arXiv:2403.17694. [Google Scholar]
Zhang, C.; Wang, C.; Zhang, J.; Xu, H.; Song, G.; Xie, Y.; Luo, L.; Tian, Y.; Guo, X.; Feng, J. DREAM-Talk: Diffusion-based realistic emotional audio-Driven method for single image talking face generation. arXiv 2023, arXiv:2312.13578. [Google Scholar]

Figure 1. The proposed emotional talking face generation method takes a reference image and driving audio as inputs. By leveraging an advanced talking face generative model, our approach surpasses prior works in both realism and expressiveness. As highlighted by the red arrows in our results, the generated videos exhibit superior emotion perception, delivering more natural and precise facial expressions. In contrast, previous methods (e.g., [18]) generate less refined emotional portrayals and fail to capture the nuanced dynamics of audio-driven facial expressions.

Figure 2. This diagram shows a video generation framework. The “Multimodal Feature Decoupling and Fusion Framework Based on Emotion Perception” (Section 3.3) has two key modules processing reference images and driving audio. Features then enter a Denoising U-Net, interacting with a Reference U-Net. The Emotion Consistency Loss Function (Section 3.4) ensures emotional coherence. Finally, a Face Decoder generates the video, with multiple attention mechanisms integrated.

Figure 3. This diagram shows the “Multimodal Feature Decoupling and Fusion Framework based on Emotion Perception” (Section 3.3). It has two main modules: the AU-Guided Expression Decouple Module using cross-attention on reference image and driving audio, and the Multimodal Feature Hierarchical Fusion Module fusing via convolutional and activation layers for the final fusion feature.

Figure 4. Examples of talking face generation from our method and other state-of-the-art methods. The left side of the dashed line is from the MEAD dataset. The right side of the dashed line is from the HDTF dataset.

Figure 5. Qualitative Qualitative ablation study on the MEAD dataset. Detailed show of the impact of the different modules. Setting a refers to the baseline model; Setting b includes the MFDF framework; Setting c incorporates both the MFDF framework and the emotion-consistency loss function. Red line boxes indicate changes in eyebrows and lips.

Table 1. Quantitative comparisons on the MEAD dataset.

Method	Years	FID ↓	LSE-C ↑	LSE-D ↓	E-FID ↓
Wav2Lip	2020	69.296	6.535	7.925	3.335
PC-AVS	2021	128.191	5.628	8.836	8.823
EAMM	2022	138.802	4.351	9.890	8.598
SadTalker	2023	120.127	6.709	8.103	5.118
DreamTalk	2023	148.664	5.910	8.278	8.616
AniPortrait	2024	85.708	3.233	10.917	3.753
Hallo	2024	46.691	6.561	8.201	2.814
EchoMimic	2024	100.182	5.419	9.447	4.571
Ours		43.052	6.781	7.962	2.403

Table 2. Quantitative comparisons on the HDTF dataset.

Method	Years	FID ↓	LSE-C ↑	LSE-D ↓	E-FID↓
Wav2Lip	2020	42.681	6.752	8.979	3.837
PC-AVS	2021	100.763	7.413	8.184	6.711
EAMM	2022	126.153	4.448	10.686	5.419
SadTalker	2023	106.031	7.517	7.778	4.095
DreamTalk	2024	133.078	6.503	8.156	5.354
AniPortrait	2024	54.309	4.026	10.537	4.128
Hallo	2024	36.980	7.535	7.728	3.907
EchoMimic	2024	81.230	5.371	9.594	3.679
Ours		35.348	7.501	7.628	3.521

Table 3. Quantitative ablation study in the MEAD dataset.

Setting	MFDF	Loss	FID ↓	LSE-C ↑	LSE-D ↓	E-FID ↓
a	×	×	46.691	6.561	8.201	2.814
b	✓	×	45.179	6.702	8.389	2.637
c (Ours)	✓	✓	43.052	6.781	7.962	2.403

Table 4. Comparative user study on perceived naturalness and expressiveness.

Method Name	Naturalness (%)	Expressiveness (%)	Composite Score (%)
Wav2Lip	16.7	0	16.7
PC-AVS	0	0	0
EAMM	8.3	0	8.3
SadTalker	16.7	16.7	16.7
DreamTalk	0	8.3	8.3
AmiPortrait	8.3	0	8.3
Hallo	8.3	33.3	20.8
EchoMimic	16.7	25	20.85
Ours	33.3	16.7	25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Huo, Y.; Liu, Y.; Guo, X.; Yan, F.; Zhao, G. Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation. Electronics 2025, 14, 2684. https://doi.org/10.3390/electronics14132684

AMA Style

Wang X, Huo Y, Liu Y, Guo X, Yan F, Zhao G. Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation. Electronics. 2025; 14(13):2684. https://doi.org/10.3390/electronics14132684

Chicago/Turabian Style

Wang, Xueping, Yuemeng Huo, Yanan Liu, Xueni Guo, Feihu Yan, and Guangzhe Zhao. 2025. "Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation" Electronics 14, no. 13: 2684. https://doi.org/10.3390/electronics14132684

APA Style

Wang, X., Huo, Y., Liu, Y., Guo, X., Yan, F., & Zhao, G. (2025). Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation. Electronics, 14(13), 2684. https://doi.org/10.3390/electronics14132684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Abstract

1. Introduction

2. Related Work

2.1. Emotional Talking Face Generation

2.2. Diffusion Models

2.3. Multimodal Fusion Strategies

3. Methods

3.1. Overview

3.2. Generative Model Network Architecture

3.3. Emotion-Aware Multimodal Feature Disentanglement

3.3.1. AU-Guided Expression Decoupling Module

3.3.2. Multimodal Feature Hierarchical Fusion Module

3.4. Emotion Consistency Loss

4. Results

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Comparisons with Other Methods

4.4. Qualitative Results

4.5. Quantitative Results

4.6. Ablation Study

4.7. User Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI