PAGF: Short-Horizon Forecasting of 3D Facial Landmarks

Yan, Mingzhu; Yuan, Ye; Liu, Jian; Yang, Fangyan

doi:10.3390/math14071222

Open AccessArticle

PAGF: Short-Horizon Forecasting of 3D Facial Landmarks

¹

Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai 200093, China

²

Shanghai DroidUp Co., Ltd., Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1222; https://doi.org/10.3390/math14071222

Submission received: 11 March 2026 / Revised: 2 April 2026 / Accepted: 3 April 2026 / Published: 6 April 2026

(This article belongs to the Special Issue Advanced Control of Complex Dynamical Systems and Robotics with Applications)

Download

Browse Figures

Versions Notes

Abstract

Short-term facial landmark forecasting is important for anticipatory facial behavior in human–robot interaction, yet models trained with pointwise reconstruction losses often suffer from mean reversion, producing low-error predictions with weakened motion dynamics. To address this issue, we propose a peak-aware gated recurrent unit (GRU) framework that separates forecasting into peak planning and peak-conditioned trajectory generation. The planning stage estimates the timing and intensity of a salient motion peak within the forecast horizon together with a global motion direction, and the generation stage produces short-horizon landmark displacements through temporal gating and structured motion composition. The model is trained with reconstruction loss, peak supervision, peak-integrity regularization, and correlation-based temporal-shape regularization. Experiments on the MEAD dataset using 3D facial landmarks under a subject-independent protocol show a clear distortion–dynamics trade-off. Compared with static and sequence-to-sequence baselines, the proposed method better preserves peak-related facial dynamics while maintaining competitive 24-step prediction accuracy.

Keywords:

facial landmark forecasting; human–robot interaction; peak-aware forecasting; GRU; 3D facial landmarks; temporal gating

MSC:

68T07

1. Introduction

With the rapid advancement of large language models (LLMs) [1], the verbal interaction capabilities of humanoid robots [2] have been significantly improved, enabling fluent and context-aware verbal communication. However, non-verbal communication, including facial expressions, eye gaze, and head movements, which is equally critical in social interaction, has developed relatively slowly. Facial expressions are the primary medium for conveying emotions, intentions, and interpersonal signals, directly determining the user’s perception of warmth, trust, and willingness to engage with the robot, making them an indispensable part of human–robot interaction (HRI) [3,4]. Research in psychology and HRI has confirmed that even millisecond-level timing mismatches in facial movements can drastically reduce the naturalness and rapport of the interaction [5], thereby weakening the user’s willingness to continue interacting with the robot. Therefore, enabling robots to generate geometrically accurate, timely, and emotionally coherent [6] facial expressions remains a core challenge for their deployment in real-world high-frequency social scenarios.

The current approaches to robotic facial behavior generation can be broadly divided into two categories: pre-programmed facial behavior patterns [7,8] and reactive imitation [9,10]. Pre-programmed methods typically rely on handcrafted scripts or fixed motion patterns. Although they can generate precise and repeatable movements, they lack the flexibility required for anticipatory and dynamically adaptive facial behavior in natural interaction. Reactive imitation methods detect human expressions and mirror them; although the computational cost is lower, the perception–processing–execution pipeline inevitably introduces latency, leading to lagged, mechanical, and insincere robot expressions. Both methods are essentially post hoc responses and lack anticipatory capability, so they cannot synchronize with the onset and peak of human expressions, responding only passively after observing human actions. In contrast, anticipatory co-expression, which is the ability to predict near-future facial dynamics and generate synchronized responses, is considered to be key to achieving smooth interaction turn-taking, affective alignment, and maintaining HRI rapport [11,12,13].

To this end, data-driven facial expression forecasting methods have gradually attracted attention. Existing studies have attempted to predict future facial movements via visual or multimodal inputs, finding that predicting imminent emotional events (such as smile peaks) can improve interaction naturalness [11,14]. At the same time, recent work has explored more structured and interactive forms of facial motion modeling, including timeline-based control for facial action generation [15], autoregressive head generation for real-time conversational behaviors [16], and physically deployed robotic facial systems with speech-synchronized or hybrid-actuated motion control [17,18]. Although these studies highlight the growing importance of temporal structure, interaction realism, and deployment feasibility in facial dynamics modeling, they are primarily developed for facial motion generation, speech-driven animation, or robot actuation rather than short-horizon subject-separated forecasting of 3D facial landmark dynamics. The current forecasting-oriented approaches therefore still have obvious limitations in actual HRI scenarios: most studies only target limited expression sets or controlled recording environments, and their generalization ability in unconstrained conversations has not been fully verified. More critically, when using mean squared error (MSE) as the optimization objective, models easily converge to a conservative mean reversion [19] or posterior collapse [20] solution. The generated expression trajectories are numerically close to the dataset average but are perceptually static and lacking vitality. Notably, simple methods such as Copy-Last-Frame can achieve extremely low MSE. This observation reveals a clear distortion–dynamics trade-off: optimizing only numerical distortion may suppress the high-frequency motion dynamics that humans interpret as expressive and natural [21].

Addressing the above issues, this study revisits short-horizon facial landmark forecasting on the large-scale multimodal emotional talking-face dataset MEAD [22]. To decouple facial motion modeling from specific robot hardware constraints and focus on anticipatory dynamics themselves, we adopt 3D facial landmarks as a compact and actuation-compatible representation of facial behavior. To improve predictive fidelity while alleviating mean reversion, we propose a peak-aware GRU framework (PAGF), which explicitly models the temporal structure of future facial motion. Inspired by hierarchical prediction strategies [23], PAGF decomposes forecasting into two stages: a peak planning stage, which estimates the timing

τ

and intensity a of a salient motion peak together with a global motion direction, and a peak-conditioned trajectory generation stage, which produces short-horizon landmark trajectories through temporal gating and structured motion composition. In addition, we introduce peak-consistency and temporal-shape regularization to improve the preservation of peak-related facial dynamics and temporal alignment. Experiments under a subject-separated protocol show that the proposed framework achieves a more favorable distortion–dynamics trade-off than representative static and recurrent baselines.

The main contributions of this paper are as follows:

We establish a subject-separated benchmark protocol for short-horizon 3D facial landmark forecasting on the MEAD dataset and evaluate methods using distortion, dynamics, and temporal-alignment metrics.
We analyze the mean reversion effect induced by pointwise reconstruction objectives and quantify the resulting distortion–dynamics trade-off through systematic comparison with strong baselines.
We propose a peak-aware recurrent forecasting framework that decomposes prediction into peak planning and peak-conditioned trajectory generation, and we study the contribution of peak-consistency and temporal-shape regularization to dynamics preservation.
The experimental results show that the proposed method better preserves peak-related facial dynamics while maintaining competitive 24-step prediction error.

The remainder of this paper is organized as follows: Section 2 reviews the related work; Section 3 elaborates on the proposed method and framework; Section 4 presents the experimental settings, results, and analysis; Section 5 discusses the significance and limitations of the study; finally, Section 6 concludes the paper.

2. Related Work

This work lies at the intersection of facial expression modeling, landmark-based motion representation, temporal forecasting, and human–robot co-expression. In the following, we review the most relevant literature and clarify the main differences between our study and the existing lines of work.

Facial Expression Analysis and Generation. The early research on facial expression processing focused primarily on static recognition, training CNN-based models to classify discrete emotion categories or estimate continuous affective dimensions from single images or short clips [24,25]. With the advent of large-scale datasets and 3D face models, the research focus has gradually shifted towards dynamic expression modeling and generation, encompassing expression transfer, reenactment, and talking-face synthesis [26]. These methods typically operate in the image or video domain, leveraging technologies such as GANs [27] and diffusion models [28] to generate photorealistic frames conditioned on emotion labels, text, or speech. More recent studies have also highlighted temporally structured facial motion generation. Timeline-based control has been introduced to enable finer temporal specification of facial actions, while autoregressive interactive head generation has been explored for more realistic conversational motion synthesis [15,16]. These methods are primarily developed for generation-oriented settings, whereas our work focuses on short-horizon forecasting of 3D facial landmark dynamics.

In parallel, another line of research employs 2D or 3D facial landmarks as an intermediate representation for expression generation and animation [29]. Compared to direct pixel-level generation, landmark trajectories offer compact interpretable representations that are naturally compatible with the actuation requirements of physical robots or digital avatars [30]. Our study also adopts a landmark-based representation. However, a core distinction from the existing work is that we explicitly formulate future expression generation as a temporal forecasting problem based on real-world talking-face data rather than learning a one-shot mapping from static conditions to full sequences.

Temporal Forecasting of Facial and Bodily Motion. Temporal forecasting has been extensively applied in human motion research, including trajectory prediction, human pose forecasting, and co-speech gesture generation [31]. Standard approaches typically employ recurrent neural networks (RNNs), temporal convolutions, or transformers to extrapolate future sequences based on recent history. It is well established that, under naive mean squared error (MSE) objectives, such models tend to converge to the mean trajectory, exhibiting overly smooth or static behavior—a phenomenon known as “mean reversion” or “mode averaging” [19]. In the context of facial motion, several works have attempted to forecast future landmarks or blendshape coefficients for talking heads or avatars. However, most studies either assume relatively regular motion patterns or focus solely on specific expressions. Furthermore, they often lack systematic analysis of strong baselines, such as Copy-Last-Frame or Seq2Seq [32,33], on noisy real-world talking-face datasets (e.g., MEAD). This work revisits the facial expression forecasting problem on the MEAD dataset, explicitly contrasting the behavioral differences between various baselines and peak-aware recurrent models through dual assessments of short-horizon and long-horizon performance.

Human–Robot Facial Co-Expression. The core objective of human–robot facial co-expression is to enable robots to generate facial or bodily expressions that are temporally aligned with a human partner, thereby enhancing empathy, interaction rapport, and overall experience quality [11]. A representative framework proposed by Hu et al. anticipates the human’s smile apex approximately 800 ms in advance via a predictive model and then maps this anticipated state to robot actuation commands using an inverse model [14]. Despite its significant influence, this framework has notable limitations: it targets a single expression type and focuses primarily on the apex state, failing to cover the full dynamical trajectory from onset to peak and offset. Recent robotic face studies have also advanced physically deployed facial motion systems, including humanoid lip-motion synchronization and neural-driven animatronic face control under actuation constraints [17,18]. These studies highlight the importance of deployment-level validation and actuator-aware control in practical HRI. In contrast, our work remains focused on short-horizon visual forecasting of 3D facial landmark dynamics rather than speech-driven robot actuation or physical facial implementation.

Our study is inspired by the idea of anticipating salient future facial events but differs from existing apex-oriented co-expression frameworks in several important respects. First, we study short-horizon facial landmark forecasting on the MEAD dataset, which contains multiple expression categories and intensity levels rather than focusing on a single expression type, such as smiles in controlled episodes. Second, instead of predicting only an apex state followed by interpolation, we propose a peak-aware recurrent framework that decomposes forecasting into peak planning and peak-conditioned trajectory generation, enabling explicit modeling of peak timing, peak intensity, and short-horizon facial motion evolution. Third, we evaluate the proposed method under a subject-separated protocol against representative static, interpolation-based, and recurrent baselines using distortion, dynamics, and temporal-alignment metrics. In this sense, our work extends the anticipatory co-expression perspective from apex-level event prediction to short-horizon landmark-level facial motion forecasting.

3. Methods

As illustrated in Figure 1, the proposed framework consists of three main components: a shared temporal encoder, a peak-aware forecasting module, and a peak-conditioned control module. The encoder first extracts a latent representation from a short history of facial landmark sequences. Based on this representation, the peak-aware forecasting module predicts key motion attributes of the upcoming expression dynamics, including the peak timing, peak intensity, and a global motion direction. These peak-aware parameters guide the peak-conditioned control module, which generates short-horizon landmark trajectories through temporal gating and motion composition. Long-term facial motion is then obtained by autoregressively rolling the control module forward.

3.1. Problem Formulation

Let

D_{t + 1 : t + T_{c}}

denote the future landmark sequence to be predicted over a short horizon of length

T_{c}

, where

T_{c}

denotes the number of future prediction steps. To characterize the strength of facial motion, we define the motion amplitude at step

t + s

as

A_{t + s} = \frac{1}{N} \sum_{i = 1}^{N} {∥D_{t + s}^{(i)} - D_{t}^{(i)}∥}_{2}

(1)

where

D_{t + s}^{(i)}

denotes the coordinates of the i-th landmark at time

t + s

. Based on the amplitude sequence, we define the ground-truth peak step within the prediction horizon as

k^{*} = arg max_{1 \leq s \leq T_{c}} A_{t + s}

(2)

and further define the corresponding normalized peak timing and peak intensity as

τ = \frac{k^{*}}{T_{c}}, a = A_{t + k^{*}}

(3)

The construction of these peak-aware targets is illustrated in Figure 2. These variables characterize the timing and intensity of the dominant motion peak within the short-horizon planning window and provide compact conditioning signals for subsequent trajectory generation.

3.2. Shared Temporal Encoder

Let

D_{t - T_{p} + 1 : t} = {D_{t - T_{p} + 1}, \dots, D_{t}}

denote the observed history of facial landmark configurations over the past

T_{p}

frames, where each frame

D_{τ} \in R^{N \times 3}

contains the 3D coordinates of N facial keypoints extracted from the input video. To model temporal dependencies in facial motion, each landmark frame is first reshaped into a vector representation

x_{τ} = vec (D_{τ}) \in R^{3 N}

. The resulting sequence

x_{t - T_{p} + 1 : t}

is then fed into a GRU encoder [34] to obtain a compact representation of recent expression dynamics:

H = {GRU}_{enc} (x_{t - T_{p} + 1 : t})

(4)

where

H \in R^{d}

denotes the final hidden state summarizing the short-term facial motion context and d is the hidden dimension of the GRU. This latent representation serves as a shared motion context for the subsequent peak-aware forecasting and trajectory generation process.

3.3. Peak-Aware Forecasting

Given the latent representation H produced by the shared temporal encoder, we first predict a set of peak-aware motion attributes that characterize the upcoming facial motion dynamics. Rather than directly extrapolating high-dimensional landmark trajectories, we estimate several intermediate control variables that explicitly describe when the next salient motion event will occur, how strong it will be, and along which geometric direction the facial landmarks are expected to evolve. Concretely, the peak-aware forecasting module predicts the peak timing

\hat{τ}

, the peak intensity

\hat{a}

, and a global motion direction

V_{dir}

.

The peak timing is predicted from the latent state H by a timing head that models the normalized prediction horizon using K discretized temporal bins:

z_{τ} = f_{τ} (H)

(5)

where

f_{τ} (\cdot)

denotes the timing prediction head and

z_{τ} \in R^{K}

are the logits associated with the temporal bins. The corresponding decoded timing estimate is denoted by

\hat{τ} \in [0, 1]

, which represents the normalized location of the next peak within the prediction horizon.

In parallel, the peak intensity is predicted as

\hat{a} = f_{a} (H)

(6)

where

f_{a} (\cdot)

denotes the intensity prediction head and

\hat{a}

represents the predicted motion amplitude at the peak instant. These two variables provide explicit temporal and magnitude priors for the subsequent trajectory generation process.

To further characterize the spatial pattern of the forthcoming facial motion, we infer a global motion direction from the same latent representation:

V_{dir} = \frac{f_{mot} (H)}{{∥f_{mot} (H)∥}_{2} + ϵ}

(7)

where

f_{mot} (\cdot)

denotes a projection from the latent space to the landmark displacement space and

ϵ

is a small constant for numerical stability. This normalization removes the magnitude component and allows

V_{dir}

to encode only the dominant spatial direction of the forthcoming facial deformation. Taken together,

\hat{τ}

,

\hat{a}

, and

V_{dir}

constitute the peak-aware control signals used by the subsequent peak-conditioned control module. Given these peak-aware control signals, the control module generates a short-horizon motion trajectory by modulating the global motion direction with a temporal gate.

3.4. Peak-Conditioned Control

Given the peak-aware motion attributes predicted from the latent representation, we generate future facial motion through a peak-conditioned control mechanism. Instead of directly regressing the landmark coordinates at each future step, we explicitly factorize the predicted motion into three components: a temporal gate controlling the evolution over time, a peak intensity determining the motion magnitude, and a global motion direction specifying the geometric deformation pattern. This factorized design provides a structured and interpretable formulation for short-horizon trajectory generation.

For each future step

s \in {1, \dots, T_{c}}

, we first construct a step-dependent conditioning variable based on the normalized temporal index and the predicted peak-aware attributes. In practice, the temporal gate is generated by a lightweight recurrent decoder conditioned on the current step, the peak-aware variables, and the motion context:

g_{s} = f_{gate} (s, \hat{τ}, \hat{a}, H)

(8)

where

f_{gate} (\cdot)

denotes the temporal-gating function and

g_{s}

denotes the gate value at step s. Intuitively,

g_{s}

models how strongly the facial motion should evolve at each future step under the guidance of the predicted peak timing and motion intensity.

Given the temporal gate

g_{s}

, the predicted peak intensity

\hat{a}

, and the global motion direction

V_{dir}

, the motion increment at time

t + s

is computed as

Δ D_{t + s} = g_{s} \cdot \hat{a} \cdot V_{dir}

(9)

The predicted facial landmark configuration is then obtained by adding the motion increment to the current frame

D_{t}

:

{\hat{D}}_{t + s} = D_{t} + Δ D_{t + s}

(10)

In this way, the future facial trajectory is generated as a peak-conditioned deformation process, where the temporal evolution is governed by

g_{s}

, the motion magnitude is controlled by

\hat{a}

, and the geometric displacement pattern is constrained by

V_{dir}

.

3.5. Autoregressive Rollout

The peak-conditioned control module predicts a short future landmark segment of length

T_{c}

. To obtain a longer prediction horizon, we deploy the model in a rolling autoregressive manner. At rollout iteration r, the model takes the most recent history window of length

T_{p}

as input and predicts a short future segment

{\hat{D}}_{t_{r} + 1 : t_{r} + T_{c}}^{(r)}

. This predicted segment is then appended to the history buffer, while the oldest frames are discarded so that the input length remains fixed. The updated history is subsequently re-encoded by the shared temporal encoder, and the same peak-aware forecasting process is repeated. By iterating this procedure, the model produces a long-horizon facial landmark trajectory up to the target horizon

T_{f}

.

Formally, if the history window before rollout iteration r is denoted by

D_{t_{r} - T_{p} + 1 : t_{r}}^{(r)}

, then the model predicts a segment

{\hat{D}}_{t_{r} + 1 : t_{r} + T_{c}}^{(r)}

and updates the history window as

D_{t_{r} + T_{c} - T_{p} + 1 : t_{r} + T_{c}}^{(r + 1)} = \{D_{t_{r} - T_{p} + T_{c} + 1 : t_{r}}^{(r)}, {\hat{D}}_{t_{r} + 1 : t_{r} + T_{c}}^{(r)}\}

(11)

where only the most recent

T_{p}

frames are retained after each rollout. In our implementation, we set

T_{c} = 6

and repeat the rolling prediction process until reaching the target horizon

T_{f} = 24

.

3.6. Training Objectives

The proposed model is trained to jointly optimize short-horizon trajectory accuracy and the reliability of the predicted peak-aware motion attributes. To this end, we employ a composite objective consisting of a control term, a peak prediction term, a peak-integrity term, and a correlation regularization term for high-dynamic samples.

For short-horizon trajectory supervision, let

{\hat{D}}_{t + s}, D_{t + s} \in R^{N \times 3}

denote the predicted and ground-truth landmark displacement matrices at step

t + s

, where N is the number of facial landmarks and

T_{c}

is the short prediction horizon. We define the control loss as

L_{ctrl} = λ_{pos} L_{pos} + λ_{amp} L_{amp} + λ_{vel} L_{vel}

(12)

where

L_{pos}

is a Frobenius-norm reconstruction term on landmark coordinates,

L_{amp}

enforces consistency of the motion-amplitude sequence, and

L_{vel}

regularizes temporal smoothness through first-order motion differences.

To encourage accurate prediction of the key peak-aware attributes, we further supervise the estimated peak timing and peak intensity. Let

τ

and a denote the ground-truth normalized peak timing and peak intensity, respectively. The peak prediction loss is defined as

L_{peak} = L_{τ} + λ_{a} L_{a}

(13)

where

L_{τ}

is a classification loss over discretized temporal bins for peak timing and

L_{a} = {∥\hat{a} - a∥}_{2}^{2}

(14)

penalizes the error in the predicted peak intensity.

To preserve the structural fidelity of the predicted facial configuration around the most salient motion event, we introduce a peak-integrity loss evaluated at the ground-truth peak step

k^{*}

:

L_{int} = {∥{\hat{D}}_{t + k^{*}} - D_{t + k^{*}}∥}_{F}^{2}

(15)

In addition, for high-dynamic samples, we impose a correlation regularization term on the predicted and ground-truth motion-amplitude sequences,

L_{corr} = 1 - corr ({\hat{A}}_{t + 1 : t + T_{c}}, A_{t + 1 : t + T_{c}})

(16)

where

corr (\cdot, \cdot)

denotes the Pearson correlation coefficient.

The final training objective is given by

L = L_{ctrl} + L_{peak} + λ_{int} L_{int} + λ_{corr} L_{corr}

(17)

4. Experiments

4.1. Experimental Setup

Dataset and Preprocessing. We evaluate our framework on the MEAD talking-face dataset [22], which contains high-quality audio-visual recordings of 60 actors expressing 8 emotions at 3 intensity levels. In this study, rather than pursuing large-scale identity coverage, we adopt a strictly controlled subject-independent setting to systematically analyze the mean reversion phenomenon in short-horizon facial forecasting. Specifically, we use continuous landmark sequences from Actors 1–5 for training and strictly hold out Actor 6 for testing. Although this protocol operates on a subset of the full MEAD diversity, it provides a clean highly focused testbed. This design explicitly isolates motion modeling from identity overfitting, allowing us to rigorously verify whether the proposed peak-aware mechanism can preserve high-frequency expressive dynamics under cross-subject transfer without relying on identity memorization.

Evaluation Protocol. We evaluate the model in a long-horizon forecasting setting with a target prediction horizon of

T_{f} = 24

(approximately 0.8 s). In our implementation, we set

T_{c} = 6

and adopt a rolling autoregressive strategy in which the model predicts a short future segment from the current history window, appends the predicted segment to the history buffer, and discards the oldest frames so that the input length remains fixed. The updated history is then fed back into the model for the next rollout iteration. Repeating this procedure yields a full predicted trajectory up to the target horizon

T_{f}

. This evaluation protocol exposes the model to its own previous predictions and therefore tests its resistance to error accumulation over time. Unless otherwise stated, we report results on the held-out Actor 6 (unseen during training) while using Actors 1–5 for training. All compared methods share the same data preprocessing pipeline, sequence sampling strategy, and optimization schedule to ensure a fair comparison.

4.2. Compared Methods and Evaluation Metrics

We intentionally restrict our baselines to recurrent and kinematic models to align with the real-time constraints of anticipatory co-expression. Although large-scale architectures like diffusion and transformer models achieve state-of-the-art offline synthesis, their high computational complexity precludes lightweight autoregressive control. Focusing on methodologically comparable models ensures a fair and practically meaningful evaluation for short-horizon online forecasting.

Copy-Last-Frame [19]: A zero-velocity baseline that repeats the last observed history frame for all future steps. While trivial, it serves as a strong baseline for MSE due to the prevalence of static or slow-moving segments in talking-face data.
Linear extrapolation [35]: A kinematic baseline that assumes constant velocity, extrapolating motion based on the difference between the last two history frames. This baseline highlights the non-linearity of facial dynamics.
GRU Seq2Seq [32]: A standard encoder–decoder gated recurrent unit (GRU) network trained with a pointwise MSE loss. This represents the generic sequence modeling approach without specific structural priors for facial peaks.
Hu-style Apex+Interp [11]: An implementation of the anticipatory co-expression framework. It predicts a single future apex state (time and intensity) and generates the trajectory via linear interpolation. This serves as the primary state-of-the-art comparison for peak-based generation.
PAGF-base: Our base peak-aware model incorporates the decoupled planning head and the gated directional control head. It is trained with the standard trajectory reconstruction objective and peak supervision but without the additional peak-integrity and correlation regularization terms used in the final model.
PAGF+Corr (Final): The complete framework builds upon PAGF-base by adding the peak-integrity loss ( $L_{int}$ ) and the shape-aware correlation loss ( $L_{corr}$ ) on high-dynamic segments. This variant is designed to improve temporal alignment and dynamic expressiveness without sacrificing trajectory fidelity.

To quantify temporal dynamics beyond pointwise distortion, we report three complementary metrics: jerk ratio, AmpCorr, and lag. Given the ground-truth trajectory

D_{1 : T_{f}}

and the predicted trajectory

{\hat{D}}_{1 : T_{f}}

, we first define the discrete jerk energy as

J (D) = \frac{1}{T_{f} - 3} \sum_{t = 1}^{T_{f} - 3} {∥D_{t + 3} - 3 D_{t + 2} + 3 D_{t + 1} - D_{t}∥}_{F}^{2}

(18)

and compute the corresponding jerk ratio by

JerkRatio (\hat{D}, D) = \frac{J (\hat{D})}{J (D) + ε}

(19)

where values closer to 1 indicate better preservation of dynamic variation.

To characterize the temporal evolution of facial motion intensity, we define the frame-level motion-amplitude sequences as

A_{t} = \frac{1}{N} \sum_{n = 1}^{N} {∥ D_{t}^{(n)} ∥}_{2}

and

{\hat{A}}_{t} = \frac{1}{N} \sum_{n = 1}^{N} {∥ {\hat{D}}_{t}^{(n)} ∥}_{2}

. Based on these sequences, the amplitude correlation is measured by

AmpCorr (\hat{A}, A) = \frac{\sum_{t = 1}^{T_{f}} ({\hat{A}}_{t} - \bar{\hat{A}}) (A_{t} - \bar{A})}{\sqrt{\sum_{t = 1}^{T_{f}} {({\hat{A}}_{t} - \bar{\hat{A}})}^{2}} \sqrt{\sum_{t = 1}^{T_{f}} {(A_{t} - \bar{A})}^{2}} + ε}

(20)

while the temporal synchronization error is defined as

Δ^{*} (\hat{A}, A) = arg max_{Δ \in [- Δ_{max}, Δ_{max}]} \sum_{t} {\hat{A}}_{t} A_{t + Δ}

(21)

lag (\hat{A}, A) = | Δ^{*} (\hat{A}, A) |

(22)

Here,

Δ^{*} (\hat{A}, A)

denotes the optimal temporal shift that maximizes the cross-correlation between the predicted and ground-truth motion-amplitude sequences. The lag metric

lag (\hat{A}, A)

is then defined as the absolute value of this shift.

4.3. Implementation Details

To ensure a fair comparison, all learned models share an identical GRU encoder–decoder backbone. The encoder and decoder are each configured with

L = 1

GRU layer and a hidden dimension of

d = 256

. The planning head comprises two-layer multilayer perceptrons (MLPs) with a hidden size of 256 for both the peak-time classification (logits) and peak-intensity regression. The motion head is implemented as a linear projection mapping the latent space

R^{d}

to the motion direction space

R^{3 N}

.

Training was performed using the Adam W optimizer [36,37] with a batch size of 32 for 200 epochs. The initial learning rate was set to

3 \times 10^{- 4}

with a weight decay of

1 \times 10^{- 4}

. Early stopping was applied based on the validation MSE at the 24th step (MSE@24) to prevent overfitting. For the shape-aware correlation loss, the dynamic threshold

γ_{dyn}

was dynamically determined as the 75th percentile of the amplitude standard deviation calculated over the control window of the training set. All experiments were implemented in PyTorch 2.10.0 and executed on a single NVIDIA RTX 4060 GPU. The detailed hyperparameters and configuration are summarized in Table 1.

4.4. Long-Horizon Distortion Analysis

We emphasize that pointwise MSE can favor degenerate near-static rollouts on MEAD because a substantial portion of sequences exhibit limited facial motion within short horizons. In such cases, repeating the last frame yields low distortion but fails to reproduce the temporal evolution of expressions. Therefore, we treat MSE as a distortion measure rather than a proxy for expressive motion quality and complement it with dynamics and synchrony metrics. We rigorously evaluated the long-horizon forecasting capability by measuring the mean squared error (MSE) over a 24-frame rollout trajectory. The results, detailing global MSE and region-specific jaw/mouth MSE, are presented in Table 2.

A critical observation is that the trivial Copy-Last-Frame baseline achieves the lowest global MSE (

1.54 \times 10^{- 5}

) among all methods. This counter-intuitive result confirms the severity of the “mean reversion” problem in talking-face datasets: since facial expressions are sparse and often return to a neutral state, a static predictor statistically minimizes the

L_{2}

error, albeit at the cost of freezing all motion. In contrast, the linear extrapolation baseline exhibits rapid divergence (

48.54 \times 10^{- 5}

), highlighting the non-linear nature of facial dynamics and the inadequacy of simple kinematic assumptions for long-term prediction.

Among data-driven approaches, GRU Seq2Seq and Hu-style Apex+Interp achieve reasonable distortion levels but still incur higher errors than the static baseline. This indicates that generic recurrent models struggle to balance motion generation with trajectory stability. Our proposed PAGF-base effectively suppresses the artifacts of unconstrained generation, achieving a low distortion (

1.54 \times 10^{- 5}

) comparable to Copy-Last-Frame. Building on this, the final PAGF+Corr variant maintains a competitive distortion profile (global MSE:

1.81 \times 10^{- 5}

; jaw MSE:

1.63 \times 10^{- 4}

), only marginally higher than the static baseline. Crucially, as will be discussed in the subsequent dynamics analysis, this slight increase in MSE is a necessary trade-off to restore the dynamic vitality and temporal alignment that static baselines completely lack.

4.5. Dynamics and Perception–Distortion Trade-Off

Low numerical distortion (MSE) does not guarantee high perceptual quality. In fact, it often implies a collapse to static mean poses. To quantify this trade-off, we evaluate three complementary metrics: (i) jerk ratio, serving as a proxy for motion smoothness and vitality; (ii) AmpCorr, measuring the shape alignment between predicted and ground-truth intensity profiles; and (iii) lag, indicating temporal synchronization accuracy.

As summarized in Table 3, the baselines exhibit extreme behaviors. Copy-Last-Frame achieves near-zero distortion but zero dynamic energy, representing a trivial static solution. Conversely, standard GRU Seq2Seq improves correlation but introduces excessive jitter and higher spatial error. Our proposed PAGF+Corr successfully navigates this trade-off: it restores natural dynamic energy (jerk ratio comparable to GT) and significantly improves temporal alignment compared to static baselines, validating that the peak-aware planning mechanism effectively steers the generation away from mean reversion without inducing unstable artifacts.

However, we also find that the absolute magnitude of AmpCorr is relatively small across methods. This is expected because (i) many MEAD clips remain close to a neutral state, yielding low-variance amplitude sequences that suppress Pearson correlation values, and (ii) minor temporal shifts around expression transitions can substantially reduce correlation even when the overall motion magnitude is plausible. Consequently, we additionally report the correlation distribution on a high-dynamic subset (Figure 3) to isolate segments with expressive motion, where temporal-shape alignment becomes more discriminative.

4.6. High-Dynamic Analysis and Qualitative Visualization

To rigorously assess the model’s capability in capturing expressive motion, we isolate a high-dynamic subset where the standard deviation of the ground-truth amplitude exceeds

γ_{dyn}

. A comprehensive analysis of this subset is presented in Figure 3, Figure 4, Figure 5 and Figure 6.

Quantitative Trends (Figure 3 and Figure 4). The mean amplitude trajectory plot (Figure 3) reveals that, while Seq2Seq (blue) suffers from amplitude decay over time, PAGF+Corr (red) robustly maintains the expression intensity, closely tracking the ground-truth band. The amplitude correlation distribution (Figure 4) further confirms that our model achieves a consistently higher median correlation than the baselines, reducing the variance of poor predictions.

Qualitative and Frame-Wise Analysis (Figure 5 and Figure 6). To further examine the predicted dynamics on a representative high-intensity expression, we provide both qualitative landmark overlays and frame-wise mouth-region error curves. As shown in Figure 5, the Copy-Last baseline remains close to a static estimate and fails to capture the onset of the motion, while the generic Seq2Seq model initiates the movement but produces an attenuated peak, consistent with the over-smoothing tendency observed in the quantitative analysis. By contrast, PAGF+Corr more closely matches the ground-truth trajectory in both peak timing and motion magnitude, resulting in a more faithful temporal evolution of the expression. This tendency is further supported by Figure 6, where PAGF+Corr maintains a more favorable mouth-region error profile over most rollout steps, particularly around the main dynamic transition region.

5. Discussion

The Perception–Distortion Trade-Off. Our empirical results on the MEAD dataset reveal a fundamental conflict in facial expression forecasting: optimizing for numerical fidelity (MSE) often comes at the expense of dynamic expressiveness. The trivial Copy-Last-Frame baseline achieves the lowest global MSE, creating a “numerical ceiling” that learned models struggle to surpass. This confirms that, under the assumption of minimal motion energy, which holds for the majority of a talking-face video, the statistically optimal strategy for minimizing

L_{2}

loss is to predict no motion. However, in the context of human–robot interaction (HRI), such a strategy is catastrophic: it results in a “zombie-like” agent that appears unresponsive to human emotional cues. Conversely, standard autoregressive models (e.g., GRU Seq2Seq) attempt to model dynamics but often suffer from “mean reversion” (or posterior collapse), producing trajectories that drift towards the average face, limiting the robot’s ability to display high-intensity empathetic expressions.

Efficacy of Peak-Aware Planning. The proposed peak-aware GRU framework (PAGF) addresses this dichotomy by decoupling the “what/when” (planning) from the “how” (control). By explicitly predicting the timing

τ

and intensity a of the upcoming expression peak, the model injects a strong high-level structural prior into the generation process. This prevents the decoder from collapsing to the mean as it is strictly conditioned to reach a specific target state. Our ablation study demonstrates that, while geometric constraints alone are insufficient to cure static behaviors, the introduction of peak integrity and shape-aware correlation losses significantly boosts the jerk energy and amplitude correlation. This suggests that, for robotic expression generation, supervising the shape and trend of the motion is as critical as supervising the absolute coordinate positions.

Computational Cost and Practical Considerations. Beyond prediction quality, the computational implications of the proposed framework should also be considered. Compared with static or single-step baselines, PAGF introduces additional overhead through peak-aware forecasting and rolling autoregressive control. However, the model remains structurally lightweight, relying on a GRU-based encoder–decoder and short control segments rather than large-scale generative architectures. In the present study, we evaluate the method in an offline benchmark setting and therefore do not claim a fully deployed real-time robotic implementation. A more complete assessment of inference latency, rollout cost, and robot-in-the-loop execution remains an important direction for future work.

Limitations. Despite the promising results, the current framework has several limitations. First, it relies solely on visual history, whereas facial expressions in conversation are often correlated with speech prosody and semantic context. In situations where visual cues are weak or ambiguous, such as the onset of a surprise reaction before substantial muscle movement, the model may struggle to predict the exact peak timing. Second, the present evaluation focuses on landmark trajectories; mapping these kinematic plans to the physical constraints of specific servo-driven robot faces remains a downstream engineering challenge. Recent robotic face studies further suggest that bridging facial kinematics to practical HRI deployment requires actuator-level mapping, latency-sensitive synchronization, and physical robot validation [17,18]. In addition, the current formulation models only the dominant peak within each short prediction horizon. While this approximation is suitable for the present short-horizon setting, more complex facial motion patterns with multiple salient peaks would require an extended formulation, such as multi-event planning or hierarchical temporal modeling. More broadly, although the proposed framework is motivated by anticipatory facial co-expression in HRI, the present study evaluates this capability through offline landmark-based proxy metrics rather than end-to-end robot deployment or user studies. Accordingly, the reported gains indicate improved temporal synchronization and dynamic fidelity at the kinematic level, but they should be interpreted as preliminary evidence rather than a definitive validation of interaction-level benefits. Future work will further examine these advantages through robot-in-the-loop experiments and user-level interaction studies.

6. Conclusions

In this paper, we presented a systematic study of short-horizon facial expression forecasting, a critical capability for anticipatory human–robot co-expression. We identified and quantified the severe mean-reversion phenomenon induced by standard MSE training on the MEAD dataset. To overcome this, we proposed the peak-aware GRU framework (PAGF), a hierarchical architecture that explicitly models short-horizon peak timing and intensity and generates structured trajectories via peak-conditioned directional gating. Our experiments demonstrate that PAGF successfully navigates the perception–distortion trade-off. Unlike static baselines that minimize error by freezing motion, our method generates vibrant synchronized facial dynamics that closely match ground-truth intensity profiles while maintaining a competitive distortion error relative to generic Seq2Seq models. This work provides a robust baseline for data-driven robot expression control, highlighting the importance of structural priors in modeling stochastic human behaviors.

In future work, we plan to extend this framework to a multimodal setting by incorporating audio and text modalities to improve the anticipation horizon. Additionally, we aim to deploy the generated trajectories on a physical facial expression robot to evaluate the impact of anticipatory co-expression on user trust and engagement in real-world interactions.

Author Contributions

Conceptualization, M.Y. and F.Y.; methodology, M.Y.; software, M.Y.; validation, M.Y. and Y.Y.; formal analysis, M.Y. and J.L.; investigation, M.Y.; data curation, M.Y. and Y.Y.; writing—original draft preparation, M.Y.; writing—review and editing, J.L. and F.Y.; visualization, M.Y. and Y.Y.; supervision, F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shanghai Special Fund for Promoting Industrial High-Quality Development—Pioneering Industry Innovation and Development Project (Grant No. RZ-RGZN-01-25-0673), the National Key Research and Development Program of China “Regional Integrated Application Demonstration of Intelligent Technology for Active Health Services” (Grant No. 2023YFC3605800), the 2025 Shanghai Key Technologies Research and Development Program—“New Energy” Project (Grant No. 25DZ3001401), and the National Science and Technology Major Project of China Mobile Information Networks (Grant No. 2025ZD1304800).

Data Availability Statement

The data used in this study are openly available in the MEAD dataset at https://wywu.github.io/projects/MEAD/MEAD.html (accessed on 2 April 2026).

Acknowledgments

During the preparation of this manuscript, the authors used GPT-4 (OpenAI) for language editing and improving the clarity of the writing. The authors reviewed and edited the content and take full responsibility for the final version of the manuscript.

Conflicts of Interest

Author Fangyan Yang is affiliated with Shanghai DroidUp Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Driess, D.; Xia, F.; Sajjadi, M.S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 8469–8488. [Google Scholar]
Tong, Y.; Liu, H.; Zhang, Z. Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects. IEEE/CAA J. Autom. Sin. 2024, 11, 301–328. [Google Scholar] [CrossRef]
Breazeal, C. Emotion and sociable humanoid robots. Int. J. Hum.-Comput. Stud. 2003, 59, 119–155. [Google Scholar] [CrossRef]
Park, U.; Kim, M.; Jang, Y.; Lee, G.; Kim, K.; Kim, I.J.; Choi, J. Robot Facial Expression Framework for Enhancing Empathy in Human-Robot Interaction. In Proceedings of the 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Vancouver, BC, Canada, 8–12 August 2021; IEEE: New York, NY, USA, 2021; pp. 832–838. [Google Scholar]
Frith, C. Role of facial expressions in social interactions. Philos. Trans. R. Soc. B Biol. Sci. 2009, 364, 3453–3458. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Ren, F.; Hu, M.; Chen, S. Facial Expression Imitation Method for Humanoid Robot Based on Smooth-Constraint Reversed Mechanical Model (SRMM). IEEE Trans. Hum.-Mach. Syst. 2020, 50, 538–549. [Google Scholar] [CrossRef]
Miwa, H.; Itoh, K.; Matsumoto, M.; Zecca, M.; Takanobu, H.; Rocella, S.; Carrozza, M.C.; Dario, P.; Takanishi, A. Effective Emotional Expressions with Expression Humanoid Robot WE-4RII: Integration of Humanoid Robot Hand RCH-1. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sendai, Japan, 28 September–2 October 2004; IEEE: New York, NY, USA, 2004; Volume 3, pp. 2203–2208. [Google Scholar]
Liu, X.; Ni, R.; Yang, B.; Song, S.; Cangelosi, A. Unlocking Human-Like Facial Expressions in Humanoid Robots: A Novel Approach for Action Unit Driven Facial Expression Disentangled Synthesis. IEEE Trans. Robot. 2024, 40, 3850–3865. [Google Scholar] [CrossRef]
Li, J.; Lv, H.; Zhang, N.; Wu, H.; Yang, G. Design and Realization of a Multi-DoF Robotic Head for Affective Humanoid Facial Expression Imitation. In Proceedings of the 16th International Conference on Intelligent Robotics and Applications (ICIRA), Hangzhou, China, 5–7 July 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 23–32. [Google Scholar]
Ren, F.; Huang, Z. Automatic Facial Expression Learning Method Based on Humanoid Robot XIN-REN. IEEE Trans. Hum.-Mach. Syst. 2016, 46, 810–821. [Google Scholar] [CrossRef]
Hu, Y.; Chen, B.; Lin, J.; Wang, Y.; Wang, Y.; Mehlman, C.; Lipson, H. Human-robot facial coexpression. Sci. Robot. 2024, 9, eadi4724. [Google Scholar] [CrossRef] [PubMed]
Hoffman, G.; Ju, W. Designing Robots with Movement in Mind. J. Hum.-Robot. Interact. 2014, 3, 89–122. [Google Scholar] [CrossRef]
Valdesolo, P.; Ouyang, J.; DeSteno, D. The rhythm of joint action: Synchrony promotes cooperative ability. J. Exp. Soc. Psychol. 2010, 46, 693–695. [Google Scholar] [CrossRef]
Chen, B.; Hu, Y.; Li, L.; Cummings, S.; Lipson, H. Smile Like You Mean It: Driving Animatronic Robotic Face with Learned Models. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021; pp. 2739–2746. [Google Scholar]
Ma, Y.; Qi, J.; Ji, C.; Zhang, P.; Zhang, B.; Deng, Z.; Bo, L. Exploring Timeline Control for Facial Motion Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; IEEE: New York, NY, USA, 2025; pp. 1940–1950. [Google Scholar]
Guo, Y.; Liu, X.; Zhen, C.; Yan, P.; Wei, X. ARIG: Autoregressive Interactive Head Generation for Real-Time Conversations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–20 October 2025; IEEE: New York, NY, USA, 2025; pp. 12956–12965. [Google Scholar]
Hu, Y.; Lin, J.; Goldfeder, J.A.; Wyder, P.M.; Cao, Y.; Tian, S.; Wang, Y.; Wang, J.; Wang, M.; Zeng, J.; et al. Learning Realistic Lip Motions for Humanoid Face Robots. Sci. Robot. 2026, 11, eadx3017. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Yang, J.; Peng, Z.; Yang, M.; Ma, J.; Cheng, L.; Xu, H.; Zhao, H.; Zhao, H. Morpheus: A Neural-Driven Animatronic Face with Hybrid Actuation and Diverse Emotion Control. Int. J. Artif. Intell. Robot. Res. 2025, 2, 2550004. [Google Scholar] [CrossRef]
Martinez, J.; Black, M.J.; Romero, J. On Human Motion Prediction Using Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2897–2906. [Google Scholar]
Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), Berlin, Germany, 11–12 August 2016; pp. 10–21. [Google Scholar]
Blau, Y.; Michaeli, T. The Perception-Distortion Tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 6228–6237. [Google Scholar]
Wang, K.; Qian, Q.; Zhang, W.; Wu, W.; Wang, C.; Zhou, F.; Loy, C.C. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 700–717. [Google Scholar]
Merel, J.; Botvinick, M.; Wayne, G. Hierarchical Visuomotor Control of Humanoids. Sci. Robot. 2019, 4, eaav3123. [Google Scholar]
Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going Deeper in Facial Expression Recognition Using Deep Neural Networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; IEEE: New York, NY, USA, 2016; pp. 1–10. [Google Scholar]
Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; Niessner, M. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 2387–2395. [Google Scholar]
Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. GANimation: Anatomically-Aware Facial Animation from a Single Image. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 818–833. [Google Scholar]
Stypułkowski, M.; Vougioukas, K.; He, S.; Zięba, M.; Petridis, S.; Pantic, M. Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; IEEE: New York, NY, USA, 2024; pp. 5036–5045. [Google Scholar]
Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeItTalk: Speaker-Aware Talking-Head Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 7118–7127. [Google Scholar]
Yuan, Y.; Li, J.; Yu, Q.; Liu, J.; Li, Z.; Li, Q.; Liu, N. A Two-Stage Facial Kinematic Control Strategy for Humanoid Robots Based on Keyframe Detection and Keypoint Cubic Spline Interpolation. Mathematics 2024, 12, 3278. [Google Scholar] [CrossRef]
Alahi, A.; Goel, K.; Ramanan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 961–971. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; Volume 27, pp. 3104–3112. [Google Scholar]
Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1724–1734. [Google Scholar]
Gui, L.Y.; Zhang, Y.X.; Wang, X.; Moura, J.M.; Veloso, M. Adversarial Geometry-Aware Human Motion Prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 786–803. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]

Figure 1. Overview of the proposed peak-aware GRU framework for facial expression forecasting. Given an input video, facial landmarks from the last 30 frames are extracted and encoded into a latent representation by a shared GRU encoder. The peak-aware forecasting module predicts the timing and intensity of the next expression peak and estimates a global motion direction. These peak-aware signals serve as control parameters for the subsequent generation stage. The peak-conditioned control module then produces short-horizon landmark trajectories through temporal gating and motion composition. By autoregressively rolling the control module forward, the model generates a future facial motion trajectory up to 24 frames.

Figure 2. Peak target construction from short-horizon motion amplitude. Motion amplitude is computed relative to the current frame, and the peak step

k^{*}

is used to derive the normalized peak timing

τ

and peak intensity a.

Figure 2. Peak target construction from short-horizon motion amplitude. Motion amplitude is computed relative to the current frame, and the peak step

k^{*}

is used to derive the normalized peak timing

τ

and peak intensity a.

Figure 3. Mean amplitude trajectories on the high-dynamic subset under the Actor 6 evaluation protocol. Compared with static and purely autoregressive baselines, PAGF+Corr better preserves short-horizon motion amplitude and remains closer to the ground-truth dynamic band over the forecast window.

Figure 4. Distribution of per-sample amplitude correlation on the high-dynamic subset under the Actor 6 evaluation protocol. Among the evaluated methods, PAGF+Corr shows the most favorable observed correlation distribution in this setting, suggesting improved synchronization with the target amplitude dynamics.

Figure 5. Qualitative landmark overlays confirm faithful large-scale facial deformations (e.g., mouth opening) compared with under-expressive baselines.

Figure 6. Frame-wise mouth-region RMSE over the prediction horizon. PAGF+Corr achieves lower error than the compared baselines over most of the rollout horizon, particularly around the main dynamic transition region.

Table 1. Hyperparameters and configuration details for the proposed PAGF+Corr model.

Parameter	Value
Model Architecture
Input History Length ( $T_{p}$ )	30 frames
Control Horizon ( $T_{c}$ )	6 frames
Forecast Horizon ( $T_{f}$ )	24 frames
Number of Keypoints (N)	58
GRU Layers (L)	1 (encoder) + 1 (decoder)
Hidden Dimension (d)	256
Peak Time Bins (K)	8
Optimization
Optimizer	AdamW
Batch Size	32
Base Learning Rate	$3 \times 10^{- 4}$
Weight Decay	$1 \times 10^{- 4}$
Max Epochs	200
Loss Function Weights (Equation (17))
Internal Weights of $L_{ctrl}$ $(λ_{pos}, λ_{amp}, λ_{vel})$	1.0, 0.5, 0.1
Planning Weights ( $λ_{τ}, λ_{a}$ )	1.0, 0.5
Regularization ( $λ_{int}, λ_{corr}$ )	1.0, 0.5
Dynamic Threshold ( $γ_{dyn}$ )	$1.5 \times 10^{- 3}$

Table 2. Quantitative comparison of long-horizon distortion metrics on the MEAD dataset (Actor 6). Global MSE and jaw/mouth MSE are averaged over 24 prediction steps. ↓ indicates lower is better.

Model	Global MSE@24 ( $\times 10^{- 5}$ )↓	Jaw/Mouth MSE@24 ( $\times 10^{- 4}$ )↓
Copy-Last-Frame	1.54	1.54
Linear Extrapolation	48.54	38.74
GRU Seq2Seq	2.58	2.04
Hu-style Apex + Interp	2.88	2.13
Ours (PAGF-base)	1.54	1.53
Ours (PAGF+Corr)	1.81	1.63

Table 3. Quantitative comparison of dynamics and temporal alignment metrics on the high-dynamic subset. Jerk ratio measures motion smoothness relative to ground truth; AmpCorr measures amplitude shape correlation; lag measures temporal synchronization error. ↑ indicates that higher values are better, whereas ↓ indicates that lower values are better.

Model	Jerk Ratio (≈1)	AmpCorr (↑)	Lag (↓)
Copy-Last-Frame	0.000	0.000	0.000
Linear Extrapolation	0.000	−0.029	0.208
GRU Seq2Seq	0.052	0.074	0.176
Hu-style Apex+Interp	0.002	0.061	0.201
PAGF-base	0.002	0.003	0.173
PAGF+Corr (Ours)	0.097	0.085	0.169

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, M.; Yuan, Y.; Liu, J.; Yang, F. PAGF: Short-Horizon Forecasting of 3D Facial Landmarks. Mathematics 2026, 14, 1222. https://doi.org/10.3390/math14071222

AMA Style

Yan M, Yuan Y, Liu J, Yang F. PAGF: Short-Horizon Forecasting of 3D Facial Landmarks. Mathematics. 2026; 14(7):1222. https://doi.org/10.3390/math14071222

Chicago/Turabian Style

Yan, Mingzhu, Ye Yuan, Jian Liu, and Fangyan Yang. 2026. "PAGF: Short-Horizon Forecasting of 3D Facial Landmarks" Mathematics 14, no. 7: 1222. https://doi.org/10.3390/math14071222

APA Style

Yan, M., Yuan, Y., Liu, J., & Yang, F. (2026). PAGF: Short-Horizon Forecasting of 3D Facial Landmarks. Mathematics, 14(7), 1222. https://doi.org/10.3390/math14071222

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PAGF: Short-Horizon Forecasting of 3D Facial Landmarks

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Problem Formulation

3.2. Shared Temporal Encoder

3.3. Peak-Aware Forecasting

3.4. Peak-Conditioned Control

3.5. Autoregressive Rollout

3.6. Training Objectives

4. Experiments

4.1. Experimental Setup

4.2. Compared Methods and Evaluation Metrics

4.3. Implementation Details

4.4. Long-Horizon Distortion Analysis

4.5. Dynamics and Perception–Distortion Trade-Off

4.6. High-Dynamic Analysis and Qualitative Visualization

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI