1. Introduction
With the rapid advancement of large language models (LLMs) [
1], the verbal interaction capabilities of humanoid robots [
2] have been significantly improved, enabling fluent and context-aware verbal communication. However, non-verbal communication, including facial expressions, eye gaze, and head movements, which is equally critical in social interaction, has developed relatively slowly. Facial expressions are the primary medium for conveying emotions, intentions, and interpersonal signals, directly determining the user’s perception of warmth, trust, and willingness to engage with the robot, making them an indispensable part of human–robot interaction (HRI) [
3,
4]. Research in psychology and HRI has confirmed that even millisecond-level timing mismatches in facial movements can drastically reduce the naturalness and rapport of the interaction [
5], thereby weakening the user’s willingness to continue interacting with the robot. Therefore, enabling robots to generate geometrically accurate, timely, and emotionally coherent [
6] facial expressions remains a core challenge for their deployment in real-world high-frequency social scenarios.
The current approaches to robotic facial behavior generation can be broadly divided into two categories: pre-programmed facial behavior patterns [
7,
8] and reactive imitation [
9,
10]. Pre-programmed methods typically rely on handcrafted scripts or fixed motion patterns. Although they can generate precise and repeatable movements, they lack the flexibility required for anticipatory and dynamically adaptive facial behavior in natural interaction. Reactive imitation methods detect human expressions and mirror them; although the computational cost is lower, the perception–processing–execution pipeline inevitably introduces latency, leading to lagged, mechanical, and insincere robot expressions. Both methods are essentially post hoc responses and lack anticipatory capability, so they cannot synchronize with the onset and peak of human expressions, responding only passively after observing human actions. In contrast, anticipatory co-expression, which is the ability to predict near-future facial dynamics and generate synchronized responses, is considered to be key to achieving smooth interaction turn-taking, affective alignment, and maintaining HRI rapport [
11,
12,
13].
To this end, data-driven facial expression forecasting methods have gradually attracted attention. Existing studies have attempted to predict future facial movements via visual or multimodal inputs, finding that predicting imminent emotional events (such as smile peaks) can improve interaction naturalness [
11,
14]. At the same time, recent work has explored more structured and interactive forms of facial motion modeling, including timeline-based control for facial action generation [
15], autoregressive head generation for real-time conversational behaviors [
16], and physically deployed robotic facial systems with speech-synchronized or hybrid-actuated motion control [
17,
18]. Although these studies highlight the growing importance of temporal structure, interaction realism, and deployment feasibility in facial dynamics modeling, they are primarily developed for facial motion generation, speech-driven animation, or robot actuation rather than short-horizon subject-separated forecasting of 3D facial landmark dynamics. The current forecasting-oriented approaches therefore still have obvious limitations in actual HRI scenarios: most studies only target limited expression sets or controlled recording environments, and their generalization ability in unconstrained conversations has not been fully verified. More critically, when using mean squared error (MSE) as the optimization objective, models easily converge to a conservative mean reversion [
19] or posterior collapse [
20] solution. The generated expression trajectories are numerically close to the dataset average but are perceptually static and lacking vitality. Notably, simple methods such as Copy-Last-Frame can achieve extremely low MSE. This observation reveals a clear distortion–dynamics trade-off: optimizing only numerical distortion may suppress the high-frequency motion dynamics that humans interpret as expressive and natural [
21].
Addressing the above issues, this study revisits short-horizon facial landmark forecasting on the large-scale multimodal emotional talking-face dataset MEAD [
22]. To decouple facial motion modeling from specific robot hardware constraints and focus on anticipatory dynamics themselves, we adopt 3D facial landmarks as a compact and actuation-compatible representation of facial behavior. To improve predictive fidelity while alleviating mean reversion, we propose a peak-aware GRU framework (PAGF), which explicitly models the temporal structure of future facial motion. Inspired by hierarchical prediction strategies [
23], PAGF decomposes forecasting into two stages: a
peak planning stage, which estimates the timing
and intensity
a of a salient motion peak together with a global motion direction, and a
peak-conditioned trajectory generation stage, which produces short-horizon landmark trajectories through temporal gating and structured motion composition. In addition, we introduce peak-consistency and temporal-shape regularization to improve the preservation of peak-related facial dynamics and temporal alignment. Experiments under a subject-separated protocol show that the proposed framework achieves a more favorable distortion–dynamics trade-off than representative static and recurrent baselines.
The main contributions of this paper are as follows:
We establish a subject-separated benchmark protocol for short-horizon 3D facial landmark forecasting on the MEAD dataset and evaluate methods using distortion, dynamics, and temporal-alignment metrics.
We analyze the mean reversion effect induced by pointwise reconstruction objectives and quantify the resulting distortion–dynamics trade-off through systematic comparison with strong baselines.
We propose a peak-aware recurrent forecasting framework that decomposes prediction into peak planning and peak-conditioned trajectory generation, and we study the contribution of peak-consistency and temporal-shape regularization to dynamics preservation.
The experimental results show that the proposed method better preserves peak-related facial dynamics while maintaining competitive 24-step prediction error.
The remainder of this paper is organized as follows:
Section 2 reviews the related work;
Section 3 elaborates on the proposed method and framework;
Section 4 presents the experimental settings, results, and analysis;
Section 5 discusses the significance and limitations of the study; finally,
Section 6 concludes the paper.
2. Related Work
This work lies at the intersection of facial expression modeling, landmark-based motion representation, temporal forecasting, and human–robot co-expression. In the following, we review the most relevant literature and clarify the main differences between our study and the existing lines of work.
Facial Expression Analysis and Generation. The early research on facial expression processing focused primarily on static recognition, training CNN-based models to classify discrete emotion categories or estimate continuous affective dimensions from single images or short clips [
24,
25]. With the advent of large-scale datasets and 3D face models, the research focus has gradually shifted towards dynamic expression modeling and generation, encompassing expression transfer, reenactment, and talking-face synthesis [
26]. These methods typically operate in the image or video domain, leveraging technologies such as GANs [
27] and diffusion models [
28] to generate photorealistic frames conditioned on emotion labels, text, or speech. More recent studies have also highlighted temporally structured facial motion generation. Timeline-based control has been introduced to enable finer temporal specification of facial actions, while autoregressive interactive head generation has been explored for more realistic conversational motion synthesis [
15,
16]. These methods are primarily developed for generation-oriented settings, whereas our work focuses on short-horizon forecasting of 3D facial landmark dynamics.
In parallel, another line of research employs 2D or 3D facial landmarks as an intermediate representation for expression generation and animation [
29]. Compared to direct pixel-level generation, landmark trajectories offer compact interpretable representations that are naturally compatible with the actuation requirements of physical robots or digital avatars [
30]. Our study also adopts a landmark-based representation. However, a core distinction from the existing work is that we explicitly formulate future expression generation as a temporal forecasting problem based on real-world talking-face data rather than learning a one-shot mapping from static conditions to full sequences.
Temporal Forecasting of Facial and Bodily Motion. Temporal forecasting has been extensively applied in human motion research, including trajectory prediction, human pose forecasting, and co-speech gesture generation [
31]. Standard approaches typically employ recurrent neural networks (RNNs), temporal convolutions, or transformers to extrapolate future sequences based on recent history. It is well established that, under naive mean squared error (MSE) objectives, such models tend to converge to the mean trajectory, exhibiting overly smooth or static behavior—a phenomenon known as “mean reversion” or “mode averaging” [
19]. In the context of facial motion, several works have attempted to forecast future landmarks or blendshape coefficients for talking heads or avatars. However, most studies either assume relatively regular motion patterns or focus solely on specific expressions. Furthermore, they often lack systematic analysis of strong baselines, such as Copy-Last-Frame or Seq2Seq [
32,
33], on noisy real-world talking-face datasets (e.g., MEAD). This work revisits the facial expression forecasting problem on the MEAD dataset, explicitly contrasting the behavioral differences between various baselines and peak-aware recurrent models through dual assessments of short-horizon and long-horizon performance.
Human–Robot Facial Co-Expression. The core objective of human–robot facial co-expression is to enable robots to generate facial or bodily expressions that are temporally aligned with a human partner, thereby enhancing empathy, interaction rapport, and overall experience quality [
11]. A representative framework proposed by Hu et al. anticipates the human’s smile apex approximately 800 ms in advance via a predictive model and then maps this anticipated state to robot actuation commands using an inverse model [
14]. Despite its significant influence, this framework has notable limitations: it targets a single expression type and focuses primarily on the apex state, failing to cover the full dynamical trajectory from onset to peak and offset. Recent robotic face studies have also advanced physically deployed facial motion systems, including humanoid lip-motion synchronization and neural-driven animatronic face control under actuation constraints [
17,
18]. These studies highlight the importance of deployment-level validation and actuator-aware control in practical HRI. In contrast, our work remains focused on short-horizon visual forecasting of 3D facial landmark dynamics rather than speech-driven robot actuation or physical facial implementation.
Our study is inspired by the idea of anticipating salient future facial events but differs from existing apex-oriented co-expression frameworks in several important respects. First, we study short-horizon facial landmark forecasting on the MEAD dataset, which contains multiple expression categories and intensity levels rather than focusing on a single expression type, such as smiles in controlled episodes. Second, instead of predicting only an apex state followed by interpolation, we propose a peak-aware recurrent framework that decomposes forecasting into peak planning and peak-conditioned trajectory generation, enabling explicit modeling of peak timing, peak intensity, and short-horizon facial motion evolution. Third, we evaluate the proposed method under a subject-separated protocol against representative static, interpolation-based, and recurrent baselines using distortion, dynamics, and temporal-alignment metrics. In this sense, our work extends the anticipatory co-expression perspective from apex-level event prediction to short-horizon landmark-level facial motion forecasting.
3. Methods
As illustrated in
Figure 1, the proposed framework consists of three main components: a shared temporal encoder, a peak-aware forecasting module, and a peak-conditioned control module. The encoder first extracts a latent representation from a short history of facial landmark sequences. Based on this representation, the peak-aware forecasting module predicts key motion attributes of the upcoming expression dynamics, including the peak timing, peak intensity, and a global motion direction. These peak-aware parameters guide the peak-conditioned control module, which generates short-horizon landmark trajectories through temporal gating and motion composition. Long-term facial motion is then obtained by autoregressively rolling the control module forward.
3.1. Problem Formulation
Let
denote the future landmark sequence to be predicted over a short horizon of length
, where
denotes the number of future prediction steps. To characterize the strength of facial motion, we define the motion amplitude at step
as
where
denotes the coordinates of the
i-th landmark at time
. Based on the amplitude sequence, we define the ground-truth peak step within the prediction horizon as
and further define the corresponding normalized peak timing and peak intensity as
The construction of these peak-aware targets is illustrated in
Figure 2. These variables characterize the timing and intensity of the dominant motion peak within the short-horizon planning window and provide compact conditioning signals for subsequent trajectory generation.
3.2. Shared Temporal Encoder
Let
denote the observed history of facial landmark configurations over the past
frames, where each frame
contains the 3D coordinates of
N facial keypoints extracted from the input video. To model temporal dependencies in facial motion, each landmark frame is first reshaped into a vector representation
. The resulting sequence
is then fed into a GRU encoder [
34] to obtain a compact representation of recent expression dynamics:
where
denotes the final hidden state summarizing the short-term facial motion context and
d is the hidden dimension of the GRU. This latent representation serves as a shared motion context for the subsequent peak-aware forecasting and trajectory generation process.
3.3. Peak-Aware Forecasting
Given the latent representation H produced by the shared temporal encoder, we first predict a set of peak-aware motion attributes that characterize the upcoming facial motion dynamics. Rather than directly extrapolating high-dimensional landmark trajectories, we estimate several intermediate control variables that explicitly describe when the next salient motion event will occur, how strong it will be, and along which geometric direction the facial landmarks are expected to evolve. Concretely, the peak-aware forecasting module predicts the peak timing , the peak intensity , and a global motion direction .
The peak timing is predicted from the latent state
H by a timing head that models the normalized prediction horizon using
K discretized temporal bins:
where
denotes the timing prediction head and
are the logits associated with the temporal bins. The corresponding decoded timing estimate is denoted by
, which represents the normalized location of the next peak within the prediction horizon.
In parallel, the peak intensity is predicted as
where
denotes the intensity prediction head and
represents the predicted motion amplitude at the peak instant. These two variables provide explicit temporal and magnitude priors for the subsequent trajectory generation process.
To further characterize the spatial pattern of the forthcoming facial motion, we infer a global motion direction from the same latent representation:
where
denotes a projection from the latent space to the landmark displacement space and
is a small constant for numerical stability. This normalization removes the magnitude component and allows
to encode only the dominant spatial direction of the forthcoming facial deformation. Taken together,
,
, and
constitute the peak-aware control signals used by the subsequent peak-conditioned control module. Given these peak-aware control signals, the control module generates a short-horizon motion trajectory by modulating the global motion direction with a temporal gate.
3.4. Peak-Conditioned Control
Given the peak-aware motion attributes predicted from the latent representation, we generate future facial motion through a peak-conditioned control mechanism. Instead of directly regressing the landmark coordinates at each future step, we explicitly factorize the predicted motion into three components: a temporal gate controlling the evolution over time, a peak intensity determining the motion magnitude, and a global motion direction specifying the geometric deformation pattern. This factorized design provides a structured and interpretable formulation for short-horizon trajectory generation.
For each future step
, we first construct a step-dependent conditioning variable based on the normalized temporal index and the predicted peak-aware attributes. In practice, the temporal gate is generated by a lightweight recurrent decoder conditioned on the current step, the peak-aware variables, and the motion context:
where
denotes the temporal-gating function and
denotes the gate value at step
s. Intuitively,
models how strongly the facial motion should evolve at each future step under the guidance of the predicted peak timing and motion intensity.
Given the temporal gate
, the predicted peak intensity
, and the global motion direction
, the motion increment at time
is computed as
The predicted facial landmark configuration is then obtained by adding the motion increment to the current frame
:
In this way, the future facial trajectory is generated as a peak-conditioned deformation process, where the temporal evolution is governed by
, the motion magnitude is controlled by
, and the geometric displacement pattern is constrained by
.
3.5. Autoregressive Rollout
The peak-conditioned control module predicts a short future landmark segment of length . To obtain a longer prediction horizon, we deploy the model in a rolling autoregressive manner. At rollout iteration r, the model takes the most recent history window of length as input and predicts a short future segment . This predicted segment is then appended to the history buffer, while the oldest frames are discarded so that the input length remains fixed. The updated history is subsequently re-encoded by the shared temporal encoder, and the same peak-aware forecasting process is repeated. By iterating this procedure, the model produces a long-horizon facial landmark trajectory up to the target horizon .
Formally, if the history window before rollout iteration
r is denoted by
, then the model predicts a segment
and updates the history window as
where only the most recent
frames are retained after each rollout. In our implementation, we set
and repeat the rolling prediction process until reaching the target horizon
.
3.6. Training Objectives
The proposed model is trained to jointly optimize short-horizon trajectory accuracy and the reliability of the predicted peak-aware motion attributes. To this end, we employ a composite objective consisting of a control term, a peak prediction term, a peak-integrity term, and a correlation regularization term for high-dynamic samples.
For short-horizon trajectory supervision, let
denote the predicted and ground-truth landmark displacement matrices at step
, where
N is the number of facial landmarks and
is the short prediction horizon. We define the control loss as
where
is a Frobenius-norm reconstruction term on landmark coordinates,
enforces consistency of the motion-amplitude sequence, and
regularizes temporal smoothness through first-order motion differences.
To encourage accurate prediction of the key peak-aware attributes, we further supervise the estimated peak timing and peak intensity. Let
and
a denote the ground-truth normalized peak timing and peak intensity, respectively. The peak prediction loss is defined as
where
is a classification loss over discretized temporal bins for peak timing and
penalizes the error in the predicted peak intensity.
To preserve the structural fidelity of the predicted facial configuration around the most salient motion event, we introduce a peak-integrity loss evaluated at the ground-truth peak step
:
In addition, for high-dynamic samples, we impose a correlation regularization term on the predicted and ground-truth motion-amplitude sequences,
where
denotes the Pearson correlation coefficient.
The final training objective is given by
4. Experiments
4.1. Experimental Setup
Dataset and Preprocessing. We evaluate our framework on the MEAD talking-face dataset [
22], which contains high-quality audio-visual recordings of 60 actors expressing 8 emotions at 3 intensity levels. In this study, rather than pursuing large-scale identity coverage, we adopt a strictly controlled subject-independent setting to systematically analyze the mean reversion phenomenon in short-horizon facial forecasting. Specifically, we use continuous landmark sequences from Actors 1–5 for training and strictly hold out Actor 6 for testing. Although this protocol operates on a subset of the full MEAD diversity, it provides a clean highly focused testbed. This design explicitly isolates motion modeling from identity overfitting, allowing us to rigorously verify whether the proposed peak-aware mechanism can preserve high-frequency expressive dynamics under cross-subject transfer without relying on identity memorization.
Evaluation Protocol. We evaluate the model in a long-horizon forecasting setting with a target prediction horizon of (approximately 0.8 s). In our implementation, we set and adopt a rolling autoregressive strategy in which the model predicts a short future segment from the current history window, appends the predicted segment to the history buffer, and discards the oldest frames so that the input length remains fixed. The updated history is then fed back into the model for the next rollout iteration. Repeating this procedure yields a full predicted trajectory up to the target horizon . This evaluation protocol exposes the model to its own previous predictions and therefore tests its resistance to error accumulation over time. Unless otherwise stated, we report results on the held-out Actor 6 (unseen during training) while using Actors 1–5 for training. All compared methods share the same data preprocessing pipeline, sequence sampling strategy, and optimization schedule to ensure a fair comparison.
4.2. Compared Methods and Evaluation Metrics
We intentionally restrict our baselines to recurrent and kinematic models to align with the real-time constraints of anticipatory co-expression. Although large-scale architectures like diffusion and transformer models achieve state-of-the-art offline synthesis, their high computational complexity precludes lightweight autoregressive control. Focusing on methodologically comparable models ensures a fair and practically meaningful evaluation for short-horizon online forecasting.
Copy-Last-Frame [19]: A zero-velocity baseline that repeats the last observed history frame for all future steps. While trivial, it serves as a strong baseline for MSE due to the prevalence of static or slow-moving segments in talking-face data.
Linear extrapolation [35]: A kinematic baseline that assumes constant velocity, extrapolating motion based on the difference between the last two history frames. This baseline highlights the non-linearity of facial dynamics.
GRU Seq2Seq [32]: A standard encoder–decoder gated recurrent unit (GRU) network trained with a pointwise MSE loss. This represents the generic sequence modeling approach without specific structural priors for facial peaks.
Hu-style Apex+Interp [11]: An implementation of the anticipatory co-expression framework. It predicts a single future apex state (time and intensity) and generates the trajectory via linear interpolation. This serves as the primary state-of-the-art comparison for peak-based generation.
PAGF-base: Our base peak-aware model incorporates the decoupled planning head and the gated directional control head. It is trained with the standard trajectory reconstruction objective and peak supervision but without the additional peak-integrity and correlation regularization terms used in the final model.
PAGF+Corr (Final): The complete framework builds upon PAGF-base by adding the peak-integrity loss () and the shape-aware correlation loss () on high-dynamic segments. This variant is designed to improve temporal alignment and dynamic expressiveness without sacrificing trajectory fidelity.
To quantify temporal dynamics beyond pointwise distortion, we report three complementary metrics: jerk ratio, AmpCorr, and lag. Given the ground-truth trajectory
and the predicted trajectory
, we first define the discrete jerk energy as
and compute the corresponding jerk ratio by
where values closer to 1 indicate better preservation of dynamic variation.
To characterize the temporal evolution of facial motion intensity, we define the frame-level motion-amplitude sequences as
and
. Based on these sequences, the amplitude correlation is measured by
while the temporal synchronization error is defined as
Here, denotes the optimal temporal shift that maximizes the cross-correlation between the predicted and ground-truth motion-amplitude sequences. The lag metric is then defined as the absolute value of this shift.
4.3. Implementation Details
To ensure a fair comparison, all learned models share an identical GRU encoder–decoder backbone. The encoder and decoder are each configured with GRU layer and a hidden dimension of . The planning head comprises two-layer multilayer perceptrons (MLPs) with a hidden size of 256 for both the peak-time classification (logits) and peak-intensity regression. The motion head is implemented as a linear projection mapping the latent space to the motion direction space .
Training was performed using the Adam W optimizer [
36,
37] with a batch size of 32 for 200 epochs. The initial learning rate was set to
with a weight decay of
. Early stopping was applied based on the validation MSE at the 24th step (MSE@24) to prevent overfitting. For the shape-aware correlation loss, the dynamic threshold
was dynamically determined as the 75th percentile of the amplitude standard deviation calculated over the control window of the training set. All experiments were implemented in PyTorch 2.10.0 and executed on a single NVIDIA RTX 4060 GPU. The detailed hyperparameters and configuration are summarized in
Table 1.
4.4. Long-Horizon Distortion Analysis
We emphasize that pointwise MSE can favor degenerate near-static rollouts on MEAD because a substantial portion of sequences exhibit limited facial motion within short horizons. In such cases, repeating the last frame yields low distortion but fails to reproduce the temporal evolution of expressions. Therefore, we treat MSE as a distortion measure rather than a proxy for expressive motion quality and complement it with dynamics and synchrony metrics. We rigorously evaluated the long-horizon forecasting capability by measuring the mean squared error (MSE) over a 24-frame rollout trajectory. The results, detailing global MSE and region-specific jaw/mouth MSE, are presented in
Table 2.
A critical observation is that the trivial Copy-Last-Frame baseline achieves the lowest global MSE () among all methods. This counter-intuitive result confirms the severity of the “mean reversion” problem in talking-face datasets: since facial expressions are sparse and often return to a neutral state, a static predictor statistically minimizes the error, albeit at the cost of freezing all motion. In contrast, the linear extrapolation baseline exhibits rapid divergence (), highlighting the non-linear nature of facial dynamics and the inadequacy of simple kinematic assumptions for long-term prediction.
Among data-driven approaches, GRU Seq2Seq and Hu-style Apex+Interp achieve reasonable distortion levels but still incur higher errors than the static baseline. This indicates that generic recurrent models struggle to balance motion generation with trajectory stability. Our proposed PAGF-base effectively suppresses the artifacts of unconstrained generation, achieving a low distortion () comparable to Copy-Last-Frame. Building on this, the final PAGF+Corr variant maintains a competitive distortion profile (global MSE: ; jaw MSE: ), only marginally higher than the static baseline. Crucially, as will be discussed in the subsequent dynamics analysis, this slight increase in MSE is a necessary trade-off to restore the dynamic vitality and temporal alignment that static baselines completely lack.
4.5. Dynamics and Perception–Distortion Trade-Off
Low numerical distortion (MSE) does not guarantee high perceptual quality. In fact, it often implies a collapse to static mean poses. To quantify this trade-off, we evaluate three complementary metrics: (i) jerk ratio, serving as a proxy for motion smoothness and vitality; (ii) AmpCorr, measuring the shape alignment between predicted and ground-truth intensity profiles; and (iii) lag, indicating temporal synchronization accuracy.
As summarized in
Table 3, the baselines exhibit extreme behaviors. Copy-Last-Frame achieves near-zero distortion but zero dynamic energy, representing a trivial static solution. Conversely, standard GRU Seq2Seq improves correlation but introduces excessive jitter and higher spatial error. Our proposed PAGF+Corr successfully navigates this trade-off: it restores natural dynamic energy (jerk ratio comparable to GT) and significantly improves temporal alignment compared to static baselines, validating that the peak-aware planning mechanism effectively steers the generation away from mean reversion without inducing unstable artifacts.
However, we also find that the absolute magnitude of AmpCorr is relatively small across methods. This is expected because (i) many MEAD clips remain close to a neutral state, yielding low-variance amplitude sequences that suppress Pearson correlation values, and (ii) minor temporal shifts around expression transitions can substantially reduce correlation even when the overall motion magnitude is plausible. Consequently, we additionally report the correlation distribution on a high-dynamic subset (
Figure 3) to isolate segments with expressive motion, where temporal-shape alignment becomes more discriminative.
4.6. High-Dynamic Analysis and Qualitative Visualization
To rigorously assess the model’s capability in capturing expressive motion, we isolate a high-dynamic subset where the standard deviation of the ground-truth amplitude exceeds
. A comprehensive analysis of this subset is presented in
Figure 3,
Figure 4,
Figure 5 and
Figure 6.
Quantitative Trends (
Figure 3 and
Figure 4). The mean amplitude trajectory plot (
Figure 3) reveals that, while Seq2Seq (blue) suffers from amplitude decay over time, PAGF+Corr (red) robustly maintains the expression intensity, closely tracking the ground-truth band. The amplitude correlation distribution (
Figure 4) further confirms that our model achieves a consistently higher median correlation than the baselines, reducing the variance of poor predictions.
Qualitative and Frame-Wise Analysis (
Figure 5 and
Figure 6). To further examine the predicted dynamics on a representative high-intensity expression, we provide both qualitative landmark overlays and frame-wise mouth-region error curves. As shown in
Figure 5, the Copy-Last baseline remains close to a static estimate and fails to capture the onset of the motion, while the generic Seq2Seq model initiates the movement but produces an attenuated peak, consistent with the over-smoothing tendency observed in the quantitative analysis. By contrast, PAGF+Corr more closely matches the ground-truth trajectory in both peak timing and motion magnitude, resulting in a more faithful temporal evolution of the expression. This tendency is further supported by
Figure 6, where PAGF+Corr maintains a more favorable mouth-region error profile over most rollout steps, particularly around the main dynamic transition region.
5. Discussion
The Perception–Distortion Trade-Off. Our empirical results on the MEAD dataset reveal a fundamental conflict in facial expression forecasting: optimizing for numerical fidelity (MSE) often comes at the expense of dynamic expressiveness. The trivial Copy-Last-Frame baseline achieves the lowest global MSE, creating a “numerical ceiling” that learned models struggle to surpass. This confirms that, under the assumption of minimal motion energy, which holds for the majority of a talking-face video, the statistically optimal strategy for minimizing loss is to predict no motion. However, in the context of human–robot interaction (HRI), such a strategy is catastrophic: it results in a “zombie-like” agent that appears unresponsive to human emotional cues. Conversely, standard autoregressive models (e.g., GRU Seq2Seq) attempt to model dynamics but often suffer from “mean reversion” (or posterior collapse), producing trajectories that drift towards the average face, limiting the robot’s ability to display high-intensity empathetic expressions.
Efficacy of Peak-Aware Planning. The proposed peak-aware GRU framework (PAGF) addresses this dichotomy by decoupling the “what/when” (planning) from the “how” (control). By explicitly predicting the timing and intensity a of the upcoming expression peak, the model injects a strong high-level structural prior into the generation process. This prevents the decoder from collapsing to the mean as it is strictly conditioned to reach a specific target state. Our ablation study demonstrates that, while geometric constraints alone are insufficient to cure static behaviors, the introduction of peak integrity and shape-aware correlation losses significantly boosts the jerk energy and amplitude correlation. This suggests that, for robotic expression generation, supervising the shape and trend of the motion is as critical as supervising the absolute coordinate positions.
Computational Cost and Practical Considerations. Beyond prediction quality, the computational implications of the proposed framework should also be considered. Compared with static or single-step baselines, PAGF introduces additional overhead through peak-aware forecasting and rolling autoregressive control. However, the model remains structurally lightweight, relying on a GRU-based encoder–decoder and short control segments rather than large-scale generative architectures. In the present study, we evaluate the method in an offline benchmark setting and therefore do not claim a fully deployed real-time robotic implementation. A more complete assessment of inference latency, rollout cost, and robot-in-the-loop execution remains an important direction for future work.
Limitations. Despite the promising results, the current framework has several limitations. First, it relies solely on visual history, whereas facial expressions in conversation are often correlated with speech prosody and semantic context. In situations where visual cues are weak or ambiguous, such as the onset of a surprise reaction before substantial muscle movement, the model may struggle to predict the exact peak timing. Second, the present evaluation focuses on landmark trajectories; mapping these kinematic plans to the physical constraints of specific servo-driven robot faces remains a downstream engineering challenge. Recent robotic face studies further suggest that bridging facial kinematics to practical HRI deployment requires actuator-level mapping, latency-sensitive synchronization, and physical robot validation [
17,
18]. In addition, the current formulation models only the dominant peak within each short prediction horizon. While this approximation is suitable for the present short-horizon setting, more complex facial motion patterns with multiple salient peaks would require an extended formulation, such as multi-event planning or hierarchical temporal modeling. More broadly, although the proposed framework is motivated by anticipatory facial co-expression in HRI, the present study evaluates this capability through offline landmark-based proxy metrics rather than end-to-end robot deployment or user studies. Accordingly, the reported gains indicate improved temporal synchronization and dynamic fidelity at the kinematic level, but they should be interpreted as preliminary evidence rather than a definitive validation of interaction-level benefits. Future work will further examine these advantages through robot-in-the-loop experiments and user-level interaction studies.
6. Conclusions
In this paper, we presented a systematic study of short-horizon facial expression forecasting, a critical capability for anticipatory human–robot co-expression. We identified and quantified the severe mean-reversion phenomenon induced by standard MSE training on the MEAD dataset. To overcome this, we proposed the peak-aware GRU framework (PAGF), a hierarchical architecture that explicitly models short-horizon peak timing and intensity and generates structured trajectories via peak-conditioned directional gating. Our experiments demonstrate that PAGF successfully navigates the perception–distortion trade-off. Unlike static baselines that minimize error by freezing motion, our method generates vibrant synchronized facial dynamics that closely match ground-truth intensity profiles while maintaining a competitive distortion error relative to generic Seq2Seq models. This work provides a robust baseline for data-driven robot expression control, highlighting the importance of structural priors in modeling stochastic human behaviors.
In future work, we plan to extend this framework to a multimodal setting by incorporating audio and text modalities to improve the anticipation horizon. Additionally, we aim to deploy the generated trajectories on a physical facial expression robot to evaluate the impact of anticipatory co-expression on user trust and engagement in real-world interactions.