Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models

Ren, Ziliang; Jin, Miaomiao; Nie, Huabei; Shen, Jianqiao; Dong, Ani; Zhang, Qieshi

doi:10.3390/electronics14030605

Open AccessArticle

Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models

by

Ziliang Ren

¹

,

Miaomiao Jin

¹,

Huabei Nie

^2,*,

Jianqiao Shen

²,

Ani Dong

² and

Qieshi Zhang

³

¹

School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523820, China

²

School of Artificial Intelligence, Dongguan City University, Dongguan 523419, China

³

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 605; https://doi.org/10.3390/electronics14030605

Submission received: 25 December 2024 / Revised: 24 January 2025 / Accepted: 1 February 2025 / Published: 4 February 2025

(This article belongs to the Special Issue Deep Learning for Facial Emotion Analysis and Human Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

Many applications benefit from the prediction of 3D human motion based on past observations, e.g., human–computer interactions, autonomous driving. However, while existing methods based on encoding–decoding achieve good performance, prediction in the range of seconds still suffers from errors and motion switching scarcity. In this paper, we propose a Latent Diffusion and Physical Principles Model (LDPM) to achieve accurate human motion prediction. Our framework performs human motion prediction by learning information about the potential space, noise-generated motion, and combining physical control of body motion, where physics principles estimate the next frame through the Euler–Lagrange equation. The framework effectively accomplishes motion switching and reduces the error accumulated over time. The proposed architecture is evaluated on three challenging datasets: Human3.6M (Human 3D Motion Capture Dataset), HumanEva-I (Human Evaluation dataset I), and AMASS (Archive of Motion Capture as Surface Shapes). We experimentally demonstrate the significant superiority of the proposed framework in the prediction range of seconds.

Keywords:

human motion prediction; latent diffusion; physical principles; parameter optimization

1. Introduction

The task of human motion prediction focuses on predicting future motion sequences by observing past motion, which has many applications, such as autonomous driving [1], intelligent robotics [2], and animation [3]. Traditional animation production often requires animators to draw movements frame by frame, a process that takes a lot of time and effort. Through the use of physical laws, human body motion prediction can generate realistic and reasonable human body motion sequences, which allows animators to work more efficiently and enhances the user’s sense of immersion. When the user interacts virtually, the system determines the movement trend of the human body and provides timely and accurate information, e.g., predicting the patient’s movement trajectory to give appropriate guidance during rehabilitation training. The predicted motions need to be continuous, diverse, and realistic, making this task complex. Traditional methods include Gaussian [4] and restricted Boltzmann machines [5], which combine simple representations of body motion encoded as latent representations, which are then decoded into predictions [6,7,8], but accurate predictions require more complex representations of motion.

Recently, motion data have begun to be used widely, and researchers have made great progress. Researchers have proposed various deep learning models to model motion sequences, such as deep learning models based on recurring neural networks (RNNs) [9,10] and convolutional neural networks (CNNs) [11,12,13]; transformers [14,15], more intuitive than the model shown in Figure 1a, utilize graph convolutional layers to process observation sequences, create motion embeddings, and decode them into future prediction sequences. This is a fully data-driven approach focusing on learning motion patterns from structured input. Variants of generative adversarial networks (GANs) [16,17] and Variational Auto-Encoder (VAE) [18] have been used to model data distributions [19,20]. Among other areas, prior methods with encoding–decoding address the human motion prediction problem, which predicts future motions bases on existing motion frames. Different sampling phases during training result in increased computational effort. While these methods perform well in some cases, such as when distinguishing basic actions, they face challenges in real-world applications that require a wider diversity of motions—such as the transition from walking to sitting. While these methods can capture complex motions, they often ignore the fact that human motion is governed by physical principles and that the human body follows the laws of physics. Therefore, while good prediction results can be achieved, methods that rely on hyperparameter balancing and loss constraints often have difficulty producing consistent, applicable results. This highlights the necessity of carefully designing parameters to balance various loss functions and training requirements. The core argument of this paper is that by incorporating the basic principles of motion diffusion and physical motion into the learning process, prediction accuracy can be significantly improved, especially over short time intervals.

The human body is a physical system with multiple body parts working in concert to form complex motions. Muscle strength and exercise habits vary greatly from person to person. The same person can move differently under different environmental conditions and emotional states. According to the physical point of view, accurately modeling the rotation of joints is not a simple task. Since human movement is a coupling of multiple degrees of freedom, this greatly increases the difficulty of accurate prediction, e.g., the coordination of hands and feet during walking. Depending on the joint angles of the human body, the Euler–Lagrange equation [21] describes the human body motion through generalized positional coordinates along with second-order ordinary differential equations (ODEs), imposing constraints to improve the quality of the output of human body motion prediction [22,23,24], and motion diffusion models provide motion information to the prediction from random noise to improve its performance [25]. Researchers in data-driven estimation have not yet explored this approach of synergizing physics and neural networks in depth. Previous surveillance systems recorded behavior at a given time point, but human motion prediction based on diffusion models can predict future actions. This plays a crucial role in crime prevention and public safety. For example, in airports with high traffic flow, by predicting the movements of suspicious people, measures can be taken in advance to effectively reduce the likelihood of potential dangers occurring. Traditional prediction methods have limitations in dealing with the uncertainty and diversity of human movements. The emergent diffusion model, on the other hand, has significant results in predicting outcomes. It is able to predict the future state of human movement by learning the potential distribution in the data. Therefore, as shown in Figure 1b, our work focuses on fusing physics and deep learning in applying diffusion to predictive models for training in an end-to-end manner without the need for multi-stage training, which avoids balancing multiple lossy hyperparameters.

In recent years, physics-based and diffusion-based deep learning models have received increasing attention in terms of prediction accuracy and training speed [26,27,28,29]. Neural networks can efficiently solve partial differential equations [30]. Inspired by these methods, we do not use encoding–decoding methods but propose a new approach for human motion prediction by diffusion-based and physically driven methods. According to what we know, the diffusion model with denoising procedures can generate motion sequences from random noise, and can easily be pruned to unsmooth results. In concrete terms, we use observation sequences, combine the noisy and denoised mechanisms, use physics-informed deep learning, and incorporate the Euler–Lagrange equations. We demonstrate that the proposed framework (LDPM) can effectively improve the prediction accuracy of existing frameworks and obtain more physically plausible prediction results.

In summary, the main contributions of our work are as follows:

To address the problems mentioned above, we propose a new framework. The diffusion-based and physically driven approach has the advantages of long-term prediction, high motion switching capability, and reduced error accumulation. The future motion is predicted by an efficient combination of physical principles and random noise.
It is experimentally demonstrated that the proposed method obtains better performance and outperforms existing methods, providing a new perspective for future research.

2. Related Works

This section reviews existing work on human motion prediction and then analyzes potential diffusion-based and physically driven models to improve adjacent human dynamics models.

2.1. Human Motion Prediction

Traditional methods [12,31] try to predict movement in a deterministic way. Previous statistical methods [4] only showed consistent patterns of movements over time. With the widespread use of large amounts of motion data, deep learning has a clear advantage in generating complex motion. Yuan et al. [25] considered physical factors and generated motion trajectories more in line with the real world using a diffusion model based on a physics-guided mechanism, which required high hardware at the time of experimentation. Zhang et al. [26] incorporated physics knowledge into neural networks to learn complex patterns and reduce noise learning for human motion prediction. Karniadakis et al. [27] provided a priori knowledge through physical laws and integrated it into machine learning algorithms to reduce false predictions, but this ran the risk of introducing approximation errors and loss of accuracy. Codec models have been employed for prediction [32,33,34,35,36] using multiple loss constraints [11] to achieve diversity and accuracy of results, learning basis vectors, using sampling strategies, decoupling network parameters and Gaussian distributions, and linear interpolation to control smoothing, which raises training requirements. RNNs and their variations capture temporal information of human movements but fail to capture spatial dependencies; extraction of features by convolutional operations has low computational complexity, but simple splicing or averaging does not yet achieve optimal performance.

To address this problem, some researchers have used CNNs, GCNs [37,38,39], and spatio-temporal graph convolution [31]. Zhong et al. [31] proposed a spatio-temporally gated adjacency graph convolutional network model to focus on channel information, balance spatio-temporal weights, and flexibly adapt to the input situation, but it increased the number of model parameters. Li et al. [38] propose a symbiotic graph neural network that combines action, structural, and bipartite graphs to explain movement changes and to comprehensively understand and utilize inter-joint relationships, but validation of model robustness is lacking. Chen et al. [39] proposed a skeleton segmented graph scattering network to use multi-channel filters to obtain the graph spectral bands, model different body parts, and obtain fine features, but it has high computational cost and requires significant debugging efforts. Recently, Guo et al. [40] showed better performance using multilayer perceptron (MLP), which performs a nonlinear transformation by stacking fully connected layers, where the neurons cooperate with all the previous layers of neurons to capture complex patterns in the input data, but leads to a large number of multiply–add operations, especially when dealing with high-dimensional data. Bouazizi et al. [12] proposed a prediction model based on MLP, which independently applies two different MLPs to solve the complexity as well as the computational difficulties of methods such as RNN, CNN, GCN, etc., but the performance of the model has a certain impact on the processing details and data inputs. Xu et al. [41] utilized geometric isotropy in motion data to improve the accuracy of the model. Even so, these methods are still affected by physical motion and require many stages of training.

2.2. Latent Diffusion and Physically Driven Models

Human dynamics are modeled with equations describing the physical motion, and a lossy diffusion model of human motion can be trained in an end-to-end format to generate enriched actions. Physics-based models can improve on purely data-driven estimation and are widely used for different tasks. This is also true for diffusion models for research tasks such as image/video generation [42,43,44,45], drug discovery [46], and 3D reconstruction [47,48,49,50,51]. Among them, in monocular 3D human reconstruction, some methods estimate parameters in the Euler–Lagrange equation, improve data-driven estimation [23], and simulate realistic motion. Traditional image synthesis is prone to detail loss and image blurring, and the potential diffusion model can perform diffusion operations on the data in the potential space to generate high-resolution images [43], which gives a problem-solving perspective to traditional image synthesis methods. Probabilistic audiovisual diffusion priors [44] play a significant role in the field of audiovisual synthesis. Traditional audiovisual synthesis methods have the problem of generative incoherence. Introducing a priori knowledge and using diffusion models to fuse audio and visual features can improve the realism and naturalness of the synthesis results. Diffusion models can also be used in video processing and generation [45], i.e., modeling inter- and intra-frame relationships in video. Using the geometric diffusion mechanism [46], the complex relationship between atoms in a molecule is dealt with to generate a reasonable molecular conformation.

A 3D shape a priori and text-to-image diffusion model [48] provides a new perspective for generating 3D models directly from text descriptions. Scalable semantic transfer over multi-labeled domains [49] allows for human parsing, which accurately delineates different parts of the human body and semantic information, and provides a more accurate semantic basis for human posture estimation studies. Unlike existing prediction work, we are integrating physical principles into a diffusion model to generate accurate future motion predictions. Cai et al. [22] predict human movement by constructing multiscale hypergraphs specialized for ICH dance videos. Zhang et al. [25] used the Euler–Lagrange equations to infer kinematic forces from observed motions, and then used these forces as a motion prediction model for additional inputs to improve performance. BeLFusion [30] decomposes model training into multiple stages to predict motion through a latent space diffusion model, which can be limited by the codec. In this paper, we propose a new framework that adds a physically driven model to the diffusion model for end-to-end motion prediction.

3. Proposed Framework

Inputting some information such as text or action labels, our goal is to generate a physically plausible human motion

x_{T + 1 : T + N}

of length N. We propose a Latent Diffusion and Physical Principles Model (LDPM) for human motion generation. The LDPM starts with a noisy motion

x_{1 : T}

and models a denoising distribution that denoises motions from a diffusion time step T to N (N < T). Iterative application of the model denoises the motion into a clean motion

x_{1 : T}

, which becomes the final output

x_{T + 1 : T + N}

. An overview of our proposed method LDPM is shown in Figure 2. First, we are able to maintain the continuity of the generated results in time by introducing a noise scheduling strategy and a diffusion process. Meanwhile, a physical data-driven motion prediction model PhysDa is introduced. The model iteratively predicts future body configurations based on input historical motion data. Finally, we explore an LDPM-based training process that effectively mitigates prediction errors accumulated over time.

3.1. Potential Diffusion Model

We take the sequence of observed

T_{o b s}

coordinate system motions as

x_{1 : T_{o b s}} = (x_{1}, x_{2}, \dots, x_{T_{o b s}}) \in R^{T_{o b s} \times 3 J}

, where

x_{T_{o b s}} \in R^{3 J}

are the coordinates of the joints at the coordinate system

T_{o b s}

and J is the number of joints. Given the observed motions

x_{1 : T_{o b s}}

, the goal of the human motion prediction (HMP) problem is to predict the following N motions:

x_{T_{o b s} + 1 : T_{o b s} + N} = (x_{T_{o b s} + 1}, x_{T_{o b s} + 2}, \dots, x_{T_{o b s} + N}) \in R^{T_{o b s} \times 3 J}

.

Samples z ∈ latent space V, z can be mapped to the coordinate space. Then, a latent diffusion model is trained to predict z. The generative sequences are formulated as follows:

p (y | x) = p (y, z | x) = p (y | z, x) p (z | x),

(1)

where y represents the conditional probability of generating the output, and z represents the probability distribution of the latent variable.

Latent diffusion models predict the perturbation

ε = f (z_{t}, t, x)

. When trained, the network f can infer z. We choose a parameterization so that

z_{0}

=

f (z_{t}, t, x)

. An approximation of z is predicted in every denoising step

z_{0}

, and we sample the input of the next denoising step

z_{t - 1}

by diffusing it

t - 1

times. We use

q (z_{t - 1} ∣ z_{0})

to clarify this diffusion process. We use the discrete cosine transform (DCT) and the inverse discrete cosine transform (iDCT) operations. Looking specifically at Algorithm 1, As shown in Figure 2, one linear layer of TransLinear inputs the DCT spectrum with step size t and one linear layer is outputs to map the joint dimensions. At the same time, N TransLinear blocks are stacked in TransLinear, features are extracted through two fully connected layers (TFCs), and the signals are passed and reduced layer by layer, while the iDCT is combined to reduce the action data. The core component prediction process module (PPM) proposed in this paper is shown in Figure 3. We have designed the PPM module, and several PPM modules have been merged to form the prediction processing unit (PPU), to perform human movement prediction analysis based on physical principles and data-driven principles, respectively; “FC” stands for the fully connected layer, “LN” and “ReLU” represent the layer normalization and activation layer. “Transpose” stands for the exchange of the spatial and temporal dimensions. Figure 4 shows the parameters M, C, and G used in the physical principle. “MLP” represents a framework that has the same architecture as “

M L P_{h}

” but is different in the number of blocks of the MLP.

Algorithm 1 DCT and iDCT

Input: sequences:

x = [x_{1}, x_{2}, \dots, x_{T}] \in R^{T \times 3 J}

.
Output: sequences:

\hat{x} = [{\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{T}]

.
1: DCT transfer operation:
2:

D \in R^{(T + N) \times (T + N)}

;
3:

y = DCT (x) = D x

;
4: Generate sequences:
5:

y \in R^{N \times 3 J}

;
6:

y_{t} = \sqrt{γ_{t}} y + \sqrt{1 - γ_{t}} μ, μ \sim N (0, I)

;
7: iDCT transfer operation:
8:

\hat{x} = iDCT (y) = D^{T} y

;
9: return

\hat{x}

.

Having an N-frame motion sequence

x_{T + 1 : T + N} \in R^{N \times 3 J}

, we put the sequence into the DCT domain through the DCT(·) operation:

y = D C T (x) = D x,

(2)

where

D \in R^{N \times N}

is the predefined DCT basis, and

y \in R^{N \times 3 J}

is the DCT coefficients. As the DCT operation is completed, we can overturn the motion sequence in the DCT domain through an iDCT(·) operation:

x = i D C T (y) = D^{T} y .

(3)

In order to achieve smoothness in switching of human motions, we take measures to perform the DCT and iDCT operations selecting the first L rows of D and

D^{T}

.

We use three different noise scheduling strategies, square root scheduling, cosine scheduling, and sigmoid scheduling. These strategies are used to adjust the level of noise added at different time steps, thus ensuring that smooth and stable predictions are generated during the iteration process. Specifically, we perform the above DCT operation

x \in R^{N \times 3 J}

for the first E frequency components, obtaining the spectrum

y_{0} \in R^{E \times 3 J}

. The noisy DCT spectrum

y_{t}

at the time step t can be computed by a reparameterization trick:

y_{t} = \sqrt{{\bar{γ}}_{t} y_{0}} + \sqrt{1 - {\bar{γ}}_{t} μ},

(4)

where

{\bar{γ}}_{t} = \prod_{i = 1}^{t} γ_{i}

,

γ_{i} \in [0, 1]

are predefined variance parameters, and

μ \sim N [0, I]

.

3.2. Physics-Based and Data-Driven Motion Prediction Model

We firstly elaborate the rules of the Euler–Lagrange equations by looking specifically at Algorithm 2. What is more, we introduce the model dynamics process.

Algorithm 2 Training procedure of LDPM

Input: Observation sequences

x_{1 : T} = [x_{1}, x_{2}, \dots, x_{T}]

, time steps

t + 1

, physic feature

H_{k}

and geometric feature

H_{s}

.
Output: Future sequences

{\hat{y}}_{T + 1 : T + N}

.
  1: Extract feature:
  2: Using MLP to obtain two features:
  3:

H_{k} = {MLP}_{k} (y_{t - 1}^{T})

,
4:

H_{s} = {MLP}_{s} (y_{t + 3}^{T}

;
5: Dynamic modeling:
6:

M_{T + N} = {MLP}_{M} (H_{k} \oplus H_{s})

7:

C_{T + N} = {MLP}_{C} (H_{k} \oplus H_{s})

8:

G_{T + N} = {MLP}_{G} (H_{k} \oplus H_{s})

9: Data-driven:
10: Using the full connectivity layer:
11:

y_{data} = {MLP}_{data} (y_{t = 1 : T})

;
12: Fusion physic and data:
13: Addition weight to predict:
14:

y_{fusion} = (1 - w_{t}) y_{physics} + w_{t} y_{data}

;
15: return

y_{fusion}

.

The Euler–Lagrange equations are use variables that comprehensively define the configuration of the physical system when applied to modeling human motion, and inspired by the widely used SMPL human model. The SMPL models the body as a 3D mesh. The motion parameters capture joint angles. In contrast, the parameters of shape represent body shape coefficients, governing ingredients of physical aspects such as height and other proportions. Using forward kinematics, the positions of the vertices and body joints in a 3D human model can be determined. The subject’s body shape is assumed to remain constant. Thus, the motion sports routes within a world coordinate system are described by combining the pose parameters with the translation parameters H. Consequently, we define the coordinate as

y = (δ, H) .

(5)

In the implementation process,

δ

denotes the Euler angles of the body joints. Our approach takes into account biomechanical constraints. This strategy effectively eliminates redundant-angle solutions and ensures a more accurate representation of joint movements.

Building upon the coordinate system outlined above, we represent the velocity and acceleration at frame t as

{\tilde{y}}_{t}

and

{\overset{˘}{y}}_{t}

, respectively. The dynamics of the human body, governed by the Euler–Lagrange equation, can then be expressed as follows:

M_{t} {\overset{˘}{y}}_{t} + C_{t} = G_{t} .

(6)

The Euler–Lagrange equations are important for modeling and analyzing human motion using two key processes. Forward dynamics is concerned with solving the equations to predict the body configuration in the subsequent frame based on the physical parameters

M_{t}

and

C_{t}

, and the applied forces

G_{t}

. Starting from a defined system state, we first compute

{\overset{˘}{y}}_{t}

, which is then used to forecast future motion through the application of Euler’s method, represented as follows:

{\overset{˘}{y}}_{t} = M_{t - 1} (G_{t} - C_{t}),

(7)

{\tilde{y}}_{t + 1} = {\tilde{y}}_{t} + {\overset{˘}{y}}_{t} △ T,

(8)

y_{t + 1} = y_{t} + {\tilde{y}}_{t} △ T,

(9)

where

△ T

denotes the time interval between consecutive frames, and

M_{t - 1}

represents the inverse of the inertia matrix. In contrast, inverse dynamics focuses on estimating the unknown physical parameters based on existing data. In this study, we introduce a novel approach that integrates the Euler–Lagrange equations into the motion prediction framework. By explicitly modeling them, we develop a physics-based motion prediction model, which is detailed below.

The in-depth modeling of human motion is first achieved through a physically based predictive model based on the Euler–Lagrange equation, which is expressed in a generalized coordinate system as

M_{t} y_{t} + C_{t} = G_{t},

where

M_{t} \in R^{75 \times 75}

denotes the total inertia matrix, which is determined by the generalized position and parameters such as body mass and inertia;

C_{t} \in R^{75}

denotes the generalized bias force; and

G_{t} \in R^{75}

denotes the generalized force. We follow (3) and use processes such as forward and inverse dynamics to model body dynamics. Using geometric features extracted from the first three frames of the Euler–Lagrange equations that fully specify a moment t, we use an MLP to extract the physical features

H_{k}

and the geometric features

H_{s}

from the motion history, and in turn estimate these physical parameters:

H_{k} = {MLP}_{k} (y_{t = 1}^{T}),

(10)

H_{s} = {MLP}_{s} (y_{T + 1}^{T + 3}) .

(11)

The generalized inertia matrix M, the bias force C, and the external force G are then predicted by another MLP:

M_{T + N} = {MLP}_{M} (H_{k} \oplus H_{s}),

(12)

C_{T + N} = {MLP}_{C} (H_{k} \oplus H_{s}),

(13)

G_{T + N} = {MLP}_{G} (H_{k} \oplus H_{s}) .

(14)

Our approach is specifically designed to maximize the expressive capabilities of neural networks while strictly adhering to the underlying physics equations. This ensures that the relationships among the physical parameters are faithfully preserved. Using the physical parameters, the physics-based model employs an ODE solver to compute

y_{p h y s i c s}

, where forward dynamics is applied to predict future states.

For training of the physics-based model, we utilize

y_{p h y s i c s} = {MLP}_{d a t a} (y_{t = 1}^{T}),

(15)

where

y_{T}

represents the ground truth position.

To forecast future motion, the physics-based motion prediction model is applied iteratively, generating the configuration for the next frame based on prior estimates and historical input motion. However, this approach primarily focuses on the immediate vicinity of the current frame, which can lead to significant error accumulation over longer prediction horizons. To mitigate this challenge, we incorporate a data-driven model and a fusion model, designed to reduce error accumulation, as detailed below.

Current approaches have shown significant progress in leveraging data to model long-term dependencies, enabling the direct prediction of future human motion from the input historical motion sequence

y_{T}

using neural networks. A data-driven forecasting model is then introduced that utilizes historical motion data to capture long-term temporal dependencies. With a fully connected layer, we generate future data-driven predictions:

y_{d a t a} = {MLP}_{d a t a} (y_{t = 1}^{T}) .

(16)

Based on the prediction

y_{p h y s i c s}

of the above physical model, the prediction generation for each frame is dynamically adjusted by weighting

w_{t}

the fusion data driver, and the final output fusion prediction is

y_{f u s i o n} = (1 - w_{t}) y_{p h y s i c s} + w_{t} y_{d a t a} .

(17)

3.3. Training of the Model

We train the latent diffusion model [24] to learn to sample in this space. The diffusion model is a probabilistic model that learns the data distribution

y \sim p_{y}

by gradually removing the noise from the random Gaussian noise. Our proposed new model, SLDP, initializes the noise scheduling and prediction modules, and during the diffusion process PhysDa corrects for the introduced noise at each time step, allowing the constraints of the physical model to be reflected in the prediction. The final generated motion data not only has data-driven accuracy, but also conforms to physical plausibility. During the sampling generation process, we perform several iterations through the PhysDa module to generate the predictions step by step, so that the predictions gradually approximate the real motion trajectories. And we use a noise prediction network parameterized as

θ

to generate the prediction noise

μ_{θ (y_{t}, t)}

, and optimize the parameter

θ

by the noise prediction loss L:

L = E_{μ, t} {∥μ - μ_{θ} (y_{t}, t)∥}^{2} .

(18)

4. Experiments

Datasets. We use the datasets Human3.6M [52], HumanEva-I [53], AMASS [54] to evaluate our model. Human3.6M is a comprehensive 3D human motion capture dataset widely used for motion prediction, pose estimation, and action recognition. It includes 11 participants performing 15 everyday activities, with high-resolution 3D joint data for 17 key joints. The dataset offers both 2D and 3D annotations, making it ideal for 2D-to-3D pose estimation and motion prediction tasks. HumanEva-I features synchronized 2D video frames and 3D motion capture data from multiple subjects performing activities like walking and gesturing. It is suited for multi-view pose estimation and motion prediction, with standardized protocols for reproducibility. AMASS is a large-scale collection of 3D motion capture data, unified in the SMPL format, facilitating motion modeling, 3D pose estimation, and motion prediction. It covers a wide range of activities and provides high-quality 3D meshes and joint data, supporting integration with other datasets like Human3.6M and HumanEva-I for cross-dataset research.

Implementation details. Human3.6M features 7 subjects performing 15 motion types, where five subjects S1, S5, S6, S7, and S8 are used for training and two (S9 and S11) for evaluation. The data are processed at 50 Hz with a 17-joint skeleton excluding the root joint. Our model observes 25 frames, 0.5 s, to predict 100 future frames, 2 s. HumanEva-I includes three subjects performing five actions. It is processed at 60 Hz with a 15-joint skeleton, using 15 observation frames, 0.25 s, to predict 60 future frames, 1 s. For AMASS, we retarget the skeleton to match the Human3.6M format, enabling direct inference using the model trained on Human3.6M. The model is trained using PyTorch on an RTX 4090 GPU with 120 GB of memory. A batch size of 64 is used, with training lasting 500 epochs for the first two datasets and 50 epochs for AMASS. The frame rates are 25 Hz for Human3.6M and 30 Hz for HumanEva-I and AMASS.

The training setup includes an Adam optimizer with a learning rate of 3 × 10⁻⁴ and a decay factor of 0.9. A dropout rate of 0.2 is applied. The DCT/iDCT operation uses 20 coefficients and 10 coefficients. The latent dimension is set to 512, with eight self-attention heads. The modulation ratio is set to 1.0 and 0.5. To mitigate error accumulation and improve motion-switching capabilities, the final 20 steps of DCT completion are replaced with simpler denoising steps. Data preprocessing includes applying a Gaussian filter to reduce noise in body translation and joint angles, ensuring smoother predictions.

Evaluation metrics. Based on previous work [28], we report five metrics to measure the prediction quality. They are average pairwise distance (APD), average displacement error (ADE), final displacement error (FDE), multi-modal ADE (MMADE), and multi-modal FDE (MMFDE). APD is the L2 distance all the samples. ADE is defined as the minimum average L2 distance between the actual and predicted motions. FDE is the L2 distance between the predicted result of the last frame and the ground truth. MMADE and MMFDE are multi-modal versions of ADE and FDE, respectively, with ground truth future motions grouped by similar observations.

4.1. Comparison to the State of the Art

Quantitative results. Table 1 shows the quantitative results, highlighting that our method achieves the best ADE, 0.319, and FDE, 0.410, on the Human3.6M dataset, and is competitive on HumanEva-I, with an ADE of 0.234 and FDE of 0.369. These results demonstrate the superior accuracy of our method in both general trajectory prediction (ADE) and final position prediction (FDE), which are critical for real-world motion prediction tasks. The results in Table 2 provide insights into the efficiency of the training process at different stages. While the focus of this section is to compare the performance of the proposed method with the state of the art in terms of motion prediction accuracy (ADE and FDE), these time measurements provide further context about the computational efficiency. Although our method achieves high accuracy, our approach also aims to ensure efficient training at a reasonable computational cost.

Visualization results. As shown in Figure 5, our method performs well on the Human3.6M dataset. Red, black, and gray denote the contextual part of the initial action state, and purple and green denote the predicted action sequences. Over time, our model predicts for a long time based on the physical constraints of the human body, and the predicted values are overall close to the true values. As shown in Figure 6, on the HumanEva-I dataset, our method fuses spatio-temporal information to predict standing, arm swinging, and jumping movements, and the prediction accuracy is significantly improved after 500 epochs of training. In Figure 7, compared with the training error, the value is lower. The training error is always lower than the validation error, indicating that the model performs well on the training set. The validation error and test error show that the model has limited generalization ability on the validation set, but performs well on the test set.

4.2. Ablation Study

LDPM modeling setup. Here, we examine the effects of the denoising length and variance scheduler on the experiments. Table 3 presents the best choice, which is the cosine scheduler for training on the Human3.6M and HumanEva-I datasets. The evaluation indicators include FDE, MMADE, and MMFDE. The cosine scheduler on the Human3.6M dataset achieves a minimum FDE error of 0.410, which is better than linear and sqrt. Compared to the sqrt scheduler, the cosine scheduler’s error is reduced by 44.1%, which shows a significant performance improvement and provides higher prediction accuracy in terms of final position error. On the HumanEva-I dataset, the cosine scheduler achieves a minimum FDE error of 0.229, which is 11.2% and 54.9% lower than linear and sqrt, respectively. The MMADE of the cosine scheduler is 0.369, which is also the lowest among the three schedulers, indicating that it is the most stable in terms of average prediction error.

By approximating the values of the DCT and iDCT to improve the efficiency, the E choices of 5 and 10 have a significantly superior performance. Table 4 and Table 5 show the model’s performance on the Human3.6M and HumanEva-I datasets, respectively. On Human3.6M,

E = 5

gives the best results, with the lowest FDE, while on HumanEva-I,

E = 10

provides the best overall accuracy and efficiency.

Prediction Networks. The TransLinear layer in the diffusion processing unit (DPU) we proposed performs skip connections. In Table 6, the skip connection results are better and the generated motion is more realistic. Table 7 shows that the best results on Human3.6M occur at eight layers, while on HumanEva-I, the best results are at four layers. Table 8 shows that increasing the noise steps from 100 to 1000 significantly improves performance, with notable reductions in FDE, MMADE, and MMFDE on both datasets.

Moreover, as shown in Figure 8, we use 28.4M and 14.45M parameter, respectively, in Human3.6M and HumanEva-I. Figure 9 compares the performance of various motion prediction methods in two plots. In the left plot, the motion value decreases as the predicted frames increase, where the GT line remains constant and the other lines show a decrease in predicted motion. Our method (represented by the purple line) maintains the lowest motion value in most predicted frames, with better consistency with the GT than other methods such as HumanMAC and MotionDiff. In the right plot, the cumulative error increases as the predicted number of frames increases. Our method again shows superior performance, keeping the cumulative error lower than all other methods, especially in the early predicted frames. As the predicted frames progress, the error of other methods increases faster than our method.

Table 9 shows the performance of our proposed method. A 0 value represents no switching capability or partial body controllability, and 1 represents switching capability or partial body controllability. MotionDiff and Deligan have relatively limited performance, both having 0 values. Our proposed method has the most comprehensive capabilities, supporting switching motions together with HumanMAC and BeLFusion, and can independently control parts of the body. Table 10 compares the performance metrics of various motion prediction models on the Human3.6M dataset. The table evaluates the models based on three criteria: whether they use a single-stage approach (Y for yes and N for no), the loss function used, and the frames per second (FPS) performance. Our model uses a single-stage approach, as indicated by “Y”, while other models such as MotionDiff, BeLFusion, Diversampling, etc., do not. The loss function used by our method is comparable to other models such as HumanMAC, but significantly better than MotionDiff. Compared to other models, our model achieves the highest FPS and performs well in real-time performance.

Table 11 shows that our method has a significantly lower runtime than other methods, with MotionDiff being the closest. Our method has a parameter size of 25.9M, which is larger than HumanMAC but smaller than Gsps. Our method has the lowest ADE metric, outperforming all other methods. Similarly, our method achieves the lowest FDE of 0.410, outperforming other models, including HumanMAC at 0.480. Our method achieves the lowest ADE and FDE values, indicating higher motion prediction accuracy compared to existing methods. As shown in Table 12, the proposed hybrid model outperforms both the data-driven model and the physics-based model in all evaluation metrics on the Human3.6M dataset. Compared with other models, it has a higher APD of 13.158 and performs best in ADE, FDE, MMADE, and MMFDE, with values of 0.319, 0.410, 0.430, and 0.473, respectively. This shows that the hybrid approach effectively combines the advantages of the data-driven model and the physics-based model to achieve more accurate motion prediction.

5. Discussion

Recent advancements in person re-identification (Re-ID) have tackled challenges like occlusion, viewpoint changes, and feature consistency. For instance, [55] shows how addressing depth variations can boost Re-ID performance. Our model, which combines latent diffusion with physical motion prediction, could benefit from similar depth adaptation to handle occlusion better. In [56], the authors explore deep learning methods for handling occlusion, which is a key issue our model addresses using Euler–Lagrange equations for motion prediction. This approach aims to improve prediction reliability in such situations. The 3D Re-ID technique in [57] uses both global guidance and local feature aggregation, which can be applied to our model to improve its prediction accuracy. Additionally, multi-view learning in [58] enhances 3D shape understanding, and we see potential to use this approach to boost long-term motion prediction accuracy. Finally, [59] focuses on feature consistency and contrast enhancement to improve Re-ID. Our model already reduces long-term prediction errors, and adopting these techniques could reduce errors even further.

In summary, our model significantly improves long-term human motion prediction and can be combined with existing Re-ID techniques to address occlusion, depth, and feature consistency, offering new directions for future research in both human motion prediction and pedestrian Re-ID.

6. Conclusions

In this paper, we proposed a model based on latent diffusion modeling and physically driven components, using PPM and PPU modules to address the problems of low long-term prediction accuracy and severe error accumulation. The framework combines motion generated by random noise with physical control derived from the Euler–Lagrange equations to provide accurate predictions for future frames of human motion. Our approach significantly improves the performance over existing methods, achieving error reduction of 15.7% in specific metrics such as ADE. Looking ahead, we plan to further investigate the integration of physical principles with deep learning techniques for a wider range of spatio-temporal tasks, with the potential to improve performance in areas such as human motion prediction.

Author Contributions

Z.R.: Conceptualization, investigation, methodology, validation, visualization, writing—original draft preparation, writing—review and editing. M.J.: Conceptualization, investigation, methodology, validation, visualization. H.N.: Conceptualization, investigation, visualization, writing—review and editing. J.S.: Conceptualization, investigation, methodology, validation, visualization, writing—review and editing. A.D.: Conceptualization, investigation, methodology, validation, visualization, writing—review and editing. Q.Z.: Conceptualization, investigation, methodology, validation, visualization, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Yunnan provincial major science and technology special plan projects under Grant 202302AD08008, Natural Science Foundation of Guangdong Province (Nos. 2022A1515140119, 2023A1515011307), Dongguan Science and Technology Development Project (No. CWSQ20230613782), Characteristic Innovation Project of Ordinary Universities in Guangdong Province (2021KTSCX189).

Data Availability Statement

The datasets analyzed during the current study are available in the https://github.com/jimm77777/LDPM, accessed on 16 December 2024. The datasets used are all public datasets, and this manuscript does not contain any information about the study participants.

Acknowledgments

The authors thank everyone who contributed to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, W.; Song, R.; Guo, X.; Zhang, C.; Chen, L. Genad: Generative end-to-end autonomous driving. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2025; pp. 87–104. [Google Scholar]
Lu, G.; Zhang, S.; Wang, Z.; Liu, C.; Lu, J.; Tang, Y. Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2025; pp. 349–366. [Google Scholar]
Hu, L. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8153–8163. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D gaussian splatting for real-time radiance field rendering. Acm Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Goto, T.; Ohzeki, M. Online calibration scheme for training restricted Boltzmann machines with quantum annealing. arXiv 2023, arXiv:2307.09785. [Google Scholar]
Wei, D.; Sun, H.; Li, B.; Lu, J.; Li, W.; Sun, X.; Hu, S. Human joint kinematics diffusion-refinement for stochastic motion prediction. AAAI Conf. Artif. Intell. 2023, 37, 6110–6118. [Google Scholar] [CrossRef]
Pearce, T.; Rashid, T.; Kanervisto, A.; Bignell, D.; Sun, M.; Georgescu, R.; Macua, S.V.; Tan, S.Z.; Momennejad, I.; Hofmann, K.; et al. Imitating human behaviour with diffusion models. arXiv 2023, arXiv:2301.10677. [Google Scholar]
Adeli, V.; Ehsanpour, M.; Reid, I.; Niebles, J.C.; Savarese, S.; Adeli, E.; Rezatofighi, H. Tripod: Human trajectory and pose dynamics forecasting in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13390–13400. [Google Scholar]
Cai, Z.; Ren, D.; Zeng, A.; Lin, Z.; Yu, T.; Wang, W.; Liu, Z. Humman: Multi-modal 4D human dataset for versatile sensing and modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 557–577. [Google Scholar]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar]
Aliakbarian, S.; Saleh, F.S.; Salzmann, M.; Petersson, L.; Gould, S. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5223–5232. [Google Scholar]
Bouazizi, A.; Holzbock, A.; Kressel, U.; Dietmayer, K.; Belagiannis, V. Motionmixer: Mlp-based 3D human body pose forecasting. arXiv 2022, arXiv:2207.00499. [Google Scholar]
Li, B.; Zhao, Y.; Zhelun, S.; Sheng, L. Danceformer: Music conditioned 3D dance generation with parametric motion transformer. AAAI Conf. Artif. Intell. 2022, 36, 1272–1279. [Google Scholar] [CrossRef]
Alexanderson, S.; Nagy, R.; Beskow, J.; Henter, G.E. Listen, denoise, action! audio-driven motion synthesis with diffusion models. Acm Trans. Graph. 2023, 42, 1–20. [Google Scholar] [CrossRef]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 2022, 35, 5775–5787. [Google Scholar]
Gurumurthy, S.; Kiran Sarvadevabhatla, R.; Venkatesh Babu, R. Deligan: Generative adversarial networks for diverse and limited data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 166–174. [Google Scholar]
Hong, F.; Zhang, M.; Pan, L.; Cai, Z.; Yang, L.; Liu, Z. Avatarclip: Zero-shot text-driven generation and animation of 3D avatars. arXiv 2022, arXiv:2205.08535. [Google Scholar] [CrossRef]
Sun, B.; Yang, Y.; Zhang, L.; Cheng, M.M.; Hou, Q. Corrmatch: Label propagation via correlation matching for semi-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 3097–3107. [Google Scholar]
Starke, S.; Mason, I.; Komura, T. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Trans. Graph. 2022, 41, 1–13. [Google Scholar] [CrossRef]
Ju, X.; Zeng, A.; Zhao, C.; Wang, J.; Zhang, L.; Xu, Q. Humansd: A native skeleton-guided diffusion model for human image generation. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15988–15998. [Google Scholar]
Tevet, G.; Gordon, B.; Hertz, A.; Bermano, A.H.; Cohen-Or, D.; Liu, Z. Motionclip: Exposing human motion generation to clip space. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 358–374. [Google Scholar]
Cai, X.; Cheng, P.; Liu, S.; Zhang, H.; Sun, H. Human Motion Prediction Based on a Multi-Scale Hypergraph for Intangible Cultural Heritage Dance Videos. Electronics 2023, 12, 1–21. [Google Scholar] [CrossRef]
Maeda, T.; Ukita, N. Motionaug: Augmentation with physical correction for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6427–6436. [Google Scholar]
Xie, K.; Wang, T.; Iqbal, U.; Guo, Y.; Fidler, S.; Shkurti, F. Physics-based human motion estimation and synthesis from videos. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11532–11541. [Google Scholar]
Yuan, Y.; Song, J.; Iqbal, U.; Vahdat, A.; Kautz, J. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 16010–16021. [Google Scholar]
Zhang, Z.; Zhu, Y.; Rai, R.; Doermann, D. Pimnet: Physics-infused neural network for human motion prediction. IEEE Robot. Autom. Lett. 2022, 7, 8949–8955. [Google Scholar] [CrossRef]
Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Chen, L.H.; Zhang, J.; Li, Y.; Pang, Y.; Xia, X.; Liu, T. Humanmac: Masked motion completion for human motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 9544–9555. [Google Scholar]
Zhang, Y.; Kephart, J.O.; Ji, Q. Incorporating physics principles for precise human motion prediction. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6164–6174. [Google Scholar]
Barquero, G.; Escalera, S.; Palmero, C. Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2317–2327. [Google Scholar]
Zhong, C.; Hu, L.; Zhang, Z.; Ye, Y.; Xia, S. Spatio-temporal gating-adjacency gcn for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6447–6456. [Google Scholar]
Salzmann, T.; Pavone, M.; Ryll, M. Motron: Multimodal probabilistic human motion forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6457–6466. [Google Scholar]
Lucas, T.; Baradel, F.; Weinzaepfel, P.; Rogez, G. Posegpt: Quantization-based 3D human motion generation and forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 417–435. [Google Scholar]
Blattmann, A.; Milbich, T.; Dorkenwald, M.; Ommer, B. Behavior-driven synthesis of human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12236–12246. [Google Scholar]
Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. Diverse human motion prediction via gumbel-softmax sampling from an auxiliary space. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Lisboa, Portugal, 10–14 October 2022; pp. 5162–5171. [Google Scholar]
Xu, S.; Wang, Y.X.; Gui, L.Y. Diverse human motion prediction guided by multi-level spatial-temporal anchors. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 251–269. [Google Scholar]
Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11467–11476. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3316–3333. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Zhang, Z.; Xie, L.; Tian, Q.; Zhang, Y. Skeleton-parted graph scattering networks for 3D human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 18–36. [Google Scholar]
Guo, W.; Du, Y.; Shen, X.; Lepetit, V.; Alameda-Pineda, X.; Moreno-Noguer, X. Back to mlp: A simple baseline for human motion prediction. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4809–4819. [Google Scholar]
Xu, C.; Tan, R.T.; Tan, Y.; Chen, S.; Wang, Y.G.; Wang, X.; Wang, Y. Eqmotion: Equivariant multi-agent motion prediction with invariant interaction reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1410–1420. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Yu, Z.; Yin, Z.; Zhou, D.; Wang, D.; Wong, F.; Wang, B. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 7645–7655. [Google Scholar]
Ho, J.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. Adv. Neural Inf. Process. Syst. 2022, 35, 8633–8646. [Google Scholar]
Xu, M.; Yu, L.; Song, Y.; Shi, C.; Ermon, S.; Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv 2022, arXiv:2203.02923. [Google Scholar]
Xu, J.; Wang, X.; Cheng, W.; Cao, Y.P.; Shan, Y.; Qie, X.; Cao, S. Dream3d: Zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 20908–20918. [Google Scholar]
Popov, V.; Vovk, I.; Gogoryan, V.; Sadekova, T.; Kudinov, M. Grad-tts: A diffusion probabilistic model for text-to-speech. In Proceedings of the International Conference on Machine Learning (ICML), Virtal, 18–24 July 2021; pp. 8599–8608. [Google Scholar]
Yang, J.; Wang, C.; Li, Z.; Wang, J.; Zhang, R. Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 19424–19433. [Google Scholar]
Mao, W.; Liu, M.; Salzmann, M. Generating smooth pose sequences for diverse human motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13309–13318. [Google Scholar]
Zhang, Y.; Black, M.J.; Tang, S. We are more than our joints: Predicting how 3D bodies move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3372–3382. [Google Scholar]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef]
Sigal, L.; Balan, A.O.; Black, M.J. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5442–5451. [Google Scholar]
Zhang, H.; Ning, X.; Wang, C.; Ning, E.; Li, L. Deformation depth decoupling network for point cloud domain adaptation. Neural Netw. 2024, 180, 106626. [Google Scholar] [CrossRef]
Ning, E.; Wang, C.; Zhang, H.; Ning, X.; Tiwari, P. Occluded person re-identification with deep learning: A survey and perspectives. Expert Syst. Appl. 2024, 239, 122419. [Google Scholar] [CrossRef]
Wang, C.; Ning, X.; Li, W.; Bai, X.; Gao, X. 3D person re-identification based on global semantic guidance and local feature aggregation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4698–4712. [Google Scholar] [CrossRef]
Ning, E.; Wang, C.; Zhang, H.; Ning, X.; Tiwari, P. Pedestrian 3D shape understanding for person re-identification via multi-view learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5589–5602. [Google Scholar]
Ning, E.; Zhang, C.; Wang, C.; Ning, X.; Chen, H.; Bai, X. Pedestrian Re-ID based on feature consistency and contrast enhancement. Displays 2023, 79, 102467. [Google Scholar] [CrossRef]

Figure 1. Comparison of motion prediction approaches: GCN-based framework and the proposed Latent Diffusion and Physical Principles Model (LDPM). (a) GCN method. (b) The proposed LDPM.

Figure 2. Overview of the proposed method LDP; “⊕” denotes the add weight.

Figure 3. Core components of physics-informed and data-driven motion prediction: (a) Prediction processing unit (PPU) and (b) prediction process module (PPM).

Figure 4. MLP architecture for physics-based data motion prediction: (a) Parameterization of mass (M), (b) damping coefficient (C), and (c) gravitational constant (G).

Figure 5. Visualization of accurate human pose predictions over 2 s using the Human3.6M dataset. (a) presents the predicted poses for shorter durations, 0.25 s, 0.5 s, and 0.75 s, demonstrating strong alignment with the ground truth (GT). (b) extends the prediction timeline to longer durations, 1.0 s, 1.5 s, and 2.0 s, showing consistent performance in capturing the end poses with minimal deviation from the ground truth.

Figure 6. Visualization of accurate 2 s pose predictions on the HumanEva-I dataset. (a) shows the predicted poses at shorter intervals, 0.25 s, 0.5 s, and 0.75 s, with the predicted poses (purple and green) closely aligning with the ground truth (GT) poses (black). (b) extends the prediction timeline to longer intervals, 1.0 s, 1.5 s, and 2.0 s, demonstrating the model’s ability to maintain high accuracy in pose estimation even at extended durations.

Figure 7. Error progression across training, validation, and testing phases over 50 epochs on the AMASS dataset. The training error decreases and stabilizes at a low level, the validation error decreases initially but stabilizes at a higher level, and the testing error remains low and constant throughout 50 epochs, indicating effective training but limited generalization.

Figure 8. Comparison of model performance on Human3.6M and HumanEva-I datasets with different model parameters: epochs = 5, layers = 6, cosine.

Figure 9. Comparison of motion prediction methods: average predicted motion vs. cumulative distribution function (CDF) of average absolute motion error on Human3.6M.

Table 1. Quantitative results of motion prediction methods on Human3.6M and HumanEva-I datasets.

Method	Human3.6M					HumanEva-I
Method	APD	ADE	FDE	MMADE	MMFDE	APD	ADE	FDE	MMADE	MMFDE
MotionDiff [6]	15.353	0.411	0.509	0.536	0.536	5.931	0.232	0.236	0.352	0.320
Deligan [16]	6.509	0.483	0.520	0.545	0.520	2.177	0.306	0.322	0.385	0.371
HumanMAC [28]	6.301	0.369	0.480	0.509	0.545	6.554	0.209	0.223	0.342	0.335
BeLFusion [30]	7.602	0.420	0.472	0.474	0.507	6.109	0.220	0.234	0.342	0.316
Diversampling [35]	15.310	0.370	0.480	0.482	0.509	6.109	0.220	0.234	0.342	0.316
Gsps [50]	14.757	0.389	0.496	0.476	0.525	5.825	0.233	0.240	0.344	0.331
Mojo [51]	12.579	0.412	0.497	0.497	0.538	4.181	0.234	0.244	0.369	0.347
Ours	13.158	0.319	0.410	0.430	0.473	10.119	0.234	0.229	0.369	0.372

The bold number represents the best results.

Table 2. Time details for different training steps and their corresponding batch and total time measurements for the AMASS dataset.

Step	Batch Time (s)	Total Time (s)
1/2598	0.25	1
1001/2598	0.257	70
2001/2598	0.243	138
1/59	0.024	0
1/1141	0.103	0
1001/1141	0.149	25

Table 3. Results of different schedulers on the Human3.6M and HumanEva-I datasets.

Scheduler	Human3.6M			HumanEva-I
Scheduler	FDE	MMADE	MMFDE	FDE	MMADE	MMFDE
Linear	0.521	0.518	0.563	0.258	0.408	0.422
Sqrt	0.733	0.665	0.768	0.508	0.545	0.612
Cosine	0.410	0.430	0.473	0.229	0.369	0.372

The bold number represents the best results.

Table 4. Experimental results on Human3.6M for varying values of E.

E	Human3.6M
E	FDE	MMADE	MMFDE
5.000	0.410	0.390	0.423
10.000	0.534	0.524	0.572
20.000	0.410	0.430	0.473

The bold number represents the best results.

Table 5. Experimental results on HumanEva-I for varying values of E.

E	HumanEva-I
E	FDE	MMADE	MMFDE
5.000	0.265	0.408	0.438
10.000	0.229	0.369	0.372
20.000	0.248	0.401	0.403

The bold number represents the best results.

Table 6. Results of different network design configurations with and without skip connections on Human3.6M and HumanEva-I datasets.

Skip Connection	Human3.6M			HumanEva-I
Skip Connection	APD	ADE	FDE	APD	ADE	FDE
N	15.236	0.400	0.423	9.927	0.226	0.217
Y	13.158	0.319	0.410	10.119	0.234	0.229

The bold number represents the best results.

Table 7. Comparison of model performance when varying the number of layers on the Human3.6M and HumanEva-I datasets.

Layers	Human3.6M			HumanEva-I
Layers	FDE	MMADE	MMFDE	FDE	MMADE	MMFDE
2	0.580	0.546	0.609	0.508	0.545	0.612
4	0.551	0.529	0.584	0.229	0.369	0.372
6	0.532	0.523	0.571	0.239	0.410	0.425
8	0.410	0.430	0.473	0.255	0.417	0.439
10	0.510	0.516	0.554	0.256	0.422	0.450

The bold number represents the best results.

Table 8. Results of different noising steps on Human3.6M and HumanEva-I datasets.

Noising Steps	Human3.6M			HumanEva-I
Noising Steps	FDE	MMADE	MMFDE	FDE	MMADE	MMFDE
100.000	0.528	0.522	0.565	0.246	0.402	0.419
1000.000	0.410	0.430	0.473	0.229	0.369	0.372

The bold number represents the best results.

Table 9. The motion editing ability of different methods.

Method	Switch Ability	Independent Body Part Control
MotionDiff [6]	0	0
Deligan [16]	0	0
HumanMAC [28]	1	1
BeLFusion [30]	1	0
Diversampling [35]	0	0
Gsps [50]	0	1
Mojo [51]	0	0
Ours	1	1

Table 10. Comparison of performance metrics and design choices across various motion prediction models on the Human3.6M dataset.

	MotionDiff [6]	Deligan [16]	HumanMAC [28]	BeLFusion [30]	Diversampling [35]	Gsps [50]	Ours
One Stage	N	Y	Y	N	N	N	Y
Loss	4	1	1	4	3	5	1
FPS	15	22	25	18	20	12	30

Table 11. Comparison of model performance in motion prediction across various methods in terms of runtime, parameter size, and accuracy metrics on the Human3.6M dataset.

	MotionDiff [6]	HumanMAC [28]	BeLFusion [30]	Diversampling [35]	Gsps [50]	Ours
Runtime (ms)	15.2	12.3	18.1	14.5	16.7	10.8
Parameter Size (M)	45.3	30.1	35.8	28.7	40.2	25.9
ADE	0.411	0.369	0.420	0.370	0.389	0.319
FDE	0.509	0.480	0.472	0.480	0.496	0.410

The bold number represents the best results.

Table 12. Quantitative comparison of different models across multiple evaluation metrics on the Human3.6M dataset.

Model	APD	ADE	FDE	MMADE	MMFDE
Data-driven	12.406	0.395	0.517	0.512	0.580
Physics-based	11.238	0.329	0.534	0.487	0.543
Proposed hybrid	13.158	0.319	0.410	0.430	0.473

The bold number represents the best results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, Z.; Jin, M.; Nie, H.; Shen, J.; Dong, A.; Zhang, Q. Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models. Electronics 2025, 14, 605. https://doi.org/10.3390/electronics14030605

AMA Style

Ren Z, Jin M, Nie H, Shen J, Dong A, Zhang Q. Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models. Electronics. 2025; 14(3):605. https://doi.org/10.3390/electronics14030605

Chicago/Turabian Style

Ren, Ziliang, Miaomiao Jin, Huabei Nie, Jianqiao Shen, Ani Dong, and Qieshi Zhang. 2025. "Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models" Electronics 14, no. 3: 605. https://doi.org/10.3390/electronics14030605

APA Style

Ren, Z., Jin, M., Nie, H., Shen, J., Dong, A., & Zhang, Q. (2025). Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models. Electronics, 14(3), 605. https://doi.org/10.3390/electronics14030605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models

Abstract

1. Introduction

2. Related Works

2.1. Human Motion Prediction

2.2. Latent Diffusion and Physically Driven Models

3. Proposed Framework

3.1. Potential Diffusion Model

3.2. Physics-Based and Data-Driven Motion Prediction Model

3.3. Training of the Model

4. Experiments

4.1. Comparison to the State of the Art

4.2. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI