1. Introduction
Human action recognition is a cornerstone of computer vision with wide-ranging applications, from intelligent surveillance systems to human–computer interaction [
1]. While recent advancements in deep learning have led to significant progress in this field, the performance of these models is heavily dependent on the quality and quantity of training data. In particular, multi-view data, captured simultaneously from various perspectives, is considered crucial for improving generalization by enabling models to learn view-invariant features [
2,
3]. Multi-view data can mitigate issues of self-occlusion or occlusion by objects, which can occur in single-view perspectives, and it allows for a more unambiguous capture of complex movements in 3D space. Indeed, developing robust methods to handle such occluded video scenarios remains a key challenge in video analysis [
4].
However, the process of constructing such datasets entails substantial cost and effort, as it requires synchronized multi-camera setups and controlled environments [
5]. Furthermore, the post-processing required to synchronize and calibrate video streams from each camera is exceedingly complex. To circumvent these challenges, some studies have turned to generating synthetic data in virtual environments [
6,
7] or through generative models for various domains [
8]. A persistent issue with this approach is the domain gap, as synthetic data often fails to capture the complexity and subtlety of real-world motion. Consequently, the action recognition research community continues to face difficulties in obtaining accessible, high-quality multi-view data.
Against this background, this study aims to address two key research questions:
RQ1: How can realistic, temporally consistent, multi-view 3D action data be generated from a single monocular video without the need for expensive equipment?
RQ2: To what extent does the proposed data generation pipeline—encompassing 3D human mesh recovery, temporal refinement, and scene reconstruction—accurately preserve the nuanced motion of the original video?
To answer these questions, this paper proposes the PSEW (Pose Scene Everywhere) framework, a practical solution for automatically generating temporally consistent, multi-view 3D human action data from a single, accessible monocular video. This study demonstrates that by fusing state-of-the-art 3D Human Mesh Recovery and 3D scene reconstruction techniques, it is feasible to stabilize dynamic human meshes from a video via temporal post-processing and reconstruct them within a virtual 3D space to synthesize novel viewpoints. The main contribution is an integrated framework that organically combines 3D human mesh recovery, temporal consistency correction, and 3D scene reconstruction. Through this process, the framework preserves the dynamic motion of the original video with high fidelity, as demonstrated in our evaluations. Ultimately, this work offers a low-cost, scalable solution to the data scarcity problem in action recognition, thereby enhancing the accessibility of research in this field.
2. Related Work
2.1. Three-Dimensional Human Mesh Recovery
3D Human Mesh Recovery, the task of recovering the 3D pose and shape of a human from a single 2D image or video, has long been a major challenge in computer vision. Early research primarily employed a two-stage approach, first detecting 2D body joints and then lifting them into 3D space [
9,
10]. However, this method had the disadvantage that errors in 2D prediction directly impacted the 3D results. With recent advances in deep learning, the mainstream approach has shifted to end-to-end methods that directly predict the parameters of a 3D parametric model, such as SMPL (Skinned Multi-Person Linear Model) [
11] or its extended version, SMPL-X (Skinned Multi-Person Linear Model with Articulated Hands and Expressive Face) [
12], from image features. These models regress tens of parameters, including pose, shape, and expression, to generate realistic 3D human meshes (
Figure 1).
The framework proposed in this study adopts this modern approach. The implemented model uses a powerful image feature extractor as its backbone and directly predicts the human model’s parameters through an attention-based transformer decoder. However, single-frame-based predictions can lead to temporal inconsistencies and jittering artifacts across a video. To address this, the framework introduces a multi-stage post-processing pipeline. A smoothing filter based on a state-space model, in particular, plays a key role by considering the dynamic characteristics of each parameter to smooth their temporal evolution, thereby restoring natural and stable movements.
2.2. Three-Dimensional Scene Reconstruction
Reconstructing 3D scenes from 2D images is essential for fields like augmented reality and robotics. Traditional approaches include Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), which simultaneously estimate camera poses and 3D structure by matching feature points across multiple views [
13]. More recently, deep learning-based techniques such as Neural Radiance Fields (NeRF) [
14] have gained significant attention for their ability to synthesize highly realistic novel-view images from a set of input images.
However, these technologies are mostly optimized for static scenes, and their complexity increases significantly in dynamic environments with moving objects like people. The framework takes a pragmatic approach tailored to its goal of generating action recognition data, rather than reconstructing the entire scene. It focuses on placing the reconstructed 3D human meshes in a spatio-temporally consistent manner. The core of this process is the floor plane estimation algorithm. This algorithm statistically determines the most stable 3D floor plane by analyzing the foot positions of all individuals across multiple frames. By aligning all human meshes to this plane, it creates a stable and realistic 3D environment where all characters appear to be moving on the same ground surface.
2.3. Data Augmentation for Action Recognition
Large-scale, diverse data is essential for enhancing the performance of deep learning-based action recognition models. While collecting more real-world data is ideal, data augmentation techniques are widely used to expand existing datasets due to cost and time constraints. In addition to common 2D augmentation methods, there is active research into using 3D technology to generate data with a higher degree of freedom. A representative example is the use of game engines to create virtual characters and environments to generate desired action data [
5,
6].
While this approach offers high flexibility in data generation, it often struggles to perfectly replicate the subtle and natural movements of real humans and can suffer from a domain gap due to visual differences from real-world data. The framework addresses this issue by directly capturing and reconstructing human movements from real videos in 3D. As a result, the generated data retains the naturalness of real actions. Furthermore, through a camera configuration file, users can freely adjust the virtual camera’s position, angle, and field of view to generate a virtually infinite amount of multi-view data. This provides an effective solution that surpasses the limitations of 2D data augmentation, enabling the training of action recognition models that are highly robust to viewpoint changes.
3. Methodology
This section elaborates on the core technical processes of the proposed framework. This process encompasses the entire pipeline for generating spatio-temporally consistent, multi-view 3D action data from the limited input of a single monocular video. This transformation process is analogous to having a virtual film director capture a single actor’s performance from multiple angles simultaneously. To achieve this, the process begins with the initial 3D human parameter prediction stage, where the 3D shape and motion of each individual are estimated from the original video. However, since these initial predictions are generated independently for each frame, they lack temporal continuity and often contain unnatural jittering artifacts. To resolve this issue and restore realistic motion, a crucial step of motion refinement through temporal consistency correction is required, which meticulously refines the predicted parameters. Finally, the generated 3D human meshes are placed within a stable virtual space, and the final goal of creating multi-view data is accomplished through the 3D scene reconstruction and virtual camera rendering stage, where they are rendered from desired viewpoints. Each stage is organically linked, forming a sequential process that expands limited 2D information into rich 3D spatio-temporal data (
Figure 2).
The overall pipeline of this framework consists of three core stages. First, in the Initial 3D Human Parameter Prediction stage, the parameters of a 3D human model are estimated from each frame of the input monocular video. Second, the Motion Refinement through Temporal Consistency Correction stage corrects the motion between frames using tracking, interpolation, and Kalman filtering to generate natural movements. Finally, in the 3D Scene Reconstruction and Virtual Camera Rendering stage, the refined 3D human meshes are placed within a stable virtual space, and novel-view videos are rendered from various user-defined viewpoints to produce the final multi-view data. The following sections will detail each of these stages.
3.1. Implementation Environment
The PSEW framework proposed in this study was developed and tested in a specific hardware and software environment to ensure the reproducibility of our experiments. The detailed specifications for the operating system, hardware components, and key software libraries used in this study are detailed in
Table 1.
3.2. Initial 3D Human Parameter Prediction
The first step of the data generation process begins by decomposing the input monocular video into individual frame images. Each frame is then sequentially fed into a deep learning-based Human Mesh Recovery (HMR) model to predict the 3D parameters for each person present in the image. This process requires a high level of technical skill to infer the complex 3D shape and pose of the human body from 2D pixel information.
The model used in this framework is designed based on two key technologies for robust performance. First, it uses the DINOv2 (Self-distillation with no labels v2) [
22] Vision Transformer (ViT) as its backbone network. DINOv2 is pre-trained on a large-scale unlabeled image dataset using a self-supervised learning method, making it highly effective at extracting rich visual features that capture the context and semantics of an image. Second, these extracted features are passed to a transformer decoder head based on an attention mechanism. This decoder comprehensively analyzes both global and local features to regress the parameters of a 3D human model for each individual.
The target of this prediction is the SMPL-X model [
12]. SMPL-X is a statistical model that represents the 3D human body using hundreds of parameters that control its shape, full-body pose, articulated hands, and facial expressions. By using this model, the complex 3D structure of the human body can be efficiently handled by being compressed into a small number of parameters.
Before predicting the 3D parameters, the model must first detect the location of people within the frame. A detection threshold of 0.3 is used for the probability of being identified as a person. This value was empirically set to strike an appropriate balance between false positives (incorrectly detecting objects as people) and false negatives (missing actual people). In other words, it aims for an optimal level that is neither too strict to miss real people nor too lenient to misinterpret parts of the background as a person. Subsequent 3D parameter prediction is performed only for locations that pass this threshold. As a result, each frame contains a set of SMPL-X parameters for every detected person. These initial predictions serve as the foundational data for the subsequent temporal consistency correction stage.
3.2.1. Data Preprocessing
As illustrated in
Figure 3, each frame of the video is converted into a standardized format that the model can process. As original images have unique resolutions and aspect ratios, they are resized to fit the model’s fixed input size while maintaining the original aspect ratio. The image is scaled based on its shorter side, and the remaining space is filled with black pixels (padding) to create a square shape. Subsequently, the image’s pixel values are normalized to a range between 0 and 1 and are further standardized using the statistics from a large-scale image dataset, which allows the model to learn features more effectively. Concurrently, based on camera information such as the user-defined field of view (FOV), the intrinsic parameter matrix of a virtual camera is generated, which defines the relationship for projecting a 3D space onto a 2D image plane.
3.2.2. Feature Extraction
The standardized image is passed to a powerful Vision Transformer backbone network to extract deep semantic information. As shown in
Figure 4, this network divides the image into multiple small, square patches and processes each patch as an independent input unit. Through the transformer’s self-attention mechanism, the model learns the relationships between each patch and all other patches within the image. Through this process, the original image is transformed into a set of high-dimensional feature vectors that richly contain the visual content and surrounding contextual information of each patch. These feature vectors encompass a deep understanding of the object’s shape, texture, and spatial arrangement, going beyond simple pixel information.
3.2.3. Human Detection
As illustrated in
Figure 5, the feature vectors extracted from the entire image are then used to locate where people are present within the image. The feature vector corresponding to each patch is passed through a small neural network classifier, which converts it into a score representing the probability of a person’s presence at that location. In the resulting score map, multiple high scores may appear adjacently for a single person. To eliminate these duplicate detections, a Non-Maximal Suppression process is applied, which retains only the location with the highest score within a specific area and suppresses the others. Finally, only the locations with scores exceeding a pre-defined confidence threshold (0.3) are confirmed as positions where a person is present.
3.2.4. Parameter Regression
For each detected individual, the feature vector at that location becomes a “query” and is used to predict 3D parameters. This query refers to the entire image’s feature map as “context” to gather the global information needed to estimate its pose. To enhance the accuracy of this process, camera information representing the 3D viewing direction of each patch is additionally fused with the feature vectors, helping the model to perceive the 3D space. An attention-based transformer decoder allows each query to interact with the context, refining its own features. Finally, this refined feature vector is passed to several specialized output modules to simultaneously regress each parameter of the 3D human model—such as pose, shape, and expression—as well as an offset value for correcting the 2D position error (
Figure 6).
Finally, the refined feature vector, which encapsulates all the 3D information for each individual after passing through the Transformer Decoder, is forwarded to the final parameter prediction stage. A single, monolithic neural network predicting all parameters with disparate characteristics—such as pose, shape, and expression—simultaneously could lead to an inefficient and difficult learning process. To address this, the framework adopts a multi-head architecture. In this approach, a single refined feature vector is passed to multiple, smaller, specialized neural networks, known as Regression Heads, to distribute the respective tasks. Each head is designed to focus exclusively on predicting its assigned specific type of parameter. The outputs, predicted concurrently by these specialized heads, are consolidated into a single structured dictionary to complete the initial 3D parameter set for that individual. The specific role of each regression head is detailed in
Table 2 below.
3.3. Motion Refinement Through Temporal Consistency Correction
The 3D parameters predicted independently for each frame do not guarantee continuity between frames. When used as is, this can lead to issues such as unnatural jittering or abrupt pose changes in the resulting video. To resolve these temporal inconsistencies and generate physically plausible and visually smooth motion, this framework sequentially applies a multi-stage refinement process. This process consists of object tracking, data interpolation, and smoothing using a Kalman filter.
3.3.1. Object Tracking
Identifying the same person across consecutive video frames and maintaining their identity is a crucial first step for analyzing motion continuity. The system defines this task as an assignment problem and solves it by finding the optimal pairing between individuals detected in the previous and current frames.
Specifically, assume there are
individuals in the previous frame (
) and
individuals in the current frame (
). Let the 2D position of the
i-th person in the previous frame be
, and the 2D position of the
j-th person in the current frame be
. The cost,
, is defined by calculating the Euclidean distance between the two individuals.
These costs are collected to form an
cost matrix (
). The objective is to assign each person to a unique counterpart in a way that minimizes the total sum of costs for all assigned pairs. The optimal solution to this problem can be efficiently found using the Hungarian Algorithm. The Hungarian Algorithm finds an assignment matrix (
), that minimizes the following objective function.
Here, is a binary variable that is 1 if the i-th person and the j-th person are paired, and 0 otherwise. Through this process, the identity of each individual is consistently maintained across all frames, enabling the generation of individual motion trajectories.
3.3.2. Data Interpolation
Despite the tracking process, there may be cases where the model fails to detect a person in a particular frame or where some parameters are missing due to occlusion. Such data gaps can cause discontinuities in motion. In this stage, linear interpolation is used to fill these gaps.
Assume a parameter set
P for a certain individual is missing between frames
and
. The parameter
at an arbitrary time
t (
) within the missing interval is calculated as follows.
Here,
and
are the valid parameter sets immediately preceding and succeeding the missing interval, respectively.
is a parameter representing the normalized temporal position, which functions as a weight in the interpolation formula and is defined as follows.
This linear interpolation allows for a smooth transition without abrupt data changes, which contributes to the stability of the subsequent smoothing stage.
3.3.3. Smoothing with Kalman Filter
Even after filling the data gaps, the accumulation of minor prediction errors from each independent frame can cause a jittering effect in the overall motion. In this stage, a Kalman Filter is applied to remove this high-frequency noise and generate physically plausible, smooth movements. The Kalman Filter is a recursive filter that statistically combines a prediction based on the previous state with the current, uncertain observation to estimate the optimal state.
The system uses a Constant Velocity Model as its state-space model. The state vector
for a parameter to be filtered consists of its position pt and velocity
.
The Kalman Filter process is divided into a prediction step and an update step.
Here, F is the state transition matrix, P is the error covariance, Q is the process noise covariance, H is the observation matrix, R is the observation noise covariance, is the observed parameter value at the current frame, and is the Kalman gain.
A key feature of this framework is its ability to adjust the filter’s responsiveness by setting different values for the noise covariance matrices Q and R based on the dynamic characteristics of the parameter being filtered.
Pose (Rotation) Parameters (): Since human pose can change rapidly, the process noise () is set to 1 × 10−3 and the observation noise () to 1 × 10−1. This configuration places a relatively high trust in the current observation, allowing the filter to respond agilely to changes. For rotational data, the process includes normalizing the angular innovation ()
to the range [−π, π] to prevent anomalous spinning artifacts caused by value wrapping near 360 degrees.
Position Parameters (): As position changes more smoothly than pose, the process noise is set to 1 × 10−2 and the observation noise to 1 × 10−1.
Shape and Expression Parameters (, ): The human body shape and facial expressions are low-frequency signals that change very little or slowly within a video. Therefore, the process noise is set to 1 × 10−4 and the observation noise to 1 × 10−2. This low setting increases the reliance on the previous state, resulting in strong smoothing to maintain stable values.
3.4. Three-Dimensional Scene Reconstruction and Virtual Camera Rendering
The 3D human parameters, having achieved spatio-temporal consistency through the motion refinement process, are finally used to construct a virtual 3D space and generate novel-view videos. This final stage is broadly divided into the process of constructing a stable 3D scene and the process of rendering that scene from various user-defined viewpoints.
3.4.1. Three-Dimensional Scene Construction
To build a stable and realistic virtual environment, the framework first generates 3D meshes for every person in each frame using the refined parameter sets. Since these generated meshes are not yet aligned in the world coordinate system, a process is needed to establish a reference floor plane and reconstruct the scene based on it, making all individuals appear to interact on the same ground surface.
Floor Plane Estimation: To estimate a stable floor plane, the system analyzes the 3D vertex data of all reconstructed human meshes across every frame of the video. It collects the locations corresponding to the feet, which are the points with the lowest y-coordinate in the 3D coordinates of each mesh. Using all the collected foot position coordinates, a single 3D plane that best represents these points is statistically calculated. This method is less sensitive to errors in specific frames or individuals and allows for the estimation of the most probable ground surface that persists throughout the entire video.
Scene Alignment: Once the floor plane is estimated, a single transformation matrix (rotation and translation) is calculated to be applied to the entire scene, aligning this plane with the horizontal plane of the world coordinate system (e.g., Y = 0). Subsequently, this transformation matrix is applied to all 3D human meshes within each frame. Through this process, all individuals are aligned on the same virtual ground, maintaining a stable spatial relationship even as they assume different positions and poses.
Scene Saving: Finally, the aligned 3D human meshes for each frame, along with lighting information, are integrated into a single 3D scene and saved in the .glb file format, which is widely used in 3D modeling and rendering. These files are then used as input for the subsequent rendering stage.
3.4.2. Virtual Camera Rendering
Once the 3D scene construction is complete, the user can render the scene from any desired viewpoint to generate new videos. This process is controlled via a separate JSON-formatted configuration file, which the user can modify to obtain various outputs.
Virtual Camera Parameter Settings: The characteristics of the virtual camera used for rendering are defined in the configuration file. The main parameters and their default values are detailed in
Table 3.
Table 3.
Virtual Camera Parameter Settings.
Table 3.
Virtual Camera Parameter Settings.
Parameter | Description | Example Value |
---|
size | The resolution (width, height) of the final generated video. | (1920, 1080) |
fov | The field of view of the camera lens in degrees. | 130 |
distance | The distance between the camera and the scene’s center in virtual units. | 5 |
rotation | The range and step for the rendering camera’s angles in degrees. | Horizontal: (0, 360, 30) Vertical: (0, 90, 30) |
Rendering and Post-processing: Based on the configuration file, the rendering engine places a virtual camera at each specified angle and captures the 3D scene of each frame as a 2D image. During this process, additional options can be applied to enhance visual quality. For instance, setting the draw-outline argument to 1 will draw an outline around the rendered human meshes, creating a cartoon-like effect or more clearly defining the characters’ boundaries. Furthermore, realism is enhanced by adding shadow effects based on depth through the lighting and material settings of the 3D scene.
Video Generation: The image sequences generated from each viewpoint are finally consolidated into a single video file using FFmpeg [
21]. This completes the generation of the user-desired number of multi-view videos from a single video input.
4. Experimental Results
To validate the performance of the proposed PSEW framework, the quality of the generated multi-view data was assessed from both qualitative and quantitative perspectives. The qualitative evaluation focused on the visual realism and consistency of the generated 3D scenes and motions. The quantitative evaluation centered on objectively measuring how accurately the motion from the original video was preserved.
4.1. Experimental Design
To objectively evaluate the performance of the proposed PSEW framework, we conducted experiments utilizing high-quality public datasets provided by AI-Hub (
www.aihub.or.kr) [
23,
24,
25,
26]. All frame images used for the visual materials in this paper, including Figures 7, 9 and 10, were generated based on data downloaded from this platform, and our use of the data complies with all of AI-Hub’s terms and conditions. Importantly, to protect the privacy of the individuals in the videos, personally identifiable features such as faces have been de-identified through blurring. We used videos containing complex and dynamic movements, including “Squat”, “Dance #1”, “Dance #2”, and “Yoga”, to assess performance on a variety of motion characteristics.
To quantitatively measure how accurately the generated data preserves the motion of the original video, we used two standard metrics. First, Root Mean Square Error (RMSE) measures the overall structural similarity of the pose. Second, Mean Per Joint Position Error (MPJPE) calculates the average positional error of each body joint to assess pose accuracy. The evaluation was conducted in both the 2D image space and the 3D world coordinate space.
The experimental procedure involved feeding each original video into the proposed PSEW framework to generate a temporally consistent 3D human mesh sequence. Subsequently, we measured the 2D error by comparing the 2D keypoints from the rendered output (at the same viewpoint as the original) with the 2D keypoints extracted from the original video on a frame-by-frame basis. Additionally, we calculated the 3D error by comparing the 3D joint positions from the source data with the joint positions of the generated 3D meshes in 3D space.
4.2. Evaluation Metrics
To comprehensively evaluate the performance of the proposed framework, we used both qualitative criteria and quantitative metrics.
4.2.1. Qualitative Assessment Criteria
The qualitative evaluation was conducted by visually inspecting the quality and realism of the generated data. The assessment was centered on the following key criteria.
Temporal Consistency: We evaluated whether the character’s motion in the generated video sequence was smooth and natural, without any jittering artifacts.
Spatial Consistency: In scenes with multiple individuals, we checked whether all characters were stably positioned on the single estimated floor plane and whether their spatial relationships remained coherent when rendered from novel viewpoints.
Pose Accuracy: We verified the visual similarity between the pose of the generated 3D mesh and the pose of the person in the original video from various angles, including front, side, and rear views.
Motion Fidelity: We compared and analyzed whether key action sequences from the original video, such as walking or dancing, were semantically reproduced in the generated video.
4.2.2. Quantitative Metrics
The quantitative evaluation was performed in both 2D and 3D space using two standard metrics to numerically measure the error between the generated and original poses.
RMSE (Root Mean Square Error): This metric measures the overall structural similarity of the pose and is calculated as the square root of the average of the squared errors in joint positions.
4.3. Qualitative Evaluation
Before presenting a detailed analysis of the results, we first demonstrate the step-by-step prediction process of the proposed PSEW framework, showing how a single frame from the original video is transformed into a final multi-view output in
Figure 7. Panel (a) shows the input frame from the original video. Panel (b) displays the initial 3D mesh directly predicted by the HMR model as described in
Section 3.1, which has not yet been temporally processed. Panel (c) is the result after applying the temporal consistency correction from
Section 3.3, where motion jitter has been removed and the pose is smoothed by the Kalman filter. In panel (e), the mesh is shown stably positioned in a virtual 3D space, aligned to the estimated floor plane using the scene reconstruction technique from
Section 3.4. Finally, panel (f) is the rendered image of this completed 3D scene from a novel viewpoint. As illustrated, our framework sequentially processes the input, with each stage organically contributing to the expansion of limited 2D information into rich 3D spatio-temporal data.
The qualitative evaluation involves a direct visual inspection of the quality of the generated data. This assessment was conducted in two ways.
First, the core capability of the proposed framework—reconstructing a complete 3D scene from a single 2D image—was visually validated. This evaluation focuses on demonstrating how a single frame, capturing a single moment from the original video, can be expanded into a multi-angle scene with a sense of three-dimensional space.
Figure 8 below clearly illustrates this transformation process. The multiple images presented in the figure were all generated based on the same single frame from the original video, with each image being the result of rendering from a virtual camera placed at a different position. This result presents several key indicators of success. A three-dimensional space, conveying depth and perspective, was successfully generated from a planar 2D image. Despite the complexity of a scene with multiple individuals, not only the 3D pose of each person but also the relative spatial relationships between them were expressed consistently and naturally from all viewpoints, signifying that the 3D scene arrangement was effectively inferred from the limited information of a single view. Furthermore, it can be confirmed that all individuals are stably standing on the same virtual ground, which visually demonstrates the successful operation of the framework’s scene reconstruction stage, where a single floor plane is estimated from the entire video and used as a basis for scene alignment.
This qualitative evaluation can be further extended to demonstrate the framework’s robustness across various action types.
Figure 9 illustrates the reconstruction and rendering results for five different actions: Squat, Dance #1, Dance #2, Yoga, and Figure Skating. The top image in each column is the original input frame, while the images below are the renderings of that frame’s 3D scene from different horizontal angles. It can be observed that the rendered result at the 0° (Front) view closely matches the pose in the original image, indicating a high degree of 3D reconstruction accuracy. Furthermore, the views from other angles, such as the side and rear, are also three-dimensionally consistent and plausible. These results confirm that PSEW possesses the generalization capability to generate high-quality, multi-view data not only for static poses but also for dynamic and complex movements.
Second, to evaluate how well the generated video reproduces the actions of the original video, we compared the results rendered from the same viewpoint as the original.
Figure 10 below presents a side-by-side comparison of a specific action sequence from the original video and the corresponding sequence from the synthesized video. The comparison confirms that key actions from the original video, such as walking or waving, were reproduced with high fidelity through the generated 3D human meshes. This suggests that the human mesh recovery and temporal consistency correction stages of the proposed framework effectively preserve the essential features of the original motion.
4.4. Quantitative Evaluation
To comprehensively analyze the visual reproducibility in the 2D image space and the structural accuracy in the 3D physical space, we quantitatively evaluated the similarity between the generated and original motions using the previously defined RMSE and MPJPE metrics.
The evaluation in 2D space reveals how well the model mimics the visual movements of the original video (
Figure 11). According to the 2D RMSE analysis, DANCE1 (0.1727) exhibited the highest fidelity with the lowest error, followed by YOGA (0.1894) and DANCE2 (0.1849). In contrast, SQUAT (0.2720) showed a noticeably higher error. This trend was even more pronounced in the 2D MPJPE results, where the error for SQUAT (0.3187) was substantially higher than that of the other motions (0.20–0.21). This suggests that the large vertical displacements and self-occlusion from limb crossing in the squat motion are primary factors that induce larger errors, particularly in the 2D projected space.
The 3D space evaluation assesses how accurately the model understands and generates the actual three-dimensional structure of the motion (
Figure 12). In the 3D RMSE results, the errors for all motions were generally lower compared to their 2D counterparts, with DANCE1 (0.1454) once again demonstrating the best performance. The 3D MPJPE analysis yielded more intriguing results. While DANCE1 (0.2057) remained the most accurate, SQUAT (0.2335), which had the highest error in 2D, actually performed better than both YOGA (0.2675) and DANCE2 (0.2660). This indicates that the model comprehends the 3D structure of the squat motion with relative accuracy, but it is more susceptible to the information loss and distortion that occur when this structure is projected onto a 2D plane.
In summary, the quantitative evaluation robustly demonstrates that the proposed model successfully reproduces diverse human motions with very high accuracy in both 2D and 3D spaces. It consistently showed the strongest performance on fluid and continuous movements like DANCE1. Furthermore, by comparing the 2D and 3D results, we confirmed that the model’s performance is influenced not only by the intrinsic complexity of a motion but also by the characteristics that arise when it is represented in a specific dimension (2D or 3D). In conclusion, the model possesses an exceptional motion generation capability, highlighted by a strong understanding of structural dynamics in 3D space.
Additionally, a statistical analysis was conducted to verify whether the observed quantitative performance differences among the five action types were statistically significant. A Shapiro–Wilk test confirmed that the error data for all action groups did not follow a normal distribution (p < 0.001); therefore, the non-parametric Kruskal–Wallis test was applied.
The analysis revealed a statistically significant difference among the action types across all evaluation metrics (for 3D MPJPE, H = 391.23, p < 0.001; for 3D RMSE, H = 482.29, p < 0.001). To identify specific differences, a Dunn’s post hoc test with Bonferroni correction was performed. In the 3D space analysis, the “SQUAT” action exhibited a statistically significantly higher error compared to “DANCE1”, “DANCE2”, and “YOGA” (all p < 0.001). Furthermore, while “DANCE1” also showed significant differences from all other actions, “DANCE2” and “YOGA” were not significantly different from each other in terms of 3D spatial error (p = 1.00). These results quantitatively support that the large vertical displacements and severe self-occlusion inherent in the “SQUAT” motion have a statistically significant negative impact on the model’s reconstruction accuracy.
5. Discussion
This study provides clear answers to the two key research questions posed in the Introduction, supported by the experimental results.
First, regarding RQ1 (“How can realistic, temporally consistent, multi-view 3D action data be generated from a single video without expensive equipment?”), this study presents the PSEW framework as a comprehensive solution. By organically integrating 3D human mesh recovery, temporal consistency correction, and 3D scene reconstruction, this automated pipeline demonstrated its effectiveness at generating spatio-temporally consistent 3D action data without the need for costly multi-camera setups.
Second, in response to RQ2 (“To what extent does the generated data accurately preserve the original motion?”), our quantitative evaluation provides a definitive answer. The proposed method preserves the original video’s dynamic motion with high fidelity, achieving low average errors in both 2D (RMSE: 0.172, MPJPE: 0.202) and 3D (RMSE: 0.145, MPJPE: 0.206) space. This accuracy is a direct result of the framework’s integrated technologies, which effectively remove jitter from initial predictions to restore physically plausible, smooth movements. Furthermore, the 3D scene reconstruction method, which aligns all individuals to a single floor plane, lent stability and realism to the generated scene.
The qualitative and quantitative evaluation results demonstrate that the proposed framework can successfully generate visually consistent multi-view data while effectively preserving the motion of the original video. This is the result of the organic interplay of its constituent technologies. First, a high-performance 3D human mesh recovery model provides a solid foundation for accurately estimating 3D poses from 2D images [
27]. However, these initial predictions alone are insufficient to ensure temporal consistency. The multi-stage motion refinement process, consisting of tracking, interpolation, and a Kalman filter with dynamically adjusted parameters, played a decisive role in effectively removing jitter from the initial predictions and restoring physically plausible, smooth movements [
28]. Furthermore, the 3D scene reconstruction method, which aligns all individuals to a single floor plane, integrates the individually recovered meshes into a unified 3D space, lending stability and realism to the generated scene.
The most significant contribution of this research is that it opens up a way for anyone to easily generate multi-view action data without expensive equipment or complex shooting environments. This can lower the barrier to entry for action recognition research, enabling more researchers to develop view-invariant models without being constrained by data scarcity issues. The generated data can be usefully applied to augment existing datasets or to test model performance in specific scenarios, such as drone or CCTV viewpoints.
The PSEW framework proposed in this study differs from existing multi-view data generation methods in its goals and approach. A prominent method involves using game engines [
5,
6], which, despite offering high flexibility, often struggle to perfectly replicate the subtle motions of real humans and can suffer from a “domain gap”. In contrast, PSEW reconstructs 3D motion directly from real videos, effectively preserving the naturalness and realism of the generated data. Other approaches, such as those based on generative models like NeRF [
14], excel at static scenes but still face challenges in consistently handling dynamic human motion. PSEW places a greater emphasis on ensuring dynamic consistency across the entire sequence by explicitly using a 3D human model and a temporal correction process. The primary goal of this study was to provide a practical framework for accessible, low-cost data generation rather than to achieve state-of-the-art benchmark scores. Therefore, a direct quantitative performance comparison with previous studies, which often have different objectives, was not within the scope of this work.
Nevertheless, this framework has several limitations. First, the quality of the final output is heavily dependent on the performance of the initial 3D human mesh recovery model. If the model fails in person detection or pose estimation due to severe occlusion or atypical poses, this error is difficult to fully recover from, even after the subsequent refinement process. Second, the current framework only reconstructs dynamic individuals in 3D and does not reconstruct the static background. This results in the generated video having a monochromatic background instead of the original one, which limits the full realism of the scene. Third, it does not explicitly model interactions between individuals and surrounding objects (e.g., sitting on a chair or opening a door). Finally, the assumption of a single, flat floor makes accurate scene reconstruction challenging in complex topographical environments such as stairs or slopes.
Future research could be directed toward overcoming these limitations. For instance, integrating modern scene reconstruction techniques like Neural Radiance Fields (NeRF) or 3D Gaussian Splatting could enable the realistic reconstruction of static backgrounds [
29]. Additionally, the framework could be extended to generate more complex and meaningful action data by adding functionality to detect and reconstruct key objects that individuals interact with.
6. Conclusions
In this study, we proposed PSEW, an automated framework for generating spatio-temporally consistent, multi-view 3D action data from a single video input, without the need for a costly multi-camera system or complex 3D modeling work. In the field of action recognition, the performance of deep learning models is heavily dependent on large and diverse training data, yet multi-view data has always been scarce due to the difficulties in its acquisition. This research aimed to provide a practical and scalable solution to address this data scarcity problem.
To achieve this, PSEW predicts 3D human parameters from each frame using state-of-the-art human mesh recovery techniques and ensures temporal continuity and naturalness through a multi-stage refinement process consisting of tracking, interpolation, and Kalman filter-based smoothing. Furthermore, it effectively reconstructs individually recovered subjects into a single, coherent 3D scene by estimating a stable, unified floor plane based on the entire video and aligning all individuals to it. Experimental results confirmed that PSEW can effectively generate visually natural multi-view videos while successfully preserving the motion of the original video. Specifically, the generated data demonstrated a high degree of motion fidelity, achieving low average errors in both 2D (RMSE: 0.172; MPJPE: 0.202) and 3D (RMSE: 0.145; MPJPE: 0.206) space.
This research presents the possibility of democratizing the way training data for action recognition models is generated, allowing more researchers to overcome the barriers of dataset construction and develop models robust to viewpoint changes. The data generated through PSEW is expected to play a crucial role in augmenting existing datasets and enhancing the generalization performance and robustness of deep learning models across various scenarios. Future work could proceed in the direction of creating more realistic and complex datasets by integrating background scene reconstruction techniques and modeling interactions with objects.
Author Contributions
Conceptualization, H.K. and Y.S.; methodology, H.K. and Y.S.; software, H.K.; validation, H.K. and Y.S.; formal analysis, H.K. and Y.S.; investigation, H.K.; resources, H.K.; data curation, H.K.; writing—original draft preparation, H.K.; writing—review and editing, H.K. and Y.S.; visualization, H.K.; supervision, Y.S.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.
Funding
This study was supported by the commercialization promotion agency for R&D outcomes grant funded by the Korea government (MSIT) (2710086167). This work was supported by the Commercialization Promotion Agency for R&D Outcomes (COMPA) grant funded by the Korea government (Ministry of Science and ICT) (RS-2025-02412990). This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-2020-0-01789) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The PSEW Framework in this study can be accessed at
https://github.com/PLASS-Lab/PSEW (accessed on 18 August 2025). This research (paper) used datasets from “The Open AI Dataset Project (AI-Hub, S. Korea)”. All data information can be accessed through “AI-Hub (
www.aihub.or.kr)”.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Al-Faris, M.; Chiverton, J.; Ndzi, D.; Ahmed, A.I. A Review on Computer Vision-Based Methods for Human Action Recognition. J. Imaging 2020, 6, 46. [Google Scholar] [CrossRef] [PubMed]
- Shah, A.; Jalal, A.; Gochoo, M.; Kim, K. Multi-View Action Recognition Using Contrastive Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4933–4942. [Google Scholar] [CrossRef]
- Dave, A.; Sharma, M.; Sikka, K.; Divakaran, A.; Chellappa, R. Multi-view Action Recognition using Cross-view Video Prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 444–460. [Google Scholar] [CrossRef]
- Seo, A.; Jeon, H.; Son, Y. Robust prediction method for pedestrian trajectories in occluded video scenarios. Soft Comput. 2025, 29, 4449–4459. [Google Scholar] [CrossRef]
- Xiong, Z.; Li, C.; Liu, K.; Liao, H.; Hu, J.; Zhu, J.; Ning, S.; Qiu, L.; Wang, C.; Wang, S.; et al. MVHumanNet: A Large-Scale Dataset of Multi-View Daily Dressing Human Captures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19801–19811. [Google Scholar] [CrossRef]
- Lu, K.; Wang, T.; Wang, G.; de Melo, C.M.; Fan, Z. Synthetic-to-Real Adaptation for Complex Action Recognition in Surveillance Applications. Proc. SPIE 2024, 13045, 130450L. [Google Scholar] [CrossRef]
- Hwang, H.; Jang, C.; Park, G.; Kim, I.J. ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications. arXiv 2020, arXiv:2010.08602. [Google Scholar] [CrossRef]
- Alabdulwahab, S.; Kim, Y.-T.; Seo, A.; Son, Y. Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments. Appl. Sci. 2023, 13, 10951. [Google Scholar] [CrossRef]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple yet Effective Baseline for 3D Human Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar] [CrossRef]
- Moreno-Noguer, F. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 2849–2856. [Google Scholar] [CrossRef]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar] [CrossRef]
- Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 405–421. [Google Scholar] [CrossRef]
- Anaconda, Inc. Conda Documentation. Available online: https://docs.conda.io/ (accessed on 10 September 2025).
- Python Software Foundation. Python, Version 3.9. Available online: https://www.python.org (accessed on 10 September 2025).
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
- NVIDIA. CUDA Toolkit Documentation. Available online: https://docs.nvidia.com/cuda/ (accessed on 10 September 2025).
- Dawson-Haggerty, M. Trimesh Library. Available online: https://github.com/mikedh/trimesh (accessed on 10 September 2025).
- Matl, M. Pyrender Library. Available online: https://github.com/mmatl/pyrender (accessed on 10 September 2025).
- FFmpeg. Available online: https://ffmpeg.org (accessed on 10 September 2025).
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Mairal, J.; LeCun, Y.; Joulin, A.; Bojanowski, P. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- AI-Hub. Figure Skating Motion Data. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71468 (accessed on 18 August 2025).
- AI-Hub. Cross Fit Motion Data. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71422 (accessed on 18 August 2025).
- AI-Hub. Breaking Dance Motion Data. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71315 (accessed on 18 August 2025).
- AI-Hub. Yoga Motion Data. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71313 (accessed on 18 August 2025).
- Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end Recovery of 3D Human Shape and Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7346–7355. [Google Scholar] [CrossRef]
- Welch, G.; Bishop, G. An Introduction to the Kalman Filter; Technical Report TR 95-041; University of North Carolina at Chapel Hill: Chapel Hill, NC, USA, 2006. [Google Scholar]
- Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 117. [Google Scholar] [CrossRef]
Figure 1.
Visualization of the expressive 3D human parametric model, SMPL-X, utilized in this study. This model represents a realistic 3D human body through low-dimensional parameters (shape, pose, facial expression, and hand articulation), serving as a foundational element for the HMR task of recovering 3D meshes from single 2D images.
Figure 1.
Visualization of the expressive 3D human parametric model, SMPL-X, utilized in this study. This model represents a realistic 3D human body through low-dimensional parameters (shape, pose, facial expression, and hand articulation), serving as a foundational element for the HMR task of recovering 3D meshes from single 2D images.
Figure 2.
The overall architecture of the proposed PSEW framework. The diagram illustrates the entire pipeline where a single monocular video input is sequentially processed through the stages of 3D human parameter prediction, temporal consistency correction, and 3D scene reconstruction and rendering to finally generate multi-view videos.
Figure 2.
The overall architecture of the proposed PSEW framework. The diagram illustrates the entire pipeline where a single monocular video input is sequentially processed through the stages of 3D human parameter prediction, temporal consistency correction, and 3D scene reconstruction and rendering to finally generate multi-view videos.
Figure 3.
Preprocessing for the input video. Each frame is resized, padded, and normalized to create the model input tensor, while the camera intrinsic matrix (K) is generated concurrently.
Figure 3.
Preprocessing for the input video. Each frame is resized, padded, and normalized to create the model input tensor, while the camera intrinsic matrix (K) is generated concurrently.
Figure 4.
The feature extraction process using a Vision Transformer. The input image is divided into patches, and each patch is converted into an embedding vector through a linear layer and the addition of positional encoding. This vector sequence is finally processed by the Transformer Encoder to produce a sequence of contextualized feature vectors.
Figure 4.
The feature extraction process using a Vision Transformer. The input image is divided into patches, and each patch is converted into an embedding vector through a linear layer and the addition of positional encoding. This vector sequence is finally processed by the Transformer Encoder to produce a sequence of contextualized feature vectors.
Figure 5.
The human detection process. Each feature vector is converted into a probability score via an MLP (Multi-Layer Perceptron) and a sigmoid function to form a score map. A list of coordinate indices for detected humans is then output after applying Non-Maximal Suppression (NMS) and a detection threshold.
Figure 5.
The human detection process. Each feature vector is converted into a probability score via an MLP (Multi-Layer Perceptron) and a sigmoid function to form a score map. A list of coordinate indices for detected humans is then output after applying Non-Maximal Suppression (NMS) and a detection threshold.
Figure 6.
The 3D human parameter regression process. Feature vectors (Queries) are extracted using the indices of detected persons and are then augmented with 3D spatial information derived from the camera matrix. These queries are refined by a Transformer Decoder that references the full image features (Context), and are finally passed through a regression head to be converted into a structured parameter dictionary.
Figure 6.
The 3D human parameter regression process. Feature vectors (Queries) are extracted using the indices of detected persons and are then augmented with 3D spatial information derived from the camera matrix. These queries are refined by a Transformer Decoder that references the full image features (Context), and are finally passed through a regression head to be converted into a structured parameter dictionary.
Figure 7.
Visualization of the step-by-step prediction process of the proposed PSEW framework. (a) The input frame from the original video. (b) The initial mesh sequence over 10 consecutive frames, showing visible jitter in the raw predictions. (c) The refined mesh sequence for the same duration, illustrating a stabilized and coherent motion flow after temporal correction. (d) A focused comparison of the temporal smoothing effect, where overlaid trajectories show that the jitter from sequence (b) (left side) has been corrected into a smooth path in sequence (c) (right side). (e) A representative mesh from the refined sequence aligned to the estimated floor plane, shown with a coordinate visualization of the person’s position in 3D space (red dot: head, blue dot: pelvis). (f) The final rendered output from a novel viewpoint.
Figure 7.
Visualization of the step-by-step prediction process of the proposed PSEW framework. (a) The input frame from the original video. (b) The initial mesh sequence over 10 consecutive frames, showing visible jitter in the raw predictions. (c) The refined mesh sequence for the same duration, illustrating a stabilized and coherent motion flow after temporal correction. (d) A focused comparison of the temporal smoothing effect, where overlaid trajectories show that the jitter from sequence (b) (left side) has been corrected into a smooth path in sequence (c) (right side). (e) A representative mesh from the refined sequence aligned to the estimated floor plane, shown with a coordinate visualization of the person’s position in 3D space (red dot: head, blue dot: pelvis). (f) The final rendered output from a novel viewpoint.
Figure 8.
Visualization of multi-view rendering results generated from a single input video, shown at the same time point as the frame in
Figure 3. The figure displays the reconstructed 3D scene from a matrix of virtual camera viewpoints. Each column represents a different horizontal angle, rotating clockwise in 45-degree increments from a top-down perspective. Each row corresponds to a different vertical angle, defined as the inclination from the estimated floor plane. This visualization demonstrates the framework’s ability to produce a consistent 3D representation of the action that can be rendered from diverse and controllable viewpoints.
Figure 8.
Visualization of multi-view rendering results generated from a single input video, shown at the same time point as the frame in
Figure 3. The figure displays the reconstructed 3D scene from a matrix of virtual camera viewpoints. Each column represents a different horizontal angle, rotating clockwise in 45-degree increments from a top-down perspective. Each row corresponds to a different vertical angle, defined as the inclination from the estimated floor plane. This visualization demonstrates the framework’s ability to produce a consistent 3D representation of the action that can be rendered from diverse and controllable viewpoints.
Figure 9.
Multi-view rendering results (rows) for various actions (columns). The top row shows the original input frames [
23,
24,
25,
26].
Figure 9.
Multi-view rendering results (rows) for various actions (columns). The top row shows the original input frames [
23,
24,
25,
26].
Figure 10.
Qualitative comparison between original video sequences and sequences generated by PSEW. In each panel, the horizontal axis represents the progression of time, while the vertical axis displays the original video, the sequence rendered from the front (0°), and the sequence rendered from the right side (90°), respectively. Each panel illustrates a different action sequence: (
a) Squat, (
b) Dance #1, (
c) Dance #2, and (
d) Yoga. The front-view renderings consistently reproduce the motion of the original sequence over time, while the side-view renderings provide a naturalistic novel viewpoint not present in the original input [
23,
24,
25,
26].
Figure 10.
Qualitative comparison between original video sequences and sequences generated by PSEW. In each panel, the horizontal axis represents the progression of time, while the vertical axis displays the original video, the sequence rendered from the front (0°), and the sequence rendered from the right side (90°), respectively. Each panel illustrates a different action sequence: (
a) Squat, (
b) Dance #1, (
c) Dance #2, and (
d) Yoga. The front-view renderings consistently reproduce the motion of the original sequence over time, while the side-view renderings provide a naturalistic novel viewpoint not present in the original input [
23,
24,
25,
26].
Figure 11.
Results of 2D Quantitative Evaluation. Graphs comparing the frame-wise 2D pose similarity between original and synthesized motions using (a) Root Mean Square Error (RMSE) and (b) Mean Per Joint Position Error (MPJPE). The legend indicates the average error value for each entire motion sequence.
Figure 11.
Results of 2D Quantitative Evaluation. Graphs comparing the frame-wise 2D pose similarity between original and synthesized motions using (a) Root Mean Square Error (RMSE) and (b) Mean Per Joint Position Error (MPJPE). The legend indicates the average error value for each entire motion sequence.
Figure 12.
Results of 3D Quantitative Evaluation. Graphs comparing the frame-wise 3D pose similarity between original and synthesized motions using (a) Root Mean Square Error (RMSE) and (b) Mean Per Joint Position Error (MPJPE). The legend indicates the average error value for each entire motion sequence.
Figure 12.
Results of 3D Quantitative Evaluation. Graphs comparing the frame-wise 3D pose similarity between original and synthesized motions using (a) Root Mean Square Error (RMSE) and (b) Mean Per Joint Position Error (MPJPE). The legend indicates the average error value for each entire motion sequence.
Table 1.
Specifications of the Implementation Environment.
Table 1.
Specifications of the Implementation Environment.
Category | Specification |
---|
Hardware |
Operating System | Ubuntu 24.04 LTS |
CPU | Intel Core i9-13900KF |
RAM | 64 GB |
GPU | NVIDIA GeForce RTX 4090 |
Software |
Virtual Environment | Conda 25.5.1 [15] |
Key Libraries & Frameworks | Python 3.9 [16], PyTorch 2.5 [17], CUDA 12.4 [18], Trimesh [19], Pyrender [20] |
Video Conversion | FFmpeg 4.3.2 [21] |
Table 2.
Composition and Roles of Regression Heads.
Table 2.
Composition and Roles of Regression Heads.
Head | Role | Output |
---|
decpose | Predicts the pose of the full body, hands, and jaw using a stable 6D rotation representation instead of unstable 3D angles. | 6D rotation vector for each joint |
decshape | Predicts the 10 principal component coefficients that determine the unique body shape (e.g., height, weight). | 10D shape parameter vector |
deccam | Predicts values related to the 3D distance (depth) between the camera and the person. | 3D camera-related parameters |
decexpression | Predicts the 10 principal component coefficients that control facial expressions. | 10D expression parameter vector |
mlp offset | Predicts a fine-grained 2D displacement vector from the center of the detected patch to the actual person’s center to correct the initial coarse detection. | 2D positional offset vector (x, y) |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).