Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

SSR-HMR: Skeleton-Aware Sparse Node-Based Real-Time Human Motion Reconstruction

Electronics 2025, 14(18), 3664; https://doi.org/10.3390/electronics14183664

by Linhai Li¹

, Jiayi Lin² and Wenhui Zhang^3,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Yongkang Xing

Electronics 2025, 14(18), 3664; https://doi.org/10.3390/electronics14183664

Submission received: 17 August 2025 / Revised: 11 September 2025 / Accepted: 15 September 2025 / Published: 16 September 2025

(This article belongs to the Special Issue AI Models for Human-Centered Computer Vision and Signal Analysis)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper introduces SSR-HMR, a skeleton-aware sparse node-based framework for real-time human motion reconstruction.
It leverages a lightweight spatiotemporal graph convolutional (ST-GCN) module for full-body motion generation from sparse inputs. Also a torso pose refinement module to reduce drift in torso and head orientation. Authors presend also a pose optimization module with hierarchical kinematic tree modeling and multi-scale velocity loss to ensure natural transitions and accurate end-effector positioning.

Presented in the paper method achieves sub-centimeter accuracy (MPJPE ≈ 10 mm, MPEEPE ≈ 5 mm) and runs at 267 FPS on CPU, significantly outperforming state-of-the-art methods like SparsePoser and HMD-Poser on the AMASS and Xsens datasets. It is also robust to different body proportions without retraining.

Strengths of the paper
1. Novel Contributions
- Skeleton-aware integration into both data representation and network architecture.
- Dual quaternion–based pose representation, improving accuracy and continuity.
- Torso refinement and multi-scale velocity loss - innovative steps to reduce drift and jitter.
2. Performance
- Outstanding accuracy: lowest position and end-effector errors compared to baselines (Table 1, p. 13).
- Real-time inference: >260 FPS on CPU with a compact 12 MB model (Table 4, p. 18).
- Scalability to different body types without retraining (Figure 10, p. 18).

3. Evaluation
- Comprehensive: authors uses both synthetic datasets (AMASS) and real-world data (Xsens).
- Ablation study - (Table 2, p. 15) clearly shows contributions of each module.
- Visual comparisons - (Figures 7–9) confirm qualitative superiority.
Weaknesses and Limitations
1. Intermediate joint errors. Authors acknowledge deviations in elbows and knees (Figure 7b, d), due to input sparsity. Could reduce realism in fine-grained motion tasks.
2. Scalability. Currently the study optimized only for six 6-DoF nodes. Flexibility for other tracker configurations (e.g., fewer or differently placed sensors) is not fully tested.
3. Comparisons. Some baselines (e.g., DragPoser) not fully evaluated due to missing code/data. Reliance only on reported numbers limits fairness of the study.
4. Application scope. Evaluation focuses on VR/AR and synthetic datasets. There is necessary More application-driven validation (rehabilitation, sports analytics) would strengthen claims.
Suggestions for Improvement
1. To extend the benchmarking. Include more diverse real-world datasets (sports, medical rehab scenarios). Suggests evaluate under noisy or missing sensor conditions to assess robustness.
2. Generalization studies. Explore training with variable numbers/placements of sparse nodes. It could be importand to test adaptability to lower-quality consumer hardware.
3. Perceptual user studies. Since VR immersion depends on subjective experience, add user evaluation of motion naturalness and sense of presence.
4. Future Work. Integrating physics-based priors (as in PIP/PNP) could further improve joint realism. Address elbow/knee uncertainties with hybrid IMU + tracker setups.
Recommendation
Overall recommendation: accept with minor revisions.
The paper provides clear methodological novelty, comprehensive experiments, and state-of-the-art results. Minor concerns (scope of evaluation, joint uncertainties, limited tracker configurations) should be addressed in revisions.

Author Response

We sincerely thank the reviewer for the thorough and insightful analysis of our work, as well as the constructive suggestions provided. Your detailed evaluation of both the strengths and limitations of SSR-HMR has given us a deeper understanding of how our method is perceived and its potential areas for improvement. We greatly appreciate your comments, which have guided us in clarifying, refining, and supplementing our manuscript to better highlight the contributions, address limitations, and present the experimental results more clearly.

Comparisons. Some baselines (e.g., DragPoser) not fully evaluated due to missing code/data. Reliance only on reported numbers limits fairness of the study.

We thank the reviewer for highlighting the issue regarding baseline comparisons. In the revised manuscript, we have restructured this section to more clearly present the comparability of different methods. Although the AMASS-based evaluation for certain baselines, such as DragPoser, is partially missing, we ensured a fair comparison by using the same datasets and following the training and evaluation configurations reported in the corresponding papers. Table~\ref{tab:model_comparison} has also been updated to include DragPoser alongside other baselines, providing a reference for its reported performance. These revisions aim to present the results in a clearer and more interpretable way, allowing readers to assess relative performance despite the dataset-related limitations.
To extend the benchmarking. Include more diverse real-world datasets (sports, medical rehab scenarios). Suggests evaluate under noisy or missing sensor conditions to assess robustness.

We appreciate the reviewer’s suggestion to extend the benchmarking. In Section 4.1, we have clarified the evaluation datasets. DanceDB is used for training, and HUMAN4D and SOMA for evaluation. These datasets collectively include ten subjects performing over 200 actions, covering daily activities, sports training (e.g., gymnastics, weightlifting), dance movements, and interactive actions, totaling more than 300,000 frames. While the datasets do not include dedicated medical rehabilitation scenarios, the walking and exercise sequences provide partial coverage of motions relevant to such applications. We apologize that specialized medical rehabilitation data are currently unavailable and plan to incorporate such datasets in future work to further extend benchmarking and applicability.

We thank the reviewer for the suggestion. In the revised conclusion, we explicitly discuss robustness under noisy or missing sensor conditions. Preliminary tests with missing nodes (without re-training) showed noticeable drops in reconstruction accuracy, highlighting the limitations of the current six-node setup. We also clarify that future work will focus on improving resilience through multimodal fusion (e.g., integrating IMUs) and incorporating physics-based priors to maintain accurate and smooth motion reconstruction even when sensor inputs are sparse or noisy.
Generalization studies. Explore training with variable numbers/placements of sparse nodes. It could be importand to test adaptability to lower-quality consumer hardware.

In response to the reviewer’s comment regarding generalization, we conducted preliminary tests under node-missing conditions and have updated the Conclusion to reflect these observations. The revised text highlights current limitations with variable node numbers and placements, discusses robustness under reduced input scenarios, and outlines future work on improving scalability and adaptability to different hardware setups. This modification ensures a clearer and fairer presentation of the method’s generalization potential.
Perceptual user studies. Since VR immersion depends on subjective experience, add user evaluation of motion naturalness and sense of presence.

In response to the reviewer’s comment on generalization, we conducted preliminary tests under node-missing conditions and updated the Conclusion accordingly. While SSR-HMR currently shows some performance degradation with fewer or differently placed nodes, it maintains high accuracy and real-time performance under the standard six-node setup. The revised text emphasizes both these strengths and current limitations, and outlines future work on improving scalability, robustness to variable node configurations, and adaptability to lower-quality or consumer-grade hardware.
Future Work. Integrating physics-based priors (as in PIP/PNP) could further improve joint realism. Address elbow/knee uncertainties with hybrid IMU + tracker setups.

In response to the reviewer’s comment regarding future work, the conclusion has been updated to explicitly highlight strategies for further improving joint realism and addressing uncertainties in elbows and knees. Specifically, the revised text discusses incorporating physics-based priors, as used in PIP/PNP, to better constrain joint motion and enhance realism, as well as leveraging hybrid setups that combine IMUs with sparse trackers to mitigate ambiguities in challenging joints. These additions clarify how SSR-HMR can be extended to handle complex motions and sparse inputs more robustly in future research.

Reviewer 2 Report

Comments and Suggestions for Authors This paper presents an integrated lightweight spatio-temporal graph convolution module to enhance the position accuracy of an end-effector. While the work is interesting and original, the authors need to address the following comments before it can be considered for publication in a reputed journal like 'Electronics': - Critically analyse and comment on computational complexity of the proposed approach. - The claim on "real-time efficiency" deserves further investigation. Include quantitative experimental results to evidence this claim. - What about scalability of the proposed approach? The work considers a 6 DOF robotic arm. What is the DOF are less or more than 6? - Update the literature review by including other relevant works such as 'Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective' - It would be useful to briefly mention details on the AMASS dataset. - Include paper outline at the end of Section 1. - The first sentence of Introduction, highlighting pivotal role of HCI could benefit from a work from literature such as 'Design of a wearable direct-driven optimized hand exoskeleton device”, 4th International Conference on Advances in Computer-Human Interactions' - Please define all the abbreviations at their first occurrence. For example, IMU should be defined on Line 29. - Please thoroughly proofread the paper for typos and other linguistic improvements. Also, search and fix ?. For example [?] on Line 399. - Also, include high-resolution images, particularly in the results section.

Author Response

We sincerely thank the reviewer for the thorough and constructive feedback. The comments provided were highly insightful and have helped improve the clarity, completeness, and technical rigor of the manuscript. In particular, the suggestions regarding computational complexity, scalability, literature coverage, dataset description, paper organization, and presentation details have been carefully addressed in the revised version.

Critically analyse and comment on computational complexity of the proposed approach. The claim on "real-time efficiency" deserves further investigation. Include quantitative experimental results to evidence this claim.

The manuscript has been updated with a dedicated subsection on inference efficiency, including quantitative results that analyze computational complexity and real-time performance. Table~\ref{tab:model_comparison} reports model size, number of parameters, and frame rates (FPS) on both CPU and GPU for the proposed method and baseline approaches. The results demonstrate that the SSR-HMR model is highly compact (12.43 MB in PyTorch, 7.31 MB in ONNX) and achieves real-time performance of 267 FPS on CPU and 121 FPS on GPU. The discussion also explains performance differences between CPU and GPU inference, attributing them to ONNX input data transfer overhead. These additions provide quantitative evidence supporting the claim of real-time efficiency while highlighting the method’s suitability for resource-constrained environments.
What about scalability of the proposed approach? The work considers a 6 DOF robotic arm. What is the DOF are less or more than 6?

The conclusion has been updated to explicitly discuss the scalability of the proposed approach. SSR-HMR is currently evaluated with a fixed six 6-DoF node configuration, which enables a lightweight design and high accuracy. We acknowledge that performance may degrade when fewer nodes are available, as preliminary experiments under node-missing conditions without re-training showed noticeable drops in reconstruction quality. Conversely, the framework is conceptually extensible to additional nodes or different sensor layouts; future work will focus on improving scalability to variable numbers and placements of nodes. Potential strategies include retraining or fine-tuning the model for different configurations, integrating multimodal inputs (e.g., IMUs), and leveraging physics-based or kinematic priors to maintain accuracy with sparse or heterogeneous inputs. These considerations ensure that SSR-HMR can be adapted to scenarios with both lower and higher DOF sensor setups beyond the currently tested six-node configuration.
Update the literature review by including other relevant works such as 'Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective'

The literature review has been updated to include the relevant work by Yao et al.~\cite{graph}, Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective. This study introduces a Body Pose Graph (BPG) to explicitly model spatial and temporal relationships between joints from sparse sensors, highlighting the benefits of graph-based representations for capturing joint dependencies. The discussion has been integrated into the subsection on sparse 6-DoF tracker-based methods, providing context for the design of the spatiotemporal graph module in SSR-HMR.
It would be useful to briefly mention details on the AMASS dataset.

Details on the AMASS dataset have been added in Section 4.1. The updated text briefly describes AMASS as a large-scale, SMPL-compatible motion capture dataset integrating multiple publicly available sources, providing diverse subjects, body shapes, and motion styles. In addition, information about the evaluation protocol has been included, specifying the use of DanceDB for training and HUMAN4D and SOMA for evaluation, covering over 200 actions and more than 300,000 frames in total. These additions clarify the dataset’s scope and the experimental setup used to benchmark SSR-HMR.
Include paper outline at the end of Section 1.

We understand that the reviewer’s intent is to provide readers with a concise overview of the paper’s structure at the end of the Introduction. In response, we have added the following brief outline:

“The remainder of this paper is organized as follows. Section~\ref{Section:Related_Work} reviews related work on sparse-node-based and learning-based human motion reconstruction methods. Section~\ref{sec:method} details the proposed SSR-HMR framework, including the spatiotemporal graph convolution module, torso pose refinement, and kinematic tree-based pose optimization. Section~\ref{sec:experiments} presents quantitative and qualitative experiments, including comparisons with existing methods, ablation studies, inference efficiency, and evaluations across different user body sizes. Finally, Section~\ref{sec:conclusion} concludes the paper and discusses limitations and future directions.”
The first sentence of Introduction, highlighting pivotal role of HCI could benefit from a work from literature such as 'Design of a wearable direct-driven optimized hand exoskeleton device”, 4th International Conference on Advances in Computer-Human Interactions'

We thank the reviewer for the suggestion to reinforce the motivation of our Introduction. We have cited Makled et al.~\cite{makled_evaluating_2025}, a recent 2025 study published in The Visual Computer, which systematically evaluates behavioral realism in AR and VR by comparing single-point inverse kinematics (IK) with full-body motion capture. Their results show that full-body motion capture significantly improves perceived realism, highlighting the critical role of accurate full-body motion reconstruction for immersive and natural user interactions. This supports our motivation for studying real-time full-body motion capture in VR/AR applications.
Please define all the abbreviations at their first occurrence. For example, IMU should be defined on Line 29.

The manuscript has been carefully checked, and all abbreviations are now defined at their first occurrence. In particular, inertial measurement unit(IMU), head-mounted display (HMD), and six-degree-of-freedom (6DoF) are introduced with their full forms when first mentioned, and the abbreviations are used consistently thereafter.
Please thoroughly proofread the paper for typos and other linguistic improvements. Also, search and fix ?. For example [?] on Line 399.

We thank the reviewer for pointing out the typographical and reference issues. The “[?]” on Line 399 was caused by an incorrect BibTeX entry for an online reference type, which has now been corrected. We have thoroughly proofread the entire manuscript to fix typos, inconsistencies, and other minor linguistic issues, and verified that all citations are properly resolved.
Also, include high-resolution images, particularly in the results section.

We have updated the figures to ensure high resolution, providing either widths exceeding 2700 pixels or vector-format PDFs. If any details still appear unclear, this may be due to the complexity of the visual elements or the relatively small size of certain annotations, rather than the intrinsic resolution of the images. We believe the current versions support zooming and high-quality printing.

Reviewer 3 Report

Comments and Suggestions for Authors

Nice idea: a lightweight spatiotemporal GCN + skeleton-aware IK refinement to reconstruct full-body motion from six 6-DoF nodes. Results look strong on AMASS and Xsens; claimed real-time CPU throughput is impressive. However, it requires some revision before publication:

1. Clarify and standardize metrics, units, and table entries.
Table 1 / Table 3 mix units and have suspicious entries (e.g., Root = 0.00 for SparsePoser in Table 1, and abstract states “mean per-joint position error of 10 mm” while Table 1 reports 1.06 cm). Please (a) define every metric formally (MPJPE, MPJRE, MPEEPE, jitter — with units), (b) ensure units are consistent across abstract, tables and text (mm vs cm), and (c) explain how “Root = 0.00” occurs (is root anchored for some methods?). Right now readers will be confused about whether numbers are apples-to-apples.

2. Minor errors: Final IK is cited as “Final IK [? ]” — missing reference. Please add the correct citation and explain how it was run for the experiments.

3. For DragPoser and PNP you state you relied on reported numbers because processing details weren’t released. That’s acceptable only if you clearly mark which numbers were reimplemented vs taken from papers, and if you provide error bars or statistical tests.

4. Right now you claim “SSR-HMR significantly outperforms all existing methods” is too strong without clarifying which results were reimplemented and which came from original papers. Add paired significance tests (or at least confidence intervals) when claiming state-of-the-art.

5. Ablation study: add error bars, training curves, and failure examples.

6. Missing ethics/usability discussion.
The paper targets VR/AR/Metaverse applications; please add a short discussion on privacy, possible failure modes in safety-critical settings (rehab / medical), and limits of applicability (e.g., extreme clothing, occlusions, very fast motions). This is important for a paper that claims deployment readiness.

Author Response

We sincerely thank the reviewer for the careful and insightful review of our manuscript. The detailed comments and suggestions are extremely helpful for improving the clarity, rigor, and completeness of our work. Below, we provide point-by-point responses to each comment and describe the corresponding revisions made in the manuscript.

Reviewer Comment 1:

Clarify and standardize metrics, units, and table entries. Table 1 / Table 3 mix units and have suspicious entries (e.g., Root = 0.00 for SparsePoser in Table 1, and abstract states “mean per-joint position error of 10 mm” while Table 1 reports 1.06 cm). Please (a) define every metric formally (MPJPE, MPJRE, MPEEPE, jitter — with units), (b) ensure units are consistent across abstract, tables and text (mm vs cm), and (c) explain how “Root = 0.00” occurs (is root anchored for some methods?). Right now readers will be confused about whether numbers are apples-to-apples.

Author Response:

We appreciate the reviewer’s observation regarding the inconsistency of metrics and units across the abstract, tables, and text. To address this:

Abstract revision: The previous phrasing “mean per-joint position error of 10 mm” has been updated to: “Experiments show that SSR-HMR achieves high-accuracy full-body motion reconstruction, with mean joint and end-effector position errors of 1.06 cm and 0.52 cm, respectively, while running at 267 FPS on a CPU.” This ensures precise numerical values and consistent units with the tables.
Evaluation Metrics (Section 4.1): Formal definitions of all evaluation metrics (MPJPE, MPJRE, MPEEPE, jitter, etc.) are provided, and units are indicated in the table headers (centimeters for positional errors, degrees for rotational error, cm/s for velocity error, etc.).
Table standardization: All tables have been updated so that metric names match the definitions in Section 4.1, and units are consistently included in the headers.
Root column removal: The “Root” column has been removed from the tables. SparsePoser, DragPoser, and our method all adopt root-centered alignment; reporting “Root = 0.00” previously caused unfair comparisons and potential confusion.

These revisions resolve the reviewer’s concerns and improve the clarity and consistency of the manuscript.

Reviewer Comment 2:

Minor errors: Final IK is cited as “Final IK [? ]” — missing reference. Please add the correct citation and explain how it was run for the experiments.

Response: Thank you for pointing out the missing reference. This issue was caused by a BibTeX incompatibility with online entries and has now been corrected.

In Section 4.1.1, a description of the experimental setup for Final IK has been added:

"Notably, Final IK is a method based on inverse kinematics (IK), specifically designed for animating full-body VR characters with sparse 6DoF trackers, while the other methods are data-driven deep learning algorithms representing cutting-edge advancements in their respective domains. In our experiments, the Final IK plugin for Unity was employed to estimate full-body motion using 6DoF positional inputs from the headset, hand controllers, hip trackers, and foot trackers. The plugin's inverse kinematics solver generated real-time joint rotations and global positions, the outputs of which were used for subsequent evaluation."

This addition clarifies the setup and ensures proper citation of the method used.

Reviewer Comment 3:

For DragPoser and PNP you state you relied on reported numbers because processing details weren’t released. That’s acceptable only if you clearly mark which numbers were reimplemented vs taken from papers, and if you provide error bars or statistical tests.

Response: Thank you for this suggestion. The manuscript has been updated to clarify which results were taken directly from the original papers and which were re-implemented:

Results for PIP, SparsePoser, and DragPoser are taken from the respective original publications.
Experiments for Final IK, HMD-Poser, and the re-implemented strong baseline SparsePoser* were performed as part of this work.

Section 4.1 has been revised to further describe the evaluation datasets and their reliability:

“In line with SparsePoser and DragPoser~\cite{DragPoser}, which reconstruct full-body motion from sparse input using six 6DoF devices, DanceDB~\cite{DanceDB:2019} is used for training, while HUMAN4D~\cite{HUMAN4D:2020} and SOMA~\cite{SOMA:2021} are used for evaluation. Together, these two datasets include ten subjects performing over 200 actions, covering daily activities, sports training (e.g., gymnastics, weightlifting), dance movements, and interactive actions, totaling more than 300,000 frames.”

Additionally, the tables now report not only mean values but also standard deviations to reflect variability and statistical reliability, ensuring readers can clearly distinguish re-implemented results from those taken from prior work.

Reviewer Comment 4:

Right now you claim “SSR-HMR significantly outperforms all existing methods” is too strong without clarifying which results were reimplemented and which came from original papers. Add paired significance tests (or at least confidence intervals) when claiming state-of-the-art.

Response: The manuscript has been revised to address this concern. First, the language describing SSR-HMR’s performance has been moderated to avoid overstatements. Quantitative results now indicate that SSR-HMR demonstrates competitive or superior performance in key metrics, rather than claiming universal superiority.

Second, Section 4.1 has been updated to clarify the sources of comparison data: PIP, SparsePoser, and DragPoser metrics are taken from the original papers, while Final IK, HMD-Poser, and the reimplemented SparsePoser* were evaluated directly under the same experimental settings. The evaluation datasets, including DanceDB, HUMAN4D, and SOMA, are explicitly described to support the reliability of these measurements.

Finally, Table 1 now reports both the mean and standard deviation for each metric, providing additional insight into result stability and reproducibility. These revisions ensure a clearer, more rigorous presentation of comparative performance and address the reviewer’s concerns regarding the credibility and clarity of the quantitative analysis.

Reviewer comment 5:

Ablation study: add error bars, training curves, and failure examples.

Response: We understand the reviewer’s concern regarding the need for more comprehensive evidence to support the ablation study. To address this, Table~\ref{tab:ablation} now reports both the mean and standard deviation for all metrics, providing a clear measure of variability and stability across the ablation variants. The evaluations use the same datasets as in Section~\ref{sec:comparison}, and Section~4.1 further details the diversity, size, and reliability of the datasets, ensuring the reported results are representative.

Additionally, the convergence behavior of each ablation variant is illustrated in Figure 9, measured by the sum of MPJPE and MPEEPE. Panel (a) shows the full training process (Epoch 0–800), while panel (b) focuses on the convergence stage (Epoch 500–800), with the learning rate at Epoch 800 indicated. The curves demonstrate stable training and convergence trends for all variants, with noticeable drops corresponding to learning rate reductions using the ReduceLROnPlateau scheduler.

Finally, representative failure cases are included in the qualitative results section, allowing for intuitive visualization of limitations in complex motions without disrupting the quantitative ablation study. We believe these revisions address the reviewer’s concerns by providing thorough quantitative and qualitative evidence for the impact of each component.

Reviewer comment 6:

Missing ethics/usability discussion. The paper targets VR/AR/Metaverse applications; please add a short discussion on privacy, possible failure modes in safety-critical settings (rehab / medical), and limits of applicability (e.g., extreme clothing, occlusions, very fast motions). This is important for a paper that claims deployment readiness.

Response: We appreciate the reviewer’s suggestion regarding ethics and usability considerations. In the revised manuscript, the Conclusion section now explicitly discusses privacy, safety, and usability aspects, as well as limitations of applicability. Specifically, it is noted that noisy or missing signals may lead to unreliable feedback in safety-critical settings such as rehabilitation or medical applications, and that extreme clothing, severe occlusions, or highly dynamic motions remain outside the tested domain. These statements provide a concise discussion of potential failure modes, privacy concerns, and deployment considerations, addressing the reviewer’s request for responsible evaluation of the method’s practical use.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all the comments suggested. The revised version of the paper has been improved significantly and is recommended for acceptance.

Reviewer 3 Report

Comments and Suggestions for Authors

It is suitable for publication and solve all issues.

Article Menu

SSR-HMR: Skeleton-Aware Sparse Node-Based Real-Time Human Motion Reconstruction

Further Information

Guidelines

MDPI Initiatives

Follow MDPI