LGSIK-Poser: Skeleton-Aware Full-Body Motion Reconstruction from Sparse Inputs
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsGeneral comments:
This work is about LGSIK-Poser, which introduces a lightweight and modular framework for full-body human motion reconstruction using sparse VR/AR inputs, integrating temporally grouped motion modeling, graph-based spatial reasoning, and inverse kinematics refinement for high-fidelity end-effector localization. It supports flexible sensor setups and uses anthropometric priors for personalized shape estimation, achieving state-of-the-art accuracy and efficiency with significant gains over existing methods like HMD-Poser. This work is well-written, but i noted some issues as indicated below:
Specific comments:
- Although figure one presents an overview of the work, it will be helpful it the authors adds 3D models to help in the readers in understanding how the information is transferred and processed through the workflow.
- Figure 2 requires more labels to help identify each components that wish to be discussed.
- On the analysis of the inference time and delay, i think this is highly dependent on the system/GPU model that were used in each test, how did the authors ensure that the analysis in table 5 is comparable. suggest adding the type of system/GPU used in each work for a fairer comparison.
- also, the authors should be discussing (in the introduction or conclusion) how this approach can also be employed in various applications such as sports (doi.org/10.3390/s23249759) and robotics (doi.org/10.1007/978-3-031-22216-0_3). suggest adding more works to enrich the discussion.
Author Response
Comment:
Although figure one presents an overview of the work, it will be helpful if the authors add 3D models to help readers understand how the information is transferred and processed through the workflow.
Response:
Thank you for the insightful suggestion. To enhance clarity and illustrate how information flows through the pipeline, we have revised Figure 1 by adding a bottom row of 3D visualizations. These visual elements represent key stages of the model, including the sparse input, intermediate outputs from each module, and the final reconstructed mesh. We have also connected each 3D visualization to its corresponding module in the pipeline using dashed lines. Furthermore, we updated the corresponding paragraph in Section 3 to explicitly describe the purpose of these visualizations and how they assist in understanding the transformation of motion information through the framework.
Comment:
Figure 2 requires more labels to help identify each component that is to be discussed.
Response:
We thank the reviewer for this suggestion. We have updated Figure 2 by explicitly labeling all semantic groups, including the newly added shape group, to improve clarity. The figure serves as a visual interpretation of the anatomical adjacency mask MM. To maintain visual clarity, edges from the shape group to other nodes are omitted in the figure but are implemented in the model. We have revised both the caption and the main text accordingly (see Section 3.1 and Figure 2).
Comment:
On the analysis of inference time and delay, I think this is highly dependent on the system/GPU model used in each test. How did the authors ensure that the analysis in Table 5 is comparable? Suggest adding the type of system/GPU used in each work for a fairer comparison.
Response:
Thank you for this valuable comment. To ensure a fair and consistent comparison, all evaluated methods—including AvatarPoser, AvatarJLM, HMD-Poser, and our proposed method—were implemented and tested within the same PyTorch framework on identical hardware, specifically an NVIDIA RTX 3090 GPU. We standardized the input data rate (60 FPS) and used a uniform sliding window of 40 frames during inference for all models. This setup eliminates variability due to differing GPU architectures or software environments. We have explicitly added this information to Section 4.3 to clarify the testing environment and reinforce the fairness of the runtime comparisons presented in Table 5.
Comment:
Also, the authors should discuss (in the introduction or conclusion) how this approach can also be employed in various applications such as sports (doi.org/10.3390/s23249759) and robotics (doi.org/10.1007/978-3-031-22216-0_3). Suggest adding more works to enrich the discussion.
Response:
We thank the reviewer for the valuable suggestion to discuss additional application areas such as sports and robotics. In the revised manuscript, we have broadened the introduction to emphasize that full-body human motion reconstruction and editing are of significant interest across a wide range of domains, including entertainment, sports performance analysis, training, education, and collaborative work. We note that our consumer-level approach, which leverages sparse sensor configurations, not only supports VR/AR applications but also holds promise for various other fields that require accurate and accessible motion capture solutions. This expansion enriches the context of our work and highlights its potential impact beyond immersive experiences.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper introduces a lightweight, real-time framework for full-body motion reconstruction from minimal and varied sensor inputs, tailored for VR/AR environments.
The paper introduces an innovative idea and follows a well-structured methodology.
To improve clarity, the abstract could focus more on its key contributions and performance metrics.
The introduction clearly defines the problem of full-body motion reconstruction and effectively describes the motivation for the suggested solution. An effective comparison of existing approaches is provided. However, more emphasis could be given to the novelty of LGSIK-Poser.
The related work section categorizes the methods by sensor type and highlights key trade-offs. I would suggest emphasizing more on how LGSIK-Poser differs in structure and efficiency.
The methodology is structured and effectively supported by Figures 1 and 2, which improve understanding of the system’s modular flow. The grouping of inputs and the use of anatomical partitions allow for localized motion modeling and efficient spatial reasoning. The inclusion of shape priors and the unidirectional influence of shape features are well-justified and suitable for real-world deployment. However, the connection between group-level and partition-level features should be explained more explicitly in the text. The IK refinement process and the use of raw inputs as skip connections would also benefit from clearer visual or textual clarification. Lastly, the loss weighting strategy and the per-frame versus per-sequence behavior of the shape fitting module should be discussed for full transparency and reproducibility.
In the following section, you could better explain why errors decrease over time in the results and why certain settings in their calculations were chosen.
The discussion could better contextualize trade-offs and explicitly link the graph module’s "interpretability" to concrete design choices. The conclusion should more sharply contrast its configurable sensor support with rigid baselines. Future work directions should prioritize not just hardware expansions but also algorithmic solutions to address lower-limb ambiguities under sparse data.
Author Response
To improve clarity, the abstract could focus more on its key contributions and performance metrics.
Thank you for the valuable suggestion. The abstract has been revised to clearly emphasize the key contributions of LGSIK-Poser, including its unified framework supporting diverse sensor inputs, anatomically informed modeling, and region-specific refinement. Performance metrics such as up to 48% improvement in hand localization, 60% model size reduction, and 22% latency decrease compared to the baseline are explicitly highlighted to showcase the practical impact. This revision aims to enhance clarity and better communicate the novelty and effectiveness of the proposed method.
The introduction clearly defines the problem of full-body motion reconstruction and effectively describes the motivation for the suggested solution. An effective comparison of existing approaches is provided. However, more emphasis could be given to the novelty of LGSIK-Poser.
Thank you for your valuable feedback. In response, the Introduction has been revised to better highlight the novelty of LGSIK-Poser. The updated text clearly describes how the proposed framework extends the scalable design of HMD-Poser by integrating grouped temporal modeling via LSTM, masked graph convolution for anatomically informed spatial reasoning, and region-specific inverse kinematics refinement to enhance end-effector localization. Furthermore, the inclusion of user-specific anthropometric priors for personalized SMPL shape estimation is emphasized. These revisions improve the clarity of the key contributions and distinctly differentiate LGSIK-Poser from existing approaches, while demonstrating its suitability for real-time deployment with sparse and practical sensor inputs.
Comment:
The related work section categorizes the methods by sensor type and highlights key trade-offs. I would suggest emphasizing more on how LGSIK-Poser differs in structure and efficiency.
Response:
Thank you for the constructive suggestion. The Related Work section has been revised to better highlight the structural and efficiency advantages of LGSIK-Poser. In particular, we explicitly contrast LGSIK-Poser with HMD-Poser and other Transformer-heavy architectures, emphasizing LGSIK-Poser's modular design, anatomically structured modeling, and reduced computational overhead. We also clarify that the model avoids costly iterative optimization by adopting a non-iterative, region-specific IK module for efficient end-effector refinement. These enhancements help delineate the novelty and practicality of the proposed method within the landscape of sparse-input human motion reconstruction.
Reviewer Comment:
However, the connection between group-level and partition-level features should be explained more explicitly in the text. The IK refinement process and the use of raw inputs as skip connections would also benefit from clearer visual or textual clarification.
Response:
Thank you for the constructive suggestion. We have carefully revised the Partition-Based Pose Optimization section to explicitly describe how group-level features produced by the GConv module transition into partition-level representations. Specifically, the final GConv layer outputs six anatomically meaningful region features (root, head, and four limbs), which serve as high-level spatial embeddings for subsequent local modeling. These partition-level features are then independently processed by dedicated LSTM networks to incorporate temporal dynamics. This hierarchical transition is now clearly explained to distinguish the roles of group-level context integration and partition-level local refinement.
In addition, the textual description of the IK refinement process has been enhanced by detailing how each partition-specific MLP predicts joint rotations based on (1) LSTM-refined temporal features, (2) the corresponding raw input as a skip connection, and (3) the root rotation prediction for global coordination. The role of skip connections is explicitly mentioned as a mechanism to preserve local sensor cues during optimization.
Figure references (Fig.~\ref{fig:framework}) have also been added at appropriate locations to guide readers through the visual flow of the pipeline and clarify information exchange between modules. These revisions collectively improve the clarity of our architecture and address concerns regarding the connection between modules and the structure of the IK refinement stage.
Comment:
Lastly, the per-frame versus per-sequence behavior of the shape fitting module should be discussed for full transparency and reproducibility.
Response:
Thank you for this valuable comment. The revised manuscript clarifies that the shape fitting module predicts per-frame shape parameters to support online inference in real-time applications. However, during training, the model operates on sliding windows of consecutive frames. Within each window, a shape consistency loss is applied to enforce temporal stability of the predicted shape parameters across frames. This encourages temporally coherent and smooth shape estimates, mitigating frame-wise fluctuations. These design choices are now explicitly discussed in the method section to improve transparency and facilitate reproducibility.
Comment:
The loss weighting strategy and why certain settings in their calculations were chosen.
Response:
Thank you for the valuable comment regarding the loss weighting strategy and the rationale behind the chosen settings. The \textsection{Experiments} section, specifically the \paragraph{\textbf{Implementation details}} subsection, has been expanded to explain the motivation and design of the loss weights.
We clarify that the weights α\alpha for each loss component were empirically chosen to balance the different scales of losses and their relative importance in achieving accurate and temporally stable reconstruction. Larger weights are assigned to joint position losses, including sparse joint positions (root and hands), to emphasize precise localization of sensor-observed keypoints essential for VR rendering and interaction. Orientation and smoothness losses have moderate weights to ensure natural, smooth articulation without dominating spatial accuracy.
The explicit listing of all loss weights and this additional explanation improve transparency and support reproducibility of the training process. This addresses the reviewer’s concern thoroughly.
Comment:
You could better explain why errors decrease over time in the results.
Response:
Thank you for the valuable comment. To address this suggestion, a detailed clarification has been added in Section 4.1 (“Comparisons”), subsection 4.1.1 (“Quantitative Results”), specifically within the paragraph titled “Online sliding window evaluation.”
The revision explicitly states that the model utilizes a unidirectional (causal) LSTM, meaning each frame’s prediction depends only on the current and past frames without access to future frames. As more past temporal context accumulates within the sliding window, prediction accuracy progressively improves. Early frames exhibit higher errors due to limited historical information, while errors decrease and stabilize in later frames. This behavior is illustrated in Figure~\ref{fig:heatmap}, which visualizes the frame-wise MPJPE across the sliding window.
This addition clarifies the temporal dynamics underlying the error trend, enhancing interpretability and completeness of the results discussion.
Comment:
The discussion could better contextualize trade-offs and explicitly link the graph module’s "interpretability" to concrete design choices.
Response:
Thank you for the valuable suggestion. The discussion section has been revised to explicitly link the graph module’s interpretability to its concrete design choices. Specifically, it is clarified that the module uses a hand-crafted adjacency matrix aligned with actual skeletal topology, which restricts feature interactions to anatomically relevant neighbors. This structural constraint enhances transparency in information flow, making the model’s learned representations more interpretable compared to Transformer-based approaches with dense, entangled attention.
Additionally, the design leads to a lightweight architecture with fewer parameters and lower computational cost, supporting efficient, real-time deployment on resource-limited platforms. These additions provide clearer context on the rationale behind the graph module’s design and its benefits.
Comment:
The conclusion should more sharply contrast its configurable sensor support with rigid baselines. Future work directions should prioritize not just hardware expansions but also algorithmic solutions to address lower-limb ambiguities under sparse data.
Response:
Thank you for the constructive feedback regarding the conclusion section. The revised manuscript’s conclusion has been updated to more clearly emphasize LGSIK-Poser’s flexible support for configurable sensor setups, contrasting it with traditional rigid baselines that require fixed sensor configurations or architectural changes. Furthermore, challenges associated with accurately reconstructing complex lower-limb motions under sparse input conditions are explicitly discussed, highlighting specific ambiguities such as fast kicking motions and subtle tremors.
To address these issues, future work directions are detailed with a focus not only on potential hardware expansions but also on algorithmic solutions. These include the integration of task-specific constraints, scene-aware priors—particularly foot-ground contact modeling—and the exploration of diffusion-based generative models to better capture lower-body dynamics and resolve under-constrained ambiguities. These additions enhance transparency regarding the model’s limitations and provide a clearer roadmap for advancing its capabilities in sparse data scenarios.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors proposes an LGSIK-Poser which is a unified and lightweight framework for real-time full-body human motion reconstruction from sparse and heterogeneous inputs.
The method utilizes three types of sensors HMDs, handheld controllers, and optional IMUs up to 3.
There are several aspects that need improvement including the following comments.
1.
The authors reviews only sensor(IMU,HMD,tracker) based HMR.
However there are two more prominent approaches: 3D human pose estimation using RGB camera and using depth camera.
They It's becoming increasingly popular and widely adopted.
Therefore, it should be considered together.
Nevertheless, if the authors decide to exclude all the camera based approaches, then 'HMR from Sparse 6DoF Tracker Inputs' should be included for comparison. The reason is 6DoF tracker is the most promising and advanced method in the sensor based approaches.
2.
At the third paragraph of Subsection 3.1 : 'Each supernode represents one of the nine semantic groups...'.
It's not clearly understood what the 9 groups refer to.
Also the two yellow nodes in the left figure of Fig.2. is vague and uncelar what's intended.
3.
Fig.1. shows the overview of the proposed framework. and details of three main modules(grouped spatio temporal modeling, partition-based pose optimization, and shape fitting) are described in Subsection 3.1-3.3.
But the description is mostly only verbally and roughly described, lacking clarity and specificity, which makes it difficult to understand how the implementation should be done.
It would be helpful if the explanation were detailed enough, perhaps using data types, formulas,
or methods of modular decomposition in an algorithmic style,
so that the specific internal structure of the module can be clearly understood.
4.
The equation Eq.1 contains seven coefficients : alpha_ori,...alpha_smooth,
each of which plays a crucial role in the loss function.
However, it is stated that these values were simply set to fixed values(10,50,10,1000,1000,10,5).
It would be necessary to include a discussion on the impact of varying these values,
methods for determining optimal settings, and conditions under which adjustments might be required.
5.
In Section 4, which presents the experimental results,
the complete lack of visual outcomes makes it difficult to easily understand
how the proposed method is applied to the experimental input data and what is the intermediate data of the progress.
It would be beneficial to add experimental results that graphically show the visual appearance of the data
used from the AMASS dataset,
the visual positions of the joints obtained from them, and the changes in joint movement over a timeline.
Author Response
Reviewer Comment:
The authors reviews only sensor(IMU,HMD,tracker) based HMR. However there are two more prominent approaches: 3D human pose estimation using RGB camera and using depth camera.They It's becoming increasingly popular and widely adopted. Therefore, it should be considered together.Nevertheless, if the authors decide to exclude all the camera based approaches, then 'HMR from Sparse 6DoF Tracker Inputs' should be included for comparison. The reason is 6DoF tracker is the most promising and advanced method in the sensor based approaches.
Author Response:
We appreciate the reviewer’s suggestion regarding the inclusion of camera-based 3D human pose estimation methods using RGB and depth cameras. While these approaches are indeed prominent and widely studied, this work specifically focuses on sensor-based human motion reconstruction (HMR) due to their distinct hardware requirements and suitability for VR/AR applications. Therefore, camera-based methods fall outside the scope of this review.
Regarding the inclusion of “HMR from Sparse 6DoF Tracker Inputs,” this category is thoroughly covered in the revised Related Work section. The survey highlights state-of-the-art methods such as SparsePoser and DragPoser, emphasizing the advantages and limitations of 6DoF tracker-based approaches as some of the most promising and advanced sensor-based solutions. This ensures a comprehensive overview of relevant sensor modalities.
Reviewer Comment:
At the third paragraph of Subsection 3.1 : 'Each supernode represents one of the nine semantic groups...'.It's not clearly understood what the 9 groups refer to.Also the two yellow nodes in the left figure of Fig.2. is vague and uncelar what's intended.
Author Response:
Thank you for highlighting the ambiguity regarding the nine semantic groups and the yellow nodes in Fig. 2. To clarify, we have explicitly defined the nine anatomical groups—torso, head, left/right arms, left/right legs, upper/lower body, and shape—at the beginning of Subsection 3.1, along with their corresponding notations in Eq. (1). We also revised the caption and annotations in Fig. 2 to clearly indicate that each supernode corresponds to one of these groups. The two yellow nodes on the left represent the upper and lower body groups, and their roles are now explicitly described in both the text and the figure caption. These clarifications aim to improve understanding of the group-wise topology and message passing structure.
Reviewer Comment:
Fig.1. shows the overview of the proposed framework. and details of three main modules(grouped spatio temporal modeling, partition-based pose optimization, and shape fitting) are described in Subsection 3.1-3.3.
But the description is mostly only verbally and roughly described, lacking clarity and specificity, which makes it difficult to understand how the implementation should be done.
It would be helpful if the explanation were detailed enough, perhaps using data types, formulas,
or methods of modular decomposition in an algorithmic style,
so that the specific internal structure of the module can be clearly understood.
Author Response:
Thank you for pointing out the lack of clarity in our original method description. In response, we have substantially revised Section 3 to include more formal notation, equations, and structured module breakdowns. Specifically, we now describe the grouped spatio-temporal modeling module with explicit formulas for group-wise input encoding, LSTM processing, and graph convolution (Eq.1–3), and we provide detailed anatomical grouping strategies and the rationale behind the message-passing mask (Fig.2). For the partition-based pose optimization and shape fitting modules, we have clarified the inputs and outputs of each sub-component (e.g., MLPs, concatenation with priors) and added corresponding references to Fig.1 for visual guidance. These revisions aim to make the internal structure and implementation of each module more transparent and reproducible.
Reviewer Comment:
The equation Eq.1 contains seven coefficients : alpha_ori,...alpha_smooth,each of which plays a crucial role in the loss function.However, it is stated that these values were simply set to fixed values(10,50,10,1000,1000,10,5).It would be necessary to include a discussion on the impact of varying these values,methods for determining optimal settings, and conditions under which adjustments might be required.
Author Response:
Thank you for the helpful suggestion. We have added a paragraph to the implementation details section to explain the choice of the seven loss weights in Eq.1. These values were empirically set to balance the relative magnitudes of each term under accurate reconstruction, and to reflect their importance for the final performance. In particular, higher weights are assigned to joint-related terms to emphasize localization accuracy, while orientation and smoothness terms help enforce temporal coherence. The final configuration was chosen through preliminary experiments. We also note that finer adjustments may be needed for specific tasks or input setups, which could be explored in future work.
Reviewer Comment:
In Section 4, which presents the experimental results, the complete lack of visual outcomes makes it difficult to easily understand how the proposed method is applied to the experimental input data and what is the intermediate data of the progress. It would be beneficial to add experimental results that graphically show the visual appearance of the data used from the AMASS dataset, the visual positions of the joints obtained from them, and the changes in joint movement over a timeline.
Author Response:
We sincerely appreciate the reviewer’s valuable suggestion regarding the addition of visual outcomes to better illustrate the effectiveness and interpretability of our proposed method. In the revised manuscript, we have addressed this point thoroughly by adding three sets of qualitative visualizations and enhancing the presentation of our framework.
- Qualitative Comparison (Figure 5):
We provide a side-by-side comparison between our method (LGSIK-Poser) and the baseline (HMD-Poser) across four challenging scenarios, including fast upper- and lower-limb motions, occluded poses, and body shape ambiguity. Each case is visualized with three vertically stacked sub-views: ground-truth (GT), HMD-Poser prediction vs. GT, and LGSIK-Poser prediction vs. GT. Different colors (blue for GT, yellow for HMD-Poser, and green for LGSIK-Poser) help distinguish the outputs. These visualizations highlight the improved plausibility and accuracy of our method in terms of joint location, pose consistency, and shape correctness.
- Temporal Sequence Visualization (Figure 6):
To demonstrate temporal coherence and continuity, we visualize a short sequence from the AMASS badminton dataset. The top row shows the ground-truth motion, while the middle and bottom rows overlay the GT with HMD-Poser and LGSIK-Poser predictions, respectively. Our method consistently maintains smoother transitions and fewer prediction artifacts, particularly for lower-limb and wrist trajectories.
- Shape-Aware Evaluation (Figure 7):
To validate the effectiveness of our shape-aware modeling, we include visualizations involving subjects of diverse body shapes and genders. Each row displays results from different actions, and each column represents a different subject. The overlay of predictions and GT meshes shows that our model generalizes well across body types, maintaining shape consistency and motion realism.
4. Updated Framework Figure (Figure 1):
We revised the framework figure to include 3D visualizations of intermediate outputs alongside the corresponding processing stages. These visuals help readers understand how raw inputs evolve through grouped temporal modeling, graph-based fusion, local optimization, and final mesh reconstruction.
All figures are visualized using aitviewer, a dedicated open-source tool for SMPL-compatible 3D human model rendering , which ensures clarity, reproducibility, and accurate joint visualization.
We believe these additions significantly enhance the interpretability of our method and thank the reviewer again for the helpful suggestion.
Reviewer 4 Report
Comments and Suggestions for AuthorsReview ai-3779658-peer-review-v1
The aim of the paper, "LGSIK-Poser: Skeleton-Aware Full-Body Motion Reconstruction from Sparse Inputs," was to introduce a real-time, skeleton-aware framework for full-body human motion reconstruction using sparse Virtual Reality (VR) and Augmented Reality (AR) inputs. The authors claim their model can accommodate diverse sensor configurations without requiring architecture modifications and achieve accurate, consistent, and interpretable pose reconstruction with improved runtime efficiency compared to state-of-the-art baselines.
The paper is well written and logically organized, guiding the reader through motivation, methodology, and experimental validation. Diagrams such as Figures 1 and 2 effectively illustrate the model architecture and feature grouping. The mathematical descriptions of loss functions and network design are appropriately detailed. I suggest eliminating phrases such as 'we present' and 'we categorize' from the paper. Most academic papers are written in the passive voice. Please revise accordingly.
Novelty and Contribution: The paper introduces a novel integration of grouped spatio-temporal modeling, skeleton-aware graph convolution, region-specific inverse kinematics refinement, and shape personalization using anthropometric priors. These components, while based on known techniques, are combined in a new and efficient way that improves interpretability and real-time performance over prior methods like HMD-Poser and DragPoser.
Literature Review and Citations: The related work is well-structured by input type and cites relevant and mostly recent literature. It clearly differentiates LGSIK-Poser from prior models and situates it within both physics-based and learning-based approaches. One improvement would be to include recent advances in diffusion-based motion modeling to reflect the latest developments in the field. Overall, citations are appropriate, with no noticeable gaps or overuse of self-citation.
Below you will find the comments for each section.
Abstract:
- Lines 8–10: Add a brief explanation of what the SMPL model is (e.g., “...SMPL model, a standard parametric human body model...”) for readers unfamiliar with it.
- Line 13: Clarify "HMD-Poser" by briefly stating its limitations (e.g., “a previous method that suffers from high latency and complexity”).
- Introduction
- Lines 31–33: The critique of HMD-Poser is a bit vague. Specify which components of the “Transformer-heavy design” are problematic. Interpretability? Memory usage? Inference speed?
- Line 38: Briefly describe what “constraint satisfaction” means in the context of motion modeling (for readers outside the field).
- Related Work:
- Line 77: Reference to HMD-Poser is repeated. Try consolidating prior mentions for better flow.
- Line 84–89: Briefly explain why SparsePoser and DragPoser require “external base stations” and how that impacts scalability - currently implicit.
- Lines 91–96: Explain how LGSIK's IK module differs from DragPoser’s iterative optimization (i.e., LGSIK’s is embedded and non-iterative).
- Method
3.1 Grouped Spatio-Temporal Modeling
- Line 108–112: Provide an example of the types of features included in each group (e.g., “the left leg group includes hip, knee, ankle rotations”).
- Lines 129–132: The phrase “self-message passing” could be clarified - what does it mean operationally? Perhaps add one sentence explaining its implementation in GNN terms.
3.2 Partition-Based Pose Optimization
- Lines 147–153: Use visual or math notation to show input/output of each MLP module, or refer to an architecture diagram section.
3.3 Shape Fitting
- Line 160: Explain why anthropometric priors (height, gender) are more robust than using SMPL annotations—add a short justification for this design.
3.4 Training (Lines 173–189)
- Line 176–178: Consider briefly describing the intuition behind “sparse joint loss”—e.g., “helps enforce accuracy where sensors provide direct observations.”
- Line 183–185: The velocity loss equation is not labeled clearly (Equation 2). It would benefit from an in-line explanation of each variable again (e.g., clarify PÌ‚t+n as predicted positions).
- Experiments
- Line 226: Define what is meant by VR-only vs. VR+IMU input setup more clearly—perhaps in a footnote or table caption.
- Line 244–254: A visual bar chart or table highlighting hand error gains over HMD-Poser (especially the 48% improvement) would make this result more compelling.
- Lines 273–282: Emphasize that performance improves under online inference—this is key and should be highlighted with bold or underlined terms.
- Line 290–299: The heatmap (Figure 3) is described but not interpreted quantitatively. Add average numerical MPJPE at the final 5 frames to show the concrete error drop.
4.2 Ablation Studies
- Lines 306–313: More explanation needed on the “noG” result—why does removing graph convolutions improve smoothness but degrade accuracy?
- Line 317–322: Clarify that “hploss” corresponds to HMD-Poser’s original loss—this isn’t clear without re-reading earlier sections.
4.3 Inference Efficiency
- Lines 344–354: The claim about HMD-Poser's reported FPS discrepancy should be toned down or footnoted as speculative. Suggest something like: “may stem from implementation differences in FK stages...”
- Line 339: Provide a short rationale for why AvatarPoser is faster despite lower generalization (e.g., “fewer parameters and simpler pipeline”).
- Discussion
- Lines 377–382: Consider breaking this long sentence into two. Currently hard to parse. Suggest separating the discussion of inter-limb coupling and foot-ground constraints.
- Line 386: Add a mention of potential learning-based solutions (e.g., incorporating generative priors or diffusion models) as future work.
- Conclusion
- Line 399–404: Reiterate the most critical application domains (e.g., VR fitness, virtual meetings, or gaming) to reinforce relevance.
- Line 409–411: Consider being more specific in the “task-specific constraints” and “scene-aware priors” — one example would make this stronger.
Final Suggestions
- Add qualitative results: Include figures showing reconstructed human poses alongside ground-truth, especially in challenging motions (e.g., turning, jumping).
- Once again, I suggest eliminating phrases such as 'we present' and 'we categorize' from the paper. Most academic papers are written in the passive voice. Please revise accordingly.
Author Response
We sincerely thank the reviewer for their thorough, detailed, and constructive comments. Many of the suggestions have substantially improved the clarity, completeness, and overall presentation quality of the manuscript. The following is a concise, point-by-point response arranged in the order of the comments.
For a more detailed explanation and supporting materials, please refer to the attached supplementary documents.
General Writing Style
Reviewer Comment:
I suggest eliminating phrases such as "we present" and "we categorize" from the paper. Most academic papers are written in the passive voice. Please revise accordingly.
Response:
Thank you for this stylistic recommendation. The manuscript has been carefully revised to eliminate first-person expressions such as “we present” and “we categorize,” adopting passive voice constructions where appropriate to align with academic writing conventions.
Abstract
Comment (Lines 8–10):
Add a brief explanation of what the SMPL model is.
Response:
We appreciate this suggestion. The phrase “SMPL model” has been revised to “SMPL model, a widely used parametric human body representation” to aid understanding for readers unfamiliar with the term.
Comment (Line 13):
Clarify "HMD-Poser" by briefly stating its limitations.
Response:
Thank you. While we understand the intent to provide more context for HMD-Poser, we believe that a balanced and detailed critique is more appropriately placed in the body of the paper. To maintain brevity and fairness in the abstract, we chose not to explicitly list its limitations there.
Introduction
Comment (Lines 31–33):
The critique of HMD-Poser is vague. Specify which components of the “Transformer-heavy design” are problematic.
Response:
We have revised the text to explain that HMD-Poser applies global self-attention across all spatiotemporal tokens, which increases computational cost and entangles spatial semantics. This design hinders modular modeling and anatomical prior integration—especially problematic under sparse input settings.
Comment (Line 38):
Briefly explain “constraint satisfaction” in the context of motion modeling.
Response:
The term has been clarified as referring to hard kinematic relationships, such as precise end-effector placements via optimization. This concept is now explained in the context of DragPoser and our region-aware refinement.
Related Work
Comment (Line 77):
Reference to HMD-Poser is repeated. Try consolidating for better flow.
Response:
We have restructured the relevant paragraph to consolidate HMD-Poser mentions, now summarizing its contributions and limitations in one concise description.
Comment (Lines 84–89):
Briefly explain the need for external base stations in SparsePoser and DragPoser.
Response:
We clarified that these methods rely on outside-in tracking using base stations like Lighthouse, which require structured light or laser sweeps. This increases setup complexity and reduces portability.
Comment (Lines 91–96):
Explain how LGSIK’s IK module differs from DragPoser’s optimization.
Response:
We now emphasize that LGSIK-Poser employs a feedforward, embedded IK module, unlike DragPoser’s iterative optimization. This allows real-time performance while preserving end-effector accuracy.
Section 3: Method
3.1 Grouped Spatio-Temporal Modeling
Comment (Lines 108–112):
Provide examples of features in each group.
Response:
We now specify that the left leg group includes pelvis and left knee IMU rotations and accelerations, while the right arm group includes head and right hand 6DoF data.
Comment (Lines 129–132):
Clarify “self-message passing.”
Response:
We added an explanation that this involves disabling incoming edges to a node, preventing it from being updated during graph convolution. The shape node performs one-way propagation of static priors.
3.2 Partition-Based Pose Optimization
Comment (Lines 147–153):
Use visual or math notation or refer to an architecture figure.
Response:
We have added a direct reference to Fig.~\ref{fig:framework}, which illustrates inputs and outputs of each MLP module in this stage.
3.3 Shape Fitting
Comment (Line 160):
Explain why anthropometric priors are more robust.
Response:
The revised text contrasts anthropometric priors (e.g., height, gender) with SMPL annotations that require complex offline pipelines. Our approach is more practical and scalable, especially in consumer-grade VR settings.
3.4 Training
Comment (Lines 176–178):
Briefly describe the intuition behind sparse joint loss.
Response:
We clarified that this loss focuses on joints directly observed via sensors (head and hands), helping to reduce ambiguity and anchor predictions.
Comment (Lines 183–185):
Add inline definitions for the velocity loss variables.
Response:
We have revised the equation explanation to clearly define P^t\hat{\bm{P}}_t as the predicted joint positions and Pt\bm{P}_t as the ground-truth.
Experiments
Comment (Line 226):
Define VR-only vs. VR+IMU setups.
Response:
Definitions for the three sensor configurations (VR-only, VR+2IMUs, VR+3IMUs) have been added at the end of Section 4.1 for immediate clarity.
Comments (Lines 273–299):
Highlight online inference benefits and interpret heatmap quantitatively.
Response:
We have bolded the performance improvements and added concrete MPJPE values (3.07 cm → 2.78 cm) to quantify error reduction across time in the heatmap.
Ablation Study (Section 4.2)
Comments (Lines 306–322):
Clarify “noG” and “hploss.”
Response:
The “noG” variant is now explained as removing structural constraints, enabling smoother but less accurate predictions. “hploss” is clarified as HMD-Poser’s original training loss.
Runtime and Efficiency
Comment (Line 339):
Add rationale for AvatarPoser’s higher FPS.
Response:
We note that AvatarPoser uses a simpler pipeline with fewer parameters, enabling faster but less generalizable inference.
Comment (Lines 344–354):
Tone down FPS speculation on HMD-Poser.
Response:
The claim has been softened to state: “The discrepancy may stem from differences in the FK stage implementation or other undocumented variations.”
Discussion & Conclusion
Comment (Lines 377–382):
Split long sentence on foot constraints.
Response:
We split the sentence into two: one focusing on leg IMU limitations, the other on weak root constraints.
Comment (Line 386):
Add mention of generative methods as future work.
Response:
We have added a reference to potential use of generative models and contact priors to improve lower-body fidelity under sparse inputs.
Comment (Lines 399–411):
Reinforce application relevance and be specific on future work.
Response:
The conclusion now emphasizes key application domains (VR fitness, meetings, gaming) and provides a specific future direction involving foot-ground contact priors.
Final Suggestions – Qualitative Results
Comment:
Add qualitative comparisons and visualizations.
Response:
We have added the following:
-
Figure 5: Side-by-side pose comparisons between LGSIK-Poser and HMD-Poser under challenging motions.
-
Figure 6: Temporal sequence comparison showing smoother transitions.
-
Figure 7: Evaluation across body shapes and genders.
-
Figure 1 (Updated): Now includes visualized intermediate outputs in the framework.
All visualizations were generated using the open-source aitviewer tool for consistent and reproducible rendering.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript has been appropriately revised reflecting the reviewer's suggestions and comments.