AFJ-PoseNet: Enhancing Simple Baselines with Attention-Guided Fusion and Joint-Aware Positional Encoding
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper addresses two key limitations of the Simple Baseline by designing the AFM and JAPE modules, with good results. However, there are several issues that need to be addressed:
- What is the computational complexity of the AFM module? Has it increased the model's inference time and parameter count?
- The improvement on MPII is small, with some parts even showing a decline. A brief explanation is recommended.
- Inference speed is an important metric. It is recommended to add FPS or inference time.
- Please provide complete training details, such as system, CPU, batch size, etc.
- It is suggested to add some success and failure cases in the Discussion section to show the strengths and limitations of the model.
- Specific definitions of some evaluation metrics, such as Mean@0.1, should be provided.
- It is recommended to include experiments with other backbone networks, or add relevant discussion on extensibility.
Author Response
Dear Reviewer #1,
Thank you for your time and for providing such detailed and constructive feedback on our manuscript. We have found your comments to be incredibly helpful and have revised our paper accordingly. We believe these changes have significantly strengthened the manuscript.
Below are our point-by-point responses to your comments.
Comment 1: "What is the computational complexity of the AFM and JAPE modules? Has it increased the model's inference time and parameter count?"
Our Response: Thank you for raising this critical point. We agree that a thorough analysis of computational cost is essential for evaluating the practicality of our model.
-
Action Taken: We have conducted extensive benchmarking and added a comprehensive analysis of model complexity. We have added two new columns, "Params (M)" (model parameters) and "FPS" (frames per second), to Table 1. Furthermore, we have included a new detailed analysis in Section 4.1 to discuss the trade-offs, showing that our model achieves significant accuracy gains with only a modest and acceptable increase in computational overhead.
Comment 2: "The improvement on MPII is small, with some parts even showing a decline. A brief explanation is recommended."
Our Response: We appreciate you pointing this out. Your insight prompted us to look deeper into the results and provide a more nuanced analysis.
-
Action Taken: We have added a detailed explanation in the first paragraph of Section 5 (Discussion). We now discuss the phenomenon where adding only the AFM module can introduce noise for less challenging joints (like Shoulder and Elbow), leading to a slight performance drop. Crucially, we then explain how the subsequent addition of the JAPE module acts as a powerful regularizer to mitigate this issue and boost overall performance, thus demonstrating the vital synergy between our two contributions.
Comment 3: "Inference speed is an important metric. It is recommended to add FPS or inference time."
Our Response: We completely agree.
-
Action Taken: As addressed in our response to your first comment, we have now included FPS measurements in Table 1 and provided a corresponding analysis in Section 4.1.
Comment 4: "Please provide complete training details, such as system, CPU, batch size, etc."
Our Response: Thank you for this suggestion. We have updated the manuscript with more detailed information to enhance the reproducibility of our work.
-
Action Taken: In Section 3.3.2 (Implementation Details), we have added specific information about the GPU used (NVIDIA Tesla V100), the batch size (128), and more precise data augmentation parameters, including the ranges for random rotation and scaling.
Comment 5: "It is suggested to add some success and failure cases in the Discussion section to show the strengths and limitations of the model."
Our Response: This is an excellent point. To provide a more balanced perspective, we have expanded our discussion to conceptually cover the model's strengths and limitations.
-
Action Taken: In the second paragraph of Section 5 (Discussion), we now explicitly discuss the trade-off observed between Average Precision (AP) and Average Recall (AR) in our COCO results. We elaborate that our model's strength lies in improving precision by pruning ambiguous detections, a desirable trait for high-fidelity applications, while also acknowledging the slight decrease in recall as a limitation.
Comment 6: "Specific definitions of some evaluation metrics, such as Mean@0.1, should be provided."
Our Response: Thank you for the feedback. We have clarified the metrics used.
-
Action Taken: In Section 3.3.1, we have explicitly mentioned that we use "PCK@0.1 for a stricter evaluation" on the MPII dataset, making it clear what this metric refers to.
Comment 7: "It is recommended to include experiments with other backbone networks, or add relevant discussion on extensibility."
Our Response: We agree that discussing the potential for generalization is important.
-
Action Taken: We have added a forward-looking statement in the final paragraph of Section 5 (Discussion). We now explicitly mention that applying our AFM and JAPE principles to other, more lightweight backbones is a promising avenue for future work, thereby addressing the extensibility of our method.
Thank you once again for your valuable contributions to improving our manuscript.
Sincerely,
The Authors
Reviewer 2 Report
Comments and Suggestions for Authors1. I think it would be easier to understand if you describe the concept and features of PoseNet.
2. This paper does not seem to contain any content on Pose Estimation. Please add this content.
3. What is the difference between U-Net-like Multi-scale Fusion Decoder with AFM and MS-UNet from the previous paper?
4. You said that it is a new architecture that greatly improves the Simple Baseline framework. Please explain this architecture in detail.
5. The Joint Aware Position Encoding (JAPE) module seems too simple. Please express it considering other requirements.
6. The experiments in this paper are too short and there are few. Please present the experimental results considering the specific experimental results and the types of each element and measurement used in the experiment.
1. I think it would be easier to understand if you describe the concept and features of PoseNet.
2. This paper does not seem to contain any content on Pose Estimation. Please add this content.
3. What is the difference between U-Net-like Multi-scale Fusion Decoder with AFM and MS-UNet from the previous paper?
4. You said that it is a new architecture that greatly improves the Simple Baseline framework. Please explain this architecture in detail.
5. The Joint Aware Position Encoding (JAPE) module seems too simple. Please express it considering other requirements.
6. The experiments in this paper are too short and there are few. Please present the experimental results considering the specific experimental results and the types of each element and measurement used in the experiment.
Author Response
Dear Reviewer #2,
We would like to express our sincere gratitude for your time and effort in reviewing our manuscript. Your feedback has been instrumental in helping us improve the clarity and depth of our paper. We have carefully addressed all your concerns, and the manuscript has been substantially revised.
Below are our point-by-point responses.
Comment 1, 2, & 4: "I think it would be easier to understand if you describe the concept and features of PoseNet. This paper does not seem to contain any content on Pose Estimation. Please add this content. ... Please explain this architecture in detail."
Our Response: We sincerely apologize if the initial manuscript was not clear enough. The entire paper is indeed focused on enhancing a well-known Human Pose Estimation model, Simple Baseline. We have thoroughly revised the manuscript to make the context, motivation, and architectural details much clearer.
-
Action Taken: The Introduction has been restructured to first define the two core limitations of the Simple Baseline model. Our proposed architecture, AFJ-PoseNet, is then presented as a direct solution to these problems. We have also completely rewritten Section 3.1 with formal equations and a detailed textual description to thoroughly explain the AFM's architecture, ensuring it aligns perfectly with Figure 2.
Comment 3: "What is the difference between U-Net-like Multi-scale Fusion Decoder with AFM and MS-UNet from the previous paper?"
Our Response: Thank you for this question, which highlights a need for better differentiation. Our work is a novel architecture built upon Simple Baseline, and the proposed AFM is specifically designed for this context, not derived from MS-UNet.
-
Action Taken: The revised Section 3.1 now provides a highly detailed, self-contained explanation of our AFM's design and mechanism, clarifying its unique role in our U-Net-like decoder. We believe this new description clearly establishes the novelty and specifics of our approach.
Comment 5: "The Joint Aware Position Encoding (JAPE) module seems too simple. Please express it considering other requirements."
Our Response: We agree that the initial description lacked sufficient theoretical motivation for JAPE's design. We have revised the text to better justify its architecture.
-
Action Taken: We have rewritten the introductory paragraph of Section 3.2 to frame the JAPE module as a principled solution to the fundamental problem of "translation invariance" in CNNs. This provides a stronger rationale for its dual-path design, showing that its simplicity is a result of an elegant and targeted approach to a complex problem.
Comment 6: "The experiments in this paper are too short and there are few. Please present the experimental results considering the specific experimental results and the types of each element and measurement used in the experiment."
Our Response: We appreciate this feedback and have significantly expanded our experimental analysis to provide a more comprehensive evaluation.
-
Action Taken: We have enriched our ablation study in Section 4.1 by adding a full computational analysis, including model parameters (Params) and inference speed (FPS) in Table 1. We also added a stricter evaluation metric (PCK@0.1) for the MPII dataset. Furthermore, we have entirely rewritten Section 5 (Discussion) to offer a much deeper and more thorough analysis of the results, the synergy between our modules, and the performance trade-offs.
We hope that our revisions have successfully addressed all your concerns and that the manuscript is now clearer and more comprehensive. Thank you again for your guidance.
Sincerely,
The Authors
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript introduces AFJ-PoseNet, an enhanced model for Human Pose Estimation (HPE) that extends the Simple Baseline framework by incorporating two main components: an Attention Fusion Module (AFM) for multi-scale feature fusion within a U-Net-like decoder, and a Joint-Aware Positional Encoding (JAPE) module that injects both global and joint-specific spatial priors. Experimental results on the MPII and COCO datasets demonstrate that these enhancements yield consistent improvements in keypoint localization accuracy, with a reported 2.2 percentage point AP gain on COCO over a strong baseline. Ablation studies further validate the contribution of each proposed module.
Strengths
The paper provides a systematic architectural improvement over an established baseline without introducing excessive complexity. The integration of AFM into the decoder effectively addresses feature bottlenecks, enhancing fine-grained localization. The JAPE module is presented as a flexible method for introducing spatial priors, and the empirical evaluation is comprehensive, including fair ablations and comparisons with both classic and recent state-of-the-art methods. The work is clearly motivated by the limitations of existing frameworks, and experimental protocols are carefully described, supporting the reproducibility of results.
Weaknesses
- The rationale for introducing absolute positions in the HPE task is insufficiently discussed. While the JAPE module incorporates absolute coordinates, the manuscript would benefit from a clearer explanation or motivation as to why this information is essential for pose estimation, given the translation-invariant nature of CNNs.
- It is recommended that the authors review the study in shape analysis. For example, would incorporating invariant features into the proposed method have the potential to improve its performance and generalization ability? Zhang et al. Differential and integral invariants under Möbius transformation. PRCV 2018. Li et al. A rotation-invariant framework for deep point cloud analysis. IEEE TVCG
- Figure resolution is noted to be low. The manuscript would be improved by providing higher-quality figures to aid in the reader’s understanding of the model architecture and module design.
- There is a lack of discussion on the compatibility of equation (1) in the AFM with the output shapes depicted in Figure 2. Specifically, clarification is needed on how the 1×1 output of the attention map (A) is broadcast or applied to the skip_feature and y_up tensors, as their dimensions must be compatible for the Hadamard product.
Author Response
Dear Reviewer #3,
We are very grateful for your thorough review and the insightful, detailed feedback you have provided. Your comments on the theoretical motivation and technical clarity were particularly helpful, and we have substantially revised our manuscript to address every point you raised.
Please find our detailed responses below.
Weakness 1: "The rationale for introducing absolute positions in the HPE task is insufficiently discussed."
Our Response: This is an excellent point. We agree that a stronger motivation was needed.
-
Action Taken: We have significantly revised the beginning of Section 3.2 to provide a solid theoretical foundation for introducing absolute positional information. We now explicitly link this necessity to overcoming the inherent "translation invariance" of CNNs, which is a key limitation for localization-sensitive tasks like Human Pose Estimation. This reframes the JAPE module as a fundamental architectural improvement rather than an incremental addition.
Weakness 2: "It is recommended that the authors review the study in shape analysis... rotation-invariant framework... There is a lack of discussion on the compatibility of equation (1) in the AFM... Specifically, clarification is needed..."
Our Response: Thank you for these two very sharp and valuable comments.
-
Action Taken (Rotation Invariance): We agree that rotation invariance is an important research direction. While it is beyond the scope of our current work, we have acknowledged its importance. In the final paragraph of Section 5 (Discussion), we now explicitly identify rotation variance as a remaining challenge and propose exploring rotation-invariant features as a key direction for future work.
-
Action Taken (Equation Compatibility): You are absolutely correct to point out the ambiguity in the original Equation (1) and its description. We apologize for this lack of clarity. We have completely rewritten Section 3.1 to resolve this. We removed the old equation and introduced a new, more precise set of equations (Equations 2 and 3) along with a detailed textual description. This new presentation now perfectly and unambiguously matches the data flow shown in Figure 2, clarifying how the attention map is generated from concatenated features and then applied.
Weakness 3: "Figure resolution is noted to be low."
Our Response: We apologize for the poor quality of the figures in the initial submission.
-
Action Taken: We have re-exported all figures in high resolution and have embedded them in the revised manuscript to ensure they are clear and legible.
We are confident that your feedback has allowed us to significantly improve the scientific quality and presentation of our paper. Thank you once again for your expert guidance.
Sincerely,
The Authors
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for your hard work revising this paper.
It's always a pleasure to see people writing papers.
I've confirmed that the issues raised in this paper have been almost completely addressed.
This paper adheres to the required format and procedures, and its quality is satisfactory, so I'm giving it an Acceptance rating.
Thank you.