Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Generative Data Augmentation for ArUco-Free RGB-Based 6-DoF Object Pose Estimation

J. Imaging 2026, 12(6), 244; https://doi.org/10.3390/jimaging12060244

by Carmelo Scribano¹

, Iacopo Ferrari¹, Giorgia Franchini^1,*

, Elena Govi²

, Davide Sapienza¹

, Tobia Poppi¹

, Micaela Verucchi²

and Marko Bertogna¹

Reviewer 1: Anonymous

Reviewer 2:

Arsanchai Sukkuea

Reviewer 3: Anonymous

J. Imaging 2026, 12(6), 244; https://doi.org/10.3390/jimaging12060244

Submission received: 23 March 2026 / Revised: 21 May 2026 / Accepted: 27 May 2026 / Published: 29 May 2026

(This article belongs to the Special Issue AI-Driven Image and Video Understanding)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Please see major comments attached.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript investigates a highly relevant issue in 6-DoF object pose estimation: the unintended shortcut bias induced by fiducial ArUco markers in the widely used Linemod dataset. While the premise is strong and the qualitative analyses are well-executed, the quantitative evaluation is somewhat limited in scope, and concerns regarding the reproducibility of the generative augmentation process should be addressed before publication.

Major

The quantitative analysis in Section 4.3 is restricted to only two objects: Object 4 (Camera) and Object 8 (Driller). While Table 1 indicates that the ArUco-Free Linemod subset covers 12 object classes , the evaluation of only two objects makes it difficult to ascertain if the generative augmentation universally mitigates bias or if the results are uniquely tied to the geometric properties of these specific items. The authors should expand the quantitative evaluation to include all objects processed in the AF-LM dataset.
The methodology relies on a commercial generative model, Nano Banana 2 (Gemini 3.1 Flash Image), for marker removal. To ensure rigorous scientific reproducibility, the authors must clarify how this dataset generation can be replicated. Although the Data Availability Statement mentions the data is "openly available", explicitly linking to the repository containing the final ArUco-Free dataset within the methodology section would strengthen the paper.
In Section 3.2, the adapted formulation replaces the ReLU activation with an absolute value operation. While the justification for retaining negative gradients in regression is sound, taking the absolute value of the final weighted sum might obscure important directional influences in the feature maps. A brief discussion on the implications of this absolute value operation regarding the interpretability of the final saliency map would benefit readers with a specialized interest in interpretable machine learning.
While the visual analysis of the Guided Backpropagation and Generalized Grad-CAM maps is compelling, relying solely on qualitative heatmaps can be subjective. To strengthen the claim that the network's attention has fundamentally shifted away from background biases, consider introducing a quantitative metric for the saliency maps. For example, calculating the Intersection over Union (IoU) or the Pointing Game accuracy between the generated saliency map threshold and the ground-truth segmentation mask of the target object would provide an objective measure of focus.
In Section 2.1.4, the HOPE dataset is explicitly highlighted as being advantageous because it does not utilize markers during acquisition. A true test of the ArUco-Free model's generalization capabilities would be zero-shot testing on a completely different markerless dataset like HOPE or T-LESS. Currently, the model is only evaluated on a synthetically modified subset of Linemod, which proves it can handle the inpainting, but not necessarily that it has learned a universal, domain-agnostic representation of the objects.
The manuscript correctly acknowledges that models trained on the ArUco-Free dataset still rely on spatial patterns to some degree. However, it is important to address whether the generative inpainting process using the Nano Banana 2 model introduced new, high-frequency artifacts. CNNs are highly sensitive to generative noise patterns. Adding a control experiment—such as masking the ArUco markers with simple Gaussian noise or using classical, non-generative inpainting (Navier-Stokes)—would help isolate whether the improvement is strictly due to the realistic generative background or simply the destruction of the marker pattern.
In Table 1, the ArUco-Free (AF) Linemod training splits are significantly smaller than the original Linemod splits (e.g., 180 images for the Camera in AF-LM compared to 1020 in the Original LM). While the manuscript mentions using stratified cross-validation (k=3) to mitigate this, it should be explicitly stated whether the EfficientPose model was trained from scratch on these small subsets or fine-tuned from existing weights. If trained from scratch, the performance degradation observed might be partially attributed to the reduced volume of training data rather than the domain shift alone.
The methodology notes that the single-channel saliency maps are normalized to the interval [0, 255] using min-max scaling. When comparing saliency maps across different domains (Original vs. ArUco-Free), this independent normalization might mask differences in the absolute magnitude of the network's activations. A brief discussion on whether the overall confidence or raw gradient magnitudes dropped in the ArUco-Free scenarios would provide a deeper understanding of the network's uncertainty.
In Equation (3), the inverse cosine function is formatted as standard text. It should be updated to use proper mathematical operator formatting for academic rigor. Additionally, in Equation (2), it is standard practice to denote the Euclidean distance with a subscript.

Minor

The caption for Figure 6 states, "The Translation subnetworks share an identical structure" , whereas the main text notes, "The translation network on the other side shares a similar structure". Please harmonize this terminology to clarify if the architectures are strictly identical or merely similar.
There is a recurring typographical error throughout the manuscript where the word "Figure" is duplicated in the text. For instance, "in Figure Figure 1" , "see Figure Figure 2" , "Figure Figure 3" , and "Figure Figure 4". Please delete the redundant word in each of these instances.
The unnumbered heading "Data Preparation" in Section 4.1 breaks the structural hierarchy. Change this heading to "4.1.1 Dataset preparation" for better flow and consistency with the rest of the manuscript.
Please review the display equations throughout the manuscript. Ensure standard mathematical formatting by making sure to delete the "." at the end of any equations. Equations should stand cleanly on their own without trailing sentence punctuation.
In Section 4.3.1, Table 2 and Table 3 mention weights trained. It would be helpful to briefly clarify what represents in this context (e.g., is it a hyperparameter for the EfficientPose backbone scaling?), as it is not explicitly defined in the methodology section beforehand.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article discusses "Generative Data Augmentation for ArUco-free RGB-based 6-DoF Object Pose Estimation." Abbreviations should be given where they first appear. The full form of the abbreviation 6-DoF is not given in the Abstract section. Instead, it is given in the Introduction section. After a brief explanation below the Introduction, subheadings 1.1, 1.2, etc. should be provided. In the Related Works section, subheadings 2.1.1. Linemod and Linemod-Occluded, 2.1.2. T-LESS, 2.1.3. HomebrewedDB, and 2.1.4. HOPE are described superficially and briefly. More detail should be provided. In the middle of subheading 2.3. 6-DoF Pose Estimation Models, there is a paragraph beginning with the sentence, "In this paper, we will focus on the specific category of RGB-Input,...". After reviewing the subheadings, the authors should highlight the superior features of their own work. A subheading titled "Contributions of This Work" could be created for this purpose. The equations are not numbered. All equations should be numbered. For example, equations between lines 340-354 are unnumbered (page 11). A new heading, "Future Works," should be added. In this section, the authors should describe the future directions of the study.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents a timely and highly relevant investigation into shortcut biases within 6-Degrees-of-Freedom object pose estimation, focusing on the widely used Linemod dataset.

Major comments

The experiments exclusively utilize the EfficientPose model with the scaling hyperparameter set to \phi=0, which is the most computationally efficient configuration. While this is appropriate for baseline testing, it would be beneficial for the authors to briefly discuss or hypothesize whether increasing the model capacity might exacerbate or reduce the shortcut learning effect.
The study utilizes a commercial generative engine (Gemini 3.1 Flash Image) accessed via API calls to synthesize the ArUco-Free dataset. Because commercial models undergo continuous updates, exact pixel-for-pixel reproducibility of the dataset generation process may be challenging for future researchers. Providing the exact date the API was accessed or hosting the generated ArUco-Free dataset in a public repository (which is noted in the Data Availability Statement) mitigates this issue well.
The authors astutely observe that generative inpainting does not fully eliminate background reliance, as the network shifts its attention to alternative spatial patterns on the board due to rigid synchronous motion. It would add value to the discussion (Section 4.4 or 5.1) to briefly suggest specific future techniques that could break this rigid spatial correlation.
The qualitative shift shown in the saliency maps (Figures 10, 11, and 12) provides excellent visual proof of the hypothesis. The quantification of these maps (Tables 3 and 4) into relative attention scores is particularly commendable, as it translates subjective visual maps into objective, analyzable metrics.
The quantitative results in Table 6 clearly illustrate the collapse in performance upon domain shift, with the Average Distance (ADD) metric dropping to 0.39 \pm 0.0632 on the ArUco-Free dataset. The recovery of this metric to 0.92 \pm 0.0090 when trained on the augmented dataset provides robust statistical validation of the proposed pipeline.
The proposed "Generalized Grad-CAM" for regression tasks replaces the ReLU activation with an absolute value and uses an L_2 norm to prevent gradient cancellation. While mathematically sound for capturing the magnitude of sensitivity, this fundamentally discards the directionality of the gradients. The resulting maps show where the model is sensitive, but not whether that sensitivity pushes the translation/rotation estimate closer to or further from the ground truth. Clarifying this limitation in the text would improve the theoretical rigor of the interpretability section.
To definitively prove that the ArUco markers are causing a shortcut bias, the Grad-CAM visualizations could be significantly strengthened by cross-validating them with perturbation-based Explainable AI (XAI) frameworks, such as SHAP (SHapley Additive exPlanations) or LIME. While Grad-CAM highlights the network's spatial focus, applying LIME or SHAP to systematically occlude or perturb the ArUco markers—and subsequently quantifying the exact numerical shift in the predicted 6-DoF vector—would provide a definitive, model-agnostic validation of the shortcut learning hypothesis.
The prompt used for the generative inpainting instructs the model to replace the markers with "consistent objects from the rest of the background" (small tools, cables). The authors should briefly discuss the risk of the generative model introducing its own systemic biases. If the inpainting engine repeatedly generates similar textures or specific tools across the ArUco-Free dataset, the EfficientPose model might simply learn these newly synthesized artifacts as a new contextual shortcut.
Generative inpainting can sometimes struggle with global illumination consistency. If the synthesized background patches do not cast appropriate shadows or reflect the scene's ambient light correctly, the network might learn to identify these photometric inconsistencies rather than focusing strictly on the target object's geometry. A brief note on how photometric consistency was verified (beyond simple pixel-wise differences) would be beneficial.
The authors mention that symmetric ADD (ADD-S) is typically preferred for models with strong symmetries, but state that such models are not considered in this work. However, objects like the "Can" or "Glue" in the LineMod dataset exhibit partial symmetries. A brief justification of why standard ADD remains sufficient for these specific instances would preemptively address reader concerns.
In Equation 9, the formulation for Delta\Theta relies on the trace of the relative rotation matrix. Ensuring that the rotation matrices are strictly orthogonal (SO(3)) is critical here. It would be helpful to explicitly state whether the output of the EfficientPose refinement module guarantees proper orthonormalization before this metric is calculated.

Minor comments

The text states, "Recovering the 6-Degrees-of-Freedom (often referred to as 6-DoF or 6-DoF) pose...". The repetition of "6-DoF" is redundant and should be corrected.
The phrase "resorting to manual labeling, this is not the case for 6-DoF object pose" could be slightly rephrased for better grammatical flow. (e.g., "resorting to manual labeling; however, this is not the case for 6-DoF object pose estimation...").
The bullet point asks, "Do the network exploit markers and objects surrounding objects as shortcuts for pose estimation?". This should be corrected to "Does the network exploit..." to match the singular subject, and "objects surrounding objects" could be streamlined to "surrounding objects".
The text lists "[26-28]" but in line 231 it lists "[31-35]". Ensure consistent spacing or formatting for citation brackets throughout the manuscript.
The text mentions "Efficien Pose" which is missing the "t". This should be corrected to "EfficientPose".

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Generative Data Augmentation for ArUco-Free RGB-Based 6-DoF Object Pose Estimation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI