Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network
Bei Cheng
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsOverall, the paper lacks sufficient methodological novelty, and there are several issues with the writing. For example, there are errors in the figures, incomplete equations, and insufficient experimental setup.
- In section 3.2.1, include a schematic diagram to help illustrate the mechanism of the shift module.
-
In section 3.2.2, add an equation to clarify the processing flow of the GSA.
-
After Equation 2, provide the complete formula for the CSSA processing pipeline.
-
Figure 5 appears to be incorrect: the three images in the "BRISK + MAGSAC++" column are identical, all showing results from the Synthetic Benchmark. The results for the UHBD and NIUHBD datasets are missing.
-
In Figure 6, the advantages of LFHomo do not seem to be clearly demonstrated on the UHBD dataset.
-
In the experiments section, add an ablation study for the SRCSSA module, comparing the results of the CSSA operation with the standard channel shuffle operation to demonstrate the necessity of grouping.
-
The ablation study should include a separate comparison between the LFHomoE homography estimator and other homography estimators to highlight the effectiveness of LFHomoE.
-
The number of recent methods used for comparison is too limited. It is recommended to include and discuss some newer methods from 2024-2025 for a more comprehensive comparison.
- It appears that the comparison with feature-based homography estimation methods is presented in Figure 6, but it should correctly belong to Figure 5.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
I have read your text with attention. The material you intend to publish appears to me to be extremely interesting, and I believe that once published, it will make a significant contribution to the current state of knowledge in the field of image recognition within modern photogrammetry.
The article presents an original solution concerning a new methodology for conducting observations using photogrammetric imagery acquired from a low altitude (from UAV platforms). This approach is important because it can enable the monitoring of changes in various phenomena in near real-time (or even in real-time). This will certainly enrich the currently designed evaluation systems for both areas and individual topographic (or engineering) objects. I consider this approach to be highly valuable and am convinced that the material should, therefore, be published.
Before this happens, however, I would like to express a few suggestions and comments, to which I kindly request a response.
- Regarding the essence of the work, the authors inform the reader about it in multiple places: the Abstract, Introduction, Methodology section, and Conclusion. It seems to me that the reader is well aware of the subject matter of the study, and there is no need to repeatedly state the core problem the authors are solving. In this context, it would be worthwhile to unify the text appropriately so that it presents a cohesive and logical whole.
- The “Introduction” section spans four full pages, which, considering that it is immediately followed by the Literature Review, constitutes an overly extensive scope. I believe that, without detriment to the overall work, you could either shorten the Introduction and move further discussions to the subsequent chapter, or even combine these two chapters into one, so that you proceed directly to the “Materials and Methods” section afterwards. This would significantly improve the cohesion of the current text and make it more transparent.
- In line 247, the authors reintroduce the reader to the subject of further discussion. This approach lengthens the text and slightly hinders its reception. Given the immense volume of scientific material available online, one should strive to synthesize thoughts and present cohesive and logical texts. Please consider my suggestion and incorporate the necessary modifications to the text.
- Figure 5 and Table 4 (“Not a Number” Notation): Figure 5 contains the information “Not a Number”—this seems either unnecessary in this context or requires an additional comment. A similar issue applies to Table 4. If a particular record does not meet the conditions related to the variable type, you should simply insert a dash (hyphen) and then explain the appearance of this type of notation in the text.
- In lines 673–677, there is a further description of the study's subject matter. This relates to my first point regarding repetitions. I therefore urge you to ensure the consistency of the text.
- Among the editorial remarks, I noticed unnecessary spaces in the text (e.g., in line 650, etc.). Please also check the punctuation and potential repetitions. I suggest conducting a thorough proofreading before resubmitting the text for review.
After making the necessary corrections, please send the material for a follow-up and hopefully final check before publication.
I wish you success in your work!
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsArticle: Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space-Air-Ground Integrated Network
Review Report
1. Introduction (Lines 58–170)
The text is informative but lacks an explicit hypothesis about the expected impact of the proposed method.
The article only describes the “low-altitude homography challenge” but does not define a measurable scientific objective (Lines 72–80).
Improvement: Suggest the clear formulation of a central hypothesis, relating the variables. The authors should demonstrate that the proposed method reduces homography error by x% compared to other state-of-the-art methods.
2. Literature Review (Lines 171–244)
A critical analysis of the trade-offs between GNNs (such as homoViG) and Transformers (LCTrans, FCTrans) should be added, emphasizing the gap that LFHomo aims to fill — robustness and lightness in multimodal UAV applications.
Improvement: Consolidate the discussed methods in a comparative table and emphasize the real gap — the absence of lightweight and robust solutions for multimodal UAV image registration.
3. Methodology (Lines 245–447)
Problem: The description is overly technical but fails to justify the choice of parameters and hyperparameters (e.g., why λ = 1.0 and μ = 0.01 in line 410).
There is no formal analysis of computational complexity to validate the use of the term “lightweight.”
Improvement: Include benchmarks of GPU/memory usage and provide a discussion of each hyperparameter’s influence.
4 and 5. Results and Discussion (Lines 448–671)
Problem: The experiments are well structured, but there is no statistical testing to verify whether the differences between methods are significant.
The paper reports only the Average Corner Error (ACE) without confidence intervals or standard deviations, compromising comparative robustness.
Additionally, the discussion is more descriptive than analytical — it acknowledges limitations (e.g., deployment on UAVs) but does not propose quantitative ways to overcome them.
Improvement Suggestion: Include boxplots and t-tests comparing methods (LFHomo vs. homoViG) to strengthen the statistical analysis.
6. Conclusion (Lines 672–700)
There is confusion between the discussion of results and future work. The conclusion should clearly separate the interpretation of findings from the proposal of future research directions.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe paper presents a novel lightweight unsupervised homography estimation method (LFHomo) for aligning infrared and visible images captured from UAV perspectives. The work addresses practical challenges in low-altitude scenarios, such as motion blur and computational constraints, by integrating an anti-blurring feature extractor and a CNN-GNN hybrid estimator. The contributions include the design of innovative components (e.g., IRSC and SRCSSA) and the creation of a new UAV-based dataset (UHBD and NIUHBD). While the paper demonstrates potential for real-time applications in edge-computing environments, significant issues in methodology clarity, experimental rigor, and presentation hinder its acceptability.
1)Section 3 provides an overview of LFHomo but omits key implementation details. For instance, the IRSC block does not specify the shift magnitude or channel partitioning strategy, and the SRCSSA module lacks clarity on hyperparameters (e.g., the number of groups G). The integration of ViG blocks in LFHomoE is also superficially described, making replication difficult.
Suggestions:
In Section 3.2.1, add pseudocode or mathematical formulations for the IRSC block, explicitly defining the shift operations (e.g., pixel displacement values) and their interaction with the inverse residual structure.
In Section 3.2.2, detail the SRCSSA parameters, such as the group size G and spatial reduction ratio, and justify their choices relative to blur handling.
In Section 3.3, expand Table 2 to include intermediate feature dimensions and computational costs. Discuss how the ViG block’s graph attention mechanism is adapted for homography estimation.
2)Experiments are limited to synthetic and self-constructed datasets, lacking validation on public large-scale benchmarks (e.g., MS-COCO or KITTI). The comparison with existing methods is insufficient (e.g., omitting state-of-the-art techniques like LoFTR or SuperGlue), and evaluation relies solely on Average Corner Error (ACE), ignoring robustness metrics (e.g., repeatability score or SSIM).
Suggestions:
- In Section 4.1, include tests on additional public datasets to demonstrate generalization beyond UAV-specific scenarios.
- In Section 4.2, introduce complementary metrics such as matching precision-recall or outlier ratios to provide a holistic performance assessment.
- In Sections 4.3–4.4, expand comparisons to include recent methods (e.g., Transformer-based matchers) and perform statistical significance testing (e.g., p-values).
3) Section 2 provides a basic literature review but fails to deeply contrast LFHomo with relevant lightweight models (e.g., MobileViT or EfficientFormer) or blur-handling techniques. The novelty of SRCSSA relative to existing attention mechanisms (e.g., CBAM or CA) is underexplored.
Suggestions:
- Revise Section 2.2 to include a dedicated discussion on how LFHomo differs from other CNN-GNN hybrids in terms of efficiency-accuracy trade-offs.
- In Section 5, add a comparative analysis of SRCSSA with standard attention methods, using ablation results from Table 7 to highlight its unique contributions.
- In Section 4.1, specify the train/validation/test split ratios and class distributions for the datasets to enhance reproducibility.
5)In Section 4.5.2, include a hyperparameter sensitivity analysis for SRCSSA (e.g., varying G) to strengthen the ablation study.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 5 Report
Comments and Suggestions for AuthorsAttachment please find my comments.
Comments for author File:
Comments.pdf
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsDear authors!
Thank you for preparing a revised version of your article. I have read this new text carefully and I find that you have considered all my comments and suggestions, which I greatly appreciate. I believe that thanks to this, the new version of the article is, above all, more transparent and better conveys the intended content. It can also be seen that a significant portion of the material has been improved. Sometimes, such corrections are necessary, so thank you for putting such significant effort into preparing the manuscript.
Given that I do not see any further significant shortcomings, I recommend that the text be published after making the necessary corrections suggested by the editor.
I wish the authors good luck and further fruitful research!
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have provided responses to all review comments, significantly enhancing the scientific quality and presentation of the manuscript through extensive supplementary experiments and discussions. This work proposes a lightweight method for homography estimation between infrared and visible images from a UAV perspective, demonstrating clear innovation and practical value. The manuscript now essentially meets the requirements for publication, and I recommend its acceptance.
