Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Unsupervised Multimodal UAV Image Registration via Style Transfer and Cascade Network

Remote Sens. 2025, 17(13), 2160; https://doi.org/10.3390/rs17132160

by Xiaoye Bi^1,2, Rongkai Qie³, Chengyang Tao^1,*, Zhaoxiang Zhang¹ and Yuelei Xu¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Remote Sens. 2025, 17(13), 2160; https://doi.org/10.3390/rs17132160

Submission received: 30 April 2025 / Revised: 12 June 2025 / Accepted: 17 June 2025 / Published: 24 June 2025

(This article belongs to the Special Issue Advances in Deep Learning Approaches: UAV Data Analysis)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

I appreciate the opportunity to review this manuscript and acknowledge the effort put into this work. The paper presents an interesting and relevant topic, and I have provided my comments and suggestions below to help improve its clarity, coherence, and overall contribution. I hope these insights will help strengthen the manuscript.

I suggest highlighting your findings more specifically and to the point in the abstract.
You mentioned in the Abstract that your method improved the accuracy by 9.27 % compared to other methods. Do you think this is enough?
Define the Abbreviation LLVIP.
Despite the existence of rich information in the introduction section, none of the technical statements have been supported with relevant references.
It is recommended to use passive voice instead of first-person pronouns like “we” for a more formal tone.
Please pay attention to the fact that all the abbreviations mentioned in the entire manuscript need to be defined first, such as SIFT, SURF, FAST, ORB, BRIEF, GPU, etc. Please sort this out in the entire manuscript.
References are required to support the statements in the paragraph between 109-117.
Any section or subsection cannot start with a figure or table. A description of the section needs to be provided first. Please replace Table 1 and Figure 4, and bring the description first for the relevant sections.
In Table 1, you just made a generalisation such as “Traditional feature-based methods” or “Methods based on deep learning features”. It is suggested to make an example of these methods to be more specific.
In Table 1, relevant references for the statements need to be provided, especially when you are discussing the advantages and limitations of methods.
After giving the advantages and limitations of the methods in Table 1, you mentioned that your research is focused on cross-modal image registration. Please specify if your method is one of the methods mentioned in Table 1 or differs. If it is different, here is the place where you need to clearly mention what the difference is and how it can overcome the limitations of other methods.
Section 3 and its subsections are the theoretical basis of the study with no proper supporting statements and logic. Strongly recommended to provide proper references where needed.
Is there any justification that 12025 image pairs out of 15488 are chosen for training and the rest for testing? This is obvious why the majority of the data is used for training. Still, it needs to be clarified for ease of reading and understanding for readers with different levels of familiarity.
Describing the logic behind the methodological steps between lines 440-450 can add more clarity to the research.
The results in conclusion can be better presented. It implied that the only main result is the 9.27% enhancement in NCC compared to state-of-the-art methods, while you can specify it better.
Please revise the conclusion, highlighting the results better. It is suggested that you highlight the main results in bullet points.
Please add a discussion on the application and impact of this research on real cases and industries.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes a multi-scale cascaded network based on style transfer for cross-modal image registration on UAV platforms. The method transforms visible images into pseudo-infrared images via a Cross-modal Style Transfer Network (CSTNet) to unify modality features and performs spatial alignment across multiple resolution scales using a Multi-scale Cascaded Registration Network (MCRNet). It employs a diffeomorphic deformation model to ensure smooth and invertible transformations and adopts a self-supervised learning paradigm to eliminate the dependence on manually annotated data. However, there are still some issues in the paper that need further revision and improvement.

The Introduction section lacks references to support the content.
Although the paper mentions that the proposed method uses a multi-scale cascaded network and a diffeomorphic deformation model, it lacks an in-depth comparative analysis of how these techniques specifically address the problems existing in current methods, such as inaccurate feature matching and difficulties in handling local deformations.
During the training of CSTNet, the authors applied random cropping and normalization to the input images. However, in the training of MCRNet, in addition to random cropping, random affine transformations and deformable transformations based on Gaussian random fields were also applied to the infrared images. Will this differential preprocessing have an impact on the experimental results?
It is suggested that the authors discuss the potential issues of the diffeomorphic deformation model in terms of computational complexity and propose corresponding solutions or optimization strategies. For example, can the computational cost be reduced by using more efficient numerical integration methods or simplifying the model?
In Exp. IV, although the proposed method achieved the best performance in terms of MSE, NCC, LNCC, and MI, there may be cases where the registration results of some image pairs are inaccurate in local areas. The paper does not provide a detailed analysis of these local anomalies.
The definition of the weight coefficients in the multi-scale similarity loss function and the multi-scale smoothness loss function seems to be empirical. It is suggested to add a theoretical analysis of the selection of the weight coefficients.
Many English abbreviations lack full names.
There are punctuation errors after the formulas. The end of formulas (9) and (10) should be periods.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

Thanks for addressing my comments. I have no further comments.

Reviewer 2 Report

Comments and Suggestions for Authors

The article has been carefully modified based on my previous comments.
It can be published in current state to me.

Article Menu

Unsupervised Multimodal UAV Image Registration via Style Transfer and Cascade Network

Further Information

Guidelines

MDPI Initiatives

Follow MDPI