A Review of Cross-Modal Image–Text Retrieval in Remote Sensing
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper primarily summarizes the methods and key technologies of cross-modal image-text retrieval in remote sensing applications with three feature representation methods and three challenges. As a review, it demonstrates a clear research framework and reasonable evaluation methods, but requires further improvements in the following aspects.
- What is the performance evaluation basis for Tables 2 and 3? Are there any references to support the data. And it is suggested to provide the specific evaluation methods and experimental conditions.
- The review proposes evaluation strategies for small objects and multi-temporal tasks (such as super-resolution reconstruction combined with semantic consistency verification). Have these methods been validated in existing work, and how about their reproducibility and generalizability?
Author Response
Dear Reviewer,
We sincerely appreciate your insightful and constructive comments on our manuscript. We have thoroughly considered all the feedback and recognized several key areas for improvement, particularly in enhancing the technical depth, refining the classification logic, and strengthening the performance analysis. We are confident that the revisions made have substantially improved the scientific rigor and clarity of our work.
Below, we provide a detailed point-by-point response to each of the comments raised.
Comments 1: What is the performance evaluation basis for Tables 2 and 3? Are there any references to support the data. And it is suggested to provide the specific evaluation methods and experimental conditions.
Response 1: Thank you for this valuable suggestion. We fully agree that the evaluation basis, data sources, and metrics should be more clearly clarified. Accordingly, we have fully reorganized and refined the evaluation section, and made the following major revisions:
- Data Citation Completion and Table Integration
- In the previous manuscript, the data in Tables 2 and 3 originated from published research in the field, but the references were incomplete and the organization lacked clarity.
- In the revised manuscript, we have integrated the original tables, optimized their structure, and supplemented all relevant citation sources corresponding to each method.
- Revised content can be found in Section 2.3, Table 2 (pages 13), Section 3.2, Table 3 (pages 20).
- Addition of Unified Evaluation Metric Explanation
- Following your suggestion, we now explicitly define the evaluation protocol and justify the selected metric.
- We adopt mean Recall (mR) as the primary metric for both image-to-text and text-to-image retrieval tasks. New text has been added to explain the computation and rationale of mR and its superiority over single-cutoff metrics such as R@1 or R@10.
- New methodological explanation appears in Section 2.3, paragraphs 6 (page 13), starting with:
“To provide a more rigorous and interpretable comparison of existing methods, we adopt mean recall (mR) as the primary evaluation metric …”
- Clarification of Data Sources and Experimental Basis
- The manuscript now clearly states that the comparison values are directly obtained from the respective papers, computed using the same recall-based metrics on standard datasets (RSICD and RSITMD).
- We emphasize that the purpose of this review is to synthesize and compare published methods rather than conduct new experiments.
These revisions ensure that all performance comparisons are transparent, reproducible, and supported by explicit literature.
Comments 2: The review proposes evaluation strategies for small objects and multi-temporal tasks (such as super-resolution reconstruction combined with semantic consistency verification). Have these methods been validated in existing work, and how about their reproducibility and generalizability?
Response 2: We appreciate the reviewer’s insightful concern regarding the reliability and reproducibility of the proposed evaluation strategies.
After reconsidering this issue, we acknowledge that our originally proposed strategies were prematurely presented as actionable methods despite not having undergone complete experimental validation. To ensure scientific rigor and avoid overclaiming, we have substantially revised the corresponding sections:
- Repositioning the Proposed Strategies as Future Directions
- We moved the previously described evaluation strategies from the main methodological discussion to the Future Trends
- They are now explicitly presented as potential research directions, rather than established, validated evaluation protocols.
- The revised discussion appears in Section 3.2, final paragraph (Small Objects), page 20, Section 3.3, second-to-last paragraph (Multi-Temporal), page 22.
- Clarification of Current Limitations
- We now explicitly state that: The proposed super-resolution–assisted consistency verification for small object tasks has not yet been demonstrated in existing RS cross-modal literature.
- This clarification improves transparency and avoids exaggeration of methodological maturity.
- Enhancing the Future Outlook Section
We now emphasize that these strategies are conceptual references intended to inspire future research, especially as large foundation models evolve toward fine-grained perception and temporal reasoning.
These revisions ensure the scientific reliability of the review and properly distinguish between validated techniques and forward-looking perspectives.
We are truly grateful for your valuable time and guidance, which have been instrumental in enhancing the quality of our manuscript. We believe that the revised version now fully aligns with the journal's standards in terms of scholarly rigor, structural coherence, and overall presentation. We eagerly look forward to your final evaluation.
Thank you once again for your support.
Yours sincerely,
Lingxin Xu
Reviewer 2 Report
Comments and Suggestions for Authors1. Overall Assessment
This manuscript aims to survey the field of remote sensing image-text fusion. While the topic is of significant research value, the manuscript in its current state has fundamental flaws that preclude its acceptance for publication. The core issues lie in its unclear logical framework and a lack of in-depth analysis. The various sections of the survey lack a coherent narrative and organic connections, reading more like a list of works than an insightful synthesis. Furthermore, the manuscript fails to provide a critical discussion and deep exploration of experimental data and evaluation strategies. Therefore, I recommend rejection.
2. Technical Comments
(1) The logic and narrative of the abstract and the entire manuscript are unclear. The sections on feature representation, technological evolution, and research frontiers are not tightly connected, failing to construct a coherent intellectual framework for the reader.
(2) In Chapter 3, there is substantial overlap in the core concepts and related work between Section 3.1 (Multi-scale Image Feature Extraction) and Section 3.2 (Small Object Feature Extraction). The rationale and necessity for treating small object detection as a core section parallel to multi-scale analysis are insufficient; the content should be reorganized.
(3) The experimental results presented in Tables 1, 2, and 3 lack a profound comparative analysis. The manuscript merely presents data without interpreting its implications, comparing the advantages and disadvantages of different methods, or discussing the underlying reasons for the performance.
(4) The "evaluation strategies" mentioned at the end of Sections 3.2 and 3.3 are described too cursorily. The specific design, core principles, relative merits, and applicable scenarios of these strategies are not clearly elaborated. The outcomes of these evaluations are also not presented or discussed.
(5) Chapter 3 lacks a conclusive summary that synthesizes the core findings and technical trends discussed. The conclusion in Chapter 4 is overly superficial, largely restating the content without providing a critical synthesis.
(6) Figure 2 and the corresponding text in Section 2.1 discuss Real-Valued Representation from natural to remote sensing images, and the model architecture in Figure 1 is highly similar to CLIP. Although improvements to CLIP in remote sensing are mentioned in Section 2.3, the seminal CLIP paper itself is not cited in the references.
(7) The presentation in Figure 4 is confusing, suffering from information overload and/or poor layout. It is recommended to redesign this figure using a clearer and more intuitive format (e.g., a flowchart or a layered diagram).
(8) The English expression in the abstract is not fluent and exhibits noticeable repetitive phrasing, such as the overuse of "The," "This," and "These" at the beginning of nearly every sentence. Thorough language polishing is required to enhance its professionalism and readability.
(9) Figures 1, 2, 3, 5, and 6 are disproportionately large within the text and are not properly centered, which detracts from the overall aesthetic and professional presentation of the manuscript.
3 Formatting Comments
(1) Tables 2 and 3 span across pages, and the break is not appropriately marked. It is recommended to adjust the content (e.g., condensing text, adjusting row height) or the page layout to ensure each table is presented on a single page.
(2) The line numbers in the right-hand margin should be removed from the manuscript.
Author Response
Dear Reviewer,
We sincerely thank you for your thorough review and the constructive comments on our manuscript. Following all suggestions, we have carefully revised the entire manuscript, leading to major and comprehensive improvements. These revisions include substantial structural restructuring, rewriting of key sections with deepened theoretical synthesis, refinement of experimental analysis and discussion, clearer presentation of figures and tables, repositioned evaluation strategies, strengthened academic writing, and a professionally redesigned layout. All suggested changes have been meticulously addressed and incorporated into the revised manuscript. In the attached response letter, we provide a detailed point-by-point response to each of your valuable recommendations, which we have categorized as follows:1. Overall Assessment, Technical Comments, Formatting Comments.
We believe your feedback has been invaluable in significantly enhancing the rigor, coherence, and overall quality of this work. The revised version now provides a more valuable contribution to the field of remote sensing cross-modal retrieval, and we hope it meets the standards for publication.
Thank you again for your time and insightful guidance.
Sincerely,
Lingxin Xu
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript presents a comprehensive review of cross-modal image-text retrieval in remote sensing (RS), emphasizing the semantic alignment challenges between RS imagery and natural language descriptions. The authors systematically discuss two principal categories of feature representation methods—real-valued representations and deep hashing approaches—and provide an extensive analysis of three core challenges in the field: multi-scale semantic modeling, small-object feature extraction, and multi-temporal feature understanding. The manuscript also reviews a wide range of existing approaches, discusses their respective strengths and limitations, and highlights recent research trends, such as self-supervised learning and neural architecture search.
However, several methodological, structural, and presentation issues need to be addressed to improve the scientific rigor and clarity of exposition:
- In Chapter 2, the core technical details of different approaches are insufficiently explained. For example, in the real-valued representation methods, the paper should more clearly describe how image and text features are aligned across modalities, including the specific processes, mathematical formulations, and key parameters;
- The classification logic in Sections 2.1 and 2 is not rigorous, with overlapping and redundant layers. For instance, the boundary between real-valued representation and deep hashing is unclear. The former should emphasize continuous feature learning in high-dimensional real space, while the latter should focus on binary encoding for efficient retrieval. Moreover, in the deep hashing section, self-supervised and contrastive learning are introduced—these belong to training paradigms rather than feature representation types, which blurs the methodological distinction;
- In Chapters 2 and 3, the authors describe a large number of methods. It is strongly recommended to include a comprehensive summary table listing, for each method:
- The year of publication,
- The datasets used,
- The evaluation metrics adopted,
- And the references, organized according to the framework proposed by the authors. Such a table would provide a clearer and more intuitive overview of the literature.
- Throughout the paper, the descriptions of some technical methods lack sufficient detail. For instance, while self-supervised learning is mentioned, the authors do not further elaborate on its specific applications and current research progress in RS image-text retrieval.
- In Section 2.3, the authors propose an evaluation framework, but the description of how the evaluation is implemented remains vague. The paper should further discuss how specific evaluation metrics (e.g., mAP, IoU) are connected to the challenges addressed, and how these metrics measure the effectiveness and multidimensional performance of the proposed methods.
- In the performance comparison (Tables 1 and 2), although the data presented are informative, the analysis is superficial. The authors should discuss why certain methods perform better, and what trade-offs must be considered when choosing among them (e.g., between computational resources and retrieval accuracy).
- The manuscript is generally well-structured, but the grammar and tense usage could be improved for academic rigor.
- Figures and tables contain formatting and labeling inconsistencies that affect readability and professionalism:
(a) Figure labeling inconsistencies: Figures 5, 6, and 7 are presented with overly brief and vague captions, making it difficult for readers to understand the meaning of the illustrations. Figure captions should clearly indicate the topic, functions of each component, and how the diagram relates to the described technical framework, especially for complex architectures or workflows.
(b) Lack of unified numbering and descriptive titles: Some figures and tables lack explicit descriptive captions. For example, Figures 5–7 do not sufficiently explain their content or relevance to the main theme. The captions merely provide short names without clarifying how the figures support the research discussion.
(c) In-text referencing: Figures 5–7 should be explicitly referenced and explained within the text, illustrating how each visual supports the paper’s key arguments.
- The reference formatting is inconsistent. For instance, references [5], [6], and [7] include DOIs or publication years, while [1], [2], and [3] do not. This inconsistency disrupts the formal structure of the reference list. Additionally, numerical citation mismatches exist: the numbering used in the text and in Tables 1–3 do not accurately correspond to the reference list. This inconsistency must be corrected according to the journal’s reference formatting guidelines.
Author Response
Dear Reviewer,
We would like to express our sincere gratitude for your extensive and critical review of our manuscript. We have carefully studied your comments regarding the logical framework, depth of analysis, and evaluation strategies. We acknowledge that the initial version had shortcomings in structural coherence and data interpretation.
In response, we have performed a comprehensive overhaul of the manuscript. We have restructured the narrative flow, rewritten the abstract and conclusion, deepened the analysis of experimental data using new metrics (mean Recall), and rigorously redefined our discussion on evaluation strategies to ensure scientific accuracy. All modifications are detailed point-by-point in the attached document.
Your guidance has significantly strengthened the scientific rigor, clarity, and overall structure of our work. We believe the revised manuscript now meets the journal’s standards for rigor, structure, and presentation. We hope our revisions are satisfactory, and we respectfully look forward to your further evaluation.
Thank you once again for your support.
Sincerely,
Lingxin Xu
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article performed a fairly good regression analysis on the retrieval, but there are still some issues:
1. I think more relevant research from 2025 should be added, and there are also a few relevant studies from 2024.
2. Some good works were not analyzed, for example:
[1] Yuan Z, Zhang W, Tian C, et al. MCRN: A multi-source cross-modal retrieval network for remote sensing[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 115: 103071.
[2] Yuan Z, Zhang W, Li C, et al. Learning to Evaluate Performance of Multimodal Semantic Localization[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-18.
[3]...
3. The article lacks analysis of the dataset. I think it should compare the current relevant image and text retrieval datasets and provide a table.
4. The author's analysis of the advantages, disadvantages, and innovation of various current methods is insufficient. Please provide more details.
Author Response
Dear Reviewer,
We sincerely appreciate your insightful and constructive comments on our manuscript. We have thoroughly considered all the feedback and recognized several key areas for improvement, particularly in enhancing the technical depth, refining the classification logic, and strengthening the performance analysis. We are confident that the revisions made have substantially improved the scientific rigor and clarity of our work.
Below, we provide a detailed point-by-point response to each of the comments raised.
Comments 1: I think more relevant research from 2025 should be added, and there are also a few relevant studies from 2024.
Response 1: Thank you for this valuable suggestion. Following your recommendation, we have supplemented the literature review with representative works published in 2024 and 2025, including GeoLangBind(2025), iEBAKER (2025) and additional recent RS-VLP-based methods. These studies have been incorporated into Section 2.3, particularly in the updated comparative analysis of mainstream methods.
Location of revision:
Page 10-13, Section 2.3. Dominant Paradigm and Performance Comparison. Corresponding references updated in the bibliography.
Comments 2: Some good works were not analyzed, for example:
[1] Yuan Z, Zhang W, Tian C, et al. MCRN: A multi-source cross-modal retrieval network for remote sensing[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 115: 103071.
[2] Yuan Z, Zhang W, Li C, et al. Learning to Evaluate Performance of Multimodal Semantic Localization[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-18.
[3] ...
Response 2: ​We sincerely thank the reviewer for this insightful comment and for pointing out these highly relevant and excellent works by Yuan et al., which we regrettably overlooked in our initial manuscript. We agree that discussing these works is crucial for a comprehensive literature review. In response, we have thoroughly studied the suggested papers, including MCRN and others. Accordingly, we have revised Section 2.3 on Page 11 to incorporate a dedicated discussion of these methods. Specifically, we have added an analysis comparing their core ideas and contributions with the approach presented in our work, highlighting the similarities and differences. More importantly, we have cited these works appropriately as references [31] and [32] in the updated manuscript. We believe this revision has significantly strengthened the background and contextual positioning of our study. Once again, we are grateful for the reviewer's valuable suggestion.
Comments 3: The article lacks analysis of the dataset. I think it should compare the current relevant image and text retrieval datasets and provide a table.
Response 3: Thank you for highlighting this important omission. In accordance with your suggestion, we have added a new subsection discussing dataset characteristics, summarizing major RS image–text datasets, including UCM-Caption, Sydney-Caption, RSICD, RSITMD, NWPU-Caption, RS5M, SkyScript, and GeoLangBind. We also provide a newly added Table 1, comparing dataset size, image resolution, and captioning method, along with discussion of dataset evolution and its impact on model performance.
Location of revision:
Page 10, Section 2.3, second paragraph.
Comments 4: The author's analysis of the advantages, disadvantages, and innovation of various current methods is insufficient. Please provide more details.
Response 4: Thank you for pointing this out. We have substantially enhanced the comparative analysis of existing methods, focusing on their strengths, limitations, and contributions. Expanded analysis throughout Section 2.1, 2.2 and 2.3.
Major improvements include:
- A more rigorous evaluation metric description:
We added a detailed explanation of why mean Recall (mR) is adopted, discussing its advantages over single-cutoff metrics such as R@1 or R@10.
- Expanded methodological comparison
We have significantly expanded the methodological discussion in our revision. Beyond enriching the related work to include real-valued methods, deep hashing, and other advanced paradigms, we have particularly strengthened the theoretical exposition in Sections 2.1 and 2.2. These sections now provide a systematic explanation and formulaic derivation of the two core representation methods, laying a more solid foundation for the proposed approach.
- Clear articulation of performance differences
We strengthened the analysis of Table 2 by explaining why early models underperform, how fine-grained alignment improves results, and why VLP models such as RemoteCLIP, GeoRSCLIP and iEBAKER achieve substantial gains.
We are truly grateful for your valuable time and guidance, which have been instrumental in enhancing the quality of our manuscript. We believe that the revised version now fully aligns with the journal's standards in terms of scholarly rigor, structural coherence, and overall presentation. We eagerly look forward to your final evaluation.
Thank you once again for your support.
Yours sincerely,
Lingxin Xu
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have substantially improved the manuscript in response to the first-round review. The mathematical formalization, methodological descriptions, figure quality, and reference consistency have all been significantly enhanced. The revised version is clearer, more rigorous, and more readable. Most major concerns have been adequately addressed.
However, several minor issues remain, which, if addressed, would further strengthen the clarity and scientific rigor of the manuscript. My detailed comments are as follows:
1. Explanation of the mR metric remains conceptual
Section 2.3 provides a formal definition of the mean Recall (mR) metric and explains its advantage over R@K. However, the connection between mR and the practical challenges discussed in Section 3 (multi-scale modeling, small objects, multi-temporal features) remains qualitative.
2.Minor reference issues
Most reference formatting inconsistencies have been resolved. However, a few dataset-related citations (e.g., [18], [20]) appear to be missing DOI information.
Author Response
Dear Reviewer,
We sincerely appreciate your insightful and constructive comments on our manuscript. We have thoroughly considered all the feedback and recognized several key areas for improvement, particularly in enhancing the technical depth, refining the classification logic, and strengthening the performance analysis. We are confident that the revisions made have substantially improved the scientific rigor and clarity of our work.
Below, we provide a detailed point-by-point response to each of the comments raised.
Comments 1: Explanation of the mR metric remains conceptual
Section 2.3 provides a formal definition of the mean Recall (mR) metric and explains its advantage over R@K. However, the connection between mR and the practical challenges discussed in Section 3 (multi-scale modeling, small objects, multi-temporal features) remains qualitative.
Response 1: We sincerely thank the reviewer for this insightful comment. Following your suggestion, we have substantially revised the manuscript to strengthen the connection between the mR metric and the three core remote sensing challenges discussed in Section 3. The previous explanation indeed focused mainly on the conceptual advantages of mR, and we agree that a deeper discussion connecting mR to practical RS-specific retrieval difficulties is essential.
To address this, we added a short explanatory paragraph that links the mR formulation to the three core RS challenges and explains how each challenge tends to affect different recall ranks (thus motivating the averaged mR measure). This text was inserted at the end of Section 2.3 (immediately after the paragraph that defines mR and Table 2).
The added paragraph clarifies that:
1. A model's inability to detect small object typically undermines its R@1 performance, owing to a direct loss of top-rank precision when key fine-grained features are overlooked.
2. Multi-Scale issues often perturb mid-rank behavior (R@5), because correct matches may appear slightly lower in the ranking when scale mismatches confuse fine-grained alignment.
3. Multi-Temporal ambiguities can spread errors across ranks causing general degradation (affecting R@1, R@5 and R@10) because temporal changes create partial matches rather than outright negatives.
Averaging across ranks (mR) therefore produces a more stable and diagnostic metric for RS retrieval where multiple heterogeneous failure modes coexist. This makes explicit the connection between the metric choice and the technical difficulties summarized in Section 3.
Comments 2: Minor reference issues
Most reference formatting inconsistencies have been resolved. However, a few dataset-related citations (e.g., [18], [20]) appear to be missing DOI information.
Response 2: We have conducted another comprehensive and detailed review of all references in the manuscript. In particular, we have carefully verified and supplemented the DOI information for dataset-related citations, such as [18] and [20]. Additionally, we have strictly enforced the journal's citation style throughout the text, meticulously cross-referencing in-text citations with the bibliography to ensure correct numbering sequence and complete and consistent metadata, including DOIs and publication years.
We sincerely appreciate the time and effort you have dedicated to improving the quality of our paper. Your guidance has significantly strengthened the scientific rigor, clarity, and overall structure of our work. In particular, we have provided a complete response to the reviewer’s request for a more rigorous, non-qualitative connection, and we believe these revisions significantly improve the clarity, scientific rigor, and interpretability of the manuscript. We appreciate the reviewer’s constructive feedback, which helped strengthen this section. We believe the revised manuscript now meets the journal’s standards for rigor, structure, and presentation. We hope our revisions are satisfactory, and we respectfully look forward to your further evaluation.
Thank you once again for your support.
Yours sincerely,
Lingxin Xu
Author Response File:
Author Response.docx

