Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing

Algorithms 2025, 18(9), 537; https://doi.org/10.3390/a18090537

by Hongyan Shi¹, Xiaofeng Bai² and Chenshuai Bai^2,*

Reviewer 1:

Timothy T. Adeliyi

Reviewer 2:

Xue Yu

Algorithms 2025, 18(9), 537; https://doi.org/10.3390/a18090537

Submission received: 16 July 2025 / Revised: 16 August 2025 / Accepted: 20 August 2025 / Published: 23 August 2025

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The study positions the research within the growing domain of remote sensing object detection, highlighting its practical applications in military, urban planning, and environmental monitoring, which immediately affirms the significance of the study. However, I have the following comments:
1. The introduction requires grammatical refinement
2. The phrase "objected optimization" in line 76 appears to be a typographical error and should
be corrected
3. The shift from general challenges to YOLOv10 could be improved with linking phrases to aid reader navigation
4. While limitations of YOLOv10 are mentioned, no specific performance metrics are cited to quantify the gaps being addressed.
5. While the literature is well-referenced, there is minimal critical synthesis or comparative analysis.
6. The section often reads like a sequential listing of existing techniques without sufficient integration or synthesis.
7. Some citations are introduced without specific insight into their unique contribution (e.g., [19], [22], [25]).
8. the hardware and software specifications are provided in Table 1, the experimental setup lacks information on batch size, learning rate, training epochs, and whether identical conditions were used across all models.
9. The experimental section does not provide detail on data augmentation, which is standard practice in training deep learning models for object detection
10. While performance improvements are reported, there is no mention of statistical validation (e.g., confidence intervals or significance testing) to verify that the improvements are not due to variance in the data.

Comments on the Quality of English Language

The paper requires extensive grammatical refinement and correction of typographical errors.

Author Response

The introduction requires grammatical refinement.

Reply: Thank you for your review and valuable comments. We have carefully revised the Introduction to improve grammar and clarity. All changes are highlighted in blue in the manuscript.

The phrase "objected optimization" in line 76 appears to be a typographical error and should

be corrected.

Reply: We sincerely appreciate your careful reading. We have corrected this error to "optimize the YOLOv10x model" and have carefully checked the entire manuscript for similar issues. Thank you again for your valuable feedback.

The shift from general challenges to YOLOv10 could be improved with linking phrases to aid reader navigation.

Reply: We sincerely appreciate the reviewer's valuable suggestion regarding improving the logical flow of our manuscript. In response to this comment, we have made the following enhancements to ensure a smoother flow. Added explicit transitional phrases between the discussion of general challenges and our selection of YOLOv10x, including:

Listing and briefly analyzing classic YOLO versions (v5, v8, and v9).
Clear justification for choosing YOLOv10x as our baseline.

3.Specific limitations of YOLOv10x that motivate our improvements.

Modifications are highlighted in light blue

While limitations of YOLOv10 are mentioned, no specific performance metrics are cited to quantify the gaps being addressed.

Reply: We sincerely appreciate your valuable feedback. As suggested, we have now added specific performance metrics to quantify the limitations of YOLOv10 in the Introduction section.

While the literature is well-referenced, there is minimal critical synthesis or comparative analysis.The section often reads like a sequential listing of existing techniques without sufficient integration or synthesis.

Reply: We sincerely appreciate your careful reading. We have revised the original sequential list expression to a thematic, logical overview. The new edition no longer introduces each article in isolation, but groups them according to research topics, and conducts in-depth comparison and comprehensive analysis. After discussing the challenges faced by remote sensing images (complex background, multi-scale, etc.), instead of directly enumerating solutions, References [23] and [24] are discussed as representative work under the theme of 'multi-scale and context-aware'. It is clearly pointed out that both studies have adopted similar core strategies to meet the challenges, and compared their specific implementations (for example, the multi-scale guidance module of [23] vs.the feature context aggregation module of [24]). The literature [25] and [26] are classified into the theme of 'deep learning network architecture', and how deep learning promotes the network to develop in a deeper and more professional direction is explained. This approach not only shows the contribution of these work, but also describes the path of technological evolution in this field. In the discussion of the attention mechanism, the literature [29], [30] and [31] are put together to reveal their common core idea - to deal with the geometric changes of the target by dynamically adjusting the receptive field and applying attention. This is better than introducing each work alone to reflect the internal relationship between them and the common direction of innovation.

Some citations are introduced without specific insight into their unique contribution (e.g., [19], [22], [25]).

Reply: We sincerely appreciate your careful reading. Specific and insightful descriptions are added to each reference (especially those you mentioned [19], [22], [25]) so that it is no longer a simple reference marker. The unique contribution of [25]: I no longer mention the 'multi-stage deep enhancement network' in general, but elaborate on its uniqueness: 'the positive samples of small targets are increased by the label allocation strategy based on the central region, and the feature representation is selectively enhanced by combining the gated context aggregation module'. This precisely points out its unique method in solving small targets and feature enhancement. Use of [19] and [22] : In the new version, I use [19] and [22] in conjunction with other citations, not in isolation. I use [18] and [19] as a general reference to draw out the importance of remote sensing image target recognition, and [20], [21], [22] as a strong support to demonstrate the challenge of remote sensing images. This usage conforms to the norms of academic writing, which not only retains the references, but also avoids the necessity of providing lengthy insights for each general reference.

the hardware and software specifications are provided in Table 1, the experimental setup lacks information on batch size, learning rate, training epochs, and whether identical conditions were used across all models.

Reply:Thank you for pointing out the lack of experimental settings in our paper. We have supplemented the key experimental parameters shared by all models in the revised version based on your valuable opinions, including batch size, learning rate, and training cycle. These additions are intended to ensure that our research is fully reproducible and further enhance its scientific rigor.

The experimental section does not provide detail on data augmentation, which is standard practice in training deep learning models for object detection.

Reply:We have supplemented the detailed data augmentation strategy in the revised paper based on your valuable suggestions. The new table 1 clearly lists all the training parameters, including HSV transformation, translation, scaling, horizontal flipping and Mosaic enhancement, which are the key steps to improve the generalization ability and robustness of the model. Thank you for helping us refine the details of the experimental section, which makes our research more scientifically rigorous and reproducible.

While performance improvements are reported, there is no mention of statistical validation (e.g., confidence intervals or significance testing) to verify that the improvements are not due to variance in the data.

Reply:Thank you very much for your valuable advice on statistical validation, which is essential to ensure the rigor of the research results. We have adopted your opinion and supplemented the relevant analysis in the paper. In order to verify the reliability of the performance improvement, we conducted three independent experiments on each configuration and reported the average performance index. In addition, we also evaluated the difference between our proposed YOLO-KRM model and the baseline model by paired t-test. The test results show that the performance improvement is statistically significant (p < 0.01), which strongly proves that these enhancements are not caused by random variance.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I have the following comments for your reference:

Comment 1: Line 126: “objects in remote sensing image identification typically …”, make the “object” as “Object”.

Comment 2: Figure 1: Make all the text in the figure English.

Comment 3: Line 221: What does the fragment sentence of “binary and univariate addition procedures” mean?

Comment 4: Clean up the lines from 219 to 223. There are repeating and fragmenting sentences.

Comment 5: Explain or define every acronym in the manuscript, especially in the figures. For example, in Figure 1: the terms “SCDown” and “PSA” were not defined; in Figure 3: I’m not sure whether the terms “LAP” and “GAP” are acronyms or are they the English words lap and gap. Maybe this is the limitation of the English language.

Comment 6: Are the architecture diagrams in Figures 1 and 3 original to this work? If so, I recommend explicitly stating this in the captions. If they are adapted from prior literature, please cite the source and briefly note any modifications. Given the clarity and quality of these figures, such attribution would enhance their academic value while ensuring proper credit.

Comment 7: Section 4.3: Can you add another evaluating indicator: the running time? It is good to compare the running time of the revised networks with the literature.

Comment 8: Figure 6: While the combined visualization in Figure 6 is comprehensive, its density makes it challenging to discern each algorithm's performance differences. To improve readability, I suggest separating this into four individual bar plots. A dedicated plot for each metric would more effectively highlight the comparative performance of the different algorithms on the RSOD datasets.

Comment 9: To further improve the manuscript's readability, consider splitting Figure 7 into four distinct bar plots. I also noted that the information in Figures 6 and 7 appears to be duplicated in Tables 2 and 3. The authors might consider whether both formats are necessary, as presenting the data in either graphical or tabular form (but not both) often makes for a more concise presentation.

Comment 10: Line 528: correct the section 4.6 title as “Experimental Resul”.

Comment 11: Did you evaluate the benchmark model (YOLOv10x) under the same conditions? In addition to the detection results presented in Figure 8, could you provide comparative visualizations (e.g., side-by-side detection outputs from both YOLOv10x and YOLO-KRM)? This would allow readers to directly observe performance differences, particularly in cases where YOLO-KRM corrects detection errors made by the benchmark model. Highlighting such contrasting examples would further strengthen the validation of your proposed method.

Author Response

Line 126: “objects in remote sensing image identification typically …”, make the “object” as “Object”.

Reply: We appreciate your careful reading. We have revised the sentence to capitalize the first letter as it appears at the beginning of the sentence (now: "Objects in remote sensing image identification typically..."). All sentence-starting words in the manuscript have been verified for proper capitalization. The modifications have been highlighted in blue.

Figure 1: Make all the text in the figure English.

Reply: We sincerely appreciate your careful review. We have revised Figure 1 by translating all Chinese annotations into English to ensure international readability.

Line 221: What does the fragment sentence of “binary and univariate addition procedures” mean?

Reply: We sincerely appreciate your questions. We have completely rephrased the description of the Kolmogorov-Arnold representation theorem. This revised formulation eliminates the previous ambiguous terminology of "binary and univariate addition procedures" and instead uses mathematically precise language that better aligns with the classical statement of the theorem. The modifications have been highlighted in blue.

Clean up the lines from 219 to 223. There are repeating and fragmenting sentences.

Reply: We sincerely appreciate your valuable feedback on the clarity of this section. We have carefully revised the text to address these issues: (1) we removed duplicate sentences; and (2) we improved some split sentences into complete sentences to better maintain the flow of the content. The revised text is now highlighted in blue in the manuscript.

Explain or define every acronym in the manuscript, especially in the figures. For example, in Figure 1: the terms “SCDown” and “PSA” were not defined; in Figure 3: I’m not sure whether the terms “LAP” and “GAP” are acronyms or are they the English words lap and gap. Maybe this is the limitation of the English language.

Reply: We sincerely appreciate the reviewer's meticulous attention to detail regarding acronym definitions. In response to this valuable feedback, we have systematically reviewed all figures and the main text to ensure every acronym is clearly defined upon first use. Specifically: (1) For Figure 1, we have added definitions for "SCDown", "PSA", "C2f", "C2fCIB", "SPPF", and "MHSA"; (2) For Figure 3, we clearly stated that “LAP”, “GAP”, and “UNAP” are technical acronyms and defined them after Figure 3. All additions are marked in blue.

Are the architecture diagrams in Figures 1 and 3 original to this work? If so, I recommend explicitly stating this in the captions. If they are adapted from prior literature, please cite the source and briefly note any modifications. Given the clarity and quality of these figures, such attribution would enhance their academic value while ensuring proper credit.

Reply: We are very grateful for the reviewer's suggestion regarding proper attribution of the figures. In response:

For Figure 1, we have clearly stated in Section 3 that this diagram is an improvement based on the original YOLOv10x architecture, with proper citation to the YOLOv10x reference, and provided detailed explanation of our specific algorithmic modifications.
For Figure 3, we have properly cited the original MLCA architecture reference in Section 3.2.

All modifications have been highlighted in blue in the revised manuscript.

Section 4.3: Can you add another evaluating indicator: the running time? It is good to compare the running time of the revised networks with the literature.

Reply:Thank you very much for your valuable advice on increasing the running time as an evaluation indicator. We have adopted this opinion and supplemented the relevant analysis in Section 4.4 of the paper. Considering that frame per second (FPS) is a more intuitive and representative performance indicator in computer vision and real-time applications, we choose to use FPS as an alternative to running time. FPS is the reciprocal of running time (FPS = 1/Running Time), which can clearly reflect the processing capacity of the model in unit time. We have performed FPS evaluation on the modified network and compared it fairly with the methods in the existing literature. The detailed results have been presented in the paper.

Figure 6: While the combined visualization in Figure 6 is comprehensive, its density makes it challenging to discern each algorithm's performance differences. To improve readability, I suggest separating this into four individual bar plots. A dedicated plot for each metric would more effectively highlight the comparative performance of the different algorithms on the RSOD datasets.

To further improve the manuscript's readability, consider splitting Figure 7 into four distinct bar plots. I also noted that the information in Figures 6 and 7 appears to be duplicated in Tables 2 and 3. The authors might consider whether both formats are necessary, as presenting the data in either graphical or tabular form (but not both) often makes for a more concise presentation.

Reply: We sincerely thank the reviewers for their constructive suggestions to improve the clarity of the data presentation. After careful consideration, we have made the following changes:

We have removed two visualizations to eliminate redundancy with Tables 2 and 3 and reduce visual clutter.
We have revised the corresponding text descriptions to better highlight the key comparative results.These changes are highlighted in blue.

Line 528: correct the section 4.6 title as “Experimental Result”.

Reply: Thank you for pointing out this error. We have corrected the title of Section 4.6 from "Experimental resul" to "Experimental Result." Furthermore, we have thoroughly reviewed all section titles in the manuscript to ensure consistency and correct capitalization. All changes have been highlighted in blue in the revised version.

Did you evaluate the benchmark model (YOLOv10x) under the same conditions? In addition to the detection results presented in Figure 8, could you provide comparative visualizations (e.g., side-by-side detection outputs from both YOLOv10x and YOLO-KRM)? This would allow readers to directly observe performance differences, particularly in cases where YOLO-KRM corrects detection errors made by the benchmark model. Highlighting such contrasting examples would further strengthen the validation of your proposed method.

Reply:Thank you very much for your valuable advice. We fully agree that providing a side-by-side comparison visualization of the baseline model and our proposed YOLO-KRM model can more intuitively and powerfully demonstrate the effectiveness of our method. We have ensured that all models ( including baseline YOLOv10x and our YOLO-KRM ) are evaluated and trained under exactly the same conditions to ensure fairness and comparability of results. To meet your requirements, we provide additional visualization results that highlight the scenarios where the YOLO-KRM model corrects baseline model error detection. The detailed detection results of YOLOv10x and YOLO-KRM models on RSOD and NWPU VHR-10 datasets are compared.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The suggested corrections has been adequately effected by the authors and the paper is good for publication in its current form

Comments on the Quality of English Language

The language has significantly improved with all the minor grammars and typographical error fixed

Article Menu

Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing

Further Information

Guidelines

MDPI Initiatives

Follow MDPI